Forward–backward algorithm

The algorithm makes use of the principle of dynamic programming to efficiently compute the values that are required to obtain the posterior marginal distributions in two passes.

In this sense, the descriptions in the remainder of this article refer only to one specific instance of this class.

In the first pass, the forward–backward algorithm computes a set of forward probabilities which provide, for all

These two sets of probability distributions can then be combined to obtain the distribution over states at any specific point in time given the entire observation sequence: The last step follows from an application of the Bayes' rule and the conditional independence of

As outlined above, the algorithm involves three steps: The forward and backward steps may also be called "forward message pass" and "backward message pass" - these terms are due to the message-passing used in general belief propagation approaches.

This step allows the algorithm to take into account any past observations of output for computing more accurate results.

We transform the probability distributions related to a given hidden Markov model into matrix notation as follows.

), the probability of observing event j is then: The probability of a given state leading to the observed event j can be represented in matrix form by multiplying the state row-vector (

Continuing the above example, the observation matrix for event 1 would be: This allows us to calculate the new unnormalized probabilities state vector

generated event 1 as: We can now make this general procedure specific to our series of observations.

, (which can be optimized as a parameter through repetitions of the forward-backward procedure), we begin with

, then updating the state distribution and weighting by the likelihood of the first observation: This process can be carried forward with additional observations using: This value is the forward unnormalized probability vector.

represents the scaling factor that causes the resulting vector's entries to sum to 1.

Since the initial state is assumed as given (i.e. the prior probability of this state = 100%), we begin with: Notice that we are now using a column vector while the forward probabilities used row vectors.

Noting that each entry contains the probability of the future event sequence given a particular initial state, normalizing this vector would be equivalent to applying Bayes' theorem to find the likelihood of each initial state given the future events (assuming uniform priors for the final state vector).

This example takes as its basis the umbrella world in Russell & Norvig 2010 Chapter 15 pp.

Performing these calculations and normalizing the results provides: For the backward probabilities, we start with: We are then able to compute (using the observations in reverse order and normalizing with different constants): Finally, we will compute the smoothed probability values.

The backward probability vectors above thus actually represent the likelihood of each state at time t given the future observations.

Because these vectors are proportional to the actual backward probabilities, the result has to be scaled an additional time.

begin with uniform priors over the initial and final state vectors (respectively) and take into account all of the observations.

when our initial state vector represents a uniform prior (i.e. all entries are equal).

We thus find that the forward probabilities by themselves are sufficient to calculate the most likely final state.

The calculations above reveal that the most probable weather state on every day except for the third one was "rain".

quantifies our knowledge of the state vector at the end of the observation sequence.

[1] The algorithm can also run in constant space with time complexity

Brute force is intractable for realistic problems, as the number of possible hidden node sequences typically is extremely high.

An enhancement to the general forward-backward algorithm, called the Island algorithm, trades smaller memory usage for longer running time, taking

time algorithm, although the inverted process may not exist or be ill-conditioned.

[4] Given HMM (just like in Viterbi algorithm) represented in the Python programming language: We can write the implementation of the forward-backward algorithm like this: The function fwd_bkw takes the following arguments: x is the sequence of observations, e.g. ['normal', 'cold', 'dizzy']; states is the set of hidden states; a_0 is the start probability; a are the transition probabilities; and e are the emission probabilities.

For simplicity of code, we assume that the observation sequence x is non-empty and that a[i][j] and e[i][j] is defined for all states i,j.