Notes On Forward Backward Algorithm

These are some notes on the Forward-Backward algorithm for Hidden Markov Models (HMMs). The focus of this post is on the derivations and on the variations of these algorithms. When possible I try to give an interpretation of the probabilities involved.

Some basic understanding of HMMs is probably needed before reading this text. Reference [1] provides a in-depth introduction and good resources are available also online for example on Wikipedia’s webpages on HMMs and forward-backward algorithms.

Here some code snippets are shown. They are written in the Julia language and should be very readable, as simple as pseudo-code.

For further insights see documentation of HMMGradients.jl.

Hidden Markov model notation

The states of a Hidden Markov Model (HMM) \(s_j \in \mathcal{N}\) for \(j=1,\dots,N_s\) are defined as integers. A state sequence is the vector \(\mathbf{s} \in \mathcal{N}^{N_t}\), where \(N_t\) is the number of time steps.

The parameters of the HMM are defined in \(\lambda = \{\mathbf{A}, \mathbf{a}, \mathcal{B} \}\).

Here \(\mathbf{A} \in \mathcal{R}^{N_s \times N_s}\) transition matrix where \(p(s_t=j|s_{t-1}=i)= a_{ij}\) with \(a_{ij}\) being the \((i,j)\)-th element of \(\mathbf{A}\) (Markov process of order 1) and \(\mathbf{a} \in \mathcal{R}^{N_s}\) are the initial state probability. \(\mathcal{B} = \{ b_1, \dots b_{N_s} \}\) are a set of distributions, one per state where \(p(y_t|s_j)=b_{j}(y_t)\) (observation independence, memoryless model) where \(y_t\) is the observation at the time \(t\).

using Distributions, LinearAlgebra, Combinatorics

# define Markov model's parameters (λ)
Ns = 2          # number of states
a = [0.5; 0.5]  # initial state probability  
A = [0.7 0.3; 0.3 0.7] # transmission Matrix  [a_11 a21; a12 a_22]
# Must be row stotastic: [sum(A[i,:]) for i in 1:size(A,1)] .== 1

B = [Categorical([0.9, 0.1]), Categorical([0.2, 0.8])] 
# observation distribution [b_1; b_2]
# Tells what is the probability of a state given an observation  P(y_t|s_i)
# e.g. `pdf(D[1],2) == 0.1` gives the probability 
# of being in state 1 if we observed y_t = 2 
return

Posterior likelihood

Say we have an observation \(\mathbf{y} \in \mathcal{R}^{N_t}\) where \(N_t\) is an integer indicating time steps.

Let’s compute the likelihood between our Markov model with parameters \(\lambda\) and the observations \(\mathbf{y}\).

Using the law of total probability, this likelihood can be defined as:

\[p_\lambda ( \mathbf{y} ) = \sum_{\mathbf{s} \in \mathcal{S}} p_{\lambda} (\mathbf{y} | \mathbf{s} ) p( \mathbf{s} )\]

where \(p_{\lambda} (\mathbf{y} | \mathbf{s} )\) is the likelihood of observation given a state sequence \(\mathbf{s} \in \mathcal{N}^{N_t}\), \(p( \mathbf{s} )\) is probability of a particular state sequence and \(\mathcal{S}\) is a set containing all possible state sequences.

Obtaining \(p_{\lambda}(\mathbf{y})\) is necessary in different circumstances, for example when learning the parameters \(\lambda\).

In this case we want to maximize \(p_{\lambda}(\mathbf{y})\):

\[\lambda^\star = \arg \max _{\lambda} p_{\lambda} (\mathbf{y})\]

where \(\lambda^\star\) is the set of parameters such that a local maximum is reached.

This problem is generally a nonconcave (nonconvex) problem. Typically an expectation–maximization (EM) procedure is used to solve this optimization problem.

A simplified summary of EM algorithm is the following:

start from an initial guess of \(\lambda^0\) and iteration \(k = 0\)
find a concave approximation (auxiliary function) of \(p_\lambda(\mathbf{y})\), call it \(Q(\lambda, \lambda^{k})\)
solve \(\lambda^{k+1} = \arg \max_{\lambda} Q(\lambda, \lambda^{k})\)
check if stopping criteria is reached otherwise \(k \leftarrow k+1\) and go to 1

More information about the EM algorithm can be found in [2].

Solving the optimization problem of point \(2\) can performed efficiently using the forward-backward algorithm.

Let’s first have a look at the simplest approach for computing \(p_\lambda(\mathbf{y})\) to start with a baseline.

We can compute \(p_\lambda ( \mathbf{y} )\) using brute force, i.e. for all of the elements of \(\mathcal{S}\).

What we need is \(p_\lambda (\mathbf{y} | \mathbf{s} ) = \prod_{t = 1}^{N_t} b_{s_t} (y_t)\) where \(y_t\) is the \(t\)-th element of \(\mathbf{y}\) (observation independence) and \(P_\lambda (\mathbf{s}) = a_{s_0} \prod_{t = 1}^T a_{s_{t-1},s_t}\) where \(a_{s_0}\) is the \(s_0\)-th element of \(\mathbf{a}\) and \(a_{s_{t-1},s_t}\) is the \((s_{t-1},s_t)\)-th of \(\mathbf{A}\) (Markov process order 1).

Nt = 5                     # time window length
y = [1;1;2;1;1]            # observations

s_all = collect(multiset_permutations([1;2],[5;5],5))

# this has all possible state sequences e.g. [1;1;1;1;1], [1;1;1;1;2] ...
# these are Ns^Nt permutations

# Probability of sequence of states
Pλ_s = zeros(length(s_all)) 
for z = 1:length(s_all)
    for t = 1:Nt
        if t == 1
            Pλ_s[z] = a[ s_all[z][t] ]
        else
            Pλ_s[z] *= A[ s_all[z][t-1], s_all[z][t] ]
        end
    end
end
@assert sum(Pλ_s) ≈ 1 # check it's a probability 

# Likelihood of O given a sequence of states
pλ_y_s = ones(length(s_all)) 
for z = 1:length(s_all)
    for t = 1:Nt
        pλ_y_s[z] *= pdf(B[ s_all[z][t] ], y[t])
    end
end

Pλ = sum(Pλ_s .* pλ_y_s)

0.03430370049999999

Baum’s forward

Brute force can easily become intractable since we have to compute likelihoods and probabilities for \(N_s^{N_t}\) times.

It is possible to express \(p_\lambda (\mathbf{y})\) by marginalizing the likelihood at a specific time \(t\):

\[p_\lambda (\mathbf{y}) = \sum_{j = 1}^{N_s} p_{\lambda} (\mathbf{y}, s_t = j),\]

using the conditional probability definition we can further decompose the summand into:

\[p_{\lambda} (\mathbf{y},s_t = j) = p_{\lambda} (y_1, \dots, y_t, s_t = j) \cdot p_{\lambda} (y_{t+1} \dots y_{N_t} | y_1, \dots, y_t, s_t = j)\]

and drop the dependence of \(\{ y_1, \dots, y_t \}\) in the conditional likelihood since it is independent of previous states (observation independence).

We can now define the two terms of the product as:

\[\alpha_j(t) = p_{\lambda} (y_1, \dots, y_t, s_t = j)\]
\[\beta_j (t) = p_{\lambda} (y_{t+1} \dots y_{N_t} | s_t = j)\]

We need to manipulate \(\alpha_j(t)\) as follows:

\(p_{\lambda} (y_1, \dots, y_t, s_t = j) = \sum_{i=1}^{N_s} p_{\lambda} (y_1, \dots, y_t, s_t = j, s_{t-1} = i)\) marginalization
\(\sum_{i=1}^{N_s} p_{\lambda} (y_1, \dots, y_t, s_t = j, s_{t-1} = i) = \sum_{i=1}^{N_s} p_{\lambda} (y_1, \dots, y_{t-1}, s_{t-1} = i) \cdot p_{\lambda}(y_t, s_t = j | y_1, \dots, y_{t-1}, s_{t-1} = i)\) conditional probability definition
\(\sum_{i=1}^{N_s} \alpha_{t-1}(i) \cdot p_{\lambda}(y_{t}, s_t = j | y_{t-1}, s_{t-1} = i) = \sum_{i=1}^{N_s} \alpha_{t-1}(i) \cdot p_{\lambda}(y_{t} | y_{t-1}, s_t = j, s_{t-1} = i) \cdot p_{\lambda}(s_t = j | y_t, y_{t-1}, s_{t-1} = i)\) conditional
\(\sum_{i=1}^{N_s} \alpha_{t-1}(i) p(y_t | s_{t} = j) P(s_t = j | s_{t-1} = i) = \sum_{i=1}^{N_s} \alpha_{t-1}(i) b_j(y_t) a_{ij}\) dropping terms due to Markov order and observation independence

To summarize:

\(\alpha_t(j) = \sum_{i=1}^{N_s} \alpha_{t-1}(i) a_{ij} b_j(y_t)\) for \(t = 1, \dots, N_t\), \(j = 1, \dots, N_s\)

and set \(\alpha_0 (j) = a_j\) for \(j = 1, \dots, N_s\)

Notice that \(\alpha_j(N_t) = p_{\lambda} (y_1, \dots, y_T, s_{N_t} = j)\), comparing this with the definition of \(p_\lambda(\mathbf{y})\):

\[p_\lambda(\mathbf{y}) = \sum_{j = 1}^{N_s} \alpha_{N_t} (j)\]

Meaning that the forward algorithm can be used to compute value of the cost function.

function Baum_forward(A,a,B,y)
    Nt, Ns = length(y), size(A,1)
    alpha = zeros(Nt,Ns)
    for j = 1:Ns
        alpha[1,j] = a[j]*pdf( B[j], y[1] )
    end
    
    for t = 2:Nt
        for j = 1:Ns
            for i = 1:Ns
                alpha[t,j] += alpha[t-1,i]*A[i,j]*pdf( B[j], y[t] )    
            end
        end
    end
    return alpha
end

alpha = Baum_forward(A,a,B,y)
@assert sum(Pλ_s .* pλ_y_s) ≈ sum(alpha[end,:]) 
println(sum(alpha[end,:]))

0.034303700500000006

Baum’s backwards

We can also compute

\[\beta_j (t) = p_{\lambda} (y_{t+1} \dots y_{N_t} | s_t = j)\]

in a similar fashion.

By applying similar tricks as before it turns out that by setting

\[\beta_j(N_t) = 1 \ \forall j = 1, \dots, N_s\]

\(\beta_j(t)\) can be compute going back through time (backwards):

\(\beta_j(t) = \sum_{k=1}^{N_s} \beta_{t+1}(k) a_{jk} b_k (y_{t+1})\) \(t = N_t-1,\dots,1\).

Once we compute this by definition we have:

\(p_{\lambda} (\mathbf{y}) = \sum_{j = 1}^{N_s} \alpha_j(t) \beta_j(t)\) \(\forall t = 1, \dots N_t\)

function Baum_backward(A,B,y)
    Nt, Ns = length(y), size(A,1)
    beta = zeros(Nt,Ns)
    beta[end,:] .= 1
    for t = Nt-1:-1:1
        for j = 1:Ns
            for k = 1:Ns
                beta[t,j] += beta[t+1,k]*A[j,k]*pdf( B[k], y[t+1] )
            end
        end
    end
    return beta
end

beta = Baum_backward(A,B,y)
@assert all(sum(alpha.*beta,dims=2) .≈ sum(alpha[end,:]))
sum(alpha.*beta,dims=2)

5×1 Array{Float64,2}:
0343037005
034303700500000006
034303700500000006
034303700500000006
034303700500000006

Getting posterior probabilities

Baum’s forward and backward algorithms suffer of underflow: as \(N_t\) increases \(\alpha_{N_t} (j) \rightarrow 0\) \(\beta_1 (j) \rightarrow 0\). Additionally only likelihoods are calculated.

In what follows we show that it is possible to solve the underflow issue and actually obtain posterior probabilities [3].

Posterior probability is defined as:

\[p_{\lambda}(s_t = j | y_1, \dots, y_t) = p_\lambda(s_t = j, y_1, \dots, y_{N_t}) / p(\mathbf{y})\]

we can split denominator and numerator as before:

\[\left( p_\lambda(s_t = j, y_1, \dots y_t) \cdot p_\lambda(y_1, \dots y_t | s_t = j) \right) / \left( p(y_1, \dots, y_t) p(y_{t+1} \dots y_{N_t} | y_1, \dots, y_t) \right)\]

then we have:

\[\bar{\alpha}_t(j) = p_\lambda(s_t = j, y_1, \dots y_t) / p(y_1, \dots, y_t) = p_\lambda(s_t = j | y_1, \dots y_t)\]
- can be interpreted the probability of being at state \(j\) given observations up to \(t\).
\[\bar{\beta}_t(j) = p_\lambda(y_1, \dots y_t | s_t = j) / p(y_{t+1} \dots y_{N_t} | y_1, \dots, y_t)\]
- has a more difficult interpretation

Applying the conditional probability definition on denominator and numerator of \(\bar{\alpha}_t(j)\):

\(\bar{\alpha}_t(j) = \left( p_\lambda (s_t = j ,y_t | y_1, \dots, y_{t-1}) p(y_1,\dots,y_{t-1}) \right) / \left( p(y_{t} | y_{1}, \dots, y_{t-1}) p(y_1,\dots,y_{t-1}) \right)\) \(\bar{\alpha}_t(j) = p_\lambda (s_t = j ,y_t | y_1, \dots, y_{t-1}) / p(y_{t} | y_{1}, \dots, y_{t-1})\)

Expanding numerator:

\[p_\lambda (s_t = j, y_t | y_1, \dots, y_{t-1}) = \sum_{i = 1}^{N_s} p_\lambda (s_{t-1} = i, s_t = j, y_t | y_1, \dots, y_{t-1}) = \sum_{i = 1}^{N_s} \bar{\alpha}_{t-1} (i) a_{ij} b_j (y_t)\]

Expressing the denominator using the numerator expression:

\[p(y_{t} | y_{1}, \dots, y_{t-1}) = \sum_{j = 1}^{N_s} p_{\lambda} (s_t = j,y_t | y_1, \dots, y_{t-1}) = \sum_{j=1}^{N_s} \sum_{i = 1}^{N_s} \bar{\alpha}_{t-1} (i) a_{ij} b_j (y_t)\]

So we get \(\bar{\alpha}_t(j) = c_t^{-1} \sum_{i=1}^{N_s} \bar{\alpha}_{t-1}(i) a_{ij} b_j(y_t)\)

where \(c_t = \sum_{i = 1}^{N_s} \bar{\alpha}_i (t)\). So we have exactly the same as Baum’s Forward with the only difference that a scaling \(c_t\) is applied.

Similar stuff happens for \(\bar{\beta}_t(j) = c_t^{-1} \sum_{k=1}^{N_s} \bar{\beta}_{t+1}(k) a_{jk} b_k (y_{t+1})\).

Posterior probabilities are then given by:

\[\gamma_{t}(j) = \bar{\alpha}_{t}(j) \bar{\beta}_{t}(j) = p(s_t = j | \mathbf{y})\]

They give the probability of being at state \(j\) given the future and past observations.

References

[1] Bilmes, Jeff A. “A gentle tutorial of the EM algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models.” International Computer Science Institute 4.510 (1998): 126.

[2] Liporace, L. “Maximum likelihood estimation for multivariate observations of Markov sources.” IEEE Transactions on Information Theory 28.5 (1982): 729-734.

[3] Devijver, Pierre A. “Baum’s forward-backward algorithm revisited.” Pattern Recognition Letters 3.6 (1985): 369-373.