55:148 Digital Image Processing
55:247 Image Analysis and Understanding

Chapter 8, Part VII
Image understanding: Hidden Markov Models

Hidden Markov Models

It is often possible when attempting image understanding to model the patterns being observed as a transitionary system.
Sometimes these are transitions in time, but they may also be transitions through another pattern; for example, the patterns of individual characters when connected in particular orders represent another pattern that is a word.
If the transitions are well understood, and we know the system state at a certain instant, they can be used to assist in determining the state at a subsequent point.
One of the simplest examples is the Markov model.

A Markov model assumes a system may occupy one of a finite number of states X_1, X_2, X_3, ... , X_n at times t_1, t_2, ..., and that the probability of occupying a state is determined solely by recent history.

More specifically, a first-order Markov model assumes these probabilities depend only on the preceding state; thus a matrix A will exist in which

Thus 0 <= a_ij <= 1 and sum_j=1^n (a_ij) = 1 for all 1 <= i <= n.
The important point is that these parameters are time independent -- the a_ij do not vary with t.
A second order model makes similar assumptions about probabilities depending on the last two states, and the idea generalizes obviously to order k models for k = 3,4, ...

A trivial example might be to model weather forecasting: Suppose that the weather on a given day may be sunny (1), cloudy (2) or rainy (3) and that the day's weather depends probabilistically on the preceding day's weather only.
We might be able to derive a matrix A

so the probability of rain after a sunny day is 0.125, the probability of cloud after a rainy day is 0.375 and so on.

In many practical applications, the states are not directly observable, and instead we observe a different set of states Y_1 , ... , Y_m (possibly n <= m) where we can only guess the exact state of the system from the probabilities

so 0<= b_jk <= 1 and sum_k=1^m (b_jk) = 1.
The n x m matrix B is time independent; that is, the observation probabilities do not depend on anything except the current state, and in particular not on how that state was achieved, or when.

Extending the weather example,
- the moistness of a piece of seaweed is an indicator of weather;
- if we conjecture four states dry (1), dryish (2), damp (3) or soggy (4), and that the actual weather is probabilistically connected to the seaweed state, we might derive a matrix such as

so the probability of observing dry seaweed when the weather is sunny is 0.6, and the probability of observing damp seaweed when the weather is cloudy is 0.25, and so on.

A first-order Hidden Markov Model lambda = (pi, A, B) is specified by the matrices A and B together with an n-dimensional vector pi to describe the probabilities of the state at time t=1.
The time independent constraints are quite strict and in many cases unrealistic, but HMMs have seen significant practical application.
In particular, they are successful in the area of speech processing, wherein the A matrix might represent the probability of a particular phoneme following another phoneme, and the B matrix refers to a feature measurement of a spoken phoneme
The same ideas have seen wide application in optical character recognition

A HMM poses three questions:
Evaluation
- Given a model, and a sequence of observations, what is the probability that the model actually generated those observations?
- If we had two different models available, lambda_1 = (pi_1, A_1, B_1) and lambda_2 = (pi_2, A_2, B_2), this question would indicate which one better described some given observations.
- For example, if we have two models, a known weather sequence and a known sequence of seaweed observations, which model is the best description of the data?

Decoding
- Given a model lambda = (pi, A , B) and a sequence of observations, what is the most likely underlying state sequence?
- This for pattern analysis is the most interesting question, since it permits an optimal estimate of what is happening on the basis of a sequence of feature measurements.
- For example, if we have a model and a sequence of seaweed observations, what is most likely to have been the underlying weather sequence?

Learning
- Given knowledge of the set X_1, X_2, X_3, ... , X_n and a sequence of observations, what are the best parameters pi, A, B if the system is indeed a HMM?
- For example, given a known weather sequence and a known sequence of seaweed observations, what model parameters best describe them?

HMM Evaluation
To determine the probability that a particular model generated an observed sequence, it is straightforward to evaluate all possible sequences, calculate their probabilities, and multiply by the probability that the sequence in question generated the observations in hand.
If

is a T long observation, and

is a state sequence, we require

This quantity is given by summing over all possible sequences X^i, and for each such, determining the probability of the given observations.
These probabilities are available from the B matrix, while the transition probabilities of X^i are available from the A matrix.
Thus

Exhaustive evaluation over the X^i is possible since A , B , pi are all available, but the load is exponential in T, and clearly not in general computationally realistic.
The assumptions of the model, however, permit a shortcut by defining a recursive definition of partial, or intermediate, probabilities.
Suppose

Here t is between 1 and T, so this is an intermediate probability.
Time independence allows us to write

since a_ij represents the probability of moving to state j, and b_j k_t+1 is the probability of observing what we do at this time.
Thus alpha is defined recursively; it may be initialized from our knowledge of the initial states;

At time T, the individual quantities alpha_T (j) give the probability of the observed sequence occurring, with the actual system terminating state being X_j.
Therefore the total probability of the model generating the observed sequence Y_k is

In particular, in OCR word recognition the individual patterns may be features extracted from characters, or groups of characters, and an individual model may represent an individual word.
We would determine which word was most likely to have generated an observed feature sequence.

HMM Decoding
Given that a particular model (pi, A, B ) generated an observation sequence of length T,

it is often not obvious what precise states the system passed through,

and we therefore need an algorithm that will determine the most probable (or optimal in some sense) X^i given Y^k.

A simple approach might be to start at time t=1 and ask what the most probable X_i_1 would be, given the observation Y_k_1. Formally,

which may be calculated given the probabilities of the X_j (or, more likely, some estimate thereof).
This approach will generate an answer, but in the event of one or more observations being poor, a wrong decision may be taken for some t.
It also has the possibility of generating illegal sequences (for example, a transition for which a_ij = 0).
This frequently occurs in observation of noisy patterns, where an isolated best guess for a pattern may not be the same as the best guess taken in the context of a stream of patterns.

We do not decide on the value of i_t during the examination of the t-th observation, but instead record how likely it is that a particular state might be reached, and if it were to be correct, which state was likely to have been its predecessor.
Then at the T-th column, a decision can be taken about the final state X_T based on the entire history, which is fed back to the earlier stages

This is the Viterbi algorithm.
The approach is similar to that developed for dynamic programming.
We reconstruct the system evolution by imagining an N x T lattice of states; at time t we occupy one of the N possible X_i in the t^th column.
States in neighboring columns are connected by transition probabilities from the A matrix, but our view of this lattice is attenuated by the observation probabilities B.
The task is to find the route from the first to the T-th column of maximal probability, given the observation set.

Formally, we set

Here, equation (8.50) initializes the first lattice column, combining the pi vector with the first observation.
Equation (8.51) is a recursion relation to define the subsequent column from the predecessor, the transition probabilities and the observation; this gives the i-th element of the t-th column, and informally is the probability of the most likely way of being in that position, given events at time t-1.
Equation (8.52) is a back pointer, indicating where one is most likely to have come from at time t-1 if currently in state i at time t (see next Figure).
Equation (8.53) indicates what the most likely state is at time T, given the preceding T-1 states and the observations.
Equation (8.54) traces the back pointers through the lattice, initializing from the most likely final state.

A simple example will illustrate this; considering the weather transition probabilities (equation 8.47) and the seaweed observation probabilities (equation 8.48), we might conjecture, without prior information, that the weather states on any given start day have equal probabilities
so pi = (1/3, 1/3, 1/3)
Suppose now we imagine a weather observer in a closed locked room with a piece of seaweed -- if on four consecutive days the seaweed is dry, dryish, soggy, soggy
The observer wishes to calculate the most likely sequence of weather states that have caused these observations.
Starting with the observation dry, the first column of probabilities becomes (equation 8.50);

The sunny state is most probable.

Now reasoning about the second day, delta_2(1) gives the probability of observing dryish seaweed on a sunny day, given the preceding day's information.
For each of the 3 possible preceding states, we calculate the explicit probability and select the largest (equation 8.51).

Thus the most probable way of reaching the sunny state on day 2 is from day 1 being sunny too;
Accordingly, we record delta_2( 1 ) = 0.02 and store the back pointer phi_2 ( 1 ) = 1 (equation (8.52).
In a similar way, we find delta_2( 2 ) = .0188, phi_2 (2) = 1 and delta_2 ( 3 ) = .00521, phi_2 (3) = 2.

delta probabilities and back pointers may be computed similarly for the third and fourth days; we discover delta_4(1) = 0.00007, delta_4(2) = 0.00055, delta_4(3) = 0.0011
-- thus the most probable final state given all preceding information, is rainy.
We select this (equation 8.53), and follow the phi back pointers of most probable predecessors to determine the optimal sequence (equation 8.54).
In this case, it is sunny, sunny, rainy, rainy, which accords well with expectation given the model.

HMM Learning
The task of learning the best model to fit a given observation sequence is the hardest of the three associated with HMM's, but an estimate (often sub-optimal) can be made.

An initial model is guessed, and this is refined to given a higher probability of providing the observations in hand via the Forward-Backward algorithm or Baum-Welch algorithm.
This is essentially a gradient descent of an error measure of the current best model, and is a special case of the EM (Estimate-Maximize) algorithm.

Applications
Early uses of the HMM approach were predominantly in speech recognition, where it is not hard to see how a different model may be used to represent each word, how features may be extracted and how the global view of the Viterbi algorithm would be necessary correctly to recognize phoneme sequences through noise and garble.
HMMs are actively used in commercial speech recognizers.
Wider applications in natural language processing have also been seen.

The same ideas translate naturally into the related language recognition domain of OCR and handwriting recognition.
One use has been to let the underlying state sequence be grammatical tags, while the observations are features derived from segmented words in printed and handwritten text.
The patterns of English grammar (which are not, of course, a first-order Markov model) closely restrict which words may follow which others, and this reduction of the size of candidate sets can be seen to assist enormously in recognition.

Similarly, HMMs lend themselves to analysis of letter sequences in text.
Here, the transition probabilities are empirically derived from letter frequencies and patterns (so for example, q is nearly always followed by u, and very rarely by j), and the observation probabilities are the output of an OCR system.
This system is seen to improve in performance when a second order Markov model is deployed.

At a lower level, HMMs can be used to recognize individual characters.
This may be done by skeletonizing characters and considering the sequence of stroke primitives to be a Markov process.
Alternatively, vertical and horizontal projections of binarized character images may be considered.
Observed through noise, a Fourier transform of the projections is derived as a feature vector, and a HMM for each possible character is trained using the Baum-Welch algorithm.
Unknown characters are then identified by determining the best scoring model for features derived from an unseen image.

Last Modified: April 15, 1997

55:148 Digital Image Processing 55:247 Image Analysis and Understanding

Chapter 8, Part VII Image understanding: Hidden Markov Models

55:148 Digital Image Processing
55:247 Image Analysis and Understanding

Chapter 8, Part VII
Image understanding: Hidden Markov Models