HMM and Viterbi notes

2017-09-26, Brendan O’Connor

What goes into POS taggers?

Usually there’s three types of information that go into a POS tagger.

  1. The word itself. If you only do this (look at what the word is), that’s the “most common tag” baseline we talked about last time. It works well for some words, but not all cases.

  2. Features! You can use, say, multiclass logistic regression, as a feature-based classifier for each token. Each token is an independent MLR decision. Then you can incorporate lots more information with features, like:
  3. Nearby POS tags: maybe if the token to the left is an adjective, you know that the current token can’t be a verb. This gives contextual clues that generalize beyond individual words. However, the inference problem will be trickier: to determine the best tagging for a sentence, the decisions about some tags might influence decisions for others.

A Hidden Markov model (HMM) is a model that combines ideas #1 (what’s the word itself?) and #3 (what POS tags are to the left and right of this token?). Later we will cover even neater models that combine all of #1-#3.

By using context information, we might be able to get examples like

    The   attack
     D    V or N?

We’d like a model that knows that “D V” is very unlikely in English, but “D N” is sensible; this might help decide the correct tag here.

Sequence tagging

Sequence tagging is a type of structured prediction problem: given an input sequence, predict an output sequence. It’s trickier than classification, where you only have to make independent labeling decisions.

Examples

POS tagging: given input sentence, tokens \(w_1..w_N\), predict POS tag sequence \(y_1..y_N\).

Word segmentation: given input character sequence \(w_1..w_N\), predict word boundary sequence \(y_1..y_N\), where each \(y_t\) is whether or not that character starts a new word, or continues the previous one.

Named entity recognition: given input sentence tokens \(w_1..w_N\), predict tags \(y_1..y_N\), where each tag is whether the token is part of a name or not.

and also speech recognition and many other areas. Outside of NLP, economic forecasting (is the economy in a recession or not? they call this HMM a “regime switching” model), all sorts of time series, etc etc.

Hidden Markov model

For classifiers, we saw two probabilistic models: a generative multinomial model, Naive Bayes, and a discriminative feature-based model, multiclass logistic regression. For sequence tagging, we can also use probabilistic models. HMM’s are a special type of language model that can be used for tagging prediction.

A simple first-order Markov model says there are transitions between words in a sequence. According to the assumptions of this model, a word depends only on the word before it; it is conditionally independent of two or three or more words ago, if you know the word before it.

\(P(\vec{w}) = \prod_t p(w_t \mid w_{t-1})\)

we can visually describe how the variables influence each other as something like this:

    w -> w -> w -> ...

In a first-order Hidden Markov model (HMM), we assume the existence of hidden states, called \(y\). There is hidden state for every token. We assume a Markov process generated the state sequence, and then individual words were generated independently based on the states. Notation: \(\vec{w}=(w_1,w_2,..,w_N)\), and \(t \in \{1..N\}\).

\(P(\vec{w},\vec{y}) = \prod_t p(y_t|y_{t-1}) p(w_t|y_t)\)

    y -> y -> y -> ...
    |    |    |
    v    v    v
    w    w    w

For POS tagging, \(y\)’s are POS tags. For the PTB tagset, there are 45 possible tags. We say that’s \(K=45\) possible states. For the Eisner ice cream example, there are 2 possible states, \(K=2\) (either H or C).

There are two components to the probability distribution.

  1. Emissions, \(p(w_t|y_t)\): this is just, for each tag, a distribution over words. For example, say the parameter is called \(\theta\), it’s like \(\theta_{Noun}\) = {dog:.01, table:.008, …}, or \(\theta_{Det}\) = {the:.7, a:.2, …}. Kind of like a dictionary that says “here are all possible nouns (and their probs), here are all possible determiners (and their probs)”, etc. This is just like the word distributions in Model 1 or Naive Bayes, except we’re postulating some set of linguistic hidden states that determine the word distribution.

  2. Transitions, \(p(y_t | y_{t-1})\): we’ve seen this before in bigram LM’s, this is just the probability of the next state given the current. Say the parameter is \(A\), then maybe \(A_{Adj} =\) {Adj:.3, Noun:.6, Verb:.01, …} or something like that where perhaps “A V” subsequences are rare.

An HMM makes two major assumptions.

  1. Markov property for the states: \(p(y_t | y_1..y_{t-1}) = p(y_t|y_{t-1})\).

  2. Output independence: \(p(w_t| w_{s \neq t}, \vec{y}) = p(w_t|y_t)\). The prob of a word, if you know its tag, depends ONLY on the tag and NOTHING else.

This corresponds to a really strong version of the linguistic substitution test. The substitution test is a conceptual way of defining what a word class is, namely, a set of words for which you can substitute them for one another in any sentence, and it’s still syntactically valid. “The {dog, cat, table, owl} was in the room” and also “I saw the {dog, cat, table, owl}” … those four words can substitute with each other in these sentences, so perhaps {dog, cat, table, owl} belong to a word class, let’s call them “nouns”. An HMM assumption says, given such a word class, you can do the substitution, and it doesn’t matter what all the other words are.

This is obviously invalid in general: consider just “I saw {a, an} {dog, cat, table, owl}” where “a” and “an” have tag DETERMINER and the last word is tag NOUN. An HMM with this simple tagset can’t capture the phonetic-ish constraint between “a”/“an” versus the next word. And of course there are longer distance syntactic and semantic constraints too. But, the HMM assumption is true enough, as unsatisfying as it is, to yield pretty good part-of-speech taggers.

“All models are wrong, but some are useful” (–Box) is the usual motto for statistical NLP.

Tasks with an HMM

OK we defined a fancy model \(p(\vec{w},\vec{y}; \theta,A)\). What do we do with it? Any time you see or make up a statistical model in NLP, immediately there are two types of interesting things to do: Inference and Learning.

Inference

Given we already have parameters, important things to do include

  1. Decoding, \(\arg\max_{\vec{y}} p(\vec{y} \mid \vec{w})\). Predict the most likely (highest posterior probability) tag sequence, given an input sentence. Efficient algorithm to do it: Viterbi algorithm.

  2. Marginal likelihood, \(P(\vec{w}) = \sum_{\vec{y}} P(\vec{w},\vec{y})\). This is what you want to compute if you wanted to use the HMM as a language model. Efficient algorithm to do it: Forward algorithm.

  3. Posterior tag marginals, \(P(y_t | \vec{w})\). We won’t worry about this right now. Efficient algorithm to do it: Forward-backward algorithm.

This document only concerns decoding.

Learning

We’d also like to learn the parameters. Today we’ll only worry about supervised learning: the \(\vec{y}\)’s are known at training time, like they’re manually-labeled tags from humans. At test time we don’t have \(\vec{y}\).

You already know how to do learn: (pseudocounted) relative frequency estimation! OK, we’re done with learning.

Decoding

Assume you learned the parameters, or maybe someone gave them to you. Now predict tags on new text.

First note that since \(p(y|w)=p(w,y)/p(w)\), and maximizing the right hand side of that is equivalent to maximizing just the numerator, we can rewrite the decoding problem as \(\arg\max_{\vec{y}} p(\vec{y},\vec{w})\). If we look at the structure of the problem it’s apparent it’s nontrivial:

\[ p(\vec{w},\vec{y}) = p(y_1) p(w_1|y_1) p(y_2|y_1) p(w_2|y_2) p(y_3|y_2) p(w_3|y_3) ...\] Writing out in log form, \[\log p(\vec{w},\vec{y}) = \log p(y_1) + \log p(w_1|y_1) + \log p(y_2|y_1) + \log p(w_2|y_2) + \log p(y_3|y_2) + \log p(w_3|y_3) ...\]

\[ = f(y_1,y_2) + g(y_2,y_3) + h(y_3,y_4) + i(y_4,y_5) + ...\]

The last line is sometime called “factor notation”; I’ve defined \(g(y_2,y_3)=\log(p(y_3|y_2)p(w_3|y_3))\) and so on to group together terms that depend on both \(y_2\) and \(y_3\) (or just one of them). Any single tag position participates in two different factor functions. The problem is that there is interdependence between tags: what your neighbors are influence you, but to figure out what your neighbors are, you have to figure out their neighbors, etc. It’s a joint optimization problem.

The trick is that there is only local dependence. \(y_2\) depends directly on \(y_3\), but it does not depend on far away \(y_{10}\). At least, not directly: \(y_{10}\) does indirectly influence \(y_2\), but only through a chain culminating in \(y_3\). Thus we can use dynamic programming, a general class of algorithm techniques that exploit cached solutions to shared subproblems. The dynamic programming algorithm that exactly solves the HMM decoding problem is called the Viterbi algorithm.

A few other possible decoding algorithms

1… Naive enumeration: this should be the most obvious approach to solving the decoding problem. Enumerate every possible solution, for each compute \(p(\vec{w},\vec{y})\) (which is very straightforward), and choose the most likely one. What’s the runtime? It’s exponential in \(N\) which is too slow. At least this algorithm find the optimal solution.

2… Greedy decoding: just go left-to-right and pick the highest probability choice each time.

We can make each decision because we’ve already decided on the left, thus have a \(\hat{y}_{t-1}\) term to use there. We don’t look to the right at all (we haven’t decided it yet).

What’s the runtime?

This can make bad decisions. Example when you have to decide tags one at a time:

    Attack      ==> This is the first word, you have to decide now!  
                    OK, I guess it's a VERB.

    Attack
       V

    Attack it   ==> OK, this is a pronoun. makes sense after an imperative verb.
       V     

Versus a different sentence:

    Attack      ==> OK let's decide it's a verb.

    Attack
       V

    Attack was  ==> Uhoh V doesn't make sense, maybe it should have been a NOUN!
       V

In greedy decoding, you can’t go back to fix “Attack” any more.

Greedy decoding isn’t the worst thing in the world for POS tagging, though it is worse than other options and for other problems it can be pretty bad. One option to enhance greedy decoding is to use backtracking search or best-first search or other heuristic techniques to search. In NLP, beam search is the most commonly used heuristic search for structured prediction. In some areas, like decoding for MT (a space of \(V^N\) possible translations, yikes!), decoding is super hard so there’s lots of research into enhancing these techniques to more efficient.

3… Viterbi decoding: this is optimal and its runtime is linear in \(N\) (and polynomial in \(K\))! (What’s its runtime complexity?)

The algorithm is to fill out a Viterbi table, a matrix of probabilities, where each entry \(V_t[k]\) is the probability of the most likely path from the start to \(t\) that ends with state \(k\).

In math, that means, the Viterbi tables preserve this recurrence relation,

\[ V_t[k] = \max_{y_1..y_{t-1}} P(y_t=k,\ \ w_1..w_t,\ y_1..y_{t-1}) \]

If you take that formula and just write it out using the HMM equation, and do some nesting of the max operators, it’s easy to see that the following algorithm recursively computes the Viterbi tables from left to right while preserving that recurrence relation:

I wrote the transition and emission probabilities in a slightly different format because it helps me at least see what’s going on. To figure out the most likely way to get state H at timestep 3, you need to consider two possibilities: either you came from H at \(t=2\), or from C at \(t=2\). If you came from H, you should factor in the prob of the H to H transition, times the likelihood of seeing whatever \(w_3\) is. If you came from C, you need to factor in the prob from the C to H transition, and again times the likelihood of \(w_3\). Then you pick the most likely predecessor, for getting to H at \(t=3\).

See the lattice diagram (Figure 7.10, second on the [handout][handout]). Step through it yourself to get an idea. Better yet, start from just the model specification and build it up a new one yourself by stepping through the Viterbi algorithm.

This recurrence relation is nice because the last table gives the probablity of the most likely sequence; specifically, where \(N+1\) is the STOP state, \[V_{N+1}[STOP] = \max_{y_1..y_N} P(y_{N+1}=STOP, w_1..w_N, y_1..y_N)\]

That would answer the question \(\max_{\vec{y}} P(\vec{w},\vec{y})\). But we actually want \(\arg\max_{\vec{y}} p(\vec{w},\vec{y})\). To do that, you need to add another step in the inside of the Viterbi inner loop: store the backpointer

That’s just the argmax over the same \(V \times trans \times emit\) equation as for the Viterbi table update. It says which state was the best way to get to \(k\) at \(t\). Once you’re done with going from left to right and filling out the V and B tables, you trace backwards through the backpointers to get the most likely path.

Why does Viterbi work? Here’s another way to think about it.

What’s the most likely state at \(t=1001\) … or rather, what’s the last state of the most likely path up through \(t=1001\)? (given you only want to condition on \(w_1..w_{1001}\).) To figure this out, you need to maximize over all possible paths over the 1000 previous timesteps. There’s \(2^{1000}=10 \times 10^{300}\), call it 10 zillion, possible previous paths to maximize over. But the Markov assumption says that, in order to decide whether \(y_{1001}=H\) or \(y_{1001}=C\), all that matters from all that history is whatever \(y_{1000}\) was. In other words: to decide on \(y_{1001}\), all that matters about those 10 zillion prefixes is, 5 zillion of them end with \(y_{1000}=H\) and 5 zillion of them end with \(y_{1000}=C\).

OK, consider how to maximize over the 5 zillion that end with \(y_{1000}=C\) in order to possibly get to \(y_{1001}=H\). If we knew what the probability of the most likely of those 5 zillion paths was, we just multiply the transition and emission to get the probability of \(y_{1001}=H\). That probability is in fact contained in the \(V_{t-1}\) table, specifically \(V_{1000}[C]\), and was recursively computed already by earlier iterations of Viterbi. Recursively speaking, we can use this same argument to compute the \(V_{1000}\) table based on the \(V_{999}\) table, and so on. This recursive reasoning justifies going all the way to the way back to the base case of \(V_1\).

Any other ways to think about it?

Other references

For a broader view of decoding and how the Viterbi algorithm fits in, see Smith, Linguistic Structure Prediction, chapter 2 — especially 2.2 and 2.3.

For Viterbi versus Dijkstra, see these slides by Liang Huang.