# Recap: reasoning over time

MMs

    X1 -> X2 -> ...

P(X1) ; P(X|X-1)

    Xt-1  P(Xt = rain | Xt -1)
    rain       0.7
    sun        0.1


(and FSA-like graph)

HMMs

    X1 -> X2 -> ...
    |     |
    V     V
    E1    E2
    
    X     P(E=umbrella|X)
    rain   0.9
    sun    0.2
    
# Inference

Given a base case X0

Inductive-ish cases:

  - observations
  - passage of time

# Observations

Consider a single timeslice:

    X1
    |
    V
    E1
    
In other words, what is P(X1|e1)?

Look at the BN. We know P(X1) and P(e1|X1).

Use laws of probability:

P(X1|e1) = 

P(X1, e1) / P(e1) = 

P(e1 | X1) P(X1) / P(e1) [durr, Bayes' rule]

we know P(e1 | X1) P(X1) and can normalize the result, in other words,

P(X1|e1) = alpha P(e1 | X1) P(X1)

(or we can do this with semantics of BNs, same result)

# Passage of time

    X1 -> X2
    
One step of time passes. How do we get P(X2)?

Look at the BN. We know P(X1) and P(X2|X1).

P(X2) = sum over x1 P(X2, x1) =

sum over x1 P(X1|x1)p(x1)

(same as regular Markov chains)

That's it! That's all you need for inference in HMMs, just need to put them together the right way.

# Passage of time, another way

Assume we have Belief P(X|evidence so far)

P(Xt|e1...et)

When another time step passes:

P(Xt+1|e1...et) = 

take the current beliefs: P(xt|e1...et)

multiply by the transition probabilities P(X_(t+1)|xt) * P(xt|e1...et)

and sum over all states: sum over xt P(X_(t+1)|xt) * P(xt|e1...et)

This is exactly what we did before, but at an arbitrary state rather than X1 -> X2

Or as beliefs over time, we start with

B(Xt) = P(Xt|e1...et)

B' are your beliefs after time passes but before an observation

B'(X_(t+1)) = sum over xt P(X'|xt)B(xt)

"tomorrow's prob that we're in state x is the sum over all the places we could have been: the probability we were in the place times the probability that being in that place led us to x"

# Example: Umbrellaworld

If just time passes, we're taking a step toward the stationary distribution.

Let's say at time 0 you're unsure if it's raining. P(rain) = 0.5

What do you think at time 1?

B'(x1)) = sum over xt P(X'|xt)B(x0) = 

    P(rain at time t+1 | rain at time t) * 1/2 + 
    P(rain at time t+1 | sun at time t) * 1/2 = 

0.7 * 0.5 + 0.1 * 0.5 = 0.4

Q. What if another timestep passes?

    P(rain at time t+1 | rain at time t) * 1/2 + 
    P(rain at time t+1 | sun at time t) * 1/2 = 

0.7 * 0.4 + 0.1 * 0.6 = 0.28 + 0.06 = 0.34

# Observations, another way

We have projected forward to "tomorrow"

    B'(X_(t+1)) = P(X_(t+1)|xt)B(xt)

now evidence comes in, we need to incorporate e_(t+1)

    P(X_(t+1)|e_(t+1) = 
    alpha P(e_(t+1)|X_(t+1)) * P(X_t+1|e1 .. et)
    
(Bayes' rule and normalization)

Take the current beliefs (rhs) and multiply by the likelihood (lhs), then renormalize.

In other words:

    B(X_(t+1)) = 
    alpha P(e_(t+1)|X_(t+1)) * B'(X_(t+1))

# Example: UmbrellaWorld, continued

B'(X1) = <0.4, 0.6>

Now we make an observation

e1 = umbrella

    B(X_(t+1)) = 
    alpha P(e_(t+1)|X_(t+1)) * B'(X_(t+1))

so B(X1) 

= alpha P(umbrella|rain) * B'(x1 = rain) =  alpha * 0.9 * 0.4 = alpha * 0.36 =  
0.75

= alpha P(umbrella|sun) * B'(x1 = sun) = alpha * 0.2 * 0.6 = alpha * 0.12 =  
0.25

Q2. What if we make another independent observation of an umbrella?

= alpha P(umbrella|rain) * B'(x1 = rain) =  alpha * 0.9 * 0.75 = alpha * 0.675 =  
~0.93 

= alpha P(umbrella|sun) * B'(x1 = sun) = alpha * 0.2 * 0.25 = alpha * 0.05 =  
~0.07

Time passing tends to blur information; observations tend to sharpen it.

# Putting it together

What if we start at no information (i.e. P(rain) = 0.5) and then make a series of observations?

    Rain0 -> Rain1 -> Rain2 ...
               |        |
               V        V
             Umb1      Umb2

Remember:

    B'(X_(t+1)) = sum over xt P(X'|xt)B(xt)

    B(X_(t+1)) = 
    alpha P(e_(t+1)|X_(t+1)) * B'(X_(t+1))

At Rain0, B(rain)=0.5

First advance time to Rain1:

B'(Rain1=rain) = 0.7 * 0.5 + 0.1 * 0.5 = 0.4

Then factor in evidence:

B(Rain1=rain) = alpha 0.9 * 0.4 = 0.36  
B(Rain1=sun) = alpha 0.2 * 0.6 = 0.12

B(Rain1=rain) = 0.75

Q3. Continuing on, suppose you see no umbrella at time 2, and no umbrella at time 3. Compute B(Rain3).

# Forward algorithm

A dynamic algorithm for computing at each timeslice the distribution of X given the evidence up to time t. I don't care about anything else; marginalize it all out.

Bt(X) = P(Xt|e1...et)

(note: can save normalizing to end)

    P(Xt|e1...et) = alpha P(xt, e1...et)
    
in other words, joint probability of X and fixed value of evidence

Sum over previous time step:

    sum over x_(t-1) P(x_(t-1), xt, e1...et)
    
factor thanks to HMM / BN semantics

    sum over x_(t-1) P(x_(t-1), e1...(t-1)) * 
                     P(xt|x_(t-1)) *
                     P(et|xt)
                     
"something before * transition * evidence"

rewrite as:

    P(et|xt) *
    sum over x_(t-1) P(xt|x_(t-1)) *
                     P(x_(t-1), e1...(t-1))
                     
Last term is a recurrence!

So start at time 0 and work our way forward.  

These are exactly the two steps we just went through (pass time + observe), together!

# What about MLE?

Filtering determines the probability distribution of the hidden variable at the current time step, given a sequence of emissions.

MLE attempts to reconstruct the most likely sequence of hidden states corresponding to a sequence of emissions.

Formally:

    argmax (x1...xt) P(x1...xt|e1...et)

# Back to state Trellises

    sun  -> sun  -> sun  ->
    	  X       X       X
    rain -> rain -> rain ->
     X1      X2      X3
     e1      e2      e2         

Each edge represents a transition from x_(t-1) to xt.

Each edge has weight P(xt|x(t-1))P(et|xt)

Each path is a sequence of states. 

Multiply the values along a given path to get its probability, jointly with the evidence, though still needs normalization.

Idea: check all paths; multiply weights; take most likely one. Too many paths for brute force: 2^n.

# Viterbi

We can efficiently compute MLE using an parallel procedure to the forward algorithm.

Remember, the forward algorithm computes:

    P(xt,e1...et) =
    P(et|xt) *
    sum over x_(t-1) P(xt|x_(t-1)) *
                     P(x_(t-1), e1...(t-1))    
emission prob * sum over (transition prob * what came before)

Viterbi looks at each state, and decides the most likely sequence of states to have led up to it, recursively.

In other words, it computes mt[x	t] = 

    max x1...x_(t-1) 
     P(x1...x_(t-1), xt, e1...e_(t-1)) =
     
via the same rewrite as forward algorithm

    P(et|xt) *
    max x1...x_(t-1) 
      P(xt|x_(t-1)) *
      P(x_(t-1), e1...(t-1))
     
Another recurrence!

Look at the last state in the trellis.

What's the most likely path that led to it? The most likely of its predecessors, and the most likely path to them. So now compute the most likely path leading to the previous two. How do we compute each of those? By computing the path to the previous two. Etc. 

Start at the front and work your way forward, storing the back pointers and most likely states along the way. Follow pointers back from most likely final state to find most likely sequence.

# Example

Suppose we saw

    sun  -> sun  -> sun  -> sun  ->
    	  X       X       X
    rain -> rain -> rain -> rain ->
     X0      X1      X2      X3
            umb    no umb  no umb

Recall:

(prob of starting at either X0 is 0.5)

Let's look X1. (P(X0=rain) = 0.5)

What's the top arrow?  
P(X1=sun|X0=sun)P(E1=umb|X1=sun) =  
0.9 * 0.2 = 0.18

(Note: we don't need to normalize, because proportions propagate.)

What's the top-to-bottom arrow?  
P(X1=rain|X0=sun)P(E1=umb|X1=rain)
0.1 * 0.9 = 0.09

What's the bottom-to-top arrow?
P(X1=sun|X0=rain)P(E1=umb|X1=sun) =
0.3 * 0.2 = 0.06

What's the bottom arrow?
P(X1=rain|X0=rain)P(E1=umb|X1=rain) =  
0.7 * 0.9 = 0.63

What's the most likely way we reached each node?   
sun1 from sun0, p=0.18; rain1 from rain0, p=0.63.
(path probs, not individual state probs, thus don't sum to 1 since we're not considering all paths)

Which is most likely at X1? rain (0.63 * 0.5) > sun (0.18 * 0.5)  
(Note we omitted prior probs in the edge weights, but the node weights require them; they're not always 0.5/0.5.)

Compute the MLE for X2, X3:

To be continued.