# Today

Markov Models

# Reasoning over time

Often, we want to reason about a sequence of observations

  - Speech recognition
  - Robot localization
  - User attention
  - Medical monitoring

Need to introduce time (or space) into our models

# Markov Models

Value of X at a given time is the state

X1 -> X2 -> X3 -> ...

P(X1)   P(Xt|Xt-1)

Parameters: called transition probabilities or dynamics, specify how the state evolves over time (also, initial state probabilities)

Stationarity assumption: transition probabilities the same at all times, i.e. P(Xt|Xt-1) same for all t

# Joint distribution

P(X1,X2,X3,X4) = P(X1)P(X2|X1)P(X3|X2)P(X4|X3)

More generally, P(X1) product over t [P(Xt|Xt-1)]

Q. Can you prove this from just the chain rule and the conditional independence assumption?

Q. Can you prove that X1 is conditionally independent from X3, X4 given X2?

# Conditional independence

Past and future independent of the present; each step depends only on the previous. (1-order Markov process)

This chain is a growable BN; we can always truncate it and use inference techniques we already know.

# Example: Weather

States X = {Rain, Sun}

Initial distribution: 1.0 Sun

CPT(Xt | Xt - 1)

Xt-1 P(Xt = Sun | Xt -1)
sun    0.9
rain   0.3

Two new ways to represent:

  - FSM with transition probs on edges
  - Chain of states

chains:
   
    sun  0.9  sun
         0.1 
         0.3
    rain 0.7  rain
    
# Moving through time

What's the probability distribution after one step?

P(X2 = sun) = P(X2 = sun|X1 = sun)P(X1 = sun) + P(X2 = sun | X1 = rain)P(X1 = rain)

0.9 * 1.0 + 0.3 * 0.0 = 0.9

# The (mini) Forward algorithm

What's P(X) on some day t?

P(X1) is known.

P(Xt) = sum 1..xt-1 P(xt-1, xt)

= sum 1..xt-1 P(xt|xt-1) P(xt-1)

# Example

From initial observation of sun:

    1.0   0.9  0.84  0.804    ->  0.75
    0.0   0.1  0.16  0.196        0.25

From initial observation of rain:

    0.0   0.3  0.48  0.588    ->  0.75
    1.0   0.7  0.52  0.412    ->  0.25
    
From any initial distribution!:

     p   -> 0.75
    1-p     0.25
    
# Stationary distributions

For most chains:

  - Influence of the initial distribution gets less and less over time
  - The distribution we end up in is independent of the initial distribution

Stationary distribution:

  - The distribution we end up with is called the stationary distribution P_inf of the chain
  - It satisfies: P_inf(X) = P_[inf+1](X) = sum over x P(X|x)P_inf(x)

# Solving for stationary distributions

What's P(X) at time infinity? (Pi = P_inf for brevity)

Pi(sun) = P(sun|sun)Pi(sun) + P(sun|rain)Pi(rain)
Pi(rain) = P(rain|sun)Pi(sun) + P(rain|rain)Pi(rain)

Pi(sun) = 0.9Pi(sun) + 0.3Pi(rain)
Pi(rain) = 0.1Pi(sun) + 0.7Pi(rain)

Pi(sun) = 3Pi(rain)
Pi(rain) = 1/3Pi(sun)

2 equations and two unknowns? Not quite. Equations not independent.

But we also know that Pi(sun) + Pi(rain) = 1.

Pi(sun) = 3/4
Pi(rain) = 3/4

# Application: PageRank

This is how PageRank used to work, back in the day.

  - Each web page is a state.
  - Initial distribution: uniform over pages.
  - Transistions: prob c, uniform jump to random page; 1-c follow a random outline.

Stationary distribution:

  - Spends more time in highly reachable pages (e.g.,  many ways to get to the Acrobat download page)
  - Somewhat robust to link spam (not so much these days with SEO)
  - All search engines now use many other features in addition to an analytic "rank" like this one (now dominated by cross-site clickstreams)

# Application: Gibbs Sampling

Each joint instantiation over all hidden and query variables is a state: {X1, …, Xn} = H u Q

Transitions:
With probability 1/n resample variable Xj according to 	

	P(Xj | mb(Xj))

Stationary distribution:

  - Conditional distribution P(X1, X2 , … , Xn|e1, …, em)
  - Means that when running Gibbs sampling long enough we get a sample from the desired distribution

# Hidden Markov Models

Markov models assume perfect knowledge of the world state.

In practice, we often only know about (unreliable) observations that proxy the world state.

HMMs model this reality:

  - Underlying Markov chain over states X
  - Observe output (effects) at each step

(on board)

# Example: Weather HMM

    Rain(t-1)  ->  Rain(t)  ->  Rain(t+1)
       v              v            v
    Umbrella(t-1)  Umbrella(t)  Umbrella(t+1)
    
A HMM is defined by:

  - initial distribution P(X1)
  - transitions P(Xt|Xt-1)
  - emissions P(Et|Xt)

example:

    Rt   P(Rt+11)
     T     0.7
     F     0.3
 
    Rt   P(Ut|Rt)
     T     0.9
     F     0.2

# Joint distribution of an HMM

P(X1, E1, X2, E2, X3, E3) = P(X1)P(E1|X1)P(X1|X1)...

Generally:

P(X1, E1, ... Xn, En) = P(X1)P(E1|X1) product t from 2 to n P(Xt|Xt-1)P(Et|Xt)

As before: Can you prove this from just the chain rule and the conditional independence assumption? (Yes; also follows from BN semantics)

In fact, this implies all sorts of conditional independence; each observation is conditionally independent of everything else, give its corresponding (hidden) state.

# Conditional independence

HMMs have two important independence properties:

  - Markov hidden process: future depends on past via the present
  - Current observation independent of all else given current state

Q. Does this mean that evidence variables are guaranteed to be independent?  
[No, they tend to correlated by the hidden state]

# Some Real World HMMs

Speech recognition HMMs:

  - Observations are acoustic signals (continuous valued)
  - States are specific positions in specific words (so, tens of thousands)

Machine translation HMMs:

  - Observations are words (tens of thousands)
  - States are translation options

Robot tracking:

  - Observations are range readings (continuous)
  - States are positions on a map (continuous)

# Filtering / Monitoring

Filtering, or monitoring, is the task of tracking the distribution Bt(X) = Pt(Xt | e1, …, et) (the belief state) over time

We start with B1(X) in an initial setting, usually uniform

As time passes, or we get observations, we update B(X)

The Kalman filter was invented in the 60’s and first implemented as a method of trajectory estimation for the Apollo program

Example: See pptx