# Today Markov Models # Reasoning over time Often, we want to reason about a sequence of observations - Speech recognition - Robot localization - User attention - Medical monitoring Need to introduce time (or space) into our models # Markov Models Value of X at a given time is the state X1 -> X2 -> X3 -> ... P(X1) P(Xt|Xt-1) Parameters: called transition probabilities or dynamics, specify how the state evolves over time (also, initial state probabilities) Stationarity assumption: transition probabilities the same at all times, i.e. P(Xt|Xt-1) same for all t # Joint distribution P(X1,X2,X3,X4) = P(X1)P(X2|X1)P(X3|X2)P(X4|X3) More generally, P(X1) product over t [P(Xt|Xt-1)] Q. Can you prove this from just the chain rule and the conditional independence assumption? Q. Can you prove that X1 is conditionally independent from X3, X4 given X2? # Conditional independence Past and future independent of the present; each step depends only on the previous. (1-order Markov process) This chain is a growable BN; we can always truncate it and use inference techniques we already know. # Example: Weather States X = {Rain, Sun} Initial distribution: 1.0 Sun CPT(Xt | Xt - 1) Xt-1 P(Xt = Sun | Xt -1) sun 0.9 rain 0.3 Two new ways to represent: - FSM with transition probs on edges - Chain of states chains: sun 0.9 sun 0.1 0.3 rain 0.7 rain # Moving through time What's the probability distribution after one step? P(X2 = sun) = P(X2 = sun|X1 = sun)P(X1 = sun) + P(X2 = sun | X1 = rain)P(X1 = rain) 0.9 * 1.0 + 0.3 * 0.0 = 0.9 # The (mini) Forward algorithm What's P(X) on some day t? P(X1) is known. P(Xt) = sum 1..xt-1 P(xt-1, xt) = sum 1..xt-1 P(xt|xt-1) P(xt-1) # Example From initial observation of sun: 1.0 0.9 0.84 0.804 -> 0.75 0.0 0.1 0.16 0.196 0.25 From initial observation of rain: 0.0 0.3 0.48 0.588 -> 0.75 1.0 0.7 0.52 0.412 -> 0.25 From any initial distribution!: p -> 0.75 1-p 0.25 # Stationary distributions For most chains: - Influence of the initial distribution gets less and less over time - The distribution we end up in is independent of the initial distribution Stationary distribution: - The distribution we end up with is called the stationary distribution P_inf of the chain - It satisfies: P_inf(X) = P_[inf+1](X) = sum over x P(X|x)P_inf(x) # Solving for stationary distributions What's P(X) at time infinity? (Pi = P_inf for brevity) Pi(sun) = P(sun|sun)Pi(sun) + P(sun|rain)Pi(rain) Pi(rain) = P(rain|sun)Pi(sun) + P(rain|rain)Pi(rain) Pi(sun) = 0.9Pi(sun) + 0.3Pi(rain) Pi(rain) = 0.1Pi(sun) + 0.7Pi(rain) Pi(sun) = 3Pi(rain) Pi(rain) = 1/3Pi(sun) 2 equations and two unknowns? Not quite. Equations not independent. But we also know that Pi(sun) + Pi(rain) = 1. Pi(sun) = 3/4 Pi(rain) = 3/4 # Application: PageRank This is how PageRank used to work, back in the day. - Each web page is a state. - Initial distribution: uniform over pages. - Transistions: prob c, uniform jump to random page; 1-c follow a random outline. Stationary distribution: - Spends more time in highly reachable pages (e.g., many ways to get to the Acrobat download page) - Somewhat robust to link spam (not so much these days with SEO) - All search engines now use many other features in addition to an analytic "rank" like this one (now dominated by cross-site clickstreams) # Application: Gibbs Sampling Each joint instantiation over all hidden and query variables is a state: {X1, …, Xn} = H u Q Transitions: With probability 1/n resample variable Xj according to P(Xj | mb(Xj)) Stationary distribution: - Conditional distribution P(X1, X2 , … , Xn|e1, …, em) - Means that when running Gibbs sampling long enough we get a sample from the desired distribution # Hidden Markov Models Markov models assume perfect knowledge of the world state. In practice, we often only know about (unreliable) observations that proxy the world state. HMMs model this reality: - Underlying Markov chain over states X - Observe output (effects) at each step (on board) # Example: Weather HMM Rain(t-1) -> Rain(t) -> Rain(t+1) v v v Umbrella(t-1) Umbrella(t) Umbrella(t+1) A HMM is defined by: - initial distribution P(X1) - transitions P(Xt|Xt-1) - emissions P(Et|Xt) example: Rt P(Rt+11) T 0.7 F 0.3 Rt P(Ut|Rt) T 0.9 F 0.2 # Joint distribution of an HMM P(X1, E1, X2, E2, X3, E3) = P(X1)P(E1|X1)P(X1|X1)... Generally: P(X1, E1, ... Xn, En) = P(X1)P(E1|X1) product t from 2 to n P(Xt|Xt-1)P(Et|Xt) As before: Can you prove this from just the chain rule and the conditional independence assumption? (Yes; also follows from BN semantics) In fact, this implies all sorts of conditional independence; each observation is conditionally independent of everything else, give its corresponding (hidden) state. # Conditional independence HMMs have two important independence properties: - Markov hidden process: future depends on past via the present - Current observation independent of all else given current state Q. Does this mean that evidence variables are guaranteed to be independent? [No, they tend to correlated by the hidden state] # Some Real World HMMs Speech recognition HMMs: - Observations are acoustic signals (continuous valued) - States are specific positions in specific words (so, tens of thousands) Machine translation HMMs: - Observations are words (tens of thousands) - States are translation options Robot tracking: - Observations are range readings (continuous) - States are positions on a map (continuous) # Filtering / Monitoring Filtering, or monitoring, is the task of tracking the distribution Bt(X) = Pt(Xt | e1, …, et) (the belief state) over time We start with B1(X) in an initial setting, usually uniform As time passes, or we get observations, we update B(X) The Kalman filter was invented in the 60’s and first implemented as a method of trajectory estimation for the Apollo program Example: See pptx