Assignment 10

This assignment is due at 1700 on Friday, 05 December.

The goal of this assignment is to implement filtering (using the forward algorithm) and most likely explanation determination (using the Viterbi algorithm) on a simple system that can be modeled using a Hidden Markov model.

I will be updating the assignment with questions (and their answers) as they are asked.

Problem

Suppose you are playing a game that involves coin flips. Each round, your opponent flips a coin and shows you the result (heads or tails).

In this problem, you will assume the opponent has two coins: a fair coin with P(heads)=0.5, and an unfair coin with P(heads) != 0.5. The opponent starts by flipping the fair coin. Occasionally, your opponent exchanges one coin for the other, but you’re not able to see them do so.

Given a sequence of observations, for example, t h t h h, there are two questions I’d like you to answer:

What is the probability that the last coin flipped was the fair coin? (This value can be found by using the forward algorithm.)
What is the most likely sequence of (fair, unfair) coins to have produced the observed set of flips? (This sequence can by found by using the Viterbi algorithm.)

In this scenario, the value of the hidden variable represents the coin (fair or unfair) that is used at time t, and the emission variable’s value (heads or tails) represents the flip at time t. The bias of the unfair coin, transition probabilities, and sequence of observed flips will be provided as input.

Input data format

Your program should expect input data in the following text-based format.

The first three lines will each consist of a single floating-point number:

The value on the first line is the probability that the unfair coin will come up heads.
The value on the second line is the probability that, if the fair coin was used for the flip at time t-1, the unfair coin will be used for the flip at time t.
The value on the third line is the probability that, if the unfair coin was used for the flip at time t-1, the fair coin will be used for the flip at time t.

The fourth line will be a series of whitespace-delimited single characters, either h or t, corresponding to a sequence of coin flips.

For example:

9
2
333
h t t h t h h h h h t h t t h t h h h h

represents an observation where:

the unfair coin comes up heads 9/10 of the time
the fair coin is swapped for the unfair coin 1/5 of the time (and is retained the other 80% of the time)
the unfair coin is swapped for the fair coin about 1/3 of the time
twenty flips were observed

Output data format

Your program that filters (that is, determines the probability that the last coin flipped was fair) should output a single floating point number to standard output, for example:

0.38096321545799494

Your program that determines the most likely explanation for a given sequence of flips should output a single line, consisting of exactly as many whitespace-separated character (either f or u) as were present on the fourth line of its input. These values correspond to the state in the MLE at each step, being either the fair coin or the unfair coin, for example:

f f f f f f f f f f f f f f f f u u u u

Both of the above sample outputs are in the correct format.

Other items of import

If while your program is executing the Viterbi algorithm there is a tie (that is, for a given sequence of flips the fair and unfair coin are equiprobable), break that tie in favor of the fair coin.

This problem is a simplification of the Unfair Casino Problem, which is often used to illustrate HMMs. You might find it helpful to search the Internet for it, but remember, you should not copy code you find online. Read it, understand it, then close the browser window and re-implement it yourself.

Floating-point underflow may once again rear its ugly head. If you re-normalize at each step in the forward algorithm, you won’t have a problem. (In practice, you would only do so if it looked like you were getting close to underflowing.) When computing the MLE using the Viterbi algorithm, you can do the entire computation under a log transform (computing the so-called logprob). Just use the greater of the two logprobs when computing max(), since the logarithm is monotonically increasing and thus ordering-preserving. As usual, though, do not worry about underflow until you have the computation working correctly. It’s easier to debug simple examples without the added complexity.

What to submit

You should submit three things: a program to perform filtering, a program to compute the most likely explanation, and a readme.txt.

Each should use its first command line argument as the path to a file containing the input data. If, for example, your filterer’s main method is in a Java class named Filter, we should be able to use java Filter /Users/liberato/input to direct your program to read the input data in /Users/liberato/input.
Your program should output to standard output, in exactly the format described above.

Submit the source code of your programs, written in the language of your choice. Name the files containing the main() methods Filter.java and Viterbi.java or your language’s equivalent. If the file(s) you submit depend(s) upon other files, be sure to submit these other files as well.

As in the previous assignments, while you may use library calls for parsing, data structures and the like, you must implement the calculations yourself. Do not use a library for filtering or MLE calculation. We will consider it plagiarism if you do. Check with us if you think there’s any ambiguity.

Your readme.txt should contain the following items:

your name
if the language of your choice is not Java, Python, Ruby, node.js-compatible JavaScript, ANSI C or C++, or Mono-compatible C# (or if you’re concerned it’s not completely obvious to us how to compile and execute it), a description of how to compile and execute the submitted files
a description of what you got working, what is partially working and what is completely broken

If you’re using language features that require a specific version of your language or runtime, check for that version at program start and fail if it’s not present, emitting an understandable error message indicating this fact. Your program must compile and execute on the Edlab Linux machines.

If your program does not compile or execute, you will receive no credit. Check with us in advance if you’re concerned.

Grading

A correct filtering implementation will be worth 70% of the points for this assignment. Correctly computing the MLE will earn the remaining 30% of the points.

We will run your programs on a variety of test cases. The exact test cases will not be available to you before grading. You are welcome to write and distribute your own test cases.

If your readme.txt is missing or judged insufficient, your overall score may be penalized by up to ten percent.

We’re not going to feed your program incorrectly formatted input, so you need only concern yourself with handling input in the format described in the assignment.

We expect valid output. Generating output that is not in the format described in the assignment will result in a failed test case.

I do not expect anything in a solution to this assignment to be particularly memory or CPU intensive. But as usual, if your program exceeds available heap memory (which we’ll set to 1 GB in Java, using the -Xmx1024M argument if necessary), or if it does not terminate in twenty seconds, we will consider the test case failed.

Questions and answers

What should we do in our Viterbi programs if there are equally likely paths?

When computing the most likely path, break ties in favor of the hidden variable’s value being the fair coin.

Can’t we just use the forward algorithm at each time step to find the viterbi path?

In general, no. The “most likely sequence of states given a sequence of observations” is not necessarily the same as the “sequence of most likely states given the evidence up to (but not after) each time step.” Make sure your implementation of the Viterbi algorithm isn’t just returning the most likely state at each time step!

As an aside, each is also different from sequence of most likely states given all evidence (up to and after) each time step – that’s called smoothing, and can be calculated with the forward-backward algorithm.

I am a bit curious about the percentage that the program is supposed to print out.

Will it ever be at at 10% or a little bit above 10%? because when I test my program with all H observations over 100 flips, it only gets as low as 13.somethingish % is it supposed to get closer to 10%? and the biased coin percentage only gets as high as 87%, is it supposed to get closer to 90%?

The answer depends upon the input. For the same probabilities as given in the sample input in the assignment, but with only h observations, the distribution for P(X=fair) converges on a value a little greater than 1/3 (assuming I’m doing my math correctly). That is, you shouldn’t see a value around 0.1 for P(Xn=fair | e1=h,e2=h,e3=h…en=h); it should be closer to 0.36. (If you think I’m wrong here please let me know, I am literally doing this on the back of an envelope.)

Do the two steps of the forward algorithm have to be done separately, or can I combine them? I found a video online where they use a single update step for each observation. It looks like the recursive equation you derived in class.

Yes, that’s fine. It’s mathematically equivalent. I like to present it as two steps, since each one is straightforward-ish in isolation. But you can do both steps simultaneously (as we did in Viterbi), multiplying the current estimate for Xt by the transition probability and by the emission probability then normalizing (or not, saving the normalization until the very last iteration).

(paraphrasing an in-class question) My output for filtering differs from yours.

Here is the stepwise output of my implementation of the forward algorithm on the assignment’s sample input. Please let me know if you think I screwed up – it’s entirely possible.

initial prob fair: 1.0
after elapsing time: 0.8
after observing t: 0.9523809523809523
after elapsing time: 0.7777619047619048
after observing t: 0.945941250057916
after elapsing time: 0.7747545637770468
after observing h: 0.6564624305504023
after elapsing time: 0.6395679550670379
after observing t: 0.8987058709568174
after elapsing time: 0.7526956417368338
after observing h: 0.6283756184764107
after elapsing time: 0.6264514138284838
after observing h: 0.4823164973043065
after elapsing time: 0.5582418042411111
after observing h: 0.4124716268854065
after elapsing time: 0.5256242497554848
after observing h: 0.3810250243319588
after elapsing time: 0.5109386863630248
after observing h: 0.36725177711997703
after elapsing time: 0.5045065799150292
after observing t: 0.8358220347989636
after elapsing time: 0.7233288902511161
after observing h: 0.5922435467863258
after elapsing time: 0.6095777363492142
after observing t: 0.8864494020898617
after elapsing time: 0.7469718707759654
after observing t: 0.9365508354244998
after elapsing time: 0.7703692401432415
after observing h: 0.6508120649453619
after elapsing time: 0.636929234329484
after observing t: 0.8976607288334011
after elapsing time: 0.7522075603651983
after observing h: 0.6277635175443642
after elapsing time: 0.6261655626932181
after observing h: 0.48201154956701425
after elapsing time: 0.5580993936477957
after observing h: 0.41233169338702846
after elapsing time: 0.5255589008117423
after observing h: 0.38096321545799494
0.38096321545799494

I was looking at your filtering output on the assignment page and I realized that your observation starts with t whereas the test data starts with h, and after counting the number of instances, there are 19 instances in your output but 20 in the test. Do we have to start from index 1 and not 0?

Sort of.

The value of the hidden variable is a given at the first time step (as part of the assignment text). The first flip is of the fair coin (that is, P(coin1 = fair) = 1.0, and it should not change regardless of the observation.

In other words, the first observation does not reveal anything, since you know with full certainly the initial probability distribution. So you must ignore it, and start by elapsing time to reach the second observation, then factoring the evidence observed at that time, to determine P(coin2=fair).

(Paraphrasing a student question) I’m not sure my Viterbi algorithm works. Can you tell me the correct values and reasoning for some small test cases?

As a validation check, what do you get for each of:

h
h h
h h h
h h h h

given the same transmission, emission, and prior probabilities as in the example? I believe you should get:

f
f f
f f f
f u u u

respectively. If all you have is one observation (h) we know the MLE

P(most likely path to X1|h) = <1.0, 0.0> (I’m abusing notation a little here; the vector is the probability that the most likely path ends on either f or u, but it won’t necessarily sum to 1.)

regardless of the flip h is f.

With two observations (h h), it’s

fair: max(1.0 * 0.8 * 0.5, 0.0 * 0.333 * 0.5) = max(0.4, 0) = 0.4 (preceded by f)
unfair: max(1.0 * 0.2 * 0.9, 0.0 * 0.667, 0.9) = max(0.18, 0) = 0.18 (preceded by f)

P(most likely path to X2|h h) = <0.4, 0.18>

So f is the most likely final state, with predecessor f, for a MLE of f f.

With three (h h h), it’s

fair: max(0.4 * 0.8 * 0.5, 0.18 * 0.333 * 0.5) = max(0.16, 0.02997) = 0.16 (preceded by f)
unfair: max(0.4 * 0.2 * 0.9, 0.18 * 0.667 * 0.9) = max(0.072, 0.10854) = 0.10854 (preceded by u)

Fair is still more likely if three observations is all you get; and the predecessor is fair, and its predecessor in turn is fair (f f f).

If you have one more observation, though (h h h h)

fair: max(0.16 * 0.8 * 0.5, 0.10854 * 0.333 * 0.5) = 0.064 (preceded by f)
unfair: max(0.16 * 0.2 * 0.9, 0.10854 * 0.333 * 0.5) = 0.06486… (preceded by u)

now you have unfair as the most likely state at time 4, preceded by u, preceded by u, preceded by f, in other words, f u u u.

I believe my implementation of the forward algorithm is working because my program outputs float values very similar to the examples given on the assignment page, though they are off by a small, but increasing degree every iteration. This leads my final estimate to be off by about 0.015. The error seems to be coming from a loss of precision in the Math.log method. How are you dealing with underflow/loss of precision?

For the forward algorithm, there is no underflow problem since you can normalize after each step. I don’t use logprobs there, and you shouldn’t either.

It’s only an issue in Viterbi, because you cannot normalize the values at each step. But you can use logprobs at each step without issue, since you only care about the maximum value. Or you can just ignore the problem. I won’t hit your program with an input that numerically naive implementation can’t handle. Only deal with it if you wish to pursue excellence for its own sake.

If you really want to get technical, all floating-point can suffer from loss of precision. Architecture isn’t a required part of the program so you don’t necessarily learn the details of floating-point before you hit 383, but they’re nontrivial. You’d also learn them if you took a numerical computation class like in the applied math program, or maybe in some of the other electives we offer.

When performance isn’t critical, you can use a data structure representing rationals (fractions) to perform calculations with no loss of precision. Some languages provide such a facility intrinsically (for example, many Lisps include a full numeric tower); others provide default APIs (like Python’s fraction module); others require you to write it yourself or use a third party library (like, say, the Apache Commons Math BigFraction class).

CMPSCI 383: Artificial Intelligence

Fall 2014 (archived)