CMPSCI 591N : Computational Linguistics
Spring 2006
Homework #5: Part of Speech Tagging with HMMs
Out: Tuesday March 28, 2006
Due: Thursday April 9, 2006, by 11:59pm, by email to compling@cs.umass.edu
In this homework assignment you will implement and experiment with a hidden Markov model (HMM) part-of-speech tagger, and write a short report about your experiences and findings.
After being trained, the hidden Markov model should take a sequence of words as input and produce a sequence of part-of-speech tags as output. As you did with the document classification (spam vs non-spam) exercise, you should report the accuracy of your HMM. (That is, the percentage of tags predicted correctly.)
Everyone should estimate the parameters of an HMM from counts (as we did in class on the board), and implement Viterbi, as described in the first bullet below. There are additional bullets below describing further optional exercises. As usual, you need not be limited by the suggestions of these extra bullets. I you are free to come up with your own tasks.
Please re-check this page as well as the course Web site syllabus, in the homework column for any updates and clarifications to this assignment.
Python and Data Infrastructure available
You may begin with hmm.py which is available at http://www.cs.umass.edu/~mccallum/courses/cl2006/code. You are also welcome to develop your own Python programs from scratch, if you prefer.
For training data, you will use the same POS-tagged Wall Street Journal data that we have used previously for regular expressions. This can be found at
http://www.cs.umass.edu/~mccallum/courses/cl2006/data/wsj15-18.pos
Tasks
- (Required part.) Finish the implemenetation of the Python file hmm.py, so that it estimates HMM transition and emmision probabilities, using Laplace smoothing. Implement the Viterbi dynamic programming algorithm. This should be in the form of a function that takes a sequence of words as input, and returns the most likely sequence of part-of-speech tags as output (and also optionally the probability of that sequence.) Write additional code for measuring the accuracy of your tagger--being sure to test on data different than the data you trained on. Report on the accuracy and other insights. (The remaining bullets are optional.)
- Try some different smoothing methods as described in the text book. See if they improve accuracy.
- Try changing the amount of training data. How quickly does the accuracy go down as you reduce the training data?
- Implement an augmented HMM that not only conditions on the current state to obtain the probability of the current word, but also conditions on the previous word. (This one is more challenging than the other optional bullets.)
- Try reducing the number of part-of-speech tags, and measure the resulting accuracy. For example, collapse all the verb tags to one, and collapse all the noun tags to one, etc.
- Sort the test sentences by their confidence. (One version of this would take the probability of the most likely path, and take the Nth root of it, where N is the length of the sequence, and sort by this value.) Do the low confidence sentences tend to have more errors in them?
- Finish ngrams.py in order to turn the dictionary of word bigram counts into a dictionary of word bigram probabilities. Train it, and use it to generate some word sequences (with a random number generator sampling from the bigram distribution). Try this on several data sets and comment. For example, you could train it on a combination of Shakespeare and Mark Twain, and see if you observe "chunks of different language styles, with distinct phase changes". You could train it on Wall Street Journal text, and use it to evaluate a machine translation system: find an online machine translation system (e.g. http://www-306.ibm.com/software/pervasive/tech/demos/translation.shtml); translate a sentence from English, to another language and back. Now give the original sentence and the translated sentence to your n-gram model. Which sentence gets higher likelihood? What if you normalized by length (taking the N-th root)?
What to hand in, and how
The homework should be emailed to compling@cs.umass.edu before 11:59pm on Tuesday April 4, 2006.
In addition to writing your Python program, write a short report about your experiences. Feel free to suggest other additional things you might like to to next that build on what you've done so far. This report should be clear, well-written, but needn't be long--one page is fine. Also, no need for fancy formatting. In fact, we prefer to receive this report as the body of your email. Your program can also be included in the body, or included as an email attachment.
Grading
The assignment will be graded for (a) correctness of your implementation, (b) quality/clarity of your written report, and (d) creativity, effort and success in the task(s) you choose.
Questions?
Feel free to ask! Send email to compling@cs.umass.edu, or if you'd like your classmates to be able to help answer your question, use compling-class@cs.umass.edu.