Introduction to Natural Language Processing

Computer Science 585 — Fall 2009

Homework #2: Naive Bayes Classification and Hidden Markov Models

This assignment is due at the start of class on Thursday, October 15. Mail it to dasmith@cs.umass.edu.

Language Identification

In this exercise, you will train probabilistic language models to distinguish between words in different languages. Rather than looking up whole words in a dictionary, you will build models of character sequences so you can make a guess about the language of unseen words. You will need to use NLTK and the Universal Declaration of Human Rights corpus.

Download the file udhr.py to your working directory. When you start python, run:

>>> execfile('udhr.py')
>>> dir()
['__builtins__', '__doc__', '__name__', 'english', 'english_dev', 'english_test', 'english_train', 'french', 'french_dev', 'french_test', 'french_train', 'italian', 'italian_dev', 'italian_test', 'italian_train', 'nltk', 'spanish', 'spanish_dev', 'spanish_test', 'spanish_train', 'udhr']
    
This shows that there are now training, development, and test samples for English, French, Italian, and Spanish. The training and development data are continuous strings of words; the test data are lists of individual word strings for easier testing.

Build unigram, bigram, and trigram character models for all four languages. You may find it convenient to use the NLTK classes FreqDist and ConditionalFreqDist, described in chapter 2 of the NLTK book. If you models have free parameters, tune them on the development data.

For each word in the English test sets, calculate the probability assigned to that string by English vs. French unigram models, E vs. F bigram models, and E vs. F trigram models. Don't forget to include the implicit space before and after each word. The number of times each model predicts English divided by the length of the word list (1000) is the accuracy of your model. Report the accuracies of the 1-, 2-, and 3-gram models.

Perform the same experiment for Spanish vs. Italian. Which language pair is harder to distinguish?

Learning HMMs with EM

For this exercise, you will implement the dynamic program to compute forward, backward, and Viterbi probabilities and the EM algorithm for parameter estimation. You will use just the plain English text in the english_train variable.

Learn the parameters of a simple two-state HMM from the English training data. The hidden Markov process should be fully connected, like so:

Each of the states, 1 and 2, should be capable of emitting any of the 2 * 26 + 4 punctuation and space characters of the alphabet, though with different probabilities. Randomly initialize the emission and transition probabilities to a uniform distribution plus a little noise. Without noise, EM will not be able to move in any direction.

Run the forward and backward procedures so that you can get the expected counts for each kind of transition and emission. Then re-estimate the probabilites of each using maximum likelihood count-and-normalize. Stop when the probabilities at the current iteration are identical to those of the previous one, or when you reach 500 iterations of EM.

After EM training converges, implement and run the Viterbi algorithm to assign a hidden state to each character in the training sequence. What patterns, if any, do you observe in this assignment?

Train a similar four-state and an eight-state HMM in the same way. What patterns do you observe in the Viterbi output?