CMPSCI 585 : Introduction to Natural Language Processing
Fall 2007
Homework #4: Language Modeling


In this homework assignment you will implement and experiment with some form of statistical language modeling, and write a short report about your experiences and findings. There are two basic choices: either a naive Bayes text classifier or an n-gram language model.

The naive Bayes text classifier takes a collection of words as input and predicts a category associated with this text. For example, it could take an email message, and predict if the message is Spam or NotSpam. It could take postings to an online discussion board, and predict to which discussion board it most appropriately belongs. It could take a word and the words surrounding it in a sentence, and predict the word's part of speech. In all cases, you will need labeled training data (in which the category is provided) in order to estimate the parameters of your model.

The n-gram language model predicts the next word given the previous n-1 words. This model could generate text (as shown in the examples in class), or could be used to assess which sentence out of several (generated by OCR, machine translation, etc) is mostly likely. The parameters of this model will also have been estimated from data.

There are several suggested tasks below. You should do at least one task. As usual, you need not be limited by the suggestions of these extra bullets. I you are free to come up with your own tasks.

Please re-check this page as well as the course Web site syllabus, in the homework column for any updates and clarifications to this assignment.

Python and Data Infrastructure available

You may begin with, or, which are available at You are also welcome to develop your own Python programs from scratch, if you prefer.

There is a data set of Spam and NotSpam email at


What to hand in, and how

The homework should be emailed to

In addition to writing your Python program, write a short report about your experiences. Feel free to suggest other additional things you might like to to next that build on what you've done so far. This report should be clear, well-written, but needn't be long--one page is fine. Also, no need for fancy formatting. In fact, we prefer to receive this report as the body of your email. Your program can also be included in the body, or included as an email attachment.


The assignment will be graded for (a) correctness of your implementation, (b) quality/clarity of your written report, and (d) creativity, effort and success in the task(s) you choose.


Feel free to ask! Send email to, or if you'd like your classmates to be able to help answer your question, use