## CMPSCI 585 : Introduction to Natural Language Processing Fall 2007 Homework #6: Maximum Entropy Classifier

In this homework assignment you will implement and experiment with a maximum entropy classifier, and write a short report about your experiences and findings.

You may begin with source code provided by Prof. McCallum; you may alternatively start from scratch if you prefer. The main task if you begin with the provided code is to implement the gradient function, then train and test your classifier in various ways you find interesting. See the tasks below.

See the class slides. For significantly more detail, you might also want to see http://www.cs.berkeley.edu/~klein/papers/maxent-tutorial-slides.pdf. More pointers are available at http://homepages.inf.ed.ac.uk/s0450736/maxent.html.

Please re-check this page as well as the course Web site syllabus, in the homework column for any updates and clarifications to this assignment.

### Python and Data Infrastructure available

You may begin with maxent.py and optimize.py which is available at http://www.cs.umass.edu/~mccallum/courses/inlp2007/code.

The package optimize.py depends on the Python Numeric package, which you will also have to install if you don't have it already. (Numeric is deprecated in favor of NumPy, but the only version of optimize.py that we could find depends on the old Numeric instead.) The package optmize.py also imports MLab, but note that this is provided by the Numeric installation.

As with HW#4, we are providing training and testing data in the form of spam and ham email, but you are welcome to find your own data.

• (Required part.) Finish an implementation of a maximum entropy classifier to classify documents. You should provide a function (and supporting functions) that takes a list of directories (one per class) containing textual documents, and returns a maximum entropy classifier trained to maximize probabilities of the true class labels conditioned on the words. The two main pieces of missing functionality in the provided code are (1) the gradient function, the Gaussian prior on parameters in the value function. I do not guarantee, however, that the remainder of the code I supply is bug-free. Fixing any bugs that may be there is part of the assignment. If you would rather not work with my possibly buggy code, you are welcome to start writing code from scratch. Demonstrate your maximum entropy classifier on some collection of labeled documents, and show off its interesting properties. How does its accuracy compare with your naive Bayes classifier? Do you notice any patterns in the relative difference among the parameters? Write a report about your findings.
• One reason that train_maxent is quite slow is that it re-parses the training data each time is calculates the value or the gradient. Re-write it so that it is more efficient.
• Use more complex, non-independent features, like, for example, word bigrams, or just selected word bigrams. Does accuracy go up? (You possibly might need to do some feature selection to make it go up.) Try such features in naive Bayes also. Which classifier is better able to handle them?
• Use your maximum entropy classifier to build a sliding window part-of-speech tagger.
• On Thursday you will learn about linear-chain conditional random fields. Implement them and try them on part-of-speech tagging, named entity recognition, or some other tagging task.
• Implement a couple of feature selection methods, and measure their impact on your maximum entropy classifier. Make up some fancy features yourself and add them.

### What to hand in, and how

The homework should be emailed to cs585-staff@cs.umass.edu.

In addition to writing your Python program, write a short report about your experiences. Feel free to suggest other additional things you might like to to next that build on what you've done so far. This report should be clear, well-written, but needn't be long--one page is fine. Also, no need for fancy formatting. In fact, we prefer to receive this report as the body of your email. Your program can also be included in the body, or included as an email attachment.