CMPSCI 240 Project #3 Assignment

CMPSCI 240

Programming Project #3: Classifying words using naive Bayes

Due Wednesday, April 15, 2009

Overview

In this assignment, you will program a naive Bayes classifier to distinguish two kinds of words. The program will accept as input 2 files of training words (one for each class/category/language) and 2 files of test words. For each test word, it will assign a probability (and a most likely class), and then the program will report the accuracy over all the test cases. The part you must write is the naive Bayes classifier; it takes in words and their class labels and then, given a word, outputs a probability.

Simply getting the code working correctly gets you a C. For full credit, play with the system (as directed below), and discuss your findings in the writeup.

Assignment

Complete the provided code to create a classifier that uses 26 features: the presence of each letter (A to Z) in a word.

To avoid getting any zero or infinite probabilities, smooth the probabilities by adding 1 to each count. (For more on reasons and ways to smooth, see wikipedia: Pseudocount or page 5 of this document. The second link is talking about measuring not the presence of a letter in a word, but the presence or count of a word in a document. tf(t, d) means "term frequency" of a word in a document--that is, the count.)
Using the provided data files (100-city files for training, 50 for testing), classify US vs. Russian cities, Russian vs. other, and US vs. other. Attach the outputs to your writeup.

Note: the code reports accuracy, which is the number correct, but also another metric, mean squared error. Accuracy, the number it gets correct, only cares whether the probability produced was above or below 0.5. For mean squared error, we see how far the probability estimate was from the correct answer. If the classifier reports P = .9 and the answer is 1, it takes (1 - .9)². If the answer is 0, it takes (.9 - 0)². The average of those (squared) distances is what's reported. Smaller is better.
Classify English vs. French words from wikipedia articles. (Don't attach the whole output, just a few lines and the accuracy.)
For each of the 4 tasks, does the classifier do better when classifying the training data (after processing it all) or test data? Why?

Finally, pick any two of the following tasks:

See if you can improve the performance (for cities, language, or another task you create) by adding additional features. (Some ideas: 2-letter combinations, whether a word starts or ends with a given letter, other regular expressions, whether a word has 3 or more vowels, length of a word, count of a letter in a word . . .) Write up what you tried and show whether it helped.
Code up a "correct" 3-way classifier. Compare its performance to the one that's provided in CitiesClassifier.txt (which simply runs all three 1-class-vs-the-rest classifiers and picks the one with the highest probability). What would be the accuracy for a classifier that used random guessing (or always answered the same thing)?
Using the original set of features (or a better one), evaluate how the accuracy on the test set varies as a function of the amount of training data given. (If it looks as if it would help to have yet more training data, grab some from the web and see.)
Can we use this code to decide the language of a passage of text (without changing the training instances)? Figure out how to implement that. Test on several passages of text in each language. (Do the passages need to be of equal lengths for the probabilities to work out right?)
Analyze the errors of the classifier on the original 3 tasks. Break them out into false positives and false negatives. Is there a cutoff that works better than .5 for the boundary between classes? Or, is there a zone in the middle where it would be better for the classifier not to guess at all?
Modify the classifier to be incremental. That is, after you classify an instance, learn from it too. What kind of accuracy can we get on the training data, and what kind of improvement on the test?
How important are the prior odds? Try changing them -- either by hard-coding that term, or by giving the classifier training files of unequal size -- and see what happens on the test data. (Or, make the test files have unequal size, and see if it's better to have the prior odds be 50:50 or to have them match the ratio of the test files.)
Think of another task this classifier could do, get data for it, and try it out.
Try something else interesting.

Logistics

The 3 java files needed for this assignment are in a folder called proj3 in the cs240 directory of the edlab machines. To get to it, use this link to the ftp server, or see the edlab main page for other ways to connect.

Submit your completed java files along with your writeup to the cs240 folder of your own home directory by midnight on Wednesday, April 15, 2009.