CMPSCI 240
Programming Project #3: Classifying words using naive Bayes
Due Wednesday, April 15, 2009
Overview
In this assignment, you will program a naive Bayes classifier to distinguish two kinds
of words. The program will accept as input 2 files of training words (one for each
class/category/language) and 2 files of test words. For each test word, it will assign a
probability (and a most likely class), and then the program will report the
accuracy over all the test cases. The part you must write is the naive Bayes classifier;
it takes in words and their class labels and then, given a word, outputs a probability.
Simply getting the code working correctly gets you a C. For full credit, play with the system (as
directed below), and discuss your findings in the writeup.
Assignment
-
Complete the provided code to create a classifier that uses 26 features: the presence of
each letter (A to Z) in a word.
To avoid getting any zero or infinite probabilities,
smooth the probabilities by adding 1 to each count. (For more on reasons and ways to smooth, see wikipedia: Pseudocount or page 5 of
this document. The
second link is talking about measuring not the presence of a letter in a word, but the
presence or count of a word in a document. tf(t, d) means "term frequency" of a word
in a document--that is, the count.)
- Using the provided data files (100-city files for training, 50 for testing), classify US vs.
Russian cities, Russian vs. other, and US vs. other. Attach the outputs to your writeup.
Note: the code reports accuracy, which is the number correct, but also another metric, mean squared
error. Accuracy, the number it gets correct, only cares whether the probability produced was above or
below 0.5. For mean squared
error, we see how far the probability estimate was from the correct answer. If the classifier reports P = .9
and the answer is 1, it takes (1 - .9)2. If the answer is 0, it takes (.9 -
0)2. The average of those (squared) distances is what's reported. Smaller is
better.
- Classify English vs. French words from wikipedia articles. (Don't attach the whole output, just a
few lines and the accuracy.)
- For each of the 4 tasks, does the classifier do better when classifying the training data
(after processing it all) or test data? Why?
Finally, pick any two of the following tasks:
- See if you can improve the performance (for cities, language, or another task you create) by adding additional features. (Some ideas: 2-letter
combinations, whether a word starts or ends with a given letter, other regular expressions,
whether a word has 3 or more vowels, length of
a word, count of a letter in a word . . .) Write up what you tried and show whether it helped.
- Code up a "correct" 3-way classifier. Compare its performance to the one that's provided in
CitiesClassifier.txt (which simply runs all three 1-class-vs-the-rest classifiers and picks the one
with the highest probability). What would be the accuracy for a classifier that used random
guessing (or always answered the same thing)?
- Using the original set of features (or a better one), evaluate how the accuracy on the test set
varies as a function of the amount of training data given. (If it looks as if it would help to
have yet more training data, grab some from the web and see.)
- Can we use this code to decide the language of a passage of text (without changing the training
instances)? Figure out how to implement that. Test on several passages of text in each language.
(Do the passages need to be of equal lengths for the probabilities to work out right?)
- Analyze the errors of the classifier on the original 3 tasks. Break them out into false positives
and false negatives. Is there a cutoff that works better than .5 for the boundary between
classes? Or, is there a zone in the middle where it would be better for the classifier not to guess
at all?
- Modify the classifier to be incremental. That is, after you classify an instance, learn from it
too. What kind of accuracy can we get on the training data, and what kind of improvement on the test?
- How important are the prior odds? Try changing them -- either by hard-coding that term, or by
giving the classifier training files of unequal size -- and see what happens on the test data. (Or,
make the test files have unequal size, and see if it's better to have the prior odds be 50:50 or to
have them match the ratio of the test files.)
- Think of another task this classifier could do, get data for it, and try it out.
- Try something else interesting.
Logistics
The 3 java files needed for this assignment are in a folder called
proj3
in the
cs240
directory of the edlab machines. To get to it, use this link to the ftp server, or see the edlab main page for other ways to connect.
Submit your completed java files along with your writeup to the cs240
folder of
your own home directory by midnight on Wednesday, April 15, 2009.