In this assignment, you will program a naive Bayes classifier to distinguish two kinds of strings. The program will accept as input 2 files of training strings (one for each class/category/language) and 2 files of test strings. For each test string, it will assign a probability (and a most likely class), and then the program will report the accuracy over all the test cases. The part you must write is the naive Bayes classifier; it takes in strings and their class labels and then, given a string, outputs a probability.
This assignment consists of four parts: (1) implement the naive Bayes' classifier, (2 & 3) see how your classifier performs on text examples, and (4) choose 2 interesting ways to extend the project. Start early on part 1, and make sure that you budget time for all four parts!
Simply getting the code working correctly gets you a C. For full credit, play with the system (as directed below), and discuss your findings in a writeup.
Complete the provided code to create a classifier that uses 26 features: the presence of each letter (A to Z) in a string.
You need to implement Feature.java and NaiveBayesClassifier.java before this code will work. You can run the classifier using the main method in TextClassifier.java (which has been implemented for you).
To avoid getting any zero or infinite probabilities, smooth the probabilities by adding 1 to each count. (For more on reasons and ways to smooth, see wikipedia: Pseudocount or page 5 of this document. The second link is talking about measuring not the presence of a letter in a string, but the presence or count of a word in a document. tf(t, d) means "term frequency" of a word in a document--that is, the count.)
Using the provided data files (100-city files for training, 50 for testing), classify US vs. Russian cities, Russian vs. other, and US vs. other. Attach the outputs to your writeup. (Run java TextClassifier with no arguments for some additional instructions.)
Note: the code reports accuracy, which is the number correct, but also another metric, mean squared error. Accuracy, the number it gets correct, only cares whether the probability produced was above or below 0.5. For mean squared error, we see how far the probability estimate was from the correct answer. If the classifier reports P = .9 and the answer is 1, it takes (1 - .9)2. If the answer is 0, it takes (.9 - 0)2. The average of those (squared) distances is what's reported. Smaller is better.
Twitter provides a search ability [twitter.com] which supports filtering recent "tweets" by subject, language, and time (among other options). You are provided with a script that fetches recent tweets from Twitter, preprocess them for the classifier, and organizes them into training and test sets. You specify the topic and languages, and the script puts together training and test sets for each language on that topic.
In the tweets directory you'll find some training and test sets of tweets organized according to language and topic. Using the hasselhoff data sets, classify English vs. German words on the topic of Hasselhoff. Report the accuracy.
Using the linux data sets, classify German vs. Finnish tweets on the topic of Linux. Report the accuracy.
For the hasselhoff and linux data sets, does the classifier do better when classifying the training data (after processing it all) or test data? Why?
We could also classify by topic. Using the h1n1 and linux data sets, classify English tweets on H1N1 vs. Linux. Report the accuracy.
Finally, pick a search topic and three languages and download
some new tweets from Twitter. You can use the getTweets.rb ruby
script to do this. For example, this will search for tweets on
Star Trek in English, German, French, and Spanish:
$ ruby getTweets.rb startrek en de fr esPick a different topic and set of 3 languages, and classify these with the multi-way classifier (which is three iterations of 1-vs-other). Tweets are limited to 140 characters, but many tweets many be shorter. Do the passages need to be of equal length for the probabilities to work out right?
Pick any two of the following tasks:
/courses/cs200/cs240/cs240/proj3/skeleton/
Submit your completed java files along with your writeup to the cs240
folder of
your own home directory by midnight on Monday, November 16, 2009.