CMPSCI 240 Project #3 Assignment

CMPSCI 240

Programming Project #3: Classifying Tweets using naive Bayes

Due Monday, November 16, 2009

Overview

In this assignment, you will program a naive Bayes classifier to distinguish two kinds of strings. The program will accept as input 2 files of training strings (one for each class/category/language) and 2 files of test strings. For each test string, it will assign a probability (and a most likely class), and then the program will report the accuracy over all the test cases. The part you must write is the naive Bayes classifier; it takes in strings and their class labels and then, given a string, outputs a probability.

Assignment

This assignment consists of four parts: (1) implement the naive Bayes' classifier, (2 & 3) see how your classifier performs on text examples, and (4) choose 2 interesting ways to extend the project. Start early on part 1, and make sure that you budget time for all four parts!

Simply getting the code working correctly gets you a C. For full credit, play with the system (as directed below), and discuss your findings in a writeup.

Part 1: implement the classifier

Complete the provided code to create a classifier that uses 26 features: the presence of each letter (A to Z) in a string.

You need to implement Feature.java and NaiveBayesClassifier.java before this code will work. You can run the classifier using the main method in TextClassifier.java (which has been implemented for you).

To avoid getting any zero or infinite probabilities, smooth the probabilities by adding 1 to each count. (For more on reasons and ways to smooth, see wikipedia: Pseudocount or page 5 of this document. The second link is talking about measuring not the presence of a letter in a string, but the presence or count of a word in a document. tf(t, d) means "term frequency" of a word in a document--that is, the count.)

Part 2: classify cities

Using the provided data files (100-city files for training, 50 for testing), classify US vs. Russian cities, Russian vs. other, and US vs. other. Attach the outputs to your writeup. (Run java TextClassifier with no arguments for some additional instructions.)

Note: the code reports accuracy, which is the number correct, but also another metric, mean squared error. Accuracy, the number it gets correct, only cares whether the probability produced was above or below 0.5. For mean squared error, we see how far the probability estimate was from the correct answer. If the classifier reports P = .9 and the answer is 1, it takes (1 - .9)². If the answer is 0, it takes (.9 - 0)². The average of those (squared) distances is what's reported. Smaller is better.

Part 3: classify tweets

Twitter provides a search ability [twitter.com] which supports filtering recent "tweets" by subject, language, and time (among other options). You are provided with a script that fetches recent tweets from Twitter, preprocess them for the classifier, and organizes them into training and test sets. You specify the topic and languages, and the script puts together training and test sets for each language on that topic.

In the tweets directory you'll find some training and test sets of tweets organized according to language and topic. Using the hasselhoff data sets, classify English vs. German words on the topic of Hasselhoff. Report the accuracy.
Using the linux data sets, classify German vs. Finnish tweets on the topic of Linux. Report the accuracy.
For the hasselhoff and linux data sets, does the classifier do better when classifying the training data (after processing it all) or test data? Why?
We could also classify by topic. Using the h1n1 and linux data sets, classify English tweets on H1N1 vs. Linux. Report the accuracy.
Finally, pick a search topic and three languages and download some new tweets from Twitter. You can use the getTweets.rb ruby script to do this. For example, this will search for tweets on Star Trek in English, German, French, and Spanish:

$ ruby getTweets.rb startrek en de fr es
Pick a different topic and set of 3 languages, and classify these with the multi-way classifier (which is three iterations of 1-vs-other). Tweets are limited to 140 characters, but many tweets many be shorter. Do the passages need to be of equal length for the probabilities to work out right?

Part 4: extend these results

Pick any two of the following tasks:

See if you can improve the performance by adding additional features. (Some ideas: 2-letter combinations, whether a string starts or ends with a given letter, other regular expressions, whether a string has 3 or more vowels, length of a string, count of a letter in a string . . .) Write up what you tried and show whether it helped.
Code up a "correct" 3-way classifier. Compare its performance to the one that's provided in TextClassifier.java (which simply runs all three 1-class-vs-the-rest classifiers and picks the one with the highest probability). What would be the accuracy for a classifier that used random guessing (or always answered the same thing)?
Using the original set of features (or a better one), evaluate how the accuracy on the test set varies as a function of the amount of training data given. (If it looks as if it would help to have yet more training data, use the ruby script to grab some more from the web and see. Don't let it overwrite the original training set!)
Analyze the errors of the classifier on the original 3 tasks. Break them out into false positives and false negatives. Is there a cutoff that works better than .5 for the boundary between classes? Or, is there a zone in the middle where it would be better for the classifier not to guess at all?
Modify the classifier to be incremental. That is, after you classify an instance, learn from it too. What kind of accuracy can we get on the training data, and what kind of improvement on the test?
How important are the prior odds? Try changing them -- either by hard-coding that term, or by giving the classifier training files of unequal size -- and see what happens on the test data. (Or, make the test files have unequal size, and see if it's better to have the prior odds be 50:50 or to have them match the ratio of the test files.)
Think of another task this classifier could do, get data for it, and try it out.
Try something else interesting.

Logistics

The java files needed for this assignment are in a directory called

/courses/cs200/cs240/cs240/proj3/skeleton/

See the edlab main page if you don't yet know how to access this directory.

Submit your completed java files along with your writeup to the cs240 folder of your own home directory by midnight on Monday, November 16, 2009.