CMPSCI 240

Programming Project #3: Classifying words using naive Bayes

Due Wednesday, April 15, 2009

Overview

In this assignment, you will program a naive Bayes classifier to distinguish two kinds of words. The program will accept as input 2 files of training words (one for each class/category/language) and 2 files of test words. For each test word, it will assign a probability (and a most likely class), and then the program will report the accuracy over all the test cases. The part you must write is the naive Bayes classifier; it takes in words and their class labels and then, given a word, outputs a probability.

Simply getting the code working correctly gets you a C. For full credit, play with the system (as directed below), and discuss your findings in the writeup.

Assignment

Finally, pick any two of the following tasks:

Logistics

The 3 java files needed for this assignment are in a folder called proj3 in the cs240 directory of the edlab machines. To get to it, use this link to the ftp server, or see the edlab main page for other ways to connect.

Submit your completed java files along with your writeup to the cs240 folder of your own home directory by midnight on Wednesday, April 15, 2009.