Assignment 07
This assignment is due at 1700 on Wednesday, 12 November.
The goal of this assignment is to implement a naive Bayes classifier.
I will be updating the assignment with questions (and their answers) as they are asked.
Problem
The naive Bayes classifier predicts the probability of a query variable, given known evidence variables and a strong simplifying assumption. The simplifying assumption is that the single class variable directly influences all other (evidence) variables, and that each evidence variable is conditionally independent of all other evidence variables given the class variable.
That is,
$$P(\mathrm{Class} | X_1, \ldots, X_n) = \alpha P(\mathrm{Class}, X_1, \ldots, X_n) = \alpha P(\mathrm{Class})\prod_{i}P(X_i | \mathrm{Class})$$
(You should be generally comfortable with deriving the naive Bayes model above, either from your time in CMPSCI240 or using your recently-acquired knowledge of Bayes nets.)
In this assignment, you will write a program that constructs a naive Bayes model and sets its parameters according to a given set of training instances. It will estimate P(Class) based upon the distribution of Class in the training set. It will estimate P(Xi = xj| Class) by counting the fraction of instances of that Class for which Xi is equal to value xj. Your program will then output a prediction on one or more test instances, consisting of a most likely class and an associated probability of that class.
In particular, we will use be using subsets of data from 16 votes taken in the U.S. House of Representatives in 1984. This data is described further here: Congressional Voting Records Data Set. Class, above, will correspond to either democrat
or republican
; each Xi corresponds to one of the votes. More details are below.
Input data format
Your program will receive two types of input: training and test data. Each will be provided in the following text-based format.
Training data
Training data will consist of one or more lines, each corresponding to an instance. Each instance will consist of seventeen comma-separated values. The first value is the class label, either democrat
or republican
, corresponding to the party of the politician with the given voting record. Each of the remaining sixteen values is either y
, n
, or ?
.
A valid set of training data follows.
1 2 3 4 5 |
|
When creating your model, treat all variables as binary. In particular, treat ?
as an unknown value, and do not increment the counts in your model for that value when it is present. (There are more advanced techniques, beyond the scope of this assignment, to handle missing data.)
Test data
Test data is in the same format as training data, with one exception. It has only sixteen columns; the column corresponding to the class value (that is, containing value democrat
or republican
) is not present.
For example:
1 2 |
|
Like training data, test data may include instances with a marker for an unknown value, ?
. In this case, omit from the probability estimate the term(s) in the product corresponding to the variable(s) with missing values.
Output data format
Your program should construct a naive Bayes model using its input data, then classify each instance in the test data using the model. The output of this classification should be written to standard output in the following text-based format.
The output should consist of a sequence of lines. There should be as many lines in the output as there are instances of test data. Each line should consist of two values, separated by a single comma. The first should be the predicted class (the more likely of P(democrat
|votes) or P(republican
|votes). The second should be the probability estimate. In the event of a perfectly uniform distribution over the class variable, output republican
as the class value.
For example:
1 2 3 |
|
is an output in the correct format for a test data set with three instances.
Other items of import
As we mentioned in class, naive Bayes has a problem when no instances of a given evidence variable were observed to correspond to a class variable. Simply setting the count to zero will result in a zeroed conditional probability. Instead, you should regularize: smooth by setting the count to one. In other words, when a given class and value have zero examples in the training data, your classifier should set the conditional probability of the value given the class to the inverse of the number of instances of training data.
For example, if P(vote7=y
| class=democrat
) = 0, don’t use zero for this term when computing P(class=democrat
| votes). Instead use 1 / (number of instances in training data). This approach is not-quite-equivalent to Laplace smoothing.
As mentioned in Assignment 06, you can run into trouble when multiplying together many small-magnitude floating-point (double
) numbers. To mitigate this trouble, either use an arbitrary-precision numerical type (like Java’s BigDecimal), or perform your multiplications under a logarithmic transform. In other words, if you want to compute:
$$v = \alpha P(\mathrm{Class})\prod_{i}P(X_i | \mathrm{Class})$$
take the log of both sides:
$$ \log v = \log [\alpha P(\mathrm{Class})\prod_{i}P(X_i | \mathrm{Class})]$$
then take advantage of the fact that:
$$ \log xy = \log x + \log y $$
to get:
$$ \log v = \log \alpha + \log P(\mathrm{Class}) + \sum_{i} \log P(X_i | \mathrm{Class})$$
The sum of the logarithms of each term won’t underflow as the magnitude will be increasing. This behavior contrasts with that of a product of probabilities, where the magnitude decreases with each additional term.
Once you’ve summed up the right-hand side, invert the log transform by raising it to the power of the log’s base. For example, if you used Java’s Math.log()
method to transform to log space, use Math.exp()
to invert it.
20% or fewer of the test cases will rely upon either using BigDecimal or getting log transforms working. I suggest you start by just computing using double
and getting it working correctly, then adding floating-point underflow prevention of either the BigDecimal or log transform variety.
What to submit
You should submit two things: a program to generate and use a naive Bayes classifier and a readme.txt
.
Your classifier should use its first command line argument as the path to file containing training data and its second command line argument as the path to a file containing test data. If, for example, your classifier’s main method is in a Java class named
NaiveBayesClassifier
, we should be able to usejava NaiveBayesClassifier /Users/liberato/training.data /Users/liberato/test.data
to direct your program to read the training data in/Users/liberato/training.data
, build a model on it, and use the model to predict the probability of classes for each instance in/Users/liberato/test.data
.Your program should print the predicted most likely classes and associated probabilities to standard output, in exactly the format described above.
Submit the source code of your programs, written in the language of your choice. Name the files containing the main()
methods NaiveBayesClassifier
or your language’s equivalent. If the file(s) you submit depend(s) upon other files, be sure to submit these other files as well.
As in the previous assignments, while you may use library calls for parsing, data structures and the like, you must implement the classifier yourself. Do not use a library for classification. We will consider it plagiarism if you do. Check with us if you think there’s any ambiguity.
Your readme.txt
should contain the following items:
- your name
- if the language of your choice is not Java, Python, Ruby, node.js-compatible JavaScript, ANSI C or C++ (or if you’re concerned it’s not completely obvious to us how to compile and execute it), a description of how to compile and execute the submitted files
- a description of what you got working, what is partially working and what is completely broken
If you’re using language features that require a specific version of your language or runtime, check for that version at program start and fail if it’s not present, emitting an understandable error message indicating this fact. Your program must compile and execute on the Edlab Linux machines.
If your program does not compile or execute, you will receive no credit. Check with us in advance if you’re concerned.
Grading
We will run your program on a variety of test cases. The exact test cases will not be available to you before grading. You are welcome to write and distribute your own test cases.
If your readme.txt
is missing or judged insufficient, your overall score may be penalized by up to ten percent.
We’re not going to feed your program incorrectly formatted input, so you need only concern yourself with handling input in the format described in the assignment.
We expect valid output. Generating output that is not in the format described in the assignment will result in a failed test case. We will check that your output classes are correct, and that your output probabilities are reasonably close to the correct values (whether or not you use either BigDecimal or a log transform will influence their computed values).
I do not expect anything in a solution to this assignment to be particularly memory or CPU intensive. But as usual, if your program exceeds available heap memory (which we’ll set to 1 GB in Java, using the -Xmx1024M
argument if necessary), or if it does not terminate in twenty seconds, we will consider the test case failed.
Questions and answers
In the data set description https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.names
There are these two pieces of information, which appear to contradict each other:
5. Number of Instances: 435 (267 democrats, 168 republicans) 9. Class Distribution: (2 classes) 1. 45.2 percent are democrat 2. 54.8 percent are republican
Why is the class distribution not 267/435 = 61.3% democrat, 168/435 = 38.6% republican?
As far as I can tell the values listed in (9) appear to be erroneous.
In any case, the training data that’s given as input to your classifier won’t necessarily be exactly this data; it may be a subset, or synthetic data in the same format. For a given run, your program should compute the class distribution based on the training data it’s given.
Relatedly, I haven’t checked the computed conditional probabilities in the .names
file, so it’s hard to say whether you can use them to validate your classifier.
To do smoothing, you said to use:
1 / (number of instances in training data)
Is the denominator the number of lines in the training set, or the total number of votes in the training set?
The former, which is the total number of instances in the training data. (In practice it doesn’t matter much exactly how you smooth so long as you do so, but for the purposes of this assignment please use the above method.)
Which of these is correct:
P(“republican”) = [# of lines that start with “republican”] / [total # of lines]
or
P(“republican”) = [# of votes made by a “republican”] / [total # of votes]
The former: P(“republican”) = [# of lines that start with “republican”] / [total # of lines]. Each line is an instance. P(“republican”) is the unconditional probability of P(Class=“republican”), which we estimate as the fraction of all instances for which it is the case.
By analogy, the P(isSpam) we discussed in class is (number of emails that were spam) / (total number of emails). Not (number of words observed in spam emails) over the (number of words observed across all emails).
You could formulate the problem in the latter way, but it’s not the way the problem is formulated in the assignment.
Will the input come from files whose paths are command line arguments? Or will the input be command line arguments?
The path to the file containing training data will be the first command line argument; the path to file containing test data will be the second command line argument.
I have a question, given Congressional Voting Records Data Set as training data. Could you tell me if the following input/output is correct?
I cannot, as I haven’t coded up the solution yet.
I ask this because I want to check if my output is correct. If not, may I ask how could I verify my output?
I have a few suggestions. Generally, you should construct test examples that you can verify by hand. So, if you’ve set up your NBC such that it can work on instances with arbitrary numbers of variables rather than exactly 16, you could write up a few small examples with one or two variables and three or four training instances, and make sure it results in the correct values.
Or you could compare tests with other students. One student posted a synthetic data set and his classifier’s results on it to Moodle; you could check your results against his and make sure they agree.
I had a question about the ? entries in the training data. Do we basically treat it as a ?missing row when calculating P(Xi = xj| Class) for a particular i?
So when we calculate the conditional probability say P(X_7 = “Yes”|republican) and there are 5 entries for republican, where 2 of them have “yes” two “no” and one “?” in 7th column, we just disregard the row with “?” and say P(X_7 = “Yes”|republican) = 2/4 ?
Patrick answered “yes” to this question. When writing the assignment, my intention was the other way around. But I can see how it could be read either way.
In other words, I expected P(X_7 = “Yes”|republican) = 2/5, as it’s the more well-defined interpretation: Under the “yes” condition, what happens in your classifier when for a particular variable, all values for one class are ?
? You’d have to compute 0/0, which is not well defined. In practice I’m assuming you’d just smooth this value as described above (1 / total number of instances).
We’ll accept output corresponding to either case (assuming everything else is correct).