Assignment 08
This assignment is due at 1700 on Wednesday, 19 November.
The goal of this assignment is to implement a decision-tree-based classifier.
I will be updating the assignment with questions (and their answers) as they are asked.
Update: To make life simpler, the assignment no longer requires you to deal with unknown values (?
) at all. They will not be present in the training or test data.
Problem
According to Russell and Norvig,
A decision tree represents a function that takes as input a vector of attribute values and returns a “decision”—a single output value.
In this assignment, you will write a program that implements the Decision-Tree-Learning algorithm (Figure 18.5 in the text) to build a decision tree according to a given set of training instances. Your program will then output a prediction on one or more test instances, consisting of a most likely class and an associated probability of that class, as predicted by the learned tree.
As described in the text, compute the class label associated with a leaf as the most frequent class in the training instances that reach that leaf. As a minor enhancement, compute the probability of each class associated with a leaf as the fraction of training instances of each class that reach that leaf.
We will be using training and test data in the same format as in Assignment 07.
Input data format
Your program will receive two types of input: training and test data. Each will be provided in the following text-based format.
Training data
Training data will consist of one or more lines, each corresponding to an instance. Each instance will consist of seventeen comma-separated values. The first value is the class label, either democrat
or republican
, corresponding to the party of the politician with the given voting record. Each of the remaining sixteen values is either y
or n
.
A valid set of training data follows.
1 2 3 4 |
|
When creating your model, treat all variables as binary.
Test data
Test data is in the same format as training data, with one exception. It has only sixteen columns; the column corresponding to the class value (that is, containing value democrat
or republican
) is not present.
For example:
1 2 |
|
Output data format
Your program should construct a decision tree using its input data, then classify each instance in the test data using the model. The output of this classification should be written to standard output in the following text-based format.
The output should consist of a sequence of lines. There should be as many lines in the output as there are instances of test data. Each line should consist of two values, separated by a single comma. The first should be the predicted class (the more likely of P(democrat
|votes) or P(republican
|votes). The second should be the estimated probability of that class. In the event of a perfectly uniform distribution over the class variable, output republican
as the class value.
For example:
1 2 3 |
|
is an output in the correct format for a test data set with three instances.
Other items of import
Use “information gain” as described in the section 18.3.4 of the text to choose the attribute upon which to split.
As noted in the Section 18.3.5 of the text, the algorithm as written can produce overfitted trees, especially on data with weak or non-existent patterns. This overfitting should come as no surprise to you. This effect will be quite noticeable if you generate fully random test data as opposed to using random subsets of the actual data – you’ll see large trees with most leaves associated with only a few training instances each, and few if any leaves with many instances. Training on real data (where you expect the variables to predict the class) will usually have the opposite effect, producing smaller trees and leaves with many associated instances.
If you want to build better trees, then consider implementing a pruning algorithm (or less good: an early-stopping algorithm) of some sort. If you take this route, add an optional third command line argument. If present and set to enabled
this argument should enable the pruning or early-stopping algorithm in your program. If not present, your program should return the full, un-pruned tree.
What to submit
You should submit two things: a program to generate and use a decision tree classifier and a readme.txt
.
Your classifier should use its first command line argument as the path to file containing training data and its second command line argument as the path to a file containing test data. If, for example, your classifier’s main method is in a Java class named
DecisionTree
, we should be able to usejava DecisionTree /Users/liberato/training.data /Users/liberato/test.data
to direct your program to read the training data in/Users/liberato/training.data
, build a model on it, and use the model to predict the probability of classes for each instance in/Users/liberato/test.data
.Your program should print the predicted most likely classes and associated probabilities to standard output, in exactly the format described above.
Submit the source code of your programs, written in the language of your choice. Name the files containing the main()
methods DecisionTree.java
or your language’s equivalent. If the file(s) you submit depend(s) upon other files, be sure to submit these other files as well.
As in the previous assignments, while you may use library calls for parsing, data structures and the like, you must implement the classifier yourself. Do not use a library for classification. We will consider it plagiarism if you do. Check with us if you think there’s any ambiguity.
Your readme.txt
should contain the following items:
- your name
- if the language of your choice is not Java, Python, Ruby, node.js-compatible JavaScript, ANSI C or C++, or Mono-compatible C# (or if you’re concerned it’s not completely obvious to us how to compile and execute it), a description of how to compile and execute the submitted files
- a description of what you got working, what is partially working and what is completely broken
If you’re using language features that require a specific version of your language or runtime, check for that version at program start and fail if it’s not present, emitting an understandable error message indicating this fact. Your program must compile and execute on the Edlab Linux machines.
If your program does not compile or execute, you will receive no credit. Check with us in advance if you’re concerned.
Grading
We will run your program on a variety of test cases. The exact test cases will not be available to you before grading. You are welcome to write and distribute your own test cases.
If your readme.txt
is missing or judged insufficient, your overall score may be penalized by up to ten percent.
We’re not going to feed your program incorrectly formatted input, so you need only concern yourself with handling input in the format described in the assignment.
We expect valid output. Generating output that is not in the format described in the assignment will result in a failed test case. We will check that your output classes are correct, and that your output probabilities are reasonably close to the correct values (whether or not you use either BigDecimal or a log transform will influence their computed values).
I do not expect anything in a solution to this assignment to be particularly memory or CPU intensive. But as usual, if your program exceeds available heap memory (which we’ll set to 1 GB in Java, using the -Xmx1024M
argument if necessary), or if it does not terminate in twenty seconds, we will consider the test case failed.
Questions and answers
I think I have a pretty rough idea of how to implement the creation of the search tree, but I think I’m also getting a little stuck on information gain.
The quick bit at the start of class was really helpful, but since I sit in the back and the markers kinda suck I’m not entirely sure the notes I copied down are correct. Is it possible to get a copy of that part of the lecrture (if it’s not already posted somewhere)?
The notes from 11-18 include a translation of the textbook’s “information gain” for trees containing only binary attributes. You’re also welcome to drop by office hours if you need more specific help.
I guess I am quite confused about which probability we should be outputting here. I can interpret it a few ways:
1: # of examples at that leaf / total number of examples in the file
2: # of examples at that leaf / total number of examples of that class in the file
3: # of examples at that leaf / total number of examples in its parentOr maybe it’s something else?
Any clarification would help.
Some number of training examples will end up at each leaf. In the best case, they are all of the same class (e.g., 3 republican
s, or 5 democrat
). In that case, if a test instance ends up in that leaf, you’d output that one class, and 1.0 as the probability.
But if the training instances that end up in a leaf can’t be split further (because, for instance, you’re out of attributes), then you might end up with, say, 3 republican
s and 2 democrat
s of training instances in a leaf. If that happens, and a test instance ends up in that leaf, output the majority class (in this example, republican
), with a probability equal to the fraction of the training instances of that class, here, 0.6.