Assignment 05
This assignment is due at 1700 on Friday, 24 October.
The goal of this problem is to implement a program to construct and perform exact inference on a full joint distribution.
I will be updating the assignment with questions (and their answers) as they are asked.
Problem
The Car Evaluation Database is a publicly available categorical data set, obtainable at https://archive.ics.uci.edu/ml/machine-learning-databases/car/. It consists 1,728 instances, each of which has seven categorical values describing a car:
buying
: the purchase pricemaint
: the maintenance costsdoors
(notdoor
as previously written): the number of doorspersons
: passenger capacitylug_boot
: trunk capacitysafety
: estimated safetycar
: whether a car is overall considered acceptable (this is also referred to as the “class variable”; it is a categorical value like any other; “class” means that it is the variable that ML algorithms are trying to predict)
and is further described at https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.names. There is some ambiguity between the description and the actual data. For example, the actual data values do not contain hyphens — that is, one value is vgood
not v-good
. If there is a conflict, expect the values in the actual data, and output the same.
Empirically estimating full joint distributions
Given a sampled data set, you can construct an empirical estimate of the full joint distribution by determining the fraction of the instances that correspond to each possible atomic event.
As a simple example, consider a pair of coins (Coin1, Coin2) being flipped together 10 times. If the observed data are:
H,T
T,H
T,T
H,T
H,T
H,H
T,H
T,T
T,H
T,H
then the estimated full joint distribution would be:
Coin1
H T
C
o H 0.1 0.4
i
n
2 T 0.3 0.2
(We are deliberately ignoring the question of whether these coins are fair, whether they are independent, and whether we have enough data to have confidence that our empirical estimate is reasonably close to the true distribution. We will return to some of these issues later in the semester.)
Your program should construct the equivalent empirically estimated full joint distribution for the data in the Car Evaluation Database.
Inference on full joint distributions
Given a full joint distribution, it is a common task to extract the distribution of some subset of variables, or a single variable. For example, what is P(Coin1)? Recall from class and the reading that we can compute this by marginalizing, or summing out, the associated rows or columns in the table. Here, P(Coin1 = H) = P(Coin1 = H ^ Coin2 = H ) + P(Coin1 = H ^ Coin2 = T) = 0.4 (and P(Coin1 = T) = 0.6, as you’d expect).
Similarly, using conditioning, we can compute conditional probabilities by looking at subsets of rows and columns in the table, e.g., P(Coin1 = H | Coin2 = H) = P(Coin1 = H ^ Coin2 = H) / (P(Coin1 = H ^ Coin2 = H) + P(Coin1 = T ^ Coin2 = H)) = 0.1 / (0.1 + 0.4) = 0.2.
Your program should be able to compute unconditional and conditional probabilities as described above, using its empirically estimated full joint distribution for the data in the Car Evaluation Database.
Input data format
We will place a copy of the car.data
file in the same directory as your program, exactly as it exists in the UCI Machine Learning Repository.
We will test your program using queries in the following text-based format.
The first line of the query will list one or more of the variables, as described above (buying
, maint
, etc.), separated by whitespace. These are the variables whose distributions must be calculated.
The remaining lines of the file are optional. If present, they specify a set of variables and values to condition the computed distribution on. Each such line starts with a single variable (buying
, etc.), and one or more values of that variable (in the case of buying
: vhigh
, high
, med
, low
, and so on), separated by whitespace. No line will contain all possible values of a variable. No two lines will start with the same variable name.
For example, to compute the distribution over car
(that is, P(Car)) the query would consist only of:
1
|
|
To compute the distribution over safety
and maint
given low or medium buying prices and two or four person capacity (that is, P(Safety, Maintenance | buying
∈ {low
, med
} ^ persons
∈ {2
, 4
}), the query would consist of:
1 2 3 |
|
Output data format
Your program’s output should be a series of lines. Each line will consist of n+1 values, separated by whitespace, where n is the number of variables in the distribution being computed. The ith value in the line should be a value for the ith variable of the distribution being computed; the n+1th value should be the probability of the given setting of variables. Variable values should be in the same order as they appeared in the query.
The probability should be output as a decimal number with at least three significant figures.
There should be as many lines in the output as there are distinct settings of the variables.
For example, given the input query:
1
|
|
a correct output would be:
1 2 3 4 |
|
Given a more complex query, such as:
1 2 3 |
|
a correctly-formatted output might resemble:
1 2 3 4 5 6 7 8 9 10 11 12 |
|
though of course these probabilities are incorrect. (The incorrectness disclaimer here was outdated and itself incorrect. Thanks to a sharp-eyed student for catching it.)
What to submit
You should submit two things: a query answering program and a readme.txt
.
- Your program should use its first command line argument as the path to an input file. If, for example, your solver’s main method is in a Java class named
FJDQuery
, we should be able to usejava FJDQuery /Users/liberato/testcase
to direct your program to read the query in the file located at/Users/liberato/testcase
. We will place a copy of thecar.data
file in the same directory as your program - Your program should print the computed distribution to standard output, in exactly the format described above.
Submit the source code of your program, written in the language of your choice. Name the file containing the main()
method FJDQuery.java
or your language’s equivalent. If the file you submit depends upon other files, be sure to submit these other files as well.
As in the previous assignment, while you may use library calls for data structures and the like, you must implement the marginalization method you use yourself. Do not use a library. We will consider it plagiarism if you do. Check with us if you think there’s any ambiguity.
Your readme.txt
should contain the following items:
- your name
- if the language of your choice is not Java, Python, Ruby, node.js-compatible JavaScript, ANSI C or C++ (or if you’re concerned it’s not completely obvious to us how to compile and execute it), a description of how to compile and execute the submitted files
- a description of what you got working, what is partially working and what is completely broken
If you’re using language features that require a specific version of your language or runtime, check for that version at program start and fail if it’s not present, emitting an understandable error message indicating this fact. Your program must compile and execute on the Edlab Linux machines.
If your program does not compile or execute, you will receive no credit. Check with us in advance if you’re concerned.
Grading
We will run your program on a variety of test cases. The exact test cases will not be available to you before grading. You are welcome to write and distribute your own test cases.
If your readme.txt
is missing or judged insufficient, your overall score may be penalized by up to ten percent.
We’re not going to feed your program incorrectly formatted input, so you need only concern yourself with handling input in the format described in the assignment.
We expect valid output. Generating output that is not in the format described in the assignment will result in a failed test case. We will check that your output value is within 1% of the correct value. Given the data and queries in this assignment, storing all intermediate values as double-precision floating-point numbers (Java double
s) will stay well within this margin of error.
I do not expect anything in a solution to this assignment to be particularly memory or CPU-intensive. But as usual, if your program exceeds available heap memory (which we’ll set to 1 GB in Java, using the -Xmx1024M
argument if necessary), or if it does not terminate in twenty seconds, we will consider the test case failed.
Questions and answers
This assignment seems too easy. All you have to do is count things up and divide by other counts. Am I overlooking something?
I don’t think you are overlooking anything. This assignment is on the easier side of those that you’ve been asked to do. Next week’s (not yet posted) will involve the conceptually more difficult problem of approximate inference on Bayesian networks, but I wanted to get you started with an easier case.
One minor note, though. At least one student showed me their work in progress, which consisted of six or seven nested for
loops to do the counting. This approach is OK, in that it works, but it won’t work when the variables in the data aren’t known in advance.
The copy-pasting involved is known as a “code smell,” and should immediately signal to you that you’re overlooking an abstraction. Leaving aside the theoretical benefits of abstraction, there’s a very concrete one: using the right abstraction results in much shorter code that’s easier to debug and test.
What’s the right abstraction? It depends. One approach is to read in the entire dataset, then for each condition, remove the irrelevant (that is, retain only the relevant) rows in the table. If you’re using Java, you need only write this filtering call once (or use Java8’s builtins, or use Google’s Guava library, etc.). Another is to perform the filtering when reading in the query.
The text file will have only one single query, correct? (even if it spans multiple lines)
Yes. One input will have one query, consisting of a line giving the variables being queried, and zero or more lines describing the evidence variables and values.
We read in the data into some format and then query that data for a distribution on a variable(s) given the evidence.
The hardest part is thinking how to store the data. I think most folk’s first thought is a multidimensional array, but that gets out of hand after about 3 levels in (and we need 7).
My idea was that it would be great if this were some sort of SQL database (currently taking Database class right now).
You certainly could take this approach, either with an in-memory or on-disk database. Java has several options for such an approach, such as the Xerial sqlite wrapper, or Apache Derby. But it’s probably overkill since you don’t require the ACID properties of SQL – you’d only be using the query language, and minimally at that.
However thinking about it, this is just one relatively big table. So, my idea is to create a class that has fields for each attribute (buying, maint, doors, etc). Have each record of the data be an instance of it and just keep them all in one big list structure (since we need to loop over the entire thing every time anyways).
Queries on the data would be just looking over the entire list, checking if they belong to a setting of the variables or not (skipping over data that doesn’t match the evidence).
My question is whether or not this an okay way of looking at the problem and structuring everything?
In short, yes, this approach is reasonable (though there are other equally reasonable approaches).
My main issue is that the more variables there are will require multiple passes over the data (4 and 12 passes on just the examples given).
Some comments:
- “Require” is perhaps too strong. You could store the list of conditions/evidence variables, and check all of them when you check a given row, rather than requiring a separate pass for each.
- Alternatively, you could pass over the data once and just store the FJD for each setting of the variables. Then it’s a much smaller table (though still nontrivial) to iterate over.
- Even if you do 12 passes, it’s still a (relatively) small constant on an otherwise linear algorithm. So it’s not too bad in this particular case.
Also when there’s given evidence it can spend a bunch of time looking at data it doesn’t need to. It’s not such a big issue with the data size we have (I think), but I’m not sure.
It’s not a big issue here. The FJD is of a large but manageable size (as is the raw data, if you represent it efficiently and do a single or small number of passes over it).
But this problem is the crux of the problem with exact inference in general. The size of the FJD is exponential in the number of variables, and so iteration over it becomes infeasible.
Does the order of the output matter?
No, it does not.
Also, is there no upper limit on the number of significant figures? (I know you said lower limit of 3).
No. And in fact, we’ll accept fewer, so long as you’re within the 1% margin of error.
doors
ordoor
?
doors
Can the variables in the first line (the distribution variables) also be in the condition variables?
For example, is this a valid query:
persons persons 4 more
I will not provide queries of this form. Generally there’s nothing intrinsically paradoxical about such a query, and at least one other student already supports this functionality. Depending upon your implementation choices you might get it “for free” (that is, without extra work).
Can the condition variables be repeated in more than one line?
For example, is this a valid query:
safety persons maint high maint med
No, they will not be repeated across lines.