Graphical Models - Homework 1

CMPSCI 691GM : Graphical Models
Spring 2011
Homework #1: Directed Graphical Models

Due Dates:
Tuesday February 1, 2011: Email working source code
Thursday February 3, 2011: Email report and revised source code

In this homework assignment you will implement and experiment with directed graphical models and write a short report describing your experiences and findings. We will provide you with simulated "medical domain" data based on the "flu" example in class; however, you are welcome to find and use your own data instead.

Medical Domain Data

We have provided you with a joint probability distribution of symptons, conditions and diseases based on the "flu" example in class. Certain diseases are more likely than others given certain symptons, and a model such as this can be used to help doctors make a diagnosis. (Don't actually use this for diagnosis, though!). The ground-truth joint probability distribution consists of twelve binary random variables and contains 2^12 possible configurations (numbered 0 to 4095), which is small enough that you can enumerate them exhaustively. The variables are as follows:

(0) IsSummer true if it is the summer season, false otherwise.
(1) HasFlu true if the patient has the flu.
(2) HasFoodPoisoning true if the patient has food poisoning.
(3) HasHayFever true if patient has hay fever.
(4) HasPneumonia true if the patient has pneumonia.
(5) HasRespiratoryProblems true if the patient has problems in the respiratory system.
(6) HasGastricProblems true if the patient has problems in the gastro-intestinal system.
(7) HasRash true if the patient has a skin rash.
(8) Coughs true if the patient has a cough.
(9) IsFatigued true if the patient is tired and fatigued.
(10) Vomits true if the patient has vomited.
(11) HasFever true if the patient has a high fever.

You can download all the data here. The archive contains two files:

joint.dat: The true joint probability distribution over the twelve binary variables. Since each variable is binary, we can represent a full variable assignment as a bitstring. This file lists all 2^12 assignments (one in each line) as pairs "Integer Probability" where "Integer" is an integer encoding of the bitstring. Specifically, assuming false=0 and true=1, an assignment to all variables results in a 12-bit binary number (with the index of the variables shown in parantheses above) which is converted to a decimal number. For example, assignment 0 represents all variables are false, 1 represents only IsSummer is true, 2 represents only HasFlu is true, and so on.
dataset.dat: The dataset consists of samples from the above probability distribution. Each line of the file contains a complete assignment to all the variables, encoded as an integer (as described above).

Core Tasks (for everyone)

Graphical Model: Use your intuition to design a directed graphical model for the twelve variables outlined above. Implement it in the programming language of your choice. You could begin your implementation work using simply randomly-assigned parameters. Given these parameters, and an assignment to 12 of the variables, your implementation should be able to return the probability of the full assignment.
Estimating Parameters: Use the dataset (i.e. dataset.dat) to estimate the parameters of your graphical model. You can do this by simply counting and normalizing, i.e. enumerate all the assignments in the dataset, and for each variable v, count the number of times a variable is true for each assignment to its parents, and then normalize the counts using the total number of times the parents had that assignment.
Model Accuracy: Measure the similarity of your model to the true joint probability distribution (i.e., joint.dat). That is, for each assignment, how similar are the probabilities returned by your model to the true probability distribution. To keep things simple, you can compare the distributions based on their L1-distance. That is, for each assignment a_i to all the variables, obtain p(a_i) from the true joint distribution ((i+1)^th row in joint.dat) and p(a_i) using your model. The distance is defined as |p(a₀)-p(a₀)| + |p(a₁)-p(a₁)| + ... + |p(a₄₀₉₅)-p(a₄₀₉₅)|. An alternative distance measure more appropriate to probability distributions is KL-divergence. If you know what that is, and want to use it, you can evaluate using KL-divergence also.
Querying: Use the graphical model above to answer some queries. A query consists of observed variables (for which we have an assignment), and query variables that over which we want the distribution. The remaining variables need to be marginalized (by summing them out). Since the domain is small you can implement this conditioning and marginalizing process by exhaustively enumerating all assignments (note that only assignments that are consistent with the observed values should be taken into account). Compare the results of these queries on your model to results obtained from using the true joint probability distribution. Try to think of some interesting queries that will demonstrate causal reasoning, evidential reasoning, and inter-causal reasoning. To get you started, here are some examples of queries to consider (but also create new ones of your own design):
- What is the probability a patient has flu given they are coughing and have a high fever? (Observed Variables: HasFever=true, Coughs=true; Query Variable: HasFlu)
- What is the probability distribution over the symptoms (HasRash, Coughs, IsFatigued, Vomits, and HasFever) given the patient has pneumonia?
- What is the probability of vomitting in summer?

Further Fun

Although not required, we hope you will be eager to experiment further with your model. Here are some ideas for additional things to try. Of course, you may come up with some even more exciting ideas to try on your own, and we encourage that. Of course, be sure to tell us what you did in your write-up.

Varying the Structure: Experiment with multiple different graphical structures (try adding or removing edges, or even changing the structure completely). Compare these different structures to the true distribution table (using the metric in 3). How close can you get? Do you think you can find the structure that we used to generate P and the data? As a baseline model, consider one that assumes all the variables are independent. See also "Structure Learning" below.
Conditional Independencies: Using the definition of statistical independence, try to identify as many conditional independencies as you can using the true joint distribution (joint.dat). Assume no variable has more than three parents. Try using these discovered independencies to construct a model and compare it to the one built using intuition (as in 1).
More datasets: Find your own dataset and model it using a directed graphical model. For example, you might like to work with some data that he continuous-valued variables.
Forward Sampling: Implement forward sampling and use your model to generate your own data. Forward sampling starts by sampling an assignment to variables that do not have any parents, and then sampling the rest of the variables by making sure that the parents of each variable already have an assigned value. Each iteration of this sampling results in a single assignment to all variables, which can be repeated to generate a dataset. Perhaps use this data to re-estimate the parameters of your model (as in 2), and compare this new model to the true joint distribution (as in 3). Explore convergence of parameter estimation by varying the number of data points generated and used to re-estimate your model.
Domain Expansion: Expand the domain of some variables in your model to be discrete or real. For example, fever could be a real-valued temperature. Augment your model with reasonable probabilities for these variables. Demonstrate your implementation by answering some queries (similar to 4 above).
Likelihood: Write code to compute likelihood of a dataset given a probability distribution. Use this to compute the likelihood of the data (dataset.dat) according to the true distribution (joint.dat) and your implemented model. For (2) above, split the dataset into training and test sets, estimate the parameters using the training set, and compute the likelihood of the test set using the resulting model. Vary the size of the train/test split and explore how the size of the train set affects the likelihood of the test set under models with more or fewer dependencies. Can you detect over-fitting with a complex model and a small amount of training data?
Structure Learning: Instead of using intuition or the conditional independencies from the joint distribution, use search to discover the appropriate structure from data. The search technique can be greedy by starting with all variables being completely independent, and each edge added greedily if adding it results in an increase in the test set likelihood. Note that this will involve parameter estimation using train data for each edge added. To make search more efficient, assume that no variables has more than three parents.
Your own ideas!...

What to hand in

The homework should be emailed to 691gm-staff@cs.umass.edu. before 5pm Eastern time on the due date.

You should provide a short (1-3 page) report describing your explorations and results. Include a description of the graphical model you implemented (e.g., a diagram, or a list of parents and conditional probabilities of the form: p(a|b,c,..)). Discuss the compactness (how many parameters does your model use?) and accuracy (how close were you to the true joint probability distribution?) of your model. Discuss your experiences in writing the query-implementing code and the queries your model was used to answer. Also include details of the optional tasks you did.
Include the complete source code of your implementation.
Description of external datasets you used (if any).

Grading

The assignment will be graded on (a) core task completion and correctness, (b) effort and creativity in the optional extras (c) quality and clarity of your written report.

Questions?

Please ask! Send email to 691gm-staff@cs.umass.edu or come to the office hours. If you'd like your classmates to be able to help answer your question, feel free to use 691gm-all@cs.umass.edu.

CMPSCI 691GM : Graphical Models Spring 2011 Homework #1: Directed Graphical Models

Due Dates: Tuesday February 1, 2011: Email working source code Thursday February 3, 2011: Email report and revised source code