CMPSCI 691GM : Graphical Models
Spring 2011
Homework #1: Directed Graphical Models
Due Dates:
Tuesday February 1, 2011: Email working source code
Thursday February 3, 2011: Email report and revised source code
In this homework assignment you will implement and
experiment with directed graphical models and write a short report
describing your experiences and findings. We will provide you with
simulated "medical
domain" data based on the "flu" example in class; however, you are
welcome to find and use your own data instead.
Medical Domain Data
We have provided you with a joint probability distribution of
symptons,
conditions and diseases based on the "flu" example in class. Certain
diseases are more likely than others given certain symptons, and a
model such as this can be used to help doctors make a diagnosis.
(Don't
actually use this for diagnosis, though!). The ground-truth joint
probability distribution consists of twelve binary random variables
and contains 2^12 possible configurations (numbered 0 to 4095), which
is small enough that
you can enumerate them exhaustively. The variables are as follows:
- (0) IsSummer true if it is the summer
season, false otherwise.
- (1) HasFlu true if the patient has the
flu.
- (2) HasFoodPoisoning true if the patient
has food poisoning.
- (3) HasHayFever true if patient has hay
fever.
- (4) HasPneumonia true if the patient has
pneumonia.
- (5) HasRespiratoryProblems true if the
patient has problems in the respiratory system.
- (6) HasGastricProblems true if the patient
has problems in the gastro-intestinal system.
- (7) HasRash true if the patient has a skin
rash.
- (8) Coughs true if the patient has a
cough.
- (9) IsFatigued true if the patient is
tired and fatigued.
- (10) Vomits true if the patient has
vomited.
- (11) HasFever true if the patient has a
high fever.
You can download all the data here.
The archive contains two files:
- joint.dat: The true joint probability distribution over
the twelve binary variables. Since each variable is binary, we can
represent a full variable assignment as a bitstring. This file lists
all 2^12 assignments (one in each line) as pairs "Integer Probability"
where "Integer" is an integer encoding of the bitstring. Specifically,
assuming false=0 and true=1, an assignment to all variables results in
a 12-bit binary number (with the index of the variables shown in
parantheses above) which is converted to a decimal number. For example,
assignment 0 represents all variables are false, 1 represents only IsSummer
is true, 2 represents only HasFlu is true, and so on.
- dataset.dat: The dataset consists of samples from the
above probability distribution. Each line of the file contains a
complete assignment to all the variables, encoded as an integer (as
described above).
Core Tasks (for everyone)
- Graphical Model: Use your intuition to design a directed
graphical model for the twelve variables outlined above.
Implement it in the programming language of your choice. You
could begin your implementation work using simply randomly-assigned
parameters. Given these parameters, and an assignment to 12 of the
variables, your implementation should be able to return the probability
of the full assignment.
- Estimating Parameters: Use the dataset (i.e.
dataset.dat) to estimate the parameters of your graphical model. You
can do this by simply counting and normalizing, i.e. enumerate
all the assignments in the dataset, and
for each variable v, count the number of times a variable is
true for each assignment to its parents, and then normalize the counts
using the total number of times the parents had that assignment.
- Model Accuracy: Measure the similarity of your model to
the true joint probability distribution (i.e., joint.dat). That is, for
each assignment, how similar are the probabilities returned by your
model to the true probability distribution. To keep things
simple, you can compare the distributions
based on their L1-distance. That is, for each assignment a_{i}
to all the
variables, obtain p(a_{i}) from the true joint distribution ((i+1)^{th}
row in joint.dat) and p(a_{i}) using your model. The
distance is defined as |p(a_{0})-p(a_{0})| + |p(a_{1})-p(a_{1})|
+ ... + |p(a_{4095})-p(a_{4095})|. An
alternative distance measure more appropriate to probability
distributions is KL-divergence. If you know what that is, and
want to use it, you can evaluate using KL-divergence also.
- Querying: Use the graphical model above to answer some
queries. A query consists of observed variables (for which we have an
assignment), and query variables that over which we want the
distribution.
The remaining variables need to be marginalized (by summing them out).
Since the domain is small you can implement this conditioning and
marginalizing process by
exhaustively enumerating all assignments (note that only assignments
that are consistent with the observed values should be taken into
account). Compare the results of these queries on your model to results
obtained from using the true joint probability distribution. Try
to think of some interesting queries that will demonstrate causal
reasoning, evidential reasoning, and inter-causal reasoning. To
get you started, here are
some examples of queries to consider (but also create new ones of your
own design):
- What is the probability a patient has flu given they are
coughing and have a high fever? (Observed Variables: HasFever=true,
Coughs=true; Query Variable: HasFlu)
- What is the probability distribution over the symptoms (HasRash,
Coughs, IsFatigued, Vomits,
and HasFever) given the patient has pneumonia?
- What is the probability of vomitting in summer?
Further Fun
Although not required, we hope you will be eager to experiment
further with your model. Here are some ideas for additional
things to try. Of course, you may come up with some even more
exciting ideas to try on your own, and we encourage that. Of
course, be sure to tell us what you did in your write-up.
- Varying the Structure: Experiment with multiple
different
graphical structures (try adding or
removing edges, or even changing the structure completely). Compare
these different structures to the true distribution table (using the
metric in 3). How close can you get? Do you think you can find
the structure that we used to generate P and the data? As a
baseline model, consider one
that assumes all the variables are independent. See also
"Structure Learning" below.
- Conditional Independencies: Using the definition of
statistical independence, try to identify as many conditional
independencies as you can using the true joint distribution
(joint.dat). Assume no variable has more than three parents. Try using
these discovered independencies to construct a model and compare it to
the one built using intuition (as in 1).
- More datasets: Find your own dataset and model it using
a directed graphical model. For example, you might like to work
with some data that he continuous-valued variables.
- Forward Sampling: Implement forward sampling and use
your model to generate your own data. Forward sampling starts by
sampling an assignment to variables that do not have any parents, and
then sampling the rest of the variables by making sure that the parents
of each variable already have an assigned value. Each iteration of this
sampling results in a single assignment to all variables, which can be
repeated to generate a dataset. Perhaps use this data to re-estimate
the
parameters of your model (as in 2), and compare this new model to the
true joint distribution (as in 3). Explore convergence of parameter
estimation by varying the
number of data points generated and used to re-estimate your model.
- Domain Expansion: Expand the domain of some variables in
your model to be discrete or real. For example, fever could be a
real-valued temperature. Augment your model with reasonable
probabilities for these variables. Demonstrate your implementation by
answering some queries (similar to 4 above).
- Likelihood: Write code to compute likelihood of a
dataset given a probability distribution. Use this to compute the
likelihood of the data (dataset.dat) according to the true distribution
(joint.dat) and your implemented model. For (2) above, split the
dataset into training and test sets, estimate the parameters using the
training set, and compute the likelihood of the test set using the
resulting model. Vary the size of the train/test split and explore how
the size of the train set affects the likelihood of the test set under
models with more or fewer dependencies. Can you detect
over-fitting with a complex model and a small amount of training data?
- Structure Learning: Instead of using intuition or the
conditional independencies from the joint distribution, use search to
discover the appropriate structure from data. The search technique can
be greedy by starting with all variables being completely
independent, and each edge added greedily if adding it results in an
increase in the test set likelihood. Note that this will involve
parameter estimation using train data for each edge added. To make
search more efficient, assume that no variables has more than three
parents.
- Your own ideas!...
What to hand in
The homework should be emailed to 691gm-staff@cs.umass.edu.
before 5pm Eastern time on the due date.
- You should provide a short (1-3 page) report describing your
explorations and
results. Include a description of the graphical model
you implemented (e.g., a diagram, or a list of parents and conditional
probabilities of the form: p(a|b,c,..)). Discuss the compactness (how
many parameters does your model use?) and accuracy (how close were you
to the true joint probability distribution?) of your model. Discuss
your experiences in writing the query-implementing code and the
queries your model was used to answer. Also include details of the
optional tasks you did.
- Include the complete source code of your implementation.
- Description of external datasets you used (if any).
Grading
The assignment will be graded on (a) core task completion and correctness,
(b) effort and creativity in the optional extras (c) quality and
clarity of your written report.
Questions?
Please ask! Send email to 691gm-staff@cs.umass.edu or come to the
office hours. If you'd like your classmates to be able to help answer your question, feel free to use 691gm-all@cs.umass.edu.