# Administrative notes

  - A05 due Friday

# Remember Bayes Nets?

A simple, graphical notation for conditional independence assertions and hence for compact specification of joint distributions

Syntax:

  - a set of nodes, one per variable
  - a directed, acyclic graph (link ≈ "directly influences")
  - a conditional distribution for each node given its parents:

    P (Xi | Parents (Xi))

In the simplest case, conditional distribution represented as a conditional probability table (CPT) giving the distribution over Xi for each combination of parent values

P (X1, … ,Xn) = product over all i of P (Xi | Parents(Xi))

# Independence

A node X is conditionally independent of its non-descendants given its parents

A node X is conditionally independent of *all other* nodes, given its Markov blanket (parents, children, children's parents).

# In class exercise

14.11 In your local nuclear power station, there is an alarm that sounds when a temperature gauge exceeds a given threshold. The gauge measures the temperature of the core. Consider the Boolean variables A (alarm sounds), F_A (alarm is faulty), F_G (gauge is faulty), and the multivalued nodes G (gauge reading) and T (actual core temperature).

 a. Draw a Bayesian network for this domain, given that the gauge is more likely to fail when the core temperature gets high.
 
    T -> F_G; T -> G; F_G -> G; F_A -> A; G -> A

 b. Is the network a polytree? What property to polytrees entail for exact inference in Bayes Nets?
 
    No. Inference linear in the size of the network.
    
 c. Suppose there are just two possible actual and measured temperatures, normal and high; the probability that a gauge gives the correct temperature is x when it is working, but y when it is faulty. Give the CPT associated with G.
 
        F_G T P(G=h | F_G, T)
         t  n   1-y
         t  h   y
         f  n   1-x
         f  h   x

 d. Suppose the alarm works correctly unless it is faulty, in which case it never sounds. Give the CPT associated with A.
 
        F_A G P(A | F_A, G)
         t  *   0
         f  h   1
         f  n   0
         
 e. Suppose the alarm and gauge are working and the alarm sounds. Calculate an expression of the probability that the temperature of the core is too high, in terms of the various CPTs in the network.

    alpha P(T=h, F_A=f, F_G=f, A=t) = 
    alpha \sum_g P(T=h, F_A=f, F_G=f, A=t, G) =
    alpha P(T=h) P(F_G=f|T=h) P(F_A = f) \sum_g P(G=g|T=h,F_G=f) P(A=t|G=g, F_A=f) =
    alpha P(T=h) P(F_G=f|T=h) P(F_A = f) (P(G=n|T=h,F_G=f) P(A=t|G=n, F_A=f) + P(G=h|T=h,F_G=f) P(A=t|G=h, F_A=f)) =
    alpha P(T=h) P(F_G=f|T=h) P(F_A = f) ((1-x)*0 + x * 1) = 
    alpha P(T=h) P(F_G=f|T=h) P(F_A = f) x

# Today: Approximate Inference

Why approximate inference?

  - inference in singly connected networks (aka polytrees) is linear! ...but many networks are not singly connected
  - inference in multiply connected networks is exponential, even when the number of parents/node is bounded
  - we may be willing to trade some small error for more tractable inference

# Solitaire and Stanislaw Ulam

From Wikipedia:

> Late in the war, under the sponsorship of von Neumann, Frankel and Metropolis began to carry out calculations on the first general-purpose electronic computer, the ENIAC. Shortly after returning to Los Alamos, Ulam participated in a review of results from these calculations.[33] Earlier, while playing solitaire during his recovery from surgery, Ulam had thought about playing hundreds of games to estimate statistically the probability of a successful outcome.[34] With ENIAC in mind, he realized that the availability of computers made such statistical methods very practical. John von Neumann immediately saw the significance of this insight. In March 1947 he proposed a statistical approach to the problem of neutron diffusion in fissionable material.[35] Because Ulam had often mentioned his uncle, Michał Ulam, "who just had to go to Monte Carlo" to gamble, Metropolis dubbed the statistical approach "The Monte Carlo method".[33] Metropolis and Ulam published the first unclassified paper on the Monte Carlo method in 1949.[36]

# Stochastic simulation

Core idea:

  - Draw samples from a sampling distribution defined by the network
  - Compute an approximate posterior probability in a way that converges to the true probability

Methods:

  - Simple sampling from an empty network
  - Rejection sampling — reject samples that don’t agree with the evidence
  - Likelihood weighting — weight samples based on evidence
  - Markov chain Monte Carlo — sample from a stochastic process whose stationary distribution is the true posterior

# Population and samples

Monte Carlo methods vary, but tend to follow a particular pattern:

  - Define a domain of possible inputs.
  - Generate inputs randomly from a probability distribution over the domain.
  - Perform a deterministic computation on the inputs.
  - Aggregate the results.

General illustration: Universe, sample(s), estimate over universe

Specific illustration: circle in square

  - Draw a square on the ground, then inscribe a circle within it.
  - Uniformly scatter some objects of uniform size (grains of rice or sand) over the square.
  - Count the number of objects inside the circle and the total number of objects.
  - The ratio of the two counts is an estimate of the ratio of the two areas, which is π/4. Multiply the result by 4 to estimate π.

# Other examples

Consider our dentistry example (toothache, cavity, catch), where we have the full joint distribution. How could we generate samples? Weight each independent event according to probability in FJD. Sample.

What about the B / E / A / J / M example?

P(B) = 0.001
P(E) = 0.002

    B  E  P(A|B,E)
    T  T   .95
    T  F   .94
    F  T   .29
    F  F   .001
    
P(J|A) = .9
P(J|!A) = .05

P(M|A) = .7
P(M|!A) = .01

# Simple sampling

Given an empty network and beginning with nodes without parents, we can sample from conditional distributions and instantiate all nodes.

This will produce one atomic event.

Doing this many times will produce an empirical distribution that approximates the full joint distribution.

# Example of simple sampling

Cloudy, Sprinkler, Rain, WetGrass

       C
      / \
     S   R
      \ /
       W

P(C) = 0.5

P(S|c) = 0.1; P(S|!c) = 0.5

P(R|c) = 0.8; P(R|!c) = 0.2

    S R P(W|S,R)
    T T  0.99
    T F  0.90
    F T  0.90
    F F  0.01
    
Flip coin for P(C): get C=T
Given C=T, flip biased coin for S, get > 0.1 S=F
Given C=T, flip biased coin for R, get < 0.8: R=T
Given S=F, R=T flip biased coin for W, get < 0.9: W = T

TFTT : a single setting	

Do many times to get an empirical distribution approximating the FJD

# Bayesian networks are generative

BNs can generate samples from the world they represent.

Generating samples is efficient (linear) even though general probabilistic inference is not.

Thus, we will attempt to use the efficient procedure to approximate the inefficient one.

# Benefits and problems of simple sampling

  - Works well for an empty network
  - Simple
  - In the limit (that means: many samples), the estimated distribution approaches the true distribution
  - But in nearly all cases, we have evidence, rather than an empty network. What can we do? For example, what if have values for S and W, and want to estimate P(C|S,W)?

# Option one: rejection sampling

Sample the network as before...but discard samples that don’t correspond with the evidence.

Similar to real-world estimation procedures, but the network is the stand-in for the world (much cheaper and easier).

However, hopelessly expensive for large networks where P(e) is small. Why?

# Option two: likelihood weighting

Do simple sampling as before...but weight the likelihood of each sample based on the evidence. 

Weight starts at one, and decreases multiplicatively based upon the likelihood of evidence.

# Example

Suppose we want to estimate P(C|S=F, W=T).

C is not an evidence variable, so sample from P(C). Get C=T.
S is an evidence variable (=F), so w = w * P(S=F|C=T) = 1 * 0.9
R is not an evidence variable, so ample from P(R|C=T). Get R=T.
W is an evidence variable (=T), so w = w * P(W=T|S=F,R=T) = 0.9 * 0.9 = 0.81

# Effects of weighting

Inferred values only pay attention to the evidence in ancestors, not children, thus producing estimates somewhere in between the prior and the posterior. The weighting helps makes up the difference.

Problems: 

  - Performance degrades with many evidence variables. Why?
  - Diverges from reality as more evidence values are later in the value ordering. Why?

# Option three: it's complicated

Markov chain Monte Carlo

Markov chain: Description of the state of a  system at successive times

Markov property: State at time t+1 depends only on the state at time t, not time t-i for i>0

Monte Carlo: A class of non-deterministic algorithms used to simulate the behavior of a system

# MCMC

(this is Gibbs sampling, one approach)

The "state" of the system is the current assignment of all variables

Algorithm:

  - Initialize all variables randomly
  - Generate next state by sampling one variable given its Markov blanket
  - Sample each variable in turn, keeping other evidence fixed, i.e. don't sample the evidence variables!
  - Variable selection can be (sequential or random)

# What is the Markov Chain for the previous query?

P(C|S=F, W=T)

To the board!

# Intuition behind MCMC

In the limit, the sampling process spends a proportion of time in each state exactly equal to its posterior probability.


# How to sample

We are sampling a variable X_i, given its Markov blanket (= parents, children, children's parents)

P(x_i' | mb(X_i)) = alpha P(x_i' | parents(X_i) \times \product_{Y_j in Children(X_i)} P(y_j| parents(Y_j))

In practice in the Gibbs sampler:

  - There's an alpha. For a binary variable X_i, you'd need to compute both x_i and !x_i, then normalize.
  - What value do you use for each variable in the Markov blanket of X_i? Its most recent value.
  - A parent of Y_j is X_i. When computing x_i, set its value to true; when computing ~x_i, set its value to false.

# MCMC problems

May require a lengthy "burn-in" to reach the stationary distribution.

Difficult to tell if it has converged (no deterministic test).

Can be inefficient if the Markov blanket is large because probabilities don’t change much.