# Administrivia

# Continued from last class...

Recall we were talking about inference by enumeration

# Inference by Enumeration

Start with the joint probability distribution:

               toothache      !toothache
             catch  !catch   catch  !catch
     cavity  0.108  0.012    0.072  0.008
    !cavity  0.016  0.064    0.144  0.576
    
To evaluate a proposition, sum the atomic events where it's true

For example: P(toothache) = 0.108 + 0.012 + 0.016 + 0.064 = 0.2

Can also compute conditional probabilities:

P(!cavity | toothache) = P(!cavity ^ toothache) / P (toothache)

	(0.016 + 0.064) /
	(0.016 + 0.064 + 0.108 + 0.012)

= 0.4

Q a. P(catch)
Q b. P(Cavity)
Q c. P(Toothache | cavity)
Q d. P(Cavity | toothache v catch)

# Normalization

Same computation for P(cavity | toothache):

	(0.108 + 0.012) /
	(0.016 + 0.064 + 0.108 + 0.012)

= 0.6

The denominator here is the same both times! Intuition: denominator makes distribution for P(Cavity | toothache) add up to one. Sometimes called a normalization constant, alpha.

In other words:

    P(Cavity | toothache) = alpha * P(Cavity,toothache)
    = alpha * [P(Cavity,toothache,catch) + P(Cavity,toothache,!catch)]
    = alpha * [<0.108,0.016> + <0.012,0.064>]
    = alpha * <0.12,0.08> = <0.6, 0.4>
    
General idea: compute distribution on query variable by fixing observed variables and summing over unobserved variables

# More on marginalization

Generally we are interested in the posterior joint distribution of the query variables Y given specific values e for the evidence (observed) variables E.

Let the hidden (unobserved) variables be H = X - Y - E

Then the required summation of joint entries is done by summing outthe hidden variables:
P(Y | E = e) = αP(Y,E = e) = αΣhP(Y,E= e, H = h)

The terms in the summation are joint entries because Y, E and H together exhaust the set of random variables

Obvious problems:  - Worst-case time complexity O(d^n) where d is the largest arity 
  - Space complexity O(d^n) to store the joint distribution  - How to find the values for O(d^n) entries?

# Independence

A and B are independent iff
P(A|B) = P(A) or P(B|A) = P(B) or P(A, B) = P(A) P(B)

For example, P(Toothache, Catch, Cavity, Weather)= P(Toothache, Catch, Cavity) P(Weather)

(This can be observed in a joint distribution, or known as domain knowledge.)

Absolute independence powerful but rare

32 entries reduced to 12; for n independent biased coins, O(2n) →O(n): linear rather than exponential!

Dentistry is a large field with hundreds of variables, none of which are independent. What to do?

# Conditional independence

               toothache      !toothache
             catch  !catch   catch  !catch
     cavity  0.108  0.012    0.072  0.008
    !cavity  0.016  0.064    0.144  0.576    

P(Toothache, Cavity, Catch) has 2^3 – 1 = 7 independent entries

If I have a cavity, the probability that the probe catches in it doesn't depend on whether I have a toothache: P(catch | toothache, cavity) = P(catch | cavity)

.108 / (.108 + .012) = 0.9

(.108 + .072) / (0.108 + 0.012 + 0.072 + 0.008) = 0.9

The same independence holds if I haven't got a cavity 

P(catch | toothache,¬cavity) = P(catch | ¬cavity)

We say Catch is conditionally independent of Toothache given Cavity:
P(Catch | Toothache,Cavity) = P(Catch | Cavity)

General statements (all equivalent) of X and Y being condindep, given Z:

  - P(X,Y | Z) = P(X|Z)P(Y|Z)
  - P(X | Y,Z) = P(X|Z)
  - P(Y | X,Z) = P(Y|Z)

Equivalent statements:P(Toothache | Catch, Cavity) = P(Toothache | Cavity)
P(Toothache, Catch | Cavity) = P(Toothache | Cavity) P(Catch | Cavity)

Like independence, conditional independence can be known or observed (or asserted).

# CondIndep: awesomesauce

Write out full joint distribution using chain rule:P(Toothache, Catch, Cavity)= P(Toothache | Catch, Cavity) P(Catch, Cavity)= P(Toothache | Catch, Cavity) P(Catch | Cavity) P(Cavity)

then apply conditional independence:

= P(Toothache | Cavity) P(Catch | Cavity) P(Cavity)I.e., 2 + 2 + 1 = 5 independent numbers

Two fewer. But!

In most cases, the use of conditional independence reduces the size of the representation of the joint distribution from exponential in n to linear in n.

**Conditional independence is our most basic and robust form of knowledge about uncertain environments.**

# Bayes' Rule

Product rule: P(a∧b) = P(a | b) P(b) = P(b | a) P(a)

Rewrite as Bayes' rule: P(a | b) = P(b | a) P(a) / P(b)

or in distribution form:
P(Y|X) = P(X|Y) P(Y) / P(X) = αP(X|Y) P(Y)

Useful for assessing diagnostic probability from causal probability:

P(Cause|Effect) = P(Effect|Cause) P(Cause) / P(Effect)

E.g., let M be meningitis, S be stiff neck:

P(s) = 0.1 ; P(m) = 0.0001; P(s|m) = 0.8
P(m|s) = P(s|m) P(m) / P(s) = 0.8 × 0.0001 / 0.1 = 0.0008
posterior probability of meningitis still very small!

# Diagnostics and causality question

Q. Consider two tests for a virus, A and B. A is 95% effective at recognizing virus when present, but has a 10% FPR. B is 90% effective, but has a 5% FPR. 1% of people carry the virus. 

Which test returning positive is more indicative of someone actually carrying the virus?

# Summary

- Probability is a rigorous formalism for uncertain knowledge- Joint probability distribution specifies probability of every atomic event- Queries can be answered by summing over atomic events- For nontrivial domains, we must find a way to reduce the joint size- Independence and conditional independence provide the tools

# And now for something fairly similar

Recall: From a joint distribution of a set of variables, you can calculate

  - the joint probability distribution of any subset of those variables
  - the conditional probability distribution of any subset given any other subset

The joint distributions is "everything you need to know" about a set of variables

But life (and full joint distributions) aren't perfect

# Curse of dimensionality

You will hear about the curse a lot in AI, but here it refers to 

  - joint probability tables grow exponentially with the number of variables
  - essentially intractable once you have more than about 10 variables

Solution: exploit independence and conditional independence, as we discussed earlier.

Generally: Bayes nets

# Bayesian Networks

A simple, graphical notation for conditional independence assertions and hence for compact specification of joint distributions

Syntax:

  - a set of nodes, one per variable
  - a directed, acyclic graph (link ≈ "directly influences")
  - a conditional distribution for each node given its parents:

    P (Xi | Parents (Xi))

In the simplest case, conditional distribution represented as a conditional probability table (CPT) giving the distribution over Xi for each combination of parent values

# Example

Topology of network encodes conditional independence assertions

    Weather     Cavity -> Toothache
                  |
                  V
                Catch

Weather in independent of all variables;

Toothache and Catch are conditionally independent, given Cavity.

Q. What is the conditional distribution associated with each of the four nodes in the above BN?

# Example: Home security

I'm at work, and my neighbor John calls to say my alarm is ringing, but my neighbor Mary doesn't call. We live in California, and sometimes the alarm is set off by minor earthquakes. 

Is there a burglar?

Variables: Burglary, Earthquake, Alarm, JohnCalls, MaryCalls

We have some probabilistic "causal" knowledge:

  - A burglar can set the alarm off
  - An earthquake can set the alarm off
  - The alarm can cause Mary to call
  - The alarm can cause John to call

# Home Security Net

P(B) = 0.001
P(E) = 0.002

    B  E  P(A|B,E)
    T  T   .95
    T  F   .94
    F  T   .29
    F  F   .001
    
P(J|A) = .9
P(J|!A) = .05

P(M|A) = .7
P(M|!A) = .01

# Benefits: Compactness

A CPT for Boolean Xi with k Boolean parents has 2^k rows for the combinations of parent values

Each row requires one number p for Xi = true(the number for  Xi = false is just 1-p)

If each variable has no more than k parents, the complete network requires O(n · 2^k) numbers

In other words, grows linearly with n, vs. O(2^n) for the full joint distribution

For burglary net, 1 + 1 + 4 + 2 + 2 = 10 numbers (vs. 2^5-1 = 31)

# Semantics

The full joint distribution is defined as the product of the local conditional distributions:

P (X1, … ,Xn) = product over all i of P (Xi | Parents(Xi))

Example:

P(j ^ m ^ a ^ !b ^ !e) = 

P(j|a) P(m|a) P(a|!b,!e) P(!b) P(!e)

.9 * .7 * .001 * .999 * .998 = .000628

# Another BN exercise

Q. A bag of biased coins a, b, c with P(heads) = .2, .6, .8 respectively. One is drawn at random (equal prob), then flipped three times to generate X1, X2, X3.

  a. Draw the BN corresponding to this setup; define the CPTs for each node.
  b. Assume the coin came up heads twice and tails once. Calculate which coin was most likely to have been drawn from the bag.