# Question

Consider this Bayes Net:

           > C
          /
    A -> B 
          \
           > D

 1. Which of these statements are implied by the network structure?
     a. P (B | A,C,D) = P (B | A) [B conditionally independent of C, D given A, no]
     b. P (D | B) = P (D | B,C) [D conditionally independent of C given B, yes]
     c. P (B | A) = P (B) [fully independent - no!]

 2. Suppose that we have evidence at B. Use Bayes rule to solve the query P (A | B) in terms of the probabilities directly available in the network. (You may assume the usual normalization constant α.)

P(A|B) = P(B|A) * P(A) / P(B)

only P(B) not directly available, rewrite using semantics of BNs:

P(B) = alpha * product over values b of B P(B=b | A)

 3. Suppose that we have evidence at C and D. Use Bayes rule followed by conditioning on B to solve the query P (A | C,D) in terms of probabilities directly available in the network. 

# Markov Chain Monte Carlo and Gibbs

Markov chain Monte Carlo

Markov chain: Description of the state of a system at successive times

Markov property: State at time t+1 depends only on the state at time t, not time t-i for i>0

Monte Carlo: A class of non-deterministic algorithms used to simulate the behavior of a system

# MCMC

(this is Gibbs sampling, one approach)

The "state" of the system is the current assignment of all variables

Algorithm:

  - Initialize all variables randomly
  - Generate next state by sampling one variable given its Markov blanket
  - Sample each variable in turn, keeping other evidence fixed, i.e. don't sample the evidence variables!
  - Variable selection can be (sequential or random)

# What is the Markov Chain for the previous query?

P(C|S=F, W=T)

To the board!

# Intuition behind MCMC

In the limit, the sampling process spends a proportion of time in each state exactly equal to its posterior probability.

# How to sample

We are sampling a variable X_i, given its Markov blanket (= parents, children, children's parents)

P(x_i' | mb(X_i)) = alpha P(x_i' | parents(X_i) \times \product_{Y_j in Children(X_i)} P(y_j| parents(Y_j))

In practice in the Gibbs sampler:

  - There's an alpha. For a binary variable X_i, you'd need to compute both x_i and !x_i, then normalize.
  - What value do you use for each variable in the Markov blanket of X_i? Its most recent value.
  - A parent of Y_j is X_i. When computing x_i, set its value to true; when computing ~x_i, set its value to false.

# MCMC problems

May require a lengthy "burn-in" to reach the stationary distribution.

Difficult to tell if it has converged (no deterministic test).

Can be inefficient if the Markov blanket is large because probabilities don’t change much.

# Other stuff

I'd love to do DBNs and HMMs, but you'll see them if you take Robotics. Understanding BNs is a good first step.

# New topic: Machine Learning as Search

Bayes Nets are a way to represent knowledge. But what if we have *data*, not *knowledge*?

# Types of learning tasks	

  - Supervised learning (aka inductive)
      - Given training data with the true class label, learn a mapping from data variables to class label.
      - Examples: Rule Learning, Classification Trees, Neural Networks
  - Reinforcement learning
      - Learn a mapping of actions taken by an agent to rewards (reinforcement) gained upon completing the action.
  - Unsupervised learning
      - Learn a mapping from data instances to ‘natural groups’ of instances.
      - Examples: Clustering , density estimation

# Inductive learning

Inductive learning algorithms construct models that approximate a mapping from data instances to (unobserved) class labels (often: conditional and joint probability distributions)

Combine search techniques, knowledge representation, and statistics

Really *abductive* reasoning rather than inductive reasoning.  Also called "learning from examples" or "supervised learning"

# Real world example: cell phone fraud detection

Analyze cellular telephone calls to automatically construct user profiles to detect fraud.(T. Fawcett and F. Provost (1997). Adaptive fraud detection. Data Mining and Knowledge Discovery 1:291-316.)

Three step process:

  - *Learn rules* — "...we use a rule-learning program to uncover indicators of fraudulent behavior from a large database of customer transactions."

  - For each rule, *produce a monitor* — "Then the indicators are used to create a set of monitors, which profile legitimate customer behavior and indicate anomalies."

  - Combine multiple monitors to *produce alarms* — ”…the outputs of the monitors are used as features in a system that learns to combine evidence to generate high-confidence alarms."

# Data format

Time series of call logs for 3,600 accounts

    Date and Time  Day of Week  Duration  Origin  Destination  Fraud?

# Supervised learning

Input: set of data instances, characterized by variables

Output: a model, often a function of the variables

  - Class values (e.g., fraud status)
  - Conditional Probability Distributions P(Fraud | Call Activity)
  - Joint Probability Distributions P( Fraud, Call Activity)

Common types of supervised learning:

  - Propositional (independent instances)
  - Temporal/spatial
  - Relational

# Machine learning in practice

Choose a data representation
  
  - Data rarely arrive in a useable form.
  - Requires transformation from raw data into suitable input for your chosen learning algorithm.

Select a knowledge representation (e.g., a type of model)

Utilize a search technique to choose a "good" specific model

  - Search the space of model structures and or parameters
  - Score models with an *evaluation function* to find the best fit to the data

# Other examples

  - Medical diagnosis
  - Loan approval
  - Document classification
  - Spam detection, content filtering
  - Image classification
  - Computer security, load prediction
  - Recommender systems	

# Learning probabilistic rules

An example system. Probabilistic rules are one of the most widely used conditional models.

Rules are represented as a combination of Boolean propositions.

Many algorithms use only conjunctions, e.g.,

    A ^ B ^ C -> X
    antecedents  consequent
    
"If A, B, and C, then X"

# Rules

Consider a grocery store learning about items co-bought:
     
    IF (cheese) THEN chocolate  [Conf=0.10, Supp=0.04]    IF (cheese & Sterno) THEN chocolate [Conf=0.95, Supp=0.01]
    
Support indicates the frequency of the antecedent and the consequent occurring together in the data.

Confidence indicates the fraction of the rows that satisfy the consequent given the antecedent holds. 

These can also be written as conditional probabilities:

    P(chocolate|cheese) = 0.10	 P(chocolate|cheese,Sterno) = 0.95
    
As rules get more specific (by including additional propositions), support decreases, but confidence may increase.

...to be continued