# Continuing on with NBCs...

# Spam classification

P(isSpam | email) = alpha P(Email | isSpam) P(isSpam)

= product over words w P(w_i | isSpam) P(isSpam)

P(isSpam) : proportion of email that is spam

P(w_i | isSpam): proportion of spam email that contain the word

# Learning CPTs

You already did this (A05)!

If CPTs are unknown, you can count just count occurrences in training data!

# An example

suppose we have a variable X = {low, med, high), and f(x) = true, false

                 low  med  high
    f(x) true     10   13   17
         false     2   13    0

P(X=low | f(x)=true) = 10 / (10 + 13 + 17)
P(F(x)=false) = (2 + 13 + 0) / (2 + 13 + 0 + 10 + 13 + 17)


# Zero counts are a problem in practice

P[ X1 = High | F(x) = False ] = 0 / (2 + 13 + 0) = 0 (???)

If an attribute value does not occur in any training example, we assign zero probability to that value

Why is this a problem?

Adjust for zero counts by “smoothing” probability estimates

In spam example, we can discard words unseen in training data, or set P(rareWord|isSpam) to a low but non-zero probability

# Pros and cons of naive bayes

Pros:

  - Extremely simple to estimate
  - Uses all data instances for every estimated CPT
  - CPTs can be learned incrementally

Cons:

  - not selective (e.g, "the")
  - assumes independence

# Is assuming independence a problem?

What is the effect on probability estimate? Over-counting evidence, leads to overly confident probability estimate

What is the effect on classification?Less clear…

For a given input x, suppose f(x) = True 

Naive Bayes will correctly classify if 				P[ F(x) = True | x ] > 0.5

# Measuring classifier error

Classifier error can be broken down into two components:

Bias (Systematic error):  
Measure of how hypothesis h qualitatively differs from true function f

Variance (Random error):  
[Recall: hypothesis h is learned from examples]  
Measure of how much h changes (varies) on different sets of examples

Suppose you're measuring student heights:
Ruler too short/long? bias
Different random samples from population? variance

# Bias-variance tradeoff

Attempts to reduce bias often increase variance

Attempts to reduce variance often increase bias

Naïve Bayes has high bias, low variance -- why?

# Classification Trees

Input: x, set of attributes (discrete or continuous)

Output: label f(x) or P[ F(x) | x ]

Assigns label by performing a sequence of “tests,” propositional statements about input

Tests are arranged in a tree structure

# Example: play tennis?

outlook -> (sunny, overcast, rain)

sunny: humidity -> (high, normal)

high: no; normal: yes

overcast: yes

rain: windspeed: (>7, <= 7)

>=7 : no

<7: yes

(leaves could be probability distributions instead of labels)

Drop input into top, deterministically follow one path to base of tree

Leaves are mutually exclusive (only one choice is possible) and collectively exhaustive (all possible choices are represented)

# Learning a tree from data, attempt 1

Task: build a tree consistent with examples

For each example, construct path from root to leaf that tests every attribute  
“Memorizing” entire set of training examples  
Is this a good idea?  

100% accuracy on examples  
What label would this tree assign to an example it has not seen before?  
(Tree is not inductive)

# Learning a tree from data, attempt 2

Task formulation revisited:

Goal: tree consistent with all examples  
Cost: size of tree 

What kind of problem is this? Search!
What search algorithm should we use?  
Let’s try uniform cost…  

# Learning tree using uniform cost search

How much time is required to search all binary trees of a given depth, assuming 20 binary variables and 2000 trees evaluated per second?

    Tree depth   No. Full tree
    1              20
    2              20 * 19 * 19
    3              20 * 19^2 * 18^4
    4              20 * 19^2 * 18^4 * 17^8
    
ms, s, days, 84M years

Looks like we'll be doing a heuristic search

# Learning a tree from data, attempt 3

Input: examples  
Output: small(ish) tree consistent with examples

Description:

Check stopping criteria  
Choose attribute to test  
Split examples based on attribute test  
Repeat on each subset 

How to choose an attribute to test? Use the one that splits the data best.

# When to stop?

Stop if any of these conditions are met:

  - Examples all have same label
  - There are no examples matching condition on attribute
  - If there are no attributes left that split the examples

# Choosing an attribute

Idea: a good attribute splits the examples into subsets that are (ideally) "all positive" or "all negative"

Example on board, patrons  (none, some, full) vs type (french, thai, italian, burger)

One option: information gain. Information gained is the reduction in entropy. Entropy is uncertainty in a random variable, aka unpredictability of information content. 

# Entropy, information and information gain

A 50/50 binary variable has, by definition, "1 bit" of entropy, corresponding exactly to the two possible outcomes. Two independent binary variables have 

H(X) = E[I(X)], where I(X) is the information (aka surprisal)

Generally:

H(X) = sum over i [P(xi)I(xi)]

What's I(x)?

Assumptions:
I(p) ≥ 0 – information is non-negative quantity
I(1) = 0 – event that always happens does not communicate information
I(p1, p2) = I(p1) + I(p2) – additivity of independent events

log function gives us exactly what we want

I(p) = log 1/p ; traditionally log_2 is used but it doesn't matter (why?)

H(X) = Sum over k P(x_k) log_2 (1/P(x_l))

(see text for full derivation of information gain -- which is actually mutual information, the expected value of the KL divergence of a CPT)

Bias-variance tradeoff here as well.  Choosing most discriminating attribute first may not be best tree (bias), yet it does make tree small (low variance).

# Pros and cons of decision trees

Pros:

  - Low bias (in theory, can represent any function)
  - Efficient inference, once constructed
  - Selective (selects some variables to use and not others)
  - Simple, recursive algorithm for construction

Cons:

  - If constructed improperly, can have high variance
  - Repeatedly split the data into subsets
  - This produces small samples at the leaves
  - Large trees can be complicated to understand
  - Top nodes are extremely influential