# Continuing on with NBCs... # Spam classification P(isSpam | email) = alpha P(Email | isSpam) P(isSpam) = product over words w P(w_i | isSpam) P(isSpam) P(isSpam) : proportion of email that is spam P(w_i | isSpam): proportion of spam email that contain the word # Learning CPTs You already did this (A05)! If CPTs are unknown, you can count just count occurrences in training data! # An example suppose we have a variable X = {low, med, high), and f(x) = true, false low med high f(x) true 10 13 17 false 2 13 0 P(X=low | f(x)=true) = 10 / (10 + 13 + 17) P(F(x)=false) = (2 + 13 + 0) / (2 + 13 + 0 + 10 + 13 + 17) # Zero counts are a problem in practice P[ X1 = High | F(x) = False ] = 0 / (2 + 13 + 0) = 0 (???) If an attribute value does not occur in any training example, we assign zero probability to that value Why is this a problem? Adjust for zero counts by “smoothing” probability estimates In spam example, we can discard words unseen in training data, or set P(rareWord|isSpam) to a low but non-zero probability # Pros and cons of naive bayes Pros: - Extremely simple to estimate - Uses all data instances for every estimated CPT - CPTs can be learned incrementally Cons: - not selective (e.g, "the") - assumes independence # Is assuming independence a problem? What is the effect on probability estimate? Over-counting evidence, leads to overly confident probability estimate What is the effect on classification?Less clear… For a given input x, suppose f(x) = True Naive Bayes will correctly classify if P[ F(x) = True | x ] > 0.5 # Measuring classifier error Classifier error can be broken down into two components: Bias (Systematic error): Measure of how hypothesis h qualitatively differs from true function f Variance (Random error): [Recall: hypothesis h is learned from examples] Measure of how much h changes (varies) on different sets of examples Suppose you're measuring student heights: Ruler too short/long? bias Different random samples from population? variance # Bias-variance tradeoff Attempts to reduce bias often increase variance Attempts to reduce variance often increase bias Naïve Bayes has high bias, low variance -- why? # Classification Trees Input: x, set of attributes (discrete or continuous) Output: label f(x) or P[ F(x) | x ] Assigns label by performing a sequence of “tests,” propositional statements about input Tests are arranged in a tree structure # Example: play tennis? outlook -> (sunny, overcast, rain) sunny: humidity -> (high, normal) high: no; normal: yes overcast: yes rain: windspeed: (>7, <= 7) >=7 : no <7: yes (leaves could be probability distributions instead of labels) Drop input into top, deterministically follow one path to base of tree Leaves are mutually exclusive (only one choice is possible) and collectively exhaustive (all possible choices are represented) # Learning a tree from data, attempt 1 Task: build a tree consistent with examples For each example, construct path from root to leaf that tests every attribute “Memorizing” entire set of training examples Is this a good idea? 100% accuracy on examples What label would this tree assign to an example it has not seen before? (Tree is not inductive) # Learning a tree from data, attempt 2 Task formulation revisited: Goal: tree consistent with all examples Cost: size of tree What kind of problem is this? Search! What search algorithm should we use? Let’s try uniform cost… # Learning tree using uniform cost search How much time is required to search all binary trees of a given depth, assuming 20 binary variables and 2000 trees evaluated per second? Tree depth No. Full tree 1 20 2 20 * 19 * 19 3 20 * 19^2 * 18^4 4 20 * 19^2 * 18^4 * 17^8 ms, s, days, 84M years Looks like we'll be doing a heuristic search # Learning a tree from data, attempt 3 Input: examples Output: small(ish) tree consistent with examples Description: Check stopping criteria Choose attribute to test Split examples based on attribute test Repeat on each subset How to choose an attribute to test? Use the one that splits the data best. # When to stop? Stop if any of these conditions are met: - Examples all have same label - There are no examples matching condition on attribute - If there are no attributes left that split the examples # Choosing an attribute Idea: a good attribute splits the examples into subsets that are (ideally) "all positive" or "all negative" Example on board, patrons (none, some, full) vs type (french, thai, italian, burger) One option: information gain. Information gained is the reduction in entropy. Entropy is uncertainty in a random variable, aka unpredictability of information content. # Entropy, information and information gain A 50/50 binary variable has, by definition, "1 bit" of entropy, corresponding exactly to the two possible outcomes. Two independent binary variables have H(X) = E[I(X)], where I(X) is the information (aka surprisal) Generally: H(X) = sum over i [P(xi)I(xi)] What's I(x)? Assumptions: I(p) ≥ 0 – information is non-negative quantity I(1) = 0 – event that always happens does not communicate information I(p1, p2) = I(p1) + I(p2) – additivity of independent events log function gives us exactly what we want I(p) = log 1/p ; traditionally log_2 is used but it doesn't matter (why?) H(X) = Sum over k P(x_k) log_2 (1/P(x_l)) (see text for full derivation of information gain -- which is actually mutual information, the expected value of the KL divergence of a CPT) Bias-variance tradeoff here as well. Choosing most discriminating attribute first may not be best tree (bias), yet it does make tree small (low variance). # Pros and cons of decision trees Pros: - Low bias (in theory, can represent any function) - Efficient inference, once constructed - Selective (selects some variables to use and not others) - Simple, recursive algorithm for construction Cons: - If constructed improperly, can have high variance - Repeatedly split the data into subsets - This produces small samples at the leaves - Large trees can be complicated to understand - Top nodes are extremely influential