# Today's topics

- exam review
- classification as probability estimation
    - a special case of learning from examples (classification) 
    - a special case of classification (probability estimation)
- naive Bayes classifiers

# Classification defined

Input x={x1,x2,...xn} is a set of attributes, function f assigns a label y to input x, where y is a discrete variable with a finite number of values.

Function f defines a decision boundary: show 2d example on board

# Example: handwriting

Manmatha's IR on handwritten documents, part of which is recognizing words, e.g., George Washington's handwriting.

Alexandria vs Andrew Lewis

Other examples:

  - predicting blockbusters
  - spam detection
  - given a query, is a web page relevant
  - predicted outcome of a sports match

# Class labels, ranks, probabilities

Different classification tasks can require different kinds of output:

  - Class labels: Crisp class boundaries only 
  - Ranking: Allows for exploration of many potential class boundaries
  - Probabilities: Allows for more refined reasoning about sets of instances

Each requires progressively more accurate  models (e.g., a poor probability estimator can still produce an accurate ranking)

# Classification as probability estimation

Instead of learning a function h (h for hypothesis) that assigns labels, learn a conditional probability distribution over the output of function f

P(F(x)|x) = P(f(x)=Y|x1,x2,…, xn)

Can *use* probability for the other two tasks: classification, ranking

# Example: classifying spam

what does spam look like?

what does non-spam look like?

# Feature construction

How can we choose a representation to represent this knowledge, and to efficiently reason over?

class label: IsSpam
Attributes? One option: words in the email (so called bag-of-words)

# Naive Bayes classifier

Goal: a system that takes an input a set of attributes, x, and returns a probability distribution over labels: P[ F(x) | x]

For example:

  - x represents the email, one variable per word
  - F(x) is true (spam) or false (not spam)

Bayesian networks are a tool for representing uncertain knowledge… 

Can we apply them here? (of course!)  Label and each attribute is a variable.  

Two challenges arise…

# Challenge 1: Choosing a structure

How do we determine conditional independence relations among words x1...xn and F(x)?

Wish the problem away (no, really). That's why it's called "naive": we just assume attributes are conditionally independent given the class label.

What does the resulting network look like?

# How do we do inference?

Same as any other Bayes net.

P(F(x)|x) = alpha P(x | F(x)) P(F(x)) (by Bayes' rule)

Note that P(x | F(x)) P(F(x)) = P(F(x), x)  (by the product rule)

Then using repeated application of the chain aka multiplication rule (equivalently, the semantic property of this Bayes net):

P(F(x), x) = P(x_1|F(X)) P(x_2)|F(X)) ... P(x_n|F(x)) P(F(x)

so ultimately P(F(x)|x) =

product i=1 to n [ P(x_i|F(x)) ] P(F(x))

But where do we get the CPTs?

More Tuesday...