# Today's topics

- information gain defined
- k-nn classifiers
- ensemble classifiers / bagging / random forests
- unsupervised learning

# Entropy and decision trees

H(X) = sum over i [P(xi)I(xi)]

I(p) = log 1/p = -log p

For binary variable true with prob q:

B(q) = -1(q log q + (1-q) log (1-q))

N.B. x log x -> 0 as x -> 0 ; hardcode this in your implementation

Gain is defined as the expected reduction in entropy

Let p be # positive examples, q be # negative examples; A be the attribute being split on

Entropy prior to split is B(p / (p + n))

Entroy remaining is Remainder(A), defined as

(p_pos + n_pos) / (p+n) B(p_pos / (p_pos + n_pos)) +
(p_neg + n_neg) / (p+n) B(p_neg / (p_neg + n_neg))

where _pos / _neg denote the counts after splitting on A.

# Another non-parametric model

A strength and weakness: non-parametric model’s parameters can grow without limit

Simple example: Use the data as the model

# k-Nearest Neighbors

Widely used, esp. when training set is of manageable size

Algorithm:

  - Save all training examples
  - For classification: Find K nearest neighbors based on input (from training examples) and take a vote
  - For regression: Find K nearest neighbors and take mean, (or do linear regression over them)

# Example

On board, show cases

# Other supervised learning things

  - SVMs generally SOTA single classifier, if you know nothing and are willing to spend time tuning parameters (also: random forests)
  - ensemble learning (generalization of multiple classifiers)

# k-NN questions

How do we measure "nearest"? Depends upon the dimension.

Minkowski distance : [sum over i |x_j,i - x_q,i|^p]^(1/p)

p=2: Euclidean distance
p=1: Manhattan distance

Hamming distance: # booleans that differ
    
How do we choose K? Lower k is less general, higher k more so. (Bias/variance?) 

How do we efficiently find the K nearest? Read the book :) k-d trees, LSH, etc.

# Curse of Dimensionality

In high dimensions, everything is spread out further than you'd think

2D/3D intuitions don't hold

# Now for something completely different

but similarly named

# Unsupervised learning

Learn patterns in the inputs

No explicit feedback given; some bias must be provided (e.g. # of clusters)

Clustering (no other good examples):

  - Clustering into "categories": K-Means
  - Probabilistic clustering (mixture models): EM for Gaussian MMs
 
# Applications for Clustering

Discretizing data (e.g., for input into decision trees)

Finding commonalities among shoppers, students, movie watchers, etc.

Determine what ‘things’ are most similar to
other ‘things’

Learning behaviors from demonstration

# K-Means

Input: A set of data points x1, x2, ... xn; an integer K representing the number of clusters

Output: an assignment data points to clusters

Algorithm: 

 1. Choose K data points from X randomly; these are the initial clustering points (note Forgy vs Random Partition) m1 ... mk
 2.  Repeat until no changes:

      - Assign each data point (xi) to the cluster with the closest mean (mi)
      - Update each cluster mean

# Properties

Each data point (example) only belongs to one cluster at a time

The sum squared error of clusterings monotonically decreases each iteration

Can oscillate between multiple clusterings with same value (sum squared error), but this almost never happens in practice

# EM

K-means is an example of expectation maximization

E-Step (Expectation): Fill in the missing data using current model

M-Step: Update the model to better reflect the “complete” data

To wit:

E-Step: Assign each point to a cluster based on current model of cluster means

M-Step: Update the model (i.e. the cluster means) based on the complete data (the cluster membership)

# EM for GMMs

What if we want a probabilistic clustering? Assign each cluster a density function (example on board in 2D)

Each has mean and variance. After each step, update cluster membership probabilities and repeat until convergence

(continue example on board)

# More on EM

EM can be used for a much wider variety than just learning mixture models

Guaranteed to converge to local maximum (in terms of log likelihood)

EM is incredibly useful but non-trivial to learn to use (take the Machine Learning course)