# Today's topics - information gain defined - k-nn classifiers - ensemble classifiers / bagging / random forests - unsupervised learning # Entropy and decision trees H(X) = sum over i [P(xi)I(xi)] I(p) = log 1/p = -log p For binary variable true with prob q: B(q) = -1(q log q + (1-q) log (1-q)) N.B. x log x -> 0 as x -> 0 ; hardcode this in your implementation Gain is defined as the expected reduction in entropy Let p be # positive examples, q be # negative examples; A be the attribute being split on Entropy prior to split is B(p / (p + n)) Entroy remaining is Remainder(A), defined as (p_pos + n_pos) / (p+n) B(p_pos / (p_pos + n_pos)) + (p_neg + n_neg) / (p+n) B(p_neg / (p_neg + n_neg)) where _pos / _neg denote the counts after splitting on A. # Another non-parametric model A strength and weakness: non-parametric model’s parameters can grow without limit Simple example: Use the data as the model # k-Nearest Neighbors Widely used, esp. when training set is of manageable size Algorithm: - Save all training examples - For classification: Find K nearest neighbors based on input (from training examples) and take a vote - For regression: Find K nearest neighbors and take mean, (or do linear regression over them) # Example On board, show cases # Other supervised learning things - SVMs generally SOTA single classifier, if you know nothing and are willing to spend time tuning parameters (also: random forests) - ensemble learning (generalization of multiple classifiers) # k-NN questions How do we measure "nearest"? Depends upon the dimension. Minkowski distance : [sum over i |x_j,i - x_q,i|^p]^(1/p) p=2: Euclidean distance p=1: Manhattan distance Hamming distance: # booleans that differ How do we choose K? Lower k is less general, higher k more so. (Bias/variance?) How do we efficiently find the K nearest? Read the book :) k-d trees, LSH, etc. # Curse of Dimensionality In high dimensions, everything is spread out further than you'd think 2D/3D intuitions don't hold # Now for something completely different but similarly named # Unsupervised learning Learn patterns in the inputs No explicit feedback given; some bias must be provided (e.g. # of clusters) Clustering (no other good examples): - Clustering into "categories": K-Means - Probabilistic clustering (mixture models): EM for Gaussian MMs # Applications for Clustering Discretizing data (e.g., for input into decision trees) Finding commonalities among shoppers, students, movie watchers, etc. Determine what ‘things’ are most similar to other ‘things’ Learning behaviors from demonstration # K-Means Input: A set of data points x1, x2, ... xn; an integer K representing the number of clusters Output: an assignment data points to clusters Algorithm: 1. Choose K data points from X randomly; these are the initial clustering points (note Forgy vs Random Partition) m1 ... mk 2. Repeat until no changes: - Assign each data point (xi) to the cluster with the closest mean (mi) - Update each cluster mean # Properties Each data point (example) only belongs to one cluster at a time The sum squared error of clusterings monotonically decreases each iteration Can oscillate between multiple clusterings with same value (sum squared error), but this almost never happens in practice # EM K-means is an example of expectation maximization E-Step (Expectation): Fill in the missing data using current model M-Step: Update the model to better reflect the “complete” data To wit: E-Step: Assign each point to a cluster based on current model of cluster means M-Step: Update the model (i.e. the cluster means) based on the complete data (the cluster membership) # EM for GMMs What if we want a probabilistic clustering? Assign each cluster a density function (example on board in 2D) Each has mean and variance. After each step, update cluster membership probabilities and repeat until convergence (continue example on board) # More on EM EM can be used for a much wider variety than just learning mixture models Guaranteed to converge to local maximum (in terms of log likelihood) EM is incredibly useful but non-trivial to learn to use (take the Machine Learning course)