# Rules Consider a grocery store learning about items co-bought: IF (cheese) THEN chocolate [Conf=0.10, Supp=0.04] IF (cheese & Sterno) THEN chocolate [Conf=0.95, Supp=0.01] Support indicates the frequency of the antecedent and the consequent occurring together in the data. Confidence indicates the fraction of the rows that satisfy the consequent given the antecedent holds. These can also be written as conditional probabilities: P(chocolate|cheese) = 0.10 P(chocolate|cheese,Sterno) = 0.95 As rules get more specific (by including additional propositions), support decreases, but confidence may increase. # Data representation A data representation is an abstract data structure for representing individual measurements and collections of measurements. Individual measurements are things such as - Results of a diagnostic test on a patient - Population of a city - Duration of a cellphone call Collections of measurements characterize instances that represent persons, places, things, and events (e.g., a patient, city, or cellphone call) # Representing data for rules Rules are represented as a conjunction of binary propositions, but the data are not. Time series of call logs for 3,600 accounts Date and Time Day of Week Duration Origin Destination Fraud? Now what? # Common strategies Discretize continuous variables: - Duration <= 60 sec - Time >= 22:00:00 Options: domain knowledge; EDA; boundary points; ... Frame multi-valued variables as yes/no questions - Does the call originate in Boston, MA? - Does the origin equal the destination? - Is destination on the east coast? - Is the call made during a weekday? - Is the call made during business hours? - Is the destination a restaurant? This process is called feature construction. # Whither search? Given a particular knowledge representation, we would like to search a space of possible models to find one that represents the data well This requires some search technique to define and systematically examine members of the space of possible models A search technique defines a set of possible models and defines a method for systematic generation of members It depends on both a data representation and an evaluation function Search techniques are the core algorithmic innovation in most learning techniques # Search space for rule learning Search over possible antecedents. Antecedents are unordered. Example search operators: - Add condition to antecedent (specialization) - Remove condition from antecedent (generalization) - Alter condition (e.g., negation, if allowed) # Generalization Removing conditions increases the numbers of instances predicted to be true. Starts by adding all conditions to the model E.g., consider rule: (Duration <= 60 sec) ^ (Origin = NYC) ^ (Time > 22:00) -> Bandit If we observe a new positive example that's outside this rule, remove a condition to include it (e.g., remove Origin = NYC)) # Specialization Adding conditions limits the numbers of instances predicted to be true. E.g., consider rule: (Duration <= 60 sec) ^ (Time > 22:00) If we observe a negative example that's inside this rule, add a condition to exclude it (e.g,. add (Day = Weds)) # Search techniques for rule learning Learning can be cast as optimization Find the model that optimizes the evaluation function. Typically, simple hill climbing is used but local maxima are a problem so: - Random restarts - Sideways moves and other hill climbing variants - Beam search # Evaluation functions An evaluation function associates a numeric score (or scores) with each possible model in a search space, given a data set. Examples include chi-square, G, information gain, a posteriori probability, R^2, the Gini index, and squared error. Evaluation functions are an integral part of a search techniques that allow the technique to select profitable paths and the best final model. Evaluation functions are statistics — estimates of a population parameter based on a data sample # Uses of evaluation functions Guides the search inside of algorithms: - Selects which path to pursue in heuristic search - Decides when to stop searching - Identifies which of the set of discovered models or patterns should be output Evaluates the results outside of algorithms: - Informs analysts about the absolute quality of a discovered model or pattern - Helps analysts compare the relative quality of different algorithms or outputs # Contingency tables Suppose we have the rule: (Duration <= 60sec) ^ (Origin = NYC) ^ (Time >= 22:00:00) -> Bandit How well does it work? Actual + - Predicated + 49 6 55 - 11 34 45 60 40 100 # Measures on contingency tables Actual + - Predicated + TP FP - FN TN True positive rate (TPR) = TP/(TP+FN) False positive rate (FPR) = FP/(FP+TN) Recall = TP/(TP+FN) = TPR Precision = TP/(TP+FP) Sensitivity = TPR Specificity = TN/(FP+TN) # Chi-Square test Calculates the normalized squared deviation of observed (predicted) values from expected (actual) values X^2 = sum from i to k of ((o_i - e_i)^2 / e_i) Sampling distribution is known, given that cell counts are above minimum thresholds Widely used to test independence between counts of categorical data Null hypothesis: prediction is independent of actual. If we can reject the null, then our algorithm works. Look up cutoff values in table (or just maximize). # Expected value of cells Actual + - Predicated + a b - c d N = a + b + c + d e_a = P(a) * N = P(predicted pos) * p(actual pos | predicted pos) * N = P(predicted pos, actual pos) * N [note: assume independence, that's what Chi-square is checking] = P(predicted pos) * P(actual pos) * N = ((a+b) / N) * ((a+c)/N)) * N # Pruning techniques for rule learning If we're searching, we might want to prune branches of the search tree. Unordered antecedents: A ^ B ^ C -> X == C ^ B ^ A -> X Optimistic evaluation: use an admissible estimate to prune branches that cannot improve the current score (shades of alpha-beta pruning!) Minimum support pruning: Only retain branches with enough support in the training set # Optimistic evaluation in practice Consider two branches: (Duration <= 60sec) ^ (Origin = NYC) ^ (Time >= 22:00:00) -> Bandit Actual + - Predicated + 49 6 - 11 34 (Day == Wed) -> Bandit Actual + - Predicated + 26 19 - 28 27 Which has a higher / lower Chi-Square value? 40.45; 0.234 Suppose we add a clause to the antecedent of the second. (Day == Wed) ^ (*) -> Bandit Actual + - Predicated + 26 0 - 28 46 It's so good it changes all false positives to true negatives (by excluding them). New Chi square is 27.48. We can kill this branch of the search. Why? Generally, what happens to the table when we specialize rules? That is, which way do instances move? What is the ideal movement of instances in the table? # Association rules in practice Compute rules for many consequents at the same time, sometimes called market basket analysis Find sets of variables which frequently occur together, called frequent item sets. Used by Amazon.com and other online retailers ("Customers who bought these items also bought...") And for recommender systems generally.