# Rules

Consider a grocery store learning about items co-bought:
     
    IF (cheese) THEN chocolate  [Conf=0.10, Supp=0.04]    IF (cheese & Sterno) THEN chocolate [Conf=0.95, Supp=0.01]
    
Support indicates the frequency of the antecedent and the consequent occurring together in the data.

Confidence indicates the fraction of the rows that satisfy the consequent given the antecedent holds. 

These can also be written as conditional probabilities:

    P(chocolate|cheese) = 0.10	 P(chocolate|cheese,Sterno) = 0.95
    
As rules get more specific (by including additional propositions), support decreases, but confidence may increase.

# Data representation

A data representation is an abstract data structure for representing individual measurements and collections of measurements.

Individual measurements are things such as

  - Results of a diagnostic test on a patient
  - Population of a city
  - Duration of a cellphone call

Collections of measurements characterize instances that represent persons, places, things, and events (e.g., a patient, city, or cellphone call)

# Representing data for rules

Rules are represented as a conjunction of binary propositions, but the data are not. 

Time series of call logs for 3,600 accounts

    Date and Time  Day of Week  Duration  Origin  Destination  Fraud?

Now what?

# Common strategies

Discretize continuous variables:

  - Duration <= 60 sec
  - Time >= 22:00:00

Options: domain knowledge; EDA; boundary points; ...

Frame multi-valued variables as yes/no questions

  - Does the call originate in Boston, MA?
  - Does the origin equal the destination?
  - Is destination on the east coast?
  - Is the call made during a weekday?
  - Is the call made during business hours?
  - Is the destination a restaurant?

This process is called feature construction.

# Whither search?

Given a particular knowledge representation, we would like to search a space of possible models to find one that represents the data well

This requires some search technique to define and systematically examine members of the space of possible models

A search technique defines a set of possible models and defines a method for systematic generation of members

It depends on both a data representation and an evaluation function

Search techniques are the core algorithmic innovation in most learning techniques

# Search space for rule learning

Search over possible antecedents. Antecedents are unordered.

Example search operators:

  - Add condition to antecedent (specialization)
  - Remove condition from antecedent (generalization)
  - Alter condition (e.g., negation, if allowed)

# Generalization

Removing conditions increases the numbers of instances predicted to be true.
Starts by adding all conditions to the model

E.g., consider rule:

(Duration <= 60 sec) ^ (Origin = NYC) ^ (Time > 22:00) -> Bandit

If we observe a new positive example that's outside this rule, remove a condition to include it (e.g., remove Origin = NYC))

# Specialization

Adding conditions limits the numbers of instances predicted to be true.

E.g., consider rule:

(Duration <= 60 sec) ^ (Time > 22:00)

If we observe a negative example that's inside this rule, add a condition to exclude it (e.g,. add (Day = Weds))

# Search techniques for rule learning

Learning can be cast as optimization

Find the model that optimizes the evaluation function.

Typically, simple hill climbing is used but local maxima are a problem so:

  - Random restarts
  - Sideways moves and other hill climbing variants
  - Beam search

# Evaluation functions

An evaluation function associates a numeric score (or scores) with each possible model in a search space, given a data set.

Examples include chi-square, G, information gain, a posteriori probability, R^2, the Gini index, and squared error.

Evaluation functions are an integral part of a search techniques that allow the technique to select profitable paths and the best final model.

Evaluation functions are statistics — estimates of a population parameter based on a data sample

# Uses of evaluation functions

Guides the search inside of algorithms:

  - Selects which path to pursue in heuristic search
  - Decides when to stop searching
  - Identifies which of the set of discovered models or patterns should be output

Evaluates the results outside of  algorithms:

  - Informs analysts about the absolute quality of a discovered model or pattern
  - Helps analysts compare the relative quality of different algorithms or outputs

# Contingency tables

Suppose we have the rule:

(Duration <= 60sec) ^ (Origin = NYC) ^ (Time >= 22:00:00) -> Bandit

How well does it work? 


                       Actual
                       +   -
    Predicated      + 49   6   55
                    - 11  34   45

                      60  40  100
                      
# Measures on contingency tables

                       Actual
                       +   -
    Predicated      + TP  FP
                    - FN  TN


True positive rate (TPR) = TP/(TP+FN)
False positive rate (FPR) = FP/(FP+TN) 
Recall = TP/(TP+FN) = TPR
Precision = TP/(TP+FP)
Sensitivity = TPR
Specificity = TN/(FP+TN)

# Chi-Square test

Calculates the normalized squared deviation of observed (predicted) values from expected (actual) values

X^2 = sum from i to k of ((o_i - e_i)^2 / e_i)

Sampling distribution is known, given that cell counts are above minimum thresholds

Widely used to test independence between counts of categorical data

Null hypothesis: prediction is independent of actual. If we can reject the null, then our algorithm works.

Look up cutoff values in table (or just maximize).

# Expected value of cells

                       Actual
                       +   -
    Predicated      +  a   b
                    -  c   d

N = a + b + c + d

    e_a = P(a) * N
        = P(predicted pos) * p(actual pos | predicted pos) * N
        = P(predicted pos, actual pos) * N  [note: assume independence, that's what Chi-square is checking]
        = P(predicted pos) * P(actual pos) * N 
        = ((a+b) / N) * ((a+c)/N)) * N
        
# Pruning techniques for rule learning

If we're searching, we might want to prune branches of the search tree.

Unordered antecedents: A ^ B ^ C -> X  == C ^ B ^ A -> X 

Optimistic evaluation: use an admissible estimate to prune branches that cannot improve the current score (shades of alpha-beta pruning!)

Minimum support pruning: Only retain branches with enough support in the training set


# Optimistic evaluation in practice

Consider two branches:

(Duration <= 60sec) ^ (Origin = NYC) ^ (Time >= 22:00:00) -> Bandit

                       Actual
                       +   -
    Predicated      + 49   6
                    - 11  34

(Day == Wed) -> Bandit

                       Actual
                       +   -
    Predicated      + 26  19
                    - 28  27


Which has a higher / lower Chi-Square value?  40.45; 0.234

Suppose we add a clause to the antecedent of the second.

(Day == Wed) ^ (*) -> Bandit

                       Actual
                       +   -
    Predicated      + 26   0
                    - 28  46

It's so good it changes all false positives to true negatives (by excluding them). 

New Chi square is 27.48.

We can kill this branch of the search. Why?

Generally, what happens to the table when we specialize rules? That is, which way do instances move?

What is the ideal movement of instances in the table?

# Association rules in practice

Compute rules for many consequents at the same time, sometimes called market basket analysis

Find sets of variables which frequently occur together, called frequent item sets.

Used by Amazon.com and other online retailers ("Customers who bought these items also bought...")

And for recommender systems generally.