# Learning and overfitting

Today's topics:

- Pathologies of learning algorithms
- Properties of evaluation functions as statistical estimators
- How evaluation functions and search interact in learning
- Methods for solving pathologies

# Some amazing facts

- Mutual funds that beat the market!
- Winning the lottery, twice!
- Hidden prophesies in text!
- Amazing coincidences in twins raised apart!

# Example problem

Consider the given training data and test instance. 

(ppt)

How would you perform supervised learning?

Q1. Features?

Q2. What model(s) might be appropriate?

Q3. Generate a small example model by hand (this can be informal); how well does it work (i.e. fit the data)?

# Example rules

Trains with small closed cars do not carry toxic chemicals

Trains with with only a small and large car, or a jagged-top car, carry toxic chemicals

# How did you devise your rules?

Did you:

  - Look for characteristics in one set but missing in the second set?
  - Examine several potential rules?
  - Consider simple rules first?
  - Reject potential rules that didn’t perform well?

You searched!

# How did you evaluate your rules?

Examined how many trains were correctly and incorrectly predicted

Assigned lower value to rules which predicted fewer trains correctly

Did you do anything else?

For any given train, how confident are you that the answer is correct?

Do we have enough data to construct a reliable rule?  How many data points is enough?

# Pathologies of Learning Algorithms

Overfitting: Adding components to models that reduce performance or leave it unchanged

Oversearching: Selecting models with lower performance as the size of search space grows

Attribute selection errors: Preferring attributes with many possible values despite lower performance

(Jensen and Cohen 2000)

# Learning Curves

Learning curves show the relationship between the amount of training data and the accuracy of the model

(let's leave aside how we can evaluate the accuracy for now)

The idiom actually has it wrong: all else being equal, a steep learning curve is considered better, not worse.

# Biased evaluations

Say you evaluate (accuracy, for example) on a sample from the training data to search for structure or parameters

When we evaluate the model on a new sample from the population, we will often find its accuracy is lower than we expected. Why?

We've *biased* our model, by overfitting to the specific training data we had. The larger the set of training data, the more pronounced this effect is.

Error graph: show total error, variance, bias as a fn of # examples

Use training data to build model, but keep training and test data separate to produce less-biased estimates of model's accuracy. (more later)

# Another case of overfitting

Some models grow with additional input data. ("Big Data" anyone?)

On board: learning curve for tree. Then add tree size: what we want, and what we get (naively).

Using more data naively can lead to bias!

# Oversearching

Evaluation is often heuristic and has error. Biased error!

Search often incomplete.

If you do exhaustive search, it turns out that you will in many cases overfit *to your biased evaluation fn* as a result. E.g., do you prioritize accuracy over size in your fn for trees?

On board: search subspace, full space. Relative accuracy: on training set, we get better, but on test sets we often do worse.

# Attribute selection errors

Some models prefer certain attributes over others. 

E.g., how are decision tree attributes calculated? All else being equal, will an attribute with fewer or more values be selected? You check them all, so a preference will bias your choice.

Again, accuracy on training set will be higher, but lower with fewer.

# Root cause of errors above

These are all due to something called the "Multiple Comparison Procedure" where you take the best of many options but doing so biases your result.

Let's talk statistics briefly to get more insight.

# Statistical inference

Given a population with some unknown aggregate *parameter* (e.g, mean height), how can we estimate that parameter?

Choose subset (a sample).

Estimate the parameter on that subset to get an estimate of the population's value for that parameter. This is a *statistic*.

# Sampling distributions

(on board)

Consider a hypothetical population space.

Take many samples and estimate the statistic on each.

The statistic values will have a distribution: a sampling distribution.

# Evaluation functions are estimators!

Evaluation functions are functions f(m,S) on models (m) and data samples (S)

Samples vary in their "representativeness" thus:  f(m,S1) = x1 ≠ x2 = f(m,S2)

Each score x is an estimate of some population parameter, ψ (psi).

# Variance

The best estimators produce values that differ only slightly from the population parameter. 

Such estimators are said to have low variance.

on board: flat vs steep sampling distributions

# Bias

The best estimators produce values that center around the population parameter.  

That is: E(X) = ψ

Such estimators are said to be unbiased.

on board: centered vs offset sampling distributions

# Multiple Comparison Procedures

- Generate multiple items- Estimate scores- Select max-scoring item 
- Generate n models
- Using the training set and an evaluation function, calculate a score for each model
- Select the model with the maximum score

# MCPs are ubiquitous in learning

Used to select:

- Settings: 	A>1, A>2, A>4…
- Components: 	A>3, B=4, C>56.3…
- Models:	 Tree 1, Tree 2, Tree 3…
- Methods: 	trees, rules, networks…
- Parameters: 	depth=4, depth=5, depth=6...

# Example: dice rolling

For a fair die with six outcomes
(H0: All outcomes are equally likely)
What is the sampling distribution of Xi?

E(Xi|H0) = 3.5
p(Xi>5|H0) = 0.167

on board: flat distribution


# What about MCP dice rolling?

For the maximum of ten dice
(H0: all outcomes equally likely)
What is the sampling distribution of Xmax?

E(Xmax|H0) = 5.8p(Xmax>5|H0) = 0.838

on board: biased distribution

# Sampling distributions are biased in MCPs!

The sampling distribution of Xmax differs from the sampling distribution of Xi

A direct analogy exists between  dice rolling and searching multiple models, model components, attributes, 
etc.

The *interpretation* of any given score varies with the number of models (or components, attributes, etc.) compared during search.

# Explaining the pathologies

Parameter estimates — Infer the value of a population parameter based on a sample statistic. (“What is the accuracy of m?”). Effect of bias is obvious here

Hypothesis tests — Infer the answer to a question about a population parameter based on a sample statistic. (“Does m perform better than chance?”)

# Hypothesis tests

Remember that hypothesis tests work Under H0, there is a non-zero probability that any model’s score xi will exceed some critical value xcrit. But if xi exceeds that value, we reject the null.

The probability that the maximum of n scores (xmax) will exceed xcrit is uniformly equal or higher.

# Overfitting as MCP

Many components are available to use in a given model.

Algorithms select the component with the maximum score.

The correct sampling distribution depends on number of components evaluated.

Most learning algorithms do not adjust for number of components.

(Pruning trees is an a posteriori attempt to deal with this, but it can be biased itself!)

# Biased parameter estimates

Sample scores are routinely used as estimates of population parameters.  Any xi score is often an unbiased estimator of the population score.

But the xmax is almost always a biased estimator

# Oversearching as MCP

Two or more search spaces contain different numbers of models.

Maximum scores in each space are biased to differing degrees.

Most algorithms directly compare scores.

(Attribute selection errors can be explained in an analogous way.)

# Adjusting for multiple comparisons

Remove bias: 

  - new data (e.g., Oates & Jensen 1999)
  - cross-validation (e.g., Weiss and Kulikowski 1991)
  - very large data sets allow for using constant new selections of data when finding parameters
  - model pruning

Estimate sampling distribution:

  - Randomization tests (Jensen 1992)

Adjust probability calculation:

  - Bonferroni adjustment (Jensen & Schmill 1997)

Alter evaluation function itself:

  - E.g., BIC (Bayesian information criterion); MDL (minimum description length)

# Strength of effect

The strength of the statistical effects of multiple comparison procedures increases as the:

 - Number of items compared (n) increases
 - Strength of relationships decreases
 - Size of samples decreases
 - Correlation among scores decreases

# Implications

Several pathologies of learning algorithms are caused by a single statistical effect.

A tradeoff exists between size of an algorithm’s search space and its sample complexity.

Reducing the size of an algorithm’s search space can increase its statistical power.

Prior knowledge can increase power by reducing the number of models examined.

# Still amazing?

- Mutual funds that beat the market!
- Winning the lottery, twice!
- Hidden prophesies in text!
- Amazing coincidences in twins raised apart!