# Exam

     Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    34.00   63.75   73.50   74.48   88.25  100.00 
    
# Continuous variables

Mostly we've looked at discrete variables:

 - binary (true/false)
 - categorical (red/blue/green)
 - ordinal (d6)

Continuous variables were handled with discertization.

Today we'll talk about handling them in their native, continuous domains.

# Consider Naive Bayes

We've already seen how to use the NBC to predict not just labels but (discrete) distributions.

P(class | vars) = alpha P(vars | class) P(class)

Each P(var|class) term is its own small distribution (bar graph on board).

What if, instead, the domain were continuous? (histogram)

# One approach: fit to a distribution

If the distribution "looks like" a well-known distribution, you can fit this parametric model.

(Caution: looks like is a tricky one! See Anscombe's quartet.)

Parametric models have a finite number of parameters.

E.g., normal (Gaussian) distributions can be characterized by mean and variance (mu and sigma); exponential distributions by lambda.

Extensions: Mixture models (sum of distributions), etc.

Parametric distributions can be "summed" or "multiplied" symbolically.

ML or stats class for more details.

# Another approach: estimate density

A nonparametric approach, which is generalizable but computationally intensive, is to estimate the density of the distribution using the observed data.

(To board: sum of min-distributions.)

Different "kernels" are used: triangles, uniform, normals. Normal has nice mathematical properties, though is not optimal like Epanechnikov.

# Another approach: Linear models

Treat the output as a linear model of one or more variables.

In other words, each variable has a coefficient (or weight) that determines its importance

In the simplest case, a single variable:

h(x) = w1 * x + w0.

# Example

Given a set of training data (points on board)

finding optimal weight is called linear regression. 

Task: find [w0, w1] such that error is minimized.

Often we use sum of squared errors (L2 loss).

(No need to search, can be done analytically.)

# To generalize

Can add more variables and corresponding weights.

Note that variables need not be linear (e.g., x2 could be w^2 or the like)

So you could model the position of a ball being thrown upward as:

h = w2 * t^2 + w1 * t + w0

h = w2 * x1 + w1 * x + w0

where x1 = t^2, etc.

Can also make each variable (including prediction variable) a vector.

# Mixing and matching classification and regression

Learn a decision tree on discretized version of variables, then prune, then regression on leaves. This is called a regression tree.

Learn a line to divide categorical values. This is called a linear classifier. (on board)

We will return to this later as it corresponds to perceptron updates.