# Today

- How the brain works, in five minutes or less
- Artificial Neural Networks
- Perceptrons
- Multilayer feed-forward nets
- (unrelated to NNs: k-NNs as the ultimate non-parametric method)

# The brain

I lied, it's a mystery!

But, basic components somewhat understood.

Neurons are cells; axons from other cells "reach to" another neuron. 1cm -- 1m (!!) touch at synapses.

Signals appear to propagate through electrochemical reactions; control both short-term activations and shape long-term responses. Appears to have major role in cognition and learning.

Human brains have about 10^11 neurons and 10^14 synapses

# Biological Neural Networks

Humans can recognize images in as little as 13 milliseconds

Our brain can distinguish between images of millions
of different objects in a fraction of a second

Completely different approach to classification (or learning in general) than most AI methods "Massive parallelism"

# Artificial Neural Networks

Also called: neural networks, connectionist systems, parallel distributed processing

Networks of simple processing units that are *abstract* models of neurons

The network does the computation, not so much the individual neurons

# Simple model

A model for a neuron (on board)

Several inputs a_0 .. n_n; each input is weighted (w_0 .. w_n); note "bias" term

Inputs are (perhaps) combined through an input function (e.g., sum)

The "activation function" produces output

output is (perhaps) sent to other units

# Common activation functions

Step function; sign function; sigmoid function (1/(1+e^-x))

# Thresholds and the bias term

Remember from linear models:

y(x, w) = w_0 + w_1 * x_1 + ...

Consider the w_0 term (the "bias") as an adaptable threshold or weight on this neuron as a whole

# Neural Nets and Logic
McCulloch and Pitts, 1943, showed that whatever you can do with logic networks, you can do with networks of abstract neuron-like units.

With step activation function:

W=1
     threshold = 1.5     
W=1 

AND gate!

Q. OR gate? NOT gate? (W=1,1; t= 0.5; W=-1, t = -0.5)

# Types of NNs

Perceptron: Simplest, but least powerful

Multilayer Feed-forward: Most widely used

Recurrent: Most powerful, but hard to train

# Perceptron

One layer of nodes with step activation function

inputs: w_0; w_1 * x_1 ... w_n * x_n

output: y = step (w^T x)

# What can they represent?

Linearly separable functions

x1 AND x2 (on board)

Q.

x1 OR x2

x1 XOR x2

# Perceptron training

How do we set the weights?

If perceptron correctly classifies example do nothing.

If classifies example incorrectly, Move input weights "a little bit" in the right direction


wi = wi + a(t - o)x

t is target value; o is observed value; a is a small delta

Essentially, we're taking the derivative and using it to tune the network

This is basically gradient descent. (Note: if a is too big, we can "step over" a local maxima by accident!)

We can also do it analytically, by taking the derivative of the output with respect to the input.

# Example

Consider a multiplier perceptron

Want to increase the output? We can look at the gradient directly

f(x,y) = xy

df(x,y)/ dx = y

df(x,y)/ dy = x

and see that following the grading will increase things by those factors!

# Multilayered networks

(on board, L to R)

Two perceptrons, two inputs, two outputs

MLFF, two inputs, four perceptrons, two outputs

# Another view

Output unit (1)

Hidden layer (4) 

Input units (10)

# Training MLFF networks

Back propagation is like perceptron training, but across many nodes simultaneously.

Intuition:

Each input to a node is "pushing" on the node in proportion to the derivative of the output with respect to the input. Tweak the weight accordingly.

You can think of each node in (relative) isolation; but note that the "push" or "pull" of each weight from the next node's output propagates back through the current node. We use this to weight locally.

# Example

Consider f(x,y,z)=(x+y)z ;  q = x+y

f(q,z) = qz

df(q,z) / dq = z

df(d,z)/ dz = q

q(x,y) = x+y

dq(x,y)/dx = 1
dq(x,y)/dy = 1

Backpropagataion is just using the chain rule; so

df(q,z)/dx = dq(x,y)/dx * df(q,z) / dq

Each "wire" has a weight forward, and a gradient backwards

# Notes on MLFFN

Expressiveness: Can approximate any continuous function arbitrarily well

Efficiency: Slow to train, but fast to use once trained

Generalization: Very successful in many real world problems

Sensitivity to noise: Very tolerant to noise in data

# Problems 

How many hidden nodes to use? No good answer exists; search can be expensive.

Local minima:

  - Very real problem
  - Can be partially mitigated with random restarts

Can be very very slow to train (can be improved with ‘second-order’ methods)

# Recurrent Neural Network

Feed-Forward Networks have no memory, cannot be used effectively for time-series data

Recurrent networks solve this problem: Feed the hidden node activations back in as additional inputs

Training becomes much harder; local minima become more common