# Today - How the brain works, in five minutes or less - Artificial Neural Networks - Perceptrons - Multilayer feed-forward nets - (unrelated to NNs: k-NNs as the ultimate non-parametric method) # The brain I lied, it's a mystery! But, basic components somewhat understood. Neurons are cells; axons from other cells "reach to" another neuron. 1cm -- 1m (!!) touch at synapses. Signals appear to propagate through electrochemical reactions; control both short-term activations and shape long-term responses. Appears to have major role in cognition and learning. Human brains have about 10^11 neurons and 10^14 synapses # Biological Neural Networks Humans can recognize images in as little as 13 milliseconds Our brain can distinguish between images of millions of different objects in a fraction of a second Completely different approach to classification (or learning in general) than most AI methods "Massive parallelism" # Artificial Neural Networks Also called: neural networks, connectionist systems, parallel distributed processing Networks of simple processing units that are *abstract* models of neurons The network does the computation, not so much the individual neurons # Simple model A model for a neuron (on board) Several inputs a_0 .. n_n; each input is weighted (w_0 .. w_n); note "bias" term Inputs are (perhaps) combined through an input function (e.g., sum) The "activation function" produces output output is (perhaps) sent to other units # Common activation functions Step function; sign function; sigmoid function (1/(1+e^-x)) # Thresholds and the bias term Remember from linear models: y(x, w) = w_0 + w_1 * x_1 + ... Consider the w_0 term (the "bias") as an adaptable threshold or weight on this neuron as a whole # Neural Nets and Logic McCulloch and Pitts, 1943, showed that whatever you can do with logic networks, you can do with networks of abstract neuron-like units. With step activation function: W=1 threshold = 1.5 W=1 AND gate! Q. OR gate? NOT gate? (W=1,1; t= 0.5; W=-1, t = -0.5) # Types of NNs Perceptron: Simplest, but least powerful Multilayer Feed-forward: Most widely used Recurrent: Most powerful, but hard to train # Perceptron One layer of nodes with step activation function inputs: w_0; w_1 * x_1 ... w_n * x_n output: y = step (w^T x) # What can they represent? Linearly separable functions x1 AND x2 (on board) Q. x1 OR x2 x1 XOR x2 # Perceptron training How do we set the weights? If perceptron correctly classifies example do nothing. If classifies example incorrectly, Move input weights "a little bit" in the right direction wi = wi + a(t - o)x t is target value; o is observed value; a is a small delta Essentially, we're taking the derivative and using it to tune the network This is basically gradient descent. (Note: if a is too big, we can "step over" a local maxima by accident!) We can also do it analytically, by taking the derivative of the output with respect to the input. # Example Consider a multiplier perceptron Want to increase the output? We can look at the gradient directly f(x,y) = xy df(x,y)/ dx = y df(x,y)/ dy = x and see that following the grading will increase things by those factors! # Multilayered networks (on board, L to R) Two perceptrons, two inputs, two outputs MLFF, two inputs, four perceptrons, two outputs # Another view Output unit (1) Hidden layer (4) Input units (10) # Training MLFF networks Back propagation is like perceptron training, but across many nodes simultaneously. Intuition: Each input to a node is "pushing" on the node in proportion to the derivative of the output with respect to the input. Tweak the weight accordingly. You can think of each node in (relative) isolation; but note that the "push" or "pull" of each weight from the next node's output propagates back through the current node. We use this to weight locally. # Example Consider f(x,y,z)=(x+y)z ; q = x+y f(q,z) = qz df(q,z) / dq = z df(d,z)/ dz = q q(x,y) = x+y dq(x,y)/dx = 1 dq(x,y)/dy = 1 Backpropagataion is just using the chain rule; so df(q,z)/dx = dq(x,y)/dx * df(q,z) / dq Each "wire" has a weight forward, and a gradient backwards # Notes on MLFFN Expressiveness: Can approximate any continuous function arbitrarily well Efficiency: Slow to train, but fast to use once trained Generalization: Very successful in many real world problems Sensitivity to noise: Very tolerant to noise in data # Problems How many hidden nodes to use? No good answer exists; search can be expensive. Local minima: - Very real problem - Can be partially mitigated with random restarts Can be very very slow to train (can be improved with ‘second-order’ methods) # Recurrent Neural Network Feed-Forward Networks have no memory, cannot be used effectively for time-series data Recurrent networks solve this problem: Feed the hidden node activations back in as additional inputs Training becomes much harder; local minima become more common