University of Massachusetts Amherst
College of Information and Computer Sciences
COMPSCI 589
Machine Learning
Short description:
Introduction to core machine learning models and algorithms for
classification, regression, dimensionality reduction and clustering.
The course will cover the mathematical foundations behind the most common machine
learning algorithms, and the effective use in solving realworld applications.
Requires a strong mathematical background and knowledge
of one highlevel programming language such as Python.
Detailed description:
This course will introduce core machine learning models and algorithms for
classification, regression, clustering, and dimensionality reduction.
On the theory side, the course will cover the mathematical foundations
behind the most common machine learning algorithms.
It will focus on understanding models and the relationships between them.
On the applied side, the course will focus on effectively using machine
learning methods to solve realworld problems with an emphasis on
model selection, regularization, design of experiments, and
presentation and interpretation of results.
The course will be held in a flippedclassroom manner, with students
being assigned prerecorded videos, and the lectures being reserved for
discussions, including Q&A on the lecture topics, exercises,
connecting the lecture abstractions to realworld application,
implementation considerations and demos. The assignments will involve
both mathematical problems and implementation tasks.
Knowledge of a highlevel programming language is absolutely necessary.
Python is most commonly used, but languages such as Matlab, R, Scala, Julia
would also be suitable.
Strong foundations in linear algebra, calculus, probability and statistics
are essential for the successful completion of this course.
Lectures: Monday & Wednesday 2:303:45pm.
Credit: 3 units
Instructor:
Teaching assistants:
Textbooks:
Grading:
 Homeworks: 50%
 Midterm: 30%. In class. Date TBA.
 MiniProject: 10%. Assignment based on an open challenge.
 Checkpoint Quizzes: 10%
 Extra credit: participation, in class and on piazza.
Class materials will be posted to the Moodle course.
Discussions will happen on Piazza or over Moodle.
Course topics:

Introduction to Machine Learning. Simple classifiers
 Definition of Machine Learning
 Relationship to other fields
 Course overview
 Learning problem formulation
 Regression vs classification; supervised vs unsupervised; parametric vs nonparametric models
 KNN classifiers
 Decision trees
Reading:
Bishop, Section 1.2.11.2.4 Probability Theory. ESL Section 2.3.2. ESL Section 2.5.

Probability and estimation
 Random variable independence
 Bayes rule
 Estimators
 Maximum likelihood estimator (MLE)
 Maximum a posteriori estimator (MAP)
Reading:
Bishop, 2.1 Binary Variables, 2.2 Multinomial Variables
Advanced: Mitchell, Estimating Probabilities

Naive Bayes
 Bayes Optimal Classifiers
 Conditional Independence
 Naive Bayes
 Learning for Naive Bayes
 Gaussian Naive Bayes
 Naive Bayes use case: the Bag of Words model
Reading:
Mitchell, 3.1 and 3.2, Naive Bayes

Linear Discriminant Analysis (LDA)
 Fitting linear responses
 Fitting by least squares
 Maximizing conditional likelihood
 LDA  model class conditional densities as multivariate Gaussians
Reading:
ESL 4.14.3 (p. 101102, 106110). Bishop 4.1.14.1.4 Discriminant Functions. Bishop 4.2 Probabilistic Generative Models

Logistic Regression (LR)
 Generative vs discriminative classifiers
 Classification using the logistics function
 Gradient methods to solve LR: gradient descent, stochastic gradient descent
 MLE and MAP estimates for LR
Reading:
ESL Section 4.4 (p. 119120, 127132)
Advanced: Mitchell, 3.3, Logistic Regression

Generalization and Evaluation
 Training error and generalization error
 Hypothesis space, model capacity
 Generalization, overfitting, underfitting, biasvariance tradeoff
 Regularization, model selection, crossvalidation
(Optional) Deep dive: Machine Learning Theory
 Theoretical model of ML
 Generalization bounds
 Consistent learning
 PAC learning
 Anostic learning. Relationship to bia/variance tradeoff
 Infinite hypothesis space. VC dimension. Sauer's lemma
Reading:
Nina Balcan, Notes on generalization guarantees.

Support Vector Machines
 Maximizing the margin
 Hinge loss vs logistic loss
 Basis expansions and kernels
 The kernel trick
Reading:
ESL Section 12.3. ESL Section 12.3.6 (p. 434438). Bishop 6.1, 6.2 (p. 291  299).

Ensemble Methods
 Introduction to ensembles
 Bagging
 Random forests
 Boosting. Adaboost
 Stacking
 (Optional) Deep dive: Analysis of Adaboost.
Reading:
ESL Chapter 16 (p. 605622). Bishop Sections 14.3,14.4 (p. 657  665).

Linear Regression, Ridge, and Lasso
 Regression intro
 Linear regression
 Ordinary least squares
 Regularization
Reading:
ESL Sections 3.1, 3.2.1 (p. 4351). ESL Sections 3.4.13.4.3 (p. 6173).

Regression trees and smoothing
 Regression trees
 Feature selection
 Kernel smoothing
Reading:
ESL 6.1 and 6.2 (p. 191200). ESL 9.2.1, 9.2.2 (305308).

Neural Networks and Deep Learning
 The Multilayer Perceptron (MLP)
 Nonlinear Activations
 Universal Function Approximation
 Convolutional Neural Networks (CNNs) for vision
Reading:
ESL 11.3 Neural Networks

Backpropagation and Sequential Neural Networks
 Training neural networks
 Backpropagation
 Learning rates and acceleration
 Recurrent neural networks (RNN)
 Longterm Shortterm Memory (LSTM)
Reading:
ESL 11.4 Fitting Neural Networks. ESL 11.5 Some Issues in Training Neural Networks.

Linear Dimensionality Reduction and SVD
 Dimensionality reduction overview
 Linear dimensionality reduction
 Singular Value Decomposition (SVD)
Reading:
ESL Section 14.15.1 (p.534536).

Principal Components Analysis
 Eigenvalue decomposition
 Direction of maximum variance
 Principal Component Analysis (PCA)
 Connection between PCA and SVD
Reading:
Bishop 12.1 Principal Component Analysis (p.559569).

Sparse Coding, NMF, ICA and Kernel PCA
 Sparse coding
 Nonnegative matrix factorization
 Independent Component Analysis (ICA)
 Kernel PCA
Reading:
ESL Section 14.6 (p.553557). ESL Section 14.7 (p.557570).

Clustering I
 KMeans
 Mixture models
 Expectation Maximization (EM)
Reading:
ESL 14.3.4  14.3.11 (k=means). ESL 8.5 (EM).

Clustering II
 Exhaustive clustering
 Hierarchical clustering
 Spectral clustering
Exam exception policy: If you have any special needs/circumstances pertaining to an exam, you must talk to the instructor at least 2 weeeks before the exam.
Late homework policy: If you cannot turn in a homework on time, you will need to discuss with the instructor at least one day in advance.
Regrade policy: Any requests for regrading must be submitted within a week of receiving the grade and preferably discussed during office hours. Each TA will be responsible for a different part of the homework, as indicated when the assignment is issued, so please direct questions appropriately. Only contact the instructors after discussing the issue with the TAs.
Copyright/distribution notice:
Many of the materials created for this course are the intellectual property of the course instructors and of the professors whose courses served as a basis for some of the lectures. This includes, but is not limited to, the syllabus, lectures and course notes. Except to the extent not protected by copyright law, any use, distribution or sale of such materials requires the permission of the instructor. Please be aware that it is a violation of university policy to reproduce, for distribution or sale, class lectures or class notes, unless copyright has been explicitly waived by the faculty member.