COMPSCI 514: Algorithms for Data Science

Lecture recordings from Echo360 can be accessed here.

Lecture	Day	Topic	Materials/Reading
1. 9/6	Tue	Course overview. Probability review. Linearity of expectation.	Slides. Compressed slides. MIT short videos and exercises on probability. Khan academy probability lessons (a bit more basic).
Randomized Methods, Sketching & Streaming
2. 9/8	Thu	Estimating set size by counting duplicates. Concentration Bounds: Markov's inequality. Random hashing for efficient lookup.	Slides. Compressed slides.
3. 9/13	Tue	Analysis of random hashing. 2-level hashing. Markov's ineqaulity.	Slides. Compressed slides. Reading: Chapter 2.2 of Foundations of Data Science with content on Markov's inequality and Chebyshev's inequality. Exercises 2.1-2.6.
4. 9/15	Thu	2-universal and pairwise independent hashing. Hashing for load balancing. Chebyshev's inequality. The union bound.	Slides. Compressed slides. Some notes (Arora and Kothari at Princeton) proving that the ax+b mod p hash function described in class in 2-universal.
5. 9/20	Tue	Exponential concentration bounds and the central limit theorem.	Slides. Compressed slides. Reading: Some notes (Goemans at MIT) showing how to prove exponential tail bounds using the moment generating function + Markov's inequality approach discussed in class.
6. 9/22	Thu	Finish up exponential concentration bound. Bloom Filters.	Slides. Compressed slides. Reading: Chapter 4 of Mining of Massive Datasets, with content on bloom filters. See here for full Bloom filter analysis. See here for some explaination of why a version of a Bloom filter with no false negatives cannot be achieved without using a lot of space. See Wikipedia for a discussion of the many bloom filter variants, including counting Bloom filters, and Bloom filters with deletions. See Wikipedia again and these notes for an explaination of Cuckoo Hashing, a randomized hash table scheme which, like 2-level hashing, has O(1) query time, but also has expected O(1) insertion time.
7. 9/27	Tue	Finish up Bloom Filters. Min-Hashing for Distinct Elements.	Slides. Compressed slides. Reading: Chapter 4 of Mining of Massive Datasets, with content on distinct elements counting.
8. 9/29	Thu	Distinct elements analysis. The median trick. Distinct elements in pratice: Flajolet-Martin and HyperLogLog.	Slides. Compressed slides. Reading: Chapter 4 of Mining of Massive Datasets, with content on distinct elements counting. The 2007 paper introducing the popular HyperLogLog distinct elements algorithm.
9. 10/4	Tue	Jaccard similarity estimation with MinHash. Locality sensitive hashing for fast similarity search.	Slides. Compressed slides. Reading: Chapter 3 of Mining of Massive Datasets, with content on Jaccard similarity, MinHash, and locality sensitive hashing.
10. 10/6	Thu	Finish up LSH. The frequent elements problem and count-min sketch.	Slides. Compressed slides. Reading: Notes (Amit Chakrabarti at Dartmouth) on streaming algorithms. See Chapters 2 and 4 for frequent elements. Some more notes on the frequent elements problem. A website with lots of resources, implementations, and example applications of count-min sketch.
11. 10/11	Tue	Finish up frequent elements estimation. Dimensionality reduction, low-distortion embeddings, and the Johnson Lindenstrauss Lemma.	Slides. Compressed slides. Reading: Chapter 2.7 of Foundations of Data Science on the Johnson-Lindenstrauss lemma. Notes on the JL-Lemma (Anupam Gupta CMU). Sparse random projections which can be multiplied by more quickly. Linear Algebra Review: Khan academy.
12. 10/13	Thu	Johnson-Lindenstrauss lemma proof.	Slides. Compressed slides. Reading: Chapter 2.7 of Foundations of Data Science on the Johnson-Lindenstrauss lemma. Notes on the JL-Lemma
13. 10/18	Tue	Midterm Review.	Slides.
10/20	Thu	Midterm	Study guide and review questions.
Spectral Methods
14. 10/25	Tue	High-dimensionality geometry and connections to the JL Lemma/dimensionality reduction.	Slides. Compressed slides. Reading: Chapters 2.3-2.6 of Foundations of Data Science on high-dimensional geometry.
15. 10/27	Thu	Intro to principal component analysis, low-rank approximation, data-dependent dimensionality reduction. Orthogonal bases and projection matrices.	Slides. Compressed slides. Reading: Chapter 3 of Foundations of Data Science and Chapter 11 of Mining of Massive Datasets on low-rank approximation and the SVD. Some good videos for linear algebra review. Some other good videos.
16. 11/01	Tue	Best fit subspaces and optimal low-rank approximation via eigendecomposition.	Slides. Compressed slides. Reading: Proof that optimal low-rank approximation can be found greedily (see Section 1.1). Chapter 3 of Foundations of Data Science and Chapter 11 of Mining of Massive Datasets on low-rank approximation.
17. 11/03	Thu	Eigenvalues as a measure of low-rank approximation error. The singular value decomposition and connections to low-rank approximation.	Slides. Compressed slides. Reading: Notes on SVD and its connection to eigendecomposition/PCA (Roughgarden and Valiant at Stanford). Chapter 3 of Foundations of Data Science and Chapter 11 of Mining of Massive Datasets on low-rank approximation and the SVD.
18. 11/08	Tue	Applications of low-rank approximation beyond compression. Matrix completion, entity embeddings, and non-linear dimensionality reduction.	Slides. Compressed slides. Reading: Notes on matrix completion, with proof of recovery under incoherence assumptions (Jelani Nelson at Harvard). Levy Goldberg paper on word embeddings as implicit low-rank approximation.
19. 11/10	Thu	Spectral graph theory and spectral clustering.	Slides. Compressed slides. Reading: Chapter 10.4 of Mining of Massive Datasets on spectral graph partitioning. For a lot more interesting material on spectral graph methods see Dan Spielman's lecture notes. Great notes on spectral graph methods (Roughgarden and Valiant at Stanford).
20. 11/15	Tue	The stochastic block model.	Slides. Compressed slides. Reading: Dan Spielman's lecture notes on stochastic block model, including matrix concentration + David-Kahan perturbation analysis.. Further stochastic block model notes (Alessandro Rinaldo at CMU). A survey of the vast literature on the stochastic block model, beyond the spectral methods discussed in class (Emmanuel Abbe at Princeton).
21. 11/17	Thu	Computing the SVD: power method. Krylov methods. Connection to random walks and Markov chains.	Slides. Compressed slides. Reading: Chapter 3.7 of Foundations of Data Science on the power method for SVD. Some notes on the power method. (Roughgarden and Valiant at Stanford).
11/22	Tue	No Class. Friday class schedule followed.
11/24	Thu	No Class. Thanksgiving recess.
Optimization
22. 11/29	Tue	Class Canceled
23. 12/01	Thu	Start on optimization and gradient descent.	Slides. Compressed slides. Reading: Chapters I and III of these notes (Hardt at Berkeley). Multivariable calc review, e.g., through: Khan academy
24. 12/06	Tue	Gradient descent analysis for convex Lipschitz functions.	Slides. Compressed slides. Reading: Chapters I and III of these notes (Hardt at Berkeley).
25. 12/08	Thu	Constrained optimization and projected gradient descent. Course conclusion/review.	Slides. Compressed slides.
12/14, 10:30am - 12:30pm		Final Exam.	Study guide and review questions. See Moodle for practice exams.

Course Schedule (Evolving)

Lecture recordings from Echo360 can be accessed here.