COMPSCI 514: Algorithms for Data Science

Lecture recordings from Echo360 can be accessed here. Handwritten lecture notes courtesy of Stephen Scarano from Fall 2023 can be accessed here.

Lecture	Day	Topic	Materials/Reading
1. 9/3	Tue	Course overview. Probability review.	Slides. Compressed slides. Reading: MIT short videos and exercises on probability (go to Unit 4). Khan academy probability lessons (a bit more basic). Chapters 1-3 of Probability and Computing with content and excersises on basic probability, expectation, variance, and concentration bounds.
Randomized Methods, Sketching & Streaming
2. 9/5	Thu	Linearity of expectation and variance. Estimating set size by counting duplicates. Markov's inequality. Random hashing for efficient lookup.	Slides. Compressed slides. Reading: Chapters 1-3 of Probability and Computing with content and excersises on basic probability, expectation, variance, and concentration bounds.
3. 9/10	Tue	Collision-free hashing. 2-level hashing. 2-universal and pairwise independent hashing.	Slides. Compressed slides. Reading: Chapter 2.2 of Foundations of Data Science with content on Markov's inequality and Chebyshev's inequality. Exercises 2.1-2.6. Chapters 1-3 of Probability and Computing with content and excersises on basic probability, expectation, variance, and concentration bounds. Some notes (Arora and Kothari at Princeton) proving that the ax+b mod p hash function described in class in 2-universal.
4. 9/12	Thu	Hashing for load balancing. Chebyshev's inequality. The union bound. Maybe start on exponential concentration bounds.	Slides. Compressed slides. Reading: Chapter 2.2 of Foundations of Data Science with content on Markov's inequality and Chebyshev's inequality. Exercises 2.1-2.6. Chapters 1-3 of Probability and Computing with content and excersises on basic probability, expectation, variance, and concentration bounds.
5. 9/17	Tue	Exponential concentration bounds and the central limit theorem.	Slides. Compressed slides. Reading: Chapter 4 of Probability and Computing on exponential concentration bounds. Some notes (Goemans at MIT) showing how to prove exponential tail bounds using the moment generating function + Markov's inequality approach discussed in class.
6. 9/19	Thu	Finish up applications of exponential concentration bounds. Bloom Filters.	Slides. Compressed slides. Reading: Chapter 4 of Mining of Massive Datasets, with content on Bloom filters. See here for full Bloom filter analysis. See Wikipedia for a discussion of the many bloom filter variants, including counting Bloom filters, and Bloom filters with deletions.
7. 9/24	Tue	Finish up Bloom filters. Start on streaming algorithms and frequent elements estimation.	Slides. Compressed slides. Reading: Chapter 4 of Mining of Massive Datasets, with content on Bloom filters. Notes (Amit Chakrabarti at Dartmouth) on streaming algorithms. See Chapters 1 and 5 for frequent elements. Some more notes on the frequent elements problem.
8. 9/26	Thu	Frequent elements estimation via Count-min sketch. Min-Hashing for Distinct elements.	Slides. Compressed slides. Reading: Notes (Amit Chakrabarti at Dartmouth) on streaming algorithms. See Chapters 1 and 5 for frequent elements. Some more notes on the frequent elements problem. A website with lots of resources, implementations, and example applications of count-min sketch. Chapter 4 of Mining of Massive Datasets, with content on distinct elements counting.
9. 10/1	Tue	Finish up distinct elements counting. The median trick. Distinct elements in pratice: Flajolet-Martin and HyperLogLog.	Slides. Compressed slides. Reading: Chapter 4 of Mining of Massive Datasets, with content on distinct elements counting. The 2007 paper introducing the popular HyperLogLog distinct elements algorithm.
10. 10/3	Thu	Start on Jaccard similarity, fast similarity search, and locality sensitive hashing	Slides. Compressed slides. Reading: Chapter 3 of Mining of Massive Datasets, with content on Jaccard similarity, MinHash, and locality sensitive hashing.
11. 10/8	Tue	Finish up locality sensitive hashing. Start on compressing high dimensional data -- low-distortion embeddings and the Johnson Lindenstrauss Lemma.	Slides. Compressed slides. Reading: Chapter 3 of Mining of Massive Datasets, with content on Jaccard similarity, MinHash, and locality sensitive hashing. Chapter 2.7 of Foundations of Data Science on the Johnson-Lindenstrauss lemma. Notes on the JL-Lemma (Anupam Gupta (CMU). Linear Algebra Review: Khan academy.
12. 10/10	Thu	Proof of the Johnson Lindenstrauss Lemma. Example application to clustering.	Slides. Compressed slides. Reading: Chapter 2.7 of Foundations of Data Science on the Johnson-Lindenstrauss lemma. Notes on the JL-Lemma (Anupam Gupta (CMU). Sparse random projections which can be multiplied by more quickly.
10/15	Tue	No Class. Monday class schedule followed.
13. 10/17	Thu	Midterm Review. Midterm in the evening. 7-9pm in ILCN 151, 211, 331.	Study guide and review questions.
Spectral Methods
10/22	Tue	No Class. Professor Traveling.
14. 10/24	Thu	Intro to principal component analysis, low-rank approximation, data-dependent dimensionality reduction. Orthogonal bases and projection matrices. Dual column/row view of low-rank approximation.	Slides. Compressed slides. Reading: Chapter 3 of Foundations of Data Science and Chapter 11 of Mining of Massive Datasets on low-rank approximation and the SVD. Some good videos for linear algebra review. Some other good videos overviewing the SVD and related topics (like orthogonal projection and low-rank approximation).
15. 10/29	Tue	Best fit subspaces and optimal low-rank approximation via eigendecomposition.	Slides. Compressed slides. Reading: Proof that optimal low-rank approximation can be found greedily (see Section 1.1). Chapter 3 of Foundations of Data Science and Chapter 11 of Mining of Massive Datasets on low-rank approximation.
16. 10/31	Thu	Finish up optimal low-rank approximation via eigendecomposition. Eigenvalues as a measure of low-rank approximation error. General linear algebra review.	Slides. Compressed slides. Reading: Chapter 3 of Foundations of Data Science and Chapter 11 of Mining of Massive Datasets on low-rank approximation.
11/05	Tue	No Class. Election Day.
17. 11/07	Thu	The singular value decomposition and connections to low-rank approximation. Applications of low-rank approximation beyond compression. Matrix completion and entity embeddings.	Slides. Compressed slides. Reading: Notes on SVD and its connection to eigendecomposition/PCA (Roughgarden and Valiant at Stanford). Notes on matrix completion, with proof of recovery under incoherence assumptions (Jelani Nelson at Harvard). Levy Goldberg paper on word embeddings as implicit low-rank approximation.
18. 11/12	Tue	Spectral graph theory and spectral clustering.	Slides. Compressed slides. Reading: Chapter 10.4 of Mining of Massive Datasets on spectral graph partitioning. For a lot more interesting material on spectral graph methods see Dan Spielman's lecture notes. Great notes on spectral graph methods (Roughgarden and Valiant at Stanford).
19. 11/14	Thu	The stochastic block model.	Slides. Compressed slides. Reading: Dan Spielman's lecture notes on stochastic block model, including matrix concentration + David-Kahan perturbation analysis.. Further stochastic block model notes (Alessandro Rinaldo at CMU). A survey of the vast literature on the stochastic block model, beyond the spectral methods discussed in class (Emmanuel Abbe at Princeton).
20. 11/19	Tue	Computing the SVD: power method.	Slides. Compressed slides. Reading: Chapter 3.7 of Foundations of Data Science on the power method for SVD. Some notes on the power method. (Roughgarden and Valiant at Stanford).
Optimization
21. 11/21	Thu	Finish up power method analysis. Krylov methods. Connection to random walks and Markov chains. Briefy intro to continuous optimization.	Slides. Compressed slides. Reading: Chapter 3.7 of Foundations of Data Science on the power method for SVD. Some notes on the power method. (Roughgarden and Valiant at Stanford). Multivariable calc review, e.g., through: Khan academy
11/26	Tue	No Class.
11/28	Thu	No Class. Thanksgiving recess.
22. 12/03	Tue	Intro to gradient descent and its analysis for convex Lipschitz functions.	Slides. Compressed slides. Reading: Chapters I and III of these notes (Hardt at Berkeley).
23. 12/05	Thu	Finish gradient descent analysis. Constrained optimization and projected gradient descent. Start on motivation for stochastic gradient descent.	Slides. Compressed slides.
24. 12/10	Tue	Online gradient descent and application to the analysis of stochastic gradient descent. Course conclusion/review.	Slides. Compressed slides. Reading: Short notes, proving regret bound for online gradient descent. A good book (by Elad Hazan) on online optimization, including online gradient descent and connection to stochastic gradient descent.
12/18, 10:30am - 12:30pm	Wed	Final Exam.	Study guide and review questions.

Course Schedule (Evolving)

Lecture recordings from Echo360 can be accessed here. Handwritten lecture notes courtesy of Stephen Scarano from Fall 2023 can be accessed here.