COMPSCI 514: Algorithms for Data Science

The lecture Zoom link is here. See announcements for password. Recordings for each class will be posted with the slides below.

Lecture	Day	Topic	Materials/Reading
1. 8/25	Tue	Course overview. Probability review.	Zoom Recording. Slides. Compressed slides. MIT short videos and exercises on probability. Khan academy probability lessons (a bit more basic).
Randomized Methods, Sketching & Streaming
2. 8/27	Thu	Estimating set size by counting duplicates. Concentration Bounds: Markov's inequality. Random hashing for efficient lookup.	Zoom Recording. Slides. Compressed slides.
3. 9/1	Tue	Finish up hash tables. 2-universal and pairwise independent hashing. Hashing for load balancing.	Zoom Recording. Slides. Compressed slides. Some notes (Arora and Kothari at Princeton) proving that the ax+b mod p hash function described in class in 2-universal.
4. 9/3	Thu	Concentration Bounds Continued: Chebyshev's inequality. The union bound. Exponential tail bounds (Bernstein's inequality).	Zoom Recording. Slides. Compressed slides. Some notes (Goemans at MIT) showing how to prove exponential tail bounds using the moment generating function + Markov's inequality approach discussed in class.
5. 9/8	Tue	Finish up exponential concentration bounds and the central limit theorem. Bloom filters and their applications.	Zoom Recording. Slides. Compressed slides. Reading: Chapter 4 of Mining of Massive Datasets, with content on bloom filters. See here for full Bloom filter analysis. See here for some explaination of why a version of a Bloom filter with no false negatives cannot be achieved without using a lot of space. See Wikipedia for a discussion of the many bloom filter variants, including counting Bloom filters, and Bloom filters with deletions. See Wikipedia again and these notes for an explaination of Cuckoo Hashing, a randomized hash table scheme which, like 2-level hashing, has O(1) query time, but also has expected O(1) insertion time.
6. 9/10	Thu	Finish up Bloom filters. Start on streaming algorithms. Min-Hashing for distinct elements.	Zoom Recording. Slides. Compressed slides. Reading: Chapter 4 of Mining of Massive Datasets, with content on distinct elements counting.
7. 9/15	Tue	Finish up distinct elements and the median trick. Flajolet-Martin and HyperLogLog. Jaccard similarity estimation with MinHash for audio fingerprinting, document comparision, etc. Start on locality sensitive hashing and nearest neighbor search.	Zoom Recording. Slides. Compressed slides. Reading: Chapter 3 of Mining of Massive Datasets, with content on Jaccard similarity, MinHash, and locality sensitive hashing. The 2007 paper introducing the popular HyperLogLog distinct elements algorithm.
8. 9/17	Thu	Finish up MinHash for Jaccard similarity and locality sensitive hashing. Similarity search. SimHash for Cosine similarity.	Zoom Recording. Slides. Compressed slides. Reading: Chapter 3 of Mining of Massive Datasets, with content on Jaccard similarity, MinHash, and locality sensitive hashing.
9. 9/22	Tue	The frequent elements problem and count-min sketch.	Zoom Recording. Slides. Compressed slides. Reading: Notes (Amit Chakrabarti at Dartmouth) on streaming algorithms. See Chapters 2 and 4 for frequent elements. Some more notes on the frequent elements problem. A website with lots of resources, implementations, and example applications of count-min sketch.
10. 9/24	Thu	Dimensionality reduction, low-distortion embeddings, and the Johnson Lindenstrauss Lemma.	Zoom Recording. Slides. Compressed slides. Reading: Chapter 2.7 of Foundations of Data Science on the Johnson-Lindenstrauss lemma. Notes on the JL-Lemma (Anupam Gupta CMU). Sparse random projections which can be multiplied by more quickly. JL type random projections for the l1 norm using Cauchy instead of Gaussian random matrices. Linear Algebra Review: Khan academy.
11. 9/29	Tue	Finish up the JL Lemma. Example application to clustering. Connections to high-dimensional geometry.	Zoom Recording. Slides. Compressed slides. Reading: Chapters 2.3-2.6 of Foundations of Data Science on high-dimensional geometry.
12. 10/1	Thu	Finish up high-dimensional geometry and connection to the JL Lemma.	Zoom Recording. Slides. Compressed slides. Reading: Chapters 2.3-2.6 of Foundations of Data Science on high-dimensional geometry.
Spectral Methods
13. 10/6	Tue	Intro to principal component analysis, low-rank approximation, data-dependent dimensionality reduction.	Zoom Recording. Slides. Compressed slides. Reading: Chapter 3 of Foundations of Data Science and Chapter 11 of Mining of Massive Datasets on low-rank approximation and the SVD.
10/8-10/9	Thu-Fri	Midterm (2 hour take home, taken during 48 hour period)	Study guide and review questions.
14. 10/13	Tue	Projection matrices and best fit subspaces.	Zoom Recording. Slides. Compressed slides. Reading: Some notes on PCA and its connection to eigendecomposition (Roughgarden and Valiant at Stanford).
15. 10/15	Thu	Optimal low-rank approximation via eigendecomposition. Principal component analysis.	Zoom Recording. Slides. Compressed slides. Reading: Some notes on PCA and its connection to eigendecomposition. Chapter 3 of Foundations of Data Science and Chapter 11 of Mining of Massive Datasets on low-rank approximation. Proof that optimal low-rank approximation can be found greedily (see Section 1.1).
16. 10/20	Tue	Singular value decomposition (SVD) and connections to PCA/low-rank approximation.	Zoom Recording. Slides. Compressed slides. Reading: Notes on SVD and its connection to eigendecomposition/PCA (Roughgarden and Valiant at Stanford). Chapter 3 of Foundations of Data Science and Chapter 11 of Mining of Massive Datasets on low-rank approximation and the SVD.
17. 10/22	Thu	Applications of low-rank approximation beyond compression. Matrix completion, LSA, and word embeddings.	Zoom recording. Slides. Compressed slides. Reading: Levy Goldberg paper on word embeddings as implicit low-rank approximation. Notes on matrix completion, with proof of recovery under incoherence assumptions (Jelani Nelson at Harvard).
18. 10/27	Tue	Linear algebraic view of graphs. Spectral graph partitioning.	Zoom recording. Slides. Compressed slides. Reading: Chapter 10.4 of Mining of Massive Datasets on spectral graph partitioning. For a lot more interesting material on spectral graph methods see Dan Spielman's lecture notes. Great notes on spectral graph methods (Roughgarden and Valiant at Stanford).
19. 10/29	Thu	Finish spectral clustering. The stochastic block model.	Zoom recording. Slides. Compressed slides. Reading: Stochastic block model notes (Alessandro Rinaldo at CMU). A survey of the vast literature on the stochastic block model, beyond the spectral methods discussed in class (Emmanuel Abbe at Princeton).
20. 11/03	Tue	Finish the stochastic block model.	Zoom recording. Slides. Compressed slides.
21. 11/05	Thu	Computing the SVD: power method, Krylov methods. Connection to random walks and Markov chains.	Zoom recording. Slides. Compressed slides. Reading: Chapter 3.7 of Foundations of Data Science on the power method for SVD. Some notes on the power method. (Roughgarden and Valiant at Stanford).
Optimization
22. 11/10	Tue	Start on optimization and gradient descent analysis for convex functions.	Zoom recording. Slides. Compressed slides. Reading: Chapters I and III of these notes (Hardt at Berkeley). Multivariable calc review, e.g., through: Khan academy.
23. 11/12	Thu	Finish gradient descent analysis. Constrained optimization and projected gradient descent.	Zoom Recording. Slides. Compressed slides. Reading: Chapters I and III of these notes (Hardt at Berkeley).
24. 11/17	Tue	Online learning and regret. Online gradient descent.	Zoom Recording. Slides. Compressed slides. Reading: Short notes, proving regret bound for online gradient descent. A good book, (by Elad Hazan) on online optimization, including online gradient descent and connection to stochastic gradient descent. Note that the analysis is close to, but slightly different than will be covered in class.
25. 11/19	Thu	Finish up online gradient descent and stochastic gradient descent analysis. Course conclusion/review.	Zoom Recording. Slides. Compressed slides.
12/3-12/4	Thu/Fri	Final (2 hour take home, taken during 48 hour period)	Study guide and review questions.

Course Schedule (Evolving)