COMPSCI 514: Algorithms for Data Science

The lecture Zoom link is here.

Lecture recordings from Fall 2019 can be accessed here. Recordings from online classes this semester will be posted with the slides below.

Lecture	Day	Topic	Materials/Reading
1. 1/21	Tue	Course overview. Probability review. Estimating set size by counting duplicates.	Slides. Compressed slides. MIT short videos and exercises on probability. Khan academy probability lessons (a bit more basic).
Randomized Methods, Sketching & Streaming
2. 1/23	Thu	Concentration Bounds: Markov's inequality and Chebyshev's inequality. Random hashing for efficient lookup and load balancing. 2-universal and pairwise independent hashing.	Slides. Compressed slides. Some notes (Arora and Kothari at Princeton) proving that the ax+b mod p hash function described in class in 2-universal.
3. 1/28	Tue	Union bound. Exponential tail bounds (Bernstein and Chernoff). Example applications.	Slides. Compressed slides. Some notes (Goemans at MIT) showing how to prove the Chernoff bound using the moment generating function + Markov's inequality approach discussed in class.
4. 1/30	Thu	Hashing continued. Bloom filters and their applications. Hashing for distinct elements.	Slides. Compressed slides, with Bernstein bound argument fixed. See here for full Bloom filter analysis. See here for some explaination of why a version of a Bloom filter with no false negatives cannot be achieved without using a lot of space. See Wikipedia for a discussion of the many bloom filter variants, including counting Bloom filters, and Bloom filters with deletions.
5. 2/4	Tue	Distinct elements continued. Flajolet-Martin and HyperLogLog. Jaccard similarity for audio fingerprinting, document comparision, etc. The median trick.	Slides. Compressed slides. The 2007 paper introducing the popular HyperLogLog distinct elements algorithm. Chapter 4 of Mining of Massive Datasets, with content on bloom filters, distinct item counting.
2/6	Thu	No Class, Professor Away.
6. 2/11	Tue	Jaccard similarity search with MinHash. Locality sensitive hashing and nearest neighbor search.	Slides. Compressed slides. Reading: Chapter 3 of Mining of Massive Datasets, with content on Jaccard similarity, MinHash, and locality sensitive hashing.
7. 2/13	Thu	Finish up MinHash and LSH. SimHash for cosine similarity. Start on frequent elements problem.	Slides. Compressed slides. Reading: Chapter 3 of Mining of Massive Datasets, with content on Jaccard similarity, MinHash, and locality sensitive hashing. Notes (Amit Chakrabarti at Dartmouth) on streaming algorithms. See Chapters 2 and 4 for frequent elements. Some more notes on the frequent elements problem.
2/18	Tue	No Class, Monday Schedule.
8. 2/20	Thu	The frequent elements problem. Misra-Gries summaries. Count-min sketch.	Slides. Compressed slides. Reading: Notes (Amit Chakrabarti at Dartmouth) on streaming algorithms. See Chapters 2 and 4 for frequent elements. Some more notes on the frequent elements problem. A website with lots of resources, implementations, and example applications of count-min sketch.
9. 2/25	Tue	Count-min sketch analysis. Start on dimensionality reduction and low-distortion embeddings.	Slides. Compressed slides. Reading: Chapter 2.7 of Foundations of Data Science on the Johnson-Lindenstrauss lemma. Notes on the JL-Lemma (Anupam Gupta CMU). Linear Algebra Review: Khan academy.
10. 2/27	Thu	The Johnson Lindenstrauss Lemma proof.	Slides. Compressed slides. Reading: Chapter 2.7 of Foundations of Data Science on the Johnson-Lindenstrauss lemma. Notes on the JL-Lemma (Anupam Gupta CMU). Sparse random projections which can be multiplied by more quickly. JL type random projections for the l1 norm using Cauchy instead of Gaussian random matrices.
11. 3/3	Tue	Finish up the JL Lemma. Applications to clustering, classification, etc. Connections to high-dimensional geometry.	Slides. Compressed slides.
12. 3/5	Thu	Finish up high-dimensional geometry and connection to the JL Lemma.	Slides: Slides. Compressed slides. Reading: Chapters 2.3-2.6 of Foundations of Data Science on high-dimensional geometry.
Spectral Methods
13. 3/10	Tue	Midterm Review. Intro to principal component analysis, low-rank approximation, data-dependent dimensionality reduction.	Slides. Compressed slides.. Reading: Chapter 3 of Foundations of Data Science and Chapter 11 of Mining of Massive Datasets on low-rank approximation and the SVD.
3/12	Thu	Midterm (In Class)	Study guide and review questions.
3/17	Tue	No Class, Spring Recess.
3/19	Thu	No Class, Spring Recess.
14. 3/24	Tue	Intro to low-rank approximation. Projection matrices and best fit subspaces.	Slides. Compressed slides. Zoom Recording. Reading: Some notes on PCA and its connection to eigendecomposition (Roughgarden and Valiant at Stanford).
15. 3/26	Thu	Optimal low-rank approximation via eigendecomposition. Principal component analysis.	Slides. Compressed slides. Zoom Recording. Reading: Some notes on PCA and its connection to eigendecomposition and singular value decomposition (SVD) (Roughgarden and Valiant at Stanford). Chapter 3 of Foundations of Data Science and Chapter 11 of Mining of Massive Datasets on low-rank approximation and the SVD.
16. 3/31	Tue	The singular value decomposition and connections to eigendecomposition and PCA. Applications of low-rank approximation beyond compression.	Slides. Compressed slides. Zoom Recording. Reading: Notes on SVD and its connection to eigendecomposition/PCA (Roughgarden and Valiant at Stanford). Chapter 3 of Foundations of Data Science and Chapter 11 of Mining of Massive Datasets on low-rank approximation and the SVD.
17. 4/1	Thu	Linear algebraic view of graphs. Applications to spectral clustering, community detection, network visualization.	Slides. Compressed slides. Zoom recording. Reading: Chapter 10.4 of Mining of Massive Datasets on spectral graph partitioning. Great notes on spectral graph methods (Roughgarden and Valiant at Stanford).
18. 4/7	Tue	Spectral graph partitioning.	Slides. Compressed slides. Zoom recording. Reading: Chapter 10.4 of Mining of Massive Datasets on spectral graph partitioning. For a lot more interesting material on spectral graph methods see Dan Spielman's lecture notes.
19. 4/9	Thu	The stochastic block model.	Slides. Compressed slides. Zoom recording.
20. 4/14	Tue	Computing the SVD: power method, Krylov methods. Connection to random walks and Markov chains.	Slides. Compressed slides. Zoom recording. Reading: Chapter 3.7 of Foundations of Data Science on the power method for SVD. Some notes on the power method. (Roughgarden and Valiant at Stanford).
Optimization
21. 4/16	Thu	Finish power method and Krylov methods. Start on continuous optimization.	Slides. Compressed slides. Zoom recording. Reading: Chapters I and III of these notes (Hardt at Berkeley). Multivariable calc review, e.g., through: Khan academy.
22. 4/21	Tue	Gradient descent and analysis for convex functions.	Slides. Compressed slides. Zoom recording. Reading: Chapters I and III of these notes (Hardt at Berkeley).
23. 4/23	Thu	Finish gradient descent analysis. Constrained optimization and projected gradient descent.	Slides. Compressed slides. Zoom Recording. Reading: Chapters I and III of these notes (Hardt at Berkeley).
24. 4/28	Tue	Online gradient descent and application to the analysis of stochastic gradient descent. Class wrap-up.	Slides. Compressed slides. Zoom Recording. Reading: Short notes, proving regret bound for online gradient descent. A good book, (by Elad Hazan) on online optimization, including online gradient descent and connection to stochastic gradient descent. Note that the analysis is close to, but slightly different than will be covered in class.
5/6	Wed	Final (2:00pm-4:00pm on Zoom)	Study guide and review questions.

Course Schedule (Evolving)

Lecture recordings from Fall 2019 can be accessed here. Recordings from online classes this semester will be posted with the slides below.