CS514 (F22) | Andrew McGregor

Welcome to the Fall 2022 homepage for CMPSCI 514 - Algorithms for Data Science. See Moodle page for links to lecture recordings, syllabus, homework etc. Slides for future lectures may be motified signifantly depending on our progress. Slides for previous lectures may be updated if, e.g., we spot a typo during the lecture.

Lecture	Topic	Reading and Background

1	Course overview. Probability review.	Slides, Probability Notes, MIT Short Probability Videos
2	Estimating set size by counting duplicates. Concentration Bounds: Markov's inequality. Random hashing for efficient lookup.	Slides
3	Finish up hash tables. 2-universal and pairwise independent hashing. Hashing for load balancing.	Slides, Proof of 2-universality
4	Concentration Bounds Continued: Chebyshev's inequality. The union bound. Exponential tail bounds (Bernstein's inequality).	Slides
5	Finish up exponential concentration bounds and the central limit theorem. Bloom filters and their applications.	Slides
6	Finish up Bloom filters. Start on streaming algorithms. Min-Hashing for distinct elements.	Slides, Reading: Chapter 4 of Mining of Massive Datasets, with content on bloom filters and distinct elements counting. See here for full Bloom filter analysis. See here for some explaination of why a version of a Bloom filter with no false positives cannot be achieved without using a lot of space. See Wikipedia for a discussion of the many bloom filter variants, including counting Bloom filters, and Bloom filters with deletions. See Wikipedia again and these notes for an explaination of Cuckoo Hashing, a randomized hash table scheme which, like 2-level hashing, has O(1) query time, but also has expected O(1) insertion time.
7	Finish up distinct elements and the median trick. Flajolet-Martin and HyperLogLog. Jaccard similarity estimation with MinHash for audio fingerprinting, document comparision, etc. Start on locality sensitive hashing and nearest neighbor search.	Slides
8	Finish up MinHash for Jaccard similarity and locality sensitive hashing. Similarity search. SimHash for Cosine similarity.	Slides
9	The frequent elements problem and count-min sketch.	Slides
10	Dimensionality reduction, low-distortion embeddings, and the Johnson Lindenstrauss Lemma.	Slides
11	Finish up the JL Lemma. Example application to clustering. Connections to high-dimensional geometry.	Slides
12	Finish up high-dimensional geometry and connection to the JL Lemma.	Slides
13	Midterm Review	Slides
14	Intro to principal component analysis, low-rank approximation, data-dependent dimensionality reduction.	Slides
15	Projection matrices and best fit subspaces.	Slides
16	Optimal low-rank approximation via eigendecomposition. Principal component analysis.	Slides
17	SVD and applications of low-rank approximation beyond compression. Matrix completion, LSA, and word embeddings.	Slides
18	Linear algebraic view of graphs. Spectral graph partitioning and clustering.	Slides
19	Stochastic block model.	Slides
20	Computing the SVD: power method, Krylov methods. Connection to random walks and Markov chains.	Slides
21	Optimization and gradient descent analysis for convex functions.	Slides
22	Finish gradient descent analysis. Constrained optimization and projected gradient descent.	Slides
23	Online learning and regret. Online gradient descent.	Slides
24	Finish up online gradient descent and stochastic gradient descent analysis.	Slides
25	Course conclusion/review.	Slides

Andrew McGregor

Associate Professor