Andrew McGregor

Associate Professor

Welcome to the Fall 2022 homepage for CMPSCI 514 - Algorithms for Data Science. See Moodle page for links to lecture recordings, syllabus, homework etc. Slides for future lectures may be motified signifantly depending on our progress. Slides for previous lectures may be updated if, e.g., we spot a typo during the lecture.

Lecture Topic Reading and Background
1 Course overview. Probability review. Slides, Probability Notes, MIT Short Probability Videos
2 Estimating set size by counting duplicates. Concentration Bounds: Markov's inequality. Random hashing for efficient lookup. Slides
3 Finish up hash tables. 2-universal and pairwise independent hashing. Hashing for load balancing. Slides, Proof of 2-universality
4 Concentration Bounds Continued: Chebyshev's inequality. The union bound. Exponential tail bounds (Bernstein's inequality). Slides
5 Finish up exponential concentration bounds and the central limit theorem. Bloom filters and their applications. Slides
6 Finish up Bloom filters. Start on streaming algorithms. Min-Hashing for distinct elements. Slides, Reading: Chapter 4 of Mining of Massive Datasets, with content on bloom filters and distinct elements counting. See here for full Bloom filter analysis. See here for some explaination of why a version of a Bloom filter with no false positives cannot be achieved without using a lot of space. See Wikipedia for a discussion of the many bloom filter variants, including counting Bloom filters, and Bloom filters with deletions. See Wikipedia again and these notes for an explaination of Cuckoo Hashing, a randomized hash table scheme which, like 2-level hashing, has O(1) query time, but also has expected O(1) insertion time.
7 Finish up distinct elements and the median trick. Flajolet-Martin and HyperLogLog. Jaccard similarity estimation with MinHash for audio fingerprinting, document comparision, etc. Start on locality sensitive hashing and nearest neighbor search. Slides
8 Finish up MinHash for Jaccard similarity and locality sensitive hashing. Similarity search. SimHash for Cosine similarity. Slides
9 The frequent elements problem and count-min sketch. Slides
10 Dimensionality reduction, low-distortion embeddings, and the Johnson Lindenstrauss Lemma. Slides
11 Finish up the JL Lemma. Example application to clustering. Connections to high-dimensional geometry. Slides
12 Finish up high-dimensional geometry and connection to the JL Lemma. Slides
13 Midterm Review Slides
14 Intro to principal component analysis, low-rank approximation, data-dependent dimensionality reduction. Slides
15 Projection matrices and best fit subspaces. Slides
16 Optimal low-rank approximation via eigendecomposition. Principal component analysis. Slides
17 SVD and applications of low-rank approximation beyond compression. Matrix completion, LSA, and word embeddings. Slides
18 Linear algebraic view of graphs. Spectral graph partitioning and clustering. Slides
19 Stochastic block model. Slides
20 Computing the SVD: power method, Krylov methods. Connection to random walks and Markov chains. Slides
21 Optimization and gradient descent analysis for convex functions. Slides
22 Finish gradient descent analysis. Constrained optimization and projected gradient descent. Slides
23 Online learning and regret. Online gradient descent. Slides
24 Finish up online gradient descent and stochastic gradient descent analysis. Slides
25 Course conclusion/review. Slides