Welcome to the Fall 2022 homepage for CMPSCI 514 - Algorithms for Data Science. See Moodle page for links to lecture recordings, syllabus, homework etc. Slides for future lectures may be motified signifantly depending on our progress. Slides for previous lectures may be updated if, e.g., we spot a typo during the lecture.
Lecture | Topic | Reading and Background |
---|---|---|
1 | Course overview. Probability review. | Slides, Probability Notes, MIT Short Probability Videos |
2 | Estimating set size by counting duplicates. Concentration Bounds: Markov's inequality. Random hashing for efficient lookup. | Slides |
3 | Finish up hash tables. 2-universal and pairwise independent hashing. Hashing for load balancing. | Slides, Proof of 2-universality |
4 | Concentration Bounds Continued: Chebyshev's inequality. The union bound. Exponential tail bounds (Bernstein's inequality). | Slides |
5 | Finish up exponential concentration bounds and the central limit theorem. Bloom filters and their applications. | Slides |
6 | Finish up Bloom filters. Start on streaming algorithms. Min-Hashing for distinct elements. | Slides, Reading: Chapter 4 of Mining of Massive Datasets, with content on bloom filters and distinct elements counting. See here for full Bloom filter analysis. See here for some explaination of why a version of a Bloom filter with no false positives cannot be achieved without using a lot of space. See Wikipedia for a discussion of the many bloom filter variants, including counting Bloom filters, and Bloom filters with deletions. See Wikipedia again and these notes for an explaination of Cuckoo Hashing, a randomized hash table scheme which, like 2-level hashing, has O(1) query time, but also has expected O(1) insertion time. |
7 | Finish up distinct elements and the median trick. Flajolet-Martin and HyperLogLog. Jaccard similarity estimation with MinHash for audio fingerprinting, document comparision, etc. Start on locality sensitive hashing and nearest neighbor search. | Slides |
8 | Finish up MinHash for Jaccard similarity and locality sensitive hashing. Similarity search. SimHash for Cosine similarity. | Slides |
9 | The frequent elements problem and count-min sketch. | Slides |
10 | Dimensionality reduction, low-distortion embeddings, and the Johnson Lindenstrauss Lemma. | Slides |
11 | Finish up the JL Lemma. Example application to clustering. Connections to high-dimensional geometry. | Slides |
12 | Finish up high-dimensional geometry and connection to the JL Lemma. | Slides |
13 | Midterm Review | Slides |
14 | Intro to principal component analysis, low-rank approximation, data-dependent dimensionality reduction. | Slides |
15 | Projection matrices and best fit subspaces. | Slides |
16 | Optimal low-rank approximation via eigendecomposition. Principal component analysis. | Slides |
17 | SVD and applications of low-rank approximation beyond compression. Matrix completion, LSA, and word embeddings. | Slides |
18 | Linear algebraic view of graphs. Spectral graph partitioning and clustering. | Slides |
19 | Stochastic block model. | Slides |
20 | Computing the SVD: power method, Krylov methods. Connection to random walks and Markov chains. | Slides |
21 | Optimization and gradient descent analysis for convex functions. | Slides |
22 | Finish gradient descent analysis. Constrained optimization and projected gradient descent. | Slides |
23 | Online learning and regret. Online gradient descent. | Slides |
24 | Finish up online gradient descent and stochastic gradient descent analysis. | Slides |
25 | Course conclusion/review. | Slides |