COMPSCI 514: Algorithms for Data Science

Course Schedule (Evolving)

Lecture recordings from Echo360 can be accessed here.

Date	Day	Topic	Materials
9/3	Tue	Course overview. Probability review, Markov's inequality. Estimating set size by counting duplicates.	Slides.
Randomized Methods, Sketching & Streaming
9/5	Thu	Chebyshev's inequality. Random hashing for efficient lookup and load balancing. 2-universal and pairwise independent hashing.	Slides. Some notes (Arora and Kothari at Princeton) proving that the ax+b mod p hash function described in class in 2-universal.
9/10	Tue	Union bound. Exponential tail bounds (Bernstein and Chernoff). Example applications.	Slides. Some notes (Goemans at MIT) showing how to prove the Chernoff bound using the moment generating function + Markov's inequality approach discussed in class.
9/12	Thu	Hashing continued. Bloom filters and their applications. Hashing for distinct elements.	Slides. I've added a sketch of the correct Bloom filter analysis. Also see here. See here for some explaination of why a version of a Bloom filter with no false negatives cannot be achieved without using a lot of space.
9/17	Tue	Distinct elements continued. Flajolet-Martin and HyperLogLog. Jaccard similarity for audio fingerprinting, document comparision, etc. The median trick.	Slides. The 2007 paper introducing the popular HyperLogLog distinct elements algorithm. Chapter 4 of Mining of Massive Datasets, with content on bloom filters, distinct item counting.
9/19	Thu	Jaccard similarity search with MinHash. Locality sensitive hashing and nearest neighbor search.	Slides. Reading: Chapter 3 of Mining of Massive Datasets, with content on Jaccard similarity, MinHash, and locality sensitive hashing.
9/24	Tue	Finish up MinHash and LSH. SimHash for consine similarity.	Slides. Reading: Chapter 3 of Mining of Massive Datasets, with content on Jaccard similarity, MinHash, and locality sensitive hashing.
9/26	Thu	The frequent elements problem. Misra-Gries summaries. Count-min sketch.	Slides. Reading: Notes (Amit Chakrabarti at Dartmouth) on streaming algorithms. See Chapters 2 and 4 for frequent elements. Some more notes on the frequent elements problem. A website with lots of resources, implementations, and example applications of count-min sketch.
10/1	Tue	Randomized dimensionality reduction and the Johnson-Lindenstrauss lemma. Applications to regression, clustering.	Slides: Compressed/cleaned up, Raw from class. Reading: Chapter 2.7 of Foundations of Data Science on the Johnson-Lindenstrauss lemma. Notes on the JL-Lemma (Anupam Gupta CMU). Linear Algebra Review: Khan academy.
10/3	Thu	Finish up JL Lemma.	Slides. The Fast JL transform: speeding up random projection with the Fast Fourier transform. Sparse random projections which can be multiplied by more quickly. JL type random projections for the l1 norm using Cauchy instead of Gaussian random matrices.
Spectral Methods
10/8	Tue	Principal component analysis, low-rank approximation, dimensionality reduction.	Slides: Compressed/cleaned up, Raw from class. Reading: Chapter 3 of Foundations of Data Science and Chapter 11 of Mining of Massive Datasets on low-rank approximation and the SVD.
10/10	Thu	Eigencomposition and application to PCA and low-rank approximation.	Slides: Cleaned up, Raw from class. Reading: Some notes on PCA and its connection to eigendecomposition (Roughgarden and Valiant at Stanford).
10/15	Tue	No Class, Monday Schedule.
10/17	Thu	Midterm (In Class)	Study guide and review questions.
10/22	Tue	The singular value decomposition and its connection to eigendecomposition/PCA/low-rank approximation. Applications of low-rank approximation beyond compression.	Slides (raw from class). Unannotated slides. Reading: Chapter 3 of Foundations of Data Science and Chapter 11 of Mining of Massive Datasets on low-rank approximation and the SVD. Some notes on the SVD and its connection to PCA (Roughgarden and Valiant at Stanford)
10/24	Thu	Linear algebraic view of graphs. Applications to spectral clustering, community detection, network visualization.	Slides (raw from class). Unannotated slides. Reading: Chapter 10.4 of Mining of Massive Datasets on spectral graph partitioning. Great notes on spectral graph methods (Roughgarden and Valiant at Stanford).
10/29	Tue	Spectral graph theory, spectral clustering, and community detection continued. Stochastic block model	Slides (raw from class). Unannotated slides. Reading: Chapter 10.4 of Mining of Massive Datasets on spectral graph partitioning. For a lot more interesting material on spectral graph methods see Dan Spielman's lecture notes.
10/31	Thu	Finish up stochastic block model. Computing the SVD: power method, Krylov methods.	Slides (raw from class). Unannotated slides. Reading: Chapter 3.7 of Foundations of Data Science on the power method for SVD. Some notes on the power method. (Roughgarden and Valiant at Stanford).
11/5	Tue	Class Cancelled.
11/7	Thu	Finish up power method. Connection to random walks and Markov chains.	Slides (raw from class). Unannotated slides.
Optimization
11/12	Tue	Gradient descent and analysis for convex functions, example applications.	Slides (raw from class). Unannotated slides. Reading: Chapters I and III of these notes (Hardt at Berkeley). Multivariable calc review, e.g., through: Khan academy.
11/14	Thu	Finish gradient descent. Projected gradient descent.	Slides (raw from class). Unannotated slides. Chapters I and III of these notes (Hardt at Berkeley).
11/19	Tue	Stochastic gradient descent for large scale learning. Analysis via online gradient descent.	Slides (raw from class). Unannotated slides. Reading: Short notes, proving regret bound for online gradient descent. A good book, (by Elad Hazan) on online optimization, including online gradient descent and connection to stochastic gradient descent. Note that the analysis is close to, but slightly different than what was covered in class.
11/21	Thu	Finish up SGD. Gradient descent for least squares regression. Connections to advanced techniques: variance reduction, accelerated methods, adaptive gradient methods.	Slides (raw from class). Unannotated slides.
11/26	Tue	No Class, Thanksgiving Recess.
11/28	Thu	No Class, Thanksgiving Recess.
Assorted Topics
12/3	Tue	High-dimensional geometry, curse of dimensionality.	Slides (raw from class). Unannotated slides. Reading: Chapters 2.3-2.6 of Foundations of Data Science on high-dimensional geometry.
12/5	Thu	Compressed sensing, sparse recovery.	Slides (raw from class). Unannotated slides.
12/10	Tue	Finish up sparse recovery and basis pursuit. Class wrap-up.	Slides (raw from class). Unannotated slides.
12/19	Thu	Final (10:30am-12:30pm in Thompson 104)	Study guide and review questions.