COMPSCI 514: Algorithms for Data Science (Fall 2019)
Time: Tue/Thurs 10am-11:15pm, Fall Semester 2019
Location: Bartlett Hall, Room 65
Office Hours: TBD
With the advent of social networks, ubiquitous sensors, and large-scale computational science, data scientists must deal with data that is massive in size,
arrives at blinding speeds, and often must be processed within interactive or quasi-interactive time frames. This course studies the mathematical foundations
of big data processing, developing algorithms and learning how to analyze them. We explore methods for sampling, sketching, and distributed processing of
large scale databases, graphs, and data streams for purposes of scalable statistical description, querying, pattern mining, and learning. Course was
previously COMPSCI 590D. Undergraduate Prerequisites: COMPSCI 240 and COMPSCI 311. 3 credits.
Website Under Construction!
Tentative List of Topics (14 weeks total)
- Introduction, probability review, and concentration inequalities (Markov, Chebyshev, Chernoff) (1 week)
- Streaming Algorithms and Hashing (3-4 weeks)
- Hashing for efficient lookup and load balancing.
- Frequent elements/heavy hitter identification. Misra-Gries summaries and Count-min sketch.
- Minhash for Jaccard similarity.
- Near neighbor search in high dimensions: locality sensity hashing (LSH).
- Randomized Compression (1 week)
- The Johnson-Lindenstrauss lemma.
- Applications: Clustering, regression, etc.
- Spectral Methods (3 weeks)
- Singular value decomposition and eigendecompostion.
- Applications: low-rank approximation/PCA, compression, pagerank, etc.
- Computing the SVD: power method, Krylov methods, randomized and streaming methods.
- Applications: spectral graph clustering and community detection.
- Connections to Johnson-Lindenstrauss lemma and randomized compression
- Optimization (3 weeks)
- Gradient descent. Analysis for convex functions.
- Online and stochastic gradient descent.
- Applications: regression and linear models on massive datasets.
- Application: Gradient descent in neural networks. Backpropagation, nonconvex analysis.
- More advanced methods: acceleration, Newton's method, quasi-newton methods.
- Alternating minimization and expectation-maximization.
- Fourier Methods and Compressed Sensing (2 weeks)
- The discrete Fourier transform. Applications to signal processing, compression, and filtering.
- The fast Fouier transform.
- The sparse Fourier transform and compressed sensing. RIP, l1 minimization, iterative thresholding.
- Connections to randomized dimensionality reduction and hashing.
- Miscellaneous Topics (2 weeks, TBD possibly based on student interest)
- MapReduce, distributed graph processing. Linear algebraic view of graph/network processing
- Clustering: k-means, k-medians, correlation clustering, k-means++, graph partitioning.
- Constrained optimization and linear programming.
- Support vector machines and kernel methods.
- Matrix completion.