We traditionally think of algorithms as running on data available in a single location, typically in main memory or at least on disk. However, in many modern applications, the data is too large to reside in a single location (terabyte and petabyte sized datasets are increasingly common), is arriving incrementally over time, is noisy and uncertain, or all of the above. Processing such data requires new algorithms and new models of computation. In recent years, practitioners have turned to MapReduce-based systems, such as Hadoop, for large data analysis, data stream analysis systems such as AT&T's Gigascope for making sense of fast arriving data, and systems such as Storm and S4 for real time distributed computation on streaming data.
These practical developments represent huge opportunities to the theory community: what are the appropriate computational abstractions for these systems and how should we go about designing algorithms that are efficient in these models? What are the opportunities for industrial impact? What should we teach our undergraduate and graduate students. Our goals in this workshop are to a) survey the basic models that have been been proposed, b) present representative algorithmic results and c) highlight open problems and new directions of research.
Date: 19 May, 2012
Location: Room 101 in Warren Weaver Hall, 251 Mercer St, New York University
More: See here for further details and information about other STOC tutorials and workshops.
Time Speaker Title Slides 1:30-2:30 Sergei Vassilvitskii, Google Distributed and Parallel Models (Survey) Slides 2:30-3:30 Andrew McGregor, UMass Amherst Data Streams and Linear Sketches (Survey) Slides 3:30-4:00 Coffee Break 4:00-4:40 John Langford, Microsoft Research Special Topics: Fun Machine Learning Problems on Big Data Slides 4:40-5:20 Piotr Indyk, MIT Special Topics: CS on CS: Computer Science Insights into Compressive Sensing (and vice versa) Slides 5:20-6:00 Ashish Goel, Stanford and Twitter Special Topics: Challenges in Industry and Education Slides