Online Data Mining PCA And K-Means
Algorithms for data mining, unsupervised machine learning and scientific computing were traditionally designed to minimize running time in the batch setting (random access to memory). In recent years, a significant amount of research is devoted to producing scaleable algorithms for the same problems. A scaleable solution assumes some limitation on data access and/or compute model. Some well known models include map reduce, message passing, local computation, pass efficient, streaming and others. In this talk we argue for the need to consider the online model in data mining tasks. In an online setting, the algorithm receives data points one by one and must make some decision immediately (without examining the rest of the input). The quality of the algorithm's decisions is compared to the best possible in hindsight. Note that no stochasticity assumption is made about the input. While practitioners are well aware of the need for such algorithms, this setting was mostly overlooked by the academic community. Here, we will review new results on online k-means clustering and online Principal Component Analysis (PCA).
Edo Liberty is a research director and Yahoo Labs and leads its Scalable Machine Learning group. He received his BSc in Computer Science and Physics from Tel Aviv University and his PhD in Computer Science from Yale. After his postdoctoral position at Yale in the Applied Math department he co-founded a New York based startup. Since 2009 he has been with Yahoo Labs. His research focuses on the theory and practice of large scale data mining.