Machine Learning and Friends Lunch

Fast and Flexible Clustering---Variational methods for the Dirichlet process

Abstract

Probabilistic clustering methods are widely used in machine learning for many application areas such as vision, information retrieval, and bioinformatics. Such methods can be used both for dimensionality reduction and, more qualitatively, automatically organizing large datasets into groups of similar elements. Lurking behind every clustering application, however, is the nagging question: how many clusters should there be?

The classical approach to answering this question is to use model-selection techniques to choose the best number of clusters for the data. However, realistic datasets grow and change over time; one shouldn't expect that a fixed number of clusters can capture all of the variation of the observed and future data. The Dirichlet process (Fergusen, 1973), the cornerstone of Bayesian nonparametric statistics, provides a modeling technique with which new data can exhibit new clusters. It can be used as a flexible clustering tool which chooses the number of clusters, but does not wholly commit to that choice.

While conceptually attractive, Dirichlet process methods are limited in practice because they rely on Monte Carlo Markov chain sampling techniques for approximate inference (West and MacEachern, 1994). Such methods work well in univariate settings, but are often too slow in the complicated multivariate problems which are the focus of machine learning. In this talk, I will review Dirichlet process clustering techniques, and develop a fast, deterministic method for approximate inference based on variational methods. This method is a practical alternative to MCMC for problems of large scale. I will give examples with text collections, image collections, and text and images.

(Joint work with Michael I. Jordan)

Back to ML Lunch home