Machine Learning and Friends Lunch

Density Allocation for Modeling Discrete Data

Abstract

The talk will discuss statistical techniques for modeling collections of unstructured or semi-structured data. We will begin by discussing the popular approaches to the problem, starting with simple unigram models, extending them to cluster-based mixture models, and moving on to two state-of-the-art latent aspect models: pLSI and LDA. We will then propose a simple generalization of these models: generative density allocation. The new formalism allows us to gain an intuition for the relative strengths and weaknesses of the popular models, and, more importantly, it allows us to develop a new generative model based on non-parametric density estimates. We will look at two variants of the model -- one based on the Dirac-delta kernel (known as Relevance Models), and one based on the Dirichlet kernel. We will discuss how the new model can be applied to the problems of web-search, cross-language retrieval, topic detection and tracking, recognition of objects in images and recognition of hand-written words. In each case the new model consistently outperforms state-of-the-art baselines.

Back to ML Lunch home