UMass Machine Learning and Friends Lunch | Main / A Natural Extension Of The Dirichlet Process To Relational Data

Daniel Roy

The Dirichlet process (Ferguson, 1973) has, in short time, become one of the most common tools employed by machine learning practitioners in the Bayesian setting. The well-known "clustering" property of the Dirichlet process makes it a natural choice as a mixing distribution for mixture modeling when the number of clusters is unknown a priori and expected to grow with the amount of data.

Simultaneously, relational data sets in information retrieval, social networks, biology, and cognitive psychology have started to gain the attention of the machine learning community. Modeling the interactions between proteins, groups of people, web sites, words in a document, and other entities is a difficult problem, in part, because of a lack of modeling tools and representations for such data.

I will introduce a relatively obscure construction of the Dirichlet process originating in population genetics (Kingman, 1982) and show how this view suggests a natural extension of the Dirichlet process to relational data.

Joint work with Yee Whye Teh (UCL), Josh Tenenbaum, Vikash Mansinghka and Charles Kemp (MIT).