UMass Machine Learning and Friends Lunch | Main / Linguistic Extensions To Topic Models

Topic Models have been an active area of research in recent years, and models like latent semantic analysis, probabilistic latent semantic indexing, and latent Dirichlet allocation (LDA) have been successfully used for detecting opinions, finding similar images, and finding relevant documents given a query. However, such models make very naive assumptions about the input data: words are unrelated to each other and the words in a document are completely exchangeable (the so-called bag of words model).

In this work, we present algorithms that enhance the document-level knowledge provided by LDA with richer linguistic assumptions. First, we allow topic models to use words arranged in a tree (such as the WordNet ontology) rather than a simple flat list and derive inference using MCMC for this model. One application that is made possible by this change is word sense disambiguation, which discovers the meanings of words (i.e. discriminating between "bank" the financial institution and the landform). We show that incorporating topics in this model improves disambiguation accuracy.

Secondly, we present a model that incorporates local syntactic information into topic models, which allows the algorithm to find groups of words that are both globally thematically consistent and locally syntactically consistent. We use the product of experts model to combine document and syntactic information and derive variational inference procedures. We show that this model predicts word usage better than previous models.

If time permits, other applications related to cross-language topic models and data mining will also be discussed. No background knowledge of topic models is assumed.