UMass Machine Learning and Friends Lunch | Main / Modeling Differential Word Usage In Hierarchically Labeled Document Collections

Abstract: Topic models based on Latent Dirichlet Allocation (LDA) model word usage within topics, but the differential use of words across topics is often of greater substantive interest. We introduce Hierarchical Poisson Convolution (HPC), a generative model for hierarchically labeled document collections that allows researchers to infer which words have the highest mutual information with topic labels and which are most likely to associated with a particular topic when observed. When available, HPC uses known hierarchical structure on the topics to make more informative comparisons by modeling differential usage separately on each branch of the tree. Specifically, the count for each word in a document is the sum of topic-specific Poisson variates whose rates are weighted by the document's membership in each topic. The log rates for a word across topics are a Gaussian diffusion down the tree, which allows us to explicitly model the discriminatory power of each word with the variance parameters of the diffusion. Using a shrinkage prior on the variances, HPC induces soft, localized feature selection by forcing less discriminative and less common words to have have similar rates locally in the tree. We develop a parallelized block Gibbs sampler using Hamiltonian Monte Carlo that allows for fast and scalable computation.

Bio: Jonathan Bischof is a third year graduate student at Harvard's Department of Statistics. His interests include Bayesian statistics and computation, hierarchical models, and quantitative social science.