UMass Machine Learning and Friends Lunch | Main / Efficient Nearest-Neighbor Search In The Probability Simplex

Abstract: Document similarity tasks occur in wide variety of areas in information retrieval and natural language processing, but a fundamental question when comparing documents is which representation to use. Topic models, which have served as versatile tools for exploratory data analysis and visualization, represent documents as probability distributions over latent topics. Systems comparing topic distributions thus use measures of probability divergence such as Kullback-Leibler, Jensen-Shannon, or Hellinger. In this talk I will present novel analysis and applications of the reduction of Hellinger divergence to Euclidean distance computations. This reduction allows us to exploit approximate nearest-neighbor techniques, such as locality-sensitive hashing (LSH) and approximate search in kd-trees, for search in the probability simplex. I'll demonstrate the effectiveness and efficiency of this approach on three tasks with two probabilistic models: using latent Dirichlet allocation (LDA) to discover relationships between NIH grants, using LDA to aid in prior-art retrieval for patents, and using a polylingual topic model (PLTM) to find document translation pairs in a large unaligned bilingual corpus. As this is work in progress any feedback is welcome.

Bio: Kriste Krstovski is a Ph.D candidate in the Computer Science Department, at the University of Massachusetts Amherst, advised by David Smith. He is a member of the Information Retrieval (IR) and Information Extraction and Synthesis (IESL) Laboratories. Prior to starting the Ph.D. program at UMass Amherst, he worked as a Staff Scientist in the Speech and Language Processing Department at BBN Technologies for four years. He finished his B.S. and M.S. in Electrical Engineering at the University of New Hampshire while being a member of the Project54 team under the supervision of Andrew L. Kun and W. Tom Miller III.