Measuring Confidence in Temporal Topic Models with Posterior Predictive Checks
David Mimno, David Blei.
NIPS Workshop on Computational Social Science and the Wisdom of Crowds, 2010, Whistler, BC.
Rethinking LDA: Why Priors Matter
Hanna Wallach, David Mimno and Andrew McCallum.
NIPS, 2009, Vancouver, BC.
PDF
Supplementary Material
Empirically, we have found that optimizing Dirichlet hyperparameters
for document-topic distributions in topic models makes a huge difference:
topics are not dominated by very common words and topics are more stable
as the number of topics increases. In this paper we explore the effects
of Dirichlet priors on topic models. The best structure seems to be
an asymmetric prior over document-topic distributions and a symmetric
prior over topic-word distributions, currently implemented in the
MALLET toolkit.
Reconstructing Pompeian Households
David Mimno. Applications of Topic Models Workshop, NIPS 2009, Whistler, BC.
House data
Artifact data
PDF (selected for oral presentation)
Pompeii provides a unique view into daily life in a Roman city, but
the evidence is noisy and incomplete. This work applies statistical
data mining methods originating in text analysis to
a database of artifacts found in 30 houses in Pompeii.
Polylingual Topic Models
David Mimno, Hanna Wallach, Jason Naradowsky, David Smith and Andrew McCallum.
EMNLP, 2009, Singapore.
PDF
Standard statistical topic models do not handle multiple languages well,
but many important corpora -- particularly outside scientific publications --
contain a mix of many languages. We show that with simple modifications,
topic models can leverage not only direct translations but also comparable
collections like Wikipedia articles. We demonstrate the system on European
parliament proceedings in 12 languages and comparable Wikipedia articles
in 14 languages.
Evaluation Methods for Topic Models
Hanna Wallach, Iain Murray, Ruslan Salakhutdinov and David Mimno.
ICML, 2009, Montreal, Quebec.
PDF
Held-out likelihood experiments provide an important complement to
task-specific evaluations in topic models. We evaluate several methods
for calculating held-out likelihoods. Several previously used methods,
especially the harmonic mean method, show poor accuracy and high variance
compared to a "Chib-style" method and a particle filter-inspired method.
Efficient Methods for Topic Model Inference on Streaming Document Collections
Limin Yao, David Mimno and Andrew McCallum.
KDD, 2009, Paris, France.
PDF
slides on fast sampling
Statistical topic modeling has become popular in text processing,
but remains computationally intensive. It is often impossible to
run standard inference methods on collections because of limited space
(eg large IR corpora) and time (eg streaming corpora). In this paper
we evaluate a number of methods for lightweight online topic inference,
based on models trained from computationally expensive offline processes.
In addition, we present SparseLDA, a new data structure and algorithm
for Gibbs sampling in multinomial mixture models (such as LDA) that
offers substantial improvements in speed and memory usage. A parallelized
version of this algorithm is implemented in
MALLET.
Error: in section 3.4, the statement "The constant s only changes when we
update the hyperparameters α" is incorrect, as the number of words in
the old topic and the new topic change by one. In fact, s
must be updated before and after sampling a topic for each token, but this update
takes a constant number of operations, regardless of the number of topics.
This problem was only in the paper — the MALLET implementation has always been correct.
Polylingual Topic Models
David Mimno, Hanna Wallach, Limin Yao, Jason Naradowsky and Andrew McCallum.
Snowbird Learning Workshop, 2009, Clearwater, FL.
Classics in the Million Book Library
Gregory Crane, Alison Babeu, David Bamman, Thomas Breuel, Lisa Cerrato, Daniel Deckers, Anke Lüdeling, David Mimno, Rashmi Singhal, David A. Smith, Amir Zeldes. Digital Humanities Quarterly 3(1), Winter 2009.
HTML
In October 2008, Google announced a settlement that will provide access to seven million scanned books while the number of books freely available under an open license from the Internet Archive exceeded one million. The collections and services that classicists have created over the past generation place them in a strategic position to exploit the potential of these collections. This paper concludes with research topics relevant to all humanists on converting page images to text, one language to another, and raw text into machine actionable data.
Gibbs Sampling for Logistic Normal Topic Models
with Graph-Based Priors
David Mimno, Hanna Wallach and Andrew McCallum.
NIPS Workshop on Analyzing Graphs, 2008, Whistler, BC. (one of five
out of 22 papers selected for oral presentation)
PDF
Dirichlet distributions are a mathematically tractable prior distribution
for mixing proportions in Bayesian mixture models, but their convenience
comes at the cost of flexibility and expressiveness. Previous work has
suggested alternative priors such as logistic normal distributions, extending
topic mixture models with covariance matrices and dynamic linear models,
but this work has been limited to variational approximations.
This paper presents a method for simple, robust
Gibbs sampling in logistic normal topic models using an auxiliary variable
scheme. Using this method, we extend previous models over linear chains
to Gaussian Markov random field priors with arbitrarily structured graphs.
Topic Models Conditioned on Arbitrary Features with Dirichlet-multinomial Regression
David Mimno and Andrew McCallum.
UAI, 2008 (selected for plenary presentation)
PDF
Text documents are usually accompanied by metadata, such as the authors,
the publication venue, the date, and any references. Work in topic modeling
that has taken such information into account, such as Author-Topic,
Citation-Topic, and Topic-over-Time models, has generally focused on
constructing specific models that are suited only for one particular type
of metadata. This paper presents a simple, unified model for learning
topics from documents given arbitrary non-textual features, which can be
discrete, categorical, or continuous.
Modeling Career Path Trajectories
David Mimno and Andrew McCallum.
University of Massachusetts, Amherst Technical Report #2007-69, 2007.
PDF
Descriptions of previous work experience in resumes are a valuable source of
information about the structure of the job market and the economy. There is,
however, a high degree of variability in these documents.
Job titles are a particular problem, as they are often either overly sparse
or overly general:
85% of job titles in our corpus occur only once, while the most common titles, such as "Consultant", are so broad as to be virtually meaningless.
We use a hierarchical hidden state model to discover
clusters of words that correspond to distinct skills, clusters of skills
that correspond to jobs, and transition patterns between jobs.
Community-based Link Prediction with Text
David Mimno, Hanna Wallach, and Andrew McCallum.
Statistical Network Modeling Workshop, NIPS, 2007, Whistler, BC.
Expertise Modeling for Matching Papers with Reviewers
David Mimno and Andrew McCallum.
ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD) 2007, San Jose, CA.
PDF
Data
Science depends on peer review, but matching papers with reviewers is a
challenging and time consuming task. We compare several automatic methods for
measuring the similarity between a submitted abstract and papers previously
written by reviewers. These include a novel topic model that automatically
divides an author's papers into topically coherent "personas".
Probabilistic Representations for Integrating Unreliable Data Sources
David Mimno, Andrew McCallum and Gerome Miklau.
IIWeb workshop at AAAI 2007, Vancouver, BC, Canada.
PDF
Mixtures of Hierarchical Topics with Pachinko Allocation.
David Mimno, Wei Li and Andrew McCallum.
International Conference on Machine Learning (ICML) 2007, Corvallis, OR.
PDF
The four-level pachinko allocation model
(PAM) (Li & McCallum, 2006) represents
correlations among topics using a DAG structure. It does not, however, represent a
nested hierarchy of topics, with some topical word distributions representing the vocabulary that is shared among several more
specic topics. This paper presents hierarchical PAM — an enhancement that explicitly represents a topic hierarchy. This model
can be seen as combining the advantages of
hLDA's topical hierarchy representation with
PAM's ability to mix multiple leaves of the
topic hierarchy. Experimental results show
improvements in likelihood of held-out documents, as well as mutual information between
automatically-discovered topics and human-generated categories such as journals.
Mining a digital library for influential authors.
David Mimno and Andrew McCallum.
Joint Conference on Digital Libraries (JCDL) 2007, Vancouver, BC, Canada.
PDF
Most digital libraries let you search for documents, but we often want to
search for people as well. We extract and disambiguate author names from
online research papers, weight papers using PageRank on the citation graph,
and expand queries using a topic model. We evaluate the system by comparing
people returned for the query "information retrieval" to recipients of
major awards in IR.
Organizing the OCA: Learning faceted subjects from a library of digital books.
David Mimno and Andrew McCallum.
Joint Conference on Digital Libraries (JCDL) 2007, Vancouver, BC, Canada.
PDF
The Open Content Alliance is one of several large-scale digitization projects
currently producing huge numbers of digital books. Statistical topic models
are a natural choice for organizing and describing such large text corpora,
but scalability becomes a problem when we are dealing with multi-billion
word corpora. This paper presents a new method for topic modeling, DCM-LDA.
In this model, we train an independent topic model for every book, using
pages as "documents". We then gather the topics discovered, cluster them,
and then fit a Dirichlet prior for each topic cluster. Finally, we retrain
the individual book topic models using these new shared topics.
Beyond Digital Incunabula: Modeling the Next Generation
of Digital Libraries.
Gregory Crane, David Bamman, Lisa Cerrato, Alison Jones, David Mimno, Adrian Packel, D. Sculley, and Gabriel Weaver.
European Conference on Digital Libraries (ECDL) 2006, Alicante, Spain.
PDF
Several groups are currently embarking on large scale digitization projects,
but are they producing anything more than lots of raw text? This paper argues
that such an investment in digitization will be more valuable if accompanied
by a parallel investment in highly structured resources such as dictionaries.
Several examples, including some I worked on while at Perseus, illustrate
this effect.
Bibliometric Impact Measures Leveraging Topic Analysis.
Gideon Mann, David Mimno and Andrew McCallum.
Joint Conference on Digital Libraries (JCDL) 2006, Chapel Hill, NC.
PDF
Powerpoint
When evaluating the impact of research papers, it's important to compare
similar papers: a massively influential paper in Mathematics may be as
well cited as a middling paper in Molecular Biology. We present a system
that combines automatic citation analysis on spidered research papers
with a new automatic topic model that is aware of multi-word terms. This
system is capable of finding fine-grained sub-fields while scaling to the
exponential increase in open-access publishing. We evaluate papers from the
Rexa digital library using both
traditional bibliometric statistics (substituting topics for journals) as
well as several new metrics.
Hierarchical Catalog Records: Implementing a FRBR Catalog.
David Mimno, Alison Jones and Gregory Crane.
DLib, October 2005. HTML
Finding a Catalog: Generating Analytical Catalog Records from Well-structured Digital Texts.
David Mimno, Alison Jones and Gregory Crane.
Joint Conference on Digital Libraries (JCDL) 2005, Denver, CO.
PDF.
Services for a Customizable Authority Linking Environment.
Mark Patton and David Mimno.
demonstration at Joint Conference on Digital Libraries (JCDL) 2004, Tucson, AZ.
Towards a Cultural Heritage Digital Library.
Gregory Crane, Clifford E. Wulfman, Lisa M. Cerrato, Anne Mahoney,
Thomas L. Milbank, David Mimno, Jeffrey A. Rydberg-Cox, David A.
Smith, and Christopher York. Joint Conference on Digital Libraries (JCDL) 2003, Houston, TX.