
I am a graduate student pursuing my M.S. in computer science at the University of Massachusetts Amherst (UMass). I am a research assistant in the Computational Social Science Lab, which is directed by Hanna Wallach and is part of the Center for Intelligent Information Retrieval. I am in the midst of my fourth semester in the M.S. program, and I intend to graduate in May 2012. In Spring 2010 I completed B.S. degrees in both computer science and mathematics along with a minor in Japanese, also at UMass.
My research deals most generally with data mining, information extraction for massive text data, unsupervised machine learning, Bayesian statistical models for text analysis, and clustering/cluster analysis.
My long-term
research interests and goals include improving the robustness,
usability, and interpretability of models for statistical text
analysis, and the software used to build such models. These kinds of
improvements are necessary if machine learning techniques are to be
made truly accessible to social scientists; this requires combining
excellent software engineering practices with innovative models.
My current
research seeks to quantify the variability of inference methods for
latent Dirichlet allocation (LDA). One goal of statistical topic models
like LDA is to obtain a topic assignment for each token in a corpus.
The quality of this partitioning of tokens affects all possible
subsequent uses of the topic model. Currently, much
statistical topic modeling work relies on collapsed Gibbs sampling, and
considers just a single set of topic assignments--perhaps those given
as
the last sample, or those with the greatest likelihood. In effect,
these practices discard all information about the uncertainty of the
topic assignments. My current research aims to quantify the variability
of parameter estimates in LDA, and to develop a technique for producing
topic modeling data that is more nuanced and exhibits less variability
than the conventional use of a single random sample. This technique
permits aggregating information from multiple partitions from multiple
experimental runs in the interest of producing high-quality output that
preserves and expresses uncertainty, and which thus permits wider use
with greater confidence in the data.
During the Fall 2010 semester I worked with Peter Krafft and Hanna
Wallach to compare several approaches to the problem of automated conversation thread disentanglement.
Some of my
research has been driven by the goal of using statistical topic
modeling techniques to enhance
multidisciplinary scientific collaboration by more
effectivelyidentifying researchers who share common interests and
complementary
skills. Related to this is my undergraduate independent Capstone
project, "Application of Natural Language Processing Techniques to the
Organization of Biomedical Research Literature," completed with the
guidance of professors Andrew McCallum and
Hanna Wallach. For my Capstone project I investigated characteristics of Medical Subject
Headings, the manually-curated publication classification system used by the National Library of Medicine,
in order to determine how statistical topic modeling techniques could be used to improve the system.
My computer
science-related interests beyond my research concentration include open
source software development, diverse software development processes especially for web-based applications,
process representation and specification, modularity, design, and nonparametric models.
My non-computer science-related interests and activities include playing upright bass in a band (check out our split album released Dec. 2011, or our full length studio album released Jan. 2011, via bandcamp), barefoot running, constrained cooking, and aquarium keeping. My heroes include Malcolm Gladwell, Douglas Hofstadter, and Michael Pollan.