profile image

Overview

I am a graduate student pursuing my M.S. in computer science at the University of Massachusetts Amherst (UMass). I am a research assistant in the Computational Social Science Lab, which is directed by Hanna Wallach and is part of the Center for Intelligent Information Retrieval. I am in the midst of my fourth semester in the M.S. program, and I intend to graduate in May 2012. In Spring 2010 I completed B.S. degrees in both computer science and mathematics along with a minor in Japanese, also at UMass.

Research

My research deals most generally with data mining, information extraction for massive text data, unsupervised machine learning, Bayesian statistical models for text analysis, and clustering/cluster analysis.

My long-term research interests and goals include improving the robustness, usability, and interpretability of models for statistical text analysis, and the software used to build such models. These kinds of improvements are necessary if machine learning techniques are to be made truly accessible to social scientists; this requires combining excellent software engineering practices with innovative models.

My current research seeks to quantify the variability of inference methods for latent Dirichlet allocation (LDA). One goal of statistical topic models like LDA is to obtain a topic assignment for each token in a corpus. The quality of this partitioning of tokens affects all possible subsequent uses of the topic model. Currently, much statistical topic modeling work relies on collapsed Gibbs sampling, and considers just a single set of topic assignments--perhaps those given as the last sample, or those with the greatest likelihood. In effect, these practices discard all information about the uncertainty of the topic assignments. My current research aims to quantify the variability of parameter estimates in LDA, and to develop a technique for producing topic modeling data that is more nuanced and exhibits less variability than the conventional use of a single random sample. This technique permits aggregating information from multiple partitions from multiple experimental runs in the interest of producing high-quality output that preserves and expresses uncertainty, and which thus permits wider use with greater confidence in the data.

During the Fall 2010 semester I worked with Peter Krafft and Hanna Wallach to compare several approaches to the problem of automated conversation thread disentanglement.

Some of my research has been driven by the goal of using statistical topic modeling techniques to enhance multidisciplinary scientific collaboration by more effectivelyidentifying researchers who share common interests and complementary skills. Related to this is my undergraduate independent Capstone project, "Application of Natural Language Processing Techniques to the Organization of Biomedical Research Literature," completed with the guidance of professors Andrew McCallum and Hanna Wallach. For my Capstone project I investigated characteristics of Medical Subject Headings, the manually-curated publication classification system used by the National Library of Medicine, in order to determine how statistical topic modeling techniques could be used to improve the system.

Current Courses

Past Courses

Awards and Scholarships

Departmental Service

Extracurricular

My computer science-related interests beyond my research concentration include open source software development, diverse software development processes especially for web-based applications, process representation and specification, modularity, design, and nonparametric models.

My non-computer science-related interests and activities include playing upright bass in a band (check out our split album released Dec. 2011, or our full length studio album released Jan. 2011, via bandcamp), barefoot running, constrained cooking, and aquarium keeping. My heroes include Malcolm Gladwell, Douglas Hofstadter, and Michael Pollan.

Last updated 15 February 2012.