Emma Strubell

I'm a Ph.D. candidate at UMass Amherst working in the Information Extraction and Synthesis Laboratory with Professor Andrew McCallum. Previously, I earned a B.S. in Computer Science from the University of Maine with a minor in math, where I applied models from mathematical biology to the spread of internet worms with Professor David Hiebeler in his Spatial Population Ecological and Epidemiological Dynamics Lab.

In summer 2016 I interned as a Research Scientist with Tom Kollar on the Alexa NLU team at Amazon Lab126. In 2017 I interned with Daniel Andor and David Weiss at Google Research (SAFT) in NYC. I am grateful to have been supported by an IBM PhD Fellowship Award for the 2017-2018 academic year.

Research Interests

I am interested in developing new machine learning techniques to facilitate fast (and accurate) natural language processing of text.

Techniques for low-level NLP tasks such as part-of-speech tagging, named entity recognition and syntactic dependency parsing are now accurate enough to be of use to practitioners who wish to extract structured information from unstructured text. This can include blog posts and discussion forums on the web, or the text of scientific research papers. Though we now wish to deploy these tools on billions of documents, many of the most accurate models were designed with no regard for computational cost. In response, our work aims to design machine learning algorithms to facilitate fast inference in NLP models while sacrificing as little accuracy as possible.

My research focuses on two avenues for improving the speed-accuracy trade-off: First, we develop models which can quickly build up rich representations of tokens in context used as features in a sequential prediction model, where sequence labeling is performed as a series of independent multi-class classifications. This approach allows for much faster inference than e.g. structured prediction in a graphical model while maintaining accuracy via high-quality feature representations incorporating wide context and a concept of neighboring labels. Second, we unify related NLP tasks into a single end-to-end model which reasons in the joint space of output labels. With this approach we aim to increase accuracy by reducing cascading errors and leveraging shared statistics of co-occurring labels, while at the same time decreasing wall-clock runtime speed by sharing model parameters and computation across tasks.









In my spare time, I enjoy cooking (with a focus on making vegetables delicious), fermenting (kombucha, kimchi, yogurt), growing plants (especially succulents), and enjoying the outdoors (backpacking and rock climbing).

In search of a fast Scala lexer, I forked JFlex and added the ability to emit Scala code. JFlex-scala, and its corresponding maven and sbt plugins, are available on Maven Central. For an example of its use, check out the tokenizer in FACTORIE.

I am also co-author of Plant Jones. He is a semi-intelligent plant who tweets negatively about water when he's thirsty, and positively when he's not. His code is available here.

In my junior year of college I wrote and presented a tutorial on quantum algorithms aimed for undergraduate students in computer science, available here, along with slides part 1 and part 2.

Gentoo Linux user since 2005.

Amherst, Massachusetts, USA


strubell [at] cs [dot] umass [dot] edu

curriculum vitae (PDF)