Hello! I'm a fourth-year graduate student in the
Computer Science Department at
UMass Amherst, and via cotutelle at Macquarie University, where I work in
natural language processing and machine learning. My advisors are
David Smith and
Mark Johnson.
I'm also occasionally in South College, pretending to be a
linguist.
My papers, and my cv.
My Erdős–Bacon number is arguably no greater than 8.
My research focuses on statistical models of natural language processing and acquisition, with an emphasis on joint inference, unsupervised learning and statistical relational learning. My primary interest stems from a fascination with how children are able to, with little exception, effortlessly acquire their first language. I develop models that explore this problem by examining how linguistic levels (morphology, phonology, etc.) interact, and how their relational structures can be leveraged to improve learning in the absence of labeled training instances and negative examples. Additionally, these models test different assumptions, biases, and learning mechanisms in an attempt to determine which best approximates human language learning, and which biases, if any, are linguistically universal.
Unsupervised Bilingual Morpheme Segmentation and Alignment with
Context-rich Hidden Semi-Markov Models
Jason Naradowsky and
Kristina Toutanova
ACL 2011
[abstract]
[paper]
[slides]
[bib]
This paper describes an unsupervised dynamic graphical model for morphological segmentation and bilingual morpheme alignment for statistical machine translation. The model extends Hidden Semi-Markov chain models by using factored output nodes and special structures for its conditional probability distributions. It relies on morpho-syntactic and lexical source-side information (part-of-speech, morphological segmentation) while learning a morpheme segmentation over the target language. Our model outperforms a competitive word alignment system in alignment quality. Used in a monolingual morphological segmentation setting it substantially improves accuracy over previous state-of-the-art models on three Arabic and Hebrew datasets.
A Discriminative Model for Joint Morphological Disambiguation and Dependency Parsing
John Lee,
Jason Naradowsky, and
David Smith
ACL 2011
[abstract]
[paper]
[bib]
Most previous studies of morphological disambiguation and dependency parsing have been pursued independently. Morphological taggers operate on n-grams and do not take into account syntactic relations; parsers use the ``pipeline'' approach, assuming that morphological information has been separately obtained.
However, in morphologically-rich languages, there is often considerable interaction between morphology and syntax, such that neither can be disambiguated without the other. In this paper, we propose a discriminative model that jointly infers morphological properties and syntactic structures. In evaluations on various highly-inflected languages, this joint model outperforms both a baseline tagger in morphological disambiguation, and a pipeline parser in head selection.
Feature Induction for Online Constraint-based Phonology Acquisition
Jason Naradowsky,
Joe Pater, and
David Smith
Synthesis Project, Presented at NECPHON 2011
[abstract][(Contact me for draft)]
Log-linear models provide a convenient method for coupling existing machine learning methods to constraint-based linguistic formalisms like optimality theory and harmonic grammar. While the learning methods themselves have been well studied in this domain, the question of how these constraints originate is often left unanswered. We present a novel, error-driven approach to constraint induction that performs lightweight decisions based on local information. When evaluated on the task of reproducing human gradient phonotactic judgements, a model trained with this procedure can sometimes nearly match the performance of state-of-the-art methods that rely on global information and individual assessment of all possible constraints. We conclude by discussing methods for incorporating context and linguistic bias into the induction scheme to produce more accurate grammars.
Learning Hidden Metrical Structure with a Log-linear Model of Grammar
Jason Naradowsky,
Joe Pater,
David Smith, and
Robert Staubs
Workshop on Computational Modelling of Sound Pattern Acquisition 2010
Polylingual Topic Models
David Mimno,
Hanna Wallach,
Jason Naradowsky,
David Smith and
Andrew McCallum
EMNLP 2009
[abstract]
[paper]
[bib]
Topic models are a useful tool for analyzing large text collections, but have previously been applied in only monolingual, or at most bilingual, contexts. Meanwhile, massive collections of interlinked documents in dozens of languages, such as Wikipedia, are now widely available, calling for tools that can characterize content in many languages. We introduce a polylingual topic model that discovers topics aligned across multiple languages. We explore the model's characteristics using two large corpora, each with over ten different languages, and demonstrate its usefulness in supporting machine translation and tracking topic trends across languages.
Improving Morphology Induction by Learning Spelling Rules
Jason Naradowsky and
Sharon Goldwater
IJCAI 2009
[abstract]
[paper]
[slides]
[bib]
Unsupervised learning of morphology is an important task for human learners and in natural language processing systems. Previous systems focus on segmenting words into substrings (taking => tak.ing), but sometimes a segmentation-only analysis is insufficient (e.g., taking may be more appropriately analyzed as take.ing, with a spelling rule accounting for the deletion of the stem-final e). In this paper, we develop a Bayesian model for simultaneously inducing both morphology and spelling rules. We show that the addition of spelling rules improves performance over the baseline morphology-only model.
Polylingual Topic Models
David Mimno,
Hanna Wallach,
Limin Yao,
Jason Naradowsky and
Andrew McCallum
The Learning Workshop (Snowbird) 2009
Natural Language Toolkit (NLTK):
The Natural Language Toolkit is a collection of open source Python modules that can be used freely for research or pedagogical purposes. There's also a book out now documenting how to use the NTLK - it doubles as an introductory computational linguistics coursebook.
For the summer of 2008 I worked on the NLTK while sponsored under the Google Summer of Code program, during which time I implemented a suite of dependency parsers under the supervision of Sebastian Riedel and Jason Baldridge.