Extracting Knowledge From Informal Text
The internet has revolutionized the way we communicate, leading to a constant flood of informal text available in electronic format, including: email, Twitter, SMS and the clinical text found in electronic medical records. This presents a big opportunity for Natural Language Processing (NLP) and Information Extraction (IE) technology to enable new large scale data-analysis applications by extracting machine-processable information from unstructured text at scale.
In this talk I will discuss several challenges and opportunities which arise when applying NLP and IE to informal text, focusing specifically on Twitter, which has recently rose to prominence, challenging the mainstream news media as the dominant source of realtime information on current events. I will describe several NLP tools we have adapted to handle Twitter’s noisy style, and present a system which leverages these to automatically extract a calendar of popular events occurring in the near future (http://statuscalendar.cs.washington.edu).
I will further discuss fundamental challenges which arise when extracting meaning from such massive open-domain text corpora. Several probabilistic latent variable models will be presented, which are applied to infer the semantics of large numbers of words and phrases and also enable a principled and modular approach to extracting knowledge from large open-domain text corpora.
Alan Ritter is a Ph.D. candidate in the Department of Computer Science and Engineering at the University of Washington. His interests include NLP in short informal messages (e.g. Twitter), modeling lexical semantics with latent variables, modeling conversations in social media and paraphrasing between different styles of language (for example translating Shakespeare’s plays, noisy Twitter text or technical writing into standard English and vice versa). He was awarded an NDSEG fellowship, and won the best student paper award at IUI in 2009.