UMass Machine Learning and Friends Lunch | Main / Beating The News Predicting Significant Societal Events From Open Source Data

Beating the News: Predicting Significant Societal Events from Open Source Data

This talk will describe BBN’s efforts in IARPA’s OSI (Open Source Indicators) program. OSI aimed to develop automated methods for continuously monitoring publicly available data sources to predict significant societal events with a lead time of seven days or greater. OSI’s test region was Latin America and included both Spanish and Portuguese-speaking countries. I will describe three forecasting components that we developed, one each for civil unrest, disease outbreaks, and election results. All three components are based on a common methodology that includes a) data acquisition components for real-time monitoring of data streams, b) feature extractors to convert data streams into time series, c) time series analysis to detect causal patterns, and d) statistical models for event prediction. The talk will present specific characteristics of the forecasting tasks, methods used, results, and lessons learned.

Sequence Recognition in Speech Lattices

Time permitting, I will present a short description of a novel algorithm for performing named entity recognition (NER) in word lattices such as those produced by speech recognition systems. Unlike the case of text, the normalization term for algorithms such as CRFs cannot be ignored. I will describe a solution that uses locally normalized probability distributions and a pair of taggers — one working forward in time and the other backward — that are combined using dual decomposition. Accuracy is comparable to other state-of-the-art techniques and the algorithm can identify names anywhere in the lattice, including those not in the one-best output of the recognizer.

Bio

Scott Miller is a Senior Technical Director and Lead Scientist in the Speech and Language Department at BBN. He currently leads an effort focused on extracting structured information from foreign language sources. Miller’s previous roles at BBN include Principal Investigator under IARPA’s OSI program and technical lead for BBN’s effort under JHU’s Center of Excellence. Miller previously served as Chief Scientist at Basis Technology Corporation where he led efforts that developed machine-learning methods for syntax-based machine translation and multilingual text processing. At Basis, he also served as PI for a DARPA GALE subcontract. In 2004 Miller founded Translingual Technologies, a Massachusetts startup that created novel machine translation technology for making foreign-language content accessible to monolingual English speakers. Miller served as a senior researcher at JHU SCALE 2009 (Summer Camp for Applied Language Exploration): Semantically Informed Machine Translation, focused on Urdu to English translation. He coled SCALE 2010: All-Source Knowledge Base Population, which focused on constructing knowledge bases from multiple sources, including conversational speech and informal text genres, in English and Arabic. Miller’s technical innovations include: IdentiFinder, the first successful statistical named-entity tagging algorithm (U.S. Patent 6,052,682, Miller, Bikel and Schwartz). IdentiFinder has been successfully applied to ASR (Automatic Speech Recognition), OCR (Optical Character Recognition) and text in multiple languages including English, Arabic, and Chinese. The work is widely cited. SIFT (Statistical Information from Text), which was among the first successful joint-inference information extraction algorithms, achieving state-of-the-art performance in a U.S. Government evaluation (MUC-7). BBN’s Quick Tagger, which dramatically improved the practicality of deploying information extraction capabilities by reducing annotation effort from many days to a few hours. Miller has authored numerous technical articles and holds a Ph.D. from Northeastern University.