UMass Machine Learning and Friends Lunch | Main / A Framework For Social Data Analysis Of Text

Abstract: What can text analysis tell us about society? Enormous corpora of news, historical documents, books, and social media encode ideas, beliefs, and culture. While manual content analysis is a useful and established social science method, interest in automated text analysis has exploded in recent years, since it scales to massive data sets, and can assist in discovering patterns and themes.

I will present some case studies of using social media text analysis as a measurement instrument for social phenomena: sentiment analysis as a correlate of public opinion polls, geographic lexical variation as data for sociolinguistics, and characterization of Chinese online censorship. These examples, and other related work, suggest that "text-as-data" analysis techniques have wide variation in their computational/statistical complexity and amount of domain knowledge. Many methods, from word statistics to sentiment lexicons to document classifiers to topic models, can be unified as "weighted lexicon" corpus analysis tools across these spectrums, supporting both exploratory and confirmatory text data analysis.

Finally, depending on time and audience interest, I could briefly present (1) generative Bayesian models for frame learning, or (2) syntactic analysis of Twitter text (part-of-speech tagging and word clustering).

Bio: Brendan O'Connor (http://brenocon.com/) is a Ph.D. Student at Carnegie Mellon University's Machine Learning Deptartment, advised by Noah Smith. He is interested in machine learning and natural language processing, especially when informed by or applied to the social sciences. He has interned on the Facebook Data Science team, and worked on crowdsourcing at Crowdflower / Dolores Labs, and "semantic" search at Powerset. His undergraduate degree was Symbolic Systems.