Example projects and datasets

Some example projects.

Movie summarization: summarize all reviews from IMDB.
Chat bot: build a conversation that can talk to you. Perhaps use a Twitter or other corpus to do language similarity or language models.
Song lyrics or poetry generator.
Predict stuff from Twitter: flu outbreaks, financial movements, etc. This can be challenging but it is fun. We have some years of twitter archives you can use here at UMass.
Identify person names in emails and text.
Movie revenues prediction: use movie reviews, or summary information, to predict movie revenues.
Sentiment analysis
Grammar and spelling correction. (Possible dataset; there may be more out there or you can make your own)

Some example already-annotated datasets

One example: Structured sentiment analysis. Identify not just the polarity of a message, but opinions about specific mentioned things in the text.

Many other shared tasks:

SemEval 2015, SemEval 2014 shared tasks lists
CoNLL shared tasks. Only certain years are easily available. For example, the CoNLL 2003 dataset on NER is one of the standard NER datasets.

Other datasets

Social and review-like datasets:
- Yelp academic dataset. Has reviews, star ratings for each, and also links reviews to business and user objects. Lots of interesting work has been done with this dataset.
- Amazon movie reviews. 8 million reviews.
- Stack Exchange data dump.
Congressional speeches dataset
- Also several other cool datasets from Lillian Lee
NLTK corpora package: link link link. Lots of stuff.
Project Gutenberg – books, especially historical books
CMU Movie Summary corpus: plot summaries of movies, linked with information about movie revenues and actors in them, as well as coref/parse information.
Movie revenues and reviews corpus.

Really big datasets – these can be challenging to process so we wouldn’t recommend it, unless you are excited about it and prepared to dive in.

Reddit public comments dataset. See the link for an example of what information is in a comment. It’s very large (~200 GB download), but gets smaller if you pick out a specific month or something.)
Wikipedia dumps … these are tricky to process. We can supply you a plaintext version of English Wikipedia if you want it.

NLP and ML tools

NLP software
- Stanford CoreNLP - one of the most popular open-source NLP stacks. In English they have: tokenizer, sentence segmenter, truecasing, POS, NER, constituent and dependency parsing, coreference. If you use the parser, spend some time figuring out the options to get a faster model. If you use from python, you may want brendan’s python wrapper for corenlp (or not, just use the xml that corenlp spits out).
- spaCy which is supposed to be very fast.
- ARK TweetNLP has POS tagging and dep. parsing for English Twitter.
- GATE - a broad package. I’m not totally sure what does but they have a Twitter POS/NER system, I think, among other things.
- NLTK - a python library. It doesn’t do much NLP itself, the last I checked, but it does include wrappers for various external NLP tools.
- KenLM for n-gram language modeling.
- There are a zillion other one-off tools and resources for various languages and types of analysis. For example, see the aclwiki’s Software page listing, or aclwiki’s List of Resources by Language.
Machine learning libraries. This is a big area, so this list tries to focus on ones that are at least somewhat geared for typical NLP.
- Scikit-Learn seems to be the emerging standard implementation of many standard ML techniques, at least for open-source python. Its classifiers can handle sparse textual feature vectors fine.
- CRFsuite is a nice linear-chain CRF where you can implement your own features. Python bindings (or just use crfsuite’s feature-file format). There are a bunch of other CRF implementations out there too.
- Gensim implements many latent-space, unsupervised learning algorithms used in NLP for distributional lexical semantics, including LSA/SVD, topic models, and similarity queries.
- Mallet from UMass - mostly known for its implementations of (1) CRFs and (2) topic models. The topic modeling can be used from the commandline. I (brendan) dont know much about its CRF implementation.
- Neural networks: this area is moving so fast that anything I link to will probably get obsolete quick. The Torch/Lua-based stack described in Andrej Karpathy’s RNN demo is one of the popular ones right now. Theano is an alternative python-based one.
- By the way, R has a much more extensive suite of ML algorithms than Python does. But most R packages aren’t designed for NLP-style data.
- Some people like Weka. I (brendan) don’t know anything about it.
Lexical resources. These are datasets about word types.
- WordNet. You probably want the NLTK interface to wordnet and see google for information on how to use it.
- Word Embeddings: GLOVE and Word2Vec both supply publicly available, pre-trained word embeddings which are relatively popular. (does NLTK have an interface for them yet?)
- Hierarchical word clusters: ARK twitter clusters, any others?
- Sentiment lexicons: The OpinionFinder subjectivity lexicon is one. The most popular in research papers is the LIWC lexicon; it’s proprietary, but i’m told to try searching for LIWC2007_English080730.dic . For Twitter, the NRC sentiment lexicon might be worth looking at.
- CMU pronunciation dictionary (for English)