Example projects and datasets

Collections of previously created shared tasks:

Semeval 2017, Semeval 2016, SemEval 2015, SemEval 2014 shared tasks lists
CoNLL shared tasks. For example, the CoNLL 2003 dataset on NER is one of the standard NER datasets.

Other datasets:

Social and review-like datasets:
- Yelp academic dataset. Has reviews, star ratings for each, and also links reviews to business and user objects. Lots of interesting work has been done with this dataset.
- Amazon movie reviews. 8 million reviews.
- Stack Exchange data dump; needs refinement for a specific research project.
Congressional speeches dataset
- Also several other cool datasets from Lillian Lee
NLTK corpora package: link link link. Lots of stuff.
Project Gutenberg – books, especially historical books
CMU Movie Summary corpus: plot summaries of movies, linked with information about movie revenues and actors in them, as well as coref/parse information.
Movie revenues and reviews corpus.

Really big datasets – these can be challenging to process so we wouldn’t recommend it, unless you are excited about it and prepared to dive in.

Reddit public comments dataset. See the link for an example of what information is in a comment. It’s very large (~200 GB download), but gets smaller if you pick out a specific month or something.)
Wikipedia dumps … these are tricky to process. We can supply you a plaintext version of English Wikipedia if you want it.

NLP and ML tools

NLP software
- Stanford CoreNLP - one of the most popular open-source NLP stacks. In English they have: tokenizer, sentence segmenter, truecasing, POS, NER, constituent and dependency parsing, coreference. If you use the parser, spend some time figuring out the options to get a faster model.
- spaCy which is fast and robust (mostly English only).
- ARK TweetNLP has POS tagging and dep. parsing for English Twitter.
- GATE - a broad package. I’m not totally sure what does but they do have a Twitter POS/NER system among other things.
- NLTK - a python library. It does only a little NLP by itself, but it does include wrappers for various external NLP tools.
- KenLM for n-gram language modeling.
- There are a zillion other one-off tools and resources for various languages and types of analysis. For example, see the aclwiki’s Software page listing, or aclwiki’s List of Resources by Language.
- External APIs: if you want to send your data to someone else's server. IBM ("Alchemy/BlueMix/'Watson'"), Microsoft ("Azure text analytics"), Google ("Cloud natural language/translation") are some of them out there right now.
Machine learning libraries. This is a big area, so this list tries to focus on ones that are at least somewhat geared for typical NLP.
- Scikit-Learn seems to be the emerging standard implementation of many standard ML techniques, at least for open-source python. Its classifiers can handle sparse textual feature vectors fine.
- CRFsuite is a nice linear-chain CRF where you can implement your own features. Python bindings (or just use crfsuite’s feature-file format). There are a bunch of other CRF implementations out there too.
- Gensim implements many latent-space, unsupervised learning algorithms used in NLP for distributional lexical semantics, including LSA/SVD, topic models, and similarity queries.
- Mallet from UMass - mostly known for its implementation topic models. The topic modeling can be used from the commandline.
- Neural networks: Tensorflow, PyTorch, and DyNet are some of the more popular libraries now; there are others that build on top of them as well. We don't teach neural nets in-depth in this course, and it's typically a lot of work to get up to speed with them. If you want to use them but you're inexperienced in this area, be prepared for a large time investment to get up to speed.
- R has an extensive suite of ML packages, but most R packages aren't designed for NLP-style data (discrete and high-dimensional).
Lexical resources. These are datasets about word types.
- WordNet. You probably want the NLTK interface to wordnet and see google for information on how to use it.
- Word Embeddings: GLOVE and Word2Vec both supply publicly available, pre-trained word embeddings which are relatively popular. (does NLTK have an interface for them yet?)
- Hierarchical word clusters: ARK twitter clusters, any others?
- Sentiment lexicons: The OpinionFinder subjectivity lexicon is one. The most popular in research papers is the LIWC lexicon; it’s proprietary, but i’m told to try searching for LIWC2007_English080730.dic . For Twitter, the NRC sentiment lexicon might be worth looking at.
- CMU pronunciation dictionary (for English)