[Back to CS585 home page]
Example projects and datasets
Some example projects.
- Movie summarization: summarize all reviews from IMDB.
- Chat bot: build a conversation that can talk to you. Perhaps use a Twitter or other corpus to do language similarity or language models.
- Song lyrics or poetry generator.
- Predict stuff from Twitter: flu outbreaks, financial movements, etc. This can be challenging but it is fun. We have some years of twitter archives you can use here at UMass.
- Identify person names in emails and text.
- Movie revenues prediction: use movie reviews, or summary information, to predict movie revenues.
- Sentiment analysis
- Grammar and spelling correction. (Possible dataset; there may be more out there or you can make your own)
Some example already-annotated datasets
One example: Structured sentiment analysis. Identify not just the polarity of a message, but opinions about specific mentioned things in the text.
Many other shared tasks:
Other datasets
Really big datasets – these can be challenging to process so we wouldn’t recommend it, unless you are excited about it and prepared to dive in.
- Reddit public comments dataset. See the link for an example of what information is in a comment. It’s very large (~200 GB download), but gets smaller if you pick out a specific month or something.)
- Wikipedia dumps … these are tricky to process. We can supply you a plaintext version of English Wikipedia if you want it.
- NLP software
- Stanford CoreNLP - one of the most popular open-source NLP stacks. In English they have: tokenizer, sentence segmenter, truecasing, POS, NER, constituent and dependency parsing, coreference. If you use the parser, spend some time figuring out the options to get a faster model. If you use from python, you may want brendan’s python wrapper for corenlp (or not, just use the xml that corenlp spits out).
- spaCy which is supposed to be very fast.
- ARK TweetNLP has POS tagging and dep. parsing for English Twitter.
- GATE - a broad package. I’m not totally sure what does but they have a Twitter POS/NER system, I think, among other things.
- NLTK - a python library. It doesn’t do much NLP itself, the last I checked, but it does include wrappers for various external NLP tools.
- KenLM for n-gram language modeling.
- There are a zillion other one-off tools and resources for various languages and types of analysis. For example, see the aclwiki’s Software page listing, or aclwiki’s List of Resources by Language.
- Machine learning libraries. This is a big area, so this list tries to focus on ones that are at least somewhat geared for typical NLP.
- Scikit-Learn seems to be the emerging standard implementation of many standard ML techniques, at least for open-source python. Its classifiers can handle sparse textual feature vectors fine.
- CRFsuite is a nice linear-chain CRF where you can implement your own features. Python bindings (or just use crfsuite’s feature-file format). There are a bunch of other CRF implementations out there too.
- Gensim implements many latent-space, unsupervised learning algorithms used in NLP for distributional lexical semantics, including LSA/SVD, topic models, and similarity queries.
- Mallet from UMass - mostly known for its implementations of (1) CRFs and (2) topic models. The topic modeling can be used from the commandline. I (brendan) dont know much about its CRF implementation.
- Neural networks: this area is moving so fast that anything I link to will probably get obsolete quick. The Torch/Lua-based stack described in Andrej Karpathy’s RNN demo is one of the popular ones right now. Theano is an alternative python-based one.
- By the way, R has a much more extensive suite of ML algorithms than Python does. But most R packages aren’t designed for NLP-style data.
- Some people like Weka. I (brendan) don’t know anything about it.
- Lexical resources. These are datasets about word types.