[Back to CS585 home page]
Example projects and datasets
Collections of previously created shared tasks:
Other datasets:
Really big datasets – these can be challenging to process so we wouldn’t recommend it, unless you are excited about it and prepared to dive in.
- Reddit public comments dataset. See the link for an example of what information is in a comment. It’s very large (~200 GB download), but gets smaller if you pick out a specific month or something.)
- Wikipedia dumps … these are tricky to process. We can supply you a plaintext version of English Wikipedia if you want it.
- NLP software
- Stanford CoreNLP - one of the most popular open-source NLP stacks. In English they have: tokenizer, sentence segmenter, truecasing, POS, NER, constituent and dependency parsing, coreference. If you use the parser, spend some time figuring out the options to get a faster model.
- spaCy which is fast and robust (mostly English only).
- ARK TweetNLP has POS tagging and dep. parsing for English Twitter.
- GATE - a broad package. I’m not totally sure what does but they do have a Twitter POS/NER system among other things.
- NLTK - a python library. It does only a little NLP by itself, but it does include wrappers for various external NLP tools.
- KenLM for n-gram language modeling.
- There are a zillion other one-off tools and resources for various languages and types of analysis. For example, see the aclwiki’s Software page listing, or aclwiki’s List of Resources by Language.
- External APIs: if you want to send your data to someone else's server.
IBM ("Alchemy/BlueMix/'Watson'"), Microsoft ("Azure text analytics"), Google ("Cloud natural language/translation") are some of them out there right now.
- Machine learning libraries. This is a big area, so this list tries to focus on ones that are at least somewhat geared for typical NLP.
- Scikit-Learn seems to be the emerging standard implementation of many standard ML techniques, at least for open-source python. Its classifiers can handle sparse textual feature vectors fine.
- CRFsuite is a nice linear-chain CRF where you can implement your own features. Python bindings (or just use crfsuite’s feature-file format). There are a bunch of other CRF implementations out there too.
- Gensim implements many latent-space, unsupervised learning algorithms used in NLP for distributional lexical semantics, including LSA/SVD, topic models, and similarity queries.
- Mallet from UMass - mostly known for its implementation topic models. The topic modeling can be used from the commandline.
- Neural networks: Tensorflow, PyTorch, and DyNet are some of the more popular libraries now; there are others that build on top of them as well. We don't teach neural nets in-depth in this course, and it's typically a lot of work to get up to speed with them. If you want to use them but you're inexperienced in this area, be prepared for a large time investment to get up to speed.
- R has an extensive suite of ML packages, but most R packages aren't designed for NLP-style data (discrete and high-dimensional).
- Lexical resources. These are datasets about word types.