Projects - CS 690N, Spring 2017, UMass Amherst

CS 690N, Spring 2017

(See the project page.)

NLP and ML tools

NLP software
- Stanford CoreNLP - a popular open-source NLP stack. In English they have: tokenizer, sentence segmenter, truecasing, POS, NER, constituent and dependency parsing, coreference. If you use the parser, spend some time figuring out the options to get a faster model. (Possibly helpful, but not necessary: my python wrapper)
- spaCy which is fast and robust (mostly English only).
- ARK TweetNLP has POS tagging and dep. parsing for English Twitter.
- GATE - a broad package. I’m not totally sure what does but they do have a Twitter POS/NER system among other things.
- NLTK - a python library. It does only a little NLP by itself, but it does include wrappers for some external NLP tools.
- KenLM for n-gram language modeling.
- There are a zillion other one-off tools and resources for various languages and types of analysis. For example, see the aclwiki’s Software page listing, or aclwiki’s List of Resources by Language.
- External APIs: if you want to send your data to someone else's server. IBM Alchemy/BlueMix/"Watson" (they have several names for it?) is one.
Machine learning libraries. This is a big area, so this list tries to focus on ones that are at least somewhat geared for typical NLP.
- Neural network frameworks: Tensorflow, DyNet, and many others are available.
- Scikit-Learn seems to be the emerging standard implementation of many standard ML techniques, at least for open-source python. Its classifiers can handle sparse textual feature vectors fine.
- CRFsuite is a nice linear-chain CRF where you can implement your own features. Python bindings (or just use crfsuite’s feature-file format). There are a bunch of other CRF implementations out there too.
- Gensim implements many latent-space, unsupervised learning algorithms used in NLP for distributional lexical semantics, including LSA/SVD, topic models, and similarity queries.
- Mallet from UMass - mostly known for its implementations of (1) CRFs and (2) topic models. The topic modeling can be used from the commandline. I (brendan) dont know much about its CRF implementation.
- By the way, R has a much more extensive suite of ML algorithms than Python does. But most R packages aren’t designed for NLP-style data.
- Some people like Weka. I (brendan) don’t know anything about it.
Lexical resources. These are datasets about word types.
- WordNet, or possibly the NLTK interface to wordnet.
- Word Embeddings: GLOVE and Word2Vec both supply publicly available, pre-trained word embeddings which are relatively popular; some python libraries have interfaces to them.
- Hierarchical word clusters: ARK twitter clusters, any others?
- Sentiment lexicons: OpinionFinder subjectivity lexicon, LIWC lexicon (LIWC2007_English080730.dic), NRC Twitter sentiment lexicon, VADER, and plenty of others.
- CMU pronunciation dictionary (for English)