[Intro to NLP, CMPSCI 585, Fall 2014]

NER tagging for Twitter

The final project will be to construct an NER tagger for Twitter. The task of named entity recognition is to take tokenized sentences as input, then recognize spans of text that correspond to a name. We will not require entity types to be predicted.

Here are two examples, showing the bracketed name spans like this, and then the BIO representation.

Football game tonight !!!!! V-I-K-I-N-G-S !!!!!!!

Football O
game O
tonight O
!!!!! O
!!!!!!! O

@paulwalk It 's the view from where I 'm living for two weeks . Empire State Building = ESB . Pretty bad storm here last evening .

@paulwalk O
It O
's O
the O
view O
from O
where O
'm O
living O
for O
two O
weeks O
. O
Empire B
State I
Building I
= O
. O
Pretty O
bad O
storm O
here O
last O
evening O
. O

This task is possible but plenty hard: in current research, systems typically get F-scores in the 60% range. The task is described in Ritter et al. 2011, and we use their data for the training and development sets.

Groups of 1 or 2, and collaboration policy

You can either do this project individually, or in a group of two. Please decide this by the milestone report date.

If you decide to do this as a group, when you submit the milestone and final reports/code, please submit a single copy of the materials, and make sure the names of both team members are on them. You both will receive the same grade.

This project is more open-ended than the problem sets. After building the base system, you are free to explore different types of features and analyses. Feel free to discuss the project with anyone else in the class. Discussions can help spur creativity for features or experiments to try. However, all code should be written individually or within your two-person group. The project reports should also be written only within the groups.

Data and Evaluation

This format is sometimes called a shared task or bakeoff competition (an obscure cultural reference, maybe).

Also, all students in the course will annotate a small amount of data as part of a problem set or exercise. These annotations will comprise part of the final test set. (Exercise 9)

Unlabeled tweets

We also are providing an external resource, a large corpus of unlabeled tweets, for possible use in features or additional data analysis. It consists of about 1 million English-language tweets that have been tokenized. We'll post the URL on Piazza (since we are not supposed to redistribute the data freely on the internet).

The 1-million tweets version has 986,784 tweets sent over Jan-Sep 2014. If this is too much data for what you want to do, just take a subset of it. (The file is sorted in a random order). The versions we have available include:

If you want more data than this let us know and we can give you as much as you want.


You can use whatever NLP/ML software or resources you like. We will provide a small bit of starter code to use CRFSuite, a software package that does first-order CRF sequence tagging. It requires you to run your own script to extract observation features as a text file. Then you tell it to train and predict with these feature files.

We provide starter code for

More details on the milestone page.


There are several points to turn things in. Update 11/30: see posts on Piazza for the current timeline. I strikethrough'd changes here.

  1. (Due Sunday, Nov 16) Milestone: create a very bare-bones tagger for the NER task, with the training and development set, and submit predictions to the Kaggle board, plus submit a short document about it.
  2. (Released Monday, 12/1; due 12/2 at noon) Test set evaluation for the competition: on Monday morning, We will provide the final test set, and it will be blind --- no labels given! You have to run your system to create predictions and submit them. We will set up Kaggle to allow up to 3 a small number of submissions if you want to try improving your model in the meantime. Only submit predictions that your system produced.
  3. (Due Friday, 12/5)(Due 12/12) The code submission, and final project reports, turned in via Moodle. We will accept the final reports (and accompanying code) until 12/12. but please submit at least a basic draft on Friday 12/5 along with your code.


The project is, in total, 20% of your grade. The milestone is 5%. It's designed to help the success of the final result. The rest is derived from both:

  1. System building.
  2. Analysis and exploration in your project report, such as analyzing the model weights, or doing ablation tests.

Finally, there will be extra credit for the top performers on the final test set. Extra credit will also be available for doing additional extensions, or doing some sort of additional analysis project.

Random tips

When feature engineering, it will be useful to write a shell script that runs the entire pipeline of feature extraction, training, testing and evaluation. Here's one example, which you will need to customize for different feature extraction scripts or whatever you're using:

set -eux
python simple_fe.py
crfsuite learn -m mymodel train.feats > train.log
crfsuite tag -m mymodel dev.feats > predtags
python tageval.py dev.txt predtags

When the test set comes out, you will want to train your system not just on train.txt, but on the concatenation of both train.txt and dev.txt, since that's more data to work with. Do not train on the test set, of course, since that's a form of cheating (and in order to prevent honest mistakes with this, we will distribute the test set with `?' labels for all the tags).

Details on milestone

These are in milestone.html

Details on final project requirements

If you implement the minimum possible basic feature extractor with a basic analysis, that will earn you a B+. (A group of two additionally requires one major extension.) Both feature extensions as well as more analysis will earn you a higher grade. Especially high performance on the test set will earn extra credit (we are currently planning on awarding extra credit to the top two teams). All implementation code should be submitted along with the final report.

We expect the final project report to be at least 8 pages, but shorter than 20. (Once you start writing it, you'll find it's much easier to write more than you might have originally thought.) This is intended to include tables and graphs (which take up a lot of space). These page limits are not hard and fast, but are intended to give you an idea of how much analysis and detail we're expecting.

System building

We require a feature extractor that produces, at the very least, the following types of features. These are fairly typical features in NLP systems. There are many approaches to the following features; please report the approaches you used and whether you found them helpful, in your final report.

As for major extensions, many are possible. Some examples of what we will consider to be a major extension include:

Another idea (we don't know how to grade this yet but): You can annotate new data yourself to use for training. More annotated data always helps, and while this is sometimes a controversial point, some researchers believe it's more important to have more data rather than fancier machine learning or linguistic algorithms. It's useful to graph the learning curve before deciding to do this -- see below

Analysis and exploration

Report your performance on the test set for the competition, plus any other results you made before or after the competition's result submission. (Typically, you may find interesting things to do after the competition is over.)

Please explain your features and system and the choices you made for it. Explain your reasoning and any experimental results you have. Things like "we tried X but it didn't work and here is why" are fantastic and will give you full credit for attempting X.

Your analysis/exploration section is expected to have at least one additional analysis component. Some examples of analysis components include:

Final project grading

Here's our guidlines for how grading will be done. Two-person teams require one major extension to make the B+. Therefore "extra extension" for a solo team would mean one extension, but for a duo team would mean two total.

The projects will also be judged by the quality of writing, the quality of results, the insightfulness of the analysis, and the amount of thought and work that went into the project.

Project report organization

Here's one possible organization of the final report. If you're not sure how to organize it, we strongly suggest using this outline.

This starts with the "WHAT" and "HOW" of your system. Then it moves into "WHY": explanations for what works or doesn't work.

In terms of organization, the "Results" and "Additional Experiments / Analysis" sections may blend into each other a little bit, depending on what you're doing. That's fine. The point is to present things in a way that is clear to the reader.

Remember that your analysis/exploration is expected to have at least one additional analysis component.

The page lengths are rough guidelines.

Title and Author Names

Abstract (0.5 pages)

Description of Implementation (1 page)

Major extensions(s)

Results (2-3 pages)

Analysis / Exploration (4-5 pages)

Discussion and Future Work (1 page)