(Adapted from gt-nlp-class)
In this problem set, you will implement a deep transition dependency parser in PyTorch. PyTorch is a popular deep learning framework providing a variety of components for constructing neural networks. You will see how more complicated network architectures than simple feed-forward networks that you have learned in earlier classes can be used to solve a structured prediction problem.
import gtnlplib.parsing as parsing
import gtnlplib.data_tools as data_tools
import gtnlplib.constants as consts
import gtnlplib.evaluation as evaluation
import gtnlplib.utils as utils
import gtnlplib.feat_extractors as feat_extractors
import gtnlplib.neural_net as neural_net
import torch
import torch.optim as optim
import torch.nn as nn
import torch.nn.functional as F
import torch.autograd as ag
from collections import defaultdict
# Read in the dataset
dataset = data_tools.Dataset(consts.TRAIN_FILE, consts.DEV_FILE, consts.TEST_FILE)
# Assign each word a unique index, including the two special tokens
word_to_ix = { word: i for i, word in enumerate(dataset.vocab) }
# Some constants to keep around
LSTM_NUM_LAYERS = 1
TEST_EMBEDDING_DIM = 5
WORD_EMBEDDING_DIM = 64
STACK_EMBEDDING_DIM = 100
NUM_FEATURES = 3
# Hyperparameters
ETA_0 = 0.01
DROPOUT = 0.0
def make_dummy_parser_state(sentence):
dummy_embeds = [ w + "-EMBEDDING" for w in sentence ] + [consts.END_OF_INPUT_TOK + "-EMBEDDING"]
return parsing.ParserState(sentence + [consts.END_OF_INPUT_TOK], dummy_embeds, utils.DummyCombiner())
Be sure that you have reviewed the notes on transition-based dependency parsing, and are familiar with the relevant terminology. One small difference is that the text describes arc-left
and arc-right
actions, which create arcs between the top of the stack and the front of the buffer; in contrast, the parser you will implement here uses reduce-left
and reduce-right
actions, which create arcs between the top two items on the stack.
Parsing will proceed as follows:
The key classes you will fill in code for are
feat_extractors.py
ParserState
class, which keeps track of the input buffer and parse stack, and offers a public interface for doing the parsing actions to update the stateTransitionParser
class, which is a PyTorch module where the core parsing logic resides in parsing.py
.neural_net.py
The network components are compartmentalized as follows:
TransitionParser
is the base component that contains and coordinates the other substitutable components.VanillaWordEmbeddingLookup
just gets embeddings from a lookup table, one per word in the sentence.BiLSTMWordEmbeddingLookup
is more fancy, running a sequence model in both directions over the sentence. The hidden state at step t is the embedding for the t'th word of the sentence. MLPCombinerNetwork
takes the two input embeddings and gives a dense outputLSTMCombinerNetwork
does a sequence model, where the output embedding is the hidden state of the next timestep.The following is how the input buffer and stack look at each step of a parse, up to the first reduction. The input sentence is "the dog ran away". Our action chooser network takes the top two elements of the stack plus a one-token lookahead in the input buffer. $C(x,y)$ refers to calling our combiner network on arguments $x, y$. Also let $A$ be the set of actions: $\{ \text{SHIFT}, \text{REDUCE-L}, \text{REDUCE-R} \}$, and let $q_w$ be the embedding for word $w$.
For each word $w_m$, the parser keeps track of: the embedding $q_{w_m}$, the word itself $w_m$, and the word's position in the sentence $m$. The combination action should store the word and the index for the head word in the relation. The combined embedding may be a function of the embeddings for both the head and modifier words.
Before beginning, I recommend completing the parse, drawing the input buffer and stack at each step, and explicity listing the arguments to the action chooser.
In this part of the assignment, you will work with the ParserState class, that keeps track of the parsers input buffer and stack.
Implement the reduction operation of the ParserState in parsing.py, in the function _reduce.
The way reduction is done is slightly different from the notes. In the notes, reduction takes place between the top element of the stack and the first element of the input buffer. Here, reduction takes place between the top two elements of the stack.
At this step, there are no embeddings, but don't forget to make the call to the combiner network component.
Hints:
StackEntry
and DepGraphEdge
tuples will be part of your solution, so take a look at how these are used elsewhere in the source.StackEntry
onto the stack, and return a DepGraphEdge
.test_sentence = "They can fish".split()+[consts.END_OF_INPUT_TOK]
parser_state = parsing.ParserState(test_sentence, [None] * len(test_sentence), utils.DummyCombiner())
print parser_state
parser_state.shift()
parser_state.shift()
print parser_state
reduction = parser_state.reduce_left()
print "Reduction Made Edge: Head: {}, Modifier: {}".format(reduction[0], reduction[1]), "\n"
print parser_state
In this short (one line) deliverable, implement done_parsing() in ParserState. Note we add an END_INPUT_TOKEN to the end of the sentence (this token could be a helpful feature). Think about what the input buffer and stack look like at the end of a parse.
print parser_state, parser_state.done_parsing(),'\n'
parser_state.shift()
print parser_state, parser_state.done_parsing(),'\n'
parser_state.reduce_right()
print parser_state, parser_state.done_parsing(),'\n'
In this part of the assignment, you will use PyTorch to create a neural network which examines the current state of the parse and makes the decision to either shift, reduce left, or reduce right.
Implement the class VanillaWordEmbeddingLookup
in neural_net.py
.
This involves adding code to the __init__
and forward
methods.
__init__
method, you want make sure that instances of the class can store the embeddingsforward
method, you should return a list of Torch variables, representing the looked up embeddings for each word in the sequence If you didn't do the tutorial, you will want to read the docs on how to create a lookup table for your word embeddings.
Hint: You will have to turn the input, which is a list of strings (the words in the sentence), into a format that your embedding lookup table can take, which is a torch.LongTensor. So that we can automatically backprop, it is wrapped in a Variable. utils.sequence_to_variable takes care of this for you.
torch.manual_seed(1) # DO NOT CHANGE
reload(neural_net)
test_sentence = "William Faulkner".split()
test_word_to_ix = { "William": 0, "Faulkner": 1 }
word_embedder = neural_net.VanillaWordEmbeddingLookup(test_word_to_ix, TEST_EMBEDDING_DIM)
embeds = word_embedder(test_sentence)
print type(embeds)
print len(embeds), "\n"
print "Embedding for William:\n {}".format(embeds[0])
Fill in the SimpleFeatureExtractor class in feat_extractors.py to give the following 3 features
If at this point you have not poked around ParserState to see how it stores the state, now would be a good time.
torch.manual_seed(1)
test_sentence = "The Sound and the Fury".split()
test_word_to_ix = { word: i for i, word in enumerate(set(test_sentence)) }
embedder = neural_net.VanillaWordEmbeddingLookup(test_word_to_ix, TEST_EMBEDDING_DIM)
embeds = embedder(test_sentence)
state = parsing.ParserState(test_sentence, embeds, utils.DummyCombiner())
state.shift()
state.shift()
feat_extractor = feat_extractors.SimpleFeatureExtractor()
feats = feat_extractor.get_features(state)
print "Embedding for 'The':\n {}".format(feats[0])
print "Embedding for 'Sound':\n {}".format(feats[1])
print "Embedding for 'and' (from buffer lookahead):\n {}".format(feats[2])
Implement the class neural_net.ActionChooserNetwork
according to the specification in neural_net.py
.
You will want to use the utils.concat_and_flatten
function. We provide this function because the Tensor reshaping code can get somewhat terse. It takes the list of embeddings passed in (that come from your feature extractor) and concatenates them to one long row vector.
This network takes as input the features from your feature extractor, concatenates them, runs them through an MLP and outputs log probabilities over actions.
Hint:
torch.manual_seed(1) # DO NOT CHANGE, you can compare my output below to yours
act_chooser = neural_net.ActionChooserNetwork(TEST_EMBEDDING_DIM * NUM_FEATURES)
feats = [ ag.Variable(torch.randn(1, TEST_EMBEDDING_DIM)) for _ in xrange(NUM_FEATURES) ] # make some dummy feature embeddings
log_probs = act_chooser(feats)
print log_probs
Implement the class neural_net.MLPCombinerNetwork
according to the specification in neural_net.py
.
Again, utils.concat_and_flatten
will come in handy.
Recall that what this component does is take two embeddings, the head and modifier, during a reduction and output a combined embedding, which is then pushed back onto the stack during parsing.
torch.manual_seed(1) # DO NOT CHANGE
combiner = neural_net.MLPCombinerNetwork(TEST_EMBEDDING_DIM)
# Again, make dummy inputs
head_feat = ag.Variable(torch.randn(1, TEST_EMBEDDING_DIM))
modifier_feat = ag.Variable(torch.randn(1, TEST_EMBEDDING_DIM))
combined = combiner(head_feat, modifier_feat)
print combined
Note: There are two unit tests for this deliverable, one worth 1 point, one worth 0.5.
You will implement the forward() function in gtnlplib.parsing.TransitionParser. It is important to understand the difference between the following tasks:
At this point, it is necessary to have all of the components in place for constructing the parser.
The parsing logic is roughly as follows:
gold_actions
, do that. Otherwise (when predicting), take the argmax of your log probabilities and do that.END_OF_INPUT_TOK
(this token should NOT be shifted onto the stack) and you cannot reduce when the stack contains fewer than 2 elements.
If your network chooses SHIFT
when it is not legal, just do REDUCE_R
Make sure to keep track of the things that the function wants to keep track of
parser_state
Variable
from your action_chooser
to the outputs listactions_done
test_sentence = "The man ran away".split()
test_word_to_ix = { word: i for i, word in enumerate(set(test_sentence)) }
test_word_to_ix[consts.END_OF_INPUT_TOK] = len(test_word_to_ix)
test_sentence_vocab = set(test_sentence)
gold_actions = ["SHIFT", "SHIFT", "REDUCE_L", "SHIFT", "REDUCE_L", "SHIFT", "REDUCE_R"]
feat_extractor = feat_extractors.SimpleFeatureExtractor()
word_embedding_lookup = neural_net.VanillaWordEmbeddingLookup(test_word_to_ix, STACK_EMBEDDING_DIM)
action_chooser = neural_net.ActionChooserNetwork(STACK_EMBEDDING_DIM * NUM_FEATURES)
combiner_network = neural_net.MLPCombinerNetwork(STACK_EMBEDDING_DIM)
parser = parsing.TransitionParser(feat_extractor, word_embedding_lookup,
action_chooser, combiner_network)
output, depgraph, actions_done = parser(test_sentence, gold_actions)
print depgraph
print actions_done
Training your parser may take some time. On the test below, I get about 5 seconds per loop (i7 6700k).
torch.manual_seed(1)
feat_extractor = feat_extractors.SimpleFeatureExtractor()
word_embedding_lookup = neural_net.VanillaWordEmbeddingLookup(word_to_ix, STACK_EMBEDDING_DIM)
action_chooser = neural_net.ActionChooserNetwork(STACK_EMBEDDING_DIM * NUM_FEATURES)
combiner_network = neural_net.MLPCombinerNetwork(STACK_EMBEDDING_DIM)
parser = parsing.TransitionParser(feat_extractor, word_embedding_lookup,
action_chooser, combiner_network)
optimizer = optim.SGD(parser.parameters(), lr=ETA_0)
%%timeit
parsing.train(dataset.training_data[:100], parser, optimizer)
# if this call doesn't work, something is wrong with your parser's behavior when gold labels aren't provided
parser.predict(dataset.dev_data[0].sentence)
# train the thing for a while here.
# Shouldn't take too long, even on a laptop
for epoch in xrange(1):
print "Epoch {}".format(epoch+1)
parsing.train(dataset.training_data[:1000], parser, optimizer, verbose=True)
print "Dev Evaluation"
parsing.evaluate(dataset.dev_data, parser, verbose=True)
print "F-Score: {}".format(evaluation.compute_metric(parser, dataset.dev_data, evaluation.fscore))
print "Attachment Score: {}".format(evaluation.compute_attachment(parser, dataset.dev_data))
print "\n"
Run the code below to output your predictions on the test data and dev data. You can run the dev test to verify you are correct up to this point. The test data evaluation is for us.
dev_sentences = [ sentence for sentence, _ in dataset.dev_data ]
evaluation.output_preds(consts.D3_2_DEV_FILENAME, parser, dev_sentences)
evaluation.output_preds(consts.D3_2_TEST_FILENAME, parser, dataset.test_data)
Implement the class BiLSTMWordEmbeddingLookup in neural_net.py. This class can replace your VanillaWordEmbeddingLookup. This class implements a sequence model over the sentence, where the t'th word's embedding is the hidden state at timestep t. This means that, rather than have our embeddings on the stack only include the semantics of a single word, our embeddings will contain information from all parts of the sentence (the LSTM will, in principle, learn what information is relevant).
torch.manual_seed(1) # DO NOT CHANGE
test_sentence = "Michael Collins".split()
test_word_to_ix = { "Michael": 0, "Collins": 1 }
lstm_word_embedder = neural_net.BiLSTMWordEmbeddingLookup(test_word_to_ix,
WORD_EMBEDDING_DIM,
STACK_EMBEDDING_DIM,
num_layers=LSTM_NUM_LAYERS,
dropout=DROPOUT)
lstm_embeds = lstm_word_embedder(test_sentence)
print type(lstm_embeds)
print len(lstm_embeds), "\n"
print "Embedding for Michael:\n {}".format(lstm_embeds[0])
Fill in the function initialize_with_pretrained
in utils.py
.
It will take a word embedding lookup component and initialize its lookup table with pretrained embeddings.
Note that you can create a Torch variable from a list of floats using torch.Tensor()
. Googling for more information about how Torch stores parameters is allowed, I don't think you'll find the exact answer online (corollary: do not post the answer online).
import cPickle
pretrained_embeds = cPickle.load(open(consts.PRETRAINED_EMBEDS_FILE))
print pretrained_embeds['four'][:5]
embedder = neural_net.VanillaWordEmbeddingLookup(word_to_ix,64)
embedder.forward(['four'])[0][0,:5]
reload(utils);
utils.initialize_with_pretrained(pretrained_embeds,embedder)
print embedder.forward(['four'])[0][0,:5]
Before, in order to combine two embeddings during a reduction, we just passed them through an MLP and got a dense output. Now, we will instead use a sequence model of the stack. The combined embedding from a reduction is the next time step of an LSTM. Implement LSTMCombinerNetwork
in neural_network.py
.
reload(neural_net);
TEST_EMBEDDING_DIM = 5
combiner = neural_net.LSTMCombinerNetwork(TEST_EMBEDDING_DIM, 1, 0.0)
head_feat = ag.Variable(torch.randn(1, TEST_EMBEDDING_DIM))
modifier_feat = ag.Variable(torch.randn(1, TEST_EMBEDDING_DIM))
utils.concat_and_flatten([head_feat,modifier_feat]).view(1,1,-1)
# note that the output keeps changing, because of the recurrent update
for _ in xrange(3):
print combiner(head_feat,modifier_feat)
The code below retrains your parser using all the new components that you just wrote.
feat_extractor = feat_extractors.SimpleFeatureExtractor()
# BiLSTM over word embeddings
word_embedding_lookup = neural_net.BiLSTMWordEmbeddingLookup(word_to_ix,
WORD_EMBEDDING_DIM,
STACK_EMBEDDING_DIM,
num_layers=LSTM_NUM_LAYERS,
dropout=DROPOUT)
# pretrained inputs
utils.initialize_with_pretrained(pretrained_embeds, word_embedding_lookup)
action_chooser = neural_net.ActionChooserNetwork(STACK_EMBEDDING_DIM * NUM_FEATURES)
# LSTM reduction operations
combiner = neural_net.LSTMCombinerNetwork(STACK_EMBEDDING_DIM,
num_layers=LSTM_NUM_LAYERS,
dropout=DROPOUT)
parser = parsing.TransitionParser(feat_extractor, word_embedding_lookup,
action_chooser, combiner)
optimizer = optim.SGD(parser.parameters(), lr=ETA_0)
%%timeit
# The LSTMs will make this take longer
parsing.train(dataset.training_data[:100], parser, optimizer)
for epoch in xrange(1):
print "Epoch {}".format(epoch+1)
parser.train() # turn on dropout layers if they are there
parsing.train(dataset.training_data[:1000], parser, optimizer, verbose=True)
print "Dev Evaluation"
parser.eval() # turn them off for evaluation
parsing.evaluate(dataset.dev_data, parser, verbose=True)
print "F-Score: {}".format(evaluation.compute_metric(parser, dataset.dev_data, evaluation.fscore))
print "Attachment Score: {}".format(evaluation.compute_attachment(parser, dataset.dev_data))
print "\n"
Run the code below to generate test predictions
dev_sentences = [ sentence for sentence, _ in dataset.dev_data ]
evaluation.output_preds(consts.D4_4_DEV_FILENAME, parser, dev_sentences)
evaluation.output_preds(consts.D4_4_TEST_FILENAME, parser, dataset.test_data)
You can use CUDA to train your network, and you should expect significant speedup if you have a decent GPU and the CUDA toolkit installed. If you want to use CUDA in this assignment, change the HAVE_CUDA variable to True in constants.py, and uncomment the .to_cuda() and .to_cpu() lines below.
We are not officially supporting CUDA though. If you have problems installing or running CUDA, please just use the CPU, we cannot help you debug it.
#training example for fun
for epoch in xrange(NUM_EPOCHS):
print "Epoch {}".format(epoch+1)
#parser.to_cuda() # uncomment to train on your GPU
parser.train() # turn on dropout layers if they are there
# train on full training data
parsing.train(dataset.training_data, parser, optimizer, verbose=True)
print "Dev Evaluation"
#parser.to_cpu() #TODO fix evaluation so you dont have to ship everything back to the CPU
parser.eval() # turn them off for evaluation
parsing.evaluate(dataset.dev_data, parser, verbose=True)
print "F-Score: {}".format(evaluation.compute_metric(parser, dataset.dev_data, evaluation.fscore))
print "Attachment Score: {}".format(evaluation.compute_attachment(parser, dataset.dev_data))
print "\n"