HW3: Alignments with IBM Model 1

You will calculate alignment posteriors and analyze them. We supply some code that visualizes the models' posteriors, but you will construct some new ones too. This is skill is part of model and algorithm diagnosis, an invaluable skill when implementing and inventing NLP and ML algorithms.

Getting started

From your commandline terminal, go into the directory where hw3.ipynb is. Type:

/path/to/ipython notebook

For example, on mac and linux, anaconda gets installed in the home directory in ~/anaconda so it is:

~/anaconda/bin/ipython notebook

Please post on Piazza if you need help getting started. Brendan and Ari can't help much with Windows, sorry, but many of your fellow students can.

When you do this, a web browser should open up with the ipython notebook GUI. See tutorials online for details on how to use it. You will have to execute code cells and do a small amount of editing.

Edit only in ANSWERME cells

Within the notebook, please add your answers where it says ANSWERME. You should not have to change things elsewhere, but if you do, please make it clear where and how you've done so. (You will also modify ibm.py)

Code logistics: External files

This notebook also depends on a data files (ttable.json), and one code file: ibm.py. You'll see that cells whose code depends on ibm.py have import and reload statements in them to make it work.

Please submit two files: (1) your final hw3.ipynb file, as well as (2) your edited ibm.py file. Note that if you make visualizations and things like that in the ipython notebook, and you save the notebook, it will be saved within the .ipynb file.

Please do not submit the data files. You will not be modifying them. (We also threw in parallel data into there, but none of your code will use it. This is explained further below.)

TOC: 50 point total

  1. Calc Posterior Alignments (20 points)
  2. Viz basics (2 points. Does not depend on 1)
  3. Viz basics (2 points. Does not depend on 1)
  4. Trans table (10 points. Does not depend on 1)
  5. Comment on alignment (8 points. Depends on 1)
  6. Comment on alignment (8 points. Depends on 1)

Alignments in IBM Model 1

ENGLISH becomes FOREIGN.  Every FOREIGN word came from exactly one ENGLISH word.
Example with Python-style indexing.

e = ["<null>", "the", "book", "is", "red"]
f = ["rotes", "Buch"]
a = [4, 2]

e[0]=<null>   e[1]=the  e[2]=book  e[3]=is  e[4]=red
                         /                       |
           --------------------------------------
           |           / 
    f[0]=rotes  f[1]=Buch

    a[0]=4      a[1]=2


Your job is to implement one part of IBM Model 1: calculating the posterior alignments. The function that calculates this will be given

  • (1) an "English" sentence $\vec{e}$,
  • (2) a "Foreign" sentence $\vec{f}$, and
  • (3) the $t(f|e)$ translation probabilities, which give the probability of a single englishword-to-foreignword transformation $p(f_i | \vec{e}, a_i) = p(f_i | e_{a_i})$.

Recall that IBM Model 1 assumes there exists a latent alignment variable $a_i$ for every position in the Foreign sentence, which can point to any position in the English sentence; you have to calculate the posterior distribution over the possible positions it could point to. That is, for every position $i$ in the Foreign-sentence, you need to calculate the posterior distribution $P(a_i=j | \vec{f},\vec{e})$, which is a vector of probabilities for each possible $j$ position in the English-sentence. This describes the model's posterior belief about which English token did $f_i$ "come from".

We're using the channel model direction of $\vec{f}$ to $\vec{e}$. (Same as J&M reading, and our Model 1 notes. Recall that lecture slides are inconsistent in the f/e directionality.)

More precisely: for the output Foreign sentence at position $i$, the posterior calculation problem is to calculate this vector:

$$[P(a_i=j|\vec{f},\vec{e}) \text{ for each $j=1..|\vec{e}|$}]$$

The length of this vector is $|\vec{e}|$ and it sums to one. There is one such vector for each token in the output Foreign sentence.

Part A: calc_alignment_posterior

Question 1 Implement the function calc_alignment_posterior in ibm.py.

Sanity check: execute the following code, for the Foreign sentence ['a','b','c'] and English sentence ['<null>','C','B'], given a toy translation model. You should be able to run this cell with the starter code. It will return dummy values of -999 for everything.

In [ ]:
# DO NOT EDIT THIS CELL
from __future__ import division
import ibm; reload(ibm)
# Toy model for testing. "English" is upper-case letters, "Foreign" is lower-case letters.
# This is t(a|A), t(a|B), etc.
# Note that t(a|A)+t(b|A)+t(c|A) = 1, but t(a|A)+t(a|B)+t(a|C)+t(a|<null>) does not.
toymodel = {
    '<null>': {'a':1.0/3, 'b':1.0/3, 'c':1.0/3},
    'A': {'a':0.98, 'b':0.01, 'c':0.01},
    'B': {'a':0.1,  'b':0.8,  'c':0.1},
    'C': {'a':0.1,  'b':0.1,  'c':0.8},
}
ibm.calc_alignment_posterior(['a','b','c'], ['<null>','C','B'], toymodel)

When you've implemented the function correctly, it should return something like the following (we've truncated some numbers for readability):

[[0.625, 0.1875, 0.1875],
 [0.27, 0.08, 0.64],
 [0.27, 0.64, 0.08]]

Part B: Visualization warmup

We have provided a function that visualizes a posterior alignment matrix. We have tested it in Chrome and Safari.

The next cell should work standalone, even without implementing posterior calculation. Please execute the cell to see how it works. It simply shows the table, where in each cell, there's a blue box that's sized proportionally to the probability in that cell. Here is a screenshot of what it should look like.

In [ ]:
import ibm; reload(ibm)

_post_alignments = [
     [0.625, 0.1875, 0.1875],
     [0.27, 0.08, 0.64],
     [0.27, 0.64, 0.08]]

ibm.show_alignment_posterior(_post_alignments, ['a','b','c'], ['<null>','C','B'])

Question 2

Where does the model think the word "c" came from?

Answer: ANSWERME

Question 3

Where does the model think the word "a" came from? Why does it think so, instead of the other options? (It may be helpful to write print/write out the translation table on paper and look at it at the same time.)

Answer: ANSWERME

Part C: Analysis of a real model

We have provided for you a $t(f|e)$ translation table that we trained with the EM algorithm. It was trained on 10,000 German-English sentences from the Europarl dataset (from the EU Parliament). The sentences are in the plaintext file de-en.short_10k.lowtok included this directory. You can view it with your favorite text editor. Our preprocessing had two steps: (1) tokenization (using nltk.word_tokenize), and (2) lowercasing all words. We ran EM for 30 iterations, which took just a few minutes. (Normally, machine translation training is done with much larger datasets, and thus higher quality models.)

First, load the translation table. It is a dictionary with the same structure as toymodel above, except much bigger. The next cell loads it, and for one word, prints out the most probable translations, and their probabilities. This code should run without you having to edit it. Make sure that ttable.json is in the same directory as this ipynb file.

In [ ]:
import json
ttable = json.load(open("ttable.json"))
print "%d words in e vocab" % len(ttable.keys())
print "Most probable translations of ."
tprobs = ttable["."]
pairs = sorted(tprobs.items(), key=lambda (f,p): -p)[:10]
print u" ".join("%s:%.3f" % (f,p) for f,p in pairs)
# Result should look like:
#6732 words in e vocab
#Most probable translations of .
#.:0.905 !:0.016 herr:0.015 präsident:0.015 –:0.008 ,:0.006 die:0.005 frau:0.004 präsidentin:0.003 meine:0.003

Question 4

Comment on the quality of this translation table (translations of an English period into German). Which of the translations are good? Which are bad? Why is it making errors that it's making? It may be useful to open the training data text file in a text editor, and search for particular words in order to understand how they're used. You can also look up German words on http://dict.leo.org/ but the main ones here are: Herr="Mr.", Frau="Mrs.", "präsident"=President, "die"=the, "meine"=mine.

Answer: ANSWERME

Next, calculate alignments on the following sentence pairs, which are from the real dataset, using this translation model. Answer questions about the model's analysis of them. You should not have to change the code cells. They should show visualizations of the posterior alignment matrix.

In [ ]:
import ibm; reload(ibm)
ee=u"over 60 % of poles support the constitutional treaty ."
ff=u"mehr als 60 % der polen sind für den verfassungsvertrag ."
ee,ff=ee.split(),ff.split()
post = ibm.calc_alignment_posterior(ff,ee,ttable)
ibm.show_alignment_posterior(post, ff,ee)

Question 5: Comment on the quality of this alignment. What's going on with the English word "Poles"?

Answer:

ANSWERME

In [ ]:
import ibm; reload(ibm)
ee=u"that is what they think about the european union ."
ff=u"so also denken die polen über die europäische union ."
ee,ff=ee.split(),ff.split()
post = ibm.calc_alignment_posterior(ff,ee,ttable)
ibm.show_alignment_posterior(post, ff,ee)

Question 6: Comment on the quality of this alignment. Are there aspects of the training data that are important here? (If you look at the data file, note that the sentences are actually randomly ordered.)

Answer:

ANSWERME