Homework 2: N-gram Language Models

This is due on Friday, Sept 23 (11:59pm), submitted electronically.

You will submit two files: (1) this jupyter notebook file with answers, and also (2) hw2.py. (do not include any other files.)

We provide a starter version of hw2.py with stub functions that need to be completed. Much of the code in this notebook calls functions from the hw2.py module.

  • Your Name: write name here

  • List collaborators: list here

(see our grading and policies page for details on our collaboration policy).

Part A: Toy Example for Testing

When implementing algorithms, it's crucial to design test cases to make sure your code works. The best way to start is to run it on inputs where you know the correct answer in advance, or the range of correct answers. (If you automate these types of tests, they're called unit tests and are a standard technique in software engineering.)

We'll take the approach of having a tiny, synthetic "toy" dataset to experiment with. It's important to run tests on this first before real-world data. Toy datasets run more quickly. Also, outputs from real-world data might look good enough so that you miss bugs.

Our toy example has a vocabulary of three word types "A", "B", and special **START** and **END** symbols. We'll calculate some quantities by hand to help verify the correctness of your implementation.

Here is a toy corpus that's eight tokens long (including start/end tokens).

  **START** A A A B B B **END**

And here are the bigram counts.

wnext = A wnext = B wnext = **END**
wprev = A 2 1 0
wprev = B 0 2 1
wprev = **START** 1 0 0

And below is the same thing in Python dictionaries. Evaluate the cell below, since we'll use this data later.

In [ ]:
uni_counts_toy={"A":3,"B":3,"**START**":1,"**END**":1}
bi_counts_toy={"A":{ "A": 2, "B":1 },"B": {"B":2,"**END**":1},"**START**":{"A":1} }

Part B: Conditional Probability

For an ngram model with history length $k$, and vocabulary size $V$, we have:

$$ P(w_i | w_{i-k}..w_{i-1} ) = \frac{ c(w_{i-k}..w_i) + \alpha }{ c(w_{i-k}..w_{i-1}) + \alpha V }. $$

Where $\alpha$ is the number of pseudocounts for every word type. In lecture, we usually used $\alpha=1$. In this homework we'll just use $k=1$ and $\alpha=0$. The longest n-grams used (in the numerator) are $(k+1)$-grams.

We assume always that $w_1=$**START**, a special symbol denoting the start of a sentence. A sentence always ends with the special symbol **END**. In terms of the generative assumption of the model, the model assume a **START** symbol exists in the first position, then it generates words one by one. When it generates a **END** symbol, the generative process stops.

Question B-1 (10 points):

Please compute the entire conditional probability table for $P(w_{next} | w_{prev1})$ for $w_{prev} \in \{A,B,\text{**}START\text{**}\}$ and $w_{next} \in \{A,B,\text{**}END\text{**}\}$. Fill in your answers in the table below. (It might be easier to do this on paper first.)

ANSWER:

P(wnext = A | w1) P(wnext = B | w1) P(wnext = END | w1)
wprev = A
wprev = B
wprev = START

Part C: Draw samples from unigrams

Utility code

Please look at hw2.py, which contains weighted_draw_from_dict(prob_dict). You give it a dictionary where they keys are strings, and the values are their probabilities, and it returns a single random sample from that distribution.

For example, run the following code a bunch of times. It randomly returns 'a' 75% of the time and 'b' 25% of the time.

(The import statement should work if hw2.py is in the same directory as this jupyter notebook file.)

In [ ]:
import hw2; reload(hw2)
hw2.weighted_draw_from_dict({'a': 0.75, 'b': 0.25})

Question C-1 (2 points):

If you drew from the above distribution 10,000 times, what is the expectation of the number of times 'a' will occur?

ANSWER:

Question C-2 (3 points):

Write a very small bit of test code to confirm weighted_draw_from_dict performs as advertised. Draw from the above distribution 10,000 times and check to see the outcome of 'a' occurs approximately the number of times it's expected to.

In [ ]:
import hw2
# add answer here

Part D: Language model sampling

In the following questions, we'll sample from a language model, based on ngram count tables, with a pseudocount of zero.

First we'll write draw_next_word_from_unigram_model (which samples from $P(w)$) and then draw_next_word_from_bigram_model (which samples from $P(w_i | w_{i-1})$). Finally we'll write the sample_sentence function to sample a sentence from the bigram model.

Throughout, make sure to test the code on the toy corpus: uni_counts_toy and bi_counts_toy from earlier in this notebook.

Question D-1: Draw from unigram distribution (10 points)

Please implement draw_next_word_unigram_model in hw2.py, and ensure the test cases below work correctly. The starter code always returns a nonsense string, so the test cases should run out of the box, but give bad answers.

In [ ]:
## TEST CODE: run but do not change. Just take a sample.
import hw2; reload(hw2)
hw2.draw_next_word_unigram_model(uni_counts_toy)
In [ ]:
## TEST CODE: run but do not change. Take lots of samples.
import hw2; reload(hw2)
print "unigram counts:", uni_counts_toy
from collections import Counter
print "Random sample counts. Should have same proportions as original counts."
Counter([hw2.draw_next_word_unigram_model(uni_counts_toy) for i in range(8000)])

Question D-2: Draw from bigram distribution (15 points)

Please implement draw_next_word_bigram_model. It takes three parameters: the first two are the unigram and bigram count tables, which effectively define the model. The third parameter is the previous context word. Make sure both test cases below run with correct results.

In [ ]:
## Test code: draw once
import hw2; reload(hw2)
hw2.draw_next_word_bigram_model(uni_counts_toy,bi_counts_toy,"A")
In [ ]:
## Test code: draw many times
from collections import Counter
for w in ['A','B','**START**']:
    manydraws = [hw2.draw_next_word_bigram_model(uni_counts_toy,bi_counts_toy,w) \
                 for i in range(1000)]
    sample_counts=Counter(manydraws)
    print "PREFIX %s --> %s" % (w, sample_counts)

Question D-3: Implementing sentence sampling (30 points):

Next, you will write the function sample_sentence which generates a new sentence from a given model (and pseudocount value of 0). Here are some considerations:

  • You should reuse the next_word_from_bigram_model function.

  • You should generate a sentence that starts with **START** and ends with **END** token. Other sequences of words have zero probability under the model, so they should never be generated. To start the function, you just set the first token to be **START** with probability one. You should keep randomly generating tokens, conditional on the previous word, until you generate the **END** token.

  • If your code has a bug and you enter an infinite loop and the "Stop" button in jupyter doesn't work, use Ctrl-C on the command line that launched the jupyter notebook. You'll have to re-run all the cells to load back in the toy data.

In [ ]:
## Test code -- draw one sample.  Run but do not change.  Run many times to be sure...
import hw2; reload(hw2)
hw2.sample_sentence(uni_counts_toy, bi_counts_toy)

Real Data!

Now that everything works on a toy model, we'll run on real data based on the movie review dataset from HW1. We have actually already peformed tokenization, normalization, and n-gram counting for you and are supplying you the unigram and bigram count tables. Their structure is the same as the toy corpus ngram count dictionaries. (If you're curious, we used this script with NLTK to do the processing.)

First, make sure the unigram_count_IMDB.json and bigram_count_IMDB.json files are in the current directory and load them with the following code. Second, make sure you can sample from this model.

In [ ]:
# Loading code
import json, os
print "current working directory", os.getcwd()
uni_counts_IMDB = json.load(open('unigram_count_IMDB.json'))
print "loaded %s unigram types" % len(uni_counts_IMDB)
bi_counts_IMDB = json.load(open('bigram_count_IMDB.json'))
In [ ]:
# Take a sample
import hw2; reload(hw2)
print u' '.join(hw2.sample_sentence(uni_counts_IMDB, bi_counts_IMDB))

Part E: How Well do N-Gram Models Capture Grammar?

Question E-1 (20 points)

Sample ten sentences from the IMDB bigram model, then copy and paste them as text into the cell below. For each, judge whether the sentence is grammatical in English. How many sentences are grammatical out of 10? If you had to formulate particular standards for how you define grammaticality, please explain. (Notice we're talking about grammaticality, like whether the sentence is well-formed, as opposed to the sentence's meaning.)

ANSWER:

(deletethis:) you may want to put sentences in a triple-quoted code segment
to deal with markdown formatting annoyances

Question E-2 (10 points)

Based on reading these samples, what are the strengths and weaknesses of n-gram language models? What phenonmena can it model, and what does it get wrong? When possible, please refer to specific examples in the sentences you just analyzed.

ANSWER:

answerme