Problem Set 1: Probability review and language models

This is due on Thursday, Sept 18, before lecture (2:30pm), submitted electronically. It may take some time, so please get started as soon as possible!

How to do this problem set

Some of these questions have textual answers, and some of them require writing Python code and computing results. Write all the answers in this document. Once you are finished, you will upload this .ipynb file to Moodle. If you look at it in a text editor or something, you'll see it's just a plaintext version of all the information in the notebook. It even includes all the images.

As you develop code, it will be helpful to use the "Restart Kernel" button above, which restarts the Python interpreter and thus clears out all global variables. When creating your final version of the problem set to hand in, please do a fresh restart and execute every cell in order. Then make sure to hit "Save"!

As you are working, you may find the standard commandline version of Python to be a convenient alternative, depending on your working style. Whatever you do, you have to copy all final answers into the IPython notebook page and make sure it correctly executes independently. We will grade based upon that.

Please write your name and UMass ID here.

  • Name:
  • UMass OIT ID:

List collaborators, and how you collaborated, here: (see our grading and policies page for details on our collaboration policy).

  • name 1

Part (A): MLE

This is a WRITTEN portion of the assignment. Please answer using a Markdown cell. To look at the source code for any ipython notebook cell, just double click on it.

The nice feature of ipython-notebook's version of Markdown is, you can embed LaTeX equations by using dollar signs. Inline example: $x = \frac{a}{b}$ ... or, full line example: $$x = \frac{a}{b}$$

You can also make things bold or italic and other fun things. See documentation for the Markdown syntax on the internet for more information. (Github, StackOverflow, and other websites also use Markdown.)

Question 1 (10 points):

Consider a unigram model. Say there are $N$ tokens in the training data, and a particular word $w$ occurs $n_w$ times. The unigram model says there is a parameter $\theta \in [0,1]$ which defines $P(w;\theta)=\theta$. We've already noted that the maximum likelihood estimate of $\theta$ is $\frac{n_w}{N}$. Prove this.

Feel free to use the bernoulli/binomial form of the unigram model, where there are only two words in the vocabulary, one of which is $w$. The maximum likelihood optimization problem can be written

$$ \hat{\theta} = \arg\max_{\theta} P(w_1..w_N; \theta) = \prod_{i=1}^N P(w_i; \theta) = \prod_{i=1}^{N} \theta^{1\{w_i=w\}} (1-\theta)^{1\{w_i \neq w\}} $$

Where the notation $1\{...\}$ is the indicator function, evaluating to $1$ if $(...)$ is true or $0$ otherwise, and the notation $\arg\max_a f(a)$ means "The value of $a$ that maximizes $f(a)$". (If you do the multinomial case where the vocabulary is larger than 2, you need Lagrange multipliers. Feel free to do that if you know it.)

Answer:

answerme

Part (B): Heaps' law

In this section you will verify an important statistical property of text called Heaps' Law, which compares the number of tokens in a document or corpus, against the vocabulary size (a.k.a. the number of unique word types). We would like to know how these two numbers relate to each other as you look at bigger and bigger corpora, by simulating collecting a bunch of news articles.

Part B1:

Run the following code. It iterates through NLTK's collection of 10,788 Reuters news articles, and after each article, calculate the cumulative sum of the number of tokens so far, as well as the number of unique wordtypes seen so far. The x-axis is the number of tokens so far, and and y-axis is the number of wordtypes (vocabulary size) so far.

Notes: To access the data, follow NLTK's instructions for how to install their data package, which contains this and other corpora. The instructions are linked from www.nltk.org; or just find how with google. You will need the data package for Part C as well.

In []:
from __future__ import division
import matplotlib.pyplot as plt
import nltk
%matplotlib inline
docids = nltk.corpus.reuters.fileids()
seen_words = set()
num_tokens_vs_wordtypes = []
ntoks = 0
for docid in docids:
    tokens = nltk.corpus.reuters.words(docid)
    ntoks += len(tokens)
    for tok in tokens: seen_words.add(tok.lower())
    num_tokens_vs_wordtypes.append( (ntoks, len(seen_words)) )

plt.plot(
    [ntok for ntok,ntype in num_tokens_vs_wordtypes],
    [ntype for ntok,ntype in num_tokens_vs_wordtypes]
         )

Part B2:

Question 2 (5 points):

At least in this corpus, it appears that as you look at more and more documents, you keep seeing new unseen words. If you were to collect more and more documents from a larger corpus, will you ever stop seeing new words? Why or why not? What are the implications for language modeling?

ANSWER:

answerme

Part (C): N-gram Language Modeling

Part C1: Utility Code

First, we provide some utility functions for manipulating multinomial distributions over types. You will employ these later on.

In []:
from __future__ import division
from collections import defaultdict
import nltk, random

# Utility functions if you want them

def normalized_dict(dct):
    """
    Assume dct is a string-to-number map.  Return a normalized version where the values sum to 1.
    {"a":4.0, "b":2.0} ==> {"a":0.6666, "b":0.3333}
    """
    s = sum(dct.values())
    new_dct = {key: value*1.0/s for key,value in dct.items()}
    return new_dct

def weighted_draw_from_dict(choice_dict):
    """Randomly choose a key from a dict, where the values are the relative probability weights."""
    # http://stackoverflow.com/a/3679747/86684
    choice_items = choice_dict.items()
    total = sum(w for c, w in choice_items)
    r = random.uniform(0, total)
    upto = 0
    for c, w in choice_items:
       if upto + w > r:
          return c
       upto += w
    assert False, "Shouldn't get here"

# Quick test: This should be approx 7500
a_count = sum(['a'==weighted_draw_from_dict({'a':3,'b':1}) for i in range(10000)])
a_count

Question 3 (2 points): Explain why the algorithm implemented in weighted_draw_from_dict(choice_dict) correctly samples from the multinomial distribution defined by choice_dict.

Answer:

answerme

Question 4 (2 points): Explain why the value of a_count above isn't exactly 7500 in practice. Note that a_count is a random variable. What is its expected value?

Answer:

answerme

Part C2: Load Some Data

We will be using the novel Emma by Jane Austen. Make sure you have the NLTK data packages installed, as described on the "Installing NLTK Data" page.

In []:
# http://www.nltk.org/api/nltk.tokenize.html#module-nltk.tokenize.punkt
sentence_splitter = nltk.data.load('tokenizers/punkt/english.pickle')
emma_text = nltk.corpus.gutenberg.raw("austen-emma.txt")

emma_sentences_raw = sentence_splitter.sentences_from_text(emma_text)
emma_sentences_toks_origcase = [nltk.word_tokenize(sent_text) for sent_text in emma_sentences_raw]
emma_sentences_toks = [[w.lower() for w in sent_toks] for sent_toks in emma_sentences_toks_origcase]

print "Document is %d characters, %s sentences, %s tokens" % (
    len(emma_text), len(emma_sentences_toks), sum(len(toks) for toks in emma_sentences_toks))

Question 5 (2 points):

Explain the difference between sentences_raw and sentences_toks above. What is tokenization?

Answer:

answerme

Part C3: Implementing Ngram Language Models

We will write some code to train an ngram model and to use it for various downstream tasks, such as querying for probability values and sampling from the model's distribution over sentences.

We will use a simple pseudocount smoothing estimator. For a $k$-gram model (of history size $k-1$), and vocabulary size $V$, we have:

$$ P(w_t | w_{t-k+1}..w_{t-1} ) = \frac{ \#(w_{t-k+1}..w_t) + \alpha }{ \#(w_{t-k+1}..w_{t-1}) + \alpha V }. $$

Here, $\alpha$ is the number of pseudocounts for every word type. The above conditional distribution yields the joint distribution over $w_1, \ldots, w_t$ given by

$$P(w_1, w_2, \ldots , w_T) = \Pi_{t=1}^T P(w_t | w_{t-k+1}..w_{t-1} )$$

This expression doesn't quite make sense, though, since we would have terms like $P(w_1 | w_{0}, w_{-1} )$ for a bigram model and there is no such thing as $w_{0}$ and $w_{-1}$. In response, we add the types "**START**" and "**END**" to our vocabulary. For an ngram order of $n$, we effectively add $n-1$ "**START**" tokens to the start of the query sequence and $n-1$ "**END**" tokens to the end. Such a transformation is necessary for the training data as well, so that you properly estimate conditional probabilities involving these special types.

Part C3.0: Testing Your Code

Whenever you need to write code, we will provide the skeleton of a python function. Place your solution between the # ANSWER STARTS HERE and # ANSWER ENDS HERE comments. To make reading your submission easy, please leave these comments in.

As in all engineering disciplines, it is very important to design test cases to make sure that what you've created actually works. An important method for diagnosing bugs in an algorithm or its implementation is to run it on inputs for which you know the correct output in advance. If the output of your implementation doesn't match what you expected, then you have a bug. Such unit tests are mandatory for software engineers at places like Google, and you should definitely get in the habit of writing them. It will save you lots of time in the long run. Broken code is not fun.

From personal experience, it is helpful to run these tests before you run on real-world data. They typically run quickly and can save you from wasting time. Also, sometimes the output on real-world data might look good enough that you don't catch bugs.

Throughout Section C3, we willl present a series of tests that the code should pass. We will do everything with respect to an imaginary language with the word types "A" and "B" (plus the **START** and **END** tokens).

Now, rather than starting with data and constructing a model via MLE, we will start with a model and then construct data such that the MLE on this data should have the same parameters as the model.

P(w2 = A | w1) P(w2 = B | w1) P(w2 = END | w1)
w1 = A 2/3 1/3 0
w1 = B 0 2/3 1/3
w1 = START 1 0 0

Question 6 (2 points): In the next cell, we define two corpora, each consisting of one sentence. For which of these does the MLE bigram language model with zero pseudocounts yield the conditional probability table above? To provide your answer, uncomment one of the two definitions for the variable corpusForTesting in the box below it.

In []:
corpus1 = [['A','A','A','B','B','B']]
corpus2 = [['A','B','A','B','A','B']]
In []:
#ANSWER STARTS HERE
#corpusForTesting = corpus1
#corpusForTesting = corpus2
#ANSWER ENDS HERE

Part C3.1: Making Ngram Counts

In this section, we will provide you with complete implementations. However, we will ask questions to make sure you understand them, and you will need to write unit tests for them.

In []:
def make_ngrams(tokens, ngram_size):
    """Return a list of ngrams, of given size, from the input list of tokens.
    Also include **START** and **END** tokens appropriately."""
    ngrams = []
    tokens = ['**START**'] * (ngram_size-1) + tokens + ['**END**'] * (ngram_size-1)
    for i in range(ngram_size, len(tokens)+1):
        ngrams.append( tuple(tokens[i-ngram_size:i]))
    
    return ngrams

Question 7 (2 points):

For an input sequence of length $L$, and an ngram size $n$, how many ngrams should the above function return?

Answer:

answerme

Next, we define a class for storing the necessary count statistics for learning an ngram language model. The two main data structures it contains are vocabulary and ngram_count. We provide a fully-implemented version of this class.

In []:
class NgramModelCounts:
    def __init__(self):
        self.vocabulary = set()
        self.ngram_size = None
        # designed to have the structure {prefix: {nextword: count}}
        # Feel free to change if you don't like this approach
        self.ngram_counts = defaultdict(lambda:defaultdict(int))

def get_ngram_counts(sentences, ngram_size):
    """'Train' a fixed-order ngram model by doing the necessary ngram counts.
    Return a data structure that represents the counts."""
    model = NgramModelCounts()
    model.ngram_size = ngram_size
    model.vocabulary.add("**START**")
    model.vocabulary.add("**END**")
    for sent_tokens in sentences:
        ngrams = make_ngrams(sent_tokens, ngram_size)
        for ngram in ngrams:
            prefix = tuple(ngram[:ngram_size-1])
            model.ngram_counts[prefix][ngram[-1]] += 1
        for tok in sent_tokens:
            model.vocabulary.add(tok)
    return model

Question 8 (2 points):

For reasonable text corpora, and a NgramModelCounts model constructed from this corpus, how will len(model.ngram_counts) compare to the total number of tokens passed in? Will it be about the same size, larger, or smaller?

Answer:

answerme

Part C3.2: Querying the Model for Probabilities

Next, we use the counts defined in a NgramModelCounts to define a multinomial word distributions for a given prefix.

In the next cell, complete the skeleton code for next_word_prob, which, for a given prefix, returns the distribution over all possible next words, stored as a dictionary from strings to probability values (hint: use the normalized_dict function from section C1).

Here are a few considerations:

  • Don't forget to account for word_pseudocount when computing probability estimates.
  • We will assume that out-of-vocabulary tokens have been mapped to the string "**OOV**." Make sure to provide correct probability values when the next word or one of the words in the prefix is out-of-vocabulary. Hint: make sure "**OOV**" receives pseudocounts.
  • Section C3.2-Test provides a means for you to make sure your code is working.
In []:
def next_word_prob(prefix, model, word_pseudocount):
    """ For the given prefix, return the distribution over all possible next words,
    using the model and per-word pseudocount value given as input. 
    Hints: 
    (0) You will want to use the normalized_dict function defined in part C1
    (1) Don't forget to add word_pseudocount to the counts in the input model. 
    (2) The input model (i.e. counts) doesn't include the **OOV** token. You will need 
        to explicitly add a counts for this (i.e. just pseudocounts).
    """
#ANSWER STARTS HERE (15 points)

#ANSWER ENDS HERE

Question 9 (2 points):

Note that NgramModelCounts isn't a model because it doesn't provide answers to probability queries directly. It simply stores the count statistics that the answers to these queries can be computed from.

There are two ways to use the function next_word_prob to produce an ngram language model:

1) For every possible prefix, store the normalized_dict given by next_word_prob(prefix, model, word_pseudocount). In downstream applications, when we want to query it for a given prefix, we just return the pre-computed vector of multinomial probabilities.

2) In downstream applications, we call next_word_prob(prefix,model,pseudocounts) only as-needed for values of prefix that are asked for. (Note: this is sometimes called a lazy model).

What are the pros and cons of each approach in terms of memory usage and the time it takes to answer a probability query? What would be a hybrid of these approaches that would have better properties than either of them (hint: caching)?

Answer:

answerme

C3.2-Tests

Question 10 (5 points): Below, we have a simple piece of test code that (1) trains a model on the corpusForTesting you defined above, then (2) prints out the result of next_word_prob() with and without a pseudocount.

Make sure this code runs and returns correct answers.

In []:
toy_model = get_ngram_counts(corpusForTesting, 2)

print "Probs for first token, under MLE"
p_after_start = next_word_prob( ("**START**",), toy_model, 0 )
print "Probs:", p_after_start                   ## This should be a dictionary
print "Their sum:", sum(p_after_start.values()) ## Should sum to 1, or extremely close
print "Prob of OOV:", p_after_start["**OOV**"]  ## This should be in the dict, but value should be 0

print "\nProbs for first token, using a pseudocount"
p_after_start_pseudo = next_word_prob( ("**START**",), toy_model, 0.1 )
print "Probs:", p_after_start_pseudo
print "Their sum:", sum(p_after_start_pseudo.values())
print "Prob of OOV:", p_after_start_pseudo["**OOV**"]

Question 11 (4 points): Next, we want to test that our code works to learn ngram models from data. We will use the synthetic corpus defined in corpusForTesting in section 3C.0. Call next_word_prob() a number of times with zero pseudocounts, to print out the values of the conditional probability table corresponding to the table in section 3C.0. If your implementation of next_word_prob() is correct, the values you print should be the same. Otherwise, you have a bug.

In []:
# ANSWER STARTS HERE

# ANSWER ENDS HERE

Part C3.3: Randomly Generating Sentences From the Model

Next, you will write a function in the next cell that samples a new sentence, given a model and a pseudocount value of 0. Here are some considerations:

  • We definitely encourage you to use the weighted_draw_from_dict function defined in section C.1.

  • Don't forget that you must generate a sentence with the **START** token and finish it with the **END** token. Other sequences of tokens have zero probability under the model, so they should never be generated. For the beginning, you just set the first token to be **START** with probability one. You should keep randomly generating tokens, conditional on their prefix, until you generate the **END** token.

  • Observe that the model may (with low probability) generate very long sentences, since there is always some probability that you draw a token that is not **END**, independent of the length of what you have generated so far. You can just manually stop the generation process for sentences longer than 1000. This introduces a very very mild bias to the sampling procedure (since it will never generate a sentence of length >1000, but these have nonzero probability under the model).

  • If your code has a bug and you enter an infinite loop, use Ctrl-C on the command line that launched ipython notebook.

In []:
def sample_sentence(model):
    ##See hints above.
    ##Additional hint: don't forget to pad with START tokens. 
    tokens = ['**START**'] * (model.ngram_size-1)

    #ANSWER STARTS HERE (15 points)

    #ANSWER ENDS HERE

C3.3-Tests

Question 12 (5 points): In the cell below, do the following for different values of N=10, N=100, and N=1000:

  • Take the NgramModelCounts you defined on corpusForTesting when testing next_word_prob above and generate N synthetic sentences from it.
  • Construct a new NgramModelCounts from these sentences.
  • Print the conditional probability table for this new model, as you did in C3.2-Tests (using zero pseudocounts again).

Repeat these steps for each of N=10, N=100, and N=1000. Larger values of N will give a recovered conditional probability table that is arbitrarily close to the table of the original model. This property is called statistical consistency. Maximum likelihood estimation happens to be consistent, so it can be checked whether your implementation has this behavior for large samples.

In []:
#ANSWER STARTS HERE

#ANSWER ENDS HERE

Part C3.4: Computing Probabilities of Observations Under the Model

Next, we provide a fully-implemented function to evaluate the log-probability of a sentence. As the function computes this, it should also print one line for each word, with four pieces of information:

prefix,  word,  p(word|prefix), count(ngrams of prefix-then-word)
In []:
import math
def logprob(tokens, model, word_pseudocount):
    # return lp, the log probability of the tokens under the model 
    # it also prints the info described above
    
    lp = 0
    tokens = ['**START**']*(model.ngram_size-1) + tokens + ['**END**']
    for t in range(model.ngram_size-1, len(tokens)):
        start,end = t-model.ngram_size+1, t
        prefix = tuple(tokens[start:end])
        probs = next_word_prob(prefix, model, word_pseudocount)
        nextword = tokens[t]
        if nextword not in model.vocabulary:
            nextword = '**OOV**'
        prob = probs[nextword]
        print prefix, nextword, prob, "with count", model.ngram_counts.get(prefix,{}).get(nextword,0)
        lp += math.log(prob)
    return lp

C3.4-Tests

Note: The following test certainly isn't comprehensive to assess whether your logprob implementation is correct. In practice, you will benefit from implementing additional unit tests.

Question 13 (2 points):

The output of logprob should never be positive. Why?

Answer:

answerme

Question 14 (2 points):

Now, in the cell below write some code that runs on some NgramModelCounts defined in the tests above (or on the Emma corpus) and asserts that the computed log probability never positive. We suggest generating random sentences using sample_sentence and then calling logprob on these. This is the beauty of unit tests. Since you know sample_sentence works, you can now use it in other tests!

In []:
#ANSWER STARTS HERE
#ANSWER ENDS HERE

Part C4: Exploratory Data Analysis

Now let's explore text of Emma by looking at properties of an ngram language model trained on it. First, train a model on the corpus:

In []:
##Warning: make sure you didn't define any variable called sentences_toks in your unit tests above. 
#Otherwise you will be building a model on that, rather than on the text of Emma
emma_model = get_ngram_counts(emma_sentences_toks, 2)

Now, let's evaluate the log probability of a few sentences

In []:
# Example 1 to evaluate. It should look like the following (subject to difference in formatting). 
#If it doesn't, you have a bug

#('**START**',) emma 0.0135983147854 with count 102
#('emma',) felt 0.0220173046907 with count 19
#('felt',) afraid 0.00841155264993 with count 1
#('afraid',) . 0.0411078996754 with count 3
#('.',) **END** 0.998585259193 with count 6354
#Out:
#-16.084855879500022

logprob(['emma','felt','afraid','.'], emma_model, .001)
In []:
# Example 2 to evaluate
logprob(['emma','felt','bodacious','.'], emma_model, .001)
In []:
# Example 3 to evaluate
logprob(['the','the','the','.'], emma_model, 0.001)

Question 15 (2 points): The three sentences above have a wide range of log probabilities. For each sentence, explain why it is less or more or in-between the others.

Answer: answerme

Part C5: How Well do NGram Models Capture Grammar?

First, train both a bigram and trigram model.

In []:
m1 = get_ngram_counts(emma_sentences_toks,2)
m2 = get_ngram_counts(emma_sentences_toks,3)

Then, sample ten sentences from each.

In []:
print "FIRST ORDER"
for i in range(10): print ' '.join(sample_sentence(m1))
print "\nSECOND ORDER"
for i in range(10): print ' '.join(sample_sentence(m2))

Question 16 (5 points):

We would like to assess how often each model generates grammatical sentences. Cut-and-paste the randomly-generated sentences from above in the triple-backquote section below, and add your grammaticality judgments at the beginning of every line with (1) if it was grammatical and (0) if it was ungrammatical. If you had to formulate particular standards for how you define grammaticality, please explain.

Answer:

answerme

We would like to test the hypothesis that sentences drawn from a trigram model are grammatical more often than sentences drawn from a bigram model. However, since you annotated only 10 samples from each, it is difficult to detect a statistically significant difference in grammaticality since your resulting confidence intervals will +/- 30%. In other words, it is not unlikely that that one is grammatical more frequently than the other as a result of chance, rather than the underlying hypothesized difference. However, if you had annotated >100 samples, then this would yield a more reasonable comparison.

Question 17 (1 point): We are going to aggregate the statistics from everyone's homework submission to perform a single course-wide hypothesis test to see whether a trigram model produces grammatical sentences more frequently.

  • How many were grammatical for the bigram model? ANSWER: answerme
  • How many were grammatical from the trigram model? ANSWER: answerme

Question 18 (3 points):

Based on reading these samples, what are the strengths and weaknesses of n-gram language models? What phenonmena can it model, and what does it get wrong? Please refer to specific examples in the sentences you just analyzed.

Answer:

answerme

Part D: Survey

Note: If you respond to these with questions something that isn't garbage, you will get full credit. Think of the 4 points from this section as pseudocounts :).

Question 19 (2 points):

How did this problem set go? Was it too easy? Too hard? Just right? What aspects of it did you find most difficult?

Answer:

answerme

Question 20 (2 points): Overall, what did you like or dislike about this problem set?

Answer:

answerme