Part 2: Text classification

In this problem set, you will build a system for classifying movie reviews as positive, negative, or neutral. You will:

  • Build a machine learning classifier, using Naive Bayes
  • Evaluate your classifiers and examine what they have learned

You will use our provided code for loading and training a model. Our NaiveBayesModel object contains data for word counts and document label counts. Please see nb.py for details on its data structures.

Exectue the following code, which loads the training data and prints out a few statistics about it. This also shows how to access the data in it. Here, "POS" are positive reviews, "NEU" are neutral reviews, and "NEG" are negative reviews.

In []:
from __future__ import division
import nb
train_docs = nb.read_jsons("reviews/train-imdb.jsons")
train_labels = nb.read_keyfile("reviews/train-imdb.key")
mm = nb.NaiveBayesModel()
mm.train(train_docs, train_labels)
print "Number of documents with each label:", mm.class_doc_counts
print "Number of tokens total for each label:", mm.class_total_tokens
print "Vocabulary size:", len(mm.vocabulary)
print "Count of word 'awesome' in POS reviews:", mm.class_word_counts["POS"]["awesome"]
print "Count of word 'awesome' in NEU reviews:", mm.class_word_counts["NEU"]["awesome"]
print "Count of word 'awesome' in NEG reviews:", mm.class_word_counts["NEG"]["awesome"]

Part 2.A: Word likelihood function

Question 8 (10 points)

Implement a function in the next cells that calculates $P(w | y)$, from a NaiveBayesModel object, and a given pseudocount value.

Hint: use the default argument of dict.get() so that if a key doesn't exist in a dict, you can get the value 0. For example, for d={"a":3,"b":1}, d.get("a",0) returns 3, but d.get("c",0) returns 0 (and does not raise an error).

In []:
def word_prob(nb_model, word, label, pseudocount):    
    #ANSWER STARTS HERE
    return 1.0/42
    #ANSWER ENDS HERE

Sanity check: compare the probabilities of "awesome" by executing the next cell's code. It should yield:

P(awesome | POS, alpha=1) = 0.0001164472325

P(awesome | NEG, alpha=1) = 2.77842828281e-05

In []:
print "P(awesome | POS, pseudocount=1) =", word_prob(mm, "awesome", "POS", 1.0)
print "P(awesome | NEG, pseudocount=1) =", word_prob(mm, "awesome", "NEG", 1.0)

Part 2.B: Model inspection

We want to rank words by $LR(w) = \frac{P(w | y=POS)}{P(w | y=NEG)}$. This measures how discriminative a word is for indicating a positive review versus a negative review. A word with LR=5 is five times more likely to appear in a positive review than it is in a negative one.

Question 9 (15 points)

In the next cell, write code to that will allow us to find the 15 most positive words by this measure, and for each, print word, LR, (wordcount in positive reviews), (wordcount in negative reviews) separated by spaces.

In our case, we would call the following function as in print_likely_words(mm,"POS","NEG",100)

In [3]:
def print_likely_words(nb_model,label_for_numerator,label_for_denominator,pseudocount):
    #Hint: for every word in the vocabulary, you will want to call word_prob to obtain both the numerator 
    #and the denominator of the LR(w) expression above.
    # Then you should sort them by their LRs and take the 15 highest ones.
    # print the output with a single space between fields, one word per line. See next cell for example. 
    # ANSWER STARTS HERE
    pass
    # ANSWER ENDS HERE

sanity check The next cell calls print_likely_words. The output should be exactly the following text:

great 2.2877851756 419 126
best 1.83129126178 275 104
love 1.73599158978 252 102
excellent 1.72455314252 106 19
wonderful 1.69891506485 91 12
well 1.68391331426 353 168
her 1.67795511849 727 391
life 1.6678837338 270 121
very 1.57588840389 528 297
performance 1.5618197063 143 55
also 1.5554402509 320 169
beautiful 1.55293498999 112 36
father 1.53747716138 96 27
true 1.53150614895 106 34
family 1.52798983334 127 48
In []:
print_likely_words(mm,"POS","NEG",100) ##do not change this code

Question 10 (5 points) In next cell, we print the most likely words with a pseudocount of 1. In the cell after that, comment on the differences between the output of print_likely_words(mm,"POS","NEG",100) above and print_likely_words(mm,"POS","NEG",1) below. What types of words did it discover in the two lists? Why are they different?

In []:
print_likely_words(mm,"POS","NEG",1) ##do not change this code

Answer: ANSWERME

Part 2.C: Logarithm of the Posterior Equation

To calculate the posterior probability of a label $P(y | \vec{w})$ using Bayes' rule, you'll need the unnormalized posterior for each label, $P(y) P(\vec{w} | y)$ (note this is the numerator in Bayes' rule).

Unfortunately, for long documents $P(\vec{w} |y) = \prod_t P(w_t|y)$ will numerically underflow: if you multiply together many small probabilities and it gets so close to zero the computer considers it zero. Instead, you'll have to work with the log-probabilities, so we will derive the appropriate equation first. We start with the equation for the posterior probability:

$$ P(y | \vec{w}) = \frac{ P(y) \prod_t P(w_t | y) }{ \sum_{y' \in \{POS,NEG,NEU\}} P(y') \prod_t P(w_t | y') } $$

You will implement the log-posterior equation,

$$\log P(y|\vec{w}) = \underbrace{\log P(y) + \sum_t \log P(w_t|y)}_{\text{Unnormalized-log-posterior}} - \underbrace{\log[\sum_{y'} P(y') \prod_t P(w_t | y')]}_{\text{Sum of unnorm-posteriors}}$$

Which can also be written as, where $f(y)=\log P(y) + \log P(\vec{w}|y)$,

$$\log P(y|\vec{w}) = f(y) - \log \sum_{y'} e^{f(y')}$$

A common convention, which we adopt in the code below, is to refer to the right hand term $\log \sum_{y'} e^{f(y')}$ in the equation above as 'log Z.'

Question 11 (15 points)

In the cell below, implement the posterior inference function, that takes in the NaiveBayesModel object, a document $(w_1..w_n)$, and a pseudocount, and computes a posterior over three labels, $P(y=k | w_1..w_n)$ for each of $k \in \{POS,NEU,NEG\}$, factoring in the pseudocount. Represent the posterior as a dictionary whose values sum to 1.

To calculate the final posterior, you'll first calculate the unnormalized log posteriors for each label. Then the logsumexp function will come in handy when computing the logarithm of the Bayes Rule denominator, since logsumexp calculates $\log \sum_i \exp(p_i)$ even when there are thousands of tiny $p$'s, without any numerical underflow.

In [4]:
import math
from scipy.misc import logsumexp

def calc_label_posterior(nb_model, doctokens, pseudocount):
    # nb_model: a NaiveBayesModel object
    # doctokens: a list of strings, of the tokens in the document
    # pseudocount: a number

    # We will incrementally compute the f(y) for each y, by adding terms into this dictionary.
    unnorm_log_posterior = {"POS": 0, "NEU":0, "NEG": 0}

    for label in ["POS","NEU","NEG"]:
        # First step: add the log P(y) term into unnorm_log_posterior.
        # The following line is dummy code that makes up a prior-- please fix.
        # ANSWER STARTS HERE
        prior_prob_of_label = 1.0/3 #this should be re-implemented with a correct right hand side
        # ANSWER ENDS HERE
        unnorm_log_posterior[label] += math.log(prior_prob_of_label)

        # Second step: iterate through all the tokens in the document
        # and add their log probabilities to the unnorm log posterior.
        for word in doctokens:
            # The following lines are dummy code; please fix.
            # ANSWER STARTS HERE
            prob_of_word = 1.0/42  #this should be re-implemented with a correct right hand side
            # ANSWER ENDS HERE
            unnorm_log_posterior[label] += math.log(prob_of_word)

    # OK, now we've computed the f(y)'s as in the equation above.
    # The final step is to calculate the normalized posterior.
    # You will want to use scipy.misc.logsumexp() to calculate logZ.
    # REPLACE DUMMY CODE in the next line
    
    # ANSWER STARTS HERE
    logZ = 42.0 #this should be re-implemented with a correct right hand side
    # ANSWER ENDS HERE

    
    # Finally compute the final posterior.
    # Calculate the values and place them in a dictionary to return.
    final_posterior = {}
    for label in ["POS","NEU","NEG"]:
        # ANSWER STARTS HERE
        log_posterior_prob_for_label = -1/42 #this should be re-implemented with a correct right hand side
        # ANSWER ENDS HERE
        final_posterior[label] = math.exp(log_posterior_prob)

    return final_posterior

Sanity checks: run the code on these following documents.

In []:
calc_label_posterior(mm, ["awesome"], 1)
#This should return
#{'NEG': 0.16750753705112045,
# 'NEU': 0.1162298481345833,
# 'POS': 0.7162626148142966}
In []:
calc_label_posterior(mm, ["awesome","awesome","awesome","awesome","awesome","awesome"], 1)
#This should return 
#{'NEG': 0.0001807103911743505,
# 'NEU': 0.0005766491562551465,
# 'POS': 0.9992426404525736}
In []:
# The following code predicts on the first document in the training set.
print "FIRST 20 WORDS OF DOC:", train_docs[0]["tokens"][:20]
calc_label_posterior(mm, train_docs[0]["tokens"], 1)

#This should return 
#FIRST 20 WORDS OF DOC: [u'Airport', u"'77", u'starts', u'as', u'a', u'brand', u'new', u'luxury', u'747', u'plane', u'is', u'loaded', u'up', u'with', u'valuable', u'paintings', u'&', u'such', u'belonging', u'to']
#{'NEG': 9.882509501713775e-79, 'NEU': 1.0, 'POS': 5.882654789945415e-82}

Part 2.D: Classification Evaluation

Execute the following code to load the test set. The variable test_labels is a dictionary mapping the document ID of a document to its label.

In []:
from __future__ import division
import nb;reload(nb)
test_docs = nb.read_jsons("reviews/dev-imdb.jsons")
test_labels = nb.read_keyfile("reviews/dev-imdb.key")
print "%d labels in test set" % len(test_docs)

Now you will evaluate your classifier's accuracy on the test set. First, we provide evaluation code that takes predictions for the test set and calculates accuracy (it just counts the number of correct predictions). Next, the code reports the accuracy of the "most common class" baseline that always predicts positive. It both prints the accuracy, and returns it as a number. Execute this cell of code.

In []:
def evaluator(predictions):
    # predictions is a list of strings, length 1000, one for each test document.
    assert len(predictions) == len(test_docs), "predictions must be a list as long as test_docs"
    total_correct = 0
    for i in range(len(test_docs)):
        doc = test_docs[i]
        label = test_labels[doc['docid']]
        pred = predictions[i]
        total_correct += int( pred == label )
    print "Accuracy = %d/%d = %s" % (total_correct, len(test_docs), total_correct/len(test_docs))
    return total_correct/len(test_docs)

print "Most common baseline"
baseline_preds = ["POS" for i in range(len(test_docs)) ]
evaluator(baseline_preds)

Question 12 (10 points)

In the following code block, implement a method that takes in a document and produces a hard classification, based on the posterior calculated by calc_label_posterior (return the string label that has the highest posterior probability).

In []:
def classify(nb_model,doc, pseudocount):
    #Hint: to get the tokens from the input doc, use doc['tokens']
    #ANSWER STARTS HERE
    label_post = calc_label_posterior(nb_model, doc['tokens'],pseudocount)
    return nb.dict_argmax(label_post)
    #ANSWER ENDS HERE


print "Classification for first document in training set:", classify(mm,train_docs[0], 1.0)

Next, we provide some helper code for looping over documents and returning a list of their assigned labels.

In []:
def evaluate_on_documents(nb_model,docs,pseudocount):
    return evaluator([ classify(mm, doc, pseudocount) for doc in test_docs ])

Question 13 (8 points) In the next cell, we evaluate accuracy on documents using a pseudocount of 100 and 1. In the cell after that, explain which pseudocount value looked better when printing out feature weights in part 2.B and which looks better here. What do you think there is a difference?

In []:
evaluate_on_documents(mm,test_docs,100)
evaluate_on_documents(mm,test_docs,1)

Answer:

ANSWERME

Next we calculate the accuracies of your classifier for several different values of the pseudocount parameter and graph them. Note that we plot the log(alpha) on the x axis.

In []:
alphas = [0.001, 0.01, 0.1, 1, 10, 100, 1000]
# dummy version
accs = []
for alpha in alphas:
    print "calculating for alpha=", alpha
    acc = evaluate_on_documents(mm,test_docs,alpha)
    accs.append(acc)
print "Accuracy rates:", accs
import matplotlib.pyplot as plt
%matplotlib inline
plt.plot([math.log10(a) for a in alphas], accs)

Question 14 (5 points)

Based on this analysis, what is the best value of the pseudocount parameter to use for future predictions?

Answer:

ANSWERME