Use Python version 3. We strongly suggest installing Python from the Anaconda Individual Edition software package.
Download large_movie_review_dataset.zip
and lotr_script.txt
. We will use these two datasets in this homework.
Most of these questions require writing Python code or Unix commands and computing results, while the remainder have textual answers. To complete this assignment, you will need to fill out the supporting files, hw1.py
and hw1.sh
.
For all of the textual answers, replace the placeholder text ("Answer in one or two sentences here.") with your answer.
This assignment is designed so that you can run all cells in a few minutes of computation time. If it is taking longer than that, you probably have a mistake in your code.
Write all the answers in this ipython notebook. Once you are finished, (1) Generate a PDF via (File -> Download As -> PDF) (2) Upload your pdf file to Gradescope. (3) Compress hw1.py
, hw1.sh
, and hw1.ipynb
into one zip file and upload to Gradescope.
Important: Check your PDF before you turn it in to gradescope to make sure it exported correctly. If ipython notebook gets confused about your syntax it will sometimes terminate the PDF creation routine early. You are responsible for checking for these errors. If your whole PDF does not print, try running on the commandline jupyter nbconvert --to pdf hw1.ipynb
to identify and fix any syntax errors that might be causing problems.
Important: When creating your final version of the PDF to hand in, please do a fresh restart and execute every cell in order. Then you'll be sure it's actually right. One convenient way to do this is by clicking Cell -> Run All
in the notebook menu.
If you are having trouble with PDF export, you can always paste screenshots into a word processor then turn that into PDF.
# Run this cell! It sets some things up for you.
# This code makes plots appear inline in this document rather than in a new window.
import matplotlib.pyplot as plt
# This code imports your work from hw1.py
from hw1 import *
%matplotlib inline
plt.rcParams['figure.figsize'] = (5, 4) # set default size of plots
# Some more magic so that the notebook will reload external python modules;
# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
%load_ext autoreload
%autoreload 2
# download the IMDB large movie review corpus to a file location on your computer
PATH_TO_DATA = 'large_movie_review_dataset' # set this variable to point to the location of the IMDB corpus on your computer
POS_LABEL = 'pos'
NEG_LABEL = 'neg'
TRAIN_DIR = os.path.join(PATH_TO_DATA, "train")
TEST_DIR = os.path.join(PATH_TO_DATA, "test")
for label in [POS_LABEL, NEG_LABEL]:
if len(os.listdir(TRAIN_DIR + "/" + label)) == 12500:
print("Great! You have 12500 {} reviews in {}".format(label, TRAIN_DIR + "/" + label))
else:
print("Oh no! Something is wrong. Check your code which loads the reviews")
# Actually reading the data you are working with is an important part of NLP! Let's look at one of these reviews
print (open(TRAIN_DIR + "/neg/3740_2.txt").read())
One major part of any NLP project is word tokenization. Word tokenization is the task of segmenting text into individual words, called tokens. In this assignment, we will use simple whitespace tokenization. You will have a chance to improve this for extra credit at the end of the assigment. Take a look at the tokenize_doc
function in hw1.py
. You should not modify tokenize_doc but make sure you understand what it is doing.
# We have provided a tokenize_doc function in hw1.py. Here is a short demo of how it works
d1 = "This SAMPLE doc has words tHat repeat repeat"
bow = tokenize_doc(d1)
assert bow['this'] == 1
assert bow['sample'] == 1
assert bow['doc'] == 1
assert bow['has'] == 1
assert bow['words'] == 1
assert bow['that'] == 1
assert bow['repeat'] == 2
bow2 = tokenize_doc("Computer science is both practical and abstract.")
for b in bow2:
print(b)
Question 1.1 (5 points)
Now we are going to count the word types and word tokens in the corpus. In the cell below, use the word_counts
dictionary variable to store the count of each word in the corpus.
Use the tokenize_doc
function to break documents into tokens.
word_counts
keeps track of how many times a word type appears across the corpus. For instance, word_counts["dog"]
should store the number 990 -- the count of how many times the word dog
appears in the corpus.
import glob
import codecs
from collections import defaultdict, Counter
word_counts = Counter() # Counters are often useful for NLP in python. Similar to dicts (you can also use those)
for label in [POS_LABEL, NEG_LABEL]:
for directory in [TRAIN_DIR, TEST_DIR]:
for fn in glob.glob(directory + "/" + label + "/*txt"):
doc = open(fn, 'r', encoding='utf8') # Open the file with UTF-8 encoding
# IMPLEMENT ME
if word_counts["movie"] == 61492:
print ("yay! there are {} total instances of the word type movie in the corpus".format(word_counts["movie"]))
else:
print ("hmm. Something seems off. Double check your code")
Question 1.2 (5 points)
Fill out the functions n_word_types
and n_word_tokens
in hw1.py
. These functions return the total number of word types and tokens in the corpus. important The autoreload "magic" that you setup early in the assignment should automatically reload functions as you make changes and save. If you run into trouble you can always restart the notebook and clear any .pyc files.
print ("there are {} word types in the corpus".format(n_word_types(word_counts)))
print ("there are {} word tokens in the corpus".format(n_word_tokens(word_counts)))
What is the difference between word types and tokens? Why are the number of tokens much higher than the number of types?
Answer in one or two sentences here.
Question 1.3 (5 points)
Using word_counts
dictionary you just created, make a new list of (word,count) pairs called sorted_list
where tuples are sorted according to counts, in decending order. Then print the first 30 values from sorted_list
.
# IMPLEMENT ME!
In this part, you will practice extracting and processing information from text with Unix commands. Download lotr_script.txt
on the course website to a file location on your computer. This text file corresponds to the movie script of The Fellowship of the Rings (2001). This script comes from a larger corpus of movie scripts, the ScriptBase-J corpus.
First, let's open and examine lotr_script.txt
.
Question 1.4 (5 points)
Describe the structure of this script. How are roles, scene directions, and dialogue organized?
Answer in one or two sentences here.
Now that we've identified this file's structure, let's use Unix commands to process & analyze its contents.
You may want to take revisit the optional reading Ken Church, "Unix for Poets".
Question 1.5 (5 points)
Use Unix commands to print the name of each character with dialogue in the script, one name per line. This script's text isn't perfect, so expect a few additional names.
Implement this in hw1.sh
. Then, copy your implementation and its resulting output into the following two cells.
Copy Unix commands here
Copy output here
Question 1.6 (5 points)
Now, let's extract and analyze the dialogue of this script using Unix commands
First, extract all lines of dialogue in this script. Then, normalize and tokenize this text such that all alphabetic characters are converted to lowercase and words are sequences of alphabetic characers. Finally, print the top-20 most frequent word types and their corresponding counts.
Hint: Ignore parantheticals. These contain short stage directions.
Implement this in hw1.sh
. Then, copy your implementation and its resulting output into the following two cells.
Copy Unix commands here
Copy output here
Question 1.7 (5 points)
If we instead tokenized all text in the script, how might the results from Question 1.6 to change? Are there specific word types that might become more frequent?
Answer in one or two sentences here.
This section of the homework will walk you through coding a Naive Bayes classifier that can distinguish between postive and negative reviews (at some level of accuracy).
Question 2.1 (5 pts) To start, implement the update_model
function and tokenize_and_update_model
function in hw1.py
. Make sure to read the function comments so you know what to update. Also review the NaiveBayes class variables in the def __init__
method of the NaiveBayes class to get a sense of which statistics are important to keep track of. Once you have implemented update_model
and tokenize_and_update_model
, run the train model function using the code below. You’ll need to provide the path to the dataset you downloaded to run the code.
nb = NaiveBayes(PATH_TO_DATA, tokenizer=tokenize_doc)
nb.train_model()
if len(nb.vocab) == 251637:
print("Great! The vocabulary size is {}".format(251637))
else:
print("Oh no! Something seems off. Double check your code before continuing. Maybe a mistake in update_model?")
Let’s begin to explore the count statistics stored by the update model function. Implement top_n
function in the Naive Bayes Block to find the top 10 most common words in the positive class and top 10 most common words in the negative class.
print("TOP 10 WORDS FOR CLASS " + POS_LABEL + ":")
for tok, count in nb.top_n(POS_LABEL, 10):
print('', tok, count)
print()
print("TOP 10 WORDS FOR CLASS " + NEG_LABEL + ":")
for tok, count in nb.top_n(NEG_LABEL, 10):
print('', tok, count)
print()
Question 2.2 (5 points)
What is the first thing that you notice when you look at the top 10 words for the 2 classes? Are these words helpful for discriminating between the two classes? Do you imagine that processing other English text will result in a similar phenomenon? What about other languages?
Answer in one or two lines here.
Question 2.3 (5 pts)
The Naive Bayes model assumes that all features are conditionally independent given the class label. For our purposes, this means that the probability of seeing a particular word in a document with class label $y$ is independent of the rest of the words in that document. Implement the p_word_given_label
function. This function calculates P (w|y) (i.e., the probability of seeing word w in a document given the label of that document is y).
Use your p_word_given_label
function to compute the probability of seeing the word “amazing” given each sentiment label. Repeat the computation for the word “dull.”
print("P('amazing'|pos):", nb.p_word_given_label("amazing", POS_LABEL))
print("P('amazing'|neg):", nb.p_word_given_label("amazing", NEG_LABEL))
print("P('dull'|pos):", nb.p_word_given_label("dull", POS_LABEL))
print("P('dull'|neg):", nb.p_word_given_label("dull", NEG_LABEL))
Which word has a higher probability given the positive class, fantastic or boring? Which word has a higher probability given the negative class? Is this what you would expect?
Answer in one or two lines here.
Question 2.4 (5 pts)
In the next cell, compute the probability of the word "car-thievery" in the positive training data and negative training data.
print("P('car-thievery'|pos):", nb.p_word_given_label("car-thievery", POS_LABEL))
print("P('car-thievery'|neg):", nb.p_word_given_label("car-thievery", NEG_LABEL))
What is unusual about P('car-thievery'|neg)? What would happen if we took the log of "P('car-thievery'|neg)"? What would happen if we multiplied "P('car-thievery'|neg)" by "P('dull'|neg)"? Why might these operations cause problems for a Naive Bayes classifier?
Answer in one or two lines here.
Question 2.5 (5 pts)
We can address the issues from question 2.4 with add-$\alpha$ smoothing (like add-1 smoothing except instead of adding 1 we add $\alpha$). Implement
p_word_given_label_and_alpha
in the Naive Bayes Block
and then run the next cell. Hint: look at the slides from the lecture on add-1 smoothing.
print("P('stop-sign.'|pos):", nb.p_word_given_label_and_alpha("stop-sign.", POS_LABEL, 0.2))
Question 2.6 (5 pts) (getting ready for question 2.11)
Prior and Likelihood
As noted before, the Naive Bayes model assumes that all words in a document are independent of one another given the document’s label. Because of this we can write the likelihood of a document as:
$P(w_{d1},\cdots,w_{dn}|y_d) = \prod_{i=1}^{n}P(w_{di}|y_d)$
However, if a document has a lot of words, the likelihood will become extremely small and we’ll encounter numerical underflow. Underflow is a common problem when dealing with prob- abilistic models; if you are unfamiliar with it, you can get a brief overview on Wikipedia. To deal with underflow, a common transformation is to work in log-space.
$\log[P(w_{d1},\cdots,w_{dn}|y_d)] = \sum_{i=1}^{n}\log[P(w_{di}|y_d)]$
Implement the log_likelihood
function (Hint: it should make calls to the p word given label and alpha function).
Implement the log_prior
function. This function takes a class label and returns the log of the fraction of the training documents that are of that label.
There is nothing to print out for this question. But you will use these functions in a moment...
Question 2.7 (5 pts)
Naive Bayes is a model that tells us how to compute the posterior probability of a document being of some label (i.e., $P(y_d|\mathbf{w_d})$). Specifically, we do so using bayes rule:
$P(y_d|\mathbf{w_d}) = \frac{P(y_d)P(\mathbf{w_d}|y_d)}{P(\mathbf{w_d})}$
In the previous section you implemented functions to compute both the log prior ($\log[P(y_d)]$) and the log likelihood ($\log[P( \mathbf{w_d} |y_d)]$ ). Now, all your missing is the normalizer, $P(\mathbf{w_d})$.
Derive the normalizer by expanding $P(\mathbf{w_d})$. You will have to use "MathJax" to write out the equations. MathJax is very similar to LaTeX. 99% of the MathJax you will need to write for this course (and others at UMass) is included in the first answer of this tutorial. MathJax and LaTeX can be annoying first, but once you get a little practice, using these tools will feel like second nature.
Derive the normalizer by expanding $P(\mathbf{w_d})$. Fill out the answer with MathJax here
Answer in one or two lines here.
Question 2.8 (5 pts)
One way to classify a document is to compute the unnormalized log posterior for both labels and take the argmax (i.e., the label that yields the higher unnormalized log posterior). The unnormalized log posterior is the sum of the log prior and the log likelihood of the document. Why don’t we need to compute the log normalizer here?
Answer in one or two lines here.
Question 2.9 (10 pts) As we saw earlier, the top 10 words from each class do not seem to tell us much about the classes. A much more informative metric, which in some ways the model actually directly uses, is the likelihood ratio, which is defined as
$LR(w)=\frac{P(w|y=\mathrm{pos})}{P(w|y=\mathrm{neg})}$
A word with LR=3 is 3 times more likely to appear in the positive class than in the negative. A word with LR 0.33 is one-third as likely to appear in the positive class as opposed to the negative class.
# Implement the nb.likelihood_ratio function and use it to investigate the likelihood ratio of "amazing" and "dull"
print ("LIKELIHOOD RATIO OF 'amazing':", nb.likelihood_ratio('amazing', 0.2))
print ("LIKELIHOOD RATIO OF 'dull':", nb.likelihood_ratio('dull', 0.2))
print ("LIKELIHOOD RATIO OF 'and':", nb.likelihood_ratio('and', 0.2))
print ("LIKELIHOOD RATIO OF 'to':", nb.likelihood_ratio('to', 0.2))
What is the minimum and maximum possible values the likelihood ratio can take? Does it make sense that $LR('amazing') > LR('to')$ ?
Answer in one or two lines here.
Question 2.10 (5 pts)
Find the word in the vocabulary with the highest likelihood ratio below.
# Implement me!
Question 2.11 (5 pts)
Implement the unnormalized_log_posterior
function and the classify
function. The classify
function should use the unnormalized log posteriors but should not compute the normalizer. Once you implement the classify
function, we'd like to evaluate its accuracy. evaluate_classifier_accuracy
is implemented for you so you don't need to change that method.
print(nb.evaluate_classifier_accuracy(0.2))
Question 2.12 (5 pts)
Try evaluating your model again with a smoothing parameter of 1000.
print(nb.evaluate_classifier_accuracy(1000.0))
Does the accuracy go up or down when alpha is raised to 1000? Why do you think this is?
Answer in one or two lines here.
Question 2.13 (5 pts)
Find a review that your classifier got wrong.
# in this cell, print out a review that your classifier got wrong. Print out the text of the review along with the label
What are two reasons your system might have misclassified this example? What improvements could you make that may help your system classify this example correctly?
Answer in one or two lines here.
Question 2.14 (5 pts)
Often times we care about multi-class classification rather than binary classification.
How many counts would we need to keep track of if the model were modified to support 5-class classification?
Answer in one or two lines here.
Extra credit (up to 10 points)
If you don't want to do the extra credit, you can stop here! Otherwise... keep reading... In this assignment, we use whitespace tokenization to create a bag-of-unigrams representation for the movie reviews. It is possible to improve this represetation to improve your classifier's performance. Use your own code or an external library such as nltk to perform tokenization, text normalization, word filtering, etc. Fill out your work in def tokenize_doc_and_more (below) and then show improvement by running the cells below.
Roughly speaking, the larger performance improvement, the more extra credit. We will also give points for the effort in the evaluation and analysis process. For example, you can split the training data into training and validation set to prevent overfitting, and report results from trying different versions of features. You can also provide some qualitative examples you found in the dataset to support your choices on preprocessing steps. Whatever you choose to try, make sure to describe your method and the reasons that you hypothesize for why the method works. You can use this ipython notebook to show your work. Be sure to explain what your code is doing in the notebook.
# from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
from nltk.stem.snowball import SnowballStemmer
from nltk.util import ngrams
stemmer = SnowballStemmer('english')
# stopset = set(stopwords.words('english'))
def tokenize_doc_and_more(doc):
"""
Return some representation of a document.
At a minimum, you need to perform tokenization, the rest is up to you.
"""
# Implement me!
bow = defaultdict(float)
# your code goes here
return bow
nb = NaiveBayes(PATH_TO_DATA, tokenizer=tokenize_doc_and_more)
nb.train_model()
nb.evaluate_classifier_accuracy(1.0)
Use cells at the bottom of this notebook to explain what you did in better_tokenize_doc. Include any experiments or explanations that you used to decide what goes in your function. Doing a good job examining, explaining and justifying your work with small experiments and comments is as important as making the accuracy number go up!
# Your experiments and explanations go here