This is due on Friday, Sept 16 (11:59pm), submitted electronically. 100 points total.
Most of these questions require writing Python code and computing results, and the rest of them have textual answers. Write all the answers in this document. Once you are finished, you will upload this .ipynb
file to Moodle.
A few tips as you develop code:
Your Name: write name here
List collaborators: list here
(see our grading and policies page for details on our collaboration policy).
You'll be working with a sample of the IMDB Large Movie Review Dataset.
Here's the sample for this assignment. It consists of 1136 positive reviews of movies. Download it and unzip it somewhere and look at a few documents to be sure you know what you're getting.
A1: Load documents (10 points):
Load the documents into a dictionary as described below. os.listdir()
and os.path.join()
may be helpful.
from __future__ import division
import os,sys,re,math
# This dictionary will hold all the documents.
# Keys and values are intended to both be strings.
# Keys: the filename
# Values: the text content of the file
fname2content = {} # {filename: text of file}
#-------------------Don't modify the code above-------------------------
#-------------------Provide your answer below--------------------------
#-------------------Provide your answer above---------------------------
#-------------------Don't modify the code below------------------------
# or only minimally modify it in case your keys have a slightly different format
print "Number of documents loaded: ", len(fname2content)
print fname2content["17_9.txt"][:500]
You should see the output of above question is as following:
Number of documents loaded: 1136
This is a complex film ...
The goal of this part is to perform simple preprocessing on the raw text and count the words.
B1: Naive tokenization (10 points):
For now, assume tokens are based on whitespace. Write code to calculate the total number of tokens -- this will have to iterate through documents and tokenize each of them.
total_token_num=0
#-------------------Don't modify the code above-------------------------
#-------------------Provide your answer bellow--------------------------
#-------------------Provide your answer above---------------------------
#-------------------Don't modify the code bellow------------------------
print "Total Number of tokens: ", total_token_num
B2: Better tokenization/normalization (10 points)
Please develop a better tokenizer and text normalizer yourself -- perhaps using regular expressions, as we discussed in class, and with case normalization. A good way to experiment with tokenizers is to run them on very small examples, so try to improve the examples below. (Once it works better on a small sample, you will run it on the larger corpus next.) Show that your new tokenizer gives better results than a naive tokenizer on these two example texts.
## --- keep this code ---
examples = ["Hello, we are good.", "OK... I'll go here, ok?"]
print "Naive tokenizations"
for example in examples:
print example.split()
## --- modify code below ---
def better_tokenizer(text):
return [] ## todo changeme
print "Better tokenizations"
for example in examples:
print better_tokenizer(example)
B3: Word count (10 points):
Count words from the corpus into the variable word_counts
, using your better_tokenizer
. We initialized the word_counts
variable as a Python dict; feel free to use defaultdict(lambda:0)
instead, which is slightly easier to use. (In the future you may want to check out Counter
, but please practice dict or defaultdict here.)
Print out
Important functions to make this easy include dict's .items()
, list's .sort()
(and/or standalone sorted()
) and the key=
parameter on sort.
#-------------------Provide your answer below--------------------------
word_counts = {} ## will contain {word: count}
In this section, you will verify two key statistical properties of text: Zipf's Law and Heaps' Law. You can check your results by comparing your plots to ones on Wikipedia; they should look qualitatively similar.
Question C1: Visualizing Zipf's Law (20 points):
Zipf's Law describes the relations between the frequency rank of words and frequency value of words. For a word $w$, its frequency is inversely proportional to its rank:
$$count_w = K \frac{1}{rank_w}$$or in other words $$\log(count_w) = K - \log(rank_w)$$
for some constant $K$, specific to the corpus and how words are being defined.
Therefore, if Zipf's Law holds, after sorting the words descending on frequency, word frequency decreases in an approximately linear fashion under a log-log scale.
Please make such a log-log plot by ploting the rank versus frequency. Use a scatter plot where the x-axis is the log(rank), and y-axis is log(frequency). You should get this information from word_counts
; for example, you can take the individual word counts and sort them. dict methods .items()
and/or values()
may be useful. (Note that it doesn't really matter whether ranks start at 1 or 0 in terms of how the plot comes out.)
Please remember to label the meaning of the x-axis and y-axis.
#-------------------Modify code below------------------
# sample plotting code; feel free to delete
import matplotlib.pyplot as plt
%matplotlib inline
fig = plt.figure()
ax = plt.gca()
ax.scatter( [1,2,3], [10,1000,3000], linewidth=2)
plt.xlabel("my x axis")
plt.ylabel("my y axis")
Question C2: Interpreting a Zipf plot (5 points):
You should see some discountinuities on the left and right sides of this figure. Why are we seeing them on the left? Why are we seeing them on the right? On the right, what are those "ledges"?
ANSWER:
answerme
Question C3: Visualizing Heaps' Law (20 points):
Heaps' Law asserts (in its most basic version) that the vocabulary size of a corpus is approximately proportional to the square root of the number of tokens in the corpus:
$$ |V| = K \sqrt{N_{tok}} $$where $K$ is a constant specific to the corpus.
We will investigate this phenomenon empirically by iterating over our corpus. Iterate over each document; for each, calculate the total number of tokens seen so far, and the total number of unique word types seen so far. We would like to know how these two numbers relate to each other as we see more and more of the corpus.
Create a plot with a curve (lineplot or scatterplot, whatever you think better) where the x-axis is number of tokens, and y-axis is the vocbulary size (so far). Make sure to label your axes.
#-------------------Provide your answer below--------------------------
# one way to implement: maintain and update a set representing the vocabulary so far.
seen_words=set()
C4 Heaps' Law: glass half empty (5 points):
Why is growth in the vocabulary slower than growth in number of tokens, and why does it get slower and slower?
ANSWER:
answerme
C5 Heaps' Law: glass half full (5 points):
Imagine you obtained millions and millions of documents and repeated this experiment. How long would the vocabulary keep increasing?
ANSWER:
answerme
C6: Constants (5 points)
Heaps' Law has a constant in it. Describe something, in either the data or the analysis method, that would affect this constant. Explain why.
ANSWER:
answerme