Homework 1: Word statistics

This is due on Friday, Sept 16 (11:59pm), submitted electronically. 100 points total.

How to do this problem set

Most of these questions require writing Python code and computing results, and the rest of them have textual answers. Write all the answers in this document. Once you are finished, you will upload this .ipynb file to Moodle.

A few tips as you develop code:

  • Enter to edit a cell and Ctrl-Enter to re-run a cell. (see Help -> Keyboard Shortcuts)
  • When creating your final version of the problem set to hand in, please do a fresh restart with "Kernel -> Reset" and execute every cell in order. Then you'll be sure your code doesn't rely on weird global variables you forgot about. Make sure to press "Save"!

Your Name: write name here

List collaborators: list here

(see our grading and policies page for details on our collaboration policy).

Part (A): Download dataset and load text

You'll be working with a sample of the IMDB Large Movie Review Dataset.

Here's the sample for this assignment. It consists of 1136 positive reviews of movies. Download it and unzip it somewhere and look at a few documents to be sure you know what you're getting.

A1: Load documents (10 points):

Load the documents into a dictionary as described below. os.listdir() and os.path.join() may be helpful.

In [ ]:
from __future__ import division
import os,sys,re,math

# This dictionary will hold all the documents.
# Keys and values are intended to both be strings.
#   Keys: the filename
#   Values: the text content of the file

fname2content = {}  # {filename: text of file}

#-------------------Don't modify the code above-------------------------
#-------------------Provide your answer below--------------------------



#-------------------Provide your answer above---------------------------
#-------------------Don't modify the code below------------------------
# or only minimally modify it in case your keys have a slightly different format
print "Number of documents loaded: ", len(fname2content)
print fname2content["17_9.txt"][:500]

You should see the output of above question is as following:

Number of documents loaded: 1136
This is a complex film ...

Part (B): Tokenize and count words

The goal of this part is to perform simple preprocessing on the raw text and count the words.

B1: Naive tokenization (10 points):

For now, assume tokens are based on whitespace. Write code to calculate the total number of tokens -- this will have to iterate through documents and tokenize each of them.

In [10]:
total_token_num=0

#-------------------Don't modify the code above-------------------------
#-------------------Provide your answer bellow--------------------------

    
#-------------------Provide your answer above---------------------------
#-------------------Don't modify the code bellow------------------------

print "Total Number of tokens: ", total_token_num
Total Number of tokens:  0

B2: Better tokenization/normalization (10 points)

Please develop a better tokenizer and text normalizer yourself -- perhaps using regular expressions, as we discussed in class, and with case normalization. A good way to experiment with tokenizers is to run them on very small examples, so try to improve the examples below. (Once it works better on a small sample, you will run it on the larger corpus next.) Show that your new tokenizer gives better results than a naive tokenizer on these two example texts.

In [ ]:
## --- keep this code ---
examples = ["Hello, we are good.",  "OK... I'll go here, ok?"]

print "Naive tokenizations"
for example in examples:
    print example.split()

## --- modify code below ---

def better_tokenizer(text):
    return []  ## todo changeme

print "Better tokenizations"
for example in examples:
    print better_tokenizer(example)

B3: Word count (10 points):

Count words from the corpus into the variable word_counts, using your better_tokenizer. We initialized the word_counts variable as a Python dict; feel free to use defaultdict(lambda:0) instead, which is slightly easier to use. (In the future you may want to check out Counter, but please practice dict or defaultdict here.)

Print out

  1. The vocabulary size.
  2. The top 10 most common terms.

Important functions to make this easy include dict's .items(), list's .sort() (and/or standalone sorted()) and the key= parameter on sort.

In [9]:
#-------------------Provide your answer below--------------------------
word_counts = {}   ## will contain {word: count}

Part (C): Data visualization

In this section, you will verify two key statistical properties of text: Zipf's Law and Heaps' Law. You can check your results by comparing your plots to ones on Wikipedia; they should look qualitatively similar.

Question C1: Visualizing Zipf's Law (20 points):

Zipf's Law describes the relations between the frequency rank of words and frequency value of words. For a word $w$, its frequency is inversely proportional to its rank:

$$count_w = K \frac{1}{rank_w}$$

or in other words $$\log(count_w) = K - \log(rank_w)$$

for some constant $K$, specific to the corpus and how words are being defined.

Therefore, if Zipf's Law holds, after sorting the words descending on frequency, word frequency decreases in an approximately linear fashion under a log-log scale.

Please make such a log-log plot by ploting the rank versus frequency. Use a scatter plot where the x-axis is the log(rank), and y-axis is log(frequency). You should get this information from word_counts; for example, you can take the individual word counts and sort them. dict methods .items() and/or values() may be useful. (Note that it doesn't really matter whether ranks start at 1 or 0 in terms of how the plot comes out.)

Please remember to label the meaning of the x-axis and y-axis.

In [ ]:
#-------------------Modify code below------------------
# sample plotting code; feel free to delete
import matplotlib.pyplot as plt
%matplotlib inline
fig = plt.figure()
ax = plt.gca()
ax.scatter( [1,2,3], [10,1000,3000], linewidth=2)
plt.xlabel("my x axis")
plt.ylabel("my y axis")

Question C2: Interpreting a Zipf plot (5 points):

You should see some discountinuities on the left and right sides of this figure. Why are we seeing them on the left? Why are we seeing them on the right? On the right, what are those "ledges"?

ANSWER:

answerme

Question C3: Visualizing Heaps' Law (20 points):

Heaps' Law asserts (in its most basic version) that the vocabulary size of a corpus is approximately proportional to the square root of the number of tokens in the corpus:

$$ |V| = K \sqrt{N_{tok}} $$

where $K$ is a constant specific to the corpus.

We will investigate this phenomenon empirically by iterating over our corpus. Iterate over each document; for each, calculate the total number of tokens seen so far, and the total number of unique word types seen so far. We would like to know how these two numbers relate to each other as we see more and more of the corpus.

Create a plot with a curve (lineplot or scatterplot, whatever you think better) where the x-axis is number of tokens, and y-axis is the vocbulary size (so far). Make sure to label your axes.

In [ ]:
#-------------------Provide your answer below--------------------------
# one way to implement: maintain and update a set representing the vocabulary so far.
seen_words=set()

C4 Heaps' Law: glass half empty (5 points):

Why is growth in the vocabulary slower than growth in number of tokens, and why does it get slower and slower?

ANSWER:

answerme

C5 Heaps' Law: glass half full (5 points):

Imagine you obtained millions and millions of documents and repeated this experiment. How long would the vocabulary keep increasing?

ANSWER:

answerme

C6: Constants (5 points)

Heaps' Law has a constant in it. Describe something, in either the data or the analysis method, that would affect this constant. Explain why.

ANSWER:

answerme