{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Problem set 1, Intro to NLP, 2017\n",
    "\n",
    "#### This is due on September 22nd at 11PM. Please see detailed submission instructions below.  100 points total.\n",
    "\n",
    "##### How to do this problem set:\n",
    "\n",
    "- What version of Python should I use? 2.7\n",
    "\n",
    "- Most of these questions require writing Python code and computing results, and the rest of them have textual answers. To generate the answers, you will have to fill out a supporting file, `hw_1.py`.\n",
    "\n",
    "- For all of the textual answers you have to fill out have placeholder text which says \"Answer in one or two sentences here.\" For each question, you need to replace \"Answer in one or two sentences here\" with your answer.\n",
    "\n",
    "- Write all the answers in this ipython notebook. Once you are finished (1) Generate a PDF via (File -> Download As -> PDF) and upload to Gradescope (2)Turn in `hw_1.py` and `hw_1.ipynb` on Moodle.\n",
    "  \n",
    "- **Important:** Check your PDF before you turn it in to gradescope to make sure it exported correctly. If ipyhton notebook gets confused about your syntax it will sometimes terminate the PDF creation routine early. You are responsible for checking for these errors. If your whole PDF does not print, try running `$jupyter nbconvert --to pdf hw_1.ipynb` to identify and fix any syntax errors that might be causing problems.\n",
    "\n",
    "- **Important:** When creating your final version of the PDF to hand in, please do a fresh restart and execute every cell in order. Then you'll be sure it's actually right. One convenient way to do this is by clicking `Cell -> Run All` in the notebook menu.\n",
    " \n",
    "- This assignment is designed so that you can run all cells in a few minutes of computation time. If it is taking longer than that, you probably have made a mistake in your code."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Academic honesty \n",
    "\n",
    "- We will audit the Moodle code from a few dozen students, chosen at random. The audits will check that the code you wrote and turned on Moodle generates the answers you turn in on your Gradescope PDF. If you turn in correct answers on your PDF without code that actually generates those answers, we will consider this a potential case of cheating. See the course page for honesty policies.\n",
    "\n",
    "- We will also run automatic checks of code on Moodle for plagiarism. Copying code from others is considered a serious case of cheating."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Run this cell! It sets some things up for you.\n",
    "\n",
    "# This code makes plots appear inline in this document rather than in a new window.\n",
    "import matplotlib.pyplot as plt\n",
    "from __future__ import division  # this line is important to avoid unexpected behavior from division\n",
    "\n",
    "# This code imports your work from hw_1.py\n",
    "from hw_1 import *\n",
    "\n",
    "%matplotlib inline\n",
    "plt.rcParams['figure.figsize'] = (5, 4) # set default size of plots\n",
    "\n",
    "# Some more magic so that the notebook will reload external python modules;\n",
    "# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython\n",
    "%load_ext autoreload\n",
    "%autoreload 2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# download the IMDB large movie review corpus from the class webpage to a file location on your computer\n",
    "\n",
    "PATH_TO_DATA = 'large_movie_review_dataset'  # set this variable to point to the location of the IMDB corpus on your computer\n",
    "POS_LABEL = 'pos'\n",
    "NEG_LABEL = 'neg'\n",
    "TRAIN_DIR = os.path.join(PATH_TO_DATA, \"train\")\n",
    "TEST_DIR = os.path.join(PATH_TO_DATA, \"test\")\n",
    "\n",
    "for label in [POS_LABEL, NEG_LABEL]:\n",
    "    if len(os.listdir(TRAIN_DIR + \"/\" + label)) == 12500:\n",
    "        print \"Great! You have 12500 {} reviews in {}\".format(label, TRAIN_DIR + \"/\" + label)\n",
    "    else:\n",
    "        print \"Oh no! Something is wrong. Check your code which loads the reviews\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Actually reading the data you are working with is an important part of NLP! Let's look at one of these reviews\n",
    "\n",
    "print open(TRAIN_DIR + \"/neg/3740_2.txt\").read()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Part One: Intro to NLP in Python: types, tokens and Zipf's law"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Types and tokens\n",
    "\n",
    "One major part of any NLP project is word tokenization. Word tokenization is the task of segmenting text into individual words, called tokens. In this assignment, we will use simple whitespace tokenization. You will have a chance to improve this for extra credit at the end of the assigment. Take a look at the `tokenize_doc` function in `hw_1.py`. **You should not modify tokenize_doc** but make sure you understand what it is doing."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# We have provided a tokenize_doc function in hw_1.py. Here is a short demo of how it works\n",
    "\n",
    "d1 = \"This SAMPLE doc has   words tHat  repeat repeat\"\n",
    "bow = tokenize_doc(d1)\n",
    "\n",
    "assert bow['this'] == 1\n",
    "assert bow['sample'] == 1\n",
    "assert bow['doc'] == 1\n",
    "assert bow['has'] == 1\n",
    "assert bow['words'] == 1\n",
    "assert bow['that'] == 1\n",
    "assert bow['repeat'] == 2\n",
    "\n",
    "bow2 = tokenize_doc(\"Computer science is both practical and abstract.\")\n",
    "for b in bow2:\n",
    "    print b"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Question 1.1 (5 points)**\n",
    "\n",
    "Now we are going to count the word types and word tokens in the corpus. In the cell below, use the `word_counts` dictionary variable to store the count of each word in the corpus.\n",
    "Use the `tokenize_doc` function to break documents into tokens. \n",
    "\n",
    "`word_counts` keeps track of how many times a word type appears across the corpus. For instance, `word_counts[\"dog\"]` should store the number 990 -- the count of how many times the word `dog` appears in the corpus."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import glob\n",
    "word_counts = {}  # you might want to use a defaultdict instead https://docs.python.org/2/library/collections.html\n",
    "                  # defaultdicts are often useful for NLP in python\n",
    "\n",
    "\n",
    "for label in [POS_LABEL, NEG_LABEL]:\n",
    "    for directory in [TRAIN_DIR, TEST_DIR]:\n",
    "        for fn in glob.glob(directory + \"/\" + label + \"/*txt\"):\n",
    "            ## Implement me!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# you should see 990 instances of the word type \"dog\" in the corpus. (updated 9/13)\n",
    "if word_counts[\"dog\"] == 990:\n",
    "    print \"yay! there are {} total instances of the word type dog in the corpus\".format(word_counts[\"dog\"])\n",
    "else:\n",
    "    print \"hrm. Something seems off. Double check your code\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Question 1.2 (5 points)**\n",
    "\n",
    "Fill out the functions `n_word_types` and `n_word_tokens` in `hw_1.py`. These functions return the total number of word types and tokens in the corpus. **important** The autoreload \"magic\" that you setup early in the assignment should automatically reload functions as you make changes and save. If you run into trouble you can always restart the notebook and clear any .pyc files."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "print \"there are {} word types in the corpus\".format(n_word_types(word_counts))\n",
    "print \"there are {} word tokens in the corpus\".format(n_word_tokens(word_counts))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Question 1.3 (5 points)**\n",
    "\n",
    "You should see a much higher number of tokens than types. Why is that? \n",
    "\n",
    "Answer in one or two sentences here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Zipf's Law\n",
    "\n",
    "**Question 1.4 (5 points)**\n",
    "\n",
    "In this section, you will verify a key statistical properties of text: [Zipf's Law](https://en.wikipedia.org/wiki/Zipf%27s_law).\n",
    "\n",
    "Zipf's Law describes the relations between the frequency rank of words and frequency value of words.  For a word $w$, its frequency is inversely proportional to its rank:\n",
    "\n",
    "$$count_w = K \\frac{1}{rank_w}$$\n",
    "or in other words\n",
    "$$\\log(count_w) = K - \\log(rank_w)$$\n",
    "\n",
    "for some constant $K$, specific to the corpus and how words are being defined.\n",
    "\n",
    "Therefore, if Zipf's Law holds, after sorting the words descending on frequency, word frequency decreases in an approximately linear fashion under a log-log scale.\n",
    "\n",
    "Please make such a log-log plot by ploting the rank versus frequency.  Use a scatter plot where the x-axis is the *log(rank)*, and y-axis is *log(frequency)*.  You should get this information from `word_counts`; for example, you can take the individual word counts and sort them.  dict methods `.items()` and/or `values()` may be useful.  (Note that it doesn't really matter whether ranks start at 1 or 0 in terms of how the plot comes out.) You can check your results by comparing your plots to ones on Wikipedia; they should look qualitatively similar.\n",
    "\n",
    "*Please remember to label the meaning of the x-axis and y-axis.*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import math\n",
    "x = []\n",
    "y = []\n",
    "X_LABEL = \"Fill in label here\"\n",
    "Y_LABEL = \"Fill in label here\"\n",
    "\n",
    "# implement me! you should fill the x and y arrays. Add your code here\n",
    "\n",
    "plt.scatter(x, y)\n",
    "plt.xlabel(X_LABEL)\n",
    "plt.ylabel(Y_LABEL)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Question 1.5 (5 points)**\n",
    "\n",
    "You should see some discountinuities on the left and right sides of this figure.  Why are we seeing them on the left?  Why are we seeing them on the right?  On the right, what are those \"ledges\"?\n",
    "\n",
    "Answer in one or two sentences here."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Part Two: Naive Bayes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This section of the homework will walk you through coding a Naive Bayes classifier that can distinguish between postive and negative reviews (at some level of accuracy)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Question 2.1 (5 pts) ** To start, implement the `update_model` function in `hw_1.py`. Make sure to read the function comments so you know what to update. Also review the NaiveBayes class variables in the `def __init__` method of the NaiveBayes class  to get a sense of which statistics are important to keep track of. Once you have implemented `update_model`, run the train model function using the code below. You’ll need to provide the path to the dataset you downloaded to run the code."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "nb = NaiveBayes(PATH_TO_DATA, tokenizer=tokenize_doc)\n",
    "nb.train_model()\n",
    "\n",
    "if len(nb.vocab) == 252165:\n",
    "    print \"Great! The vocabulary size is {}\".format(252165)\n",
    "else:\n",
    "    print \"Oh no! Something seems off. Double check your code before continuing. Maybe a mistake in update_model?\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Exploratory analysis\n",
    "\n",
    "Let’s begin to explore the count statistics stored by the update model function. Use the provided `top_n` function to find the top 10 most common words in the positive class and top 10 most common words in the negative class. You don't have to code anything to do this."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "print \"TOP 10 WORDS FOR CLASS \" + POS_LABEL + \":\"\n",
    "for tok, count in nb.top_n(POS_LABEL, 10):\n",
    "    print '', tok, count\n",
    "print ''\n",
    "\n",
    "print \"TOP 10 WORDS FOR CLASS \" + NEG_LABEL + \":\"\n",
    "for tok, count in nb.top_n(NEG_LABEL, 10):\n",
    "    print '', tok, count\n",
    "print ''"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Question 2.2 (5 points)**\n",
    "\n",
    "Will the top 10 words of the positive/negative classes help discriminate between the two classes? Do you imagine that processing other English text will result in a similar phenomenon?\n",
    "\n",
    "Answer in one or two sentences here."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Question 2.3 (5 pts) **\n",
    "\n",
    "The Naive Bayes model assumes that all features are conditionally independent given the class label. For our purposes, this means that the probability of seeing a particular word in a document with class label $y$ is independent of the rest of the words in that document. Implement the `p_word_given_label` function. This function calculates P (w|y) (i.e., the probability of seeing word w in a document given the label of that document is y)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Use your `p_word_given_label` function to compute the probability of seeing the word “fantastic” given each sentiment label. Repeat the computation for the word “boring.” "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "print \"P('fantastic'|pos):\",  nb.p_word_given_label(\"fantastic\", POS_LABEL)\n",
    "print \"P('fantastic'|neg):\",  nb.p_word_given_label(\"fantastic\", NEG_LABEL)\n",
    "print \"P('boring'|pos):\",  nb.p_word_given_label(\"boring\", POS_LABEL)\n",
    "print \"P('boring'|neg):\",  nb.p_word_given_label(\"boring\", NEG_LABEL)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Which word has a higher probability given the positive class, fantastic or boring? Which word has a higher probability given the negative class? Is this what you would expect?\n",
    "\n",
    "Answer in one or two sentences here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Question 2.4 (5 pts)**\n",
    "\n",
    "In the next cell, compute the probability of the word \"car-thievery\" in the positive training data and negative training data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "print \"P('car-thievery'|pos):\",  nb.p_word_given_label(\"car-thievery\", POS_LABEL)\n",
    "print \"P('car-thievery'|neg):\",  nb.p_word_given_label(\"car-thievery\", NEG_LABEL)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What do you notice about \"P('car-thievery'|neg)\"? Why do you see this number? What would happen if we took the log of \"P('car-thievery'|neg)\"? What would happen if we multiplied \"P('car-thievery'|neg)\" by \"P('cliche'|neg)\"? Why might these operations cause problems for a Naive Bayes classifier?\n",
    "\n",
    "Answer in one or two sentences here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Question 2.5 (5 pts)**\n",
    "\n",
    "We can address the issues from question 2.4 with psuedocounts. A psuedocount is a fixed amount added to the count of each word stored in our model. Psuedocounts are used to help smooth calculations involving words for which there is little data. Implement\n",
    "`p_word_given_label_and_psuedocount` and then run the next cell. Hint: look at the slides from the lecture on pseudocounts."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "print \"P('car-thievery'|neg):\",  nb.p_word_given_label_and_pseudocount(\"car-thievery\", NEG_LABEL, 1.0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Question 2.6 (getting ready for question 2.10)**\n",
    "\n",
    "*Prior and Likelihood* \n",
    "\n",
    "As noted before, the Naive Bayes model assumes that all words in a document are independent of one another given the document’s label. Because of this we can write the likelihood of a document as:\n",
    "\n",
    "$P(w_{d1},\\cdots,w_{dn}|y_d) = \\prod_{i=1}^{n}P(w_{di}|y_d)$\n",
    "\n",
    "However, if a document has a lot of words, the likelihood will become extremely small and we’ll encounter numerical underflow. Underflow is a common problem when dealing with prob- abilistic models; if you are unfamiliar with it, you can get a brief overview on [Wikipedia](https:/en.wikipedia.org/wiki/Arithmetic_underflow). To deal with underflow, a common transformation is to work in log-space.\n",
    "\n",
    "$\\log[P(w_{d1},\\cdots,w_{dn}|y_d)] = \\sum_{i=1}^{n}\\log[P(w_{di}|y_d)]$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Implement the `log_likelihood` function (Hint: it should make calls to the p word given label and psuedocount function).\n",
    "Implement the `log_prior` function. This function takes a class label and returns the log of the fraction of the training documents that are of that label.\n",
    "\n",
    "There is nothing to print out for this question. But you will use these functions in a moment..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Question 2.7 (5 pts) **\n",
    "\n",
    "Naive Bayes is a model that tells us how to compute the posterior\n",
    "probability of a document being of some label (i.e.,\n",
    "$P(y_d|\\mathbf{w_d})$).  Specifically, we do so using bayes rule:\n",
    "\n",
    "  $P(y_d|\\mathbf{w_d}) = \\frac{P(y_d)P(\\mathbf{w_d}|y_d)}{P(\\mathbf{w_d})}$\n",
    "\n",
    "In the previous section you implemented functions to compute both\n",
    "the log prior ($\\log[P(y_d)]$) and the log likelihood\n",
    "($\\log[P( \\mathbf{w_d} |y_d)]$ ). Now, all your missing is the\n",
    "*normalizer*, $P(\\mathbf{w_d})$."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Derive the normalizer by expanding $P(\\mathbf{w_d})$. You will have to use \"MathJax\" to write out the equations. MathJax is very similar to LaTeX. 99% of the MathJax you will need to write for this course (and others at U Mass) is included in the first answer of [this](https://math.meta.stackexchange.com/questions/5020/mathjax-basic-tutorial-and-quick-reference) tutorial. MathJax and LaTeX can be annoying first, but once you get a little practice, using these tools will feel like second nature.\n",
    "\n",
    "\n",
    "Derive the normalizer by expanding $P(\\mathbf{w_d})$. Fill out the answer with MathJax here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Question 2.8 (5 pts)**\n",
    "\n",
    "One way to classify a document is to compute the unnormalized log posterior for both labels and take the argmax (i.e., the label that yields the higher unnormalized log posterior). The unnormalized log posterior is the sum of the log prior and the log likelihood of the document. Why don’t we need to compute the log normalizer here?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Answer in one or two sentences here."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Question 2.9 (15 pts)**\n",
    "\n",
    "Implement the `unnormalized_log_posterior` function and the `classify` function. The `classify` function should use the unnormalized log posteriors but should not compute the normalizer. Once you implement the `classify` function, we'd like to evaluate its accuracy. `evaluate_classifier_accuracy` is implemented for you so you don't need to change that method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "print nb.evaluate_classifier_accuracy(1.0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Question 2.10 (5 pts)**\n",
    "\n",
    "Try evaluating your model again with a pseudocount parameter of 500."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "print nb.evaluate_classifier_accuracy(500.0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Does the accuracy go up or down when the pseudo count parameter is raised to 500? Why do you think this is?\n",
    "\n",
    "Answer in one or two sentences here."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Question 2.11 (5 pts)**\n",
    "\n",
    "Our trained model can be queried to do exploratory data analysis. We\n",
    "saw that the top 10 most common words for each class were not very\n",
    "discriminative. Often times, a more descriminative statistic is a\n",
    "word's likelihood ratio. A word's likelihood ratio is defined as\n",
    "\n",
    "$LR(w)=\\frac{P(w|y=\\mathrm{pos})}{P(w|y=\\mathrm{neg})}$\n",
    "\n",
    "A word with $LR=5$ is five times more likely to appear in a positive\n",
    "review than it is in a negative review; a word with $LR=0.33$ is one\n",
    "third as likely to appear in a positive review than a negative review"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Implement the nb.likelihod_ratio function and use it to investigate the likelihood ratio of \"fantastic\" and \"boring\"\n",
    "print \"LIKEHOOD RATIO OF 'fantastic':\", nb.likelihood_ratio('fantastic', 1.0)\n",
    "print \"LIKEHOOD RATIO OF 'boring':\", nb.likelihood_ratio('boring', 1.0)\n",
    "print \"LIKEHOOD RATIO OF 'the':\", nb.likelihood_ratio('the', 1.0)\n",
    "print \"LIKEHOOD RATIO OF 'to':\", nb.likelihood_ratio('to', 1.0)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Does it make sense that $LR$('fantastic') $>$ $LR$('to')? \n",
    "\n",
    "Answer in one or two sentences here."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "** Question 2.12 (15 pts)** \n",
    "\n",
    "Find a review that your classifier got wrong. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# in this cell, print out a review that your classifier got wrong. Print out the text of the review along with the label"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What are two reasons your system might have misclassified this example? What improvements could you make that may help your system classify this example correctly? "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Fill out reason nubmer 1 here | Fill out a possible improvement here\n",
    "\n",
    "\n",
    "Fill out reason nubmer 2 here | Fill out a possible improvement here\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Extra credit (up to 10 points) **\n",
    "\n",
    "If you don't want to do the extra credit, you can stop here! Otherwise... keep reading... \n",
    " \n",
    "In this assignment, we use whitespace tokenization to create a bag-of-unigrams representation for the movie reviews. It is possible to improve this represetation to improve your classifier's performance. Use your own code or an external library such as nltk to perform tokenization, text normalization, word filtering, etc. Fill out your work in `def tokenize_doc_and_more` (below) and then show improvement by running the following.\n",
    "\n",
    "`nb = NaiveBayes(PATH_TO_DATA, tokenizer=tokenize_doc_and_more)\n",
    "nb.train_model()\n",
    "nb.evaluate_classifier_accuracy(1.0)\n",
    "`\n",
    "\n",
    "Roughly speaking, the larger performance improvement, the more extra credit. However, doing a good job investigating, explaining and justifying your work with small experiments and comments is also extremely important. Make sure to describe what you did and analyze why your method works. Use this ipython notebook to show your work."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def tokenize_doc_and_more(doc):\n",
    "    \"\"\"\n",
    "    Return some representation of a document.\n",
    "    At a minimum, you need to perform tokenization, the rest is up to you.\n",
    "    \"\"\"\n",
    "    # Implement me!\n",
    "    bow = defaultdict(float)\n",
    "    \n",
    "    # your code goes here\n",
    "    \n",
    "    return bow"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "nb = NaiveBayes(PATH_TO_DATA, tokenizer=tokenize_doc_and_more)\n",
    "nb.train_model()\n",
    "nb.evaluate_classifier_accuracy(1.0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Use cells at the bottom of this notebook to explain what you did in `better_tokenize_doc`. Include any experiments or explanations that you used to decide what goes in your function. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Your experiments and explanations go here"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}