CMPSCI 187: Programming With Data Structures

David Mix Barrington

Fall, 2012

Programming Project #5: Word Frequencies

Originally posted 18 November 2012, due at 11:59 p.m. EDT on Friday 7 December 2012, by placing .java files in a subdirectory of your cs187 directory on your edlab account, called proj5. For more information on accessing the edlab, see here or ask in discussion section.

Correction in green made on 19 November 2012.

Clarifications in purple added on 29 November.

Goals of this project:

Adapt an existing program that uses binary search trees to determine word frequencies in a document.
Write a program using a priority queue to determine word frequencies.
Compare the running times of these two implementations for the same problem.
Write a program taking input from and giving output to files, and using the command line to get parameters.
Use word frequencies and the related "tf-idf" statistic to analyze a set of documents, if possible making inferences about authorship.

Questions and answers about this project are collected here.

In Chapter 8 of DJW (and in their code base available on their web site) there is a class called FrequencyList that determines the most common words in a given text file. It takes two parameters from the console, the minimum word length and the minimum count to report. It takes its input from a file named words.dat and sends a report to the console, in a format that may be seen on page 596-7. It works by reading the file with a Scanner, identifying each word in the text, and placing a WordFreq object for each word in a BinarySearchTree, sorted by alphabetical order on the words.

First Half of the Project:

You are to write a new class PQFrequency that mostly has the same functionality as FrequencyList, with the following changes:

The user may specify the input file name as well as the minimum word size and the minimum count to report.
The user may specify these things either from the console or from the commmand Line. The output from the example on pages 596-7 could thus be produced in either of two ways: saying java PQFrequency and then giving the parameters to the console, or saying java PQFrequency 7 6 words.dat in the command line.
The program should store the words and frequency counts in a priority queue, with the frequency as the priority, rather than in a BST. This priority queue should be a java.util.PriorityQueue object, using the methods described in the Java API for that class. Note that you will have to write a new class for word-count pairs, similar to DJW's WordFreq but with a different compareTo method.
The program's output should be to a file named report.dat rather than to the console.

Completing this much of the project earns a C.

Additional Tasks for the Project

Completing some or all of these additional tasks can raise your grade to an A or even an A+.

Write a version of the program (in a class CompareFrequency) that will input the same parameters as PQFrequency and do the task of PQFrequency twice, once as FrequencyList does it and once as PQFrequency does it. It should check that the outputs are identical and report the time taken for each job to the console. The program can check the current time at any point by creating a Date object (again, see the API). Your report to the console should say whether the two output files are identical (after actually checking them) and how long each method took to produce them. The format of your report is not important, as we will be looking at it by eye rather than string-matching.
Write a version of PQFrequency (still using the priority queue) that also takes a "separator word" as a parameter. (In our example of the Federalist Papers, our separator word will be "Federalist" as that occurs at the beginning of each individual essay and nowhere within any essay. The word "federalists" occurs in essay #10 but you are only looking for the separator word occurring as a word, with delimiters or whitespace on either side.)
Your class MultitextFrequency should treat its input first as a single document (the corpus), then as a collection of documents (the items) broken up by each occurrrence of the separator word. For each item, you are to report two things:
1. The three most common words in the item (of length at least six letters), and
2. The three words in the item with the highest tf-idf score, relative to the entire corpus. The tf-idf score is a double rather than an integer, and it is computed for each word w as (the frequency of w in this item) multiplied by (the natural log of (the number of items in the corpus divided by the number of items in which w occurs at least once)).
3. Your report should have a single line for each item, in the format:
  number: commonword, lesscommonword, stilllesscommonword; hightfidf, nexttfidf, thirdtfidf
  with the commas, spaces, colon and semicolon as specified and no tab characters. (Here "number" is the number of the item in order, starting with 0. Note that this may not equal the number given to the item in the text, because there are a few items before Essay #1 and there are two versions of Essay #70.)
4. Your report should be in a file called MultitextReport.txt. Except for the content of the line for each essay, the format is up to you.
The Federalist Papers are known to have been written by Alexander Hamilton, John Jay, and James Madison, but historians are not entirely agreed on who wrote which. Our text, taken from Project Gutenberg, has their best guess as to the author of each paper. Your task is to say anything you can about what the information in part (2) above says about the authorship of the papers. Do your results show any pattern with respect to the claimed authors of each text? Your report may have any format you choose and should be in a file called Authorship.txt.

Completing additional tasks (1) and (2) will earn an A. A serious, thoughtful response to (3) may raise your grade as much as 1/3 letter, e.g., B- to B or A to A+.

You should put your classes and files in your EdLab space, in a directory called "cs187/proj5", and test the behavior with a main method or a driver class. (We will test the code with our own driver class.)

If you create your code within Eclipse (which we encourage but don't require), you will want to ignore the warning message that says "The use of the default package is discouraged". Bear in mind that if you make the mistake of declaring your class in a package, our driver class will not see it (since it will not be in a package) and we won't be able to test your code.

Last modified 29 November 2012