Originally posted 18 November 2012, due at 11:59 p.m. EDT on
Friday 7 December 2012,
by placing .java files in a subdirectory of your cs187 directory on
your edlab account, called proj5
.
For more information on accessing the edlab, see
here or ask in discussion
section.
Correction in green made on 19 November 2012.
Clarifications in purple added on 29 November.
Goals of this project:
Questions and answers about this project are collected here.
In Chapter 8 of DJW (and in their code base available on their web
site) there is a class called FrequencyList
that
determines the most common words in a given text file. It takes two
parameters from the console, the minimum word length and the
minimum count to report. It takes its input from a file named
words.dat
and sends a report to the console, in a format that may be seen on
page 596-7. It works by reading the file with a Scanner
,
identifying each word in the text, and placing a WordFreq
object for each word in a BinarySearchTree
, sorted by
alphabetical order on the words.
You are to write a new class PQFrequency
that
mostly has the same functionality as
FrequencyList
, with the following changes:
java PQFrequency
and then giving the parameters to the
console, or saying java PQFrequency 7 6 words.dat
in the
command line.
java.util.PriorityQueue
object, using the methods
described in the Java API for that class. Note that you will have
to write a new class for word-count pairs, similar to DJW's
WordFreq
but with a different compareTo
method.
report.dat
rather than to the console.
Completing this much of the project earns a C.
Completing some or all of these additional tasks can raise your grade to an A or even an A+.
CompareFrequency
) that will input the same
parameters as PQFrequency
and do the task of
PQFrequency
twice, once as
FrequencyList
does it and once as
PQFrequency
does it. It should check that the
outputs are identical and report the time taken for each job to
the console. The program can check the current time at any point
by creating a Date
object (again, see the API).
Your report to the console should say whether
the two output files are identical (after actually checking them)
and
how long each method took to produce them. The format of your
report is not important, as we will be looking at it by eye rather
than string-matching.
PQFrequency
(still using the
priority queue) that also takes a "separator word" as a
parameter. (In our example of the Federalist Papers, our
separator word will be "Federalist" as that occurs at the
beginning of each individual essay and nowhere within any essay.
The word "federalists" occurs in essay #10 but you are only
looking for the separator word occurring as a word, with
delimiters or whitespace on either side.)
Your class MultitextFrequency
should treat its
input
first as a single document (the corpus), then as a
collection of documents (the items)
broken up by each occurrrence of the separator word. For each
item, you are to report two things:
double
rather than an integer, and it is computed for
each word w as (the frequency of w in this item)
multiplied by (the
natural log of (the number of items in the corpus divided by the
number of items in which w occurs at least once)).
number: commonword, lesscommonword, stilllesscommonword; hightfidf, nexttfidf, thirdtfidf
with the commas, spaces, colon and semicolon as specified and no tab characters. (Here "number" is the number of the item in order, starting with 0. Note that this may not equal the number given to the item in the text, because there are a few items before Essay #1 and there are two versions of Essay #70.)
MultitextReport.txt
. Except for the content of the
line for each essay, the format is up to you.
Authorship.txt
.
Completing additional tasks (1) and (2) will earn an A. A serious, thoughtful response to (3) may raise your grade as much as 1/3 letter, e.g., B- to B or A to A+.
You should put your classes and files in your EdLab space, in a directory called "cs187/proj5", and test the behavior with a main method or a driver class. (We will test the code with our own driver class.)
If you create your code within Eclipse (which we encourage but don't require), you will want to ignore the warning message that says "The use of the default package is discouraged". Bear in mind that if you make the mistake of declaring your class in a package, our driver class will not see it (since it will not be in a package) and we won't be able to test your code.
Last modified 29 November 2012