Question text in black, answers in blue.
contains
method of java.util.PriorityQueue
and it is not finding objects that I am certain are in the queue.
I looked at your code and found the problem. Your
WordFreq
class needs an equals
method, because you
are not finding the same object that you are looking for, but an
object with the same string. The contains
method of that
standard Java class uses equals
. Furthermore, it uses
public boolean equals (Object 0)
, so you must write an
equals
method that takes an Object
parameter.
(This is different from compareTo
, which can take a T
parameter to implement Comparable<T>
.)
Your equals
method should take its Object
argument and
immediately cast it into a WordFreq
, or whatever class you are
writing for.
Question 5.2, posted 20 November: I fixed that, thanks. When I
found that the word I was looking for already had an object in the queue, I
tried to increment that object, but it didn't work. I realized I was
incrementing the object I had just created to call contains
on,
not the object that was in the queue. What I need now is the get
method, but java.util.PriorityQueue
doesn't have that method.
No, it doesn't. Priority queues aren't meant to be searched this way, which is one of the many ways they are the wrong data structure for
this job. (But it makes the project more fun, doesn't it?) You have at least
two choices that I can see. You can make an Iterator
for the
queue, which will give you the elements in some unknown order, and then
increment the right object when the iterator gives it to you. Or you could
dequeue objects until you dequeue the right one, keeping them somewhere until
you can add them back. The latter will be faster if you are looking for
common words that are near the front of the queue.
Question 5.3, posted 21 November: I had a strange error when I
modified the code on page 595 to get the input from the user. I'm not really
sure what the line skip = conIn.nextLine( );
is doing there anyway,
but my attempt to get the file name from the user caused the program to hang,
not taking any input, throwing an exception, or doing anything else.
Our console input using a If you ask for the file name after asking for the two numbers, you need that
Scanner
has normally
taken in a String
with the nextLine
method. But
here we want to take two int
inputs, for which we are using the
nextInt
method of Scanner
.
Unlike nextLine
, nextInt
doesn't read the linebreak
character after reading the digits. So to continue with anything, we need to
read that character. I don't really see the point in assigning the string read
to a variable as DJW do, but whatever.
skip
line after the first two requests but not after the last one.
Question 5.4, posted 29 November: You say that the outputs of
FrequencyList
and PQFrequency
are supposed to be
identical. Does that mean the latter has to be alphabetical?
Yes.
Ok. It would be much easier if it were in frequency order, because that's how the objects would come out of the queue.
Yes, it would. You're going to need an extra sorting
phase. I'd suggest converting each PQFreq
object into a
WordFreq
object when you take it out, then putting it into a
separate priority queue of WordFreq
objects from which you can
take the objects out in alphabetical order.
Question 5.5, posted 29 November: When we compute tf-idf values, do we do it only for words that are above the minimum length?
Yes. Only words above the minimum length need to have objects made to put in the queue. But minimum count doesn't have the same role as it used to -- it might be that a word that occurs only once is among the top three for tf-idf score.
Question 5.6, posted 29 November: I created a file and wrote my report to it. The report was there when I read the file, but so was a string of strange characters before the report started. Do you know why?
Actually, I don't, but it appears that this happens only
when you use the writeObject
method as DJW explain with regard
to serializable objects. You only want to write String
objects,
which you can do with simple I/O methods for text files.
Question 5.7, posted 29 November: I had an idea for the problem
identified in 5.2 above, that there is no get
method in
java.util.PriorityQueue
. I can't add code to that class myself,
but what if I create a class MyPriorityQueue
that extends
PriorityQueue
, then write my get
method in that class?
I like that idea -- your method still has to use a linear
search which takes O(q) time on a queue with q elements in the worst case, but
it's nice to have the get
code written just once. Remember that
in a class extending PriorityQueue
, you will only have access to
public
and protected
methods of that class. But
you probably want to create an Iterator
object, and the
iterator
method of PriorityQueue
is public.
Question 5.8, posted 29 November: Could you please be more specific about what output you would like for the "Additional Tasks"? Should output be to the console, to files (with what names?), or what?
That's certainly a reasonable request -- I'll put
this on the assignment page as well.
CompareFrequency
should output to the console, as
it says. The output should say whether the outputs of your two
methods
are identical (by actually checking them, not just saying that they
are)
and should report the time taken by each of the two methods. The
format
is not important as we are checking it by eye, not by
string-matching.
MultitextFrequency
should output its report to a
file
named MultitextReport.txt
. The format is not so
important because we are checking it by eye, but do pay attention to
the specification of what should go on each line of the report
corresponding to an individual essay. You may decide what to
report
about the entire document.
Authorship.txt
and may be in any format you
like.
Question 5.9, posted 29 November: Where do we get the document with the Federalist Papers?
Whoops, I forgot to say, didn't I. There are multiple copies on the net, but use this one.
Question 5.10, posted 29 November: I found it easier to time my program by using long
start = System.currentTimeMillis();
then calling that time again at the end of the
program, and subtracting the times to get the total run time. Is this
alright?
Sure, no problem.
Also, I got only a slight loss of time with the priority queue and someone else said they were faster in some cases with the priority queue. Did we do something wrong? You said the PQ was "the wrong data structure".
It's hard to predict which will run faster and by
how much, because each method has some advantages. The "get method"
for the PQ uses more time on the unsuccessful searches, which is the
worst case. But depending on what order you search the elements,
which you don't control, you might run faster on successful searches
of common words.
Also, A serious attempt to compare the two methods would test them on a
variety of inputs of a variety of lengths, for starters. I'm
basically just getting you used to the idea of comparing them here.
You'll get full credit for legitimately creating, running, and
measuring the time of the two methods, whatever your results.
java.util.PriorityQueue
is
professionally-written software and might benefit from various
tricks. One thing that might be important is that it uses a heap
stored in an array, which might make much faster use of memory than
DJW's BinarySearchTree
which is reference-based.
Question 5.11, posted 5 December: You say that
MultitextFrequency
should have the separator word as an additional input. Does that mean
it has the inputs of PQFrequency
, plus the new one?
Exactly. So the command-line call would be
"java MultitextFrequency 7 6 fedpapers.txt federalist
", for
example.
But the "minimum count" parameter has no meaning in this part. And you told us that the minumum word length should be 6.
Those are good points, but do it as I said above anyway.
Question 5.12, posted 5 December: Is the separator word "federalist" part of the individual items?
No, it's between them. The first item is before
the first occurrence of the word, the second item is between the first
and second, and so on until the last item is after the last occurrence
of the word. Remember that you are already breaking the text into
words
as in FrequencyList
, so you just have to check each word
for equality with the separator word when you see it.
But the word "federalist" doesn't just occur at the start of essays -- it's also in the introductory part. Should I strip that out?
No, I want your code to work on the raw text. It's true that you'll have some bogus "essays" at the start and maybe at the end of the text. If you like, if an "essay" has fewer than 100 total words in it you don't have to bother with the output for it -- just give a line with the number and nothing else.
Question 5.13, posted 5 December: Do we have to write another program to analyze the results from Task 2 as we do Task 3?
No, you can take the data from
MultitextFrequency
and look at either by eye, using commercial software, or writing some
code yourself.
Question 5.14, posted 5 December: My program seems really
slow. It took about a half-hour for PQFrequency
to get
through
the Federalist Papers document. It seems to be taking a long time to
find words in the queue. What I'm doing is dequeueing words until the
right word comes along, then incrementing its count and putting all
the words back. Is that the wrong way to do it?
It's not the best way, as you are doing lots of PQ
operations and each of them is O(log n) rather than O(1) (if n is the
number of items in the PQ). It's usually better, when you are trying
to observe the PQ rather than change it, to use an iterator object
rather than the PQ operations. The PQ comes with its own
contains
method, and you can create a get
method with an iterator as discussed in 5.7 above.
Ok, so I can use my get
method to find the item in the
PQ if contains
says that it's there. Then I just
increment the frequency field in place, like in the DJW code, right?
That's a sensible idea, but there's a problem
with it. Remember that when DJW did that incrementing in place, they
warned
you (at the top of page 592) that this was an unusual practice because
a data structure's items should normally only be modified using the
data structure's own methods. In the DJW code they are changing the
frequency field of a What you want to do is get the item you want, save it in a
temporary variable, remove it from the PQ, increment it, then
add it back into the queue. (The WordFreq
object, and those objects
are compared by only looking at the word field. But in your PQ, the
objects are compared by frequency, and you are planning to change
the frequency. You might, for example, make the heap property
false by making an item's frequency greater than its parent's. The PQ
won't have any reason to move the item into the correct position.
It's certainly possible that the code will work anyway and give the
correct output, because we're going to sort the output alphabetically
anyway. But this kind of thing "voids the warranty" on the PQ -- you
no longer have the programmer's assurance that the class will do what
it's supposed to, because you may have violated the precondition of
some of the methods.
java.util.PriorityQueue
class has a remove(T)
method that will do this for you,
along with the remove( )
method that dequeues the largest
element.) This costs you the time for two PQ operations, but it
ensures
that the PQ will hold its elements correctly. It should still be much
faster than the large number of PQ operations you were doing before.
Question 5.15, posted 6 December: What was that business about String objects in lecture yesterday?
In CompareFrequency
, you are supposed
to be having each word-counter make text output to a file, then
comparing
the two files to see whether they are identical. Some students
instead
made a String
object out of each file (or composed the
output lines into a String
) and compared them directly.
That's not really what I wanted, and I'm not sure that you will get
away
with using such huge Strings, but I reluctantly said that this was ok.
Question 5.16, posted 6 December: I compared my
PQFrequency
output to the file that you posted. I have the same table of words
and frequencies, but my program counted about 30,000 more words than
yours. It was the same with my altered version of
FrequencyTable
. Do you know what's going on? Am I going
to get an F even though my output is nearly all right?
It looks like we aren't going to be able to
insist
on your matching the exact word count from our report. We'll give
full credit for getting just the table of frequencies right.
What I think is happening, at least for many of you, is that you
are reformatting the Federalists.dat
file in the course
of getting it from the web site into your computer. This is quite
likely if it passes through any Microsoft products on the way. The
formatting changes seem to result in your Scanner finding lots of
extra
small words. Try
altering your "numValidWords++
;" statement to "if
(newWord.length( ) > 0) numValidWords++;", where "newWord" is the word
you are currently reading. For many of you at least, a lot of these
bogus words are empty. But one student had two such extra words of
length 7 or more, which didn't show up enough times to mess up the
list.
Question 5.17, posted 6 December: For some reason, I got
only about half the table copied from my priority queue of
PQWordFreq
objects to my priority queue of
WordFreq
objects.
More than one person has had this problem because the condition of the loop controlling the copying had a reference to the size of the queue being emptied. Since that size is changing as the loop executes, the loop may not run the number of times you were expecting.
Last modified 6 December 2012