CMPSCI 187: Programming With Data Structures

David Mix Barrington

Fall, 2012

Q&A for Programming Project #5: Word Frequencies

Question text in black, answers in blue.

Question 5.1, posted 20 November: I'm using the contains method of java.util.PriorityQueue and it is not finding objects that I am certain are in the queue.
I looked at your code and found the problem. Your WordFreq class needs an equals method, because you are not finding the same object that you are looking for, but an object with the same string. The contains method of that standard Java class uses equals. Furthermore, it uses public boolean equals (Object 0), so you must write an equals method that takes an Object parameter. (This is different from compareTo, which can take a T parameter to implement Comparable<T>.) Your equals method should take its Object argument and immediately cast it into a WordFreq, or whatever class you are writing for.
Question 5.2, posted 20 November: I fixed that, thanks. When I found that the word I was looking for already had an object in the queue, I tried to increment that object, but it didn't work. I realized I was incrementing the object I had just created to call contains on, not the object that was in the queue. What I need now is the get method, but java.util.PriorityQueue doesn't have that method.
No, it doesn't. Priority queues aren't meant to be searched this way, which is one of the many ways they are the wrong data structure for this job. (But it makes the project more fun, doesn't it?) You have at least two choices that I can see. You can make an Iterator for the queue, which will give you the elements in some unknown order, and then increment the right object when the iterator gives it to you. Or you could dequeue objects until you dequeue the right one, keeping them somewhere until you can add them back. The latter will be faster if you are looking for common words that are near the front of the queue.
Question 5.3, posted 21 November: I had a strange error when I modified the code on page 595 to get the input from the user. I'm not really sure what the line skip = conIn.nextLine( ); is doing there anyway, but my attempt to get the file name from the user caused the program to hang, not taking any input, throwing an exception, or doing anything else.
Our console input using a Scanner has normally taken in a String with the nextLine method. But here we want to take two int inputs, for which we are using the nextInt method of Scanner. Unlike nextLine, nextInt doesn't read the linebreak character after reading the digits. So to continue with anything, we need to read that character. I don't really see the point in assigning the string read to a variable as DJW do, but whatever.
If you ask for the file name after asking for the two numbers, you need that skip line after the first two requests but not after the last one.
Question 5.4, posted 29 November: You say that the outputs of FrequencyList and PQFrequency are supposed to be identical. Does that mean the latter has to be alphabetical?
Yes.
Ok. It would be much easier if it were in frequency order, because that's how the objects would come out of the queue.
Yes, it would. You're going to need an extra sorting phase. I'd suggest converting each PQFreq object into a WordFreq object when you take it out, then putting it into a separate priority queue of WordFreq objects from which you can take the objects out in alphabetical order.
Question 5.5, posted 29 November: When we compute tf-idf values, do we do it only for words that are above the minimum length?
Yes. Only words above the minimum length need to have objects made to put in the queue. But minimum count doesn't have the same role as it used to -- it might be that a word that occurs only once is among the top three for tf-idf score.
Question 5.6, posted 29 November: I created a file and wrote my report to it. The report was there when I read the file, but so was a string of strange characters before the report started. Do you know why?
Actually, I don't, but it appears that this happens only when you use the writeObject method as DJW explain with regard to serializable objects. You only want to write String objects, which you can do with simple I/O methods for text files.
Question 5.7, posted 29 November: I had an idea for the problem identified in 5.2 above, that there is no get method in java.util.PriorityQueue. I can't add code to that class myself, but what if I create a class MyPriorityQueue that extends PriorityQueue, then write my get method in that class?
I like that idea -- your method still has to use a linear search which takes O(q) time on a queue with q elements in the worst case, but it's nice to have the get code written just once. Remember that in a class extending PriorityQueue, you will only have access to public and protected methods of that class. But you probably want to create an Iterator object, and the iterator method of PriorityQueue is public.
Question 5.8, posted 29 November: Could you please be more specific about what output you would like for the "Additional Tasks"? Should output be to the console, to files (with what names?), or what?
That's certainly a reasonable request -- I'll put this on the assignment page as well.
Question 5.9, posted 29 November: Where do we get the document with the Federalist Papers?
Whoops, I forgot to say, didn't I. There are multiple copies on the net, but use this one.
Question 5.10, posted 29 November: I found it easier to time my program by using long start = System.currentTimeMillis(); then calling that time again at the end of the program, and subtracting the times to get the total run time. Is this alright?
Sure, no problem.
Also, I got only a slight loss of time with the priority queue and someone else said they were faster in some cases with the priority queue. Did we do something wrong? You said the PQ was "the wrong data structure".
It's hard to predict which will run faster and by how much, because each method has some advantages. The "get method" for the PQ uses more time on the unsuccessful searches, which is the worst case. But depending on what order you search the elements, which you don't control, you might run faster on successful searches of common words.
Also, java.util.PriorityQueue is professionally-written software and might benefit from various tricks. One thing that might be important is that it uses a heap stored in an array, which might make much faster use of memory than DJW's BinarySearchTree which is reference-based.
A serious attempt to compare the two methods would test them on a variety of inputs of a variety of lengths, for starters. I'm basically just getting you used to the idea of comparing them here. You'll get full credit for legitimately creating, running, and measuring the time of the two methods, whatever your results.
Question 5.11, posted 5 December: You say that MultitextFrequency should have the separator word as an additional input. Does that mean it has the inputs of PQFrequency, plus the new one?
Exactly. So the command-line call would be "java MultitextFrequency 7 6 fedpapers.txt federalist", for example.
But the "minimum count" parameter has no meaning in this part. And you told us that the minumum word length should be 6.
Those are good points, but do it as I said above anyway.
Question 5.12, posted 5 December: Is the separator word "federalist" part of the individual items?
No, it's between them. The first item is before the first occurrence of the word, the second item is between the first and second, and so on until the last item is after the last occurrence of the word. Remember that you are already breaking the text into words as in FrequencyList, so you just have to check each word for equality with the separator word when you see it.
But the word "federalist" doesn't just occur at the start of essays -- it's also in the introductory part. Should I strip that out?
No, I want your code to work on the raw text. It's true that you'll have some bogus "essays" at the start and maybe at the end of the text. If you like, if an "essay" has fewer than 100 total words in it you don't have to bother with the output for it -- just give a line with the number and nothing else.
Question 5.13, posted 5 December: Do we have to write another program to analyze the results from Task 2 as we do Task 3?
No, you can take the data from MultitextFrequency and look at either by eye, using commercial software, or writing some code yourself.
Question 5.14, posted 5 December: My program seems really slow. It took about a half-hour for PQFrequency to get through the Federalist Papers document. It seems to be taking a long time to find words in the queue. What I'm doing is dequeueing words until the right word comes along, then incrementing its count and putting all the words back. Is that the wrong way to do it?
It's not the best way, as you are doing lots of PQ operations and each of them is O(log n) rather than O(1) (if n is the number of items in the PQ). It's usually better, when you are trying to observe the PQ rather than change it, to use an iterator object rather than the PQ operations. The PQ comes with its own contains method, and you can create a get method with an iterator as discussed in 5.7 above.
Ok, so I can use my get method to find the item in the PQ if contains says that it's there. Then I just increment the frequency field in place, like in the DJW code, right?
That's a sensible idea, but there's a problem with it. Remember that when DJW did that incrementing in place, they warned you (at the top of page 592) that this was an unusual practice because a data structure's items should normally only be modified using the data structure's own methods. In the DJW code they are changing the frequency field of a WordFreq object, and those objects are compared by only looking at the word field. But in your PQ, the objects are compared by frequency, and you are planning to change the frequency. You might, for example, make the heap property false by making an item's frequency greater than its parent's. The PQ won't have any reason to move the item into the correct position. It's certainly possible that the code will work anyway and give the correct output, because we're going to sort the output alphabetically anyway. But this kind of thing "voids the warranty" on the PQ -- you no longer have the programmer's assurance that the class will do what it's supposed to, because you may have violated the precondition of some of the methods.
What you want to do is get the item you want, save it in a temporary variable, remove it from the PQ, increment it, then add it back into the queue. (The java.util.PriorityQueue class has a remove(T) method that will do this for you, along with the remove( ) method that dequeues the largest element.) This costs you the time for two PQ operations, but it ensures that the PQ will hold its elements correctly. It should still be much faster than the large number of PQ operations you were doing before.
Question 5.15, posted 6 December: What was that business about String objects in lecture yesterday?
In CompareFrequency, you are supposed to be having each word-counter make text output to a file, then comparing the two files to see whether they are identical. Some students instead made a String object out of each file (or composed the output lines into a String) and compared them directly. That's not really what I wanted, and I'm not sure that you will get away with using such huge Strings, but I reluctantly said that this was ok.
Question 5.16, posted 6 December: I compared my PQFrequency output to the file that you posted. I have the same table of words and frequencies, but my program counted about 30,000 more words than yours. It was the same with my altered version of FrequencyTable. Do you know what's going on? Am I going to get an F even though my output is nearly all right?
It looks like we aren't going to be able to insist on your matching the exact word count from our report. We'll give full credit for getting just the table of frequencies right.
What I think is happening, at least for many of you, is that you are reformatting the Federalists.dat file in the course of getting it from the web site into your computer. This is quite likely if it passes through any Microsoft products on the way. The formatting changes seem to result in your Scanner finding lots of extra small words. Try altering your "numValidWords++;" statement to "if (newWord.length( ) > 0) numValidWords++;", where "newWord" is the word you are currently reading. For many of you at least, a lot of these bogus words are empty. But one student had two such extra words of length 7 or more, which didn't show up enough times to mess up the list.
Question 5.17, posted 6 December: For some reason, I got only about half the table copied from my priority queue of PQWordFreq objects to my priority queue of WordFreq objects.
More than one person has had this problem because the condition of the loop controlling the copying had a reference to the size of the queue being emptied. Since that size is changing as the loop executes, the loop may not run the number of times you were expecting.



Last modified 6 December 2012