Starter code: information-retrieval-student.zip
Overview
Maybe you’ve heard of this little startup called Google? Part of their initial claim to fame was their web search engine, which let users quickly find web pages of interest based on a list of search terms. A full web search engine relies on a stable of different kinds of software.
In this assignment, you’ll be building a simplified part of a search engine, consisting of a document indexer and a system for determining documents’ relevance to a specific search term. We’ve provided an outline for the search engine, but it will be your job to fill it in.
We’ve provided a set of unit tests to help with automated testing. But unlike previous assignments, the Gradescope autograder is running slightly different tests that you do not have access to. If your code can pass the tests we’ve provided, it is likely correct, but the Gradescope tests are the final arbiter.
Goals
- Translate written descriptions of behavior into code.
- Practice representing state in a class.
- Practice interacting with the
Map
,Set
, andList
abstractions. - Test code using unit tests.
Downloading and importing the starter code
As in previous assignments, download and decompress the provided archive file containing the starter code. Then import it into Code in the same way; you should end up with a information-retrieval-student
project in the “Project Explorer”.
Search engine behavior
For the purposes of this assignment, a search engine is a stateful object that “knows about” a set of documents and supports various queries on those documents and their contents.
Documents are identified by a unique ID and consist of a sequence of terms. Terms are (always) the lowercase version of words; operations on the search engine and documents are therefore case-insensitive. Documents are added one-by-one to the search engine; the search engine’s internal state is updated each time a document is added.
The search engine functions as an index. That is, given a term, the search engine can return the set of documents (that previously were added to it) that contain that term.
The search engine can also find a list of documents (again, from among the set it knows about) relevant to a given term, ordered from most-relevant to least-relevant. It does so using a specific version of the tf-idf statistic, which sounds intimidating but is not too bad to calculate efficiently — so long as you have the data structures to support doing so.
What to do
The SearchEngine
needs to keep track of the documents for two things: to do index lookups of terms, returning a set of documents (in indexLookup
), and to compute the two components of the tf-idf statistics (in termFrequency
and inverseDocumentFrequency
). You can hold this state with whatever data structures you like, but my suggestions follow.
I suggest you get addDocument
and indexLookup
working first. To support the index, a straightforward mapping of terms to DocumentID
s will work. (To be clear: a Map<String, Set<DocumentId>>
). (It turns out you don’t need to create this structure; you can use the one you’ll make to support tf-idf instead, but creating this Map
might be a good warmup.) In any case, declare the structure(s) as instance variables, create the empty structure(s) you’ll use in the constructor, fill it/them in addDocument
, and examine it/them in indexLookup
.
How do you read in the document? One way is to use a BufferedReader
to read lines one-by-one. You can turn a generic Reader
into a BufferedReader
using one of BufferedReader
’s constructors:
BufferedReader br = new BufferedReader(reader)
Then you can set up a loop to read each line:
for (String line = br.readLine(); line != null; line = br.readLine()) {
// do some stuff with each line
// probably involving split("\\W+") and toLowercase()
}
When turning the each line into terms, you can use the same approach as in Assignment 06 : String.split
using "\\W+"
, and remember toLowercase
the result.
You are also welcome to use a Scanner
to achieve the same effect, though be careful about getting exactly the same behavior in terms of word-splitting that the "\\W+"
regular expression produces – it splits on all non-word characters, where word characters are defined as the roman alphabet, upper- (A-Z
) and lower-case (a-z
), the numerals (0-9
), and the underscore (_
).
termFrequency
requires that you compute the number of times a given term appears in a given document. This suggests you should have a data structure that keeps track of the count of terms per document: a Map<String, Integer>
. But this frequency-counting structure is per-document; you need to keep track of each document’s counts. So overall, I suggest a Map<DocumentId, Map<String, Integer>>
. The outer map goes from DocumentId
s to the inner frequency-counting structure. You’ll have to update addDocument
to populate and update these structures. Be sure to get clear in your head the different times you’ll use get
, put
, containsKey
, and getOrDefault
before you start writing code!
Once you have the structure described above, you can use it to implement inverseDocumentFrequency
. Be sure to read the javadoc comment above the method for the exact equation the tests are expecting. Use Math.log
to compute the logarithm (not Math.log10
or Math.log2
).
Use these two methods to compute a given document-term pair’s tfIdf
.
Be careful about int
eger division – it truncates! Remember to cast to double
where appropriate to avoid this behavior.
Finally, implement relevanceLookup
, which returns a list of all documents containing a given term, sorted from largest tf-idf to smallest. You’ll probably need to implement TfIdfComparator.compare
, but note that no tests test the comparator directly, so if you have another method in mind to sort the list, go ahead (though I do not recommend it). If, as recommended, you do implement it, make sure it returns a value that will result in the list being sorted largest-to-smallest, and remember the tie-breaker requirement. The Gradescope tests check the sorting and tie-breaker more thoroughly than the tests we provide to you.
Submitting the assignment
When you have completed the changes to your code, you should export an archive file containing the src/
directory from your Java project. To do this, follow the same steps as from Assignment 01 to produce a .zip
file, and upload it to Gradescope.
Remember, you can resubmit the assignment as many times as you want, until the deadline. If it turns out you missed something and your code doesn’t pass 100% of the tests, you can keep working until it does.
Other notes
Mutators and observers
The only place where your SearchEngine
’s state (its instance variables: Map
s and so on) will be modified (mutated) should be addDocument
. The other methods (indexLookup
, termFrequency
, and so on) are observers: they should return results based upon the things stored in your SearchEngine
’s state, but they should not change its state.
On Stringly-typed variables
You may have noticed that DocumentId
is a very thin wrapper around the String
class. Why bother? Why not just use a String
for the document’s ID?
Well, you could. Doing so is called String
ly typing (a pun on “strongly” typing) your program. Sometimes we do this when it’s perhaps more trouble than it’s worth to declare a new type (like passport numbers or phone numbers in a previous assignment).
But then you lose one of Java’s strengths, which is its strong static (that is, compile-time) type system’s ability to help you find errors. If we had just used String
, I can guarantee that at least a third or so of the class would have accidentally used a document ID where they meant to use a term (or vice versa) in their code. This mistake would result in hard-to-track-down errors at runtime. But making a different type for document IDs prevents this error. Further, it aids in code readability, as it is now unambiguous what, say a variable in a loop is for when it’s a DocumentId
and not just a String.
The major downside is that Java makes declaring what amounts to a type alias very verbose. In many other statically-typed languages this ends up being a single line (compare with structs and enums Rust or user-defined types in OCaml).