Programming Assignment 08: Information Retrieval

Estimated reading time: 10 minutes
Estimated time to complete: two to three hours (plus debugging time)
Prerequisites: Assignment 05
Starter code: information-retrieval-student.zip
Collaboration: not permitted

Overview

Maybe you’ve heard of this little startup called Google? Part of their initial claim to fame was their web search engine, which let users quickly find web pages of interest based on a list of search terms. A full web search engine relies on a stable of different kinds of software.

In this assignment, you’ll be building a simplified part of a search engine, consisting of a document indexer and a system for determining documents’ relevance to a specific search term. We’ve provided an outline for the search engine, but it will be your job to fill it in.

We’ve provided a set of unit tests to help with automated testing. But unlike previous assignments, the Gradescope autograder is running slightly different tests that you do not have access to. If your code can pass the tests we’ve provided, it is likely correct, but the Gradescope tests are the final arbiter.

Goals

  • Translate written descriptions of behavior into code.
  • Practice representing state in a class.
  • Practice interacting with the Map, Set, and List abstractions.
  • Test code using unit tests.

Downloading and importing the starter code

As in previous assignments, download and save (but do not decompress) the provided archive file containing the starter code. Then import it into Eclipse in the same way; you should end up with a information-retrieval-student project in the “Project Explorer”.

Search engine behavior

For the purposes of this assignment, a search engine is a stateful object that “knows about” a set of documents and supports various queries on those documents and their contents.

Documents are identified by a unique ID and consist of a sequence of terms. Terms are (always) the lowercase version of words; operations on the search engine and documents are should therefore be case-insensitive. Documents are added one-by-one to the search engine.

The search engine function as an index. That is, given a term, the search engine can return the set of documents (that it knows about) that contain that term.

The search engine can also find a list of documents (again, from among the set it knows about) relevant to a given term, ordered from most-relevant to least-relevant. It does so using a specific version of the tf-idf statistic, which sounds intimidating but is actually fairly straightforward to calculate — so long as you have the data structures to support doing so.

What to do

The SearchEngine needs to keep track of the documents for two things: to do index lookups of terms, returning a set of documents (in indexLookup), and to compute the two components of the tf-idf statistics (in termFrequency and inverseDocumentFrequency). You can hold this state with whatever data structures you like, but my suggestions follow.

I suggest you get addDocument and indexLookup working first. To support the index, a straightforward mapping of terms to DocumentIDs will work. (To be clear: a Map<String, Set<DocumentId>>). It turns out you don’t need to create this structure; you can use the one you’ll make to support tf-idf instead, but creating this Map might be a good warmup. In any case, declare the structure(s) as instance variables, create the empty structure(s) you’ll use in the constructor, fill it/them in addDocument, and examine it/them in indexLookup. When turning the document itself into terms, use the same approach as in Assignment 05: String.split using "\\W+", and remember toLowercase the result.

termFrequency requires that you compute the number of times a given term appears in a given document. This suggests you should have a data structure that keeps track of the count of terms per document: a Map<String, Integer>. But this frequency-counting structure is per-document; you need to keep track of each document’s counts. So overall, I suggest a Map<DocumentId, Map<String, Integer>>. The outer map goes from DocumentIds to the inner frequency-counting structure. You’ll have to update addDocument to populate and update these structures. Be sure to get clear in your head the different times you’ll use get, put, containsKey, and getOrDefault.

Once you have the structure described above, inverseDocumentFrequency is fairly straightforward. Be sure to read the javadoc comment above the method for the exact equation the tests are expecting. Use Math.log to compute the logarithm (not Math.log10 or Math.log2).

Use these two methods to compute a given document-term pair’s tfIdf.

Finally, implement relevanceLookup, which returns a list of all documents containing a given term, sorted from largest tf-idf to smallest. You’ll probably need to implement TfIdfComparator.compare, but note that no tests test the comparator directly, so if you have another method in mind to sort the list, go ahead. If you do implement it, make sure it returns a value that will result in the list being sorted largest-to-smallest, and mind the tie-breaker requirement.

Submitting the assignment

When you have completed the changes to your code, you should export an archive file containing the src/ directory from your Java project. To do this, follow the same steps as from Assignment 01 to produce a .zip file, and upload it to Gradescope.

Remember, you can resubmit the assignment as many times as you want, until the deadline. If it turns out you missed something and your code doesn’t pass 100% of the tests, you can keep working until it does.

Other notes

On Stringly-typed variables

You may have noticed that DocumentId is a very thin wrapper around the String class. Why bother? Why not just use a String for the document’s ID?

Well, you could. Doing so is called Stringly typing (a pun on “strongly” typing) your program.

But then you lose one of Java’s strengths, which is its strong static (that is, compile-time) type system’s ability to help you find errors. If we had just used String, I can guarantee that at least a third or so of the class would have accidentally used a document ID where they meant to use a term (or vice versa) in their code. This mistake would result in hard-to-track-down errors at runtime. But making a different type for document ids prevents this error. Further, it aids in code readability, as it is now unambiguous what, say a variable in a loop is for when it’s a DocumentId and not just a String.

The major downside is that Java makes declaring what amounts to a type alias very verbose. In many other languages this ends up being a single line (compare with type aliases and tuple structs in Rust or user-defined types in OCaml).