Programming Assignment 06: Needle in a haystack

Estimated reading time: 10 minutes
Estimated time to complete: 2–3 hours (plus debugging time)
Prerequisites: Assignment 05, Lab 03
Starter code: needle-in-a-haystack-student.zip
Collaboration: not permitted

Overview

The INTERPOL Elite Crime Squad (ECS) has a problem, and only COMPSCI 190D students can help them. (Too cheesy? Not my fault! A surplus of cheese is clearly one of America’s most pressing problems.) A criminal has been committing a series of high-profile robberies from famous museums around the world. The Mona Lisa has left the Louvre. The Stary Night nipped out of New York. The Babbage Difference Engine No. 2 moved out of Mountain View. You get the idea.

From the most expensive hotel in the area of each of the five thefts, ECS has acquired logs of all the names, phone numbers, and passport numbers of the guests that checked in and out around each theft. ECS has reason to believe that the thief may have used multiple pseudonyms and phones, but used the same passport number in all cases. Therefore, they think they can figure out the identity of the thief (or at least, narrow down the list of suspects) by carefully examining the logs. Unfortunately, most of the thefts occurred in busy metropolitan areas, and the logs are enormous, containing literally hundreds of entries. No mere human being possesses the ability to methodically work through such vast quantities of data, what with the siren calls of NetFlix and Pokémon Go and the like.

Your task is to write a program to sift through these logs and find the proverbial needle in the haystack. You’ll use an external library to help parse the logs. Then, you’ll use the appropriate data structure to quickly find the commonalities between the logs you’ve been provided.

We’ve provided a small set of unit tests to help with automated testing, though you might also want to write a class with a main method for interactive testing. If your code can pass the tests we’ve provided, it is likely correct.

Goals

  • Translate written descriptions of behavior into code.
  • Practice writing static methods.
  • Practice writing a “record” class with associated constructor and instance methods.
  • Practice representing state in a class.
  • Practice interacting with the Set and List abstractions.
  • Practice using external JARs.
  • Test code using unit tests.

Downloading and importing the starter code

As in previous assignments, download and save (but do not decompress) the provided archive file containing the starter code. Then import it into Eclipse in the same way; you should end up with a needle-in-a-haystack-student project in the “Project Explorer”.

What to do

Broadly, there are two parts to this assignment. First, you’ll write code to parse a hotel’s log into a collection of entries; each entry will represent a person and their passport number. Then you’ll compare the logs, finding the one entry (or small number of entries) that appear(s) in all of them.

Parsing the logs

The logs have been provided as comma-separated value files (“CSVs”). CSV, a format you will learn to hate due to ubiquity and lack of standardization, is a way (actually, a large family of slightly-incompatible ways) of representing tabular data. Fortunately for you, each of the logs you have been given are in a single CSV format, one which the opencsv library can handle.

The project already includes the opencsv JAR (as well as its dependency, the Commons Lang 3 JAR) on its build path. Read the opencsv page (in particular, “How do I read and parse a CSV file?”), and you’ll see that you can instantiate a CSVReader using any other Reader object, such as a FileReader or a StringReader (or, you know, the Reader argument of parseLog). Once you have a CSVReader, you can use its readNext method to get an array of Strings (that is, a String[]) representing the values on the next line, or you can use its readAll method to load the entire log at once into a List<String[]>.

Or, you can ignore CSVReader and attempt to MacGyver your way to victory here, using String.split, regular expressions, bubblegum, and the like to attempt to parse the CSV. It’s up to you, but we won’t help you with this approach.

In any case, once you have the ability to get a String[] representing an entry, you’ll need to convert it into an SuspectEntry object. We’ve provided a skeleton of an SuspectEntry, but you’ll need to define instance variables, a constructor, likely some methods, and possible implement an interface for the next task.

Filtering the data

So now you can parse a CSV-format log into a List<SuspectEntry>.

Ultimately, ECS will want to use your code to parse several logs, find the entry or entries in common (or at least, that represent the same person) among all the logs, and return a list of all distinct entries, sorted lexicographically by passport number, breaking ties by name and then by phone number.

The findCommonEntries method should perform this task, but much like in the DNA assignment, you may want to break things up into simpler methods that you can test independently.

When thinking about whether two SuspectEntrys refer to the same person, note that only the passport number is the uniquely identifying piece of information in the context of this problem. (When narrowing down potential suspects, ECS does not care if an individual has more than one phone number or name/alias, only that it’s the same individual.)

You will probably want to use a Set (of what type?) to narrow things down. You might then generate List<SuspectEntry> of the final entries of interest. To sort it, you can call .sort(null) method on the List of entries. But for this to work, you’ll need to have given the SuspectEntry a natural ordering (that is, it will need to implement the Comparable interface, similar to the PostalAddress example we did in lecture).

Be sure to return a complete sorted list of SuspectEntrys that have passport numbers matching the narrowed-down list.

Submitting the assignment

When you have completed the changes to your code, you should export an archive file containing the src/ directory from your Java project. To do this, follow the same steps as from Assignment 01 to produce a .zip file, and upload it to Gradescope.

Remember, you can resubmit the assignment as many times as you want, until the deadline. If it turns out you missed something and your code doesn’t pass 100% of the tests, you can keep working until it does.