Assignment 07: Needle in a Haystack

Starter code: needle-in-a-haystack-student.zip

Overview

The INTERPOL Elite Crime Squad (ECS) has a problem, and only COMPSCI 186 students can help them. A criminal has been committing a series of high-profile robberies from famous museums around the world. The Mona Lisa has left the Louvre. The Starry Night nipped out of New York. The Babbage Difference Engine No. 2 moved out of Mountain View. You get the idea.

From the most expensive hotel in the area of each of the five thefts, ECS has acquired logs of all the names, phone numbers, and passport numbers of the guests that checked in and out around each theft. ECS has reason to believe that the thief may have used multiple pseudonyms and phones, but used the same passport number in all cases. Therefore, they think they can figure out the identity of the thief (or at least, narrow down the list of suspects) by carefully examining the logs. Unfortunately, most of the thefts occurred in busy metropolitan areas, and the logs are enormous, containing literally dozens of entries. No mere human being possesses the ability to methodically work through such vast quantities of data, what with the siren calls of NetFlix (and chill?), Fortnite, and the like distracting them.

Your task is to write a program to sift through these logs and find the proverbial needle in the haystack. You’ll use an external library to help parse the logs. Then, you’ll use the appropriate data structure to quickly find the commonalities between the logs you’ve been provided.

We’ve provided a small set of unit tests to help with automated testing, though you might also want to write a class with a main method for interactive testing. If your code can pass the tests we’ve provided, it is likely correct.

Goals

Downloading and importing the starter code

As in previous assignments, download and decompress the provided archive file containing the starter code. Then import it into Code in the same way; you should end up with a needle-in-a-haystack-student project in the “Project Explorer”.

What to do

Broadly, there are two parts to this assignment. First, you’ll write code to parse a hotel’s log into a collection of entries; each entry will represent a person, their phone number, and their passport number. Then you’ll compare the logs, finding the one entry (or small number of entries) that appear(s) in all of them.

Parsing the logs

The logs have been provided as comma-separated value files (“CSVs”). CSV, a format you will learn to hate due to ubiquity and lack of standardization, is a way (actually, a large family of slightly-incompatible ways) of representing tabular data. Fortunately for you, each of the logs you have been given are in a single CSV format, one which the opencsv library can handle.

The project already includes the opencsv JAR (as well as its dependency, the Commons Lang 3 JAR) on its build path. How can you use it to implement parseLog? Read the opencsv page (in particular, “Reading into an Array of Strings”), and you’ll see that you can instantiate a CSVReader using any other Reader object, such as a FileReader or a StringReader. Your code should use the Reader argument of parseLog, in other words, something like:

CSVReader csvReader = new CSVReader(r)

Once you have a CSVReader, you can use its readNext method to get an array of Strings (that is, a String[]) representing the values on the next line, or you can use its readAll method to load the entire log at once into a List<String[]>.

Or, you can ignore CSVReader and attempt to MacGyver your way to victory here, using String.split, regular expressions, bubblegum, baling wire, and the like to attempt to parse the CSV. It’s up to you, but we won’t help you with this approach.

In any case, once you have the ability to get a String[] representing an entry, you’ll need to convert it into an SuspectEntry object. We’ve provided a skeleton of an SuspectEntry, but you’ll need to define instance variables, a constructor, some methods (like toString, equals, and hashCode, especially if you want things like a collection of SuspectEntries to work as you’d expect if you use, for example, contains). You might want to make SuspectEntry implement Comparable<SuspectEntry> to facilitate the sorting described later, but hold off on that until you read the rest of this document.

Filtering the data

So now you can parse a CSV-format log into a List<SuspectEntry>.

Ultimately, ECS will want to use your code to parse several logs, find the entry or entries in common (or at least, that represent the same person) among all the logs, and return a list of all distinct entries, sorted lexicographically by passport number, breaking ties by name and then by phone number.

The findCommonEntries method should perform this task, but much like in the DNA assignment, you may want to break things up into simpler methods that you can test independently.

When thinking about whether two SuspectEntrys refer to the same person, note that only the passport number is the uniquely identifying piece of information in the context of this problem. (When narrowing down potential suspects, ECS does not care if an individual has more than one phone number or name/alias, only that it’s the same individual.)

You will probably want to use a Set (of what? not SuspectEntries!) to narrow things down. You might then generate a List<SuspectEntry> of the final entries of interest. To sort it, you can call .sort(null) method on the List of entries. But for this to work, you’ll need to have given the SuspectEntry a natural ordering (that is, it will need to implement the Comparable interface, similar to the PostalAddress example we did in lecture). Or you can write your own Comparator.

Be sure to return a complete, sorted list of SuspectEntrys that have passport numbers matching the narrowed-down list.

Reminder: You do not need Maps for this assignment! Just Sets and Lists will suffice.

Submitting the assignment

When you have completed the changes to your code, you should export an archive file containing the entire Java project. To do this, follow the same steps as from Assignment 01 to produce a .zip file, and upload it to Gradescope. Note that if you want things to upload faster, you can use an external program to zip only the src/ directory by expanding the project; that’s all this autograder requires.

Remember, you can resubmit the assignment as many times as you want, until the deadline. If it turns out you missed something and your code doesn’t pass 100% of the tests, you can keep working until it does.