Programming Assignment 07: Needle in a Haystack
Estimated reading time: 10 minutes
Estimated time to complete: 2–3 hours (plus debugging time)
Prerequisites: Assignment 06
Starter code: needle-in-a-haystack-student.zip
Collaboration: not permitted
Overview
The INTERPOL Elite Crime Squad (ECS) has a problem, and only COMPSCI 186 students can help them. A criminal has been committing a series of high-profile robberies from famous museums around the world. The Mona Lisa has left the Louvre. The Stary Night nipped out of New York. The Babbage Difference Engine No. 2 moved out of Mountain View. You get the idea.
From the most expensive hotel in the area of each of the five thefts, ECS has acquired logs of all the names, phone numbers, and passport numbers of the guests that checked in and out around each theft. ECS has reason to believe that the thief may have used multiple pseudonyms and phones, but used the same passport number in all cases. Therefore, they think they can figure out the identity of the thief (or at least, narrow down the list of suspects) by carefully examining the logs. Unfortunately, most of the thefts occurred in busy metropolitan areas, and the logs are enormous, containing literally hundreds of entries. No mere human being possesses the ability to methodically work through such vast quantities of data, what with the siren calls of NetFlix (and chill?), Steam, and the like distracting them.
Your task is to write a program to sift through these logs and find the proverbial needle in the haystack. You’ll use an external library to help parse the logs. Then, you’ll use the appropriate data structure to quickly find the commonalities between the logs you’ve been provided.
We’ve provided a small set of unit tests to help with automated testing, though you might also want to write a class with a main
method for interactive testing. If your code can pass the tests we’ve provided, it is likely correct.
Goals
- Translate written descriptions of behavior into code.
- Practice writing static methods.
- Practice writing a “record” class with associated constructor and instance methods.
- Practice representing state in a class.
- Practice interacting with the
Set
andList
abstractions. - Practice using external JARs.
- Test code using unit tests.
Downloading and importing the starter code
As in previous assignments, download and save (but do not decompress) the provided archive file containing the starter code. Then import it into Eclipse in the same way; you should end up with a needle-in-a-haystack-student
project in the “Project Explorer”.
What to do
Broadly, there are two parts to this assignment. First, you’ll write code to parse a hotel’s log into a collection of entries; each entry will represent a person, their phone number, and their passport number. Then you’ll compare the logs, finding the one entry (or small number of entries) that appear(s) in all of them.
Parsing the logs
The logs have been provided as comma-separated value files (“CSVs”). CSV, a format you will learn to hate due to ubiquity and lack of standardization, is a way (actually, a large family of slightly-incompatible ways) of representing tabular data. Fortunately for you, each of the logs you have been given are in a single CSV format, one which the opencsv library can handle.
The project already includes the opencsv JAR (as well as its dependency, the Commons Lang 3 JAR) on its build path. How can you use it to implement parseLog
? Read the opencsv page (in particular, “Reading into an Array of Strings”), and you’ll see that you can instantiate a CSVReader
using any other Reader
object, such as a FileReader
or a StringReader
(or, you know, the Reader
argument of parseLog
). Once you have a CSVReader
, you can use its readNext
method to get an array of String
s (that is, a String[]
) representing the values on the next line, or you can use its readAll
method to load the entire log at once into a List<String[]>
.
Or, you can ignore CSVReader
and attempt to MacGyver your way to victory here, using String.split
, regular expressions, bubblegum, and the like to attempt to parse the CSV. It’s up to you, but we won’t help you with this approach.
In any case, once you have the ability to get a String[]
representing an entry, you’ll need to convert it into an SuspectEntry
object. We’ve provided a skeleton of an SuspectEntry
, but you’ll need to define instance variables, a constructor, likely some methods, and possible implement an interface for the next task.
Filtering the data
So now you can parse a CSV-format log into a List<SuspectEntry>
.
Ultimately, ECS will want to use your code to parse several logs, find the entry or entries in common (or at least, that represent the same person) among all the logs, and return a list of all distinct entries, sorted lexicographically by passport number, breaking ties by name and then by phone number.
The findCommonEntries
method should perform this task, but much like in the DNA assignment, you may want to break things up into simpler methods that you can test independently.
When thinking about whether two SuspectEntry
s refer to the same person, note that only the passport number is the uniquely identifying piece of information in the context of this problem. (When narrowing down potential suspects, ECS does not care if an individual has more than one phone number or name/alias, only that it’s the same individual.)
You will probably want to use a Set
(of what?) to narrow things down. You might then generate List<SuspectEntry>
of the final entries of interest. To sort it, you can call .sort(null)
method on the List
of entries. But for this to work, you’ll need to have given the SuspectEntry
a natural ordering (that is, it will need to implement the Comparable
interface, similar to the PostalAddress
example we did in lecture). Or you can write your own Comparator
.
Be sure to return a complete, sorted list of SuspectEntry
s that have passport numbers matching the narrowed-down list.
Submitting the assignment
When you have completed the changes to your code, you should export an archive file containing the src/
directory from your Java project. To do this, follow the same steps as from Assignment 01 to produce a .zip
file, and upload it to Gradescope.
Remember, you can resubmit the assignment as many times as you want, until the deadline. If it turns out you missed something and your code doesn’t pass 100% of the tests, you can keep working until it does.