Programming Assignment 06: Needle in a haystack
Estimated reading time: 10 minutes
Estimated time to complete: 60-120 minutes (plus debugging time)
Prerequisites: Assignment 05, Lab 04
Starter code: needle-in-a-haystack-student.zip
Collaboration: not permitted
Overview
The INTERPOL Elite Crime Squad (ECS) has a problem, and only COMPSCI 190D students can help them. (Too cheesy? Not my fault! A surplus of cheese is clearly one of America’s most pressing problems.) A criminal has been committing a series of high-profile robberies from famous museums around the world. The Mona Lisa has left the Louvre. The Stary Night nipped out of New York. The Babbage Difference Engine No. 2 moved out of Mountain View. You get the idea.
From the most expensive hotel in the area of each of the five thefts, ECS has acquired logs of all the names, phone numbers, and passport numbers of the guests that checked in and out around each theft. ECS has reason to believe that the thief may have used multiple pseudonyms and phones, but used the same passport number in all cases. Therefore, they think they can figure out the identity of the thief (or at least, narrow down the list of suspects) by carefully examining the logs. Unfortunately, most of the thefts occurred in busy metropolitan areas, and the logs are enormous, containing literally hundreds of entries. No mere human being possesses the ability to methodically work through such vast quantities of data, what with the siren calls of NetFlix and Pokémon Go and the like.
Your task is to write a program to sift through these logs and find the proverbial needle in the haystack. You’ll use an external libraries to help parse the logs. Then, you’ll use the appropriate data structure to quickly find the commonalities between the logs you’ve been provided.
We’ve provided a small set of unit tests to help with automated testing, though you might also want to write a class with a main
method for interactive testing. The Gradescope autograder includes a few more tests, but they exist primarily to verify you’re not gaming the autograder. If your code can pass the tests we’ve provided, it is likely correct.
Note that if you run into trouble with the Eclipse debugger mysteriously quitting during unit tests, it’s due to the timeout rule that we use to catch infinite loops:
@Rule
public Timeout globalTimeout = Timeout.seconds(10); // 10 seconds
Comment out the above two lines in all test files, and the debugger will no longer exit (and test cases will now get stuck in infinite loops).
Goals
- Translate written descriptions of behavior into code.
- Practice writing static methods.
- Practice writing a “record” class with associated constructor and instance methods.
- Practice representing state in a class.
- Practice interacting with the
Set
andList
abstractions. - Practice using external JARs.
- Test code using unit tests.
Downloading and importing the starter code
As in previous assignments, download and save (but do not decompress) the provided archive file containing the starter code. Then import it into Eclipse in the same way; you should end up with a needle-in-a-haystack-student
project in the “Project Explorer”.
What to do
Broadly, there are two parts to this assignment. First, you’ll write code to parse a hotel’s log into a collection of entries; each entry will represent a person and their passport number. Then you’ll compare the logs, finding the one entry (or small number of entries) that appear(s) in all of them.
Parsing the logs
The logs have been provided as comma-separated value files (“CSVs”). CSV, a format you will learn to hate due to ubiquity and lack of standardization, is a way (actually, a large family of slightly-incompatible ways) of representing tabular data. Fortunately for you, each of the logs you have been given are in a single CSV format, one which the opencsv library can handle.
The project already includes the opencsv JAR (as well as its dependency, the Commons Lang 3 JAR) on its build path. Read the opencsv page (in particular, “How do I read and parse a CSV file?”), and you’ll see that you can instantiate a CSVReader
using any other Reader
object, such as a FileReader
or a StringReader
(or, you know, the Reader
argument of parseLog
). Once you have a CSVReader
, you can use its readNext
method to get an array of String
s (that is, a String[]
) representing the values on the next line, or you can use its readAll
method to load the entire log at once into a List<String[]>
.
Or, you can ignore CSVReader
and attempt to MacGyver your way to victory here, using String.split
, regular expressions, bubblegum, and the like to attempt to parse the CSV. It’s up to you, but we won’t help you with this approach.
In any case, once you have the ability to get a String[]
representing an entry, you’ll need to convert it into an SuspectEntry
object. We’ve provided a skeleton of an SuspectEntry
, but you’ll need to define instance variables, a constructor, likely some methods, and possible implement an interface for the next task.
Filtering the data
So now you can parse a CSV-format log into a List<Entry>
.
Ultimately, ECS will want to use your code to parse several logs, find the entry or entries in common (or at least, that represent the same person) among all the logs, and return a list of all distinct entries, sorted lexicographically by passport number, breaking ties by name and then by phone number.
The findCommonEntries
method should perform this task, but much like in the DNA assignment, you may want to break things up into simpler methods that you can test independently.
When thinking about whether two SuspectEntry
s refer to the same person, note that only the passport number is the uniquely identifying piece of information in the context of this problem. (When narrowing down potential suspects, ECS does not care if an individual has more than one phone number or name/alias, only that it’s the same individual.)
Be sure to return a complete sorted list of SuspectEntry
s that have passport numbers matching the narrowed-down list.
Submitting the assignment
When you have completed the changes to your code, you should export an archive file containing the src/
directory from your Java project. To do this, follow the same steps as from Assignment 01 to produce a .zip
file, and upload it to Gradescope.
Remember, you can resubmit the assignment as many times as you want, until the deadline. If it turns out you missed something and your code doesn’t pass 100% of the tests, you can keep working until it does.