Programming Assignment 06: Similarity Detector
Estimated reading time: 10 minutes
Estimated time to complete: two to three hours (plus debugging time)
Prerequisites: Assignment 03
Starter code: similarity-detector-student.zip
Collaboration: not permitted
Overview
Many classes at UMass and other colleges and universities use one or more automated tools to alert instructors to possible plagiarism. These tools, such as turnitin and MossPlus, are used to evaluate the similarity of texts to one another and/or to large corpora of source material. Turnitin, for example, doesn’t just check student assignments against one another – it can compare against assignments submitted in previous years (thus helping deter paper banks), or against all of Wikipedia, or against other digitized texts.
At its core, these services have to compare the similarity between texts (or program code). How do they work? In this assignment, you’ll build a simplified similarity detection system and find out. Your system will convert input texts into sets representing those texts, and will compare those sets using the Jaccard index, a measure of the similarity between sets.
We’ve provided a large set of unit tests to help with automated testing, though you might also want to write a class with a main
method for interactive testing. The Gradescope autograder includes a few more tests, but they exist primarily to verify you’re not gaming the autograder. If your code can pass the tests we’ve provided, it is likely correct.
As before, we’ve disable the timeout code so you can use the debugger, but if your code gets stuck during testing, you might want to uncomment these two lines at the top of each test file:
@Rule
public Timeout globalTimeout = Timeout.seconds(10); // 10 seconds
Goals
- Translate written descriptions of behavior into code.
- Practice writing static methods.
- Practice interacting with the
Set
andList
abstractions. - Test code using unit tests.
Downloading and importing the starter code
As in previous assignments, download and save (but do not decompress) the provided archive file containing the starter code. Then import it into Eclipse in the same way; you should end up with a similarity-detector-student
project in the “Project Explorer”.
What to do
Complete the code in both SetUtilities
and SimilarityUtilities
.
SetUtilities
SetUtilities
has four methods for you to complete. Reading the type signatures:
public static <E> Set<E> union(Set<E> s, Set<E> t)
you should understand that these are generic methods, that is, they are parameterized on type <E>
. union
will work on any two of same-typed Set<E>
s. That’s what the <E>
before the return type of a static
method indicates: this static
method is generic on type <E>
.
These four methods should be straightforward to implement. Get them done early, because you’ll need them for the next part of the assignment.
SimilarityUtilities
There are six methods to complete here.
As in previous assignments, you’ll find String.split
helpful. But I don’t expect you to know “regular expressions” at this point, so here are two you’ll need:
- Passing
"\\n"
toString.split
will split the original string on newlines ('\n'
s). It will return an array of strings representing the split lines. You may still need to handle empty lines yourself, though. - Passing
"\\W+"
(note: uppercaseW
) toString.split
will split the original string on non-word characters. (\W
means “a non-word character” and+
means “one or more times in a row”). You may still need to handle empty words yourself, as above.
You may find other instance methods of String
, such as trim
and toLowercase
, helpful.
In the first lineSimilarity
method, your code should convert each text to a set using trimmedLines
, then compare them using jaccardIndex
. There is no trick here!
In the second lineSimilarity
method, there is a template
argument. The template
is a string that you don’t want to consider when comparing the similarity of two texts. For example, if I were checking your programming assignments for similarity, I would include the starter code as the template. To “ignore” the template, convert a text to a set and a template to a set, then use set difference to remove the template set from the text set. Do so for both input texts, then find their Jaccard index.
Then comes the shingle
method. Shingling is a venerable technique for finding partial duplicates (see, for example, http://nlp.stanford.edu/IR-book/html/htmledition/near-duplicates-and-shingling-1.html) that creates a set of multiple overlapping pieces of a text for further comparison. The Java doc has an example, and the JUnit test cases have further examples; it should be straightforward to generalize from those examples. You’re likely to need a pair of nested for
loops.
Finally, implement the shingleSimilarity
method, which parallels the second lineSimilarity
method but requires shingling the words of the text, rather than the trimmed lines.
Submitting the assignment
When you have completed the changes to your code, you should export an archive file containing the entire Java project. To do this, follow the same steps as from Assignment 01 to produce a .zip
file, and upload it to Gradescope.
Remember, you can resubmit the assignment as many times as you want, until the deadline. If it turns out you missed something and your code doesn’t pass 100% of the tests, you can keep working until it does.