Programming Assignment 04: DNA Sequence Assembly
Estimated reading time: 15 minutes
Estimated time to complete: 90-150 minutes (plus debugging time)
Prerequisites: Assignment 03
Starter code: dna-sequence-assembly-student.zip
Collaboration: not permitted  
Start this assignment early. You will be writing about the same amount of code (or perhaps less) than in the last two assignments, but it will require more thinking time. This is not an assignment most students will be able to rush through on Friday afternoon.
Overview
As you no doubt remember from high school biology, the “code of life” is written in DNA (and its pal, RNA) – triplet “codons” encode a sequence of amino acids, which are assembled into proteins, and so on. An important breakthrough in biological sciences was the ability to replicate DNA in vitro (that is, not in a cell, but in a test tube) using the polymerase chain reaction (PCR). PCR creates many copies of fragments of a sequence of DNA. The equipment for PCR is now cheap enough that some high schools have PCR labs.
Another breakthrough was the ability to assemble the sequence of fragments into a coherent whole, thus determining the genetic code (the genome) for an entire organism. For example, the Human Genome Project has sequenced the human genome.
In this assignment, you’ll solve a simplified version of the DNA sequence assembly problem, using a simple “greedy” algorithm. That is, given a collection of two or more overlapping DNA fragments, you’ll align and merge these fragments by choosing the best matches at each step, with the goal of ending with a single longer sequence.
We’ve provided a large set of unit tests to help with automated testing, though you might also want to write a class with a main method for interactive testing. The Gradescope autograder includes a few more tests, but they exist primarily to verify you’re not gaming the autograder. If your code can pass the tests we’ve provided, it is likely correct.
Goals
- Translate written descriptions of behavior into code.
- Practice writing instance methods, including overriding methods of Object.
- Practice interacting with the Listabstraction.
- Test code using unit tests.
Downloading and importing the starter code
As in previous assignments, download and save (but do not decompress) the provided archive file containing the starter code. Then import it into Eclipse in the same way; you should end up with a dna-sequence-assembly-student project in the “Project Explorer”.
What you will be doing
The most important thing to understand is how you’ll assemble shorter overlapping fragments into longer ones.
First, what does a fragment look like? It’s a sequence of one or more nucelotides; each is one of adenine (A), cytosine (C), guanine (G) and thymine (T). So a fragment might look like one of
CGCAT
CATGAC
ACATG
Next, how do we assemble them? We’re going to use a simple greedy algorithm, which, quoting Wikipedia, reads as follows:
Given a set of sequence fragments the object is to find the shortest common supersequence.
- Сalculate pairwise alignments of all fragments.
- Choose two fragments with the largest overlap.
- Merge chosen fragments.
- Repeat step 2 and 3 until only one fragment is left.
Let’s look at the first few steps in more detail, since they’re the least clear.
Calculating pairwise alignments
What’s a pairwise alignment (and what’s its overlap)? Let’s look at an example. Consider our first two fragments, listed above: CGCAT and CATGAC. We can see how much their “ends” overlap if we put CGCAT first and then CATGAC:
  these three overlap
  vvv
CGCAT
  CATGAC
  ^^^
  these three overlap
Here, there’s an overlap of three (CAT). If we were to merge these together, the result would be CGCATGAC:
 CGCAT
+  CATGAC
---------
 CGCATGAC
If we tried them the other way around, what would the overlap and merged fragment look like?
 CATGAC
+     CGCAT
-----------
 CATGACGCAT
Here the overlap is only one and the result would be CATGACGCAT.
You should now see (1) that any two fragments can have an overlap of at least zero and at most the length of the shorter fragment, and (2) that order matters when comparing overlaps: the front of one fragment can be checked against the rear of another, but that’s different from checking the rear of the first against the front of the second.
Finally, note that we will only consider overlaps on the end, and not worry about one fragment being entirely embedded within another. That is, your code must not check for things like:
 GCTCAGC
+  TCA
--------
 GCTCAGC
Though two identical fragments will be merged, as they are only compared on the end. In other words, we do expect you to merge fragments like:
 GCTCAGC
+GCTCAGC
--------
 GCTCAGC
Choosing the largest overlap
Given a collection of fragments, you can compare every fragment against every other (in both orders!) and find the pair with the largest overlap.
But what if two have the same overlap? I want you to break ties by choosing the pair whose merger results in the shorter merged sequence.
If there are further ties, do what you like – I will try to make sure there are no tests that are ambiguous, and I don’t want your merge method to be sixteen special cases long. In a practical sequence assembler, deciding how to handle ambiguity is very important, as are many other cases:  What about “almost perfect” matches, as real PCR occasionally induces errors in the fragments? Or subsets, which I told you to ignore? Or how little overlap is so little as to be not worth merging? And so on. But we won’t worry about those details here.
Merging the fragments
Suppose we are still working with our example three fragments, CGCAT, CATGAC, and ACATG. Further suppose they’re stored in a list, which we’ll write as [CGCAT, CATGAC, ACATG].
If, after checking, we decided to merge the first two (as described above), our list would look like: [CGCATGAC, ACATG]. Then we’d merge again and be left with a single entry in our list: [CGCATGACATG].
What to do
As usual, look over the files we’ve provided. The Fragment class represents a single fragment; the Assembler class keeps a list of Fragments and assembles them into longer Fragments.
Start with the Fragment class. Here are some hints to get you started there:
- You’ll need to store the nucleotide sequence in an instance variable. Stringis probably the easiest thing to use.
- lengthand- toStringshould be straightforward.
- Review your notes from September 27 to write the equalsmethod; you’ll need a correctequalsmethod for the last few tests (based uponassertEquals) to work correctly. Note that a corrected version of the tests was posted on October 02 — in particular,FragmentTest.testEqualsTrue2should read:
@Test
public void testEqualsTrue2() {
  Fragment f = new Fragment(new String("GCAT"));
  Fragment g = new Fragment(new String("GCAT"));
  assertTrue(f.equals(g));
  assertTrue(g.equals(f));
}
and there should be a few new tests added:
@Test
public void testOverlapPastBounds() {
  Fragment f = new Fragment("GGAA");
  Fragment g = new Fragment("AAGGAA");
  assertEquals(2, f.calculateOverlap(g));
}
@Test
public void testSameMergeLength() {
  Fragment f = new Fragment("GCAT");
  Fragment g = new Fragment("GCAT");
  assertEquals(4, f.calculateOverlap(g));
}
@Test
public void testSameMerge() {
  Fragment f = new Fragment("GCAT");
  Fragment g = new Fragment("GCAT");
  assertEquals(new Fragment("GCAT"), f.mergedWith(g));
}
- Look over the instance methods of Stringwhen writingcalculateOverlapandmergedWith. In particular, you might findstartsWith,endsWith, andsubstringhelpful. Also try breaking the solution up into conceptual chunks. For example, you might write a method that checks (that is, returnstrue) if twoFragments have an overlap of a parameterized length. This method’s signature might look something likeboolean hasOverlap(Fragment f, int overlapLength). It will be easier to get that working before you tackle the generalcalculateOverlapmethod (though you might need to write your own tests for it).
Once you have Fragment passing the tests, start on Assembler. Again, some hints:
- The constructor and getFragmentsshould be straightforward, though watch the copy requirement in the constructor.
- For assembleOnce:- Sometimes we use -1or0as the initial value of a variable that we’re checking against to track a maximum. What if you want to initialize a variable that’s tracking a minimum? UseInteger.MAX_VALUEin this case.
- You’re probably going to need to write a nested forloop (that is, aforloop inside aforloop) to check each pair of fragments. Remember not to compare a fragment against itself (and think about whether this check should use==orequals).
- Remember to add the newly merged Fragmentto the list, and to remove the twoFragments that were merged.
 
- Sometimes we use 
- assembleAllwill be a one-liner once you get- assembleOnceworking.
Submitting the assignment
When you have completed the changes to your code, you should export an archive file containing the entire Java project. To do this, follow the same steps as from Assignment 01 to produce a .zip file, and upload it to Gradescope.
Remember, you can resubmit the assignment as many times as you want, until the deadline. If it turns out you missed something and your code doesn’t pass 100% of the tests, you can keep working until it does.