Programming Assignment 04: DNA Sequence Assembly
Estimated reading time: 15 minutes
Estimated time to complete: 90-150 minutes (plus debugging time)
Prerequisites: Assignment 03
Starter code: dna-sequence-assembly-student.zip
Collaboration: not permitted
Start this assignment early. You will be writing about the same amount of code (or perhaps less) than in the last two assignments, but it will require more thinking time. This is not an assignment most students will be able to rush through on Friday afternoon.
Overview
As you no doubt remember from high school biology, the “code of life” is written in DNA (and its pal, RNA) – triplet “codons” encode a sequence of amino acids, which are assembled into proteins, and so on. An important breakthrough in biological sciences was the ability to replicate DNA in vitro (that is, not in a cell, but in a test tube) using the polymerase chain reaction (PCR). PCR creates many copies of fragments of a sequence of DNA. The equipment for PCR is now cheap enough that some high schools have PCR labs.
Another breakthrough was the ability to assemble the sequence of fragments into a coherent whole, thus determining the genetic code (the genome) for an entire organism. For example, the Human Genome Project has sequenced the human genome.
In this assignment, you’ll solve a simplified version of the DNA sequence assembly problem, using a simple “greedy” algorithm. That is, given a collection of two or more overlapping DNA fragments, you’ll align and merge these fragments by choosing the best matches at each step, with the goal of ending with a single longer sequence.
We’ve provided a large set of unit tests to help with automated testing, though you might also want to write a class with a main
method for interactive testing. The Gradescope autograder includes a few more tests, but they exist primarily to verify you’re not gaming the autograder. If your code can pass the tests we’ve provided, it is likely correct.
Goals
- Translate written descriptions of behavior into code.
- Practice writing instance methods, including overriding methods of
Object
. - Practice interacting with the
List
abstraction. - Test code using unit tests.
Downloading and importing the starter code
As in previous assignments, download and save (but do not decompress) the provided archive file containing the starter code. Then import it into Eclipse in the same way; you should end up with a dna-sequence-assembly-student
project in the “Project Explorer”.
What you will be doing
The most important thing to understand is how you’ll assemble shorter overlapping fragments into longer ones.
First, what does a fragment look like? It’s a sequence of one or more nucelotides; each is one of adenine (A), cytosine (C), guanine (G) and thymine (T). So a fragment might look like one of
CGCAT
CATGAC
ACATG
Next, how do we assemble them? We’re going to use a simple greedy algorithm, which, quoting Wikipedia, reads as follows:
Given a set of sequence fragments the object is to find the shortest common supersequence.
- Сalculate pairwise alignments of all fragments.
- Choose two fragments with the largest overlap.
- Merge chosen fragments.
- Repeat step 2 and 3 until only one fragment is left.
Let’s look at the first few steps in more detail, since they’re the least clear.
Calculating pairwise alignments
What’s a pairwise alignment (and what’s its overlap)? Let’s look at an example. Consider our first two fragments, listed above: CGCAT
and CATGAC
. We can see how much their “ends” overlap if we put CGCAT
first and then CATGAC
:
these three overlap
vvv
CGCAT
CATGAC
^^^
these three overlap
Here, there’s an overlap of three (CAT
). If we were to merge these together, the result would be CGCATGAC
:
CGCAT
+ CATGAC
---------
CGCATGAC
If we tried them the other way around, what would the overlap and merged fragment look like?
CATGAC
+ CGCAT
-----------
CATGACGCAT
Here the overlap is only one and the result would be CATGACGCAT
.
You should now see (1) that any two fragments can have an overlap of at least zero and at most the length of the shorter fragment, and (2) that order matters when comparing overlaps: the front of one fragment can be checked against the rear of another, but that’s different from checking the rear of the first against the front of the second.
Finally, note that we will only consider overlaps on the end, and not worry about one fragment being entirely embedded within another. That is, your code must not check for things like:
GCTCAGC
+ TCA
--------
GCTCAGC
Though two identical fragments will be merged, as they are only compared on the end. In other words, we do expect you to merge fragments like:
GCTCAGC
+GCTCAGC
--------
GCTCAGC
Choosing the largest overlap
Given a collection of fragments, you can compare every fragment against every other (in both orders!) and find the pair with the largest overlap.
But what if two have the same overlap? I want you to break ties by choosing the pair whose merger results in the shorter merged sequence.
If there are further ties, do what you like – I will try to make sure there are no tests that are ambiguous, and I don’t want your merge
method to be sixteen special cases long. In a practical sequence assembler, deciding how to handle ambiguity is very important, as are many other cases: What about “almost perfect” matches, as real PCR occasionally induces errors in the fragments? Or subsets, which I told you to ignore? Or how little overlap is so little as to be not worth merging? And so on. But we won’t worry about those details here.
Merging the fragments
Suppose we are still working with our example three fragments, CGCAT
, CATGAC
, and ACATG
. Further suppose they’re stored in a list, which we’ll write as [CGCAT
, CATGAC
, ACATG
].
If, after checking, we decided to merge the first two (as described above), our list would look like: [CGCATGAC
, ACATG
]. Then we’d merge again and be left with a single entry in our list: [CGCATGACATG
].
What to do
As usual, look over the files we’ve provided. The Fragment
class represents a single fragment; the Assembler
class keeps a list of Fragment
s and assembles them into longer Fragment
s.
Start with the Fragment
class. Here are some hints to get you started there:
- You’ll need to store the nucleotide sequence in an instance variable.
String
is probably the easiest thing to use. length
andtoString
should be straightforward.- Review your notes from September 27 to write the
equals
method; you’ll need a correctequals
method for the last few tests (based uponassertEquals
) to work correctly. Note that a corrected version of the tests was posted on October 02 — in particular,FragmentTest.testEqualsTrue2
should read:
@Test
public void testEqualsTrue2() {
Fragment f = new Fragment(new String("GCAT"));
Fragment g = new Fragment(new String("GCAT"));
assertTrue(f.equals(g));
assertTrue(g.equals(f));
}
and there should be a few new tests added:
@Test
public void testOverlapPastBounds() {
Fragment f = new Fragment("GGAA");
Fragment g = new Fragment("AAGGAA");
assertEquals(2, f.calculateOverlap(g));
}
@Test
public void testSameMergeLength() {
Fragment f = new Fragment("GCAT");
Fragment g = new Fragment("GCAT");
assertEquals(4, f.calculateOverlap(g));
}
@Test
public void testSameMerge() {
Fragment f = new Fragment("GCAT");
Fragment g = new Fragment("GCAT");
assertEquals(new Fragment("GCAT"), f.mergedWith(g));
}
- Look over the instance methods of
String
when writingcalculateOverlap
andmergedWith
. In particular, you might findstartsWith
,endsWith
, andsubstring
helpful. Also try breaking the solution up into conceptual chunks. For example, you might write a method that checks (that is, returnstrue
) if twoFragment
s have an overlap of a parameterized length. This method’s signature might look something likeboolean hasOverlap(Fragment f, int overlapLength)
. It will be easier to get that working before you tackle the generalcalculateOverlap
method (though you might need to write your own tests for it).
Once you have Fragment
passing the tests, start on Assembler
. Again, some hints:
- The constructor and
getFragments
should be straightforward, though watch the copy requirement in the constructor. - For
assembleOnce
:- Sometimes we use
-1
or0
as the initial value of a variable that we’re checking against to track a maximum. What if you want to initialize a variable that’s tracking a minimum? UseInteger.MAX_VALUE
in this case. - You’re probably going to need to write a nested
for
loop (that is, afor
loop inside afor
loop) to check each pair of fragments. Remember not to compare a fragment against itself (and think about whether this check should use==
orequals
). - Remember to add the newly merged
Fragment
to the list, and to remove the twoFragment
s that were merged.
- Sometimes we use
assembleAll
will be a one-liner once you getassembleOnce
working.
Submitting the assignment
When you have completed the changes to your code, you should export an archive file containing the entire Java project. To do this, follow the same steps as from Assignment 01 to produce a .zip
file, and upload it to Gradescope.
Remember, you can resubmit the assignment as many times as you want, until the deadline. If it turns out you missed something and your code doesn’t pass 100% of the tests, you can keep working until it does.