Programming Assignment 04: DNA Sequence Assembly
Estimated reading time: 15 minutes
Estimated time to complete: 2–3 hours (plus debugging time)
Prerequisites: Assignment 03
Starter code: dna-sequence-assembly-student.zip
Collaboration: not permitted
Start this assignment early. You will be writing about the same amount of code (or perhaps less) than in the last two assignments, but it will require more thinking time. This is not an assignment most students will be able to rush through on Friday afternoon.
Overview
As you no doubt remember from high school biology, the “code of life” is written in DNA (and its pal, RNA) – triplet “codons” encode a sequence of amino acids, which are assembled into proteins, and so on. An important breakthrough in biological sciences was the ability to replicate DNA in vitro (that is, not in a cell, but in a test tube) using the polymerase chain reaction (PCR). PCR creates many copies of fragments of a sequence of DNA. The equipment for PCR is now cheap enough that some high schools have PCR labs.
Another breakthrough was the ability to assemble the sequence of fragments into a coherent whole, thus determining the genetic code (the genome) for an entire organism. For example, the Human Genome Project has sequenced the human genome.
In this assignment, you’ll solve a simplified version of the DNA sequence assembly problem, using a simple “greedy” algorithm. That is, given a collection of two or more overlapping DNA fragments, you’ll align and merge these fragments by choosing the best matches at each step, with the goal of ending with a single longer sequence.
We’ve provided a large set of unit tests to help with automated testing, though you might also want to write a class with a main
method for interactive testing. The Gradescope autograder includes a few more tests, but they exist primarily to verify you’re not gaming the autograder. If your code can pass the tests we’ve provided, it is likely correct. As before, we’ve disable the timeout code so you can use the debugger, but if your code gets stuck during testing, you might want to uncomment these two lines at the top of each test file:
@Rule
public Timeout globalTimeout = Timeout.seconds(10); // 10 seconds
Goals
- Translate written descriptions of behavior into code.
- Practice writing instance methods, including overriding methods of
Object
. - Practice interacting with the
List
abstraction. - Test code using unit tests.
Downloading and importing the starter code
As in previous assignments, download and save (but do not decompress) the provided archive file containing the starter code. Then import it into Eclipse in the same way; you should end up with a dna-sequence-assembly-student
project in the “Project Explorer”.
What you will be doing
The most important thing to understand is how you’ll assemble shorter overlapping fragments into longer ones.
First, what does a fragment look like? It’s a sequence of one or more nucelotides; each is one of adenine (A), cytosine (C), guanine (G) and thymine (T). So a fragment might look like one of
CGCAT
CATGAC
ACATG
Next, how do we assemble them? We’re going to use a simple greedy algorithm, which, quoting Wikipedia, reads as follows:
Given a set of sequence fragments the object is to find the shortest common supersequence.
- Сalculate pairwise alignments of all fragments.
- Choose two fragments with the largest overlap.
- Merge chosen fragments.
- Repeat step 2 and 3 until only one fragment is left.
Let’s look at the first few steps in more detail, since they’re the least clear.
Calculating pairwise alignments
What’s a pairwise alignment (and what’s its overlap)? Let’s look at an example. Consider our first two fragments, listed above: CGCAT
and CATGAC
. We can see how much their “ends” overlap if we put CGCAT
first and then CATGAC
:
these three overlap
vvv
CGCAT
CATGAC
^^^
these three overlap
Here, there’s an overlap of three (CAT
). If we were to merge these together, the result would be CGCATGAC
:
CGCAT
+ CATGAC
---------
CGCATGAC
If we tried them the other way around, what would the overlap and merged fragment look like?
CATGAC
+ CGCAT
-----------
CATGACGCAT
Here the overlap is only one and the result would be CATGACGCAT
.
You should now see (1) that any two fragments can have an overlap of at least zero and at most the length of the shorter fragment, and (2) that order matters when comparing overlaps: the front of one fragment can be checked against the rear of another, but that’s different from checking the rear of the first against the front of the second.
Finally, note that we will only consider overlaps on the end, and not worry about one fragment being entirely embedded within another. That is, your code must not check for things like:
GCTCAGC
+ TCA
--------
GCTCAGC
Though two identical fragments will be merged, as they are only compared on the end. In other words, we do expect you to merge fragments like:
GCTCAGC
+GCTCAGC
--------
GCTCAGC
Choosing the largest overlap
Given a collection of fragments, you can compare every fragment against every other (in both orders!) and find the pair with the largest overlap. What do we mean by both orders? Consider each fragment as both a left fragment against every other on its right, and a right fragment against every other on its left.
But what if two have the same overlap? I want you to break ties by choosing the pair whose merger results in the shorter merged sequence.
If there are further ties, do what you like — I will try to make sure there are no tests that are ambiguous, and I don’t want your merge
method to be sixteen special cases long. In a practical sequence assembler, deciding how to handle ambiguity is very important, as are many other cases: What about “almost perfect” matches, as real PCR occasionally induces errors in the fragments? Or subsets, which I told you to ignore? Or how little overlap is so little as to be not worth merging? And so on. But we won’t worry about those details here.
Merging the fragments
Suppose we are still working with our example three fragments, CGCAT
, CATGAC
, and ACATG
. Further suppose they’re stored in a list, which we’ll write as [CGCAT
, CATGAC
, ACATG
].
If, after checking, we decided to merge the first two (as described above), our list would look like: [CGCATGAC
, ACATG
]. Then we’d merge again and be left with a single entry in our list: [CGCATGACATG
].
What to do
As usual, look over the files we’ve provided. The Fragment
class represents a single fragment; the Assembler
class keeps a list of Fragment
s and assembles them into longer Fragment
s.
Start with the Fragment
class. Here are some hints to get you started there:
- You’ll need to store the nucleotide sequence in an instance variable.
String
is probably the easiest thing to use. length
andtoString
should be straightforward.- Remember you can have Eclipse write the
equals
method for you (“Source → Generate hashCode() and equals()…”). You’ll need a correctequals
method for the last few tests (based uponassertEquals
) to work correctly. We mentioned this in class a couple of weeks ago, and we’ll touch on this again in lecture this week. - Look over the instance methods of
String
when writingcalculateOverlap
andmergedWith
. In particular, you might findstartsWith
,endsWith
, andsubstring
helpful. Try breaking the solution up into conceptual chunks. - One such chunk to consider: You might add a new method
boolean hasOverlap(Fragment f, int overlapLength)
that checks (that is, returnstrue
) if the currentFragment
overlaps with another fragmentf
with an overlap ofoverlapLength
. Then use it to implementcalculateOverlap
by checking iteratively checking for a maximum-size overlap, then one less, then one less, etc., until you find the largest overlap. - If you choose to write a
hasOverlap
method, you might want to add some tests. For example:
@Test
public void testHasNoOverlap() {
Fragment f = new Fragment("GCAT");
Fragment g = new Fragment("CGTA");
assertFalse(f.hasOverlap(g, 1));
assertFalse(g.hasOverlap(f, 1));
}
@Test
public void testHasSomeOverlap() {
Fragment f = new Fragment("GGGA");
Fragment g = new Fragment("AGGG");
assertTrue(f.hasOverlap(g, 1));
assertFalse(f.hasOverlap(g, 2));
assertTrue(g.hasOverlap(f, 1));
assertTrue(g.hasOverlap(f, 2));
assertTrue(g.hasOverlap(f, 3));
assertFalse(g.hasOverlap(f, 4));
}
Once you have Fragment
passing the tests, start on Assembler
. Again, some hints:
- The constructor and
getFragments
should be straightforward, though remember the copy requirement in the constructor. - For
assembleOnce
:- Sometimes we use
-1
or0
as the initial value of a variable that we’re checking against to track a maximum. What if you want to initialize a variable that’s tracking a minimum? UseInteger.MAX_VALUE
in this case. (Note you may not need this, depending upon how you structure your code.) - You’re probably going to need to write a nested
for
loop (that is, afor
loop inside afor
loop) to check each pair of fragments. Remember not to compare a fragment against itself (and think about whether this check should use==
orequals
). - Remember to add the newly merged
Fragment
to the list, and to remove from the list the twoFragment
s that were merged.
- Sometimes we use
assembleAll
will be a one- or two-liner once you getassembleOnce
working.
Submitting the assignment
When you have completed the changes to your code, you should export an archive file containing the entire Java project. To do this, follow the same steps as from Assignment 01 to produce a .zip
file, and upload it to Gradescope.
Remember, you can resubmit the assignment as many times as you want, until the deadline. If it turns out you missed something and your code doesn’t pass 100% of the tests, you can keep working until it does.