Assignment 04: DNA sequence assembly

Starter code: dna-sequence-assembly-student.zip

Start this assignment early. You will be writing about the same amount of code (probably less, in fact) than in the last two assignments, but it will require more thinking time. This is not an assignment most students will be able to rush through on Friday afternoon.

Overview

As you no doubt remember from high school biology, the “code of life” is written in DNA (and its pal, RNA) – triplet “codons” encode a sequence of amino acids, which are assembled into proteins, and so on. An important breakthrough in biological sciences was the ability to replicate DNA in vitro (that is, not in a cell, but in a test tube) using the polymerase chain reaction (PCR). PCR creates many copies of fragments of a sequence of DNA. The equipment for PCR is now cheap enough that some high schools have PCR labs.

Another breakthrough was the ability to assemble the sequence of fragments into a coherent whole, thus determining the genetic code (the genome) for an entire organism. For example, the Human Genome Project has sequenced the human genome.

In this assignment, you’ll solve a simplified version of the DNA sequence assembly problem, using a simple “greedy” algorithm. That is, given a collection of two or more overlapping DNA fragments, you’ll align and merge these fragments by choosing the best matches at each step, with the goal of ending with a single longer sequence.

We’ve provided a large set of unit tests to help with automated testing, though you might also want to write a class with a main method for interactive testing. The Gradescope autograder includes a few more tests, but they exist primarily to verify you’re not gaming the autograder. If your code can pass the tests we’ve provided, it is likely correct. As before, we’ve disable the timeout code so you can use the debugger, but if your code gets stuck during testing, you might want to uncomment these two lines at the top of each test file:

	@Rule
	public Timeout globalTimeout = Timeout.seconds(10); // 10 seconds

Goals

Downloading and importing the starter code

As in previous assignments, download and decompress the provided archive file containing the starter code. Then import it into Code in the same way; you should end up with a dna-sequence-assembly-student project.

What you will be doing

The most important thing to understand is how you’ll assemble shorter overlapping fragments into longer ones.

First, what does a fragment look like? It’s a sequence of one or more nucelotides; each is one of adenine (A), cytosine ©, guanine (G) and thymine (T). So a fragment might look like one of

CGCAT
CATGAC
ACATG

Next, how do we assemble them? We’re going to use a simple greedy algorithm, which, quoting Wikipedia, reads as follows:

Given a set of sequence fragments the object is to find the shortest common supersequence.

  1. Сalculate pairwise alignments of all fragments.
  2. Choose two fragments with the largest overlap.
  3. Merge chosen fragments.
  4. Repeat step 2 and 3 until only one fragment is left.

Let’s look at the first few steps in more detail, since they’re the least clear.

Calculating pairwise alignments

What’s a pairwise alignment (and what’s its overlap)? Let’s look at an example. Consider our first two fragments, listed above: CGCAT and CATGAC. We can see how much their “ends” overlap if we put CGCAT first and then CATGAC:

  these three overlap
  vvv
CGCAT
  CATGAC
  ^^^
  these three overlap

Here, there’s an overlap of three (CAT). If we were to merge these together, the result would be CGCATGAC:

 CGCAT
+  CATGAC
---------
 CGCATGAC

If we tried them the other way around, what would the overlap and merged fragment look like?

 CATGAC
+     CGCAT
-----------
 CATGACGCAT

Here the overlap is only one and the result would be CATGACGCAT.

You should now see (1) that any two fragments can have an overlap of at least zero and at most the length of the shorter fragment, and (2) that order matters when comparing overlaps: the front of one fragment can be checked against the rear of another, but that’s different from checking the rear of the first against the front of the second.

Finally, note that we will only consider overlaps on the end, and not worry about one fragment being entirely embedded within another. That is, your code must not check for things like:

 GCTCAGC
+  TCA
--------
 GCTCAGC

Though two identical fragments will be merged, as they are only compared on the end. In other words, we do expect you to merge fragments like:

 GCTCAGC
+GCTCAGC
--------
 GCTCAGC

Choosing the largest overlap

Given a collection of fragments, you can compare every fragment against every other (in both orders) and find the pair with the largest overlap. What do we mean by both orders? Consider each fragment as both a left fragment against every other on its right, and a right fragment against every other on its left.

But what if two have the same overlap? I want you to break ties by choosing the pair whose merger results in the shorter merged sequence.

If there are further ties, do what you like — I will make sure there are no tests that are ambiguous, and I don’t want your merge method to be sixteen special cases long. In a practical sequence assembler, deciding how to handle ambiguity is very important, as are many other cases: What about “almost perfect” matches, as real PCR occasionally induces errors in the fragments? Or subsets, which I told you to ignore? Or how little overlap is so little as to be not worth merging? And so on. But we won’t worry about those details here.

Merging the fragments

Suppose we are still working with our example three fragments, CGCAT, CATGAC, and ACATG. Further suppose they’re stored in a list, which we’ll write as [CGCAT, CATGAC, ACATG].

If, after checking, we decided to merge the first two (as described above), our list would look like: [CGCATGAC, ACATG]. Then we’d merge again and be left with a single entry in our list: [CGCATGACATG].

What to do

As usual, look over the files we’ve provided. The Fragment class represents a single fragment; the Assembler class keeps a list of Fragments and assembles them into longer Fragments.

Start with the Fragment class. Here are some hints to get you started there:

Once you have Fragment passing the tests, start on Assembler. Again, some hints:

Submitting the assignment

When you have completed the changes to your code, you should export an archive file containing the entire Java project. To do this, follow the same steps as from Assignment 01 to produce a .zip file, and upload it to Gradescope. Note that if you want things to upload faster, you can use an external program to zip only the src/ directory by expanding the project; that’s all this autograder requires.

Remember, you can resubmit the assignment as many times as you want, until the deadline. If it turns out you missed something and your code doesn’t pass 100% of the tests, you can keep working until it does.