In the previous lecture we have seen the principle behind dynamic programming. This approach is extremely useful for comparing biological sequences, which is coincidentally one of the main points of this course. This lecture explain how this is done. In writing this text I heavily relied on wonderful course taught by Ben Langmead at Johns Hopkins. The cover image shows pairwise alignments for human, mouse, and dog KIF3 locus from Dubchak et al. 2000.
Suppose you have two sequences of the same length:
A C A T G C C T A
A C T G C C T A C
How different are they? In other words, how many bases should be change to turn one sequence onto the other:
A C A T G C C T A
| | * * * | * * *
A C T G C C T A C
the vertical lines above indicate positions that are identical between the two sequences, while asterisks show differences. It will take six substitutions to turn one sequence into the other. This number – six substitutions – is called Hamming distance or the minimal number of substitutions required to turn one string into another. The code below computes the Hamming distance. Try it. You can change seq1
and seq2
into whatever you want except that they should have the same length.
Now in addition to substitutions (i.e., changing one character into another) let's allow insertions and deletions. This will essentially allow us to insert dashes (gaps) between characters:
A C A T G C C T A -
| | * | | | | | | *
A C - T G C C T A C
In this case the edit distance between two sequences is 2. It is defined as the minimum number of operations (substitutions, insertions, and deletions) requited to turn one string into another. The compared strings do not have to be of the same length to be able to compute the edit distance as we can compensate for length differences using deletions and insertions. While the situation above (where we inserted two dashes) is biologically much more meaningful (and realistic), it is much more difficult to find.
Before we can develop an algorithm that will help us to compute the edit distance let's develop a general framework that would allow us to think about the problem in exact terms. Let's look at a pair of VERY long sequences. So long, that we do not even see the left end – it disappears into
the red parts of the two sequences represent prefixes for the last nucleotides shown in black. Let's assume that the edit distance between the two prefixes is known (don't ask how we know, we just do). For simplicity let's "compact" the prefix of the first sequence into
again, the edit distance between
but we not always have
where
We can write an algorithm that would exhaustively evaluate the above expression for all possible suffixes. This algorithm is below. Try executing it. It will take roughly ~30 seconds to find the edit distance between the two sequences used in the beginning of this lecture:
The take-home-message here is that it takes a very long time to compute the edit distance between two sequences that are only nine nucleotides long! Why is this happening? Figure 1 below shows a small subset of situations the algorithm is evaluating for two very short strings
Image Not Showing Possible ReasonsLearn More →
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Figure 1 | A fraction of situations evaluated by the naïve algorithm for computing the edit distance. Just like in the case of the change problem discussed in the previous lecture a lot of time is wasted on computing distances between suffixes that has been seen more than once (shown in red).
To understand the magnitude of this problem let's look at slightly modified version of the previous Python code below. All we do here is keeping track how many times a particular pair of suffixes (in this case
While this approach to the edit distance problem is correct, it will hardly help us on the genome-wide scale. Just like in case of the change problem and Manhattan tourist problem dynamic programming is going to save the day.
It turns out that in order to find the alignment we first need to learn how to compute edit distances between sequences efficiently. So, suppose we have two sequences that deliberately have different lengths:
and
Let's represent them as the following matrix where the first character
Let's fill the first column and first row of the matrix. Because the distance between a string and an empty string is equal to the length of the string (e.g., a distance between string
Now, let's fill in the cell between
where
In this specific case:
This minimum of 0, 2, and 2 will be 0, so we are putting zero into that cell:
Using this uncomplicated logic we can fill the entire matrix like this:
The lower rightmost cell highlighted in red is special. It contains the value for the edit distance between the two strings. The following Python script implements this idea. You can see that it is essentially instantaneous:
In the previous section we have filled the dynamic programming matrix to find out that the edit distance between the sequences is 2. But in biological applications we are most often interested not in edit distance per se but in the actual alignment between two sequences.
We can use the dynamic programming matrix to reconstruct the alignment. This is done using traceback procedure. Let's look at the rightmost bottom cell of the matrix highlighted in bold:
When we were filling this matrix did we come to this point from above (
Remembering the fact that when filling the matrix we are trying to minimize the expression for edit distance, we clearly arrived to this point diagonally from
Continuing this idea we will draw a trace like the one below until we hit an interesting point:
At this point we have arrived to this value from the top because 0 + 1 = 1. If we were arriving diagonally it would be 1 + 1 = 2 since
Now going through traceback line from the top we are getting the following alignment between two two sequences:
G C - T A T A C
| | * | | | * |
G C G T A T G C
Let's now get a bit more practical. In many real biological applications we are trying to see if one sequence is contained within another. So, how can we use dynamic programming to find if there is an approximate match between two sequences
Suppose we have two strings:
and
Let's construct the following matrix:
Let me remind you that our goal is to find where
Now let's fill this matrix in using the same logic we used before:
Now that we have filled in the complete matrix let's look at the bottom row. Instead of using the rightmost cell we will find a cell with the lowest number. That would be 2 (highlighted in red):
Starting already familiar traceback procedure at that cell we will get the following path through the matrix:
for a corresponding alignment:
A A C C C T A T G T C A T G C C T T G G A
| | * | | | | * | |
T A C G T C A - G C
So far in filling the dynamic programming matrix we were using the following expression to compute the number within each cell:
Basically we were adding 1 if there was a mismatch (the
Figure 2 | There are two kinds of nucleotide substitutions: Transitions and Transversions. Transitions are substitutions between nucleotides belonging to the same chemical group. For example, a substitution of Adenine, a purine, to Guanine, also a purine, is a transition. Transversions, on the other hand, occur between chemically dissimilar nucleotides. For example, any substitution of a purine to a pyrimidine and vice verse will be a transition. (Image from Wikipedia)
you can see that there are move ways in which we can have a transversion. Despite this fact transversions are significantly less frequent that transitions. In fact in human the so called Transition/Transversion ratio (
The situation with insertions and deletions (that are often called indels) is similar in that are relatively rare and their rarity increases with size. A single nucleotide indel may occur every 1,000 bases on average, while a two-nucleotide deletion occurs every 3,000 bases and so on (see Montgomery et al. 2013 for a more detailed statistics).
As a result it is simply unrealistic to use "1" is all cases. Instead, we need to penalize rare events more than we penalize frequent, more probable events. Let's create a penalty matrix to achieve this goal:
Here if two nucleotides match, the penalty is 0. For a transitional substitution we pay
Now, let's align two sequences:
and
First, we need to fill the following dynamic programming matrix:
But now we using penalty matrix generated above to calculate values in each cell. Specifically, before we were using this expression:
but now, we will change it into this:
where
Similarly, for the first column we get:
and finally the full matrix:
Drawing a traceback through this matrix will give us:
This corresponds to the following alignment:
T A C G T C A - G C
| | * | | | | * | |
T A T G T C A T G C
The following Python code implements Global alignment approach we have seen above. Note that the first function, exampleCost
, can be changed to set different value for the penalty matrix. You can play with it here:
Global alignment discussed above works only in cases when we truly expect sequences to match across their entire lengths. In majority of biological application this is rarely the case. For instance, suppose you want to compare two bacterial genomes to figure out if they have matching sequences. Local alignment algorithm helps with this challenge. Surprisingly, it is almost identical to the approaches we used before. So here is our problem:
Sequence 1: ##################################################
|||||||||||||||||
Sequence 2: #################################################
there are two sequences (shown with # characters) and they share a region of high similarity (indicated by vertical lines). How do we find this region? We cannot use global alignment logic here because across the majority of their lengths these sequences are dissimilar.
To approach this problem we will change our penalty strategy. Instead of giving a high value for each mismatch we will instead give negative penalties. Only matches get positive rewards. Here is an example of such scoring matrix (as opposed to the penalty matrix we've used before):
Our scoring logic will also change:
We are now looking for maximum and use 0 to prevent having negative values in the matrix. This also implies that the first row and column will now be filled with zeros. Let apply this to two sequences:
Now, let's align two sequences:
and
Dynamic programming matrix with initialized first row and column will look like this:
Filling it completely will yield the following matrix. Note that clues to where the local alignment may be are given off by positive numbers in the sea of 0s:
To identify to boundary of the local alignment we need to identify the cell with the highest value. This cell has value of
This corresponds to the best local alignment between the two sequences:
T A T A T G C - G G C G T T T
| | | | | * | | | |
G G T A T G C T G G C G C T A
Here is a Python representation of this approach:
import numpy
def exampleCost(xc, yc):
''' Cost function: 2 to match, -6 to gap, -4 to mismatch '''
if xc == yc: return 2 # match
if xc == '-' or yc == '-': return -6 # gap
return -4
def smithWaterman(x, y, s):
''' Calculate local alignment values of sequences x and y using
dynamic programming. Return maximal local alignment value. '''
V = numpy.zeros((len(x)+1, len(y)+1), dtype=int)
for i in range(1, len(x)+1):
for j in range(1, len(y)+1):
V[i, j] = max(V[i-1, j-1] + s(x[i-1], y[j-1]), # diagonal
V[i-1, j ] + s(x[i-1], '-'), # vertical
V[i , j-1] + s('-', y[j-1]), # horizontal
0) # empty
argmax = numpy.where(V == V.max())
return V, int(V[argmax])
def traceback(V, x, y, s):
""" Trace back from given cell in local-alignment matrix V """
# get i, j for maximal cell
i, j = numpy.unravel_index(numpy.argmax(V), V.shape)
xscript, alx, aly, alm = [], [], [], []
while (i > 0 or j > 0) and V[i, j] != 0:
diag, vert, horz = 0, 0, 0
if i > 0 and j > 0:
diag = V[i-1, j-1] + s(x[i-1], y[j-1])
if i > 0:
vert = V[i-1, j] + s(x[i-1], '-')
if j > 0:
horz = V[i, j-1] + s('-', y[j-1])
if diag >= vert and diag >= horz:
match = x[i-1] == y[j-1]
xscript.append('M' if match else 'R')
alm.append('|' if match else ' ')
alx.append(x[i-1]); aly.append(y[j-1])
i -= 1; j -= 1
elif vert >= horz:
xscript.append('D')
alx.append(x[i-1]); aly.append('-'); alm.append(' ')
i -= 1
else:
xscript.append('I')
aly.append(y[j-1]); alx.append('-'); alm.append(' ')
j -= 1
xscript = (''.join(xscript))[::-1]
alignment = '\n'.join(map(lambda x: ''.join(x), [alx[::-1], alm[::-1], aly[::-1]]))
return xscript, alignment
x, y = 'GGTATGCTGGCGCTA', 'TATATGCGGCGTTT'
V, best = smithWaterman(x, y, exampleCost)
print(V)
print("Best score=%d, in cell %s" % (best, numpy.unravel_index(numpy.argmax(V), V.shape)))
print(traceback(V, x, y, exampleCost)[1])
This algorithm was developed by Temple Smith and Michael Waterman in 1981. This is why it is most often called Smith Waterman local alignment algorithm.
https://github.com/lastz/lastz
https://github.com/lh3/minimap2
A schematic representation of the multiZ pipeline. Blue bars represent analysis steps performed by distinct tools. The workflow begins by softmasking interspersed and tandem repeats in the genomes to be aligned. The genomes are then split into smaller segments for parallelization purposes and all segments of the reference are aligned against all segments of each genome using lastZ. The resulting alignments are combined (in non-reference genome bins), sorted, and split by reference chromosomes. They are then chained and the chains are used to create nets. These steps are performed in parallel for each reference/genome{1, 2, …, N} combination. Subsequently all reference/genome combinations are combined into multiple alignments with MultiZ. Resulting MAF datasets are used to define conserved regions identified by phastCons and to compute per-base phyloP values. Chain and nets are explained in 1.2. 2bit, lav, axt, psl, and MAF are various intermediate file formats (described in [87]).
A schematic representation of steps involved in aligning four genomes using ProgressiveCactus. First, genomes to be aligned are softmasked: regions of the sequence corresponding to repetitive elements are converted to lowercase characters. During the pairwise alignment step lastZ does not use softmasked regions in the seeding but is allowed to extend matches to these regions. ProgressiveCactus splits the MSA task into smaller parallelizable subtasks using a guide tree. Here the ancestral genome E is first reconstructed by a process which aligns the ingroups A and B, using the outgroups C and D to determine orthology relationships within the alignment of A and B. The output of this process is a sub-alignment of A and B to their direct ancestor E and a reconstruction of the genome assembly of E. At each stage there are two computation steps: a local alignment step (lastZ) and a MSA step (CAF/BAR). Each of these can be run on a single compute node with sufficient resources. The local alignment step generates local alignments between the ingroup genomes (all against all), and between the in-group genomes and the outgroups (using an intelligent strategy to avoid aligning all ingroups to all outgroups). In the MSA step these local alignments are combined within a graph theoretic structure which initially represents the transitive closure of these alignments and which progressively cleans and then refines the resulting MSA to produce the final MSA for the node. This whole process is then repeated for internal nodes F and G and the resulting MSAs, represented using the flexible HAL format, are stitched together. The total number of individual steps is equal to two times the number of internal nodes in the tree plus a single post processing step that combines the resulting sub-MSAs to create the final alignment. Gray nodes on the tree represent ancestral genomes inferred by Progressive Cactus. CAF = Cactus alignment filter; BAR = base-level alignment refinement; MSA = multiple sequence alignment.