Try   HackMD

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Lecture 24: Sequence alignment

In the previous lecture we have seen the principle behind dynamic programming. This approach is extremely useful for comparing biological sequences, which is coincidentally one of the main points of this course. This lecture explain how this is done. In writing this text I heavily relied on wonderful course taught by Ben Langmead at Johns Hopkins. The cover image shows pairwise alignments for human, mouse, and dog KIF3 locus from Dubchak et al. 2000.

How different are two sequences?

Suppose you have two sequences of the same length:

A C A T G C C T A
A C T G C C T A C

How different are they? In other words, how many bases should be change to turn one sequence onto the other:

A C A T G C C T A
| | * * * | * * *
A C T G C C T A C

the vertical lines above indicate positions that are identical between the two sequences, while asterisks show differences. It will take six substitutions to turn one sequence into the other. This number six substitutions is called Hamming distance or the minimal number of substitutions required to turn one string into another. The code below computes the Hamming distance. Try it. You can change seq1 and seq2 into whatever you want except that they should have the same length.

Now in addition to substitutions (i.e., changing one character into another) let's allow insertions and deletions. This will essentially allow us to insert dashes (gaps) between characters:

A C A T G C C T A -
| | * | | | | | | *
A C - T G C C T A C

In this case the edit distance between two sequences is 2. It is defined as the minimum number of operations (substitutions, insertions, and deletions) requited to turn one string into another. The compared strings do not have to be of the same length to be able to compute the edit distance as we can compensate for length differences using deletions and insertions. While the situation above (where we inserted two dashes) is biologically much more meaningful (and realistic), it is much more difficult to find.

Generalizing the problem

Before we can develop an algorithm that will help us to compute the edit distance let's develop a general framework that would allow us to think about the problem in exact terms. Let's look at a pair of VERY long sequences. So long, that we do not even see the left end it disappears into

:

\colorred.....A C T G C C T A G\colorred.....A C T G C C T A C

the red parts of the two sequences represent prefixes for the last nucleotides shown in black. Let's assume that the edit distance between the two prefixes is known (don't ask how we know, we just do). For simplicity let's "compact" the prefix of the first sequence into

α and the prefix of the second sequence into
β
:

αGβC

again, the edit distance between

α and
β
is known to us. The three possibilities for computing the edit distance between
αG
and
βC
are:

Edit Distance(αG,βC)=min{Edit Distance(α,β)+1 because they do not matchEdit Distance(αG,β)+1 because we are adding a gapEdit Distance(α,βC)+1 because we are adding a gap

but we not always have

G and
C
as two last nucleotiodes, so for the general case:

Edit Distance(αx,βy)=min{Edit Distance(α,β)+δ(x,y)Edit Distance(αx,β)+1Edit Distance(α,βy)+1

where

δ(x,y)=0 if
x=y
(nucleotides match) and
δ(x,y)=1
if
xy
(nucleotides do not match). The
δ(x,y)
has a particular meaning. If the two nucleotides at the end are equal to each other, there is nothing to be done no substitution operation is necessary. If a substitution is necessary however, we will record this by adding 1. When we will be talking about global and local alignment below the power of
δ(x,y)
will become even more clear.

We can write an algorithm that would exhaustively evaluate the above expression for all possible suffixes. This algorithm is below. Try executing it. It will take roughly ~30 seconds to find the edit distance between the two sequences used in the beginning of this lecture:

The take-home-message here is that it takes a very long time to compute the edit distance between two sequences that are only nine nucleotides long! Why is this happening? Figure 1 below shows a small subset of situations the algorithm is evaluating for two very short strings

TAG and
TAC
:

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Figure 1 | A fraction of situations evaluated by the naïve algorithm for computing the edit distance. Just like in the case of the change problem discussed in the previous lecture a lot of time is wasted on computing distances between suffixes that has been seen more than once (shown in red).

To understand the magnitude of this problem let's look at slightly modified version of the previous Python code below. All we do here is keeping track how many times a particular pair of suffixes (in this case

AC and
AC
) are seen by the program. The number is staggering: 48,639. So this algorithm is extremely wasteful.

While this approach to the edit distance problem is correct, it will hardly help us on the genome-wide scale. Just like in case of the change problem and Manhattan tourist problem dynamic programming is going to save the day.


Dynamic programming to the rescue

It turns out that in order to find the alignment we first need to learn how to compute edit distances between sequences efficiently. So, suppose we have two sequences that deliberately have different lengths:

G C T A T A C

and

G C G T A T G C

Let's represent them as the following matrix where the first character

ϵ represents an empty string:

ϵGCTATACϵGCGTATGCNote:sequence X is vertical and sequence Y is horizontal.

Let's fill the first column and first row of the matrix. Because the distance between a string and an empty string is equal to the length of the string (e.g., a distance between string

TCG and an empty string is 3) this resulting matrix will look like this:

ϵGCTATACϵ01234567G1C2G3T4A5T6G7C8Note:sequence X is vertical and sequence Y is horizontal.

Now, let's fill in the cell between

G and
G
. Recall that:

Edit Distance(αx,βy)=min{\colorredEdit Distance(α,β)+δ(x,y)\colorblueEdit Distance(αx,β)+1\colorgreenEdit Distance(α,βy)+1

where

δ(x,y)=0 if
x=y
and
δ(x,y)=1
if
xy
. And now let's color each of the cells corresponding to each part of the above expression:

ϵGCTATACϵ\colorred0\colorgreen1234567G\colorblue1C2G3T4A5T6G7C8Note:sequence X is vertical and sequence Y is horizontal.

In this specific case:

Edit Distance(ϵG,ϵG)=min{\colorredEdit Distance(ϵ,ϵ)+δ(G,G) or 0 + 0 = 0\colorblueEdit Distance(ϵG,ϵ)+1 or 1 + 1 = 2\colorgreenEdit Distance(ϵ,ϵG)+1 or 1 + 1 = 2

This minimum of 0, 2, and 2 will be 0, so we are putting zero into that cell:

ϵGCTATACϵ\colorred0\colorgreen1234567G\colorblue1\colorred0C2G3T4A5T6G7C8Note:sequence X is vertical and sequence Y is horizontal.

Using this uncomplicated logic we can fill the entire matrix like this:

ϵGCTATACϵ01234567G10123456C21012345G32112345T43212234A54321223T65432123G76543223C8765433\colorred2Note:sequence X is vertical and sequence Y is horizontal.

The lower rightmost cell highlighted in red is special. It contains the value for the edit distance between the two strings. The following Python script implements this idea. You can see that it is essentially instantaneous:


From edit distance to alignment

In the previous section we have filled the dynamic programming matrix to find out that the edit distance between the sequences is 2. But in biological applications we are most often interested not in edit distance per se but in the actual alignment between two sequences.

The traceback

We can use the dynamic programming matrix to reconstruct the alignment. This is done using traceback procedure. Let's look at the rightmost bottom cell of the matrix highlighted in bold:

ϵGCTATACϵ01234567G10123456C21012345G32112345T43212234A54321223T65432123G765432\colorred2\colorgreen3C876543\colorblue32

When we were filling this matrix did we come to this point from above (

\colorgreen3), from the left (
\colorblue3
), or diagonally (
\colorred2
):

\colorred2\colorgreen3\colorblue32

Remembering the fact that when filling the matrix we are trying to minimize the expression for edit distance, we clearly arrived to this point diagonally from

\colorred2. This because arriving from top (
\colorgreen3
) or left (
\colorblue3
) would add 1. So we highlight diagonal cell in red:

ϵGCTATACϵ01234567G10123456C21012345G32112345T43212234A54321223T65432123G765432\colorred23C8765433\colorred2

Continuing this idea we will draw a trace like the one below until we hit an interesting point:

ϵGCTATACϵ01234567G10123456C21012345G32\colorred112345T432\colorred12234A5432\colorred1223T65432\colorred123G765432\colorred23C8765433\colorred2

At this point we have arrived to this value from the top because 0 + 1 = 1. If we were arriving diagonally it would be 1 + 1 = 2 since

G  C, so we are turning traceback upward and then again diagonally:

ϵGCTATACϵ\colorred01234567G1\colorred0123456C21\colorred012345G32\colorred112345T432\colorred12234A5432\colorred1223T65432\colorred123G765432\colorred23C8765433\colorred2

Now going through traceback line from the top we are getting the following alignment between two two sequences:

G C - T A T A C
| | * | | | * |
G C G T A T G C

Approximate matching

Let's now get a bit more practical. In many real biological applications we are trying to see if one sequence is contained within another. So, how can we use dynamic programming to find if there is an approximate match between two sequences

P and
T
?

Suppose we have two strings:

T=A A C C C T A T G T C A T G C C T T G G A

and

P=T A C G T C A G C

Let's construct the following matrix:

ϵAACCCTATGTCATGCCTTGGATACGTCAGCNote:sequence T is horizontal while P is vertical.

Let me remind you that our goal is to find where

P matches
T
. A priori we do not know when it may be, so we will start by filling the entire first row with zeroes. This is because the match between
P
and
T
may start at any point up there. The first column we will initialize the same way we did previously: with increasing sequence of numbers:

ϵAACCCTATGTCATGCCTTGGA0000000000000000000000T1A2C3G4T5C6A7G8C9
Now let's fill this matrix in using the same logic we used before:
ϵAACCCTATGTCATGCCTTGGA0000000000000000000000T1111110101011011100111A2112221011111112211121C3221222112212221222222G4332233221222322233223T5443333322123233323333C6554334433212333333444A7655444444321234444454G8766555554432223455445C9876656665543332345555

Now that we have filled in the complete matrix let's look at the bottom row. Instead of using the rightmost cell we will find a cell with the lowest number. That would be 2 (highlighted in red):

ϵAACCCTATGTCATGCCTTGGA0000000000000000000000T1111110101011011100111A2112221011111112211121C3221222112212221222222G4332233221222322233223T5443333322123233323333C6554334433212333333444A7655444444321234444454G8766555554432223455445C987665666554333\colorred2345555

Starting already familiar traceback procedure at that cell we will get the following path through the matrix:

ϵAACCCTATGTCATGCCTTGGA00000\colorred00000000000000000T111111\colorred0101011011100111A2112221\colorred011111112211121C32212221\colorred12212221222222G433223322\colorred1222322233223T5443333322\colorred123233323333C65543344332\colorred12333333444A765544444432\colorred1\colorred234444454G87665555544322\colorred23455445C987665666554333\colorred2345555

for a corresponding alignment:

A A C C C T A T G T C A T G C C T T G G A
          | | * | | | | * | | 
          T A C G T C A - G C

Global alignment

So far in filling the dynamic programming matrix we were using the following expression to compute the number within each cell:

Edit Distance(αx,βy)=min{\colorredEdit Distance(α,β)+δ(x,y)\colorblueEdit Distance(αx,β)+1\colorgreenEdit Distance(α,βy)+1

Basically we were adding 1 if there was a mismatch (the

δ function) and also adding 1 for every gap. This however is not biologically realistic. If we look at the rates of different rates of mutations in the human genome we will see that they vary dramatically. Let's look at substitutions first:

Figure 2 | There are two kinds of nucleotide substitutions: Transitions and Transversions. Transitions are substitutions between nucleotides belonging to the same chemical group. For example, a substitution of Adenine, a purine, to Guanine, also a purine, is a transition. Transversions, on the other hand, occur between chemically dissimilar nucleotides. For example, any substitution of a purine to a pyrimidine and vice verse will be a transition. (Image from Wikipedia)

you can see that there are move ways in which we can have a transversion. Despite this fact transversions are significantly less frequent that transitions. In fact in human the so called Transition/Transversion ratio (

Ts:Tv) is close to 2 (or even higher in coding regions).

The situation with insertions and deletions (that are often called indels) is similar in that are relatively rare and their rarity increases with size. A single nucleotide indel may occur every 1,000 bases on average, while a two-nucleotide deletion occurs every 3,000 bases and so on (see Montgomery et al. 2013 for a more detailed statistics).

As a result it is simply unrealistic to use "1" is all cases. Instead, we need to penalize rare events more than we penalize frequent, more probable events. Let's create a penalty matrix to achieve this goal:

ACGTA0\colorblue4\colorred2\colorblue4\colororange8C\colorblue40\colorblue4\colorred2\colororange8G\colorred2\colorblue40\colorblue4\colororange8T\colorblue4\colorred2\colorblue40\colororange8\colororange8\colororange8\colororange8\colororange8

Here if two nucleotides match, the penalty is 0. For a transitional substitution we pay

\colorred2 and for a transversional we pay
\colorblue4
. The gap is the most expensive at
\colororange8
.

Now, let's align two sequences:

X=T A T G T C A T G C

and

Y=T A C G T C A G C

First, we need to fill the following dynamic programming matrix:

ϵTATGTCATGCϵTACGTCAGCNote:sequence X is vertical and sequence Y is horizontal.

But now we using penalty matrix generated above to calculate values in each cell. Specifically, before we were using this expression:

Edit Distance(αx,βy)=min{Edit Distance(α,β)+δ(x,y)Edit Distance(αx,β)+1Edit Distance(α,βy)+1

but now, we will change it into this:

D(αx,βy)=min{D(α,β)+p(x,y)D(αx,β)+p(x,-)D(α,βy)+p(-,y)

where

p(x,y),
p(x,-)
, and
p(-,y)
are taken directly from penalty matrix. Let's start with the first row. In this row we can only fill from the left were we essentially inserting a gap every time. Since the gap penalty is 8 we will get:

ϵTATGTCATGCϵ08162432404856647280TACGTCAGCNote:sequence X is vertical and sequence Y is horizontal.

Similarly, for the first column we get:

ϵTATGTCATGCϵ08162432404856647280T8A16C24G32T40C48A56G64C72

and finally the full matrix:

ϵTATGTCATGCϵ08162432404856647280T8081624324048566472A1680816243240485664C24168210182432404856G322416102101826344048T403224161021018263442C484032241810210182634A564840322618102101826G645648403226181061018C7264564840342618121010

Drawing a traceback through this matrix will give us:

ϵTATGTCATGCϵ\colorred08162432404856647280T8\colorred081624324048566472A168\colorred0816243240485664C24168\colorred210182432404856G32241610\colorred2101826344048T4032241610\colorred21018263442C484032241810\colorred210182634A56484032261810\colorred2\colorred101826G64564840322618106\colorred1018C72645648403426181210\colorred10

This corresponds to the following alignment:

T A C G T C A - G C
| | * | | | | * | | 
T A T G T C A T G C

The following Python code implements Global alignment approach we have seen above. Note that the first function, exampleCost, can be changed to set different value for the penalty matrix. You can play with it here:

Local alignment

Global alignment discussed above works only in cases when we truly expect sequences to match across their entire lengths. In majority of biological application this is rarely the case. For instance, suppose you want to compare two bacterial genomes to figure out if they have matching sequences. Local alignment algorithm helps with this challenge. Surprisingly, it is almost identical to the approaches we used before. So here is our problem:

Sequence 1: ##################################################
                              |||||||||||||||||
Sequence 2:         #################################################

there are two sequences (shown with # characters) and they share a region of high similarity (indicated by vertical lines). How do we find this region? We cannot use global alignment logic here because across the majority of their lengths these sequences are dissimilar.

To approach this problem we will change our penalty strategy. Instead of giving a high value for each mismatch we will instead give negative penalties. Only matches get positive rewards. Here is an example of such scoring matrix (as opposed to the penalty matrix we've used before):

ACGTA2\colorblue4\colorred4\colorblue4\colororange6C\colorblue42\colorblue4\colorred4\colororange6G\colorred4\colorblue42\colorblue4\colororange6T\colorblue4\colorred4\colorblue42\colororange6\colororange6\colororange6\colororange6\colororange6

Our scoring logic will also change:

D(αx,βy)=max{D(α,β)+s(x,y)D(αx,β)+s(x,-)D(α,βy)+s(-,y)0

We are now looking for maximum and use 0 to prevent having negative values in the matrix. This also implies that the first row and column will now be filled with zeros. Let apply this to two sequences:

Now, let's align two sequences:

T A T A T G C G G C G T T T

and

G G T A T G C T G G C G C T A

Dynamic programming matrix with initialized first row and column will look like this:

ϵTATATGCGGCGTTTϵ000000000000000G0G0T0A0T0G0C0T0G0G0C0G0C0T0A0Remember:sequence X is vertical while Y is horizontal.

Filling it completely will yield the following matrix. Note that clues to where the local alignment may be are given off by positive numbers in the sea of 0s:

ϵTATATGCGGCGTTTϵ000000000000000G000000202202000G000000202402000T020202000000422A004040000000000T020606000000222G000020822202000C0000002104040000T020202046000222G000000406822000G000000202844000C0000000402104000G00000020624\colorred12600C000000040246820T0202020000008104A004040000000246

To identify to boundary of the local alignment we need to identify the cell with the highest value. This cell has value of

\colorred12 and is highlighted above. Using traceback procedure we will find:

ϵTATATGCGGCGTTTϵ000000000000000G000000202202000G000000202402000T020\colorred202000000422A0040\colorred40000000000T02060\colorred6000000222G000020\colorred822202000C0000002\colorred104040000T0202020\colorred46000222G00000040\colorred6822000G000000202\colorred844000C0000000402\colorred104000G00000020624\colorred12600C000000040246820T0202020000008104A004040000000246

This corresponds to the best local alignment between the two sequences:

 T A T A T G C - G G C G T T T
     | | | | | * | | | |
 G G T A T G C T G G C G C T A

Here is a Python representation of this approach:

import numpy

def exampleCost(xc, yc):
    ''' Cost function: 2 to match, -6 to gap, -4 to mismatch '''
    if xc == yc: return 2 # match
    if xc == '-' or yc == '-': return -6 # gap
    return -4

def smithWaterman(x, y, s):
    ''' Calculate local alignment values of sequences x and y using
        dynamic programming.  Return maximal local alignment value. '''
    V = numpy.zeros((len(x)+1, len(y)+1), dtype=int)
    for i in range(1, len(x)+1):
        for j in range(1, len(y)+1):
            V[i, j] = max(V[i-1, j-1] + s(x[i-1], y[j-1]), # diagonal
                          V[i-1, j  ] + s(x[i-1], '-'),    # vertical
                          V[i  , j-1] + s('-',    y[j-1]), # horizontal
                          0)                               # empty
    argmax = numpy.where(V == V.max())
    return V, int(V[argmax])
    
def traceback(V, x, y, s):
    """ Trace back from given cell in local-alignment matrix V """
    # get i, j for maximal cell
    i, j = numpy.unravel_index(numpy.argmax(V), V.shape)
    xscript, alx, aly, alm = [], [], [], []
    while (i > 0 or j > 0) and V[i, j] != 0:
        diag, vert, horz = 0, 0, 0
        if i > 0 and j > 0:
            diag = V[i-1, j-1] + s(x[i-1], y[j-1])
        if i > 0:
            vert = V[i-1, j] + s(x[i-1], '-')
        if j > 0:
            horz = V[i, j-1] + s('-', y[j-1])
        if diag >= vert and diag >= horz:
            match = x[i-1] == y[j-1]
            xscript.append('M' if match else 'R')
            alm.append('|' if match else ' ')
            alx.append(x[i-1]); aly.append(y[j-1])
            i -= 1; j -= 1
        elif vert >= horz:
            xscript.append('D')
            alx.append(x[i-1]); aly.append('-'); alm.append(' ')
            i -= 1
        else:
            xscript.append('I')
            aly.append(y[j-1]); alx.append('-'); alm.append(' ')
            j -= 1
    xscript = (''.join(xscript))[::-1]
    alignment = '\n'.join(map(lambda x: ''.join(x), [alx[::-1], alm[::-1], aly[::-1]]))
    return xscript, alignment

x, y = 'GGTATGCTGGCGCTA', 'TATATGCGGCGTTT'
V, best = smithWaterman(x, y, exampleCost)
print(V)
print("Best score=%d, in cell %s" % (best, numpy.unravel_index(numpy.argmax(V), V.shape)))
print(traceback(V, x, y, exampleCost)[1])

This algorithm was developed by Temple Smith and Michael Waterman in 1981. This is why it is most often called Smith Waterman local alignment algorithm.

Alignment tools

Lastz

https://github.com/lastz/lastz

Minimap2

https://github.com/lh3/minimap2

Multiple whole genome alignment approaches

MultiZ

A schematic representation of the multiZ pipeline. Blue bars represent analysis steps performed by distinct tools. The workflow begins by softmasking interspersed and tandem repeats in the genomes to be aligned. The genomes are then split into smaller segments for parallelization purposes and all segments of the reference are aligned against all segments of each genome using lastZ. The resulting alignments are combined (in non-reference genome bins), sorted, and split by reference chromosomes. They are then chained and the chains are used to create nets. These steps are performed in parallel for each reference/genome{1, 2, …, N} combination. Subsequently all reference/genome combinations are combined into multiple alignments with MultiZ. Resulting MAF datasets are used to define conserved regions identified by phastCons and to compute per-base phyloP values. Chain and nets are explained in 1.2. 2bit, lav, axt, psl, and MAF are various intermediate file formats (described in [87]).

Progressive Cactus

A schematic representation of steps involved in aligning four genomes using ProgressiveCactus. First, genomes to be aligned are softmasked: regions of the sequence corresponding to repetitive elements are converted to lowercase characters. During the pairwise alignment step lastZ does not use softmasked regions in the seeding but is allowed to extend matches to these regions. ProgressiveCactus splits the MSA task into smaller parallelizable subtasks using a guide tree. Here the ancestral genome E is first reconstructed by a process which aligns the ingroups A and B, using the outgroups C and D to determine orthology relationships within the alignment of A and B. The output of this process is a sub-alignment of A and B to their direct ancestor E and a reconstruction of the genome assembly of E. At each stage there are two computation steps: a local alignment step (lastZ) and a MSA step (CAF/BAR). Each of these can be run on a single compute node with sufficient resources. The local alignment step generates local alignments between the ingroup genomes (all against all), and between the in-group genomes and the outgroups (using an intelligent strategy to avoid aligning all ingroups to all outgroups). In the MSA step these local alignments are combined within a graph theoretic structure which initially represents the transitive closure of these alignments and which progressively cleans and then refines the resulting MSA to produce the final MSA for the node. This whole process is then repeated for internal nodes F and G and the resulting MSAs, represented using the flexible HAL format, are stitched together. The total number of individual steps is equal to two times the number of internal nodes in the tree plus a single post processing step that combines the resulting sub-MSAs to create the final alignment. Gray nodes on the tree represent ancestral genomes inferred by Progressive Cactus. CAF = Cactus alignment filter; BAR = base-level alignment refinement; MSA = multiple sequence alignment.

Alignment and UCSC Genome Browser

https://genome.ucsc.edu/