Bioinformatics Algorithms

# Bioinformatics Algorithms ## index - de bruijn graph - alignment - minimum edit distance -> how much "different" - longest common subsequence -> how much "equal" - global alignment -> negative score for differences, positive score for equal character - local alignment -> smith-waterman - reading - "normal" tries - suffix tries - suffix trees (nb trie != tree) - suffix arrays - manbers & myers - Burrows Wheeler Transform (BWT) - FM index - inexact matching - *LE PARTI DI PINOLI* - la parte delle hash table per l'inexact matching - multiple sequence alignment - center star - l'altro metodo ma non ricordo il nome - hidden markov models ## Playground These are links to online demos of the algorithms seen in this course - [Global alignment - Needleman-Wunsch](http://rna.informatik.uni-freiburg.de/Teaching/index.jsp?toolName=Needleman-Wunsch) - [Local alignment - Smith-Waterman](http://rna.informatik.uni-freiburg.de/Teaching/index.jsp?toolName=Smith-Waterman) - [Suffix trie](https://www.cs.usfca.edu/~galles/visualization/Trie.html) - [Suffix tree creation](https://visualgo.net/en/suffixtree) - [alternative demo](https://visualgo.net/en/suffixtree) - [Burrows-Wheeler transform](https://calcoolator.eu/burrows-wheeler-transform-encoder-decoder-) ## Genome assembly :::warning todo ::: - How many reads do we need to cover the human genome on average 10-fold if each read has a length of 500 bases? - G: genome size - L read length - N number of reads - n~b~ = NL total number of sequenced baes - $\lambda$ = NL/G is the coverage (10x in this case) With k-mers-> in a string long L, there are L-k+1 kmers. with N reads, we have n~k~ = N x (L-k+1) total kmers. the coverage depth for k-mer is $d_k = \frac{n_k}{G}$ so we have that the ratio between coverage depth for bases and kmers is ![](https://i.imgur.com/Ltxxx5r.png) - the probality of seeing at a given one base will not be covered is $$P(X=0) = \frac{e^{-\lambda} \lambda^0}{0!} = e^{-\lambda}$$ Therefore, the probability of seeing at least one read at a given position is P (X > 0) = 1 − P (X = 0) = 1 − e^−λ^ So if we want to estimate the mean read depth required such that at least 99% of the genome is covered once we have P(X>0) =0.99 => solve for $\lambda$ ALL THE STUFF ABOUT CONTIG ### random stuff for graph - **Eulerian cycle**: a path that traverses every edge exactly once and returns at the end of the traversal to the start node - **Traversable graph**: a graph that can be drawn without taking a pen from the paper and without retracing the same edge. In such a case the graph is said to have an Eulerian path (a trail in a graph which visits every edge exactly once) - When the *degree of all the vertices is even* then the graph is *traversable* and there is an **eulerian cycle** - If there are *exactly two vertices of odd degree* there is an **Eulerian path**, but no cycle - if there are more than two odd vertices an **Eulerian path** doesn't exists ##### Hierholzer's algorithm //why does it work tho? 1. make sure that the euler's conditions are true in this case 2. make two stacks, *tpath* and *epath* 3. make a suitable vertex *v*, (the one with one more outgoing edge for a path, any for a cycle) 4. push *v* to *tpath* 5. let *u* = *tpath.TOP* 6. all outgoing edges from *u* have been visited? - yes-> pop *u* from *tpath and push it into epath* - no-> select a random outgoing edge (u,x), push x to tpath and delete (u,x), from the se of the edges 7. repeat 5 until tpath is empty the Eulerian path will be in *epath*, in the right order. ## Sequence alignment ### Minimum edit distance This a metric for measuring how much *different* two strings are. The minimum number of operations to transform one string into the other. The possibile operations are *removal, insertion, substitution*. $$ \text{E}(a(i),b(i)) = \min \begin{cases} 1 + E(a(i), b(j-1))\\ 1 + E(a(i-1), b(j))\\ \begin{cases} E(a(i-1), b(j-1)) & \text{if } a_i=b_i\\ 1 + E(a(i-1), b(j-1)) & \text{if } a_i\neq b_i\\ \end{cases} \end{cases} $$ This is the recursive definition. The base cases are E(a(1), $\epsilon$) = 1 , E($\epsilon$,b(1)) = 1 and $$ E(a(1),b(1))= \begin{cases} 0 & a_i = b_i \\ 1 & a_i \neq b_i \\ \end{cases} $$ #### graphical method if done by hand there is a graphical method for finding the solution. ![](https://i.imgur.com/6ao3GIr.png) For each cell(i,j) take the minimum of: - cell to the left +1 (gap insertion in a) - cell to the top +1 (gap insertion in b) - if a(i) $=$ b(j) -> upper-left cell + 0 (same symbol, so the edit distance is not increased) - if a(i) $\neq$ b(j) -> upper-left cell + 1 (substitution) in the bottom right cell there the *MED* for the 2 strings. #### finding the best MED alignment After finding the *MED* we need to find which is the alignment that actually minimize the distance. **Backtracking** achieves this goal. ![](https://i.imgur.com/r61ZKVR.png) Starting from the *MED* cell we calculate the mininum of the cells that could have let us there. **note** when calculating the minimum we have to sum the "weight of the movement". For example - cell(6,4): min of 4+1 (left), 4+1(top), 3+0 (diagonal, same letter) - cell(5,3): min of 3+1 (left), 3+1(top), 2+1 (diagonal, different letter (N vs R)) Two strings can have **more** than one *MED* alignment. #### Weighted edit distance Instead of assigning "1" for each operation, we can assign them a specific weight. $$ \text{E}_w(a(i),b(i)) = \min \begin{cases} w_{id} + E_w(a(i), b(j-1))\\ w_{id} + E_w(a(i-1), b(j))\\ \begin{cases} E_w(a(i-1), b(j-1)) & \text{if } a_i=b_i\\ w(a_i,b_j) + E_w(a(i-1), b(j-1)) & \text{if } a_i\neq b_i\\ \end{cases} \end{cases} $$ - $w_{id}$ is the weight of *insertions and deletions* - $w(a_i,b_j)$ is the **weight matrix** Weights a are *positive values*, higher the weight, higher the penalization. ### Longest Common Subsequence This is a metric for measuring how much two strings are similar. **subsequence**: derived by deleting *some, all or no* characters from the string. The order of the remaining chars has to be preserved. APL is a valid subseq of APPLE. $$ \text{LCS}(a(i),b(i)) = \max \begin{cases} 0 + LCS(a(i), b(j-1))\\ 0 + LCS(a(i-1), b(j))\\ \begin{cases} 1+ LCS(a(i-1), b(j-1)) & \text{if } a_i=b_i\\ 0 + LCS(a(i-1), b(j-1)) & \text{if } a_i\neq b_i\\ \end{cases} \end{cases} $$ The idea is similar to the *MED*. Instead of minimizing a weight function, we maximize a *score* #### graphical method The graphical method is analogous the the *MED*. Just keep in mind that the scores are assigned with a reverse logic. ![](https://i.imgur.com/0M4Ggwh.png) So the first row and column are set to 0, becase the score for gaps is 0. #### finding the best LCS alignment we **backtrack** in the same way as the MED. As always, the logic is flipped, so we need to account a +1 score when we go diagonal with a matching character. ![](https://i.imgur.com/Dzo99Ky.png) Remember: here we are searching the maximum value. ### Global Alignment (Needleman-Wunsch) This is syhthesis of MED and LCS. - Differences have a *negative* effect (penality) - Conserved letters have a *positive* effect (positive weight) $$ \text{E}(a(i),b(i)) = \max \begin{cases} w_d + E(a(i), b(j-1))\\ w_d + E(a(i-1), b(j))\\ \begin{cases} w_m + E(a(i-1), b(j-1)) & \text{if } a_i=b_i\\ w_s + E(a(i-1), b(j-1)) & \text{if } a_i\neq b_i\\ \end{cases} \end{cases} $$ - $w_d <0$: negative **gap penalty** for insertion/deletions - $w_s<0$: negative **mismatch penalty** for substitution - $w_m>0$: positive **match score** for same symbol ![](https://i.imgur.com/UXtBcvj.png) **note** - how the initialitation of the first row and column is dictated by the *gap penalty* - it's a **maximizing** problem. - when going diagonal we have to adjust in both cases with the appropriate penalty or score. ### Local sequence alignment (Smith-Waterman) > *Evolutionary relationship can be identified only for a part of the sequences, but not “globally”, while for the rest of the sequences no relevant similarity can be found* A method for finding similar regions *within* two sequences is needed. The problem can be formalized as: > *Find the partial alignment (of substrings of a and b),with the maximum score (such that there are no other substrings which would achieve a higher alignment score)* In the global alignment algorithm the cell(i,j) contains the total alignment score for the *prefixes a(i), b(j)*, i.e. from the cell(0,0). Let's change that to: let cell(i,j) contains the *maximum alignment score of all substring pairs which end at (i,j)*. So not from the beginning but from some intermediate position in string *a* and *b*. This achieved by adding 0 as a fourth option! $$ \text{M}(a(i),b(i)) = \max \begin{cases} 0\\ w_d + M(a(i), b(j-1))\\ w_d + M(a(i-1), b(j))\\ \sigma(a_i,b_j) + M(a(i-1), b(j-1)) \end{cases} $$ The reasoning for adding 0 is the following 1. Computing a value for a cell(and thus creating a traceback pointer) is equivalent to extending an alignment. 2. But if all possible extensions lead to negative scores, then it's better to reset everything a start a new potential alignment. 3. When the choice is 0 no traceback pointer will be associated with the cell. That's because we are not extending an existing alignment but creating a new one. #### finding the best local alignments The optimal score is the *highest positive score* in the matrix. Thus the optimal alignment can be found by backtracing starting from that point. ![](https://i.imgur.com/J81Kcv6.png) **note** - Zeros in the first row and column, instead of negative scores from gaps. - We can have more than one cell with the optimal score. - We can have more than one path from the same cell. - It's worth to calculate the second,third... best alignment. - This algorithm is applied to large sequences that may have more than one similarity. ## Read Mapping when we have reads we want to confront with a reference genome. eg: we have DNA reads from a patient and we want to see if it maps to a known mutation of DNA, like genetic diseases. For future use let's introduce now the symbol $, that's defined to be lexicographically less the our other characters. In the DNA it would be: $< A < C < G < T It's used to enforce le lexicographically order, e.g. "over\$" comes before "overture\$". ### Suffix trie and tree A suffix *trie* of a string T is a [trie](https://en.wikipedia.org/wiki/Trie) composed by all the *substrings* of T. ![](https://i.imgur.com/SAHFpVf.png) Given a string S we can trivially check if is a substring of T by just following the edges with the same letters as S. If the path ends on a edge labeled "\$", then S is a substring of T. #### Trie algorithms 1. **Constructing a suffix trie** ```python= # def createSuffixTree(T) T+ = $ root = {} for i=1 to i=length(T): n = root # n is the current node for c in T [i:] # for each char in the i-th suffix do if c not in n: n[c] = {} # add outgoing edge to n if needed n = n[c] # switch current node to child node return root ``` 2. **Searching a suffix trie** ```python= #Return the the node at the end of the path or NULL if there's no path def followPath(T,S): root = SuffixTrie(T ) n = root # n is the current node for i=1 to i=length(S) do c = S[i] # i-th char of S if c not in n: return NULL # not found n = n[c] # switch current node to child node return n ``` The space complexity is $O(m^2)$ where m=|T|. This is too much. The human genoma is *3 billion bases* long. The first step is to migrate from trie to tree. This is done by combining non-branching paths into a single edge with a *string label* ![](https://i.imgur.com/01fzNUa.png) By doing so we have reduced by a lot the number of nodes, but how much? Let's reason: - We have exactly m *leaf* nodes, where m=|T| (an *m*-string has *m* suffixes). - a full binary tree (in general not our case) with m *leaf* nodes has m-1 internal nodes. - Our tree has at most many *internal* nodes as a FBT, that is because if an internal node has >2 children then we have fewer parent nodes. - Thus we have <=2m total nodes, $O(m)$!! What about edges? each leaf and internal node has *one* incoming edge, so <= 2m-1 BUT! There is a catch, the total length of the *edge labels* is still quadratic $O(m^2)$. Insted of storing the label, we store two integers: the *offset* and the *length* of the original label in the original string. ![](https://i.imgur.com/E0KIicB.png) **errata corrige** The incoming edge's label of the node 10 is (11,1) The naive construction algorithms of a suffix tree are quadraric in time, $O(m^2)$. An elegant alternative is [Ukkonen's algorithm](https://cs.brown.edu/courses/csci1820/resources/Ukkonen_1995.pdf) which is $O(m)$. ##### Find patterns with a suffix tree Let T be our "genome" and P be a "sequencing read", i.e. the pattern to found in the genome. The search takes $O(n+k)$ time, where n=|P| and k=n° of matches. ![](https://i.imgur.com/OZL5Ypn.png) ### Suffix Arrays It's a lexicographically ordered array of *pointers to the suffixes* of T. ![](https://i.imgur.com/u4G7Rod.png) The array has only the numbers, can be indexes or actually memory pointers to the original genome. Note how the repeated substrings are adjacents. #### Creating suffix arrays - Manber - Miers This is the method proposed in the original paper by M&M. it has a time complexity of $O(n\log n)$, but there are more advanced solutions that create suffix arrays in $O(n)$. It's a [radix sort](https://en.wikipedia.org/wiki/Radix_sort) with a additional twist. To sort $2^k$-mers we can reuse information we have from sorting $2^{k-1}$-mers. ![](https://i.imgur.com/oIMRLCb.png) In this example, when sorting 0 and 9 we look at the at the position of suffix 0+4=4 (rank 7) and 9+4=13 (rank 5). Without comparing the rest of the sequence we know that suffix 9 comes before suffix 0. This is possibile because we are ordering all the suffixes of a given string. ### Burrows Wheeler Transform (BWT) The BWT applies a reversible transformation to a block of input text. The transformation does not itself compress the data, but reorders it to make it easy to compress with simple algorithms such as move-to-front coding. 1. Form all **rotations** of the input text T. 2. Sort the rotations lexicographically. Now we have the **Burrows Wheeler matrix** 3. The **BWT** is the last column of the matrix. There's an obvious connection between BWT(T) and the SA(T). The char in position *i* in the BWT corresponds the the character that, in the original string, is just to the left of the *i*-th suffix in the SA. ![](https://i.imgur.com/RVRwRhi.png) The function that creates the BWT from the SA in now trivial $$ BWT(T)[i]= \begin{cases} T[SA[i]-1] & if \quad SA[i]>0\\ $ & if \quad SA[i]=0 \end{cases} $$ ``` python def bwtFromSA(T): sa = constructSuffixArray(T+$) L = length(sa) bwt = new string[L] for i=0 to i=L-1: if sa[i] = 0: bwt [i] = $ else: bwt [i] = T [sa[i]−1] return bwt ``` #### Reversing the BWT Let's introduce the T-ranking: - The T-ranking of the character at any given position **is the number of times that an identical character has preceeded it in T** **LF Mapping property**: In the BW matrix, for any characters, the T-rankings of characters in the first column (F) appears in the same relative order in the last column (L) ![](https://i.imgur.com/AwyTrs9.png) Note how the T-rankings of *a* is in both columns in the same order. Instead of storing the ranks we introduce a "implicit" ranking: the **B-ranking.** The **B-ranking** is the *number of times the same character has occured in the F column “above” the current* position. ![](https://i.imgur.com/AlOXfxK.png) The B-ranks follow the same property of maintaning the order in both F and L, as the T-rank did. So starting with the BWT, which is the L column we can 1. reconstruct the F column using the *cumulative index property* - The position of a character $\tau$ in the F column is the sum of the numbers of characters lexicographically less than $\tau$ + the b-ranking of $\tau$. ![](https://i.imgur.com/nfSt1Pg.png) The position of $c_0$ is 1(#$) + 5 (#a) + 2 (#b) + 0 (b-rank of $c_0$). 2. With the F column we can start to reconstruct backward the original string with this zig-zag movement ![](https://i.imgur.com/7l98vao.png) Remember: these are all rotations, so the L column contains the preceding character! To do this we only need the BWT(T), without reconstructing the matrix. We just need a small data structure containing how many charaxters of each type are contained in T. ### FM index > The FM index uses the BWT and some other > auxilliary data structures to generate a >fast an efficient index for search for small patterns within a larger string T The main idea is: the substring we're searching will be the *prefix of a suffix* in the Burrows-Wheeler matrix. The approach is similar to the reconstruction of the original string. This is better explained with an example: Let be P=abra the pattern to be found, and T=abracadabra the genome in which to search. Let's start from the *last character of P*, an "a". It's easy to find the portion of the BWM that start with "a" thanks to the *cumulative index property* ![](https://i.imgur.com/GQ7Aqoe.png) Once we have found all the rows that starts with "a", look at L to identify the rows that contain the next char in the pattern, an "r". Now read the b-ranks of these two "r" and start the process again with the next char in the pattern(remember: we're going backward in the pattern). When the last (first) letter of the pattern is checked, if we always succeeded, we have the two rows that begin with our pattern, in our case 2 and 3. Note how they're consecutive rows, this will always happens. We have three problem tho: 1. How do we efficiently find the preceding character? (that is: starting from a chunk of prefixes in F, how do we find the correct characters in L to continue leftwards?) - Worst case: $O(T)$, scan all the genome 2. We don't actually store the B-ranks of the characters, we need a way for "deducing" them. 3. How to figure out at what position the pattern is in T? The **Tally Table** resolves issue #1 and #2. What the TT does is *precalculate the number of each specific character in **L** up to every row.* ![](https://i.imgur.com/SylSwpX.png) After we have found all the rows beginning with an "a" in the first step, let's say [i,j], look at the [i-1,j] rows in the tally table. ![](https://i.imgur.com/V2wZYy6.png) We see that only 2 "r" occurs in our range. Instead of storing the entire table ,$O(|T|\cdot|\Sigma|)$ integers, store every K^th^ row. Assuming we space the check point rows at a constant distance, the number of lookups will at most that distance, if we start at the nearest checkpoint. ![](https://i.imgur.com/rO6T0UW.png) The number of "A" at row ??? will be either 113+1 or 114-1. Note how the number of maximum lookups will be 8, the k-factor of this TT. Note: the relation between tally counter and b-ranks is-> B = TTc-1 How to solve issue #3? The suffix array has the information we need, it's just to costly in terms of space (4 bytes * |T|). Let's borrow the idea of the checkpoints from the TT: store every k^th^ value **of the string T**. Let's call this method *Selective Suffix Array*. ![](https://i.imgur.com/KfDC26Q.png) To find the position of first match, that has not the hit in the selective SA, just proceed backwards, as always, until a suffix with a SA hit is found. ![](https://i.imgur.com/yHjJQKc.png) Here for example "dabra" was found at position 6, that means that "abra" is at position 7. Storing every k^th^ value in the string T ensures that we do at max k-1 hops to find a hit. ### Inexact Matching IDK ## Multiple Sequence Alignment k strings, each lenght n Dynamic programming on a k hypercube, space is O(n^k^) Time is (2^k^ n^k^). For each cell we look 2^k^-1 alternatives. Metric Sum of Pairs $$ SP = \sum_{i=1} ^{k-1} \sum_{j=1+1} ^{k} a(i,j)$$ This is the score of the alignment of every pair of sequences. To msa we need to maximise this ### center star algorithm 1. compute all the *pairwise alignment* 2. find the string with the maximum SP score, (P) 3. the alignment is built by startting with § and then adding each of the other strings one by one, **as they were aligned to § in the initial step** the complexity is: 1. the first step is O(k^2^) alignments 2. each of the alignments takes O(n^2), time and spacee overall the firt step is O(k^2^ n^2^) 3. choosing the center string and building the rest cost O(nk) time. overal is O(n^2^ k^2^) #### progressive alignment It's a greedy algorith, ie at each step take the local maximum. At each step align the two most similar SP score. this forms an evolutionary tree. ### local multiple alignment 1. build the profile, no gaps. ![](https://i.imgur.com/lpaeYgv.png) find the combination of substrings that aligned with no gaps produce the maximum IC score. $$ IC(P) = \sum_{i=1} ^4 \sum_{j=1} ^{w} f_{i,j} \log_2 \frac{f_{i,j}}{1/4} $$ the **logo* it the IC of every letter in every column, in bits. ![](https://i.imgur.com/yiq9ALH.png) ## hidden Markov models it's a probabilistic FSM we only see the output the states are hidden ### viterbi algorithm finding the sequence of states that maximizes the probability of a string to be generated by the model dynamic programmin table, L column, the length of the string, |Q| rows. Cell v~k~ (i)is the **probability of the most probble path of states ending at symbol x~i~ and at the $\pi_k$ ![](https://i.imgur.com/cGqUjWe.png) - a~kk'~ the probabilty of transitioning from k to k' - e~k~(n) the probability of emitting n, at the state k in generale the rule for each column is $$ v_F(i)= e_F(x_i) \cdot \max_{L,F} \{ v_F (i-1) \cdot a_{FF} , v_L (i-1) \cdot a_{LF} \} $$ $$ v_L(i)= e_L(x_i) \cdot \max_{L,F} \{ v_F (i-1) \cdot a_{FL} , v_L (i-1) \cdot a_{LL} \} $$ At each step create a trace back pointer The optimal solution for the whole string will be thus max{v~F~(L), v~L~(L)},. the path is obtainetd by traceback pointer.