Algorithms for bioinformatics

--- tags: Master --- # Algorithms for bioinformatics ## Introduction 🖖 **Role of algoz** - procedure that solves a problem - solve `practical problems` - model a phenomenon and simulate - (model mathematically to predict/understand parameters) - `biological system as a computational device` - (evolution is an optimization procedure) **Terms:** - Strings - sequence on alphabet (Bases, Amminoacids?) - Sequence alignment - **Global** - attempts to align the `whole sequences` for best score - **suitable for similar sequences** and with equal length (e.g. _homologous genes_) - `Needleman-Wunsch` algorithm (dynamic programming, more intensive) - **exact alignment** (100% best solution) - **Local** - only covers `part of the sequences` for best score - **suitable for dissimilar sequences** that are suspected to contain regions of similarity (e.g. _non-homologous genes with homologous domains_) - `Smith-Waterman` algorithm - **exact alignment** (100% best solution) ![](https://i.imgur.com/xKBIobe.png) - Turing machines... - there are some good analogies in biology (DNA) ## Global sequence alignment 🌐 Direct comparison of two sequences is insufficient to estabilish the full genetic relationships between two proteins. **Allowance for gaps** increases the number of comparison that can be made. - pairwise alignment - `match` - `mismatch` - `gap` (insertion/deletion) ### Needleman-Wunsch algorithm ([wiki](https://en.wikipedia.org/wiki/Needleman%E2%80%93Wunsch_algorithm)) `Dynamic programming` algorithm used in bioinformatics to align protein or nucleotide sequences. ![](https://i.imgur.com/Jv00SSX.png) - _Scoring function:_ assigns a **score** to every possible alignment depending of the goal, memorize in `substitution matrices` if assigning different weights on different letter relations, `common weights` are: - $\text{match} =1$ - $\text{mismatch} = -1$ - $\text{gap} = -2$ - _Matrix population:_ row by row, the score of each cell is the best choice given the scores of the left, upper, and upper left cells - diagonal movement represents match/mismatch - horizontal and vertical movements represent gaps - the final score of a cell indicates the best score of the subproblems until this point - _Traceback:_ the best matching solution is given by starting from the lower right cell (score of the whole problem) and following in reverse the best choices at each step - there may be more than one best choice $\implies$ more possible best matching solutions SCORING EXAMPLE: - `Inputs:` - $S_1=$ A C G T T A C - $S_2=$ C G T C A - `Weights:` as above | | Alignment_1 | Alignment_2 | | ----- | ------------------- | ------------------- | | $S_1$ | A C G T T A C | A C G T T A C | | $S_2$ | _ C G T C A _ | C _ G _ T C A | | SCORE | -2+1+1+1-1+1-2 = -1 | -1-2+1-2+1-1-1 = -6 | 👁‍🗨 The first alignment has a higher score, thus it is a better match than the second one. ### Local sequence alignment 🎯 Alignment of the best matchin substrings from both the sequences. ### Smith-Waterman algorithm ([wiki](https://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm)) `Dynamic programming` algorithm similar to the Needleman-Wunsch one. ![](https://i.imgur.com/2OOKfy2.png) (on the Wiki it says $O(mn)$) <img style=" display: block; max-width: 100%; height: auto; margin: auto;" src="https://upload.wikimedia.org/wikipedia/commons/9/92/Smith-Waterman-Algorithm-Example-En.gif"> - _Scoring function:_ - _Matrix population:_ - _Traceback:_ start from the highest value cell, follow the best choices in reverse until reaching zero ## Substitution matrices 📅 (proteins alignment) They permit to have **individual scores** for aligning amino acids sequences. ![](https://i.imgur.com/n4QjGId.png) ### PAM Model **(**`Point-Accepted-Mutation`**) ([wiki](https://en.wikipedia.org/wiki/Point_accepted_mutation))** **Point accepted mutation:** replacement of a single amino-acid in a protein, with another single amino-acid, which is **neither silent or deadly**. - Each **PAM matrix** has $20$ rows and $20$ columns - based on an explicit `evolutionary model` - model of mutation and acceptance based on a `Markov model`: $P_{jk}^{(1)} = m_j \dfrac{A_{jk}}{∑_{h≠j}A_{jh}} \quad \text{with}\ j≠k$ $P_{jj}^{(1)}=1-m_j$ - where: - $m_j$ is the **mutability** of the amino acid $j$ - $A_{jk}$ is the **count** of accepted point mutations between amino acids $i$ and $j$ 👁‍🗨 $\text{PAMn} = P^{(n)}= (P ^{(1)})^n$ - `PAM1` is calculated from comparisons of sequences with **little divergence** - other PAM matrices are `extrapolated` from PAM1 $\implies$ errors in PAM1 scale in PAMn - higher PAM numbers to detect more remote sequence similarities ### BLOSUM **(**`Blocks-Substitution-Matrix`**) ([wiki](https://en.wikipedia.org/wiki/BLOSUM))** Based on **blocks** within protein sequences by clustering them `at a given level of similarity`. Each block has $s$ sequences of $w$ amino-acids in un-gapped multiple alignments. - based on `local alignment` - based on `protein families` - uses larger and more diverse set of protein sequences (**30-90% divergence**) - `errors` arise from alignment errors - lower BLOSUM numbers to detect more remote sequence similarities ![](https://i.imgur.com/K1bbjzq.png) ![](https://i.imgur.com/1n6KA0Q.png) **Where:** - $f_{ij}$ is the `frequency` of a pair of amino acids in the same column - $q_{ij}= \dfrac{f_{ij}}{\sum_{i=1}^{20}\sum_{j=1}^i f_{ij}}$ is the `observed proportion` - $p_i= q_{ii}+\dfrac{\sum_{j \ne i}q_{ij}}{2}$ is the `probability of occurrence` of the amino acid $i$ - $e_{ij}=p_i p_j+p_j p_i=2p_i p_j \quad ,i\ne j$ is the `expected occurrence` of the pair $ij$ - if $i=j$, then $e_{ii} = p_i^2$ - $s_{ij}=log_2 (\frac{q_{ij}}{e_{ij}})$ is the `log odds ratio` - **scores** obtained by $\text{round}(2s_{ij})$ ## Heuristic Alignment Methods 🖇 ⚠`Exact alignment methods`: (Needleman, Waterman, ...) have too much complexity for practical uses. ⚠`Heuristic methods`: sacrifice exact solutions for lower complexity: - FASTA (actually is an `hybrid`) - BLAST and its variations (blastp, blastn, blastx, ...) ### FASTA **Fasta ([wiki](https://en.wikipedia.org/wiki/FASTA))** stands for "FAST-All", because it works with any alphabet. Is an heuristic method for both `global` and `local` sequence alignment in any alphabet. 🌐 Online tool: https://www.ebi.ac.uk/Tools/sss/fasta - **query a sequence** against a database of subject sequences - subject sequences are `filtered in stages` **Based on 4 main ideas:** ![](https://i.imgur.com/NZeXxWQ.png) #### Look-up tables Each **subject sequence** is represented in a **look-up table** with amino-acids in same order as `PAM250`: - the `PAM250` matrix is the most used PAM matrix and represents the mutation probabilities of sequences with $20\%$ of equivalence - **lookup example:** finding the best offset for a good ungapped match ![](https://i.imgur.com/T8IA2tV.png) - **offset** values are used for defining the`best initial regions` - 👁‍🗨once the look-up table is built **the comparison is linear** with the length of the second sequence **Search window size:** `ktup` parameter between $1$ and $6$, determines how many consecutive identical residues are required for a match. - `low` ktup $\implies$ ➕sensitive, ➖selective - $\text{ktup}=2$ is default for `protein` comparisons - `high` ktup $\implies$ ➖sensitive, ➕selective - $4 \le \text{ktup} \le 6$ used for `DNA` comparisons ![](https://i.imgur.com/FkMCC20.png) #### After look-up 1. compute the `scores of the diagonals` (matching segments) using a substitution matrix (e.g. `PAM250`) 2. `discard` the worst subjects 3. compute an `exact alignment` in a limited area: **Smith Watermann** (~$20$ elements) #### FASTA format Single-line description (`identifier`), followed by lines of sequence `data`. 👁‍🗨 not only used in FASTA software, is the dominating format in the context (available in BLAST along many others) ```powershell >MCHU - Calmodulin - Human, rabbit, bovine, rat, and chicken ADQLTEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGNGTID FPEFLTMMARKMKDTDSEEEIREAFRVFDKDGNGYISAAELRHVMTNLGEKLTDEEVDEMIREA DIDGDGQVNYEEFVQMMTAK* >gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus] LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX IENY ``` ### BLAST **(**`Basic Local Alignment Search Tool`**): ([wiki](https://en.wikipedia.org/wiki/BLAST_(biotechnology)))** 🌐 Online tool: https://blast.ncbi.nlm.nih.gov/Blast.cgi - ➕faster alternative to FASTA without sacrificing much accuracy - uses bigger search window sizes than FASTA $\implies$ ➖ sensitive, ➕selective - gaps between query and target sequences are not allowed **Procedure:** 1. **compare** query to every database entry in order to find `hits` - **word hit:** when segments of given size are similar between query and DB entry 2. **HSP discovery:** for each hit extend both ends until local score falls below given threshold - obtained sequence is a `high-scoring segment pair` 3. **local alignment** around each HSP (`Smith-Waterman` like) ![](https://i.imgur.com/MbAb2fj.png) ### Statistics of sequence similarity scores **Problem:** scores depend on many things... do not depend only on the actual content of the sequences. **Statistics of global sequence comparison:** - under even the simplest random models and scoring systems,` very little is known` about the random distribution of optimal global alignment scores - `Monte Carlo` experiments can provide rough distributional results for some specific scoring systems - but these not easily generalized **Statistics of local sequence comparison:** - fortunately statistics for the scores of `local alignments` (unlike those of global alignments) are well understood - particularly true for `gap-less` local alignments **Statistical significance indicators:** - **E-value:** expected number of random matches at least as good as this one (with at least this alignment score), is the most used - $E = K M N e^{–λ S}$ - $S$ is the raw alignment score - $K$ and $λ$ are constants that depends on the `score matrix` and `gap penalties` used - $M$ and $N$ are the lengths of the query and database sequences - Rules of thumb: - $E < 0.05$: probably related (homologous) - $E < 1$: may be related - $E >= 1$: no statistical significance, but may be biologically significant anyway - Some other simple indicators of significance (less accurate): - Percentage of identical residues - Percentage of similar residues - Bit score (Normalized score) - $S’ = (λ S – \text{ln} K ) / \text{ln} 2$ - Raw alignment score ... ### BLAST specialized programs Describes all the shades of BLAST... ## Multiple sequence alignment 🗿🗿🗿 **Utility:** - indicates ` common conserved residues` in all or most sequences - 👁‍🗨usually important for _function / activity_ - indicates accepted residues in the `different positions` - indicates positions where `gaps` are more likely **Use case:** - basis for construction of `phylogenetic trees` - basis for `sequence motifs and profiles` - essential for `evolutionary studies` and phylogenetics **Some definitions:** The **score** is usually the sum-of-pairs-score or similar: ![](https://i.imgur.com/qzSad9m.png) Relevance of multiple sequence alignment: - **Similarity:** QUANTITATIVE, `degree of likeness` between two sequences, usually expressed as a `percentage` of similar (or identical) residues over a given length of the alignment - ➕➕similarity degree $\implies$ ➕➕homology probability - **Homology:** QUALITATIVE, statement about `common evolutionary ancestry` of two sequences - in theory is a `boolean`, but we can rarely be certain about this, it is therefore used as a hypothesis with its probability **Approaches:** - **Brute force** optimal alignment (very hard) - **Centre-star** alignment (simple, used in PSI-BLAST) - **Progressive** alignment (Clustal-W, T-COFFEE,...) - 💡 **Idea:** transitive property, if protein 1 can be aligned to p2, and p2 to p3, then there must be an alignment that includes all three - ❌whenever a sequence is added, the tree it is not recalculated 1. `pairwise alignment` step $\rightarrow$ distance/similarity matrix - free choice, either local or global, exact or heuristic... - :warning:careful choice: point of **bottleneck** and errors propagation 2. `guide tree` construction - two main methods: 1. **NJ (**Neighbor Joining**)** 2. **UPGMA (**Unweighted Pair Group Method with Arithmetic mean**)** - :warning: not necessarily bound to the actual evolutionary history (`Philogeny reconstruction`) 3. `progressive alignment` 4. return aligned sequences ![](https://i.imgur.com/KybiqEH.png) - **Iterative** alignment (e.g. Muscle) ![](https://i.imgur.com/vE2nxGh.png) ### CLUSTAL-W One of the most commonly used and well-known tools for `multiple sequence alignment`. - most popular `progressive alignment` algorithm - does a type of `global alignment` ![](https://i.imgur.com/gkQ8dbH.png) **Consensus sequence:** holds the most frequent character of the alignment at each column. ### T-COFFEE **T-COFFEE (**`Tree-based Consistency Objective Function for Alignment Evaluation`**):** **Optimizes the CLUSTAL algorithm** by introducing after multiple alignment a complex scoring system that allows you to "learn" hot areas and adjust the alignment based on these, allowing the **"merging" of alignments made with different methods**. - `progressive algorithm` along a Tree - 💡 **Idea:** `consistency` between the pairwise alignment and the overall alignment ![](https://i.imgur.com/BmRqJOu.png) ### Profile Hidden Markov Models Can assume a DNA or protein sequence are Markovian chains (each state depends only on the $k$ previous states). - `standard profile analysis:` uses heuristic methods - vs `HMM profile analysis:` formal probabilistic basis and a consistent theory behind gap and insertion scores - in general, producing good profile HMMs requires less skill and manual intervention than producing good standard profiles **Profile analysis:** a `sequence comparison method` for finding and aligning distantly related sequences. - allows a new sequence to be aligned optimally to a family of similar sequences - **standard profile:** a `table` of position-specific symbol comparison values and gap penalties that represents and MSA Profile HMM **use cases**: - most commonly: compare profile HMMs and sequences - 👁‍🗨 because a profile HMM can serve as a representation of a sequence family or sequence domain - can use a sequence as query in a DB of HMM profiles - can use a profile HMM as query in a DB of sequences - create a multiple alignment of a large number of sequences more quickly than by using standard methods **Random info about profile HMM:** - match states with **position specific emission probabilities** - `deletion states are silent` (they do not emit symbols) - allowing self transitions as in insertions would complicate inference algorithms - can be aligned to a sequence either `globally` (use the whole profile) `or locally`, but the choice is part of the model - ⚠ choice at model built time, not at runtime! - 👁‍🗨 watch out the zero frequency values $\implies$ use smoothing (Laplacian add-1, ...) - 👁‍🗨 `consensus columns:` if prevalence of insertions/deletions in a column, options in that column don't have their own M state ![](https://i.imgur.com/ngQkxBQ.png) ![](https://i.imgur.com/5XFbx7X.png) For multiple-alignment of a `protein family`: - easy to find large portions of ungapped matches - insertions and deletions are rare in general - but more probable in certain positions #### HMMER Use cases: - identify homologous protein or nucleotide sequences - perform sequence alignments (profile HMM creation) 🌐 Online tool: https://www.ebi.ac.uk/Tools/hmmer - `hmmscan:` protein sequence $\rightarrow$ profile HMM DB - `hmmsearch:` profile HMM $\rightarrow$ protein sequence DB - `jackhmmer:` iterative search vs protein sequence DB ## Phylogenetic trees reconstruction 🌲 🌐 Online tool: http://www.phylogeny.fr/index.cgi Phyletic == phylogenetic `OTU`: Operational Taxonomic Unit (group of closely related individuals) **Issue:** phylogenetic tree `reconstruction` from aligned sequences **Phylogenetic tree:** tree structures encoding weighted relations between OTUs - rooted: hypothetic chronological order - un-rooted: simply describe relationships **Molecular phylogeny:** proteins vs nucleotides - **protein** sequences: 👎 - need $20\times 20$ replacement matrices, `very complex` to deal with - are the expression of coding regions only - identical amino acids can be the expression of multiple codons - **nucleotide** sequences: 👍 preferable - they can be described with $4 \times 4$ matrices - can be extracted from non-coding genomic sequences, therefore with higher variation - they have `no degeneration`, `no redundancy`. **Homologous genes:** in search of an evolutionary distance measure - **Paralogous genes:** originating from duplication in same organism - genes of **different colors** are `paralogs` because they are related through the initial gene duplication, regardless of whether they are found in the same or different species - **Orthologous genes:** originating by speciation - genes **sharing a color** are more closely related through the speciation event and are therefore `orthologs` - 👁‍🗨by definition, orthologs can never be found in the same species: orthologous gene trees are a **reflection of the species tree** - 👍 better for Phylogenetic tree construction ![](https://i.imgur.com/wYes54e.png) **Tree construction:** - methodologies: - `clustering` algorithms: based on the observation of calculated genetic distances on multiple alignments - **FM (**`Fitch-Margoliash`**)** - **NJ (**`Neighbor-joining`**)** - `optimization` algorithms: based on objective quality criteria - **Maximum Parsimony** - **Maximum likelihood** - **W/UPMGA (**`Weighted/Unweighted Pair Group Method with Arithmetic mean`**)** - data source: - pre-calculated `genetic distances` (shorter calculation times) - `homologous` multi-aligned sequences (much longer calculation times) ### Distance-based Methods (clustering) Neighbor-Joining uses Fitch-Margoliash in its original formulation. **Genetic distance:** calculable objective parameter - for conserved `nucleic acids`: $d = \frac{\#replacements}{length}$ - for less conserved (> 25%) `nucleic acids`: $d’= - \frac{3}{4} log⁡(1-\frac{4}{3} d)$ - Jukes-Cantor correction - for `proteins`: $d =(\text{Sobs} – \text{Srand})/( \text{Smax} - \text{Srand})$ - approximate formula - $\text{Sobs}$: score of the alignment - $\text{Smax}$: average of the scores of the alignments of all proteins with themselves - $\text{Srand}$: score expected for sequences of the same length and composition ![](https://i.imgur.com/2YDl7lM.png) #### Fitch-Margoliash Uses a **weighted least squares** method for clustering based on genetic distance. (?) Given a `matrix of distances` and a tree with a `known topology`: With $N = 3$: ![](https://i.imgur.com/fqntt5u.png) With $N > 3$: 1. look for the `closest OTUs` 2. calculate their `average distance` from the other OTUs 3. use this to calculate the `length of the branches` that join the OTUs close to their node 4. `join` neighboring OTUs - use as distances the average of the distances they had compared to the other OTUs in the tree 5. `update matrix` of distances 6. $\text{GOTO} \ 1$ ![](https://i.imgur.com/mCj30g6.png) #### Neighbor-Joining A `quick method` to solve the topology of trees. ([Wiki](https://en.wikipedia.org/wiki/Neighbor_joining)) - assumption: `evolutions with different speeds` Procedure: 1. $\text{START}$ from a `star tree` and a `distance matrix` 2. compute initial score $So$ (average distance from initial point) 3. $\forall$ `couple` of OTUs in the star: 1. the couple is extracted from the tree (e.g. AB) 2. all branches are calculated with $\text{FM}$ algorithm 3. the branches are summed up 4. get a score $S_{AB}$ 4. `join` the couple with the minimal score 5. `update matrix` of distances 6. $\text{GOTO} \ 1$ ...details... ### Optimization Methods #### Maximum Parsimony Principle 💡 **idea:** _"keep it simple"_, use simplest explanation for tree (`fewest evolutionary changes`). - belief that selection was more likely to discover the utility of a mutation in two closely related lines of descent, however `neutral mutations` exist - score $=$ number of mutations - 👁‍🗨 does not give branch length, only branch order - good for similar sequences (e.g. protein families) **Molecular clocks:** some regions of DNA appear to evolve at `constant rates`. PROCEDURE: (is this related to the maximum parsimony principle?) 1. **Preliminary reconstruction** from `assumed topology` and OTU assignments: ![](https://i.imgur.com/2cYoK5T.png) - ⚠ nodes are sets! (not sequences) - **Rules:** from each node look at descendants - if $\cap \ne \emptyset$, do $\cap$ - else do $\cup$ 2. **Final conversion:** use rules of ambiguity to `correct preliminary nodal sets` to final ones - 👁‍🗨root node is always final ![](https://i.imgur.com/IP51QPG.png) - rule of **diminished** ambiguity: if preliminary nodal set `contains his father` $\rightarrow$ remove extra elements - rule of **expanded** ambiguity: if preliminary nodal set is created by `union of his childs` $\rightarrow$ add all extra elements of his father - rule of **encompassing** ambiguity: add all extra elements that his `father and one of the child` have - 👁‍🗨 rules to apply in a specific procedure, not described here 3. **Linkage finding:** sequence is limited to the segments shown ![](https://i.imgur.com/GTvwsXi.png) - arrow-heads denote nucleotide replacement - **Rules:** - from $N_i$, `if` $\exists N_i$ descendant $\implies$ **obligatory** $N_i \rightarrow N_i$ and $(N_i) \rightarrow N_i$ - `else` $\implies$ can do every linkage **except** $N_i \rightarrow (N_j)$ and $(N_i) \rightarrow (N_j)$ #### Maximum Likelihood and Complements 💡 **idea:** `given certain rules` about how DNA changes over time, a tree can be found that reflects the most likely sequence of evolutionary events. **Pulley principle:** position of the root does not affect likelihood. ... random things... he passed to mentioning hierarchical cluster?! (WPGMA and UPGMA) ... linkage types between clusters... Ok, WPGMA and UPGMA are not explained. Some mentions of bayesian methods... he must be desperate by mentioning so many things. :sunny: FIN :sunny: