Lecture 23: Assembly, Projects, next set of lectures

--- tags: BMMB554-23 --- # Lecture 23: Assembly, Projects, next set of lectures ## Wrapping up assembly <iframe src="https://docs.google.com/presentation/d/e/2PACX-1vSOWjQT9zSrrPQmmBZtLtGG-FmONWFlqa_Dc5Y2AqeGOM-CfFb9W8EV1GBYzMmx-0T4649oLWqkuL88/embed?start=false&loop=false&delayms=3000" frameborder="0" width="683" height="541" allowfullscreen="true" mozallowfullscreen="true" webkitallowfullscreen="true"></iframe> ## Projects ### Logistics After all we will have **individual** projects. Here are the logistical details. 1. Pick a project from the list below. If you want, you can propose your own project as well. 2. Next class (April 11th) I will circulate a sign-up sheet where you will: a. Indicate what project you decided to pick b. Reserve a time to meet with me to discuss details (if necessary - this is optional) ### List of potential projects 1. Analyze several datasets from the latest [1,000 genomes set](https://www.sciencedirect.com/science/article/pii/S0092867422009916) against [T2T reference](https://www.science.org/doi/10.1126/science.abj6987) using [DeepVarinat caller](https://www.nature.com/articles/nbt.4235). 2. Pick a biologically interesting protein and try to sovle its structure using [AlphaFold](https://www.nature.com/articles/s41586-021-03819-2) 3. Perform a simulation to estimate mutational target size --- estimate the chances of parallel mutations occuring by chance in *E. coli* genome. There is an existing R script here that needs to be modified. Develop a RStudio/Jupyter/Observable notebook for performing and visalizing this analysis. 4. Develop a Jupyter notebook for the analysis of [Genome/Plasmid evolutionary trajectories](https://www.biorxiv.org/content/10.1101/2022.03.17.484700v1) from tag counts data. 5. Pick any tutorial from [GTN](https://training.galaxyproject.org/) that is close to the analyses you are doing and use it to interpret your own data. ## Bacterial assembly ## DeBruijn Assembler: SPAdes In this example we have two types of data: (1) high quality high coverage but relatively short (250 bp) Illumina data and (2) low quality low coverage but long Oxford Nanopore (ONP) data. This would allow us to perform so called Hybrid assembly. In hybrid assembly long low quality reads produced by ONP will allow bridge contigs assembled from Illumina reads into a fully assembled bacterial genome. Currently, one of the best performing tools for hybrid assembly is [hybridSPAdes](http://bioinformatics.oxfordjournals.org/content/early/2015/11/20/bioinformatics.btv688.full.pdf+html), a part of [SPAdes family](http://cab.spbu.ru/software/spades/) of genome, metagenome, and transcriptome assemblers. ### Key SPAdes innovations Before briefly explaining some key aspects of the SPAdes assembler we should mention one practical issue. In the previous section we assumed that sequencing reads have identical lengths. In reality this is almost never the case. Thus we cannot assume that reads and *k*-mers are the same thing. Instead, we derive *k*-mers from the reads by breaking them into pieces: ``` atgcgt <- read atgc tgcg <- its k-mers of length 4 gcgt ``` ### Multisized deBruijn graph In the previous lecture we have been constructing graphs for *k*-mers of a fixed size. We have noted that when *k* is small it is difficult to resolve the repeats. If *k* is too high a corresponding graph may become fragments (especially if read coverage is low). SPAdes uses several values for *k* (that are either manually set or inferred automatically) to create a *multisized* graph that minimized tangledness and fragmentation by combining various *k*-mers: ----- ![](https://i.imgur.com/wYYJKh8.jpg) **Multisized de Bruijn graph**. A circular Genome CATCAGATAGGA is covered by a set Reads consisting of nine 4-mers, {ACAT, CATC, ATCA, TCAG, CAGA, AGAT, GATA, TAGG, GGAC}. Three out of 12 possible 4-mers from Genome are missing from Reads (namely {ATAG,AGGA,GACA}), but all 3-mers from Genome are present in Reads. (A) The outside circle shows a separate black edge for each 3-mer from Reads. Dotted red lines indicate vertices that will be glued. The inner circle shows the result of applying some of the glues. (B) The graph DB(Reads, 3) resulting from all the glues is tangled. The three h-paths of length 2 in this graph (shown in blue) correspond to h-reads ATAG, AGGA, and GACA. Thus Reads3,4 contains all 4-mers from Genome. (C) The outside circle shows a separate edge for each of the nine 4-mer reads. The next inner circle shows the graph DB(Reads, 4), and the innermost circle represents the Genome. The graph DB(Reads, 4) is fragmented into 3 connected components. (D) The multisized de Bruijn graph DB(Reads, 3, 4). (From [Bankevich:2012](http://online.liebertpub.com/doi/full/10.1089/cmb.2012.0021)). ----- ### Read pair utilization While the use of paired reads and mate pairs is not new (and key) to genome assembly, SPAdes utilizes so called paired DeBruin graphs to take the advantage of the paired end data. One of the key issues with paired DeBruin graphs is the resulting genome assemblies do not tolerate variability in insert sizes (the initial formulation of paired DeBruijn graphs assumed constant distance between pairs of reads). In practice this distance is always variable. SPAdes performs *k*-bimer (these are *k*-mers derived from *paired* reads) adjustment to identify exact of nearly-exact distances for each *k*-bimer pair: ---- ![](https://i.imgur.com/Tjpg0G9.jpg) **k-bimer adjustment**. **A**. Bireads are decomposed into pairs of k-mers with estimated genomic distances (B-transformation). These are tabulated into histograms of estimated genomic distances between pairs of h-edges (H-transformation), and peaks in the histograms and paths in the graph are used to reveal the actual genomic distances between h-edges (A-transformation). This may be converted back to genomic distances between k-mers on pairs of h-paths (E-transformation, used for presentation purposes but not needed in the implementation). **B**. The h-biedge histogram (α|β,*) corresponding to the exact h-biedge (α|β, 72163) in the assembly graph. path(α) is an h-path (condensed edge representing 72049 edges) in the upper right, and path(β) is an h-path (representing 46097 edges) at the lower left. The histogram collects all distance estimates between α and β derived from bireads. The h-biedge histogram was smoothed using the Fast Fourier Transform (red curve). The peak in the smoothed histogram (marked red) well approximates the actual distance (marked blue). **C**. The h-biedge histogram (α|β,*) estimates the distance between h-edges α and β (|path(α)| = 46054, |path(β)| = 72). Because of the directed cycle formed by the two h-paths of lengths 72 and 13, there may be multiple walks through the graph between α and β. The h-biedge histogram has been divided into clusters with centers at 46060 and 46145. Thus SPAdes transforms the entire histogram into two h-biedges: (α|β, 46054) and (α|β, 46139). (From [Bankevich:2012](http://online.liebertpub.com/doi/full/10.1089/cmb.2012.0021)). ---- ### Error correction Sequencing data contains a substantial number of sequencing errors that manifest themselves as deviations (bulges and non-connected components) within the assembly graph. One of the ways to improve the graph even constructing it is to minimize the amount sequencing errors by performing error correction. SPAdes uses [BayesHammer](https://goo.gl/1iGkMe) to correct the reads. Here is a brief summary of what it does: 1. SPAdes (or rather BayesHammer) counts *k*-mers in reads and computed *k*-mer statistics that takes into account base quality values. 2. [Hamming graph](https://en.wikipedia.org/wiki/Hamming_graph) is constructed for *k*-mers is which *k*-mers are nodes. In this graph edges connect nodes (*k*-mers) is they differ from each other by a number of nucleotides up to a certain threshold (the [Hamming distance](https://en.wikipedia.org/wiki/Hamming_distance)). The graph is central to the error correction algorithm. 3. At this step Bayesian subclustering of the graph produced in the previous step. For each *k*-mer we now know the center of its subcluster. 4. **Solid** *k*-mers are derived from cluster centers and are assumed to be *error free*. 5. Solid *k*-mers are mapped back to the reads. 6. Reads are corrected using solid *k*-mers: ----- ![](https://i.imgur.com/xBLilqZ.png) **Read correction**. Black *k*-mers are solid. Grey k-mers are non-solid. Red k-mers are the centers of the corresponding clusters (two grey k-mers striked through on the right are non-solid singletons). As a result, one nucleotide is changed based on majority rule. (From [Nikolenko:2013](https://goo.gl/1iGkMe)). ---- ### Hybrid assembly SPAdes first construct an assembly graph from high quality high coverage short read data and then utilizes long reads generated with PacBio, Oxford Nanopore, or Illumina's TruSeq Synthetic Long-Read technology to close assembly gaps and to resolve repetitive regions. To close the gaps it finds long reads spanning a coverage gap in the assembly graph. The challenge with long reads (particularly from PacBio and Oxford Nanopore) is their high error rate. To alleviate this issues SPAdes looks for a collection of long reads spanning a given gap and computed a consensus. ----- ![](https://i.imgur.com/xuAyeDp.png) **Repeat resolution**. This picture represents an example of a collapsed repeat. Somewhere in the genome there are three distinct regions (blue, green, an red) containing a repeat (black). Because black region is highly similar (or identical) among these three regions it is collapsed on the graph as shown here. If we align long reads to this graph we will be able to disentangle this. For example, long reads (dashed lines) span red edges allowing us to partially uncollapse this assembly. (From [Antipov:2015](http://bioinformatics.oxfordjournals.org/content/early/2015/11/20/bioinformatics.btv688.full.pdf+html)) [![](https://i.imgur.com/eKyqEtH.png)](https://training.galaxyproject.org/training-material/topics/assembly/tutorials/assembly-with-preprocessing/tutorial.html)