ACEIDHA-SV: Quality control and genome assembly of long-read data

<center><img src="https://i.imgur.com/rPIZUIq.png" alt="drawing" width="700"/></center> # ACEIDHA-SV: Quality control and genome assembly of long-read data In this hands-on exercise, you will work on sequence data using the Galaxy EU platform. > For this course, use **https://usegalaxy.eu/join-training/aceidha_bioinf22** link. This gives a dedicated computer powers for our workshop. > After the course, your data and the server is accessible at **https://usegalaxy.eu.** **Objective of this exercise is to do** 1. Quality control and assembly of this datasets' short read data using the workflow you made in [last exercise](https://hackmd.io/FWE3m3jAQOeJj8sSl2L5KQ?both) 2. Quality control and assembly of long read data 3. Hybrid assembly of isolates that have both short and long read data available to them. ## The isolates **Background isolates:** We will now analyse genomes of the bacteria *Escherichia coli*, *Klebsiella pneumoniae* and *Serratia fonticola*. The genomes were generated in a conjugation experiment (Figure 1); * Two donors, *Serratia* **A111** and *Klebsiella* **A177** (ESBL producing) * Two recipient *Escherichia coli* DH5α. To document transfer of the plasmid, both donors and transconjugants were sequenced. Donors were sequenced with short- and longread type sequencing because the *location* and genomic arrangment of the genes were important (not just presence/absence). Recipients were sequenced using long read only The short-read data were generated using (PE) short-read technology* on the Illumina Novaseq platform (see [video](https://youtu.be/fCd6B5HRaZ8)), and the longread was generated using [Oxford Nanopore technology on the MiniON platform](https://nanoporetech.com/applications/dna-nanopore-sequencing) (ONT). <center> Figure 1 (Abe et al, 2020) <img src="https://i.imgur.com/1uI6RzR.png" alt="drawing" width="700"/></center> **I.** Import the data from this [History](https://usegalaxy.eu/u/allarena/h/importedakl-aceidha-conjugation) ##### Overview of donors: **A177** : ESBL *Klebsiella pneumoniae*: Shortread data: *A177_R1.fq.gz* and *A177_R2.fq.gz* Longread data: *A177_ONT.fastq* A111: ESBL *Serratia*: Shortread data: *A111_R2.fq.gz, A111_R1.fq.gz* Longread data: *A111_ONT.fastq* ##### Transconjugants: DH5 recieved A177: E.coli_A177_ONT.fastq DH5 recieved A111: E.coli_A111_ONT.fastq > #### TASK: > * Inspect the different files by clicking on the "eye" icon - what format are the sequence in? ### Shortread data handling **II.** In the *Campylobacter* exercise, you created a workflow for quality control and assembly of short read data. Run that workflow on the short read data for A111 and A177. *A111_R1.fq.gz, A111_R2.fq.gz* *A177_R1.fq.gz, A177_R2.fq.gz* If you dont have a workflow, import from here https://usegalaxy.eu/u/allarena/w/reads-genomes-sr ### Quality control of longread data Before doing any assembly, we must know if the reads are ok. We therefore do quality control. We will use [NanoPlot](https://github.com/wdecoster/NanoPlot) which will generate a summary and plots of the data statistics. NanoPlot is mainly used for long-read data, like ONT and PACBIO. **III.** Since you now have used some tools in already Galaxy, find Nanoplot yourself and analyze all four isolates that were sequenced using longread technology (DH5_A111, DH5_A177, A177, A111). Inspect the outputs when done. > #### TASK (look at `Nanostats`) > > * What is the average readlength for the four isolates? > * Enterobacteriaceae are about 5Mbp long - what is the average coverage (pick one isolate) > * What is the quality like? > * Is there any correlation between readlength and quality? - [ ] **DO: Rename files so you dont loose track of what is what!** Depending on the analysis it could be possible that a certain quality or length is needed. The reads can be filtered using the tool [Filtlong](https://github.com/rrwick/Filtlong). In this training all reads below 1000bp will be filtered. **V.** Trimming: Run the Filtlong [Tool](https://toolshed.g2.bx.psu.edu/repos/iuc/filtlong/filtlong/0.2.0) with the following parameters: `Input FASTQ`: <your four different ONT files> (imported) `Output thresholds`; `Min. length`: 1000 - [ ] **Rename the files to `<name>_filtered`** - [ ] **VI.** Move all filtered ONT files and paired-end short reads to a new history. Call this new history something like "ONT_assembly" or similar. You can find a step by step guide in Figure 4. Do not continue before you have made new History and moved the files. Figure 4 <iframe src="https://scribehow.com/embed/Usegalaxy_Workflow__m_LnMr11TDek6CTvjKeP0A" width="640" height="640" allowfullscreen frameborder="0"></iframe> ## Assembly https://usegalaxy.eu/u/allarena/h/unnamed-history When the quality of the reads is determined and the data is filtered (like we did with filtlong) and/or trimmed (like is more often done with short read data) an assembly can be made. There are many tools that create assembly for long-read data, but in this tutorial [Unicylcer](https://github.com/rrwick/Unicycler)(see also publication by [Wick et al. 2017](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005595) will be used. Unicycler is a de novo assembler for both paired, single molecule sequencing and hybrid assembly, but its specifically built for hybrid assembly: that is using both short- and long-read sequencing data of small (e.g., bacterial, viral, organellar) genomes. Unicycler will output the assembly in a fasta file. We will use Unicycler to assemble the filtered ONT reads for A111, A177 and *E. coli* transconjugates. In addition, A111 and A177 will be assembled using short- and ONT-reads; hybrid. In total, A177 and A111 will be assembled three times 1. Just ONT reads (remember to leave the ) 2. ONT reads and short-read reads (hybrid assembly) 3. Shortread assembly from step **I.** At the end of the excercise (tomorrow), we will compare the assemblies achieved by ONT, short-read and hybrid. **VIII.** Find Unicycler in software on the left. Make four assemblies of ONT reads from each of the ONTs from BC10_E.colidh5_A177, BC8_E.colidh5_A111, A111 and A177. Also, at the same time, start the hybrid assembly with the following parameters: `“Paired or Single end data?”` to `Paired` for hybrid and 'none' for ONT assembly only `“First Set of reads”` to the forward reads file R1 from A111 or A177 `“Second Set of reads”` to the reverse reads file R2 from A111 or A177 `“Long reads”` to the ONT file for either A111 or A177 Use default parameters ##### Assembly takes time There is no such thing as Assembly in real time. It takes time so it is a good time to have lunch or at least coffee or leave it running for tomorrow. This Unicycler run will take anywhere between 90 minutes and two hours.