ACEIDHA: Quality control and genome assembly of long-read data

<center><img src="https://i.imgur.com/rPIZUIq.png" alt="drawing" width="700"/></center> # ACEIDHA: Quality control and genome assembly of long-read data In this hands-on exercise, you will work on sequence data using the Galaxy EU platform. > For this course, use **https://usegalaxy.eu/join-training/aceidha_bioinf22** link. This gives a dedicated computer powers for our workshop. > After the course, your data and the server is accessible at **https://usegalaxy.eu.** **Objective of this exercise is to do** 1. Quality control and assembly of this datasets' short read data using the workflow you made in [last exercise](https://hackmd.io/FWE3m3jAQOeJj8sSl2L5KQ?both) 2. Quality control and assembly of long read data 3. Hybrid assembly of isolates that have both short and long read data available to them. ## The isolates **Background isolates:** > We will now analyse genomes of the bacteria *Escherichia coli*, *Klebsiella pneumoniae* and *Serratia fonticola*. The genomes were generated in a conjugation experiment (Figure 1); two donors, *Serratia* A111 and *Klebsiella* A177 were inoculated with a recipient *Escherichia coli* DH5α. The donors were both assumed to be extended spectrum beta-lactam (ESBL) producing bacteria, and that the trait was located on a mobile genetic element - a plasmid. The recipient was only resistant against nalidixacid (Nal_R). They inoculated the strains together and observed if the recipient become ESBL producing after the experiment, assuming a horizontal gene transfer of the plasmid. To document transfer of the plasmid, both donors and transconjugants were sequenced using short- and longread type sequencing because the *location* and genomic arrangment of the genes were important (not just presence/absence). The short-read data were generated using (PE) short-read technology* on the Illumina Novaseq platform (see [video](https://youtu.be/fCd6B5HRaZ8)), and the longread was generated using [Oxford Nanopore technology on the MiniON platform](https://nanoporetech.com/applications/dna-nanopore-sequencing) (ONT). <center> Figure 1 (Abe et al, 2020) <img src="https://i.imgur.com/1uI6RzR.png" alt="drawing" width="700"/></center> Nanopore sequencing has several properties that make it well-suited for our purposes * Long-read sequencing technology offers simplified and less ambiguous genome assembly * Long-read sequencing gives the ability to span repetitive genomic regions * Long-read sequencing makes it possible to identify large structural variations ###### When using Oxford Nanopore Technologies (ONT) sequencing, the change inelectrical current is measured over the membrane of a flow cell. When nucleotides pass the pores in the flow cell the current change is translated(basecalled) to nucleotides by a basecaller. A schematic overview is given inthe picture below. ###### When sequencing using a MinIT or MinION Mk1C, the basecalling software is present on the devices. With basecalling the electrical signals are translatedto bases (A,T,G,C) with a quality score per base. The sequenced DNA strand willbe basecalled and this will form one read. Multiple reads will be stored in a fastq file. <center> Figure 2 (Abe et al, 2020) <img src="https://i.imgur.com/JAcZCKw.jpg" alt="drawing" width="700"/></center> Some isolates were sequenced only with Nanopore (*E. coli* recipient), while donors were sequenced using both Illumina and Nanopore technology. **I.** Import the data from this [History](https://usegalaxy.no/u/6feb7210142f4721b3766892afde6d40/h/conjugation-reads) or through your local harddrive (downloaded from https://drive.google.com/drive/folders/1eo4bVeXRlEv-hmiWTdCNHyGB6Iigr1R8?usp=sharing) (see former HackMD) ##### Overview of donors: A177 : ESBL *Klebsiella pneumoniae*: Shortread data: A177_R1.fq.gz and A177_R2.fq.gz Longread data: merged_pass_reads_BC9_A177_nanopore.txt A111: ESBL *Serratia*: Shortread data: A111_R2.fq.gz, A111_R1.fq.gz Longread data: merged_pass_reads_A111_nanopore.txt ##### Transconjugants: DH5 recieved A177: Longread data merged_pass_reads_BC10_E.colidh5_A177_nanopore.txt DH5 recieved A111: longread data merged_pass_reads_BC8_E.colidh5_A111.txt Notice that transconjugants were only sequenced using Nanopore technology. > #### TASK: > * Inspect the different files - what format are the sequence in? > * Why is there two of each short read file and just one for the long read sequencing? **IMPORTANT** > - [ ] **For downstream analysis the files need to be in `fastqsanger` - so if not imported like that, change the format now.** ### Shortread data handling **II.** In the *Campylobacter* exercise, you created a workflow for quality control and assembly of short read data. Run that workflow on the short read data for A111 and A177. ### Quality control of longread data Before doing any assembly, the first questions you should ask about your input reads include: * What is the coverage of my genome? * How good are my reads? * Do I need to ask/perform for a new sequencing run? * Is it suitable for the analysis I need to do? We therefore do quality control. When assessing the fastq files all bases had their own quality (or Phred score) represented by symbols. To assess the quality by hand would be too much work. That’s why tools like [NanoPlot](https://github.com/wdecoster/NanoPlot) or FastQC are made, which will generate a summary and plots of the data statistics. NanoPlot is mainly used for long-read data, like ONT and PACBIO and FastQC for short read, like Illumina and Sanger > [Nanoplot](https://github.com/wdecoster/NanoPlot) modules includes: > > Summary Statistics > * Mean / Median / N50 reads length > * Mean / Median / N50 reads quality (Qx is the average per-base error probability, experssed on the log (Phred) scale. Q1 is Q10, Q2 is Q20 etc.) > * Number of reads > * Total of bases generated > Plots (depends on parameters) > * Histogram of read lengths > * Yield by length > * Read lengths vs Average read quality Depending on the analysis it could be possible that a certain quality or length is needed. The reads can be filtered using the tool [Filtlong](https://github.com/rrwick/Filtlong). In this training all reads below 1000bp will be filtered. **III.** Since you now have used some tools in already Galaxy, find Nanoplot yourself and analyze all four isolates that were sequenced using longread technology (DH5_A111, DH5_A177, A177, A111). Inspect the outputs when done. > #### TASK (look at `Nanostats`) > > * What is the average readlength for the four isolates? > * Enterobacteriaceae are about 5Mbp long - what is the average coverage (pick one isolate) > * What is the quality like? > * Is there any correlation between readlength and quality? - [ ] **DO: Rename files so you dont loose track of what is what!** For the next excersice, we need to groom the fastq files to the correct format (Filtlong is picky) **IV.** Use [FASTQ Groomer](https://doi.org/10.1093%2Fbioinformatics%2Fbtq281). Use all reads as input. Figure 3 <iframe src="https://scribehow.com/embed/FASTQGroomer___SgdMU2kTTC51TdXljHuSg" width="640" height="640" allowfullscreen frameborder="0"></iframe> - [ ] **Remember to rename output files correctly `<name>_groomed`.** **V.** Trimming: Run the Filtlong [Tool](https://toolshed.g2.bx.psu.edu/repos/iuc/filtlong/filtlong/0.2.0) with the following parameters: `Input FASTQ`: <your four different ONT files> (imported) `Output thresholds`; `Min. length`: 1000 - [ ] **Rename the files to `<name>_filtered`** **VI.** Rerun Nanoplot on the filtered reads. We will now use `Scratchbook` to compare. > If you would like to view two or more datasets at once, you can use the `Scratchbook` feature in Galaxy: > Click on the Scratchbook icon galaxy-scratchbook on the top menu bar. You should see a little checkmark on the icon now > View a dataset by clicking on the eye icon galaxy-eye to view the output > You should see the output in a window overlayed over Galaxy > You can resize this window by dragging the bottom-right corner > Click outside the file to exit the Scratchbook > View galaxy-eye a second dataset from your history > You should now see a second window with the new dataset > This makes it easier to compare the two outputs Repeat this for as many files as you would like to compare You can turn off the Scratchbook galaxy-scratchbook by clicking on the icon again ***ANSWER*** 1. What is the increase of your median read length? 1. What is the decrease in total bases? 1. What is coverage? 1. What would be the coverage before and after trimming, based on a genome size of 5 Mbp? - [ ] **VII.** Move all filtered ONT files and paired-end short reads to a new history. Call this new history something like "ONT_assembly" or similar. You can find a step by step guide in Figure 4. Do not continue before you have made new History and moved the files. Figure 4 <iframe src="https://scribehow.com/embed/Usegalaxy_Workflow__m_LnMr11TDek6CTvjKeP0A" width="640" height="640" allowfullscreen frameborder="0"></iframe> ## Assembly When the quality of the reads is determined and the data is filtered (like we did with filtlong) and/or trimmed (like is more often done with short read data) an assembly can be made. There are many tools that create assembly for long-read data, but in this tutorial [Unicylcer](https://github.com/rrwick/Unicycler)(see also publication by [Wick et al. 2017](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005595) will be used. Unicycler is a de novo assembler for both paired, single molecule sequencing and hybrid assembly, but its specifically built for hybrid assembly: that is using both short- and long-read sequencing data) of small (e.g., bacterial, viral, organellar) genomes. Unicycler employs a multi-step process that utilizes a number of software tools: ###### Figure 5: Simplified view of the Unicycler assembly process (From Wick et al. 2017). In short, Unicycler uses SPAdes to produce an assembly graph, which is then bridged (simplified) using long reads to produce the longest possible set of contigs. These are then polished by aligning the original short reads against contigs and feeding these alignments to Pilon - an assembly improvement tool.![](https://i.imgur.com/GjHh2iI.jpg) The Unicylcer assembly for ONT data is based on finding overlapping reads with variable length with high error tolerance, and the de bruijn graph for short read data. Unicycler will output the assembly in a fasta file. We will use Unicycler to assemble the filtered ONT reads for A111, A177 and *E. coli* transconjugates. In addition, A111 and A177 will be assembled using short- and ONT-reads; hybrid. In total, A177 and A111 will be assembled three times 1. Just ONT reads (remember to leave the ) 2. ONT reads and short-read reads (hybrid assembly) 3. Shortread assembly from step **I.** At the end of the excercise (tomorrow), we will compare the assemblies achieved by ONT, short-read and hybrid. **VIII.** Find Unicycler in software on the left. Make four assemblies of ONT reads from each of the ONTs from BC10_E.colidh5_A177, BC8_E.colidh5_A111, A111 and A177. Also, at the same time, start the hybrid assembly with the following parameters: `“Paired or Single end data?”` to `Paired` for hybrid and 'none' for ONT assembly only `“First Set of reads”` to the forward reads file R1 from A111 or A177 `“Second Set of reads”` to the reverse reads file R2 from A111 or A177 `“Long reads”` to the ONT file for either A111 or A177 Use default parameters ##### Assembly takes time There is no such thing as Assembly in real time. It takes time so it is a good time to have lunch or at least coffee or leave it running for tomorrow. This Unicycler run will take anywhere between 90 minutes and two hours.