<center><img src="https://i.imgur.com/rPIZUIq.png" alt="drawing" width="700"/></center>
# ACEIDHA-SV: Quality control and genome assembly of long-read data
In this hands-on exercise, you will work on sequence data using the Galaxy EU platform.
> For this course, use **https://usegalaxy.eu/join-training/aceidha_bioinf22** link. This gives a dedicated computer powers for our workshop.
> After the course, your data and the server is accessible at **https://usegalaxy.eu.**
**Objective of this exercise is to do**
1. Quality control and assembly of this datasets' short read data using the workflow you made in [last exercise](https://hackmd.io/FWE3m3jAQOeJj8sSl2L5KQ?both)
2. Quality control and assembly of long read data
3. Hybrid assembly of isolates that have both short and long read data available to them.
## The isolates
**Background isolates:**
We will now analyse genomes of the bacteria *Escherichia coli*, *Klebsiella pneumoniae* and *Serratia fonticola*. The genomes were generated in a conjugation experiment (Figure 1);
* Two donors, *Serratia* **A111** and *Klebsiella* **A177** (ESBL producing)
* Two recipient *Escherichia coli* DH5α.
To document transfer of the plasmid, both donors and transconjugants were sequenced.
Donors were sequenced with short- and longread type sequencing because the *location* and genomic arrangment of the genes were important (not just presence/absence).
Recipients were sequenced using long read only
The short-read data were generated using (PE) short-read technology* on the Illumina Novaseq platform (see [video](https://youtu.be/fCd6B5HRaZ8)), and the longread was generated using [Oxford Nanopore technology on the MiniON platform](https://nanoporetech.com/applications/dna-nanopore-sequencing) (ONT).
<center> Figure 1 (Abe et al, 2020) <img src="https://i.imgur.com/1uI6RzR.png" alt="drawing" width="700"/></center>
**I.** Import the data from this [History](https://usegalaxy.eu/u/allarena/h/importedakl-aceidha-conjugation)
##### Overview of donors:
**A177** : ESBL *Klebsiella pneumoniae*:
Shortread data:
*A177_R1.fq.gz* and *A177_R2.fq.gz*
Longread data: *A177_ONT.fastq*
A111: ESBL *Serratia*:
Shortread data:
*A111_R2.fq.gz, A111_R1.fq.gz*
Longread data: *A111_ONT.fastq*
##### Transconjugants:
DH5 recieved A177:
E.coli_A177_ONT.fastq
DH5 recieved A111:
E.coli_A111_ONT.fastq
> #### TASK:
> * Inspect the different files by clicking on the "eye" icon - what format are the sequence in?
### Shortread data handling
**II.** In the *Campylobacter* exercise, you created a workflow for quality control and assembly of short read data. Run that workflow on the short read data for A111 and A177.
*A111_R1.fq.gz, A111_R2.fq.gz*
*A177_R1.fq.gz, A177_R2.fq.gz*
If you dont have a workflow, import from here
https://usegalaxy.eu/u/allarena/w/reads-genomes-sr
### Quality control of longread data
Before doing any assembly, we must know if the reads are ok.
We therefore do quality control.
We will use [NanoPlot](https://github.com/wdecoster/NanoPlot) which will generate a summary and plots of the data statistics. NanoPlot is mainly used for long-read data, like ONT and PACBIO.
**III.** Since you now have used some tools in already Galaxy, find Nanoplot yourself and analyze all four isolates that were sequenced using longread technology (DH5_A111, DH5_A177, A177, A111). Inspect the outputs when done.
> #### TASK (look at `Nanostats`)
>
> * What is the average readlength for the four isolates?
> * Enterobacteriaceae are about 5Mbp long - what is the average coverage (pick one isolate)
> * What is the quality like?
> * Is there any correlation between readlength and quality?
- [ ] **DO: Rename files so you dont loose track of what is what!**
Depending on the analysis it could be possible that a certain quality or length is needed. The reads can be filtered using the tool [Filtlong](https://github.com/rrwick/Filtlong). In this training all reads below 1000bp will be filtered.
**V.** Trimming:
Run the Filtlong [Tool](https://toolshed.g2.bx.psu.edu/repos/iuc/filtlong/filtlong/0.2.0) with the following parameters:
`Input FASTQ`: <your four different ONT files> (imported)
`Output thresholds`; `Min. length`: 1000
- [ ] **Rename the files to `<name>_filtered`**
- [ ] **VI.** Move all filtered ONT files and paired-end short reads to a new history. Call this new history something like "ONT_assembly" or similar. You can find a step by step guide in Figure 4.
Do not continue before you have made new History and moved the files.
Figure 4 <iframe src="https://scribehow.com/embed/Usegalaxy_Workflow__m_LnMr11TDek6CTvjKeP0A" width="640" height="640" allowfullscreen frameborder="0"></iframe>
## Assembly
https://usegalaxy.eu/u/allarena/h/unnamed-history
When the quality of the reads is determined and the data is filtered (like we did with filtlong) and/or trimmed (like is more often done with short read data) an assembly can be made.
There are many tools that create assembly for long-read data, but in this tutorial [Unicylcer](https://github.com/rrwick/Unicycler)(see also publication by [Wick et al. 2017](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005595) will be used.
Unicycler is a de novo assembler for both paired, single molecule sequencing and hybrid assembly, but its specifically built for hybrid assembly: that is using both short- and long-read sequencing data of small (e.g., bacterial, viral, organellar) genomes.
Unicycler will output the assembly in a fasta file.
We will use Unicycler to assemble the filtered ONT reads for A111, A177 and *E. coli* transconjugates. In addition, A111 and A177 will be assembled using short- and ONT-reads; hybrid.
In total, A177 and A111 will be assembled three times
1. Just ONT reads (remember to leave the )
2. ONT reads and short-read reads (hybrid assembly)
3. Shortread assembly from step **I.**
At the end of the excercise (tomorrow), we will compare the assemblies achieved by ONT, short-read and hybrid.
**VIII.** Find Unicycler in software on the left. Make four assemblies of ONT reads from each of the ONTs from BC10_E.colidh5_A177, BC8_E.colidh5_A111, A111 and A177. Also, at the same time, start the hybrid assembly with the following parameters:
`“Paired or Single end data?”` to `Paired` for hybrid and 'none' for ONT assembly only
`“First Set of reads”` to the forward reads file R1 from A111 or A177
`“Second Set of reads”` to the reverse reads file R2 from A111 or A177
`“Long reads”` to the ONT file for either A111 or A177
Use default parameters
##### Assembly takes time
There is no such thing as Assembly in real time. It takes time so it is a good time to have lunch or at least coffee or leave it running for tomorrow. This Unicycler run will take anywhere between 90 minutes and two hours.