<center><img src="https://i.imgur.com/rPIZUIq.png" alt="drawing" width="700"/></center>
# ACEIDHA: Quality control of assemblies - short read
###### Relies heavily on *
You should now have assemblies from *Campylobacter* and the conjugation experiments in two histories. Keep them that way.
We will start with the *Campylobacter* assemblies:
**I:** Investigate the output of the **Shovill** log. What steps did it do? Try to identify some of these in the log (Hint; use the https://github.com/tseemann/shovill). Also look at the contigs - does it look like a fasta file?
## Assess Assembly quality with Bandage
Now that we’ve assembled the genomes, let’s visualise this assembly using Bandage ([Wick et al. 2015](https://training.galaxyproject.org/training-material/topics/assembly/tutorials/debruijn-graph-assembly/tutorial.html#Wick2015)). This tool will let us better understand how the assembly graph really looks, and can give us a feeling for if the genome was well assembled or not.
**II.** Find the Bandage Image tool, and choose the “Contig graph” as input. Execute. View the output file.
The next thing to be aware of is that there can be multiple valid interpretations of a graph, all equally valid in absence of other data.
## Assess Assembly quality with Quast
Quast ([Gurevich et al. 2013](https://training.galaxyproject.org/training-material/topics/assembly/tutorials/unicycler-assembly/tutorial.html#Gurevich2013)) is a tool providing quality metrics for assemblies, and can also be used to compare multiple assemblies. The tool can also take an optional reference file as input, and will provide complementary metrics.
The Quast tool outputs assembly metrics as an html file with metrics and graphs. The image below looks exceptionally boring. This is a good thing:
***Figure 1. Quast Output**. Quast provides different statistics such as the number of contigs or scaffolds, the N50 and N75, and the total length of the assembly. You can also access 3 plots, the cumulative length of the contigs, the Nx, or the GC content*.

**III.** Run Quast on all assemblies. Do it in both Histories, so you get one Quast output from Campylobacter and one from the conjugate isolates.
**TASK**
Can you summarize:
* How long are the assemblies?
* How many contigs have been built?
* What is the median and max length of the contigs?
* What is N50 and what does it inform you about?
* How does the GC% content match what you expect?
**IV** Even if all seems like its fine, a quick species check should be standard before you continue with more analysis.
We will estimate the average nucleotide identity (ANI) between our assemblies and the reference genome of *Campylobacter jejuni* (NCTC 11168, accession number NC_002163.1) using a program called **FastANI** (https://github.com/ParBLiSS/FastANI). ANI is defined as mean nucleotide identity of orthologous gene pairs shared between two microbial genomes.
- Find the reference genome from [NCBI](https://www.ncbi.nlm.nih.gov/assembly/GCF_000009085.1) - download in fastaformat to computer and upload data to Galaxy history
- Decompress the file
- Adjust the filetype to `fasta.gz` by editing through the pen symbol - thereafter uncompress through `Convert`
- Look at it to make sure it looks ok
- For FastANI, the input is the query sequence (genomes in fasta format) and reference sequence (also in fasta format). Read the instructions on Galaxy and run the analysis.
**FastANI** produces a table output with with columns: Query Genome, Reference Genome, ANI Value, Count of Bidirectional Fragment Mappings, and Total Query Fragments.
Questions:
* What is the ANI?
* Do we have genomes of the species *C. jejuni*? Hint: Check the original paper for intra-species ANI variation. [High throughput ANI analysis of 90K prokaryotic genomes reveals clear species boundaries. Jaine et al (2018)](https://doi.org/10.1038/s41467-018-07641-9)
* More recommended reading on ANI and species https://pubmed.ncbi.nlm.nih.gov/29792589/
> ###### Anton Nekrutenko, Delphine Lariviere, Simon Gladman, 2022 Unicycler Assembly (Galaxy Training Materials). https://training.galaxyproject.org/training-material/topics/assembly/tutorials/unicycler-assembly/tutorial.html Online; accessed Tue Oct 18 2022
> ###### Batut et al., 2018 Community-Driven Data Analysis Training for Biology Cell Systems 10.1016/j.cels.2018.05.012
> ###### Simon Gladman, Helena Rasche, Saskia Hiltemann, 2022 De Bruijn Graph Assembly (Galaxy Training Materials). https://training.galaxyproject.org/training-material/topics/assembly/tutorials/debruijn-graph-assembly/tutorial.html Online; accessed Wed Oct 19 2022