IBA - Quality and species control of assemblies, rapid and full annotation and defining the core and accessory genome

<center><img src="https://i.imgur.com/BWehQrf.png" alt="drawing" width="700"/></center> # IBA - Quality and species control of assemblies, rapid and full annotation and defining the core and accessory genome In this hands-on exercise, you will work on sequence data using the Galaxy platform. Galaxy is an open, web-based platform for accessible, reproducible, and transparent computational biomedical research, and usegalaxy.no is a national Galaxy server for life science data hosted and supported by ELIXIR Norway. You can access the **NeLS portal** at [https://nels.bioinfo.no/](https://nels.bioinfo.no/) You can access the **usegalaxy.no** at [https://usegalaxy.no/](https://usegalaxy.no/) **I:** Investigate the output of the **Shovill** log. What steps did it do? Try to identify these in the log. (Hint; use the https://github.com/tseemann/shovill). **II:** To get a better visual picture of your genome stats, run **Quast** *(Mikheenko, A., Prjibelski, A., Saveliev, V., Antipov, D., & Gurevich, A. (2018). *Versatile genome assembly evaluation with QUAST-LG*. Bioinformatics, 34(13), i142–i150. https://doi.org/10.1093/bioinformatics/bty266)*. Run **Quast** with “Contigs/scaffolds output file” to the output of **Shovill** “Type of assembly” to Genome “Use a reference genome?” to No “Type of organism” to Prokaryotes “Lower Threshold” to 500 “Comma-separated list of contig length thresholds” to 0,1000 This tool generates 5 output files, but we will focus on the HTML report and the Icarus viewer. Inspect the output of **Quast** by downloading the html-file and opening up in a browser of your choice. Can you summarize: * How long are the assemblies? * How many contigs have been built? * What is the mean, min and max length of the contigs? * What is N50 and what does it inform you about? * How does the GC% content match what you expect? **III** Even if all seems like its fine, a quick species check should be standard before you continue with more analysis. We will estimate the average nucleotide identity (ANI) between our assemblies and the reference genome of *Campylobacter jejuni* (NCTC 11168, accession number NC_002163.1) using a program called **FastANI** (https://github.com/ParBLiSS/FastANI). ANI is defined as mean nucleotide identity of orthologous gene pairs shared between two microbial genomes. - Find and download the reference sequence from NCBI with your favorite browser. Upload to Galaxy (Upload Data). This is now in a compressed *.tar* fileformat. - Adjust the filetype to fasta by editing through the pen symbol - For FastANI, the input is the query sequence (genomes in fasta format) and reference sequence (also in fasta format). Read the instructions on Galaxy and run the analysis. **FastANI** produces a table output with with columns: Query Genome, Reference Genome, ANI Value, Count of Bidirectional Fragment Mappings, and Total Query Fragments. Questions: * What is the ANI? * Do we have genomes of the species *C. jejuni*? Hint: Check the original paper for intra-species ANI variation. *Jain, C., Rodriguez-R, L.M., Phillippy, A.M. et al. High throughput ANI analysis of 90K prokaryotic genomes reveals clear species boundaries. Nat Commun 9, 5114 (2018). https://doi.org/10.1038/s41467-018-07641-9* **IV**: When time is short, rapid detection of genotype, antimicrobial resistance and possible plasmid is needed. **Staramr** quickly scans contigs for ST type, plasmids and AMR genes. It uses the following databases: ResFinder, PlasmidFinder and PointFinder, and compiles a summary report. There are 8 different output files produced by staramr as well as a collection of additional files. * Select your genome contigs (in FASTA format). * Select whether or not you wish to scan your genome for point mutations giving antimicrobial resistance using the PointFinder database. This requires you to specify the specific organism you are scanning * Run the tool. Inspect the results: - Summarize your genomes genotypes, plasmids and AMR genes **V:** We now know we have two *C. jejuni* genomes, and that they are of two different STs, with similar inert resistance against penicilling. But the remainder of the genome content is not annotated yet. In this section we will use a software tool called **Prokka** to annoatate a draft genome sequence. Prokka is a software tool to rapidly annotate bacterial, archaeal and viral genomes, and produce output files that require only minor tweaking to submit to GenBank/ENA/DDBJ. *Seemann, T. (2014). Prokka: rapid prokaryotic genome annotation. Bioinformatics, 30(14), 2068–2069. https://doi.org/10.1093/bioinformatics/btu153* - Find **Prokka** under Annotation Section - Select the two contigs to annotate - Fill out Species name and make sure the Select batch mode for *Contigs to annotate*, and make sure that *Kingdom* is set to *Bacteria*, otherwise leave settings at default. - - Press execute Prokka will produce the following files: ``` Extension: Description .gff This is the master annotation in GFF3 format, containing both sequences and annotations. It can be viewed directly in Artemis or IGV. .gbk This is a standard Genbank file derived from the master .gff If the input to prokka was a multi-FASTA, then this will be a multi-Genbank, with one record for each sequence. .fna Nucleotide FASTA file of the input contig sequences. .faa Protein FASTA file of the translated CDS sequences. .ffn Nucleotide FASTA file of all the prediction transcripts (CDS, rRNA, tRNA, tmRNA, misc_RNA) .sqn An ASN1 format "Sequin" file for submission to Genbank. It needs to be edited to set the correct taxonomy, authors, related publication etc. .fsa Nucleotide FASTA file of the input contig sequences, used by "tbl2asn" to create the .sqn file. It is mostly the same as the .fna file, but with extra Sequin tags in the sequence description lines. .tbl Feature Table file, used by "tbl2asn" to create the .sqn file. .err Unacceptable annotations - the NCBI discrepancy report. .log Contains all the output that Prokka produced during its run. This is a record of what settings you used, even if the --quiet option was enabled. .txt Statistics relating to the annotated features found. .tsv Tab-separated file of all features: locus_tag,ftype,len_bp,gene,EC_number,COG,product ``` Try to answer the following: - How many CDS does each strain have? is that normal for this species? - Inspect the gbk file; what does this remind you of? Hint: go to Genbank.... - Search for the glutamine synthetase gene; its one of the MLST genes. Do all have it? **VI:** A collaborator at a hospital is wondering if his patients might have acquired Campylobacter from the animals you took your samples from. He sequenced them, and wants to send you his assemblies for analysis. He sends you data from six patients - see NeLS storage "IBA_course/WGS/Assemblies from collaborator" To use Roary, we need gff files. These were generated by prokka. Do this now again for the six new strains you imported. BEWARE of filetype (you might have to change to fasta). This might take some time, so now is a suitable time for a coffee! **VII** Next we will use the prokka generated *.gff files from nine strains to determine which genes are present in all genomes (core genome), and which are accessory (accessory genome). In addition, by aligning the genes making up the core genome, we can estimate the divergence between the three strains, and thereafter build a phylogenetic tree. This phylogenetic tree will inform us how closely related the strains are - i.e. if they are the same, it might be that the humans got contaminated fromt the animals. **Roary** is a commonly used pangenome pipeline which quickly estimates core, accessory genome and constructs a core genome alignment from gff3 files. Also, it makes summary statistics and a table of gene presence and absence which we can vizualize later. Roary only works with genomes of the same species which are similar to eachother. If you want to compare more distantly related genomes, other tools such as Mashtree, could be more useful. Roary take a bit of time, so get it started and leave for break/tomorrow. First, change the name of the prokka files to something unique for each file - there cannot be spaces in the input to Roary. After that, you can start Roary: * Now that you have used Galaxy and its tools for a while - maybe its time to try filling it out without me helping? Leave settings at default for now. Make sure the tool is running, and continue with the next HackMD sheet