<center><img src="https://i.imgur.com/rPIZUIq.png" alt="drawing" width="700"/></center>
# ACEIDHA - Rapid genotyping and pangenome definition - the *Campylobacter* dataset
The goal of our analysis is to check if there are circulating clones of Campylobacter in Luxembourg. We want to see if they are the same type. Type can be defined differently. For example, the bacterial type may be Staphyloccus aureus with resistance to penicillin. We can also “type” a bacteria by focusing on several genes, and see which allele is present for each of these genes. Each bacterial species will have its own scheme, which is the set of genes that are looked at. Overall, this process is called multi-locus (= several genes) sequence typing, or MLST. If you have clinical isolates, their AMR pattern is also relevant. Here we will again use `Staramr` to perform both MLST and AMR in silico characterization.
**I**:
* Select your genome contigs (in FASTA format).
* Select whether or not you wish to scan your genome for point mutations giving antimicrobial resistance using the PointFinder database. This requires you to specify the specific organism you are scanning.
* Run the tool.
Inspect the results:
- Summarize your genomes genotypes, plasmids and AMR genes
- Which ST type do you have? Any of the same type?
**II.**
We know that we have isolates of the same STs, but that does not mean that they are of the same clone. We need to compare them at the whole genome level to figure that out. That can be done in a range of ways, here we will start by performing a core genome alignement and investigate their phylogeny based upon that. To do so, we need to annotate the entire genome (as for the conjugate dataset), compare the genes the different isolates have, and compare the ones they have in common (core genes). Only these can be aligned.
- Find **Prokka** under Annotation Section
- Select the four contigs to annotate
- Fill out Species name and make sure the Select Multiple dataset mode for *Contigs to annotate*, and make sure that *Kingdom* is set to *Bacteria*. Adjust outputs so you get annotations in a gff file and statistics only (otherwise you will get so many files).
- Press execute
Prokka can produce the following files:
```
Extension: Description
.gff This is the master annotation in GFF3 format, containing both sequences and annotations. It can be viewed directly in Artemis or IGV.
.gbk This is a standard Genbank file derived from the master .gff If the input to prokka was a multi-FASTA, then this will be a multi-Genbank, with one record for each sequence.
.fna Nucleotide FASTA file of the input contig sequences.
.faa Protein FASTA file of the translated CDS sequences.
.ffn Nucleotide FASTA file of all the prediction transcripts (CDS, rRNA, tRNA, tmRNA, misc_RNA)
.sqn An ASN1 format "Sequin" file for submission to Genbank. It needs to be edited to set the correct taxonomy, authors, related publication etc.
.fsa Nucleotide FASTA file of the input contig sequences, used by "tbl2asn" to create the .sqn file. It is mostly the same as the .fna file, but with extra Sequin tags in the sequence description lines.
.tbl Feature Table file, used by "tbl2asn" to create the .sqn file.
.err Unacceptable annotations - the NCBI discrepancy report.
.log Contains all the output that Prokka produced during its run. This is a record of what settings you used, even if the --quiet option was enabled.
.txt Statistics relating to the annotated features found.
.tsv Tab-separated file of all features: locus_tag,ftype,len_bp,gene,EC_number,COG,product
```
Try to answer the following:
- How many CDS does each strain have? Is that normal for this species?
**RENAME YOUR FILES - this is superimportant!**
**III** Next we will use the `prokka` generated *.gff files from the Campylobacter strains of similar ST to determine which genes are present in all genomes (core genome), and which are accessory (accessory genome). In addition, by aligning the genes making up the core genome, we can estimate the divergence between the three strains, and thereafter build a phylogenetic tree. This phylogenetic tree will inform us how closely related the strains are - i.e. if they are the same, it might be that the humans got contaminated from the animals.
**Roary** is a commonly used pangenome pipeline which quickly estimates core, accessory genome and constructs a core genome alignment from gff3 files. Also, it makes summary statistics and a table of gene presence and absence which we can vizualize later.
Roary only works with genomes of the same species which are similar to eachother. If you want to compare more distantly related genomes, other tools such as Mashtree, could be more useful.
Roary take a bit of time, so get it started and leave for break.
> Now that you have used Galaxy and its tools for a while - maybe its time to try filling it out without me helping? Make sure you get out at least (default):
> 1. Summary
> 1. Core gene alignment
> 1. Gene presence absence file
**IV Rename file entries**. Roary **might** change the names of the files to an internal filesystem name `(Dataset_xxxxxx)`. This gets problematic because you will no longer be able to deduct which samples is which. To handle this, you can do some text manipulation. *Kjetil Klepper from NTNU* has written a script that will swap the dataset names for the filename. Find it under `General Text Tools - Text Manipulation - Rename file entries`, and execute. Try to figure out which files goes where before you ask us.