ACEIDHA - Making a phylogenetic tree and exploring vizualisation tool for *Campylobacter* dataset

<center><img src="https://i.imgur.com/rPIZUIq.png" alt="drawing" width="700"/></center> # ACEIDHA - Making a phylogenetic tree and exploring vizualisation tool for *Campylobacter* dataset **I**: Inspecting the *Summary statistics* output of **Roary** * How many genes is present all together in either of the strains? * How many are present in all genomes? **II**: Making a phylogenomic/evolutionary tree from a core gene alignement using **IQtree**. **Q-TREE** takes as input a multiple sequence alignment and will reconstruct an evolutionary tree that is best explained by the input data. The input alignment can be in various common formats, such as PHYLIP, FASTA, NEXUS, and CLUSTALW. `Inspect your alignment - what filetype do you have?` IQ-TREE will choose the best model for you automatically if specify any of the TEST models, but valid custom models can also be specified. With default, **IQ-TREE** will choose the best model. Run **IQ-TREE** with default settings. Use the core genome alignment from the Roary output as input to **IQ-TREE**. When done, inspect the Report and Final Tree to understand which tree you should vizualise. Download this treefile to your local computer. Store the textfile with a .nwk ending (Newick format) Also, download the file from Roary that shows which genes are present in which genomes (Gene Presence Absence). **III** We are going to use [Phandango](https://jameshadfield.github.io/phandango/) to vizualise the tree and core/accessory genomes of our strains simultaneoulsy. Use edge, not chrome! If you have store the tree file correctly as Newick, drag and drop this to the browser window. Therafter, drag and drop the csv file with the gene-content. Inspect the resulting tree: - Can you find two strains that are exactly identical? - How are their pangenomic profiles? - How do you answer your collaborator? [Microreact](https://microreact.org/) is anoter vizualisation tool. Its better than Phandango if you want to demonstrate trees, and one can add geographical GPS locations and time as well. For it to work best, add metadata such as AMR genes, ST type and year of isolation. Details on how to make compatible datasets can be found [here](https://docs.microreact.org/instructions/data/supported-file-formats). **IV** The metadata file can be in csv file; comma separated file. It can be made in Excel. > Required columns > Only an identifier for your data rows is required. The ID column must be unique (i.e. each row has a unique ID value). Note that the column does not need to be named "ID" as a header, although it can be renamed as such. > Proper visualisation in Microreact requires a single ID column that uniquely identifies each row of data in each of the related files. Here the unique identifier will be similar to the name of the leaf-tips. Metadata could be for instance resistance data and ST type. Try it out. As you see, the seven strains are very similar. But how similar, and where do they vary? Lets do a variant calling excercise to figure out! #### Variant calling Variant calling is the process of identifying differences between two genome samples. Usually differences are limited to single nucleotide polymorphisms (SNPs) and small insertions and deletions (indels). Larger structural variation such as inversions, duplications and large deletions are not typically covered by “variant calling”. Imagine that you have been asked to find the differences between a sample that has been sequenced and a known genome. For example: You have a new sample from a patient and you want to see if it has any differences from a well known reference genome of the same species. Typically, you would have a couple of fastq read files sent back to you from the sequencing provider and either an annotated or non annotated reference genome. In this tutorial, we will use the tool “Snippy” (see author development repository) to find high confidence differences (indels or SNPs) between our known genome and our reads. Snippy uses one tool to align the reads to the reference genome, and another tool to decide (“call”) if any of the resulting discrepancies are real variants or technical artifacts that can be ignored. Finally, Snippy uses another tool to check what effect these differences have on the predicted genes - truncation, frame shift or if the changes are synonymous. For the read alignment (read mapping) step, Snippy uses BWA MEM with a custom set of settings which are very suitable to aligning reads for microbial type data. For the variant calling step, Snippy uses Freebayes with a custom set of settings. snpeff is then used to describe what the predicted changes do in terms of the genes themselves. The Galaxy wrapper for Snippy has the ability to change some of the underlying tool settings in the advanced section but it is not recommended. Read more about SNP calling at Wikipedia. **V.** Choose two strains that cluster close together in the phylogenetic tree. Use one of them as reference strain and the other as query. Paramters: `Reference File` will be the chosen reference strains *.gbk file (if the genbank file is not selectable, make sure to change its datatype to ‘genbank’) `Single or Paired-end reads` to Paired “Select first set of reads” to `query_strain_R1.fastq` “Select second set of reads” to `query_strain_R2.fastq` Select all outputs **VI.** ##### Examine the Snippy output Snippy has taken the reads, mapped them against the reference using BWA MEM, looked through the resulting BAM file and found differences using some fancy Bayesian statistics (Freebayes), filtered the differences for sensibility and finally checked what effect these differences will have on the predicted genes and other features in the genome. It produces quite a bit of output, there can be up to [10 output files.](https://github.com/tseemann/snippy#output-files). Have a look at the contents of the SNP table file (snippy on data XX, data XX and data XX table): 1. Which types of variants have been found? 1. What is the third variant called? 1. What is the product of the mutation? 1. What might be the result of such a mutation?