BI278 Lab 7 - Phylogenetics

# BI278 Lab 7 - Phylogenetics Goal: use... - andi for whole genome phylogenies - iqtree for single gene phylogenies - R for visualizing and manipulating trees **Exercise 1. andi (ANchor DIstances)** Notes: - aligment-free methods for genrating phylogenies = new type of phylogenetic method that has come about since advent of NGS tech & abundance of genomic seq data - gene trees = straighforward to generate - species trees = estimated by concatenation or coalescence modeling of multiple genes - diff ways of doing alignment-free phylogeny estimation but most methods use k-mer freqs & overlapping k-mers to generate distance matrices & distance trees - andi identifies long exact matches (anchors) to use as homologous segments between 2 genomes - then it looks at mismatched regions bracketed by these anchors & estimates a distance matrix - ideally used for very closely related genomes (strains within the same species and closely related species within the same genus) BUT method appears to work just as well as other alignment-free methods so we use it in this lab Read more about andi: https://academic.oup.com/bioinformatics/article/31/8/1169/212918 0) Make a new directory to work in mkdir lab_07 1) To run andi, you need to first prepare an input file that contains all of our genomes that we want to build a phylogeny with. You should be able to leave the input files where they are as long as we list out the entire path: ls /courses/bi278/Course_Materials/lab_07/*.*a > andi_inputs.txt 2) Run andi: andi --file-of-filenames=andi_inputs.txt -t 2 --join > andi_output.dist If you run andi on your own later, NOTE it'll ignore filenames after the first "." So rename your files accordingly so you can tell them apart (no periods before the last .fa or .fna so either remove them or use an _ instead) **Exercise 2 - IQ-TREE for phylogenies** Notes: - IQ-TREE = powerful phylogenetics software that estimates trees for either nucleotide or protein (amino acid) alignments using max likelihood - most convenient feature is that it tests a large range of phylogenetic (eg substitution) models & then outputs a tree best of which model appears to be most appropriate for your data - RECALL: we used FastTree before to generate quick phylogenies from multiple seq alignments in lab #4 - IQ-Tree has many more bells & whistles compared to FastTree but this may not be necessary depending on what your data is like - IF data requires more complex models of seq evolution such as rate heterogeneity (ex: you need to separately model sites that evolve), data partitions (ex: you want independent models for each gene in a multi-gene alignment), etc THEN IQ-Tree is great! Read about range of possible models: http://www.iqtree.org/doc/Substitution-Models Now let's look at the orthogroups, species and gene trees outputted by orthofinder. Prof Noh reran orthofinder with the same taxa used for andi. These files are in: /courses/bi278/Course_Materials/lab_07/orthofinder_18taxa 3) Identify orthogroups w/ genes from each genome. We want all the rows where orthologous groups share a common ancestor at node 0 & each genome has at least 1 gene member in the group awk -F "\t" '$3=="n0" && $4!="" && $5!="" && $6!="" && $7!="" && $8!="" && $9!="" && $10!="" && $11!="" && $12!="" && $13!="" && $14!="" && $15!="" && $16!="" && $17!="" && $18!="" && $19!="" && $20!="" && $21!=""' /courses/bi278/Course_Materials/lab_07/orthofinder_18taxa/Phylogenetic_Hierarchical_Orthogroups/N0.tsv > 18taxa_coregenes.txt There should be 1887 orthogroups. 4) Now, we want to see how many genes are included in each orthogroup from above. If there are only single copy genes, then the number of genes should be the same as the number of genomes included in the orthoginder analysis (18). sed 's/, /\t/g' 18taxa_coregenes.txt | awk '{print $1, NF-1}' For demonstration purposes: Prof Noh chose to investigate OG0000084 in more detail for the arbitrary reason that some but not all taxa that appear to have a gene duplication. She wanted to compare the gene tree to the species tree. The amino acid sequences for all members of this orthogroup have been collected here: /courses/bi278/Course_Materials/lab_07/orthofinder_18taxa/Orthogroup_Sequences/OG0000084.fa 5) Generate a multiple seq alignment using MAFFT mafft --maxiterate 3 /courses/bi278/Course_Materials/lab_07/orthofinder_18taxa/Orthogroup_Sequences/OG0000084.fa > OG0000084_mafft.faa 6) Now use MSA as input for IQ-Tree iqtree -s OG0000084_mafft.faa This will test all the protein substitution models it has, pick the best one for the data, do tree searching, etc & finally print a detailed log to the screen as it does this. Prof Noh's tree is in this file: OG0000084_mafft.faa.treefile For a preview of what the tree looks like (along w/ a bunch of other info regarding best protein substitution model), look at this file: OG0000084_mafft.faa.iqtree **Exercise 3. Trees in R** Here, we'll look at and compare our trees in R. The two main packages for phylogenetics in R are ape: Analyses of Phylogenetics & Evolution and phytools. If you need answers to future questions about phytools: http://blog.phytools.org/ Let’s take inventory of the trees we will look at: - gene tree from IQ-TREE above: OG0000084_mafft.faa.treefile - trees that Orthofinder generates! may want to copy these two files over to your working directory. - gene tree for the orthogroup 0000084: /courses/bi278/Course_Materials/lab_07/orthofinder_18taxa/Gene_Trees/OG0000084_tree.txt - We also have a species tree from Orthofinder: /courses/bi278/Course_Materials/lab_07/orthofinder_18taxa/Species_Tree/SpeciesTree_rooted.txt 7) Go to Rstudio @ bi278.colby.edu **From here on, enter commands into script section (uppper left) in RStudio view 8) Load phylogenetic packages: library (apes) library (phytools) 9) Before making comparisons, you need to take the distance matrix generated by andi & make a neighbor-joining tree from it: andi.dist <- read.table("/home2/snoh/lab_07/andi_output.dist", header=F, row.names=1, skip=1) colnames(andi.dist) <- rownames(andi.dist) # infer tree from distance matrix andi.tree <- nj(as.dist(andi.dist)) 10) Look at this tree: plot(andi.tree) 11) To make it easier to compare between trees between methods, clean up labels of the tips. Here, we use a few diff methods: # first check what the tip labels are andi.tree$tip.label # substitute patterns that have to do with method or source temp <- gsub(andi.tree$tip.label, pattern="_nanoassembly", replacement="") temp2 <- gsub(temp, pattern="_ncbi", replacement="") # we can also directly substitute individual elements temp2[1] <- "bh69" # we can also directly substitute ranges of elements temp2[15:18] <- c("Pfungorum", "Psprentiae", "Pterrae", "Pxenovorans") # last thing to clean up that was inconsistent across names temp3 <- gsub(temp2, pattern="qs", replacement="") # replace the original labels andi.tree$tip.label <- temp3 # clean up unnecessary files rm(temp, temp2, temp3) 12) Look at tree again (this time w/ node numbers): plot(andi.tree) nodelabels() You should have two outgoups (B. mallei and B. pseudomallei and they are join together at a specific node): ![](https://i.imgur.com/Git74l5.png) 14) Now look at species tree outputted by orthofinder: ortho.tree <- read.tree("/courses/bi278/Course_Materials/lab_07/orthofinder_18taxa/Species_Tree/SpeciesTree_rooted.txt") plot(ortho.tree) # fix tip labels ortho.tree$tip.label temp <- gsub(ortho.tree$tip.label, pattern="_prokka", replacement="") temp2 <- gsub(temp, pattern="qs", replacement="") ortho.tree$tip.label <- temp2 rm(temp, temp2) # root this tree too plot(ortho.tree) nodelabels() rooted.ortho <- root.phylo(ortho.tree, node=25, resolve.root=T) plot(rooted.ortho, cex=1, label.offset=0.005) 15) Rotate nodes of this tree so the tips match more closely the tips of the tree from andi: rotate.ortho <- rotateConstr(rooted.ortho, rooted.andi$tip.label) plot(rotate.ortho) 16) Compare how similar these trees are: # compare topology (branching pattern) all.equal(andi.tree, ortho.tree, use.edge.length=FALSE) # plot both trees side by side cophyloplot(rooted.andi, rotate.ortho) ![](https://i.imgur.com/jA4ALXu.png) Trees are not the same in topology! P. sprentiae and P. terrae are in slightly different positions, as are several of the bh# strains. This is ok BUT indicates that we can either improve our taxon sampling (e.g. include more Paraburkholderia species other than the four we have here to better resolve their relationships) or perhaps even just use andi for the very closely related taxa it’s supposed to be best for. 17) Compare gene trees now: # the tree I generated with IQ-TREE iq.tree <- read.tree("/home2/snoh/lab_07/OG0000084_mafft.faa.treefile") # check and root this tree plot(iq.tree) nodelabels() rooted.iq84 <- root.phylo(iq.tree, node=33, resolve.root=T) plot(rooted.iq84, cex=0.75, label.offset=0.005) # the gene tree from orthofinder gene.tree <- read.tree("/courses/bi278/Course_Materials/lab_07/orthofinder_1 8taxa/Gene_Trees/OG0000084_tree.txt") # check and root this tree too plot(gene.tree) nodelabels() rooted.or84 <- root.phylo(gene.tree, node=55, resolve.root=T) plot(rooted.or84, cex=0.75, label.offset=0.005) # rotate and compare rotate.iq84 <- rotateConstr(rooted.iq84, rooted.or84$tip.label) all.equal(rooted.or84, rooted.iq84, use.edge.length=FALSE) cophyloplot(rooted.or84, rotate.iq84) ![](https://i.imgur.com/lTRJOAb.png) These trees are not the same either! They're similar BUT we can see some interesting patterns. For example, PAGRI genes are from Paraburkholderia agricolaris strain Ba159 and there were 4 copies of this gene in the same genome (Figure 1). ![](https://i.imgur.com/wD2gr7S.png) There appear to be 2 copies that may be more distantly related to the ingroup than expected (PAGRI_06485 and PAGRI_07147). This is weird/not normal! NOTE: The other two copies are grouping together with other Paraburkholderia species like in the species tree. This seems to be the result of an old gene duplication event that affected both Burkholderia and Paraburkholderia. This is more of what we expected to see. 18) Save figures you created directly but also save cleaned up species trees as text files in case you want to use it in the future. write.tree(rooted.andi, file="PATH/andi_cleaned.tre") write.tree(rotate.ortho, file="PATH/ortho_cleaned.tre") DON'T FORGET to save Rscript so you can come back to it & use elements from it in the future!