# BI278 Lab 7 - Phylogenetics
Goal: use...
- andi for whole genome phylogenies
- iqtree for single gene phylogenies
- R for visualizing and manipulating trees
**Exercise 1. andi (ANchor DIstances)**
Notes:
- aligment-free methods for genrating phylogenies = new type of phylogenetic method that has come about since advent of NGS tech & abundance of genomic seq data
- gene trees = straighforward to generate
- species trees = estimated by concatenation or coalescence modeling of multiple genes
- diff ways of doing alignment-free phylogeny estimation but most methods use k-mer freqs & overlapping k-mers to generate distance matrices & distance trees
- andi identifies long exact matches (anchors) to use as homologous segments between 2 genomes
- then it looks at mismatched regions bracketed by these anchors & estimates a distance matrix
- ideally used for very closely related genomes (strains within the same species and closely related species within the same genus) BUT method appears to work just as well as other alignment-free methods so we use it in this lab
Read more about andi: https://academic.oup.com/bioinformatics/article/31/8/1169/212918
0) Make a new directory to work in
mkdir lab_07
1) To run andi, you need to first prepare an input file that contains all of our genomes that we want to build a phylogeny with. You should be able to leave the input files where they are as long as we list out the entire path:
ls /courses/bi278/Course_Materials/lab_07/*.*a > andi_inputs.txt
2) Run andi:
andi --file-of-filenames=andi_inputs.txt -t 2 --join > andi_output.dist
If you run andi on your own later, NOTE it'll ignore filenames after the first "." So rename your files accordingly so you can tell them apart (no periods before the last .fa or .fna so either remove them or use an _ instead)
**Exercise 2 - IQ-TREE for phylogenies**
Notes:
- IQ-TREE = powerful phylogenetics software that estimates trees for either nucleotide or protein (amino acid) alignments using max likelihood
- most convenient feature is that it tests a large range of phylogenetic (eg substitution) models & then outputs a tree best of which model appears to be most appropriate for your data
- RECALL: we used FastTree before to generate quick phylogenies from multiple seq alignments in lab #4
- IQ-Tree has many more bells & whistles compared to FastTree but this may not be necessary depending on what your data is like
- IF data requires more complex models of seq evolution such as rate heterogeneity (ex: you need to separately model sites that evolve), data partitions (ex: you want independent models for each gene in a multi-gene alignment), etc THEN IQ-Tree is great!
Read about range of possible models: http://www.iqtree.org/doc/Substitution-Models
Now let's look at the orthogroups, species and gene trees outputted by orthofinder. Prof Noh reran orthofinder with the same taxa used for andi. These files are in: /courses/bi278/Course_Materials/lab_07/orthofinder_18taxa
3) Identify orthogroups w/ genes from each genome. We want all the rows where orthologous groups share a common ancestor at node 0 & each genome has at least 1 gene member in the group
awk -F "\t" '$3=="n0" && $4!="" && $5!="" && $6!="" && $7!="" && $8!="" && $9!="" && $10!="" && $11!="" && $12!="" && $13!="" && $14!="" && $15!="" && $16!="" && $17!="" && $18!="" && $19!="" && $20!="" && $21!=""' /courses/bi278/Course_Materials/lab_07/orthofinder_18taxa/Phylogenetic_Hierarchical_Orthogroups/N0.tsv > 18taxa_coregenes.txt
There should be 1887 orthogroups.
4) Now, we want to see how many genes are included in each orthogroup from above. If there are only single copy genes, then the number of genes should be the same as the number of genomes included in the orthoginder analysis (18).
sed 's/, /\t/g' 18taxa_coregenes.txt | awk '{print $1, NF-1}'
For demonstration purposes: Prof Noh chose to investigate OG0000084 in more detail for the arbitrary reason that some but not all taxa that appear to have a gene duplication. She wanted to compare the gene tree to the species tree.
The amino acid sequences for all members of this orthogroup have been collected here:
/courses/bi278/Course_Materials/lab_07/orthofinder_18taxa/Orthogroup_Sequences/OG0000084.fa
5) Generate a multiple seq alignment using MAFFT
mafft --maxiterate 3 /courses/bi278/Course_Materials/lab_07/orthofinder_18taxa/Orthogroup_Sequences/OG0000084.fa > OG0000084_mafft.faa
6) Now use MSA as input for IQ-Tree
iqtree -s OG0000084_mafft.faa
This will test all the protein substitution models it has, pick the best one for the data, do tree searching, etc & finally print a detailed log to the screen as it does this.
Prof Noh's tree is in this file: OG0000084_mafft.faa.treefile
For a preview of what the tree looks like (along w/ a bunch of other info regarding best protein substitution model), look at this file: OG0000084_mafft.faa.iqtree
**Exercise 3. Trees in R**
Here, we'll look at and compare our trees in R. The two main packages for phylogenetics in R are ape: Analyses of Phylogenetics & Evolution and phytools.
If you need answers to future questions about phytools: http://blog.phytools.org/
Let’s take inventory of the trees we will look at:
- gene tree from IQ-TREE above: OG0000084_mafft.faa.treefile
- trees that Orthofinder generates! may want to copy these two files over to your working directory.
- gene tree for the orthogroup 0000084: /courses/bi278/Course_Materials/lab_07/orthofinder_18taxa/Gene_Trees/OG0000084_tree.txt
- We also have a species tree from Orthofinder: /courses/bi278/Course_Materials/lab_07/orthofinder_18taxa/Species_Tree/SpeciesTree_rooted.txt
7) Go to Rstudio @ bi278.colby.edu
**From here on, enter commands into script section (uppper left) in RStudio view
8) Load phylogenetic packages:
library (apes)
library (phytools)
9) Before making comparisons, you need to take the distance matrix generated by andi & make a neighbor-joining tree from it:
andi.dist <- read.table("/home2/snoh/lab_07/andi_output.dist", header=F, row.names=1, skip=1)
colnames(andi.dist) <- rownames(andi.dist)
# infer tree from distance matrix
andi.tree <- nj(as.dist(andi.dist))
10) Look at this tree:
plot(andi.tree)
11) To make it easier to compare between trees between methods, clean up labels of the tips. Here, we use a few diff methods:
# first check what the tip labels are
andi.tree$tip.label
# substitute patterns that have to do with method or source
temp <- gsub(andi.tree$tip.label, pattern="_nanoassembly",
replacement="")
temp2 <- gsub(temp, pattern="_ncbi", replacement="")
# we can also directly substitute individual elements
temp2[1] <- "bh69"
# we can also directly substitute ranges of elements
temp2[15:18] <- c("Pfungorum", "Psprentiae", "Pterrae",
"Pxenovorans")
# last thing to clean up that was inconsistent across names
temp3 <- gsub(temp2, pattern="qs", replacement="")
# replace the original labels
andi.tree$tip.label <- temp3
# clean up unnecessary files
rm(temp, temp2, temp3)
12) Look at tree again (this time w/ node numbers):
plot(andi.tree)
nodelabels()
You should have two outgoups (B. mallei and B. pseudomallei and they are join together at a specific node):

14) Now look at species tree outputted by orthofinder:
ortho.tree <- read.tree("/courses/bi278/Course_Materials/lab_07/orthofinder_18taxa/Species_Tree/SpeciesTree_rooted.txt")
plot(ortho.tree)
# fix tip labels
ortho.tree$tip.label
temp <- gsub(ortho.tree$tip.label, pattern="_prokka",
replacement="")
temp2 <- gsub(temp, pattern="qs", replacement="")
ortho.tree$tip.label <- temp2
rm(temp, temp2)
# root this tree too
plot(ortho.tree)
nodelabels()
rooted.ortho <- root.phylo(ortho.tree, node=25, resolve.root=T)
plot(rooted.ortho, cex=1, label.offset=0.005)
15) Rotate nodes of this tree so the tips match more closely the tips of the tree from andi:
rotate.ortho <- rotateConstr(rooted.ortho,
rooted.andi$tip.label)
plot(rotate.ortho)
16) Compare how similar these trees are:
# compare topology (branching pattern)
all.equal(andi.tree, ortho.tree, use.edge.length=FALSE)
# plot both trees side by side
cophyloplot(rooted.andi, rotate.ortho)

Trees are not the same in topology! P. sprentiae and P. terrae are in slightly different positions, as are several of the bh# strains. This is ok BUT indicates that we can either improve our taxon sampling (e.g. include more Paraburkholderia species other than the four we have here to better resolve their relationships) or perhaps even just use andi for the very closely related taxa it’s supposed to be best for.
17) Compare gene trees now:
# the tree I generated with IQ-TREE
iq.tree <-
read.tree("/home2/snoh/lab_07/OG0000084_mafft.faa.treefile")
# check and root this tree
plot(iq.tree)
nodelabels()
rooted.iq84 <- root.phylo(iq.tree, node=33, resolve.root=T)
plot(rooted.iq84, cex=0.75, label.offset=0.005)
# the gene tree from orthofinder
gene.tree <-
read.tree("/courses/bi278/Course_Materials/lab_07/orthofinder_1
8taxa/Gene_Trees/OG0000084_tree.txt")
# check and root this tree too
plot(gene.tree)
nodelabels()
rooted.or84 <- root.phylo(gene.tree, node=55, resolve.root=T)
plot(rooted.or84, cex=0.75, label.offset=0.005)
# rotate and compare
rotate.iq84 <- rotateConstr(rooted.iq84, rooted.or84$tip.label)
all.equal(rooted.or84, rooted.iq84, use.edge.length=FALSE)
cophyloplot(rooted.or84, rotate.iq84)

These trees are not the same either! They're similar BUT we can see some interesting patterns. For example, PAGRI genes are from Paraburkholderia agricolaris strain Ba159 and there were 4 copies of this gene in the same genome (Figure 1).

There appear to be 2 copies that may be more distantly related to the ingroup than expected (PAGRI_06485 and PAGRI_07147). This is weird/not normal! NOTE: The other two copies are grouping together with other Paraburkholderia species like in the species tree. This seems to be the result of an old gene duplication event that affected both Burkholderia and Paraburkholderia. This is more of what we expected to see.
18) Save figures you created directly but also save cleaned up species trees as text files in case you want to use it in the future.
write.tree(rooted.andi, file="PATH/andi_cleaned.tre")
write.tree(rotate.ortho, file="PATH/ortho_cleaned.tre")
DON'T FORGET to save Rscript so you can come back to it & use elements from it in the future!