# BI278 Lab 7: Phylogenetics
- use andi for whole genome phylogenies
- use iqtree for single gene phylogenies
- use R for visualizing and manipulating trees
**Excersize 1: andi (anchor distances)**
Alignment free methods for generating phylogenies. Most use k-mer frequencies and overlapping k-mers to generate distance matrices and distance trees. Andi identifies long exact matches to use as homologous segments between two genomes, then looks at the mismatched regions bracketed by the anchors and estimates a distance matrix.
0. Make a new directory
```
mkdir lab_07
```
1. Running andi is simple, but we first need to prepare an input file that contains all of our genomes that we want to build a phylogeny with. (need to list out entire path)
```
ls /courses/bi278/Course_Materials/lab_07/*.*a > andi_inputs.txt
```
2. Run andi:
```
andi --file-of-filenames=andi_inputs.txt -t 2 --join > andi_output.dist
```
**Excersize 2: IQ-TREE for phylogenies**
IQ-TREE is a powerful phylogenetics software that estimates trees for either nucleotide or protein alignments using maximum likelihood. It tests a large range of phylogenetic models and outputs a tree best of which model appears to be most appropriate for hte data.
FastTree was used in lab 4, but IQ-TREE is more complex, so it's used if your data require more complex models of sequence evolution like heterogeniety or data partitions.
Look at orthogroups, species and gene trees outputted by orthofinder. She reran orthofinder with the same taxa used for andi, which are in
```
/courses/bi278/Course_Materials/lab_07/orthofinder_18taxa
```
3. Identify orthogroups with genes present from each genome. All rows where orthologous groups share a common ancestor should be at node 0 and each genome shuold have at least one gene member in the group
```
awk -F "\t" '$3=="n0" && $4!="" && $5!="" && $6!="" && $7!="" && $8!="" && $9!="" && $10!="" && $11!="" && $12!="" && $13!="" && $14!="" && $15!="" && $16!="" && $17!="" && $18!="" && $19!="" && $20!="" && $21!=""' /courses/bi278/Course_Materials/lab_07/orthofinder_18taxa/Phylogenetic_Hierarchical_Orthogroups/N0.tsv > 18taxa_coregenes.txt
```
There appear to be 1887 such orthogroups.
4. Next, we want to see how many genes are included in each orthogroup above. If there are only single genes, then the number of genes should be the same as the number of genomes included in the orthofinder analysis (18).
```
sed 's/, /\t/g' 18taxa_coregenes.txt | awk '{print $1, NF-1}'
```
Prof. noh is investigating Og0000084 in more detail (bc some but not all taxa appear to have gene duplication).
The amino acid sequences for all members of the orthogroup are here:
```
/courses/bi278/Course_Materials/lab_07/orthofinder_18taxa/Orthog
roup_Sequences/OG0000084.fa
```
5. Generate an MSA using MAFFT
```
mafft --maxiterate 3 /courses/bi278/Course_Materials/lab_07/orthofinder_18taxa/Orthogroup_Sequences/OG0000084.fa > OG0000084_mafft.faa
```
6. Now can use MSA as input for igtree
```
iqtree -s OG0000084_mafft.faa
```
It will print a log to the screen as it tests all of the protein substitution models that it has, picks the best one for the data, and does tree searching, etc.
Tree is in file:```OG0000084_mafft.faa.treefile```
But you can see a preview of wha thte tree looks like, along with a bunch of other information regarding hte best f=protein substitution model in this file: ```OG0000084_mafft.faa.igtree```
**Excersize 3: Trees in R**
Look at and compare trees in R. The two main packages for phylogenetics in R are ```ape: Analyses of Phylogenetics and Evolution``` and ```phytools```.
The trees to look at:
- Gene tree from IQ-TREE: ```OG0000084_mafft.faa.treefile```
- Gene tree from orthofinder (orthogroup 0000084): ```/courses/bi278/Course_Materials/lab_07/orthofinder_18taxa/Gene_Trees/OG0000084_tree.txt```
- Species tree from orthofinder: ```/courses/bi278/Course_Materials/lab_07/orthofinder_18taxa/Species_Tree/SpeciesTree_rooted.txt```
Copy second two files over to working directory
```
cp /courses/bi278/Course_Materials/lab_07/orthofinder_18taxa/Gene_Trees/OG0000084_tree.txt ~/lab_07
cp /courses/bi278/Course_Materials/lab_07/orthofinder_18taxa/Species_Tree/SpeciesTree_rooted.txt ~/lab_07
```
7. In internet browser log onto RStudio at bi278.colby.edu
- Command, enter to run command in R
8. Load phylogenetics packages
```
library(ape)
library(phytools)
```
9. Need to take the distance matrix generated by andi and make a neighbor-joining tree from it
```
andi.dist <- read.table("/home2/cgsnow25/lab_07/andi_output.dist", header=F, row.names=1, skip=1)
colnames(andi.dist) <- rownames(andi.dist)
# infer tree from distance matrix
andi.tree <- nj(as.dist(andi.dist))
```
10. plot tree
```
plot(andi.tree)
```
11. To make it easier to compare trees between methos, clean up the labels of the tips
```
#first check what the tip labels are
andi.tree$tip.label
# substitute patterns that have to do with method or source
temp <- gsub(andi.tree$tip.label, pattern="_nanoassembly", replacement="")
temp2 <- gsub(temp, pattern="_ncbi", replacement="")
# we can also directly substitute individual elements
temp2[1] <- "bh69"
# we can also directly substitute ranges of elements
temp2[15:18] <- c("Pfungorum", "Psprentiae", "Pterrae", "Pxenovorans")
# last thing to clean up that was inconsistent across names
temp3 <- gsub(temp2, pattern="qs", replacement="")
# replace the original labels
andi.tree$tip.label <- temp3
# clean up unnecessary files
rm(temp, temp2, temp3)
```
12. Look at tree again with node numbers
```
plot(andi.tree)
nodelabels()
```
Two outgroups (B. mellei and B. pseudomallei) and they join together at a specific node.
13. Root tree with the outgroup node, then look at it again
```
rooted.andi <- root.phylo(andi.tree, node=34, resolve.root=T)
plot(rooted.andi, cex=1, label.offset=0.005)
```
14. Let's look at the species tree outputted by orthofinder
```
ortho.tree <- read.tree("/courses/bi278/Course_Materials/lab_07/orthofinder_18taxa/Species_Tree/SpeciesTree_rooted.txt")
plot(ortho.tree)
# fix tip labels
ortho.tree$tip.label
temp <- gsub(ortho.tree$tip.label, pattern="_prokka", replacement="")
temp2 <- gsub(temp, pattern="qs", replacement="")
ortho.tree$tip.label <- temp2
rm(temp, temp2)
# root this tree too
plot(ortho.tree)
nodelabels()
rooted.ortho <- root.phylo(ortho.tree, node=25, resolve.root=T)
plot(rooted.ortho, cex=1, label.offset=0.005)
```
15. Rotate the nodes of this tree so the tips match more closely the tips of the tree from andi
```
rotate.ortho <- rotateConstr(rooted.ortho, rooted.andi$tip.label)
plot(rotate.ortho)
```
16. Compare how similar the trees are
```
# compare topology (branching pattern)
all.equal(andi.tree, ortho.tree, use.edge.length=FALSE)
# plot both trees side by side
cophyloplot(rooted.andi, rotate.ortho)
```

The trees are not in the same topology. *P. sperentiae* and *P. terrae* are in slightly different positions, as are several of the bh# strains. It indicates that we can either improve our taxon sampling (include more Paraburkholderia species other than the four we have here to better reseolve their relationships) or just use andi for the very closely related taxa it's supposed to be best for.
17. Let's compare our gene trees now.
```
# the tree I generated with IQ-TREE
iq.tree <- read.tree("/home2/cgsnow25/lab_07/OG0000084_mafft.faa.treefile")
# check and root this tree
plot(iq.tree)
nodelabels()
rooted.iq84 <- root.phylo(iq.tree, node=33, resolve.root=T)
plot(rooted.iq84, cex=0.75, label.offset=0.005)
# the gene tree from orthofinder
gene.tree <- read.tree("/courses/bi278/Course_Materials/lab_07/orthofinder_18taxa/Gene_Trees/OG0000084_tree.txt")
# check and root this tree too
plot(gene.tree)
nodelabels()
rooted.or84 <- root.phylo(gene.tree, node=55, resolve.root=T)
plot(rooted.or84, cex=0.75, label.offset=0.005)
# rotate and compare
rotate.iq84 <- rotateConstr(rooted.iq84, rooted.or84$tip.label)
all.equal(rooted.or84, rooted.iq84, use.edge.length=FALSE)
cophyloplot(rooted.or84, rotate.iq84)
```

The trees are not the same, either. They are very similar, but there are interesting patterns. Eg, PAGRI genes from *Parabulkholderia agricolaris* strain Ba159 and there were 4 copies of this gene in the same genome.
There appear to be 2 copies that may be more distantly related to the ingroup than expected. The other two copies are grouping together with other Parabulkholderia species like in teh species tree. This seems to eb the result of an old gene duplication event that affected both *Bulkholderia* and *Parabulkholderia*.