# BI278 Lab#7
### Exercise 1. andi (ANchor DIstances)
0. Make a new directory to work in.
```
mkdir lab_07
```
1. Running andi is quite simple (it doesn’t have many options) but we need to first prepare an input file that contains all of our genomes that we want to build a phylogeny with. You should be able to leave the input files where they are as long as we list out the entire path:
```
ls /courses/bi278/Course_Materials/lab_07/*.*a > andi_inputs.txt
```
2. Now let’s run andi:
```
andi --file-of-filenames=andi_inputs.txt -t 2 --join > andi_output.dist
```
Later if you do run andi on your own, note that it will ignore filenames after the first “.” (period). So rename your files accordingly so you can tell them apart (no periods before the last .fa or .fna so either remove them or use an _ instead).
### Exercise 2. IQ-TREE for phylogenies
3. First, I want to identify orthogroups with genes present from each genome. I want all the rows where orthologous groups share a common ancestor at node 0 (zero) and each genome has at least one gene member in the group.
```
awk -F "\t" '$3=="n0" && $4!="" && $5!="" && $6!="" && $7!="" && $8!="" && $9!="" && $10!="" && $11!="" && $12!="" && $13!="" && $14!="" && $15!="" && $16!="" && $17!="" && $18!="" && $19!="" && $20!="" && $21!=""' /courses/bi278/Course_Materials/lab_07/orthofinder_18taxa/Phylogenetic_Hierarchical_Orthogroups/N0.tsv > 18taxa_coregenes.txt
```
There appear to be 1887 such orthogroups.
4. Next I want to see how many genes are included in each orthogroup from above. If there are only single copy genes, then the number of genes should be the same as the number of genomes included in the orthofinder analysis (18).
```
sed 's/, /\t/g' 18taxa_coregenes.txt | awk '{print $1, NF-1}'
```
From the results, for demonstration purposes I’m going to choose in investigate OG0000084 in more detail for the arbitrary reason that some but not all taxa that appear to have a gene duplication. (Feel free to choose a different orthogroup if you wish.) I just want to compare the gene tree to the species tree.
The amino acid sequences for all members of this orthogroup have been collected here:
```
/courses/bi278/Course_Materials/lab_07/orthofinder_18taxa/Orthogroup_Sequences/OG0000084.fa
```
5. I need to generate a multiple sequence alignment (MSA). I’ll use MAFFT.
```
mafft --maxiterate 3 OG0000084.fa > OG0000084_mafft.faa
```
6. Now I can use the MSA as input for iqtree.
```
iqtree -s OG0000084_mafft.faa
```
It will test all the protein substitution models it has, pick the best one for my data, do tree searching, etc, and will print a detailed log to the screen as it does this.
My tree is in this file:
>OG0000084_mafft.faa.treefile
>
But I can see a preview of what the tree looks like, along with a bunch of other information regarding the best protein substitution model, in this file:
>OG0000084_mafft.faa.iqtree
### Exercise 3. trees in R
7. On an internet browser, go and log on to RStudio
8. Let’s load our phylogenetics packages.
```
library(ape)
library(phytools)
```
9. Before any comparisons, we need to take the distance matrix generated by andi and make a neighbor-joining tree from it.
```
andi.dist <- read.table("/home2/xsi25/lab_07/andi_output.dist",
header=F, row.names=1, skip=1)
colnames(andi.dist) <- rownames(andi.dist)
# infer tree from distance matrix
andi.tree <- nj(as.dist(andi.dist))
```
10. Now let’s take a look at this tree.
```
plot(andi.tree)
```
11. To make it a little easier to compare trees between methods, let’s clean up the labels of the tips. We are employing a few different methods here:
```
# first check what the tip labels are
andi.tree$tip.label
# substitute patterns that have to do with method or source
temp <- gsub(andi.tree$tip.label, pattern="_nanoassembly",
replacement="")
temp2 <- gsub(temp, pattern="_ncbi", replacement="")
# we can also directly substitute individual elements
temp2[1] <- "bh69"
# we can also directly substitute ranges of elements
temp2[15:18] <- c("Pfungorum", "Psprentiae", "Pterrae",
"Pxenovorans")
# last thing to clean up that was inconsistent across names
temp3 <- gsub(temp2, pattern="qs", replacement="")
# replace the original labels
andi.tree$tip.label <- temp3
# clean up unnecessary files
rm(temp, temp2, temp3)
```
12. Let’s look at our tree again, this time with node numbers
```
plot(andi.tree)
nodelabels()
```
I had two outgroups (B. mallei and B. pseudomallei) and they are join together at a specific node.
13. Let’s also root our tree with the outgroup node, then look at it again:
```
rooted.andi <- root.phylo(andi.tree, node=34, resolve.root=T)
plot(rooted.andi, cex=1, label.offset=0.005)
```
14. Now let’s look at the species tree outputted by orthofinder.
```
ortho.tree <-
read.tree("/courses/bi278/Course_Materials/lab_07/orthofinder_18taxa/Species_Tree/SpeciesTree_rooted.txt")
plot(ortho.tree)
# fix tip labels
ortho.tree$tip.label
temp <- gsub(ortho.tree$tip.label, pattern="_prokka",
replacement="")
temp2 <- gsub(temp, pattern="qs", replacement="")
ortho.tree$tip.label <- temp2
rm(temp, temp2)
# root this tree too
plot(ortho.tree)
nodelabels()
rooted.ortho <- root.phylo(ortho.tree, node=25, resolve.root=T)
plot(rooted.ortho, cex=1, label.offset=0.005)
```
15. Let’s rotate the nodes of this tree so the tips match more closely the tips of the tree from andi.
```
rotate.ortho <- rotateConstr(rooted.ortho,
rooted.andi$tip.label)
plot(rotate.ortho)
```
16. Let’s compare how similar these trees are.
```
# compare topology (branching pattern)
all.equal(andi.tree, ortho.tree, use.edge.length=FALSE)
[1] FALSE
# plot both trees side by side
cophyloplot(rooted.andi, rotate.ortho)
```
My trees are not the same in topology. P. sprentiae and P. terrae are in slightly different positions, as are several of the bh# strains. This is ok, but does indicate that we can either improve our taxon sampling (e.g. include more Paraburkholderia species other than the four we have here to better resolve their relationships) or perhaps even just use andi for the very closely related taxa it’s supposed to be best for.
17. Let’s compare our gene trees now.
```
# the tree I generated with IQ-TREE
iq.tree <-
read.tree("/home2/xsi25/lab_07/OG0000084_mafft.faa.treefile")
# check and root this tree
plot(iq.tree)
nodelabels()
rooted.iq84 <- root.phylo(iq.tree, node=33, resolve.root=T)
plot(rooted.iq84, cex=0.75, label.offset=0.005)
# the gene tree from orthofinder
gene.tree <-
read.tree("/courses/bi278/Course_Materials/lab_07/orthofinder_18taxa/Gene_Trees/OG0000084_tree.txt")
# check and root this tree too
plot(gene.tree)
nodelabels()
rooted.or84 <- root.phylo(gene.tree, node=55, resolve.root=T)
plot(rooted.or84, cex=0.75, label.offset=0.005)
# rotate and compare
rotate.iq84 <- rotateConstr(rooted.iq84, rooted.or84$tip.label)
all.equal(rooted.or84, rooted.iq84, use.edge.length=FALSE)
cophyloplot(rooted.or84, rotate.iq84)
```
These trees are not the same either. They are certainly similar, but I can see some interesting patterns. For example, PAGRI genes are from Paraburkholderia agricolaris
strain Ba159 and there were 4 copies of this gene in the same genome (Figure 1).
There appear to be 2 copies that may be more distantly related to the ingroup than expected (PAGRI_06485 and PAGRI_07147). This is weird! The other two copies are grouping together with other Paraburkholderia species like in the species tree. This seems to be the result of an old gene duplication event that affected both Burkholderia and Paraburkholderia. This is more of what I expected to see.
18. Lastly, you can save any figures that you created directly, but let’s also save our cleaned up species trees as text files in case you want to use it in the future.
```
write.tree(rooted.andi, file="/home2/xsi25/lab_07/andi_cleaned.tre")
write.tree(rotate.ortho, file="/home2/xsi25/lab_07/ortho_cleaned.tre")
```
The comparison of gene trees (orthofinder vs iqtree)
