# Bio Lab 7
Todays lab is focused on gene and species trees, inparticular using `andi` to visualize genome phylogenies, `iqtree` for gene trees, and R to help us visualize these trees.
We can use our otrthofinder results or the genomes given by professor Noh.
## Excercise 1 andi (ANchor DIstances)
This is an allignment free method to generate a genome phylogeny. Because this form of phylogeny doesn't require an allignmnet, it can be used on highly fragemented seqeucnes such as draft genomes.
Alligment free phylogenies can be estimated using multiple different methods, but the most common is using k-mer frequencies and overlapping k-mers to generate a distance matrices and distance trees.
andi identifies long exact k-mer matches to use as homologous segments between two genemones. These homologous segments then serve as anchors and the non-homologous sections of the allignment between the two anchors is then scored for distance. So the area between two anchors is used to generate distance matrices and distance trees.
andi excels at being applied to very closely related genomes (within the same species or genus) but the method can be applied to other allignment-free methods so it can be used here too.
link to paper with more info: https://academic.oup.com/bioinformatics/article/31/8/1169/212918
0. Make a new directory to work in this week
`mkdir lab_7`
1. Prepare an input file that contains all of our genomes that we want to build a phylogeny with. You should be able to leave the input files where they are as long as we list out the entire path
`ls /courses/bi278/Course_Materials/lab_07/*.*a > andi_inputs.txt`
after you do this try to double check to make sure that worked by using `ls`
2. Running andi
`andi --file-of-filenames=andi_inputs.txt -t 2 --join > andi_output.dist`
It may take a second to run
NOTES about andi from Prof. Noh: "it will ignore filenames after the first . (period). So rename your files accordingly so you can tell them apart. For example, if your genome files were called pbonn.bb859.fna, pbonn.bb433.fna, then rename them to pbonn_bb859.fna, pbonn_bb433.fna because if you don’t, andi will report both genomes just as pbonn and pbonn"
## Excercise 2 IQ-TREE for gene trees
IQ-Tree (iqtree) buils phyolgdenic trees from both amino acid or nucleotide allignments using maximum likelihood. The software is built on multiple phylogenetic (e.g. substitution) models, and it tests all models while building a tree then outputs the best tree based on which model seems most appropriate for our data.
Prof Noh mentions that compared to fasttree (which we've previously use), IQ-Tree provides more options and flexibility. For example if your data requires a model that includes concepts such as rate heterogeneity or data partitions iq tree will provide the best results since it will run all of the models.
Heres a link to a page about IQ Tree and Substitution models http://www.iqtree.org/doc/Substitution-Models
Prof Noh also suggets that iqtree may be helpful for us if we want to look in depth in a protien we found interesting in our orthogroup. We can use iqtree to compare that protien and its evolution. (Try to compare it to another closely related protien that isn't in the in group)
The example we are doing today is looking into a sensory kinase protien, to get the protien info for this we can pull information from NCBI. When pulling aditional sequences on a protien avoid your ingroup (Burkholderia and Paraburkholderia in this case) becuas you have already revealed intial results from those.
Prof. Noh also increased her max target sequences to 500, but doesn't explain why.
Prof Noh ended up selcting 15 sequences from NCBI for us to run IQ Tree on. She took the seauences from all different genera, including some closely and distantly related species to the ingroup. You can directly download all of these through NCBI.
Prof Noh used an empty nano window to save the files, but you can also use the filer.
Runing IQ-Tree
1. Running a Multiple Sequence Allignment as an input. We will use mafft and some different options than we used in a previous week. We want to specify using a local alignment method with a generalized affine gap penalty function.
First combine all the sequences into one file
` cat /courses/bi278/Course_Materials/lab_07/ncbi_sensor_kinase.txt /courses/bi278/Course_Materials/lab_07/orthofinder_18taxa/Orthogroup_Sequences/OG0002162.fa > OG0002162_plus_ncbi.fa
`
Then run mafft
`mafft --genafpair --maxiterate 1000 OG0002162_plus_ncbi.fa > OG0002162_mafft.faa`
3. Now run iqtree
`iqtree -s OG0002162_mafft.faa`
IQ Tree tests all the protein substitution models it has, pick the best one for the data, does tree searching, etc, and will print a detailed log to the screen as it does this.
4. Previewing the tree (Not required)
If you want to preview the tree you can do that by using this command
`Head FILENAME
`
## Excercise 3 Tree Visualization in R
R cn be use to look at an compare different tree. R has two main packages for phylogenetics: ape - Analyses of Phylogenetics and Evolution and another Package called phytools. Phytools has alot more helpful information on the internet so you may want to start by using that.
Phytools Help: http://blog.phytools.org
The two data files we will use for this are:
distance matrix (to make a species tree) from andi above:
`andi_output.dist`
gene tree from IQ-TREE above:
`OG0002162_mafft.faa.treefile`
1. Log into R studio at https://bi278.colby.edu
All commands from here on can be entered into the script section on the upper left side. and executed there by leaving your cursor on the line and clicking run.
2. Load the phylogenetics packages
```
library(ape)
library(phytools)
```
3. Before you can compare, you need to take the andi distance matrix and form a neighbor-joining tree
```
andi.dist <- read.table("~/lab_07/andi_output.dist", header=F, row.names=1, skip=1)
colnames(andi.dist) <- rownames(andi.dist)
# infer tree from distance matrix
andi.tree <- nj(as.dist(andi.dist))
```
4. Look at the new tree formed
```
plot(andi.tree)
```
5. For intrpretation we will clean up the plot by using the following methods.
```
# first check what the tip labels are
andi.tree$tip.label
# substitute patterns that have to do with method or source
temp <- gsub(andi.tree$tip.label, pattern="_nanoassembly", replacement="")
temp2 <- gsub(temp, pattern="_ncbi", replacement="")
# check what it looks like now
temp2
# we can also directly substitute individual elements
temp2[1] <- "bh69"
# we can also directly substitute ranges of elements
temp2[13:14] <- c("bmall", "bpseu")
# last thing to clean up that was inconsistent across names
temp3 <- gsub(temp2, pattern="qs", replacement="")
# check what it looks like now
temp3
# replace the original labels
andi.tree$tip.label <- temp3
# clean up unnecessary files
rm(temp, temp2, temp3)
```
6. Now you can look at the tree with the new labels
```
plot(andi.tree)
nodelabels()
```
7. Root the tree with the outgroups
Our outgroups here are B. mallei and B. pseudomallei which are grouped together at node=34
```
rooted.andi <- root.phylo(andi.tree, node=34, resolve.root=T)
plot(rooted.andi, cex=1, label.offset=0.005)
```
8. Lets compare and look at a tree outputed by orthofinder
```
ortho.tree <- read.tree("/courses/bi278/Course_Materials/lab_07/orthofinder_18taxa/Species_Tree/SpeciesTree_rooted.txt")
plot(ortho.tree)
# fix tip labels
ortho.tree$tip.label
temp <- gsub(ortho.tree$tip.label, pattern="_prokka", replacement="")
temp2 <- gsub(temp, pattern="qs", replacement="")
temp2[6:7] <- c("Bmallei", "Bpseudomallei")
# check
temp2
# replace labels
ortho.tree$tip.label <- temp2
rm(temp, temp2)
# root this tree too
plot(ortho.tree)
nodelabels()
rooted.ortho <- root.phylo(ortho.tree, node=25, resolve.root=T)
plot(rooted.ortho, cex=1, label.offset=0.005)
```
9. Rotate the ortho tree to compare with the andi tree (So the nodes are alligned similarly)
```
rotate.ortho <- rotateConstr(rooted.ortho, rooted.andi$tip.label)
plot(rotate.ortho)
```
10. Compare how similar the trees are
```
# compare topology (branching pattern)
all.equal(andi.tree, ortho.tree, use.edge.length=FALSE)
# plot both trees side by side
cophyloplot(rooted.andi, rotate.ortho)
```

Prof Noh on the results
"My trees are not exactly the same in topology. P. sprentiae and P. terrae are in slightly different positions, as are several of the bh# strains. This is ok, but does indicate that we can either improve our taxon sampling (e.g. include more Paraburkholderia species other than the four we have here to better resolve their relationships) or perhaps even just use andi for the very closely related taxa it’s supposed to be best for."
11. Looking into the gene tree
```
# the tree you generated with IQ-TREE
iq.tree <- read.tree("~/lab_7/OG0002162_mafft.faa.treefile")
# check and root this tree
plot(iq.tree)
nodelabels()
rooted.iq <- root.phylo(iq.tree, node=64, resolve.root=T)
plot(rooted.iq, cex=0.75, label.offset=0.005)
```
This shows that the sensory kinase from the ingroup are more closely related to each other than any of the additional sequences found on NCBI. This suggets that the gene is inherited vertically. To have gorizontal inheretance you would expect the gene to be more closely related to the distantly related genomes.
12. Save figure
```
write.tree(rooted.andi, file="~/lab_7/andi_cleaned.tre")
write.tree(rotate.iq, file="~/lab_7/iq_sensorkinase.tre")
```
## Excercise 4 Using dendextend
To use dendextend on trees you have built you need to convert the newick treefile format to dendrogram format. The package decipher can do this.
Dendextend is helpful to change the visuals of the tree, you can color your taxa, show their clade, or do any other visual editing.
1. Convert to dendrogram
1. `# install if not present
library(DECIPHER)
library(dendextend)`
2. read in the tree and plot it
```
iq.dend <- ReadDendrogram("~/lab_7/OG0002162_mafft.faa.treefile")
plot(iq.dend)
```
2. Adjust margins
You can adjust the margins in the plot window. Then plot again to see if the labels are visible now:
```
par(mar = c(8,4,2,2) + 0.1) #bottom, left, top, right; default is c(5,4,4,2)
plot(iq.dend)
```
3. The order of taxa on the dendrogram is identical to what you see in the plot window:
```
iq.dend %>% labels
```
4. Color the leaves manually based on label order just checked.
```
iq.dend %>% set("leaves_pch", 19) %>% set("leaves_cex", 1) %>% set("leaves_col", c(rep(1,15),2,2,2,2,2,3,3,3,4,4,4,4,4,4,4,4,5,5)) %>% plot(type="rectangle")
```
5. We can also rotate the plot and change the circles to square
```
par(mar = c(4,2,2,8))
iq.dend %>% set("leaves_pch", 15) %>% set("leaves_cex", 1) %>% set("leaves_col", c(rep(1,15),2,2,2,2,2,3,3,3,4,4,4,4,4,4,4,4,5,5)) %>% plot(type="rectangle", horiz=TRUE)
```
6. We can also change the color of the labels
```
iq.dend %>% set("labels_colors", c(rep(1,15),2,2,2,2,2,3,3,3,4,4,4,4,4,4,4,4,5,5)) %>% plot(type="rectangle", horiz=TRUE)
```
7. This resets the margins back to the deafualt afterwards
```
par(mar = c(5,4,4,2))
```
You can do alot of different formating with Dendextend, This is the link to more information on its capabilities: https://cran.r-project.org/web/packages/dendextend/vignettes/dendextend.html