# BI278 Lab 7 Notes
### Exercise 1. andi (ANchor DIstances)
- andi identifies long exact matches between two genomes which it uses as homologous segments. It then looks at the mismatched regions bracketed by these anchors and estimates a distance matrix. Ideally, it is used for very closely related genomes.
- Running andi is quite simple but we need to first prepare an input file that contains all our genomes that we want to build a phylogeny with.
1. For this lab:
`ls /courses/bi278/Course_Materials/lab_07/*.* > andi_inputs.txt`
2. To run andi:
`andi --file-of-filenames=andi_inputs.txt -t 2 --join > andi_output.dist`
### Exercise 2. IQ-Tree for phylogenies
- IQ-TREE estimates trees for either nucleotide or protein alignments using maximum likelihood.
- tests a large range of phylogenetic models for you then outputs a tree best of which model appears to be most appropriate for your data.
3. To identify orthogroups with genes present from each genome, find the rows where orthologous groups share a common ancestor at node 0 and each genome has at least one gene member in the group. Utilizes this command:
`awk -F "\t" '$3=="n0" && $4!="" && $5!="" && $6!="" && $7!=""
&& $8!="" && $9!="" && $10!="" && $11!="" && $12!="" && $13!=""
&& $14!="" && $15!="" && $16!="" && $17!="" && $18!="" &&
$19!="" && $20!="" && $21!=""'
/courses/bi278/Course_Materials/lab_07/orthofinder_18taxa/Phylo
genetic_Hierarchical_Orthogroups/N0.tsv > 18taxa_coregenes.txt`
4. To see how many genes are included in each orthogroup from the above command use:
`sed 's/, /\t/g' 18taxa_coregenes.txt | awk '{print $1, NF-1}'`
- This should give a list of a bunch of different orthogroups. Then you can collect the amino acid sequences for all the members of an orthogroup (from this list) of your choosing.
### Mistake here! I didn't choose a file found after the sed command, and therefore didn't limit the search enough.
- In this lab, I chose to use the orthogroup titled OG0006491
to view the amino acid sequences in this lab, we utilize the command:
`/courses/bi278/Course_Materials/lab_07/orthofinder_18taxa/Orthog
roup_Sequences/OG0006491.fa`
5. To generate a multiple sequence alignment, you can use MAFFT:
`mafft --maxiterate 3
/courses/bi278/Course_Materials/lab_07/orthofinder_18taxa/Ortho
group_Sequences/OG0006491.fa > OG0006491_mafft.faa`
6. Now, you can use this Multiple Sequence Alignments as input for an iqtree:
`iqtree -s OG0006491_mafft.faa`
- this tests the protein substitution models it has, picks the best one for your data, and does tree searching then prints a detailed log to the screen as it does it.
- At the end, it gave me a little message stating where to find specific items about the tree. My maximum-likelihood tree was found in a file called "OG0006491_mafft.faa.treefile"
- This command also gives an overarching review of the information (along with a preview of the tree) in the file called "OG0006491_mafft.faa.iqtree"
### Exercise 3. Trees in R
Now, we can compare trees in R. Phylogenetics packages in R are "ape: Analyses of Phylogenetics and Evolution" as well as "phytools". The developer of phytools created this website:
http://blog.phytools.org/
Which is very helpful for answering questions regarding phytools.
You can then copy trees from orthofinder that you may have found:
(for this example)
``/courses/bi278/Course_Materials/lab_07/orthofinder_18taxa/Gene
_Trees/OG0000084_tree.txt``
and
`/courses/bi278/Course_Materials/lab_07/orthofinder_18taxa/Spec
ies_Tree/SpeciesTree_rooted.txt`
7. Now, we can compare these trees.
- First we log into to RStudio at bi278.colby.edu
- Within RStudio, all commands typed for the following steps can be entered in the top left box, and executed by typing command+return
- the cursor has to be on the line you are running it on though!
8. To load the phylogenetics packages we use: (written in R)
```
library(ape)
library(phytools)
```
9. we then take the distance matrix generated by andi and make a neighbor-joining tree from it.
```
andi.dist <- read.table("/home2/snoh/lab_07/andi_output.dist",
header=F, row.names=1, skip=1)
colnames(andi.dist) <- rownames(andi.dist)
# infer tree from distance matrix andi.tree <- nj(as.dist(andi.dist))
```
10. To take a look at this tree:
`plot(andi.tree)`
11. Now to make it easier to compare trees between methods, you can clean up the labels of the tips with a few different methods:
- first check what the tip labels are
andi.tree$tip.label
- substitute patterns that have to do with method or source
temp <- gsub(andi.tree$tip.label, pattern="_nanoassembly",
replacement="")
temp2 <- gsub(temp, pattern="_ncbi", replacement="")
- we can also directly substitute individual elements
temp2[1] <- "bh69"
- we can also directly substitute ranges of elements
temp2[15:18] <- c("Pfungorum", "Psprentiae", "Pterrae",
"Pxenovorans")
- last thing to clean up that was inconsistent across names
temp3 <- gsub(temp2, pattern="qs", replacement="")
- replace the original labels
andi.tree$tip.label <- temp3
- clean up unnecessary files
rm(temp, temp2, temp3)
12. Now we can look at our tree again, and this time with node numbers.
```
plot(andi.tree)
nodelabels()
```
13. We can also root our tree with the outgroup node, then look at it again:
```
rooted.andi <- root.phylo(andi.tree, node=34, resolve.root=T)
plot(rooted.andi, cex=1, label.offset=0.005)
```
14. Now, we can look at the species tree outputted by orthofinder:
```
ortho.tree <-
read.tree("/courses/bi278/Course_Materials/lab_07/orthofinder_1
8taxa/Species_Tree/SpeciesTree_rooted.txt")
plot(ortho.tree)
```
fix tip labels:
```
ortho.tree$tip.label
temp <- gsub(ortho.tree$tip.label, pattern="_prokka",
replacement="")
temp2 <- gsub(temp, pattern="qs", replacement="")
ortho.tree$tip.label <- temp2
rm(temp, temp2)
```
To root this tree too:
```
plot(ortho.tree)
nodelabels()
rooted.ortho <- root.phylo(ortho.tree, node=25, resolve.root=T)
plot(rooted.ortho, cex=1, label.offset=0.005)
```
15. Now we can rotate the nodes of this tree so the tips more closely match the tips of the tree from andi:
```
rotate.ortho <- rotateConstr(rooted.ortho, rooted.andi$tip.label)
plot(rotate.ortho)
```
16. Now we can compare how similar these trees are:
- compare topology (branching pattern)
```
all.equal(andi.tree, ortho.tree, use.edge.length=FALSE)
```
- plot both trees side by side
```
cophyloplot(rooted.andi, rotate.ortho)
```
- upon analysis of side by side comparisons, if there are notable differences between the trees, we can assume that we can either improve our taxon sampling (include more Paraburkholderia species other than the four we have here to better resolve their relationships in this example) or perhaps even just use andi for the very closely related taxa it's supposed to be best for.
17. Now we can compare our gene trees.
- the tree generated with IQ-Tree:
```
iqtree <- read.tree("/home2/jereil26/lab_07/OG0006491_mafft.faa.treefile")
```
- to check and root this tree:
```
plot(iq.tree)
nodelabels()
rooted.iq84 <- root.phylo(iq.tree, node=33, resolve.root=T)
plot(rooted.iq84, cex=0.75, label.offset=0.005)
```
- rotate and compare
```
rotate.iq84 <- rotateConstr(rooted.iq84, rooted.or84$tip.label)
all.equal(rooted.or84, rooted.iq84, use.edge.length=FALSE)
cophyloplot(rooted.or84, rotate.iq84)
```
## Mistake here: due to my random selection of the OG group (didn't fully follow instructions earlier) my tree was too small to be applied in R well, if a different (and applicable) orthologous group was chosen and the same steps occurred, I could have been in the correct place at this step :(
18. Last step is saving figures you created. You can do this directly, but you can also save species trees as text files in case you want to use it in the future:
```
write.tree(rooted.andi, file="PATH/andi_cleaned.tre")
write.tree(rotate.ortho, file="PATH/ortho_cleaned.tre")
```
I saved these two trees in my home2 under lab_07.
I saved the rscript for this lab as "lab7rscript.r"