# BI278 LAB 8 Notes
## Data Visualization in R
(R- bi278.colby.edu)
Given example dataset:

Utilizing a couple different clustering algorithms, we can explore this data.
## Exercise 1. Ordination and Clustering
#### 1.1 Principal Components Analysis
Ordingation methods are a variety of approaches used to understand multivariate data.
- Principal Components Analysis (PCA) aims to take multiple variables of samples and reduce them into smaller number of variables that capture the similarities and differences among the samples.
- Non-metric multidimensional scaling (NMDS) looks at distances between pairs of samples and tries to figure out how these samples are arranged in multivariate space.
PCA:
1. Utilize the following command to launch the following libraries:
```
x<-c("dendextend", "ggplot2", "vegan", "reshape2", "ggrepel")
lapply(x, require, character.only = TRUE)
```
2. Read in data file as a table named cdata and take a look at it.
```
cdata <- read.table("PATH/multivariate_data.txt", h=T)
head(cdata)
```
3. Set row names of cdata:
`rownames(cdata) <- cdata$sample`
4. Make a simple scatterplot. x and y variables can be changed to any of the columns in your object cdata.
```
ggplot(cdata, aes(x=gene1, y=gene2)) + geom_point() +
geom_text(aes(label=sample), hjust=-.3, vjust=0)
```
This command will take your table and tell the graphing function ggplot() to use gene1 as x variable and gene 2 as y variable. Then it will make a scatterplot and add some text labels so you can tell which dot is which.
5. To run a PCA on these data and save the results to a new dataframe:
```
cdata.pca <- prcomp(cdata[,c(3:10)], scale=T)
cdata.pca.res <- as.data.frame(cdata.pca$x)
```
"3:10" can change in this command. We haven chose to cscale (~normalize) our data.
6. To take a look at the results:
`summary(cdata.pca)`
Results look like this:

- What the output indicates is how much of the similarities and differences (variances) of our data are captured in the composite/reduced variables called principal components 1, 2, etc in a cumulative way. Our numbers indicate that PC1 and PC2 are able to capture within those two dimensions 67.6% of the total variance present in our data (last row).
7. To make a scatterplot of these new comoposite variables using the PCA results we captured:
```
ggplot(cdata.pca.res, aes(x=PC1, y=PC2)) + geom_point() + geom_text(aes(label=rownames(cdata.pca.res)), hjust=1,vjust=-.5)
```
8. To bring the taxon labels over and then add them as a new column to this new dataframe, then make a new scatterplot:
```
cdata.pca.res$taxonomy <- cdata$taxonomy
ggplot(cdata.pca.res, aes(x=PC1, y=PC2, color=taxonomy)) + geom_point() + geom_text(aes(label=rownames(cdata.pca.res)), hjust=1,vjust=-.5)
```
9. If you want to know if the samples are seperated by your experimental variable "path" you can add it as a variable and visualize it:
```
path = c(rep("green",5), rep("blue",4))
cdata.pca.res$path <- path
ggplot(cdata.pca.res, aes(x=PC1, y=PC2, color=path)) + geom_point() + geom_text(aes(label=rownames(cdata.pca.res)), hjust=1,vjust=-.5)
```
once again, "taxonomy" as well as "rownames(cdata.pca.res)" are variable in this command.
11. If points or text overlap, this function from ggrepel is a nice way to make things clearer:
```
ggplot(cdata.pca.res, aes(x=PC1, y=PC2, color=path)) + geom_point() + geom_label_repel(aes(label=rownames(cdata.pca.res)))
```
11. Given how samples fall out on the PCA plot, to find how factors (genes,etc.) were reduced to these variables. We can do this by checking the loadings:
`cdata.pca$rotation`
- The results of this command show you the relative influences of your factors on each of the PCs (principal components). For Example, I see for PC1 that gene1 and gene7 have higher positive influences while gene3 and gene5 have higher negative influences. So samples that are toward the right-hand side of PC1 should have higher abundances of gene1 and gene7 and lower abundances of gene3 and gene5 compared to other samples.
#### 1.2 Nonmetric Multidimensional Scaling
NMDS is a popular and robust technique because it is based on ranks rather than normally distributed numbers.
12. First, to turn data into a dissimilarity matrix, we can use Bray-Curtis. Bray Curtis is commonlu used as it handles changes in composition and relative abundances well even whe some data are missing.
- We also have to specify number of dimensions (k). It is common practice to choose 2 or 3 dimensions (but you would need to try out a few diff k values and choose the best one that minimizes the stress value)
```
cdata.nms <- metaMDS(cdata[,c(3:10)], distance="bray", k=3)
cdata.nms
```
These commands will have R report back how it fit the NMDS including stress value it landed on:

Rule of thumb: anything below 0.05 is excellent, < 0.1 is great, < 0.2 is ok, and anything over 0.2 is questionable.
13. We can also examine the Shepard plot which slows the final reduced-dimensio ordination result based on ranks (red line) compared to the original dissimilarities (blue dots). It is good if the blue dots are close to the red line.
`stressplot(cdata.nms)`
14. We can plot our results using vegan. But if we want to plot it be differentiating the groups by our experimental variable "path". We can use the built-in method for displaying results like this.
```
path = c(rep("green",5), rep("blue",4)) # same order as your rows
plot(cdata.nms, type="n")
ordihull(cdata.nms, groups=path, draw="polygon", col="grey90",label=F)
orditorp(cdata.nms, display="species", col="red")
orditorp(cdata.nms, display="sites",
col=c(rep("green",5),rep("blue",4)), cex=1.25)
```
Results of these commands:

- Unlike PCA, NMDS analyses cannot tell you what the loadings of each axis (NMDS1-X) and sequence of axes is arbitrary (e.g. the largest difference does not necessarily show up on NMDS1 whereas it does on PC1). But we can often draw similar conclusions from NMDS analyses as we would for PCA. Here, it appears the blue group is associated with different abundances of gene1, gene7, and gene8 compared to the green group.
### Steps 16-18 are skipped. If interested look back at the lab (file:///Users/jereil26/Downloads/bi278_fall2022_lab_08.pdf) steps 16-18 for the use of GGPLOT2 for the same purpose as the above steps.
example of GGplot2 results:

#### 1.3 Hierarchical Clustering
Hierarchical clustering makes dendrograms very much liek the distance trees we looked at when discussing phylogenies.
18. First, make a (Bray-Curtis) distance matrix:
`d <- vegdist(cdata[,3:10], method="bray")`
- "3:10" is subject to variability
19. Then, run hierarchical clustering analysis.
`cdata.clust <- hclust(d, method="ward.D2")`
- You can pick among a few different clustering methods (type in ?hclust for more details; default is “complete”) that differ by how the clusters are evaluated as they are grouped together. Ward’s method tries to find compact spherical clusters, “average” will use the average identities of members of a cluster to represent the cluster, and the “complete” method uses the maximum distance rather than the average distance
20. Take a look at the dendrogram that was produced:
`plot(cdata.clust)`
The Dendrogram that was produced:

21. Next, we cut across this dendrogram to produce the number of clusters that you want. Pick a K based on your guess of how many clusters may be present and cut across the dendrogram.
`cutree(cdata.clust, k=2)`
- shown here with 2 based on our experimental variable “path”, though remember I made up these data
- You can see the classification results based on the K you chose (which class each sample was classified into).
- Worth noting: what K you select here did not impact the clustering itself.
22. Now we can use an R package called dendextend to manipulate the hirarchical clustering dendrogram. First, we convert the clustering results into a dendrogram.
```
dendro <- as.dendrogram(cdata.clust)
plot(dendro, type = "rectangle", ylab = "Height")
```
Results:

23. Then, we can manipulate the dendrogram further. For example, we can color the "leaves" based on the cluster identity that was shown with cutree().
```
dendro %>% set("leaves_pch", c(19)) %>% set("leaves_cex", c(2)) %>% set("leaves_col", as.numeric(cutree(cdata.clust,k=2 [cdata.clust$order])+1) %>% plot(type = "rectangle")
```
- Note: while ggplot2 strings additional functions together with '+', dendextend uses '%>%'.'
- This command gave me trouble: this is the command that was successful in the end:

### 2. Box Plots and Dot Plots
Allows for working with a specific set of genes. Shows how they differ in occurance and abundance among groups.
To do this, you have to format your data in "long" format, as opposed to "wide".
- instead of each row being a complete set of measures collected for each sample, each row will be a single measure collected for a sample, and any given sample will show up in multiple rows.
- This data conversion can be done using reshape2.
24. Convert your data from wide to long to and take a look at how it's been converted so you know the column names. Make sure to add the experimental variable too:
```
cdata <- read.table("PATH/multivariate_data.txt", h=T)
path <- c(rep("green",5), rep("blue",4))
cdata$path <- path
cluster.long <- melt(cdata[,1:11])
head(cluster.long)
```
"1:11" is subject to variability.
25. Now, we can plot these data into a series of box plots that show us the difference in each. gene count by taxonomy or "path":
```
ggplot(cluster.long, aes(x=taxonomy, y=value, fill=taxonomy)) + geom_boxplot() + facet_wrap(.~variable, scales="free")
ggplot(cluster.long, aes(x=path, y=value, fill=path)) + geom_boxplot() + facet_wrap(.~variable, scales="free")
```
Results:

26. If you know what gene you want form these, you can subset the data by that gene and make a plot for it:
```
gene2.long <- subset(cluster.long, variable=="gene2")
ggplot(gene2.long, aes(x=taxonomy, y=value, fill=taxonomy)) + geom_boxplot()
```
Results:

27. You may want to make a dot plot instead of a box plot to better show the individual values:
```
ggplot(gene2.long, aes(x=path, y=value, fill=path)) + geom_dotplot(binaxis='y', stackdir='center')
```
Results:

### 3. Applying Dendextend to Other Trees
To use dendextend on trees you may have built with FastTree, IQ-Tree or any other phylogenetic package, you will need to convert the newick tree file format to a dendrogram format. The R package DECIPHER is able to do this.
1. First read a tree you made and plot it.
```
library(DECIPHER)
dendro.ex <- ReadDendrogram("PATH/k25.tre")
plot(dendro.ex)
```
"k25.tre" is the tree file.
My results using lab4 tree:

2. To increase the margins in the plot window, then plotting again to test if able to see labels:
```
par(mar = c(12,4,2,2) + 0.1) #bottom, left, top, right; default is c(5,4,4,2)
plot(dendro.ex)
```
You can then color the taxa ("leaves"), to show which cluster they are in, or whatever else.
3. To first find out how many taxa are on this tree I can call the tree object itself:
`dendro.ex`

I find mine has 63 members total.
4. Now you can change the shape (fig4) and colors (fig5) of the leaves manually.
For a tree with 15 members total:
```
dendro.ex %>%
set("leaves_pch",
c(15,16,16,16,16,16,16,16,16,17,17,17,17,15,15)) %>%
set("leaves_col", c(4,3,3,3,3,3,3,3,3,2,2,2,2,4,4)) %>%
plot(type="rectangle")
```

5. You can also turn the plots sideways: (for the one w 15)
```
par(mar = c(5,4,4,12))
dendro.ex %>%
set("leaves_pch",
c(15,16,16,16,16,16,16,16,16,17,17,17,17,15,15)) %>%
set("leaves_col", c(4,3,3,3,3,3,3,3,3,2,2,2,2,4,4)) %>%
plot(type="rectangle", horiz=TRUE)
```
6. You can also change the colors of the labels instead:
```
dendro.ex %>%
set("leaves_pch", c(16)) %>%
set("leaves_col", c(4,3,3,3,3,3,3,3,3,2,2,2,2,4,4)) %>%
set("labels_colors", c(4,3,3,3,3,3,3,3,3,2,2,2,2,4,4)) %>%
plot(type="rectangle", horiz=TRUE)
```
7. It might be helpful to sort your taxa by increasing node depth:
```
dendro.ex %>%
set("labels_colors", c(4,3,3,3,3,3,3,3,3,2,2,2,2,4,4)) %>%
sort(type = "nodes") %>%
plot(type="rectangle", horiz=TRUE)
```
8. This command sets the margins back to the default after wards.
`par(mar = c(5,4,4,2))`