# BI278 Lab 8: Data visualization in R
Task: work through and play with R's visualization to figure out how to visualize the results of your projects
1. Copy over file to visualize in /personal directory so you can see it from R
```
mkdir lab_08
cp /courses/bi278/Course_Materials/lab_08/multivariate_data.txt ~/lab_08
```
The rest of the lab will be in R. The data set is simple and synthetic (we’ll pretend that this dataset is generated from a ortholog detection experiment and we have summarized our findings into measures of ortholog abundance)
**Ordination and Clustering**
*Principle Components Analysis*
2. Launch following libraries
```
library(dendextend)
library(ggplot2)
library(vegan)
library(reshape2)
library(ggrepel)
x<-c("dendextend", "ggplot2", "vegan", "reshape2", "ggrepel")
lapply(x, require, character.only = TRUE)
```
3. Figure out working directory and where you put this week's data file, then read it in as a table named cdata, and take a look at it
```
cdata <- read.table("multivariate_data.txt", h=T)
head(cdata)
```
4. Set the row names of cdata:
```
rownames(cdata) <- cdata$sample
```
5. Make scatter plot. Change x and y variables to any of the columns in object data
```
ggplot(cdata, aes(x=gene1, y=gene2)) + geom_point() + geom_text(aes(label=sample), hjust=-.3, vjust=0)
```
This command takes table and tells graphing function ggplot() to use gene1 as the x variable and gene2 as the y variable. Then it will make a
scatterplot with these data geom_point(), and add some text labels geom_text() so you can tell which dot is which.
You can also switch around components to look at data in different ways
```
ggplot(cdata, aes(x=gene4, y=gene5)) + geom_point() + geom_text(aes(label=taxonomy), hjust=-.2, vjust=0)
```
6. Run a PCA on these data and save results to new dataframe
```
cdata.pca <- prcomp(cdata[,c(3:10)], scale=T)
cdata.pca.res <- as.data.frame(cdata.pca$x)
```
7. Look at results
```
summary(cdata.pca)
```

Output indicates how much of the similarities and differences of our data are captured in the composite/reduced variables called principal components 1, 2, etc in a cumulative way. Our numbers indicate that PC1 and PC2 are able to capture within those two dimensions 67.6% of the total variance present in our data (last row).
8. Make scatterplot of new composite variables using the PCA results we captured in the previous step
```
ggplot(cdata.pca.res, aes(x=PC1, y=PC2)) + geom_point() + geom_text(aes(label=rownames(cdata.pca.res)), hjust=1,vjust=-.5)
```
9. How does the taxon labels fall out on this new scatterplot– bring labels over and add them as a new column to this dataframe and make a new scatterplot
```
cdata.pca.res$taxonomy <- cdata$taxonomy
ggplot(cdata.pca.res, aes(x=PC1, y=PC2, color=taxonomy)) + geom_point() + geom_text(aes(label=rownames(cdata.pca.res)), hjust=1,vjust=-.5)
```
10. are samples seperated by experimental variable "path"– add that as a variable here, too, and visualize it
```
path = c(rep("green",5), rep("blue",4))
cdata.pca.res$path <- path
ggplot(cdata.pca.res, aes(x=PC1, y=PC2, color=path)) + geom_point() + geom_text(aes(label=rownames(cdata.pca.res)), hjust=1,vjust=-.5)
```
11. If points/texts overlap, this function is a nice way to make things clearer
```
ggplot(cdata.pca.res, aes(x=PC1, y=PC2, color=path)) + geom_point() + geom_label_repel(aes(label=rownames(cdata.pca.res)))
```
12. Want to know how factors (genes, etc) were reduced to these variables. Check the loadings
```
cdata.pca$rotation
```
The results of this command show you the relative influences of your factors on each of the PCs (principal components). For Example, I see for PC1 that gene1 and gene7 have higher positive influences while gene3 and gene5 have higher negative influences. So samples that are toward the right-hand side of PC1 should have higher abundances of gene1 and gene7 and lower abundances of gene3 and gene5 compared to other samples.

*Nonmetric Multidimentional Scaling*
(NMDS)
13. Turn data into dissimilarity matrix
Specify number of dimensions (k). Choose 2-3 dimensions but you need to try out a few different k values and choose the best one that minimizes the stress value
```
cdata.nms <- metaMDS(cdata[,c(3:10)], distance="bray", k=3)
cdata.nms
```
R will report back how it fit the NMDS, including the stress value it landed on:
A general rule of thumb for stress is that anything below 0.05 is excellent, < 0.1 is great, < 0.2 is ok, and anything over 0.2 is questionable.
14. Examine Shepard plot, which slows the final reduced-dimension ordination result baased on ranks (red line) compared to original dissimilarities (blue line). Blue dots should be pretty close to the red line
```
stressplot(cdata.nms)
```

15. Next, we can plot our results using the same package (vegan) we used for the NMDS analysis. But let’s say we want plot it be differentiating the groups by our experimental variable “path”. We can use the built-in method for displaying results like this.
```
path = c(rep("green",5), rep("blue",4)) # same order as your rows
plot(cdata.nms, type="n")
ordihull(cdata.nms, groups=path, draw="polygon", col="grey90", label=F)
orditorp(cdata.nms, display="species", col="red")
orditorp(cdata.nms, display="sites", col=c(rep("green",5),rep("blue",4)), cex=1.25)
```

Unlike PCA, NMDS analyses cannot tell you what the loadings of each axis (NMDS1-X) and sequence of axes is arbitrary (e.g. the largest difference does not necessarily show up on NMDS1 whereas it does on PC1). But we can often draw similar conclusions from NMDS analyses as we would for PCA. Here, it appears the blue group is associated with different abundances of gene1, gene7, and gene8 compared to the green group.
(skip 16-18)
*Hierarchical Clustering*
Hierarchical clustering makes dendograms (like distance trees we looked at when discussing phylogenies) and one type of hierarchical clustering method produces UPGMA (Unweighted Pair Group Method with Arithmetic Mean) trees.
19. Make a Bray-Curtis disance matrix:
```
d <- vegdist(cdata[,3:10], method="bray")
```
20. Next run a hierarchical clustering analysis
```
cdata.clust <- hclust(d, method="ward.D2")
```
You can pick among a few different clustering methods (type in ?hclust for more details; default is “complete”) that differ by how the clusters are evaluated as they are grouped together. Ward’s method tries to find compact spherical clusters, “average” will use the average identities of members of a cluster to represent the cluster, and the “complete” method uses the maximum distance rather than the average distance.
21. Look at dendogram produced
```
plot(cdata.clust)
```

22. Pick a K based off of how many clusters may be present and cut across dendogram (here, 2 based on experimental path)
```
cutree(clust.avg, k=2)
```
You can see the classification results based on the K you chose (which class each sample was classified into). NOTE: what K you select here did not impact the clustering itself.
Use dendextend to manipulate hierarchical clustering dendogram (R package).
23. Convert clustering results into dendogram
```
dendro <- as.dendrogram(cdata.clust)
plot(dendro, type = "rectangle", ylab = "Height")
```
24. We can manipulate it further: ex. color "leaves" based on cluster identity shown with cutree().
```
dendro %>%
set("leaves_pch", c(19)) %>%
set("leaves_cex", c(2)) %>%
set("leaves_col", as.numeric(cutree(cdata.clust, k=2)[cdata.clust$order])+1) %>%
plot(type = "rectangle")
```

NOTE: while ggplot2 strings additional functions together with '+', dendextend uses '%>%'
**Box Plots and Dot Plots**
Instead of working with a large number of genes, you can use a specific set of genes and how they differ in occurance/abundance among groups (therefore, make dot plot or box plot).
To do this, format data so they are in "long" format rather than in "wide" format using package reshape.
25. Convert data from wide to long (and look at how it's converted so that you know the column names, and add experimental variable)
```
cdata <- read.table("multivariate_data.txt", h=T)
path <- c(rep("green",5), rep("blue",4))
cdata$path <- path
cdata.long <- melt(cdata[,1:11])
head(cdata.long)
```
26. Plot data into series of box plots that showus the difference in each gene count by taxonomy or "path":
```
ggplot(cdata.long, aes(x=taxonomy, y=value, fill=taxonomy)) + geom_boxplot() + facet_wrap(.~variable, scales="free")
ggplot(cdata.long, aes(x=path, y=value, fill=path)) + geom_boxplot() + facet_wrap(.~variable, scales="free")
````

27. If you know which gene you want out of these, you can subset the data by gene and just make a plot for it
```
gene2.long <- subset(cdata.long, variable=="gene2")
ggplot(gene2.long, aes(x=taxonomy, y=value, fill=taxonomy)) + geom_boxplot()
```

28. You could also make a dot plot instead of a box plot to better show the individual values
```
ggplot(gene2.long, aes(x=path, y=value, fill=path)) + geom_dotplot(binaxis='y', stackdir='center')
```

**Bonus: Applying Dendextend to other trees**