# Lab 08 Data Visualization in R
## Exercise 1: Ordination and Clustering
### 1.1 Principal Component Analysis
The code for this week's lab will be executed in Rstudio.
```
x<-c("dendextend", "ggplot2", "vegan", "reshape2", "ggrepel")
lapply(x, require, character.only = TRUE)
```
The code above loads the packages saved in the x sequence and requires them in order for the script to run.
```
cdata <- read.table("multivariate_data.txt", h = T)
head(cdata)
```
The command above reads in the data at the "multivariate_data.txt" file. The "h" parameter corresponds to whether the headers will be included in the data entry. The head command will print the first few rows of data.
```
rownames(cdata) <- cdata$sample
```
Assigns names to each row of cdata which is stored in the sample column.
```
ggplot(cdata, aes(x=gene1, y=gene2)) + geom_point() +
geom_text(aes(label=sample), hjust=-.3, vjust=0)
```
The ggplot R package allows for the easy manipulation of figure parameters in R plots. aes specifies which variables should be used as the axes, the geom_point() parameter is just one of any visual representations that the plots can take, and h and vjust will control the horizontal and vertical justification of the plot (on a scale of 0 to 1).

### Running PCA
```
cdata.pca <- prcomp(cdata[, c(3:10), scale = T])
cdata.pca.res <- as.data.frame(cdata.pca$x)
```
Principal component analysis is a form of statistical dimension reduction that can be used on datasets of size m x n. One other method employed here is vector slicing. Since cdata is a multidimensional array, we must specify which columns/rows the pca command will use as input data. Leaving the entry for the rows empty allows us to sample every row.
```
summary(cdata.pca)
```
The output of pca shows us how much the variances between columns of our data are captured in the reduced dimension variables called principal components. By looking at the row that shows cumulative proportion, we can see that 67.6% of the variance is explained using only the first two pcs.
```
ggplot(cdata.pca.res, aes(x=PC1, y= PC2)) + geom_point() +
geom_text(aes(label = rownames(cdata.pca.res)), hjust = 1, vjust = -0.5)
```

We can plot the first two principal components against each other. These types of plots are known as score plots and are used to project the data onto a span of the principal components.
```
cdata.pca.res$taxonomy <- cdata$taxonomy
ggplot(cdata.pca.res, aes(x=PC1, y=PC2, color=taxonomy)) +
geom_point() + geom_text(aes(label=rownames(cdata.pca.res)),
hjust=1,vjust=-.5)
```

Above is the code used to project the taxonomic origins of each gene to the data points as colors. They illustarate ways that we can represent multi-dimensional data on one plot.
```
path = c(rep("green", 5), rep("blue", 4))
cdata.pca.res$path <- path
ggplot(cdata.pca.res, aes(x = PC1, y = PC2, color = path))+
geom_point() + geom_text(aes(label = rownames(cdata.pca.res)), hjust = 1, vjust = -0.5)
path = c(rep("green", 5), rep("blue", 4))
cdata.pca.res$path <- path
ggplot(cdata.pca.res, aes(x = PC1, y = PC2, color = path))+
geom_point() + geom_text(aes(label = rownames(cdata.pca.res)), hjust = 1, vjust = -0.5)
```
Above assigns a new column to cdata.pca.res by creating the path variable. This variable is a sequence containing the word "green" five times followed by the word "blue" four times. These are used as the color labels on the plot displaying the first two principal components.
In order to make the points easier to see, ggrepel can be used in order to highlight the labels of each point.
```
cdata.pca$rotation
```
The information saved in cdata rotation column. It shows the relative influence of the factors on each of the principal components.
## Exercise 1.2: Nonmetric Multidimensional scaling
NMDS is a popular and robust technique because it is based on ranks rather than normally distributed numbers
```
cdata.nms <- metaMDS(cdata[, c(3:10)], distance = "bray", k = 3)
cdata.nms
```
In order to conduct nonmetric multidimensional scaling (NMDS), we first need to turn our data into a dissimilarity matrix using hte Bray-Curtis method. Bray Curtis handles changes in composition and relative abudnaces. We also specify that there are 3 dimensions.
metaMDS reports back the stress. The rule of thumb for stress is similar to p-values in that anything below 0.05 is statistically excellent, less than 0.1 is great, and anything over 0.2 is not accepted.
In order to visualize the stress data, we can use the stressplot method.
```
stressplot(cdata.nms)
```

It shows the final, reduced-dimension ordination based on the ranks which are displayed in red, compared to the original dissimilarity values. In an ideal plot, the blue dots will be fairly close to the red line.
```
path = c(rep ("green", 5), rep("blue", 4))
plot(cdata.nms, type = "n")
ordihull(cdata.nms, groups = path, draw = "polygon", col = "grey90", label = F)
orditorp(cdata.nms, display = "species", col = "red")
orditorp(cdata.nms, display = "sites", col = c(rep("green", 5), rep("blue", 4), cex= 1.25))
```
The ordihull plot from the vegan package allows us to plot our experiment variable path.
NMDS is unable to tell which loadings on each axis are arbitrary. In the plot below, it seems that the blue group is associated with different abundances of gene1, gene7, and gene8.

## 1.3 Hiearchical Clustering
```
d <- vegdist(cdata[, 3:10], method = "bray")
```
Creating the bray-curtis distance matrix.
```
cdata.clust <- hclust(d, method = 'ward.D2')
plot(cdata.clust)
```
Running the hierarchical clustering analysis. The different clustering methods that can be used in this step can be explored further by using ?hclust. These clustering methods differ by how clusters are evaluated. ward.D2 is a method that tries to find compact spherical clusters, "average" will use the average identities of members of a cluster, and the "complete" method uses the maximum distance between points rather than the average distance.

The plot produced by hierarhcical clustering will be a dendrogram. We can prune this tree in order to produce the number of desired clusters
Choose a value of k based on an estimation of how many clusters there should be present. In this case, we choose two because the experimental path variable can take on the two values, green and blue.
```
cuttree(cdata.clust, k = 2)
```
This method will clasify results based on how many expected clusters there should be.
Converting the clustering results into a dendro gram, we can run the following commands:
```
dendro <- as.dendrogram(cdata.clust)
plot(dendro, type = "rectangle", ylab = "Height")
```

```
dendro %>%
set("leaves_pch", c(19)) %>%
set("leaves_cex", c(2)) %>%
set("leaves_col", as.numeric(cutree(cdata.clust, k=2)[cdata.clust$order]) + 1) %>%
plot(type = "rectangle")
```
In the above command, the %>% string allows the dendextend package in order to string together additional functions.
This is a way to manipulate the dendrogram further so that the "leaves" are colored based on cluster identity. These are the classifications identified by cutree.

## Exercise 2: Box Plots and Dot Plots
Instead of working with a large set of genes independently, we can calculate statistics that allow us to stratify specific sets of genes. We can study these sets of genes for how they differ in occurence or abundance among groups.
In order to do this, we need to convert our data in a "long" format so that each row will be a single measure collected for a sample. Any given sample in this format can show up in multiple rows.
```
cdata$path <- path
cluster.long <- melt(cdata[, 1:11])
head(cluster.long)
```
This allows each subject data to have multiple rows for each gene.
The code below allows us to visualize the gene data for each gene by taxonomy and by path.
```
# for genes by taxonomy
ggplot(cdata.long, aes(x=taxonomy, y=value, fill = taxonomy)) + geom_boxplot() +facet_wrap(.~variable, scales = "free")
# for genes by path
ggplot(cdata.long, aes(x=path, y=value, fill = path)) + geom_boxplot() +facet_wrap(.~variable, scales = "free")
```


If we want to look at one gene out of this plot, we can assign the gene data to a new variable:
```
gene2.long <- subset(cdata.long, variable == "gene2")
ggplot(gene2.long, aes(x = taxonomy, y = value, fill = taxonomy)) + geom_boxplot()
```

Making a dotplot instead of a boxplot:
```
ggplot(gene2.long, aes(x=path, y = value, fill = path)) + geom_dotplot(binaxis = 'y', stackdir= 'center')
```

## Exercise 3: Applying DenDextend to other trees
```
library(DECIPHER)
dendro.ex <- ReadDendrogram("k25.tre")
par(mar = c(12, 4, 2, 2) + 0.1) # bottom, left, top, right; default is c(5,4,4,2)
dendro.ex
```
We first load in the DECIPHER R-package and a tree. We update the margins so that the labels of the dendrograms are easier to see.

Calling dendro.ex directly allows us to see the characteristics of the dendrogram.
Change the shapes of the leaves
```
dendro.ex %>%
set("leaves_pch",
c(15,16,16,16,16,16,16,16,16,17,17,17,17,15,15)) %>%
set("leaves_col", c(4,3,3,3,3,3,3,3,3,2,2,2,2,4,4)) %>%
plot(type="rectangle")
```
Turn the plot sideways
```
par(mar = c(5,4,4,12))
dendro.ex %>%
set("leaves_pch",
c(15,16,16,16,16,16,16,16,16,17,17,17,17,15,15)) %>%
set("leaves_col", c(4,3,3,3,3,3,3,3,3,2,2,2,2,4,4)) %>%
plot(type="rectangle", horiz=TRUE)
```

Change color of leaf
```
dendro.ex %>%
set("leaves_pch", c(16)) %>%
set("leaves_col", c(4,3,3,3,3,3,3,3,3,2,2,2,2,4,4)) %>%
set("labels_colors", c(4,3,3,3,3,3,3,3,3,2,2,2,2,4,4)) %>%
plot(type="rectangle", horiz=TRUE)
```

Sort each taxa by increasing the depth of the node.
```
dendro.ex %>%
set("labels_colors", c(4,3,3,3,3,3,3,3,3,2,2,2,2,4,4)) %>%
sort(type = "nodes") %>%
plot(type="rectangle", horiz=TRUE)
```
