# Lab 08: Data Visualization (in R)
Kirsten Pastre
#### 1 Copy over the file to visualize into your /personal directory
```
mkdir lab_08
cp /courses/bi278/Course_Materials/lab_08/multivariate_data.txt ~/lab_08
```
## Ordination and Clustering
### Principal Components Analysis
#### 2. launch the following libraries
```
x <- c("dendextend", "ggplot2", "vegan", "reshape2", "ggrepel")
lapply(x, require, character.only = TRUE)
```
#### 3. read in data as table named cdata
```
cdata <- read.table("multivariate_data.txt", h=T)
head(cdata)
```
#### 4. set the row name of cdata
```
rownames(cdata) <-cdata$sample
```
#### 5. make a simple scatterplot
```
ggplot(cdata, aes(x=gene1, y=gene2)) + geom_point() + geom_text(aes(label=sample), hjust=-.3, vjust =0)
```

#### 6. run a PCA on these data and save those results to a new dataframe
```
cdata.pca <- prcomp(cdata[,c(3:10)], scale = T)
cdata.pca.res <- as.data.frame(cdata.pca$x)
```
* note: we have chosen a scale to normalize our data
#### 7. look at the reslts of PCA
```
summary(cdata.pca)
```
#### 8. make a new scatterplot of these new composite variables using the PCA results
```
ggplot(cdata.pca.res, aes(x=PC1, y=PC2)) + geom_point() + geom_text(aes(label=rownames(cdata.pca.res)), hjust =1, vjust=.5)
```

#### 9. place taxon labels on new scatterplot
```
cdata.pca.res$taxonomy <- cdata$taxonomy
ggplot(cdata.pca.res, aes(x=PC1, y=PC2, color=taxonomy)) + geom_point() + geom_text(aes(label=rownames(cdata.pca.res)), hjust=1, vjust=-.5)
```

#### 10. Want to know if my samples are separated by my experimental variable "path" so I'm going to add that as a variable here too and visualize it
```
path = c(rep("green", 5), rep("blue", 4))
cdata.pca.res$path <- path
ggplot(cdata.pca.res, aes(x=PC1, y=PC2, color=path)) + geom_point() + geom_text(aes(label=rownames(cdata.pca.res)), hjust=1, vjust=-.5)
```

#### 11. function ggrepel makes things clearer
```
ggplot(cdata.pca.res, aes(x=PC1, y=PC2, color=path)) + geom_point() + geom_label_repel(aes(label=rownames(cdata.pca.res)))
```

#### 12. what factor were reduced to these variables... check the loadings
```
cdata.pca$rotation
```
### Nonmetric Multidimensional Scaling
#### 13. Bray-Curtis method for forming a dissimalarity matrix
```
cdata.nms <- metaMDS(cdata[,c(3:10)], distance="bray", k=3)
cdata.nms
#returns: Stress: 0.02184196
```
#### 14. Examine the Shepard plot, shows the final reduced-dimension ordination result based on ranks (red line) compared to the original dissimilarities (blue dots)
```
stressplot(cdata.nms)
```

15. plot results using vegan package. Plot it by differentiating the group by our experimental variable "path"
```
path = c(rep("green", 5), rep("blue",4))
plot(cdata.nms, type="n")
ordihull(cdata.nms, groups=path, draw="polygon", col="grey90", label=F)
orditorp(cdata.nms, display="species", col="red")
orditorp(cdata.nms, display="sites", col=c(rep("green",5),rep("blue",4)), cex=1.25)
```

#### 19. Make a Bray-Curtis distance matrix:
```
d <- vegdist(cdata[,3:10], method="bray")
```
#### 20. Run a hierarchical clustering analysis
```
cdata.clust <- hclust(d, method="ward.D2")
```
#### 21. Take a look at the dendogram that was produced:
```
plot(cdata.clust)
```

#### 22. Pick a K based on your guess of how many clusters may be present
```
cutree(cdata.clust, k=2)
```
#### 23. Convert the clustering results into a dendrogram
```
dendro <- as.dendrogram(cdata.clust)
plot(dendro, type = "rectangle", ylab= "Height")
```

#### 24. Further manipulate the dendrogram - color the "leaves" based on the cluster identity that was shown in cutree()
```
dendro %>%
set("leaves_pch", c(19)) %>%
set("leaves_cex", c(2)) %>%
set("leaves_col", as.numeric(cutree(cdata.clust, k=2)[cdata.clust$order]) +1) %>%
plot(type = "rectangle")
```

## 2. Box Plots and Dot Plots
#### 25. Convert data from wide to long and take look at how converted so know the column names
```
cdata <- read.table("multivariate_data.txt", h=T)
path <- c(rep("green",5), rep("blue",4))
cdata$path <- path
cluster.long <- melt(cdata[,1:11])
head(cluster.long)
```
#### 26. Plot these data into a series of box plots that shows us the difference in each gene count by taxonomy or "path"
```
ggplot(cluster.long, aes(x=taxonomy, y=value, fill=taxonomy)) + geom_boxplot() + facet_wrap(.~variable, scales="free")
ggplot(cluster.long, aes(x=path, y=value, fill=path)) + geom_boxplot() + facet_wrap(.~variable, scales="free")
```


27. If you know which gene you want out of these, you can subset the data by that gene and make a plot for it
```
gene2.long <- subset(cdata.long, variable=="gene2")
ggplot(gene2.long, aes(x=taxonomy, y=value, fill=taxonomy)) + geom_boxplot()
```

#### 28. make a dot plot instead of a box plot
```
ggplot(gene2.long, aes(x=path, y=value, fill=path)) + geom_dotplot(binaxis='y', stackdir='center')
```
