Lab 08: Data Visualization (in R)

# Lab 08: Data Visualization (in R) Kirsten Pastre #### 1 Copy over the file to visualize into your /personal directory ``` mkdir lab_08 cp /courses/bi278/Course_Materials/lab_08/multivariate_data.txt ~/lab_08 ``` ## Ordination and Clustering ### Principal Components Analysis #### 2. launch the following libraries ``` x <- c("dendextend", "ggplot2", "vegan", "reshape2", "ggrepel") lapply(x, require, character.only = TRUE) ``` #### 3. read in data as table named cdata ``` cdata <- read.table("multivariate_data.txt", h=T) head(cdata) ``` #### 4. set the row name of cdata ``` rownames(cdata) <-cdata$sample ``` #### 5. make a simple scatterplot ``` ggplot(cdata, aes(x=gene1, y=gene2)) + geom_point() + geom_text(aes(label=sample), hjust=-.3, vjust =0) ``` ![](https://i.imgur.com/0ZcbBVi.png) #### 6. run a PCA on these data and save those results to a new dataframe ``` cdata.pca <- prcomp(cdata[,c(3:10)], scale = T) cdata.pca.res <- as.data.frame(cdata.pca$x) ``` * note: we have chosen a scale to normalize our data #### 7. look at the reslts of PCA ``` summary(cdata.pca) ``` #### 8. make a new scatterplot of these new composite variables using the PCA results ``` ggplot(cdata.pca.res, aes(x=PC1, y=PC2)) + geom_point() + geom_text(aes(label=rownames(cdata.pca.res)), hjust =1, vjust=.5) ``` ![](https://i.imgur.com/KOOsKtE.png) #### 9. place taxon labels on new scatterplot ``` cdata.pca.res$taxonomy <- cdata$taxonomy ggplot(cdata.pca.res, aes(x=PC1, y=PC2, color=taxonomy)) + geom_point() + geom_text(aes(label=rownames(cdata.pca.res)), hjust=1, vjust=-.5) ``` ![](https://i.imgur.com/RCYFoWZ.png) #### 10. Want to know if my samples are separated by my experimental variable "path" so I'm going to add that as a variable here too and visualize it ``` path = c(rep("green", 5), rep("blue", 4)) cdata.pca.res$path <- path ggplot(cdata.pca.res, aes(x=PC1, y=PC2, color=path)) + geom_point() + geom_text(aes(label=rownames(cdata.pca.res)), hjust=1, vjust=-.5) ``` ![](https://i.imgur.com/0onWbCE.png) #### 11. function ggrepel makes things clearer ``` ggplot(cdata.pca.res, aes(x=PC1, y=PC2, color=path)) + geom_point() + geom_label_repel(aes(label=rownames(cdata.pca.res))) ``` ![](https://i.imgur.com/tR5YPoR.png) #### 12. what factor were reduced to these variables... check the loadings ``` cdata.pca$rotation ``` ### Nonmetric Multidimensional Scaling #### 13. Bray-Curtis method for forming a dissimalarity matrix ``` cdata.nms <- metaMDS(cdata[,c(3:10)], distance="bray", k=3) cdata.nms #returns: Stress: 0.02184196 ``` #### 14. Examine the Shepard plot, shows the final reduced-dimension ordination result based on ranks (red line) compared to the original dissimilarities (blue dots) ``` stressplot(cdata.nms) ``` ![](https://i.imgur.com/JAfmKI3.png) 15. plot results using vegan package. Plot it by differentiating the group by our experimental variable "path" ``` path = c(rep("green", 5), rep("blue",4)) plot(cdata.nms, type="n") ordihull(cdata.nms, groups=path, draw="polygon", col="grey90", label=F) orditorp(cdata.nms, display="species", col="red") orditorp(cdata.nms, display="sites", col=c(rep("green",5),rep("blue",4)), cex=1.25) ``` ![](https://i.imgur.com/ZoZzPGZ.png) #### 19. Make a Bray-Curtis distance matrix: ``` d <- vegdist(cdata[,3:10], method="bray") ``` #### 20. Run a hierarchical clustering analysis ``` cdata.clust <- hclust(d, method="ward.D2") ``` #### 21. Take a look at the dendogram that was produced: ``` plot(cdata.clust) ``` ![](https://i.imgur.com/GCIqBH6.png) #### 22. Pick a K based on your guess of how many clusters may be present ``` cutree(cdata.clust, k=2) ``` #### 23. Convert the clustering results into a dendrogram ``` dendro <- as.dendrogram(cdata.clust) plot(dendro, type = "rectangle", ylab= "Height") ``` ![](https://i.imgur.com/sOWHn9F.png) #### 24. Further manipulate the dendrogram - color the "leaves" based on the cluster identity that was shown in cutree() ``` dendro %>% set("leaves_pch", c(19)) %>% set("leaves_cex", c(2)) %>% set("leaves_col", as.numeric(cutree(cdata.clust, k=2)[cdata.clust$order]) +1) %>% plot(type = "rectangle") ``` ![](https://i.imgur.com/HkvQPvw.png) ## 2. Box Plots and Dot Plots #### 25. Convert data from wide to long and take look at how converted so know the column names ``` cdata <- read.table("multivariate_data.txt", h=T) path <- c(rep("green",5), rep("blue",4)) cdata$path <- path cluster.long <- melt(cdata[,1:11]) head(cluster.long) ``` #### 26. Plot these data into a series of box plots that shows us the difference in each gene count by taxonomy or "path" ``` ggplot(cluster.long, aes(x=taxonomy, y=value, fill=taxonomy)) + geom_boxplot() + facet_wrap(.~variable, scales="free") ggplot(cluster.long, aes(x=path, y=value, fill=path)) + geom_boxplot() + facet_wrap(.~variable, scales="free") ``` ![](https://i.imgur.com/eXmJA86.png) ![](https://i.imgur.com/4GTpK61.png) 27. If you know which gene you want out of these, you can subset the data by that gene and make a plot for it ``` gene2.long <- subset(cdata.long, variable=="gene2") ggplot(gene2.long, aes(x=taxonomy, y=value, fill=taxonomy)) + geom_boxplot() ``` ![](https://i.imgur.com/fbe8ZdD.png) #### 28. make a dot plot instead of a box plot ``` ggplot(gene2.long, aes(x=path, y=value, fill=path)) + geom_dotplot(binaxis='y', stackdir='center') ``` ![](https://i.imgur.com/VdRFT0k.png)