# BI278 Lab #8 – data visualization (in R) Goal: Data exploration & visualization methods Notes: - when working with multivariate data, it's difficult to tell overall similarities & differences in what we are comparing - genomic data = typically multivariate b/c for each sample, you collect more than 1 measure of data - visualization can help tell differences apart Task: Work through the exercise and play with R’s visualization to figure out how you will visualize the results of your projects. It can help to draw out what you expect to compare (literally, with colored pencils or markers), then work backwards to figure out what columns of data you need, what sources those data will come from, etc. 1) Copy file to visualize into personal directory so you can see it from R: /courses/bi278/Course_Materials/lab_08/multivariate_data.txt Rest of work is in RStudio **Exercise 1 - Ordination & Clustering** Notes: - ordination methods = variety of approaches used to understand multivariate data - PCA (principal components analysis) aims to take multiple variables of samples & reduce them into a smaller # of variables that capture similarities & differences among samples - NMDS (non-metric multidimensional scaling) looks at distances between pairs of samples & tries to figure out how these samples are arranged in multivariate space - both methods pop up in biological data 1.1 - Principal Components Analysis 2) log into Rstudio & launch folowing libraries: x<-c("dendextend", "ggplot2", "vegan", "reshape2", "ggrepel") lapply(x, require, character.only = TRUE) 3) Figure out your working directory and where you put this week's data file, then read it in as a table named cdata, and take a look at it my working directory is lab_08 cdata <- read.table("multivariate_data.txt", h=T) head(cdata) 4) Set the row names of cdata: rownames(cdata) <- cdata$sample 5) Make a simple scatterplot. You can change x and y variables to any of the columns in your object cdata. ggplot(cdata, aes(x=gene1, y=gene2)) + geom_point() + geom_text(aes(label=sample), hjust=-.3, vjust=0) ^This command takes your table and tells ggplot() (graphing fnction) to use gene1 as x var & gene2 as y var and make scatterplot with these data geom_point(). It also adds text labels (geom_text()) so you know which dot is which. NOTE: You can switch around components to take a look at your data in different ways. Ex: ggplot(cdata, aes(x=gene4, y=gene5)) + geom_point() +geom_text(aes(label=taxonomy), hjust=-.2, vjust=0) 6) Let's run a PCA on these data (must all be numerical) and save those results to a new dataframe (a data table format that R uses) cdata.pca <- prcomp(cdata[,c(3:10)], scale=T) cdata.pca.res <- as.data.frame(cdata.pca$x) NOTE: we've essentially chosen to scale (~normalize) our data 7) look at results of PCA summary(cdata.pca) ![](https://i.imgur.com/nM9BlFw.png) ^output indicates how much of the similarties and differences (variances) of our data are captured in the composite/reduced variables called principal components 1,2, etc in a cumulative way ^^our #s indicate that PC1 & PC2 are able to capture w/in those 2 dimensions 67.7% of the total variance present in our data (last row!!) 8) make a new scatterplot of these new composite variables using PCA results captured in previous step: ggplot(cdata.pca.res, aes(x=PC1, y=PC2)) + geom_point() +geom_text(aes(label=rownames(cdata.pca.res)), hjust=1,vjust=-.5) 9) We're curious how the taxon labels fall out on this new scatterplot so we have to bring those labels over and add them as a new column to this new dataframe, then make a new scatterplot. cdata.pca.res$taxonomy <- cdata$taxonomy ggplot(cdata.pca.res, aes(x=PC1, y=PC2, color=taxonomy)) + geom_point() + geom_text(aes(label=rownames(cdata.pca.res)), hjust=1,vjust=-.5) 10) We also want to know if our samples are separated by experimental variable "path" so we're going to add that as a variable here too & visualize it. path = c(rep("green",5), rep("blue",4)) cdata.pca.res$path <- path ggplot(cdata.pca.res, aes(x=PC1, y=PC2, color=path)) + geom_point() + geom_text(aes(label=rownames(cdata.pca.res)), hjust=1,vjust=-.5) 11) if your points or text overlap, function fro ggrepe1 is a nice way to make things clearer: ggplot(cdata.pca.res, aes(x=PC1, y=PC2, color=path)) + geom_point() + geom_label_repel(aes(label=rownames(cdata.pca.res))) 12) Now that you’ve observed how your samples fall out on the PCA plot, you probably want to know how your factors (genes, etc) were reduced to these variables. To do this, we need to check the loadings. cdata.pca$rotation ^results of this command show you relative influences of your factors on each of the PCs. Prof Noh example: " I see for PC1 that gene1 and gene7 have higher positive influences while gene3 and gene5 have higher negative influences. So samples that are toward the right-hand side of PC1 should have higher abundances of gene1 and gene7 and lower abundances of gene3 and gene5 compared to other samples." 1.2 - Nonmetric Multidimensional Scaling Notes: - NMDS = popular + robust technique b/c based on ranks rather than normally distributed #s - which means that it can handle all kinds of biological data without having to worry about whether or not you have measured things appropriately 13) Turn data into dissimilarity mattrix. Bray-Curtis = commonly used method to do this b/c it handles changes in composition & relative abundances well even when some data are missing Specify # of dimensions (k). It's common practice to choose 2 or 3 dimensions so let's start here BUT you'd need to try out a few diff k values & choose the best one that minimizes the stress value. cdata.nms <- metaMDS(cdata[,c(3:10)], distance="bray", k=3) cdata.nms R will report back how it fits the NMDS, including the stress value it landed on. GENERAL RULE OF THUMB for stress: anything below 0.05 is excellent, < 0.1 is great, < 0.2 is ok, and anything over 0.2 is questionable. 14) We can also examine the Shepard plot, which slows the final reduced-dimension ordination result based on ranks (red line) compared to the original dissimilarities (blue dots). We want our blue dots to be close to the red line. stressplot(cdata.nms) ![](https://i.imgur.com/T5mEZpw.png) 15) Now, we can plot our results using same package (vegan) we used for NMDS analysis. But let’s say we wanted the plot to be differentiating groups by our experimental variable “path”. We can use the built-in method for displaying results like this: path = c(rep("green",5), rep("blue",4)) # same order as your rows plot(cdata.nms, type="n") ordihull(cdata.nms, groups=path, draw="polygon", col="grey90",label=F) orditorp(cdata.nms, display="species", col="red") orditorp(cdata.nms, display="sites", col=c(rep("green",5),rep("blue",4)), cex=1.25) ![](https://i.imgur.com/86J61bt.png) NOTE: unlike PCA, NMDS analyses can't tel you what the loadings of each axis an seq of axes is arbitrary. For instance, largest difference does not necessarily show up on NMDS1 whereas it does on PC1. BUT we can still draw similar conclusions from NMDS analyses as we would for PCA. In this particular result, it appears the blue group is associated with different abundances of gene1, gene7, and gene8 compared to the green group. NOTE: Skipped according to lab manual steps 16-18 that use ggplot2 for the same purpose as above. Refer back to lab manual/pdf if you want to learn more about that: file:///Users/sdivit25/Downloads/bi278_fall2022_lab_08.pdf 1.3 - Hierarchical Clustering Notes: - hierarchical clustering makes dendrograms similar to the distance trees we looked at when discussing phylogenies - one type of hierarchical clustering method produces UPGMA (Unweighted Pair Group Method with Arithmetic Mean) trees In this section: - we check out the differences among the samples in this dataset using hierarchical clustering - we will calculate a distance matrix and generate a dendrogram ** we can choose levels to cut across the branches of the dendrogram to group our samples into a set number of clusters. 19) first make a (Bray-Curtis) distance matrix: d <- vegdist(cdata[,3:10], method="bray") 20) Run hierarchical clustering analysis: cdata.clust <- hclust(d, method="ward.D2") NOTE: you can pick among a few diff clustering methods (type in ?hclust for more details; default is “complete”) that differ by how the clusters are evaluated as they are grouped together. Ward’s method tries to find compact spherical clusters, “average” will use the average identities of members of a cluster to represent the cluster, & the “complete” method uses the maximum distance instead of the average distance. 21) Look at the dendrogram produced: plot(cdata.clust) ![](https://i.imgur.com/4U4OXx5.png) I think this reflects what I thought was similar based on the PCA or NMDS. The dendrogram is definitely easier to read than the PCA or NMDS. 22) Pick a K based on your guess of how many clusters may be present (shown here with 2 based on our experimental variable “path”, BUT REMEMBER Prof Noh made up these data) and cut across the dendrogram. cutree(cdata.clust, k=2) NOTICE: You can see classification results based on the K you chose (which class each sample was classified into). NOTE: the K selected doesn't impact clustering itself Now, you use package dendextend to manipulate the hierarchical clustering dendrogram. For more deets: https://cran.r-project.org/web/packages/dendextend/vignettes/dendextend.html 23) Convert clustering results into dendrogram: dendro <- as.dendrogram(cdata.clust) plot(dendro, type = "rectangle", ylab = "Height") ![](https://i.imgur.com/a9IJIJi.png) 24) Manipulate dendrogram further. Ex: color "leaves" based on cluster identity that was shown w/ cutree dendro %>% set("leaves_pch", c(19)) %>% set("leaves_cex", c(2)) %>% set("leaves_col", as.numeric(cutree(cdata.clust, k=2)[cdata.clust$order])+1) %>% plot(type = "rectangle") ![](https://i.imgur.com/og8fopp.png) NOTE: ggplot2 strings additional functions together with '+' but dendextend uses '%>%'. Key takeway from exercise: - diff exploratory methods can be useful for finding additional insights from data beyond initial hypotheses - a lot of genomic data is multidimensional (thus difficult to interpret w/out additional processing), so important & useful to be fluent in diff exploratory statistical techniques **Exercise 2 - Box Plots and Dot Plots** Notes: - instead of working w/ a large # of genes, you may be interested in specific set of genes & how they differ in occurrence/abundance among groups - in this case, you want to make a box plot or dot plot! To make a box/dot plot, you need to format data so they are in "long" format (instead of "wide") format. - instead of each row being a complete set of measures collected for each sample, each row will be a single measure collected for a sample, and any given sample will show up in multiple rows - use package reshape2 to do data conversion for us!! 25) Convert data from wide to long & look at how it's been converted so you know column names. Also add experimental variable. cdata <- read.table("multivariate_data.txt", h=T) path <- c(rep("green",5), rep("blue",4)) cdata$path <- path cluster.long <- melt(cluster[,1:11]) head(cluster.long) 26) Plot these data into series of box plots that show us diff in each gene count by taxonomy or "path" ggplot(cdata.long, aes(x=taxonomy, y=value, fill=taxonomy)) + geom_boxplot() + facet_wrap(.~variable, scales="free") ggplot(cdata.long, aes(x=path, y=value, fill=path)) + geom_boxplot() + facet_wrap(.~variable, scales="free") First result: ![](https://i.imgur.com/ExOyMOQ.png) Second result: ![](https://i.imgur.com/VOgj4fN.png) 27) If you know which gene you want out of these, you can subset the data by that gene and just make a plot for it: gene2.long <- subset(cdata.long, variable=="gene2") ggplot(gene2.long, aes(x=taxonomy, y=value, fill=taxonomy)) + geom_boxplot() ![](https://i.imgur.com/PySwz9m.png) 28) You may want to make a dot plot instead of a box plot, to better show the individual values: ggplot(gene2.long, aes(x=path, y=value, fill=path)) + geom_dotplot(binaxis='y', stackdir='center') Prof Noh's NOTES: Usually, these types of figures would accompany a statistical test. There's no need to do that for our class. We don’t have enough samples to do meaningful statistics and I’m more interested in making sure you can understand and interpret what patterns and trends you see than testing statistical significance. NOTE: Refer to BONUS part 3 in lab manual/handout if, when working on your project, you need to apply dendextend to other trees!