Bi278 Lab 8 Data Visualization

# Bi278 Lab 8 Data Visualization ### By Lee Ferenc 11/14/2023 This lab will be dealing with Rstudio and visualizing data (notes may not be complete due to having taken a course or two in Rstudio) # 0. Using categorized COGs ### Step 1. Read in categorized tables #### Code ``` name <- read.delim("file.cog.intact.categorized", h=F) ``` #### Lab example ``` pagri <- read.delim("baqs159.cog.intact.categorized", h=F) phayl <- read.delim("bhqs11.cog.intact.categorized", h=F) pbonn <- read.delim("bbqs859.cog.intact.categorized", h=F) ``` Make sure to repeat for all your COGs. ### Step 2. Turn into individual summary tables #### Code (# of genes in each COG category) ``` cat.name <- as.data.frame(table(name$V3)) colnames(cat.name) <- c("COG", "all") ``` #### Lab example ``` cat.agri <- as.data.frame(table(pagri$V3)) colnames(cat.agri) <- c("COG","all") cat.bonn <- as.data.frame(table(pbonn$V3)) colnames(cat.bonn) <- c("COG","all") cat.hayl <- as.data.frame(table(phayl$V3)) colnames(cat.hayl) <- c("COG","all") ``` ### Step 3. Combine Tables #### Code ``` #concatinate them cat.comp <-cbind(cat.name[,2], cat.name2[,2], cat.name3[,2]) #make a data frame df.cat <-as.data.frame(t(cat.comp)) #label rows and columns rownames(df.cat) <- c("name1","name2","name3") colnames(df.cat) <- cat.name$COG ``` #### Lab example ``` cat.comp <- cbind(cat.agri[,2],cat.bonn[,2],cat.hayl[,2]) df.cat <- as.data.frame(t(cat.comp)) rownames(df.cat) <- c("pagri","pbonn","phayl") colnames(df.cat) <- cat.agri$COG ``` ### Step 4. Clean up tables Basically replacing NA with 0/zero. #### Code/Lab Example ``` df.cat[is.na(df.cat)] <- 0 df.cat[,1:25] <- lapply(df.cat[,1:25], as.numeric) #then add genome as a variable too df.cat$genome <- row.names(df.cat) ``` Table will be equivalent to `cdata` used on the other parts of lab. You will need to adjust variable names and column numbers as necessary. ### Step 5. Wide table -> long table Not always required but when we do this, we can specify the order in which COG categories are presented #### Code/Lab example ``` long.cat <- melt(df.cat) colnames(long.cat) <- c("genome","COG","count") long.cat$COG <- factor(long.cat$COG, levels=c("J", "A", "K", "L", "B", "D", "Y", "V", "T", "M", "N", "Z", "W", "U", "O", "X", "C", "G", "E", "F", "H", "I", "P", "Q", "R", "S")) ``` ### Step 6. Make histogram (of # of genes in each cateogories) #### One Histogram code ``` lcat.name <- subset(long.cat, genome=="name") cog1 <- ggplot(lcat.name, aes(x=COG, y=count)) + geom_bar(stat="identity", fill=2) + scale_x_discrete(breaks=lcat.name$COG, limits = rev(levels(lcat.name$COG))) + coord_flip() ``` #### Lab example It's easier to make three sepearte hisograms and then combine them together at the end. ``` lcat.agri <- subset(long.cat, genome=="pagri") lcat.bonn <- subset(long.cat, genome=="pbonn") lcat.hayl <- subset(long.cat, genome=="phayl") cog1 <- ggplot(lcat.agri, aes(x=COG, y=count)) + geom_bar(stat="identity", fill=2) + scale_x_discrete(breaks=lcat.agri$COG, limits = rev(levels(lcat.agri$COG))) + coord_flip() cog2 <- ggplot(lcat.bonn, aes(x=COG, y=count)) + geom_bar(stat="identity", fill=3) + scale_x_discrete(breaks=lcat.bonn$COG, limits = rev(levels(lcat.bonn$COG))) + coord_flip() cog3 <- ggplot(lcat.hayl, aes(x=COG, y=count)) + geom_bar(stat="identity", fill=4) + scale_x_discrete(breaks=lcat.hayl$COG, limits = rev(levels(lcat.hayl$COG))) + coord_flip() gridExtra::grid.arrange(cog1, cog2, cog3, ncol=1) ``` # 1. Ordination and Clustering * **Ordination methods**: Variety of approaches used to understand multivariate data. * **Principal components analysis (PCA)**: Takes multiple variables of samples and reduces them into a smaller # of variables that capture the similarities and differences among the samples. * **Non-metric multidimensional scaling (NMDS)**: Examines distances between pairs of samples and tries to figure out how these samples are arranged in multivariate space. ## 1.1 Principle Compenents Analysis ### Step 1. Relaunch libraries #### Code ``` x<-c("ggplot2", "vegan", "reshape2", "ggrepel", "dendextend", "gridExtra") lapply(x, require, character.only = TRUE) ``` ### Step 2. Find/Set Working Directory and Read in Data #### Code ``` getwd() setwd("/home2/enfere24/lab_08") getwd() cdata <- read.table("/courses/bi278/Course_Materials/lab_08/multivariate_data.txt", h=T) head(cdata) ``` ### Step 3. Set the row names #### Code ``` rownames(cdata) <- cdata$sample ``` ### Step 4. Make scatterplot You can change the x and y vairables to any column in object data. #### Code ``` ggplot(cdata, aes(x=gene1, y=gene2)) + geom_point() + geom_text(aes(label=sample), hjust=-.3, vjust=0) ``` ![1.1 step 1](https://hackmd.io/_uploads/SylUgSbEa.png) It takes ur table and tells `ggplot()` to use gene1 as x variable and gene2 as y varaible. It'll make a scatterplot w/ these data using `geom_point()` and add text labels w/ `geom_text()`. ### Step 5. Run a PCA and save in new dataframe 1) Data must be all numerical 2) Data frama is a data table format R uses #### Code ``` cdata.pca <- prcomp(cdata[,c(3:10)], scale=T) cdata.pca.res <- as.data.frame(cdata.pca$x) ``` Note: We have chosen to normalize the data ### Step 6. Look at results #### Code ``` summary(cdata.pca) ``` #### Output: ``` Importance of components: PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 Standard deviation 1.8621 1.3933 1.1006 0.8509 0.62170 0.44026 0.26143 0.08455 Proportion of Variance 0.4334 0.2427 0.1514 0.0905 0.04831 0.02423 0.00854 0.00089 Cumulative Proportion 0.4334 0.6761 0.8275 0.9180 0.96633 0.99056 0.99911 1.00000 ``` What the output indicates: How much of the similarities and differences (variances) of the data is captured in the composite/reduced variables called principal components 1, 2, etc (PC1, PC2, etc) in a cumulative way. Numbers indicate that PC1 and PC2 are able to capture within those two dimensions 67.6% of the total variance present in our data (last row or cumulative proportion). ### Step 7. Make another scatterplot This is using the new composite vairables using PCA results #### Code ``` ggplot(cdata.pca.res, aes(x=PC1, y=PC2)) + geom_point() + geom_text(aes(label=rownames(cdata.pca.res)), hjust=1,vjust=- .5) ``` ![1.1 step 2](https://hackmd.io/_uploads/Sy1YGBW4p.png) ### Step 8. Add labels to plot #### Code ``` #first make a new column that has the taxon labels cdata.pca.res$taxonomy <- cdata$taxonomy #plot ggplot(cdata.pca.res, aes(x=PC1, y=PC2, color=taxonomy)) + geom_point() + geom_text(aes(label=rownames(cdata.pca.res)), hjust=1,vjust=-.5) # another way of plotting the same information with ggrepel ggplot(cdata.pca.res, aes(x=PC1, y=PC2, color=taxonomy)) + geom_point() + geom_label_repel(aes(label=rownames(cdata.pca.res))) ``` ![1.1 step 3](https://hackmd.io/_uploads/SkvVQB-4p.png) ![1.1 step 3.2](https://hackmd.io/_uploads/rkbrmH-46.png) ### Step 9. See if samples are seperate by variable Seeing if the samples separated by the experimental variable “path” so add that as a variable here and visualize it. #### Code ``` path = c(rep("green",5), rep("blue",4)) cdata.pca.res$path <- path ggplot(cdata.pca.res, aes(x=PC1, y=PC2, color=path)) + geom_point() + geom_text(aes(label=rownames(cdata.pca.res)), hjust=1,vjust=-.5) # ggrepel will repel the labels away from each other so is useful if your points are close together ggplot(cdata.pca.res, aes(x=PC1, y=PC2, color=path)) + geom_point() + geom_label_repel(aes(label=rownames(cdata.pca.res))) ``` ![1.1 step 4](https://hackmd.io/_uploads/BJc1VrZVp.png) (Second one looks better but no big change so I'm not adding it) ### Step 10. Check loadings Basically see how the factors (genes, etc) were reduced to these variables #### Code ``` cdata.pca$rotation ``` Output: ``` PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 gene1 0.48023860 -0.2699394 0.03088775 -0.15215054 0.25731341 -0.11423481 -0.37436616 0.67304530 gene2 -0.06664634 -0.5649934 0.25894258 0.53782230 0.36431129 0.28470729 0.32531472 0.02064068 gene3 -0.41318650 0.2683005 -0.30257555 0.34383344 0.22519269 0.39826855 -0.55750958 0.16544443 gene4 -0.30080401 -0.3709459 0.03710416 -0.71179954 0.04812204 0.50555298 -0.05129491 -0.05787982 gene5 -0.50277021 0.1960257 -0.06293914 -0.12064954 0.11281210 -0.20962145 0.50509442 0.61521565 gene6 -0.16583805 0.1356629 0.80426528 0.09192012 -0.43452363 0.11563735 -0.21908406 0.22050114 gene7 0.44256984 0.2891876 -0.17146862 0.05278685 -0.27373096 0.65242135 0.36265048 0.23710107 gene8 0.16605668 0.5078083 0.39857314 -0.19215956 0.68671218 0.09340096 0.07706258 -0.18037207 ``` Results show the relative influences of the factors on each of the PCs. Example, PC1 gene1 and gene7 have high positive influence while gene3 and gene5 have high neg influence. So samples that are toward the right-hand side of PC1 should have higher abundances of gene1 and gene7 and lower abundances of gene3 and gene5 compared to other samples. Use boxplots to show this (skip to 3). ## 1.2 Nonmetric Multidimensional Scaling NMDS are based on ranks rather than normally distributed numbers. Essentially, it can handle all kinds of biological data without having to worry about whether you have measured things appropriately. ### Step 1. Turn data into a dissimilarity matrix "A dissimilarity matrix (also called distance matrix) describes pairwise distinction between M objects" -Google Bray-Curtis is the commonly used method to do this b/c it handles changes in composition and relative abundances well even when some data are missing (eg. in the raw data matrix some genes aren't observed) Also *specify the number of dimensions (k)*. It's common to chose 2 or 3. Choose the best to minimze stress value #### Code ``` cdata.nms <- metaMDS(cdata[,c(3:10)], distance="bray", k=3) cdata.nms ``` Last stress is the highest (k=2 was awful). I got: ``` Run 20 stress 0.02210778 ``` Stress: * Excellent: < 0.05 * Great: < 0.1 * Okay: < 0.2 * Questionable: > 0.2 So not too bad. ### Step 2. Examine Shepard plot Plot shows the final reduced-dimension ordination result based on ranks (red line) compared to the original dissimilarities (blue dots). Want blue dots to be close to the red line. Code ``` stressplot(cdata.nms) ``` ![1.2 1](https://hackmd.io/_uploads/Bklbur-4a.png) ### Step 3. Plot w/ vegan but differentiate by pth variable #### Code ``` path = c(rep("green",5), rep("blue",4)) # same order as your rows plot(cdata.nms, type="n") ordihull(cdata.nms, groups=path, draw="polygon", col="grey90", label=F) orditorp(cdata.nms, display="species", col="red") orditorp(cdata.nms, display="sites", col=c(rep("green",5),rep("blue",4)), cex=1.25) ``` ![1.2 2](https://hackmd.io/_uploads/SyDoYBWNT.png) Vegan assumes rows are sites and our columns are species. NMDS analyses cannot tell you what the loadings of each axis (NMDS1-X) and sequence of axes are arbitrary (e.g. the largest difference doesn't always show up on NMDS1 whereas it does on PC1). ### Step 4. Extract coords for ggplot2 #### Code pt.1 Make a dataframe w/ information regarding each genome ``` cdata.nms.res <- as.data.frame(scores(cdata.nms, "sites")) cdata.nms.res$taxonomy <- cdata$taxonomy cdata.nms.res$path <- path head(cdata.nms.res) ``` #### Output pt.1 ``` NMDS1 NMDS2 NMDS3 taxonomy path a1 -0.24232251 -0.07352231 -0.03156709 burkho green a2 -0.15393725 0.15153922 -0.08292738 burkho green a3 -0.24137050 -0.07425665 -0.05172880 burkho green a4 -0.21895156 -0.06702596 0.18095354 parabu green a5 -0.05144667 -0.15277998 0.02630649 parabu green a6 0.11986093 -0.12726608 -0.11856719 parabu blue ``` #### Code pt.2 Make an additional dataframe with information regarding each gene ``` cdata.scores <- as.data.frame(scores(cdata.nms, "species")) cdata.scores$genes <- rownames(cdata.scores) head(cdata.scores) ``` #### Output pt.2 ``` NMDS1 NMDS2 NMDS3 genes gene1 0.28405766 -0.21300739 0.115962720 gene1 gene2 -0.05182152 -0.18399198 -0.007671403 gene2 gene3 -0.45226934 0.53469004 -0.222850619 gene3 gene4 -0.33626396 -0.17801963 0.092867097 gene4 gene5 -0.16701435 0.14426518 0.002336161 gene5 gene6 -0.02849258 0.05968661 -0.040683146 gene6 ``` ### Step 5. Look at Taxonomy #### Code for NMDS1 vs. NMDS2 ``` ggplot() + geom_text(data=cdata.scores, aes(x=NMDS1, y=NMDS2, label=genes), alpha=0.5) + geom_point(data=cdata.nms.res, aes(x=NMDS1, y=NMDS2, color=taxonomy)) + geom_label_repel(data=cdata.nms.res, aes(x=NMDS1, y=NMDS2, color=taxonomy, label=row.names(cdata.nms.res))) ``` ![1.2 3](https://hackmd.io/_uploads/rknqjBZV6.png) #### Code for NMDS2 vs. NMDS3 ``` ggplot() + geom_text(data=cdata.scores, aes(x=NMDS2, y=NMDS3, label=genes), alpha=0.5) + geom_point(data=cdata.nms.res, aes(x=NMDS2, y=NMDS3, color=taxonomy)) + geom_label_repel(data=cdata.nms.res, aes(x=NMDS2, y=NMDS3, color=taxonomy, label=row.names(cdata.nms.res))) ``` ![1.2 4](https://hackmd.io/_uploads/Hk9sirbEa.png) ### Step 6. Determine which samples compose the "hull" of the groups #### Code pt.1 ``` ath.g1 <- subset(cdata.nms.res, path=="green")[chull(subset(cdata.nms.res, path=="green")[,c(1:2)]),] path.b1 <- subset(cdata.nms.res, path=="blue")[chull(subset(cdata.nms.res, path=="blue")[,c(1:2)]),] hull.data.1 <- rbind(path.g1, path.b1) hull.data.1 ``` #### Output pt.1 ``` NMDS1 NMDS2 NMDS3 taxonomy path a5 -0.05144667 -0.15277998 0.02630649 parabu green a3 -0.24137050 -0.07425665 -0.05172880 burkho green a1 -0.24232251 -0.07352231 -0.03156709 burkho green a2 -0.15393725 0.15153922 -0.08292738 burkho green a8 0.37146808 -0.05901354 0.14668125 caball blue a7 0.36383965 -0.06122061 -0.09179992 caball blue a6 0.11986093 -0.12726608 -0.11856719 parabu blue a9 0.05285983 0.46354590 0.02264910 ralsto blue ``` #### Code pt.2 ``` path.g2 <- subset(cdata.nms.res, path=="green")[chull(subset(cdata.nms.res, path=="green")[,c(2:3)]),] path.b2 <- subset(cdata.nms.res, path=="blue")[chull(subset(cdata.nms.res, path=="blue")[,c(2:3)]),] hull.data.2 <- rbind(path.g2, path.b2) hull.data.2 ``` #### Output pt.2 ``` NMDS1 NMDS2 NMDS3 taxonomy path a2 -0.15393725 0.15153922 -0.08292738 burkho green a3 -0.24137050 -0.07425665 -0.05172880 burkho green a5 -0.05144667 -0.15277998 0.02630649 parabu green a4 -0.21895156 -0.06702596 0.18095354 parabu green a6 0.11986093 -0.12726608 -0.11856719 parabu blue a8 0.37146808 -0.05901354 0.14668125 caball blue a9 0.05285983 0.46354590 0.02264910 ralsto blue ``` ### Step 7. Plot samples By each pair of MDS axes #### Code for plot 1 ``` ggplot() + geom_text(data=cdata.scores, aes(x=NMDS1, y=NMDS2, label=genes), alpha=0.5) + geom_point(data=cdata.nms.res, aes(x=NMDS1, y=NMDS2, color=taxonomy)) + geom_polygon(data=hull.data.1,aes(x=NMDS1,y=NMDS2, fill=path, group=path), alpha=0.30) + geom_label_repel(data=cdata.nms.res, aes(x=NMDS1,y=NMDS2, color=taxonomy, label=row.names(cdata.nms.res))) ``` #### Output for plot 1 ![1.2 5](https://hackmd.io/_uploads/HyxNaBbVp.png) #### Code for plot 2 ``` ggplot() + geom_text(data=cdata.scores, aes(x=NMDS2, y=NMDS3, label=genes), alpha=0.5) + geom_point(data=cdata.nms.res, aes(x=NMDS2, y=NMDS3, color=taxonomy)) + geom_polygon(data=hull.data.1,aes(x=NMDS2,y=NMDS3, fill=path, group=path), alpha=0.30) + geom_label_repel(data=cdata.nms.res, aes(x=NMDS2,y=NMDS3, color=taxonomy, label=row.names(cdata.nms.res))) ``` #### Output for plot 2 ![1.2 6](https://hackmd.io/_uploads/HkaETHWET.png) ## 1.3 Hierarchical Clustering Hierarchical clustering makes dendrograms similar to distance trees. One type makes UPGMA (Unweighted Pair Group Method with Arithmetic Mean) trees. ### Step 1. Make a (Bray-Curtis) Distance Matrix #### Code ``` d <- vegdist(cdata[,3:10], method="bray") ``` ### Step 2. Run Hierarchical Clustering Analysis #### Code ``` cdata.clust <- hclust(d, method="ward.D2") ``` There are different clusering methods: * "ward.D" tries to find compact spherical clusters * "ward.D2" tries to find compact spherical clusters * "single" * "complete" (default) uses the maximum distance rather than the average distance * "average" = UPGMA. *Uses the avg identities of members of a cluster to represent the cluster* (Note: For project might be helpful) * "mcquitty" = WPGMA * "median" = WPGMC * "centroid" = UPGMC ### Step 3. Look at the dendogram #### Code ``` plot(cdata.clust) ``` ![1.3 1](https://hackmd.io/_uploads/HyWUXUbVT.png) Looks surprisingly like I expected in 1.2 and the end of 1.1. ### Step 4. Cut Across the Dendogram Pick a K based on your guess of how many clusters may be present (example uses 2 based on our experimental variable “path”). Note: K doesn't affect clustering itself #### Code ``` cutree(cdata.clust, k=2) ``` #### Output ``` a1 a2 a3 a4 a5 a6 a7 a8 a9 1 1 1 1 1 1 2 2 1 ``` So everyone's in 1 except a7 and a8. ### Step 5. Convert Clustering into dendrogram #### Code ``` dendro <- as.dendrogram(cdata.clust) plot(dendro, type = "rectangle", ylab = "Height") ``` ![1.3 2](https://hackmd.io/_uploads/Bkz64UWE6.png) (Basically looks like the picture above but without the different lengths) ### Step 6. Manipulate the dendrogram Purdify it. Example: Color the “leaves” based on the cluster identity that was shown with cutree(). #### Code ``` dendro %>% set("leaves_pch", c(19)) %>% set("leaves_cex", c(2)) %>% set("leaves_col", as.numeric(cutree(cdata.clust, k=2)[cdata.clust$order])+1) %>% plot(type = "rectangle") ``` ![3 3](https://hackmd.io/_uploads/SJWNBLb46.png) ggplot2 uses + while dendextend uses %>% which is weird and awful. # 2. Box Plots and Dot Plots Compare a specific set of fenes and how they differ in occurrence or abundance! While usually, these types of figures would accompany a statistical test, in this class it's not super important. ### Step 1. Convert data from wide to long Instead of each row being a complete set of measures collected for each sample, each row will be a single measure collected for a sample/ #### Code ``` cdata$path <- path cdata.long <- melt(cdata[,1:11]) head(cdata.long) ``` #### Output ``` sample taxonomy path variable value 1 a1 burkho green gene1 2 2 a2 burkho green gene1 2 3 a3 burkho green gene1 3 4 a4 parabu green gene1 4 5 a5 parabu green gene1 4 6 a6 parabu blue gene1 5 ``` ### Step 2. Plot into dotplots This shows the differences in each gene count by taxonomy or "path". #### Code ``` ggplot(cdata.long, aes(x=taxonomy, y=value, fill=taxonomy)) + geom_boxplot() + facet_wrap(.~variable, scales="free") ggplot(cdata.long, aes(x=path, y=value, fill=path)) + geom_boxplot() + facet_wrap(.~variable, scales="free") ``` ![2 1](https://hackmd.io/_uploads/Hy4_88bVa.png) ![2 2](https://hackmd.io/_uploads/r1S5LIbNa.png) ### Step 3. Subset data Subseting by gene and plotting #### Code ``` gene2.long <- subset(cdata.long, variable=="gene2") ggplot(gene2.long, aes(x=taxonomy, y=value, fill=taxonomy)) + geom_boxplot() ``` ![0](https://hackmd.io/_uploads/H1WkPIbE6.png) ### Step 4. Dotplots not Boxplots #### Code ``` ggplot(cdata.long, aes(x=taxonomy, y=value, fill=taxonomy)) + geom_dotplot(dotsize=2, binaxis='y', stackdir='center') + facet_wrap(.~variable, scales="free") ggplot(cdata.long, aes(x=path, y=value, fill=path)) + geom_dotplot(dotsize=2, binaxis='y', stackdir='center') + facet_wrap(.~variable, scales="free") ggplot(gene2.long, aes(x=path, y=value, fill=path)) + geom_dotplot(binaxis='y', stackdir='center') ``` ![1](https://hackmd.io/_uploads/SycQvIZV6.png) ![2](https://hackmd.io/_uploads/SyzNPUbEa.png) ![3](https://hackmd.io/_uploads/r13VwI-VT.png) ### Step 5. Spot Matrix #### Code ``` ggplot(cdata.long, aes(taxonomy, variable)) + geom_point(aes(size=value)) + scale_size_continuous() + coord_flip() + theme_bw() + xlab("") ``` ![1 spot](https://hackmd.io/_uploads/r1hFvLZEp.png) ### Step 6. Heat Map #### Code ``` ggplot(cdata.long, aes(taxonomy, variable, fill=value)) + geom_tile(colour="white", lwd=1.5, linetype=1) + scale_fill_gradient(low = "white", high = "black") + geom_text(aes(label=value), colour="white") + coord_flip() + theme_bw() + xlab("") ``` ![1 heat](https://hackmd.io/_uploads/SyD9PIZNT.png)