BI278 Lab noebook 8

# BI278 Lab noebook 8 Copy files from Course materials to personal directory ``` mkdir lab_08 cp /courses/bi278/Course_Materials/lab_08/multivariate_data.txt lab_08/ ``` ### Data visualization * Ordination methods are a variety of approaches used to understand multivariate data. * Principal components analysis (PCA) aims to take multiple variables of samples and reduce them into a smaller number of variables that capture the similarities and differences among the samples. * Non-metric multidimensional scaling (NMDS) looks at distances between pairs of samples and tries to figure out how these samples are arranged in multivariate space. Launch libraries needed ``` x<- c("dendextend", "ggplot2", "vegan", "reshape2", "ggrepel") lapply(x, require, character.only=TRUE) ``` Read the data file as a table ``` cdata<- read.table("multivariate_data.txt", h=T) head(cdata) ``` Set row names: `rownames(cdata)<- cdata$sample` Make a scatter plot ``` ggplot(cdata, aes(x=gene1, y = gene2)) + geom_point() +geom_text(aes(label=sample), hjust=-.3,vjust=0) ``` ![](https://i.imgur.com/RKawVtJ.png) Takes the table and makes scatter plot with data ``` geom_point() and adds text label geom_text() ggplot(cdata, aes(x=gene4, y = gene5)) + geom_point() +geom_text(aes(label=taxonomy), hjust=-.2,vjust=0) ``` ![](https://i.imgur.com/Q8Ir39t.png) #### PCA Run PCA on the data(must be numerical) and save it to a dataframe scale the data- normalize it ``` cdata.pca <- prcomp(cdata[,c(3:10)], scale=T) cdata.pca.res <- as.data.frame(cdata.pca$x) ``` Look at the results of the PCA `summary(cdata.pca)` Output indicates how much of the similarities and differences (variances) of the data are captured in the principal components in a cumulative way ![](https://i.imgur.com/a3KsKVc.png) Make a scatter plot with PCA results ``` ggplot(cdata.pca.res, aes(x=PC1, y = PC2)) + geom_point() + geom_text(aes(label=rownames(cdata.pca.res)), hjust=1,vjust=-.5) ``` ![](https://i.imgur.com/UWxXZkk.png) Bring over the taxon labels and add them as a new column to the dataframe and make a scatterplot ``` cdata.pca.res$taxonomy<- cdata$taxonomy ggplot(cdata.pca.res, aes(x=PC1, y = PC2, color=taxonomy)) + geom_point() + geom_text(aes(label=rownames(cdata.pca.res)), hjust=1,vjust=-.5) ``` ![](https://i.imgur.com/CGJnIGv.png) Want to know if samples are separated by experimental variable "path" Add path as a variable and visualize it ``` path = c(rep("green", 5), rep("blue",4)) cdata.pca.res$path <- path ggplot(cdata.pca.res, aes(x=PC1, y = PC2, color=path)) + geom_point() +geom_text(aes(label=rownames(cdata.pca.res)), hjust=1,vjust=-.5) ``` ![](https://i.imgur.com/AUWXyUO.png) Use ggrepel if points or text overlap ``` ggplot(cdata.pca.res, aes(x=PC1, y = PC2, color=path)) + geom_point() +geom_label_repel(aes(label=rownames(cdata.pca.res))) ``` ![](https://i.imgur.com/dGOiDZQ.png) Check the loadings to see what factors were redcued to the variables in the PCA `cdata.pca$rotation` ![](https://i.imgur.com/vMtPGGi.png) Shows the relative influences of your factors on each of the PCs For example for PC1 that gene1 and gene7 have higher positive influences while gene3 and gene5 have higher negative influences. So samples that are toward the right-hand side of PC1 should have higher abundances of gene1 and gene7 and lower abundances of gene3 and gene5 compared to other samples. ## Non metric multidimensional scaling It is based on ranks rather than normally distributed numbers. So it can handle all kinds of biological data without having to worry about whether you have measured things appropriately Ned to turn data into a dissimilarity matrix. Bray-Curtis is the commonly used method to do this for a variety of biological data because it handles changes in composition and relative abundances well even when some data are missing (e.g. in our raw data matrix some genes are not observed in a few samples and we don’t know if this is because that gene is actually not present or because of an error in our sampling method). * Use Bray-Curtis method to do this * need to specify k number of dimensions = usually 2 or 3 ``` cdata.nms <- metaMDS(cdata[,c(3:10)], distance="bray", k=3) cdata.nms ``` * Reports how it fit the NMDS including the stress value = 0.02184198 * stress value<0.05 is excellent, <0.1 is great, < 0.2 is ok, and anything over 0.2 is questionable Examine the Shepard plot, which slows the final reduced-dimension ordination result based on ranks (red line) compared to the original dissimilarities (blue dots). * We want our blue dots to be pretty close to the red line. `stressplot(cdata.nms)` ![](https://i.imgur.com/TkJyG2M.png) Plot results using package vegan which was used for NMDS analysis ``` path = c(rep("greeen",5), rep("blue",4)) #Same order as rows plot(cdata.nms, type="n") ordihull(cdata.nms, groups=path, draw = "polygon", col ="grey90", label=F) orditorp(cdata.nms, display = "species", col="red") orditorp(cdata.nms, display = "sites", col=c(rep("green",5), rep("blue",4)), cex=1.25) ``` ![](https://i.imgur.com/pKfOo23.png) * NMDS analyses cannot tell you what the loadings of each axis (NMDS1-X) are and sequence of axes is arbitrary (e.g. the largest difference does not necessarily show up on NMDS1). * It appears the blue group is associated with different abundances of gene1, gene7, and gene8 compared to the green group. ## Hierarchical clustering It makes dendrograms very much like distance trees, and one type of hierarchical clustering method in fact produces UPGMA (Unweighted Pair Group Method with Arithmetic Mean) trees. Check out the differences among the samples in this dataset using hierarchical clustering. You can pick among a few different clustering methods (type in ?hclust for more details; default is “complete”). Ward’s method tries to find compact spherical clusters, “average” will use the average identities of members of a cluster to represent the cluster, and the “complete” method uses the maximum distance rather than the average distance. * Make a (Bray-Curtis) distance matrix `d<- vegdist(cdata[,3:10], method = "bray")` * Run hierarchical clustering analysis `cdata.clust <- hclust(d, method = "ward.D2")` * Look at the dendogram `plot(cdata.clust)` ![](https://i.imgur.com/MI21erl.png) Cut across dendogram to produce clusters * Pick k based on how many clusters you think may be present `cutree(cdata.clust, k=2)` ![](https://i.imgur.com/7loagDQ.png) Shows what class each sample was classified into Use dendextend package to manipulate the dendogram * convert clustering results into a dendogram ``` dendro <- as.dendrogram(cdata.clust) plot(dendro, type="rectangle", ylab = "Height") ``` ![](https://i.imgur.com/FRzrU4O.png) Color the leaves based on cluster identity ``` dendro %>% set("leaves_pch", c(19)) %>% set("leaves_cex", c(2)) %>% set("leaves_col", as.numeric(cutree(cdata.clust,k=2)[cdata.clust$order])+1) %>% plot(type="rectangle") ``` `dendextend` uses %>% to string additonal functions together ![](https://i.imgur.com/J7PXT08.png) ## Box plots and dot plots Need to format your data so that they are in a “long” format, as opposed to a “wide” format. So, instead of each row being a complete set of measures collected for each sample, each row will be a single measure collected for a sample, and any given sample will show up in multiple rows. Convert data from wide to long using reshape2 package ``` cdata<- read.table("multivariate_data.txt", h=T) path <- c(rep("green",5), rep("blue",4)) cdata$path <- path cdata.long <- melt(cdata[,1:11]) ``` Plot data into a series of box plots that shows difference in each gene count by taxonomy or "path" ``` ggplot(cdata.long, aes(x=taxonomy, y=value, fill=taxonomy)) + geom_boxplot() + facet_wrap(.~variable, scales="free") ``` ![](https://i.imgur.com/wwe4vEo.png) ``` ggplot(cdata.long, aes(x=path, y=value, fill=path)) + geom_boxplot() + facet_wrap(.~variable, scales="free") ``` ![](https://i.imgur.com/sOZal3D.png) Can subset a gene and make a plot for it ``` gene2.long <- subset(cdata.long, variable=="gene2") ggplot(gene2.long, aes(x=taxonomy, y=value, fill=taxonomy)) + geom_boxplot() ``` ![](https://i.imgur.com/KFXc3MV.png) Make a dot plot ``` ggplot(gene2.long, aes(x=path, y=value, fill= path)) + geom_dotplot(binaxis = 'y', stackdir = 'center') ``` ![](https://i.imgur.com/wl0tVco.png) ## Applying `dendextend` to other tree `cp /courses/bi278/Course_Materials/lab_08/k25.tre lab_08` * Load library: `library(DECIPHER)` * Read in and plot a tree ``` dendro.ex <-ReadDendrogram("k25.tre") plot(dendro.ex) ``` ![](https://i.imgur.com/w9HCd9k.png) * Increase margins in plot window to see labels better ``` par(mar=c(12,4,2,2)+0.1) plot(dendro.ex) ``` ![](https://i.imgur.com/LJHuPjb.png) * Color the leaves(taxa) to show the cluster * find out how many taxa are on the tree `dendro.ex` * Has 15 members total * Change shape and colors of leaves manually ``` dendro.ex %>% set("leaves_pch", c(15,16,16,16,16,16,16,16,16,17,17,17,17,15,15)) %>% set("leaves_col", c(4,3,3,3,3,3,3,3,3,2,2,2,2,4,4)) %>% plot(type="rectangle") ``` ![](https://i.imgur.com/5Vzbbt0.png) Can use these shape and color numbers ![](https://i.imgur.com/lC9Suwq.png) * Turn the plot sideways ``` par(mar=c(5,4,4,12)) dendro.ex %>% set("leaves_pch", c(15,16,16,16,16,16,16,16,16,17,17,17,17,15,15)) %>% set("leaves_col", c(4,3,3,3,3,3,3,3,3,2,2,2,2,4,4)) %>% plot(type="rectangle", horiz=TRUE) ``` ![](https://i.imgur.com/c3pela1.png) * Change the colors of the labels ``` dendro.ex %>% set("leaves_pch", c(16)) %>% set("leaves_col", c(4,3,3,3,3,3,3,3,3,2,2,2,2,4,4)) %>% set("labels_colors",c(4,3,3,3,3,3,3,3,3,2,2,2,2,4,4)) %>% plot(type="rectangle", horiz=TRUE) ``` ![](https://i.imgur.com/5l1J0EV.png) * Sort taxa by increasing node depth ``` dendro.ex %>% set("labels_colors",c(4,3,3,3,3,3,3,3,3,2,2,2,2,4,4)) %>% sort(type ="nodes") %>% plot(type="rectangle", horiz=TRUE) ``` ![](https://i.imgur.com/KnIWsqt.png) * Sets margin back to default `par(mar= c(5,4,4,2))`