# BI278 lab#8
### Task:
Work through the exercise and play with R’s visualization to figure out how you will visualize the results of your projects. It can help to draw out what you expect to compare (literally, with colored pencils or markers), then work backwards to figure out what columns of data you need, what sources those data will come from, etc.
1. To get started, copy over the file to visualize into your /personal directory so you can see it from R:
```
cp /courses/bi278/Course_Materials/lab_08/multivariate_data.txt lab_08
```
### 1. ORDINATION AND CLUSTERING
#### 1.1 PRINCIPAL COMPONENTS ANALYSIS
2. Once you are logged onto Rstudio, launch the following libraries (install if necessary):
```
x<-c("dendextend", "ggplot2", "vegan", "reshape2", "ggrepel")
lapply(x, require, character.only = TRUE)
```
3. Figure out your working directory and where you put this week's data file, then read it in as a table named cdata, and take a look at it.
```
cdata <- read.table("multivariate_data.txt", h=T)
head(cdata)
```
4. Set the row names of cdata:
```
rownames(cdata) <- cdata$sample
```
5. Let's first make a simple scatterplot like the one above. You can change the x and y variables to any of the columns in your object cdata.
```
ggplot(cdata, aes(x=gene1, y=gene2)) + geom_point() +
geom_text(aes(label=sample), hjust=-.3, vjust=0)
```

What this command does is takes your table and tells the graphing function ggplot() to use gene1 as the x variable and gene 2 as the y variable. Then it will make a scatterplot with these data geom_point(), and add some text labels geom_text() so you can tell which dot is which.
You can switch around components to take a look at your data in different ways:
```
ggplot(cdata, aes(x=gene4, y=gene5)) + geom_point() +
geom_text(aes(label=taxonomy), hjust=-.2, vjust=0)
```

6. Let's run a PCA on these data (must be all numerical) and save those results to a new dataframe (a data table format that R uses).
```
cdata.pca <- prcomp(cdata[,c(3:10)], scale=T)
cdata.pca.res <- as.data.frame(cdata.pca$x)
```
Note: we have chosen to scale (~normalize) our data.
7. We can take a look at the results of the PCA here.
```
summary(cdata.pca)
```

What the output indicates is how much of the similarities and differences (variances) ofour data are captured in the composite/reduced variables called principal components 1, 2, etc in a cumulative way. Our numbers indicate that PC1 and PC2 are able to capture within those two dimensions 67.6% of the total variance present in our data (last
row).
8. Let's make a new scatterplot of these new composite variables using the PCA results we captured in the previous step:
```
ggplot(cdata.pca.res, aes(x=PC1, y=PC2)) + geom_point() +
geom_text(aes(label=rownames(cdata.pca.res)), hjust=1,vjust=-
.5)
```

9. I'm curious how the taxon labels fall out on this new scatterplot so I have to bring those labels over and add them as a new column to this new dataframe, then make a new scatterplot.
```
cdata.pca.res$taxonomy <- cdata$taxonomy
ggplot(cdata.pca.res, aes(x=PC1, y=PC2, color=taxonomy)) +
geom_point() + geom_text(aes(label=rownames(cdata.pca.res)),
hjust=1,vjust=-.5)
```

10. I also want to know if my samples are separated by my experimental variable “path” so I’m going to add that as a variable here too and visualize it.
```
path = c(rep("green",5), rep("blue",4))
cdata.pca.res$path <- path
ggplot(cdata.pca.res, aes(x=PC1, y=PC2, color=path)) +
geom_point() + geom_text(aes(label=rownames(cdata.pca.res)),
hjust=1,vjust=-.5)
```

11. If your points or text overlap, this function from ggrepel is a nice way to make things clearer.
```
ggplot(cdata.pca.res, aes(x=PC1, y=PC2, color=path)) +
geom_point() +
geom_label_repel(aes(label=rownames(cdata.pca.res)))
```

12. Now that you’ve observed how your samples fall out on the PCA plot, you probably want to know how your factors (genes, etc) were reduced to these variables. To do this, we need to check the loadings.
```
cdata.pca$rotation
```

The results of this command show you the relative influences of your factors on each of the PCs (principal components). For Example, I see for PC1 that gene1 and gene7 have higher positive influences while gene3 and gene5 have higher negative influences. So samples that are toward the right-hand side of PC1 should have higher abundances of gene1 and gene7 and lower abundances of gene3 and gene5 compared to othersamples.
Let’s try NMDS next. NMDS is a popular and robust technique because it is based on ranks rather than normally distributed numbers. This means it can handle all kinds of biological data without having to worry about whether you have measured things appropriately (e.g. you don’t need to ask yourself questions such as, you know 2 is higher than 1, but is it twice as good as 1 or do you have the wrong scale). There are some limitations, which we’ll see below.
#### 1.2 NONMETRIC MULTIDIMENSIONAL SCALING
13. First we need to turn our data into a dissimilarity matrix. Bray-Curtis is the commonlyused method to do this for a variety of biological data because it handles changes in composition and relative abundances well even when some data are missing (e.g. in our raw data matrix some genes are not observed in a few samples and we don’t know if this is because that gene is actually not present or because of an error in our sampling method). We also have to specify the number of dimensions (k). It is common practice to choose 2 or 3 dimensions so let’s start here, but you would need to try out a few different k values and choose the best one that minimizes the stress value.
```
cdata.nms <- metaMDS(cdata[,c(3:10)], distance="bray", k=3)
cdata.nms
```
R will report back how it fit the NMDS, including the stress value it landed on:
Stress: 0.02184196
A general rule of thumb for stress is that anything below 0.05 is excellent, < 0.1 is great, < 0.2 is ok, and anything over 0.2 is questionable.
14. We can also examine the Shepard plot, which slows the final reduced-dimensionordination result based on ranks (red line) compared to the original dissimilarities (blue dots). We want our blue dots to be pretty close to the red line.
```
stressplot(cdata.nms)
```

15. Next, we can plot our results using the same package (vegan) we used for the NMDS analysis. But let’s say we want plot it be differentiating the groups by our experimental variable “path”. We can use the built-in method for displaying results like this.
```
path = c(rep("green",5), rep("blue",4)) # same order as your rows
plot(cdata.nms, type="n")
ordihull(cdata.nms, groups=path, draw="polygon", col="grey90",
label=F)
orditorp(cdata.nms, display="species", col="red")
orditorp(cdata.nms, display="sites",
col=c(rep("green",5),rep("blue",4)), cex=1.25)
```

Unlike PCA, NMDS analyses cannot tell you what the loadings of each axis (NMDS1-X) and sequence of axes is arbitrary (e.g. the largest difference does not necessarily show up on NMDS1 whereas it does on PC1). But we can often draw similar conclusions from
NMDS analyses as we would for PCA. Here, it appears the blue group is associated with different abundances of gene1, gene7, and gene8 compared to the green group.
#### 1.3 HIERARCHICAL CLUSTERING
Hierarchical clustering makes dendrograms very much like the distance trees we looked at when discussing phylogenies, and one type of hierarchical clustering method in fact produces UPGMA (Unweighted Pair Group Method with Arithmetic Mean) trees.
Let’s check out the differences among the samples in this dataset using hierarchical clustering. We will calculate a distance matrix and generate a endrogram. Then, we can choose levels to cut across the branches of the dendrogram to group our samples into a set number of clusters.
19. First make a (Bray-Curtis) distance matrix:
```
d <- vegdist(cdata[,3:10], method="bray")
```
20. Next run a hierarchical clustering analysis.
```
cdata.clust <- hclust(d, method="ward.D2")
```
You can pick among a few different clustering methods (type in ?hclust for more details; default is “complete”) that differ by how the clusters are evaluated as they are grouped together. Ward’s method tries to find compact spherical clusters, “average” will use the average identities of members of a cluster to represent the cluster, and the “complete” method uses the maximum distance rather than the average distance.
21. Take a look at the dendrogram that was produced:
```
plot(cdata.clust)
```

Does this reflect what you thought was similar or different based on the PCA or NMDS?
The next step is to cut across this dendrogram to produce the number of clusters that you want.
22. Pick a K based on your guess of how many clusters may be present (shown here with 2 based on our experimental variable “path”, though remember I made up these data) and cut across the dendrogram.
```
cutree(cdata.clust, k=2)
```
You can see the classification results based on the K you chose (which class each sample was classified into). Worth noting: what K you select here did not impact the clustering itself.
Now we can use an R package called dendextend to manipulate the hierarchical clustering dendrogram.
23. Convert the clustering results into a dendrogram.
```
dendro <- as.dendrogram(cdata.clust)
plot(dendro, type = "rectangle", ylab = "Height")
```

24. We can manipulate the dendrogram further. For example, we can color the “leaves” based on the cluster identity that was shown with cutree().
```
dendro %>%
set("leaves_pch", c(19)) %>%
set("leaves_cex", c(2)) %>%
set("leaves_col", as.numeric(cutree(cdata.clust,
k=2)[cdata.clust$order])+1) %>%
plot(type = "rectangle")
```

Note that while ggplot2 strings additional functions together with '+', dendextend
uses '%>%'.
What I hope you get out of this exercise is that these different exploratory methods can be useful for finding additional insights from your data beyond your initial hypotheses. Because so much of genomic data is multidimensional (and thus difficult to interpretn without additional processing), being fluent in different exploratory statistical techniques can be very useful.
### 2. BOX PLOTS AND DOT PLOTS
Instead of working with a large number of genes, you may be interested in a specific set of genes and how they differ in occurrence or abundance among groups. In this case, you would want to make a box plot or a dot plot.
In order to do this, you will have to format your data so that they are in a “long” format, as opposed to a “wide” format. So, instead of each row being a complete set of measures collected for each sample, each row will be a single measure collected for a sample, and any given sample will show up in multiple rows. We can use the package reshape2 to do the data conversion for us.
25. Convert your data from wide to long and take a look at how it's been converted so you know the column names. Let’s make sure to add our experimental variable too:
```
cdata <- read.table("multivariate_data.txt", h=T)
path <- c(rep("green",5), rep("blue",4))
cdata$path <- path
cluster.long <- melt(cluster[,1:11])
head(cluster.long)
```

26. Now, plot these data into a series of box plots that shows us the difference in each gene count by taxonomy or “path”:
By taxonomy:
```
ggplot(cluster.long, aes(x=taxonomy, y=value, fill=taxonomy)) +
geom_boxplot() + facet_wrap(.~variable, scales="free")
```

By path:
```
ggplot(cluster.long, aes(x=path, y=value, fill=path)) +
geom_boxplot() + facet_wrap(.~variable, scales="free")
```

27. If you know which gene you want out of these, you can subset the data by that gene and just make a plot for it:
```
gene2.long <- subset(cluster.long, variable=="gene2")
ggplot(gene2.long, aes(x=taxonomy, y=value, fill=taxonomy)) +
geom_boxplot()
```

28. You may want to make a dot plot instead of a box plot, to better show the individual values:
```
ggplot(gene2.long, aes(x=path, y=value, fill=path)) +
geom_dotplot(binaxis='y', stackdir='center')
```

Usually, these types of figures would accompany a statistical test. There's no need to do that for our class. We don’t have enough samples to do meaningful statistics and I’m more interested in making sure you can understand and interpret what patterns and trends you see than testing statistical significance.
### 3. BONUS: APPLYING DENDEXTEND TO OTHER TREES
To use dendextend on trees you may have built with FastTree, IQ-TREE or any other phylogenetic package, you will need to convert the newick tree file format to a dendrogram format. The R package DECIPHER is able to do this. Here’s an example done with a tree I made. You should use a tree of your own.
1. I'm reading in a tree I made and plotting it.
```
library(DECIPHER)
dendro.ex <- ReadDendrogram("k25.tre")
plot(dendro.ex)
```

2. I'm increasing the margins in my plot window, then plotting again to test if I'm able to see my labels:
```
par(mar = c(12,4,2,2) + 0.1) #bottom, left, top, right; default is c(5,4,4,2)
plot(dendro.ex)
```

Now I want to color my taxa (“leaves”), to show which cluster they are in, or whatever else you want to show.
3. To first find out how many taxa are on this tree I can call the tree object itself:
```
dendro.ex
```
>'dendrogram' with 3 branches and 15 members total, at height 0.30529
I find out that my dendrogram has 15 members total. You can manually change parts this tree based on your metadata but you need to enter everything according to the order of the taxa (=leaves) as they appear in your tree (left to right).
4. Now I can change the shape (Fig 4) and colors (Fig 5) of my leaves manually:
```
dendro.ex %>%
set("leaves_pch",
c(15,16,16,16,16,16,16,16,16,17,17,17,17,15,15)) %>%
set("leaves_col", c(4,3,3,3,3,3,3,3,3,2,2,2,2,4,4)) %>%
plot(type="rectangle")
```

5. You can also turn these plots sideways:
```
par(mar = c(5,4,4,12))
dendro.ex %>%
set("leaves_pch",
c(15,16,16,16,16,16,16,16,16,17,17,17,17,15,15)) %>%
set("leaves_col", c(4,3,3,3,3,3,3,3,3,2,2,2,2,4,4)) %>%
plot(type="rectangle", horiz=TRUE)
```

6. Change the colors of the labels instead:
```
dendro.ex %>%
set("leaves_pch", c(16)) %>%
set("leaves_col", c(4,3,3,3,3,3,3,3,3,2,2,2,2,4,4)) %>%
set("labels_colors", c(4,3,3,3,3,3,3,3,3,2,2,2,2,4,4)) %>%
plot(type="rectangle", horiz=TRUE)
```

7. It might be helpful to sort your taxa by increasing node depth:
```
dendro.ex %>%
set("labels_colors", c(4,3,3,3,3,3,3,3,3,2,2,2,2,4,4)) %>%
sort(type = "nodes") %>%
plot(type="rectangle", horiz=TRUE)
```

8. Finally, this command sets the margins back to the default afterwards.
```
par(mar = c(5,4,4,2))
```