# Bi278 Lab 8 Data Visualization
### By Lee Ferenc 11/14/2023
This lab will be dealing with Rstudio and visualizing data (notes may not be complete due to having taken a course or two in Rstudio)
# 0. Using categorized COGs
### Step 1. Read in categorized tables
#### Code
```
name <- read.delim("file.cog.intact.categorized", h=F)
```
#### Lab example
```
pagri <- read.delim("baqs159.cog.intact.categorized", h=F)
phayl <- read.delim("bhqs11.cog.intact.categorized", h=F)
pbonn <- read.delim("bbqs859.cog.intact.categorized", h=F)
```
Make sure to repeat for all your COGs.
### Step 2. Turn into individual summary tables
#### Code
(# of genes in each COG category)
```
cat.name <- as.data.frame(table(name$V3))
colnames(cat.name) <- c("COG", "all")
```
#### Lab example
```
cat.agri <- as.data.frame(table(pagri$V3))
colnames(cat.agri) <- c("COG","all")
cat.bonn <- as.data.frame(table(pbonn$V3))
colnames(cat.bonn) <- c("COG","all")
cat.hayl <- as.data.frame(table(phayl$V3))
colnames(cat.hayl) <- c("COG","all")
```
### Step 3. Combine Tables
#### Code
```
#concatinate them
cat.comp <-cbind(cat.name[,2], cat.name2[,2], cat.name3[,2])
#make a data frame
df.cat <-as.data.frame(t(cat.comp))
#label rows and columns
rownames(df.cat) <- c("name1","name2","name3")
colnames(df.cat) <- cat.name$COG
```
#### Lab example
```
cat.comp <- cbind(cat.agri[,2],cat.bonn[,2],cat.hayl[,2])
df.cat <- as.data.frame(t(cat.comp))
rownames(df.cat) <- c("pagri","pbonn","phayl")
colnames(df.cat) <- cat.agri$COG
```
### Step 4. Clean up tables
Basically replacing NA with 0/zero.
#### Code/Lab Example
```
df.cat[is.na(df.cat)] <- 0
df.cat[,1:25] <- lapply(df.cat[,1:25], as.numeric)
#then add genome as a variable too
df.cat$genome <- row.names(df.cat)
```
Table will be equivalent to `cdata` used on the other parts of lab. You will need to adjust variable names and column numbers as necessary.
### Step 5. Wide table -> long table
Not always required but when we do this, we can specify the order in which COG categories are presented
#### Code/Lab example
```
long.cat <- melt(df.cat)
colnames(long.cat) <- c("genome","COG","count")
long.cat$COG <- factor(long.cat$COG, levels=c("J", "A", "K", "L", "B", "D", "Y", "V", "T", "M", "N", "Z", "W", "U", "O", "X", "C", "G", "E", "F", "H", "I", "P", "Q", "R", "S"))
```
### Step 6. Make histogram (of # of genes in each cateogories)
#### One Histogram code
```
lcat.name <- subset(long.cat, genome=="name")
cog1 <- ggplot(lcat.name, aes(x=COG, y=count)) + geom_bar(stat="identity", fill=2) + scale_x_discrete(breaks=lcat.name$COG, limits = rev(levels(lcat.name$COG))) + coord_flip()
```
#### Lab example
It's easier to make three sepearte hisograms and then combine them together at the end.
```
lcat.agri <- subset(long.cat, genome=="pagri")
lcat.bonn <- subset(long.cat, genome=="pbonn")
lcat.hayl <- subset(long.cat, genome=="phayl")
cog1 <- ggplot(lcat.agri, aes(x=COG, y=count)) + geom_bar(stat="identity", fill=2) + scale_x_discrete(breaks=lcat.agri$COG, limits = rev(levels(lcat.agri$COG))) + coord_flip()
cog2 <- ggplot(lcat.bonn, aes(x=COG, y=count)) + geom_bar(stat="identity", fill=3) + scale_x_discrete(breaks=lcat.bonn$COG, limits = rev(levels(lcat.bonn$COG))) + coord_flip()
cog3 <- ggplot(lcat.hayl, aes(x=COG, y=count)) + geom_bar(stat="identity", fill=4) + scale_x_discrete(breaks=lcat.hayl$COG, limits = rev(levels(lcat.hayl$COG))) + coord_flip()
gridExtra::grid.arrange(cog1, cog2, cog3, ncol=1)
```
# 1. Ordination and Clustering
* **Ordination methods**: Variety of approaches used to understand multivariate data.
* **Principal components analysis (PCA)**: Takes multiple variables of samples and reduces them into a smaller # of variables that capture the similarities and differences among the samples.
* **Non-metric multidimensional scaling (NMDS)**: Examines distances between pairs of samples and tries to figure out how these samples are arranged in multivariate space.
## 1.1 Principle Compenents Analysis
### Step 1. Relaunch libraries
#### Code
```
x<-c("ggplot2", "vegan", "reshape2", "ggrepel", "dendextend", "gridExtra")
lapply(x, require, character.only = TRUE)
```
### Step 2. Find/Set Working Directory and Read in Data
#### Code
```
getwd()
setwd("/home2/enfere24/lab_08")
getwd()
cdata <- read.table("/courses/bi278/Course_Materials/lab_08/multivariate_data.txt", h=T)
head(cdata)
```
### Step 3. Set the row names
#### Code
```
rownames(cdata) <- cdata$sample
```
### Step 4. Make scatterplot
You can change the x and y vairables to any column in object data.
#### Code
```
ggplot(cdata, aes(x=gene1, y=gene2)) + geom_point() +
geom_text(aes(label=sample), hjust=-.3, vjust=0)
```

It takes ur table and tells `ggplot()` to use gene1 as x variable and gene2 as y varaible. It'll make a scatterplot w/ these data using `geom_point()` and add text labels w/ `geom_text()`.
### Step 5. Run a PCA and save in new dataframe
1) Data must be all numerical
2) Data frama is a data table format R uses
#### Code
```
cdata.pca <- prcomp(cdata[,c(3:10)], scale=T)
cdata.pca.res <- as.data.frame(cdata.pca$x)
```
Note: We have chosen to normalize the data
### Step 6. Look at results
#### Code
```
summary(cdata.pca)
```
#### Output:
```
Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
Standard deviation 1.8621 1.3933 1.1006 0.8509 0.62170 0.44026 0.26143 0.08455
Proportion of Variance 0.4334 0.2427 0.1514 0.0905 0.04831 0.02423 0.00854 0.00089
Cumulative Proportion 0.4334 0.6761 0.8275 0.9180 0.96633 0.99056 0.99911 1.00000
```
What the output indicates: How much of the similarities and differences (variances) of the data is captured in the composite/reduced variables called principal components 1, 2, etc (PC1, PC2, etc) in a cumulative way.
Numbers indicate that PC1 and PC2 are able to capture within those two dimensions 67.6% of the total variance present in our data (last row or cumulative proportion).
### Step 7. Make another scatterplot
This is using the new composite vairables using PCA results
#### Code
```
ggplot(cdata.pca.res, aes(x=PC1, y=PC2)) + geom_point() +
geom_text(aes(label=rownames(cdata.pca.res)), hjust=1,vjust=-
.5)
```

### Step 8. Add labels to plot
#### Code
```
#first make a new column that has the taxon labels
cdata.pca.res$taxonomy <- cdata$taxonomy
#plot
ggplot(cdata.pca.res, aes(x=PC1, y=PC2, color=taxonomy)) +
geom_point() + geom_text(aes(label=rownames(cdata.pca.res)),
hjust=1,vjust=-.5)
# another way of plotting the same information with ggrepel
ggplot(cdata.pca.res, aes(x=PC1, y=PC2, color=taxonomy)) + geom_point() + geom_label_repel(aes(label=rownames(cdata.pca.res)))
```


### Step 9. See if samples are seperate by variable
Seeing if the samples separated by the experimental variable “path” so add that as a variable here and visualize it.
#### Code
```
path = c(rep("green",5), rep("blue",4))
cdata.pca.res$path <- path
ggplot(cdata.pca.res, aes(x=PC1, y=PC2, color=path)) + geom_point() + geom_text(aes(label=rownames(cdata.pca.res)), hjust=1,vjust=-.5)
# ggrepel will repel the labels away from each other so is useful if your points are close together
ggplot(cdata.pca.res, aes(x=PC1, y=PC2, color=path)) + geom_point() + geom_label_repel(aes(label=rownames(cdata.pca.res)))
```

(Second one looks better but no big change so I'm not adding it)
### Step 10. Check loadings
Basically see how the factors (genes, etc) were reduced to these variables
#### Code
```
cdata.pca$rotation
```
Output:
```
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
gene1 0.48023860 -0.2699394 0.03088775 -0.15215054 0.25731341 -0.11423481 -0.37436616 0.67304530
gene2 -0.06664634 -0.5649934 0.25894258 0.53782230 0.36431129 0.28470729 0.32531472 0.02064068
gene3 -0.41318650 0.2683005 -0.30257555 0.34383344 0.22519269 0.39826855 -0.55750958 0.16544443
gene4 -0.30080401 -0.3709459 0.03710416 -0.71179954 0.04812204 0.50555298 -0.05129491 -0.05787982
gene5 -0.50277021 0.1960257 -0.06293914 -0.12064954 0.11281210 -0.20962145 0.50509442 0.61521565
gene6 -0.16583805 0.1356629 0.80426528 0.09192012 -0.43452363 0.11563735 -0.21908406 0.22050114
gene7 0.44256984 0.2891876 -0.17146862 0.05278685 -0.27373096 0.65242135 0.36265048 0.23710107
gene8 0.16605668 0.5078083 0.39857314 -0.19215956 0.68671218 0.09340096 0.07706258 -0.18037207
```
Results show the relative influences of the factors on each of the PCs. Example, PC1 gene1 and gene7 have high positive influence while gene3 and gene5 have high neg influence. So samples that are toward the right-hand side of PC1 should have higher abundances of gene1 and gene7 and lower abundances of gene3 and gene5 compared to other samples. Use boxplots to show this (skip to 3).
## 1.2 Nonmetric Multidimensional Scaling
NMDS are based on ranks rather than normally distributed numbers. Essentially, it can handle all kinds of biological data without having to worry about whether you have measured things appropriately.
### Step 1. Turn data into a dissimilarity matrix
"A dissimilarity matrix (also called distance matrix) describes pairwise distinction between M objects" -Google
Bray-Curtis is the commonly used method to do this b/c it handles changes in composition and relative abundances well even when some data are missing (eg. in the raw data matrix some genes aren't observed)
Also *specify the number of dimensions (k)*. It's common to chose 2 or 3. Choose the best to minimze stress value
#### Code
```
cdata.nms <- metaMDS(cdata[,c(3:10)], distance="bray", k=3)
cdata.nms
```
Last stress is the highest (k=2 was awful). I got:
```
Run 20 stress 0.02210778
```
Stress:
* Excellent: < 0.05
* Great: < 0.1
* Okay: < 0.2
* Questionable: > 0.2
So not too bad.
### Step 2. Examine Shepard plot
Plot shows the final reduced-dimension ordination result based on ranks (red line) compared to the original dissimilarities (blue dots). Want blue dots to be close to the red line.
Code
```
stressplot(cdata.nms)
```

### Step 3. Plot w/ vegan but differentiate by pth variable
#### Code
```
path = c(rep("green",5), rep("blue",4)) # same order as your
rows
plot(cdata.nms, type="n")
ordihull(cdata.nms, groups=path, draw="polygon", col="grey90", label=F)
orditorp(cdata.nms, display="species", col="red")
orditorp(cdata.nms, display="sites", col=c(rep("green",5),rep("blue",4)), cex=1.25)
```

Vegan assumes rows are sites and our columns are species.
NMDS analyses cannot tell you what the loadings of each axis (NMDS1-X) and sequence of axes are arbitrary (e.g. the largest difference doesn't always show up on NMDS1 whereas it does on PC1).
### Step 4. Extract coords for ggplot2
#### Code pt.1
Make a dataframe w/ information regarding each genome
```
cdata.nms.res <- as.data.frame(scores(cdata.nms, "sites"))
cdata.nms.res$taxonomy <- cdata$taxonomy
cdata.nms.res$path <- path
head(cdata.nms.res)
```
#### Output pt.1
```
NMDS1 NMDS2 NMDS3 taxonomy path
a1 -0.24232251 -0.07352231 -0.03156709 burkho green
a2 -0.15393725 0.15153922 -0.08292738 burkho green
a3 -0.24137050 -0.07425665 -0.05172880 burkho green
a4 -0.21895156 -0.06702596 0.18095354 parabu green
a5 -0.05144667 -0.15277998 0.02630649 parabu green
a6 0.11986093 -0.12726608 -0.11856719 parabu blue
```
#### Code pt.2
Make an additional dataframe with information regarding each gene
```
cdata.scores <- as.data.frame(scores(cdata.nms, "species"))
cdata.scores$genes <- rownames(cdata.scores)
head(cdata.scores)
```
#### Output pt.2
```
NMDS1 NMDS2 NMDS3 genes
gene1 0.28405766 -0.21300739 0.115962720 gene1
gene2 -0.05182152 -0.18399198 -0.007671403 gene2
gene3 -0.45226934 0.53469004 -0.222850619 gene3
gene4 -0.33626396 -0.17801963 0.092867097 gene4
gene5 -0.16701435 0.14426518 0.002336161 gene5
gene6 -0.02849258 0.05968661 -0.040683146 gene6
```
### Step 5. Look at Taxonomy
#### Code for NMDS1 vs. NMDS2
```
ggplot() + geom_text(data=cdata.scores, aes(x=NMDS1, y=NMDS2,
label=genes), alpha=0.5) + geom_point(data=cdata.nms.res, aes(x=NMDS1, y=NMDS2, color=taxonomy)) + geom_label_repel(data=cdata.nms.res, aes(x=NMDS1, y=NMDS2,
color=taxonomy, label=row.names(cdata.nms.res)))
```

#### Code for NMDS2 vs. NMDS3
```
ggplot() + geom_text(data=cdata.scores, aes(x=NMDS2, y=NMDS3,
label=genes), alpha=0.5) + geom_point(data=cdata.nms.res, aes(x=NMDS2, y=NMDS3, color=taxonomy)) + geom_label_repel(data=cdata.nms.res, aes(x=NMDS2, y=NMDS3,
color=taxonomy, label=row.names(cdata.nms.res)))
```

### Step 6. Determine which samples compose the "hull" of the groups
#### Code pt.1
```
ath.g1 <- subset(cdata.nms.res, path=="green")[chull(subset(cdata.nms.res, path=="green")[,c(1:2)]),]
path.b1 <- subset(cdata.nms.res, path=="blue")[chull(subset(cdata.nms.res, path=="blue")[,c(1:2)]),]
hull.data.1 <- rbind(path.g1, path.b1)
hull.data.1
```
#### Output pt.1
```
NMDS1 NMDS2 NMDS3 taxonomy path
a5 -0.05144667 -0.15277998 0.02630649 parabu green
a3 -0.24137050 -0.07425665 -0.05172880 burkho green
a1 -0.24232251 -0.07352231 -0.03156709 burkho green
a2 -0.15393725 0.15153922 -0.08292738 burkho green
a8 0.37146808 -0.05901354 0.14668125 caball blue
a7 0.36383965 -0.06122061 -0.09179992 caball blue
a6 0.11986093 -0.12726608 -0.11856719 parabu blue
a9 0.05285983 0.46354590 0.02264910 ralsto blue
```
#### Code pt.2
```
path.g2 <- subset(cdata.nms.res, path=="green")[chull(subset(cdata.nms.res, path=="green")[,c(2:3)]),]
path.b2 <- subset(cdata.nms.res, path=="blue")[chull(subset(cdata.nms.res, path=="blue")[,c(2:3)]),]
hull.data.2 <- rbind(path.g2, path.b2)
hull.data.2
```
#### Output pt.2
```
NMDS1 NMDS2 NMDS3 taxonomy path
a2 -0.15393725 0.15153922 -0.08292738 burkho green
a3 -0.24137050 -0.07425665 -0.05172880 burkho green
a5 -0.05144667 -0.15277998 0.02630649 parabu green
a4 -0.21895156 -0.06702596 0.18095354 parabu green
a6 0.11986093 -0.12726608 -0.11856719 parabu blue
a8 0.37146808 -0.05901354 0.14668125 caball blue
a9 0.05285983 0.46354590 0.02264910 ralsto blue
```
### Step 7. Plot samples
By each pair of MDS axes
#### Code for plot 1
```
ggplot() + geom_text(data=cdata.scores, aes(x=NMDS1, y=NMDS2, label=genes), alpha=0.5) + geom_point(data=cdata.nms.res, aes(x=NMDS1, y=NMDS2, color=taxonomy)) + geom_polygon(data=hull.data.1,aes(x=NMDS1,y=NMDS2, fill=path, group=path), alpha=0.30) + geom_label_repel(data=cdata.nms.res, aes(x=NMDS1,y=NMDS2, color=taxonomy, label=row.names(cdata.nms.res)))
```
#### Output for plot 1

#### Code for plot 2
```
ggplot() + geom_text(data=cdata.scores, aes(x=NMDS2, y=NMDS3, label=genes), alpha=0.5) + geom_point(data=cdata.nms.res, aes(x=NMDS2, y=NMDS3, color=taxonomy)) + geom_polygon(data=hull.data.1,aes(x=NMDS2,y=NMDS3, fill=path, group=path), alpha=0.30) + geom_label_repel(data=cdata.nms.res, aes(x=NMDS2,y=NMDS3, color=taxonomy, label=row.names(cdata.nms.res)))
```
#### Output for plot 2

## 1.3 Hierarchical Clustering
Hierarchical clustering makes dendrograms similar to distance trees. One type makes UPGMA (Unweighted Pair Group Method with Arithmetic Mean) trees.
### Step 1. Make a (Bray-Curtis) Distance Matrix
#### Code
```
d <- vegdist(cdata[,3:10], method="bray")
```
### Step 2. Run Hierarchical Clustering Analysis
#### Code
```
cdata.clust <- hclust(d, method="ward.D2")
```
There are different clusering methods:
* "ward.D" tries to find compact spherical clusters
* "ward.D2" tries to find compact spherical clusters
* "single"
* "complete" (default) uses the maximum distance rather than the average distance
* "average" = UPGMA. *Uses the avg identities of members of a cluster to represent the cluster* (Note: For project might be helpful)
* "mcquitty" = WPGMA
* "median" = WPGMC
* "centroid" = UPGMC
### Step 3. Look at the dendogram
#### Code
```
plot(cdata.clust)
```

Looks surprisingly like I expected in 1.2 and the end of 1.1.
### Step 4. Cut Across the Dendogram
Pick a K based on your guess of how many clusters may be present (example uses 2 based on our experimental variable “path”).
Note: K doesn't affect clustering itself
#### Code
```
cutree(cdata.clust, k=2)
```
#### Output
```
a1 a2 a3 a4 a5 a6 a7 a8 a9
1 1 1 1 1 1 2 2 1
```
So everyone's in 1 except a7 and a8.
### Step 5. Convert Clustering into dendrogram
#### Code
```
dendro <- as.dendrogram(cdata.clust)
plot(dendro, type = "rectangle", ylab = "Height")
```

(Basically looks like the picture above but without the different lengths)
### Step 6. Manipulate the dendrogram
Purdify it.
Example: Color the “leaves” based on the cluster identity that was shown with cutree().
#### Code
```
dendro %>%
set("leaves_pch", c(19)) %>%
set("leaves_cex", c(2)) %>%
set("leaves_col", as.numeric(cutree(cdata.clust, k=2)[cdata.clust$order])+1) %>% plot(type = "rectangle")
```

ggplot2 uses + while dendextend uses %>% which is weird and awful.
# 2. Box Plots and Dot Plots
Compare a specific set of fenes and how they differ in occurrence or abundance! While usually, these types of figures would accompany a statistical test, in this class it's not super important.
### Step 1. Convert data from wide to long
Instead of each row being a complete set of measures collected for each sample, each row will be a single measure collected for a sample/
#### Code
```
cdata$path <- path
cdata.long <- melt(cdata[,1:11])
head(cdata.long)
```
#### Output
```
sample taxonomy path variable value
1 a1 burkho green gene1 2
2 a2 burkho green gene1 2
3 a3 burkho green gene1 3
4 a4 parabu green gene1 4
5 a5 parabu green gene1 4
6 a6 parabu blue gene1 5
```
### Step 2. Plot into dotplots
This shows the differences in each gene count by taxonomy or "path".
#### Code
```
ggplot(cdata.long, aes(x=taxonomy, y=value, fill=taxonomy)) + geom_boxplot() + facet_wrap(.~variable, scales="free")
ggplot(cdata.long, aes(x=path, y=value, fill=path)) + geom_boxplot() + facet_wrap(.~variable, scales="free")
```


### Step 3. Subset data
Subseting by gene and plotting
#### Code
```
gene2.long <- subset(cdata.long, variable=="gene2")
ggplot(gene2.long, aes(x=taxonomy, y=value, fill=taxonomy)) +
geom_boxplot()
```

### Step 4. Dotplots not Boxplots
#### Code
```
ggplot(cdata.long, aes(x=taxonomy, y=value, fill=taxonomy)) + geom_dotplot(dotsize=2, binaxis='y', stackdir='center') + facet_wrap(.~variable, scales="free")
ggplot(cdata.long, aes(x=path, y=value, fill=path)) + geom_dotplot(dotsize=2, binaxis='y', stackdir='center') + facet_wrap(.~variable, scales="free")
ggplot(gene2.long, aes(x=path, y=value, fill=path)) +
geom_dotplot(binaxis='y', stackdir='center')
```



### Step 5. Spot Matrix
#### Code
```
ggplot(cdata.long, aes(taxonomy, variable)) + geom_point(aes(size=value)) + scale_size_continuous() + coord_flip() + theme_bw() + xlab("")
```

### Step 6. Heat Map
#### Code
```
ggplot(cdata.long, aes(taxonomy, variable, fill=value)) + geom_tile(colour="white", lwd=1.5, linetype=1) + scale_fill_gradient(low = "white", high = "black") + geom_text(aes(label=value), colour="white") + coord_flip() + theme_bw() + xlab("")
```
