BI345 lab #4 (wk_05)

# BI345 lab #4 (wk_05) We will take the read counts made last week and run a differential expression analysis. We will be running in R for most of this process. First, load in the DESeq2 library and assign the schema for the file structure. ``` library(DESeq2) directory <- "/courses/bi345/Course_Materials/wk_04/final_counts" sampleFiles <- grep("qs", list.files(directory), value = T) sampleName <- sub("(*).star.count", "\\1", sampleFiles) ``` Setting up the metadata table is important so that we can refer to this table for any information about any sample or any attributes. This metadata table includes the countfile names for the samples above. ``` sampleTable <- data.frame(sampleName = sampleName, fileName = sampleFiles, dicty = c(rep("qs18",6), rep("qs864",6), rep("qs9",6)), batch = c("1966_1","1966_1","1966_1","1966_2","1966_2","1966_2","1966_3","1966_3","1966_4","1966_4","1966_3","1966_4","1952_7","1952_8","1952_7","1952_7","1952_8","1952_8"), condition = c("control","infected","infected","control","infected","infected","infected","infected","infected","infected","control","control","control","control","infected","infected","infected","infected"), genome = c("none","reduced","nonreduced","none","nonreduced","reduced","reduced","nonreduced","nonreduced","reduced","none","none","none","none","reduced","nonreduced","nonreduced","reduced"), burk = c("none","b11","b159","none","b70","b859","b11","b159","b70","b859","none","none","none","none","b11","b159","b70","b859"), stringsAsFactors = T ) ``` Now, we can read in the counts. This uses a propreitary function, ```DESeqDataSetFromHTSeqCount``` that specifies the data filetype that should be read. The parameters for this function is specifying that the effect that should be focused on is the ```condition``` (i.e. infection vs. no infection). The Strain was the biological background and thus was included as a factor to account for random effect. ``` dds.dicty <- DESeqDataSetFromHTSeqCount(sampleTable= sampleTable, directory= directory, design= ~condition+dicty) ``` Eliminating some genes from the count matrix below a certain read count threshold will allow us to remove noise. In this case, we will only keep genes that have at least 3 samples that have 10 or more counts for a given gene. ``` keep <- rowSums(assay(dds.dicty) >= 10) >= 3 table(keep) dds.dicty <- dds.dicty[keep,] ``` ![](https://i.imgur.com/3DrNz5m.png) Running this threshold keeps 58% of the reads. The DESeq METHOD will normalize and shrinkage on this object. ``` dds.dicty <- DESeq(dds.dicty) ### output in console ### # estimating size factors # estimating dispersions # gene-wise dispersion estimates # mean-dispersion relationship # final dispersion estimates # fitting model and testing ``` Our DESeq dataset (dds) object contains raw counts, size factors, and other parameters used to apply shrinkage We can see the relationship between library size and szie factor by using hte apply function. ``` rc <- apply(assay(dds.dicty), 2, sum ) sf <- sizeFactors(dds.dicty) plot(rc, sf) ``` ![](https://i.imgur.com/YhuYWL0.png) **Q1:** Why might the relationship between library size and size factor be non-linear? **A1:** DESeq2 performs an internal normalization where geometric mean is calculated for each gene across all samples. The counts for a gene in each sample is then divided by this mean. The median of these ratios in a sample is the size factor for that sample. The library size in DESeq2 is internally corrected so it takes an untransformed input. This is why the relationship is non-linear. Next, we can take a look at the way the counts behave in the dataset. We can check the relationship between mean and variance (```log(dm), log(dv)``` ) as well as the mean-coefficient_of_variance (```log(dm), dc```) relationships ``` coefvar <- function(x) sd(x)/mean(x) dm <- apply(assay(dds.dicty), 1, mean) dv <- apply(assay(dds.dicty), 1, var) dc <- apply(assay(dds.dicty), 1, coefvar) plot(log(dm), log(dv)) plot(log(dm), dc) ``` ![](https://i.imgur.com/L7i4x9d.png) ![](https://i.imgur.com/wIRht8Y.png) We can also illustrate what DESeq2 does for dispersion to allow better visualization of differential effects ``` plotDispEsts(dds.dicty) ``` ![](https://i.imgur.com/KHQ8cQa.png) **Q2** Describe the relationship between count means and variances, and what DESeq2 appears to be doing during dispersion estimation and model fitting. **A2**: We see that from the mean vs. variance plot that the relationships appear to be roughly linear. This implies that the normalized means in the DESeq2 data set roughly equal the normalized variances in the same dataset. This is not true of the raw count data prior to normalization. It also appears that the spread of the mean-variance plot is larger for lower mean counts. This would effect the DESeq2 dispersion estimation by differing the estimate much more for smaller mean counts. From the dispersion-mean(normal) graph, we see that there appears to be three steps being conducted by DESeq2. It first estimates the dispersion for each gene by using maximum likelihood esitmation. Second, a curve is fit onto these estimated estimations. In the third step, the final dispersion estimates are calculated by shrinking the original estimates towards the fitted curve. Any original estimates circled in the color corresponding to the final estimates had dispersion/variability so high that they did not match the DESeq2 model assumptions and were not shrunk towards the curve. ## Differential Expression We can now run a differential expression analysis based on the generalized linear model. The command below will return which DE results can be accessed ``` resultsNames(dds.dicty) ``` We can specify that we want to know the differentially expressed genes based on their condition. If we wanted to use a stringent FDR (0.05) than the default, we can specify that in the parameters. ``` res.dicty <- results(dds.dicty, contrast = c("condition", "infected", "control"), alpha = 0.05) ``` We can check the relationship between the p-value and the FDR-corrected adjusted p-value by plotting them on a line. ``` plot(res.dicty$pvalue, res.dicty$padj) abline(a = 0, b = 1, lty = 2, col = "red") ``` ![](https://i.imgur.com/38Lso2f.png) We see that the smaller p-values are stretched out over a longer range. We can also reorder the results so that the significant results are listed first. ``` out of 9887 with nonzero total read count adjusted p-value < 0.05 LFC > 0 (up) : 669, 6.8% LFC < 0 (down) : 357, 3.6% outliers [1] : 52, 0.53% low counts [2] : 0, 0% (mean count < 2) [1] see 'cooksCutoff' argument of ?results [2] see 'independentFiltering' argument of ?results ### output ### # out of 9887 with nonzero total read count # adjusted p-value < 0.05 # LFC > 0 (up) : 669, 6.8% # LFC < 0 (down) : 357, 3.6% # outliers [1] : 52, 0.53% # low counts [2] : 0, 0% # (mean count < 2) # [1] see 'cooksCutoff' argument of ?results # [2] see 'independentFiltering' argument of ?results ``` ### Exploratory visualization We can transform the data using regularized log (rlog). This transformation will allow us to look at trends without worrying about the influence of the library size and the mean expression. ``` rld.dicty <- rlogTransformation(dds.dicty, blind = F) ``` We can now create a PCA plot which takes 500 of the most variable genes and runs a quick visualization. ``` plotPCA(rld.dicty, intgroup=c("dicty")) plotPCA(rld.dicty, intgroup=c("condition")) plotPCA(rld.dicty, intgroup=c("genome")) plotPCA(rld.dicty, intgroup=c("burk")) ``` ![](https://i.imgur.com/mV71wPm.png) ![](https://i.imgur.com/yUS4TND.png) ![](https://i.imgur.com/0E0Gp37.png) ![](https://i.imgur.com/OXKglge.png) We can see how well our samples correlate with each other via heatmap. We can do this by manually creating a color scheme for the infected and uninfected samples. ``` color.map <- function(condition) { if (condition=="infected") "red" else "greenyellow" } cond.colors <- unlist(lapply(sampleTable$condition, color.map)) library(RColorBrewer) corr.colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255) dist.colors <- colorRampPalette(brewer.pal(9, "Oranges"))(255) ``` We can create a correlation matrix that shows the relationships among samples in a heatmap. Here, the highest correlation has light colors and the darker colors indicate low correlation. We use the Spearman correlation matrix here because we do not expect the relationship to be linear. ``` corr.mat <- as.matrix(cor(assay(rld.dicty), method="spearman")) heatmap(corr.mat, Colv=F, Rowv=NA, scale="none", ColSideColors=cond.colors, col=corr.colors) ``` ![](https://i.imgur.com/8ZIrh4r.png) We can also create a distance matrix that visualizes the results into a heatmap. This heatmap will correspond to distances rather than correlations. The smaller distances are shown with lighter colors ``` dist.mat <- as.matrix(dist(t(assay(rld.dicty)))) heatmap(dist.mat, Colv=F, Rowv=NA, scale="none", ColSideColors=cond.colors, col=dist.colors) ``` ![](https://i.imgur.com/m3dVcis.png) The information shown is slightly difference however fairly simmilar. We are checking to see if our controls are the most similar to each other. ### Results visualization ### We can make an MA plot that shows us the relationship between the mean expression and the fold change (up/down regulation). We can look at these relationships before and after shrinkage is applied to the log-fold change Before Shrinkage ``` plotMA(res.dicty) ``` ![](https://i.imgur.com/tWnX7bT.png) After Shrinkage ``` lfc.dicty <- lfcShrink(dds.dicty, coef = "condition_infected_vs_control", type = "apeglm") plotMA(lfc.dicty) ``` ![](https://i.imgur.com/2Xrtl83.png) We see from the effect of shrinkage by comparing these two plots. Before the shrinkage, the siginificant and non-significant fold changes are mixed together around the borders of the cluster of points. A volcano plot is another way that we can visualize fold changes. We need to save the results of the differential expression analysis to a data frame and plot those results using ggplot2 ``` library(ggplot2) tab.dicty <- as.data.frame(lfc.dicty) tab.dicty$sig <- tab.dicty$padj tab.dicty$sig <- ifelse(tab.dicty$padj<=0.05, "FDR 0.05","not sig.") tab.dicty$sig <- factor(ifelse(tab.dicty$padj>0.05 & tab.dicty$padj<=0.1, "FDR 0.1", tab.dicty$sig)) ggplot(tab.dicty, aes(x=log2FoldChange, y=-log10(padj), color=sig)) + geom_point(shape=16, size=1) + scale_color_manual(values=c("red","yellowgreen","grey"),name="") + ylab("-log10(Padj)") ``` ![](https://i.imgur.com/mVDt0HR.png) We can also show a heatmap of the expression change by selecting the 50 most significant fold change genes and showing the rlog-corrected expression. Overexpressed genes are shown in red while underexpressed genes are shown in blue ``` heat.colors <- colorRampPalette(brewer.pal(9, "RdBu"))(255) select <- row.names(res.dicty)[1:50] heatmap(assay(rld.dicty)[select,], ColSideColors=cond.colors, cexRow=0.4, col=heat.colors) ``` ![](https://i.imgur.com/OOtaPKx.png) Individual gene cts can also be shown. The most significant gene for our sorted DE analysis results are shown below. ``` plotCounts(dds.dicty, gene=res.dicty[1,1], intgroup="condition") plotCounts(dds.dicty, gene="DDB_G0268208", intgroup="condition") # higher in control plotCounts(dds.dicty, gene="DDB_G0293980", intgroup="condition") # lower in infected ``` ![](https://i.imgur.com/Gg5jJFh.png) ![](https://i.imgur.com/DicunGb.png) ![](https://i.imgur.com/QMJtgNN.png) ### Own Analysis We will be running our analysis using ```genome``` instead of the conditions field. This breaks down the segmented conditions field into further genome clades First load in the data set and threshold by the count again. ``` dds.genome <- DESeqDataSetFromHTSeqCount(sampleTable= sampleTable, directory= directory, design= ~genome+dicty) keep <- rowSums(assay(dds.genome) >= 10) >= 3 table(keep) dds.genome <- dds.genome[keep,] dds.genome <- DESeq(dds.genome) rc <- apply(assay(dds.genome), 2, sum ) sf <- sizeFactors(dds.genome) plot(rc, sf) ``` ![](https://i.imgur.com/jbmaThb.png) We see from the plot of the library size and the size factor that the relationship is non-linear, which we expect. ``` coefvar <- function(x) sd(x)/mean(x) dm <- apply(assay(dds.genome), 1, mean) dv <- apply(assay(dds.genome), 1, var) dc <- apply(assay(dds.genome), 1, coefvar) plot(log(dm), log(dv)) plot(log(dm), dc) plotDispEsts(dds.genome) ``` ![](https://i.imgur.com/4ntAe3S.png) ![](https://i.imgur.com/YZo26jA.png) ![](https://i.imgur.com/bA4Nxav.png) We see similar behavior in the mean-variance and mean-dispersion estimates. The smaller means tend to have higher variance and higher dispersion estimates than larger mean counts. ``` resultsNames(dds.genome) res.genome <- results(dds.genome, contrast=c("genome", "reduced", "nonreduced"), alpha=0.05) plot(res.genome$pvalue, res.genome$padj) abline(a = 0, b = 1, lty = 2, col = "red") res.genome <- res.genome[order(res.genome$padj),] summary(res.genome) mcols(res.genome)$description head(res.genome) ``` ![](https://i.imgur.com/xtyRBkW.png) Before and after shrinkage plots ``` plotMA(res.genome) # before shrinkage lfc.dicty <- lfcShrink(dds.genome, coef = "genome_reduced_vs_none", type = "apeglm") plotMA(lfc.dicty) # after shrinkage ``` ![](https://i.imgur.com/Sss7MFq.png) ![](https://i.imgur.com/NKYh6H4.png) We see again that before shrinkage, the significant and non-significant changes mix together around the border whereas the post-shrinkage plot is much clearer. Volcano Plot: ``` ## plotting volcano plots using ggplot library(ggplot2) tab.genome <- as.data.frame(lfc.genome) tab.genome$sig <- tab.genome$padj tab.genome$sig <- ifelse(tab.genome$padj<=0.05, "FDR 0.05","not sig.") tab.genome$sig <- factor(ifelse(tab.genome$padj>0.05 & tab.dicty$padj<=0.1, "FDR 0.1", tab.genome$sig)) ggplot(tab.genome, aes(x=log2FoldChange, y=-log10(padj), color=sig)) + geom_point(shape=16, size=1) + scale_color_manual(values=c("red","yellowgreen","grey"),name="") + ylab("-log10(Padj)") ``` ![](https://i.imgur.com/7g4dzaj.png) We see that in this volcano plot, in the reduced vs. none genome clade, there appears to be more upregulated genes than those that are downregulated. This is also seen in the log fold change plot above. Heatmap of over and underexpressed genes ``` ## heatmap of over/underexpressed genes heat.colors <- colorRampPalette(brewer.pal(9, "RdBu"))(255) select <- row.names(res.genome)[1:50] heatmap(assay(rld.genome)[select,], ColSideColors=cond.colors, cexRow=0.4, col=heat.colors) ``` ![](https://i.imgur.com/fHAg3r8.png)