# BI345 Lab 4 Goal: take out read counts from last week and do a differential expression analysis Note: Remember that you are using counts from the following samples. The sample files are named to make it easy to tell what it is. Each sample name has the structure dicty_burk, and the controls are noted as a1-c2 instead of a burk strain id. Note that I have already prepared these samples with block randomization so you do not need to take batch into account in your GLM. ![](https://i.imgur.com/UVS8u61.png) Documentation for DESeq2: http://www.bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html ## Exercise 1 - Set Up We need to read our data into R, and then associate it with a metadata table. NOTE: Make sure to set working directory ``` setwd("/courses/bi345/sdivit25/Lab_4") ``` 1.1. First load the package. ```library(DESeq2)``` 1.2. Prepare to tell a subsequent function where your files are. ``` directory <- "/courses/bi345/Course_Materials/wk_04/final_counts/" sampleFiles <- grep("qs",list.files(directory),value=TRUE) sampleName <- sub("(*).star.count","\\1",sampleFiles) ``` You can see here why it is generally a good idea to have a consistent file-naming schema. It makes operations like this much simpler. 1.3. Next set up the metadata table, that includes the countfile names for the samples that you figured out above. ```sampleTable <- data.frame(sampleName = sampleName, fileName = sampleFiles, dicty = c(rep("qs18",6), rep("qs864",6), rep("qs9",6)), batch = c("1966_1","1966_1","1966_1","1966_2","1966_2","1966_2","1966_3","1966_3","1966_4","1966_4","1966_3","1966_4","1952_7","1952_8","1952_7","1952_7","1952_8","1952_8"), condition = c("control","infected","infected","control","infected","infected","infected","infected","infected","infected","control","control","control","control","infected","infected","infected","infected"), genome = c("none","reduced","nonreduced","none","nonreduced","reduced","reduced","nonreduced","nonreduced","reduced","none","none","none","none","reduced","nonreduced","nonreduced","reduced"), burk = c("none","b11","b159","none","b70","b859","b11","b159","b70","b859","none","none","none","none","b11","b159","b70","b859"), stringsAsFactors = T ) ``` 1.4. Now read in the counts. There are a few options for how to do this depending on how you got your counts. We will use the htseq-count file option. ``` dds.dicty <- DESeqDataSetFromHTSeqCount(sampleTable= sampleTable, directory= directory, design= ~condition+dicty) ``` Prof Noh gave this function the most basic GLM design. What she's interested in as her treatment in this GLM is the effect of condition, any infection compared to no infection. Dicty strain was the biological background in which each infection was carried out. So dicty is included as a factor to account for it as a random effect that she knows can be biologically variable. 1.5. We can eliminate some rows (genes) from the count matrix that don’t have a whole lot of read counts. I’m keeping rows in which at least 3 samples have 10 or more counts for a given gene. ``` keep <- rowSums(assay(dds.dicty) >= 10) >= 3 table(keep) dds.dicty <- dds.dicty[keep,] ``` 1.6. Now run the “normalization” and shrinkage. ``` dds.dicty <- DESeq(dds.dicty) ``` Pay attention here to everything that DESeq2 is saying it’s doing in your console. This output should make sense considering what we talked about regarding the statistics behind differential expression analysis. 1.7. At this point, your dds (DESeq dataset) object contains raw counts, size factors, and other parameters used to apply shrinkage. For example, we can check to see the relationship between library size and size factor. ``` rc <- apply(assay(dds.dicty), 2, sum) sf <- sizeFactors(dds.dicty) plot(rc, sf) ``` Output: ![](https://i.imgur.com/mtAFGS8.png) **Q1. Why might the relationship between library size and size factor be non-linear?** I think the relationship between library size and size factor can be non-linear for a few reasons. First, RNA-seq data often exhibit high variability in read counts due to factors such as GC content, transcript length, and sequence-specific biases. This variability can make it difficult to accurately estimate the size factor based on library size alone, especially for small or large libraries. Also, the relationship between library size and size factor may be influenced by the distribution of reads across genes. For instance, if a large proportion of reads in a library come from highly expressed genes, then the size factor may be overestimated if based solely on library size. Conversely, if a library has a high proportion of reads from lowly expressed genes, then the size factor may be underestimated. 1.8. We can also look at some of the ways in which counts behave in this dataset. We can check the mean-variance relationship, as well as that mean-CV (coefficient of variance = dispersion) relationship to better understand what DESeq2 does. ``` coefvar <- function(x) sd(x)/mean(x) dm <- apply(assay(dds.dicty), 1, mean) dv <- apply(assay(dds.dicty), 1, var) dc <- apply(assay(dds.dicty), 1, coefvar) plot(log(dm), log(dv)) plot(log(dm), dc) ``` First plot: ![](https://i.imgur.com/QDjZInV.png) Second plot: ![](https://i.imgur.com/MI0Mt04.png) 1.9. This next figure illustrates what DESeq2 does for dispersion to allow better differential expression analysis. ``` plotDispEsts(dds.dicty) ``` Output: ![](https://i.imgur.com/cZJgYvr.png) **Q2. Describe the relationship between count means and variances, and what DESeq2 appears to be doing during dispersion estimation and model fitting.** As the mean expression level of a gene increases, so does its variance. During dispersion estimation, DESeq2 seems to be estimating the mean-variance relationship of the data and uses it to model the dispersion of each gene by fitting a negative binomial distribution to the data. In model fitting, it uses a generalized linear model to analyze differential gene expression. Overall, I think DESeq2 estimates the mean-variance relationship of the data and uses it to model the dispersion of each gene, which is then used in the GLM to identify differentially expressed genes between experimental conditions. ## Exercise 2. Differential expression 2.1. Now let’s run a differential expression analysis based on the GLM model I specified above. You can see which DE results I can access by running this function. ``` resultsNames(dds.dicty) ``` You will see some choices regarding which contrasts are available. This will come in handy when you have a design that allows multiple factors to be fitted. 2.2. I can now specify that I want to know DE genes based on condition. I want to use a more stringent FDR threshold (0.05) than the default (0.1) so I’ve indicated that as well. ``` res.dicty <- results(dds.dicty, contrast=c("condition","infected","control"), alpha=0.05) ``` 2.3. We can do a quick check of the relationship between p-value and FDR-corrected padj (=adjusted p-value). ``` plot(res.dicty$pvalue, res.dicty$padj) abline(a=0, b=1, lty=2, col="red") ``` Plot of p-value & FDR-corrected p adj: ![](https://i.imgur.com/ds1P1Eb.png) ***You can see that small p-values are stretched out over a longer range when FDR-corrected. 2.4. For convenience, I can re-order these results so that the most significant ones are listed first. ``` res.dicty <- res.dicty[order(res.dicty$padj),] ``` 2.3. Now let’s check the summary of results. This will tell you how many genes are significantly up or down regulated. ``` summary(res.dicty) ``` Summary Output: ![](https://i.imgur.com/8MS2S0k.png) You can also look at the results table directly. ``` mcols(res.dicty)$description head(res.dicty) ``` Table Output: ![](https://i.imgur.com/Eq9cmoe.png) ## Exercise 3. Exploratory visualization Let’s do some visualization, first to explore our samples and then to show results. 3.1. First, we’ll transform the data using regularized log (rlog). These transformed data will let us look at trends without worrying too much about the influence of library size and mean expression. This is for visualization only. ``` rld.dicty <- rlogTransformation(dds.dicty, blind=F) ``` 3.2. Now we can create a PCA plot. By default, this function takes 500 of the most variable genes and runs a quick PCA visualization. You can supply more than one grouping variable at a time but I find it easier to just show one at a time. ``` plotPCA(rld.dicty, intgroup=c("dicty")) plotPCA(rld.dicty, intgroup=c("condition")) plotPCA(rld.dicty, intgroup=c("genome")) plotPCA(rld.dicty, intgroup=c("burk")) ``` Plot 1: ![](https://i.imgur.com/BmzKX5S.png) Plot 2: ![](https://i.imgur.com/EAz4lt2.png) Plot 3: ![](https://i.imgur.com/79LMULo.png) Plot 4: ![](https://i.imgur.com/pWq6jKA.png) 3.3. We can check how well our samples correlate with each other. I want to make sure I can easily tell the difference between infected and uninfected samples so I’ll create a color scheme for that, and also for the heatmap. ``` color.map <- function(condition) { if (condition=="infected") "red" else "greenyellow" } cond.colors <- unlist(lapply(sampleTable$condition, color.map)) library(RColorBrewer) corr.colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255) dist.colors <- colorRampPalette(brewer.pal(9, "Oranges"))(255) ``` 3.4. We can create a correlation matrix and show the relationships among samples in a heatmap. Here, the highest correlation has the lightest colors and darker colors indicate samples with low correlation. ``` corr.mat <- as.matrix(cor(assay(rld.dicty), method="spearman")) heatmap(corr.mat, Colv=F, Rowv=NA, scale="none", ColSideColors=cond.colors, col=corr.colors) ``` Correlation Matrix as Heatmap: ![](https://i.imgur.com/QD4N2M6.png) 3.5. We can also create a distance matrix and visualize results in a heatmap. Here, the smallest distance has the lightest colors. ``` dist.mat <- as.matrix(dist(t(assay(rld.dicty)))) heatmap(dist.mat, Colv=F, Rowv=NA, scale="none", ColSideColors=cond.colors, col=dist.colors) ``` Distance Matrix as Heatmap: ![](https://i.imgur.com/ghmt6LX.png) Note: - In either correlations or distances, we are checking to see if our controls are most similar to each other. You’ll find that the information is slightly different depending on whether you look or correlations or distances. ## Exercise 4. Results visualization Now let’s make some typical RNA-seq results plots. MA plots can tell us the relationship between mean expression and fold change. We can look at this two ways, before and after shrinkage is applied to log fold change. 4.1. Before shrinkage: ``` plotMA(res.dicty) ``` MA Plot before shinkage: ![](https://i.imgur.com/odPovpm.png) 4.2. After shrinkage: ``` lfc.dicty <- lfcShrink(dds.dicty, coef="condition_infected_vs_control", type="apeglm") plotMA(lfc.dicty) ``` MA Plot after shrinkage: ![](https://i.imgur.com/CPDvKTO.png) Note: - You should be able to see the effect of shrinkage because in the before plot, the significant (blue) and nonsignificant (grey) fold changes are mixed around the edges of the cloud of points. 4.3. Now, let’s try a volcano plot. To do this, we’ll save the results of the DE analysis to a data frame and plot those results using some ggplot2 functions. ``` library(ggplot2) tab.dicty <- as.data.frame(lfc.dicty) tab.dicty$sig <- tab.dicty$padj tab.dicty$sig <- ifelse(tab.dicty$padj<=0.05, "FDR 0.05","not sig.") tab.dicty$sig <- factor(ifelse(tab.dicty$padj>0.05 & tab.dicty$padj<=0.1, "FDR 0.1", tab.dicty$sig)) ggplot(tab.dicty, aes(x=log2FoldChange, y=-log10(padj), color=sig)) + geom_point(shape=16, size=1) + scale_color_manual(values=c("red","yellowgreen","grey"),name="") + ylab("-log10(Padj)") ``` Volcano Plot: ![](https://i.imgur.com/gzdk1uU.png) 4.4. Maybe you want to show a heatmap of expression change. We can do that as well. Here I’m selecting my top 50 most significant foldchange genes and showing rlog-corrected expression. Genes relatively overexpressed are shown in Red while underexpressed are shown in blue. ``` heat.colors <- colorRampPalette(brewer.pal(9, "RdBu"))(255) select <- row.names(res.dicty)[1:50] heatmap(assay(rld.dicty)[select,], ColSideColors=cond.colors, cexRow=0.4, col=heat.colors) ``` Heatmap of expression change: ![](https://i.imgur.com/6YG5Ykz.png) 4.5. Lastly, you can also plot individual gene “counts”. For example, for the first (most significant, so row 1) gene for our sorted DE analysis results. If you knew of a specific gene id, you can enter it directly as well. ``` plotCounts(dds.dicty, gene=res.dicty[1,1], intgroup="condition") plotCounts(dds.dicty, gene="DDB_G0268208", intgroup="condition") # higher in control plotCounts(dds.dicty, gene="DDB_G0293980", intgroup="condition") # lower in infected ``` Plot 1: ![](https://i.imgur.com/ZfnOjRB.png) Plot 2: ![](https://i.imgur.com/VWTEvtH.png) Plot 3: ![](https://i.imgur.com/PkjdukK.png) 4.6. For next week, let’s save our results to a file. Make sure working directory is set to a place that you have write-permission. ``` write.csv(tab.dicty, file="dicty.condition_de_results.txt") ``` ## Exercise 5 - Your own Analysis: Genome Basically Rerunning analysis except with a new dds aobject where ```design= ~genome+dicty``` PCA Plots: intgroup = c("dicty") ![](https://i.imgur.com/nuwE1x1.png) intgroup = c("condition") ![](https://i.imgur.com/9dDAlfg.png) intgroup = c("genome") ![](https://i.imgur.com/BgOBrfK.png) intgroup = c("burk") ![](https://i.imgur.com/8wzDLCj.png) Correlation Matrix Heatmap: ![](https://i.imgur.com/SpY4OKH.png) Distance Matrix Heatmap: ![](https://i.imgur.com/XnIc529.png) MA Plot Before Shrinkage: ![](https://i.imgur.com/Qzf4Vtb.png) MA Plot After Shrinkage: ![](https://i.imgur.com/wdv2RTy.png) Volcano Plot: ![](https://i.imgur.com/E0pbKYm.png) Heatmap of Expression Change: ![](https://i.imgur.com/LDejwBK.png) Plot Counts 1: ![](https://i.imgur.com/E1VZFqr.png) Plot Counts 2: ![](https://i.imgur.com/V4eCwSv.png) Plot Counts 3: ![](https://i.imgur.com/7njQRdz.png)