BI345 Lab #7 - HackMD

# BI345 Lab #7 What are we learning about? Time series What are time series? Time series are essentially a more complicated version of DE analysis. We’ll take two approaches in limma, and limma + voom: linear models vs. cubic splines. General notes: - many parallels to what limma and DESeq2 do, though limma has slightly different language - One advantage to limma is that it is able to fit more complex models and contrasts simultaneously - It also uses an empirical Bayes procedure, as well as what it calls “quantitative weights” - These weights are meant to give more weight to controls or reliable genes in order to increase detection power - idea behind limma is that we can now treat the converted counts as if they followed a normal distribution, rather than a count distribution (e.g. NB) - voom is the specific function within limma that converts the relationship between individual observations of count mean and variance into “precision weights”, to make limma work even better for RNA-seq experiments To read more about limma: https://academic.oup.com/nar/article/43/7/e47/2414268 To read more about voom: https://genomebiology.biomedcentral.com/articles/10.1186/gb-2014-15-2-r29 Data used in lab: We are using a subset of the data from Lindsey et al (https://journals.asm.org/doi/10.1128/mBio.03472-20) that looked at gene expression in flies infected with different combinations of Wolbachia and SINV. For simplicity, we are using only the set that are Wolbachia-positive. These data were available in TMM normalized form from NCBI Gene Expression Omnibus (GEO) here. Because of this, we are going to use these data as is, but it is better to be able to start with raw counts because pre-normalized data limit what types of analyses we can apply to them*. Today's lab is all in R. ## Exercise 1 - Linear Model NOTE: Make sure to set working directory ``` setwd("/courses/bi345/sdivit25/Lab_7") ``` 1.1. Read in both count and metadata tables. They are currently in ``` /courses/bi345/Course_Materials/wk_09 ``` We want to make sure the count columns and sample names in the metadata table match. ``` tmm.count <- read.table("./GSE162666.tmm.count", h=T, sep="\t") meta <- read.table("./GSE162666.metadata.txt", h=T, sep="\t", stringsAsFactors=T) colnames(tmm.count) %in% meta$sample ``` 1.2. We need to make sure the metadata table has factors coded properly. For the linear analysis, we’ll specify that the three time points are factors rather than a continuous variable. NOTE: This choice is because, to us, the time points are arbitrary “late” and “later” time points, rather than each 24 hr period being a meaningful unit since we didn’t generate the data. ``` str(meta) meta$time <- factor(as.character(meta$time), levels=c("6","24","48")) str(meta) ``` 1.3. Load the ```edgeR``` library (it automatically calls its dependency ```limma```) and convert the TMM count table to a ```DGEList``` object. ``` library(edgeR) d0 <- DGEList(tmm.count) d0$samples ``` We can see here that this is a similar object to DESeqDataSet. It stores information about library size and normalization (size) factors. These data were already normalized so you can see that the library sizes are very similar. For raw count data, size factors are calculated (and later applied) using calcNormFactors(). 1.4. Filter (remove) genes with low expression. ``` keep <- filterByExpr(d0, group=meta$sinv) table(keep) d <- d0[keep,] ``` ^This command keeps genes with 10 reads or more in at least as many samples as the smallest group size. We are using SINV infection status as our primary grouping factor we are interested in. 1.5. We can check out some diagnostic plots. ``` par(mfrow=c(1,3)) # make three plots appear on the same row plotMDS(d, col=as.numeric(meta$time), label=meta$time) plotMDS(d, col=as.numeric(meta$sinv), label=meta$sinv) plotMDS(d, col=as.numeric(meta$rep), label=meta$rep) ``` Output: ![](https://i.imgur.com/fzmoHQN.png) ^Notice: each of our factors tend to group samples together. So these should be accounted for in the model we’ll fit to these data. 1.6. Code our samples into groups by combining SINV status and time. Then build a LM with this new group factor. ``` meta$group <- factor(paste(meta$sinv, meta$time, sep="."), levels=c("cont.6","cont.24","cont.48","sinv.6","sinv.24","sinv.48")) mm <- model.matrix(~0 + group + rep, data=meta) mm ``` To read about discussion about how to build your model in edgeR (user guide): https://www.bioconductor.org/packages/devel/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf General gist of discussion^ by Prof Noh: "Briefly, it is crucial to fit a model that is appropriate for the hypothesis you want to test. The experimental design for these data was to check how gene expression changed over time depending on SINV infection status. We need to account for small differences in replicates and allow for interaction between the factors sinv and time (expression changes differently depending on whether SINV-positive or -negative). So it’s recommended to make the 'group' factor like above so we can be explicit in which contrasts we want to make." 1.7. Build linear contrasts based on your hypothesis. ``` cm.linear <- makeContrasts( b6 = groupsinv.6 - groupcont.6, c24 = groupcont.24 - groupcont.6, s24 = groupsinv.24 - groupsinv.6, c48 = groupcont.48 - groupcont.6, s48 = groupsinv.48 - groupsinv.6, t1 = (groupsinv.24 + groupcont.24) - (groupsinv.6 + groupcont.6), t2 = (groupsinv.48 + groupcont.48) - (groupsinv.6 + groupcont.6), x1 = (groupsinv.24 - groupsinv.6) - (groupcont.24 - groupcont.6), x2 = (groupsinv.48 - groupsinv.6) - (groupcont.48 - groupcont.6), levels = colnames(mm)) cm.linear ``` Output: ![](https://i.imgur.com/x9krs5E.png) Explanation/reasoning: Based on what is essentially a 2-sample multi-time series design (both control and treatment are tested at each time point), we want to contrast the two types of samples at each time against their own baseline at 6 hours. Next, we want to see shared changes by both types of samples between 6 and 24 hours post-infection (hpi) and between 6 and 48 hpi. Lastly, we want to test whether there are interactions between 6 and 24 hpi and between 6 and 48 hpi. Keep in mind: we are not going to use all of these contrasts, and are mostly interested in the last 4. However, these are written out here so you can see how multiple pairwise contrasts can be specified at once. 1.8. Use voom to convert data for DE analysis using linear models. ``` par(mfrow=c(1,2)) y.linear <- voom(counts=d, mm, plot=T) ``` Output: ![](https://i.imgur.com/hakrR5d.png) 1.9. Run the DE analysis, including fitting contrasts. ``` vfit.linear <- lmFit(y.linear, mm) vfit.linear <- contrasts.fit(vfit.linear, contrasts=cm.linear) efit.linear <- eBayes(vfit.linear) ``` 1.10. Plot the residuals of the fitted data for comparison. You should not see any patterns here if mean-variance trends were properly accounted for. ``` plotSA(efit.linear, main="residual: Mean-variance trend") ``` Output: ![](https://i.imgur.com/dkB2B11.png) 1.11. Check the results of the different contrasted factors. ``` dt.linear <- decideTests(efit.linear) summary(dt.linear) ``` Output: ![](https://i.imgur.com/sKtnMoj.png) ^These results indicate that there are more changes in the SINV+ samples over time compared to the SINV- controls, particularly at 48 hours post-infection. There are also some genes that behave differently between 6 hpi and the subsequent time points depending on whether a fly was positive for SINV or not. 1.12. Visualize using a Venn diagram. Keep in mind this step is showing more than you need to demonstrate the flexibility of this function. ``` library(RColorBrewer) col <- brewer.pal(8, "Dark2") # any overlap between DE genes at time 24 for sinv- and sinv+ vennDiagram(dt.linear[,c(2,3)], circle.col=col[1:2]) # any overlap between DE genes at time 48 for sinv- and sinv+ vennDiagram(dt.linear[,c(4,5)], circle.col=col[3:4]) # all four early contrasts together vennDiagram(dt.linear[,c(2,3,6,8)], circle.col=col[1:4]) # any overlap between DE genes that responded differently for sinv+ and sinv- controls by time 24 or time 48 vennDiagram(dt.linear[,c(4,5,7,9)], circle.col=col[5:8]) ``` Output: ![](https://i.imgur.com/2DUXdNr.png) ![](https://i.imgur.com/65hYOoC.png) Something to consider: Prof Noh thinks its hard to read more comparisons at the same time but you can also compare up to 5 groups at a time. 1.13. Visualize using MD plots ``` plotMD(efit.linear, column=6, status=dt.linear[,6]) plotMD(efit.linear, column=7, status=dt.linear[,7]) plotMD(efit.linear, column=8, status=dt.linear[,8]) plotMD(efit.linear, column=9, status=dt.linear[,9]) ``` Output: ![](https://i.imgur.com/phL8B5L.png) ![](https://i.imgur.com/ASXAjgp.png) ```limma``` has a version of MA-plots called MD-plots. NEXT STEPS: From here you would proceed with GO or other types of follow-up analyses. We’ll move on to an alternative method of analyzing time series data. ## Exercise 2 - Cubic Splines Gen Notes: - We can also use cubic splines to fit smooth curves to gene-wise gene expression patterns - This would work better if we had more time points but it still works with just 3 time points in our current dataset 2.1. Load the splines library. ``` library(splines) ``` 2.2. Make a dummy variable to enable spline fitting. This time, we are coding time as a number instead of a factor. ``` meta$X <- ns(as.numeric(meta$time), df=2) ``` NOTES: - If we had more time points, we would increase the degrees of freedom df - The recommended number seems to be between 3 to 5 - The degrees of freedom will dictate how many knots (bends) you can have in your smooth spline - With df=2 we can have one knot, so your curve at its most extreme can look like a U-shape (upright or upside down) - If df=3 our spline could have two knots, so the curve at the most extreme could looks like an S-shape 2.3. Build a model that incorporates SINV status-specific splines using this dummy variable. ``` sm <- model.matrix(~0 + sinv*X + rep, data=meta) sm colnames(sm) <- c("cont", "sinv", "X1", "X2", "repB", "repC", "repD", "sinv.X1", "sinv.X2") ``` ^The last step here is changing the names of our model columns for a little more clarity (readibility). 2.4. Fit this model using voom etc like above. ``` y.spline <- voom(counts=d, sm, plot=T) vfit.spline <- lmFit(y.spline, sm) efit.spline <- eBayes(vfit.spline) plotSA(efit.spline, main="residual: Mean-variance trend") ``` Output: ![](https://i.imgur.com/AWmscRQ.png) Note: Because we’re using splines, we no longer look at specific linear contrasts. 2.5. Find genes that respond by time. ``` sfit.list.1 <- row.names(topTable(efit.spline, coef=3:4, number = 1000, p.value = 0.05)) sfit.list.2 <- row.names(topTable(efit.spline, coef=8:9, number = 1000, p.value = 0.05)) ``` Alternate way: Use the ```topTable``` function to see what the results look like in a summary table. Here, the ```p.value``` is the significance cutoff for the adjusted p-value for a gene to make it into the table, and the ```number``` is arbitrarily high to get it to show all the significant genes. 2.6. Save the DE status of these genes to visualize with a Venn diagram. ``` dt.spline <- as.data.frame(dt.linear[,0]) dt.spline$all <- row.names(dt.spline) %in% sfit.list.1 dt.spline$int <- row.names(dt.spline) %in% sfit.list.2 vennDiagram(dt.spline, circle.col = col) ``` Output: ![](https://i.imgur.com/XuFPt0F.png) 2.7. Alternatively, we can fit a negative binomial model and use quasilikelihood on the same spline model in edgeR. It turns out these results are very similar (limma vs. edgeR splines). Prof Noh did it here because edgeR results let us visualize what the splines look like. ``` y.quasi <- estimateDisp(d, sm) qfit <- glmQLFit(y.quasi, sm) plotQLDisp(qfit) ``` Output: ![](https://i.imgur.com/dxb28ok.png) 2.8. Find significant genes that respond to time. ``` qfit.all <- glmQLFTest(qfit, coef=3:4) summary(decideTests(qfit.all)) qfit.int <- glmQLFTest(qfit, coef=8:9) summary(decideTests(qfit.int)) ``` NOTE: This is a similar process to step 2.5. In both cases, the fitted model is being interrogated to show genes that are significantly DE by the selected coefficients. 2.9. Save each set of genes for more Venn diagrams. ``` qfit.list.1 <- row.names(topTags(qfit.all, n = 1000, p.value = 0.05)) qfit.list.2 <- row.names(topTags(qfit.int, n = 1000, p.value = 0.05)) dt.q <- as.data.frame(dt.linear[,0]) dt.q$all <- row.names(dt.q) %in% qfit.list.1 dt.q$int <- row.names(dt.q) %in% qfit.list.2 vennDiagram(dt.q, circle.col = col) ``` Output: ![](https://i.imgur.com/Xw7PvPP.png) 2.10. Let’s take a look at what these splines look like. To do so, we need to first save our data and fitted estimates into logCPM (log-transformed count per million). We’ll do this for all the samples, as well as samples by SINV status. ``` # counts first qfit.obs.0 <- cpm(y.quasi, log=TRUE, prior.count=qfit$prior.count) qfit.obs.1 <- cpm(y.quasi, log=TRUE, prior.count=qfit$prior.count)[,1:12] qfit.obs.2 <- cpm(y.quasi, log=TRUE, prior.count=qfit$prior.count)[,13:24] # fitted estimates next qfit.fit.0 <- cpm(qfit, log=TRUE) qfit.fit.1 <- cpm(qfit, log=TRUE)[,1:12] qfit.fit.2 <- cpm(qfit, log=TRUE)[,13:24] ``` 2.11. Remember that you’ve already saved the names of the genes above. Let’s take a look at some from the second set that seemed to have different splines based on SINV status (qfit.list.2). ``` par(mfrow=c(1,3)) for(i in 1:5) { FlybaseID <- sample(qfit.list.2, 5)[i] qfit.obs.0.i <- qfit.obs.0[FlybaseID,] qfit.obs.1.i <- qfit.obs.1[FlybaseID,] qfit.obs.2.i <- qfit.obs.2[FlybaseID,] qfit.fit.0.i <- qfit.fit.0[FlybaseID,] qfit.fit.1.i <- qfit.fit.1[FlybaseID,] qfit.fit.2.i <- qfit.fit.2[FlybaseID,] plot(as.numeric(meta$time), qfit.obs.0.i, ylab="log-CPM", main=FlybaseID, pch=16) plot(as.numeric(meta$time)[1:12], qfit.obs.1.i, ylab="log-CPM", main="cont", pch=16) lines(as.numeric(meta$time)[1:12], qfit.fit.1.i, col="blue") plot(as.numeric(meta$time)[13:24], qfit.obs.2.i, ylab="log-CPM", main="sinv", pch=16) lines(as.numeric(meta$time)[13:24], qfit.fit.2.i, col="red") } ``` ^This function will take a sample of 5 of those genes and plot out what everyone looks like, vs. SINV- controls, vs. SINV+ samples. It should have generated five pages of plots (look in outputs). Outputs: ![](https://i.imgur.com/7SJiRLt.png) ![](https://i.imgur.com/PRCNqBf.png) ![](https://i.imgur.com/nxGykf2.png) ![](https://i.imgur.com/HvlmUMZ.png) ![](https://i.imgur.com/GgJNuzY.png) ## Exercise 3 - Summarize Both Approaches In this part of the lab, we are going to try to summarize what the difference is between these approaches. 3.1. First let’s save all the tables that include DE genes by linear model vs. splines. We’ll take the ones that were done by limma. ``` dt.both <- cbind(dt.linear, dt.spline) ``` 3.2. Let’s look at some more Venn diagrams. ``` vennDiagram(dt.both[,c("t1","t2","all")], circle.col=col) vennDiagram(dt.both[,c("x1","x2","int")], circle.col=col) ``` Note: Here we’re looking at DE genes that should mostly have shared reponses to time regardless of approach, and DE genes that should have some sort of interaction. Output: ![](https://i.imgur.com/Fq9yWsH.png) 3.3. Let’s take a look at logCPM for these too. We can’t look at the fitted estimates here so these are just the counts that entered the models. ``` linear.obs.0 <- cpm(d, log=TRUE, prior.count=efit.linear$s2.prior) linear.obs.1 <- cpm(d, log=TRUE, prior.count=efit.linear$s2.prior)[,1:12] linear.obs.2 <- cpm(d, log=TRUE, prior.count=efit.linear$s2.prior)[,13:24] ``` 3.4. First, Prof Noh thinks it would be useful to look at genes detected as DE by t1 or t2 in the linear model, but not in the overall spline-based model. ``` checklist.1 <- sample(row.names(subset(as.data.frame(dt.both), t1!=0 & t2!=0 & all=="FALSE")),5) par(mfrow=c(1,3)) for(i in 1:5) { FlybaseID <- checklist.1[i] linear.obs.0.i <- linear.obs.0[FlybaseID,] linear.obs.1.i <- linear.obs.1[FlybaseID,] linear.obs.2.i <- linear.obs.2[FlybaseID,] plot(meta$time, linear.obs.0.i, ylab="log-CPM", main=FlybaseID) plot(meta$time[1:12], linear.obs.1.i, ylab="log-CPM", main="cont") plot(meta$time[13:24], linear.obs.2.i, ylab="log-CPM", main="sinv") } ``` Outputs: ![](https://i.imgur.com/kueZgFP.png) ![](https://i.imgur.com/2b6lSSt.png) ![](https://i.imgur.com/H0S5gqf.png) ![](https://i.imgur.com/jajfRHC.png) ![](https://i.imgur.com/qzRmE15.png) 3.5. Next Prof Noh thinks it would be useful to see what is detected by x2 in the linear model but not the SINV status-specific splines. ``` checklist.2 <- sample(row.names(subset(as.data.frame(dt.both), x2!=0 & int=="FALSE")),5) par(mfrow=c(1,3)) for(i in 1:5) { FlybaseID <- checklist.2[i] linear.obs.0.i <- linear.obs.0[FlybaseID,] linear.obs.1.i <- linear.obs.1[FlybaseID,] linear.obs.2.i <- linear.obs.2[FlybaseID,] plot(meta$time, linear.obs.0.i, ylab="log-CPM", main=FlybaseID) plot(meta$time[1:12], linear.obs.1.i, ylab="log-CPM", main="cont") plot(meta$time[13:24], linear.obs.2.i, ylab="log-CPM", main="sinv") } ``` ^linear and spline are identical here Outputs: ![](https://i.imgur.com/YBiFHx5.png) ![](https://i.imgur.com/PUhfLxP.png) ![](https://i.imgur.com/V0CesuN.png) ![](https://i.imgur.com/BTu5gMD.png) ![](https://i.imgur.com/EfuSiQO.png) The reason to do these types of visualization here is both to understand and verify what exactly you are testing for using these different types of models. It could be useufl to try visualizing different parts of the venn diagrams from either the linear or the cubic spline model if you are unsure of what is being tested. **Q1: What do you think? Do you find either of these approaches more attractive or intuitive? Try to verbalize why or why not.** I think that they're both really useful visualization tools and the specific use of the tool might depend on the research question/nature of data. Personally, I think linear models are more straightforward but I can also see that cubic splines are more flexible, thus providing a more complete picture of various relationships. It feels like they may be slightly more challenging to interpret but that could also be due to the fact that I am not used to the tool yet. **Conclusion/Takeaways** Time series are complicated to analyze and have different ways to approach them. For the most part, your choice will depend on your experimental design and your hypothesis. Model-based approaches for DE analysis may be more prone to false positives, especially if you’ve incorrecty specified your model. For practical examples of what the coefficients would mean depending on how you wrote your model, Prof Noh recommends carefully reading through the edgeR user guide in addition to the one for limma linked above. She found these to be the most helpful for thinking through different options. **NOTE FOR FINAL PROJECT:** Consider starting from a GEO dataset, rather than an SRA dataset, as long as you can get raw reads and reads were aligned in a similar way. By similar, Prof Noh means that either the software were identical, or software with similar alignment principles were used. This will save a little bit of time.