bi345 lab 7 (wk 09)

# bi345 lab 7 (wk 09) ## Linear Model First, we load in the data from Course_materials/wk_09. The two files we are interested in are the tmm.count files and the metadata file. Set the colnames of the count files to the names of the samples from the metadata. stringsAsFactors has been depreciated so we will edit the string variables to be considered R factors. We also load in the edgeR library which will automatically call its dependency limma. ``` library(edgeR) tmm.count <- read.table("./GSE162666.tmm.count", h = T, sep = "\t") meta <- read.table("./GSE162666.metadata.txt", h = T, sep = "\t") meta$sample = as.factor(meta$sample) meta$sinv = as.factor(meta$sinv) meta$rep = as.factor(meta$rep) colnames(tmm.count) %in% meta$sample ``` We then make sure that the metadata table has the factors coded properly. This is done in order to make sure that the time points are not treated as a continuous variable because they are discrete. ``` str(meta) meta$time <- factor(as.character(meta$time), levels = c("6", "24", "48")) str(meta) ``` We can use the DGEList function in order to convert the TMM count table to a DGEList object ``` d0 <- DGEList(tmm.count) d0$samples ``` The filestructure is similar to the DESeqDataSet in DESeq2. It stores data on the library size and normalization factors/ These data are already normalized so the library sizes are very similar compared to the raw data. The size factors are calculated on the raw count data and later applied using calcNormFactor() We can now filter genes with low expression values. We will use the filterByExpr method from edgeR. ``` keep <- filterByExpr(d0, group = meta$sinv) table(keep) d <- d0[keep,] ``` This command keeps genes with 10 or more reads in at least as many samples as the smallest group size. We can check some diagnostic plots for this filtered dataset ``` par(mfrow = c(1,3)) plotMDS(d, col = as.numeric(meta$time), label = meta$time) plotMDS(d, col = as.numeric(meta$sinv), label = meta$sinv) plotMDS(d, col = as.numeric(meta$rep), label = meta$rep) ``` ![](https://i.imgur.com/TJmyoMY.png) Each of the factors group samples together so we need to keep this in mind as we fit the data. We code our samples into groups by combining the SINV status and the time into a variable with six levels. ``` meta$group <- factor(paste(meta$sinv, meta$time, sep = "."), levels = c("cont.6", "cont.24", "cont.48", "sinv.6", "sinv.24", "sinv.48")) mm <- model.matrix(~0 + group + rep, data = meta) mm ``` It is crucial to fit the model to the hypothesis we want to test. In this case, the hypothesis is to check how gene expression changed over time depending on the infection status. In order to build linear contrasts based on our hypothesis, we can use the makeContrasts function from edgeR. ``` cm.linear <- makeContrasts( b6 = groupsinv.6 - groupcont.6, c24 = groupcont.24 - groupcont.6, s24 = groupsinv.24 - groupsinv.6, c48 = groupcont.48 - groupcont.6, s48 = groupsinv.48 - groupsinv.6, t1 = (groupsinv.24 + groupcont.24) - (groupsinv.6 + groupcont.6), t2 = (groupsinv.48 + groupcont.48) - (groupsinv.6 + groupcont.6), x1 = (groupsinv.24 - groupsinv.6) - (groupcont.24 - groupcont.6), x2 = (groupsinv.48 - groupsinv.6) - (groupcont.48 - groupcont.6), levels = colnames(mm)) cm.linear ``` ![](https://i.imgur.com/EXNfAUh.png) The contrast is between the two types of samples at each time against their own baseline at 6 hours. We also are interested in finding the shared changes by both types of samples between six and 24 hours post-infection and between six and 48 hours. We can now use voom to convert data for DE analysis using linear models. ``` par(mfrow = c(1,2)) y.linear <- voom(counts = d, mm, plot = T) ``` The voom method will return the following: ![](https://i.imgur.com/cv6pcyR.png) We now run our DE analysis including fitting contrasts: ``` vfit.linear <- lmFit(y.linear, mm) vfit.linear <- contrasts.fit(vfit.linear, contrasts = cm.linear) efit.linear <- eBayes(vfit.linear) ``` Visualize the residuals of the fitted data. If the mean-variance trends are accounted for, this plot should not have any patterns. ``` plotSA(efit.linear, main = "residual: Mean-variance trend") ``` ![](https://i.imgur.com/dU7ixQ7.png) In order to check the results of the different contrasted factors, we can use the method, decideTests. ``` dt.linear <- decideTests(efit.linear) summary(dt.linear) ``` ![](https://i.imgur.com/rFeSFUi.png) The results indicate that there are more changes in the positive-SINV samples over time compared to the SINV- controls. This is found by looking at the control (c24, c48) and the positive columns (s24, s48). The interaction terms (last four columns) indicate that there are some genes that behave dfiferently between 6hpi and the subsequent time points that is linked to their infection status. We now visualize the venn diagrams of the overlapping DE genes between the conditions. ``` # overlap between DE genes from sinv- and sinv+ at time = 24h vennDiagram(dt.linear[,c(2,3)], circle.col=col[1:2], names = c("sinv-", "sinv+"), ) # overlap between DE genes from sinv- and sinv+ at time = 48h vennDiagram(dt.linear[,c(4,5)], circle.col=col[3:4] ,names = c("sinv-", "sinv+")) ``` Overlapping genes at 24 hours (left) and 48 hours (right) ![](https://i.imgur.com/U3AIQxk.png) ``` # all four early contrasts vennDiagram(dt.linear[,c(2,3,6,8)], circle.col=col[1:4]) # any overlap between DE genes that responded differently for sinv+ and sinv- controls by time 24 or time 48 vennDiagram(dt.linear[,c(4,5,7,9)], circle.col=col[5:8]) ``` Overlapping genes from the contrast at 24 hours (left). Overlapping DE genes that responded differently by treatment by time 24 or 48 (right) ![](https://i.imgur.com/ge128Qs.png) We can also visualize this data using MD plots ``` plotMD(efit.linear, column=6, status=dt.linear[,6]) plotMD(efit.linear, column=7, status=dt.linear[,7]) plotMD(efit.linear, column=8, status=dt.linear[,8]) plotMD(efit.linear, column=9, status=dt.linear[,9]) ``` ![](https://i.imgur.com/2MWsakO.png) ![](https://i.imgur.com/gUml6A0.png) ## Cubic Splines We can use cubic splines to fit smooth curves to gene-wise gene expression patterns. This would work if we had more time points but it still works with just three time points. We first need to load the splines library and create a dummy variable to enable spline fitting. **This will treat the time as a continuous variable rather than a factor.** ``` library(splines) meta$X <- ns(as.numeric(meta$time), df = 2) ``` Since we only use three different time points, we have low degrees of freedom. While the recommend df value is between three and five, the degrees of freedom here is two. This determines how many knots we can have in our "smooth" spline. With extreme values, the curve can look like a U-shape (upright/upside down) We can build a model that incorporates SINV status-specific-splines using a dummy variable. ``` sm <- model.matrix(~0 + sinv*X + rep, data=meta) sm colnames(sm) <- c("cont", "sinv", "X1", "X2", "repB", "repC", "repD", "sinv.X1", "sinv.X2") ``` We changed the names of the columns from the simple strucutre in order to make them more readable. We can fit the model using voom as above ``` y.spline <- voom(counts=d, sm, plot=T) vfit.spline <- lmFit(y.spline, sm) efit.spline <- eBayes(vfit.spline) plotSA(efit.spline, main="residual: Mean-variance trend") ``` ![](https://i.imgur.com/4AuaVSj.png) We find genes that respond by time by assigning them to a list. The topTable function returns what the results should look like in a summary table. The p-value is the significance cutoff for the **adjusted p-value**. The number is arbitrarily high to show all significant genes. ``` sfit.list.1 <- row.names(topTable(efit.spline, coef=3:4, number = 1000, p.value = 0.05)) sfit.list.2 <- row.names(topTable(efit.spline, coef=8:9, number = 1000, p.value = 0.05)) ``` We save the DE status of these genes in order to do a vennDiagram visualization. ``` dt.spline <- as.data.frame(dt.linear[,0]) dt.spline$all <- row.names(dt.spline) %in% sfit.list.1 dt.spline$int <- row.names(dt.spline) %in% sfit.list.2 vennDiagram(dt.spline, circle.col = col) ``` ![](https://i.imgur.com/xbzbwHe.png) We can also fit a negative binomial model and use the quasilikelihood on the same model using edgeR. The results are expectedly very similar. ``` y.quasi <- estimateDisp(d, sm) qfit <- glmQLFit(y.quasi, sm) plotQLDisp(qfit) ``` ![](https://i.imgur.com/506oxxi.png) In order to record all the significant genes that respond to time: ``` qfit.all <- glmQLFTest(qfit, coef = 3:4) summary(decideTests(qfit.all)) qfit.int <- glmQLFTest(qfit, coef = 8:9) summary(decideTests(qfit.int)) ``` ![](https://i.imgur.com/CrFH6RZ.png) ![](https://i.imgur.com/4fyMPKo.png) Similarly to using splines, the fitted model is being probed in order to find genes that are significantly DE by the selected coefficients. To save each set of genes to more VennDiagarams: ``` qfit.list.1 <- row.names(topTags(qfit.all, n = 1000, p.value = 0.05)) qfit.list.2 <- row.names(topTags(qfit.int, n = 1000, p.value = 0.05)) dt.q <- as.data.frame(dt.linear[,0]) dt.q$all <- row.names(dt.q) %in% qfit.list.1 dt.q$int <- row.names(dt.q) %in% qfit.list.2 vennDiagram(dt.q, circle.col = col) ``` This result is expectedly similar to the spline results. ![](https://i.imgur.com/QdYXsDF.png) In order to look at what the splines look like, we need to save the data and fitted estimates into logCPM. We do this for samples as well as the samples by SINV status separately. ``` # counts qfit.obs.0 <- cpm(y.quasi, log=TRUE, prior.count=qfit$prior.count) qfit.obs.1 <- cpm(y.quasi, log=TRUE, prior.count=qfit$prior.count)[,1:12] qfit.obs.2 <- cpm(y.quasi, log=TRUE, prior.count=qfit$prior.count)[,13:24] # fitted estimates qfit.fit.0 <- cpm(qfit, log=TRUE) qfit.fit.1 <- cpm(qfit, log=TRUE)[,1:12] qfit.fit.2 <- cpm(qfit, log=TRUE)[,13:24] ``` We have already saved the names of the genes so we look at some from the fitted estimate set that have different splines based on the infection status. ``` par(mfrow=c(1,3)) for(i in 1:5) { FlybaseID <- sample(qfit.list.2, 5)[i] qfit.obs.0.i <- qfit.obs.0[FlybaseID,] qfit.obs.1.i <- qfit.obs.1[FlybaseID,] qfit.obs.2.i <- qfit.obs.2[FlybaseID,] qfit.fit.0.i <- qfit.fit.0[FlybaseID,] qfit.fit.1.i <- qfit.fit.1[FlybaseID,] qfit.fit.2.i <- qfit.fit.2[FlybaseID,] plot(as.numeric(meta$time), qfit.obs.0.i, ylab="log-CPM", main=FlybaseID, pch=16) plot(as.numeric(meta$time)[1:12], qfit.obs.1.i, ylab="log-CPM", main="cont", pch=16) lines(as.numeric(meta$time)[1:12], qfit.fit.1.i, col="blue") plot(as.numeric(meta$time)[13:24], qfit.obs.2.i, ylab="log-CPM", main="sinv", pch=16) lines(as.numeric(meta$time)[13:24], qfit.fit.2.i, col="red") } ``` The function detailed above takes a sample of five genes and plots out what the rest look like compared to the SINV- controls, and the SINV+ samples. There is a plot-page for each gene but only one example has been displayed below: ![](https://i.imgur.com/kGSZOF5.png) ## Summarize both approaches We first save all the labels that include DE genes by the linear model vs. the model made with splines. Then, we take the ones done by limma. We can then plot these data as venn diagrams. ``` dt.both <- cbind(dt.linear, dt.spline) vennDiagram(dt.both[,c("t1","t2","all")], circle.col=col) vennDiagram(dt.both[,c("x1","x2","int")], circle.col=col) ``` Our goal is to find DE genes that should mostly have shared responses to time regardless of which method was used to determine them. ![](https://i.imgur.com/UgHzAvN.png) We can also look at the logCPM values for these sets. We cannot see the fitted estimates so we just use the counts entered into the models ``` linear.obs.0 <- cpm(d, log=TRUE, prior.count=efit.linear$s2.prior) linear.obs.1 <- cpm(d, log=TRUE, prior.count=efit.linear$s2.prior)[,1:12] linear.obs.2 <- cpm(d, log=TRUE, prior.count=efit.linear$s2.prior)[,13:24] ``` We look at genes detected as differentially expressed by t1 or t2 in the linear model but not in the spline model. ``` checklist.1 <- sample(row.names(subset(as.data.frame(dt.both), t1!=0 & t2!=0 & all=="FALSE")),5) par(mfrow=c(1,3)) for(i in 1:5) { FlybaseID <- checklist.1[i] linear.obs.0.i <- linear.obs.0[FlybaseID,] linear.obs.1.i <- linear.obs.1[FlybaseID,] linear.obs.2.i <- linear.obs.2[FlybaseID,] plot(meta$time, linear.obs.0.i, ylab="log-CPM", main=FlybaseID) plot(meta$time[1:12], linear.obs.1.i, ylab="log-CPM", main="cont") plot(meta$time[13:24], linear.obs.2.i, ylab="log-CPM", main="sinv") } ``` We again end up with five plots that summarize the DE genes detected by t1 or t2 in the linear model but not in the spline set. ![](https://i.imgur.com/iCtWSxY.png) We can also look at what is detected by x2 in the linear model but not the SINV status-specific splines ``` checklist.2 <- sample(row.names(subset(as.data.frame(dt.both), x2!=0 & int=="FALSE")),5) par(mfrow=c(1,3)) for(i in 1:5) { FlybaseID <- checklist.2[i] linear.obs.0.i <- linear.obs.0[FlybaseID,] linear.obs.1.i <- linear.obs.1[FlybaseID,] linear.obs.2.i <- linear.obs.2[FlybaseID,] plot(meta$time, linear.obs.0.i, ylab="log-CPM", main=FlybaseID) plot(meta$time[1:12], linear.obs.1.i, ylab="log-CPM", main="cont") plot(meta$time[13:24], linear.obs.2.i, ylab="log-CPM", main="sinv") } ``` ![](https://i.imgur.com/GMxeg55.png) Question: What do you think? Do you find either of these approaches more attractive or intuitive? Try to verbalize why or why not. A: I think that it is more intuitive to use the linear model approach when there are fewer time points available, thus limiting the number of degrees of freedom. While the spline-model can still be used with data that has time-points, using a dummy variable may potentially be convoluted. However, the spline-based model has requirement of data normalization whereas the linear model has assumptions about the distribution of the data. The use of either method should be taken into close consideration.