BI345 Lab 5 - HackMD

# BI345 Lab 5 Goal: take your differentially expressed gene lists and do enrichment analyses with them in order to gain insight into system-wide functions of our significantly DE genes Note: Look at the documentation for both packages to make sure you understand what choices Prof Noh made for you in the analyses above so you can apply these steps to future analyses of your own choice. Keep in mind: - DE is not proof of any specific gene being involved in a condition, neither are enriched GO terms - having this info can help narrow down types of biological processes that may result from the expression variation found in DE analysis ## Exercise 1 - Prepare GO Annotation file Use Unix for this first step. Gen notes: - You should have your results data frame from last week. The only additional information necessary for enrichment analysis is gene_association.dictybase* in this week’s folder. - This is a GO annotation file that shows what GO term each gene maps to. You can find more details about this file type here. This is a file you will need for any organism in order to run a GO enrichment analysis. - The other file, goterm_list.AmiGO* is a table of all the GO terms in use (43249) on geneontology.org and their descriptions. 1.1) One thing to notice from this file is that GO annotations are created from a wide variety of sources. These are typically other protein databases. So the question becomes: what are all the sources of GO annotation in this file and how often are each source used? We can figure this out with a command like this: ``` awk -F "\t" 'NR>3 && $4!~/NOT/ && $7!="ND" {print}' OFS="\t" gene_association.dictybase* | awk -F "\t" '{split($8,a,":"); print a[1]}' OFS="\t" | sort | uniq -c ``` NOTE: Some annotations don’t have an attributed source (the 9009 up top), while the majority of annotations come from the following sources: InterPro, PANTHER, and UniProtKB. If we want to use these annotations to interpret what our differential expression results mean, we should understand how these annotations were made. Take some time to look up what these sources are. Note that a gene can have multiple sources of evidence, and the type of evidence are indicated by GO evidence codes that range from experimental to electronically annotated. Output: ![](https://i.imgur.com/H7836X2.png) 1.2) We need to make a couple of different file formats for the respective GO enrichment analysis packages. For GOstats (https://bioconductor.org/packages/release/bioc/html/GOstats.html), Prof Noh filtered out annotations that had ‘NOT’ qualifiers and ‘ND’ (No biological Data available) evidence codes. Then she removed any GO annotations that did not list a source. ``` awk -F "\t" 'NR>3 && $4!~/NOT/ && $7!="ND" {print}' OFS="\t" gene_association.dictybase* | awk -F "\t" '{split($8,a,":"); print $5, $7, $2, a[1]}' OFS="\t" | awk '$4 != "" {print $1,$2,$3}' OFS="\t" > gene_association.dictybase.filter.gostats.txt ``` ^This took # of GO anotation from 80k to 65k. NOTE: If you have any reason to remove certain annotations depending on the source, the evidence code, etc, this is the time to do it! Depending on your GO annotation source, you may need to make different decisions. Basically, you want to keep as much relatively-high-quality information as possible. 1.3) For topGO (https://bioconductor.org/packages/release/bioc/html/topGO.html), I can just modify the file I made for GOstats and pull out the two columns I need. ``` awk -F "\t" '{print $3,$1}' OFS="\t" gene_association.dictybase.filter.gostats.txt > geneid2go.dictybase.filter.topgo.txt ``` Now, our GO annotation data are just in flat table form. Within R, we will use packages that are aware of the nested structure of GO terms to properly test for any enrichment in functional terms. ## Exercise 2 - topGO This part uses tje course RStudio server. First, we'll use topGO to run some enrichment analyses. Here is the manual(https://bioconductor.org/packages/release/bioc/vignettes/topGO/inst/doc/topGO.pdf). Gen notes: This package has its own object classes and is a little more complicated to use than GOstats. But it has different types of statistical tests available and has built-in graph functions that are useful for exploring our results. 2.1) Set working directory & check out what's in current environment. Load it up using a ```read.csv()``` command. ``` setwd("/courses/bi345/sdivit25/Lab_5") df.res.neg <- read.csv("dicty.condition_de_results.txt") ls() ``` 2.2) Load ```topGO``` ``` library(topGO) ``` 2.25) Check that row names are genes and not numbers ``` row.names(df.res.neg) <- df.res.neg$X row.names(df.res.neg) ``` 2.3) Now we want to build some gene sets from our results. Here Prof Noh is defining up-regulated genes, down-regulated genes, and the universe of genes that DESeq2 ran its differential expression analysis in. ``` up <- row.names(subset(df.res.neg, padj<=0.05 & log2FoldChange>0)) down <- row.names(subset(df.res.neg, padj<=0.05 & log2FoldChange<0)) universe <- row.names(subset(df.res.neg, is.na(padj)==FALSE)) ``` 2.4) We now need to read in our GO annotation table and format it into a list structure. ``` raw.map <- read.table("geneid2go.dictybase.filter.topgo.txt", h=F) colnames(raw.map) <- c("gene_id","go_id") fixed.map <- by(raw.map$go_id, raw.map$gene_id, function(x) as.character(x)) str(head(fixed.map)) ``` NOTE: One of the nice things about ```topGO``` is that it can take both specific gene lists you want to test, and also data that can be ranked or filtered. In our case, rankings could be based on the adjusted p-values from the Wald tests of last week. 2.5) We will create a list with adjusted p-values and gene ids as names for each statistic. ``` geneList <- subset(df.res.neg, is.na(padj)==FALSE)[,6] names(geneList) <- universe str(geneList) ``` 2.6) We can now specify a function to filter this list within ```topGO```. We can just use a filter for adjusted p-values smaller than 0.05 for now. You can find other ways to filter your data by specifying a similar function. ``` diffGene <- function(allScore) { return(allScore < 0.05) } ``` 2.7) We can check that this function works the way we want. The number of diffGene should be the same as both upGene and downGene added together! ``` table(diffGene(geneList)) ``` 2.8) Now we need to make a ```topGOdata``` object. This is where we specify what our data are, what gene ontology we want to use to examine it, and what GO annotations to use when running enrichment tests. ``` BPdata <- new("topGOdata", description="GO analysis of All Genes", ontology="BP", allGenes=geneList , geneSel=diffGene, annot = annFUN.gene2GO, gene2GO=fixed.map, nodeSize=10) BPdata ``` NOTE: - output for BPdata is important. - First, it tells us that for the gene universe of this experiment, 9835 of these were used for GO analysis. A lot of this is because these are the genes that have GO annotations. Much like COG, not all genes will have a GO annotation. Accordingly, our list of significant genes has decreased from 1026 to 545. - The same output tells us that there are 1203 nodes to a GO graph. Because GO terms are hierarchical, this means that for BP (Biological Process) annotations, there are 1203 GO terms being used. - Now we can run some statistical tests. Fisher’s test works with gene counts (i.e. how many genes are present in a category), while the Kolmogorov-Smirnov test can use numeric data as well (i.e. which genes are present at what kinds of ranks or scores). We can use both tests in topGO as well as several others (t-Test, globalTest, etc; see manual for more information). Test Statistics Class Structure: ![](https://i.imgur.com/FCGChW5.png) - Any tests with this package are facilitated with a groupStats object that tells topGO what test to use and how. This test is then applied to the data object that we made above. - Let’s start with a count-based Fisher’s test. We will use the classic (term-for-term) approach first that tests each GO term individually. Then we’ll apply additional methods that take the GO hierarchy into account: elim, weight, weight01 (which is a supposed to be mixture of classic, elim, and weight and the default method of topGO), and parent-child. This will be discussed more in lecture later (keep an ear out!) 2.8) A classic Fisher’s test can be run like this. This test on its own is expected to lead to many false positives. ``` fisher.stat <- new("classicCount", testStatistic=GOFisherTest, name="Fisher test") resultFisher <- getSigGroups(BPdata, fisher.stat) resultFisher ``` 2.9) Additional versions of Fisher’s test can be run like this: ``` elim.stat <- new("elimCount", testStatistic=GOFisherTest, name="Fisher (elim)") resultElim <- getSigGroups(BPdata, elim.stat) resultElim # 21 sig terms weight.stat <- new("weightCount", testStatistic=GOFisherTest, name="Fisher (weight)") resultWeight <- getSigGroups(BPdata, weight.stat) resultWeight # 19 sig terms mix.stat <- new("weight01Count", testStatistic=GOFisherTest, name="Fisher (mix)") resultMix <- getSigGroups(BPdata, mix.stat) resultMix # 20 sig terms parent.stat <- new("parentChild", testStatistic=GOFisherTest, name="Fisher (parent-child)") resultParent <- getSigGroups(BPdata, parent.stat) resultParent # 58 sig terms ``` Note: my sig terms were off by just a couple (give or take 5ish) 2.10) We can make a table compiling these test results. Prof Noh asked for the top 20 (topNodes=) that weight01 detected as significant terms (orderBy=, ranksOf=). ``` countRes <- GenTable(BPdata, mix=resultMix, classic=resultFisher, elim=resultElim, weight=resultWeight, parent=resultParent, orderBy="mix", ranksOf="mix", topNodes=20) countRes ``` 2.11) ```topGo``` has some different functions that might be useful. For example, we can get the gene ids of all the genes that belong to a significant set of GO terms. ``` which.terms <- countRes[,1] genesInTerm(BPdata, which.terms) ``` 2.12) We can also make a directed acyclic graph (DAG) showing the top 3 GO terms that are significant. DAGs can show the nested structure of GO terms well, though these figures can easily can become too crowded to be useful. You can change the number of terms to look at by changing firstSigNodes. NOTE: Prof Noh doesn't know how to get it to show DAGs for specific GO terms. ``` showSigOfNodes(BPdata, score(resultFisher), firstSigNodes=3, useInfo='all') ``` Output: ![](https://i.imgur.com/pL6oaA4.png) ^In these DAGs, the significant terms are in rectangles & the more significant ones are in a darker red compared to orange or yellow. 2.13) You should get the idea of how to run the different tests shown in the figure above. The KS test with elim can be run like this. ``` elim.score <- new("elimScore", testStatistic=GOKSTest, name="KS test") resultKS <- getSigGroups(BPdata, elim.score) resultKS GenTable(BPdata, mixKS=resultKS, topNodes=26) ``` Prof Noh wants you to pay attention here. Enrichment analyses can have quite different results depending on which test you chose. Because of this, it is important to intentionally choose a way to interpret your data using enrichment tests. Case in point: the significant GO terms from the 5 different versions of the count-based test can be compared as below. ``` clasRes <- GenTable(BPdata, classic=resultFisher, topNodes=86) count.a <- clasRes[,c(1,6)] elimRes <- GenTable(BPdata, elim=resultElim, topNodes=21) count.b <- elimRes[,c(1,6)] weigRes <- GenTable(BPdata, weight=resultWeight, topNodes=19) count.c <- weigRes[,c(1,6)] pareRes <- GenTable(BPdata, parent=resultParent, topNodes=20) count.d <- pareRes[,c(1,6)] mixRes <- GenTable(BPdata, mix=resultMix, topNodes=58) count.e <- mixRes[,c(1,6)] # merge all significant results temp.a <- merge(count.a, count.b, by="GO.ID", all.x=T, all.y=T) temp.b <- merge(temp.a, count.c, by="GO.ID", all.x=T, all.y=T) temp.a <- merge(temp.b, count.d, by="GO.ID", all.x=T, all.y=T) all.sig <- merge(temp.a, count.e, by="GO.ID", all.x=T, all.y=T) rm(temp.a, temp.b) table(rowSums(!(is.na(all.sig[,2:6])))) ``` The last table here shows you how many versions of the count-based Fisher test support each GO term that at least one test found as significant: ![](https://i.imgur.com/XUdTOwE.png) Next, Prof Noh wanted to see whihc methods tended to produce solely supported (only one test found it significant) GO terms. ``` keep <- rowSums(!(is.na(all.sig[,2:6]))) <2 all.sig[keep,] ``` Notes: - Based on the table, it was always classic and mix (weight01). It turns out that weight01 simply takes the mean of classic, elim, and weight p-values and presents it at log scale. - Because of this, Prof Noh would hesitate to use classic or weight01 as her enrichment test version since both methods appear to be too lenient. ## Exercise 3 - topGO for up- vs down-regulated genes Despite the demonstration above, typically you would want to look separately at what up-regulated genes do vs. down-regulated genes do, in addition to or rather than looking at what all DE genes do together. - For example, you could use a function like this instead of the one above (2.5. diffGene) to filter for up-regulated genes. Start with up-regulated genes: ``` upGene <- function(allScore) { return(names(geneList) %in% up) } ``` Then run everything from previous exercises (): ``` #upregulated genes upGene <- function(allScore) { return(names(geneList) %in% up) } table(upGene(geneList)) BPdataUp <- new("topGOdata", description="GO analysis of All Genes", ontology="BP", allGenes=geneList , geneSel=upGene, annot = annFUN.gene2GO, gene2GO=fixed.map, nodeSize=10) BPdataUp fisher.statUp <- new("classicCount", testStatistic=GOFisherTest, name="Fisher test") resultFisherUp <- getSigGroups(BPdata, fisher.statUp) resultFisherUp elim.statUp <- new("elimCount", testStatistic=GOFisherTest, name="Fisher (elim)") resultElimUp <- getSigGroups(BPdataUp, elim.statUp) resultElimUp # 21 sig terms weight.statUp <- new("weightCount", testStatistic=GOFisherTest, name="Fisher (weight)") resultWeightUp <- getSigGroups(BPdataUp, weight.statUp) resultWeightUp # 19 sig terms mix.statUp <- new("weight01Count", testStatistic=GOFisherTest, name="Fisher (mix)") resultMixUp <- getSigGroups(BPdataUp, mix.statUp) resultMixUp # 20 sig terms parent.statUp <- new("parentChild", testStatistic=GOFisherTest, name="Fisher (parent-child)") resultParentUp <- getSigGroups(BPdataUp, parent.statUp) resultParentUp # 58 sig terms countResUp <- GenTable(BPdataUp, mix=resultMixUp, classic=resultFisherUp, elim=resultElimUp, weight=resultWeightUp, parent=resultParentUp, orderBy="mix", ranksOf="mix", topNodes=20) countResUp which.termsUp <- countRes[,1] genesInTerm(BPdataUp, which.termsUp) showSigOfNodes(BPdataUp, score(resultFisherUp), firstSigNodes=3, useInfo='all') elim.scoreUp <- new("elimScore", testStatistic=GOKSTest, name="KS test") resultKSUp <- getSigGroups(BPdata, elim.score) resultKSUp GenTable(BPdataUp, mixKS=resultKSUp, topNodes=26) clasResUp <- GenTable(BPdataUp, classic=resultFisherUp, topNodes=86) count.aUp <- clasResUp[,c(1,6)] elimResUp <- GenTable(BPdataUp, elim=resultElimUp, topNodes=21) count.bUp <- elimResUp[,c(1,6)] weigResUp <- GenTable(BPdataUp, weight=resultWeightUp, topNodes=19) count.cUp <- weigResUp[,c(1,6)] pareResUp <- GenTable(BPdataUp, parent=resultParentUp, topNodes=20) count.dUp <- pareResUp[,c(1,6)] mixResUp <- GenTable(BPdataUp, mix=resultMixUp, topNodes=58) count.eUp <- mixResUp[,c(1,6)] # merge all significant results temp.aUp <- merge(count.aUp, count.bUp, by="GO.ID", all.x=T, all.y=T) temp.bUp <- merge(temp.aUp, count.cUp, by="GO.ID", all.x=T, all.y=T) temp.aUp <- merge(temp.bUp, count.dUp, by="GO.ID", all.x=T, all.y=T) all.sigUp <- merge(temp.aUp, count.eUp, by="GO.ID", all.x=T, all.y=T) rm(temp.aUp, temp.bUp) table(rowSums(!(is.na(all.sigUp[,2:6])))) keep <- rowSums(!(is.na(all.sigUp[,2:6]))) <2 all.sigUp[keep,] ``` DAG Output: ![](https://i.imgur.com/mSP4kae.png) Table Output (shows me how many versions of the count-based Fisher test support each GO term that at least one test found as significant): ![](https://i.imgur.com/qMGOoQx.png) Now do down-regulated genes: ``` downGene <- function(allScore) { return(names(geneList) %in% down) } ``` Then run everything from previous exercises (): ``` #downregulated genes downGene <- function(allScore) { return(names(geneList) %in% down) } table(downGene(geneList)) BPdataDown <- new("topGOdata", description="GO analysis of All Genes", ontology="BP", allGenes=geneList , geneSel=downGene, annot = annFUN.gene2GO, gene2GO=fixed.map, nodeSize=10) BPdataDown fisher.statDown <- new("classicCount", testStatistic=GOFisherTest, name="Fisher test") resultFisherDown <- getSigGroups(BPdataDown, fisher.statUp) resultFisherDown elim.statDown <- new("elimCount", testStatistic=GOFisherTest, name="Fisher (elim)") resultElimDown <- getSigGroups(BPdataDown, elim.statDown) resultElimDown # 21 sig terms weight.statDown <- new("weightCount", testStatistic=GOFisherTest, name="Fisher (weight)") resultWeightDown <- getSigGroups(BPdataDown, weight.statDown) resultWeightDown # 19 sig terms mix.statDown <- new("weight01Count", testStatistic=GOFisherTest, name="Fisher (mix)") resultMixDown <- getSigGroups(BPdataDown, mix.statDown) resultMixDown # 20 sig terms parent.statDown <- new("parentChild", testStatistic=GOFisherTest, name="Fisher (parent-child)") resultParentDown <- getSigGroups(BPdataDown, parent.statDown) resultParentDown # 58 sig terms countResDown <- GenTable(BPdataDown, mix=resultMixDown, classic=resultFisherDown, elim=resultElimDown, weight=resultWeightDown, parent=resultParentDown, orderBy="mix", ranksOf="mix", topNodes=20) countResDown which.termsDown <- countRes[,1] genesInTerm(BPdataDown, which.termsDown) showSigOfNodes(BPdataDown, score(resultFisherDown), firstSigNodes=3, useInfo='all') elim.scoreDown <- new("elimScore", testStatistic=GOKSTest, name="KS test") resultKSDown <- getSigGroups(BPdataDown, elim.score) resultKSDown GenTable(BPdataDown, mixKS=resultKSDown, topNodes=26) clasResDown <- GenTable(BPdataDown, classic=resultFisherDown, topNodes=86) count.aDown <- clasResDown[,c(1,6)] elimResDown <- GenTable(BPdataDown, elim=resultElimDown, topNodes=21) count.bDown <- elimResDown[,c(1,6)] weigResDown <- GenTable(BPdataDown, weight=resultWeightDown, topNodes=19) count.cDown <- weigResDown[,c(1,6)] pareResDown <- GenTable(BPdataDown, parent=resultParentDown, topNodes=20) count.dDown <- pareResDown[,c(1,6)] mixResDown <- GenTable(BPdataDown, mix=resultMixDown, topNodes=58) count.eDown <- mixResDown[,c(1,6)] # merge all significant results temp.aDown <- merge(count.aDown, count.bDown, by="GO.ID", all.x=T, all.y=T) temp.bDown <- merge(temp.aDown, count.cDown, by="GO.ID", all.x=T, all.y=T) temp.aDown <- merge(temp.bDown, count.dDown, by="GO.ID", all.x=T, all.y=T) all.sigDown <- merge(temp.aDown, count.eDown, by="GO.ID", all.x=T, all.y=T) rm(temp.aDown, temp.bDown) table(rowSums(!(is.na(all.sigDown[,2:6])))) keepDown <- rowSums(!(is.na(all.sigDown[,2:6]))) <2 all.sigDown[keepDown,] ``` DAG Output: ![](https://i.imgur.com/Ut2V4KL.png) Table Output (shows me how many versions of the count-based Fisher test support each GO term that at least one test found as significant): ![](https://i.imgur.com/8C69ctH.png) Reporting back results: Comparison of up- vs down- regulated --> Are your GO terms enriched in up- vs. down-regulated genes different, overlapping, or mostly the same Upregulated Genes GO terms: ![](https://i.imgur.com/zZRupyu.png) Downregulated Genes GO terms: ![](https://i.imgur.com/Dq59T4J.png) We can see that the GO terms enriched in up- vs down- regulated genes are difffernet and not overlapping. Additionally, it's interesting to note that the upregulated genes are concerned with immune responses and metabolic processes while the downregulated genes are mostly concerned with regulation and transduction of processes. ## Exercise 4 - GOstats Gen Notes: - In comparison to topGO, GOstats is maybe less flexible but a similarly popular choice for the enrichment analysis - It employs a parent-child relationship-aware algorithm - It runs statistical tests from the leaves to the root (bottom to top), and if a significant term is detected those genes are removed from the parent term’s list before the next round of significance testing - NOTE it sounds a lot like topGo elim! - GOstats does not have much in terms of built-in visualization so you would need to explore other options. There are several visualization softwares out there and we'll learn how to do that next week. Vignette/description/manual for GOstats(https://bioconductor.org/packages/release/bioc/vignettes/GOstats/inst/doc/GOstatsHyperG.pdf) 4.1) We need two libraries here. The second one, GSEABase, is used for formatting GO term tables to work with this package. ``` library(GOstats) library(GSEABase) ``` 4.2) We can read in the table we made earlier and turn it into a GeneSetCollection object. ``` gocurateData <- read.table("gene_association.dictybase.filter.gostats.txt",h=F) goCurate <- GOFrame(gocurateData, organism="Dictyostelium discoideum") goCurFrame <- GOAllFrame(goCurate) gsc <- GeneSetCollection(goCurFrame, setType=GOCollection()) ``` 4.3) Next we build a GSEAGOHyperGParams object that points to the gene list, the ontology, and the annotations for GOstats to run a hypergeometric test on. Here we don’t have alternative choices for the statistical test. The hypergeometric test is similar to Fisher’s test and uses counts to test for enrichment. In this example, we are testing up-regulated genes. ``` bp.params <- GSEAGOHyperGParams(name="dicty GSEA", geneSetCollection=gsc, geneIds=up, universeGeneIds=universe, ontology="BP", pvalueCutoff=0.01, conditional=T, testDirection="over") ``` Note: The pvalueCutoff parameter here is both used as your typical significant level, but also to determine when to remove significant child-node genes from a parent-node. 4.4) Next we run the test itself, and obtain the results based on the p-value cutoff we specified in the previous step. If you wanted to get more results, you can set the cutoff higher so that the function reports back more results. ``` bp.up <- hyperGTest(bp.params) bp.up ``` 4.5) We can see a results table like this. ``` summary(bp.up, categorySize=10) ``` Note: The categorySize parameter makes GOstats only show significant results for those GO terms that are larger than a specific size. Output: ![](https://i.imgur.com/0s1PnQ2.png) 4.6) Complete steps above for down-regulated genes and compare. ``` ##down-regulated genes (4.6) #NOTE: same libraries and 1st four steps as up-regulated genes bp.paramsDown <- GSEAGOHyperGParams(name="dicty GSEA", geneSetCollection=gsc, geneIds=down, universeGeneIds=universe, ontology="BP", pvalueCutoff=0.01, conditional=T, testDirection="over") bp.down <- hyperGTest(bp.paramsDown) bp.down summary(bp.down, categorySize=10) ``` Output: ![](https://i.imgur.com/sXGInwn.png) 4.7) How do these GOstats results compare to topGO results? Are they qualitatively different? The GOstats results identified many more GO terms for up-regulated genes when compared to topGO results. GOstats identified some but not all of the same GO terms in both up-regulated and down-regulated as the topGO results.