# BI345 Lab 05
## Prepare GO annotation file
Since GO annotations are created from a wide range of sources, we should look at the annotation in these files and how each source is used.
```
awk -F "\t" 'NR>3 && $4!~/NOT/ && $7!="ND" {print}' OFS="\t" gene_association.dictybase* | awk -F "\t" '{split($8,a,":"); print a[1]}' OFS="\t" | sort | uniq -c
```

The output of this command displays where the source of the GO annotations are. Note that some of the annotations do not have an attributed source while the majority come from InterPro, PANTHER, and UniProtKB.
In order to preprocess the GO annotation file, we need to make two different file formats for the respective package.
#### GOStats Preprocessing
For GOStats, we filter any annotations that had "NOT" qualifiers and "ND" (No bio. data) as well as any annotations without a GO source.
```
awk -F "\t" 'NR>3 && $4!~/NOT/ && $7!="ND" {print}' OFS="\t" gene_association.dictybase* | awk -F "\t" '{split($8,a,":"); print $5, $7, $2, a[1]}' OFS="\t" | awk '$4 != "" {print $1,$2,$3}' OFS="\t" > gene_association.dictybase.filter.gostats.txt
```
We can see that this had an affect of filtering the GO Annotations included down to around 65k by counting the number of instances the string "GO" appeared in the file.
```
grep GO gene_association.dictybase.filter.gostats.txt | wc -l
```

#### TopGO Preprocessing
For the TopGO tool, we can just take the filtered annotations from the GOStats input file and pull out the two columns that are important for the package.
```
awk -F "\t" '{print $3,$1}' OFS="\t" gene_association.dictybase.filter.gostats.txt > geneid2go.dictybase.filter.topgo.txt
```
All GO annotation data is currently being stored in a flat table which will be interpreted by the R packages which are structure-aware.
## TopGO
TopGO is the R package used for enrichment analyses.
First, we set up our working environment by loading the topGO library, setting our working directory, and loading in our data. We also need to assign the rownames in our data to the first column in our data because it appears that the table was saved improperly.
```
library(topGO)
getwd()
setwd("/courses/bi345/kyamad23/bi345_lab5/")
df.res.neg <- read.csv(file = "dicty.condition_de_results.txt")
ls()
row.names(df.res.neg) <- df.res.neg$X
```
Next, we assign the whole set of non-NA genes to the universe variable and save the up and downregulated genes to their own objects.
The upregulated and downregulated genes are defined as any in the results from last week with an adjusted p-value less than 0.05 with the proper log2FoldChange values. Since our data set includes genes with an NA value, we assign universe to be any non-NA genes included in our data.
```
up <- row.names(subset(df.res.neg, padj <= 0.05 & log2FoldChange > 0))
down <- row.names(subset(df.res.neg, padj <= 0.05 & log2FoldChange < 0))
universe <- row.names(subset(df.res.neg, is.na(padj) == FALSE))
```
Read in the GO annotation table that was preprocessed and format it as an R list.
We first load in our preprocessed topGO annotation file without headers and assign the column names to be gene_id and go_id. Then, we use the "by" method which serves similarly to "tapply" on R dataframes and ensure that the data are properly labeled.
```
raw.map <- read.table("geneid2go.dictybase.filter.topgo.txt", h = F)
colnames(raw.map) <- c("gene_id", "go_id")
fixed.map <- by(raw.map$go_id, raw.map$gene_id, function(x) as.character(x))
str(head(fixed.map))
```
Create a list with adjusted p-values and gene ids as names for each statistic. We pull every row from the sixth column in the dataframe but we could alternatively just call the column by its name "padj".
```
geneList <- subset(df.res.neg, is.na(padj) == F)[,6]
names(genesList) <- universe
str(geneList)
```
We will write a function in order to filter this list within topGo to any genes with a score of 0.05.
```
diffGene <- function(allScore){
return(allScore < 0.05)
}
table(diffGene(geneList))
```

We are ensuring that the number of differentially expressed genes calculated via our function is the same as the sum of the number of up and downregulated genes.
Next, we need to make a topGOdata object to specify what are data are and what gene ontology we should use to analyze it. This
```
BPdata <- new("topGOdata", description="GO analysis of All Genes", ontology="BP", allGenes=geneList , geneSel=diffGene, annot = annFUN.gene2GO, gene2GO=fixed.map, nodeSize=10)
BPdata
```

The output of the BPdata tells us that there are 549 significant genes which is down from our initial 1030 differentially expressed genes. This decrease is because not all genes will have a GO annotation.
We can run Fisher's exact test on the gene counts in order to get the statistical significance of each GO-annotations' contigency table. The Kolmogorov-Smirnov test allows us to use numeric data to see which genes are prsent at which ranks/scores.
These two tests along with a variety of others are built into the topGO package. If we use a ```groupStats``` object will tell topGO which test to use.
### Fisher's test
```
fisher.stat <- new("classicCount", testStatistic=GOFisherTest, name="Fisher test")
resultFisher <- getSigGroups(BPdata, fisher.stat)
resultFisher
```

The classic Fisher's test results in a large number of false positives so in most cases, we will likely use an alternative.
We can run additional versions of Fisher's test as follows:
In order to do an elimination test on the data that weighs children ontologies heavier, we can run the following command
```
elim.stat <- new("elimCount", testStatistic=GOFisherTest, name="Fisher (elim)")
resultElim <- getSigGroups(BPdata, elim.stat)
resultElim # 21 sig terms
```

In order to do a weight test on the data that weighs more generalized, parent nodes heavier, we can run the following command:
```
weight.stat <- new("weightCount", testStatistic=GOFisherTest, name="Fisher (weight)")
resultWeight <- getSigGroups(BPdata, weight.stat)
resultWeight # 19 sig terms
```

A mix of weight and elimination algorithms can be ran as follows:
```
mix.stat <- new("weight01Count", testStatistic=GOFisherTest, name="Fisher (mix)")
resultMix <- getSigGroups(BPdata, mix.stat)
resultMix # 20 sig terms
```

The parent test can be ran as follows:
```
parent.stat <- new("parentChild", testStatistic=GOFisherTest, name="Fisher (parent-child)")
resultParent <- getSigGroups(BPdata, parent.stat)
resultParent # 58 sig terms
```

It would be good to store these compiled results in a table. We can also filter the nodes by specifying the ```topNodes``` parameter.
```
countRes <- GenTable(BPdata, mix=resultMix, classic=resultFisher, elim=resultElim, weight=resultWeight, parent=resultParent, orderBy="mix", ranksOf="mix", topNodes=20)
countRes
```
If we wanted to get the genes that belonged to these top 20 nodes, we pull the GO terms from the results and use the ```genesInTerm``` method in topGO to get the gene membership
```
which.terms <- countRes[,1]
genesInTerm(BPdata, which.terms)
```
This command will return the gene ID's by GO terms:

To summarize these results, we can make a directed acyclic graph (DAG) that show the top 3 GO terms that are significant. They can also show how gene membership in GO terms can be nested.
```
showSigOfNodes(BPdata, score(resultFisher), firstSigNodes = 3, useInfo = 'all')
```

Within this figure, the signifcant terms are in rectangles and the coloration indicates the level of significance.
### KS-test
KS-test with elimination algorithm:
```
elim.score <- new("elimScore", testStatistic=GOKSTest, name="KS test")
resultKS <- getSigGroups(BPdata, elim.score)
resultKS
GenTable(BPdata, mixKS=resultKS, topNodes=26)
```

The results of the KS table shows that depending on the test and the algorithm chosen, the results can be significantly different. While 28 vs. 21 significant terms does not sound like much, some GO terms can have dozens of genes so it is important to consider which terms are being included or secluded depending on the test and algorithm.
In the following lines, we store the significant GO terms from 5 different versions of the test (classic Fisher, elimination Fisher, weight Fisher, parent, and mix)
```
clasRes <- GenTable(BPdata, classic=resultFisher, topNodes=86)
count.a <- clasRes[,c(1,6)]
elimRes <- GenTable(BPdata, elim=resultElim, topNodes=21)
count.b <- elimRes[,c(1,6)]
weigRes <- GenTable(BPdata, weight=resultWeight, topNodes=19)
count.c <- weigRes[,c(1,6)]
pareRes <- GenTable(BPdata, parent=resultParent, topNodes=20)
count.d <- pareRes[,c(1,6)]
mixRes <- GenTable(BPdata, mix=resultMix, topNodes=58)
count.e <- mixRes[,c(1,6)]
# merge all significant results
temp.a <- merge(count.a, count.b, by="GO.ID", all.x=T, all.y=T)
temp.b <- merge(temp.a, count.c, by="GO.ID", all.x=T, all.y=T)
temp.a <- merge(temp.b, count.d, by="GO.ID", all.x=T, all.y=T)
all.sig <- merge(temp.a, count.e, by="GO.ID", all.x=T, all.y=T)
rm(temp.a, temp.b)
table(rowSums(!(is.na(all.sig[,2:6]))))
```

The table above shows which GO terms that are unique to each method are supported by other tests.
We can see which methods produce terms that are only supported by itself:
```
keep <- rowSums(!(is.na(all.sig[,2:6]))) < 2
all.sig[keep,]
```
This table shows us that the classic and mix algorithms tend to identify unique GO terms not found by other methods. This is because the classic Fisher's test can find many false positives and because the mix algorithms takes the mean of the classic, elim, and weight p-values and log-normalizes them. Because of this, we should generally avoid classic Fisher's and Mix tests in enrichment analyses as they are too lenient and include too many false positives.

## topGO for up and downregulated genes
Often times, we want to look at up and down regulated genes separately rather than the union of all genes. We can re-run through our analyses above using separated genesets.
In the code block below, we subset the universe of genes into separate up and downregulated genes and create new topGOdata objects using identical parameters in order to see what changes.
```
upGene <- function(allScore){
return(names(geneList) %in% up)
}
downGene <- function(allScore){
return(names(geneList) %in% down)
}
BPdata_up <- new("topGOdata", description="GO analysis of upregulated Genes", ontology="BP", allGenes=geneList , geneSel=upGene, annot = annFUN.gene2GO, gene2GO=fixed.map, nodeSize=10)
BPdata_up
BPdata_down <- new("topGOdata", description="GO analysis of downregulated Genes", ontology="BP", allGenes=geneList , geneSel=downGene, annot = annFUN.gene2GO, gene2GO=fixed.map, nodeSize=10)
BPdata_down
```


Now that we have separate BPdata objects for the up and down reg. genes, we can run the variety of tests to see if we get differing number of significant GO terms.
#### Upregulated Tests:
```
fisher.stat <- new("classicCount", testStatistic=GOFisherTest, name="Fisher test")
resultFisher_up <- getSigGroups(BPdata_up, fisher.stat)
resultFisher_up # 123 terms
elim.stat <- new("elimCount", testStatistic=GOFisherTest, name="Fisher (elim)")
resultElim_up <- getSigGroups(BPdata_up, elim.stat)
resultElim_up # 26 sig terms
weight.stat <- new("weightCount", testStatistic=GOFisherTest, name="Fisher (weight)")
resultWeight_up <- getSigGroups(BPdata_up, weight.stat)
resultWeight_up # 19 sig terms
mix.stat <- new("weight01Count", testStatistic=GOFisherTest, name="Fisher (mix)")
resultMix_up <- getSigGroups(BPdata_up, mix.stat)
resultMix_up # 22 sig terms
parent.stat <- new("parentChild", testStatistic=GOFisherTest, name="Fisher (parent-child)")
resultParent_up <- getSigGroups(BPdata_up, parent.stat)
resultParent_up # 67 sig terms
```
#### Downregulated Tests:
```
fisher.stat <- new("classicCount", testStatistic=GOFisherTest, name="Fisher test")
resultFisher_down <- getSigGroups(BPdata_down, fisher.stat)
resultFisher_down # 36 terms
elim.stat <- new("elimCount", testStatistic=GOFisherTest, name="Fisher (elim)")
resultElim_down <- getSigGroups(BPdata_down, elim.stat)
resultElim_down # 12 sig terms
weight.stat <- new("weightCount", testStatistic=GOFisherTest, name="Fisher (weight)")
resultWeight_down <- getSigGroups(BPdata_down, weight.stat)
resultWeight_down # 11 sig terms
mix.stat <- new("weight01Count", testStatistic=GOFisherTest, name="Fisher (mix)")
resultMix_down <- getSigGroups(BPdata_down, mix.stat)
resultMix_down # 12 sig terms
parent.stat <- new("parentChild", testStatistic=GOFisherTest, name="Fisher (parent-child)")
resultParent_down <- getSigGroups(BPdata_down, parent.stat)
resultParent_down # 20 sig terms
```
The results of these tests show that there are quite a few more terms identified among the upregulated genes than there are for the downregulated genes. We can create a similar table from the union analysis that identifies unique terms by method. This is most likely because the number of downregulated genes is around half the number of upregulated genes.
We can create a table of test results for each GO term for both the up and downregulated sets:
```
countRes_up <- GenTable(BPdata_up, elim=resultElim_up, weight=resultWeight_up, parent=resultParent_up, orderBy="parent", ranksOf="parent", topNodes=20)
countRes_up
countRes_down <- GenTable(BPdata_down, elim=resultElim_down, weight=resultWeight_down, parent=resultParent_down, orderBy="parent", ranksOf="parent", topNodes=20)
countRes_down
```


These result counts have now been sorted by the parent algorithm results.
We can take a look at how many of these top 20 scoring GO terms are overlapped between the two genesets.
```
which.terms_up <- countRes_up[,1]
up_terms <- genesInTerm(BPdata_up, which.terms_up)
which.terms_down <- countRes_down[,1]
down_terms <- genesInTerm(BPdata_down, which.terms_down)
length(setdiff(names(up_terms), names(down_terms)))
```
Out of the 20 GO terms identified, none of them are shared among the up and down regulated genes. We can pull out the terms from the counts for additional information:
```
terms_up <- countRes_up[,2]
terms_down <- countRes_down[,2]
```
It seems that the GO terms identified from the upregulated genes mostly have to do with the metabolic process and replication. However, among the downregulated genes, there are more GO terms associated with cell signaling and cell defense. This could indicate that genes may be shut off due to being infected making the microorganisms more vulnerable. The upregulation of genes involved in replication and metabolic processes may also be explained by the fact that the microorganisms are forced to replicate and consume more in order to accomdate the parasite. This is however speculation and further insight into specific genes in these pathways are necessary to conclude anything.
## GOstats
GOstats is a less flexible but popular chocie for enrichment analysis that employs a parent-child relatioinship-aware algorithm. It runs statistical tests from the leaves to theroot and removes genes from significant terms up the chain (genes in children terms are removed from their parents). This is similar to topGO's elim algorithm.
The second library loaded is used for formatting GO term tables.
```
library(GOstats)
library(GSEABase)
gocurateData <- read.table("gene_association.dictybase.filter.gostats.txt", h = F)
goCurate <- GOFrame(gocurateData, organism = "Dictyostelium discoideum")
goCurFrame <- GOAllFrame(goCurate)
gsc <- GeneSetCollection(goCurFrame, setType = GOCollection())
```
The genesets in GOstats are called "GeneSetCollections" and we can load the preprocessed table into this object.
Next, we can build a hyperparameter object that points to the location of the gene list, which ontology we should use, and the annotations for GOstats to run a hypergeometric test on. **GOStats does not offer alternatives to the hypergeometric test**
```
bp.params <- GSEAGOHyperGParams(name="dicty GSEA", geneSetCollection = gsc, geneIds = up, universeGeneIds = universe,
ontology = "BP", pvalueCutoff = 0.01, conditional = T, testDirection = "over")
```
The p-value cutoff is of note to determine when to remove significant child-node genes from the parent node.
We can run the test and obtain the results based on the threshold that we specified. The category Size parameter of the summary tells GoStats to only show significant results for GO terms larger than a certain gene count.
```
bp.up <- hyperGTest(bp.params)
bp.up
summary(bp.up, categorySize = 10)
```
We can run the same process for the downregulated genes:
```
bp.params_down <- GSEAGOHyperGParams(name="dicty GSEA", geneSetCollection = gsc, geneIds = down, universeGeneIds = universe,
ontology = "BP", pvalueCutoff = 0.01, conditional = T, testDirection = "over")
bp.down <- hyperGTest(bp.params_down)
bp.down
summary(bp.down, categorySize = 10)
```




We see that there is a lot of overlap between the GO terms identified by topGO and GOstats. If we rank the results table from topGO by elim, while the order is not perserved, the terms identified in both the up and down regulated analyses across the R tools are the same. While the counts and test statistics are slightly different, the GO terms that should be of note are qualitatively identical.