# BI345 Peer Review Packet
## Poster

## Updates/Supplemental Information
Here are some comparisons of enriched processes I found across varying acidity and temperature.
The table below shows the prevalent processes and how many conditions they were common to:

## Sample Information
The samples, or data, used for this project started as raw sequencing files.
Here are the BioProject IDs and the links to access the files:
- Cong(2020) → https://www.ncbi.nlm.nih.gov/sra?linkname=bioproject_sra_all&from_uid=649349
- Wong(2023) → https://www.ncbi.nlm.nih.gov/Traces/study/?acc=PRJNA824428&o=acc_s%3Aa
For the thermoresistance files, samples were identified through the Library/Sample name provided here: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE200383
Here is a table that contains all of the samples and relevant information:
| Sample | Replicate | LB/SM | SRR# | ID |
| ------ | --------- | ----- | ---- | -- |
| 20°C | 1 |GSM6032459|SRR18674015|N2_20C_Rep2|
| 20°C | 2 |GSM6032460|SRR18674014|N2_20C_Rep3|
| 20°C | 3 |GSM6032461|SRR18674013|N2_20C_Rep4|
| 37°C | 1 |GSM6032463|SRR18674011|N2_37C_Rep2|
| 37°C | 2 |GSM6032464|SRR18674010|N2_37C_Rep3|
| 37°C | 3 |GSM6032465|SRR18674009|N2_37C_Rep4|
| pH 2 | 1 |FRAS190032895|SRR12352234|pH2_1_1|
| pH 2 | 2 |FRAS190032896|SRR12352228|pH2_2_1|
| pH 2 | 3 |FRAS190032897|SRR12352225|pH2_3_1|
| pH 4 | 1 |FRAS190032901|SRR12352232|pH4_1_1|
| pH 4 | 2 |FRAS190032902|SRR12352231|pH4_2_1|
| pH 4 | 3 |FRAS190032903|SRR12352230|pH4_3_1|
| pH 6 | 1 |FRAS190032904|SRR12352229|pH6_1_1|
| pH 6 | 2 |FRAS190032905|SRR12352227|pH6_2_1|
| pH 6 | 3 |FRAS190032906|SRR12352226|pH6_3_1|
Downloaded fastq files from NCBI using fasterq-dump command:
fasterq-dump --split-files -x --include-technical SRR12352226
## Steps to obtain counts from raw sequencing data
### 1) The fastq files were downloaded from NCBI using the fasterq-dump command
```
fasterq-dump --split-files -x SRR#
```
Be sure to replace the # following "SRR" with the SRR# that corresponds to the file that you want to download.
### 2) Once you have your downloaded files, we need to perform RNA-seq Alignment
#### 2a) Quality Control
I made a QC directory to hold all cleaned files. My data was paired-end so we have to provide both files. From here on out, commands will be shown for the acid analysis using the sample SRR12352234 (pH2_1) as an example and for the temperature analysis using SRR18674015 (20C) as an example. The following command was used for quality control:
```
# acid
fastp -i /courses/bi345/sdivit25/FinalProjectWorking/SRR12352234_1.fastq -o QC/out.SRR12352234_1.fastq -I /courses/bi345/sdivit25/FinalProjectWorking/SRR12352234_2.fastq -O QC/out.SRR12352234_2.fastq
# temperature
fastp -i /courses/bi345/sdivit25/FinalProjectWorking/SRR18674015_1.fastq -o QC_Temp/out.SRR18674015_1.fastq -I /courses/bi345/sdivit25/FinalProjectWorking/SRR18674015_2.fastq -O QC_Temp/out.SRR18674015_2.fastq
```
#### 2b) Index the Reference Genome
I downloaded my genome files (.fna & .gff) from the NCBI database: https://www.ncbi.nlm.nih.gov/genome/?term=Caenorhabditis+elegans
These are located in the FinalProjectWorking directory as: GCF_000002985.6_WBcel235_genomic.fna and GCF_000002985.6_WBcel235_genomic.gff
Here, I used STAR to index the reference genome through the following command:
```
STAR --runMode genomeGenerate --genomeDir ./STAR_index --genomeFastaFiles ./GCF_000002985.6_WBcel235_genomic.fna --sjdbGTFfile ./GCF_000002985.6_WBcel235_genomic.gff
```
#### 2c) Align Reads to the Reference Genome
```
#acid
STAR --genomeDir /courses/bi345/sdivit25/FinalProjectWorking/STAR_index/ --readFilesIn /courses/bi345/sdivit25/FinalProjectWorking/QC/out.SRR12352234_1.fastq /courses/bi345/sdivit25/FinalProjectWorking/QC/out.SRR12352234_2.fastq --outFileNamePrefix SRR12352234. --outSAMtype BAM Unsorted --outSAMattributes All
#temperature
STAR --genomeDir /courses/bi345/sdivit25/FinalProjectWorking/STAR_index/ --readFilesIn /courses/bi345/sdivit25/FinalProjectWorking/QC_Temp/out.SRR18674015_1.fastq /courses/bi345/sdivit25/FinalProjectWorking/QC_Temp/out.SRR18674015_2.fastq --outFileNamePrefix SRR18674015. --outSAMtype BAM Unsorted --outSAMattributes All
```
#### 2d) Exclude Unmapped Reads and Sort the Reads by Coordinate
```
samtools view -f 0x2 -h SRR#.Aligned.out.bam | samtools sort -n -o SRR#.sorted.bam
```
#### 2e) Add Read Group Information
Since we didn’t do this during alignment, it is essential to carry out this step here using the following command:
```
#acid
samtools addreplacerg -r "@RG\tID:pH2_1_1\tSM:pH2_1_1\tPL:illumina" -o pH2_1_1_readgroupFiles.bam SRR12352234.sorted.bam
#temperature
samtools addreplacerg -r "@RG\tID:N2_20C_Rep2\tSM:GSM6032459\tLB:GSM6032459\tPL:illumina" -o N2_20C_Rep2_readgroupFiles.bam SRR18674015.sorted.bam
```
Make sure your ID, LB/SM, and SRR# align with the appropriate corresponding values for the files you are adding read group information. Refer back to the table!
#### 2f) Count Reads
```
samtools view ID_readgroupFiles.bam | htseq-count -s reverse -a 20 - GCF_000002985.6_WBcel235_genomic_filtered.gtf > pH#_#.countReads.txt
#acid
samtools view pH2_1_1_readgroupFiles.bam | htseq-count -s reverse -a 20 - GCF_000002985.6_WBcel235_genomic_filtered.gtf > pH2_1.countReads.txt
#temperature
samtools view N2_20C_Rep2_readgroupFiles.bam | htseq-count -s reverse -a 20 - GCF_000002985.6_WBcel235_genomic_filtered.gtf > 20C_1.countReads.txt
```
## Location of Data (counts)
My count files are in the countReadsAcid and countReadsTemp. As the directories indicate, the countReadsAcid and countReadsTemp contain the count files for the acid and temperature samples.
## R-Code
### Differential Expression (DE) Analysis - Acid
This is the R-Code necessary for running DE Analysis. This code is for differential expression analysis for "low" vs "control". You want to repeat this process for medium and high vs control. Here are the specific lines you'll have to change depending on whether you want to do low, medium, or high:
- res.acid <- results(dds.acid, contrast=c("acid_level","low","control"), alpha=0.05)
- lfc.acid<- lfcShrink(dds.acid, coef="acid_level_low_vs_control", type="normal")
- write.csv(tab.acid, file="LowVsControl.acidLevel_de_results.txt")
```
# Set working directory to the location of the project files
setwd("/courses/bi345/sdivit25/FinalProjectWorking/Rfiles")
# Load the DESeq2 library for differential expression analysis
library(DESeq2)
# Define the directory containing the count data for each sample
directory <- "/courses/bi345/sdivit25/FinalProjectWorking/countReadsAcid/"
# Get the file names for each sample
sampleFiles <- grep("pH",list.files(directory),value=TRUE)
# Extract the sample name from each file name
sampleName <- sub("(*).countReads.WB.txt","\\1",sampleFiles)
# Create a data frame containing the sample information
sampleTable <- data.frame(sampleName = sampleName,
fileName = sampleFiles,
acid_level = c("high","high","high","medium","medium","medium","low","low","low","control","control","control"),
stringsAsFactors = T
)
# Load the dplyr library for data manipulation
library(dplyr)
# Reorder the levels of the "acid_level" column in the sample table
sampleTable_reordered <- sampleTable %>%
mutate(acid_level = factor(acid_level, levels = c("control", "low", "medium", "high"))) %>%
arrange(acid_level)
# Create a DESeq2 object from the count data and sample information
dds.acid <- DESeqDataSetFromHTSeqCount(sampleTable= sampleTable_reordered,
directory= directory,
design= ~ acid_level)
# Filter out low-expressed genes
keep <- rowSums(assay(dds.acid) >= 10) >= 3
table(keep)
dds.acid <- dds.acid[keep,]
# Run DESeq2 analysis
dds.acid <- DESeq(dds.acid)
# Visualize the relationship between read counts and size factors
rc <- apply(assay(dds.acid), 2, sum)
sf <- sizeFactors(dds.acid)
plot(rc, sf)
# Define a function to calculate the coefficient of variation
coefvar <- function(x) sd(x)/mean(x)
# Calculate the mean, variance, and coefficient of variation for each gene
dm <- apply(assay(dds.acid), 1, mean)
dv <- apply(assay(dds.acid), 1, var)
dc <- apply(assay(dds.acid), 1, coefvar)
# Visualize the relationship between mean expression and variance
plot(log(dm), log(dv))
# Visualize the relationship between mean expression and coefficient of variation
plot(log(dm), dc)
# Plot dispersion estimates for each group
plotDispEsts(dds.acid)
# Print the names of the results tables
resultsNames(dds.acid)
# output: "acid_level_low_vs_control"; "acid_level_medium_vs_control"; "acid_level_high_vs_control"
# Perform differential expression analysis for "low" vs "control"
res.acid <- results(dds.acid, contrast=c("acid_level","low","control"), alpha=0.05)
# Plot p-values against adjusted p-values for the "low" vs "control" comparison
plot(res.acid$pvalue, res.acid$padj)
abline(a=0, b=1, lty=2, col="red")
# Order the results table by adjusted p-value
res.acid <- res.acid[order(res.acid$padj),]
# Print a summary of the results
summary(res.acid)
# Print the descriptions of the top differentially expressed genes
mcols(res.acid)$description
head(res.acid)
# Transform the data using variance stabilizing transformation (VST)
rld.acid <- vst(dds.acid, blind = F)
# Plot the principal component analysis (PCA) with acid_level as the grouping variable
plotPCA(rld.acid, intgroup=c("acid_level"))
# Define a color map function based on the acid_level
color.map <- function(acid_level) {
if (acid_level == "high") {
"red"
} else if (acid_level == "medium") {
"yellow"
} else if (acid_level == "low") {
"greenyellow"
} else {
"black"
}
}
# Generate the colors for each sample based on their acid_level using the color map function
cond.colors <- unlist(lapply(sampleTable_reordered$acid_level, color.map))
# Load the RColorBrewer library and define color palettes for the heatmap
library(RColorBrewer)
corr.colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255)
dist.colors <- colorRampPalette(brewer.pal(9, "Oranges"))(255)
# Calculate the correlation matrix using the spearman method and generate the heatmap
corr.mat <- as.matrix(cor(assay(rld.acid), method="spearman"))
heatmap(corr.mat, Colv=F, Rowv=NA, scale="none", ColSideColors=cond.colors, col=corr.colors)
# Calculate the distance matrix and generate the heatmap
dist.mat <- as.matrix(dist(t(assay(rld.acid))))
heatmap(dist.mat, Colv=F, Rowv=NA, scale="none", ColSideColors=cond.colors, col=dist.colors)
# Plot the MA plot for the differential expression analysis
plotMA(res.acid)
# Print the names of the results
resultsNames(dds.acid)
# Apply shrinkage estimator to the log-fold changes using the "acid_level_low_vs_control" contrast
lfc.acid<- lfcShrink(dds.acid, coef="acid_level_low_vs_control", type="normal")
# Plot the MA plot for the shrunk log-fold changes
plotMA(lfc.acid)
# Load the ggplot2 library and create a data frame for the shrunk log-fold changes
library(ggplot2)
tab.acid <- as.data.frame(lfc.acid)
# Add a column for significance based on the adjusted p-value (padj)
tab.acid$sig <- tab.acid$padj
tab.acid$sig <- ifelse(tab.acid$padj<=0.05, "FDR 0.05","not sig.")
tab.acid$sig <- factor(ifelse(tab.acid$padj>0.05 & tab.acid$padj<=0.1, "FDR 0.1", tab.acid$sig))
# Generate a scatter plot of the shrunk log-fold changes with significance color-coded
ggplot(tab.acid, aes(x=log2FoldChange, y=-log10(padj), color=sig)) + geom_point(shape=16, size=1) + scale_color_manual(values=c("red","yellow", "yellowgreen","grey"),name="") + ylab("-log10(Padj)")
# Define a color palette for the heatmap and select the top 50 differentially expressed genes
heat.colors <- colorRampPalette(brewer.pal(9, "RdBu"))(255)
select <- row.names(res.acid)[1:50]
# Plot a heatmap for the expression of the selected genes
heatmap(assay(rld.acid)[select,], ColSideColors=cond.colors, cexRow=0.4, col=heat.colors)
# Save the differential expression results to a CSV file
write.csv(tab.acid, file="LowVsControl.acidLevel_de_results.txt")
```
### Differential Expression (DE) Analysis - Temperature
DE code for temperature is mostly the same! The beginning/initial set-up is different and showed below.
```
# Set working directory to the location of the project files
setwd("/courses/bi345/sdivit25/FinalProjectWorking/Rfiles")
# Load the DESeq2 library for differential expression analysis
library(DESeq2)
# Define the directory containing the count data for each sample
directory_Temp <- "/courses/bi345/sdivit25/FinalProjectWorking/countReadsTemp/"
# Get the file names for each sample
sampleFiles_Temp <- grep("N2",list.files(directory_Temp),value=TRUE)
# Extract the sample name from each file name
sampleName_Temp <- sub("(*).countReads.WB.txt","\\1",sampleFiles_Temp)
# Create a data frame containing the sample information
sampleTable_Temp <- data.frame(sampleName = sampleName_Temp,
fileName = sampleFiles_Temp,
temperature = c("control","control","control","high","high","high"),
stringsAsFactors = T
)
# Create a DESeq2 object from the count data and sample information
dds.temp <- DESeqDataSetFromHTSeqCount(sampleTable= sampleTable_Temp,
directory= directory_Temp,
design= ~ temperature)
```
The rest of the code is the same. Just be sure to use the new DESeq2 object and and change the specifics of the code as necessary. You should be able to figure out the rest of the code (it's just making sure everything lines up correctly).
### Enrichment Analysis
The code below performs GO analysis to identify significant biological processes (BP) associated with differentially expressed (DE) genes between the two conditions investigated in the DE input file. The code essentially reads in the input file containing DE results, identifies up-regulated and down-regulated genes, sets the universe for the analysis, reads in the gene annotation file, creates a table of enriched GO terms using different methods, orders the results by the 'mix' method, extracts the GO term IDs from the table, and shows the most significant GO terms using Fisher's exact test and displays all available information.
```
# Set working directory to where input files are located
setwd("/courses/bi345/sdivit25/FinalProjectWorking/Rfiles")
# Read in the input file (from DE analysis) and store it in a data frame "df.res.neg"
df.res.neg <- read.csv("HighVsControl.temperature_de_results.txt")
ls()
#load the topGO package
library(topGO)
#Set the row names of "df.res.neg" as the values in the "X" column
row.names(df.res.neg) <- df.res.neg$X
row.names(df.res.neg)
# Identify the up-regulated and down-regulated genes based on adjusted p-value and log2 fold change
up <- row.names(subset(df.res.neg, padj<=0.05 & log2FoldChange>0))
down <- row.names(subset(df.res.neg, padj<=0.05 & log2FoldChange<0))
# Set the universe as all the rows where padj is not missing
universe <- row.names(subset(df.res.neg, is.na(padj)==FALSE))
# Set working directory to where the geneid2go.C.elegans.filter.topgo.txt file is located
setwd("/courses/bi345/sdivit25/FinalProjectWorking/EnrichmentAnalyses")
# Read in the "geneid2go.C.elegans.filter.topgo.txt" file and store it in a data frame "raw.map"
raw.map <- read.table("geneid2go.C.elegans.filter.topgo.txt", h=F)
# Set the column names of "raw.map" as "gene_id" and "go_id"
colnames(raw.map) <- c("gene_id","go_id")
# Create a new data frame "fixed.map" by grouping the values in the "go_id" column by "gene_id" and converting them into a character vector
fixed.map <- by(raw.map$go_id, raw.map$gene_id, function(x) as.character(x))
# Print the first few elements of "fixed.map"
str(head(fixed.map))
# Subset "df.res.neg" to include only rows where padj is not missing and select the 6th column (which contains the gene IDs) as the gene list
geneList <- subset(df.res.neg, is.na(padj)==FALSE)[,6]
# Set the names of the "geneList" as the values in "universe"
names(geneList) <- universe
# Print the structure of "geneList"
str(geneList)
# Define a function "diffGene" that returns a logical vector indicating whether the input value is less than 0.05
diffGene <- function(allScore) {
return(allScore < 0.05)
}
# Apply the "diffGene" function to "geneList" and create a table of the results
table(diffGene(geneList))
# Create a new topGOdata object "BPdata" for biological process (BP) ontology analysis, using the "geneList" and "fixed.map" as input
BPdata <- new("topGOdata", description="GO analysis of All Genes", ontology="BP", allGenes=geneList , geneSel=diffGene, annot = annFUN.gene2GO, gene2GO=fixed.map, nodeSize=10)
BPdata
# Create a new classicCount object "fisher.stat" using the GOFisherTest as the test statistic
fisher.stat <- new("classicCount", testStatistic=GOFisherTest, name="Fisher test")
# Perform the enrichment analysis using "BPdata" and "fisher.stat"
resultFisher <- getSigGroups(BPdata, fisher.stat)
resultFisher
# Define and execute the Fisher's exact test using the 'elim' method to obtain significantly enriched Gene Ontology (GO) terms for biological processes (BP)
elim.stat <- new("elimCount", testStatistic=GOFisherTest, name="Fisher (elim)")
resultElim <- getSigGroups(BPdata, elim.stat)
resultElim
# Define and execute the Fisher's exact test using the 'weight' method to obtain significantly enriched GO terms for BP
weight.stat <- new("weightCount", testStatistic=GOFisherTest, name="Fisher (weight)")
resultWeight <- getSigGroups(BPdata, weight.stat)
resultWeight
# Define and execute the Fisher's exact test using the 'mix' method to obtain significantly enriched GO terms for BP
mix.stat <- new("weight01Count", testStatistic=GOFisherTest, name="Fisher (mix)")
resultMix <- getSigGroups(BPdata, mix.stat)
resultMix
# Define and execute the Fisher's exact test using the 'parent-child' method to obtain significantly enriched GO terms for BP
parent.stat <- new("parentChild", testStatistic=GOFisherTest, name="Fisher (parent-child)")
resultParent <- getSigGroups(BPdata, parent.stat)
resultParent
# Generate a table of enriched GO terms using different methods and order them by the 'mix' method
countRes <- GenTable(BPdata, mix=resultMix, classic=resultFisher, elim=resultElim, weight=resultWeight, parent=resultParent, orderBy="mix", ranksOf="mix", topNodes=20)
countRes
# Extract the GO term IDs from the table
which.terms <- countRes[,1]
# Get the genes associated with each significant GO term
genesInTerm(BPdata, which.terms)
# Show the most significant GO terms using the Fisher's exact test and display all the available information
showSigOfNodes(BPdata, score(resultFisher), firstSigNodes=3, useInfo='all')
# Define and execute the Kolmogorov-Smirnov (KS) test using the 'elim' method to obtain significantly enriched GO terms for BP
elim.score <- new("elimScore", testStatistic=GOKSTest, name="KS test")
resultKS <- getSigGroups(BPdata, elim.score)
resultKS
# Generating tables using different statistical tests
# Generate a table using the KS test with mix statistic as input and top 26 nodes as output
GenTable(BPdata, mixKS=resultKS, topNodes=26)
# Generate a table using the classic Fisher test with top 86 nodes as output
clasRes <- GenTable(BPdata, classic=resultFisher, topNodes=86)
count.a <- clasRes[,c(1,6)]
# Generate a table using the Fisher test with elimination procedure with top 21 nodes as output
elimRes <- GenTable(BPdata, elim=resultElim, topNodes=21)
count.b <- elimRes[,c(1,6)]
# Generate a table using the Fisher test with weighting procedure with top 19 nodes as output
weigRes <- GenTable(BPdata, weight=resultWeight, topNodes=19)
count.c <- weigRes[,c(1,6)]
# Generate a table using the Fisher test with parent-child method with top 20 nodes as output
pareRes <- GenTable(BPdata, parent=resultParent, topNodes=20)
count.d <- pareRes[,c(1,6)]
# Generate a table using the Fisher test with a mixture of both weighting and elimination procedures with top 58 nodes as output
mixRes <- GenTable(BPdata, mix=resultMix, topNodes=58)
count.e <- mixRes[,c(1,6)]
# Merge all significant results
temp.a <- merge(count.a, count.b, by="GO.ID", all.x=T, all.y=T)
temp.b <- merge(temp.a, count.c, by="GO.ID", all.x=T, all.y=T)
temp.a <- merge(temp.b, count.d, by="GO.ID", all.x=T, all.y=T)
all.sig <- merge(temp.a, count.e, by="GO.ID", all.x=T, all.y=T)
# Remove temp.a and temp.b
rm(temp.a, temp.b)
# Create a table that shows the number of rows that have at least one non-NA value in columns 2 to 6
table(rowSums(!(is.na(all.sig[,2:6]))))
# Create a logical vector "keep" that is TRUE for rows that have less than 2 non-NA values in columns 2 to 6
keep <- rowSums(!(is.na(all.sig[,2:6]))) <2
# Subset the "all.sig" table to keep only rows that satisfy the condition in "keep" vector
all.sig[keep,]
```
This next code is going to perform GO analysis on the set of upregulated genes. It is an extension of the previous code. Instead of diffGene, it defines a function called upGene to determine which genes in a specified set are upregulated, and then generates a table of counts for the upregulated gene set. It then creates a topGOdata object using the upregulated gene set and performs different tests to identify significant Gene Ontology (GO) terms associated with the upregulated genes. This additional code includes a number of functions that are not present in the previous code including upGene, new, getSigGroups, GenTable, genesInTerm, showSigOfNodes, merge, and rowSums.
```
# Upregulated genes
# Define function upGene to return genes in a specific set, 'up'
upGene <- function(allScore) {
return(names(geneList) %in% up)
}
# Generate table of counts for the up-regulated gene set
table(upGene(geneList))
# Create a new topGOdata object with the up-regulated gene set
BPdataUp <- new("topGOdata", description="GO analysis of All Genes", ontology="BP", allGenes=geneList , geneSel=upGene, annot = annFUN.gene2GO, gene2GO=fixed.map, nodeSize=10)
BPdataUp
# Define the test statistic and create a new object of the classicCount class for Fisher test
fisher.statUp <- new("classicCount", testStatistic=GOFisherTest, name="Fisher test")
# Get the significant groups using the classicCount object with Fisher test as the test statistic
resultFisherUp <- getSigGroups(BPdataUp, fisher.statUp)
resultFisherUp
# Define the test statistic and create a new object of the elimCount class for Fisher (elim) test
elim.statUp <- new("elimCount", testStatistic=GOFisherTest, name="Fisher (elim)")
# Get the significant groups using the elimCount object with Fisher (elim) test as the test statistic
resultElimUp <- getSigGroups(BPdataUp, elim.statUp)
resultElimUp
# Define the test statistic and create a new object of the weightCount class for Fisher (weight) test
weight.statUp <- new("weightCount", testStatistic=GOFisherTest, name="Fisher (weight)")
# Get the significant groups using the weightCount object with Fisher (weight) test as the test statistic
resultWeightUp <- getSigGroups(BPdataUp, weight.statUp)
resultWeightUp
# Define the test statistic and create a new object of the weight01Count class for Fisher (mix) test
mix.statUp <- new("weight01Count", testStatistic=GOFisherTest, name="Fisher (mix)")
# Get the significant groups using the weight01Count object with Fisher (mix) test as the test statistic
resultMixUp <- getSigGroups(BPdataUp, mix.statUp)
resultMixUp
# Define the test statistic and create a new object of the parentChild class for Fisher (parent-child) test
parent.statUp <- new("parentChild", testStatistic=GOFisherTest, name="Fisher (parent-child)")
# Get the significant groups using the parentChild object with Fisher (parent-child) test as the test statistic
resultParentUp <- getSigGroups(BPdataUp, parent.statUp)
resultParentUp
# Generate a table with the results from all five tests ordered by the results of Fisher (mix) test
countResUp <- GenTable(BPdataUp, mix=resultMixUp, classic=resultFisherUp, elim=resultElimUp, weight=resultWeightUp, parent=resultParentUp, orderBy="mix", ranksOf="mix", topNodes=20)
countResUp
# Extract significant terms and their corresponding genes
which.termsUp <- countRes[,1]
genesInTerm(BPdataUp, which.termsUp)
# Show the top significant nodes based on the Fisher test
showSigOfNodes(BPdataUp, score(resultFisherUp), firstSigNodes=3, useInfo='all')
# Use the KS test to identify significant terms
elim.scoreUp <- new("elimScore", testStatistic=GOKSTest, name="KS test")
resultKSUp <- getSigGroups(BPdataUp, elim.score)
resultKSUp
GenTable(BPdataUp, mixKS=resultKSUp, topNodes=26)
# Generate tables of significant terms using each method individually
clasResUp <- GenTable(BPdataUp, classic=resultFisherUp, topNodes=86)
count.aUp <- clasResUp[,c(1,6)]
elimResUp <- GenTable(BPdataUp, elim=resultElimUp, topNodes=21)
count.bUp <- elimResUp[,c(1,6)]
weigResUp <- GenTable(BPdataUp, weight=resultWeightUp, topNodes=19)
count.cUp <- weigResUp[,c(1,6)]
pareResUp <- GenTable(BPdataUp, parent=resultParentUp, topNodes=20)
count.dUp <- pareResUp[,c(1,6)]
mixResUp <- GenTable(BPdataUp, mix=resultMixUp, topNodes=58)
count.eUp <- mixResUp[,c(1,6)]
# Merge all significant results
temp.aUp <- merge(count.aUp, count.bUp, by="GO.ID", all.x=T, all.y=T)
temp.bUp <- merge(temp.aUp, count.cUp, by="GO.ID", all.x=T, all.y=T)
temp.aUp <- merge(temp.bUp, count.dUp, by="GO.ID", all.x=T, all.y=T)
all.sigUp <- merge(temp.aUp, count.eUp, by="GO.ID", all.x=T, all.y=T)
# Remove temp.aUp and temp.bUp
rm(temp.aUp, temp.bUp)
# Create a table that shows the number of rows that have at least one non-NA value in columns 2 to 6
table(rowSums(!(is.na(all.sigUp[,2:6]))))
# Create a logical vector "keep" that is TRUE for rows that have less than 2 non-NA values in columns 2 to 6
keepUp <- rowSums(!(is.na(all.sigUp[,2:6]))) <2
# Subset the "all.sigUp" table to keep only rows that satisfy the condition in "keep" vector
all.sigUp[keepUp,]
```
To perform GO analysis on a set of downregulated genes, use the same code except substitute "Up" for "Down".
To visualize results, use the following code:
```
# Visualizing data
# Create topGOdata objects for BP analysis of up-regulated genes and down-regulated genes, respectively
BPup <- new("topGOdata", description="GO BP analysis of All Genes", ontology="BP", allGenes=geneList , geneSel=upGene, annot = annFUN.gene2GO, gene2GO=fixed.map, nodeSize=10)
BPdown <- new("topGOdata", description="GO BP analysis of All Genes", ontology="BP", allGenes=geneList , geneSel=downGene, annot = annFUN.gene2GO, gene2GO=fixed.map, nodeSize=10)
# Create an elimCount object with Fisher test statistic, name "Fisher (elim)", and cutoff value of 0.05
elim.stat.05 <- new("elimCount", testStatistic=GOFisherTest, name="Fisher (elim)", cutOff=0.05)
# perform topGO analysis using elimCount with up-regulated genes and down-regulated genes, respectively, to get significant GO groups
resultElimup <- getSigGroups(BPup, elim.stat.05)
resultElimup
resultElimdown <- getSigGroups(BPdown, elim.stat.05)
resultElimdown
# Generate tables of significant GO groups with P-values and number of genes annotated to each group for up-regulated genes and down-regulated genes, respectively
# TopNodes and numChar are parameters that control the number of top GO groups displayed and the number of characters used for displaying GO terms
GenTable(BPup, Pvalue=resultElimup, topNodes=50)
tab4 <- GenTable(BPup, Pvalue=resultElimup, topNodes=36, numChar=99)
GenTable(BPdown, Pvalue=resultElimdown, topNodes=50)
tab5 <- GenTable(BPdown, Pvalue=resultElimdown, topNodes=19, numChar=99)
# Write the tables to files in tab-delimited format without row or column names, respectively
write.table(tab4, file="elim.upreg_ControlVsHighTemp.topGO.txt", quote=F, sep="\t", row.names=T, col.names=T)
write.table(tab5, file="elim.downreg_ControlVsHighTemp.topGO.txt", quote=F, sep="\t", row.names=T, col.names=T)
# Load ggplot2 and scales libraries
library(ggplot2)
library(scales)
# Convert P-value columns to numeric and combine the two tables into a single dataframe
tab4$Pvalue <- as.numeric(tab4$Pvalue)
tab5$Pvalue <- as.numeric(tab5$Pvalue)
goVis <- rbind(tab4[,c(1,2,6)], tab5[,c(1,2,6)])
# set the order of GO terms to be displayed in the plot and generate a ggplot object
goVis$Term <- factor(goVis$Term, levels=rev(goVis$Term))
ggplot(goVis, aes(x=Term, y=-log10(Pvalue), fill=-log10(Pvalue))) +
expand_limits(y=1) + geom_point(shape=21, size=3) + scale_fill_continuous(low='blue', high='red4') +
xlab('') + ylab('Enrichment score') + labs(title='GO Biological Process') + theme_bw() + coord_flip()
```
Be sure to change the name of the file you're writing based on the enrichment analysis you're doing!
To visualize as bubble plots, use the following code:
```
gofigure -i /courses/bi345/sdivit25/FinalProjectWorking/EnrichmentAnalyses/elim.upreg_$1.topGO.txt -o elim_up_$1/ -j topgo$
gofigure -i /courses/bi345/sdivit25/FinalProjectWorking/EnrichmentAnalyses/elim.downreg_$1.topGO.txt -o elim_down_$1/ -j t$
```
Be ware I ran these commands in a new directory titled "BubblePlots" in order to contain all data in one place. Change $1 with the analysis you wish to visualize, for example, "ControlVsHighAcid".
Once you've generated the bubble plots, you're done! Great job.