BI345 Peer Review Packet

# BI345 Peer Review Packet ## Poster ![](https://i.imgur.com/1Kut5Fs.jpg) ## Updates/Supplemental Information Here are some comparisons of enriched processes I found across varying acidity and temperature. The table below shows the prevalent processes and how many conditions they were common to: ![](https://i.imgur.com/F6B8JXo.png) ## Sample Information The samples, or data, used for this project started as raw sequencing files. Here are the BioProject IDs and the links to access the files: - Cong(2020) → https://www.ncbi.nlm.nih.gov/sra?linkname=bioproject_sra_all&from_uid=649349 - Wong(2023) → https://www.ncbi.nlm.nih.gov/Traces/study/?acc=PRJNA824428&o=acc_s%3Aa For the thermoresistance files, samples were identified through the Library/Sample name provided here: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE200383 Here is a table that contains all of the samples and relevant information: | Sample | Replicate | LB/SM | SRR# | ID | | ------ | --------- | ----- | ---- | -- | | 20°C | 1 |GSM6032459|SRR18674015|N2_20C_Rep2| | 20°C | 2 |GSM6032460|SRR18674014|N2_20C_Rep3| | 20°C | 3 |GSM6032461|SRR18674013|N2_20C_Rep4| | 37°C | 1 |GSM6032463|SRR18674011|N2_37C_Rep2| | 37°C | 2 |GSM6032464|SRR18674010|N2_37C_Rep3| | 37°C | 3 |GSM6032465|SRR18674009|N2_37C_Rep4| | pH 2 | 1 |FRAS190032895|SRR12352234|pH2_1_1| | pH 2 | 2 |FRAS190032896|SRR12352228|pH2_2_1| | pH 2 | 3 |FRAS190032897|SRR12352225|pH2_3_1| | pH 4 | 1 |FRAS190032901|SRR12352232|pH4_1_1| | pH 4 | 2 |FRAS190032902|SRR12352231|pH4_2_1| | pH 4 | 3 |FRAS190032903|SRR12352230|pH4_3_1| | pH 6 | 1 |FRAS190032904|SRR12352229|pH6_1_1| | pH 6 | 2 |FRAS190032905|SRR12352227|pH6_2_1| | pH 6 | 3 |FRAS190032906|SRR12352226|pH6_3_1| Downloaded fastq files from NCBI using fasterq-dump command: fasterq-dump --split-files -x --include-technical SRR12352226 ## Steps to obtain counts from raw sequencing data ### 1) The fastq files were downloaded from NCBI using the fasterq-dump command ``` fasterq-dump --split-files -x SRR# ``` Be sure to replace the # following "SRR" with the SRR# that corresponds to the file that you want to download. ### 2) Once you have your downloaded files, we need to perform RNA-seq Alignment #### 2a) Quality Control I made a QC directory to hold all cleaned files. My data was paired-end so we have to provide both files. From here on out, commands will be shown for the acid analysis using the sample SRR12352234 (pH2_1) as an example and for the temperature analysis using SRR18674015 (20C) as an example. The following command was used for quality control: ``` # acid fastp -i /courses/bi345/sdivit25/FinalProjectWorking/SRR12352234_1.fastq -o QC/out.SRR12352234_1.fastq -I /courses/bi345/sdivit25/FinalProjectWorking/SRR12352234_2.fastq -O QC/out.SRR12352234_2.fastq # temperature fastp -i /courses/bi345/sdivit25/FinalProjectWorking/SRR18674015_1.fastq -o QC_Temp/out.SRR18674015_1.fastq -I /courses/bi345/sdivit25/FinalProjectWorking/SRR18674015_2.fastq -O QC_Temp/out.SRR18674015_2.fastq ``` #### 2b) Index the Reference Genome I downloaded my genome files (.fna & .gff) from the NCBI database: https://www.ncbi.nlm.nih.gov/genome/?term=Caenorhabditis+elegans These are located in the FinalProjectWorking directory as: GCF_000002985.6_WBcel235_genomic.fna and GCF_000002985.6_WBcel235_genomic.gff Here, I used STAR to index the reference genome through the following command: ``` STAR --runMode genomeGenerate --genomeDir ./STAR_index --genomeFastaFiles ./GCF_000002985.6_WBcel235_genomic.fna --sjdbGTFfile ./GCF_000002985.6_WBcel235_genomic.gff ``` #### 2c) Align Reads to the Reference Genome ``` #acid STAR --genomeDir /courses/bi345/sdivit25/FinalProjectWorking/STAR_index/ --readFilesIn /courses/bi345/sdivit25/FinalProjectWorking/QC/out.SRR12352234_1.fastq /courses/bi345/sdivit25/FinalProjectWorking/QC/out.SRR12352234_2.fastq --outFileNamePrefix SRR12352234. --outSAMtype BAM Unsorted --outSAMattributes All #temperature STAR --genomeDir /courses/bi345/sdivit25/FinalProjectWorking/STAR_index/ --readFilesIn /courses/bi345/sdivit25/FinalProjectWorking/QC_Temp/out.SRR18674015_1.fastq /courses/bi345/sdivit25/FinalProjectWorking/QC_Temp/out.SRR18674015_2.fastq --outFileNamePrefix SRR18674015. --outSAMtype BAM Unsorted --outSAMattributes All ``` #### 2d) Exclude Unmapped Reads and Sort the Reads by Coordinate ``` samtools view -f 0x2 -h SRR#.Aligned.out.bam | samtools sort -n -o SRR#.sorted.bam ``` #### 2e) Add Read Group Information Since we didn’t do this during alignment, it is essential to carry out this step here using the following command: ``` #acid samtools addreplacerg -r "@RG\tID:pH2_1_1\tSM:pH2_1_1\tPL:illumina" -o pH2_1_1_readgroupFiles.bam SRR12352234.sorted.bam #temperature samtools addreplacerg -r "@RG\tID:N2_20C_Rep2\tSM:GSM6032459\tLB:GSM6032459\tPL:illumina" -o N2_20C_Rep2_readgroupFiles.bam SRR18674015.sorted.bam ``` Make sure your ID, LB/SM, and SRR# align with the appropriate corresponding values for the files you are adding read group information. Refer back to the table! #### 2f) Count Reads ``` samtools view ID_readgroupFiles.bam | htseq-count -s reverse -a 20 - GCF_000002985.6_WBcel235_genomic_filtered.gtf > pH#_#.countReads.txt #acid samtools view pH2_1_1_readgroupFiles.bam | htseq-count -s reverse -a 20 - GCF_000002985.6_WBcel235_genomic_filtered.gtf > pH2_1.countReads.txt #temperature samtools view N2_20C_Rep2_readgroupFiles.bam | htseq-count -s reverse -a 20 - GCF_000002985.6_WBcel235_genomic_filtered.gtf > 20C_1.countReads.txt ``` ## Location of Data (counts) My count files are in the countReadsAcid and countReadsTemp. As the directories indicate, the countReadsAcid and countReadsTemp contain the count files for the acid and temperature samples. ## R-Code ### Differential Expression (DE) Analysis - Acid This is the R-Code necessary for running DE Analysis. This code is for differential expression analysis for "low" vs "control". You want to repeat this process for medium and high vs control. Here are the specific lines you'll have to change depending on whether you want to do low, medium, or high: - res.acid <- results(dds.acid, contrast=c("acid_level","low","control"), alpha=0.05) - lfc.acid<- lfcShrink(dds.acid, coef="acid_level_low_vs_control", type="normal") - write.csv(tab.acid, file="LowVsControl.acidLevel_de_results.txt") ``` # Set working directory to the location of the project files setwd("/courses/bi345/sdivit25/FinalProjectWorking/Rfiles") # Load the DESeq2 library for differential expression analysis library(DESeq2) # Define the directory containing the count data for each sample directory <- "/courses/bi345/sdivit25/FinalProjectWorking/countReadsAcid/" # Get the file names for each sample sampleFiles <- grep("pH",list.files(directory),value=TRUE) # Extract the sample name from each file name sampleName <- sub("(*).countReads.WB.txt","\\1",sampleFiles) # Create a data frame containing the sample information sampleTable <- data.frame(sampleName = sampleName, fileName = sampleFiles, acid_level = c("high","high","high","medium","medium","medium","low","low","low","control","control","control"), stringsAsFactors = T ) # Load the dplyr library for data manipulation library(dplyr) # Reorder the levels of the "acid_level" column in the sample table sampleTable_reordered <- sampleTable %>% mutate(acid_level = factor(acid_level, levels = c("control", "low", "medium", "high"))) %>% arrange(acid_level) # Create a DESeq2 object from the count data and sample information dds.acid <- DESeqDataSetFromHTSeqCount(sampleTable= sampleTable_reordered, directory= directory, design= ~ acid_level) # Filter out low-expressed genes keep <- rowSums(assay(dds.acid) >= 10) >= 3 table(keep) dds.acid <- dds.acid[keep,] # Run DESeq2 analysis dds.acid <- DESeq(dds.acid) # Visualize the relationship between read counts and size factors rc <- apply(assay(dds.acid), 2, sum) sf <- sizeFactors(dds.acid) plot(rc, sf) # Define a function to calculate the coefficient of variation coefvar <- function(x) sd(x)/mean(x) # Calculate the mean, variance, and coefficient of variation for each gene dm <- apply(assay(dds.acid), 1, mean) dv <- apply(assay(dds.acid), 1, var) dc <- apply(assay(dds.acid), 1, coefvar) # Visualize the relationship between mean expression and variance plot(log(dm), log(dv)) # Visualize the relationship between mean expression and coefficient of variation plot(log(dm), dc) # Plot dispersion estimates for each group plotDispEsts(dds.acid) # Print the names of the results tables resultsNames(dds.acid) # output: "acid_level_low_vs_control"; "acid_level_medium_vs_control"; "acid_level_high_vs_control" # Perform differential expression analysis for "low" vs "control" res.acid <- results(dds.acid, contrast=c("acid_level","low","control"), alpha=0.05) # Plot p-values against adjusted p-values for the "low" vs "control" comparison plot(res.acid$pvalue, res.acid$padj) abline(a=0, b=1, lty=2, col="red") # Order the results table by adjusted p-value res.acid <- res.acid[order(res.acid$padj),] # Print a summary of the results summary(res.acid) # Print the descriptions of the top differentially expressed genes mcols(res.acid)$description head(res.acid) # Transform the data using variance stabilizing transformation (VST) rld.acid <- vst(dds.acid, blind = F) # Plot the principal component analysis (PCA) with acid_level as the grouping variable plotPCA(rld.acid, intgroup=c("acid_level")) # Define a color map function based on the acid_level color.map <- function(acid_level) { if (acid_level == "high") { "red" } else if (acid_level == "medium") { "yellow" } else if (acid_level == "low") { "greenyellow" } else { "black" } } # Generate the colors for each sample based on their acid_level using the color map function cond.colors <- unlist(lapply(sampleTable_reordered$acid_level, color.map)) # Load the RColorBrewer library and define color palettes for the heatmap library(RColorBrewer) corr.colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255) dist.colors <- colorRampPalette(brewer.pal(9, "Oranges"))(255) # Calculate the correlation matrix using the spearman method and generate the heatmap corr.mat <- as.matrix(cor(assay(rld.acid), method="spearman")) heatmap(corr.mat, Colv=F, Rowv=NA, scale="none", ColSideColors=cond.colors, col=corr.colors) # Calculate the distance matrix and generate the heatmap dist.mat <- as.matrix(dist(t(assay(rld.acid)))) heatmap(dist.mat, Colv=F, Rowv=NA, scale="none", ColSideColors=cond.colors, col=dist.colors) # Plot the MA plot for the differential expression analysis plotMA(res.acid) # Print the names of the results resultsNames(dds.acid) # Apply shrinkage estimator to the log-fold changes using the "acid_level_low_vs_control" contrast lfc.acid<- lfcShrink(dds.acid, coef="acid_level_low_vs_control", type="normal") # Plot the MA plot for the shrunk log-fold changes plotMA(lfc.acid) # Load the ggplot2 library and create a data frame for the shrunk log-fold changes library(ggplot2) tab.acid <- as.data.frame(lfc.acid) # Add a column for significance based on the adjusted p-value (padj) tab.acid$sig <- tab.acid$padj tab.acid$sig <- ifelse(tab.acid$padj<=0.05, "FDR 0.05","not sig.") tab.acid$sig <- factor(ifelse(tab.acid$padj>0.05 & tab.acid$padj<=0.1, "FDR 0.1", tab.acid$sig)) # Generate a scatter plot of the shrunk log-fold changes with significance color-coded ggplot(tab.acid, aes(x=log2FoldChange, y=-log10(padj), color=sig)) + geom_point(shape=16, size=1) + scale_color_manual(values=c("red","yellow", "yellowgreen","grey"),name="") + ylab("-log10(Padj)") # Define a color palette for the heatmap and select the top 50 differentially expressed genes heat.colors <- colorRampPalette(brewer.pal(9, "RdBu"))(255) select <- row.names(res.acid)[1:50] # Plot a heatmap for the expression of the selected genes heatmap(assay(rld.acid)[select,], ColSideColors=cond.colors, cexRow=0.4, col=heat.colors) # Save the differential expression results to a CSV file write.csv(tab.acid, file="LowVsControl.acidLevel_de_results.txt") ``` ### Differential Expression (DE) Analysis - Temperature DE code for temperature is mostly the same! The beginning/initial set-up is different and showed below. ``` # Set working directory to the location of the project files setwd("/courses/bi345/sdivit25/FinalProjectWorking/Rfiles") # Load the DESeq2 library for differential expression analysis library(DESeq2) # Define the directory containing the count data for each sample directory_Temp <- "/courses/bi345/sdivit25/FinalProjectWorking/countReadsTemp/" # Get the file names for each sample sampleFiles_Temp <- grep("N2",list.files(directory_Temp),value=TRUE) # Extract the sample name from each file name sampleName_Temp <- sub("(*).countReads.WB.txt","\\1",sampleFiles_Temp) # Create a data frame containing the sample information sampleTable_Temp <- data.frame(sampleName = sampleName_Temp, fileName = sampleFiles_Temp, temperature = c("control","control","control","high","high","high"), stringsAsFactors = T ) # Create a DESeq2 object from the count data and sample information dds.temp <- DESeqDataSetFromHTSeqCount(sampleTable= sampleTable_Temp, directory= directory_Temp, design= ~ temperature) ``` The rest of the code is the same. Just be sure to use the new DESeq2 object and and change the specifics of the code as necessary. You should be able to figure out the rest of the code (it's just making sure everything lines up correctly). ### Enrichment Analysis The code below performs GO analysis to identify significant biological processes (BP) associated with differentially expressed (DE) genes between the two conditions investigated in the DE input file. The code essentially reads in the input file containing DE results, identifies up-regulated and down-regulated genes, sets the universe for the analysis, reads in the gene annotation file, creates a table of enriched GO terms using different methods, orders the results by the 'mix' method, extracts the GO term IDs from the table, and shows the most significant GO terms using Fisher's exact test and displays all available information. ``` # Set working directory to where input files are located setwd("/courses/bi345/sdivit25/FinalProjectWorking/Rfiles") # Read in the input file (from DE analysis) and store it in a data frame "df.res.neg" df.res.neg <- read.csv("HighVsControl.temperature_de_results.txt") ls() #load the topGO package library(topGO) #Set the row names of "df.res.neg" as the values in the "X" column row.names(df.res.neg) <- df.res.neg$X row.names(df.res.neg) # Identify the up-regulated and down-regulated genes based on adjusted p-value and log2 fold change up <- row.names(subset(df.res.neg, padj<=0.05 & log2FoldChange>0)) down <- row.names(subset(df.res.neg, padj<=0.05 & log2FoldChange<0)) # Set the universe as all the rows where padj is not missing universe <- row.names(subset(df.res.neg, is.na(padj)==FALSE)) # Set working directory to where the geneid2go.C.elegans.filter.topgo.txt file is located setwd("/courses/bi345/sdivit25/FinalProjectWorking/EnrichmentAnalyses") # Read in the "geneid2go.C.elegans.filter.topgo.txt" file and store it in a data frame "raw.map" raw.map <- read.table("geneid2go.C.elegans.filter.topgo.txt", h=F) # Set the column names of "raw.map" as "gene_id" and "go_id" colnames(raw.map) <- c("gene_id","go_id") # Create a new data frame "fixed.map" by grouping the values in the "go_id" column by "gene_id" and converting them into a character vector fixed.map <- by(raw.map$go_id, raw.map$gene_id, function(x) as.character(x)) # Print the first few elements of "fixed.map" str(head(fixed.map)) # Subset "df.res.neg" to include only rows where padj is not missing and select the 6th column (which contains the gene IDs) as the gene list geneList <- subset(df.res.neg, is.na(padj)==FALSE)[,6] # Set the names of the "geneList" as the values in "universe" names(geneList) <- universe # Print the structure of "geneList" str(geneList) # Define a function "diffGene" that returns a logical vector indicating whether the input value is less than 0.05 diffGene <- function(allScore) { return(allScore < 0.05) } # Apply the "diffGene" function to "geneList" and create a table of the results table(diffGene(geneList)) # Create a new topGOdata object "BPdata" for biological process (BP) ontology analysis, using the "geneList" and "fixed.map" as input BPdata <- new("topGOdata", description="GO analysis of All Genes", ontology="BP", allGenes=geneList , geneSel=diffGene, annot = annFUN.gene2GO, gene2GO=fixed.map, nodeSize=10) BPdata # Create a new classicCount object "fisher.stat" using the GOFisherTest as the test statistic fisher.stat <- new("classicCount", testStatistic=GOFisherTest, name="Fisher test") # Perform the enrichment analysis using "BPdata" and "fisher.stat" resultFisher <- getSigGroups(BPdata, fisher.stat) resultFisher # Define and execute the Fisher's exact test using the 'elim' method to obtain significantly enriched Gene Ontology (GO) terms for biological processes (BP) elim.stat <- new("elimCount", testStatistic=GOFisherTest, name="Fisher (elim)") resultElim <- getSigGroups(BPdata, elim.stat) resultElim # Define and execute the Fisher's exact test using the 'weight' method to obtain significantly enriched GO terms for BP weight.stat <- new("weightCount", testStatistic=GOFisherTest, name="Fisher (weight)") resultWeight <- getSigGroups(BPdata, weight.stat) resultWeight # Define and execute the Fisher's exact test using the 'mix' method to obtain significantly enriched GO terms for BP mix.stat <- new("weight01Count", testStatistic=GOFisherTest, name="Fisher (mix)") resultMix <- getSigGroups(BPdata, mix.stat) resultMix # Define and execute the Fisher's exact test using the 'parent-child' method to obtain significantly enriched GO terms for BP parent.stat <- new("parentChild", testStatistic=GOFisherTest, name="Fisher (parent-child)") resultParent <- getSigGroups(BPdata, parent.stat) resultParent # Generate a table of enriched GO terms using different methods and order them by the 'mix' method countRes <- GenTable(BPdata, mix=resultMix, classic=resultFisher, elim=resultElim, weight=resultWeight, parent=resultParent, orderBy="mix", ranksOf="mix", topNodes=20) countRes # Extract the GO term IDs from the table which.terms <- countRes[,1] # Get the genes associated with each significant GO term genesInTerm(BPdata, which.terms) # Show the most significant GO terms using the Fisher's exact test and display all the available information showSigOfNodes(BPdata, score(resultFisher), firstSigNodes=3, useInfo='all') # Define and execute the Kolmogorov-Smirnov (KS) test using the 'elim' method to obtain significantly enriched GO terms for BP elim.score <- new("elimScore", testStatistic=GOKSTest, name="KS test") resultKS <- getSigGroups(BPdata, elim.score) resultKS # Generating tables using different statistical tests # Generate a table using the KS test with mix statistic as input and top 26 nodes as output GenTable(BPdata, mixKS=resultKS, topNodes=26) # Generate a table using the classic Fisher test with top 86 nodes as output clasRes <- GenTable(BPdata, classic=resultFisher, topNodes=86) count.a <- clasRes[,c(1,6)] # Generate a table using the Fisher test with elimination procedure with top 21 nodes as output elimRes <- GenTable(BPdata, elim=resultElim, topNodes=21) count.b <- elimRes[,c(1,6)] # Generate a table using the Fisher test with weighting procedure with top 19 nodes as output weigRes <- GenTable(BPdata, weight=resultWeight, topNodes=19) count.c <- weigRes[,c(1,6)] # Generate a table using the Fisher test with parent-child method with top 20 nodes as output pareRes <- GenTable(BPdata, parent=resultParent, topNodes=20) count.d <- pareRes[,c(1,6)] # Generate a table using the Fisher test with a mixture of both weighting and elimination procedures with top 58 nodes as output mixRes <- GenTable(BPdata, mix=resultMix, topNodes=58) count.e <- mixRes[,c(1,6)] # Merge all significant results temp.a <- merge(count.a, count.b, by="GO.ID", all.x=T, all.y=T) temp.b <- merge(temp.a, count.c, by="GO.ID", all.x=T, all.y=T) temp.a <- merge(temp.b, count.d, by="GO.ID", all.x=T, all.y=T) all.sig <- merge(temp.a, count.e, by="GO.ID", all.x=T, all.y=T) # Remove temp.a and temp.b rm(temp.a, temp.b) # Create a table that shows the number of rows that have at least one non-NA value in columns 2 to 6 table(rowSums(!(is.na(all.sig[,2:6])))) # Create a logical vector "keep" that is TRUE for rows that have less than 2 non-NA values in columns 2 to 6 keep <- rowSums(!(is.na(all.sig[,2:6]))) <2 # Subset the "all.sig" table to keep only rows that satisfy the condition in "keep" vector all.sig[keep,] ``` This next code is going to perform GO analysis on the set of upregulated genes. It is an extension of the previous code. Instead of diffGene, it defines a function called upGene to determine which genes in a specified set are upregulated, and then generates a table of counts for the upregulated gene set. It then creates a topGOdata object using the upregulated gene set and performs different tests to identify significant Gene Ontology (GO) terms associated with the upregulated genes. This additional code includes a number of functions that are not present in the previous code including upGene, new, getSigGroups, GenTable, genesInTerm, showSigOfNodes, merge, and rowSums. ``` # Upregulated genes # Define function upGene to return genes in a specific set, 'up' upGene <- function(allScore) { return(names(geneList) %in% up) } # Generate table of counts for the up-regulated gene set table(upGene(geneList)) # Create a new topGOdata object with the up-regulated gene set BPdataUp <- new("topGOdata", description="GO analysis of All Genes", ontology="BP", allGenes=geneList , geneSel=upGene, annot = annFUN.gene2GO, gene2GO=fixed.map, nodeSize=10) BPdataUp # Define the test statistic and create a new object of the classicCount class for Fisher test fisher.statUp <- new("classicCount", testStatistic=GOFisherTest, name="Fisher test") # Get the significant groups using the classicCount object with Fisher test as the test statistic resultFisherUp <- getSigGroups(BPdataUp, fisher.statUp) resultFisherUp # Define the test statistic and create a new object of the elimCount class for Fisher (elim) test elim.statUp <- new("elimCount", testStatistic=GOFisherTest, name="Fisher (elim)") # Get the significant groups using the elimCount object with Fisher (elim) test as the test statistic resultElimUp <- getSigGroups(BPdataUp, elim.statUp) resultElimUp # Define the test statistic and create a new object of the weightCount class for Fisher (weight) test weight.statUp <- new("weightCount", testStatistic=GOFisherTest, name="Fisher (weight)") # Get the significant groups using the weightCount object with Fisher (weight) test as the test statistic resultWeightUp <- getSigGroups(BPdataUp, weight.statUp) resultWeightUp # Define the test statistic and create a new object of the weight01Count class for Fisher (mix) test mix.statUp <- new("weight01Count", testStatistic=GOFisherTest, name="Fisher (mix)") # Get the significant groups using the weight01Count object with Fisher (mix) test as the test statistic resultMixUp <- getSigGroups(BPdataUp, mix.statUp) resultMixUp # Define the test statistic and create a new object of the parentChild class for Fisher (parent-child) test parent.statUp <- new("parentChild", testStatistic=GOFisherTest, name="Fisher (parent-child)") # Get the significant groups using the parentChild object with Fisher (parent-child) test as the test statistic resultParentUp <- getSigGroups(BPdataUp, parent.statUp) resultParentUp # Generate a table with the results from all five tests ordered by the results of Fisher (mix) test countResUp <- GenTable(BPdataUp, mix=resultMixUp, classic=resultFisherUp, elim=resultElimUp, weight=resultWeightUp, parent=resultParentUp, orderBy="mix", ranksOf="mix", topNodes=20) countResUp # Extract significant terms and their corresponding genes which.termsUp <- countRes[,1] genesInTerm(BPdataUp, which.termsUp) # Show the top significant nodes based on the Fisher test showSigOfNodes(BPdataUp, score(resultFisherUp), firstSigNodes=3, useInfo='all') # Use the KS test to identify significant terms elim.scoreUp <- new("elimScore", testStatistic=GOKSTest, name="KS test") resultKSUp <- getSigGroups(BPdataUp, elim.score) resultKSUp GenTable(BPdataUp, mixKS=resultKSUp, topNodes=26) # Generate tables of significant terms using each method individually clasResUp <- GenTable(BPdataUp, classic=resultFisherUp, topNodes=86) count.aUp <- clasResUp[,c(1,6)] elimResUp <- GenTable(BPdataUp, elim=resultElimUp, topNodes=21) count.bUp <- elimResUp[,c(1,6)] weigResUp <- GenTable(BPdataUp, weight=resultWeightUp, topNodes=19) count.cUp <- weigResUp[,c(1,6)] pareResUp <- GenTable(BPdataUp, parent=resultParentUp, topNodes=20) count.dUp <- pareResUp[,c(1,6)] mixResUp <- GenTable(BPdataUp, mix=resultMixUp, topNodes=58) count.eUp <- mixResUp[,c(1,6)] # Merge all significant results temp.aUp <- merge(count.aUp, count.bUp, by="GO.ID", all.x=T, all.y=T) temp.bUp <- merge(temp.aUp, count.cUp, by="GO.ID", all.x=T, all.y=T) temp.aUp <- merge(temp.bUp, count.dUp, by="GO.ID", all.x=T, all.y=T) all.sigUp <- merge(temp.aUp, count.eUp, by="GO.ID", all.x=T, all.y=T) # Remove temp.aUp and temp.bUp rm(temp.aUp, temp.bUp) # Create a table that shows the number of rows that have at least one non-NA value in columns 2 to 6 table(rowSums(!(is.na(all.sigUp[,2:6])))) # Create a logical vector "keep" that is TRUE for rows that have less than 2 non-NA values in columns 2 to 6 keepUp <- rowSums(!(is.na(all.sigUp[,2:6]))) <2 # Subset the "all.sigUp" table to keep only rows that satisfy the condition in "keep" vector all.sigUp[keepUp,] ``` To perform GO analysis on a set of downregulated genes, use the same code except substitute "Up" for "Down". To visualize results, use the following code: ``` # Visualizing data # Create topGOdata objects for BP analysis of up-regulated genes and down-regulated genes, respectively BPup <- new("topGOdata", description="GO BP analysis of All Genes", ontology="BP", allGenes=geneList , geneSel=upGene, annot = annFUN.gene2GO, gene2GO=fixed.map, nodeSize=10) BPdown <- new("topGOdata", description="GO BP analysis of All Genes", ontology="BP", allGenes=geneList , geneSel=downGene, annot = annFUN.gene2GO, gene2GO=fixed.map, nodeSize=10) # Create an elimCount object with Fisher test statistic, name "Fisher (elim)", and cutoff value of 0.05 elim.stat.05 <- new("elimCount", testStatistic=GOFisherTest, name="Fisher (elim)", cutOff=0.05) # perform topGO analysis using elimCount with up-regulated genes and down-regulated genes, respectively, to get significant GO groups resultElimup <- getSigGroups(BPup, elim.stat.05) resultElimup resultElimdown <- getSigGroups(BPdown, elim.stat.05) resultElimdown # Generate tables of significant GO groups with P-values and number of genes annotated to each group for up-regulated genes and down-regulated genes, respectively # TopNodes and numChar are parameters that control the number of top GO groups displayed and the number of characters used for displaying GO terms GenTable(BPup, Pvalue=resultElimup, topNodes=50) tab4 <- GenTable(BPup, Pvalue=resultElimup, topNodes=36, numChar=99) GenTable(BPdown, Pvalue=resultElimdown, topNodes=50) tab5 <- GenTable(BPdown, Pvalue=resultElimdown, topNodes=19, numChar=99) # Write the tables to files in tab-delimited format without row or column names, respectively write.table(tab4, file="elim.upreg_ControlVsHighTemp.topGO.txt", quote=F, sep="\t", row.names=T, col.names=T) write.table(tab5, file="elim.downreg_ControlVsHighTemp.topGO.txt", quote=F, sep="\t", row.names=T, col.names=T) # Load ggplot2 and scales libraries library(ggplot2) library(scales) # Convert P-value columns to numeric and combine the two tables into a single dataframe tab4$Pvalue <- as.numeric(tab4$Pvalue) tab5$Pvalue <- as.numeric(tab5$Pvalue) goVis <- rbind(tab4[,c(1,2,6)], tab5[,c(1,2,6)]) # set the order of GO terms to be displayed in the plot and generate a ggplot object goVis$Term <- factor(goVis$Term, levels=rev(goVis$Term)) ggplot(goVis, aes(x=Term, y=-log10(Pvalue), fill=-log10(Pvalue))) + expand_limits(y=1) + geom_point(shape=21, size=3) + scale_fill_continuous(low='blue', high='red4') + xlab('') + ylab('Enrichment score') + labs(title='GO Biological Process') + theme_bw() + coord_flip() ``` Be sure to change the name of the file you're writing based on the enrichment analysis you're doing! To visualize as bubble plots, use the following code: ``` gofigure -i /courses/bi345/sdivit25/FinalProjectWorking/EnrichmentAnalyses/elim.upreg_$1.topGO.txt -o elim_up_$1/ -j topgo$ gofigure -i /courses/bi345/sdivit25/FinalProjectWorking/EnrichmentAnalyses/elim.downreg_$1.topGO.txt -o elim_down_$1/ -j t$ ``` Be ware I ran these commands in a new directory titled "BubblePlots" in order to contain all data in one place. Change $1 with the analysis you wish to visualize, for example, "ControlVsHighAcid". Once you've generated the bubble plots, you're done! Great job.