geneSymbol -> gene ontology

--- title: 'geneSymbol -> gene ontology' disqus: hackmd --- geneSymbol -> gene ontology === ![downloads](https://img.shields.io/github/downloads/atom/atom/total.svg) ![build](https://img.shields.io/appveyor/ci/:user/:repo.svg) ![chat](https://img.shields.io/discord/:serverId.svg) Our selection, codeml, and gene loss results have generated these massive lists of geneSymbols, which are quite difficult to interpret (and quite overwhelming!). For each gene in the results, I am interested in retrieving gene ontology for: 1. geneID/geneSymbol 2. gene description (long form name of the gene) 3. Entrez gene summary 4. GO terms (BP, CC, MF) 5. Disease Ontology (disgenet) 6. Disease Ontology (DOIDs for cancer and aging assoc. degenerative disease) 7. KEGG_pathway 8. Reactome ## Table of Contents [TOC] ### Installation and Dependences We need to create a new conda environment for these gseas and install the biopython library. The .yml conda environment for this pipeline is [located on github](https://github.com/tanyalama/project_bat_longevity/tree/main/lifespan_obj2_comparative/scripts/gsea): ``` conda create --name gsea source activate gsea conda install -c conda-forge biopython ``` ### Step 0. Harvest what you can from DAVID [DAVID](https://david.ncifcrf.gov/tools.jsp) is an easy first step for converting official geneSymbols to geneIDs and harvesting: 1. geneID/geneSymbol :heavy_check_mark: 2. gene description (long form name of the gene) :heavy_check_mark: 3. Entrez gene summary :heavy_check_mark: 4. GO terms (BP, CC, MF) :heavy_check_mark: 5. Disease Ontology (disgenet) :heavy_check_mark: 6. Disease Ontology (DOIDs for cancer and aging assoc. degenerative disease) :exclamation: 7. KEGG_pathway :heavy_check_mark: 8. Reactome :heavy_check_mark: DAVID won't successfully enrich ALL of your genes:. I've included the following scripts which will 1) confirm any information harvested by DAVID and 2) retrieve additional information and parse it. Save the result from DAVID as david.csv and create a separate list of geneSymbols (with a geneSymbols header) called geneSymbols.csv. Parse results using 0_parse_DAVID_results.py ``` #!/usr/bin/env python #0_parse_DAVID_results.py compares the results from DAVID with your list of geneSymbols and merges the two into a single table (merged_table.csv), where information not found by DAVID will be listed as NA. This will help identify the genes that need further enrichment. import pandas as pd # Read geneSymbol table geneSymbol_df = pd.read_csv('geneSymbol.csv') # Read david table david_df = pd.read_csv('david.csv') # Create a dictionary to store the longest matching gene for each gene in geneSymbol gene_mapping = {} # Iterate over geneSymbol table for gene_symbol in geneSymbol_df['geneSymbol']: # Find matching genes in david table matching_genes = david_df[david_df['GeneSymbol'] == gene_symbol] if len(matching_genes) > 0: # Find the longest matching gene longest_gene = matching_genes.loc[matching_genes['GeneSymbol'].str.len().idxmax(), :'KEGG_REACTOME_PATHWAY'].to_dict() gene_mapping[gene_symbol] = longest_gene else: # If no match found, set 'Not Available' for all columns gene_mapping[gene_symbol] = {'GeneSymbol': gene_symbol, 'Gene Name': 'Not Available', 'GOTERM_BP_DIRECT': 'Not Available', 'GOTERM_CC_DIRECT': 'Not Available', 'GOTERM_MF_DIRECT': 'Not Available', 'KEGG_PATHWAY': 'Not Available', 'KEGG_REACTOME_PATHWAY': 'Not Available'} # Create a new DataFrame with merged data merged_df = pd.DataFrame.from_dict(gene_mapping, orient='index') # Reorder the columns merged_df = merged_df[['Gene Name', 'GOTERM_BP_DIRECT', 'GOTERM_CC_DIRECT', 'GOTERM_MF_DIRECT', 'KEGG_PATHWAY', 'KEGG_REACTOME_PATHWAY']] # Combine geneSymbol table with merged data final_df = pd.merge(geneSymbol_df, merged_df, left_on='geneSymbol', right_index=True, how='left') # Save the final result to a new CSV file final_df.to_csv('merged_table.csv', index=False) ``` ### Step 1. Retrieving gene descriptions and summaries Let's hop on an interactive session to get this running ``` srun --pty -t 120:00 --mem=8G -p cpu bash ``` Use 1_fetch_gene_info.py to retrieve the gene descriptions and summaries: ``` #!/usr/bin/env python #1_fetch_gene_info.py retrieves gene descriptions and summaries using the Entrez API. Outputs a table called gene_info.csv with parsed data. from Bio import Entrez import csv import xml.etree.ElementTree as ET def fetch_gene_info(symbol): Entrez.email = 'tlama@smith.edu' # Enter your email address handle = Entrez.esearch(db='gene', term=symbol) record = Entrez.read(handle) if record['IdList']: gene_id = record['IdList'][0] handle = Entrez.esummary(db='gene', id=gene_id) xml_data = handle.read() root = ET.fromstring(xml_data) summary = root.find(".//Summary") gene_description = root.find(".//Description") gene_summary = summary.text if summary is not None else 'Not available' if gene_description is not None: gene_description = gene_description.text else: gene_description = 'Not available' else: gene_description = 'Not available' gene_summary = 'Not available' return gene_description, gene_summary def main(): gene_symbols = [] with open('genes.txt', 'r') as file: gene_symbols = [line.strip() for line in file] gene_info_table = [] for symbol in gene_symbols: description, summary = fetch_gene_info(symbol) gene_info_table.append({'Symbol': symbol, 'Description': description, 'Summary': summary}) # Write the gene info table to a CSV file with open('gene_info.csv', 'w', newline='') as csvfile: fieldnames = ['Symbol', 'Description', 'Summary'] writer = csv.DictWriter(csvfile, fieldnames=fieldnames) writer.writeheader() for gene in gene_info_table: writer.writerow(gene) print("Gene info table has been written to gene_info.csv.") if __name__ == '__main__': main() ``` ### Step 2. Retrieving disease ontology (disease - gene associations) with DisGeNET You will need to create an account and password to use the disgenet API, then execute 2_fetch_disease_ontology_disgenet.py ``` #!/usr/bin/env python #2_fetch_disease_ontology_disgenet.py retrieves gene/disease associations from the disgenet API and outputs a parsed table called gene_disease_associations.csv. Note that this function requires an account and password on disgenet. import csv import requests def fetch_disease_associations(gene_symbol): # Set your DisGeNET account credentials email = "tlama@smith.edu" password = "JX4MN5d7@QctwHr" # Authenticate and obtain an API key auth_params = {"email": email, "password": password} api_host = "https://www.disgenet.org/api" api_key = None session = requests.Session() try: response = session.post(api_host + '/auth/', data=auth_params) if response.status_code == 200: json_response = response.json() api_key = json_response.get("token") else: print("Authentication failed.") except requests.exceptions.RequestException as req_ex: print(req_ex) print("Something went wrong with the request.") if api_key: session.headers.update({"Authorization": "Bearer %s" % api_key}) # Retrieve disease associations for the gene symbol response = session.get(api_host + f'/gda/gene/{gene_symbol}') if response.status_code == 200: associations_list = response.json() if associations_list and isinstance(associations_list, list): # Extract only the 'geneid' and 'disease_name' fields gene_diseases = [{'geneid': assoc['geneid'], 'disease_name': assoc['disease_name']} for assoc in associations_list] return gene_diseases else: print("Failed to retrieve disease associations.") if session: session.close() return [] def main(): gene_symbols = [] with open('geneSymbol.csv', 'r') as file: gene_symbols = [line.strip() for line in file] gene_disease_table = [] for symbol in gene_symbols: diseases = fetch_disease_associations(symbol) gene_disease_table.append({'Symbol': symbol, 'Diseases': diseases}) # Write the gene-disease table to a CSV file with open('gene_disease_associations.csv', 'w', newline='') as csvfile: fieldnames = ['Symbol', 'Diseases'] writer = csv.DictWriter(csvfile, fieldnames=fieldnames) writer.writeheader() for gene in gene_disease_table: writer.writerow(gene) print("Gene-disease associations have been written to gene_disease_associations.csv.") if __name__ == '__main__': main() ``` ### Step 3. Parse disgenet results Step 2 generates an output from disgenet for each of your genes, however much of the information is repetitive (e.g., 'geneid' is listed over and over again within a single result). 3_format_disease_ontology_disgenet.py produces a simplified format: ``` #!/usr/bin/env python #3_format_disease_ontology_disgenet.py reduces this repetitive data in gene_disease_associations.csv and parses a usable and simplified format. import csv import ast import sys # Increase the field size limit csv.field_size_limit(sys.maxsize) def main(): gene_disease_dict = {} with open('gene_disease_associations.csv', 'r') as file: reader = csv.DictReader(file) for row in reader: gene_symbol = row['Symbol'] diseases_str = row['Diseases'] diseases_list = ast.literal_eval(diseases_str) disease_names = [disease['disease_name'] for disease in diseases_list] gene_id = diseases_list[0]['geneid'] if diseases_list else '' if gene_symbol in gene_disease_dict: gene_disease_dict[gene_symbol]['disease_names'].extend(disease_names) else: gene_disease_dict[gene_symbol] = {'gene_id': gene_id, 'disease_names': disease_names} # Write the transformed data to a new CSV file with open('disgenet_transformed.csv', 'w', newline='') as file: writer = csv.writer(file) writer.writerow(['geneSymbol', 'geneid', 'disease_name']) # Write the header for gene_symbol, data in gene_disease_dict.items(): gene_id = data['gene_id'] disease_names = '; '.join(data['disease_names']) writer.writerow([gene_symbol, gene_id, disease_names]) if __name__ == '__main__': main() ``` ### Step 4. Retrieve cancer-specific disease ontologies by DOID 4_fetch_cancer_DO.py uses the Disease Ontology API to search for gene-disease associations specific to a disease ontology ID (e.g., DOID:162 which is "cancer"). ``` #!/usr/bin/env python #4_fetch_cancer_DO.py uses the Disease Ontology API to search for gene-disease associations specific to a disease ontology ID (e.g., DOID:162 which is "cancer"). Outputs a table called doid_cancer.csv import csv import requests from pydantic import BaseModel class Gene: def __init__(self, symbol): self.symbol = symbol class DiseaseOntologyTerm(BaseModel): id: str name: str definition: str def get_cancer_disease_ontology_terms(gene_symbol): url = f"http://www.disease-ontology.org/api/metadata/DOID:162/genes/{gene_symbol}" response = requests.get(url) data = response.json() disease_terms = [] for term_data in data["terms"]: disease_term = DiseaseOntologyTerm( id=term_data["id"], name=term_data["name"], definition=term_data["definition"], ) disease_terms.append(disease_term) return disease_terms def main(): genes = [] with open("ten.txt", "r") as file: for line in file: symbol = line.strip() gene = Gene(symbol) genes.append(gene) results = [] for gene in genes: disease_terms = get_cancer_disease_ontology_terms(gene.symbol) disease_names = "; ".join(term.name for term in disease_terms) result = [gene.symbol, disease_names] results.append(result) with open("doid_cancer.csv", "w", newline="") as file: writer = csv.writer(file) writer.writerow(["geneSymbol", "disease_name"]) writer.writerows(results) if __name__ == "__main__": main() ``` ### Step 5. Retrieve disease ontology using DOSE DOSE is a great R-package alternative to the above disease ontology methods. DOSE also has specific query sets for cancer and other "major" diseases. However, the documentation is terrible. I've provided here a brief script for basic implementation of enrichDO ``` # Install DOSE if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager") library(BiocManager) BiocManager::install("DOSE") BiocManager::install("org.Hs.eg.db") browseVignettes("DOSE") # Assign ENTREZ gene ids -- note that geneSymbol will NOT work here. genes<- c(7133,2689,59340,3303,1376,11132) yy <- DOSE::enrichDO(genes, ont="DO", pvalueCutoff = 0.05) , pAdjustMethod = "BH", minGSSize = 10, maxGSSize = 500, qvalueCutoff = 0.2, readable = FALSE ) yy <- DOSE::enrichDO( gene, ont = "DO", pvalueCutoff = 0.05, pAdjustMethod = "BH", minGSSize = 10, maxGSSize = 500, qvalueCutoff = 0.2, readable = TRUE ) write.table(yy, file="/Volumes/GoogleDrive/My Drive/project_bat1k_longevity/scripts_longevity/gsea/dose.csv") ``` ## Appendix and FAQ :::info **Find this document incomplete?** Leave a comment! ::: ###### tags: `Templates` `Documentation`