--- tags: GeneLab title: Revisiting GL annotation stuff --- # Revisiting GL annotation stuff [toc] ## STRINGdb ### Summary --- > **Summary** > - Using either SYMBOL or ENSEMBL returns multiple STRING_id hits for some, this isn't an ENSEMBL-mapping problem, it's inherent to the STRINGdb. > - Using ENSEMBL does not return the wrong STRING_ids, for those that are shared, it returns the same > - e.g., see genes ENSMUSG00000000037 and ENSMUSG00000000056 in the `head(string_map_symbol)` and `head(string_map_ensembl)` outputs below in the "Running with GLDS-288" section > - We cannot appropriately use SYMBOL to map back because they are not unique, we lose the information to be able to tie back to each individual gene, e.g.: > - gene-A has SYMBOL-Y > - gene-B has SYMBOL-Y > - we get STRING_ids for SYMBOL-Y, then map back based on SYMBOL, and neither gene-A nor gene-B gets the annotation when we merge the tables (e.g., see genes ENSMUSG00000038729 and ENSMUSG00000090053 in the "Short run" section below) > - Currently, the deduplication step causes us to lose a ton of STRING_id annotations erroneously > - this is demonstrated in the "Short run" section below, and then tested out on the output diff. expr. table for GLDS-288 in the "Running with GLDS-288 IDs" section below, where ~20,000 of the genes have NA for STRING_id in the GLDS output table while only ~6,000 should > > - STRING_ids overall more successfullly map when using ENSEMBL > - Using SYMBOL > - 70% mapped of GLDS-288 > - Using ENSEMBL > - 72% mapped of GLDS-288 > - Fewer multiple-STRING_ids show up when using ENSEMBL > - Using SYMBOL > - 122 total multiple STRING_ids returned, spread across 57 unique SYMBOLs (though some of these may be combined genes) > - Using ENSEMBL > - 65 total multiple STRING_ids returned, spread across 32 unique ENSEMBL IDs (which we know are all unique) > > **As shown on the page below:** > 1. We need to tie the mapping to our unique ENSEMBL IDs in order to not lose individual gene information that is collapsed by the non-unique SYMBOLs - which is also causing a problem when merging (demonstrated in the "Short run" and by all the lost STRING_ids in the GLDS-288 example below). > 2. We should report multiple STRING_ids when they exist for a single gene entry. If not, we need to choose one, and there is no criteria in place or available to do that. > 3. We can map based on both ENSEMBL IDs *and* SYMBOLs, and we recover a tiny bit more string annotations that way, but we appropriately still keep the gene-level relationships because we are still using the unique ENSEMBL IDs. (E.g., ENSEMBL alone for GLDS-288 we have 5,841 NAs for STRING_id, with SYMBOLs too we have 5,731). This is the approach I think we should use, and is done in the last section ("STRINGdb with ENSEMBL and SYMBOL and dealing with multiple STRING_ids") 👍 > --- Work was/can be done anywhere on N288 in the following conda environment. No files are needed that aren't downloaded with commands below, and nothing is written out. So should be quick/easy to recreate and look at things via copy/paste/modify if wanted. Everything is virtually instant except for a few commands that may take about a minute. ```bash conda activate /global/smf/miniconda38_admin/envs/RNAseq_Rtools_03-2022 ``` ### Short run showing current problems ```r library(tidyverse) library(STRINGdb) # for adding String database annotations library(PANTHER.db) # for adding GOSLIM annotations options(timeout=600) # reading in gene IDs unique_gene_IDs <- c("ENSMUSG00000038729", "ENSMUSG00000099619", "ENSMUSG00000000031", "ENSMUSG00000090053", "ENSMUSG00000068073", "ENSMUSG00000085894") # reading in organisms.csv file org_link <- "https://raw.githubusercontent.com/asaravia-butler/GeneLab_Data_Processing/master/RNAseq/organisms.csv" organism_table <- read.csv(org_link) # setting organism for this example target_organism <- "MOUSE" # getting organism taxid target_taxid <- organism_table %>% filter(name == target_organism) %>% pull(taxon) # building annotation table ann.dbi <- organism_table %>% filter(name == target_organism) %>% pull(annotations) # loading organism db package (and installing if needed) if( ! require(ann.dbi, character.only = TRUE) ) { BiocManager::install(ann.dbi, ask = FALSE) library(ann.dbi, character.only = TRUE) } keytype <- "ENSEMBL" annot <- data.frame(unique_gene_IDs, stringsAsFactors = FALSE) colnames(annot)[1] <- keytype # adding gene symbols if available if ( "SYMBOL" %in% columns(eval(parse(text = ann.dbi), env=.GlobalEnv)) ) { annot$SYMBOL <- mapIds(eval(parse(text = ann.dbi), env=.GlobalEnv), keys = unique_gene_IDs, keytype = keytype, column = "SYMBOL", multiVals = "first") } annot # ENSEMBL SYMBOL # 1 ENSMUSG00000038729 Pakap # 2 ENSMUSG00000099619 Gm28091 # 3 ENSMUSG00000000031 H19 # 4 ENSMUSG00000090053 Pakap # 5 ENSMUSG00000068073 1110025L11Rik # 6 ENSMUSG00000085894 <NA> # adding STRINGdb annotations string_db <- STRINGdb$new(version = "11", species = target_taxid, score_threshold = 0) string_map_symbol <- string_db$map(annot, "SYMBOL", removeUnmappedRows = FALSE, takeFirst = TRUE) # Warning: we couldn't map to STRING 28% of your identifiers dim(string_map_symbol) # 7 3 # this introduced one duplicate row because it has multiple STRING_id hits dim(annot) # 6 2 string_map_symbol # SYMBOL ENSEMBL STRING_id # 1 PAKAP ENSMUSG00000038729 10090.ENSMUSP00000117466 # 6 GM28091 ENSMUSG00000099619 <NA> # 3 H19 ENSMUSG00000000031 10090.ENSMUSP00000028386 # 2 PAKAP ENSMUSG00000090053 10090.ENSMUSP00000117466 # 4 1110025L11RIK ENSMUSG00000068073 10090.ENSMUSP00000086506 # 5 1110025L11RIK ENSMUSG00000068073 10090.ENSMUSP00000086505 # 7 <NA> ENSMUSG00000085894 <NA> ``` --- > **As seen in the table just above, genes ENSMUSG00000038729 and ENSMUSG00000090053 should both have STRING_id 10090.ENSMUSP00000117466** --- ```r ## deduplication step breaks things string_map_symbol_no_dups <- string_map_symbol[!duplicated(string_map_symbol$SYMBOL), ] # that threw away this gene: ENSMUSG00000090053 string_map_symbol_no_dups # SYMBOL ENSEMBL STRING_id # 1 PAKAP ENSMUSG00000038729 10090.ENSMUSP00000117466 # 6 GM28091 ENSMUSG00000099619 <NA> # 3 H19 ENSMUSG00000000031 10090.ENSMUSP00000028386 # 4 1110025L11RIK ENSMUSG00000068073 10090.ENSMUSP00000086506 # 7 <NA> ENSMUSG00000085894 <NA> # merging annot_with_STRINGdb_after_dedup <- dplyr::left_join(annot, string_map_symbol_no_dups[, c("SYMBOL", "STRING_id")], by = "SYMBOL") # neither of the the original two genes mentioned above (that shared a SYMBOL) # got their appropriate STRING_id annotation annot_with_STRINGdb_after_dedup # ENSEMBL SYMBOL STRING_id # 1 ENSMUSG00000038729 Pakap <NA> # 2 ENSMUSG00000099619 Gm28091 <NA> # 3 ENSMUSG00000000031 H19 10090.ENSMUSP00000028386 # 4 ENSMUSG00000090053 Pakap <NA> # 5 ENSMUSG00000068073 1110025L11Rik <NA> # 6 ENSMUSG00000085894 <NA> <NA> ``` --- > - Neither of the two genes mentioned above (that shared a SYMBOL) have their appropriate STRING_id, they both have NAs. > - Checking one output from GL, ("[GLDS-288_rna_seq_differential_expression.csv](https://genelab-data.ndc.nasa.gov/genelab/static/media/dataset/GLDS-288_rna_seq_differential_expression.csv?version=3)"), these same two genes (ENSMUSG00000038729 and ENSMUSG00000090053) both have the SYMBOL in the table, but neither has the STRING_id. > - SYMBOLs that mapped to more than one STRING_id also return nothing in the merged table. This is the case for ENSMUSG00000068073, which we can see has two rows in the `string_map_symbol` table above. Neither of those STRING_ids were assigned to it in the output merged table. --- ### Running with GLDS-288 IDs ```r # clearing out things from above rm(list = ls()) # getting ENSEMBL IDs from GLDS-288 output link <- "https://genelab-data.ndc.nasa.gov/genelab/static/media/dataset/GLDS-288_rna_seq_differential_expression.csv?version=3" unique_gene_IDs <- read.csv(link) %>% pull(ENSEMBL) length(unique_gene_IDs) # 20724 # reading in organisms.csv file org_link <- "https://raw.githubusercontent.com/asaravia-butler/GeneLab_Data_Processing/master/RNAseq/organisms.csv" organism_table <- read.csv(org_link) # setting organism for this example target_organism <- "MOUSE" # getting organism taxid target_taxid <- organism_table %>% filter(name == target_organism) %>% pull(taxon) # building annotation table ann.dbi <- organism_table %>% filter(name == target_organism) %>% pull(annotations) # loading organism db package (and installing if needed) if( ! require(ann.dbi, character.only = TRUE) ) { BiocManager::install(ann.dbi, ask = FALSE) library(ann.dbi, character.only = TRUE) } keytype <- "ENSEMBL" annot <- data.frame(unique_gene_IDs, stringsAsFactors = FALSE) colnames(annot)[1] <- keytype # adding gene symbols if available if ( "SYMBOL" %in% columns(eval(parse(text = ann.dbi), env=.GlobalEnv)) ) { annot$SYMBOL <- mapIds(eval(parse(text = ann.dbi), env=.GlobalEnv), keys = unique_gene_IDs, keytype = keytype, column = "SYMBOL", multiVals = "first") } dim(annot) # 20724 2 head(annot) # ENSEMBL SYMBOL # 1 ENSMUSG00000000001 Gnai3 # 2 ENSMUSG00000000028 Cdc45 # 3 ENSMUSG00000000031 H19 # 4 ENSMUSG00000000037 Scml2 # 5 ENSMUSG00000000049 Apoh # 6 ENSMUSG00000000056 Narf ``` #### STRINGdb with SYMBOL ```r string_db <- STRINGdb$new(version = "11", species = target_taxid, score_threshold = 0) string_map_symbol <- string_db$map(annot, "SYMBOL", removeUnmappedRows = FALSE, takeFirst = TRUE) # Warning: we couldn't map to STRING 30% of your identifiers dim(string_map_symbol) # 20762 3 # the stringdb output has more rows, because of those cases where a gene maps # to multiple STRING_ids head(string_map_symbol) # SYMBOL ENSEMBL STRING_id # 1 GNAI3 ENSMUSG00000000001 10090.ENSMUSP00000000001 # 2 CDC45 ENSMUSG00000000028 10090.ENSMUSP00000000028 # 3 H19 ENSMUSG00000000031 10090.ENSMUSP00000028386 # 4 SCML2 ENSMUSG00000000037 10090.ENSMUSP00000076593 # 5 APOH ENSMUSG00000000049 10090.ENSMUSP00000000049 # 6 NARF ENSMUSG00000000056 10090.ENSMUSP00000099304 sum(is.na(string_map_symbol$STRING_id)) # 6273 # there are currently ~6,000 NAs for STRING_id in there ## checking out multiple STRING_ids symbols_with_multiple_STRING_ids <- data.frame(table(string_map_symbol$SYMBOL)) %>% filter(Freq > 1) %>% pull(Var1) %>% as.character length(symbols_with_multiple_STRING_ids) # 57 unique symbols with multiple STRING_ids symbols_total_multiple_STRING_ids <- data.frame(table(string_map_symbol$SYMBOL)) %>% filter(Freq > 1) %>% pull(Freq) %>% sum() symbols_total_multiple_STRING_ids # 122 total multiple STRING_ids ## current way of combining string_map_symbol_no_dupes <- string_map_symbol[!duplicated(string_map_symbol$SYMBOL), ] dim(string_map_symbol_no_dupes) # 16375 3 # this deduplication step just takes the first STRING_id that shows up and # throws the rest away. there is nothing special about the order they show # up, this is basically random annot_combined_with_string_map_symbol_no_dupes <- dplyr::left_join(annot, string_map_symbol_no_dupes[, c("SYMBOL", "STRING_id")], by = "SYMBOL") dim(annot_combined_with_string_map_symbol_no_dupes) # 20724 3 head(annot_combined_with_string_map_symbol_no_dupes) # ENSEMBL SYMBOL STRING_id # 1 ENSMUSG00000000001 Gnai3 <NA> # 2 ENSMUSG00000000028 Cdc45 <NA> # 3 ENSMUSG00000000031 H19 10090.ENSMUSP00000028386 # 4 ENSMUSG00000000037 Scml2 <NA> # 5 ENSMUSG00000000049 Apoh <NA> # 6 ENSMUSG00000000056 Narf <NA> # we can see compared to the 'string_map_symbol' table just above, NAs were introduced # probably because of the case difference in SYMBOLs, but # either way, they are not unique and not as systematic as the ENSEMBL IDs sum(is.na(annot_combined_with_string_map_symbol_no_dupes$STRING_id)) # 20640 are NA, when in the initial string map object only ~6,000 were sum(is.na(string_map_symbol$STRING_id)) # 6273 # (the current GLDS-288...expression.csv has 20627 NA) ## joining without removing duplicates is wonky, as expected annot_combined_with_string_map_symbol_with_dupes <- dplyr::left_join(annot, string_map_symbol[, c("SYMBOL", "STRING_id")], by = "SYMBOL") dim(annot_combined_with_string_map_symbol_with_dupes) # 18704735 3 # ^way too many rows ## trying joining on ENSEMBL annot_combined_with_string_map_symbol_by_ensembl <- dplyr::left_join(annot, string_map_symbol[, c("ENSEMBL", "STRING_id")], by = "ENSEMBL") dim(annot_combined_with_string_map_symbol_by_ensembl) # 20762 3 # that also introduced duplicates because of the ones with multiple STRING_ids # so it's a separate problem, but we need to combine the multiple STRING_ids # together for those that have them # the next section addresses that ``` #### Dealing with multiple STRING_ids ```r # we need to build this table with ENSEMBL IDs because they are unique and # we have one for each row, matching our starting gene table tab_with_multiple_STRINGids_combined <- data.frame(row.names = annot$ENSEMBL, stringsAsFactors = FALSE) # iterating through our unique gene IDs (takes about a minute) for ( curr_gene_ID in row.names(tab_with_multiple_STRINGids_combined) ) { # getting current gene's STRING_ids and turning into # combined a vector curr_STRING_ids <- string_map_symbol %>% filter(ENSEMBL == curr_gene_ID) %>% pull(STRING_id) %>% paste(collapse = "|") # adding to table tab_with_multiple_STRINGids_combined[curr_gene_ID, "STRING_id"] <- curr_STRING_ids } tab_with_multiple_STRINGids_combined <- tab_with_multiple_STRINGids_combined %>% rownames_to_column("ENSEMBL") head(tab_with_multiple_STRINGids_combined) # ENSEMBL STRING_id # 1 ENSMUSG00000000001 10090.ENSMUSP00000000001 # 2 ENSMUSG00000000028 10090.ENSMUSP00000000028 # 3 ENSMUSG00000000031 10090.ENSMUSP00000028386 # 4 ENSMUSG00000000037 10090.ENSMUSP00000076593 # 5 ENSMUSG00000000049 10090.ENSMUSP00000000049 # 6 ENSMUSG00000000056 10090.ENSMUSP00000099304 # combining with primary table annots_with_combined_STRING_ids <- dplyr::left_join(annot, tab_with_multiple_STRINGids_combined, by = "ENSEMBL") head(annots_with_combined_STRING_ids) # ENSEMBL SYMBOL STRING_id # 1 ENSMUSG00000000001 Gnai3 10090.ENSMUSP00000000001 # 2 ENSMUSG00000000028 Cdc45 10090.ENSMUSP00000000028 # 3 ENSMUSG00000000031 H19 10090.ENSMUSP00000028386 # 4 ENSMUSG00000000037 Scml2 10090.ENSMUSP00000076593 # 5 ENSMUSG00000000049 Apoh 10090.ENSMUSP00000000049 # 6 ENSMUSG00000000056 Narf 10090.ENSMUSP00000099304 dim(annots_with_combined_STRING_ids) # 20724 3 # now we still have the proper amount of rows (one per gene, as # those with multiple are combined) annots_with_combined_STRING_ids[grep("\\|", annots_with_combined_STRING_ids$STRING_id), ] %>% head # ENSEMBL SYMBOL STRING_id # 4045 ENSMUSG00000025059 Gk 10090.ENSMUSP00000021114|10090.ENSMUSP00000099984|10090.ENSMUSP00000119564 # 5310 ENSMUSG00000027882 Stxbp3 10090.ENSMUSP00000099681|10090.ENSMUSP00000088014 # 5462 ENSMUSG00000028195 Ccn1 10090.ENSMUSP00000029416|10090.ENSMUSP00000029270|10090.ENSMUSP00000029846 # 5570 ENSMUSG00000028431 Elp1 10090.ENSMUSP00000030140|10090.ENSMUSP00000101595 # 9289 ENSMUSG00000038563 Efl1 10090.ENSMUSP00000029566|10090.ENSMUSP00000046046 # 9369 ENSMUSG00000038782 1700028J19Rik 10090.ENSMUSP00000071992|10090.ENSMUSP00000048665 annots_with_combined_STRING_ids$STRING_id[annots_with_combined_STRING_ids$STRING_id == "NA"] <- NA sum(is.na(annots_with_combined_STRING_ids$STRING_id)) # 6,273 NAs # and instead of ~20,000 unannotated with STRING_ids, there are ~6,000 ``` ### STRINGdb with ENSEMBL and dealing with multiple STRING_ids ```r string_map_ensembl <- string_db$map(annot, "ENSEMBL", removeUnmappedRows = FALSE, takeFirst = TRUE) # Warning: we couldn't map to STRING 28% of your identifiers (vs 30% with SYMBOL mapping) dim(string_map_ensembl) # 20733 3 sum(is.na(string_map_ensembl$STRING_id)) # 5841 # checking multiple STRING_ids ensembls_with_multiple_STRING_ids <- data.frame(table(string_map_ensembl$SYMBOL)) %>% filter(Freq > 1) %>% pull(Var1) %>% as.character length(ensembls_with_multiple_STRING_ids) # 32 unique ensembl IDs with multiple STRING_ids # (vs 57 with SYMBOL mapping) ensembls_total_multiple_STRING_ids <- data.frame(table(string_map_ensembl$SYMBOL)) %>% filter(Freq > 1) %>% pull(Freq) %>% sum() ensembls_total_multiple_STRING_ids # 65 total multiple STRING_ids # (vs 122 with SYMBOL mapping) ## collapsing those with multiple STRING_ids ensembl_tab_with_multiple_STRINGids_combined <- data.frame(row.names = annot$ENSEMBL, stringsAsFactors = FALSE) # (takes about a minute) for ( curr_gene_ID in row.names(ensembl_tab_with_multiple_STRINGids_combined) ) { # getting current gene's STRING_ids curr_STRING_ids <- string_map_ensembl %>% filter(ENSEMBL == curr_gene_ID) %>% pull(STRING_id) %>% paste(collapse = "|") # adding to table ensembl_tab_with_multiple_STRINGids_combined[curr_gene_ID, "STRING_id"] <- curr_STRING_ids } ensembl_tab_with_multiple_STRINGids_combined <- ensembl_tab_with_multiple_STRINGids_combined %>% rownames_to_column("ENSEMBL") head(ensembl_tab_with_multiple_STRINGids_combined) # ENSEMBL STRING_id # 1 ENSMUSG00000000001 10090.ENSMUSP00000000001 # 2 ENSMUSG00000000028 10090.ENSMUSP00000000028 # 3 ENSMUSG00000000031 NA # 4 ENSMUSG00000000037 10090.ENSMUSP00000076593 # 5 ENSMUSG00000000049 10090.ENSMUSP00000000049 # 6 ENSMUSG00000000056 10090.ENSMUSP00000099304 # ^ in the first 6, this is missing 1 that SYMBOL mapping # catches, but overall more are caught by ENSEMBL despite the first 6 here # (72% by ensembl vs 70% by SYMBOL) # additionally, i think we can/should do mapping by both # as done in the next section ## combining with primary annots table annots_with_ensembl_combined_STRING_ids <- dplyr::left_join(annot, ensembl_tab_with_multiple_STRINGids_combined, by = "ENSEMBL") head(annots_with_ensembl_combined_STRING_ids) # ENSEMBL SYMBOL STRING_id # 1 ENSMUSG00000000001 Gnai3 10090.ENSMUSP00000000001 # 2 ENSMUSG00000000028 Cdc45 10090.ENSMUSP00000000028 # 3 ENSMUSG00000000031 H19 NA # 4 ENSMUSG00000000037 Scml2 10090.ENSMUSP00000076593 # 5 ENSMUSG00000000049 Apoh 10090.ENSMUSP00000000049 # 6 ENSMUSG00000000056 Narf 10090.ENSMUSP00000099304 # after combining the STRING_ids remain, most were switched # to NAs in the SYMBOL approach above due to case-differences between # Org.db and STRINGdb dim(annots_with_ensembl_combined_STRING_ids) # 20724 3 # now we still have the proper amount, and there instead of ~20,000 unannotated with STRING_ids, there are only annots_with_ensembl_combined_STRING_ids$STRING_id[annots_with_ensembl_combined_STRING_ids$STRING_id == "NA"] <- NA sum(is.na(annots_with_ensembl_combined_STRING_ids$STRING_id)) # now we have only 5,841 NAs ``` ### STRINGdb with ENSEMBL and SYMBOL and dealing with multiple STRING_ids As mentioned above, we can map by both, and still keep the individual gene-level info because we are including ENSEMBL IDs in the mapping. As seen below, this lets us recover a tiny bit more on these GLDS-288 IDs (we end up with 5,731 NAs instead of the 5,841 we just saw above, so 110 fewer in this case). ```r string_map_ensembl_and_symbol <- string_db$map(annot, c("ENSEMBL", "SYMBOL"), removeUnmappedRows = FALSE, takeFirst = TRUE) # Warning: we couldn't map to STRING 27% of your identifiers dim(string_map_ensembl_and_symbol) # 20734 3 head(string_map_ensembl_and_symbol) # ENSEMBL SYMBOL STRING_id # 1 ENSMUSG00000000001 GNAI3 10090.ENSMUSP00000000001 # 2 ENSMUSG00000000028 CDC45 10090.ENSMUSP00000000028 # 3 ENSMUSG00000000037 SCML2 10090.ENSMUSP00000076593 # 4 ENSMUSG00000000049 APOH 10090.ENSMUSP00000000049 # 5 ENSMUSG00000000056 NARF 10090.ENSMUSP00000099304 # 6 ENSMUSG00000000058 CAV2 10090.ENSMUSP00000000058 sum(is.na(string_map_ensembl_and_symbol$STRING_id)) # 5731 ## collapsing multiple STRING_ids E_and_S_produced_tab_with_multiple_STRINGids_combined <- data.frame(row.names = annot$ENSEMBL, stringsAsFactors = FALSE) # (takes about a minute) for ( curr_gene_ID in row.names(E_and_S_produced_tab_with_multiple_STRINGids_combined) ) { # getting current gene's STRING_ids curr_STRING_ids <- string_map_ensembl_and_symbol %>% filter(ENSEMBL == curr_gene_ID) %>% pull(STRING_id) %>% paste(collapse = "|") # adding to table E_and_S_produced_tab_with_multiple_STRINGids_combined[curr_gene_ID, "STRING_id"] <- curr_STRING_ids } E_and_S_produced_tab_with_multiple_STRINGids_combined <- E_and_S_produced_tab_with_multiple_STRINGids_combined %>% rownames_to_column("ENSEMBL") head(E_and_S_produced_tab_with_multiple_STRINGids_combined) # ENSEMBL STRING_id # 1 ENSMUSG00000000001 10090.ENSMUSP00000000001 # 2 ENSMUSG00000000028 10090.ENSMUSP00000000028 # 3 ENSMUSG00000000031 10090.ENSMUSP00000028386 # 4 ENSMUSG00000000037 10090.ENSMUSP00000076593 # 5 ENSMUSG00000000049 10090.ENSMUSP00000000049 # 6 ENSMUSG00000000056 10090.ENSMUSP00000099304 ## combining with primary annots table annots_with_E_and_S_combined_STRING_ids <- dplyr::left_join(annot, E_and_S_produced_tab_with_multiple_STRINGids_combined, by = "ENSEMBL") head(annots_with_E_and_S_combined_STRING_ids) # ENSEMBL SYMBOL STRING_id # 1 ENSMUSG00000000001 Gnai3 10090.ENSMUSP00000000001 # 2 ENSMUSG00000000028 Cdc45 10090.ENSMUSP00000000028 # 3 ENSMUSG00000000031 H19 10090.ENSMUSP00000028386 # 4 ENSMUSG00000000037 Scml2 10090.ENSMUSP00000076593 # 5 ENSMUSG00000000049 Apoh 10090.ENSMUSP00000000049 # 6 ENSMUSG00000000056 Narf 10090.ENSMUSP00000099304 dim(annots_with_E_and_S_combined_STRING_ids) # 20724 3 # proper amount of rows still annots_with_E_and_S_combined_STRING_ids$STRING_id[annots_with_E_and_S_combined_STRING_ids$STRING_id == "NA"] <- NA sum(is.na(annots_with_E_and_S_combined_STRING_ids$STRING_id)) # 5,731 NAs # now all is right with the world # we still have the proper amount of rows # we captured things tied by ensembl ID and/or SYMBOL # and there are only 5,731 NAs ``` Now all is right with the world. We still have the proper amount of rows. We capptured things tied by ENSEMBL ID and/or SYMBOL. And there are only 5,731 NAs. This is the approach I think we should use 👍 --- <!-- ## STRINGdb with SYMBOL and ENSEMBL ```r string_map_symbol_and_ensembl <- string_db$map(annot, c("SYMBOL", "ENSEMBL"), removeUnmappedRows = FALSE, takeFirst = TRUE) # Warning: we couldn't map to STRING 27% of your identifiers dim(string_map_symbol_and_ensembl) # 20765 3 sum(is.na(string_map_symbol_and_ensembl$STRING_id)) # 5731 ``` --> <!-- # ORIGINAL FULL RUN ```r library(tidyverse) library(STRINGdb) # for adding String database annotations library(PANTHER.db) # for adding GOSLIM annotations options(timeout=600) # getting all unique gene IDs from initial gtf file map_tab <- read.table("Mus_musculus.GRCm38.101-gene-to-transcript-map.tsv", sep = "\t", header = FALSE, col.names = c("gene_ID", "transcript_ID"), stringsAsFactors = FALSE) unique_gene_IDs <- map_tab %>% pull(gene_ID) %>% unique %>% sort # reading in organisms.csv file organism_table <- read.csv("organisms.csv") # setting organism for this example target_organism <- "MOUSE" # getting organism taxid target_taxid <- organism_table %>% filter(name == target_organism) %>% pull(taxon) # building annotation table ann.dbi <- organism_table %>% filter(name == target_organism) %>% pull(annotations) # loading organism db package (and installing if needed) if( ! require(ann.dbi, character.only = TRUE) ) { BiocManager::install(ann.dbi, ask = FALSE) library(ann.dbi, character.only = TRUE) } keytype <- "ENSEMBL" annot <- data.frame(unique_gene_IDs, stringsAsFactors = FALSE) colnames(annot)[1] <- keytype # adding gene symbols if available if ( "SYMBOL" %in% columns(eval(parse(text = ann.dbi), env=.GlobalEnv)) ) { annot$SYMBOL <- mapIds(eval(parse(text = ann.dbi), env=.GlobalEnv), keys = unique_gene_IDs, keytype = keytype, column = "SYMBOL", multiVals = "first") } # adding entrez IDs if available if ( "ENTREZID" %in% columns(eval(parse(text = ann.dbi), env=.GlobalEnv)) ) { annot$ENTREZID <- mapIds(eval(parse(text = ann.dbi), env=.GlobalEnv), keys = unique_gene_IDs, keytype = keytype, column = "ENTREZID", multiVals = "first") } ``` ## STRINGdb ```r # adding STRINGdb annotations if available string_db <- STRINGdb$new(version = "11", species = target_taxid, score_threshold = 0) string_map_symbol <- string_db$map(annot, "SYMBOL", removeUnmappedRows = FALSE, takeFirst = TRUE) # Warning: we couldn't map to STRING 62% of your identifiers dim(string_map_symbol) # 55595 4 string_map_ensembl <- string_db$map(annot, "ENSEMBL", removeUnmappedRows = FALSE, takeFirst = TRUE) # Warning: we couldn't map to STRING 60% of your identifiers dim(string_map_ensembl) # 55506 4 # checking multiple STRING_ids symbols_with_multiple_STRING_ids <- data.frame(table(string_map_symbol$SYMBOL)) %>% filter(Freq > 1) %>% pull(Var1) %>% as.character length(symbols_with_multiple_STRING_ids) # 145 unique symbols with multiple STRING_ids symbols_total_multiple_STRING_ids <- data.frame(table(string_map_symbol$SYMBOL)) %>% filter(Freq > 1) %>% pull(Freq) %>% sum() symbols_total_multiple_STRING_ids # 320 total multiple STRING_ids ensembls_with_with_multiple_STRING_ids <- data.frame(table(string_map_ensembl$ENSEMBL)) %>% filter(Freq > 1) %>% pull(Var1) %>% as.character length(ensembls_with_with_multiple_STRING_ids) # 19 unique symbols with multiple STRING_ids ensembls_total_multiple_STRING_ids <- data.frame(table(string_map_symbol$ENSEMBL)) %>% filter(Freq > 1) %>% pull(Freq) %>% sum() ensembls_total_multiple_STRING_ids # 204 total multiple STRING_ids ``` > Either path, the units we're using can have multiple STRING_ids. There are more when using SYMBOL > My original question was about if we should be keeping and reporting all STRING_ids, instead of whichever happens to be first (which is what happens now), e.g.: ```r table_with_dups <- string_map_symbol %>% filter(SYMBOL %in% symbols_with_multiple_STRING_ids) %>% arrange(SYMBOL) head(table_with_dups) # SYMBOL ENSEMBL ENTREZID STRING_id # 1110025L11RIK ENSMUSG00000068073 68637 10090.ENSMUSP00000086505 # 1110025L11RIK ENSMUSG00000068073 68637 10090.ENSMUSP00000086506 # # 1700003E24RIK ENSMUSG00000095110 100042663 10090.ENSMUSP00000085456 # 1700003E24RIK ENSMUSG00000095110 100042663 10090.ENSMUSP00000085457 # # 1700028J19RIK ENSMUSG00000038782 70004 10090.ENSMUSP00000071992 # 1700028J19RIK ENSMUSG00000038782 70004 10090.ENSMUSP00000048665 table_without_dups <- table_with_dups[!duplicated(table_with_dups$SYMBOL), ] head(table_without_dups, n = 3) # SYMBOL ENSEMBL ENTREZID STRING_id # 1110025L11RIK ENSMUSG00000068073 68637 10090.ENSMUSP00000086505 # 1700003E24RIK ENSMUSG00000095110 100042663 10090.ENSMUSP00000085456 # 1700028J19RIK ENSMUSG00000038782 70004 10090.ENSMUSP00000071992 ``` I was suggesting we report them all, e.g.: ```r tab_with_multiple_STRINGids_combined <- data.frame(row.names = unique(na.omit(table_with_dups$SYMBOL)), stringsAsFactors = FALSE) for ( curr_symbol in row.names(tab_with_multiple_STRINGids_combined) ) { # getting current SYMBOL's STRING_ids curr_STRING_ids <- string_map_symbol %>% filter(SYMBOL == curr_symbol) %>% pull(STRING_id) %>% paste(collapse = "|") # adding to table tab_with_multiple_STRINGids_combined[curr_symbol, "STRING_id"] <- curr_STRING_ids } tab_with_multiple_STRINGids_combined <- tab_with_multiple_STRINGids_combined %>% rownames_to_column("SYMBOL") head(tab_with_multiple_STRINGids_combined) # and this would be combined with the annotation table ``` -->