---
tags: GeneLab
title: Revisiting GL annotation stuff
---
# Revisiting GL annotation stuff
[toc]
## STRINGdb
### Summary
---
> **Summary**
> - Using either SYMBOL or ENSEMBL returns multiple STRING_id hits for some, this isn't an ENSEMBL-mapping problem, it's inherent to the STRINGdb.
> - Using ENSEMBL does not return the wrong STRING_ids, for those that are shared, it returns the same
> - e.g., see genes ENSMUSG00000000037 and ENSMUSG00000000056 in the `head(string_map_symbol)` and `head(string_map_ensembl)` outputs below in the "Running with GLDS-288" section
> - We cannot appropriately use SYMBOL to map back because they are not unique, we lose the information to be able to tie back to each individual gene, e.g.:
> - gene-A has SYMBOL-Y
> - gene-B has SYMBOL-Y
> - we get STRING_ids for SYMBOL-Y, then map back based on SYMBOL, and neither gene-A nor gene-B gets the annotation when we merge the tables (e.g., see genes ENSMUSG00000038729 and ENSMUSG00000090053 in the "Short run" section below)
> - Currently, the deduplication step causes us to lose a ton of STRING_id annotations erroneously
> - this is demonstrated in the "Short run" section below, and then tested out on the output diff. expr. table for GLDS-288 in the "Running with GLDS-288 IDs" section below, where ~20,000 of the genes have NA for STRING_id in the GLDS output table while only ~6,000 should
>
> - STRING_ids overall more successfullly map when using ENSEMBL
> - Using SYMBOL
> - 70% mapped of GLDS-288
> - Using ENSEMBL
> - 72% mapped of GLDS-288
> - Fewer multiple-STRING_ids show up when using ENSEMBL
> - Using SYMBOL
> - 122 total multiple STRING_ids returned, spread across 57 unique SYMBOLs (though some of these may be combined genes)
> - Using ENSEMBL
> - 65 total multiple STRING_ids returned, spread across 32 unique ENSEMBL IDs (which we know are all unique)
>
> **As shown on the page below:**
> 1. We need to tie the mapping to our unique ENSEMBL IDs in order to not lose individual gene information that is collapsed by the non-unique SYMBOLs - which is also causing a problem when merging (demonstrated in the "Short run" and by all the lost STRING_ids in the GLDS-288 example below).
> 2. We should report multiple STRING_ids when they exist for a single gene entry. If not, we need to choose one, and there is no criteria in place or available to do that.
> 3. We can map based on both ENSEMBL IDs *and* SYMBOLs, and we recover a tiny bit more string annotations that way, but we appropriately still keep the gene-level relationships because we are still using the unique ENSEMBL IDs. (E.g., ENSEMBL alone for GLDS-288 we have 5,841 NAs for STRING_id, with SYMBOLs too we have 5,731). This is the approach I think we should use, and is done in the last section ("STRINGdb with ENSEMBL and SYMBOL and dealing with multiple STRING_ids") 👍
>
---
Work was/can be done anywhere on N288 in the following conda environment. No files are needed that aren't downloaded with commands below, and nothing is written out. So should be quick/easy to recreate and look at things via copy/paste/modify if wanted. Everything is virtually instant except for a few commands that may take about a minute.
```bash
conda activate /global/smf/miniconda38_admin/envs/RNAseq_Rtools_03-2022
```
### Short run showing current problems
```r
library(tidyverse)
library(STRINGdb) # for adding String database annotations
library(PANTHER.db) # for adding GOSLIM annotations
options(timeout=600)
# reading in gene IDs
unique_gene_IDs <- c("ENSMUSG00000038729",
"ENSMUSG00000099619",
"ENSMUSG00000000031",
"ENSMUSG00000090053",
"ENSMUSG00000068073",
"ENSMUSG00000085894")
# reading in organisms.csv file
org_link <-
"https://raw.githubusercontent.com/asaravia-butler/GeneLab_Data_Processing/master/RNAseq/organisms.csv"
organism_table <- read.csv(org_link)
# setting organism for this example
target_organism <- "MOUSE"
# getting organism taxid
target_taxid <- organism_table %>%
filter(name == target_organism) %>%
pull(taxon)
# building annotation table
ann.dbi <- organism_table %>%
filter(name == target_organism) %>%
pull(annotations)
# loading organism db package (and installing if needed)
if( ! require(ann.dbi, character.only = TRUE) ) {
BiocManager::install(ann.dbi, ask = FALSE)
library(ann.dbi, character.only = TRUE)
}
keytype <- "ENSEMBL"
annot <- data.frame(unique_gene_IDs, stringsAsFactors = FALSE)
colnames(annot)[1] <- keytype
# adding gene symbols if available
if ( "SYMBOL" %in% columns(eval(parse(text = ann.dbi), env=.GlobalEnv)) ) {
annot$SYMBOL <- mapIds(eval(parse(text = ann.dbi), env=.GlobalEnv),
keys = unique_gene_IDs,
keytype = keytype,
column = "SYMBOL",
multiVals = "first")
}
annot
# ENSEMBL SYMBOL
# 1 ENSMUSG00000038729 Pakap
# 2 ENSMUSG00000099619 Gm28091
# 3 ENSMUSG00000000031 H19
# 4 ENSMUSG00000090053 Pakap
# 5 ENSMUSG00000068073 1110025L11Rik
# 6 ENSMUSG00000085894 <NA>
# adding STRINGdb annotations
string_db <- STRINGdb$new(version = "11",
species = target_taxid,
score_threshold = 0)
string_map_symbol <-
string_db$map(annot, "SYMBOL",
removeUnmappedRows = FALSE,
takeFirst = TRUE)
# Warning: we couldn't map to STRING 28% of your identifiers
dim(string_map_symbol)
# 7 3 # this introduced one duplicate row because it has multiple STRING_id hits
dim(annot)
# 6 2
string_map_symbol
# SYMBOL ENSEMBL STRING_id
# 1 PAKAP ENSMUSG00000038729 10090.ENSMUSP00000117466
# 6 GM28091 ENSMUSG00000099619 <NA>
# 3 H19 ENSMUSG00000000031 10090.ENSMUSP00000028386
# 2 PAKAP ENSMUSG00000090053 10090.ENSMUSP00000117466
# 4 1110025L11RIK ENSMUSG00000068073 10090.ENSMUSP00000086506
# 5 1110025L11RIK ENSMUSG00000068073 10090.ENSMUSP00000086505
# 7 <NA> ENSMUSG00000085894 <NA>
```
---
> **As seen in the table just above, genes ENSMUSG00000038729 and ENSMUSG00000090053 should both have STRING_id 10090.ENSMUSP00000117466**
---
```r
## deduplication step breaks things
string_map_symbol_no_dups <-
string_map_symbol[!duplicated(string_map_symbol$SYMBOL), ]
# that threw away this gene: ENSMUSG00000090053
string_map_symbol_no_dups
# SYMBOL ENSEMBL STRING_id
# 1 PAKAP ENSMUSG00000038729 10090.ENSMUSP00000117466
# 6 GM28091 ENSMUSG00000099619 <NA>
# 3 H19 ENSMUSG00000000031 10090.ENSMUSP00000028386
# 4 1110025L11RIK ENSMUSG00000068073 10090.ENSMUSP00000086506
# 7 <NA> ENSMUSG00000085894 <NA>
# merging
annot_with_STRINGdb_after_dedup <-
dplyr::left_join(annot,
string_map_symbol_no_dups[, c("SYMBOL", "STRING_id")],
by = "SYMBOL")
# neither of the the original two genes mentioned above (that shared a SYMBOL)
# got their appropriate STRING_id annotation
annot_with_STRINGdb_after_dedup
# ENSEMBL SYMBOL STRING_id
# 1 ENSMUSG00000038729 Pakap <NA>
# 2 ENSMUSG00000099619 Gm28091 <NA>
# 3 ENSMUSG00000000031 H19 10090.ENSMUSP00000028386
# 4 ENSMUSG00000090053 Pakap <NA>
# 5 ENSMUSG00000068073 1110025L11Rik <NA>
# 6 ENSMUSG00000085894 <NA> <NA>
```
---
> - Neither of the two genes mentioned above (that shared a SYMBOL) have their appropriate STRING_id, they both have NAs.
> - Checking one output from GL, ("[GLDS-288_rna_seq_differential_expression.csv](https://genelab-data.ndc.nasa.gov/genelab/static/media/dataset/GLDS-288_rna_seq_differential_expression.csv?version=3)"), these same two genes (ENSMUSG00000038729 and ENSMUSG00000090053) both have the SYMBOL in the table, but neither has the STRING_id.
> - SYMBOLs that mapped to more than one STRING_id also return nothing in the merged table. This is the case for ENSMUSG00000068073, which we can see has two rows in the `string_map_symbol` table above. Neither of those STRING_ids were assigned to it in the output merged table.
---
### Running with GLDS-288 IDs
```r
# clearing out things from above
rm(list = ls())
# getting ENSEMBL IDs from GLDS-288 output
link <-
"https://genelab-data.ndc.nasa.gov/genelab/static/media/dataset/GLDS-288_rna_seq_differential_expression.csv?version=3"
unique_gene_IDs <- read.csv(link) %>% pull(ENSEMBL)
length(unique_gene_IDs)
# 20724
# reading in organisms.csv file
org_link <-
"https://raw.githubusercontent.com/asaravia-butler/GeneLab_Data_Processing/master/RNAseq/organisms.csv"
organism_table <- read.csv(org_link)
# setting organism for this example
target_organism <- "MOUSE"
# getting organism taxid
target_taxid <- organism_table %>%
filter(name == target_organism) %>%
pull(taxon)
# building annotation table
ann.dbi <- organism_table %>%
filter(name == target_organism) %>%
pull(annotations)
# loading organism db package (and installing if needed)
if( ! require(ann.dbi, character.only = TRUE) ) {
BiocManager::install(ann.dbi, ask = FALSE)
library(ann.dbi, character.only = TRUE)
}
keytype <- "ENSEMBL"
annot <- data.frame(unique_gene_IDs, stringsAsFactors = FALSE)
colnames(annot)[1] <- keytype
# adding gene symbols if available
if ( "SYMBOL" %in% columns(eval(parse(text = ann.dbi), env=.GlobalEnv)) ) {
annot$SYMBOL <- mapIds(eval(parse(text = ann.dbi), env=.GlobalEnv),
keys = unique_gene_IDs,
keytype = keytype,
column = "SYMBOL",
multiVals = "first")
}
dim(annot)
# 20724 2
head(annot)
# ENSEMBL SYMBOL
# 1 ENSMUSG00000000001 Gnai3
# 2 ENSMUSG00000000028 Cdc45
# 3 ENSMUSG00000000031 H19
# 4 ENSMUSG00000000037 Scml2
# 5 ENSMUSG00000000049 Apoh
# 6 ENSMUSG00000000056 Narf
```
#### STRINGdb with SYMBOL
```r
string_db <- STRINGdb$new(version = "11",
species = target_taxid,
score_threshold = 0)
string_map_symbol <-
string_db$map(annot, "SYMBOL",
removeUnmappedRows = FALSE,
takeFirst = TRUE)
# Warning: we couldn't map to STRING 30% of your identifiers
dim(string_map_symbol)
# 20762 3
# the stringdb output has more rows, because of those cases where a gene maps
# to multiple STRING_ids
head(string_map_symbol)
# SYMBOL ENSEMBL STRING_id
# 1 GNAI3 ENSMUSG00000000001 10090.ENSMUSP00000000001
# 2 CDC45 ENSMUSG00000000028 10090.ENSMUSP00000000028
# 3 H19 ENSMUSG00000000031 10090.ENSMUSP00000028386
# 4 SCML2 ENSMUSG00000000037 10090.ENSMUSP00000076593
# 5 APOH ENSMUSG00000000049 10090.ENSMUSP00000000049
# 6 NARF ENSMUSG00000000056 10090.ENSMUSP00000099304
sum(is.na(string_map_symbol$STRING_id))
# 6273
# there are currently ~6,000 NAs for STRING_id in there
## checking out multiple STRING_ids
symbols_with_multiple_STRING_ids <-
data.frame(table(string_map_symbol$SYMBOL)) %>%
filter(Freq > 1) %>% pull(Var1) %>% as.character
length(symbols_with_multiple_STRING_ids)
# 57 unique symbols with multiple STRING_ids
symbols_total_multiple_STRING_ids <-
data.frame(table(string_map_symbol$SYMBOL)) %>%
filter(Freq > 1) %>% pull(Freq) %>% sum()
symbols_total_multiple_STRING_ids
# 122 total multiple STRING_ids
## current way of combining
string_map_symbol_no_dupes <-
string_map_symbol[!duplicated(string_map_symbol$SYMBOL), ]
dim(string_map_symbol_no_dupes)
# 16375 3
# this deduplication step just takes the first STRING_id that shows up and
# throws the rest away. there is nothing special about the order they show
# up, this is basically random
annot_combined_with_string_map_symbol_no_dupes <-
dplyr::left_join(annot,
string_map_symbol_no_dupes[, c("SYMBOL", "STRING_id")],
by = "SYMBOL")
dim(annot_combined_with_string_map_symbol_no_dupes)
# 20724 3
head(annot_combined_with_string_map_symbol_no_dupes)
# ENSEMBL SYMBOL STRING_id
# 1 ENSMUSG00000000001 Gnai3 <NA>
# 2 ENSMUSG00000000028 Cdc45 <NA>
# 3 ENSMUSG00000000031 H19 10090.ENSMUSP00000028386
# 4 ENSMUSG00000000037 Scml2 <NA>
# 5 ENSMUSG00000000049 Apoh <NA>
# 6 ENSMUSG00000000056 Narf <NA>
# we can see compared to the 'string_map_symbol' table just above, NAs were introduced
# probably because of the case difference in SYMBOLs, but
# either way, they are not unique and not as systematic as the ENSEMBL IDs
sum(is.na(annot_combined_with_string_map_symbol_no_dupes$STRING_id))
# 20640 are NA, when in the initial string map object only ~6,000 were
sum(is.na(string_map_symbol$STRING_id))
# 6273
# (the current GLDS-288...expression.csv has 20627 NA)
## joining without removing duplicates is wonky, as expected
annot_combined_with_string_map_symbol_with_dupes <-
dplyr::left_join(annot,
string_map_symbol[, c("SYMBOL", "STRING_id")],
by = "SYMBOL")
dim(annot_combined_with_string_map_symbol_with_dupes)
# 18704735 3
# ^way too many rows
## trying joining on ENSEMBL
annot_combined_with_string_map_symbol_by_ensembl <-
dplyr::left_join(annot,
string_map_symbol[, c("ENSEMBL", "STRING_id")],
by = "ENSEMBL")
dim(annot_combined_with_string_map_symbol_by_ensembl)
# 20762 3
# that also introduced duplicates because of the ones with multiple STRING_ids
# so it's a separate problem, but we need to combine the multiple STRING_ids
# together for those that have them
# the next section addresses that
```
#### Dealing with multiple STRING_ids
```r
# we need to build this table with ENSEMBL IDs because they are unique and
# we have one for each row, matching our starting gene table
tab_with_multiple_STRINGids_combined <-
data.frame(row.names = annot$ENSEMBL,
stringsAsFactors = FALSE)
# iterating through our unique gene IDs (takes about a minute)
for ( curr_gene_ID in row.names(tab_with_multiple_STRINGids_combined) ) {
# getting current gene's STRING_ids and turning into
# combined a vector
curr_STRING_ids <- string_map_symbol %>%
filter(ENSEMBL == curr_gene_ID) %>%
pull(STRING_id) %>% paste(collapse = "|")
# adding to table
tab_with_multiple_STRINGids_combined[curr_gene_ID, "STRING_id"] <- curr_STRING_ids
}
tab_with_multiple_STRINGids_combined <-
tab_with_multiple_STRINGids_combined %>%
rownames_to_column("ENSEMBL")
head(tab_with_multiple_STRINGids_combined)
# ENSEMBL STRING_id
# 1 ENSMUSG00000000001 10090.ENSMUSP00000000001
# 2 ENSMUSG00000000028 10090.ENSMUSP00000000028
# 3 ENSMUSG00000000031 10090.ENSMUSP00000028386
# 4 ENSMUSG00000000037 10090.ENSMUSP00000076593
# 5 ENSMUSG00000000049 10090.ENSMUSP00000000049
# 6 ENSMUSG00000000056 10090.ENSMUSP00000099304
# combining with primary table
annots_with_combined_STRING_ids <-
dplyr::left_join(annot,
tab_with_multiple_STRINGids_combined,
by = "ENSEMBL")
head(annots_with_combined_STRING_ids)
# ENSEMBL SYMBOL STRING_id
# 1 ENSMUSG00000000001 Gnai3 10090.ENSMUSP00000000001
# 2 ENSMUSG00000000028 Cdc45 10090.ENSMUSP00000000028
# 3 ENSMUSG00000000031 H19 10090.ENSMUSP00000028386
# 4 ENSMUSG00000000037 Scml2 10090.ENSMUSP00000076593
# 5 ENSMUSG00000000049 Apoh 10090.ENSMUSP00000000049
# 6 ENSMUSG00000000056 Narf 10090.ENSMUSP00000099304
dim(annots_with_combined_STRING_ids)
# 20724 3
# now we still have the proper amount of rows (one per gene, as
# those with multiple are combined)
annots_with_combined_STRING_ids[grep("\\|", annots_with_combined_STRING_ids$STRING_id), ] %>%
head
# ENSEMBL SYMBOL STRING_id
# 4045 ENSMUSG00000025059 Gk 10090.ENSMUSP00000021114|10090.ENSMUSP00000099984|10090.ENSMUSP00000119564
# 5310 ENSMUSG00000027882 Stxbp3 10090.ENSMUSP00000099681|10090.ENSMUSP00000088014
# 5462 ENSMUSG00000028195 Ccn1 10090.ENSMUSP00000029416|10090.ENSMUSP00000029270|10090.ENSMUSP00000029846
# 5570 ENSMUSG00000028431 Elp1 10090.ENSMUSP00000030140|10090.ENSMUSP00000101595
# 9289 ENSMUSG00000038563 Efl1 10090.ENSMUSP00000029566|10090.ENSMUSP00000046046
# 9369 ENSMUSG00000038782 1700028J19Rik 10090.ENSMUSP00000071992|10090.ENSMUSP00000048665
annots_with_combined_STRING_ids$STRING_id[annots_with_combined_STRING_ids$STRING_id == "NA"] <- NA
sum(is.na(annots_with_combined_STRING_ids$STRING_id))
# 6,273 NAs
# and instead of ~20,000 unannotated with STRING_ids, there are ~6,000
```
### STRINGdb with ENSEMBL and dealing with multiple STRING_ids
```r
string_map_ensembl <-
string_db$map(annot, "ENSEMBL",
removeUnmappedRows = FALSE,
takeFirst = TRUE)
# Warning: we couldn't map to STRING 28% of your identifiers (vs 30% with SYMBOL mapping)
dim(string_map_ensembl)
# 20733 3
sum(is.na(string_map_ensembl$STRING_id))
# 5841
# checking multiple STRING_ids
ensembls_with_multiple_STRING_ids <-
data.frame(table(string_map_ensembl$SYMBOL)) %>%
filter(Freq > 1) %>% pull(Var1) %>% as.character
length(ensembls_with_multiple_STRING_ids)
# 32 unique ensembl IDs with multiple STRING_ids
# (vs 57 with SYMBOL mapping)
ensembls_total_multiple_STRING_ids <-
data.frame(table(string_map_ensembl$SYMBOL)) %>%
filter(Freq > 1) %>% pull(Freq) %>% sum()
ensembls_total_multiple_STRING_ids
# 65 total multiple STRING_ids
# (vs 122 with SYMBOL mapping)
## collapsing those with multiple STRING_ids
ensembl_tab_with_multiple_STRINGids_combined <-
data.frame(row.names = annot$ENSEMBL,
stringsAsFactors = FALSE)
# (takes about a minute)
for ( curr_gene_ID in row.names(ensembl_tab_with_multiple_STRINGids_combined) ) {
# getting current gene's STRING_ids
curr_STRING_ids <- string_map_ensembl %>%
filter(ENSEMBL == curr_gene_ID) %>%
pull(STRING_id) %>% paste(collapse = "|")
# adding to table
ensembl_tab_with_multiple_STRINGids_combined[curr_gene_ID, "STRING_id"] <- curr_STRING_ids
}
ensembl_tab_with_multiple_STRINGids_combined <-
ensembl_tab_with_multiple_STRINGids_combined %>%
rownames_to_column("ENSEMBL")
head(ensembl_tab_with_multiple_STRINGids_combined)
# ENSEMBL STRING_id
# 1 ENSMUSG00000000001 10090.ENSMUSP00000000001
# 2 ENSMUSG00000000028 10090.ENSMUSP00000000028
# 3 ENSMUSG00000000031 NA
# 4 ENSMUSG00000000037 10090.ENSMUSP00000076593
# 5 ENSMUSG00000000049 10090.ENSMUSP00000000049
# 6 ENSMUSG00000000056 10090.ENSMUSP00000099304
# ^ in the first 6, this is missing 1 that SYMBOL mapping
# catches, but overall more are caught by ENSEMBL despite the first 6 here
# (72% by ensembl vs 70% by SYMBOL)
# additionally, i think we can/should do mapping by both
# as done in the next section
## combining with primary annots table
annots_with_ensembl_combined_STRING_ids <-
dplyr::left_join(annot,
ensembl_tab_with_multiple_STRINGids_combined,
by = "ENSEMBL")
head(annots_with_ensembl_combined_STRING_ids)
# ENSEMBL SYMBOL STRING_id
# 1 ENSMUSG00000000001 Gnai3 10090.ENSMUSP00000000001
# 2 ENSMUSG00000000028 Cdc45 10090.ENSMUSP00000000028
# 3 ENSMUSG00000000031 H19 NA
# 4 ENSMUSG00000000037 Scml2 10090.ENSMUSP00000076593
# 5 ENSMUSG00000000049 Apoh 10090.ENSMUSP00000000049
# 6 ENSMUSG00000000056 Narf 10090.ENSMUSP00000099304
# after combining the STRING_ids remain, most were switched
# to NAs in the SYMBOL approach above due to case-differences between
# Org.db and STRINGdb
dim(annots_with_ensembl_combined_STRING_ids)
# 20724 3
# now we still have the proper amount, and there instead of ~20,000 unannotated with STRING_ids, there are only
annots_with_ensembl_combined_STRING_ids$STRING_id[annots_with_ensembl_combined_STRING_ids$STRING_id == "NA"] <- NA
sum(is.na(annots_with_ensembl_combined_STRING_ids$STRING_id))
# now we have only 5,841 NAs
```
### STRINGdb with ENSEMBL and SYMBOL and dealing with multiple STRING_ids
As mentioned above, we can map by both, and still keep the individual gene-level info because we are including ENSEMBL IDs in the mapping. As seen below, this lets us recover a tiny bit more on these GLDS-288 IDs (we end up with 5,731 NAs instead of the 5,841 we just saw above, so 110 fewer in this case).
```r
string_map_ensembl_and_symbol <-
string_db$map(annot,
c("ENSEMBL", "SYMBOL"),
removeUnmappedRows = FALSE,
takeFirst = TRUE)
# Warning: we couldn't map to STRING 27% of your identifiers
dim(string_map_ensembl_and_symbol)
# 20734 3
head(string_map_ensembl_and_symbol)
# ENSEMBL SYMBOL STRING_id
# 1 ENSMUSG00000000001 GNAI3 10090.ENSMUSP00000000001
# 2 ENSMUSG00000000028 CDC45 10090.ENSMUSP00000000028
# 3 ENSMUSG00000000037 SCML2 10090.ENSMUSP00000076593
# 4 ENSMUSG00000000049 APOH 10090.ENSMUSP00000000049
# 5 ENSMUSG00000000056 NARF 10090.ENSMUSP00000099304
# 6 ENSMUSG00000000058 CAV2 10090.ENSMUSP00000000058
sum(is.na(string_map_ensembl_and_symbol$STRING_id))
# 5731
## collapsing multiple STRING_ids
E_and_S_produced_tab_with_multiple_STRINGids_combined <-
data.frame(row.names = annot$ENSEMBL,
stringsAsFactors = FALSE)
# (takes about a minute)
for ( curr_gene_ID in row.names(E_and_S_produced_tab_with_multiple_STRINGids_combined) ) {
# getting current gene's STRING_ids
curr_STRING_ids <- string_map_ensembl_and_symbol %>%
filter(ENSEMBL == curr_gene_ID) %>%
pull(STRING_id) %>% paste(collapse = "|")
# adding to table
E_and_S_produced_tab_with_multiple_STRINGids_combined[curr_gene_ID, "STRING_id"] <- curr_STRING_ids
}
E_and_S_produced_tab_with_multiple_STRINGids_combined <-
E_and_S_produced_tab_with_multiple_STRINGids_combined %>%
rownames_to_column("ENSEMBL")
head(E_and_S_produced_tab_with_multiple_STRINGids_combined)
# ENSEMBL STRING_id
# 1 ENSMUSG00000000001 10090.ENSMUSP00000000001
# 2 ENSMUSG00000000028 10090.ENSMUSP00000000028
# 3 ENSMUSG00000000031 10090.ENSMUSP00000028386
# 4 ENSMUSG00000000037 10090.ENSMUSP00000076593
# 5 ENSMUSG00000000049 10090.ENSMUSP00000000049
# 6 ENSMUSG00000000056 10090.ENSMUSP00000099304
## combining with primary annots table
annots_with_E_and_S_combined_STRING_ids <-
dplyr::left_join(annot,
E_and_S_produced_tab_with_multiple_STRINGids_combined,
by = "ENSEMBL")
head(annots_with_E_and_S_combined_STRING_ids)
# ENSEMBL SYMBOL STRING_id
# 1 ENSMUSG00000000001 Gnai3 10090.ENSMUSP00000000001
# 2 ENSMUSG00000000028 Cdc45 10090.ENSMUSP00000000028
# 3 ENSMUSG00000000031 H19 10090.ENSMUSP00000028386
# 4 ENSMUSG00000000037 Scml2 10090.ENSMUSP00000076593
# 5 ENSMUSG00000000049 Apoh 10090.ENSMUSP00000000049
# 6 ENSMUSG00000000056 Narf 10090.ENSMUSP00000099304
dim(annots_with_E_and_S_combined_STRING_ids)
# 20724 3
# proper amount of rows still
annots_with_E_and_S_combined_STRING_ids$STRING_id[annots_with_E_and_S_combined_STRING_ids$STRING_id == "NA"] <- NA
sum(is.na(annots_with_E_and_S_combined_STRING_ids$STRING_id))
# 5,731 NAs
# now all is right with the world
# we still have the proper amount of rows
# we captured things tied by ensembl ID and/or SYMBOL
# and there are only 5,731 NAs
```
Now all is right with the world. We still have the proper amount of rows. We capptured things tied by ENSEMBL ID and/or SYMBOL. And there are only 5,731 NAs.
This is the approach I think we should use 👍
---
<!--
## STRINGdb with SYMBOL and ENSEMBL
```r
string_map_symbol_and_ensembl <-
string_db$map(annot,
c("SYMBOL", "ENSEMBL"),
removeUnmappedRows = FALSE,
takeFirst = TRUE)
# Warning: we couldn't map to STRING 27% of your identifiers
dim(string_map_symbol_and_ensembl)
# 20765 3
sum(is.na(string_map_symbol_and_ensembl$STRING_id))
# 5731
```
-->
<!--
# ORIGINAL FULL RUN
```r
library(tidyverse)
library(STRINGdb) # for adding String database annotations
library(PANTHER.db) # for adding GOSLIM annotations
options(timeout=600)
# getting all unique gene IDs from initial gtf file
map_tab <- read.table("Mus_musculus.GRCm38.101-gene-to-transcript-map.tsv",
sep = "\t",
header = FALSE,
col.names = c("gene_ID", "transcript_ID"),
stringsAsFactors = FALSE)
unique_gene_IDs <- map_tab %>% pull(gene_ID) %>% unique %>% sort
# reading in organisms.csv file
organism_table <- read.csv("organisms.csv")
# setting organism for this example
target_organism <- "MOUSE"
# getting organism taxid
target_taxid <- organism_table %>%
filter(name == target_organism) %>%
pull(taxon)
# building annotation table
ann.dbi <- organism_table %>%
filter(name == target_organism) %>%
pull(annotations)
# loading organism db package (and installing if needed)
if( ! require(ann.dbi, character.only = TRUE) ) {
BiocManager::install(ann.dbi, ask = FALSE)
library(ann.dbi, character.only = TRUE)
}
keytype <- "ENSEMBL"
annot <- data.frame(unique_gene_IDs, stringsAsFactors = FALSE)
colnames(annot)[1] <- keytype
# adding gene symbols if available
if ( "SYMBOL" %in% columns(eval(parse(text = ann.dbi), env=.GlobalEnv)) ) {
annot$SYMBOL <- mapIds(eval(parse(text = ann.dbi), env=.GlobalEnv),
keys = unique_gene_IDs,
keytype = keytype,
column = "SYMBOL",
multiVals = "first")
}
# adding entrez IDs if available
if ( "ENTREZID" %in% columns(eval(parse(text = ann.dbi), env=.GlobalEnv)) ) {
annot$ENTREZID <- mapIds(eval(parse(text = ann.dbi), env=.GlobalEnv),
keys = unique_gene_IDs,
keytype = keytype,
column = "ENTREZID",
multiVals = "first")
}
```
## STRINGdb
```r
# adding STRINGdb annotations if available
string_db <- STRINGdb$new(version = "11",
species = target_taxid,
score_threshold = 0)
string_map_symbol <-
string_db$map(annot, "SYMBOL",
removeUnmappedRows = FALSE,
takeFirst = TRUE)
# Warning: we couldn't map to STRING 62% of your identifiers
dim(string_map_symbol)
# 55595 4
string_map_ensembl <-
string_db$map(annot, "ENSEMBL",
removeUnmappedRows = FALSE,
takeFirst = TRUE)
# Warning: we couldn't map to STRING 60% of your identifiers
dim(string_map_ensembl)
# 55506 4
# checking multiple STRING_ids
symbols_with_multiple_STRING_ids <-
data.frame(table(string_map_symbol$SYMBOL)) %>%
filter(Freq > 1) %>% pull(Var1) %>% as.character
length(symbols_with_multiple_STRING_ids)
# 145 unique symbols with multiple STRING_ids
symbols_total_multiple_STRING_ids <-
data.frame(table(string_map_symbol$SYMBOL)) %>%
filter(Freq > 1) %>% pull(Freq) %>% sum()
symbols_total_multiple_STRING_ids
# 320 total multiple STRING_ids
ensembls_with_with_multiple_STRING_ids <-
data.frame(table(string_map_ensembl$ENSEMBL)) %>%
filter(Freq > 1) %>% pull(Var1) %>% as.character
length(ensembls_with_with_multiple_STRING_ids)
# 19 unique symbols with multiple STRING_ids
ensembls_total_multiple_STRING_ids <-
data.frame(table(string_map_symbol$ENSEMBL)) %>%
filter(Freq > 1) %>% pull(Freq) %>% sum()
ensembls_total_multiple_STRING_ids
# 204 total multiple STRING_ids
```
> Either path, the units we're using can have multiple STRING_ids. There are more when using SYMBOL
> My original question was about if we should be keeping and reporting all STRING_ids, instead of whichever happens to be first (which is what happens now), e.g.:
```r
table_with_dups <- string_map_symbol %>%
filter(SYMBOL %in% symbols_with_multiple_STRING_ids) %>%
arrange(SYMBOL)
head(table_with_dups)
# SYMBOL ENSEMBL ENTREZID STRING_id
# 1110025L11RIK ENSMUSG00000068073 68637 10090.ENSMUSP00000086505
# 1110025L11RIK ENSMUSG00000068073 68637 10090.ENSMUSP00000086506
#
# 1700003E24RIK ENSMUSG00000095110 100042663 10090.ENSMUSP00000085456
# 1700003E24RIK ENSMUSG00000095110 100042663 10090.ENSMUSP00000085457
#
# 1700028J19RIK ENSMUSG00000038782 70004 10090.ENSMUSP00000071992
# 1700028J19RIK ENSMUSG00000038782 70004 10090.ENSMUSP00000048665
table_without_dups <-
table_with_dups[!duplicated(table_with_dups$SYMBOL), ]
head(table_without_dups, n = 3)
# SYMBOL ENSEMBL ENTREZID STRING_id
# 1110025L11RIK ENSMUSG00000068073 68637 10090.ENSMUSP00000086505
# 1700003E24RIK ENSMUSG00000095110 100042663 10090.ENSMUSP00000085456
# 1700028J19RIK ENSMUSG00000038782 70004 10090.ENSMUSP00000071992
```
I was suggesting we report them all, e.g.:
```r
tab_with_multiple_STRINGids_combined <-
data.frame(row.names = unique(na.omit(table_with_dups$SYMBOL)),
stringsAsFactors = FALSE)
for ( curr_symbol in row.names(tab_with_multiple_STRINGids_combined) ) {
# getting current SYMBOL's STRING_ids
curr_STRING_ids <- string_map_symbol %>%
filter(SYMBOL == curr_symbol) %>%
pull(STRING_id) %>% paste(collapse = "|")
# adding to table
tab_with_multiple_STRINGids_combined[curr_symbol, "STRING_id"] <- curr_STRING_ids
}
tab_with_multiple_STRINGids_combined <-
tab_with_multiple_STRINGids_combined %>%
rownames_to_column("SYMBOL")
head(tab_with_multiple_STRINGids_combined)
# and this would be combined with the annotation table
```
-->