## 前言
生物研究時常需要基因資訊,但分析工具與資料庫之間的註解未必能直接相通,如某些網站僅提供RefSeq ID, 有些只提供Ensembl ID等,雖然都是描述同一基因、但需要做進一步轉換才能串連資訊;另一常見狀況則是在GO enrichment analysis時,需要利用entrezid(gene id)來進行KEGG, GO enrichment分析。今天要介紹三種覺得好用的註解轉換套件:biomaRt, bitr, mygene,文章在R語言環境進行操作,若想使用python可以查詢是否支援並用 conda 安裝。
## biomaRT
`biomaRT` 套件主要協助取得 [Ensembl](http://www.ensembl.org ) 網站的資訊。
### Installation
[bioconductor website](https://bioconductor.org/packages/release/bioc/html/biomaRt.html)
```
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("biomaRt")
```
### Select desired BioMart database and dataset
以 `useEnsembl()` 來指定連線到哪個 biomaRT 資料庫與資料集
```
library(biomaRt)
ensembl <- useEnsembl(biomart = "genes", dataset = "hsapiens_gene_ensembl", version = 113)
```
函式有兩個必要參數 `biomart` 和 `dataset`(層級 biomart > dataset),首先用 `listEnsembl()` 或 `listMarts()` 查詢 `biomart`選項,雖名稱不同但內容相等(如: `genes` 等價於 `ENSEMBL_MART_ENSEMBL`)
```
> listEnsembl()
biomart version
1 genes Ensembl Genes 113
2 mouse_strains Mouse strains 113
3 snps Ensembl Variation 113
4 regulation Ensembl Regulation 113
> listMarts()
biomart version
1 ENSEMBL_MART_ENSEMBL Ensembl Genes 113
2 ENSEMBL_MART_MOUSE Mouse strains 113
3 ENSEMBL_MART_SNP Ensembl Variation 113
4 ENSEMBL_MART_FUNCGEN Ensembl Regulation 113
```
以 `useMart()` 指定biomart資料庫 + `listDatasets()` 查詢可用的資料集,可搭配字串篩選:
```
mart <- useMart("ENSEMBL_MART_ENSEMBL")
ds <- as.data.frame(listDatasets(mart=mart))
```
```
## find rows in dataset which contains 'human' string
ds %>% filter(grepl("human",tolower(description)))
## output:
dataset description version
1 hsapiens_gene_ensembl Human genes (GRCh38.p14) GRCh38.p14
```
最後如果有指定ensembl版本需求,可用 `listEnsemblArchives()` 函數,並用 `useEnsembl(version = xxx)` 指定版本 (default: latest release)
```
name date url version current_release
1 Ensembl GRCh37 Feb 2014 https://grch37.ensembl.org GRCh37
2 Ensembl 113 Oct 2024 https://oct2024.archive.ensembl.org 113 *
3 Ensembl 112 May 2024 https://may2024.archive.ensembl.org 112
4 Ensembl 111 Jan 2024 https://jan2024.archive.ensembl.org 111
```
### Run biomaRT query
指定ensembl參考資料後,接著以 `getBM()`連線至 ensembl server 並針對query返回dataframe:
```
getBM(attributes, filters = "", values = "", mart, curl = NULL,
checkFilters = TRUE, verbose = FALSE, uniqueRows = TRUE, bmHeader = FALSE,
quote = "\"", useCache = TRUE)
```
參數包含:
- `atribute` 輸出ensembl的欄位,可用 `listAttributes(${useEnsembl()_varname})` 查看選項
- `filters` 輸入內容的資訊類型,如輸入 `ENSGxxx` 則 `filters = "ensembl_gene_id"` 可用 `listFilters(${useEnsembl()_varname})` 查看選項
- `values` 輸入內容 input query list
- `mart` 指定 `useEnsembl()` 物件
### Check caches
若先前有跑過相同資料庫,透過 `biomartCacheInfo()` 查詢是否有 cache file
```
biomaRt cache
- Location: C:\xxxx/biomaRt/Cache
- No. of files: 12
- Total size: 983.1 Kb
```
### Example R code
```
library(biomaRt)
## check mart dbs
listMarts()
mart <- useMart("ENSEMBL_MART_ENSEMBL")
## check mart datasets and find rows where column 'description' contains human
ds <- as.data.frame(listDatasets(mart=mart))
ds %>% filter(grepl("human",tolower(description)))
# build this object which specify older version 103
ensembl <- useEnsembl(biomart = 'ensembl', dataset = 'ggallus_gene_ensembl', version = 103)
## run biomart convert
data <- read.table('query_gene.txt', header = T)
inputs <- as.vector(data$ensembl_gene_id)
# send it to the Ensembl BioMart server
genes <- getBM(attributes=c("external_gene_name","ensembl_gene_id","entrezgene_id"),
filters = "ensembl_gene_id",
values = inputs,
mart = ensembl)
```
---
## bitr
`bitr()` 為 `clusterprofiler`套件內的函式,利用 `AnnotationDbi::select` 來篩選特定物種的 `OrgDB` 內的資訊,最後輸出dataframe
**所有的`.db`結尾的library其實就是使用AnnotationDbi來的物件
### Installation
[bioconductor website](https://bioconductor.org/packages/release/bioc/html/clusterProfiler.html)
```
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("clusterProfiler")
```
### Run bitr() function
```
bitr(geneID, fromType, toType, OrgDb, drop = TRUE)
```
- `geneID` 為 input query list
- `fromType` 和 `toType` 須為 `idType(orgDB)` 返回的類別(key):
```
fromType' should be one of ACCNUM, ALIAS, ENSEMBL, ENSEMBLPROT, ENSEMBLTRANS, ENTREZID, ENZYME, EVIDENCE, EVIDENCEALL, GENENAME, GENETYPE, GO, GOALL, IPI, MAP, OMIM, ONTOLOGY, ONTOLOGYALL, PATH, PFAM, PMID, PROSITE, REFSEQ, SYMBOL, UCSCKG, UNIPROT.
```
- `OrgDb` 為指定物種的註解包,需要先安裝才能使用,物種匹配的orgDB 可以參考[這個網站](https://evvail.com/2021/07/13/2459.html)
### Example R code
```
library(clusterProfiler)
library(org.Hs.eg.db)
query_gene <- c('TP53','BRAF','PIK3CA','KRAS')
gene_out <- bitr(query_gene, fromType="SYMBOL", toType=c("GENENAME", "ENTREZID","ENSEMBL"), OrgDb=org.Hs.eg.db)
gene_out <- as.data.frame(gene_out)
> print(gene_out)
SYMBOL GENENAME ENTREZID ENSEMBL
1 TP53 tumor protein p53 7157 ENSG00000141510
2 BRAF B-Raf proto-oncogene, serine/threonine kinase 673 ENSG00000157764
3 PIK3CA phosphatidylinositol-4,5-bisphosphate 3-kinase catalytic subunit alpha 5290 ENSG00000121879
4 KRAS KRAS proto-oncogene, GTPase 3845 ENSG00000133703
```
轉換完成會顯示有多少比例轉換失敗
```
50% of input gene IDs are fail to map...
> print(gene_out)
SYMBOL GENENAME ENTREZID
1 TP53 tumor protein p53 7157
4 KRAS KRAS proto-oncogene, GTPase 3845
```
---
## mygene
`mygene.info` 以API形式獲取公開資料庫的基因資訊,官方文件多以 python code 做示範使用,R 套件函式屬於 API qurey 的 wrapper
> `queryMany` is a wrapper for POST query of "/query" service
### installation
[bioconductor website](https://www.bioconductor.org/packages/release/bioc/html/mygene.html)
```
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("mygene")
```
### Run mygene query
主要使用 `queryMany` 函式
```
queryMany(qterms, scopes=NULL, ..., return.as=c("DataFrame", "records", "text"), mygene)
```
- `qterm` 為 input query list
- `scopes` 提供 query list 類型,可用 field請參考[官方文件](https://docs.mygene.info/en/latest/doc/data.html#available-fields),如: query 給 `ENSGxx` 則 `scopes = ensembl.gene`
- 其他常用參數
- `species` 給予物種資訊如 **taxid**;
- `fields` 指定輸出內容,可填類型跟 `scopes` field 相同
- `return.as` 指定輸出類型: 目前支援"DataFrame" (default), "records" (list), "text" (JSON) 三種
### Example R code
```
library(mygene)
geneList <- c('TP53','BRAF','PIK3CA','KRAS')
query_output <- queryMany(geneList, scopes = 'symbol', species = 9606, fields = 'ensembl.gene,name,entrezgene')
> print(query_output)
DataFrame with 4 rows and 6 columns
query _id X_score entrezgene name ensembl.gene
<character> <character> <numeric> <character> <character> <character>
1 TP53 7157 18.0079 7157 tumor protein p53 ENSG00000141510
2 BRAF 673 17.6012 673 B-Raf proto-oncogene.. ENSG00000157764
3 PIK3CA 5290 17.8778 5290 phosphatidylinositol.. ENSG00000121879
4 KRAS 3845 17.6623 3845 KRAS proto-oncogene,.. ENSG00000133703
```
---
## Comments
`bitr()` 和 `mygene queryMany()` 相當方便,常見或具有orgDB的物種基本上沒有太大問題,但有時會遇到以下情況導致轉換失敗:
- gene name alias 或 synonyms 無法直接聯繫到正確的HGNC資訊,這時可先用 `AnnotationDbi::select(org.${spec}.eg.db, keys = alias_list, columns = c('SYMBOL'), keytype = "ALIAS")` 轉換基因名稱,或是找尋好用套件來轉換...
```
alias_list <- c('MEK','PI3K','TP53')
> AnnotationDbi::select(org.Hs.eg.db, keys = alias_list, columns = c('SYMBOL'), keytype = "ALIAS")
> 'select()' returned 1:many mapping between keys and columns
ALIAS SYMBOL
1 MEK MAP2K7
2 PI3K PIK3CA
3 PI3K PIK3CB
4 PI3K PIK3CD
5 PI3K PIK3CG
6 TP53 TP53
```
- 若資料庫以舊版 ensembl 註解,有些transcript ID 可能已經被移除導致轉換失敗,這時 `biomaRT useEnsembl()` 指定資料集版本的功能相當方便,能多少救一些轉換,但也要注意這樣的轉換是否有意義,畢竟最新版沒有收錄......
## References
> https://bioconductor.org/packages/release/bioc/vignettes/biomaRt/inst/doc/accessing_ensembl.html
> https://mygene.info/
> https://github.com/gege-circle/home/issues/702