UCSC db === ###### tags: `基因體/三級分析/資料庫` ###### tags: `生物資訊`, `基因體`, `資料庫`, `ucsc`, `db` <br> ## 測試環境 - Ubuntu 16.04 <br> ## 安裝 mysql client 套件 ```bash $ mysql # 查看此指令是否存在 ``` 若不存在,會顯示安裝提示,並協助你安裝 MariaDb 或 MySQL 套件 ```bash $ sudo apt install mysql-client-core-5.7 ``` <br> ## 連線到 UCSC db ``` $ mysql --host=genome-mysql.cse.ucsc.edu --user=genome ``` - 用法介紹 - 新版說明:[UCSC > Downloads > MySQL Access](https://genome.ucsc.edu/goldenPath/help/mysql.html) - 舊版說明:http://rohsdb.usc.edu/GBshape/goldenPath/help/mysql.html - host 主機名稱 - user 使用者名稱 <br> ## 離開資料庫 ```mysql mysql> exit; Bye ``` <br> ## 預備知識 - [UCSC 的版本介紹](https://hackmd.io/3dTx6dqhTBOKbg8-RsR3PA) - 基因體版本 hg 就是「資料庫名稱」 - 常見表格 - <b style="padding: 0.2em;background: lightgray;">[refSeq](https://genome.ucsc.edu/cgi-bin/hgTables?hgta_doSchemaDb=hg19&hgta_doSchemaTable=refGene)</b> (近即時資料,連接 NCBI) > gene prediction with some additional info. > 資料來自 NCBI (RefSeq gene predictions from NCBI) - **name** (主鍵) - RefSeq ID, 參考序列編號, 以流水號表示 - 正確來講,應為 transcript id (轉錄子 ID) - 亦可稱為「基因名稱(gene name)」 非人類可讀的「基因符號(gene symbol)」 - **染色體名稱** - **正反股** ![](https://i.imgur.com/iskJUOK.png) - 正股(+):Coding Strand, 編碼股(大陸:編碼鏈) - 反股(-):Template Strand, 模板股(大陸:模板鏈) - 基因的「起始位置」&「結束位置」 - 編碼區域的「起始位置」&「結束位置」 - 外顯子資訊 <br> - <b style="padding: 0.2em;background: lightgray;">[knownGene](https://genome.ucsc.edu/cgi-bin/hgTables?hgsid=757737243_iyrwaPXTlRlqr72fx7dcW50v1Bga&hgta_doSchemaDb=hg19&hgta_doSchemaTable=knownGene)</b> (2013-06-14) > Transcript from default gene set in UCSC browser > UCSC Genes (RefSeq, GenBank, CCDS, Rfam, tRNAs & Comparative Genomics) - 基本欄位名稱,與 refGene 相同 - 差別: - refGene 近即時資料,隨時間更新 - knownGene 固定資料,不再更新 <br> - <b style="padding: 0.2em;background: lightgray;">[kgXref](https://genome.ucsc.edu/cgi-bin/hgTables?hgta_doSchemaDb=hg19&hgta_doSchemaTable=kgXref)</b> (2013-06-14, kg=KnownGene) > Links a Known Gene ID with mRNA, UniProtKB, RefSeq, and NCBI accessions/IDs > kgXXX 通常與 knownGene 搭配 - **geneSymbol** 基因符號,人類可讀的符號,如 BRCA1, BRCA2, APOE - **refseq** RefSeq ID, 參考序列編號, 以流水號表示 亦為「基因名稱(gene name)」 非人類可讀的「基因符號(gene symbol)」 <br> <hr> <hr> ## [概念1] 查看資料表格的綱要 - ```describe [db_name.]table.name``` - for formated output - ```show create table [db_name.]table_name;``` - for the SQL statement ## 查看 [hg19] 的 [refGene] 表格的綱要 - db_name: hg19 - table: refGene ```mysql mysql> describe hg19.refGene; +--------------+------------------------------------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +--------------+------------------------------------+------+-----+---------+-------+ | bin | smallint(5) unsigned | NO | | NULL | | | name | varchar(255) | NO | MUL | NULL | | | chrom | varchar(255) | NO | MUL | NULL | | | strand | char(1) | NO | | NULL | | | txStart | int(10) unsigned | NO | | NULL | | | txEnd | int(10) unsigned | NO | | NULL | | | cdsStart | int(10) unsigned | NO | | NULL | | | cdsEnd | int(10) unsigned | NO | | NULL | | | exonCount | int(10) unsigned | NO | | NULL | | | exonStarts | longblob | NO | | NULL | | | exonEnds | longblob | NO | | NULL | | | score | int(11) | YES | | NULL | | | name2 | varchar(255) | NO | MUL | NULL | | | cdsStartStat | enum('none','unk','incmpl','cmpl') | NO | | NULL | | | cdsEndStat | enum('none','unk','incmpl','cmpl') | NO | | NULL | | | exonFrames | longblob | NO | | NULL | | +--------------+------------------------------------+------+-----+---------+-------+ 16 rows in set (0.27 sec) ``` [![](https://i.imgur.com/d76g7Ny.png)](https://i.imgur.com/d76g7Ny.png) ```SELECT * FROM hg19.refGene LIMIT 1;``` <br> - 相同 name,不同 bin,不知道如何解釋 ```sql SELECT bin, name, txStart, txEnd FROM hg19.refGene WHERE name = 'NR_028325'; ``` ``` +------+-----------+-----------+-----------+ | bin | name | txStart | txEnd | +------+-----------+-----------+-----------+ | 587 | NR_028325 | 323891 | 328581 | | 1964 | NR_028325 | 180750506 | 180755196 | +------+-----------+-----------+-----------+ 2 rows in set (0.19 sec) ``` 會造成 join 的時候,產生多筆資料 <br> ## 查看 [hg19] 的 [kgXref] 表格的綱要 - db_name: hg19 - table: kgXref ```mysql mysql> describe hg19.kgXref; +-------------+--------------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +-------------+--------------+------+-----+---------+-------+ | kgID | varchar(255) | NO | MUL | NULL | | | mRNA | varchar(255) | NO | MUL | NULL | | | spID | varchar(255) | NO | MUL | NULL | | | spDisplayID | varchar(255) | NO | MUL | NULL | | | geneSymbol | varchar(255) | NO | MUL | NULL | | | refseq | varchar(255) | NO | MUL | NULL | | | protAcc | varchar(255) | NO | MUL | NULL | | | description | longblob | NO | | NULL | | | rfamAcc | varchar(255) | NO | MUL | NULL | | | tRnaName | varchar(255) | NO | MUL | NULL | | +-------------+--------------+------+-----+---------+-------+ 10 rows in set (1.14 sec) ``` [![](https://i.imgur.com/g4CDQA5.png)](https://i.imgur.com/g4CDQA5.png) ```SELECT * FROM hg19.kgXref LIMIT 1;``` <br> <hr> <hr> ## [概念2] 查看「基因名稱」、「起始位置」、「結束位置」 - 版本:hg19 - 需要表格 - **kgXref** - geneSymbol 基因名稱 - refseq 參考序列編號 - **knownGene** - name 參考序列編號 - txStart 序列起始 - txEnd 序列結束 <br> - SQL: table - **kgXref** ```sql SELECT geneSymbol, refseq from hg19.kgXref; ``` - 82960 rows in set (13.15 sec) - **knownGene** ```sql SELECT name, txStart, txEnd from hg19.knownGene; ``` - 75327 rows in set (3.04 sec) - SQL: join ```sql SELECT X.geneSymbol, Y.name, X.refseq, Y.txStart, Y.txEnd FROM hg19.kgXref as X LEFT JOIN hg19.knownGene as Y ON X.kgID = Y.name; ``` - 82960 rows in set (2.86 sec) - 若將 Y: knownGene 改成 Y: refGene,則 join 條件是 ```X.refseq = Y.name``` <br> <hr> <hr> ## GeneEpi ### 抓取 GenEpi 資料(取前 10 筆為例) ```mysql mysql> use hg19 Database changed mysql> SELECT chr, CASE WHEN strand='+' THEN txStart-1000 ELSE txStart END AS txStart, CASE WHEN strand='-' THEN txEnd+1000 ELSE txEnd END AS txEnd, strand, geneSymbol FROM (SELECT REPLACE(chr, 'chr', '') AS chr, txStart, txEnd, strand, geneSymbol, MAX(ABS(txEnd-txStart)) FROM ( SELECT knownGene.chrom AS chr, knownGene.txStart AS txStart, knownGene.txEnd AS txEnd, knownGene.strand AS strand, kgXref.geneSymbol AS geneSymbol FROM kgXref INNER JOIN knownGene ON kgXref.kgID=knownGene.name WHERE LEFT(kgXref.mRNA, 2) IN ('NR', 'NM')) AS L1 GROUP BY geneSymbol) AS L2 WHERE LEFT(chr, 1) NOT IN ('X', 'Y', 'M', 'U') AND chr NOT LIKE '%\\_%' ORDER BY CAST(chr AS UNSIGNED), txStart limit 10; +-----+---------+--------+--------+--------------+ | chr | txStart | txEnd | strand | geneSymbol | +-----+---------+--------+--------+--------------+ | 1 | 10873 | 14409 | + | DDX11L1 | | 1 | 14361 | 30370 | - | WASH7P | | 1 | 34610 | 37081 | - | FAM138F | | 1 | 68090 | 70008 | + | OR4F5 | | 1 | 134772 | 141566 | - | LOC729737 | | 1 | 322891 | 328581 | + | LOC100132062 | | 1 | 366658 | 368597 | + | OR4F29 | | 1 | 661138 | 666731 | - | LOC100133331 | | 1 | 700244 | 715068 | - | LOC100288069 | | 1 | 761585 | 763902 | - | LINC00115 | +-----+---------+--------+--------+--------------+ 10 rows in set (2.47 sec) ``` <br> - 實際上有 22376 筆資料 ### 分析上面的 SQL 指令 ```sql SELECT chr, CASE WHEN strand='+' THEN txStart-1000 ELSE txStart END AS txStart, CASE WHEN strand='-' THEN txEnd+1000 ELSE txEnd END AS txEnd, strand, geneSymbol FROM ( SELECT REPLACE(chr, 'chr', '') AS chr, txStart, txEnd, strand, geneSymbol, MAX(ABS(txEnd-txStart)) FROM ( SELECT knownGene.chrom AS chr, knownGene.txStart AS txStart, knownGene.txEnd AS txEnd, knownGene.strand AS strand, kgXref.geneSymbol AS geneSymbol FROM kgXref INNER JOIN knownGene ON kgXref.kgID=knownGene.name WHERE LEFT(kgXref.mRNA, 2) IN ('NR', 'NM') ) AS L1 GROUP BY geneSymbol ) AS L2 WHERE LEFT(chr, 1) NOT IN ('X', 'Y', 'M', 'U') AND chr NOT LIKE '%\\_%' ORDER BY CAST(chr AS UNSIGNED), txStart ; ``` <br> ### 根據 GeneEpi 資料,過濾出 APOE 基因 ``` mysql> SELECT chr, CASE WHEN strand='+' THEN txStart-1000 ELSE txStart END AS txStart, CASE WHEN strand='-' THEN txEnd+1000 ELSE txEnd END AS txEnd, strand, geneSymbol FROM (SELECT REPLACE(chr, 'chr', '') AS chr, txStart, txEnd, strand, geneSymbol, MAX(ABS(txEnd-txStart)) FROM ( SELECT knownGene.chrom AS chr, knownGene.txStart AS txStart, knownGene.txEnd AS txEnd, knownGene.strand AS strand, kgXref.geneSymbol AS geneSymbol FROM kgXref INNER JOIN knownGene ON kgXref.kgID=knownGene.name WHERE LEFT(kgXref.mRNA, 2) IN ('NR', 'NM')) AS L1 GROUP BY geneSymbol) AS L2 WHERE LEFT(chr, 1) NOT IN ('X', 'Y', 'M', 'U') AND chr NOT LIKE '%\\_%' AND geneSymbol like 'APOE%' ORDER BY CAST(chr AS UNSIGNED), txStart ; ``` - 比較 hg19 和 hg38 的資料 - [Genome 版本資訊](https://hackmd.io/qmzwm4idRyWROeeboqpfSw#%E9%81%8E%E5%8E%BB%E8%A8%88%E7%95%AB%EF%BC%86%E8%A8%88%E7%95%AB%E4%B8%AD) - use hg19 (較舊) | chr | txStart | txEnd | strand | geneSymbol | |-----|----------|----------|--------|------------| | 19 | 45408038 | 45412650 | + | APOE | - use hg38 (最新) | chr | txStart | txEnd | strand | geneSymbol | |-----|----------|----------|--------|------------| | 19 | 44904753 | 44909393 | + | APOE | - Wiki | chr | txStart | txEnd | strand | geneSymbol | |-----|----------|----------|--------|------------| | 19 | 44905791 | 44909393 | + | [APOE](https://en.wikipedia.org/wiki/Apolipoprotein_E) | - Wiki 的資料,與 UCSC 的資料都不一致 <br> ### 根據 GeneEpi 資料,過濾出 BRCA 基因 ``` mysql> SELECT chr, CASE WHEN strand='+' THEN txStart-1000 ELSE txStart END AS txStart, CASE WHEN strand='-' THEN txEnd+1000 ELSE txEnd END AS txEnd, strand, geneSymbol FROM (SELECT REPLACE(chr, 'chr', '') AS chr, txStart, txEnd, strand, geneSymbol, MAX(ABS(txEnd-txStart)) FROM ( SELECT knownGene.chrom AS chr, knownGene.txStart AS txStart, knownGene.txEnd AS txEnd, knownGene.strand AS strand, kgXref.geneSymbol AS geneSymbol FROM kgXref INNER JOIN knownGene ON kgXref.kgID=knownGene.name WHERE LEFT(kgXref.mRNA, 2) IN ('NR', 'NM')) AS L1 GROUP BY geneSymbol) AS L2 WHERE LEFT(chr, 1) NOT IN ('X', 'Y', 'M', 'U') AND chr NOT LIKE '%\\_%' AND geneSymbol like 'BRCA%' ORDER BY CAST(chr AS UNSIGNED), txStart ; ``` - 比較 hg19 和 hg38 的資料 - [Genome 版本資訊](https://hackmd.io/qmzwm4idRyWROeeboqpfSw#%E9%81%8E%E5%8E%BB%E8%A8%88%E7%95%AB%EF%BC%86%E8%A8%88%E7%95%AB%E4%B8%AD) - use hg19 (較舊) | chr | txStart | txEnd | strand | geneSymbol | geneLength | |-----|----------|----------|--------|------------|-----------| | 13 | 32888616 | 32973809 | + | BRCA2 | 85192 | | 17 | 41196311 | 41278500 | - | BRCA1 | 82188 | - use hg38 (最新) | chr | txStart | txEnd | strand | geneSymbol | geneLength | |-----|----------|----------|--------|------------|-----------| | 13 | 32314479 | 32399668 | + | BRCA2 | 84791 | | 17 | 43044294 | 43126483 | - | BRCA1 | 82188 | - Wiki | chr | txStart | txEnd | strand | geneSymbol | geneLength | |-----|----------|----------|--------|------------|-----------| | 13 | 32315474 | 32400266 | + | [BRCA2](https://en.wikipedia.org/wiki/BRCA2) | 84791 | | 17 | 43044295 | 43170245 | - | [BRCA1](https://en.wikipedia.org/wiki/BRCA1) | 125949 | - [BRCA1](https://en.wikipedia.org/wiki/BRCA1) ![](https://i.imgur.com/VDMn9qU.png) - [BRCA2](https://en.wikipedia.org/wiki/BRCA2) ![](https://i.imgur.com/3CKzV7I.png) - Wiki 的資料,與 UCSC 的資料都不一致 <br> ## 怪異問題集 - 在 knwonGene 資料表中,基因起始點相同,卻有不同的結束點?蛋白質編碼區也是 ```sql select name, chrom, strand, txStart, txEnd, cdsStart, cdsEnd, exonCount from hg19.knownGene where txStart = 41243451; ``` [![](https://i.imgur.com/RgUPej1.png)](https://i.imgur.com/RgUPej1.png) - 位於 chr17 的 BRCA1 基因 - txStart = 41243451 - 可能解釋 - [選擇性剪接](https://zh.wikipedia.org/wiki/%E9%81%B8%E6%93%87%E6%80%A7%E5%89%AA%E6%8E%A5) > 選擇性剪接便是利用這樣的特性,一條未經剪接的RNA,含有的多種外顯子被剪成的**不同組合**,可轉譯出**不同的蛋白質**。 - [RNA剪接](https://zh.wikipedia.org/wiki/RNA%E5%89%AA%E6%8E%A5) - [knownGene]&[kgXref] v.s. [refGene],在這兩個表格中,發現資料不一致 ```sql SELECT name, chrom, strand, txStart, txEnd, cdsStart, cdsEnd, exonCount, kgID, mRNA, geneSymbol, refseq, protAcc FROM hg19.kgXref AS A INNER JOIN hg19.knownGene AS B ON A.kgID = B.name WHERE geneSymbol = 'BRCA1' ORDER BY txStart, refseq; ``` [![](https://i.imgur.com/TCipx3W.png)](https://i.imgur.com/TCipx3W.png) - txEnd: 41277468 <br> <br> ```sql SELECT name, chrom, strand, txStart, txEnd, cdsStart, cdsEnd, exonCount FROM hg19.refGene WHERE chrom = 'chr17' AND 41196311 <= txStart AND txStart <= 41243451 ORDER BY name; ``` ![](https://i.imgur.com/h3U1W1N.png) - txEnd: 41277381 <br> ## 參考資料 - [UCSC > Downloads > MySQL Access](https://genome.ucsc.edu/goldenPath/help/mysql.html) - [[stackoverflow] How do I show the schema of a table in a MySQL database?](https://stackoverflow.com/questions/1498777/how-do-i-show-the-schema-of-a-table-in-a-mysql-database) - [[w3schools.com] MySQL Functions](https://www.w3schools.com/sql/sql_ref_mysql.asp)