UCSC db
===
###### tags: `基因體/三級分析/資料庫`
###### tags: `生物資訊`, `基因體`, `資料庫`, `ucsc`, `db`
<br>
## 測試環境
- Ubuntu 16.04
<br>
## 安裝 mysql client 套件
```bash
$ mysql # 查看此指令是否存在
```
若不存在,會顯示安裝提示,並協助你安裝 MariaDb 或 MySQL 套件
```bash
$ sudo apt install mysql-client-core-5.7
```
<br>
## 連線到 UCSC db
```
$ mysql --host=genome-mysql.cse.ucsc.edu --user=genome
```
- 用法介紹
- 新版說明:[UCSC > Downloads > MySQL Access](https://genome.ucsc.edu/goldenPath/help/mysql.html)
- 舊版說明:http://rohsdb.usc.edu/GBshape/goldenPath/help/mysql.html
- host 主機名稱
- user 使用者名稱
<br>
## 離開資料庫
```mysql
mysql> exit;
Bye
```
<br>
## 預備知識
- [UCSC 的版本介紹](https://hackmd.io/3dTx6dqhTBOKbg8-RsR3PA)
- 基因體版本 hg 就是「資料庫名稱」
- 常見表格
- <b style="padding: 0.2em;background: lightgray;">[refSeq](https://genome.ucsc.edu/cgi-bin/hgTables?hgta_doSchemaDb=hg19&hgta_doSchemaTable=refGene)</b> (近即時資料,連接 NCBI)
> gene prediction with some additional info.
> 資料來自 NCBI (RefSeq gene predictions from NCBI)
- **name** (主鍵)
- RefSeq ID, 參考序列編號, 以流水號表示
- 正確來講,應為 transcript id (轉錄子 ID)
- 亦可稱為「基因名稱(gene name)」
非人類可讀的「基因符號(gene symbol)」
- **染色體名稱**
- **正反股**

- 正股(+):Coding Strand, 編碼股(大陸:編碼鏈)
- 反股(-):Template Strand, 模板股(大陸:模板鏈)
- 基因的「起始位置」&「結束位置」
- 編碼區域的「起始位置」&「結束位置」
- 外顯子資訊
<br>
- <b style="padding: 0.2em;background: lightgray;">[knownGene](https://genome.ucsc.edu/cgi-bin/hgTables?hgsid=757737243_iyrwaPXTlRlqr72fx7dcW50v1Bga&hgta_doSchemaDb=hg19&hgta_doSchemaTable=knownGene)</b> (2013-06-14)
> Transcript from default gene set in UCSC browser
> UCSC Genes (RefSeq, GenBank, CCDS, Rfam, tRNAs & Comparative Genomics)
- 基本欄位名稱,與 refGene 相同
- 差別:
- refGene 近即時資料,隨時間更新
- knownGene 固定資料,不再更新
<br>
- <b style="padding: 0.2em;background: lightgray;">[kgXref](https://genome.ucsc.edu/cgi-bin/hgTables?hgta_doSchemaDb=hg19&hgta_doSchemaTable=kgXref)</b> (2013-06-14, kg=KnownGene)
> Links a Known Gene ID with mRNA, UniProtKB, RefSeq, and NCBI accessions/IDs
> kgXXX 通常與 knownGene 搭配
- **geneSymbol**
基因符號,人類可讀的符號,如 BRCA1, BRCA2, APOE
- **refseq**
RefSeq ID, 參考序列編號, 以流水號表示
亦為「基因名稱(gene name)」
非人類可讀的「基因符號(gene symbol)」
<br>
<hr>
<hr>
## [概念1] 查看資料表格的綱要
- ```describe [db_name.]table.name```
- for formated output
- ```show create table [db_name.]table_name;```
- for the SQL statement
## 查看 [hg19] 的 [refGene] 表格的綱要
- db_name: hg19
- table: refGene
```mysql
mysql> describe hg19.refGene;
+--------------+------------------------------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+--------------+------------------------------------+------+-----+---------+-------+
| bin | smallint(5) unsigned | NO | | NULL | |
| name | varchar(255) | NO | MUL | NULL | |
| chrom | varchar(255) | NO | MUL | NULL | |
| strand | char(1) | NO | | NULL | |
| txStart | int(10) unsigned | NO | | NULL | |
| txEnd | int(10) unsigned | NO | | NULL | |
| cdsStart | int(10) unsigned | NO | | NULL | |
| cdsEnd | int(10) unsigned | NO | | NULL | |
| exonCount | int(10) unsigned | NO | | NULL | |
| exonStarts | longblob | NO | | NULL | |
| exonEnds | longblob | NO | | NULL | |
| score | int(11) | YES | | NULL | |
| name2 | varchar(255) | NO | MUL | NULL | |
| cdsStartStat | enum('none','unk','incmpl','cmpl') | NO | | NULL | |
| cdsEndStat | enum('none','unk','incmpl','cmpl') | NO | | NULL | |
| exonFrames | longblob | NO | | NULL | |
+--------------+------------------------------------+------+-----+---------+-------+
16 rows in set (0.27 sec)
```
[](https://i.imgur.com/d76g7Ny.png)
```SELECT * FROM hg19.refGene LIMIT 1;```
<br>
- 相同 name,不同 bin,不知道如何解釋
```sql
SELECT bin, name, txStart, txEnd
FROM hg19.refGene
WHERE name = 'NR_028325';
```
```
+------+-----------+-----------+-----------+
| bin | name | txStart | txEnd |
+------+-----------+-----------+-----------+
| 587 | NR_028325 | 323891 | 328581 |
| 1964 | NR_028325 | 180750506 | 180755196 |
+------+-----------+-----------+-----------+
2 rows in set (0.19 sec)
```
會造成 join 的時候,產生多筆資料
<br>
## 查看 [hg19] 的 [kgXref] 表格的綱要
- db_name: hg19
- table: kgXref
```mysql
mysql> describe hg19.kgXref;
+-------------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------------+--------------+------+-----+---------+-------+
| kgID | varchar(255) | NO | MUL | NULL | |
| mRNA | varchar(255) | NO | MUL | NULL | |
| spID | varchar(255) | NO | MUL | NULL | |
| spDisplayID | varchar(255) | NO | MUL | NULL | |
| geneSymbol | varchar(255) | NO | MUL | NULL | |
| refseq | varchar(255) | NO | MUL | NULL | |
| protAcc | varchar(255) | NO | MUL | NULL | |
| description | longblob | NO | | NULL | |
| rfamAcc | varchar(255) | NO | MUL | NULL | |
| tRnaName | varchar(255) | NO | MUL | NULL | |
+-------------+--------------+------+-----+---------+-------+
10 rows in set (1.14 sec)
```
[](https://i.imgur.com/g4CDQA5.png)
```SELECT * FROM hg19.kgXref LIMIT 1;```
<br>
<hr>
<hr>
## [概念2] 查看「基因名稱」、「起始位置」、「結束位置」
- 版本:hg19
- 需要表格
- **kgXref**
- geneSymbol 基因名稱
- refseq 參考序列編號
- **knownGene**
- name 參考序列編號
- txStart 序列起始
- txEnd 序列結束
<br>
- SQL: table
- **kgXref**
```sql
SELECT geneSymbol, refseq from hg19.kgXref;
```
- 82960 rows in set (13.15 sec)
- **knownGene**
```sql
SELECT name, txStart, txEnd from hg19.knownGene;
```
- 75327 rows in set (3.04 sec)
- SQL: join
```sql
SELECT X.geneSymbol, Y.name, X.refseq, Y.txStart, Y.txEnd
FROM
hg19.kgXref as X
LEFT JOIN
hg19.knownGene as Y
ON X.kgID = Y.name;
```
- 82960 rows in set (2.86 sec)
- 若將 Y: knownGene 改成 Y: refGene,則 join 條件是
```X.refseq = Y.name```
<br>
<hr>
<hr>
## GeneEpi
### 抓取 GenEpi 資料(取前 10 筆為例)
```mysql
mysql> use hg19
Database changed
mysql> SELECT chr, CASE WHEN strand='+' THEN txStart-1000 ELSE txStart END AS txStart, CASE WHEN strand='-' THEN txEnd+1000 ELSE txEnd END AS txEnd, strand, geneSymbol FROM (SELECT REPLACE(chr, 'chr', '') AS chr, txStart, txEnd, strand, geneSymbol, MAX(ABS(txEnd-txStart)) FROM ( SELECT knownGene.chrom AS chr, knownGene.txStart AS txStart, knownGene.txEnd AS txEnd, knownGene.strand AS strand, kgXref.geneSymbol AS geneSymbol FROM kgXref INNER JOIN knownGene ON kgXref.kgID=knownGene.name WHERE LEFT(kgXref.mRNA, 2) IN ('NR', 'NM')) AS L1 GROUP BY geneSymbol) AS L2 WHERE LEFT(chr, 1) NOT IN ('X', 'Y', 'M', 'U') AND chr NOT LIKE '%\\_%' ORDER BY CAST(chr AS UNSIGNED), txStart limit 10;
+-----+---------+--------+--------+--------------+
| chr | txStart | txEnd | strand | geneSymbol |
+-----+---------+--------+--------+--------------+
| 1 | 10873 | 14409 | + | DDX11L1 |
| 1 | 14361 | 30370 | - | WASH7P |
| 1 | 34610 | 37081 | - | FAM138F |
| 1 | 68090 | 70008 | + | OR4F5 |
| 1 | 134772 | 141566 | - | LOC729737 |
| 1 | 322891 | 328581 | + | LOC100132062 |
| 1 | 366658 | 368597 | + | OR4F29 |
| 1 | 661138 | 666731 | - | LOC100133331 |
| 1 | 700244 | 715068 | - | LOC100288069 |
| 1 | 761585 | 763902 | - | LINC00115 |
+-----+---------+--------+--------+--------------+
10 rows in set (2.47 sec)
```
<br>
- 實際上有 22376 筆資料
### 分析上面的 SQL 指令
```sql
SELECT
chr,
CASE WHEN strand='+' THEN txStart-1000 ELSE txStart END AS txStart,
CASE WHEN strand='-' THEN txEnd+1000 ELSE txEnd END AS txEnd,
strand,
geneSymbol
FROM (
SELECT
REPLACE(chr, 'chr', '') AS chr,
txStart,
txEnd,
strand,
geneSymbol,
MAX(ABS(txEnd-txStart))
FROM (
SELECT
knownGene.chrom AS chr,
knownGene.txStart AS txStart,
knownGene.txEnd AS txEnd,
knownGene.strand AS strand,
kgXref.geneSymbol AS geneSymbol
FROM kgXref INNER JOIN knownGene
ON kgXref.kgID=knownGene.name
WHERE
LEFT(kgXref.mRNA, 2) IN ('NR', 'NM')
) AS L1
GROUP BY geneSymbol
) AS L2
WHERE
LEFT(chr, 1) NOT IN ('X', 'Y', 'M', 'U')
AND chr NOT LIKE '%\\_%'
ORDER BY
CAST(chr AS UNSIGNED), txStart
;
```
<br>
### 根據 GeneEpi 資料,過濾出 APOE 基因
```
mysql> SELECT chr, CASE WHEN strand='+' THEN txStart-1000 ELSE txStart END AS txStart, CASE WHEN strand='-' THEN txEnd+1000 ELSE txEnd END AS txEnd, strand, geneSymbol FROM (SELECT REPLACE(chr, 'chr', '') AS chr, txStart, txEnd, strand, geneSymbol, MAX(ABS(txEnd-txStart)) FROM ( SELECT knownGene.chrom AS chr, knownGene.txStart AS txStart, knownGene.txEnd AS txEnd, knownGene.strand AS strand, kgXref.geneSymbol AS geneSymbol FROM kgXref INNER JOIN knownGene ON kgXref.kgID=knownGene.name WHERE LEFT(kgXref.mRNA, 2) IN ('NR', 'NM')) AS L1 GROUP BY geneSymbol) AS L2 WHERE LEFT(chr, 1) NOT IN ('X', 'Y', 'M', 'U') AND chr NOT LIKE '%\\_%' AND geneSymbol like 'APOE%' ORDER BY CAST(chr AS UNSIGNED), txStart ;
```
- 比較 hg19 和 hg38 的資料
- [Genome 版本資訊](https://hackmd.io/qmzwm4idRyWROeeboqpfSw#%E9%81%8E%E5%8E%BB%E8%A8%88%E7%95%AB%EF%BC%86%E8%A8%88%E7%95%AB%E4%B8%AD)
- use hg19 (較舊)
| chr | txStart | txEnd | strand | geneSymbol |
|-----|----------|----------|--------|------------|
| 19 | 45408038 | 45412650 | + | APOE |
- use hg38 (最新)
| chr | txStart | txEnd | strand | geneSymbol |
|-----|----------|----------|--------|------------|
| 19 | 44904753 | 44909393 | + | APOE |
- Wiki
| chr | txStart | txEnd | strand | geneSymbol |
|-----|----------|----------|--------|------------|
| 19 | 44905791 | 44909393 | + | [APOE](https://en.wikipedia.org/wiki/Apolipoprotein_E) |
- Wiki 的資料,與 UCSC 的資料都不一致
<br>
### 根據 GeneEpi 資料,過濾出 BRCA 基因
```
mysql> SELECT chr, CASE WHEN strand='+' THEN txStart-1000 ELSE txStart END AS txStart, CASE WHEN strand='-' THEN txEnd+1000 ELSE txEnd END AS txEnd, strand, geneSymbol FROM (SELECT REPLACE(chr, 'chr', '') AS chr, txStart, txEnd, strand, geneSymbol, MAX(ABS(txEnd-txStart)) FROM ( SELECT knownGene.chrom AS chr, knownGene.txStart AS txStart, knownGene.txEnd AS txEnd, knownGene.strand AS strand, kgXref.geneSymbol AS geneSymbol FROM kgXref INNER JOIN knownGene ON kgXref.kgID=knownGene.name WHERE LEFT(kgXref.mRNA, 2) IN ('NR', 'NM')) AS L1 GROUP BY geneSymbol) AS L2 WHERE LEFT(chr, 1) NOT IN ('X', 'Y', 'M', 'U') AND chr NOT LIKE '%\\_%' AND geneSymbol like 'BRCA%' ORDER BY CAST(chr AS UNSIGNED), txStart ;
```
- 比較 hg19 和 hg38 的資料
- [Genome 版本資訊](https://hackmd.io/qmzwm4idRyWROeeboqpfSw#%E9%81%8E%E5%8E%BB%E8%A8%88%E7%95%AB%EF%BC%86%E8%A8%88%E7%95%AB%E4%B8%AD)
- use hg19 (較舊)
| chr | txStart | txEnd | strand | geneSymbol | geneLength |
|-----|----------|----------|--------|------------|-----------|
| 13 | 32888616 | 32973809 | + | BRCA2 | 85192 |
| 17 | 41196311 | 41278500 | - | BRCA1 | 82188 |
- use hg38 (最新)
| chr | txStart | txEnd | strand | geneSymbol | geneLength |
|-----|----------|----------|--------|------------|-----------|
| 13 | 32314479 | 32399668 | + | BRCA2 | 84791 |
| 17 | 43044294 | 43126483 | - | BRCA1 | 82188 |
- Wiki
| chr | txStart | txEnd | strand | geneSymbol | geneLength |
|-----|----------|----------|--------|------------|-----------|
| 13 | 32315474 | 32400266 | + | [BRCA2](https://en.wikipedia.org/wiki/BRCA2) | 84791 |
| 17 | 43044295 | 43170245 | - | [BRCA1](https://en.wikipedia.org/wiki/BRCA1) | 125949 |
- [BRCA1](https://en.wikipedia.org/wiki/BRCA1)

- [BRCA2](https://en.wikipedia.org/wiki/BRCA2)

- Wiki 的資料,與 UCSC 的資料都不一致
<br>
## 怪異問題集
- 在 knwonGene 資料表中,基因起始點相同,卻有不同的結束點?蛋白質編碼區也是
```sql
select name, chrom, strand, txStart, txEnd, cdsStart, cdsEnd, exonCount from hg19.knownGene where txStart = 41243451;
```
[](https://i.imgur.com/RgUPej1.png)
- 位於 chr17 的 BRCA1 基因
- txStart = 41243451
- 可能解釋
- [選擇性剪接](https://zh.wikipedia.org/wiki/%E9%81%B8%E6%93%87%E6%80%A7%E5%89%AA%E6%8E%A5)
> 選擇性剪接便是利用這樣的特性,一條未經剪接的RNA,含有的多種外顯子被剪成的**不同組合**,可轉譯出**不同的蛋白質**。
- [RNA剪接](https://zh.wikipedia.org/wiki/RNA%E5%89%AA%E6%8E%A5)
- [knownGene]&[kgXref] v.s. [refGene],在這兩個表格中,發現資料不一致
```sql
SELECT
name, chrom, strand, txStart, txEnd, cdsStart, cdsEnd, exonCount,
kgID, mRNA, geneSymbol, refseq, protAcc
FROM
hg19.kgXref AS A
INNER JOIN
hg19.knownGene AS B
ON A.kgID = B.name
WHERE geneSymbol = 'BRCA1'
ORDER BY txStart, refseq;
```
[](https://i.imgur.com/TCipx3W.png)
- txEnd: 41277468
<br>
<br>
```sql
SELECT
name, chrom, strand, txStart, txEnd,
cdsStart, cdsEnd, exonCount
FROM hg19.refGene
WHERE chrom = 'chr17'
AND 41196311 <= txStart
AND txStart <= 41243451
ORDER BY name;
```

- txEnd: 41277381
<br>
## 參考資料
- [UCSC > Downloads > MySQL Access](https://genome.ucsc.edu/goldenPath/help/mysql.html)
- [[stackoverflow] How do I show the schema of a table in a MySQL database?](https://stackoverflow.com/questions/1498777/how-do-i-show-the-schema-of-a-table-in-a-mysql-database)
- [[w3schools.com] MySQL Functions](https://www.w3schools.com/sql/sql_ref_mysql.asp)