---
# System prepended metadata

title: UCSC db
tags: [資料庫, 基因體/三級分析/資料庫, 基因體, db, ucsc, 生物資訊]

---

UCSC db
===
###### tags: `基因體/三級分析/資料庫`
###### tags: `生物資訊`, `基因體`, `資料庫`, `ucsc`, `db`

<br>

## 測試環境
- Ubuntu 16.04

<br>

## 安裝 mysql client 套件
```bash
$ mysql  # 查看此指令是否存在
```

若不存在，會顯示安裝提示，並協助你安裝 MariaDb 或 MySQL 套件
```bash
$ sudo apt install mysql-client-core-5.7
```

<br>

## 連線到 UCSC db
```
$ mysql --host=genome-mysql.cse.ucsc.edu --user=genome
```
- 用法介紹
    - 新版說明：[UCSC > Downloads > MySQL Access](https://genome.ucsc.edu/goldenPath/help/mysql.html)
    - 舊版說明：http://rohsdb.usc.edu/GBshape/goldenPath/help/mysql.html
    - host 主機名稱
    - user 使用者名稱

<br>

## 離開資料庫
```mysql
mysql> exit;
Bye
```

<br>

## 預備知識
- [UCSC 的版本介紹](https://hackmd.io/3dTx6dqhTBOKbg8-RsR3PA)
    - 基因體版本 hg 就是「資料庫名稱」
- 常見表格
    - <b style="padding: 0.2em;background: lightgray;">[refSeq](https://genome.ucsc.edu/cgi-bin/hgTables?hgta_doSchemaDb=hg19&hgta_doSchemaTable=refGene)</b> (近即時資料，連接 NCBI)
         > gene prediction with some additional info.
         > 資料來自 NCBI (RefSeq gene predictions from NCBI)
         
         - **name** (主鍵)
             - RefSeq ID, 參考序列編號, 以流水號表示
             - 正確來講，應為 transcript id (轉錄子 ID)
             - 亦可稱為「基因名稱(gene name)」
                 非人類可讀的「基因符號(gene symbol)」
         - **染色體名稱**
         - **正反股**
             ![](https://i.imgur.com/iskJUOK.png)
             - 正股(+)：Coding Strand, 編碼股（大陸：編碼鏈）
             - 反股(-)：Template Strand, 模板股（大陸：模板鏈）
         - 基因的「起始位置」＆「結束位置」
         - 編碼區域的「起始位置」＆「結束位置」
         - 外顯子資訊
         <br>
     - <b style="padding: 0.2em;background: lightgray;">[knownGene](https://genome.ucsc.edu/cgi-bin/hgTables?hgsid=757737243_iyrwaPXTlRlqr72fx7dcW50v1Bga&hgta_doSchemaDb=hg19&hgta_doSchemaTable=knownGene)</b> (2013-06-14)
         > Transcript from default gene set in UCSC browser
         > UCSC Genes (RefSeq, GenBank, CCDS, Rfam, tRNAs & Comparative Genomics)
        - 基本欄位名稱，與 refGene 相同
        - 差別：
            - refGene 近即時資料，隨時間更新
            - knownGene 固定資料，不再更新
         <br>
     - <b style="padding: 0.2em;background: lightgray;">[kgXref](https://genome.ucsc.edu/cgi-bin/hgTables?hgta_doSchemaDb=hg19&hgta_doSchemaTable=kgXref)</b> (2013-06-14, kg=KnownGene)
        > Links a Known Gene ID with mRNA, UniProtKB, RefSeq, and NCBI accessions/IDs
        > kgXXX 通常與 knownGene 搭配
        - **geneSymbol**
            基因符號，人類可讀的符號，如 BRCA1, BRCA2, APOE
        - **refseq**
             RefSeq ID, 參考序列編號, 以流水號表示
             亦為「基因名稱(gene name)」
             非人類可讀的「基因符號(gene symbol)」
<br>
<hr>
<hr>

## [概念1] 查看資料表格的綱要
- ```describe [db_name.]table.name```
    - for formated output
- ```show create table [db_name.]table_name;```
    - for the SQL statement

## 查看 [hg19] 的 [refGene] 表格的綱要
- db_name: hg19
- table: refGene
    ```mysql
    mysql> describe hg19.refGene;

    +--------------+------------------------------------+------+-----+---------+-------+
    | Field        | Type                               | Null | Key | Default | Extra |
    +--------------+------------------------------------+------+-----+---------+-------+
    | bin          | smallint(5) unsigned               | NO   |     | NULL    |       |
    | name         | varchar(255)                       | NO   | MUL | NULL    |       |
    | chrom        | varchar(255)                       | NO   | MUL | NULL    |       |
    | strand       | char(1)                            | NO   |     | NULL    |       |
    | txStart      | int(10) unsigned                   | NO   |     | NULL    |       |
    | txEnd        | int(10) unsigned                   | NO   |     | NULL    |       |
    | cdsStart     | int(10) unsigned                   | NO   |     | NULL    |       |
    | cdsEnd       | int(10) unsigned                   | NO   |     | NULL    |       |
    | exonCount    | int(10) unsigned                   | NO   |     | NULL    |       |
    | exonStarts   | longblob                           | NO   |     | NULL    |       |
    | exonEnds     | longblob                           | NO   |     | NULL    |       |
    | score        | int(11)                            | YES  |     | NULL    |       |
    | name2        | varchar(255)                       | NO   | MUL | NULL    |       |
    | cdsStartStat | enum('none','unk','incmpl','cmpl') | NO   |     | NULL    |       |
    | cdsEndStat   | enum('none','unk','incmpl','cmpl') | NO   |     | NULL    |       |
    | exonFrames   | longblob                           | NO   |     | NULL    |       |
    +--------------+------------------------------------+------+-----+---------+-------+
    16 rows in set (0.27 sec)
    ```
    [![](https://i.imgur.com/d76g7Ny.png)](https://i.imgur.com/d76g7Ny.png)
    ```SELECT * FROM hg19.refGene LIMIT 1;```
    <br>
- 相同 name，不同 bin，不知道如何解釋
    ```sql
    SELECT bin, name, txStart, txEnd 
    FROM hg19.refGene 
    WHERE name = 'NR_028325';
    ```
    ```
    +------+-----------+-----------+-----------+
    | bin  | name      | txStart   | txEnd     |
    +------+-----------+-----------+-----------+
    |  587 | NR_028325 |    323891 |    328581 |
    | 1964 | NR_028325 | 180750506 | 180755196 |
    +------+-----------+-----------+-----------+
    2 rows in set (0.19 sec)
    ```
    會造成 join 的時候，產生多筆資料

<br>

## 查看 [hg19] 的 [kgXref] 表格的綱要
- db_name: hg19
- table: kgXref
    ```mysql
    mysql> describe hg19.kgXref;

    +-------------+--------------+------+-----+---------+-------+
    | Field       | Type         | Null | Key | Default | Extra |
    +-------------+--------------+------+-----+---------+-------+
    | kgID        | varchar(255) | NO   | MUL | NULL    |       |
    | mRNA        | varchar(255) | NO   | MUL | NULL    |       |
    | spID        | varchar(255) | NO   | MUL | NULL    |       |
    | spDisplayID | varchar(255) | NO   | MUL | NULL    |       |
    | geneSymbol  | varchar(255) | NO   | MUL | NULL    |       |
    | refseq      | varchar(255) | NO   | MUL | NULL    |       |
    | protAcc     | varchar(255) | NO   | MUL | NULL    |       |
    | description | longblob     | NO   |     | NULL    |       |
    | rfamAcc     | varchar(255) | NO   | MUL | NULL    |       |
    | tRnaName    | varchar(255) | NO   | MUL | NULL    |       |
    +-------------+--------------+------+-----+---------+-------+
    10 rows in set (1.14 sec)
    ```
    [![](https://i.imgur.com/g4CDQA5.png)](https://i.imgur.com/g4CDQA5.png)
    ```SELECT * FROM hg19.kgXref LIMIT 1;```

<br>
<hr>
<hr>

## [概念2] 查看「基因名稱」、「起始位置」、「結束位置」
- 版本：hg19
- 需要表格
    - **kgXref**
        - geneSymbol 基因名稱
        - refseq 參考序列編號
    - **knownGene**
        - name 參考序列編號
        - txStart 序列起始
        - txEnd 序列結束
    <br>
- SQL: table
    - **kgXref**
        ```sql
        SELECT geneSymbol, refseq from hg19.kgXref;
        ```
        - 82960 rows in set (13.15 sec)
    
    - **knownGene**
        ```sql
        SELECT name, txStart, txEnd from hg19.knownGene;
        ```
        - 75327 rows in set (3.04 sec)
- SQL: join
    ```sql
    SELECT X.geneSymbol, Y.name, X.refseq, Y.txStart, Y.txEnd
    FROM 
        hg19.kgXref as X 
        LEFT JOIN
        hg19.knownGene as Y
    ON X.kgID = Y.name;
    ```
    - 82960 rows in set (2.86 sec)
    - 若將 Y: knownGene 改成 Y: refGene，則 join 條件是 
        ```X.refseq = Y.name```

<br>
<hr>
<hr>

## GeneEpi
### 抓取 GenEpi 資料（取前 10 筆為例）
```mysql
mysql> use hg19
Database changed

mysql> SELECT chr, CASE WHEN strand='+' THEN txStart-1000 ELSE txStart END AS txStart, CASE WHEN strand='-' THEN txEnd+1000 ELSE txEnd END AS txEnd, strand, geneSymbol FROM (SELECT REPLACE(chr, 'chr', '') AS chr, txStart, txEnd, strand, geneSymbol, MAX(ABS(txEnd-txStart)) FROM ( SELECT knownGene.chrom AS chr, knownGene.txStart AS txStart, knownGene.txEnd AS txEnd, knownGene.strand AS strand, kgXref.geneSymbol AS geneSymbol FROM kgXref INNER JOIN knownGene ON kgXref.kgID=knownGene.name WHERE LEFT(kgXref.mRNA, 2) IN ('NR', 'NM')) AS L1 GROUP BY geneSymbol) AS L2 WHERE LEFT(chr, 1) NOT IN ('X', 'Y', 'M', 'U') AND chr NOT LIKE '%\\_%' ORDER BY CAST(chr AS UNSIGNED), txStart limit 10;
+-----+---------+--------+--------+--------------+
| chr | txStart | txEnd  | strand | geneSymbol   |
+-----+---------+--------+--------+--------------+
| 1   |   10873 |  14409 | +      | DDX11L1      |
| 1   |   14361 |  30370 | -      | WASH7P       |
| 1   |   34610 |  37081 | -      | FAM138F      |
| 1   |   68090 |  70008 | +      | OR4F5        |
| 1   |  134772 | 141566 | -      | LOC729737    |
| 1   |  322891 | 328581 | +      | LOC100132062 |
| 1   |  366658 | 368597 | +      | OR4F29       |
| 1   |  661138 | 666731 | -      | LOC100133331 |
| 1   |  700244 | 715068 | -      | LOC100288069 |
| 1   |  761585 | 763902 | -      | LINC00115    |
+-----+---------+--------+--------+--------------+
10 rows in set (2.47 sec)
```
<br>
- 實際上有 22376 筆資料

### 分析上面的 SQL 指令
```sql
SELECT 
	chr, 
	CASE WHEN strand='+' THEN txStart-1000 ELSE txStart END AS txStart, 
	CASE WHEN strand='-' THEN txEnd+1000 ELSE txEnd END AS txEnd, 
	strand, 
	geneSymbol
FROM (
	SELECT
		REPLACE(chr, 'chr', '') AS chr, 
		txStart, 
		txEnd, 
		strand, 
		geneSymbol, 
		MAX(ABS(txEnd-txStart)) 
	FROM ( 
		SELECT 
			knownGene.chrom AS chr, 
			knownGene.txStart AS txStart, 
			knownGene.txEnd AS txEnd, 
			knownGene.strand AS strand, 
			kgXref.geneSymbol AS geneSymbol 
		FROM kgXref INNER JOIN knownGene 
		ON kgXref.kgID=knownGene.name
		WHERE 
			LEFT(kgXref.mRNA, 2) IN ('NR', 'NM')
	) AS L1
	GROUP BY geneSymbol
) AS L2
WHERE
	LEFT(chr, 1) NOT IN ('X', 'Y', 'M', 'U') 
	AND chr NOT LIKE '%\\_%' 
ORDER BY 
	CAST(chr AS UNSIGNED), txStart
;
```

<br>

### 根據 GeneEpi 資料，過濾出 APOE 基因
```
mysql> SELECT chr, CASE WHEN strand='+' THEN txStart-1000 ELSE txStart END AS txStart, CASE WHEN strand='-' THEN txEnd+1000 ELSE txEnd END AS txEnd, strand, geneSymbol FROM (SELECT REPLACE(chr, 'chr', '') AS chr, txStart, txEnd, strand, geneSymbol, MAX(ABS(txEnd-txStart)) FROM ( SELECT knownGene.chrom AS chr, knownGene.txStart AS txStart, knownGene.txEnd AS txEnd, knownGene.strand AS strand, kgXref.geneSymbol AS geneSymbol FROM kgXref INNER JOIN knownGene ON kgXref.kgID=knownGene.name WHERE LEFT(kgXref.mRNA, 2) IN ('NR', 'NM')) AS L1 GROUP BY geneSymbol) AS L2 WHERE LEFT(chr, 1) NOT IN ('X', 'Y', 'M', 'U') AND chr NOT LIKE '%\\_%' AND geneSymbol like 'APOE%' ORDER BY CAST(chr AS UNSIGNED), txStart ;
```
- 比較 hg19 和 hg38 的資料
    - [Genome 版本資訊](https://hackmd.io/qmzwm4idRyWROeeboqpfSw#%E9%81%8E%E5%8E%BB%E8%A8%88%E7%95%AB%EF%BC%86%E8%A8%88%E7%95%AB%E4%B8%AD)
    - use hg19 (較舊)
        | chr | txStart  | txEnd    | strand | geneSymbol |
        |-----|----------|----------|--------|------------|
        | 19  | 45408038 | 45412650 | +      | APOE       |
    - use hg38 (最新)
        | chr | txStart  | txEnd    | strand | geneSymbol |
        |-----|----------|----------|--------|------------|
        | 19  | 44904753 | 44909393 | +      | APOE       |
    - Wiki
        | chr | txStart  | txEnd    | strand | geneSymbol |
        |-----|----------|----------|--------|------------|
        | 19  | 44905791 | 44909393 | +      | [APOE](https://en.wikipedia.org/wiki/Apolipoprotein_E)      |
        - Wiki 的資料，與 UCSC 的資料都不一致

<br>

### 根據 GeneEpi 資料，過濾出 BRCA 基因
```
mysql> SELECT chr, CASE WHEN strand='+' THEN txStart-1000 ELSE txStart END AS txStart, CASE WHEN strand='-' THEN txEnd+1000 ELSE txEnd END AS txEnd, strand, geneSymbol FROM (SELECT REPLACE(chr, 'chr', '') AS chr, txStart, txEnd, strand, geneSymbol, MAX(ABS(txEnd-txStart)) FROM ( SELECT knownGene.chrom AS chr, knownGene.txStart AS txStart, knownGene.txEnd AS txEnd, knownGene.strand AS strand, kgXref.geneSymbol AS geneSymbol FROM kgXref INNER JOIN knownGene ON kgXref.kgID=knownGene.name WHERE LEFT(kgXref.mRNA, 2) IN ('NR', 'NM')) AS L1 GROUP BY geneSymbol) AS L2 WHERE LEFT(chr, 1) NOT IN ('X', 'Y', 'M', 'U') AND chr NOT LIKE '%\\_%' AND geneSymbol like 'BRCA%' ORDER BY CAST(chr AS UNSIGNED), txStart ;
```
- 比較 hg19 和 hg38 的資料
    - [Genome 版本資訊](https://hackmd.io/qmzwm4idRyWROeeboqpfSw#%E9%81%8E%E5%8E%BB%E8%A8%88%E7%95%AB%EF%BC%86%E8%A8%88%E7%95%AB%E4%B8%AD)
    - use hg19 (較舊)
        | chr | txStart  | txEnd    | strand | geneSymbol | geneLength |
        |-----|----------|----------|--------|------------|-----------|
        | 13  | 32888616 | 32973809 | +      | BRCA2      | 85192 |
        | 17  | 41196311 | 41278500 | -      | BRCA1      | 82188 |
    - use hg38 (最新)
        | chr | txStart  | txEnd    | strand | geneSymbol | geneLength |
        |-----|----------|----------|--------|------------|-----------|
        | 13  | 32314479 | 32399668 | +      | BRCA2      | 84791 |
        | 17  | 43044294 | 43126483 | -      | BRCA1      | 82188 |
    - Wiki
        | chr | txStart  | txEnd    | strand | geneSymbol | geneLength |
        |-----|----------|----------|--------|------------|-----------|
        | 13  | 32315474 | 32400266 | +      | [BRCA2](https://en.wikipedia.org/wiki/BRCA2)      | 84791 |
        | 17  | 43044295 | 43170245 | -      | [BRCA1](https://en.wikipedia.org/wiki/BRCA1)      | 125949 |
        - [BRCA1](https://en.wikipedia.org/wiki/BRCA1)
            ![](https://i.imgur.com/VDMn9qU.png)
        - [BRCA2](https://en.wikipedia.org/wiki/BRCA2)
            ![](https://i.imgur.com/3CKzV7I.png)
        - Wiki 的資料，與 UCSC 的資料都不一致




<br>

## 怪異問題集
- 在 knwonGene 資料表中，基因起始點相同，卻有不同的結束點？蛋白質編碼區也是
    ```sql
    select name, chrom, strand, txStart, txEnd, cdsStart, cdsEnd, exonCount from hg19.knownGene where txStart = 41243451;
    ```
    [![](https://i.imgur.com/RgUPej1.png)](https://i.imgur.com/RgUPej1.png)
    - 位於 chr17 的 BRCA1 基因
    - txStart = 41243451
    - 可能解釋
        - [選擇性剪接](https://zh.wikipedia.org/wiki/%E9%81%B8%E6%93%87%E6%80%A7%E5%89%AA%E6%8E%A5)
            > 選擇性剪接便是利用這樣的特性，一條未經剪接的RNA，含有的多種外顯子被剪成的**不同組合**，可轉譯出**不同的蛋白質**。
        - [RNA剪接](https://zh.wikipedia.org/wiki/RNA%E5%89%AA%E6%8E%A5)

- [knownGene]&[kgXref] v.s. [refGene]，在這兩個表格中，發現資料不一致
    ```sql
    SELECT 
        name, chrom, strand, txStart, txEnd, cdsStart, cdsEnd, exonCount, 
        kgID, mRNA, geneSymbol, refseq, protAcc 
    FROM
        hg19.kgXref AS A 
        INNER JOIN 
        hg19.knownGene AS B
        ON A.kgID = B.name
    WHERE geneSymbol = 'BRCA1'
    ORDER BY txStart, refseq;
    ```
    [![](https://i.imgur.com/TCipx3W.png)](https://i.imgur.com/TCipx3W.png)
    - txEnd: 41277468
    <br>
    <br>
    ```sql
    SELECT 
        name, chrom, strand, txStart, txEnd, 
        cdsStart, cdsEnd, exonCount
    FROM hg19.refGene
    WHERE chrom = 'chr17'
        AND 41196311 <= txStart
        AND txStart <= 41243451
    ORDER BY name;
    ```
    ![](https://i.imgur.com/h3U1W1N.png)
    - txEnd: 41277381

<br>

## 參考資料
- [UCSC > Downloads > MySQL Access](https://genome.ucsc.edu/goldenPath/help/mysql.html)
- [[stackoverflow] How do I show the schema of a table in a MySQL database?](https://stackoverflow.com/questions/1498777/how-do-i-show-the-schema-of-a-table-in-a-mysql-database)
- [[w3schools.com] MySQL Functions](https://www.w3schools.com/sql/sql_ref_mysql.asp)