參考基因體 (Genome Reference)
===
###### tags: `基因體/二級分析`
###### tags: `生物資訊`, `基因體`, `資料庫`, `Genome Reference(參考基因體)`, `WGS(全基因體定序)`, `鹼基編碼`
<br>
[TOC]
<br>
## 參考序列
### T2T-CHM13
- ### [看 T2T 聯盟如何完成史上第一個“完整的”人類基因體序列](https://www.blossombio.com/eNews/20210804/index.html)
- GRCh38.p13 缺少了近 8% 的序列
- 缺失序列主要分布在異染色質 (heterochromatin) 與複雜區域,
- 絲粒衛星陣列 (centromeric satellite arrays)
- rDNA 陣列
- 次端粒 (subtelomeric) 區域
- [【三代定序 新知分享】我全都要!! PacBio HiFi reads讓你好貪心也沒關係](https://www.toolsbiotech.com/news_detail.php?id=39)

### Q & A
- ### [人類基因體計畫最後一塊拼圖:我們要用誰的基因體來代表全人類?](https://www.thenewslens.com/article/174038/fullpage)
- 目前金體參考序列,仍然有8%(大約兩億個鹼基對)的區域無法完全解碼
- 端粒(telomere)序列
- 中節(centromere)序列
- 核糖體陣列(ribosomal DNA array)序列
- 雙倍體(diploid)在組裝時造成分析上的困難
- 科學家發現一組代碼為CHM13(Complete Hydatidiform Mole 13)的特別細胞株可以協助解決這個難題,這個細胞株在精卵結合時發生異常,最終的細胞只包含精子DNA。
- 「我們要用誰的基因體來代表人類?」
- 讓每個人都有一份專屬於自己的完整基因體序列,或許才是達到個人化醫療、高度精確的臨床診斷的終極方式。
<br>
<hr>
<br>
## fasta 的序列長度,可以這樣計算
### Step 1:先算檔頭大小:
- [Step 1-1] 過濾檔頭資訊
```bash
$ grep '>' human_g1k_v37_decoy.fasta > head.txt
```
- [Step 1-2] 計算檔頭資訊
```bash
$ wc head.txt
86 258 5052 head.txt
# 86 行,共 5052 bytes
```
### Step 2:在算 fasta 大小:
```bash
$ wc human_g1k_v37_decoy.fasta
52290996 52291168 3189750467 human_g1k_v37_decoy.fasta
# 52290996 行,共 3189750467 bytes
```
### Step 3:fasta 大小減去檔頭資訊,剩下的就是「序列大小」
```
3189750467 - 5052 - (52290996-86) = 3137454505
```
- 其中:
- ```3189750467```:fasta 大小
- ```5052```:檔頭資訊大小
- ```52290996-86```:序列換行字元'\n'個數 = fasta換行字元個數 - 檔頭換行字元個數
### 序列長度為 3.137 G 鹼基
### [補充] 序列檔頭資訊
```
$ grep '>' human_g1k_v37_decoy.fasta
```
```fasta=
>1 dna:chromosome chromosome:GRCh37:1:1:249250621:1
>2 dna:chromosome chromosome:GRCh37:2:1:243199373:1
>3 dna:chromosome chromosome:GRCh37:3:1:198022430:1
>4 dna:chromosome chromosome:GRCh37:4:1:191154276:1
>5 dna:chromosome chromosome:GRCh37:5:1:180915260:1
>6 dna:chromosome chromosome:GRCh37:6:1:171115067:1
>7 dna:chromosome chromosome:GRCh37:7:1:159138663:1
>8 dna:chromosome chromosome:GRCh37:8:1:146364022:1
>9 dna:chromosome chromosome:GRCh37:9:1:141213431:1
>10 dna:chromosome chromosome:GRCh37:10:1:135534747:1
>11 dna:chromosome chromosome:GRCh37:11:1:135006516:1
>12 dna:chromosome chromosome:GRCh37:12:1:133851895:1
>13 dna:chromosome chromosome:GRCh37:13:1:115169878:1
>14 dna:chromosome chromosome:GRCh37:14:1:107349540:1
>15 dna:chromosome chromosome:GRCh37:15:1:102531392:1
>16 dna:chromosome chromosome:GRCh37:16:1:90354753:1
>17 dna:chromosome chromosome:GRCh37:17:1:81195210:1
>18 dna:chromosome chromosome:GRCh37:18:1:78077248:1
>19 dna:chromosome chromosome:GRCh37:19:1:59128983:1
>20 dna:chromosome chromosome:GRCh37:20:1:63025520:1
>21 dna:chromosome chromosome:GRCh37:21:1:48129895:1
>22 dna:chromosome chromosome:GRCh37:22:1:51304566:1
>X dna:chromosome chromosome:GRCh37:X:1:155270560:1
>Y dna:chromosome chromosome:GRCh37:Y:2649521:59034049:1
>MT gi|251831106|ref|NC_012920.1| Homo sapiens mitochondrion, complete genome
>GL000207.1 dna:supercontig supercontig::GL000207.1:1:4262:1
>GL000226.1 dna:supercontig supercontig::GL000226.1:1:15008:1
>GL000229.1 dna:supercontig supercontig::GL000229.1:1:19913:1
>GL000231.1 dna:supercontig supercontig::GL000231.1:1:27386:1
>GL000210.1 dna:supercontig supercontig::GL000210.1:1:27682:1
>GL000239.1 dna:supercontig supercontig::GL000239.1:1:33824:1
>GL000235.1 dna:supercontig supercontig::GL000235.1:1:34474:1
>GL000201.1 dna:supercontig supercontig::GL000201.1:1:36148:1
>GL000247.1 dna:supercontig supercontig::GL000247.1:1:36422:1
>GL000245.1 dna:supercontig supercontig::GL000245.1:1:36651:1
>GL000197.1 dna:supercontig supercontig::GL000197.1:1:37175:1
>GL000203.1 dna:supercontig supercontig::GL000203.1:1:37498:1
>GL000246.1 dna:supercontig supercontig::GL000246.1:1:38154:1
>GL000249.1 dna:supercontig supercontig::GL000249.1:1:38502:1
>GL000196.1 dna:supercontig supercontig::GL000196.1:1:38914:1
>GL000248.1 dna:supercontig supercontig::GL000248.1:1:39786:1
>GL000244.1 dna:supercontig supercontig::GL000244.1:1:39929:1
>GL000238.1 dna:supercontig supercontig::GL000238.1:1:39939:1
>GL000202.1 dna:supercontig supercontig::GL000202.1:1:40103:1
>GL000234.1 dna:supercontig supercontig::GL000234.1:1:40531:1
>GL000232.1 dna:supercontig supercontig::GL000232.1:1:40652:1
>GL000206.1 dna:supercontig supercontig::GL000206.1:1:41001:1
>GL000240.1 dna:supercontig supercontig::GL000240.1:1:41933:1
>GL000236.1 dna:supercontig supercontig::GL000236.1:1:41934:1
>GL000241.1 dna:supercontig supercontig::GL000241.1:1:42152:1
>GL000243.1 dna:supercontig supercontig::GL000243.1:1:43341:1
>GL000242.1 dna:supercontig supercontig::GL000242.1:1:43523:1
>GL000230.1 dna:supercontig supercontig::GL000230.1:1:43691:1
>GL000237.1 dna:supercontig supercontig::GL000237.1:1:45867:1
>GL000233.1 dna:supercontig supercontig::GL000233.1:1:45941:1
>GL000204.1 dna:supercontig supercontig::GL000204.1:1:81310:1
>GL000198.1 dna:supercontig supercontig::GL000198.1:1:90085:1
>GL000208.1 dna:supercontig supercontig::GL000208.1:1:92689:1
>GL000191.1 dna:supercontig supercontig::GL000191.1:1:106433:1
>GL000227.1 dna:supercontig supercontig::GL000227.1:1:128374:1
>GL000228.1 dna:supercontig supercontig::GL000228.1:1:129120:1
>GL000214.1 dna:supercontig supercontig::GL000214.1:1:137718:1
>GL000221.1 dna:supercontig supercontig::GL000221.1:1:155397:1
>GL000209.1 dna:supercontig supercontig::GL000209.1:1:159169:1
>GL000218.1 dna:supercontig supercontig::GL000218.1:1:161147:1
>GL000220.1 dna:supercontig supercontig::GL000220.1:1:161802:1
>GL000213.1 dna:supercontig supercontig::GL000213.1:1:164239:1
>GL000211.1 dna:supercontig supercontig::GL000211.1:1:166566:1
>GL000199.1 dna:supercontig supercontig::GL000199.1:1:169874:1
>GL000217.1 dna:supercontig supercontig::GL000217.1:1:172149:1
>GL000216.1 dna:supercontig supercontig::GL000216.1:1:172294:1
>GL000215.1 dna:supercontig supercontig::GL000215.1:1:172545:1
>GL000205.1 dna:supercontig supercontig::GL000205.1:1:174588:1
>GL000219.1 dna:supercontig supercontig::GL000219.1:1:179198:1
>GL000224.1 dna:supercontig supercontig::GL000224.1:1:179693:1
>GL000223.1 dna:supercontig supercontig::GL000223.1:1:180455:1
>GL000195.1 dna:supercontig supercontig::GL000195.1:1:182896:1
>GL000212.1 dna:supercontig supercontig::GL000212.1:1:186858:1
>GL000222.1 dna:supercontig supercontig::GL000222.1:1:186861:1
>GL000200.1 dna:supercontig supercontig::GL000200.1:1:187035:1
>GL000193.1 dna:supercontig supercontig::GL000193.1:1:189789:1
>GL000194.1 dna:supercontig supercontig::GL000194.1:1:191469:1
>GL000225.1 dna:supercontig supercontig::GL000225.1:1:211173:1
>GL000192.1 dna:supercontig supercontig::GL000192.1:1:547496:1
>NC_007605
>hs37d5
```
<br>
<hr>
<br>
## 粒線體 (mitochondrion) ([NC_012920](https://www.genome.jp/dbget-bin/www_bget?refseq:NC_012920))
- LOCUS
- NC_012920
- 16569 bp
- DNA
- DEFINITION
- Homo sapiens mitochondrion, complete genome
## 人類皰疹病毒第四型 (Human gammaherpesvirus 4) ([NC_007605](https://www.genome.jp/dbget-bin/www_bget?refseq:NC_007605))
- LOCUS
- NC_007605
- 171823 bp
- DNA
- DEFINITION
- Human gammaherpesvirus 4, complete genome.
<br>
<hr>
<br>
## 參考資料
- ### [關於人類參考基因組的一些認識](https://www.plob.org/article/21355.html)
- #### ALT contig是什麼?
> To better capture variation in the human genome across the world;
> it(hg38最初版) contains more copies of some loci than hg19(最初版).
>
- #### Patches(補丁)是什麼?
- #### ALT contig
- 存在多處定位的參考序列,比如 X 染色體上的一些區域在 Y 染色體中也有(pseudoautosomal region, PAR)。在標準的流程中,這些區域是沒法進行變異檢測的(怎麼理解?上面第一點也提到了)。正確的做法是將Y染色體中的這類區域進行mask。
- #### 将模棱两可的碱基编码为N
- [IUB Nucleotide Codes](http://biocorp.ca/IUB.php)
| Code | Definition | Mnemonic |
|:----:|:----------:|:----------:|
| A | Adenine | A |
| C | Cytosine | C |
| G | Guanine | G |
| T | Thymine | T |
| R | AG | puRine |
| Y | CT | pYrimidine |
| K | GT | Keto |
| M | AC | aMino |
| S | GC | Strong |
| W | AT | Weak |
| B | CGT | Not A |
| D | AGT | Not C |
| H | ACT | Not G |
| V | ACG | Not T |
| N | AGCT | aNy |
- #### human_g1k_v37.fasta
- 包含了人的24條染色體水平的序列,線粒體序列,以及沒有定位的contig序列。
- hs37d5.fa包含了以上,以及NC_007605、hs37d5:
- NC_007605 是人皰疹病毒的序列,Human herpesvirus 4 type 1,也稱EBV;
- hs37d5 就是 decoy 序列,由很多短序列間隔一定數量的N組成。
- 這些短序列是什麼?有什麼作用?
> 這些短序列可以簡單認為是來源於人但是ref裡面不包含(組裝技術的局限)。怎麼找到的?對比已有的克隆序列和ref序列確定的。參考基因組並不完整,總有一些區域不包含在已有的參考基因組中,這種情況下,來源於這些區域的reads要么比對不上,要么錯誤地比對到其他地方(造成假陽性) 。如果能把那些已經確定的但是ref沒有的序列加上去,就能有所改善。
>
> EBV序列和decoy序列的作用類似,都是盡可能讓reads比對到它真實的地方。只不過EBV序列不屬於human genome,但細胞裡面可能含有,提取DNA測序可能就測到了。
- ### [2020-01-15 了解人类不同版本参考基因组及如何选择](https://www.codenong.com/jse65115b4633a/)
- #### 比对至GRCh37(hg19),使用hs37-1kg:
ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/human_g1k_v37.fasta.gz
- #### 比对至GRCh37,并且认为 decoy sequence* 有助于variant calling,使用hs37d5:
ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz
- #### 比对至GRCh38(hg38):
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz
- #### 黄色部分为flanking sequence,起调控作用
[](https://i.imgur.com/7IueIIu.png)
- 多聚腺苷酸尾(Poly-A Tail)保護 mRNA
- #### 伪常染色体序列(PARs)
- 是X和Y染色体上核苷酸的同源序列,假常染色体基因(到目前为止至少发现了29个)表现出常染色体遗传而不是性别相关的遗传模式。

- #### Not including unplaced and unlocalized contigs.
- 基因组中不包括来自 unlocalized 和 unplaced 序列,导致来自这些序列的读段被强制map到其它染色体上,导致错误的variant call.

- ### [Which human reference genome to use?](https://lh3.github.io/2017/11/13/which-human-reference-genome-to-use)
- ### [Human genome reference builds - GRCh38 or hg38 - b37 - hg19](https://gatk.broadinstitute.org/hc/en-us/articles/360035890951-Human-genome-reference-builds-GRCh38-or-hg38-b37-hg19)

