二級分析 / 參考序列
===
###### tags: `基因體/二級分析`
###### tags: `生物資訊`, `基因體`, `二級分析`, `參考序列`, `fasta`, `fna`, `fa`, `Reference`, `BWA`, `BWA-MEM`, `Burrows-Wheeler Aligner`, `序列比對`, `GATK`, `Nvidia Clara Parabricks`, `BAM`, `VCF`
<br>
[TOC]
<br>
## 版本總覽
| NCBI 釋出時間 | NCBI 版本 | UCSC 版本 | Ensembl 版本 | Broad Institute 版本 |
|---------------|------------------------------|-----------|---------------------------|----------------------|
| 2022/01 | T2T-CHM13v2.0 (人類完整參考序列) | hs1 | | |
| 2013/11 | [GRCh38](https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.26/) | hg38 | release_76/77/78/80/81/82 | |
| 2009/02 | GRCh37 | hg19 | release_59/61/64/68/69/75 | b37 (GRCh37 加強版) |
| 2006/03 | NCBI Build 36.1 | hg18 | release_52 | |
| 2004/05 | NCBI Build 35 | hg17 | | |
| 2003/01 | NCBI Build 34 | hg16 | | |
- ### 其他版本
- **1000 genomes Project 版本**
- [humanG1Kv37 = b37 - human herpesvirus 4 type 1 (不包含人皰疹病毒4類型1)](http://toolsbiotech.blog.fc2.com/blog-entry-119.html)
- **GRCh37 hs37d5 版本**
- [hs37d5 = b37 + decoy](https://blog.csdn.net/u014182497/article/details/84032261)
- 增加一條病毒序列(皰疹病毒)
- b37 的升級板
- [基於GRCh37版本修改的版本](https://www.prismabiotech.com.tw/post/dragen上grch37參考序列版本的差別)
- **[decoy 序列是什麼](https://www.prismabiotech.com.tw/post/dragen上grch37參考序列版本的差別)**
- decoy 序列不是人類參考序列的一部分,例如 EBV 病毒序列,這個病毒常常出現在定序的原始資料中
- [大意] 為了不讓外來的 DNA 比對到人類參考序列,而產生的 workaround
> decoy 是誘餌的意思,誘餌序列是為了增加分析的正確性,因為 Alignment 演算法在某種程度上仍會窮盡搜尋結果找出最好的結果,這些外來 DNA 會被強迫比對到基因體上,加入誘餌序列可以讓這些外來 DNA 不會比對到核心基因體上,另外一個隨之而來的好處,因為 Alignment 也會因為找出最佳解提前結束搜索,加入 decoy 序列可以加速比對時間。
- ### 術語
- **NCBI**
National Center for Biotechnology Information
美國國家生物技術資訊中心
- **UCSC**
University of California, Santa Cruz
聖塔克魯茲加利福尼亞大學
- **GRCh**
Genome Reference Consortium - human build XX
基因體參考序列聯盟 - 人類參考序列 XX
基因體參考序列協會 (by ChatGPT)
<br>
<hr>
<br>
## 序列種類
> 應該有底下這五種?
### assembled chromosomes
> **"Assembled chromosomes" refer to the sequences of chromosomes that have been fully assembled and can be ordered and oriented in a specific order. These are typically the most complete and accurate representations of a genome.**
> - by ChatGPT
> "Assembled chromosomes" 指的是已完全組裝完成,可以按照特定的順序進行排序和定位的染色體序列。這些序列通常是基因體最完整和最準確的表示形式。
> - by Google
> “組裝的染色體” 是指已經完全組裝並且可以按特定順序排列和定向的染色體序列。 這些通常是基因組最完整和準確的表示。
- chr1~chr22, chrX, chrY, chrM
### unlocalized scaffolds
> "Unlocalized scaffolds" refer to sequences that have been assembled into larger contiguous pieces, but their location on a specific chromosome is not yet determined.
> - by ChatGPT
> "Unlocalized scaffolds" 是指已經組裝成較大的連續片段,但它們在某個特定染色體上的位置尚未確定。
> - by Google
> “未定位的支架”是指已組裝成更大的連續片段的序列,但它們在特定染色體上的位置尚未確定。
> - 簡單講:
> 知道是來自於哪個染色體,但是不知道是在染色體的哪個位置)
- chr1_KI270706v1_random ~ chr1_KI270716v1_random
- chr2_KI270715v1_random
- ...
- chr22_KI270739v1_random
- chrY_KI270740v1_random
### unplaced scaffolds
> "Unplaced scaffolds" refer to sequences that have been assembled into larger contiguous pieces but cannot be assigned to a specific chromosome location. These sequences may represent areas of the genome that are difficult to assemble or regions that are highly variable between individuals or populations.
> - by ChatGPT
> "Unplaced scaffolds" 是指已經組裝成較大的連續片段,但不能歸屬於特定的染色體位置。這些序列可能代表基因體中難以組裝的區域或在個體或族群之間高度變異的區域。
> - by Google
> “未放置的支架”是指已組裝成較大的連續片段但無法分配給特定染色體位置的序列。 這些序列可能代表基因組中難以組裝的區域或個體或種群之間高度可變的區域。
> - 簡單講:
> 不知道是來自於哪個染色體
- chrUn_KI270302v1
- chrUn_KI270304v1
- ...
- chrUn_GL000218v1
### alternate sequences
> - 簡單講:
> 染色體上有某些區域因為具有高度的變異性,使得無法用單一參考序列代表
- chr1_KI270762v1_alt
- chr1_KI270766v1_alt
- chr1_KI270760v1_alt
- ...
- chr19_KI270938v1_alt
### decoy sequences
> 指在基因體組裝過程中,加入的不屬於基因體的假序列片段。這些序列片段在組裝中充當噪聲,能夠幫助識別和過濾掉可能存在於基因體中的假基因序列或其他錯誤。
- chrEBV
- chrUn_KN707606v1_decoy
- chrUn_KN707607v1_decoy
- chrUn_KN707608v1_decoy
- ...
- chrUn_KN707992v1_decoy
- chrUn_JTFH01000001v1_decoy
- chrUn_JTFH01000002v1_decoy
- chrUn_JTFH01000003v1_decoy
- ...
- chrUn_JTFH01001998v1_decoy
- `HLA-A*01:01:01:01`
- `HLA-A*01:01:01:02N`
- ...
- `HLA-B*07:02:01`
- `HLA-B*07:05:01`
- ...
- `HLA-C*01:02:01`
- `HLA-C*01:02:11`
- ...
- `HLA-DQA1*01:01:02`
- ...
- `HLA-DQB1*02:01:01`
### 參考資料
- [人類染色體版本釋疑 @ 有勁的基因資訊 :: 痞客邦 ::](https://yourgene.pixnet.net/blog/post/117778207-)
<br>
<hr>
<br>
## 鹼基種類
- ### [核酸序列](https://zh.wikipedia.org/zh-tw/%E6%A0%B8%E9%85%B8%E5%BA%8F%E5%88%97)

- ### [Nucleic acid notation](https://en.wikipedia.org/wiki/Nucleic_acid_notation)

<br>
<hr>
<br>
## FAQ
### **為什麼我們需要模型?**
- [NCBI Genome Assembly Model](https://www.ncbi.nlm.nih.gov/assembly/model/)
> 生物體的基因體由一組染色體組成。細菌可以有一條染色體,通常伴隨著染色體外質粒,而真核生物通常有多條染色體,每條染色體出現超過 1 次。雖然許多真核生物往往以二倍體狀態存在(每個體染色體有 2 個拷貝和 2 個性染色體),但其他生物(例如植物)可以有多個拷貝(例如四倍體或六倍體)。然而,大多數生物體基因體的表示往往只表示單倍體狀態,如圖 1 所示。
>
> 然而,目前的定序技術不允許對單個染色體進行完整定序。為了完成基因體定序,基因體被打碎,小片段被多次定序。然後組裝這些序列以嘗試重建染色體序列(參見[組裝基礎](https://www.ncbi.nlm.nih.gov/assembly/basics/)想要查詢更多的信息)。由於在許多生物體中發現的複雜性,目前不可能獲得完整的染色體序列。事實上,大多數組裝算法的輸出是一組重疊群和支架(contigs and scaffolds),然後使用外部數據(例如映射信息)對其進行排序和定向。通常,並非所有序列都可以排序和定向。開發穩健的組裝模型使我們能夠將組裝算法的輸出與經過多年基因體研究開發的生物模型聯繫在一起。
<br>
### **為何要更新參考序列?**
- [[wiki] Reference genome](https://en.wikipedia.org/wiki/Reference_genome)
- [[32] "Genome Reference Consortium". www.ncbi.nlm.nih.gov. Retrieved 2022-08-18.](https://en.wikipedia.org/wiki/Reference_genome#cite_note-:1-32)
- [The Genome Reference Consortium](https://www.ncbi.nlm.nih.gov/grc)
> The GRC remains committed to its mission to improve the human reference genome assembly, correcting errors and adding sequence to ensure it provides the best representation of the human genome to meet basic and clinical research needs.
<br>
### hg19 普遍被使用的原因
- [DRAGEN上GRCh37參考序列版本的差別](https://www.prismabiotech.com.tw/post/dragen上grch37參考序列版本的差別)
- 因他們的Genome Browser是當時最好用的軟體
<br>
### hg19 vs GRCh37
- [DRAGEN 上 GRCh37 參考序列版本的差別](https://www.prismabiotech.com.tw/post/dragen上grch37參考序列版本的差別)
- 除核心染色體外其他contig的編碼跟GRCh37不同
[](https://i.imgur.com/vaLqIsj.png)
<br>
- **粒線體上有差異**
- hg19 採用 NC_001807.4
- GRCh37 採用 NC_012920.1
- 這兩個版本有超過40個位點不同
做Pair-wise blast的結果:
[](https://i.imgur.com/GcddvgF.png)
<br>
- [[次世代定序知識櫥窗] 人類參考基因體 (Human Reference Genome)](http://toolsbiotech.blog.fc2.com/blog-entry-119.html)
| diff | hg19 | GRCh37 |
| ---- | ----- | ----- |
| **染色體編號** | 帶有 chr 這三個字元,例如: chr1 | 直接以數字表示,沒有 chr 這幾個字元 |
| **粒線體版本** | 舊版的 NC_001807 | 修正版的 NC_012920 |
<br>
### h19 的下一個版本為何不是 hg20,而是 hg38?
- [這就跟蘋果公司出的 iPhone 8,下一代為什麼不是 iPhone 9 而是 iPhone X 一樣(誤)](http://toolsbiotech.blog.fc2.com/blog-entry-119.html)
<br>
### 不同的座標系統,代表的意思是什麼?
- 就如同數學座標系統一樣,有
- [直角座標系(笛卡兒座標系)](https://www.wikiwand.com/zh-hk/%E7%AC%9B%E5%8D%A1%E5%B0%94%E5%9D%90%E6%A0%87%E7%B3%BB)

- [極座標系](https://zh.wikipedia.org/zh-tw/%E6%9E%81%E5%9D%90%E6%A0%87%E7%B3%BB)

- [對數座標系](https://www.newton.com.tw/wiki/%E5%8D%8A%E5%B0%8D%E6%95%B8%E5%9D%90%E6%A8%99/4527400)

- 等等,這些數學座標系可以互相轉換
<br>
- 不同參考序列的版本差異,應該是在於添加更多的序列、更完整的序列、修正不正確的序列,因此參考的鹼基位置會變動
- [TP53 (NM_001276698)這個基因在](http://toolsbiotech.blog.fc2.com/blog-entry-119.html)
- hg38 或是 GRCh38 的位置是chr17:7,668,402-7,675,493
- hg19 或是 GRCh37 的位置是chr17:7,571,720-7,578,811
<br>
### 參考序列版本如何選擇?
- [多種版本同時並行,我究竟該用哪個版本呢?](http://toolsbiotech.blog.fc2.com/blog-entry-119.html?sp)
> 參考基因體是利用來自 **"多個"** DNA提供者的基因體,進行定序之後而組裝而成的,因此不能準確地代表任何一個人的基因體序列。
- [那我該選擇哪個版本呢?是不是最新的版本最好?](http://toolsbiotech.blog.fc2.com/blog-entry-119.html)
> 這就要看使用者的選擇了,目前兩種版本(GRChXX, hgXX)出現的比例差不多,相關的輔助資訊,例如各大基因體資訊資料庫,NCBI、UCSC、Ensembl、1000 Genomes Project、gnomAD、COSMIC等,還有我們台灣人體生物資料庫,同時都可以使用這兩種版本進行查詢與資料使用。不過還是有一些延伸的工具還沒有完全更新,所以目前來說,使用 GRCh37/hg19 可參考的資料較完善,也還不會遇到什麼版本不合的問題。
- [2020-01-15 了解人类不同版本参考基因组及如何选择](https://www.codenong.com/jse65115b4633a/)
<br>
### 何謂 T2T-CHM13 參考序列?
- [看 T2T 聯盟如何完成史上第一個“完整的”人類基因體序列](https://www.blossombio.com/eNews/20210804/index.html)
> #T2T 聯盟, CHM13v1.1
- [【三代定序 新知分享】我全都要!! PacBio HiFi reads讓你好貪心也沒關係](https://www.toolsbiotech.com/news_detail.php?id=39)

- [人類基因體計畫最後一塊拼圖:我們要用誰的基因體來代表全人類?](https://www.thenewslens.com/article/174038/fullpage)
- 目前金體參考序列,仍然有8%(大約兩億個鹼基對)的區域無法完全解碼
- 端粒(telomere)序列
- 中節(centromere)序列
- 核糖體陣列(ribosomal DNA array)序列
- 雙倍體(diploid)在組裝時造成分析上的困難
- 科學家發現一組代碼為CHM13(Complete Hydatidiform Mole 13)的特別細胞株可以協助解決這個難題,這個細胞株在精卵結合時發生異常,最終的細胞只包含精子DNA。
- 「我們要用誰的基因體來代表人類?」
- 讓每個人都有一份專屬於自己的完整基因體序列,或許才是達到個人化醫療、高度精確的臨床診斷的終極方式。
- [[Slide] NVIDIA Clara™ Parabricks 4.0](https://docs.google.com/presentation/d/1ELzX61Fmd9FNBsBcoYutCERCiLaru5A79EgDK1ggN2k/edit#slide=id.g17f0b3e9eb0_0_0)

---

---

<br>
<hr>
<br>
## 如何為參考序列建立 index?
> - 以 T2T-CHM13v2.0 為例
> - **keywords**: GRCh37, GRCh38, T2T-CHM13v2.0, genome reference, fasta, fna, bwa, samtools, index, indexing, fai
### 使用 BWA-MEM 為 fasta 建 BWA index
```bash
# for T2T-CHM13v2.0
## Create the BWA indices
$ apt-get install bwa
$ bwa index GCF_009914755.1_T2T-CHM13v2.0_genomic.fna
```
- [[國立陽明交通大學機構典藏] 高度可平行化的長片段生物序列之FM-index變體](https://ir.nctu.edu.tw/handle/11536/125690)
> A highly parallelizable FM-index variant for long biological sequences
> 基於Burrows-Wheeler轉換之Ferragina-Manzini index(FM-index)目前已經被廣泛地應用在**針對基因體序列進行快速字串搜尋**的應用中。
<br>
### 使用 samtools 為 fasta 建 FAI index
> 用途:有效地存取參考序列中的任意區域
>
```bash
# for T2T-CHM13v2.0
## Create an FAI index
$ samtools faidx GCF_009914755.1_T2T-CHM13v2.0_genomic.fna
```
- [[nvidia] Downloading and Indexing a Reference Genome and Known Sites](https://docs.nvidia.com/clara/parabricks/3.7.0/How-Tos/WholeGenomeGermlineSmallVariants.html#downloading-and-indexing-a-reference-genome-and-known-sites)
- [[samtools] faidx](http://www.htslib.org/doc/faidx.html)
faidx – an index enabling random access to FASTA and FASTQ files
| 欄位 | 說明 |
|-----|-----|
| NAME | Name of this reference sequence
| LENGTH | Total length of this reference sequence, in bases
| OFFSET | Offset in the FASTA/FASTQ file of this sequence's first base
| LINEBASES | The number of bases on each line
| LINEWIDTH | The number of bytes in each line, including the newline |
| QUALOFFSET | Offset of sequence's first quality within the FASTQ file |
- 範例
```
NC_000001.11 248956422 69 80 81
NT_187361.1 175055 252068568 80 81
NT_187362.1 32032 252245933 80 81
NT_187363.1 127682 252278487 80 81
NT_187364.1 66860 252407887 80 81
NT_187365.1 40176 252475704 80 81
NT_187366.1 42210 252516504 80 81
NT_187367.1 176043 252559363 80 81
NT_187368.1 40745 252737728 80 81
NT_187369.1 41717 252779104 80 81
```
<br>
### 使用 picard.jar 中的 CreateSequenceDictionary 為 fasta 建 .dict
> 目的:為參考序列建立字典順序
> 用途:有效地隨機存取參考鹼基
>
- [[BroadInstitute] CreateSequenceDictionary (Picard) ](https://gatk.broadinstitute.org/hc/en-us/articles/360036729911)
```
java -jar picard.jar CreateSequenceDictionary \
R=reference.fasta \
O=reference.dict
```
```
gatk CreateSequenceDictionary \
R=reference.fna \
O=reference.dict
```
- gatk 來源:
https://github.com/broadinstitute/gatk/releases
- 若沒有 .dict,執行其他指令會有底下錯誤訊息
> A USER ERROR has occurred: Fasta dict file /workspace/datasets/ref/T2T-CHM13v2.0/GCF_009914755.1_T2T-CHM13v2.0_genomic.dict for reference /workspace/datasets/ref/T2T-CHM13v2.0/GCF_009914755.1_T2T-CHM13v2.0_genomic.fna does not exist. Please see http://gatkforums.broadinstitute.org/discussion/1601/how-can-i-prepare-a-fasta-file-to-use-as-reference for help creating it.
- 範例
```
@HD VN:1.6
@SQ SN:NC_000001.11 LN:248956422 M5:6aef897c3d6ff0c78aff06ac189178dd UR:file:/workspace/datasets/ref/GRCh38.p14/GCF_000001405.40_GRCh38.p14_genomic.fna
@SQ SN:NT_187361.1 LN:175055 M5:62def1a794b3e18192863d187af956e6 UR:file:/workspace/datasets/ref/GRCh38.p14/GCF_000001405.40_GRCh38.p14_genomic.fna
@SQ SN:NT_187362.1 LN:32032 M5:78135804eb15220565483b7cdd02f3be UR:file:/workspace/datasets/ref/GRCh38.p14/GCF_000001405.40_GRCh38.p14_genomic.fna
@SQ SN:NT_187363.1 LN:127682 M5:1e95e047b98ed92148dd84d6c037158c UR:file:/workspace/datasets/ref/GRCh38.p14/GCF_000001405.40_GRCh38.p14_genomic.fna
@SQ SN:NT_187364.1 LN:66860 M5:4e2db2933ea96aee8dab54af60ecb37d UR:file:/workspace/datasets/ref/GRCh38.p14/GCF_000001405.40_GRCh38.p14_genomic.fna
@SQ SN:NT_187365.1 LN:40176 M5:9949f776680c6214512ee738ac5da289 UR:file:/workspace/datasets/ref/GRCh38.p14/GCF_000001405.40_GRCh38.p14_genomic.fna
@SQ SN:NT_187366.1 LN:42210 M5:af383f98cf4492c1f1c4e750c26cbb40 UR:file:/workspace/datasets/ref/GRCh38.p14/GCF_000001405.40_GRCh38.p14_genomic.fna
@SQ SN:NT_187367.1 LN:176043 M5:c38a0fecae6a1838a405406f724d6838 UR:file:/workspace/datasets/ref/GRCh38.p14/GCF_000001405.40_GRCh38.p14_genomic.fna
@SQ SN:NT_187368.1 LN:40745 M5:cb78d48cc0adbc58822a1c6fe89e3569 UR:file:/workspace/datasets/ref/GRCh38.p14/GCF_000001405.40_GRCh38.p14_genomic.fna
```
<br>
### 完整檔案如下所示
```
├── [3.7K] GCF_009914755.1_T2T-CHM13v2.0_genomic.dict <- by picard.jar
├── [2.9G] GCF_009914755.1_T2T-CHM13v2.0_genomic.fna
├── [ 16] GCF_009914755.1_T2T-CHM13v2.0_genomic.fna.amb
├── [2.6K] GCF_009914755.1_T2T-CHM13v2.0_genomic.fna.ann
├── [2.9G] GCF_009914755.1_T2T-CHM13v2.0_genomic.fna.bwt
├── [ 915] GCF_009914755.1_T2T-CHM13v2.0_genomic.fna.fai <- by samtools
├── [889M] GCF_009914755.1_T2T-CHM13v2.0_genomic.fna.gz
├── [743M] GCF_009914755.1_T2T-CHM13v2.0_genomic.fna.pac
├── [1.5G] GCF_009914755.1_T2T-CHM13v2.0_genomic.fna.sa
└── [2.1K] GCF_009914755.1_T2T-CHM13v2.0_genomic.fna.summary
```
<br>
### 參考資料
- ### [[BroadInstitute] (How to) Install all software packages required to follow the GATK Best Practices](https://gatk.broadinstitute.org/hc/en-us/articles/360041320571)
- **Software packages**
- BWA
- SAMtools
- Picard
- Genome Analysis Toolkit (GATK)
- ### [[BroadInstitute] FASTA - Reference genome format](https://gatk.broadinstitute.org/hc/en-us/articles/360035531652-FASTA-Reference-genome-format)
- **Most GATK tools additionally require that the main FASTA file be accompanied by a dictionary file ending in `.dict` and an index file ending in `.fai`, because it allows efficient random access to the reference bases.**
大多數的GATK工具還需要主要的 FASTA 檔案附帶一個以 `.dict` 結尾的字典檔案和以 `.fai` 結尾的索引檔案,因為這樣可以有效地隨機存取參考鹼基。
- [[Format] Sequence Alignment/Map Format Specification (SAM v1)](https://samtools.github.io/hts-specs/SAMv1.pdf)
- `@HD`: header
File-level metadata. Optional. If present, there must be only one @HD line and it must be the first line of the file.
- `VN: Format version
- `@SQ`: sequence
Reference sequence dictionary. The order of @SQ lines defines the alignment sorting order.
- `SN`: Reference sequence name.
- `LN`: Reference sequence length.
- `M5`: MD5 checksum of the sequence.
- `UR`: URI of the sequence.
This value may start with one of the standard protocols, e.g., 'http:' or 'ftp:'. If it does not start with one of these protocols, it is assumed to be a file-system path.
- ### [[BioStars] How is contig length calculated ?](https://www.biostars.org/p/9527598/#9527607)
<br>
<hr>
<br>
## NCBI
### [T2T-CHM13 (2021.05.07)](https://www.ncbi.nlm.nih.gov/assembly/GCA_009914755.3/)
| patch | Total sequence length | <span style="white-space: nowrap;">Submitted sequence</span><br><span style="white-space: nowrap;">(GenBank accession)</span> | <span style="white-space: nowrap;">Reference sequence</span><br>(RefSeq accession) |
| ----- | --------------------- | ------------------ | ----------------------|
| T2T-CHM13v2.0 (2022/01/24) | 3,054,832,041 | GCA_009914755.4<br>[[assembly]](https://www.ncbi.nlm.nih.gov/assembly/GCA_009914755.4/) [[data-hub]](https://www.ncbi.nlm.nih.gov/data-hub/genome/GCA_009914755.4/) | GCF_009914755.1<br>[[assembly]](https://www.ncbi.nlm.nih.gov/assembly/GCF_009914755.1/) [[data-hub]](https://www.ncbi.nlm.nih.gov/data-hub/genome/GCF_009914755.1/)
- **GCA vs GCF**
研究員或機構等,可能提交 GCA 一次至數次,經過審核後,變成 GCF (正式版)
- 2020/01/22: [GCA_009914755.1 -> T2T-CHM13v0.7](https://www.ncbi.nlm.nih.gov/assembly/GCA_009914755.1/)
- 2021/01/26: [GCA_009914755.2 -> T2T-CHM13v1.0](https://www.ncbi.nlm.nih.gov/assembly/GCA_009914755.2/)
- 2021/05/07: [GCA_009914755.3 -> T2T-CHM13v1.1](https://www.ncbi.nlm.nih.gov/assembly/GCA_009914755.3/)
- 2022/01/24: [GCA_009914755.4 -> T2T-CHM13v2.0](https://www.ncbi.nlm.nih.gov/assembly/GCA_009914755.4/)
- 2022/01/24: [GCF_009914755.1](https://www.ncbi.nlm.nih.gov/assembly/GCF_009914755.1/) (正式版)
- **FTP**
https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/009/914/755/

- 下載 fasta (fna)
```
$ wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/009/914/755/GCF_009914755.1_T2T-CHM13v2.0/GCF_009914755.1_T2T-CHM13v2.0_genomic.fna.gz
```
<br>
### [GRCh38 (2013.12.17)](https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.26/)
| patch | Total sequence length | <span style="white-space: nowrap;">Submitted sequence</span><br><span style="white-space: nowrap;">(GenBank accession)</span> | <span style="white-space: nowrap;">Reference sequence</span><br>(RefSeq accession) |
| ----- | --------------------- | ------------------ | ----------------------|
| GRCh38<br>(2013/12/17) | 3,099,734,149 | GCA_000001405.15<br>[[assembly]](https://www.ncbi.nlm.nih.gov/assembly/GCA_000001405.15/) [[data-hub]](https://www.ncbi.nlm.nih.gov/data-hub/genome/GCA_000001405.15/) | <span style="white-space: nowrap;">GCF_000001405.26<br>[[assembly]](https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.26/) [[data-hub]](https://www.ncbi.nlm.nih.gov/data-hub/genome/GCF_000001405.26/) [[FTP]](https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.26_GRCh38/)</span> |
| GRCh38.**p11**<br>(2017/06/14) | 3,099,734,149 | GCA_000001405.26<br>[[assembly]](https://www.ncbi.nlm.nih.gov/assembly/GCA_000001405.26/) [[data-hub]](https://www.ncbi.nlm.nih.gov/data-hub/genome/GCA_000001405.26/) | GCF_000001405.37<br>[[assembly]](https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.37/) [[data-hub]](https://www.ncbi.nlm.nih.gov/data-hub/genome/GCF_000001405.37/) [[FTP]](https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.37_GRCh38.p11/) |
| GRCh38.**p12**<br>(2017/12/21) | 3,099,706,404 | GCA_000001405.27<br>[[assembly]](https://www.ncbi.nlm.nih.gov/assembly/GCA_000001405.27/) [[data-hub]](https://www.ncbi.nlm.nih.gov/data-hub/genome/GCA_000001405.27/) | GCF_000001405.38<br>[[assembly]](https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.38/) [[data-hub]](https://www.ncbi.nlm.nih.gov/data-hub/genome/GCF_000001405.38/) [[FTP]](https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.38_GRCh38.p12/) |
| GRCh38.**p13**<br>(2019/02/18) | 3,099,706,404 | GCA_000001405.28<br>[[assembly]](https://www.ncbi.nlm.nih.gov/assembly/GCA_000001405.28/) [[data-hub]](https://www.ncbi.nlm.nih.gov/data-hub/genome/GCA_000001405.28/) | GCF_000001405.39<br>[[assembly]](https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.39/) [[data-hub]](https://www.ncbi.nlm.nih.gov/data-hub/genome/GCF_000001405.39/) [[FTP]](https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.39_GRCh38.p13/) |
| GRCh38.**p14**<br>(2022/02/03) | 3,099,441,038 | GCA_000001405.29<br>[[assembly]](https://www.ncbi.nlm.nih.gov/assembly/GCA_000001405.29/) [[data-hub]](https://www.ncbi.nlm.nih.gov/data-hub/genome/GCA_000001405.29/) | GCF_000001405.40<br>[[assembly]](https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.40/) [[data-hub]](https://www.ncbi.nlm.nih.gov/data-hub/genome/GCF_000001405.40/) [[FTP]](https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.40_GRCh38.p14/) |
- [What is the correct name for the human genome reference assembly?](https://www.ncbi.nlm.nih.gov/grc/help/faq/)
- 一般所稱的 [GRCh38](https://www.ncbi.nlm.nih.gov/data-hub/genome/GCA_000001405.15/) 是指(?)
- GenBank 登錄號為 [GCA_000001405.15](https://www.ncbi.nlm.nih.gov/assembly/GCA_000001405.15)
- RefSeq 登錄號為 [GCF_000001405.26](https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.26) (2013/12/17)
- **FTP**
https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/

- 為 BWA-MEM 建 index
```bash
# for GRCh38.p13
$ apt-get install bwa
$ bwa index GCF_000001405.39_GRCh38.p13_genomic.fna
```
:::spoiler 執行過程
```
$ bwa index GCF_000001405.39_GRCh38.p13_genomic.fna
[bwa_index] Pack FASTA... 30.26 sec
[bwa_index] Construct BWT for the packed sequence...
[BWTIncCreate] textLength=6544178410, availableWord=472472396
[BWTIncConstructFromPacked] 10 iterations done. 99999994 characters processed.
[BWTIncConstructFromPacked] 20 iterations done. 199999994 characters processed.
[BWTIncConstructFromPacked] 30 iterations done. 299999994 characters processed.
...
...
[BWTIncConstructFromPacked] 700 iterations done. 6493609402 characters processed.
[BWTIncConstructFromPacked] 710 iterations done. 6518454730 characters processed.
[BWTIncConstructFromPacked] 720 iterations done. 6540533258 characters processed.
[bwt_gen] Finished constructing BWT in 722 iterations.
[bwa_index] 3386.70 seconds elapse.
[bwa_index] Update BWT... 33.22 sec
[bwa_index] Pack forward-only FASTA... 16.21 sec
[bwa_index] Construct SA from BWT and Occ... 1705.25 sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa index GCF_000001405.39_GRCh38.p13_genomic.fna
[main] Real time: 5226.528 sec; CPU: 5171.657 sec
```
:::
<br>
產生的檔案:
```
# tree -h
.
├── [3.1G] GCF_000001405.39_GRCh38.p13_genomic.fna
├── [ 21K] GCF_000001405.39_GRCh38.p13_genomic.fna.amb
├── [ 82K] GCF_000001405.39_GRCh38.p13_genomic.fna.ann
├── [3.0G] GCF_000001405.39_GRCh38.p13_genomic.fna.bwt
├── [780M] GCF_000001405.39_GRCh38.p13_genomic.fna.pac
└── [1.5G] GCF_000001405.39_GRCh38.p13_genomic.fna.sa
0 directories, 6 files
```
<br>
### NCBI Accession
> NCBI 登錄號
- 基因體資料庫
- **GCA**:Genome Collections Accession (基因體收錄登錄號) (包含其他物種)
- 由科學界提交,每次提交會有一個登錄號(稱為 GCA),可提交多次
- **GCF**:reference for Genome Collections accession (參考基因體收錄登錄號)(推測)
- 由 NCBI 審核 GCA 提交紀錄,經過審核後變成正式登錄號(稱為 GCF)
- **GCA 和 GCF 的範例**
- 參考基因體 T2T-CHM13v2.0 (無 v1.0 版本)
- **2020/01/22**:[GCA_009914755.1](https://www.ncbi.nlm.nih.gov/assembly/GCA_009914755.1/) (名稱:T2T-CHM13v0.7)
- **2021/01/26**:[GCA_009914755.2](https://www.ncbi.nlm.nih.gov/assembly/GCA_009914755.2/) (名稱:T2T-CHM13v1.0)
- **2021/05/07**:[GCA_009914755.3](https://www.ncbi.nlm.nih.gov/assembly/GCA_009914755.3/) (名稱:T2T-CHM13v1.1)
- **2022/01/24**:[GCA_009914755.4](https://www.ncbi.nlm.nih.gov/assembly/GCA_009914755.4/) (名稱:T2T-CHM13v2.0)
- `.number`:.1, .2, .3, .4, … 表示 patch 版本
- ==**2022/01/24**:[GCF_009914755.1](https://www.ncbi.nlm.nih.gov/assembly/GCF_009914755.1/) (正式版)==
- patch 版本不一定從 1 開始
- 例如,參考基因體 GRCh38 版本 (接續 GRCh37)
- 2013/12/17:提交 [GCA_000001405.15](https://www.ncbi.nlm.nih.gov/assembly/GCA_000001405.15/) -> 正式 [GCF_000001405.26](https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.26/) (名稱:GRCh38)
- ...
- 2022/02/03:提交 [GCA_000001405.29](https://www.ncbi.nlm.nih.gov/assembly/GCA_000001405.29/) -> 正式 [GCF_000001405.40](https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.40/) (名稱:GRCh38.p14 )
- ### 參考資料
- **GCA:**
[a unique Genome Collections Accession](https://asia.ensembl.org/info/genome/genebuild/assembly.html)
唯一基因體收錄登錄號
- **GCF:**
可能是 reference Genome Collections (?) (基因體收錄之參考)
亦或是 reference for Genome Collections (?) (類似 RefSNP,簡稱 rs)
- **[Unique identifiers and NCBI accession prefixes](https://www.ncbi.nlm.nih.gov/pathogens/pathogens_help/#unique-identifiers)**
- `GCA_`
Accession number prefix for a GenBank genome assembly. This is data submitted by the scientific community directly to GenBank as an assembled genome.
GenBank 基因體組裝的登錄號前綴。這是科學界直接提交給 GenBank 作為組裝基因體的數據。
(Read more about [genomes](https://www.ncbi.nlm.nih.gov/pathogens/pathogens_help/#data-type-genome) in the [data types](https://www.ncbi.nlm.nih.gov/pathogens/pathogens_help/#data-types) section of this document.)
- `GCF_`
Accession number prefix for a RefSeq genome assembly. This is a representative genome assembly for a given organism in RefSeq, a non-redundant database.
RefSeq 基因體組裝的登錄號前綴。這是代表一個非冗餘資料庫 RefSeq 中所給定的生物體之基因體組裝編號。
(Read more about [Prokaryotic RefSeq Genomes](https://www.ncbi.nlm.nih.gov/refseq/about/prokaryotes/).)
(Read more about [NCBI Genome Assembly Models](https://www.ncbi.nlm.nih.gov/assembly/model/).)
- **[Accession Number prefixes: Where are the sequences from?](https://www.ncbi.nlm.nih.gov/genbank/acc_prefix/)**
Nucleotide Accession Prefixes
[](https://i.imgur.com/Lvnn8we.png)
(more)
- **[Introduction to NCBI Bioinformatics Resources: Gene Search / NCBI Gene](https://usuhs.libguides.com/c.php?g=468091&p=3247777)**
- NCBI Reference sequences: links to curated and annotated reference sequence records for the gene (accession number prefix NG), mRNA (NM) and protein (NP).
- NG accession number links to the GenBank record, FASTA sequence, and Sequence viewer in the Nucleotide database.
- NM accession number links to the mRNA record in the Nucleotide database.
- NP accession number links to the protein record in the Protein database.
### 其他入口
> [Human Genome Resources at NCBI](https://www.ncbi.nlm.nih.gov/genome/guide/human/)
> 
https://www.ncbi.nlm.nih.gov/genome/guide/human/
[](https://www.ncbi.nlm.nih.gov/genome/guide/human/)
<br>
<hr>
<br>
## UCSC
### Jan. 2022 (T2T-CHM13 v2.0/hs1)
- ### Web 入口點
http://hgdownload.soe.ucsc.edu/downloads.html#human

- ### [hs1.fa.gz](http://hgdownload.soe.ucsc.edu/goldenPath/hs1/bigZips/)
```
$ wget http://hgdownload.soe.ucsc.edu/goldenPath/hs1/bigZips/hs1.fa.gz
$ gunzip -k hs1.fa.gz
```
<br>
### Dec. 2013 (GRCh38/hg38)
- ### Web 入口點
http://hgdownload.soe.ucsc.edu/downloads.html#human

- ### FTP 入口點
https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/
http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/

- ### [hg38.p13.fa.gz](https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/p13/)
> - http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/p13/
> - https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/p13/
```
$ wget https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/p13/hg38.p13.fa.gz
$ gunzip -k hg38.p13.fa.gz
```
- ### [hg38.p14.fa.gz](https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/p14/)
```
$ wget https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/p14/hg38.p14.fa.gz
$ gunzip -k hg38.p14.fa.gz
```
<br>
### 參考資料
- [How to Download hg38/GRCh38 FASTA Human Reference Genome](https://www.gungorbudak.com/blog/2018/05/16/how-to-download-hg38-grch38-fasta-human-reference-genome/)

<br>
### Feb. 2009 (GRCh37/hg19)
- ### [hg19Patch13.fa.gz](https://hgdownload.soe.ucsc.edu/goldenPath/hg19/hg19Patch13/)
```
$ wget https://hgdownload.soe.ucsc.edu/goldenPath/hg19/hg19Patch13/hg19Patch13.fa.gz
$ gunzip -k hg19Patch13.fa.gz
```
<br>
<hr>
<br>
## Google
> [genomics-public-data](https://console.cloud.google.com/storage/browser/genomics-public-data/references)
[](https://i.imgur.com/ZHGrBkZ.png)
- GRCh37
- GRCh37lite
- GRCh38
- GRCh38_Verily
- Homo_sapiens_assembly19_1000genomes_decoy
- b37
- hg19
- hg38
- hs37d5
### 參考資料
- [Reference Genomes](https://cloud.google.com/life-sciences/docs/resources/public-datasets/reference-genomes)
<br>
<hr>
<br>
## Nvidia - Assembly38
左邊是 GRCh38.p14 , 右邊是 assembly38 (Homo_sapiens_assembly38)
[](https://i.imgur.com/FTgTwuq.png)
- 大部分是 GRCh38 版本,部份又是四不像
- GRCh38 版本和 hg38 版本的 chr5 都相同,
而 NV 版本卻不一樣,亦找不到其他序列與其對應
- 有添加額外序列
<br>
<hr>
<br>
## 比較版本差異
### GRCh38 vs hg38
- ### 檔案內容
- GRCh38 中的序列,每 80 個字元會換行
- hg38 中的序列,每 50 個字元會換行
- ### 染色體表示法
- 中文:第一號染色體
- ENSEMBL 名稱 -> 1
- UCSC 名稱 -> chr1
- NCBI 名稱 -> NC_000001.11
- ### 使用的鹼基
- hg38 只使用 ATCG+N 鹼基
- N 是 ambiguity code (歧義碼, 模糊的碼),可跟 ATCG 任一配對
- GRCh38 除了 ATCG+N 鹼基,亦有使用 M,R,...
- 一號染色體整條序列,GRCh38 和 hg38 有兩個鹼基差異
> src1: GRCh38, src2: hg38
> diff_at aound: (248752479, 248752560]
src1: cacatgcacatcaccccccacacacaccaaaCA<b style="color: red; background: yellow;">M</b>cccacacaacacacacacaccacaccacacaaacacaaacacacca
src2: cacatgcacatcaccccccacacacaccaaaca<b style="color: red; background: yellow;">n</b>cccacacaacacacacacaccacaccacacaaacacaaacacacca
> diff_at aound: (248755119, 248755200]
src1: G<b style="color: red; background: yellow;">R</b>CACCTAGCAGAGAAACAGAGCCTGGGAGTGACCCCATAGCAGGACCCAGGGCCGTGCTCAACACCTCTGTGGGTAAGA
src2: g<b style="color: red; background: yellow;">n</b>cacctagcagagaaacagagcctgggagtgaccccatagcaggacccagggccgtgctcAACACCTCTGTGGGTAAGA
- 實際上 GRCh38 和 hg38 還是有所不同
- ### 鹼基表示法
- [[wiki] IUPAC notation](https://en.wikipedia.org/wiki/Nucleic_acid_notation)
> IUPAC degenerate base symbols
>

- M (Amino) 表示鹼基可能是 A or C
- R (Purine) 表示鹼基可能是 A or G
- N (Any one base) 表示鹼基可能是 A or C or G or T
- GRCh 會用 IUPAC (應用化學聯合會) 表示法,但 hg38 全改成 N
- [[wiki] 核酸序列](https://zh.wikipedia.org/zh-tw/%E6%A0%B8%E9%85%B8%E5%BA%8F%E5%88%97)

- [将模棱两可的碱基编码为N](https://www.plob.org/article/21355.html)
http://biocorp.ca/IUB.php

<br>
### 讀取某個 scaffold 的長度,並計算長度
```python=
class Scaffold:
def __init__(self, filepath, sequence_name):
self._fd = open(filepath, 'r')
self._buffer = ''
while True:
line = self._fd.readline()
if len(line) == 0:
self._fd.close()
print('not found any sequence with ' + sequence_name)
break
if line.startswith('>' + sequence_name):
print('found:', line)
break
def read(self, sequence_length=80):
if not self._fd.closed:
while len(self._buffer) < sequence_length:
line = self._fd.readline()
if len(line) == 0 or line.startswith('>'):
self._fd.close()
break
self._buffer += line.strip()
sequence = self._buffer[0:sequence_length]
self._buffer = self._buffer[sequence_length:]
return sequence
filepath1 = '/workspace/datasets/ref/GRCh38.p14/GCF_000001405.40_GRCh38.p14_genomic.fna'
scaffold1 = Scaffold(filepath1, 'NC_000001.11')
count = 0
while True:
sequence1 = scaffold1.read()
count += len(sequence1)
if len(sequence1) == 0:
break
print(count)
```
<br>
### 計算某一個染色體或 scaffold 的長度,以及其 md5
> 隨手寫得範例
```python=
import hashlib
grch38 = '/workspace/datasets/ref/GRCh38.p14/GCF_000001405.40_GRCh38.p14_genomic.fna'
NC_1 = '>NC_000001.11'
hg38 = '/workspace/datasets/ref/hg38.p14/hg38.p14.fa'
chr1 = '>chr1'
filepath = grch38
keyword = NC_1
md5 = hashlib.md5()
with open(filepath, 'r') as fd:
while True:
line = fd.readline()
if line.startswith(keyword):
break
if len(line) == 0:
break
if len(line) > 0:
count = 0
while True:
line = fd.readline()
if line.startswith('>'):
break
if len(line) == 0:
break
count += len(line.strip())
md5.update(line.strip().encode('utf8'))
last_line = line
print('ln:', count)
print('m5:', md5.hexdigest())
print('last line:', last_line)
'''
# grch38 -> should be 248956422 6aef897c3d6ff0c78aff06ac189178dd NC_000001.11
# ln: 248956422
# m5: 657ffde0993ab3cc8152afd2c0f6e86c
# last line: NNNNNNNNNNNNNNNNNNNNNN
# hg38 -> should be 248956422 2648ae1bacce4ec4b6cf337dcae37816 chr1
# ln: 248956422
# m5: a004bc1b0bf05fc668cab6bbfd93d3eb
# last line: NNNNNNNNNNNNNNNNNNNNNN
'''
```
<br>
### 輸出某條染色體,並每 80 字元就換行
> 隨手寫得範例
```python=
import os
def read_scaffold(filepath, name, len_per_line = 80):
output_filename = f'output-{name}.txt'
src = open(filepath, 'r')
dest = open(output_filename, 'w')
# locate the target scaffold
while True:
line = src.readline()
if line.startswith('>' + name):
break
if len(line) == 0:
break
# read the body of the target scaffold
if len(line) > 0:
buffer = ''
count = 0
while True:
line = src.readline()
if line.startswith('>'):
break
if len(line) == 0:
break
line = line.strip()
count += len(line)
buffer += line
if len(buffer) >= len_per_line:
dest.write(buffer[0:80] + '\n')
buffer = buffer[80:]
src.close()
dest.close()
print(f'name:{name} ln: {count} file-size:{os.path.getsize(output_filename)}')
grch38 = '/workspace/datasets/ref/GRCh38.p14/GCF_000001405.40_GRCh38.p14_genomic.fna'
read_scaffold(grch38, 'NC_000001.11')
hg38 = '/workspace/datasets/ref/hg38.p14/hg38.p14.fa'
read_scaffold(hg38, 'chr1')
```
執行結果:
```
name:NC_000001.11 ln: 248956422 file-size:252068355
name:chr1 ln: 248956422 file-size:252068355
```
<br>
### 比較 GRCh38 和 hg38 序列中的某一條染色體差異
> - 前提:需將 hg38 中的某一條染色體,從 50 字元改成 80 字元
> - 隨手寫得範例,未優化過,亦可直接定位某條染色體,再依序讀鹼基比對
```python=
def diff_scaffold(filepath1, filepath2, break_after_n_line=5):
src1 = open(filepath1, 'r')
src2 = open(filepath2, 'r')
count = 0
diff_times = 0
while True:
line1 = src1.readline()
line2 = src2.readline()
if len(line1) == 0 or len(line2) == 0:
break
count += len(line1.strip())
if line1.upper() == line2.upper():
continue
else:
print(f'diff_at aound: ({count-len(line1)}, {count}]')
print('src1:', line1.strip())
print('src2:', line2.strip())
print()
diff_times += 1
if diff_times > break_after_n_line:
break
print('count:', count)
src1.close()
src2.close()
diff_scaffold('output-NC_000001.11.txt', 'output-chr1.txt')
```
比較結果:
> diff_at aound: (248752479, 248752560]
src1: cacatgcacatcaccccccacacacaccaaaCA<b style="color: red; background: yellow;">M</b>cccacacaacacacacacaccacaccacacaaacacaaacacacca
src2: cacatgcacatcaccccccacacacaccaaaca<b style="color: red; background: yellow;">n</b>cccacacaacacacacacaccacaccacacaaacacaaacacacca
> diff_at aound: (248755119, 248755200]
src1: G<b style="color: red; background: yellow;">R</b>CACCTAGCAGAGAAACAGAGCCTGGGAGTGACCCCATAGCAGGACCCAGGGCCGTGCTCAACACCTCTGTGGGTAAGA
src2: g<b style="color: red; background: yellow;">n</b>cacctagcagagaaacagagcctgggagtgaccccatagcaggacccagggccgtgctcAACACCTCTGTGGGTAAGA
> count: 248956422
<br>
### 統計某個特定參考序列的鹼基數量
- ### GRCh38: GCF_000001405.26_GRCh38_genomic.fna
```
A: 543573447
B: 2
C: 401305022
G: 401633720
K: 8
M: 8
N: 159970223
R: 27
S: 5
T: 544299756
W: 14
Y: 35
a: 354711972
c: 222422320
g: 224701417
t: 356668129
```
- total: 3,209,286,105 (3209286105)
- ### GRCh38.p14: GCF_000001405.40_GRCh38.p14_genomic.fna
```
A: 558619211
B: 2
C: 413530454
G: 413917617
K: 8
M: 8
N: 161611379
R: 29
S: 5
T: 559373567
W: 15
Y: 36
a: 364497992
c: 229022463
g: 231314379
t: 366543471
```
- total: 3,298,430,636 (3298430636)
- ### hg38.p14.fa
```
A: 446356635
C: 304761492
G: 305028912
N: 161608333
T: 447005205
a: 476942826
c: 338004625
g: 340408553
n: 3149
t: 479090309
```
- total: 3,299,210,039 (3299210039)
<br>
<hr>
<br>
## 組裝上的挑戰
- [【脊椎動物基因體計劃】](https://www.facebook.com/groups/883166995753861/posts/975287693208457/)
- 在當前的定序技術和演算法的情況下,僅通過短讀長技術不可能實現高連續性的Contig,因為無法跨越長於讀長的重複序列區段。
- 組裝結果校正了先前基因體中上千個基因的重大錯誤,針對高GC區段的調控以及染色體的演化等也有進一步的發現,確認長讀長定序技術對於獲得最佳化基因體組裝至關重要,如果處理不當,未解決的複雜重複片段和單倍型雜合體會是裝配錯誤的主要來源。
<br>
<hr>
<br>
## 參考資料
- ### [[wiki] Reference genome](https://en.wikipedia.org/wiki/Reference_genome)
| Release name | Date of release | Equivalent UCSC version |
|:---------------:|:--------------------------:|:-----------------------:|
| GRCh39 | Indefinitely postponed[^[32]^](https://en.wikipedia.org/wiki/Reference_genome#cite_note-:1-32) | - |
| T2T-CHM13 | January 2022 | - |
| GRCh38 | Dec 2013 | hg38 |
| GRCh37 | Feb 2009 | hg19 |
| NCBI Build 36.1 | Mar 2006 | hg18 |
| NCBI Build 35 | May 2004 | hg17 |
| NCBI Build 34 | Jul 2003 | hg16 |
- ### [[UCSC][Genomes] Table Browser](https://genome.ucsc.edu/cgi-bin/hgTables?clade=mammal&org=Human&db=hg19)

- assembly
- Jan. 2022 (T2T CHM13v2.0/hs1)
- Dec. 2013 (GRCh38/hg38)
- Feb. 2009 (GRCh37/hg19)
- Mar. 2006 (NCBI36/hg18)
- May 2004 (NCBI35/hg17)
- July 2003 (NCBI34/hg16)
- ### [DRAGEN上GRCh37參考序列版本的差別](https://www.prismabiotech.com.tw/post/dragen上grch37參考序列版本的差別)
- ### [[次世代定序知識櫥窗] 人類參考基因體 (Human Reference Genome)](http://toolsbiotech.blog.fc2.com/blog-entry-119.html)
- ### [基因组各种版本对应关系](http://www.bio-info-trainee.com/1469.html)
- ### [GRCh37 hg19 b37 humanG1Kv37 - Human Reference Discrepancies](https://gatk.broadinstitute.org/hc/en-us/articles/360035890711-GRCh37-hg19-b37-humanG1Kv37-Human-Reference-Discrepancies#b37)
- ### [2020-01-15 了解人类不同版本参考基因组及如何选择](https://www.codenong.com/jse65115b4633a/)
- ### [GRCh37 hg19 b37 humanG1Kv37 - Human Reference Discrepancies](https://gatk.broadinstitute.org/hc/en-us/articles/360035890711-GRCh37-hg19-b37-humanG1Kv37-Human-Reference-Discrepancies)
- hg19 (ucsc.hg19.fasta, MD5sum: a244d8a32473650b25c6e8e1654387d6)
- b37 (Homo_sapiens_assembly19.fasta, MD5sum: 886ba1559393f75872c1cf459eb57f2d)
- GRCh37 (GRCh37.p13.genome.fasta, MD5sum: c140882eb2ea89bc2edfe934d51b66cc)
- humanG1Kv37 (human_g1k_v37.fasta, MD5sum: 0ce84c872fc0072a885926823dcd0338)
- ### [关于人类参考基因组的一些认识](https://www.plob.org/article/21355.html)
- ALT contig 序列是为了反映人群多态性的一段替补序列,和原染色体位置对应的序列之间有一定的差异。放在ref中的隐患是人为增加了重复序列。
- 将模棱两可的碱基编码为N
http://biocorp.ca/IUB.php
- 4.7 与NCBI ENSEMBL的比较
主要染色体的序列在几个数据库中是一样的,但染色体名称不同。
"chr1" at UCSC, "NC_000001.11" at NCBI, and "1" at the ENSEMBL;
soft-masked的碱基不完全一样,因为三者运行Repeatmasker设置参数不一样