參考基因體 (Genome Reference) === ###### tags: `基因體/二級分析` ###### tags: `生物資訊`, `基因體`, `資料庫`, `Genome Reference(參考基因體)`, `WGS(全基因體定序)`, `鹼基編碼` <br> [TOC] <br> ## 參考序列 ### T2T-CHM13 - ### [看 T2T 聯盟如何完成史上第一個“完整的”人類基因體序列](https://www.blossombio.com/eNews/20210804/index.html) - GRCh38.p13 缺少了近 8% 的序列 - 缺失序列主要分布在異染色質 (heterochromatin) 與複雜區域, - 絲粒衛星陣列 (centromeric satellite arrays) - rDNA 陣列 - 次端粒 (subtelomeric) 區域 - [【三代定序 新知分享】我全都要!! PacBio HiFi reads讓你好貪心也沒關係](https://www.toolsbiotech.com/news_detail.php?id=39)  ### Q & A - ### [人類基因體計畫最後一塊拼圖:我們要用誰的基因體來代表全人類?](https://www.thenewslens.com/article/174038/fullpage) - 目前金體參考序列,仍然有8%(大約兩億個鹼基對)的區域無法完全解碼 - 端粒(telomere)序列 - 中節(centromere)序列 - 核糖體陣列(ribosomal DNA array)序列 - 雙倍體(diploid)在組裝時造成分析上的困難 - 科學家發現一組代碼為CHM13(Complete Hydatidiform Mole 13)的特別細胞株可以協助解決這個難題,這個細胞株在精卵結合時發生異常,最終的細胞只包含精子DNA。 - 「我們要用誰的基因體來代表人類?」 - 讓每個人都有一份專屬於自己的完整基因體序列,或許才是達到個人化醫療、高度精確的臨床診斷的終極方式。 <br> <hr> <br> ## fasta 的序列長度,可以這樣計算 ### Step 1:先算檔頭大小: - [Step 1-1] 過濾檔頭資訊 ```bash $ grep '>' human_g1k_v37_decoy.fasta > head.txt ``` - [Step 1-2] 計算檔頭資訊 ```bash $ wc head.txt 86 258 5052 head.txt # 86 行,共 5052 bytes ``` ### Step 2:在算 fasta 大小: ```bash $ wc human_g1k_v37_decoy.fasta 52290996 52291168 3189750467 human_g1k_v37_decoy.fasta # 52290996 行,共 3189750467 bytes ``` ### Step 3:fasta 大小減去檔頭資訊,剩下的就是「序列大小」 ``` 3189750467 - 5052 - (52290996-86) = 3137454505 ``` - 其中: - ```3189750467```:fasta 大小 - ```5052```:檔頭資訊大小 - ```52290996-86```:序列換行字元'\n'個數 = fasta換行字元個數 - 檔頭換行字元個數 ### 序列長度為 3.137 G 鹼基 ### [補充] 序列檔頭資訊 ``` $ grep '>' human_g1k_v37_decoy.fasta ``` ```fasta= >1 dna:chromosome chromosome:GRCh37:1:1:249250621:1 >2 dna:chromosome chromosome:GRCh37:2:1:243199373:1 >3 dna:chromosome chromosome:GRCh37:3:1:198022430:1 >4 dna:chromosome chromosome:GRCh37:4:1:191154276:1 >5 dna:chromosome chromosome:GRCh37:5:1:180915260:1 >6 dna:chromosome chromosome:GRCh37:6:1:171115067:1 >7 dna:chromosome chromosome:GRCh37:7:1:159138663:1 >8 dna:chromosome chromosome:GRCh37:8:1:146364022:1 >9 dna:chromosome chromosome:GRCh37:9:1:141213431:1 >10 dna:chromosome chromosome:GRCh37:10:1:135534747:1 >11 dna:chromosome chromosome:GRCh37:11:1:135006516:1 >12 dna:chromosome chromosome:GRCh37:12:1:133851895:1 >13 dna:chromosome chromosome:GRCh37:13:1:115169878:1 >14 dna:chromosome chromosome:GRCh37:14:1:107349540:1 >15 dna:chromosome chromosome:GRCh37:15:1:102531392:1 >16 dna:chromosome chromosome:GRCh37:16:1:90354753:1 >17 dna:chromosome chromosome:GRCh37:17:1:81195210:1 >18 dna:chromosome chromosome:GRCh37:18:1:78077248:1 >19 dna:chromosome chromosome:GRCh37:19:1:59128983:1 >20 dna:chromosome chromosome:GRCh37:20:1:63025520:1 >21 dna:chromosome chromosome:GRCh37:21:1:48129895:1 >22 dna:chromosome chromosome:GRCh37:22:1:51304566:1 >X dna:chromosome chromosome:GRCh37:X:1:155270560:1 >Y dna:chromosome chromosome:GRCh37:Y:2649521:59034049:1 >MT gi|251831106|ref|NC_012920.1| Homo sapiens mitochondrion, complete genome >GL000207.1 dna:supercontig supercontig::GL000207.1:1:4262:1 >GL000226.1 dna:supercontig supercontig::GL000226.1:1:15008:1 >GL000229.1 dna:supercontig supercontig::GL000229.1:1:19913:1 >GL000231.1 dna:supercontig supercontig::GL000231.1:1:27386:1 >GL000210.1 dna:supercontig supercontig::GL000210.1:1:27682:1 >GL000239.1 dna:supercontig supercontig::GL000239.1:1:33824:1 >GL000235.1 dna:supercontig supercontig::GL000235.1:1:34474:1 >GL000201.1 dna:supercontig supercontig::GL000201.1:1:36148:1 >GL000247.1 dna:supercontig supercontig::GL000247.1:1:36422:1 >GL000245.1 dna:supercontig supercontig::GL000245.1:1:36651:1 >GL000197.1 dna:supercontig supercontig::GL000197.1:1:37175:1 >GL000203.1 dna:supercontig supercontig::GL000203.1:1:37498:1 >GL000246.1 dna:supercontig supercontig::GL000246.1:1:38154:1 >GL000249.1 dna:supercontig supercontig::GL000249.1:1:38502:1 >GL000196.1 dna:supercontig supercontig::GL000196.1:1:38914:1 >GL000248.1 dna:supercontig supercontig::GL000248.1:1:39786:1 >GL000244.1 dna:supercontig supercontig::GL000244.1:1:39929:1 >GL000238.1 dna:supercontig supercontig::GL000238.1:1:39939:1 >GL000202.1 dna:supercontig supercontig::GL000202.1:1:40103:1 >GL000234.1 dna:supercontig supercontig::GL000234.1:1:40531:1 >GL000232.1 dna:supercontig supercontig::GL000232.1:1:40652:1 >GL000206.1 dna:supercontig supercontig::GL000206.1:1:41001:1 >GL000240.1 dna:supercontig supercontig::GL000240.1:1:41933:1 >GL000236.1 dna:supercontig supercontig::GL000236.1:1:41934:1 >GL000241.1 dna:supercontig supercontig::GL000241.1:1:42152:1 >GL000243.1 dna:supercontig supercontig::GL000243.1:1:43341:1 >GL000242.1 dna:supercontig supercontig::GL000242.1:1:43523:1 >GL000230.1 dna:supercontig supercontig::GL000230.1:1:43691:1 >GL000237.1 dna:supercontig supercontig::GL000237.1:1:45867:1 >GL000233.1 dna:supercontig supercontig::GL000233.1:1:45941:1 >GL000204.1 dna:supercontig supercontig::GL000204.1:1:81310:1 >GL000198.1 dna:supercontig supercontig::GL000198.1:1:90085:1 >GL000208.1 dna:supercontig supercontig::GL000208.1:1:92689:1 >GL000191.1 dna:supercontig supercontig::GL000191.1:1:106433:1 >GL000227.1 dna:supercontig supercontig::GL000227.1:1:128374:1 >GL000228.1 dna:supercontig supercontig::GL000228.1:1:129120:1 >GL000214.1 dna:supercontig supercontig::GL000214.1:1:137718:1 >GL000221.1 dna:supercontig supercontig::GL000221.1:1:155397:1 >GL000209.1 dna:supercontig supercontig::GL000209.1:1:159169:1 >GL000218.1 dna:supercontig supercontig::GL000218.1:1:161147:1 >GL000220.1 dna:supercontig supercontig::GL000220.1:1:161802:1 >GL000213.1 dna:supercontig supercontig::GL000213.1:1:164239:1 >GL000211.1 dna:supercontig supercontig::GL000211.1:1:166566:1 >GL000199.1 dna:supercontig supercontig::GL000199.1:1:169874:1 >GL000217.1 dna:supercontig supercontig::GL000217.1:1:172149:1 >GL000216.1 dna:supercontig supercontig::GL000216.1:1:172294:1 >GL000215.1 dna:supercontig supercontig::GL000215.1:1:172545:1 >GL000205.1 dna:supercontig supercontig::GL000205.1:1:174588:1 >GL000219.1 dna:supercontig supercontig::GL000219.1:1:179198:1 >GL000224.1 dna:supercontig supercontig::GL000224.1:1:179693:1 >GL000223.1 dna:supercontig supercontig::GL000223.1:1:180455:1 >GL000195.1 dna:supercontig supercontig::GL000195.1:1:182896:1 >GL000212.1 dna:supercontig supercontig::GL000212.1:1:186858:1 >GL000222.1 dna:supercontig supercontig::GL000222.1:1:186861:1 >GL000200.1 dna:supercontig supercontig::GL000200.1:1:187035:1 >GL000193.1 dna:supercontig supercontig::GL000193.1:1:189789:1 >GL000194.1 dna:supercontig supercontig::GL000194.1:1:191469:1 >GL000225.1 dna:supercontig supercontig::GL000225.1:1:211173:1 >GL000192.1 dna:supercontig supercontig::GL000192.1:1:547496:1 >NC_007605 >hs37d5 ``` <br> <hr> <br> ## 粒線體 (mitochondrion) ([NC_012920](https://www.genome.jp/dbget-bin/www_bget?refseq:NC_012920)) - LOCUS - NC_012920 - 16569 bp - DNA - DEFINITION - Homo sapiens mitochondrion, complete genome ## 人類皰疹病毒第四型 (Human gammaherpesvirus 4) ([NC_007605](https://www.genome.jp/dbget-bin/www_bget?refseq:NC_007605)) - LOCUS - NC_007605 - 171823 bp - DNA - DEFINITION - Human gammaherpesvirus 4, complete genome. <br> <hr> <br> ## 參考資料 - ### [關於人類參考基因組的一些認識](https://www.plob.org/article/21355.html) - #### ALT contig是什麼? > To better capture variation in the human genome across the world; > it(hg38最初版) contains more copies of some loci than hg19(最初版). > - #### Patches(補丁)是什麼? - #### ALT contig - 存在多處定位的參考序列,比如 X 染色體上的一些區域在 Y 染色體中也有(pseudoautosomal region, PAR)。在標準的流程中,這些區域是沒法進行變異檢測的(怎麼理解?上面第一點也提到了)。正確的做法是將Y染色體中的這類區域進行mask。 - #### 将模棱两可的碱基编码为N - [IUB Nucleotide Codes](http://biocorp.ca/IUB.php) | Code | Definition | Mnemonic | |:----:|:----------:|:----------:| | A | Adenine | A | | C | Cytosine | C | | G | Guanine | G | | T | Thymine | T | | R | AG | puRine | | Y | CT | pYrimidine | | K | GT | Keto | | M | AC | aMino | | S | GC | Strong | | W | AT | Weak | | B | CGT | Not A | | D | AGT | Not C | | H | ACT | Not G | | V | ACG | Not T | | N | AGCT | aNy | - #### human_g1k_v37.fasta - 包含了人的24條染色體水平的序列,線粒體序列,以及沒有定位的contig序列。 - hs37d5.fa包含了以上,以及NC_007605、hs37d5: - NC_007605 是人皰疹病毒的序列,Human herpesvirus 4 type 1,也稱EBV; - hs37d5 就是 decoy 序列,由很多短序列間隔一定數量的N組成。 - 這些短序列是什麼?有什麼作用? > 這些短序列可以簡單認為是來源於人但是ref裡面不包含(組裝技術的局限)。怎麼找到的?對比已有的克隆序列和ref序列確定的。參考基因組並不完整,總有一些區域不包含在已有的參考基因組中,這種情況下,來源於這些區域的reads要么比對不上,要么錯誤地比對到其他地方(造成假陽性) 。如果能把那些已經確定的但是ref沒有的序列加上去,就能有所改善。 > > EBV序列和decoy序列的作用類似,都是盡可能讓reads比對到它真實的地方。只不過EBV序列不屬於human genome,但細胞裡面可能含有,提取DNA測序可能就測到了。 - ### [2020-01-15 了解人类不同版本参考基因组及如何选择](https://www.codenong.com/jse65115b4633a/) - #### 比对至GRCh37(hg19),使用hs37-1kg: ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/human_g1k_v37.fasta.gz - #### 比对至GRCh37,并且认为 decoy sequence* 有助于variant calling,使用hs37d5: ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz - #### 比对至GRCh38(hg38): ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz - #### 黄色部分为flanking sequence,起调控作用 [](https://i.imgur.com/7IueIIu.png) - 多聚腺苷酸尾(Poly-A Tail)保護 mRNA - #### 伪常染色体序列(PARs) - 是X和Y染色体上核苷酸的同源序列,假常染色体基因(到目前为止至少发现了29个)表现出常染色体遗传而不是性别相关的遗传模式。  - #### Not including unplaced and unlocalized contigs. - 基因组中不包括来自 unlocalized 和 unplaced 序列,导致来自这些序列的读段被强制map到其它染色体上,导致错误的variant call.  - ### [Which human reference genome to use?](https://lh3.github.io/2017/11/13/which-human-reference-genome-to-use) - ### [Human genome reference builds - GRCh38 or hg38 - b37 - hg19](https://gatk.broadinstitute.org/hc/en-us/articles/360035890951-Human-genome-reference-builds-GRCh38-or-hg38-b37-hg19)  
×
Sign in
Email
Password
Forgot password
or
By clicking below, you agree to our
terms of service
.
Sign in via Facebook
Sign in via Twitter
Sign in via GitHub
Sign in via Dropbox
Sign in with Wallet
Wallet (
)
Connect another wallet
New to HackMD?
Sign up