Paper reading: graph-based genome for reference

# graph-based genome for reference ###### tags: `NGS課程筆記` ## 前言這是節錄以下研究文章來討論的筆記，由於最近讀到Nature新發的(2023.05)human pangenome reference，特此回來複習並發表這份筆記。 > Sherman, R.M., Salzberg, S.L. Pan-genomics in the human genome era. Nat Rev Genet 21, 243–254 (2020). https://doi.org/10.1038/s41576-020-0210-7 ## pan-genomics in human > No project to date has produced a comprehensive, analysable human pan-genome that surveys a wide variety of human popu- lations, captures both genic and intergenic variation, and incorporates this variation into a single utilizable pan-genome 建立泛基因組的重要性在於，決大部分的基因體分析，第一步驟都是將樣本序列和參考序列進行比對(alignment)拼出序列相對位置後，在進行後續品質校正與變異偵測；因此可以說參考序列是分析的基準。但如果樣本序列存在新變異(novel variant)或是族群特有的變異，帶有變異的片段很有可能在比對過程，因沒有辦法比對上參考序列、而被當作錯誤片段，這樣會使變異無法被偵測到。文章提到： > we now know that any given individual is likely to contain on the order of 20,000 structural variants (of >50 bp) relative to the reference genome 因此，參考序列的校正和更新是一項重要的工作。如同GRCh38版序列之於GRCh37新增alternative contigs，現在有許多計劃致力於尋找新變異位點，主要通過增加定序深度與樣本數的方式，來挖掘變異位點 > 1000 Genomes Project (1KGP) > initially collecting SNP array data and later generating low-coverage (mean 7.4×) whole-genome sequence (WGS) data for 2,504 samples from 26 populations > In 2019 an updated re-sequencing of these 2,504 genomes was released in order to improve data quality and consistency > Simons Genome Diversity Project > deep coverage (30–40×) in short-read sequencing of 300 individuals from 142 diverse populations24. > The project assembled sequences that failed to align to the reference genome and discovered 5.8 Mb of novel, non-repeat sequences in the collection. 下表出自原始論文，可以看到不同cohort study所找到的新變異區域長度，這些新變異區域都有助於參考序列的更新，並納入來自多族群的變異。 ![](https://i.imgur.com/bVmuY0U.png) ## pan-genomics representation ### graph reference genome 當知道新變異的存在後，該以什麼方法結合到參考序列上？參考序列給人的印象，便是鹼基序列以一長直線排列而成，這樣的基因體形式又稱做線性基因(linear genome)，線性基因的優點是一目瞭然，且比對效率高（序列像溜滑梯一樣，找到相似序列後就可一路比對下去）但缺點為形式簡單，其中GRCh37又比GRCh38缺少alternative haplotype的紀錄，多族群的適用程度不一。因此現在的參考序列傾向以圖形化基因(graph genome)來紀錄，graph存在分支，可使某位點存在多種序列紀錄，以下圖為例： ![](https://i.imgur.com/pLVDInV.png) 假設非洲族群的基因體多為A配上CGT的序列，歐洲人多為C配上CGT deletion（亦或是非洲族群insertion）的序列，在linear reference的情況下，可能會有一組序列被當作是變異report出去；在graph genome的情況兩序列都會被比對到，減少對變異偵測的影響。 ### graph alignment 利用graph genome不僅可增加序列多元性、標示出差異，在alignment時也可參考多個SNP並增進mapping quality，並幫助排除這些序列在後續被當作變異的可能性。下圖顯示graph genome在比對重複序列的好處 ![](https://i.imgur.com/0TBEvvy.png) graph genome雖資訊量涵蓋較多，但缺點就是運算複雜度增加。妥協方法像是： * 使用haplotype解決：若某些序列同時出現的可能高，則可將他們當作一組並給予較高的權重，之後若再比對到，可直接將和他同組的序列資訊填上去 * 使用two step alignment：如下圖所示，首先使用graph genome將不同haplotype組合的序列分類後，在使用linear genome來進行比對，以及後續的分析步驟。 ![](https://i.imgur.com/Fijr9sK.png)