## 前言 這是一篇整理review paper的心得文章,因內容眾多,主要集中在**使用short-read偵測SV的發展** ## Abstract & Reference > Ho, S.S., Urban, A.E. & Mills, R.E. Structural variation in the sequencing era. Nat Rev Genet 21, 171–189 (2020). https://doi.org/10.1038/s41576-019-0180-9 reviewing modern approaches for investigating SVs and proffer that, moving forwards, studies integrating biological information with detection will be necessary to comprehensively understand the impact of SV in the human genome. ## SV detection algorithm using short-read signatures to infer the presence of SVs compared with reference genome could **limited on the sequence and insert size** methods for resolving SVs that bypass the limitatins of short-read approaches: - algorithmic ensembles - leveraging new technologies ### history for SV detection - very first methods: cytogenetics, chromosome banding or FISH, but was limited to large scale variation (ex. trisomy) - Microarray came in 2000s and used to detect CNVs - sequencing-based approach increase due to the csot down and had better performance than array: - detect **balanced SV type** - detect **novel SVs** missing by array - high resolution - these advances pushed to **>25,000 SVs per individual** ## algorithmic ensemble ### four main basic algorithms - read-pair 利用序列方向(orientation)和序列中間的insert size 長度是否符合預期,來判斷變異 - read-depth 適合偵測會造成深度改變的deletion和duplication事件 - split-read 利用alignment上參考序列後、是否符合預期排列來判斷變異(map over breakpoints) - local assembly 透過重新組裝contigs來和參考序列比較有差異的地方 ### caller development and strategy 早期工具只利用一種演算法設計工具,偵測能力有限,例如CNVnator 結合多重演算法設計工具**hybrid algorithm**,例如DELLY, Lumpy, Manta 單一工具層級的準確度仍然較低,原因是不同工具都有適合偵測得種類和範圍,不可能只用一個工具就做得很好 > no individual caller has been shown to be capable of identifying the complete range of SV owing to the large diversity in viable detection approaches and the variability in SV subtype and size 結合多種偵測工具以擴大偵測範圍、藉以篩選高可信度位點的**ensemble algorithm**成為主流方案。目前各研究都自行組合工具、並沒有標準答案。 > The methods to integrate, combine and score calls vary markedly between studies 一般設計ensemble algorithms的流程與考量因素如下圖: ![](https://hackmd.io/_uploads/rJOKT9XK2.png) 圖B介紹七種比對工具間變異是否可以合併的方法: - Ba 利用變異重疊程度判斷合併,**是最常使用的判斷方法,一般要求至少50% reciprocity overlap** - Bb 利用斷點間距(maximum distabce?)來判斷合併,**一般使用在判斷斷點高解析度的演算法如split-read上** - Bc 利用genotype一致性來判斷合併,**算是對準確度有高要求的衡量標準** - Bd 利用read特性來篩選,通常**用在當兩個工具都偵測到同位置變異時,有高解析度且片段較長者,會比片段零碎者可信** (split reads > read pair > read depth) - Be 利用read depth 來篩選 - Bf 利用連**聯集或交集策略**來篩選輸出變異,**或是由至少caller偵測到**來篩選變異 - Bg 跟Bb有點類似,也是拿斷點距離判斷,感覺這個比較像是要說**合併成同個位點的變異,彼此斷點差異的最大區間設定** - Bh, Bi 都是在說利用benchmarking 比對合併準確度,**透過調整參數**篩選變異以增進ROC curve、減少FDR > applications will either intersect calls or take a union, decreasing and increasing sensitivity while decreasing and increasing the FDR, respectively #### example of ensemble algorithm-based callers | Caller | SVmerge | MetaSV | Parliament2 | |:--------------:|:--------------:|:--------------:|:------------:| | Merge strategy | intersection | union | intersection | | Validation | local assembly | local assembly | x | ### population-scale SV detetion (mostly for germline SVs) - [1000 Genome Project Structural Variation sequencing](/JxrKdEE1RIaeY3k1HG5epw) - [gnomAD database](/Ei6cobaCQiOtr277YH7dYA) 族群基因體學可以延伸的地方: - variants specific to ancestral backgrounds - novel variants in specific population - *de novo* mutation rate using trio data - pahsed SVs for Mendelian error rate using trio data ### Limitations in ensemble algorithm-based studies - 偵測結果受深度和定序品質有關,像是1KGP 當時WGS深度大概6-7x,到2022重做的那篇(參閱上面1KGP文章) 有提升至30x,拿同批樣本比較,在recall高的情況下變異總數確實增加 - 合併要在recall跟precision之間做取捨 - 目前合併策略仍沒有標準做法,且開源工具多參考overlap作為判斷依據,有所偏頗或不周全? > stand-alone ensemble algorithm tools are largely immature and mostly rely on simple overlap - 目前工具以合併short-read data為主,定序本身的限制也會侷限合併結果,像是repeat region本來就call不好,合併也沒有意義 > ensemble algorithms focused on integrating only short-read data do not overcome the limitations of short-insert sizes; they continue to poorly detect small insertions and suffer in repetitive regions ## overview of the new technologies 上篇討論的是針對short-read設計演算法,偵測合併策略,以及實際應用在族群資料的成果與討論,本篇將介紹現有的其他定序技術、以提昇SV偵測能力。 * connected molecule strategy: inferring long connections between distally mapped short-read pairs * Linked reads * Strand-seq * Hi-C sequence coverage vs. physical coverage ![](https://hackmd.io/_uploads/By2Gqfm63.png) > Meyerson, M., et al. Nat Rev Genet 11, 685–696 (2010). https://doi.org/10.1038/nrg2841 > **Sequence coverage** represents the number of sequenced **reads** that **cover the site**; this affects the ability to detect point mutations. > **Physical coverage** measures the number of **fragments** that **span the site**; this affects the ability to detect the rearrangement * single molecule strategy: continuous reads tens to hundreds of kilobases long and improving alignment of unique reads in repetitive regions * long-read sequencing: PacBio and ONT * optical mapping: Bionano * multiplatform discovery ## Connected molecule strategy ### Linked reads (~ synthetic long reads) * **specific library preparations(partition and barcode)** to infer long-range information from existing short-read sequence * 將序列標示barcodes後打碎分到油滴內定序,再藉由barcode加預測模型組裝成長序列 * 10X genomics 2020年停售此技術 * **從示意圖來看,透過序列的組裝出長片段,來判斷是否有SV** ![](https://hackmd.io/_uploads/HJPH5G7a3.png) ### Strand-seq * 在序列複製過程加入 bromodeoxyuridine (synthetic nucleoside analogue with a chemical structure similar to thymidine) 嵌入非模板股後,以UV照射阻斷複製、以生成單股 * 因互補股都有,適合拿strans-seq資料做haplotype phasing * 且保留序列方向性(生成過程明顯由5端開始做到停)適合偵測rearrangement (inversion, translocation) * 缺點為製備清洗流程繁雜、降低覆蓋度,難偵測短片段SV * **從示意圖來看,可以看到互補股可以分很清楚,且具有方向性,適合偵測方向變換的SV** ![](https://hackmd.io/_uploads/H1m85zQ62.png) ### Hi-C * [Experiment video](https://www.jove.com/v/1869/hi-c-a-method-to-study-the-three-dimensional-architecture-of-genomes) * sequencing crosslinked chromatin that proximal in 3D structure 因為是抓住鄰近染色質,限制酶切除並黏成環狀定序 * 定序範圍最長到megabases,適合大偵測大片段translocations * 缺點為染色質上需要有限制酶切割點 * **從示意圖來看,可以看到從3D結構切出定序的序列、在2Dlinear的真實距離和深度,以判斷大片段錯置的SV** ![](https://hackmd.io/_uploads/ryFIqfma3.png) ## Single molecule strategy ### PacBio single- molecule real-time (SMRT) sequencing and Nanopore * Algorithms detect SVs from SMRT data by **leveraging intra-read and inter-read signatures** * Intra-read * missing sequence (deletion) * soft-clip (insertion) within properly aligned flanking sequences * Inter-read * inconsistencies in orientation, location and size during mapping ### Optical Mapping ## Integrating SVs with biological information * 偵測到變異後,可透過何種技術分析對下游功能的影響 ![](https://hackmd.io/_uploads/HJBvqfmTh.png)