2022.03.22 - [GTC] 運用 Clara Parbricks 加速基因體工作流程 === ###### tags: `會議` ###### tags: `會議`, `講座`, `Nvidia`, `GTC` > - [場次](https://docs.google.com/spreadsheets/d/1Jm2gnqgc8tpFJaDc4nGwRHSzwMwFF8c_352uS1YtpXs/edit#gid=757809432) > - [HCLS Dev Summit: Accelerating Genomics Workflows using Parabricks [S42590]](https://reg.rainfocus.com/flow/nvidia/gtcspring2022/aplive/page/ap/session/1643261496528001Fz2f) <br> [TOC] <br> ## 簡介 **Genomic sequencing is faster and cheaper than ever. The new bottleneck in the genomics pipeline is in the analysis.** 基因組定序比以往更快、更便宜。基因體學管線的新瓶頸是在於分析。 **If it takes 30 hours to run variant calling on a single sample, it could take months or even years to process thousands of samples.** 如果對單個樣本執行變異點偵測需要 30 小時,則處理數千個樣本可能需要幾個月甚至幾年的時間。 **This is where CLARA Parabricks comes in. Using GPU acceleration, we've cut the variant calling time to below 30 minutes for a 30x human genome.** 這就是 CLARA Parabricks 的用武之地。使用 GPU 加速,我們已將 30 倍人類基因體的變異點偵測時間縮短到 30 分鐘以下。 **This allows for new genomics projects to be done at a scale that wasn't previously possible.** 這允許以以前不可能的規模完成新的基因體學專案。 **We'll discuss the capabilities of Parabricks and its performance compared to traditional genomics software packages (such as GATK), and show a demo of what it looks like in action.** 我們將討論 Parabricks 的功能及其與傳統基因體學軟體套件(例如 GATK)相比的效能,並展示操作 Parabricks 過程的樣子。 - ### Presenter (主持人) [Gary Burnett](https://reg.rainfocus.com/flow/nvidia/gtcspring2022/aplive/page/ap/session/1643261496528001Fz2f), Technical Marketing Engineer, NVIDIA - ### Industry Segment (行業領域) - Healthcare & Life Sciences 醫療保健與生命科學 - ### Primary Topic (主要主題) - Healthcare – Drug Discovery, Genomics 醫療保健 – 藥物發現、基因體學 <br> <hr> <br> ## [Slides](https://docs.google.com/presentation/d/1Wz-ZqYK8qMU6_LzDqihmri16elQGzlP0bYelY_ZTQik/edit?usp=sharing) > https://docs.google.com/presentation/d/1Wz-ZqYK8qMU6_LzDqihmri16elQGzlP0bYelY_ZTQik/edit?usp=sharing ### page1:使用 parabricks 加速基因體學分析流程 [![](https://i.imgur.com/ahvKf5R.jpg)](https://i.imgur.com/ahvKf5R.jpg) - **Accelerating genomics workflows using parabricks** 使用 parabricks 加速基因體學分析流程 <br> ### page2:議程 [![](https://i.imgur.com/qAilgGk.jpg)](https://i.imgur.com/qAilgGk.jpg) - **AGENDA** (議程) - **What is Parabricks?** 何謂 Parabricks? - **DEMO: Running Germline Analysis** 示範:執行遺傳變異分析 - **What’s new in v3.7?** v3.7 有什麼新功能? - **How to get started** 如何開始 <br> ### page3:計算基因體學拐點 [![](https://i.imgur.com/vciErw5.jpg)](https://i.imgur.com/vciErw5.jpg) - **Computational genomics inflection point** 計算基因體學拐點 > **Cost decreasing, throughput increasing** > 成本下降,吞吐量上升 - **橫軸:西元年** - **左縱軸:Cost per Genome (USD)**, 每個基因體的成本(美元) - **右縱軸:Data on SRA (TBs)** 在 NCBI SRA 上的資料 - SRA Sequence Read Archive, 序列讀數檔案 目前次世代定序實驗資料原始檔案的儲存地方 - **1000 Genomes Project launches (2008)** 2008 年累積達 1000 個基因體計畫 - **100,000 Genomes Project launches (2013)** 2013 年累積達 10,000 個基因體計畫 - **1,000,000 Genomes Initiative launches (2018)** 2018 年累積達 1,000,000 個基因體計畫 <br> ### page4:次世代定序分析流程 [![](https://i.imgur.com/UtfzwUv.jpg)](https://i.imgur.com/UtfzwUv.jpg) - **Next Generation Sequencing Workflows** 次世代定序分析流程 > **Highlighted green boxes are currently supported with Clara Parabricks** 標上綠盒子的部份為目前 Clara Parabricks 所支援的 - **Primary Analysis (On Instrument)** 一級分析(儀器上) > Base calling turns signal into sequence data (reads) > 鹼基判定將訊號轉換為序列資料(讀數) - **Secondary Analysis (Near Instrument)** 二次分析(接近儀器) > **Reference based: alignment-based applications (Dominant)** > 基於參考:基於對齊的應用(主導當前技術) > **Reference free: de novo assembly of "New" genomes (Emerging)** > 無參考:"新"基因體的從頭組裝(全新的組裝)(新興的技術) :::warning :warning: **Parabricks 不支援「長讀數」對齊&前處理** 前處理是指:MergeBam, Coordinate Sorting, Picard MarkDups, BQSR (見下頁) ::: - **Tertiary Analysis (Off Instrument)** 三級分析(儀器外) > **Analyzing variations to assign annotations such as disease susceptibility or resistance** > 分析變異以指定註釋,例如疾病易感性或抗性 - **Phenotype to Genotype** 表現型到基因型 以往要從外觀觀察症狀,變得可以直接從基因做觀察 <br> ### page5:Parabricks 包括許多常見的工具 [![](https://i.imgur.com/Urg0eMr.jpg)](https://i.imgur.com/Urg0eMr.jpg) - **Parabricks Includes Many Familiar Tools** Parabricks 包括許多常見的工具 > **GPU and CPU Accelerated Bioinformatics Tools** > GPU 和 CPU 加速的生物資訊學工具 - **UMI**: Unique molecular identifiers 唯一分子標記 - **barcoded FastQ (read + umi)**: 尚未去除定序接頭(唯一分子標記)的序列 <br> ### page6:示範:執行遺傳變異分析 [![](https://i.imgur.com/2yloaFo.jpg)](https://i.imgur.com/2yloaFo.jpg) - **DEMO: running germline analysis** 示範:執行遺傳變異分析 <br> ### page7:Parabricks 遺傳變異分析流程 [![](https://i.imgur.com/GkM02E8.jpg)](https://i.imgur.com/GkM02E8.jpg) - **Parabricks Germline Pipeline** Parabricks 遺傳變異分析流程 > **Performance Benchmarks** > 效能基準測試 :::info :bulb: **GATK 是跑在單核 CPU 上,不會隨著 CPU 數量增多而加快** ![](https://i.imgur.com/lPCRvRJ.png) ::: :::danger :warning: **投影片寫錯的地方** ![](https://i.imgur.com/h5KNd9y.png) - DeepVariant 的 f1 score 為 **0.9985** (見下頁計算) ::: :::info :bulb: **參考資料** - [CNVKIT 為 CPU 加速](https://docs.nvidia.com/clara/parabricks/v3.5/text/variant_callers.html#cnvkit) - [多核程式設計 與 單核多執行緒程式設計的區別](https://codertw.com/%E7%A8%8B%E5%BC%8F%E8%AA%9E%E8%A8%80/457223/) - [多核計算與程式設計"介紹](https://blog.xuite.net/wallace_tsou/twblog/116635855) ::: <br> ### page8:Parabricks 遺傳變異分析流程 [![](https://i.imgur.com/9lsGXlE.jpg)](https://i.imgur.com/9lsGXlE.jpg) - **Parabricks Germline Pipeline** Parabricks 遺傳變異分析流程 > **Running multiple variant callers** > 執行多個變異點偵測 - ### [upset plot(集合圖)] 轉 [文氏圖] [![](https://i.imgur.com/SvYiQZb.png)](https://i.imgur.com/SvYiQZb.png) [![](https://i.imgur.com/wmAtmCF.png)](https://i.imgur.com/wmAtmCF.png) #(TruthSet_HG002) = 3893+1481+3494167+5428 = 3504969 #(DeepVariant) = 5428+3494167+2500+2601 = 3504696 #(HaplotypeCaller) = 1481 + 3494167 + 2500 + 17805 = 3515953 #關鍵字:DeepVariant, HaplotypeCaller, strelka - ### f1 score 計算 | | FP | FN | |--------|-------------------|----------------| | HC only | 17805+2500 = 20305 | 3893+5428 = 9321 | | DV only | 2601+2500 = 5101 | 3893+1481 = 5374 | || | Union | (17805+2500) + (2601+2500) - 2500 = 22906 | (3893+5428) + (3893+1481) - 3893 = 10802 | | Intersect | 2500 | 3893 | | | TP | TN | |--------|-----------------------|-------| | HC only | 1481+3494167 =3495648 | 2601 | | DV only | 5428+34941673499595 | 17805 | ### HC 之 f1 score: - presion = TP / (TP + FP) = 3495648 / (3495648 + 20305) = 0.9942248943600782 - recall = TP / (TP + FN) = 3495648 / (3495648 + 9321 ) = 0.9973406326846257 - f1 = 2/(1/presion + 1/recall) = 0.99578032628763 = **0.9958** ### DV 之 f1 score: - presion = TP / (TP + FP) = 3499595 / (3499595 + 5101) = 0.9985445242611627 - recall = TP / (TP + FN) = 3499595 / (3499595 + 5374) = 0.998466748208044 - f1 = 2/(1/presion + 1/recall) = 0.9985056347200616 = **0.9985** :::danger :warning: **投影片寫錯的地方** ![](https://i.imgur.com/ygQkHH4.png) - 右下角 3,893 和 10,802 要對調 - 「**交集(Interset)**」不會比「**聯集(Union)**」還要大 :exclamation::exclamation: ::: <br> ### page9:Parabricks 3.7 有什麼新功能? [![](https://i.imgur.com/9X2JH3j.jpg)](https://i.imgur.com/9X2JH3j.jpg) - **What’S New In Parabricks 3.7?** Parabricks 3.7 有什麼新功能? <br> ### page10:Parabricks 3.7 現在包括 50 多種工具 [![](https://i.imgur.com/3gkhvRw.jpg)](https://i.imgur.com/3gkhvRw.jpg) - **Parabricks 3.7 Is Now Includes 50+ Tools** Parabricks 3.7 現在包括 50 多種工具 - **8 steps of fgbio condensed to a single command** fgbio 的 8 個步驟濃縮為一個命令 <br> ### page11:支援使用分子條碼的基因套組 [![](https://i.imgur.com/DcVhc6B.jpg)](https://i.imgur.com/DcVhc6B.jpg) - **Supporting Gene Panels With Molecular Barcodes** 支援使用分子條碼的基因套組 > **Molecular Barcodes or Unique Molecular Indices (UMIs)** > 分子條碼或唯一分子索引 (UMI) - **Extract UMI > group by UMI > consensus reads** 提取 UMI > 按 UMI 分群 > 共識讀數(共識序列片段) - **Genomic material is fragmented or cDNA is generated.** 基因體序列被片段化或生成 cDNA。 - **The fragments are uniquely labeled with molecular barcodes.** 使用分子條碼將序列片段進行唯一標記。 - **After amplification and sequencing, there are many labeled reads. But errors can occur --- in both reads and barcodes.** 經過擴增和定序,有許多標記的讀數。 但是錯誤可能同時發生在——讀數和條碼中。 - true singleton 真正的唯一實體 - **Software helps to remove such errors, as well as PCR duplicates.** 軟體有助於消除此類錯誤以及 PCR 重複。 - **Software-corrected molecules with singletons removed.** 軟體校正後,只保留單一分子。 - **UMI de-duplicating for ultra low frequency variants** 針對超低頻的變體,進行 UMI 去除重複技術 - **VAF: Variant Allele Frequency** 變異等位基因頻率 - **參考資料** - [VAF:Variant Allele Frequency簡介](https://ppfocus.com/0/sc799a472.html) $$VAF = \frac{allel\_depth}{tatal\_depth} = \frac{AD}{DP}$$ - **VCF**: Variant Allele Frequency 變異等位基因頻率 - **MAF**: Minor Allele Frequency 次等位基因在人羣中的頻率 - **VAF的值的大小有什麼含義呢?** 以二倍體生物爲例,假設所有的細胞中該位點都是雜合的,那麼50%的染色體上包含了ref allel, 另外50%的染色體上包含了alt allel, 則測序結果中該位點的VAF值應該爲0.5。對於germline genotype而言,一個可靠的突變位點其VAF的值應該在0.5附近。 - 對於生殖變異的檢測,認爲其VAF的偏移來源於拷貝數的變化, - 對於體細胞檢測而言,更多的認爲VAF的偏移來源於腫瘤細胞的異質性。 <br> ### page12:支援使用分子條碼的基因套組 [![](https://i.imgur.com/icWjDo1.jpg)](https://i.imgur.com/icWjDo1.jpg) - **Supporting Gene Panels With Molecular Barcodes** 支援使用分子條碼的基因套組 > **Molecular Barcodes or Unique Molecular Indices (UMIs)** > 分子條碼或唯一分子索引 (UMI) - **Goals: simplified workflow with 10x acceleration** 目標:10 倍速的簡化分析流程 - **Multi input capability:** - **UMIs still within the DNA read** UMIs 仍在 DNA 讀數中 - **UMIs extracted to FastQ read names** 提取 UMI 作為 FastQ 讀數的名稱 - **UMIs extracted to separate FastQ file** 提取 UMI 作為個別的 FastQ 檔案 - **DemuxFastq read structures, for extraction regardless of UMI position and length** DemuxFastq 讀取結構,無論 UMI 位置和長度如何,都可以進行提取 - **補充資料** - [fgbio tools](http://fulcrumgenomics.github.io/fgbio/tools/latest/) - [DemuxFastq](http://fulcrumgenomics.github.io/fgbio/tools/latest/DemuxFastqs.html) **Performs sample demultiplexing on FASTQs.** 對 FASTQ 執行樣本解多工 - **demultiplex** 信號分離/解編 - [umi_fgbio](https://docs.nvidia.com/clara/parabricks/3.7.0/Documentation/ToolDocs/man_umi_fgbio.html) > fgbio: Fulcrum Genomic Bioinformatics? (支點基因體生物資訊) > **This UMI pipeline is based on Fulcrum Genomics toolkit, processes sequencingreads with molecular barcodes (also known as Unique Molecular Indices, UMIs),which provide impressive error correction and increased accuracy using asequencing consensus read level.** 該 UMI 管線是基於 Fulcrum Genomics 工具箱,使用分子條碼(也稱為唯一分子索引)處理定序讀數,使用定序相同讀數水平提供令人印象深刻的糾錯和提高準確性。 :::danger :warning: **投影片寫錯的地方** `pbrun umi` 應該更正為 [`pbrun umi_fgbio`](https://docs.nvidia.com/clara/parabricks/3.7.0/Documentation/ToolDocs/man_umi_fgbio.html) ::: <br> ### page13:示範:執行遺傳變異分析 (跟 page8 標題一樣) [![](https://i.imgur.com/NTlEEDR.jpg)](https://i.imgur.com/NTlEEDR.jpg) > 前面帶個指令操作+印出help > 後面查看 germline 的 4 個執行結果 - DEMO: running germline analysis 示範:執行遺傳變異分析 <br> ### page14:Parabricks 如何開始 [![](https://i.imgur.com/KWX0DTD.jpg)](https://i.imgur.com/KWX0DTD.jpg) - **How To Get Started** Parabricks 如何開始 <br> ### page15:確認硬體和軟體需求 [![](https://i.imgur.com/FmqnDMl.jpg)](https://i.imgur.com/FmqnDMl.jpg) - **Verify the hardware and software requirements** 確認硬體和軟體需求 > **Make sure your system meets the required specs** > 確保您的系統符合要求的規格 > https://docs.nvidia.com/clara/parabricks/3.7.0/GettingStarted.html#installation-requirements <br> ### page16:你可以在任何地方運行 Parabricks [![](https://i.imgur.com/MpVmYQF.jpg)](https://i.imgur.com/MpVmYQF.jpg) - **You can run parabricks anywhere:** 你可以在任何地方運行 Parabricks: - **on-premises** 地端 - **Annual license software** (年度授權軟體) - **Node locked** (節點鎖定版,單機版,綁訂機器) - **Floating license** (浮動授權版,多機版,不綁訂機器) - **Pay per GPU** (按 GPU 付費) - **EDU Discount** (有教育折扣) <br> :::info :bulb: **[[類似參考] Node Locked 跟 Floating 授權有何不同?](http://ahasoft.blogspot.com/2010/01/x-win32-node-locked-floating.html)** ::: <br> - **cloud** 雲端 - **Can run on Google, Amazon, Microsoft** 可以在 Google, Amazon, Microsoft 雲端上執行 - **Working with channels: DNANexus, SevenBridges** 使用渠道:DNANexus、SevenBridges - **Pay per hour** 按小時付費 - **Some customers are only using the cloud** 一些客戶只使用雲端 - **Some are just using the cloud for peak computation** 另有些客戶只針對高計算量的部份使用雲端 <br> :::info :bulb: **SevenBridges** - [雲計算基因數據分析平台:seven bridges](https://ppfocus.com/0/eddd24ebe.html) - [生物數據分析公司Seven Bridges入駐Google雲平台](https://kknews.cc/tech/g42pze.html) - [火視全球|吸金4500萬美元——Seven Bridges Genomics將雲端與基因組學完美結合](https://kknews.cc/tech/38j849o.html) ::: <br> ### page17:Parabricks 入門起點 [![](https://i.imgur.com/53CZUMu.jpg)](https://i.imgur.com/53CZUMu.jpg) - **Starting points** Parabricks 入門起點 - **Clara Parabricks Home Page** (Clara Parabricks 首頁) https://www.nvidia.com/en-us/clara/genomics/ - **Getting Started Video** (入門影片) https://www.youtube.com/watch?v=AQltyCwPgU0& - **Request a Free Eval** > On Prem or Cloud > (在本地或雲端) > https://www.nvidia.com/en-us/clara/genomics/parabricks-free-trial/ - **Access Parabricks AMI on AWS** (在 AWS 上存取 Parabricks AMI) https://aws.amazon.com/marketplace/pp/prodview-apbngojlskcyq - **Documentation** (文件) https://docs.nvidia.com/clara/parabricks/3.7.0/index.html