2022.03.22 - [GTC] 人工智慧的生物學應用:蛋白質結構預測及其他展望

2022.03.22 - [GTC] 人工智慧的生物學應用:蛋白質結構預測及其他展望 === ###### tags: `會議` ###### tags: `會議`, `講座`, `Nvidia`, `GTC` > - [場次](https://docs.google.com/spreadsheets/d/1Jm2gnqgc8tpFJaDc4nGwRHSzwMwFF8c_352uS1YtpXs/edit#gid=757809432) > - [HCLS Dev Summit: AI for Biology - Protein Structure Prediction and Beyond [S42628]](https://reg.rainfocus.com/flow/nvidia/gtcspring2022/aplive/page/ap/session/1643414209690001u10x) [TOC] ## 簡介 **Originally developed to understand human language, self-supervised natural language processing models have recently been instrumental in understanding and predicting the structure and function of biomolecules like proteins.** 最初是為了理解人類語言而開發的，自我監督的自然語言處理模型最近有助於理解和預測蛋白質等生物分子的結構和功能。 **Much like they do for natural language, transformer-based representations of protein sequences provide powerful embeddings for use in downstream AI tasks,** 就像他們對自然語言所做的那樣，基於 transformer 的蛋白質序列表示提供了強大的嵌入，可用於下游 AI 任務， **like predicting the final folded state of a protein, understanding the strength of protein-protein or protein-small molecule interactions, or in the design of protein structure provided a biological target.** 例如預測蛋白質的最終折疊狀態，了解蛋白質-蛋白質或蛋白質-小分子相互作用的強度，或者在設計蛋白質結構時提供了生物學靶點。 **Here, we review some of the most exciting recent advancements and model architectures as a window into the tools, techniques, and infrastructure required to advance the field.** 這裡我們回顧了一些最令人興奮的最新進展和模型架構，作為推進該領域所需的工具、技術和基礎設施窗口。 - ### Presenter (主持人) - Ade Ojewole, Senior Developer Relations Manager, NVIDIA NVIDIA 資深開發者關係經理 Ade Ojewole - ### Industry Segment (行業領域) - Healthcare & Life Sciences 醫療保健與生命科學 - ### Primary Topic (主要主題) - Healthcare – Drug Discovery, Genomics 醫療保健 – 藥物發現、基因體學 <hr> ## [Slides](https://docs.google.com/presentation/d/1jwMyvJrqKVziIStmEEp0yJgjAHPQPo4r_ln26mUIrtE/edit?usp=sharing) > [官方 PDF](https://static.rainfocus.com/nvidia/gtcspring2022/sess/1643414209690001u10x/SessionFile/HCLS%20Dev%20Summit%3A%20AI%20for%20Biology%20-%20Protein%20Structure%20Prediction%20and%20Beyond_1647313234154001tQdH.pdf) ### page1：人工智慧的生物學應用 [![](https://i.imgur.com/xvnLJ49.jpg)](https://i.imgur.com/xvnLJ49.jpg) - ### AI for biology: 人工智慧的生物學應用: > **Protecin structure prediction and beyond** > 蛋白質結構預測及其他展望 --- ### page2：議程 [![](https://i.imgur.com/JuVc3rU.jpg)](https://i.imgur.com/JuVc3rU.jpg) - ### AGENDA 議程 - **Data explosion in the biological sciences provides a great opportunity** 生物科學的數據爆炸提供了一個很好的機會 - **Leveraging accelerated compute for state-of-the-art ML in biology** 利用加速計算實現最先進的生物學 ML - **Structure determination** 結構測定 - **Cheminformatics and protein bioinformatics** 化學資訊學和蛋白質生物資訊學 - **Geometric learning** 幾何學習 - **Future directions** 未來發展方向 --- ### page3：動機 [![](https://i.imgur.com/BQM68I1.jpg)](https://i.imgur.com/BQM68I1.jpg) - ### MOTIVATION 動機 > **Developing and testing therapeutic molecules is slow and costly** 開發和測試治療分子是緩慢且昂貴的 - **Drug discovery and development is costly, slow, and has low rates of success** 藥物發現和開發成本高、速度慢、成功率低 - **First principles models are limited:** 第一原理模型是有限的： - **assume simplified physical models** 假設簡化的物理模型 - **can be computationally prohibitive** 在計算上可能是禁止的（無法進行計算、不知道怎麼計算） - **require expert knowledge to set up correctly** 需要專業知識才能正確設置 - **have limited ability to leverage the increasing corpus of data** 利用不斷增加的數據語料庫的能力有限 --- ![](https://i.imgur.com/keH2soJ.png) - **Drug discovery** (藥物探索) - **0-1: target to hit** (從標的研究至活性化合物) - **1-2.5: hit to lead** (活性化合物至先導化合物) - **2.5-4.5: lead optimization** (先導化合物最佳化) - **4.5-5.5: preclinical** (臨床前研究) - **Drug development** (藥物開發) - **Phase I** (臨床試驗 I) - **Phase II** (臨床試驗 II) - **Phase III** (臨床試驗 III) - **submission** (送件)(準備上市) - ### 補充資料 - hit - [大陸] 苗頭化合物 - [台灣] 活性化合物 - [[wiki] 藥物研發](https://zh.wikipedia.org/wiki/%E8%8D%AF%E7%89%A9%E7%A0%94%E5%8F%91) - [化學藥｜一文告訴你，新藥是怎麼從研發到上市的，收藏！](https://kknews.cc/science/v2avnv4.html) 1. 靶標的確定藥物的靶標包括酶、受體、離子通道等 2. 模型的建立篩選和評價化合物的活性 3. 先導化合物的發現通過各種途徑和方法得到的具有某種生物活性或藥理活性的化合物 - [藥物發現各階段概覽](https://kknews.cc/zh-tw/health/3okj6vo.html) - [先導化合物的發現及優化](https://kknews.cc/news/opbv43m.html) - [苗頭化合物的發現](https://kknews.cc/health/q5z2r48.html) - [開發新穎性藥物傳輸系統創造臺灣製藥產業新局](https://www.moea.gov.tw/MNS/doit/achievement/Achievements2.aspx?menu_id=5391&ac_id=1323) - [CDE付費諮詢服務新制介紹](https://www.cde.org.tw/Content/Files/Activity/CDE%E4%BB%98%E8%B2%BB%E8%AB%AE%E8%A9%A2%E4%BB%8B%E7%B4%B9_20211007.pdf) ![](https://i.imgur.com/O4D53HS.png) - Phase I 試驗細部設計議題: (1)族群選擇 (2)劑量選擇 (3)試驗架構(4)PK設計(5)統計(視試驗設計而定)(6)倫理及法規議題 - Phase II 試驗細部設計議題: (如安定性) (1)族群選擇 (2)劑量選擇 (3)試驗架構(4)盲性設計與統計(5)倫理及法規議題 (6) Population PK設計 - Phase III 試驗細部設計議題: (如生殖毒性) (1)適應症選擇 (2)劑量選擇 (3)試驗架構(4)盲性設計與統計分析(5)倫理及法規議題(6)療效指標(7)Population PK設計 --- ### page4：資料驅動人工智慧在生物學的應用 [![](https://i.imgur.com/RB2Seao.jpg)](https://i.imgur.com/RB2Seao.jpg) - ### DATA DRIVES AI FOR BIOLOGY 資料驅動人工智慧在生物學的應用 > **Abundance of structure and sequence data in the last decade** > 過去十年中豐富的結構和序列資料 - **Vs. 10 years ago: 2.13x more protein structures are publicly available** 對比 10 年前：公開可用的蛋白質結構增加了 2.13 倍 (= 162913 / 76389) - **Increasing rate of CryoEM structure deposition** 提高 CryoEM 結構沉積率 > CryoEM 見 [page6](#page6) 說明 - **Finer resolution structures over time** 隨著時間的推移更精細的分辨率結構 - **Proliferation of proteomics has resulted in a wealth of protein sequence and function data** 蛋白質體學的普及，帶來了豐富的蛋白質序列和功能資料 --- ### page5：ML 在結構測定上推向最新技術 [![](https://i.imgur.com/mpg6WHa.jpg)](https://i.imgur.com/mpg6WHa.jpg) - ### ML drives state-of-the-art in structure determination ML 在結構測定上推向最新技術 --- ### page6：冷凍電子顯微鏡(Cryo-EM) [![](https://i.imgur.com/mqm9ZlI.jpg)](https://i.imgur.com/mqm9ZlI.jpg) - ### 背景知識 - [「把生物分子看得更清楚！」結構生物學最新神器 – 冷凍電子顯微鏡](https://research.sinica.edu.tw/tsai-ming-daw-cryo-electron-microscope/) - [2017年諾貝爾化學獎》「冷凍電子顯微鏡」技術大師獲殊榮](https://www.storm.mg/article/340228?mode=whole) - [What is Cryo-Electron Microscopy (Cryo-EM)?](https://www.youtube.com/watch?v=Qq8DO-4BnIY) ![](https://i.imgur.com/XgriBfC.png) - ### CRYOGENIC ELECTRON MICROSCOPY (CRYO-EM) 冷凍電子顯微鏡 > **From Sample Processing to Data Collection to Machine Learning** > 從樣本處理到數據收集，再到機器學習 - **Projected to dominate high resolution macromolecular structure determination** 預計將主導高解析度大分子結構測定 - **Advantages:** 優點 - **Structure determination is simpler, more robust** 結構測定更簡單、更穩健 - **Can image large macromolecules** 可以對大分子進行成像 - **Less destructive to samples, does not require crystallization** 對樣本的破壞性較小，不需要結晶 - **With the tradeoff of high storage and computational requirements** 在高容量儲存和計算要求之間做權衡 ![](https://i.imgur.com/bJgdPCz.png) - **Storage Pool** 儲存資源池 1. **Frame Alignment** (框架對齊) 2. **Particle Picking** (粒子挑選) 3. **2D Classification** (二維分類) 4. **3D Classification** (3D 分類) 5. **3D Reconstruction** (3D 重建) ![](https://i.imgur.com/wJejZVy.png) --- ### page7：冷凍電子顯微鏡工作流程 [![](https://i.imgur.com/GIuz7O1.jpg)](https://i.imgur.com/GIuz7O1.jpg) - ### CRYO-EM WORKFLOW 冷凍電子顯微鏡工作流程 > Large Data Requirements > 大數據需求 - Memory requirement: 512 GB 記憶體需求：512 GB - Storage requirement: 360 TB 儲存需求：360 TB <style> #page7 tr.header { background-color: #6ea200; color: white } #page7 th, #page7 td { padding: 0.1em 0.3em;} </style> <table id="page7" style="font-size: 90%"> <tr class="header"> <th></th> <th>Microscope Output 顯微鏡輸出</th> <th>Frame Alignment 圖框對齊</th> <th>Particle Picking 粒子挑選</th> <th>2D Classification 二維分類</th> <th>3D Classification 3D 分類</th> <th>3D Reconstruction 3D 重建</th> </tr> <tr> <td>Description 描述</td> <td></td> <td>Motion compensation to account for electron beam interaction with sample 運動補償以解釋電子束與樣品的相互作用</td> <td>Identification of individual objects in the electron micrograph 在電子顯微照片中識別個別物體</td> <td>Determination of particle quality 測定粒子品質</td> <td>Determination of conformational heterogeneity 測定構象異質性</td> <td>Determine poses that best fit electron density 測定最適合電子密度的位姿</td> </tr> <tr> <td>Application 應用程式(執行程式)</td> <td></td> <td>MotionCor2</td> <td>Topaz</td> <td colspan="3">CryoSPARC</td> </tr> <tr> <td>Method 方法</td> <td></td> <td>Multi-GPU accelerated frame drift and anisotropic motion correction 多 GPU 加速圖框漂移和各向異性運動校正</td> <td>Multi-GPU self-supervised DL-based object detection 基於多 GPU 自監督深度學習的目標檢測</td> <td>Expectation maximization, Stochastic gradient descent 期望最大化，隨機梯度下降</td> <td>Expectation maximization, Stochastic gradient descent 期望最大化，隨機梯度下降</td> <td>Stochastic gradient descent for maximum likelihood 最大似然的隨機梯度下降 </td> </tr> <tr> <td>Storage Requirement 儲存需求 (assuming 106 particles) （假設 106 個粒子）</td> <td>2 TB</td> <td>14 GB</td> <td>207 GB</td> <td colspan="3"><center>36 TB</center></td> </tr> </table> - [[MotionCor2] GPU加持顯微鏡視頻，加速COVID-19阻控研究](https://kknews.cc/zh-tw/science/3y5qr8g.html) > 採用 cryo-EM 技術的蛋白質樣品在 -196 攝氏度下冷凍以保護生物結構，否則其生物結構會被顯微鏡的高能電子束破壞。 > > 這項技術更適合研究稍微異質的標本，例如大分子和細胞。當樣品以不同角度傾斜時，收集的一系列圖像可以對齊並重建為詳細的 3D 模型 > > GPU 設置可同時運行八項工作，對多達 400 幀（每幀近 1 億像素）的視頻進行運動校正的疊代過程。 --- ### page8：在 CRYO-EM 分析流程上實現 2.4 倍加速的多 GPU 縮放 [![](https://i.imgur.com/ryYkjNj.jpg)](https://i.imgur.com/ryYkjNj.jpg) - ### MULTI-GPU SCALING FOR 2.4-FOLD ACCELERATION ON CRYO-EM WORKFLOW 在 CRYO-EM 分析流程上實現 2.4 倍加速的多 GPU 縮放 > **Benchmarked on AWS** > 在 AWS 上進行基準測試 > > A100: 7h -> 3h = 2.33x > > 每張 frame 大小：476GB / 2756 = 172.7MB/張 - **Cannabinoid Receptor 1-G Protein Complex** > 大麻素受體 1-G 蛋白複合物 - **GPCR expressed in the central and peripheral nervous system** GPCR 在中樞和外周神經系統中表達 - **Dataset: 2756 frames, 476 GB** 資料集：2756 幀，476 GB - **Analysis pipeline includes I/O and Compute** 包括 I/O 和計算的分析管線 - **Frame alignment** 圖框對齊 - **Particle picker** 粒子挑選 - **2D/3D classification** 2D/3D 分類 - **3D reconstruction** 3D 重建 --- ### page9：NLP 推動化學資訊學、蛋白質生物資訊學的最新技術 [![](https://i.imgur.com/Jewg7Kk.jpg)](https://i.imgur.com/Jewg7Kk.jpg) - ### NLP DRIVES STATE-OF-THE-ART IN CHEMINFORMATICS, PROTEIN BIOINFORMATICS NLP 推動化學資訊學、蛋白質生物資訊學的最新技術 --- ### page10：NLP 語言模型非常龐大 [![](https://i.imgur.com/xnMzZv3.jpg)](https://i.imgur.com/xnMzZv3.jpg) - ### NLP LANGUAGE MODELS ARE MASSIVE NLP 語言模型非常龐大 - **We are experiencing an exponential increase in NLP model size** 我們正在經歷 NLP 模型大小的指數級增長 - **Transformer models achieve state-of-the-art performance** Transformer 變壓器模型實現了最先進的性能 - **Downstream tasks improve monotonically as model size increases** 隨著模型的增大，下游任務單調地(?)改進 - monotonic [͵mɑnəˋtɑnɪk] adj.單調的 - **Larger models achieve better results when trained on less data** 在較少數據上訓練時，較大的模型會獲得更好的結果 - **Model Size (in billions of parameters** 模型大小（單位：十億個參數） ![](https://i.imgur.com/RIyj9fS.png) - **ELMo** (94M) - Embedding from Language Models 嵌入式語言模型 - [ ELMo (Embeddings from Language Models 嵌入式語言模型)](https://medium.com/programming-with-data/x-c59937da83af) - **能理解一詞多義** 理解上下文，能根據語境去判斷一個多義詞的正確含義 - **不需要人工標註** 用大量沒有標註的語料去訓練（即無監督學習）(預先訓練) - **GPT (GPT-1) (117M)** - Generative Pre-training-3 生成式預先訓練 - ==**[【GPT-3 訓練費用預估 3.5 億！】AI 門檻越來越高，一般人哪玩得起？](https://buzzorange.com/techorange/2020/08/25/gpt-3-ai-model/)**== - GPT = ELMo(預先訓練) + Fine tuning(微調) - 訓練數據量和模型規模 ![](https://i.imgur.com/I8vLkvu.png) - 財力支援 > 2019 年 7 月，微軟向OpenAI 注資 10 億美元（約台幣 293 億元）。雙方協定，微軟給 OpenAI 提供演算力支持，而 OpenAI 則將部分 AI 知識產權授權給微軟進行商業化。 - **BERT-Large** (340M) - BERT 和 GPT 框架攜同但 GPT 是單向語言模型，而 BERT 採用雙向語言模型 - 無監督學習+有監督學習（即「預先訓練+微調」) - BERT 缺點：需要特定領域標註數據 → 過擬合 - **GPT-2** (1.5B) - Generative Pre-training-3 生成式預先訓練-2 - 解決了 BERT 的缺點 - **Megatron-LM** (8.3B) - **T5** (11B) - **Turing-NLG** (17.2B) - NLG: Natural Language Generation - **GPT-3** (175B) - Generative Pre-training-3 生成式預先訓練-3-tuning(微調) - **Megatron-Turing NLG** (530.B) - NLG: Natural Language Generation --- ### page11：MegaTron [![](https://i.imgur.com/nSDTUCW.jpg)](https://i.imgur.com/nSDTUCW.jpg) - ### MEGATRON > NVIDIA’s open-source project to efficiently train the world’s largest transformer-based language models > NVIDIA 的開源專案，可有效訓練世界上最大的基於 transformer 的語言模型 > - **Rearrange the layer normalization and residual connections in the transformer layer, improving scaling performance** 重新排列 transformer 層中的每一層正規化和殘差連接，提高縮放性能效能 - **Parallelism between layers** 層間平行 - **Split sets layers across multiple devices** 跨多個設備拆分層集合 - **Layer 0,1,2 and layer 3,4,5 are on different devices** 第 0、1、2 層和第 3、4、5 層在不同的設備上 - **Parallelism within layers** 層內並行 - **Split individual layers across multiple devices** 跨多個設備拆分各層 - **Both devices compute different parts of each layer** 兩種設備都計算每一層的不同部分 - 參考資料 - [NVIDIA 推出大型人工智慧語言模型供全球企業使用](https://blogs.nvidia.com.tw/2021/11/09/nvidia-brings-large-language-ai-models-to-enterprises-worldwide/) > NeMo Megatron 框架可以讓企業克服在訓練精巧複雜的自然語言處理模型時面臨的挑戰，經最佳化處理後，可以在 NVIDIA DGX SuperPOD™ 的大規模加速運算基礎架構上進行擴展。 > NeMo Megatron 利用可以匯入、管理、組織與清除資料的資料處理函式庫，自動化 LLM 訓練的複雜性。它使用資料、張量與流程平行化的先進技術，把大型語言模型的訓練高效率地分散到數千個 GPU。企業可以使用 NeMo Megatron 框架為他們所處的特定領域和語言訓練 LLM。 --- ### page12：在 Nvidia DGX-A100 叢集上擴展 Megatron [![](https://i.imgur.com/3h151t7.jpg)](https://i.imgur.com/3h151t7.jpg) - ### Megatron Scaling On Nvidia's DGX-A100 Cluster 在 Nvidia DGX-A100 叢集上擴展 Megatron > **Selene** 月神 > - **Batch size: 3072** 批次大小：3072 - **2048 tokens sequences** 2048 句元序列 - **48-way data parallel** 48 路資料平行 - **vocabulary size: 51200** 詞彙量：51200 --- ### page13：MegaMolBart(2021) 藥物探索模型 > [[Nvidia][NGC] MegaMolBART](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/clara/models/megamolbart) [![](https://i.imgur.com/lexgI7Z.jpg)](https://i.imgur.com/lexgI7Z.jpg) - ### MEGAMOLBART(2021) 藥物探索模型 > **Leveraging Megatron for Cheminformatics** > 利用 Megatron 進行化學資訊學 > - **Model overview: MegaMolBart is a sequence-to-sequence bidirectional and auto-regressive transformer with two main use cases** 模型概述：MegaMolBart 是一個序列到序列的雙向自回歸transformer，具有兩個主要使用案例 - **Encoder embeddings can be used as features for predictive models** 編碼器嵌入可用作預測模型的特徵 - **Encoder and decoder can be used together to generate novel molecules by sampling the model's latent space** 編碼器和解碼器可以一起使用，透過對模型的隱藏空間進行取樣來生成新的分子 - **Step 1: Pretrain** 第 1 步：預訓練 ![](https://i.imgur.com/mHQeGWG.png) - **Step 2: Fine-tune for downstream tasks** 第 2 步：微調下游任務 - **Application 1: reaction prediction** 應用一：預測化學反應 ![](https://i.imgur.com/NuEXizd.png) - reactant 反應物 - products 生成物 - **Application 2: molecule optimization** 應用二：化學分子最佳化 ![](https://i.imgur.com/uhujnDq.png) - **Application 3: property prediction** 應用三：化學屬性預測 ![](https://i.imgur.com/RIlo3mQ.png) - FCN, Fully Convolutional Networks 全卷積網路 - ### 參考資料 - ### [如何从ZINC数库（www.zinc15.docking.org）下载虚拟筛选化合物库](https://zhuanlan.zhihu.com/p/356969831) - ZINC 數據庫是目前最大的有機小分子化合物庫之一 - ### [[Nvidia] MegaMolBART](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/clara/models/megamolbart) - **MegaMolBART is a model that understands chemistry and can be used for a variety of cheminformatics applications in drug discovery.** MegaMolBART 是一個了解化學的模型，可用於藥物發現中的各種化學資訊學應用。 **The embeddings from its encoder can be used as features for predictive models. Alternatively, the encoder and decoder can be used together to generate novel molecules by sampling the model's latent space.** 來自其編碼器的嵌入(向量)可以用作預測模型的特徵。或者，編碼器和解碼器可以一起使用，透過對模型的隱藏空間進行取樣來生成新的分子。 - [使用 Clara Discovery 的 MegaMolBart 加速藥物研發](https://www.youtube.com/watch?v=pySSYb0Jftk) > https://www.facebook.com/NVIDIA.TW/posts/4047371831968806/ > 在藥物研發初期，篩選接合標靶的化合物就得花上大把時間，而 Clara Discovery 框架的 MegaMolBart 能運用 #AI 和 #深度學習技術，在虛擬篩選上進行快速分析，提高後續實驗成功的可能性，加速藥物研發速度。 - ### [Nvidia斥資1億美元打造英國最快的AI超級電腦](https://www.ithome.com.tw/news/145500) - ==MegaMolBART 藥物探索模型== - 從AstraZeneca的MolBART Transformer模型所發展出來的MegaMolBART藥物探索模型，能夠預測化學反應、最佳化和從頭分子合成（De Novo Molecular Generation），進而最佳化整個藥物開發過程 - ### [太强了！Bert老师！](https://zhuanlan.zhihu.com/p/166273680) > 将大量有机化合物的分子式转换成一个序列，像文本一样输入BERT中进行预训练，训练一个预训练模型，将这个模型作为下游分子的预测的encoder部分来对有机物进行毒性、溶解性、类药性以及合成可行性的预测 - ### [[wiki] IUPAC有機物命名法](https://zh.wikipedia.org/wiki/IUPAC%E6%9C%89%E6%9C%BA%E7%89%A9%E5%91%BD%E5%90%8D%E6%B3%95) - ### [[wiki] 簡化分子線性輸入規範](https://zh.wikipedia.org/wiki/%E7%AE%80%E5%8C%96%E5%88%86%E5%AD%90%E7%BA%BF%E6%80%A7%E8%BE%93%E5%85%A5%E8%A7%84%E8%8C%83) - SMILES: Simplified molecular input line entry specification - SMILES 用一串字元來描述一個三維化學結構，它必然要將化學結構轉化成一個生成樹 - SMILES 保證每個化學分子只有一個SMILES表達式 - ### [SMILES:一种简化的分子语言](https://www.jianshu.com/p/8c915de5ad4d) :+1: - 唯一的SMILES ![](https://i.imgur.com/YHuIwqR.png) | Input SMILES | Unique SMILES | | ------------ | ------------- | | OC(=O)C(Br)(Cl)N | NC(Cl)(Br)C(=O)O | | ClC(Br)(N)C(=O)O | NC(Cl)(Br)C(=O)O | | O=C(O)C(N)(Br)Cl | NC(Cl)(Br)C(=O)O | - [SMILES 标准化算法 - CANGEN](https://www.zealseeker.com/archives/canonical-smiles-cangen/) ![](https://i.imgur.com/Aygla6D.png) ``` C12C3C4C1C5C4C3C25 ``` ![](https://i.imgur.com/RyfE3f7.png) - ### SMILES 線上工具 - [Online SMILES Translator - NCI/CADD](http://www.cheminfo.org/flavor/malaria/Utilities/SMILES_generator___checker/index.html) - ### SMILES 範例 - 氫原子 - [水](https://zh.wikipedia.org/wiki/%E6%B0%B4) - [乙醇|酒精](https://zh.wikipedia.org/wiki/%E4%B9%99%E9%86%87) - 化學鍵 - [二氧化碳](https://zh.wikipedia.org/wiki/%E4%BA%8C%E6%B0%A7%E5%8C%96%E7%A2%B3) - 離子鍵 - [氫氧根](https://zh.wikipedia.org/wiki/%E6%B0%A2%E6%B0%A7%E6%A0%B9) - [氫氧化鈣](https://zh.wikipedia.org/wiki/%E6%B0%A2%E6%B0%A7%E5%8C%96%E9%92%99) - [三氟乙腈](https://zh.wikipedia.org/wiki/%E4%B8%89%E6%B0%9F%E4%B9%99%E8%85%88) - [苯](https://zh.wikipedia.org/wiki/%E8%8B%AF) - [三氟甲烷](https://zh.wikipedia.org/wiki/%E4%B8%89%E6%B0%9F%E7%94%B2%E7%83%B7) - ### SMILES 異構物描述的問題 - 實際例子 - [葡萄糖最右邊的 H 和 OH 位置](https://smallcollation.blogspot.com/2013/04/anomers.html#gsc.tab=0) [![](https://i.imgur.com/qt3dET6.png)](https://i.imgur.com/qt3dET6.png) - [[wiki][en] SMILES / 立體化學 / 丙胺酸](https://en.wikipedia.org/wiki/Simplified_molecular-input_line-entry_system#Stereochemistry) > ...the other three are either clockwise or counter-clockwise. These cases are indicated with @@ and @, respectively (because the @ symbol itself is a counter-clockwise spiral). - [丙胺酸](https://zh.wikipedia.org/wiki/%E4%B8%99%E6%B0%A8%E9%85%B8) - [三角型碳](https://www.easyatm.com.tw/wiki/%E4%B8%89%E8%A7%92%E5%9E%8B%E7%A2%B3) - 符號@@: 如果是實心三角型的,就是說這個鍵在碳環所在平面的上方 - 符號@: 虛線的三角形,表示說這個基團在平面下方。 - 測試 ```molecular N[C@@H](C)C(=O)O N[C@H](C)C(=O)O N[CH](C)C(=O)O ``` ![](https://i.imgur.com/UD0u5Od.png) - [費歇爾投影式](https://zh.wikipedia.org/wiki/%E8%B4%B9%E6%AD%87%E5%B0%94%E6%8A%95%E5%BD%B1%E5%BC%8F) ![](https://i.imgur.com/OsFCxDI.png) - [[Youtube][有機化學] Ch2.10 費雪投影式(I) Fischer Projection Part.1](https://www.youtube.com/watch?v=Ntj97tTDB_A) - ### SMILES 的超複雜範例 - [[wiki] 四環黴素](https://zh.wikipedia.org/wiki/%E5%9B%9B%E7%92%B0%E9%BB%B4%E7%B4%A0) ![](https://i.imgur.com/swYHedX.png) `C[C@]1(c2cccc(c2C(=O)C3=C([C@]4([C@@H](C[C@@H]31)[C@@H](C(=C(C4=O)C(=O)N)O)N(C)C)O)O)O)O` ![](https://i.imgur.com/c9fDqh4.png) - ### [Online SMILES Translator and Structure File Generator](https://cactus.nci.nih.gov/translate/) > #SMILES-to-PDB - [SMILES and Unique SMILES Definition](https://cactus.nci.nih.gov/translate/trans_info.html#unique) - [PDB文件格式說明](https://jerkwin.github.io/2015/06/05/PDB%E6%96%87%E4%BB%B6%E6%A0%BC%E5%BC%8F%E8%AF%B4%E6%98%8E/) > SMILES: `C12C3C4C1C5C4C3C25` > ![](https://i.imgur.com/NgMcLK3.png) --- ### page14：MegaMolBart(2021) 在分子生成中達到當前最佳性能 [![](https://i.imgur.com/4slJFwu.jpg)](https://i.imgur.com/4slJFwu.jpg) - ### MegaMolBart(2021) Achieves Sota Performance in Molecule Generation MegaMolBart(2021) 在分子生成中達到當前最佳性能 - [Sota Performance](https://towardsdatascience.com/software-design-patterns-and-principles-for-a-i-1-sota-tests-3dd265c6bf97) the best performance achieved so far on this problem - **Summary Statistics** 摘要統計 | 統計屬性 | 數值 | | -------- | ---- | | Number of Layers | 8 | | Number of Attention Heads | 4 | | Hidden Layers Size | 256 | | Local Batch Size | 512 | | Global Batch Size | 16,384 | | Optimizer | Adam | | Weight Decay | Beta1 = 0.9, Beta2 = 0.99 | | Training Steps | 610k | | Number of Parameters | 230M | | Number of Nodes | 1 | | Number of GPUs | 32 | | Total training time | 24 hours | - **Latent space sampling benchmarks** 潛在空間採樣基準 - **Validity: percentage of molecules generated that are valid SMILES, as per RDKit** 有效性：根據 RDKit，生成的有效 SMILES 分子的百分比 - **Uniqueness: percentage of valid generated molecules that are unique** 唯一性：唯一的有效生成分子的百分比 - **Beats CDDD model (Winter et al.) in generating novel, realistic molecules** 在生成新穎、實際可行的分子方面(?)，優於 CDDD 模型（Winter 等人） - **See session for latest MegaMolBart!!** 查看最新的 MegaMolBart 會議！ - ### 參考資料 - [RDKit介紹及使用](https://cowasks.com/culture/115851.html) RDKit是一個用於化學資訊學的開源工具包，基於對化合物2D和3D分子操作，利用機器學習方法進行化合物描述符生成 - [🔥 RDKit Python Wheels](https://pypi.org/project/rdkit-pypi/) `pip install rdkit-pypi` - [[RDKit] Basic Molecular Representation for Machine Learning](https://towardsdatascience.com/basic-molecular-representation-for-machine-learning-b6be52e9ff76) ![](https://i.imgur.com/wLPu6TY.png) --- ### page15：ProtTrans [![](https://i.imgur.com/lyZtfTq.jpg)](https://i.imgur.com/lyZtfTq.jpg) - ### ProtTrans (Protein Pre-Trained) **Large Protein Language Models Trained on Summit** 在 Summit 上訓練的大型蛋白質語言模型 - **Open-source project: Technical University of Munich, Med AI Technology, Google AI, NVIDIA and Oak Ridge National Laboratory** 開源專案：[慕尼黑工業大學](https://zh.wikipedia.org/zh-tw/%E6%85%95%E5%B0%BC%E9%BB%91%E5%B7%A5%E4%B8%9A%E5%A4%A7%E5%AD%A6)、Med AI Technology、Google AI、NVIDIA 和[橡樹嶺國家實驗室](https://zh.wikipedia.org/zh-tw/%E6%A9%A1%E6%A0%91%E5%B2%AD%E5%9B%BD%E5%AE%B6%E5%AE%9E%E9%AA%8C%E5%AE%A4) - **Models overview: ProTXL is an auto-regressive language transformer trained on the IBM Summit supercomputer** 模型概述：ProTXL 是在 IBM 高峰(Summit) 超級計算機上訓練的自回歸語言 transformer - **Training corpus** 訓練語料庫 | | UniRef100 | BFD | | ----------------------|--------|---------| | Number of Proteins | 216 M | 2,122 M | | Number of Amino Acids | 88 B | 393 B | | Disk Space | 150 GB | 572 GB | - UniRef100 平均每條蛋白質上有 `(88*10^9) / (216*10^6) = 407.4` 個胺基酸 - BFD 平均每條蛋白質上有 `(393*10^9) / (2122*10^6) = 185.2` 個胺基酸 - ### 參考資料 - [[github] ProtTrans](https://github.com/agemagician/ProtTrans) **ProtTrans is providing state of the art pre-trained models for proteins.** ProtTrans 正在為蛋白質提供最先進的預訓練模型。 **ProtTrans was trained on thousands of GPUs from Summit and hundreds of Google TPUs using various Transformer models.** ProtTrans 使用各種 Transformer 模型在 Summit 的數千個 GPU 和數百個 Google TPU 上進行了訓練。 --- ### page16：IBM 高峰超級電腦 [![](https://i.imgur.com/JSVvUlN.jpg)](https://i.imgur.com/JSVvUlN.jpg) - ### IBM Summit Supercomputer IBM 高峰超級電腦 - **World’s second-fastest supercomputer, fastest in the USA** 世界第二快的超級計算機，美國最快 - **200 petaflops peak** 每秒峰值速度為 200 peta 每秒浮點運算次數為 200 peta 每秒執行 200 peta 次的浮點運算 - [PetaFLOPS](https://baike.baidu.hk/item/PetaFLOPS/6417557) - peta : 用於計量單位，表示10的15次方，表示千萬億次 - FLOPS = FLoating-point Operations Per Second - **27,648 Tesla V100 GPUs** 27,648 個 Tesla V100 GPU - **Summary Statistics* 摘要統計 ![](https://i.imgur.com/QRVVQ2N.png) --- ### page17：ProTXL 預測生化和結構特性 [![](https://i.imgur.com/VOMdJF4.jpg)](https://i.imgur.com/VOMdJF4.jpg) - ### ProTXL predicts biochemical and structural properties ProTXL 預測生化和結構特性 - **Self-supervised learning: t-SNE projections of uncontextualized amino acid embeddings** 自監督學習：非上下文胺基酸嵌入的 t-SNE 投影 - **Good predictions of hydrophobicity, charge, size** 對(胺基酸的)疏水性、電荷、尺寸的預測良好 ![](https://i.imgur.com/bWQlWRa.png) - **Supervised learning (with fine tuning):** 監督學習（微調）： - **a. Can predict secondary structure** 可以預測二級結構 - **b. Can predict cellular localization (i.e. membrane vs nonmembrane)** 可以預測細胞定位（即膜與非膜） --- ### page18：比演化方法更快的推論速度 [![](https://i.imgur.com/1ho9j7u.jpg)](https://i.imgur.com/1ho9j7u.jpg) - ### ProTXL: Faster Inference Speed than Evolutionary Methods 比演化方法更快的推論速度 ![](https://i.imgur.com/Ns4sScr.png) - **ProtXL: up to 4x faster inference time than CPU** 推理時間比 CPU 快 4 倍 - **UniRef90 and UniRef100** UniRef90 和 UniRef100 - **113M and 216M proteins, respectively** 分別為 113M 和 216M 蛋白質 - **CPU inference system: Intel Xeon Scalable Processor** CPU 推理系統：Intel Xeon Scalable Processor - **“Skylake” Gold 6248 with 40 threads, 377 GB RAM** - 核心代號 “Skylake” - 型號：[Xeon Gold 6248](https://detail.zol.com.cn/1287/1286914/param.shtml) - 執行緒數量：40 條執行緒 - 配記憶體 377GB ? --- ### page19：幾何深度學習的突破性表現 [![](https://i.imgur.com/qMk3p09.jpg)](https://i.imgur.com/qMk3p09.jpg) - ### Breakthrough Performance in Geometric Deep Learning 幾何深度學習的突破性表現 --- ### page20：加速 SE(3)-Transformer [![](https://i.imgur.com/tF5Aeg3.jpg)](https://i.imgur.com/tF5Aeg3.jpg) - ### Accelerating the SE(3)-Transformer 加速 SE(3)-Transformer - **SE(3)-Transformers are versatile graph neural networks unveiled at NeurIPS 2020 by Fuchs et al.** SE(3)-Transformers 是 Fuchs 等人在 NeurIPS 2020 上推出的多功能圖神經網絡。 - **SE(3)-Transformers are useful for learning geometric symmetries in small molecules processing, proteins, or point clouds** SE(3)-Transformers 可用於學習小分子處理、蛋白質或點雲中的幾何對稱性 - **They are part of larger models, such as RoseTTAFold and OpenFold** 它們是較大模型的一部分，例如 RoseTTAFold 和 OpenFold - **NVIDIA released an open-source optimized implementation that uses 43x less memory and is up to 21x faster than the baseline official implementation** NVIDIA 發布了一個開源優化過的實作，它使用的記憶體減少了 43 倍，並且比基準官方實作快了 21 倍 --- ### page21：我們的加速 SE(3)-Transformer 可以整合到更大的模型中 [![](https://i.imgur.com/7sX6oiw.jpg)](https://i.imgur.com/7sX6oiw.jpg) - ### Our Accelerated SE(3)-Transformer Can Be Integrated into Larger Models 我們的加速 SE(3)-Transformer 可以整合到更大的模型中 - **Pooled SE(3)-Transformer for molecular property prediction** 用於分子特性預測的池化 SE(3)-Transformer [![](https://i.imgur.com/a8ntAXq.png)](https://i.imgur.com/a8ntAXq.png) --- ### page22：將 SE(3)-Transformer 記憶體耗用量降低 43 倍 [![](https://i.imgur.com/RUPXS1X.jpg)](https://i.imgur.com/RUPXS1X.jpg) - ### Lower SE(3)-Transformer Footprint by 43x 將 SE(3)-Transformer 記憶體耗用量降低 43 倍 - **For research on proteins with amino acid residue as nodes, this means that you can feed longer sequences and increase the receptive field of each residue.** 對於以氨基酸殘基為節點的蛋白質研究，這意味著您可以輸入更長的序列並增加每個殘基的感受野。 --- ### page23：將 SE(3)-Transformer 訓練速度提高 21 倍 [![](https://i.imgur.com/SRXUClK.jpg)](https://i.imgur.com/SRXUClK.jpg) - ### Accelerating SE(3)-Transformer Training by 21x 將 SE(3)-Transformer 訓練速度提高 21 倍 - **An NVIDIA DGX A100, SE(3)-Transformers can now be trained in 12 minutes on the QM9 dataset.** 現在可以在 12 分鐘內在 QM9 數據集上訓練 NVIDIA DGX A100、SE(3)-Transformers。 - *By comparison, the authors of the original took 2.5 days on their hardware (NVIDIA GeForce GTX 1080 Ti).** 相比之下，原作的作者在他們的硬體（NVIDIA GeForce GTX 1080 Ti）上花費了 2.5 天。 - **Bottom line: faster training enables faster architecture search. Coupled with smaller memory footprint, you can now train bigger models with more attention layers, and feed larger inputs to the model.** 底線：更快的訓練可以實現更快的架構搜尋。再加上更小的記憶體耗用量，您現在可以使用更多注意力層來訓練更大模型，並為模型提供更大的輸入。 - **Please see our developer blog for more details.** 請參閱我們的開發者部落格以了解更多詳情。 --- ### page24：生物學人工智慧的未來方向 [![](https://i.imgur.com/RuIiLu0.jpg)](https://i.imgur.com/RuIiLu0.jpg) - ### Future Directions in AI for Biology 生物學人工智慧的未來方向 --- ### page25：未來的方向和挑戰 [![](https://i.imgur.com/hA5XyaQ.jpg)](https://i.imgur.com/hA5XyaQ.jpg) - ### Future directions and challenges 未來的方向和挑戰 - **Biological sequences are distinct from natural language: future developments in language modeling, representation learning for biomolecules need to model properties fundamental to biomolecules** 生物序列不同於自然語言：語言建模的未來發展、生物分子的表示學習需要對生物分子的基本屬性進行建模 - **One direction is to condition language models on structural, chemical, and physical priors for increased fidelity** 一個方向是根據結構、化學和物理先驗來調節語言模型以提高保真度 - **Also need to train for larger receptive fields to better capture the self-interactions and dynamics of biomolecules** 還需要對更大的感受野進行訓練，以更好地捕捉生物分子的自我相互作用和動力學 - **We don’t know which ML architectures will power the next advancements in biology** 我們不知道哪些 ML 架構將為生物學的下一個進步提供動力 - **What we do know is that the hardware that powers next-gen models, should be flexible, scalable, and have a mature software stack** 我們所知道的是，為下一代模型提供動力的硬體應該是靈活的、可擴展的，並擁有成熟的軟體堆疊 <hr> ## 參考資料 - ### [What is Cryo-Electron Microscopy (Cryo-EM)?](https://www.youtube.com/watch?v=Qq8DO-4BnIY) - ### [「把生物分子看得更清楚！」結構生物學最新神器 – 冷凍電子顯微鏡](https://research.sinica.edu.tw/tsai-ming-daw-cryo-electron-microscope/) - ### [將圖卷積神經網路用於解碼分子生成](https://www.gushiciku.cn/pl/pSnu/zh-tw)