Differentiable open-ended commonsense reasoning 閱讀筆記

Differentiable open-ended commonsense reasoning 閱讀筆記 === [原始論文](https://arxiv.org/abs/2010.14439) Abstract === - OpenSCR是一種評估模型推理能力的方法(?) - DRFACT 用於對知識事實進行多跳推理 - 為了評估OpenCSR方法，作者適應了三個流行的多選擇數據集，並通過群眾外包收集了每個測試問題的多個新答案。 --- 1 Introduction === - 現有的常識推理模型通常通過對問題—候選答案對進行評分來工作 - 通識常識事實語料庫中離線提取的大量與問題獨立的候選概念集合中產生一個排序的答案列表。 <big>**OpenCSR問題**</big> - 通過問題本身從語料庫中檢索出相關的事實 <big>**多跳推理**</big> - 從多個事實中進行推理，而不是僅依靠一個或幾個事實來回答問題 - OpenCSR中沒有可用的標註來幫助模型識別哪些事實需要被用於推理鏈，唯一的監督信號只有一組問題和答案 - 多數常識問題都需要進行跨概念的推理，但如「樹木」、「二氧化碳」、「大氣」和「光合作用」這些概念之間的關係很難用一個簡單的知識圖譜(KG)來表示。 - 知識圖譜(KG) - 解決方法 : 常識語料庫（如GenericsKB） - 一個收集了大量自然語言句子的語料庫，其中包含了許多具有共通性的常識事實，例如「樹木透過光合作用將二氧化碳從大氣中移除」。我們可以從這樣的語料庫中提取出事實 → how can we conduct multi-hop reasoning over such a knowledge corpus, similar to the way multi-hop reasoning methods traverse a KG? → can we achieve this in a differentiable way, to support end-to-end learning? - differentiable way 當一個模型是「可微分的」，代表著當我們對該模型進行訓練時，我們可以使用基於梯度的優化方法來找到最佳的模型參數，以最大化模型的性能。換句話說，如果模型是可微分的，那麼我們可以通過在模型中獲得對模型參數的梯度，來調整模型的參數，使其更好地擬合訓練數據，從而提高模型的性能。Ex. 線性回歸模型可微分 - end-to-end learning 在 end-to-end learning 中，我們將整個模型看作一個黑盒，只需要提供輸入和期望輸出，然後讓模型自己學習如何提取特徵和進行分類 - Multi-hop reasoning 指透過多個步驟來推理知識圖譜(Knowledge Graph)中的概念之間的關係。在知識圖譜中，每個概念都是一個節點，並且概念之間的關係可以表示為邊。透過多步推理，可以跨越多個節點和邊，從而推理出概念之間的複雜關係。 **DRFACT** - formulate multi-hop reasoning over a corpus as an iterative process of differentiable fact-following operations over a hypergraph - 把corpus 中的句子都encode成dense vector 去組成一個neural fact index，如此一來就可以在maximum inner product search (MIPS)做快速索引 - neural fact index : 將事實encode 成vector，這些向量就構成了一個neural fact index，可以用於快速檢索相關事實。例如，如果我們想知道英國的首都是什麼，我們可以計算每個向量與“英國的首都是什麼”這個向量的內積，並選擇最大的內積對應的事實。 - MIPS(之後有空在研究) - 使用簡單的線性搜尋，需要逐個計算所有向量的內積 MIPS 是一種可以加速最大內積搜尋的方法，基於一個稱為「倒排索引表」的資料結構。在 MIPS 中，我們首先將所有向量進行分組，將每個向量的內積相當於「該向量所屬的組」和「輸入向量 $q$ 所屬的組」之間的內積，這樣可以大大減少需要計算的向量數量。 - fact-to-fact矩陣 - 存儲事實之間的符號鏈接（例如，如果兩個事實共享共同的概念，則它們會被連接） <big>**evaluate OpenCSR methods**</big> - 為了評估DRFACT方法而創建的OpenCSR數據集，並使用Crowd-sourcing人工標注來為測試問題收集多個新答案 --- 2 Related Work === Commonsense Reasoning --- - 最近常識推理 (CSR) 的一些方法，大多聚焦在多選題 QA 上然而，這些方法不太適用於實際應用中，因為通常不會有可用的答案候選者。 - UnifiedQA 和其他closed-book QA models可以生成問題的答案，但缺點是它們不提供答案的supporting evidence，這使得它們不太可信。 - closed-book models exist that are augmented with an additional retrieval module: 但這些模型主要適用於單跳推理。 QA over KGs or Text --- - triple-based symbolic commonsense knowledge graphs (CSKGs)（例如“物體1”、“屬性”、“物體2”）中，由於其僅描述兩個實體之間的關係，所以在表示複雜的常識知識時可能受到限制。 - triple-based symbolic commonsense knowledge graphs (CSKGs)不能處的三元關係範例 ex. 「小明是這個班上最高的學生」：這個關係包含三個元素，分別是「小明」、「這個班」和「最高」。其中，「小明」和「這個班」是實體，「最高」是描述性詞語。 - GenericsKB (corpus of generic sentences about commonsense facts) : - text can represent more complex commonsense knowledge, including facts that relate three or more concepts. - OpenCSR might need interative retrieval(迭代檢索) - OpenCSR provide fewer hints surface of the commnense questions - surface hints 例如，在open-domain QA任務的一個問題中，“誰是美國的第一任總統？”這個問題中包含了“美國”、“第一任總統”等關鍵詞，這些關鍵詞就提供了關於回答這個問題所需要的一些線索。因此，這個問題的答案可以通過單個步驟的推理得到。相比之下，OpenCSR任務的問題通常涉及更複雜的常識推理，問題的“表面信息”相對較少。例如，“為什麼貓喜歡捉老鼠？”這個問題中並沒有直接給出答案所需要的線索，回答這個問題需要涉及到多個概念（如貓、老鼠、狩獵等）之間的多層次推理過程。因此，OpenCSR任務對於常識推理的難度更高。 Multi-Hop Reasoning --- - recent model using multi-hop reasoning through iterative retrieval(GRAFT-Net (Sun et al., 2018), MUPPET (Feldman and El-Yaniv, 2019), PullNet (Sun et al., 2019), and GoldEn (Qi et al., 2019)) are not end-to-end differentiable, so they are slow - Neural Query Language (Cohen et al., 2020) : - differentiable multi-hop - entity-following templates for reasoning over a compactly stored symbolic KG - 但KG只能處理二元關係 <big>**DrKIT**</big> | DrKIT | DRFACT | | --- | --- | | multi-hop between entities | multi-hop between facts | | 1) finding mentions of new entities x’ that are related to some entity in X, guided by the indices, and then 2) aggregating these mentions to produce a new weighted set of entities. | | | differentiable | differentiable | | only named entities | | | entity | fact | | --- | --- | | 具體的事物或概念 | 這些entities之間的關係 | | 節點 | 邊線 | | ex. 人、地點、組織、事件 | ex. “John是某個組織的CEO”、“Paris是法國的首都” | --- 3 Open-Ended Commonsense Reasoning === Task Formulation --- - F: corpus of knowledge facts - sentence that describes generic commonsense knowledge - ex. “trees remove carbon dioxide from the atmosphere through photosynthesis.” - V: denote a vocabulary of concepts - noun or base noun phrase mentioned frequently in these facts - ex. ‘tree’ and ‘carbon dioxide’ - q: question - answer it by returning a weighted set of concepts, such as {(a1=‘renewable energy’, w1), (a2=‘tree’, w2), . . . }, where wi ∈ R is the weight of the predicted concept ai ∈ V. - to be an interpretable, trustworthy reasoning models, it is expected that models can output intermediate results that justify the reasoning process - i.e., the supporting facts from F. E.g., an explanation for ‘tree’ to be an answer to the question above can be the combination of two facts: f1 = “carbon dioxide is the major ...” and f2 = “trees remove ...” Implicit Multi-Hop Structures --- | Commonsense questions | multi-hop factoid QA | | --- | --- | | more implicit and relatively unclear | focus on querying about evident relations between named entities. | | the reasoning process can be implicit and relatively unclear | the reasoning process can be decomposed into more specific qs | | ex. question: “what can help alleviate global warming?” → q1 = “what contributes to global warming” q2 = “what removes q1. answer from the atmosphere” ==— but many other decompositions are also plausible== | ex. question: “which team does the player named 2015 Diamond Head Classic’s MVP play for?” → q1 = “the player named 2015 DHC’s MVP” q2 = “which team does q1. answer play for” | | 指需要透過人們對於一般常識的了解才能回答的問題 | 多個事實之間進行多次推理才能回答的問題 | - unlike HotpotQA 我們沒有任何標準答案或事實作為訓練數據的依據 --- 4 DrFact: An Efficient Approach for Differentiable Reasoning over Facts === ![](https://i.imgur.com/RqkVH1F.png) Figure 3 The overall workflow of DRFACT. We encode the hypergraph (Fig. 2) with a concept-to-fact sparse matrix E and a fact-to-fact sparse matrix S. The dense fact index D is pre-computed with a pre-trained bi-encoder. A weighed set of facts is represented as a sparse vector F. The workflow (left) of DRFACT starts mapping a question to a set of initial facts that have common concepts with it. Then, it recursively performs Fact-Follow operations (right) for computing Ft and At. Finally, it uses learnable hop-weights αt to aggregate the answers **總結步驟** 1. 預處理製作**Sparse Fact-to-Fact Index (S)**和**Dense Neural Fact Index (D)**(Bi-encoder產生)，以計算fact和fact之間的相似度 2. fact-following 1. sparse retrieva 將Ft-1傳入S矩陣中取得可能的Fst Fst = Ft-1S 2. dense retrieval 1. 將Ft-1經過D轉為zt-1 zt-1 = Ft-1D 2. 再將zt-1和qt傳入MLP得到**ht−1** ht-1 = g(zt-1,qt) 3. 使用MIPS(maximum inner product search)在D上用ht−1搜尋next-hop的前K個相關事實，即Fdt Fdt = MIPS_k(ht-1,D) 3. element-wise multiplication Ft = Fst ⊙ Fdt 4.1 Overview --- ![](https://i.imgur.com/9rgTIwq.png) - reasoning model will traverse hypergraph - each hyperedge corresponds to a fact in F and connects the concepts in V - 因為hyperedge (F)連接了提到的concepts (V)，這樣的方式可以保留原始自然語言陳述的上下文信息。 1. 從question(concepts)開始traverse hypergraph, 最終經過多個hyperedge(facts)到達a set of cpncept nodes. 2. 我們想得到每個c (c ∈ V) 作為問題q的答案的可能性P(c|q) 3. Ft表示a weighted set of retrieved facts at 第t次hop, and F0 for the initial facts below 4. 我們迭代搜尋facts作為下次hop，最終我們用搜尋到的fact幫concept評分 5. (我們會將每次搜尋得到的一些事實（facts）加入到已經搜尋到的事實中，進而擴大已知的事實集合。這樣可以幫助模型更好地理解問題，並找到更精確的答案。而在最後一步，我們使用已經搜尋到的事實來幫每個候選答案（即V中的每個概念）進行評分，以判斷哪個答案最可能是正確的答案。)chatgpt不知道從哪知道的 4.2 Pre-computed Indices --- - **Dense Neural Fact Index D** - pre-train a bi-encoder architecture over BERT - bi-encoder [BERT to bi-encoder](https://www.notion.so/BERT-to-bi-encoder-b6d4254e10ef4b2c9ba732ca873cc106) Bi-encoder是用於計算兩個文本之間相似度的一種方法。其中一個文本作為“query”（即問題），另一個文本則作為“candidate”（即可能的答案）。Bi-encoder將這兩個文本都嵌入到相同的向量空間中，然後計算它們之間的相似度得分 - 學習最大化含有question的正確答案的facts的分數 - 用MIPS再facts去做dense搜尋 - pre-train後，把fact F都embed成dense vector （使用 [CLS] 標記表示） - D is a |F| × d dense matrix - **Sparse Fact-to-Fact Index S** - 通過一組連接規則去pre-compute facts 之間的sparse links(稀疏連接) - ex. fi→fj when - fi and fj has a least one common concept - & fj 引入了至少兩個不在fi中的新concept - S is a binary spare tensor(張量)，舉有dense shape |F|×|F| - - **Sparse Index of Concept-to-Fact Links E** - a concept can appear in multiple facts - a fact usually mentions multiple concepts - ⇒ 把每個fact和每個concept之間共同出現的情況encode成一個 sparse matrix with dense shape |V|×|F| (也就是concept-to-fact index) - 將所有的concept與相關聯的fact之間的關聯性以稀疏矩陣的形式表示出來的索引 4.3 Differentiable Fact-Following Operation --- <big>**fact-following framework**</big> - **P (Ft | Ft−1, q)** : 在問題q的背景下從一個fact到另一個fact的轉換model - **S(Sparse Fact-to-Fact Index)** : Sij。直觀地說，如果我們可以從fi到fj進行遍歷，這些fact應該提到一些共同的concep - 具體來說，稀疏矩陣 S 中的每一行代表一個事實，每一列代表一個特定的特徵，當 Ft-1 的向量表示通過矩陣乘法與 S 相乘時，會獲得一個長度為 S 的列向量，其中每個元素表示 Ft-1 和對應事實之間的相似性得分。通過對得分進行排序，可以獲得可能的下一個事實。 - **D(Dense Neural Fact Index)** 包含了每個fact的與一訊息，可用於衡量一個fact在某question下的可信度 - 使用TensorFlow的tf.RaggedTensor結構implement <big>**fact-follow operation**</big> **sparse retrieva** - uses a fact-to-fact sparse matrix to obtain possible next-hop facts - 作者使用了 TensorFlow 中的 tf.RaggedTensor 來存儲 S 矩陣，該方法可以節省大量的存儲空間和計算資源，可以高效地計算取得下一跳的fact,Fst (a set) $F_t^s = F_{t-1}S$ **dense retrieval** 1. 將Ft-1作為input經過D得到一個密集向量zt-1 - **zt-1**: Ft-1 集合中所有事實的向量表示聚合成的一個向量，它表示當前時間步前的所有事實的綜合信息。 $z_{t-1} = F_{t-1}D$ 2. 將zt-1和qt傳入模型 $h_{t-1} = g(z_{t-1},q_t)$ - **qt**: 是當前步驟的查詢向量，它基於輸入的question和先前的查詢向量計算而來，簡單來說，qt 是當前步驟的查詢 - **g()**: MLP(一種深度學習模型，也稱為fact-translating function)，計算前一步的事實向量 zt-1 和當前步驟的查詢向量 qt的相似性得分 - **ht−1**: query vector a dense vector that has same dimemsionality as fact vector(zt-1) $F_t^d = MIPS_k(h_{t-1},D)$ - 使用maximum inner product search (MIPS) 在 D上用ht−1搜尋next-hop的前K個相關事實，即Fdt 3. element-wise multiplication - To get the best of both symbolic and neural world - Ft = Fst ⊙ Fdt $\begin{split}F_t&=Fact-Follow(F_{t-1},q)\\&=F_{t-1}\odot MIPS_k(g(F_{t-1D,q_t},D))\end{split}$ 4. concept predictions 1. **At** (a set of concept predictions) : multiply **Ft** with a precomputed **fact-to-concept matrix E →** At - fact-to-concept matrix E (還沒了解過) Appendix B 2. **concept scores :** - 好像是從At找出有提到concept c的facts中，分數最大的分數去aggregate the concept scores - 好像是用於更新Pre-computed Indices中的S和D矩陣 3. $A = \sum_{t=1}^T(\alpha_tA_t)$ - αt is a learnable parameter - αt Appendix B - Ft = Fact-Follow(Ft−1, q): - a random-walk process on the hypergraph associated with the corpus. - self-following - performance improved - augmenting(增加) Ft with the 高於threshold τ 的 Ft−1: Ft = Fact-Follow(Ft−1, q) + Filter(Ft−1, τ ) - Ft 包含和自己高度相關的facts (distance t’<t)，在不同的問題可能需要不同數量的推理步驟的情況下，可以improve model. ---