Answering Questions by Meta-Reasoning over Multiple Chain of Thought - HackMD

<style> .red { color: red; } .blue{ color: blue; } .green{ color: green; } </style> # [Answering Questions by Meta-Reasoning over Multiple Chain of Thought](https://arxiv.org/abs/2304.13007) :::danger **Github:** https://github.com/oriyor/reasoning-on-cots ::: :::warning :bulb: 這篇論文為 **Self-Consistency** 的延伸建議先閱讀： :arrow_right: [Self-Consistency Improves Chain Of Thought Reasoning In Language Models](https://hackmd.io/Fewd37Q8Smm55_optr4JoQ) ::: ## 1. Introduction ![截圖 2024-01-30 20.14.49](https://hackmd.io/_uploads/ryR3dPI9T.png =65%x) - SC leads to performance gains, it also has several shortcomings. 1. When the space of possible outputs is large, each reasoning chain may lead to a different output, in which case no significant majority will be formed. 2. Focusing exclusively on the final output discards relevant information that is present in the intermediate reasoning steps. - Using SC jointly with chain-of-thought prompting reduces interpretability, as there is no single reasoning chain that can be considered as an explanation. - In this work, we propose Multi-Chain Reasoning (MCR), where we prompt a large language model (LLM) to meta-reason across multiple reasoning chains and produce a final answer, alongside an explanation. - sampled reasoning chains are used to collect pieces of evidence from multiple chains. - MCR concatenates the intermediate steps from each chain into a unified context, which is passed, along with the original question, to a meta-reasoner model. - The meta-reasoner is a separate LLM, prompted to meta-reason on multiple reasoning chains and produce a final answer along with an explanation ## 2. Background - The majority of these works follow a common standard: 1. Given a question, plan a **step-by-step reasoning chain** to derive the answer and solve all intermediate steps, aided by a retriever to minimize model hallucination. 2. Then, incorporate multiple reasoning chains with answers to derive the final answer. - In our work, we follow this template and focus on the latter part. - However, our meta-reasoning approach differs from prior work by <span class='red'>reasoning on multiple reasoning chains</span>. Namely, we use multiple chains to collect relevant evidence for question answering. ## 3. Method - Our focus is on open-domain QA - where the input is a question q, and the evidence to answer it is found in one or more sentences in a corpus C. - When answering q requires multiple reasoning steps, it can be expressed by a reasoning chain, denoted by r. - <span class='red'>The **reasoning chain** is a list of one or more intermediate question-evidence-answer triples (qi, ei, ai).</span> ## 3.1 Generating Reasoning Chains ![截圖 2024-01-31 12.20.53](https://hackmd.io/_uploads/Byw2cHw56.png =65%x) - Given a question q, we generate its reasoning chain using: 1. a decomposition model 2. a retriever component. - At each step, the decomposition model generates an intermediate question qi, based on the original question q and the previous reasoning steps. - Then, the retriever uses qi to retrieve relevant evidence ei ∈ C. - We feed ei and qi to the decomposition model (along with the previous steps) to generate intermediate answer ai. - During answer generation, <span class='red'>we prepend intermediate evidence sentences to the beginning of the chain</span> rather than interleaving them, as it improves the accuracy for all baselines. ## 3.2 Reasoning over Reasoning Chains - <span class='red'>The meta-reasoner module is the core contribution of MCR</span>. - Instead of sampling multiple chains for their predicted answers, we utilize them for context generation. - This context is fed to a prompted LLM to read the generated chains and reason over them to return the answer. - We first sample multiple chains and use all of their intermediate questionanswer pairs (qi, ai) as our multi-chain context. - Next, the multi-chain context and the original question are input to the meta-reasoner. - This model is an LLM, few-shot prompted for QA over a multi-chain context. ![截圖 2024-01-31 12.32.31](https://hackmd.io/_uploads/SyFuprP9T.png =65%x) :::success 1. Providing the meta-reasoner with multiple chains allows it to combine and aggregate facts across chains. 2. The model needs to extract the most relevant facts in the chains to serve as its explanation. This enables MCR to be both more accurate and more interpretable than past multi-chain approaches. ::: ## 4. Experiments - We compare MCR to existing methods on 7 multi-hop QA benchmarks. - MCR consistently outperforms existing approaches on all benchmarks, when experimenting with two different LLMs and retrievers. ## 4.1 Experimental Setting ### 4.1.1 Datasets - To limit the cost of model API calls, we evaluate on 500-1000 random examples from the development set of each dataset. - We also evaluate on the official test sets of STRATEGYQA and FERMI, as they target implicit reasoning, have multiple valid strategies, and their test set evaluation cost is reasonable. :::info 1. **Implicit Reasoning**: - Questions that entail implicit reasoning steps. - The reasoning steps for solving it <span class='red'>cannot be explicitly derived from the language of the question</span> and require commonsense or arithmetic reasoning. - Such questions may have multiple valid reasoning chains. - We evaluate on: **STRATEGYQA**, **FERMI** and **QUARTZ**. 2. **Explicit Reasoning**: - Multi-hop questions where <span class='red'>the reasoning steps are explicitly expressed</span> in the language of the question. - These include **HOTPOTQA**, **2WIKIMQA** and **BAMBOOGLE**. - We also evaluate on FEVEROUS, a fact verification dataset where claims require verifying multiple facts, and evidence may be either in sentences, tables or both. ::: ![截圖 2024-01-31 12.41.51](https://hackmd.io/_uploads/Skii1ID9p.png =65%x) - For evaluation, we <span class='red'>use F1 to compare predicted</span> and <span class='red'>Gold answers for all explicit reasoning datasets</span>. - In FERMI, we use the official order-of-magnitude evaluation. ### 4.1.2 Models - Our main models and baselines are all retrieval-augmented instances of code-davinci-002, prompted with in-context learning exemplars. - The number of exemplars varies from 6-12 between datasets. - <span class='red'>Decomposition prompt exemplars are based on random examples from the train and development sets, coupled with their gold reasoning chain</span>. - For the meta-reasoner exemplars, we use reasoning chains sampled from the decomposition model as the multi-chain context. - We ensure that the answer can be inferred using the sampled chains and add an explanation before the final answer. - For the binary-choice datasets, STRATEGYQA, QUARTZ, and FEVEROUS, the prompt contains an equal number of exemplars from each label. ### Meta-Reasoner - We experiment with two variants of the meta-reasoner to measure the effect of reasoning on more than a single chain. 1. **MCR**: - The meta-reasoner <span class='red'>is given five reasoning chains as its multi-chain context</span>. - <span class='red'>We decode one chain with greedy decoding, and sample another four reasoning chains with temperature t = 0.7.</span> - This enables the metareasoner to review different pieces of evidence when answering the full question. 2. **SCR**: - The meta-reasoner is given the same prompt as MCR aside from <span class='red'>having only the greedy-decoded chain in its context</span>. - This disentangles the effect of using multiple chains from the effect of having an LLM that is separate from the decomposition model to generate the final answer. ::: info 這句話的意思是,它區分了使用多個reasoning chain和使用一個與分解模型分離的語言模型生成最終答案這兩個效果。具體來說: 1. 使用多個推理鏈(multiple chains)的效果:指的是透過生成並考慮多個推理過程來回答問題。 2. 使用一個與decomposition model不同的語言模型(an LLM that is separate from the decomposition model)生成最終答案的效果:指的是使用一個與用於生成推理鏈的模型不同的語言模型,專門負責生成最終答案。這樣區分開來可以檢測單獨使用多個推理鏈和使用獨立的語言模型這兩種技術的效果大小。如果兩者結合使用效果更好,那麼就說明這兩種技術都是有效的,有互補的作用。 ::: ### Baselines 1. **SA (Self-Ask)**: - Self-Ask returns the answer of a single reasoning chain, that was generated with greedy decoding. 2. **SC (Self-Consistency)**: - We experiment with variants of 3, 5 and 15 sampled chains (SC@3, SC@5 and SC@15). - As in MCR, we use the chain generated with greedy decoding along with additional chains sampled with t = 0.7. ::: info 這段話的意思是: 和 MCR(Multi-Chain Reasoning)中一樣,我們使用了貪婪解碼(greedy decoding)生成的推理鏈,以及使用溫度(temperature) t=0.7 取樣的額外多個推理鏈。具體來說: 1. 貪婪解碼生成的推理鏈:將生成推理鏈的過程視為一個解碼過程,每一步都選擇最有可能的詞語,這樣生成的推理鏈被稱為貪婪解碼生成的推理鏈。 2. 溫度取樣額外多個推理鏈:在生成推理鏈時,還可以設置一個溫度參數,通過調整這個溫度參數,可以生成更加不同的推理鏈。文章中使用了溫度 t=0.7 進行了額外多個推理鏈的取樣。 3. 和 MCR 中一樣:指的是這裡使用的方法與 MCR 中使用的生成推理鏈的方法一致。所以這段話的意思就是,使用貪婪解碼生成一條推理鏈,同時使用溫度為 0.7 的取樣生成額外多條不同的推理鏈。這與 MCR 中使用的方法是一致的。 ::: ### Retrieval - Our models and baselines use a retriever based on Google Search, via the SerpAPI service. - As most of our datasets contain evidence from Wikipedia (§4.1.1), we consider it as our retrieval corpus.Therefore, we format search queries as “en.wikipedia.org qi”, with the Wikipedia domain preceding the intermediate question. - We return the top-1 evidence retrieved by Google. Retrieved evidence may be either sentences or parsed lists. - We also retrieve evidence for the original question q. - Last, all retrieved evidence sentences are prepended to the decomposition (§3.1). ## 4.2 Main Results ### MCR Performance ![截圖 2024-01-31 13.41.33](https://hackmd.io/_uploads/H1QnpIvc6.png) - In addition, we <span class='red'>list an **oracle score** which uses the best answer out of all five chains</span>. :::success MCR outperforms all baselines on all of the benchmarks ! ::: ### Adding Reasoning Chains ![截圖 2024-01-31 13.53.15](https://hackmd.io/_uploads/ryPwlPwqa.png =60%x) - As extending MCR is bounded by context length, we follow a straightforward approach and perform self-consistency on three MCR runs. - We compare this model, MCR+SC@3, which used 15 reasoning chains (5 for each MCR run), to SC@15. :::info 這段話的意思是: 由於擴展MCR受到上下文長度的限制,我們採用了一種簡單的方法,就是對三次 MCR 運行的結果進行self-consistency。具體來說: 1. 擴展MCR受到上下文長度的限制:MCR需要將多個推理鏈連接成一個上下文,但上下文長度是有限制的。所以擴展MCR生成更多推理鏈會受到這個長度限制。 2. 採用簡單的方法 - 對三次MCR運行進行self-consistency檢驗:既然單次MCR受到上下文長度限制,可以進行多次獨立的MCR運行,每次使用不同的reasoning chain,然後對這三次運行的結果進行多數決,看哪個答案出現次數最多。這就是一種self-consistency檢驗的方法。總結來說,這段話說明了因為單次MCR受到上下文長度限制,所以採用了讓MCR多次運行並進行多數決的自洽性檢驗來擴展使用更多推理鏈的思路。 ::: ![截圖 2024-01-31 13.55.42](https://hackmd.io/_uploads/Sy5lbDDca.png =80%x) - For each dataset, the effect that adding more reasoning chains has on meta-reasoning performance. - It presents the results with 1 chain (SCR), 5 chains (MCR) and 15 reasoning chains (MCR+SC@3). ### Test Set Results - We evaluate our models on the official test sets of STRATEGYQA6 and FERMI, which include 490 and 558 examples respectively. ![截圖 2024-01-31 20.41.54](https://hackmd.io/_uploads/SJ7uepDqa.png =70%x) ### Recent Approaches ![截圖 2024-01-31 20.46.45](https://hackmd.io/_uploads/r1dBbavq6.png) - While an apples-to-apples comparison with other recent approaches is impossible due to fundamental differences in the experimental setup. - It serves as a rough measuring stick for the robustness of MCR across different tasks. ## 4.3 Open-source Models - To further examine MCR’s performance (§4.2) and <span class='red'>for better reproducibility</span>, we experiment with an additional open-source retriever and LLM. - As our <span class='red'>retriever</span>, we use <span class='red'>ColBERTv2</span>. - In addition to code-davinci-002, we <span class='red'>experiment with Vicuna-13B, a 13-billion parameters model</span> shown to outperform LLMs like LLaMA and Alpaca. - We **use the same prompts as in code-davinci-002, trimmed to fit a 2,048 tokens context length**. ![截圖 2024-02-01 12.51.14](https://hackmd.io/_uploads/ryALQiu56.png) ![截圖 2024-02-01 12.52.31](https://hackmd.io/_uploads/B1ZsXoOqp.png) - For code-davinci-002, substituting Google Search with ColBERTv2 exhibits the same trend, albeit a slight decrease in performance. - **(Table 6)** Unsurprisingly, <span class='red'>results sharply decrease when evaluating the smaller Vicuna-13B with ColBERTv2</span>. - **(Table 6)** The comparison between MCR and SCR suggests that reasoning over multiple chains is a challenge for the weaker Vicuna-13B model. :::success This suggests that **meta-reasoning over multiple chains has greater gains** (compared to SCR) <span class='red'>when both the decomposition model and meta-reasoner are larger LLMs</span>. ::: ## 5. Analysis *Next, we measure the importance of incorporating:* - Multiple reasoning chains in MCR - Qualitatively assess its output ### When are Multiple Chains Helpful? - We wish to prove that this advantage lies in cases where the <span class='red'>meta-reasoner uses additional chains</span>. - To this end, we <span class='red'>sort examples based on the **similarity** of their greedy-decoded chain to the MCR explanation</span>. - <span class='red'>Lower similarity indicates less reliance of MCR on the greedy chain</span>. - MCR explanation includes relevant facts from a chain other than the greedy one. ::: success MCR gains over SCR are highest when **MCR explanations are less similar to the greedy chain**. ::: ### Combining Reasoning Chains - In addition to <span class='red'>choosing between reasoning chains</span>, an interesting property of the meta-reasoner is that <span class='red'>it can combine facts from different chains</span>. - On the **implicit datasets**, STRATEGYQA and FERMI, which **are more challenging**. - We examine if one of the output sentences appears in exactly one chain, while another sentence is absent from that chain and is part of a different chain. - We consider **sentences as similar** if their **ROUGE-1** precision is above 0.8. :::success 1. We observe that these multi-chain explanations are better than any individual reasoning chain in 10% of cases. 2. For the remaining 90%, the reasoning expressed in the resulting combination is a paraphrase of an individual chain. ::: ### Explanation Quality - Four of the authors manually reviewed 600 random examples, 100 per dataset and scored their meta-reasoner explanations. - Each explanation is scored as either 1 (irrelevant), 2 (partially relevant) or 3 (highly relevant), <span class='red'>based on its relevance to answering the question</span>. :::success We find the explanation is **highly relevant in 82% of the cases** (87% excluding FERMI, which is the most challenging), and is irrelevant in less than 3%. ::: - Next, we evaluate the **faithfulness** of explanations.Namely, <span class='red'>whether a person provided only with the question and MCR explanation would answer the same as the model</span>. - Our **focus** was **on examples with quality explanations (score 3)**, since they are answerable given the explanation. :::success 1. In 90% of cases (95% excluding FERMI), the MCR predictions matched our own, highlighting the faithfulness of its explanations. 2. We attribute part of the gap between human and MCR predictions to implicit reasoning tasks, where humans lead by five points, on average. ::: ### Error Analysis - We manually analyzed 700 errors by MCR (100 per dataset). - We consider the following categories: 1. **Valid predictions**: where the generated answer is accurate or the original question is ambiguous. 2. **Decomposition errors**: where <span class='red'>no chain has the necessary reasoning steps to answer the question</span>. 3. **Retrieval errors**: where <span class='red'>the retrieved contexts were irrelevant</span>, leading the model to hallucinate. 4. **Explanation errors**: where <span class='red'>MCR generates a wrong explanation</span> while a correct one is present in the multi-chain context. 5. **Answer errors**: <span class='red'>when the MCR explanation is correct, but the answer is not</span>. 6. **Contradicting facts**: are cases where MCR errs due to <span class='red'>contrasting statements appearing in the multi-chain context</span>. ![截圖 2024-02-01 17.36.14](https://hackmd.io/_uploads/H1H7UJtcp.png =70%x) :::success 1. In four datasets, over 20% of errors appear to be **valid predictions**, labeled as incorrect due to ambiguous questions, outdated answers or dataset errors. 2. **Decomposition** is a challenge in the implicit datasets, STRATEGYQA and FERMI, with more than 24% of errors. 3. **Explanation** and **Answer errors** are 50% on implicit reasoning datasets compared to 23% on explicit reasoning ones. 4. **Retrieval errors** are more prevalent in explicit reasoning tasks with 66% of errors being due to Retrieval or Contradicting facts, compared to 30% in implicit datasets. ::: ## 6. Conclusion 1. We introduce the MCR method for metareasoning on multiple chains-of-thought. 2. We show that MCR outperforms all baselines, including self-consistency, on all 7 multi-hop open-domain QA benchmarks. 3. We analyze MCR for its explanation quality and its multi-chain reasoning capabilities.