Self-Consistency Improves Chain Of Thought Reasoning In Language Models - HackMD

<style> .red { color: red; } .blue{ color: blue; } .green{ color: green; } </style> # [Self-Consistency Improves Chain Of Thought Reasoning In Language Models](https://arxiv.org/abs/2203.11171) :::warning :bulb: 這篇論文為基礎 **Chain-of-Thought** 的延伸建議先閱讀： :arrow_right: [Chain-of-Thought Prompting Elicits Reasoning in Large Language Models](https://hackmd.io/zoYrS1J2SU-SpwN6zvKLFg) ::: ## 1. Introduction - In this paper, <span class='green'>we introduce a novel decoding strategy called self-consistency to replace the greedy decoding strategy used in chain-of-thought</span>. prompting. - Self-consistency leverages the intuition that complex reasoning tasks typically admit multiple reasoning paths that reach a correct answer. - The more that deliberate thinking and analysis is required for a problem, the greater the diversity of reasoning paths that can recover the answer. :::success ![截圖 2024-01-19 16.02.22](https://hackmd.io/_uploads/HJYshsPtT.png) - Instead of greedily decoding the optimal reasoning path, we propose a “**sample-and-marginalize**” decoding procedure: 1. We first <span class='red'>sample from the language model’s decoder to generate a diverse set of reasoning paths</span>; each reasoning path might lead to a different final answer. 2. So we determine the optimal answer by <span class='red'>marginalizing out the sampled reasoning paths to find the most consistent answer in the final answer set</span>. - The self-consistency method contains three steps: 1. Prompt a language model using chain-of-thought (CoT) prompting; 2. Replace the “greedy decode” in CoT prompting by sampling from the language model’s decoder to generate a diverse set of reasoning paths; and 3. Marginalize out the reasoning paths and aggregate by choosing the most consistent answer in the final answer set. ::: - Such an approach is analogous to the human experience that if **multiple different ways of thinking lead to the same answer, one has greater confidence that the final answer is correct**. - Compared to other decoding methods, <span class='red'>self-consistency avoids the repetitiveness and local-optimality</span> that plague greedy decoding, while mitigating the stochasticity of a single sampled generation. - **Self-consistency is entirely unsupervised**, works off-the-shelf with pre-trained language models, requires no additional human annotation, and avoids any additional training, auxiliary models or fine-tuning. ## 2. Self-Consistency Over Diverse Reasoning Paths - A salient aspect of humanity is that **people think differently**. It is natural to suppose that in tasks requiring deliberate thinking, there are likely several ways to attack the problem. - We propose that such a process can be **simulated in language models via sampling from the language model’s decoder**. - We hypothesize that correct reasoning processes, even if they are diverse, <span class='red'>tend to have greater agreement in their final answer than incorrect processes</span>. **In more detail:** 1. Assume the generated answers ai are from a fixed answer set, ai ∈ A, where i = 1 , . . . , m indexes the m candidate outputs sampled from the decoder. 2. Given a prompt and a question, self-consistency introduces an additional latent variable ri, which is a sequence of tokens representing the reasoning path in the i-th output. 3. Then couples the generation of (ri, ai) where ri → ai . :arrow_right: <span class='green'>Generating a reasoning path ri is optional and only used to reach the final answer ai.</span> 4. After sampling multiple (ri; ai) from the model’s decoder, self-consistency applies a marginalization over ri by taking a majority vote over ai. :arrow_right: ![截圖 2024-01-19 16.42.56](https://hackmd.io/_uploads/rygBLhPt6.png =35%x)<span class='green'>, or as we defined as the most “consistent” answer among the final answer set.</span> **Different aggregation strategies** ![截圖 2024-01-19 16.48.19](https://hackmd.io/_uploads/SyrDP2wYp.png) - In Table 1, we show the test accuracy over a set of reasoning tasks by using different answer aggregation strategies. - In addition to majority vote, one can also <span class='red'>**weight** each (ri, ai) by P(ri, ai | prompt, question) when aggregating the answers</span>. - Note to compute P(ri, ai | prompt, question), we can either: 1. take the **unnormalized probability** of the model generating (ri, ai) given (prompt, question). 2. or we can **normalize the conditional probability** by the output length:![截圖 2024-01-19 16.58.12](https://hackmd.io/_uploads/rkyAYhDFa.png =90%x) - where log P(tk | prompt, question, t1, . . . , tk-1) is the log probability of generating the k-th token tk in (ri, ai) conditioned on the previous tokens. - K is the total number of tokens in (ri, ai). :::success **Result:** 1. In Table 1, we show that taking the “<span class='red'>unweighted sum</span>”, i.e., taking a majority vote directly over ai <span class='red'>yields a very similar accuracy as aggregating using the “normalized weighted sum”</span>. - We took a closer look at the model’s output probabilities and found this is because for each (ri, ai), **the normalized conditional probabilities P(ri, ai | prompt, question) are quite close to each other, i.e., the language model regards those generations as “similarly likely”**. 2. Table 1 show that the <span class='red'>“normalized” weighted sum</span> (i.e., Equation 1) <span class='red'>yields a much higher accuracy compared to its unnormalized counterpart</span>. 3. For completeness, in Table 1 we also report the results by <span class='red'>taking a “weighted average”</span>, i.e., each a gets a score of its weighted sum divided by ![截圖 2024-01-19 17.16.15](https://hackmd.io/_uploads/SJZlC2vtp.png =25%x), which <span class='red'>results in a much worse performance</span>. ::: :::warning 在自然語言處理中，使用自然對數的意義主要體現在以下幾個方面： 1. **表示詞在句子中的機率**。在自然語言處理中，通常使用機率來表示一個詞在一個句子中出現的機率。使用自然對數可以使機率的計算更加簡便。 2. **計算句子的生成機率**。在自然語言處理中，通常使用機率來表示一個句子的生成機率。使用自然對數可以使機率的計算更加簡便。 3. **計算句子的相似度**。在自然語言處理中，通常使用相似度來衡量兩個句子的相似程度。使用自然對數可以使相似度的計算更加簡便。總而言之，呈上使用自然對數具有重要的意義，可以幫助我們更有效地處理機率的問題。 ::: :::warning **normalized probability** 和 **unnormalized probability** 的區別如下: Unnormalized probability:指模型直接輸出的原始機率值,沒有做額外的處理。 Normalized probability:指對模型原始的機率進行Normalized處理後的機率。Normalized的目的是使得所有候選答案的機率總和為1。具體來說,假設一個填空題有3個候選答案A、B、C。模型分別給出他們的Unnormalized probability為0.7、0.5、0.3。我們可以透過下面的公式將其Normalized: P(A) = 0.7 / (0.7+0.5+0.3) = 0.46 P(B) = 0.5 / (0.7+0.5+0.3) = 0.33 P(C) = 0.3 / (0.7+0.5+0.3) = 0.20 Normalized後,三個答案的機率總和為1。 <span class='red'>Normalized probability通常更有意義,因為它們可以當作選擇該答案的機率</span>。Unnormalized probability則不能直接當作選擇的概率,因為總和大於1。所以在很多情況下我們需要對模型的原始機率進行Normalized。 ::: - Self-consistency explores an interesting space between **open-ended text generation** and **optimal text generation with a fixed answer** ( Reasoning tasks typically have fixed answers ). - One should note that <span class='red'>self-consistency can be applied only to problems where the final answer is from a fixed answer set</span>. But in principle this approach can be extended to open-text generation problems if a good metric of consistency can be defined between multiple generations, e.g., whether two answers agree or contradict each other. ::: info 也就是說,self-consistency這個方法目前只適用於問題的答案是固定的,比如選擇題。但是這個方法理論上可以推廣到答案不是固定的開放式問題,比如生成式問答。關鍵是需要定義一個好的、一致性的度量標準,用來衡量多個生成答案之間的一致性,比如判斷它們之間是否存在矛盾或者支持的關係。如果能找到這樣的一致性度量標準,那麼就有可能將self-consistency應用到更加開放的文本生成任務中。 ::: ## 3. Experiments *We conducted a series of experiments to compare the proposed self-consistency method with existing approaches on a range of reasoning benchmarks.* ## 3.1 Experiments Setup ### 3.1.1 Tasks and datasets 1. Arithmetic reasoning: - Math Word Problem Repository - **AddSub**, **MultiArith**, **ASDiv** - AQUA-RAT - **GSM8K** ( a recently published benchmark of grade-school-math problems ), and **SVAMP** ( a challenge dataset over math word problems ) 2. Commonsense reasoning: - **CommonsenseQA** - **StrategyQA** - **AI2 Reasoning Challenge (ARC)** 3. Symbolic Reasoning: - **Last letter concatenation** - **Coinflip** ### 3.1.2 Language models and prompts *We evaluate self-consistency over four transformer-based language models with varying scales:* 1. **UL2** - is an <span class='blue'>encoder-decoder model trained on a mixture of denoisers with 20-billion parameters</span> and thus is more compute-friendly. 2. **GPT-3** - <span class='blue'>with 175-billion parameters</span>. We use two public engines code-davinci-001 and code-davinci-002 from the Codex series to aid reproducibility. 3. **LaMDA-137B** - is a <span class='blue'>dense left-to-right</span>, <span class='blue'>decoder-only language model with 137-billion parameters</span>, pre-trained on a mixture of web documents, dialog data and Wikipedia. 4. **PaLM-540B** - is a <span class='blue'>dense left-to-right</span>, <span class='blue'>decoder-only language model with 540-billion parameters</span>, pre-trained on a high quality corpus of 780 billion tokens with filtered webpages, books, Wikipedia, news articles, source code, and social media conversations. :::success 1. We **perform all experiments in the few-shot setting, without training or fine-tuning the language models**. 2. For a fair comparison we use the same prompts as in Wei et al. (2022): - For all arithmetic reasoning tasks we use the same set of 8 manually written exemplars. - For each commonsense reasoning task, 4-7 exemplars are randomly chosen from the training set with manually composed chain-of-thought prompts. ::: ### 3.1.3 Sampling scheme - **UL2-20B** and **LaMDA-137B**: - temperature sampling with <span class='red'>T = 0.5</span> - truncated at the <span class='red'>top-k (k = 40)</span> tokens with the highest probability - **PaLM-540B**: - <span class='red'>T = 0.7</span> - <span class='red'>k = 40</span> - **GPT-3**: - <span class='red'>T = 0.7</span> - <span class='red'>without top-k truncation</span> ## 3.2 Main Results - We report the **results of self-consistency averaged over 10 runs, where we sampled 40 outputs independently from the decoder in each run**. - The baseline we compare to is chain-of-thought prompting with greedy decoding. ### 3.2.1 Arithmetic Reasoning ![截圖 2024-01-19 18.13.59](https://hackmd.io/_uploads/SJpdipwtp.png) :::success 1. Self-consistency improves the arithmetic reasoning performance over all four language models significantly over chain-of-thought prompting. 2. <span class='red'>The gains become more significant when the language model’s scale increases</span>. 3. With self-consistency, we achieve new state-of-the-art results on almost all tasks. 4. Despite the fact that self-consistency is unsupervised and task-agnostic, these results compare favorably to existing approaches that require task-specific training, or fine-tuning with thousands of examples. ::: ### 3.2.2 Commonsense and Symbolic Reasoning ![截圖 2024-01-19 18.14.39](https://hackmd.io/_uploads/Skzsipwta.png) :::success 1. Self-consistency yields large gains across all four language models, and obtained SoTA results on 5 out of 6 tasks. 2. In this challenging OOD ( Out-of-Domain/Out-of-Distribution ) setting, the gain of self-consistency is still quite significant compared to CoT-prompting with sufficient model sizes. ::: ![截圖 2024-01-19 18.24.41](https://hackmd.io/_uploads/Hk5gApvFp.png) :::success **effect of the number of sampled reasoning paths:** 1. The results show that sampling a higher number (e.g., 40) of reasoning paths leads to a consistently better performance. 2. Further emphasizing the importance of introducing diversity in the reasoning paths. 3. We show self-consistency yields a richer set of reasoning paths compared to greedy decoding with a few example questions from two tasks. ::: ## 3.3 Self-Consistency Helps When Chain-of-Thought Hurts Performance - Ye & Durrett (2022) show that <span class='red'>sometimes chain-of-thought prompting could hurt performance compared to standard prompting in few-shot in-context learning</span>. - Here we perform a study using self-consistency to see if it can help fill in the gap, over a set of common NLP tasks. ![截圖 2024-01-19 18.32.44](https://hackmd.io/_uploads/r1l1eAvFT.png) :::success 1. The results over **PaLM-540B**. 2. For some tasks (e.g., ANLI-R1, e-SNLI, RTE), adding chain-of-thought does hurt performance compared to standard prompting. 3. But self-consistency is able to robustly boost the performance and outperform standard prompting, making it a reliable way to add rationales in few-shot in-context learning for common NLP tasks. ::: ## 3.4 Comparison to Other Existing Approaches ### 3.4.1 Comparison to Sample-and-Rank - Commonly used approach to improve generation quality is sample-and-rank. - where <span class='red'>multiple sequences are sampled from the decoder and then ranked according to each sequence’s log probability</span>. ![截圖 2024-01-19 18.43.06](https://hackmd.io/_uploads/rJaSf0DKT.png) :::success 1. We compare self-consistency with sample-and-rank **on GPT-3 code-davinci-001**, by **sampling the same number of sequences from the decoder as self-consistency** and taking the final answer from the top-ranked sequence. 2. While sample-and-rank does improve the accuracy with additionally sampled sequences and ranking, <span class='red'>the gain is much smaller compared to self-consistency</span>. ::: ### 3.4.2 Comparison to Beam Search ![截圖 2024-01-19 18.52.55](https://hackmd.io/_uploads/BJBoNADFp.png) :::success 1. We compare self-consistency with beam search decoding **on the UL2-20B model**. 2. For a fair comparison we report the accuracy **under the same number of beams and reasoning paths**. 3. **Self-consistency can also adopt beam search to decode each reasoning path** (results are shown as “Self-consistency using beam search”), but <span class='red'>its performance is worse compared to self-consistency with sampling</span>. 4. <span class='red'>The reason is that beam search yields a lower diversity in the outputs</span>, while in self-consistency the diversity of the reasoning paths is the key to a better performance. ::: ### 3.4.3 Comparison to Ensemble-based Approaches - We further compare self-consistency to **ensemblebased methods for few-shot learning**. - we consider ensembling by: 1. **Prompt order permutation**: we randomly permute the exemplars in the prompt 40 times to mitigate model’s sensitivity to prompt order. 2. **Multiple sets of prompts**: we manually write 3 different sets of prompts. - We took majority vote of the answers from greedy decoding in both approaches as an ensemble. ![截圖 2024-01-19 19.06.11](https://hackmd.io/_uploads/HJPnvCPF6.png) :::success 1. Compared to self-consistency, <span class='red'>existing ensemble-based approaches achieve a much smaller gain</span>. 2. Self-consistency is different from a typical model-ensemble approach, where multiple models are trained and their outputs are aggregated. **Self-consistency acts more like a “self-ensemble” on top of a single language model**. ::: ## 3.5 Additional Studies ### 3.5.1 Self-Consistency is Robust to Sampling Strategies and Scaling - We show self-consistency is robust to sampling strategies and parameters, <span class='red'>by varying T in temperature sampling, k in top-k sampling, and p in nucleus sampling</span>, over PaLM-540B. ![截圖 2024-01-19 19.22.14](https://hackmd.io/_uploads/HkYusRDta.png) :::success 1. **(LEFT)** Self-consistency robustly improves performance across all scales for the LaMDA-137B model series. 2. <span class='red'>**(RIGHT)** The gain is relatively lower for smaller models due to certain abilities (e.g., arithmetic) only emerge when the model reaches a sufficient scale</span>. ::: ### 3.5.2 Self-Consistency Improves Robustness to Imperfect Prompts - For few-shot learning with manually constructed prompts, **human annotators sometimes make minor mistakes when creating the prompts**. - We further study if self-consistency can help improve a language model’s robustness to imperfect prompts. ![截圖 2024-01-19 19.27.28](https://hackmd.io/_uploads/H1G33RvF6.png) :::success 1. **(LEFT)** Imperfect prompts decrease accuracy with greedy decoding (17.1 → 14.9), self-consistency can fill in the gaps and robustly improve the results. 2. <span class='red'>**(RIGHT)** Consistency (in terms of % of decodes agreeing with the final aggregated answer) is highly correlated with accuracy</span>. 3. **(RIGHT)** This suggests that one <span class='red'>can use self-consistency to provide an uncertainty estimate of the model in its generated solutions</span>. 4. <span class='red'>**(RIGHT)** Can use low consistency as an indicator that the model has low confidence</span>. 5. <span class='red'>**(RIGHT)** Self-consistency confers some ability for the model to “know when it doesn’t know”</span>. ::: ### 3.5.3 Self-Consistency Works for Non-Natural-Language Reasoning Paths and Zero-shot CoT ![截圖 2024-01-19 19.41.34](https://hackmd.io/_uploads/S1mZek_K6.png =80%x) :::success 1. Self-consistency still improves accuracy by generating intermediate equations; however, <span class='red'>compared to generating natural language reasoning paths, the gain is smaller since the equations are much shorter and less opportunity remains for generating diversity in the decoding process</span>. 2. Self-consistency works for zero-shot CoT as well and improves the results significantly (+26.2%). ::: ## 4. Conclusion and Discussion 1. We introduced a simple yet effective method called self-consistency. 2. Self-consistency is also useful for collecting rationales when performing reasoning tasks with language models, and for providing uncertainty estimates and improved calibration of language model outputs. 3. <span class='red'>One limitation of self-consistency is that it incurs more computation cost</span>. 4. In most cases the performance saturates quickly. 5. As part of future work, one could **use self-consistency to generate better supervised data to fine-tune the model**, such that the model can give more accurate predictions in a single inference run after fine-tuning. ## 5. Related Work 1. Reasoning in language models - **Language models are known to struggle in Type 2 tasks**, such as arithmetic, logical and commonsense reasoning. - Self-consistency is applicable to a wide range of reasoning tasks without any additional supervision or fine-tuning, 2. Sampling and re-ranking in language models - Multiple decoding strategies for language models have been proposed in the literature, e.g., temperature sampling, top-k sampling, nucleus sampling, minimum Bayes risk decoding, and typical decoding. - **Re-ranking is another common approach to improve generation quality in language models**. - All these methods require either training an additional re-ranker or collecting additional human annotation, while self-consistency requires no additional training, fine-tuning, nor extra data collection. 3. Extract reasoning paths - Some previous work has considered task-specific approaches for identifying reasoning paths. - Compared to these approaches, self-consistency is far simpler and requires no additional training. - The approach we propose **simply couples the generation of reasoning paths and a final answer by sampling from the decoder, using aggregation to recover the most consistent answer without additional modules**. 4. Consistency in language models - In this paper we focus on a slightly different notion of “consistency”, i.e., **utilizing answer consistency among diverse reasoning paths to improve accuracy**.