<style>
.red {
color: red;
}
.blue{
color: blue;
}
.green{
color: green;
}
</style>
# [Self-Consistency Improves Chain Of Thought Reasoning In Language Models](https://arxiv.org/abs/2203.11171)
:::warning
:bulb: 這篇論文為基礎 **Chain-of-Thought** 的延伸
建議先閱讀:
:arrow_right: [Chain-of-Thought Prompting Elicits Reasoning in Large Language Models](https://hackmd.io/zoYrS1J2SU-SpwN6zvKLFg)
:::
## 1. Introduction
- In this paper, <span class='green'>we introduce a novel decoding strategy called self-consistency to replace the greedy decoding strategy used in chain-of-thought</span>. prompting.
- Self-consistency leverages the intuition that complex reasoning tasks typically admit multiple reasoning paths that reach a correct answer.
- The more that deliberate thinking and analysis is required for a problem, the greater the diversity of reasoning paths that can recover the answer.
:::success

- Instead of greedily decoding the optimal reasoning path, we propose a “**sample-and-marginalize**” decoding procedure:
1. We first <span class='red'>sample from the language model’s decoder to generate a diverse set of reasoning paths</span>; each reasoning path might lead to a different final answer.
2. So we determine the optimal answer by <span class='red'>marginalizing out the sampled reasoning paths to find the most consistent answer in the final answer set</span>.
- The self-consistency method contains three steps:
1. Prompt a language model using chain-of-thought (CoT) prompting;
2. Replace the “greedy decode” in CoT prompting by sampling from the language model’s decoder to generate a diverse set of reasoning paths; and
3. Marginalize out the reasoning paths and aggregate by choosing the most consistent answer in the final answer set.
:::
- Such an approach is analogous to the human experience that if **multiple different ways of thinking lead to the same answer, one has greater confidence that the final answer is correct**.
- Compared to other decoding methods, <span class='red'>self-consistency avoids the repetitiveness and local-optimality</span> that plague greedy decoding, while mitigating the stochasticity of a single sampled generation.
- **Self-consistency is entirely unsupervised**, works off-the-shelf with pre-trained language models, requires no additional human annotation, and avoids any additional training, auxiliary models or fine-tuning.
## 2. Self-Consistency Over Diverse Reasoning Paths
- A salient aspect of humanity is that **people think differently**. It is natural to suppose that in tasks requiring deliberate thinking, there are likely several ways to attack the problem.
- We propose that such a process can be **simulated in language models via sampling from the language model’s decoder**.
- We hypothesize that correct reasoning processes, even if they are diverse, <span class='red'>tend to have greater agreement in their final answer than incorrect processes</span>.
**In more detail:**
1. Assume the generated answers ai are from a fixed answer set, ai ∈ A, where i = 1 , . . . , m indexes the m candidate outputs sampled from the decoder.
2. Given a prompt and a question, self-consistency introduces an additional latent variable ri, which is a sequence of tokens representing the reasoning path in the i-th output.
3. Then couples the generation of (ri, ai) where ri → ai .
:arrow_right: <span class='green'>Generating a reasoning path ri is optional and only used to reach the final answer ai.</span>
4. After sampling multiple (ri; ai) from the model’s decoder, self-consistency applies a marginalization over ri by taking a majority vote over ai.
:arrow_right: <span class='green'>, or as we defined as the most “consistent” answer among the final answer set.</span>
**Different aggregation strategies**

- In Table 1, we show the test accuracy over a set of reasoning tasks by using different answer aggregation strategies.
- In addition to majority vote, one can also <span class='red'>**weight** each (ri, ai) by P(ri, ai | prompt, question) when aggregating the answers</span>.
- Note to compute P(ri, ai | prompt, question), we can either:
1. take the **unnormalized probability** of the model generating (ri, ai) given (prompt, question).
2. or we can **normalize the conditional probability** by the output length:
- where log P(tk | prompt, question, t1, . . . , tk-1) is the log probability of generating the k-th token tk in (ri, ai) conditioned on the previous tokens.
- K is the total number of tokens in (ri, ai).
:::success
**Result:**
1. In Table 1, we show that taking the “<span class='red'>unweighted sum</span>”, i.e., taking a majority vote directly over ai <span class='red'>yields a very similar accuracy as aggregating using the “normalized weighted sum”</span>.
- We took a closer look at the model’s output probabilities and found this is because for each (ri, ai), **the normalized conditional probabilities P(ri, ai | prompt, question) are quite close to each other, i.e., the language model regards those generations as “similarly likely”**.
2. Table 1 show that the <span class='red'>“normalized” weighted sum</span> (i.e., Equation 1) <span class='red'>yields a much higher accuracy compared to its unnormalized counterpart</span>.
3. For completeness, in Table 1 we also report the results by <span class='red'>taking a “weighted average”</span>, i.e., each a gets a score of its weighted sum divided by , which <span class='red'>results in a much worse performance</span>.
:::
:::warning
在自然語言處理中,使用自然對數的意義主要體現在以下幾個方面:
1. **表示詞在句子中的機率**。在自然語言處理中,通常使用機率來表示一個詞在一個句子中出現的機率。使用自然對數可以使機率的計算更加簡便。
2. **計算句子的生成機率**。在自然語言處理中,通常使用機率來表示一個句子的生成機率。使用自然對數可以使機率的計算更加簡便。
3. **計算句子的相似度**。在自然語言處理中,通常使用相似度來衡量兩個句子的相似程度。使用自然對數可以使相似度的計算更加簡便。
總而言之,呈上使用自然對數具有重要的意義,可以幫助我們更有效地處理機率的問題。
:::
:::warning
**normalized probability** 和 **unnormalized probability** 的區別如下:
Unnormalized probability:指模型直接輸出的原始機率值,沒有做額外的處理。
Normalized probability:指對模型原始的機率進行Normalized處理後的機率。Normalized的目的是使得所有候選答案的機率總和為1。
具體來說,假設一個填空題有3個候選答案A、B、C。模型分別給出他們的Unnormalized probability為0.7、0.5、0.3。我們可以透過下面的公式將其Normalized:
P(A) = 0.7 / (0.7+0.5+0.3) = 0.46
P(B) = 0.5 / (0.7+0.5+0.3) = 0.33
P(C) = 0.3 / (0.7+0.5+0.3) = 0.20
Normalized後,三個答案的機率總和為1。
<span class='red'>Normalized probability通常更有意義,因為它們可以當作選擇該答案的機率</span>。Unnormalized probability則不能直接當作選擇的概率,因為總和大於1。所以在很多情況下我們需要對模型的原始機率進行Normalized。
:::
- Self-consistency explores an interesting space between **open-ended text generation** and **optimal text generation with a fixed answer** ( Reasoning tasks typically have fixed answers ).
- One should note that <span class='red'>self-consistency can be applied only to problems where the final answer is from a fixed answer set</span>. But in principle this approach can be extended to open-text generation problems if a good metric of consistency can be defined between multiple generations, e.g., whether two answers agree or contradict each other.
::: info
也就是說,self-consistency這個方法目前只適用於問題的答案是固定的,比如選擇題。
但是這個方法理論上可以推廣到答案不是固定的開放式問題,比如生成式問答。關鍵是需要定義一個好的、一致性的度量標準,用來衡量多個生成答案之間的一致性,比如判斷它們之間是否存在矛盾或者支持的關係。
如果能找到這樣的一致性度量標準,那麼就有可能將self-consistency應用到更加開放的文本生成任務中。
:::
## 3. Experiments
*We conducted a series of experiments to compare the proposed self-consistency method with existing approaches on a range of reasoning benchmarks.*
## 3.1 Experiments Setup
### 3.1.1 Tasks and datasets
1. Arithmetic reasoning:
- Math Word Problem Repository - **AddSub**, **MultiArith**, **ASDiv**
- AQUA-RAT - **GSM8K** ( a recently published benchmark of grade-school-math problems ), and **SVAMP** ( a challenge dataset over math word problems )
2. Commonsense reasoning:
- **CommonsenseQA**
- **StrategyQA**
- **AI2 Reasoning Challenge (ARC)**
3. Symbolic Reasoning:
- **Last letter concatenation**
- **Coinflip**
### 3.1.2 Language models and prompts
*We evaluate self-consistency over four transformer-based language models with varying scales:*
1. **UL2** - is an <span class='blue'>encoder-decoder model trained on a mixture of denoisers with 20-billion parameters</span> and thus is more compute-friendly.
2. **GPT-3** - <span class='blue'>with 175-billion parameters</span>. We use two public engines code-davinci-001 and code-davinci-002 from the Codex series to aid reproducibility.
3. **LaMDA-137B** - is a <span class='blue'>dense left-to-right</span>, <span class='blue'>decoder-only language model with 137-billion parameters</span>, pre-trained on a mixture of web documents, dialog data and Wikipedia.
4. **PaLM-540B** - is a <span class='blue'>dense left-to-right</span>, <span class='blue'>decoder-only language model with 540-billion parameters</span>, pre-trained on a high quality corpus of 780 billion tokens with filtered webpages, books, Wikipedia, news articles, source code, and social media conversations.
:::success
1. We **perform all experiments in the few-shot setting, without training or fine-tuning the language models**.
2. For a fair comparison we use the same prompts as in Wei et al. (2022):
- For all arithmetic reasoning tasks we use the same set of 8 manually written exemplars.
- For each commonsense reasoning task, 4-7 exemplars are randomly chosen from the training set with manually composed chain-of-thought prompts.
:::
### 3.1.3 Sampling scheme
- **UL2-20B** and **LaMDA-137B**:
- temperature sampling with <span class='red'>T = 0.5</span>
- truncated at the <span class='red'>top-k (k = 40)</span> tokens with the highest probability
- **PaLM-540B**:
- <span class='red'>T = 0.7</span>
- <span class='red'>k = 40</span>
- **GPT-3**:
- <span class='red'>T = 0.7</span>
- <span class='red'>without top-k truncation</span>
## 3.2 Main Results
- We report the **results of self-consistency averaged over 10 runs, where we sampled 40 outputs independently from the decoder in each run**.
- The baseline we compare to is chain-of-thought prompting with greedy decoding.
### 3.2.1 Arithmetic Reasoning

:::success
1. Self-consistency improves the arithmetic reasoning performance over all four language models significantly over chain-of-thought prompting.
2. <span class='red'>The gains become more significant when the language model’s scale increases</span>.
3. With self-consistency, we achieve new state-of-the-art results on almost all tasks.
4. Despite the fact that self-consistency is unsupervised and task-agnostic, these results compare favorably to existing approaches that require task-specific training, or fine-tuning with thousands of examples.
:::
### 3.2.2 Commonsense and Symbolic Reasoning

:::success
1. Self-consistency yields large gains across all four language models, and obtained SoTA results on 5 out of 6 tasks.
2. In this challenging OOD ( Out-of-Domain/Out-of-Distribution ) setting, the gain of self-consistency is still quite significant compared to CoT-prompting with sufficient model sizes.
:::

:::success
**effect of the number of sampled reasoning paths:**
1. The results show that sampling a higher number (e.g., 40) of reasoning paths leads to a consistently better performance.
2. Further emphasizing the importance of introducing diversity in the reasoning paths.
3. We show self-consistency yields a richer set of reasoning paths compared to greedy decoding with a few example questions from two tasks.
:::
## 3.3 Self-Consistency Helps When Chain-of-Thought Hurts Performance
- Ye & Durrett (2022) show that <span class='red'>sometimes chain-of-thought prompting could hurt performance compared to standard prompting in few-shot in-context learning</span>.
- Here we perform a study using self-consistency to see if it can help fill in the gap, over a set of common NLP tasks.

:::success
1. The results over **PaLM-540B**.
2. For some tasks (e.g., ANLI-R1, e-SNLI, RTE), adding chain-of-thought does hurt performance compared to standard prompting.
3. But self-consistency is able to robustly boost the performance and outperform standard prompting, making it a reliable way to add rationales in few-shot in-context learning for common NLP tasks.
:::
## 3.4 Comparison to Other Existing Approaches
### 3.4.1 Comparison to Sample-and-Rank
- Commonly used approach to improve generation quality is sample-and-rank.
- where <span class='red'>multiple sequences are sampled from the decoder and then ranked according to each sequence’s log probability</span>.

:::success
1. We compare self-consistency with sample-and-rank **on GPT-3 code-davinci-001**, by **sampling the same number of sequences from the decoder as self-consistency** and taking the final answer from the top-ranked sequence.
2. While sample-and-rank does improve the accuracy with additionally sampled sequences and ranking, <span class='red'>the gain is much smaller compared to self-consistency</span>.
:::
### 3.4.2 Comparison to Beam Search

:::success
1. We compare self-consistency with beam search decoding **on the UL2-20B model**.
2. For a fair comparison we report the accuracy **under the same number of beams and reasoning paths**.
3. **Self-consistency can also adopt beam search to decode each reasoning path** (results are shown as “Self-consistency using beam search”), but <span class='red'>its performance is worse compared to self-consistency with sampling</span>.
4. <span class='red'>The reason is that beam search yields a lower diversity in the outputs</span>, while in self-consistency the diversity of the reasoning paths is the key to a better performance.
:::
### 3.4.3 Comparison to Ensemble-based Approaches
- We further compare self-consistency to **ensemblebased methods for few-shot learning**.
- we consider ensembling by:
1. **Prompt order permutation**: we randomly permute the exemplars in the prompt 40 times to mitigate model’s sensitivity to prompt order.
2. **Multiple sets of prompts**: we manually write 3 different sets of prompts.
- We took majority vote of the answers from greedy decoding in both approaches as an ensemble.

:::success
1. Compared to self-consistency, <span class='red'>existing ensemble-based approaches achieve a much smaller gain</span>.
2. Self-consistency is different from a typical model-ensemble approach, where multiple models are trained and their outputs are aggregated. **Self-consistency acts more like a “self-ensemble” on top of a single language model**.
:::
## 3.5 Additional Studies
### 3.5.1 Self-Consistency is Robust to Sampling Strategies and Scaling
- We show self-consistency is robust to sampling strategies and parameters, <span class='red'>by varying T in temperature sampling, k in top-k sampling, and p in nucleus sampling</span>, over PaLM-540B.

:::success
1. **(LEFT)** Self-consistency robustly improves performance across all scales for the LaMDA-137B model series.
2. <span class='red'>**(RIGHT)** The gain is relatively lower for smaller models due to certain abilities (e.g., arithmetic) only emerge when the model reaches a sufficient scale</span>.
:::
### 3.5.2 Self-Consistency Improves Robustness to Imperfect Prompts
- For few-shot learning with manually constructed prompts, **human annotators sometimes make minor mistakes when creating the prompts**.
- We further study if self-consistency can help improve a language model’s robustness to imperfect prompts.

:::success
1. **(LEFT)** Imperfect prompts decrease accuracy with greedy decoding (17.1 → 14.9), self-consistency can fill in the gaps and robustly improve the results.
2. <span class='red'>**(RIGHT)** Consistency (in terms of % of decodes agreeing with the final aggregated answer) is highly correlated with accuracy</span>.
3. **(RIGHT)** This suggests that one <span class='red'>can use self-consistency to provide an uncertainty estimate of the model in its generated solutions</span>.
4. <span class='red'>**(RIGHT)** Can use low consistency as an indicator that the model has low confidence</span>.
5. <span class='red'>**(RIGHT)** Self-consistency confers some ability for the model to “know when it doesn’t know”</span>.
:::
### 3.5.3 Self-Consistency Works for Non-Natural-Language Reasoning Paths and Zero-shot CoT

:::success
1. Self-consistency still improves accuracy by generating intermediate equations; however, <span class='red'>compared to generating natural language reasoning paths, the gain is smaller since the equations are much shorter and less opportunity remains for generating diversity in the decoding process</span>.
2. Self-consistency works for zero-shot CoT as well and improves the results significantly (+26.2%).
:::
## 4. Conclusion and Discussion
1. We introduced a simple yet effective method called self-consistency.
2. Self-consistency is also useful for collecting rationales when performing reasoning tasks with language models, and for providing uncertainty estimates and improved calibration of language model outputs.
3. <span class='red'>One limitation of self-consistency is that it incurs more computation cost</span>.
4. In most cases the performance saturates quickly.
5. As part of future work, one could **use self-consistency to generate better supervised data to fine-tune the model**, such that the model can give more accurate predictions in a single inference run after fine-tuning.
## 5. Related Work
1. Reasoning in language models
- **Language models are known to struggle in Type 2 tasks**, such as arithmetic, logical and commonsense reasoning.
- Self-consistency is applicable to a wide range of reasoning tasks without any additional supervision or fine-tuning,
2. Sampling and re-ranking in language models
- Multiple decoding strategies for language models have been proposed in the literature, e.g., temperature sampling, top-k sampling, nucleus sampling, minimum Bayes risk decoding, and typical decoding.
- **Re-ranking is another common approach to improve generation quality in language models**.
- All these methods require either training an additional re-ranker or collecting additional human annotation, while self-consistency requires no additional training, fine-tuning, nor extra data collection.
3. Extract reasoning paths
- Some previous work has considered task-specific approaches for identifying reasoning paths.
- Compared to these approaches, self-consistency is far simpler and requires no additional training.
- The approach we propose **simply couples the generation of reasoning paths and a final answer by sampling from the decoder, using aggregation to recover the most consistent answer without additional modules**.
4. Consistency in language models
- In this paper we focus on a slightly different notion of “consistency”, i.e., **utilizing answer consistency among diverse reasoning paths to improve accuracy**.