Chain-of-Thought Prompting Elicits Reasoning in Large Language Models - HackMD

<style> .red { color: red; } .blue{ color: blue; } .green{ color: green; } </style> # [Chain-of-Thought Prompting Elicits Reasoning in Large Language Models](https://arxiv.org/abs/2201.11903) ## 1. Introduction - Scaling up the size of language models has been shown to confer a range of benefits, such as improved performance and sample efficiency. - However, scaling up model size alone has not proved sufficient for achieving high performance on challenging tasks such as **arithmetic**, **commonsense**, and **symbolic reasoning**. - This work explores how the reasoning ability of large language models can be unlocked by a simple method motivated by two ideas: 1. <span class='red'>First, techniques for arithmetic reasoning can benefit from generating natural language rationales that lead to the final answer.</span> Prior work has given models the ability to generate natural language intermediate steps by <span class='blue'>training from scratch</span> or finetuning a pretrained model, in addition to neuro-symbolic methods that use formal languages instead of natural language. :::info 在機器學習領域裡,「training from scratch」指的是從頭開始訓練模型,不使用任何預訓練的模型或參數。具體來說,它表示: 1. 不使用任何預訓練的模型,從一個隨機初始化的模型開始訓練 2. 模型所有的參數和權重都是從頭開始學習得到的,沒有使用預訓練模型的參數相比之下,微調(fine-tuning)是在一個預訓練模型的基礎上進行訓練,使用預訓練模型學到的參數和特徵提取能力,只調整最後一些層進行適應特定任務。因此之前的工作使模型可以透過完全從頭開始訓練或微調預訓練模型的方式,來學會生成算術推理過程中的自然語言中間步驟。 ::: 3. <span class='red'>Second, large language models offer the exciting prospect of in-context few-shot learning via prompting.</span> That is, instead of finetuning a separate language model checkpoint for each new task, one can simply “prompt” the model with a few input–output exemplars demonstrating the task. - Both of the above ideas, however, have key limitations: 1. <span class='red'>For rationale-augmented training and finetuning methods, it is costly to create a large set of high quality rationales</span>, which is much more complicated than simple input–output pairs used in normal machine learning. 2. <span class='red'>For the traditional few-shot prompting method used in Brown et al. (2020), it works poorly on tasks that require reasoning abilities</span>, and often does not improve substantially with increasing language model scale. - In this paper, we combine the strengths of these two ideas in a way that avoids their limitations. Specifically, we explore the ability of language models to perform few-shot prompting for reasoning tasks, **given a prompt that consists of triples: <input, chain of thought, output>**. - A prompting only approach is important because it does not require a large training dataset and because a single model checkpoint can perform many tasks without loss of generality. This work underscores how large language models can learn via a few examples with natural language data about the task. :::info 這句話的意思是: 只使用提示的方法很重要,因為它不需要使用大量的數據集來訓練,而且單一個模型的檢查點就可以執行很多種任務,且不會失去泛化的能力。論文也強調了大型語言模型可以透過極少的自然語言任務相關範例進行學習的能力。具體來說: 1. 提示(Prompting)方法不需要大規模標註數據來訓練,只需要給少量的範例,模型就可以完成許多新任務 2. 單一個模型即可透過提示進行多種不同任務,而不需要為每個任務都微調一個模型,這樣可以保持模型的泛化能力 3. 大型語言模型有強大的能力,可以透過極少的自然語言範例來學會新的任務,不依賴大量訓練數據這顯示了大型語言模型的強大的泛化能力以及跨任務遷移學習(Transfer Learning)的能力。 ::: ## 2. Chain-of-Thought Prompting ![截圖 2024-01-08 18.19.56](https://hackmd.io/_uploads/HkDw3HtOa.png =90%x) - Consider one’s own thought process when solving a complicated reasoning task such as a multi-step math word problem. **It is typical to decompose the problem into intermediate steps and solve each before giving the final answer.** - We will show that sufficiently large language models can generate chains of thought if demonstrations of chain-of-thought reasoning are provided in the exemplars for few-shot prompting. - The chain of thought in this case resembles a solution and can interpreted as one, but we still opt to call it a chain of thought to better capture the idea that it mimics a step-by-step thought process for arriving at the answer. - Chain-of-thought prompting has several attractive properties as an approach for facilitating reasoning in language models. 1. First, chain of thought, in principle, allows models to decompose multi-step problems into intermediate steps, which means that <span class='red'>additional computation can be allocated to problems that require more reasoning steps.</span> 2. Second, <span class='red'>a chain of thought provides an interpretable window into the behavior of the model</span>, suggesting how it might have arrived at a particular answer and providing opportunities to debug where the reasoning path went wrong (although fully characterizing a model’s computations that support an answer remains an open question). :::info 這句話的意思是: 即使模型可以生成答案過程中的chain-of-thought,但我們仍然無法完整描述模型的內部計算機制是如何支持生成這些答案和chain-of-thought的。這仍然是一個未解決的開放性問題。具體來說: 1. 模型可以生成解答過程的chain of thought,表面上模擬了人類的思考過程 2. 但是我們仍然不清楚模型內部的計算機制,也不了解這些chain of thought是如何從模型裡生成出來的 3. 描述和表徵模型的計算過程以支持其結果生成,仍然是一個未解決的開放性研究問題所以這句話表示,盡管模型可以生成chain-of-thought,但我們仍然無法完全理解其計算過程,這仍有待未來的研究。 ::: 3. Third, <span class='red'>chain-of-thought reasoning can be used</span> for tasks such as math word problems, commonsense reasoning, and symbolic manipulation, and is potentially applicable (at least in principle) <span class='red'>to any task that humans can solve via language.</span> 4. Finally, <span class='red'>chain-of-thought reasoning can be readily elicited in sufficiently large off-the-shelf language models</span> simply by including examples of chain of thought sequences into the exemplars of few-shot prompting. ## 3. Arithmetic Reasoning - Though simple for humans, arithmetic reasoning is a task where language models often struggle. - Strikingly, chain-of-thought prompting when <span class='green'>used with the 540B parameter language model performs comparably with task-specific finetuned models on several tasks</span>, even achieving new state of the art on the challenging GSM8K benchmark. ![image](https://hackmd.io/_uploads/SkCr99Fd6.png) ## 3.1 Experimental Setup ### Benchmark 1. **GSM8K** benchmark - math word problems 2. **SVAMP** dataset - math word problems with varying structures 3. **ASDiv** dataset - diverse math word problems 4. **AQuA** dataset - algebraic word problems 5. **MAWPS** benchmark ### Standard prompting - Language model is given <span class='blue'>in-context</span> exemplars of input-output pairs before outputting a prediction for a test time example. Exemplars are formatted as questions and answers. :::info 「in-context」指的是在語言模型的背景中，也就是在語言模型已經生成的文本內容的基礎上進行推理或問答。具體來說，「in-context」有以下的意思： 1. 在語言模型生成的文本內容中嵌入問題或推理任務。 2. 語言模型需要基於前面出現的文本內容來回答問題或進行推理。 3. 利用語言模型預訓練的知識和語義理解能力，在特定上下文中進行推理。相比之下，「out-of-context」是指在無任何上下文的情況下，直接讓語言模型回答問題或進行推理，不利用已經生成的文本內容。所以這句話提到的「in-context few-shot learning via prompting」指的是，透過在語言模型生成的文本中嵌入少量範例，讓語言模型在這些範例的上下文中學習如何進行推理，而不需要額外的數據進行fine-tune。 ::: ### Chain-of-thought prompting - <span class='red'>Augment each exemplar</span> in few-shot prompting with a chain of thought for an associated answer. - As most of the datasets only have an evaluation split, we <span class='red'>manually composed a set of eight few-shot exemplars with chains of thought for prompting</span>. - We used this single set of eight chain of thought exemplars for all benchmarks except AQuA, which is multiple choice instead of free response. ### Language Models 1. **GPT-3**: 350M, 1.3B, 6.7B, and 175B parameters 2. **LaMDA**: 422M, 2B, 8B, 68B, and 137B parameters 3. **PaLM**: 8B, 62B, and 540B parameters 4. **UL2**: 20B parameters 5. **Codex** *We sample from the models via greedy decoding (though follow-up work shows chain-of-thought prompting can be improved by taking the majority final answer over many sampled generations). For LaMDA, we report averaged results over five random seeds, where each seed had a different randomly shuffled order of exemplars. As LaMDA experiments did not show large variance among different seeds, to save compute we report results for a single exemplar order for all other models.* :::info 這段話的意思是: 我們透過greedy decoding從模型中採樣。(雖然後續的工作顯示：透過在多次採樣生成中選擇最終答案的多數票,可以改進chain-of-thought prompting。) 對於 LaMDA,我們報告了五個隨機種子的平均結果,其中每個種子都有不同的範例隨機打散順序。由於 LaMDA 實驗在不同種子之間沒有顯示出大的變異,為了節省計算,我們對所有其他模型報告了單個範例順序的結果。換句話說: 1. 模型結果是透過greedy decoding採樣生成的 2. 後續工作發現透過多次採樣選取最終答案的多數票可以改進chain-of-thought prompting的效果 3. LaMDA 模型報告的是5個隨機種子條件下的平均結果,種子之間範例順序不同 4. 由於 LaMDA 在種子間變異不大,為節省計算成本,其他模型只報告了單個範例順序條件下的結果 ::: ## 3.2 Result ![截圖 2024-01-08 19.55.54](https://hackmd.io/_uploads/HklyXvYu6.png =50%x) 1. <span class='red'>Chain-of-thought prompting is an emergent ability of model scale</span>. - That is, chain-of-thought prompting does not positively impact performance for small models, and only yields performance gains when used with models of 100B parameters. - We qualitatively found that models of smaller scale produced fluent but illogical chains of thought, leading to lower performance than standard prompting. 2. <span class='red'>Chain-of-thought prompting has larger performance gains for more-complicated problems</span>. - Performance more than doubled for the largest GPT and PaLM models. - the easiest subset of MAWPS which only requires a single step to solve, performance improvements were either negative or very small. 3. <span class='red'>Chain-of-thought prompting via GPT-3 175B and PaLM 540B compares favorably to prior state of the art</span>, which typically finetunes a task-specific model on a labeled training dataset. - PaLM 540B uses chain-ofthought prompting to achieve new state of the art on GSM8K, SVAMP, and MAWPS (though note that standard prompting already passed the prior best for SVAMP). 4. To better understand why chain-of-thought prompting works, we manually examined model generated chains of thought by LaMDA 137B for GSM8K. - The summary of this analysis is that <span class='red'>46% of the chains of thought were almost correct, barring minor mistakes</span> (calculator error, symbol mapping error, or one reasoning step missing), and that the <span class='red'>other 54% of the chains of thought had major errors in semantic understanding or coherence</span>. 5. Why scaling improves chain-of-thought reasoning ability? - We performed a similar analysis of errors made by PaLM 62B and whether those errors were fixed by scaling to PaLM 540B. - The summary is that <span class='green'>scaling PaLM to 540B fixes a large portion of one-step missing and semantic understanding errors in the 62B model</span>. ## 3.3 Ablation Study *The observed benefits of using chain-of-thought prompting raises the natural question of <span class='red'>whether the same performance improvements can be conferred via other types of prompting.*</span> ![截圖 2024-01-08 20.19.10](https://hackmd.io/_uploads/rkl5ODF_a.png =45%x) 1. **Equation only** - Chain-of-thought prompting produces the mathematical equation to be evaluated, and so we test a variation where the model is prompted to <span class='red'>output only a mathematical equation before giving the answer</span>. - <span class='green'>Because semantics of the questions in GSM8K are too challenging to directly translate into an equation without the natural language reasoning steps in chain of thought</span>. - For datasets of one-step or two-step problems, however, we find that equation only prompting does improve performance, since the equation can be easily derived from the question. 2. **Variable compute only** - Another intuition is that chain of thought allows the model to spend more computation on harder problems(i.e., intermediate tokens). - We test a configuration where the model is <span class='red'>prompted to output a only sequence of dots (...) equal to the number of characters in the equation</span> needed to solve the problem. - <span class='green'>Variable computation by itself is not the reason for the success of chain-of-thought prompting, and that there appears to be utility from expressing intermediate steps via natural language</span>. :::info 這段話的意思是: 可變的計算量本身並不是chain-of-thought prompting成功的原因,表達中間步驟的自然語言似乎是有用的。更詳細的解釋: 1. Chain-of-thought prompting讓模型可以輸出更多中間步驟,增加了計算量 2. 但單純增加計算量並不能改進效果,也並不是成功的關鍵 3. 使用自然語言表達中間步驟對於模型推理是有幫助的 4. 所以chain-of-thought prompting成功的原因似乎是利用了自然語言的表達能力來進行一步一步的推理總結來說,相較於單純增加計算量,使用自然語言來表達推理過程的中間步驟對chain-of-thought prompting的成功是必要的。 ::: 3. **Chain of thought after answer** - Another potential benefit of chain-of-thought prompting could simply be that <span class='red'>such prompts allow the model to better access relevant knowledge acquired during pretraining</span>. - We test an alternative configuration where the <span class='red'>chain of thought prompt is only given after the answer</span>, isolating whether the model actually depends on the produced chain-of-thought to give the final answer. :::info 這段話的意思是: 我們測試了一種替代的設定,其中chain-of-thought prompt只在答案之後給出,以隔離模型是否實際上依賴於生成的chain-of-thought才能給出最終答案。更詳細的解釋: 1. 在之前的實驗設定,模型需要在給出答案之前生成chain-of-thought 2. 為了測試模型是否真的需要依賴chain-of-thought,我們設計了一種替代的設定 3. 在這種設定下,模型先給出答案,chain-of-thought prompt 4. 透過比較這兩種設定的表現,可以檢驗模型是否確實利用了chain-of-thought的推理來給出答案這樣可以隔離出chain-of-thought對於生成答案的重要性,檢驗模型是否有真正利用chain-of-thought prompt的推理過程。 ::: - <span class='green'>The sequential reasoning embodied in the chain of thought is useful for reasons beyond just activating knowledge</span>. ## 3.4 Robustness of Chain-of-Thought *Sensitivity to exemplars is a key consideration of prompting approaches:* ![image](https://hackmd.io/_uploads/SkcQmFFOa.png =40%x) 1. <span class='red'>Varying the permutation of few-shot exemplars</span> can cause the accuracy of GPT-3 on SST-2 to range from near chance (54.3%) to near state of the art (93.4%). 2. Chains of thought <span class='red'>written by different annotators</span>. - Although there is variance among different chain of thought annotations, as would be expected when using exemplar-based prompting, all sets of chain of thought prompts outperform the standard baseline by a large margin. - <span class='green'>This result implies that successful use of chain of thought does not depend on a particular linguistic style</span>. 3. To confirm that successful chain-of-thought prompting works for other sets of exemplars, we also <span class='red'>run experiments with three sets of eight exemplars randomly sampled from the GSM8K training set</span>. - <span class='green'>These prompts performed comparably with our manually written exemplars, also substantially outperforming standard prompting.</span> :::success In addition to robustness to <span class='red'>annotators</span>,<span class='red'> independently-written chains of thought</span>, <span class='red'>different exemplars</span>, and <span class='red'>various language models</span>, we also find that chain-of-thought prompting for arithmetic reasoning is robust to <span class='red'>different exemplar orders</span> and <span class='red'>varying numbers of exemplars</span>. ::: ## 4. Commonsense Reasoning - The language-based nature of chain of thought actually makes it applicable to a broad class of commonsense reasoning problems. - which <span class='red'>involve reasoning about physical and human interactions under the presumption of general background knowledge</span>. - Commonsense reasoning is key for interacting with the world and is still beyond the reach of current natural language understanding systems. ![image](https://hackmd.io/_uploads/r1Ktq5F_p.png) ### Benchmark 1. **CSQA** - asks commonsense questions about the world involving complex semantics that often require prior knowledge 2. **StrategyQA** - requires models to infer a multi-hop strategy to answer questions 3. BIG-bench effort - **Date** Understanding - which involves inferring a date from a given context - **Sports** Understanding - which involves determining whether a sentence relating to sports is plausible or implausible 4. **SayCan** dataset - involves mapping a natural language instruction to a sequence of robot actions from a discrete set ### Prompts 1. For **CSQA** and **StrategyQA** - We randomly selected examples from the training set and manually composed chains of thought for them to use as few-shot exemplars. 2. Two **BIG-bench** tasks - Do not have training sets, so we selected the first ten examples as exemplars in the evaluation set as few-shot exemplars and report numbers on the rest of the evaluation set. 3. **SayCan** - we use six examples from the training set and also manually composed chains of thought ### Results ![image](https://hackmd.io/_uploads/SkzkiYF_T.png) 1. For all tasks, <span class='red'>scaling up model size improved the performance of standard prompting</span>. 2. Chain-of-thought prompting led to further gains, with improvements appearing to be largest for PaLM 540B. - With chain-of-thought prompting, PaLM 540B achieved strong performance relative to baselines, outperforming the prior state of the art on StrategyQA(75.6% vs 69.4%) - outperforming an unaided sports enthusiast on sports understanding (95.4% vs 84%). :::success These results demonstrate that <span class='red'>chain-of-thought prompting can also improve performance on tasks requiring a range of commonsense reasoning abilities</span> (though note that gain was minimal on CSQA). ::: ## 5. Symbolic Reasoning - Symbolic reasoning is **simple for humans but potentially challenging for language models**. - chain-of-thought prompting not only enables language models to perform symbolic reasoning tasks that are challenging in the standard prompting setting, but also <span class='red'>facilitates length generalization to inference-time inputs longer than those seen in the few-shot exemplars</span>. ![image](https://hackmd.io/_uploads/Hy5999td6.png) ### Tasks 1. **Last letter concatenation** - This task asks the model to <span class='red'>concatenate the last letters of words in a name</span> (e.g., “Amy Brown”!“yn”). - It is a more challenging version of first letter concatenation, which language models can already perform without chain of thought. - We generate full names by randomly concatenating names from the top one-thousand first and last names from name [census data](https://namecensus.com/). 2. **Coin flip** - This task asks the model to <span class='red'>answer whether a coin is still heads up after people either flip or don’t flip the coin</span>. - e.g., “A coin is heads up. Phoebe flips the coin. Osvaldo does not flip the coin. Is the coin still heads up?” ! “no”). As the construction of these symbolic reasoning tasks is well-defined, for each task we consider: - **in-domain** test set - for which examples <span class='red'>had the same number of steps as the training/few-shot exemplars</span>. - **out-of-domain** (OOD) test set - for which evaluation examples <span class='red'>had more steps than those in the exemplars</span>. ### Experimental Setup For **last letter concatenation**, the model only sees exemplars of names with two words, and then performs last letter concatenation on names with 3 and 4 words. We do the same for the number of potential flips in the **coin flip task**. **Our experimental setup uses the same methods and models as in the prior two sections**. We again manually compose chains of thought for the few-shot exemplars for each task. ### Result ![image](https://hackmd.io/_uploads/r1kF0Ytda.png =40%x) 1. With PaLM 540B, chain-of-thought prompting leads to almost 100% solve rates (note that standard prompting already solves coin flip with PaLM 540, though not for LaMDA 137B) 2. Note that <span class='red'>these in-domain evaluations are “toy tasks” in the sense that perfect solution structures are already provided by the chains of thought in the few-shot exemplars, **all the model has to do is repeat the same steps with the new symbols in the test-time example**</span>. 3. <span class='red'>Small models still fail</span> - <span class='green'>the ability to perform abstract manipulations on unseen symbols for these three tasks (Arithmetic Reasoning, Commonsense Reasoning, Symbolic Reasoning) only arises at the scale of 100B model parameters</span>. 4. As for the OOD evaluations - standard prompting fails for both tasks. - With chain-of-thought prompting, language models achieve upward scaling curves (though performance is lower than in the in-domain setting). - Hence, <span class='green'>chain-of-thought prompting facilitates length generalization beyond seen chains of thought for language models of sufficient scale</span>. ## 6. Discussion - In all experiments, chain-of-thought reasoning is elicited simply by prompting an off-the-shelf language model. **No language models were finetuned in the process of writing this paper**. - The emergence of chain-of-thought reasoning as a result of model scale has been a prevailing theme. - <span class='red'>Our work underscores that standard prompting only provides a lower bound on the capabilities of large language models</span>. ### Limitations 1. We first qualify that although chain of thought emulates the thought processes of human reasoners, **this does not answer whether the neural network is actually “reasoning”**, which we leave as an open question. 2. Although the cost of manually augmenting exemplars with chains-of-thought is minimal in the few-shot setting, such annotation costs could be prohibitive for finetuning (though this could potentially be surmounted with synthetic data generation, or zero-shot generalization). :::info 這段話的意思是: 雖然在少樣本設定下,手動擴充範例以建構chains-of-thought的成本很低,但這種標示成本在微調過程中可能會變得過高(儘管這可以透過合成數據的生成或零樣本推廣而得到緩解)。更詳細的解釋: 1. 少量範例手動添加chains-of-thought的成本很低,但大規模訓練時註釋成本會很高 2. 大量chains-of-thought的註釋可能會成為微調大型模型的瓶頸 3. 可透過自動生成chains-of-thought標示數據來降低成本 4. Zero-shot generalization也可能作為另一條路徑避免大量標示數據所以對於少量範例的提示學習來說手動添加chains-of-thought可行,但大規模微調時標示成本過高仍是限制,需要通過自動生成標示或Zero-shot generalization來解決。 ::: 3. **There is no guarantee of correct reasoning paths**, which can lead to both correct and incorrect answers; improving factual generations of language models is an open direction for future work. 4. Finally, the **emergence of chain-of-thought reasoning only at large model scales** makes it costly to serve in real-world applications; further research could explore how to induce reasoning in smaller models. ## 7. Conclusions - Through experiments on arithmetic, symbolic, and commonsense reasoning, we find that chain-of-thought reasoning is an emergent property of model scale that allows sufficiently large language models to perform reasoning tasks that otherwise have flat scaling curves. - Broadening the range of reasoning tasks that language models can perform will hopefully inspire further work on language-based approaches to reasoning.