DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

# DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning ###### tags:`論文翻譯` `deeplearning` [TOC] ## 說明排版的順序為先原文，再繁體中文，並且圖片與表格都會出現在第一次出現的段落下面原文繁體中文照片或表格 :::warning 1. 個人註解，任何的翻譯不通暢部份都請留言指導 2. 為了加速閱讀，直接採用自建的反思翻譯(Phi4-14B模型所翻譯)的結果呈現，然後快速看過，語意對了頭過身就過，多多包函 3. 這篇論文測試利用docling將論文從pdf取出轉成markdown格式，再利用正則式取片段至dify建置的反思翻譯api取得譯文再做優化調整 4. 解釋的部份也是直接用Phi4-14B模型來針對片段做理解說明 5. 機器說明的部份我會把一些不必要的冗文刪除，這可能可以從提示詞中再來做回應的優化 ::: :::danger * [paper hyperlink](https://github.com/deepseek-ai/DeepSeek-R1/blob/main/DeepSeek_R1.pdf) * [github](https://github.com/deepseek-ai/DeepSeek-R1) ::: ## Abstract We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1. DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, demonstrates remarkable reasoning capabilities. Through RL, DeepSeek-R1-Zero naturally emerges with numerous powerful and intriguing reasoning behaviors. However, it encounters challenges such as poor readability, and language mixing. To address these issues and further enhance reasoning performance, we introduce DeepSeek-R1, which incorporates multi-stage training and cold-start data before RL. DeepSeek-R1 achieves performance comparable to OpenAI-o1-1217 on reasoning tasks. To support the research community, we open-source DeepSeek-R1-Zero, DeepSeek-R1, and six dense models (1.5B, 7B, 8B, 14B, 32B, 70B) distilled from DeepSeek-R1 based on Qwen and Llama. 我們推出了我們的第一個推理模型——DeepSeek-R1-Zero及DeepSeek-R1。DeepSeek-R1-Zero是通過大規模強化學習(RL)訓練而不需要監督式微調(SFT)的初步階段，展現了卓越的推理能力。透過強化學習的方式，DeepSeek-R1-Zero自然地表現出豐富的強大且有趣的推理行為。不過，它面臨著像是可讀性差以及語言混用等挑戰。為了解決這些問題並進一步提升推理的效能，我們引入了DeepSeek-R1，該模型在強化學習之前結合了多階段訓練(multi-stage training)和冷啟動資料(cold-start data)。DeepSeek-R1的效能與OpenAI-o1-1217在推理任務上有著可比擬的實力。為了支持研究社群，我們開源了DeepSeek-R1-Zero、DeepSeek-R1以及六款基於Qwen和Llama的從DeepSeek-R1中蒸餾的密集模型(1.5B、7B、8B、14B、32B、70B)。 ![image](https://hackmd.io/_uploads/SJ-LTVdKJx.png) Figure 1 | Benchmark performance of DeepSeek-R1. :::warning >[name=機器說明] 本文介紹了兩款基於大規模強化學習(RL)訓練的人工智慧模型——DeepSeek-R1-Zero及DeepSeek-R1。DeepSeek-R1-Zero作為初步模型展示了優異的推理能力，但因可讀性和語言表達能力不足而面臨挑戰。為了解決這些問題，研究團隊引入了DeepSeek-R1，該模型通過結合多階段訓練和冷啟動資料來改善效能。研究團隊還開源了六款基於Qwen和Llama的密集模型(1.5B、7B、8B、14B、32B、70B)，並說明其於推理任務上的效能與OpenAI-o1-1217相當。最後，附圖說明了DeepSeek-R1的基準效能。 ::: :::warning >[name=Felo.ai] 在深度學習和機器學習的領域中，“冷啟動資料”(cold-start data)是指在模型訓練初期所使用的少量資料，這些資料用於幫助模型在缺乏歷史資料的情況下進行初步的學習和調整。這一概念特別重要，因為許多機器學習模型在開始時並沒有足夠的資料來進行有效的預測或推薦。 ::: ## 1. Introduction In recent years, Large Language Models (LLMs) have been undergoing rapid iteration and evolution (Anthropic, 2024; Google, 2024; OpenAI, 2024a), progressively diminishing the gap towards Artificial General Intelligence (AGI). 近年來，大型語言模型(Large Language Models, LLMs)正進行快速的迭代與演化(Anthropic, 2024; Google, 2024; OpenAI, 2024a)，逐步縮小了與通用人工智慧(AGI)之間的差距。 Recently, post-training has emerged as an important component of the full training pipeline. It has been shown to enhance accuracy on reasoning tasks, align with social values, and adapt to user preferences, all while requiring relatively minimal computational resources against pre-training. In the context of reasoning capabilities, OpenAI's o1 (OpenAI, 2024b) series models were the first to introduce inference-time scaling by increasing the length of the Chain-of-Thought reasoning process. This approach has achieved significant improvements in various reasoning tasks, such as mathematics, coding, and scientific reasoning. However, the challenge of effective test-time scaling remains an open question for the research community. Several prior works have explored various approaches, including process-based reward models (Lightman et al., 2023; Uesato et al., 2022; Wang et al., 2023), reinforcement learning (Kumar et al., 2024), and search algorithms such as Monte Carlo Tree Search and Beam Search (Feng et al., 2024; Trinh et al., 2024; Xin et al., 2024). However, none of these methods has achieved general reasoning performance comparable to OpenAI's o1 series models. 近期，在完整的訓練流程中，後訓練處理(post-training)已成為重要的一環。後訓練處理已經被證明可以在推理任務上提高準確性、對齊社會價值觀，並適應用戶偏好，同時其計算資源需求相對於預訓練來說非常低。在推理能力的上下文中，OpenAI 的 o1 系列模型(OpenAI, 2024b)是第一個通過增加「思維鏈(Chain-of-Thought)』推理過程的長度來縮放(scaling)推理時間的模型。這種方法在數學、程式設計和科學推理等各類推理任務上取得了顯著改善。然而，有效的測試時間縮放(testing-time scaling)的挑戰仍是研究社群裡未解的問題。在此背景下，已有多項先前的研究探索了包括基於過程的獎勵(process-based reward)模型(Lightman et al., 2023; Uesato et al., 2022; Wang et al., 2023)、強化學習(Kumar et al., 2024)以及搜尋演算法，如Monte Carlo Tree Search和Beam Search(Feng et al., 2024; Trinh et al., 2024; Xin et al., 2024)。不過，這些方法都還是不能能達到 OpenAI o1 系列模型的通用推理效能水平。 :::warning > [name=機器說明] 1. 總結回應: 文本強調了後訓練處理(post-training)在人工智慧中的重要性，特別是在推理能力方面。OpenAI的o1系列模型通過延長「連續思考」的過程時間來提升其推理能力，在多個領域表現出色，但仍存在測試這種方法有效性的挑戰。先前的研究已經嘗試過用不同演算法來解決問題，不過還是無法匹敵o1系列在通用推理方面的表現。 2. 細部重點說明: - **後訓練處理(post-training)**: 在深度學習中，後訓練處理(post-training)指的是對已完成基本訓練的模型進行調整和優化。這些過程可以提高推理任務中的準確性並讓模型更好地與社會價值觀或用戶偏好保持一致，同時相對於預訓練所需的計算資源較少。 - **測試時間縮放(testing-time scaling)**: 儘管o1系列模型在理論上展現出色，但有效地測量和驗證其推理過程的延長是否實際帶來了效能的提升仍然是一個研究挑戰。 ::: In this paper, we take the first step toward improving language model reasoning capabilities using pure reinforcement learning (RL). Our goal is to explore the potential of LLMs to develop reasoning capabilities without any supervised data, focusing on their self-evolution through a pure RL process. Specifically, we use DeepSeek-V3-Base as the base model and employ GRPO (Shao et al., 2024) as the RL framework to improve model performance in reasoning. During training, DeepSeek-R1-Zero naturally emerged with numerous powerful and interesting reasoning behaviors. After thousands of RL steps, DeepSeek-R1-Zero exhibits super performance on reasoning benchmarks. For instance, the pass@1 score on AIME 2024 increases from 15.6% to 71.0%, and with majority voting, the score further improves to 86.7%, matching the performance of OpenAI-o1-0912. 在這篇論文中，我們開始探索如何通過純粹的強化學習(RL)來提升語言模型的推理能力。我們的目標是探究大規模語言模型(LLMs)在沒有任何監督資料(supervised data)情況下發展推理能力的可能性，重點放在通過純粹的強化學習過程來自我演進。具體來說，我們使用DeepSeek-V3-Base作為基礎模型並採用GRPO(Shao et al., 2024，強化學習框架)來提升模型在推理上的效能。在訓練過程中，DeepSeek-R1-Zero自然而然地展現出多種強大且有趣的推理行為。在經過數千次的RL步驟之後，DeepSeek-R1-Zero在推理基準測試中表現出色。例如，在AIME 2024上，其pass@1 score從15.6%提升到71.0%，如果是搭配多數決(majority voting)的話，則是可以進一步的提升至86.7%，與OpenAI-o1-0912的表現持平。 :::warning > [name=機器說明] 細部重點說明： - **pass@1 score**：指模型第一次嘗試就成功解題的比例 ::: However, DeepSeek-R1-Zero encounters challenges such as poor readability, and language mixing. To address these issues and further enhance reasoning performance, we introduce DeepSeek-R1, which incorporates a small amount of cold-start data and a multi-stage training pipeline. Specifically, we begin by collecting thousands of cold-start data to fine-tune the DeepSeek-V3-Base model. Following this, we perform reasoning-oriented RL like DeepSeek-R1-Zero. Upon nearing convergence in the RL process, we create new SFT data through rejection sampling on the RL checkpoint, combined with supervised data from DeepSeek-V3 in domains such as writing, factual QA, and self-cognition, and then retrain the DeepSeek-V3-Base model. After fine-tuning with the new data, the checkpoint undergoes an additional RL process, taking into account prompts from all scenarios. After these steps, we obtained a checkpoint referred to as DeepSeek-R1, which achieves performance on par with OpenAI-o1-1217. 然而，DeepSeek-R1-Zero 面臨著如可閱讀性低與語言混合等挑戰。為了解決這些問題並進一步的提升推理能力，我們引入了 DeepSeek-R1，其結合少量的冷啟動資料(cold-start data)以及多階段訓練流程(multi-stage training pipeline)。具體來說，我們從收集數千筆冷啟動資料開始，用於微調 DeepSeek-V3-Base 模型。接下來，我們進行推理導向的強化學習(reasoning-oriented RL)，類似於 DeepSeek-R1-Zero 的方法。在強化學習過程接近收斂的時候，在強化學習檢查點(RL checkpoint)上透過[棄卻抽樣](https://terms.naer.edu.tw/detail/a90d0278f94118daa2d8a9481137a71a/)的方式建立新的SFT資料，結合DeepSeek-V3中專有領域的監督資料，像是寫作、事實問答與自我認知，再次對 DeepSeek-V3-Base 模型進行重新訓練。在使用新資料微調之後，檢查點(checkpoint)經過另一輪強化學習的過程，考量所有場景的提示。完成以上步驟後，我們獲得了被稱為 DeepSeek-R1 的檢查點，其效能與 OpenAI-o1-1217 相當。 :::warning > [name=機器說明] 1. 總結回應 DeepSeek-R1 模型是一個在推理能力上進行了優化的語言模型，旨在解決其前身 DeepSeek-R1-Zero 面臨的挑戰，如可閱讀性低和語言混合問題。為達到此目標，研究者採用了一種複雜且多階段的訓練策略，包括使用少量冷啟動資料進行初步微調、利用推理導向的強化學習(RL)進一步改善模型效能，並在適當時期採用棄卻抽樣法來建立更多的監督式微調資料。最後，在集成了新資料和再次強化學習的基礎上，生成了 DeepSeek-R1 檢查點，其效能可與 OpenAI-o1-1217 相媲美。 2. 細部重點說明本文描述了一個高級的深度學習模型訓練過程，涉及多階段技術結合。首先，DeepSeek-R1-Zero 面臨兩大主要挑戰：可閱讀性低和語言混合現象，這對於任何自然語言處理模型來說都是重大的問題。為解決這些問題，引入了 DeepSeek-R1，其訓練過程複雜而精細。 - **冷啟動資料微調**：初步使用數千筆冷啟動資料微調基模型(DeepSeek-V3-Base)，這一階段通常被視作為讓模型快速適應新的語言結構或任務設置的方法。 - **強化學習(RL)**：利用推理導向的 RL 來進一步提高模型在複雜推理任務上的表現，類似於 DeepSeek-R1-Zero 的方法。這涉及到通過環境互動來持續調整模型以最大化特定樣本的積累獎勵。 - **棄卻抽樣法和監督式微調**：在 RL 近收斂時，採用棄卻抽樣法建立新的資料集合(Supervised Fine-Tuning Data, SFT)，這一過程涉及篩選出最具代表性或最具挑戰性的資料，用於增強模型在特定領域(如寫作、事實問答和自我認知)的表現。 - **融合與再訓練**：這些新資料據與其他監督式資料結合，並再次用於微調模型，隨後模型在考量所有情境的提示下，通過另一輪強化學習的過程，進而獲得進一步改善。 - **效能提升**：最終生成的 DeepSeek-R1 模型不僅解決了原有挑戰，且其效能水平被比擬為 OpenAI-o1-1217 的水準。這表明模型在多個語言和推理任務上達到了高效的水平。 ::: We further explore distillation from DeepSeek-R1 to smaller dense models. Using Qwen2.5-32B (Qwen, 2024b) as the base model, direct distillation from DeepSeek-R1 outperforms applying RL on it. This demonstrates that the reasoning patterns discovered by larger base models are crucial for improving reasoning capabilities. We open-source the distilled Qwen and Llama (Dubey et al., 2024) series. Notably, our distilled 14B model outperforms state-of-the-art open-source QwQ-32B-Preview (Qwen, 2024a) by a large margin, and the distilled 32B and 70B models set a new record on the reasoning benchmarks among dense models. 我們進一步研究從 DeepSeek-R1 蒸餾到更小型的密集模型。以 Qwen2.5-32B(Qwen, 2024b)作為基礎模型，直接從 DeepSeek-R1 蒸餾出來的效果優於在其上應用強化學習(RL)。這證明了更大規模基礎模型所發現到的推理模式對於提升推理能力而言是非常重要的。我們開源蒸餾後的 Qwen 和 Llama 系列模型。值得注意的是，我們蒸餾出的 14B 規模模型大幅度的超越了目前最先進的開源模型 QwQ-32B-Preview(Qwen, 2024a)，而蒸餾後的 32B 和 70B 規模模型在密集模型推理基準測試中創下了新紀錄。 :::warning > [name=機器說明] 1. 總結回應: 文本討論了利用大型模型 DeepSeek-R1 對小規模密集模型進行知識蒸餾的過程。基於 Qwen2.5-32B 模型的研究顯示，從更大的模型直接蒸餾出來的效果優於在該基礎上應用強化學習(RL)方法。這證明了大規模模型所發現到的推理模式對提高小模型推理能力來說是非常重要的。 2. 細部重點說明: - **深度學習與知識蒸餾**: 深度學習在大型模型(如 DeepSeek-R1)中通過資料訓練發現了豐富的模式和推理能力。知識蒸餾是一種技術，允許將這些模型所學到的知識轉移到小規模模型上，以便在計算資源受限的條件下達到更好的效能。該文提到直接從大規模模型中蒸餾，優於進一步通過強化學習增強基礎小模型。 - **推理能力**: 在提到密集模型推理基準測試中創下新紀錄時，強調了這些蒸餾後模型在各類典型推理任務上表現出色。這不僅展示了技術突破，也標誌著開源努力如何改變和提升 AI 系統的整體能力。 ::: ## 1.1. Contributions ### Post-Training: Large-Scale Reinforcement Learning on the Base Model - · We directly apply RL to the base model without relying on supervised fine-tuning (SFT) as a preliminary step. This approach allows the model to explore chain-of-thought (CoT) for solving complex problems, resulting in the development of DeepSeek-R1-Zero. DeepSeek-R1-Zero demonstrates capabilities such as self-verification, reflection, and generating long CoTs, marking a significant milestone for the research community. Notably, it is the first open research to validate that reasoning capabilities of LLMs can be incentivized purely through RL, without the need for SFT. This breakthrough paves the way for future advancements in this area. - 我們直接將強化學習(RL)應用於基本模型中，而不依賴監督微調(SFT)作為預備步驟。這種方法允許模型探索思維鏈(CoT)，以解決複雜問題，從而開發出 DeepSeek-R1-Zero。DeepSeek-R1-Zero 表現出自我驗證、反思和產生長篇 CoTs 的能力，這是研究社群的重要里程碑。值得注意的是，它是第一個驗證大型語言模型的推理能力可以單純的通過 RL 而不需要 SFT 來激勵的開源研究。這一突破性成果為這個領域的未來發展鋪平了道路。 - · We introduce our pipeline to develop DeepSeek-R1. The pipeline incorporates two RL stages aimed at discovering improved reasoning patterns and aligning with human preferences, as well as two SFT stages that serve as the seed for the model's reasoning and non-reasoning capabilities. We believe the pipeline will benefit the industry by creating better models. - 我們介紹開發 DeepSeek-R1 的流程。該工作流程包含兩個 RL 階段，旨在探索提升推理模式並與人類偏好對齊，以更貼近人類思維方式；同時也有兩個 SFT 階段作為模型推理和非推理能力的種子。我們相信這個流程將有利於行業建置出更優秀的模型。 :::warning > [name=機器說明] 1. 總結回應本文描述了在深度學習領域的一個創新方法，透過直接應用強化學習(RL)來訓練基礎模型，而不是先使用監督式微調(SFT)。這種方式使得模型DeepSeek-R1-Zero能夠自行驗證、反思和生成長鏈思考(CoT)，並展示了進一步研究潛力。另外，介紹了一個包含RL與SFT階段的開發管道，以提升模型的推理能力及適應性。 ::: ### Distillation: Smaller Models Can Be Powerful Too - We demonstrate that the reasoning patterns of larger models can be distilled into smaller models, resulting in better performance compared to the reasoning patterns discovered through RL on small models. The open source DeepSeek-R1, as well as its API, will benefit the research community to distill better smaller models in the future. - 我們展示了大型模型中的推理能力可以被蒸餾到小型模型中，這樣做比使用強化學習在小型模型中探索出來的推理能力表現更佳。DeepSeek-R1 的開源專案和相關 API 在未來將有助於幫助研究社群蒸餾出更棒的小型模型。 - Using the reasoning data generated by DeepSeek-R1, we fine-tuned several dense models that are widely used in the research community. The evaluation results demonstrate that the distilled smaller dense models perform exceptionally well on benchmarks. DeepSeekR1-Distill-Qwen-7B achieves 55.5% on AIME 2024, surpassing QwQ-32B-Preview. Additionally, DeepSeek-R1-Distill-Qwen-32B scores 72.6% on AIME 2024, 94.3% on MATH-500, and 57.2% on LiveCodeBench. These results significantly outperform previous opensource models and are comparable to o1-mini. We open-source distilled 1.5B, 7B, 8B, 14B, 32B, and 70B checkpoints based on Qwen2.5 and Llama3 series to the community. - 我們對研究社群中已經廣泛使用的多個密集模型利用DeepSeek-R1生成的推理資料進行了微調。評估結果顯示，經過蒸餾處理的小型密集模型在基準測試上表現出色。DeepSeekR1-Distill-Qwen-7B 在 AIME 2024 上取得了 55.5% 的成績，超越了 QwQ-32B-Preview。此外，DeepSeek-R1-Distill-Qwen-32B 在 AIME 2024 上取得了 72.6%，在 MATH-500 上取得了 94.3%，以及在 LiveCodeBench 上取得了 57.2% 的成績。這些結果明顯超越了它們的前身，並與 o1-mini 相當。我們將基於 Qwen2.5 和 Llama3 系列壓縮後的 1.5B、7B、8B、14B、32B 和 70B 檢查點(checkpoints)開源給社群使用。 ## 1.2. Summary of Evaluation Results - Reasoning tasks : (1) DeepSeek-R1 achieves a score of 79.8% Pass@1 on AIME 2024, slightly surpassing OpenAI-o1-1217. On MATH-500, it attains an impressive score of 97.3%, performing on par with OpenAI-o1-1217 and significantly outperforming other models. (2) On coding-related tasks, DeepSeek-R1 demonstrates expert level in code competition tasks, as it achieves 2,029 Elo rating on Codeforces outperforming 96.3% human participants in the competition. For engineering-related tasks, DeepSeek-R1 performs slightly better than DeepSeek-V3, which could help developers in real world tasks. - Reasoning tasks :(1) DeepSeek-R1 在 AIME 2024 上的的 Pass@1獲得了 79.8% 分，略高於 OpenAI-o1-1217。在 MATH-500 上，其獲得了驚人的 97.3% 分，與 OpenAI-o1-1217 相當，並大幅超越其它模型。(2) 在程式相關的任務中，DeepSeek-R1 展現出專業水準，在 Codeforces 上取得了 2029 的 Elo 排名，優於比賽中96.3%的人類參賽者。在工程相關任務上，DeepSeek-R1 表現略優於 DeepSeek-V3 ，這對開發者在實際應用中可能會有幫助。 - Knowledge : On benchmarks such as MMLU, MMLU-Pro, and GPQA Diamond, DeepSeekR1 achieves outstanding results, significantly outperforming DeepSeek-V3 with scores of 90.8% on MMLU, 84.0% on MMLU-Pro, and 71.5% on GPQA Diamond. While its performance is slightly below that of OpenAI-o1-1217 on these benchmarks, DeepSeek-R1 surpasses other closed-source models, demonstrating its competitive edge in educational tasks. On the factual benchmark SimpleQA, DeepSeek-R1 outperforms DeepSeek-V3, demonstrating its capability in handling fact-based queries. A similar trend is observed where OpenAI-o1 surpasses 4o on this benchmark. - Knowledge : 在像是 MMLU、MMLU-Pro 和 GPQA Diamond 這類的基準測試中，DeepSeekR1 獲得了顯著優異的成果，其表現明顯超越了 DeepSeek-V3，分別在 MMLU、MMLU-Pro 和 GPQA Diamond 上獲得了 90.8%、84.0% 和 71.5% 的高分。雖然其表現略遜於 OpenAI-o1-1217 在這些基準測試上的表現，不過 DeepSeek-R1 仍究是碾壓其它的閉源模型，顯示出其在教育任務中的競爭優勢。在事實基準測試 SimpleQA 上，DeepSeek-R1 也能戰勝 DeepSeek-V3，證明了它處理事實問題的能力。類似的趨勢也出現在 OpenAI-o1 在此基準測試中較 4o 為優。 - Others : DeepSeek-R1 also excels in a wide range of tasks, including creative writing, general question answering, editing, summarization, and more. It achieves an impressive length-controlled win-rate of 87.6% on AlpacaEval 2.0 and a win-rate of 92.3% on ArenaHard, showcasing its strong ability to intelligently handle non-exam-oriented queries. Additionally, DeepSeek-R1 demonstrates outstanding performance on tasks requiring long-context understanding, substantially outperforming DeepSeek-V3 on long-context benchmarks. - Others : DeepSeek-R1 在廣泛的任務中仍然是表現卓越，包括創意寫作、一般問答、編輯和摘要總結等多種功能。它在 AlpacaEval 2.0 長度控制挑戰中達成了令人印象深刻的 87.6% 勝率，在 ArenaHard 中則獲得了 92.3% 的勝率，展示了其於處理非考試類型問答時的強大智慧能力。此外，DeepSeek-R1 在需要長篇脈絡理解的任務中表現出色，明顯超越了 DeepSeek-V3 在長文內容基準測試上的表現。 ## 2. Approach ## 2.1. Overview Previous work has heavily relied on large amounts of supervised data to enhance model performance. In this study, we demonstrate that reasoning capabilities can be significantly improved through large-scale reinforcement learning (RL), even without using supervised fine-tuning (SFT) as a cold start. Furthermore, performance can be further enhanced with the inclusion of a small amount of cold-start data. In the following sections, we present: (1) DeepSeek-R1-Zero, which applies RL directly to the base model without any SFT data, and (2) DeepSeek-R1, which applies RL starting from a checkpoint fine-tuned with thousands of long Chain-of-Thought (CoT) examples. 3) Distill the reasoning capability from DeepSeek-R1 to small dense models. 過去的研究很大程度上是依賴著大量的監督資料來提升模型效能。在本研究中，我們證明了，通過大規模強化學習(RL)，即使沒有使用任何冷啟動的監督微調(SFT)資料，推理能力也可以明顯提升。此外，效能還可以通過包含較少量的冷啟動資料進一步的加強。在後續的章節中，我們提出了：(1) DeepSeek-R1-Zero，它直接將RL應用於基本模型而不使用任何SFT資料；(2) DeepSeek-R1，利用數千個長的思維鏈(CoT)樣本從檢查點(checkpoint)開始微調；以及 (3) 從DeepSeek-R1中蒸餾推理能力到小型密集模型中。 :::warning > [name=機器說明] * 專業解釋：此文本討論了如何通過大規模強化學習（RL）來提高人工智能系統的推理能力，而不完全依賴於監督式微調（SFT）。文中提出了兩種方法：DeepSeek-R1-Zero直接應用RL到基本模型上，而DeepSeek-R1則從含有長連鎖思考（CoT）例子的微調點開始進行。此外，還介紹了將推理能力轉移到小密集模型的方法。 * 專有名詞解釋： * 強化學習（RL）：一種機器學習技術，訓練算法通過與環境交互來完成任務，透過嘗試和錯誤學會如何做出最佳決策。 * 監督式微調（SFT）：指使用標記好的數據對已有的模型進行再訓練，以改善其在特定任務上的性能。 * Cold Start：在不含初始微調數據的情況下開始訓練或學習的過程。 * DeepSeek-R1-Zero：指沒有使用SFT數據直接應用RL到基本模型上的方法。 * DeepSeek-R1：從含有長連鎖思考（CoT）例子的微調點開始進行強化學習的方法。連鎖思考（Chain-of-Thought, CoT）：一種指導模型通過展示解題步驟來提高推理能力的訓練技術。 ::: ## 2.2. DeepSeek-R1-Zero: Reinforcement Learning on the Base Model Reinforcement learning has demonstrated significant effectiveness in reasoning tasks, as evidenced by our previous works (Shao et al., 2024; Wang et al., 2023). However, these works heavily depended on supervised data, which are time-intensive to gather. In this section, we explore the potential of LLMs to develop reasoning capabilities without any supervised data , focusing on their self-evolution through a pure reinforcement learning process. We start with a brief overview of our RL algorithm, followed by the presentation of some exciting results, and hope this provides the community with valuable insights. 強化學習在邏輯推理任務中展示了強大的效能，這在我們先前的研究中已經有所證明(Shao et al., 2024; Wang et al., 2023)。然而，這些研究很大程度的地依賴著監督的訓練資料，而這些資料的收集時間成本非常高昂。在這個章節中，我們探討了大語言模型(LLMs)在沒有任何監督式資料的情況下發展推理能力的可能性，專注於它們通過純粹的強化學習過程自主演化。首先，我們提供了有關強化學習算法的簡要概述，接著展示一些令人興奮的成果，希望能夠為社群提供寶貴的見解。 ### 2.2.1. Reinforcement Learning Algorithm **Group Relative Policy Optimization** In order to save the training costs of RL, we adopt Group Relative Policy Optimization (GRPO) (Shao et al., 2024), which foregoes the critic model that is typically the same size as the policy model, and estimates the baseline from group scores instead. Specifically, for each question $q$ , GRPO samples a group of outputs $\{ o_1, 0_2, \cdots, o_G \}$ from the old policy $\pi_{\theta_{old}}$ and then optimizes the policy model $\pi_\theta$ by maximizing the following objective: $$ \begin{align} \mathcal{J}_{GRPO}(\theta)&= \mathbb{E}[q\sim P(Q),\{o_i\}^G_{i=1}\sim \pi_{\theta_{old}}(O\vert q)] \\ & \frac{1}{G}\sum^G_{i=1}\bigg(\min\bigg(\dfrac{\pi_\theta(o_i\vert q)}{\pi_{\theta_{old}}(o_i\vert q)}A_i,clip\bigg(\dfrac{\pi_\theta(o_i\vert q)}{\pi_{\theta_{old}}(o_i\vert q)}, 1-\epsilon, 1+\epsilon\bigg)A_i \bigg) -\beta \pmb{D}_{KL}(\pi_\theta \Vert \pi_{ref}) \bigg) \tag{1} \\ & \pmb{D}_{KL}(\pi_\theta\Vert\pi_{ref})=\dfrac{\pi_{ref}(o_i\vert q)}{\pi_\theta(o_i\vert q)}-\log\dfrac{\pi_{ref}(o_i\vert q)}{\pi_\theta(o_i\vert q)}-1 \tag{2} \end{align} $$ where $\epsilon$ and $\beta$ are hyper-parameters, and $A_i$ is the advantage, computed using a group of rewards $\{ r_1,r_2,\cdots, r_G\}$ corresponding to the outputs within each group: $$ A_i = \dfrac{r_i - mean(\{r_1,r_2,\cdots, r_G \})}{std(\{r_1,r_2,\cdots,r_G \})} \tag{3} $$ **Group Relative Policy Optimization** 為了降低強化學習(RL)訓練成本，我們採用Group Relative Policy Optimization (GRPO) (Shao et al., 2024)，其放棄了一個通常和策略模型(policy model)相同大小的評估模型(critic model)，改用group scores來估測基線。具體來說，對每個問題$q$，GRPO從舊的策略 $\pi_{\theta_{old}}$ 中採樣出一組輸出 $\{ o_1, 0_2, \cdots, o_G \}$ ，然後透過追求最大化的方式來最佳化策略模型$\pi_\theta$： $$ \begin{align} \mathcal{J}_{GRPO}(\theta)&= \mathbb{E}[q\sim P(Q),\{o_i\}^G_{i=1}\sim \pi_{\theta_{old}}(O\vert q)] \\ & \frac{1}{G}\sum^G_{i=1}\bigg(\min\bigg(\dfrac{\pi_\theta(o_i\vert q)}{\pi_{\theta_{old}}(o_i\vert q)}A_i,clip\bigg(\dfrac{\pi_\theta(o_i\vert q)}{\pi_{\theta_{old}}(o_i\vert q)}, 1-\epsilon, 1+\epsilon\bigg)A_i \bigg) -\beta \pmb{D}_{KL}(\pi_\theta \Vert \pi_{ref}) \bigg) \tag{1} \\ & \pmb{D}_{KL}(\pi_\theta\Vert\pi_{ref})=\dfrac{\pi_{ref}(o_i\vert q)}{\pi_\theta(o_i\vert q)}-\log\dfrac{\pi_{ref}(o_i\vert q)}{\pi_\theta(o_i\vert q)}-1 \tag{2} \end{align} $$ 其中$\epsilon$與$\beta$是超參數，$A_i$則是優勢(advantage)，使用每個群組所對應的輸出的rewards $\{ r_1,r_2,\cdots, r_G\}$ 計算而得： $$ A_i = \dfrac{r_i - mean(\{r_1,r_2,\cdots, r_G \})}{std(\{r_1,r_2,\cdots,r_G \})} \tag{3} $$ :::warning >[name=GPT] > * $\pi_\theta(o_i|q)$表示在當前策略$\theta$下，給定輸入$q$時生成輸出$o_i$的概率 * $\pi_{\theta_{old}}(o_i|q)$表示在舊策略$\theta_{old}$下，給定相同輸入$q$時生成輸出$o_i$的概率 * $A_i$是輸出$o_i$的相對優勢，通常透過對組內獎勵進行標準化計算 * 函數$\text{clip}(x, 1 - \epsilon, 1 + \epsilon)$用於限制$x$的值在$1 - \epsilon$和$1 + \epsilon$之間，以防止策略更新過大 * $D_{KL}(\pi_\theta | \pi_{ref})$表示當前策略$\pi_\theta$與參考策略$\pi_{ref}$之間的KL散度，用於衡量兩者之間的差異，$\beta$是控制該項影響的超參數 ::: :::warning >[name=GPT] GRPO（Group Relative Policy Optimization，群體相對策略優化）是一種強化學習演算法，旨在透過群體輸出的相對優勢來優化策略，特別適用於大型語言模型的微調。與傳統的PPO（Proximal Policy Optimization，近端策略優化）不同，GRPO不需要單獨的價值函數模型，而是利用對同一輸入生成的多個輸出的平均得分作為基線，進行相對優勢的計算。在GRPO中，對於每個輸入$q$，模型會生成多個輸出${o_i}{i=1}^G$，並使用獎勵模型對每個輸出進行評分，得到對應的獎勵值${r_i}{i=1}^G$。這些獎勵值的平均值作為基線，然後計算每個輸出的相對優勢$A_i$，通常透過對組內獎勵進行標準化（如減去平均值後除以標準差）來實現。 ::: Aconversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think> <answer> answer here </answer>. User: prompt. Assistant: Table 1 | Template for DeepSeek-R1-Zero. prompt will be replaced with the specific reasoning question during training. ## 2.2.2. Reward Modeling The reward is the source of the training signal, which decides the optimization direction of RL. To train DeepSeek-R1-Zero, we adopt a rule-based reward system that mainly consists of two types of rewards: 獎勵(reward)是訓練信號的來源，決定強化學習（RL）最佳化的方向。為了訓練DeepSeek-R1-Zero，我們採用了一種基於規則的獎勵系統，主要包括兩類的獎勵： - Accuracy rewards : The accuracy reward model evaluates whether the response is correct. For example, in the case of math problems with deterministic results, the model is required to provide the final answer in a specified format (e.g., within a box), enabling reliable rule-based verification of correctness. Similarly, for LeetCode problems, a compiler can be used to generate feedback based on predefined test cases. - Accuracy rewards：正確性獎勵模型用於評估響應是否正確。舉例來說，在涉及決定性結果的數學問題中，模型必須以指定格式（如方框內）提供最終的答案，以便通過可靠的規則驗證其正確性。同樣地，在 LeetCode 問題中，編譯器可以根據預先定義的測試用例生成反饋。 - Format rewards : In addition to the accuracy reward model, we employ a format reward model that enforces the model to put its thinking process between '<think>' and '</think>' tags. - Format rewards：除了正確性獎勵模型外，我們還使用格式獎勵模型來要求模型將其思考過程放在「<think>」與「</think>」標籤之間。 We do not apply the outcome or process neural reward model in developing DeepSeek-R1-Zero, because we find that the neural reward model may suffer from reward hacking in the large-scale reinforcement learning process, and retraining the reward model needs additional training resources and it complicates the whole training pipeline. 我們在開發 DeepSeek-R1-Zero 時沒有採用neural reward model的結果或處理過程，因為我們發現到neural reward model可能會在大規模強化學習過程中出現所謂的「Reward Hacking」的問題。重新訓練神經奬勵模型需要額外的訓練資源，這會使得整個訓練流程變得更加複雜。 ## 2.2.3. Training Template To train DeepSeek-R1-Zero, we begin by designing a straightforward template that guides the base model to adhere to our specified instructions. As depicted in Table 1, this template requires DeepSeek-R1-Zero to first produce a reasoning process, followed by the final answer. We intentionally limit our constraints to this structural format, avoiding any content-specific biases-such as mandating reflective reasoning or promoting particular problem-solving strategies-to ensure that we can accurately observe the model's natural progression during the RL process. 為了訓練 DeepSeek-R1-Zero，我們從設計一個簡單的模板開始，指導基礎模型按照我們指定的指令操作。如Table 1所示，這個模板要求 DeepSeek-R1-Zero 先生成推理過程，然後提供最終答案。我們故意限制在這種結構格式中，避免任何對內容的偏見—像是，強制進行反思推理或提示特定的問題解决策略—以便能夠準確地觀察模型在強化學習過程中自然發展的情況。 ## 2.2.4. Performance, Self-evolution Process and Aha Moment of DeepSeek-R1-Zero **Performance of DeepSeek-R1-Zero** Figure 2 depicts the performance trajectory of DeepSeek-R1-Zero on the AIME 2024 benchmark throughout the RL training process. As illustrated, DeepSeek-R1-Zero demonstrates a steady and consistent enhancement in performance as the RL training advances. Notably, the average pass@1 score on AIME 2024 shows a significant increase, jumping from an initial 15.6% to an impressive 71.0%, reaching performance levels comparable to OpenAI-o1-0912. This significant improvement highlights the efficacy of our RL algorithm in optimizing the model's performance over time. **Performance of DeepSeek-R1-Zero** Figure 2描述了在 AIME 2024 評估基準上，隨著 RL 訓練的進行 DeepSeek-R1-Zero 的效能軌跡。如圖所示，DeepSeek-R1-Zero 在強化學習訓練中表現出穩步地提高的效能。特別是在 AIME 2024 上的average pass@1 score明顯增加，從初始的 15.6% 躍升至驚人的 71.0%，達到與 OpenAI-o1-0912 相當的性能水準。這一顯著改善突顯出我們強化學習演算法隨著時間於效能提升方面的有效性。 Table 2 provides a comparative analysis between DeepSeek-R1-Zero and OpenAI's o1-0912 models across a variety of reasoning-related benchmarks. The findings reveal that RL empowers DeepSeek-R1-Zero to attain robust reasoning capabilities without the need for any supervised fine-tuning data. This is a noteworthy achievement, as it underscores the model's ability to learn and generalize effectively through RL alone. Additionally, the performance of DeepSeek-R1-Zero can be further augmented through the application of majority voting. For example, when majority voting is employed on the AIME benchmark, DeepSeek-R1-Zero's performance escalates from 71.0% to 86.7%, thereby exceeding the performance of OpenAI-o1-0912. The ability of DeepSeek-R1-Zero to achieve such competitive performance, both with and without majority voting, highlights its strong foundational capabilities and its potential for further advancements in reasoning tasks. Table 2給出了DeepSeek-R1-Zero與OpenAI o1-0912模型在多種推理相關基準測試上的比較分析。研究結果顯示，強化學習（RL）讓DeepSeek-R1-Zero擁有魯棒棒的推理能力，而不需依賴於任何監督微調的資料。這是一項值得關注的成就，因為它凸顯出模型通過單獨的強化學習所能實現的學習和泛化的能力。此外，DeepSeekR1-Zero的效能還可以通過多數決(majority voting)的方式進一步的提升。舉例來說，在AIME基準測試上採用多數決後，DeepSeek-R1-Zero的表現從71.0%飆升至86.7%，超越了OpenAI-o1-0912的表現。無論是否採用多數決，DeepSeek-R1-Zero都展現出具有競爭性的表現能力，並顯示出在推理任務中的進一步提升的潛力。 | Model | AIME 2024 | AIME 2024 | MATH-500 | GPQA Diamond | LiveCode Bench | CodeForces | |------------------|-------------|-------------|------------|----------------|------------------|--------------| | | pass@1 | cons@64 | pass@1 | pass@1 | pass@1 | rating | | OpenAI-o1-mini | 63.6 | 80.0 | 90.0 | 60.0 | 53.8 | 1820 | | OpenAI-o1-0912 | 74.4 | 83.3 | 94.8 | 77.3 | 63.4 | 1843 | | DeepSeek-R1-Zero | 71.0 | 86.7 | 95.9 | 73.3 | 50.0 | 1444 | Table 2 | Comparison of DeepSeek-R1-Zero and OpenAI o1 models on reasoning-related benchmarks. ![image](https://hackmd.io/_uploads/rJLPl4jK1e.png) Figure 2 | AIME accuracy of DeepSeek-R1-Zero during training. For each question, we sample 16 responses and calculate the overall average accuracy to ensure a stable evaluation. Figure 2 | DeepSeek-R1-Zero 在訓練期間的 AIME 正確率。對於每個問題，我們隨機選擇 16 個回答並計算平均準確率以保證穩定評估。 **Self-evolution Process of DeepSeek-R1-Zero** The self-evolution process of DeepSeek-R1-Zero is a fascinating demonstration of how RL can drive a model to improve its reasoning capabilities autonomously. By initiating RL directly from the base model, we can closely monitor the model's progression without the influence of the supervised fine-tuning stage. This approach provides a clear view of how the model evolves over time, particularly in terms of its ability to handle complex reasoning tasks. **Self-evolution Process of DeepSeek-R1-Zero** DeepSeek-R1-Zero 的自我演化過程超吸睛，展現了強化學習（RL）如何驅動模型自主提升其推理的能力。透過直接於基礎模型使用強化學習，我們可以仔細地監控模型在沒有受監督微調階段影響的情況下的發展過程。這種方法提供了一種清晰的視角來看模型如何隨著時間演變，特別是在處理複雜推理任務時的能力。 As depicted in Figure 3, the thinking time of DeepSeek-R1-Zero shows consistent improvement throughout the training process. This improvement is not the result of external adjustments but rather an intrinsic development within the model. DeepSeek-R1-Zero naturally acquires the ability to solve increasingly complex reasoning tasks by leveraging extended test-time computation. This computation ranges from generating hundreds to thousands of reasoning tokens, allowing the model to explore and refine its thought processes in greater depth. 如Figure 3所示，DeepSeek-R1-Zero在訓練過程中的思考時間展現出持續改善的趨勢。這種提升並非來自外部的調整，而是模型內在的發展結果。DeepSeek-R1-Zero很自然地獲得了解決越來越複雜的推理任務的能力(透過利用擴展測試時間的計算能力)。此計算涵蓋生成數百到數千個reasoning tokens的範圍，使得模型能夠更深入地探索和精煉其思考過程。 ![image](https://hackmd.io/_uploads/rkGIv4itJx.png) Figure 3 | The average response length of DeepSeek-R1-Zero on the training set during the RL process. DeepSeek-R1-Zero naturally learns to solve reasoning tasks with more thinking time. Figure 3 | 在強化學習過程中，DeepSeek-R1-Zero在訓練集上的平均輸出長度。DeepSeek-R1-Zero很自然地學會通過提供更多思考時間來解決推理任務。 One of the most remarkable aspects of this self-evolution is the emergence of sophisticated behaviors as the test-time computation increases. Behaviors such as reflection-where the model revisits and reevaluates its previous steps-and the exploration of alternative approaches to problem-solving arise spontaneously. These behaviors are not explicitly programmed but instead emerge as a result of the model's interaction with the reinforcement learning environment. This spontaneous development significantly enhances DeepSeek-R1-Zero's reasoning capabilities, enabling it to tackle more challenging tasks with greater efficiency and accuracy. 這種自我演化中最讓人驚豔的一點就是，隨著測試時間計算的增加而出現了更為複雜的行為。例如，反思——即模型重新檢視和重新評估其之前的步驟——以及在問題解決中探索替代方案等行為自發性地出現了。這些行為並非事先預設好的，而是源於模型與強化學習環境中互動的結果。這自發性的發展大幅提升了DeepSeek-R1-Zero的推理能力，使其能夠更有效和更準確的處理更具挑戰性的任務。 **Aha Moment of DeepSeek-R1-Zero** Aparticularly intriguing phenomenon observed during the training of DeepSeek-R1-Zero is the occurrence of an 'aha moment'. This moment, as illustrated in Table 3, occurs in an intermediate version of the model. During this phase, DeepSeek-R1-Zero learns to allocate more thinking time to a problem by reevaluating its initial approach. This behavior is not only a testament to the model's growing reasoning abilities but also a captivating example of how reinforcement learning can lead to unexpected and sophisticated outcomes. **Aha Moment of DeepSeek-R1-Zero** 在DeepSeek-R1-Zero的訓練過程中，出現了一個有趣的現象——「aha moment」。正如Table 3所描述，這種情況發生於模型的某個中間版本中。在這個階段，DeepSeek-R1-Zero學會了對一個問題花費更多的思考時間，通過重新評估其初始方法來實現這一點。這種行為不僅展示了模型推理能力的增強，也是一個引人注目的案例，說明如何通過強化學習達到意想不到且複雜的結果。 This moment is not only an 'aha moment' for the model but also for the researchers observing its behavior. It underscores the power and beauty of reinforcement learning: rather than explicitly teaching the model on how to solve a problem, we simply provide it with the right incentives, and it autonomously develops advanced problem-solving strategies. The 'aha moment' serves as a powerful reminder of the potential of RL to unlock new levels of intelligence in artificial systems, paving the way for more autonomous and adaptive models in the future. 這一刻不僅是模型的「aha moment」，也是觀察到此一行為的研究人員們所感受到的重要時刻。它強調了強化學習方法的強大與美妙：我們不需要很明確地教導模型如何解決問題，只需提供適當的激勵措施便能讓它自主地發展出高階的問題解決策略。這個「aha moment」是一個強而有力的提示，說明著强化學習有著在人工系統中解鎖新的智慧層級的潛力，為未來更自主和自我調整的模型鋪平道路。 ![image](https://hackmd.io/_uploads/r185JM3K1l.png) Table 3 | An interesting 'aha moment' of an intermediate version of DeepSeek-R1-Zero. The model learns to rethink using an anthropomorphic tone. This is also an aha moment for us, allowing us to witness the power and beauty of reinforcement learning. Table 3 | DeepSeek-R1-Zero中一個引人注目的「aha moment」。模型學會用類似人類的方式重新思考問題。這對我們來說也是一個aha moment，讓我們見證了強化學習的強大與美妙之處。 **Drawback of DeepSeek-R1-Zero** Although DeepSeek-R1-Zero exhibits strong reasoning capabilities and autonomously develops unexpected and powerful reasoning behaviors, it faces several issues. For instance, DeepSeek-R1-Zero struggles with challenges like poor readability, and language mixing. To make reasoning processes more readable and share them with the open community, we explore DeepSeek-R1, a method that utilizes RL with human-friendly cold-start data. **Drawback of DeepSeek-R1-Zero** DeepSeek-R1-Zero 雖然展現出強大的推理能力並自主形成出人意料的超強推理行為，但也面臨著許多問題。例如，在面對如可讀性不佳和語言混雜等挑戰時遇到困難。為了使推理過程更具有可讀性並與開放社群分享，我們探索出 DeepSeek-R1，這是一種搭配人類易於理解的冷啟動資料的強化學習方法。 ## 2.3. DeepSeek-R1: Reinforcement Learning with Cold Start Inspired by the promising results of DeepSeek-R1-Zero, two natural questions arise: 1) Can reasoning performance be further improved or convergence accelerated by incorporating a small amount of high-quality data as a cold start? 2) How can we train a user-friendly model that not only produces clear and coherent Chains of Thought (CoT) but also demonstrates strong general capabilities? To address these questions, we design a pipeline to train DeepSeek-R1. The pipeline consists of four stages, outlined as follows. 受DeepSeek-R1-Zero令人鼓舞的成果啟發，兩個問題自然而然的浮現：1) 是否可以通過加入一些少量的高品質資料來進行冷啟動，進一步提升推理效能或促使模型更快收斂？2) 如何訓練一個易於使用的模型，它不僅生成清晰且連貫的思維鏈（CoT），還展示出超強的通用能力？為了解決這些問題，我們設計了一個DeepSeek-R1的訓練流程。該流程包括四個階段，如下所述： ### 2.3.1. Cold Start Unlike DeepSeek-R1-Zero, to prevent the early unstable cold start phase of RL training from the base model, for DeepSeek-R1 we construct and collect a small amount of long CoT data to fine-tune the model as the initial RL actor. To collect such data, we have explored several approaches: using few-shot prompting with a long CoT as an example, directly prompting models to generate detailed answers with reflection and verification, gathering DeepSeek-R1-Zero outputs in a readable format, and refining the results through post-processing by human annotators. 與DeepSeek-R1-Zero不同，為了防止強化學習利用基礎模型訓練的初期出現不穩定的冷啟動階段，我們建構並收集少量的長思維鏈資料(long CoT data )來微調模型，做為初始的強化學習的actor。為了搜集這類的資料，我們探索了多種方法：使用搭配長思維鏈(long CoT)的提示範例(few-shot prompting)做為樣本、直接提示模型生成包含反思和驗證的詳細答案，以可閱讀的格式收集DeepSeek-R1-Zero的輸出，最後通過人工標註人員進行後處理以獲得更好的結果。 :::warning 根據論文，這邊的基礎模型指的是DeepSeek-V3-Base ::: In this work, we collect thousands of cold-start data to fine-tune the DeepSeek-V3-Base as the starting point for RL. Compared to DeepSeek-R1-Zero, the advantages of cold start data include：在這個研究中，我們收集了數千筆冷啟動資料來微調 DeepSeek-V3-Base 作為強化學習 (RL) 的起點。相較於 DeepSeek-R1-Zero，使用冷啟動資料的優勢包含： - Readability: A key limitation of DeepSeek-R1-Zero is that its content is often not suitable for reading. Responses may mix multiple languages or lack markdown formatting to highlight answers for users. In contrast, when creating cold-start data for DeepSeek-R1, we design a readable pattern that includes a summary at the end of each response and filters out responses that are not reader-friendly. Here, we define the output format as |special\_token|<reasoning\_process>|special\_token|<summary>, where the reasoning process is the CoT for the query, and the summary is used to summarize the reasoning results. - Readability：DeepSeek-R1-Zero 的一個關鍵限制在於其內容通常不適合人類閱讀。回應可能會混雜多種語言，或缺乏 Markdown 格式來強調給予使用者的答案。相比之下，在為 DeepSeek-R1 的建立冷啟動資料時，我們就設計了一種可讀性的模式，這個模式會在每個回應的結尾包含摘要，並過濾掉不易閱讀的回應。這裡，我們定義輸出格式為 |special_token|\<reasoning process\>|special_token|\<summary\>，其中 Reasoning Process 是對於問題的 CoT，而 summary 則是用來總結推理結果。 - Potential: By carefully designing the pattern for cold-start data with human priors, we observe better performance against DeepSeek-R1-Zero. We believe the iterative training is a better way for reasoning models. - Potential：透過精細設計以人類先驗為基礎的冷啟動資料的模式，我們觀察到相對於DeepSeek-R1-Zero而言有著更好的效能。我們相信，對於推理模型來說，迭代訓練是一種更好的方式。 ### 2.3.2. Reasoning-oriented Reinforcement Learning After fine-tuning DeepSeek-V3-Base on the cold start data, we apply the same large-scale reinforcement learning training process as employed in DeepSeek-R1-Zero. This phase focuses on enhancing the model's reasoning capabilities, particularly in reasoning-intensive tasks such as coding, mathematics, science, and logic reasoning, which involve well-defined problems with clear solutions. During the training process, we observe that CoT often exhibits language mixing, particularly when RL prompts involve multiple languages. To mitigate the issue of language mixing, we introduce a language consistency reward during RL training, which is calculated as the proportion of target language words in the CoT. Although ablation experiments show that such alignment results in a slight degradation in the model's performance, this reward aligns with human preferences, making it more readable. Finally, we combine the accuracy of reasoning tasks and the reward for language consistency by directly summing them to form the final reward. We then apply RL training on the fine-tuned model until it achieves convergence on reasoning tasks. 在以冷啟動資料對 DeepSeek-V3-Base 進行微調之後，我們用了與 DeepSeek-R1-Zero 相同的大規模強化學習訓練流程。這個階段專注於提升模型的推理能力，特別是那些需要高度推理能力的任務，如寫程式、數學、科學和邏輯推理等領域，這些都涉及有明確問題與解決方案的情境。在訓練過程中，我們發現 CoT 通常會表現出語言混用的問題，尤其是在RL prompts涉及多語言的情況下。為了解決語言混用的問題，我們在強化學習訓練過程中引入了一個語言一致性的獎勵機制，這是根據 CoT 中目標語言單詞的比例來計算的。儘管消融實驗(ablation experiment)顯示這種對齊可能會稍微降低模型的整體效能，但這個獎勵與人類偏好一致，使得輸出更易於閱讀。最後，我們通過直接將推理任務的準確性和語言一致性獎勵相加來形成最終的獎勵函數。隨後，我們在微調過的模型上進行 RL 訓練，一直到模型在推理任務上達到收斂。 ## 2.3.3. Rejection Sampling and Supervised Fine-Tuning When reasoning-oriented RL converges, we utilize the resulting checkpoint to collect SFT (Supervised Fine-Tuning) data for the subsequent round. Unlike the initial cold-start data, which primarily focuses on reasoning, this stage incorporates data from other domains to enhance the model's capabilities in writing, role-playing, and other general-purpose tasks. Specifically, we generate the data and fine-tune the model as described below. 當推理導向的強化學習收斂時，我們會使用生成的模型檢查點來收集下一輪訓練要用的監督微調（SFT，Supervised Fine-Tuning）資料。與最初的冷啟動資料不同的地方在於——後者主要針對推理——這個階段整合了來自其它領域的資料，以增強模型在寫作、角色扮演與通用任務上的能力。具體而言，我們生成資料並微調模型，如下所述。 **Reasoning data** We curate reasoning prompts and generate reasoning trajectories by performing rejection sampling from the checkpoint from the above RL training. In the previous stage, we only included data that could be evaluated using rule-based rewards. However, in this stage, we expand the dataset by incorporating additional data, some of which use a generative reward model by feeding the ground-truth and model predictions into DeepSeek-V3 for judgment. Additionally, because the model output is sometimes chaotic and difficult to read, we have filtered out chain-of-thought with mixed languages, long parapraphs, and code blocks. For each prompt, we sample multiple responses and retain only the correct ones. In total, we collect about 600k reasoning related training samples. **Reasoning data** 我們透過從上面所說的強化學習訓練的檢查點中執行棄卻抽樣來策劃推理提示詞並生成推理的軌跡。在上一階段中，我們就只包含可以使用規則性獎勵來進行評估的資料。然而，在這個階段中，我們透過合併額外的資料擴展了資料集，其中一些資料就使用生成獎勵模型，將實際資料與模型預測餵到DeepSeek-V3中進行評估。此外，由於模型輸出有時會混七八糟且難以閱讀，我們濾掉了包含多種語言混合、過長段落和程式碼區塊的思維鏈。針對每個提示，我們對多個響應做了採樣，只保留正確的響應。總的來看，我們收集了大約600k個推理相關的訓練樣本。 :::warning > [name=機器說明] 專有名詞解釋： - **Rejection Sampling**：一種用來從某個目標分佈中抽取樣本的方法，通常在無法直接進行樣本生成時使用。 - **Checkpoint**：在計算機科學中，是指保存訓練過程中某一階段模型參數的點。 - **Generative Reward Model**：使用生成模型來評估預測輸出的正確性或質量。 ::: :::warning 問題：600k的數量不少，怎麼判定正確與否？ ::: **Non-Reasoning data** For non-reasoning data, such as writing, factual QA, self-cognition, and translation, we adopt the DeepSeek-V3 pipeline and reuse portions of the SFT dataset of DeepSeek-V3. For certain non-reasoning tasks, we call DeepSeek-V3 to generate a potential chain-of-thought before answering the question by prompting. However, for simpler queries, such as 'hello' we do not provide a CoT in response. In the end, we collected a total of approximately 200k training samples that are unrelated to reasoning. **Non-Reasoning data** 對於非推理的資料，像是寫作、事實問答、自我認知和翻譯的部份，我們採用 DeepSeek-V3 pipeline，並重複利用部分 DeepSeek-V3 的 SFT 資料集。在某些非推理任務中，我們會在回答問題的答案之前，利用提示詞呼叫 DeepSeek-V3 來生成一個可能的思維鏈。不過，在更簡單的詢問，像是「hello」時，我們就不會在響應中提供思維鏈。最終，我們收集了約 200k 個與推理無關的訓練樣本。 We fine-tune DeepSeek-V3-Base for two epochs using the above curated dataset of about 800k samples. 我們使用上面精心策劃約80萬筆的資料集來對DeepSeek-V3-Base進行了兩輪的微調訓練。 ## 2.3.4. Reinforcement Learning for all Scenarios To further align the model with human preferences, we implement a secondary reinforcement learning stage aimed at improving the model's helpfulness and harmlessness while simultaneously refining its reasoning capabilities. Specifically, we train the model using a combination of reward signals and diverse prompt distributions. For reasoning data, we adhere to the methodology outlined in DeepSeek-R1-Zero, which utilizes rule-based rewards to guide the learning process in math, code, and logical reasoning domains. For general data, we resort to reward models to capture human preferences in complex and nuanced scenarios. We build upon the DeepSeek-V3 pipeline and adopt a similar distribution of preference pairs and training prompts. For helpfulness, we focus exclusively on the final summary, ensuring that the assessment emphasizes the utility and relevance of the response to the user while minimizing interference with the underlying reasoning process. For harmlessness, we evaluate the entire response of the model, including both the reasoning process and the summary, to identify and mitigate any potential risks, biases, or harmful content that may arise during the generation process. Ultimately, the integration of reward signals and diverse data distributions enables us to train a model that excels in reasoning while prioritizing helpfulness and harmlessness. 為了更好的讓模型與人類偏好保持一致，我們執行了secondary reinforcement learning stage，旨在提升模型的有用性和安全性，同時精進其推理能力。具體來說，我們利用結合獎勵信號與多樣化的提示分佈(diverse prompt distributions)來訓練模型。在推理資料方面，我們遵循 DeepSeek-R1-Zero 的方法，其使用基於規則的獎勵來引導數學、程式碼和邏輯推理領域的學習過程。在一般資料方面，我們依靠獎勵模型來捕捉在複雜和微妙的情境中的人類偏好。我們建立在 DeepSeek-V3 pipeline 之上，並採用類似的偏好對(preference pairs)分佈及訓練提示詞。在提升有用性方面，我們就只專注在最後的摘要，確保評估重點在於回應對使用者的實用性和相關性，同時盡量最小化干擾基礎的推理過程。在評估安全性方面，我們評估模型的整體回應，包括推理過程和總結，以識別並減少生成過程中可能出現的任何風險、偏見或有害內容。最終，結合獎勵信號和多樣化資料分佈，讓我們能夠訓練一個在推理方面表現優異並同時優先考舉有用性和安全性的模型。 ## 2.4. Distillation: Empower Small Models with Reasoning Capability To equip more efficient smaller models with reasoning capabilities like DeepSeek-R1, we directly fine-tuned open-source models like Qwen (Qwen, 2024b) and Llama (AI@Meta, 2024) using the 800k samples curated with DeepSeek-R1, as detailed in §2.3.3. Our findings indicate that this straightforward distillation method significantly enhances the reasoning abilities of smaller models. The base models we use here are Qwen2.5-Math-1.5B, Qwen2.5-Math-7B, Qwen2.514B, Qwen2.5-32B, Llama-3.1-8B, and Llama-3.3-70B-Instruct. We select Llama-3.3 because its reasoning capability is slightly better than that of Llama-3.1. 為了使更高效的小型模型具備類似DeepSeek-R1的推理能力，我們直接微調開源模型，如Qwen（Qwen, 2024b）和Llama（AI@Meta, 2024），使用了800k樣本數據集，該資料的取得如§2.3.3所述。我們的研究結果指出，這種直接的蒸餾方法明顯提升了小型模型的推理能力。這邊我們所用的基礎模型包括Qwen2.5-Math-1.5B、Qwen2.5-Math-7B、Qwen2.514B、Qwen2.5-32B、Llama-3.1-8B和Llama-3.3-70B-Instruct。我們選擇了Llama-3.3，因為其推理能力略優於Llama-3.1。 For distilled models, we apply only SFT and do not include an RL stage, even though incorporating RL could substantially boost model performance. Our primary goal here is to demonstrate the effectiveness of the distillation technique, leaving the exploration of the RL stage to the broader research community. 對於蒸餾的模型，我們單純的利用 SFT 資料集，且未包含 RL 階段，儘管結合 RL 可能會大大地提高模型效能就是。在這裡，我們的主要目標是說明壓縮技術的效果，而 RL 階段的探討就留給更廣泛的研究社群去吧。 ## 3. Experiment **Benchmarks** We evaluate models on MMLU (Hendrycks et al., 2020), MMLU-Redux (Gema et al., 2024), MMLU-Pro (Wang et al., 2024), C-Eval (Huang et al., 2023), and CMMLU (Li et al., 2023), IFEval (Zhou et al., 2023), FRAMES (Krishna et al., 2024), GPQA Diamond (Rein et al., 2023), SimpleQA (OpenAI, 2024c), C-SimpleQA (He et al., 2024), SWE-Bench Verified (OpenAI,2024d), Aider 1 , LiveCodeBench (Jain et al., 2024) (2024-08 - 2025-01), Codeforces 2 , Chinese National High School Mathematics Olympiad (CNMO 2024) 3 , and American Invitational Mathematics Examination 2024 (AIME 2024) (MAA, 2024). In addition to standard benchmarks, we also evaluate our models on open-ended generation tasks using LLMs as judges. Specifically, we adhere to the original configurations of AlpacaEval 2.0 (Dubois et al., 2024) and Arena-Hard (Li et al., 2024), which leverage GPT-4-Turbo-1106 as judges for pairwise comparisons. Here, we only feed the final summary to evaluation to avoid the length bias. For distilled models, we report representative results on AIME 2024, MATH-500, GPQA Diamond, Codeforces, and LiveCodeBench. **Benchmarks** 我們在多個基準測試中評估模型的表現，包括 MMLU (Hendrycks et al., 2020), MMLU-Redux (Gema et al., 2024), MMLU-Pro (Wang et al., 2024), C-Eval (Huang et al., 2023), and CMMLU (Li et al., 2023), IFEval (Zhou et al., 2023), FRAMES (Krishna et al., 2024), GPQA Diamond (Rein et al., 2023), SimpleQA (OpenAI, 2024c), C-SimpleQA (He et al., 2024), SWE-Bench Verified (OpenAI,2024d), Aider 1 , LiveCodeBench (Jain et al., 2024) (2024-08 - 2025-01), Codeforces 2 , Chinese National High School Mathematics Olympiad (CNMO 2024) 3 , and American Invitational Mathematics Examination 2024 (AIME 2024) (MAA, 2024)。除了標準的基準測試外，我們也在開放式生成任務中進行模型的評估，使用大語言模型（LLMs）作為裁判。具體來說，我們遵循 AlpacaEval 2.0（Dubois et al., 2024）和 Arena-Hard（Li et al., 2024）的原始配置，利用 GPT-4-Turbo-1106 進行對比評判。在這一過程中，我們僅將最終摘要用於評估，以避免因篇幅不同而產生的偏差。對於蒸餾後的模型，我們在這些基準測試中的 AIME 2024、MATH-500、GPQA Diamond、Codeforces 和 LiveCodeBench 上報告了代表性結果。 **Evaluation Prompts** Following the setup in DeepSeek-V3, standard benchmarks such as MMLU, DROP, GPQA Diamond, and SimpleQA are evaluated using prompts from the simpleevals framework. For MMLU-Redux, we adopt the Zero-Eval prompt format (Lin, 2024) in a zero-shot setting. In terms of MMLU-Pro, C-Eval and CLUE-WSC, since the original prompts are few-shot, we slightly modify the prompt to the zero-shot setting. The CoT in few-shot may hurt the performance of DeepSeek-R1. Other datasets follow their original evaluation protocols with default prompts provided by their creators. For code and math benchmarks, the HumanEval-Mul dataset covers eight mainstream programming languages (Python, Java, C++, C#, JavaScript, TypeScript, PHP, and Bash). Model performance on LiveCodeBench is evaluated using CoT format, with data collected between August 2024 and January 2025. The Codeforces dataset is evaluated using problems from 10 Div.2 contests along with expert-crafted test cases, after which the expected ratings and percentages of competitors are calculated. SWE-Bench verified results are obtained via the agentless framework (Xia et al., 2024). AIDER-related benchmarks are measured using a "diff" format. DeepSeek-R1 outputs are capped at a maximum of 32,768 tokens for each benchmark. **Evaluation Prompts** 依循 DeepSeek-V3 的設置，標準基準測試如 MMLU、DROP、GPQA Diamond 和 SimpleQA 會使用 simpleevals 框架中的提示詞進行評估。對於 MMLU-Redux，我們採用無樣本設置(zero-shot setting)中的Zero-Eval prompt format(Lin, 2024)。至於 MMLU-Pro、C-Eval 和 CLUE-WSC，由於原始提示是少量樣本(few-shot)的，我們稍微修改了提示以適應zero-shot的設置。在 few-shot 中的 CoT 可能會傷害到 DeepSeek-R1 的效能。其他資料集則依循其原始的評估協議，使用其作者所提供的預設提示詞。對於程式碼和數學基準測試，HumanEval-Mul 資料集涵蓋了八種主流程式語言（Python、Java、C++、C#、JavaScript、TypeScript、PHP 和 Bash）。模型在 LiveCodeBench 的效能評估是使用 CoT format，其資料是於 2024 年 8 月至 2025 年 1 月之間所收集。Codeforces 資料集則使用來自 10 個 Div.2 競賽的問題以及精心製作的測試案例進行評估，然後計算預期的平均排名和比賽中參與者的百分比。SWE-Bench 的驗證結果為通過 agentless framework (Xia et al., 2024)所取得。AIDER 相關的基準測試使用 "diff" 格式進行量測。DeepSeek-R1 的輸出在每個基準測試中被限制在最大 32,768 個 token。 :::warning > [name=機器說明] 專有名詞解釋： - MMLU, DROP, GPQA Diamond, SimpleQA：各自是特定的評估任務或測試集，在此背景下用於量化模型性能。 - simpleevals framework：一個用於生成評估語句的框架，具體定義不清楚。 - MMLU-Redux：可能是基於MMLU（Massive Multitask Language Understanding）測試集的變種或改進版本。 - Zero-Eval prompt format (Lin, 2024)：一種在零對抗設置下使用的評估語句格式，由Lin於2024年提出。 - MMLU-Pro, C-Eval, CLUE-WSC：各自代表不同的評估任務或測試集。 - CoT (Chain of Thought)：指模型在解決問題時展示推理過程的方法，通常用於改善性能和可信度。 - HumanEval-Mul：涵蓋多種主流程式語言的評估數據集。 - LiveCodeBench, Codeforces：特定類型的程式碼測試平台。 - SWE-Bench：可能是指某個軟體工程測試集，具體不清楚。 - agentless framework (Xia et al., 2024)：一種無代理系統或框架，由Xia等人於2024年提出，用於驗證SWE-Bench測試。 - AIDER-related benchmarks：可能指與特定AI任務相關的基準測試。 ::: **Baselines** We conduct comprehensive evaluations against several strong baselines, including DeepSeek-V3, Claude-Sonnet-3.5-1022, GPT-4o-0513, OpenAI-o1-mini, and OpenAI-o1-1217. Since accessing the OpenAI-o1-1217 API is challenging in mainland China, we report its performance based on official reports. For distilled models, we also compare the open-source model QwQ-32B-Preview (Qwen, 2024a). **Baselines** 我們對多種強大的基準進行全面性的評估，包括 DeepSeek-V3、Claude-Sonnet-3.5-1022、GPT-4o-0513、OpenAI-o1-mini 和 OpenAI-o1-1217。由於在中國大陸訪問 OpenAI-o1-1217 API 呈現挑戰，我們基於官方公佈的資料做為呈現其表現的依據。對於經過蒸餾過的模型，我們也比較了開源模型 Qwen 32B 預覽版本 (QwQ-32B-Preview, Qwen, 2024a)。 **Evaluation Setup** We set the maximum generation length to 32,768 tokens for the models. We found that using greedy decoding to evaluate long-output reasoning models results in higher repetition rates and significant variability across different checkpoints. Therefore, we default to pass@$𝑘$ evaluation (Chen et al., 2021) and report pass@1 using a non-zero temperature. Specifically, we use a sampling temperature of 0.6 and a top-$𝑝$ value of 0.95 to generate $k$ responses (typically between 4 and 64, depending on the test set size) for each question. Pass@1 is then calculated as **Evaluation Setup** 我們將模型的最大生成長度設定為 32,768 個 token。我們發現使用greedy decoding來評估長度較長的推理模型時，會導致更高的重複率，且不同檢查點之間的明顯的差異。因此，我們預設採用 pass@$𝑘$ 來評估（Chen et al., 2021）方法，並報告使用non-zero temperature下的 pass@1 的結果。具體來說，我們使用sampling temperature為 0.6 和 top-$𝑝$ 值為 0.95 生成每個問題的 $k$ 個回應（通常介於 4 到 64 之間，取決於測試集的大小）。Pass@1的計算如下： $$ \text{pass}@1 = \dfrac{1}{k}\sum^k_{i=1}p_i $$ :::warning > [name=機器說明] 專有名詞解釋： - Greedy Decoding: 算法在生成序列時每次選擇最高概率的單元（如字或詞）作為下一步，可能導致重複且缺乏創造性。 - Pass@ $𝑘$ Evaluation: 這是一種測量模型在給定問題上的表現方式。通過多次生成不同答案來計算至少一個正確答案（Pass）的概率。 - Sampling Temperature: 控制針對生成分佈進行隨機取樣時的溫度，更高的溫度會增加多樣性但減少可預測性。 - Top-$p$ (Top-$p$ Sampling): 只從累積概率至少為 $p$ 的前面部分候選單元中進行抽樣，以控制生成的多樣性和可預測性。 ::: where $p_i$ denotes the correctness of the $𝑖$-th response. This method provides more reliable performance estimates. For AIME 2024, we also report consensus (majority vote) results (Wang et al., 2022) using 64 samples, denoted as cons@64. 其中$p_i$表示第$i$個回應的正確性。這種方法提供更可靠的效能估計。AIME 2024的部份，我們還報告了投票達成一致的結果（多數決），使用64個樣本來計算，以cons@64表示（Wang et al., 2022）。 ### 3.1. DeepSeek-R1 Evaluation ![image](https://hackmd.io/_uploads/rkZNLOeqkg.png) For education-oriented knowledge benchmarks such as MMLU, MMLU-Pro, and GPQA Diamond, DeepSeek-R1 demonstrates superior performance compared to DeepSeek-V3. This improvement is primarily attributed to enhanced accuracy in STEM-related questions, where significant gains are achieved through large-scale reinforcement learning. Additionally, DeepSeek-R1 excels on FRAMES, a long-context-dependent QA task, showcasing its strong document analysis capabilities. This highlights the potential of reasoning models in AI-driven search and data analysis tasks. On the factual benchmark SimpleQA, DeepSeek-R1 outperforms DeepSeek-V3, demonstrating its capability in handling fact-based queries. A similar trend is observed where OpenAI-o1 surpasses GPT-4o on this benchmark. However, DeepSeek-R1 performs worse than DeepSeek-V3 on the Chinese SimpleQA benchmark, primarily due to its tendency to refuse answering certain queries after safety RL. Without safety RL, DeepSeek-R1 could achieve an accuracy of over 70%. 對於如MMLU、MMLU-Pro和GPQA Diamond等以教育為導向的知識基準測試，DeepSeek-R1對比DeepSeek-V3，展現出其不凡。這種提升主要歸功於STEM相關問題準確度的提升，通過大規模的强化學習取得了顯著的成果。除此之外，DeepSeek-R1在FRAMES上——一個長文本依賴的QA任務中——的表現也是非常出色，顯示其優秀的文件分析能力。這強調了AI驅動搜尋和資料分析任務中推理模型的潛力。在事實查核基準測試SimpleQA上，DeepSeek-R1也超越了DeepSeek-V3，展示其處理事實查核的能力。同樣地，OpenAI-o1也超越了GPT-4o。然而，在中文SimpleQA基準測試上，DeepSeek-R1比DeepSeek-V3表現差，這主要是因為在safety RL後，它有時會拒絕回答某些問題。在不使用safety RL時，DeepSeek-R1可以達到超過70%的準確度。 DeepSeek-R1 also delivers impressive results on IF-Eval, a benchmark designed to assess a model's ability to follow format instructions. These improvements can be linked to the inclusion of instruction-following data during the final stages of supervised fine-tuning (SFT) and RL training. Furthermore, remarkable performance is observed on AlpacaEval2.0 and ArenaHard, indicating DeepSeek-R1's strengths in writing tasks and open-domain question answering. Its significant outperformance of DeepSeek-V3 underscores the generalization benefits of large-scale RL, which not only boosts reasoning capabilities but also improves performance across diverse domains. Moreover, the summary lengths generated by DeepSeek-R1 are concise, with an average of 689 tokens on ArenaHard and 2,218 characters on AlpacaEval 2.0. This indicates that DeepSeek-R1 avoids introducing length bias during GPT-based evaluations, further solidifying its robustness across multiple tasks. DeepSeek-R1 在 IF-Eval 上也展現出優異的成效，IF-Eval 是一個設計用來衡量模型遵循格式指令能力的基準測試。這些改進可能與supervised fine-tuning (SFT)和强化學習訓練的最後階段納入指令依循資料有關。此外，DeepSeek-R1 在 AlpacaEval2.0 和 ArenaHard 上表現出色，這顯示出 DeepSeek-R1 在寫作任務及開放領域問答中的優異表現。其明顯超越 DeepSeek-V3 突顯出大規模強化學習的泛化優勢，不僅提升了推理能力，還改善了跨多個領域的表現。此外，在 ArenaHard 上 DeepSeek-R1 生成的摘要長度為平均 689 個字節，在 AlpacaEval 2.0 上為平均 2,218 個字符，這表明 DeepSeek-R1 在 GPT 型態評估中避免引入長度偏見，從而在多個任務上更加穩健。 On math tasks, DeepSeek-R1 demonstrates performance on par with OpenAI-o1-1217, surpassing other models by a large margin. A similar trend is observed on coding algorithm tasks, such as LiveCodeBench and Codeforces, where reasoning-focused models dominate these benchmarks. On engineering-oriented coding tasks, OpenAI-o1-1217 outperforms DeepSeek-R1 on Aider but achieves comparable performance on SWE Verified. We believe the engineering performance of DeepSeek-R1 will improve in the next version, as the amount of related RL training data currently remains very limited. 在數學任務中，DeepSeek-R1 的表現與 OpenAI-o1-1217 相當，並且碾壓其它模型。這一趨勢也出現在程式演算法任務上，例如 LiveCodeBench 和 Codeforces，其中專注於推理的模型佔據了這些評估標準。在工程導向的程式任務中，OpenAI-o1-1217 在 Aider 中表現優於 DeepSeek-R1，但在 SWE Verified 上兩者的性能相近。我們認為 DeepSeek-R1 的工程表現將在下一版本中有所提升，因為目前與此類主題相關的 RL 訓練資料量仍然非常有限。 ### 3.2. Distilled Model Evaluation ![image](https://hackmd.io/_uploads/HJTRsdxqye.png) Table 5 | Comparison of DeepSeek-R1 distilled models and other comparable models on reasoning-related benchmarks. As shown in Table 5, simply distilling DeepSeek-R1's outputs enables the efficient DeepSeekR1-7B (i.e., DeepSeek-R1-Distill-Qwen-7B, abbreviated similarly below) to outperform nonreasoning models like GPT-4o-0513 across the board. DeepSeek-R1-14B surpasses QwQ-32BPreview on all evaluation metrics, while DeepSeek-R1-32B and DeepSeek-R1-70B significantly exceed o1-mini on most benchmarks. These results demonstrate the strong potential of distillation. Additionally, we found that applying RL to these distilled models yields significant further gains. We believe this warrants further exploration and therefore present only the results of the simple SFT-distilled models here. 如Table 5示，僅通過蒸餾DeepSeek-R1的輸出即可讓效率更高的DeepSeekR1-7B（即DeepSeek-R1-Distill-Qwen-7B，以下簡稱）在所有方面都超越非推理模型如GPT-4o-0513。DeepSeek-R1-14B在所有評估指標上超過QwQ-32BPreview，而DeepSeek-R1-32B和DeepSeekR1-70B則在大多數基準測試中顯著超越o1-mini。這些結果說明了蒸餾的強大潛力。此外，我們發現將RL應用於這些蒸餾模型能帶來進一步的提升。我們認為這值得進一步探索，因此在此僅呈現簡單SFT-distilled model的結果。 ## 4. Discussion ## 4.1. Distillation v.s. Reinforcement Learning In Section 3.2, we can see that by distilling DeepSeek-R1, the small model can achieve impressive results. However, there is still one question left: can the model achieve comparable performance through the large-scale RL training discussed in the paper without distillation? 在Section 3.2中，我們可以看到通過蒸餾DeepSeek-R1，小型模型能夠取得讓人驚豔的成果。然而，還有一個問題未解：這個模型是否能在沒有蒸餾的情況下，通過論文中所述的大規模的強化學習訓練達到類似或可比擬的效能？ To answer this question, we conduct large-scale RL training on Qwen-32B-Base using math, code, and STEM data, training for over 10K steps, resulting in DeepSeek-R1-Zero-Qwen-32B. The experimental results, shown in Table 6, demonstrate that the 32B base model, after large-scale RL training, achieves performance on par with QwQ-32B-Preview. However, DeepSeek-R1-Distill-Qwen-32B, which is distilled from DeepSeek-R1, performs significantly better than DeepSeek-R1-Zero-Qwen-32B across all benchmarks. 為了解答這個問題，我們在數學、程式碼和STEM資料上對Qwen-32B-Base進行大規模強化學習訓練，訓練超過10,000個steps，最終生成DeepSeek-R1-Zero-Qwen-32B。實驗結果如Table 6所示，顯示大規模RL訓練後的32B基礎模型在性能上媲美QwQ-32B-Preview。然而，在所有評估標準下，從DeepSeek-R1中提煉出來的DeepSeek-R1-Distill-Qwen-32B，明顯超越了DeepSeek-R1-Zero-Qwen-32B。 ![image](https://hackmd.io/_uploads/B1ikRSW51l.png) Table 6 | Comparison of distilled and RL Models on Reasoning-Related Benchmarks. Therefore, we can draw two conclusions: First, distilling more powerful models into smaller ones yields excellent results, whereas smaller models relying on the large-scale RL mentioned in this paper require enormous computational power and may not even achieve the performance of distillation. Second, while distillation strategies are both economical and effective, advancing beyond the boundaries of intelligence may still require more powerful base models and larger-scale reinforcement learning. 因此，我們可以得出兩個結論：首先，將強大的模型蒸餾成小型模型能夠取得優異的效果，而依賴於本論文中所提到的大規模強化學習（RL）的小型模型則需要龐大的計算資源且可能無法達到蒸餾所帶來的效能。其次，儘管蒸餾策略既節省錢省力，但要超越智力的極限仍然需要更強大的基礎模型和更大規模的強化學習。 ## 4.2. Unsuccessful Attempts In the early stages of developing DeepSeek-R1, we also encountered failures and setbacks along the way. We share our failure experiences here to provide insights, but this does not imply that these approaches are incapable of developing effective reasoning models. 在開發DeepSeek-R1的初期階段，我們也曾遭遇過失敗和挫折的經驗。這裡分享這些失敗和挫折的經驗，以提供一些見解，但這並不意味著所採用的方法本身就無法開發出有效的推理模型。 **Process Reward Model (PRM)** PRM is a reasonable method to guide the model toward better approaches for solving reasoning tasks (Lightman et al., 2023; Uesato et al., 2022; Wang et al., 2023). However, in practice, PRM has three main limitations that may hinder its ultimate success. First, it is challenging to explicitly define a fine-grain step in general reasoning. Second, determining whether the current intermediate step is correct is a challenging task. Automated annotation using models may not yield satisfactory results, while manual annotation is not conducive to scaling up. Third, once a model-based PRM is introduced, it inevitably leads to reward hacking (Gao et al., 2022), and retraining the reward model needs additional training resources and it complicates the whole training pipeline. In conclusion, while PRM demonstrates a good ability to rerank the top-N responses generated by the model or assist in guided search (Snell et al., 2024), its advantages are limited compared to the additional computational overhead it introduces during the large-scale reinforcement learning process in our experiments. **Process Reward Model (PRM)** PRM對於解決推理任務而言是一種引導模型採取更好方法的合理的方式（Lightman et al., 2023; Uesato et al., 2022; Wang et al., 2023）。然而，在實務上，PRM 有三個主要的侷限性，可能會阻礙其最終的成功。首先，在一般推理中明確地定義細粒度的步驟具有挑戰性；其次，判斷當前的中間步驟是否正確是個困難任務。使用模型進行自動化標注可能不會產生令人滿意的結果，而手動標注則不利於擴展。第三，一旦引入基於模型的 PRM，就會不可避免地導致reward hacking（Gao et al., 2022），且重新訓練獎勵模型需投入額外的訓練資源，這會複雜化整個訓練流程。總之，儘管 PRM 在重新排序模型生成的前 N 個響應或引導式搜尋（Snell et al., 2024）方面具有良好能力，不過跟我們實驗中大規模强化學習過程中引入的額外計算開銷相比，它的優勢是有限的。 :::warning > [name=機器說明] 2. 專有名詞解釋： - Process Reward Model (PRM)：一種用於指導和改進模型在推理任務中的方法。 - reward hacking：指在訓練過程中，模型學會利用系統弱點來謀求更高分數，而不是實質改善性能的現象。 ::: **MonteCarlo Tree Search (MCTS)** Inspired by AlphaGo (Silver et al., 2017b) and AlphaZero (Silver et al., 2017a), we explored using Monte Carlo Tree Search (MCTS) to enhance test-time compute scalability. This approach involves breaking answers into smaller parts to allow the model to explore the solution space systematically. To facilitate this, we prompt the model to generate multiple tags that correspond to specific reasoning steps necessary for the search. For training, we first use collected prompts to find answers via MCTS guided by a pre-trained value model. Subsequently, we use the resulting question-answer pairs to train both the actor model and the value model, iteratively refining the process. **MonteCarlo Tree Search (MCTS)** 受到AlphaGo（Silver et al., 2017b）和AlphaZero（Silver et al., 2017a）的啟發，我們探索了利用Monte Carlo Tree Search (MCTS)來增強測試時間的計算可擴展性。這一方法涉及將答案分解為更小的部分，讓模型能夠系統性地探索解決方案空間。為了方便處理，在輸入提示時，我們促使模型生成多個標籤，這些標籤對應於查詢所必要的推理步驟。在訓練的時候，我們首先使用收集到的提示詞通過預訓練的value model所引導的MCTS來找出答案。接續著，我們使用生成的問題-答案配對來訓練actor model和value model，反覆地改善這個過程。 However, this approach encounters several challenges when scaling up the training. First, unlike chess, where the search space is relatively well-defined, token generation presents an exponentially larger search space. To address this, we set a maximum extension limit for each node, but this can lead to the model getting stuck in local optima. Second, the value model directly influences the quality of generation since it guides each step of the search process. Training a fine-grained value model is inherently difficult, which makes it challenging for the model to iteratively improve. While AlphaGo's core success relied on training a value model to progressively enhance its performance, this principle proves difficult to replicate in our setup due to the complexities of token generation. 然而，這個方法在將訓練規模放大的時候就遇到很多挑戰。首先，不同於西洋棋，西洋棋的搜尋空間相對定義完善，token的生成則是呈現出指數級更大的搜尋空間。為了解決這個問題，我們設定了每個節點的最大擴展限制，但這可能會導致模型陷入局部最佳。其次，value model直接影響生成的品質，因為它指導搜尋過程的每一個步驟。訓練一個細粒度的value model本身就是很困難的事情了，這使得模型逐步改善變得具有挑戰性。儘管AlphaGo成功主要依賴於不斷訓練和提升value model的效能，不過在我們的設置中由於token生成的複雜性，這一原則很難得以重演。 In conclusion, while MCTS can improve performance during inference when paired with a pre-trained value model, iteratively boosting model performance through self-search remains a significant challenge. 總而言之，儘管MCTS與預訓練的value model搭配能夠增加推理過程中的效能，總之，在與預訓練的價值模型配對時，MCTS（Monte Carlo Tree Search）可以在推理過程中增強效能，不過要透過self-search迭代地增強模型效能仍是一個重大的挑戰。 ## 5. Conclusion, Limitations, and Future Work In this work, we share our journey in enhancing model reasoning abilities through reinforcement learning. DeepSeek-R1-Zero represents a pure RL approach without relying on cold-start data, achieving strong performance across various tasks. DeepSeek-R1 is more powerful, leveraging cold-start data alongside iterative RL fine-tuning. Ultimately, DeepSeek-R1 achieves performance comparable to OpenAI-o1-1217 on a range of tasks. 我們分享通過強化學習來提升模型推理能力這研究中所踩過的坑。DeepSeek-R1-Zero代表了一種純粹以強化學習為基礎的方法，不依賴於冷啟動資料(cold-start data)，並在多項任務中展現出強大的效能。DeepSeek-R1則更加強大，它利用了冷啟動資料搭配反覆計算強化學習微調。最終，DeepSeek-R1在一系列任務上展現出與 OpenAI-o1-1217 可比擬的效能。 We further explore distillation the reasoning capability to small dense models. We use DeepSeek-R1 as the teacher model to generate 800K training samples, and fine-tune several small dense models. The results are promising: DeepSeek-R1-Distill-Qwen-1.5B outperforms GPT-4o and Claude-3.5-Sonnet on math benchmarks with 28.9% on AIME and 83.9% on MATH. Other dense models also achieve impressive results, significantly outperforming other instructiontuned models based on the same underlying checkpoints. 我們進一步探索將推理能力蒸餾到小型密集模型中。我們使用 DeepSeek-R1 作為教師模型(teacher model)來生成 800k 筆訓練樣本，並微調多個小型密集模型。結果令人舒爽：DeepSeek-R1-Distill-Qwen-1.5B 在數學測試基準上表現優於 GPT-4o 和 Claude-3.5-Sonnet，分別在 AIME 上達到 28.9%、MATH 達到 83.9%。其它密集模型也取得了棒棒的成果，明顯超越基於其它相同underlying checkpoints的指令調優模型。 In the future, we plan to invest in research across the following directions for DeepSeek-R1. 在未來，我們對DeepSeek-R1有著下面的研究計劃： - **General Capability:** Currently, the capabilities of DeepSeek-R1 fall short of DeepSeek-V3 in tasks such as function calling, multi-turn, complex role-playing, and JSON output. Moving forward, we plan to explore how long CoT can be leveraged to enhance tasks in these fields. - **General Capability:** 目前來看，DeepSeek-R1在函數呼叫、多輪對話、複雜角色扮演以及 JSON 輸出等任務上，表現不如 DeepSeek-V3。未來，我們計畫探索長思維鏈（CoT）如何能夠增強這些領域中的任務。 - **Language Mixing:** DeepSeek-R1 is currently optimized for Chinese and English, which may result in language mixing issues when handling queries in other languages. For instance, DeepSeek-R1 might use English for reasoning and responses, even if the query is in a language other than English or Chinese. We aim to address this limitation in future updates. - **Language Mixing:** DeepSeek-R1 目前主要是針對中文和英文最佳化，這可能會導致在處理其它語言的查詢時出現語言混合的問題。舉例來說，DeepSeek-R1 在推理以及響應的時候可能會使用英文，即使使用者的查詢是用中、英以外的語言。我們計劃在未來版本中解決這一侷限性。 - **Prompting Engineering:** When evaluating DeepSeek-R1, we observe that it is sensitive to prompts. Few-shot prompting consistently degrades its performance. Therefore, we recommend users directly describe the problem and specify the output format using a zero-shot setting for optimal results. - **Prompting Engineering:** 在評估 DeepSeek-R1 時，我們觀察到它對提示詞非常敏感。Few-shot prompting會持續地降低其效能。因此，我們建議使用者直接描述問題並指定輸出格式，使用zero-shot的設置來取得最佳結果。 - **Software Engineering Tasks:** Due to the long evaluation times, which impact the efficiency of the RL process, large-scale RL has not been applied extensively in software engineering tasks. As a result, DeepSeek-R1 has not demonstrated a huge improvement over DeepSeek-V3 on software engineering benchmarks. Future versions will address this by implementing rejection sampling on software engineering data or incorporating asynchronous evaluations during the RL process to improve efficiency. - **Software Engineering Tasks:** 由於評估時間過長，影響了強化學習過程的效率，大規模的強化學習在軟體工程任務中並未被廣泛應用。因此，DeepSeek-R1 在軟體工程基準測試上沒有顯示出巨大的改進，相比於 DeepSeek-V3。未來版本將通過在軟體工程資料中實施棄卻抽樣或在強化學習過程中引入非同步評估，以提高效率來解決這個問題。