DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

tags:`論文翻譯` `deeplearning`

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

說明

排版的順序為先原文，再繁體中文，並且圖片與表格都會出現在第一次出現的段落下面

原文

繁體中文

照片或表格

個人註解，任何的翻譯不通暢部份都請留言指導
為了加速閱讀，直接採用自建的反思翻譯(Phi4-14B模型所翻譯)的結果呈現，然後快速看過，語意對了頭過身就過，多多包函
這篇論文測試利用docling將論文從pdf取出轉成markdown格式，再利用正則式取片段至dify建置的反思翻譯api取得譯文再做優化調整
解釋的部份也是直接用Phi4-14B模型來針對片段做理解說明
機器說明的部份我會把一些不必要的冗文刪除，這可能可以從提示詞中再來做回應的優化

paper hyperlink

Abstract

Mathematical reasoning poses a significant challenge for language models due to its complex and structured nature. In this paper, we introduce DeepSeekMath 7B, which continues pretraining DeepSeek-Coder-Base-v1.5 7B with 120B math-related tokens sourced from Common Crawl, together with natural language and code data. DeepSeekMath 7B has achieved an impressive score of 51.7% on the competition-level MATH benchmark without relying on external toolkits and voting techniques, approaching the performance level of Gemini-Ultra and GPT-4. Self-consistency over 64 samples from DeepSeekMath 7B achieves 60.9% on MATH. The mathematical reasoning capability of DeepSeekMath is attributed to two key factors: First, we harness the significant potential of publicly available web data through a meticulously engineered data selection pipeline. Second, we introduce Group Relative Policy Optimization (GRPO), a variant of Proximal Policy Optimization (PPO), that enhances mathematical reasoning abilities while concurrently optimizing the memory usage of PPO.

數學推理因為其複雜性與結構化的特性，對語言模型來說是個重大的挑戰。在這篇論文中，我們提出 DeepSeekMath 7B，它基於 DeepSeek-Coder-Base-v1.5 7B 模型，並使用來自 Common Crawl 的 120B 個數學相關的tokens以及自然語言和程式碼資料來進行訓練。DeepSeekMath 7B 在不靠外部的toolkits和voting techniques的情況下，在competition-level MATH的基準測試中取得了令人印象深刻的 51.7% 的分數，接近 Gemini-Ultra 和 GPT-4 的表現水準。 DeepSeekMath 7B 對64個樣本的自身一致性，在 MATH 上達到 60.9%。DeepSeekMath 擁有強大的數學推理能力，主要歸因於兩個關鍵因素：首先，我們充分利用了公開的網路資料的潛力，這得益於精心設計的資料選擇管線(pipeline)。其次，我們引入了 Group Relative Policy Optimization (GRPO)，它是一種 Proximal Policy Optimization (PPO) 的變體，不僅能提升數學推理能力，同時還能最佳化 PPO 的記憶體使用效率。

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

Figure 1 | Top1 accuracy of open-source models on the competition-level MATH benchmark (Hendrycks et al., 2021) without the use of external toolkits and voting techniques.

1. Introduction

Large language models (LLM) have revolutionized the approach to mathematical reasoning in artificial intelligence, spurring significant advancements in both the quantitative reasoning benchmark (Hendrycks et al., 2021) and the geometry reasoning benchmark (Trinh et al., 2024). Moreover, these models have proven instrumental in assisting humans in solving complex mathematical problems (Tao, 2023). However, cutting-edge models such as GPT-4 (OpenAI, 2023) and Gemini-Ultra (Anil et al., 2023) are not publicly available, and the currently accessible open-source models considerably trail behind in performance.

大型語言模型（LLM）對人工智慧領域的數學推理方法已經有著革命性的變化，這促使數理邏輯基準測試（Hendrycks et al., 2021）和幾何推理基準測試（Trinh et al., 2024）取得重大進展。此外，這些模型也已經被證明對於幫助人類解決複雜數學問題方面是有幫助的（Tao, 2023）。然而，最先進的模型如GPT-4（OpenAI，2023）和Gemini-Ultra（Anil et al., 2023）並未開源，目前可供使用的開源模型在效能方面明顯不如。

In this study, we introduce DeepSeekMath, a domain-specific language model that significantly outperforms the mathematical capabilities of open-source models and approaches the performance level of GPT-4 on academic benchmarks. To achieve this, we create the DeepSeekMath Corpus, a large-scale high-quality pre-training corpus comprising 120B math tokens. This dataset is extracted from the Common Crawl (CC) using a fastText-based classifier (Joulin et al., 2016). In the initial iteration, the classifier is trained using instances from OpenWebMath (Paster et al., 2023) as positive examples, while incorporating a diverse selection of other web pages to serve as negative examples. Subsequently, we employ the classifier to mine additional positive instances from the CC, which are further refined through human annotation. The classifier is then updated with this enhanced dataset to improve its performance. The evaluation results indicate that the large-scale corpus is of high quality, as our base model DeepSeekMath-Base 7B achieves 64.2% on GSM8K (Cobbe et al., 2021) and 36.2% on the competition-level MATH dataset (Hendrycks et al., 2021), outperforming Minerva 540B (Lewkowycz et al., 2022a). In addition, the DeepSeekMath Corpus is multilingual, so we notice an improvement in Chinese mathematical benchmarks (Wei et al., 2023; Zhong et al., 2023). We believe that our experience in mathematical data processing is a starting point for the research community, and there is significant room for improvement in the future.

在本研究中，我們引入 DeepSeekMath，這是領域特定(domain-specific)的語言模型，其數學能力明顯超越開源模型並接近 GPT-4 在學術基準測試中的表現。為此，我們創建了 DeepSeekMath 語料庫，這是一個大規模且高品質的預訓練語料庫，包含 120B 個數學相關的tokens。這個資料集利用基於 fastText 的分類器從 Common Crawl (CC) 中所提取出來的。在初始迭代中，分類器使用來自 OpenWebMath(Paster et al., 2023) 的示例作為正樣本，同時結合其它多個網頁的資料來作為負樣本。隨後，我們利用這個分類器從 Common Crawl 中挖掘更多的正樣本，然後通過人工注釋的方式對資料集品質進行改善。然後，這個分類器再拿著這些增強過的資料集來更新，以提升自己的效能。評估結果指出，這個大規模語料庫的品質很高，我們的基礎模型 DeepSeekMath-Base 7B 在 GSM8K 上達到了 64.2% 的得分，在competition-level MATH dataset (Hendrycks et al., 2021)上則是達到了 36.2% 的得分，超越了 Minerva 540B (Lewkowycz et al., 2022a)。此外，DeepSeekMath 語料庫是多語言的，所以我們有注意到於Chinese mathematical benchmarks (Wei et al., 2023; Zhong et al., 2023)的提升。我們相信，我們在數學資料處理方面的經驗是研究社群的起點，未來仍有極大的改善空間。

DeepSeekMath-Base is initialized with DeepSeek-Coder-Base-v1.5 7B (Guo et al., 2024), as we notice that starting from a code training model is a better choice compared to a general LLM. Furthermore, we observe the math training also improves model capability on MMLU (Hendrycks et al., 2020) and BBH benchmarks (Suzgun et al., 2022), indicating it does not only enhance the model's mathematical abilities but also amplifies general reasoning capabilities.

DeepSeekMath-Base 以 DeepSeek-Coder-Base-v1.5 7B (Guo et al., 2024) 為基礎初始化，我們發現以程式碼訓練的模型為起點比起普通的大型語言模型會是更好的選擇。此外，我們觀察到數學訓練同時會提升了模型在 MMLU (Hendrycks et al., 2020) 和 BBH 基準 (Suzgun et al., 2022) 上的能力，這說明了，它不單純的增強模型的數學能力，同時提升了其一般推理能力。

After pre-training, we apply mathematical instruction tuning to DeepSeekMath-Base with chain-of-thought (Wei et al., 2022), program-of-thought (Chen et al., 2022; Gao et al., 2023), and tool-integrated reasoning (Gou et al., 2023) data. The resulting model DeepSeekMath-Instruct 7B beats all 7B counterparts and is comparable with 70B open-source instruction-tuned models.

在預訓練之後，我們用數學指令來微調DeepSeekMath-Base，並搭配思維考（chain-of-thought，Wei et al., 2022）、思維規劃（program-of-thought，Chen et al., 2022; Gao et al., 2023）以及工具整合推理（tool-integrated reasoning，Gou et al., 2023）的資。結果產生的模型DeepSeekMath-Instruct 7B屌打所有7B模型，並與70B開源指令微調模型相當。

Furthermore, we introduce the Group Relative Policy Optimization (GRPO), a variant reinforcement learning (RL) algorithm of Proximal Policy Optimization (PPO) (Schulman et al., 2017). GRPO foregoes the critic model, instead estimating the baseline from group scores, significantly reducing training resources. By solely using a subset of English instruction tuning data, GRPO obtains a substantial improvement over the strong DeepSeekMath-Instruct, including both in-domain (GSM8K: 82.9% → 88.2%, MATH: 46.8% → 51.7%) and out-of-domain mathematical tasks (e.g., CMATH: 84.6% → 88.8%) during the reinforcement learning phase. We also provide a unified paradigm to understand different methods, such as Rejection Sampling Fine-Tuning (RFT) (Yuan et al., 2023a), Direct Preference Optimization (DPO) (Rafailov et al., 2023), PPO and GRPO. Based on such a unified paradigm, we find that all these methods are conceptualized as either direct or simplified RL techniques. We also conduct extensive experiments, e.g., online v.s. offline training, outcome v.s. process supervision, single-turn v.s. iterative RL and so on,to deeply investigate the essential elements of this paradigm. At last, we explain why our RL boosts the performance of instruction-tuned models, and further summarize potential directions to achieve more effective RL based on this unified paradigm.

此外，我們引入Group Relative Policy Optimization (GRPO)，這是 Proximal Policy Optimization (PPO) (Schulman et al., 2017)，一種強化學習演算法的變體。與傳統的 PPO 方法不同，GRPO 放棄使用critic model，而是從群組分數中估測基線，顯著降低了訓練資源的需求。我們單純的使用英文指令調整資料集的一部分，GRPO 在強大的 DeepSeekMath-Instruct 模型上取得了極大的提升，包括在in-domain (GSM8K: 82.9% → 88.2%, MATH: 46.8% → 51.7%)和out-of-domain mathematical tasks (e.g., CMATH: 84.6% → 88.8%)在強化學習階段中均有提升。我們還為不同方法，包括Rejection Sampling Fine-Tuning (RFT) (Yuan et al., 2023a) 和 Direct Preference Optimization (DPO) (Rafailov et al., 2023)，以及 PPO 和 GRPO 提供了一個統一的範式。依據這個統一的範式，我們發現所有這些方法都可以被視為直接或簡化的 RL 技術。此外，我們還做了大量的實驗，像是線上與離線訓練、結果與過程監督、single-turn與iterative RL 等，以深入探索這一範式的方方面面。最後，我們也解釋我們的強化學習為何能夠提升指令微調模型的效能，並進一步的總結基於這個統一範式更有效的強化學習方法的可能方向。

1.1. Contributions

Our contribution includes scalable math pre-training, along with the exploration and analysis of reinforcement learning.

我們的貢獻包括可擴展的數學預訓練，以及對強化學習的探索和分析。

Math Pre-Training at Scale

Our research provides compelling evidence that the publicly accessible Common Crawl data contains valuable information for mathematical purposes. By implementing a meticulously designed data selection pipeline, we successfully construct the DeepSeekMath Corpus, a high-quality dataset of 120B tokens from web pages filtered for mathematical content, which is almost 7 times the size of the math web pages used by Minerva (Lewkowycz et al., 2022a) and 9 times the size of the recently released OpenWebMath (Paster et al., 2023).
我們的研究提供了有力證據，顯示公開可訪問的 Common Crawl 資料中包含著有價值的信息可用於數學用途。通過精心設計的資料選擇流程，我們成功地建立了 DeepSeekMath 語料庫，這是一個由120B個tokens所組成的高品質資料集，這是一些數學內容網頁過濾而來的資料集，幾乎是 Minerva (Lewkowycz et al., 2022a) 所使用的數學網頁數量的7倍，也是近期發布的 OpenWebMath (Paster et al., 2023) 的9倍。
Our pre-trained base model DeepSeekMath-Base 7B achieves comparable performance with Minerva 540B (Lewkowycz et al., 2022a), indicating the number of parameters is not the only key factor in mathematical reasoning capability. A smaller model pre-trained on high-quality data could achieve strong performance as well.
我們預訓練的基礎模型 DeepSeekMath-Base 7B，在數學推理能力上與 Minerva 540B (Lewkowycz et al., 2022a) 表現相當，這說明了參數數量並不是增強數學推理能力的唯一關鍵因素。一個小的模型在高品質資料上預訓練也能夠有春天。
We share our findings from math training experiments. Code training prior to math training improves models' ability to solve mathematical problems both with and without tool use. This offers a partial answer to the long-standing question: does code training improve reasoning abilities? We believe it does, at least for mathematical reasoning.
我們分享了在數學訓練實驗中的所發現到的事物。在進行數學訓練之前進行程式碼的訓練，這麼做能夠提升模型解決包含或不包含工具使用的數學問題的能力。這給出了長久以來的問題一部分答案：程式碼的訓練是否可能提升推理能力？我們認為，至少對於數學推理而言，它是有幫助的。
Although training on arXiv papers is common, especially in many math-related papers, it brings no notable improvements on all mathematical benchmarks adopted in this paper.
儘管使用 arXiv 論文進行模型訓練是一種常見做法，特別是跟數學相關的論文，但這種方法並未在本研究所採用的數學基準測試中帶來顯著的改進。

Exploration and Analysis of Reinforcement Learning

We introduce Group Relative Policy Optimization (GRPO), an efficient and effective reinforcement learning algorithm. GRPO foregoes the critic model, instead estimating the baseline from group scores, significantly reducing training resources compared to Proximal Policy Optimization (PPO).
我們提出一種名為 Group Relative Policy Optimization (GRPO) 的強化學習算法，快又好用。與 Proximal Policy Optimization (PPO) 不同的是，GRPO 放棄使用critic model，並改用群組分數來估計基線，這種作法明顯降低訓練資源的需求。
We demonstrate that GRPO significantly enhances the performance of our instruction-tuned model DeepSeekMath-Instruct, by solely using the instruction-tuning data. Furthermore, we observe enhancements in the out-of-domain performance during the reinforcement learning process.
我們說明 GRPO 可以明顯提升我們指令調教後的模型 DeepSeekMath-Instruct 的效能，這模型單純的使用指令調教的相關資料。此外，在強化學習過程中，我們觀察到在領域外的效能提升。
We provide a unified paradigm to understand different methods, such as RFT, DPO, PPO, and GRPO. We also conduct extensive experiments, e.g., online v.s. offline training, outcome v.s. process supervision, single-turn v.s. iterative reinforcement learning, and so on to deeply investigate the essential elements of this paradigm.
我們給出一種統一的範式來理解不同的方法，如RFT、DPO、PPO以及GRPO等。此外，我們還進行了大量的實驗，像是線上與離線訓練、結果與過程監督、single-turn與iterative強化學習等，深入探討這個範式的方方面面。
Based on our unified paradigm, we explore the reasons behind the effectiveness of reinforcement learning, and summarize several potential directions to achieve more effective reinforcement learning of LLMs.
基於我們提出的統一範式，我們探討了強化學習之所以有效的原因，並總結了幾個使大型語言模型強化學習更高效的可能方向。

1.2. Summary of Evaluations and Metrics

English and Chinese Mathematical Reasoning : We conduct comprehensive assessments of our models on English and Chinese benchmarks, covering mathematical problems from grade-school level to college level. English benchmarks include GSM8K (Cobbe et al., 2021), MATH (Hendrycks et al., 2021), SAT (Azerbayev et al., 2023), OCW Courses (Lewkowycz et al., 2022a), MMLU-STEM (Hendrycks et al., 2020). Chinese benchmarks include MGSM-zh (Shi et al., 2023), CMATH (Wei et al., 2023), Gaokao-MathCloze (Zhong et al., 2023), and Gaokao-MathQA (Zhong et al., 2023). We evaluate models' ability to generate self-contained text solutions without tool use, and also the ability to solve problems using Python.
English and Chinese Mathematical Reasoning : 我們進行我們的模型在英文和中文評測基準上全面性的評估，涵蓋各級別從小學到大學階段的數學問題。英文評測基準包括 GSM8K (Cobbe et al., 2021), MATH (Hendrycks et al., 2021), SAT (Azerbayev et al., 2023), OCW Courses (Lewkowycz et al., 2022a), 和 MMLU-STEM (Hendrycks et al., 2020)。中文評測基準包括 MGSM-zh (Shi et al., 2023), CMATH (Wei et al., 2023), Gaokao-MathCloze (Zhong et al., 2023)，以及 Gaokao-MathQA (Zhong et al., 2023)。我們評估模型生成不依賴任何外部工具的自含文本解答的能力，並測試其使用 Python 解決問題的能力。

self-contained：樂詞網中的電子計算機領域翻譯為自含、自主

On English benchmarks, DeepSeekMath-Base is competitive with the closed-source Minerva 540B (Lewkowycz et al., 2022a), and surpasses all open-source base models (e.g., Mistral 7B (Jiang et al., 2023) and Llemma-34B (Azerbayev et al., 2023)), regardless of whether they've undergone math pre-training or not, often by a significant margin. Notably, DeepSeekMath-Base is superior on Chinese benchmarks, likely because we don't follow previous works (Azerbayev et al., 2023; Lewkowycz et al., 2022a) to collect English-only math pre-training data, and also include high-quality non-English ones. With mathematical instruction tuning and reinforcement learning, the resulting DeepSeekMath-Instruct and DeepSeekMath-RL demonstrate strong performance, obtaining an accuracy of over 50% on the competition-level MATH dataset for the first time within the open-source community.

在英文基準測試中，DeepSeekMath-Base 可比擬閉源的Minerva 540B(Lewkowycz et al., 2022a)，並碾壓所有開源基礎模型（像是，Mistral 7B (Jiang et al., 2023)和Llemma-34B (Azerbayev et al., 2023)），無論是否進行過數學相關的預訓練，一差就是十萬八千里。值得注意的是，DeepSeekMath-Base在中文基準測試上優勢更大，可能因我們不像之前的研究（(Azerbayev et al., 2023; Lewkowycz et al., 2022a)）就只拿著英文的數學預訓練資料，我們作法還包含高品質的非英語資料。透過數學指令微調和強化學習，最終產生的DeepSeekMath-Instruct和DeepSeekMath-RL展現出強大的效能，在competition-level MATH資料集上首次於開源社群中達到超過50%的準確率。

Formal Mathematics : We evaluate DeepSeekMath-Base using the informal-to-formal theorem proving task from (Jiang et al., 2022) on miniF2F (Zheng et al., 2021) with Isabelle (Wenzel et al., 2008) chosen to be the proof assistant. DeepSeekMath-Base demonstrates strong few-shot autoformalization performance.
Formal Mathematics : 我們在 miniF2F（Zheng et al., 2021）上，利用 Jiang 等人（2022）提出的informal-to-formal theorem proving task(非正式到正式的定理證明任務)來評估 DeepSeekMath-Base，並選擇 Isabelle（enzel et al., 2008）作為證明輔助工具。DeepSeekMath-Base 展現出強大的少量樣本(few-shot)自動公式化的效能。
Natural Language Understanding, Reasoning, and Code : To build a comprehensive profile of models' general understanding, reasoning, and coding capabilities, we evaluate DeepSeekMath-Base on the Massive Multitask Language Understanding (MMLU) benchmark (Hendrycks et al., 2020) which encompasses 57 multiple-choice tasks covering diverse subjects, BIG-Bench Hard (BBH) (Suzgun et al., 2022) which consists of 23 challenging tasks that mostly require multi-step reasoning to solve, as well as HumanEval (Chen et al., 2021) and MBPP (Austin et al., 2021) which are widely used to evaluate code language models. Math pre-training benefits both language understanding and reasoning performance.
Natural Language Understanding, Reasoning, and Code : 為了全面性的說明模型在自然語言理解、推理和程式設計方面的能力，我們使用眾多方法來評估DeepSeekMath-Base，如下：Massive Multitask Language Understanding (MMLU)(Hendrycks et al., 2020)，有著 57 個涵蓋各種主題的選擇題、BIG-Bench Hard (BBH) (Suzgun et al., 2022)，有著23 個大多需要多步推理才能解決的挑戰性任務，還有HumanEval (Chen et al., 2021)與MBPP (Austin et al., 2021)，被廣泛用來評估程式語言模型的基準。

2. Math Pre-Training

2.1. Data Collection and Decontamination

In this section, we will outline the process of constructing the DeepSeekMath Corpus from Common Crawl. As depicted in Figure 2, we present an iterative pipeline that demonstrates how to systematically gather a large-scale mathematical corpus from Common Crawl, starting with a seed corpus (e.g., a small but high-quality collection of math-related dataset). It's worth noting that this approach is also applicable to other domains, such as coding.

在本節中，我們會概略說明一下如何從 Common Crawl 建構 DeepSeekMath 語料庫的過程。如Figure 2所示，我們提出一種迭代式的流程，說明了如何系統性地從 Common Crawl 收集大規模的數學語料庫，起始於一個小型但品質高且數學相關的資料集。值得注意的是，這種方法也適用於其它領域，如寫程式。

First, we choose OpenWebMath (Paster et al., 2023), a collection of high-quality mathematical web texts, as our initial seed corpus. Using this corpus, we train a fastText model (Joulin et al., 2016) to recall more OpenWebMath-like mathematical web pages. Specifically, we randomly select 500,000 data points from the seed corpus as positive training examples and another 500,000 web pages from Common Crawl as negative ones. We employ an open-source library for training, configuring the vector dimension to 256, learning rate to 0.1, the maximum length of word n-gram to 3, the minimum number of word occurrences to 3, and the number of training epochs to 3. To reduce the size of the original Common Crawl, we employ URL-based deduplication and near-deduplication techniques, resulting in 40B HTML web pages. We then recall mathematical web pages from deduplicated Common Crawl with the fastText model. To filter out low-quality mathematical content, we rank the collected pages according to their scores predicted by the fastText model, and only preserve the top-ranking ones. The volume of data preserved is assessed through pre-training experiments on the top 40B, 80B, 120B, and 160B tokens. In the first iteration, we choose to keep the top 40B tokens.

首先，我們選擇 OpenWebMath（Paster et al., 2023），一個高品質數學網頁文本的集合，來作為初始的種子語料庫(seed corpus)。使用這個語料庫，我們訓練了一個 fastText model (Joulin et al., 2016) 以從中召回更多與 OpenWebMath 相似的數學相關網頁。具體來說，我們從種子語料庫中隨機選取了 500,000 筆資料作為訓練用的正樣本，並從 Common Crawl 中選取另外 500,000 網頁作為負樣本。我們使用開源的函式庫來進行模型的訓練，將向量維度配置為 256、學習效率設定為 0.1，word n-gram 的最大長度設為 3、word occurrences(單詞出現次數)的最小值設為 3，然後training epochs也設置為 3。為了幫 Common Crawl 原始資料集瘦身，我們採用URL-based deduplication及near-deduplication的技術，最終產生 40B 的HTML web pages。接著，我們使用 FastText 模型從去重後的 Common Crawl 中召回數學相關網頁。為了過濾掉低品質的數學內容，我們根據 FastText 模型預測的分數對收集到的網頁進行排序，並僅保留排名最高的網頁。所保留的資料量是通過在前 40B、80B、120B 和 160B 個 tokens 上進行預訓練實驗來評估的。在第一個迭代中，我們選擇保持前40B個tokens。

這邊的開源函式庫指的是：https://fasttext.cc

Figure 2 | An iterative pipeline that collects mathematical web pages from Common Crawl.

After the first iteration of data collection, numerous mathematical web pages remain uncollected, mainly because the fastText model is trained on a set of positive examples that lacks sufficient diversity. We therefore identify additional mathematical web sources to enrich the seed corpus, so that we can optimize the fastText model. Specifically, we first organize the entire Common Crawl into disjoint domains; a domain is defined as web pages sharing the same base URL. For each domain, we calculate the percentage of web pages that are collected in the first iteration. Domains where over 10% of the web pages have been collected are classified as math-related (e.g., mathoverflow.net ). Subsequently, we manually annotate the URLs associated with mathematical content within these identified domains (e.g., mathoverflow.net/questions ). Web pages linked to these URLs, yet uncollected, will be added to the seed corpus. This approach enables us to gather more positive examples, thereby training an improved fastText model capable of recalling more mathematical data in the subsequent iteration. After four iterations of data collection, we end up with 35.5M mathematical web pages, totaling 120B tokens. In the fourth iteration, we notice that nearly 98% of the data has already been collected in the third iteration, so we decide to cease data collection.

在資料收集的第一個迭代之後，多數的數學相關的網頁並未被收入，主要是因為 fastText model 是在一組缺乏足夠多樣性的正樣本集上訓練的。因此，我們選擇了額外的數學網路來源以豐富種子語料庫，以便我們可以最佳化 fastText 模型。具體來說，我們首先將整個 Common Crawl 整理成disjoint domains(不相交、分離的定義域？)；domain的定義是指擁有相同基本 URL 的網頁。對於每個domain，我們就計算在第一次迭代中已收集的網頁百分比。如果某個domain有超過 10% 網頁被收集的，那就將之分類為數學相關（例如 mathoverflow.net）。隨後，我們手動標註這些識別出來的domain內與數學內容相關聯的 URL（如 mathoverflow.net/questions）。連結到這些URLs的網頁，然後尚未被收入的，就會被加入種子語料庫。這種方法使我們能夠收集更多正樣本，進而訓練出一個改進的 fastText 模型，能在後續的迭代中召回更多的數學資料。經過四輪的資料收集迭代之後，我們最終獲得了 35.5M 個數學網頁，總計 120B 的 tokens。在第四次迭代中，我們發現近 98% 的資料已在第三次迭代中被收入，因此我們決定停止資料收集。」

To avoid benchmark contamination, we follow Guo et al. (2024) to filter out web pages containing questions or answers from English mathematical benchmarks such as GSM8K (Cobbe et al., 2021) and MATH (Hendrycks et al., 2021) and Chinese benchmarks such as CMATH (Wei et al., 2023) and AGIEval (Zhong et al., 2023). The filtering criteria are as follows: any text segment containing a 10-gram string that matches exactly with any sub-string from the evaluation benchmarks is removed from our math training corpus. For benchmark texts that are shorter than 10 grams but have at least 3 grams, we employ exact matching to filter out contaminated web pages.

為避免基準測試的污染，我們依循 Guo et al. (2024) 的方法，將包含英文數學基準測試如 GSM8K（Cobbe et al., 2021）和 MATH（Hendrycks et al., 2021），以及中文基準測試如 CMATH（Wei et al., 2023）和 AGIEval（Zhong et al., 2023）中問題或答案的網頁從我們的數學訓練語料庫中移除。篩選標準如下：若文本段落內含一個長度為 10-gram（即十個單詞）的子字符串與任何評估基準的子字符串完全匹配，那就從語料庫中移除。對於文字長度小於 10-gram 但至少有 3-gram（即三個單詞）的評估基準文本，我們則是使用精確匹配(exact matching)來篩選出受污染的網頁。

2.2. Validating the Quality of the DeepSeekMath Corpus

We run pre-training experiments to investigate how the DeepSeekMath Corpus is compared with the recently released math-training corpora:

我們做了預訓練實驗來探討 DeepSeekMath 語料庫與最近釋出的數學訓練資料集相比如何：

MathPile (Wang et al., 2023c): a multi-source corpus (8.9B tokens) aggregated from textbooks, Wikipedia, ProofWiki, CommonCrawl, StackExchange, and arXiv, with the majority (over 85%) sourced from arXiv;
MathPile (Wang et al., 2023c)：一個多來源的語料庫（8.9B tokens），從教科書、維基百科、ProofWiki、CommonCrawl、StackExchange和arXiv中集合而成，其中超過85%的內容來自於arXiv。
OpenWebMath (Paster et al., 2023): CommonCrawl data filtered for mathematical content, totaling 13.6B tokens;
OpenWebMath (Paster et al., 2023)：從CommonCrawl中篩選出含有數學內容的資料，總計13.6B tokens。
Proof-Pile-2 (Azerbayev et al., 2023): a mathematical corpus consisting of OpenWebMath, AlgebraicStack (10.3B tokens of mathematical code), and arXiv papers (28.0B tokens). When experimenting on Proof-Pile-2, we follow Azerbayev et al. (2023) to use an arXiv:Web:Code ratio of 2:4:1.
Proof-Pile-2 (Azerbayev et al., 2023)：這是一個由OpenWebMath、AlgebraicStack（10.3B tokensd 的數學程式碼）和arXiv論文（28.0B tokens）組成的數學語料庫。在Proof-Pile-2進行實驗時，我們依循Azerbayev et al. (2023)的建議，arXiv:Web:Code比例設定為2:4:1。

2.2.1. Training Setting

We apply math training to a general pre-trained language model with 1.3B parameters, which shares the same framework as the DeepSeek LLMs (DeepSeek-AI, 2024), denoted as DeepSeekLLM 1.3B. We separately train a model on each mathematical corpus for 150B tokens. All experiments are conducted using the efficient and light-weight HAI-LLM (High-flyer, 2023) training framework. Following the training practice of DeepSeek LLMs, we use the AdamW optimizer (Loshchilov and Hutter, 2017) with

β_{1} = 0.9, β_{2} = 0.95

, and weight_decay = 0.1, along with a multi-step learning rate schedule where the learning rate reaches the peak after 2,000 warmup steps, decreases to its 31.6% after 80% of the training process, and further decreases to 10.0% of the peak after 90% of the training process. We set the maximum value of learning rate to 5.3e-4, and use a batch size of 4M tokens with a 4K context length.

我們在1.3B參數量的的通用預訓練語言模型上做數學訓練，這個模型有著與DeepSeek LLMs（DeepSeek-AI, 2024）相同的架構，並被稱作DeepSeekLLM 1.3B。我們分別在每個數學語料庫上對模型進行150B tokens的訓練。所有的實驗都是使用高效且輕量的HAI-LLM（High-flyer, 2023）訓練框架來進行的。依循著DeepSeek LLMs的訓練慣例，我們使用AdamW optimizer（Loshchilov and Hutter, 2017），其參數

β_{1} = 0.9, β_{2} = 0.95

和weight_decay = 0.1，並採用多階段學習效率調度策略(multi-step learning rate schedule)：在2,000個warmup steps後達到峰值，訓練過程的80%時降至峰值的31.6%，再於90%時進一步降至峰值的10%。我們將學習效率的最大值設定為5.3e-4，並使用4M tokens搭配4K context length的batch size。

	Size	English Benchmarks	English Benchmarks	English Benchmarks	English Benchmarks	English Benchmarks	Chinese Benchmarks	Chinese Benchmarks	Chinese Benchmarks
Math Corpus	Size	GSM8K MATH OCW			SAT	MMLU STEM	CMATH	Gaokao MathCloze	Gaokao MathQA
No Math Training	N/A	2.9%	3.0%	2.9%	15.6%	19.5%	12.3%	0.8%	17.9%
MathPile	8.9B	2.7%	3.3%	2.2%	12.5%	15.7%	1.2%	0.0%	2.8%
OpenWebMath	13.6B	11.5%	8.9%	3.7%	31.3%	29.6%	16.8%	0.0%	14.2%
Proof-Pile-2	51.9B	14.3%	11.2%	3.7%	43.8%	29.2%	19.9%	5.1%	11.7%
DeepSeekMath Corpus	120.2B	23.8%	13.6%	4.8%	56.3%	33.1%	41.5%	5.9%	23.6%

Table 1 | Performance of DeepSeek-LLM 1.3B trained on different mathematical corpora, evaluated using few-shot chain-of-thought prompting. Corpus sizes are calculated using our tokenizer with a vocabulary size of 100K.

2.2.2. Evaluation Results

The DeepSeekMath Corpus is of high quality, covers multilingual mathematical content, and is the largest in size.

DeepSeekMath 語料庫是一個高品質，涵蓋多種語言的數學內容，並且是目前規模最大的。

High-quality : We evaluate downstream performance on 8 mathematical benchmarks using few-shot chain-of-thought prompting Wei et al. (2022). As shown in Table 1, there is a clear performance lead of the model trained on the DeepSeekMath Corpus. Figure 3 shows that the model trained on the DeepSeekMath Corpus demonstrates better performance than Proof-Pile-2 at 50B tokens (1 full epoch of Proof-Pile-2), indicating the average quality of DeepSeekMath Corpus is higher.
High-quality：我們在八個數學評估任務中使用few-shot chain-of-thought prompting（Wei et al. 2022）來評估下游效能。如Table 1所示，在 DeepSeekMath 語料庫上所訓練的模型在效能上是明顯領先的。Figure 3說明著，在50B tokens（即 Proof-Pile-2 一個完整迭代）的情況下，基於 DeepSeekMath Corpus 訓練的模型能是明顯優於 Proof-Pile-2，這說明了 DeepSeekMath Corpus 的平均品質更高。
Multilingual : The DeepSeekMath Corpus encompasses data in multiple languages, predominantly featuring English and Chinese as the two most represented languages. As shown in Table 1, training on the DeepSeekMath Corpus enhances mathematical reasoning performance in both English and Chinese. In contrast, existing mathematical corpora, which are primarily English-centric, show limited improvement and may even hinder performance in Chinese mathematical reasoning.
Multilingual：DeepSeekMath 語料庫包含了多種語言的資料，主要以英文和中文為兩大代表性語言。如Table 1所示，在 DeepSeekMath 語料庫上訓練能夠提升數學推理的能力，這一改善同樣適用於英文與中文。相比之下，現有以英文為主的數學資料集，在中文數學推理方面的改善有限，甚至可能導致性能下降。
Large-scale : The DeepSeekMath Corpus is several times larger than existing mathematical corpora. As depicted in Figure 3, DeepSeek-LLM 1.3B, when trained on the DeepSeekMath Corpus, shows a steeper learning curve along with more lasting improvements. In contrast, the baseline corpora are much smaller, and have already been repeated multiple rounds during training, with the resulting model performance quickly reaching a plateau.
Large-scale：DeepSeekMath 語料庫比現有的數學語料庫要大上好幾個倍。如Figure 3所示，在訓練於 DeepSeekMath 資料庫後，DeepSeek-LLM 1.3B 顯示出更陡峭的學習曲線以及更持久的改善。相比之下，基線語料庫要小得多，且在訓練過程中已被重複使用多次，導致模型效能很快的就趨於平穩。

Figure 3 | Benchmark curves of DeepSeek-LLM 1.3B trained on different mathematical corpora.

2.3. Training and Evaluating DeepSeekMath-Base 7B

In this section, we introduce DeepSeekMath-Base 7B, a base model with strong reasoning abilities, especially in mathematics. Our model is initialized with DeepSeek-Coder-Base-v1.5 7B (Guo et al., 2024) and trained for 500B tokens. The distribution of the data is as follows: 56% is from the DeepSeekMath Corpus, 4% from AlgebraicStack, 10% from arXiv, 20% is Github code, and the remaining 10% is natural language data from Common Crawl in both English and Chinese. We mainly adopt the training setting specified in Section 2.2.1, except that we set the maximum value of the learning rate to 4.2e-4 and use a batch size of 10M tokens.

在本節中，我們提出 DeepSeekMath-Base 7B，這是一款具備強大推理能力的基礎模型，尤其擅長於數學領域。該模型以 DeepSeek-Coder-Base-v1.5 7B (Guo et al., 2024) 做為初始模型，並經過訓練 500B tokens。資料分布如下：56% 來自 DeepSeekMath 語料集，4% 來自 AlgebraicStack，10% 來自 arXiv，20% 為 Github 程式碼，剩餘的 10% 是來自 Common Crawl 的英文和中文自然語言資料。我們主要採用了 Section 2.2.1 所指定的訓練設置，除了兩個設置，也就是將學習效率最大值設為 4.2e-4，並使用批次大小為 10M tokens。

We conduct a comprehensive assessment of the mathematical capabilities of DeepSeekMathBase 7B, focusing on its ability to produce self-contained mathematical solutions without relying on external tools, solve mathematical problems using tools, and conduct formal theorem proving. Beyond mathematics, we also provide a more general profile of the base model, including its performance of natural language understanding, reasoning, and programming skills.

我們對 DeepSeekMathBase 7B 的數學能力做了全面性的評估，重點關注其在不依賴外部工具的情況下產生自含的數學解決方案、使用工具解決數學問題以及進行形式化證明的能力。除了數學之外，我們還提供了基礎模型的更為一般性的概況，包括其自然語言理解、推理和程式設計技能的能力。

Mathematical Problem Solving with Step-by-Step Reasoning We evaluate DeepSeekMathBase's performance of solving mathematical problems using few-shot chain-of-thought prompting (Wei et al., 2022), across eight benchmarks in English and Chinese. These benchmarks encompass quantitative reasoning (e.g., GSM8K (Cobbe et al., 2021), MATH (Hendrycks et al., 2021), and CMATH (Wei et al., 2023)) and multiple-choice problems (e.g., MMLU-STEM (Hendrycks et al., 2020) and Gaokao-MathQA (Zhong et al., 2023)), covering diverse fields of mathematics from elementary to college-level complexity.

Mathematical Problem Solving with Step-by-Step Reasoning 我們評估了DeepSeekMathBase在利用few-shot chain-of-thought prompting (Wei et al., 2022)來解決數學問題時的效能，這項評估涵蓋了八個以英文和中文進行的基準測試。這些基準涵蓋了數理邏輯（例如GSM8K (Cobbe et al., 2021)、MATH (Hendrycks et al., 2021)和CMATH (Wei et al., 2023)）以及多選題（如MMLU-STEM (Hendrycks et al., 2020)和Gaokao-MathQA (Zhong et al., 2023)，包括著從小學到大學水平的不同數學領域問題。

As shown in Table 2, DeepSeekMath-Base 7B leads in performance across all eight benchmarks among the open-source base models (including the widely-used general model Mistral 7B (Jiang et al., 2023) and the recently released Llemma 34B (Azerbayev et al., 2023) which underwent math training on Proof-Pile-2 (Azerbayev et al., 2023)). Notably, on the competitionlevel MATH dataset, DeepSeekMath-Base surpasses existing open-source base models by over 10% absolute, and outperforms Minerva 540B (Lewkowycz et al., 2022a), a closed-source base model 77 times larger which builds on PaLM (Lewkowycz et al., 2022b) and is further trained on mathematical texts.

如Table 2所示，DeepSeekMath-Base 7B 在所有八個基準測試中領先於其它開源基線模型（包括被廣泛使用的通用模型 Mistral 7B (Jiang et al., 2023) 和以 Proof-Pile-2 資料集做數學訓練的 Llemma 34B (Azerbayev et al., 2023)）。值得注意的是，DeepSeekMath-Base 在競爭級別的 MATH 資料集上碾壓現有開源基線模型超過10%絕對值，並且在性能上優於規模比它大77倍的閉源基線模型 Minerva 540B (Lewkowycz et al., 2022a)，Minerva 540B建立在 PaLM (Lewkowycz et al., 2022b) 的基礎上並進行了對數學文本的額外訓練。

		English Benchmarks	English Benchmarks	English Benchmarks	English Benchmarks	English Benchmarks	Chinese Benchmarks	Chinese Benchmarks	Chinese Benchmarks
Model	Size		GSM8K MATH OCW		SAT	MMLU STEM	CMATH	Gaokao MathCloze	Gaokao MathQA
Closed-Source Base Model	Closed-Source Base Model	Closed-Source Base Model	Closed-Source Base Model	Closed-Source Base Model	Closed-Source Base Model	Closed-Source Base Model	Closed-Source Base Model	Closed-Source Base Model	Closed-Source Base Model
Minerva	7B	16.2%	14.1%	7.7%	-	35.6%	-	-	-
Minerva	62B	52.4%	27.6%	12.0%	-	53.9%	-	-	-
Minerva	540B	58.8%	33.6%	17.6%	-	63.9%	-	-	-
Open-Source Base Model	Open-Source Base Model	Open-Source Base Model	Open-Source Base Model	Open-Source Base Model	Open-Source Base Model	Open-Source Base Model	Open-Source Base Model	Open-Source Base Model	Open-Source Base Model
Mistral	7B	40.3%	14.3%	9.2%	71.9%	51.1%	44.9%	5.1%	23.4%
Llemma	7B	37.4%	18.1%	6.3%	59.4%	43.1%	43.4%	11.9%	23.6%
Llemma	34B	54.0%	25.3%	10.3%	71.9%	52.9%	56.1%	11.9%	26.2%
DeepSeekMath-Base 7B	DeepSeekMath-Base 7B	64.2%	36.2%	15.4%	84.4%	56.5%	71.7%	20.3%	35.3%

Table 2 | Comparisons between DeepSeekMath-Base 7B and strong base models on English and Chinese mathematical benchmarks. Models are evaluated with chain-of-thought prompting. Minerva results are quoted from Lewkowycz et al. (2022a).

Mathematical Problem Solving with Tool Use We evaluate program-aided mathematical reasoning on GSM8K and MATH using few-shot program-of-thought prompting (Chen et al., 2022; Gao et al., 2023). Models are prompted to solve each problem by writing a Python program where libraries such as math and sympy can be utilized for intricate computations. The execution result of the program is evaluated as the answer. As shown in Table 3, DeepSeekMath-Base 7B outperforms the prior state-of-the-art Llemma 34B.

Mathematical Problem Solving with Tool Use 我們使用few-shot program-of-thought prompting (Chen et al., 2022; Gao et al., 2023)評估在 GSM8K 和 MATH 上的程式輔助數學推理（Chen et al., 2022; Gao et al., 2023）。模型被提示通過撰寫一個 Python 程式來解決每個問題，這個程式中可以使用 math 和 sympy 之類的函式庫來進行複雜的計算。程序執行結果被視為答案。如Table 3所示，DeepSeekMath-Base 7B 的效能確實是優於 Llemma 34B。

Model	Size	Problem Solving w/ Tools	Problem Solving w/ Tools	Informal-to-Formal Proving	Informal-to-Formal Proving
Model	Size	GSM8K+Python MATH+Python miniF2F-valid			miniF2F-test
Mistral	7B	48.5%	18.2%	18.9%	18.0%
CodeLlama	7B	27.1%	17.2%	16.3%	17.6%
CodeLlama	34B	52.7%	23.5%	18.5%	18.0%
Llemma	7B	41.0%	18.6%	20.6%	22.1%
Llemma	34B	64.6%	26.3%	21.0%	21.3%
DeepSeekMath-Base	7B	66.9%	31.4%	25.8%	24.6%

Table 3 | Few-shot evaluation of base models' ability to solve mathematical problems using tools and the ability to conduct informal-to-formal theorem proving in Isabelle.

Formal Mathematics Formal proof automation is beneficial to ensure the accuracy and reliability of mathematical proofs and enhance efficiency, with increasing attention in recent years. We evaluate DeepSeekMath-Base 7B on the task of informal-to-formal proving from (Jiang et al., 2022) which is to generate a formal proof based on an informal statement, a formal counterpart of the statement, and an informal proof. We evaluate on miniF2F (Zheng et al., 2021), a benchmark for formal Olympiad-level mathematics, and generate a formal proof in Isabelle for each problem with few-shot prompting. Following Jiang et al. (2022), we leverage models to generate proof sketches, and execute the off-the-shelf automated prover Sledgehammer (Paulson, 2010) to fill in the missing details. As shown in Table 3, DeepSeekMath-Base 7B demonstrates strong performance in proof autoformalization.

Formal Mathematics 形式化證明自動化有利於確保數學證明的準確性、可靠度並提升效率，這種方式在近幾年愈來愈受到關注。我們根據(Jiang et al., 2022)的informal-to-formal proving(非形式化到形式化證明，)任務來評估DeepSeekMath-Base 7B，這任務是基於一個非形式化的語句、語句的形式對應(formal counterpart)與非形式化證明的方式來生成形式化的證明。我們在 miniF2F (Zheng et al., 2021) 上進行評估，這是一個用於形式化奧林匹克水平數學的基準測試，利用few-shot prompting來生成每個問題在 Isabelle 中的形式證明。根據 Jiang et al. (2022) 的方法，我們利用模型生成推理輪廓，然後使用現成自動化證明工具 Sledgehammer (Paulson, 2010) 填補缺失的細節。如Table 3所示，DeepSeekMath-Base 7B 在證明的自動形式化上展現出強大的效能。

Model	Size	MMLU	BBH	HumanEval (Pass@1)	MBPP (Pass@1)
Mistral	7B	62.4%	55.7%	28.0%	41.4%
DeepSeek-Coder-Base-v1.5 †	7B	42.9%	42.9%	40.2%	52.6%
DeepSeek-Coder-Base-v1.5	7B	49.1%	55.2%	43.2%	60.4%
DeepSeekMath-Base	7B	54.9%	59.5%	40.9%	52.6%

Table 4 | Evaluation on natural language understanding, reasoning, and code benchmarks. DeepSeek-Coder-Base-v1.5 † is the checkpoint right before learning rate decay, which is used to train DeepSeekMath-Base. On MMLU and BBH, we use few-shot chain-of-thought prompting. On HumanEval and MBPP, we evaluate model performance under the zero-shot setting and a few-shot setting, respectively.

Natural Language Understanding, Reasoning, and Code We evaluate model performance of natural language understanding on MMLU (Hendrycks et al., 2020), reasoning on BBH (Suzgun et al., 2022), and coding capabilities on HumanEval (Chen et al., 2021) and MBPP (Austin et al., 2021). As shown in Table 4, DeepSeekMath-Base 7B exhibits significant enhancements in performance on MMLU and BBH over its precursor, DeepSeek-Coder-Base-v1.5 (Guo et al., 2024), illustrating the positive impact of math training on language understanding and reasoning. Additionally, by including code tokens for continual training, DeepSeekMath-Base 7B effectively maintains the performance of DeepSeek-Coder-Base-v1.5 on the two coding benchmarks. Overall, DeepSeekMath-Base 7B significantly outperforms the general model Mistral 7B (Jiang et al., 2023) on the three reasoning and coding benchmarks.

Natural Language Understanding, Reasoning, and Code 我們評估了模型在自然語言理解上的表現，使用 MMLU (Hendrycks et al., 2020)、推理能力則用 BBH (Suzgun et al., 2022)，以及寫程式的能力分別使用 HumanEval (Chen et al., 2021) 和 MBPP (Austin et al., 2021)。如Table 4所示，DeepSeekMath-Base 7B 在 MMLU 和 BBH 上顯示出顯著的性能提升，超越了其前身 DeepSeek-Coder-Base-v1.5 (Guo et al., 2024)，這說明了數學訓練對語言理解和推理能力的正面影響。此外，通過在持續訓練中包含code tokens，DeepSeekMath-Base 7B 能有效地保持 DeepSeek-Coder-Base-v1.5 在兩個寫程式的基準測試上的表現。總的來說，DeepSeekMath-Base 7B 在三個推理和寫程式的基準測試中明顯超越了通用模型 Mistral 7B (Jiang et al., 2023)。

3. Supervised Fine-Tuning

3.1. SFT Data Curation

We construct a mathematical instruction-tuning dataset covering English and Chinese problems from different mathematical fields and of varying complexity levels: problems are paired with solutions in chain-of-thought (CoT) (Wei et al., 2022), program-of-thought (PoT) (Chen et al., 2022; Gao et al., 2023), and tool-integrated reasoning format (Gou et al., 2023). The total number of training examples is 776K.

我們建立了一個涵蓋來自英文和中文的各類數學問題的數學指令微調資料集，這些問題來自不同的數學領域且具有多種階層複雜度：每個問題都與chain-of-thought (CoT) (Wei et al., 2022)；program-of-thought (PoT) (Chen et al., 2022; Gao et al., 2023)；以及 tool-integrated reasoning format (Gou et al., 2023) 的解法配對。總訓練樣本數達776K。

English mathematical datasets : We annotate GSM8K and MATH problems with tool-integrated solutions, and adopt a subset of MathInstruct (Yue et al., 2023) along with the training set of Lila-OOD (Mishra et al., 2022) where problems are solved with CoT or PoT. Our English collection covers diverse fields of mathematics, e.g., algebra, probability, number theory, calculus, and geometry.
English mathematical datasets ： 我們利用工具整合解決方案(tool-integrated solutions)來標註GSM8K與MATH問題，並採用MathInstruct (Yue et al., 2023)的子集，以及 Lila-OOD 的訓練集（Mishra et al., 2022）中用 CoT 或 PoT 解決的資料。我們的英文收藏涵蓋了多樣化的數學領域，例如代數、概率、數論、微積分和幾何學。

本資料集尤其適用於人工智能和深度學習領域，因為它提供了大量的標註數學問題來支持模型訓練與測試。

Chinese mathematical datasets : We collect Chinese K-12 mathematical problems spanning 76 sub-topics such as linear equations, with solutions annotated in both CoT and tool-integrated reasoning format.
Chinese mathematical datasets： 我們收集了Chinese K-12的數學問題，涉及76個子議題，像是線性方程式，並且其解是以CoT與tool-integrated reasoning format的方式標註。

3.2. Training and Evaluating DeepSeekMath-Instruct 7B

In this section, we introduce DeepSeekMath-Instruct 7B which undergoes mathematical instruction tuning based on DeepSeekMath-Base. Training examples are randomly concatenated until reaching a maximum context length of 4K tokens. We train the model for 500 steps with a batch size of 256 and a constant learning rate of 5e-5.

在本節中，我們介紹了DeepSeekMath-Instruct 7B，這是一個基於DeepSeekMath-Base進行數學指令調整訓練的模型。訓練樣本為隨機串連一直到最大上下文長度為4K tokens。我們將模型訓練500 steps，批量大小為256，學習效率保持在5e-5不變。

We evaluate models' mathematical performance both without and with tool use, on 4 quantitative reasoning benchmarks in English and Chinese. We benchmark our model against the leading models of the time:

我們在沒有工具使用和有工具使用的情況下評估模型的數學表現，分別在英文和中文各自的四個數理邏輯評估標準上進行。我們把我們的模型與當時領先的模型做了比較：

Closed-source models include: (1) the GPT family among which GPT-4 (OpenAI, 2023) and GPT-4 Code Interpreter are the most capable ones, (2) Gemini Ultra and Pro (Anil et al., 2023), (3) Inflection-2 (Inflection AI, 2023), (4) Grok-1 , as well as models recently released by Chinese companies including (5) Baichuan-3, (6) the latest GLM-4 5 from the GLM family (Du et al., 2022). These models are for general purposes, most of which have undergone a series of alignment procedures.
Closed-source models include：(1) GPT家族成員中的GPT-4（OpenAI, 2023）和 GPT-4 Code Interpreter 是最具能力的，(2) Gemini Ultra 和 Pro（Anil et al., 2023）；(3)Inflection-2（Inflection AI, 2023）；(4)Grok-1，以及中國公司近期推出的模型：(5)Baichuan-3和(6)GLM家族最新版本GLM-4（Du et al., 2022）。這些模型主要用於一般性應用，大多數經歷了一系列對齊程序。
Open-source models include: general models like (1) DeepSeek-LLM-Chat 67B (DeepSeekAI, 2024), (2) Qwen 72B (Bai et al., 2023), (3) SeaLLM-v2 7B (Nguyen et al., 2023), and (4) ChatGLM3 6B (ChatGLM3 Team, 2023), as well as models with enhancements in mathematics including (5) InternLM2-Math 20B 6 which builds on InternLM2 and underwent math training followed by instruction tuning, (6) Math-Shepherd-Mistral 7B which applys PPO training (Schulman et al., 2017) to Mistral 7B (Jiang et al., 2023) with a process-supervised reward model, (7) the WizardMath series (Luo et al., 2023) which improves mathematical reasoning in Mistral 7B and Llama-2 70B (Touvron et al., 2023) using evolve-instruct (i.e., a version of instruction tuning that uses AI-evolved instructions) and PPO training with training problems primarily sourced from GSM8K and MATH, (8) MetaMath 70B (Yu et al., 2023) which is Llama-2 70B fine-tuned on an augmented version of GSM8K and MATH, (9) ToRA 34B Gou et al. (2023) which is CodeLlama 34B fine-tuned to do tool-integrated mathematical reasoning, (10) MAmmoTH 70B (Yue et al., 2023) which is Llama-2 70B instruction-tuned on MathInstruct.
Open-source models 包括：通用模型如（1）DeepSeek-LLM-Chat 67B (DeepSeekAI, 2024)、（2）Qwen 72B (Bai et al., 2023)、（3）SeaLLM-v2 7B (Nguyen et al., 2023) 及（4）ChatGLM3 6B (ChatGLM3 Team, 2023)，以及在數學方面有加強的模型，例如（5）基於 InternLM2 的 InternLM2-Math 20B ，在經過數學訓練後進行指令調整、（6）Math-Shepherd-Mistral 7B ，這是一個採用 process-supervised reward model 將 PPO 訓練應用於 Mistral 7B (Jiang et al., 2023) 的模型（7）WizardMath 系列 (Luo et al., 2023)，使用 evolve-instruct(也就是利用 AI-evolved instructions 的指令進行調整) 與 PPO 訓練來改善 Mistral 7B 和 Llama-2 70B (Touvron et al., 2023) 的數學推理能力，訓練問題主要來源為 GSM8K 和 MATH、（8）MetaMath 70B (Yu et al., 2023) 是在 GSM8K 和 MATH 增強版本上基於 Llama-2 70B 進行微調的結果、（9）ToRA 34B Gou et al. (2023) 是 CodeLlama 34B 微調以執行工具整合的數學推理，（10）MAmmoTH 70B (Yue et al., 2023) 是 Llama-2 70B 在 MathInstruct 上進行指令微調。

As shown in Table 5, under the evaluation setting where tool use is disallowed, DeepSeekMathInstruct 7B demonstrates strong performance of step-by-step reasoning. Notably, on the competition-level MATH dataset, our model surpasses all open-source models and the majority of proprietary models (e.g., Inflection-2 and Gemini Pro) by at least 9% absolute. This is true even for models that are substantially larger (e.g., Qwen 72B) or have been specifically enhanced through math-focused reinforcement learning (e.g., WizardMath-v1.1 7B). While DeepSeekMath-Instruct rivals the Chinese proprietary models GLM-4 and Baichuan-3 on MATH, it still underperforms GPT-4 and Gemini Ultra.

如Table 5所示，在禁止工具使用的評估設置下，DeepSeekMath-Instruct 7B展現出強大的逐步推理過程能力。尤其是在競爭水平的MATH資料集上，我們的模型超越了所有開源模型和多數專有模型（例如Inflection-2及Gemini Pro），至少以9%的絕對值領先。即便是那些規模遠大於DeepSeekMath-Instruct 7B的模型（如Qwen 72B）或者通過 math-focused reinforcement learning 進行了特別加強的模型（例如 WizardMath-v1.1 7B），此結果仍然成立。儘管DeepSeekMath-Instruct與中文專有模型GLM-4和Baichuan-3在MATH上能夠一爭，但它仍然未能超越GPT-4及Gemini Ultra的表現。

Under the evaluation setting where models are allowed to integrate natural language reasoning and program-based tool use for problem solving, DeepSeekMath-Instruct 7B approaches an accuracy of 60% on MATH, surpassing all existing open-source models. On the other benchmarks, our model is competitive with DeepSeek-LLM-Chat 67B, the prior state-of-the-art that is 10 times larger.

在允許模型整合自然語言推理與基於程式的工具用於解決問題的評估設置情況下，DeepSeekMath-Instruct 7B 在 MATH 測試集上達到約 60% 的準確率，這超越了當今天下所有的的開源模型。在其它測試基準中，我們的模型與 DeepSeek-LLM-Chat 67B 可是有得比的，這可是之前最先進的模型，而且是7B的十倍大。

4. Reinforcement Learning

4.1. Group Relative Policy Optimization

Reinforcement learning (RL) has been proven to be effective in further improving the mathematical reasoning ability of LLMs after the Supervised Fine-Tuning (SFT) stage (Luo et al., 2023; Wang et al., 2023b). In this section, we introduce our efficient and effective RL algorithm, Group Relative Policy Optimization (GRPO).

強化學習（RL）已被證實在 Supervised Fine-Tuning (SFT) 階段之後能夠有效地提升大型語言模型（LLMs）的數學推理能力（Luo et al., 2023; Wang et al., 2023b）。在本節中，我們就要來介紹我們所提出的，不止高效而且還很有效的強化學習算法—— Group Relative Policy Optimization (GRPO)。

4.1.1. From PPO to GRPO

Proximal Policy Optimization (PPO) (Schulman et al., 2017) is an actor-critic RL algorithm that is widely used in the RL fine-tuning stage of LLMs (Ouyang et al., 2022). In particular, it optimizes LLMs by maximizing the following surrogate objective:

Proximal Policy Optimization (PPO) (Schulman et al., 2017) 是一種 actor-critic RL algorithm，廣泛的被應用於大型語言模型（LLMs）在強化學習微調階段（Ouyang et al., 2022）。具體來說，它通過最大化下面的約束替代目標函式來最佳化LLMs。

\begin{matrix} (1) & J_{P P O} (θ) = E [q \sim P (Q), o \sim π_{θ_{o l d}} (O | q)] \frac{1}{| o |} \sum_{t = 1}^{| o |} min [\frac{π_{θ} (o_{t} | q, o_{< t})}{π_{θ_{o l d}} (o_{t} | q, o_{< t})} A_{t}, clip (\frac{π_{θ} (o_{t} | q, o_{< t})}{π_{θ_{o l d}} (o_{t} | q, o_{< t})}, 1 - ϵ, 1 + ϵ) A_{t}] \end{matrix}

where

π_{θ}

and

π_{θ_{o l d}}

are the current and old policy models, and

q, o

, are questions and outputs sampled from the question dataset and the old policy

π_{θ_{o l d}}

, respectively.

ϵ

is a clipping-related hyper-parameter introduced in PPO for stabilizing training.

A_{t}

is the advantage, which is computed by applying Generalized Advantage Estimation (GAE) (Schulman et al., 2015), based on the rewards

{r_{\geq t}}

and a learned value function

V_{ψ}

. Thus, in PPO, a value function needs to be trained alongside the policy model and to mitigate over-optimization of the reward model, the standard approach is to add a per-token KL penalty from a reference model in the reward at each token (Ouyang et al., 2022), i.e.,

\begin{matrix} (2]) & r_{t} = r_{ψ} (q, o_{\leq t}) - β \log \frac{π_{θ} (o_{t} | q, o_{< t})}{π_{r e f} (o_{t} | q, o_{< t})} \end{matrix}

其中，

π_{θ}

與

π_{θ_{o l d}}

表示當下與舊的策略模型(policy model)，而

q, o

則分別表示從問題資料集與舊的策略

π_{θ_{o l d}}

中採樣的問題與輸出。

ϵ

是PPO為了穩定訓練所引入的clipping-related
hyper-parameter。

A_{t}

是一個advantage(優勢函數？)，這是利用Generalized Advantage Estimation (GAE) (Schulman et al., 2015)基於獎勵

{r_{\geq t}}

與學習到的價值函數(value function)

V_{ψ}

計算而得。因此，在PPO中，價值函數必需跟策略模型一起訓練，同時為了減少獎勵模型過度最佳化，標準的方法就是在每個token的獎勵過程中加入來自參考模型的per-token KL penalty(Ouyang et al., 2022)，也就是，

\begin{matrix} (2]) & r_{t} = r_{ψ} (q, o_{\leq t}) - β \log \frac{π_{θ} (o_{t} | q, o_{< t})}{π_{r e f} (o_{t} | q, o_{< t})} \end{matrix}

where

r_{ψ}

is the reward model,

π_{r e f}

is the reference model, which is usually the initial SFT model, and

β

is the coefficient of the KL penalty.

其中

r_{ψ}

是獎勵模型(reward model)，

π_{r e f}

是參考模型(reference model)，通常是初始的 SFT 模型，而

β

是 KL penalty 的係數(coefficient)。

Model	Size	English Benchmarks	English Benchmarks	Chinese Benchmarks	Chinese Benchmarks
		GSM8K	MATH	MGSM-zh	CMATH
Chain-of-Thought Reasoning	Chain-of-Thought Reasoning	Chain-of-Thought Reasoning	Chain-of-Thought Reasoning	Chain-of-Thought Reasoning	Chain-of-Thought Reasoning
Closed-Source Model	Closed-Source Model	Closed-Source Model	Closed-Source Model	Closed-Source Model	Closed-Source Model
Gemini Ultra	-	94.4%	53.2%	-	-
GPT-4	-	92.0%	52.9%	-	86.0%
Inflection-2	-	81.4%	34.8%	-	-
GPT-3.5	-	80.8%	34.1%	-	73.8%
Gemini Pro	-	86.5%	32.6%	-	-
Grok-1	-	62.9%	23.9%	-	-
Baichuan-3	-	88.2%	49.2%	-	-
GLM-4	-	87.6%	47.9%	-	-
Open-Source Model	Open-Source Model	Open-Source Model	Open-Source Model	Open-Source Model	Open-Source Model
InternLM2-Math	20B	82.6%	37.7%	-	-
Qwen	72B	78.9%	35.2%	-	-
Math-Shepherd-Mistral	7B	84.1%	33.0%	-	-
WizardMath-v1.1	7B	83.2%	33.0%	-	-
DeepSeek-LLM-Chat	67B	84.1%	32.6%	74.0%	80.3%
MetaMath	70B	82.3%	26.6%	66.4%	70.9%
SeaLLM-v2	7B	78.2%	27.5%	64.8%	-
ChatGLM3	6B	72.3%	25.7%	-	-
WizardMath-v1.0	70B	81.6%	22.7%	64.8%	65.4%
DeepSeekMath-Instruct	7B	82.9%	46.8%	73.2%	84.6%
DeepSeekMath-RL	7B	88.2%	51.7%	79.6%	88.8%
Tool-Integrated Reasoning	Tool-Integrated Reasoning	Tool-Integrated Reasoning	Tool-Integrated Reasoning	Tool-Integrated Reasoning	Tool-Integrated Reasoning
Closed-Source Model	Closed-Source Model	Closed-Source Model	Closed-Source Model	Closed-Source Model	Closed-Source Model
GPT-4 Code Interpreter	-	97.0%	69.7%	-	-
Open-Source Model	Open-Source Model	Open-Source Model	Open-Source Model	Open-Source Model	Open-Source Model
InternLM2-Math	20B	80.7%	54.3%	-	-
DeepSeek-LLM-Chat	67B	86.7%	51.1%	76.4%	85.4%
ToRA	34B	80.7%	50.8%	41.2%	53.4%
MAmmoTH		76.9%	41.8%
DeepSeekMath-Instruct	70B			-	-
DeepSeekMath-RL	7B 7B	83.7% 86.7%	57.4% 58.8%	72.0% 78.4%	84.3% 87.6%

Table 5 | Performance of Open- and Closed-Source models with both Chain-of-Thought and Tool-Integrated Reasoning on English and Chinese Benchmarks. Scores in gray denote majority votes with 32 candidates; The others are Top1 scores. DeepSeekMath-RL 7B beats all opensource models from 7B to 70B, as well as the majority of closed-source models. Although DeepSeekMath-RL 7B is only further trained on chain-of-thought-format instruction tuning data of GSM8K and MATH, it improves over DeepSeekMath-Instruct 7B on all benchmarks.

Figure 4 | Demonstration of PPO and our GRPO. GRPO foregoes the value model, instead estimating the baseline from group scores, significantly reducing training resources.

上圖說明了GRPO與PPO的差異在於，GRPO是利用採樣的樣本來計算，而PPO是利用一個價值函數模型來計算。

As the value function employed in PPO is typically another model of comparable size as the policy model, it brings a substantial memory and computational burden. Additionally, during RL training, the value function is treated as a baseline in the calculation of the advantage for variance reduction. While in the LLM context, usually only the last token is assigned a reward score by the reward model, which may complicate the training of a value function that is accurate at each token. To address this, as shown in Figure 4, we propose Group Relative Policy Optimization (GRPO), which obviates the need for additional value function approximation as in PPO, and instead uses the average reward of multiple sampled outputs, produced in response to the same question, as the baseline. More specifically, for each question

q

, GRPO samples a group of outputs

{0_{1}, 0_{2}, \dots, o_{G}}

from the old policy

π_{θ_{o l d}}

and then optimizes the policy model by maximizing the following objective:

\begin{matrix} (3) & \begin{aligned} J_{G R P O} (θ) & = E [q \sim P (Q), {o_{i}}_{i = 1}^{G} \sim π_{θ_{o l d}} (O | q)] \\ \frac{1}{G} \sum_{i = 1}^{G} \frac{1}{| o_{i} |} \sum_{t = 1}^{| o_{i} |} {min [\frac{π_{θ} (o_{i, t} | q, o_{i, < t})}{π_{θ_{o l d}} (o_{i, t} | q, o_{i, < t})} {\hat{A}}_{i, t}, clip (\frac{π_{θ} (o_{i, t} | q, o_{i, < t})}{π_{θ_{o l d}} (o_{i, t} | q, o_{i, < t})}, 1 - ϵ, 1 + ϵ) {\hat{A}}_{i, t}] - β D_{K L} [π_{θ} | | π_{r e f}]}, \end{aligned} \end{matrix}

在PPO中使用的價值函數(value function)通常與策略模型(policy model)的大小相同，這帶來了相當可觀的記憶和計算負荷。此外，在RL訓練過程中，價值函數在計算優勢(advantage)時被視為基準，以減少方差。然而，在大語言模型（LLM）的上下文中，通常只有最後一個token會由獎勵模型賦予獎勵分數，這可能複雜化了價值函數對每個token精確的訓練的需求。為了解決這個問題，如Figure 4所示，我們提出Group Relative Policy Optimization (GRPO)，它不用像PPO中需要計算額外的價值函數近似(value function approximation)，取而代之的是利用響應同一個問題所生成的多個樣本輸出的平均獎勵來做為基線。具體來說，對於每個問題

q

，GRPO會從舊的策略

π_{θ_{o l d}}

中採樣出一組輸出

{0_{1}, 0_{2}, \dots, o_{G}}

，然後利用最大化下面目標函式來最佳化策略模型：

\begin{matrix} (3) & \begin{aligned} J_{G R P O} (θ) & = E [q \sim P (Q), {o_{i}}_{i = 1}^{G} \sim π_{θ_{o l d}} (O | q)] \\ \frac{1}{G} \sum_{i = 1}^{G} \frac{1}{| o_{i} |} \sum_{t = 1}^{| o_{i} |} {min [\frac{π_{θ} (o_{i, t} | q, o_{i, < t})}{π_{θ_{o l d}} (o_{i, t} | q, o_{i, < t})} {\hat{A}}_{i, t}, clip (\frac{π_{θ} (o_{i, t} | q, o_{i, < t})}{π_{θ_{o l d}} (o_{i, t} | q, o_{i, < t})}, 1 - ϵ, 1 + ϵ) {\hat{A}}_{i, t}] - β D_{K L} [π_{θ} | | π_{r e f}]} \end{aligned} \end{matrix}

where

ϵ

and

β

are hyper-parameters, and

{\hat{A}}_{i, t}

is the advantage calculated based on relative rewards of the outputs inside each group only, which will be detailed in the following subsections. The group relative way that GRPO leverages to calculate the advantages, aligns well with the comparative nature of rewards models, as reward models are typically trained on datasets of comparisons between outputs on the same question. Also note that, instead of adding KL penalty in the reward, GRPO regularizes by directly adding the KL divergence between the trained policy and the reference policy to the loss, avoiding complicating the calculation of

{\hat{A}}_{i, t}

其中

ϵ

和

β

是超參數，而

{\hat{A}}_{i, t}

是基於每個群組內輸出相對獎勵計算而得的優勢(advantage)，後續小節會進一步詳述。GRPO用來計算優勢的群體相對方式，與獎勵模型的比較性質非常吻合，因為獎勵模型通常是在同一問題的輸出比較資料集上進行訓練。此外，GRPO不是直接將KL散度作為獎勵中的懲罰項加入，而是通過在損失函數中直接添加訓練策略與參考策略之間的KL散度來進行正規化，從而避免了複雜化

{\hat{A}}_{i, t}

的計算過程。

And different from the KL penalty term used in (2), we estimate the KL divergence with the following unbiased estimator (Schulman, 2020):

\begin{matrix} (4) & D_{K L} [π_{θ} ‖ π_{r e f}] = \frac{π_{r e f} (o_{i, t} | q, o_{i < t})}{π_{θ} (o_{i, t} | q, o_{i < t})} - \log \frac{π_{r e f} (o_{i, t} | q, o_{i < t})}{π_{θ} (o_{i, t} | q, o_{i < t})} - 1 \end{matrix}

與（2）中所用的 KL 正規化項不同，我們使用 Schulman (2020) 提出的無偏性估計來估算 KL divergence：

\begin{matrix} (4) & D_{K L} [π_{θ} ‖ π_{r e f}] = \frac{π_{r e f} (o_{i, t} | q, o_{i < t})}{π_{θ} (o_{i, t} | q, o_{i < t})} - \log \frac{π_{r e f} (o_{i, t} | q, o_{i < t})}{π_{θ} (o_{i, t} | q, o_{i < t})} - 1 \end{matrix}

which is guaranteed to be positive.

其保證始終為正值。

4.1.2. Outcome Supervision RL with GRPO

Formally, for each question

q

, a group of outputs

{o_{1}, o_{2}, \dots, o_{G}}

are sampled from the old policy model

π_{θ_{o l d}}

. A reward model is then used to score the outputs, yielding

𝐺

rewards

r = {𝑟_{1}, 𝑟_{2}, \dots, 𝑟_{𝐺}}

correspondingly. Subsequently, these rewards are normalized by subtracting the group average and dividing by the group standard deviation. Outcome supervision provides the normalized reward at the end of each output

o_{i}

and sets the advantages

{\hat{A}}_{i, t}

, of all tokens in the output as the normalized reward, i.e.,

{\hat{A}}_{i, t} = \tilde{r_{i}} = \frac{r_{i} - mean (r)}{std (r)}

, and then optimizes the policy by maximizing the objective defined in equation (3).

正確來說，對每個問題

q

，我們都從舊的策略模型

π_{θ_{o l d}}

中採樣出一組輸出

{o_{1}, o_{2}, \dots, o_{G}}

。接下來使用獎勵模型對這些輸出進行評分，生成相對應的獎勵

𝐺

，

r = {𝑟_{1}, 𝑟_{2}, \dots, 𝑟_{𝐺}}

。然後，這些獎勵再以減去群組平均值，再除以群組的標準差的方式來正規化。Outcome supervision於每個輸出

o_{i}

的結尾提供正規化後的獎勵，並將所有在輸出中的tokens的優勢值(advantages)設定為此正規化獎勵，也就是

{\hat{A}}_{i, t} = \tilde{r_{i}} = \frac{r_{i} - mean (r)}{std (r)}

。接著利用最大化方程式(3)所定義的目標函式來最佳化策略(policy)。

Outcome supervision，結果監督？字面直接翻譯感覺不是很通暢，

4.1.3. Process Supervision RL with GRPO

Outcome supervision only provides a reward at the end of each output, which may not be sufficient and efficient to supervise the policy in complex mathematical tasks. Following Wang et al. (2023b), we also explore process supervision, which provides a reward at the end of each reasoning step. Formally, given the question

q

and

G

sampled outputs

{o_{1}, o_{2}, \dots, o_{G}}

, a process reward model is used to score each step of the outputs, yielding corresponding rewards:

R = {{r_{1}^{i n d e x (1)}, \dots, r_{1}^{i n d e x (K_{1})}}, \dots, {r_{G}^{i n d e x (1)}, \dots, r_{G}^{i n d e x (K_{G})}}}

, where

i n d e x (𝑗)

is the end token index of the

𝑗

-th step, and

𝐾_{𝑖}

is the total number of steps in the

𝑖

-th output. We also normalize these rewards with the average and the standard deviation, i.e.,

{\tilde{r}}_{i}^{i n d e x} = \frac{r_{i}^{i n d e x (j)} - mean (R)}{std (R)}

. Subsequently, the process supervision calculates the advantage of each token as the sum of the normalized rewards from the following steps, i.e.,

{\hat{A}}_{i, j} = \sum_{index (j) \geq t} {\tilde{r}}_{i}^{index (j)}

, and then optimizes the policy by maximizing the objective defined in equation (3).

Outcome supervision只會在每次輸出結束時提供獎勵，這在處理複雜的數學任務中可能不夠充分或有效的監督策略。根據Wang et al. (2023b)，我們也探索了process supervision(過程監督)，這種方法會在每一步推理結束時提供獎勵。正確來說，就是給定問題

q

和採樣的輸出

{o_{1}, o_{2}, \dots, o_{G}}

，然後會有一個process reward model被用來對每個步驟的輸出進行評分，產生相對應的獎勵：

R = {{r_{1}^{i n d e x (1)}, \dots, r_{1}^{i n d e x (K_{1})}}, \dots, {r_{G}^{i n d e x (1)}, \dots, r_{G}^{i n d e x (K_{G})}}}

，其中

i n d e x (𝑗)

是第

j

步的end token index，而

𝐾_{𝑖}

是第

𝑖

個輸出中步驟的總數。接著，我們還將這些獎勵進行正規化處理，使用均值和標準差來調整，也就是

{\tilde{r}}_{i}^{i n d e x} = \frac{r_{i}^{i n d e x (j)} - mean (R)}{std (R)}

。隨後，process supervision 計算每個token的advantage(優勢值)做為從接下來步驟中正規化獎勵之和，也就是

{\hat{A}}_{i, j} = \sum_{index (j) \geq t} {\tilde{r}}_{i}^{index (j)}

，並通過最大化方程（3）中所定義的目標來最佳化policy(策略)。

4.1.4. Iterative RL with GRPO

As the reinforcement learning training process progresses, the old reward model may not be sufficient to supervise the current policy model. Therefore, we also explore the iterative RL with GRPO. As shown in Algorithm 1, in iterative GRPO, we generate new training sets for the reward model based on the sampling results from the policy model and continually train the old reward model using a replay mechanism that incorporates 10% of historical data. Then, we set the reference model as the policy model, and continually train the policy model with the new reward model.

隨著強化學習訓練過程的進展，舊的獎勵模型可能無法充分指導現行的策略模型。因此，我們也探索結合GRPO的迭代式強化學習。如Algorithm 1所示，在迭代式GRPO中，根據從策略模型取採的結果生成新的訓練資料集來更新獎勵模型，同時利用replay mechanism，也就是結合10%的歷史資料來持續訓練舊有的獎勵模型。接下來我們將策略模型設為參考模型，並利用新的獎勵模型持續訓練策略模型。

4.2. Training and Evaluating DeepSeekMath-RL

We conduct RL based on DeepSeekMath-Instruct 7B. The training data of RL are chain-of-thought-format questions related to GSM8K and MATH from the SFT data, which consists of around 144K questions. We exclude other SFT questions to investigate the impact of RL on benchmarks that lack data throughout the RL phase. We construct the training set of reward models following (Wang et al., 2023b). We train our initial reward model based on the DeepSeekMath-Base 7B with a learning rate of 2e-5. For GRPO, we set the learning rate of the policy model as 1e-6. The KL coefficient is 0.04. For each question, we sample 64 outputs. The max length is set to 1024, and the training batch size is 1024. The policy model only has a single update following each exploration stage. We evaluate DeepSeekMath-RL 7B on benchmarks following DeepSeekMath-Instruct 7B. For DeepSeekMath-RL 7B, GSM8K and MATH with chain-of-thought reasoning can be regarded as in-domain tasks and all the other benchmarks can be regarded as out-of-domain tasks.

我們基於 DeepSeekMath-Instruct 7B 實施強化學習（RL）。RL 的訓練資料中包含來自 SFT 資料的 GSM8K 和 MATH 相關的思維鏈形式的(chain-of-thought-format)問題，約有 144K 個問題。我們排除了其它 SFT 問題以研究 RL 在整過 RL 階段中，在缺乏資料的情況下對基線的影響。在 (Wang et al., 2023b) 的基礎上，我們建立了獎勵模型的訓練集。初始的獎勵模型是以 DeepSeekMath-Base 7B 訓練的，學習效率設定為 2e-5。在 GRPO 中，策略模型的學習效率設定為 1e-6。KL 係數設為 0.04。針對每個問題，我們會採樣64個輸出。最大長度設置為 1024，訓練批次大小為 1024。策略模型在每一次的探索階段都只會進行一次的更新。我們根據 DeepSeekMath-Instruct 7B 來評估 DeepSeekMath-RL 7B 的表現。對於 DeepSeekMath-RL 7B，GSM8K 和 MATH 的思維鏈推理可以視為in-domain tasks，而所有其它的基準測試則被視為out-of-domain tasks。

Table 5 demonstrates the performance of open- and closed-source models with both chain-of-thought and tool-integrated reasoning on English and Chinese benchmarks. We find that: 1) DeepSeekMath-RL 7B attains accuracies of 88.2% and 51.7% on GSM8K and MATH, respectively, utilizing chain-of-thought reasoning. This performance surpasses that of all open-source models in the 7B to 70B range, as well as the majority of closed-source models. 2) Crucially, DeepSeekMath-RL 7B is only trained on chain-of-thought-format instruction tuning data of GSM8K and MATH, starting from DeepSeekMath-Instruct 7B. Despite the constrained scope of its training data, it outperforms DeepSeekMath-Instruct 7B across all evaluation metrics, showcasing the effectiveness of reinforcement learning.

Table 5給出了在英文和中文基準測試上，開源與封閉源模型以思維鏈(chain-of-thought)及工具整合(tool-integrated )推理方式的效能表現。我們發現：1) DeepSeekMath-RL 7B 在 GSM8K 和 MATH 上分別達到了88.2%和51.7%的準確率，這是利用思維鏈(chain-of-thought)推理實現的。其效能超過了所有在 7B 到 70B 之間的開源模型以及大多數閉源模型。2) 很重要的一件事就是，DeepSeekMath-RL 7B 很單純的就只有在 GSM8K 和 MATH 的chain-of-thought-format instruction tuning data上進行訓練，基於 DeepSeekMath-Instruct 7B 開始。儘管其訓練資料的範圍有限，但它在所有評估指標上都超越了 DeepSeekMath-Instruct 7B，展示了强化學習的有效性。

5. Discussion

In this section, we will share our findings in pre-training and RL experiments.

在本節中，我們要來分享我們在預訓練與強化學習實驗中發現的一些事情。

5.1. Lessons Learnt in Pre-Training

We first share our experience in pre-training. Unless otherwise specified, we will adhere to the training settings outlined in Section 2.2.1. It is worth noting that, when referring to the DeepSeekMath Corpus in this section, we use an 89B-token dataset from the second iteration of the data collection process.

我們首先分享關於預先訓練的經驗。如果沒有特別說明，我們將遵循Section 2.2.1中所述的訓練設置。值得注意的是，在本節中提到的DeepSeekMath語料庫，我們使用的是在資料收集過程的第二輪的89B-token dataset。

5.1.1. Code Training Benefits Mathematical Reasoning

A popular yet unverified hypothesis suggests that code training improves reasoning. We attempt to offer a partial response to this, particularly within the mathematical domain: code training improves models' ability to do mathematical reasoning both with and without tool use.

一種廣受歡迎但未得到確認的假設指出，程式訓練可能提升推理能力。這部份我們嚐試給出一點點點的響應，特別是在數學領域內：程式訓練有助於模型在數學推理上表現得更佳，無論是否使用工具。

Training Setting	Training Tokens	Training Tokens	Training Tokens	w/o Tool Use	w/o Tool Use	w/o Tool Use	w/ Tool Use	w/ Tool Use
	General						Code Math GSM8K MATH CMATH GSM8K+Python MATH+Python
No Continual Training	-	-	-	2.9%	3.0%	12.3%	2.7%	2.3%
Two-Stage Training	Two-Stage Training	Two-Stage Training	Two-Stage Training	Two-Stage Training	Two-Stage Training	Two-Stage Training	Two-Stage Training	Two-Stage Training
Stage 1: General Training	400B	-	-	2.9%	3.2%	14.8%	3.3%	2.3%
Stage 2: Math Training	-	-	150B	19.1%	14.4%	37.2%	14.3%	6.7%
Stage 1: Code Training	-	400B	-	5.9%	3.6%	19.9%	12.4%	10.0%
Stage 2: Math Training	-	-	150B	21.9%	15.3%	39.7%	17.4%	9.4%
One-Stage Training	One-Stage Training	One-Stage Training	One-Stage Training	One-Stage Training	One-Stage Training	One-Stage Training	One-Stage Training	One-Stage Training
Math Training	-	-	150B	20.5%	13.1%	37.6%	11.4%	6.5%
Code & Math Mixed Training	-	400B	150B	17.6%	12.1%	36.3%	19.7%	13.5%

Table 6 | Investigation of how code affects mathematical reasoning under different training settings. We experiment with DeepSeek-LLM 1.3B, and evaluate its mathematical reasoning performance without and with tool use via few-shot chain-of-thought prompting and few-shot program-of-thought prompting, respectively.

To study how code training affects mathematical reasoning, we experimented with the following two-stage training and one-stage training settings:

為了研究程式的訓練究竟如何對數學推理產生影響，我們在以下兩種不同的訓練設置下進行了實驗：一是兩階段訓練，二是單階段訓練。

Two-Stage Training

Code Training for 400B Tokens → Math Training for 150B Tokens : We train DeepSeekLLM 1.3B for 400B code tokens followed by 150B math tokens;
Code Training for 400B Tokens → Math Training for 150B Tokens：對於 DeepSeekLLM 1.3B 模型進行訓練，我們首先使用了 400B code tokens，隨後再用 150B math tokens。
General Training for 400B Tokens → Math Training for 150B Tokens : As a control experiment, we also experiment with general tokens (sampled from a large-scale general corpus created by DeepSeek-AI) instead of code tokens in the first stage of training, in an attempt to investigate the advantages of code tokens over general tokens in improving mathematical reasoning.
General Training for 400B Tokens → Math Training for 150B Tokens 作為對照實驗，我們也嚐試在訓練的第一階段過程中用一些比較一般的tokens(利用DeepSeek-AI從大型的通用語料庫中採樣建立而得)，而不是拿code tokens，試圖研究code tokens在提升數學推理能力方面相較於一般tokens的優勢。

One-Stage Training

Math Training for 150B Tokens : We train DeepSeek-LLM 1.3B for 150B math tokens;
Math Training for 150B Tokens：我們用150B的math tokens來訓練DeepSeek-LLM 1.3B；
Training on a mixture of 400B Code Tokens and 150B Math Tokens : Math training following code training degrades coding performance. We investigate whether code tokens, when mixed with math tokens for one-stage training, would still improve mathematical reasoning and also alleviate the problem of catastrophic forgetting.
Training on a mixture of 400B Code Tokens and 150B Math Tokens：在程式碼訓練之後進行數學訓練會降低原有的程式設計能力。我們探討在進行單階段訓練中將code tokens與math tokens混合時，是否仍然能夠提升數學推理能力並同時減輕災難性遺忘的問題。

Results Table 6 and Table 7 demonstrate the downstream performance under different training settings.

Results Table 6和Table 7給出了在不同訓練設置下的後續效能結果。

Code training benefits program-aided mathematical reasoning, both under the two-stage training and one-stage training settings. As shown in Table 6, under the two-stage training setting, code training alone already significantly enhances the ability to solve GSM8K and MATH problems using Python. Math training in the second stage yields further improvements. Interestingly, under the one-stage training setting, mixing code tokens and math tokens effectively mitigates the issue of catastrophic forgetting that arises from two-stage training, and also synergizes coding (Table 7) and program-aided mathematical reasoning (Table 6).

不論是兩階段訓練還是單階段訓練，程式訓練都有利於程式輔助的數學推理。如Table 6所示，在兩階段的訓練設置下，單純的程式訓練就能明顯提升使用Python解決GSM8K和MATH問題的能力。第二階段的數學訓練進一步加強了這種能力。有趣的是，在單階段訓練設置下，混合code tokens與math tokens有效地減輕了由雙階段訓練引起的災難性遺忘的問題，而且同時加強程式設計（Table 7）和程式輔助的數學推理（Table 6）。

Training Setting	Training Tokens	Training Tokens	Training Tokens	MMLU	BBH	HumanEval (Pass@1)	MBPP (Pass@1)
	General		Code Math
No Continual Training	-	-	-	24.5%	28.1%	12.2%	13.0%
Two-Stage Training	Two-Stage Training	Two-Stage Training	Two-Stage Training	Two-Stage Training	Two-Stage Training	Two-Stage Training	Two-Stage Training
Stage 1: General Training	400B	-	-	25.9%	27.7%	15.2%	13.6%
Stage 2: Math Training	-	-	150B	33.1%	32.7%	12.8%	13.2%
Stage 1: Code Training	-	400B	-	25.0%	31.5%	25.0%	40.0%
Stage 2: Math Training	-	-	150B	36.2%	35.3%	12.2%	17.0%
One-Stage Training	One-Stage Training	One-Stage Training	One-Stage Training	One-Stage Training	One-Stage Training	One-Stage Training	One-Stage Training
Math Training	-	-	150B	32.3%	32.5%	11.6%	13.2%
Code & Math Mixed Training	-	400B	150B	33.5%	35.6%	29.3%	39.4%

Table 7 | Investigation of how different settings of code and math training affect model performance of language understanding, reasoning, and coding. We experiment with DeepSeek-LLM 1.3B. We evaluate the models on MMLU and BBH using few-shot chain-of-thought prompting. On HumanEval and MBPP, we conduct zero-shot and few-shot evaluations, respectively.

Model		ArXiv Corpus	English Benchmarks	English Benchmarks	English Benchmarks	English Benchmarks	English Benchmarks	Chinese Benchmarks	Chinese Benchmarks	Chinese Benchmarks
Model		ArXiv Corpus	GSM8K MATH OCW			SAT	MMLU STEM	CMATH	Gaokao MathCloze	Gaokao MathQA
	1.3B	No Math Training	2.9%	3.0%	2.9%	15.6%	19.5%	12.3%	0.8%	17.9%
	1.3B	MathPile	2.7%	3.3%	2.2%	12.5%	15.7%	1.2%	0.0%	2.8%
	1.3B	ArXiv-RedPajama	3.3%	3.4%	4.0%	9.4%	9.0%	7.4%	0.8%	2.3%
DeepSeek-Coder-Base-v1.5		No Math Training	29.0%	12.5%	6.6%	40.6%	38.1%	45.9%	5.9%	21.1%
DeepSeek-Coder-Base-v1.5		MathPile	23.6%	11.5%	7.0%	46.9%	35.8%	37.9%	4.2%	25.6%
DeepSeek-Coder-Base-v1.5		ArXiv-RedPajama	28.1%	11.1%	7.7%	50.0%	35.2%	42.6%	7.6%	24.8%

Table 8 | Effect of math training on different arXiv datasets. Model performance is evaluated with few-shot chain-of-thought prompting.

ArXiv Corpus	miniF2F-valid	miniF2F-test
No Math Training	20.1%	21.7%
MathPile	16.8%	16.4%
ArXiv-RedPajama	14.8%	11.9%

Table 9 | Effect of math training on different arXiv corpora, the base model being DeepSeekCoder-Base-v1.5 7B. We evaluate informal-to-formal proving in Isabelle.

Code training also improves mathematical reasoning without tool use. Under the two-stage training setting, the initial stage of code training already results in moderate enhancements. It also boosts the efficiency of the subsequent math training, eventually leading to the best performance. However, combining code tokens and math tokens for one-stage training compromises mathematical reasoning without tool use. One conjecture is that DeepSeek-LLM 1.3B, due to its limited scale, lacks the capacity to fully assimilate both code and mathematical data simultaneously.

程式碼的訓練還可以提在沒有使用工具的情況下的數學推理能力。在兩階段訓練設置中，程式訓練的第一階段就已經導致一定的提升了。這還推升了後續的數學訓練的效率，最終達到最佳效能。然而，結合code tokens和math tokens對單階段訓練來說，會降低沒有工具使用時的數學推理能力。有一個猜測是這樣的，DeepSeek-LLM 1.3B 因其規模有限，而不足以同時全面吸收程式碼與數學資料。

5.1.2. ArXiv Papers Seem Ineffective in Improving Mathematical Reasoning

ArXiv papers are commonly included as a component of math pre-training data (Azerbayev et al., 2023; Lewkowycz et al., 2022a; Polu and Sutskever, 2020; Wang et al., 2023c). However, detailed analysis regarding their impact on mathematical reasoning has not been extensively conducted. Perhaps counter-intuitively, according to our experiments, arXiv papers seem ineffective in improving mathematical reasoning. We experiment with models of different sizes, including DeepSeek-LLM 1.3B and DeepSeek-Coder-Base-v1.5 7B (Guo et al., 2024), using arXiv corpora that underwent varied processing pipelines:

ArXiv論文通常被用作數學預訓練資料的一部分（Azerbayev et al., 2023; Lewkowycz et al., 2022a; Polu and Sutskever, 2020; Wang et al., 2023c）。然而，對於其對數學推理能力影響的詳細分析並不廣泛。這麼說也許有點違反直覺，不過根據我們的實驗結果，ArXiv論文似乎對提高數學推理能力是沒效果。我們在不同規模的模型上進行實驗，包括DeepSeek-LLM 1.3B和DeepSeek-Coder-Base-v1.5 7B（Guo et al., 2024），並使用經過多種處理流程的ArXiv語料庫：

MathPile (Wang et al., 2023c): an 8.9B-token corpus developed with cleaning and filtering heuristic rules, over 85% of which are scientific arXiv papers;
MathPile (Wang et al., 2023c)：使用清洗與過濾啟發式法則(filtering heuristic rules)而得的8.9B-token語料庫，其中約有 85% 的內容來自科學領域的 arXiv 論文。
ArXiv-RedPajama (Computer, 2023): the entirety of arXiv LaTeX files with preambles, comments, macros, and bibliographies removed, totaling 28.0B tokens.
ArXiv-RedPajama (Computer, 2023)：從 arXiv LaTeX 文件中移除前言、註解、巨集指令和參考文獻後，共計 28.0B 個 tokens。

In our experiments, we separately train DeepSeek-LLM 1.3B for 150B tokens and DeepSeekCoder-Base-v1.5 7B for 40B tokens on each arXiv corpus. It seems that arXiv papers are ineffective in improving mathematical reasoning. When trained on a arXiv-only corpus, both models display no notable improvements or even deterioration across various mathematical benchmarks of different complexities employed in this study. These benchmarks include quantitative reasoning datasets like GSM8K and MATH (Table 8), multiple-choice challenges like MMLU-STEM (Table 8), and formal mathematics like miniF2F (Table 9).

在我們的實驗中，我們分別用 150B tokens 訓練 DeepSeek-LLM 1.3B，用 40B tokens 訓練 DeepSeekCoder-Base-v1.5 7B，都是用 arXiv 語料庫。結果來看吼，arXiv 論文對於提升數學推理能力似乎效果有限。當模型僅在 arXiv 單一語料庫訓練時，兩個模型在此項研究中使用的各種複雜度數學基準測試中均未顯示出顯著改善，甚至有所惡化。這些基準測試包括數理邏輯資料集如 GSM8K 和 MATH（Table 8），多選挑戰如 MMLU-STEM（Table 8）以及正式數學題目如 miniF2F（Table 9）。

However, this conclusion has its limitations and should be taken with a grain of salt. We have not yet studied:

然而，這個結論有其侷限性，並應持保留態度。我們尚未研究還有：

The impact of arXiv tokens on specific math-related tasks not included in this research, such as informalization of theorems which is to convert formal statements or proofs to their informal versions;
這個研究並未包含 arXiv tokens 對特定數學相關任務的影響，像是理論的非正式化(informalization)，也就是將正式的陳述或證明轉換為其非正式版本；

在人工智慧和深度學習領域中，這種轉換有助於自然語言處理模型更好地理解和生成數學內容。非正式化的過程涉及將數學定理或證明中的技術性、符號豐富的表述轉換成更易於閱讀和理解的自然語言，這樣可以促進數學知識的普及與傳播。通過 arXiv 標籤的使用，可以更好地組織和檢索相關文獻，從而支持 AI 系統在非正式化任務中的學習與發展。

The effect of arXiv tokens when combined with other types of data;
將arXiv tokens跟其它類型的資料合併一起使用時的影響
Whether the benefits of arXiv papers would manifest themselves at a larger model scale.
在更大模型規模下是否能觀察到 arXiv 論文所帶來的好處。

Thus, further exploration is required, which we leave for future studies.

因此，進一步的探研有其必要，這部分我們將在未來研究中進一步探討。

5.2. Insights of Reinforcement Learning

5.2.1. Towards to a Unified Paradigm

In this section, we provide a unified paradigm to analyze different training methods, such as SFT, RFT, DPO, PPO, GRPO, and further conduct experiments to explore the factors of the unified paradigm. Generally, the gradient with respect to the parameter

θ

of a training method can be written as:

\begin{matrix} (5) & \nabla θ J_{A} (θ) = E [\underset{Data Source}{\underset{⏟}{(q, 0) \sim D}}] (\frac{1}{| o |} \sum_{t = 1}^{| o |} \underset{Gradient Coefficient}{\underset{⏟}{G C_{A} (q, o, t, π_{r f})}} \nabla_{θ} \log π_{θ} (o_{t} | q, o_{< t})) \end{matrix}

在本節中，我們提出一種統一的範式來分析各種不同的訓練方法，像是 SFT、RFT、DPO、PPO 和 GRPO 等，並進行實驗以探索此統一的範式的因子(factory)。通常情況下，一種訓練方法中相對於參數

θ

的梯度可表示為：

\begin{matrix} (5) & \nabla θ J_{A} (θ) = E [\underset{Data Source}{\underset{⏟}{(q, 0) \sim D}}] (\frac{1}{| o |} \sum_{t = 1}^{| o |} \underset{Gradient Coefficient}{\underset{⏟}{G C_{A} (q, o, t, π_{r f})}} \nabla_{θ} \log π_{θ} (o_{t} | q, o_{< t})) \end{matrix}

There exist three key components: 1) Data Source

D

, which determines the training data; 2) Reward Function

π_{r f}

, which is the source of the training reward signal; 3) Algorithm

A

: which processes the training data and the reward signal to the gradient coefficient 𝐺𝐶 that determines the magnitude of the penalty or reinforcement for the data. We analyze several representative methods based on such a unified paradigm:

這邊有著三個關鍵的元件：1) 資料來源

D

，這決定了訓練資料；2) 獎勵函數

π_{r f}

，是訓練獎勵信號的來源；3) 演算法

A

：處理訓練資料和獎勵信號，以此生成梯度係數

𝐺 𝐶

來決定資料所受到的懲罰或獎勵的幅度。我們基於此統一範式分析了若幾個代表性方法：

Supervised Fine-tuning (SFT) : SFT fine-tunes pretrained model on human selected SFT data.
Supervised Fine-tuning (SFT)：基於人工選定的 SFT data 來微調預訓練模型。
Rejection Sampling Fine-tuning (RFT) : RFT further fine-tunes the SFT model on the filtered outputs sampled from the SFT model based on SFT questions. RFT filters the outputs based on the correctness of their answers.
Rejection Sampling Fine-tuning (RFT)：RFT 進一步的在過濾的輸出採樣的資料上做微調，這些輸出(output)是基於 SFT questions 從 SFT model 採樣而得的資料。RFT 則是根據答案的正確性來過濾輸出。
Direct Preference Optimization (DPO) : DPO further refines the SFT model by fine-tuning it on augmented outputs sampled from the SFT model, using pair-wise DPO loss.
Direct Preference Optimization (DPO)：DPO 通過使用 pair-wise DPO loss 在 SFT model 採樣的增强輸出資料上進行微調，進一步優化了SFT模型。
Online Rejection Sampling Fine-tuning (Online RFT) : Different from RFT, Online RFT initiates the policy model using the SFT model and refines it by fine-tuning with the augmented outputs sampled from the real-time policy model.
Online Rejection Sampling Fine-tuning (Online RFT) 不同於傳統的RFT， Online RFT 使用 SFT model 初始化策略模型，並通過使用從 real-time policy model 中採樣的增强輸出進行微調來對其優化。
PPO/GRPO : PPO/GRPO initializes the policy model using the SFT model and reinforces it with the outputs sampled from the real-time policy model.
PPO/GRPO：PPO/GRPO 利用 SFT model 初始化策略模型，並通過從 real-time policy model 中採樣的輸出來加強其訓練。

We summarize the components of these methods in Table 10. Please refer to Appendix A.1 for a more detailed derivation process.

我們在Table 10中總結了這些方法的各個元件。請參閱 Appendix A.1 以獲取更詳細的推導過程。

Table 10 | The data source and gradient coefficient of different methods.

𝑃_{𝑠 𝑓 𝑡}

denotes the data distribution of supervised fine-tuning datasets.

π_{θ_{s f t}}

and

π_{t} h e t a

denote the supervised fine-tuned model and the real-time policy model during the online training process, respectively.

Figure 5 | Performance of the DeepSeekMath-Instruct 1.3B model, which was further trained using various methods, on two benchmarks.

Observation about Data Source We divide the data source into two categories, online sampling, and offline sampling. Online sampling denotes that the training data is from the exploration results of the real-time training policy model, while offline sampling denotes that the training data is from the sampling results of the initial SFT model. RFT and DPO follow the offline style, while Online RFT and GRPO follow the online style.

Observation about Data Source 我們將資料來源分為兩大類，也就是線上採樣(online sampling)和離線採樣(offline sampling)。線上採樣指訓練資料是來自於即時訓練策略模型的探索結果，而離線採樣則是指訓練資料來自於初始的 SFT model 的採樣結果。RFT和DPO使用離線的方式，而Online RFT和GRPO則是使用線上的方式。

Figure 6 | Performance of iterative reinforcement learning with DeepSeekMath-Instruct 7B on two benchmarks.

As shown in Figure 5, we find that the Online RFT significantly outperforms RFT on two benchmarks. Specifically, Online RFT is comparable to RFT in the early stage of training but gains an absolute advantage in the later stage, demonstrating the superiority of online training. This is intuitive, as in the initial stage, the actor and the SFT model exhibit close resemblance, with the sampled data revealing only minor differences. In the later stage, however, the data sampled from the actor will exhibit more significant differences, and real-time data sampling will offer greater advantages.

如Figure 5所示，我們發現到，Online RFT 在兩個基準測試上明顯優於 RFT。具體來說，Online RFT在訓練的早期階段與RFT表現相近，不過在後期階段則是取得了絕對領先，這說明了線上訓練的優勢。這也是蠻直觀的，在初始階段，actor和SFT model之間表現出密切的相似性，而採樣到的資料只顯示出微小的差異。不過在後期的階段來看，從actor那邊採樣到的資料展現出更大的差異性。這時候實時的資料採構會提供更大優勢。

Observation about Gradient Coefficient The algorithm processes the reward signal to the gradient coefficient to update the model parameter. We divide the reward function as 'Rule' and 'Model' in our experiments. Rule refers to judging the quality of a response based on the correctness of the answer, and Model denotes that we train a reward model to score each response. The training data of the reward model is based on the rule judgment. Equations 10 and 21 highlight a key difference between GRPO and Online RFT: GRPO uniquely adjusts its gradient coefficient based on the reward value provided by the reward model. This allows for differential reinforcement and penalization of responses according to their varying magnitudes. In contrast, Online RFT lacks this feature; it does not penalize incorrect responses and uniformly reinforces all responses with correct answers at the same level of intensity.

Observation about Gradient Coefficient 演算法將獎勵信號處理至梯度係數以更新模型參數。在我們的實驗中，我們將獎勵函數分為「Rule」和「Model」。「Rule」指的是基於答案正確性來判斷響應的品質，而「Model」表示我們訓練一個獎勵模型來對每個響應評分。獎勵模型的訓練資料是基於規則判斷。方程式10和21突顯出GRPO與Online RFT的差異：GRPO獨特地根據獎勵模型提供的獎勵值來調整其梯度係數。這使得根據響應的不同程度來差異化地強化和懲罰成為可能。相反，Online RFT缺乏此功能；它不會懲罰錯誤的響應並對所有正確答案的響應給予相同程度的強化。

As demonstrated in Figure 5, GRPO surpasses online RFT, thereby highlighting the efficiency of altering positive and negative gradient coefficients. In addition, GRPO+PS shows superior performance compared to GRPO+OS, indicating the benefits of using fine-grained, step-aware gradient coefficients. Furthermore, we explore the iterative RL, in our experiments, we conduct two rounds of iteration. As shown in Figure 6, we notice that the iterative RL significantly improves the performance, especially at the first iteration.

如Figure 5所示，GRPO碾壓online RFT，從而強調調整正負梯度係數的效率重要性。此外，GRPO+PS在性能上優於GRPO+OS，這表明使用粒度更細、步驟感知的梯度係數是有好處的。而且吼，我們還探索了迭代RL，在實驗中我們進行了兩輪迭代。如Figure 6所示，我們發現到，迭代RL明顯提高效能，尤其是在第一次迭代時。

Figure 7 | The Maj@K and Pass@K of SFT and RL DeepSeekMath 7B on GSM8K and MATH (temperature 0.7). It was noted that RL enhances Maj@K but not Pass@K.

5.2.2. Why RL Works?

In this paper, we conduct reinforcement learning based on a subset of instruction tuning data, and it achieves significant performance enhancement upon the instruction tuning model. To further explain why reinforcement learning works. We evaluate the Pass@K and Maj@K accuracy of the Instruct and RL models on two benchmarks. As shown in Figure 7, RL enhances Maj@K's performance but not Pass@K. These findings indicate that RL enhances the model's overall performance by rendering the output distribution more robust, in other words, it seems that the improvement is attributed to boosting the correct response from TopK rather than the enhancement of fundamental capabilities. Similarly, (Wang et al., 2023a) identified a misalignment problem in reasoning tasks within the SFT model, showing that the reasoning performance of SFT models can be improved through a series of preference alignment strategies (Song et al., 2023; Wang et al., 2023a; Yuan et al., 2023b).

在本篇論文中，我們基於一部分指令微調資料來做強化學習，這作法在指令微調模型的基礎上有著明顯的提升。為了進一步的解釋為什麼強化學習有效，我們在兩個基準測試中對 Instruct 和 RL 模型的 Pass@K 和 Maj@K 正確率進行了評估。如Figure 7所顯示，RL 提高了 Maj@K 的效能，但未影響 Pass@K。這些發現表明，RL 透過渲染output distribution更加穩健的方式來強化模型的整體效能，也就是說吼，這種方式似乎是因為提高TopK中的正確響應，而不是模型基本能力的增加。類似地，（Wang et al., 2023a）在SFT模型的推理任務中識別了一個不一致問題(misalignment)，這說明了SFT模型的推理效能可以通過一系列偏好對齊策略來改善（Song et al., 2023；Wang et al., 2023a；Yuan et al., 2023b）。。

5.2.3. How to Achieve More Effective RL?

We demonstrate RL works pretty well in mathematical reasoning tasks. We also provide a unified paradigm to understand different representative training methods. Within this paradigm, all methods are conceptualized as either direct or simplified RL techniques. As summarized in Equation 5, there exist three key components: Data Source, Algorithm, and Reward Function. We provide some potential future directions about the three components.

我們說明了強化學習在數學推理任務中的表現是出色的。此外，我們還提出一種統一的範式來理解不同代表性的訓練方法。在這個範式中，所有的方法都被概念化為直接或簡化的強化學習技術。如 Equation 5所總結，存在三個關鍵元素：資料來源、演算法和獎勵函數。我們對這些元素提供了一些可能的未來發展方向。

Data Source Data source is the raw material of all training methods. In the context of RL, we specifically refer to the data source as the unlabeled questions with the outputs sampled from the policy model. In this paper, we only use the questions from the instruction tuning stage and a naive nucleus sampling to sample outputs. We think this is a potential reason that our RL pipeline only improves the Maj@K performance. In the future, we will explore our RL pipeline on out-of-distribution question prompts, in conjunction with advanced sampling (decoding) strategies , like those based on tree-search methods (Yao et al., 2023). Also, the efficient inference techniques (Kwon et al., 2023; Leviathan et al., 2023; Xia et al., 2023, 2024), which determines the exploration efficiency of policy models, also play an exceedingly important role.

Data Source 資料來源是所有訓練方法的原材料(raw material)。在強化學習（RL）的上下文中，我們特別地把資料來源視為一個未標記問題(unlabeled questions)，以及從策略模型中採樣而得的輸出。在這篇論文中，我們單純使用來自指令微調訓練階段的問題以及一種基本的核採樣（nucleus sampling）來採樣輸出。這可能是我們的RL pipeline只改善了 Maj@K performance 的一個潛在原因。未來，我們將研究一下我們這個RL pipeline在分佈外(out-of-distribution)問題提示的效果，並結合那些像是基於搜索樹（Yao et al., 2023）等進階採樣（decoding）策略的方法。此外，決定策略模型探索效率的高效推論技術（Kwon et al., 2023; Leviathan et al., 2023; Xia et al., 2023, 2024）也扮演著極為重要的角色。

Algorithms Algorithms process the data and reward signal to the gradient coefficient to update the model parameter. Based on Equation 5, to some extent, all methods now fully TRUST the signal of the reward function to increase or decrease the conditional probability of a certain token. However, it is impossible to ensure the reward signal is always reliable, especially in extremely complex tasks. For example, even the PRM800K datasets (Lightman et al., 2023), which have been carefully annotated by well-trained annotators, still contain approximately 20% of incorrectly annotations . To this end, we will explore the reinforcement learning algorithm that is robust against noisy reward signals. We believe such WEAK-TO-STRONG (Burns et al., 2023) alignment methods will bring a fundamental change to the learning algorithms.

Algorithms 演算法處理著資料與獎勵信號來提供梯度係數做為更新模型參數的依據。根據Equation 5的說明，某種程度上來說，現在所有的方法都完全依賴獎勵函數的信號來增加或是減少某個特定token的條件機率。然而，在極端複雜的任務中，我們並沒有辦法確認獎勵信號是否始終是可靠的。舉例來說，即使PRM800K資料集(Lightman et al., 2023)已經是由訓練有素的專業標記人員來標記，仍然存在約20%的錯誤。因此，我們將探索一種能夠抵禦噪點獎勵信號(noisy reward signals)的強化學習演算法。我們相信這樣的從弱到強(WEAK-TO-STRONG) (Burns et al., 2023)的對齊方法將帶來對學習演算法根本性的改變。

Reward Function Reward function is the source of the training signal. In RL, the reward function is usually the neural reward model. We think there exist three important directions for reward models: 1) How to enhance the generalization ability of the reward model. The reward model must be effectively generalized to handle out-of-distribution questions and advanced decoding outputs; otherwise, reinforcement learning may merely stabilize the distribution of LLMs rather than improve their fundamental capabilities; 2) How to reflect the uncertainty of reward model. The uncertainty could potentially act as a linking bridge between the weak reward model and the weak-to-strong learning algorithms; 3) How to efficiently build high quality process reward models that can provide fine-grained training signals for the reasoning process (Lightman et al., 2023; Wang et al., 2023b).

Reward Function 獎勵函數是訓練信號的來源。在強化學習（RL）中，獎勵函數通常是神經獎勵模型。我們認為獎勵模型有三個重要的方向：1) 如何提升獎勵模型的泛化能力。獎勵模型必須有效地泛化以處理分佈外的問題和先進的解碼輸出；否則，強化學習可能就單純的能夠穩定大語言模型（LLMs）的分佈而非提升其基礎能力；2) 如何反映獎勵模型的不確定性。不確定性可能作為弱獎勵模型與弱到強學習算法之間的聯結橋樑；3) 如何有效地構建高品質的過程獎勵模型，這些模型能夠提供推理過程的細粒度訓練信號（Lightman et al., 2023; Wang et al., 2023b）。

6. Conclusion, Limitation, and Future Work

We present DeepSeekMath, which outperforms all open-source models on the competitionlevel MATH benchmark and approaches the performance of closed models. DeepSeekMath is initialized with DeepSeek-Coder-v1.5 7B and undergoes continual training for 500B tokens, with a significant component of the training data being 120B math tokens sourced from Common Crawl. Our extensive ablation study shows web pages offer significant potential for high-quality mathematical data, while arXiv may not as beneficial as we expected. We introduce Group Relative Policy Optimization (GRPO), a variant of Proximal Policy Optimization (PPO), which can notably improve mathematical reasoning capabilities with less memory consumption. The experiment results show that GRPO is effective even if DeepSeekMath-Instruct 7B has reached a high score on benchmarks. We also provide a unified paradigm to understand a series of methods and summarize several potential directions for more effective reinforcement learning.

我們提出 DeepSeekMath，該模型在競賽水平的 MATH 基準測試中超越所有開源模型且接近閉源模型的效能表現。DeepSeekMath 以 DeepSeek-Coder-v1.5 7B 為基礎進行初始化，用了 500B tokens 不斷的訓練（其中有來自 Common Crawl 的 120B math tokens），形成其訓練資料的重要部分。我們大量的測試研究顯示，網頁提供了高質量數學資料的巨大潛力，而 arXiv 的效果反而不如預期。此外，我們引入了 Group Relative Policy Optimization (GRPO)，這是 Proximal Policy Optimization (PPO) 的一種變體，能夠顯著提高數學推理能力並降低記憶體消耗。實驗結果表明，即便 DeepSeekMath-Instruct 7B 在基準測試中已達到高分，GRPO 仍然還是有效的。我們還提出一種統一的範式來理解各種方法並總結出有助於更有效強化學習的幾個潛在方向。

Although DeepSeekMath achieves impressive scores on quantitative reasoning benchmarks, its capability on geometry and theorem-proof are relatively weaker than closed models. For instance, in our dry run, the model cannot handle problems related to triangles and ellipses, which may indicate data selection bias in pre-training and fine-tuning. In addition, restricted by the model scale, DeepSeekMath is worse than GPT-4 on few-shot capability. GPT-4 could improve its performance with few-shot inputs, while DeepSeekMath shows similar performance in zero-shot and few-shot evaluation. In the future, we will further improve our engineered data selection pipeline to construct more high-quality pre-trained corpus. In addition, we will explore the potential directions (Section 5.2.3) for more effective reinforcement learning of LLMs.

儘管DeepSeekMath在數理邏輯基準測試上取得了令人印象深刻的成績，但在幾何和定理證明方面的表現相較於閉源模型來說還是比較弱的。例如，在我們的乾運行測試中，模型無法處理與三角形和橢圓有關的問題，這可能表明在預訓練和微調過程中存在資料的選擇偏差。此外，受限於模型規模，DeepSeekMath在少量樣本(few-shot)學習能力上不如GPT-4。GPT-4能夠通過少量樣本(few-shot)輸入提高其表現，而DeepSeekMath在零樣本(zero-shot)和少量樣本(few-shot)的評估中其效能相近。未來，我們將進一步改善工程資料選擇流程以建立更高品質的預訓練語料庫。此外，我們還準備針對如何更有效地進行大規模語言模型的強化學習來探索可能的方向（Section 5.2.3）。

A. Appendix

A.1. Analysis of Reinforcement Learning

We provide the detailed derivation of the data source and gradient coefficient (algorithm and reward function) across various methods, including SFT, RFT, Online RFT, DPO, PPO, and GRPO.

我們提供了來自各種方法（包括 SFT、RFT、Online RFT、DPO、PPO 及 GRPO）的資料來源與梯度係數（演算法及其相關的獎勵函數）的詳細推導過程。

A.1.1. Supervised Fine-tuning

The objective of Supervised Fine-tuning is maximizing the following objective:

\begin{matrix} (6) & J_{S F T} (θ) = E [q, o \sim P_{s f t} (Q, O)] (\frac{1}{| o |} \sum_{t = 1}^{| o |} \log π_{θ} (o_{t} | q, o_{< t})) \end{matrix}

$E [q, o \sim P_{s f t} (Q, O)]$ ：從分佈
$P_{s f t} (Q, O)$ 中抽到
$q, o$ 這個pair的期望值，因為是監督微調(sft)，因此這邊的資料是標註好的
$| o |$ ：回答的token長度
$o_{< t}$ ：回答到time stpe為
$t$ 之前的token，你，你好，你好嗎之類的
$π_{θ} (o_{t} | q, o_{< t}$ ：policy在條件
$q$ 且當前的token情況下，產出
$o_{t}$ 的機率

總的來看就是加總所有token取log的機率然後計算平均

The gradient of

J_{S F T} (θ)

is:

\begin{matrix} (7) & \nabla_{θ} J_{S F T} = E [q, o \sim P_{s f t} (Q, O)] (\frac{1}{| o |} \sum_{t = 1}^{| o |} \nabla_{θ} \log π_{θ} (o_{t} | q, o_{< t})) \end{matrix}

Data Source: The dataset employed for SFT. Reward Function: This can be regarded as human selection. Gradient Coefficient: always set to 1.

A.1.2. Rejection Sampling Fine-tuning

Rejection Sampling Fine-tuning first samples multiple outputs from the supervised fine-tuned LLMs for each question, and then trains LLMs on the sampled outputs with the correct answer. Formally, the objective of RFT is to maximize the following objectives:

棄卻抽樣微調（Rejection Sampling Fine-tuning, RFT）首先從一個訓練好的監督式微調的大型語言模型中，對於每一個問題生成多樣的輸出結果，然後利用選取的正確答案來進一步訓練模型。具體來說，RFT以最佳化下面目標函式為目標：

\begin{matrix} (8) & J_{R F T} (θ) = E [q \sim P_{s f t} (Q), o \sim π_{s f t} (O | q)] (\frac{1}{| o |} \sum_{t = 1}^{| o |} I (o) \log π_{θ} (o_{t} | q, o_{< t})) \end{matrix}

$q \sim P_{s f t} (Q)$ ：從sft中採樣一個問題
$o \sim π_{s f t} (O | q)$ ：給定採樣的問題，從sft中採樣一個回答
$I (o)$ ：indicator function，要不要接受這個回答，要就1，不要就0，這邊就是棄卻抽樣的概念，我們採樣很多，然後留下一個我們想要的

總的來說，就是我們就只會用我們想要的那個答案來做為訓練的依據。

The gradient of

J_{R F T} (θ)

is:

\begin{matrix} (9) & \nabla_{θ} J_{R F T} (θ) = E [q \sim P_{s f t} (Q), o \sim π_{s f t} (O | q)] (\frac{1}{| o |} \sum_{t = 1}^{| o |} I (o) \nabla_{θ} \log π_{θ} (o_{t} | q, o_{< t})) \end{matrix}

Data Source: question in SFT dataset with outputs sampled from SFT model. Reward Function: Rule (whether the answer is correct or not). Gradient Coefficient:

\begin{matrix} (10) & G C_{R F T} (q, o, t) = I (o) {\begin{cases} 1 the answer of o is correct \\ 0 the answer of o is incorrect \end{cases} \end{matrix}

A.1.3. Online Rejection Sampling Fine-tuning

The only difference between RFT and Online RFT is that the outputs of Online RFT are sampled from the real-time policy model

π_{θ}

, rather than from the SFT model

π_{θ_{s f t}}

. Therefore, the gradient of online RFT is:

\begin{matrix} (11) & \nabla_{θ} J_{O n R F T} (θ) = E [q \sim P_{s f t} (Q), o \sim π_{θ} (O | q)] (\frac{1}{| o |} \sum_{t = 1}^{| o |} I (o) \nabla_{θ} \log π_{θ} (o_{t} | q, o_{< t})) \end{matrix}

RFT 和 Online RFT 的唯一區別在於，Online RFT 的輸出是從real-time policy model

π_{θ}

採樣的，而不是從 SFT model

π_{θ_{s f t}}

中得出。因此，Online RFT 的梯度為：

A.1.4. Direct Preference Optimization (DPO)

The objective of DPO is:

\begin{matrix} (12) & J_{D P O} (θ) = E [q \sim P_{s f t} (Q), o^{+}, o^{-} \sim π_{s f t (O | q)}] \log σ (β \frac{1}{| o^{+} |} \sum_{t = 1}^{| o^{+} |} \log \frac{π_{θ} (o_{t}^{+} | q, o_{< t}^{+})}{π_{r e f} (o_{t}^{+} | q, o_{< t}^{+})} - β \frac{1}{| o^{-} |} \sum_{t = 1}^{| o^{-} |} \log \frac{π_{θ} (o_{t}^{-} | q, o_{< t}^{-})}{π_{r e f} (o_{t}^{-} | q, o_{< t}^{-})}) \end{matrix}

$E [q \sim P_{s f t} (Q), o^{+}, o^{-} \sim π_{s f t (O | q)}]$ ：其中
$q \sim P_{s f t} (Q)$ 代表從sft資料中採樣一個問題
$q$ ，然後再根據這個問題從sft中採樣兩個回應，分別為人類偏好的
$o^{+}$ ，以及非人類偏好的
$o^{-}$
$\log σ$ ：利用
$σ$ 來轉為機率，取
$\log$ 就有點像是logistic-regression那樣的對轉換的機率做log-likehood的最大化，從式子來看不難理解，前段是人類偏好，後段是非人類偏好，我們希望這兩者之間的距離愈大愈好
$β \frac{1}{| o^{+} |} \sum_{t = 1}^{| o^{+} |} \log \frac{π_{θ} (o_{t}^{+} | q, o_{< t}^{+})}{π_{r e f} (o_{t}^{+} | q, o_{< t}^{+})} - β \frac{1}{| o^{-} |} \sum_{t = 1}^{| o^{-} |} \log \frac{π_{θ} (o_{t}^{-} | q, o_{< t}^{-})}{π_{r e f} (o_{t}^{-} | q, o_{< t}^{-})}$ ：前段為人類偏好，後段為非人類偏好，也許可以想成是我們系統上被使用者按讚跟倒讚，計算的就是逐token的對數機率比，比的就是當下的模型與參考模型之間的比例。不是整句喔，是逐token，『你』跟『你』，『你好』跟『你好』，『你好嗎』跟『你好壞』，這樣，逐token的加總之後再取平均做為整句的preference score
$β$ ：一個調控的係數，很明顯的，數值愈大差異就愈明顯

總的來說就是希望模型能夠回應人類愛聽的話。

The gradient of

J_{D P O} (θ)

is:

\begin{matrix} (13) & \begin{array}{l} \nabla_{θ} J_{D P O} (θ) = E [q \sim P_{s f t} (Q), o^{+}, o^{-} \sim π_{s f t (O | q)}] & (\frac{1}{| o^{+} |} \sum_{t = 1}^{| o^{+} |} G C_{D P O} (q, o, t) \nabla_{θ} \log π_{θ} (o_{t}^{+} | q, o_{< t}^{+}) \\ - \frac{1}{| o^{-} |} \sum_{t = 1}^{| o^{-} |} G C_{D P O} (q, o, t) \nabla_{θ} \log π_{θ} (o_{t}^{-} | q, o_{< t}^{-})) \end{array} \end{matrix}

Data Source: question in SFT dataset with outputs sampled from SFT model. Reward Function: human preference in the general domain (can be 'Rule' in mathematical tasks). Gradient Coefficient:

\begin{matrix} (14) & G C_{D} P O (q, o, t) = σ (β \log \frac{π_{θ} (o_{t}^{-} | q, o_{< t}^{-})}{π_{r e f} (o_{t}^{-} | q, o_{< t}^{-})} - β \log \frac{π_{θ} (o_{t}^{+} | q, o_{< t}^{+})}{π_{r e f} (o_{t}^{+} | q, o_{< t}^{+})}) \end{matrix}

$σ$ ：值域為0~1，應該是為了加大偏好
log-likehood為正，那就代表偏好，就會傾向生成該token
當不愛的(
$o^{-}$ )大於喜好的(
$o^{+}$ )，那得到的值經過sigmoid轉換之後就會接近1，那loss就會增大，為了降低loss，自然就會往喜好的token去生成

A.1.5. Proximal Policy Optimization (PPO)

The objective of PPO is:

\begin{matrix} (15) & J_{P P O} (θ) = E [q \sim P_{s f t} (Q), o \sim π_{θ_{o l d}} (O | q)] \frac{1}{| o |} \sum_{t = 1}^{| o |} min [\frac{π_{θ} (o_{t} | q, o_{< t})}{π_{θ_{o l d}} (o_{t} | q, o_{< t})} A_{t}, clip (\frac{π_{θ} (o_{t} | q, o_{< t})}{π_{θ_{o l d}} (o_{t} | q, o_{< t})}, 1 - ϵ, 1 + ϵ) A_{t}] \end{matrix}

$q \sim P_{s f t} (Q)$ ：從sft中採樣一個問題
$o \sim π_{θ_{o l d}} (O | q)$ ：從舊的policy中去採樣針對該問題的回答
取平均是避免長回應會影響模型最終偏好囉哩八唆
$\frac{π_{θ} (o_{t} | q, o_{< t})}{π_{θ_{o l d}} (o_{t} | q, o_{< t})}$ ：
- 分子，
  $π_{θ} (o_{t} | q, o_{< t})$ ：給定問題
  $q$ 以及目前為止(
  $t - 1$ )所生成的token
  $o$ ，當下的policy會生成
  $o_{t}$ 的機率
- 分母就是舊的policy，也許就是前一個迭代的policy model
- 兩個加起來看就是一個重要性採樣比，這在強化學習聖經裡面印象中是有提到的，大於1，那就代表當前的policy會比舊的policy更傾向於選擇這個token，反之就代表不愛
$A_{t}$ ：在time step為
$t$ 的時候的Advantage Function，也就是在state
$s_{t}$ ，也就是面對問題
$q$ 與目前為止(
$o < t$ )的回應所生成的token，其選擇
$o_{t}$ 這個token有多好，大於0就代表比平均好，小於0則代表比平均差
- 理解上平均也許是過去的policy，就連前輩們都覺得好的，那就放大，反之則減少
$clip$ ：常見的剪裁作法，給定一個
$ϵ$ 來做控制，不要造成過大的異動就是

總的來看應該就是，老人走過的路你也來，那就給你多一點的回報，不過同時也激勵自己再進步。

To simplify the analysis, it is assumed that the model only has a single update following each exploration stage, thereby ensuring that

π_{θ_{o l d}} = π_{t} h e t a

. In this case, we can remove the min and clip operation:

\begin{matrix} (16) & J_{P P O} (θ) = E [q \sim P_{s f t} (Q), o \sim π_{θ_{o l d}} (O | q)] \frac{1}{| o |} \sum_{t = 1}^{| o |} \frac{π_{θ} (o_{t} | q, o_{< t})}{π_{θ_{o l d}} (o_{t} | q, o_{< t})} A_{t} \end{matrix}

為了簡化分析，我們假設模型在每個探索階段就只會做一次更新，從而確保

π_{θ_{o l d}} = π_{t} h e t a

。在這種情況下，我們可以拿掉

min

和

clip

：

\begin{matrix} (16) & J_{P P O} (θ) = E [q \sim P_{s f t} (Q), o \sim π_{θ_{o l d}} (O | q)] \frac{1}{| o |} \sum_{t = 1}^{| o |} \frac{π_{θ} (o_{t} | q, o_{< t})}{π_{θ_{o l d}} (o_{t} | q, o_{< t})} A_{t} \end{matrix}

The gradient of

J_{P P O} (θ)

is:

\begin{matrix} (17) & \nabla_{θ} J_{P P O} (θ) = E [q \sim P_{s f t} (Q), o \sim π_{θ_{o l d}} (O | q)] \frac{1}{| o |} \sum_{t = 1}^{| o |} A_{t} \nabla_{θ} \log π_{θ} (o_{t} | q, o_{< t}) \end{matrix}

Data Source: question in SFT dataset with outputs sampled from policy model. Reward Function: reward model. Gradient Coefficient:

\begin{matrix} (18) & G C_{P P O} (q, o, t, π_{θ_{r m}}) = A_{t} \end{matrix}

where

𝐴_{𝑡}

is the advantage, which is computed by applying Generalized Advantage Estimation (GAE) (Schulman et al., 2015), based on the rewards

r_{g r t}

and a learned value function

V_{ψ}

A.1.6. Group Relative Policy Optimization (GRPO)

The objective of GRPO is (assume

π_{θ_{o l d}} = π_{θ}

for simplified analysis):

\begin{aligned} J_{G R P O} (θ) & = [q \sim P_{s f t} (Q), {o_{i}}_{i = 1}^{G} \sim π_{θ_{o l d}} (O | q)] \\ (19) & \frac{1}{G} \sum_{i = 1}^{G} \frac{1}{| o_{i} |} \sum_{t = 1}^{| o_{i} |} [\frac{π_{θ} (o_{i, t} | q, o_{i, < t})}{π_{θ_{o l d}} (o_{i, t} | q, o_{i, < t})} {\hat{A}}_{i, t} - β (\frac{π_{r e f} (o_{i, t} | q, o_{i, < t})}{π_{θ} (o_{i, t} | q, o_{i, < t})} - \log \frac{π_{θ} (o_{i, t} | q, o_{i, < t})}{π_{θ_{o l d}} (o_{i, t} | q, o_{i, < t})} - 1)] \end{aligned}

$q \sim P_{s f t} (Q)$ ：從一個sft中採樣一個問題
${o_{i}}_{i = 1}^{G} \sim π_{θ_{o l d}} (O | q)$ ：根據給定的問題
$q$ ，從舊的策略模型(通常是上一個迭代)中取
$G$ 個回應
$\frac{1}{G} \sum_{i = 1}^{G}$ ：針對這
$G$ 個可能的回應計算平均，這邊平均的是整個完整的輸出序列
$\frac{1}{| o_{i} |} \sum_{t = 1}^{| o_{i} |}$ ：針對某個完整回應的輸出長度計算平均，平均的就是加總每個token針對中括號裡面的計算項目
$\frac{π_{θ} (o_{i, t} | q, o_{i, < t})}{π_{θ_{o l d}} (o_{i, t} | q, o_{i, < t})} {\hat{A}}_{i, t}$ ：重要性抽樣比例，搭配advantage function來看這個選擇好不好
$β$ ：調控用的
$\frac{π_{r e f} (o_{i, t} | q, o_{i, < t})}{π_{θ} (o_{i, t} | q, o_{i, < t})}$ ：看看其它人(ref)會不會在相同情況下採取相同的token，也許我可以把參考用略模型理解成是某一個超大型語言模型
$\log \frac{π_{θ} (o_{i, t} | q, o_{i, < t})}{π_{θ_{o l d}} (o_{i, t} | q, o_{i, < t})}$ ：這個對數的重要性抽樣比例，有人說是跟KL-divergence有關，可能需要再消化一下才能理解

GPT的回應：

The gradient of

J_{G R P O} (θ)

is:

\begin{aligned} \nabla_{θ} J_{G R P O} (θ) & = [q \sim P_{s f t} (Q), {o_{i}}_{i = 1}^{G} \sim π_{θ_{o l d}} (O | q)] \\ (20) & \frac{1}{G} \sum_{i = 1}^{G} \frac{1}{| o_{i} |} \sum_{t = 1}^{| o_{i} |} [{\hat{A}}_{i, t} - β (\frac{π_{r e f} (o_{i, t} | q, o_{i, < t})}{π_{θ} (o_{i, t} | q, o_{i, < t})} - 1)] \nabla_{θ} \log π_{θ} (o_{i, t} | q, o_{i < t}) \end{aligned}

Data Source: question in SFT dataset with outputs sampled from policy model. Reward Function: reward model. Gradient Coefficient:

\begin{matrix} (21) & G C_{G R P O (q, o, t, π_{θ_{r m}})} = {\hat{A}}_{i, t} + β (\frac{π_{r e f} (o_{i, t} | q, o_{i, < t})}{π_{θ} (o_{i, t} | q, o_{i, < t})} - 1) \end{matrix}

where

{\hat{A}}_{i, t}

, is computed based on the group reward scores.

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

tags:論文翻譯 deeplearning

說明

Abstract

1. Introduction

1.1. Contributions

1.2. Summary of Evaluations and Metrics

2. Math Pre-Training

2.1. Data Collection and Decontamination

2.2. Validating the Quality of the DeepSeekMath Corpus

2.2.1. Training Setting

2.2.2. Evaluation Results

2.3. Training and Evaluating DeepSeekMath-Base 7B

3. Supervised Fine-Tuning

3.1. SFT Data Curation

3.2. Training and Evaluating DeepSeekMath-Instruct 7B

4. Reinforcement Learning

4.1. Group Relative Policy Optimization

4.1.1. From PPO to GRPO

4.1.2. Outcome Supervision RL with GRPO

4.1.3. Process Supervision RL with GRPO

4.1.4. Iterative RL with GRPO

4.2. Training and Evaluating DeepSeekMath-RL

5. Discussion

5.1. Lessons Learnt in Pre-Training

5.1.1. Code Training Benefits Mathematical Reasoning

Two-Stage Training

One-Stage Training

5.1.2. ArXiv Papers Seem Ineffective in Improving Mathematical Reasoning

5.2. Insights of Reinforcement Learning

5.2.1. Towards to a Unified Paradigm

5.2.2. Why RL Works?

5.2.3. How to Achieve More Effective RL?

6. Conclusion, Limitation, and Future Work

A. Appendix

A.1. Analysis of Reinforcement Learning

A.1.1. Supervised Fine-tuning

A.1.2. Rejection Sampling Fine-tuning

A.1.3. Online Rejection Sampling Fine-tuning

A.1.4. Direct Preference Optimization (DPO)

A.1.5. Proximal Policy Optimization (PPO)

A.1.6. Group Relative Policy Optimization (GRPO)

Read more

Book_論文翻譯

Dify + Whisper Asr Webservice

OUTRAGEOUSLY LARGE NEURAL NETWORKS: THE SPARSELY-GATED MIXTURE-OF-EXPERTS LAYER

Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention

tags:`論文翻譯` `deeplearning`