DeepSeek-OCR: Contexts Optical Compression

# DeepSeek-OCR: Contexts Optical Compression ###### tags:`論文翻譯` `deeplearning` [TOC] ## 說明區塊如下分類，原文區塊為藍底，翻譯區塊為綠底，部份專業用語翻譯參考國家教育研究院 :::info 原文 ::: :::success 翻譯 ::: :::warning 個人註解，任何的翻譯不通暢部份都請留言指導 ::: :::danger * [paper hyperlink](https://arxiv.org/abs/2510.18234) ::: ## DeepSeek-OCR: Contexts Optical Compression ## Abstract We present DeepSeek-OCR as an initial investigation into the feasibility of compressing long contexts via optical 2D mapping. DeepSeek-OCR consists of two components: DeepEncoder and DeepSeek3B-MoE-A570M as the decoder. Specifically, DeepEncoder serves as the core engine, designed to maintain low activations under high-resolution input while achieving high compression ratios to ensure an optimal and manageable number of vision tokens. Experiments show that when the number of text tokens is within 10 times that of vision tokens (i.e., a compression ratio < 10 × ), the model can achieve decoding (OCR) precision of 97%. Even at a compression ratio of 20 × , the OCR accuracy still remains at about 60%. This shows considerable promise for research areas such as historical long-context compression and memory forgetting mechanisms in LLMs. Beyond this, DeepSeek-OCR also demonstrates high practical value. On OmniDocBench, it surpasses GOT-OCR2.0 (256 tokens/page) using only 100 vision tokens, and outperforms MinerU2.0 (6000+ tokens per page on average) while utilizing fewer than 800 vision tokens. In production, DeepSeek-OCR can generate training data for LLMs/VLMs at a scale of 200k+ pages per day (a single A100-40G). Codes and model weights are publicly accessible at http://github.com/deepseek-ai/DeepSeek-OCR . 我們提出了 DeepSeek-OCR，這是一項對通過光學 2D 映射壓縮長文本上下文可行性進行的初步研究。DeepSeek-OCR 包含兩個組成部分：DeepEncoder 和作為解碼器的 DeepSeek3B-MoE-A570M。具體來說，DeepEncoder 是核心引擎，其設計目的是在高解析度輸入下保持低激活值，同時實現高壓縮比，以確保 vision tokens 的數量處於最佳且可管理的範圍內。實驗證明，當 text token 的數量是 vision token 的 10 倍以內時，模型可以實現 97% 的準確率。即使在壓縮比達到 20 倍時，OCR 的準確率仍可保持在約 60%。這對於歷史長文本壓縮和大型語言模型（LLMs）中的記憶遺忘機制等研究領域來說簡直是前途一片光明。此外，DeepSeek-OCR 也展示了很高的實用價值。在 OmniDocBench 測試中，它僅使用 100 個 vision tokens 就超越了 GOT-OCR2.0（256 tokens/page），並且在使用的 vision tokens 數量少於 800 個的情況下，其表現優於 MinerU2.0（6000+ tokens per page on average）。在實際應用中，DeepSeek-OCR 每天可以生成 200 萬多頁用於 LLM/VLM 的訓練資料（使用單張 A100-40G 系統）。程式碼和模型權重開源於 http://github.com/deepseek-ai/DeepSeek-OCR 。 ![image](https://hackmd.io/_uploads/SJVpy3CJ-g.png) Figure 1 | Figure (a) shows the compression ratio (number of text tokens in ground truth/number of vision tokens model used) testing on Fox [21] benchmark; Figure (b) shows performance comparisons on OmniDocBench [27]. DeepSeek-OCR can achieve state-of-the-art performance among end-to-end models enjoying the fewest vision tokens. Figure 1 | Figure (a) 顯示了在 Fox [21] 基準測試中，壓縮比例（真實文本 tokeen 數量與模型使用的 vision tokens 數量之比）的結果；Figure (b) 顯示了在 OmniDocBench [27] 上的表現比較。DeepSeek-OCR 在端到端模型中擁有最優秀的表現，且使用的 vision tokens 數量最少。 ## 1. Introduction Current Large Language Models (LLMs) face significant computational challenges when processing long textual content due to quadratic scaling with sequence length. We explore a potential solution: leveraging visual modality as an efficient compression medium for textual information. A single image containing document text can represent rich information using substantially fewer tokens than the equivalent digital text, suggesting that optical compression through vision tokens could achieve much higher compression ratios. 現代的大型語言模型（LLMs）在處理長文本時，會遇到嚴重的計算難題，這是因為它們的複雜度會隨著序列長度的增加而呈現二次方級別的增長。我們提出了一種可能的解決方案：利用視覺模態(visual modality)作為文本信息的有效壓縮的媒介。單一張包含文本信息的圖片所需的 token 數量遠少於等量的數位文本，這說明著，通過 vision tokens 進行光學壓縮能夠實現更高的壓縮比率。 This insight motivates us to reexamine vision-language models (VLMs) from an LLM-centric perspective, focusing on how vision encoders can enhance LLMs' efficiency in processing textual information rather than basic VQA [12, 16, 24, 32, 41] what humans excel at. OCR tasks, as an intermediate modality bridging vision and language, provide an ideal testbed for this visiontext compression paradigm, as they establish a natural compression-decompression mapping between visual and textual representations while offering quantitative evaluation metrics. 這一發現促使我們從大型語言模型的角度重新審視視覺語言模型（VLMs），關注視覺編碼器如何提升大型語言模型在處理文本信息方面的效率，而不僅限於人類擅長的基本視覺問答（VQA）功能。光學字符識別（OCR）作為連接視覺與語言的中介模態，為這種視覺-文本壓縮範式提供了理想的測試平台，因為它在視覺表示和文本表示之間建立了一個自然的壓縮-解壓縮的映射，並提供了量化的評估指標。 Accordingly, we present DeepSeek-OCR, a VLM designed as a preliminary proof-of-concept for efficient vision-text compression. Our work makes three primary contributions: 所以啊，我們就提出了 DeepSeek-OCR，這是一個設計用於高效視覺‑文本壓縮，一種初步的概念驗證的視覺語言模型（Visual Language Model, VLM）。我們的研究有著三項主要貢獻： First, we provide comprehensive quantitative analysis of vision-text token compression ratios. Our method achieves 96%+ OCR decoding precision at 9-10 × text compression, ∼ 90% at 10-12 × compression, and ∼ 60% at 20 × compression on Fox [21] benchmarks featuring diverse document layouts (with actual accuracy being even higher when accounting for formatting differences between output and ground truth), as shown in Figure 1(a). The results demonstrate that compact language models can effectively learn to decode compressed visual representations, suggesting that larger LLMs could readily acquire similar capabilities through appropriate pretraining design. 首先，我們對視覺-文本 tokens 的壓縮比進行了全面的定量分析。在 Fox [21] 的測試基準中，我們的方法在文本壓縮率為 9-10 倍時實現了 96% 以上的 OCR 解碼精度，在 10-12 倍壓縮率時實現了約 90% 的解碼精度，在 20 倍壓縮率時實現了約 60% 的解碼精度；這些測試包含了多種不同的文檔佈慧局（若考慮輸出與實際正確答案之間的格式差異，實際精度還會更高），如 Figure 1(a) 所示。這些結果表明，小型的語言模型能夠有效學習如何解碼壓縮後的視覺表示，這說明著，通過適當的預訓練設計，更大的大型語言模型（LLM）也能輕鬆獲得類似的能力。 Second, we introduce DeepEncoder, a novel architecture that maintains low activation memory and minimal vision tokens even with high-resolution inputs. It serially connects window attention and global attention encoder components through a 16 × convolutional compressor. This design ensures that the window attention component processes a large number of vision tokens, while the compressor reduces vision tokens before they enter the dense global attention component, achieving effective memory and token compression. 其次，我們引入了一種新穎的架構，DeepEncoder，即使在高解析度的輸入情況下，它也能保持較低的激活記憶與最少的 vision tokens。該結構通過一個 16 × 的卷積壓縮器，將窗口注意力(window attention)與全局注意力編碼器組件做串聯連接。這種設計確保窗口注意力組件能夠處理大量的 vision tokens，而壓縮器在進入密集的全局注意力組件(dense global attention component)之前減少 vision tokens，從而實現有效的記憶與 token 壓縮的效果。 Third, we develop DeepSeek-OCR based on DeepEncoder and DeepSeek3B-MoE [19, 20]. As shown in Figure 1(b), it achieves state-of-the-art performance within end-to-end models on OmniDocBench while using the fewest vision tokens. Additionally, we equip the model with capabilities for parsing charts, chemical formulas, simple geometric figures, and natural images to enhance its practical utility further. In production, DeepSeek-OCR can generate 33 million pages of data per day for LLMs or VLMs using 20 nodes (each with 8 A100-40G GPUs). 第三，我們基於 DeepEncoder 和 DeepSeek3B-MoE [19, 20] 開發了 DeepSeek-OCR。如 Figure 1(b) 所示，該模型在 OmniDocBench 的端到端模型測試中取得了領先的表現，同時它所使用的 vision tokens 數量極少。此外，我們還為該模型加入了解析圖表、化學公式、簡單幾何圖形和自然影像的功能，進一步提升了其實用性。在實際應用環境中，DeepSeek-OCR 每天可生成 3300 萬頁資料，供大語言模型（LLMs）或視覺語言模型（VLMs）（使用使用 20 個節點，每個節點配備 8 個 A100-40G GPU）。 In summary, this work presents a preliminary exploration of using visual modality as an efficient compression medium for textual information processing in LLMs. Through DeepSeekOCR, we demonstrate that vision-text compression can achieve significant token reduction (7-20 × ) for different historical context stages, offering a promising direction for addressing long-context challenges in large language models. Our quantitative analysis provides empirical guidelines for VLM token allocation optimization, while the proposed DeepEncoder architecture showcases practical feasibility with real-world deployment capabilities. Although focused on OCR as a proof-of-concept, this paradigm opens new possibilities for rethinking how vision and language modalities can be synergistically combined to enhance computational efficiency in large-scale text processing and agent systems. 總而言之，這個研究初步探討將視覺模態作為大型語言模型（LLM）中文本信息處理的高效壓縮方式。透過 DeepSeekOCR，我們證明了視覺-文本壓縮可以在不同的歷史背景階段中實現顯著的 token 量的減少（7-20 倍），為解決大型語言模型中的長文本處理問題提供了有希望的方向。我們的定量分析爲視覺模型（VLM）的 token 的分配最佳化給出了實證指導，而我們所提出的 DeepEncoder 架構則展示了其在實際應用中的可行性。雖然本研究的重點在於以 OCR 作爲概念驗證，不過，這個範式為重新思考如何協同整合視覺與語言模態以提升大規模文本處理和智能體系統的計算效率開啟了新的可能性。 ![image](https://hackmd.io/_uploads/HkAhMWexZg.png) Figure 2 | Typical vision encoders in popular VLMs. Here are three types of encoders commonly used in current open-source VLMs, all of which suffer from their respective deficiencies. Figure 2 | 現下視覺語言模型（VLMs）中常用的視覺編碼器。以下是當前開源視覺語言模型中常見的三種編碼器類型，這些編碼器均存在各自的缺陷。 ## 2. Related Works ### 2.1. Typical Vision Encoders in VLMs Current open-source VLMs employ three main types of vision encoders, as illustrated in Figure 2. The first type is a dual-tower architecture represented by Vary [36], which utilizes parallel SAM[17] encoder to increase visual vocabulary parameters for high-resolution image processing. While offering controllable parameters and activation memory, this approach suffers from significant drawbacks: it requires dual image preprocessing that complicates deployment and makes encoder pipeline parallelism challenging during training. The second type is tile-based method exemplified by InternVL2.0 [8], which processes images by dividing them into small tiles for parallel computation, reducing activation memory under high-resolution settings. Although capable of handling extremely high resolutions, this approach has notable limitations due to its typically low native encoder resolution (below 512 × 512), causing large images to be excessively fragmented and resulting in numerous vision tokens. The third type is adaptive resolution encoding represented by Qwen2-VL [35], which adopts the NaViT [10] paradigm to directly process full images through patch-based segmentation without tile parallelization. While this encoder can handle diverse resolutions flexibly, it faces substantial challenges with large images due to massive activation memory consumption that can cause GPU memory overflow, and sequence packing requires extremely long sequence lengths during training. Long vision tokens will slow down both prefill and generation phases of inference. 當前的開源視覺語言模型（VLM）主要採用三種類型的視覺編碼器，如 Figure 2 所示。第一種是以 Vary [36] 為代表性的 dual-tower (雙塔？)架構，該編碼器利用並行的 SAM [17] encoder 來擴充視覺語彙的參數，從而提升高解析度影像的處理能力。雖然這種方法具有可控的參數設定和充足的激活記憶，但這方法還是有著明顯的缺陷：雙重的影像預處理的過程增加了部署的複雜性，並限制了訓練期間的編碼器並行性。第二種方法是 tile-based 的方法，以 InternVL2.0 [8]為典型代表，通過將影像分割成小塊(tile)進行並行計算，從而在保持較低激活記憶消耗的同時提升處理效率。然而，由於其編碼器的原始解析度通常較低（通常小於512 × 512），這種方法在處理大型影像時會導致影像過度分割，進而產生大量的 vision tokens。第三種方法是自適應解析度編碼，以Qwen2-VL [35]為代表，該編碼器採用NaViT [10]的技術，通過 patch-based segmentation 直接處理完整影像，無需進行 tile parallelization (切小片的平行處理)。雖然這種編碼器在處理各種解析度的時候有著較高的靈活性，但在處理大型影像時會因消耗大量激活記憶體而面臨挑戰，可能導致GPU的記憶體溢位；此外，sequence packing 在訓練期間需要非常常的序列長度。這種長的 vision tokens 會降低推理階段的預填充與生成的效率。 ### 2.2. End-to-end OCR Models OCR, particularly document parsing task, has been a highly active topic in the image-to-text domain. With the advancement of VLMs, a large number of end-to-end OCR models have emerged, fundamentally transforming the traditional pipeline architecture (which required separate detection and recognition expert models) by simplifying OCR systems. Nougat [6] first employs end-to-end framework for academic paper OCR on arXiv, demonstrating the potential of models in handling dense perception tasks. GOT-OCR2.0 [38] expands the scope of OCR2.0 to include more synthetic image parsing tasks and designs an OCR model with performance-efficiency trade-offs, further highlighting the potential of end-to-end OCR researches. Additionally, general vision models such as Qwen-VL series [35], InternVL series [8], and many their derivatives continuously enhance their document OCR capabilities to explore dense visual perception boundaries. However, a crucial research question that current models have not addressed is: for a document containing 1000 words, how many vision tokens are at least needed for decoding? This question holds significant importance for research in the principle that " a picture is worth a thousand words. "" 光學字符識別（OCR），尤其是文檔解析任務，在影像轉文字的領域中一直是一個非常活躍的研究方向。隨著視覺語言模型（VLMs）的進步，出現了大量的端到端（end-to-end）的OCR模型，這些模型大幅改變了傳統的處理流程（傳統流程需要分別使用檢測和識別專家模型），並簡化了OCR系統的架構。Nougat [6] 首次在arXiv上使用端到端框架進行學術論文的OCR處理，展示了模型在處理密集感知任務(dense perception task)方面的潛力。GOT-OCR2.0 [38] 將OCR2.0的應用範圍擴展到更多合成影像解析任務，並設計了一個在性能與效率之間取得平衡的OCR模型，進一步突顯了端到端OCR研究的價值。此外，像Qwen-VL系列 [35]、InternVL系列 [8] 及其許多衍生模型，不斷提升它們的文檔OCR能力，以探索密集視覺感知的邊界(dense visual perception boundaries)。然而，現有模型尚未解決的一個關鍵研究問題是：對於一個包含1000個字的文檔，至少需要多少個 vision tokens 才能完成解碼？這個問題對於「一張圖片等於千言萬語」這一研究原則有著重大的意義。 ![image](https://hackmd.io/_uploads/H1niPsGe-g.png) Figure 3 | The architecture of DeepSeek-OCR. DeepSeek-OCR consists of a DeepEncoder and a DeepSeek-3B-MoE decoder. DeepEncoder is the core of DeepSeek-OCR, comprising three components: a SAM [17] for perception dominated by window attention, a CLIP [29] for knowledge with dense global attention, and a 16 × token compressor that bridges between them. Figure 3 | DeepSeek-OCR 的架構。DeepSeek-OCR 由一個 DeepEncoder 和一個 DeepSeek-3B-MoE 解碼器組成。DeepEncoder 是 DeepSeek-OCR 的核心，包含三個構成部分：一個 SAM [17]，主要使用窗口注意力機制進行感知；一個 CLIP [29]，通過密集的全局注意力機制來處理知識；以及一個 16 × 符號壓縮器，用於連接這兩者。 ## 3. Methodology ### 3.1. Architecture As shown in Figure 3, DeepSeek-OCR enjoys a unified end-to-end VLM architecture consisting of an encoder and a decoder. The encoder (namely DeepEncoder) is responsible for extracting image features and tokenizing as well as compressing visual representations. The decoder is used for generating the required result based on image tokens and prompts. DeepEncoder is approximately 380M in parameters, mainly composed of an 80M SAM-base [17] and a 300M CLIP-large [29] connected in series. The decoder adopts a 3B MoE [19, 20] architecture with 570M activated parameters. In the following paragraphs, we will delve into the model components, data engineering, and training skills. 如 Figure 3 所示，DeepSeek-OCR 搭配了一個統一的端到端視覺語言模型架構，該架構由一個編碼器和一個解碼器所組成。編碼器（即 DeepEncoder）的主要功能是提取影像特徵、進行標記化處理並壓縮視覺表示。解碼器則根據 image tokens 和提示詞(prompts)來生成最終結果。DeepEncoder 的參數量約為 380M 個，主要由 80M 個 SAM-base [17] 神經單元和 300M 個 CLIP-large [29] 神經單元串聯構成。解碼器採用了 570M 個激活參數的 3B MoE [19, 20] 架構。在以下章節中，將詳細討論該模型的各個組成部分、資料工程技術以及訓練技巧。 ### 3.2. DeepEncoder To explore the feasibility of contexts optical compression, we need a vision encoder with the following features: 1.Capable of processing high resolutions; 2.Low activation at high resolutions; 3.Few vision tokens; 4.Support for multiple resolution inputs; 5. Moderate parameter count. However, as described in the Section 2.1, current open-source encoders cannot fully satisfy all these conditions. Therefore, we design a novel vision encoder ourselves, named DeepEncoder. 為了探索光學壓縮在特定情境下的可行性，我們需要一個具備以下特性的視覺編碼器：1. 能夠處理高解析度影像；2. 在高解析度下具有較低的計算成本（即激活值較低）；3. 使用較少的 vision tokens；4. 支持多種解析度輸入；5. 參數數量適中。然而，如 Section 2.1 所述，現有的開源編碼器無法完全滿足這些要求。因此，我們自行設計了一個新型的視覺編碼器，命名為DeepEncoder。 #### 3.2.1. Architecture of DeepEncoder DeepEncoder mainly consists of two components: a visual perception feature extraction component dominated by window attention, and a visual knowledge feature extraction component with dense global attention. To benefit from the pretraining gains of previous works, we use SAM-base (patch-size 16) and CLIP-large as the main architectures for the two components respectively. For CLIP, we remove the first patch embedding layer since its input is no longer images but output tokens from the previous pipeline. Between the two components, we borrow from Vary [36] and use a 2-layer convolutional module to perform 16 × downsampling of vision tokens. Each convolutional layer has a kernel size of 3, stride of 2, padding of 1, and channels increase from 256 to 1024. Assuming we input a 1024 × 1024 image, the DeepEncoder will segment it into 1024/16 × 1024/16=4096 patch tokens. Since the first half of encoder is dominated by window attention and only 80M, the activation is acceptable. Before entering global attention, the 4096 tokens go through the compression module and the token count becomes 4096/16=256, thus making the overall activation memory controllable. DeepEncoder 主要包含兩個組成部分：一個是視覺感知特徵提取組件，該組件以窗口注意力（window attention）所主導；另一個是視覺知識特徵提取組件，該組件採用密集的全局注意力（dense global attention）機制。為了有效利用先前研究的預訓練成果，我們分別選用了SAM-base (patch-size 16) 和 CLIP-large 作為這兩個組件的主要架構。CLIP 的部份，我們移除了第一個 patch embedding layer，因為其輸入已經不是影像，而是從前面管道來的 tokens。在這兩個組件之間，我們借鏡 Vary [36]，並使用了一個兩層卷積層的模組來對 vision tokens 做 16 倍的下采樣處理。每個卷積層的 kernel 大小為 3，stride 為 2，padding 為 1，channels 從 256 增加到 1024。假設我們輸入一張 1024 × 1024 的影像，DeepEncoder 會將其分割成 1024/16 × 1024/16=4096 個 patch tokens。由於編碼器的前半部分主要使用窗口注意力機制，且其計算資源僅為 80M，因此激活值仍在可接受範圍之內。在進入全局注意力機制之前，這 4096 個 tokens 會經過壓縮模組，其 token 數量會減少到 4096 / 16 = 256 個，從而使得整體的激活記憶使用量得以控制。 ![image](https://hackmd.io/_uploads/H1yJNZ4xbx.png) Figure 4 | To test model performance under different compression ratios (requiring different numbers of vision tokens) and enhance the practicality of DeepSeek-OCR, we configure it with multiple resolution modes. Figure 4 | 為了測試模型在不同壓縮比例下的性能（這些壓縮比例需要不同數量的 vision tokens），並提升 DeepSeek-OCR 的實用性，我們為其配置了多種解析度模式。 Table 1 | Multi resolution support of DeepEncoder. For both research and application purposes, we design DeepEncoder with diverse native resolution and dynamic resolution modes. Table 1 | DeepEncoder 的多解析度支援。為了滿足研究與應用的需求，我們設計了 DeepEncoder，該模型支持多種解析度設定及動態調整解析度的功能。 | | Native Resolution | Native Resolution | Native Resolution | Native Resolution | Dynamic Resolution | Dynamic Resolution | |---------------------------|---------------------|---------------------|---------------------|---------------------|---------------------------------------|----------------------------------------| | Mode | Tiny | Small | Base | Large | Gundam | Gundam-M | | Resolution Tokens Process | 512 64 resize | 640 100 resize | 1024 256 padding | 1280 400 padding | 640+1024 n × 100+256 resize + padding | 1024+1280 n × 256+400 resize + padding | #### 3.2.2. Multiple resolution support Suppose we have an image with 1000 optical characters and we want to test how many vision tokens are needed for decoding. This requires the model to support a variable number of vision tokens. That is to say the DeepEncoder needs to support multiple resolutions. 假設我們有一張有著 1000 個光學字符的影像，然後我們想要測試解碼所需的 vision tokens 數量。這意味着模型需要支援可變數量的 vision tokens。也就是說 DeepEncoder 需要支援多解析度。 We meet the requirement aforementioned through dynamic interpolation of positional encodings, and design several resolution modes for simultaneous model training to achieve the capability of a single DeepSeek-OCR model supporting multiple resolutions. As shown in Figure 4, DeepEncoder mainly supports two major input modes: native resolution and dynamic resolution. Each of them contains multiple sub-modes. 我們透過對位置編碼進行動態插值的方式來滿足上述要求，並設計了多種解析度模式以實現模型的同步訓練，從而使得單一個 DeepSeek-OCR 模型能夠支援多種解析度的需求。如 Figure 4 所示，DeepEncoder 主要支援兩種輸入模式：原始解析度和動態解析度，每種模式都包含多個子模式。 Native resolution supports four sub-modes: Tiny, Small, Base, and Large, with corresponding resolutions and token counts of 512 × 512 (64), 640 × 640 (100), 1024 × 1024 (256), and 1280 × 1280 (400) respectively. Since Tiny and Small modes have relatively small resolutions, to avoid wasting vision tokens, images are processed by directly resizing the original shape. For Base and Large modes, in order to preserve the original image aspect ratio, images are padded to the corresponding size. After padding, the number of valid vision tokens is less than the actual number of vision tokens, with the calculation formula being: $$ N_{\text{valid}}=\left\lceil N_{\text{actual}}\times[1-((\max(w,h)-\min(w,h))/(\max(w,h)))] \right\rceil \tag{1} $$ where $w$ and $h$ represent the width and height of the original input image. 原始解析度支援四種子模式：Tiny、Small、Base 和 Large，對應的解析度和 vision token 的數量分別為 512 × 512（64）、640 × 640（100）、1024 × 1024（256）和 1280 × 1280（400）。由於 Tiny 和 Small 模式的解析度相對較低，為避免 vision token 的浪費，影像會通過直接調整原始的尺寸來進行處理。對於 Base 和 Large 模式，為了保持影像的原始長寬比，影像會被填補(pad)到相應的大小。在填補之後，有效的 vision token 數量會少於實際的 vision token 數量，其計算公式為： $$ N_{\text{valid}}=\left\lceil N_{\text{actual}}\times[1-((\max(w,h)-\min(w,h))/(\max(w,h)))] \right\rceil \tag{1} $$ 其中 $w$ 和 $h$ 分別代表原始影像的寬度與高度。 Dynamic resolution can be composed of two native resolutions. For example, Gundam mode consists of $n$ × 640 × 640 tiles (local views) and a 1024 × 1024 global view. The tiling method following InternVL2.0 [8]. Supporting dynamic resolution is mainly for application considerations, especially for ultra-high-resolution inputs (such as newspaper images). Tiling is a form of secondary window attention that can effectively reduce activation memory further. It's worth noting that due to our relatively large native resolutions, images won't be fragmented too much under dynamic resolution (the number of tiles is controlled within the range of 2 to 9). The vision token number output by the DeepEncoder under Gundam mode is: $n$ × 100 + 256, where $n$ is the number of tiles. For images with both width and height smaller than 640, $n$ is set to 0, i.e., Gundam mode will degrade to Base mode. 「動態解析度可以由兩種解析度來組成。以Gundam 模式為例，它包含 $n$ × 640 × 640 的圖塊(tile)（局部視圖）與 1024 × 1024 的全局視圖。這種圖塊化的方法(tiling method)依循著 InternVL2.0 [8]的規範。支持動態解析度主要是出於應用方面的考慮，特別是針對超高解析度的輸入（如新聞照片）。圖塊化是一種輔助的視窗注意力機制，能夠進一步有效降低啟用記憶體（激活記憶體）的消耗。由於我們的原生解析度相對較高，因此在動態解析度下影像不會被過度分割（圖塊的數量控制在2到9之間）。在 Gundam 模式下，DeepEncoder 輸出的 vision tokens 的數量為：$n$ × 100 + 256，其中 $n$ 代表圖塊的數量。對於寬、高都小於640的影像，$n$ 會被設定為0，也就是說，Gundam 模式會降級為 Base 模式。 Gundam mode is trained together with the four native resolution modes to achieve the goal of one model supporting multiple resolutions. Note that Gundam-master mode (1024 × 1024 local views+1280 × 1280 global view) is obtained through continued training on a trained DeepSeekOCR model. This is mainly for load balancing, as Gundam-master's resolution is too large and training it together would slow down the overall training speed. Gundam 模式是與四種原生解析模式一起進行訓練的，以此實現單一模型支持多種解析度的目標。Gundam-master 模式（1024 × 1024 的局部視圖加上 1280 × 1280 的全局視圖）是通過在已訓練好的 DeepSeekOCR 模型上進一步訓練而獲得的。這一設計的主要目的是爲了實現負載均衡，由於 Gundam-master 模式的解析度實在太高了，若與其它模式一起訓練可能會影響整體的訓練效率。 ### 3.3. The MoE Decoder Our decoder uses the DeepSeekMoE [19, 20], specifically DeepSeek-3B-MoE. During inference, the model activates 6 out of 64 routed experts and 2 shared experts, with about 570M activated parameters. The 3B DeepSeekMoE is very suitable for domain-centric (OCR for us) VLM research, as it obtains the expressive capability of a 3B model while enjoying the inference efficiency of a 500M small model. 我們的解碼器採用了 DeepSeekMoE [19, 20]，特別是 DeepSeek-3B-MoE。在推理過程中，該模型會激活 64 個專家中的 6 個，以及 2 個共享專家，共激活了約 570M 個參數。3B DeepSeekMoE 非常適用於領域特定的 VLM（如我們的光學字符識別，OCR）的研究，它既擁有 3B 模型的表現能力，又兼具 500M 參數小型模型的高效推理性能。 The decoder reconstructs the original text representation from the compressed latent vision tokens of DeepEncoder as: $$ f_{\text{dec}}:\mathbb{R}^{{n\times d}_{\text{latent}}} \to\mathbb{R}^{{N\times d}_{\text{text}}} ;\hat{\mathbf{X}}=f_{\text{dec}}(\mathbf{Z}) \text{ where } n\leq N \tag{2} $$ where $\mathbf{Z} \in \mathbb{R}^{{n\times d}_{\text{latent}}}$ are the compressed latent(vision) tokens from DeepEncoder and $\hat{\mathbf{X}}\in\mathbb{R}^{{N\times d}_{\text{text}}}$ is the reconstructed text representation. The function $f_{\text{dec}}$ represents a non-linear mapping that can be effectively learned by compact language models through OCR-style training. It is reasonable to conjecture that LLMs, through specialized pretraining optimization, would demonstrate more natural integration of such capabilities. 解碼器根據 DeepEncoder 壓縮後的 latent vision tokens，重新構建出原始的文本表示形式為： $$ f_{\text{dec}}:\mathbb{R}^{{n\times d}_{\text{latent}}} \to\mathbb{R}^{{N\times d}_{\text{text}}} ;\hat{\mathbf{X}}=f_{\text{dec}}(\mathbf{Z}) \text{ where } n\leq N \tag{2} $$ 其中 $\mathbf{Z} \in \mathbb{R}^{{n\times d}_{\text{latent}}}$ 是來自 DeepEncoder 壓縮後的 latent(vision) tokens，而 $\hat{\mathbf{X}}\in\mathbb{R}^{{N\times d}_{\text{text}}}$ 是重建的文本表示。函數 $f_{\text{dec}}$ 代表了一種非線性映射，這種映射可以通過 OCR-style 的訓練方式，由緊湊的語言模型有效學習得到。可以合理地猜測，通過專門的預訓練最佳化，大型語言模型（LLMs）將能夠更自然地整合這些功能 ### 3.4. Data Engine We constructe complex and diverse training data for DeepSeek-OCR, including OCR 1.0 data, which mainly consists of traditional OCR tasks such as scene image OCR and document OCR; OCR 2.0 data, which mainly includes parsing tasks for complex artificial images, such as common charts, chemical formulas, and plane geometry parsing data; General vision data, which is mainly used to inject certain general image understanding capabilities into DeepSeekOCR and preserve the general vision interface. 我們為 DeepSeek-OCR 建構複雜且多樣化的訓練資料，包括 OCR 1.0 資料，主要包含傳統的 OCR 任務，如場景影像 OCR 和文件 OCR；OCR 2.0 資料，主要涉及對複雜人工影像的解析任務，例如常見的圖表、化學式和平面幾何圖形的解析；以及通用的視覺資料，這些資料用於將通用影像理解能力注入 DeepSeekOCR，並保持其通用的視覺介面。 #### 3.4.1. OCR 1.0 data Document data is the top priority for DeepSeek-OCR. We collect 30M pages of diverse PDF data covering about 100 languages from the Internet, with Chinese and English accounting for approximately 25M and other languages accounting for 5M. For this data, we create two types of ground truth: coarse annotations and fine annotations. Coarse annotations are extracted directly from the full dataset using fitz , aimed at teaching the model to recognize optical text, especially in minority languages. Fine annotations include 2M pages each for Chinese and English, labeled using advanced layout models (such as PP-DocLayout [33]) and OCR models (such as MinuerU [34] and GOT-OCR2.0 [38]) to construct detection and recognition interleaved data. For minority languages, in the detection part, we find that the layout model enjoys certain generalization capabilities. In the recognition part, we use fitz to create small patch data to train a GOT-OCR2.0, then use the trained model to label small patches after layout processing, employing a model flywheel to create 600K data samples. During the training of DeepSeekOCR, coarse labels and fine labels are distinguished using different prompts. The ground truth for fine annotation image-text pairs can be seen in Figure 5. We also collect 3M Word data, constructing high-quality image-text pairs without layout by directly extracting content. This data mainly brings benefits to formulas and HTML-formatted tables. Additionally, we select some open-source data [28, 37] as supplements. DeepSeek-OCR 的主要任務是處理文件資料。我們從網路上收集了 3000 萬頁約100種語言的 PDF 文件資料，其中中文和英文約佔 2500 萬頁，其他語言約佔 500 萬頁。對於這些資料，我們製作了兩種類型的標記：粗略標記和詳細標記。粗略標記是使用`fitz`直接從完整的資料集中提取的，旨在訓練模型識別文本，尤其是那些少數語言的文本。詳細標記的話，中文和英文各包含 200 萬頁，並使用先進的布局模型(layout models)（如 PP-DocLayout [33]）和 OCR 模型（如 MinuerU [34] 和 GOT-OCR2.0 [38]）來進行標記，以構建用於檢測和識別的資料。在處理少數語言時，我們發現布局模型具有一定的泛化能力。在識別階段，我們使用 `fitz` 程序生成小塊資料來訓練 GOT-OCR2.0 模型，然後在進行布局處理後使用該模型對小塊資料進行標記，最終產生 600 萬個資料樣本。在 DeepSeek-OCR 的訓練過程中，粗略標記和精細標記使用不同的提示詞進行區分。精細標記的圖像-文本對(image-text pairs)的樣本可見 Figure 5。此外，我們還收集了 300 萬頁的 Word 文件資料，通過直接提取內容來構建高質量的 image-text pairs，這些資料主要對公式和 HTML 格式的表格有幫助。另外，我們還選擇了一些開源資料 [28, 37] 作爲補充材料。 ![image](https://hackmd.io/_uploads/Hkg9an8zbx.png) Figure 5 | OCR 1.0 fine annotations display. We format the ground truth into an interleaved layout and text format, where each paragraph of text is preceded by the coordinates and label of it in the original image. All coordinates are normalized into 1000 bins. Figure 5 | OCR 1.0的精細註解。我們將真實資料格式化為交錯佈局和文本格式，其中每段文本前面都標有其在原始影像中的座標和標籤。所有座標都已被正規化到1000個區間內。 For natural scene OCR, our model mainly supports Chinese and English. The image data sources come from LAION [31] and Wukong [13], labeled using PaddleOCR [9], with 10M data samples each for Chinese and English. Like document OCR, natural scene OCR can also control whether to output detection boxes through prompts. 對於自然場景的光學字符識別（OCR）的部份，我們的模型主要支援中英文兩種語言。影像資料來源為 LAION [31] 和 Wukong [13]，這些資料使用 PaddleOCR [9] 進行標記，其中中英文各包含1000萬個樣本資料。與文檔OCR一樣，自然場景OCR也可以通過提示詞來控制是否顯示檢測結果。 #### 3.4.2. OCR 2.0 data Following GOT-OCR2.0 [38], we refer to chart, chemical formula, and plane geometry parsing data as OCR 2.0 data. For chart data, following OneChart [7], we use pyecharts and matplotlib to render 10M images, mainly including commonly used line, bar, pie, and composite charts. We define chart parsing as image-to-HTML-table conversion task, as shown in Figure 6(a). For chemical formulas, we utilize SMILES format from PubChem as the data source and render them into images using RDKit, constructing 5M image-text pairs. For plane geometry images, we follow Slow Perception [39] for generation. Specifically, we use perception-ruler size as 4 to model each line segment. To increase the diversity of rendered data, we introduce geometric translation-invariant data augmentation, where the same geometric image is translated in the original image, corresponding to the same ground truth drawn at the centered position in the coordinate system. Based on this, we construct a total of 1M plane geometry parsing data, as illustrated in Figure 6(b). 依循 GOT-OCR2.0 [38] 的分類，我们將圖表、化學公式和平面幾何圖形的解析資料稱為 OCR 2.0 資料。對於圖表資料，參考 OneChart [7] 的方法，我們使用 `pyecharts` 和 `matplotlib` 來渲染 1000 萬張影像，主要包括常見的線圖、柱狀圖、圓餅圖和複合圖表。我們將圖表解析定義為 image-to-HTML-table 的轉換任務，如 Figure 6(a) 所示。對於化學公式，我們使用 `PubChem` 提供的 `SMILES` 格式作為資料來源，並使用 `RDKit` 將其渲染成影像，構建了 500 萬組 image-text pairs。對於平面幾何圖形的生成，我們依循 Slow Perception [39] 的方法。具體來說，我們使用 perception-ruler 尺寸設為 4 的方式來建構每個線段。爲了增加渲染資料的多樣性，我們引入了幾何平移不變的資料增強技術，在原始影像中移動（平移）相同的幾何影像，並使之對應於座標系統中心位置繪製的、相同的真實值（Ground Truth）。基於這個方式，我們構建了共 100 萬組平面幾何圖形解析資料，如 Figure 6(b) 所示。」 ![image](https://hackmd.io/_uploads/BJtHrcwGZx.png) Figure 6 | For charts, we do not use OneChart's [7] dictionary format, but instead use HTML table format as labels, which can save a certain amount of tokens. For plane geometry, we convert the ground truth to dictionary format, where the dictionary contains keys such as line segments, endpoint coordinates, line segment types, etc., for better readability. Each line segment is encoded using the Slow Perception [39] manner. Figure 6 | 針對圖表的部份，我們並沒有採用 OneChart 的 [7] 字典格式，而是使用 HTML 表格格式來作爲圖表的標記，這能某種程度上的節省一些 token 的數量。在平面幾何學的部份，我們將真實結果轉換成字典格式，字典中包含線段、端點坐標、線段類型等關鍵信息，從而提升可讀性。每個線段都是使用 Slow Perception [39] 的方法進行編碼的。 #### 3.4.3. General vision data DeepEncoder can benefit from CLIP's pretraining gains and has sufficient parameters to incorporate general visual knowledge. Therefore, we also prepare some corresponding data for DeepSeek-OCR. Following DeepSeek-VL2 [40], we generate relevant data for tasks such as caption, detection, and grounding. Note that DeepSeek-OCR is not a general VLM model, and this portion of data accounts for only 20% of the total data. We introduce such type of data mainly to preserve the general vision interface, so that researchers interested in our model and general vision task can conveniently advance their work in the future. DeepEncoder 能夠從CLIP的預訓練成果中獲益，並且擁有充足的參數來整合一般的視覺知識。因此，我們也為 DeepSeek-OCR 準備了一些相應的資料。與 DeepSeek-VL2 [40]一樣，我們生成了用於標題、偵測和實體定位等任務的相關資料。需要強調的是，DeepSeek-OCR 並非一個通用的視覺語言模型（VLM），這部分的資料僅佔總資料的20％。我們引入這類資料的主要目的是為了能夠維持一般的視覺處理接口，這將有助於對我們的模型及一般視覺任務感興趣的研究人員在未來能更方便地進行研究工作。 #### 3.4.4. Text-only data To ensure the model's language capabilities, we introduced 10% of in-house text-only pretrain data, with all data processed to a length of 8192 tokens, which is also the sequence length for DeepSeek-OCR. In summary, when training DeepSeek-OCR, OCR data accounts for 70%, general vision data accounts for 20%, and text-only data accounts for 10%. 為了確保模型在語言方面的能力，我們使用了 10% 的內部純文本預訓練資料。所有這些資料都被處理成 8192 個 token 的長度，這與 DeepSeek-OCR 的序列長度是一致的。總的來說，在訓練 DeepSeek-OCR 時，OCR 資料佔了 70% 的比例，一般視覺資料佔了 20%，純文本資料佔了 10% 的比例。 ### 3.5. Training Pipelines Our training pipeline is very simple and consists mainly of two stages: a).Training DeepEncoder independently; b).Training the DeepSeek-OCR. Note that the Gundam-master mode is obtained by continuing training on a pre-trained DeepSeek-OCR model with 6M sampled data. Since the training protocol is identical to other modes, we omit the detailed description hereafter. 我們的訓練流程非常簡單，主要包含兩個階段：a) 獨立訓練 DeepEncoder；b) 訓練 DeepSeek-OCR。請注意，Gundam-master 模式是透過對已預訓練的 DeepSeek-OCR 模型進行額外訓練（使用 600 萬個樣本資料）來獲得的。由於這個訓練協議與其它模式相同，因此這邊就不再詳述。 #### 3.5.1. Training DeepEncoder Following Vary [36], we utilize a compact language model [15] and use the next token prediction framework to train DeepEncoder. In this stage, we use all OCR 1.0 and 2.0 data aforementioned, as well as 100M general data sampled from the LAION [31] dataset. All data is trained for 2 epochs with a batch size of 1280, using the AdamW [23] optimizer with cosine annealing scheduler [22] and a learning rate of 5e-5. The training sequence length is 4096. 根據 Vary [36] 的研究，我們採用了一種緊湊型的語言模型 [15]，並使用下一個 token 預測框架來訓練 DeepEncoder。在這個階段中，我們使用前面提到的所有的 OCR 1.0 和 OCR 2.0 的資料，以及從 LAION [31] 資料集中抽取的 1 億筆通用的資料。所有資料均經過 2 次迭代的訓練，批量大小為 1280，使用 AdamW [23] optimizer 以及 cosine annealing scheduler(餘弦退火)，學習效率設置為 5e-5。訓練序列的長度為 4096。 #### 3.5.2. Training DeepSeek-OCR After DeepEncoder is ready, we use data mentioned in Section 3.4 to train the DeepSeek-OCR. with the entire training process conducted on the HAI-LLM [14] platform. The entire model uses pipeline parallelism (PP) and is divided into 4 parts, with DeepEncoder taking two parts and the decoder taking two parts. For DeepEncoder, we treat SAM and the compressor as the vision tokenizer, place them in PP0 and freeze their parameters, while treating the CLIP part as input embedding layer and place it in PP1 with unfrozen weights for training. For the language model part, since DeepSeek3B-MoE has 12 layers, we place 6 layers each on PP2 and PP3. We use 20 nodes (each with 8 A100-40G GPUs) for training, with a data parallelism (DP) of 40 and a global batch size of 640. We use the AdamW optimizer with a step-based scheduler and an initial learning rate of 3e-5. For text-only data, the training speed is 90B tokens/day, while for multimodal data, the training speed is 70B tokens/day. 在 DeepEncoder 準備就緒後，我們使用 Section 3.4 中所提到的資料來訓練 DeepSeek-OCR，整個訓練過程都是在 HAI-LLM [14] 平台進行的。整個模型採用 pipeline parallelism（PP），並拆分為4個部分，其中 DeepEncoder 佔用兩個部分，decoder 佔用兩個部分。對於 DeepEncoder，我們將 SAM 和壓縮器(compressor)視為視覺分詞器(vision tokenizer)，將其置於 PP0 並凍結其參數，同時將 CLIP 部分視為輸入嵌入層(input embedding layer)，置於 PP1 且不凍結權重以進行訓練。對於語言模型部分，由於 DeepSeek3B-MoE 有 12 層，我們在 PP2 和 PP3 各放置 6 層。我們使用 20 個節點（每個節點配有 8 個 A100-40G GPU）進行訓練，資料平行(DP) 為 40，全局批次大小為 640。我們使用 AdamW optimizer 搭配 step-based scheduler，初始的學習效率為 3e-5。對於純文本資料的部份，訓練速度為 90B tokens/day，而對於多模態資料的部份，其訓練速度則為 70B tokens/day。 Table 2 | We test DeepSeek-OCR's vision-text compression ratio using all English documents with 600-1300 tokens from the Fox [21] benchmarks. Text tokens represent the number of tokens after tokenizing the ground truth text using DeepSeek-OCR's tokenizer. Vision Tokens=64 or 100 respectively represent the number of vision tokens output by DeepEncoder after resizing input images to 512 × 512 and 640 × 640. Table 2 | 我們使用 Fox [21] 基準測試集中的所有英語文件（每個文件包含 600-1300 個 tokens）來測試 DeepSeek-OCR 的視覺文本壓縮比例。Text tokens 指的是使用 DeepSeek-OCR 的分詞器對原始文本進行分詞處理後得到的 token 數量。Vision Tokens 分別為 64 或 100，代表將輸入影像尺寸調整為 512 × 512 或 640 × 640 後，DeepEncoder 生成的 vision tokens 的數量。 | | Vision Tokens =64 | Vision Tokens =64 | Vision Tokens=100 | Vision Tokens=100 | | |-------------|---------------------|---------------------|---------------------|---------------------|-------| | Text Tokens | Precision | Compression | Precision | Compression | Pages | | 600-700 | 96.5% | 10.5 × | 98.5% | 6.7 × | 7 | | 700-800 | 93.8% | 11.8 × | 97.3% | 7.5 × | 28 | | 800-900 | 83.8% | 13.2 × | 96.8% | 8.5 × | 28 | | 900-1000 | 85.9% | 15.1 × | 96.8% | 9.7 × | 14 | | 1000-1100 | 79.3% | 16.5 × | 91.5% | 10.6 × | 11 | | 1100-1200 | 76.4% | 17.7 × | 89.8% | 11.3 × | 8 | | 1200-1300 | 59.1% | 19.7 × | 87.1% | 12.6 × | 4 | ## 4. Evaluation ### 4.1. Vision-text Compression Study We select Fox [21] benchmarks to verify DeepSeek-OCR's compression-decompression capability for text-rich documents, in order to preliminarily explore the feasibility and boundaries of contexts optical compression. We use the English document portion of Fox, tokenize the ground truth text with DeepSeek-OCR's tokenizer (vocabulary size of approximately 129k), and select documents with 600-1300 tokens for testing, which happens to be 100 pages. Since the number of text tokens is not large, we only need to test performance in Tiny and Small modes, where Tiny mode corresponds to 64 tokens and Small mode corresponds to 100 tokens. We use the prompt without layout: "<image> \n Free OCR." to control the model's output format. Nevertheless, the output format still cannot completely match Fox benchmarks, so the actual performance would be somewhat higher than the test results. 我們選擇 Fox [21] 基準測試來驗證 DeepSeek-OCR 對於文本豐富型文檔的壓縮-解壓能力，旨在初步探索上下文光學壓縮的可行性與邊界。我們使用 Fox 的英文文檔部分，利用 DeepSeek-OCR 的分詞器（詞表大小約為 129k）對標記文本做分詞處理，並選取文件中有 600 至 1300 個 token 左右的文件來進行測試，這剛好是 100 頁。由於文本 token 的數量不大，我們僅需測試 Tiny 和 Small 模式下的效能，其中 Tiny 模式對應 64 個 token，Small 模式對應 100 個 token。我們使用不含佈局(layout)的提示詞：「<image> \n Free OCR.」來控制模型的輸出格式。儘管如此，輸出格式仍無法完全與 Fox 基準測試匹配，因此實際效能會略高於測試結果。 Table 3 | Weuse OmniDocBench [27] to test the performance of DeepSeek-OCR on real document parsing tasks. All metrics in the table are edit distances, where smaller values indicate better performance. "Tokens" represents the average number of vision tokens used per page, and " † 200dpi " means using `fitz` to interpolate the original image to 200dpi. For the DeepSeek-OCR model, the values in parentheses in the "Tokens" column represent valid vision tokens, calculated according to Equation 1. Table 3 | 我們使用 OmniDocBench [27] 來測試 DeepSeek-OCR 在實際文件解析任務中的表現。表中的所有指標均為 edit distances，數值越小表示表現越好。"Tokens" 代表每頁平均使用的 vision tokens 的數量，‘† 200dpi’ 表示使用 `fitz` 函數將原始影像插值到 200dpi。對於 DeepSeek-OCR 模型而言，"Tokens" column 中的括號內數值代表有效的 vision tokens，這些標記是根據公式 1 計算得到的。 | | | | | | | Chinese | Chinese | Chinese | Chinese | Chinese | | |------------------------|------------------------|------------------------|------------------------|------------------------|------------------------|------------------------|------------------------|------------------------|------------------------|------------------------|------------------------| | Model | Tokens | overall | text | formula | table order | overall | text | formula | table order | | | | Pipline Models | Pipline Models | Pipline Models | Pipline Models | Pipline Models | Pipline Models | Pipline Models | Pipline Models | Pipline Models | Pipline Models | Pipline Models | Pipline Models | | Dolphin [11] | - | 0.356 | 0.352 | 0.465 | 0.258 0.35 | 0.44 | 0.44 | 0.604 | 0.367 0.351 | | | | Marker [1] | - | 0.296 | 0.085 | 0.374 | 0.609 0.116 | 0.497 | 0.293 | 0.688 | 0.678 0.329 | | | | Mathpix [2] | - | 0.191 | 0.105 | 0.306 | 0.243 0.108 | 0.364 | 0.381 | 0.454 | 0.32 0.30 | | | | MinerU-2.1.1 [34] | - | 0.162 | 0.072 | 0.313 | 0.166 0.097 | 0.244 | 0.111 | 0.581 | 0.15 0.136 | | | | MonkeyOCR-1.2B [18] | - | 0.154 | 0.062 | 0.295 | 0.164 0.094 | 0.263 | 0.179 | 0.464 | 0.168 0.243 | | | | PPstructure-v3 [9] | - | 0.152 | 0.073 | 0.295 | 0.162 0.077 | 0.223 | 0.136 | 0.535 | 0.111 0.11 | | | | End-to-end Models | End-to-end Models | End-to-end Models | End-to-end Models | End-to-end Models | End-to-end Models | End-to-end Models | End-to-end Models | End-to-end Models | End-to-end Models | End-to-end Models | End-to-end Models | | Nougat [6] | 2352 | 0.452 | 0.365 | 0.488 | 0.572 0.382 | 0.973 | 0.998 | 0.941 | 1.00 | | | | SmolDocling [25] | 392 | 0.493 | 0.262 | 0.753 | 0.729 0.227 | 0.816 | 0.838 | 0.997 | 0.954 0.907 0.522 | | | | InternVL2-76B [8] | 6790 | 0.44 | 0.353 | 0.543 | 0.547 0.317 | 0.443 | 0.29 | 0.701 | 0.555 0.228 | | | | Qwen2.5-VL-7B [5] | 3949 | 0.316 | 0.151 | 0.376 | 0.598 0.138 | 0.399 | 0.243 | 0.5 | 0.627 | | | | OLMOCR [28] | 3949 | 0.326 | 0.097 | 0.455 | 0.608 0.145 | 0.469 | 0.293 | 0.655 | 0.226 0.652 0.277 | | | | GOT-OCR2.0 [38] | 256 | 0.287 | 0.189 | 0.360 | 0.459 0.141 | 0.411 | 0.315 | 0.528 | 0.52 0.28 | | | | OCRFlux-3B [3] | 3949 | 0.238 | 0.112 | 0.447 | 0.269 0.126 | 0.349 | 0.256 | 0.716 | 0.162 0.263 | | | | GPT4o [26] | - | 0.233 | 0.144 | 0.425 | 0.234 0.128 | 0.399 | 0.409 | 0.606 | 0.329 0.251 | | | | InternVL3-78B [42] | 6790 | 0.218 | 0.117 | 0.38 | 0.095 | 0.296 | 0.21 | 0.533 | 0.282 0.161 | | | | Qwen2.5-VL-72B [5] | 3949 | 0.214 | 0.092 | 0.315 | 0.279 0.341 0.106 | 0.261 | 0.18 | 0.434 | 0.262 0.168 | | | | dots.ocr [30] | 3949 | 0.182 | 0.137 | 0.320 | 0.166 0.182 | 0.261 | 0.229 | 0.468 | 0.160 0.261 | | | | Gemini2.5-Pro [4] | - | 0.148 | 0.055 | 0.356 | 0.13 0.049 | 0.212 | 0.168 | 0.439 | 0.119 0.121 | | | | MinerU2.0 [34] | 6790 | 0.133 | 0.045 | 0.273 | 0.15 0.066 | 0.238 | 0.115 | 0.506 | 0.209 0.122 | | | | dots.ocr † 200dpi [30] | 5545 | 0.125 | 0.032 | 0.329 | 0.099 0.04 | 0.16 | 0.066 | 0.416 | 0.092 0.067 | | | | DeepSeek-OCR (end2end) | DeepSeek-OCR (end2end) | DeepSeek-OCR (end2end) | DeepSeek-OCR (end2end) | DeepSeek-OCR (end2end) | DeepSeek-OCR (end2end) | DeepSeek-OCR (end2end) | DeepSeek-OCR (end2end) | DeepSeek-OCR (end2end) | DeepSeek-OCR (end2end) | DeepSeek-OCR (end2end) | DeepSeek-OCR (end2end) | | Tiny | 64 | 0.386 | 0.373 | 0.469 | 0.422 0.283 | 0.361 | 0.307 | 0.635 | 0.266 0.236 | | | | Small | 100 | 0.221 | 0.142 | 0.373 | 0.242 0.125 | 0.284 | 0.24 | 0.53 | 0.159 0.205 | | | | Base | 256(182) | 0.137 | 0.054 | 0.267 | 0.163 0.064 | 0.24 | 0.205 | 0.474 | 0.1 0.181 | | | | Large | 400(285) | 0.138 | 0.054 | 0.277 | 0.152 0.067 | 0.208 | 0.143 | 0.461 | 0.104 0.123 | | | | Gundam | 795 | 0.127 | 0.043 | 0.269 | 0.134 0.062 | 0.181 | 0.097 | 0.432 | 0.089 0.103 | | | | Gundam-M † 200dpi | 1853 | 0.123 | 0.049 | 0.242 | 0.147 0.056 | 0.157 | 0.087 | 0.377 | 0.08 0.085 | | | As shown in Table 2, within a 10 × compression ratio, the model's decoding precision can reach approximately 97%, which is a very promising result. In the future, it may be possible to achieve nearly 10 × lossless contexts compression through text-to-image approaches. When the compression ratio exceeds 10 × , performance begins to decline, which may have two reasons: one is that the layout of long documents becomes more complex, and another reason may be that long texts become blurred at 512 × 512 or 640 × 640 resolution. The first issue can be solved by rendering texts onto a single layout page, while we believe the second issue will become a feature of the forgetting mechanism. When compressing tokens by nearly 20 × , we find that precision can still approach 60%. These results indicate that optical contexts compression is a very promising and worthwhile research direction, and this approach does not bring any overhead because it can leverage VLM infrastructure, as multimodal systems inherently require an additional vision encoder. 如 Table 2 所示，在 10 倍壓縮率以內，模型的解碼精度可達到約 97%，這是一個非常令人振奮的結果。未來，透過文生圖（text-to-image）方法，或許能實現接近 10 倍的無損上下文壓縮。當壓縮率超過 10 倍時，性能開始下降，這可能有兩個原因：一是長文文的佈局變得更加複雜，另一個原因可能是長文本在 512 × 512 或 640 × 640 解析度下變得模糊。第一個問題可以透過將文本渲染到單一佈局頁面上來解決，而我們認為第二個問題將成為遺忘機制的一種特性。當 Token 壓縮近 20 倍時，我們發現精度仍能接近 60%。這些結果表明，光學上下文壓縮是一個非常具前景且值得研究的方向，且這種方法不會帶來任何額外開銷，因為它可以利用視覺語言模型（VLM）的基礎架構，畢竟多模態系統本身就具備額外的視覺編碼器。 Table 4 | Edit distances for different categories of documents in OmniDocBench. The results show that some types of documents can achieve good performance with just 64 or 100 vision tokens, while others require Gundam mode. Table 4 | OmniDocBench 中不同類型的 documents edit distinces。結果顯示，某些類型的文件僅使用 64 或 100 個 vision tokens 就能取得不錯的效能，而其它類型則需要使用 Gundam mode 來提升效能。 | Mode Type | Book Slides | Financial Report | Textbook | Exam Paper | Magazine | Academic Papers | Notes Newspaper Overall | Notes Newspaper Overall | Notes Newspaper Overall | |-------------|---------------|--------------------|------------|--------------|------------|-------------------|---------------------------|---------------------------|---------------------------| | Tiny | 0.147 0.116 | 0.207 | 0.173 | 0.294 | 0.201 | 0.395 | 0.297 | 0.94 | 0.32 | | Small | 0.085 0.111 | 0.079 | 0.147 | 0.171 | 0.107 | 0.131 | 0.187 | 0.744 | 0.205 | | Base | 0.037 0.08 | 0.027 | 0.1 | 0.13 | 0.073 | 0.052 | 0.176 | 0.645 | 0.156 | | Large | 0.038 0.108 | 0.022 | 0.084 | 0.109 | 0.06 | 0.053 | 0.155 | 0.353 | 0.117 | | Gundam | 0.035 0.085 | | 0.289 | 0.095 0.094 | 0.059 | 0.039 | 0.153 | 0.122 | 0.083 | | Guandam-M | 0.052 | 0.09 | 0.034 | 0.091 0.079 | 0.079 | 0.048 | 0.1 | 0.099 | 0.077 | ### 4.2. OCR Practical Performance DeepSeek-OCR is not only an experimental model; it has strong practical capabilities and can construct data for LLM/VLM pretraining. To quantify OCR performance, we test DeepSeekOCR on OmniDocBench [27], with results shown in Table 3. Requiring only 100 vision tokens (640 × 640 resolution), DeepSeek-OCR surpasses GOT-OCR2.0 [38] which uses 256 tokens; with 400 tokens (285 valid tokens, 1280 × 1280 resolution), it achieves on-par performance with stateof-the-arts on this benchmark. Using fewer than 800 tokens (Gundam mode), DeepSeek-OCR outperforms MinerU2.0 [34] which needs nearly 7,000 vision tokens. These results demonstrate that our DeepSeek-OCR model is powerful in practical applications, and because the higher tokens compression, it enjoys a higher research ceiling. DeepSeek-OCR 不僅是一個實驗性的模型；它還具有強大的實用能力，能夠為大型語言模型（LLM）/視覺語言模型（VLM）的預訓練構建資料。為了量化 OCR 的效能，我們在 OmniDocBench [27] 上對 DeepSeekOCR 進行了測試，測試結果如 Table 3 所示。DeepSeek-OCR 只需要 100 個 vision tokens （640 × 640 解析度），就超越了使用了 256 個 tokens 的 GOT-OCR2.0 [38]；當使用 400 個 tokens（285 個有效的 tokens，1280 × 1280 解析度）時，其表現與該測試基準上的最先進的演算法相當。在使用了少於 800 個 tokens（Gundam 模式）的情況下，DeepSeek-OCR 的表現優於需要近 7,000 個 vision tokens 的 MinerU2.0 [34]。這些結果表明，DeepSeek-OCR 模型在實際應用中非常強大，而且由於其較高的 tokens 壓縮率，它具有更大的研究潛力。 As shown in Table 4, some categories of documents require very few tokens to achieve satisfactory performance, such as slides which only need 64 vision tokens. For book and report documents, DeepSeek-OCR can achieve good performance with only 100 vision tokens. Combined with the analysis from Section 4.1, this may be because most text tokens in these document categories are within 1,000, meaning the vision-token compression ratio does not exceed 10 × . For newspapers, Gundam or even Gundam-master mode is required to achieve acceptable edit distances, because the text tokens in newspapers are 4-5,000, far exceeding the 10 × compression of other modes. These experimental results further demonstrate the boundaries of contexts optical compression, which may provide effective references for researches on the vision token optimization in VLMs and context compression, forgetting mechanisms in LLMs. 如 Table 4 所示，某些類型的文檔只需極少的 vision token 即可取得良好的處理效果。例如，幻燈片類的文檔只需 64 個 vision token。書本跟報告類文件則僅需 100 個 vision token 就可以有著很好的處理效果。結合 Section 4.1 的分析，這可能是因為這些文檔類型中的大部分 text tokens 數量都在 1,000 個以下，從而使得 vision token 的壓縮比率未超過 10 倍。報紙類文件的 text tokens 數量約為 4,000–5,000 個，遠超過其它模式的壓縮比率，因此需要使用 Gundam 或 Gundam-master 模式才能達到可接受的編輯距離。這些實驗結果進一步說明上下文光學壓縮的侷限性，為在視覺語言模型（VLMs）中進行 vision token 最佳化，以及在大型語言模型（LLMs）中研究遺忘機制提供了有效的參考依據。 ### 4.3. Qualitative Study #### 4.3.1. Deep parsing DeepSeek-OCR possesses both layout and OCR 2.0 capabilities, enabling it to further parse images within documents through secondary model calls, a feature we refer to as "deep parsing". As shown in Figures 7,8,9,10, our model can perform deep parsing on charts, geometry, chemical formulas, and even natural images, requiring only a unified prompt. DeepSeek-OCR 具備佈局分析和 OCR 2.0 的能力，這使得它能夠透過次級模型呼叫來進一步解析文檔中的影像，我們將此功能稱為「deep parsing(深度解析)」。如 Figures 7、8、9、10 所示，我們的模型可以對圖表、幾何圖形、化學公式，甚至是自然影像進行深度解析，而只需要一個統一的提示詞即可。 ![image](https://hackmd.io/_uploads/ryUZVM-Vbe.png) Figure 7 | In the field of financial research reports, the deep parsing mode of DeepSeek-OCR can be used to obtain structured results of charts within documents. Charts are a crucial form of data representation in finance and scientific fields, and the chart structured extraction is an indispensable capability for future OCR models. ![image](https://hackmd.io/_uploads/BJsjBGWV-x.png) Figure 8 | For books and articles, the deep parsing mode can output dense captions for natural images in the documents. With just a prompt, the model can automatically identify what type of image it is and output the required results ![image](https://hackmd.io/_uploads/HkpTBMbV-x.png) Figure 9 | DeepSeek-OCR in deep parsing mode can also recognize chemical formulas within chemical documents and convert them to SMILES format. In the future, OCR 1.0+2.0 technology may play a significant role in the development of VLM/LLM in STEM fields. Figure 9 | DeepSeek-OCR 在深度解析模式下能夠辨別化學文件中的化學式，並將其轉換為 SMILES 格式。未來，OCR 1.0+2.0 技術可能在 STEM 領域的 VLM/LLM（視覺語言模型/大型語言模型）發展中發揮重要作用。 ![image](https://hackmd.io/_uploads/H1Z-UGZVbl.png) Figure 10 | DeepSeek-OCR also possesses the capability to copy (structure) simple planar geometric figures. Due to the intricate interdependencies among line segments in geometric shapes, parsing geometry task is extremely challenging and has a long way to go. Figure 10 | DeepSeek-OCR 也能複製（構建）簡單的平面幾何圖形。由於幾何圖形中線段之間的複雜關聯，解析幾何圖形的任務極具挑戰性，且仍有很長的發展道路要走。 ## 4.3.2. Multilingual recognition PDF data on the Internet contains not only Chinese and English, but also a large amount of multilingual data, which is also crucial when training LLMs. For PDF documents, DeepSeek-OCR can handle nearly 100 languages. Like Chinese and English documents, multilingual data also supports both layout and non-layout OCR formats. The visualization results are shown in Figure 11, where we select Arabic and Sinhala languages to demonstrate results. 網際網路上的PDF文件不僅包含中文和英文資料，還有大量的多語言資料，這在訓練大型語言模型（LLM）時也非常重要。DeepSeek-OCR能處理近100種語言的PDF文件。與中文和英文文件一樣，多語言資料也支援有佈局的和無佈局的OCR格式。Figure 11 給出視覺結果，其中我們選用了阿拉伯語和僧伽羅語來展示處理效果。 ![image](https://hackmd.io/_uploads/BJb3LfW4Wl.png) Figure 11 | To endow the capability of processing widely crawled PDFs (multilingual data), we train our model with OCR capabilities for nearly 100 languages. Minority language documents can also support both layout and non-layout outputs through different prompts. Figure 11 | 為了讓模型能夠處理大量爬到的 PDF 文件（多語言資料），我們使用能識別近 100 種語言的 OCR 技術來訓練模型。少數語言的文件也能通過不同的提示來生成包含頁面佈局信息或僅包含文本的輸出結果。 ## 4.3.3. General vision understanding We also provide DeepSeek-OCR with a certain degree of general image understanding capabilities. The related visualization results are shown in Figure 12. 我們還為 DeepSeek-OCR 提供了相當程度的通用影像理解能力。相關的可視化結果如 Figure 12 所示。 ![image](https://hackmd.io/_uploads/H14QDzZEbg.png) Figure 12 | We retain DeepSeek-OCR's capabilities in general visual understanding, mainly including image description, object detection, grounding, etc. Meanwhile, due to the inclusion of text-only data, DeepSeek-OCR's language capabilities are also retained. Note that since we do not include SFT (Supervised Fine-Tuning) stage, the model is not a chatbot, and some capabilities need completion prompts to be activated. Figure 12 | 我們保留了 DeepSeek-OCR 在一般視覺理解方面的能力，主要包括影像描述、物體偵測、以及文字定位於影像等功能。由於加入了純文本資料，DeepSeek-OCR 的語言處理能力也得以保留。然而，由於沒有進行監督式微調，該模型並非聊天機器人；因此，某些功能的啟用需要特定的提示或指令。 ## 5. Discussion Our work represents an initial exploration into the boundaries of vision-text compression, investigating how many vision tokens are required to decode $N$ text tokens. The preliminary results are encouraging: DeepSeek-OCR achieves near-lossless OCR compression at approximately 10 × ratios, while 20 × compression still retains 60% accuracy. These findings suggest promising directions for future applications, such as implementing optical processing for dialogue histories beyond $k$ rounds in multi-turn conversations to achieve 10 × compression efficiency. 我們的研究初步探索了視覺資訊與文本壓縮的邊界，研究了需要多少 vision tokens 才能解碼$N$個 text tokens。初步結果令人振奮：DeepSeek-OCR 能夠實現接近無損的 OCR 壓縮，壓縮比例約為 10 倍；而壓縮比例達到 20 倍時，準確率仍可保持在 60%。這些發現為未來的應用提供了一個方向，例如在$k$輪對話中實現對話歷史的光學處理，以達到 10 倍的壓縮效率。 ![image](https://hackmd.io/_uploads/rJ2fdGWV-l.png) Figure 13 | Forgetting mechanisms constitute one of the most fundamental characteristics of human memory. The contexts optical compression approach can simulate this mechanism by rendering previous rounds of historical text onto images for initial compression, then progressively resizing older images to achieve multi-level compression, where token counts gradually decrease and text becomes increasingly blurred, thereby accomplishing textual forgetting. Figure 13 | 遺忘機制是人類記憶最重要的特徵之一。上下文光學壓縮方法透過將歷史文本的過往版本呈現在影像中，來模擬這種記憶遺忘的過程，進行初始壓縮；隨後再逐步調整較舊影像的尺寸，實現多層壓縮。在此過程中，影像中的文本標記數量會不斷減少，文本的清晰度也會逐漸降低，從而達到文本的“遺忘”效果。 For older contexts, we could progressively downsizing the rendered images to further reduce token consumption. This assumption draws inspiration from the natural parallel between human memory decay over time and visual perception degradation over spatial distance-both exhibit similar patterns of progressive information loss, as shown in Figure 13. By combining these mechanisms, contexts optical compression method enables a form of memory decay that mirrors biological forgetting curves, where recent information maintains high fidelity while distant memories naturally fade through increased compression ratios. 「對於較舊的上下文，我們可以逐步縮小顯示的影像尺寸，以進一步降低 token 的消耗。這一假設的靈感來自於人類記憶隨時間衰減與視覺感知隨空間距離降低之間的自然類比——兩者都顯示出類似的逐步信息流失模式，如 Figure 13 所示。通過結合這些機制，上下文光學壓縮方法實現了一種類似於生物學中遺忘曲線的記憶衰減效果：最近的記憶保持高保真度，而較遠的記憶則因壓縮比率的增加而自然地逐漸模糊。 While our initial exploration shows potential for scalable ultra-long context processing, where recent contexts preserve high resolution and older contexts consume fewer resources, we acknowledge this is early-stage work that requires further investigation. The approach suggests a path toward theoretically unlimited context architectures that balance information retention with computational constraints, though the practical implications and limitations of such vision-text compression systems warrant deeper study in future research. 雖然我們的初步探索表明，在超長上下文處理方面具有可擴展的潛力——即最新的上下文能夠保持高解析度，而較舊的上下文則消耗較少的資源——不過我們有識知到，這只是處於早期階段的研究，仍需進一步的研究。這種方法為理論上無限的上下文架構提供了一條途徑，該架構能在信息保留與計算限制之間取得平衡。然而，這類視覺-文本壓縮系統的實際影響和限制仍需在未來的研究中深入探究。 ## 6. Conclusion In this technical report, we propose DeepSeek-OCR and preliminarily validate the feasibility of contexts optical compression through this model, demonstrating that the model can effectively decode text tokens exceeding 10 times the quantity from a small number of vision tokens. We believe this finding will facilitate the development of VLMs and LLMs in the future. Additionally, DeepSeek-OCR is a highly practical model capable of large-scale pretraining data production, serving as an indispensable assistant for LLMs. Of course, OCR alone is insufficient to fully validate true context optical compression and we will conduct digital-optical text interleaved pretraining, needle-in-a-haystack testing, and other evaluations in the future. From another perspective, optical contexts compression still offers substantial room for research and improvement, representing a promising new direction. 在這份技術報告中，我們提出了 DeepSeek-OCR 模型，並初步驗證了該模型實現上下文光學壓縮的可行性。實驗證明，該模型能夠從少量的視覺資料中有效解碼出超過其十倍的文本信息。我們相信這一發現將有助於未來大型語言模型（VLMs）的發展。此外，DeepSeek-OCR 是一個非常實用的模型，能夠進行大規模的預訓練資料產生，成為大型語言模型的重要輔助工具。當然，單純的 OCR 技術尚不足以完全驗證光學壓縮的可行性，我們未來還將進行數位-光學文本交錯預訓練、在海量資料中查找特定信息的測試等其它評估。從另一個角度來看，光學壓縮技術仍有很大的研究與改進空間，這是一個非常有前景的新方向。