DeepSeek-OCR 筆記 _ VLM用於結構化資料提取的效能

###### tags: `Literature Reading` `DeepSeek` `OCR` `Paper` `CV` ### [AI / ML領域相關學習筆記入口頁面](https://hackmd.io/@YungHuiHsu/BySsb5dfp) --- ## 論文概覽 ![image](https://hackmd.io/_uploads/rklaebUClx.png =1200x) > Together, they achieve an impressive 97% OCR precision when compressing text at a ratio of less than 10× (meaning 10 text tokens compressed into 1 vision token). Even at an aggressive 20× compression ratio, the system maintains approximately 60% accuracy. - 亮點： :::info 一張圖勝過千言萬語。透過將資訊有效的以低失真方式視覺方式壓縮、傳遞與還原 ::: ![image](https://hackmd.io/_uploads/BJcUTlRAle.png =350x) source:[Sam Witteveen。DeepSeek OCR - More than OCR](https://www.youtube.com/watch?v=YEZHU4LSUfU) ![image](https://hackmd.io/_uploads/B1ZUAg2xbg.png =500x) source:[Caleb Writes Code。DeepSeek-OCR Explained](https://www.youtube.com/watch?v=uWrBH4iN5y4) 影片中對於資訊邀縮的比喻解釋得很好，如同音訊可以透過採樣頻率的的壓縮，視覺訊號透過神景網路轉換成數字的向量表示後，是否也能在壓縮後保留主要足以辨識的特徵呢? DeepSeek-OCR創新的借用了訊號壓縮(下採樣)的概念。想想文字字符本身也沒有意義，而是透過人類視神經轉成電訊號後送入大腦經過後天學習來解讀，潦草的字跡不也是一種被壓縮的訊號? :::success - 當以小於 10 倍的壓縮率（以10個text tokens壓縮為1個vision token時壓縮文字時，達到高達97% 的 OCR precision - 20x壓縮時則能維持約60%的accuracy ::: - 在OmniDocBench測試中 - 僅使用100 個vision token，超越GOT-OCR2.0（ 256 個token/page） - 性能優於MinerU2.0（averages 6000+ tokens per page），同時使用少於800 個vision tokens #### [DeepSeek-OCR: Contexts Optical Compression](https://arxiv.org/abs/2111.06377) - [官方Github](https://github.com/deepseek-ai/DeepSeek-OCR) - [DeepSeek.AI Blog](https://deepseek.ai/blog/deepseek-ocr-context-compression) - [huggingface DEMO](https://huggingface.co/spaces/khang119966/DeepSeek-OCR-DEMO) --- ## The Technical Architecture Behind DeepSeek-OCR ### Vision Encoder Comparison ![image](https://hackmd.io/_uploads/By-yeGUAle.png =1200x) >Figure 2 | Typical vision encoders in popular VLMs. Here are three types of encoders commonly used in current open-source VLMs, all of which suffer from their respective deficiencies. 目前的開源視覺語言模型 (VLM) 採用三種主要類型的視覺編碼器(搭配圖2): 1. Dual-tower architecture（雙塔架構） - 代表模型： Vary/DeepSeekVL/... - 核心原理： ![image](https://hackmd.io/_uploads/Sy2_vGLCgg.png =400x) * 使用兩個平行的視覺編碼器（例如：VIT + SAM） * 一個負責低解析度全圖(局)理解，另一個負責高解析度細節感知 * 兩個編碼器同時處理同一張圖片 - 優點： - Dual-tower架構的模型參數量與記憶體用量（因為固定resize輸入圖片）都相對可控 - 缺點： - 需要對同一張圖片進行兩次不同的預處理 - 訓練時難以實現編碼器的流水線平行化（pipeline parallelism）、推理速度較慢 2. 基於圖塊的方法(Tile-based methods) - 代表模型： InternVL series/DeepSeekVL2/... ![image](https://hackmd.io/_uploads/rkCIPzUCel.png =400x) - 核心原理： - 透過將一張大圖切分成多個小圖塊Tile (例如 448×448 或 384×384) - 逐個或分批處理這些tiles，而非一次載入全部 - 優點: - 記憶體用量可控（Reduced activation memory，因為不會一次性載入整張大圖 - 支援極高解析度，理論上可以處理任意大小的圖片 - 可平行處理，不同tiles可以在不同GPU上同時處理 - 缺點: - 大圖片被過度分割，並產生大量視覺Tokens，Token數量爆炸 ← 導致LLM處理慢、成本高 :::spoiler Token數量計算： ``` # 編碼器處理階段（產生tokens）每個tile尺寸: 448×448 Patch size: 16×16 每個tile產生: (448/16)² = 784 tokens # 大圖需要的tiles 大圖: 2480×3508 需要tiles: 6×8 = 48個 Local tokens: 784 × 48 = 37,632 tokens 全局視圖: 784 tokens 總vision tokens: 38,416 tokens # 對比文字版本同一份文檔的文字版: ~4,000 text tokens 圖片版本是文字版本的 9.6倍！ ``` - LLM處理這些tokens的記憶體消耗： ``` # LLM的Self-Attention記憶體需求 seq_len = 38,416 tokens hidden_dim = 4096 num_heads = 32 # KV cache記憶體（推理時） kv_cache_memory = 2 × seq_len × hidden_dim × num_layers × 2 bytes = 2 × 38,416 × 4096 × 32 × 2 ≈ 20 GB # Attention矩陣記憶體（訓練時） attention_memory = seq_len × seq_len × 4 bytes = 38,416² × 4 ≈ 5.9 GB 影響： - 推理速度慢：Attention複雜度O(n²)，38k tokens比4k tokens慢91倍 - 記憶體消耗大：需要20GB+ GPU記憶體 ``` ::: - 過度分割（excessive fragmentation）、語意資訊過於破碎導致上下文斷裂、圖塊邊緣的文字被截斷 - 原生解析度過低，圖塊被切的太細碎 3. 自適應解析度編碼 (Adaptive resolution encoding) - 代表模型： Qwen2(.5)/3VL series... ![image](https://hackmd.io/_uploads/rkYqwzICge.png =500x) * 核心原理： * 採用NaViT（Native Resolution ViT）架構 * 不resize、不tiling，直接處理原始尺寸的圖片 * 使用動態位置編碼（如2D RoPE）來支援任意長寬比 * RoPE：旋轉編碼 + 實時計算 + 無限外推 >　RoPE通過旋轉機制和連續函數實現了更優雅的位置編碼，天然具有相對位置屬性，並能處理任意長度的序列 * 直接將圖片切成patches後送入編碼器 * 優點 - 靈活處理多種解析度 - 保留完整細節 - 語義連貫性強 - 缺點 - 記憶體用量爆炸 ->問題根源：直接處理完整圖片 - Token數量隨圖片尺寸線性增長，導致推理速度緩慢。 - Attention計算複雜度: O(n²), n為token數量 - (訓練時)需要極長的序列長度支援 ### DeepSeek-OCR架構設計 ![image](https://hackmd.io/_uploads/ByFXbIv0ll.png =800x) DeepSeek-OCR 的架構。DeepSeek-OCR由DeepEncoder作為編碼器與DeepSeek-3B-MoE作為解碼器組成。 DeepEncoder作為核心模組包含三部分：1.採用window attention主導感知模型的[SAM](https://arxiv.org/abs/2304.02643) 、2.運用dense global attention建構知識模型的[CLIP](https://arxiv.org/abs/2103.00020) ，以及3. 串聯兩者的16倍詞元壓縮器(16× token compressor)。 #### **Encoder**: DeepEncoder = `SAM` + `16× token compressor` + `CLIP` ![image](https://hackmd.io/_uploads/B15cjSDCex.png =800x) DeepEncoder 主要包含兩個元件：配備window attention的視覺感知特徵萃取元件(Visual Perception Feature Extraction)，以及配備 dense global attention的視覺知識特徵萃取元件(visual knowledge feature extraction)。 1. 視覺感知特徵萃取元件。Visual Perception Feature Extraction: "Tokenizer"　→ 局部注意力 (local attention)，負責視覺細節。 - 標準定義：Tokenizer = 產生 tokens 的模塊這裡的設計目標是希望CLIP能使用高效壓縮過、已經富有低層級視覺特徵的tokens作為輸入，所以利用 SAM + Conv 做為tokenizer模組 ``` 傳統 Vision Encoder 的 "Tokenizer": ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 功能：將圖像轉換為 token 序列實現：Patch Embedding + Positional Encoding 例如 CLIP： Image → Patch Embedding → N tokens └─ 這裡是 "Tokenizer" DeepSeek-OCR 的 "Tokenizer": ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 功能：將圖像轉換為壓縮後的 token 序列實現：Patch Embedding + SAM + Conv Compressor Image → Patch Emb → SAM → Conv → 256 tokens └────────────────────┘ 這整個過程產生最終的 tokens 所以被稱為 "Tokenizer" ``` - **SAM(Segment Anything Model)** 透過Window Attention平行運算。抓取局部視覺特徵細節(local attention) • Window Attention → 局部 • 64 個 windows 可以並行計算 • 提取細尺度特徵 - input:　4096 tokens　ｘ　256 dim * 原始數據: 1024×1024 圖像 * 經過 Patch Embedding: * Patch size: 16×16 * Patches 數量: (1024/16)² * 計算: 64×64 = 4096 patches　 - output: 4096 tokens x 256 dim - **16× token compressor**：　負責訊息壓縮可學習的卷積壓縮層，將視覺資訊(vision tokens)進一步壓縮 • 4096 → 256 tokens • 16× 壓縮比 - input:　4096 tokens　ｘ　256 dim - output: 256 tokens x 1024 dim - `n/16` vision token ``` n = 原始 patch tokens 的數量 = (Image_width / patch_size) × (Image_height / patch_size) = (1024/16) × (1024/16) = 64 × 64 = 4096 n/16 = 壓縮後的 vision tokens 數量 = 4096 / 16 = 256 所以： n/16 = 256 個 vision tokens ``` 2. 視覺知識特徵萃取元件。Visual Knowledge Feature Extraction/ Embedding Generator(Layer)→ 全局注意力 (global attention)，負責語義與上下文。 - **CLIP**：負責語意理解 CLIP在這裡扮演Emnbedding生成的角色，接收壓縮後的視覺特徵後，進行全局的語義理解、跨模態對齊(透過global attention 計算)。 - input: 256 tokens x 1024 dim (256 compressed visual features) - 經過 256x256 global attention 計算 - output: 256 tokens x 1024 dim (semantic embeddings) - 濃縮了圖像的語義信息 - 可以被 LLM 理解 - 具有跨模態對齊特性 - ⊕ 代表 vision tokens 與 CLIP embedding 的殘差融合操作設計用意主要是為了在壓縮與全局特徵整合之間，兼顧穩定性、資訊保留與低記憶開銷。 ```python # vision_tokens: [B, 256, D] # clip_embeddings: [B, 256, D] deepencoder_output = vision_tokens + clip_embeddings # ⊕ residual fusion ``` #### **Decoder**: `DeepSeek-3B-MoE` ![image](https://hackmd.io/_uploads/SJb3Z8D0le.png =250x) DeepSeek-OCR 的解碼器採用 **DeepSeek-3B-MoE (Mixture-of-Experts)** 結構，對應論文第 3.3 節所述的 **MoE Decoder**。其設計目標是利用 **專家路由 (expert routing)** 提升語言生成的表達能力，同時保持推理階段的高效能與低記憶使用。 - **核心概念** DeepSeek-3B-MoE 由 64 個專家 (experts) 與 2 個共享專家 (shared experts) 組成。推理時僅啟用其中 6 個路由專家 + 2 個共享專家，約 **570M 活化參數**，因此： > **具備 3B 模型的表達能力，但僅需 500M 模型的推理成本。** - **數學定義** 解碼器的功能是從 DeepEncoder 壓縮後的潛在視覺特徵 (latent vision tokens) 重建對應的文字表徵 (text representation)： $$[ f_{dec}: \mathbb{R}^{n \times d_{latent}} \rightarrow \mathbb{R}^{N \times d_{text}} ] [ \hat{X} = f_{dec}(Z), \quad n \leq N ]$$ 其中： * $(Z \in \mathbb{R}^{n \times d_{latent}})$：來自 DeepEncoder 的壓縮潛在視覺 tokens（例如 256×1024） * $(\hat{X} \in \mathbb{R}^{N \times d_{text}})$：重建後的文字嵌入序列 * $(f_{dec})$：非線性對映函數，由 MoE 結構實現 - **運作流程** 1. **輸入：** 256 個經 DeepEncoder 壓縮後的 latent tokens 2. **融合 Prompt：** 根據任務注入文本提示（如 “Free OCR”, “Convert document to markdown.”） 3. **Expert Routing：** 每個 token 通過 gating network 選擇合適的專家子網路 4. **解碼重建：** 專家執行前饋運算，產生對應的文本 embedding 5. **輸出：** 最終生成對應的文字序列 - **程式化表示** ```python # Z: latent vision tokens from DeepEncoder [B, 256, D] # prompt: textual instruction tokens # f_dec: DeepSeek-3B-MoE decoder # Step 1: concatenate vision tokens + prompt decoder_input = torch.cat([prompt, Z], dim=1) # Step 2: pass through MoE-based decoder X_hat = f_dec(decoder_input) # reconstruct textual representation # Step 3: map to output tokens (text) output = tokenizer.decode(X_hat) ``` ### 支援多重解析度模式(Multi-Resolution Support) DeepEncoder is designed to support multiple resolutions efficiently, enabling it to process documents of varying sizes and complexities without sacrificing performance or requiring excessive computational resources. ![image](https://hackmd.io/_uploads/BJhSUSwCel.png =900x) > 圖4｜為測試模型在不同壓縮率（需不同數量的visiontokens）下的性能，並提升DeepSeek-OCR的實用性，配置了多重解析度模式。 ![image](https://hackmd.io/_uploads/r1B7UBwCgg.png =900x) >　表1｜|Multi　resolution　support　of　DeepEncoder. >　為兼顧研究與應用需求，設計了具備多種原生解析度與動態解析度模式的深度編碼器 $$N_{\text{valid}} = \left\lceil N_{\text{actual}} \times \left[1 - \frac{\max(w, h) - \min(w, h)}{\max(w, h)}\right]\right\rceil \tag{1}$$ > - $N_{\text{valid}}$：有效的 vision token 數量 > - $N_{\text{actual}}$：實際的 vision token 數量（padding 後的總數） > - $w$：原始圖片寬度 > - $h$：原始圖片高度 > - $\lceil \cdot \rceil$：向上取整（ceiling function） > - $\max(w, h)$：寬度和高度中的較大值 > - $\min(w, h)$：寬度和高度中的較小值 :::info 分子：max(w,h) - min(w,h) = 長邊 - 短邊 = padding的量分母：max(w,h) = 長邊比例：(長邊 - 短邊) / 長邊 = padding比例 1 - padding比例 = 有效內容比例 = 短邊/長邊最終：有效tokens = 總tokens × 有效內容比例（向上取整） ::: --- ## **Data Engine** DeepSeek-OCR 的 **Data Engine** 是整個模型的訓練基礎，用來支撐 OCR 壓縮學習、跨模態對齊，以及語言生成能力。 > 設計目標： > 同時讓模型「能看懂圖像」與「能生成對應文字」，並涵蓋多樣文檔類型與任務。資料分成四大類： 1️⃣ OCR 1.0 data 2️⃣ OCR 2.0 data 3️⃣ General vision data 4️⃣ Text-only data ### **3.4.1. OCR 1.0 data — 傳統 OCR 任務資料** **目的：** 讓模型具備文件與場景文字識別能力，是 DeepSeek-OCR 的訓練核心。 **資料組成：** * **文件型 (Document OCR)** * 來源：約 30M 頁 PDF * 語言：100 種語言（中英約佔 25M 頁） * 類型： * **粗標註 (Coarse Annotations)**：從完整 PDF 自動抽取，用於訓練光學文字辨識能力（重點是「看懂字」）。 * **精標註 (Fine Annotations)**：由先進 layout 模型（PP-DocLayout）+ OCR 模型（MinerU、GOT-OCR2.0）生成。 * 格式：`[坐標 + 區塊標籤 + 對應文字]`，如 Figure 5 所示。 * 用於學習版面結構 (layout) 與文字定位。 * **場景型 (Scene OCR)** * 來源：LAION、Wukong 圖像集 * 各 10M 張中英文圖片 * 標註：使用 PaddleOCR 自動標記 **用途：** * 訓練模型能辨識不同場景下的文字（文件、拍照、自然場景） * 控制輸出格式（可透過 Prompt 決定是否輸出檢測框） ### **3.4.2. OCR 2.0 data — 結構化解析任務資料** **目的：** 延伸 OCR 能力到「文件內圖表與符號結構理解」，建立所謂的「OCR+理解」能力。 > 即論文中稱為 *“parsing tasks for complex artificial images”*。 **資料組成：** 1. **圖表 (Charts)** * 參考 OneChart * 工具：`pyecharts`、`matplotlib` 自動生成 * 數量：約 10M 張 * 標註格式：**HTML 表格 (table)**（而非 OneChart 的字典格式），節省 token 數。 2. **化學結構式 (Chemical Formulas)** * 來源：PubChem 的 SMILES 格式 * 工具：`RDKit` 渲染生成圖像 * 數量：約 5M 筆 * 任務：影像 → SMILES 格式轉換 3. **平面幾何 (Plane Geometry)** * 參考 Slow Perception 方法 * 以規則生成器製作幾何圖形（線段、角度、交點等） * 標註格式：**字典 (dictionary)**，包含： * `line_segments`, `endpoints`, `types`, `coordinates` * 數量：約 1M 張 * 支援 **translation-invariant augmentation**（移動位置但語義不變） **用途：** * 教模型學會結構化輸出（例如：HTML、JSON、字典格式） * 鍛鍊模型的「圖表解析」、「數學/化學結構辨識」、「空間理解」能力。 ### **3.4.3. General Vision Data — 一般視覺任務資料** **目的：** 保留模型的「通用視覺能力」，避免過度專化在 OCR 任務上。 **資料組成：** * 任務類型：caption、object detection、grounding 等 * 生成方式：沿用 DeepSeek-VL2 的資料生成流程 * 比例：僅佔整體訓練資料的 **20%** **用途：** * 幫助模型理解一般圖片（非文字主導的影像） * 維持跨模態 alignment，讓 OCR 模型仍具備 general VLM 能力。 ### **3.4.4. Text-only data — 純文字資料** **目的：** 強化語言層（decoder）的語義生成與長文本對齊能力。 **資料組成：** * 內部文本語料，長度約 8192 tokens * 比例：佔整體資料的 **10%** **用途：** * 提升語言生成品質與邏輯一致性 * 提供 MoE decoder 語言預訓練支持 ### 📊 **資料比例與用途總覽** | 資料類型 | 比例 | 任務重點 | 來源 / 格式 | | ------------------ | ---------- | ---------------------------------- | -------------------------------- | | **OCR 1.0** | 約 70% | 光學文字辨識 (Document + Scene OCR) | PDF、LAION、Wukong、fitz、PaddleOCR | | **OCR 2.0** | (含於 1.0 內) | 結構化解析 (Charts, Chemical, Geometry) | OneChart, RDKit, Slow Perception | | **General Vision** | 約 20% | 通用影像理解 | DeepSeek-VL2 生成資料 | | **Text-only** | 約 10% | 語言生成 / 對齊訓練 | 內部語料庫 | --- ### DeepSeek-OCR 實際訓練流程 (2 階段) Stage 1: DeepEncoder 訓練 ├─ 訓練對象: SAM + Compressor + CLIP (380M) ├─ 資料: OCR 1.0/2.0 + 100M LAION 資料 ├─ 訓練方式: Next-token prediction (2 epochs) └─ 輸出: 訓練好的 DeepEncoder Stage 2: 全模型訓練 ├─ Pipeline Parallelism 配置: │ ├─ PP0: SAM + Compressor (凍結 ❄️) │ ├─ PP1: CLIP (未凍結,繼續訓練 🔥) │ ├─ PP2: DeepSeek-3B-MoE Layer 1-6 (訓練 🔥) │ └─ PP3: DeepSeek-3B-MoE Layer 7-12 (訓練 🔥) ├─ 資料: 混合 (70% OCR + 20% 通用視覺 + 10% 純文字) └─ 輸出: 最終的 DeepSeek-OCR 模型 --- # 2025年底VLM用於結構化資料提取的效能 ## 一、DeepSeek-OCR 與 DotsOCR 的比較 ### 架構差異 | 特性 | DeepSeek-OCR | DotsOCR (參考文件) | |------|--------------|-------------------| | **Vision Encoder** | DeepEncoder (SAM + CLIP, 380M) | dots.vit (NaViT, 1.2B) | | **LLM Decoder** | DeepSeek3B-MoE-A570M | Qwen2.5-1.5B | | **總參數量** | ~3B (570M active) | ~3B (2.7B active) | | **Vision 架構特色** | SAM (局部) + CLIP (全域) | NaViT (原生解析度,最高 11M 像素) | | **Decoder 架構** | MoE (Mixture of Experts) | Dense Transformer | | **設計目標** | OCR + Visual Context Compression | 文件版面解析 + OCR | | **Compression** | 10× (1000 tokens → 100) | 保持細節 (NaViT) | - 使用OmniDocBench資料集實測 - [dots.ocr](https://github.com/rednote-hilab/dots.ocr)在多處情況下還是勝出，DeepSeek-OCR勝在記憶體使用量小很多 ![image](https://hackmd.io/_uploads/H1EpNjtJ-e.png) [source](https://github.com/deepseek-ai/DeepSeek-OCR/blob/main/DeepSeek_OCR_paper.pdf） :::info 個人實測經驗，Expert VLMs如dots.ocr、DeepSeek-OCR在處理A4排版格式單純的文件，表現已經勝過Pipeline Tools，例如Docling、Unstructured等。掃描文件的便是正確度更遠勝內建使用簡易OCR的Docling、Unstructured等工具雖然dots.ocr自稱對複雜排版的圖表理解力不好，實測對於PPTX Slide的排版偵測定位與閱讀順序的解析表現仍優於DeepSeek-OCR ::: - dots.ocr在單純Layout Detection使用時的準確度亦高出DocLayout-YOLO-DocStructBench許多 <table> <thead> <tr> <th rowspan="2"><strong>Method</strong></th> <th colspan="5" style="text-align: center;"><strong>F1@IoU=.50:.05:.95↑</strong></th> <th colspan="5" style="text-align: center;"><strong>F1@IoU=.50↑</strong></th> </tr> <tr> <th>Overall</th> <th>Text</th> <th>Formula</th> <th>Table</th> <th>Picture</th> <th>Overall</th> <th>Text</th> <th>Formula</th> <th>Table</th> <th>Picture</th> </tr> </thead> <tbody> <td>DocLayout-YOLO-DocStructBench</td> <td>0.733</td> <td>0.694</td> <td>0.480</td> <td>0.803</td> <td>0.619</td> <td>0.806</td> <td>0.779</td> <td>0.620</td> <td>0.858</td> <td>0.678</td> </tr> <tr> <td>dots.ocr-parse all</td> <td>0.831</td> <td>0.801</td> <td>0.654</td> <td>0.838</td> <td>0.748</td> <td>0.922</td> <td>0.909</td> <td>0.770</td> <td>0.888</td> <td>0.831</td> </tr> <tr> <td> <strong>dots.ocr-detection only</strong> </td> <td><strong>0.845</strong></td> <td><strong>0.816</strong></td> <td><strong>0.716</strong></td> <td><strong>0.875</strong></td> <td><strong>0.765</strong></td> <td><strong>0.930</strong></td> <td><strong>0.917</strong></td> <td><strong>0.832</strong></td> <td><strong>0.918</strong></td> <td><strong>0.843</strong></td> </tr> </tbody> </table> > **Notes:** > - prompt_layout_all_en for **parse all**, prompt_layout_only_en for **detection only**, please refer to [prompts](https://github.com/rednote-hilab/dots.ocr/blob/master/dots_ocr/utils/prompts.py) [source:　dots.ocr](https://github.com/rednote-hilab/dots.ocr) ### Instruction Following 能力喪失問題 :::warning 經過實測，兩個模型在使用客製化PROMPT要對picture做進一步parse時效果都不佳，會出現嚴重幻覺 ::: - DeepSeek-OCR論文: >　"Since we do not include SFT stage, the model is not a chatbot, and some capabilities need completion prompts to be activated." - 訓練策略 - [DotsOCR訓練流程](https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/blog.md#methods) | 訓練階段 | DeepSeek-OCR | DotsOCR | |------------------|--------------------------------|-----------------------------------------| | 總階段數 | 2 | 4 | | Stage 1 | DeepEncoder (OCR + 100M LAION) | Vision Encoder 從零訓練 (image-text pairs) | | Stage 2 | 全模型訓練 (70% OCR + 30% 其他) | Vision Encoder 增強 (LLM 凍結) | | Stage 3 | ❌ 無 | OCR 專門化 (兩階段 freeze/unfreeze, LLM 開始訓練) | | Stage 4 | ❌ 無 | SFT (300K 樣本, LLM 繼續訓練) | | LLM Decoder 訓練資料 | Stage 2: 70% OCR + 30% 其他 | Stage 3/4: ~100% OCR (無通用資料保護) | | 通用資料保護 | ⚠️ 30% 混入但不足以保護 | ❌ 無 (Stage 3/4 都是純 OCR) | | SFT 存在性 | ❌ 無獨立 SFT | ✅ 有 (Stage 4, 300K OCR-centric) | #### LLM Decoder 針對下游任務專門性的訓練，會導致災難性遺忘、指令跟隨能力消失 1. 參數空間的覆蓋效應預訓練 LLM 的參數分佈： ├─ 通用語言理解 (佔據大部分參數空間) ├─ 指令跟隨 (instruction tuning 學到的) ├─ 推理能力 └─ 多任務泛化專門化訓練 (例如 OCR) 後： ├─ OCR 模式識別 (新學習，覆蓋原參數) ├─ 固定格式輸出 (強化特定路徑) └─ 通用能力 (被壓縮到邊緣/遺忘) 關鍵問題： - LLM 的參數容量有限（即使 3B/1.5B 也無法「完美記住所有事」） - Fine-tuning 會調整權重分佈，讓模型偏向新任務 - 當新任務資料量大且分佈單一時，會擠壓原有能力的參數空間 2. 梯度更新的方向性偏移在 Full Fine-tuning 過程中： # 簡化示意 ```PYTHON= for batch in ocr_dataset: # 假設 90% 都是 OCR loss = model(batch) loss.backward() # 梯度都指向「如何做好 OCR」 optimizer.step() # 參數更新方向：朝 OCR 優化 ``` 結果： - 梯度持續推動參數朝「OCR 最佳化」方向移動 - 原本的「指令跟隨」參數空間被破壞 - 模型學會「看到圖片→輸出結構化文字」這個固定模式 3. Loss Landscape 的重塑原始 LLM 的 Loss Landscape： - 多個局部最優點 (對應不同任務) - 指令跟隨能力 = 其中一個寬廣的盆地 OCR 專門化後： - OCR 任務形成「深窄的峽谷」 - 指令跟隨的盆地被填平/變淺 - 模型困在 OCR 峽谷，難以泛化