Tsung-Jung Tsai (TJ_Tsai)
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note No publishing access yet

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.

      Your account was recently created. Publishing will be available soon, allowing you to share notes on your public page and in search results.

      Your team account was recently created. Publishing will be available soon, allowing you to share notes on your public page and in search results.

      Explore these features while you wait
      Complete general settings
      Bookmark and like published notes
      Write a few more notes
      Complete general settings
      Write a few more notes
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights New
    • Engagement control
    • Make a copy
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Make a copy Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note No publishing access yet

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.

    Your account was recently created. Publishing will be available soon, allowing you to share notes on your public page and in search results.

    Your team account was recently created. Publishing will be available soon, allowing you to share notes on your public page and in search results.

    Explore these features while you wait
    Complete general settings
    Bookmark and like published notes
    Write a few more notes
    Complete general settings
    Write a few more notes
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    vLLM === ###### tags: `LLM / inference` ###### tags: `LLM`, `inference`, `推論`, `vLLM`, `--enforce-eager`, `eager=true` <br> [TOC] <br> ## Intro - [[github] vllm](https://github.com/vllm-project/vllm/) <br> ## 啟動方式 ``` python3 -m vllm.entrypoints.openai.api_server --port 5000 --model /model/llama3 ``` ### 完整啟動方式 > Model: vllm-llama32-11b-vision ```bash #!/bin/bash MAX_CONTEXT_LEN=131072 #32768 #8192 #32768 MAX_NUM_SEQ=64 GPU_UTILIZATION=0.8 TP=4 python3 -m vllm_ocisext.entrypoints.openai.api_server \ --served-model-name vllm-llama32-11b-vision \ --port 5000 \ --model /models/Llama-3.2-11B-Vision-Instruct \ --dtype bfloat16 \ --tensor-parallel-size $TP \ --pipeline_parallel_size 1 \ --max-num-batched-tokens $MAX_CONTEXT_LEN \ --max-model-len $MAX_CONTEXT_LEN \ --max-num-seqs $MAX_NUM_SEQ \ --gpu-memory-utilization $GPU_UTILIZATION \ --distributed-executor-backend mp \ --enable-chunked-prefill false \ --enforce-eager \ --guided-decoding-backend lm-format-enforcer \ --tool-call-parser llama31 ``` - 用 fp8 跑? ``` python3 -m vllm.entrypoints.openai.api_server \ --served-model-name meta-llama33-70b-inst \ --port 5000 \ --model /models/Llama-3.3-70B-Instruct/ \ --dtype bfloat16 \ --tensor-parallel-size 4 \ --pipeline_parallel_size 1 \ --max-num-batched-tokens 8192 \ --max-model-len 32768 \ --max-num-seqs 128 \ --gpu-memory-utilization 0.90 \ --quantization="fp8" ``` ## 參考資料 - [VLLM推理流程梳理](https://blog.csdn.net/just_sort/article/details/132115735) - [Ray: 大模型时代的AI计算基础设施](https://www.jiqizhixin.com/articles/2023-08-17-6) - [vllm/vllm /sampling_params.py#skip_special_tokens](https://github.com/vllm-project/vllm/blob/v0.4.0.post1/vllm/sampling_params.py#L91-L93) <br> <hr> <hr> <br> # 參數說明 ## `--enforce-eager` - [doc](https://docs.vllm.ai/en/latest/models/engine_args.html) Always use eager-mode PyTorch. If False, will use eager mode and CUDA graph in hybrid for maximal performance and flexibility. - [[Usage]: what is enforce_eager #4449](https://github.com/vllm-project/vllm/issues/4449) - it disables the construction of CUDA graph in Pytorch. This may harm performance but reduces the memory requirement (of maintaining the CUDA graph) - CUDAGraphs take a bit of memory and time to compile, but can be helpful for performance if the model is small - 結論: - true -> memory 下降 -> performance 下降 - false -> default ### Q&A - enforce_eager=True or False 都可以情況,優先選 False ? (enable CUDA graph) [time=2024 11 20 07:51] Frank: 我是會優先false,輸出在8k內的seq有加速作用 ### Blog - [Vllm Enforce Eager=True Explained](https://www.restack.io/p/vllm-knowledge-eager-true-cat-ai) <br> ### [討論] vLLM 中啟用 `eager=true` 的重點總結 - ### gpt-4o-2024-11-20 ### 1. **Eager Execution 的定義** 啟用 `eager=true` 會強制模型以即時執行模式運作,操作會立即執行,而不是延遲到整個計算圖完成。 ### 2. **啟用 Eager Execution 的好處** - **即時回饋**:能立即獲得操作結果,方便快速發現問題。 - **簡化除錯**:可以檢查中間結果與狀態,讓除錯過程更加直觀。 - **支援動態計算圖**:允許計算圖隨需求動態調整,適合靈活的模型架構。 ### 3. **實作步驟** 1. **設定參數**:在模型的初始化設定中添加 `eager=true`。 2. **測試**:啟用後運行測試,檢查是否有執行行為的差異。 3. **性能監控**:觀察性能指標,避免對生產環境的效率產生不良影響。 ### 4. **範例程式碼** ```python= from vllm import ModelExecutor # 啟用即時執行模式 model_executor = ModelExecutor(model_name='your_model_name', eager=True) # 運行推論 output = model_executor.run(input_data) print(output) ``` ### 5. **考量因素** - **性能取捨**:即時執行雖然有助於開發,但可能增加開銷,影響生產環境的效能。 - **社群貢獻**:參與 vLLM 社群,分享使用經驗與建議,幫助改善模型的穩定性與效能。 ### 6. **應用場合** - 適合開發與除錯階段,因為能提供更多互動性與靈活性。 - 生產環境需權衡性能需求,可能更適合選擇非即時執行模式。 <br> ### [討論] eager=true 中文是什麼意思? - ### gpt-4o-2024-11-20 `eager=true` 在中文中的意思是「啟用即時執行模式」,也可翻譯為「強制即時執行」。當此設定啟用時,系統會立即執行每一個操作,而不是等待整個計算圖(computation graph)構建完成後再批量執行。 ### 關鍵概念 - **Eager(即時)**:指的是操作在呼叫時馬上執行,而不是採用延遲執行(lazy execution)。 - **執行模式**:即時執行模式適合開發、測試或除錯(debugging)過程,因為可以在每一步操作中即時檢查結果與狀態。 ### 執行模式比較 1. **lazy(延遲執行)**:構建好所有操作後才一次性執行。 2. **eager(即時執行)**:每個操作都在呼叫的當下執行。 ### 示例程式碼 ```python= from vllm import ModelExecutor # 啟用即時執行模式 model_executor = ModelExecutor(model_name='your_model_name', eager=True) # 運行推論 output = model_executor.run(input_data) print(output) ``` --- ## `--distributed-executor-backend=mp` > `pipeline_parallel_size` x `tensor_parallel_size` <= 單機GPU數量: -> mp > - **換句話說**: > - 在單一節點(單台機器)部署時,`--distributed-executor-backend` 預設就是 `mp`(多處理程序 backend)。 > - 這是因為 vLLM 在偵測到所有 GPU 都在同一台機器上,且 `pipeline_parallel_size × tensor_parallel_size ≤ GPU 數量` 的情況下,會自動選擇 `mp` 作為執行後端,而不需要使用 `ray`。 > - 如果是用單機跑 vLLM,而且沒特別指定 `--distributed-executor-backend`,它預設就會用 `mp` 模式,也就是用多個處理程序分配到每張 GPU 上。 > - 如需在多節點部署時才會考慮使用 `ray`。 - ### gpt-4o + Internet (2025/06/04) 在 vLLM(大語言模型推論引擎)中,`--distributed-executor-backend=mp` 參數的作用是指定使用「多處理程序(multiprocessing)」作為分散式執行後端。 這種模式特別適用於單一機器上擁有多張 GPU 的部署情境,當模型太大無法在單一 GPU 上運行,但可以分散到同一台機器的多個 GPU 上時,使用此模式可以有效地進行推論。 ### `mp` 模式的特點與適用情境 - **單機多 GPU 支援**:當模型需要多個 GPU 但所有 GPU 都在同一台機器上時,`mp` 模式允許每個 GPU 對應一個 Python 程序,進行並行處理。 - **無需安裝 Ray**:與需要跨多節點協調的 Ray 模式不同,`mp` 模式不依賴於 Ray 框架,減少了額外的安裝和配置需求。 - **較低的系統開銷**:由於不涉及跨節點通信,`mp` 模式在單機環境下具有更低的系統開銷和更簡單的部署流程。 ### 自動選擇與手動指定 vLLM 具有自動選擇執行後端的能力。 當 `pipeline_parallel_size` 和 `tensor_parallel_size` 的乘積小於或等於可用 GPU 的數量時,系統預設使用 `mp` 模式。 否則,如果已安裝 Ray,則會選擇 `ray` 模式。 不過,使用者可以透過 `--distributed-executor-backend` 參數手動指定所需的後端。 ### 使用範例 假設您有一台配備 4 張 GPU 的機器,並希望在這些 GPU 上運行推理服務,可以使用以下命令: ``` vllm serve facebook/opt-13b \ --tensor-parallel-size 4 \ --distributed-executor-backend mp ``` 這將啟動一個使用 4 個程序(每個 GPU 一個程序)的服務,每個程序負責模型的一部分,實現張量並行處理。 ### 總結 `--distributed-executor-backend=mp` 參數使 vLLM 能夠在單機多 GPU 環境下高效運行,無需依賴外部分散式框架,適合需要在單台機器上部署大型語言模型的場景。 --- <br> ## `--guided-decoding-backend lm-format-enforcer` ### docs - [DecodingConfig](https://docs.vllm.ai/en/latest/configuration/engine_args.html#decodingconfig) - [[github] noamgat/lm-format-enforcer](https://github.com/noamgat/lm-format-enforcer?tab=readme-ov-file#comparison-to-other-libraries) ![](https://hackmd.io/_uploads/SJHbQefIgx.png) <br> ### 範例講解 - ### JSON 內容: ```json { "key": "value" ``` 預期下一個 token 只能是: - " " (可以允許空白) - "," (接續 key, i.e. 還有其他 key) - "}" (結束 dict) 其他 token 機率設為 0 <br> --- ### [討論] `--guided-decoding-backend lm-format-enforcer` 用途為何? - ### gpt-4.1-mini (2025/07/14) `--guided-decoding-backend lm-format-enforcer` 是 vLLM 這個推理框架中一個用於 **guided decoding(引導式解碼)** 的參數選項。 --- ### 用途說明 * **guided decoding(引導式解碼)**: 是一種在生成過程中「引導」模型朝特定方向生成文本的技術,通常用來保證生成結果符合某種格式或規則。 * **lm-format-enforcer(語言模型格式強制器)**: 這個後端的作用是對模型生成的輸出進行格式上的約束和校驗,確保生成的文字符合預定的語言格式或特定規則。 例如: * 保證生成文本符合 JSON、XML、正規表達式等特定格式。 * 避免語法錯誤或結構錯誤的生成結果。 * 幫助模型輸出更嚴謹、結構化的文本。 --- ### 總結 使用 `--guided-decoding-backend lm-format-enforcer`,vLLM 會在生成過程中啟用一個格式強制器,確保生成內容嚴格遵守指定格式,減少格式錯誤,提高輸出品質及一致性。 --- 如果你想,我也可以幫你找更簡單的使用範例或示意,讓你更直觀了解它的運作! <br> --- ### [討論] token 輸出不是模型決定的嗎?為何 LM Format Enforcer 可以決定輸出 token ? - ### gpt-4.1-mini (2025/07/14) 好問題!你說得沒錯:**token 輸出是模型根據機率分布決定的**,模型會對每個可能的下一個 token 給一個概率,然後選出一個(透過 sampling、beam search 等方式)。 那為什麼 LM Format Enforcer 可以「決定」或「限制」模型輸出 token 呢?以下是原理與做法說明: --- ### 為什麼 LM Format Enforcer 可以限制 token 輸出? #### 1. **模型會給每個可能 token 一個機率分數 (logits)** 模型在每一步生成時,會針對整個詞彙表(所有 token)算出「每個 token 出現的可能性」。這就是「logits」或「機率分布」。 --- #### 2. **Guided decoding 是在模型輸出機率分布後,做「後處理過濾」** LM Format Enforcer 不是直接改變模型內部運算,而是**接收模型算出的機率分布,然後在選 token 前,根據格式規則「篩選」允許出現的 token,剔除不合法的 token。** * 具體來說: * 它會先根據格式規則計算出「下一步允許的 token 集合」(allowed tokens)。 * 然後,把不在允許集合裡的 token 機率直接砍掉(設定成非常低或負無限大),讓模型**不可能選擇那些非法 token**。 --- #### 3. **實際解碼流程** 1. **模型算出所有 token 的機率分布**。 2. **LM Format Enforcer 根據目前生成的內容與格式規則,計算出下一步哪些 token 合法**。 3. **把不合法 token 的機率設為0(或負無限)以剔除**。 4. **從剩下的合法 token 中根據機率採樣或選擇下一個 token**。 --- #### 4. **類比** 就像: * 你讓模型先列出所有可能答案(token)。 * 你幫模型過濾一遍,說「只能從這些答案裡選」。 * 模型只能從「被允許的答案」中挑,不能挑其他的。 --- #### 5. **技術意義** * 這樣就實現了「在解碼時對模型的生成過程施加約束」(constrained decoding)。 * 並且不需改模型本身,只是改解碼時的行為。 * 讓生成內容嚴格符合格式規則。 --- ### 簡單結論 * **模型負責算機率分布。** * **LM Format Enforcer 負責在選 token 前過濾機率分布,禁止不符合格式的 token 被選出。** 這樣整個生成過程才能同時有模型的語言能力,又能確保格式正確。 --- 如果你想要,我可以幫你用程式碼示意來說明這個「過濾機率」的流程,讓理解更直觀! <br> <br> # Misc ## deepseek_r1_reasoning_parser.py vs qwen3_reasoning_parser.py - ### Sources - [deepseek_r1_reasoning_parser.py](https://github.com/vllm-project/vllm/blob/main/vllm/reasoning/deepseek_r1_reasoning_parser.py) - [qwen3_reasoning_parser.py](https://github.com/vllm-project/vllm/blob/main/vllm/reasoning/qwen3_reasoning_parser.py) - ### gpt-o4-mini-high (2025/05/20) ### 1. 標籤存在與否時的預設行為 - **DeepSeekR1** - **不一定產生 `<think>` 標籤**,因此它假定:「整段輸出一開始」都是 reasoning,直到看到 `</think>` 為止。 - `extract_reasoning_content` 如果找不到 `</think>`,就回傳 `(model_output, None)` —— 把全部當 reasoning、沒有 content。 - **Qwen3** - **嚴格依賴**同時出現 `<think>` 和 `</think>` 來切分 reasoning。 - 若任一標籤短缺,就回傳 `(None, model_output)` —— 全部當 content、沒有 reasoning。 ### 2. `extract_reasoning_content` 的實作差異 | | DeepSeekR1 | Qwen3 | |---------------|----------------------------------------------------------|-----------------------------------------------------------| | 有 `<think>` | 先丟掉 `<think>` 之前的部分,再遇 `</think>` 分割 | 用 `partition("<think>")` 拆分,然後再用 `partition("</think>")` 分割 | | 無 `</think>` | 回 `(全部字串, None)` | 回 `(None, 全部字串)` | ### 3. 串流(`extract_reasoning_content_streaming`)邏輯 - **DeepSeekR1** - 支援多種情境: 1. **`<think>` 已在先前出現**,後續 delta 裡碰到 `</think>` 抽 reasoning,再把後面剩餘當 content 2. **只有 `</think>` 出現在 delta**(或先前),也能處理、抽 reasoning 再分出 content 3. **連 `<think>` 都沒出現**,就把 delta 全當 reasoning(因為預設都是 reasoning) - **Qwen3** - 流程簡化: 1. **只有在 delta 或先前「出現 `<think>`」**,才進一步檢查 `</think>`;否則一律當 content 回傳 2. 不處理「單獨出現 `</think>`」的特殊情境──如果 `<think>` 根本沒來,就不會把任何東西當 reasoning ### 4. 適用模型的差異 - **DeepSeek R1** - 模型本身**不再**輸出 `<think>` 起始標籤(但保留 `</think>` 作為分界),故 parser 必須「寬鬆地」把一開始都當 reasoning,然後透過 `</think>` 來切分。 - **Qwen3** - 模型**會**完整輸出 `<think>…</think>`(並且可關閉 thinking 功能),所以 parser 才能「嚴格地」只有在真的看到這對標籤時,才認定中間是 reasoning,其餘文字全當最終回答。 --- 簡言之,**DeepSeekR1** 的 parser 是為了配合「幾乎不出起始標籤」的模型行為,預設把輸出都當 reasoning;而 **Qwen3** 的 parser 則是為了那個「必須完整出現 `<think>…</think>`」的模型設計,預設把文字當 content,只有在看到完整標籤時才抽 reasoning。 <br> {%hackmd vaaMgNRPS4KGJDSFG0ZE0w %}

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password
    or
    Sign in via Facebook Sign in via X(Twitter) Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    By signing in, you agree to our terms of service.

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully