alanhc
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights New
    • Engagement control
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    --- hackmd: url: https://hackmd.io/kztDWXnISIWNPOic64h5TA title: linux final lastSync: 2025-04-26T05:04:38.718Z --- # 2025q1 Linux Term Project (BitNet) ## 優化方向 - memory - NUMA - mmap (llama.cpp 原本使用) - Huge Page - THP - I/O - `io_uring` - scheduler - FIFO - CFS - cpu bound / io bound ## 信件 - [信件](https://mail.google.com/mail/u/0/#inbox/QgrcJHsTmbFLcjVnzSxbVNxsTFFmpJPbdbv) [facebook post](https://www.facebook.com/groups/system.software2025/posts/973447758325359/) - BitNet - [在 GNU/Linux 系統運作 BitNet b1.58 2B4T 並重現論文實驗](#%E5%9C%A8-GNULinux-%E7%B3%BB%E7%B5%B1%E9%81%8B%E4%BD%9C-BitNet-b158-2B4T-%E4%B8%A6%E9%87%8D%E7%8F%BE%E8%AB%96%E6%96%87%E5%AF%A6%E9%A9%97) * [以 perf 在內的工具,測量推理過程中運算資源佔比前 20 大的函式,並探討其作用](#%E4%BB%A5-perf-%E5%9C%A8%E5%85%A7%E7%9A%84%E5%B7%A5%E5%85%B7%EF%BC%8C%E6%B8%AC%E9%87%8F%E6%8E%A8%E7%90%86%E9%81%8E%E7%A8%8B%E4%B8%AD%E9%81%8B%E7%AE%97%E8%B3%87%E6%BA%90%E4%BD%94%E6%AF%94%E5%89%8D-20-%E5%A4%A7%E7%9A%84%E5%87%BD%E5%BC%8F%EF%BC%8C%E4%B8%A6%E6%8E%A2%E8%A8%8E%E5%85%B6%E4%BD%9C%E7%94%A8) * [分析記憶體使用量,特別是過程中的 page fault, TLB miss 等統計](#%E5%88%86%E6%9E%90%E8%A8%98%E6%86%B6%E9%AB%94%E4%BD%BF%E7%94%A8%E9%87%8F%EF%BC%8C%E7%89%B9%E5%88%A5%E6%98%AF%E9%81%8E%E7%A8%8B%E4%B8%AD%E7%9A%84-page-fault-TLB-miss-%E7%AD%89%E7%B5%B1%E8%A8%88)。在 XMRig [2] 一類的挖礦程式中,善用 huge page (或 THP),可達到加速效果 * [評估 T-MAC[3] [5],特別是其搭配 BitNet 的查表效益,紀錄過程中的 perf 事件統計](#%E8%A9%95%E4%BC%B0-T-MAC%EF%BC%8C%E7%89%B9%E5%88%A5%E6%98%AF%E5%85%B6%E6%90%AD%E9%85%8D-BitNet-%E7%9A%84%E6%9F%A5%E8%A1%A8%E6%95%88%E7%9B%8A%EF%BC%8C%E7%B4%80%E9%8C%84%E9%81%8E%E7%A8%8B%E4%B8%AD%E7%9A%84-perf-%E4%BA%8B%E4%BB%B6%E7%B5%B1%E8%A8%88) * 觀察載入模型的機制,能否用 splice [4] 一類的機制予以加速 - 如何藉由 Linux 核心的機制來加速 BitNet [1] [https://github.com/microsoft/BitNet](https://github.com/microsoft/BitNet) [2] [https://xmrig.com/docs/miner/hugepages](https://xmrig.com/docs/miner/hugepages) [3] [https://github.com/microsoft/T-MAC](https://github.com/microsoft/T-MAC) [4] [https://hackmd.io/@sysprog/linux-zerocopy](https://hackmd.io/@sysprog/linux-zerocopy) [5] BitNet 有 LUT: [https://github.com/microsoft/BitNet/tree/main/src](https://github.com/microsoft/BitNet/tree/main/src) ## TODO (start) - [名詞解釋](https://hackmd.io/@alanhc/B1ZkopHmlx) - [profiling](/JqdN3X6hQSm5A4-NsKEMFQ) - [ ] ik_llama.cpp [筆記](https://hackmd.io/@alanhc/BkyjoTHmll) - [ ] [評估 ik_llama.cpp 效能表現](#%E8%A9%95%E4%BC%B0-ik_llamacpp-%E6%95%88%E8%83%BD%E8%A1%A8%E7%8F%BE) - [ ] BitNet [筆記](https://hackmd.io/@alanhc/rkJJoPnzel) - [x] [在 GNU/Linux 系統運作 BitNet b1.58 2B4T 並重現論文實驗](#%E5%9C%A8-GNULinux-%E7%B3%BB%E7%B5%B1%E9%81%8B%E4%BD%9C-BitNet-b158-2B4T-%E4%B8%A6%E9%87%8D%E7%8F%BE%E8%AB%96%E6%96%87%E5%AF%A6%E9%A9%97) - [x] [以 perf 在內的工具,測量推理過程中運算資源佔比前 20 大的函式,並探討其作用](#%E4%BB%A5-perf-%E5%9C%A8%E5%85%A7%E7%9A%84%E5%B7%A5%E5%85%B7%EF%BC%8C%E6%B8%AC%E9%87%8F%E6%8E%A8%E7%90%86%E9%81%8E%E7%A8%8B%E4%B8%AD%E9%81%8B%E7%AE%97%E8%B3%87%E6%BA%90%E4%BD%94%E6%AF%94%E5%89%8D-20-%E5%A4%A7%E7%9A%84%E5%87%BD%E5%BC%8F%EF%BC%8C%E4%B8%A6%E6%8E%A2%E8%A8%8E%E5%85%B6%E4%BD%9C%E7%94%A8) - [ ] T-MAC [筆記](https://hackmd.io/@alanhc/t-mac) - [ ] [T-MAC 載入模型的機制](#T-MAC-%E8%BC%89%E5%85%A5%E6%A8%A1%E5%9E%8B%E7%9A%84%E6%A9%9F%E5%88%B6) - [ ] [分析記憶體使用量,特別是過程中的 page fault, TLB miss 等統計](#%E5%88%86%E6%9E%90%E8%A8%98%E6%86%B6%E9%AB%94%E4%BD%BF%E7%94%A8%E9%87%8F%EF%BC%8C%E7%89%B9%E5%88%A5%E6%98%AF%E9%81%8E%E7%A8%8B%E4%B8%AD%E7%9A%84-page-fault-TLB-miss-%E7%AD%89%E7%B5%B1%E8%A8%88) - [ ] [使用 perf 分析 T-MAC](#%E4%BD%BF%E7%94%A8-perf-%E5%88%86%E6%9E%90-T-MAC) - [ ] [使用 uftrace 定位內積運算的效能成本](#%E4%BD%BF%E7%94%A8-uftrace-%E5%AE%9A%E4%BD%8D%E5%85%A7%E7%A9%8D%E9%81%8B%E7%AE%97%E7%9A%84%E6%95%88%E8%83%BD%E6%88%90%E6%9C%AC) - [ ] scheduler - [ ] [關閉 GGML_USE_OPENMP,由 llama.cpp 自行設計的多執行緒機制進行矩陣運算,避免 GCC/LLVM 提供的 OpenMP 帶來的不確定性。](#%E9%97%9C%E9%96%89-GGML_USE_OPENMP%EF%BC%8C%E7%94%B1-llamacpp-%E8%87%AA%E8%A1%8C%E8%A8%AD%E8%A8%88%E7%9A%84%E5%A4%9A%E5%9F%B7%E8%A1%8C%E7%B7%92%E6%A9%9F%E5%88%B6%E9%80%B2%E8%A1%8C%E7%9F%A9%E9%99%A3%E9%81%8B%E7%AE%97%EF%BC%8C%E9%81%BF%E5%85%8D-GCCLLVM-%E6%8F%90%E4%BE%9B%E7%9A%84-OpenMP-%E5%B8%B6%E4%BE%86%E7%9A%84%E4%B8%8D%E7%A2%BA%E5%AE%9A%E6%80%A7%E3%80%82) - [ ] [嘗試改用 FIFO 排程策略](#%E5%98%97%E8%A9%A6%E6%94%B9%E7%94%A8-FIFO-%E6%8E%92%E7%A8%8B%E7%AD%96%E7%95%A5) - [ ] [改進 llama.cpp 的內部排程機制,見 ggml: Implement yield barrier using futex for improved thread scheduling efficiency](https://hackmd.io/kztDWXnISIWNPOic64h5TA#%E6%94%B9%E9%80%B2-llamacpp-%E7%9A%84%E5%85%A7%E9%83%A8%E6%8E%92%E7%A8%8B%E6%A9%9F%E5%88%B6%EF%BC%8C%E8%A6%8B-ggml-Implement-yield-barrier-using-futex-for-improved-thread-scheduling-efficiency) - [ ] [關閉 `GGML_USE_OPENMP`,由 `llama.cpp` 自行設計的多執行緒機制進行矩陣運算,避免 GCC/LLVM 提供的 OpenMP 帶來的不確定性。](#%E9%97%9C%E9%96%89-GGML_USE_OPENMP%EF%BC%8C%E7%94%B1-llamacpp-%E8%87%AA%E8%A1%8C%E8%A8%AD%E8%A8%88%E7%9A%84%E5%A4%9A%E5%9F%B7%E8%A1%8C%E7%B7%92%E6%A9%9F%E5%88%B6%E9%80%B2%E8%A1%8C%E7%9F%A9%E9%99%A3%E9%81%8B%E7%AE%97%EF%BC%8C%E9%81%BF%E5%85%8D-GCCLLVM-%E6%8F%90%E4%BE%9B%E7%9A%84-OpenMP-%E5%B8%B6%E4%BE%86%E7%9A%84%E4%B8%8D%E7%A2%BA%E5%AE%9A%E6%80%A7%E3%80%82) - [ ] [改進 llama.cpp 的內部排程機制,見 ggml: Implement yield barrier using futex for improved thread scheduling efficiency](#%E6%94%B9%E9%80%B2-llamacpp-%E7%9A%84%E5%85%A7%E9%83%A8%E6%8E%92%E7%A8%8B%E6%A9%9F%E5%88%B6%EF%BC%8C%E8%A6%8B-ggml-Implement-yield-barrier-using-futex-for-improved-thread-scheduling-efficiency) - [ ] memory - [ ] [現行 llama.cpp 的 mmap 載入機制](#%E7%8F%BE%E8%A1%8C-llamacpp-%E7%9A%84-mmap-%E8%BC%89%E5%85%A5%E6%A9%9F%E5%88%B6) - [ ] [TODO: 使用 Huge Page 加速](#TODO-%E4%BD%BF%E7%94%A8-Huge-Page-%E5%8A%A0%E9%80%9F) - [ ] [評估 Huge Page 效益 io_uring](#%E8%A9%95%E4%BC%B0-Huge-Page-%E6%95%88%E7%9B%8A) - [ ] [實驗開啟 THP 與否對於 BitNet 推論的 perf stat 結果](#%E5%AF%A6%E9%A9%97%E9%96%8B%E5%95%9F-THP-%E8%88%87%E5%90%A6%E5%B0%8D%E6%96%BC-BitNet-%E6%8E%A8%E8%AB%96%E7%9A%84-perf-stat-%E7%B5%90%E6%9E%9C) - [ ] [評估 T-MAC,特別是其搭配 BitNet 的查表效益,紀錄過程中的 perf 事件統計](#%E8%A9%95%E4%BC%B0-T-MAC%EF%BC%8C%E7%89%B9%E5%88%A5%E6%98%AF%E5%85%B6%E6%90%AD%E9%85%8D-BitNet-%E7%9A%84%E6%9F%A5%E8%A1%A8%E6%95%88%E7%9B%8A%EF%BC%8C%E7%B4%80%E9%8C%84%E9%81%8E%E7%A8%8B%E4%B8%AD%E7%9A%84-perf-%E4%BA%8B%E4%BB%B6%E7%B5%B1%E8%A8%88) - [ ] 對照閱讀 Introduce New Lookup-Table(LUT)-Based Matrix Multiplication Method 討論 ## 本篇實驗環境 - ASUS gl552vw 6700hq 16g ram hdd - Intel nuc7i7dnhe i7 8650U 16g ram ssd - Intel 14700 (desktop) 64g ram ssd - perf 設定 - `sudo sh -c 'echo -1 > /proc/sys/kernel/perf_event_paranoid'` ## Backgrounds - [huge page](https://hackmd.io/5LvqUZQ4RtmkQAoz3Dw6Bw#Huge-Page) - [llama.cpp](/fdgpX6KkTSmBiLJzpakb-A) - [profiling](/JqdN3X6hQSm5A4-NsKEMFQ) - [bitnet](/K_oSz8V5RcibFD04lj-Emg) - BitNet 有潛在的弱點:[TriLM vs FloatLM: Ternary LLMs are more Performant than Quantized FP16 LLMs](https://openreview.net/pdf?id=gvaDL9omKU) - [t-mac](/6U2Ac9Z7QB2GfFBYtNcxPw) ### 發展歷史 - 25 Apr 2025 BitNet b1.58 - v2 - 16 Apr 2025 BitNet b1.58 - v1 - 25 Mar 2025 TMAC - v2 - 25 Jun 2024 TMAC - V1 - 17 Oct 2023 BitNet [A Survey of Low-bit Large Language Models: Basics, Systems, and Algorithm](https://arxiv.org/abs/2409.16694) ## T-MAC 載入模型的機制 - [T-MAC 載入模型的機制](https://hackmd.io/6U2Ac9Z7QB2GfFBYtNcxPw#T-MAC-%E8%BC%89%E5%85%A5%E6%A8%A1%E5%9E%8B%E7%9A%84%E6%A9%9F%E5%88%B6) ## 以 perf 在內的工具,測量推理過程中運算資源佔比前 20 大的函式,並探討其作用 - cmake:呼叫 CMake 指令。 - `-B build`:指定 build 目錄(等同於 mkdir build && cd build)。 - `-DCMAKE_BUILD_TYPE=RelWithDebInfo -DCMAKE_CXX_FLAGS=...`:設定建置類型與 C++ 編譯旗標。 - `RelWithDebInfo`:代表 最佳化(Release)等級,同時保留 debug symbol(可用於除錯)。 - `-g`:加上除錯符號。 - `-fno-omit-frame-pointer`:保留 call stack,用於 perf, gdb, uftrace 等工具追蹤堆疊。 - `BitNet/setup_env.py` 加入 ```python def compile(): ... run_command(["cmake", "-B", "build", "-DCMAKE_BUILD_TYPE=RelWithDebInfo -DCMAKE_CXX_FLAGS=-g -fno-omit-frame-pointer"], log_step="generate_build_files") run_command(["cmake", "--build", "build", "--config", "RelWithDebInfo"], log_step="compile") ``` - 使用 perf 分析 - `perf record -g python run_inference.py -m ~/models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf -p "hi, how are you" -n 10 --temp 0` - `perf report` 結果 https://gist.github.com/alanhc/a03265dde546bf9b8cfaedc0643e2a81 {%gist a03265dde546bf9b8cfaedc0643e2a81 %} ![image](https://hackmd.io/_uploads/SkKoUt2Gge.png) | 排名 | 函式名稱 | 所在模組 | % cycles | 說明 | | -- | ---------------------------------------- | -------------------- | -------- | ------------------------------------------------ | | 1 | `ggml_graph_compute.omp_outlined` | `libggml.so` | \~90% | OpenMP 包裹的主計算 kernel,負責所有 graph node 計算分派。 | | 2 | `__kmp_invoke_microtask` | `libomp.so.5` | \~90% | OpenMP runtime 啟動的子任務(thread 工作),會呼叫 ggml 的並行算子。 | | 3 | `ggml_compute_forward_thread` | `libggml.so` | \~90% | 每條 thread 執行個別 graph node 計算的實體函式。 | | 4 | `ggml_compute_forward` | `libggml.so` | \~89% | ggml 對一個 node 做 forward 計算(包含矩陣乘、activation 等)。 | | 5 | `ggml_compute_forward_mul_mat` | `libggml.so` | \~88% | 主要的矩陣乘法 (MatMul) 運算入口。 | | 6 | `ggml_compute_forward_mul_mat_one_chunk` | `libggml.so` | \~87% | 被 `mul_mat` 呼叫,實際將大矩陣切成 chunk 執行。 | | 7 | `ggml_vec_dot_i2_i8_s` | `libggml.so` | \~52% | **i2/i8 壓縮格式的向量點積運算**(表示使用了低位元量化模型)。這是核心運算。 | | 8 | `llama_decode` | `libllama.so` | \~46% | 解碼流程主函式,呼叫 ggml graph 執行 token 推論。 | | 9 | `llama_decode_internal` | `libllama.so` | \~46% | 實際實作:將 prompt context -> logits 的一輪計算。 | | 10 | `llama_graph_compute` | `libllama.so` | \~46% | 封裝 graph 執行的介面,用於解耦 backend 與 compute。 | | 11 | `ggml_backend_sched_graph_compute_async` | `libggml.so` | \~46% | 啟動非同步推論(threadpool 排程器)。 | | 12 | `ggml_backend_sched_compute_splits` | `libggml.so` | \~46% | 將 ggml graph 拆分並分派給各 thread。 | | 13 | `ggml_backend_graph_compute_async` | `libggml.so` | \~46% | 非同步 graph 計算主入口。 | | 14 | `ggml_backend_cpu_graph_compute` | `libggml.so` | \~46% | 用 CPU backend 執行 ggml graph(vs GPU/NPU)。 | | 15 | `ggml_graph_compute` | `libggml.so` | \~46% | graph 計算主入口,呼叫 backend。 | | 16 | `__kmpc_fork_call` | `libomp.so.5` | \~45% | OpenMP fork 任務的函式,設立 thread team。 | | 17 | `main` → `__libc_start_main` | `llama-cli` / `libc` | \~55% | 啟動點。佔比其實只是因為包含整段 call stack。 | | 18 | `_start` | `llama-cli` | \~55% | binary entry point,非實際計算熱點。 | | 19 | `libc.so.6` 相關 entry | `libc.so.6` | \~55% | 應屬呼叫栈初始 setup,無實質運算負擔。 | | 20 | `0x…` 匿名符號 | `libomp.so.5` | \~89% | OpenMP runtime 中的匿名指標,通常與 microtask 相關。 | | 熱點函數 | 說明 | | ------------------------------ | ---------------------------------- | | `ggml_vec_dot_i2_i8_s` | INT2 的 bitwise 內積運算,為 BitNet 專屬 | | `ggml_compute_forward_mul_mat` | ggml 中最主要的 matmul 入口 | | `libomp.so.5` 多次出現在 call stack | 表示大量計算平行化透過 OpenMP 進行 | | `ggml_vec_dot_f16` | 有部分非量化張量還在使用 F16(推論時可能是 embedding) | - 火焰圖 ![image](https://hackmd.io/_uploads/rkb4LY2Mee.png) - 簡化流程圖 ```mermaid graph TD llama_cli_main --> llama_decode llama_decode --> llama_graph_compute llama_graph_compute --> ggml_graph_compute ggml_graph_compute --> ggml_compute_forward ggml_compute_forward --> ggml_compute_forward_mul_mat ggml_compute_forward_mul_mat --> ggml_vec_dot_i2_i8_s ``` ### TODO: 使用 Huge Page 加速 > 在 XMRig 一類的挖礦程式中,善用 huge page (或 THP),可達到加速效果 - [XMRig](https://deepwiki.com/xmrig/xmrig) - [https://xmrig.com/docs/miner/hugepages](https://xmrig.com/docs/miner/hugepages) - [Huge page 開啟方法](https://hackmd.io/5LvqUZQ4RtmkQAoz3Dw6Bw#Huge-Page) - [舊實驗](https://gist.github.com/alanhc/474128a1e581a6f36e32a2a032d9d4e1) ## 在 GNU/Linux 系統運作 BitNet b1.58 2B4T 並重現論文實驗 - [重現論文實驗](https://hackmd.io/K_oSz8V5RcibFD04lj-Emg#%E9%87%8D%E7%8F%BE%E8%AB%96%E6%96%87%E5%AF%A6%E9%A9%97) ## 現行 llama.cpp 的 mmap 載入機制 - [現行 llama.cpp 的 mmap 載入機制](https://hackmd.io/fdgpX6KkTSmBiLJzpakb-A#%E7%8F%BE%E8%A1%8C-llamacpp-%E7%9A%84-mmap-%E8%BC%89%E5%85%A5%E6%A9%9F%E5%88%B6) ## 分析記憶體使用量,特別是過程中的 page fault, TLB miss 等統計 - 在 BitNet 裡面 > 在 [XMRig](https://deepwiki.com/xmrig/xmrig) 一類的挖礦程式中,善用 huge page (或 THP),可達到加速效果 ``` sudo perf stat -e cycles,instructions,minor-faults,major-faults,dTLB-loads,dTLB-load-misses,iTLB-loads,iTLB-load-misses,page-faults,cache-references,cache-misses \ ./build/bin/llama-cli \ -m ./models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf \ -p "Hi, how are you?" \ -n 10 \ --temp 0.0 \ --top_k 1 \ --ignore-eos \ ``` - result - https://gist.github.com/alanhc/701f510a45fddfe399f60451fa93f3e2 - IPC | 指標 | 數值 | 評估 | | -------------- | ---- | ---------------------- | | `cpu_core` IPC | 2.82 | ✅ 很高,代表 pipeline 執行效率佳 | | `cpu_atom` IPC | 1.73 | ✅ 合理,Atom 核心通常效率較低 | - Cache | 項目 | 數值 | miss rate | | --------------------------- | ---- | ---------- | | `cpu_core/cache-references` | 295M | - | | `cpu_core/cache-misses` | 127M | **43.13%** | | `cpu_atom/cache-misses` | 34M | **46.28%** | - cache miss rate 超過 40% 代表模型存取 memory 的方式缺乏 spatial / temporal locality。可能包括: - 推論過程使用的 int2 模型壓縮方式導致權重分佈不連續 - 缺乏 cache blocking / tiled matrix multiplication - 輸出 token 數太少,導致 init + graph 建立的成本無法 amortize - 記憶體系統:TLB | 項目 | 數值 | 說明 | | --------------------- | ----- | --------------------------- | | `cpu_core/dTLB-loads` | 44.6B | 記憶體存取多 | | `dTLB-load-misses` | 4.3M | **miss rate 僅 \~0.009%**,很低 | | `iTLB-load-misses` | 56K | 正常範圍,表示指令 cache 命中良好 | - Page Faults | 指標 | 數值 | | ------------ | ---- | | minor-faults | 101K | | major-faults | 0 | - 時間 | 指標 | 數值 | | ----------------- | ------ | | `real time` | 1.80 秒 | | `user + sys time` | 7.59 秒 | | 平均執行緒數估計 | ≈ 4.2 | ![截圖 2025-04-28 晚上9.05.03](https://hackmd.io/_uploads/SybKEB0Jgl.png) - iTLB-load-misses 有點高 - 指令翻譯錯誤率極高 - 大量小 page 導致 instruction fetch 時反覆 page table walk - cache-misses - cache miss 非常高 - memory access 沒有局部性,跨 page 很嚴重 - **iTLB miss** 這麼高 → **指令區(程式本身)破碎、翻譯負擔重**。 - **cache miss** 這麼高 → **資料區(模型權重、activation)散亂**。 ## 實驗開啟 THP 與否對於 BitNet 推論的 perf stat 結果 - 沒開 - https://gist.github.com/alanhc/14355d2075dad3d4c6e185f496b6f8b2 - 開啟 https://gist.github.com/alanhc/4e1a45d4b4143125296567c7d0e13da8 ### 數據對比 | 指標 | **THP 關閉** | **THP 開啟** | 差異 (%) | | ---------------------- | ---------- | ----------- | --------------- | | ⏱️ Time Elapsed | 1.817 sec | 1.823 sec | 🔄 +0.33%(幾乎無差) | | 🧠 Instructions (Core) | 108.73B | 105.32B | 🔽 -3.1% | | 🔄 Cycles (Core) | 38.76B | 36.89B | 🔽 -4.8% | | ⚙ IPC (Core) | 2.80 | **2.86** | 🔼 +2.1% | | 🧭 Cache References | 294.48M | 289.08M | 🔽 -1.8% | | ❗ Cache Misses | 125.97M | **126.14M** | 🔼 +0.1% | | 🧮 Cache Miss Ratio | 42.78% | **43.63%** | 🔼 +0.85% | | 🔄 dTLB Misses (Core) | 4.42M | **3.12M** | 🔽 -29.5% ✅ | | 🔄 iTLB Misses (Core) | 49.17K | 46.74K | 🔽 -4.9% ✅ | | ⚠ Minor Page Faults | 101K | **21K** | 🔽 -79% ✅ | - 分析結果 - 優化 - dTLB Misses 大幅減少(-29.5%) - THP 的最大效益就是減少 TLB misses,因為它把原本 4 KiB 的頁面變成 2 MiB,整體頁表壓縮 512 倍。 - Minor Page Fault 減少(-79%) - 大頁面減少了觸發頁面分配的次數。 - Core IPC 增加(2.80 → 2.86) - 處理器每個 cycle 執行更多指令,表示 memory bottleneck 稍微改善。 - 持平 - Cache Miss 數量基本持平 - 因為 THP 改變的是 虛擬頁面對應與 TLB 命中率,對 L1/L2 cache miss 的幫助有限。 - Total time elapsed 幾乎沒變 - 表示整體推論邏輯仍以 CPU compute-bound 為主,THP 改變的是記憶體存取行為,而不是 compute-intensive 結構。 ## 評估 T-MAC,特別是其搭配 BitNet 的查表效益,紀錄過程中的 perf 事件統計 :::success $\to$ TODO: 對照閱讀 [Introduce New Lookup-Table(LUT)-Based Matrix Multiplication Method](https://github.com/ggml-org/llama.cpp/pull/10181) 討論 ::: ::: info 這看起來是 T-MAC 的維護者 QingtaoLi1 要整合進 ggml-org/llama.cpp slaren 表示可能會有不好整合,這可能是一個貢獻點? - T-MAC 會需要消除所有第三方依賴,因為 llama.cpp 的設計原則是不依賴外部函式庫 - 這裡面好像也由提到一些 NPU 或不同硬體規格下的效能 - 他在 12700 及 M2 Ultra 做測試,我有一台 m1 pro 可能可以測試 ::: - TODO - 整合進 llama.cpp - `~/models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf` sudo perf stat -e cycles,instructions,minor-faults,major-faults,dTLB-loads,dTLB-load-misses,iTLB-loads,iTLB-load-misses,page-faults,cache-references,cache-misses ./3rdparty/llama.cpp/build/bin/llama-cli -m ~/models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf -p "Hi, how are you?" --temp 0.0 ``` perf stat -e cycles,instructions,minor-faults,major-faults,dTLB-loads,dTLB-load-misses,iTLB-loads,iTLB-load-misses,page-faults,cache-references,cache-misses ./3rdparty/llama.cpp/build/bin/llama-bench -m ~/model/model.INT_N.gguf -n 128 -t 28 -p 256 -ngl 0 ``` ``` perf stat -e cycles,instructions,minor-faults,major-faults,dTLB-loads,dTLB-load-misses,iTLB-loads,iTLB-load-misses,page-faults,cache-references,cache-misses ./3rdparty/llama.cpp/build/bin/llama-bench -m ~/model/model.INT_N.gguf -n 128 -t 4 -p 256 -ngl 0 ``` - result - intel 14700 - https://gist.github.com/alanhc/a422987be400751cbfb984fc1dfffea8 [https://github.com/microsoft/T-MAC](https://github.com/microsoft/T-MAC) https://deepwiki.com/microsoft/T-MAC - iTLB-loads / misses (6.4M / 3.1M → 48.49% miss rate) - 指令 TLB 有接近一半 miss,可能表示程式碼分散或使用了大量函式呼叫。可考慮使用 hugepages 或降低動態程式碼載入。 - cache-misses (15.98G → 39.85% miss rate) - 相當高的 cache miss,可能是模型參數或中介資料沒有 fit 到 L3 cache。需考慮: - 調整 batch size - 使用 NUMA-aware 執行或 memory layout 優化 BitNet 有 LUT: [https://github.com/microsoft/BitNet/tree/main/src](https://github.com/microsoft/BitNet/tree/main/src) ## 使用 uftrace 定位內積運算的效能成本 - https://github.com/namhyung/uftrace ```shell cd ~/workspace/T-MAC/3rdparty/llama.cpp rm -rf build && mkdir build && cd build ``` - `sudo apt install libc++-dev libc++abi-dev` ``` delete export INSTR_FLAGS="-finstrument-functions -fno-omit-frame-pointer -stdlib=libc++" ``` ```shell export CC=clang export CXX=clang++ export INSTR_FLAGS="-fno-omit-frame-pointer -stdlib=libc++" cmake .. \ -DGGML_TMAC=ON \ -DCMAKE_PREFIX_PATH=/home/alanhc/workspace/T-MAC/install/lib/cmake/t-mac \ -DCMAKE_BUILD_TYPE=RelWithDebInfo \ -DCMAKE_C_FLAGS="$INSTR_FLAGS" \ -DCMAKE_CXX_FLAGS="$INSTR_FLAGS" cmake --build . --target llama-cli -j$(nproc) ``` - 使用 uftrace ``` LD_LIBRARY_PATH=/usr/lib/llvm-18/lib:$LD_LIBRARY_PATH \ uftrace record -a ./bin/llama-cli \ -m ~/model/model.INT_N.gguf -n 5 -t 28 -p "hello" --temp 0 ``` :::info 上面設定目前會有 WARN: child terminated by signal: 7: Bus error 的錯誤,正在排查中,且RAM會不斷增長 應該跟 /dev/shm 有關 可以把這清空避免記憶體炸掉: `sudo find /dev/shm -mindepth 1 -delete` ![image](https://hackmd.io/_uploads/S1_vytsble.png) 解法:用 filter 不要一次觀察太多 function ::: - 看報告 - `uftrace report` https://gist.github.com/alanhc/5b2f479c0bbf2a7540d2e3b95471fa23 - `14.994 s 14.994 s 38 linux:schedule` - 意義:花 15 秒在等待 thread 被排入 CPU 執行(context switch) - 猜測:thread 被 OS 搶走了 → idle 或阻塞等待 → 無法持續算下去。 ## 使用 perf 分析 T-MAC - cmd: `sudo perf record -g ./bin/llama-cli -m ~/model/model.INT_N.gguf -n 10 -t 2 -p "hello" --temp 0` - https://gist.github.com/alanhc/98aabd4b843536159316bb123f55dd87 - CPU 資源幾乎完全耗在 OpenMP 多執行緒矩陣運算上(libomp.so.5 和 libggml.so) - 核心 bottleneck 在於: - ggml_compute_forward_mul_mat(矩陣乘法 forward) - tbl_g4_int8_int32_update_impl<16, 2>(可能是 INT8 量化矩陣乘法) - TMACGeMMWrapper::llama_cpp_compute(...)(QGEMM kernel 包裝) ``` # 這個是 -99 llama_perf_sampler_print: sampling time = 0.13 ms / 12 runs ( 0.01 ms per token, 90909.09 tokens per second) llama_perf_context_print: load time = 2227.85 ms llama_perf_context_print: prompt eval time = 258.68 ms / 2 tokens ( 129.34 ms per token, 7.73 tokens per second) llama_perf_context_print: eval time = 1326.53 ms / 9 runs ( 147.39 ms per token, 6.78 tokens per second) llama_perf_context_print: total time = 1588.04 ms / 11 tokens Performance counter stats for './3rdparty/llama.cpp/build/bin/llama-cli -m /home/alanhc/model/model.INT_N.gguf -n 10 -t 28 -p hi --temp 0': 183,218,574,897 cpu_atom/cycles/ (94.41%) 292,689,953,915 cpu_core/cycles/ (95.34%) 105,868,429,772 cpu_atom/instructions/ # 0.58 insn per cycle (94.41%) 130,288,033,071 cpu_core/instructions/ # 0.45 insn per cycle (95.34%) 170,662 minor-faults 0 major-faults 38,306,828,734 cpu_atom/dTLB-loads/ (94.41%) 43,447,572,227 cpu_core/dTLB-loads/ (95.34%) 595,512 cpu_atom/dTLB-load-misses/ # 0.00% of all dTLB cache accesses (94.41%) 1,054,263 cpu_core/dTLB-load-misses/ # 0.00% of all dTLB cache accesses (95.34%) <not supported> cpu_atom/iTLB-loads/ <not supported> cpu_core/iTLB-loads/ 1,740,032 cpu_atom/iTLB-load-misses/ (94.41%) 27,339 cpu_core/iTLB-load-misses/ (95.34%) 170,662 page-faults 148,428,442 cpu_atom/cache-references/ (94.41%) 169,337,301 cpu_core/cache-references/ (95.34%) 77,247,617 cpu_atom/cache-misses/ # 52.04% of all cache refs (94.41%) 78,828,727 cpu_core/cache-misses/ # 46.55% of all cache refs (95.34%) 3.930637455 seconds time elapsed 96.185629000 seconds user 0.214916000 seconds sys ``` - cache miss rate: 52% (atom), 46% (core) 非常高,代表 cache 行為差,L1/L2 失效率高 - IPC (atom/core) 0.58 / 0.45 很低 ➜ CPU 被 memory 或 cache 限制 - iTLB miss atom: 1.7M instruction fetch 還好,但稍高 - 整體來看是 memory-bound(尤其 cache-bound)而非 CPU-bound。 - `sudo perf report` ![image](https://hackmd.io/_uploads/B1an1M2bgx.png) | Symbol (函式) | Total % | Self % | 說明 | | ------------------------------------- | ------- | ------ | ----------------------------------------------- | | `ggml_graph_compute_thread` | 91.59% | 43.35% | 多數計算集中在這個主執行緒(佔總時間近一半) | | `ggml_compute_forward` | 47.54% | 0.00% | forward 運算主流程(呼叫各種 layer/block) | | `ggml_compute_forward_mul_mat` | 47.25% | 7.17% | 矩陣乘法核心入口 | | `ggml_tmac_mul_mat_task_compute` | 36.46% | 0.00% | 這是 TMAC 矩陣乘法封裝任務 | | `TMACGeMMWrapper::llama_cpp_compute` | 36.46% | 0.00% | QInt kernel 的 wrapper | | `tbl_g4_int8_int32_update_impl<16,2>` | 34.39% | 19.26% | 🚨**主熱點!** 你的矩陣乘法核(可能是 table-based INT8 kernel) | | `qgemm_lut_t1_int8_*` | 12\~17% | -- | 多種不同參數的 LUT-based GEMM kernel | `./stackcollapse-perf.pl out.perf > out.folded` `./flamegraph.pl out.folded > flamegraph.svg` ![image](https://hackmd.io/_uploads/r1ycEfnWgl.png) ### 另一次實驗 ``` export OMP_PROC_BIND=true export OMP_PLACES=cores ``` ``` export CC=clang export CXX=clang++ export INSTR_FLAGS="-fno-omit-frame-pointer -stdlib=libc++" cmake .. \ -DGGML_TMAC=ON \ -DCMAKE_PREFIX_PATH=/home/alanhc/workspace/T-MAC/install/lib/cmake/t-mac \ -DCMAKE_BUILD_TYPE=RelWithDebInfo \ -DCMAKE_C_FLAGS="$INSTR_FLAGS" \ -DCMAKE_CXX_FLAGS="$INSTR_FLAGS" cmake --build . --target llama-cli -j$(nproc) ``` ` sudo perf record -g ./bin/llama-cli -m ~/model/model.INT_N.gguf -n 10 -t 1 -p "hello" --temp 0` `sudo perf report` ``` Samples: 5K of event 'cpu_core/cycles/P', Event count (approx.): 7699093253 Children Self Command Shared Object Symbol + 99.97% 0.00% llama-cli llama-cli [.] _start + 99.97% 0.00% llama-cli libc.so.6 [.] __libc_start_main@@GLIBC_2.34 + 99.97% 0.00% llama-cli libc.so.6 [.] __libc_start_call_main + 99.95% 0.00% llama-cli llama-cli [.] main + 82.22% 0.00% llama-cli libllama.so [.] llama_decode + 82.09% 0.00% llama-cli libllama.so [.] llama_graph_compute(llama_context&, ggml_cgraph*, int, ggml_threadpool*) + 82.09% 0.00% llama-cli libggml.so [.] ggml_backend_sched_graph_compute_async + 82.09% 0.00% llama-cli libggml.so [.] ggml_backend_cpu_graph_compute(ggml_backend*, ggml_cgraph*) + 81.94% 0.00% llama-cli libggml.so [.] ggml_graph_compute + 81.87% 0.07% llama-cli libggml.so [.] ggml_graph_compute_thread + 69.14% 0.31% llama-cli libggml.so [.] ggml_compute_forward_mul_mat + 67.97% 0.02% llama-cli libggml.so [.] ggml_tmac_mul_mat_task_compute + 67.95% 0.02% llama-cli libggml.so [.] TMAC::TMACGeMMWrapper<float, 4>::llama_cpp_compute(void*, void*, void*, void*, void*, vo + 67.50% 67.43% llama-cli libggml.so [.] int tbl_g4_int8_int32_update_impl<16, 2>(int, int*, signed char*, unsigned char*) + 30.18% 0.10% llama-cli libggml.so [.] qgemm_lut_t1_int8_m640_k3200_n1_b2 + 24.09% 0.00% llama-cli llama-cli [.] llama_init_from_gpt_params(gpt_params&) + 22.52% 0.19% llama-cli libggml.so [.] qgemm_lut_t1_int8_m128_k3200_n1_b2 + 15.16% 0.12% llama-cli libggml.so [.] qgemm_lut_t1_int8_m256_k8640_n1_b2 + 11.46% 3.57% llama-cli libc.so.6 [.] __memset_avx2_unaligned_erms + 10.68% 10.65% llama-cli libggml.so [.] ggml_vec_dot_f16 + 10.08% 0.09% llama-cli [kernel.kallsyms] [k] asm_exc_page_fault + 7.71% 0.00% llama-cli [kernel.kallsyms] [k] handle_mm_fault + 7.58% 0.14% llama-cli [kernel.kallsyms] [k] __handle_mm_fault + 7.40% 0.02% llama-cli [kernel.kallsyms] [k] handle_pte_fault + 6.62% 0.02% llama-cli [kernel.kallsyms] [k] exc_page_fault + 6.48% 0.09% llama-cli [kernel.kallsyms] [k] do_user_addr_fault Cannot load tips.txt file, please install perf! ``` 🔍 重點觀察分析 1. 絕大多數 CPU 時間在 llama_decode → ggml_graph_compute llama_decode:82.22% ggml_graph_compute_thread → ggml_compute_forward_mul_mat:主導整體執行 重點內核: tbl_g4_int8_int32_update_impl<16, 2>:67.50% Self Time ggml_vec_dot_f16:10.65% Self Time __memset_avx2_unaligned_erms:3.57% Self Time 👉 表示目前主要瓶頸是 低階矩陣乘法與資料初始化/搬移過程。 2. 記憶體相關錯誤處理開銷 asm_exc_page_fault, handle_mm_fault, __handle_mm_fault 合計近 10% 暗示: 有可能是過度記憶體分配(大量 mmap 或懶加載 tensor) cache miss 或 TLB miss 頻繁 ``` export CC=clang export CXX=clang++ # 加入 uftrace 所需的插裝與除錯資訊 export INSTR_FLAGS="-fno-omit-frame-pointer -finstrument-functions -g -O2 -stdlib=libc++" cmake .. \ -DGGML_TMAC=ON \ -DCMAKE_PREFIX_PATH=/home/alanhc/workspace/T-MAC/install/lib/cmake/t-mac \ -DCMAKE_BUILD_TYPE=RelWithDebInfo \ -DCMAKE_C_FLAGS="$INSTR_FLAGS" \ -DCMAKE_CXX_FLAGS="$INSTR_FLAGS" cmake --build . --target llama-cli -j$(nproc) ``` ` uftrace record -F 'ggml_compute_forward_mul_mat*' --no-libcall --depth 2 ./bin/lla ma-cli -m ~/model/model.INT_N.gguf -n 2 -p "hello" -t 1 --temp 0` - `uftrace report` ``` Total time Self time Calls Function ========== ========== ========== ==================== 1.020 m 4.359 ms 705 ggml_compute_forward_mul_mat 1.015 m 1.015 m 36112 ggml_tmac_mul_mat_task_compute 4.122 s 13.274 ms 2387 ggml_compute_forward_mul_mat_one_chunk 4.108 s 4.107 s 312326 ggml_vec_dot_f16 371.800 ms 371.774 ms 904 ggml_tmac_mul_mat_task_init 263.510 ms 263.503 ms 2372 llamafile_sgemm 21.626 ms 21.626 ms 614 linux:schedule (pre-empted) 2.541 ms 2.541 ms 4995 ggml_fp32_to_fp16_row 492.011 us 492.011 us 3092 ggml_is_contiguous 189.770 us 189.770 us 2601 ggml_row_size 167.919 us 167.919 us 8526 ggml_type_size 91.019 us 91.019 us 653 ggml_tmac_can_mul_mat 62.760 us 62.760 us 3438 ggml_n_dims 40.840 us 40.840 us 2372 ggml_blck_size 25.202 us 25.202 us 546 ggml_tmac_get_type_bits 23.095 us 23.095 us 653 ggml_barrier 11.440 us 11.440 us 107 ggml_is_numa ``` | Function | 說明 | Time (Total) | Calls | 重點 | | ---------------------------------------- | ----------------------------------------------- | ------------- | ------- | ---------------------------- | | `ggml_compute_forward_mul_mat` | 高層張量乘法呼叫 | **1.020 min** | 705 | 是入口層的封裝呼叫 | | `ggml_tmac_mul_mat_task_compute` | 真正做矩陣乘法的核心任務 | **1.015 min** | 36,112 | 幾乎佔了全部運算時間 | | `ggml_compute_forward_mul_mat_one_chunk` | 分塊矩陣乘法 | 4.1 秒 | 2,387 | 可考慮平行化或 chunk 拆分優化 | | `ggml_vec_dot_f16` | 向量內積(float16) | 4.1 秒 | 312,326 | 單一點乘運算,量非常大 | | `llamafile_sgemm` | fallback 的 float32 sgemm | 263 ms | 2,372 | 可能是 fallback 情況(非 fastpath) | | `linux:schedule` | context switch 次數 | 21.6 ms | 614 | 有一定 preempt 發生,可能 OpenMP 被搶佔 | | 其他 | 如 `fp16 轉換`, `barrier`, `row_size`, `type_size` | 微小,影響不大 | | | ## 嘗試改用 FIFO 排程策略 ### 如何知道使用什麼排程策略? - 找 pid : `ps aux | grep llama` ``` alanhc 39697 593 2.5 2500804 1693468 pts/4 Rl+ 16:08 0:19 ./bin/llama-cli -m /home/alanhc/model/model.INT_N.gguf -n 10 -t 8 -p hello, how are you --temp 0 ``` ``` $ chrt -p 39958 pid 39958's current scheduling policy: SCHED_OTHER pid 39958's current scheduling priority: 0 ``` ### 使用 CFS - cmd: `./bin/llama-cli -m ~/model/model.INT_N.gguf -n 10 -t 8 -p "hello, how are you" --temp 0` ``` llama_perf_sampler_print: sampling time = 0.10 ms / 16 runs ( 0.01 ms per token, 156862.75 tokens per second) llama_perf_context_print: load time = 1027.12 ms llama_perf_context_print: prompt eval time = 1980.98 ms / 6 tokens ( 330.16 ms per token, 3.03 tokens per second) llama_perf_context_print: eval time = 3325.68 ms / 9 runs ( 369.52 ms per token, 2.71 tokens per second) llama_perf_context_print: total time = 5308.03 ms / 15 tokens ``` ### 使用 FIFO #### 設定 - 先查看當前最大值,通常是99 `ulimit -r` - 若為0需要進行以下步驟 1. `sudo nano /etc/security/limits.conf` 加上以下,以帳號 alanhc 為例 ``` alanhc - rtprio 99 alanhc - nice -20 ``` 2. 編輯 `sudo nano /etc/pam.d/common-session` 確保有 ``` session required pam_limits.so ``` 3. 重新開機 #### FIFO 優先權 5 - `sudo chrt -f 5 ./bin/llama-cli -m ~/model/model.INT_N.gguf -n 10 -t 8 -p "hello, how are you" --temp 0` ``` llama_perf_sampler_print: sampling time = 0.19 ms / 16 runs ( 0.01 ms per token, 84656.08 tokens per second) llama_perf_context_print: load time = 1641.62 ms llama_perf_context_print: prompt eval time = 2330.08 ms / 6 tokens ( 388.35 ms per token, 2.58 tokens per second) llama_perf_context_print: eval time = 3814.60 ms / 9 runs ( 423.84 ms per token, 2.36 tokens per second) llama_perf_context_print: total time = 6155.38 ms / 15 tokens ``` #### FIFO 優先權 20 - `sudo chrt -f 20 ./bin/llama-cli -m ~/model/model.INT_N.gguf -n 10 -t 8 -p "hello, how are you" --temp 0` ``` llama_perf_sampler_print: sampling time = 0.19 ms / 16 runs ( 0.01 ms per token, 85561.50 tokens per second) llama_perf_context_print: load time = 1394.06 ms llama_perf_context_print: prompt eval time = 2340.22 ms / 6 tokens ( 390.04 ms per token, 2.56 tokens per second) llama_perf_context_print: eval time = 3830.23 ms / 9 runs ( 425.58 ms per token, 2.35 tokens per second) llama_perf_context_print: total time = 6173.55 ms / 15 tokens ``` #### FIFO 優先權 80 - `sudo chrt -f 80 ./bin/llama-cli -m ~/model/model.INT_N.gguf -n 10 -t 8 -p "hello, how are you" --temp 0` ``` llama_perf_sampler_print: sampling time = 0.19 ms / 16 runs ( 0.01 ms per token, 84210.53 tokens per second) llama_perf_context_print: load time = 1388.09 ms llama_perf_context_print: prompt eval time = 2329.73 ms / 6 tokens ( 388.29 ms per token, 2.58 tokens per second) llama_perf_context_print: eval time = 3712.91 ms / 9 runs ( 412.55 ms per token, 2.42 tokens per second) llama_perf_context_print: total time = 6045.73 ms / 15 tokens ``` #### FIFO 優先權 99 - `sudo chrt -f 99 ./bin/llama-cli -m ~/model/model.INT_N.gguf -n 10 -t 8 -p "hello, how are you" --temp 0` ``` llama_perf_sampler_print: sampling time = 0.19 ms / 16 runs ( 0.01 ms per token, 84656.08 tokens per second) llama_perf_context_print: load time = 1385.15 ms llama_perf_context_print: prompt eval time = 2295.40 ms / 6 tokens ( 382.57 ms per token, 2.61 tokens per second) llama_perf_context_print: eval time = 3825.41 ms / 9 runs ( 425.05 ms per token, 2.35 tokens per second) llama_perf_context_print: total time = 6123.93 ms / 15 tokens ``` `./bin/llama-cli -m ~/model/model.INT_N.gguf -n 10 -t 8 -p "hello, how are you" --temp 0` ``` llama_perf_sampler_print: sampling time = 0.11 ms / 12 runs ( 0.01 ms per token, 107142.86 tokens per second) llama_perf_context_print: load time = 1648.90 ms llama_perf_context_print: prompt eval time = 1202.29 ms / 2 tokens ( 601.14 ms per token, 1.66 tokens per second) llama_perf_context_print: eval time = 5723.40 ms / 9 runs ( 635.93 ms per token, 1.57 tokens per second) llama_perf_context_print: total time = 6926.95 ms / 11 tokens ``` ``` llama_perf_sampler_print: sampling time = 0.05 ms / 7 runs ( 0.01 ms per token, 132075.47 tokens per second) llama_perf_context_print: load time = 1151.84 ms llama_perf_context_print: prompt eval time = 678.12 ms / 2 tokens ( 339.06 ms per token, 2.95 tokens per second) llama_perf_context_print: eval time = 1480.29 ms / 4 runs ( 370.07 ms per token, 2.70 tokens per second) llama_perf_context_print: total time = 2159.31 ms / 6 tokens real 0m3.410s user 0m18.802s sys 0m0.309s ``` ``` llama_perf_sampler_print: sampling time = 0.09 ms / 7 runs ( 0.01 ms per token, 74468.09 tokens per second) llama_perf_context_print: load time = 1410.44 ms llama_perf_context_print: prompt eval time = 760.87 ms / 2 tokens ( 380.43 ms per token, 2.63 tokens per second) llama_perf_context_print: eval time = 1690.04 ms / 4 runs ( 422.51 ms per token, 2.37 tokens per second) llama_perf_context_print: total time = 2453.07 ms / 6 tokens 22.15user 0.40system 0:04.01elapsed 562%CPU (0avgtext+0avgdata 1675264maxresident)k 0inputs+0outputs (0major+185871minor)pagefaults 0swaps ``` #### 比較 CFS 與 FIFO - 核心指標 | 策略 | Load time (ms) | Prompt eval time / tok | Eval time / tok | Total time (ms) | | ----------- | -------------- | ---------------------- | --------------- | --------------- | | **CFS** | 1027.12 | 330.16 ms / tok | 369.52 ms / tok | 5308.03 | | **FIFO 5** | 1641.62 | 388.35 ms / tok | 423.84 ms / tok | 6155.38 | | **FIFO 20** | 1394.06 | 390.04 ms / tok | 425.58 ms / tok | 6173.55 | | **FIFO 80** | 1388.09 | 388.29 ms / tok | 412.55 ms / tok | 6045.73 | | **FIFO 99** | 1385.15 | 382.57 ms / tok | 425.05 ms / tok | 6123.93 | - 結論 - CFS 反而最快 - CFS 在本機執行環境下給出的 load time、eval time、total time 都是最短的。 - FIFO 效能差異與 priority 數值關係不大 - 推測 - llama-cli 預設用 OpenMP 多執行緒,而 FIFO 是 per-thread 排程,可能造成部分 thread 被 FIFO 鎖死排程。 - FIFO 無時間片設計,容易造成 thread starvation 或不均衡的 thread 運作。 - CFS 有動態調整 fairness,反而讓 OpenMP 更平均地跑在各核心上。 ## 關閉 `GGML_USE_OPENMP`,由 `llama.cpp` 自行設計的多執行緒機制進行矩陣運算,避免 GCC/LLVM 提供的 OpenMP 帶來的不確定性。 :::success $\to$ TODO: 關閉 `GGML_USE_OPENMP`,由 `llama.cpp` 自行設計的多執行緒機制進行矩陣運算,避免 GCC/LLVM 提供的 OpenMP 帶來的不確定性。 > 抑制 OpenMP,使用 GGML 的 [thread pool](https://github.com/ggml-org/llama.cpp/pull/8672),這樣才可進行後續的改進 ::: - 先設定 `export GGML_N_THREADS=8` ### 停用 OpenMP 實驗,並使用 8 threads 進行實驗 ``` rm -rf "$BUILD_DIR" mkdir -p "$BUILD_DIR" cd "$BUILD_DIR" ZSTD_SO=$(ldconfig -p | grep libzstd.so | head -n1 | awk '{print $4}') ``` ``` cmake -G Ninja .. \ -DCMAKE_BUILD_TYPE=Release \ -DCMAKE_C_FLAGS="-DGGML_USE_OPENMP=0" \ -DCMAKE_CXX_FLAGS="-DGGML_USE_OPENMP=0" \ -DUSE_LLVM=llvm-config \ -DUSE_GRAPH_EXECUTOR=ON \ -DUSE_AUTO_SCHEDULER=ON \ -DZSTD_LIBRARY="$ZSTD_SO" \ -DZSTD_INCLUDE_DIR=/usr/include ``` #### cfs - cmd: `./bin/llama-cli -m ~/model/model.INT_N.gguf -n 10 -t 8 -p "hello, how are you" --temp 0` - result ``` llama_perf_sampler_print: sampling time = 0.10 ms / 16 runs ( 0.01 ms per token, 152380.95 tokens per second) llama_perf_context_print: load time = 291.67 ms llama_perf_context_print: prompt eval time = 95.98 ms / 6 tokens ( 16.00 ms per token, 62.51 tokens per second) llama_perf_context_print: eval time = 261.51 ms / 9 runs ( 29.06 ms per token, 34.42 tokens per second) llama_perf_context_print: total time = 358.19 ms / 15 tokens ``` #### fifo - `sudo chrt -f 99 ./bin/llama-cli -m ~/model/model.INT_N.gguf -n 10 -t 8 -p "hello, how are you" --temp 0` - result ``` llama_perf_sampler_print: sampling time = 0.18 ms / 16 runs ( 0.01 ms per token, 87912.09 tokens per second) llama_perf_context_print: load time = 460.26 ms llama_perf_context_print: prompt eval time = 249.23 ms / 6 tokens ( 41.54 ms per token, 24.07 tokens per second) llama_perf_context_print: eval time = 567.86 ms / 9 runs ( 63.10 ms per token, 15.85 tokens per second) llama_perf_context_print: total time = 818.15 ms / 15 tokens ``` #### 分析 - result | 指標 | CFS | FIFO | | ---------------------------- | ----------- | ----------- | | **tokens/sec(eval)** | `34.42` | `15.85` | | **eval time(per token)** | `29.06 ms` | `63.10 ms` | | **prompt eval time / token** | `16.00 ms` | `41.54 ms` | | **總耗時** | `358.19 ms` | `818.15 ms` | - CFS 表現較佳 - GGML 的 thread pool 是 cooperative thread pool(使用 pthread + semaphore),它依賴 CFS scheduler 幫忙做 thread yield 與公平排程,適合多執行緒非即時並行的 workload。 - FIFO 即時排程會讓主執行緒(或某些 thread)長時間霸佔 CPU,不容易 yield → 導致其他 thread 無法即時排程被調度 → 降低並行效率。 - 所以 Linux 無法靠 kernel-level 的 scheduling 優化分配。FIFO 會破壞這種細膩協調。 ## 改進 `llama.cpp` 的內部排程機制,見 [ggml: Implement yield barrier using futex for improved thread scheduling efficiency](https://github.com/ggml-org/llama.cpp/pull/13079) :::success $\to$ TODO: 改進 `llama.cpp` 的內部排程機制,見 [ggml: Implement yield barrier using futex for improved thread scheduling efficiency](https://github.com/ggml-org/llama.cpp/pull/13079) > 背景知識: [實作輕量級的 Mutex Lock](https://hackmd.io/@sysprog/concurrency/%2F%40sysprog%2Fconcurrency-mutex) 及 [建立相容於 POSIX Thread 的實作](https://hackmd.io/@sysprog/concurrency/%2F%40sysprog%2Fconcurrency-thread-package) ::: - [ggml: Implement yield barrier using futex for improved thread scheduling efficiency](https://github.com/ggml-org/llama.cpp/pull/13079) - SongXiaoXi: futex-based yield barriers versus traditional spin barriers - yielding 可以提高系統 scalability 和 efficiency ,但會在較輕量的 workload 增加 context-switching 成本 - slaren:目前推理(generation)階段效能損失太大 - SongXiaoXi - do_spin() implementation - Counter and throttled Counter 和 throttled - Tuning for hybrid cores - 目的: 在多核心 CPU 上進行推理時,ggml 使用多執行緒來加速計算,而這些執行緒之間需要同步。 - 傳統(spin barrie):每個執行緒會「忙等(busy wait)」直到所有人到齊再繼續,但這會浪費 CPU。 - 改進做法(yield barrier): 果發現還有其他執行緒沒來,它會呼叫 futex() 把自己暫時 block 起來,等別人來叫醒我(wake up),減少 CPU 使用。 - do_spin(): 是一段 fallback 行為:執行緒會先 spin 一下(短暫忙等),如果還沒等到其他執行緒,就 futex_wait()。 - Counter and throttled Counter - Counter:用來追蹤目前有多少執行緒已經抵達同步點。 - Throttled Counter:可能用來避免太早進入 sleep 狀態,例如只當前面已經很多人到了才開始 futex wait。 ### 單一threads #### cfs `./bin/llama-cli -m ~/model/model.INT_N.gguf -n 10 -t 1 -p "hello, how are you" --temp 0` - 結果 ``` llama_perf_sampler_print: sampling time = 0.10 ms / 16 runs ( 0.01 ms per token, 156862.75 tokens per second) llama_perf_context_print: load time = 4711.76 ms llama_perf_context_print: prompt eval time = 12384.73 ms / 6 tokens ( 2064.12 ms per token, 0.48 tokens per second) llama_perf_context_print: eval time = 20465.06 ms / 9 runs ( 2273.90 ms per token, 0.44 tokens per second) llama_perf_context_print: total time = 32851.03 ms / 15 tokens ``` #### fifo `sudo chrt -f 99 ./bin/llama-cli -m ~/model/model.INT_N.gguf -n 10 -t 1 -p "hello, how are you" --temp 0` - 結果 ``` llama_perf_sampler_print: sampling time = 0.19 ms / 16 runs ( 0.01 ms per token, 84210.53 tokens per second) llama_perf_context_print: load time = 6111.02 ms llama_perf_context_print: prompt eval time = 16312.76 ms / 6 tokens ( 2718.79 ms per token, 0.37 tokens per second) llama_perf_context_print: eval time = 26601.43 ms / 9 runs ( 2955.71 ms per token, 0.34 tokens per second) llama_perf_context_print: total time = 42917.32 ms / 15 tokens ``` #### fifo + tasklet `sudo taskset -c 2 chrt -f 99 ./bin/llama-cli -m ~/model/model.INT_N.gguf -n 10 -t 1 -p "hello" --temp 0` ``` llama_perf_sampler_print: sampling time = 0.10 ms / 12 runs ( 0.01 ms per token, 117647.06 tokens per second) llama_perf_context_print: load time = 5024.82 ms llama_perf_context_print: prompt eval time = 4679.66 ms / 2 tokens ( 2339.83 ms per token, 0.43 tokens per second) llama_perf_context_print: eval time = 22225.47 ms / 9 runs ( 2469.50 ms per token, 0.40 tokens per second) llama_perf_context_print: total time = 26906.25 ms / 11 tokens ``` ### 結論 | 指標 | FIFO 單執行緒(taskset) | CFS 單執行緒 | | -------------------- | ----------------------- | ---------------- | | **total time** | \~26.9 秒 | \~32.9 秒 | | **token/sec** | 0.41 | 0.44 | | **CPU utilization** | 高、穩定(幾乎100%) | 高但有 scheduler 介入 | | **context switches** | 幾乎 0(FIFO 無 preemption) | 較多,因為有時間片 | | **能效推測** | ✅ 穩定且長時間佔用 CPU | ⚠ 有更多切換與空轉成本 | - FIFO 的優勢 - taskset 確保 cache locality,無核心移動。 - 沒有 CFS 中的時間片 preemption → 更少 scheduler overhead。 - CFS 的劣勢 - 有更多 context switch(多次 kernel/user 切換) - 有可能發生 CPU idle-wake→重新進入 active state,造成 power spike ## 評估 [ik_llama.cpp](https://github.com/ikawrakow/ik_llama.cpp) 效能表現 :::success TODO: 評估 [ik_llama.cpp](https://github.com/ikawrakow/ik_llama.cpp) 效能表現,見 [Updated BitNet arch bitnet-b1.58](https://github.com/ikawrakow/ik_llama.cpp/issues/365) --- 看起來這個 llama-server 有效能議題 [Research: performance divergence #476](https://github.com/ikawrakow/ik_llama.cpp/issues/476) CPU Processing speed 在大的 context 可能有效能議題 [Feature Request: Improve CPU processing speed for large contexts #26](https://github.com/ikawrakow/ik_llama.cpp/issues/26) [CPU prompt processing speed for large contexts #25](https://github.com/ikawrakow/ik_llama.cpp/discussions/25) ::: - [筆記](https://hackmd.io/@alanhc/BkyjoTHmll) - ![image](https://hackmd.io/_uploads/SyvEolvQxx.png) - 跟矩陣乘法有關的(紫色部分):![image](https://hackmd.io/_uploads/SkjF2WvXeg.png) - [reddit 討論 ik_llama.cpp](https://www.reddit.com/r/LocalLLaMA/comments/1keoint/llama_gotta_go_fast_both_ik_and_mainline_llamacpp/) - `./build/bin/llama-quantize --allow-requantize ~/model/ggml-model-i2_s.gguf ggml-model-i2_s_bn.gguf iq2_bn` - ` ./build/bin/llama-cli -m ggml-model-i2_s_bn.gguf --prompt "Once upon a time" -n 32 --temp 0` `sudo perf record -g ./build/bin/llama-cli -m ggml-model-i2_s_bn.gguf --prompt "Once upon a time" -n 32 --temp 0` ![image](https://hackmd.io/_uploads/B1v3oV6fxl.png) ![image](https://hackmd.io/_uploads/Bk9J2V6Ggl.png) - 關鍵指標說明 | 函數 / Symbol | CPU 佔比 | 說明 | | ------------------------------ | ------ | ------------------------------------------------------------ | | `ggml_compute_forward_mul_mat` | 89.23% | LLM 主推論流程中,最底層的「forward 矩陣乘法」核心函數。幾乎全部 CPU 時間花在這 | | `iqk_mul_mat_4d` | 43.44% | 這是 GGML 中處理 4D 張量矩陣乘法的量化實作邏輯 | | `mul_mat_iq2bn_q8_K64<1>()` | 8.14% | 真正做 bit-packed int4 \* int8 矩陣乘法的內核,這裡通常做 SIMD 或 blocking 計算 | | `__cyg_profile_func_enter` | 12.63% | profiling 開銷,非計算熱點,但也消耗非小資源,可能需要移除 | | `libggml.so` | 88%+ | 幾乎所有熱點都來自這個共享物件,代表你的 bottleneck 完全集中於 **GGML 的量化計算邏輯** | - 火焰圖 ![image](https://hackmd.io/_uploads/B1jO2E6Mxl.png) - 這表示絕大多數計算時間都在 `ggml_compute_forward_mul_mat()` 函式中,也就是 矩陣乘法的 forward pass。這是 LLM 推論的核心瓶頸。 - `sudo perf report`: {%gist 5b730dcc52018bb59d53e49376c38b12 %} ``` sudo perf stat -e cycles,instructions,minor-faults,major-faults,dTLB-loads,dTLB-load-misses,iTLB-loads,iTLB-load-misses,page-faults,cache-references,cache-misses ./build/bin/llama-cli -m ggml-model-i2_s_bn.gguf --prompt "Once upon a time" -n 32 --temp 0 ``` {%gist 9d2270a1e72dacd7300350a61f9f71e1 %} - Cycles 與 Instructions - IPC > 1 表示有良好的指令並行度。 - core 比 atom 效率略高,符合預期(Atom 是低功耗核心)。 - dTLB(data Translation Lookaside Buffer)存取與 miss - miss rate 非常低(< 0.01%),代表記憶體區塊大致集中、TLB 命中率佳,沒有明顯分頁效能問題。 - iTLB(instruction TLB)miss - Atom 的 miss 高出 core 很多,表示 atom 核心讀取指令的 locality 較差,或該 core 使用了更多小函式與 page 邊界。 - Page Faults - minor-faults: 記憶體尚未對應,但系統已有頁面可用(如匿名 mmap)。 - major-faults = 0: 沒有造成磁碟 I/O,表示模型已 preload,不須從磁碟載入。 - Cache access - miss ratio 偏高,表示: - 推論過程 memory footprint 過大(不完全 fit in cache)。 - 模型權重或資料佈局沒有 cache-friendly(e.g., matrix 排列不佳)。 - 建議後續可優化 ggml_compute_forward_*() 計算路徑的 memory layout。 | 指標 | 狀況 | 建議 | | ---------- | -------- | --------------------------- | | IPC | 高(1.67+) | 執行效率佳,CPU 資源利用不錯 | | TLB Miss | 低 | 記憶體分頁良好 | | Cache Miss | 偏高 | 考慮調整張量排布或 cache blocking 優化 | | Page Fault | 無 major | 模型已載入記憶體 | | CPU 使用率 | 很高 | 表示 threading 有效,但也代表競爭存在 | - `sudo uftrace record -F ggml_compute_forward_mul_mat ./build/bin/llama-cli -m ggml-model-i2_s_bn.gguf --prompt "Once upon a time" -n 32 --temp 0` https://gist.github.com/alanhc/b4a389ce56d16e0e5e9f29d63d08b5f8 {%gist b4a389ce56d16e0e5e9f29d63d08b5f8 %} - 熱點 | Function Name | Total Time | Self Time | Calls | 說明 | | ------------------------------ | ----------- | ----------- | ------- | ------------------------------------ | | `ggml_compute_forward_mul_mat` | **4.865 s** | 179 ms | 71,544 | 矩陣乘法的主進入點(總熱點)🔥 | | `iqk_mul_mat_4d` | 4.271 s | 13 ms | 127,248 | 對 4D 張量的 IQ 量化矩陣乘法 | | `iqk_mul_mat` | 4.257 s | 20 ms | 190,608 | 展開後的量化乘法計算 | | `mul_mat_NxM` | 4.211 s | 10 ms | 134,904 | 模板化乘法邏輯(包裝器) | | `mul_mat_iq2bn_q8_K64` | **2.786 s** | **2.784 s** | 55,440 | 核心內核!真正的 int4 × int8 乘法🔥 | | `linux:schedule` | 339 ms | 339 ms | 91,701 | OS thread 排程(代表有大量 context switch)⚠️ | | `mul_mat_qY_K_q8_K_T` | 1.332 s | 1.332 s | 264 | 類似內核,可能用於不同層級或配置 | - annotate `mul_mat_iq2bn_q8_K64` ![image](https://hackmd.io/_uploads/SJI3vTafge.png) - `linux:schedule` - 程式有明顯的 thread context switching https://gist.github.com/alanhc/3fcf8cf1bc61de173cf43bc52d5ccbcc {%gist 3fcf8cf1bc61de173cf43bc52d5ccbcc %} - 看熱點怎麼執行:uftrace replay -F mul_mat_iq2bn_q8_K64 - 產生火焰圖 `uftrace dump -F -S > folded-stacks.txt` - `flamegraph.pl folded-stacks.txt > llama-flamegraph.svg` ## 評估 Huge Page 效益 `llama.cpp` 專案中 (注意: BitNet 使用 [llama.cpp 的分支](https://github.com/Eddie-Wang1120/llama.cpp/tree/merge-dev),與 huge page 相關的討論: * [Huge Page Support](https://github.com/ggml-org/llama.cpp/issues/2251) * [Llama.cpp patch for using static hugepages](https://www.reddit.com/r/VFIO/comments/1fop300/llamacpp_patch_for_using_static_hugepages/) * [allow mmap to take advantage of hugepage feature](https://github.com/ggml-org/llama.cpp/issues/12444) * [support hugepage feature of pagesize 2M or 1G](https://github.com/ggml-org/llama.cpp/pull/12552) :::warning 看起來 T-MAC 應該是用這: https://github.com/kaleid-liner/llama.cpp/tree/eb07ecf0172230d58fff5d23a3fd6feebda35065? ::: :::success TODO: 評估上述修改的效益 現行的 `llama.cpp` 使用 mmap 系統呼叫,儘管是通用的縮減模型載入時間的方法,但考慮到 SSD 一類的儲存媒介,尚有改進空間。`llama.cpp` 如何使用 `mmap`: 1. 檔案→位址空間: 啟動時,`llama.cpp` 透過 `mmap()` 將整個模型檔案映射到自己的虛擬位址空間 2. 按需分頁錯誤: 當程式第一次存取某個頁面(例如讀取某層權重)時,核心會觸發分頁錯誤,將那 4 KiB 的頁面從磁碟載入到作業系統分頁快取及記憶體中 3. 分頁快取緩衝: 所有頁面都儲存在 Linux 的分頁快取裡,後續存取就直接在記憶體(若未被換出)和 CPU 快取中命中 4. 鎖定到記憶體: 如果加了 `-mlock`,在映射完畢(或在第一次分頁錯誤後,或使用 `MAP_POPULATE` 事先預取)會呼叫 `mlock()`。這會將這些頁面鎖在記憶體中,避免被換出 為何考慮用 `io_uring` 取代 `mmap`?考量因素並非 SSD 寫入壽命,也不是要繞過記憶體 → CPU 快取。真正的優勢在於模型載入吞吐量: 1. 系統呼叫開銷更低: 使用 `io_uring`,可在一個 SQE 批次裡提交成百上千個讀取請求,而不是讓每次分頁錯誤都走一次 `read()` 路徑。 * 批次 I/O 可減少上下文切換和中斷次數 2. 非同步、平行 I/O: 同時向高速 NVMe 發出多個讀取請求,並非同步地處理完成事件 * 相較於序列化的分頁錯誤或 `MAP_POPULATE`,更有機會飽和儲存裝置效能 3. 選擇繞過分頁快取: 若確定載入時只會讀取一次全部資料,可開啟 `O_DIRECT`(或在 `io_uring` 用 `IORING_OP_READ` + `IOSQE_BUFFER_SELECT`),將資料直接串流到應用程式緩衝區,避免分頁快取的雙重緩衝 - [I/O 模型演化: Linux 的 io_uring](https://hackmd.io/@sysprog/iouring) `io_uring` 的考量: * 適合時機 - 在非常高效能的 SSD 上做大規模連續載入,批次、全非同步 I/O 有機會達到最高吞吐量 - 需要控制記憶體使用時:繞過分頁快取以保留更多 RAM 給其他工作。 * 不適合時機 - 推論階段的小範圍隨機存取:`mmap` 的按需分頁已經對隨機讀取做了高度調整 * 已鎖定工作集:一旦用 `mlock` 把頁面鎖在記憶體,`mmap` + 快取的零複製存取就足夠高效 延伸閱讀: * [Load LLaMA Models Instantly](https://news.ycombinator.com/item?id=35199418) * [OS-Level Challenges in LLM Inference and Optimizations](https://eunomia.dev/blog/2025/02/18/os-level-challenges-in-llm-inference-and-optimizations/) ::: :::info wip ::: ```mermaid flowchart TD subgraph Kernel Space K1[🔧 mmap 註冊 VMA] K2[📍 page fault handler] K3[📦 page cache buffer] K4[🔁 io_uring SQ/CQ 管理] end subgraph RAM R1["🧠 page cache (OS managed)"] R2["📤 User buffer (mapped)"] R3["🔒 Locked page (mlock)"] end subgraph User Space U1[📁 llama.cpp 啟動] U2[📖 access weights → vaddr] U3["🔄 使用 io_uring 提交多筆 read()"] U4["📬 用 buffer[] 接收 read 資料"] end %% mmap 流程 U1 --> K1 --> U2 --> K2 --> K3 --> R1 --> R2 R2 --> U2 %% io_uring 流程 U3 --> K4 --> R1 K4 --> R2 --> U4 %% mlock R2 --> R3 %% 樣式 style K1 fill:#e6f7ff,stroke:#3399cc style K2 fill:#ffe6e6,stroke:#cc3333 style K3 fill:#fff2cc,stroke:#cc9900 style K4 fill:#ddeeff,stroke:#3399cc style R1 fill:#f9ffe6,stroke:#ccff33 style R2 fill:#ccffee,stroke:#33cc99 style R3 fill:#ccffcc,stroke:#66cc66 style U1 fill:#f0f0f0,stroke:#888 style U2 fill:#f0f0f0,stroke:#888 style U3 fill:#f0f0f0,stroke:#888 style U4 fill:#f0f0f0,stroke:#888 ``` ## 視覺化理解 :::info wip ::: ```mermaid flowchart TD subgraph "🧠 NUMA Memory (User Space RAM)" B[📂 mmap-ed Model Memory] end subgraph "🧍‍♂️ User Space" G["🧩 ggml_load_model_from_file()"] H["📊 ggml_compute_forward()"] I[⚙️ SIMD / matmul / attention kernel] end subgraph "🧷 Kernel Space" D["🧠 mmap() → VMA 建立"] E["📬 io_uring_submit()"] F[📤 Completion Ring Polling] end subgraph "💽 I/O Subsystem (Disk, SSD/NVMe)" J[📄 GGUF Model File] end subgraph "🧠 NUMA Node 0" A[CPU Core + Local Memory] end subgraph "🧠 NUMA Node 1" A1[Other CPU Core + Remote Memory] end %% Data Flow J -->|io_uring 讀取模型資料| F F -->|寫入頁框| B A -->|allocate| B A1 -->|fallback| B B --> D D --> E E --> F B --> G G --> H H --> I %% Style style A fill:#d1f4ff,stroke:#00a3cc style A1 fill:#f0f8ff,stroke:#0088cc style B fill:#fff6e6,stroke:#ffcc00 style D fill:#e6f7ff,stroke:#3399ff style E fill:#e6f7ff,stroke:#3399ff style F fill:#e6f7ff,stroke:#3399ff style G fill:#e6ffe6,stroke:#00cc66 style H fill:#e6ffe6,stroke:#00cc66 style I fill:#f0fff0,stroke:#00b300 style J fill:#fff0f0,stroke:#ff6666 ```

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully