ik_llama 筆記

ik_llama 筆記 === - `sudo uftrace record -F 'mul_mat*'./build/bin/llama-cli -m ggml-model-i2_s_bn.gguf --prompt "Once upon a time" -n 32 --temp 0` - https://deepwiki.com/ikawrakow/ik_llama.cpp/ - https://chatgpt.com/c/683c74f9-8b0c-8002-b8fc-ec4dd55f2adc - quantize: `./build/bin/llama-quantize --allow-requantize ~/model/ggml-model-i2_s.gguf ggml-model-i2_s_bn.gguf iq2_bn` - 執行: `./build/bin/llama-cli -m ggml-model-i2_s_bn.gguf --prompt "Once upon a time" -n 32 --temp 0` - `sudo perf stat -e cycles,instructions,minor-faults,major-faults,dTLB-loads,dTLB-load-misses,iTLB-loads,iTLB-load-misses,page-faults,cache-references,cache-misses ./build/bin/llama-cli -m ggml-model-i2_s_bn.gguf --prompt "Once upon a time" -n 32 --temp 0` ![image](https://hackmd.io/_uploads/rys1p287lx.png) https://gist.github.com/alanhc/c0a246fc1660ff29b3ed31fc67672dda {%gist c0a246fc1660ff29b3ed31fc67672dda %} - 主要負載集中在 P-core 上，且 IPC > 1，效能良好。 | 指標 | 說明 | | ------------------------ | -------------------------------------- | | `minor-faults` = 107,553 | 輕微分頁錯誤（例如 page 尚未映射但不需讀磁碟）。 | | `major-faults` = 0 | 沒有 page fault 導致 disk I/O，代表記憶體使用狀況良好。 | 🟩 TLB 表現良好，未顯著拖慢速度。 | 指標 | 說明 | | ---------------------------------------------------- | --------------------------------------------------- | | `cache-references`：Atom 373M, Core 532M | Cache 使用頻繁。 | | `cache-misses`：Atom 283M → 75.85%；Core 435M → 81.84% | ❗ 高 cache miss rate，代表 cache locality 不佳，常回主記憶體取資料。 | ⚠️ 這可能是 performance 瓶頸之一。 ✅ 重點結論效能集中在 P-core（90% cycle & instruction 都在大核上），且 IPC 表現良好。 Cache miss 比例偏高（超過 75%），可能導致延遲與 DRAM 頻繁存取。 TLB miss 極低，表示記憶體分頁與虛擬記憶體管理良好。幾乎沒有 page fault 與 major fault，表示模型資料與程式碼已完全載入於 RAM。 CPU time 高於 elapsed time，代表系統有利用多核資源平行推論。 - `sudo perf record -g ./build/bin/llama-cli -m ggml-model-i2_s_bn.gguf --prompt "Once upon a time" -n 32 --temp 0` ![image](https://hackmd.io/_uploads/ryUKCnLXxl.png) ![image](https://hackmd.io/_uploads/B1o6A3L7lg.png) `sudo perf record -F 999 -g ./build/bin/llama-cli -m ggml-model-i2_s_bn.gguf --prompt "Once upon a time" -n 32 --temp 0` - Flamegraph ![image](https://hackmd.io/_uploads/By9vGaLmle.png) - Call Graph 樹狀結構 ``` sudo apt install graphviz sudo perf record -g -- ./llama-cli ... sudo perf report --call-graph graph sudo perf script | stackcollapse-perf.pl | flamegraph.pl --hash > out.svg ``` ![image](https://hackmd.io/_uploads/B12ZmpU7el.png) | 排名 | 百分比 | 函式名稱 | 來源模組 | 層級 | 說明 | | -- | ------ | --------------------------------------- | ------------------- | ------ | ------------------------------------------------------ | | 1 | 84.09% | `ggml_compute_forward_mul_mat` | libggml.so | User | LLM 中所有矩陣運算的核心，負責 transformer 的 linear/attention/MLP 等 | | 2 | 81.24% | `iqk_mul_mat_4d` | libggml.so | User | 將 4 維矩陣乘法拆解為 SIMD friendly 的區塊，適用量化模型 | | 3 | 81.17% | `iqk_mul_mat` | libggml.so | User | `*_4d` 的實作主體，負責具體的 dot product 計算 | | 4 | 51.82% | `mul_mat_iq2bn_q8_K64<1>` | libggml.so | User | 8-bit block quantized 矩陣乘法的核心 kernel，Q8\_K 變體 | | 5 | 27.41% | `mul_mat_qY_K_q8_K_T<DequantizerQ6K>` | libggml.so | User | 用於另一種量化格式 Q6\_K，含反量化邏輯 | | 6 | 7.28% | `entry_SYSCALL_64_after_hwframe` | kernel.kallsyms | Kernel | syscall 進入點 | | 7 | 7.28% | `do_syscall_64` | kernel.kallsyms | Kernel | syscall 分派邏輯 | | 8 | 6.73% | `x64_sys_futex` | kernel.kallsyms | Kernel | futex 鎖操作：多執行緒等待 | | 9 | 5.88% | `do_futex` | kernel.kallsyms | Kernel | futex 的核心處理 | | 10 | 5.77% | `libgomp.so+0x...` | libgomp.so | User | OpenMP 執行緒管理 | | 11 | 5.16% | `ggml_graph_compute_thread.constprop.*` | libggml.so | User | 透過 thread pool 執行多個 graph node | | 12 | 4.91% | `futex_wait` | kernel.kallsyms | Kernel | futex wait queue 等待點 | | 13 | 4.72% | `__futex_wait` | kernel.kallsyms | Kernel | futex 的等待實作 | | 14 | 3.27% | `futex_wait_queue` | kernel.kallsyms | Kernel | 加入等待佇列 | | 15 | 2.93% | `schedule` | kernel.kallsyms | Kernel | Linux context switch 核心排程 | | 16 | 2.79% | `__schedule` | kernel.kallsyms | Kernel | 更細部的排程邏輯 | | 17 | 2.08% | `__memset_avx2_unaligned_erms` | libc.so.6 | User | 記憶體清空，可能來自 malloc/new | | 18 | 1.93% | `_start` / `main` / `__libc_start_main` | llama-cli / libc.so | User | 程式初始化流程 | | 19 | 1.88% | `llama_init_from_gpt_params()` | llama-cli | User | llama 模型初始化（加載 tokenizer、ggml graph 等） | | 20 | 1.73% | `mul_mat_iq2bn_q8_K64<5>` | libggml.so | User | 與 <1> 類似，但執行參數不同（thread id 不同或 workload 分段） | ``` # 1. 清光舊 build & cache rm -rf build/ CMakeCache.txt CMakeFiles/ # 2. 產生新 build，所有 OpenMP 相關選項關閉 cmake -B build \ -DCMAKE_BUILD_TYPE=Release \ -DLLAMA_OPENMP=OFF \ -DGGML_OPENMP=OFF \ -DCMAKE_DISABLE_FIND_PACKAGE_OpenMP=TRUE \ -DCMAKE_C_FLAGS_RELEASE="-O3 -fno-openmp -DGGML_NO_OPENMP" \ -DCMAKE_CXX_FLAGS_RELEASE="-O3 -fno-openmp -DGGML_NO_OPENMP" \ . # 3. 編譯 cmake --build build -j$(nproc) # 4. 驗收 (必須都「沒有輸出」) ldd build/bin/llama-cli | grep gomp grep -R "fopenmp" build ``` ![image](https://hackmd.io/_uploads/r1GfllvQll.png) | 百分比 (Children) | 函式 | 說明 | | ------------------- | ---------------------------------------------------- | -------------------------------------- | | **54.9 %** | `ggml_compute_forward_mul_mat` → `iqk_mul_mat_*` | 量化矩陣乘法核心，依然是絕對 bottleneck | | **34.6 %** | `ggml_barrier` | ggml thread-pool 的 spin barrier（忙迴圈同步） | | **30.6 % / 23.1 %** | `mul_mat_iq2bn_q8_K64<1>` / `mul_mat_qY_K_q8_K_T<…>` | 主要的 int8 block-quant matmul kernels | | 其餘 (< 2 %) | page-fault / memset / vocab load | 初始化或記憶體 housekeeping | ``` # 先清乾淨舊 build 與 cache rm -rf build/ CMakeCache.txt CMakeFiles/ # 重新產生 build system cmake -B build -G Ninja \ -DCMAKE_BUILD_TYPE=Release \ -DLLAMA_OPENMP=OFF \ -DGGML_OPENMP=OFF \ -DCMAKE_DISABLE_FIND_PACKAGE_OpenMP=TRUE \ -DCMAKE_C_FLAGS_RELEASE="-O3 -fno-openmp -DGGML_NO_OPENMP" \ -DCMAKE_CXX_FLAGS_RELEASE="-O3 -fno-openmp -DGGML_NO_OPENMP" \ -DGGML_MAX_THREADS=1 \ -DLLAMA_MAX_THREADS=1 \ -DLLAMA_THREADS=OFF \ -DGGML_USE_THREADS=OFF \ . # 編譯 cmake --build build -j$(nproc) ``` `cmake --build build -j$(nproc)` ![image](https://hackmd.io/_uploads/SyvEolvQxx.png) `sudo perf stat -e cycles,instructions,minor-faults,major-faults,dTLB-loads,dTLB-load-misses,iTLB-loads,iTLB-load-misses,page-faults,cache-references,cache-misses ./build/bin/llama-cli -m ggml-model-i2_s_bn.gguf --prompt "Once upon a time" -n 32 --temp 0 -t 1` ``` Performance counter stats for './build/bin/llama-cli -m ggml-model-i2_s_bn.gguf --prompt Once upon a time -n 32 --temp 0 -t 1': 1,628,069,275 cpu_atom/cycles/ (0.27%) 11,298,828,953 cpu_core/cycles/ (99.73%) 1,654,102,016 cpu_atom/instructions/ # 1.02 insn per cycle (0.27%) 24,495,335,248 cpu_core/instructions/ # 2.17 insn per cycle (99.73%) 107,439 minor-faults 0 major-faults 418,521,717 cpu_atom/dTLB-loads/ (0.27%) 5,758,680,746 cpu_core/dTLB-loads/ (99.73%) 629,274 cpu_atom/dTLB-load-misses/ # 0.15% of all dTLB cache accesses (0.27%) 1,295,693 cpu_core/dTLB-load-misses/ # 0.02% of all dTLB cache accesses (99.73%) <not supported> cpu_atom/iTLB-loads/ <not supported> cpu_core/iTLB-loads/ 1,189,318 cpu_atom/iTLB-load-misses/ (0.27%) 128,400 cpu_core/iTLB-load-misses/ (99.73%) 107,439 page-faults 10,632,085 cpu_atom/cache-references/ (0.27%) 467,954,250 cpu_core/cache-references/ (99.73%) 3,727,551 cpu_atom/cache-misses/ # 35.06% of all cache refs (0.27%) 393,355,294 cpu_core/cache-misses/ # 84.06% of all cache refs (99.73%) 2.146829471 seconds time elapsed 2.006486000 seconds user 0.138964000 seconds sys ``` ` sudo perf stat -e cycles,instructions,minor-faults,major-faults,dTLB-loads,dTLB-load-misses,iTLB-loads,iTLB-load-misses,page-faults,cache-references,cache-misses ./build/bin/llama-cli -m ggml-model-i2_s_bn.gguf --prompt "Once upon a time" -n 32 --temp 0 -t 8` ``` Performance counter stats for './build/bin/llama-cli -m ggml-model-i2_s_bn.gguf --prompt Once upon a time -n 32 --temp 0 -t 8': 11,527,089,160 cpu_atom/cycles/ (58.84%) 29,775,399,018 cpu_core/cycles/ (82.35%) 9,641,798,064 cpu_atom/instructions/ # 0.84 insn per cycle (58.84%) 23,503,884,833 cpu_core/instructions/ # 0.79 insn per cycle (82.35%) 107,767 minor-faults 0 major-faults 3,807,722,819 cpu_atom/dTLB-loads/ (58.84%) 5,694,071,994 cpu_core/dTLB-loads/ (82.35%) 385,176 cpu_atom/dTLB-load-misses/ # 0.01% of all dTLB cache accesses (58.84%) 4,006,803 cpu_core/dTLB-load-misses/ # 0.07% of all dTLB cache accesses (82.35%) <not supported> cpu_atom/iTLB-loads/ <not supported> cpu_core/iTLB-loads/ 380,657 cpu_atom/iTLB-load-misses/ (58.84%) 193,443 cpu_core/iTLB-load-misses/ (82.35%) 107,767 page-faults 218,262,470 cpu_atom/cache-references/ (58.84%) 463,026,457 cpu_core/cache-references/ (82.35%) 182,524,587 cpu_atom/cache-misses/ # 83.63% of all cache refs (58.84%) 393,850,846 cpu_core/cache-misses/ # 85.06% of all cache refs (82.35%) 1.162837679 seconds time elapsed 6.079556000 seconds user 0.144892000 seconds sys ``` - sudo perf sched latency -i sched.data --sort avg | less -S ![image](https://hackmd.io/_uploads/H1r4A-w7lx.png) ## scheduler - 用 kernel shark 分析其排程行為 sudo apt install kernelshark