ik_llama 筆記
===
- `sudo uftrace record -F 'mul_mat*'./build/bin/llama-cli -m ggml-model-i2_s_bn.gguf --prompt "Once upon a time" -n 32 --temp 0`
- https://deepwiki.com/ikawrakow/ik_llama.cpp/
- https://chatgpt.com/c/683c74f9-8b0c-8002-b8fc-ec4dd55f2adc
- quantize: `./build/bin/llama-quantize --allow-requantize ~/model/ggml-model-i2_s.gguf ggml-model-i2_s_bn.gguf iq2_bn`
- 執行: `./build/bin/llama-cli -m ggml-model-i2_s_bn.gguf --prompt "Once upon a time" -n 32 --temp 0`
- `sudo perf stat -e cycles,instructions,minor-faults,major-faults,dTLB-loads,dTLB-load-misses,iTLB-loads,iTLB-load-misses,page-faults,cache-references,cache-misses ./build/bin/llama-cli -m ggml-model-i2_s_bn.gguf --prompt "Once upon a time" -n 32 --temp 0`

https://gist.github.com/alanhc/c0a246fc1660ff29b3ed31fc67672dda
{%gist c0a246fc1660ff29b3ed31fc67672dda %}
- 主要負載集中在 P-core 上,且 IPC > 1,效能良好。
| 指標 | 說明 |
| ------------------------ | -------------------------------------- |
| `minor-faults` = 107,553 | 輕微分頁錯誤(例如 page 尚未映射但不需讀磁碟)。 |
| `major-faults` = 0 | 沒有 page fault 導致 disk I/O,代表記憶體使用狀況良好。 |
🟩 TLB 表現良好,未顯著拖慢速度。
| 指標 | 說明 |
| ---------------------------------------------------- | --------------------------------------------------- |
| `cache-references`:Atom 373M, Core 532M | Cache 使用頻繁。 |
| `cache-misses`:Atom 283M → 75.85%;Core 435M → 81.84% | ❗ 高 cache miss rate,代表 cache locality 不佳,常回主記憶體取資料。 |
⚠️ 這可能是 performance 瓶頸之一。
✅ 重點結論
效能集中在 P-core(90% cycle & instruction 都在大核上),且 IPC 表現良好。
Cache miss 比例偏高(超過 75%),可能導致延遲與 DRAM 頻繁存取。
TLB miss 極低,表示記憶體分頁與虛擬記憶體管理良好。
幾乎沒有 page fault 與 major fault,表示模型資料與程式碼已完全載入於 RAM。
CPU time 高於 elapsed time,代表系統有利用多核資源平行推論。
- `sudo perf record -g ./build/bin/llama-cli -m ggml-model-i2_s_bn.gguf --prompt "Once upon a time" -n 32 --temp 0`


`sudo perf record -F 999 -g ./build/bin/llama-cli -m ggml-model-i2_s_bn.gguf --prompt "Once upon a time" -n 32 --temp 0`
- Flamegraph

- Call Graph 樹狀結構
```
sudo apt install graphviz
sudo perf record -g -- ./llama-cli ...
sudo perf report --call-graph graph
sudo perf script | stackcollapse-perf.pl | flamegraph.pl --hash > out.svg
```

| 排名 | 百分比 | 函式名稱 | 來源模組 | 層級 | 說明 |
| -- | ------ | --------------------------------------- | ------------------- | ------ | ------------------------------------------------------ |
| 1 | 84.09% | `ggml_compute_forward_mul_mat` | libggml.so | User | LLM 中所有矩陣運算的核心,負責 transformer 的 linear/attention/MLP 等 |
| 2 | 81.24% | `iqk_mul_mat_4d` | libggml.so | User | 將 4 維矩陣乘法拆解為 SIMD friendly 的區塊,適用量化模型 |
| 3 | 81.17% | `iqk_mul_mat` | libggml.so | User | `*_4d` 的實作主體,負責具體的 dot product 計算 |
| 4 | 51.82% | `mul_mat_iq2bn_q8_K64<1>` | libggml.so | User | 8-bit block quantized 矩陣乘法的核心 kernel,Q8\_K 變體 |
| 5 | 27.41% | `mul_mat_qY_K_q8_K_T<DequantizerQ6K>` | libggml.so | User | 用於另一種量化格式 Q6\_K,含反量化邏輯 |
| 6 | 7.28% | `entry_SYSCALL_64_after_hwframe` | kernel.kallsyms | Kernel | syscall 進入點 |
| 7 | 7.28% | `do_syscall_64` | kernel.kallsyms | Kernel | syscall 分派邏輯 |
| 8 | 6.73% | `x64_sys_futex` | kernel.kallsyms | Kernel | futex 鎖操作:多執行緒等待 |
| 9 | 5.88% | `do_futex` | kernel.kallsyms | Kernel | futex 的核心處理 |
| 10 | 5.77% | `libgomp.so+0x...` | libgomp.so | User | OpenMP 執行緒管理 |
| 11 | 5.16% | `ggml_graph_compute_thread.constprop.*` | libggml.so | User | 透過 thread pool 執行多個 graph node |
| 12 | 4.91% | `futex_wait` | kernel.kallsyms | Kernel | futex wait queue 等待點 |
| 13 | 4.72% | `__futex_wait` | kernel.kallsyms | Kernel | futex 的等待實作 |
| 14 | 3.27% | `futex_wait_queue` | kernel.kallsyms | Kernel | 加入等待佇列 |
| 15 | 2.93% | `schedule` | kernel.kallsyms | Kernel | Linux context switch 核心排程 |
| 16 | 2.79% | `__schedule` | kernel.kallsyms | Kernel | 更細部的排程邏輯 |
| 17 | 2.08% | `__memset_avx2_unaligned_erms` | libc.so.6 | User | 記憶體清空,可能來自 malloc/new |
| 18 | 1.93% | `_start` / `main` / `__libc_start_main` | llama-cli / libc.so | User | 程式初始化流程 |
| 19 | 1.88% | `llama_init_from_gpt_params()` | llama-cli | User | llama 模型初始化(加載 tokenizer、ggml graph 等) |
| 20 | 1.73% | `mul_mat_iq2bn_q8_K64<5>` | libggml.so | User | 與 <1> 類似,但執行參數不同(thread id 不同或 workload 分段) |
```
# 1. 清光舊 build & cache
rm -rf build/ CMakeCache.txt CMakeFiles/
# 2. 產生新 build,所有 OpenMP 相關選項關閉
cmake -B build \
-DCMAKE_BUILD_TYPE=Release \
-DLLAMA_OPENMP=OFF \
-DGGML_OPENMP=OFF \
-DCMAKE_DISABLE_FIND_PACKAGE_OpenMP=TRUE \
-DCMAKE_C_FLAGS_RELEASE="-O3 -fno-openmp -DGGML_NO_OPENMP" \
-DCMAKE_CXX_FLAGS_RELEASE="-O3 -fno-openmp -DGGML_NO_OPENMP" \
.
# 3. 編譯
cmake --build build -j$(nproc)
# 4. 驗收 (必須都「沒有輸出」)
ldd build/bin/llama-cli | grep gomp
grep -R "fopenmp" build
```

| 百分比 (Children) | 函式 | 說明 |
| ------------------- | ---------------------------------------------------- | -------------------------------------- |
| **54.9 %** | `ggml_compute_forward_mul_mat` → `iqk_mul_mat_*` | 量化矩陣乘法核心,依然是絕對 bottleneck |
| **34.6 %** | `ggml_barrier` | ggml thread-pool 的 spin barrier(忙迴圈同步) |
| **30.6 % / 23.1 %** | `mul_mat_iq2bn_q8_K64<1>` / `mul_mat_qY_K_q8_K_T<…>` | 主要的 int8 block-quant matmul kernels |
| 其餘 (< 2 %) | page-fault / memset / vocab load | 初始化或記憶體 housekeeping |
```
# 先清乾淨舊 build 與 cache
rm -rf build/ CMakeCache.txt CMakeFiles/
# 重新產生 build system
cmake -B build -G Ninja \
-DCMAKE_BUILD_TYPE=Release \
-DLLAMA_OPENMP=OFF \
-DGGML_OPENMP=OFF \
-DCMAKE_DISABLE_FIND_PACKAGE_OpenMP=TRUE \
-DCMAKE_C_FLAGS_RELEASE="-O3 -fno-openmp -DGGML_NO_OPENMP" \
-DCMAKE_CXX_FLAGS_RELEASE="-O3 -fno-openmp -DGGML_NO_OPENMP" \
-DGGML_MAX_THREADS=1 \
-DLLAMA_MAX_THREADS=1 \
-DLLAMA_THREADS=OFF \
-DGGML_USE_THREADS=OFF \
.
# 編譯
cmake --build build -j$(nproc)
```
`cmake --build build -j$(nproc)`

`sudo perf stat -e cycles,instructions,minor-faults,major-faults,dTLB-loads,dTLB-load-misses,iTLB-loads,iTLB-load-misses,page-faults,cache-references,cache-misses ./build/bin/llama-cli -m ggml-model-i2_s_bn.gguf --prompt "Once upon a time" -n 32 --temp 0 -t 1`
```
Performance counter stats for './build/bin/llama-cli -m ggml-model-i2_s_bn.gguf --prompt Once upon a time -n 32 --temp 0 -t 1':
1,628,069,275 cpu_atom/cycles/ (0.27%)
11,298,828,953 cpu_core/cycles/ (99.73%)
1,654,102,016 cpu_atom/instructions/ # 1.02 insn per cycle (0.27%)
24,495,335,248 cpu_core/instructions/ # 2.17 insn per cycle (99.73%)
107,439 minor-faults
0 major-faults
418,521,717 cpu_atom/dTLB-loads/ (0.27%)
5,758,680,746 cpu_core/dTLB-loads/ (99.73%)
629,274 cpu_atom/dTLB-load-misses/ # 0.15% of all dTLB cache accesses (0.27%)
1,295,693 cpu_core/dTLB-load-misses/ # 0.02% of all dTLB cache accesses (99.73%)
<not supported> cpu_atom/iTLB-loads/
<not supported> cpu_core/iTLB-loads/
1,189,318 cpu_atom/iTLB-load-misses/ (0.27%)
128,400 cpu_core/iTLB-load-misses/ (99.73%)
107,439 page-faults
10,632,085 cpu_atom/cache-references/ (0.27%)
467,954,250 cpu_core/cache-references/ (99.73%)
3,727,551 cpu_atom/cache-misses/ # 35.06% of all cache refs (0.27%)
393,355,294 cpu_core/cache-misses/ # 84.06% of all cache refs (99.73%)
2.146829471 seconds time elapsed
2.006486000 seconds user
0.138964000 seconds sys
```
` sudo perf stat -e cycles,instructions,minor-faults,major-faults,dTLB-loads,dTLB-load-misses,iTLB-loads,iTLB-load-misses,page-faults,cache-references,cache-misses ./build/bin/llama-cli -m ggml-model-i2_s_bn.gguf --prompt "Once upon a time" -n 32 --temp 0 -t 8`
```
Performance counter stats for './build/bin/llama-cli -m ggml-model-i2_s_bn.gguf --prompt Once upon a time -n 32 --temp 0 -t 8':
11,527,089,160 cpu_atom/cycles/ (58.84%)
29,775,399,018 cpu_core/cycles/ (82.35%)
9,641,798,064 cpu_atom/instructions/ # 0.84 insn per cycle (58.84%)
23,503,884,833 cpu_core/instructions/ # 0.79 insn per cycle (82.35%)
107,767 minor-faults
0 major-faults
3,807,722,819 cpu_atom/dTLB-loads/ (58.84%)
5,694,071,994 cpu_core/dTLB-loads/ (82.35%)
385,176 cpu_atom/dTLB-load-misses/ # 0.01% of all dTLB cache accesses (58.84%)
4,006,803 cpu_core/dTLB-load-misses/ # 0.07% of all dTLB cache accesses (82.35%)
<not supported> cpu_atom/iTLB-loads/
<not supported> cpu_core/iTLB-loads/
380,657 cpu_atom/iTLB-load-misses/ (58.84%)
193,443 cpu_core/iTLB-load-misses/ (82.35%)
107,767 page-faults
218,262,470 cpu_atom/cache-references/ (58.84%)
463,026,457 cpu_core/cache-references/ (82.35%)
182,524,587 cpu_atom/cache-misses/ # 83.63% of all cache refs (58.84%)
393,850,846 cpu_core/cache-misses/ # 85.06% of all cache refs (82.35%)
1.162837679 seconds time elapsed
6.079556000 seconds user
0.144892000 seconds sys
```
- sudo perf sched latency -i sched.data --sort avg | less -S

## scheduler
- 用 kernel shark 分析其排程行為
sudo apt install kernelshark