Profiling CPU scheduler with eBPF

實驗環境

硬體資訊：

hosted by QEMU with KVM enabled.
Architecture:            x86_64             
  CPU op-mode(s):        32-bit, 64-bit     
  Address sizes:         40 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  8
  On-line CPU(s) list:   0-7                 
Vendor ID:               AuthenticAMD
  Model name:            AMD Ryzen 7 3700X 8-Core Processor
    CPU family:          23
    Model:               113
    Thread(s) per core:  1
    Core(s) per socket:  1
    Socket(s):           8
    Stepping:            0

sched BPF hook 位置

runqueue profiling

load balance profiling

測試程式

使用 BPF ringbuf 紀錄 hook 觸發事件。

目前僅使用 stress 製造工作負載。MMTests 或許更適合？

userspace 使用 BPF ringbuf 提供的 API，ring_buffer__poll，對指定 ringbuf 使用 epoll 等待事件發生。(或者，也可以使用 ring_buffer__consume 進行 busy-looping 處理事件。目前 epoll 可以承受事件發生頻率，因此尚未使用。)

以下箱形圖，runqueue profiling 以 ~65539 筆量測數據繪製，load balance profiling 則以 ~11000 筆數據繪製。

runqueue 的部份，程式開始後 buffer 馬上被填滿。

load balance 的部份則要花半小時左右(無負載狀態下)。

task 於 runqueue 中的等待時間

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

CFS 做 load balance 花費時間

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

BPF 程式的執行開銷

2020 年 eBPF summit 裡頭一場以 How and When You Should Measure CPU Overhead of eBPF Programs 為題，介紹對 eBPF program 做 profiling 的幾種方式。

`kernel.bpf_stats_enabled`

於 v5.1 引入，透過 procfs 切換開啟/關閉，或 sysctl：

$ sudo sysctl -w kernel.bpf_stats_enabled=1

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

(取自上述介紹簡報)

透過 sched_clock() 提取時間戳，開銷為 ~20ns。

一旦開啟，機器上所有 BPF program 都會被量測。

最快速檢視量測結果方式為，透過 bpftool:

$ sudo bpftool prog show

獲得所有於機器中的 eBPF program 的資訊：

323: sched  name lb_start  tag f589160b6c071b80  gpl run_time_ns 60820676 run_cnt 13791
        loaded_at 2021-11-16T18:06:13+0800  uid 0
        xlated 128B  jited 74B  memlock 4096B  map_ids 165
        btf_id 428
        pids time_in_lb(965265)
324: sched  name lb_end  tag 53367b54a588dca6  gpl run_time_ns 4902948 run_cnt 13791
        loaded_at 2021-11-16T18:06:13+0800  uid 0
        xlated 152B  jited 82B  memlock 4096B  map_ids 165
        btf_id 428
        pids time_in_lb(965265)
306: sched  name enqueue  tag 0826b63ede8eaacc  gpl run_time_ns 7237294 run_cnt 87362
        loaded_at 2021-11-16T18:01:58+0800  uid 0
        xlated 184B  jited 152B  memlock 4096B  map_ids 159
        btf_id 407
        pids time_in_rq(961648)
307: sched  name dequeue  tag 49878c108acd3ce6  gpl run_time_ns 101317190 run_cnt 170419
        loaded_at 2021-11-16T18:01:58+0800  uid 0
        xlated 208B  jited 160B  memlock 4096B  map_ids 159
        btf_id 407
        pids time_in_rq(961648)

相關統計包含：總執行時間、總執行次數。

換算後，我們實驗時用的 BPF program 的平均執行時間個別為： (以工作負載 stress -c 500 作為量測背景)

task enqueue: 82.842586ns
task picked-up: 594.518158ns
load balance start: 4410.17156ns
load balance end: 355.517946ns

`BPF_ENABLE_STATS` (LKML 對應 patch)

於 v5.8 引入，以 fd 操作為主。

用於解決上述 kernel.bpf_stats_enabled 可能同時被多個工具開啟/關閉，進而潛在某些工具取得不正確的量測資訊的問題。

bpftool prog profile

於 v5.7 引入，為 bpftool 的功能之一。

可以做更細微的統計，包含：

cycles
instructions
l1d_loads
llc_misses
itlb_misses
dtlb_misses

使用方式：

bpftool prog profile PROG [duration DURATION] METRICs

將上述 load balance start 這個花費時間最高的 BPF program 用此方式量測，發現雖然其程式碼與 load balance end 幾乎一致，但實際執行的平均指令數量卻有兩倍左右的差異(636.431608 vs. 1241.71394, per run)，CPU cycle 平均數量差異則是接近四倍(1927.1585 vs 7458.94961, per run)

lb_start 以及 lb_end 經 BPF JIT 後的 native assembly

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

顯然，兩者幾乎產生一樣的組合語言。

目前推測 BPF ringbuf 的 API，bpf_ringbuf_reserve (用於取得 buffer)，所使用的 spinlock 有潛在的 contention (之前説錯了，實做中，只有處理 NMI 時才會以 try_lock 方式獲取 lock)，才導致指令、cycle 數量有如此的數量落差。

解決開銷如此之大的方案：

改用 PER_CPU_ARRAY BPF map 以降低潛在的 contention

以 Perf 觀察量測 CFS load balance 的 eBPF program

使用以下命令進行持續 10 分鐘的觀察： (以工作負載 stress -c 500 作為量測背景)

sudo perf record -e 'cycles:k' -a sleep 600

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

$↑$ 左側為沒有載入 BPF 程式的情境; 右側是載入後。

由此觀察得知，載入量測 load balance 的 BPF program 後，在同樣環境下，load balance 觸發的次數差距不小，因此本 BPF program 已經改變了 kernel 行為，需要修正。

以 PERCPU_ARRAY 改寫量測 load balance 的 BPF program

以 stress -c 500 作為量測背景，搭配上述工具量測改寫後的 BPF program 執行成本：

360: sched  name lb_start  tag c6a2f69becd9774c  gpl run_time_ns 8689011 run_cnt 26470
        loaded_at 2021-11-28T21:00:49+0800  uid 0
        xlated 192B  jited 120B  memlock 4096B  map_ids 170,169
        btf_id 491
        pids time_in_lb(247712)
361: sched  name lb_end  tag 88c55c7d6fa31085  gpl run_time_ns 2079679 run_cnt 26470
        loaded_at 2021-11-28T21:00:49+0800  uid 0
        xlated 216B  jited 133B  memlock 4096B  map_ids 170,169
        btf_id 491
        pids time_in_lb(247712)

換算後，平均時間降至：

load balance start: 4410.17156 ns
$\to$ 328.258821 ns
load balance end: 355.517946 ns
$\to$ 78.5673971 ns

以 bpftool prog profile 量測 dtlb_misses:

load balance start:
```
7322 run_cnt
49388 dtlb_misses
```
平均每次執行會發生 6.74 次 dTLB miss。(miss 成本？做 dummy BPF prog 或實際去 dTLB miss handler 量測)
load balance end:
```
9382 run_cnt
19388 dtlb_misses
```
平均每次執行會發生 2.06 次 dTLB miss。

以 perf 觀察機器行為是否仍受到 BPF 程式影響：(同樣以 stress -c 500 作為背景)

$ sudo perf record -e 'cycles:k' -ono_bpf_running -F5000 --strict-freq -a sleep 600

$↑$ 左：無 BPF 程式執行; 右：bpf 程式正在執行)

load balance 次數差了 24 次，差距還是不小..

以修改後的 BPF 程式量測 load balance 開銷：(量測時間為兩分鐘，使用 stress 製造不同工作負載)

load balance 開銷相較上面的量測降低了一半左右，可見 BPF 帶來的額外執行成本對 load balance 的行為有影響。(load_balance() 內部有大量 goto，在必要時會重複執行特定程式碼)

idle 時 load balance 會 sleep?

放大至 yrange=50000:

同樣量測時間，idle 時 load balance 觸發次數較多
task 數量多時，load balance 開銷浮動較大

放大至 yrange=50000:

以自定義 BPF map 改寫後的 BPF program

由於呼叫 BPF helper bpf_map_lookup_elem 的成本不低 (可高達 ~100 ns)，因此使用 atomic 操作搭配自定義 BPF map 改寫。

以下為目前使用的 map：








struct percpu_event {
    long time_ns[LB_EVENT_SZ];
    unsigned int idx;
} __cacheline_aligned;

struct {
    struct percpu_event cpu_ev[NPROC];
} map = {};

BPF 程式碼：













SEC("sched/cfs_trigger_load_balance_start")
int BPF_PROG(lb_start)
{
    __u32 cpu = bpf_get_smp_processor_id();
    __u32 idx;

    idx = __sync_fetch_and_add(&map.cpu_ev[cpu & CPU_MASK].idx, 1);
    if (idx >= LB_EVENT_SZ)
        return 0;
    __sync_fetch_and_add(&map.cpu_ev[cpu & CPU_MASK].time_ns[idx], bpf_ktime_get_ns());

    return 0;
}

以下列出各個 BPF helper 以及相關操作個別平均執行成本： (時間、dTLB miss 欄位中，左為虛擬機，右為 bare-metal)

helper, 操作 \ 成本	時間 (ns)	dTLB miss 次數
ktime_get_ns	167.62, 175	2.51, 0.56
bpf_map_lookup_elem	130, 127	2.88, 1.38
bpf_get_smp_processor_id	86, 32	1.77, 1.24
對 map 寫入	56, 87	1.82, 1.38

(為避免 false-sharing, 對 map 寫入 的數據為綜合 bpf_get_smp_processor_id 操作後再將後者成本扣除後得知)

改寫後的 BPF 只使用兩個 helper (bpf_get_smp_processor_id, ktime_get_ns)，對應執行成本相較 PERCPU_ARRAY BPF map 為：

執行成本： 328.25 ns
$\to$ 230.94 ns
dTLB miss 次數： 6.74 次
$\to$ 4.03 次

原先考量是否因為配給虛擬機器的處理器核沒有使用到 HT (Hyper Threading) 對應的真正處理器核，所以才導致 dTLB 次數不低，但實際使用 taskset 將處理器核綁定 (例如： taskset -c 0,8,1,9,2,10,3,11) 後，反而對執行時間造成反效果 (執行時間及 dTLB miss 次數分別為 293.42 ns, 3.90 次)。

ktime_get_ns 中使用到 seqlock，所以執行成本高還算合理，但 bpf_get_smp_processor_id 也可以高達 86 ns 就想不到原因了 (如果是在 bare-metal 機器上，執行成本則可低至 32 ns)。

目前 CFS 的 load_balance() 發生次數還是受到影響：

$↑$ 左： BPF 程式正在執行; 右：無 BPF 程式正在執行)

sched BPF call trace

找出例如 load_balance() 是從哪邊執行過來的。

BPF helper - bpf_get_stack

the helper needs ctx, which is a pointer to the context on which the tracing program is executed.

會需要對應 context 的指標。(TODO: trace kernel 內相關手法)

問題

LB_START 的執行次數有時候會比 LB_END 多一次 (甚至兩次)。為什麼不是只有後者較多的情況？ BPF 程式載入時，load balance 可能做到一半，所以 LB_END 有機會有多一次的執行次數。但 LB_START 發生次數較多的原因，沒頭緒..
- static key 對 kernel 做 live patching 時沒有一次做完？

off-topic

sched_schedstats 與 sched BPF hook 一樣，均使用 static key 機制控制，前者還可透過 /proc/sys/kernel/sched_schedstats 將相關 code path 開啟/關閉。老師提到過的將 EAS 相關程式碼移除的部份，也有對應的開關：/proc/sys/kernel/sched_energy_aware，相關文件。
有些 BPF map (目前已知： PERCPU_ARRAY) 在 kernel 內配置記憶體時，會以 8 bytes 做對齊。因此如果 map 小於 8 bytes 的話，在定義資料結構時，記得使用相關對齊 extension 標注。
注意 BPF map 宣告，誤用相關巨集可能導致對應資料結構大小不如預期 (例如：總是只有 8-byte)，相關修正。
即便在 Clang 12 也有 BPF XADD inst 無法使用問題，仍需 mcpu=v3 flag 才可編譯。(相關討論)。

References

BPF ring buffer (a successor to BPF perf buffer, since kernel 5.8)
- official doc
- Andrii's intro
off-topic
- BPF CO-RE guide
- Introduction of group scheduling

Profiling CPU scheduler with eBPF

實驗環境

sched BPF hook 位置

runqueue profiling

load balance profiling

測試程式

task 於 runqueue 中的等待時間

CFS 做 load balance 花費時間

BPF 程式的執行開銷

kernel.bpf_stats_enabled

BPF_ENABLE_STATS (LKML 對應 patch)

bpftool prog profile

lb_start 以及 lb_end 經 BPF JIT 後的 native assembly

以 Perf 觀察量測 CFS load balance 的 eBPF program

以 PERCPU_ARRAY 改寫量測 load balance 的 BPF program

以自定義 BPF map 改寫後的 BPF program

sched BPF call trace

BPF helper - bpf_get_stack

問題

off-topic

References

`kernel.bpf_stats_enabled`

`BPF_ENABLE_STATS` (LKML 對應 patch)