開發紀錄 (2): FCFS/RR scheduler

# 開發紀錄 (2): FCFS/RR scheduler > [Repo](https://github.com/cce-underdogs/scx/tree/otteryc_exp) : 重現 FIFO/RR scheduler 實驗, with Rust :crab: contributed by < [`charliechiou`](https://github.com/charliechiou) > < [`EricccTaiwan`](https://github.com/EricccTaiwan) >  ## 產生 trace 檔案 **Terminal 1** ```shell $ # scxtop $ sudo LC_ALL=C scxtop $ # 按下 a 錄製 perfetto 檔案 ``` **Terminal 2** ```shell $ # stress-ng $ stress-ng --cpu 3 --verbose ``` **Terminal 3** ```shell $ # scx $ sudo ./bin/scx_rlfifo ``` 接著便可用 perfetto UI 開啟流程影片 : https://www.youtube.com/watch?v=tIEW7PNKnR4 ## 實測 scx_rlfifo 實際跑測試，用 [Perfetto](https://perfetto.dev/) 視覺化排程開啟 `scx_rlfifo` 前 ![image](https://hackmd.io/_uploads/Bkm_7xyxxx.png) 開啟 `scx_rlfifo` 後 ![image](https://hackmd.io/_uploads/rysAXxyxge.png) :::warning 問題 : 如果不能鎖定只使用 1 個 CPU，task 就可以做 migrate ，這樣如何確認其 fifo 的特性 ? => 要試著限制 CPU migration，stress-ng 可以透過 `--taskset CPU_NUM` 的方式來把任務綁定在 CPU_NUM 的 CPU 上。 ::: ## stress-ng 單看上述的實驗結果無法分辨有沒有真的成功，需要加入額外的 workload 做測試。使用 [stress-ng](https://github.com/ColinIanKing/stress-ng?tab=readme-ov-file) 來測試。 ```shell $ stress-ng --cpu 3 --verbose ``` 增加三個 cpu 的 workload ，並觀察輸出如下 - 未使用 scx_lavd 前，有三個 workload 產生分別為 [59042][59043][59044] 。 ![image](https://hackmd.io/_uploads/B1mCrZ1elg.png) 重點觀察 [59042] ![image](https://hackmd.io/_uploads/Hyj-L-yxee.png) - 使用 scx_lavd 後 ![image](https://hackmd.io/_uploads/SJAtBWJgxx.png) 同樣重點觀察 [59042] ![image](https://hackmd.io/_uploads/ByyjB-Jxel.png) 可以實際看見使用 scx_lavd 的排程結果，會設定 time slice ，時間到便將任務 sleep，因此和未載入前相比較為零碎。 ## 改寫 `scx_rlfifo` > Commit [`ac88851`](https://github.com/sched-ext/scx/commit/ac888518da46bc1e0eefd3a72bf3bd5bcd7ee2e8) **參照 [Linux 核心專題: sched_ext 研究 (by otteryc)](https://hackmd.io/@sysprog/H1u6D9LI0)** 因為要限制 CPU migration ，因此做了以下的更動 ```diff let cpu = self.bpf.select_cpu(task.pid, task.cpu, task.flags); - dispatched_task.cpu = if cpu >= 0 { cpu } else { task.cpu }; + dispatched_task.cpu = task.cpu; ``` ![image](https://hackmd.io/_uploads/B1LBNsyell.png) 以下重現 otteryc 期末專題的設計概念： * 當 $NICE \lt 0$ 時，執行 FIFO * 當 $NICE \geq 0$ 時，執行 Round Robin (Time Slice = 50 $\mu$s) --- ### exp 1 > Follow otteryc ```rust if nice < 0 { // Nice < 0 => Treat as FIFO self.served_fifo += 1; dispatched_task.slice_ns = u64::MAX; } else { // Nice >= 0 => Treat as RR self.served_rr += 1; // dispatched_task.slice_ns = SLICE_NS / (nr_waiting + 1); dispatched_task.slice_ns = SLICE_NS / 50_000; // 固定 rr 大小 } ``` ![image](https://hackmd.io/_uploads/r1d08ikxxx.png) ### exp 2 > 變體: rr 的 Time Slice 由須排程的 task 數量所決定 ```rust // Get the amount of tasks that are waiting to be scheduled. let nr_waiting = *self.bpf.nr_queued_mut(); ``` ```rust if nice < 0 { // Nice < 0 => Treat as FIFO self.served_fifo += 1; dispatched_task.slice_ns = u64::MAX; } else { // Nice >= 0 => Treat as RR self.served_rr += 1; dispatched_task.slice_ns = SLICE_NS / (nr_waiting + 1); } ``` ![image](https://hackmd.io/_uploads/HyhxvsJllx.png) 上述兩項的 Perfetto 輸出圖無法有效確認是否有按照我們所想的不同的 nice 值設定不同的排程方式，且可以發現程式碼無法正確讀取到 nice 值的設定。 --- ### 改善實驗和 jserv 老師討論後修正實驗設計，應該考量 scx 的時間分配是否正確運作，因此修改 stress-ng 所產生的 workflow 如下，透過限制 workflow 的時間為 1 sec 並使用 watch 不斷產生 task 。 ```shell $ sudo watch nice -n -5 stress-ng --cpu 1 --timeout 1s ``` 重複安排 1 sec 的任務並將 NICE 值設定為 -5，觀察以下結果並配合 `htop` 確認 NICE 數值有正確被設定。 ![image](https://hackmd.io/_uploads/rJLUI8vlel.png =70%x) ![image](https://hackmd.io/_uploads/HkSMLIDexl.png =70%x) 實驗發現，雖然在 `htop` 中有正確的設定 NICE 數值，但 scx_rlfifo 中無論將 NICE 值設定成 5 或是 -5 讀取到的值皆為 0 。 >In this case you can simply use task.weight which is determined as a function of task's nice (see https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/sched/core.c#n10148 for detils) >[name=Andrea Righi] 在向其他 scx_ext 的開發人員 ([`arighi`](https://github.com/arighi)) 詢問後得知，可以**使用 `task.weight` 做代替**，`task.weight` 為 NICE 值的映射。接著為了測試是否能夠有效的取得 weight 數值，將程式碼做以下改變，讀取 `task.weight` 數值並輸出 PID 及對應的 weight 值。 ```rust let nice = task.weight; println!("PID={} weight={}", task.pid, nice); ``` 分別測試以下兩種不同 NICE 值，並使用 `grep` 過濾出指定的 PID 方便觀察。 #### NICE = -5 ```rust $ sudo nice -n -5 stress-ng --cpu 1 stress-ng: info: [250733] defaulting to a 1 day run per stressor stress-ng: info: [250733] dispatching hogs: 1 cpu ``` 執行結果： ```rust $ sudo ./bin/scx_rlfifo | grep "250734" PID=250734 weight=305 PID=250734 weight=305 PID=250734 weight=305 PID=250734 weight=305 ... ``` #### NICE = 5 ```rust $ sudo nice -n 5 stress-ng --cpu 1 stress-ng: info: [250928] defaulting to a 1 day run per stressor stress-ng: info: [250928] dispatching hogs: 1 cpu ``` 執行結果： ```rust $ sudo ./bin/scx_rlfifo | grep "250929" PID=250929 weight=33 PID=250929 weight=33 PID=250929 weight=33 PID=250929 weight=33 ... ``` 透過上述實驗可以發現，當 NICE 值越小優先度越高，因此所設定的權重 `weight` 也越大，反之亦然。而對應關係是透過 cgroup weight，而非直接使用 prio_to_wight 。 > The weight used by sched_ext is the same as the "cgroup weight", after resolving the weight from prio_to_weight[] the value is normalized in the cgroup weight range [1 .. 10000], where 100 is the logarithmic center (so 1024 becomes 100). Search for [sched_weight_to_cgroup()](https://elixir.bootlin.com/linux/v6.14.5/source/kernel/sched/sched.h#L268), [CGROUP_WEIGHT_MIN](https://elixir.bootlin.com/linux/v6.14.5/source/include/linux/cgroup.h#L33) and [CGROUP_WEIGHT_MAX](https://elixir.bootlin.com/linux/v6.14.5/source/include/linux/cgroup.h#L33)  ## 問題 :::warning 如何實際用指標計算排程效率 ? Throughput ? Latency ? 或單純計算執行時間就好 ? => 目前先著重在實驗的重現，驗證程式碼確實是 rr 的電腦可能同時有很多其餘執行序在跑，如何確保每次實驗執行環境都相同？ => 只能透過 p-value 確保其他因素是不顯著的 --- scx_rlfifo 的啟動時機？ => context_switch 的時間會計算在前一個任務還是後一個任務？ => ::: ## After office-hours 測試 ### RR test $NICE = 5$ ```shell $ watch -n 5 ' nice -n 5 taskset -c 3 bash -c " stress-ng --cpu 1 --timeout 2s & stress-ng --cpu 1 --timeout 2s & stress-ng --cpu 1 --timeout 2s & wait "' ``` 因為 $NICE \ge 0$ ，因此預期為 RR 和 time slice 10 ms ![image](https://hackmd.io/_uploads/ry2hfZceeg.png) 發現 stress-ng 的最小單位確實是 10ms，但同時卻觀察到有 20ms, 30ms 的出現，這明顯不符合應該全部 time slice 皆是 10ms 的期待。 >根據 office-hour 詢問的結果， perfetto 會紀錄 task 的執行，因此同任務第一個 time slice 及第二個 time slice 中間並未有其他任務產生 (i.e., 未有 context switch) 則兩者會顯示在一起，出現形如 20 ms 或 30 ms 的結果。為了更進一步測試，我們增加 stress-ng 分派的任務，改為一次分派 6 個 task 。預期 Avg Wall duration(ms) 會因為更頻繁的交換任務而降低兩個 time slice 相連的現象。 ![image](https://hackmd.io/_uploads/rktS9hhlel.png) 可以發現確實每個任務相較於前次實驗的 16.9, 12.9, 14.1 更為接近理想的 10 ms 。 #### schedbrench 未使用 ```shell $ taskset -c 3 nice -n 5 ./schbench/schbench -t 6 -r 2 Wakeup Latencies percentiles (usec) runtime 2 (s) (1145 total samples) 50.0th: 5160 (524 samples) 90.0th: 6872 (234 samples) * 99.0th: 7048 (63 samples) 99.9th: 8880 (10 samples) min=1, max=9008 Request Latencies percentiles (usec) runtime 2 (s) (1145 total samples) 50.0th: 5160 (466 samples) 90.0th: 7016 (336 samples) * 99.0th: 27744 (103 samples) 99.9th: 32960 (10 samples) min=1868, max=40500 RPS percentiles (requests) runtime 2 (s) (3 total samples) 20.0th: 0 (1 samples) * 50.0th: 563 (1 samples) 90.0th: 573 (1 samples) min=562, max=572 average rps: 572.50 ``` 使用 rr ```shell $ taskset -c 3 nice -n 5 ./schbench/schbench -t 6 -r 2 Wakeup Latencies percentiles (usec) runtime 2 (s) (207 total samples) 50.0th: 2002 (62 samples) 90.0th: 7064 (79 samples) * 99.0th: 29024 (18 samples) 99.9th: 30624 (2 samples) min=3, max=30616 Request Latencies percentiles (usec) runtime 2 (s) (207 total samples) 50.0th: 30048 (62 samples) 90.0th: 132352 (83 samples) * 99.0th: 204032 (18 samples) 99.9th: 208640 (2 samples) min=1883, max=208466 RPS percentiles (requests) runtime 2 (s) (3 total samples) 20.0th: 0 (1 samples) * 50.0th: 91 (1 samples) 90.0th: 104 (1 samples) min=91, max=104 average rps: 103.50 ``` - Wakeup Latencies percentiles (usec) | | RR | Rustland| Default | | -------- | ---------- | ------- | ----------- | | 50.0th | 2002 µs | 21 µs | 5160 µs | | 90.0th | 7064 µs | 1734 µs | 6872 µs | | 99.0th | 29024 µs | 8496 µs |7048 µs | | Max | 30616 µs | 32881 µs|9008 µs | | Total samples | 207 | 934 | 1145 | - Request Latencies percentiles (usec) | | RR | Rustland | Default | | -------- | --------- | ---------- |--------- | | 50.0th | 30048 µs | 6888 µs | 5160 µs | | 90.0th | 132352 µs | 10480 µs | 7016 µs | | 99.0th | 204032 µs | 55232 µs | 27744 µs | | Max | 208466 µs | 69299 µs | 40500 µs | | Total samples | 207 | 934 | 1145 | - RPS percentiles (requests) | | RR |Rustland | Default | | ------------ | -------- | -------- | ------- | | Average RPS | 103.5 | 467.0 | 572.5 | | 50.0th | 91 | 438 | 563 | | Max | 104 | 484 | 573 | ### FIFO test $NICE = -5$ ```shell $ watch -n 5 'nice -n -5 taskset -c 3 bash -c "stress-ng --cpu 1 --timeout 2s & stress-ng --cpu 1 --timeout 2s & stress-ng --cpu 1 --timeout 2s & wait"' ``` ### Time slice 設定確認為了確認 scx 可以正確設定 time_slice 因此我們更改原先 scx_rlfifo 的設定。 ```diff - dispatched_task.slice_ns = SLICE_NS / (nr_waiting + 1); + dispatched_task.slice_ns = 10_000_000; ``` 原先是依照任務數量平均分配，為了更好觀察 time_slice 設定是否符合 Perfetto 的觀察結果，我們將其設定為定值，分別為 10ms, 20ms, 30ms。結果如下 - 10ms ![image](https://hackmd.io/_uploads/rJuMWT7Eee.png) - 20ms ![image](https://hackmd.io/_uploads/ry_Q-pX4lg.png) - 30ms ![image](https://hackmd.io/_uploads/H1Z4Wa7Ele.png) 可以確認 scx 可以有效設定 time_slice。 ## Ref [Linux 核心專題: sched_ext 研究 by otteryc](https://hackmd.io/@sysprog/H1u6D9LI0) [GitHub - scx](https://github.com/sched-ext/scx) [eBPF 隨筆（七）：sched_ext](https://medium.com/@ianchen0119/ebpf-%E9%9A%A8%E7%AD%86-%E4%B8%83-sched-ext-f7b60ea28976) [Perfetto Docs](https://perfetto.dev/docs/)