# 開發紀錄 (3): Machine Learning
contributed by < [charliechiou](https://github.com/charliechiou) > < [EricccTaiwan](https://github.com/EricccTaiwan) >
## Some idea
- 仿造 `scx_central` 把計算 ML 的工作放在同一個 CPU 上
- 仿造 `scx_lavd` 使用裡面的參數
- 觀察 `scx_lavd` 的程式碼
## base scheduler - why `scx_rusty` ?
我們選擇從現有的 scheduler 著手進行修改,前兩個候選便是 `scx_lavd` 及 `scx_rusty`。前者透過觀察任務被 block 及 wake 其他任務的頻率來決定任務的 latency-criticality 進而排程,而後者則是在 rust 底下實做 load balance 的機制。
:::warning
這邊對於 `scx_rusty` 的敘述待理解完 `scx_rusty` 後會再補充
筆記可以參考 [scx_rusty](/bSCNGtw1SjelXRieqXI1Qg)
:::
起初閱讀 `scx_lavd` ,並試圖引入機器學習,但詳細閱讀後發現,`scx_lavd` 的整個運作函式 (或是說 `eBPF` hook) 都是實作在 `*.bpf.c` (e.g., `loadbalance.bpf.c`) ,而 `main.rs` 只是在進行函式調用的呼叫,因此在引入機器學習上會有很多的難點須克服 :
```text
// scx_lavd 為例
Rust 層 (e.g., main.rs)
|
\- eBPF 層 (e.g., loadbalance.bpf.c) => 主要的實作
===
// scx_rusty 為例
Rust 層 (e.g., main.rs, load_balance.rs) => 主要的實作
|
\- eBPF 層 (e.g., 一些 library 而已)
```
1. `*.bpf.c` 可以想像成是 underlayered ,而 `*.rs` 則是 overlayered ,`Rust` 層可以呼叫 `eBPF` 層的函式進行使用,但如果要把機器學習也放在 `Rust` 層,必須要有輸入資料 $\to$ 因此問題在於,`eBPF` 可以將資料傳給 `Rust` 層嗎?
2. 承上,若 `eBPF` 無法將 task 資訊傳給 `Rust` 層,那 `eBPF` 層有辦法實做 ML ,取代 `loadbalance.bpf.c` 嘛?
或許還有更多問題,主要也是對於 `eBPF` 的不了解,因此轉向使用 `scx_rusty` ,其 `loadbalance` 實作於 `Rust` 層,因此可以輕易的用 `Rust-ML` 取代,不需要考慮 `Rust-ML <--> eBPF` 之間的問題。
## Machine learning framework
> 使用 [burn](https://github.com/tracel-ai/burn) 或是 [Candle](https://github.com/huggingface/candle) 當作框架
在機器學習的部份,我們選擇了 [Candle](https://github.com/huggingface/candle) 作為使用的框架。
>Candle's core goal is to make serverless inference possible. Full machine learning frameworks like PyTorch are very large, which makes creating instances on a cluster slow. Candle allows deployment of lightweight binaries.
`Candle` 由 `Huggingface` 提供,已經實作了許多不同領域的機器學習任務,諸如 YOLO, Mamba 或 LLM 等常見的任務 。另一方面, `Burn` 則希望可以支援各種不同的 backend ,而 `Candle` 也在其中。雖然支援各種後端是件實用的特性,但對於我們要完成的任務而言,我們專注於機器學習本身,因此選擇較多實作範例及較輕便的 candle 。
>Compared to other frameworks, Burn has a very different approach to supporting many backends. By design, most code is generic over the Backend trait, which allows us to build Burn with swappable backends. This makes composing backend possible, augmenting them with additional functionalities such as autodifferentiation and automatic kernel fusion.
## `scx_rusty`
### load_balance 流程:
```rust
fn perform_balancing
|-- 如果有 NUMA node => 進行 node 之間的平衡 ( fn balance_between_nodes )
|
\-- 走訪 node,進行 node 的內部平衡 ( fn balance_within_node )
|
\-- 如果 node.domains.len() < 2 , Early return。
```
:::warning
測試時候發現無論如何都無法用 `!println` 來輸出,原因是因為 `scx` 的程式碼使用了
```rust
use log::debug;
use log::info;
```
並搭配
```rust
let llv = match opts.verbose {
0 => LevelFilter::Info,
1 => LevelFilter::Debug,
_ => LevelFilter::Trace,
};
```
因此要使用 `debug!("Hi!")` , `info!("Hi!")` 或 `trace!("Hi")` 並搭配 `-v` 或是 `-vv` 才會有輸出顯示,算是一個小小的坑。
:::
在 `try_find_move_task` 加入提取訓練資料的程式碼後,我們發現無論如何都無法紀錄到任何資料。因此使用 `bug!()` 來一路向上檢查呼叫的順序。可以找到在`balance_between_nodes` 及 `balance_within_node` 皆會使用到該函式。
在 `balance_between_nodes` 可以見到以下程式碼
```rust
if self.nodes.len() < 2 {
return Ok(());
}
```
最初實驗用 server 硬體規格如下:

因為其 node 數量 $< 2$ ,並不會執行到 `try_find_move_task` 這個函式,導致無法紀錄到資料。
另一方面,雖然我們的 node 中 domain 數量應為 8,但卻仍然因為 `node.domains.len() < 2`而直接 early return。
```rust
fn balance_within_node(&mut self, node: &mut NumaNode) -> Result<()> {
if node.domains.len() < 2 {
return Ok(());
}
...
```
<!--  -->
<!--  -->
因此為了測試,是否因為設備問題而向老師借用 Server ,其規格如下。

>jserv-arm
實際測試後發現 scx 能夠正確找到 2 個 NUMA node 但 domain 仍然只有找到一個而不是 28 個。因此我們發了個 issue 做詢問 - [scx_rusty: Domain detect didn't work #2214](https://github.com/sched-ext/scx/issues/2214)。從 Emil Tsalapatis 的回覆可以了解到, scx_rusty 的 domain 是指 llc (Last level Cache) 因此在上方兩台 server 皆只有 1 。
<!-- - 沒開 `scx_rusty` ( default 排程器 )
```shell
$ ./underdog/schbench/schbench -m 4 -t 4 -r 10
Wakeup Latencies percentiles (usec) runtime 10 (s) (20929 total samples)
50.0th: 10 (0 samples)
90.0th: 13 (7071 samples)
* 99.0th: 19 (1411 samples)
99.9th: 26 (184 samples)
min=1, max=1337
Request Latencies percentiles (usec) runtime 10 (s) (20969 total samples)
50.0th: 7592 (14314 samples)
90.0th: 7592 (0 samples)
* 99.0th: 7640 (898 samples)
99.9th: 13648 (194 samples)
min=7574, max=13832
RPS percentiles (requests) runtime 10 (s) (11 total samples)
20.0th: 2100 (9 samples)
* 50.0th: 2100 (0 samples)
90.0th: 2108 (2 samples)
min=2045, max=2108
sched delay: message 0 (usec) worker 0 (usec)
current rps: 2102.87
Wakeup Latencies percentiles (usec) runtime 10 (s) (20930 total samples)
50.0th: 10 (0 samples)
90.0th: 13 (7071 samples)
* 99.0th: 19 (1411 samples)
99.9th: 26 (184 samples)
min=1, max=1337
Request Latencies percentiles (usec) runtime 10 (s) (20986 total samples)
50.0th: 7592 (14329 samples)
90.0th: 7592 (0 samples)
* 99.0th: 7640 (899 samples)
99.9th: 13648 (195 samples)
min=7574, max=13832
RPS percentiles (requests) runtime 10 (s) (11 total samples)
20.0th: 2100 (9 samples)
* 50.0th: 2100 (0 samples)
90.0th: 2108 (2 samples)
min=2045, max=2108
average rps: 2098.60
sched delay: message 0 (usec) worker 0 (usec)
```
- 只有 LLC (Last-Level Cache) load balance
```shell
$ ./underdog/schbench/schbench -m 4 -t 4 -r 10
Wakeup Latencies percentiles (usec) runtime 10 (s) (21035 total samples)
50.0th: 11 (3162 samples)
90.0th: 17 (9093 samples)
* 99.0th: 26 (1121 samples)
99.9th: 36 (158 samples)
min=1, max=52
Request Latencies percentiles (usec) runtime 10 (s) (21036 total samples)
50.0th: 7576 (0 samples)
90.0th: 7576 (0 samples)
* 99.0th: 7608 (1195 samples)
99.9th: 7720 (134 samples)
min=7532, max=13756
RPS percentiles (requests) runtime 10 (s) (11 total samples)
20.0th: 2100 (8 samples)
* 50.0th: 2100 (0 samples)
90.0th: 2108 (3 samples)
min=2091, max=2110
sched delay: message 0 (usec) worker 0 (usec)
current rps: 2102.62
Wakeup Latencies percentiles (usec) runtime 10 (s) (21036 total samples)
50.0th: 11 (3162 samples)
90.0th: 17 (9093 samples)
* 99.0th: 26 (1121 samples)
99.9th: 36 (158 samples)
min=1, max=52
Request Latencies percentiles (usec) runtime 10 (s) (21053 total samples)
50.0th: 7576 (0 samples)
90.0th: 7576 (0 samples)
* 99.0th: 7608 (1197 samples)
99.9th: 7720 (134 samples)
min=7532, max=13756
RPS percentiles (requests) runtime 10 (s) (11 total samples)
20.0th: 2100 (8 samples)
* 50.0th: 2100 (0 samples)
90.0th: 2108 (3 samples)
min=2091, max=2110
average rps: 2105.30
sched delay: message 0 (usec) worker 0 (usec)
```
- domain 看到 L2 cache
```shell
$ ./underdog/schbench/schbench -m 4 -t 4 -r 10
Wakeup Latencies percentiles (usec) runtime 10 (s) (16272 total samples)
50.0th: 13 (3755 samples)
90.0th: 24 (3185 samples)
* 99.0th: 30 (1458 samples)
99.9th: 38 (87 samples)
min=1, max=219
Request Latencies percentiles (usec) runtime 10 (s) (16265 total samples)
50.0th: 7576 (0 samples)
90.0th: 13712 (3505 samples)
* 99.0th: 20256 (1376 samples)
99.9th: 20384 (131 samples)
min=7559, max=20479
RPS percentiles (requests) runtime 10 (s) (11 total samples)
20.0th: 1622 (3 samples)
* 50.0th: 1626 (5 samples)
90.0th: 1630 (3 samples)
min=1616, max=1629
sched delay: message 0 (usec) worker 0 (usec)
current rps: 1626.75
Wakeup Latencies percentiles (usec) runtime 10 (s) (16274 total samples)
50.0th: 13 (3755 samples)
90.0th: 24 (3186 samples)
* 99.0th: 30 (1458 samples)
99.9th: 38 (87 samples)
min=1, max=219
Request Latencies percentiles (usec) runtime 10 (s) (16283 total samples)
50.0th: 7576 (0 samples)
90.0th: 13712 (3511 samples)
* 99.0th: 20256 (1379 samples)
99.9th: 20384 (131 samples)
min=7559, max=20479
RPS percentiles (requests) runtime 10 (s) (11 total samples)
20.0th: 1622 (3 samples)
* 50.0th: 1626 (5 samples)
90.0th: 1630 (3 samples)
min=1616, max=1629
average rps: 1628.30
sched delay: message 0 (usec) worker 0 (usec)
```
- scx_lavd
```shell
./underdog/schbench/schbench -m 4 -t 4 -r 10
Wakeup Latencies percentiles (usec) runtime 10 (s) (21012 total samples)
50.0th: 12 (5377 samples)
90.0th: 18 (8307 samples)
* 99.0th: 23 (499 samples)
99.9th: 32 (188 samples)
min=1, max=233
Request Latencies percentiles (usec) runtime 10 (s) (21002 total samples)
50.0th: 7576 (0 samples)
90.0th: 7592 (3954 samples)
* 99.0th: 7880 (627 samples)
99.9th: 8752 (159 samples)
min=7533, max=17113
RPS percentiles (requests) runtime 10 (s) (11 total samples)
20.0th: 2100 (10 samples)
* 50.0th: 2100 (0 samples)
90.0th: 2100 (0 samples)
min=2084, max=2106
sched delay: message 0 (usec) worker 0 (usec)
current rps: 2103.87
Wakeup Latencies percentiles (usec) runtime 10 (s) (21014 total samples)
50.0th: 12 (5377 samples)
90.0th: 18 (8308 samples)
* 99.0th: 23 (499 samples)
99.9th: 32 (188 samples)
min=1, max=233
Request Latencies percentiles (usec) runtime 10 (s) (21020 total samples)
50.0th: 7576 (0 samples)
90.0th: 7592 (3956 samples)
* 99.0th: 7880 (628 samples)
99.9th: 8752 (159 samples)
min=7533, max=17113
RPS percentiles (requests) runtime 10 (s) (11 total samples)
20.0th: 2100 (10 samples)
* 50.0th: 2100 (0 samples)
90.0th: 2100 (0 samples)
min=2084, max=2106
average rps: 2102.00
sched delay: message 0 (usec) worker 0 (usec)
```
#### Summary -->
## L3 & L2 test
邊 compile kernel + `schbench`
```shell
$ # terminal 1
$ cd linux
$ make clean
$ make -j"$(nproc)"
$ # terminal 2
$ ./underdog/schbench/schbench -m 4 -t 4 -r 10
```
- 沒開 `scx_rusty` ( default 排程器 )
:::spoiler Result
```shell
$ ./underdog/schbench/schbench -m 4 -t 4 -r 10
Wakeup Latencies percentiles (usec) runtime 10 (s) (11472 total samples)
50.0th: 14 (3264 samples)
90.0th: 82 (4136 samples)
* 99.0th: 3396 (1032 samples)
99.9th: 4020 (103 samples)
min=1, max=6018
Request Latencies percentiles (usec) runtime 10 (s) (11456 total samples)
50.0th: 7640 (1820 samples)
90.0th: 25504 (4549 samples)
* 99.0th: 52160 (1028 samples)
99.9th: 76672 (102 samples)
min=7576, max=100178
RPS percentiles (requests) runtime 10 (s) (11 total samples)
20.0th: 565 (3 samples)
* 50.0th: 579 (3 samples)
90.0th: 2084 (4 samples)
min=554, max=2088
sched delay: message 4 (usec) worker 303 (usec)
current rps: 578.93
Wakeup Latencies percentiles (usec) runtime 10 (s) (11475 total samples)
50.0th: 14 (3264 samples)
90.0th: 82 (4138 samples)
* 99.0th: 3396 (1033 samples)
99.9th: 4020 (103 samples)
min=1, max=6018
Request Latencies percentiles (usec) runtime 10 (s) (11475 total samples)
50.0th: 7640 (1820 samples)
90.0th: 25568 (4577 samples)
* 99.0th: 52160 (1019 samples)
99.9th: 76672 (102 samples)
min=7576, max=100178
RPS percentiles (requests) runtime 10 (s) (11 total samples)
20.0th: 565 (3 samples)
* 50.0th: 579 (3 samples)
90.0th: 2084 (4 samples)
min=554, max=2088
average rps: 1147.50
sched delay: message 0 (usec) worker 0 (usec)
```
:::
- 只有 LLC (Last-Level Cache) load balance
:::spoiler Result
```shell
./underdog/schbench/schbench -m 4 -t 4 -r 10
Wakeup Latencies percentiles (usec) runtime 10 (s) (6581 total samples)
50.0th: 116 (1930 samples)
90.0th: 1050 (2631 samples)
* 99.0th: 3004 (591 samples)
99.9th: 5080 (58 samples)
min=1, max=20952
Request Latencies percentiles (usec) runtime 10 (s) (6567 total samples)
50.0th: 22880 (1964 samples)
90.0th: 28064 (2578 samples)
* 99.0th: 41792 (591 samples)
99.9th: 56896 (59 samples)
min=10794, max=67114
RPS percentiles (requests) runtime 10 (s) (11 total samples)
20.0th: 571 (3 samples)
* 50.0th: 675 (3 samples)
90.0th: 697 (4 samples)
min=556, max=699
sched delay: message 278 (usec) worker 459 (usec)
current rps: 685.90
Wakeup Latencies percentiles (usec) runtime 10 (s) (6584 total samples)
50.0th: 116 (1932 samples)
90.0th: 1050 (2631 samples)
* 99.0th: 3004 (592 samples)
99.9th: 5080 (58 samples)
min=1, max=20952
Request Latencies percentiles (usec) runtime 10 (s) (6584 total samples)
50.0th: 22880 (1974 samples)
90.0th: 28064 (2583 samples)
* 99.0th: 41792 (592 samples)
99.9th: 56896 (59 samples)
min=10794, max=67114
RPS percentiles (requests) runtime 10 (s) (11 total samples)
20.0th: 571 (3 samples)
* 50.0th: 675 (3 samples)
90.0th: 697 (4 samples)
min=556, max=699
average rps: 658.40
sched delay: message 0 (usec) worker 0 (usec)
```
:::
- domain 看到 L2 cache
:::spoiler Result
```shell
$ ./underdog/schbench/schbench -m 4 -t 4 -r 10
Wakeup Latencies percentiles (usec) runtime 10 (s) (9116 total samples)
50.0th: 29 (2691 samples)
90.0th: 827 (3611 samples)
* 99.0th: 14320 (820 samples)
99.9th: 21024 (83 samples)
min=1, max=52989
Request Latencies percentiles (usec) runtime 10 (s) (9103 total samples)
50.0th: 13616 (2777 samples)
90.0th: 24352 (3589 samples)
* 99.0th: 41920 (808 samples)
99.9th: 56128 (82 samples)
min=7559, max=70317
RPS percentiles (requests) runtime 10 (s) (11 total samples)
20.0th: 683 (3 samples)
* 50.0th: 743 (3 samples)
90.0th: 1186 (4 samples)
min=659, max=1194
sched delay: message 891 (usec) worker 753 (usec)
current rps: 682.42
Wakeup Latencies percentiles (usec) runtime 10 (s) (9121 total samples)
50.0th: 29 (2691 samples)
90.0th: 841 (3615 samples)
* 99.0th: 14320 (821 samples)
99.9th: 21024 (83 samples)
min=1, max=52989
Request Latencies percentiles (usec) runtime 10 (s) (9122 total samples)
50.0th: 13616 (2777 samples)
90.0th: 24352 (3601 samples)
* 99.0th: 41920 (813 samples)
99.9th: 56128 (82 samples)
min=7559, max=70317
RPS percentiles (requests) runtime 10 (s) (11 total samples)
20.0th: 683 (3 samples)
* 50.0th: 743 (3 samples)
90.0th: 1186 (4 samples)
min=659, max=1194
average rps: 912.20
sched delay: message 0 (usec) worker 0 (usec)
```
:::
結論 :
目前的結果可以看出,即使在有 workload 的情況下,l2 cache 的 balance 對於整體效能,不升反降。l2 cache balance 的效果在 wakeup latency 與 default / llc 效果差了很多,延遲將近 3 倍。
## TODO
:::warning
- [ ] 訓練 / Inference
- [ ] 將訓練結果用到 `scx_rusty` 上
:::
<!-- ## 將 [Burn](https://github.com/tracel-ai/burn) 引入 scx 中
在原先的 scx 根目錄下使用`cargo new ml --lib` 來創造新的子 crate ,並在根目錄的 Cargo.toml 中的 member 加入路徑。
接著在 ml 資料夾底下的 Cargo.toml 中增加 brun 作為 dependency,可以直接用 cargo 的指令打或是手動加進去。
```shell
cargo add burn --features train,ndarray
```
```rust
[package]
name = "ml"
version = "0.1.0"
edition = "2024"
[dependencies]
burn = { version = "0.17", features = ["train", "ndarray"] }
csv = "1.3"
serde = { version = "1.0", features = ["derive"] }
```
最後為了能夠在我們的 scx_mlp 中使用到這個 carte,因此也要把 ml 加入到該 scx_mlp 中的 Cargo.toml 中。
加入後為了測試是否能夠使用到 ml 裡面的函式,我們在原先 scx_mlp 裡面的 `src/main.rs` 中加入 `use ml::add;` 及 `println!("Test ml dependency: 2 + 3 = {}", add(2, 3));` 若能夠成功呼叫便會在 scx_mlp 排程器具執行時看見 `Test ml dependency: 2 + 3 = {}", add(2, 3));` 的輸出。
```shell
06:35:02 [INFO] Autopilot mode is enabled.
06:35:02 [INFO] Opts {
autopilot: true,
autopower: false,
performance: false,
powersave: false,
balanced: false,
slice_max_us: 5000,
slice_min_us: 500,
preempt_shift: 6,
cpu_pref_order: "",
no_futex_boost: false,
no_preemption: false,
no_wake_sync: false,
no_core_compaction: false,
no_freq_scaling: false,
stats: None,
monitor: None,
monitor_sched_samples: None,
verbose: 0,
version: false,
help_stats: false,
}
Test ml dependency: 2 + 3 = 5 <- 確認能夠成功使用到 ml 的函式
06:35:02 [INFO] CPU pref order in performance mode: [8, 12, 9, 13, 0, 2, 4, 6, 10, 14, 1, 3, 5, 7, 11, 15]
06:35:02 [INFO] CPU pref order in powersave mode: [0, 1, 2, 3, 4, 5, 6, 7, 10, 11, 14, 15, 8, 9, 12, 13]
06:35:03 [INFO] scx_lavd scheduler is initialized (build ID: 1.0.12-gb5c9a5f9-dirty x86_64-unknown-linux-gnu/debug)
06:35:03 [INFO] scx_lavd scheduler starts running.
``` -->
## Candle 測試
為了測試 candle 的框架我們拿常見的 iris 資料集合做訓練,並測試基本的 MLP 及 ResNet。[GitHub 連結](https://github.com/cce-underdogs/rust-ml)
MLP 結果如下:
```shell
# Training
Loaded 100 samples with 4 features
First xs row: Tensor[5.1, 3.5, 1.4, 0.2; f32]
Epoch 10: loss = 0.8027, acc = 50.00%
Epoch 20: loss = 0.6380, acc = 69.00%
Epoch 30: loss = 0.4993, acc = 92.00%
Epoch 40: loss = 0.4038, acc = 99.00%
Epoch 50: loss = 0.3200, acc = 99.00%
Epoch 60: loss = 0.2517, acc = 99.00%
Epoch 70: loss = 0.1951, acc = 99.00%
Epoch 80: loss = 0.1499, acc = 99.00%
Epoch 90: loss = 0.1159, acc = 100.00%
Epoch 100: loss = 0.0909, acc = 100.00%
Saving 4 variables to model.safetensors
#Inference
Predicted probability: 0.8351
Predicted class: 1
```
ResNet 結果如下:
```shell
# Training
Loaded 100 samples with 4 features
Epoch 10: loss = 0.4236, acc = 89.00%
Epoch 20: loss = 0.2110, acc = 100.00%
Epoch 30: loss = 0.1030, acc = 100.00%
Epoch 40: loss = 0.0393, acc = 100.00%
Epoch 50: loss = 0.0148, acc = 100.00%
Epoch 60: loss = 0.0074, acc = 100.00%
Epoch 70: loss = 0.0044, acc = 100.00%
Epoch 80: loss = 0.0031, acc = 100.00%
Epoch 90: loss = 0.0024, acc = 100.00%
Epoch 100: loss = 0.0019, acc = 100.00%
Saving 8 variables to model.safetensors
# Inference
# Testing data : 5.6,3,4.1,1.3 -> GT : 1
Predicted class: 1
Probability: 0.9989
```
可以確認目前的 ML 是有辦法辨識簡單的 iris 資料集合。
## Data collection
原先的 load_balance 程式碼會判斷一個任務是否為
- 該任務是否允許被排程在該 domain
- 是否是 kworker
- 是否已經被遷移過
```rust
.filter(|task| {
task.dom_mask & (1 << pull_dom_id) != 0
&& !(self.skip_kworkers && task.is_kworker)
&& !task.migrated.get()
```
在判斷完上述的硬性條件後,會判斷一個任務是否 migrate 後可以有效降低 load ,若非的話則不 migrate,我們便是在這邊蒐集資料。若任務無法有效降低 load 則儲存任務並 label 為 0 ,反之若可以降低 load 則 label 為 1。
```rust
// If the best candidate can't reduce the imbalance, there's nothing
// to do for this pair.
let old_imbal = to_push + to_pull;
if old_imbal < new_imbal {
std::mem::swap(&mut push_dom.tasks, &mut SortedVec::from_unsorted(tasks));
return Ok(None);
}
let load = *(task.load);
let taskc_p = task.taskc_p;
task.migrated.set(true);
std::mem::swap(&mut push_dom.tasks, &mut SortedVec::from_unsorted(tasks));
push_dom.transfer_load(load, unsafe { &mut *taskc_p }, pull_dom);
```
## Reference
- [sched_ext(5): Integrate Machine Learning into scx_rusty Load balancer](https://hackmd.io/@vax-r/sched_ext_5)
## Appendix A : scx_lavd (Latency-criticality Aware Virtual Deadline)
:::spoiler `lavd` 介紹
>Overall procedure of the LAVD scheduler
>
>LAVD is a deadline-based scheduling algorithm, so its overall procedure is similar to other deadline-based scheduling algorithms. Under LAVD, a
runnable task has its time slice and virtual deadline. The LAVD scheduler picks a task with the closest virtual deadline and allows it to execute for the given time slice.
* `main.rs` - It processes command line options and logs out scheduling statistics, See the more detailed overview of the LAVD design at main.bpf.c.
* `main.bpf.c` 閱讀設計
### 設計理念 - Note
1. Overall procedure of the LAVD scheduler
* Deadline-based scheduling algorithm
* picks a task with the closest virtual deadline and allows it to execute for the given time slice.
* 選最早的 virtual $deadline$
1. **CFS** 選最早的 $vruntime$
2. **EEVDF** 是先選 shortest $vruntime$ ,再選 shortest $deadline$
2. **Latency criticality**: how to determine how latency-critical a task is
* 若 task 不是 latency-critical ,可以延遲;反之,不行。
* 判斷 latency-critical 依據:
1. tast 的 runtime 短 $\to$ 對延遲敏感
2. tast frequently 喚醒 task
3. tast frequently 等待 task
3. **Virtual deadline**: when to execute a task
* 較高 prio 的任務有較短的 deadline $\to$ 更容易被排程器選中
4. **Time slice**: how long execute a task
* 借用 CFS 和 scx_rustland 計算 time slice 的概念
:::warning
要看一下 scx_rustland 的計算方式
:::
* 高 prio ($NICE$ value 小) $\to$ longer time slice
* compute-intensive 的任務 $\to$ boosts & 分配更長的 time slice
* new forked 的任務 $\to$ assigns **only half of a regular time slice** (防止 fork-bomb)
5. **Fairness**: how to enforce the fair use of CPU time
* 根據 nice 分配 time slice 不保證公平 $\to$ 任務可能被排太多次(過多 timeslice)或太少次(過少 timeslice)。
* 若任務 over-scheduled $\to$ 延後選擇此任務,減少執行頻率。
>The deferring time- ineligible duration- is proportional to how much time is over-spent and added to the task's deadline.
6. **Preemption**
* Yield-based (占了 70-90%) :
* 如果 global run queue 有 HIGHER Prio task 在等待,就必須讓出 CPU
* 定時 tick 檢查是否應讓出 CPU,直接讓
* Kick-based : 啟發式算法 [Power of Two Random Choices](https://medium.com/the-intuition-project/load-balancing-the-intuition-behind-the-power-of-two-random-choices-6de2e139ac2f)
* 使用「二選一」隨機選擇 victim CPU 發 IPI (Inter-processor interrupt) $\to$ 從所有候選中隨機挑兩個,選擇其中負載較輕的一個。
* _immediately_ schedule an urgent task, even paying a higher preemption cost
7. **Performance criticality**
* 衡量任務對 CPU 頻率的依賴程度。
* 執行時間長 & 常等待/被喚醒 $\to$ 效能臨界
8. **CPU frequency scaling**
* 兩個升頻因素
1. 當前 CPU 使用率
2. 執行中任務的效能臨界度
* 策略:快升頻、慢降頻,避免效能震盪與 spike 。
9. **Core compaction**
* 系統 CPU 使用率低時,將任務集中到少數核心,其餘進入 C-state (Cpu idle state) 。
### 一些東東
* `task_struct`: 排程單位
* `task_ctx`: 該任務對應的 context,包含 lat_cri、perf_cri .... (下方會介紹)
### main.bpf.c
* `calc_greedy_ratio`
* 計算某個任務相對於「系統平均服務時間」所累積的「貪婪度比例(greedy ratio)」 。
* 後續若任務過度佔用 CPU,scheduler 會給予額外懲罰,透過「懲罰因子」降低它的優先度。
* `calc_greedy_factor`
* 根據的 greedy_ratio,決定懲罰因子(greedy_ft)。
* 懲罰因子用在計算 virtual deadline 時,如果某任務長期貪婪使用 CPU,就會增大分子,使其 deadline 延後,降低被 scheduler 優先挑選的機率
* `calc_runtime_factor`
* **短 runtime $\to$ 高 factor**
* **長 runtime $\to$ 低 factor**
$$
\begin{aligned}
\text{rsigmoid}(runtime, &~\text{LAVD_LC_RUNTIME_MAX})\newline\\
&=
\begin{cases}
0, & \text{if } v \ge \text{max} \\
\text{max} - runtime, & \text{if } v < \text{max}
\end{cases}
\end{aligned}
$$
* `calc_freq_factor`
* 用以反映任務被喚醒/等待的頻率,若頻率高,指出此任務在生產者/消費者鏈中很重要
$$
\begin{aligned}
\text{sigmoid}(freq, &~\text{LAVD_LC_FREQ_MAX})\newline\\
&=
\begin{cases}
max, & \text{if } v > \text{max} \\
freq, & \text{if } v \le \text{max}
\end{cases}
\end{aligned}
$$
* `calc_weight_factor`
* 重新計算 weight,對於 kernel task 、 affinity ... 進行 `BOOST`
* 對於擁有**wake-up**, **kernel**, **kernel worker**, **affinity**, **pinned**, **migration disable** 及 **lock holder** 這些特性的任務增加其 weight 以加速。
* 最後再使用 nice 值來設定 priority (Respect nice priority) 。
* `calc_perf_cri`
* 只有當大小核都存在,才計算任務的 CPU perf-sensitive $\to$ if 敏感度高 ($> 1$),使用大核
* perf-cri 預設為 $1$
> A task is more CPU-performance sensitive when it meets the following
> conditions:
>
> - It is in the middle of the task graph (high wait and wake
> frequencies).
> - Its runtime and frequency are high;
> - Its nice priority is high;
* `calc_lat_cri`
* 計算任務的 latency 敏感度
* 主要受到 `wait_freq_ft`, `wake_freq_ft` 及 `runtime_ft` 影響
* 當 wait_freq 及 wake_freq 皆高的時候,可以認定為該任務在 task chain 的中間段,因此 latency 敏感度高。
$\to$ `log2_u64(wait_freq_ft * wake_freq_ft)`
* 執行時間越長且權重越大(boost 越多)的任務,latency 敏感度也高。
$\to$ `log2_u64(runtime_ft * weight_ft)`
* 最後如果任務是由 latency 敏感的任務喚醒則也將其設定為 latency 敏感。
$\to$ `taskc->lat_cri = max(lat_cri, taskc->lat_cri_waker);`
* If its waker is more latency-critcial, inherit waker's latency criticality.
* `calc_adjusted_runtime`
* A task is **woken up to do the job**, so it is generally good to prioritize such a task 。
* Prefer a **short-running** (avg_runtime) and **recently woken-up** (acc_runtime) task 。
* To avoid the starvation of CPU-bound tasks, which rarely sleep
* 兩者計算基準皆為 LAVD_ACC_RUNTIME_MAX , `acc_runtime` 高表示任務一直持續運行,可能是 CPU-bound ,額外把 deadline 推遲一個最大值 `LAVD_ACC_RUNTIME_MAX`。反之,則將 deadline 在基準值上增加 acc_runtime。
$$
\tiny
\begin{aligned}
&adjusted\_runtime \\\\
&=
\begin{cases}
2 \cdot \text{LAVD_ACC_RUNTIME_MAX}, & \text{if } acc\_runtime \ge \text{LAVD_ACC_RUNTIME_MAX} \\\\
\text{LAVD_ACC_RUNTIME_MAX} + acc\_runtime, & \text{if } acc\_runtime < \text{LAVD_ACC_RUNTIME_MAX}
\end{cases}
\end{aligned}
$$
* `calc_virtual_deadline_delta`
>Calculate the deadline based on runtime, latency criticality, and greedy ratio 。
* 實際回傳值之後會用 cur_logical_clk + $\Delta$ 作為此任務的 virtual deadline 。
* short-running and recently woken-up $\to$ `greedy_ft` 小 $\to$ $\Delta$ 小, $deadline$ 早 $\to$ 更容易被選
* 非常 latency-critical $\to$ `lat_cri` 大 $\to$ $\Delta$ 小, $deadline$ 早 $\to$ 更容易被選
$$
\Delta = \frac{adjusted\_runtime \times greedy\_ft}{lat\_cri}
$$
* `calc_time_slice`
* taskc 為空 (目前不知道這代表什麼意思,) $\to$ assign MAX time slice (`5ms`)
* assign time slice `taskc->slice_ns = sys_stat.slice;`
* `reset_suspended_duration`
* 清除 idle 時間,如果 `cpuc->online_clk > cpuc->offline_clk`,將 `cpuc->offline_clk = cpuc->online_clk`。
* `get_suspended_duration_and_reset`
* 如果 `cpuc->online_clk > cpuc->offline_clk` $\to$ 計算 `duration = offline - online`,以及 `reset_suspended_duration`
* 反之,`duration` = 0
* `advance_cur_logical_clk`
* 避免虛擬時間 `cur_logical_clk` 落後任務 deadline。
* Advance the clock up to the task's deadline. When overloaded, advance the clock slower so other can jump in the run queue.
* `update_stat_for_running`
* Update task state when starts running.
* `update_stat_for_stopping`
* 任務停止執行時 Update task state 。
* `calc_when_to_run`
* Before enqueueing a task to a run queue, we should decide when a task should be scheduled.
* 回傳 `cur_logical_clk` + $\Delta$
* `BPF_STRUCT_OPS(lavd_select_cpu)`
* 決定任務 p 變成 runnable 時,該被指派到哪個 CPU 執行。
* idle CPU 優先分配
* 偵測到 wake_flags,提昇 wake 的 latency criticality
* DSQ is empty $\to$ 立刻 dispatch
* `can_direct_dispatch`
* CPU 現在是空的, DSQ 也是空的 $\to$ 直接 dispatch
* `BPF_STRUCT_OPS(lavd_enqueue)`: 決定此任務何時、在哪裡(DSQ)被排入佇列,必要時也會嘗試搶占執行緒。
1. 檢查 taskc 是否有效
2. 計算 dsq_vtime, slice
3. 選擇 DSQ → pick_proper_dsq()
4. 若 CPU idle 且 DSQ 空:
$\to$ 快速插入 local DSQ $\to$ 立刻 kick CPU $\to$ return
5. 否則進入 DSQ 按 dsq_vtime 排隊
6. 若該 CPU idle $\to$ kick 它去執行新任務 $\to$ return
7. 沒 idle $\to$ 嘗試 kick 其他低優先權 CPU(搶佔)
* `BPF_STRUCT_OPS(lavd_dispatch)`
1. prev 是 lock holder $\to$ 直接 refill slice,讓它繼續跑 $\to$ 避免卡在 lock contention
2. CPU 是 active (目前主要被使用的 CPU) 或 overflow set (e.g., affinitized 任務強行加入的 CPU) $\to$ consume DSQ 中任務
3. CPU 不是 active/overflow $\to$ 嘗試擴張 overflow 或找 affinitized 任務
4. 找到合適任務就 consume,否則 idle 或讓 prev 繼續跑
:::warning
`active_cpumask`: 排程器主動使用的 CPU 集合
`ovrflw_cpumask`: 被動加入來滿足 affinity 限制的 CPU
:::
* `BPF_STRUCT_OPS(lavd_runnable)` : 當任務被喚醒時觸發,根據 waker 的資訊更新 wakee
* 重設它的 acc_runtime(累積運行時間)
* 根據喚醒它的「waker」任務來更新各種統計資訊(wake frequency、latency criticality)
* Update wake frequency: 頻繁喚醒,代表示中間節點,透過 EWMA(指數加權移動平均)計算喚醒頻率
* `p_taskc->lat_cri_waker = waker_taskc->lat_cri`: Propagate waker's latency criticality to wakee.
* `BPF_STRUCT_OPS(lavd_running)` : task 進入 CPU 開始執行時觸發
* 更新 task 與 CPU 的執行統計資料 ( `update_stat_for_running` )
* 根據 latency-criticality 分配 slice ( `calc_time_slice` )
* 更新 CPU-perf 目標,利於 DVFS 調整 ( `update_cpuperf_target` ) $\to$ Dynamic Frequency Scaling (DVFS) aware scheduling 的關鍵邏輯
* 計算何時可以被 preempt(估計停止時間 = now $+$ `avg_runtime` )
* `BPF_STRUCT_OPS(lavd_stopping)` : 當某個任務準備 running 狀態切換出來時呼叫
* e.g., 搶佔、主動讓出 CPU 、 time slice 用盡)
* 當一個任務「即將離開 CPU」時,更新該任務與其所屬 CPU 的統計資訊。
* `BPF_STRUCT_OPS(lavd_quiescent)` : 用來處理 task 進入 quiescent(靜止、非 runnable)狀態時的統計更新
* `cpu_ctx_init_online` : 在 CPU 上線時,初始化該 CPU 的 context 狀態
* `BPF_STRUCT_OPS(lavd_cpu_online)` : CPU 被 hotplug 上線時,所觸發的 BPF hook
* `cpu_ctx_init_offline` : CPU offline 時的清理初始化
* `BPF_STRUCT_OPS(lavd_cpu_offline)` : CPU 被 hotplug offline 時觸發的 BPF hook
* `BPF_STRUCT_OPS(lavd_update_idle)` : CPU 進入或離開 idle 狀態,追蹤 CPU 的 idle duration,計算 CPU 利用率
* CPU 進入 idle $\to$ 紀錄 idle 開始時間、清除搶佔標記 (idle task cannot be preempted)
* CPU 離開 idle $\to$
* `idle_start_clk != 0` $\to$ 計算 idle 持續時間
* `idle_start_clk == 0` $\to$ 重設 idle time
* `BPF_STRUCT_OPS(lavd_set_cpumask)` : 處理 task CPU affinity 設定變更
* 如果該任務允許運行的 CPU 少於系統所有 CPU,就認為它是 affinitized
* `BPF_STRUCT_OPS(lavd_cpu_acquire)` : `scx_lavd` 重新取得 CPU 控制權時的更新
* 記錄這段被 high-prio sched class 使用的時間,並進行容量與頻率校正,以更新該 CPU 的負載評估與頻率資料
* `void BPF_STRUCT_OPS(lavd_cpu_release)` : 當 `scx_lavd` 釋出 CPU 控制權給 high-prio sched class 的清除和紀錄
* reset the CPU's preemption information so it cannot be a victim. (不能 kick)
* Requeue the tasks in a local DSQ to the global enqueue. (因為這個 CPU 上的 dsq ,不能再使用此 CPU)
* Reset the current CPU's performance target
* `BPF_STRUCT_OPS(lavd_enable)` : init 新任務的 `svc_time` 為全系統的最小服務時間,避免 unfairness
* `init_task_ctx` : 初始化每個 task 的 `task_ctx`
* `BPF_STRUCT_OPS(lavd_init_task)` : 任務剛被建立或轉交給 SCX control 時所呼叫的初始化 hook
* 初始化 & 分配記憶體
* `init_cpdoms` : 為有效的 compute domain(cpdom)建立對應的 DSQ
* Compute domain 是一組被視為「排程單位」的 CPU 集合
* `calloc_cpumask`
* 分配記憶體
* `init_cpumasks`
* init cpumask, 紀錄 online cpumask
* `init_per_cpu_ctx`
* init cpu 的 cpu context
* `BPF_STRUCT_OPS_SLEEPABLE(lavd_init)`
* lavd 的 init
:::
## `main.rs`
## Profiling: ML-based `scx_rusty`
### compile kernel
```shell
$ make clean
$ time make -j"$(nproc)"
```
* EEVDF
```
1245.72s user
118.13s system
1417% cpu // 越高越好
1:36.24 total // 越低越好
```
* default `scx_rusty`
```
1236.88s user
118.20s system
1315% cpu
1:42.99 total
```
* default `scx_rusty-l2` L2-balance (domain catch)
```
1183.22s user
116.42s system
1070% cpu
2:01.38 total
```
* ml-based `scx_rusty-l2-ml` (stress-ng set)
```
1241.19s user
118.28s system
1400% cpu
1:37.07 total
```
* ml-based `scx_rusty-l2-ml` (compile set)
---
### perf
* https://github.com/brendangregg/perf-tools
### fio
* https://github.com/axboe/fio