sched_ext(5): Integrate Machine Learning into scx_rusty Load balancer

`scx_rusty` overview

scx_rusty 身為 scx 專案目前收錄的一個排程器，將 CPU 分到不同 domain 當中，在 bpf program 當中實作主要的排程邏輯，而特別的是透過一個使用者層級的 load balancer 來幫助進行負載平衡，定時進行 load balance 運算並將資料透過 BPF map 更新， bpf program 進行 select_cpu 或 enqueue, dispatch 等行為時可透過讀取 BPF map 當中的資訊而得知是否改進行負載平衡。

Userspace part

使用者層級的程式透過 Rust 程式語言實作，主要負責兩個部分，第一個是提供 higher frequency tuning decision 。找出當前負載不沉重的 CPU 並將他們標記起來，使得他們可以從過載的 CPU 當中進行 task pull 。

第二個部分是運行 lower frequency load balancing ，透過比較 domain 之間的 load averages 來決定是否進行 load balancing 。如果 load difference 夠大則會檢查最多 1024 個 task 來判斷哪個 task 應該被移動（這也是我們可以優化之處）。

在使用者層級的操作成本低，而且負載平衡成本也不像過往高，進行的頻率也低，不過 work-conservation 依舊透過 tuning 和 greedy execution 來進行。

目前 scx_rusty 假設所有 domain 之間的距離都一樣，而且 processing power 也都一樣 。

Load Balancer

負載平衡的流程大致如下

利用 dom_ctx_map_elem bpf map 當中的 duty cycle buckets 來計算 domain load average ，利用 scx_utils::LoadCalculator 來計算 load 並利用 scx_utils::LoadLedger 來判斷 load distribution 。
拆分 NumaNode 和 Domain 的 hierarchy ，詳細可見 scheds/rust/scx_rusty/src/load_balance.rs
從 LB root 開始尋遍每個 NUMA node ，從 push nodes 也就是負載過高的 node 將任務遷移到 pull nodes 當中。此處進行 task migration 的代價會比 node 當中的 domain 進行 task migration 的代價高。因此這種 migration 只有當 imbalance 情況非常嚴重才會發生。
若 NUMA nodes 之間的負載平衡結束後，接下來會針對每個 NUMA nodes 當中的 domain 進行負載平衡，此處 task migration 的成本低。

BPF part

負責進行在每個 domain 間進行 round robin ，當一個 task 初次進入系統當中時（ rusty_prep_enable ），它們會透過 round robin 順序進行其中一個 domain 。
排程最主要在 rusty_select_cpu 當中進行，當一個 task 狀態變為 runnable 時會被呼叫， lb_data bpf map 是用來判斷此 task 是否需要被遷移到新的 domain ，這個 bpf map 的 producer 就是 userspace part 的 load balancer 。若不需要的話 task 排程的 priority order 如下

若 task 被同步喚醒而系統當中有其他 idle CPU ，則放到當前的 core 當中
若前一個 core 為 idle ，則放到它上面
若 task 有被指派到特定的 core 上則放到該 core 上
domain 當中任意的 idle CPU

若上述四項都不滿足，該 task 會被放到對應 domain 的 dispatch queue 當中。

rusty_dispatch 會嘗試從對應 domain 的 dispatch queue 當中 consume 一個 task 來運行。若沒有找到 task ，則會從其他 dispatch queue 當中進行 greedy load stealing 。

基本上 load balancing 完全交給使用者層級處理， BPF part 負責將 task weight, domain mask 和 current domain 生產進入 task_data bpf map 當中，並透過 lb_data bpf map 當中的資訊判斷是否進行 load balance 。

整合機器學習進入 Load balancer

Main idea

在使用者層級的 load balancer 實作當中，有一個函式稱為 try_find_move_task ，用來找到 push_dom 當中要遷移到 pull_dom 當中的 task ，若找到的話會更新對應 lb_data bpf map 當中的資料，並且回傳移動的 load 。

該函式當中尋找可以被遷移的 task 的實作當中，若可以適當的加入機器學習模型的 inference 來判斷哪個 task 應該被遷移，或許可以帶來更好的效果。

Training

透過 MLLB 當中的程式碼為基礎，以 tensorflow 訓練一個模擬 EEVDF/CFS 的 load balancing 機制的模型。也就是模仿 can_migrate_task() 的行為。

原本論文當中透過掛載 kprobe 在 can_migrate_task() 上蒐集資料的方式可以採納，但拿來訓練的 feature 必須改變，因為在 scx_rusty 當中並不具有 struct lb_env 這個結構體，用來進行負載平衡的 lb_data, task_data 成員須擴充，但無法完全滿足 struct lb_env 的所有成員。

當前 scx_rusty 完全只依據 task load 來判定是否進行 load balance 和 task migration 。我們依照原論文方法重新蒐集資料，也就是分別在 stress-ng --cpu 30 -l 100 --timeout 120s, stress-ng --cpu 60 -l 100 --timeout 120s, stress-ng --cpu 120 -l 100 --timeout 120s 情況下蒐集系統資料，之後只利用四項 feature 進行訓練，分別是 cpu_idle, cpu_not_idle, src_load, dst_load （應該要再加上 NUMA 資訊，但該程式嘗試取得 NUMA 相關資料時會出錯，尚待解決）。

把訓練好的模型檔案搬移至 scx_rusty 資料夾底下，準備進行 inference 。

Inference

Rust 剛好有能整合 tensorflow 的套件，安裝該套件後實作 inference 應能變得很方便。
首先要在 scx_rusty 的 Cargo.toml 檔案當中新增套件

[dependencies]
tensorflow = "0.21.0"

之後擴充 struct task_ctx 使我們可以從 kernel space 取得要拿來 inference 的資料 (位在 scheds/rust/scx_rusty/src/bpf/initf.h)

+	/* Machine Learning inference needed data */
+	unsigned long src_dom_load;
+	unsigned long dst_dom_load;
+	int cpu_idle;
+	int cpu_not_idle;
+	int cpu;
+};

另外在 main.bpf.c 當中修改 init 。

+		.cpu = bpf_get_smp_processor_id(),
+		.cpu_idle = 1,
+		.cpu_not_idle = 0,

之後在 scheds/rust/scx_rusty/src/load_balance.rs 當中定義 tensorflow 的 struct 和相關 method


extern crate tensorflow;

use tensorflow::Graph;
use tensorflow::Session;
use tensorflow::SessionOptions;
use tensorflow::SessionRunArgs;
use tensorflow::Tensor;
use std::error::Error;

struct TensorFlowModel {
    graph: Graph,
    session: Session,
}

impl TensorFlowModel {
    fn new(model_dir: &str) -> Result<Self, Box<dyn Error>> {
        let mut graph = Graph::new();
        let bundle = tensorflow::SavedModelBundle::load(
            &SessionOptions::new(),
            &["serve"],
            &mut graph,
            model_dir,
        )?;
        let session = bundle.session;
        
        Ok(TensorFlowModel { graph, session })
    }

    fn predict(&self, input_data: Vec<f64>) -> Result<bool, Box<dyn Error>> {
        let input_tensor = Tensor::new(&[input_data.len() as u64]).with_values(&input_data)?;

        let input_op = self.graph.operation_by_name_required("serving_default_input")?;
        let output_op = self.graph.operation_by_name_required("StatefulPartitionedCall")?;

        let mut args = SessionRunArgs::new();
        args.add_feed(&input_op, 0, &input_tensor);
        let output_token = args.request_fetch(&output_op, 0);

        self.session.run(&mut args)?;

        let output_tensor: Tensor<f64> = args.fetch(output_token).unwrap();
        let output_value = output_tensor[0];
        Ok(output_value == 1.0)
    }
}

可以看到 inference 函式就是 predict 。
之後實作 inference call back 函式

    fn migrate_inference(&self, cpu: &i32, cpu_idle: &i32, cpu_not_idle: &i32, src_dom_load: &f64, dst_dom_load: &f64) -> bool {
        let mut input_vec: Vec<f64> = Vec::with_capacity(5 as usize);
        input_vec.push(f64::from(*cpu));
        input_vec.push(f64::from(*cpu_idle));
        input_vec.push(f64::from(*cpu_not_idle));
        input_vec.push(*src_dom_load);
        input_vec.push(*dst_dom_load);

        self.inference_model.predict(input_vec).unwrap()
    }

呼叫的方式很簡單，在 try_find_move_task 的最後檢查挑選出的 task 是否適合遷移

+        let migrate_or_not = self.migrate_inference(&task.cpu, &task.cpu_idle, &task.cpu_not_idle, &push_load, &pull_load);
+        task.migrated.set(migrate_or_not);

之後進行測試，首先進行同時編譯 linux kernel 與進行 fio 測試情況下 task migration 次數

$ sudo perf stat -e sched:sched_migrate_task -a sleep 120

 Performance counter stats for 'system wide':

            6,1749      sched:sched_migrate_task

     120.200994178 seconds time elapsed

可以看到遷移次數相對原本 scx_rusty 有顯著的下降，再來測試編譯一次 linux kernel 的時間

$ time sudo make -j $(nproc)
...

real    4m23.921s
user    0m0.046s
sys     0m0.094s

居然還比 scx_rusty 編譯的更快！

目前模型的訓練與 inference 都還有很大的改進空間，但已經可以顯示原本 scx_rusty 過多的 task migration 是不必要的，不只對於 load imbalance 沒有過多幫助，整體效能也有提升的空間，當然已經比原本的 EEVDF 好上許多。

sched_ext(5): Integrate Machine Learning into scx_rusty Load balancer

scx_rusty overview

Userspace part

Load Balancer

BPF part

整合機器學習進入 Load balancer

Main idea

Training

Inference

Read more

2024q1 Homework1 (lab0)

Rust learning note (2)

Concurrency 學習筆記

eBPF and related tools

`scx_rusty` overview