Linux 核心專題: 探討 sched_ext 及機器學習

# Linux 核心專題: 探討 sched_ext 及機器學習 > 執行人: [`EricccTaiwan`](https://github.com/EricccTaiwan), [`charliechiou`](https://github.com/charliechiou) > [專題講解影片](https://youtu.be/_rz-CAbTRkg), [簡報](https://www.slideshare.net/slideshow/2025-linux-sched_ext-pdf/281093837) ### Reviewed by `salmoniscute` 關於 label 的部分我沒看懂，想釐清的是 label 後的 0 跟 1 是不會再改變的值了嗎？如果是不會改變更新的值，我想問：因為系統的負載會是動態改變的，如果 label 代表的是在目前的時間點 migrate 這個任務是否會讓不平衡更嚴重的意義，但這不一定表示這個任務永遠或是在其他時間點不適合被 migrate。這樣要如何保證模型的準確性和泛化能力？ > >想釐清的是 label 後的 0 跟 1 是不會再改變的值了嗎？ > > label 的目的是為了儲存當下狀態對應適不適合 migrate，而判斷方式則維持 `scx_rusty` 原先的機制，因此是已經確定不會更改的數值。 > >不一定表示這個任務永遠或是在其他時間點不適合被 migrate。這樣要如何保證模型的準確性和泛化能力？ > > 這句話是正確的，因此為了保證準確性及泛化能力使用何種任務的性質來當作 ML 的輸入很重要。錯誤的參數選擇可能導致誤導模型降低準確性，而過多的參數則可能導致 overfitting 降低泛化能力。 > 然而目前 `scx_rusty` 所能取得的參數皆是封裝好可由 Rust 取得的參數，並非最佳選擇。為了讓 ML 學習到系統中動態改變的行為，應該參照 `scx_lavd` 中直接使用 eBPF 對核心提取會較為恰當，這也正是我們下一步的目標。 > [name=Charlie Chiou] > > > 然而目前 `scx_rusty` 所能取得的參數皆是封裝好可由 Rust 取得的參數，並非最佳選擇 > > > > 也就是說，目前僅能由 `scx_rusty` 定義好的參數，也侷限了 ML 可用的 input。 > > [name=EricccTaiwan] ### Reviewed by `wurrrrrrrrrr` 我想詢問為什麼是選擇 ResNet 來當作模型而不是選擇其他模型？ > 模型選擇上我們最初是使用最基本的 MLP 來做測試，經實驗後發現因為 input 的參數過於簡單，容易導致梯度消失而訓練不起來 (也反映到資料蒐集上的進步空間)，因此為了減少梯度消失的現象而改用較深層的 ResNet 。預期加強資料蒐集方式後也會從 MLP 開始嘗試，避免使用過於複雜的模型。 > [name=Charlie Chiou] > > input 的參數過於簡單 > > 可見上方與 `salmoniscute` 的討論。 > [name=EricccTaiwan] ### Reviewed by `HeatCrab` 所以總體而言，`sched_ext` 是一種排程器類型的統稱，用意是讓使用者去客製化針對自己需求的排程器，類似一種套件的概念嗎？那在開發過程中也要考量 `sched_ext` 跟 EEVDF ，也就是 Linux 核心排程器間的搭配與衝突嗎？還是這在 `sched_ext` 被公開時就已經有相關配套措施與解決辦法了？ > `sched_ext` 可從兩個面向來看。Kernel-space 部分實作了排程器的**空殼**，而 User-space 則可以藉由 eBPF 將注入其**靈魂**來完成客製化的排程器行為。載入 `sched_ext` 客製排程器後，原先由 EEVDF 所負責的任務會改交由 `sched_ext` 負責，因此這兩者之間不會有衝突，而若掛載的 `sched_ext` 客制排程有問題時，也會安全的退回原先的預設排程器。 > [name=Charlie Chiou] > > 所以總體而言，`sched_ext` 是一種排程器類型的統稱 > > `SCHED_EXT` 是 Linux 核心中新增的一個排程類別 (scheduler class)，而 `sched_ext` 則是一個「功能」，讓使用者在 User-space 利用此功能 (`sched_ext`) 客製化自己排程器，透過 `eBPF` (e.g., `main.bpf.c`) 實作排程決策，此部份可以對照[簡報第 10 頁](https://www.slideshare.net/slideshow/2025-linux-sched_ext-pdf/281093837#10)。 > > > 那在開發過程中也要考量 `sched_ext` 跟 EEVDF ，也就是 Linux 核心排程器間的搭配與衝突嗎？還是這在 `sched_ext` 被公開時就已經有相關配套措施與解決辦法了？ > > 不用考量衝突，因為 `sched_ext` 已經有安全的退回機制，才被 v6.12 收錄。這也是當初 CPU 排程器的 [maintainer](https://docs.kernel.org/process/maintainers.html#scheduler) Peter Zijlstra 反對收錄 `sched_ext` 的理由之一，而 Tejun Heo (`sched_ext` 的 [maintainer](https://docs.kernel.org/process/maintainers.html#scheduler-sched-ext)) 也做出保證，會處理一切相關的排程器衝突，有興趣可以參考 [[PATCHSET v6] sched: Implement BPF extensible scheduler class](https://lore.kernel.org/bpf/20240501151312.635565-1-tj@kernel.org/)。 > [name=EricccTaiwan] ### Reviewed by `Andrushika` 在比較 EEVDF 和預設的 `scx_rusty` 時提到，儘管 `scx_rusty` 的 migrate 次數明顯較多，但效能並沒有提升，推測是因為「執行的任務太過簡單」(compile kernel)。想請問這個推測的結論是如何得出的呢？會不會有可能是 cache miss 或其他的問題？ > > 儘管 `scx_rusty` 的 migrate 次數明顯較多，但效能並沒有提升，推測是因為「執行的任務太過簡單」(compile kernel)。 > > `scx_rusty` 判斷任務是否可以 migrate，只會觀察任務 migrate 前後，是否能讓 `push_domain` （生產者）和 `pull_domain` （消費者）間的負載「更平衡」，此現象在去年的[期末專題 - CPU 排程器](https://youtu.be/G6p0Y9DZJsM?t=1395)也可見。雖然在 load balance 的機制上很簡單，但 [README](https://github.com/sched-ext/scx/tree/main/scheds/rust/scx_rusty#production-ready) 上描述為「 Production Ready 」，且我們只透過單純的 workload 進行測試 (e.g., `stress-ng` 、 compile kernel) ，才得出「執行的任務太過簡單」的結論。 > > > 會不會有可能是 cache miss 或其他的問題？ > > 同意， `scx_rusty` 太頻繁 migrate ，可能導致更多的 cache miss，我們會再向開發者們請教。 > [name=EricccTaiwan] > 比較使用 migrate 次數是為了透過數據展示 `scx_rusty` 是藉由 migrate 來達到效能的提升，但無論是 CPU 使用率及執行時間皆未有提升，因此可推斷當前的任務環境下無須 migrate 便足夠的效能表現，反而 migrate 會造成效能的拖累。 > > 另一方面，觀察 `scx_rusty` 在 L3 Cache-base 及 L2 Cache-base 的差異，原先 L3 cache-base 並不會在 L2 之間的任務作搬移，而 L2 Cache-base 則會，這便造成了你問題中所提到的 cache miss 的現象 (即 context switch 的成本)。若要判斷是否是由於 cache miss ，需先了解 EEVDF 排程行為中 migrate 的發生時機，這也是我們現在還不清楚的地方。 > [name=Charlie Chiou] ### Reviewed by `dingsen-Greenhorn` 「 DispatchTaks 」應該是「 DispatchTasks 」 > 已更正 typo 。 > [name=EricccTaiwan] ### Reviewed by `yy214123` 我好奇以下兩點： - 為甚麼這個 repo 的主力開發是採 rust 而非 C。這其中有甚麼權衡考量嗎？ - 另外由於引進了 ML，所以模型在推論的部分勢必有成本，這部分的影響有辦法量化嗎？ > > 為甚麼這個 repo 的主力開發是採 rust 而非 C。這其中有甚麼權衡考量嗎？ > > 我認為最大的優勢之一，是 Rust 擁有豐富的 crate 生態系，類似於 C 語言中的 header 檔；另外，相較於 C 需要手動管理記憶體，Rust 則透過所有權與生命週期機制，在編譯期間即保障記憶體「安全」，可以參考「 [Rust和Linux之争，到底在争什么？](https://youtu.be/ONZZvc_IqQg?si=CE1yQkrXX1pGOPPJ)」。 > 不過，能開發 `sched_ext` 排程器的不只有 C 與 Rust，例如：Ian 學長[用 Go 實作的 `Gthulhu`](https://github.com/Gthulhu/Gthulhu) 和 [Writing a Linux scheduler in Java with eBPF](https://youtu.be/JWwX3uCEPO8?si=ehhTTZXEQZ5Im87S)。只要能與 [`libbpf` 整合](https://github.com/libbpf)，任何語言都能參與排程器開發。 > > [開發者回覆](https://hackmd.io/@cce-underdogs/SJgyJEzBge) : I think it was more that there was someone maintaining [libbpf-rs](https://github.com/libbpf/libbpf-rs) at meta [[name=Daniel H]](https://github.com/hodgesds) > > 所以模型在推論的部分勢必有成本，這部分的影響有辦法量化嗎？ > > 目前無法精確量化推論成本。話雖如此，但尚未觀察到明顯的 overhead。從[簡報第 19 頁](https://www.slideshare.net/slideshow/2025-linux-sched_ext-pdf/281093837#19)可見，引入機器學習後的 kernel 編譯時間反而縮短約 20 秒，推論雖可能帶來些許 overhead，但實測並未造成顯著影響。 > [name=EricccTaiwan] > `scx_lavd` 及 `scx_rusty` 中不乏有複雜的計算，而將計算部分移出 kernel 也是 `sched_ext` 的其中一個目的。如何測試排程器在排程過程中所造成的影響，這部分是我們目前還不清楚的。 > [name=charliechiou] ### Reviewed by `Ian-Yen` >`scx_rusty` 判斷任務是否可以 migrate，只會觀察任務 migrate 前後，是否能讓 `push_domain` （生產者）和 `pull_domain` （消費者）間的負載「更平衡」在「是否可以 migrate」的邏輯中，目前只觀察 push/pull domain 的負載是否更平衡，這樣的策略會不會過於貪婪，而忽略了 cache locality、前後連續性等成本。 > 這是預設 `scx_rusty` 的設計。 > 此設計背後的原理，先前有詢問 Meta 開發人員，尚未得到回覆，運行邏輯可查閱 [`scx_rusty/load_balance.rs`](https://github.com/sched-ext/scx/blob/main/scheds/rust/scx_rusty/src/load_balance.rs) 的 `perform_balancing` 函式。 > [name=EricccTaiwan] > 目前的策略是先使用 ML 來替換掉原先演算法中啟發式計算的部分，日後針對不同的使用場合有希望對應關注的點便可以加入 ML 的判斷中。因此，在初期的嘗試中我們僅先採用原先內部便封裝過的參數及排程邏輯，日後或許可加入問題中提到的 cache locality、前後連續成本等作為 ML 的輸入。 > 「觀察 push/pull domain 的負載是否更平衡」是我們藉由觀察原先排程器行為所模仿的 label 方式，在考量 cache locality 等其他因素後，如何決定是否需要 migrate 也是日後 ML 部分需要考量的。 > [name=Charlie Chiou] ### Reviewed by `RealBigMickey` 在目前的設計中，每次在 `try_find_move_task()` 中進行 ML 推論時，都需要從 eBPF map 取得任務特徵資料傳至 Rust user-space。我想請問： - 是否有打算量化 kernel-to-user space 資料通訊的時間成本？（例如單次讀取一筆 task context 需多少延遲？） - 這筆 overhead 有沒有可能是整個排程 loop 的瓶頸，還是實際影響不大銷？ - 若將 ML 決策邏輯搬到 kernel space，效能會有實質提升嗎？還是 kernel space 本身的限制（如無 FP 支援、執行時間 cap）反而導致整體效益有限呢？ > > 是否有打算量化 kernel-to-user space 資料通訊的時間成本？ > > 這筆 overhead 有沒有可能是整個排程 loop 的瓶頸，還是實際影響不大銷？ > > `sched_ext` 排程器確實普遍面臨這筆成本。以 `scx_rustland` 為例，Andrea Righi 即指出「[The main bottleneck ... is the communication between kernel and user space.」](https://arighi.blogspot.com/2024/08/re-implementing-my-linux-rust-scheduler.html)，因此後續才開發純 `eBPF` 版 `scx_bpfland` 來移除此開銷。目前社群中，並未看到量化數據，未來會再進行補充。 > > > 若將 ML 決策邏輯搬到 kernel space，效能會有實質提升嗎？還是 kernel space 本身的限制（如無 FP 支援、執行時間 cap）反而導致整體效益有限呢？ > > 一定會遇到無 FP 支援的問題。若將 ML-based 負載平衡邏輯整合進核心空間 (kernel-space) ，就必須改寫為定點運算、查表或其他量化形式，並確保每次執行都能在 `eBPF` verifier 規定的時間內結束。從 `sched_ext` 的設計理念來看，它鼓勵開發者把排程策略放在使用者空間 (user-space) ，以便快速迭代、靈活選用各式 ML 框架，同時讓企業能在閉源環境下保存競爭優勢。 > [name=EricccTaiwan] ## 任務簡述 Linux v6.12 引入的 [sched_ext](https://github.com/sched-ext/scx) (`scx`) 允許開發者藉由 `eBPF` ，在使用者空間動態載入或抽換 CPU 排程器。本任務嘗試結合機器學習，利用 BPF map 彙整 CPU 排程相關事件資料，依據推論動態調整 time slice 、 CPU affinity 與 task migration。預計探討以下： * 回顧 CFS / EEVDF * `sched_ext` 的創新和相關機制 * 從客製化的 FCFS / RR 排程器到機器學習，並引入負載預測機制 ## TODO: 回顧 CFS / EEVDF > 參考《[Demystifying the Linux CPU Scheduler](https://github.com/sysprog21/linux-kernel-scheduler-internals)》 - by Ching-Chun (“[Jserv](https://wiki.csie.ncku.edu.tw/User/jserv)”) Huang ### Completely Fair Scheduler (CFS) CFS 是 Linux 自 v2.6.23 起採用的預設排程器；其核心理念是以 $vruntime$ (virtual runtime) 模擬理想狀態下所有可執行工作同時執行、均分 CPU 的情形：每個任務根據權重 (由 $NICE$ 值換算) 計算 $$ \text{delta_vruntime} = \frac{\text{delta_exec} \times \text{NICE_0_WEIGHT}}{\text{task_weight}} $$ 並累加至 $vruntime$ ，權重較高 ($NICE$ 值越小，即越不 nice) 累積速度較慢。 CFS 為每個任務維護 $vruntime$ ，排程時優先挑選 $vruntime$ 最小 (即目前最「落後」的任務)，確保任務間的 fairness (公平) 。再以 $\textit{target latency}$ (`sched_latency_ns`) [1] 和 $\textit{minimum granularity}$ (`sched_min_granularity_ns`) [2] 決定實際 time slice，使得高權重任務獲得更多 CPU time、低權重工作也不致飢餓 (starvation) ；並持續更新最小的 `min_vruntime` 來初始化新建立或剛醒來的任務，避免該任務的 $vruntime$ 與佇列最小值產生過大的落差。如此一來， CFS 同時達成按權重公平分配 CPU time ，並在負載升高時於 latency 與 throughput 之間取得動態平衡。 [1]: the minimum amount of time idealized to an infinitely small duration required for every runnable task to get at least one turn on the processor [2]: minimum amount of time that can be assigned to a task ### Earliest Eligible Virtual Deadline First (EEVDF) EEVDF 是自 Linux v6.6 起接替 CFS 的新預設排程器；其核心理念是先以「資格 (eligible) 」維持公平，再以「虛擬截止時間 ($\text{virtual deadline}$) 」排序時效。對每個任務計算 $$ \begin{aligned} \text{lag} &= vruntime - \overline{vruntime} \\ \text{deadline} &= \text{vruntime} + \Delta t_{\text{slice}} \\ \Delta t_{\text{slice}} &= \frac{\text{sched_base_slice} \times \text{NICE_0_WEIGHT}}{\text{task_weight}} \\ \end{aligned} $$ 其中 $\overline{vruntime}$ 為佇列 (queue) 中所有任務 $vruntime$ 的加權平均，$lag<0$ 表示任務 CPU time 落後（under-served）且具執行資格 (eligible) 。排程器先從 eligible 任務中挑選 $lag$ 最小者；若 $lag$ 相同，再取 $\text{deadline}$ 最早的任務執行。任務實際執行 $\text{delta_exec}$ 後，更新其 $vruntime$ 與新的 $\text{deadline}$，並重新插入紅黑樹。高權重（$NICE$ 值小）任務，可一次取得較長 time slice ($\Delta t_{\text{slice}}$) ；低權重則 time slice 較短但輪替更頻繁。新建立或剛喚醒的任務以當前 $\overline{vruntime}$ 為初始 $vruntime$，並保留先前 $lag$，避免與佇列最小值產生過大落差。透過「eligible + $\text{virtual deadline}$」的雙層機制，EEVDF 延續 CFS 的權重公平，並在低負載時提供更低延遲、在高負載時延長 time slice 以減少 context switch 次數並提升 throughput。 ## TODO: `sched_ext` 的創新和相關機制 * [`scx` 相關筆記](https://hackmd.io/@cce-underdogs/speech_notes) ### Why `sched_ext` ? CFS 和 EEVDF 是 Linux 核心的預設通用排程器，關注於任務全體的吞吐量 (throughput) ，但並非所有應用場景都在意整體的吞吐量。因此 `sched_ext` 就是為了解決上述的痛點，提供了更多的設計彈性，開發者們能針對不同場景設計對應的排程器，例如 `scx_flash`, `scx_rusty` 和 `scx_lavd` 等等，都是針對不同的應用所設計的排程器。以 `scx_lavd` 為例，著重於提昇使用者的遊戲體驗 (e.g., 在 Steam Deck 上安裝 Arch Linux 玩遊戲)。玩遊戲時，我們在意的是如何降低遊戲 (i.e., **latency-critical tasks**) 的延遲 (latency) ，而非關注於提昇遊戲和背景程式的整體吞吐量，背景應用顯然不是我們在意的事情。 > `scx_lavd` is introduced by [Changwoo Min](https://multics69.github.io/) (Igalia) > lavd: **L**atency-criticality **A**ware **V**irtual **D**eadline scheduling algo 如果任務是位於一連串的 task chain 中央，此任務就是 latency-critical task ，影響整個 task chain 的延遲，但如何決定一個任務是否為 latency-critical ? Changwoo 教授透過判斷任務被喚醒 (wakee) 和喚醒其他任務 (waker) 的頻率，藉此調整 latency-critical 任務的可用 time-slice 、可搶佔等等。此設計，確實會產生不公平 (unfairness) 的情形發生，但如上所述，此排程器目的在於提高遊戲體驗，降低 latency-critical task 的延遲，因此整體吞吐量不是優先考量。 --- ### and How ? ![image](https://hackmd.io/_uploads/rkLZv1tVgx.png) > 圖片來源: [sched_ext: scheduler architecture and interfaces (Part 2)](https://blogs.igalia.com/changwoo/sched-ext-scheduler-architecture-and-interfaces-part-2/) ![image](https://hackmd.io/_uploads/SJSfARdNll.png) > 圖片來源: [Crafting a Linux kernel scheduler in Rust - Andrea Righi](https://www.youtube.com/watch?v=L-39aeUQdS8) `sched_ext` 是在 Linux 核心中的一個排程類別 ([`kernel/sched/ext.c`](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/sched/ext.c))，其核心理念是將排程決策邏輯從 kernel-space 移至 user-space ，透過 `eBPF` 作為橋接層，其作用就是進行 message passing ，讓使用者在不更改核心程式碼的情況下，在 user-space 實作各種排程演算法。整體架構如圖所示，可分為三個主要層級： - Kernel 層（`sched_ext` 模組）：提供一個最小但功能齊全的排程框架，負責處理與上下文切換相關的低階作業，並透過 DiSpatch Queues（DSQs）作為執行單位佇列，將具體的排程決策委託給上層。 - eBPF 層（eBPF Scheduler）：作為 `sched_ext` 與 user-space 之間的邏輯中介，定義了 `struct sched_ext_ops` 中的一組調度操作（e.g., `select_cpu`、`enqueue`、`dispatch` 等），這些操作函式由 `eBPF` 程式實作，用來控制任務該放入哪個 DSQ、應由哪個 CPU 執行、或在 DSQ 無工作時如何補充等行為。 - User-space 層：負責載入並註冊 eBPF 排程器，透過 `libbpf` 或 `libbpf-rs` 操作 eBPF object 與 maps，並可自由設計、實作、更新不同排程策略。簡而言之，user-space 透過 `sched_ext` 所實作的排程器，本質上就像是一種可動態載入的 kernel module。不同的是，它並非以 C 語言撰寫並編譯成傳統 kernel module，而是以 `eBPF` 實作，經由核心中的 `sched_ext` 框架載入與執行。  ## TODO: 從客製化的 FCFS / RR 排程器到機器學習首先，如果想實作/開發 `scx` 排程器，強烈建議加入[官方的 `Discord` 群組](https://discord.gg/b2J8DrWa7t)，能直接尋求開發者們 (諸位大神， e.g., [Tejun Heo](https://github.com/htejun) ) 的協助，也有每週的 `Office hours` 可以當面詢問 (開發者們也會回報各自的進度)。其次，此 repo 的主力開發是 [Rust Scheduler](https://github.com/sched-ext/scx/tree/main/scheds/rust)，而非 [C Scheduler](https://github.com/sched-ext/scx/tree/main/scheds/c)，除非 C 語言排程器有很嚴重的錯誤需修正，否則相關的 PR 對於此 repo 來說 **no tangible benefit** ，可詳見 [scx#1827](https://github.com/sched-ext/scx/pull/1827)。 ### [sched_ext 環境設置](https://hackmd.io/@cce-underdogs/linux-exp1) #### 1. 關閉 E-core 電腦的 CPU 如果是 12-th 後，[大小核](https://www.intel.com.tw/content/www/tw/zh/support/articles/000091896/processors.html)的設計會影響到實驗結果，把小核關掉暫時不考慮暫時避免受到小核影響。 :::spoiler 參考: [Intel系列主機板如何關閉CPU部分核心（即E-core）？](https://www.asus.com/hk/support/faq/1054283/#222) - 查看大小核 (P/E core) ```shell $ lstopo # 有 E-core ，因此須關閉 ``` ![image](https://hackmd.io/_uploads/S14Vs6oyll.png) ![image](https://hackmd.io/_uploads/HkxB7BA1ex.png=30%x) ::: :::spoiler 實驗環境 #1 (x86) ```shell $ uname -r 6.14.0-16-generic $ lsb_release -a No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 25.04 Release: 25.04 Codename: plucky ``` ```shell $ neofetch $ lstopo ``` **關閉 E-core 前** - CPU: 12th Gen Intel i9-12900K (24) @ 5.100GHz ![image](https://hackmd.io/_uploads/SyIUqkh1ge.png=30%x) ![image](https://hackmd.io/_uploads/r1aWUAoygg.png=10%x) **關閉 E-core 後** - CPU: 12th Gen Intel i9-12900K (16) @ 5.100GHz ![image](https://hackmd.io/_uploads/r1WKMSAkxg.png=30%x) ![image](https://hackmd.io/_uploads/B1pkQSC1ee.png=30%x) ::: :::spoiler 實驗環境 #2 (arm) ```shell $ uname -r 6.14.0-16-generic $ lsb_release -a No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 25.04 Release: 25.04 Codename: plucky ``` ![image](https://hackmd.io/_uploads/SyzjokNNlg.png) ![lstopo](https://hackmd.io/_uploads/HJG3iyENeg.png) ::: #### 2. 版本及 sched_ext 支援 ##### 2-A. 核心版本要求 (至少 v6.12) :::spoiler v6.12+核心版本與v6.12，表現影響不大 [sched_ext is supported by the upstream kernel starting from version 6.12](https://arighi.blogspot.com/search?updated-max=2025-05-01T00%3A17%3A00%2B02%3A00&max-results=1)，因此核心版本要求 6.12+，至於 6.13、6.14 的核心版本在 `scx` 上的表現是否有差異? [Tejun Heo](https://github.com/htejun) ( `scx` 的 maintainer ) 給了以下回覆， > There are new features introduced which may improve performance in some cases (e.g. queued_wakeup support) but for the most part, the kernel versions wouldn't cause noticeable differences. All schedulers should work fine across the kernel versions. [name=Tejun Heo] ::: ##### 2-B. 核心及 Ubuntu 版本 > 我們將實驗環境從 Ubuntu 24.04 $\to$ 24.10 $\to$ 25.04 一路向上升級 (一次只能升級一個版本，所以要升級兩次) ; 如果空間有限，可用舊版 Ubuntu 搭配 kernel v6.12+ 作為實驗環境 (e.g. Ubuntu 24.04 w/ kernel v6.12) ```shell OS: Ubuntu 25.04 x86_64 Kernel: 6.14.0-16-generic ``` ::: spoiler 方法 1 : 升級至 Ubuntu 25.04 ```shell $ sudo apt update $ sudo apt upgrade $ sudo apt dist-upgrade $ sudo apt install update-manager-core ``` 目前最新的 LTS 為 Ubuntu 24.04，因此需調整升級設定以對應此版本： ```shell $ sudo vim /etc/update-manager/release-upgrades ``` 把最後一列的 `Prompt=lts` 改成 `Prompt=normal`， ```diff [DEFAULT] # Default prompting and upgrade behavior, valid options: # # never - Never check for, or allow upgrading to, a new release. # normal - Check to see if a new release is available. If more than one new # release is found, the release upgrader will attempt to upgrade to # the supported release that immediately succeeds the # currently-running release. # lts - Check to see if a new LTS release is available. The upgrader # will attempt to upgrade to the first LTS release available after # the currently-running one. Note that if this option is used and # the currently-running release is not itself an LTS release the # upgrader will assume prompt was meant to be normal. - Prompt=lts + Prompt=normal ``` 接著執行 ```shell $ sudo do-release-upgrade -d # 開始升級 Ubuntu ``` 升級後檢查是否升級成功 ```shell $ lsb_release -a ``` 由於每次升級僅能跨一個版本，因此若從 24.04 升級至 25.04，需先升級至 24.10，再進一步升級至 25.04，共需進行兩次升級。 ::: :::spoiler 方法 2 : 僅下載 kernel v6.12 先下載對應的核心檔案 ```shell $ wget https://cdn.kernel.org/pub/linux/kernel/v6.x/linux-6.12.tar.xz $ tar -xf linux-6.12.tar.xz $ cd linux-6.12 ``` 接著把原本作業系統中的設定檔複製並編譯 ```shell $ cp /boot/config-$(uname -r) .config $ make menuconfig ``` 編譯核心(可能會花上很多時間) ```shell $ make -j$(nproc) $ sudo make modules_install $ sudo make install ``` 最後更新 grub 並重開機。 ```shell $ sudo update-grub $ sudo reboot ``` 確認升級後的核心版本 ```shell $ uname -r # 要出現 6.12 ``` ::: ##### 2-C. 確認目前核心是否支援 sched_ext ```shell $ ls /sys/kernel/ | grep sched_ext ``` - 確認目前 sched_ext 的狀態 ```shell $ cat /sys/kernel/sched_ext/state ``` 以 `scx_rustland` 為例，執行 `scx_rustland` 前後輸出結果如下： ```shell $ # Before attached scx_rustland $ cat /sys/kernel/sched_ext/state disabled $ # After attached scx_rustland $ cat /sys/kernel/sched_ext/state enabled ``` ##### 2-D. 下載 `sched-ext` (簡稱 `scx`) ::: spoiler 參考 [Build & Install](https://github.com/sched-ext/scx?tab=readme-ov-file#build--install) 以 Ubuntu/Debian 環境為例 ###### a. 下載 `meson` > Note: Many distros only have earlier versions of meson, in that case just clone the meson repo and call `meson.py` > e.g. `/path/to/meson/repo/meson.py compile -C build` . > > Alternatively, use pip e.g. `pip install meson` or `pip install meson --break-system-packages` (if needed). ###### b. 下載 dependencies ```shell $ sudo apt install build-essential libssl-dev llvm lld libelf-dev meson cargo rustc clang llvm cmake pkg-config protobuf-compiler ``` ###### c. Static linking against `libbpf` (preferred) > 這個方式，C 和 Rust 都可以編譯完成 ```shell $ cd ~/scx $ meson setup build --prefix ~ $ # 以下每次更新 code 完都需要重跑 (編譯前記得存檔) $ meson compile -C build $ sudo meson install -C build $ # $ meson setup --wipe build ## re-config 才需要或是 build 失敗 ``` > meson always uses a separate build directory. Running the following commands in the root of the tree builds and installs all schedulers under `~/bin`. 執行 meson compile 後會把執行檔及中間的建構檔存在對應的資料夾中，若要把可執行的檔案安裝下來則需要使用 meson install 並透過 `-C` 切換到 build 的資料夾中做安裝，而安裝的位置則是前述用 setup 設定的位置 (i.e., `~`)。接著執行 `sudo ~/bin/<schedule name>` 便可執行對應的 scheduler。 >詳細的安裝步驟可以查看 `meson.build` 的腳本，其中在 `if enable_rust` 的段落中 >```rust > cargo_cmd = [cargo, 'build', '--manifest-path=@INPUT@', '--target-dir=@OUTDIR@', cargo_build_args] >``` >會把 rust 的排程器建構在 `@OUTDIR@` ，而 `@OUTDIR@` 則是由 meson 傳入的 target。因此也可以在 target 資料夾中的 `release-fast` 中找到我們的 rust 排程器。另一方面 c 語言所建構的排程器則會安置在 build 資料夾底下。 ::: #### 3. 工具介紹 > 可以參考 [Developer Guide](https://github.com/sched-ext/scx/blob/main/DEVELOPER_GUIDE.md#developer-guide) :::spoiler [scxtop](https://github.com/sched-ext/scx/tree/main/tools/scxtop#scxtop) ```shell $ sudo LC_ALL=C scxtop # 執行 scxtop ``` 會出現下方的 TUI 界面，按下 `q` 即可離開 ![image](https://hackmd.io/_uploads/SJ-5FJyelx.png) 執行 `scx` 排程器的過程中，按下 `a` 可以儲存 trace 檔案，接著打開 [Perfetto UI](https://ui.perfetto.dev/#!/viewer) ，點選左側的 `Open trace file` 開啟剛剛生成的檔案，便可看到下圖的 CPU 排程結果。 ![image](https://hackmd.io/_uploads/HJly06Y-lx.png) 若不想使用到 `scxtop` 提供的 TUI 介面，也可以利用以下指令產生 trace 檔案 ```shell $ sudo scxtop trace --trace-ms 5000 --output-file test # 會生成檔名為 test 的二進制檔 ``` ::: :::spoiler [Perfetto](https://perfetto.dev/) [地雷](https://hackmd.io/@cce-underdogs/prob-1#Clarification) :boom: : In Perfetto, the Avg [Wall duration](https://groups.google.com/a/chromium.org/g/chromium-dev/c/IAvPoAwBtdM?pli=1) refers to the average time a process runs before being interrupted, rather than average time slice (`time_slice`) . ![image](https://hackmd.io/_uploads/SJbCA3t-lg.png) 詳細的測試分析，可見[開發紀錄 (2)： FCFS / RR scheduler](https://hackmd.io/@cce-underdogs/linux-exp2)，這邊只介紹幾個重要的參數，同時也是我們踩過的地雷， - [Wall duration](https://blog.csdn.net/f2006116/article/details/107581327) (ms) - 代表一個 Process 的總共持續時間 - Avg Wall duration (ms) - In Perfetto, the Avg [Wall duration](https://groups.google.com/a/chromium.org/g/chromium-dev/c/IAvPoAwBtdM?pli=1) refers to the average time a process runs before being interrupted, rather than average time slice (`time_slice`) . - [代表一個 Process 在被中斷前的平均持續時間，而不是平均的 time slice 。](https://hackmd.io/@cce-underdogs/prob-1#Clarification) (這點很重要!!!) - Occurrences - Process 執行期間的中斷次數 ::: :::spoiler [stress-ng](https://github.com/ColinIanKing/stress-ng) ```shell $ stress-ng --cpu 3 --timeout 4s --taskset 5 $ # 在 CPU 5 產生 3 個 workload 且每個 workload 執行 4 秒。 ``` ::: :::spoiler [WebGL Aquarium](https://webglsamples.org/aquarium/aquarium.html) 可以作為排程器的效能測試工具 (e.g. FPS)，詳見 [scx#1912](https://github.com/sched-ext/scx/pull/1912)。 (主要用來測試 `scx_bpfland`) > Usual "gaming while building the kernel" scenario, using the WebGL Aquarium demo (15000 fishes) on a system with 8 CPUs Intel i7-1195G7 @ 2.90GHz, running `make -j 32`. ::: :::spoiler [schbench](https://git.kernel.org/pub/scm/linux/kernel/git/mason/schbench.git/about/) ::: :::spoiler 隔離 cpu 方便實驗並觀察 workload 在 cpu 上的排程及 time slice 分配是否如預期地雷 :boom: : [不要用 `isolcpus`](https://www.kernel.org/doc/Documentation/admin-guide/kernel-parameters.txt) ，[用 `CPUSETS`](https://docs.kernel.org/admin-guide/cgroup-v1/cpusets.html) 。 `isolcpus` 地雷 :boom: 記得先啟動 ssh 自動開啟，否則 server 重新啟動後會無法連線 ```shell $ sudo systemctl enable ssh # 開機後自動開啟 $ sudo systemctl status ssh # 檢查連線狀態 $ ssh localhost # 連線確認，確認可以連到自己 ``` 以隔離 **CPU $13$** 為例，方便進行測試和觀察 ```diff $ sudo vim /etc/default/grub $ # 加上下方的 diff + GRUB_CMDLINE_LINUX_DEFAULT="isolcpus=13 nohz_full=13 rcu_nocbs=13" $ sudo update-grub $ sudo reboot ``` 重新開機後，再次執行 `./bin/scx_rlfifo` 卻跳出了以下 error， ![image](https://hackmd.io/_uploads/ryFi95K-xx.png) 詢問後，才知道 isolcpus 已經被棄用了 [Doc/admin-guide/kernel-parameters.txt](https://www.kernel.org/doc/Documentation/admin-guide/kernel-parameters.txt) ```shell isolcpus= [KNL,SMP,ISOL] Isolate a given set of CPUs from disturbance. [Deprecated - use cpusets instead] ``` ::: --- ### 實作 FCFS / RR 排程器 (代整理) > [Linux 核心專題: sched_ext 研究 (by otteryc)](https://hackmd.io/@sysprog/H1u6D9LI0) : 重現 FCFS / RR scheduler 實驗。 > 目的: 確認 time slice 可以被 `scx` 正確設定。 :::success 地雷 : `scx` 排程器皆有預設的最大等待時間 `timeout_ms` 。 ::: 以 `scx_rustland_core` 為例，[最大的等待時間是 $5$ 秒 ](https://github.com/search?q=repo%3Asched-ext/scx%20timeout_ms&type=code)， Andrea 建議讓這個參數 configurable ，透過更改 `timeout_ms`，給與 `DispatchTask` 的 `time_slice` 有更大的調整空間。 :::info TODO: 提交 PR 修改 ::: #### Time slice 設定確認為了確認 `scx` 可以正確設定 `time_slice` 因此我們更改原先 `scx_rlfifo` 的設定。 ```diff - dispatched_task.slice_ns = SLICE_NS / (nr_waiting + 1); + dispatched_task.slice_ns = 10_000_000; ``` 原先是依照任務數量平均分配，為了更好觀察 `time_slice` 設定是否符合 `Perfetto` 的觀察結果，我們將其設定為定值，分別為 $10$ ms, $20$ ms, $30$ ms。結果如下， - $10$ ms ![image](https://hackmd.io/_uploads/rJuMWT7Eee.png) - $20$ ms ![image](https://hackmd.io/_uploads/ry_Q-pX4lg.png) - $30$ ms ![image](https://hackmd.io/_uploads/H1Z4Wa7Ele.png) 可以確認 `scx` 可以有效設定 `time_slice`。 #### 改寫 `scx_rlfifo` 以下重現 [otteryc](https://wiki.csie.ncku.edu.tw/User/otteryc) 期末專題的設計概念： * 當 $NICE \lt 0$ 時，執行 FCFS * 當 $NICE \geq 0$ 時，執行 Round-Robin (Time Slice = $50$ $\mu$s) --- ### [WIP] 機器學習 - [Introduce ML into `scx_rusty`](https://hackmd.io/@cce-underdogs/scx_rusty) #### Why `scx_rusty` ? 起初閱讀 `scx_lavd` ，並試圖引入機器學習，但詳細閱讀後發現，`scx_lavd` 的整個運作函式 (或是說 `eBPF` hook) 都是實作在 `*.bpf.c` (e.g., `loadbalance.bpf.c`) ，而 `main.rs` 只是在進行函式調用的呼叫，因此在引入機器學習上會有很多的難點須克服 : ```rust +===========+ | scx_lavd | +===========+ Rust 層 (e.g., main.rs) => 函式調用 \-- eBPF 層 (e.g., loadbalance.bpf.c) => 函式實作 +===========+ | scx_rusty | +===========+ Rust 層 (e.g., main.rs, load_balance.rs) => 函式實作和調用 \-- eBPF 層 (e.g., 一些 lib) ``` 1. `*.bpf.c` 可以想像成是 underlayered ，而 `*.rs` 則是 overlayered ，`Rust` 層可以呼叫 `eBPF` 層的函式進行使用，但如果要把機器學習也放在 `Rust` 層，必須要有輸入資料 $\to$ 因此問題在於，`eBPF` 可以將資料傳給 `Rust` 層嗎? 2. 承上，若 `eBPF` 無法將 task 資訊傳給 `Rust` 層，那 `eBPF` 層有辦法實做 ML ，取代 `loadbalance.bpf.c` 嘛? 或許還有更多問題，主要也是對於 `eBPF` 的不了解，因此轉向使用 `scx_rusty` ，其 `load_balance.rs` 實作於 `Rust` 層，因此可以輕易的用 `Rust-ML` 取代，不需要考慮 `Rust-ML <--> eBPF` 之間的問題。 --- #### 現存 `scx_rusty` load balance 可改進之處 ( by [David Vernet](https://github.com/Byte-Lab) ) > When deciding whether to migrate a task, we're only looking at its impact on addressing load imbalances. In reality, this is a **very complex, multivariate cost function**. > - For example, a domain with sufficiently low load to warrant having an imbalance / requiring more load maybe should not pull load if it's running tasks that are much better suited to isolation. > - Or, a domain may not want to push a task to another domain if the task is co-located with other tasks that benefit from shared L3 cache locality. --- #### ML framework 在機器學習的部份，我們選擇了 [Candle](https://github.com/huggingface/candle) 作為使用的框架。 >Candle's core goal is to make serverless inference possible. Full machine learning frameworks like PyTorch are very large, which makes creating instances on a cluster slow. Candle allows deployment of lightweight binaries. `Candle` 由 `Hugging Face` 提供，已經實作了許多不同領域的機器學習任務，諸如 YOLO 、 Mamba 或 LLM 等常見的任務。另一方面， `Burn` 則希望可以支援各種不同的 backend ，而 `Candle` 也在其中。雖然支援各種後端是件實用的特性，但對於我們要完成的任務而言，我們專注於機器學習本身，因此選擇較多實作範例及較輕便的 `Candle` 。 >Compared to other frameworks, Burn has a very different approach to supporting many backends. By design, most code is generic over the Backend trait, which allows us to build Burn with swappable backends. This makes composing backend possible, augmenting them with additional functionalities such as autodifferentiation and automatic kernel fusion. --- 我們選擇將 ML 實作在和 migration 最相關的 `try_find_move_task` 中。無論是 NUMA Node 之間的平衡或是 Last Level Cache (LLC) 下的平衡都需要藉由該函式來選出適當的任務，若能在該函式中提昇 migrate 的效率便能夠提高整體的 load balance 。 `try_find_move_task` 輸入值為目標/來源 domain 及理想上被 migrate 任務的 load 。先使用 filter 來過濾不適合的任務，其過濾條件如下 - 該任務是否允許被排程在該 domain - 是否是 kworker - 是否已經被遷移過在所有能被 migrate 的任務挑出後，再從中挑選出一個 load 與目標最接近的任務，最後比較選出的任務 migrate 前及後是否能夠達到較好的 load balance。我們的目標便是在 filter 引入 Machine Learning ，進一步過濾掉後面可能導致不適合 load balance 的任務。 #### `scx_rusty` Data collection 資料蒐集我們使用原本的 `scx_rusty` 且在判斷為不可 migrate 的 if-statement 內蒐集 label 為 0 的資料。 ```diff let old_imbal = to_push + to_pull; if old_imbal < new_imbal { + if self.export_ml_data { + if let Some(writer) = self.ml_data_file.as_mut() { + let taskc = unsafe { &*task.taskc_p }; + let _ = writeln!( + writer, + "{},{},{},{},{},{},{},{}", + taskc.pid, + task.load, + taskc.blocked_freq, + taskc.waker_freq, + taskc.weight as f32, + taskc.deadline as u64, + taskc.avg_runtime as u64, + 0 as u8, // can_not_migrate + ); + } + } std::mem::swap(&mut push_dom.tasks, &mut SortedVec::from_unsorted(tasks)); return Ok(None); } ``` 另一方面則在可以 migrate 的地方蒐集 label 為 1 的資料。在 workload 部份我們使用了 [stress-ng](https://github.com/ColinIanKing/stress-ng) 來給定 30 個 cpu 的 workload ```shell $ stress-ng --cpu 30 -l 100 --timeout 120s --cpu-method matrixprod ``` #### Implementing ML ML 的部份我們使用了 ResNet 來當作模型，另一方面由於不同的 input features 有不同的數值範圍，因此我們將所有的特徵正規化並紀錄最大最小值，以在 inference 時使用相同的轉換關係。 ```diff -.filter( +.filter(|task| { + let taskc = unsafe { &*task.taskc_p }; + let input_raw = vec![ + task.load.0 as f32, + taskc.blocked_freq as f32, + taskc.waker_freq as f32, + taskc.weight as f32, + taskc.deadline as f32, + taskc.avg_runtime as f32, + ]; + let prediction_ok = self.predictor.predict(&input_raw).unwrap_or(false); task.dom_mask & (1 << pull_dom_id) != 0 && !(self.skip_kworkers && task.is_kworker) && !task.migrated.get() + && task.load.0 > 0.0f64 + && prediction_ok -) +}) ```  #### Profiling: ML-based `scx_rusty` ##### compile [kernel](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/) ```shell $ git clone https://kernel.googlesource.com/pub/scm/linux/kernel/git/torvalds/linux $ cd kernel $ make defconfig #用預設值產生一份最小設定 $ make clean $ time make -j"$(nproc)" #紀錄編譯核心的時間 ``` * EEVDF ``` 1245.72s user 118.13s system 1417% cpu // 越高越好 1:36.24 total // 越低越好 ``` * default `scx_rusty` ``` 1236.88s user 118.20s system 1315% cpu 1:42.99 total ``` * default `scx_rusty-l2` L2-balance (domain cache) ``` 1183.22s user 116.42s system 1070% cpu 2:01.38 total ``` * ml-based `scx_rusty-l2-ml` (stress-ng set) ``` 1241.19s user 118.28s system 1400% cpu 1:37.07 total ```  ##### [perf-tools](https://github.com/brendangregg/perf-tools#) ```shell $ git clone git@github.com:brendangregg/perf-tools.git $ sudo perf stat -e sched:sched_migrate_task ``` EEVDF ```shell Performance counter stats for 'system wide': 55873 sched:sched_migrate_task 96.921136397 seconds time elapsed ``` original `rusty-l2` ```shell Performance counter stats for 'system wide': 428263 sched:sched_migrate_task 100.828604854 seconds time elapsed ``` `ml_rusty` ```shell Performance counter stats for 'system wide': 457935 sched:sched_migrate_task 103.606137682 seconds time elapsed ``` ##### [schbench](https://kernel.googlesource.com/pub/scm/linux/kernel/git/mason/schbench/) ```shell $ git clone https://kernel.googlesource.com/pub/scm/linux/kernel/git/mason/schbench ``` ## 其他 ### TODO * ML-tuned ### [`sched-ext/scx`](https://github.com/sched-ext/scx) 的貢獻 * [EricccTaiwan PR](https://github.com/sched-ext/scx/pulls?q=is%3Amerged+is%3Apr+author%3AEricccTaiwan+) * [charliechiou PR](https://github.com/sched-ext/scx/pulls?q=is%3Amerged+is%3Apr+author%3Acharliechiou+) ### 成果分享 * 《[Kafka相關饅頭營](https://www.facebook.com/share/p/1BnJ1EcTGg/)》- [Slides](https://bit.ly/3FKwf4k) & [YouTube](https://youtu.be/wk-qzWtVzAg?t=12032) * OSS-NA 2025 - by Ching-Chun (Jim) Huang. [Slides](https://static.sched.com/hosted_files/ossna2025/d2/Improve-Load-Balancing-With-Machine-Learning-Techniques-based-on-sched_ext.pdf) * [2025 年 Linux 核心設計課程期末展示](https://hackmd.io/@sysprog/linux2025-showcase)。[影片](https://youtu.be/Ae0jVIDCycU?t=4716) & [簡報](https://www.slideshare.net/slideshow/2025-linux-sched_ext-pdf/281093837) * 開源人年會 COSCUP 2025 講者。講題:「 [藉由 `sched_ext` 實作客製化 Linux CPU 排程器](https://pretalx.coscup.org/coscup-2025/talk/WN9RDZ/) 」 ### 參考資料 & 待整理 * [Scheduler in Kernel](https://hackmd.io/@cce-underdogs/BJrJXuoWle) * [Core scheduling](https://hackmd.io/@cce-underdogs/BkLW78TZee) * [`scx` overview HOW](https://github.com/sched-ext/scx/blob/main/OVERVIEW.md#how) * A CPU always executes a task from its local DSQ. A task is "dispatched" to a DSQ. A non-local DSQ is "consumed" to transfer a task to the consuming CPU's local DSQ. * [Linux 核心設計: 不只挑選任務的排程器](https://hackmd.io/@srhuang/S1d6875F1g?utm_source=preview-mode&utm_medium=rec) * [jserv project clone](/xvRIBnRyRQSoXL1a-_aqLg) * [Build your packet scheduler with BPF qdisc [0]!](https://www.facebook.com/share/p/19Aum68QXw/) * [sched_ext(5): Integrate Machine Learning into scx_rusty Load balancer](https://hackmd.io/@vax-r/sched_ext_5) * [Crafting a Linux kernel scheduler in Rust - Andrea Righi](https://youtu.be/L-39aeUQdS8?si=JenjV2Ij7LNIAyKO)