Linux 核心設計: Scheduler(7): sched_ext

--- tags: Linux Kernel Internals, 作業系統 --- # Linux 核心設計: Scheduler(7): sched_ext ## 引言在 Linux 核心中的 CPU 排程器經歷許多變化。然而 Linux 作為一個可以應用於資料庫、行動裝置、雲端服務等多種平台的作業系統，從 O(1) Scheduler、CFS、到現今的 EEVDF，核心中預設的排程演算法的設計關鍵是必須提供通用(generically)的策略，以符合各式各樣的應用場景。僅有有限的參數被提供以微幅調校排程器，使得應用端得可以更接近理想的效能。然而，對於某些採用 Linux 的公司來說，他們的平台僅存在特定的工作負載(workload)，此時通用的排程器就不能滿足他們的期待。與之相比，如果可以自訂排程器的行為，可能有機會獲得更多好處。不過若要直接改動核心程式碼實作自定義排程器，這需要一定的技術含量。考慮排程器作為作業系統的關鍵元件，若實作上有錯誤也很容易帶來負面影響。此外，Linux 不太可能允許各廠家把自訂的排程器發佈到上游，這會導致專案充斥冗餘程式碼與混亂。因此綜合來看，該以何種方式開發自訂排程器，以減輕開發的成本和維護的負擔，是至關重要的題目。在此訴求下，有開發人員開始想在 Linux 中引入「可插入」的排程器機制。這將使得每個人都可以編寫自己定制的排程器，然後以低成本的方式輕鬆的將其植入到核心中使用。然而該用什麼方式來提供能最符合上述的要求呢? 此時 [eBPF](https://en.wikipedia.org/wiki/EBPF) 就映入開發者的眼簾。eBPF 不但支援動態載入程式碼到 kernel 中執行的功能，也提供檢查器(verifier)來減少載入的程式破壞 kernel 的風險，無疑可以說是最適合發展「可插入」排程器的基礎。於是在此基礎上，[`sched_ext`](https://github.com/sched-ext/sched_ext) 就此誕生了。 ## 如何使用 `sched_ext`? `sched_ext` 的使用到底是否足夠達到前述的目的呢? 接下來就讓我們動手來試ㄧ試。 ### 編譯 Linux 核心由於 `sched_ext` 尚未被核心正式採納，要使用有兩種方式: 一是直接 apply 相關的 [patch](https://lore.kernel.org/all/20231111024835.2164816-1-tj@kernel.org/) 到 mainline 中，另一方式則是直接在 [`sched_ext`](https://github.com/sched-ext/sched_ext) 的基礎上開發。本文選擇較為容易的後者。 ``` $ cd sched_ext $ make CC=clang-17 LLVM=1 defconfig ``` 首先透過以上命令建立一個預設的 config 檔。注意到這邊特別設置了 `CC=clang-17` 和 `LLVM=1`，因為根據 [sched_ext: a BPF-extensible scheduler class (Part 1)](https://blogs.igalia.com/changwoo/sched-ext-a-bpf-extensible-scheduler-class-part-1/) 建議我們需要使用 clang-16 以上的版本，因此這裡使用 clang-17 來編譯。 :::info 關於不同版本的 clang 如何安裝可以參考 [How to install Clang 17 or 16 in Ubuntu 22.04 | 20.04](https://ubuntuhandbook.org/index.php/2023/09/how-to-install-clang-17-or-16-in-ubuntu-22-04-20-04/) ::: ``` $ make CC=clang-17 LLVM=1 menuconfig ``` 接著需要開啟 BPF 相關選項。 * General setup -> BPF subsystem ![image](https://hackmd.io/_uploads/rJyp7-QwT.png) * General setup ![image](https://hackmd.io/_uploads/SknvCXmPp.png) * Kernel hacking -> Scheduler Debugging ![image](https://hackmd.io/_uploads/BydEVb7P6.png) * Kernel hacking -> Compile-time checks and compiler options ![image](https://hackmd.io/_uploads/HkNz5gEPT.png) ![image](https://hackmd.io/_uploads/SJwJsMmw6.png) ``` $ make CC=clang-17 LLVM=1 -j$(nproc) ``` 然後就可以編譯核心了。編譯完成後，我們即可以在 QEMU 或者 virtme-ng 上測試! :::info 這些工具的使用細節就不贅述，可以參考 [Linux 核心設計: 開發、測試與除錯環境](https://hackmd.io/@RinHizakura/SJ8GXUPJ6) 這篇文章了解更多資訊 :smiley:~ ::: :::warning 若在編譯時遇到以下訊息: ``` BTF: .tmp_vmlinux.btf: pahole (pahole) is not available Failed to generate BTF for vmlinux Try to disable CONFIG_DEBUG_INFO_BTF make[2]: *** [scripts/Makefile.vmlinux:37: vmlinux] Error 1 make[1]: *** [/home/rin/Linux/sched_ext/Makefile:1165: vmlinux] Error 2 make[1]: *** Waiting for unfinished jobs.... LD [M] net/ipv4/netfilter/iptable_nat.ko LD [M] net/netfilter/xt_addrtype.ko make: *** [Makefile:234: __sub-make] Error 2 ``` 有可能是缺少套件導致，可嘗試安裝 dwarves 再重新編譯過: ``` $ sudo apt install dwarves ``` ::: ### 編譯 `sched_ext` 範例程式碼接著我們進到 `tools/sched_ext/` 路徑下編譯出範例的自訂排程器。 ``` $ cd tools/sched_ext ``` 這邊筆者在編譯時遇到一些麻煩，具體問題是 [`scx_utils`](https://docs.rs/scx_utils/latest/scx_utils/) 會嘗試辨認 clang 版本是否足夠新(>= 16)，取得的方式是強制使用預設的 `clang --version` 命令。但此前我們預想其實是透過 `clang-17` 才能得到正確結果。由於 `clang --version` 在筆者的電腦上得到的版本號將 <16，因此最終使用到 `scx_utils` 套件的函式庫會編譯失敗。一番研究後，因為 [scx_utils](https://docs.rs/scx_utils/0.4.0/src/scx_utils/bpf_builder.rs.html#229) 原始碼中發現可以透過 `BPF_CLANG` 去指定使用的 clang，因此目前採取以下 workaround 方式: * 在 `scx_rusty/build.rs` 和 `scx_layered/build.rs` 但 `main.rs` 額外加入一行: ```diff + std::env::set_var("BPF_CLANG", "clang-17"); ``` :::info 如果有更好的解決方式歡迎提供建議 >m< ::: 然後就可以進行執行以下命令: ``` $ make CC=clang-17 LLVM=1 -j$(nproc) ``` :::warning 如果編譯時遇到以下訊息: ``` /usr/bin/ld: cannot find -lzstd: No such file or directory ``` 需安裝必要的函式庫套件: ``` sudo apt-get install libzstd-dev ``` ::: 完成上述所有步驟後，應該可以得到一系列的 `build/bin/scx*` 檔案，這代表我們已經得到要值入核心的排程器! ### 測試基於 `sched_ext` 框架的排程器這裡我們展示用 `virtme-ng` 測試的方式。這相較於直接使用 QEMU 應更好上手，不需要另外建立 root filesystem image，也不必複雜的額外設定，推薦想先專注於學習 `sched_ext` 的讀者嘗試。以測試 `scx_simple` 為範例，啟動 `virtme-ng` 後在 `sched_ext` 資料夾底下使用以下命令來測試結果: ``` $ vng $ sudo ./build/bin/scx_simple local=0 global=0 local=5 global=2 local=137 global=9 local=140 global=11 local=143 global=22 local=274 global=27 ... ``` ## 撰寫簡單的 `sched_ext` 排程器若想理解如何撰寫 sched_ext 排程器，以 [`scx_simple`](https://github.com/sched-ext/sched_ext/blob/sched_ext/tools/sched_ext/scx_simple.bpf.c) 作為起點或許是不錯的選擇。這展示了一個簡單的 sched_ext 排程器實作。 ### Scheduling Cycle 正式開始深入程式碼之前，這裡建議先閱讀 [Scheduling Cycle](https://github.com/sched-ext/sched_ext/blob/sched_ext/Documentation/scheduler/sched-ext.rst#scheduling-cycle) 一節。以明白 `sched_ext` 大致的排程流程與對應會使用的方法為何。 1. 當任務被喚醒時，首先進行 `select_cpu()`。這有兩個目的: 一是提示在 CPU 選擇上的最佳解答。其次，如果所選 CPU 是 idle 狀態則喚醒之。 * `select_cpu()` 若選擇不合法的 CPU(例如 task 透過 cpumask 標示不在特定 CPU 上運行)，此選擇最終會變成無效 2. 選擇目標 CPU 後，呼叫 `enqueue()`。它可以做出以下其中一種決定： * 選擇 `SCX_DSQ_GLOBAL` 或 `SCX_DSQ_LOCAL` dsp 來呼叫 `scx_bpf_dispatch()`，直接將任務 dispatch 到全域或本地 dsp。 * 使用額外以 `scx_bpf_create_dsq` 建立的 dsq 來呼叫 `scx_bpf_dispatch()` * 先 enqueue task 到 BPF 內部的自定義結構中 3. 當 CPU 準備好排程時，首先檢查本地 dsq(`SCX_DSQ_LOCAL`)。如果為空，則進一步查看全域 dsq(`SCX_DSQ_GLOBAL`)。如果仍然沒有找到可以執行的任務，則會呼叫 `dispatch()`，這有兩種方式可以將任務加入到本地 dsq * `scx_bpf_dispatch()` 將任務分派到 local DSQ。可以選擇任何的 dsq，包含 `SCX_DSQ_LOCAL`、`SCX_DSQ_LOCAL_ON | cpu`、`SCX_DSQ_GLOBAL` 或自建的 dsq * `scx_bpf_consume()` 將任務從指定的非本地 dsq 轉移到 local dsq 4. `dispatch()` 結束後，再檢查一遍本地 dsq(`SCX_DSQ_LOCAL`)，此時有任務的話則執行之，若否 * 再嘗試一遍全域 dsq(`SCX_DSQ_GLOBAL`) * 若前一步失敗，如果之前的 `dispatch()` 是有分派新任務的，那再試一遍步驟 3 * 若前一步還是失敗，如果前一個任務是 SCX task 且仍是 runnable，就繼續執行它 * 否則 CPU 進入 idle ### BPF 程式碼在撰寫 eBPF 程式碼的時候，通常會分為兩個部份。一部份會編譯成 BPF bytecode，之後被載入至 kernel space 運行;另一部份則是執行於 userspace 的 BPF loader。其依賴於 libbpf，被用來將 BPF bytecode 載入到 kernel，並監視其狀態以在 userspace 進行對應行為。 :::info 請閱讀 [Linux 核心設計: eBPF](https://hackmd.io/@RinHizakura/S1DGq8ebw) 和 [BPF 的可移植性: Good Bye BCC! Hi CO-RE!](https://hackmd.io/@RinHizakura/HynIEOD7n) 來得知更多細節~ ::: 載入核心的部份也就是 [`scx_simple.bpf.c`](https://github.com/sched-ext/sched_ext/blob/sched_ext/tools/sched_ext/scx_simple.bpf.c)，讓我們首先來閱讀其內容。 ```c SEC(".struct_ops.link") struct sched_ext_ops simple_ops = { .enqueue = (void *) simple_enqueue, .dispatch = (void *) simple_dispatch, .running = (void *) simple_running, .stopping = (void *) simple_stopping, .enable = (void *) simple_enable, .init = (void *) simple_init, .exit = (void *) simple_exit, .name = "simple", }; ``` 在 sched_ext 中，我們透過 `struct sched_ext_ops` 來描述排程器的各種操作，例如 enqueue、挑選任務等等。注意到開頭的 `SEC` 標註這個資料結構應該被放在 ELF 的哪個 section。我們必須要將此放在 `.struct_ops.link` 來讓 sched_ext 正確連結到要插入的操作。 ```c /* * Scheduling policies */ #define SCHED_NORMAL 0 #define SCHED_FIFO 1 #define SCHED_RR 2 #define SCHED_BATCH 3 /* SCHED_ISO: reserved but not implemented yet */ #define SCHED_IDLE 5 #define SCHED_DEADLINE 6 #define SCHED_EXT 7 ``` 查看 `include/uapi/linux/sched.h`，在 sched_ext 框架下 scheduler 被分成上面的數類，其實就是 kernel 原有的種類加上 `SCHED_EXT` 一種。若沒有插入 sched_ext 排程器的話，`SCHED_EXT` 就被當成 `SCHED_NORMAL` 處理。 ```c #define SHARED_DSQ 0 s32 BPF_STRUCT_OPS_SLEEPABLE(simple_init) { if (!switch_partial) scx_bpf_switch_all(); return scx_bpf_create_dsq(SHARED_DSQ, -1); } ``` 一旦 eBPF 程式碼被植入，`init` 就會被執行。`switch_partial` 是一個可以在 user 端開關的變數，如果不開啟，則使用 `scx_bpf_switch_all` 這個由 `sched_ext` 提供的函式，可以將 real-time 等級(`SCHED_DEADLINE`)以下的任務都轉由我們的客製化排程器管理，否則就只有 `SCHED_EXT` 種類的任務是由此排程器控制的。在 sched_ext 提供 dispatch queue 的實作來管理任務的佇列。在 dsp 中可以用 FIFO 或 priority queue 方式來管理任務(取決於使用的 API)，在 consume (`scx_bpf_consume`，見後續說明)時先優先取 FIFO 再取 priority queue。預設是每個 CPU 有 1 個全域 (`SCX_DSQ_GLOBAL`) 和 1 個本地的 dsq (`SCX_DSQ_LOCAL`)。可以使用 `scx_bpf_create_dsq` 來建立額外的 dsp。以此範例而言，我們利用 id=0 建立一個由所有 CPU 共享的 task queue。 :::warning 一般來說，似乎結束時要藉由 `scx_bpf_destroy_dsq` 將 queue 釋放，但此範例程式碼忽略了此步驟。 ::: ```c void BPF_STRUCT_OPS(simple_running, struct task_struct *p) { if (fifo_sched) return; /* * Global vtime always progresses forward as tasks start executing. The * test and update can be performed concurrently from multiple CPUs and * thus racy. Any error should be contained and temporary. Let's just * live with it. */ if (vtime_before(vtime_now, p->scx.dsq_vtime)) vtime_now = p->scx.dsq_vtime; } ``` 當一個任務在 CPU 上獲得排程時則呼叫 `running()`。這裡如果 `fifo_sched` 沒開啟狀況下，我們會記錄當前最新的 `vtime` 至 `vtime_now`，`fifo_sched` 開啟時因為時間不是影響任務排隊的因素因此可以不用記錄這個。 ```c void BPF_STRUCT_OPS(simple_enable, struct task_struct *p, struct scx_enable_args *args) { p->scx.dsq_vtime = vtime_now; } ``` `enable()` 用來操作首次被 sched_ext 所管理的任務，例如 `fork()`, `clone()`, `scx_bpf_switch_all()`。在 sched_ext 中每個排程單元會有各自的 [`struct sched_ext_entity`](https://github.com/sched-ext/sched_ext/blob/sched_ext/include/linux/sched/ext.h#L645) 來描述自己，而其中 `dsq_vtime` 與任務在 dsq 中的順序有關。這裡顯式地將 `dsq_vtime` 對齊 `vtime_now`，後者即最近一次被排程(running) 任務的 vtime。 :::warning 這裡留意到 `vtime_now` 一開始初始為 0，因此可預期的是當其他任務被 `scx_bpf_switch_all` 轉移的時候，他們原本的先後優先次序將不被考慮進來，因為大家都是從 vtime = 0 開始。 ::: ```c void BPF_STRUCT_OPS(simple_enqueue, struct task_struct *p, u64 enq_flags) { /* * If scx_select_cpu_dfl() is setting %SCX_ENQ_LOCAL, it indicates that * running @p on its CPU directly shouldn't affect fairness. Just queue * it on the local FIFO. */ if (enq_flags & SCX_ENQ_LOCAL) { stat_inc(0); /* count local queueing */ scx_bpf_dispatch(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, enq_flags); return; } stat_inc(1); /* count global queueing */ if (fifo_sched) { scx_bpf_dispatch(p, SHARED_DSQ, SCX_SLICE_DFL, enq_flags); } else { u64 vtime = p->scx.dsq_vtime; /* * Limit the amount of budget that an idling task can accumulate * to one slice. */ if (vtime_before(vtime, vtime_now - SCX_SLICE_DFL)) vtime = vtime_now - SCX_SLICE_DFL; scx_bpf_dispatch_vtime(p, SHARED_DSQ, SCX_SLICE_DFL, vtime, enq_flags); } } ``` `enqueue()` 顯然是用來將一個任務加入到排程器的 runqueue 中。 `SCX_ENQ_LOCAL` 這個特殊的 enqueue flag 提示此任務應該加入到 local dsq(`SCX_DSQ_LOCAL`) 中，這裡的處理方式是直接以 FIFO 方式(`scx_bpf_dispatch`)加入其中。而對於其他非 local 的任務，若開啟 `fifo_sched` 也是直接加入到另一個 `SHARED_DSQ` 的 FIFO queue。 :::warning 問: `SCX_ENQ_LOCAL` 是屬建議還是嚴格要求? 意即若刻意忽視 flag 將任務加入到 `SHARED_DSQ` 中，是只會導致較差的效能，還是破壞整個 `sched_ext` 運作的正確性? 答: 若看 [`SCX_ENQ_LOCAL`](https://github.com/sched-ext/sched_ext/blob/sched_ext/kernel/sched/ext.h#L62) 的註解似乎只是建議。 ::: 下面仔細來理解 `scx_bpf_dispatch` 的用法: ```c void scx_bpf_dispatch(struct task_struct *p, u64 dsq_id, u64 slice, u64 enq_flags); ``` * `p` 是接下來要加入到 dsq 中的 task * `dsq_id` 是所選擇的 dsq 之編號 * `slice` 是任務之後被挑選出來後可以執行的時間長度 (單位是 ns) * `enq_flags` 若是在 enqueue 階段做 dispatch，傳遞對應的 flag :::info 問: 從 `scx_bpf_dispatch` 必須要傳遞 `enq_flags` 來看，是否任務的管理綁定要透過 dsq 而不能使用自訂的 runqueue 結構? 總覺得這樣很大幅度的限制了 `sched_ext` 的彈性。答: 可以僅把 dsq 當成中間層，先在自定義的資料結構上維護任務的優先級，在 dispatch 的時候再加入到 dsq 中。`scx_bpf_dispatch` 的註解也提到這 API 可以在 `enqueue()` 或 `dispatch()` 時使用。或參考 [sched-ext.rst](https://github.com/sched-ext/sched_ext/blob/sched_ext/Documentation/scheduler/sched-ext.rst) 的這段敘述: > Note that the BPF scheduler can always choose to dispatch tasks immediately in ops.enqueue() as illustrated in the above simple example. If only the built-in DSQs are used, there is no need to implement ops.dispatch() as a task is never queued on the BPF scheduler and both the local and global DSQs are consumed automatically. 另一個範例的 [`scx_pair`](https://github.com/sched-ext/sched_ext/blob/sched_ext/tools/sched_ext/scx_pair.bpf.c) 可能更詳細的展示了這樣的用法。 ::: 關閉 `fifo_sched` 的狀況下，則藉 `scx_bpf_dispatch_vtime` 使用 `SHARED_DSQ` 的 Priority queue，後者會用 `vtime` 來排序下一個被分派的任務。 ```c void BPF_STRUCT_OPS(simple_dispatch, s32 cpu, struct task_struct *prev) { scx_bpf_consume(SHARED_DSQ); } ``` 當 CPU 要尋找下一個要執行的任務時，如果本地 dsq 中有任務，則從中選擇第一個。否則，CPU 會嘗試使用全域 dsq 來選擇。如果這兩者都沒有產生可運行的任務，就會呼叫 `dispatch()` 方法來將我們自行管理的 dsq 中的任務移動到該 CPU 的本地 dsq 執行。`scx_bpf_consume` 將對應 dsq 的 task 轉移到 CPU 的 local dsq。 ```c void BPF_STRUCT_OPS(simple_stopping, struct task_struct *p, bool runnable) { if (fifo_sched) return; /* * Scale the execution time by the inverse of the weight and charge. * * Note that the default yield implementation yields by setting * @p->scx.slice to zero and the following would treat the yielding task * as if it has consumed all its slice. If this penalizes yielding tasks * too much, determine the execution time by taking explicit timestamps * instead of depending on @p->scx.slice. */ p->scx.dsq_vtime += (SCX_SLICE_DFL - p->scx.slice) * 100 / p->scx.weight; } ``` 一個任務結束時會呼叫 `stopping()`。參數的 `runnable` 則是指 stop 之後狀態是否是可執行。這裡的處理是將 `dsq_vtime` 以權重方式推進運行時間的長度，運行時間的依據則是剩餘可運行的 slice `p->scx.slice` 和一開始指派的值 `SCX_SLICE_DFL` 推算。注意到註解提及這種算法在 yield 情況下的效果。 ### BPF loader 上述的程式碼會被轉換成 eBPF bytecode，再交由 [scx_simple.c](https://github.com/sched-ext/sched_ext/blob/sched_ext/tools/sched_ext/scx_simple.c) 這個 BPF loader 植入到核心的 virtual machine 運行。 ```cpp static volatile int exit_req; static void sigint_handler(int simple) { exit_req = 1; } int main(int argc, char **argv) { struct scx_simple *skel; struct bpf_link *link; __u32 opt; signal(SIGINT, sigint_handler); signal(SIGTERM, sigint_handler); libbpf_set_strict_mode(LIBBPF_STRICT_ALL); ``` BPF loader 的程式碼相對容易。首先是註冊 signal handler 以正確處理 BPF loader 的結束。Handler 的行為只是設置一個 flag，以終止後續會看到的主要功能迴圈。 `libbpf_set_strict_mode` 和 libbpf 的行為，會影響運作上是用 v1.0 還是舊版本的。詳情可查看 [Testing your application with libbpf 1.0 behavior](https://github.com/libbpf/libbpf/wiki/Libbpf-1.0-migration-guide#testing-your-application-with-libbpf-10-behavior) 的敘述。 ```cpp skel = scx_simple__open(); SCX_BUG_ON(!skel, "Failed to open skel"); while ((opt = getopt(argc, argv, "fph")) != -1) { switch (opt) { case 'f': skel->rodata->fifo_sched = true; break; case 'p': skel->rodata->switch_partial = true; break; default: fprintf(stderr, help_fmt, basename(argv[0])); return opt != 'h'; } } SCX_BUG_ON(scx_simple__load(skel), "Failed to load skel ``` 在 BPF loader 中以 BPF **skeleton** 來使 userspace 關聯到 kernel 中的 BPF 程式，以存取其中的變數或資料結構。舉例來說，這裡在透過 `scx_simple__open()` 使 `skel` 有效後，對 `skel->rodata` 的變數進行修改即可以影響到後續以 `scx_simple__load()` 載入的 BPF code 中的對應全域變數(`fifo_sched`/`switch_partial`)之值。 ```cpp link = bpf_map__attach_struct_ops(skel->maps.simple_ops); SCX_BUG_ON(!link, "Failed to attach struct_ops"); while (!exit_req && !uei_exited(&skel->bss->uei)) { __u64 stats[2]; read_stats(skel, stats); printf("local=%llu global=%llu\n", stats[0], stats[1]); fflush(stdout); sleep(1); } ``` BPF code 以 `scx_simple__load()` 載入之後，藉 `bpf_map__attach_struct_ops()` 將之前看到的 `struct sched_ext_ops simple_ops` 註冊到核心之中。之後就會根據 `sched_ext` 的執行方式去進行前面我們看到的那些 scheduler 程式碼。 ```cpp static void read_stats(struct scx_simple *skel, __u64 *stats) { int nr_cpus = libbpf_num_possible_cpus(); __u64 cnts[2][nr_cpus]; __u32 idx; memset(stats, 0, sizeof(stats[0]) * 2); for (idx = 0; idx < 2; idx++) { int ret, cpu; ret = bpf_map_lookup_elem(bpf_map__fd(skel->maps.stats), &idx, cnts[idx]); if (ret < 0) continue; for (cpu = 0; cpu < nr_cpus; cpu++) stats[idx] += cnts[idx][cpu]; } } ``` 在 BPF 程式碼運行中，我們可以動態去和 kernel 交互來拿到在 BPF 中定義的 `map` 之資訊。特別關注 `read_stats` 可以看到: 我們藉由 `bpf_map_lookup_elem` 能夠從定義的 array map `stats` 中即時取得當下在 kernel 中對應結構下的數據。 ```cpp bpf_link__destroy(link); uei_print(&skel->bss->uei); scx_simple__destroy(skel); return 0; } ``` 後續就是一些資源釋放的處理，這裡就不再多做深入。 ## Reference * [The extensible scheduler class](https://lwn.net/Articles/922405/) * [sched-ext.rst](https://github.com/sched-ext/sched_ext/blob/sched_ext/Documentation/scheduler/sched-ext.rst) * [sched_ext: a BPF-extensible scheduler class (Part 1)](https://blogs.igalia.com/changwoo/sched-ext-a-bpf-extensible-scheduler-class-part-1/) * [Kernel Recipes 2023 - sched_ext: pluggable scheduling in the Linux kernel](https://www.youtube.com/watch?v=8kAcnNVSAdI) * [Getting started with sched-ext development](https://arighi.blogspot.com/2024/04/getting-started-with-sched-ext.html)

Read more

Linux 核心設計: Scheduler(2): 概述 CFS Scheduler

Linux 核心設計: Scheduler(5): EEVDF 排程器

Linux Linked List

Linux 核心設計: Power Management(1): System Sleep model