sched_ext (2) : 排程器程式碼分析

Introduction

在利用 sched_ext 撰寫我們的 custom scheduler 之前，先理解現有排程器如何實作，解析相關工具的使用並探討是否有改進空間。

架構分析

以 scheds/c/ 底下包含的排程器來說，不外乎是一個 userspace program 搭配一個 bpf program ，而 bpf program 又會透過 libbpf 解析行程 skeleton file ，因此使用者層級程式碼都是 include skel.h 標頭檔來和 bpf application 進行互動。以下先以架構最簡單的 scx_simple 為例。

使用者層級程式

先看到 main 函式當中，透過 libbpf API libbpf_set_print 來設定如何印出 libbpf warning 和 informational messages 。此處是設定為 libbpf_print_fn ，而 libbpf_print_fn 實作如下

static int libbpf_print_fn(enum libbpf_print_level level, const char *format, va_list args)
{
	if (level == LIBBPF_DEBUG && !verbose)
		return 0;
	return vfprintf(stderr, format, args);
}

接下來可以看到兩行 signal 相關程式碼，目的是將 SIGINT, SIGTERM 都導向 sigint_handler 函式用以通知程式執行的結束。之後利用定義在 scheds/include/scx/compat.h 當中的巨集 SCX_OPS_OPEN 來讓 skeleton file 對應的 bpf application 進入 open phase ，可以注意到程式碼當中的 __skel = __scx_name##__open() 即是 libbpf 產生的 skeleton file 當中所提供之函式。在程式當中的 struct scx_simple 則是 libbpf 的 skeleton file 所建立的 struct 用來對應到 bpf program 當中所有可用的 data fields 。之後的 SCX_OPS_LOAD, SCX_OPS_ATTACH 分別對應到 bpf application 的 load phase 和 attach phase 。

值得注意的是在 SCX_OPS_ATTACH() 巨集當中使用到的 bpf_map__attach_struct_ops() 是 libbpf API ，會回傳一個 struct bpf_link。

另外可以注意到程式當中許多 UEI_* 的函式或巨集，都是定義在 scheds/include/scx/user_exit_info.h 當中， struct user_exit_info 是使用者層級程式和 BPF program 共享的資訊，用來傳送 exit status 和其他資訊。

在這個程式的 main loop 當中，利用 exit_req 和 UEI_EXITED(skel, uei) 來判斷是否離開 main loop ，我認為 UEI_EXITED 巨集的實作值得探討，它利用 __sync_val_compare_and_swap 來檢查 skeleton file 的 __uei_name.kind 的數值。

__sync_val_compare_and_swap 是一個 GNU builtin function ，可以參見 6.49 Built-in functions for atomic memory access ，此函式將 __uei_name.kind 的值和 -1 比較，若相同則將 -1 寫入 __uei_name.kind ，每次都是回傳 __uei_name.kind 原本的值。所以結束條件應該是當 __uei_name.kind 不為 0 ，正常情況下都應該是 0 。

接著 main loop 當中的操作很單純，不斷的透過 bpf_map_lookup_elem() 來查看 BPF program 當中 BPF maps 包含的值是什麼並印出來，所以 BPF program 是如何將原本的 scheduler 替換掉，並和 kernel 互動的？

BPF program

在每個排程器實作當中我們都可以看到大量的 BPF_STRUCT_OPS 巨集，該巨集定義在 scheds/include/scx/common.bpf.h 當中，定義如下

#define BPF_STRUCT_OPS(name, args...)						\
SEC("struct_ops/"#name)								\
BPF_PROG(name, ##args)

其中 SEC 巨集來自 libbpf 也就是 /tools/lib/bpf/bpf_helpers.h 當中，定義如下

/*
 * Helper macro to place programs, maps, license in
 * different sections in elf_bpf file. Section names
 * are interpreted by libbpf depending on the context (BPF programs, BPF maps,
 * extern variables, etc).
 * To allow use of SEC() with externs (e.g., for extern .maps declarations),
 * make sure __attribute__((unused)) doesn't trigger compilation warning.
 */
#if __GNUC__ && !__clang__

/*
 * Pragma macros are broken on GCC
 * https://gcc.gnu.org/bugzilla/show_bug.cgi?id=55578
 * https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90400
 */
#define SEC(name) __attribute__((section(name), used))

#else

#define SEC(name) \
	_Pragma("GCC diagnostic push")					    \
	_Pragma("GCC diagnostic ignored \"-Wignored-attributes\"")	    \
	__attribute__((section(name), used))				    \
	_Pragma("GCC diagnostic pop")					    \

#endif

另一個巨集 BPF_PROG 定義在 /tools/lib/bpf/bpf_tracing.h 底下，定義如下

/*
 * BPF_PROG is a convenience wrapper for generic tp_btf/fentry/fexit and
 * similar kinds of BPF programs, that accept input arguments as a single
 * pointer to untyped u64 array, where each u64 can actually be a typed
 * pointer or integer of different size. Instead of requring user to write
 * manual casts and work with array elements by index, BPF_PROG macro
 * allows user to declare a list of named and typed input arguments in the
 * same syntax as for normal C function. All the casting is hidden and
 * performed transparently, while user code can just assume working with
 * function arguments of specified type and name.
 *
 * Original raw context argument is preserved as well as 'ctx' argument.
 * This is useful when using BPF helpers that expect original context
 * as one of the parameters (e.g., for bpf_perf_event_output()).
 */
#define BPF_PROG(name, args...)						    \
name(unsigned long long *ctx);						    \
static __always_inline typeof(name(0))					    \
____##name(unsigned long long *ctx, ##args);				    \
typeof(name(0)) name(unsigned long long *ctx)				    \
{									    \
	_Pragma("GCC diagnostic push")					    \
	_Pragma("GCC diagnostic ignored \"-Wint-conversion\"")		    \
	return ____##name(___bpf_ctx_cast(args));			    \
	_Pragma("GCC diagnostic pop")					    \
}									    \
static __always_inline typeof(name(0))					    \
____##name(unsigned long long *ctx, ##args)

幫我們根據給定的 name 與 args 建立一系列函式原型。
我們可以透過 BPF_STRUCT_OPS 和 BPF_STRUCT_OPS_SLEEPABLE 定義一系列 handlers 並利用 SCX_OPS_DEFINE 來將這些 handlers 和對應之 name 註冊為 struct sched_ext_ops，成為和其他元件互動的介面。
關於每個函式詳細的作用在 /kernel/scheds/ext.c 有講解。

在 /kernel/scheds/ext.c 當中有對應的 dummy interface 如下

static struct sched_ext_ops __bpf_ops_sched_ext_ops = {
	.select_cpu = select_cpu_stub,
	.enqueue = enqueue_stub,
	.dequeue = dequeue_stub,
	.dispatch = dispatch_stub,
	.runnable = runnable_stub,
	.running = running_stub,
	.stopping = stopping_stub,
	.quiescent = quiescent_stub,
	.yield = yield_stub,
	.core_sched_before = core_sched_before_stub,
	.set_weight = set_weight_stub,
	.set_cpumask = set_cpumask_stub,
	.update_idle = update_idle_stub,
	.cpu_acquire = cpu_acquire_stub,
	.cpu_release = cpu_release_stub,
	.init_task = init_task_stub,
	.exit_task = exit_task_stub,
	.enable = enable_stub,
	.disable = disable_stub,
#ifdef CONFIG_EXT_GROUP_SCHED
	.cgroup_init = cgroup_init_stub,
	.cgroup_exit = cgroup_exit_stub,
	.cgroup_prep_move = cgroup_prep_move_stub,
	.cgroup_move = cgroup_move_stub,
	.cgroup_cancel_move = cgroup_cancel_move_stub,
	.cgroup_set_weight = cgroup_set_weight_stub,
#endif
	.cpu_online = cpu_online_stub,
	.cpu_offline = cpu_offline_stub,
	.init = init_stub,
	.exit = exit_stub,
};

static struct bpf_struct_ops bpf_sched_ext_ops = {
	.verifier_ops = &bpf_scx_verifier_ops,
	.reg = bpf_scx_reg,
	.unreg = bpf_scx_unreg,
	.check_member = bpf_scx_check_member,
	.init_member = bpf_scx_init_member,
	.init = bpf_scx_init,
	.update = bpf_scx_update,
	.validate = bpf_scx_validate,
	.name = "sched_ext_ops",
	.owner = THIS_MODULE,
	.cfi_stubs = &__bpf_ops_sched_ext_ops
};

並且透過註冊在 bpf_sched_ext_ops 的成員當中，作為和其他程式互動的介面，透過 init_sched_ext_class 和 scx_init 來初始化 scx 。其中 scx_init 負責將上述在 BPF program 當中定義的 handler 註冊成為 kernel function 。之後 kernel 可以透過呼叫 ops->* 相關函式來操作 scheduler 。

所以這段程式是怎麼把原先的 scheduler 替換掉的？

先觀察 init 和 exit 函式

s32 BPF_STRUCT_OPS_SLEEPABLE(simple_init)
{
	__COMPAT_scx_bpf_switch_all();
	return scx_bpf_create_dsq(SHARED_DSQ, -1);
}

void BPF_STRUCT_OPS(simple_exit, struct scx_exit_info *ei)
{
	UEI_RECORD(uei, ei);
}

在 init function 當中先利用 __COMPAT_scx_bpf_switch_all() 將所有 task 都轉為 scx task 讓所有 class 的任務都被 scx scheduler 排程。實作在 /scheds/include/scx/compat.bpf.h 當中。現階段被 SCX_OPS_SWITCH_PARTIAL 給替代了。

enum scx_ops_flags 被定義在 /scheds/include/vmlinux/vmlinux.h 當中如下

enum scx_ops_flags {
	SCX_OPS_KEEP_BUILTIN_IDLE = 1,
	SCX_OPS_ENQ_LAST = 2,
	SCX_OPS_ENQ_EXITING = 4,
	SCX_OPS_SWITCH_PARTIAL = 8,
	SCX_OPS_CGROUP_KNOB_WEIGHT = 65536,
	SCX_OPS_ALL_FLAGS = 65551,
};

關於每個 flags 用途的詳細解說可以參見 /kernel/sched/ext.c 。其中表示若 SCX_OPS_SWITCH_PARITIAL 有被設定，則只有 SCHED_EXT class 的任務會被我們撰寫的 BPF scheduler 排程，若沒設定則是全部都排程。如果你使用的是新版的 sched_ext kernel ，則 __COMPAT_scx_bpf_switch_all() 不會有任何作用，因為所有排程器預設會把所有任務拿來排程，只有當 SCX_OPS_SWITCH_PARTIAL 被設定時才會只排 SCHED_EXT 這個 class 的任務。

接下來程式碼呼叫並回傳 scx_bpf_create_dsq() 的回傳值，要理解此函式我們就要理解 sched_ext 當中 dispatch queue 的運用，可以參見 Extensible Scheduler Class : Dispatch Queues 。特別注意 built-int queues 例如 SCX_DSQ_GLOBAL, SCX_DSQ_LOCAL 都是 FIFO queues ，若想利用 priority queues (多數時候會使用，因為要考量權重) ，則要自己透過 scx_bpf_create_dsq() 配置新的 queue 。
每個 CPU 都會從自己的 local DSQ 抓取任務出來執行，若 local DSQ 為空，則會嘗試從 global DSQ 抓取任務，若沒有拿到可執行的任務，則會呼叫 ops.dispatch() 。

Scheduling Cycle

當 task 被喚醒，首先會呼叫 ops.select_cpu() ，主要有兩個功能，第一個是做 CPU selection optimization hint ，另外則是喚醒被選中的 CPU 。
被選中的 CPU 僅是一個 optimization hint 而非最終結果，最後 task 要被放到哪個 CPU 的 DSQ 上是由 scheduling 的最後一步來決定。若 ops.select_cpu() 選中的 CPU 和最終執行該 task 的 CPU 相同，效能上可以有些微提升。
選擇某個 CPU 的 side-effect 是會把它從 idle 狀態喚醒，當然 BPF scheduler 可以利用 scx_bpf_kick_cpu() 來喚醒任何 CPU 。
若 task 呼叫 scx_bpf_dispatch() ，則在 ops.select_cpu() 之後會立即將該 task 分配到某個 DSQ 上，如果是分配到 SCX_DSQ_LOCAL ，則該 local dispatch queue 就是 ops.select_cpu() 選中的 CPU 之 local dispatch queue 。另外注意到的是，如果 dispatch 是直接由 ops.select_cpu() 處理，則 ops.enqueue() 就會被跳過。
當 CPU 選好之後，會呼叫 ops.enqueue() ，而該函式呼叫會做以下其中一個決定
- 透過呼叫 scx_bpf_dispatch() 將 task 立即分配到 SCX_DSQ_GLOBAL 或 SCX_DSQ_LOCAL 。
- 透過呼叫 scx_bpf_dispatch() 將 task 立即分配到一個 custom DSQ 上，而 DSQ ID 一定小於
  $2^{63}$
- 將 task 分配到 BPF side 的 queue 上
當 CPU 可以進行排程時，它會先搜索自己的 local DSQ ，若是空的才會搜索 global DSQ ，若還是沒有任務可以運行，則會呼叫 ops.dispatch() ，該函式會運用以下兩個 helper function 來產生任務到 local DSQ
- scx_bpf_dispatch() : 將任務分配至一個 DSQ 上，此處任何 DSQ 都可以。不過目前若 BPF lock 被持有，則無法呼叫 scx_bpf_dispatch() ，未來改進項目之一。此函式當中最多可以有 ops.dispatch_max_batch 個 pending tasks 。
- scx_bpf_consume() : 從指定的 non-local DSQ 當中拿出一個 task 並放入另一個 DSQ ，此函式同樣無法在 BPF lock 被持有時呼叫。 scx_bpf_consume() 會在嘗試進行以上操作之前先將 pending dispatch tasks 清空。
如果 ops.dispatch() 回傳後， local DSQ 有任務可以執行，則 CPU 會執行第一個，若 DSQ 依舊為空的，則進行以下動作
- 嘗試 consume global DSQ ，若成功就運行該任務
- 若 ops.dispatch() 沒有分配任何任務，重複第三步驟
- 若前一個任務是一個 SCX task 並且依舊可運行，則繼續執行它。
- 轉為 idle

理解了 sched_ext 當中的 scheduling cycle 後，我們終於可以來探討 scx_simple 的各項 handler function 實作，首先是 simple_select_cpu 。

simple_select_cpu()
利用預設的 scx_bpf_select_cpu_dfl() 挑選 CPU ，此處尚需探討挑選的方式為何，若挑選到的 CPU 狀態為 idle ，我們透過 scx_bpf_dispatch() 將任務直接分配到 SCX_DSQ_LOCAL 當中。
依據 scheduling cycle ，若挑選到的 CPU 為 idle 則透過 scx_bpf_dispatch() 直接分發任務後，就會跳過 simple_enqueue() 。
simple_enqueue()
將任務分配到 SHARED_DSQ 當中，此處會判定排程的規則是 FIFO 或透過 vtime 當優先權，可以注意到若採用 vtime 來決定，則會比較 p->scx.dsq_vtime 和 vtime_now - SCX_SLICE_DFL 的大小。
simple_dispatch()
在 CPU 搜索 local DSQ 和 global DSQ 都找不到可執行的任務時會呼叫此函式，在這個排程器的例子當中是利用 scx_bpf_consume() 從 SHARED_DSQ 當中取出 task 執行。
simple_running()
若採用 vtime 決定排程順序，則此函式會負責更新 vtime_now 的數值，應該探討如何呼叫此函式。
simple_stopping()
更新 task 的 dsq_vtime 。應探討如何呼叫此函式。

sched-ext 解析

sched-ext 撰寫了關於 scheduler extension 的所有基礎建設，包括如何將 scheduler 透過 eBPF 機制替換成使用者撰寫的 BPF scheduler 等等，首先解析 struct sched_ext_ops ，這是一個 BPF scheduler 的 operation table 。

select_cpu()
選擇一個 CPU 來執行 task ，此處的決策不一定是最終決策，不過如果最後 task 被放入此處選擇的 CPU 的 runqueue ，我們可以減少一些 overhead 。它作為預設將 idle CPU 喚醒的函式，使用者也可以另外實作自己將 CPU 喚醒的機制。
enqueue()
將 task 放入 BPF scheduler 的 queue 當中。如果在 select_cpu() 時沒有利用 scx_bpf_dispatch() 直接將 task 分配到選擇到的 CPU 上，則會執行此函式，此時 BPF scheduler 是該 task 的持有者。
dequeue()
將 task 從 BPF scheduler 上移除，通常在更新該 task 的 scheduling properties 時才會特別呼叫，用來將該 task 隔離。 sched-ext 的程式當中已經實作了可以追蹤某個 task 是否被 BPF scheduler 持有的機制，用以忽略 BPF scheduler 進行的 spurious dispatch ，所以此函式不一定要自行實作。
dispatch()
通知 BPF scheduler 來分配 task 到 CPU 上，或者 consume DSQ 。當 CPU 的 local DSQ 為空時才會被呼叫， BPF scheduler 應該透過 scx_bpf_dispatch() 分配一個或多個任務並利用 scx_bpf_consume() 將 user DSQ 當中的任務移動到 loca DSQ 當中。 prev 是上一個被 switched out 的任務，並且該任務的 slice 已經被耗盡，若 prev 狀態依舊是 runnable ，則它會在 dispatch() 函式回傳後被 enqueue() ，若要持續執行 prev ，則回傳時不該分配或 consume 任何任務。
tick()
每個 CPU 在每 1/HZ 秒會呼叫一次，若 p->scx.slice 變為 0 則會自動觸發 CPU 的 dispatch cycle 。
runnable()
代表該任務狀態轉變為 runnable ，用來追蹤任務的 execution state transition 。一個任務會先在 CPU 上轉變為 runnable ，並且經過幾個 running() 和 stopping() ，最後它結束在該 CPU 上的運行後會轉為 quiescent() 。
以下三種情況，一個任務會在它要被執行的 CPU 上轉為 runnable
- waking up (SCX_ENQ_WAKEUP)
- being moved from another CPU
- being restored after temporarily taken off the queue for an attribute change
和 enqueue() 相關但並非成對的，也就是說此函式用來通知任務 p 的狀態改變，有可能會接著呼叫 enqueue() 也有可能不會。同樣的任務也可以進行 enqueue() 但沒有先呼叫 runnable() 。
running()
通知某任務狀態變為 running
stopping()
通知某任務狀態變為 stopping ，若呼叫後任務狀態並非 runnable ，則下一個會呼叫 quiescent() 。
quiescent()
任務狀態轉變為 not runnable 。以下三種情形會讓 CPU 上的任務轉變為 quiescent
- sleeping (SCX_DEQ_SLEEP)
- being moved to another CPU
- being temporarily taken off the queue for an attribute change (SCX_DEQ_SAVE)
此函式和 dequeue() 相關聯但並非成對。
yield()
任務主動讓出 CPU 給另一個 runnable task ， BPF scheduler 需要確保另一個可執行的任務在這個 yielding task 之前先被分配。

另外 scheduling 當中十分重視 SMT 是否有被開啟，透過位於 /include/linux/sched/smt.h 當中的函式 sched_smt_active() 可以判斷當前系統是否有開啟 SMT 。其中一個在意 SMT 是否開啟的場景即是 idle tracking 。
在 sched_ext 當中 idle tracking 會透過一個結構體稱為 idle_masks 來進行，定義如下

static struct {
	cpumask_var_t cpu;
	cpumask_var_t smt;
} idle_masks CL_ALIGNED_IF_ONSTACK;

cpumask_var_t 在 /include/linux/cpumask.h 當中的定義實際上就是 struct cpumask ，但操作上有些細節要注意，可以參見它的註解。

struct cpumask 和 cpumask_t 的定義如下

typedef struct cpumask { DECLARE_BITMAP(bits, NR_CPUS); } cpumask_t;

可以看到 struct cpumask 實際上就是一個長度為 NR_CPUS 的 bitmap ，而 cpumask_t 就是 struct cpumask ，那為何要多定義一個 cpumask_var_t ？
smt mask 和 cpu mask 差異在哪？

例如在 test_and_clear_cpu_idle(int cpu) 函式當中就會針對 smt 做特別的處理，先取得該 cpu 的 smt mask 後，檢查 idle_masks.smt 是否和該 smt mask 有重疊的部分，若有則清除那些部分的位元，接著再送入 cpumask_test_and_clear_cpu() 。

scx 解析

scx 專案底下包含了許多排程器，用全 C 語言撰寫的排程器都放在 scheds/c 目錄底下，但此處的排程器都是展示功能用，並非能上 production 或重點開發的排程器。主要的排程器是以 Rust 搭配 C 語言撰寫的 BPF program 組合而成，位於 scheds/rust 目錄底下，目前較成熟的排程器有 scx_lavd 、 scx_rusty 、 scx_rustland 、 scx_layered 。

此專案有個值得注意的部分位於 rust/scx_utils 目錄底下，撰寫了包括如何取得電腦目前架構的 topology 等方法，這些工具在 scx_rusty 等排程器當中被大量使用。

這裡可以注意的細節是 Linux scheduler 的觀點， CPU 架構階層是 NUMA node -> domain -> core -> CPU ， CPU 在這當中實際上是 logical thread 的角色。

sched_ext (2) : 排程器程式碼分析

Introduction

架構分析

使用者層級程式

BPF program

Scheduling Cycle

sched-ext 解析

scx 解析

Read more

2024q1 Homework1 (lab0)

Rust learning note (2)

Concurrency 學習筆記

eBPF and related tools