Scheduler in Kernel

# Scheduler in Kernel 參考 [Linux 核心設計: Scheduler(2): 概述 CFS Scheduler](https://hackmd.io/@sysprog/B18MhT00t) 及核心程式碼來理解核心中如何排程及相互配合。 ## Scheduler class 核心有許多不同種類的 scheduler，並提供 `struct sched_class` 當作一致的抽象定義。 [struct sched_class](https://elixir.bootlin.com/linux/v6.14.5/source/kernel/sched/sched.h#L2411) 並且不同的 scheduler 則放於 [kernel/sched/](https://elixir.bootlin.com/linux/v6.14.5/source/kernel/sched) 目錄下，並在 [linked script](https://elixir.bootlin.com/linux/v6.14.5/source/include/asm-generic/vmlinux.lds.h#L131) 中把所有的物件按照優先級排序。 ```c /* * The order of the sched class addresses are important, as they are * used to determine the order of the priority of each sched class in * relation to each other. */ #define SCHED_DATA \ STRUCT_ALIGN(); \ __sched_class_highest = .; \ *(__stop_sched_class) \ *(__dl_sched_class) \ *(__rt_sched_class) \ *(__fair_sched_class) \ *(__ext_sched_class) \ *(__idle_sched_class) \ __sched_class_lowest = .; ``` 其中每個 `*(__xxx_sched_class)` 會把對應的 class 按照順序放入記憶體空間中，因此由上方的程式碼可以知道最高優先級的 scheduler 為 `stop_sched_class` 其次為 `dl_sched_class` 以此類推最低優先級為 `__idle_sched_class`，而我們所關心的 CFS 排程器則是 `__fair_sched_class`。可以在 [/kernel/sched/fair.c](https://elixir.bootlin.com/linux/v6.14.5/source/kernel/sched/fair.c#L13616) 中看見使用到 `DEFINE_SCHED_CLASS` 的巨集來定義每個 `__xxx_sched_class`。而該巨集則定義於 [/kernel/sched/sched.h](https://elixir.bootlin.com/linux/v6.14.5/source/kernel/sched/sched.h#L2522) 中。 ```c /* * Helper to define a sched_class instance; each one is placed in a separate * section which is ordered by the linker script: * * include/asm-generic/vmlinux.lds.h * * *CAREFUL* they are laid out in *REVERSE* order!!! * * Also enforce alignment on the instance, not the type, to guarantee layout. */ #define DEFINE_SCHED_CLASS(name) \ const struct sched_class name##_sched_class \ __aligned(__alignof__(struct sched_class)) \ __section("__" #name "_sched_class") ``` 因此會把 `fair.c` 中的 ```c /* * All the scheduling class methods: */ DEFINE_SCHED_CLASS(fair) = { .enqueue_task = enqueue_task_fair, .dequeue_task = dequeue_task_fair, .yield_task = yield_task_fair, .yield_to_task = yield_to_task_fair, ...後略 }; ``` 展開為 ```c __attribute__((section("__fair_sched_class"))) struct sched_class fair_sched_class = { .enqueue_task = enqueue_task_fair, .dequeue_task = dequeue_task_fair, .pick_next_task = pick_next_task_fair, ...後略 }; ``` 這部份會把 `fair_sched_class` 放入 `__fair_sched_class` 的 section 中，也就是提到的 linker script 中，不同的任務會依照不同的特性及優先級放入不同的 scheduler_class 中。 :::info 這邊借用 [Linux 核心設計: Scheduler](https://hackmd.io/@sysprog/Hka7Kzeap/%2FB18MhT00t) 內 @Uduru0522 的補充之所以用 section 來決定優先順序是為考慮對 cache 使用上的友善，詳見: * [sched: Micro optimization in pick_next_task() and in check_preempt_curr()](https://lore.kernel.org/all/157675913272.349305.8936736338884044103.stgit@localhost.localdomain) * [sched: Force the address order of each sched class descriptor](https://lore.kernel.org/all/20191219214558.845353593@goodmis.org/T/#m74565019ad2c2ebbe4eff82190a72ed70c683d2c) ::: 那核心又是如何排程每個任務的呢？核心程式碼中排程的主程式位於 [/kernel/sched/core.c](https://elixir.bootlin.com/linux/v6.14.5/source/kernel/sched/core.c#L6606) 中。其中，我們關注的是如何選出下一個任務，對應到 6717 行的 `next = pick_next_task(rq, prev, &rf);`，接著查看同一個檔案中的 [6104 行](https://elixir.bootlin.com/linux/v6.14.5/source/kernel/sched/core.c#L6104) `pick_next_task` 是如何選出下一個任務的。其中判斷是否啟用了 core_scheduling 若無則使用 `__pick_next_task` 來選定下一個任務。 :::warning 還不了解 [core_scheduling](https://docs.kernel.org/admin-guide/hw-vuln/core-scheduling.html) 的機制，暫時跳過。 ::: --- [__pick_next_task](https://elixir.bootlin.com/linux/v6.14.5/source/kernel/sched/core.c#L6009) 便是我們要尋找的**挑選下一個任務**的方法。其中，先確定是否有開啟 scx_ext，若已開啟則跳過後面 CFS 的部份直接跳到 restart 的段落。 ```c if (scx_enabled()) goto restart; ``` 若無開啟 scx_ext，則[判斷以下條件](https://elixir.bootlin.com/linux/v6.14.5/source/kernel/sched/core.c#L6023) ```c if (likely(!sched_class_above(prev->sched_class, &fair_sched_class) && rq->nr_running == rq->cfs.h_nr_queued)) { ``` - `sched_class_above(prev->sched_class, &fair_sched_class)` : 目前執行的任務(prev) 是否擁有比 CFS 類別還高的優先權 - `rq->nr_running == rq->cfs.h_nr_queued` : 目前在等待的任務是否全為 CFS 的任務類別。若 1.沒有優先權高於 CFS 的任務類別 2.所有排隊的任務皆屬於 CFS 類別，則直接從 CFS 類別中挑選下一個任務 (`if` 成立) ，以節省時間。 ```c p = pick_next_task_fair(rq, prev, rf); ``` 若上述的優化沒有執行或 scx_ext 開啟，則直接使用原先的策略如下。執行 `prev_balance(rq, prev, rf);` 處理前一次的任務(關於 `prev_balance(rq, prev, rf)` 的說明補充在 [Appendix A](https://hackmd.io/_02EdggFSBmWMPeami9eZA?both#Appendix-A---prev_balance) 中)。接著使用 [`for_each_active_class(class)`](https://elixir.bootlin.com/linux/v6.14.5/source/kernel/sched/sched.h#L2572) 走訪每個定義好的 sched_class ，由於前面的 linker_script 固定了每個 class 的順序(按照優先級排序)，因此這邊會先走訪到優先權高的 sched_class 中的任務。 ```c p = class->pick_next_task(rq, prev); if (p) return p; ``` 而走訪的 [for_each_active_class(class)](https://elixir.bootlin.com/linux/v6.14.5/source/kernel/sched/sched.h#L2572) 定義如下 ```c #define for_active_class_range(class, _from, _to) \ for (class = (_from); class != (_to); class = next_active_class(class)) #define for_each_active_class(class) \ for_active_class_range(class, __sched_class_highest, __sched_class_lowest) ``` 至此，可了解到 kernel 是如何在不同優先級的 scheduler class 中做選擇。 ## Completely Fair Scheduling (CFS) Ingo Molnar 在 2007 年發表了 [Completely Fair Scheduler (CFS)](https://lkml.indiana.edu/0704.1/2138.html)，信件中也提及了諸多設計理念。 >My goal is to address various feature requests and to fix deficiencies in the vanilla scheduler that were suggested/found in the past few years, both for desktop scheduling and for server scheduling workloads. 其中，使用了 [rbtree](https://en.wikipedia.org/wiki/Red%E2%80%93black_tree) 來實做，且不依賴原始的 jiffies 或是 Hz 的計算，這部份也會在後面查看程式碼時看見。 >CFS's design is quite radical: it does not use runqueues, it uses a time-ordered rbtree to build a 'timeline' of future task execution, and thus has no 'array switch' artifacts (by which both the vanilla scheduler and RSDL/SD are affected). > >CFS uses nanosecond granularity accounting and does not rely on any jiffies or other HZ detail. Thus the CFS scheduler has no notion of 'timeslices' and has no heuristics whatsoever. 接著查看核心中 [fair.c](https://elixir.bootlin.com/linux/v6.14.5/source/kernel/sched/fair.c) 中的程式碼來理解其設計。 --- 首先先查看 CFS 的 [sched_class](https://elixir.bootlin.com/linux/v6.14.5/source/kernel/sched/fair.c#L13616) 分別對應到哪個函式。 ```c /* * All the scheduling class methods: */ DEFINE_SCHED_CLASS(fair) = { .enqueue_task = enqueue_task_fair, .dequeue_task = dequeue_task_fair, .yield_task = yield_task_fair, .yield_to_task = yield_to_task_fair, .wakeup_preempt = check_preempt_wakeup_fair, .pick_task = pick_task_fair, .pick_next_task = __pick_next_task_fair, .put_prev_task = put_prev_task_fair, .set_next_task = set_next_task_fair, .task_tick = task_tick_fair, .task_fork = task_fork_fair, .reweight_task = reweight_task_fair, .prio_changed = prio_changed_fair, .switched_from = switched_from_fair, .switched_to = switched_to_fair, .get_rr_interval = get_rr_interval_fair, .update_curr = update_curr_fair, }; ``` :::warning 此處我們先忽略 CONFIG_SMP, CONFIG_FAIR_GROUP_SCHED, CONFIG_SCHED_CORE, CONFIG_UCLAMP_TASK 這幾個編譯核心時的選項，專注於討論 CFS 的主要函式。 ::: 我們關注和排程最相關的幾個函式，分別為把任務加入/移出 CFS runqueue 的 `enqueue_task_fair` / `dequeue_task_fair`，挑選出下一個任務的 `__pick_next_task_fair` 。 ### [sched_entity](https://elixir.bootlin.com/linux/v6.14.5/source/include/linux/sched.h#L553) 在開始查看如何把任務排入/移出 runqueue 前，需要先了解**排程的單位**。為了避免不同使用者擁有不同的 task 數量造成不公平 (unfair) ，因此 CFS 引入了 group scheduling 的機制。 >e.g. 若 A 擁有 1 個任務而 B 擁有 49 個任務，98% 的 CPU 資源便會被 B 拿走，這並非 CFS 所希望的公平。 > > [Linux 核心設計: Scheduler](https://hackmd.io/@sysprog/Hka7Kzeap/%2FB18MhT00t) : > scheduler 在排程時，會去尋找最需要 CPU 的 sched_entity，如果 sched_entity 表現的不是一個 task，則再去 sched_entity 下的 runqueue 挑出下一個 entity，重複直到找到是表現一個 task 的 sched_entity。最後，在把 CPU 時間加回 virtual runtime 時，會把該 sched_entity 包含往上 parent 階層的都一併更新。查看 sched_entity 的結構(部份省略) ```c struct sched_entity { /* For load-balancing: */ struct load_weight load; struct rb_node run_node; u64 deadline; u64 min_vruntime; u64 min_slice; struct list_head group_node; unsigned char on_rq; unsigned char sched_delayed; unsigned char rel_deadline; unsigned char custom_slice; /* hole */ u64 exec_start; u64 sum_exec_runtime; u64 prev_sum_exec_runtime; u64 vruntime; s64 vlag; u64 slice; u64 nr_migrations; #ifdef CONFIG_FAIR_GROUP_SCHED int depth; struct sched_entity *parent; /* rq on which this entity is (to be) queued: */ struct cfs_rq *cfs_rq; /* rq "owned" by this entity/group: */ struct cfs_rq *my_q; /* cached value of my_q->h_nr_running */ unsigned long runnable_weight; #endif }; ``` 當開啟 `CONFIG_FAIR_GROUP_SCHED` 時，每個 sched_entity 底下會有另外的 sched_entity 對應到其 parent，及該階層所在的 cfs_rq，透過該方式便可以有效的把每個不同層的任務分開且公平的排序。 :::warning 參考 [Linux 核心設計: Scheduler](https://hackmd.io/@sysprog/Hka7Kzeap/%2FB18MhT00t) 的筆記，另一個 `struct my_q` 會指向自己底下的 runqueue。但不理解為何自己底下還會需要另一個 runqueue。是指把自己當成 parent 的 sched_entity 的 runqueue 嗎？ TODO: 釐清 `cfs_rq` 和 `my_q` 的差異。 ::: 其中 `struct load_weight load;` 是由 NICE 值做轉換的，後面計算 vruntime 會使用到。其[轉換規則](https://elixir.bootlin.com/linux/v6.14.5/source/kernel/sched/core.c#L10097)如下，詳細的轉換過程紀錄於 [Appendix B](https://hackmd.io/_02EdggFSBmWMPeami9eZA?both#Appendix-B---vruntime) ```c /* * Nice levels are multiplicative, with a gentle 10% change for every * nice level changed. I.e. when a CPU-bound task goes from nice 0 to * nice 1, it will get ~10% less CPU time than another CPU-bound task * that remained on nice 0. * * The "10% effect" is relative and cumulative: from _any_ nice level, * if you go up 1 level, it's -10% CPU usage, if you go down 1 level * it's +10% CPU usage. (to achieve that we use a multiplier of 1.25. * If a task goes up by ~10% and another task goes down by ~10% then * the relative distance between them is ~25%.) */ const int sched_prio_to_weight[40] = { /* -20 */ 88761, 71755, 56483, 46273, 36291, /* -15 */ 29154, 23254, 18705, 14949, 11916, /* -10 */ 9548, 7620, 6100, 4904, 3906, /* -5 */ 3121, 2501, 1991, 1586, 1277, /* 0 */ 1024, 820, 655, 526, 423, /* 5 */ 335, 272, 215, 172, 137, /* 10 */ 110, 87, 70, 56, 45, /* 15 */ 36, 29, 23, 18, 15, }; ``` `struct rb_node run_node;` 指向該任務在紅黑樹中的位置，而 vruntime 相關的計算補充於 [Appendix B](https://hackmd.io/_02EdggFSBmWMPeami9eZA?both#Appendix-B---vruntime)。 >[Linux 核心設計: Scheduler](https://hackmd.io/@sysprog/Hka7Kzeap/%2FB18MhT00t) : >先簡單從概念上解釋 CFS 維護任務調度的思維: 那些急需 CPU(virtual runtime 小) 的 task 會被放在紅黑樹的左邊，而不迫切的 task 則在右邊。為保持公平，Scheduler 會去挑選最左邊(leftmost)的任務，任務被切換掉後，把 virtual runtime 加上使用的 CPU 時間重新插入 RB-tree。如此一來，左側的任務就被確保有時間，並且樹的內容會從右向左遷移以保持公平，每個 task 會不斷追逐其他使用 CPU 多於自己的任務以保持平衡。 ### [enqueue_task_fair](https://elixir.bootlin.com/linux/v6.14.5/source/kernel/sched/fair.c#L6924) > 任務會透過 enqueue_task_fair 來加入到 CFS 的紅黑樹中後才增加 nr_running 的數目，避免在增加完成前就被視做 runnable 的任務。其中 `struct cfs_rq *cfs_rq;` 及 `struct sched_entity *se = &p->se;` 分別代表 CFS 的 runqueue 及要加入到 runqueue 的任務，結構的定義可以在`/kernel/linux/sched.h` 的 [650 行](https://elixir.bootlin.com/linux/v6.14.5/source/kernel/sched/sched.h#L650) 及 `/include/linux/sched.h` 的 [553 行](https://elixir.bootlin.com/linux/v6.14.5/source/include/linux/sched.h#L553)看到，因此我們的目標是把 `se` 放入 `cfs_rq` 中。 :::warning 從 6934 行至 6963 行皆是在針對不同的 flag 及設置做處理，還沒理解這邊每個設置的差異，暫時先不討論。 ::: 這邊使用到 for_each_sched_entity 來逐一走訪每個 sched_entity，目的便是在 enqueue 完目前的 group 後繼續把 parent 的任務給 enqueue。 ```c /* Walk up scheduling entities hierarchy */ #define for_each_sched_entity(se) \ for (; se; se = se->parent) ``` 先使用 `se->on_rq` 確認要 enqueue 的任務是否已經在 runqueue上，並確認是否是被 delay 的任務。若已經在 runqueue 的話則不須要再排入 runqueue 因此直接 break，若是被 delay 的任務則使用 requeue_delayed_entity 來重新排入 runqueue 。 ```c if (se->on_rq) { if (se->sched_delayed) requeue_delayed_entity(se); break; } ``` 接著使用 `cfs_rq = cfs_rq_of(se);` 來取得 se 應該要放置的 cfs runqueue 中，其函式定義如下。 ```c /* runqueue on which this entity is (to be) queued */ static inline struct cfs_rq *cfs_rq_of(const struct sched_entity *se) { return se->cfs_rq; } ``` enqueue 前後會更新 slice ，若子代 (enqueue 是從底層的 group 一路往上做直到 parent 的)的 se 有傳 slice 的話則把當前的 slice 設為由子代傳入的數值。而 enqueue 結束後重新算 slice 以傳給親代 (parent)。 ```c /* * Basically set the slice of group entries to the min_slice of * their respective cfs_rq. This ensures the group can service * its entities in the desired time-frame. */ if (slice) { se->slice = slice; se->custom_slice = 1; } enqueue_entity(cfs_rq, se, flags); slice = cfs_rq_min_slice(cfs_rq); ``` 而真正 enqueue 的函式為 [enqueue_entity](https://elixir.bootlin.com/linux/v6.14.5/source/kernel/sched/fair.c#L5317)，我們把它放在這個段落的最後介紹。完成後把子代 sched_entity 傳上來的資訊加入到當前的 cfs_runqueue 中，分別是 runnable 的權重, 目前等待排入的任務數量及閒置 entity 的數量。而 `cfs_rq_throttled` 則是用來限制 task_group 在一個週期內可以消耗的 CPU 資源。 ```c cfs_rq->h_nr_runnable += h_nr_runnable; cfs_rq->h_nr_queued++; cfs_rq->h_nr_idle += h_nr_idle; if (cfs_rq_is_idle(cfs_rq)) h_nr_idle = 1; /* end evaluation on encountering a throttled cfs_rq */ if (cfs_rq_throttled(cfs_rq)) goto enqueue_throttle; ``` 程式碼第二個迴圈則在更新對應的統計資訊，最後則用 `add_nr_running(rq, 1);` 來增加 CPU 中 runnable 的數量，至此完成 enqueue 的處理。最後根據註解所說，當新任務初始化時會分配到所屬 CPU 的剩餘效能作為效能估算的數值，即使任務非常小也可能被高估。因此這邊不納入計算新任務的估計，而是等到這些任務的 PELT (Per-Entity Load Tracking) 收斂才納入計算。(關於 PELT 的介紹放在 [Appendix C](https://hackmd.io/_02EdggFSBmWMPeami9eZA?both#Appendix-C---PELT)) ```c /* * Since new tasks are assigned an initial util_avg equal to * half of the spare capacity of their CPU, tiny tasks have the * ability to cross the overutilized threshold, which will * result in the load balancer ruining all the task placement * done by EAS. As a way to mitigate that effect, do not account * for the first enqueue operation of new tasks during the * overutilized flag detection. * * A better way of solving this problem would be to wait for * the PELT signals of tasks to converge before taking them * into account, but that is not straightforward to implement, * and the following generally works well enough in practice. */ if (!task_new) check_update_overutilized_status(rq); ``` 接著查看 [enqueue_entity](https://elixir.bootlin.com/linux/v6.14.5/source/kernel/sched/fair.c#L5317) 的程式碼，發現 `enqueue_entity` 同樣先更新任務狀態，真正排程的部份在 [__enqueue_entity](https://elixir.bootlin.com/linux/v6.14.5/source/kernel/sched/fair.c#L850) 中。 ```c= /* * Enqueue an entity into the rb-tree: */ static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se) { avg_vruntime_add(cfs_rq, se); se->min_vruntime = se->vruntime; se->min_slice = se->slice; rb_add_augmented_cached(&se->run_node, &cfs_rq->tasks_timeline, __entity_less, &min_vruntime_cb); } ``` 第 6 行使用 `avg_vruntime_add` 將 se->vruntime 加入該 runqueue 的平均 vruntime 計算中。 :::warning 第 7 行及第 8 行用來儲存 min_vruntime 及 min_slice 至目前的任務，但不明白為何會直接把目前的 vruntime 視為該 sched_entity (se) 的最小值。 $\to$ 是否因為 vruntime 和 slice 都會根據 CFS 的排程去做調整，所以在最一開始 enqueue 時，紀錄最小值? ::: 第 9 行則是用來將目前的任務插入 runqueue 的紅黑樹中，vruntime 的解釋則放在 [Appendix B](https://hackmd.io/_02EdggFSBmWMPeami9eZA?both#Appendix-B---vruntime) 說明。我們查看 [rb_add_augmented_cached](https://elixir.bootlin.com/linux/v6.14.5/source/include/linux/rbtree_augmented.h#L63) 來了解如何將任務插入到 run queue 中。 ```c= static __always_inline struct rb_node * rb_add_augmented_cached(struct rb_node *node, struct rb_root_cached *tree, bool (*less)(struct rb_node *, const struct rb_node *), const struct rb_augment_callbacks *augment) { struct rb_node **link = &tree->rb_root.rb_node; struct rb_node *parent = NULL; bool leftmost = true; while (*link) { parent = *link; if (less(node, parent)) { link = &parent->rb_left; } else { link = &parent->rb_right; leftmost = false; } } rb_link_node(node, parent, link); augment->propagate(parent, NULL); /* suboptimal */ rb_insert_augmented_cached(node, tree, leftmost, augment); return leftmost ? node : NULL; } ``` - **第 2 ~ 3 行** : `*node` 為目標插入的節點，`*tree` 則為目標的紅黑樹，`*less` 為自訂的比較函式，用來依照 vruntime 來排序，`*augment` 則為紅黑樹的額外資訊。 - 對應到 enqueue 函式中的輸入則分別為 `&se->run_node`, `&cfs_rq->tasks_timeline`, `__entity_less`, `&min_vruntime_cb` - 查看傳入的比較函式 [__entity_less](https://elixir.bootlin.com/linux/v6.14.5/source/kernel/sched/fair.c#L798) 內部呼叫了 [entity_before](https://elixir.bootlin.com/linux/v6.14.5/source/kernel/sched/fair.c#L547) 並輸入 `__node_2_se(a)` 及 `__node_2_se(b)` ，其定義整理如下， ```c #define __node_2_se(node) \ rb_entry((node), struct sched_entity, run_node) #define rb_entry(ptr, type, member) container_of(ptr, type, member) ``` 相當於 `container_of(node, struct sched_entity, run_node)`，用來從 node 找到 sched_entity 的結構並傳入 entity_before 做排序。 ```c static inline bool __entity_less(struct rb_node *a, const struct rb_node *b) { return entity_before(__node_2_se(a), __node_2_se(b)); } ``` :::warning 這邊有個疑惑，為何比較函式中的 entity_before 會是用 deadline 在比較的?不是應該是 vruntime 嗎? ```c static inline bool entity_before(const struct sched_entity *a, const struct sched_entity *b) { /* * Tiebreak on vruntime seems unnecessary since it can * hardly happen. */ return (s64)(a->deadline - b->deadline) < 0; } ``` $\to$ 在 Linux [v6.8-rc1](https://elixir.bootlin.com/linux/v6.8-rc1/source/kernel/sched/fair.c#L551) 才從原先對 vruntime 比較，改成了比較 deadline (詳細的 git commit 要在找一下) ::: - **第 6 ~ 8 行** ：將初始位置設定為紅黑樹的根節點，宣告 parent 及 leftmost ， leftmost 用於最後若插入的節點放置在樹的最左邊則回傳該節點方便快速尋找。 - **第 10 ~ 18 行** : 使用二元搜尋的方式找到需要安插的位置，其中使用 less 當作比較函式。 - **第 20 行** ：將節點插入到搜尋出來的位置中，查看[程式碼](https://elixir.bootlin.com/linux/v6.14.5/source/include/linux/rbtree.h#L59)可以了解這步驟僅是將節點連接上去，還沒進行樹的平衡及改變節點顏色。 - **第 21 行** ：使用 `propagate` 將輸入的 vruntime 往上更新 parent 節點內的資訊。 :::warning 不明白 propagete 用途 :::spoiler [propogae 實作](https://elixir.bootlin.com/linux/v6.14.5/source/include/linux/rbtree_augmented.h#L90) ```c /* * Template for declaring augmented rbtree callbacks (generic case) * * RBSTATIC: 'static' or empty * RBNAME: name of the rb_augment_callbacks structure * RBSTRUCT: struct type of the tree nodes * RBFIELD: name of struct rb_node field within RBSTRUCT * RBAUGMENTED: name of field within RBSTRUCT holding data for subtree * RBCOMPUTE: name of function that recomputes the RBAUGMENTED data */ #define RB_DECLARE_CALLBACKS(RBSTATIC, RBNAME, \ RBSTRUCT, RBFIELD, RBAUGMENTED, RBCOMPUTE) \ static inline void \ RBNAME ## _propagate(struct rb_node *rb, struct rb_node *stop) \ { \ while (rb != stop) { \ RBSTRUCT *node = rb_entry(rb, RBSTRUCT, RBFIELD); \ if (RBCOMPUTE(node, true)) \ break; \ rb = rb_parent(&node->RBFIELD); \ } \ } \ static inline void \ RBNAME ## _copy(struct rb_node *rb_old, struct rb_node *rb_new) \ { \ RBSTRUCT *old = rb_entry(rb_old, RBSTRUCT, RBFIELD); \ RBSTRUCT *new = rb_entry(rb_new, RBSTRUCT, RBFIELD); \ new->RBAUGMENTED = old->RBAUGMENTED; \ } \ static void \ RBNAME ## _rotate(struct rb_node *rb_old, struct rb_node *rb_new) \ { \ RBSTRUCT *old = rb_entry(rb_old, RBSTRUCT, RBFIELD); \ RBSTRUCT *new = rb_entry(rb_new, RBSTRUCT, RBFIELD); \ new->RBAUGMENTED = old->RBAUGMENTED; \ RBCOMPUTE(old, false); \ } \ RBSTATIC const struct rb_augment_callbacks RBNAME = { \ .propagate = RBNAME ## _propagate, \ .copy = RBNAME ## _copy, \ .rotate = RBNAME ## _rotate \ }; ``` ::: - **第 22 行** ：使用到 [`rb_insert_augmented_cached`](https://elixir.bootlin.com/linux/v6.14.5/source/include/linux/rbtree_augmented.h#L54) 來修正插入節點後的紅黑樹 - ```c static inline void rb_insert_augmented_cached(struct rb_node *node, struct rb_root_cached *root, bool newleft, const struct rb_augment_callbacks *augment) { if (newleft) root->rb_leftmost = node; rb_insert_augmented(node, &root->rb_root, augment); } ``` - 先確認 `if (newleft)` 判斷插入的新節點是否是最左邊的節點，若為是的話則更新快取以方便後面做快速尋找最小節點。接著再呼叫 [rb_insert_augmented](https://elixir.bootlin.com/linux/v6.14.5/source/include/linux/rbtree_augmented.h#L47) 並使用其中的 [__rb_insert_augmented](https://elixir.bootlin.com/linux/v6.14.5/source/lib/rbtree.c#L456)，該函式又呼叫 [__rb_insert](https://elixir.bootlin.com/linux/v6.14.5/source/tools/lib/rbtree.c#L85) 來平衡紅黑樹，而詳細紅黑樹的操作可以參考 [Linux 核心的紅黑樹](https://hackmd.io/@sysprog/linux-rbtree)。綜合上述，CFS 使用紅黑樹來儲存任務，透過 enqueue 將任務加入到樹中並依照 vruntime 來排序，將需要優先執行的任務放在樹的左側並使用到快取的方式來加速。 ### [dequeue_task_fair](https://elixir.bootlin.com/linux/v6.14.5/source/kernel/sched/fair.c#L7179) 了解完 enqueue 的方式，為了避免贅述在 dequeue 的部份我們專注於在 dequeue 的過程中需要紀錄及更新什麼參數。實際在進行 dequeue 的函式為 [dequeue_entities](https://elixir.bootlin.com/linux/v6.14.5/source/kernel/sched/fair.c#L7067) 中的 [dequeue_entity](https://elixir.bootlin.com/linux/v6.14.5/source/kernel/sched/fair.c#L5454) ，程式碼如下，依照註解更新不同的參數並呼叫 `__dequeue_entity` 來移除任務。 ```c /* * When dequeuing a sched_entity, we must: * - Update loads to have both entity and cfs_rq synced with now. * - For group_entity, update its runnable_weight to reflect the new * h_nr_runnable of its group cfs_rq. * - Subtract its previous weight from cfs_rq->load.weight. * - For group entity, update its weight to reflect the new share * of its group cfs_rq. */ update_load_avg(cfs_rq, se, action); se_update_runnable(se); update_stats_dequeue_fair(cfs_rq, se, flags); update_entity_lag(cfs_rq, se); if (sched_feat(PLACE_REL_DEADLINE) && !sleep) { se->deadline -= se->vruntime; se->rel_deadline = 1; } if (se != cfs_rq->curr) __dequeue_entity(cfs_rq, se); se->on_rq = 0; account_entity_dequeue(cfs_rq, se); ``` 其中 [__dequeue_entity](https://elixir.bootlin.com/linux/v6.14.5/source/kernel/sched/fair.c#L859) 呼叫了 `rb_erase_augmented_cached` 及 `avg_vruntime_sub` ，分別用於把 task 移出 runqueue 及更新平均的 vruntime。 ```c static void __dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se) { rb_erase_augmented_cached(&se->run_node, &cfs_rq->tasks_timeline, &min_vruntime_cb); avg_vruntime_sub(cfs_rq, se); } ``` 和 enqueue 相同，後綴的 cached 反應在上述程式碼的 if 判斷中，如果移除的節點是最左的節點則可使用快取來加速搜尋。 ```c static __always_inline void rb_erase_augmented_cached(struct rb_node *node, struct rb_root_cached *root, const struct rb_augment_callbacks *augment) { if (root->rb_leftmost == node) root->rb_leftmost = rb_next(node); rb_erase_augmented(node, &root->rb_root, augment); } ``` 而真正執行移除節點的函式位於 [rb_erase_augmented](https://elixir.bootlin.com/linux/v6.14.5/source/tools/include/linux/rbtree_augmented.h#L291) 中使用到 [__rb_erase_augmented](https://elixir.bootlin.com/linux/v6.14.5/source/tools/include/linux/rbtree_augmented.h#L187) 來真正的移除節點，而該函式也會返回是否需要 rebalance 若需要的話在透過 [__rb_erase_color](https://elixir.bootlin.com/linux/v6.14.5/source/tools/include/linux/rbtree_augmented.h#L183) 來調整。同樣的，詳細紅黑樹的操作可以參考 [Linux 核心的紅黑樹](https://hackmd.io/@sysprog/linux-rbtree)。 ```c static __always_inline void rb_erase_augmented(struct rb_node *node, struct rb_root *root, const struct rb_augment_callbacks *augment) { struct rb_node *rebalance = __rb_erase_augmented(node, root, augment); if (rebalance) __rb_erase_color(rebalance, root, augment->rotate); } ``` ### [__pick_next_task_fair](https://elixir.bootlin.com/linux/v6.14.5/source/kernel/sched/fair.c#L8960) 這部份直接呼叫 [`pick_next_task_fair`](https://elixir.bootlin.com/linux/v6.14.5/source/kernel/sched/fair.c#L8874) 來挑選下一個任務。 ```c static struct task_struct *__pick_next_task_fair(struct rq *rq, struct task_struct *prev) { return pick_next_task_fair(rq, prev, NULL); } ``` 這邊我們從兩個方向理解，如何挑選出下一個任務及若有開啟 group scheduling 又需要注意哪些面向。首先，若沒有開啟 group scheduling 則 `#ifdef CONFIG_FAIR_GROUP_SCHED` 至 `#endif` 中間段不會被編譯。因此程式碼會如下 ```c= struct task_struct * pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) { struct sched_entity *se; struct task_struct *p; int new_tasks; again: p = pick_task_fair(rq); if (!p) goto idle; se = &p->se; put_prev_set_next_task(rq, prev, p); return p; idle: if (!rf) return NULL; new_tasks = sched_balance_newidle(rq, rf); /* * Because sched_balance_newidle() releases (and re-acquires) rq->lock, it is * possible for any higher priority task to appear. In that case we * must re-start the pick_next_entity() loop. */ if (new_tasks < 0) return RETRY_TASK; if (new_tasks > 0) goto again; /* * rq is about to be idle, check if we need to update the * lost_idle_time of clock_pelt */ update_idle_rq_clock_pelt(rq); return NULL; } ``` #### 選擇任務 - 第 9 行：使用 [pick_task_fair](https://elixir.bootlin.com/linux/v6.14.5/source/kernel/sched/fair.c#L8843) 從 runqueue 中選擇執行的 task 並存入 `task_struct *p` 中 - 第 10 ~ 11 行：若沒有選出任務則跳至 idle 處理。 - 第 12 行：若有找到任務則從 task_struct 中取出 sched_entity 供 CFS 使用 - 第 13 ~ 14 行：使用 put_prev_set_next_task 將前一次的任務儲存，並更新對應的 CFS 資訊，最後回傳找出的任務 p #### idle 處理 - 第 17 ~ 18 行：判斷傳入的旗標，若旗標為 true 則直接回傳 NULL 不進行後面的嘗試。 :::warning 不確定這邊旗標和什麼相關... (一種防禦機制? 避免 `sched_balance_newidle` 對空指針 dereference ) [rq_flags](https://elixir.bootlin.com/linux/v6.14.5/source/kernel/sched/sched.h#L1712) ```c struct rq_flags { unsigned long flags; struct pin_cookie cookie; #ifdef CONFIG_SCHED_DEBUG /* * A copy of (rq::clock_update_flags & RQCF_UPDATED) for the * current pin context is stashed here in case it needs to be * restored in rq_repin_lock(). */ unsigned int clock_update_flags; #endif }; ``` ::: - 第 20 行： [sched_balance_newidle](https://elixir.bootlin.com/linux/v6.14.5/source/kernel/sched/fair.c#L12774) 嘗試從其他的 cpu 偷任務來執行。 >sched_balance_newidle is called by schedule() if this_cpu is about to become > Returns: >< 0 - we released the lock and there are !fair tasks present > 0 - failed, no new tasks > \>0 - success, new (fair) tasks present idle. Attempts to pull tasks from other CPUs. - 第 27 ~ 31 行：若回傳為 `< 0` 表示找到的任務不是 CFS 的任務，因此需要回傳RETRY_TASK 通知排程器重新排程。若回傳值 `> 0` ，代表有 idle 的任務，跳回去 again 重新挑任務。 - 第 37 ~ 39 行：更新該 runqueue 的 PELT並回傳 NULL 表示沒有找到任務。 :::warning 關於 PELT 的介紹放在 Appendix C :::spoiler [update_idle_rq_clock_pelt](https://elixir.bootlin.com/linux/v6.14.5/source/kernel/sched/pelt.h#L125) ```c /* * When rq becomes idle, we have to check if it has lost idle time * because it was fully busy. A rq is fully used when the /Sum util_sum * is greater or equal to: * (LOAD_AVG_MAX - 1024 + rq->cfs.avg.period_contrib) << SCHED_CAPACITY_SHIFT; * For optimization and computing rounding purpose, we don't take into account * the position in the current window (period_contrib) and we use the higher * bound of util_sum to decide. */ static inline void update_idle_rq_clock_pelt(struct rq *rq) { u32 divider = ((LOAD_AVG_MAX - 1024) << SCHED_CAPACITY_SHIFT) - LOAD_AVG_MAX; u32 util_sum = rq->cfs.avg.util_sum; util_sum += rq->avg_rt.util_sum; util_sum += rq->avg_dl.util_sum; /* * Reflecting stolen time makes sense only if the idle * phase would be present at max capacity. As soon as the * utilization of a rq has reached the maximum value, it is * considered as an always running rq without idle time to * steal. This potential idle time is considered as lost in * this case. We keep track of this lost idle time compare to * rq's clock_task. */ if (util_sum >= divider) rq->lost_idle_time += rq_clock_task(rq) - rq->clock_pelt; _update_idle_rq_clock_pelt(rq); } ``` ::: #### [pick_task_fair](https://elixir.bootlin.com/linux/v6.14.5/source/kernel/sched/fair.c#L8843) pick_task_fair 會先取得 runqueue 中的任務數量，如果沒有任務便直接回傳 NULL。 ```c if (!cfs_rq->nr_queued) return NULL; ``` 接著先檢查目前的 cfs runqueue 的 runtime 是否仍有時間，若無則重新挑選，若有則使用 pick_next_entity 來挑選下一個任務，並進到下一個 group 繼續排程。 ```c if (unlikely(check_cfs_rq_runtime(cfs_rq))) goto again; se = pick_next_entity(rq, cfs_rq); if (!se) goto again; cfs_rq = group_cfs_rq(se); ``` 最後回傳該 sched_entity 的 task_struct。 ```c return task_of(se); ``` --- ## Appendix A - [prev_balance](https://elixir.bootlin.com/linux/v6.14.5/source/kernel/sched/core.c#L5976) prev_balance 從 prev 的 class 開始往下檢查，直到 idle class 為止，對每一個 class 執行以確保該層級排程器的狀態是正確的或是準備好的。 ```c #ifdef CONFIG_SCHED_CLASS_EXT /* * SCX requires a balance() call before every pick_task() including when * waking up from SCHED_IDLE. If @start_class is below SCX, start from * SCX instead. Also, set a flag to detect missing balance() call. */ if (scx_enabled()) { rq->scx.flags |= SCX_RQ_BAL_PENDING; if (sched_class_above(&ext_sched_class, start_class)) start_class = &ext_sched_class; } #endif ``` 用 ifdef 確認是否有 `CONFIG_SCHED_CLASS_EXT` 這個 Kconfig 的選項，若有的話則檢查 scx_ext 是否有被開啟，若開啟則直接將 start_class 設定為 ext_sched_class，後續則從 ext_sched_class 開始 balance 而非從 prev 開始。 ```c for_active_class_range(class, start_class, &idle_sched_class) { if (class->balance && class->balance(rq, prev, rf)) break; } ``` 接著確認 sched_class 有定義 balance 的函式，若有定義便呼叫 balance()。當某個 sched_class 的 balance 回傳 true 的時候表示該類別內部已經有可以執行的任務了，因此直接跳出迴圈，跳過 balance 後面其餘優先級較低的類別。 :::warning 各個 scheduler_class 的 balance() 具體在在什麼還不清楚。 :::spoiler [balance (還沒找到實作)](https://elixir.bootlin.com/linux/v6.14.5/source/kernel/sched/sched.h#L2424) ```c int (*balance)(struct rq *rq, struct task_struct *prev, struct rq_flags *rf); ``` ::: ## Appendix B - vruntime CFS 使用 vruntime 來衡量一個行程已經「佔用 CPU 多久」的虛擬時間值，在計算上也會考慮到 NICE 值來分配並比較不同行程的 CPU 使用狀況，以決定下一個要執行的行程。從 [Line 566](https://elixir.bootlin.com/linux/v6.14.5/source/kernel/sched/fair.c#L566) 開始便是相關的計算。 ### vruntime 實際計算 vruntime 的函式為 [update_curr](https://elixir.bootlin.com/linux/v6.14.5/source/kernel/sched/fair.c#L1228)，函式會計算目前實際執行了多久時間，並且將其依照權重換算成 vruntime 。透過 update_cur_se 來取得執行時間差 `delta_exec`，接著透過 `calc_delta_fair` 來計算 vruntime 的更新。這邊我們著墨於 vruntime 的計算，因此便不撰述 `update_cur_se` 的內容。 [calc_delta_fair](https://elixir.bootlin.com/linux/v6.14.5/source/kernel/sched/fair.c#L292) 在進入真正的計算前會先確認 NICE 值是否為 0 ，如果為 0 就直接返回 delta 無須進行加權的調整。 ```c static inline u64 calc_delta_fair(u64 delta, struct sched_entity *se) { if (unlikely(se->load.weight != NICE_0_LOAD)) delta = __calc_delta(delta, NICE_0_LOAD, &se->load); return delta; } ``` 而 [__calc_delta](https://elixir.bootlin.com/linux/v6.14.5/source/kernel/sched/fair.c#L250) 的計算公式如下，透過 $w_i$ 來調整 vruntime 。當權重大時 vruntime 的變化小，因此權重大的任務其 vruntime 變化量小，有更高的機會被執行。 $$ \Delta v = \frac{\Delta t \cdot w_{\text{ref}}}{w_i} $$ 其中 $\Delta v$ 為 vruntime 的變化量，$\Delta t$ 為實際時間變化量， $w_{ref}$ 為 NICE = 0 時的權重，而 $w_i$ 為該次任務的權重。但 Linux 核心可能會遇到沒有 FPU 的情況，因此這邊進一步改良成使用 fix-point 做計算，因此公式改如下， $$ \Delta v = \left( \Delta t \cdot \left( w_{\text{ref}} \cdot \frac{2^S}{w_i} \right) \right) \div 2^S $$ 其中 $S$ 一般會用 32 ，因此公式轉為 $$ \Delta v = \left( \Delta t \cdot \left( w_{\text{ref}} \cdot \frac{2^{32}}{w_i} \right) \right) \div 2^{32} $$ ### weight 那所謂的 weight 又是如何計算的呢？為了讓不同 $NICE$ 值獲得的 $vruntime$ 相差 10 % 使用了公式 $$ w(n) = \frac{1024}{(1.25^n)} $$ 以 weight 為 $n_a$ 及 $n_b$ 為例，兩者所獲得的 CPU 時間分別是 $$ CPU_a = \frac{w(n_a)}{w(n_a)+w(n_b)}, ~CPU_b = \frac{w(n_b)}{w(n_a)+w(n_b)} $$ 因此其 CPU 時間相差可以計算為 $$ CPU_{diff} = \frac{w(n_a)}{w(n_a)+w(n_b)} - \frac{w(n_b)}{w(n_a)+w(n_b)}=\frac{w(n_a)-w(n_b)}{w(n_a)+w(n_b)} $$ 接著結合上述的 $w(n) = \frac{1024}{(1.25^n)}$ 公式，我們把 $n_a$ 及 $n_b$ 定為 $n_a$ 及 $n_{a+diff}$ 因此 $CPU_{diff}$ 為 $$ CPU_{diff} = \frac{\frac{1024}{1.25^{n_a}}-\frac{1024}{1.25^{n_{a}+diff}}}{\frac{1024}{1.25^{n_a}}+\frac{1024}{1.25^{n_{a}+diff}}} = \frac{1.25^{diff}-1}{1.25^{diff}+1} $$ 當 diff 為 1 時， $CPU_{diff} = \frac{1.25-1}{1.25+1} = \frac{0.25}{2.25} = 0.1111$ ，略接近於我們希望的 10% 的差距。 :::info 可以注意到，上述計算若 1024 替換為其他數值結果同樣會是 10% 的差距，但又為何這邊要乘上 1024 ? $\to$ 為了避免 FPU 的使用，核心內部大量用到 fix-point 的計算，而這邊的 1024 便是 fix-point 計算中左移 10 位元（fix-point shift 10）的表現。 ::: CFS 維護了兩份[查找表](https://elixir.bootlin.com/linux/v6.14.5/source/kernel/sched/core.c#L10097)分別是用於計算不同 NICE 值時 weight 及 $\frac{2^{32}}{weight}$ 的對應值(使用了 fix-point 做計算且 fix-point shift 設定為 10）以方便快速計算。以 NICE 值 19 為例，其 weight 為 $$ w(19) = \frac{1024}{(1.25)^{19}}=14.757395259 $$ 因此在查找表上 NICE 19 對應的 weight 為 15。接著計算其 $\frac{2^{32}}{weight}$ $$ \frac{2^{32}}{15} = 286331153.067 $$ 因此在查找表上紀錄為 286331153。 --- $$ \Delta v = \left( \Delta t \cdot \left( w_{\text{ref}} \cdot \frac{2^{32}}{w_i} \right) \right) \div 2^{32} $$ 上述計算對應到[程式碼](https://elixir.bootlin.com/linux/v6.14.5/source/kernel/sched/fair.c#L250)如下: ```c= static u64 __calc_delta(u64 delta_exec, unsigned long weight, struct load_weight *lw) { u64 fact = scale_load_down(weight); u32 fact_hi = (u32)(fact >> 32); int shift = WMULT_SHIFT; int fs; __update_inv_weight(lw); if (unlikely(fact_hi)) { fs = fls(fact_hi); shift -= fs; fact >>= fs; } fact = mul_u32_u32(fact, lw->inv_weight); fact_hi = (u32)(fact >> 32); if (fact_hi) { fs = fls(fact_hi); shift -= fs; fact >>= fs; } return mul_u64_u32_shr(delta_exec, fact, shift); } ``` - 第 16 行 `mul_u32_u32` 傳入的 `lw->inv_weight` 便是先前用查找表先計算好的 $\frac{32}{w_i}$ ，而 `fact` 是公式內的 $w_{ref}$。 - 第 18 行則是對應到公式內除以 $2^32$ 的部份。 - 第 25 行再使用 mul_u64_u32_shr 來乘上 $\Delta t$。另一方面，當任務進入睡眠並醒來後，其 vruntime 可能會因為太久沒有更新因此過小。因此需要紀錄 min_vruntime ，當任務醒來後確認 vruntime 數值，若 vruntime 過小 (優先權過高)則將其設為儲存的 min_vruntime。 :::warning 檢視 min_vruntime 對應的程式碼及計算 ::: ### Avg_vruntime CFS 若僅需要使用 `vruntime` 的計算便可以排序每個任務，那又為何需要引入 avg_vruntime 的機制呢？原因可以在 2023 年的 [[PATCH 03/15] sched/fair: Add lag based placement](https://lkml.org/lkml/2023/5/31/606) 中了解，原先的排程只看每個行程單獨的 `vruntime` 並選擇最小的優先執行。但若某個行程突然加入（e.g. 剛醒來的行程），便可能導致平均 $V$ 大幅度改變。為了在任務醒來時保持穩定，因此引入 `vlag` 來計算任務被移出 `runqueue` 前落後的程度，任務醒來後便能夠透過 `vlag` 來重新推導出正確的 `vruntime` 。 :::warning 這邊跟 EEVDF 很像，且註解中也有提到 EEVDF ，似乎是CFS和EEVDF中間的一個過度？ -> EEVDF 並非是額外獨立一個排程器，而是對 fair.c 的改良，因此這邊 Avg_vruntime 確實是用於 EEVDF。 ::: :::warning 關於 `vlag` 的用途追蹤先跳過，超出這邊討論範圍。 ::: 因此得先了解如何正確計算 `avg_vruntime`。為了達成真正的公平，每個行程落後的時間總和要為 0 ，即 $$ \sum_{i} lag_i = 0 $$ 其中 $lag_i$ 表示第 $i$ 個行程落後的時間，其定義為 $$ lag_i = S - s_i = w_i \times (V - v_i) $$ 而這其中 $S$ 表示理想的執行時間, $s_i$ 表示已經獲得的執行時間。接著我們把它轉換成 $$ lag_i = w_i \times (V - v_i) $$ 其中 $w_i$ 表示每個任務的權重，越大表示應該越早被執行。而 $V$ 及 $v_i$ 則分別為整個 runqueue 的時間基準及每個任務實際已經獲得的執行時間。透過上面兩個公式可以推導出 $$ \sum_{i} w_i \cdot (V - v_i) = 0 \Rightarrow V = \frac{\sum_{i} w_i \cdot v_i}{\sum_{i} w_i} $$ 因此整個 CFS runqueue 的 average vruntime 會是全部行程 vruntime 的加權平均。另一方面，為了避免在乘法計算時產生溢位，在計算上會把 $v_i$ 調整為 $(v_i - v_0) + v_0$ 因此公式計算會變為 $$ V = \frac{\sum_{i} ((v_i - v_0) + v_0) \cdot w_i}{W} = \frac{\sum_{i} (v_i - v_0) \cdot w_i}{W} + v_0 $$ 上述公式對應到程式碼分別為 `avg_vruntime_add` 表示當新任務加入到 runqueue 時 `avg_vruntime` 的更新及 `avg_vruntime_sub` 表示當任務移出 runqueue 時的更新。下列兩者按照公式計算所有的執行時間及總權重，分別對應到 $(v_i - v_0) \cdot w_i$ 及 $W$ 其中 $v_0$ 為 `min_vruntime`。 ```c static void avg_vruntime_add(struct cfs_rq *cfs_rq, struct sched_entity *se) { unsigned long weight = scale_load_down(se->load.weight); s64 key = entity_key(cfs_rq, se); cfs_rq->avg_vruntime += key * weight; cfs_rq->avg_load += weight; } static void avg_vruntime_sub(struct cfs_rq *cfs_rq, struct sched_entity *se) { unsigned long weight = scale_load_down(se->load.weight); s64 key = entity_key(cfs_rq, se); cfs_rq->avg_vruntime -= key * weight; cfs_rq->avg_load -= weight; } // 其中 entity_key 用於計算 vruntime 時間差，函式如下 static inline s64 entity_key(struct cfs_rq *cfs_rq, struct sched_entity *se) { return (s64)(se->vruntime - cfs_rq->min_vruntime); } ``` 當 min_vruntime 改變時，原先公式中的 $v_i - v_0$ 會變為 $v_i - (v_0+\delta)$ 其中 $\delta$ 為新舊 `min_vruntime` 的差距。因此求和公式內的 $(v_i - v_0)\cdot w_i$ 每項都會減少 $\delta \cdot w_i$，總共減少 $\delta \cdot \sum w_i$，對應到程式碼如下 ```c static inline void avg_vruntime_update(struct cfs_rq *cfs_rq, s64 delta) { /* * v' = v + d ==> avg_vruntime' = avg_runtime - d*avg_load */ cfs_rq->avg_vruntime -= cfs_rq->avg_load * delta; } ``` 至此便可以在任務加入或移出 runqueue 的時候正確計算 `avg_vruntime`，而 CFS 主要透過 [avg_vruntime](https://elixir.bootlin.com/linux/v6.14.5/source/kernel/sched/fair.c#L656) 來計算，其計算方式與上面三個函式相同因此便不再贅述。 :::danger Avg_vruntime "似乎" 是 EEVDF 的，之後可以考慮往後搬 ::: ## Appendix C - PerEntity Load Tracking (PELT)