Linux 核心設計: Scheduler(3): 深入剖析 CFS Scheduler

--- tags: Linux Kernel Internals, 作業系統 --- # Linux 核心設計: Scheduler(3): 深入剖析 CFS Scheduler ## 引言在上一章節中，我們探討了 CFS 背後運作的原理，以及其中重要的資料結構。對於 CFS 的核心思想有所掌握之後，接下來我們要關注透過 `DEFINE_SCHED_CLASS` 定義的 fair scheduler，更細節的去研究在該 scheduler 中的加入新的 task, 移除 task, 挑選下個 task 等各種操作是如何實現的。 ```cpp DEFINE_SCHED_CLASS(fair) = { .enqueue_task = enqueue_task_fair, .dequeue_task = dequeue_task_fair, .yield_task = yield_task_fair, .yield_to_task = yield_to_task_fair, .check_preempt_curr = check_preempt_wakeup, .pick_next_task = __pick_next_task_fair, .put_prev_task = put_prev_task_fair, .set_next_task = set_next_task_fair, #ifdef CONFIG_SMP .balance = balance_fair, .pick_task = pick_task_fair, .select_task_rq = select_task_rq_fair, .migrate_task_rq = migrate_task_rq_fair, .rq_online = rq_online_fair, .rq_offline = rq_offline_fair, .task_dead = task_dead_fair, .set_cpus_allowed = set_cpus_allowed_common, #endif .task_tick = task_tick_fair, .task_fork = task_fork_fair, .prio_changed = prio_changed_fair, .switched_from = switched_from_fair, .switched_to = switched_to_fair, .get_rr_interval = get_rr_interval_fair, .update_curr = update_curr_fair, #ifdef CONFIG_FAIR_GROUP_SCHED .task_change_group = task_change_group_fair, #endif #ifdef CONFIG_UCLAMP_TASK .uclamp_enabled = 1, #endif }; ``` ## 解析 CFS 的原始程式碼 - enqueue ### `enqueue_task` `enqueue_task` 一如其名，該函式指標對應的函式是用來向 runqueue 添加新任務的。該函式指標也是函式的 [`enqueue_task`](https://elixir.bootlin.com/linux/v5.16.6/source/kernel/sched/core.c#L1988) 工作的關鍵: ```cpp static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags) { if (!(flags & ENQUEUE_NOCLOCK)) update_rq_clock(rq); if (!(flags & ENQUEUE_RESTORE)) { sched_info_enqueue(rq, p); psi_enqueue(p, flags & ENQUEUE_WAKEUP); } uclamp_rq_inc(rq, p); p->sched_class->enqueue_task(rq, p, flags); if (sched_core_enabled(rq)) sched_core_enqueue(rq, p); } ``` 首先，我們可以看到輸入中包含 `flags` 以考慮特殊的情境進行 enqueue 的調整 * `ENQUEUE_NOCLOCK`: 如果這個 flag 沒有被設置，需要在 enqueue 的時候更新 `rq->clock`(即 [`update_rq_clock`](https://elixir.bootlin.com/linux/v5.16.6/source/kernel/sched/core.c#L668)) * `ENQUEUE_RESTORE`: 若這個 flag 被設置，表示該任務之前被 dequeue 只是為了修改一下屬性(`DEQUEUE_SAVE`)，例如 [`rt_mutex_setprio`](https://elixir.bootlin.com/linux/v5.16.6/source/kernel/sched/core.c#L2013) 所為，現在要被 enqueue 回去，若非此情形則: * [`sched_info_enqueue`](https://elixir.bootlin.com/linux/v5.16.6/source/kernel/sched/stats.h#L252): 用來更新此刻被加入到 runqueue 的 timestamp * [`psi_enqueue`](https://elixir.bootlin.com/linux/v5.16.6/source/kernel/sched/stats.h#L114): 如果設置了 `CONFIG_PSI`，kernel 會做 Pressure stall information tracking，暫時不深入討論，可先參考 [psi: pressure stall information for CPU, memory, and IO](https://lwn.net/Articles/753840/) :::danger 對於 `ENQUEUE_NOCLOCK` flag 設置的需求我不是很肯定，暫時的推論是: 一般的執行流程下，`update_rq_clock` 會在 enqueue 之前就被週期性的更新，因此正常是不需要在 `enqueue_task` 這裡再更新一次的，故 `ENQUEUE_NOCLOCK` 會被設置。但像是 [`move_queued_task`](https://elixir.bootlin.com/linux/v5.16.6/source/kernel/sched/core.c#L2249) 這種涉及 CPU migration 情境的 enqueue，就必須額外先把 `rq->clock` 進行更新? 附帶一提，[註解](https://elixir.bootlin.com/linux/v5.16.6/source/kernel/sched/sched.h#L2056) 有簡單解釋 `ENQUEUE_NOCLOCK` 以外的其他 flag 之作用，因此如果可以釐清設置該 flag 之目的的話，也許可以送個 patch? ::: 此外，在進入核心的 enqueue 實作之前，如果有設置 `CONFIG_UCLAMP_TASK` 的話 [`uclamp_rq_inc`](https://elixir.bootlin.com/linux/v5.16.6/source/kernel/sched/core.c#L1598)，會進行 utilization clamping 相關的處理。utilization clamping 大體的作用是讓排程的優先可以不僅止於 priority 的設定，還可以直接從 userspace 去告知 scheduler 該任務的需求，本文暫時不討論，可先參考: * [Scheduler utilization clamping](https://lwn.net/Articles/762043/) * [Add utilization clamping support](https://lwn.net/Articles/791682/) ### `enqueue_task_fair` 接著，就進入 enqueue 的核心 [`enqueue_task_fair`](https://elixir.bootlin.com/linux/v5.16.6/source/kernel/sched/fair.c#L5571) 部分。一如註解所說， CFS 的 enqueue 行為會將任務新增到 RB-tree 中。 ```cpp= /* * The enqueue_task method is called before nr_running is * increased. Here we update the fair scheduling stats and * then put the task into the rbtree: */ static void enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) { struct cfs_rq *cfs_rq; struct sched_entity *se = &p->se; int idle_h_nr_running = task_has_idle_policy(p); int task_new = !(flags & ENQUEUE_WAKEUP); /* * The code below (indirectly) updates schedutil which looks at * the cfs_rq utilization to select a frequency. * Let's add the task's estimated utilization to the cfs_rq's * estimated utilization, before we update schedutil. */ util_est_enqueue(&rq->cfs, p); /* * If in_iowait is set, the code below may not trigger any cpufreq * utilization updates, so do it here explicitly with the IOWAIT flag * passed. */ if (p->in_iowait) cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT); ``` 如果有設置 `CONFIG_SMP` 的話，`util_est_enqueue` 會被執行，而如果有設置 `CONFIG_CPU_FREQ` 的話，則會執行 `cpufreq_update_util`。這兩者都與 `schedutil` 和 CPU 的調頻有關，這裡暫時不討論，可先參見: * [CPU调速器schedutil原理分析](https://deepinout.com/android-system-analysis/android-cpu-related/principle-analysis-of-cpu-governor-schedutil.html#schedutil) * [CPUFreq Governor](https://www.kernel.org/doc/Documentation/cpu-freq/governors.txt) ```cpp=30 for_each_sched_entity(se) { if (se->on_rq) break; cfs_rq = cfs_rq_of(se); enqueue_entity(cfs_rq, se, flags); ``` 接著就是與排程最為相關的部分了。首先，[`for_each_sched_entity`](https://elixir.bootlin.com/linux/v5.16.6/source/kernel/sched/fair.c#L281) 的行為是遍歷輸入的 `task_struct` 之 `shed_entity` 與其向上的 parent，這是因為 CFS 的結構是多階層的，因此我們需要對每一層級的 sched entity(可能是屬於 `task_struct` 或 `task_group`) 都進行 enqueue 操作。對於每次遍歷的目標 `sched_entity`，如果其不在 runqueue 上的話，就把他加入到自己所屬([`cfs_rq_of`](https://elixir.bootlin.com/linux/v5.16.6/source/kernel/sched/sched.h#L1384))的 runqueue([`enqueue_entity`](https://elixir.bootlin.com/linux/v5.16.6/source/kernel/sched/fair.c#L4236)) ```cpp=35 cfs_rq->h_nr_running++; cfs_rq->idle_h_nr_running += idle_h_nr_running; if (cfs_rq_is_idle(cfs_rq)) idle_h_nr_running = 1; /* end evaluation on encountering a throttled cfs_rq */ if (cfs_rq_throttled(cfs_rq)) goto enqueue_throttle; flags = ENQUEUE_WAKEUP; } ``` `h_nr_running` 根據註解所述包含了 `SCHED_{NORMAL,BATCH,IDLE}` 三種類型的任務，在 CFS 結構的意義上是包含在該層級 `cfs_rq` 中的 `sched_entity`，以及其底下層級(也就是其 runqueue 中 `sched_entity` 是屬於 `task_group` 之情形下) 所包含的 `sched_entity` 數量之總和。與之相對的，`cfs_rq` 中的另一個 member `nr_running` 則只有描述在該層級 `cfs_rq` 中的 `sched_entity` 數量。後者也意味著在該層級的 `cfs_rq` 中要去分配 CPU 時間資源的 entity 數量。同樣道理 `idle_h_nr_running` 也包含該層 `cfs_rq` 本身以及其下屬於 `SCHED_IDLE` 類型的任務數量。`idle_h_nr_running` 會根據這個 `task_struct` 是否是 `SCHED_IDLE` ([`task_has_idle_policy`](https://elixir.bootlin.com/linux/v5.16.6/source/kernel/sched/sched.h#L185))，或者該 `task_group` 是否是 idle group ([`cfs_rq_is_idle`](https://elixir.bootlin.com/linux/v5.16.6/source/kernel/sched/fair.c#L447)) 而設定成 1 或 0，累加到 `csf_rq->idle_h_nr_running` 上。 [`cfs_rq_throttled`](https://elixir.bootlin.com/linux/v5.16.6/source/kernel/sched/fair.c#L4737) 與 CFS 的 bandwidth control 相關，簡而言之是用來限制 `task_group` 在一個周期內可以消耗的 CPU 資源。細節我們會留待後續再深入討論。可先閱讀: > [CFS调度器（5）-带宽控制](http://www.wowotech.net/process_management/451.html) ```cpp=48 for_each_sched_entity(se) { cfs_rq = cfs_rq_of(se); update_load_avg(cfs_rq, se, UPDATE_TG); se_update_runnable(se); update_cfs_group(se); cfs_rq->h_nr_running++; cfs_rq->idle_h_nr_running += idle_h_nr_running; if (cfs_rq_is_idle(cfs_rq)) idle_h_nr_running = 1; /* end evaluation on encountering a throttled cfs_rq */ if (cfs_rq_throttled(cfs_rq)) goto enqueue_throttle; /* * One parent has been throttled and cfs_rq removed from the * list. Add it back to not break the leaf list. */ if (throttled_hierarchy(cfs_rq)) list_add_leaf_cfs_rq(cfs_rq); } ``` 第二個 `for_each_sched_entity` 是因應前一個 for loop 因為 `se->on_rq` 成立，即該 level 的 sched_entity 已經在 runqueue 上而提早結束時的情境。舉例來說，假設有一個 `task_group` A 底下包含 B、C 兩個 `task_struct`，且先假設一開始大家都不在所屬的 runqueue 上 (`se->on_rq == false`)。則某一時間 C 被 enqueue 時，`task_group` A 作為 C 的 parent 也會被 enqueue(`on_rq` 被設為 1)，下一時間 B 要被 enqueue 時，不需要再進行 `task_group` A 的 enqueue，但仍需要更新相關的信息，即上方的 code block 之流程。這個流程具體做了甚麼呢? * [`update_load_avg`](https://elixir.bootlin.com/linux/v5.16.6/source/kernel/sched/fair.c#L3813): 利用 PELT 機制更新 se 的 load * [`se_update_runnable`](https://elixir.bootlin.com/linux/v5.16.6/source/kernel/sched/sched.h#L742): 更新 se->runnable_weight，這與 PELT 機制在計算 `runnable_sum` 時的權重相關 * [`update_cfs_group`](https://elixir.bootlin.com/linux/v5.16.6/source/kernel/sched/fair.c#L3197): 若該 se 是一個 group entity，需把其中的 `struct cfs_rq` 也同步作更新 > [[PATCH v2 4/5] sched/pelt: Add a new runnable average signal](https://yhbt.net/lore/all/20200214152729.6059-5-vincent.guittot@linaro.org/) 對於上述的三個 function，這個和 CFS 的 PELT(per entity load tracking) 機制有關，每次 enqueue 時都需要 update 相關資訊。事實上，前一個 for loop 在 `enqueue_entity` 中也有進行同樣的更新。這些更新的目的詳細留待後續章節研究。更新 `h_nr_running` 和 `idle_h_nr_running` / `cfs_rq_throttled` 也與前一個 loop 同理。 * [`throttled_hierarchy`](https://elixir.bootlin.com/linux/v5.16.6/source/kernel/sched/fair.c#L4743) ? ```cpp=73 /* At this point se is NULL and we are at root level*/ add_nr_running(rq, 1); /* * Since new tasks are assigned an initial util_avg equal to * half of the spare capacity of their CPU, tiny tasks have the * ability to cross the overutilized threshold, which will * result in the load balancer ruining all the task placement * done by EAS. As a way to mitigate that effect, do not account * for the first enqueue operation of new tasks during the * overutilized flag detection. * * A better way of solving this problem would be to wait for * the PELT signals of tasks to converge before taking them * into account, but that is not straightforward to implement, * and the following generally works well enough in practice. */ if (!task_new) update_overutilized_status(rq); enqueue_throttle: if (cfs_bandwidth_used()) { /* * When bandwidth control is enabled; the cfs_rq_throttled() * breaks in the above iteration can result in incomplete * leaf list maintenance, resulting in triggering the assertion * below. */ for_each_sched_entity(se) { cfs_rq = cfs_rq_of(se); if (list_add_leaf_cfs_rq(cfs_rq)) break; } } assert_list_leaf_cfs_rq(rq); hrtick_update(rq); ``` * 將 runqueue 的 member `nr_running` 加一(事實上還有一些其他的考量，參見 [`add_nr_running`](https://elixir.bootlin.com/linux/v5.16.6/source/kernel/sched/sched.h#L2347)) * [`update_overutilized_status`](https://elixir.bootlin.com/linux/v5.16.6/source/kernel/sched/fair.c#L5529) / [`list_add_leaf_cfs_rq`](https://elixir.bootlin.com/linux/v5.16.6/source/kernel/sched/fair.c#L297)? * [`hrtick_update`](https://elixir.bootlin.com/linux/v5.16.6/source/kernel/sched/fair.c#L5500): 當目前的任務是 CFS class 時且其所屬的 CFS runqueue 之 `nr_running` 足夠小時，需要更新 hrtick(high resolution timer, 如果 `CONFIG_SCHED_HRTICK`)? > [Add HRTICK support to Core Scheduling](https://lwn.net/Articles/829717/): CFS supports a feature called "HRTICK" which allows scheduling decisions to be made independent of the HZ tick. That means that we can achieve much more fine grained time slices and thus be more fair in distributing time to different workloads. 總結 `enqueue_task_fair` 的實作，如果先忽略 bandwidth control 和 PELT 等複雜的機制的話，其流程就是從目標 `task_struct` 往其 parent 遞迴處理到 root，更新每一層級 runqueue 的任務數量 (例如 `h_nr_running`) 及插入對應節點到紅黑樹上。 ### `enqueue_entity` :::warning 由於 `enqueue_entity` 使用的邏輯上應該都是要把 entity 放回自己所屬的 cfs_rq，總覺得 function 的設計可以做修改: ```cpp static void enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags){ } => static void enqueue_entity(struct sched_entity *se, int flags){ struct cfs_rq *qcfs_rq = cfs_rq_of(se); ... } ``` 這樣的好處是避免誤將 entity 放去非預期的 rq 中，雖然實際上的效益可能不大就是了lol ::: 現在我們回過頭來研究 [`enqueue_entity`](https://elixir.bootlin.com/linux/v5.16.6/source/kernel/sched/fair.c#L4236) 之中的實作細節。 ```cpp /* * MIGRATION * * dequeue * update_curr() * update_min_vruntime() * vruntime -= min_vruntime * * enqueue * update_curr() * update_min_vruntime() * vruntime += min_vruntime * * this way the vruntime transition between RQs is done when both * min_vruntime are up-to-date. * * WAKEUP (remote) * * ->migrate_task_rq_fair() (p->state == TASK_WAKING) * vruntime -= min_vruntime * * enqueue * update_curr() * update_min_vruntime() * vruntime += min_vruntime * * this way we don't have the most up-to-date min_vruntime on the originating * CPU and an up-to-date min_vruntime on the destination CPU. */ ``` 先從註解閱讀。這裡就面臨到排程所需要考慮的議題: 任務在不同 CPU 上的移動。想像一個情況，假設我們有 cpu0 和 cpu1，現在兩個 runqueue 上的 [`min_vruntime`](https://elixir.bootlin.com/linux/v5.16.6/source/kernel/sched/sched.h#L540) 數值個別是 10 和 1000 這樣差距極大的數字。則當某任務 A 現在在 cpu0 被建立，其 `sched_entity` 下起始的 [`vruntime`](https://elixir.bootlin.com/linux/v5.16.6/source/include/linux/sched.h#L538) 值會根據 cpu0 之 runqueue 的 `min_vruntime` 被設定成 10(在此前的章節對 [`cfs_rq` 的解說](https://hackmd.io/9ktsi7dQR823VChlW1SyEw?view#cfs_rq) 有提及這麼做的原因)。那麼之後這個任務的運行卻是要移動到 cpu1 上該怎麼辦呢?首先，使用原本的 vruntime 10 直接加入 cpu1 的 queue 是不合理的，這代表該任務將會排擠那些原本就在 cpu1 上運行的任務，後者的 vruntime 理應都是 1000 以上，這樣明顯不是很公平。其次，直接初始成 cpu1 的 `min_vruntime` 也不太合理，這相當於忽視了之前該任務在 cpu0 上運行過的時間。因此，這裡的做法是進行標準化: 在該任務從 cpu0 移除時先減去 cpu0 的 `min_vruntime`，然後之後被移動到 cpu1 後會將該值加上 cpu1 的 vruntime 以確保公平。 ```cpp=1 static void enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) { bool renorm = !(flags & ENQUEUE_WAKEUP) || (flags & ENQUEUE_MIGRATED); bool curr = cfs_rq->curr == se; /* * If we're the current task, we must renormalise before calling * update_curr(). */ if (renorm && curr) se->vruntime += cfs_rq->min_vruntime; update_curr(cfs_rq); /* * Otherwise, renormalise after, such that we're placed at the current * moment in time, instead of some random moment in the past. Being * placed in the past could significantly boost this task to the * fairness detriment of existing tasks. */ if (renorm && !curr) se->vruntime += cfs_rq->min_vruntime; ``` 因此，對應回上面的程式碼。`renorm` 若為 true表示要進行如前述的標準化流程，則該任務可能會是: * 情形一: `!(flags & ENQUEUE_WAKEUP)`: 非從 sleep state 被喚醒的任務 * 可能是剛被建立(透過 `fork` 等相似系統呼叫) * 或者該任務是在 cpu A 的 runqueue 上被直接移動到 cpu B 的 runqueue(on queue migration) * 情形二: `flags & ENQUEUE_MIGRATED` :或者該任務是在 wake up 過程中經過 migrate，需注意 ` ENQUEUE_MIGRATED` 這個 flag [並不是在所有 enqueue 時有 migration 就會被設置](https://blog.csdn.net/zhouchengming1/article/details/118632471)，而是需要是被 wake up 的前提。也就是如果 `flags & ENQUEUE_MIGRATED == true` 則 `flags & ENQUEUE_WAKEUP == true` 理當也成立 * 換句話說，wake up 但並沒有發生 migration 的 enqueue entity 就不需要進行這個標準化的流程但這邊我們可以看到若 `renorm` 為 true，根據 `cfs_rq->curr == se` 還分成是在 `update_curr` 前後進行更新兩種。這又是甚麼考量呢? 詳細可以參見 [Migrated CFS task getting an unfair advantage](https://linux.kernel.narkive.com/p15Wmn0i/migrated-cfs-task-getting-an-unfair-advantage) 的這段 mailing list 和程式碼前面的註解。簡單來說，因為 [`update_curr`](https://elixir.bootlin.com/linux/v5.16.6/source/kernel/sched/fair.c#L844) 有機會更改到 `cfs_rq->min_vruntime`，後者是進行 normalize 所需的參數。一般來說，如果在 `update_curr` **之前**就進行 normalize 的話(注意此時要 enqueue 的 entity 不一定是 current task 對應之 entity)，則對於在 `cfs_rq->min_vruntime` 被更新之前和之後進行 enqueue 的兩個 migrated 任務會有區別待遇。但同時還需要考慮 current 的 normalize 必須先於 `update_curr` 的進行，因此程式碼才根據兩種情形分別在不同時間點進行 normalize。 :::warning `update_curr` 需要被執行的時機點是? 有沒有可以歸納的原則。 ::: ```cpp=25 /* * When enqueuing a sched_entity, we must: * - Update loads to have both entity and cfs_rq synced with now. * - Add its load to cfs_rq->runnable_avg * - For group_entity, update its weight to reflect the new share of * its group cfs_rq * - Add its new weight to cfs_rq->load.weight */ update_load_avg(cfs_rq, se, UPDATE_TG | DO_ATTACH); // step1 se_update_runnable(se); // step2 update_cfs_group(se); // step3 account_entity_enqueue(cfs_rq, se); ``` 註解說明了在 enqueue entity 時需要進行的相關更新流程為: 1. 更新 se 的 load 使得與目前的 clock 同步 * `update_load_avg` -> `__update_load_avg_se` 2. 更新 se 所屬 cfs_rq 的 load 使得與目前的 clock 同步: * `update_load_avg` ->`update_cfs_rq_load_avg` 3. 藉由 `se_update_runnable` 更新 `se->runnable_weight`，後者會針對 group se 將 `runnable_weight` 與 se 所對應的 queue (`se->my_q`) 中存在的 se 總量(一層一層的遞規總和，詳見 `h_nr_running` 的算法) 做同步。 4. 將 load 累加到對應的 cfs_rq 中: * `update_load_avg` -> `attach_entity_load_avg` 5. 如果 entity 是屬於 `task_group` 的，需要透過 `share` 更新去更新該 entity 的 weight: * `update_cfs_group` -> [`reweight_entity`](#reweight_entity) 6. 然後可以透過 se 的 weight 去更新所屬的 `cfs_rq` 之 `load.weight`: * `account_entity_enqueue` -> `update_load_add` `account_entity_enqueue` 也會去將 cfs_rq->nr_running + 1。 :::warning `reweight_entity` 和 `account_entity_enqueue` 裡看起來都有 `update_load_add(&cfs_rq->load, se->load.weight)` 的操作，這樣會導致加到額外的 load 嗎? ::: 回到 code，可以發現 step1 - step 3 在 `for_each_sched_entity` 時候有看到過一次，也就是說這些是想 enqueue 輸入的 task 時，需要對該 task 與其 parent 的 sched_entity 不論是否已在 runqueue 上皆須進行的步驟。隨著 enqueue 發生，我們需要把該 entity 的 load_weight 累計到所屬的 `cfs_rq` 上([`update_load_add`](https://elixir.bootlin.com/linux/v5.16.6/source/kernel/sched/fair.c#L147))。另外，還記得我們在 `enqueue_task_fair` 更新了 `h_nr_running` 和 `idle_h_nr_running`，當時我們提到該數值反映了包含在該層級 cfs_rq 以及其底下層級的所有 sched_entity 數量之總和。與之相對的 `nr_running` 和 `idle_nr_running` 則只有描述在該層級 cfs_rq 中的 sched_entity 數量，因此在 `enqueue_entity` 時再更新即可，這些行為都在 [`account_entity_enqueue`](https://elixir.bootlin.com/linux/v5.16.6/source/kernel/sched/fair.c#L2942) 完成。 ```cpp=39 if (flags & ENQUEUE_WAKEUP) place_entity(cfs_rq, se, 0); ``` 對於從 sleep state 醒來的任務(`flags & ENQUEUE_WAKEUP == true`)，[`place_entity`](https://elixir.bootlin.com/linux/v5.16.6/source/kernel/sched/fair.c#L4165) 會將 vruntime 做校正。考慮以下情境: 如果一個任務進入 sleep 很長一段時間，則睡醒時其 vruntime 勢必落後其他沒有進入 sleep 一直在進行的任務，如果讓該任務就此霸佔 CPU 其實並不合理。 ```cpp static void place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial) { u64 vruntime = cfs_rq->min_vruntime; ... /* sleeps up to a single latency don't count. */ if (!initial) { unsigned long thresh; if (se_is_idle(se)) thresh = sysctl_sched_min_granularity; else thresh = sysctl_sched_latency; /* * Halve their sleep time's effect, to allow * for a gentler effect of sleepers: */ if (sched_feat(GENTLE_FAIR_SLEEPERS)) thresh >>= 1; vruntime -= thresh; } /* ensure we never gain time by being placed backwards. */ se->vruntime = max_vruntime(se->vruntime, vruntime); } ``` 具體怎麼做呢? 中心思想其實還是以 `min_vruntime` 來重新初始化 `se`。但雖然我們不希望 vruntime 太小導致霸佔 CPU，但一方面也希望剛醒來的任務可以盡快被排程，因此額外又引入 `thresh` 來給這些任務一些補償。可以看到當 `flags & ENQUEUE_WAKEUP == true` 時，`place_entity` 的 `initial` 被設置成 0，此時會嘗試以 `min_vruntime - thresh` 的值來做為每個 runqueue 中的 entity 該被接受的最小 vruntime。因此如果當前要 enqueue 的任務原本的 vruntime 小於該數值，則要將其調整到 `min_vruntime - thresh`，否則就會有睡醒的任務長時間霸佔資源的問題。更仔細的看 `thresh` 的設置的話: * non-idle task * [`sysctl_sched_latency`](https://elixir.bootlin.com/linux/v5.16.6/source/kernel/sched/fair.c#L38): 在[上一張章節](https://hackmd.io/@RinHizakura/B18MhT00t#Schedule-Latency)有提到一般情形下的 schedule latency 為 6ms，如果一個任務睡超過這個時間，那就額外只給 1 個 latency * idle task * [`sysctl_sched_min_granularity`](https://elixir.bootlin.com/linux/v5.16.6/source/kernel/sched/fair.c#L59): 補償到 0.75ms，可參考 [[PATCH v3 4/4] sched: adjust sleeper credit for SCHED_IDLE entities](https://www.spinics.net/lists/kernel/msg4049712.html) * 可以開啟 `sched_feat(GENTLE_FAIR_SLEEPERS)` 來使被喚醒的 task 再禮讓更多資源 ```cpp=42 check_schedstat_required(); update_stats_enqueue_fair(cfs_rq, se, flags); check_spread(cfs_rq, se); ``` * 需要更新一些 sched_stat 相關的資訊 * [sched: support schedstat for RT sched class](https://lwn.net/Articles/837786/) * [`check_schedstat_required`](https://elixir.bootlin.com/linux/v5.16.6/source/kernel/sched/stats.h#L54) * [`update_stats_enqueue_fair`](https://elixir.bootlin.com/linux/v5.16.6/source/kernel/sched/fair.c#L953) * [`check_spread`](https://elixir.bootlin.com/linux/v5.16.6/source/kernel/sched/fair.c#L4151) ```cpp=45 if (!curr) __enqueue_entity(cfs_rq, se); se->on_rq = 1; ``` 被加入到 queue 上的 entity 之 `on_rq` 狀態要被設為 1，而除了 current task 之外的 entity 都需要被 [`__enqueue_entity`](https://elixir.bootlin.com/linux/v5.16.6/source/kernel/sched/fair.c#L586) 加入到紅黑樹的 runqueue 中(因為 current task 處於獲得排程的狀態，可以想成是被從 queue 中取出來執行，等 context switch 時會在 [`put_prev_entity`](https://elixir.bootlin.com/linux/v5.16.6/source/kernel/sched/fair.c#L4279) 放回 queue)，細節的實作可以參考 [`rb_add_cached`](https://elixir.bootlin.com/linux/v5.16.6/source/include/linux/rbtree.h#L165)。 ```cpp=49 /* * When bandwidth control is enabled, cfs might have been removed * because of a parent been throttled but cfs->nr_running > 1. Try to * add it unconditionally. */ if (cfs_rq->nr_running == 1 || cfs_bandwidth_used()) list_add_leaf_cfs_rq(cfs_rq); if (cfs_rq->nr_running == 1) check_enqueue_throttle(cfs_rq); } ``` 後續還有一些 bandwidth controll 相關的調整，這裡一樣暫時先不深究。 ## 解析 CFS 的原始程式碼 - dequeue ### `dequeue_task` ```cpp static inline void dequeue_task(struct rq *rq, struct task_struct *p, int flags) { if (sched_core_enabled(rq)) sched_core_dequeue(rq, p); if (!(flags & DEQUEUE_NOCLOCK)) update_rq_clock(rq); if (!(flags & DEQUEUE_SAVE)) { sched_info_dequeue(rq, p); psi_dequeue(p, flags & DEQUEUE_SLEEP); } uclamp_rq_dec(rq, p); p->sched_class->dequeue_task(rq, p, flags); } ``` 與 enqueue 相對應的，需呼叫 `dequeue_task` 將任務從 runqueue 中取出。 * `DEQUEUE_NOCLOCK`: 如果這個 flag 沒有被設置，需要在 dequeue 的時候更新 `rq->clock` ([`update_rq_clock`](https://elixir.bootlin.com/linux/v5.16.6/source/kernel/sched/core.c#L668)) * `DEQUEUE_SAVE`: 這個 flag 和 `ENQUEUE_RESTORE` 成對，若這個 flag 被設置，表示該任務被 dequeue 只是為了修改一下屬性，若非此情形則: * [`sched_info_dequeue`](https://elixir.bootlin.com/linux/v5.16.6/source/kernel/sched/stats.h#L2112): [`sched_info_enqueue`](https://elixir.bootlin.com/linux/v5.16.6/source/kernel/sched/stats.h#L252) 時會更新當時被加入到 runqueue 的 timestamp，現在 `sched_info_dequeue` 可以藉自己被呼叫的時間，計算該任務從進入 queue 到實際被排程的時間長度 * [`psi_dequeue`](https://elixir.bootlin.com/linux/v5.16.6/source/kernel/sched/stats.h#L137): Pressure stall information tracking 相關，暫略 * [`uclamp_rq_dec`](https://elixir.bootlin.com/linux/v5.16.6/source/kernel/sched/core.c#L1622): utilization clamping，暫略 CFS 相關的 dequeue 操作則被實作在 [`dequeue_task_fair`]( https://elixir.bootlin.com/linux/v5.16.6/source/kernel/sched/fair.c#L5686)。 ### `dequeue_task_fair` `dequeue_task_fair` 與 `enqueue_task_fair` 相比其實不難看出就是反向的操作，因此這邊就不會再探討相對應的部分，只關注在有不同的地方。 ```cpp for_each_sched_entity(se) { cfs_rq = cfs_rq_of(se); dequeue_entity(cfs_rq, se, flags); ... /* Don't dequeue parent if it has other entities besides us */ if (cfs_rq->load.weight) { /* Avoid re-evaluating load for this entity: */ se = parent_entity(se); /* * Bias pick_next to pick a task from this cfs_rq, as * p is sleeping when it is within its sched_slice. */ if (task_sleep && se && !throttled_hierarchy(cfs_rq)) set_next_buddy(se); break; } flags |= DEQUEUE_SLEEP; } ``` 第一次的 `for_each_sched_entity(se)` 遍歷輸入的 `task_struct` 之 `shed_entity` 與其向上的 parent，與 enqueue 時有不同的是，如果某層級的 entity 所屬的 runqueue 上還有其他 entity(`if (cfs_rq->load.weight)`)，則不需要再對該 runqueue 的 entity 進行 dequeue，而是設置 runqueue 上的下個 entity([`set_next_buddy`](https://elixir.bootlin.com/linux/v5.16.6/source/kernel/sched/fair.c#L7087))。否則就把它從自己所屬(`cfs_rq_of`)的 runqueue 移除([`dequeue_entity`](https://elixir.bootlin.com/linux/v5.16.6/source/kernel/sched/fair.c#L4342))。 ```cpp for_each_sched_entity(se) { ... } ``` 第二個 for 迴圈與 enqueue 基本是完全相近的。 ```cpp /* At this point se is NULL and we are at root level*/ sub_nr_running(rq, 1); /* balance early to pull high priority tasks */ if (unlikely(!was_sched_idle && sched_idle_rq(rq))) rq->next_balance = jiffies; dequeue_throttle: util_est_update(&rq->cfs, p, task_sleep); hrtick_update(rq); } ``` 最後的流程也與 enqueue 類似，有些許不同的地方是會去更新 `rq->next_balance`，這個變數與 Linux 中週期性的做 load balancing 相關([`trigger_load_balance`](https://elixir.bootlin.com/linux/v5.16.6/source/kernel/sched/fair.c#L11021))，也就是 dequeue 的時候這邊會去提早觸發 load balancing 的機制。 :::warning (TODO: why?) ::: ### `dequeue_entity` ```cpp= static void dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) { /* * Update run-time statistics of the 'current'. */ update_curr(cfs_rq); ``` [`dequeue_entity`](https://elixir.bootlin.com/linux/v5.16.6/source/kernel/sched/fair.c#L4342) 與 `enqueue_entity` 相對。首先都要先做 `update_curr`: 更新 current task (`exec_start` / `vruntime`) 和其所屬 rq (`min_vruntime`) 的相關數據，以在後續的流程中做使用。 ```cpp /* * When dequeuing a sched_entity, we must: * - Update loads to have both entity and cfs_rq synced with now. * - Subtract its load from the cfs_rq->runnable_avg. * - Subtract its previous weight from cfs_rq->load.weight. * - For group entity, update its weight to reflect the new share * of its group cfs_rq. */ update_load_avg(cfs_rq, se, UPDATE_TG); se_update_runnable(se); ``` dequeu entity 在這裡也和 enqueue entity 一樣有類似的註解，不過要留意實作上各步驟進行的順序之差異: 1. 更新 se 的 load 使得與目前的 clock 同步 * `update_load_avg` -> `__update_load_avg_se` 2. 更新 se 所屬 cfs_rq 的 load 使得與目前的 clock 同步: * `update_load_avg` ->`update_cfs_rq_load_avg` 3. 藉由 `se_update_runnable` 更新 `se->runnable_weight`，後者會針對 group se 將 `runnable_weight` 與 se 所對應的 queue (`se->my_q`) 中存在的 se 總量(一層一層的遞規總和，詳見 `h_nr_running` 的算法) 做同步。 :::danger 要將 load 從對應的 cfs_rq 中移除嗎? 在 enqueue 裡面有的 `attach_entity_load_avg` 對應的 `detach_entity_load_avg` 在這一段似乎沒有做 ::: ```cpp update_stats_dequeue_fair(cfs_rq, se, flags); clear_buddies(cfs_rq, se); ``` * [`update_stats_dequeue_fair`](https://elixir.bootlin.com/linux/v5.16.6/source/kernel/sched/fair.c#L970): dequeue 相關的 `schedstat` 更新?(暫略) * [`clear_buddies`](https://elixir.bootlin.com/linux/v5.16.6/source/kernel/sched/fair.c#L4327): `cfs_rq` struct 中分別會記錄 `cfs_rq->next`, `cfs_rq->last`, `cfs_rq->skip`，可以用來協助 scheduler 在挑選下個 se 時的策略。但對於即將被移出 rq 的 se，如果其正好等同於上述三者的任何之一，這會是無效的，因此需要在 dequeue 時先調整之 > * [调度器4—CFS的buddy ](https://www.cnblogs.com/hellokitty2/p/15302500.html) ```cpp if (se != cfs_rq->curr) __dequeue_entity(cfs_rq, se); se->on_rq = 0; ``` 除了 current task 之外的 entity 都需要藉由 [`__dequeue_entity`](https://elixir.bootlin.com/linux/v5.16.6/source/kernel/sched/fair.c#L591) 來從紅黑樹中移除掉。另外注意到在此時刻 `on_rq` 才被設置為 0，在後續才能進行到的像是 `reweight_entity` 等會需要用到 `se` 已經被 dequeue 資訊的函式。 ```cpp account_entity_dequeue(cfs_rq, se); ``` 然後可以透過 se 的 weight 去更新所屬的 `cfs_rq` 之 `load.weight`: * `account_entity_dequeue` -> `update_load_sub` 也要調整 `cfs_rq->nr_running`/`cfs_rq->idle_nr_running`。 ```cpp /* * Normalize after update_curr(); which will also have moved * min_vruntime if @se is the one holding it back. But before doing * update_min_vruntime() again, which will discount @se's position and * can move min_vruntime forward still more. */ if (!(flags & DEQUEUE_SLEEP)) se->vruntime -= cfs_rq->min_vruntime; /* return excess runtime on last dequeue */ return_cfs_rq_runtime(cfs_rq); ``` 在 se 從 cfs_rq 移出成為 non-runnable 狀態的時候要對 vruntime 進行標準化。 :::danger * TODO: [`return_cfs_rq_runtime`](https://elixir.bootlin.com/linux/v5.16.6/source/kernel/sched/fair.c#L5156)? ::: ```cpp update_cfs_group(se); ``` `update_cfs_group` 主要考慮如果 entity 是屬於 `task_group` 的，需要透過 `share` 更新去更新該 entity 的 weight([`reweight_entity`](#reweight_entity)) ```cpp /* * Now advance min_vruntime if @se was the entity holding it back, * except when: DEQUEUE_SAVE && !DEQUEUE_MOVE, in this case we'll be * put back on, and if we advance min_vruntime, we'll be placed back * further than we started -- ie. we'll be penalized. */ if ((flags & (DEQUEUE_SAVE | DEQUEUE_MOVE)) != DEQUEUE_SAVE) update_min_vruntime(cfs_rq); } ``` 最後將 `cfs_rq->min_vruntime` 更新。 :::info `update_curr` 那邊應該已經會執行一次 `update_min_vruntime`，這裡又呼叫一次的原因是? 1. 第一次 `update_curr` 的 `update_min_vruntime` 是為了校正 `se->vruntime` 2. 因為 `min_vruntime` 可能是來自當下被移除的 `se` 原本的 vruntime，在把 `se` 真正從 cfs_rq 移除後需要再進行一次 ::: ## 解析 CFS 的原始程式碼 - reweight 在某些特定時刻，se 的權重會動態的被調整，比如說使用者透過 sysfs 進行設定，或是 enqueue/dequeue 發生(調整 group se 的權重, 例如 `update_cfs_group`) 時。以下我們就來關注與此行為相關的函式。 ### `reweight_entity` [`reweight_entity`](https://elixir.bootlin.com/linux/v5.16.6/source/kernel/sched/fair.c#L3043) 是調整一個 se 之 weight 的核心函式。 ```cpp static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, unsigned long weight) { if (se->on_rq) { /* commit outstanding execution time */ if (cfs_rq->curr == se) update_curr(cfs_rq); update_load_sub(&cfs_rq->load, se->load.weight); } ``` 如果該 se 已經存在所屬的 queue 中，且是目前受排程中的 se，在更新 weight 之前，我們需要將到當下時刻的為止經過的 vruntime 做一次累計，因為這段時間是新的 weight 之前所歷經的，這藉由 `update_curr` 來完成。之後，上章節有提到 `cfs_rq->load` 的數值是該 runqueue 中所有任務的權重總和(方便可以迅速計算某 se 應分配的 vruntime 比例)，既然其下的 se 即將發生調整(reweight)，我們自然需要先減去舊的 weight，之後等新的 weight 更新好，再將其加總回去。 ```cpp dequeue_load_avg(cfs_rq, se); ``` [`dequeue_load_avg`](https://elixir.bootlin.com/linux/v5.16.6/source/kernel/sched/fair.c#L3030) 將 `se` 的 load 從原屬的 cfs_rq 中先行移除(`cfs_rq->avg.load_sum` / `cfs_rq->avg.load_avg`) ```cpp /* * static inline void update_load_set(struct load_weight *lw, unsigned long w) * { * lw->weight = w; * lw->inv_weight = 0; * } */ update_load_set(&se->load, weight); ``` 然後我們終於可以將 se 的 weight 進行更新了。[`upload_set`](https://elixir.bootlin.com/linux/v5.16.6/source/kernel/sched/fair.c#L159) 將 se 的 weight 換成輸入的參數 `weight`。 :::info 值得注意的是，weight 的倒數 `inv_weight` 並不會立刻被計算，而是等到會用到的 function(例如 `place_entity` -> [`__calc_delta`](https://elixir.bootlin.com/linux/v5.16.6/source/kernel/sched/fair.c#L244)) 才會呼叫 [`__update_inv_weight`](https://elixir.bootlin.com/linux/v5.16.6/source/kernel/sched/fair.c#L215) 去得到 ::: ```cpp #ifdef CONFIG_SMP do { u32 divider = get_pelt_divider(&se->avg); se->avg.load_avg = div_u64(se_weight(se) * se->avg.load_sum, divider); } while (0); #endif enqueue_load_avg(cfs_rq, se); if (se->on_rq) update_load_add(&cfs_rq->load, se->load.weight); } ``` * `CONFIG_SMP` ? * 隨著 weight 完成更新，[`enqueue_load_avg`](https://elixir.bootlin.com/linux/v5.16.6/source/kernel/sched/fair.c#L3023) 再將該 se 的 load 重新計算回所屬的 cfs_rq 之中。 * 如果剛剛有因為 se 已經存在所屬的 queue 中而先扣掉 weight 的部分，將其再加回 cfs_rq 中 ### `reweight_task` 一個 `reweight_entity` 會被呼叫的點是在 [`reweight_task`](https://elixir.bootlin.com/linux/v5.16.6/source/kernel/sched/fair.c#L3070)。 ```cpp void reweight_task(struct task_struct *p, int prio) { struct sched_entity *se = &p->se; struct cfs_rq *cfs_rq = cfs_rq_of(se); struct load_weight *load = &se->load; unsigned long weight = scale_load(sched_prio_to_weight[prio]); reweight_entity(cfs_rq, se, weight); load->inv_weight = sched_prio_to_wmult[prio]; } ``` 對於給定的 nice value `prio`，藉由 `sched_prio_to_weight` 可以轉換得到應該設置的 weight，接著就可以藉由 `reweight_entity` 去調整。 ### `calc_group_shares` 另一個會使用 `reweight_entity` 的情境是在 `update_cfs_group`，會去將 group se 的 weight 做修改。而這裡我們要關注的是用來計算新的 weight 的函式 [`calc_group_shares`](https://elixir.bootlin.com/linux/v5.16.6/source/kernel/sched/fair.c#L3156) ```cpp /* Task group related information */ struct task_group { ... /* schedulable entities of this group on each CPU */ struct sched_entity **se; /* runqueue "owned" by this group on each CPU */ struct cfs_rq **cfs_rq; unsigned long shares; ``` 回顧一下 `task_group` 結構，`task_group` 對應每個 CPU 會各自有一個 `se` 和 `cfs_rq`。而整個 group 本身的權重則是 `shares`，這些 `se` 會按比例分配 `shares` 的權重。舉例來說，如下圖，假設我們有 CPU0 和 CPU1 各自的 `se0` 和 `se1`，而 CPU0 的 `cfs_rq0` 中分別有兩個任務，權重一個是 1024 一個是 2048，CPU1 的 `cfs_rq1` 中則有一個權重為 1024 的任務。則由此可以換算，`se0` 可以分配到的比例是 $\frac{(1024 + 2048)}{(1024 + 2048) + 1024}$，`se1` 則是 $\frac{1024}{(1024 + 2048) + 1024}$。 ![](https://i.imgur.com/FluRtyA.png) 這也對應 `calc_group_shares` 註解一開始說明的公式 (1)。 ```cpp * * All this does is approximate the hierarchical proportion which includes that * global sum we all love to hate. * * That is, the weight of a group entity, is the proportional share of the * group weight based on the group runqueue weights. That is: * * tg->weight * grq->load.weight * ge->load.weight = ----------------------------- (1) * \Sum grq->load.weight * ``` 然而，為此要計算 `task_group` 底下每個 `cfs_rq` 中 `se` 的 weight 之合，這點上的成本是比較高的。因此在 Linux 中我們可以改以近似值的方式來計算: ```cpp * Now, because computing that sum is prohibitively expensive to compute (been * there, done that) we approximate it with this average stuff. The average * moves slower and therefore the approximation is cheaper and more stable. * * So instead of the above, we substitute: * * grq->load.weight -> grq->avg.load_avg (2) * * which yields the following: * * tg->weight * grq->avg.load_avg * ge->load.weight = ------------------------------ (3) * tg->load_avg * * Where: tg->load_avg ~= \Sum grq->avg.load_avg * * That is shares_avg, and it is right (given the approximation (2)). ``` 我們可以改用 load_avg 的比例(PELT load)來近似計算 se 的 load weight 的比例(可以參考 [`__update_load_avg_cfs_rq`](https://elixir.bootlin.com/linux/v5.16.6/source/kernel/sched/pelt.c#L324) 是如何計算 `cfs_rq` 的 `load_avg` 的)，而 task_group 底下所有 cfs_rq 之 load_avg 總和會近似於 `tg->load_avg` 所記錄的值(可以參考 [`update_tg_load_avg`](https://elixir.bootlin.com/linux/v5.16.6/source/kernel/sched/fair.c#L3320)，得到公式 (3)。 ```cpp * The problem with it is that because the average is slow -- it was designed * to be exactly that of course -- this leads to transients in boundary * conditions. In specific, the case where the group was idle and we start the * one task. It takes time for our CPU's grq->avg.load_avg to build up, * yielding bad latency etc.. * * Now, in that special case (1) reduces to: * * tg->weight * grq->load.weight * ge->load.weight = ----------------------------- = tg->weight (4) * grp->load.weight * * That is, the sum collapses because all other CPUs are idle; the UP scenario. ``` 問題在於使用 load_avg 和實際用 weight 來計算之間的差距，由於 load_avg 的建立需要時間，這中間的延遲導致某些情境下近似的效果不佳。具體來說是當 group 一開始是 idle，並準備要開始運行第一個 task 的時候。此時會希望將其視為一個特例計算，也就是使用註解中提到的公式 (4)。 :::danger 這是引用註解的說明，實際上我也不太清楚如果不用此特例去計算的話具體是造成甚麼問題 :crying_cat_face: ::: ```cpp * So what we do is modify our approximation (3) to approach (4) in the (near) * UP case, like: * * ge->load.weight = * * tg->weight * grq->load.weight * --------------------------------------------------- (5) * tg->load_avg - grq->avg.load_avg + grq->load.weight ``` 可以用公式 (5) 使得一般清況下的計算會等同於 (3)，而特殊情況時會等同公式 (4)。 ```cpp * * But because grq->load.weight can drop to 0, resulting in a divide by zero, * we need to use grq->avg.load_avg as its lower bound, which then gives: * * * tg->weight * grq->load.weight * ge->load.weight = ----------------------------- (6) * tg_load_avg' * * Where: * * tg_load_avg' = tg->load_avg - grq->avg.load_avg + * max(grq->load.weight, grq->avg.load_avg) ``` 但問題還沒解決。當一個 cfs_rq 的 weight 為 0 時(該情況是可能發生的)，這除法的進行會有問題，因此我們還需要額外處理這種情況。作法是使用該 cfs_rq 的 load_avg 作為下限。 :::warning 可以看到由於近似的計算以及需考慮 boundary case 的原因，整個公式來源的邏輯可能並不是很容易能直接從程式碼梳理出來(也有可能只是我太爛看不懂QQ) ::: 暫且我們就當成只要套用這公式就能得到 task_group 中 se 的 weight，觀察其中的程式碼: ```cpp static long calc_group_shares(struct cfs_rq *cfs_rq) { long tg_weight, tg_shares, load, shares; struct task_group *tg = cfs_rq->tg; tg_shares = READ_ONCE(tg->shares); load = max(scale_load_down(cfs_rq->load.weight), cfs_rq->avg.load_avg); tg_weight = atomic_long_read(&tg->load_avg); /* Ensure tg_weight >= load */ tg_weight -= cfs_rq->tg_load_avg_contrib; tg_weight += load; shares = (tg_shares * load); if (tg_weight) shares /= tg_weight; /* * MIN_SHARES has to be unscaled here to support per-CPU partitioning * of a group with small tg->shares value. It is a floor value which is * assigned as a minimum load.weight to the sched_entity representing * the group on a CPU. * * E.g. on 64-bit for a group with tg->shares of scale_load(15)=15*1024 * on an 8-core system with 8 tasks each runnable on one CPU shares has * to be 15*1024*1/8=1920 instead of scale_load(MIN_SHARES)=2*1024. In * case no task is runnable on a CPU MIN_SHARES=2 should be returned * instead of 0. */ return clamp_t(long, shares, MIN_SHARES, tg_shares); } ``` :::danger 如果 `tg_load_avg' == 0`，就把 `tg->weight * load` 當成 `ge->load.weight` 了? ::: 不過實際對照後發現程式碼的實做和註解的公式 6 似乎不大相同? :dizzy_face: 感覺程式碼的實做比較像是: ``` tg->weight * load ge->load.weight = ----------------------------- (6) tg_load_avg' Where: load = max(grq->load.weight, grq->avg.load_avg) tg_load_avg' = tg->load_avg - grq->tg_load_avg_contrib + load ``` * `grq->tg_load_avg_contrib` 可以理解成前段時間某次的 `grq->avg.load_avg`，詳情可以看 [`update_tg_load_avg`](https://hackmd.io/@RinHizakura/Bk4y_5o-9#update_tg_load_avg) 的實作，兩者的值畢竟有區別，不太清楚這裡選用兩個不同變數的考量為何原諒小弟資質駑鈍，只能先放棄理解這段程式碼了QAQ 跪求有大神開示一下這到底是怎麼計算的... ## 解析 CFS 的原始程式碼 - pick next task ### `__pick_next_task` 在每次的重排程， [`__schedule`](https://elixir.bootlin.com/linux/v5.16.6/source/kernel/sched/core.c#L6131) 會被呼叫，函式執行的流程上 `pick_next_task` -> [`__pick_next_task`](https://elixir.bootlin.com/linux/v5.16.6/source/kernel/sched/core.c#L5579) 是選擇下一個將要被執行的任務的關鍵。 ```cpp /* * Pick up the highest-prio task: */ static inline struct task_struct * __pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) { const struct sched_class *class; struct task_struct *p; /* * Optimization: we know that if all tasks are in the fair class we can * call that function directly, but only if the @prev task wasn't of a * higher scheduling class, because otherwise those lose the * opportunity to pull in more work from other CPUs. */ if (likely(prev->sched_class <= &fair_sched_class && rq->nr_running == rq->cfs.h_nr_running)) { p = pick_next_task_fair(rq, prev, rf); if (unlikely(p == RETRY_TASK)) goto restart; /* Assume the next prioritized class is idle_sched_class */ if (!p) { put_prev_task(rq, prev); p = pick_next_task_idle(rq); } return p; } ``` `__pick_next_task` 先假設下一個運行的任務屬於 fair class，因為一般狀況下核心中的任務都是 fair class，因此該假設提供了一些優化。判斷此假設是否正確的方式是: 判斷目前的 task 是不是 fair class，以及 `rq->nr_running == rq->cfs.h_nr_running` 是否符合: * `rq->nr_running` 是指當前 rq 上所有的任務(遞迴運算底下的每個 rq) * `rq->cfs.h_nr_running` 則是指 rq 中屬於 cfs_rq 一部分，包含在該層級 cfs_rq 中的 sched_entity，以及其底下層級所包含的 sched_entity 數量之總和 * 若兩者相同，表示沒有 fair 以外的任務需要調度，則可以保證下一個被挑到的任務必然屬於 fair class 滿足假設的情況，可以直接透過 fair 的 pick_next 方式 `pick_next_task_fair` 進行。正常情況下返回 `pick_next_task_fair` 的返回值即可，但有些例外需考量: * `RETRY_TASK`: 有屬於更高優先級的任務被喚醒，需要重新選擇下個任務 * NULL(`!p`): 沒有可挑選的 fair class task ```cpp restart: put_prev_task_balance(rq, prev, rf); for_each_class(class) { p = class->pick_next_task(rq); if (p) return p; } BUG(); /* The idle class should always have a runnable task. */ } ``` 如果不在 fair class 中選擇，那就按照 class 的優先級逐一呼叫各自定義的 `pick_next_task` 方法。 :::warning `put_prev_task` 的目的是什麼? 為什麼優化該段調用 `pick_next_task_fair` 之前不需要先 `put_prev_task`? * `pick_next_task_fair` 裡是會做 `put_prev_task` 的 ::: ### `pick_next_task_fair` [`pick_next_task_fair`](https://elixir.bootlin.com/linux/v5.16.6/source/kernel/sched/fair.c#L7238) 是 fair class scheduler 選擇任務機制的實作。 ```cpp struct task_struct * pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) { struct cfs_rq *cfs_rq = &rq->cfs; struct sched_entity *se; struct task_struct *p; int new_tasks; again: if (!sched_fair_runnable(rq)) goto idle; ``` [`sched_fair_runnable`](https://elixir.bootlin.com/linux/v5.16.6/source/kernel/sched/sched.h#L2219) 首先確認 CFS 的 runqueue 中是否有 runnable 的任務，如果沒有，會移動到 `idle` 的該段程式碼，這部分我們留待後頭探討。如果 CFS 的 runqueue 中有任務，那麼就會往後執行後面的程式碼。在 `CONFIG_FAIR_GROUP_SCHED` 被打開的情況下，表示啟動分組排程的功能，可以看到以下會有額外的一段流程。 ```cpp #ifdef CONFIG_FAIR_GROUP_SCHED ... #endif ``` :::danger 由於筆者水平有限，加上相關資源都跳過對此段的研究，為避免錯誤解讀因此暫時略過這段QAQ ::: ```cpp if (prev) put_prev_task(rq, prev); do { se = pick_next_entity(cfs_rq, NULL); set_next_entity(cfs_rq, se); cfs_rq = group_cfs_rq(se); } while (cfs_rq); p = task_of(se); ``` simple 流程則較為單純。首先是藉由 [`put_prev_task`](https://elixir.bootlin.com/linux/v5.16.6/source/kernel/sched/sched.h#L2160) -> [`put_prev_task_fair`](https://elixir.bootlin.com/linux/v5.16.6/source/kernel/sched/fair.c#L7328) 將當前任務放回 runqueue 的紅黑樹中，然後就是 1. 利用 [`pick_next_entity`](https://elixir.bootlin.com/linux/v5.16.6/source/kernel/sched/fair.c#L4482) 從 `cfs_rq` 中挑出優先的 se 2. [`set_next_entity`](https://elixir.bootlin.com/linux/v5.16.6/source/kernel/sched/fair.c#L4434) 將下一個獲得排程的 se 從 runqueue 中取出，該 se 會被 `cfs_rq->curr` reference 到 3. 如果該 se 是屬於 group se (`se->my_q != NULL`)， `group_cfs_rq` 就再前往下一層的 `cfs_rq`，回到 step 1 往復，直到找到 task se 為止(透過 `task_of` 可以得到 `task_struct`)。 ```cpp done: __maybe_unused; #ifdef CONFIG_SMP /* * Move the next running task to the front of * the list, so our cfs_tasks list becomes MRU * one. */ list_move(&p->se.group_node, &rq->cfs_tasks); #endif if (hrtick_enabled_fair(rq)) hrtick_start_fair(rq, p); update_misfit_status(p, rq); return p; ``` 如果 `CONFIG_SMP` 被設置，額外需要把該 task 放到 `cfs_tasks` 這個 list 的前方，後者是在 `CONFIG_SMP` 下才會定義的成員，這是一個 Most Recently Used List，與 load balance 相關(?) > [](https://lkml.iu.edu/hypermail/linux/kernel/2208.1/02244.html) 如果 `CONFIG_SCHED_HRTICK` 被設置，需要針對 hrtimer 進行相關的設置。[`update_misfit_status`](https://elixir.bootlin.com/linux/v5.16.6/source/kernel/sched/fair.c#L4088) 這邊則是會做 misfit task 的檢查(這兩段暫略)。最後就可以回傳被選到的 task 了! ```cpp idle: if (!rf) return NULL; new_tasks = newidle_balance(rq, rf); /* * Because newidle_balance() releases (and re-acquires) rq->lock, it is * possible for any higher priority task to appear. In that case we * must re-start the pick_next_entity() loop. */ if (new_tasks < 0) return RETRY_TASK; if (new_tasks > 0) goto again; /* * rq is about to be idle, check if we need to update the * lost_idle_time of clock_pelt */ update_idle_rq_clock_pelt(rq); return NULL; } ``` 回到最開始省略的 idle 流程。`*rf` 指向 [`struct rq_flags rf`](https://elixir.bootlin.com/linux/v5.16.6/source/kernel/sched/sched.h#L1510) 結構，這和 `newidle_balance` 中的 dead lock 檢測有關(?)。而 [`newidle_balance`](https://elixir.bootlin.com/linux/v5.16.6/source/kernel/sched/fair.c#L10870) 在 `CONFIG_SMP` 被 enable 時會回傳非零值，其作用是某個 CPU 即將要進入 idle 時，嘗試去其他的 CPU 上取得任務來執行: * return == 0: * 首先運行 [`update_idle_rq_clock_pelt`](https://elixir.bootlin.com/linux/v5.16.6/source/kernel/sched/pelt.h#L115)，這與 `rq->clock_pelt` 的維護有關，後者是用 PELT 做 load balance 時使用的 timeline。`update_idle_rq_clock_pelt` 這邊會計算 `rq->lost_idle_time` * 因為已經沒有可挑選的任務，`pick_next_task_fair` 最後回傳 NULL :::info 簡單來說 `clock_pelt` 的使用考慮了以下情境: 在如 [SMP](https://en.wikipedia.org/wiki/Symmetric_multiprocessing) 的系統架構下，CPU 會有大小核之分，時脈頻率也有不同，因此即便是同樣的任務量，也會因為各自計算能力的不同導致花費時間的區別;換句話說，直接依此時間計算的話，相同任務量也會在不同的 CPU 上得到各異的 PELT 結果。為了能達到一致，實際上 PELT 使用的時間會經過正規化，即把所有執行時間對齊到最大核的最高頻率上。但帶來的副作用是 cpu idle 的時間被壓縮了（正規化後的時間會變小一些，但 idle 時間就不需區分大小核/頻率了），為了解決這個問題，因此需要 `update_idle_rq_clock_pelt`。 ::: * return < 0: 有更高優先級的任務被喚醒，回傳 `RETRY_TASK`，重新選擇下個任務 * return > 0: 得到新的 fair task，因此回到 `again` 從頭來過 ## Reference * [蜗窝科技 - CFS](http://www.wowotech.net/tag/CFS) * [完全公平调度(Completely Fair Scheduler)](https://github.com/freelancer-leon/notes/blob/master/kernel/sched/sched_cfs.md) * [CFS调度器（3）-组调度](http://www.wowotech.net/process_management/449.html) * [Linux CFS and task group](https://mechpen.github.io/posts/2020-04-27-cfs-group/index.html)