理解 CFS 到 EEVDF

# 理解 CFS 到 EEVDF 為了理解 Linux 核心的 CPU 排程器從 CFS 到 EEVDF 轉變的動機，首先會需要了解 CPU 的 bandwidth control。以下整理相關論文內容及參考文件。 [CPU bandwidth control for CFS](https://landley.net/kdocs/ols/2010/ols2010-pages-245-254.pdf) https://danluu.com/cgroup-throttling/ 另外，此[論文](https://landley.net/kdocs/ols/2010/ols2010-pages-245-254.pdf)撰寫於 2010，當中提及的實作方式與目前 Linux latest release LTS `v6.1.140` 已經有所差距，目前是打算以 `v6.1.140` 的原始碼為主來搭配論文探討。 ## 論文 - CPU bandwidth control for CFS Fair Group Scheduler 透過指定 group （`sched_entity` 可代表任務也可代表群組）的權重來按比例分配的 CPU 資源使用時間，排程器的本質是 work-conserving 的。Work-conserving 意指排程器只要有可執行的實體，就會盡可能讓 CPU 忙碌，只有在沒有可執行實體的情況下，CPU 才會 idle。這種設計讓系統能夠充分利用 CPU 資源，不浪費任何可用的計算能力。這種設計在大多數情況下（例如個人電腦）是效率會是很好的，但在某些情況下，會帶來下列副作用： * CPU 分配變得不穩定：某個群組實際能使用的 CPU 時間，會隨其他群組是否活躍而波動。 * 很難準確分配資源：如果不了解所有共用 CPU 的應用在做什麼，無法有效的預先規劃每個人可以用多少資源。 * 無法設上限：shares 提供的是「下限」保證（該群組在大家都忙時至少能拿到多少），但沒有提供「上限」控制。所以當其他群組很閒時，任務可能突然用很多 CPU，超出預期。也因此無法預測最多會用到多少 CPU，對容量相關的規劃（例如要部署幾個應用、客戶要收多少錢）造成困難。 > shares 在以下 group scheduling 段落會簡單介紹 ### Quota 概念的引入為了解決上述問題，可透過 CPU bandwidth control。Bandwidth control 機制引入了 quota 的概念，為群組設定明確的上限。與 shares 提供下界保證不同，quota 提供上界限制，使系統管理者能夠預測和控制資源使用。這項功能在企業環境中很實用，例如在按使用付費的場景中，客戶購買特定的 CPU 資源配額，服務提供者需要確保客戶既不會獲得過少也不會消耗過多資源。可以盡可能準確的控制客戶使用的 CPU 資源，也有助於提供較穩定的延遲保障，改善多租戶環境中的資源隔離。 ### 排程器歷史簡介 > 較詳細的 linux 排程器歷史可參考《Demystifying the Linux CPU Scheduler》一書 Linux CPU 排程器中有兩個主要的排程類別： * `SCHED_RT`（即時排程類別） * `SCHED_NORMAL`（一般排程類別）當即時排程類別的排程實體變為可執行時，總是會優先於一般排程類別的實體被選中執行。可以注意的是，`SCHED_RT` 的時間消耗是獨立於 `SCHED_NORMAL` 的。在 v2.6.24 之前，Linux 排程器只管理單一 task，並以 `nice(2)` 值作為主要的頻寬控制手段。CFS 在 v2.6.24 引入，取代了原先的 `SCHED_NORMAL` 排程類別的實作方式。 :::info > In v2.6.24, the completely fair scheduler (CFS) was merged, replacing the existing SCHED_NORMAL scheduling class. 論文的這段我認為會有些誤導，我的理解是底層的實作機制，也就是排程的演算法從 O(1) 改成使用 CFS，但不是代表 CFS 取代 `SCHED_NORMAL` 這個排程類別本身。 ::: CFS 基於權重來進行 CPU 的頻寬分配，可讓使用者自訂 group scheduling ，透過 `cgroups` 的 CPU 控制子系統進行管理。 ### Group Scheduling CFS 引入了基於權重的排程機制，透過指定每個任務或群組的 shares 值，實作按比例分配的 CPU bandwidth。每個可執行的實體會根據其 shares 比例獲得對應的 CPU 時間，這提供了一種下界保證（lower-bound provisioning）。注意到 shares 並不代表使用上限，而是一種比例性分配：當部分實體未執行時，其餘可執行的實體可以取得更多 CPU 資源。 > Assigning share to a group doesn’t guarantee that it will get a particular amount of CPU. It only means that the available CPU bandwidth will be divided as per the shares. shares 並不是絕對的 CPU 配額，而是表示相對於其他可執行實體的「權重比」。實際舉例，假設有三個群組，各有 $1024$ shares 三個群組的總 shares：$1024 + 1024 + 1024 = 3072$ 每個群組的 CPU 分配比例： $$1024 / 3072 = 1 / 3 ≈ 33.3%$$ 加入第 4 個群組，有 2048 shares 新的總 shares：$1024 + 1024 + 1024 + 2048 = 5120$ 原有的三個群組的比例變為： $$1024 / 5120 = 0.2 = 20%$$ 新的第 4 群組比重為： $$2048 / 5120 = 0.4 = 40%$$ 可以看出 shares 所對應的實際 CPU 資源分配是會動態變動的，依賴於： * 哪些排程實體（tasks/groups）處於 runnable 狀態。 * 各自的 shares 設定。 ### Nested Groups CFS 引入 group scheduling 後，整個排程架構從原本的單層任務排程演進為多層階層結構。在這個架構中，排程器不再直接面對所有任務，而是先處理群組，再由群組管理其內部的子群組或任務。 ``` 傳統模型：Scheduler → Tasks 群組模型：Scheduler → Groups → Tasks ``` 傳統模型中，排程器直接管理系統中的所有任務。但在群組模型中，排程器首先在頂層群組間進行排程，每個群組再負責其內部實體的排程。透過階層式的群組結構（Nested Groups）來達成實作，代表群組中可以再包含群組。每個群組或任務的 bandwidth 分配透過相應的 shares 來配置，這個 shares 就代表其權重。對於單一任務而言，可用 `nice(2)` API 來調整其 shares。論文中的例子：設想一個大學伺服器的情境。最頂層可能分為教授群組、學生群組和行政群組。假設教授群組分配到 2048 shares，學生群組和行政群組各分配 1024 shares，那麼教授群組就能獲得 50% 的 CPU 資源，其他兩個群組各得 25%。在教授群組內部，可能又細分為不同的個別教授。如果 Prof A 和 Prof B 在教授群組內各有 1024 shares，他們就會平分教授群組的 50% CPU 資源，也就是各得到整體系統 25% 的 CPU 時間。學生群組內部也可能進一步劃分為大學部學生和研究生，各自根據其 shares 來分配學生群組所獲得的 25% CPU 資源。 #### 統一的排程實體概念 CFS 透過 `sched_entity` 結構體統一處理群組和任務。每個 `sched_entity` 可能代表一個群組或一個任務，排程器在處理時不需要區分兩者的差異。群組類型的 `sched_entity` 會包含一個指向其子執行佇列（`cfs_rq`）的指標，這個子佇列管理該群組內的所有子實體。任務類型的 `sched_entity` 則是階層的葉節點，不包含子佇列。這種設計讓排程演算法可以遞迴的在階層中運作。排程器首先在頂層選擇要執行的群組，然後遞迴到該群組內部選擇具體的子群組或任務。 ### Hybrid Quota Model Local pool 模型讓每個 CPU 的 runqueue 都維護自己的 quota pool。當某個 CPU 的 quota 用光時，它需要向其他 CPU 借用剩餘的 quota。這種設計在小型 SMP 系統中運作是良好的，但隨著 CPU 數量增加就會遇到嚴重問題。在多對多的借用關係中，每次 quota 不足時都需要逐一走訪所有其他 CPU 的 runqueue，並且需要得到每個 runqueue 的 lock。另外每次借用只能獲得其他 CPU 剩餘 quota 的一小部分，這代表可能需要多次的迭代才能獲得足夠的 quota。 Local pool 模型的優點是，當 CPU 有足夠的 local quota 時，執行 quota 扣除操作完全不需要 lock，也不需要訪問其他遠端的資料結構。另一方面，完全採用 global pool 的做法是讓所有 CPU 直接從同一個全局 quota pool 中扣掉使用量。這種方式雖然避免了複雜的借用關係，但會造成嚴重的全域競爭問題。因為每次任務執行時都需要更新全域 quota，表示會需要大量的 lock 競爭。在高度並行的系統中，這種全域競爭可能會成為效能瓶頸。而 global pool 模型的優點在於 quota 管理的簡潔和準確，所有的 quota 使用都能即時反映在全局狀態中，不會有 local 不一致的問題。以下整理： * 傳統的 local Pool 模型： * 缺點：local pool 模型中存在多對多的關係來追蹤每個 runqueue 的剩餘 quota。在 SMP 系統小時還可以接受，但隨著 CPU 數量變多就會變得複雜且低效。 * 優點：一旦有 quota，可以在本地使用，locklessly 且無需遠端查詢，執行效率非常高。 * global Pool 模型： * 缺點：如果完全使用全域 quota pool 來追蹤每個排程實體的資源使用量，會導致 global contention 嚴重。另外在大型系統中不具擴展性。 * 優點：quota 管理的簡潔和準確。為了結合兩種模型的優勢並避免各自的缺點，CFS 實作上是採用 local 與 global 結合的混合式模型：每個 `task_group` 新增了一個 `cfs_bandwidth` 結構體，global pool 會由 `cfs_bandwidth` 來管理，追蹤目前的配額總量還有已用配額。 ```c struct cfs_bandwidth { #ifdef CONFIG_CFS_BANDWIDTH raw_spinlock_t lock; ktime_t period; u64 quota; // u64 runtime; // u64 burst; u64 runtime_snap; s64 hierarchical_quota; u8 idle; u8 period_active; u8 slack_started; struct hrtimer period_timer; struct hrtimer slack_timer; struct list_head throttled_cfs_rq; /* Statistics: */ int nr_periods; int nr_throttled; int nr_burst; u64 throttled_time; u64 burst_time; #endif }; ``` 但實際的 quota 消耗並不直接在這個全局的結構上進行，而是先分配到各個 CPU 的 local pool 中。每個 `cfs_rq` 仍保有自己的 local granted + consumed quota。當 local quota 不足時，不會向其他 CPU 借用，會從 global pool 申請新的 quota（一次申請一個 batch，大小可由使用者設定）。 ```c struct cfs_rq { ... u64 runtime_remaining; // 對應論文的 local granted quota u64 runtime_used; // 對應論文的 consumed quota ... }; ``` 如果沒能從 global pool 取得 quota，此 runqueue 就會被 throttle。每當進入新的一個 quota period 時會把所有被 throttle 的 runqueue 解除限制，並且全域的配額被重新補充。 > Our design for the distribution of quota is a hybrid model which attempts to combine both local and global quota tracking. To each task_group a new cfs_ bandwidth structure has been added. This tracks (globally) the allocated and consumed quota within a period. However, consumption does not occur against this pool directly; as in the local pool approach above there is a local, per cfs_rq, store of granted and consumed quota. This quota is acquired from the global pool in a (user configurable) batch size. When there is no quota available to re-provision a running cfs_rq, it is locally throttled until the next quota refresh. Bandwidth refresh is a periodic operation that occurs once per quota period within which all throttled runqueues are unthrottled and the global bandwidth pool is replenished. 要開啟 CFS bandwidth control 的話，需要設定以下 `CONFIG` 選項： * `CONFIG_FAIR_GROUP_SCHED`：開啟公平 group scheduling * `CONFIG_CFS_BANDWIDTH`：開啟 CFS bandwidth control ```c if CGROUP_SCHED config FAIR_GROUP_SCHED bool "Group scheduling for SCHED_OTHER" depends on CGROUP_SCHED default CGROUP_SCHED config CFS_BANDWIDTH bool "CPU bandwidth provisioning for FAIR_GROUP_SCHED" depends on FAIR_GROUP_SCHED default n help This option allows users to define CPU bandwidth rates (limits) for tasks running within the fair group scheduler. Groups with no limit set are considered to be unconstrained and will run with no restriction. See Documentation/scheduler/sched-bwc.rst for more information. ``` ### Bandwidth distribution and constraint 根據論文，`update_curr` 函式會被週期性的呼叫，來源包括： * scheduler ticks（排程器的時鐘中斷） * 其他排程器事件，像是 enqueue，dequeue 這個函式主要負責計算目前執行實體（`curr`）自上次呼叫以來獲得的 CPU 時間，也會更新 `vruntime` 等等排程統計的資訊。最後呼叫 `account_cfs_rq_runtime` 函式。 ```c static void update_curr(struct cfs_rq *cfs_rq) { struct sched_entity *curr = cfs_rq->curr; u64 now = rq_clock_task(rq_of(cfs_rq)); u64 delta_exec; ... account_cfs_rq_runtime(cfs_rq, delta_exec); } ``` > todo：深入 trace `update_curr` #### Quota 分配和追蹤 ![截圖 2025-06-05 下午4.04.36](https://hackmd.io/_uploads/Hy4kxCCGlg.png) > 對照論文的 Figure 5: Control flow diagram 剛剛說明的 `update_curr` 會呼叫 `account_cfs_rq_runtime` 函式，對應論文裡 `account_cfs_rq_quota` 函式，負責追蹤和扣除 CPU 使用時間。 ::: info 研讀過程中發現好像許多舊的版本中是 quota 的地方改成叫 runtime 為什麼？ > todo ::: ```c static __always_inline void account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec) { if (!cfs_bandwidth_used() || !cfs_rq->runtime_enabled) return; __account_cfs_rq_runtime(cfs_rq, delta_exec); } ``` 如果沒有啟用 bandwidth control 或此 runqueue 沒有限制，直接返回。反之繼續呼叫下一層函式 `__account_cfs_rq_runtime`。 ```c static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec) { /* dock delta_exec before expiring quota (as it could span periods) */ cfs_rq->runtime_remaining -= delta_exec; // 對應論文圖片 local_quota_used += delta if (likely(cfs_rq->runtime_remaining > 0)) // 對應 local_quota_used < local_quota_assigned ? return; if (cfs_rq->throttled) return; if (!assign_cfs_rq_runtime(cfs_rq) && likely(cfs_rq->curr)) resched_curr(rq_of(cfs_rq)); // 對應 Throttle cfs_rq and resched task } ``` 從 local 剩餘的配額中扣除剛才執行的時間 `delta_exec`，如果還有剩餘，直接返回： ```c cfs_rq->runtime_remaining -= delta_exec; if (likely(cfs_rq->runtime_remaining > 0)) return; ``` 如果已經被 throttle，不再處理。 ```c if (cfs_rq->throttled) return; ``` 配額不足時的處理： ```c /* * if we're unable to extend our runtime we resched so that the active * hierarchy can be throttled */ if (!assign_cfs_rq_runtime(cfs_rq) && likely(cfs_rq->curr)) resched_curr(rq_of(cfs_rq)); ``` 嘗試從 global pool 申請更多 quota（`assign_cfs_rq_runtime`）如果申請失敗且有正在執行的任務，觸發重新排程。重新排程會導致該 runqueue 被 throttle。 #### 從 Global Pool 申請 Quota 剛剛提到的從 global pool 申請一個 batch size 的 quota，透過 `assign_cfs_rq_runtime` 函式來做，對應到論文圖例中的 `Borrow runtime from global pool borrowed = tg_request_cfs_quota()`。 ```c /* returns 0 on failure to allocate runtime */ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq) { struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg); int ret; raw_spin_lock(&cfs_b->lock); ret = __assign_cfs_rq_runtime(cfs_b, cfs_rq, sched_cfs_bandwidth_slice()); raw_spin_unlock(&cfs_b->lock); return ret; } ``` 輸入的參數 `cfs_rq` 是需要分配運行時間的 runqueue。首先透過 `tg_cfs_bandwidth` 函式獲取該 `cfs_rq` 所屬 `task_group` 的 `cfs_bandwidth` 結構體。 ```c static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg) { return &tg->cfs_bandwidth; } ``` 呼叫底層函式 `__assign_cfs_rq_runtime` 執行實際的分配工作，分配大小由 `sched_cfs_bandwidth_slice` 決定。 ```c ret = __assign_cfs_rq_runtime(cfs_b, cfs_rq, sched_cfs_bandwidth_slice()); ``` ```c static inline u64 sched_cfs_bandwidth_slice(void) { return (u64)sysctl_sched_cfs_bandwidth_slice * NSEC_PER_USEC; } ``` 此函式將 `sysctl_sched_cfs_bandwidth_slice` 從微秒轉去奈秒。 --- `v6.1.140` `sysctl_sched_cfs_bandwidth_slice`： ```c #ifdef CONFIG_CFS_BANDWIDTH /* * Amount of runtime to allocate from global (tg) to local (per-cfs_rq) pool * each time a cfs_rq requests quota. * * Note: in the case that the slice exceeds the runtime remaining (either due * to consumption or the quota being specified to be smaller than the slice) * we will always only issue the remaining available time. * * (default: 5 msec, units: microseconds) */ static unsigned int sysctl_sched_cfs_bandwidth_slice = 5000UL; #endif ``` > This sysctl interface manages how many units are transferred from the global pool each time a local pool requires additional quota. The current default is 10ms. The details of this transfer process are discussed in later sections. 以上可以看出 v6.1.140 的實作上跟論文的的落差，一個是 5ms 一個是 10ms 。從論文第七章討論 Fairness issues 的部分，提及到了 batch size 的重要性： > A short term mitigation strategy could be to reduce the batch-sizing used for the propagation of quota from global to local pools. * 較小的 slice 可以更精細的控制配額，減少過度分配，但是也會更頻繁的去請求 global pool，增加同步的競爭 * 較大的 slice 可以減少 global pool 的訪問頻率，降低同步開銷，但是可能造成配額浪費和不公平分配。後來將預設值從 10ms 降低到 5ms 可能就是基於上述考量。 > todo：why where when --- 以下繼續看 `__assign_cfs_rq_runtime`： ```c static int __assign_cfs_rq_runtime(struct cfs_bandwidth *cfs_b, struct cfs_rq *cfs_rq, u64 target_runtime) ``` 輸入的參數 `target_runtime` 是剛剛提到的換算成奈秒的 `sysctl_sched_cfs_bandwidth_slice`。首先計算需要申請的最小時間量。如果 `runtime_remaining` 為負數（表示已透支），這會是一個正數。 ```c min_amount = target_runtime - cfs_rq->runtime_remaining; ``` 當配額設為無限時（`RUNTIME_INF`），直接滿足所有需求。反之會去檢查 global pool （由 `cfs_bandwidth` 結構體管理）是否還有配額。有的話分配需求量以及剩餘配額當中的較小值。 ```c if (cfs_b->quota == RUNTIME_INF) amount = min_amount; else { start_cfs_bandwidth(cfs_b); if (cfs_b->runtime > 0) { amount = min(cfs_b->runtime, min_amount); cfs_b->runtime -= amount; cfs_b->idle = 0; } } ``` 最後將分配到的時間加入 local pool，並返回是否成功獲得配額。 ```c cfs_rq->runtime_remaining += amount; return cfs_rq->runtime_remaining > 0; ``` ### Throttling 機制首先要先釐清楚 `task_group`、`cfs_rq`、`sched_entity` 的關係：每個 `task_group` 在每個 Cpu 上都有： * 一個 `cfs_rq`：可以想成一種容器，存放屬於這個群組的排程實體 `sched_entity` ，當然這個排程實體可能是指單一任務也可能是另一個 `task_group`。 * 一個 `sched_entity`：在親代群組的 `cfs_rq` 中代表這個群組，因為 `task_group` 本身也是一種 `sched_entity` --- > A throttled cfs_rq is one that has run out of local bandwidth (specifically, local_quota_used ≥ local_quota_assigned) 根據論文中對於 throttled 的定義，throttled 的 `cfs_rq` 是指 local bandwidth 耗盡，當本地已使用配額 ≥ 本地分配到的配額時觸發。同時也代表 global pool 也沒有額外配額可以去補充。被 throttled 的排程實體在本次 quota 週期中已經達到了自己可用的頻寬上限，因此目前不能被排入執行，必須等到下一個 quota 週期才能繼續執行。要注意到的是，被 throttled 的實體一定會是正在運行的，因為只有運行中的實體才有可能會去消耗配額。 > the dequeue operation on the throttled entity will just update the accounting since the currently running entity is already dequeued from RB tree as per CFS design 在 CFS 設計中，獲得 CPU 資源而正在運行的排程實體已經從紅黑樹中移除，所以對正在運行實體的 dequeue 操作只會： - 更新統計資訊 - 設定 `entity->on_rq = 0` - 不需要實際從樹中移除，因為已經不在樹上 ### [`throttle_cfs_rq`](https://elixir.bootlin.com/linux/v6.1.140/source/kernel/sched/fair.c#L5310) 函式實作部分 ```c static bool throttle_cfs_rq(struct cfs_rq *cfs_rq); ``` 在實際 throttle 前，透過 `__assign_cfs_rq_runtime` 函式再次嘗試分配 1ns 的 runtime 給 `cfs_rq`。 ```c /* * We have raced with bandwidth becoming available, and if we * actually throttled the timer might not unthrottle us for an * entire period. We additionally needed to make sure that any * subsequent check_cfs_rq_runtime calls agree not to throttle * us, as we may commit to do cfs put_prev+pick_next, so we ask * for 1ns of runtime rather than just check cfs_b. */ ``` 某個時刻任務正在被檢查是否需要 throttle，但同時 global pool 剛好補充回來（例如新的週期開始，或配額被增加），造成檢查結果可能過時或錯誤。unthrottle 的機制可能只在每個週期重新配額時觸發。如果在週期開始時發生競爭，被誤判需要 throttle，那整個週期都可能失去執行機會，導致 starvation。另外如果只是檢查到有配額從而決定不 throttle，在執行 `put_prev + pick_next` 的過程中，其他 CPU 可能消耗掉剩餘配額，後續的 `check_cfs_rq_runtime` 呼叫會發現沒配額了，導致狀態不一致。因此為了讓整體判斷更一致，改成主動請求 1ns。 > todo `check_cfs_rq_runtime` 相關 ```c raw_spin_lock(&cfs_b->lock); if (__assign_cfs_rq_runtime(cfs_b, cfs_rq, 1)) { dequeue = 0; } else { list_add_tail_rcu(&cfs_rq->throttled_list, &cfs_b->throttled_cfs_rq); } raw_spin_unlock(&cfs_b->lock); if (!dequeue) return false; ``` ```c /* freeze hierarchy runnable averages while throttled */ rcu_read_lock(); walk_tg_tree_from(cfs_rq->tg, tg_throttle_down, tg_nop, (void *)rq); rcu_read_unlock(); ``` #### Hierarchical Impact > todo 搭配書本第四章閱讀