Linux 核心設計: Scheduler(2): 概述 CFS Scheduler

--- tags: Linux Kernel Internals, 作業系統 --- # Linux 核心設計: Scheduler(2): 概述 CFS Scheduler ## 前言 ### Problem of scheduler Scheduler 需要考慮的議題相當複雜，如何 "公平" 的排程呢？顯然不是每個任務都享有相同的 CPU 時間那麼單純而已。如何把抽象的 "priority" 轉換成公平的時間分配？Interactive 的程式與 batch 的程式該如何排程才可以滿足使用的期待？諸多的問題讓 scheduler 的設計可說是沒有正確答案的。 :::info interactive 指接受某些形式輸入並產生回應的任務類型。例如一個文字編輯器，會一直等待使用者輸入內容，且當輸入後儘量快速的回應(例如顯示在畫面上) ::: 在 linux 上，scheduler 需要思考更多問題。即使 linux 最初是以桌面系統發展的，但如今它已經被廣泛運用在 server、嵌入式系統、[mainframes](https://en.wikipedia.org/wiki/Mainframe_computer)、及超級電腦上。顯然，這些系統的 schedule domain 各自不同，硬體結構的差異、對任務優先的差異等，更是讓 linux 排程器的設計充滿困難。 O(1) scheduler 雖然比起其前面的版本具有更高的彈性，也考量了對於 I/O bound 及 CPU bound 的 "公平" 問題，但是因為其仰賴 heuristics 的計算來調整任務，導致了程式的架構非常難以維護，而 heuristics 也缺乏算術實質，沒有具體的邏輯建立排程的演算法。正因如此，CFS 在其後被提出以改善 O(1) scheduler 所存在的問題。 ### Overview of CFS CFS 的核心想法是保持 task 對於使用 CPU time 的 "平衡"，當有 task 被給予的可用的時間落後於其他時，scheduler 會對此做出補償，給予其更多的執行時間。那麼 CFS 該怎麼定義 "平衡" 呢？先簡而言之: CFS 會維護每一個 task 的 *virtual runtime*。 Task 的 virtual runtime 愈小，表示其被允許使用的 CPU 時間愈短，也間接代表此 task 對 CPU 的需要更高，因此更需要被排程。CFS 同時也思考到了進入 sleep 狀態的任務(例如等待 I/O)之公平性，確保當他們醒來時可以得到足夠的 CPU 額度。 :::success 補充： Virtual runtime 指 process 截至目前為止已經運行的虛擬時間，所以一個任務的 vruntime 越小，代表它比起其他任務還需要更多的 CPU 額度。 ::: 這麼說明可能有些不清楚，下面就讓我們更仔細的來認識 CFS 中的設計考量和核心架構吧! ## sched_class 在探討 CFS 本身之前，可以先認識一下在現今的 Linux 版本(5.16)中設計 scheduler 的框架。由於核心提供許多不同種類的 scheduler ，例如 RT scheduler、Deadline scheduler、CFS scheduler 及 Idle scheduler 等。近似於物件導向的 parent class(沒錯! Linux 中有許多透過 C 語言設計的物件導向思維)，這些不同的 scheduler 透過 [`struct sched_class`](https://elixir.bootlin.com/linux/v5.16.6/source/kernel/sched/sched.h#L2095) 結構作為一致的抽象(abstraction)定義，這樣的模組化使得設計 scheduler 時可以不需要修改大量的程式碼，只需要定義 scheduler 對應的 operation 即可。 scheduler 的定義透過 [`DEFINE_SCHED_CLASS`](https://elixir.bootlin.com/linux/v5.16.6/source/kernel/sched/sched.h#L2180) 這個 macro 完成，後者將 scheduler 根據 linker script 擺放到編譯出的 linux image 的對應位置。設計上，每一個執行單元在建立之後，都要選擇一種調度策略，而每一種調度策略會對應一個 sched_class。因此，除了執行單元在 scheduler 中有優先級，不同的 scheduler 之間也存在先後之分。有趣的是，這些 scheduler 的先後順序是透過很特別的方法決定的，前面我們有提到 `DEFINE_SCHED_CLASS` 會將 scheduler 按照 linker script 擺在記憶體中。實際上，這些 scheduler 會緊密的排列在相鄰的位置，而參見 [/include/asm-generic/vmlinux.lds.h](https://elixir.bootlin.com/linux/v6.14.5/source/include/asm-generic/vmlinux.lds.h#L132): ```c /* * The order of the sched class addresses are important, as they are * used to determine the order of the priority of each sched class in * relation to each other. */ #define SCHED_DATA \ STRUCT_ALIGN(); \ __sched_class_highest = .; \ *(__stop_sched_class) \ *(__dl_sched_class) \ *(__rt_sched_class) \ *(__fair_sched_class) \ *(__ext_sched_class) \ *(__idle_sched_class) \ __sched_class_lowest = .; ``` 註解告訴我們 `sched_class` 在記憶體中的先後順序其實就對應優先程度!這使得核心實際上不需要透過額外的資料結構來紀錄優先序。在 [`_pick_next_task`](https://elixir.bootlin.com/linux/v5.16.6/source/kernel/sched/core.c#L5579) 的 [`for_each_class`](https://elixir.bootlin.com/linux/v5.16.6/source/kernel/sched/sched.h#L2195) 也驗證了此種設計。 :::info 之所以用 section 來決定優先順序是為考慮對 cache 使用上的友善，詳見: (感謝 @Uduru0522 補充) * [sched: Micro optimization in pick_next_task() and in check_preempt_curr()](https://lore.kernel.org/all/157675913272.349305.8936736338884044103.stgit@localhost.localdomain) * [sched: Force the address order of each sched class descriptor](https://lore.kernel.org/all/20191219214558.845353593@goodmis.org/T/#m74565019ad2c2ebbe4eff82190a72ed70c683d2c) ::: 也因為所有的 scheduler 都需要透過 `DEFINE_SCHED_CLASS` 做宣告，我們可以在 [`DEFINE_SCHED_CLASS` 的搜尋頁面](https://elixir.bootlin.com/linux/v6.14.5/C/ident/DEFINE_SCHED_CLASS) 找到不同種類的 scheduler 所在。因此，我們將從此 macro 出發，並深入解析實作一個 scheduler 所需定義的介面。 ## CFS scheduler 的關鍵元素 > [CFS调度器（1）-基本原理](http://www.wowotech.net/process_management/447.html) ### Weight Function 所謂公平並不是使所有任務都具有相同的執行時間，而是應該根據每個任務的優先級給予合適的時間比例。為此，核心引入權重的概念，權重代表著進程的優先級。各個進程之間按照權重的比例分配 cpu 時間。而權重的數值就透過 nice 值的概念進行對應，CFS 可以透過陣列表對應 nice value 與一個 weight 數值。nice 值的範圍在 [-20, 19] 之間，值越小，代表優先級越大，權重值越大。如下所示為在 linux kernel 4.4 以前在 kernel/sched/sched.h header 中的定義。 ```cpp /* * Nice levels are multiplicative, with a gentle 10% change for every * nice level changed. I.e. when a CPU-bound task goes from nice 0 to * nice 1, it will get ~10% less CPU time than another CPU-bound task * that remained on nice 0. * * The "10% effect" is relative and cumulative: from _any_ nice level, * if you go up 1 level, it's -10% CPU usage, if you go down 1 level * it's +10% CPU usage. (to achieve that we use a multiplier of 1.25. * If a task goes up by ~10% and another task goes down by ~10% then * the relative distance between them is ~25%.) */ static const int prio_to_weight[40] = { /* -20 */ 88761, 71755, 56483, 46273, 36291, /* -15 */ 29154, 23254, 18705, 14949, 11916, /* -10 */ 9548, 7620, 6100, 4904, 3906, /* -5 */ 3121, 2501, 1991, 1586, 1277, /* 0 */ 1024, 820, 655, 526, 423, /* 5 */ 335, 272, 215, 172, 137, /* 10 */ 110, 87, 70, 56, 45, /* 15 */ 36, 29, 23, 18, 15, }; ``` nice value 換算到 weight 的公式近似 $\frac{1024} {1.25 ^ {nice}}$。因此 nice value 每增加 1，weight 就乘上 1.25 倍。這使得 nice value 每增加 1 可以減少約 10% 的 CPU time 的比重。 :::info 我們可以簡單的計算出 10% 的由來，假設 process A 的 nice 為 n 而 process B 的 nice 為 n + 1。因此 $weight_A = \frac{1024} {1.25 ^ {n}}$ 而 $weight_B = \frac{1024} {1.25 ^ {n+1}}$，則 A 佔比重 $\frac{weight_A}{ weight_A+weight_B}$，B 佔比重 $\frac{weight_B}{ weight_A+weight_B}$，兩者比重相減的值(計算略)得到的數值就約為 10% ::: 不過這其實存在一個容易被忽略的缺陷: 因為該陣列是 static 宣告在 header 中的，這導致所有 include 都會建立一份獨立的 header 到記憶體中。由於該陣列理應只需在記憶體中存在一份且不必被修改，該結構被修改成 [`const int sched_prio_to_weight`](https://elixir.bootlin.com/linux/v5.16.6/source/kernel/sched/core.c#L10848)。對照 [sched/core: Move the sched_to_prio[] arrays out of line](https://lore.kernel.org/lkml/tip-ed82b8a1ff76ed7b2709e36ed361ddd022fe2407@git.kernel.org/) 的 patch 郵件，從這裡我們也可以發現一個也許可以著手的修改思考! ### Schedule Latency ```cpp /* * The idea is to set a period in which each task runs once. * * When there are too many tasks (sched_nr_latency) we have to stretch * this period because otherwise the slices get too small. * * p = (nr <= nl) ? l : l*nr/nl */ static u64 __sched_period(unsigned long nr_running) { if (unlikely(nr_running > sched_nr_latency)) return nr_running * sysctl_sched_min_granularity; else return sysctl_sched_latency; } ``` schedule lantency 所指是每個任務在被排程之後，下一次可以再被排程的時間，一般狀況下固定為 [`sysctl_sched_latency`](https://elixir.bootlin.com/linux/v5.16.6/source/kernel/sched/fair.c#L38)(6ms)。但假設想固定這個延遲時間，可以想像如果 queue 中任務一多，每個任務能夠獲得的 time slice 將被瓜分而導致過於頻繁的 context switch。因此，CFS 中會根據 queue 中的任務數量動態調整 latency 的。當任務數量超過 [`sched_nr_latency`](https://elixir.bootlin.com/linux/v5.16.6/source/kernel/sched/fair.c#L73)，會以 `nr_running` 為權重提高 latency，確保每個任務至少有足夠 `sysctl_sched_min_granularity` 的運行時間可以使用。可以在 [`__sched_period`](https://elixir.bootlin.com/linux/v5.16.6/source/kernel/sched/fair.c#L668) 看到這一系列行為的呈現。 ### Virtual Runtime 前面我們有提到 CFS 會根據不同任務的 nice value 分配不同比例的時間，但這就表示任務實際執行的時間長短不能做為判斷是否 "公平" 分配的依據。為了方便管理，CFS 引入了 virtual runtime(vruntime) 的觀念來定義任務之間的 "公平"。每一個任務的 virtual runtime 會根據 nice value 所對應的權重以及實際執行的時間做計算，如此一來，task 的 virtual runtime 越小，就可以直接表示其對 CPU 的需要更高。換句話說，CFS 只需要保證每個任務在系統中運行的 vruntime 是平衡的即可，在選擇下一個需要被進行的任務的時候，CFS 就只需選擇 vruntime 數值中最小的任務來達到公平。那麼，將實際時間(wall time)轉換成 vruntime 的公式為何呢? 我們可以關注 [`calc_delta_fair`](https://elixir.bootlin.com/linux/v5.16.6/source/kernel/sched/fair.c#L652)，先閱讀其中的註解部份: ```cpp /* * delta_exec * weight / lw.weight * OR * (delta_exec * (weight * lw->inv_weight)) >> WMULT_SHIFT * * Either weight := NICE_0_LOAD and lw \e sched_prio_to_wmult[], in which case * we're guaranteed shift stays positive because inv_weight is guaranteed to * fit 32 bits, and NICE_0_LOAD gives another 10 bits; therefore shift >= 22. * * Or, weight =< lw.weight (because lw.weight is the runqueue weight), thus * weight/lw.weight <= 1, and therefore our shift will also be positive. */ ``` 根據註解，計算式為: $$ \begin{aligned} \text{vruntime_delta} &= \text{delta_exec} \times \frac{\text{weigth}}{\text{lw.weigth}} \\ &= (\text{delta_exec} \times \text{weigth} \times \text{lw->inv_weigth}) >> \text{WMULT_SHIFT} \end{aligned} $$ 為了方便說明，我們先考慮 `weight == NICE_0_LOAD` 的一般情形下([`CONFIG_64BIT`](https://elixir.bootlin.com/linux/v5.16.6/source/kernel/sched/sched.h#L122) not defined)，可以把計算式理解成: $$ \text{vruntime_delta} = \text{wall_time_delta} \times \frac{\text{NICE_0_weigth}}{\text{NICE_k_weigth}} $$ 不過在核心的計算中可能需要遇到沒有 FPU 的情況，為此我們需要針對會用到的除法運算進行調整。因為 $1/ \text{NICE_k_weigth}$ 顯然是 1 以下的小數(對照 `sched_prio_to_weight` 都是正整數)，我們可以把式子調整成: $$ \begin{aligned} \text{vruntime_delta} &= \text{wall_time_delta} \times \frac{\text{NICE_0_weigth}}{\text{NICE_k_weigth}} \\ &= \text{wall_time_delta} \times \frac{\text{NICE_0_weigth} \times 2^{32}}{\text{NICE_k_weigth} \times 2^{32}} \\ &= (\text{wall_time_delta} \times \frac{\text{NICE_0_weigth} \times 2^{32}}{\text{NICE_k_weigth}} ) >> 32 \\ &= (\text{wall_time_delta} \times \text{NICE_0_weigth} \times \frac{2^{32}}{\text{NICE_k_weigth}} ) >> 32 \end{aligned} $$ 也就是說，我們可以先把 $1/ \text{NICE_k_weigth}$ 放大 $2^{32}$ 倍和 `wall_time_delta` 進行計算，也就是 $\frac{2^{32}}{\text{NICE_k_weigth}}$(`lw->inv_weigth`)，然後最後再把放大的部份補償回來即可。 $\frac{2^{32}}{\text{NICE_k_weigth}}$ 這個數字在 [`sched_prio_to_wmult`](https://elixir.bootlin.com/linux/v5.16.6/source/kernel/sched/core.c#L10866) 中被記錄起來，例如我們可以在 [`reweight_task`](https://elixir.bootlin.com/linux/v5.16.6/source/kernel/sched/fair.c#L3070) 中看到其被用來指派給一個 `sched_entity` 中的 `load_weight` 中儲存起來。 ```cpp void reweight_task(struct task_struct *p, int prio) { struct sched_entity *se = &p->se; struct cfs_rq *cfs_rq = cfs_rq_of(se); struct load_weight *load = &se->load; unsigned long weight = scale_load(sched_prio_to_weight[prio]); reweight_entity(cfs_rq, se, weight); load->inv_weight = sched_prio_to_wmult[prio]; } ``` 藉此 [`calc_delta_fair`](https://elixir.bootlin.com/linux/v5.16.6/source/kernel/sched/fair.c#L652) 可以拿到 `lw->inv_weigth` 並使用，可以看到核心的計算都被放在 `__calc_delta`，而 `calc_delta_fair` 做了一點小優化因為 nice 值為 0 的任務 vruntime = wall time。 ```cpp static inline u64 calc_delta_fair(u64 delta, struct sched_entity *se) { if (unlikely(se->load.weight != NICE_0_LOAD)) delta = __calc_delta(delta, NICE_0_LOAD, &se->load); return delta; } ``` 下面來看 vruntime 計算的核心函數 `__calc_delta` 是怎麼實現的: ```cpp /* /include/linux/sched.h */ # define SCHED_FIXEDPOINT_SHIFT 10 # define SCHED_FIXEDPOINT_SCALE (1L << SCHED_FIXEDPOINT_SHIFT) /* kernel/sched/sched.h */ #ifdef CONFIG_64BIT # define NICE_0_LOAD_SHIFT (SCHED_FIXEDPOINT_SHIFT + SCHED_FIXEDPOINT_SHIFT) # define scale_load(w) ((w) << SCHED_FIXEDPOINT_SHIFT) # define scale_load_down(w) \ ({ \ unsigned long __w = (w); \ if (__w) \ __w = max(2UL, __w >> SCHED_FIXEDPOINT_SHIFT); \ __w; \ }) #else # define NICE_0_LOAD_SHIFT (SCHED_FIXEDPOINT_SHIFT) # define scale_load(w) (w) # define scale_load_down(w) (w) #endif /* kernel/sched/fair.c */ static u64 __calc_delta(u64 delta_exec, unsigned long weight, struct load_weight *lw) { u64 fact = scale_load_down(weight); u32 fact_hi = (u32)(fact >> 32); int shift = WMULT_SHIFT; int fs; __update_inv_weight(lw); if (unlikely(fact_hi)) { fs = fls(fact_hi); shift -= fs; fact >>= fs; } fact = mul_u32_u32(fact, lw->inv_weight); ... ``` 首先，我們想計算 $\text{weigth} \times \text{lw->inv_weigth}$。然而，如果 $\text{weigth}$ 超過 32 bits(在 `CONFIG_64BIT` defined 時可能發生)，而 $\text{lw->inv_weigth}$(也就是 $\frac{2^{32}}{\text{NICE_k_weigth}}$) 是 32 bits 表示的，直接相乘可能會發生超過 64 bits 的 overflow 的問題。為此，我們可以先把最後要做的 $\times \frac{1}{2^{32}}$( $>> 32$) 中的一部分 shift 先 apply 到 weight 上，也就是把 $\text{weigth} \times \frac{2^{32}}{\text{NICE_k_weigth}} \times \frac{1}{2^{32}}$ 改寫成 $\frac{\text{weigth}}{2^{fs}} \times \frac{2^{32}}{\text{NICE_k_weigth}} \times \frac{2^{fs}}{2^{32}}$。然後上面就是先計算 $\frac{\text{weigth}}{2^{fs}} \times \frac{2^{32}}{\text{NICE_k_weigth}}$ 的部份 ```cpp fact_hi = (u32)(fact >> 32); if (fact_hi) { fs = fls(fact_hi); shift -= fs; fact >>= fs; } return mul_u64_u32_shr(delta_exec, fact, shift); } ``` 最後要再乘上 $\text{wall_time_delta} \times \frac{2^{fs}}{2^{32}}$，類似於前面，直接乘上 $\text{wall_time_delta}$ 可能有 overflow，因此可以先 apply 一部分 shift 到前面運算得到的結果。 :::warning - [ ] TODO: [`CONFIG_64BIT`](https://elixir.bootlin.com/linux/v5.16.6/source/kernel/sched/sched.h#L122) 的目的以及實作上的額外考量點 ::: ### CFS 的資料結構 [`sched_entity`](https://elixir.bootlin.com/linux/v5.16.6/source/include/linux/sched.h#L529), [`task_struct`](https://elixir.bootlin.com/linux/v5.16.6/source/include/linux/sched.h#L723), [`task_group`](https://elixir.bootlin.com/linux/v5.16.6/source/kernel/sched/sched.h#L391) 和 [`cfs_rq`](https://elixir.bootlin.com/linux/v5.16.6/source/kernel/sched/sched.h#L532) 是 CFS 最為核心的資料結構，釐清這些資料結構之間的架構可說是弄懂 CFS 的關鍵。我們可以先透過下面的圖簡單的了解四個資料結構彼此的大致關係(紅色的箭頭表示指標變數的 reference，灰色虛線則是對應變數與其 instance 內部的詳細內容)。接著，就讓我們更詳盡的用這張圖進行說明吧! ![](https://i.imgur.com/gOgkZ2I.png) #### `sched_entity` 我們首先關注排程器管理資源分配的單元: [`sched_entity`](https://elixir.bootlin.com/linux/v5.16.6/source/include/linux/sched.h#L529)。前面我們先探討了 CFS 如何實現其公平的目標，提及了每個任務會根據優先度對應的權重調配可以分到的 time slice。若從程式碼的實作而言，CFS 實際上調配權重以排程的單位就是 `sched_entity` 實體。 ```cpp struct sched_entity { /* For load-balancing: */ struct load_weight load; struct rb_node run_node; ... u64 vruntime; ... struct sched_entity *parent; /* rq on which this entity is (to be) queued: */ struct cfs_rq *cfs_rq; /* rq "owned" by this entity/group: */ struct cfs_rq *my_q; ... ``` 也因此，首先我們可以看到 `struct sched_entity` 中包含 `struct cfs_rq` 的指標 `cfs_rq`，指向其所在的 runqueue 中。而另一個 `struct cfs_rq` 指標 `my_q` 則是當該 se 是屬於 group entity 是，會 reference 到自己底下的 run queue。結構中都有自己的 [`struct load_weight`](https://elixir.bootlin.com/linux/v5.16.6/source/include/linux/sched.h#L393) 結構的成員，代表該排程單元在排程中的被分配權重。最後，`sched_entity` 中也包含用來選擇任務需要的 `vruntime`。如果 `sched_entity` 是隸屬於非 `root_task_group` 的某個 `task_group` 之下的話，成員的 `parent` 會指向該 `task_group` 的 `sched_entity`。對應上方的圖可以看到，這是因為系統中的 runqueue 其實是階層式結構的(對於某個 `sched_entity` 所在的 `runqueue` 所屬的 `task_group`，該 `task_group` 可能是在另一個 `task_group` 的 runqueue 之下)，而某些操作會需要對每一層級的`sched_entity` 都進行更新，因此記錄 `parent` 資訊可以有效的進行此操作。如果要仔細探討的話，CFS 的 runqueue 的資料結構其實是藉由一個紅黑樹(RB-tree) 來維護的。因此我們可以在其 `sched_entity` 下看到 `rb_node` 這個結構。後者就可以被插入在 runqueue 的紅黑樹下。 ```cpp struct rb_node { unsigned long __rb_parent_color; struct rb_node *rb_right; struct rb_node *rb_left; } __attribute__((aligned(sizeof(long)))); ``` ![](https://i.imgur.com/NBHmRwV.png) > [Inside the Linux 2.6 Completely Fair Scheduler](https://developer.ibm.com/tutorials/l-completely-fair-scheduler/) 對於 scheduler 如何使用紅黑樹進行排程，我們會在後續的章節透過程式碼仔細探討。先簡單從概念上解釋 CFS 維護任務調度的思維: 那些急需 CPU(virtual runtime 小) 的 task 會被放在紅黑樹的左邊，而不迫切的 task 則在右邊。為保持公平，Scheduler 會去挑選最左邊(leftmost)的任務，任務被切換掉後，把 virtual runtime 加上使用的 CPU 時間重新插入 RB-tree。如此一來，左側的任務就被確保有時間，並且樹的內容會從右向左遷移以保持公平，每個 task 會不斷追逐其他使用 CPU 多於自己的任務以保持平衡。 #### `task_struct` 對於 Linux 有稍微了解的人應該知道，Linux 可以透過 [`task_struct`](https://elixir.bootlin.com/linux/v5.16.6/source/include/linux/sched.h#L723) 結構來描述在核心中被建立之任務。關注與 CFS 有關的部分的話，`task_struct` 中包含的 [`sched_entity`](https://elixir.bootlin.com/linux/v5.16.6/source/include/linux/sched.h#L529) 即可用來向 CFS scheduler 註冊排程，而 `sched_task_group` 則會指向所屬的 `task_group`。 ```cpp struct task_struct { struct sched_entity se; struct task_group *sched_task_group; ``` #### `task_group` 如果 CFS 的公平僅是以 `task_struct` 表示的 task 為單位的話，實際上會造成某種程度上的不公平。試想一個情境：在某個 server 上的 50 個任務，有 1 個 A 君所擁有，另外 49 個是由一個 B 君所擁有，如果只以 task 為公平的判斷單位，這就表示 B 君可以佔用 98 % 的 CPU 資源。為此，CFS 中帶來了 group scheduling 的概念。Group scheduling 是另一種帶來公平性的方法。而由於排程的單位是基於 `sched_entity` 而非 `task_struct`，這就使得 group scheduling 得以被實作。實作上是透過定義 [`task_group`](https://elixir.bootlin.com/linux/v5.16.6/source/kernel/sched/sched.h#L391) 結構描述由多個任務組成的群組。`task_group` 中也包含 `sched_entity` 實體，因此其也可以做為一個 CFS 所管理的對象，被加入到一個 runqueue 之中。並且，`task_group` 中對於自己群組中的任務也存在自己的 runqueue 中。換句話說，系統中的 runqueue 其實是階層式樹狀結構的，在前面的章節中我們已經介紹過這點。 ```cpp /* Task group related information */ struct task_group { ... /* schedulable entities of this group on each CPU */ struct sched_entity **se; /* runqueue "owned" by this group on each CPU */ struct cfs_rq **cfs_rq; ``` `task_group` 中兩個重要的成員: 其一是 `se`，其內容是一個 `sched_entity` 指標的陣列，對於該 `task_group`，每個 CPU 都具有一個 `sched_entity` 可以註冊到 runqueue 中。另一個則是 `cfs_rq`，是 `task_group` 對應每個 CPU 的 runqueue。因為引入了 group scheduling 的觀念，scheduler 在排程時，會去尋找最需要 CPU 的 `sched_entity`，如果 `sched_entity` 表現的不是一個 task，則再去 `sched_entity` 下的 runqueue 挑出下一個 entity，重複直到找到是表現一個 task 的 `sched_entity`。最後，在把 CPU 時間加回 virtual runtime 時，會把該 `sched_entity` 包含往上 parent 階層的都一併更新。因此，面對一開始提到的問題，系統的管理者只要針對 A 君和 B 君建立一個 `sched_entity` ，就可以做到針對兩個使用者的公平。實際上，所有任務都會根源於 [`root_task_group`](https://elixir.bootlin.com/linux/v5.16.6/source/kernel/sched/core.c#L9252) 這個群組中。 #### `cfs_rq` 描述 CFS 之 runqueue 的結構是 [`cfs_rq`](https://elixir.bootlin.com/linux/v5.16.6/source/kernel/sched/sched.h#L532): ```cpp struct cfs_rq { struct load_weight load; unsigned int nr_running; u64 min_vruntime; struct rb_root_cached tasks_timeline; struct task_group *tg; /* group that "owns" this runqueue */ ``` 可以看到 `cfs_rq` 中也存在一個 `load_weight` 結構，該數值其實是 runqueue 中所有任務的權重總和，藉此可以有效的更新對應層級的 `task_group` 之 vruntime，而 `nr_running` 則記錄該 runqueue 中的 `sched_entity` 之數量。 `min_vruntime` 是 runqueue 中所有 entity 的 vruntime 之最小值，該值是用來在 enqueue `sched_entity` 的時候協助初始化其 vruntime。思考任務被移出 runqueue 時的議題: 假設有任務 A 需要處理 I/O 而被暫時移出 runqueue，其 `vruntime` 會維持同樣的值，而仍處在在 runqueue 中的任務則會持續增加。當這個任務 A 被喚醒並且放回 runqueue 時，如果 `vruntime` 是依離開時的值設定，這會導致這個任務的優先遠超其他原本在 runqueue 中的任務，這違反了 CFS "fair" 的精神。類似的情形，如果一個新任務被建立，我們不可能直接 assign 0 值給它，否則舊的任務等於有很大一段時間都不能取得 CPU 資源，同樣不 "fair"。因此，CFS scheduler 會追蹤整個 runqueue 中的最小 `min_vruntime`，當每個任務被挑選並運行時，這個最小值也同步更新。如此一來，當新的任務被建立，就可以透過這個值去初始化其 `vruntime`。對於進入 wait queue 而後被喚醒的任務，則是確保其原本的 `vruntime` 需不小於記錄的最小值，否則就重設為該最小值再放回 runqueue 中，藉此避免被喚醒的任務會長時間的霸佔 CPU 資源。接著，我們可以在 `cfs_rq` 結構中看到 [`rb_root_cached`](https://elixir.bootlin.com/linux/v5.16.6/source/include/linux/rbtree_types.h#L26) 類型的成員 `tasks_timeline`，後者包含了 RB-tree 的根節點(root, rnuqueue 直接擁有)與最左側的節點(leftmost, 指到某個 `sched_entity` 之 `rb_node`)之 reference 以更有效率的操作 RB-tree。 ```cpp struct rb_root_cached { struct rb_root rb_root; struct rb_node *rb_leftmost; }; ``` `cfs_rq` 也可以透過其下的 `tg` 來找到自己所屬的 `task_group`。 ## Reference * [Inside the Linux 2.6 Completely Fair Scheduler](https://developer.ibm.com/technologies/linux/tutorials/l-completely-fair-scheduler/) * [CFS group scheduling](https://lwn.net/Articles/240474/) * [Linux CFS and task group](https://mechpen.github.io/posts/2020-04-27-cfs-group/index.html)