Linux 核心專題: 排程器研究

執行人: yenslife
解說錄影
 簡報內容

Reviewed By `SimonLee0316`

Rotating Staircase DeadLine Scheduler 方法為何沒有被 linux 採納？

我並沒有在相關 commit 找到線索，不過我發現了一些文章，說明麻醉科醫師 Con Kolivas 在當時沒有辦法證明 RSDL 比原生的排程器更好。但我覺得最主要的原因是 Con Kolivas 先生在 2007 年決定離開 Linux 社群 (不過在 2009 年帶著 Brain Fuck Scheduler 重磅回歸XD)，也就是 CFS 正式納入 Linux 的一年，先不論原因為何，沒了主要的開發者就更不可能被採納了。僅管如此，RSDL 還是啟發 Ingo Molnar 先生開發 CFS 很重要的推手，因為這是 Linux 排秤器討論中首次展現 "fairness" 的排程構想。

參考資料：[patch] Modular Scheduler Core and Completely Fair Scheduler [CFS], A Scheduling Story, Linux: The Completely Fair Scheduler, The staircase scheduler, Why I quit: kernel developer Con Kolivas (這個是目前我找到最詳細的，看完可以知道為何 Klivas 先生會離開 Linux，主要是對核心開發者接收 Patch 風氣感到不滿，被拒絕很多次 QQ)
yenslife

Reviewed by `ollieni`

現在的 task_struct 中沒有 counter 成員了

為甚麼沒有 counter 了，是不需要嗎?

過去 task_struct 的成員 counter 是為了計算 timeslice，一直到
$O (n)$ 都還有 counter 存在。但是到了 v2.5.2 這個 long counter 就不見了，我目前能找到最早的 git commit 已經是 1da177e4，也就是 v2.6.12-rc2。觀察 v2.5.1 的註解，我猜這是因為
$O (1)$ 計算 timeslice 的方式不再是 goodness，也就不需要 counter 了。
yenslife

任務簡介

把第一章第二章全部看完 + 有貢獻 (一定要細讀，能作出貢獻，比照新版程式碼)，這邊的看完包含紀錄問題、與老師討論。發現有誤解的地方，改進表達。第三章第四章也要看，最好是可以和老師討論。確保 6 月底之前有東西可以拿出來。需要做貢獻的時候就需要討論。盡可能把所有問題提出來。

需要針對 Linux v6.6+ 進行探討

發送 patch 的參考資料

TODO: 檢驗第一章的內容

內容描述和程式碼應當符合 Linux v6.6，若發現不相容，指出並修正

在 Linux 中最小的排程單位是 thread 而不是 process，thread 的 ID 叫做 PID (process identifier)，process 的 id 叫做 TGID (thread group identifier)。如果 process 只有一個 thread 那該 process 的 PID 等於 TGID

注意 Tuning scheduling policy 提到: "because the scheduler’s primary concern is to keep the system busy, it may not schedule threads optimally for application performance." 實際排程不會以執行緒作為單一考量來排程，例如 NUMA 就涉及到行程間對處理器節點的關聯。

Image Not Showing Possible Reasons
The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported
Learn More →

jserv

Thread 和 Process 最大的差異在於：thread 之間共享 address space；process 則無，其他都一樣。在 Linux 中則以 "task" (task_struct) 來稱呼他們。

之所以在 task_struct 中有 volatile 是為了抑制編譯器最佳化，以確保每次都會有一個 load。

Linux 中使用 hash table 來紀錄 PID，測驗題提過的 hlist_head，所以在 kill [PID] 才會是

O (1)

。

TASK_RUNNING 表示 runnable (ready but not running) 或是正在 running

列出對應的 elixir 超連結，要標注版本和程式碼區域。

17 頁的 tast_struct 2 ~ 7 行的註解和成員程式碼與 v6.6 有出入， v6.x 中沒有 unsigned int cpu 成員，取而代之的是 on_cpu, recent_used_cpu, wake_cpu 等等；也沒有 volatile 關鍵字，而volatile long state 被 unsigned int __state 所取代。

已更新程式碼，並整合進書本 commit 1ff5d23333d02ddba80f7074dc9127ce441f0d22






struct task_struct {
    /* -1 unrunnable, 0 runnable, >0 stopped: */
    volatile long state;
    void *stack;
    /* Current CPU: */
    unsigned int cpu;      // v6.6 中不存在

commit bcf9033e54 說明 cpu 成員被移動到 thread_info 中

TODO: 檢驗第二章的內容

內容描述和程式碼應當符合 Linux v6.6，若發現不相容，指出並修正
EEVDF 會是最大的變革

2.1.1

Fairness: 同優先權的行程被分配到差不多的 CPU time，任意優先權的行程都會被分配到時間
Starvation: 一直沒有被分配到 CPU time 的行程的狀態有趣的笑話
書中 34 頁提到

An operating system’s liveness property cannot be broken in a finite execution since the “good” event might still happen at a later time.

這邊的 "good" event 指的是高優先權的行程嗎？good 是哪方面 good？

2.1.2

Linux 不把排程策略寫死，而是讓每個行程都有自己的排程策略類別。

為什麼引用這段原文要把開頭的 the introduction of Scheduling Classes: 改成中括號？

[. . . Scheduling classes is] an extensible hierarchy of scheduler mod- ules. These modules encapsulate scheduling policy details and are handled by the scheduler core without the core code assuming too much about them.

在 symmetric multiprocessing 中，所有 processor 都平等，沒有主從之分，stop class 只會發生在這類系統；asymmetric multiprocessing 則有一個主要的 processor 來控制其他 processor

我不太懂為什麼 stop class 的排程類別會將該 CPU 的行程遷移到其他 CPU，又為什麼需要讓其他 CPU 多一個 kernel thread？

Reviewed by `YangYeh-PD`

我認為是這樣。如果今天系統負載較低，它會關閉任意 secondary CPU，以降低能源消耗。在關閉該 CPU 之前，一定會執行 Stop-task 類別的任務。如果你關閉的那顆 CPU 上面剛好在執行其他任務，那 Stop-task 就需要把上面的任務 migrate 到其他 CPU 完成，因此需要任務遷移。

原來是這樣，感謝同學解惑！yenslife

scheduling class 和 scheduling policies 的差異在於，前者用來區分優先權，後者則是排程的方法

nice 值越高 (19)，表示優先權越低；反之，nice 值越低 (-20) 則優先權越高，那為什麼他不要叫做 bad 值？我只是覺得有點不直覺。範圍之所以在 -20 ~ 19 是因為記憶體大小考量嗎？

可以參考 2.2.2 的註釋

2.2.1 Early Linux Scheduler

最早的 CPU scheduler 十分簡單，沒有考慮到多核的環境。有 timeslice，以相反方向走訪 runqueue，根據 timeslice 大小、狀態 (sleep/runnable/nil) 來決定誰會先被執行。當所有在 runqueue 中 runnable task 的 timeslice 都為 0，將任務加上一半的 timeslice + priority。從這裡開始 TASK_RUNNING 表示的就不是「正在執行」，而是「可以被執行/準備好被執行」，NR_TASK 為最大任務數量。

44 頁的程式碼與原始碼有些微的不同，不知道是不是老師有意為之？

是的，為了排版需要。

現在的 task_struct 中沒有 counter 成員了

2.2.2
$O (n)$ scheduler

用 Linked list 來紀錄任務們，每個任務有一個 goodness score，這個分數是根據像是 nice 值或是 timeslice 來決定。

O (n)

scheduler 引進了 epoch 的概念，當所有任務的 timeslice 都被用完了，則表示這次 epoch 結束了。若還有剩下的 timeslice，就把這剩下的 timeslice / 2 加到該任務的新 timeslice

流程：

Not to be confused with “niceness”, which is a measure of a task’s courtesy to other tasks in the system: a task with high niceness (i.e., running at a low priority) is more generous to the system’s users, whereas a less “nice” task demands more CPU resources. See Section 2.1.2 for more details

這一段註解解答了我對 "nice value" 的疑問，nice 指的是中文的「友善」，當一個任務不佔用 CPU，表示他越「友善」。具有高「niceness」的任務運行在低優先級，對系統的其他使用者更慷慨，因為它們不急於佔用 CPU 資源，很友善。nice 高 -> priority 低；nice 低 -> priority 高。先前我錯把 nice 理解成 priority 了。

goodness

goodness 函式引入 List API 相關程式碼，用來計算 niceness。他會先檢查傳入的任務使用什麼 policy，如果是 Non-real time 就直接把該任務的 counter 當作 weight；如果不是，則先檢查 CPU，若該任務跑在同一個 cpu 上，則讓他的優先程度增加，給這樣的任務更高的優勢。如果是在跑在同一個 address space (struct_mm)，也給他一些優勢。若該任務是 read time 的，就給他超大的權重 (1000 + real time priority)，這邊用 1000 這個數字做為分隔：-1000 表示永遠不會執行到；1000 表示 real time 的任務。

之所以叫做

O (n)

就是因為這個 list_for_each。

2.2.3
$O (1)$ scheduler

在多核的環境中，排程器最大的挑戰就是如何在 cpu affinity 和 load balance 之間取得平衡：我們不希望有 CPU 閒置，但又要考慮到任務轉換會有 cache miss 的問題。

O (1)

scheduler 使用一個 global priority queue 來記錄不同優先級的 runqueue，長度為 MAX_PRIO(140)，伴隨著一個 bitmap 來紀錄每一個 priority 的 queue 有沒有任務，如果沒有就是 0，如果有就是 1。有了這個 bitmap，排程器就可以在

O (1)

時間選擇任務，因為查找的時間不是根據任務的數量，而是 priority 的數量 (140)。

O (1)

用兩個陣列來紀錄任務與 timeslice 的關係

Active array：還有 timeslice 的任務
Expired array：用完 timeslice 的任務

typo: 50 頁的 BITMAP\_SIZE -> BITMAP_SIZE

已整合進書本 commit be5a7cd1051a9e204e41cbf3450fd1e45df76171

Priority 分成 static 以及 bonus，計算方法是利用過去的該任務「睡眠」的時間長度來決定，「睡眠」指的是 idle 或 waiting，就是沒有讓 CPU 處理的時間。Static priority 的範圍為 -20 ~ +19，其值根據 nice 值來決定；Bonus priority 的範圍為 -5 ~ +5，若該任務長時間處於「睡眠」狀態，則提高它的 priority (e.g. priotity - 5)。最後將 Static priority 與 Bonus priority 相加，即可得到 "Dynamic priority"

在 Timeslice 的環節提到

The O(1) scheduler is based on the concept of timeslice, this is the amount of time that a task runs on the CPU during a cycle. A new cycle begins with moving every task to the expired queue.

一個新的 cycle 開始時，會將所有任務放到 expired array，但為什麼不是放到 Actice array？這些任務不是被重新分配 timeslice 了嗎？

Interactive tasks and heuristics

當一個被視為 interactive 行程的 timeslice 用完後，它會被插入 active array 中，這是為了減少 response time，但要怎樣才可以被視為 interactive？好像沒有在書中細說，只有說會根據其 dynamic priority 以及 nice value。這邊舉了兩個例子：

A task with a nice value of −20 is marked as interactive even if it has a dynamic priority of +3, while a task with a nice value of 0 has to have a dynamic priority of −2. With a nice level of +19 it is impossible for a task to be marked as interactive, independently of the dynamic priority.

為什麼 nice value 可以和 dynamic priority 一起討論？dynamic priority 不是等於 nice value + static priority 嗎？我可以理解為

O (1)

排程器最大的弱點就是因為判斷一個任務為 interactive 的方式太過簡單嗎？

Nice value 與 jiffy

由於 nice value 為 19 的任務太佔用 CPU (低優先權任務不該佔用太長的 CPU 時間，nice value 19 為最低優全任務)，於是 O(1) 延伸出了 jiffy 的概念，jiffy 為 timeslice 的最小時間單位，根據 tick rate 而定，也就是系統常數 HZ (tick 次數 / 秒) 。後面舉了一個例子當作補充說明，兩個優先權低的任務在系統中會被分配到 50% 的時間，但還是依據 HZ 來給任務時間，所以會發生「低優先權任務頻繁排程」的問題。

2.2.4 Rotating Staircase DeadLine Scheduler

概念由澳洲麻醉科醫師 Con Kolivas 提出，但最後沒有被用在 Linux 中。和

O (1)

一樣都用到 Multi-level feedback queue 的概念，把 level 形容成「階梯」。為每個任務都給予 rank，rank 是根據該任務的優先權而定，當該任務的 timeslice 被用完時，就下降它的 rank，讓該任務到下一階層的佇列；在 sleep 狀態下的任務會被提升 rank。當任務到達最底層的階梯時，將任務的 rank 提升到他原有最大的 rank 的下一層，並多給他兩個 timeslice，過程看起來就像在下樓梯。RSDS 確保每個任務都可以被執行到，首次引入 "fairness" 的概念，那些在睡覺的任務通常是 IO bound，在原先

O (1)

的情況下可能會永遠不被 CPU 選到；在 RSDS 中，它們一醒來可能就變成高優先權了。

2.3 CFS

理想中的 fair scheduling 是讓任務分配到

1 / n

個 CPU time，並平行處理，n 為 runnable 的任務數量。現實中的處理器不可能這樣，因為 context switch 成本太高，需要掃過每一個任務，時間複雜度為

O (N)

。

於是，CFS 引入 virtual runtime (vruntime) 的概念，為達到 fairness 的目標，CFS 希望所有任務的 vruntime 盡可能相同，所以高優先權的任務會有較少的 vruntime；反之，低優先權任務擁有越少的 vruntime。

In time-sharing systems, guaranteeing CPU partitions is difficult, if not impossible.

為什麼還要講 if not impossible，難道有 impossible 的時候嗎？

CFS 會讓 virtual runtime 較低的任務被執行，不同 nice 值的任務，其消耗 virtual runtime 的速率也會不同。

2.3.2 Weight function

之所以 weight function 的公式會有一個 1.25，是為了達到 nice value 差 1 接近差 10% 的效果

w (n) = \frac{1024}{(1.25)^{n}} = \frac{2^{10}}{(\frac{5}{4})^{n}}

舉例：當有兩個 nice value 相同的任務，他們的 CPU time 個別會是 50%；但如果他們的 nice value 相差 1，那就會大約是 45% 和 55%。

2.3.3 Assigned time and virtual runtime

target latency：所有 runnable 任務至少能夠在處理器上被執行一次所需的最短時間，用來確保任務的 fairness

minimum granularity：分配給任務的最小時間量，當發生 context switch 時，可以讓每個任務在被搶佔之前有一段時間可以被執行。在桌面環境會希望這個值小一點來增加 interactive，在伺服器環境則會希望這個值大一點，讓伺服器有更好的負載。

virtual runtime 又稱

v r u n t i m e

，演算法如下

v r u n t i m e = d e l t a_e x e c \frac{w e i g h t_o f_n i c e_0}{t a s k_w e i g h t}

其中 delta_exec 是實際的 runtime，weight_of_nice_0 是 nice 值為 0 的權重大小，這個公式後面的分數就像是給 runtime 加上一個速度，一個相對 nice 值為 0 任務的執行速度。

Fixed point arithmetic

比起浮點數運算，Linux 更偏好定點數運算，因為前者使用的是 FPU，後者則可以完全利用 ALU 的能力，速度快了 17%。

一個 32 bits 的定點數的格式為前 22 bits 表示整數，後 10 bits 表示分數。
fixed-point shift 表示小數點會在 bits 組成的哪個位置，又叫 scaling factor。常見的 scaling factor 為 1024，表示該定點數的組成為 10 bits 是分數部分，前 22 bits 為整數部分。sched_prio_to_weight 中的數值也是定點數，舉例來說 nice 值為 0 的值就是

1024 = 2^{10} = 1_{i n t e g e r}

；sched_prio_to_wmult 則是預先算好的倒數值。

為什麼浮點數 v 轉換成定點數 f 是： f = (int) (v * (1 << fixedpoint_shift));

上式子表示浮點數代表的值，不是浮點數本身

2.3.4 Runqueue

當一個任務從 runqueue 中移除 (sleep)，他的 vruntime 會維持在被移除之前相同，但當他被移回到 runqueue，因為其他在 runqueue 中的任務也會增加 vruntime，所以該任務的優先權會變得異常高，這就導致 unfairness。

解決方法是，紀錄目前 runqueue 中最小的 vruntime，每當新任務進到 runqueue 之前，先檢查其 vruntime 是否小於目前最小的 vruntime，茹果是，則更新其 vruntime 為目前最小的 vruntime。

2.4 Multiprocessing

Theoretical performance gain

以下是平行加速公式，其中 F 為不能平行化的任務比率，N 為處理器數量。可以處理器的數量提供的加速效果取決於任務類型。

s p e e d u p = \frac{1}{F + \frac{1 - F}{N}}

2.4.1 Load

Per-Entity Load Tracking

我們不能直接用任務的權重 (weight) 來當作計算 load 的依據，還需要考慮到任務的 CPU 使用率 (CPU-bound or I/O-bound)。

PELT 用來追蹤任務的 CPU 使用率，它將任務從現在到過去的時間歷程劃分成固定的區塊 (time slot)，每個區塊給一個正或負的值，如果該任務有使用 CPU 則該值為正。最後，加總這些值。由於使用靠近現在這個時間點的資訊才有意義，所以它會在每次計算完時加入一個 decay factor。以下是在時間點 t 的 load 計算公式

L o a d_{t} = u_{t} + u_{t - 1} \times y^{1} + u_{t - 2} \times y^{2} + . . .

為什麼要叫他 "cared record" 不要直接叫 "recorded value" 就好？要 care 什麼？

已改進表達，並整合進 commit 9bbbffe24e737dc102ca23fba5d5716b6778ee25

2.4.2 Load balancing

Linux 週期性的檢查 CPU 是否附載均衡。
將任務從一個 CPU core 移動到另一個，這個行為稱為 "task migration"，由 load balancer 去執行。migrating 是排程器的「一部分」，但它不是排程。

migration 有兩種，一種是 pull migration，另一種是 push migration。前者是當該 core 處為 idle 狀態，它會想要去「拉」其他人的任務來做；後者則是 core 的 runqueue 還有任務，那它就會想把一些任務「推」給其他 core 執行。

scheduling domains

The system’s structure is described by two entities: scheduling domains and CPU groups. A scheduling domain is a set of CPUs, and a CPU group is a partition of the CPUs within a domain. Each CPU in a domain must be a member of a group within that domain. Typically, a CPU cannot be a part of multiple groups within the same scheduling domain. However, in intricate NUMA sytems, a CPU might be a member of multiple groups within a scheduling domain.

scheduling domains 是一組 CPU
CPU Groups 則是在 scheduling domains 中的 CPU 的劃分 (partition)

我覺得這邊的 partition 改用 subset 會更好懂

CFS Load Balancing

typo: p78 頁 scheding -> scheduling
已整合進 commit be5a7cd1051a9e204e41cbf3450fd1e45df76171

Each scheduling domain has a circular list of scheduling groups, as illustrated by the arrows in Figure 2.20 (excluding the circular part).

上面提到 CPU groups 了建議就不要用 scheduling groups 比較好。

圖示的 core/sd 為什麼不要用 core 就好，還要加 /sd，是 per scheduler 嗎？

sched groups -> CPU groups (還是其實指的是 scheduling domain?)

2.4.3 CPU Isolation

這一章我只對 "jitter" 這個詞感到疑惑，第一次看到這樣用，我將它理解為「系統內部的延遲或計算負載引起的不穩定性。」–> 老師說其實就是「最大延遲減去最低延遲」

slight uncontrolled movement or shaking, for example in electronic equipment

2.4.4 Unintended Side Effects

在 CFS 中，每個 core 都有自已的 runqueue，當有 core idle 時，load balancer 就會緊急遷移任務給它，但如果僅僅是遷移任務這樣的解法太過簡單，可能會忽略 cache locality 以及 NUMA 議題，導致更多效能喪失。

常見的疑慮有四種

The Group Imbalance Bug: 當一 scheduling group 的負載高於「平均值」，那其他 cores 可能會把它底下的任務搶來做。這是一種依賴平均值來判斷 load balance 的時機導致的缺失，比較好的做法是使用最小負載。
Scheduling Group Construction: taskset 命令可以將特定的任務固定在特定的 CPU 來執行，這可以避免 load balancer 從特定的 CPU 搶任務去達到 load balance。問題來了，所有的任務都是從
$C P U_{0}$ 的角度去建構的，也就是說組內的距離都是相對
$C P U_{0}$ 去計算的。如果 load balance 任務是在
$C P U_{31}$ 上被執行，那它會可能不會從相鄰的 CPU 中去搶任務，因為它認為這些 CPU 太遠了。這種錯誤的距離判斷會導致
$C P U_{31}$ 避免從這些「相鄰但被認為太遠」的 CPU 中搶工作。
(不知道我的理解有沒有對，如果有的話，我覺得這段的文字說明要再更清楚才行，像是強調 taskset 指定的任務，其距離是從
$C P U_{0}$ 開始算的)
Overload on Wakeup: 過度依賴 Processor Affinity 去判斷導致的缺失。Processor/CPU Affinity 為了避免 cache miss，會希望在同一個 core/group 的任務盡可能不要被遷移。這邊的例子是當一個在 Group 1 的任務從 sleep 變成 runnable，因為 Affinity 的特性，排程器可能會把它交給 Group 1 的其中一個 core 來處理，但這樣可能錯失了讓其他更空閒的 runqueue 做事的機會。
Missing Scheduling Domains: 如同其名稱，在分配任務時將所有的任務都集中在一個 core 而忽略了該 group 的其他 cores。

參閱論文 The Linux Scheduler: a Decade of Wasted Cores

2.5 Energy-AwareScheduling(EAS)

排程器的耗電量也是一個重要的議題，尤其是在行動裝置上的 Linux。

In general, when picking a CPU for a task to run on, the EAS policy is to find a CPU with enough capacity for the task that has the smallest energy impact.

EAS 系統中有兩個子系統，CPUIdle 和 CPUFreq。前者通過在 CPU 沒有任務運行時選擇 low-power mode 或 Idle mode 來最小化能功耗。但是能源節省越多，喚醒的時間也就越長，如何讓 CPU 進入 idle mode 變成一個重要的議題；CPU Frequency 指的是 CPU 的運行頻率，越高表示處理速度越快。CPUFreq 能夠縮放 CPU 頻率。

Arm bigLITTLE architecture

big.LITTLE 這個名字對應到兩個 CPY type：big 表示運算能力更強但是功耗更高的「大核」；LITTLE 表示運算能力不強但更省電的「小核」。這樣的模式切換在行動裝置上成效顯著，在重視效能的任務使用 big 模式；反之則選擇 LITTLE 模式。

Dynamic Voltage and Frequency Scaling (DVFS)

在 Linux 中，cpufreq 這個界面用來支援與 DVFS 相關操作，對 CPU 進行頻率縮放。在 Arm 架構也有支援，包含前面提到的 bit.LITTLE 模式，與之搭配的省電效果更好。

Scheduling in Asymmetric Multi-Cores

為了在好效能與低功耗之間取得平衡，排程器需要考量以下三點

選擇 big/LITTLE 種類的核心
在核心之以及核心種類之間切換是否符合效益
如何調配核心的速度和功耗

現有的 CFS 是根據吞吐量策略運作的，但這可能並不是最理想的方法。EAS 就是為了解決這一問題而誕生的。

2.6 Miscellaneous CPU Schedulers

miscellaneous: consisting of a mixture of various things that are not usually connected with each other

EEVDF

為了解決 CFS 無法處理 latency 的問題，Linux v6.x 版本引入 EEVDF (Earliest Eligible Virtual Deadline First) 排程機制。與其說它取代 CFS，不如說它是 CFS 的擴充，因為有大量共用的程式碼。

TODO: 紀錄第三章的疑惑

3.1 Structs and their role

這一章主要是對 task_struct 結構體成員的介紹

`thread_info`

包含許多 flag，追蹤任務發出或是發送給任務的信號。

之所以必須定義在 tast_struct 的第一個成員是為了讓 current_thread_info() 巨集透過轉型的方式來實作

#define current_thread_info() ((struct thread_info *)current)

thread_info 的實作根據機器架構而有所不同

`state` (目前已修改為`__state`)

state 為型態 long，為 -1 時表示 unrunnable，0 表示 runnable，> 0 則為 stopped。state 成員在 Linux 5.x 版本還存在，但到了 v6.x 有了新的變革，有關 state 的成員被替換成 unsigned int __state; 以及 unsigned int saved_state;，其數值所代表的意義也不再只有三種。電子書應更新這一段程式碼。
可從註解窺探各別的功能，__state 與之前接近；saved_state 為 "spinlock sleepers" 而生。










/*
 * Task state bitmask. NOTE! These bits are also
 * encoded in fs/proc/array.c: get_task_state().
 *
 * We have two separate sets of flags: task->__state
 * is about runnability, while task->exit_state are
 * about the task exiting. Confusing, but this way
 * modifying one set can't modify the other one by
 * mistake.
 */


/* saved state for "spinlock sleepers" */
unsigned int			saved_state;

commit 2f064a59a11ff

Change the type and name of task_struct::state. Drop the volatile and
shrink it to an 'unsigned int'. Rename it in order to find all uses
such that we can use READ_ONCE/WRITE_ONCE as appropriate.

在講座〈並行程式設計: Atomics 操作〉中提到，volatile 有許多誤用，因為它僅保證編譯器不進行最佳化，不能保證 CPU 不會遭遇重排，若讀取操作被提前，即使抑制最佳化，也可能讀出意料之外的資料。這邊改成 READ_ONCE/WRITE_ONCE 來確保操作的 atomic 操作的正確性。

有關 state 的描述我參考了這篇文章，以及書本第一章。

已整合進書本 commit c22f920249fc6861a6f289cc3e0cc2e299815ee1

但好像沒有特別說明 /* Used in tsk->__state again: */ 的部分，這些 state 會用在哪裡呢？

`sched_entity`

對排程器而言，不論是行程 (process)、執行緒 (thread) 還是一整組的行程，並沒有區別。他們都是由 sched_entity 結構體表示排程實體。因此，排程器只與排程實體打交道，而不是直接與任務（即 task_struct）打交道。也就是說，runqueue 包含的是排程實體而不是任務。一個排程實體可能會被組織成階層結構，目的是提供分組排程的選項，適用在多人不同任務共享 CPU 的情境。

假設目前有兩個使用者同時在一個系統中，且共有 20 個任務存在，其中一個任務屬於 User1，剩下 19 個任務屬於 User2。CFS 若是只根據任務的 fairness 來排程，那每個執行緒能得到

\frac{1}{20} = 5 %

的 CPU 時間；若有 sched_entity 有設定分組選項，那 CFS 會先將 CPU 時間「公平分配」給兩個使用者，再根據其階層結構分配任務的 CPU 時間。在這個例子中，User1 的唯一一個任務會被分配到

50]

的時間；User2 的單一任務則會被分配到

50 % \div (20 - 1) = 2.6 %

CPU 時間。

第二章提到 CFS 的 vruntime 等相關數值就是定義在 sched_entity 結構體中的成員包含

load_weight: 第二章提到 weight_function 計算的結果
struct rb_node run_node: 紅黑數中的節點
on_rq: 表示該實體是否在 runqueue 上
exec_start: 該任務在 CPU 上開始執行的時間
sum_exec_runtime: 任務實際上運行的原始時間
vruntime: 搭配權重計算後的虛擬時間
struct sched_entity *parent: 階層結構的親代節點
struct cfs_rq *cfs_rq: 該排程實體所屬的 runqueue
struct cfs_rq *my_q: 若該實體為一個組，則 my_q 為該排程實體底下子代所屬的 sub-runqueue

runqueue 的資料結構為紅黑數

`sched_class`

定義在 kernel/sched/sched.h，如同其名，定義排程類別的內容。

v6.x 後的 yield_to_task 函式指標成員的參數 bool preempt 和 pick_next_task 函式指標的參數 struct task_struct *prev, struct rq_flags *rf 都已經不存在，其相關註解也被刪掉了。

參見 commit 0900acf2d8273 可以得知，原先的 preempt 參數在 yield_to_task() 中變得冗餘，因為相關的排程程式碼已經移動到 yield_to() 中。

至於註解被刪除以及 pick_next_task 的相關更動可以在 commit 67692435c 與 commit 98c2f700e 看到線索，用不到後面兩個參數了。

已整合至書籍 commit 5fd54f9375fc0c87646fb15c4362527b21c6d02c

排程器會根據排程類別的優先權順序來走訪，若該排程類別沒有任務，則往下一階的類別，這個過程會不斷重複直到排程找到可以排程的任務。為了保證排程類別的順序，他們被放置在自身的 data section，原本類別定義的成員中有 .next，但現在這些順序是通過連結器腳本 (linker script: vmlinux.lds.h) 定義的，所以不再需要它們了，可以參見 commit a87e749，有關連結器腳本的閱讀方式可以對照閱讀這篇文章。由於排程類別的順序已經被保證了，所以撰寫走訪程式碼時只要比較指標即可，而不需要比較目前排程類別與其他排程類別的相等性。有關走訪程式碼的更動請參見 commit c3a340f7e7

 #define for_class_range(class, _from, _to) \
-       for (class = (_from); class != (_to); class = class->next)
+       for (class = (_from); class != (_to); class--)

 #define for_each_class(class) \
-       for_class_range(class, sched_class_highest, NULL)
+       for_class_range(class, sched_class_highest, sched_class_lowest)

有關書本 fair class 的程式碼已經過時，不直接使用 __section 巨集，而是利用 DEFINE_SCHED_CLASS 這個巨集來定義排程類別 (v5.11+)。可以在書籍的註釋中看到簡單的說明。由於所有的排程類別都需要透過 DEFINE_SCHED_CLASS 來定義，所以只要搜尋關鍵字就可以找到其他排程類別的定義。

定義可以在 kernel/sched/sched.h 找到

/*
 * Helper to define a sched_class instance; each one is placed in a separate
 * section which is ordered by the linker script:
 *
 *   include/asm-generic/vmlinux.lds.h
 *
 * *CAREFUL* they are laid out in *REVERSE* order!!!
 *
 * Also enforce alignment on the instance, not the type, to guarantee layout.
 */
#define DEFINE_SCHED_CLASS(name) \
const struct sched_class name##_sched_class \
	__aligned(__alignof__(struct sched_class)) \
	__section("__" #name "_sched_class")

/* Defined in include/asm-generic/vmlinux.lds.h */
extern struct sched_class __sched_class_highest[];
extern struct sched_class __sched_class_lowest[];

#define for_class_range(class, _from, _to) \   
	for (class = (_from); class < (_to); class++)

#define for_each_class(class) \
	for_class_range(class, __sched_class_highest, __sched_class_lowest)

參見 commit 43c31ac0e6 得知原先的記憶體對齊是根據型態 (sched_class) 而非物件實例去對齊，類別對齊是一個最小值，其實編譯器可以為類型實例 (即變數本身) 選擇更大的對齊。並且移除了 #include <asm-generic/vmlinux.lds.h>。原先的對齊方式是利用 vmlinux.lds.h 中的 STRUCT_ALIGNMENT，所以是固定的大小 (32 bytes)。

@@ -1836,7 +1835,7 @@ struct sched_class {
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	void (*task_change_group)(struct task_struct *p, int type);
 #endif
-} __aligned(STRUCT_ALIGNMENT); /* STRUCT_ALIGN(), vmlinux.lds.h */
+};

// include/asm-generic/vmlinux.lds.h
#define STRUCT_ALIGNMENT 32
#define STRUCT_ALIGN() . = ALIGN(STRUCT_ALIGNMENT)

註解還特別強調

*CAREFUL* they are laid out in REVERSE order!!!

參見 commit 546a3fee1，可以知道這個改動是為了避免使用負偏移量，因為 "GCC-12 is fully stupid about array bounds"，至於是多 stupid 有待研究。Peter Zijlstra 也將原先表示指標開始和結束的命名從 __begin_sched_classes 和 __end_sched_classes 修改成 __sched_class_highest 和 __sched_class_lowest，更符合情境而不是單純頭尾交換位置。

排程類別在連結腳本 vmlinux.lds.h 中的順序如下，. 在連結器腳本中表示 "current location counter"。

/*
 * The order of the sched class addresses are important, as they are
 * used to determine the order of the priority of each sched class in
 * relation to each other.
 */
#define SCHED_DATA                              \
        STRUCT_ALIGN();                         \
        __sched_class_highest = .;              \   // 範圍的開始
        *(__stop_sched_class)                   \
        *(__dl_sched_class)                     \
        *(__rt_sched_class)                     \
        *(__fair_sched_class)                   \
        *(__idle_sched_class)                   \   
        __sched_class_lowest = .;                   // 範圍的結束

已新增說明，整合進 commit 670559f9c39a8bbcb02d14d71df69c115f9e20e8

有關 sched_class 的成員代表的意義如下

enqueue_task(): 當有任務的狀態變成 runnable 時會被呼叫，將 nr_running 這個變數的值加一，表示 sched_entities 在 runqeue 的數量多一個了，並將這個任務 (entity) 插入至紅黑數中;
dequeue_task(): 與 enqueue_task() 做的事情相反，將任務狀態不再為 runnable 的任務移出紅黑數，並將 nr_running 的數值減一;
yield_task(): 先 dequeue 再 enqueue，但如果開啟了 compat_yield 設定，它會將排程實體放到紅黑數的最右邊 (最後面);
check_preempt_curr(): 檢查狀為態 runnable 的任務是否需要被搶佔;
pick_next_task(): 選擇最適合當下一個被執行的任務;
set_curr_task(): 當任務的排程類別或是所屬的組別有更動時被呼叫;
task_tick(): 由 time tick 函式呼叫，可能會引發行程切換以及搶佔;

runqueue

怎麼沒有人回答這個註解的問題呢？

核心程式碼中只有三處用到 rq_cpu_time，如果可以確保 rq_cpu_tim = cfs_rq.exec_clock + rq->rt_rq.rt_runtime，是否可以省略這個成員？

TODO: 紀錄第四章的疑惑

4.1 Introduction

The Completely Fair Scheduler (CFS), as discussed in Chapter 2, operates on individual tasks and aims to distribute CPU time fairly among all tasks. How- ever, it does not consider whether two threads belong to the same process or user.

開頭提及的例子其實在第三章講 sched_entity 時有提到過了，後續的例子感覺有一點點點點冗餘