Linux 排程器研究

--- tags: _核心設計 breaks: true --- # Linux 排程器研究 ## CFS 排程器先來介紹自從核心版本 2.6 之後引進的 CFS (Completely Fair Scheduler) 的概念和實作原理。根據[官方文檔介紹](https://www.kernel.org/doc/html/latest/scheduler/sched-design-CFS.html)和[从几个问题开始理解CFS调度器](http://linuxperf.com/?p=42)，為了實現有效率和公平的排程引入了 RBTree 的資料結構和 virtual runtime 的概念，無關乎 cpu 的 jiffy (就是 system clock 一個 tick 的長度) 或是本身運算能力，一切排程和演算法也是依據這個值在進行的。首先，每個 CFS 的 runqueue (也就是排程器的佇列) 都有設個 `sysctl_sched_latency`，這是每個 schedule entity 的等待時間，也就是這個 runqueue 的執行周期。再按各個 process 的優先權 (nice 值) 不同，去加權分配就是每個 process 的 vruntime。從 vruntime 最小的開始執行。執行完各自的時間片段之後會再加上一個 CPU virtual runtime `delta_exec`。再去尋找下一個最小的 vruntime。 ###### spoiler 文中的 *Process 和 Schedule entity* > > [color=#bee] > process 的資料型態是 [`struct task_struct`](https://elixir.bootlin.com/linux/latest/source/include/linux/sched.h#L632)，schedule entity 其實應該是 process 的 fields。 > > schedule entity 的資料型態是 [`struct sched_entity`](https://elixir.bootlin.com/linux/latest/source/include/linux/sched.h#L451)，而 `vruntime`、`cfs_rq` 和 `weight` 則是 [`struct sched_entity`](https://elixir.bootlin.com/linux/latest/source/include/linux/sched.h#L451) 的 fields。 > > 上面解釋原理在概念上是同一個東西，所以當作同一個詞就行。 > 順帶一提，[nice 值](https://en.wikipedia.org/wiki/Nice_(Unix))是決定 CPU priority 的值，也就是決定 CPU 執行片段長度 (vruntime) 的依據，由最高優先的 -20 到最低優先的 19，按權重計算 vruntime。 ###### *nice 值決定 vruntime 實作原理* > [color=#bee] > 有正負號的參數自然不會是直接做四則運算，而是透過指數運算。 > 當前 Linux kernel 使用的公式轉換如下 >> [color=#afa] >> weight = 1024 * 1.25^(-nice) >> > 這部份 linux kernel 是用查表的 > > [core.c](https://elixir.bootlin.com/linux/latest/source/kernel/sched/core.c) > ```c=7932 > /* > * Nice levels are multiplicative, with a gentle 10% change for every > * nice level changed. I.e. when a CPU-bound task goes from nice 0 to > * nice 1, it will get ~10% less CPU time than another CPU-bound task > * that remained on nice 0. > * > * The "10% effect" is relative and cumulative: from _any_ nice level, > * if you go up 1 level, it's -10% CPU usage, if you go down 1 level > * it's +10% CPU usage. (to achieve that we use a multiplier of 1.25. > * If a task goes up by ~10% and another task goes down by ~10% then > * the relative distance between them is ~25%.) > */ > const int sched_prio_to_weight[40] = { > /* -20 */ 88761, 71755, 56483, 46273, 36291, > /* -15 */ 29154, 23254, 18705, 14949, 11916, > /* -10 */ 9548, 7620, 6100, 4904, 3906, > /* -5 */ 3121, 2501, 1991, 1586, 1277, > /* 0 */ 1024, 820, 655, 526, 423, > /* 5 */ 335, 272, 215, 172, 137, > /* 10 */ 110, 87, 70, 56, 45, > /* 15 */ 36, 29, 23, 18, 15, > }; > ``` > 換算過的參數再做簡單的正規化 > > weight_factor = weight ~nice=0~ / weight ~curr~ ~process~ > > 這就是用來計算 cpu 運算時間片段長度的加權比重了 > > [fair.c](https://elixir.bootlin.com/linux/latest/source/kernel/sched/fair.c) > ```c=219 > /* > * delta_exec * weight / lw.weight > * OR > * (delta_exec * (weight * lw->inv_weight)) >> WMULT_SHIFT > * > * Either weight := NICE_0_LOAD and lw \e sched_prio_to_wmult[], in which case > * we're guaranteed shift stays positive because inv_weight is guaranteed to > * fit 32 bits, and NICE_0_LOAD gives another 10 bits; therefore shift >= 22. > * > * Or, weight =< lw.weight (because lw.weight is the runqueue weight), thus > * weight/lw.weight <= 1, and therefore our shift will also be positive. > */ > static u64 __calc_delta(u64 delta_exec, unsigned long weight, struct load_weight *lw) > { > u64 fact = scale_load_down(weight); > int shift = WMULT_SHIFT; > > __update_inv_weight(lw); > > if (unlikely(fact >> 32)) { > while (fact >> 32) { > fact >>= 1; > shift--; > } > } > > fact = mul_u32_u32(fact, lw->inv_weight); > > while (fact >> 32) { > fact >>= 1; > shift--; > } > > return mul_u64_u32_shr(delta_exec, fact, shift); > } > ``` > > 其中 `delta_exec` 就是 CPU virtual runtime。如果 nice 值為負，加權比重就會小於 1，CPU virtual runtime 就會比較小，那 virtual runtime 成長的幅度就會比較慢，最終得到的 cpu 運算時間就會相對比較多 (差一級約莫會差 10% 的 CPU 運算時間)。 > > 參見 [Linux Scheduler – CFS and Virtual Run Time (vruntime)](https://oakbytes.wordpress.com/2012/07/03/linux-scheduler-cfs-and-virtual-run-time/) 介紹 nice 值如何影響 virtual rumtime > 與此同時，每個 runqueue 還有另外設置 `sysctl_sched_min_granularity` 和 `min_vruntime`。 - 前者一方面顧名思義就是用來確保每個 process 有保障的 CPU 使用時間，時間沒到誰都不能把他從 CPU 上換下來；另一方面用來避免當 process 數量過多導致各個 process 分配到的 vruntime 太小的時候就以 `sysctl_sched_min_granularity` 作為 vruntime，此時的 runqueue 週期就會是 ***`sysctl_sched_min_granularity` * process 數量***，而忽略原本的 `sysctl_sched_latency`。 :::spoiler 對應核心原始碼 `__sched_period()` 中用到了 `sysctl_sched_min_granularity` [fair.c](https://elixir.bootlin.com/linux/latest/source/kernel/sched/fair.c) ```c=682 /* * The idea is to set a period in which each task runs once. * * When there are too many tasks (sched_nr_latency) we have to stretch * this period because otherwise the slices get too small. * * p = (nr <= nl) ? l : l*nr/nl */ static u64 __sched_period(unsigned long nr_running) { if (unlikely(nr_running > sched_nr_latency)) return nr_running * sysctl_sched_min_granularity; else return sysctl_sched_latency; } ``` ::: - 後者則是會紀錄當前 runqueue 中最小的 vruntime，作為當前這個 runqueue 的一個參考值，用來讓新進 process 的 vruntime 的初值有個依據，也在喚醒 process 之後更新 vruntime 時有個依據。 :::spoiler 對應核心原始碼 `place_entity()` 中用到了 `vruntime` > [color=#bee] > [fair.c](https://elixir.bootlin.com/linux/latest/source/kernel/sched/fair.c) > ```c=4085 > static void > place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial) > { > u64 vruntime = cfs_rq->min_vruntime; > > /* > * The 'current' period is already promised to the current tasks, > * however the extra weight of the new task will slow them down a > * little, place the new task so that it fits in the slot that > * stays open at the end. > */ > if (initial && sched_feat(START_DEBIT)) > vruntime += sched_vslice(cfs_rq, se); > > /* sleeps up to a single latency don't count. */ > if (!initial) { > unsigned long thresh = sysctl_sched_latency; > > /* > * Halve their sleep time's effect, to allow > * for a gentler effect of sleepers: > */ > if (sched_feat(GENTLE_FAIR_SLEEPERS)) > thresh >>= 1; > > vruntime -= thresh; > } > > /* ensure we never gain time by being placed backwards. */ > se->vruntime = max_vruntime(se->vruntime, vruntime); > } > ``` > - `initial` 表示新進 process，`!intial` 表示喚醒 process。 > - 會影響新進 process vruntime 的初值的來有兩個變數 `sysctl_sched_child_runs_first` 和 scheduling features 其中的 `START_DEBIT` (參見 [feature.h](https://elixir.bootlin.com/linux/latest/source/kernel/sched/features.h) 查看所有的 scheduling features) > - `sysctl_sched_child_runs_first` 這是 Linux 核心參數 kernel.sched_child_runs 設定 child process 優先於 parent process 與否 > ```c=10673 > if (sysctl_sched_child_runs_first && curr && entity_before(curr, se)) { > /* > * Upon rescheduling, sched_class::put_prev_task() will place > * 'current' within the tree based on its new key value. > */ > swap(curr->vruntime, se->vruntime); > resched_curr(rq); > } > ``` > - `sched_feat(START_DEBIT)` 則是將新進 process 的 vruntime 設的比原本參考值 (min_vruntime) 更大一點，避免 process 透過不斷的 `fork()` 取得 CPU 使用時間 > - 還有為了避免休眠了一段之後的 process vruntime 跟其他 process 差距過大 (這會導致得到的 CPU 時間會跟其他 process 差距過大)，要重新設定。減去的部份是對於休眠了一段時間的補償 (因為 vruntime 愈小愈容易得到 CPU 的時間)。 ::: ## 衍生議題接著，根據引導閱讀的兩篇文章說起: #### 第一篇 [Linux fork之后，到底是子进程先运行还是父进程先运行](https://blog.csdn.net/dog250/article/details/105756168) :zzz: 其實執行順序追根究底就是一個核心參數 kernel.sched_child_runs_first ```bash # 察看系統預設值 sysctl -a | grep kernel.sched_child_runs_first # 設成 child process 先執行 sysctl -w kernel.sched_child_runs_first=1 ``` 為了測試這個參數實際效用，這篇針對 parent process 和 child process 執行順序的測試採用的是 [Systemtap](https://sourceware.org/systemtap/)，而不是採用最常見最籠統的 printf()，理由是 printf() 的執行周期和成本不小，不一定能夠顯示正確的執行順序。而且執行過後事實證明結果好像也不如預期。這是另外寫的一篇 ==[Systemtab 的介紹](/fHpOveNOSZad5_R1YTAshA)== :::spoiler 以下是原文用來分析問題的 Systemtab script 不過實際跑過一遍結果跳了個錯誤訊息 `ERROR: Skipped too many probes, check MAXSKIPPED or try again with stap -t for more details.` >[color=#bee] > ```stap= > global g_se; > global g_cfs_rq; > > probe begin { > g_cfs_rq = 0; > g_se = 0; > } > > probe kernel.function("__schedule") > { > t_curr = task_current(); > if (task_execname(t_curr) == "a.out") > printf("[_schedule] current task: %s[%d]\n", task_execname(t_curr), task_pid(t_curr)); > } > > probe kernel.function("do_exit") > { > t_curr = task_current(); > if (task_execname(t_curr) == "a.out") > printf("Exit task: %s[%d]\n", task_execname(t_curr), task_pid(t_curr)); > } > > probe kernel.function("pick_next_task_fair") > { > g_cfs_rq = &$rq->cfs; > } > > function container_of_entity:long(se:long) > { > offset = &@cast(0, "struct task_struct")->se; > return se - offset; > } > > probe kernel.function("pick_next_task_fair").return > { > if($return != 0) { > se = &$return->se; > t_se = container_of_entity(se); > t_curr = task_current(); > if (task_execname(t_se) == "a.out" || task_execname(t_curr) == "a.out") { > printf("[pick_next_task_fair] Return task: %s[%d] From current: %s[%d]\n", task_execname(t_se), task_pid(t_se), task_execname(t_curr), task_pid(t_curr)); > } > } > } > > > probe kernel.function("wake_up_new_task") > { > g_se = &$p->se; > g_cfs_rq = @cast(g_se, "struct sched_entity")->cfs_rq; > } > > probe kernel.function("wake_up_new_task").return > { > t_se = container_of_entity(g_se); > tname = task_execname(t_se); > vruntime = @cast(g_se, "struct sched_entity")->vruntime; > if (tname == "a.out") { > curr = @cast(g_cfs_rq, "struct cfs_rq")->curr; > t_curr = container_of_entity(curr); > curr_vruntime = @cast(curr, "struct sched_entity")->vruntime; > printf("[wake_up_new_task] current:[%s][%d] curr:%d new:%d del:%d\n", > task_execname(t_curr), task_pid(t_curr), curr_vruntime, vruntime, > curr_vruntime - vruntime); > } > g_se = 0; > g_cfs_rq = 0; > } > > probe kernel.function("place_entity") > { > t_initial = $initial; > if (t_initial == 1) { > g_cfs_rq = $cfs_rq; > g_se = $se; > } > } > probe kernel.function("place_entity").return > { > if (g_se) { > t_se = container_of_entity(g_se); > tname = task_execname(t_se); > vruntime = @cast(g_se, "struct sched_entity")->vruntime; > if (tname == "a.out") { > curr = @cast(g_cfs_rq, "struct cfs_rq")->curr; > t_curr = container_of_entity(curr); > curr_vruntime = @cast(curr, "struct sched_entity")->vruntime; > printf("[place_entity] name:[%s][%d] curr:%d new:%d delta:%d\n", > task_execname(t_curr), task_pid(t_curr), curr_vruntime, vruntime, > curr_vruntime - vruntime); > } > g_se = 0; > g_cfs_rq = 0; > } > } > ``` > - 第一組追蹤的函式是進入和離開當前排程器任務的 [`__schedule()`](https://elixir.bootlin.com/linux/v5.7.1/source/kernel/sched/core.c#L3962) 和 [`do_exit()`](https://elixir.bootlin.com/linux/v5.7.1/source/kernel/exit.c#L706) 是排程器運作主要的函式，再用 [task_current()](https://elixir.bootlin.com/linux/v5.7.1/source/kernel/sched/sched.h#L1662) 輸出當前的 process > > - 第二組追蹤的函式是針對 [cgroup](https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/resource_management_guide/ch01) 去優化過的排程函式 [pick_next_task_fair()](https://elixir.bootlin.com/linux/v5.7.1/source/kernel/sched/fair.c#L6957) 的進入點和離開點，輸出替換上來的 process 和用 `container_of_entity()` 取得的 parent process > > - 最後一組是喚醒新創 process 的 [`wake_up_new_task()`](https://elixir.bootlin.com/linux/latest/source/kernel/sched/core.c#L2926) 和[重新設定 vruntime](https://stackoverflow.com/questions/8016154/linux-cfs-completely-fair-scheduler-latency) 的 [`place_entity()`](https://elixir.bootlin.com/linux/latest/source/kernel/sched/fair.c#L4086)，因為是緊接著的工作，所以感覺可能會有點多餘，忠於原著，暫時先做保留。 ::: 檢查出來的問題在於，在 child process 剛創建時，確實優先順序是高於 parent process，但是 child process 還沒準備好，也就不足已被 wakeup。原文作者試圖透過 systemtap 的 script 手動解決這個問題。 :::spoiler 原文實驗用的 script 附在下面，實際操作了一遍還是沒能成功，出現了錯誤訊息: `ERROR: read fault [man error::fault] at 0x8d8 near operator '@cast' at correction.stp:40:11` ```stap= #!/usr/bin/stap -g global g_p; probe begin { g_p = 0; } %{ static void *(*_resched_task)(struct task_struct *p); %} function resched(tsk:long, tskp:long) %{ struct task_struct *task = NULL, *parent = NULL; struct sched_entity *pse = NULL, *cse = NULL; task = (struct task_struct *)STAP_ARG_tsk; parent = (struct task_struct *)STAP_ARG_tskp; cse = &task->se; pse = &parent->se; if (_resched_task == NULL) _resched_task = (void *)kallsyms_lookup_name("resched_task"); if (_resched_task && pse->vruntime < cse->vruntime) { swap(pse->vruntime, cse->vruntime); STAP_PRINTF("---[%lu]------[%lu]-------\n", pse->vruntime, cse->vruntime); _resched_task(current); } %} probe kernel.function("check_preempt_wakeup") { g_p = $p; } // 这里的trick在于，由于我们的父子a.out都是纯CPU型的，只在创建时被wakeup一次，所以hook该点。 probe kernel.function("check_preempt_wakeup").return { parent = @cast(g_p, "struct task_struct")->parent; // 这里过滤掉了除了我们的fork场景之外的所有其它的wakeup场景。 if (task_execname(g_p) == "a.out" || task_execname(parent) == "a.out") { resched(g_p, parent); } g_p = 0; } ``` ::: 還是沒能成功將 child process 優先於 parent process 執行。由於對 systemtap 的研究還不夠透澈，暫時還沒辦法做問題排除。 :dizzy_face: ###### 為何 `sysctl -w kernel.sched_child_runs_first=1` 無效根據 Linus Torvalds 在[這份文件](https://yarchive.net/comp/linux/child-runs-first.html)中提到其實 child runs first 和 parent runs first 各有各的理由和問題，所以首先這本來就算是偏好和建議而非保證。在最後的嘗試發現 child runs first 對於 bash 來說的問題比較大，所以現階段是選擇 parent runs first。而單純的參數設定也因為上述問題而沒用。 #### 第二篇 [从一个CFS调度案例谈Linux系统卡顿的根源](https://blog.csdn.net/dog250/article/details/105710571) :zzz: 這個議題在於排程的運作好像不錯，以及處理器的使用率似乎好高，但系統操作起來卻有些卡頓。原文嘗試了幾次才成功重現並分析了問題，我照著原方法試了幾次都沒遇到，所以就不附過程了，畢竟套一句原作者 [dog250](https://blog.csdn.net/dog250) 說的「如果這麼容易重現問題 Linux 社群應該早就修掉了。對於問題的分析過程可以參見原文，這裡講講問題發生的情境和描述。起因於單純用紅黑樹和上述演算法去實作 CFS 的話倒還單純，可為了提升效能和資源的使用率，CFS 後來又引入並支援了許多 features，包括後來發現問題的 `LAST_BUDDY` 這個 feature。這是一個好比生產者消費者這種有對應關係或順序的 process 在輪流 sleep 和 wakeup 的過程中，理論上都會有個理論上是中斷點，下一次 rescheduling 的過程中再從這個中斷點繼續。但是有的時候 CPU 會因為 cache locality 的關係讓一些原本可能是 wakeup 之後的接續點 preempt，因此打亂了這些理論上的順序。