Linux 核心專題: CPU 排程器

# Linux 核心專題: 群組排程的考量因素和落實 > 執行人: yy214123 [專題講解影片](https://youtu.be/CC6P5zdBkyU) ### Reviewed by `EricccTaiwan` ```shell $ #echo cfs_quota_us cfs_preiod_us > ./subgroup_name/cpu.max $ echo "50000 100000" > /sys/fs/cgroup/child1/cpu.max $ echo "50000 100000" > /sys/fs/cgroup/child2/cpu.max ``` 對於設定子群組 `cpu.max` 的命令，我想進一步了解 Linux 核心在以下情況下的處理方式，及對應的程式碼片段： 1. 當 `cfs_quota_us` 大於 `cfs_preiod_us` ，核心會怎麼反應？ 2. 若有兩個子 `cgroup`，且 $\text{quota}_1 + \text{quota}_2$ 大於 `cfs_preiod_us` ，核心如何分配 CPU 時間？ 3. 若上述兩個子 `cgroup` 各自設定不同的 `cfs_period_us`，核心實際行為又是什麼？ > 當 `cfs_quota_us` 大於 `cfs_preiod_us` ，核心會怎麼反應？ > > 對應這個問題的實驗設計如下： > > 1. 仍保持兩個子群組，我將個別的 `cpu.max` 皆設為 "200000 100000"，使其滿足 `cfs_quota_us` 大於 `cfs_preiod_us` 這個條件。 > > 2. 兩個 cpu bound tasks，皆設定 cpu affinity 在編號 1 的處理器。 > > > > 以 kernelshark 觀察執行結果可發現： > > ![image](https://hackmd.io/_uploads/SJgKA3-Sxe.png) > > 大約是 0.003 秒切換一次，此時 `cpu.max` 這個控制器失效了，每個任務的執行時間將由 `base_slice_ns` 所決定。 > > > > 其中 `base_slice_ns` 就是書中提到的 `sched_min_granularity_ns`，從 Commit [8a99b68](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=8a99b6833c884fa0e7919030d93fecedc69fc625) 及 Commit [87ff27c](https://github.com/redhat-performance/tuned/commit/87ff27c8ca9f4f985cabd094dacb3ceee446cfca) 可以觀察其變革。 > > > > - 首先是目錄的變革，自 5.13 開始移至 `/sys/kernel/debug/sched`，在這個時期將 `sched_min_granularity_ns` 改名為 `min_granularity_ns`。 > > - 而自 6.6 開始，這個參數又被改了一次名（`base_slice_ns`），一直沿用至今。 > > > > 使用以下命令查看我電腦的 `base_slice_ns` > > ```shell > > $ cat /sys/kernel/debug/sched/base_slice_ns > > 3000000 > > ``` > > 恰好就是 0.003 秒，可以將其調整並再次進行實驗以證實是由其響排程結果： > > ```shell > > $ echo 5000000 > sys/kernel/debug/sched/base_slice_ns > > ``` > > 再次使用 kernelshark 觀察執行結果可發現： > > ![image](https://hackmd.io/_uploads/r10CMaWHxg.png) > > 變成 0.005 秒切換一次。 > ___ > 若有兩個子 `cgroup`，且 $\text{quota}_1 + \text{quota}_2$ 大於 `cfs_preiod_us` ，核心如何分配 CPU 時間？ > > 實驗設計如下： > > ```shell > > $ #echo cfs_quota_us cfs_preiod_us > ./subgroup_name/cpu.max > > $ echo "90000 100000" > /sys/fs/cgroup/child1/cpu.max > > $ echo "30000 100000" > /sys/fs/cgroup/child2/cpu.max > > ``` > > ![image](https://hackmd.io/_uploads/BkcUpXBBex.png) > > 在這個實驗我觀察到這種設定下，前半段兩個群組的任務會交替執行，每個綠色區塊約為 0.004 秒，當 child 2 的任務達到 `cfs_quota_us` 的限制後，child 1 的任務會連續執行 0.03秒（下圖橘色部份） > > ![image](https://hackmd.io/_uploads/rJOPa7HSee.png) > > 所以說，並不會因為 child1 有比較大的 `cfs_quota_us` 就影響到整體結果，它會在下一個 `cfs_period_us` 盡可能地執行，但當下一個 `cfs_period_us` 到來，又會恢復兩群組任務交替執行的狀況。 > ___ > 若上述兩個子 `cgroup` 各自設定不同的 `cfs_period_us`，核心實際行為又是什麼？ > > 實驗設計如下： > > ```shell > > $ #echo cfs_quota_us cfs_preiod_us > ./subgroup_name/cpu.max > > $ echo "50000 200000" > /sys/fs/cgroup/child1/cpu.max > > $ echo "50000 100000" > /sys/fs/cgroup/child2/cpu.max > > ``` > > 使其滿足各自設定不同的 `cfs_period_us` 的條件 > > ![image](https://hackmd.io/_uploads/BkaCffHHgx.png) > > 此時發現每個任務約莫會運行 `base_slice_ns`（0.003 s）就切換，而 `cfs_period_us` 的效果仍存在。 > > > > 而將 `base_slice_ns` 提高到 0.005 秒後： > > ![image](https://hackmd.io/_uploads/r1pbIfBBlx.png) > > 的確每個任務也是至少運行 `base_slice_ns`（0.005 s）才切換，以 cpu-bound-29000 這個屬於 child1 群組來觀察，在 Maker A 與 Maker B 這個約等於其 `cfs_period_us` 的時間區段來說，總共執行了 $0.005 * 11 = 0.055$，休息了 $0.005 * 11 =0.055$ 再加上比較長時間休息的橘色遮罩部份（約 0.08 秒）。 > > > > 由此得出的結論是，在這設定下，任務運行會以 `base_slice_ns` 為每次運行的時間。並依照各群組的 `cpu.max` 去決定整體的運行時間與休息時間。 ### Reviewed by `salmoniscute` 你的文獻閱讀筆記目前都沒有開啟權限喔！ > 已經調整權限，不過內容還有很多要補，因為我尚未將這兩篇文獻讀完。 > > - [CPU bandwidth control for CFS](https://hackmd.io/NJXuSFHDRX65opFJ06kBuQ) > > - [Mitigating Unnecessary Throttling in Linux CFS Bandwidth Control](https://hackmd.io/Ls-IpJj3Qo2ZkcNmaMJNBQ) > > [name=yy214123] ### Reviewed by `charliechiou` 對於兩種不同版本的 cgroup 在階層上會有所不同，想進一步詢問這兩者的差異 ? > 假設有兩個任務，我們想要對 CPU 及 memory 進行 bandwidth 控制，使第一個任務僅擁有 30 % 的資源;另一個任務擁有 70 % 的資源： > ![image](https://hackmd.io/_uploads/SJKhqMZSxe.png) > 從上圖可以反映出兩種版本的階層架構不同之處。 > > **cgroup v1** > 在 v1 中，如果我們要同時限制 CPU 與 memory，就需要掛載兩棵平行的樹,掛載時需指定對應的控制器： > - /sys/fs/cgroup/cpu/：負責 CPU 控制 > - /sys/fs/cgroup/memory/：負責 memory 控制 > > 並在這兩棵樹下都新增 30 % quota 的目錄及 70 % quota 的目錄，如果希望橘色任務同步受到 CPU 與 memory 30 % 的限制，就須將其 PID 加入到： > - /sys/fs/cgroup/cpu/30% > - /sys/fs/cgroup/memory/30% > > 這是 v1 的不便之處。 > > **cgroup v2** > 這邊就比較直觀，根群組目錄下任何一個子目錄就是一個子群組。在子群組中有多種控制器可調控。 > [name=yy214123] ### Reviewed by `wurrrrrrrrrr` base_slice_ns 規範了 timeslice 的下限值，這個值是能自定義設定的嗎？ > 是的這個值是可以自定義調整的，在回覆上方同學 `EricccTaiwan` 所提及的問題 1，下方我進行的實驗剛好對 `base_slice_ns` 的值進行了設定。 > [name=yy214123] ### Reviewed by `Andrushika` 從 kernel shark 的實驗結果視覺化可以觀察到一個規律，當同群組內有多個行程，且 vruntime 相同的狀況下，較低 PID 的行程好像總是會優先執行，這是因為受到排程策略的影響嗎？ > 會有這個現象的原因，與核心程式碼中的 [ __enqueue_entity](https://elixir.bootlin.com/linux/v6.15/source/kernel/sched/fair.c#L850) 及 [rb_add_augmented_cached](https://elixir.bootlin.com/linux/v6.15/source/include/linux/rbtree_augmented.h#L64) 的函式實作有關，當 vruntime 都相同時，此時在紅黑樹的插入類似 FIFO 的順序，所以最早被插入的排程實體，除非後續插入的排程實體的 vruntime 更小，否則最早被插入的排程實體仍是該棵紅黑樹的最左子樹葉節點。 > > 而我在使用 mktasks 這個工具時會依序產生任務，所以 PID 最小的任務是最早被插入紅黑樹的排程實體，故才會發生你所觀察到的現象。 ### Reviewed by `dingsen-Greenhorn` 已收錄之 pull requests: ＃252, ＃254, ＃255, ＃256, ＃257, ＃258, ＃260, ＃261, ＃262, ＃263, ＃264, ＃270, ＃272, #287」「＃」是全形符號，建議全部統一用半形「#」，例如 #252，較符合技術文件風格。 > 已調整。 > [name=yy214123] ### Reviewed by `Ian-Yen` base_slice_ns 是怎麼跟 quota 與 period 互動的，看不太懂。 > quota 與 period 算是一個全域的視角，而 `base_slice_ns` 是每次 context switch 的下界，也就是當前在處理器上執行的任務的最少執行時間。 ## 任務簡述： Linux 系統提供多種排程類別，而本次期末專題將著重探討其中的公平排程（Completely Fair Scheduler, CFS）。隨著雲端伺服器的普及，過去單純從任務角度進行排程的方式已不再適用。在現今的雲端環境中，一台伺服器主機可能同時服務多位使用者，而這些使用者通常支付相同的費用給伺服器供應商，然而他們各自運行的任務數量卻可能存在巨大差異。若僅根據任務數量來分配寶貴的硬體資源，將會對任務較少的使用者產生嚴重不公平的現象。為解決此問題，Linux 引入了群組排程（Group Scheduling），以使用者或任務群組為單位進行資源分配。本專題將深入探討群組排程的核心機制，並透過實驗方式檢視相關核心程式碼與群組控制器（cgroup controller）的實際行為及效能表現。 ## TODO: 修訂電子書 > 研讀《Demystifying the Linux CPU Scheduler》、紀錄問題並提交修正 > 修訂針對用語和敘述。列出提交並收錄的 pull request 已收錄之 pull requests: [#252](https://github.com/sysprog21/linux-kernel-scheduler-internals/pull/252), [#254](https://github.com/sysprog21/linux-kernel-scheduler-internals/pull/254), [#255](https://github.com/sysprog21/linux-kernel-scheduler-internals/pull/255), [#256](https://github.com/sysprog21/linux-kernel-scheduler-internals/pull/256), [#257](https://github.com/sysprog21/linux-kernel-scheduler-internals/pull/256), [#258](https://github.com/sysprog21/linux-kernel-scheduler-internals/pull/256), [#260](https://github.com/sysprog21/linux-kernel-scheduler-internals/pull/260), [#261](https://github.com/sysprog21/linux-kernel-scheduler-internals/pull/261), [#262](https://github.com/sysprog21/linux-kernel-scheduler-internals/pull/262), [#263](https://github.com/sysprog21/linux-kernel-scheduler-internals/pull/263), [#264](https://github.com/sysprog21/linux-kernel-scheduler-internals/pull/264), [#270](https://github.com/sysprog21/linux-kernel-scheduler-internals/pull/270), [#272](https://github.com/sysprog21/linux-kernel-scheduler-internals/pull/272), [#287](https://github.com/sysprog21/linux-kernel-scheduler-internals/pull/287#issuecomment-2994404999) 其中 #287 這個 PR 比較特別，是因為書本中第七章的實驗有相依的外部 submodules，使用 `$ git submodule update --init --recursive` 時所取得的相關程式碼是舊版的，會有 use-after-free 的問題，而修改的方式較以往不同，在此做個紀錄：建立新的分支並移動到對應的目錄： ```shell $ git checkout -b fix-jsonc-update $ cd external/json-c ``` 設定對應專案的遠端連結並取得其更新： ```shell $ git remote set-url origin https://github.com/json-c/json-c $ git fetch --tags origin ``` 選擇對應的版本： ```shell $ git checkout json-c-0.18-20240915 ``` > 此處沒有選擇最新的版本，是因為最新的版本不支援 cmake 編譯，而我的目的只是要避免編譯時出現錯誤，故折衷去挑選適合的版本。最後加入追蹤並提交 commit： ```shell $ git add -f external/json-c .gitmodules $ git commit -a $ git push -u origin fix-jsonc-update ``` ### 問題紀錄 #### 啟用群組排程的 FLAG 書中第 143 頁提到： > Linux allows group scheduling with CFS via namespaces. This feature is active only if the kernel is configured with the **CONFIG_FAIR_GROUP_SCHED** flag. 但在 [CFS Group Scheduling](https://blogs.oracle.com/linux/post/cfs-group-scheduling)，裡面提到： > CFS group scheduling was introduced in Linux kernel version 2.6.24 and needs 2 kernel config options, **`CONFIG_CGROUP_SCHED`** and **`CONFIG_FAIR_GROUP_SCHED`** to get enabled. :::info `CONFIG_FAIR_GROUP_SCHED` 的啟用與 bandwidth 有關，我目前不確定群組排程與 bandwidth 在現在系統是否有綁定。 > 5/28 一對一面談：進一步看 RT 相關的 cgroup。 `CONFIG_CGROUP_SCHED` 與 RT group 及 CFS group 的啟用有關，為何不是將 CFS group 的功能直接與 `CONFIG_FAIR_GROUP_SCHED` 合併，這也是授課老師持續在探討的議題。 ::: ___ #### 關於 `min_runtime` 的更新書中第 109 頁提到： > The min_vruntime member represents the least vruntime value across all enqueued entities, which is the minimum vruntime value of all entities in the queue. 在 kernel 5.0 版本的 [`update_min_vruntime`](https://elixir.bootlin.com/linux/v5.0/source/kernel/sched/fair.c#L495) 函式，分成三個階段： - **第一階段** ```c if (curr) { if (curr->on_rq) vruntime = curr->vruntime; else curr = NULL; } ``` 當前執行任務若還在執行佇列中，則將 `vruntime` 變數更新為當前執行任務的 vruntime。 - **第二階段** ```c if (leftmost) { /* non-empty tree */ struct sched_entity *se; se = rb_entry(leftmost, struct sched_entity, run_node); if (!curr) vruntime = se->vruntime; else vruntime = min_vruntime(vruntime, se->vruntime); } ``` 接著檢查紅黑樹是否為空，若紅黑樹非空，則表示最左節點對應下一個即將執行的任務。此時，判別式的行為會根據第一階段的結果有所不同。如果當前執行的任務仍在執行佇列中，則將 `vruntime` 變數更新為「當前執行任務」與「即將執行任務」的 vruntime 值中較小者；反之，若當前執行任務已不在執行佇列中，則直接將 `vruntime` 更新為即將執行任務的 vruntime 值。 - **第三階段** ```c cfs_rq->min_vruntime = max_vruntime(cfs_rq->min_vruntime, vruntime); ``` `vruntime` 在這個階段會與當前 `min_vruntime` 進行比較，並取出其中較大者在 kernel 6.14.8 版本的 [`update_min_vruntime`](https://elixir.bootlin.com/linux/v6.14.8/source/kernel/sched/fair.c#L759) 也是三階段，前兩個階段與上方相同，唯一差別在於最一個階段： - **第三階段** ```c static u64 __update_min_vruntime(struct cfs_rq *cfs_rq, u64 vruntime) { u64 min_vruntime = cfs_rq->min_vruntime; /* * open coded max_vruntime() to allow updating avg_vruntime */ s64 delta = (s64)(vruntime - min_vruntime); if (delta > 0) { avg_vruntime_update(cfs_rq, delta); min_vruntime = vruntime; } return min_vruntime; } cfs_rq->min_vruntime = __update_min_vruntime(cfs_rq, vruntime); ``` 僅在 `vruntime` 大於當前 `min_vruntime` 才去更新。 :::info 對標這兩個核心版本來說，書中的敘述不正確，`min_runtime` 並不一定是佇列中最小的 vruntime，有可能比它還大。 > TODO: 建 issue。 :heavy_check_mark: > [issue #276](https://github.com/sysprog21/linux-kernel-scheduler-internals/issues/276) > [Linux 核心設計: Scheduler(1): O(1) Scheduler](https://hackmd.io/@sysprog/Hka7Kzeap/%2FBJ9m_q) ::: :::info 另外我也想確認一下是什麼情境會導致 curr 不在 runqueue 中，因為在 curr 還沒被執行前，它必定在 runqueue 中。執行過程遇到什麼事了？ > 5/28 面談：可能是 idle tesk，或是被 migrate。但 migrate 不會瞬間發生。 > 研究 [migrate_se_pelt_lag()](https://elixir.bootlin.com/linux/v6.15/source/kernel/sched/fair.c#L4488) ::: :::info 為什麼要讓 min_vruntime 遞增？ > 5/28 面談：避免有某個任務的 vruntime 一直很小，一個 threshold 的概念，同時考慮到多核系統有可能發生 migrate，此時 vruntime 要重新計算。 >[Linux 核心設計: Scheduler(3): 深入剖析 CFS Scheduler](https://hackmd.io/@sysprog/Hka7Kzeap/%2FBJ9m_qs-5) ::: ___ #### 釐清 `cgroup.procs` 在書中提到： cgroup.procs contains the list of PIDs of **all processes in the system**. 但我進行對應實驗時，發現根群組所顯示 PID 資訊，與 `$ ps -ef` 命令所顯示的 PID 資訊不一致： ```shell $ cat ./cgroup.procs ... 7604 7828 9302 ``` > 這邊顯示出最大的 PID 僅到 `9302` ```shell $ ps -ef UID PID PPID C STIME TTY TIME CMD ... root 9302 2 0 15:27 ? 00:00:00 [kworker/16:0] boju 9322 5511 3 15:28 pts/1 00:00:00 zsh boju 9358 2498 0 15:28 pts/1 00:00:00 zsh boju 9360 2498 0 15:28 pts/1 00:00:00 zsh ... ``` > 但實際上運行在系統的 PID 大於 `9302` 的 porcesses 還有許多。 ##### **TODO：** 使用乾淨的環境進行相關的實驗。（使用 alpine） > 5/21 一對一面談時，將上述問題回報給授課老師，老師提供相關教材 [How to Use the Alpine Docker Official Image](https://www.docker.com/blog/how-to-use-the-alpine-docker-official-image/) ##### 踩坑紀錄先安裝相依的套件： ```shell # Add Docker's official GPG key: sudo apt-get update sudo apt-get install ca-certificates curl sudo install -m 0755 -d /etc/apt/keyrings sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc sudo chmod a+r /etc/apt/keyrings/docker.asc # Add the repository to Apt sources: echo \ "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu \ $(. /etc/os-release && echo "${UBUNTU_CODENAME:-$VERSION_CODENAME}") stable" | \ sudo tee /etc/apt/sources.list.d/docker.list > /dev/null sudo apt-get update ``` 接著下載 [docker-desktop-amd64.deb](https://desktop.docker.com/linux/main/amd64/docker-desktop-amd64.deb?utm_source=docker&utm_medium=webreferral&utm_campaign=docs-driven-download-linux-amd64&_gl=1*130kg1a*_gcl_au*MTg2NjUyMjU5OC4xNzQ3ODk5MjAw*_ga*MTU1OTMzNjQzNS4xNzQ3ODk5MjAw*_ga_XJWPQMJYHQ*czE3NDc4OTkyMDAkbzEkZzEkdDE3NDc4OTk3MDQkajE2JGwwJGgwJGRWcTR6OU1OdkVHNWRjbkhqOTktRExQUVctdW5LdXFXanVn)，接著運行下列命令進行安裝： ```shell $ sudo apt-get update $ sudo apt-get install ./docker-desktop-amd64.deb ``` 使用終端機執行下列命令即可啟動 Docker Desktop： ```shell systemctl --user start docker-desktop ``` ![image](https://hackmd.io/_uploads/rkqUKI3Zll.png) 在左邊的選單點選 `image`，搜尋 alpine，接著點選 `run`。 :::success 遇到了奇怪的問題，正常來說 docker 的核心版本要與我的系統一致才對，不過當我個別輸入下列命令卻發現： **docker** ```shell $ uname -r 6.10.14-linuxkit ``` **本機終端機** ```shell $ uname -r 6.14.0-15-generic ``` > 參考 [Install Docker Desktop on Linux](https://docs.docker.com/desktop/setup/install/linux/) 後，發現其提到： > Docker Desktop on Linux runs a **Virtual Machine (VM)** which creates and uses a custom docker context, desktop-linux, on startup. > > 所以應該要用 **Docker Engines**，因為 Docker Desktop 不能算是真正的 docker，其本質上還是虛擬機器。 ::: ##### 實驗使用 Docker Engines 後，再次確認 docker 的核心版本以與本地端相同。 ```shell $ docker run -it --rm alpine /bin/ash $ uname -r 6.14.0-15-generic ``` 再次檢查 PID ```shell $ cat /sys/fs/cgroup/cgroup.procs 1 9 $ ps -ef PID USER TIME COMMAND 1 root 0:00 /bin/ash 10 root 0:00 ps -ef ``` 這次的結果就比較正常了。5/22 的晚間直播 [57:34](https://www.youtube.com/live/du2i9K9WcpI?si=FKhhG1v8j7u1EfdM&t=3454) 有提及 PID 增加的問題，所以我理解為 9 是執行 `$ cat` 時所 fork 的 process，當將結果顯示於終端機中表示其任務已完成，就被殺掉了，而 10 同理。所以當執行 `$ ps -ef` 時看不到 PID 為 9 的 process。 :::success 不確定上述的說明是否精準。待求證。 > 5/28 面談：並沒有 kill，只有 terminate。 ::: ##### TODO: 查證 Ubuntu 25.04 是否有其他排程類別的任務存在 [比較腳本](https://gist.github.com/yy214123/8329aecb9226bf88e29dad031b30f9b5) 先賦予權限，並執行腳本： ```shell $ chmod +x diff-cgroup-pids.sh $ ./check_cgroup_diff.sh ``` 輸出結果保存於上方連結的 comment。可以發現這些不在 root_group 的 processes 其掛載的 cgroup 路徑有兩者： `/sys/fs/cgroup/user.slice` `/sys/fs/cgroup/system.slice` 其中可以觀察看有一些 chrome、kitty 的相關字眼。接著我運行另一個腳本去觀察根群組那些 PID 其所對應的 CMD 為何： ```shell while read pid; do ps -p "$pid" -o pid,cmd done < /sys/fs/cgroup/cgroup.procs ``` 結果卻顯示出一些與 kernel 有關的資訊： ```shell PID CMD 2 [kthreadd] PID CMD 3 [pool_workqueue_release] PID CMD 4 [kworker/R-rcu_gp] PID CMD 5 [kworker/R-sync_wq] PID CMD 6 [kworker/R-kvfree_rcu_reclaim] PID CMD 7 [kworker/R-slub_flushwq] ... ``` :::info 1. 一些跟 kernel 有關的 process 被掛載在根群組 2. 一些 user 層級的 process 反而沒有，跟書本的敘述有點衝突 > 想了許久，還是說 `user.slice` 及 `system.slice` 也是一個子群組的概念？ > 5/28 面談：進一步追蹤 `systemd`。 tools/perf/tests/shell/stat_bpf_counters_cgrp.sh ::: ___ ## TODO: 回顧去年相關報告並熟悉相關工具 > [報告-1](https://hackmd.io/@sysprog/rkJd7TFX0), [報告-2](https://hackmd.io/@sysprog/rysYj43ER), [報告-3](https://hackmd.io/@sysprog/SyTH65LUC) > 釐清提出的疑慮，並熟悉 [jitterdebug](https://github.com/igaw/jitterdebugger), schbench, schedgraph 等工具 ### [jitterdebug](https://github.com/igaw/jitterdebugger) #### 介紹這個工具會在每個處理器上執行一條執行緒，並為其設置 timer，主要測量 timer 觸發與對應執行緒執行中間所經過的 wake up latencies。首先將專案下載，使用 `make` 命令進行編譯： ```shell $ git clone git@github.com:igaw/jitterdebugger.git $ make ``` 提供兩種執行方式： - **第一種** ```shell $ sudo ./jitterdebugger ``` 這種方式需要在終端機輸入 ctrl+c 才可以監看輸出結果。 - **第二種** ```shell $ sudo ./jitterdebugger -v ``` 這種方式會在終端機上動態更新測量結果。統計的資料如下： ```shell T: 0 (25351) A: 0 C: 2168411 Min: 1 Avg: 1.16 Max: 9990 T: 1 (25352) A: 1 C: 2168410 Min: 1 Avg: 1.19 Max: 11165 T: 2 (25353) A: 2 C: 2168409 Min: 1 Avg: 1.00 Max: 91 T: 3 (25354) A: 3 C: 2168407 Min: 1 Avg: 1.00 Max: 34 T: 4 (25355) A: 4 C: 2168406 Min: 1 Avg: 1.33 Max: 10977 T: 5 (25356) A: 5 C: 2168404 Min: 1 Avg: 2.05 Max: 61491 T: 6 (25357) A: 6 C: 2168403 Min: 1 Avg: 1.18 Max: 9543 T: 7 (25358) A: 7 C: 2168401 Min: 1 Avg: 1.38 Max: 25288 T: 8 (25359) A: 8 C: 2168400 Min: 1 Avg: 1.15 Max: 5806 T: 9 (25360) A: 9 C: 2168399 Min: 1 Avg: 1.18 Max: 11117 T:10 (25361) A:10 C: 2168397 Min: 1 Avg: 1.15 Max: 4896 T:11 (25362) A:11 C: 2168396 Min: 1 Avg: 1.19 Max: 12993 T:12 (25363) A:12 C: 2168395 Min: 1 Avg: 1.62 Max: 2807 T:13 (25364) A:13 C: 2168394 Min: 1 Avg: 1.43 Max: 2160 T:14 (25365) A:14 C: 2168392 Min: 1 Avg: 1.50 Max: 1565 T:15 (25366) A:15 C: 2168391 Min: 1 Avg: 1.38 Max: 2922 T:16 (25367) A:16 C: 2168390 Min: 1 Avg: 1.37 Max: 1951 T:17 (25368) A:17 C: 2168388 Min: 1 Avg: 1.24 Max: 1506 T:18 (25369) A:18 C: 2168387 Min: 1 Avg: 1.30 Max: 1135 T:19 (25370) A:19 C: 2168386 Min: 1 Avg: 1.38 Max: 12901 ``` ___ ### schbench > 相關資源： > [Linux 核心設計: Scheduler(6): 排程器測試工具](https://hackmd.io/@sysprog/H1Eh3clIp#Schbench) > [kernel/git/mason/schbench.git](https://git.kernel.org/pub/scm/linux/kernel/git/mason/schbench.git/tree/README.md) 首先將專案下載，使用 `make` 命令進行編譯： ```shell $ git clone https://kernel.googlesource.com/pub/scm/linux/kernel/git/mason/schbench $ make ``` #### baseline benchmark 專案提供對應的命令來量測基準值： ```shell $ ./schbench -F 256 -n 5 --calibrate -r 10 setting worker threads to 20 Wakeup Latencies percentiles (usec) runtime 10 (s) (53418 total samples) 50.0th: 3 (10411 samples) 90.0th: 78 (18535 samples) * 99.0th: 775 (4808 samples) 99.9th: 1082 (481 samples) min=1, max=1834 Request Latencies percentiles (usec) runtime 10 (s) (53475 total samples) 50.0th: 3252 (15625 samples) 90.0th: 4264 (21847 samples) * 99.0th: 6120 (4151 samples) 99.9th: 7096 (474 samples) min=1909, max=8804 RPS percentiles (requests) runtime 10 (s) (11 total samples) 20.0th: 5336 (3 samples) * 50.0th: 5352 (4 samples) 90.0th: 5368 (4 samples) min=5238, max=5365 current rps: 5360.71 Wakeup Latencies percentiles (usec) runtime 10 (s) (53420 total samples) 50.0th: 3 (10411 samples) 90.0th: 79 (18543 samples) * 99.0th: 775 (4801 samples) 99.9th: 1082 (482 samples) min=1, max=1834 Request Latencies percentiles (usec) runtime 10 (s) (53495 total samples) 50.0th: 3252 (15627 samples) 90.0th: 4264 (21855 samples) * 99.0th: 6120 (4151 samples) 99.9th: 7096 (474 samples) min=1909, max=8804 RPS percentiles (requests) runtime 10 (s) (11 total samples) 20.0th: 5336 (3 samples) * 50.0th: 5352 (4 samples) 90.0th: 5368 (4 samples) min=5238, max=5365 average rps: 5349.50 ``` 這邊會關閉 spinlock （`-C`），並進行矩陣運算，矩陣的大小對應到參數 `-F` 後面的數字，而 `-n` 對應到矩陣運算的次數。然後要將第二項統計資料（Request Latencies）的 p99 對應到關心的 timeslice。 :::success 我要怎麼知道我關心的 timeslice 是多少？ > 5/28 面談：應該要著重 target lantecy。對應頁面的敘述不夠嚴謹，應有一套對應的統計去計算當下情境所關心的值。 ::: ___ ### [schedgraph](https://gitlab.inria.fr/schedgraph/schedgraph) 首先須安裝一些相依的套件： ```shell $ sudo apt install ocaml $ sudo apt install opam $ sudo apt install jgraph $ sudo apt install libglfw3-dev $ sudo apt install trace-cmd ``` > 後來發現 jgraph 要另外安裝，否則生成圖片時會報錯，參考 [Jgraph -- A Filter for Plotting Graphs in Postscript](https://web.eecs.utk.edu/~jplank/plank/jgraph/jgraph.html) 後需要另外安裝並編譯才能使用。[下載網址](http://web.eecs.utk.edu/~jplank/plank/jgraph/2024-02-15-Jgraph.tar) > 記得將 jgraph 的執行檔加入 PATH 中： > ```shell > $ sudo cp jgraph /usr/local/bin/ > ``` 將專案 clone 下來後，須進行一些初始化的設定： ```shell $ git clone https://gitlab.inria.fr/schedgraph/schedgraph.git $ git submodule init $ git submodule update ``` 進行編譯： ```shell $ make ... opam install ocamlfind [ERROR] Opam has not been initialised, please run `opam init' make: *** [Makefile:21: stepper] Error 50 ``` 這邊會報錯，照著上面的指示輸入，接著要將安裝路徑 export 到環境中： ```shell $ opam init $ eval $(opam env) ``` 接著重新編譯即可： ```shell $ make clean $ make ``` > 專案原生的 Makefile 沒有 clean，須自己補上： > ```diff >+ clean: >+ rm -f *.cmi *.cmo *.cma *.o \ >+ dat2graph2 dat2graph \ >+ running_waiting events stepper \ >+ implot_viewer > ``` ___ ## TODO: 研究 `scheduler_tick` > [Issue #236](https://github.com/sysprog21/linux-kernel-scheduler-internals/issues/236) ## TODO: group scheduling 及 scheduler domain/group > [Issue #271](https://github.com/sysprog21/linux-kernel-scheduler-internals/issues/271) **任務說明**： Linux v5.14 在 CFS 中導入了大幅改進的 CPU 頻寬控制機制，有望解決容器工作負載中長期存在的問題。但 CPU 配額在 Kubernetes 中歷來會造成意外且不良的影響。 Kubernetes 使用 cgroups 來限制 CPU 與記憶體。記憶體配額的行為可預測：超出就會被拒絕，稍微超出可能會成功但可能觸發 OOM 終止，這與開發者在一般 Linux 環境中的預期相符。 CPU 配額是以時間比例表示，例如 0.5 vCPU 代表每 100 毫秒只能用 50 毫秒 CPU。當時間片用完後，排程器會讓進程進入不可排程狀態，直到配額恢復，但不會以平滑方式減速。在多執行緒容器中，若有空閒的核，process 會瞬間使用大量 CPU，然後完全消失直到配額補回。這導致表面上看起來工作負載可執行，實則無法被排程。 Tim Hockin 建議停用 CPU 配額，並使用 `--cpu-cfs-quota=false`。許多營運者也採取相同做法，改以 CPU 鎖定（整顆 CPU）方式替代，但對於需要部分 vCPU 的服務來說並不實用。 > [Issue #149](https://github.com/sysprog21/linux-kernel-scheduler-internals/issues/149) ### 文獻閱讀筆記 > [CPU bandwidth control for CFS](https://hackmd.io/NJXuSFHDRX65opFJ06kBuQ) > [Mitigating Unnecessary Throttling in Linux CFS Bandwidth Control](https://hackmd.io/Ls-IpJj3Qo2ZkcNmaMJNBQ) ### 介紹 CFS 是以個別任務為單位去分配處理器資源，假設某一個系統具 50 個可執行任務，平均分配給每個任務的時間約莫 $1/50$，事實上有可能其中 49 個任務屬於 userA，剩下的一個任務屬於 userB，如此從使用者的觀點出發，則 userA 會被分配 98% 的處理器資源，而 user B 僅被分配到 2% 的處理器資源。這意味著，若從個別任務的角度出發，排成器此時是公平的，但從使用者的觀點來看，則非常不公平。若要使其公平，則各自使用者應該獲得 50% 的處理器資源，再將此資源依照各自的任務數進行分配。在 CFS 中，每個可排程的單位稱為 `sched_entity`，可以是單一任務，也可以是一個群組，為了解決上述問題，便引入了 group scheduling 的理念。 ### per-CPU runqueues 每個 CPU 有一個執行佇列，其目的很單純，僅用於決定此刻該執行哪個任務，在 Linux 中，分成了許多排程類別（越上面的優先權越高）： | Scheduling classes | Scheduling policies | | ------------------ | ------------------- | | stop_sched_class | | | dl_sched_class | SCHED_DEADLINE | | rt_sched_class | SCHED_FIFO,SCHED_RR | | fair_sched_class | SCHED_NORMAL(OTHER),SCHED_BATCH,SCHED_IDLE | | idle_sched_class | | 對應的資料結構為 `rq`： ```c struct rq { //... struct cfs_rq cfs; struct rt_rq rt; //... } ``` 在這之中可看到各排程類別也有自己一個執行佇列。各個排程類別有各自挑選下一個被執行任務的演算法。 ### CFS runqueue CFS 的執行佇列由 `cfs_rq` 結構表示，以 vruntime 為排序依據建構紅黑樹（cfs_rq 結構體的成員 `tasks_timeline` ），每個節點皆對應到一個排程實體，其中最左子樹葉節點表示 vruntime 最小的任務。 ### 從單一任務的角度出發以單一任務來說，會以 `load_weight` 決定其可被分配的 cpu 時間，這取決於 nice 值（-20~19）的轉換： #### **weight function** $$ weight(nice) = \frac{1024}{(1.25)^{nice}} $$ 有了權重後，接著是計算 vruntime： #### **vruntime** $$ vruntime=\delta * \frac{weight(nice\_0)}{task\_weight} $$ > 其中 $\delta$ 為已執行時間，total_weight 為所有可執行任務的總權重。而 CFS 總是會選擇 vruntime 最小的任務來執行。既然已經決定了挑選對象，下一步要決定其實際可被執行的時間。 #### **assigned time** 也稱作每個任務的 timeslice。 $$ timeslice=target\_latency* \frac{task\_weight}{total\_weight} $$ > 其中 target_latency 是指每個可執行任務至少執行一次所需的最短時間（不代表完成） :::info 這邊我也有疑惑之處，如果啟用群組排程，也是這樣計算嘛？還是說上述方法僅限於單個任務，若選到的是群組任務，就要用 global quota 及 local quota 那套方法來決定實際可執行時間。 > 5/28 面談：quota 會改 weight，而影響到 task_weight 後，也是運用同一套方式去計算 timeslice。還需要進一步求證。 ::: 另外還有所謂的 min granularity，這規範了 timeslice 的下限值，因為 context switch 有高昂的成本，為了避免浪費，任何執行中的任務在被搶佔前，都必須執行一段時間。其在 kernel 中的變數名稱，經過了兩次變化： > commit [8a99b68](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=8a99b6833c884fa0e7919030d93fecedc69fc625) 5.13 將目錄移至 `/sys/kernel/debug/sched`，並將 `sched_min_granularity_ns` 改名為 `min_granularity_ns`。 >[patch](https://github.com/redhat-performance/tuned/commit/87ff27c8ca9f4f985cabd094dacb3ceee446cfca) 而自 6.6 開始，這個參數又被改了一次名（`base_slice_ns`），一直沿用至今。 ### 從群組的角度出發如果啟用了群組排程，則根群組的執行佇列就是 `rq->cfs_rq` ```c struct rq { //... struct cfs_rq cfs; struct rt_rq rt; //... } ``` #### Cgroup 有兩種版本，在階層架構上有差別： **v1** ![image](https://hackmd.io/_uploads/ByWsf1sZel.png) **v2** ![image](https://hackmd.io/_uploads/Syi2G1sZge.png) 可以透過下列命令檢查當前系統所使用 Cgroup 系統版本及掛載的目錄位置： ```shell $ mount | grep cgroup cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot) ``` 在該目錄下，將其稱為**根群組**，可以查看其中具有什麼界面： ```shell $ ls /sys/fs/cgroup cgroup.controllers cpu.stat.local memory.reclaim cgroup.max.depth dev-hugepages.mount memory.stat cgroup.max.descendants dev-mqueue.mount memory.zswap.writeback ... ``` 關鍵的核心界面有三個，分別是： 1. `cgroup.procs`：群組中所有的 PID。 2. `cgroup.controllers`：顯示當前群組所支援的控制器 3. `cgroup.subtree_control` ：啟用給子群組的控制器。在根群組目錄下倘若新增目錄，則該目錄成為子群組： ```graphviz digraph G { rankdir=TB; node [shape=box]; "/sys/fs/cgroup" -> "group1"; "/sys/fs/cgroup" -> "group2"; } ``` 由於根群組是 group1 及 group2 的親代群組，其 `cgroup.controllers` 會與根群組的 `cgroup.subtree_control` 相同： ```shell $ cat /cgroup/cgroup.cgroup.subtree_control cpuset cpu io memory pids $ cat /cgroup/group1/cgroup.controllers cpuset cpu io memory pids $ cat /cgroup/group2/cgroup.controllers cpuset cpu io memory pids ``` 由於是階層式關係，並不能直接修改某個群組的 `cgroup.controllers` ，來新增控制器，必須修改其親代群組的 `cgroup.subtree_control` 才能達成。 > 詳細的控制器清單，可參閱 [man page](https://man7.org/linux/man-pages/man7/cgroups.7.html)。以 group1 做示範，其 `cgroup.controllers` 為空，當在其目錄下新增子目錄： ```graphviz digraph G { rankdir=TB; node [shape=box]; "/sys/fs/cgroup" -> "group1"; "/sys/fs/cgroup" -> "group2"; "group1" -> "child1" } ``` ```shell $ cat /sys/fs/cgroup/group1/child1/cgroup/controllers null ``` 透過下列命令新增控制器到 group1 的 `cgroup.subtree_control`，再次查看 child1 的 `cgroup.controllers`： ```shell // 在親代群組進行設定 $ echo "+cpu" >> /sys/fs/cgroup/group1/cgroup.subtree_control $ cat /sys/fs/cgroup/group1/cgroup.subtree_control cpu // 查看子群組 $ cat /sys/fs/cgroup/group1/child1/cgroup/controllers cpu ``` 在群組中，各階層的群組也有其所對應的執行佇列，我們藉由上圖來說明，從 child1 來說，可以透過 `$ cat child1/cgroup/cgroup.procs ` 查看其包含的 process。從 child1 的執行佇列中，必須選出一名 process 來作為整個 child1 在其親代群組（即 group 1）的代表。接著往上一個階層，從 group 1 的執行佇列中，其中可能包含許多個別 process 及多個子群組選出來的代表，每個皆視為獨立的排程實體，同樣也要從中推派一名代表到根群組。而根群組也依同樣的模式選出一名代表。 :::info > When the scheduler must choose the next task to run, it initially selects the most deserving entity from the top-level queue using the standard CFS algorithm. > If the chosen entity represents a task, it is executed. Otherwise, if it represents a group, the scheduler repeats the process at the lower level. 根據這段描述，應該是 top down 去看。最頂層的 `cfs_rq` 在挑選 vruntime 最小的任務時，會去檢查該任務是單一任務，或是群組的代表實體。 > The only entity that is inserted into the CPU runqueue is the top-level object, which has **no parents**. 在思考 "代表實體" 一直有盲點，首先是到底要選誰當代表？以及該怎麼選。因為根據上面敘述，若挑的是群組代表實體還要往下繼續，這意味著在根群組的群組代表實體不一定是整個群組架構中 vruntime 最小的，也代表當初 bottom up 在推代表時的依據並非 vruntime。如果代表實體並非 vruntime 最小，那我也好奇為什麼不採用這種設計？這樣當最頂層的執行佇列選到任務（即便他是實體），也能如選到一般任務直接執行。 > https://elixir.bootlin.com/linux/v6.15/source/kernel/sched/fair.c#L13080 ::: ```graphviz digraph G { rankdir=TB; node [shape=box]; "/sys/fs/cgroup" -> "group1"; "/sys/fs/cgroup" -> "group2"; "group1" -> "child1" } ``` #### kernel 程式碼我透過 ftrace 去追蹤當在根目錄底下新增子群組目錄時，核心發生的一連串函式呼叫。首先是 [cgroup_mkdir()](https://elixir.bootlin.com/linux/v6.14.8/source/kernel/cgroup/cgroup.c#L5838)，裡面涉及一連串的檢查條件，包含親代群組是否存在，以及其親代群組的 `nr_descendants` 及 `level` 的設定（[cgroup_check_hierarchy_limits()](https://elixir.bootlin.com/linux/v6.14.8/source/kernel/cgroup/cgroup.c#L5815)）。 >`nr_descendants`：此群組的廣度（即子群組總數）。 `level `：此群組的階層數限制。我檢查了一下根目錄的設定都是 `max`，此處與 kernel 文件的說明一致： > **cgroup.max.descendants** A read-write single value files. The default is "max". Maximum allowed number of descent cgroups. If the actual number of descendants is equal or larger, an attempt to create a new cgroup in the hierarchy will fail. > > **cgroup.max.depth** A read-write single value files. The default is "max". Maximum allowed descent depth below the current cgroup. If the actual descent depth is equal or larger, an attempt to create a new child cgroup will fail. [〈Control Group v2〉](https://www.kernel.org/doc/Documentation/cgroup-v2.txt) 這兩個預設值的設定與 [cgroup_max_descendants_write()](https://elixir.bootlin.com/linux/v6.14.8/source/kernel/cgroup/cgroup.c#L3622) 及 [cgroup_max_depth_write()](https://elixir.bootlin.com/linux/v6.14.8/source/kernel/cgroup/cgroup.c#L3665) 有關，實際上就是 `INT_MAX`。 ##### 問題一 [man cgroups](https://man7.org/linux/man-pages/man7/cgroups.7.html) 的敘述不精準： ```shell cgroup.max.depth (since Linux 4.14) This file defines a limit on the depth of nesting of descendant cgroups. A value of 0 in this file means that no descendant cgroups can be created. An attempt to create a descendant whose nesting level exceeds the limit fails (mkdir(2) fails with the error EAGAIN). Writing the string "max" to this file means that no limit is imposed. The default value in this file is "max". cgroup.max.descendants (since Linux 4.14) This file defines a limit on the number of live descendant cgroups that this cgroup may have. An attempt to create more descendants than allowed by the limit fails (mkdir(2) fails with the error EAGAIN). Writing the string "max" to this file means that no limit is imposed. The default value in this file is "max". ``` 並不是 **no limit is impossed**，應該是 `INT_MAX` 才對。 ##### 挑選下一個可執行任務進一步追蹤挑選任務相關的核心程式碼，首先我看到核心中有條件編譯，如果啟用了 `CONFIG_SCHED_CORE`，會執行這個版本的 [pick_next_task](https://elixir.bootlin.com/linux/v6.15/source/kernel/sched/core.c#L6105)，若沒有啟用則會執行這個版本的 [pick_next_task](https://elixir.bootlin.com/linux/v6.15/source/kernel/sched/core.c#L6549)。 > 該 config 的說明在[SCHED_CORE](https://elixir.bootlin.com/linux/v6.15/source/kernel/Kconfig.preempt#L135) 可以透過下列命令檢查當前的系統是否有啟用： ```shell $ grep CONFIG_SCHED_CORE /boot/config-$(uname -r) ``` :::info 我的電腦預設是啟用的狀態，目前知道與 [Core Scheduling](https://docs.kernel.org/admin-guide/hw-vuln/core-scheduling.html#core-scheduling) 有密切關係，但尚未進一步研究。在授課老師撰寫的排程器書籍 4.5 章節有介紹： ![image](https://hackmd.io/_uploads/Bkf0zXrQel.png) > As shown in Figure 4.3, when running untrusted task, the sibling CPU either • runs a task from the same untrusted group. • not runs any tasks forcedly if the scheduler can not find a runnable trusted task in the same group as the other sibling. 我覺得圖片跟文字敘述之間沒什麼關聯欸 SMT: hardware thread ::: 以下基於沒有啟用 `CONFIG_SCHED_CORE` 的狀態繼續追蹤，接著從 per-CPU runqueue 的角度來說要進行以下的檢查（[__pick_next_task()](https://elixir.bootlin.com/linux/v6.15/source/kernel/sched/core.c#L6013)）： - 上一個任務的類別是否也是 fair_sched_class - `rq->nr_running` 與 `rq->cfs.h_nr_queued` 是否相同。接著 [pick_next_task_fair()](https://elixir.bootlin.com/linux/v6.15/source/kernel/sched/fair.c#L8872) 呼叫了 [pick_task_fair()](https://elixir.bootlin.com/linux/v6.15/source/kernel/sched/fair.c#L8841)，首先檢查了 `cfs_rq->nr_queued` 是否為空： ```c cfs_rq = &rq->cfs; if (!cfs_rq->nr_queued) return NULL; ``` :::info 釐清 `nr_queued` 及 `h_nr_queued` 確切的差異在那，另外也發現這兩個成員名稱在去年年底重新命名過。 - [cfs_rq.h_nr_running -> cfs_rq.nr_queued](https://lkml.org/lkml/2024/12/2/1223) - [cfs_rq.nr_running -> cfs_rq.nr_queued](https://lkml.indiana.edu/2412.1/01010.html) > 參考 [PATCH 01/11 v3](https://lkml.org/lkml/2024/12/2/1221)，可以得知 `nr_running` 數值所代表的意義是當前這一層包含的實體數量，而 `h_nr_running` 會連子階層的實體一併考慮進去。修改的原因是因為 delayed dequeued，所以處於 sleep 的排程實體還會留在佇列中，這個更動更貼近其行為。隨後在 [PATCH 04/11 v3](https://lkml.org/lkml/2024/12/2/1224) 新增了 `h_nr_runnable` 成員。 ```c struct cfs_rq { unsigned int h_nr_queued; /* SCHED_{NORMAL,BATCH,IDLE} */ unsigned int h_nr_runnable; /* SCHED_{NORMAL,BATCH,IDLE} */ unsigned int h_nr_idle; /* SCHED_IDLE */ } ``` 我目前對於 cfs_rq 中以上三個成員的認知是，`h_nr_queued` = `h_nr_runnable` + `h_nr_idle`。還需要進一步以實驗或核心程式碼來佐證。起初讓我感到好奇的是，既然 `cfs_rq` 成員名稱都將 running 這個字眼改為 queued 了，那為何 per-CPU runqueue 所用的 [rq](https://elixir.bootlin.com/linux/v6.15/source/kernel/sched/sched.h?fbclid=IwZXh0bgNhZW0CMTAAYnJpZBExaUd5aG1oYmRzQkx5aDAxTAEe24u35d6_BdKjUn3R_HcAJ1ExHfhvceZ9RnR_Soqib2PPpyy-NOrRSWkQQPM_aem_kWwLoBu8gTnGa9K8fv_vTw#L1101) 結構體仍使用 running 這個字眼。還是說 per-CPU runqueue 的行為比較單純。 > 授課老師回覆：聯繫開發者並善用 LKML 進行討論。 >> Q：看起來 [LKML](https://lkml.org/lkml/2025/6/6/) 展示了許多開發者提交的 patch（尚未被收錄的程式碼修改），不過我上方的比較像是問題詢問，這種情況該如何進行？ >> >> A：授課老師說明，若只是發問開發者可能們不會理會，須發 **RFC patch**。 > >> Q：bootlin 目前最新的版本是 v6.15，而 [torvalds/linux](https://github.com/torvalds/linux?tab=readme-ov-file) 部份的 commit message 已經提到 6.16。 >> >> A：要追蹤最新的程式可參閱 [Working with linux-next](https://www.kernel.org/doc/man-pages/linux-next.html) > >> Q：cgroup 是否有相關的開源專案可以投入？ >> >> A：有，[kubernetes](https://kubernetes.io/docs/concepts/architecture/cgroups/)。 ::: :::success TODO：發 RFC patch 🚫 > 原先預計提交 RFC patch，針對 queued 及 running 向核心開發者詢問是否有不一致的問題。隨後追蹤這個議題時有以下發現： > > 這一系列的更改是由擔任 SCHED_NORMAL maintainer 的 Vincent Guittot 所提出，在這則[訊息](https://lkml.org/lkml/2024/12/3/741)中，Dietmar Eggemann 在 review 時提到了： > >Using nr_running on rq/rt_rq/dl_rq and nr_queued for cfs_rq might look strange to the untrained eye. > > 隨後另一個 maintainer Peter Zijlstra 回覆： > > Yes, but keeping nr_running with new semantics would not be **less confusing** and potentially **more dangerous**. > > 推測可能是 rq 所涵蓋的範圍較廣，但也沒看到 Peter 進一步的解釋。只知道目前沒有進行更動的意願。 ::: :::info TODO：發 patch > 雖然核心開發者們暫時不想去調整 rq 結構體的成員名稱，使其與 crs_rq 結構體一致，與授課老師討論後，老師認為將 cfs_rq 結構體的註解完善，也是值得投入的地方，後續將投入這部份。 ::: 如果想觀察上面提到的在各個群組對應的值，可以透過下列命令來觀察： ```shell $ sudo cat /sys/kernel/debug/sched/debug ``` 接著是關鍵的do-while 搭配函式（[group_cfs_rq()](https://elixir.bootlin.com/linux/v6.15/source/kernel/sched/sched.h#L1582)），其用於確認當前挑選到的排程實體是單一任務還是群組： ```c do { /* Might not have done put_prev_entity() */ if (cfs_rq->curr && cfs_rq->curr->on_rq) update_curr(cfs_rq); if (unlikely(check_cfs_rq_runtime(cfs_rq))) goto again; se = pick_next_entity(rq, cfs_rq); if (!se) goto again; cfs_rq = group_cfs_rq(se); } while (cfs_rq); ``` ```c #ifdef CONFIG_FAIR_GROUP_SCHED static inline struct cfs_rq *group_cfs_rq(struct sched_entity *grp) { return grp->my_q; } ``` 如果是群組實體的話，do-while 的條件會成立，會繼續往子群組迭代。反之如果是單一任務的話，其結構體的 `my_q` 成員會是 NULL，但其實在 `sched.h` 已經定義了 [entity_is_task](https://elixir.bootlin.com/linux/v6.15/source/kernel/sched/sched.h#L903) 這個巨集： ```c #ifdef CONFIG_FAIR_GROUP_SCHED /* An entity is a task if it doesn't "own" a runqueue */ #define entity_is_task(se) (!se->my_q) ``` 且在核心中也在[多處](https://elixir.bootlin.com/linux/v6.15/C/ident/entity_is_task)參考此巨集去實作，為了更貼近核心程式碼風格，或許可以如下修改： ```diff #ifdef CONFIG_FAIR_GROUP_SCHED static inline struct cfs_rq *group_cfs_rq(struct sched_entity *grp) { - return grp->my_q; + return !entity_is_task(grp) } ``` :::success TODO：發 patch 🚫 > 修改 `group_cfs_rq`，使其能重用現有巨集。 > 後續測試發現，並不能直接使用這個巨集來代替，原因是這個巨集回傳的是一個布林值，但若是群組式的階層架構，`my_q` 會指向子群組的 `csf_rq`。 ::: ## TODO: 新實驗 6/10 實體課程要找老師討論 cgroup 章節，預計在書中加入 container 的實驗。 > 實驗環境：為避免 systemd 的 user.slice 及 system.slice 干擾，決定以 docker 進行。 > 對標核心版本：6.1.x > 分析對象：root cgroup 在調整 bandwidth 後各個排程實體的執行時間。 > 分析工具：schedgraph，範例可參考 [FOSDEM 2023](https://archive.fosdem.org/2023/schedule/event/sched_tracing/attachments/slides/5824/export/events/attachments/sched_tracing/slides/5824/fosdem_lawall.pdf) 使用 `mainline` 命令查看版本： ```shell $ mainline list ... 6.5 6.4 6.3 6.2 6.1.141 6.1.140 6.1.139 6.1.138 ``` 我選擇安裝 6.1.141 的該版本最新核心： ```shell mainline install 6.1.141 ``` 啟動 docker 並移動路徑到根群組的目錄下： ```shell $ sudo docker run -it --rm alpine /bin/ash / # cd sys/fs/cgroup ``` 一開始想要新增子群組目錄時報錯： ```shell /sys/fs/cgroup # mkdir child1 mkdir: can't create directory 'child1': Read-only file system /sys/fs/cgroup # mkdir child2 mkdir: can't create directory 'child2': Read-only file system ``` :::info 目前還找不到解決辦法，參考 [docker container run](https://docs.docker.com/reference/cli/docker/container/run/) 裡的 option 設置後，即便可以建立子群組目錄，但又遇到了無法將 process 搬進去子群組的問題： ```shel bash: echo: write error: No such file or directory ``` 1. 需與授課老師討論是否改在 host 進行實驗。 > 6/14 面談：改用 [QEMU](https://wiki.alpinelinux.org/wiki/Install_Alpine_in_QEMU)，搭配 KVM。 2. 實驗的細節： - 子群組的深度及廣度 - 總任務數 - bandwidth 的比例 - 要隔離幾個 core 來呈現實驗結果？我的想法是可以對所有子群組的 `cpuset.cpus` 進行設定，來讓所有群組相關的只會跟特定的 core 具 affinity。 3. 任務的類別，該如何設計適合用於實驗的任務,舉例如下方這種任務合適嗎? ```shell yes > /dev/null & ``` > 6/14 面談：改用 mktest 目前初步的想法是，我先預先 isolate 特定幾個核，但這邊首先要釐清下方問題： ```shell lscpu -e INT CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE MAXMHZ MINMHZ MHZ 0 0 0 0 0:0:0:0 yes 5100.0000 800.0000 5099.9858 1 0 0 0 0:0:0:0 yes 5100.0000 800.0000 4587.1509 2 0 0 1 4:4:1:0 yes 5100.0000 800.0000 800.0000 3 0 0 1 4:4:1:0 yes 5100.0000 800.0000 800.0000 4 0 0 2 8:8:2:0 yes 5100.0000 800.0000 5079.9990 5 0 0 2 8:8:2:0 yes 5100.0000 800.0000 4757.5869 6 0 0 3 12:12:3:0 yes 5100.0000 800.0000 5089.0762 7 0 0 3 12:12:3:0 yes 5100.0000 800.0000 5089.0288 8 0 0 4 16:16:4:0 yes 5100.0000 800.0000 5100.0132 9 0 0 4 16:16:4:0 yes 5100.0000 800.0000 5100.0000 10 0 0 5 20:20:5:0 yes 5100.0000 800.0000 5099.9751 11 0 0 5 20:20:5:0 yes 5100.0000 800.0000 5100.0000 12 0 0 6 24:24:6:0 yes 3900.0000 800.0000 3899.9680 13 0 0 7 25:25:6:0 yes 3900.0000 800.0000 3900.0071 14 0 0 8 26:26:6:0 yes 3900.0000 800.0000 3900.0801 15 0 0 9 27:27:6:0 yes 3900.0000 800.0000 3842.7729 16 0 0 10 28:28:7:0 yes 3900.0000 800.0000 3888.0730 17 0 0 11 29:29:7:0 yes 3900.0000 800.0000 3888.2700 18 0 0 12 30:30:7:0 yes 3900.0000 800.0000 3892.3989 19 0 0 13 31:31:7:0 yes 3900.0000 800.0000 3888.1479 ``` 1. 每個 CORE 有兩個 CPU，查了一下這叫做 Hyper-Threading，以 CPU 0 及 CPU 1 來說，我的認知是他們會共用 CORE 0 的資源，那以排程的角度來說在同個 CORE 上 migrate 會比快對吧？這樣會不會造成實驗結果不精準？ 2. 需要將其關閉變成一對一嘛？或是我在 isolate 及 affinity 的設定上只選用奇數 CPU 或偶數 CPU，另外我也想釐清一下奇數跟偶數本質上有沒有差異？ ```shell GRUB_CMDLINE_LINUX_DEFAULT="quiet splash isolcpus=0,2,4,6,8,10,12,14" $ echo "0,2,4,6,8,10,12,14" > /sys/fs/cgroup/child1/cpuset.cpus ``` ```shell GRUB_CMDLINE_LINUX_DEFAULT="quiet splash isolcpus=1,3,5,7,9,11,13,15" $ echo "1,3,5,7,9,11,13,15" > /sys/fs/cgroup/child2/cpuset.cpus ``` ::: ### 實驗一 ```graphviz digraph G { rankdir=TB; node [shape=box]; "/sys/fs/cgroup" -> "child1"; "/sys/fs/cgroup" -> "child2"; subgraph cluster_child1 { label = "child1 (50%)"; style=filled; color=lightgrey; node [shape=circle, style=filled, color=white]; task_1; task_2; task_3; task_4; task_5; task_6; task_7; task_8; task_9; task_10; } subgraph cluster_child2 { label = "child2 (50%)"; style=filled; color=lightgrey; node [shape=circle, style=filled, color=white]; task_11; task_12; task_13; task_14; task_15; task_16; task_17; task_18; task_19; task_20; } "child1" -> {task_1 task_2 task_3 task_4 task_5 task_6 task_7 task_8 task_9 task_10} [style=dotted]; "child2" -> {task_11 task_12 task_13 task_14 task_15 task_16 task_17 task_18 task_19 task_20} [style=dotted]; } ``` #### **cpu isolate** 在終端機輸入下方命令對 grub 檔案進行編輯： ```shell $ sudo vim /etc/default/grub ``` ```diff - GRUB_CMDLINE_LINUX_DEFAULT="quiet splash" + GRUB_CMDLINE_LINUX_DEFAULT="quiet splash isolcpus=0,1,2,3,4,5,6,7" ``` 更新 grub 並重新啟動： ```shell $ sudo update-grub $ sudo reboot ``` 重新開機後檢查： ```shell $ cat /proc/cmdline BOOT_IMAGE=/boot/vmlinuz-6.1.141-0601141-generic root=UUID=c71fcb9c-8ab6-464c-8aa0-d7d5195e3a35 ro quiet splash isolcpus=0,1,2,3,4,5,6,7 crashkernel=2G-4G:320M,4G-32G:512M,32G-64G:1024M,64G-128G:2048M,128G-:4096M vt.handoff=7 ``` #### **子群組** 先在根群組目錄下新增 `child1`,`child2` 子群組： ```shell $ cd /sys/fs/cgroup $ sudo mkdir child1 child2 ``` bandwidth 設定 > 先讓兩個子群組的比例均衡 ```shell $ echo "50000 100000" > /sys/fs/cgroup/child1/cpu.max $ echo "50000 100000" > /sys/fs/cgroup/child2/cpu.max ``` affinity 設定 > 前面 isolate 了 0～7 號的處理器，所以這邊要進行相對應的設置，才能確保在子群組中的任務不會被 migrate 到其他核中。 ```shell $ echo "0-7" > /sys/fs/cgroup/child1/cpuset.cpus $ echo "0-7" > /sys/fs/cgroup/child2/cpuset.cpus ``` #### **任務** 個別將 10 個 `yes > /dev/null &` 任務加入在兩個子群組中： ```shell $ for i in $(seq 1 10); do yes > /dev/null & echo $! > /sys/fs/cgroup/child1/cgroup.procs done [1] 8409 [2] 8410 [3] 8411 [4] 8412 [5] 8413 [6] 8414 [7] 8415 [8] 8416 [9] 8417 [10] 8418 $ for i in $(seq 1 10); do yes > /dev/null & echo $! > /sys/fs/cgroup/child2/cgroup.procs done [11] 8424 [12] 8425 [13] 8426 [14] 8427 [15] 8428 [16] 8429 [17] 8430 [18] 8431 [19] 8432 [20] 8433 ``` 檢查是否加入成功： ```shell $ cat /sys/fs/cgroup/child1/cgroup.procs 8409 8410 8411 8412 8413 8414 8415 8416 8417 8418 $ cat /sys/fs/cgroup/child2/cgroup.procs 8424 8425 8426 8427 8428 8429 8430 8431 8432 8433 ``` #### **監測排程結果** 繪圖工具使用授課老師指定的 schedgraph，環境設定與安裝參見 [TODO: 回顧去年相關報告並熟悉相關工具 - schedgraph](#schedgraph)。使用 trace-cmd 追蹤排程相關資訊，由於我放在子群組中的都是相同任務，可以先篩出其 PID，避免不必要的干擾： ```shell $ PIDS=$(pgrep yes | paste -sd, -) ``` 使用 trace-cmd 監看 5 秒鐘的資訊： ```shell $ sudo trace-cmd record -e sched -e sched_stat_runtime -P $PIDS -M 0-7 sleep 5 ``` 將產生的 `trace.dat` 以 `schedgraph` 製圖： ```shell $ ./dat2graph2 trace.dat --color-by-top-command --save-tmp --ps ``` ![image](https://hackmd.io/_uploads/rkbcQeiQgg.png) > 1. 應該還有更細緻的選項可以選擇要呈現哪些 core 在 y 軸上，還在研究中。 > 2. 中間有很大的空白，好像有其他任務被穿插了。如果將製圖的命令改為： ```shell $ ./dat2graph2 trace.dat --color-by-pid --save-tmp --ps ``` ![image](https://hackmd.io/_uploads/BygKKGnXxg.png) :::info 或許得從 trace-cmd 下手，但嘗試許久找不到參數去限制只蒐集特定的 core 的資料，但從另一角度出發，或許可以對原始的 trace.dat 進行篩選，只篩選出特定 core 的統計資料，然後再用工具繪製。 ::: ### 改用 QEMU 進行實驗 #### 環境架設使用下列命令安裝相關套件： ```shell $ sudo apt install qemu-system-x86 qemu-kvm virt-manager bridge-utils ``` 由於要使用核心版本 6.1.x，對應的 apline 版本為 3.18，iso 檔可到[此處](https://dl-cdn.alpinelinux.org/alpine/v3.18/releases/x86_64/alpine-standard-3.18.12-x86_64.iso)下載。開啟 vm 管理器： ```shell $ sudo virt-manager ``` 剛剛下載好對應的 iso 檔了，所以要選第一個選項 ![image](https://hackmd.io/_uploads/rynDJtZNll.png) :::info 後來發現這種 GUI 並不能直接貼上 host 的內容，所以沒有繼續採用，後續將改用單純 CLI 的方式進行。 ::: 可以自己指定一個路徑，並掛載虛擬硬碟： ```shell $ qemu-img create -f qcow2 alpine-cgroupv2.qcow2 4G ``` > alpine-cgroupv2 是我自行指定的名稱。接著是以 QEMU 啟動安裝流程： ```shell $ qemu-system-x86_64 \ -m 1024M \ -enable-kvm \ -cpu host \ -smp 2 \ -drive file=~/alpine-cgroupv2.qcow2,format=qcow2 \ -cdrom ~/alpine-standard-3.18.12-x86_64.iso \ -boot d \ -nic user \ -nographic ``` 接著是一些初始化的設定，前面我是都按 Enter 維持預設設定，直到出現以下字眼： ```shell Which disk(s) would you like to use? (or '?' for help or 'none') [none] ``` 這邊記得對照： ```shell Available disks are: fd0 (0.0 GB ) sda (4.3 GB ATA QEMU HARDDISK ) ``` 去輸入正確的名稱，接著會看到： ```Shell How would you like to use it? ('sys', 'data', 'crypt', 'lvm' or '?' for help) ``` 此處記得輸入 sys，這邊沒有照著設定的話每次都會開啟新的模擬器，會造成實驗的環境都要重設一次。隨後就會進入 QEMU，檢查一下核心版本是否符合需求： ```shell $ uname -r 6.1.128-0-lts ``` 預設是沒有掛載 cgroupv2 的，需要手動輸入下列命令進行掛載： ```shell $ mount -t cgroup2 none /sys/fs/cgroup ``` 要檢查是否有掛載成功，可以使用以下命令： ```shell $ mount | grep cgroup none on /sys/fs/cgroup type cgroup2 (rw,relatime) ``` **安裝 trace-cmd** 不像 Ubuntu 那麼方便，這邊都要從原始專案自行編譯安裝： ```shell $ git clone https://github.com/rostedt/trace-cmd.git $ git clone https://git.kernel.org/pub/scm/libs/libtrace/libtraceevent.git $ git clone https://git.kernel.org/pub/scm/libs/libtrace/libtracefs.git ``` 進入各自的目錄後，進行編譯及安裝： ```shell $ cd xxx $ make $ make install ``` 建立用於實驗的行程，使用的工具為排成器書籍專案中的 mktasks，要使用這個工具可以下達此命令： ```shell $ make tools $ python3 tools/mktasks/start.py --targets=xxx ``` 其中 targets 後面相當於調用對應的 json 檔，所以要將行程的設定預先準備好。 > 其對應方式如下， targets 指定了 xxx，會去找 experiments/xxx-config.json。並將其中的設定傳遞給 mktasks。 **mktasks 設定** 在 tools 這個目錄下，根據 REDAME 所描述的 json 架構，可以發現目前的 `mktasks.c` 是支援 cgroupv1 版本。而我預期在 cgroupv2 進行實驗，需要對 `mktasks.c` 進行修改，且 json 格式也得調整。 :::info 但實際上 cgroupv1 在一些核心版本仍有使用，我認為將其整合也是一個值得投入的方向，但礙於時間不太夠，目前先自行修改了一版專用於 cgroupv2 的 `mktasks.c`。 > [程式碼](https://gist.github.com/yy214123/39cdea6b73c70688d0e36d3640ae18ea)，將其替換掉原先的 `mktasks.c` 即可。 ::: 對應的行程設置 json 檔（`cgroup-config.json`）： ```json { "cgroup": { "": { "subgroup": { "child1": { "cpu.max": "50000 100000" }, "child2": { "cpu.max": "50000 100000" } } } }, "procs": [ { "comm": "cpu_bound", "cpus": "3", "name": "p_child1", "group": "child1", "policy": "other", "policy_detail": "0" }, { "comm": "cpu_bound", "cpus": "3", "name": "p_child2", "group": "child1", "policy": "other", "policy_detail": "0" }, { "comm": "cpu_bound", "cpus": "3", "name": "p_child3", "group": "child1", "policy": "other", "policy_detail": "0" }, { "comm": "cpu_bound", "cpus": "3", "name": "p_child4", "group": "child2", "policy": "other", "policy_detail": "0" } ] } ``` 上方的設置其群組階層與行程的示意圖如下： ```graphviz digraph G { rankdir=TB; node [shape=box]; "/sys/fs/cgroup" -> "child1"; "/sys/fs/cgroup" -> "child2"; subgraph cluster_child1 { label = "child1 (50%)"; style=filled; color=lightgrey; node [shape=circle, style=filled, color=white]; task_1; task_2; task_3; } subgraph cluster_child2 { label = "child2 (50%)"; style=filled; color=lightgrey; node [shape=circle, style=filled, color=white]; task_4; } "child1" -> {task_1 task_2 task_3} [style=dotted]; "child2" -> {task_4} [style=dotted]; } ``` 先執行： ```shell $ sudo python3 tools/mktasks/start.py --targets=cgroup --events=sched:sched_switch --cpu-mask=3 ``` 接著會在 result 目錄下生成 `cgroup-trace.dat` 及 `cgroup.log` 兩個檔案，其中 `cgroup-trace.dat` 就是要拿來將統計資料視覺化的目標檔案，以 kernelshark 繪圖： ```shell $ kernelshark result/cgroup-trace.dat ``` 可以看到 ![image](https://hackmd.io/_uploads/B1S1OJu4le.png) 這個結果圖與預期相符，child1 子群組中有 3 個任務（PID 24186,24187,24188），而 child2 子群組中有 1 個任務（PID 24189），由於兩者的比例相同，所以可以看到：在排程的初期 PID 24189 就被密集的執行，白色的空白處就是輪到了 child1 的其中一個行程。由於 child1 有三個行程，所以當 child 2 的行程執行完成時，這三個在 child1 中的行程皆各自完成了 1/3 的進度。隨後的情況就只剩下這三個行程繼續交替執行。 :::success 兩個群組皆尚有未完成任務時，輪替的狀況很合理，但當剩 child1 時，中間很大的空白我不清楚怎麼解釋。 > 後續觀察時間發現這就是 "cpu.max": "50000 100000" 的效果： > ![image](https://hackmd.io/_uploads/ByybkDuVlg.png) > 這樣的一個區間（有負載+空白）恰好會是 100 毫秒。用 kernelshark 可以畫圖，但是用 schedgraph 出現了： ```shell Fatal error: exception Failure("hd") ``` > 目前已知的問題有： > * schedgraph 使用時不能只紀錄單一核。這個問題在 [設計實驗並以 schedgraph 展現 CFS 排程行為](https://hackmd.io/@sysprog/BJh9FdlS2#Block-the-FIFO) 也被提出。為了設計出不做事的任務，需要在 experiments 目錄新增對應的檔案（`idle_dummy.c`）然後要將 start.py 進行以下修改： ```diff prepare_cmds = [ f"make -C {root_dir} tools", f"gcc {expr_dir}cpu_bound.c -O2 -o cpu_bound", f"gcc {expr_dir}io_block.c -O2 -o io_block", f"gcc {expr_dir}yield.c -O2 -o yield", + f"gcc {expr_dir}idle_dummy.c -O2 -o idle_dummy", ] def do_cleanup() -> None: os.remove("cpu_bound") os.remove("io_block") os.remove("yield") + os.remove("idle_dummy") ``` 使用 schedgraph 畫出來的圖如下所示： ![image](https://hackmd.io/_uploads/HkidlPdEel.png) > 如果只想顯示特定 PID 可以在畫圖時加入參數： > ```shell > $ ./dat2graph2 --pid 11959 --pid 11960 --pid 11961 --pid 11962 --color-by-pid xxx.dat > ``` 圖片的標籤會以 pid 標示，但這樣其實不太清楚，所以我們可以在使用工具繪圖時加入參數 `--save-tmp`，如此將生成一個 xxx.jgr 檔案，可以透過腳本批量修改其中的 label： ```shell $ sed -i 's/label : 11959:/label : child1-1/' xxx.jgr $ sed -i 's/label : 11960:/label : child1-2/' xxx.jgr $ sed -i 's/label : 11961:/label : child1-3/' xxx.jgr $ sed -i 's/label : 11962:/label : child2-1/' xxx.jgr ``` 接著將檔案轉成 xxx.eps 再轉成 xxx.pdf 即可： ```shell $ jgraph xxx.jgr > xxx.eps $ epstopdf xxx.eps ``` 效果如下： ![image](https://hackmd.io/_uploads/HkrdvDdEeg.png) ::: #### 實驗一分為兩個子群組 child1（包含三個 cpu-bound 任務）及 child2（包含一個 cpu-bound 任務），兩者的 `cpu.max` 皆設為 50000 100000，這相當於在一個 cpu 週期時間（100000 微秒）每個群組都有 50000 毫秒能運用。 > 各個任務的權重皆預設 ```graphviz digraph G { rankdir=TB; node [shape=box]; "/sys/fs/cgroup" -> "child1"; "/sys/fs/cgroup" -> "child2"; subgraph cluster_child1 { label = "child1 (50%)"; style=filled; color=lightgrey; node [shape=circle, style=filled, color=white]; task_2; task_3; task_50; } subgraph cluster_child2 { label = "child2 (50%)"; style=filled; color=lightgrey; node [shape=circle, style=filled, color=white]; task_1; } "child1" -> {task_2 task_3 task_50} [style=dotted]; "child2" -> {task_1} [style=dotted]; } ``` ##### 單核 ![image](https://hackmd.io/_uploads/rkmBv12Egx.png) ![image](https://hackmd.io/_uploads/S1E_v1hEgg.png) ##### 雙核 ![image](https://hackmd.io/_uploads/BJIMdk3Ell.png) ![image](https://hackmd.io/_uploads/BkvsPknVgx.png) ##### 四核 #### 實驗二對照書中第四章的說明，將 child 1 的任務增加至 49 個， child 2 保持一個 ##### 單核 ![image](https://hackmd.io/_uploads/BJvepJ3Ege.png) ![image](https://hackmd.io/_uploads/S1pqjJhVxg.png) ![image](https://hackmd.io/_uploads/HJbFak34lg.png) ![image](https://hackmd.io/_uploads/rktkAJ3Eel.png) 將橘色的區塊放大來看，可以看到在 1 秒的 epoch 中有 10 個 0.05 秒的 cpu 1 負載。 ![image](https://hackmd.io/_uploads/Hkb26y34ll.png) 同樣也呈現出了兩個群組公平的效果。 ##### 雙核 ![image](https://hackmd.io/_uploads/r1untU9Ngl.png) 任務比較多時，也符合預期。在 100000（橘色遮罩）微秒的時間中， child 2 的任務執行了 50000 微秒並 throttle 了 50000 微秒，符合 `cpu.max` 的設定。 ![image](https://hackmd.io/_uploads/SJScSnKVel.png) :::info 結合上方的發現： 1. 無論任務數多寡，在單核的情況都符合預期。 2. 任務很少時，在多核有奇怪的現象。 3. 在實驗一的雙核情境下，為何不像四核時是將 child 2 的任務獨立於一個核。 4. * vruntime-> group 視角以及階層介紹 5. 成員名稱的變革 6. https://youtu.be/DAqjl_x4hZc ::: 6/26 與授課老師面談時，決定先將 HT 關閉並在 host 進行實驗。 > 我的設定方式是進入 bios 將 HT 關閉，輸入下列命令可以檢查： > ```shell > // 關閉前 > $ lscpu | grep "Thread" > Thread(s) per core: 2 > > // 關閉後 > $ lscpu | grep "Thread" > Thread(s) per core: 1 > ``` #### 實驗三（階層架構與實驗一相同，僅關閉 HT） ##### 單核 ![image](https://hackmd.io/_uploads/ByEVXb3Nel.png) ##### 雙核 ![image](https://hackmd.io/_uploads/SJX5X-2Nee.png) ##### 四核 ![image](https://hackmd.io/_uploads/SymCQW24le.png) #### 實驗四 ##### 單核 ![image](https://hackmd.io/_uploads/SJ5GRZ3Eeg.png) ##### 雙核 ![image](https://hackmd.io/_uploads/SyObhM3Nlx.png) ![image](https://hackmd.io/_uploads/r1XehGhEeg.png) ##### 四核 ![image](https://hackmd.io/_uploads/Bk-Q6M34le.png) ![image](https://hackmd.io/_uploads/Bydi2Mn4ll.png) #### 實驗五 ##### 單核 ![image](https://hackmd.io/_uploads/r1iHhbn4lx.png) ##### 雙核 ![image](https://hackmd.io/_uploads/SyoHabnNel.png) ##### 四核 ![image](https://hackmd.io/_uploads/Syr1tbnNle.png) 也是在 0.1 秒之前有奇怪的現象，從 0.1 ～ 0.7 秒都符合預期。且從後半段可以觀察出，即便有閒置的 cpu，也傾向不 migrate 到其中。