Control Groups 解說

# Control Groups 解說 :::warning Mainly for cgroups v2 ::: ## Overview ### 什麼是 cgroups 官方的定義是 cgroup 將 tasks 集合映射到子系統 (subsystem) 的集合上。 **subsystem** 定義是利用 cgroups 提供的 task grouping facilities 使這群 tasks 的行為被特別規範，也稱為 **resource controller** ，可以針對特定某個或某些 cgroup 或一群 process 做資源的限制。 cgroups 的結構可以想像成一顆樹，稱為**hierarchy** ，任何 task 都會剛好位在一個 cgroup 當中，還有一群 subsystem 的集合當中。所有 hierarchy 都有一個 cgroup virtual filesystem instance 。 ### 為何需要 cgroups Linux 核心當中提供很多 process aggregations 的操作，主要都是為了 resouce-tracking ，包括 cpusets, CKPM/ResGroups, UserBeanCounters 還有 virtual server namespaces 等。它們全都需要能將行程做 grouping/partitioning ，新建立的行程會和親代行程位在相同的 cgroup 當中。 hierarchy 使得每個 subsystem 當中切分 tasks 到不同 cgroups 的方式可以完全不同，代表 hierarchy 之間是互相不關聯且平行的。我們可以讓每個 subsystem 放在不同的 hierarchy 中，也可以把所有 subsystem 都放在同個 hierarchy 中。 ### 如何實作關於 `struct cgroup` 的定義在 [/include/linux/cgroup-defs.h](https://elixir.bootlin.com/linux/latest/source/include/linux/cgroup-defs.h#L397) 底下。特別注意到幾個成員 * `self` : 指向自己的 css (cgroup subsystem state) * `subsys` : 該 cgroup 底下包含的 subsystem 的 css 另外有幾個部分需要注意 * 系統當中每個 task 都有一個 reference-counted pointer to a css_set * `css_set` 包含了許多 reference-counted pointers to `cgroup-subsys-state` ，每個都代表一個在系統當中的 subsystem 。 task 當中不會有指標或其他成員直接把 hierarchy 當中的 cgroup 和 task 連接在一起，都是透過 `cgroup_subsys_state` 物件來找到的。這是因爲取得 subsystem state 通常是發生在 performance-critical 程式碼當中，而在 cgroup 之間移動 task 則是較少發生的。關於 `css_set` 的定義可以在 [/include/linux/cgroup-defs.h](https://elixir.bootlin.com/linux/latest/source/include/linux/cgroup-defs.h#L217) 當中找到。 * cgroup hierarchy filesystem 可以被掛載以提供給 userspace 使用者來操作與瀏覽在系統中有幾個簡單的 hook * 系統啟動時 `init/main.c` 會初始化 root cgroups 和 `css_set` * fork 和 exit 時會將 task 加入或抽離對應的 `css_set` 注意到如果一個 cgroup filesystem 以下有 child cgroups ，那即使它被 unmounted ，對應的 hierarchy 還是會保持 active ，反之若沒有則 hierarchy 會被 deactivated 。並且 cgroups 的所有操作都是透過 cgruop file system ，沒有任何系統呼叫。每個 cgroup file system 底下的資料夾都代表一個 cgroup ，並且下列的有下列檔案 * `tasks` : 代表該 cgroup 底下的 tasks ，不一定經過排序，把某個 thread ID 寫入此檔案會把該 thread 加入這個 cgroup 底下。 * `cgroup.procs` : 該 cgroup 底下的 thread group ，不一定經過排序且可能重複，排序與 uniquify 由 userspace 來做。 * `notify_on release` * `release_agent` :::info 什麼是 process migration ？ ::: 當我們把一個 task 從一個 cgroup 搬移到另一個時，他會獲得一個新的 pointer to `css_set` ，如果是搬移到一個已經存在的 cgroup ，則直接使用該 cgroup ，否則會建立一個新的 `css_set` 。尋找存在的 `css_set` 是透過在 hash table 當中搜尋。 ### `struct cgroup_subsys_state` 又稱為 css ，儲存每個 cgroup 或子系統之資訊，有以下幾個 field 需要特別注意 * `sibling` : 透過雙向鏈結串列將該節點之親代節點的子節點全部串在一起。 `cgroup_for_each_live_child()` 會將指定 cgroup 之子節點全部走訪一遍。 ### `notify_on_release` 預設的 `notify_on_release` 是關閉的，一但它被開啟則當 cgroup 的最後一個 task 離開該 cgroup 使得該 cgroup 的最後一個 child cgroup 被移除，則 kernel 會執行一個寫在 `release_agent` 檔案當中的命令。 ## Usage Example ### Basic Usage 利用 cgroup virtual filesystem 我們可以對 cgroups 進行一系列操作。首先如果想掛載一個能操作所有子系統的 cgroup hierarchy 則可以利用 ```shell $ sudo mount -t tmpfs cgroup_root /sys/fs/cgroup ``` 但我們的系統應該預設本來就有 `/sys/fs/cgroup` 所以上述步驟不需要。對於每種想要控制的資源，我們都應該建立一個 cgroup hierarchy ，所以把 `tmpfs` 掛載在 `/sys/fs/cgroup` 後我們需要把對應的資源都建立一個對應的資料夾 ```shell $ sudo mkdir /sys/fs/cgroup/rg1 ``` 建立完成後我們可以進入該資料夾並查看內部內容 ```shell $ cd /sys/fs/cgroup/rg1 $ ls cgroup.controllers cpu.stat memory.reclaim cgroup.events io.pressure memory.stat cgroup.freeze memory.current memory.swap.current cgroup.kill memory.events memory.swap.events cgroup.max.depth memory.events.local memory.swap.high cgroup.max.descendants memory.high memory.swap.max cgroup.pressure memory.low memory.swap.peak cgroup.procs memory.max memory.zswap.current cgroup.stat memory.min memory.zswap.max cgroup.subtree_control memory.numa_stat pids.current cgroup.threads memory.oom.group pids.events cgroup.type memory.peak pids.max cpu.pressure memory.pressure pids.peak ``` 可以看到對應的檔案都幫我們建立好了，如果在此 cgroup 底下再建立一個 cgroup ```shell $ sudo mkdir richard_cgroup $ cd richard_cgroup $ ls cgroup.controllers cgroup.freeze cgroup.max.depth cgroup.pressure cgroup.stat cgroup.threads cpu.pressure io.pressure cgroup.events cgroup.kill cgroup.max.descendants cgroup.procs cgroup.subtree_control cgroup.type cpu.stat memory.pressure ``` ## Kernel API ### Overview 所有掛載到總系統的 cgroup system 的子系統都會建立一個對應的 `cgroup_subsys` 物件，定義在 [/include/linux/cgroup-defs.h](https://elixir.bootlin.com/linux/latest/source/include/linux/cgroup-defs.h#L688) 。底下包含了很多操作該 cgroup 的 call back 函式，而 subysystem ID 則是在 boot time 由系統自動指派。每個子系統的名稱不該大於 `MAX_CGROUP_TYPE_NAMELEN` 。同時系統建立的每個 cgroup object 都會有一個包含許多指標的陣列，這些陣列由子系統完全管理，總 cgroup 的程式碼完全不會碰到他們。 ### Synchornization 在 cgroup 系統中有一個 global mutex 稱為 `cgroup_mutex` ，任何想要改動 cgroup system 的 caller 都需要先取得此 mutex ，可以利用 `cgroup_lock() / cgroup_unlock()` 來取得或釋放此 mutex 。 :::info 在 `/kernel/cgroup/cgroup.c` 當中使用到 `cgroup_mutex` 都是透過一個叫 `lockdep_assert_held` 的巨集，該巨集的作用是什麼？為何不使用傳統等待 mutex 的方式？參閱 [/include/linux/lockdep.h](https://elixir.bootlin.com/linux/latest/source/include/linux/lockdep.h) ，提及應該參閱 [lockdep-design.rst](https://www.kernel.org/doc/Documentation/locking/lockdep-design.rst) 。何時使用 `cgroup_lock()` 何時使用 `lockdep_assert_held()` ？ ::: ### Subsystem API 在 `struct cgroup_subsys` 底下定義了一系列的 callback function ，作為 subsystem API ，底下逐一介紹 * `struct cgroup_subsys_state *css_alloc(struct cgroup *cgrp)` caller 需要持有 `cgroup_mutex` ，會為 cgroup 配置對應 subsystem state 物件空間。 * `int css_online(struct cgroup *cgrp)` caller 需要持有 `cgroup_mutex` ，當 `cgrp` 完成所有子系統的空間配置並且對於 `cgroup_for_each_child/descendant_*()` 是可見的時候可以呼叫，主要是作為實作 reliable state sharing and propagation along the hierarchy 來使用。 * `void css_offline(struct cgroup *cgrp)` caller 需要持有 `cgroup_mutex` ，只有當 `css_online()` 呼叫成功才會呼叫（必定會呼叫），這代表對應 cgroup 結束階段的開始，它的子系統會開始解除所有持有的 reference ，等所有的 references 都被解除了則會進入下個階段 `css_free()` ，等該 call back 結束，對於子系統而言該 cgroup 就是死亡了。 ## Cgroups Interface ### Two parts of cgroup cgroup 機制主要分為兩個部分 * core 用來設置與維護 cgroup hierarchy ，和 eBPF 整合在一起 * controllers 配置特定資源在 hierarchy 當中 cgroup hierarchy 在系統當中被管理的方式是作為 virtual file system 並被掛載在某個地方，就像 procfs 一樣。 ### Mounted hierarchy ```shell /sys/fs/cgroup$ ls -la total 0 dr-xr-xr-x 12 root root 0 六 11 15:07 . drwxr-xr-x 8 root root 0 六 19 13:23 .. -r--r--r-- 1 root root 0 六 19 13:23 cgroup.controllers -rw-r--r-- 1 root root 0 六 19 13:23 cgroup.max.depth -rw-r--r-- 1 root root 0 六 19 13:23 cgroup.max.descendants -rw-r--r-- 1 root root 0 六 19 13:23 cgroup.pressure -rw-r--r-- 1 root root 0 六 19 13:23 cgroup.procs -r--r--r-- 1 root root 0 六 19 13:23 cgroup.stat -rw-r--r-- 1 root root 0 六 15 07:28 cgroup.subtree_control -rw-r--r-- 1 root root 0 六 19 13:23 cgroup.threads -rw-r--r-- 1 root root 0 六 19 13:23 cpu.pressure -r--r--r-- 1 root root 0 六 19 13:23 cpuset.cpus.effective -r--r--r-- 1 root root 0 六 19 13:23 cpuset.cpus.isolated -r--r--r-- 1 root root 0 六 19 13:23 cpuset.mems.effective -r--r--r-- 1 root root 0 六 19 13:23 cpu.stat -r--r--r-- 1 root root 0 六 19 13:23 cpu.stat.local drwxr-xr-x 2 root root 0 六 11 15:07 dev-hugepages.mount drwxr-xr-x 2 root root 0 六 11 15:07 dev-mqueue.mount drwxr-xr-x 2 root root 0 六 11 15:07 init.scope -rw-r--r-- 1 root root 0 六 19 13:23 io.cost.model -rw-r--r-- 1 root root 0 六 19 13:23 io.cost.qos -rw-r--r-- 1 root root 0 六 19 13:23 io.pressure -rw-r--r-- 1 root root 0 六 19 13:23 io.prio.class -r--r--r-- 1 root root 0 六 19 13:23 io.stat -r--r--r-- 1 root root 0 六 19 13:23 memory.numa_stat -rw-r--r-- 1 root root 0 六 19 13:23 memory.pressure --w------- 1 root root 0 六 19 13:23 memory.reclaim -r--r--r-- 1 root root 0 六 19 13:23 memory.stat -rw-r--r-- 1 root root 0 六 19 13:23 memory.zswap.writeback -r--r--r-- 1 root root 0 六 19 13:23 misc.capacity -r--r--r-- 1 root root 0 六 19 13:23 misc.current drwxr-xr-x 2 root root 0 六 11 15:07 proc-sys-fs-binfmt_misc.mount drwxr-xr-x 2 root root 0 六 11 15:07 sys-fs-fuse-connections.mount drwxr-xr-x 2 root root 0 六 11 15:07 sys-kernel-config.mount drwxr-xr-x 2 root root 0 六 11 15:07 sys-kernel-debug.mount drwxr-xr-x 2 root root 0 六 11 15:07 sys-kernel-tracing.mount drwxr-xr-x 68 root root 0 六 19 12:30 system.slice drwxr-xr-x 4 root root 0 六 19 13:22 user.slice ``` 這個檔案系統底下就列出了所有 controllers ，運用他們的方式是先建立一個行程，之後將該行程加入 cgroup 當中，我們就能用這些 controller 控制此行程。 * `blkio` limits per-cgroup block I/O performance * `cpu` Enables setting of scheduling preferences on per-cgroup basis including the distribution of CPU cycles * `cpusets` Facilitate assigning a set of CPUs and memory nodes to cgroups * Tasks in a cpuset cgroup may only be scheduled on CPUs assigned to that cpuset * Very valuable in NUMA systems where the CPU and memory location need to be controlled 此處應該特別注意 L1/L2 cache 、 L3 cache 和 memory ，若 processor 嘗試 fetch 某個資料，而它位在 L1 cache 當中，則我們可以以近乎和 processor 速度相同的程度得到該資料，問題是 L1 cache size 是被嚴格限制的，它特別昂貴。所以 L1 cache miss 但 L2 cache hit ，代價大約是 5~10 clock cycles ，若發生 L2 cache miss ，得到更遠的 cache 找資料，成本大概是 20~25 clock cycles ，如果真的跑到 memory 取得資料，則大概要耗費 200~300 clock cycles 。假設我們今天用的是 x86 super scaling processor ，很舊的版本依舊可以一次執行 10 道指令，若每一個都得到 memory 取得資料，則我們總共會喪失大約 2000 clock cycles ，一次 fetch 代價如此大而我們無法把這些成本給索回。而 cpusets 的用途就是將某些行程鎖在特定的 processor cores ，同時也把這些行程需要的資料鎖在對應的 cache 當中。 * `cpuacct` Provides per-cgroup CPU usage accounting * `devices` Controls the ability of tasks to create or use device nodes using either a blacklist or whitelist * `freezer` 類似將和該 cgroup 相關的行程全部強迫睡眠，這時就可以把它們遷移到其他 processor 上或其他。 * `hugetlb` * `memory` Allows memory, kernel memory, and swap usage to be tracked and limited * `net_cls` Provides an interface for tagging packets based on the sender cgroup * `net_prio` Allows setting network traffic priority on a per-cgroup basis * `perf_event` * `pids` * `RDMA` * `unified` ## Cgroups v1 每個 controller 都可以被掛載在一個獨立並擁有自己 hierarchy 的 cgroup filesystem 當中。同時也可以將數個 controllers 掛載在同一個 cgroup filesystem 底下，這意味著這些 contollers 的行程結構 hierarchy 全都相同。而資料夾的樹狀結構就是對應的 cgroup hierarchy ，每個 cgroup 都由一個資料夾來代表，而它的 child cgroup 即是一個 child directory 。舉例來說 `user/joe/1.session` 代表了 cgroup `1.session` ，它是 cgroup `joe` 的 child ，同時 `joe` 又是 `user` 的 child 。 ### Tasks vs. Processes 在 cgroup v1 當中， `process` 和 `task` 是有明確的界線的，一個行程由數個任務組成，而 cgroup v1 提供了單獨針對某個特定的執行緒進行 cgroup 操作的能力。也就是可以將屬於同一個行程底下的不同執行緒打散到不同 cgroups 當中。這樣的功能在某些情況會造成問題。例如在 `memory` controller 當中，所有屬於同一個行程的執行緒都享有同樣的 address space ，將它們分開是不合理的。由於此原因在 cgroup v2 當中移除了這樣的功能，以 "thread mode" 的形式提供了一個較為限制的形式。 ## Cgroups v2 在此架構中，所有被掛載的 controllers 都位在同一個被整合的 hierarchy 當中。同時不同的 controllers 依舊可能同時被掛載在 v1 或 v2 hierarchy 當中，但同一個 contollers 在同一時間只能選擇 v1 或 v2 其中一種 hierarchy 來掛載。 cgroup v2 的行為大致統整如下 * 提供一個統整的 hierarchy 並把所有 controllers 掛載在其下。 * "Internal" processes 不被允許。行程只被允許存在 leaf nodes 當中，以下會講解 * 透過 `cgroup.controllers` 和 `cgroup.subtree_control` 來表示 active cgroups 。 * 移除 `tasks` 檔案和 `cgroup.clone_children` 檔案。 * `cgroup.events` 檔案提供了一個更好的機制來通知系統有 empty cgroups 。 ### Basic operation 透過以下命令可以將 cgroup v2 hierarchy 掛載到任意想要掛載的位置 ```shell $ mkdir tmp_cgroupfs $ sudo mount -t cgroup2 none tmp_cgroupfs $ cd tmp_cgroupfs; ls cgroup.controllers cgroup.stat cpuset.cpus.isolated dev-mqueue.mount io.prio.class memory.stat sys-fs-fuse-connections.mount test_cgroup cgroup.max.depth cgroup.subtree_control cpuset.mems.effective init.scope io.stat memory.zswap.writeback sys-kernel-config.mount user.slice cgroup.max.descendants cgroup.threads cpu.stat io.cost.model memory.numa_stat misc.capacity sys-kernel-debug.mount cgroup.pressure cpu.pressure cpu.stat.local io.cost.qos memory.pressure misc.current sys-kernel-tracing.mount cgroup.procs cpuset.cpus.effective dev-hugepages.mount io.pressure memory.reclaim proc-sys-fs-binfmt_misc.mount system.slice ``` cgroup2 filesystem 有一個特別的 magic number `0x63677270` 代表 "cgrp" 。所有 v2 支援的 controllers 都會被自動掛載到該路徑底下。同時這個界面也可以向下相容 v1 controllers 。只有當 controller 在當前的 hierarchy 當中不再被使用後，才能夠將該 controller 移動到不同的 hierarchy 當中。而在上一個 hierarchy 當中 umount 後， controller 可能不會立即在新的 hierarchy 上可見，因為 per-cgroup controller 的狀態刪除是 asynchronously 的，而 controller 可能會有 lingering references 。同樣的一個 controller 若想從 unified hierarchy 當中移動到其他的 hierarchy 當中，需要將該 controller 完全 disable ，而該過程需要時間，所以在新的 hierarchy 上面見到它需要一些時間。同時若存在 controller 之間的 dependencies ，其他 controllers 也需要被 disabled 。 cgroup v2 提供以下幾種 mounting options * **nsdelegate** 將 cgroup namespaces 視為 delegation boundaries ，此選項是 system wide 的並且只能在 mount 時或從 init namesapce 進行 remount 時設置。 * **favordynmods** 減少 dynamic cgroup midifications 的延遲，例如 task migrations 和 controller on/offs ，若在 hot path operation 例如 forks/exits 上會更昂貴。 * **memory_localevents** 只對當前 cgroup 產生 `memory.events` ，而不對任何其子樹作用。 * **memory_recursiveprot** 遞迴的將 `memory.min` 和 `memory.low` 套用到整個 subtrees 上，不需要特別指出到葉節點的 downward propagation 。如此一來可以保護 subtree 不受到其他 subtrees 影響，但同時依舊保有和其他 subtrees 競爭的自由。 * **memory_hugetlb_accounting** 針對 HugeTLB memory usage 計算 cgroup 對於 memory controller 的整體記憶體用量。它有機會造成目前設置的效能退化。有以下幾個警告需要記住 * memory controller 當中沒有 HugeTLB pool management 。 pre-allocated pool 不屬於任何人，特別是當有新的 HugeTLB folio 被配置到該 pool 當中， memory controller 並不會將它納入計算。只有當實際被使用時才會被納入計算。整體來說 HugeTLB pool management 應該有其他機制來控管。 * (TODO : understand HugeTLB) ### Organizing Processes and threads 系統啟動的最初，所有行程都屬於 root cgroup ，若想建立 child cgroup 則可以在 root cgroup 對應的資料夾底下建立新的資料夾。每個 cgroup 都有一個 read-writable interface file 稱為 `cgroup.procs` 。讀取它時會列出所有屬於該 cgroup 的行程 pid ，注意此處的 pid 沒有經過排序且若該行程曾經被移動到其他 cgroup 之後再移動回來，則該 pid 可能重複出現，或者在讀取時該 pid 被 recycled 。將某個行程移動至特定 cgroup 底下的方法之一即是將對應的 pid 寫入該 cgroup 的 `cgroup.procs` 檔案當中。一次 `write(2)` 只能寫入一個行程。若該行程由許多執行緒組成，則寫入任意一個執行緒的 id 會將整個 thread group 移至該 cgroup 底下。若某個行程進行 fork 建立子行程，則子行程的 cgroup 會和親代行程進行 fork operation 時所屬的 cgroup 相同。在 exit 之後行程依舊會位於對應 cgroup 之下，直到它被收割。但若一個行程為 zombie process 則它不會出現在 `cgroup.procs` 當中，因此無法被移動到任何 cgroup 之下。若一個 cgroup 沒有任何 child cgroup 或者 live process ，則我們可以透過 `rmdir` 命令來刪除該 cgroup 對應的資料夾進而移除該 cgroup 。沒有任何 child cgroup 而只有 zombie process 的 cgroup 同樣可以直接被移除。 ```shell $ sudo rmdir $CGROUP_NAME ``` 此外 `/proc/$PID/cgroup` 檔案則是列出屬於 `$PID` 行程的 cgroup membership 。注意到 cgroup v2 的呈現方式永遠遵守 `0::$PATH` 的格式 ```shell $ sudo cat /proc/$PID/cgroup ... 0::/test-cgroup/test-cgroup-nested ``` 若某個行程變為 zombie 而對應的 cgroup 被移除，在該檔案底下會呈現 ```shell $ sudo cat /proc/$PID/cgroup ... 0::/test-cgroup/test-cgroup-nested (deleted) ``` ### Threads 預設上所有屬於一個行程的執行緒也都屬於相同的 cgroup ，該 cgroup 也作為主要管理資源消耗的 cgroup ，不限於行程或執行緒，稱為 resource domain 。 Thread mode 則是使得不同執行緒得以被分配到不同子樹中，同時也保持相同的 resource domain 。支援 thread mode 的 controller 也稱為 threaded controllers 。若不支援則稱為 domain controllers 。將一個 cgroup 標記為支援 thread mode 則會將它以 thread cgroup 的形式加入它的親代 cgroup 的 resource domain 。該親代 cgroup 也可能是另一個 threaded cgroup ，而對於它而言的 resource domain 則是在整個 hierarchy 當中更上層。對於一個 threaded subtree 而言， root 即是最近的一個 non-threaded cgroup ，也稱為 threaded domain 或者 thread root ，同時作為該子樹的 resource domain 。在 threaded cgroup 當中，屬於某個行程的執行緒們可以被分佈到不同的 cgroups 之中，並且不需要遵守 no internal process constraint ，也就是 threaded controllers 當中不管有沒有執行序，都可以在 non-leaf cgroups 上被 enabled 。由於 threaded domain cgroup 負責其子樹當中所有 domain resource consumptions ，不管它是否有行程在其中，它都應該有 internal resource consumptions 。並且它不能產生任何非 threaded cgroups 作為 child 。而 root cgroup 由於不需要遵守 no internal process constraint ，它可以同時作為 threaded domain 或者 domain cgroups 的親代 cgroup 。我們可以透過查看 `cgroup.type` 的內容來得知當前的 cgroup operation mode 是什麼，可能是 normal domain 或者是一個 threaded subtree 的 domain 或者是一個 threaded cgroup 。一個 cgroup 剛被建立時都是一個 domain cgroup ，需要透過以下命令將其變為 threaded ```shell # echo threaded > cgroup.type ``` 一但轉為 threaded cgroup 後，該 cgroup 就不能再被轉回 domain cgroup 。而要開啟 thread mode ，以下的條件需要被滿足 * 由於 cgroup 會加入它的親代 cgroup 的 resource domain ，該親代 cgroup 必須為有效的 domain 或一個 threaded cgroup * 若親代 cgroup 並非 threaded domain ，它不能有任何被 enabled 的 domain controllers ，也不能產生 domain chidlren 。如果有一個 cgroup topology 長得如下 ```shell A (threaded domain) - B (threaded) - C (domain, just created) ``` 在此架構下， C 由於剛被建立所以是 domain cgroup ，也因此它的親代 cgroup 應該要有 domain cgroup 來持有所有 child domains ，但在此情況並沒有，因此 C 在當前狀態是無效的，無法被使用。必須將它轉為 threaded cgroup 才行。當某個 domain cgroup 的 child 變為 threaded 或者是在 child cgroup 當中有行程，同時 threaded contollers 在 `cgroup.subtree_control` 檔案當中被 enabled ，此時該 domain cgroup 就會轉為一個 threaded domain 。當上述的情況消失後， threaded cgroup 會自行轉回 domain cgroup 。 threaded domain cgroup 作為整個 subtree 的 resource domain ，同時由於執行緒可以被散佈在 subtree 之中，所有行程都被視為在該 threaded domain cgroup 當中。在該 threaded domain cgroup 的 `cgroup.procs` 檔案當中可以讀取所有在該 subtree 當中的行程 PIDs 。在 threaded subtree 當中，只有 threaded controllers 可以被 enabled 。它被 enabled 後會負責計算並控制所有在該 cgroup 和其子代 cgroups 當中的 resource consumptions 。只要不是 thread-specific 的 resource consumptions 都會被納入該 threaded domain cgroup 的計算。目前有以下幾種 controllers 為 threaded 並且可以在 threaded cgroup 當中被啟用 * cpu * cpuset * perf_event * pids ### Populated Notification 所有的 non-root cgroup 都有一個檔案稱為 `cgroup.events` ，包含了代表該 cgroup 的 sub-hierarchy 當中是否含有 live process 的欄位。 0 代表在該 cgroup 和其子代 cgroup 當中都沒有 live process ，除此之外 `1` 代表的是 poll and notify events are triggered when the value changes 。在某個行程的 sub-hierarchy 都移除後進行的 clean-up 可以利用此特性。 populated state 的更新和通知都是遞迴的，用以下的例子來觀察，括號當中的數字代表每個 cgroup 當中的行程數量 ```shell A(4) - B (0) - C(1) \ D(0) ``` 一開始 A, B, C 的 populated field 都會是 1 而 D 是 0 。當 C 當中的一個行程結束後， B, C 的 populated fields 會變為 0 ，而該檔案改寫的 events 會被產生在這兩個 cgroup 當中的 `cgroup.events` 。 ### Controlling Controllers #### Enabling and Disabling 每個 cgroup 都有一個檔案 `cgroup.controllers` ，當中包含了所以有該 cgroup 可用的 controllers 。 ```shell $ cat cgroup.controllers cpuset cpu io memory hugetlb pids rdma misc ``` 不會有任何 controllers 被預先開啟，我們可以透過寫入 `cgroup.subtree_control` 來啟用或停止 controller 。 ```shell $ echo "+cpu +memory -io" > cgroup.subtree_control ``` 在某個 cgroup 當中啟用某個 controller ，代表在該 cgroup 的 immediate children 當中，對應的資源會被該 controller 控制。 ``` A(cpu,memory) - B(memory) - C() \ D() ``` 以上架構為例， A 當中有 cpu, memory controllers 被啟用，代表 A 會控制 B 的 cpu cycles 和 memory ，而 B 則是可以控制 C, D 的 memory 。在 cgroup 當中啟用一個 controller 後，在它的 immediate children 當中會產生對應的 controller's interface files 。例如在 B 當中啟用 cpu controller 則會在 C, D 當中產生 prefix 為 `cpu.` 的 interface file 。這代表 controller interface file ，也就是那些 prefix 並非 cgroup 的檔案，都是由該 cgroup 的親代 cgroup 控制的。 #### Top-down Constraint 資源分配遵守一個 top-down 的結構，也就是只有當某個 cgroup 的 parent 有將某項資源分配給它，它才能進一步將該資源分配給它的子代 cgroup 。可以更加延伸理解為，所有 non-root 的 `cgroup.subtree_control` 當中只能擁有自己 parent cgroup 的 `cgroup.subtree_control` 當中有寫到的 controller 。從 controller 的角度來說，要在某個 cgroup 當中被啟用，首先必須在該 cgroup 的親代 cgroup 當中也有被啟用才行，而若要停用某個 controller ，則需要在該 cgroup 的所有子代 cgroups 當中都停用它才行。 #### No Internal Process Constraint Non-root cgroups 只有在自身沒有任何行程時，才能夠將 domain resource 分配到自己的子代 cgroups 當中。換個角度理解，只有不包含任何行程的 domain cgroups 可以在它們的 `cgroup.subtree_control` 當中啟用 domain controllers 。這個規則保證了當一個 domain controller 在尋找一個 hierachy 當中啟用它的部分時，行程永遠只會存在這個 hierarchy 的葉節點當中。如此一來可以避免 child cgroups 和它們親代 cgroups 當中的行程產生競爭。不過對於 root cgroup 而言，它不遵守此規範，它可以含有行程和 anonymous resouce consumption (和其他 cgroups 都無關聯) ，而各個 controllers 需要因此對於 root cgroup 有不同的處裡方式。特別注意當某個 cgroup 的 `cgroup.subtree_control` 當中沒有任何啟用的 controller 的話，就不需要遵守此規範，否則根本無法從 populated cgroup 當中再建立新的 cgroups 。要控制一個 cgroup 的資源分配方式，該 cgroup 必須先建立許多子代 cgroups 並把它的行程都分配到子代 cgroups 當中，這個動作必須在 `cgroup.subtree_control` 當中啟用 controller 之前完成。 ### Delegation #### Model of Delegation 將一個 cgroup 進行 delegation 有兩種方法。 1. 將 cgroup 委託給一個權限較低的使用者，透過給予他對該目錄進行寫入的權限，以及寫入其 `cgroup.procs`, `cgroup.threads` 和 `cgroup.subtree_control` 檔案給該使用者。 2. 若 `nsdelegate` mount option 被設定，則在 namespace 建立時會自動的將它委託給該 cgroup namespace 。應該注意的是被委託的使用者不該有能力寫入 resource control interface files ，因為一個目錄底下的這些檔案是用來控制親代 cgroups 的資源分配的。若委託的機制是第一種，則達成的方式是不要將這些檔案的權限開放給被委託的使用者。第二種的話， kernel 會從 namespace 當中的 namespace root 拒絕所有對於 `cgroup.procs`, `cgroups.subtree_control` 以外檔案的寫入操作。對於兩種 delegation type 而言，產生的影響是相同的。以第一種來說使用者可以在該目錄底下建立 sub-hierarchy ，管理其下的資源以及行程等等。對於所有 resource controllers 而言，其限制以及設定都是 hierarchical 同時不管 sub-hierarchy 的設定是什麼都無法逃脫親代的 resource restriction 。目前 cgroup 對於 sub-hierarchy 當中的 cgroups 數量或者 delegated sub-hierarchy 的 nesting depth 都沒有限制，未來可能會加以限制。 #### Delegation Containment Delegated sub-hierarchy 其中一個限制是 delegatee 無法將 sub-hierarchy 當中的行程移進或移出。若委託給較低權限的使用者，則以下兩個條件需被滿足才能把一個不帶有 root euid 的行程透過把 PID 寫入 "cgroup.procs" 搬遷到另一個 cgroup 當中。 * The write must have write access to the "cgroup.procs" file * The writer must have write acces to the "cgroup.procs" file of the common ancestor of the source and destination cgroups 滿足以上條件的 delegatee 可以在 delegated sub-hierarchy 當中隨意地移動行程，但不能移出 sub-hierarchy 之外。舉以下的例子 ``` ~~~~~~~~~~~~~ - C0 - C00 ~ cgroup ~ \ C01 ~ hierarchy ~ ~~~~~~~~~~~~~ - C1 - C10 ``` 我們將 cgroups C0 和 C1 委託給使用者 U0 ，它在 C0 底下建立了 C00 和 C01 ，在 C1 底下建立了 C10 。此時若 U0 想將 C10 底下的行程的 PID 寫入 "C00/cgroup.procs" ，會發生什麼事呢？首先 U0 具有寫入 "C00/cgroup.procs" 的權限，但是 source cgroup C10 和 destination cgroup C00 的共同祖先沒有被 delegate ，因此 U0 對於這個共同祖先的 "cgroup.procs" 沒有寫入權限，因此該寫入會回傳一個 `-EACCES` ### Guidelines #### Organize Once and Control 在 cgroup 之間進行任務搬遷是一個相對昂貴的操作，並且 stateful resource 例如記憶體並不會跟著被移動。因此在 cgroups 之間搬移行程是不被鼓勵的行為，任何任務在剛開始時就應該根據系統需求以及架構指派適合的 cgroup ，若之後需要動態調整應該改變 controller 的 configuration 。 ## Resource Distribution Models ### Weights Parent resouce 被分配到 children 的方式是將 active children 的 weights 全部加起來之後根據每個 child 的 weight/sum 這個比率來分配。通常這個方法只會用在 stateless resources 上。 Weights 的預設值是 100 ，範圍是 1~10000 ，只要 weight 在此範圍以內，則所有 configuration 組合都是有效的，因此也沒有理由拒絕任何 configuration changes 或者行程搬遷。有個經典的例子是 "cpu.weight" ，它會依照比例分配 CPU cycles 給 active children 。 ### Limits 一個 child 可以使用的資源量最多可以到 configured 的上限。因此 limits 可能會 over-committed ，也就是 children 的 limit sum 超過 parent 當中該資源實際可用的量。 Limits 的預設值是 "max" ，範圍是 [0, max] ， "max" 同時也是 no-op 。 "io.max" 就是個經典的例子。 ### Protections (TODO) ### Allocations Cgroup 會被 exclusively 的分配到一個有限資源的特定量上。同時 allocation 不能被 over-committed ，也就是 children 的 allocation 總和不能超出 parent 可用資源量。 Allocation 的範圍為 [0, max] ，預設是 0 也就是沒有資源。由於 allocation 不能發生 over-committed 情況，因此有些 configuration 組合是無效的，需要被拒絕，若對於某個行程的執行而言某個資源是被強制的，則行程的搬遷可能會被拒絕。 "cpu.rt.max" 即是一個例子，它 hard-allocate realtime slices 。 ## Interface files ### Core Interface Files * **cgroup.type** 存在在 non-root cgroups 上的 read-write single value file。讀取時它代表該 cgroup 的當前型別，可能是以下幾種 * **domain** : A normal valid domain cgroup * **domain threaded** : A threaded domain cgroup which is serving as the root of a threaded subtree * **domain invalid** : A cgroup which is in an invalid state. It can't be populated or have controllers enabled. It may be allowed to become a threaded cgroup * **threaded** : A threaded cgroup which is a member of a threaded subtree 我們可以透過把 "threaded" 寫入此檔案來把一個 cgroup 變為一個 threaded cgroup 。 * **cgroup.procs** 所有 cgroups 都具備的 read-write new-line seperated file 。進行讀取時它會一行一行的列出屬於該 cgroup 的行程 PIDs 。注意這些 PIDs 沒有經過排序，同時有可能重複出現。我們可以透過將 PID 寫入該檔案來進行行程的遷移，而 writer 應該有以下的條件 * Have write access to the "cgroup.procs" file * Have write access to the "cgroup.procs" file of the common ancestor of the source and destination cgroups 在 threaded cgroup 當中，讀取此檔案會造成 `EOPNOTSUPP` ，由於所有的行程是屬於 thread root 的。不過寫入操作是被支援的。 * **cgroup.threads** 存在在所有 cgroups 當中的 read-write new-line seperated values file 。基本上和 cgroup.procs 相同指示操作精度變為 thread 。 **cgroup.controllers** 存在所有 cgroups 當中的 read-only space separated values file 。顯示該 cgroup 可用的所有 controllers ，沒有經過排序。 * **cgroup.subtree_control** 存在所有 cgroups 上的 read-write separated values file 。一開始是空的。讀取時會顯示從該 cgroup 到其 children 的被啟用 controllers 。透過在某個 controller 前面加上 "+" 或 "-" 可以啟用或停用 controllers 。 * **cgroup.events** 只存在 non-root cgroups 的 read-only flat-keyed file 。此檔案若發生 value change ，則會產生 file modified event * populated : 1 if the cgroup of its descendants contains any live processes, otherwise 0. * frozen : 1 if the cgroup is frozen, otherwise 0. * **cgroup.stat** read-only flat-keyed file * **nr_descendants** : Total number of visible descendant cgroups * **nr_dying_descendants** : Total number of dying cgroups 。當某個 cgroup 被使用者刪除時它會變為 dying 。該 cgroup 會維持 dying state 一段時間後才會被完全刪除。行程在任何情況都不能進入 dying cgroup ，同時一個 dying cgroup 也無法再復原。值得注意的是 dying cgroup 依舊可以消耗系統的資源，只要不超過它的 limit 。 * **cgroup.freeze** 只存在 non-root cgroup 上的 read-write singe value file 。可能的值為 "0", "1" ，預設為 "0" 。對一個 cgroup 的此檔案寫入 "1" 會凍結該 cgroup 以及其所有 descendant 。這代表所有其下的行程都會被停止直到該 cgroup 被解凍為止。凍結一個 cgroup 可能需要一些時間，一但該操作完成，則 "cgoup.events" 檔案當中的 "frozen" 值會被更新為 "1" ，並產生對應的通知。 ### Controllers #### CPU "cpu" controllers 限制並規範 CPU cycles 如何被分配。該 controllers 針對 normal scheduling policy 實作了 weight 和 absolute bandwidth limit models ，針對 realtime scheduling policy 實作了 absolute bandwidth allocation model 。以上方法當中 cycles distribution 都僅僅由 temporal base 定義並且不考慮任務被執行的 frequency 。 Utilization clamping 使得 schedutil cpufreq 可以控管每個 CPU 所需提供的最小 frequency 以及最大 frequency 。 #### CPU interface files (以下的計時單位皆為 microseconds ) * **cpu.stat** 唯讀的 flat-keyed file 。不管 controller 啟用與否都存在。提供以下三種數據 * usage_usec * user_usec * system_usec 當 controller 被啟用還會提供以下五種 * nr_periods * nr_throttled * throttled_usec * nr_bursts * burst_usec * **cpu.weight** 可讀寫的檔案，存在所有 non-root cgroups 當中，預設值為 "100" 。對於非 idle cgroups (cpu.idle = 0), weight 的範圍是 [1, 10000] 。若 cgroup 被設定為 SCHED_IDLE (cpu.idle = 1) 則 weight 會是 0 。 * **cpu.weight.nice** 存在 non-root cgroups 當中的可讀寫檔案。預設值是 "0" 。 nice value 範圍是 [-20, 19] 。此 interface file 是 "cpu.weight" 的另一個替代品，可以透過 `nice(2)` 來設定 weight ，由於 nice value 的範圍較小精度較粗糙，因此讀取出來的值會是最接近當前 weight 的估計值。 * **cpu.max** 存在 non-root cgroups 上的可讀寫檔案，預設值是 "max 100000" ，也就是 bandwidth limit 的最大值。格式如下 ``` $MAX $PERIOD ``` 代表該 group 在每個 $PERIOD 當中可以消耗至多 $MAX 。若 $MAX 被設為 max 則代表沒有限制。 * **cpu.max.burst** 存在所有 non-root cgroups 當中的可讀寫檔案，預設值是 "0" 。範圍是 [0, $MAX] * **cpu.uclamp.min** * **cpu.uclamp.max** * **cpu.idle** #### Memory memory controller 規範記憶體的分佈。該 controller 為 stateful 並且實作 limit 和 protection models 。此模型十分複雜，由於 memory usage 和 reclaim pressure 以及 memory 的 stateful 特性。由於此 controller 並非完全沒有漏洞，因此會追蹤一個 cgroup 主要的 memory usage 使得總共的記憶體用量可以被計算並且被掌控在一個合理的範圍當中。目前有以下三種類型的記憶體使用會被追蹤 * Userland memory - page cache and anonymous memory * Kernel data structures such as dentries and inodes * TCP socket buffers #### Memory Interface Files 所有記憶體的計量單位都是 bytes 。若寫入一個值時沒有對齊 `PAGE_SIZE` ，則在讀取時可能會被 round up 至最接近的 `PAGE_SIZE` multiple 。 * **memory.current** 紀錄當前被該 cgroup 以及其 descendants 所使用的記憶體總量。 * **memory.min** Hard memory protection 。若一個 cgroup 的記憶體用量在這個 min boundary 以下，則該 cgroup 的記憶體不管在任何情況都不會被回收。若當前沒有任何不受保護的 reclaimable memory ，則 OOM killer 會介入來回收記憶體。若用量大於 min boundary ，則 pages 會依據超過的量被回收，藉此減少 smaller overages 的 reclaim pressure 。有效的 min boundary 會被所有祖先 cgroups 的 memory.min 值限制，若 memory.min 存在 overcommitment ( child cgroup 需要的被保護記憶體比 parent 允許的更多 )，則每個 child cgroup 會得到一部分的 parent 的記憶體用量。若 memory cgroup 並非由行程產生的，則它的 memory.min 會被忽略 * **memory.low** 預設為 0 。 Best-effort memory protection 。 ## 待整理 [Control Groups](https://docs.kernel.org/admin-guide/cgroup-v1/cgroups.html)