# Linux 核心專題: CPU 排程器研究
> 執行人: chun61205
> [專題解說錄影](https://youtu.be/UmsdPt122QE)
:::success
:question: 提問清單
* 第七章的實驗或許可改用 schedgraph 來進行?
> [`linhoward0522`](https://hackmd.io/hE9dWQ3MSU6nLMNZCjdudA?both)
Referring to the article written by `linhoward0522`, set up the environment of schedgraph.
:::warning
To set up the environment of `opam`, the command should be
```shell
$ eval $(opam env)
```
instead.
:::
After generating a workload, use the following command to get the statistic result:
```shell
$ ./dat2graph2 --save-tmp /path/to/trace.dat
$ jgraph trace.jgr | ps2eps -q -l -r 600 | epstopdf --filter > trace.pdf
```
Take the experiment of generating differnt workloads for example, the outcoming grapu will be like
![](https://hackmd.io/_uploads/SJS1mlVq3.png)
In the graph, lines in different colors means different processes. Process of purple line is the process with nice value $5$, process of brown line is the process with nice value $0$, and process of brown line is the process with nice value $-5$.
However, I found out that `schedgraph` does not count `vruntime`, so I can not calculate the ratio $vruntim/runtime$.
:::
## 任務簡述
改進《Demystifying the Linux CPU Scheduler》,確保所有參照的程式碼皆適用於 Linux v5.15 (LTS),包含 ftrace 在內的實驗。內容使用英文書寫。
## TODO: 紀錄電子書閱讀過程中的問題
紀錄閱讀過程中的問題,連同自己的推論彙整在本頁面,適時與授課教師討論。
> 提及頁碼時,要附上電子書最後更新日期,範例: `P.16 @ March 21`
### Chapter 1
#### Linked List 採用環狀是基於哪些考量? (P.14 第三段 @ March 21)
> [johnnylord](https://hackmd.io/@johnnylord/rk7xJhVIN?type=view)
> [你所不知道的 C 語言: linked list 和非連續記憶體](https://hackmd.io/@sysprog/c-linked-list)
> [Why Linux kernel doubly-linked list is not just a simple linked list?!](https://medium.com/@m.zanoosi/why-linux-kernel-doubly-linked-list-is-not-just-a-simple-linked-list-fb8c43ff150)
> [include/linux/list.h](https://github.com/torvalds/linux/blob/master/include/linux/list.h)
1. No need to consider edge cases
Here is the linked list diagram in linux kernel.
![](https://hackmd.io/_uploads/ByVQGN8H3.png)
Observing the diagram above, we can see that the head node does not involve any data. In other words, it has nothing to do with the data in nodes.
Take `__list_del` for example,
```c
static inline void __list_del(struct list_head * prev, struct list_head * next)
{
next->prev = prev;
WRITE_ONCE(prev->next, next);
}
```
If a singly-linked list is used, user needs to consider the situation that if the head node is deleted. However, in the case above, there is no need to consider the edge case.
2. Polymorphism
Simply putting, polymorphism allows you to treat objects of different subclasses as if they were objects of the parent class, enabling you to store them in a common list and call their specific functions in a consistent manner.
Suppose we want to create a structure named `employee` and several positions inheriting it. The implementation is like below:
```c
struct employee {
int id;
char name[10];
struct list_head _list;
};
struct supervisor {
struct employee _employee;
char subordinates[10][10];
struct list_head _list;
};
struct business {
struct employee _employee;
char works[10][10];
struct list_head _list;
};
```
`struct supervisor` and `struct business` inherit `employee` and we put all inherited objects of a parent into a list.
```graphviz
digraph ele_list {
rankdir=LR
node[shape=record]
subgraph cluster_0 {
node [shape=record]
value [label="id|name"];
head [label="{<left>prev|<right>next}"];
style=bold
label = employee
}
subgraph cluster_1 {
node [shape=record]
node1_value [label="_employee|subordinates"];
node1 [label="{<left>prev|<right>next}"]
style=bold
label = supervisor
}
subgraph cluster_2 {
node [shape=record]
node2_value [label="_employee|works"];
node2 [label="{<left>prev|<right>next}"]
style=bold
label = business
}
head:right:e -> node1:left:w[arrowhead=normal, arrowtail=normal, dir=both]
node1:right:e -> node2:left:w[arrowhead=normal, arrowtail=normal, dir=both]
}
```
3. Poison
Here is the function `list_del` in [include/linux/list.h](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/include/linux/list.h#n146).
```c
static inline void list_del(struct list_head *entry)
{
__list_del_entry(entry);
entry->next = LIST_POISON1;
entry->prev = LIST_POISON2;
}
```
In `__list_del_entry`, the previous node and the next node of `entry` will be connected to delete `entry`. However if `entry` can still be accessed, `entry->next` and `entry->prev` can still be accessed by `entry` too, which is not the result that we expect. Therefore, in this case `LIST_POISON1` and `LIST_POISON2` are used to identify that the pointer is not accessible anymore.
According to the patch [c9cf55285e87ac423c45d9efca750d3f50234d10](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/include/linux/poison.h?id=c9cf55285e87ac423c45d9efca750d3f50234d10),
> Localize poison values into one header file for better documentation and easier/quicker debugging and so that the same values won't be used for multiple purposes.
poisons are designed to improve code traceability.
:::warning
1. How does linux kernel ensure that the position `LIST_POISON1` and `LIST_POISON2` will not be accessed?
In my observation, linux kernel just check manually to ensure the address will not be access.
For example, in [/lib/list_debug.c](https://elixir.bootlin.com/linux/latest/source/lib/list_debug.c#L53)
```c
bool __list_del_entry_valid(struct list_head *entry)
{
if (CHECK_DATA_CORRUPTION(next == NULL,
"list_del corruption, %px->next is NULL\n", entry) ||
CHECK_DATA_CORRUPTION(prev == NULL,
"list_del corruption, %px->prev is NULL\n", entry) ||
CHECK_DATA_CORRUPTION(next == LIST_POISON1,
"list_del corruption, %px->next is LIST_POISON1 (%px)\n",
entry, LIST_POISON1) ||
CHECK_DATA_CORRUPTION(prev == LIST_POISON2,
"list_del corruption, %px->prev is LIST_POISON2 (%px)\n",
entry, LIST_POISON2) ||
CHECK_DATA_CORRUPTION(prev->next != entry,
"list_del corruption. prev->next should be %px, but was %px. (prev=%px)\n",
entry, prev->next, prev) ||
CHECK_DATA_CORRUPTION(next->prev != entry,
"list_del corruption. next->prev should be %px, but was %px. (next=%px)\n",
entry, next->prev, next))
return false;
return true;
}
```
However, there is still a question in my mind. If we can allocate an address like `LIST_POISON1` or `LIST_POISON2`, is that means that the newly allocated address is useless?
2. Why do linux kernel define two macros for list poison instead of just one?
I guess it is for traceability. Users can use two macro to trace specificly.
:::
:::warning
> [fix LIST_POISON{1,2} offset - Jiri Slaby - lore](https://lore.kernel.org/lkml/c7ecfa39d66c62ee662ae6906a2eec3d28a96e6a.1463648873.git.jslaby@suse.cz/)
:::
#### linux kernel 是如何插入 pid 到 PID hash table 的?為什麼移除 `pid_chain` 和 `pid_list` ? (P.19 @ May 30)
TODO: 對照原始程式碼
> [include/linux/pid.h](https://elixir.bootlin.com/linux/v6.1.32/source/include/linux/pid.h#L59)
> [Pid Namespace 详解- 泰晓科技](https://tinylab.org/pid-namespace/)
> [struct pid and struct upid - Linux - Kernel - LinuxQuestions.org](https://www.linuxquestions.org/questions/linux-kernel-70/question-about-kernel-pid_namespace-struct-pid-and-struct-upid-796265/)
> [kernel/pid.c](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/kernel/pid.c#n331)
> [kernel/fork.c](https://elixir.bootlin.com/linux/v6.1.32/source/kernel/fork.c#L1994)
> [include/linux/sched.h](https://elixir.bootlin.com/linux/v6.1.32/source/include/linux/sched.h#L737)
> [92476d7fc0326a409ab1d3864a04093a6be9aca7](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/include/linux/pid.h?id=92476d7fc0326a409ab1d3864a04093a6be9aca7)
> [e8cfbc245e24887e3c30235f71e9e9405e0cfc39](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/include/linux/pid.h?id=e8cfbc245e24887e3c30235f71e9e9405e0cfc39)
1. `struct pid` and `pid_t`
`pid_t` is an `int` data type used to store the process id.
Whereas `pid` is an data structure storing process data.
```c
struct pid
{
refcount_t count;
unsigned int level;
spinlock_t lock;
/* lists of tasks that use this pid */
struct hlist_head tasks[PIDTYPE_MAX];
struct hlist_head inodes;
/* wait queue for pidfd notifications */
wait_queue_head_t wait_pidfd;
struct rcu_head rcu;
struct upid numbers[1];
};
```
2. Data Structures Related to `attach_pid` in `pid.c`
1. `pid`
As metioned before, `pid` is an data structure storing process data. A `struct pid` structure represents a specific number of pid. A `struct pid` may contain serveral types of pid, whereas these types may not be related directly and may contain completely different tasks.
```c
// In alloc_pid
for (type = 0; type < PIDTYPE_MAX; ++type)
INIT_HLIST_HEAD(&pid->tasks[type]);
```
2. `pid_type`
`pid_type` is used to specify the type of pid such as `PIDTYPE_PID` or `PIDTYPE_TGID`.
3. `struct pid_namespace`
Namespace is a mechanism for encapsulating and isolating global system resources, allowing processes in different namespaces to have independent access to the global system resources.
4. `retval`
`retval` is used to stored the error type.
5. `struct upid`
```c
/*
* struct upid is used to get the id of the struct pid, as it is
* seen in particular namespace. Later the struct pid is found with
* find_pid_ns() using the int nr and struct pid_namespace *ns.
*/
```
6. `struct hlist tasks`
> It is never possible for two tasks (inside one NS) to have the same PID. However, the ProcessGroup ID and the Session ID can be shared by multiple tasks.
7. `PIDTYPE_MAX`
```c
enum pid_type
{
PIDTYPE_PID,
PIDTYPE_TGID,
PIDTYPE_PGID,
PIDTYPE_SID,
PIDTYPE_MAX,
};
```
`enum pid_type` defines types of pid. In this case `PIDTYPE_MAX` is $4$, which means the number of pid types. In `include/linux/pid.h`, `pid_links` is defined like
```c
struct hlist_node pid_links[PIDTYPE_MAX];
```
We can see that there are `PIDTYPE_MAX` pid hash tables.
3. `attach_pid`
```c
void attach_pid(struct task_struct *task, enum pid_type type)
{
struct pid *pid = *task_pid_ptr(task, type);
hlist_add_head_rcu(&task->pid_links[type], &pid->tasks[type]);
}
```
In `kernel/fork.c`, `copy_process` creates a new process as a copy of the old one. In `copy_process`, `attach_pid` will be called several times to add different types of a pid with specific number to the hash table list.
在 `kernel/fork.c` 中, `copy_process` 會複製舊的 process 來創建一個新的 process 。在 `copy_process` 中, `attach_pid` 會被呼叫數次,用來將同一個 pid number 的不同 type 的 pid 插入到特定 type 的 hash table 。
![](https://hackmd.io/_uploads/B13KeirD2.png)
In fact, the pid hash table is stored in `struct task_struct`, `pid_links`
4. Why `pid_chain` and `pid_list` are removed?
1. `pid_list`
The reason why `pid_list` is removed is metioned in the patch [92476d7fc0326a409ab1d3864a04093a6be9aca7](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/include/linux/pid.h?id=92476d7fc0326a409ab1d3864a04093a6be9aca7).
Simply put, removing `pid_list` simplifies the code, reduces the need for 4 pid hash tables, and makes the code more capable.
Originally, each `pid_list` contains a type of pid. When a pid is created, list node of each type will not be allocated immediately, instead, only when attaching, the node will be allocated.
After this patch, it creates the hash table entry when the pid was allocated and free the hash table entry when the pid was freed.
2. `pid_chain`
The reason why `pid_chain` is removed is metioned in the patch [e8cfbc245e24887e3c30235f71e9e9405e0cfc39](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/include/linux/pid.h?id=e8cfbc245e24887e3c30235f71e9e9405e0cfc39).
> pidhash is no longer required as all the information can be looked up from idr tree.
Where idr stands for "ID Radix" tree. idr is a data structure that is used for managing mappings from integer IDs to pointers.
#### 只有部分 hardware context 儲存在 process descriptor (`struct task_struct`) ,但是在舊版的 kernel ,所有資訊都會儲存在這裡,如此設計的考量是什麼? (P.24 第一段第三行 @ March 21)
TODO: 對照 git log / `git blame include/linux/sched.h`
#### Linux Kernel 為何不區分 "TASK RUNNING (ready but not running)" 及 "TASK RUNNING (running)" ? (P.31 Figure 1.12: State machine of task states @ May 30)
> [Linux 核心設計: Scheduler(4): PELT - HackMD](https://hackmd.io/@RinHizakura/Bk4y_5o-9)
1. Per-Entity Load Tracking (PELT)
PELT is a mechanism within the used for tracking the load or the utilization of CPU per-cpu runqueue.
$$
\begin{aligned}
Load_t&=u_t+u_{(t-1)}\times y^1+u_{(t-2)}\times y^2+...\\
&=\sum_{i=0}y_{(t-i)}\times y^i
\end{aligned}
$$
Where $u_t$ is the cared record of the current time slot and $y$ is the decay factor, which controls the decaying process decribed before.
2. Why doesn't the Linux kernel differentiate between the ready state and running state?
The reason is stated in chapter 2.2.3.
> Since each core has its own set of registers and cache, this means that it is simpler to restore a task to the core it previoulsy ran on. This is known as processor affinity: the task has an affinity for running on the same core it was swapped out from.
In summary, combining ready state and running state reduce the chance of swapping tasks out of the process. In consquence, kernel doesn't need to restore the task.
[Applying Machine Learning Techniques to Improve Linux Process Scheduling](https://ieeexplore.ieee.org/document/4085157)
PELT ==> load 的測量
### 其他問題
#### PIDd 似乎是多打了一個 d (P.16 第三段第二行 @ March 21)
> :flashlight: 已解決
#### busy 多打了一個 i 打成 buisy (P.74 第一段第三行 @ May 30)
> :flashlight: 已解決
## TODO: 確保所有參照的程式碼皆適用於 Linux v5.15+
逐一檢視書中程式碼列表,確保所有參照的程式碼皆適用於 Linux v5.15+,若遇到程式碼的出入 (可能來自 Linux 核心符號的命名變更),紀錄在此並嘗試從 [lxr](https://elixir.bootlin.com/linux/latest/source) 找出可能的變遷原因,並附上自己的推論。
重現書中關於 ftrace 和 tracepoint 相關的實驗,確保在 Linux v5.15+ 得以正確運作,若發現出入,嘗試變更程式碼和實驗步驟。
### Chapter 1
#### `struct pid` 已變更 (P.19 @ May 30)
> [include/linux/pid.h](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/include/linux/pid.h#n59)
Linux v5.15.112 的實作:
```c
struct pid {
refcount_t count;
unsigned int level;
spinlock_t lock;
/* lists of tasks that use this pid */
struct hlist_head tasks[PIDTYPE_MAX];
struct hlist_head inodes;
/* wait queue for pidfd notifications */
wait_queue_head_t wait_pidfd;
struct rcu_head rcu;
struct upid numbers[1];
};
```
移除 `pid_chain` 和 `pid_list`,並使用 2 個 `struct hlist_head`。
#### `list_entry` 已移動到 [include/linux/list.h](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/include/linux/list.h#n519) (P.18 @ March 21)
> :flashlight: 已解決
#### `container_of` 已移動到 [include/linux/container_of.h](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/include/linux/container_of.h#n18) (P.21 @ May 30)
> [d2a8ebbf8192b84b11f1b204c4f7c602df32aeac](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/include/linux/kernel.h?id=d2a8ebbf8192b84b11f1b204c4f7c602df32aeac)
In Linux v5.16 ,`container_of` macro has been split out from `include/linux/kernel.h`.
The reason can been seen in the patch:
> For time being include new header back to kernel.h to avoid twisted
indirected includes for existing users.
> Note, there are _a lot_ of headers and modules that include kernel.h
solely for one of these macros and this allows to unburden compiler for
the twisted inclusion paths and to make new code cleaner in the future.
`container_of` is quite frequently used, so the linux kernel split it out `include/linux/kernel.h` to unburden compiler for the twisted inclusion paths and to make new code cleaner.
:::warning
研究如何設計實驗
:::
#### `offsetof` 已變更 (P.22 @ May 30)
> [include/linux/stddef.h](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/include/linux/stddef.h#n16)
> :flashlight: 已解決
## TODO: 重現第 7 章實驗
> [Linux 核心專題: CPU 排程器研究](https://hackmd.io/@sysprog/BJh9FdlS2#%E9%96%8B%E7%99%BC%E7%92%B0%E5%A2%83)
> [rostedt/trace-cmd: utilities for Linux ftrace - GitHub](https://github.com/rostedt/trace-cmd)
### Environment
```shell
$ gcc --version
gcc (Ubuntu 11.3.0-1ubuntu1~22.04.1) 11.3.0
$ uname -mrs
Linux 5.19.0-42-generic x86_64
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 39 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Vendor ID: GenuineIntel
Model name: Intel(R) Core(TM) i5-1035G1 CPU @ 1.00GHz
CPU family: 6
Model: 126
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
Stepping: 5
CPU max MHz: 3600.0000
CPU min MHz: 400.0000
BogoMIPS: 2380.80
Virtualization features:
Virtualization: VT-x
Caches (sum of all):
L1d: 192 KiB (4 instances)
L1i: 128 KiB (4 instances)
L2: 2 MiB (4 instances)
L3: 6 MiB (1 instance)
NUMA:
NUMA node(s): 1
NUMA node0 CPU(s): 0-7
```
### Isolating One CPU
In the experiment, we will see the variation of virtual runtime. In order the observe it, we need to isolate a cpu first so that the tasks can run on it.
Modify `/etc/default/grub`:
```shell
$ sudo vi /etc/default/grub
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash isolcpus=2,3"
```
Update the configuration:
```shell
$ sudo update-grub
```
Reboot:
```shell
$ sudo reboot
```
Check CPU affinity:
```shell
$ cat /sys/devices/system/cpu/isolated
2-3
```
### Installing `trace-cmd`
```shell
$ sudo apt-get install build-essential git pkg-config -y
$ sudo apt-get install libtracefs-dev libtraceevent-dev -y
$ git clone https://git.kernel.org/pub/scm/libs/libtrace/libtraceevent.git
$ cd libtraceevent
$ make
$ sudo make install
$ git clone https://git.kernel.org/pub/scm/libs/libtrace/libtracefs.git
$ cd libtracefs
$ make
$ sudo make install
$ git clone https://github.com/rostedt/trace-cmd.git
$ cd trace-cmd
$ make
$ sudo make install
$ sudo make install_libs
```
### Same Nice vs. Different Nice
According to 《Demystifying the Linux CPU Scheduler》, prepare two configurations.
For same nice,
```json
{
"procs": [
{
"comm": "cpu_bound",
"cpus": "3",
"name": "nice 0",
"spown_cnt": 3
}
]
}
```
For different nice,
```json
{
"procs": [
{
"comm": "cpu_bound",
"cpus": "3",
"name": "nice -5",
"policy": "other",
"policy_detail": "-5"
},
{
"comm": "cpu_bound",
"cpus": "3",
"name": "nice 0",
"policy": "other",
"policy_detail": "0"
},
{
"comm": "cpu_bound",
"cpus": "3",
"name": "nice 5",
"policy": "other",
"policy_detail": "5"
}
]
}
```
Run the experiments
```shell
$ tools/start.py --targets=same,diff
```
:::danger
實際執行之後發現無法執行 `start.py` ,因此這裡暫時使用 `stress-ng` 來執行實驗。
:::
### `stress-ng`
Use the following command to install `stress-ng`.
```shell
$ sudo apt install stress-ng
```
Use the following command to check if `stress-ng` has been successfully installed.
```shell
$ stress-ng --version
stress-ng, version 0.13.12 (gcc 11.3, x86_64 Linux 5.19.0-42-generic) 💻🔥
```
### Running Experiment
#### Same Nice
1. Generating workload
```shell
$ nice -n 0 taskset -c 3 stress-ng --cpu 3 &
```
1. `nice -n`: Specify nice value
2. `taskset -c`: Specify cpu core
3. `--cpu`: Specify task numbers
2. Tracing the tasks
```shell
$ sudo trace-cmd record -e sched
```
* `-e sched`: Scheduling events only
3. Getting the result
```shell
$ sudo trace-cmd report | grep "stress-ng" | grep -o "pid=[0-9]* runtime=[0-9]* \[ns\] vruntime=[0-9]* \[ns\]" > experiments/same_nice.txt
```
4. Preparing a program to plot
```python
for i, (pid, data) in enumerate(pids.items()):
x_values = [x + i * width for x in range(2, len(data['vruntime']) - 1)]
y_values = [(data['vruntime'][j+3] - vruntime) / data['runtime'][j+3] for j, vruntime in enumerate(data['vruntime'][2:-1])]
plt.bar(x_values, y_values, width, label=f"PID {pid}")
plt.xlabel("cycle")
plt.ylabel("vruntime / runtime")
plt.legend()
plt.show()
```
5. Observing the result graph
![](https://hackmd.io/_uploads/H1WBU4Q_3.png)
In this graph, we can find out that tasks with same nice value has the same ratio of $vruntime/runtime$. In addition, as we set the nice value to $0$, the ratio will be $1$.
#### Different Nice Value
1. Generating workload
```shell
$ sudo nice -n -5 taskset -c 3 stress-ng --cpu 1 &
$ nice -n 0 taskset -c 3 stress-ng --cpu 1 &
$ nice -n 5 taskset -c 3 stress-ng --cpu 1 &
```
2. Tracing the tasks
```shell
$ sudo trace-cmd record -e sched
```
3. Getting the result
```shell
$ sudo trace-cmd report | grep "stress-ng" | grep -o "pid=[0-9]* runtime=[0-9]* \[ns\] vruntime=[0-9]* \[ns\]" > experiments/diff_nice.txt
```
4. Observing the result graph
![](https://hackmd.io/_uploads/SJ5DJSX_n.png)
* `PID 10175`: nice -5
* `PID 10181`: nice 0
* `PID 10184`: nice 5
In this experiment, we can find out that the task with lower nice value has more runtime.
Accroding to the chapter 7,
$$
\begin{aligned}
&ratio_{(-5)}\approx0.328\\
&ratio_{(0)}\approx1\\
&ratio_{(5)}\approx3.056
\end{aligned}
$$
which fit the result.
重現第 7 章 scheduler events 實驗,利用 7.4 節提供的程式碼,追蹤排程器行為,應當搭配 [Cgroups](https://hackmd.io/@0xff07/Bkse9fRdF) 驗證 Linux 核心對資源管理的行為。