執行人: chun61205
專題解說錄影
:question: 提問清單
第七章的實驗或許可改用 schedgraph 來進行?
Referring to the article written by linhoward0522
, set up the environment of schedgraph.
To set up the environment of opam
, the command should be
$ eval $(opam env)
instead.
After generating a workload, use the following command to get the statistic result:
$ ./dat2graph2 --save-tmp /path/to/trace.dat
$ jgraph trace.jgr | ps2eps -q -l -r 600 | epstopdf --filter > trace.pdf
Take the experiment of generating differnt workloads for example, the outcoming grapu will be like
schedgraph
does not count vruntime
, so I can not calculate the ratio 改進《Demystifying the Linux CPU Scheduler》,確保所有參照的程式碼皆適用於 Linux v5.15 (LTS),包含 ftrace 在內的實驗。內容使用英文書寫。
紀錄閱讀過程中的問題,連同自己的推論彙整在本頁面,適時與授課教師討論。
提及頁碼時,要附上電子書最後更新日期,範例:
P.16 @ March 21
johnnylord
你所不知道的 C 語言: linked list 和非連續記憶體
Why Linux kernel doubly-linked list is not just a simple linked list?!
include/linux/list.h
No need to consider edge cases
Here is the linked list diagram in linux kernel.
__list_del
for example,
static inline void __list_del(struct list_head * prev, struct list_head * next)
{
next->prev = prev;
WRITE_ONCE(prev->next, next);
}
If a singly-linked list is used, user needs to consider the situation that if the head node is deleted. However, in the case above, there is no need to consider the edge case.
Polymorphism
Simply putting, polymorphism allows you to treat objects of different subclasses as if they were objects of the parent class, enabling you to store them in a common list and call their specific functions in a consistent manner.
Suppose we want to create a structure named employee
and several positions inheriting it. The implementation is like below:
struct employee {
int id;
char name[10];
struct list_head _list;
};
struct supervisor {
struct employee _employee;
char subordinates[10][10];
struct list_head _list;
};
struct business {
struct employee _employee;
char works[10][10];
struct list_head _list;
};
struct supervisor
and struct business
inherit employee
and we put all inherited objects of a parent into a list.
Poison
Here is the function list_del
in include/linux/list.h.
static inline void list_del(struct list_head *entry)
{
__list_del_entry(entry);
entry->next = LIST_POISON1;
entry->prev = LIST_POISON2;
}
In __list_del_entry
, the previous node and the next node of entry
will be connected to delete entry
. However if entry
can still be accessed, entry->next
and entry->prev
can still be accessed by entry
too, which is not the result that we expect. Therefore, in this case LIST_POISON1
and LIST_POISON2
are used to identify that the pointer is not accessible anymore.
According to the patch c9cf55285e87ac423c45d9efca750d3f50234d10,
Localize poison values into one header file for better documentation and easier/quicker debugging and so that the same values won't be used for multiple purposes.
poisons are designed to improve code traceability.
How does linux kernel ensure that the position LIST_POISON1
and LIST_POISON2
will not be accessed?
In my observation, linux kernel just check manually to ensure the address will not be access.
For example, in /lib/list_debug.c
bool __list_del_entry_valid(struct list_head *entry)
{
if (CHECK_DATA_CORRUPTION(next == NULL,
"list_del corruption, %px->next is NULL\n", entry) ||
CHECK_DATA_CORRUPTION(prev == NULL,
"list_del corruption, %px->prev is NULL\n", entry) ||
CHECK_DATA_CORRUPTION(next == LIST_POISON1,
"list_del corruption, %px->next is LIST_POISON1 (%px)\n",
entry, LIST_POISON1) ||
CHECK_DATA_CORRUPTION(prev == LIST_POISON2,
"list_del corruption, %px->prev is LIST_POISON2 (%px)\n",
entry, LIST_POISON2) ||
CHECK_DATA_CORRUPTION(prev->next != entry,
"list_del corruption. prev->next should be %px, but was %px. (prev=%px)\n",
entry, prev->next, prev) ||
CHECK_DATA_CORRUPTION(next->prev != entry,
"list_del corruption. next->prev should be %px, but was %px. (next=%px)\n",
entry, next->prev, next))
return false;
return true;
}
However, there is still a question in my mind. If we can allocate an address like LIST_POISON1
or LIST_POISON2
, is that means that the newly allocated address is useless?
Why do linux kernel define two macros for list poison instead of just one?
I guess it is for traceability. Users can use two macro to trace specificly.
pid_chain
和 pid_list
? (P.19 @ May 30)TODO: 對照原始程式碼
include/linux/pid.h
Pid Namespace 详解- 泰晓科技
struct pid and struct upid - Linux - Kernel - LinuxQuestions.org
kernel/pid.c
kernel/fork.c
include/linux/sched.h
92476d7fc0326a409ab1d3864a04093a6be9aca7
e8cfbc245e24887e3c30235f71e9e9405e0cfc39
struct pid
and pid_t
pid_t
is an int
data type used to store the process id.pid
is an data structure storing process data.
struct pid
{
refcount_t count;
unsigned int level;
spinlock_t lock;
/* lists of tasks that use this pid */
struct hlist_head tasks[PIDTYPE_MAX];
struct hlist_head inodes;
/* wait queue for pidfd notifications */
wait_queue_head_t wait_pidfd;
struct rcu_head rcu;
struct upid numbers[1];
};
attach_pid
in pid.c
pid
pid
is an data structure storing process data. A struct pid
structure represents a specific number of pid. A struct pid
may contain serveral types of pid, whereas these types may not be related directly and may contain completely different tasks.
// In alloc_pid
for (type = 0; type < PIDTYPE_MAX; ++type)
INIT_HLIST_HEAD(&pid->tasks[type]);
pid_type
pid_type
is used to specify the type of pid such as PIDTYPE_PID
or PIDTYPE_TGID
.struct pid_namespace
retval
retval
is used to stored the error type.struct upid
/*
* struct upid is used to get the id of the struct pid, as it is
* seen in particular namespace. Later the struct pid is found with
* find_pid_ns() using the int nr and struct pid_namespace *ns.
*/
struct hlist tasks
It is never possible for two tasks (inside one NS) to have the same PID. However, the ProcessGroup ID and the Session ID can be shared by multiple tasks.
PIDTYPE_MAX
enum pid_type
{
PIDTYPE_PID,
PIDTYPE_TGID,
PIDTYPE_PGID,
PIDTYPE_SID,
PIDTYPE_MAX,
};
enum pid_type
defines types of pid. In this case PIDTYPE_MAX
is include/linux/pid.h
, pid_links
is defined like
struct hlist_node pid_links[PIDTYPE_MAX];
PIDTYPE_MAX
pid hash tables.attach_pid
void attach_pid(struct task_struct *task, enum pid_type type)
{
struct pid *pid = *task_pid_ptr(task, type);
hlist_add_head_rcu(&task->pid_links[type], &pid->tasks[type]);
}
kernel/fork.c
, copy_process
creates a new process as a copy of the old one. In copy_process
, attach_pid
will be called several times to add different types of a pid with specific number to the hash table list.kernel/fork.c
中, copy_process
會複製舊的 process 來創建一個新的 process 。在 copy_process
中, attach_pid
會被呼叫數次,用來將同一個 pid number 的不同 type 的 pid 插入到特定 type 的 hash table 。struct task_struct
, pid_links
pid_chain
and pid_list
are removed?
pid_list
pid_list
is removed is metioned in the patch 92476d7fc0326a409ab1d3864a04093a6be9aca7.pid_list
simplifies the code, reduces the need for 4 pid hash tables, and makes the code more capable.pid_list
contains a type of pid. When a pid is created, list node of each type will not be allocated immediately, instead, only when attaching, the node will be allocated.pid_chain
pid_chain
is removed is metioned in the patch e8cfbc245e24887e3c30235f71e9e9405e0cfc39.
pidhash is no longer required as all the information can be looked up from idr tree.
Where idr stands for "ID Radix" tree. idr is a data structure that is used for managing mappings from integer IDs to pointers.
struct task_struct
) ,但是在舊版的 kernel ,所有資訊都會儲存在這裡,如此設計的考量是什麼? (P.24 第一段第三行 @ March 21)TODO: 對照 git log / git blame include/linux/sched.h
Per-Entity Load Tracking (PELT)
PELT is a mechanism within the used for tracking the load or the utilization of CPU per-cpu runqueue.
Where
Why doesn't the Linux kernel differentiate between the ready state and running state?
The reason is stated in chapter 2.2.3.
Since each core has its own set of registers and cache, this means that it is simpler to restore a task to the core it previoulsy ran on. This is known as processor affinity: the task has an affinity for running on the same core it was swapped out from.
In summary, combining ready state and running state reduce the chance of swapping tasks out of the process. In consquence, kernel doesn't need to restore the task.
Applying Machine Learning Techniques to Improve Linux Process Scheduling
PELT ==> load 的測量
:flashlight: 已解決
:flashlight: 已解決
逐一檢視書中程式碼列表,確保所有參照的程式碼皆適用於 Linux v5.15+,若遇到程式碼的出入 (可能來自 Linux 核心符號的命名變更),紀錄在此並嘗試從 lxr 找出可能的變遷原因,並附上自己的推論。
重現書中關於 ftrace 和 tracepoint 相關的實驗,確保在 Linux v5.15+ 得以正確運作,若發現出入,嘗試變更程式碼和實驗步驟。
struct pid
已變更 (P.19 @ May 30)Linux v5.15.112 的實作:
struct pid {
refcount_t count;
unsigned int level;
spinlock_t lock;
/* lists of tasks that use this pid */
struct hlist_head tasks[PIDTYPE_MAX];
struct hlist_head inodes;
/* wait queue for pidfd notifications */
wait_queue_head_t wait_pidfd;
struct rcu_head rcu;
struct upid numbers[1];
};
移除 pid_chain
和 pid_list
,並使用 2 個 struct hlist_head
。
list_entry
已移動到 include/linux/list.h (P.18 @ March 21):flashlight: 已解決
container_of
已移動到 include/linux/container_of.h (P.21 @ May 30)In Linux v5.16 ,container_of
macro has been split out from include/linux/kernel.h
.
The reason can been seen in the patch:
For time being include new header back to kernel.h to avoid twisted
indirected includes for existing users.
Note, there are a lot of headers and modules that include kernel.h
solely for one of these macros and this allows to unburden compiler for
the twisted inclusion paths and to make new code cleaner in the future.
container_of
is quite frequently used, so the linux kernel split it out include/linux/kernel.h
to unburden compiler for the twisted inclusion paths and to make new code cleaner.
研究如何設計實驗
offsetof
已變更 (P.22 @ May 30)include/linux/stddef.h
:flashlight: 已解決
Linux 核心專題: CPU 排程器研究
rostedt/trace-cmd: utilities for Linux ftrace - GitHub
$ gcc --version
gcc (Ubuntu 11.3.0-1ubuntu1~22.04.1) 11.3.0
$ uname -mrs
Linux 5.19.0-42-generic x86_64
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 39 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Vendor ID: GenuineIntel
Model name: Intel(R) Core(TM) i5-1035G1 CPU @ 1.00GHz
CPU family: 6
Model: 126
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
Stepping: 5
CPU max MHz: 3600.0000
CPU min MHz: 400.0000
BogoMIPS: 2380.80
Virtualization features:
Virtualization: VT-x
Caches (sum of all):
L1d: 192 KiB (4 instances)
L1i: 128 KiB (4 instances)
L2: 2 MiB (4 instances)
L3: 6 MiB (1 instance)
NUMA:
NUMA node(s): 1
NUMA node0 CPU(s): 0-7
In the experiment, we will see the variation of virtual runtime. In order the observe it, we need to isolate a cpu first so that the tasks can run on it.
Modify /etc/default/grub
:
$ sudo vi /etc/default/grub
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash isolcpus=2,3"
Update the configuration:
$ sudo update-grub
Reboot:
$ sudo reboot
Check CPU affinity:
$ cat /sys/devices/system/cpu/isolated
2-3
trace-cmd
$ sudo apt-get install build-essential git pkg-config -y
$ sudo apt-get install libtracefs-dev libtraceevent-dev -y
$ git clone https://git.kernel.org/pub/scm/libs/libtrace/libtraceevent.git
$ cd libtraceevent
$ make
$ sudo make install
$ git clone https://git.kernel.org/pub/scm/libs/libtrace/libtracefs.git
$ cd libtracefs
$ make
$ sudo make install
$ git clone https://github.com/rostedt/trace-cmd.git
$ cd trace-cmd
$ make
$ sudo make install
$ sudo make install_libs
According to 《Demystifying the Linux CPU Scheduler》, prepare two configurations.
For same nice,
{
"procs": [
{
"comm": "cpu_bound",
"cpus": "3",
"name": "nice 0",
"spown_cnt": 3
}
]
}
For different nice,
{
"procs": [
{
"comm": "cpu_bound",
"cpus": "3",
"name": "nice -5",
"policy": "other",
"policy_detail": "-5"
},
{
"comm": "cpu_bound",
"cpus": "3",
"name": "nice 0",
"policy": "other",
"policy_detail": "0"
},
{
"comm": "cpu_bound",
"cpus": "3",
"name": "nice 5",
"policy": "other",
"policy_detail": "5"
}
]
}
Run the experiments
$ tools/start.py --targets=same,diff
實際執行之後發現無法執行 start.py
,因此這裡暫時使用 stress-ng
來執行實驗。
stress-ng
Use the following command to install stress-ng
.
$ sudo apt install stress-ng
Use the following command to check if stress-ng
has been successfully installed.
$ stress-ng --version
stress-ng, version 0.13.12 (gcc 11.3, x86_64 Linux 5.19.0-42-generic) 💻🔥
Generating workload
$ nice -n 0 taskset -c 3 stress-ng --cpu 3 &
nice -n
: Specify nice valuetaskset -c
: Specify cpu core--cpu
: Specify task numbersTracing the tasks
$ sudo trace-cmd record -e sched
-e sched
: Scheduling events onlyGetting the result
$ sudo trace-cmd report | grep "stress-ng" | grep -o "pid=[0-9]* runtime=[0-9]* \[ns\] vruntime=[0-9]* \[ns\]" > experiments/same_nice.txt
Preparing a program to plot
for i, (pid, data) in enumerate(pids.items()):
x_values = [x + i * width for x in range(2, len(data['vruntime']) - 1)]
y_values = [(data['vruntime'][j+3] - vruntime) / data['runtime'][j+3] for j, vruntime in enumerate(data['vruntime'][2:-1])]
plt.bar(x_values, y_values, width, label=f"PID {pid}")
plt.xlabel("cycle")
plt.ylabel("vruntime / runtime")
plt.legend()
plt.show()
Observing the result graph
In this graph, we can find out that tasks with same nice value has the same ratio of
Generating workload
$ sudo nice -n -5 taskset -c 3 stress-ng --cpu 1 &
$ nice -n 0 taskset -c 3 stress-ng --cpu 1 &
$ nice -n 5 taskset -c 3 stress-ng --cpu 1 &
Tracing the tasks
$ sudo trace-cmd record -e sched
Getting the result
$ sudo trace-cmd report | grep "stress-ng" | grep -o "pid=[0-9]* runtime=[0-9]* \[ns\] vruntime=[0-9]* \[ns\]" > experiments/diff_nice.txt
Observing the result graph
PID 10175
: nice -5PID 10181
: nice 0PID 10184
: nice 5In this experiment, we can find out that the task with lower nice value has more runtime.
Accroding to the chapter 7,
which fit the result.
重現第 7 章 scheduler events 實驗,利用 7.4 節提供的程式碼,追蹤排程器行為,應當搭配 Cgroups 驗證 Linux 核心對資源管理的行為。