Try   HackMD

Linux 核心專題: CPU 排程器研究

執行人: chun61205
專題解說錄影

:question: 提問清單

  • 第七章的實驗或許可改用 schedgraph 來進行?

    linhoward0522

    Referring to the article written by linhoward0522, set up the environment of schedgraph.

    To set up the environment of opam, the command should be

    ​​​​$ eval $(opam env)
    

    instead.

    After generating a workload, use the following command to get the statistic result:

    ​​​​$ ./dat2graph2 --save-tmp /path/to/trace.dat
    ​​​​$ jgraph trace.jgr | ps2eps -q -l -r 600  | epstopdf --filter > trace.pdf
    

    Take the experiment of generating differnt workloads for example, the outcoming grapu will be like

    Image Not Showing Possible Reasons
    • The image was uploaded to a note which you don't have access to
    • The note which the image was originally uploaded to has been deleted
    Learn More →

    In the graph, lines in different colors means different processes. Process of purple line is the process with nice value 5, process of brown line is the process with nice value 0, and process of brown line is the process with nice value 5.
    However, I found out that schedgraph does not count vruntime, so I can not calculate the ratio vruntim/runtime.

任務簡述

改進《Demystifying the Linux CPU Scheduler》,確保所有參照的程式碼皆適用於 Linux v5.15 (LTS),包含 ftrace 在內的實驗。內容使用英文書寫。

TODO: 紀錄電子書閱讀過程中的問題

紀錄閱讀過程中的問題,連同自己的推論彙整在本頁面,適時與授課教師討論。

提及頁碼時,要附上電子書最後更新日期,範例: P.16 @ March 21

Chapter 1

Linked List 採用環狀是基於哪些考量? (P.14 第三段 @ March 21)

johnnylord
你所不知道的 C 語言: linked list 和非連續記憶體
Why Linux kernel doubly-linked list is not just a simple linked list?!
include/linux/list.h

  1. No need to consider edge cases
    Here is the linked list diagram in linux kernel.

    Image Not Showing Possible Reasons
    • The image was uploaded to a note which you don't have access to
    • The note which the image was originally uploaded to has been deleted
    Learn More →

    Observing the diagram above, we can see that the head node does not involve any data. In other words, it has nothing to do with the data in nodes.
    Take __list_del for example,

    ​​​​static inline void __list_del(struct list_head * prev, struct list_head * next)
    ​​​​{
    ​​​​    next->prev = prev;
    ​​​​    WRITE_ONCE(prev->next, next);
    ​​​​}
    

    If a singly-linked list is used, user needs to consider the situation that if the head node is deleted. However, in the case above, there is no need to consider the edge case.

  2. Polymorphism
    Simply putting, polymorphism allows you to treat objects of different subclasses as if they were objects of the parent class, enabling you to store them in a common list and call their specific functions in a consistent manner.
    Suppose we want to create a structure named employee and several positions inheriting it. The implementation is like below:

    ​​​​struct employee {
    ​​​​    int id;
    ​​​​    char name[10];
    ​​​​    struct list_head _list;
    ​​​​};
    ​​​​struct supervisor {
    ​​​​    struct employee _employee;
    ​​​​    char subordinates[10][10];
    ​​​​    struct list_head _list;
    ​​​​};
    ​​​​struct business {
    ​​​​    struct employee _employee;
    ​​​​    char works[10][10];
    ​​​​    struct list_head _list;
    ​​​​};
    

    struct supervisor and struct business inherit employee and we put all inherited objects of a parent into a list.

    
    
    
    
    
    
    ele_list
    
    
    cluster_0
    
    employee
    
    
    cluster_1
    
    supervisor
    
    
    cluster_2
    
    business
    
    
    
    value
    
    id
    
    name
    
    
    
    head
    
    prev
    
    next
    
    
    
    node1
    
    prev
    
    next
    
    
    
    head:e->node1:w
    
    
    
    
    
    
    node1_value
    
    _employee
    
    subordinates
    
    
    
    node2
    
    prev
    
    next
    
    
    
    node1:e->node2:w
    
    
    
    
    
    
    node2_value
    
    _employee
    
    works
    
    
    
    
  3. Poison
    Here is the function list_del in include/linux/list.h.

    ​​​​static inline void list_del(struct list_head *entry)
    ​​​​{
    ​​​​    __list_del_entry(entry);
    ​​​​    entry->next = LIST_POISON1;
    ​​​​    entry->prev = LIST_POISON2;
    ​​​​}
    

    In __list_del_entry, the previous node and the next node of entry will be connected to delete entry. However if entry can still be accessed, entry->next and entry->prev can still be accessed by entry too, which is not the result that we expect. Therefore, in this case LIST_POISON1 and LIST_POISON2 are used to identify that the pointer is not accessible anymore.
    According to the patch c9cf55285e87ac423c45d9efca750d3f50234d10,

    Localize poison values into one header file for better documentation and easier/quicker debugging and so that the same values won't be used for multiple purposes.

    poisons are designed to improve code traceability.

    1. How does linux kernel ensure that the position LIST_POISON1 and LIST_POISON2 will not be accessed?

      In my observation, linux kernel just check manually to ensure the address will not be access.
      For example, in /lib/list_debug.c

      ​​​​​​​​bool __list_del_entry_valid(struct list_head *entry)
      ​​​​​​​​{
      ​​​​​​​​    if (CHECK_DATA_CORRUPTION(next == NULL,
      ​​​​​​​​            "list_del corruption, %px->next is NULL\n", entry) ||
      ​​​​​​​​        CHECK_DATA_CORRUPTION(prev == NULL,
      ​​​​​​​​            "list_del corruption, %px->prev is NULL\n", entry) ||
      ​​​​​​​​        CHECK_DATA_CORRUPTION(next == LIST_POISON1,
      ​​​​​​​​            "list_del corruption, %px->next is LIST_POISON1 (%px)\n",
      ​​​​​​​​            entry, LIST_POISON1) ||
      ​​​​​​​​        CHECK_DATA_CORRUPTION(prev == LIST_POISON2,
      ​​​​​​​​            "list_del corruption, %px->prev is LIST_POISON2 (%px)\n",
      ​​​​​​​​            entry, LIST_POISON2) ||
      ​​​​​​​​        CHECK_DATA_CORRUPTION(prev->next != entry,
      ​​​​​​​​            "list_del corruption. prev->next should be %px, but was %px. (prev=%px)\n",
      ​​​​​​​​            entry, prev->next, prev) ||
      ​​​​​​​​        CHECK_DATA_CORRUPTION(next->prev != entry,
      ​​​​​​​​            "list_del corruption. next->prev should be %px, but was %px. (next=%px)\n",
      ​​​​​​​​            entry, next->prev, next))
      ​​​​​​​​        return false;
      
      ​​​​​​​​    return true;
      ​​​​​​​​}
      

      However, there is still a question in my mind. If we can allocate an address like LIST_POISON1 or LIST_POISON2, is that means that the newly allocated address is useless?

    2. Why do linux kernel define two macros for list poison instead of just one?
      I guess it is for traceability. Users can use two macro to trace specificly.

linux kernel 是如何插入 pid 到 PID hash table 的?為什麼移除 pid_chainpid_list ? (P.19 @ May 30)

TODO: 對照原始程式碼

include/linux/pid.h
Pid Namespace 详解- 泰晓科技
struct pid and struct upid - Linux - Kernel - LinuxQuestions.org
kernel/pid.c
kernel/fork.c
include/linux/sched.h
92476d7fc0326a409ab1d3864a04093a6be9aca7
e8cfbc245e24887e3c30235f71e9e9405e0cfc39

  1. struct pid and pid_t
    pid_t is an int data type used to store the process id.
    Whereas pid is an data structure storing process data.
    ​​​​struct pid
    ​​​​{
    ​​​​    refcount_t count;
    ​​​​    unsigned int level;
    ​​​​    spinlock_t lock;
    ​​​​    /* lists of tasks that use this pid */
    ​​​​    struct hlist_head tasks[PIDTYPE_MAX];
    ​​​​    struct hlist_head inodes;
    ​​​​    /* wait queue for pidfd notifications */
    ​​​​    wait_queue_head_t wait_pidfd;
    ​​​​    struct rcu_head rcu;
    ​​​​    struct upid numbers[1];
    ​​​​};
    
  2. Data Structures Related to attach_pid in pid.c
    1. pid
      As metioned before, pid is an data structure storing process data. A struct pid structure represents a specific number of pid. A struct pid may contain serveral types of pid, whereas these types may not be related directly and may contain completely different tasks.
      ​​​​​​​​// In alloc_pid
      ​​​​​​​​for (type = 0; type < PIDTYPE_MAX; ++type)
      ​​    INIT_HLIST_HEAD(&pid->tasks[type]);
      
    2. pid_type
      pid_type is used to specify the type of pid such as PIDTYPE_PID or PIDTYPE_TGID.
    3. struct pid_namespace
      Namespace is a mechanism for encapsulating and isolating global system resources, allowing processes in different namespaces to have independent access to the global system resources.
    4. retval
      retval is used to stored the error type.
    5. struct upid
      ​​​​​​​​/*
      ​​​​​​​​ * struct upid is used to get the id of the struct pid, as it is
      ​​​​​​​​ * seen in particular namespace. Later the struct pid is found with
      ​​​​​​​​ * find_pid_ns() using the int nr and struct pid_namespace *ns.
      ​​​​​​​​ */
      
    6. struct hlist tasks

      It is never possible for two tasks (inside one NS) to have the same PID. However, the ProcessGroup ID and the Session ID can be shared by multiple tasks.

    7. PIDTYPE_MAX
      ​​​​​​​​enum pid_type
      ​​​​​​​​{
      ​​​​​​​​    PIDTYPE_PID,
      ​​​​​​​​    PIDTYPE_TGID,
      ​​​​​​​​    PIDTYPE_PGID,
      ​​​​​​​​    PIDTYPE_SID,
      ​​​​​​​​    PIDTYPE_MAX,
      ​​​​​​​​};
      
      enum pid_type defines types of pid. In this case PIDTYPE_MAX is 4, which means the number of pid types. In include/linux/pid.h, pid_links is defined like
      ​​​​​​​​struct hlist_node pid_links[PIDTYPE_MAX];
      
      We can see that there are PIDTYPE_MAX pid hash tables.
  3. attach_pid
    ​​​​void attach_pid(struct task_struct *task, enum pid_type type)
    ​​​​{
    ​​​​    struct pid *pid = *task_pid_ptr(task, type);
    ​​​​    hlist_add_head_rcu(&task->pid_links[type], &pid->tasks[type]);
    ​​​​}
    
    In kernel/fork.c, copy_process creates a new process as a copy of the old one. In copy_process, attach_pid will be called several times to add different types of a pid with specific number to the hash table list.
    kernel/fork.c 中, copy_process 會複製舊的 process 來創建一個新的 process 。在 copy_process 中, attach_pid 會被呼叫數次,用來將同一個 pid number 的不同 type 的 pid 插入到特定 type 的 hash table 。

    In fact, the pid hash table is stored in struct task_struct, pid_links
  4. Why pid_chain and pid_list are removed?
    1. pid_list
      The reason why pid_list is removed is metioned in the patch 92476d7fc0326a409ab1d3864a04093a6be9aca7.
      Simply put, removing pid_list simplifies the code, reduces the need for 4 pid hash tables, and makes the code more capable.
      Originally, each pid_list contains a type of pid. When a pid is created, list node of each type will not be allocated immediately, instead, only when attaching, the node will be allocated.
      After this patch, it creates the hash table entry when the pid was allocated and free the hash table entry when the pid was freed.
    2. pid_chain
      The reason why pid_chain is removed is metioned in the patch e8cfbc245e24887e3c30235f71e9e9405e0cfc39.

      pidhash is no longer required as all the information can be looked up from idr tree.
      Where idr stands for "ID Radix" tree. idr is a data structure that is used for managing mappings from integer IDs to pointers.

只有部分 hardware context 儲存在 process descriptor (struct task_struct) ,但是在舊版的 kernel ,所有資訊都會儲存在這裡,如此設計的考量是什麼? (P.24 第一段第三行 @ March 21)

TODO: 對照 git log / git blame include/linux/sched.h

Linux Kernel 為何不區分 "TASK RUNNING (ready but not running)" 及 "TASK RUNNING (running)" ? (P.31 Figure 1.12: State machine of task states @ May 30)

Linux 核心設計: Scheduler(4): PELT - HackMD

  1. Per-Entity Load Tracking (PELT)
    PELT is a mechanism within the used for tracking the load or the utilization of CPU per-cpu runqueue.
    Loadt=ut+u(t1)×y1+u(t2)×y2+...=i=0y(ti)×yi
    Where ut is the cared record of the current time slot and y is the decay factor, which controls the decaying process decribed before.

  2. Why doesn't the Linux kernel differentiate between the ready state and running state?
    The reason is stated in chapter 2.2.3.

    Since each core has its own set of registers and cache, this means that it is simpler to restore a task to the core it previoulsy ran on. This is known as processor affinity: the task has an affinity for running on the same core it was swapped out from.

    In summary, combining ready state and running state reduce the chance of swapping tasks out of the process. In consquence, kernel doesn't need to restore the task.

Applying Machine Learning Techniques to Improve Linux Process Scheduling

PELT ==> load 的測量

其他問題

PIDd 似乎是多打了一個 d (P.16 第三段第二行 @ March 21)

:flashlight: 已解決

busy 多打了一個 i 打成 buisy (P.74 第一段第三行 @ May 30)

:flashlight: 已解決

TODO: 確保所有參照的程式碼皆適用於 Linux v5.15+

逐一檢視書中程式碼列表,確保所有參照的程式碼皆適用於 Linux v5.15+,若遇到程式碼的出入 (可能來自 Linux 核心符號的命名變更),紀錄在此並嘗試從 lxr 找出可能的變遷原因,並附上自己的推論。

重現書中關於 ftrace 和 tracepoint 相關的實驗,確保在 Linux v5.15+ 得以正確運作,若發現出入,嘗試變更程式碼和實驗步驟。

Chapter 1

struct pid 已變更 (P.19 @ May 30)

include/linux/pid.h

Linux v5.15.112 的實作:

struct pid {
    refcount_t count;
    unsigned int level;                       
    spinlock_t lock;
    /* lists of tasks that use this pid */
    struct hlist_head tasks[PIDTYPE_MAX];
    struct hlist_head inodes;
    /* wait queue for pidfd notifications */
    wait_queue_head_t wait_pidfd;
    struct rcu_head rcu;
    struct upid numbers[1];
};

移除 pid_chainpid_list,並使用 2 個 struct hlist_head

list_entry 已移動到 include/linux/list.h (P.18 @ March 21)

:flashlight: 已解決

container_of 已移動到 include/linux/container_of.h (P.21 @ May 30)

d2a8ebbf8192b84b11f1b204c4f7c602df32aeac

In Linux v5.16 ,container_of macro has been split out from include/linux/kernel.h.
The reason can been seen in the patch:

For time being include new header back to kernel.h to avoid twisted
indirected includes for existing users.
Note, there are a lot of headers and modules that include kernel.h
solely for one of these macros and this allows to unburden compiler for
the twisted inclusion paths and to make new code cleaner in the future.

container_of is quite frequently used, so the linux kernel split it out include/linux/kernel.h to unburden compiler for the twisted inclusion paths and to make new code cleaner.

研究如何設計實驗

offsetof 已變更 (P.22 @ May 30)

include/linux/stddef.h
:flashlight: 已解決

TODO: 重現第 7 章實驗

Linux 核心專題: CPU 排程器研究
rostedt/trace-cmd: utilities for Linux ftrace - GitHub

Environment

$ gcc --version
gcc (Ubuntu 11.3.0-1ubuntu1~22.04.1) 11.3.0

$ uname -mrs
Linux 5.19.0-42-generic x86_64

$ lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         39 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  8
  On-line CPU(s) list:   0-7
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Core(TM) i5-1035G1 CPU @ 1.00GHz
    CPU family:          6
    Model:               126
    Thread(s) per core:  2
    Core(s) per socket:  4
    Socket(s):           1
    Stepping:            5
    CPU max MHz:         3600.0000
    CPU min MHz:         400.0000
    BogoMIPS:            2380.80
Virtualization features: 
  Virtualization:        VT-x
Caches (sum of all):     
  L1d:                   192 KiB (4 instances)
  L1i:                   128 KiB (4 instances)
  L2:                    2 MiB (4 instances)
  L3:                    6 MiB (1 instance)
NUMA:                    
  NUMA node(s):          1
  NUMA node0 CPU(s):     0-7

Isolating One CPU

In the experiment, we will see the variation of virtual runtime. In order the observe it, we need to isolate a cpu first so that the tasks can run on it.
Modify /etc/default/grub:

$ sudo vi /etc/default/grub
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash isolcpus=2,3"

Update the configuration:

$ sudo update-grub

Reboot:

$ sudo reboot

Check CPU affinity:

$ cat /sys/devices/system/cpu/isolated
2-3

Installing trace-cmd

$ sudo apt-get install build-essential git pkg-config -y
$ sudo apt-get install libtracefs-dev libtraceevent-dev -y

$ git clone https://git.kernel.org/pub/scm/libs/libtrace/libtraceevent.git
$ cd libtraceevent
$ make
$ sudo make install

$ git clone https://git.kernel.org/pub/scm/libs/libtrace/libtracefs.git
$ cd libtracefs
$ make
$ sudo make install

$ git clone https://github.com/rostedt/trace-cmd.git
$ cd trace-cmd
$ make
$ sudo make install
$ sudo make install_libs

Same Nice vs. Different Nice

According to 《Demystifying the Linux CPU Scheduler》, prepare two configurations.
For same nice,

{
    "procs": [
	{
	    "comm": "cpu_bound",
	    "cpus": "3",
	    "name": "nice 0",
	    "spown_cnt": 3
	}
    ]
}

For different nice,

{
    "procs": [
	{
	    "comm": "cpu_bound",
	    "cpus": "3",
	    "name": "nice -5",
	    "policy": "other",
	    "policy_detail": "-5"
	},
	{
	    "comm": "cpu_bound",
	    "cpus": "3",
	    "name": "nice 0",
	    "policy": "other",
	    "policy_detail": "0"
	},
	{
	    "comm": "cpu_bound",
	    "cpus": "3",
	    "name": "nice 5",
	    "policy": "other",
	    "policy_detail": "5"
	}
    ]
}

Run the experiments

$ tools/start.py --targets=same,diff

實際執行之後發現無法執行 start.py ,因此這裡暫時使用 stress-ng 來執行實驗。

stress-ng

Use the following command to install stress-ng.

$ sudo apt install stress-ng

Use the following command to check if stress-ng has been successfully installed.

$ stress-ng --version
stress-ng, version 0.13.12 (gcc 11.3, x86_64 Linux 5.19.0-42-generic) 💻🔥

Running Experiment

Same Nice

  1. Generating workload

    ​​​​$ nice -n 0 taskset -c 3 stress-ng --cpu 3 &
    
    1. nice -n: Specify nice value
    2. taskset -c: Specify cpu core
    3. --cpu: Specify task numbers
  2. Tracing the tasks

    ​​​​$ sudo trace-cmd record -e sched
    
    • -e sched: Scheduling events only
  3. Getting the result

    ​​​​$ sudo trace-cmd report | grep "stress-ng" | grep -o "pid=[0-9]* runtime=[0-9]* \[ns\] vruntime=[0-9]* \[ns\]" > experiments/same_nice.txt 
    
  4. Preparing a program to plot

    ​​​​for i, (pid, data) in enumerate(pids.items()):
    ​​​​    x_values = [x + i * width for x in range(2, len(data['vruntime']) - 1)]
    ​​​​    y_values = [(data['vruntime'][j+3] - vruntime) / data['runtime'][j+3] for j, vruntime in enumerate(data['vruntime'][2:-1])]
    ​​​​    plt.bar(x_values, y_values, width, label=f"PID {pid}")
    
    ​​​​plt.xlabel("cycle")
    ​​​​plt.ylabel("vruntime / runtime")
    ​​​​plt.legend()
    ​​​​plt.show()
    
  5. Observing the result graph

    In this graph, we can find out that tasks with same nice value has the same ratio of vruntime/runtime. In addition, as we set the nice value to 0, the ratio will be 1.

Different Nice Value

  1. Generating workload

    ​​​​$ sudo nice -n -5 taskset -c 3 stress-ng --cpu 1 &
    ​​​​$ nice -n 0 taskset -c 3 stress-ng --cpu 1 &
    ​​​​$ nice -n 5 taskset -c 3 stress-ng --cpu 1 &
    
  2. Tracing the tasks

    ​​​​$ sudo trace-cmd record -e sched
    
  3. Getting the result

    ​​​​$ sudo trace-cmd report | grep "stress-ng" | grep -o "pid=[0-9]* runtime=[0-9]* \[ns\] vruntime=[0-9]* \[ns\]" > experiments/diff_nice.txt 
    
  4. Observing the result graph

    • PID 10175: nice -5
    • PID 10181: nice 0
    • PID 10184: nice 5

    In this experiment, we can find out that the task with lower nice value has more runtime.
    Accroding to the chapter 7,
    ratio(5)0.328ratio(0)1ratio(5)3.056
    which fit the result.

重現第 7 章 scheduler events 實驗,利用 7.4 節提供的程式碼,追蹤排程器行為,應當搭配 Cgroups 驗證 Linux 核心對資源管理的行為。