《Demystifying the Linux CPU Scheduler》問題集

# 《Demystifying the Linux CPU Scheduler》問題集 [筆記](https://hackmd.io/49H-da0GQe6IcJr9AxNa0w) [GPS 筆記](https://hackmd.io/@salmonii/HJSvKVOlle) ## ch1 ### Typo 相關問題 1. introduction 最後一句 :heavy_check_mark: > When discussing architecture-dependent code it will be assumed that the architecture is x86. When discussing architecture-dependent code, it will be assumed that the architecture is x86. 這樣讀起來比較通順 2. 1.2.1 System call :heavy_check_mark: >A system call, often known as syscall,`is a routine that are used to` implement these gates and act as APIs between kernel and user space. is a routine that is used to ... 3. 1.3.1 Processes and threads > Each thread is assigned a unique number, which is confusingly called a PID (Process `IDentifier`) in kernel terms. Threads belonging to a process are accounted for as the TGID (Thread Group `ID`). 是否都統一用 IDentifier 比較好？ ### 其他問題 1. 因為在看 clone process , thread 的差別，於是去看了 cloning flags 內容以下節錄自 [linux/include/uapi/linux/sched.h](https://github.com/torvalds/linux/blob/fc96b232f8e7c0a6c282f47726b2ff6a5fb341d2/include/uapi/linux/sched.h#L11) ```c #define CLONE_VM 0x00000100 /* set if VM shared between processes */ #define CLONE_FS 0x00000200 /* set if fs info shared between processes */ #define CLONE_FILES 0x00000400 /* set if open files shared between processes */ #define CLONE_SIGHAND 0x00000800 /* set if signal handlers and blocked signals shared */ #define CLONE_THREAD 0x00010000 /* Same thread group? */ #define CLONE_SYSVSEM 0x00040000 /* share system V SEM_UNDO semantics */ #define CLONE_SETTLS 0x00080000 /* create a new TLS for the child */ #define CLONE_PARENT_SETTID 0x00100000 /* set the TID in the parent */ #define CLONE_CHILD_CLEARTID 0x00200000 /* clear the TID in the child */ ``` 我想問為什麼 CLONE_THREAD 的註解有一個問號？註解那句本身也很不完整 2. 如果在看原始碼的過程中，發現一些註解的排版錯誤以及文法錯誤，發 patch 會很像在找碴嗎？ --- 以下跟書的內容有關 3. 參考 1.3.1 中的說明 :heavy_check_mark: 老師已解惑我發現我說明的不清楚，因此再補充： > Each thread is assigned a unique number, which is confusingly called a PID (Process IDentifier) in kernel terms. Threads belonging to a process are accounted for as the TGID (Thread Group ID). > Adding further confusion, when getpid() is called, it is actually retrieving the TGID (the group leader PID, identifying the entire process). Conversely, upon calling gettid(), the PID is obtained (which identifies a sin- gle thread, not a group). Therefore, it appears that the PID functions more like a thread identifier. This confusing way of using IDs was implemented to com- ply with POSIX standards, which require that each thread of a multi-threaded process must have the same id: this is why getpid() returns the TGID. 參照這段話 ``` PID -> thread id TGID -> process id ``` 但是我去看 Linux Manual Page 中的對於 [getpid](https://man7.org/linux/man-pages/man2/getpid.2.html) 以及 [gettid](https://man7.org/linux/man-pages/man2/gettid.2.html) 的定義，發現跟書中不太一樣： ``` PID -> process id TID -> thread id ``` 應該要以哪一個版本為準？以及本書之後所提及的所有 PID 皆是指 thread id 嗎？ > man page 是 user space 的內容 4. 如何組織和管理 multithreaded process？ :heavy_check_mark: 老師已解惑參考《Demystifying the Linux CPU Scheduler》 1.3.2 中的圖示： ![cpu-sched-book-20250331 2](https://hackmd.io/_uploads/BJDtZWEJgl.jpg) 如果這邊的 PID 是指 thread id 的情況，我想問為什麼 pid_list 裡面的 PID （nr）的值都是一樣的，同個 thread group 裡面的不同 thread 不會有不同的 PID 嗎？ 5. 有軟體中斷這種東西嗎？參考 1.3.1 中的說明： > Interrupt number 128 (=0x80) is released. In Linux, it corresponds to a system call interrupt. > Performing a system call in assembly code involves filling right CPU registers with the syscall arguments and then using a special assembly instruction. > ... invoke the interrupt 128 (by the instruction int 0x80). 根據這段描述，系統呼叫是透過指令 int 0x80 來觸發的。這種透過 int 指令產生的中斷會不會經過 PIC / APIC？還是說它會直接對應到一個 interrupt vector 並跳轉到對應的 handler，而不像硬體中斷那樣？老師上課說現在講「中斷」都是指硬體中斷，那書中的說明會不會讓讀者混淆？還是我這樣很鑽牛角尖 6. 1.3.4 附上的程式碼與目前[核心程式碼](https://github.com/torvalds/linux/blob/bc3372351d0c8b2726b7d4229b878342e3e6b0e8/include/linux/sched.h#L4)不同 ```c /* Used in tsk->state: */ 2 #define TASK_RUNNING 0x0000 3 #define TASK_INTERRUPTIBLE 0x0001 4 #define TASK_UNINTERRUPTIBLE 0x0002 5 #define __TASK_STOPPED 0x0004 6 #define __TASK_TRACED 0x0008 7 /* Used in tsk->exit_state: */ 8 #define EXIT_DEAD 0x0010 9 #define EXIT_ZOMBIE 0x0020 10 #define EXIT_TRACE (EXIT_ZOMBIE | EXIT_DEAD) ``` 這樣後續的部分說明內容是否也需要更新？ >The values associated with these states are defined with values such as 0x0000, 0x0001, 0x0002, etc. ```c if (tsk->state & (TASK_INTERRUPTIBLE | TASK_UNINTERRUPTIBLE)) printk("task %d is waiting for something", tsk->pid); ``` 以及這段範例程式碼中的 `tsk->state` 應該要改成 `tsk->__state` 7. 1.3.1 create new process 程式碼部分 ```c // spawns a new process, this is the same as using fork() clone(&do_something, stackTop, SIGCHLD, 0); ``` 我實際執行 `fork` 看到 `clone` 系統呼叫的 cloning flags 不只帶 `SIGCHLD`，還有帶 `CLONE_CHILD_CLEARTID` 以及 `CLONE_CHILD_SETTID`，我想問為什麼？以及這樣的話 the same as 的用法是不是不精準？ 8. 1.3.1 > This is because when THREAD_INFO_IN_TASK is enabled, thread_info must remain at the beginning of task_struct as current_thread_info() retrieves it directly from there. 我想是否把 `THREAD_INFO_IN_TASK` 改成 `CONFIG_THREAD_INFO_IN_TASK` 比較好，這樣也跟第三章一致。且我在[原始碼](https://elixir.bootlin.com/linux/v6.14.4/A/ident/THREAD_INFO_IN_TASK)查找 `THREAD_INFO_IN_TASK` 顯示 Identifier not used。 #### 釐清關於 `THREAD_INFO_IN_TASK` 和 `CONFIG_THREAD_INFO_IN_TASK` 的關係 /init/Kconfig ``` config THREAD_INFO_IN_TASK bool help Select this to move thread_info off the stack into task_struct. To make this work, an arch will need to remove all thread_info fields except flags and fix any runtime bugs. One subtle change that will be needed is to use try_get_task_stack() and put_task_stack() in save_thread_stack_tsk() and get_wchan(). menu "General setup" ``` 在 Kconfig 中，是以 `config THREAD_INFO_IN_TASK` 的形式定義的，沒有 CONFIG_。以下程式碼節錄自 [scripts/kconfig/confdata.c](https://elixir.bootlin.com/linux/v6.4.16/source/scripts/kconfig/confdata.c#L1179) ```c int conf_write_autoconf(int overwrite){ ... ret = __conf_write_autoconf(conf_get_autoheader_name(), print_symbol_for_c, &comment_style_c); ... } ``` ```c static const char *conf_get_autoheader_name(void) { char *name = getenv("KCONFIG_AUTOHEADER"); return name ? name : "include/generated/autoconf.h"; } ``` ```c static void print_symbol_for_c(FILE *fp, struct symbol *sym) { ... switch (sym->type) { case S_BOOLEAN: case S_TRISTATE: switch (*val) { case 'n': return; case 'm': sym_suffix = "_MODULE"; /* fall through */ default: val = "1"; } break; ... fprintf(fp, "#define %s%s%s %s%s\n", CONFIG_, sym->name, sym_suffix, val_prefix, val); free(escaped); } ``` 從以上程式碼可以看出在前置處理的時候就已經是把 Kconfig 中的 `THREAD_INFO_IN_TASK` 以加上 prefix 的方式寫入 include/generated/autoconf.h。實際去看電腦中的 include/generated/autoconf.h 檔案： ```shell $ cat /usr/src/linux-headers-$(uname -r)/include/generated/autoconf.h ``` ```shell #define CONFIG_THREAD_INFO_IN_TASK 1 ``` 在 C 語言撰寫中會是以下的形式，而不是 `THREAD_INFO_IN_TASK` ```c #ifdef CONFIG_THREAD_INFO_IN_TASK /* * For CONFIG_THREAD_INFO_IN_TASK kernels we need <asm/current.h> for the * definition of current, but for !CONFIG_THREAD_INFO_IN_TASK kernels, * including <asm/current.h> can cause a circular dependency on some platforms. */ ``` 參考[你所不知道的 C 語言：前置處理器應用篇](https://hackmd.io/@sysprog/c-preprocessor) 綜合以上，我認爲此處用 `CONFIG_THREAD_INFO_IN_TASK` 比 `THREAD_INFO_IN_TASK` 更清楚。 9. 1.3.1 >The flag CLONE_SIGHAND at line 11 requires that the parent process receives a SIGCHLD signal upon the termination of the created child process, with the exception that the address space, filesystem information (CLONE_FS), open files (CLONE_FILES) and signal handlers will be shared.∗ 以下節錄自 [man](https://man7.org/linux/man-pages/man2/clone.2.html) >A new thread created with CLONE_THREAD has the same parent process as the process that made the clone call (i.e., like CLONE_PARENT), so that calls to getppid(2) return the same value for all of the threads in a thread group. When a CLONE_THREAD thread terminates, the thread that created it is not sent a SIGCHLD (or other termination) signal; nor can the status of such a thread be obtained using wait(2). (The thread is said to be detached.) After all of the threads in a thread group terminate the parent process of the thread group is sent a SIGCHLD (or other termination) signal. 我想問 `CLONE_SIGHAND` 跟 `SIGCHLD` 有直接關係嗎？ man page 是寫如果有設定 `CLONE_THREAD`，終止時創建該 thread 的 thread 不會收到 `SIGCHLD`。 ## ch2 ### Typo 相關問題 1. 2.1.1 :heavy_check_mark: > While the scheduler has to provide a interactive environment for the system user, it should also take a common evaluation matrix, throughput, into consideration. a interactive environment -> an interactive environment 2. 2.2.2 :heavy_check_mark: > Note that when a task becomes runnable and the has_cpu field is 0 (indi- cating that it is not currently assigned to any CPU), the scheduler invoke the reschedule_idle function. the scheduler invoke -> the scheduler invokes 3. 2.2.2 :heavy_check_mark: Overall 為開頭的段落，其中的 essentially 沒有適當換行 4. 2.2.2 在這個章節有介紹到 goodness() function 還有 reschedule_idle function，然後他們出現時的撰寫形式都沒有統一，有的是有加 () ，有的是字體不同。 ``` goodness() function reschedule_idle function ``` 5. 2.2.3 :heavy_check_mark: > What if two Bnice=19 running? 應該要是 What if two Bnice=19 **are** running? 6. 2.2.4 :heavy_check_mark: > With several queues, processes placed onto these queues will run for the timeslice specified by its queue. its queue -> their queue 7. 2.4 :heavy_check_mark: > while the traditional server workloads highlights the fairness of jobs highlights -> highlight ### 其他問題 1. 2.1.1 :heavy_check_mark: 已有其他同學專題在進行 ```c /* * Scheduling policies */ #define SCHED_NORMAL 0 #define SCHED_FIFO 1 #define SCHED_RR 2 #define SCHED_BATCH 3 /* SCHED_ISO: reserved but not implemented yet */ #define SCHED_IDLE 5 #define SCHED_DEADLINE 6 #define SCHED_EXT 7 ``` 2.1.2 要補上 `SCHED_EXT` 的說明 CONFIG_SCHED_CLASS_EXT 2. 2.1.2 > The command ps -eo pid,rtprio,time,comm can report a snapshot of the current processes with real-time priority ranging from 0 (lowest priority) to 99 (highest priority). These have a priority called nice that ranges from −20 to 19, representing how “nice” the process is to all the others: a value of −20 indicates that the task is unwilling to share CPU with the others, hence it corresponds to a high priority value. 此處我覺得 `These` 所指的東西不太清楚前面是在說明 real-time priority，後面討論到不是 real-time 的 priority 卻用 these 作為連接。 3. 2.2.1 O(1) scheduler ```c struct prio_array { int nr_active; unsigned long bitmap[BITMAP_SIZE]; struct list_head queue[MAX_PRIO]; }; ``` ```c #define TASK_INTERACTIVE(p) \ ((p)->prio <= (p)->static_prio - DELTA(p)) #define DELTA(p) \ (SCALE(TASK_NICE(p), 40, MAX_USER_PRIO*PRIO_BONUS_RATIO/100) + \ INTERACTIVE_DELTA) #define rt_task(p) ((p)->prio < MAX_RT_PRIO) ``` ```c void scheduler_tick(int user_ticks, int sys_ticks){ ... if (unlikely(rt_task(p))) { /* * RR tasks need a special form of timeslice management. * FIFO tasks have no timeslices. */ if ((p->policy == SCHED_RR) && !--p->time_slice) { p->time_slice = task_timeslice(p); p->first_time_slice = 0; set_tsk_need_resched(p); /* put it at the end of the queue: */ dequeue_task(p, rq->active); enqueue_task(p, rq->active); } goto out_unlock; } if (!--p->time_slice) { dequeue_task(p, rq->active); set_tsk_need_resched(p); p->prio = effective_prio(p); p->time_slice = task_timeslice(p); p->first_time_slice = 0; if (!rq->expired_timestamp) rq->expired_timestamp = jiffies; if (!TASK_INTERACTIVE(p) || EXPIRED_STARVING(rq)) { enqueue_task(p, rq->expired); } else enqueue_task(p, rq->active); } ... } ``` 看過 2.6.0 版本的原始碼整理以下：在 `scheduler_tick` 函式裡面會先判斷 `if (unlikely(rt_task(p)))` 。從巨集 `rt_task` 的定義可以看到他會判斷 prio 是否小於 `MAX_RT_PRIO`（100），也就是這個 task 是否為 real-time 的。所以可以看到在 `p->time_slice` 用完的時候就會走兩條路：一條是對於在 0-99 號佇列裡面，且 policy 為 `SCHED_RR` 的 task（`SCHED_FIFO` 跟 time_slice 無關），另外一條是對於在 100-139 號佇列裡面的 task。 * real-time 對於 real-time task 來說，並不會在時間到之後被丟進去 expired array 裡面，可以看到程式碼會把 task 重新放到 active array 中對應的佇列裡面。 * non real-time `p->prio = effective_prio(p);` 會計算出新的 dynamic priority，`TASK_INTERACTIVE` 巨集比較任務的目前 priority 與 static priority 的差值來判斷 interactive。如果動態調整後的 prioirty 不為 `SCHED_NORMAL`（interactive），或是出現 expired task starvation 的情況，就會把時間用完的 task 丟到 expired array 裡面，反之繼續放回去 active array。但是我認為在書中的這一章節沒有區分清楚 0-99 以及 100-139 號佇列中對於 timeslice expiration 的任務的處理方式。書中主要是在講 non real-time 這條，導致我在一開始閱讀的時候誤以為這樣的運作方式是適用於第 0-139 priority 的所有任務。另外，我有查到相關的[教材](https://hackmd.io/@sysprog/Hka7Kzeap/%2FS1opp7-mP)，當中的說明也是不夠精確。而且這個教材說明比書的內容更不符合程式碼，教材裡面是寫一旦 timeslice 歸零，任務就會被放到 expired array，但其實有可能被放到 active array。 --- TODO: CFS, EEVDF, SMP --- 3. 2.3 Completely Fair Scheduler (CFS) > Practically, it is the actual execution time normalized by the number of runnable tasks. CFS uses it to determine the next task to put on the core. The target is to maintain the virtual runtime of every task close to each other (ideally, to the same value), therefore CFS always picks the task with the smallest runtime. it is the actual execution time normalized by the number of runnable tasks. 這句話以 virtual runtime 的算法來說的話，是不是不精準 ## ch3 ### Typo 相關問題 1. 3.1 Structs and their role :heavy_check_mark: > as order is guaranteed, instead of comparing equality of the current class with every other scheduling classes every other scheduling classes -> every other scheduling class 2. 3.1 Structs and their role > sched_class collects all the methods of a scheduling class, which forms a table, what is the pointer in task_struct points to. 這句看起來很怪，what is 的用法讓我疑惑他到底是陳述句還是疑問句。 3. 3.1 :heavy_check_mark: > The pointer grants an O(1) access time instead of the ordinary O(logn), a useful speed-up in a critical path such the one present in schedule(). such as ?? 4. 3.2 Time keeping :heavy_check_mark: > Every time that the interrupt handler is executed, the value of jiffies is increased by one. Every time that the -> Every time the ### 其他問題 1. 3.1 Structs and their role > A sched_entity may be organized in a hierarchy of entities, this is done to allow group scheduling and it is optional. Suppose that there are more users using a system and we want to share the CPU equally between them, but they have different processes running. With normal CFS, each task is present in the runqueue and it is scheduled independently. By organizing the entities in hierarchies, we can group together the tasks and schedule them as a single schedulable entity. In the previous example we would have only two entities on the runqueue, one for each user, and a corresponding runqueue containing all the tasks of that user. The scheduler decides which of the two entities to schedule, as if there were only two tasks running, and then repeats the process for the sub-queue. As stated earlier, a sched_entity does not always correspond to a process; only the leaves of this structure correspond to a task. 最後那句話：As stated earlier, a sched_entity does not always correspond to a process; only the leaves of this structure correspond to a task. 在跟著上下文理解後，我認為會讓人混淆，誤以為 `sched_entity` 本身是個樹狀結構，但其實是 root `cfs_rq` 使用紅黑數的資料結構來存所有可運行的 `sched_entity`。 2. 第 110 頁 Figure 3.7: CFS runqueue using red-black tree 的位置可以調整一下，他直接卡在 `rt_prio_array` 的程式碼中間。 3. 3.2 Time keeping > It is also possible to configure the kernel as tickless (NO_HZ): this means that the interrupt is no longer at fixed intervals, but it is dynamically scheduled as needed, which is helpful for power saving. 實際去看 v5.16.6 原始碼`linux/kernel/time/Kconfig` ``` config NO_HZ_IDLE bool "Idle dynticks system (tickless idle)" select NO_HZ_COMMON help This option enables a tickless idle system: timer interrupts will only trigger on an as-needed basis when the system is idle. This is usually interesting for energy saving. Most of the time you want to say Y here. config NO_HZ bool "Old Idle dynticks config" help This is the old config entry that enables dynticks idle. We keep it around for a little while to enforce backward compatibility with older config files. ``` 會發現其實有 `NO_HZ_FULL` 還有 `NO_HZ_IDLE`，且 `NO_HZ` 現今已經大多不再使用，留著是為了舊版配置的 compatibility。在設定上 `NO_HZ_IDLE` 還有 `NO_HZ` 會是一樣的行為。從程式碼說明可以得知： * `NO_HZ_IDLE`：CPU 處於閒置時，停止週期性的 timer interrupts * `NO_HZ_FULL`：即使 CPU 並非完全閒置，只有一個 runnable 任務時，也嘗試停止週期性的 timer interrupts 我覺得這部分可以改寫的更清楚，然後 `NO_HZ` 改用 `NO_HZ_IDLE`。另外一個點是，是否要與 `CONFIG_THREAD_INFO_IN_TASK` 一樣，統一用加上 `CONFIG_` 前綴？這部分在上面第一章有說明。 ## ch4 ### Typo 相關問題 1. 4.3.1 Group/Entity Hierarchy > The scheduling entities se(1) and se(3) represents some individual tasks represents -> represent 2. 4.3.1 Group/Entity Hierarchy > Entities that belongs to a group, like se(A) and se(B) belongs -> belong ### 其他問題 1. 4.3.1 Group/Entity Hierarchy ![截圖 2025-05-08 下午2.05.06](https://hackmd.io/_uploads/HyaGb0tlgx.png =80%x) 這張圖是錯的 se(B) se(A) 的 parent 應該是要指向 se(2) > todo 須釐清 [教材](https://hackmd.io/@sysprog/Hka7Kzeap/%2FB18MhT00t)中的圖也有部分畫錯 ![截圖 2025-05-06 上午11.43.04](https://hackmd.io/_uploads/r1S9GRKxle.png =90%x)