紀錄教材影片以及上課討論內容
- [ ] [Linux 核心設計: 賦予應用程式生命的系統呼叫](https://hackmd.io/@sysprog/linux-syscall)
- [ ] [Linux 核心設計: 針對事件驅動的 I/O 模型演化](https://hackmd.io/@sysprog/linux-io-model/https%3A%2F%2Fhackmd.io%2F%40sysprog%2Fevent-driven-server)
- [ ] Demystifying the Linux CPU Scheduler
## [bpftrace](https://github.com/bpftrace/bpftrace/blob/master/man/adoc/bpftrace.adoc)
是為 Linux 系統上的 eBPF(extended Berkeley Packet Filter)所設計的
```shell
$ sudo bpftrace [OPTIONS] FILENAME
$ sudo bpftrace [OPTIONS] -e 'program code'
```
>A program will continue running until Ctrl-C is hit, or an exit function is called.
### probe 類型
#### 1. [kprobe/kretprobe](https://docs.kernel.org/trace/kprobes.html)
追蹤 kernel 的函式
我本來確認 kernel 模組函式有沒有被呼叫之類的,是用 `printk` 搭配 `dmesg`
但是用 kprobe 可以直接用動態插入探針的方式。當 kernel 執行到目標的函式,就可以拿到相關的資訊,然後再回到原本的執行流程。
(callback)
常用命令:
```shell
$ kprobe:function_name { 動作 }
$ kretprobe:function_name { 動作 }
```
> How Does a Kprobe Work?
>
>When a kprobe is registered, Kprobes makes a copy of the probed instruction and replaces the first byte(s) of the probed instruction with a breakpoint instruction (e.g., int3 on i386 and x86_64).
>
>When a CPU hits the breakpoint instruction, a trap occurs, the CPU’s registers are saved, and control passes to Kprobes via the notifier_call_chain mechanism. Kprobes executes the “pre_handler” associated with the kprobe, passing the handler the addresses of the kprobe struct and the saved registers.
>
>Next, Kprobes single-steps its copy of the probed instruction. (It would be simpler to single-step the actual instruction in place, but then Kprobes would have to temporarily remove the breakpoint instruction. This would open a small time window when another CPU could sail right past the probepoint.)
>
>After the instruction is single-stepped, Kprobes executes the “post_handler,” if any, that is associated with the kprobe. Execution then continues with the instruction following the probepoint.
非侵入式的
#### 2. tracepoint
tracepoint 是內建於 Linux kernel 原始碼中的靜態追蹤點,與 Kprobes 需要動態插入探針至目標函式不同,
比起動態插入的 kprobe,tracepoint 在沒有啟用時幾乎不造成效能影響。
>The advantage of tracepoints is that they generally provide a more stable interface than kprobe s do, they do not depend on the existence of a kernel function.
常見用途:觀察系統呼叫(syscalls)等等
範例用法,列出所有以 `sys_enter_` 開頭的系統呼叫的 tracepoints,再來就可以找找看有沒有目標 tracepoint 來編寫 bpftrace 腳本:
```shell
$ sudo bpftrace -l 'tracepoint:syscalls:sys_enter_*'
tracepoint:syscalls:sys_enter_accept
tracepoint:syscalls:sys_enter_accept4
tracepoint:syscalls:sys_enter_access
tracepoint:syscalls:sys_enter_acct
tracepoint:syscalls:sys_enter_add_key
tracepoint:syscalls:sys_enter_adjtimex
tracepoint:syscalls:sys_enter_alarm
tracepoint:syscalls:sys_enter_arch_prctl
tracepoint:syscalls:sys_enter_bind
tracepoint:syscalls:sys_enter_bpf
...
```
也可以看看跟 scheduler 或是 irq 有關的 tracepoints 有些什麼:
```shell
$ sudo bpftrace -l 'tracepoint:sched:*'
tracepoint:sched:sched_kthread_stop
tracepoint:sched:sched_kthread_stop_ret
tracepoint:sched:sched_kthread_work_execute_end
tracepoint:sched:sched_kthread_work_execute_start
tracepoint:sched:sched_kthread_work_queue_work
tracepoint:sched:sched_migrate_task
tracepoint:sched:sched_move_numa
tracepoint:sched:sched_pi_setprio
tracepoint:sched:sched_prepare_exec
tracepoint:sched:sched_process_exec
tracepoint:sched:sched_process_exit
tracepoint:sched:sched_process_fork
...
```
```shell
$ sudo bpftrace -l 'tracepoint:irq:*'
tracepoint:irq:irq_handler_entry
tracepoint:irq:irq_handler_exit
tracepoint:irq:softirq_entry
tracepoint:irq:softirq_exit
tracepoint:irq:softirq_raise
tracepoint:irq:tasklet_entry
tracepoint:irq:tasklet_exit
...
```
## [Linux 核心設計: 不僅是個執行單元的 Process](https://hackmd.io/@sysprog/linux-process)
linux 把 process 當作容器
容器裡面有很多個 thread
在 Linux 中,處理程序(Process)被視為一種容器,其中可以包含多個執行緒(Thread)。Linux 對 process 和 thread 的區分主要取決於資源共享的程度:
### process vs thread
#### 什麼是 threading
在同一個 process addess space 執行多個執行緒(共享)
多個 thread 可以共享記憶體、開啟的檔案、資源
在許多作業系統 會把執行緒當作最小的執行單元
Linux 中沒有真正的 thread 資料結構
每個 thread 都是一個 [task_struct](https://elixir.bootlin.com/linux/v6.14.1/source/include/linux/sched.h#L791)(也就是 process),只是透過 `clone` 相關的系統呼叫設定要共享哪些資源。
>Both clone() and clone3() allow a flags bit mask that modifies
their behavior and allows the caller to specify what is shared
between the calling process and the child process.
> they’re called lightweight processes(LWP)
另外 thread 存在一些問題:
由於 thread 間共享記憶體與資源,如果某個 thread 發生 segmentation fault,整個 process 都會 crash
而且要花費很多成本在處裡共享資源的 lock
所以有了 TLS (Thread local storage)這個東西
把特定的變數放到很遠地方,讓 thread 不會互相影響
需要 os 以及 compiler 的協助,來產生出不會相互干擾的程式碼
> 4/21 補:
> To better accommodate concurrent programs, operating systems introduced a hybrid abstraction called a thread.
[posix thread ](https://man7.org/linux/man-pages/man3/pthread_create.3.html)
> user space api
#### fork
create a child process
parent / child process 多半是相同內容,但是記憶體位置不同,PID 不同
如果在 fork 之後緊接著做 exec 的話 他不會去做 copy -> linux 的最佳化 ???
copy-on-write
#### clone
跟 fork 很像,只是可以依據 flag 決定要不要做資源共享
可以選擇共享(如 thread)或不共享(如 process)
### eBPF
主要是想順便熟悉一下 bpftrace 怎麼用
首先,在先前列出 tracepoint 那邊,找到系統呼叫層級有個 `sys_enter_clone` 可以來玩玩看,`-lv` 顯示參數如下:
```shell
$ sudo bpftrace -lv tracepoint:syscalls:sys_enter_clone
```
```
tracepoint:syscalls:sys_enter_clone
int __syscall_nr
unsigned long clone_flags
unsigned long newsp
int * parent_tidptr
int * child_tidptr
unsigned long tls
```
以下是示範運用輸入的參數,可以觀察到 clone flags 的內容:
```shell
$ sudo bpftrace -e '
tracepoint:syscalls:sys_enter_clone {
printf("PID %d called clone with flags=0x%x\n", pid, args->clone_flags);
}
tracepoint:syscalls:sys_exit_clone {
printf(" clone returned: child_pid/tid=%d\n", args->ret);
printf("-------------------------\n");
}'
```
`sys_enter_clone` 在 clone system call 開始執行時觸發,會記錄 clone 呼叫時輸入的參數。
`sys_exit_clone` 在 clone system call 完成執行、即將返回 user space 時觸發。
執行以上 bpftrace 腳本之後,每當系統中有程式呼叫 `clone` 時就會產生反應。
範例輸出:
```
PID 11367 called clone with flags=0x1200011
clone returned: child_pid/tid=11368
-------------------------
PID 4366 called clone with flags=0x3d0f00
clone returned: child_pid/tid=11369
-------------------------
```
#### fork()
首先執行測試程式碼:
```c
int main() {
pid_t pid = fork();
printf("TGID: %d, PID: %d, Parent TGID: %d\n",
getpid(), gettid(), getppid());
return 0;
}
```
範例輸出:
```
TGID: 12105, PID: 12105, Parent TGID: 11694
TGID: 12106, PID: 12106, Parent TGID: 12105
```
`fork` 之後程式有兩個 process 在執行。`printf` 執行兩次,分別在這兩個 process 中。可以看到第二個 process 的 Parent TGID 是第一個行程的 TGID。
注意到在 kernel 的世界裡,PID(Process IDentifier)其實指的是 thread 的 ID。還有 TGID(Thread Group ID)指的是隸屬於同個 process 的 thread 會共同有這個 ID。
`getpid` 這個函式回傳的是 TGID,代表 thread group leader,同時也是代表 process。如果想要取得 PID,要呼叫 `gettid` 函式。
> 問:
> 以上定義參考 《Demystifying the Linux CPU Scheduler》 1.3.1 中的說明。
> 但是我去看 Linux Manual Page 中的對於 [getpid](https://man7.org/linux/man-pages/man2/getpid.2.html) 以及 [gettid](https://man7.org/linux/man-pages/man2/gettid.2.html) 的定義跟書中不太一樣:
> gettid() returns the caller's thread ID (TID). In a single-threaded process, the thread ID is equal to the process ID (PID), as returned by getpid(2)).
---
開啟兩個 terminal 測試,一個負責 trace clone (就剛剛那個)一個負責 trace fork 如下:
```shell
$ sudo bpftrace -e '
tracepoint:syscalls:sys_enter_fork {
printf("PID %d called fork\n", pid);
}
tracepoint:syscalls:sys_exit_fork {
printf(" fork returned: child_pid/tid=%d\n", args->ret);
printf("-------------------------\n");
}'
Attaching 2 probes...
```
再執行 `fork` 函式會發現實際有反應的是 clone tracepoint。用 strace 去觀察 user space 發出的系統呼叫:
```shell
$ strace ./test
```
```
clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7021e365ca10) = 15883
getppid() = 15879
getpid() = 15882
fstat(1, PID: 15883, Parent PID: 15882
```
會發現使用 `fork` 函式時,Linux 實際進核心時用的是 clone 這個系統呼叫。
#### pthread_create()
接下來使用 pthread 提供的 api 來建立執行緒,觀察和建立行程的不同。
```shell
clone3({flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID,
child_tid=0x75af9b1ff990,
parent_tid=0x75af9b1ff990,
exit_signal=0,
stack=0x75af9a9ff000,
stack_size=0x7fff80,
tls=0x75af9b1ff6c0} => {parent_tid=[16238]}, 88) = 16238
```
可以看到創建執行緒時,是使用較新的 `clone3` 系統呼叫,這跟創建行程時不一樣。為什麼?
[clone3](https://man7.org/linux/man-pages/man2/clone.2.html) 是自 Linux kernel 5.4 起引入的新 syscall。相比 clone,clone3 有更彈性的參數控制。thread 建立通常需要設定更多參數,例如:TLS 相關參數還有一堆 flags 讓 thread 共用記憶體空間和資料。
#### 對應的 kernel function
`sys_clone`
```c
#ifdef __ARCH_WANT_SYS_CLONE
#ifdef CONFIG_CLONE_BACKWARDS
SYSCALL_DEFINE5(clone, unsigned long, clone_flags, unsigned long, newsp,
int __user *, parent_tidptr,
unsigned long, tls,
int __user *, child_tidptr)
#elif defined(CONFIG_CLONE_BACKWARDS2)
SYSCALL_DEFINE5(clone, unsigned long, newsp, unsigned long, clone_flags,
int __user *, parent_tidptr,
int __user *, child_tidptr,
unsigned long, tls)
#elif defined(CONFIG_CLONE_BACKWARDS3)
SYSCALL_DEFINE6(clone, unsigned long, clone_flags, unsigned long, newsp,
int, stack_size,
int __user *, parent_tidptr,
int __user *, child_tidptr,
unsigned long, tls)
#else
SYSCALL_DEFINE5(clone, unsigned long, clone_flags, unsigned long, newsp,
int __user *, parent_tidptr,
int __user *, child_tidptr,
unsigned long, tls)
#endif
{
struct kernel_clone_args args = {
.flags = (lower_32_bits(clone_flags) & ~CSIGNAL),
.pidfd = parent_tidptr,
.child_tid = child_tidptr,
.parent_tid = parent_tidptr,
.exit_signal = (lower_32_bits(clone_flags) & CSIGNAL),
.stack = newsp,
.tls = tls,
};
return kernel_clone(&args);
}
#endif
```
`sys_clone3`
```c
/**
* sys_clone3 - create a new process with specific properties
* @uargs: argument structure
* @size: size of @uargs
*
* clone3() is the extensible successor to clone()/clone2().
* It takes a struct as argument that is versioned by its size.
*
* Return: On success, a positive PID for the child process.
* On error, a negative errno number.
*/
SYSCALL_DEFINE2(clone3, struct clone_args __user *, uargs, size_t, size)
{
int err;
struct kernel_clone_args kargs;
pid_t set_tid[MAX_PID_NS_LEVEL];
#ifdef __ARCH_BROKEN_SYS_CLONE3
#warning clone3() entry point is missing, please fix
return -ENOSYS;
#endif
kargs.set_tid = set_tid;
err = copy_clone_args_from_user(&kargs, uargs, size);
if (err)
return err;
if (!clone3_args_valid(&kargs))
return -EINVAL;
return kernel_clone(&kargs);
}
```
`sys_clone`,`sys_clone3` 都是進到 `kernel_clone` 函式。
```c
/*
* Ok, this is the main fork-routine.
*
* It copies the process, and if successful kick-starts
* it and waits for it to finish using the VM if required.
*
* args->exit_signal is expected to be checked for sanity by the caller.
*/
pid_t kernel_clone(struct kernel_clone_args *args)
```
```c
/*
* This creates a new process as a copy of the old one,
* but does not actually start it yet.
*
* It copies the registers, and all the appropriate
* parts of the process environment (as per the clone
* flags). The actual kick-off is left to the caller.
*/
__latent_entropy struct task_struct *copy_process(
struct pid *pid,
int trace,
int node,
struct kernel_clone_args *args)
```
> 4/22 jserv 上課時問說如果呼叫 `pthread_create` 會創造一個 procss 嗎?
已經知道 `pthread_create` 會走 `clone3` 系統呼叫,可以看到[原始碼](https://github.com/torvalds/linux/blob/bc3372351d0c8b2726b7d4229b878342e3e6b0e8/kernel/fork.c#L3117)中,`sys_clone3` 的註解是說 `create a new process with specific properties`,而且會再呼叫 `copy_process`。`copy_process` 這個函式會根據 clone flags 決定這個新創建的結構跟親代要有多不同。
由此可見 `fork` 跟 `pthread_create` 最後都會走同一條路,我想應該要關注的是 clone flags,不該用二分法去理解。
#### clone flags
剛剛有提到 clone 系列的系統呼叫有 flags 來決定他們的行為。可以針對執行緒和行程的 flags 內容比較一下:
cloning flags 內容節錄自 [linux/include/uapi/linux/sched.h](https://github.com/torvalds/linux/blob/fc96b232f8e7c0a6c282f47726b2ff6a5fb341d2/include/uapi/linux/sched.h#L11)
1. `pthread_create`
```
flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID
```
```c
#define CLONE_VM 0x00000100 /* set if VM shared between processes */
#define CLONE_FS 0x00000200 /* set if fs info shared between processes */
#define CLONE_FILES 0x00000400 /* set if open files shared between processes */
#define CLONE_SIGHAND 0x00000800 /* set if signal handlers and blocked signals shared */
#define CLONE_THREAD 0x00010000 /* Same thread group? */
#define CLONE_SYSVSEM 0x00040000 /* share system V SEM_UNDO semantics */
#define CLONE_SETTLS 0x00080000 /* create a new TLS for the child */
#define CLONE_PARENT_SETTID 0x00100000 /* set the TID in the parent */
#define CLONE_CHILD_CLEARTID 0x00200000 /* clear the TID in the child */
```
> 問:
> 為什麼 CLONE_THREAD 的註解要有那個問號???
* **CLONE_VM**:共享虛擬記憶體空間。這是 thread 的特性。相反,不同行程通常有各自獨立的 address space。
* **CLONE_FS**:共享檔案系統資訊
* **CLONE_THREAD**:指定新創建的是 thread 而非 process,共享同一個 PID。
* **CLONE_SETTLS**:為 thread 創建新的 TLS (Thread Local Storage)。
2. `fork`
```
flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD
```
可以看出執行緒和行程的最大差異在於資源共享程度:執行緒會共享記憶體空間、檔案系統和開啟檔案等資源;而行程則僅是創建一個資源獨立的執行實體,擁有自己的記憶體空間和系統資源。
剛好看到 Demystifying the Linux CPU Scheduler 第 1.2.4 章
裡面說到有些架構中,`thread_info` 結構體會在 stack 的底部,導致
>the overflow would corrupt the thread_info structure, which is the first data that the stack encounters along its path. This corruption would make the process non-existent for the kernel and cause a memory leak
```c
struct task_struct {
#ifdef CONFIG_THREAD_INFO_IN_TASK
/*
* For reasons of header soup (see current_thread_info()), this
* must be the first element of task_struct.
*/
struct thread_info thread_info;
#endif
...
};
```
於是後來的實作方法傾向把 `thread_info` 放在 `task_struct` 裡面,避免被 overflows 影響。
[linux/include/linux/thread_info.h](https://github.com/torvalds/linux/blob/9d7a0577c9db35c4cc52db90bc415ea248446472/include/linux/thread_info.h#L17)
```c
#ifdef CONFIG_THREAD_INFO_IN_TASK
/*
* For CONFIG_THREAD_INFO_IN_TASK kernels we need <asm/current.h> for the
* definition of current, but for !CONFIG_THREAD_INFO_IN_TASK kernels,
* including <asm/current.h> can cause a circular dependency on some platforms.
*/
```
```shell
$ cat /boot/config-$(uname -r) | grep CONFIG_THREAD_INFO_IN_TASK
```
像我的電腦,如果顯示 `CONFIG_THREAD_INFO_IN_TASK`=y 的話,代表 thread_info 在 task_struct 裡面。
---
參考[原始碼](https://elixir.bootlin.com/linux/v6.14.3/source/arch/x86/include/asm/thread_info.h#L62),x86 之下的 thread_info 結構體:
```c
struct thread_info {
unsigned long flags; /* low level flags */
unsigned long syscall_work; /* SYSCALL_WORK_ flags */
u32 status; /* thread synchronous flags */
#ifdef CONFIG_SMP
u32 cpu; /* current CPU */
#endif
};
```
> when THREAD_INFO_IN_TASK is enabled, thread_info must remain at the beginning of task_struct as current_thread_info() retrieves it directly from there. The implementation relies on specific field ordering, and altering this order would break offset calculations across different architectures, potentially causing kernel malfunctions.
>
> todo
### 如何組織和管理 multithreaded process?

> 問:
> 為什麼 pid_list 裡面的 PID 的值都是一樣的
> 同個 thread group 裡面的不同 thread 不會有不同的 PID 嗎?
---
後來發現有另一個 tracepoint 可以用,之前那些是 syscall 層級的
```shell
$ sudo bpftrace -lv tracepoint:sched:sched_process_fork
tracepoint:sched:sched_process_fork
char parent_comm[16]
pid_t parent_pid
char child_comm[16]
pid_t child_pid
```
sched:sched_process_fork 是一個 scheduler-level tracepoint,當 任何新行程(process)被 fork 出來時都會被觸發。
不管是用 fork()、vfork()、clone()、pthread_create() 等等,只要建立出一個新的 task,這個 tracepoint 都會記錄。
相較於 sys_enter_clone,它不依賴使用哪個 syscall。
### Task Lifetime
```c
SYSCALL_DEFINE1(exit, int, error_code)
{
do_exit((error_code&0xff)<<8);
}
```
> Upon exit, its exit state is initially set to the zombie state.
它會變成 zombie,等待親代 process 呼叫 wait() 回收
這是由 `exit_notify` 函式
```c
/*
* Send signals to all our closest relatives so that they know
* to properly mourn us..
*/
static void exit_notify(struct task_struct *tsk, int group_dead)
```
中的 `tsk->exit_state = EXIT_ZOMBIE;` 去設定。
```c
list_for_each_entry_safe(p, n, &dead, ptrace_entry) {
list_del_init(&p->ptrace_entry);
release_task(p);
}
```
> If the parent of a zombie process exits without waiting, the child becomes an orphan process so it will become a child of init. The ancestor process (init) has a routine that waits periodically to reap possible zombie processes; so the child process will simply be waited by init and get cleared.
`exit_notify` 中:
```c
/*
* This does two things:
*
* A. Make init inherit all the child processes
* B. Check to see if any process groups have become orphaned
* as a result of our exiting, and if they have any stopped
* jobs, send them a SIGHUP and then a SIGCONT. (POSIX 3.2.2.2)
*/
static void forget_original_parent(struct task_struct *father,
struct list_head *dead)
```
```c
reaper = find_child_reaper(father, dead);
```
找一個活著的 process 來接手 zombie process 作為它的 child,好讓這個 zombie process 在之後可以被 wait() 掉
### [vm_area_struct ](https://hackmd.io/@RinHizakura/rJfd9Xdfv#vm_area_struct-in-Linux)
Linux 使用 vm_area_struct 來管理虛擬記憶體區域。這是記憶體管理的基本單位
早期是用鏈結串列來管理記憶體,但是效率不好,走訪成本高 O(n)
現在是用紅黑樹搭配 cache 機制來提高效率
mn_struct 記憶體管理的結構
sparse address space
#### mmap system call
把檔案或裝置對應到記憶體,透過讀寫那塊記憶體位置,去操控周邊
檔案系統的檔案映射到記憶體
支援 demand paging:系統只在必要時才去分配實際的實體記憶體
> ex. 《Demystifying the Linux CPU Scheduler》 1.2.4
> The user space stack can potentially be increasing, and even though it is initially small it can allocate more physical memory if needed: this mechanism is called “on-demand paging”.
大部分時間作業系統是很懶得,只有在必要時間提供服務
#### context switching
從一個執行中的 process 切換到另一個
address space 的置換,必須要改 page tables
在特定的情況下,可以避免去做 context switch
比如說當兩個 task 之間交換資料時,不用透過 context switch,而是直接去讀取 register 內部的值來降低成本
總之全看當下最佳化的策略
> Each process has an execution context, which includes everything needed for a task to execute, from its stack to the code.
context switch 之前需要保存 hardware context
部分會存在 task_struct 裡面(ex. `task_struct->thread.sp`)部分存在 kernel stack
> switches the address spaces of the two processes and then calls __switch_to()
屬於非常底層、與硬體架構緊密相關的程式碼
Scheduler 不是一個獨立的 Kernel Thread
當前的 process 執行了 scheduler 的程式碼本身
Context switch 一定在 kernel mode 下進行
### scheduling classes and policies
> Within each class, policies represent special scheduling decisions, and each process has a policy associated with it.
```c
struct task_struct {
...
unsigned int policy;
struct sched_entity se;
#ifdef CONFIG_SCHED_CLASS_EXT
struct sched_ext_entity scx;
#endif
const struct sched_class *sched_class;
...
}
```
> The scheduler relies on the priority setting provided by its source to determine the execution order at runtime. In another word, the scheduler only put a task from a scheduling class into execution only when there are no other avail- able tasks with higher priority in the scheduling hierarchy.
以 scheduler 的角度來看,他不會直接操作 `task_struct`,代表他不關心這個 task 是 thread 還是 process,scheduler 操作的是 `sched_entity` 這個結構體。
運行在 runqueues 中的是 scheduling entities。
#### group scheduling
在最基本的情況下,一個 `sched_entity` 對應於一個任務
多個任務可以被組織成一個 group,由一個 `sched_entity` 表示。
> 為什麼 `task_struct` 中有 on_rq , `scheduled_entity` 也要有 on_rq?
```c
/* task_struct::on_rq states: */
#define TASK_ON_RQ_QUEUED 1
#define TASK_ON_RQ_MIGRATING 2
```
> todo
### O(N) scheduler
使用 single runqueue 且 single runqueue lock
這樣的設計雖然 good for load balancing -> task 可以在任何 processor 上面跑
但是也由於 task 可能被分派到任何處理器上,導致 cache 的效率很低(沒有 processor affinity 概念)
由於只有一個 runqueue,所以也只有一個鎖來保護它
當一個處理器需要修改 runqueue,其他處理器必須等待鎖被釋放
要等多久 -> O(N) 逐一走訪的成本
### [linux O(1) scheduler ](https://hackmd.io/@sysprog/Hka7Kzeap/%2FS1opp7-mP)
global scheduler
以常數時間完成任務選擇與切換的排程器
> here is one queue for each priority level, thus MAX_PRIO (140) is both the highest priority and the number of queues. Tasks are indexed according to their priority [0,139].
負責維護 `2 * 140 * cpu 數量`個 queues
O(1) scheduler 對每個 CPU 都維護了兩個 array:active/expired array。每個 array 中都包含 40 個 priority level 的佇列
```c
struct prio_array {
int nr_active;
unsigned long bitmap[BITMAP_SIZE];
struct list_head queue[MAX_PRIO];
};
```
> tasks are migrated only if the core imbalance exceeds 25%, and cache-hot tasks (those in the active array) are less likely to be migrated than tasks in the expired array.
```c
#define TASK_INTERACTIVE(p) \
((p)->prio <= (p)->static_prio - DELTA(p))
#define DELTA(p) \
(SCALE(TASK_NICE(p), 40, MAX_USER_PRIO*PRIO_BONUS_RATIO/100) + \
INTERACTIVE_DELTA)
#define rt_task(p) ((p)->prio < MAX_RT_PRIO)
```
```c
void scheduler_tick(int user_ticks, int sys_ticks){
...
if (unlikely(rt_task(p))) {
/*
* RR tasks need a special form of timeslice management.
* FIFO tasks have no timeslices.
*/
if ((p->policy == SCHED_RR) && !--p->time_slice) {
p->time_slice = task_timeslice(p);
p->first_time_slice = 0;
set_tsk_need_resched(p);
/* put it at the end of the queue: */
dequeue_task(p, rq->active);
enqueue_task(p, rq->active);
}
goto out_unlock;
}
if (!--p->time_slice) {
dequeue_task(p, rq->active);
set_tsk_need_resched(p);
p->prio = effective_prio(p);
p->time_slice = task_timeslice(p);
p->first_time_slice = 0;
if (!rq->expired_timestamp)
rq->expired_timestamp = jiffies;
if (!TASK_INTERACTIVE(p) || EXPIRED_STARVING(rq)) {
enqueue_task(p, rq->expired);
} else
enqueue_task(p, rq->active);
}
...
}
```
看過原始碼整理以下:
在 `scheduler_tick` 函式裡面會先判斷 `if (unlikely(rt_task(p)))` 。從巨集 `rt_task` 的定義可以看到他會判斷 prio 是否小於 `MAX_RT_PRIO`(100),也就是這個 task 是否為 real-time 的。所以可以看到在 `p->time_slice` 用完的時候就會走兩條路:一條是對於在 0-99 號佇列裡面,且 policy 為 SCHED_RR 的 task(`SCHED_FIFO` 跟 time_slice 無關),另外一條是對於在 100-139 號佇列裡面的 task。
* real-time
對於 real-time task 來說,並不會在時間到之後被丟進去 expired array 裡面,可以看到程式碼會把 task 重新放到 active array 中對應的佇列裡面。
* non real-time
`p->prio = effective_prio(p);` 會計算出新的 dynamic priority,`TASK_INTERACTIVE` 巨集比較任務的當前 priority 與 static priority 的差值來判斷 interactive。
如果動態調整後的 prioirty 不為 SCHED_NORMAL(interactive),或是出現 expired task starvation 的情況,就會把時間用完的 task 丟到 expired array 裡面,反之繼續放回去 active array。
優先權會是動態的調整 dynamic priority
像是考量到是否 frequent blocking、是否大量的使用 I/O
取決於 how long the task spent sleeping, which refers to time spent idling or waiting.
配置當下會是靜態的
執行時是動態的
#### rebalancing tasks
要去分配到哪一個核,本身也有成本問題
理論上讓每個核心工作平均看似最理想,但在實務上,不當的 rebalancing 可能會降低效能
why not rebalancing?
* NUMA(Non-Uniform Memory Access)
不同核心存取記憶體的延遲不同,而且在不同 NUMA node 間移動 task 會有額外成本
* hyper threading
宅說可以讓 12 cpu 變成 24個的神奇技術
不知道
* multi-core cache behavior
Task 在特定核心執行後,其資料會在該核心的 cache 中,移動至其他核心會導致 cache miss,降低效能
### Completely Fair Scheduler
Linux 2.6.23 開始 CFS
```c
struct sched_entity {
u64 vruntime;
...
```
CFS 排程時,會選擇 vruntime 最小的任務來執行
以下先以 Linux v6.6 之前的程式碼為主,v6.6 之後是用 EEVDF 來選,而不是 vruntime 最小
v6.5.13
在 DEFINE_SCHED_CLASS 中
```c
DEFINE_SCHED_CLASS(fair) = {
.enqueue_task = enqueue_task_fair,
.dequeue_task = dequeue_task_fair,
.yield_task = yield_task_fair,
.yield_to_task = yield_to_task_fair,
.check_preempt_curr = check_preempt_wakeup,
.pick_next_task = __pick_next_task_fair,
.put_prev_task = put_prev_task_fair,
.set_next_task = set_next_task_fair,
...
```
__pick_next_task_fair -> pick_next_task_fair -> pick_next_entity -> __pick_first_entity
從目前可以執行的任務中,挑出 vruntime 最小的那一個任務來執行。
找「最少被 CPU 照顧」
```c
struct sched_entity *__pick_first_entity(struct cfs_rq *cfs_rq)
{
struct rb_node *left = rb_first_cached(&cfs_rq->tasks_timeline);
if (!left)
return NULL;
return __node_2_se(left);
}
```
首先找出紅黑樹中具有最小 vruntime 的節點
priority 是靜態的
### signal
interrupt 是針對硬體,signal 是針對應用程式,用於處理 process thread 之間溝通
kernel 裡面有 signal handler (ex:`SIGINT`, `SIGTERM`, `SIGSEGV`)
發 signal 通知 user 去做對應的動作
user 也會註冊對應的 handler -> 非同步的一種 IPC 機制
#### [futex](https://hackmd.io/@sysprog/concurrency-thread-package#futex-系統呼叫)
實作 posix thread
>傳統 kernel-based lock 機制在切換 locked 以及 unlocked 時,需要藉由系統呼叫進到核心模式,以確認是否有在等待 lock 的執行緒,但在執行緒間很少競爭共享資源 (low contention rate) 的情況下,沒有其他執行緒在 wait queue 中等待 lock,但還是要從使用者層級切換到核心模式去檢查,這樣頻繁的 CPU 模式切換會對效能有負面影響。
Linux 提供的一種快速且低成本的同步機制
降低 lock 成本 盡可能在 user space 做
快速 user space locking
過往 mutex 是用 system call
很常需要 acquire release lock
要一直切來切去 mode switch 成本很高
改成讓大部分事情都在 user space 做
除了 wait wake 這兩個主要操作
#### 執行緒以及其同步問題
一段函式如果能夠被中斷後再從頭執行,同時被多個執行緒呼叫,並且不會改變共享資源的狀態 -> 就是 reentrant
程式執行到一半發生中斷且 interrupt handler 剛好也是這段程式碼的現象,舉例來說 strtok_r (`_r` 就是 reentrant 的版本)
只要有中斷,就要考慮到 reentrant 的問題
盡量避免全域變數