紀錄教材影片以及上課討論內容 - [ ] [Linux 核心設計: 賦予應用程式生命的系統呼叫](https://hackmd.io/@sysprog/linux-syscall) - [ ] [Linux 核心設計: 針對事件驅動的 I/O 模型演化](https://hackmd.io/@sysprog/linux-io-model/https%3A%2F%2Fhackmd.io%2F%40sysprog%2Fevent-driven-server) - [ ] Demystifying the Linux CPU Scheduler ## [bpftrace](https://github.com/bpftrace/bpftrace/blob/master/man/adoc/bpftrace.adoc) 是為 Linux 系統上的 eBPF(extended Berkeley Packet Filter)所設計的 ```shell $ sudo bpftrace [OPTIONS] FILENAME $ sudo bpftrace [OPTIONS] -e 'program code' ``` >A program will continue running until Ctrl-C is hit, or an exit function is called. ### probe 類型 #### 1. [kprobe/kretprobe](https://docs.kernel.org/trace/kprobes.html) 追蹤 kernel 的函式 我本來確認 kernel 模組函式有沒有被呼叫之類的,是用 `printk` 搭配 `dmesg` 但是用 kprobe 可以直接用動態插入探針的方式。當 kernel 執行到目標的函式,就可以拿到相關的資訊,然後再回到原本的執行流程。 (callback) 常用命令: ```shell $ kprobe:function_name { 動作 } $ kretprobe:function_name { 動作 } ``` > How Does a Kprobe Work? > >When a kprobe is registered, Kprobes makes a copy of the probed instruction and replaces the first byte(s) of the probed instruction with a breakpoint instruction (e.g., int3 on i386 and x86_64). > >When a CPU hits the breakpoint instruction, a trap occurs, the CPU’s registers are saved, and control passes to Kprobes via the notifier_call_chain mechanism. Kprobes executes the “pre_handler” associated with the kprobe, passing the handler the addresses of the kprobe struct and the saved registers. > >Next, Kprobes single-steps its copy of the probed instruction. (It would be simpler to single-step the actual instruction in place, but then Kprobes would have to temporarily remove the breakpoint instruction. This would open a small time window when another CPU could sail right past the probepoint.) > >After the instruction is single-stepped, Kprobes executes the “post_handler,” if any, that is associated with the kprobe. Execution then continues with the instruction following the probepoint. 非侵入式的 #### 2. tracepoint tracepoint 是內建於 Linux kernel 原始碼中的靜態追蹤點,與 Kprobes 需要動態插入探針至目標函式不同, 比起動態插入的 kprobe,tracepoint 在沒有啟用時幾乎不造成效能影響。 >The advantage of tracepoints is that they generally provide a more stable interface than kprobe s do, they do not depend on the existence of a kernel function. 常見用途:觀察系統呼叫(syscalls)等等 範例用法,列出所有以 `sys_enter_` 開頭的系統呼叫的 tracepoints,再來就可以找找看有沒有目標 tracepoint 來編寫 bpftrace 腳本: ```shell $ sudo bpftrace -l 'tracepoint:syscalls:sys_enter_*' tracepoint:syscalls:sys_enter_accept tracepoint:syscalls:sys_enter_accept4 tracepoint:syscalls:sys_enter_access tracepoint:syscalls:sys_enter_acct tracepoint:syscalls:sys_enter_add_key tracepoint:syscalls:sys_enter_adjtimex tracepoint:syscalls:sys_enter_alarm tracepoint:syscalls:sys_enter_arch_prctl tracepoint:syscalls:sys_enter_bind tracepoint:syscalls:sys_enter_bpf ... ``` 也可以看看跟 scheduler 或是 irq 有關的 tracepoints 有些什麼: ```shell $ sudo bpftrace -l 'tracepoint:sched:*' tracepoint:sched:sched_kthread_stop tracepoint:sched:sched_kthread_stop_ret tracepoint:sched:sched_kthread_work_execute_end tracepoint:sched:sched_kthread_work_execute_start tracepoint:sched:sched_kthread_work_queue_work tracepoint:sched:sched_migrate_task tracepoint:sched:sched_move_numa tracepoint:sched:sched_pi_setprio tracepoint:sched:sched_prepare_exec tracepoint:sched:sched_process_exec tracepoint:sched:sched_process_exit tracepoint:sched:sched_process_fork ... ``` ```shell $ sudo bpftrace -l 'tracepoint:irq:*' tracepoint:irq:irq_handler_entry tracepoint:irq:irq_handler_exit tracepoint:irq:softirq_entry tracepoint:irq:softirq_exit tracepoint:irq:softirq_raise tracepoint:irq:tasklet_entry tracepoint:irq:tasklet_exit ... ``` ## [Linux 核心設計: 不僅是個執行單元的 Process](https://hackmd.io/@sysprog/linux-process) linux 把 process 當作容器 容器裡面有很多個 thread 在 Linux 中,處理程序(Process)被視為一種容器,其中可以包含多個執行緒(Thread)。Linux 對 process 和 thread 的區分主要取決於資源共享的程度: ### process vs thread #### 什麼是 threading 在同一個 process addess space 執行多個執行緒(共享) 多個 thread 可以共享記憶體、開啟的檔案、資源 在許多作業系統 會把執行緒當作最小的執行單元 Linux 中沒有真正的 thread 資料結構 每個 thread 都是一個 [task_struct](https://elixir.bootlin.com/linux/v6.14.1/source/include/linux/sched.h#L791)(也就是 process),只是透過 `clone` 相關的系統呼叫設定要共享哪些資源。 >Both clone() and clone3() allow a flags bit mask that modifies their behavior and allows the caller to specify what is shared between the calling process and the child process. > they’re called lightweight processes(LWP) 另外 thread 存在一些問題: 由於 thread 間共享記憶體與資源,如果某個 thread 發生 segmentation fault,整個 process 都會 crash 而且要花費很多成本在處裡共享資源的 lock 所以有了 TLS (Thread local storage)這個東西 把特定的變數放到很遠地方,讓 thread 不會互相影響 需要 os 以及 compiler 的協助,來產生出不會相互干擾的程式碼 > 4/21 補: > To better accommodate concurrent programs, operating systems introduced a hybrid abstraction called a thread. [posix thread ](https://man7.org/linux/man-pages/man3/pthread_create.3.html) > user space api #### fork create a child process parent / child process 多半是相同內容,但是記憶體位置不同,PID 不同 如果在 fork 之後緊接著做 exec 的話 他不會去做 copy -> linux 的最佳化 ??? copy-on-write #### clone 跟 fork 很像,只是可以依據 flag 決定要不要做資源共享 可以選擇共享(如 thread)或不共享(如 process) ### eBPF 主要是想順便熟悉一下 bpftrace 怎麼用 首先,在先前列出 tracepoint 那邊,找到系統呼叫層級有個 `sys_enter_clone` 可以來玩玩看,`-lv` 顯示參數如下: ```shell $ sudo bpftrace -lv tracepoint:syscalls:sys_enter_clone ``` ``` tracepoint:syscalls:sys_enter_clone int __syscall_nr unsigned long clone_flags unsigned long newsp int * parent_tidptr int * child_tidptr unsigned long tls ``` 以下是示範運用輸入的參數,可以觀察到 clone flags 的內容: ```shell $ sudo bpftrace -e ' tracepoint:syscalls:sys_enter_clone { printf("PID %d called clone with flags=0x%x\n", pid, args->clone_flags); } tracepoint:syscalls:sys_exit_clone { printf(" clone returned: child_pid/tid=%d\n", args->ret); printf("-------------------------\n"); }' ``` `sys_enter_clone` 在 clone system call 開始執行時觸發,會記錄 clone 呼叫時輸入的參數。 `sys_exit_clone` 在 clone system call 完成執行、即將返回 user space 時觸發。 執行以上 bpftrace 腳本之後,每當系統中有程式呼叫 `clone` 時就會產生反應。 範例輸出: ``` PID 11367 called clone with flags=0x1200011 clone returned: child_pid/tid=11368 ------------------------- PID 4366 called clone with flags=0x3d0f00 clone returned: child_pid/tid=11369 ------------------------- ``` #### fork() 首先執行測試程式碼: ```c int main() { pid_t pid = fork(); printf("TGID: %d, PID: %d, Parent TGID: %d\n", getpid(), gettid(), getppid()); return 0; } ``` 範例輸出: ``` TGID: 12105, PID: 12105, Parent TGID: 11694 TGID: 12106, PID: 12106, Parent TGID: 12105 ``` `fork` 之後程式有兩個 process 在執行。`printf` 執行兩次,分別在這兩個 process 中。可以看到第二個 process 的 Parent TGID 是第一個行程的 TGID。 注意到在 kernel 的世界裡,PID(Process IDentifier)其實指的是 thread 的 ID。還有 TGID(Thread Group ID)指的是隸屬於同個 process 的 thread 會共同有這個 ID。 `getpid` 這個函式回傳的是 TGID,代表 thread group leader,同時也是代表 process。如果想要取得 PID,要呼叫 `gettid` 函式。 > 問: > 以上定義參考 《Demystifying the Linux CPU Scheduler》 1.3.1 中的說明。 > 但是我去看 Linux Manual Page 中的對於 [getpid](https://man7.org/linux/man-pages/man2/getpid.2.html) 以及 [gettid](https://man7.org/linux/man-pages/man2/gettid.2.html) 的定義跟書中不太一樣: > gettid() returns the caller's thread ID (TID). In a single-threaded process, the thread ID is equal to the process ID (PID), as returned by getpid(2)). --- 開啟兩個 terminal 測試,一個負責 trace clone (就剛剛那個)一個負責 trace fork 如下: ```shell $ sudo bpftrace -e ' tracepoint:syscalls:sys_enter_fork { printf("PID %d called fork\n", pid); } tracepoint:syscalls:sys_exit_fork { printf(" fork returned: child_pid/tid=%d\n", args->ret); printf("-------------------------\n"); }' Attaching 2 probes... ``` 再執行 `fork` 函式會發現實際有反應的是 clone tracepoint。用 strace 去觀察 user space 發出的系統呼叫: ```shell $ strace ./test ``` ``` clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7021e365ca10) = 15883 getppid() = 15879 getpid() = 15882 fstat(1, PID: 15883, Parent PID: 15882 ``` 會發現使用 `fork` 函式時,Linux 實際進核心時用的是 clone 這個系統呼叫。 #### pthread_create() 接下來使用 pthread 提供的 api 來建立執行緒,觀察和建立行程的不同。 ```shell clone3({flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, child_tid=0x75af9b1ff990, parent_tid=0x75af9b1ff990, exit_signal=0, stack=0x75af9a9ff000, stack_size=0x7fff80, tls=0x75af9b1ff6c0} => {parent_tid=[16238]}, 88) = 16238 ``` 可以看到創建執行緒時,是使用較新的 `clone3` 系統呼叫,這跟創建行程時不一樣。為什麼? [clone3](https://man7.org/linux/man-pages/man2/clone.2.html) 是自 Linux kernel 5.4 起引入的新 syscall。相比 clone,clone3 有更彈性的參數控制。thread 建立通常需要設定更多參數,例如:TLS 相關參數還有一堆 flags 讓 thread 共用記憶體空間和資料。 #### 對應的 kernel function `sys_clone` ```c #ifdef __ARCH_WANT_SYS_CLONE #ifdef CONFIG_CLONE_BACKWARDS SYSCALL_DEFINE5(clone, unsigned long, clone_flags, unsigned long, newsp, int __user *, parent_tidptr, unsigned long, tls, int __user *, child_tidptr) #elif defined(CONFIG_CLONE_BACKWARDS2) SYSCALL_DEFINE5(clone, unsigned long, newsp, unsigned long, clone_flags, int __user *, parent_tidptr, int __user *, child_tidptr, unsigned long, tls) #elif defined(CONFIG_CLONE_BACKWARDS3) SYSCALL_DEFINE6(clone, unsigned long, clone_flags, unsigned long, newsp, int, stack_size, int __user *, parent_tidptr, int __user *, child_tidptr, unsigned long, tls) #else SYSCALL_DEFINE5(clone, unsigned long, clone_flags, unsigned long, newsp, int __user *, parent_tidptr, int __user *, child_tidptr, unsigned long, tls) #endif { struct kernel_clone_args args = { .flags = (lower_32_bits(clone_flags) & ~CSIGNAL), .pidfd = parent_tidptr, .child_tid = child_tidptr, .parent_tid = parent_tidptr, .exit_signal = (lower_32_bits(clone_flags) & CSIGNAL), .stack = newsp, .tls = tls, }; return kernel_clone(&args); } #endif ``` `sys_clone3` ```c /** * sys_clone3 - create a new process with specific properties * @uargs: argument structure * @size: size of @uargs * * clone3() is the extensible successor to clone()/clone2(). * It takes a struct as argument that is versioned by its size. * * Return: On success, a positive PID for the child process. * On error, a negative errno number. */ SYSCALL_DEFINE2(clone3, struct clone_args __user *, uargs, size_t, size) { int err; struct kernel_clone_args kargs; pid_t set_tid[MAX_PID_NS_LEVEL]; #ifdef __ARCH_BROKEN_SYS_CLONE3 #warning clone3() entry point is missing, please fix return -ENOSYS; #endif kargs.set_tid = set_tid; err = copy_clone_args_from_user(&kargs, uargs, size); if (err) return err; if (!clone3_args_valid(&kargs)) return -EINVAL; return kernel_clone(&kargs); } ``` `sys_clone`,`sys_clone3` 都是進到 `kernel_clone` 函式。 ```c /* * Ok, this is the main fork-routine. * * It copies the process, and if successful kick-starts * it and waits for it to finish using the VM if required. * * args->exit_signal is expected to be checked for sanity by the caller. */ pid_t kernel_clone(struct kernel_clone_args *args) ``` ```c /* * This creates a new process as a copy of the old one, * but does not actually start it yet. * * It copies the registers, and all the appropriate * parts of the process environment (as per the clone * flags). The actual kick-off is left to the caller. */ __latent_entropy struct task_struct *copy_process( struct pid *pid, int trace, int node, struct kernel_clone_args *args) ``` > 4/22 jserv 上課時問說如果呼叫 `pthread_create` 會創造一個 procss 嗎? 已經知道 `pthread_create` 會走 `clone3` 系統呼叫,可以看到[原始碼](https://github.com/torvalds/linux/blob/bc3372351d0c8b2726b7d4229b878342e3e6b0e8/kernel/fork.c#L3117)中,`sys_clone3` 的註解是說 `create a new process with specific properties`,而且會再呼叫 `copy_process`。`copy_process` 這個函式會根據 clone flags 決定這個新創建的結構跟親代要有多不同。 由此可見 `fork` 跟 `pthread_create` 最後都會走同一條路,我想應該要關注的是 clone flags,不該用二分法去理解。 #### clone flags 剛剛有提到 clone 系列的系統呼叫有 flags 來決定他們的行為。可以針對執行緒和行程的 flags 內容比較一下: cloning flags 內容節錄自 [linux/include/uapi/linux/sched.h](https://github.com/torvalds/linux/blob/fc96b232f8e7c0a6c282f47726b2ff6a5fb341d2/include/uapi/linux/sched.h#L11) 1. `pthread_create` ``` flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID ``` ```c #define CLONE_VM 0x00000100 /* set if VM shared between processes */ #define CLONE_FS 0x00000200 /* set if fs info shared between processes */ #define CLONE_FILES 0x00000400 /* set if open files shared between processes */ #define CLONE_SIGHAND 0x00000800 /* set if signal handlers and blocked signals shared */ #define CLONE_THREAD 0x00010000 /* Same thread group? */ #define CLONE_SYSVSEM 0x00040000 /* share system V SEM_UNDO semantics */ #define CLONE_SETTLS 0x00080000 /* create a new TLS for the child */ #define CLONE_PARENT_SETTID 0x00100000 /* set the TID in the parent */ #define CLONE_CHILD_CLEARTID 0x00200000 /* clear the TID in the child */ ``` > 問: > 為什麼 CLONE_THREAD 的註解要有那個問號??? * **CLONE_VM**:共享虛擬記憶體空間。這是 thread 的特性。相反,不同行程通常有各自獨立的 address space。 * **CLONE_FS**:共享檔案系統資訊 * **CLONE_THREAD**:指定新創建的是 thread 而非 process,共享同一個 PID。 * **CLONE_SETTLS**:為 thread 創建新的 TLS (Thread Local Storage)。 2. `fork` ``` flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD ``` 可以看出執行緒和行程的最大差異在於資源共享程度:執行緒會共享記憶體空間、檔案系統和開啟檔案等資源;而行程則僅是創建一個資源獨立的執行實體,擁有自己的記憶體空間和系統資源。 剛好看到 Demystifying the Linux CPU Scheduler 第 1.2.4 章 裡面說到有些架構中,`thread_info` 結構體會在 stack 的底部,導致 >the overflow would corrupt the thread_info structure, which is the first data that the stack encounters along its path. This corruption would make the process non-existent for the kernel and cause a memory leak ```c struct task_struct { #ifdef CONFIG_THREAD_INFO_IN_TASK /* * For reasons of header soup (see current_thread_info()), this * must be the first element of task_struct. */ struct thread_info thread_info; #endif ... }; ``` 於是後來的實作方法傾向把 `thread_info` 放在 `task_struct` 裡面,避免被 overflows 影響。 [linux/include/linux/thread_info.h](https://github.com/torvalds/linux/blob/9d7a0577c9db35c4cc52db90bc415ea248446472/include/linux/thread_info.h#L17) ```c #ifdef CONFIG_THREAD_INFO_IN_TASK /* * For CONFIG_THREAD_INFO_IN_TASK kernels we need <asm/current.h> for the * definition of current, but for !CONFIG_THREAD_INFO_IN_TASK kernels, * including <asm/current.h> can cause a circular dependency on some platforms. */ ``` ```shell $ cat /boot/config-$(uname -r) | grep CONFIG_THREAD_INFO_IN_TASK ``` 像我的電腦,如果顯示 `CONFIG_THREAD_INFO_IN_TASK`=y 的話,代表 thread_info 在 task_struct 裡面。 --- 參考[原始碼](https://elixir.bootlin.com/linux/v6.14.3/source/arch/x86/include/asm/thread_info.h#L62),x86 之下的 thread_info 結構體: ```c struct thread_info { unsigned long flags; /* low level flags */ unsigned long syscall_work; /* SYSCALL_WORK_ flags */ u32 status; /* thread synchronous flags */ #ifdef CONFIG_SMP u32 cpu; /* current CPU */ #endif }; ``` > when THREAD_INFO_IN_TASK is enabled, thread_info must remain at the beginning of task_struct as current_thread_info() retrieves it directly from there. The implementation relies on specific field ordering, and altering this order would break offset calculations across different architectures, potentially causing kernel malfunctions. > > todo ### 如何組織和管理 multithreaded process? ![cpu-sched-book-20250331 2](https://hackmd.io/_uploads/BJDtZWEJgl.jpg) > 問: > 為什麼 pid_list 裡面的 PID 的值都是一樣的 > 同個 thread group 裡面的不同 thread 不會有不同的 PID 嗎? --- 後來發現有另一個 tracepoint 可以用,之前那些是 syscall 層級的 ```shell $ sudo bpftrace -lv tracepoint:sched:sched_process_fork tracepoint:sched:sched_process_fork char parent_comm[16] pid_t parent_pid char child_comm[16] pid_t child_pid ``` sched:sched_process_fork 是一個 scheduler-level tracepoint,當 任何新行程(process)被 fork 出來時都會被觸發。 不管是用 fork()、vfork()、clone()、pthread_create() 等等,只要建立出一個新的 task,這個 tracepoint 都會記錄。 相較於 sys_enter_clone,它不依賴使用哪個 syscall。 ### Task Lifetime ```c SYSCALL_DEFINE1(exit, int, error_code) { do_exit((error_code&0xff)<<8); } ``` > Upon exit, its exit state is initially set to the zombie state. 它會變成 zombie,等待親代 process 呼叫 wait() 回收 這是由 `exit_notify` 函式 ```c /* * Send signals to all our closest relatives so that they know * to properly mourn us.. */ static void exit_notify(struct task_struct *tsk, int group_dead) ``` 中的 `tsk->exit_state = EXIT_ZOMBIE;` 去設定。 ```c list_for_each_entry_safe(p, n, &dead, ptrace_entry) { list_del_init(&p->ptrace_entry); release_task(p); } ``` > If the parent of a zombie process exits without waiting, the child becomes an orphan process so it will become a child of init. The ancestor process (init) has a routine that waits periodically to reap possible zombie processes; so the child process will simply be waited by init and get cleared. `exit_notify` 中: ```c /* * This does two things: * * A. Make init inherit all the child processes * B. Check to see if any process groups have become orphaned * as a result of our exiting, and if they have any stopped * jobs, send them a SIGHUP and then a SIGCONT. (POSIX 3.2.2.2) */ static void forget_original_parent(struct task_struct *father, struct list_head *dead) ``` ```c reaper = find_child_reaper(father, dead); ``` 找一個活著的 process 來接手 zombie process 作為它的 child,好讓這個 zombie process 在之後可以被 wait() 掉 ### [vm_area_struct ](https://hackmd.io/@RinHizakura/rJfd9Xdfv#vm_area_struct-in-Linux) Linux 使用 vm_area_struct 來管理虛擬記憶體區域。這是記憶體管理的基本單位 早期是用鏈結串列來管理記憶體,但是效率不好,走訪成本高 O(n) 現在是用紅黑樹搭配 cache 機制來提高效率 mn_struct 記憶體管理的結構 sparse address space #### mmap system call 把檔案或裝置對應到記憶體,透過讀寫那塊記憶體位置,去操控周邊 檔案系統的檔案映射到記憶體 支援 demand paging:系統只在必要時才去分配實際的實體記憶體 > ex. 《Demystifying the Linux CPU Scheduler》 1.2.4 > The user space stack can potentially be increasing, and even though it is initially small it can allocate more physical memory if needed: this mechanism is called “on-demand paging”. 大部分時間作業系統是很懶得,只有在必要時間提供服務 #### context switching 從一個執行中的 process 切換到另一個 address space 的置換,必須要改 page tables 在特定的情況下,可以避免去做 context switch 比如說當兩個 task 之間交換資料時,不用透過 context switch,而是直接去讀取 register 內部的值來降低成本 總之全看當下最佳化的策略 > Each process has an execution context, which includes everything needed for a task to execute, from its stack to the code. context switch 之前需要保存 hardware context 部分會存在 task_struct 裡面(ex. `task_struct->thread.sp`)部分存在 kernel stack > switches the address spaces of the two processes and then calls __switch_to() 屬於非常底層、與硬體架構緊密相關的程式碼 Scheduler 不是一個獨立的 Kernel Thread 當前的 process 執行了 scheduler 的程式碼本身 Context switch 一定在 kernel mode 下進行 ### scheduling classes and policies > Within each class, policies represent special scheduling decisions, and each process has a policy associated with it. ```c struct task_struct { ... unsigned int policy; struct sched_entity se; #ifdef CONFIG_SCHED_CLASS_EXT struct sched_ext_entity scx; #endif const struct sched_class *sched_class; ... } ``` > The scheduler relies on the priority setting provided by its source to determine the execution order at runtime. In another word, the scheduler only put a task from a scheduling class into execution only when there are no other avail- able tasks with higher priority in the scheduling hierarchy. 以 scheduler 的角度來看,他不會直接操作 `task_struct`,代表他不關心這個 task 是 thread 還是 process,scheduler 操作的是 `sched_entity` 這個結構體。 運行在 runqueues 中的是 scheduling entities。 #### group scheduling 在最基本的情況下,一個 `sched_entity` 對應於一個任務 多個任務可以被組織成一個 group,由一個 `sched_entity` 表示。 > 為什麼 `task_struct` 中有 on_rq , `scheduled_entity` 也要有 on_rq? ```c /* task_struct::on_rq states: */ #define TASK_ON_RQ_QUEUED 1 #define TASK_ON_RQ_MIGRATING 2 ``` > todo ### O(N) scheduler 使用 single runqueue 且 single runqueue lock 這樣的設計雖然 good for load balancing -> task 可以在任何 processor 上面跑 但是也由於 task 可能被分派到任何處理器上,導致 cache 的效率很低(沒有 processor affinity 概念) 由於只有一個 runqueue,所以也只有一個鎖來保護它 當一個處理器需要修改 runqueue,其他處理器必須等待鎖被釋放 要等多久 -> O(N) 逐一走訪的成本 ### [linux O(1) scheduler ](https://hackmd.io/@sysprog/Hka7Kzeap/%2FS1opp7-mP) global scheduler 以常數時間完成任務選擇與切換的排程器 > here is one queue for each priority level, thus MAX_PRIO (140) is both the highest priority and the number of queues. Tasks are indexed according to their priority [0,139]. 負責維護 `2 * 140 * cpu 數量`個 queues O(1) scheduler 對每個 CPU 都維護了兩個 array:active/expired array。每個 array 中都包含 40 個 priority level 的佇列 ```c struct prio_array { int nr_active; unsigned long bitmap[BITMAP_SIZE]; struct list_head queue[MAX_PRIO]; }; ``` > tasks are migrated only if the core imbalance exceeds 25%, and cache-hot tasks (those in the active array) are less likely to be migrated than tasks in the expired array. ```c #define TASK_INTERACTIVE(p) \ ((p)->prio <= (p)->static_prio - DELTA(p)) #define DELTA(p) \ (SCALE(TASK_NICE(p), 40, MAX_USER_PRIO*PRIO_BONUS_RATIO/100) + \ INTERACTIVE_DELTA) #define rt_task(p) ((p)->prio < MAX_RT_PRIO) ``` ```c void scheduler_tick(int user_ticks, int sys_ticks){ ... if (unlikely(rt_task(p))) { /* * RR tasks need a special form of timeslice management. * FIFO tasks have no timeslices. */ if ((p->policy == SCHED_RR) && !--p->time_slice) { p->time_slice = task_timeslice(p); p->first_time_slice = 0; set_tsk_need_resched(p); /* put it at the end of the queue: */ dequeue_task(p, rq->active); enqueue_task(p, rq->active); } goto out_unlock; } if (!--p->time_slice) { dequeue_task(p, rq->active); set_tsk_need_resched(p); p->prio = effective_prio(p); p->time_slice = task_timeslice(p); p->first_time_slice = 0; if (!rq->expired_timestamp) rq->expired_timestamp = jiffies; if (!TASK_INTERACTIVE(p) || EXPIRED_STARVING(rq)) { enqueue_task(p, rq->expired); } else enqueue_task(p, rq->active); } ... } ``` 看過原始碼整理以下: 在 `scheduler_tick` 函式裡面會先判斷 `if (unlikely(rt_task(p)))` 。從巨集 `rt_task` 的定義可以看到他會判斷 prio 是否小於 `MAX_RT_PRIO`(100),也就是這個 task 是否為 real-time 的。所以可以看到在 `p->time_slice` 用完的時候就會走兩條路:一條是對於在 0-99 號佇列裡面,且 policy 為 SCHED_RR 的 task(`SCHED_FIFO` 跟 time_slice 無關),另外一條是對於在 100-139 號佇列裡面的 task。 * real-time 對於 real-time task 來說,並不會在時間到之後被丟進去 expired array 裡面,可以看到程式碼會把 task 重新放到 active array 中對應的佇列裡面。 * non real-time `p->prio = effective_prio(p);` 會計算出新的 dynamic priority,`TASK_INTERACTIVE` 巨集比較任務的當前 priority 與 static priority 的差值來判斷 interactive。 如果動態調整後的 prioirty 不為 SCHED_NORMAL(interactive),或是出現 expired task starvation 的情況,就會把時間用完的 task 丟到 expired array 裡面,反之繼續放回去 active array。 優先權會是動態的調整 dynamic priority 像是考量到是否 frequent blocking、是否大量的使用 I/O 取決於 how long the task spent sleeping, which refers to time spent idling or waiting. 配置當下會是靜態的 執行時是動態的 #### rebalancing tasks 要去分配到哪一個核,本身也有成本問題 理論上讓每個核心工作平均看似最理想,但在實務上,不當的 rebalancing 可能會降低效能 why not rebalancing? * NUMA(Non-Uniform Memory Access) 不同核心存取記憶體的延遲不同,而且在不同 NUMA node 間移動 task 會有額外成本 * hyper threading 宅說可以讓 12 cpu 變成 24個的神奇技術 不知道 * multi-core cache behavior Task 在特定核心執行後,其資料會在該核心的 cache 中,移動至其他核心會導致 cache miss,降低效能 ### Completely Fair Scheduler Linux 2.6.23 開始 CFS ```c struct sched_entity { u64 vruntime; ... ``` CFS 排程時,會選擇 vruntime 最小的任務來執行 以下先以 Linux v6.6 之前的程式碼為主,v6.6 之後是用 EEVDF 來選,而不是 vruntime 最小 v6.5.13 在 DEFINE_SCHED_CLASS 中 ```c DEFINE_SCHED_CLASS(fair) = { .enqueue_task = enqueue_task_fair, .dequeue_task = dequeue_task_fair, .yield_task = yield_task_fair, .yield_to_task = yield_to_task_fair, .check_preempt_curr = check_preempt_wakeup, .pick_next_task = __pick_next_task_fair, .put_prev_task = put_prev_task_fair, .set_next_task = set_next_task_fair, ... ``` __pick_next_task_fair -> pick_next_task_fair -> pick_next_entity -> __pick_first_entity 從目前可以執行的任務中,挑出 vruntime 最小的那一個任務來執行。 找「最少被 CPU 照顧」 ```c struct sched_entity *__pick_first_entity(struct cfs_rq *cfs_rq) { struct rb_node *left = rb_first_cached(&cfs_rq->tasks_timeline); if (!left) return NULL; return __node_2_se(left); } ``` 首先找出紅黑樹中具有最小 vruntime 的節點 priority 是靜態的 ### signal interrupt 是針對硬體,signal 是針對應用程式,用於處理 process thread 之間溝通 kernel 裡面有 signal handler (ex:`SIGINT`, `SIGTERM`, `SIGSEGV`) 發 signal 通知 user 去做對應的動作 user 也會註冊對應的 handler -> 非同步的一種 IPC 機制 #### [futex](https://hackmd.io/@sysprog/concurrency-thread-package#futex-系統呼叫) 實作 posix thread >傳統 kernel-based lock 機制在切換 locked 以及 unlocked 時,需要藉由系統呼叫進到核心模式,以確認是否有在等待 lock 的執行緒,但在執行緒間很少競爭共享資源 (low contention rate) 的情況下,沒有其他執行緒在 wait queue 中等待 lock,但還是要從使用者層級切換到核心模式去檢查,這樣頻繁的 CPU 模式切換會對效能有負面影響。 Linux 提供的一種快速且低成本的同步機制 降低 lock 成本 盡可能在 user space 做 快速 user space locking 過往 mutex 是用 system call 很常需要 acquire release lock 要一直切來切去 mode switch 成本很高 改成讓大部分事情都在 user space 做 除了 wait wake 這兩個主要操作 #### 執行緒以及其同步問題 一段函式如果能夠被中斷後再從頭執行,同時被多個執行緒呼叫,並且不會改變共享資源的狀態 -> 就是 reentrant 程式執行到一半發生中斷且 interrupt handler 剛好也是這段程式碼的現象,舉例來說 strtok_r (`_r` 就是 reentrant 的版本) 只要有中斷,就要考慮到 reentrant 的問題 盡量避免全域變數