# 從 `start_kernel` 到第一個任務: Linux Scheduler 閱讀筆記 (2) 在前篇 [Review with code: Linux Scheduler 閱讀筆記 (1)](https://hackmd.io/@Kuanch/linux-kernel-scheduler-notes3) 我們提及了 `sched_fork()`,又再往上追溯到了 `kernel_clone()`,那又是被誰呼叫的呢? 第一個行程 (PID = 1) 是怎麼產生的?CPU Scheduler 又是怎麼被初始化的?我們嘗試回答以上問題,以便對 Linux CPU Sceheduler 有更深的了解。 以下程式碼為 Linux 核心 `v6.8-rc7` 版本。 ## Call Hierachy 我們大概都知道 bootstrap 的流程大概是 BIOS -> MBR -> boot loader -> kernel,前三者多是硬體主導,而載入 Kernel 後才是我們現在關心的;我們知道 boot loader 會載入 kernel,kernel 會將自己放到記憶體後開始一連串的初始化以及測試,也就是作業系統了,要回答開頭的問題,我們能夠由此下手。 以下我們此 call hierachy 開始理解,從 `start_kernel()` 到我們熟知的 `kernel/sched/fair.c` 中的 `_fair()` 函式: ```c - start_kernel() // in init/main.c at line 873 - sched_init() // in kernel/sched/core.c at line 9900 - rest_init() // in init/main.c at line 684 - user_mode_thread() // in kernel/fork.c at line 2970 - kernel_thread() // in kernel/fork.c at line 2951 - kernel_clone() // in kernel/fork.c at line 2861 - copy_process() // in kernel/fork.c at line 2240 - wake_up_new_task() // in kernel/sched/core.c at line 4868 - __set_task_cpu() // in kernel/sched/sched.h at line 2056 - set_task_rq() // in kernel/sched/sched.h at line 2027 - set_task_rq_fair() // in kernel/sched/fair.c at line 4171 ``` ### `sched_init()` `start_kernel()` 定義在 `init/main.c`,被[認為](https://github.com/0xAX/linux-insides/blob/master/Initialization/linux-initialization-1.md#next-to-start_kernel)是 Linux Kernel 正式的進入點,它會執行一連串子模組的初始化,包含我們最感興趣的 `sched_init()`,它也會啟動 PID = 1 的第一個行程執行 `kernel_init`。 ![image](https://hackmd.io/_uploads/H1iVmxe-0.png) `sched_init()` 最顯眼的程式碼之一莫過於歷遍各個 CPU 初始化其 run queue: ```c=9964 // defined at kernel/sched/core.c for_each_possible_cpu(i) { struct rq *rq; rq = cpu_rq(i); raw_spin_lock_init(&rq->__lock); rq->nr_running = 0; rq->calc_load_active = 0; rq->calc_load_update = jiffies + LOAD_FREQ; init_cfs_rq(&rq->cfs); init_rt_rq(&rq->rt); init_dl_rq(&rq->dl); ... } ``` ### `user_mode_thread()` 與 `kernel_thread()` 雖然上圖是以 `kernel_thread()` 啟動 `kernel_init`,但在 v6.8-rc7 中,事實上是使用 `pid = user_mode_thread(kernel_init, NULL, CLONE_FS);`,原因是 `kernel_init` 是時上不需要 kernel thread,因它最終仍會成為 user process;注意此處的見 user_mode_thread 並非常規的 user thread,可以注意到它並不具有用於分辨 user thread 和 kernel thread 的 `mm` 部分,詳細見 [Patch: kthread: Unify kernel_thread() and user_mode_thread()](https://lore.kernel.org/lkml/20230605161052.033ebe4cecc0a9c879d43f56@linux-foundation.org/t/) 以及 [Patch: kthread: Rename user_mode_thread() to kmuser_thread()](https://lore.kernel.org/lkml/ZJ9kWqhRCWkLcYyv@bombadil.infradead.org/T/) 我們還可以於此比較一下 `user_mode_thread()` 和 `kernel_thread()` 的差異: ```c=2947 // defined at kernel/fork.c /* * Create a kernel thread. */ pid_t kernel_thread(int (*fn)(void *), void *arg, const char *name, unsigned long flags) { struct kernel_clone_args args = { .flags = ((lower_32_bits(flags) | CLONE_VM | CLONE_UNTRACED) & ~CSIGNAL), .exit_signal = (lower_32_bits(flags) & CSIGNAL), .fn = fn, .fn_arg = arg, .name = name, .kthread = 1, }; return kernel_clone(&args); } /* * Create a user mode thread. */ pid_t user_mode_thread(int (*fn)(void *), void *arg, unsigned long flags) { struct kernel_clone_args args = { .flags = ((lower_32_bits(flags) | CLONE_VM | CLONE_UNTRACED) & ~CSIGNAL), .exit_signal = (lower_32_bits(flags) & CSIGNAL), .fn = fn, .fn_arg = arg, }; return kernel_clone(&args); } ``` 可以看到,兩者的差別只在於傳入參數是否帶有 `name` 以及 `kthread = 1` 而已,但如同前述所提 `user_mode_thread` 並非 user thread,而是 "special kernel threads"。 :::info :::spoiler How do we create a "user thread" ? 使用 `sys_clone3()`: ```c=3175 // defined at kernel/fork.c /** * sys_clone3 - create a new process with specific properties * @uargs: argument structure * @size: size of @uargs * * clone3() is the extensible successor to clone()/clone2(). * It takes a struct as argument that is versioned by its size. * * Return: On success, a positive PID for the child process. * On error, a negative errno number. */ SYSCALL_DEFINE2(clone3, struct clone_args __user *, uargs, size_t, size) { int err; struct kernel_clone_args kargs; pid_t set_tid[MAX_PID_NS_LEVEL]; kargs.set_tid = set_tid; err = copy_clone_args_from_user(&kargs, uargs, size); if (err) return err; if (!clone3_args_valid(&kargs)) return -EINVAL; return kernel_clone(&kargs); } ``` 關鍵在於傳入的 `kargs` 是否有帶有 user thread 的資訊,並交由後續的 `kernel_clone()` 以及 `copy_process()`;注意到在 `copy_process()` 中的 `copy_mm()`: ```c static int copy_mm(unsigned long clone_flags, struct task_struct *tsk) { struct mm_struct *mm, *oldmm; tsk->min_flt = tsk->maj_flt = 0; tsk->nvcsw = tsk->nivcsw = 0; #ifdef CONFIG_DETECT_HUNG_TASK tsk->last_switch_count = tsk->nvcsw + tsk->nivcsw; tsk->last_switch_time = 0; #endif tsk->mm = NULL; tsk->active_mm = NULL; /* * Are we cloning a kernel thread? * * We need to steal a active VM for that.. */ oldmm = current->mm; if (!oldmm) return 0; if (clone_flags & CLONE_VM) { mmget(oldmm); mm = oldmm; } else { mm = dup_mm(tsk, current->mm); if (!mm) return -ENOMEM; } tsk->mm = mm; tsk->active_mm = mm; sched_mm_cid_fork(tsk); return 0; } ``` 如果是 clone kernel thread,也就是沒有 `->mm`,那就創建一個同樣沒有 `->mm` 的,反之亦然;或者這樣理解,**若由 user thread 發起 cloning (`sys_clone3`),就創造 user thread,反之,由 `kernel_thread()` 發起,則創造 kernel thread。** ::: ### `kernel_clone()` and `copy_process()` 上面兩個函式的實作是隱藏在 `kernel_clone()` 的,而其又會呼叫 `copy_process()`,後者是一個長約五百行的函式,定義在 `kernel/fork.c`,其重要性可見一班;而在其之後,必須要呼叫 `wake_up_new_task()` 才能夠算是完整地創造了一個新行程,其定義在 `kernel/sched/core.c`。 ### `__kthread_bind_mask` `__kthread_bind_mask` 將每個 `kworker` 按照預設的 CPU affinity 綁定在不同的 CPU 上,有幾個 CPU 就會有相對應的每個 `kworker`,或是應該反過來說,`kworker` 的數量代表的就是 CPU 的數量;多核心處理由此開始。 透過 gdb 中斷 `__kthread_bind_mask` 可以看到,當 `kworker/0:0` 和 `kworker/0:1` 被創造時,其 `mask->bits` 為 1,而 `kworker/1:0` 和 `kworker/1:1` 為 2。 ## What makes a CPU? 我們若使用 `ps` 指令能夠看到許多正在運行的行程,尤其在使用 KVM 時,能夠時刻觀察到從開始運行到完成開機有許多行程陸續被運行,其中大多都是 Kernel Threads,那他們的作用分別是什麼?又,行程一個最小完整的 CPU 需要哪些 Kernel Threads? ```sh #!/bin/bash qemu-system-x86_64 \ -M pc \ -smp 2 \ -kernel ./output/images/bzImage \ -drive file=./output/images/rootfs.ext2,if=virtio,format=raw \ -append "root=/dev/vda console=ttyS0 nokaslr" \ -net user,hostfwd=tcp:127.0.0.1:3333-:22 \ -net nic,model=virtio \ -fsdev local,security_model=passthrough,id=test_dev,path=./share \ -device virtio-9p-pci,id=fsdev0,fsdev=test_dev,mount_tag=test_mount \ -nographic \ -S -s ``` 透過以上 qemu 指令,我們可以模擬一個雙核 x86 系統,並在開機後看到以下行程 ```shell TASK PID COMM 0xffffffff8200a880 0 swapper/0 0xffff8880029c0000 1 init 0xffff8880029c0e00 2 kthreadd 0xffff8880029c1c00 3 pool_workqueue_ 0xffff8880029c2a00 4 kworker/R-rcu_g 0xffff8880029c3800 5 kworker/R-rcu_p 0xffff8880029c4600 6 kworker/R-slub_ 0xffff8880029c5400 7 kworker/R-netns 0xffff8880029c7000 9 kworker/0:0H 0xffff888002a00000 10 kworker/0:1 0xffff888002a00e00 11 kworker/u4:0 0xffff888002a01c00 12 kworker/R-mm_pe 0xffff888002a02a00 13 rcu_tasks_kthre 0xffff888002a03800 14 rcu_tasks_rude_ 0xffff888002a04600 15 rcu_tasks_trace 0xffff888002a05400 16 ksoftirqd/0 0xffff888002a06200 17 rcu_preempt 0xffff888002a07000 18 migration/0 0xffff888002a38e00 19 cpuhp/0 0xffff888002a39c00 20 cpuhp/1 0xffff888002a3aa00 21 migration/1 0xffff888002a3b800 22 ksoftirqd/1 0xffff888002a3c600 23 kworker/1:0 0xffff888002a3d400 24 kworker/1:0H 0xffff888002a78000 25 kdevtmpfs 0xffff888002a78e00 26 kworker/R-inet_ 0xffff888002a79c00 27 kworker/1:1 0xffff888002a7aa00 28 kworker/u4:1 0xffff888002a7b800 29 oom_reaper 0xffff888002a7c600 30 kworker/R-write 0xffff888002a3e200 31 kworker/u4:2 0xffff888002a3f000 32 kcompactd0 0xffff888002b08000 33 kworker/R-kbloc 0xffff888002a7d400 34 kworker/R-ata_s 0xffff888002a7e200 35 kswapd0 0xffff888002b08e00 36 kworker/0:1H 0xffff888002b09c00 37 kworker/R-acpi_ 0xffff888002a7f000 38 kworker/R-ttm 0xffff888002b0aa00 39 scsi_eh_0 0xffff888002b0b800 40 kworker/R-scsi_ 0xffff888002b0c600 41 scsi_eh_1 0xffff888002b0d400 42 kworker/R-scsi_ 0xffff888002b0e200 43 kworker/0:2 0xffff888003390000 44 kworker/R-mld 0xffff888002b0f000 45 kworker/R-ipv6_ 0xffff8880033b0000 46 kworker/1:1H 0xffff888003391c00 48 kworker/u5:0 0xffff8880033b0e00 49 kworker/R-ext4- 0xffff888003396200 68 syslogd 0xffff888003392a00 72 klogd 0xffff888003393800 112 udhcpc 0xffff888003397000 114 sh 0xffff8880036d0000 115 getty ``` PID 49 以前的看起來都非常像 Kernel Threads,我們可以將其分類並理解其用途 1. Core Kernel Threads: * `swapper/0`: idle thread * `init`: inital user-space process * `kthreadd`: the kernel thread daemon responsible for creating other kernel threads * `ksoftirqd/n`: handles software interrupts (softirqs) on CPU n. * `migration/n`: manages process migration between CPUs. * `cpuhp/n`: manages CPU hotplug operations for CPU n. 2. RCU (Read-Copy Update) Threads: 3. Memory Management Threads: 4. Device and I/O Management Threads: