# 從 `start_kernel` 到第一個任務: Linux Scheduler 閱讀筆記 (2)
在前篇 [Review with code: Linux Scheduler 閱讀筆記 (1)](https://hackmd.io/@Kuanch/linux-kernel-scheduler-notes3) 我們提及了 `sched_fork()`,又再往上追溯到了 `kernel_clone()`,那又是被誰呼叫的呢?
第一個行程 (PID = 1) 是怎麼產生的?CPU Scheduler 又是怎麼被初始化的?我們嘗試回答以上問題,以便對 Linux CPU Sceheduler 有更深的了解。
以下程式碼為 Linux 核心 `v6.8-rc7` 版本。
## Call Hierachy
我們大概都知道 bootstrap 的流程大概是 BIOS -> MBR -> boot loader -> kernel,前三者多是硬體主導,而載入 Kernel 後才是我們現在關心的;我們知道 boot loader 會載入 kernel,kernel 會將自己放到記憶體後開始一連串的初始化以及測試,也就是作業系統了,要回答開頭的問題,我們能夠由此下手。
以下我們此 call hierachy 開始理解,從 `start_kernel()` 到我們熟知的 `kernel/sched/fair.c` 中的 `_fair()` 函式:
```c
- start_kernel() // in init/main.c at line 873
- sched_init() // in kernel/sched/core.c at line 9900
- rest_init() // in init/main.c at line 684
- user_mode_thread() // in kernel/fork.c at line 2970
- kernel_thread() // in kernel/fork.c at line 2951
- kernel_clone() // in kernel/fork.c at line 2861
- copy_process() // in kernel/fork.c at line 2240
- wake_up_new_task() // in kernel/sched/core.c at line 4868
- __set_task_cpu() // in kernel/sched/sched.h at line 2056
- set_task_rq() // in kernel/sched/sched.h at line 2027
- set_task_rq_fair() // in kernel/sched/fair.c at line 4171
```
### `sched_init()`
`start_kernel()` 定義在 `init/main.c`,被[認為](https://github.com/0xAX/linux-insides/blob/master/Initialization/linux-initialization-1.md#next-to-start_kernel)是 Linux Kernel 正式的進入點,它會執行一連串子模組的初始化,包含我們最感興趣的 `sched_init()`,它也會啟動 PID = 1 的第一個行程執行 `kernel_init`。
![image](https://hackmd.io/_uploads/H1iVmxe-0.png)
`sched_init()` 最顯眼的程式碼之一莫過於歷遍各個 CPU 初始化其 run queue:
```c=9964
// defined at kernel/sched/core.c
for_each_possible_cpu(i) {
struct rq *rq;
rq = cpu_rq(i);
raw_spin_lock_init(&rq->__lock);
rq->nr_running = 0;
rq->calc_load_active = 0;
rq->calc_load_update = jiffies + LOAD_FREQ;
init_cfs_rq(&rq->cfs);
init_rt_rq(&rq->rt);
init_dl_rq(&rq->dl);
...
}
```
### `user_mode_thread()` 與 `kernel_thread()`
雖然上圖是以 `kernel_thread()` 啟動 `kernel_init`,但在 v6.8-rc7 中,事實上是使用 `pid = user_mode_thread(kernel_init, NULL, CLONE_FS);`,原因是 `kernel_init` 是時上不需要 kernel thread,因它最終仍會成為 user process;注意此處的見 user_mode_thread 並非常規的 user thread,可以注意到它並不具有用於分辨 user thread 和 kernel thread 的 `mm` 部分,詳細見
[Patch: kthread: Unify kernel_thread() and user_mode_thread()](https://lore.kernel.org/lkml/20230605161052.033ebe4cecc0a9c879d43f56@linux-foundation.org/t/)
以及
[Patch: kthread: Rename user_mode_thread() to kmuser_thread()](https://lore.kernel.org/lkml/ZJ9kWqhRCWkLcYyv@bombadil.infradead.org/T/)
我們還可以於此比較一下 `user_mode_thread()` 和 `kernel_thread()` 的差異:
```c=2947
// defined at kernel/fork.c
/*
* Create a kernel thread.
*/
pid_t kernel_thread(int (*fn)(void *), void *arg, const char *name,
unsigned long flags)
{
struct kernel_clone_args args = {
.flags = ((lower_32_bits(flags) | CLONE_VM |
CLONE_UNTRACED) & ~CSIGNAL),
.exit_signal = (lower_32_bits(flags) & CSIGNAL),
.fn = fn,
.fn_arg = arg,
.name = name,
.kthread = 1,
};
return kernel_clone(&args);
}
/*
* Create a user mode thread.
*/
pid_t user_mode_thread(int (*fn)(void *), void *arg, unsigned long flags)
{
struct kernel_clone_args args = {
.flags = ((lower_32_bits(flags) | CLONE_VM |
CLONE_UNTRACED) & ~CSIGNAL),
.exit_signal = (lower_32_bits(flags) & CSIGNAL),
.fn = fn,
.fn_arg = arg,
};
return kernel_clone(&args);
}
```
可以看到,兩者的差別只在於傳入參數是否帶有 `name` 以及 `kthread = 1` 而已,但如同前述所提 `user_mode_thread` 並非 user thread,而是 "special kernel threads"。
:::info
:::spoiler How do we create a "user thread" ?
使用 `sys_clone3()`:
```c=3175
// defined at kernel/fork.c
/**
* sys_clone3 - create a new process with specific properties
* @uargs: argument structure
* @size: size of @uargs
*
* clone3() is the extensible successor to clone()/clone2().
* It takes a struct as argument that is versioned by its size.
*
* Return: On success, a positive PID for the child process.
* On error, a negative errno number.
*/
SYSCALL_DEFINE2(clone3, struct clone_args __user *, uargs, size_t, size)
{
int err;
struct kernel_clone_args kargs;
pid_t set_tid[MAX_PID_NS_LEVEL];
kargs.set_tid = set_tid;
err = copy_clone_args_from_user(&kargs, uargs, size);
if (err)
return err;
if (!clone3_args_valid(&kargs))
return -EINVAL;
return kernel_clone(&kargs);
}
```
關鍵在於傳入的 `kargs` 是否有帶有 user thread 的資訊,並交由後續的 `kernel_clone()` 以及 `copy_process()`;注意到在 `copy_process()` 中的 `copy_mm()`:
```c
static int copy_mm(unsigned long clone_flags, struct task_struct *tsk)
{
struct mm_struct *mm, *oldmm;
tsk->min_flt = tsk->maj_flt = 0;
tsk->nvcsw = tsk->nivcsw = 0;
#ifdef CONFIG_DETECT_HUNG_TASK
tsk->last_switch_count = tsk->nvcsw + tsk->nivcsw;
tsk->last_switch_time = 0;
#endif
tsk->mm = NULL;
tsk->active_mm = NULL;
/*
* Are we cloning a kernel thread?
*
* We need to steal a active VM for that..
*/
oldmm = current->mm;
if (!oldmm)
return 0;
if (clone_flags & CLONE_VM) {
mmget(oldmm);
mm = oldmm;
} else {
mm = dup_mm(tsk, current->mm);
if (!mm)
return -ENOMEM;
}
tsk->mm = mm;
tsk->active_mm = mm;
sched_mm_cid_fork(tsk);
return 0;
}
```
如果是 clone kernel thread,也就是沒有 `->mm`,那就創建一個同樣沒有 `->mm` 的,反之亦然;或者這樣理解,**若由 user thread 發起 cloning (`sys_clone3`),就創造 user thread,反之,由 `kernel_thread()` 發起,則創造 kernel thread。**
:::
### `kernel_clone()` and `copy_process()`
上面兩個函式的實作是隱藏在 `kernel_clone()` 的,而其又會呼叫 `copy_process()`,後者是一個長約五百行的函式,定義在 `kernel/fork.c`,其重要性可見一班;而在其之後,必須要呼叫 `wake_up_new_task()` 才能夠算是完整地創造了一個新行程,其定義在 `kernel/sched/core.c`。
### `__kthread_bind_mask`
`__kthread_bind_mask` 將每個 `kworker` 按照預設的 CPU affinity 綁定在不同的 CPU 上,有幾個 CPU 就會有相對應的每個 `kworker`,或是應該反過來說,`kworker` 的數量代表的就是 CPU 的數量;多核心處理由此開始。
透過 gdb 中斷 `__kthread_bind_mask` 可以看到,當 `kworker/0:0` 和 `kworker/0:1` 被創造時,其 `mask->bits` 為 1,而 `kworker/1:0` 和 `kworker/1:1` 為 2。
## What makes a CPU?
我們若使用 `ps` 指令能夠看到許多正在運行的行程,尤其在使用 KVM 時,能夠時刻觀察到從開始運行到完成開機有許多行程陸續被運行,其中大多都是 Kernel Threads,那他們的作用分別是什麼?又,行程一個最小完整的 CPU 需要哪些 Kernel Threads?
```sh
#!/bin/bash
qemu-system-x86_64 \
-M pc \
-smp 2 \
-kernel ./output/images/bzImage \
-drive file=./output/images/rootfs.ext2,if=virtio,format=raw \
-append "root=/dev/vda console=ttyS0 nokaslr" \
-net user,hostfwd=tcp:127.0.0.1:3333-:22 \
-net nic,model=virtio \
-fsdev local,security_model=passthrough,id=test_dev,path=./share \
-device virtio-9p-pci,id=fsdev0,fsdev=test_dev,mount_tag=test_mount \
-nographic \
-S -s
```
透過以上 qemu 指令,我們可以模擬一個雙核 x86 系統,並在開機後看到以下行程
```shell
TASK PID COMM
0xffffffff8200a880 0 swapper/0
0xffff8880029c0000 1 init
0xffff8880029c0e00 2 kthreadd
0xffff8880029c1c00 3 pool_workqueue_
0xffff8880029c2a00 4 kworker/R-rcu_g
0xffff8880029c3800 5 kworker/R-rcu_p
0xffff8880029c4600 6 kworker/R-slub_
0xffff8880029c5400 7 kworker/R-netns
0xffff8880029c7000 9 kworker/0:0H
0xffff888002a00000 10 kworker/0:1
0xffff888002a00e00 11 kworker/u4:0
0xffff888002a01c00 12 kworker/R-mm_pe
0xffff888002a02a00 13 rcu_tasks_kthre
0xffff888002a03800 14 rcu_tasks_rude_
0xffff888002a04600 15 rcu_tasks_trace
0xffff888002a05400 16 ksoftirqd/0
0xffff888002a06200 17 rcu_preempt
0xffff888002a07000 18 migration/0
0xffff888002a38e00 19 cpuhp/0
0xffff888002a39c00 20 cpuhp/1
0xffff888002a3aa00 21 migration/1
0xffff888002a3b800 22 ksoftirqd/1
0xffff888002a3c600 23 kworker/1:0
0xffff888002a3d400 24 kworker/1:0H
0xffff888002a78000 25 kdevtmpfs
0xffff888002a78e00 26 kworker/R-inet_
0xffff888002a79c00 27 kworker/1:1
0xffff888002a7aa00 28 kworker/u4:1
0xffff888002a7b800 29 oom_reaper
0xffff888002a7c600 30 kworker/R-write
0xffff888002a3e200 31 kworker/u4:2
0xffff888002a3f000 32 kcompactd0
0xffff888002b08000 33 kworker/R-kbloc
0xffff888002a7d400 34 kworker/R-ata_s
0xffff888002a7e200 35 kswapd0
0xffff888002b08e00 36 kworker/0:1H
0xffff888002b09c00 37 kworker/R-acpi_
0xffff888002a7f000 38 kworker/R-ttm
0xffff888002b0aa00 39 scsi_eh_0
0xffff888002b0b800 40 kworker/R-scsi_
0xffff888002b0c600 41 scsi_eh_1
0xffff888002b0d400 42 kworker/R-scsi_
0xffff888002b0e200 43 kworker/0:2
0xffff888003390000 44 kworker/R-mld
0xffff888002b0f000 45 kworker/R-ipv6_
0xffff8880033b0000 46 kworker/1:1H
0xffff888003391c00 48 kworker/u5:0
0xffff8880033b0e00 49 kworker/R-ext4-
0xffff888003396200 68 syslogd
0xffff888003392a00 72 klogd
0xffff888003393800 112 udhcpc
0xffff888003397000 114 sh
0xffff8880036d0000 115 getty
```
PID 49 以前的看起來都非常像 Kernel Threads,我們可以將其分類並理解其用途
1. Core Kernel Threads:
* `swapper/0`: idle thread
* `init`: inital user-space process
* `kthreadd`: the kernel thread daemon responsible for creating other kernel threads
* `ksoftirqd/n`: handles software interrupts (softirqs) on CPU n.
* `migration/n`: manages process migration between CPUs.
* `cpuhp/n`: manages CPU hotplug operations for CPU n.
2. RCU (Read-Copy Update) Threads:
3. Memory Management Threads:
4. Device and I/O Management Threads: