Try   HackMD

start_kernel 到第一個任務: Linux Scheduler 閱讀筆記 (2)

在前篇 Review with code: Linux Scheduler 閱讀筆記 (1) 我們提及了 sched_fork(),又再往上追溯到了 kernel_clone(),那又是被誰呼叫的呢?

第一個行程 (PID = 1) 是怎麼產生的?CPU Scheduler 又是怎麼被初始化的?我們嘗試回答以上問題,以便對 Linux CPU Sceheduler 有更深的了解。

以下程式碼為 Linux 核心 v6.8-rc7 版本。

Call Hierachy

我們大概都知道 bootstrap 的流程大概是 BIOS -> MBR -> boot loader -> kernel,前三者多是硬體主導,而載入 Kernel 後才是我們現在關心的;我們知道 boot loader 會載入 kernel,kernel 會將自己放到記憶體後開始一連串的初始化以及測試,也就是作業系統了,要回答開頭的問題,我們能夠由此下手。

以下我們此 call hierachy 開始理解,從 start_kernel() 到我們熟知的 kernel/sched/fair.c 中的 _fair() 函式:

- start_kernel()                                // in          init/main.c at line  873
    - sched_init()                              // in  kernel/sched/core.c at line 9900
    - rest_init()                               // in          init/main.c at line  684
        - user_mode_thread()                    // in        kernel/fork.c at line 2970
        - kernel_thread()                       // in        kernel/fork.c at line 2951
            - kernel_clone()                    // in        kernel/fork.c at line 2861
                - copy_process()                // in        kernel/fork.c at line 2240
                - wake_up_new_task()            // in  kernel/sched/core.c at line 4868
                    - __set_task_cpu()          // in kernel/sched/sched.h at line 2056
                        - set_task_rq()         // in kernel/sched/sched.h at line 2027
                           - set_task_rq_fair() // in  kernel/sched/fair.c at line 4171

sched_init()

start_kernel() 定義在 init/main.c,被認為是 Linux Kernel 正式的進入點,它會執行一連串子模組的初始化,包含我們最感興趣的 sched_init(),它也會啟動 PID = 1 的第一個行程執行 kernel_init

image

sched_init() 最顯眼的程式碼之一莫過於歷遍各個 CPU 初始化其 run queue:

// defined at kernel/sched/core.c for_each_possible_cpu(i) { struct rq *rq; rq = cpu_rq(i); raw_spin_lock_init(&rq->__lock); rq->nr_running = 0; rq->calc_load_active = 0; rq->calc_load_update = jiffies + LOAD_FREQ; init_cfs_rq(&rq->cfs); init_rt_rq(&rq->rt); init_dl_rq(&rq->dl); ... }

user_mode_thread()kernel_thread()

雖然上圖是以 kernel_thread() 啟動 kernel_init,但在 v6.8-rc7 中,事實上是使用 pid = user_mode_thread(kernel_init, NULL, CLONE_FS);,原因是 kernel_init 是時上不需要 kernel thread,因它最終仍會成為 user process;注意此處的見 user_mode_thread 並非常規的 user thread,可以注意到它並不具有用於分辨 user thread 和 kernel thread 的 mm 部分,詳細見
Patch: kthread: Unify kernel_thread() and user_mode_thread()
以及
Patch: kthread: Rename user_mode_thread() to kmuser_thread()

我們還可以於此比較一下 user_mode_thread()kernel_thread() 的差異:

// defined at kernel/fork.c /* * Create a kernel thread. */ pid_t kernel_thread(int (*fn)(void *), void *arg, const char *name, unsigned long flags) { struct kernel_clone_args args = { .flags = ((lower_32_bits(flags) | CLONE_VM | CLONE_UNTRACED) & ~CSIGNAL), .exit_signal = (lower_32_bits(flags) & CSIGNAL), .fn = fn, .fn_arg = arg, .name = name, .kthread = 1, }; return kernel_clone(&args); } /* * Create a user mode thread. */ pid_t user_mode_thread(int (*fn)(void *), void *arg, unsigned long flags) { struct kernel_clone_args args = { .flags = ((lower_32_bits(flags) | CLONE_VM | CLONE_UNTRACED) & ~CSIGNAL), .exit_signal = (lower_32_bits(flags) & CSIGNAL), .fn = fn, .fn_arg = arg, }; return kernel_clone(&args); }

可以看到,兩者的差別只在於傳入參數是否帶有 name 以及 kthread = 1 而已,但如同前述所提 user_mode_thread 並非 user thread,而是 "special kernel threads"。

How do we create a "user thread" ?

使用 sys_clone3()

// defined at kernel/fork.c /** * sys_clone3 - create a new process with specific properties * @uargs: argument structure * @size: size of @uargs * * clone3() is the extensible successor to clone()/clone2(). * It takes a struct as argument that is versioned by its size. * * Return: On success, a positive PID for the child process. * On error, a negative errno number. */ SYSCALL_DEFINE2(clone3, struct clone_args __user *, uargs, size_t, size) { int err; struct kernel_clone_args kargs; pid_t set_tid[MAX_PID_NS_LEVEL]; kargs.set_tid = set_tid; err = copy_clone_args_from_user(&kargs, uargs, size); if (err) return err; if (!clone3_args_valid(&kargs)) return -EINVAL; return kernel_clone(&kargs); }

關鍵在於傳入的 kargs 是否有帶有 user thread 的資訊,並交由後續的 kernel_clone() 以及 copy_process();注意到在 copy_process() 中的 copy_mm()

static int copy_mm(unsigned long clone_flags, struct task_struct *tsk)
{
	struct mm_struct *mm, *oldmm;

	tsk->min_flt = tsk->maj_flt = 0;
	tsk->nvcsw = tsk->nivcsw = 0;
#ifdef CONFIG_DETECT_HUNG_TASK
	tsk->last_switch_count = tsk->nvcsw + tsk->nivcsw;
	tsk->last_switch_time = 0;
#endif

	tsk->mm = NULL;
	tsk->active_mm = NULL;

	/*
	 * Are we cloning a kernel thread?
	 *
	 * We need to steal a active VM for that..
	 */
	oldmm = current->mm;
	if (!oldmm)
		return 0;

	if (clone_flags & CLONE_VM) {
		mmget(oldmm);
		mm = oldmm;
	} else {
		mm = dup_mm(tsk, current->mm);
		if (!mm)
			return -ENOMEM;
	}

	tsk->mm = mm;
	tsk->active_mm = mm;
	sched_mm_cid_fork(tsk);
	return 0;
}

如果是 clone kernel thread,也就是沒有 ->mm,那就創建一個同樣沒有 ->mm 的,反之亦然;或者這樣理解,若由 user thread 發起 cloning (sys_clone3),就創造 user thread,反之,由 kernel_thread() 發起,則創造 kernel thread。

kernel_clone() and copy_process()

上面兩個函式的實作是隱藏在 kernel_clone() 的,而其又會呼叫 copy_process(),後者是一個長約五百行的函式,定義在 kernel/fork.c,其重要性可見一班;而在其之後,必須要呼叫 wake_up_new_task() 才能夠算是完整地創造了一個新行程,其定義在 kernel/sched/core.c

__kthread_bind_mask

__kthread_bind_mask 將每個 kworker 按照預設的 CPU affinity 綁定在不同的 CPU 上,有幾個 CPU 就會有相對應的每個 kworker,或是應該反過來說,kworker 的數量代表的就是 CPU 的數量;多核心處理由此開始。

透過 gdb 中斷 __kthread_bind_mask 可以看到,當 kworker/0:0kworker/0:1 被創造時,其 mask->bits 為 1,而 kworker/1:0kworker/1:1 為 2。

What makes a CPU?

我們若使用 ps 指令能夠看到許多正在運行的行程,尤其在使用 KVM 時,能夠時刻觀察到從開始運行到完成開機有許多行程陸續被運行,其中大多都是 Kernel Threads,那他們的作用分別是什麼?又,行程一個最小完整的 CPU 需要哪些 Kernel Threads?

#!/bin/bash
qemu-system-x86_64 \
    -M pc \
    -smp 2 \
    -kernel ./output/images/bzImage \
    -drive file=./output/images/rootfs.ext2,if=virtio,format=raw \
    -append "root=/dev/vda console=ttyS0 nokaslr" \
    -net user,hostfwd=tcp:127.0.0.1:3333-:22 \
    -net nic,model=virtio \
    -fsdev local,security_model=passthrough,id=test_dev,path=./share \
    -device virtio-9p-pci,id=fsdev0,fsdev=test_dev,mount_tag=test_mount \
    -nographic \
    -S -s

透過以上 qemu 指令,我們可以模擬一個雙核 x86 系統,並在開機後看到以下行程

      TASK          PID    COMM
0xffffffff8200a880   0   swapper/0
0xffff8880029c0000   1   init
0xffff8880029c0e00   2   kthreadd
0xffff8880029c1c00   3   pool_workqueue_
0xffff8880029c2a00   4   kworker/R-rcu_g
0xffff8880029c3800   5   kworker/R-rcu_p
0xffff8880029c4600   6   kworker/R-slub_
0xffff8880029c5400   7   kworker/R-netns
0xffff8880029c7000   9   kworker/0:0H
0xffff888002a00000  10   kworker/0:1
0xffff888002a00e00  11   kworker/u4:0
0xffff888002a01c00  12   kworker/R-mm_pe
0xffff888002a02a00  13   rcu_tasks_kthre
0xffff888002a03800  14   rcu_tasks_rude_
0xffff888002a04600  15   rcu_tasks_trace
0xffff888002a05400  16   ksoftirqd/0
0xffff888002a06200  17   rcu_preempt
0xffff888002a07000  18   migration/0
0xffff888002a38e00  19   cpuhp/0
0xffff888002a39c00  20   cpuhp/1
0xffff888002a3aa00  21   migration/1
0xffff888002a3b800  22   ksoftirqd/1
0xffff888002a3c600  23   kworker/1:0
0xffff888002a3d400  24   kworker/1:0H
0xffff888002a78000  25   kdevtmpfs
0xffff888002a78e00  26   kworker/R-inet_
0xffff888002a79c00  27   kworker/1:1
0xffff888002a7aa00  28   kworker/u4:1
0xffff888002a7b800  29   oom_reaper
0xffff888002a7c600  30   kworker/R-write
0xffff888002a3e200  31   kworker/u4:2
0xffff888002a3f000  32   kcompactd0
0xffff888002b08000  33   kworker/R-kbloc
0xffff888002a7d400  34   kworker/R-ata_s
0xffff888002a7e200  35   kswapd0
0xffff888002b08e00  36   kworker/0:1H
0xffff888002b09c00  37   kworker/R-acpi_
0xffff888002a7f000  38   kworker/R-ttm
0xffff888002b0aa00  39   scsi_eh_0
0xffff888002b0b800  40   kworker/R-scsi_
0xffff888002b0c600  41   scsi_eh_1
0xffff888002b0d400  42   kworker/R-scsi_
0xffff888002b0e200  43   kworker/0:2
0xffff888003390000  44   kworker/R-mld
0xffff888002b0f000  45   kworker/R-ipv6_
0xffff8880033b0000  46   kworker/1:1H
0xffff888003391c00  48   kworker/u5:0
0xffff8880033b0e00  49   kworker/R-ext4-
0xffff888003396200  68   syslogd
0xffff888003392a00  72   klogd
0xffff888003393800  112  udhcpc
0xffff888003397000  114  sh
0xffff8880036d0000  115  getty

PID 49 以前的看起來都非常像 Kernel Threads,我們可以將其分類並理解其用途

  1. Core Kernel Threads:
  • swapper/0: idle thread

  • init: inital user-space process

  • kthreadd: the kernel thread daemon responsible for creating other kernel threads

  • ksoftirqd/n: handles software interrupts (softirqs) on CPU n.

  • migration/n: manages process migration between CPUs.

  • cpuhp/n: manages CPU hotplug operations for CPU n.

  1. RCU (Read-Copy Update) Threads:
  2. Memory Management Threads:
  3. Device and I/O Management Threads: