林納斯 Project 3 Cheatsheat

# 林納斯 Project 3 Cheatsheat --- ＞＞＞[LITERAL CHEATSHEAT](#The-real-cheersheat-TODO)＜＜＜ [Cloudy's](https://hackmd.io/@Cing-Chen/ryoc0zocj) --- ## Contents [TOC] --- {%hackmd /vgoFfe-PQTioGuCz7JDuJA %} --- ## Source Code of Interest - Task info - [`#define current`](https://elixir.bootlin.com/linux/v3.10.104/source/arch/x86/include/asm/current.h#L17) *（`arch/x86/include/asm/current.h:17`）* - [`struct task_struct`](https://elixir.bootlin.com/linux/v3.10.104/source/include/linux/sched.h#L1041) *（`include/linux/sched.h:1041`）* - [`struct thread_struct`](https://elixir.bootlin.com/linux/v3.10.104/source/arch/x86/include/asm/processor.h#L442) *（`arch/x86/include/asm/processor.h:442`）* - [`struct thread_info`](https://elixir.bootlin.com/linux/v3.10.104/source/arch/x86/include/asm/thread_info.h#L25) *（`arch/x86/include/asm/thread_info.h:25`）* - [`#define task_thread_info(...)`](https://elixir.bootlin.com/linux/v3.10.104/source/include/linux/sched.h#L2322) *（`include/linux/sched.h:2322`）* - Get CPU - [`unsigned int task_cpu(...)`](https://elixir.bootlin.com/linux/v3.10.104/source/include/linux/sched.h#L2628) *（`include/linux/sched.h:2628`）* - [`#define smp_processor_id()`](https://elixir.bootlin.com/linux/v3.10.104/source/include/linux/smp.h#L218) *（`include/linux/smp.h:218`）* - Sched & CSW related - [`void schedule(void)`](https://elixir.bootlin.com/linux/v3.10.104/source/kernel/sched/core.c#L3052) *（`kernel/sched/core.c:3052`）* - [`void __schedule(void)`](https://elixir.bootlin.com/linux/v3.10.104/source/kernel/sched/core.c#L2955) *（`kernel/sched/core.c:2955`）* - [(`struct task_struct`) `unsigned long nvcsw, nivcsw`](https://elixir.bootlin.com/linux/v3.10.104/source/include/linux/sched.h#L1186) *（`include/linux/sched.h:1186`）* - [`struct sched_class`](https://elixir.bootlin.com/linux/v3.10.104/source/kernel/sched/sched.h#L968) *（`kernel/sched/sched.h:968`）* - [`void context_switch(...)`](https://elixir.bootlin.com/linux/v3.10.104/source/kernel/sched/core.c#L1984) *（`kernel/sched/core.c:1984`）* - [`void enter_lazy_tlb(...)`](https://elixir.bootlin.com/linux/v3.10.104/source/arch/x86/include/asm/mmu_context.h#L25) *（`arch/x86/include/asm/mmu_context.h:25`）* - [`void switch_mm(...)`](https://elixir.bootlin.com/linux/v3.10.104/source/arch/x86/include/asm/mmu_context.h#L33) *（`arch/x86/include/asm/mmu_context.h:33`）* - [`#define switch_to(...)`](https://elixir.bootlin.com/linux/v3.10.104/source/arch/x86/include/asm/switch_to.h#L104) *（`arch/x86/include/asm/switch_to.h:104`）* - ~~[`#define switch_to(...)`](https://elixir.bootlin.com/linux/v3.10.104/source/include/asm-generic/switch_to.h#L25)~~ *（`include/asm-generic/switch_to.h:25`）* - [`struct task_struct* __switch_to(...)`](https://elixir.bootlin.com/linux/v3.10.104/source/arch/x86/kernel/process_64.c#L272) *（`arch/x86/kernel/process_64.c:272`）* - Sched classes - [`struct sched_class`](https://elixir.bootlin.com/linux/v3.10.104/source/kernel/sched/sched.h#L968) *（`kernel/sched/sched.h:968`）* - [`extern struct sched_class stop_sched_class`](https://elixir.bootlin.com/linux/v3.10.104/source/kernel/sched/sched.h#L1018) *（`kernel/sched/sched.h:1018`）* [`struct sched_class stop_sched_class`](https://elixir.bootlin.com/linux/v3.10.104/source/kernel/sched/stop_task.c#L105) *（`kernel/sched/stop_task.c:105`）* - [`extern struct sched_class dl_sched_class`](https://elixir.bootlin.com/linux/v6.1.4/source/kernel/sched/sched.h#L2252) *（`v6.1.4@kernel/sched/sched.h:2252`）* [`struct sched_class dl_sched_class`](https://elixir.bootlin.com/linux/v6.1.4/source/kernel/sched/deadline.c#L2696) *（`v6.1.4@kernel/sched/deadline.c:2696`）* - [`extern struct sched_class rt_sched_class`](https://elixir.bootlin.com/linux/v3.10.104/source/kernel/sched/sched.h#L1019) *（`kernel/sched/sched.h:1019`）* [`struct sched_class rt_sched_class`](https://elixir.bootlin.com/linux/v3.10.104/source/kernel/sched/rt.c#L2071) *（`kernel/sched/rt.c:2071`）* - [`extern struct sched_class fair_sched_class`](https://elixir.bootlin.com/linux/v3.10.104/source/kernel/sched/sched.h#L1020) *（`kernel/sched/sched.h:1020`）* [`struct sched_class fair_sched_class`](https://elixir.bootlin.com/linux/v3.10.104/source/kernel/sched/fair.c#L6167) *（`kernel/sched/fair.c:6167`）* - [`extern struct sched_class idle_sched_class`](https://elixir.bootlin.com/linux/v3.10.104/source/kernel/sched/sched.h#L1021) *（`kernel/sched/sched.h:1021`）* [`struct sched_class idle_sched_class`](https://elixir.bootlin.com/linux/v3.10.104/source/kernel/sched/idle_task.c#L90) *（`kernel/sched/idle_task.c:90`）* - CSW related - [`struct tss_struct`](https://elixir.bootlin.com/linux/v3.10.104/source/arch/x86/include/asm/processor.h#L264) *（`arch/x86/include/asm/processor.h:264`）* - [`struct x86_hw_tss`](https://elixir.bootlin.com/linux/v3.10.104/source/arch/x86/include/asm/processor.h#L240) *（`arch/x86/include/asm/processor.h:240`）* - Misc - [`long sys_sched_setaffinity(...)`](https://elixir.bootlin.com/linux/v3.10.104/source/kernel/sched/core.c#L4246) *（`kernel/sched/core.c:4246`）* - [`long do_fork(...)`](https://elixir.bootlin.com/linux/v3.10.104/source/kernel/fork.c#L1574) *（`kernel/fork.c:1574`）* - [`int gettimeofday(...)`](https://elixir.bootlin.com/linux/v3.10.104/source/arch/x86/vdso/vclock_gettime.c#L281) *（`arch/x86/vdso/vclock_gettime.c:281`）* - [`int do_nanosleep(...)`](https://elixir.bootlin.com/linux/v3.10.104/source/kernel/hrtimer.c#L1557) *（`kernel/hrtimer.c:1557`）* ## Cache Table > messy > tired > [name=Mibudin] > https://hackmd.io/@wasabi-neko/ByXAcyJtj [name=wasabi-neko] > 這根本不叫 cheatsheet > [name=JCxYIS] --- ### Logical & Physical CPU ID ==TODO== --- ### `smp_processor_id` - [`#define smp_processor_id()`](https://elixir.bootlin.com/linux/v3.10.104/source/include/linux/smp.h#L218) *（`include/linux/smp.h:218`）* - A per-CPU variable named `cpu_number` > ```c= #define raw_smp_processor_id() (this_cpu_read(cpu_number)) #define smp_processor_id() raw_smp_processor_id() ``` --- ### `taskset` - [`int sched_setaffinity(pid_t pid, size_t cpusetsize, const cpu_set_t *mask)`](https://man7.org/linux/man-pages/man2/sched_setaffinity.2.html) - System call：[`long sys_sched_setaffinity(...)`](https://elixir.bootlin.com/linux/v3.10.104/source/kernel/sched/core.c#L4246) *（`kernel/sched/core.c:4246`）* > ```c= #include<sched.h> int cpu_id = 3; cpu_set_t cpuset; CPU_ZERO(&cpuset); CPU_SET(cpu_id, &cpuset); sched_setaffinity(0, sizeof(cpuset), &cpuset); ``` --- ### Sleep (`sleep()` / `usleep()` / `nanosleep()`) - https://www.jianshu.com/p/42abcc2c9e50 - In Linux, `sleep()`, `usleep()`, `nanosleep()` all use system call `nanosleep()` [`int do_nanosleep(...)`](https://elixir.bootlin.com/linux/v3.10.104/source/kernel/hrtimer.c#L1557) *（`kernel/hrtimer.c:1557`）* 以下兩者會將 CPU 執行權讓給別的 process (自己掛進 ready queue，使用`sigsuspend()`)； `usleep` 很短時間的話： Linux 的 scheduler 在 context switch 所花費的時間會更長(約幾ms, 1000us)，導致失準 - `sleep` 單位是秒 (s) - `usleep` 單位是微秒(us, $10^{-6}s$)，適用於幾十ms(幾萬us)以上 `nanosleep` 在很小時間區間會透過設置定時器(`timer_list`組成的隊列)，系統會定期檢查隊列，時間到了就叫裡面的 callback，也不到非常準確 - `nanosleep` 單位是奈秒(ns, $10^{-9}s$)，應注意判斷返回值和錯誤代碼，否則容易造成cpu佔用率100% 需要準確可以用 `select`或自己算時脈寫 asm ![](https://i.imgur.com/jEIDDj0.png =10000x) --- ### `gettimeofday()` - System call ---> vsyscall (vsdo in x86) - vsyscall/vsdo: Map some kernel code (system calls) into user space, avoid kernel/user context switches - `strace` 追不到 - [`int gettimeofday(...)`](https://elixir.bootlin.com/linux/v3.10.104/source/arch/x86/vdso/vclock_gettime.c#L281) *（`arch/x86/vdso/vclock_gettime.c:281`）* [`long sys_gettimeofday(...)`](https://elixir.bootlin.com/linux/v3.10.104/source/kernel/time.c#L101) *（`kernel/time.c:101`）* --- ### `time` ==TODO== --- ### `current` - [`#define current`](https://elixir.bootlin.com/linux/v3.10.104/source/arch/x86/include/asm/current.h#L17) *（`arch/x86/include/asm/current.h:17`）*: - A per-CPU-variable named `current_task` > ```c= DECLARE_PER_CPU(struct task_struct *, current_task); static __always_inline struct task_struct *get_current(void) {return this_cpu_read_stable(current_task);} #define current get_current() ``` --- ### `task_struct` & `thread_info` 每個進程都有一個 task_struct 和 thread_info。利用 thread_info 可以訪問 task_struct。 task_struct (*[include/linux/sched.h:1041](https://elixir.bootlin.com/linux/v3.10.104/source/include/linux/sched.h#L1041)*) 是內核中最重要、也是最大的進程描述符，它記錄了進程的狀態資訊、進程間關係、資源配置、系統調用等資訊 (之前的 [mm_struct](https://hackmd.io/@hami-duck/HyCjWTkVj/%2FKZXSVG4fRdqCQS0ykXgZWg) 也是在這裏面)，它被用於管理進程之間的關係和狀態。 thread_info(*[arch/x86/include/asm/thread_info.h:25]( https://elixir.bootlin.com/linux/v3.10.104/source/arch/x86/include/asm/thread_info.h#L25)*) 是一個較輕量的結構，它的作用是存儲進程的基本資訊，例如進程ID、進程狀態、內核棧資訊等。 thread_info 一般是放在進程的內核棧頂部。 ![](https://i.imgur.com/6XuSZyT.png) > 會這樣放是因為 memory layout， > 可以透過把 esp 位置後面 13 位都歸零(mask)直接拿到 thread_info > 參考 [current_thread_info 的實作](https://elixir.bootlin.com/linux/v3.10.104/source/arch/x86/include/asm/thread_info.h#L176): ```c return (struct thread_info *) (current_stack_pointer & ~(THREAD_SIZE - 1)); ``` --- ### `thread_struct` ==TODO== --- ### `tss_struct` & `x86_hw_tss` ==TODO== --- ### Task State ==TODO== - 有五種 Process State，用 `top` 指令來看 - 'D' = UNINTERRUPTABLE_SLEEP - 不可中斷睡眠，他在等資源，通常是系統調用會用到，例如 `mkdir`， - 'R' = RUNNING & RUNNABLE - 運行中與可運行(還沒分配 CPU) - 'S' = INTERRRUPTABLE_SLEEP - 可中斷睡眠，他在等資源，例如使用者輸入或封包，他會讓出 CPU。 - 可以隨意終止，`kill` - 'T' = STOPPED - 一個被掛起的 process，可以在運行時給予 `SIGSTOP`(`kill -STOP`) 或 `SIGTSTP`(`ctrl+Z`) 信號將進程置於停止狀態(T)。 - 可以通過發送`SIGCONT`(`fg`)信號使進程恢復到運行或可運行狀態。 - 'Z' = ZOMBIE - 當一個進程完成執行或終止時，它會向父進程發送 `SIGCHLD` 信號並進入殭屍狀態。將保持此狀態，直到父進程將其從進程表中清除(`kill -s SIGCHLD <PPID>`)。父進程必須使用`wait()`或 `waitpid()`系統調用讀取子進程的退出值。 --- ### Segment Registers ==TODO== - cs: code segment - ds: Data segment - es: estination segment - ss: stack segment - fs: TLS(thread load storge) entry - gs: TLS entry --- ### Schedule > https://hackmd.io/@RinHizakura/B18MhT00t - CFS - Completely Fair Scheduler - Linux 內核中的預設調度演算法 (since 2.6.23)。主要用於多核和多執行緒環境中的進程調度。 - 講求公平性，儘量保證每個進程都能得到合理的資源利用。 - 實現方式：基於時間片的調度演算法 - 每個進程都會被分配一個相同的時間片(timeslice)。 - 在時間片分配過程中，CFS 會根據進程的優先順序、執行時間等因素來分配時間片。 - 進程在運行過程中，若其佔用的時間片未用完則會把剩餘的時間片進行繼承。 - Virtual Runtime `vruntime` - CFS 使用 Virual Runtime 來衡量任務之間的公平，這是根據每個任務的 nice value 對應權重和實際執行時間計算，task 的 Virtual Runtime 越小代表需求越高，會先選擇最小的執行。 - `sched_entity` - 管理資源分配的單元，CFS 實際上調配權重以排程的單位就是 sched_entity 實體。 - 他在 `task_struct` 裡面。 - `cfs_rq` 成員，指向其所在的 runqueue。 - `load_weight` 成員，代表該排程被分配權重。 - `vruntime` 成員，Virtual Runtime。 - `task_group` - 他在 `task_struct` 裡面，為避免擁有多個任務的使用者占用過多 CPU。 - 描述由多個任務組成的群組，也包含 sched_entity 實體，可以被 CFS 加入 runqueue，他自己也有 runqueue。 - `cfs_rq` - CFS 的 runqueue。 - `load_weight` runqueue 中所有任務權重總和，衡量`task_group` 用。 - `min_vruntime` runqueue 中所有實體的 `vruntime` 的最小值，用來在 `enqueue sched_entity` 的時候協助初始化其 `vruntime`，避免任務回到 runqueue 時擁有較高優先度。 - `tasks_timeline` 裡面有紅黑樹的根節點`rb_root`和最左節點`rb_leftmost` - `tg` 顯示所屬 `task_group` - runqueue - `struct rq` ([`kernel/sched/sched.h:397`](https://elixir.bootlin.com/linux/v3.10.104/source/kernel/sched/sched.h#L397)) - | 每個 CPU 都有一個 runqueue | - 包含紅黑樹 (RB-tree) 跟 min heap (priority queue) - 將`vruntime`越少的工作（即 sched_entity）排列在紅黑樹的左邊，時間複雜度是O(log N) - Scheduler 挑選最左邊的任務，切換掉後，把 virtual runtime 加上使用的 CPU 時間重新插入 RB-tree。 - 節點（即rb_node）的安插工作則由 `dequeue_entity()` 和`enqueue_entity()`來完成。 - 當前執行的task通過呼叫 `put_prev_task` 返回紅黑樹，下一個待執行的task則由`pick_next_task`來呼叫 - ![](https://i.imgur.com/2MwTXKI.png) - [Scheduling Policy](https://man7.org/linux/man-pages/man7/sched.7.html) 5 policies *(6 policies in higher versions)* - :::spoile r Deadline Scheduling Policy - **`SCHED_DEADLINE`：Sporadic task model deadline scheduling** *(since v3.14)* > - GEDF *(Global Earliest Deadline First)* + CBS *(Constant Bandwidth Server)* > - Tasks are the highest priority (user controllable) tasks in the system > - if runnable, it will preempt any task scheduled under one of the other policies > - A sporadic task is one that has a sequence of jobs, where each job is activated at most once per period > ``` > arrival/wakeup absolute deadline > | start time | > | | | > v v v > -----x--------xooooooooooooooooo--------x--------x--- > |<-- Runtime ------->| > |<----------- Deadline ----------->| > |<-------------- Period ------------------->| > ``` > - The CBS guarantees non-interference between tasks, by throttling threads that attempt to over-run their specified `Runtime` ::: - :::spoile r Real-Time Scheduling Policies `sched_priority = 1...99` - **`SCHED_FIFO`：First in-first out scheduling** > - Without time slicing > - Preempted tasks stay at the head of the list for its priority > - resume execution while all higher blocked > - Newly runnable or yielded tasks inserted at the end of the list for its priority > - Priority raised ---> inserted at the end of the list for its new priority > Priority lowered ---> inserted at the front of the list for its new priority - **`SCHED_RR`：Round-robin scheduling** > - Similar to `SCHED_FIFO`, except ... > - Each task allowed to run only for a maximum time quantum > - or be put at the end of the list for its priority > - Preempted tasks keep the left round-robin time quantum ::: - :::spoile r Normal Scheduling Policies `sched_priority = 0` *(always LOWER then real-time policies)* `nice = -20...+19` *(high ---> low)* - **`SCHED_OTHER = SCHED_NORMAL`** - **`SCHED_NORMAL`：Default Linux time-sharing scheduling** > - dynamic priority: determined only inside static priority 0 list > - based on `nice` value > - increased when ready but denied to run by scheduler > - `nice` value: > - per-thread (per-task) value in Linux > - each unit of diff in the `nice` values of two tasks results in a factor of 1.25 in the degree to which the scheduler favors the higher priority task - **`SCHED_BATCH`：Scheduling batch processes** > - Similar to `SCHED_NORMAL`, except ... > - Assume all task is CPU-intensive (CPU-bound) > - Apply small penalty on wakeup ---> mildly disfavored in scheduling decisions > - Useful for: > - non-interactive but non-low workloads > - need deterministic sched policy without interactivity causing extra preemptions - **`SCHED_IDLE`：Scheduling very low priority jobs** > - `nice` value has no effect > - Useful for: > - running jobs at extremely low priority > - lower even than `nice = +19` in `SCHED_NORMAL` & `SCHED_BATCH` ::: - Scheduling Class: *4 classes (5 classes in higher versions)* [`struct sched_class`](https://elixir.bootlin.com/linux/v3.10.104/source/kernel/sched/sched.h#L968) *（`kernel/sched/sched.h:968`）* :::spoile r 4 classes *(5 classes in higher versions)* - **Stop-Task Scheduling** [`extern struct sched_class stop_sched_class`](https://elixir.bootlin.com/linux/v3.10.104/source/kernel/sched/sched.h#L1018) *（`kernel/sched/sched.h:1018`）* [`struct sched_class stop_sched_class`](https://elixir.bootlin.com/linux/v3.10.104/source/kernel/sched/stop_task.c#L105) *（`kernel/sched/stop_task.c:105`）* > - The stop task is the highest priority task in the system, it preempts everything and will be preempted by nothing. - **Deadline Scheduling *(`SCHED_DEADLINE`)*** *(since v3.14)* *(source code in latest release v6.1.4)* [`extern struct sched_class dl_sched_class`](https://elixir.bootlin.com/linux/v6.1.4/source/kernel/sched/sched.h#L2252) *（`v6.1.4@kernel/sched/sched.h:2252`）* [`struct sched_class dl_sched_class`](https://elixir.bootlin.com/linux/v6.1.4/source/kernel/sched/deadline.c#L2696) *（`v6.1.4@kernel/sched/deadline.c:2696`）* > - Earliest Deadline First (EDF) + Constant Bandwidth Server (CBS) > - Tasks that periodically executes their instances for less than their untime won't miss any of their deadlines. > - Tasks that are not periodic or sporadic or that tries to execute more than their reserved bandwidth will be slowed down (and may potentially miss some of their deadlines), and won't affect any other task. - **Real-Time Scheduling *(`SCHED_FIFO` or `SCHED_RR`)*** [`extern struct sched_class rt_sched_class`](https://elixir.bootlin.com/linux/v3.10.104/source/kernel/sched/sched.h#L1019) *（`kernel/sched/sched.h:1019`）* [`struct sched_class rt_sched_class`](https://elixir.bootlin.com/linux/v3.10.104/source/kernel/sched/rt.c#L2071) *（`kernel/sched/rt.c:2071`）* - **Completely Fair Scheduling *(`SCHED_NORMAL` or `SCHED_BATCH`)*** ==TODO: more details?== [`extern struct sched_class fair_sched_class`](https://elixir.bootlin.com/linux/v3.10.104/source/kernel/sched/sched.h#L1020) *（`kernel/sched/sched.h:1020`）* [`struct sched_class fair_sched_class`](https://elixir.bootlin.com/linux/v3.10.104/source/kernel/sched/fair.c#L6167) *（`kernel/sched/fair.c:6167`）* - **Idle Task Scheduling *(`SCHED_IDLE`)*** [`extern struct sched_class idle_sched_class`](https://elixir.bootlin.com/linux/v3.10.104/source/kernel/sched/sched.h#L1021) *（`kernel/sched/sched.h:1021`）* [`struct sched_class idle_sched_class`](https://elixir.bootlin.com/linux/v3.10.104/source/kernel/sched/idle_task.c#L90) *（`kernel/sched/idle_task.c:90`）* ::: :::spoile r `void schedule(void)` :::info [`void schedule(void)`](https://elixir.bootlin.com/linux/v3.10.104/source/kernel/sched/core.c#L3052) *（`kernel/sched/core.c:3052`）* ```c= asmlinkage void __sched schedule(void) { struct task_struct *tsk = current; sched_submit_work(tsk); // Submit plugged IO queued to avoid deadlocks __schedule(); // Actually do schedule } EXPORT_SYMBOL(schedule); // Export this symbol to all kernel code ``` ::: *(get current task & submit work)* :::spoile r `void __schedule(void)` :::info [`void __schedule(void)`](https://elixir.bootlin.com/linux/v3.10.104/source/kernel/sched/core.c#L2955) *（`kernel/sched/core.c:2955`）* ```c= static void __sched __schedule(void) { struct task_struct *prev, *next; unsigned long *switch_count; struct rq *rq; int cpu; need_resched: // Disable preemption if needed preempt_disable(); cpu = smp_processor_id(); rq = cpu_rq(cpu); rcu_note_context_switch(cpu); // RCU (Read-Copy Update) handling prev = rq->curr; // p' is the current task // Fro debug statistic schedule_debug(prev); // Test if `HRTICK` feature enable if (sched_feat(HRTICK)) hrtick_clear(rq); // Clear `rq->hrtick_timer` // To avoid race with `signal_wake_up()` // while there are signals pending & `TASK_INTERRUPTIBLE` set // make sure the order of signals is correct smp_mb__before_spinlock(); raw_spin_lock_irq(&rq->lock); // Non-running & preemption disallowed --> Voluntary // else --> Involuntary switch_count = &prev->nivcsw; // Count `nivcsw` // Test if p' non-running & preemption disallowed // `TASK_RUNNING` = 0 if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) { if (unlikely(signal_pending_state(prev->state, prev))) { // If there are signals pending to p', set `TASK_RUNNING` prev->state = TASK_RUNNING; } else { // Dequeue p' from `rq` (`sched_class` related) // `p->sched_class->dequeue_task(rq, p, flags);` deactivate_task(rq, prev, DEQUEUE_SLEEP); prev->on_rq = 0; // No more on rq // If a work thread is sleeping, notify and ask whether to wake up if (prev->flags & PF_WQ_WORKER) { struct task_struct *to_wakeup; to_wakeup = wq_worker_sleeping(prev, cpu); if (to_wakeup) // Try to wake up the worker // Enqueue if needed (`sched_class` related) // `p->sched_class->enqueue_task(rq, p, flags);` try_to_wake_up_local(to_wakeup); } } switch_count = &prev->nvcsw; // Count `nvcsw` } // Pre-schedule handling (`sched_class` related) pre_schedule(rq, prev); if (unlikely(!rq->nr_running)) idle_balance(cpu, rq); // Put p' back to `rq` (`sched_class` related) // `prev->sched_class->put_prev_task(rq, prev);` put_prev_task(rq, prev); // Pick n' from `rq` (`sched_class` related) // `for_each_class(class) if(p = class->pick_next_task(rq)) return p;` next = pick_next_task(rq); clear_tsk_need_resched(prev); // p' just go to be resched rq->skip_clock_update = 0; // Context switch if p' != n' if (likely(prev != next)) { // Update infos rq->nr_switches++; rq->curr = next; ++*switch_count; // Do context switch: // Switch to the new MM if needed // Save & restore flags // Switch the reg state and the stack context_switch(rq, prev, next); // also unlock rq // Re-evaluate `cpu` / `rq` // p' may have moved CPUs since `schedule` called cpu = smp_processor_id(); rq = cpu_rq(cpu); } else // Unlock rq manually raw_spin_unlock_irq(&rq->lock); // Post-schedule handling (`sched_class` related) post_schedule(rq); // Enable preemption if needed sched_preempt_enable_no_resched(); // Test if `thread_info->flags` set `TIF_NEED_RESCHED` if (need_resched()) goto need_resched; } ``` ::: *(the main scheduler function)* --- ### Context Switch - Associated: - [`void enter_lazy_tlb(...)`](https://elixir.bootlin.com/linux/v3.10.104/source/arch/x86/include/asm/mmu_context.h#L25) *（`arch/x86/include/asm/mmu_context.h:25`）* - [`void switch_mm(...)`](https://elixir.bootlin.com/linux/v3.10.104/source/arch/x86/include/asm/mmu_context.h#L33) *（`arch/x86/include/asm/mmu_context.h:33`）* :::spoile r `void context_switch(...)` :::info [`void context_switch(...)`](https://elixir.bootlin.com/linux/v3.10.104/source/kernel/sched/core.c#L1984) *（`kernel/sched/core.c:1984`）* ```c= static inline void context_switch(struct rq *rq, struct task_struct *prev, struct task_struct *next) { struct mm_struct *mm, *oldmm; prepare_task_switch(rq, prev, next); mm = next->mm; oldmm = prev->active_mm; // For para-virtualization arch_start_context_switch(prev); // Kernel thread: // mm == NULL // active_mm == active user virtual space | NULL // User thread: // mm == user virtual space // active_mm == user virtual space // Kernel threads: // => all kernel threads share same kernel virtual space // => no need to switch user virtual space // => no need to switch mm & clear TLB (Lazy TLB) if (!mm) { // n' is kernel thread next->active_mm = oldmm; // Keep last user virtual space atomic_inc(&oldmm->mm_count); // Count how many refs to this mm enter_lazy_tlb(oldmm, next); // Lazy TLB Handing: // per-CPU `cpu_tlbstate.state`: TLBSTATE_OK --> TLBSTATE_LAZY } else // n' is user thread switch_mm(oldmm, mm, next); // Switch mm: // per-CPU `cpu_tlbstate.state`: --> TLBSTATE_OK // load n' cpumask // load pagetable (`pdg`) // load LDT (GDT will load at `__switch_to`) if (!prev->mm) { // p' is kernel thread prev->active_mm = NULL; // No more active user virtual space rq->prev_mm = oldmm; // Keep the previous active user virtual space } // Release rq lock #ifndef __ARCH_WANT_UNLOCKED_CTXSW spin_release(&rq->lock.dep_map, 1, _THIS_IP_); #endif // For context tracking context_tracking_task_switch(prev, next); // Switch the reg state and the stack switch_to(prev, next, prev); // Do memory barrier: force memory accessing IN ORDER barrier(); // Re-evaluate `rq` // p' may have moved CPUs since `schedule` called finish_task_switch(this_rq(), prev); } ``` ::: *(switch to the new MM if needed)* :::spoile r `# define switch_to(...)` :::info [`#define switch_to(...)`](https://elixir.bootlin.com/linux/v3.10.104/source/arch/x86/include/asm/switch_to.h#L104) *（`arch/x86/include/asm/switch_to.h:104`）* ```c= #define switch_to(prev, next, last) \ asm volatile(SAVE_CONTEXT \ // Push kernel context to stack "movq %%rsp,%P[threadrsp](%[prev])\n\t" \ // Save p' %rsp (stack ptr) -> p' `thread_struct` "movq %P[threadrsp](%[next]),%%rsp\n\t" \ // Load n' %rsp "call __switch_to\n\t" \ // Call `__switch_to()` "movq "__percpu_arg([current_task])",%%rsi\n\t" \ // Move n' `task_struct` -> %rsi __switch_canary \ // For protecting against smashing the stack "movq %P[thread_info](%%rsi),%%r8\n\t" \ // Move n' `thread_info` -> %r8 "movq %%rax,%%rdi\n\t" \ // Move p' `task_struct` -> %rdi "testl %[_tif_fork],%P[ti_flags](%%r8)\n\t" \ // Test if n' just created with clone/fork "jnz ret_from_fork\n\t" \ // if true, jump to `ret_from_fork` RESTORE_CONTEXT \ // Pop kernel context from stack : "=a" (last) \ __switch_canary_oparam \ // Output stack canary : [next] "S" (next), [prev] "D" (prev), \ [threadrsp] "i" (offsetof(struct task_struct, thread.sp)), \ [ti_flags] "i" (offsetof(struct thread_info, flags)), \ [_tif_fork] "i" (_TIF_FORK), \ [thread_info] "i" (offsetof(struct task_struct, stack)), \ [current_task] "m" (current_task) \ __switch_canary_iparam \ // Input stack canary : "memory", "cc" __EXTRA_CLOBBER) ``` ::: *(save restore flags to clear handle leaking NT)* :::spoile r `struct task_struct* __switch_to(...)` :::info [`struct task_struct* __switch_to(...)`](https://elixir.bootlin.com/linux/v3.10.104/source/arch/x86/kernel/process_64.c#L272) *（`arch/x86/kernel/process_64.c:272`）* ```c= __notrace_funcgraph struct task_struct * __switch_to(struct task_struct *prev_p, struct task_struct *next_p) { struct thread_struct *prev = &prev_p->thread; struct thread_struct *next = &next_p->thread; int cpu = smp_processor_id(); struct tss_struct *tss = &per_cpu(init_tss, cpu); // TSS is per CPU unsigned fsindex, gsindex; fpu_switch_t fpu; // Save p', load & init n' FPU state -> FPU fpu = switch_fpu_prepare(prev_p, next_p, cpu); // Load n' sp0 -> TSS per CPU // sp0: kernel space stack ptr load_sp0(tss, next); // Save p' %%fs & %%gs // linux `glibc` use %gs to access TLS savesegment(fs, fsindex); savesegment(gs, gsindex); // Load n' TLS -> GDT per CPU // load TLS first to correct GDT refs in the following // LDT loaded at `switch_mm`, GDT loaded here load_TLS(next, cpu); // For para-virtualization arch_end_context_switch(next_p); // %%cs & %%ss already as part of `pt_regs` // Switch %%es & %%ds if needed savesegment(es, prev->es); if (unlikely(next->es | prev->es)) loadsegment(es, next->es); savesegment(ds, prev->ds); if (unlikely(next->ds | prev->ds)) loadsegment(ds, next->ds); // Switch %%fs & %%gs if needed if (unlikely(fsindex | next->fsindex | prev->fs)) { loadsegment(fs, next->fsindex); if (fsindex) prev->fs = 0; } if (next->fs) wrmsrl(MSR_FS_BASE, next->fs); prev->fsindex = fsindex; if (unlikely(gsindex | next->gsindex | prev->gs)) { load_gs_index(next->gsindex); if (gsindex) prev->gs = 0; } if (next->gs) wrmsrl(MSR_KERNEL_GS_BASE, next->gs); prev->gsindex = gsindex; // Init n' FPU state -> FPU switch_fpu_finish(next_p, fpu); // Switch %rsp // `this_cpu_write`: write per CPU vars prev->usersp = this_cpu_read(old_rsp); this_cpu_write(old_rsp, next->usersp); // Load n' `task_struct*` -> `current_task` // `current_task`: has PDA (per-processor data area) states & FPU context this_cpu_write(current_task, next_p); this_cpu_write(kernel_stack, (unsigned long)task_stack_page(next_p) + THREAD_SIZE - KERNEL_STACK_OFFSET); // Load debug regs & I/O bitmaps if (unlikely(task_thread_info(next_p)->flags & _TIF_WORK_CTXSW_NEXT || task_thread_info(prev_p)->flags & _TIF_WORK_CTXSW_PREV)) __switch_to_xtra(prev_p, next_p, tss); #ifdef CONFIG_XEN // Manually swaps out I/O privilege bits for Xen Paravirtualization if (unlikely(xen_pv_domain() && prev->iopl != next->iopl)) xen_set_iopl_mask(next->iopl); #endif return prev_p; } ``` ::: *(switch the reg state and the stack)* --- ### `nvcsw` / `nivcsw` - [(`struct task_struct`) `unsigned long nvcsw, nivcsw`](https://elixir.bootlin.com/linux/v3.10.104/source/include/linux/sched.h#L1186) *（`include/linux/sched.h:1186`）* - **`nvcsw`**: ***N**umber of **V**oluntary **C**ontext **Sw**itches* - **`nivcsw`**: ***N**umber of **I**nvoluntary **C**ontext **Sw**itches* - Counted in [`void __schedule(void)`](https://elixir.bootlin.com/linux/v3.10.104/source/kernel/sched/core.c#L2955) *（`kernel/sched/core.c:2955`）* > ```c= if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) // Non-running & preemption disallowed switch_count = &prev->nvcsw; // Voluntary else // else switch_count = &prev->nivcsw; // Involuntary ``` - It is possible to init them in `fork()` aka [`long do_fork(...)`](https://elixir.bootlin.com/linux/v3.10.104/source/kernel/fork.c#L1574) *（`kernel/fork.c:1574`）* --- ### CPU-bound / I/O-bound ==TODO== - CPU-bound: a process is limited by CPU power, may cause by bad language e.g. python - I/O-bound: a process is limited by I/O, try multi-processing {%youtube sntGta76v6Y %} --- ==TODO== ~**C/C++ `stdout` Flush** ~**Inline Assembly** ~**`strace`** :::info ## BSQTAA 1. ㄏㄣcpu counter 怎麼抓 2. 3. 4. 5. 6. 7. ::: :::info ## The real cheersheat ==TODO==  **1. `task_struct` 中的 `randomized_struct_fields` 是幹嘛用的？** > - [`struct task_struct`](https://elixir.bootlin.com/linux/v6.1.4/source/include/linux/sched.h#L737) *（`v6.1.4@include/linux/sched.h:737`）* > ```c= > # define randomized_struct_fields_start struct { > # define randomized_struct_fields_end } __randomize_layout; > ``` > - Randomize the memory positions of fields (e.g. `mm_struct`) > - Make it more difficult for an attacker (e.g. 富皓) to exploit certain types of memory-based vulnerabilities (e.g. BOF) **2. 為什麼 `task_struct` 中 `thread_struct` 要放最後？** > - 因為 thread_struct 是經常用到的 data structure，x86 的 register 又很有限，所以把它放在 task_struct 最低最易取得的位置。 > - 跟 [*上面*](#task_struct-amp-thread_info) 的理由很類似。 > - 最上最下記憶體位置固定，不在 [`struct {...} __randomize_layout;`](https://elixir.bootlin.com/linux/v6.1.4/source/include/linux/compiler_types.h#L256) *（`v6.1.4@include/linux/compiler_types.h:256`）* > - [`struct task_struct`](https://elixir.bootlin.com/linux/v6.1.4/source/include/linux/sched.h#L737) *（`v6.1.4@include/linux/sched.h:737`）* > ```c=737 > struct task_struct { > #ifdef CONFIG_THREAD_INFO_IN_TASK > /* > * For reasons of header soup (see current_thread_info()), this > * must be the first element of task_struct. > */ > struct thread_info thread_info; > #endif > > // ...OMITTED... > > /* > * This begins the randomizable portion of task_struct. Only > * scheduling-critical items should be added above here. > */ > randomized_struct_fields_start // struct { > ``` > ```c=1531 > /* > * New fields for task_struct should be added above here, so that > * they are included in the randomized portion of task_struct. > */ > randomized_struct_fields_end // } __randomize_layout; > > /* CPU-specific state of this task: */ > struct thread_struct thread; > > /* > * WARNING: on x86, 'thread_struct' contains a variable-sized > * structure. It *MUST* be at the end of 'task_struct'. > * > * Do not put anything below here! > */ > } > ``` **3. `kernel/sched/core.c` 中的 `nvcsw`/ `nivcsw` 是什麼？** > - Voluntary / Involuntary context switch > - [LINK](#nvcsw--nivcsw) **4. `kernel/sched/core.c` 中 `rq = context_switch(rq, prev, next, &rf);` 做了啥？** > - Move 6 segment regs: > - ES: Extra segment (general purpose) > - DS: Data segment > - FS: TLS entry (general purpose) > - GS: TLS entry (general purpose) > - CS: Code segment > - SS: Stack segment > - 其他詳細看上面的逐行注釋 [LINK](#Context-Switch) **5. CPU Runqueue 中是用什麼當 Priority？** > - Virtual Runtime `vruntime` > - `nice` (---> `weight`) ---> `vruntime` (---> load *(PELT (Per-Entity Load Tracking))*) **6. 一個 Linux OS 中會有幾個 Runqueue？** > - One Runqueue per CPU  **7. Process 如何從 Wait Queue 被叫醒到 Ready Queue？** > https://hackmd.io/@YiZjennnnn/process_scheduling > - 做完 I/O或 systemcall (如 sleep() )，會 throw 一個 interrupt 回來，OS就會把它放回 ready queue > - 可能 interrupt signal > ![](https://i.imgur.com/TQNJOqK.png) :::  <style> .markdown-body { max-width: 1024px !important; } </style>