Linux Kernel Development

# Process Management A process is a program in the midst of execution. Each thread includes a unique program counter, process stack and set of processor registers, kernel schedules individual threads, not process. To Linux thread is just a process which share resources with other process Kernel stores the list of processes in a circular doubly linked list called **task list** ## Process descriptor Each element in the task list is a process descriptor of the type **struct task_struct** `task_struct` 中儲存所有核心所需的 process 資訊，如: - open files - process's address space - pending signals - process state - etc ## Process State Process descriptor 中的 `state` 欄位儲存行程目前的狀態，在 kernel 中 process 有五種可能的狀態，用不同的 flag 表示 - `TASK_RUNNING`: 已經可以執行，行程在這個狀態下可能已經在執行或在 runqueue 中 - `TASK_INTERRUPTIBLE`: 在睡眠中 (blocked)，等待 condition 成立，成立後 kernel 會把狀態改為 `TASK_RUNNING`，行程收到 signal 時也會被喚醒 - `TASK_UNINTERRUPTIBLE`: 與 `TASK_INTERRUPTIBLE` 相同，但在收到 signal 時不會被喚醒，只有在 interrupt 或是事件發生才會被喚醒。 >不會被`SIGKILL`刪除，但若 task 持有 semaphore 時不該被刪除 (mutex 可以知道誰持有，所以在刪除時會釋放) - `__TASK_TRACED`: 行程正在被其他行程 **traced**，如使用 `ptrace` - `__TASK_STOPPED`: 行程執行被終止，在收到 `SIGSTOP`、`SIGTSTP`、`SIGTTIN`、`SIGTTOU` signal 發生，或在被 debug 時收到任何 signal 都會發生 ## Process Family Tree 所有行程都是 `init` process (PID = 1) 的後代 > `init` process 在開機的最後部分時執行`initscripts`完成開機每個行程都有一個 parent，在 `task_struct` 中有欄位儲存 parent 以及 children 的串列。 ## Process Creation Unix 使用 `fork()` 及 `exec()` 來創建行程 1. `fork`: 利用目前行程建立一個子行程，與目前行程唯一的差異只有 PID 與 PPID 及一些資源及資訊，如 pending signals 2. `exec`: 載入新的可執行檔到 address space 並開始執行 ### fork Linux 利用 `colne` 系統呼叫來實現 `fork`，`clone` 的參數為 flags，來指定那些資源會被 parent 與 child 共享，`vfork()`、`fork`、`__clone`都是利用 `clone` 及不同的 flag 實現。在 `clone` 完成後，會呼叫 `do_fork()`，`do_fork()` 呼叫 `copy_process()`並讓行程開始執行，以下是`copy_process()`的流程: 1. 呼叫 `dup_task_struct()`建立新的 kernel stack、`thread_info`、`task_struct`，目前 parent 跟 child 的 process descriptor 是相同的 2. 檢查 current user 在新增 new child 後 number of process 有沒有超過 resource limit 3. 清除或初始化 child 的 process descriptor 中的部分內容，主要是 statistically information 4. child 的 state 設定成 `TASK_UNINTERRUPTIBLE` 5. 更新 `task_struct` 中的 flag 6. 更新 child 的 PID 7. 根據 `clone` 的參數複製或共享 open files, filesystem information, signal handlers, process address space and namespace :::success Kernel 會先執行子行程，因為通常子行程會直接呼叫 `exec`，而若先執行 parent 就較可能寫入 address space 但目前 kernel 的實作好像不是這樣 ::: ## Linux Implementation of threads 多個執行緒會在相同的 address space 下執行相同的程式，也可以共享 open files 與其他資源，在 Linux 中 thread 只是跟其他行程共享資源的行程，每個執行緒都有自己的 `task_struct` 結構體 ## Creating Threads 與`fork`相同，只是會共享 process address space、filesystem resourses、file descriptors、signal handler ```c clone(CLONE_VM | CLONE_FS | CLONE_FILES | CLONE_SIGHAND, 0); ``` `fork` 可以利用以下 flag 實作 ```c clone(SIGCHLD, 0); ``` ## Kernel Threads Kernel thread 與一般行程的差別是 kernel thread 沒有 address space (`mm` 指向 NULL，也可以說 address space 是整個 kernel space)，且只會在 kernel space 中執行，不會 context switch 到 user-space 如 `ksoftirqd` (Kernel Software Interrupt ReQuest Daemon): 每個 CPU 永遠會有一個來 dispatch interrupt requests. # Process Scheduling Criteria of scheduler: - Interativness - Response time - Throughput - Fairness: 相同優先權的任務執行時間應該要差不多 - Liveness: Freedom from starvation - Power Efficiency: 根據 workload 動態調整 CPU frequency ## How scheduler knows if task is time-sensitive source program 中會標註類型 (**scheduling class**) 來進行排程 Scheduler only put a task from a scheduling class into execution only when there are no other available tasks with higher priority in the scheduling hierarchy ## Classes and policies Policies represent special scheduling decisions and each process has a policy associated with it ### Classes 目前 kernel 有五種不同的 scheduling classes 1. `stop_sched_class`: overwrite all scheduling classes before attempting to shut down the machine or hot-plug a CPU，用在排程 per-CPU 的 stop task，優先權最高 (只有在 SMP 系統才會有) 2. `dl_sched_class`: Implements the Earliest Deadline First scheduling policy 3. `rt_sched_class`: follows the POSIX specification for real-time scheduling 4. `fair_sched_class`: Implements the completely fair scheduling policy 5. `idle_sched_class`: solely responsible for the dispaching of the idle tasks，用來排程 per-CPU idle task (swapper task)，在沒有其他 task 時執行 ### Policies - `SCHED_DEADLINE` - `SCHED_FIFO` - `SCHED_RR` - `SCHED_NORMAL` - `SCHED_BATCH` - `SCHED_IDLE` `SCHED_FIFO`與`SCHED_RR`的優先權相同，會共享相同的 runqueue，且 RR 與 FIFO 非常相似，差別只有 RR 有 timeslices > Runqueue: holds tasks that are in a runnalbe state at any given moment of time, waiting for the processor #### SCHED_FIFO 1. 系統只有在以下狀況時才會中斷 FIFO 任務: a. 更高優先權的執行續狀態變為 Ready b. 另一個更高優先權的 FIFO 任務變為 Ready c. 目前的 FIFO 任務被 block (如等待 I/O) d. 目前 FIFO 任務呼叫 `sched_yield` 自願放棄 CPU 2. FIFO 任務被中斷時會被放在對應優先權的佇列中 3. 若 FIFO 任務變為 ready 且優先權高於目前任務則目前任務會被 preempt，若優先權相同則先執行等比較久的 #### SCHED_RR 除了 timesilce 之外 Policy 與 FIFO 相同 FIFO 與 RR 的程式碼是同一份可能讓 real-time task 有 bug 時 completely block the system - `dl_sched_class` - `kernel/sched/deadline.c` - `rt_sched_class` - `kernel/sched/rt.c` - `fair_sched_class` - `kernel/sched/fair.c` `SCHED_DEADLINE`: 優先權最高的 policy，會定義 deadline 的時間 `SCHED_NORMAL` 與 `SCHED_OTHER` 為 default policy 並使用 CFS (在 fair.c 中) `SCHED_BATCH`與`SCHED_NORMAL`類似，但搶佔發生較不頻繁，interactivity 低但 better use of cache `SCHED_IDLE`: 優先權比`SCHED_NORMAL`低每個 task 都會有對應的 priority ## CFS CFS 沒有 timeslice 的概念，但還是會紀錄每個行程執行的時間來保持 fair share 使用 `<linux/sched.h>` 中的 `struct sched_entity` 儲存，`struct sched_entity` 為排程器管理資源分配的單元 ```c struct sched_entity { /* For load-balancing: */ struct load_weight load; struct rb_node run_node; u64 deadline; u64 min_vruntime; u64 min_slice; struct list_head group_node; unsigned char on_rq; unsigned char sched_delayed; unsigned char rel_deadline; unsigned char custom_slice; /* hole */ u64 exec_start; u64 sum_exec_runtime; u64 prev_sum_exec_runtime; u64 vruntime; s64 vlag; u64 slice; u64 nr_migrations; #ifdef CONFIG_FAIR_GROUP_SCHED int depth; struct sched_entity *parent; /* rq on which this entity is (to be) queued: */ struct cfs_rq *cfs_rq; /* rq "owned" by this entity/group: */ struct cfs_rq *my_q; ``` > 省略 config 部分變數 scheduler entity 也被 embedded 在 process descriptor (`task_struct`) 中，命名為 `se` ```c struct task_struct { struct sched_entity se; struct task_group *sched_task_group; } ``` `struct cfs_rq` 的指標 `cfs_rq`，指向其所在的 runqueue 中。而另一個 `struct cfs_rq` 指標 `my_q` 則是當該 `se` 是屬於 **group entity**，會 reference 到自己底下的 run queue。結構中都有自己的 struct load_weight 結構的成員，代表該排程單元在排程中的被分配權重。最後，sched_entity 中也包含用來選擇任務需要的 vruntime。如果 sched_entity 是隸屬於非 root_task_group 的某個 task_group 之下的話，成員的 parent 會指向該 task_group 的 sched_entity。對應上方的圖可以看到，這是因為系統中的 runqueue 其實是階層式結構的(對於某個 sched_entity 所在的 runqueue 所屬的 task_group，該 task_group 可能是在另一個 task_group 的 runqueue 之下)，而某些操作會需要對每一層級的sched_entity 都進行更新，因此記錄 parent 資訊可以有效的進行此操作。 CFS 的 runqueue 的資料結構一個 RB-tree 來維護。因此我們可以在其 sched_entity 下看到 rb_node 這個結構。後者就可以被插入在 runqueue 的紅黑樹下。那些急需 CPU (virtual runtime 小) 的 task 會被放在紅黑樹的左邊，而不迫切的 task 則在右邊。為保持公平，Scheduler 會去挑選 leftmost 的任務，任務被切換掉後，把 virtual runtime 加上使用的 CPU 時間重新插入 RB-tree。如此一來，左側的任務就被確保有時間，並且樹的內容會從右向左遷移以保持公平 ### task_group 如果 CFS 的公平僅是以 `task_struct` 表示的 task 為單位的話，實際上會造成某種程度上的不公平。試想一個情境：在某個 server 上的 50 個任務，有 1 個行程 A 所擁有，另外 49 個是由一個行程 B 所擁有，如果只以 task 為公平的判斷單位，這就表示行程 B 可以佔用 98 % 的 CPU 資源。為此，CFS 中帶來了 group scheduling 的概念。Group scheduling 是另一種帶來公平性的方法。而由於排程的單位是基於 `sched_entity` 而非 `task_struct`，這就使得 group scheduling 得以被實作。實作上是透過定義 task_group 結構描述由多個任務組成的群組。`task_group` 中也包含 `sched_entity` 實體，因此其也可以做為一個 CFS 所管理的對象，被加入到一個 runqueue 之中。並且，`task_group` 中對於自己群組中的任務也存在自己的 runqueue 中。換句話說，系統中的 runqueue 其實是階層式樹狀結構的 ```c /* Task group related information */ struct task_group { ... /* schedulable entities of this group on each CPU */ struct sched_entity **se; /* runqueue "owned" by this group on each CPU */ struct cfs_rq **cfs_rq; ``` `task_group` 中兩個重要的成員: 其一是 se，其內容是一個 sched_entity 指標的陣列，對於該 `task_group`，每個 CPU 都具有一個 sched_entity 可以註冊到 runqueue 中。另一個則是 cfs_rq，是 `task_group` 對應每個 CPU 的 runqueue。因為引入了 group scheduling 的觀念，scheduler 在排程時，會去尋找最需要 CPU 的 `sched_entity`，如果 `sched_entity` 表現的不是一個 task，則再去 `sched_entity` 下的 runqueue 挑出下一個 entity，重複直到找到是表現一個 task 的 `sched_entity`。最後，在把 CPU 時間加回 virtual runtime 時，會把該 `sched_entity` 包含往上 parent 階層的都一併更新。因此，面對一開始提到的問題，系統的管理者只要針對 A 君和 B 君建立一個 `sched_entity` ，就可以做到針對兩個使用者的公平。實際上，所有任務都會根源於 `root_task_group` 這個群組中。 ### cfs_rq 描述 CFS 之 runqueue 的結構是 cfs_rq: ```c struct cfs_rq { struct load_weight load; unsigned int nr_running; u64 min_vruntime; struct rb_root_cached tasks_timeline; struct task_group *tg; /* group that "owns" this runqueue */ ``` 可以看到 `cfs_rq` 中也存在一個 `load_weight` 結構，該數值其實是 runqueue 中所有任務的權重總和，藉此可以有效的更新對應層級的 task_group 之 vruntime，而 `nr_running` 則記錄該 runqueue 中的 `sched_entity` 之數量。 `min_vruntime` 是 runqueue 中所有 entity 的 vruntime 之最小值，該值是用來在 enqueue `sched_entity` 的時候協助初始化其 vruntime。思考任務被移出 runqueue 時的議題: 假設有任務 A 需要處理 I/O 而被暫時移出 runqueue，其 vruntime 會維持同樣的值，而仍處在在 runqueue 中的任務則會持續增加。當這個任務 A 被喚醒並且放回 runqueue 時，如果 vruntime 是依離開時的值設定，這會導致這個任務的優先遠超其他原本在 runqueue 中的任務，這違反了 CFS **fair** 的精神。類似的情形，如果一個新任務被建立，我們不可能直接 assign 0 值給它，否則舊的任務等於有很大一段時間都不能取得 CPU 資源，同樣不 "fair"。因此，CFS scheduler 會追蹤整個 runqueue 中的最小 min_vruntime，當每個任務被挑選並運行時，這個最小值也同步更新。如此一來，當新的任務被建立，就可以透過這個值去初始化其 vruntime。對於進入 wait queue 而後被喚醒的任務，則是確保其原本的 vruntime 需不小於記錄的最小值，否則就重設為該最小值再放回 runqueue 中，藉此避免被喚醒的任務會長時間的霸佔 CPU 資源。接著，我們可以在 cfs_rq 結構中看到 rb_root_cached 類型的成員 tasks_timeline，後者包含了 RB-tree 的根節點(root, rnuqueue 直接擁有)與最左側的節點(leftmost, 指到某個 sched_entity 之 rb_node)之 reference 以更有效率的操作 RB-tree。 ```c struct rb_root_cached { struct rb_root rb_root; struct rb_node *rb_leftmost; }; ``` cfs_rq 也可以透過其下的 tg 來找到自己所屬的 task_group。 ### vruntime CFS 會維護每一個 task 的 **virtual runtime**。 Task 的 virtual runtime 愈小，表示其被允許使用的 CPU 時間愈短，也間接代表此 task 對 CPU 的需要更高，因此更需要被排程。CFS 也會讓進入 sleep 狀態的任務(例如等待 I/O)算入 virtual runtime，確保當他們醒來時可以得到足夠的 CPU 額度，進而讓 interactive task 有較高的優先權。 >Virtual runtime 指形成截至目前為止已經運行的虛擬時間，所以一個任務的 vruntime 越小，代表它比起其他任務還需要更多的 CPU 額度，會選擇較小 vruntime 的行程來達到公平性，最理想的狀態是相同優先權下的任務有相同的 virtual runtime ```c /* * Update the current task's runtime statistics. */ static void update_curr(struct cfs_rq *cfs_rq) { struct sched_entity *curr = cfs_rq->curr; s64 delta_exec; if (unlikely(!curr)) return; // Get the amount of time the current task was running // since last time change load delta_exec = update_curr_se(rq_of(cfs_rq), curr); if (unlikely(delta_exec <= 0)) return; curr->vruntime += calc_delta_fair(delta_exec, curr); update_deadline(cfs_rq, curr); update_min_vruntime(cfs_rq); if (entity_is_task(curr)) update_curr_task(task_of(curr), delta_exec); account_cfs_rq_runtime(cfs_rq, delta_exec); } ``` CFS 利用 nice value 來加權形成獲得的 proportion (越高優先權越低) 因為是計算 proportion，有可能導致在行程過多時分配到的 proportion 過小，時間小於 context switch，所以有 minimum granularity，假如為 1 則每個行程至少會執行 1 毫秒 [Linux 核心設計: Scheduler(2): 概述 CFS Scheduler](https://hackmd.io/@RinHizakura/B18MhT00t) [Linux 核心設計: 不只挑選任務的排程器](https://hackmd.io/@sysprog/linux-scheduler#CPU-scheduling-%E5%B0%8D%E7%B3%BB%E7%B5%B1%E7%9A%84%E5%BD%B1%E9%9F%BF) # System call 除了 cexception 跟 trap 外從 user-space 進入 kernel space 的唯一方式 > device files, /proc 等等都是通過 syscall 存取在 user-space 利用 C library 中的函式呼叫，回傳值為 `long` (for compatibility with 64-bit architectures)，通常負的回傳值代表錯誤 (not always **find exception**)，錯誤會被寫入全域變數 `errno` 中，可以透過 `perror()` 轉換成 human-readable 的錯誤訊息 kernel 中 syscall 的實作 (如 `getpid`): ```c SYSCALL_DEFINE0(getpid) { return task_tgid_vnr(current); // returns current->tgid } ``` [SYSCALL_DEFINE0](https://elixir.bootlin.com/linux/v6.11/A/ident/SYSCALL_DEFINE0) ，不同 ISA 也有不同的相同的巨集， :::info 為甚麼不同 ISA 需要實作不同的 SYSCALL_DEFINE0 ::: ```c #ifndef SYSCALL_DEFINE0 #define SYSCALL_DEFINE0(sname) \ SYSCALL_METADATA(_##sname, 0); \ asmlinkage long sys_##sname(void); \ ALLOW_ERROR_INJECTION(sys_##sname, ERRNO); \ asmlinkage long sys_##sname(void) #endif /* SYSCALL_DEFINE0 */ ``` :::warning sys call return an `int` in user-space and a `long` in the kernel for compatibility between 32 and 64-bit systems ??? ::: [asmlinkage modifier](https://www.jollen.org/blog/2006/10/_asmlinkage.html) Tell the compiler to look only on the stack for this function's arguments，所有的系統呼叫都必須有這個 modifier。 > The asmlinkage tag is one other thing that we should observe about this simple function. This is a #define for some gcc magic that tells the compiler that the function should not expect to find any of its arguments in registers (a common optimization), but only on the CPU's stack. Recall our earlier assertion that system_call consumes its first argument, the system call number, and allows up to four more arguments that are passed along to the real system call. system_call achieves this feat simply by leaving its other arguments (which were passed to it in registers) on the stack. All system calls are marked with the asmlinkage tag, so they all look to the stack for arguments. Of course, in sys_ni_syscall's case, this doesn't make any difference, because sys_ni_syscall doesn't take any arguments, but it's an issue for most other system calls. It is also used to allow calling a function from assembly files. 使用 stack 傳參數可能是為了不同 ISA 間的相容性 ## System call number 每個系統呼叫都有一個唯一的 **syscall number** 讓 kernel 辨別，而不是用函式的名稱，一個 syscall 在之後來的版本被刪除 syscall number 也不能重複使用，不然可能導致以前編譯的程式有錯誤的行為核心中有 system call table，不同架構有不同的 table :::info TODO: [[PATCH 17/17] riscv: convert to generic syscall table](https://lore.kernel.org/linux-arm-kernel/20240704143611.2979589-18-arnd@kernel.org/)，了解 scripts/syscall.tbl 怎麼生成 syscall_table_64.h ::: user-space 應用程式在呼叫系統呼叫時會通過 software interrupt 來通知 kernel，系統會切換成 kernel mode 並執行 exception handler，這邊的 exception handler 為 system call handler > x86 的 software interrupt 對應的 interrupt number 為 128 syscall number 與 parameter 是利用 register 來傳遞給 kernel，回傳值也是利用暫存器來傳遞到 user-space 系統呼叫需要檢查 user-space 傳入的參數 - 指標要指向 user-space 的記憶體位置，不能讀取 kernel-space 的記憶體 - 指標要指向行程的 address space，否則可能讀取到其他資料 - 須遵守 memory accoss restrictions，如讀取則記憶體需要是可讀的若要寫入 user-space 要使用 `copy_to_user()`，讀取 user-space 資料則使用 `copy_from_user()` > 整理 > https://hackmd.io/@sakuraiiii/r1gFubKkF 有些操作還需要檢查 permission，如 `reboot()` ```c if (!capable(CAP_SYS_BOOT)) return -EPERM; ``` 只有 superuser 才可以呼叫這個系統呼叫在呼叫系統呼叫的行程的 process context 中執行，所以可以睡眠 (blocks on a call or call `schedule()`)，也是 preemptible，如果目前的 task 正在執行系統呼叫，被另一個要執行相同系統呼叫的 task preempt，需要保證系統呼叫是 reentrant 雖然系統呼叫很容易使用且執行的很快，但因為 syscall number 進入到 stable kernel 後就不能再修改了，沒辦法很容易地用 script 跟 filesystem 來存取系統呼叫，所以通常可以使用以下方式實作替代新增一個系統呼叫 - device node 搭配 `ioctl()` - sysfs - etc :::success others: - vsyscall ::: # Kernel Data Sturctures ## Linked List [linux/list.h](https://github.com/torvalds/linux/blob/master/include/linux/list.h) 核心的串列為 circular doubly-linked list，為了讓大家不用自己維護一套 doubly-linked list，把資料與 doubly-linked list 拆開 ```c struct list_head { struct list_head *next; struct list_head *prev; } ``` 只要在自定義的結構中加入 `struct list_head`，就可以搭配 Linux 中一系列的 linked list 操作 (詳見 [The Linux Kernel API - List Management Functions](https://www.kernel.org/doc/html/latest/core-api/kernel-api.html)) 來建立自定義結構的 linked list ```c struct fox { unsigned long tail_length; unsigned long weight; struct list_head list; }; ``` 並利用 `container_of()` 來找到含有 `list_head` 的 parent structure ```c #define container_of(ptr, type, member) ({ \ const typeof( ((type *)0)->member ) *__mptr = (ptr); \ (type *) ( (char *)__mptr - offsetof(type,member) );}) ``` ```c #define list_entry(ptr, type, member) \ container_of(ptr, type, member) ``` 透過分離 linked-list 與 data，就可以使用核心提供的一系列的 linked list API [Linux 核心原始程式碼巨集: container_of](https://hackmd.io/@sysprog/linux-macro-containerof) [你所不知道的 C 語言: linked list 和非連續記憶體](https://hackmd.io/@sysprog/c-linked-list) ## Queues <linux/kfifo.h> ## Map ## RBtree # Interrupt 當硬體想要讓處理器做某些事情就會發出中斷，中斷可能會在任何時候發生 (與 processor clock 不同步)，硬體裝置會產生數位訊號到 interrupt controller 的 input pin，interrupt controller 會再傳送訊號給處理器，接收到後就會通知 OS 中斷發生並處理。 > TODO: PLIC 核心會利用 Interrupt value (也稱為 interrupt request (IRQ) line) 來判斷中斷的原因，一個 interrupt 會對應一個裝置 > exception 會與 processor clock 同步，而中斷不會，exception 是在執行指令導致的錯誤 (如 divide by zero) 或是一些需要核心處理的狀況發生 (如 page fault) ## Interrupt Handler 核心中處理各個中斷的函式叫做 interrupt handler 或 interrupt service routine (ISR)，每個中斷都有對應的 ISR > ISR 是 device driver 的一部分 ISR 會在 interrupt context (有時候被稱為 atomic context) 中執行，在 process context 執行是 preemptible 的，但 interrupt context 不能被 block，進去就會做完 (只能被更高優先權的中斷 preempt)，所以 ISR 需要快速執行完成若 ISR 需要處理的事情比較多，則需要分成 top 跟 buttom 兩部分如網路卡當網路卡收到封包時，需要立即通知 kernel，如果這個過程很慢就會掉封包，且為了優化network throughput and latency and avoid timeouts.，所以他會立馬發送interrupt 給 kernel kernel 需要立即把網路卡送來的 data 存進記憶體裡，因為是 ring buffer，沒有及時處理會有覆蓋問題，這些需要立即處理的問題放在 top half，也就是 ISR :::warning [圖解Linux網路包接收過程](https://mp.weixin.qq.com/s/GoYDsfy9m0wRoXi_NCfCmg) 中提到 top half 只會簡單紀錄暫存器的值與觸發 softirq，不會在 top half 處理 data ::: ## Registerring an Interrupt Handler 每個裝置都有對應的 driver，如果需要使用中斷則 driver 需要跟核心註冊 irq，利用 `<linux/interrupt.h>` 中的 `request_irq()` ```c /* request_irq: allocate a given interrupt line */ int request_irq(unsigned int irq, irq_handler_t handler, unsigned long flags, const char *name, void *dev) ``` - `irq`: 要分配的 interrupt number - `handler`: 指向 ISR 的指標 - `flag`: 定義在 `<linux/interrupt.h>` 中 - `IRQF_DISABLED`: 設為 1 代表核心會在執行這個 ISR 時關閉所有的中斷，只有在 real-time 裝置使用 - `IRQF_SAMPLE_RANDOM` - `IRQF_TIMER` - `IRQF_SHARED`: 表示這個 interrupt line 可以被多個 ISR 共享。如果沒有設定這個 flag，則每個 line 只會有一個 handler - `name`: 裝置的名稱 - `dev`: 在共享 interrupt line 時使用 :::info request_irq() 是可能導致 sleep 的，所以不能在 ISR 中執行執行過程會在 `/proc/irq` 建立一個 entry，其中會使用到 kmalloc()，而 kmalloc() 可以 sleep ::: ## Freeing an Interrupt Handler 在 unload driver 時需要 unregister ISR，如果一個 interrupt line 的 handler 都被 free 了則 interrupt line 會被 disable ```c void free_irq(unsigned int irq, void *dev) ``` :::info ISR 不需要是 reentrant，一個 ISR 在執行時，對應的 interrupt line 會在所有的處理器被 mask out，所以相同的 ISR 不會被 concurrently 呼叫，簡化了 ISR 的實作 ::: ## Interrupt Context 在 process context 下，巨集 `current` 會指向目前的 task，且可以 sleep 或是呼叫排程器而 interrupt context 下，與行程沒有關聯 (`current` 指向被中斷的行程)，所以不能 sleep (**沒辦法排程**) Interrupt context 一定要 quick and simple (因為占用的是別的行程的的 cpu 時間，被中斷的對象也可能是別的 interrupt handler on a different line) 有些版本的 interrupt 有自已的 stack (interrupt stack)，有些是跟被中斷的行程共用，但不管有沒有都要最小化 stack 的使用量 > TODO: READ >https://0xax.gitbooks.io/linux-insides/content/Interrupts/linux-interrupts-1.html [Device Driver](https://hackmd.io/LY7FS2QtTkibO_PsU9RMow?view) ## Implementing Interrupt Handler 中斷處理系統的實作是 architecture-dependent，根據處理器與 interrupt controller 而有所不同處理器收到中斷後會進入核心中預先訂好的中斷處理 entry point，會保存目前的暫存器到 stack，並執行 `do_IRQ()` ```c unsigned int do_IRQ(strutct pt_regs regs) ``` `pt_regs` 中會儲存在 assembly entry 保存的暫存器的初始值，接收中斷並 disable 該 interrupt line，再來會檢查該 line 的 ISR 有沒有註冊且目前沒有在執行，並呼叫 `handle_IRQ_event()` ## 裝置中斷號碼如何映射到核心 interrupt line number 使用 `struct irq_domain` 來描述中斷控制器，使用 `irq_of_parse_and_map()` 將硬體中斷號碼映射到核心的 IRQ number ## /proc/interrupts ## Interrupt Control Disable interrupt 可以保證目前的程式不會被 ISR preempt，disable interrupt 也會 **disable kernel preemption** # Bottom half 以下操作都在 ISR 中執行，其他都是在 buttom half 執行: - time sensitive - 與硬體有關 - 需要保證其他中斷不會中斷目前的程式 (特別是相同的中斷) Kernel 2.3 引入 **softirqs** 與 **tasklets** - sotfirqs - statically defined bottom halves 集合 (在編譯時就要寫好) - 可以在任一處理器同時執行，兩個相同的任務也可以 - 在注重 performance 時使用 - Tasklets - 可以動態建立，基於 softirq 實作 - 兩個不同的 tasklets 可以同時執行，但相同的不行 ## Softirqs 用在最重要且 timing-critical 的下半部份處理，目前只有在 networking 跟 block device 兩個 subsystems 使用 :::info Kernel timer 與 tasklets 都是用 softirq 實作 ::: ```c struct softirq_action { void (*action)(struct softirq_action*); // 呼叫方式 my_softirq->action(my_softirq); // 可以在 softirq_action 結構體後面帶上自定義資料 }; ``` softirq 不會被其他 softirq 搶佔，只會被 ISR 搶佔 Pending 的 softirq 會在以下狀況被檢查: - return from hardware interrupt code path - In the `ksoftirqd` kernel thread - In any code that explicitly cehcks for and executes pending softirqs Softirq 類型: | Tasklet | Priority | Softirq Description | | -------- | -------- | -------- | | HI_SOFTIRQ | 0 | High-priority tasklets | | TIMER_SOFTIRQ | 1 | Timers | | NET_TX_SOFTIRQ | 2 | Send network packets | | NET_RX_SOFTIRQ | 3 | Receive network packets | | BLOCK_SOFTIRQ | 4 | Block devices | | TASKLET_SOFTIRQ | 5 | Normal priority tasklets | | SCHED_SOFTIRQ | 6 | Scheduler | | HRTIMER_SOFTIRQ | 7 | High-resolution timers | | RCU_SOFTIRQ | 8 | RCU locking | 在一個 softirq 執行時，相同的 softirq 也可能在另一個處理器執行，所以如果有共享的資源 (甚至是只有在 softirq handler 中使用的全域資料) 都需要 **proper locking**，但如果用鎖導致不能 concurrent 執行那就使用 tasklet 就可以了 > 大部分 softirq handler 會有 per-processor data (每個處理器獨有的資料)，就不需要鎖，與其他的技巧來避免鎖如果沒辦法做到 **scalability** (sacle to infinitely many processors) 那就使用 tasklet 就好 ## Tasklets 大部分情況下使用 tasklet 即可，是其中一種 softirq (`HI_SOFTIRQ` 或 `TASKLET_SOFTIRQ`，差別只在優先權) softirq 與 tasklet 都不能 sleep (在 interrupt context 執行)，代表不能使用 semaphore 或其他會導致阻塞的函式，在執行時也可能被其他中斷搶占，所以如果有與中斷共享資源則需要鎖，但兩個相同的 tasklet 不會同時執行 ## ksoftirqd softirq (包含 tasklet) 是由 per-processor kernel thread 集合來處理 softirq 可以自己 reactive 自己，所以可能一直重複處理 softirq，導致 user-space 的程式的 starvation，但定期處理 softirq 是不可接受的解決方式: reactivated softirq 留到下次處理，但在 idle system 時可能會浪費可以處理 softirq 的機會。雖然解決 user-space 程式 starvation，但 softirq 處理得較慢，沒有利用到 idle-system 的優勢目前的解決辦法是在要處理的 softirq 的數量過多時喚醒 a family of kernel threads 來處理，這些 kernel thread 的優先權是最低的，避免與其他重要任務搶奪資源每個處理器都會有一個這樣的執行緒，命名為 `ksoftirqd/n` (n 是處理器編號)，保證 idle processor 永遠可以處理 softirq ## Workqueue Defer work into kernel thread 永遠在 process context 執行，所以可以排程也可以 sleep。如果需要這兩個功能則用 workqueue，如分配大量記憶體、取得 semaphore、block I/O 操作，不需要的話使用 softirq 或 tasklet # Kernel Synchronization 如果其他執行緒也可以存取資源的話，就需要鎖 > lock data, not code ## Atomic Integer Operations [atomic_t](https://www.kernel.org/doc/Documentation/atomic_t.txt)，保證編譯器不會優化 value 的存取 prefer atomic operation over locking mechanisms ## Atomic Bitwise Operations Architecture-specific，defined in `<asm/bitops>` Nonatomic Bit Operation 是甚麼? :::info atomicity: Either instructioons succeed in their entirety, uninterrupted, or instructions fail to execute at all. Real atomicity requires all intermediate states be correctly realized. For example, assume you issue two atoomic bit operations: set the bit and then clear the bit. Without atomic operations, the bit might end up cleared, but it might never have been set. With atomic operations, the set would actually occur. ::: https://hackmd.io/@sysprog/ByS9w_lHh#TODO-%E7%A2%BA%E8%AA%8D%E6%94%B9%E5%AF%AB%E7%9A%84%E7%A8%8B%E5%BC%8F%E7%A2%BC%E5%BE%97%E4%BB%A5%E5%9C%A8-Linux-%E6%A0%B8%E5%BF%83%E9%81%8B%E4%BD%9C ## Spin Locks A single-holder lock should be held for short durations. 若持有 spinlock 的時間超過兩個 context switch，則可以使用 sleep Spin locks in Linux kernel are not recursive (嘗試持有已經持有的自旋鎖也會被鎖住) 可以在 ISR 中使用，取得鎖前需要關閉 local interrupt (其他處理器上的中斷不會影響鎖的釋放)，否則優先權更高的 ISR 如果也要取得相同的鎖可能導致 deadlock 在 uniporcessor 或關閉 kernel preempt 則鎖會被 compile away，但還是要關閉中斷避免 ISR 存取共享資料 Always associate shared data with a specific lock. For example, `struct foo` is locked by `foo_lock` ## Read write spin lock 在可以分成 reader 與 writer 的場景，kernel 有提供 read-write spin lock，多個 reader 可以在沒有 writer 時同時讀取相同的執行緒可以 recursively obtain 相同的 read lock ## Semaphores Sleep locks，如果嘗試取得失敗則會把任務放到 wait queue 並讓任務睡眠在取得 semaphore 時可以 sleep (因為其他行程嘗試取得相同的鎖會 sleep)，但在持有 semaphore 時不能持有 spin lock (持有 spin lock 時不能 sleep) ## mutex - 只有持有 mutex 的行程才可以解鎖 - 行程在持有 mutex 時不能 exit - 持有 mutex 時也可以睡眠 https://hackmd.io/jLH5aLc-R3qWkNjfbKti9A?view ## Completion Variables In fact, completion variables merely provide a simple solution to a problem whose answer is otherwise semaphores 用在同步，發送訊號讓另一個行程知道事件發生如 `vfork()` 使用 completion variables 在子行程呼叫 `exec()` 或 `exit()` 時來喚醒 parent 行程 [Difference between completion variables and semaphores](https://stackoverflow.com/questions/4764945/difference-between-completion-variables-and-semaphores) ## Sequential Locks [Sequence counters and sequential locks](https://docs.kernel.org/locking/seqlock.html) 又稱為 **seq lock**，writer 優先及較高在以下狀況時使用: - 有很多 reader - writer 少 - 想要提高 writer 的優先級但不會讓 reader starve - 要保護的資料簡單，如一個簡單的結構體或整數 (不能為 atomic) jiffies 使用 seq lock ## Preemption Disabling 在 per-percessor data，因為資料只在一個處理器中，所以可能不需要用 lock 保護，但資料可能被其他任務存取，這時候可以關閉搶占 # Timers and Time Mangement https://hackmd.io/@sysprog/linux-timer # Memory Mangement Kernel 的記憶體管理與比 user-space 困難很多，因為核心沒辦法很容易得處理記憶體分配錯誤，且常常處於不能 sleep 的狀態 ## Pages 核心以 Physical pages 作為記憶體管理的基本單位。雖然處理器的最小 addressable 的單位是 byte 或 word，但通常 MMU 的單位為 page 核心中以 [struct page](https://github.com/torvalds/linux/blob/master/include/linux/mm_types.h) 來表示所有的 **physical page** (用來表示 physical memory，而非裡面的資料)，核心需要知道 page 是否為 free (有沒有分配)，如果不是，則需要知道是誰擁有這個 page ## Zones 一些頁面由於 physical address 在記憶體中，所以無法用於某些任務。因此，核心將頁面劃分為不同的區域，並通過這些區域來分組具有相似屬性的頁面。 Linux 必須處理硬體在記憶體尋址方面的兩個缺陷: - 某些硬體裝置只能對特定的記憶體地址執行 DMA - 某些架構物理記憶體比虛擬記憶體更大，因此某些記憶體無法永久映射到核心地址空間中 Linux 有以下四種主要的 memory zones，還有兩種在 [<linux/mmzone.h>](https://github.com/torvalds/linux/blob/master/include/linux/mmzone.h) 中: - DMA zone: 可以進行 DMA 的頁面 - DMA32 zone: 可以進行 DMA 的頁面，但只能被 32-bit 裝置使用 - Normal zone (LowMem) - HighMem Zone: High memory, 不會永遠映射到 kernel's address space 的頁面 Memory zone 的使用與架構有關，例如有些架構任何記憶體位置都可以進行 DMA，所以 DMA zone 是空的，都是使用 Normal zone 把頁面區分成不同的 Zone 可以建立 pool 來滿足各種分配需求。例如，`ZONE_DMA` 區域用來滿足 DMA 所需的記憶體分配。如果需要這類記憶體，核心可以直接從 ZONE_DMA 區域中提取所需數量的頁面。需要注意的是，這些區域並沒有實際的物理意義，它們只是核心用來管理頁面的邏輯分組。有些分配會只會從一個 zone 取得頁面，但有些分配可以從多個 zone 進行分配，例如只能從 `ZONE_DMA` 分配可以進行 DMA 的記憶體，但一般的分配可以從 `ZONE_DMA` 或 `ZONE_NORMAL` 進行分配，但不能跨 zone (**allocation cannot cross zone boundaries**) ## Getting Pages 核心在 [<linux/gfp.h>] 中提供一系列 interface 來以 page 為單位 request memory 主要的函式為 `alloc_pages` ```c sturct page * alloc_pages(gfp_t gfp_mask, unsigned int orderr) ``` 會分配 $2^{order}$ (1 << order) 的連續物理頁並回傳只到起始頁 `page` 結構體的指標可以利用 `page_address` 函式來將頁面轉換成對應的 logical address ```c void * page_address(struct page *page) ``` ## kmalloc() 以 byte 為單位取得核心的記憶體，如果需要以 page 為單位則使用上一節的 interface ```c void * kmalloc(size_t size, gfp_t flags) ``` 分配的空間是連續物理記憶體，只有在空間不夠時才會分配失敗，失敗時會回傳 NULL ## gfp_mask Flags 用 `gfp_t` type 表示，在 `<linux/types.h>` 中定義為 `unsigned int`，`gfp` 代表 `__get_free_pages()` Flag 被分成三個類別: - action modifiers: 表示核心應該如何分配記憶體 - zone modifiers: 表示要去哪裡分配記憶體 - Type Flags: action 與 zone 的組合 (如 GFP_KERNEL) 通常只會用到 type modifiers ### Action modifiers | Flag | Description | | -------- | -------- | | __GFP_WAIT | Alloctor can sleep | | __GFP_HIGH | Alloctor can access emergency pools| | __GFP_IO | Alloctor can start disk I/O| | __GFP_FS | Alloctor can start filesystem I/O| | __GFP_COLD | Alloctor should use cache cold pages| | __GFP_NOWARN | Alloctor does not print failure warnings| | __GFP_REPEAT | Alloctor repeats the allocation if it fails, but the allocation can potentially fail| | __GFP_NOFAIL | Alloctor indefinitely repeats the allocation. The allocation cannot fail.| | __GFP_NORETRY | Alloctor never retries if the allocation fails| | __GFP_NOMEMALLOC | Alloctor does not fall back on reserves| | __GFP_HARDWALL | Alloctor enforces “hardwall” cpuset boundaries| | __GFP_RECLAIMABLE | Alloctor marks the pages reclaimable| | __GFP_COMP | Alloctor adds compound page metadata (used internally by the hugetlb code| 可以一起使用，例如: ```c ptr = kmalloc(size, __GFP_WAIT | __GFP_IO | __GFP_FS); ``` 代表分配時如果需要可以阻塞，執行 I/O 與 filesystem 操作 ### Zone modifiers | Flag | Description | | -------- | -------- | | __GFP_DMA | 只從 `ZONE_VMA` 分配 | | __GFP_DMA32 | 只從 `ZONE_VMA32` 分配 | | __GFP_HIGHMEM | 從 `ZONE_HIGHMEM` 或 `ZONE_NORMAL` 分配 | 在呼叫 `kmalloc()` 或 `__get_free_pages()` 時不能使用 `__GFP_HIGHMEM`，因為兩個函式都會回傳 logical address，而非 `page` 結構體，可能會分配到目前沒有映射到核心虛擬記憶體的記憶體空間，導致沒有 logical address。只有 `alloc_page()` 可以分配 high memory。 ### Type Flags | Flag | Description | | -------- | -------- | | GFP_ATOMIC | 高優先權且不能 sleep，在 ISR 以及 bottom half 持有 spinlock 等不能睡眠的狀況中使用 | | GFP_NOWAIT | 優先及較 GFP_ATOMIC 低，也不能 sleep，will not fallback on emergency memory pools，導致分配失敗的機率較高 | | GFP_NOIO | 可以被 block，但不能發起 disk I/O，用在不想觸發更多 disk I/O 的 block I/O 程式中 (might lead to some unpleasant recursion) | | GFP_NOFS | 可以被 block 也可以發起 disk I/O，但不能發起 filesystem 操作，用在不能發起其他 filesystem 操作的 filesystem 程式中 | | GFP_KERNEL | 用在可以 sleep 的 process context 中， kernel will do whatever it has to do to obtain the memory requested by the caller | | GFP_USER | 可能被 block 的分配，用在 user-space 行程 | | GFP_HIGHUSER | 從 ZONE_HIGHMEM 分配且克能被 block，用在 user-space 行程 | | GFP_DMA | 從 ZONE_DMA 分配，需要 DMA-able 記憶體的 device driver 使用 | ### Type Flag 背後的 Modifiers | Flag | Modifiers Flags | | -------- | -------- | | GFP_ATOMIC | __GFP_HIGH | | GFP_NOWAIT | 0 | | GFP_NOIO | __GFP_WAIT | | GFP_NOFS | (__GFP_WAIT | __GFP_IO) | | GFP_KERNEL | (__GFP_WAIT | __GFP_IO | __GFP_FS) | | GFP_USER | (__GFP_WAIT | __GFP_IO | __GFP_FS) | | GFP_HIGHUSER | (__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HIGHMEM) | | GFP_DMA | __GFP_DMA | 大部分在核心中的分配使用 `GFP_KERNEL`，但可能會 block 所以只能在可以重新排程的 process context 使用，因為沒有做任何限制，所以成功機率較高而 `GFP_ATOMIC` 代表在分配的過程不能 sleep，所以如果沒有夠大且可用的連續物理記憶體，因為不能 sleep 的限制，核心也不能去釋放其他空間 (`GFP_KERNEL` 可以讓 caller sleep 並 swap inactive pages to disk, flush dirty pages to disk, and so on) ### Which Flag to Use | Situation | Solution | | -------- | -------- | | Process context, can sleep | Use GFP_KERNEL | | Process context, cannot sleep | Use GFP_ATOMIC, or perform your allocations with GFP_KERNEL at an earlier or later point when you can sleep | | Interrupt handler | Use GFP_ATOMIC | | Softirq | Use GFP_ATOMIC | | Tasklet | Use GFP_ATOMIC | | Need DMA-able memory, can sleep | Use (GFP_DMA \| GFP_KERNEL) | | Need DMA-able memory, cannot sleep | Use (GFP_DMA \| GFP_ATOMIC), or perform your allocation at an earlier point when you can sleep | ## kfree 只能用在使用 `kmalloc()` 分配的記憶體 ```c void kfree(const void *ptr) ``` 不可以 free 已經被 free 過的記憶體。Doing so is a bug, resulting in bad behavior such as freeing memory belonging to another part of the kernel > When will this happen `kfree(NULL);` 是合法且安全的 ## vmalloc() 與 `kmalloc()` 相似，但分配的記憶體只保證在虛擬記憶體是連續的通常只有硬體裝置需要連續的物理記憶體，因為沒有虛擬記憶體的概念，但在不需要連續物理記憶體時通常還是會只用 `kmalloc()`，因為 `vmalloc()` 的速度通常較慢，所以只有在需要大量記憶體空間時使用，例如 moudle 被動態載入時 > 要讓不連續的物理記憶體在虛擬記憶體連續需要設定好 page table entry，而且透過 `vmalloc()` 獲取的頁面必須一個個映射，因為它們在物理上並非連續，這會導致比直接映射記憶體更多的 TLB thrashing (多 way TLB?)，進而影響效能。 ## Slab layer 核心中使用 **free lists** 來加速 object 的分配與釋放 The slab layer attempts to leverage several basic tenets: - Frequently used data structures tend to be allocated and freed often, so cache them. - Frequent allocation and deallocation can result in memory fragmentation (the inability to find large contiguous chunks of available memory).To prevent this, the cached free lists are arranged contiguously. Because freed data structures return to the free list, there is no resulting fragmentation. - The free list provides improved performance during frequent allocation and deallocation because a freed object can be immediately returned to the next allocation. - If the allocator is aware of concepts such as object size, page size, and total cache size, it can make more intelligent decisions. - If part of the cache is made per-processor (separate and unique to each processor on the system), allocations and frees can be performed without an SMP lock. - If the allocator is NUMA-aware, it can fulfill allocations from the same memory node as the requestor. - Stored objects can be colored to prevent multiple objects from mapping to the same cache lines 每個不同的 object type 一個 cache，例如 process descriptor (free list of `task_struct`)，另一個 cache 存 `struct inode`，而一個 cache 由多個 slab 組成，一組 slab 包含一個或多個連續頁面的記憶體，這些頁面包含特定大小的核心物件。 slab 有三種狀態: - full: 沒有 free object (所有的 object 都被分配) - partial: 有些物件已經分配有些被釋放 - empty: 都是 free object 當核心想要一些新的 object 時，會從 partial slab 開始尋找，若沒有才會從 empty 分配，**減少碎片化** `kmalloc()` 是由 slab 實作 (較新版本的 Linux 已改成 slub) The relationship between caches, slabs, and objects: ![image](https://hackmd.io/_uploads/SJxuIzhAR.png) 每個 cache 以 `kmem_cache` 結構體表示 ## Per-CPU Allocations Per-CPU data: 每個處理器獨有的資料，以陣列儲存，並使用目前處理器的號碼來索引使用時只需要考慮 kernel preemption，任務重新排程可能會在另一個處理器上執行，導致原本的 cpu 變數不再有效，也可能另一個任務也要在相同的處理器上存取相同的變數導致 race condition :::info Compile time 的 per-CPU 變數不能在 module 中使用，因為 linker 會建立一個 unique executable section (.data.percpu) ::: ### Reasons for Using Per-CPU Data - 減少鎖的需求，只需要關閉 kernel preemption (核心提供的 per-cpu interface 會自動關閉) - 減少 cahce invalidation，如果資料在不同處理器的快取間使用則需要 flush 或更新快取，常常 flush 稱為 **thrashing the cache**，會讓系統的效能降低可以在 process 或 interrupt context 間使用，但在存取 per-CPU 資料的過程中不能 sleep (可能會被排成到其他處理器) [slab 記憶體配置](https://hackmd.io/@sysprog/linux-slab#Linux-%E6%A0%B8%E5%BF%83%E7%9A%84-slab) [Linux 記憶體管理](https://hackmd.io/@sysprog/linux-memory#%E7%9B%AE%E6%A8%99%E8%A8%AD%E5%AE%9A) [Memory Management](https://linux-kernel-labs.github.io/refs/heads/master/lectures/memory-management.html) # Virtual Filesystem (VFS) 提供給 user-space 程式的檔案或檔案系統相關的介面，讓程式可以用 Unix 系統呼叫(如 read()、write() 等) 來讀寫不同的檔案系統或 media，使用者不需要管底層實作，且這些系統呼叫透過 VFS 支援不同的 filesystems 與 media，例如在不同檔案系統間移動檔案 ![image](https://hackmd.io/_uploads/Hku0CgaCC.png) VFS 提供通用的 file model 來表示檔案系統的 general feature set and behavior (biased toward Unix-sytle filesystems) 當一個 user-space program 執行 ```c ret = write(fd, buf, len); ``` 會呼叫 `sys_write()` 系統呼叫，並以 `fd` 的 filesystem 來決定寫入的方法 ![image](https://hackmd.io/_uploads/r141J-60A.png) ## Unix Filesystems Unix 提供四大與檔案系統有關的 abstraction: files, directory, entries, and mount points ### filesystem 遵循特定結構的階層式資料儲存方式。檔案系統包含檔案、目錄以及相關的控制資訊。對檔案系統進行的典型操作包括建立、刪除和掛載。在 Unix 系統中，檔案系統掛載於一個名為 namespace 的全域階層中的特定掛載點。這使得所有掛載的檔案系統都以單一樹狀結構中的條目形式出現。與此相對比，DOS 和 Windows 的行為是將檔案命名空間分解為磁碟代號，例如 `C:`，這樣的方式將命名空間按照裝置和分割區邊界分割開來，將硬體細節「洩漏」到檔案系統的抽象層中。由於這種劃分可能是任意的，甚至會讓使用者感到困惑，因此這種方式比 Linux 的統一命名空間要遜色。 > Linux 中每個行程都有自己的命名空間，若沒有特別定義則是繼承 parent ### files Orderd string of bytes，每個檔案都有 human-readable 的名子來給系統或使用者辨別，典型的操作有讀取、寫入、建立與刪除 ### directory 檔案被組織在目錄中，目錄下也可以有其他目錄，稱作子目錄。目錄可以嵌套形成路徑。路徑的每個組成部分都稱為 **dentries**。例如，路徑 `/home/wolfman/butter` 中，根目錄 `/`、目錄 `home` 和 `wolfman`，以及檔案 `butter` 都是 dentries 對於 VFS 來說目錄也是檔案，所以操作目錄的方式與檔案相同 Unix 系統將檔案的概念與其相關的資訊（例如存取權限、大小、擁有者、建立時間等）分開的。這些資訊有時稱為 file metadata（即檔案資料的資料），並存儲在一個稱為 **inode** (index node) 的獨立資料結構中。這些控制資訊存儲在一個稱為 superblock 的資料結構中。superblock 包含有關整個檔案系統的資訊。這些資料集合有時稱為 filesystem matadata，filesystem matadata 包括有關單個檔案和整個檔案系統的資訊。 ## VFS Objects and their Data Structures VFS 是用 C 寫的 OOP 架構，所以這些資料結構常被稱為物件 (結構體) VFS 有四種主要的物件類型: - superblock object: represents a specific mounted filesystem - inode object: represents a specific file - dentry object: represents a directory entry, which is a single component of a path - file object: represents a open file as associated with a process 因為 VFS 把目錄也視為檔案，所以沒有目錄物件。dentry 代表的是路徑中的 component，可能包含檔案，與目錄不相同。每個物件都有 operation object - **super_operations** object: contains the methods that the kernel can invoke on a specific filesystem, such as `write_inode()` and `sync_fs()` - **inode_operations** object: contains the methods that the kernel can invoke on a specific file, such as create() and link() - **dentry_operations** object: contains the methods that the kernel can invoke on a specific directory entry, such as `d_compare()` and `d_delete()` - **file_operations** object: contains the methods that a process can invoke on an open file, such as `read()` and `write()` operation objects 通常由結構體實現，內含指向相關函式的指標，這些函式專門用於操作其父物件。如果基本功能已經足夠，這些物件可以繼承一個通用的函式。否則，特定的檔案系統實例會用其實作專有的方法來填充這些指標。 ### Superblock Object Superblock 物件是由每個檔案系統實作的，並用來儲存描述該特定檔案系統的相關資訊。這個物件通常對應於 **filesystem superblock** 或 **filesystem control block**，這些資訊存儲在磁碟上的特殊 sector 中。對基於虛擬記憶體的檔案系統（例如 sysfs），會即時生成 superblock，並將其存儲在記憶體中。 superblock 物件在 [<linux/fs.h>](https://github.com/torvalds/linux/blob/master/include/linux/fs.h) 中以 `struct super_block` 表示建立、管理與刪除 superblock object 的程式在 [<fs/super.c>](https://github.com/torvalds/linux/blob/master/fs/super.c) 中，由 `alloc_super()` 來進行建立與初始化，在 mount 時，檔案系統會呼叫這個函式來從讀取磁碟的 superblock，並加入 superblock 物件 #### Superblock Operations superblock 物件中最重要的是 `s_op` (只到 superblock operation table 的指標)，在 [<linux/fs.h>](https://github.com/torvalds/linux/blob/master/include/linux/fs.h) 定義為 `super_operations`，其中每個 operation 都會對檔案系統與其 inode 做對應的操作例如一個檔案系統想要寫入它的 superblock，可以呼叫 `write_super()` ```c sb->s_op->write_super(sb); ``` 傳入的 `sb` 為 superblock，需要再傳一的 superblock 的原因是 C 沒有支援物件導向，沒辦法很容易地找到 parent。這些函式都是由 VFS 來呼叫，且在 process context 下執行 ### inode object inode 裡含有核心操作檔案或目錄的所有資訊，也在 [<linux/fs.h>](https://github.com/torvalds/linux/blob/master/include/linux/fs.h) 中以 `struct inode` 表示 inode 用來表示檔案系統中的檔案，但 inode 物件只有在檔案被存取時才會在記憶體中建立 ### Dentry Object 在 `/bin/vi` 路徑中，`bin` 與 `vi` 都是檔案，`bin` 是目錄，而 `vi` 是常規檔案。兩個檔案都有個別的 inode 儘管 VFS 的這種統一性很有用，它經常需要執行特定於目錄的操作，例如路徑名稱查找。路徑名稱查找涉及翻譯路徑中的每個檔案，確保其有效，並依序跟隨到下一個檔案。為了讓這個過程更容易，VFS 引入了 dentry 的概念。dentry 是路徑中的一個具體檔案。以先前的例子來說，`/`、`bin` 和 `vi` 都是 dentry 物件。前兩者是目錄，而最後一個是常規檔案。這是一個重要的點： **dentry 物件不僅是路徑中的目錄，也包括檔案**。解析路徑並遍歷其檔案是一項不簡單的工作，耗時且涉及大量的字串操作，這些操作執行起來昂貴且代碼編寫複雜。而 dentry 物件使整個過程變得更簡單。 VFS 會在需要進行 directory operation 時及時建立 dentry 物件 dentry 物件在 [<linux/dcache.h>](https://github.com/torvalds/linux/blob/master/include/linux/dcache.h) 中定義成 `struct dentry` 與前面兩個物件不同，dentry 物件中不會有任何磁碟上的資料 #### Dentry State 一個有效的 dentry 物件有三中可能的狀態: - used - 對應到一個有效的 inode (`d_inode` 指到對應 inode) - object 有一個或多個 user (`dcount` > 0) - used 狀態的 dentry 被 VFS 使用，而且有指到有效的資料，所以不能被丟棄 - unused - 對應到一個有效的 inode (`d_inode` 指到對應 inode) - VFS 目前沒有在使用這個 dentry object (`dcount` 為 0) - 快取狀態，path name lookup 可以更快的完成，因為沒有在使用所以可以丟棄 - negative - 沒有對應到有效的 inode (`d_inode` 指到 `NULL`)，代表 inode 被刪除或是路徑查找失敗 - 可以加速未來的 lookup，例如 `open()` 一直回傳 `ENOENT` 直到核心建立路徑並檢查路徑後確認檔案存在，因為 lookup 失敗也需要花很多時間，所以把失敗的結果放入快取 - 可以被丟棄 ### file object 用來表示行程打開的檔案，以使用者空間的角度來看 VFS 時最容易想到的就是檔案物件，因為行程直接與檔案互動，而不是 superblock、inode 或 dentry。因為檔案物件包含的資訊（例如存取模式和目前的 offset）是我們熟悉的資料，以及熟悉的操作方式，例如系統呼叫 `read()` 和 `write()`。檔案物件是打開檔案時的 in-memory representation。此物件（不是實體檔案）會在呼叫 `open()` 系統呼叫時創建，並在呼叫 `close()` 系統呼叫時銷毀。所有這些與檔案相關的呼叫，實際上都是在 file operation table 中定義的方法。由於多個行程可以同時打開和操作一個檔案，因此**同一檔案可以存在多個檔案物件**。檔案物件只是表示一個行程對打開檔案的描述。該物件會指向 dentry（而 dentry 又指向 inode），它實際上代表了該檔案。當然，inode 和 dentry 物件是唯一的。與 dentry 物件類似，file object 的資料也不會在磁碟上 ## Data Structures Associated with Filesystems 除了 VFS 的基本物件，核心也使用其他資料結構來管理檔案系統相關的資訊因為 Linux 支援多種檔案系統，所以用 `struct file_system_type` 來表示檔案系統的能力與行為。不管有幾個檔案系統的 instance 被 mount 到系統上 (或沒有被 mount) 每個檔案系統都只有一個 `file_system_type` 在檔案系統被掛載後，會建立 `vfsmount` 結構體，用來表示檔案系統的 instance (掛載點) ## Data Structures Associated with a Process 每個行程會儲存開啟檔案列表、根檔案系統、當前工作目錄、掛載點等等的資訊有三個資料結構將 VFS 層與行程聯繫起來，分別是： - `file_struct` - `fs_struct` - `namespace` ### [file_struct](https://github.com/torvalds/linux/blob/master/include/linux/fdtable.h) 在 process descriptor 中的 `files` 欄位儲存，有行程開啟的檔案與檔案描述符的資訊 ```c /* * Open file table structure */ struct files_struct { /* * read mostly part */ atomic_t count; bool resize_in_progress; wait_queue_head_t resize_wait; struct fdtable __rcu *fdt; struct fdtable fdtab; /* * written part on a separate cache line in SMP */ spinlock_t file_lock ____cacheline_aligned_in_smp; unsigned int next_fd; unsigned long close_on_exec_init[1]; unsigned long open_fds_init[1]; unsigned long full_fds_bits_init[1]; struct file __rcu * fd_array[NR_OPEN_DEFAULT]; }; ``` `fd_array` 會指向 open file object 的陣列，`NR_OPEN_DEFAULT` 與 `BITS_PER_LONG` 相等，所以在 64-bit 上只能儲存 64 個 open file object，如果有行程打開更多的檔案則會分配一個新的陣列，並用 `fdt` 指向該陣列，如果大部分的行程都會超過 `NR_OPEN_DEFAULT` 也可以修改 `NR_OPEN_DEFAULT` 巨集 ### [fs_struct](https://github.com/torvalds/linux/blob/master/include/linux/fs_struct.h) 在 process descriptor 中的 `fs` 欄位儲存，儲存目前工作目錄 (pwd) 與行程的根目錄 ```c struct fs_struct { int users; spinlock_t lock; seqcount_spinlock_t seq; int umask; int in_exec; struct path root, pwd; } __randomize_layout; ``` ### [namespace](https://github.com/torvalds/linux/blob/master/include/linux/mnt_namespace.h) 在 process descriptor 中的 `namespace` 欄位儲存，enable each process to have a unique view of the mounted filesystem hierarchy (not just a unique root directory, but an entirely unique filesystem hierarchy) ```c struct mnt_namespace { struct ns_common ns; struct mount * root; struct rb_root mounts; /* Protected by namespace_sem */ struct user_namespace *user_ns; struct ucounts *ucounts; u64 seq; /* Sequence number to prevent loops */ wait_queue_head_t poll; u64 event; unsigned int nr_mounts; /* # of mounts in the namespace */ unsigned int pending_mounts; struct rb_node mnt_ns_tree_node; /* node in the mnt_ns_tree */ refcount_t passive; /* number references not pinning @mounts */ } __randomize_layout; ``` [Overview of the Linux Virtual File System](https://www.kernel.org/doc/html/latest/filesystems/vfs.html) [Linux 核心設計: 檔案系統概念及實作手法](https://hackmd.io/@sysprog/linux-file-system) # Block I/O Layer Block device 與 character device 的差別是 Block device 會 random 存取固定大小的資料 (稱為 block)，例如硬碟、floppy drivers、flash memory 等，這些裝置都是可以掛載檔案系統的裝置——檔案系統是 block device 的通用語言。 Charater device 是以有順序資料流的形式進行訪問，一個字節接一個字節。例如，serial port 和鍵盤就是字符裝置的例子。如果裝置是以資料流的形式進行訪問，那麼它就是以字符裝置的方式實現的。另一方面，如果裝置以隨機（非順序）的方式進行訪問，那麼它就是一個塊裝置。管理字符裝置只需要知道目前的位置，但 block device 需要可以在任意位置之間來回。核心因為塊裝置的複雜性提供專門的子系統來管理。然而，對塊裝置提供如此廣泛支持的更大原因是，塊裝置為 performance sensitive；充分利用硬碟的每一點性能比從鍵盤中榨取額外的百分之幾的速度更為重要。此外，正如你將看到的，塊裝置的複雜性為此類優化提供了大量空間。這部分核心被稱為 block I/O layer。 Sector 是 block device 的最小可尋址單位，sector 的大小通常是 2 的冪，其中最常見的大小是 512 bytes。sector 大小是裝置的一個物理屬性，裝置無法尋址或操作比 sector 更小的單元，但許多 block device 可以一次操作多個 sector 。 block 為軟體操作的最小邏輯循址單位，是檔案系統的一種抽象，檔案系統只能以 block 的倍數進行訪問。儘管物理裝置可以在 sector 層級進行尋址，但核心對磁碟的所有操作都是基於塊進行的。由於裝置的最小可尋址單元是 sector，因此 block 的大小不能小於 sector ，且必須是 sector 大小的倍數。此外，與硬體和 sector 類似，核心要求 block 的大小是 2 的冪，並且不能超過頁面的大小因此，**block 大小通常是 sector 大小的 2 的冪倍，且不大於頁面的大小**。常見的塊大小有512字節、1KB和4KB。 Sector 有稱為 hard sectors 或 device blocks，而 block 又被稱為 filesystem blocks 或 I/O blocks :::info Summery: Device I/O 的操作以 sector 為單位，而核心使用更高層的概念 block 來操作檔案系統 (built on top of sectors) ::: ## Buffers and Buffer Heads 當 block 被儲存在記憶體中，例如在讀取後或等待寫入時，它會被儲存在一個**緩衝區**中。每個緩衝區對應一個 block。緩衝區用來在記憶體中表示 disk block 的對象。block 是由一個或多個 sector 組成的，但其大小不會超過一個頁面。因此，一個頁面可以在記憶體中容納一個或多個 block。由於核心需要一些控制資訊（例如該緩衝區來自哪個 block 設備以及是哪個具體的 block），每個緩衝區都有一個描述符與其相關聯。這個描述符被稱為 **buffer head**，其類型為 `struct buffer_head`，包含核心操作緩衝區所需的所有資訊，並定義在 [<linux/buffer_head.h>](https://github.com/torvalds/linux/blob/master/include/linux/buffer_head.h) 中。 ```c /* * Historically, a buffer_head was used to map a single block * within a page, and of course as the unit of I/O through the * filesystem and block layers. Nowadays the basic I/O unit * is the bio, and buffer_heads are used for extracting block * mappings (via a get_block_t call), for tracking state within * a folio (via a folio_mapping) and for wrapping bio submission * for backward compatibility reasons (e.g. submit_bh). */ struct buffer_head { unsigned long b_state; /* buffer state bitmap (see above) */ struct buffer_head *b_this_page;/* circular list of page's buffers */ union { struct page *b_page; /* the page this bh is mapped to */ struct folio *b_folio; /* the folio this bh is mapped to */ }; sector_t b_blocknr; /* start block number */ size_t b_size; /* size of mapping */ char *b_data; /* pointer to data within the page */ struct block_device *b_bdev; bh_end_io_t *b_end_io; /* I/O completion */ void *b_private; /* reserved for b_end_io */ struct list_head b_assoc_buffers; /* associated with another mapping */ struct address_space *b_assoc_map; /* mapping this buffer is associated with */ atomic_t b_count; /* users using this buffer_head */ spinlock_t b_uptodate_lock; /* Used by the first bh in a page, to * serialise IO completion of other * buffers in the page */ }; ``` 在操作 buffer head 前需要使用 inline 函式 `get_bh()` 來修改 reference count (`b_count`)，並在結束後呼叫 `put_bh()` 磁碟上與給定緩衝區對應的物理塊是由 `b_bdev` 描述的 block device 上的第 `b_blocknr` 個邏輯塊。記憶體中與給定緩衝區對應的物理頁面由 `b_page` 指向的頁面表示。更具體地說，`b_data` 是直接指向塊的指針（該塊位於 `b_page` 的某個位置），其長度為 `b_size` byte。因此，這個塊在記憶體中的位置從 `b_data` 開始，到地址 (`b_data` + `b_size`) 結束。 buffer head 的作用是**描述磁碟上的塊與記憶體中物理緩衝區（即特定頁面上的一個字節序列）之間的映射**。 ## bio Structure 核心中 block I/O 的基本容器是 `bio` 結構，定義在 [<linux/blk_type.h>](https://github.com/torvalds/linux/blob/master/include/linux/blk_types.h) 中。這個結構體表示**正在進行的 block I/O 操作**，作為 segment 串列來管理。segment 是**記憶體中連續的緩衝區塊**，因此，**單個緩衝區不必在記憶體中是連續的**。通過允許緩衝區以 block 的形式描述，`bio` 結構使核心**能夠在記憶體的多個位置對一個緩衝區進行 block I/O 操作**。這種類型的向量 I/O 被稱為 scatter-gather I/O。 `bio` 結構體主要的目的是用來表示正在執行的 block I/O 操作，其中最重要的欄位就是 `bi_io_vec`、`bi_vcnt`、`bi_idx` ![image](https://hackmd.io/_uploads/BJFNhpkJkg.png) ### I/O vectors `bi_io_vec` 欄位會指向存有 `bio_vec` 結構體的陣列，這些結構體作為此特定 block I/O 操作中的 segment 列表。每個 `bio_vec` 都被視為一個向量，格式為 `<page, offset, len>` - page：表示該 segment 所在的物理頁面。 - offset：block 在頁面內的偏移量。 - len：從該偏移量開始的 block 的長度。每個向量描述一個具體的 segment，存有這些向量的陣列 (`bi_io_vec`) 描述了整個緩衝區。`bio_vec` 結構體定義在 [<linux/bvec.h>](https://github.com/torvalds/linux/blob/master/include/linux/bvec.h) 中 ```c /** * struct bio_vec - a contiguous range of physical memory addresses * @bv_page: First page associated with the address range. * @bv_len: Number of bytes in the address range. * @bv_offset: Start of the address range relative to the start of @bv_page. * * The following holds for a bvec if n * PAGE_SIZE < bv_offset + bv_len: * * nth_page(@bv_page, n) == @bv_page + n * * This holds because page_is_mergeable() checks the above property. */ struct bio_vec { struct page *bv_page; unsigned int bv_len; unsigned int bv_offset; }; ``` 在每個給定的 block I/O 操作中，`bi_vcnt` 表示 `bio_vec` 陣列中的向量數量，並從 `bi_io_vec` 開始。隨著 block I/O 操作的進行，`bi_idx` 字段用來指向當前陣列中的索引位置。總結來說，每個 block I/O 請求由一個 `bio` 結構體來表示。每個請求由一個或多個 block 組成，這些 block 存儲在 `bio_vec` 結構體陣列中，並通過 `bi_io_vec` 指向最初的 segment，`bi_idx` 指向當前進行的 segment。在 I/O 操作過程中，這些向量定義了多段 block 的記憶體位置和大小。這些結構體作為向量，描述了每個 segment 在物理記憶體頁中的位置。I/O 操作的第一個段由 `b_io_vec` 指向，每個後續的 segment 依次跟隨，總共由 `bi_vcnt` 個 segment 組成的列表。當 block I/O layer 提交請求中的 block 時，`bi_idx` 會更新，指向當前正在處理的 segment，這有助於 block I/O 層追踪部分完成的 block I/O 操作。更重要的用途是**允許對 `bio` 結構進行分割**。這個特性允許實現 RAID 的驅動程序，將最初打算針對單一設備的 `bio` 結構分割到多個磁碟之間。例如，RAID 驅動程序可以複製 `bio` 結構，並更新 **bi_idx** 來指示每個磁碟應從哪裡開始其操作。這樣，RAID 裝置能夠有效地將單個 block I/O 請求分配給多個磁碟，以提高性能和可靠性。 `bio` 使用 `bi_cnt` 來管理 usage count，當 `bi_cnt` 為零時結構就會被摧毀，並釋放記憶體 ## Request Queue Block device 維護 request queues (struct request_queue) 來保存 pending 的 block I/O 請求，定義在 [<linux/blkdev.h>](https://github.com/torvalds/linux/blob/master/include/linux/blkdev.h) 中，利用雙向鏈結串列來管理請求，請求由核心中的更高層的程式 (例如檔案系統) 加入佇列，block device driver 會從佇列中取出請求並給對應的 block device 處理。每個請求由 `struct request` 表示，因為**單個請求可能會在不同的連續 disk block 上操作所以請求可由一個或多個 `bio` 組成** :::info blocks on disk 必須相鄰，但在記憶體中的 block 可以不相鄰，每個 `bio` 結構體可以描述多個 segment (segment 是記憶體中連續的 block) ::: ## I/O Schedulers 如果按照核心發出 block I/O 請求的順序，立即將它們發送到 block device，會導致性能不佳。在現代計算機中，disk seek 是最慢的操作之一。每次 seek (即將硬盤的 disk head 定位到特定 block 的位置) 都需要幾毫秒的時間。所以減少 seek 次數對系統效能至關重要。因此，核心不會按照接收到的順序或立即將 block I/O 請求發送給磁碟。相反，核心會執行稱 merging 和 sorting 的操作，以提高系統整體效能。執行這些操作的核心子系統稱為 I/O Scheduler。 I/O 排程器負責在系統中的待處理 block I/O 請求之間分配磁碟 I/O 資源。它通過對請求隊列中的請求進行合併和排序來實現這一點。需要注意的是，I/O 排程器和行程排程器是不同的（行程排程器負責在系統上的行程之間分配處理器資源）。雖然兩者性質相似，都是在多個對象之間虛擬化資源，但它們的目標不同。行程排程器把處理器虛擬化，並在系統上的多個行程之間共享處理器，這為多任務和分時操作系統（如 Unix）提供了虛擬化的表象。而 I/O 排程器則虛擬化 block 裝置，並在多個未完成的 block 請求之間共享它們，這樣可以減少磁碟 seek 次數，並確保磁碟效能的最佳化。 ### The Job of an I/O Scheduler 目的為增加 **global throughput** (還是會有不公平狀況)，主要利用合併與排序來減少 seek 的時間合併: 將多個請求 (存取相鄰 sector) 合併成一個，讓多個請求可以省略 seeking 步驟排序: 若兩個請求沒辦法合併，但要存取的 sector 很接近，則可以把他們排在一起，減少 seek 的時間 (讓 disk sick 直線移動，不會來回)，而 kernel 中的 I/O 排程器被稱為 **elevators** ### Linux Elevator 核心中第一個 I/O 排程器，以 Linus 命名，在 2.6 後被取代會進行 front 與 back merge - front merge: the new request immediately proceeds an existing request - back merge: the new request immediately precedes an existing request 但通常檔案的 sector number 通常為遞增，且 I/O 通常是從開始讀到結束並且沒有倒著讀取，所以 front merge 的數量比 back merge 少很多如果合併失敗，會尋找佇列中可能的插入點 (較接近的 sector)，如果沒有的話就會插入到佇列的結尾，但如果已經在佇列中的請求與要插入的請求間隔超過 threshold，則新的請求也會被加入到結尾，避免 starving，但這個版本的 `age` 檢查效率不高，而且也不會嘗試在期間內完成請求，只會中止插入的排序，所以 starvation 還是可能會發生 ### Deadline I/O Scheduler 目標為解決 Linus Elevator 的 starvation 問題，在希望最小化 seek 的情況下，在某一區塊有頻繁的 I/O 請求很容易會 starve 另一區塊的請求，並且減少 read starvation 還有另一個問題是 **writes starving reads**，寫入的操作通常是非同步的，在提交請求後就可以做其他事情，而讀取操作是同步的，需要等待請求完成才能繼續執行，所以讀取操作的延遲對系統表現就變得非常重要 (但寫入的延遲也不能過久，否則資料可能過時或緩存過大) 而且通常讀取的請求會互相依賴，例如要讀取一個大檔案，每次讀取一小塊 buffered chunks，應用會在前一個 chunk 從磁碟讀取完後才讀取下一個 chunk，所以如果每次讀取都會被 starve 一次，那總共的 delay 就會非常高而非同步或是沒有與其他讀取依賴的讀取請求就可以忍受較大的延遲實現公平與最小化 seek 是有衝突的，所以要做出一些妥協在 Deadlline I/O Scheduler 中，每個請求都會有過期時間 (預設讀取為 500 ms，寫入 5s)，也會維護一個 request queue 稱作 **sorted queue**，與 Linus Elevator 不同的是，除了 request queue 之外，讀取請求會被排序到 read FIFO queue，寫入請求會被排序到 write FIFO queue，request queue 是由 sector 排序，而在 FIFO 佇列中是以插入時間排序在沒有請求過期時，還是會從 sorted queue 的開頭處理請求，但如果過期了就會從 FIFO 佇列處理 ![image](https://hackmd.io/_uploads/rkIiRClk1l.png) Deadline I/O 排程器不會保證請求的延遲時間，但因為讀取請求的過期時間較短，所以也可以保證寫入操作不會 starve 讀取操作 ### Anticipatory I/O Scheduler 在使用 Deadline I/O 排程器的情況下，如果系統正在執行大量的寫入，在讀取請求過期後，系統就會優先執行讀取，並 seek 到讀取位置，完成後再 seek 並繼續寫入，期間可能發生多次這種狀況，導致 global disk throughput 降低 Anticipatory I/O Scheduler 的目的就是在維持 read latency 的情況下提升 global throughput 與 Deadline 一樣也有三個佇列與過期時間，主要是新增了 **anticipation heuristic**，在還是會優先去處理讀取操作，但在完成讀取後，不會馬上 seek 回原本的操作，而是等待一段時間 (預設是 6 ms)，在這段期間很可能會有其他的讀取需要處理，如果在附近的話就會直接去處理，等待時間結束後再回到原位，減少來回的 seek time，這段等待的時間就叫做 **anticipation** ### The Complete Fair Queuing (CFQ) I/O Scheduler 為了特別的 workload 設計，但在很多情況都可以有不錯的表現每個行程都會有一個佇列來維護 I/O 請求，佇列也會各自進行合併與排序，並用 Round Robin 來處理 (每次處理幾個請求，預設為 4)，讓行程間的 I/O 保持公平，原本的情境是在多媒體，但在不同場景表現也不錯 ### The Noop I/O Scheduler Noop: No Operation Performed 不進行排序或其他 seek-prevention 的動作，但會進行合併，新的請求會和任何相鄰的請求合併在 block device 真的是 random-access 時使用，例如 flash memory cards，如果不需要優化 seeking 或 sorting 則 Noop 是個好選擇其中核心預設是使用 CFQ # Process Address Space 核心除了管理自己的物理記憶體之外，還需要管理使用者空間行程的 address space，通常一個行程的 address space 會比物理記憶體大很多 (vm) ## Address Space 不同行程有各自的 address space，所以相同的記憶體位置在不同行程通常是完全沒有關聯的，但行程之間也可以共享 address space，這時就會被稱為執行緒 Adderss space 中不是所有位置都可以存取，通常會有一個 interval 來表示合法的地址，稱作 **memory areas**，行程可以透過核心來動態增加或減少 memory areas 行程只能在有效的 memory area 存取記憶體，memory area 會有存取權，例如可讀、可寫、可執行，如果違反就會發生 segmentation fault Memory areas: - text section: 映射到可執行檔程式碼的記憶體空間 - data section: 映射到已初始化的全域變數的記憶體空間 - bss section: 映射到 zero page 的記憶體空間，其中存還沒初始化的變數 (但 C standard 規定把還沒初始化的變數給定預設值，通常是 0，所以核心會把未初始化變數放到 zero page，避免浪費空間) - 給行程的 user-space stack 的 zero page - shared library 中的 text, data, bss 區域，例如 C library - memory mapped files - Shared memory segments - Anonymous memory mappings 如 `malloc()` 分配的記憶體 (sbrk，也有 mmap 版本的) ## Memory Descriptor 核心利用 **memory descriptor** 來描述行程的 address space，稱為 `struct mm_struct`，定義在 [<linux/mm_types.h>](https://github.com/torvalds/linux/blob/master/include/linux/mm_types.h) 中。其中 `mm_user` 儲存使用這個 address space 的行程數量，而 `mm_count` 用來記錄 `mm_struct` 的 reference count，所有的 `mm_user` 只會算一個 `mm_count`，在 `mm_count` 為 0 時 `mm_struct` 就可能被釋放 > The kernel needs to increment mm_count whenever it is using the mm_struct. For example, the VM subsystem increments mm_count by one if it is inspecting the address space for page ins or outs. As another example, the kernel will bump mm_count when a kernel thread is lazily executing in the process address space of the last-running user thread. > [Why are we using two variables mm_users and mm_count in mm_struct?](https://www.quora.com/Linux-Kernel-Why-are-we-using-two-variables-mm_users-and-mm_count-in-mm_struct) `mmap` 與 `mm_rb` 是兩個不同的資料結構，但都是儲存這個 address space 的所有 memory areas，`mmap` 是 linked list，而 `mm_rb` 是紅黑樹，串列是為了更方便的走訪所有的元素，而 RB-tree 用來尋找特定元素所有的 `mm_struct` 由雙向鏈結串列 `mmlist` 來維護，`init_mm` 會儲存串列的開頭 memory descriptor，也就是 init process 的 address space，這個串列由 `mmlist_lock` 來保護 ### Allocating a Memory Descriptor 行程對應的 memory descriptor 會儲存在 `task_struct` 的 `mm` 欄位，而 `fork()` 時會呼叫 `copy_mm()` 來複製 parent 的 memory descriptor，`mm_struct` 利用 `allocate_mm()` 來分配，搭配 `mm_cachep` slab cache。在 `clone()` 時搭配 `CLONE_VM` 就不會呼叫 `allocate_mm()` 來複製 memory descriptor，而是把行程的 `mm` 欄位指向 parent 的 memory descriptor ```c tsk->mm = current->mm; ``` ## mm_struct and Kernel Threads Kernel thread 不會有 process address space，所以沒有對應的 memory descriptor，因此 `mm` 欄位會指向 NULL。kernel thread 的定義為沒有 user context 的行程，可以沒有 address space 是因為 kernel thread 不會存取任何使用者空間的記憶體，所以不會有自己的 memory descriptor 跟 page table。但 kernel thread 有時候還是會需要 page table 或存取核心記憶體等等，為了讓 kernel thread 可以存取這寫資源又不想浪費記憶體在 memory descriptor 與 page table，或花費時間在 kernel thread 開始執行時切換 address space，kernel thread 會繼續使用上一個在執行的行程的 memory descriptor。在行程被排程時，`mm` 會被設定，`active_mm` 也會被更新成新的 address space，而 kernel thread 被排程時 `mm` 欄位會設定為 NULL，而 `active_mm` 會保留，讓 kernel thread 可以使用 page table。 > [How kernel threaduse memory descriptor(mm_struct) of last ran process in Linux?](https://stackoverflow.com/questions/27518369/how-kernel-threaduse-memory-descriptormm-struct-of-last-ran-process-in-linux) ## Virtual Memory Areas 核心使用 `vm_area_struct` 來表示 memory areas，在核心中 memory area 也被稱為 virtual memory areas (VMAs) 一個 `vm_area_struct` 描述一個 address space 中連續的一段區間的 memory area，核心把每個 memory area 當成一個獨立的物件，每個 memory area 都有自己的特性，例如存取權以及對應的操作，定義在 [<linux/mm_type.h>](https://github.com/torvalds/linux/blob/master/include/linux/mm_types.h) 中因為每個 memory descriptor 都會與行程的 address space 的區間對應，所以 `vm_start` 中儲存區間的位址 (lowest) 起點，`vm_end` 中儲存區間最後 (highest) 位址後的第一個 byte，所以 `vm_end` - `vm_start` 就會是區間的大小 (byte 為單位)，不同 memory area 間的區間不能重疊 `vm_mm` 欄位會指向這個 VMA 對應的 `mm_struct` (memory descriptor)，不同 `mm_struct` 不會共享 VMA，所以就算兩個行程把相同的檔案映射到他們的 address space 他們的 `vm_area_struct` 也是獨立的，而相同 address space 間的 thread 會共享所有的 `vm_area_struct` ### VMA flags `vm_flags` 欄位的 bit flag 在 [<linux/mm.h>](https://github.com/torvalds/linux/blob/master/include/linux/mm.h) 中定義，例如行程的 object code 在映射時會使用 `VM_READ`、`VM_EXEC` 並且不會有 `VM_WRITE` 而 `VM_SHARE` flag 設置時代表有不只一個 process 對這個 memory area 進行映射 (稱為 **shared mapping**)，如果沒有設置則稱為 **private mapping** `VM_IO` 表示這個 memory area 映射到裝置的 I/O 空間，通常是在 device driver 的 I/O space 呼叫 `mmap()` 時設定 > It specifies, among other things, that the memory area must not be included in any process's core dump `VM_RESERVED` 表示這個記憶體區塊不能被 swapped out，也會在 device driver mapping 中使用 `VM_SEQ_READ` 告訴核心應用正在這個 VMA 映射進行 sequential read (linear and contiguous)，可以增加備份檔案的 read-ahead 來進行優化，而 `VM_RAND_READ` 則相反，讓核心減少或關閉 read-ahead。利用 `madvise()` 系統呼叫來設定 ### VMA Operations `vm_area_struct` 中的 `vm_ops` 會指向 memory area 的 operation table，核心可以利用這些函式來操作 VMA，定義在 [<linux/mm.h>](https://github.com/torvalds/linux/blob/master/include/linux/mm.h) 中 ## Manipulating Memory Areas [<linux/mm.h>](https://github.com/torvalds/linux/blob/master/include/linux/mm.h) 中有需多函式可以讓核心在 memory area 進行操作，這些操作來自 `mmap()` 具體實作在 [mm/mmap.c](https://github.com/torvalds/linux/blob/master/mm/mmap.c) ### find_vma() 用來尋找 VMA 屬於哪個 memory area ```c struct vm_area_struct * find_vma(struct mm_struct *mm, unsigned long addr); ``` 會尋找 memory area `vm_end` 大於 `addr` 的 memory area，因為條件只有大於，所以可能回傳的 VMA 起點位址會大於 `addr`，表示 `addr` 不一定會在回傳的 VMA 區間中 `find_vma()` 的結果會被佔存在 memory descriptor 的 `mmap_cache` 欄位，如果沒有在快取中則會用紅黑樹結構來尋找。 ### find_vma_prev() 與 `find_vma()` 相同，但回傳的是 `addr` 之前的第一個 VMA ### find_vma_intersection() 回傳第一個與傳入地址間隔重疊的 VMA ## mmap() and do_map(): Creating an Address Interval 核心使用 `do_mmap()` 來擴展現有或建立新的 linear address interval，如果已經存在的 address interval 相鄰要建立的 address interval 且 permission 都相同就會合併在一起，如果沒有就會建立一個新的 VMA ```c unsigned long do_mmap(struct file *file, unsigned long addr, unsigned long len, unsigned long prot, unsigned long flags, vm_flags_t vm_flags, unsigned long pgoff, unsigned long *populate, struct list_head *uf) ``` 若 `file` 為 NULL 稱為 **anonymous mapping**，有 `file` 與 `offset` 則稱為 `file-backed mapping` 若不能合併需要建立 `vm_area_struct` 會透過 `vm_area_cachep` slab cache 來分配，並把新的 memory area 加入 address space 的 linked list 與 rb-tree 中，並更新 memory descriptor 的 `total_vm` 欄位 ## Page Tables 核心中的分頁表為 3 層，就算在沒有支援 3 層分頁表的硬體也是 ![image](https://hackmd.io/_uploads/By7S1hzkyl.png) # Page Cahe and Page Writeback [Page Cache](https://www.cs.unc.edu/~porter/courses/cse506/s16/slides/page-cache.pdf) 快取 physical memory 中的資料來**減少 disk I/O** (TLB 是減少地址轉換) page 來自讀取或寫入 filesystem files, block device files, and memory-mapped files ## Approacheds to Caching page cache 在 RAM 中儲存 physical pages，內容對應到磁碟上的 physical block，page cache 的大小是動態的，依照系統的記憶體使用量調整，被 cached 的儲存裝置稱為 backing store > Backing store is typically a part of the hard disk that is used for a paging system to store data that is not currently in main memory 在核心開始執行 `read()` 操作時會先檢查資料有沒有在 page cache 裡，若有則不需要存取磁碟，若沒有則核心需要排程 block I/O 操作來從磁碟讀取資料 ## Write Cache 在行程想要寫入磁碟時，cache 有三種策略: - no write: 不快取寫入的操作，直接寫入磁碟中。通常不會使用，因為沒辦法快取寫入操作，且需要 invalidation - write-through: 同時更新快取與磁碟內容，可以保持 cache coerent，且不需要 invalidation，實作較簡單 - write-back: Linux 使用，只在 page cache 進行寫入，backing store 不會馬上執行，page cache 中被寫入的 page 會被標記程 dirty 並加入到 dirty list 中，一段時間後會把 dirty list 裡的 page 寫回磁碟。好處是可以合併寫入操作 (coalescing)，但實作較複雜。 ### Cache Eviction 從快取刪除資料的操作稱為 **cache eviction**，核心會選擇 clean (not dirty) 頁面並替換，若沒有 clean 頁面則需要 writeback 來產生更多的 clean page #### Least Recently Used 選擇 least recently used 頁面做替換但在很多檔案只被存取一次就不會在被存取時效果不佳，這些資料被放在 LRU list 的最前面顯然不是很好，我們沒辦法知道未來會不會被使用，但可以知道被存取了多少次 #### Two-List Strategy Linux 會維護兩個串列: **active list** 與 **inactive list**，在 active list 上的頁面被視為**熱頁面**，因此不會被驅逐，而在 inactive list 中的頁面就可以被驅逐。頁面只有**在 inactive list 中被存取時才會被放入 active list** 兩個串列都以 pesudo-LRU 來維護，從結尾新增，並從開頭刪除 Two-List 也可稱為 LRU/2，也可為 n-list ## address_space Object Page cache 中的頁面可以有多個非連續的 physical disk blocks > For example, a physical page is 4KB in size on the x86 architecture, whereas a disk block on many filesystems can be as small as 512 bytes. Therefore, 8 blocks might fit in a single page. The blocks need not be contiguous because the files might be laid out all over the disk. 由於構成每個頁面的 block 是非連續的，因此檢查頁面快取中是否已緩存特定資料變得困難。因此，**無法僅使用設備名稱和 block 號來索引頁面快取中的資料**，儘管這本來是最簡單的解決方案。此外，Linux 的頁面快取相當通用，能夠緩存多種類型的頁面。Linux 的頁面快取旨在緩存任何基於頁面的對象，這包括多種檔案和記憶體映射形式。雖然 Linux 的頁面快取可以通過擴展 `inode` 結構來支持頁面 I/O 操作，但這樣的選擇會將頁面快取限制在檔案層面上。為了維持一個通用的頁面快取——即不依賴於物理檔案或 `inode` 結構的快取——Linux 使用一個 object 來管理頁面快取中的條目和頁面 I/O 操作。這個對象就是 `address_space` 結構。可以將 `address_space` 視為與 `vm_area_struct` 結構相對應的物理對象。比如，一個檔案可能會被表示為 10 個 `vm_area_struct` 結構（假設有 5 個行程各自對它進行了兩次 mmap()），但這個檔案只有一個 `address_space` 結構——就像一個檔案可能有多個虛擬地址，但在物理記憶體中只存在一次。和 Linux 核心的許多命名類似，`address_space` 的名稱並不完全精確，或許 **page_cache_entity** 或 **physical_pages_of_a_file** 會更適合。定義在 [<linux/fs.h>](https://github.com/torvalds/linux/blob/master/include/linux/fs.h) 中 ```c struct address_space { struct inode *host; struct xarray i_pages; struct rw_semaphore invalidate_lock; gfp_t gfp_mask; atomic_t i_mmap_writable; #ifdef CONFIG_READ_ONLY_THP_FOR_FS /* number of thp, only for non-shmem files */ atomic_t nr_thps; #endif struct rb_root_cached i_mmap; unsigned long nrpages; pgoff_t writeback_index; const struct address_space_operations *a_ops; unsigned long flags; errseq_t wb_err; spinlock_t i_private_lock; struct list_head i_private_list; struct rw_semaphore i_mmap_rwsem; void * i_private_data; } __attribute__((aligned(sizeof(long)))) __randomize_layout; /* * On most architectures that alignment is already the case; but * must be enforced here for CRIS, to let the least significant bit * of struct page's "mapping" pointer be used for PAGE_MAPPING_ANON. */ ``` `i_mmap`: 儲存這個 address space 中所有共享與私有的映射的 priority search tree，因為一個被快取的檔案對應一個 `address space` 結構體，可以有多個對應的 `vm_area_struct`，所以一個 physical page 可以對應到多個 virtual pages，而 `i_mmap` 可以讓核心有效率的找到這個快取檔案對應的映射 > priority search tree 由 heap 與 radix tree 組成 address space 中總共有 `nrpages` 個 page entries `address_space` 會跟一些 kernel object 有關聯，`host` 儲存 `address_space` 的擁有者，不是 inode 就是 block device ### address_space Operations `a_ops` 會指向 address space operation table [<pagemap.h>](https://elixir.bootlin.com/linux/v6.11.3/source/include/linux/pagemap.h#L713) ```c /** * find_get_page - find and get a page reference * @mapping: the address_space to search * @offset: the page index * * Looks up the page cache slot at @mapping & @offset. If there is a * page cache page, it is returned with an increased refcount. * * Otherwise, %NULL is returned. */ static inline struct page *find_get_page(struct address_space *mapping, pgoff_t offset) { return pagecache_get_page(mapping, offset, 0, 0); } ``` `mapping` 就是 `address_space`!!! 回傳的 `page` 就可以使用 `mapping->a_ops->readpage(file, page)` 進行讀取 > 現在已經變成 read_folio() :::danger TODO: folio ::: ### Radix Tree 在 page I/O 操作之前，檢察頁面有沒有在頁面快取中，所以需要快速完成，不然快取可能不會變快 page cache 透過 `address_space` object 加上 offset 來搜尋，每個 `address_space` 有一個自己的 radix tree ## Buffer Cache 單個 disk block 透過 block I/O 緩衝區與頁面快取聯繫在一起。緩衝區是單個物理磁碟 block 的記憶體表示。緩衝區充當描述符，將記憶體中的頁面映射到磁碟 block 。因此，頁面快取在 block I/O 操作期間也通過緩存磁碟 block 並延遲緩衝 block I/O 操作來減少磁碟訪問。這種緩存通常稱為 **buffer cache**，儘管在實作上，它不是一個獨立的快取，**而是頁面快取的一部分**。 block I/O 操作一次只能操作單個磁碟 block。常見的 block I/O 操作包括讀寫 inode。核心提供了 `bread()` 函數來執行從磁碟中低層級讀取單個 block 的操作。通過緩衝區，磁碟 block 被映射到其相關的記憶體頁面，並緩存在頁面快取中。緩衝快取和頁面快取並非一開始就是統一的。將它們統一起來是 Linux 2.4 核心中的一個重要特性。在早期的核心中，存在兩個單獨的磁碟快取：頁面快取和緩衝快取。前者緩存頁面，後者緩存緩衝區。這兩個快取並未統一：一個磁碟 block 可能同時存在於兩個快取中，這導致在同步兩個快取副本時浪費了精力，並且重複緩存的項目浪費了記憶體。如今只有一個磁碟快取，即頁面快取。不過，核心仍然需要使用緩衝區來表示記憶體中的磁碟 block ，而緩衝區正好描述了 block 映射到頁面上的過程，這些頁面存在於頁面快取中。 ## Flusher Thread 寫入操作會被延遲執行在頁面快取中，當頁面快取中的資料比 backing store 還新，我們稱這些資料為 dirty。在記憶體中積累的髒頁面最終需要寫回磁碟。髒頁面的寫回會發生在以下三種情況下： - 當可用記憶體低於指定的門檻時，核心會將髒資料寫回磁碟以釋放記憶體，因為只有乾淨的記憶體可以被驅逐出記憶體。當資料變乾淨後，核心可以將其從快取中驅逐，進而縮小快取，釋放更多記憶體。 - 當髒資料超過一定時間閾值時，會寫回到磁碟，以確保髒資料不會無限期地保持髒的狀態。 - 當用戶行程調用 `sync()` 和 `fsync()` 系統調用時，核心會按需進行資料寫回操作。這三項工作的目標不同。事實上，在舊的核心版本中，兩個獨立的核心執行緒分別負責這些工作。然而，從 Linux 2.6 開始，一組核心執行緒，稱為 flusher 執行緒，負責執行這三項工作。首先，當系統中的可用記憶體縮小到指定的水平以下時，flusher 執行緒需要將髒資料刷新回磁碟。這項背景寫回操作的目標是當可用物理記憶體較低時，通過釋放被髒頁面佔用的記憶體來回收記憶體。該過程開始的記憶體水平由 `dirty_background_ratio` sysctl 參數設定。當可用記憶體低於此閾值時，核心會調用 `wakeup_flusher_threads()` 喚醒一個或多個 flusher 執行緒，並讓它們運行 `bdi_writeback_all()` 函數來開始將髒頁面寫回磁碟。此函數接收要寫回的頁面數量作為參數，並持續寫出資料直到滿足以下兩個條件： - 已經寫出了指定的最小數量的頁面。 - 可用記憶體數量高於 `dirty_background_ratio` 閾值。這兩個條件保證 flusher thread 在缺少記憶體空間時進行緩解另外，flusher thread 會被定期喚醒，並把 dirty page 寫回磁碟，保證 dirty page 不會永遠留在記憶體中，在系統 failure 時，因為 memory 是 volatile，所以會失去 dirty block，所以定期同步資料非常重要 ## Laptop mode Laptop mode 是一種專門針對頁面寫回策略的調整，目的為通過最小化硬碟活動來優化電池續航時間，使硬碟保持關閉狀態盡可能長的時間。這個模式可以通過 `/proc/sys/vm/laptop_mode` 檔案進行配置。 Laptop mode 對頁面寫回行為做了一項改變。除了當髒頁面變得過舊時執行寫回外，flusher 執行緒還會搭載其他物理磁碟的 I/O 操作，將所有的髒緩衝區一起寫回磁碟。通過這種方式，頁面寫回會利用磁碟剛好已經被啟動的時機，確保以後不會因為寫回操作再次啟動磁碟。這種行為改變在 `dirty_expire_interval` 和 `dirty_writeback_interva`l 設置為較大值（例如 10 分鐘）時最有意義。由於寫回延遲較長，磁碟不會頻繁啟動，而當磁碟被啟動時，Laptop mode **確保充分利用這個機會**。由於關閉硬碟是節省電力的重要方式之一，Laptop mode 可以顯著延長筆記型電腦的電池續航時間。缺點是，如果**系統崩潰或發生其他故障，可能會丟失大量數據**。 # Devices and Modules Linux 中有三種裝置類型 - Block device (blkdevs) - Character device (cdevs) - Network device 使用者空間的程式利用 device node 直接跟 character device 作互動，而 network device 通過 **socket API**，而不是 device node。不是所有的 driver 都會對應到硬體裝置，有些是虛擬的，稱為 **pseudo devices** # Debugging - Reliably reproducible bug - Kernel version that bug first appeared ## Debugging by Printing `printk()` 可以在任何時候使用，不管是 interrupt 或 process context，只有在系統啟動時 console 初始化之前不能使用 (可以使用 `early_printk()`，但有些架構可能沒有實作) ## Loglevels [Message logging with printk](https://www.kernel.org/doc/html/next/core-api/printk-basics.html) All `printk()` messages are printed to the kernel log buffer, which is a ring buffer exported to userspace through `/dev/kmsg`. The usual way to read it is using dmesg. ### Log Buffer Kernel messages 存在 `LOG_BUF_LEN` 大小的環狀 buffer 中，若 buffer 滿了又呼叫一次 `printk()` 則會覆蓋掉舊的資訊使用 circular buffer 的優點: - 容易同時寫入與讀取，interrupt context 也可以使用 - 簡易的 log 管理，如果資訊太多就覆蓋缺點就是可能會失去訊息 ### syslogd and klogd 使用者空間的 klogd daemon 會從 log buffer 收取 kernel message，並通過 `syslogd` daemon 送到 system log。預設使用 `/proc`，`klogd` 會 block 直到有新的 kernel message `syslogd` daemon 會把接收到的訊息新增到檔案，預設為 `/var/log/messages` ## Oops 核心告訴使用者 something bad happened 的方法，由於核心是整個系統的監督者，它無法像處理使用者空間問題時那樣簡單地修復或直接終止。因此，核心會發出一個 `oops`。這涉及在控制台上顯示錯誤訊息、dumping 暫存器內容，並提供 backtrace。核心中的失敗很難處理，因此核心必須通過多重操作來發出 `oops` 並進行清理。通常，在發生 `oops` 之後，核心會處於 inconsistant state。例如，核心可能在處理重要數據的過程中發生了 `oops`。它可能持有一個鎖，或者正在與硬體互動。核心必須優雅地退出當前的上下文並嘗試恢復對系統的控制。然而，在許多情況下，這是不可能的。如果 `oops` 發生在 interrupt context 中，核心無法繼續，將會觸發 `panic`。核心 `panic` 會立即導致系統停止。如果 `oops` 發生在閒置任務（pid 為 0）或 init 任務（pid 為 1）中，結果也會是 `panic`，因為核心無法在沒有這些重要行程的情況下繼續運行。但如果 `oops` 發生在其他行程中，核心會終止該行程並嘗試繼續執行。 `Oops` 可能由多種原因引起，包括記憶體訪問違規或非法指令。 ## Asserting Bugs and Dumping Information `BUG()` 與 `BUG_ON()` 執行時會觸發 oops，並導致 stack trace 與 error messgage dump 到核心，大部分的架構把 `BUG()` 與 `BUG_ON()` 定義成非法只應，就可以造成 oops 用於 assertion ```c if (bad_thing) BUG(); ``` 或這樣更好 ```c BUG_ON(bad_thing); ``` `BUG_ON` 較具可讀性，且 assertion 有 `unlikely()` 可以有選項來把 `BUG_ON()` 都編譯掉，減少空間用在 embedded kernel，所以 `BUG_ON()` 不能有任何 side effect `BUILD_BUG_ON()` 目的相同，但是在編譯時檢查最嚴重的錯誤是 panic，呼叫 `panic()` 會列印錯誤訊息，並終止核心 ## SysRq SysRq (system request) 是鍵盤按鍵，可以在任何時候與核心溝通，所以可以在 dying system 做一些任務，但如果系統在 deadlock 狀態可能不會回應 sysRq，或是處理過程中失敗 ## gdb ``` $ gdb vmlinux /proc/kcore ``` `vmlinux` 檔案是存在 root 的 build 目錄的 uncompressed kernel image 可以列印變數的值與 disassemble 函式，但不能修改核心資料，且不能一步驟執行核心程式碼，也不能設定 breakpoint ## kgdb 可以使用 gdb 的完整功能，利用 serial line 遠端 debug 核心需要兩台電腦，一台利用 kgdb 執行核心，另一台使用 serial line 的 gdb 來 debug bug in which kernel version binary search at revision level ``` $ git bisect start ``` # Portability 如果需要不同實作來支援不同架構，則會實作不同的函式並在需要時呼叫，例如排程器，大部分的排程器都是 architecture independent，例如切換處理器狀態、swapping out address space 等等。與架構有關的程式會在 `arch/architecture` 中 ## Word size and Data Types 處理器一次可以處理的資料大小稱為 **word**，`n-bits` machine 通常代表的是 word size 處理器的 general-purpose register (GPRs) 的數量與 word size 相同，架構的 component 的 width 至少也會是 word size，例如 memory bus 在 Linux 中，virtual memory address space 與 word size 相同，但 physical address space 通常會更小，而 pointer 與 C type `long` 的大小也會與 word size 相同，而在 64-bit 架構中，`int` 為 32 bit，`long` 為 64 bit，32 bit 架構 `int` 與 `long` 都是 32 bit。所以在寫程式的時候不能預設 standard C type 的大小 Rules: - 根據 ANSI C，`char` 永遠是 1 byte - 雖然沒有規定，但在 Linux 支援的所有架構 `int` 都是 32 bit，`short` 都為 16 bit - 不能假設指標與 `long` 的大小，可能是 64 也可能是 32 ## Opaque Types Opaque Data Types 不會揭露其內部格式或結構，在 C 語言中，它們基本上是一種“黑盒子”。語言本身對它們的支援有限，開發者通常會聲明一個 typedef，將其稱為不透明類型，並期望沒有人會將它強制轉換回標準 C 類型。這些類型的使用通常通過開發者創建的一套特殊接口來進行。一個典型的不透明類型例子是 `pid_t`，用於儲存 process identification number。實際上，這個類型的大小並沒有明確揭露，儘管任何人都可以偷看並發現它實際上是一個 int。如果程式碼沒有明確依賴這個類型的大小，那麼改變其大小也不會帶來太多問題。事實上，這在舊的 Unix 系統中曾經發生過：`pid_t` 最初被宣告為 `short`。另一個例子是 `atomic_t`，它持有一個可以以 atomic 方式操作的整數值。雖然這個類型實際上是 int，但使用不透明類型有助於**確保該資料僅能透過特定的 atomic 操作函數進行操作**。`atomic_t` 的不透明性還有助於隱藏它的實際大小，這並非總是 32 位，因為在某些 32 位 SPARC 架構上存在限制。其他內核中的不透明類型還包括 `dev_t`（設備號類型）、`gid_t`（群組ID類型）和 `uid_t`（用戶ID類型）。在處理不透明類型時，需遵守以下規則： - **不要假設類型的大小**：在某些系統中可能是 32 位，而在其他系統中則可能是 64 位。內核開發者可以隨時間調整其大小。 - 不要將類型轉換回標準 C 類型。 - **保持與大小無關的代碼**：編寫程式碼時應避免依賴具體的儲存大小和格式，以便不透明類型的實際儲存和格式可以改變。 ## Special Types 因為不同架構的 type 大小不同，所以核心也有定義特定大小的類型，例如 `s8` 為 signed byte，`u8` 為 unsigned byte 等等 ## Signedness of Char char 可以是有號也可以是無號的，由編譯器與架構一起決定 https://lkml.org/