Linux 核心設計: Process

--- tags: NCKU Linux Kernel Internals, 作業系統 --- # Linux 核心設計: Process [Linux 核心設計: 不僅是個執行單元的 Process](https://hackmd.io/@sysprog/linux-process?type=view) :::success :video_camera: 此筆記僅補充一些有興趣的議題，詳細請參考課程錄影 ::: ## Overview ### Process Reference to [CSE 506, Spring 2016: Operating Systems](http://www.cs.unc.edu/~porter/courses/cse506/s16/): [Process Address Spaces and Binary Formats](http://www.cs.unc.edu/~porter/courses/cse506/s16/slides/address-spaces.pdf) 首先我們要問，process 是甚麼呢？ * process 是一段虛擬空間，映射到 kernel 中的實體記憶體 * process 中包含 * Memory-mapped files(包含 program binary 本身) * Anonymous pages(heap, stack 等非對應檔案的記憶體) 既然 process 是一段對 kernel 的映射，那麼勢必需要資料結構表示此映射關係，這就是 **vm_area_struct** ### vm_area_struct in Linux Linux 透過紀錄在 `struct task_struct` -> `struct mm_struct` 中的 [struct vm_area_struct](https://elixir.bootlin.com/linux/latest/source/include/linux/mm_types.h#L301) 這個資料結構，管理 userspace 的 address space，結構中包含的成員有： * vm_start: 映射 virtual address 的起始位置 * vm_end: 映射 virtual address 的結束位置 * 考慮到 page aligned，`vm_start - vm_end` 可能會略大於要求配置的範圍 * vm_next, vm_prev: 串接 VMA 單元的 linked list (sorted by address) * vm_flags: 可讀？可寫？可執行？ * vm_file: 映射到的 file(可為 NULL) 等等... 如講義中圖示，簡易的表達 VMA 與定址空間的關係類似： ![](https://i.imgur.com/9ilLsDO.png) 不過， VMA 資料結構在 kernel 中的操作十分頻繁，如講義中說： > Source code (mm/mmap.c): claims 35% hit rate 簡單的 linked list 勢必在 search 上的 O(n) 複雜度會導致效能的低下，因此 linux 會透過紅黑樹來優化搜尋速度。此外，前段時間被 access 的 VMA 單元很可能會再度被使用，因此使用 pointer 去 cache 前一次使用的 VMA 單元，增進 locality。 #### Demand paging 當使用者透過 mmap 等 syscall 要求建立一個 VMA 時，kernel 可以不必直接建立真正對於實體記憶體的映射。直到該 VMA 真正被存取時，會觸發 page fault，此時再真正去映射實體記憶體，這就是 [demand paging](https://en.wikipedia.org/wiki/Demand_paging)。這種設計可以取得以下優點: * 僅載入執行時真正需要的 page，減少了記憶體的浪費 * 越多可用的 memory，表示有越多的 process 可以被 cache 在 memory 中，減少 context switch 需要的時間當然也伴隨著一些 trade-off: * page fault 的處理是相對的成本，如果 page fault 發生的頻率太過頻繁(試想多個 process 都需要 memory 而彼此搶奪資源，造成 [thrashing](https://en.wikipedia.org/wiki/Thrashing_(computer_science)))，反而可能造成效能更差 * page replacement / swapping 是 demand paging 強大的關鍵，但並不是每個硬體架構都有 [MMU](https://en.wikipedia.org/wiki/Memory_management_unit) 可以支援此技術。(如果沒有 MMU 的話，由於無法保證置換回 secondary storage 的 page 再被載回時是同 address，而會發生問題。然而在有 MMU 的狀況下，雖然 physical memory 不同，但可以映射到相同的 virtual memory 上) ### Signal Reference to [Signals and Inter-Process Communication](http://www.cs.unc.edu/~porter/courses/cse506/s16/slides/ipc.pdf) #### What is signal? 根據維基百科 [Signal (IPC)](https://en.wikipedia.org/wiki/Signal_(IPC)) 的說法: > Signals are a limited form of inter-process communication (IPC), typically used in Unix, Unix-like, and other POSIX-compliant operating systems. A signal is an asynchronous notification sent to a process or to a specific thread within the same process in order to notify it of an event that occurred. Signal 類似於 interrupt，只是目標對象不是 kernel，而是某個 application(process)，是 process 間交換訊息的一種有限的形式。process 可以 signal 另一個 process 或 thread，並且，每個 process 或 thread 也可以自定義收到每個 signal 的對應行動。 Signal 與 hardware exception 也有密切的關聯: 當程式運行中發生 exception，OS 會轉至 kernel exception handler 來解決問題，然後恢復原本程式的運行。然而，有些 exception 可能是 kernel 無法解決的，因此 kernel 會通過 signal 將 exception 轉回 process，告訴 process 需自行解決問題。舉例來說，當一個 x86 CPU 上的 process 嘗試除零時，會產生 divide error exception，並使得 kernel 將 SIGFPE signal 轉至 process。 signal 的類型可以閱讀 [signal(7) — Linux manual page](https://man7.org/linux/man-pages/man7/signal.7.html) ![](https://i.imgur.com/8yB45Qw.png) 如圖，例如在 linux 中，當 process 收到 signal，會透過 `do_signal` 檢查 signal 的類型進行對應的處理。如果是 kernel 無法處理的 signal 的話，則先把原本切換回 user-mode 要執行的 context 儲存起來，並且建立對應的 signal handler stack frame，把 program counter 指過去執行。當 signal handler 處理完 signal 會透過 system call 回到 kernel，再回復原本儲存起來的 context，恢復原本程式的運行流程。 :::warning :warning: 上面這一段是憑著一些二手的資料跟我之前修課的記憶寫的，描述得相當籠統，而且也不保證沒有錯誤的地方，好奇的人建議還是實際 trace code 或者透過實驗去測試吧~~ 未來也許會再回頭補充清楚? ::: #### Nested Signals 如果在 signal handler 階段又被 signal 會如何呢? ```c= int a = 0; int main() { /* ... */ signal(SIGINT, &handler); signal(SIGTERM, &handler); /* ... */ } int handler() { a++; /* do something */ printf("this is the %d time of handler\n", a); } ``` 以一個簡單的例子來說，也許你預期如果 handler 每次被運行就印出自己是第幾次被執行，也就是印出的 pattern 是: ``` this is the 1 time of handler this is the 2 time of handler ``` 但是設想一下，如果第一個 handler 在 `/* do something */` 的途中又被 signal，於是在第一個 handler `printf` 先執行了 `a++`，結果也許就變成: ``` this is the 2 time of handler this is the 2 time of handler ``` 為了避免非預期的結果，建議將 `signal()` 換成 `sigaction()`，後者就類似在 interrupt handler 執行時會關掉 interrupt 一樣，避免前述的問題發生。順帶一提，application 和 kernel 的 signal 不大相同: * 對於 application 來說，signal 會被立即傳遞出去。 * kernel 會先 pending signal ，再從 interrupt 或者 system call 返回 userspace 時檢查是否有 pending signal 並傳遞出去。 ### Native POSIX Threading Library (NPTL) 甚麼是 thread? * thread 是 OS 進行排程與執行的單位 * 在一個 address space(process) 中，可以存在一至多個 thread * 以 linux 而言，也就是多個 `task_struct` 共享同一個 `mm_struct` 早期的 linux 中(2.6 版以前)，process 與 thread 的關係是充滿曖昧的。由於 linux 初期是以 single thread 為導向的作業系統，因此 process 是最小的排程單元。但提供了 system call [clone](https://man7.org/linux/man-pages/man2/clone.2.html)，可以用來提供共用 address space 的 **process**，你可以在 man page 中讀到： > For example, using these system calls, the caller can control whether or not the two processes share the virtual address space 意即: 用 clone() 產生的 thread，本質上仍是 process，可以說 Linux 核心中定義的 “thread” 只做到了資源共享，而不構成執行工作的最基本單位。這也導致了 linux 中的 thread 與 POSIX 的標準不兼容。舉例來說，因為不是真正的 thread，如果某個 thread 的 signal 被 block 住，其他同一個 address space 下的 signal 也會連帶被 block 掉。延伸閱讀： [Linux threading models compared: LinuxThreads and NPTL](https://pdfs.semanticscholar.org/a1a7/451a1332d62ea6a6414f3e474e5415e523eb.pdf?_ga=2.105282755.1976655423.1597985306-1100633812.1597985306) ## TODO - [ ] 研究 linux source code 如何透過紅黑樹加速 VMA 的 access 時間 - [ ] 研究 linux source code 的 signal 機制，包含 system call 的流程與 kernel 如何處理 - [ ] 研究 task_struct 的設計，包含 fork 、 clone 等相關的 system call，以及 thread、process 與之的關係。