---
# System prepended metadata

title: 'Linux 核心設計: Process'
tags: [NCKU Linux Kernel Internals, ' 作業系統']

---

---
tags: NCKU Linux Kernel Internals, 作業系統
---
# Linux 核心設計: Process

[Linux 核心設計: 不僅是個執行單元的 Process](https://hackmd.io/@sysprog/linux-process?type=view)


:::success
:video_camera: 此筆記僅補充一些有興趣的議題，詳細請參考課程錄影 
:::

## Overview 

### Process
Reference to [CSE 506, Spring 2016: Operating Systems](http://www.cs.unc.edu/~porter/courses/cse506/s16/): [Process Address Spaces and Binary  Formats](http://www.cs.unc.edu/~porter/courses/cse506/s16/slides/address-spaces.pdf)

首先我們要問，process 是甚麼呢？
* process 是一段虛擬空間，映射到 kernel 中的實體記憶體
* process 中包含
    * Memory-mapped	files(包含 program binary 本身)
    * Anonymous	pages(heap, stack 等非對應檔案的記憶體)

既然 process 是一段對 kernel 的映射，那麼勢必需要資料結構表示此映射關係，這就是 **vm_area_struct**

### vm_area_struct in Linux

Linux 透過紀錄在 `struct task_struct` -> `struct mm_struct` 中的 [struct vm_area_struct](https://elixir.bootlin.com/linux/latest/source/include/linux/mm_types.h#L301) 這個資料結構，管理 userspace 的 address space，結構中包含的成員有：

* vm_start: 映射 virtual address 的起始位置
* vm_end: 映射 virtual address 的結束位置
    * 考慮到 page aligned，`vm_start - vm_end` 可能會略大於要求配置的範圍
* vm_next, vm_prev: 串接 VMA 單元的 linked list (sorted by address)
* vm_flags: 可讀？可寫？可執行？
* vm_file: 映射到的 file(可為 NULL)

等等...

如講義中圖示，簡易的表達 VMA 與定址空間的關係類似：
![](https://i.imgur.com/9ilLsDO.png)

不過， VMA 資料結構在 kernel 中的操作十分頻繁，如講義中說：

> Source code (mm/mmap.c): claims 35% hit rate	

簡單的 linked list 勢必在 search 上的 O(n) 複雜度會導致效能的低下，因此 linux 會透過紅黑樹來優化搜尋速度。此外，前段時間被 access 的 VMA 單元很可能會再度被使用，因此使用 pointer 去 cache 前一次使用的 VMA 單元，增進 locality。

#### Demand paging
當使用者透過 mmap 等 syscall 要求建立一個 VMA 時，kernel 可以不必直接建立真正對於實體記憶體的映射。直到該 VMA 真正被存取時，會觸發 page fault，此時再真正去映射實體記憶體，這就是 [demand paging](https://en.wikipedia.org/wiki/Demand_paging)。

這種設計可以取得以下優點:
* 僅載入執行時真正需要的 page，減少了記憶體的浪費
* 越多可用的 memory，表示有越多的 process 可以被 cache 在 memory 中，減少 context switch 需要的時間

當然也伴隨著一些 trade-off:
* page fault 的處理是相對的成本，如果 page fault 發生的頻率太過頻繁(試想多個 process 都需要 memory 而彼此搶奪資源，造成 [thrashing](https://en.wikipedia.org/wiki/Thrashing_(computer_science)))，反而可能造成效能更差
* page replacement / swapping 是 demand paging 強大的關鍵，但並不是每個硬體架構都有 [MMU](https://en.wikipedia.org/wiki/Memory_management_unit) 可以支援此技術。(如果沒有 MMU 的話，由於無法保證置換回 secondary storage 的 page 再被載回時是同 address，而會發生問題。然而在有 MMU 的狀況下，雖然 physical memory 不同，但可以映射到相同的 virtual memory 上)

### Signal
Reference to [Signals and Inter-Process Communication](http://www.cs.unc.edu/~porter/courses/cse506/s16/slides/ipc.pdf)

#### What is signal?
根據維基百科 [Signal (IPC)](https://en.wikipedia.org/wiki/Signal_(IPC)) 的說法:

> Signals are a limited form of inter-process communication (IPC), typically used in Unix, Unix-like, and other POSIX-compliant operating systems. A signal is an asynchronous notification sent to a process or to a specific thread within the same process in order to notify it of an event that occurred.

Signal 類似於 interrupt，只是目標對象不是 kernel，而是某個 application(process)，是 process 間交換訊息的一種有限的形式。process 可以 signal 另一個 process 或 thread，並且，每個 process 或 thread 也可以自定義收到每個 signal 的對應行動。

Signal 與 hardware exception 也有密切的關聯: 當程式運行中發生 exception，OS 會轉至 kernel exception handler 來解決問題，然後恢復原本程式的運行。然而，有些 exception 可能是 kernel 無法解決的，因此 kernel 會通過 signal 將 exception 轉回 process，告訴 process 需自行解決問題。舉例來說，當一個 x86 CPU 上的 process 嘗試除零時，會產生 divide error exception，並使得 kernel 將 SIGFPE signal 轉至 process。

signal 的類型可以閱讀 [signal(7) — Linux manual page](https://man7.org/linux/man-pages/man7/signal.7.html)

![](https://i.imgur.com/8yB45Qw.png)

如圖，例如在 linux 中，當 process 收到 signal，會透過 `do_signal` 檢查 signal 的類型進行對應的處理。如果是 kernel 無法處理的 signal 的話，則先把原本切換回 user-mode 要執行的 context 儲存起來，並且建立對應的 signal handler stack frame，把 program counter 指過去執行。當 signal handler 處理完 signal 會透過 system call 回到 kernel，再回復原本儲存起來的 context，恢復原本程式的運行流程。

:::warning
:warning: 上面這一段是憑著一些二手的資料跟我之前修課的記憶寫的，描述得相當籠統，而且也不保證沒有錯誤的地方，好奇的人建議還是實際 trace code 或者透過實驗去測試吧~~

未來也許會再回頭補充清楚?
:::

#### Nested	Signals
如果在 signal handler 階段又被 signal 會如何呢?
```c=
int a = 0;

int main() {
/* ... */
signal(SIGINT, &handler);
signal(SIGTERM, &handler);
/* ... */
} 

int handler() {
a++;
/* do something */
printf("this is the %d time of handler\n", a);
} 
```

以一個簡單的例子來說，也許你預期如果 handler 每次被運行就印出自己是第幾次被執行，也就是印出的 pattern 是:
```
this is the 1 time of handler
this is the 2 time of handler
```
但是設想一下，如果第一個 handler 在 `/* do something */` 的途中又被 signal，於是在第一個 handler `printf` 先執行了 `a++`，結果也許就變成:
```
this is the 2 time of handler
this is the 2 time of handler
```

為了避免非預期的結果，建議將 `signal()` 換成 `sigaction()`，後者就類似在 interrupt handler 執行時會關掉 interrupt 一樣，避免前述的問題發生。

順帶一提，application 和 kernel 的 signal 不大相同:

* 對於 application 來說，signal 會被立即傳遞出去。 
* kernel 會先 pending signal ，再從 interrupt 或者 system call 返回 userspace 時檢查是否有 pending signal 並傳遞出去。


### Native POSIX Threading Library (NPTL)

甚麼是 thread?
* thread 是 OS 進行排程與執行的單位
* 在一個 address space(process) 中，可以存在一至多個 thread 
* 以 linux 而言，也就是多個 `task_struct` 共享同一個 `mm_struct`

早期的 linux 中(2.6 版以前)，process 與 thread 的關係是充滿曖昧的。由於 linux 初期是以 single thread 為導向的作業系統，因此 process 是最小的排程單元。但提供了 system call [clone](https://man7.org/linux/man-pages/man2/clone.2.html)，可以用來提供共用 address space 的 **process**，你可以在 man page 中讀到：

> For example, using these system calls, the caller can control whether or not the two processes share the virtual address space

意即: 用 clone() 產生的 thread，本質上仍是 process，可以說 Linux 核心中定義的 “thread” 只做到了資源共享，而不構成執行工作的最基本單位。這也導致了 linux 中的 thread 與 POSIX 的標準不兼容。舉例來說，因為不是真正的 thread，如果某個 thread 的 signal 被 block 住，其他同一個 address space 下的 signal 也會連帶被 block 掉。

延伸閱讀： [Linux threading models compared: LinuxThreads and NPTL](https://pdfs.semanticscholar.org/a1a7/451a1332d62ea6a6414f3e474e5415e523eb.pdf?_ga=2.105282755.1976655423.1597985306-1100633812.1597985306)

## TODO
- [ ] 研究 linux source code 如何透過紅黑樹加速 VMA 的 access 時間
- [ ] 研究 linux source code 的 signal 機制，包含 system call 的流程與 kernel 如何處理
- [ ] 研究 task_struct 的設計，包含 fork 、 clone 等相關的 system call，以及 thread、process 與之的關係。