# Operation System - Three Easy Pieces 中文版筆記
PDF link: https://book.douban.com/subject/33463930/
## 三個部份
* Vitualization
* concurrency
* persistence
### Vitualization
為什麼需要 Vitualization
為了使 OS 更加易於使用-將 Hardware(physical resource) 轉換為通用的虛擬形式
資源的隔離
以模擬獨有的方式共享 CPU resource
以模擬獨有的方式共享 Memory resource
### Concurrency
在 writeback 模式下,cpu cache 寫回 memory 要三個步驟。
多個Vitual Resource 同時對單一 api 的解釋不存在 atomically
### Persistence
DRAM 以 volatile 方式存儲數據,中止後數據就丟失。
概敘流程:透過 System Call 打到 File System ,對 I/O Device 發出請求,轉到 Device Driver。
## OS
### Abstraction
在 Abstraction 和 Performance 中間找到平衡點
寫 C 不用管組合語言,寫組合語言不用考慮指令集實作,實作指令集不用考慮電晶體的組成,這就是抽象的力量。
### Isolation
避免 Application 破壞整個 OS
避免 OS 層級的失效,提高 OS 本身的 Reliability
### History
#### Earlist
OS = Library
#### More than just library
Isolation 的概念開始出現
任意存取 Memory 或 File I/O 沒有 Isolation 和 Reliability
從操作 File 的 Library 升級為操作 File System 的 Library
如何隔離對 File 和 File System 的操作 -> System Call
隔離 User Mode 跟 Kernel Mode
在 Kernel Mode 下,OS 可以完全操作 Hardware,因此可以透過 driver 操作 I/O 或取求更多 Memory 給 User 之類的功能
流程是 Application -> trap -> trap handler -> kernal mode -> return from trap
### Multiprogramming
提高 CPU 利用率,減少 I/O 操作的佔比。
Memory protection, Concurrency 變成顯學
### Unix 的誕生
Pipe 是創舉,提供極為直覺的程序互動方式。
Unix -> Linux 脫離了版權枷鎖。
## Vitualization - 從桃子到 OS
Physical 上有一個桃子
有無限多個 User 都想吃桃,由於 User 量不定因此我們也無法預先切好桃子來分配。
於是我們提供一個假象,承諾 ~~Make Peach Great Again~~ 每個人都有一個自己的桃子,但其實每個人拿到的都是一個 Virtual Peach。
這樣做不會出錯的基礎是建立在
* 假設平均下來每個人對自己桃子的利用率不高。
* 假設不會有人無時無刻都咬住(Block)桃子。
在 OS 的議題上,CPU 就是那個桃子
> 後面不會有更多桃子了。
### Abstraction:Process
Process 是 OS 對 Resource 調度的最基本抽象,Program 本身沒有生命週期,只是存在 Disk 上的一堆 Byte,只有跟 Process 結合成為 Application 之後才有 Life cycle 可言。
有了 Process 之後 User 就不用管理 CPU 的可用狀況。
> Q: 如何提供有無限多~~桃..~~ Resource 的假象
> A: 基於 Time Sharing 和 Space Sharing 共享 Resource,對 CPU 使用 Time Sharing,在不同的時間段把 CPU 提供給不同 Instance 使用。對 Disk 使用 Space Sharing,把整個 Disk 切成不同的空間分給不同 Instance 使用。
在此機制之上,必定要有特定 Policy 存在以作到 Sharing,Scheduling Policy 會嘗試利用各種訊息(e.g. Time usage,Space usage, throughput) 去做出決定。
#### Machine state
Process 的 Life cycle 儲存在兩個部份,其一是 address space,具體長相參見下圖

其二是 Register,很多執行的狀態會在 Register 中流轉。
例如以下組合語言
```
push ebp
mov ebp,esp
sub esp,size_of_local_variables
...
somehting something something
...
mov esp, ebp
pop ebp
ret
```
個人就曾經有經驗,由於 Stack overflow 不小心修改到 Stack Pointer,最終導致 Register 從 Stack Pointer 取得位置並 return 時跳到無法解析的位置導致 Exception。
> 這種狀況操作的好可以直接變成一種攻擊手段,控制 Applicatino 跳轉到任意指定的位置。
#### API
* Create
* Destory
* Wait
* Statu
* Other(Pause/Resume...)
#### Memory Space 的加載
分為兩種,Eagerly & Lazily
* Eagerly
* Binary
* Static Global value
* Lazily
* Dynamic library
* malloc(當 Heap 內沒有可用空間時)
#### Process State

* Running
* Running on cpu right now
* Ready
* Ready to Running
* Blocked
* Waiting for interupt,(e.g, Disk I/O, network I/O)
Ready -> Running = Scheduled
Running -> Blcoked = Descheduled
#### Data struct
OS 在 Trace Process status 的時候會需要許多資料,因此在 Descheduled 的時候會需要把這些資料保存到 Memory 內,範例如下
OS
```c
// the registers xv6 will save and restore
// to stop and subsequently restart a process
struct context {
int eip;
int esp;
int ebx;
int ecx;
int edx;
int esi;
int edi;
int ebp;
};
// the different states a process can be in
enum proc_state { UNUSED, EMBRYO, SLEEPING,
RUNNABLE, RUNNING, ZOMBIE };
// the information xv6 tracks about each process
// including its register context and state
struct proc {
char *mem; // Start of process memory
uint sz; // Size of process memory
char *kstack; // Bottom of kernel stack
// for this process
enum proc_state state; // Process state
int pid; // Process ID
struct proc *parent; // Parent process
void *chan; // If non-zero, sleeping on chan
int killed; // If non-zero, have been killed
struct file *ofile[NOFILE]; // Open files 参考资料 25
struct inode *cwd; // Current directory
struct context context; // Switch here to run process
struct trapframe *tf; // Trap frame for the
// current interrupt
};
```
> xv6 process kernel data struct
此外 Process 其實還有一些其他的狀態如
* Initial
* Creating
* Final
* Process ended, 等待被 Parent Process 回收
> 因此 Linux 中 Parent 呼叫 [wait](https://man7.org/linux/man-pages/man2/waitpid.2.html) 之後的流程其實是這樣
| Event | Parent | Child |
| --------------------- | -------- | ------- |
| | Running | Running |
| Parent::wait(Block) | Blocking | Running |
| Child::return | Blocking | Final |
| | Ready | Final |
| Parent::wait(Unblock) | Running | |
## TODO
./process-run.py
## Process API
> 獨萬捲書不如行萬里路,對於內容有任何問題都推薦直接啃 [xv6 Source Code](https://github.com/mit-pdos/xv6-public) ,極為簡潔易懂。
### fork
fork 是相當特殊的 API,同時也是搞 Multi-Process 一切開始的地方。
```c=
int main(int argc, char *argv[]) {
printf("hello world (pid:%d)\n", (int) getpid());
int rc = fork();
if (rc < 0) { // fork failed; exit
fprintf(stderr, "fork failed\n");
exit(1);
} else if (rc == 0) { // child (new process)
printf("hello, I am child (pid:%d)\n", (int) getpid());
} else { // parent goes down this path (main)
printf("hello, I am parent of %d (pid:%d)\n", rc, (int) getpid());
}
return 0;
}
```
輸出
```shell=
hello world (pid:29146)
hello, I am parent of 29147 (pid:29146)
hello, I am child (pid:29147)
```
或
```shell=
hello world (pid:29146)
hello, I am child (pid:29147)
hello, I am parent of 29147 (pid:29146)
```
> 這種不確定就是 Concurrency 的代表特徵之一
Child Process 有不同的 address space, register 等,但 memory 內的數值都是複製過去的。
Parent process 會獲得 Child process 的 PID,至於 Child Process 則會獲得 0。
#### How fork return PID?
> 感謝宇光提出
TL;DR
fork 的執行在 Kernel mode,有全部權限,執行完透過 Trap Frame 修改 Register 狀態來回傳。
可以參考 xv6 這裡的敘述
[https://github.com/mit-pdos/xv6-public/blob/eeb7b415dbcb12cc362d0783e41c3d1f44066b17/TRICKS](https://github.com/mit-pdos/xv6-public/blob/eeb7b415dbcb12cc362d0783e41c3d1f44066b17/TRICKS)
實作在這
[https://github.com/mit-pdos/xv6-public/blob/master/proc.c#L181](https://github.com/mit-pdos/xv6-public/blob/master/proc.c#L181)
[https://github.com/mit-pdos/xv6-public/blob/master/proc.c#L74](https://github.com/mit-pdos/xv6-public/blob/master/proc.c#L74)
調用 Fork 後切到 Kernel Mode,在系統創建新 Process 之後可以從 PCB 取得 PID,接下來在載入 Parent Process stack pointer 的時候先將 PID 寫入 return value 的 register,等 Parent Process 載入完成時自然可以從該 register 讀到對應 Value。
On x86 the EAX register is normally used for return values.
```wasm
main:
push rbp
mov rbp, rsp
sub rsp, 16
call fork
mov DWORD PTR [rbp-4], eax
mov eax, 0
leave
ret
```
可見編譯出來的組合語言也是直接從 eax 取值。
https://godbolt.org/z/7893jnf6z

### wait
Block until all child process execute to `Final`
> waitpid 可以指定 child process
具體流程可參考上方 [Data Struct](https://hackmd.io/c5v55On3SmCysA6lij-xIQ?view#Data-struct) 小節
### exec
直接開啟加載給定 binary 並且執行,繼承現有 resource,除非 exec 在加載時發生異常不然永不返回。
### 為什麼這樣實現?
把操作範圍最小化,這樣可以方便實現很多其他功能。
比如 Shell 在解析 `cat main.go > test.txt` 的時候,可以先 fork 一個 Process,並且在真正執行 cat 之前先把這個 Child process 的 stdout 導到 test.txt,在 exec `cat main.go` 時會自動繼承,並且由於 exec `cat main.go` 時是直接使用 Child process,對於 Parent process 可以不在意 Child process 打算做什麼,單純就 wait() 直到對應 Child process 返回即可。
簡單,方便,易測試。
### Other API
RTFM(Read the friendly manpage)
### TODO 延伸思考:Why the cost of process context switch is bigger than thread context switch
## limited direct execution
> We want to virtualize our resources without performance costs
### User mode & Kernel mode
Only kernal mode can access the physical resource
`System Call` a way to allow OS expose some key feature to user application.
* I/O
* Operate process
* Cross process communication
### Trap
An standard data exchange interface for System Call
When System Call happen
Call fork() -> Call Trap() ->
Save current register status to TrapFrame ->
Query Trap Table to find out where is the System Call code ->
Jump to specific space to execute System Call Code(A number) ->
Send interupt ->
OS received interupt, read System Call Code number, Call specific function ->
Function read argument from trap frame
## Limited Direct Execution
### How to make sure OS is executing?
If a process never yield itself, how we make sure OS is still executing in our CPU?
Through Clock interupt.(Provide by hardware)

However, cause the trap is not call by User.
So the register should be save by hardware. (Just like when a function call happen)
then trigger trap() -> jump to switch() -> executing other process
### Note: reboot is useful
Time already approve that reboot computer is a good way to make sure the status of application back to our control.
It's also easy way can be automatic
So be positive about reboot. It's not a bad way to solve problem. It's a reliable way to solve the problem XD
## The truth about context switch
Why context effect performance a lot?
We may heard that it's because that CPU have to reload process status.
However, what's "process status"?
It's actully combine lot's of things include
* Recover register status
* Update TLB & cache
* TLB miss
* Cache miss
* Page fault
* Maintain cache coherence
* Rebuild Branch predicter
* etc.
Recover register status is actully not the biggest cost, it's quite cheap.
## Space sharing
All the memory address we see is virtual address
only OS & Hardware know the physical address
## Note: How malloc & free work
Those are actully **C library**
reference: https://blog.csdn.net/dog250/article/details/80331590
Source code: https://github.com/lattera/glibc/blob/master/malloc/malloc.c
### brk
### mmap && munmap
## Memory layout

> Unitialize/Initialized data are also a part of heap
### How to caculate offset?
With `Base address` & `Size`
Usually It will store like this(explicit)

Top 2 bit declare is it stack/heap/text
So the code to find the physical address will be like
```c=
// get top 2 bits of 14-bit VA
Segment = (VirtualAddress & SEG_MASK) >> SEG_SHIFT
// now get offset
Offset = VirtualAddress & OFFSET_MASK
if (Offset >= Bounds[Segment])
RaiseException(PROTECTION_FAULT)
else
PhysAddr = Base[Segment] + Offset
Register = AccessMemory(PhysAddr)
```
> Bounds and Base will be different for each process? or different for each segment?
### How to caculate stack?
stack grow reversely, how we present it?
Quite simple, add a reverse flag.(one bit)
Or hardcode the caculation of stack as subtraction
### Sharing
Some of Segment need to been sharing
e.g, dynamic library
So there is actually a `Protect bit` to mark is those
### Q: those are all for one process, how I access the sharing memory?
### Q:Why stack grow from high to low?
Better use space, let heap and stack can share the same unused address range.
## Schedule
There is two way to record memory space usage
for example, we have follow memory space
| 0~9 | 10~19 | 20~29 |
| ---- | ----- | ----- |
| free | used | free |
1. Record used space
```graphviz
digraph {
"root" -> "10~19";
}
```
2. Record free space
```graphviz
digraph {
"root" -> "0~9";
"0~9" -> "20~29";
}
```
Sees the no big different. However, let's see another example
| 0~9 | 10~19 | 20~29 |
| ---- | ----- | ----- |
| used | used | free |
1. Record used space
```graphviz
digraph {
"root" -> "0~9";
"0~9" -> "10~19"
}
```
2. Record free space
```graphviz
digraph {
"root" -> "20~29";
}
```
It show's that the time complexity of find a free space in second solution(Record free space) is always O(1). However, in the first solution(Record used space) we may have to traverse to the end of list, it worst-case time-complexity is O(n).
There are still a disadvantages for second solution, it may sperate a free space to two different block, such as
```graphviz
digraph {
"root" -> "10~19";
"10~19" -> "20~29";
}
```
We have to merge them by ourself when we schedule them.
### Chunk header

> `magic` is used to check is that header correct or not, a.k.a checksum.
### Schdule policy
1. Best fit
traverse the list, find the smallest free space that fit the requirement
```go
fitNode := &node{val: INT_MAX}
for _, node := nodes {
if
}
```
0~29
| 0~9 | 10~19 | 20~29 |
| ---- | ----- | ----- |
| used | used | used |
free(obj1) 40
obj2 50
free C -> Mark
* sbrk -> mmap