Try   HackMD

Operation System - Three Easy Pieces 中文版筆記

PDF link: https://book.douban.com/subject/33463930/

三個部份

  • Vitualization
  • concurrency
  • persistence

Vitualization

為什麼需要 Vitualization
為了使 OS 更加易於使用-將 Hardware(physical resource) 轉換為通用的虛擬形式

資源的隔離

以模擬獨有的方式共享 CPU resource
以模擬獨有的方式共享 Memory resource

Concurrency

在 writeback 模式下,cpu cache 寫回 memory 要三個步驟。
多個Vitual Resource 同時對單一 api 的解釋不存在 atomically

Persistence

DRAM 以 volatile 方式存儲數據,中止後數據就丟失。

概敘流程:透過 System Call 打到 File System ,對 I/O Device 發出請求,轉到 Device Driver。

OS

Abstraction

在 Abstraction 和 Performance 中間找到平衡點
寫 C 不用管組合語言,寫組合語言不用考慮指令集實作,實作指令集不用考慮電晶體的組成,這就是抽象的力量。

Isolation

避免 Application 破壞整個 OS
避免 OS 層級的失效,提高 OS 本身的 Reliability

History

Earlist

OS = Library

More than just library

Isolation 的概念開始出現
任意存取 Memory 或 File I/O 沒有 Isolation 和 Reliability
從操作 File 的 Library 升級為操作 File System 的 Library

如何隔離對 File 和 File System 的操作 -> System Call

隔離 User Mode 跟 Kernel Mode

在 Kernel Mode 下,OS 可以完全操作 Hardware,因此可以透過 driver 操作 I/O 或取求更多 Memory 給 User 之類的功能

流程是 Application -> trap -> trap handler -> kernal mode -> return from trap

Multiprogramming

提高 CPU 利用率,減少 I/O 操作的佔比。

Memory protection, Concurrency 變成顯學

Unix 的誕生

Pipe 是創舉,提供極為直覺的程序互動方式。
Unix -> Linux 脫離了版權枷鎖。

Vitualization - 從桃子到 OS

Physical 上有一個桃子
有無限多個 User 都想吃桃,由於 User 量不定因此我們也無法預先切好桃子來分配。

於是我們提供一個假象,承諾 Make Peach Great Again 每個人都有一個自己的桃子,但其實每個人拿到的都是一個 Virtual Peach。

這樣做不會出錯的基礎是建立在

  • 假設平均下來每個人對自己桃子的利用率不高。
  • 假設不會有人無時無刻都咬住(Block)桃子。

在 OS 的議題上,CPU 就是那個桃子

後面不會有更多桃子了。

Abstraction:Process

Process 是 OS 對 Resource 調度的最基本抽象,Program 本身沒有生命週期,只是存在 Disk 上的一堆 Byte,只有跟 Process 結合成為 Application 之後才有 Life cycle 可言。

有了 Process 之後 User 就不用管理 CPU 的可用狀況。

Q: 如何提供有無限多桃.. Resource 的假象
A: 基於 Time Sharing 和 Space Sharing 共享 Resource,對 CPU 使用 Time Sharing,在不同的時間段把 CPU 提供給不同 Instance 使用。對 Disk 使用 Space Sharing,把整個 Disk 切成不同的空間分給不同 Instance 使用。

在此機制之上,必定要有特定 Policy 存在以作到 Sharing,Scheduling Policy 會嘗試利用各種訊息(e.g. Time usage,Space usage, throughput) 去做出決定。

Machine state

Process 的 Life cycle 儲存在兩個部份,其一是 address space,具體長相參見下圖

其二是 Register,很多執行的狀態會在 Register 中流轉。

例如以下組合語言

push ebp
mov ebp,esp
sub esp,size_of_local_variables
...
somehting something something
...
mov esp, ebp
pop ebp
ret

個人就曾經有經驗,由於 Stack overflow 不小心修改到 Stack Pointer,最終導致 Register 從 Stack Pointer 取得位置並 return 時跳到無法解析的位置導致 Exception。

這種狀況操作的好可以直接變成一種攻擊手段,控制 Applicatino 跳轉到任意指定的位置。

API

  • Create
  • Destory
  • Wait
  • Statu
  • Other(Pause/Resume)

Memory Space 的加載

分為兩種,Eagerly & Lazily

  • Eagerly
    • Binary
    • Static Global value
  • Lazily
    • Dynamic library
    • malloc(當 Heap 內沒有可用空間時)

Process State

  • Running
    • Running on cpu right now
  • Ready
    • Ready to Running
  • Blocked
    • Waiting for interupt,(e.g, Disk I/O, network I/O)

Ready -> Running = Scheduled
Running -> Blcoked = Descheduled

Data struct

OS 在 Trace Process status 的時候會需要許多資料,因此在 Descheduled 的時候會需要把這些資料保存到 Memory 內,範例如下

OS

// the registers xv6 will save and restore 
// to stop and subsequently restart a process 
struct context { 
 int eip; 
 int esp; 
 int ebx; 
 int ecx; 
 int edx; 
 int esi; 
 int edi; 
 int ebp; 
}; 
// the different states a process can be in 
enum proc_state { UNUSED, EMBRYO, SLEEPING, 
 RUNNABLE, RUNNING, ZOMBIE }; 
// the information xv6 tracks about each process 
// including its register context and state 
struct proc { 
 char *mem; // Start of process memory 
 uint sz; // Size of process memory 
 char *kstack; // Bottom of kernel stack 
 // for this process 
 enum proc_state state; // Process state 
 int pid; // Process ID 
 struct proc *parent; // Parent process 
 void *chan; // If non-zero, sleeping on chan 
 int killed; // If non-zero, have been killed 
 struct file *ofile[NOFILE]; // Open files 参考资料 25 
 struct inode *cwd; // Current directory 
 struct context context; // Switch here to run process 
 struct trapframe *tf; // Trap frame for the 
 // current interrupt 
}; 

xv6 process kernel data struct

此外 Process 其實還有一些其他的狀態如

  • Initial
    • Creating
  • Final
    • Process ended, 等待被 Parent Process 回收

因此 Linux 中 Parent 呼叫 wait 之後的流程其實是這樣

Event Parent Child
Running Running
Parent::wait(Block) Blocking Running
Child::return Blocking Final
Ready Final
Parent::wait(Unblock) Running

TODO

./process-run.py

Process API

獨萬捲書不如行萬里路,對於內容有任何問題都推薦直接啃 xv6 Source Code ,極為簡潔易懂。

fork

fork 是相當特殊的 API,同時也是搞 Multi-Process 一切開始的地方。

int main(int argc, char *argv[]) { printf("hello world (pid:%d)\n", (int) getpid()); int rc = fork(); if (rc < 0) { // fork failed; exit fprintf(stderr, "fork failed\n"); exit(1); } else if (rc == 0) { // child (new process) printf("hello, I am child (pid:%d)\n", (int) getpid()); } else { // parent goes down this path (main) printf("hello, I am parent of %d (pid:%d)\n", rc, (int) getpid()); } return 0; }

輸出

hello world (pid:29146) hello, I am parent of 29147 (pid:29146) hello, I am child (pid:29147)

hello world (pid:29146) hello, I am child (pid:29147) hello, I am parent of 29147 (pid:29146)

這種不確定就是 Concurrency 的代表特徵之一

Child Process 有不同的 address space, register 等,但 memory 內的數值都是複製過去的。

Parent process 會獲得 Child process 的 PID,至於 Child Process 則會獲得 0。

How fork return PID?

感謝宇光提出

TL;DR
fork 的執行在 Kernel mode,有全部權限,執行完透過 Trap Frame 修改 Register 狀態來回傳。

可以參考 xv6 這裡的敘述

https://github.com/mit-pdos/xv6-public/blob/eeb7b415dbcb12cc362d0783e41c3d1f44066b17/TRICKS

實作在這
https://github.com/mit-pdos/xv6-public/blob/master/proc.c#L181

https://github.com/mit-pdos/xv6-public/blob/master/proc.c#L74

調用 Fork 後切到 Kernel Mode,在系統創建新 Process 之後可以從 PCB 取得 PID,接下來在載入 Parent Process stack pointer 的時候先將 PID 寫入 return value 的 register,等 Parent Process 載入完成時自然可以從該 register 讀到對應 Value。

On x86 the EAX register is normally used for return values.

main:
        push    rbp
        mov     rbp, rsp
        sub     rsp, 16
        call    fork
        mov     DWORD PTR [rbp-4], eax
        mov     eax, 0
        leave
        ret

可見編譯出來的組合語言也是直接從 eax 取值。

https://godbolt.org/z/7893jnf6z

wait

Block until all child process execute to Final

waitpid 可以指定 child process

具體流程可參考上方 Data Struct 小節

exec

直接開啟加載給定 binary 並且執行,繼承現有 resource,除非 exec 在加載時發生異常不然永不返回。

為什麼這樣實現?

把操作範圍最小化,這樣可以方便實現很多其他功能。

比如 Shell 在解析 cat main.go > test.txt 的時候,可以先 fork 一個 Process,並且在真正執行 cat 之前先把這個 Child process 的 stdout 導到 test.txt,在 exec cat main.go 時會自動繼承,並且由於 exec cat main.go 時是直接使用 Child process,對於 Parent process 可以不在意 Child process 打算做什麼,單純就 wait() 直到對應 Child process 返回即可。

簡單,方便,易測試。

Other API

RTFM(Read the friendly manpage)

TODO 延伸思考:Why the cost of process context switch is bigger than thread context switch

limited direct execution

We want to virtualize our resources without performance costs

User mode & Kernel mode

Only kernal mode can access the physical resource

System Call a way to allow OS expose some key feature to user application.

  • I/O
  • Operate process
  • Cross process communication

Trap

An standard data exchange interface for System Call
When System Call happen

Call fork() -> Call Trap() ->
Save current register status to TrapFrame ->
Query Trap Table to find out where is the System Call code ->
Jump to specific space to execute System Call Code(A number) ->
Send interupt ->
OS received interupt, read System Call Code number, Call specific function ->
Function read argument from trap frame

Limited Direct Execution

How to make sure OS is executing?

If a process never yield itself, how we make sure OS is still executing in our CPU?
Through Clock interupt.(Provide by hardware)

However, cause the trap is not call by User.
So the register should be save by hardware. (Just like when a function call happen)
then trigger trap() -> jump to switch() -> executing other process

Note: reboot is useful

Time already approve that reboot computer is a good way to make sure the status of application back to our control.
It's also easy way can be automatic

So be positive about reboot. It's not a bad way to solve problem. It's a reliable way to solve the problem XD

The truth about context switch

Why context effect performance a lot?

We may heard that it's because that CPU have to reload process status.
However, what's "process status"?

It's actully combine lot's of things include

  • Recover register status
  • Update TLB & cache
    • TLB miss
    • Cache miss
    • Page fault
  • Maintain cache coherence
  • Rebuild Branch predicter
  • etc.

Recover register status is actully not the biggest cost, it's quite cheap.

Space sharing

All the memory address we see is virtual address

only OS & Hardware know the physical address

Note: How malloc & free work

Those are actully C library

reference: https://blog.csdn.net/dog250/article/details/80331590

Source code: https://github.com/lattera/glibc/blob/master/malloc/malloc.c

brk

mmap && munmap

Memory layout

Unitialize/Initialized data are also a part of heap

How to caculate offset?

With Base address & Size

Usually It will store like this(explicit)

Top 2 bit declare is it stack/heap/text

So the code to find the physical address will be like

// get top 2 bits of 14-bit VA Segment = (VirtualAddress & SEG_MASK) >> SEG_SHIFT // now get offset Offset = VirtualAddress & OFFSET_MASK if (Offset >= Bounds[Segment]) RaiseException(PROTECTION_FAULT) else PhysAddr = Base[Segment] + Offset Register = AccessMemory(PhysAddr)

Bounds and Base will be different for each process? or different for each segment?

How to caculate stack?

stack grow reversely, how we present it?
Quite simple, add a reverse flag.(one bit)
Or hardcode the caculation of stack as subtraction

Sharing

Some of Segment need to been sharing

e.g, dynamic library

So there is actually a Protect bit to mark is those

Q: those are all for one process, how I access the sharing memory?

Q:Why stack grow from high to low?

Better use space, let heap and stack can share the same unused address range.

Schedule

There is two way to record memory space usage

for example, we have follow memory space

0~9 10~19 20~29
free used free
  1. Record used space






%0



root

root



10~19

10~19



root->10~19





  1. Record free space






%0



root

root



0~9

0~9



root->0~9





20~29

20~29



0~9->20~29





Sees the no big different. However, let's see another example

0~9 10~19 20~29
used used free
  1. Record used space






%0



root

root



0~9

0~9



root->0~9





10~19

10~19



0~9->10~19





  1. Record free space






%0



root

root



20~29

20~29



root->20~29





It show's that the time complexity of find a free space in second solution(Record free space) is always O(1). However, in the first solution(Record used space) we may have to traverse to the end of list, it worst-case time-complexity is O(n).

There are still a disadvantages for second solution, it may sperate a free space to two different block, such as







%0



root

root



10~19

10~19



root->10~19





20~29

20~29



10~19->20~29





We have to merge them by ourself when we schedule them.

Chunk header

magic is used to check is that header correct or not, a.k.a checksum.

Schdule policy

  1. Best fit

traverse the list, find the smallest free space that fit the requirement

fitNode := &node{val: INT_MAX}
for _, node := nodes {
    if 
}

0~29

0~9 10~19 20~29
used used used

free(obj1) 40
obj2 50

free C -> Mark

  • sbrk -> mmap