Linux 筆記 - HackMD

# Outline [TOC] # Intel x86 Architecture * **電腦的主機板** ![image](https://hackmd.io/_uploads/BJ9E3CUL6.png =70%x) # Intel x86 Registers * **General Purpose Register** ![image](https://hackmd.io/_uploads/B1fdZamDa.png ) * **Table Register (System Address Register)** ![image](https://hackmd.io/_uploads/rJC_2RIIT.png ) * **Control Register** CR0 ~ CR7: 用來設定 CPU 行為的暫存器 ![image](https://hackmd.io/_uploads/HkwufT7D6.png ) # Linux OS Kernel * **Process init** init process 通過 `/etc/inittab` 文件配置，會啟動 getty 進程，然後 getty 進程會提供登錄提示符，讓用戶輸入用戶名和密碼，最終啟動登錄 shell 會話 # IA32 Process Address Space Layout * **Address space of a process** * total space: 4GB * 分成兩部分: * 前 3GB user address space: 0x00000000 ~ 0xBFFFFFFF * 最後 1GB kernel address space: 0xC0000000 ~ 0xFFFFFFFF * ==示意圖==: ![image](https://hackmd.io/_uploads/ByX3rmhUT.png ) > kernel control path(有 3 種原因會進到 kernel): > * system call > * exception > * interrupt # Memory Addressing * **Logical 和 Linear(Virtual) 所表示的意涵都是轉換的規則，可以想成是設計圖上兩種不同的表示方式，而 Physical 則是在匯流排中傳輸的數位信號** * **Logical Address** * 是程式當中的 address * 兩部分組成: * segment base address (由 segment register 所提供) * offset (指從 segment 的起始位址到實際地址之間的距離) * **Linear/Virtual Address** * 程式 layout 圖上面的地址就是 linear address * 32-bit * 總共 2^32 = 4GB * From 0x00000000 to 0xFFFFFFFF * **Physical Address** * 就是實實在在的地址 * 32-bit * **Relocatable 的概念** * segment 記憶體切割區塊的一種方式 * 執行檔的 source code 可以被放在不同的檔案中，EX: gcc ./a.out ./b.out * offset 從頭到尾都不用改變，只需要加上 segment 的起始位址 * address 的兩種表示方式: 1. segment_address + offset 2. 0 + (segment_address + offset) 整個都當 offset --- 如果 segment 都從 0 開始的話，基本上就沒有 segment 了 ![image](https://hackmd.io/_uploads/SybSpKuUT.png) # Address Translation ![image](https://hackmd.io/_uploads/Sybd92rB6.png =80%x) * **Segmentation Unit** * 每個 segment 就是一個連續的記憶體區塊(最大 2^32 = 4GB) * 會有一個 segment descriptor(8-byte) 描述它 * 描述的內容: 從哪個位址開始多大讀寫權限 * segmentation unit: * a 16-bit segment selector -> 可找到 segment base address * a 32-bit offset * **Segmentation Register** * 共 6 個: CS DS SS ES FS GS，每個 16 bits * segmentation register 裡面存放 segment selector * CS: 指向 code segment * DS: 指向 data segment * SS: 指向 stack segment * ES FS GS: general purpose * **CPU Privilege Level** * 在 CS register 中的 RPL 欄位又稱作 CPL * kernel mode: CPL = 0 * user mode: CPL = 3 * 存取 segment 的兩個檢查: (越小權限越高) * CPL <= DPL * RPL <= DPL * **Segment Descriptor** * segment descriptor 會被存放在 GDT 或 LDT * 一個 CPU 只會有一個 GDT / LDT 根據 process 數量不固定 ![ppt2](https://hackmd.io/_uploads/SJg52DO8a.jpg) * **Segment Descriptor Format** * Base field (32) * G granularity flag (1): 0 (byte); 1 (4K bytes) * Limit field (20) * DPL (Descriptor privilege level) (2) * **Global Descriptor Table(GDT)** * 示意圖: ![ppt2-66](https://hackmd.io/_uploads/rkVlg9_8p.jpg =70%x) * data type: per-CPU ``` 定義: DEFINE_PER_CPU(struct delayed_work, vmstat_work) 存取: per_cpu(vm_event_states, cpu) ``` * Define GDT Table: per-CPU gdt_page * 21 entry(每個 entry 8-byte) * entry16: ==**TSS (Task State Segment)**== 每個 CPU 會有一個 TSS(per-CPU 變數) 指到相同區塊，給正在執行的 process 使用所有的 TSS 都按順序儲存在每個 CPU 的 init_tss 變數中 ![image](https://hackmd.io/_uploads/rkd4TR8U6.png) * entry6,7,8: TLS (Thread Local Storage) define keyword: \_\_thread * **4 Main Linux Segments** ![image](https://hackmd.io/_uploads/BymJ6WhLp.png) * 使用巨集定義: __USER_CS, __USER_DS, __KERNEL_CS, and __KERNEL_DS * To address the kernel code segment, loads the value yielded by the __KERNEL_CS macro into the cs segmentation register * **Privilege Level Change** ==考古== 在 X86 的 protected mode，當 CS 暫存器被改變時，為什麼所有其他 segment selector register (例如DS、SS、ES、FS）也必須被同時改變呢？因為確保所有的 segment register 的 RPL 和 CS 暫存器相同 # Paging - Virtual to Physical ## 基本名詞 * **Paging Unit功能** 1. 轉換 linear to physical address 2. 檢查存取權限: 檢查所要求的存取類型是否符合 linear address 的存取權限 * **Page & Page Frame** 1. Page frame 像容器(可以儲存東西的記憶體區塊)，page 是放在容器裡的資料 2. Page frame 起始位址一定是 4KB * **PG Flag** 1. Paging is enabled by setting the PG flag of the control register cr0 2. 剛開機從 real mode 剛進到 protected mode 的時候， ==在 PG flag = 0，virtual address 會等於 physical address== ## virtual address 欄位 32-bit divided into 3 parts: | Page Dirctory(10) | Page Table(10) | Offset(12) | | -------- | -------- | -------- | ![ppt3-18](https://hackmd.io/_uploads/HJabuQ8HT.jpg =80%x) ## 2 Steps of Translation a Linear Address ![image](https://hackmd.io/_uploads/ByjBSb8H6.png =60%x) * **第一層: Page Directory(G)** * 每個 process 只會有一個 * 會把 Page Directory 的 physical address 放到 cr3 * 總共 1024 個 entry(每個 4 byte) * entry 0 ~ 767 負責 user address space 轉換 * entry 768 ~ 1023 負責 kernel address space 轉換 * 一個 process 的 address space 是 4GB 這 4GB(2^32) 都是透過 Page Directory(G) 的 1024 entry 轉換的 2^32 / 2^10 = 2^22 所以每個 entry 負責 2^22(4MB) 的轉換 * **第二層: Page Table(G)** * **Structure of Page Directory(G) & Page Table(G)** ![image](https://hackmd.io/_uploads/SJbmUXIH6.png) * **EXAMPLE** ![image](https://hackmd.io/_uploads/Bk6JImUBT.png) * **virtual address 對應 physical memory 示意圖** ## Extended Paging * 示意圖: ![ppt3-35](https://hackmd.io/_uploads/HyqRXp6ra.jpg =60%x) * page frame 可以是 4MB * 利於轉換 **large contiguous linear address ranges to corresponding physical ones** * 省略中間大寫的 Page Table(G) * enable extended paging by setting: * Page Directory: page size flag * CR4: PSE flag * virtual address layout: | Page Dirctory(10) | Offset(22) | | -------- | -------- | * 2^22=4MB ## Hardware Protection Scheme * NX bit (No eXecute bit) * 64-bit(8-byte) 最左邊 63 位元會控制相關的 page frame 可不可以執行 ## The Physical Address Extension (PAE) Paging Mechanism **Q1: CPU registers such as EIP, ESP, are still 32 bits; thus, how to transfer a 32-bit virtual address into a 36-bit physical one?** **Q2: 程式當中 virtual address 轉換出來的 physical address 頂多就只有 4GB 種組合，多出來的 12GB 要如何處理?** * Enabled by setting PAE flag in cr4 register * physical address: 36bit (2^36=64GB) (有 36 條 address line) * 每個 page frame = 4KB * 2^36(64GB) / 2^12(4KB) = 2^24 (需要 24 bit 才夠表示) * **4KB page size in PAE** ![ppt3-55](https://hackmd.io/_uploads/B1B0hL_Ia.jpg) > PDPT: 8 byte, 4 entry > PDE: 8 byte, 512 entry (total size: 4KB) > PTE: 24 bit, 512 entry * **2MB page size in PAE** ![ppt3-57](https://hackmd.io/_uploads/r1FCnIdUa.jpg) * PAE + extended page: 把 PTE 拿掉 PDE 直接指向一個 2MB(2^21) 放置 code 的 page frame **有 PAE 會多一層** **有 extended page 會少一層** * Linux: 如果有 enable PAE => 不會有 PUD ## Paging in Linux(3-9-4) * Linux 有四層: * Page Global Directory * Page Upper Directory * Page Middle Directory * Page Table(G) ![image](https://hackmd.io/_uploads/BJ_DCV18a.png) * Code: ```c #define PGDIR_SHIFT 22 #define PTRS_PER_PGD 1024 /* the pgd page can be thought of an array like pgd_t[PTRS_PER_PGD] this macro returns the index of the entry in the pgd page which would control the given virtual address */ #define pgd_index(address) (((address) >> PGDIR_SHIFT) & (PTRS_PER_PGD - 1)) // 32bit shift 22bit -> 剩最左邊 10bit #define pgd_offset(mm, address) ((mm)->pgd + pgd_index((address))) // pgd 的 type: pgd_t(4 byte) static inline pud_t * pud_offset(pgd_t * pgd, unsigned long address) { return (pud_t *)pgd; } static inline pmd_t * pmd_offset(pud_t * pud, unsigned long address) { return (pmd_t *)pud; } // pgd pud pmd 全部都是 pgd 那 4 個 byte ``` ```c static void machine_kexec_page_table_set_one( pgd_t *pgd, pmd_t *pmd, pte_t *pte, unsigned long vaddr, unsigned long paddr) { pud_t *pud; pgd += pgd_index(vaddr); ... pud = pud_offset(pgd, vaddr); pmd = pmd_offset(pud, vaddr); ... ``` * 應用在 2-Level Paging System * 使用在 32-bit 且沒有 PAE 的情況下，會將 PMD PUD 設為 0-bit，來消去這兩層 * I/O Ports * I/O ports 也會對應到 physical address space * Default Physical Addresses Used by Kernel * default address space 0xC0000000 * ==剛開機的時候 kernel 會載入到記憶體 physical address 0x01000000(16M) 的位址== * 承上 why? 當一台個人電腦啟動時，在 Linux 載入到記憶體並掌控系統之前，BIOS 會以 real mode 進行硬體測試、硬體調查、作業系統啟動以及一些硬體初始化工作。在這階段，BIOS 需要在固定的記憶體地址上滿足特殊的記憶體需求。 # Kernel Initializes Its Own Page Tables ## Phase 1 * kernel 建立一個有限的 address space，包括： * kernel’s code segment * kernel’s data segments * initial page tables(b) * 128 KB for some dynamic data structures 這個最小的地址空間僅足夠在 RAM 中安裝 kernel 並初始化其核心的資料結構 * 第一層: initial_page_table ![image](https://hackmd.io/_uploads/r1HITzFu6.png) > 1024 個 entry，每個 4 byte，起始初始值 0 * 第二層: __brk_base * Physical Address Layout ![ppt5-14](https://hackmd.io/_uploads/BJH_6MY_6.jpg =80%x) > 16MB(reserved memory) + 7MB(vmlinux size) + 1MB(MAPPING_BEYOND_END) > Boot loader put linux kernel at physical address 0x01000000 * 示意圖: ![ppt5-20](https://hackmd.io/_uploads/rkU06GFdp.jpg =80%x) > initial_page_table 中，entry 0~5, 768~773 以外的值都是 0 * virtual address 有兩個 range(各自使用 6 個 page tables mapping RAM): 1. **0x00000000 ~ 0x017fffff** (24MB) 轉換出來值會一樣在初始階段，physical address 只會落在這區塊 2. **0xc0000000 ~ 0Xc17fffff** (24MB) virtual address - 0Xc0000000(kernel address space) = physical address * **Enable the Paging Unit** 將 initial_page_table 的 physical address 載入 cr3 設置 cr0 的 PG flag ![image](https://hackmd.io/_uploads/ryab0GY_6.png =80%x) ## Phase 2 * Finish the PGD * ==常考== **The final mapping provided by the ==kernel Page Tables== must transform virtual addresses starting from ==0xc0000000== to physical addresses starting from 0x00000000** (只負責後面 256 entry) * linear mapping region 的大小，透過以下兩個變數設置: * CONFIG_NOHIGHMEM * set -> kernel 只能存取小於 1024MB 的 physical memory * case1: RAM 大小小於 895MB * case2: RAM大小介於 895MB 和 1024MB 之間 * CONFIG_HIGHMEM * set -> kernel 可以存取大於 1024MB 的 physical memory * case1: RAM 大小小於 887MB * case2: RAM 大小介於 887MB 與 4096MB 之間 * case3: RAM 大小大於 4096MB * ==必考題== 一個 memory cell 放的是 user address space 的 code data，它會有 2 個 virtual address: 一個大於 0xc0000000，一個小於 0xc0000000 * 要存取任何一個 physical memory cell，此 memory cell 需要有一個大於 0xc0000000 的 address * 假設 memory cell 放的是某個 process 在 user address space 的 data，那它會有一個小於 0xc0000000 的 address * **Case 1: When RAM Size Is Less Than 887MB** ![ppt5-39](https://hackmd.io/_uploads/rJ0mRfK_6.jpg =80%x) * **Case 2: When RAM Size Is between 887MB and 4096MB(4GB)** * 因為 virtual address space 只有 1GB，所以無法完全線性對應到 4GB 的 RAM * 所以在初始階段，kernel linear address space 還是只有對應到 RAM 的 887MB * 如果要對應到後面的 3GB 就需要更改 translation table ![ppt5-43](https://hackmd.io/_uploads/H13V0fFu6.jpg =80%x) * mapping 示意圖: ![ppt5-44](https://hackmd.io/_uploads/rJIHCfKO6.jpg =80%x) * **Case 3: When RAM Size Is More Than 4096MB** * 需要 enabled PAE * 對應的 physical address 還是只有 887MB * 處理方式同 Case2，只有線性對應到 887MB 的 RAM，其餘的保持 unmapped 的狀態，採用 dynamic remapping 的方式 * mapping 示意圖: ![ppt5-49](https://hackmd.io/_uploads/HJlSUCMFup.jpg =80%x) # Fix-Mapped Linear Addresses * 在 4GB 中，約保留 137MB 的 linear address，讓 kernel 使用它們來實現 noncontiguous memory allocation 和 fix-mapped linear addresses * fix-mapped linear address 是一個固定的 linear address (like 0xffffc000) * 每個 fix-mapped linear address 對應到 physical memory 的 **1 個 page frame** * fix_to_virt() 的運作 * function 的 input: x * function 的 output: virtual address 0xffffffff - (x+1) * 4K * 示意圖: ![ppt5-56](https://hackmd.io/_uploads/HJEw0MKuT.jpg =80%x) # Introduction of Process * **當 process 被產生時會有三樣東西一起被產生:** * process descriptor(儲存在 dynamic memory) * kernel mode stack * translation table * **fork()** ```c= int main() { while(1) { char * str ; gets(str) ; if( fork() ) { // 非0 -> true // parent code wait4() ; } // end if else { // child code execve(str, ,) ; } // end else } // end while } // main() ``` > 某個 process 執行到 fork() 時 > 會進到 kernel > 產生一組 child process descriptor, translation table, kernel mode stack > 產生完畢 > 回到 user address space --- fork() 那行，同時回傳 child PID > 通常 child 會先執行 ## Process Descriptor * **task_struct: Process Descriptor 的 data type** ![image](https://hackmd.io/_uploads/SJvcRGFu6.png) [Linux Source Code task_struct](https://elixir.bootlin.com/linux/v3.9/source/include/linux/sched.h#L1201) * **必考題** ==影片: 11/20 28:00== 如果要在 process descriptor 加東西不要用 insert 請用 append 加在最後面因為有些用 assembly 寫的 code 是直接用起始位址加上 offset 表示如果 insert 會影響後面所有的欄位 * **volatile** * 作用: 告訴 compiler 不要對程式碼做最佳化 * 舉例: ```c= // without volatile int foo() { t = 3; k = t; if ( k > 3 ) return 8; else return 10; } // with volatile int foo() { return 10; } ``` > 如果在 k=t 以後發生 interrupt 導致 k != 3，有可能會有別的可能性 * **Process Descriptor Pointer** 指向 process descriptor 第一個 byte 的位址 process descriptor 是動態配置記憶體 * **Process Descriptor的欄位: state** * TASK_RUNNING_ > EX: p->state = TASK_RUNNING ; * TASK_INTERRUPTIBLE > 暫時停止執行 > 會被放到 waiting queue * TASK_UNINTERRUPTIBLE > 同上(TASK_INTERRUPTIBLE) > 但無法用 signal 叫醒 * __TASK_STOPPED > 因為收到特定 4 種 signal 所以暫時停止 > SIGSTOP SIGTSTP SIGTTIN SIGTTOU * __TASK_TRACED * TASK_WAKEKILL * TASK_KILLABLE * EXIT_ZOMBIE * EXIT_DEAD >以下兩個情況的 signal 叫做 fatal signal，當 process 收到 signal 時 (1) 無法被 catch 且 kernel 預設會把這個 process kill 掉 (2) signal 可以被 catch 但如果沒有被 catch 時 kernel 也會把 process kill 掉 * **送 signal 的 3 種方式** * command kill * system kill * keyboard * **Macro Current** * current 是一個用巨集定義的 function > #define current get_current() > 呼叫 get_current() 可以得到目前正在執行的 process 的 process descriptor pointer 所指向的位址 * per-CPU 變數: current_task > 用來儲存 current process 的 process descriptor 的 address ## Kernel Mode Stack * **data type: thread_info** ![ppt6-39](https://hackmd.io/_uploads/S1BwJ7Y_a.jpg =80%x) # Linked List * **Data Structure: struct list_head** ![image](https://hackmd.io/_uploads/SkHjy7KOa.png) * **Macro: LIST_HEAD(name)** 用來創建一個 data type 是 list_head 的變數 (linked list 的 head) 初始化會將 prev 和 next 都指向自己 An Empty Doubly Linked List 示意圖: ![ppt6-59](https://hackmd.io/_uploads/Sy1CkmKOT.jpg =50%x) * **Macro: list_entry(p, t, m)** 用來查找大的 data structure 的起始位址 list_entry(ptr, type, member) * ptr: 指向 list_head 的 pointer * type: 大的 data structure 的 data type * member: 欄位名稱 * 舉例: ![ppt6-63](https://hackmd.io/_uploads/r1RCJmK_p.jpg =80%x) ![ppt6-66](https://hackmd.io/_uploads/B1Y1lQFO6.jpg =80%x) * **Data structure: hlist_head, hlist_node** 另一種資料結構，示意圖: ![ppt6-69](https://hackmd.io/_uploads/r1wgx7KOa.jpg =80%x) ![image](https://hackmd.io/_uploads/HJzHxmt_T.png =70%x) ![image](https://hackmd.io/_uploads/rJaSxXt_a.png =70%x) * **prev: 本身就是存放地址，又是指向 data type 為 hlist_head 的 \*next 欄位 (struct hlist_node **prev) * **Process list** * circular doubly linked list * 把 thread group leader 的 process descriptor 串在一起 * 每個 task_struct 的資料結構都包含一個 tasks 欄位(data type: list_head)，其 prev 和 next 欄位分別指向前一個和後一個 task_struct 裡的 tasks 欄位 * for_each_process 用來走訪 linked list 的 function ![image](https://hackmd.io/_uploads/BkIclXY_6.png) ![ppt6-74](https://hackmd.io/_uploads/HyaclQtda.jpg =70%x) # Linux Scheduler * **kernel function schedule()** * process switch 一定是發生在 kernel，透過呼叫 schedule() * **struct sched_class** * scheduling class 的 data type ![ppt6-99](https://hackmd.io/_uploads/ByNhemKda.jpg =70%x) * source code ![image](https://hackmd.io/_uploads/Hkw0eXK_T.png) * sched_class 裡的第一個欄位叫做 next，只向下一個 priority 較低的 class type * 4 個 type 形成一個 linked list: stop_sched_class → rt_sched_class → fair_sched_class → idle_sched_class → NULL * **stop sched class**: 用來 schedule the per-cpu stop task 執行以後 cpu 就準備關掉了 * **real time sched class**: process 動態改變會有一個 sub-run queue，有自己的 scheduling policy * **completely fair sched class**: process 動態改變 * **idle sched class**: 就是 process 0 用來 schedule the per-cpu idle task(又叫 swapper task) * run queue: ![image](https://hackmd.io/_uploads/r1DZbXtda.png =70%x) # Run Queue * 一個 process 不會同時出現在兩個 sub-run queue * 也不會同時 run 在多個 cpu * **data type: struct rq** ```c struct rq { struct cfs_rq cfs; struct rt_rq rt; ... struct task_struct *curr, *idle, *stop; unsigned long next_balance; struct mm_struct *prev_mm; ... } ``` * cfs: completely fair sched class 的 sub-run queue * rt: real time sched class 的 sub-run queue * (task_struct*)curr: 指向正在執行的 process 的 process descriptor * idle: 指向 idle process 的 process descriptor ## CFS Run Queue * Red-Black Tree * root/leaves(NIL): black * red node: 2 black child node * CFS Run Queue of a CPU: ![ppt6-112](https://hackmd.io/_uploads/ryQfZQF_a.jpg) source code: 從 struct rq -> (struct cfs_rq)cfs ![image](https://hackmd.io/_uploads/rk44ZQFdp.png =80%x) > tasks_timeline: 指向 root > \*rb_leftmost: 指向等待時間最長的 process * 其他 source code: ![image](https://hackmd.io/_uploads/H1SPWmYOT.png =80%x) > line1217 se 欄位 > node 的記憶體是放在 (task_struct)process decriptor -> (sched_entity)se -> (rb_node)run_node ![image](https://hackmd.io/_uploads/ByjtZXK_6.png =80%x) > line 1141 run_node 欄位 ![image](https://hackmd.io/_uploads/HyOjZ7YuT.png =80%x) # Wait Queue ![image](https://hackmd.io/_uploads/Hk3-GQt_p.png =80%x) * introduction: * a set of sleeping processe * ... * implementation: doubly linked list * **__wait_queue_head** typedef 成 wait_queue_head_t ![image](https://hackmd.io/_uploads/B17Uz7Yua.png =70%x) > * lock: for synchronization(因為 interrupt 和一些 kernel code 都會動到 wait queue ，怕指標亂掉) > * task_list: the head of a list of waiting processes * **__wait_queue** typedef 成 wait_queue_t ![image](https://hackmd.io/_uploads/SkP9fXKua.png =70%x) ![image](https://hackmd.io/_uploads/By_jGQY_6.png =70%x) > * Sleeping process types(flags): > WQ_FLAG_EXCLUSIVE(1) or ~WQ_FLAG_EXCLUSIVE(0) > 等待的資源有排他性: 會有多個 process 等待相同的 event 發生的情況 -> 避免 Thundering Herd > * func 用 func 把 process 從 wait queue 移除，丟進去 run queue * **Declare a wait queue head** * source code: ![image](https://hackmd.io/_uploads/SkYeXQt_p.png =70%x) > lock: 初始化上鎖 > task_list: 初始化 \*next \*prev 都指向自己 * 示意圖: disk_wait_queue 是自己取的名稱 ![image](https://hackmd.io/_uploads/BJG8B7FOT.png =70%x) * **Initialized a wait queue element** * 已經 define，透過它起始值 * source code: ![image](https://hackmd.io/_uploads/r18dXQK_p.png) > 把 node 的值初始起來 > 參數1 wait_queue_t \*q: wait queue 的一般 node > 參數2 task_struct \*p: 指向對應的 process descriptor > q -> flags: default nonexclusive (~WQ_FLAG_EXCLUSIVE(0)) > q -> func: default_wake_function 再去呼叫 try_to_wake_up() 把 process 從 wait queue 移除，並加到 run queue * **Define a new wait queue element** * autoremove_wake_function() 也是把 process 從 wait queue 移除，並加到 run queue * 如果要自己定義 wake up function，需由 init_waitqueue_func_entry()呼叫 * **Functions to Add/Remove Elements from a Wait Queue** * 用在定義好後，要加到 queue 中 * add_wait_queue(): flag = 0，加在 linked list 的頭 * add_wait_queue_exclusive(): flag = 1，加在 linked list 的尾巴 * **把 process 丟去睡覺呼叫的一系列 function** * **sleep_on()** and **interruptible_sleep_on()** * 示意圖: ![ppt8-24](https://hackmd.io/_uploads/BkRFQ7Y_6.jpg) * source code: ```c void sleep_on(wait_queue_head_t *wq) { wait_queue_t wait; init_waitqueue_entry(&wait, current); current->state = TASK_UNINTERRUPTIBLE; add_wait_queue(wq, &wait); /* wq points to the wait queue head */ schedule(); // 叫起床，從 wq 被放到 rq remove_wait_queue(wq, &wait); // 要再自己從 wq 移除 } ``` > 假設 schedule() 有 1000 行，在第 500 行會執行 process switch > 執行之後，eip 停在 501 行不動了 > 去找到 priority 最高的 process，控制權交給它繼續執行 > 但 process 的狀態都還儲存起來 > 所以之後被叫醒要繼續執行的時候再把儲存起來的東西放回暫存器繼續執行 > > 控制權會在 3 個 process 之間轉移 > 我會把控制權交給誰: prev 交給 next > 誰會把控制權交給我: x 交給 prev * 會發生 race condition 不要用! > race condition: 程式執行順序不同導致的錯誤 ```c= while(we_have_to_wait) sleep_on(&some_wait_queue); ``` > 原因: 在執行 line1 之後有發生 context switch 的風險 > EX: 在判斷和執行之間發生了 context switch，且在等待的資源被其他 procss 搶走，導致被叫去睡覺的 process 永遠等不到資源 * interruptible_sleep_on() 差別在 current->state = TASK_INTERRUPTIBLE; (signal 可以把此 process 叫起來) * **sleep_on_timeout()** and **interruptible_sleep_on_timeout()** 差別在於 sleep_on() 需要等到 interrupt 發生或資源被釋放 sleep_on_timeout() 可以設定時間，時間一到就把 process 叫起來 * **prepare_to_wait()** and **prepare_to_wait_exclusive()** * source code: ![image](https://hackmd.io/_uploads/S1ZamXtup.png) > line74: 看是否在串列當中，不在的話 -> 加進去 wq * \+ finish_wait() ![image](https://hackmd.io/_uploads/S1ny4mY_6.png) > 先準備好再判斷是否要加進 wq * **wait_event()** and **wait_event_interruptible()** * source code: ![image](https://hackmd.io/_uploads/B1sGVQY_a.png) * \+ finish_wait() ![image](https://hackmd.io/_uploads/By1S4mK_a.png) > 差別在先判斷是否需要 wait，需要等待才加入 wq * **把 process 叫醒的 function** * **wake_up macro** ```c void wake_up(wait_queue_head_t *q) { struct list_head *tmp; wait_queue_t *curr; list_for_each(tmp, &q->task_list) { curr = list_entry(tmp, wait_queue_t, task_list); if(curr->func(curr,TASK_INTERRUPTIBLE|TASK_UNINTERRUPTIBLE, 0, NULL) && curr->flags) break; } } ``` > 會呼叫 func，假設順利執行回傳 1 > 且 flags = 1 (nonexclusive) -> 繼續 wake up > 反之遇到 flags = 0 -> break # Process Resource Limits * **存放 process 資源限制的位置** * **current->signal->rlim** * process descriptor 的欄位: signal ![image](https://hackmd.io/_uploads/S1iwNQFuT.png) * signal_struct 的欄位: rlim ![image](https://hackmd.io/_uploads/rksKV7YuT.png =80%x) ![image](https://hackmd.io/_uploads/BkiqVmtuT.png =80%x) * data type: rlimit ![image](https://hackmd.io/_uploads/S1A3NQYua.png) > rlim_cur: 資源最大的使用量 > rlim_max: 可調整的資源最大量 * source code: ![image](https://hackmd.io/_uploads/SyPJSmY_6.png) # Process Switch * **Hardware Context** * CPU registers 是所有 process 共用的 * 所以在當發生 process switch，需要將 register 的值存在記憶體當中 * process execution context: code, data, translation table, process descriptor, kernel mode stack * Repository: 存在 process descriptor 和 Kernel Mode stack * **Task State Segment Components Used by Linux** * 從 user mode 進到 kernel mode * 硬體可以直接從 TR 暫存器找到現在正在使用的 entry 16 TSS (Kernel mode stack 的 address) * ==必考題== process 的 hardware context 並不儲存在 TSS，那為什麼每個 CPU 還需要有一個 tss_init 儲存給正在執行的 process，因為 tss 要儲存兩個欄位: * 正在執行的 process 它的 kernel mode stack 位址 * I/O Permission Bitmap: user address space 執行 io 的時候，用來查看哪些 port 可以操作 # SWITCH_TO * 示意圖 ![ppt9-8](https://hackmd.io/_uploads/HyFSyft_a.jpg =85%x) * 執行的步驟: 1. **儲存 prev 和 next** > movl prev,%eax > movl next,%edx 2. **儲存 eflags 和 ebp 到目前正在執行的 process kernel mode stack** > pushfl > pushl %ebp 3. **把 esp 存到 prev->thread.sp** > movl %esp,484(%eax) 4. **把 next->thread.sp 的值放到 esp** ==From now on, the kernel operates on the Kernel Mode stack of next, so this instruction performs the actual process switch from prev to next.== > movl 484(%edx), %esp 5. **將 address 儲存到 prev->thread.ip** > movl $1f, 480(%eax) 6. **把 next->thread.ip 的值放到 edx** > pushl 480(%edx) 7. **跳回 __switch_to()** > jmp __switch_to