# MIT6.s081 Lab: xv6 lazy page allocation ###### tags: `mit6.s081` `作業系統 Operating System` 虛擬地址有兩個主要的優點: 1. 隔離性 (isolation):每個進程都有自己的地址空間,而 kernel 與 user space 之間也使用不同的映射關係 (除了處理 trap 的 trampoline),這讓程序之間不能互相修改對方的資料,而每個程序就像獨自擁有整個記憶體空間 2. 間接性 (level of indirection):虛擬地址提供了一層抽象,CPU 的指令都執行在虛擬地址上,由 kernel 定義這個映射關係。這層抽象為 kernel 的設計帶來很多彈性,可以完成很多有趣的設計 截至目前為止,記憶體映射的使用是靜態的,無論是 user space 或著 kernel space,都是在一開始就設置好,然後不再更動。當 CPU 無法從 page table 轉換虛擬地址時,會發出 page faule (缺頁錯誤),kernel 依照 page fault 的成因更新 page table,這是讓映射關係動態化的關鍵。那 kernel 需要哪些訊息,才能妥善地處理 page fault? - 錯誤發生的虛擬地址,這個地址會放到 **stval** 暫存器中 - page fault 發生的原因,根據指令的不同可分為三類,在 **scause** 會有對應的值來表示 - **Instruction page fault** : 12 - **Load page fault** : 13 - **Store page fault** : 15 - 導致 page fault 發生的程式碼地址,因為發生錯誤會陷入 kernel,想當然爾這個地址會存在 **sepc** 暫存器中及 trapframe->epc 當中。程式計數器非常重要,因為當修復 page fault 之後,通常指令就可以無錯誤的運行,所以 page fault 很多時候其實是刻意為之,是系統設計的一個特色 本次實驗要求修改 xv6 程式碼,讓 xv6 具有 lazy allocation 的特性。在 xv6 中我們透過 **sbrk** 系統呼叫為程序動態配置 heap 空間,**sbrk** 在一開始指向 heap 的底端、stack 的頂端。目前 xv6 在呼叫 **sbrk** 時,kernel 會立即為進程分配物理空間、將該空間初始化為 0 並建立映射,這種實現稱為 **eager allocation** ![](https://i.imgur.com/QDZBa59.png) 但很多時候,應用程式很難預測確切需要多少記憶體,所以會傾向申請多於程序所需要的空間,若 kernel 每次程序呼叫 **sbrk** 都立刻配置對應大小的物理空間,可想而知系統將有大量被浪費的記憶體 **lazy allocation** 在根本上解決了這個問題,其概念非常簡單,當我們調用 **sbrk** 時,只變更 `p->sz` 的數值為 `p->sz + n`,但 kernel 在這個時間點不會配置任何物理空間,稍後應用程序使用到新分配的地址時,由於我們還沒建立映射,這時將觸發 page fault,我們透過 handler 為該地址要求一塊物理空間並更新 page table,再重新執行原來的指令。kerenl 只在應用程序實際使用時,才分配物理空間,並透過 page fault 的機制,讓整個過程對 user 進程是透明的 ![](https://i.imgur.com/kdEE3zx.png) ## Eliminate allocation from sbrk() (easy) 第一題非常簡單,單純修改 `sys_sbrk`,讓它單純變更 `p->sz` 的值,但不實際配置物理空間 ```cpp kernel/sysproc.c uint64 sys_sbrk(void) { int addr; int n; if(argint(0, &n) < 0) return -1; addr = myproc()->sz; myproc()->sz += n; return addr; } ``` ## Lazy allocation (moderate) & Lazytests and Usertests (moderate) 第二題跟第三題要求修改 `kernel/trap.c`,當 page fault 發生時,為發生錯誤的虛擬地址配置新的物理頁 page fault 發生後陷入 kernel 的過程與其他 trap 無異,回想前一個實驗 traps,xv6 在 **usertrap** 判斷中斷類型並做相應的處理,我們一樣在 usertrap 增加 page fault 對應的處理機制 首先透過 **r_scause** 函式取得 RISC-V 的 **scause** 暫存器,若其中的數值為 13 或 15 代表為 page fault。然後藉由 **r_stval** 取得發生錯誤的虛擬地址,這個地址不應該超過目前進程的記憶體大小,也不該與 stack 重疊 (可透過 trapframe->sp 取得進程的 stack pointer),若超過的話依照實驗要求殺掉進程。最後呼叫 **kalloc** 及 **mappages** 分配及映射物理地址,依實驗要求無論 kalloc 或 mappages 失敗我們都直接殺掉進程 ```cpp kernel/trap.c void usertrap(void) { ... if(r_scause() == 8){ ... } else if ((r_scause() == 13) || (r_scause() == 15)) { uint64 va = r_stval(); // printf("page fault %p\n", va); if (va >= p->sz || va < p->trapframe->sp) { p->killed = 1; } else { char *mem = kalloc(); if (mem == 0) { p->killed = 1; } else { memset(mem, 0, PGSIZE); if(mappages(p->pagetable, PGROUNDDOWN(va), PGSIZE, (uint64)mem, PTE_W|PTE_R|PTE_U) != 0) { kfree(mem); p->killed = 1; } } } } else { ... ``` hints 提到 **uvmunmap** 會發生錯誤,這是因為修改前的 xv6 在呼叫 **sbrk** 時一定會配置物理空間,所以釋放物理頁時 pte 一定存在,而且 PTE_V bit 應該要被設置,但採用 lazy allocation 後,heap 有機會存在沒有實際分配物理頁的虛擬地址,可能是 page table 已經建立了但 pte 尚未使用過 (PTE_V = 0),或著 page table 根本就還沒有建立 (pte = 0),無論是哪種情況現在都是合理的,不該再引發 panic,所以直接忽略即可。另外,**fork** 呼叫 **uvmcopy** 也有相同的狀況,程式碼一併進行修改 ```cpp kernel/vm.c void uvmunmap(pagetable_t pagetable, uint64 va, uint64 npages, int do_free) { uint64 a; pte_t *pte; if((va % PGSIZE) != 0) panic("uvmunmap: not aligned"); for(a = va; a < va + npages*PGSIZE; a += PGSIZE){ if((pte = walk(pagetable, a, 0)) == 0) // panic("uvmunmap: walk"); continue; if((*pte & PTE_V) == 0) // panic("uvmunmap: not mapped") continue; if(PTE_FLAGS(*pte) == PTE_V) panic("uvmunmap: not a leaf"); if(do_free){ uint64 pa = PTE2PA(*pte); kfree((void*)pa); } *pte = 0; } } int uvmcopy(pagetable_t old, pagetable_t new, uint64 sz) { pte_t *pte; uint64 pa, i; uint flags; char *mem; for(i = 0; i < sz; i += PGSIZE){ if((pte = walk(old, i, 0)) == 0) // panic("uvmcopy: pte should exist"); continue; if((*pte & PTE_V) == 0) // panic("uvmcopy: page not present") continue; pa = PTE2PA(*pte); flags = PTE_FLAGS(*pte); if((mem = kalloc()) == 0) goto err; memmove(mem, (char*)pa, PGSIZE); if(mappages(new, i, PGSIZE, (uint64)mem, flags) != 0){ kfree(mem); goto err; } } return 0; err: uvmunmap(new, 0, i / PGSIZE, 1); return -1; } ``` 應用程式有可能縮小 heap,xv6 要能妥善處理這種情況,理論上直接呼叫 **uvmdealloc** 取消物理頁面的映射即可,但 user 有可能惡意傳進一個負值非常大的參數,我們必需先判斷縮小後的進程空間,是否高於 stack pointer,若低於 stack pointer 代表這是一個非法的參數 ```cpp kernel/sysproc.c uint64 sys_sbrk(void) { int addr; int n; if(argint(0, &n) < 0) return -1; struct proc *p = myproc(); addr = p->sz; // heap should not overlap stack if (addr + n < p->trapframe->sp) return -1; // if grow negative, dealloc memory map if (n < 0) uvmdealloc(p->pagetable, p->sz, p->sz + n); p->sz += n; return addr; } ``` 最後 hints 提醒我們一種狀況,我們以 **sbrk** 跟 kernel 要了一塊空間,但因為 lazy allocation 的特性,這塊空間尚未實際分配物理頁,此時我們執行**系統呼叫**,希望對這個空間做一些記憶體的操作,如 read、write 等,但在 kernel space 觸發 page fault 時不會進入 usertrap,無法透過 trap 機制動態配置記憶體 我們以 **write** 為例,呼叫的過程為 write (syscall) --> filewrite (kernel/file.c) --> writei (kernel/fs.c) --> either_copyin (kernel/proc.c) --> copyin (kernel/vm.c) --> walkaddr (kernel/vm.c),xv6 在 kernel space 下,user 虛擬地址的映射都必須透過 **walkaddr** 進行查找 以下為 **copyin** 及 **walkaddr** 的原始實作,walkaddr 在 pte = 0 以及 PTE_V = 0 的情況都直接返回 0,copyin 因此認定為異常而結束 ```cpp kernel/vm.c int copyin(pagetable_t pagetable, char *dst, uint64 srcva, uint64 len) { uint64 n, va0, pa0; while(len > 0){ va0 = PGROUNDDOWN(srcva); pa0 = walkaddr(pagetable, va0); if(pa0 == 0) return -1; n = PGSIZE - (srcva - va0); if(n > len) n = len; memmove(dst, (void *)(pa0 + (srcva - va0)), n); len -= n; dst += n; srcva = va0 + PGSIZE; } return 0; } uint64 walkaddr(pagetable_t pagetable, uint64 va) { pte_t *pte; uint64 pa; if(va >= MAXVA) return 0; pte = walk(pagetable, va, 0); if(pte == 0) return 0; if((*pte & PTE_V) == 0) return 0; if((*pte & PTE_U) == 0) return 0; pa = PTE2PA(*pte); return pa; } ``` 但現在這兩個狀況是由於 lazy allocation 所導致,我們不該直接返回 0,而是替虛擬地址分配物理頁,參考 **usertrap** 修改如下 ```cpp kernel/vm.c uint64 walkaddr(pagetable_t pagetable, uint64 va) { pte_t *pte; uint64 pa; if(va >= MAXVA) return 0; pte = walk(pagetable, va, 0); struct proc *p = myproc(); // va might have not been allocated from system call if(pte == 0 || (*pte & PTE_V) == 0) { if (va >= p->sz || va < p->trapframe->sp) return 0; char *mem = kalloc(); if (mem == 0) return 0; memset(mem, 0, PGSIZE); if(mappages(pagetable, PGROUNDDOWN(va), PGSIZE, (uint64)mem, PTE_W|PTE_R|PTE_U) != 0) { kfree(mem); return 0; } } if((*pte & PTE_U) == 0) return 0; pa = PTE2PA(*pte); return pa; } ``` :::success == Test running lazytests == $ make qemu-gdb (19.6s) == Test lazy: map == lazy: map: OK == Test lazy: unmap == lazy: unmap: OK == Test usertests == $ make qemu-gdb Timeout! (300.3s) == Test usertests: pgbug == usertests: pgbug: OK == Test usertests: sbrkbugs == usertests: sbrkbugs: OK == Test usertests: argptest == usertests: argptest: OK == Test usertests: sbrkmuch == usertests: sbrkmuch: OK == Test usertests: sbrkfail == usertests: sbrkfail: OK == Test usertests: sbrkarg == usertests: sbrkarg: OK == Test usertests: stacktest == usertests: stacktest: OK == Test usertests: execout == usertests: execout: OK == Test usertests: copyin == usertests: copyin: OK == Test usertests: copyout == usertests: copyout: OK == Test usertests: copyinstr1 == usertests: copyinstr1: OK == Test usertests: copyinstr2 == usertests: copyinstr2: OK == Test usertests: copyinstr3 == usertests: copyinstr3: OK == Test usertests: rwsbrk == usertests: rwsbrk: OK == Test usertests: truncate1 == usertests: truncate1: OK == Test usertests: truncate2 == usertests: truncate2: OK == Test usertests: truncate3 == usertests: truncate3: OK == Test usertests: reparent2 == usertests: reparent2: OK == Test usertests: badarg == usertests: badarg: OK == Test usertests: reparent == usertests: reparent: OK == Test usertests: twochildren == usertests: twochildren: OK == Test usertests: forkfork == usertests: forkfork: OK == Test usertests: forkforkfork == usertests: forkforkfork: OK == Test usertests: createdelete == usertests: createdelete: OK == Test usertests: linkunlink == usertests: linkunlink: OK == Test usertests: linktest == usertests: linktest: OK == Test usertests: unlinkread == usertests: unlinkread: OK == Test usertests: concreate == usertests: concreate: OK == Test usertests: subdir == usertests: subdir: OK == Test usertests: fourfiles == usertests: fourfiles: OK == Test usertests: sharedfd == usertests: sharedfd: OK == Test usertests: exectest == usertests: exectest: OK == Test usertests: bigargtest == usertests: bigargtest: OK == Test usertests: bigwrite == usertests: bigwrite: OK == Test usertests: bsstest == usertests: bsstest: OK == Test usertests: sbrkbasic == usertests: sbrkbasic: OK == Test usertests: kernmem == usertests: kernmem: OK == Test usertests: validatetest == usertests: validatetest: OK == Test usertests: opentest == usertests: opentest: OK == Test usertests: writetest == usertests: writetest: OK == Test usertests: writebig == usertests: writebig: OK == Test usertests: createtest == usertests: createtest: OK == Test usertests: openiput == usertests: openiput: OK == Test usertests: exitiput == usertests: exitiput: OK == Test usertests: iput == usertests: iput: OK == Test usertests: mem == usertests: mem: OK == Test usertests: pipe1 == usertests: pipe1: OK == Test usertests: preempt == usertests: preempt: OK == Test usertests: exitwait == usertests: exitwait: OK == Test usertests: rmdot == usertests: rmdot: OK == Test usertests: fourteen == usertests: fourteen: OK == Test usertests: bigfile == usertests: bigfile: OK == Test usertests: dirfile == usertests: dirfile: OK == Test usertests: iref == usertests: iref: OK == Test usertests: forktest == usertests: forktest: OK == Test time == time: OK Score: 119/119 ::: ## Reference 1. [xv6: a simple, Unix-like teaching operating system](https://pdos.csail.mit.edu/6.828/2020/xv6/book-riscv-rev1.pdf) 2. [Lab: xv6 lazy page allocation](https://pdos.csail.mit.edu/6.S081/2020/labs/traps.html) 3. [Chapter 4: Traps and System Calls](https://zhuanlan.zhihu.com/p/351939252) 4. [MIT6.S081 課程翻譯](https://mit-public-courses-cn-translatio.gitbook.io/mit6-s081/)