# MIT6.s081 Lab: xv6 lazy page allocation
###### tags: `mit6.s081` `作業系統 Operating System`
虛擬地址有兩個主要的優點:
1. 隔離性 (isolation):每個進程都有自己的地址空間,而 kernel 與 user space 之間也使用不同的映射關係 (除了處理 trap 的 trampoline),這讓程序之間不能互相修改對方的資料,而每個程序就像獨自擁有整個記憶體空間
2. 間接性 (level of indirection):虛擬地址提供了一層抽象,CPU 的指令都執行在虛擬地址上,由 kernel 定義這個映射關係。這層抽象為 kernel 的設計帶來很多彈性,可以完成很多有趣的設計
截至目前為止,記憶體映射的使用是靜態的,無論是 user space 或著 kernel space,都是在一開始就設置好,然後不再更動。當 CPU 無法從 page table 轉換虛擬地址時,會發出 page faule (缺頁錯誤),kernel 依照 page fault 的成因更新 page table,這是讓映射關係動態化的關鍵。那 kernel 需要哪些訊息,才能妥善地處理 page fault?
- 錯誤發生的虛擬地址,這個地址會放到 **stval** 暫存器中
- page fault 發生的原因,根據指令的不同可分為三類,在 **scause** 會有對應的值來表示
- **Instruction page fault** : 12
- **Load page fault** : 13
- **Store page fault** : 15
- 導致 page fault 發生的程式碼地址,因為發生錯誤會陷入 kernel,想當然爾這個地址會存在 **sepc** 暫存器中及 trapframe->epc 當中。程式計數器非常重要,因為當修復 page fault 之後,通常指令就可以無錯誤的運行,所以 page fault 很多時候其實是刻意為之,是系統設計的一個特色
本次實驗要求修改 xv6 程式碼,讓 xv6 具有 lazy allocation 的特性。在 xv6 中我們透過 **sbrk** 系統呼叫為程序動態配置 heap 空間,**sbrk** 在一開始指向 heap 的底端、stack 的頂端。目前 xv6 在呼叫 **sbrk** 時,kernel 會立即為進程分配物理空間、將該空間初始化為 0 並建立映射,這種實現稱為 **eager allocation**

但很多時候,應用程式很難預測確切需要多少記憶體,所以會傾向申請多於程序所需要的空間,若 kernel 每次程序呼叫 **sbrk** 都立刻配置對應大小的物理空間,可想而知系統將有大量被浪費的記憶體
**lazy allocation** 在根本上解決了這個問題,其概念非常簡單,當我們調用 **sbrk** 時,只變更 `p->sz` 的數值為 `p->sz + n`,但 kernel 在這個時間點不會配置任何物理空間,稍後應用程序使用到新分配的地址時,由於我們還沒建立映射,這時將觸發 page fault,我們透過 handler 為該地址要求一塊物理空間並更新 page table,再重新執行原來的指令。kerenl 只在應用程序實際使用時,才分配物理空間,並透過 page fault 的機制,讓整個過程對 user 進程是透明的

## Eliminate allocation from sbrk() (easy)
第一題非常簡單,單純修改 `sys_sbrk`,讓它單純變更 `p->sz` 的值,但不實際配置物理空間
```cpp
kernel/sysproc.c
uint64
sys_sbrk(void)
{
int addr;
int n;
if(argint(0, &n) < 0)
return -1;
addr = myproc()->sz;
myproc()->sz += n;
return addr;
}
```
## Lazy allocation (moderate) & Lazytests and Usertests (moderate)
第二題跟第三題要求修改 `kernel/trap.c`,當 page fault 發生時,為發生錯誤的虛擬地址配置新的物理頁
page fault 發生後陷入 kernel 的過程與其他 trap 無異,回想前一個實驗 traps,xv6 在 **usertrap** 判斷中斷類型並做相應的處理,我們一樣在 usertrap 增加 page fault 對應的處理機制
首先透過 **r_scause** 函式取得 RISC-V 的 **scause** 暫存器,若其中的數值為 13 或 15 代表為 page fault。然後藉由 **r_stval** 取得發生錯誤的虛擬地址,這個地址不應該超過目前進程的記憶體大小,也不該與 stack 重疊 (可透過 trapframe->sp 取得進程的 stack pointer),若超過的話依照實驗要求殺掉進程。最後呼叫 **kalloc** 及 **mappages** 分配及映射物理地址,依實驗要求無論 kalloc 或 mappages 失敗我們都直接殺掉進程
```cpp
kernel/trap.c
void
usertrap(void)
{
...
if(r_scause() == 8){
...
} else if ((r_scause() == 13) || (r_scause() == 15)) {
uint64 va = r_stval();
// printf("page fault %p\n", va);
if (va >= p->sz || va < p->trapframe->sp) {
p->killed = 1;
} else {
char *mem = kalloc();
if (mem == 0) {
p->killed = 1;
} else {
memset(mem, 0, PGSIZE);
if(mappages(p->pagetable, PGROUNDDOWN(va), PGSIZE, (uint64)mem, PTE_W|PTE_R|PTE_U) != 0) {
kfree(mem);
p->killed = 1;
}
}
}
} else {
...
```
hints 提到 **uvmunmap** 會發生錯誤,這是因為修改前的 xv6 在呼叫 **sbrk** 時一定會配置物理空間,所以釋放物理頁時 pte 一定存在,而且 PTE_V bit 應該要被設置,但採用 lazy allocation 後,heap 有機會存在沒有實際分配物理頁的虛擬地址,可能是 page table 已經建立了但 pte 尚未使用過 (PTE_V = 0),或著 page table 根本就還沒有建立 (pte = 0),無論是哪種情況現在都是合理的,不該再引發 panic,所以直接忽略即可。另外,**fork** 呼叫 **uvmcopy** 也有相同的狀況,程式碼一併進行修改
```cpp
kernel/vm.c
void
uvmunmap(pagetable_t pagetable, uint64 va, uint64 npages, int do_free)
{
uint64 a;
pte_t *pte;
if((va % PGSIZE) != 0)
panic("uvmunmap: not aligned");
for(a = va; a < va + npages*PGSIZE; a += PGSIZE){
if((pte = walk(pagetable, a, 0)) == 0)
// panic("uvmunmap: walk");
continue;
if((*pte & PTE_V) == 0)
// panic("uvmunmap: not mapped")
continue;
if(PTE_FLAGS(*pte) == PTE_V)
panic("uvmunmap: not a leaf");
if(do_free){
uint64 pa = PTE2PA(*pte);
kfree((void*)pa);
}
*pte = 0;
}
}
int
uvmcopy(pagetable_t old, pagetable_t new, uint64 sz)
{
pte_t *pte;
uint64 pa, i;
uint flags;
char *mem;
for(i = 0; i < sz; i += PGSIZE){
if((pte = walk(old, i, 0)) == 0)
// panic("uvmcopy: pte should exist");
continue;
if((*pte & PTE_V) == 0)
// panic("uvmcopy: page not present")
continue;
pa = PTE2PA(*pte);
flags = PTE_FLAGS(*pte);
if((mem = kalloc()) == 0)
goto err;
memmove(mem, (char*)pa, PGSIZE);
if(mappages(new, i, PGSIZE, (uint64)mem, flags) != 0){
kfree(mem);
goto err;
}
}
return 0;
err:
uvmunmap(new, 0, i / PGSIZE, 1);
return -1;
}
```
應用程式有可能縮小 heap,xv6 要能妥善處理這種情況,理論上直接呼叫 **uvmdealloc** 取消物理頁面的映射即可,但 user 有可能惡意傳進一個負值非常大的參數,我們必需先判斷縮小後的進程空間,是否高於 stack pointer,若低於 stack pointer 代表這是一個非法的參數
```cpp
kernel/sysproc.c
uint64
sys_sbrk(void)
{
int addr;
int n;
if(argint(0, &n) < 0)
return -1;
struct proc *p = myproc();
addr = p->sz;
// heap should not overlap stack
if (addr + n < p->trapframe->sp)
return -1;
// if grow negative, dealloc memory map
if (n < 0)
uvmdealloc(p->pagetable, p->sz, p->sz + n);
p->sz += n;
return addr;
}
```
最後 hints 提醒我們一種狀況,我們以 **sbrk** 跟 kernel 要了一塊空間,但因為 lazy allocation 的特性,這塊空間尚未實際分配物理頁,此時我們執行**系統呼叫**,希望對這個空間做一些記憶體的操作,如 read、write 等,但在 kernel space 觸發 page fault 時不會進入 usertrap,無法透過 trap 機制動態配置記憶體
我們以 **write** 為例,呼叫的過程為 write (syscall) --> filewrite (kernel/file.c) --> writei (kernel/fs.c) --> either_copyin (kernel/proc.c) --> copyin (kernel/vm.c) --> walkaddr (kernel/vm.c),xv6 在 kernel space 下,user 虛擬地址的映射都必須透過 **walkaddr** 進行查找
以下為 **copyin** 及 **walkaddr** 的原始實作,walkaddr 在 pte = 0 以及 PTE_V = 0 的情況都直接返回 0,copyin 因此認定為異常而結束
```cpp
kernel/vm.c
int
copyin(pagetable_t pagetable, char *dst, uint64 srcva, uint64 len)
{
uint64 n, va0, pa0;
while(len > 0){
va0 = PGROUNDDOWN(srcva);
pa0 = walkaddr(pagetable, va0);
if(pa0 == 0)
return -1;
n = PGSIZE - (srcva - va0);
if(n > len)
n = len;
memmove(dst, (void *)(pa0 + (srcva - va0)), n);
len -= n;
dst += n;
srcva = va0 + PGSIZE;
}
return 0;
}
uint64
walkaddr(pagetable_t pagetable, uint64 va)
{
pte_t *pte;
uint64 pa;
if(va >= MAXVA)
return 0;
pte = walk(pagetable, va, 0);
if(pte == 0)
return 0;
if((*pte & PTE_V) == 0)
return 0;
if((*pte & PTE_U) == 0)
return 0;
pa = PTE2PA(*pte);
return pa;
}
```
但現在這兩個狀況是由於 lazy allocation 所導致,我們不該直接返回 0,而是替虛擬地址分配物理頁,參考 **usertrap** 修改如下
```cpp
kernel/vm.c
uint64
walkaddr(pagetable_t pagetable, uint64 va)
{
pte_t *pte;
uint64 pa;
if(va >= MAXVA)
return 0;
pte = walk(pagetable, va, 0);
struct proc *p = myproc();
// va might have not been allocated from system call
if(pte == 0 || (*pte & PTE_V) == 0) {
if (va >= p->sz || va < p->trapframe->sp)
return 0;
char *mem = kalloc();
if (mem == 0)
return 0;
memset(mem, 0, PGSIZE);
if(mappages(pagetable, PGROUNDDOWN(va), PGSIZE, (uint64)mem, PTE_W|PTE_R|PTE_U) != 0) {
kfree(mem);
return 0;
}
}
if((*pte & PTE_U) == 0)
return 0;
pa = PTE2PA(*pte);
return pa;
}
```
:::success
== Test running lazytests ==
$ make qemu-gdb
(19.6s)
== Test lazy: map ==
lazy: map: OK
== Test lazy: unmap ==
lazy: unmap: OK
== Test usertests ==
$ make qemu-gdb
Timeout! (300.3s)
== Test usertests: pgbug ==
usertests: pgbug: OK
== Test usertests: sbrkbugs ==
usertests: sbrkbugs: OK
== Test usertests: argptest ==
usertests: argptest: OK
== Test usertests: sbrkmuch ==
usertests: sbrkmuch: OK
== Test usertests: sbrkfail ==
usertests: sbrkfail: OK
== Test usertests: sbrkarg ==
usertests: sbrkarg: OK
== Test usertests: stacktest ==
usertests: stacktest: OK
== Test usertests: execout ==
usertests: execout: OK
== Test usertests: copyin ==
usertests: copyin: OK
== Test usertests: copyout ==
usertests: copyout: OK
== Test usertests: copyinstr1 ==
usertests: copyinstr1: OK
== Test usertests: copyinstr2 ==
usertests: copyinstr2: OK
== Test usertests: copyinstr3 ==
usertests: copyinstr3: OK
== Test usertests: rwsbrk ==
usertests: rwsbrk: OK
== Test usertests: truncate1 ==
usertests: truncate1: OK
== Test usertests: truncate2 ==
usertests: truncate2: OK
== Test usertests: truncate3 ==
usertests: truncate3: OK
== Test usertests: reparent2 ==
usertests: reparent2: OK
== Test usertests: badarg ==
usertests: badarg: OK
== Test usertests: reparent ==
usertests: reparent: OK
== Test usertests: twochildren ==
usertests: twochildren: OK
== Test usertests: forkfork ==
usertests: forkfork: OK
== Test usertests: forkforkfork ==
usertests: forkforkfork: OK
== Test usertests: createdelete ==
usertests: createdelete: OK
== Test usertests: linkunlink ==
usertests: linkunlink: OK
== Test usertests: linktest ==
usertests: linktest: OK
== Test usertests: unlinkread ==
usertests: unlinkread: OK
== Test usertests: concreate ==
usertests: concreate: OK
== Test usertests: subdir ==
usertests: subdir: OK
== Test usertests: fourfiles ==
usertests: fourfiles: OK
== Test usertests: sharedfd ==
usertests: sharedfd: OK
== Test usertests: exectest ==
usertests: exectest: OK
== Test usertests: bigargtest ==
usertests: bigargtest: OK
== Test usertests: bigwrite ==
usertests: bigwrite: OK
== Test usertests: bsstest ==
usertests: bsstest: OK
== Test usertests: sbrkbasic ==
usertests: sbrkbasic: OK
== Test usertests: kernmem ==
usertests: kernmem: OK
== Test usertests: validatetest ==
usertests: validatetest: OK
== Test usertests: opentest ==
usertests: opentest: OK
== Test usertests: writetest ==
usertests: writetest: OK
== Test usertests: writebig ==
usertests: writebig: OK
== Test usertests: createtest ==
usertests: createtest: OK
== Test usertests: openiput ==
usertests: openiput: OK
== Test usertests: exitiput ==
usertests: exitiput: OK
== Test usertests: iput ==
usertests: iput: OK
== Test usertests: mem ==
usertests: mem: OK
== Test usertests: pipe1 ==
usertests: pipe1: OK
== Test usertests: preempt ==
usertests: preempt: OK
== Test usertests: exitwait ==
usertests: exitwait: OK
== Test usertests: rmdot ==
usertests: rmdot: OK
== Test usertests: fourteen ==
usertests: fourteen: OK
== Test usertests: bigfile ==
usertests: bigfile: OK
== Test usertests: dirfile ==
usertests: dirfile: OK
== Test usertests: iref ==
usertests: iref: OK
== Test usertests: forktest ==
usertests: forktest: OK
== Test time ==
time: OK
Score: 119/119
:::
## Reference
1. [xv6: a simple, Unix-like teaching operating system](https://pdos.csail.mit.edu/6.828/2020/xv6/book-riscv-rev1.pdf)
2. [Lab: xv6 lazy page allocation](https://pdos.csail.mit.edu/6.S081/2020/labs/traps.html)
3. [Chapter 4: Traps and System Calls](https://zhuanlan.zhihu.com/p/351939252)
4. [MIT6.S081 課程翻譯](https://mit-public-courses-cn-translatio.gitbook.io/mit6-s081/)