RV32 port for MIT xv6 operating system (and contribute!)

# RV32 port for MIT xv6 operating system (and contribute!) contributed by < [terry23304](https://github.com/terry23304?tab=repositories) and [paulpeng](https://github.com/paulpeng-popo) > # Preparation ## [riscv-gnu-toolchain](https://github.com/riscv-collab/riscv-gnu-toolchain) For 64-bit toolchain ```shell $ git clone https://github.com/riscv/riscv-gnu-toolchain $ sudo apt-get install autoconf automake autotools-dev curl python3 python3-pip libmpc-dev libmpfr-dev libgmp-dev gawk build-essential bison flex texinfo gperf libtool patchutils bc zlib1g-dev libexpat-dev ninja-build git cmake libglib2.0-dev $ cd riscv-gnu-toolchain $ mkdir build && cd build $ ../configure --prefix=/opt/riscv $ sudo make -j$(nproc) ``` For 32-bit toolchain, add extra flags for configuration ```shell $ ../configure --prefix=/opt/riscv \ --with-arch=rv32i \ --with-isa-spec=20191213 \ --with-multilib-generator="\ rv32im_zicsr-ilp32--;\ rv32imac_zicsr-ilp32--;\ rv32im_zicsr_zba_zbb_zbc_zbs-ilp32--;\ rv32imac_zicsr_zba_zbb_zbc_zbs-ilp32--;\ rv32em_zicsr-ilp32e--;\ rv32emac_zicsr-ilp32e--" ``` ## [qemu](https://github.com/qemu/qemu) ```shell $ sudo apt install autoconf automake autotools-dev curl libmpc-dev libmpfr-dev libgmp-dev gawk build-essential bison flex texinfo gperf libtool patchutils bc zlib1g-dev libexpat-dev git $ sudo apt-get install libpixman-1-dev $ git clone https://github.com/qemu/qemu $ cd qemu $ ./configure --target-list=riscv64-softmmu $ make -j $(nproc) $ sudo make install ``` ## [xv6-riscv](https://github.com/mit-pdos/xv6-riscv) ```shell $ git clone https://github.com/mit-pdos/xv6-riscv.git $ make qemu ``` OUTPUT: ``` qemu-system-riscv64 -machine virt -bios none -kernel kernel/kernel -m 128M -smp 3 -nographic -global virtio-mmio.force-legacy=false -drive file=fs.img,if=none,format=raw,id=x0 -device virtio-blk-device,drive=x0,bus=virtio-mmio-bus.0 xv6 kernel is booting hart 2 starting hart 1 starting init: starting sh $ ``` ## [ladybird](https://github.com/harihitode/ladybird) Clone repository ```shell $ git clone https://github.com/harihitode/ladybird.git ``` For `gdbstub` submodule, we need to change `.gitmodules` as follow and run submodule update ```.gitmodules [submodule "sim/gdbstub"] path = sim/gdbstub url = https://github.com/harihitode/gdbstub.git ``` ```shell $ git submodule update --init --recursive ``` With files `kernel` and `fs.img`, then we can build it and run the xv6 ``` $ make $ ./launch_sim ./kernel --ebreak --disk ./fs.img --uart-in ``` OUTPUT: ``` xv6 kernel is booting init: starting sh $ ls ls . 1 1 1024 .. 1 1 1024 README 2 2 2226 cat 2 3 30988 echo 2 4 29504 forktest 2 5 17160 grep 2 6 36220 init 2 7 30572 kill 2 8 29344 ln 2 9 29408 ls 2 10 34364 mkdir 2 11 29508 rm 2 12 29496 sh 2 13 67252 stressfs 2 14 30800 usertests 2 15 195824 grind 2 16 52520 wc 2 17 32016 zombie 2 18 28784 exit 2 19 28728 console 3 20 0 tmp.txt 2 21 7 ``` # page table RISC-V have three types of page table: - sv32: two level (rv32) - sv39: three level (rv64, xv6 using) - sv48: four level (rv64) ## sv32 page table entry ![image](https://hackmd.io/_uploads/Hyjv8MuwT.png) ```c #define PTE_V (1L << 0) // valid #define PTE_R (1L << 1) #define PTE_W (1L << 2) #define PTE_X (1L << 3) #define PTE_U (1L << 4) // 1 -> user can access ``` ## Translation Lookaside Buffer (TLB) The page table of sv32 has two levels. If each translation requires two lookups, the cost can be high. Therefore, most processors cache the recently used virtual address translation results in a Translation Lookaside Buffer. This buffer stores the mappings between virtual addresses and their corresponding Page Table Entries. When there is a context switch between processes, the page table is also switched (change **stap** register), leading to the invalidation of TLB contents. Otherwise, address translation may result in errors. In RISC-V, the `sfence_vma` instruction is used to flush the TLB. ## page table lookups In xv6, there is a `walk()` function implemented to perform the two-level page table lookups. This function will return the page table entry (PTE) corresponding to the virtual address (va) at the lowest level of the page table. If the entry has not been allocated, it allocates a new page table and initializes it to zero. Therefore, xv6 use this function to initialize the page table. ```c pte_t * walk(pagetable_t pagetable, uint64 va, int alloc) { for(int level = 1; level > 0; level--) { pte_t *pte = &pagetable[PX(level, va)]; if(*pte & PTE_V) { pagetable = (pagetable_t)PTE2PA(*pte); } else { if(!alloc || (pagetable = (pde_t*)kalloc()) == 0) return 0; memset(pagetable, 0, PGSIZE); *pte = PA2PTE(pagetable) | PTE_V; } } return &pagetable[PX(0, va)]; } ``` # RISC-V trap machinery When trap occurs, the kernel writes to the CPU's control registers to instruct how the CPU should handle traps. Here are the key control registers and their functionalities: - **stvec**: Stores the address of the trap handler. When a trap occurs, jumps to the address stored in stvec. - **sepc**: When a trap occurs, the current program counter value is saved in sepc. This is because RISC-V will use the value in stvec to override pc. Before returning to user space, the userret function uses the sret instruction to write back the value from sepc to pc. - **scause**: Contains a number representing the reason for the trap. - **sscratch**: Before returning from kernel space to user space, the kernel saves the address of the trapframe page in this register. Additionally, sscratch is used for value exchange with general-purpose registers through the csrrw instruction, facilitating the saving and restoring of registers by uservec. - **sstatus**: The SIE (Supervisor mode Interrupt Enable) bit controls whether device interrupts are allowed. If the kernel clears SIE bit, RISC-V defers device interrupts until SIE is set again. The SPP bit indicates whether the trap comes from user or supervisor mode and controls the mode after sret returns. Control registers can only be read and written in supervisor mode. Machine mode also has identical registers. When a trap occurs, RISC-V hardware processes it as follows: 1. If it's a device interrupt, and SIE bit is cleared, skip the subsequent steps. 2. Clear SIE bit, pause device interrupts. 3. Copy pc to sepc register. 4. Store the current mode (user/supervisor) in the SPP bit of sstatus register. 5. Set scause register. 6. Set to supervisor mode. 7. Copy the value of stvec to pc. 8. Begin execution at the new pc. The CPU does not switch page tables or kernel stacks; it retains the pc register and leaves other responsibilities to the kernel, giving the kernel the flexibility to choose its actions. ## Trap from user space In user space, trap ocurr when executing ecall instruction, encountering illegal operations, or facing device interrupts. Due to the reason that RISC-V hardware will not switch page tables when a trap occurs, the user page table must include mappings for **uservec** to execute properly. In user space, **stvec** should store a pointer pointing to **uservec**. For xv6's trap handling, a switch to the kernel page table is necessary, and the kernel page table must have mappings for the handler pointed to by stvec. In xv6, this issue is addressed using a trampoline, which is a mapping located at the highest addresses in both user and kernel virtual spaces. ### switch to kernel mode ```c // in start.c void start(){ // ... // delegate all interrupts and exceptions to supervisor mode. w_medeleg(0xffff); w_mideleg(0xffff); w_sie(r_sie() | SIE_SEIE | SIE_STIE | SIE_SSIE); // ... } ``` - medeleg: Machine exception delegation register. - mideleg: Machine interrupt delegation register. > By default, all traps at any privilege level are handled in machine modee, though a machine-mode handler can redirect traps back to the appropriate level with the MRET instruction To increase performance, implementations can provide individual read/write bits within **medeleg** and **mideleg** to indicate that certain exceptions and interrupts should be processed directly by a lower privilege level. Therefore, in the begining of `usertrap()`, we can write to `w_stvec` without switching mode to supervisor mode. ### uservec After calling ecall, the mode switches to supervisor mode, and the pc register is set to the location of the trampoline (stored in the stvec register). It then jumps to the beginning of the trampoline page, which is **uservec**. **uservec** has two tasks: - Save the contents of the registers. - Swap the page table. In RISC-V's supervisor mode, memory space must be accessed through the page table. At the beginning of **uservec**, **satp** still holds the user page table. Additionally, at this point, we don't know the address of the kernel page table, so we need to establish a mapping in the user page table. Each process has a trapframe page, to store register values. The virtual address of the trapframe is always `0x3ffffffe000`, allowing us to save register values without switching page tables. The `kernel/proc.h` file defines the structure of the trapframe. At the beginning of uservec uses csrrw (atomic swap) to exchange the contents of **a0** and **sscratch** registers. After the swap, a0 contains the starting address of the trapframe, and sscratch holds the first argument of the system call. Then, using the relative position of a0, the contents of the registers are stored in the trapframe. ```c .globl uservec uservec: # trap.c sets stvec to point here, so # traps from user space start here, # in supervisor mode, but with a # user page table. # sscratch points to where the process's p->trapframe is # mapped into user space, at TRAPFRAME. # swap a0 and sscratch # so that a0 is TRAPFRAME csrrw a0, sscratch, a0 # save the user registers in TRAPFRAME sw ra, 20(a0) sw sp, 24(a0) sw gp, 28(a0) ... sw t6, 140(a0) # save the user a0 in p->trapframe->a0 csrr t0, sscratch sw t0, 56(a0) ``` After saving the registers, the preparation work for entering the kernel begins by extracting information from the trapframe. This includes reading the kernel stack, CPU ID, kernel page table, and the address of `usertrap()`. At this point, the **trapframe->epc** (user program counter) has not been saved yet. Finally, the satp and TLB are refreshed (using sfence.vma zero, zero instruction), and jump to `usertrap()`. It's important to note that, since the kernel page table does not have a mapping for the trapframe, the value in a0 is invalid at this stage. Additionally, the reason the kernel can still operate normally after switching to the kernel page table is that, at this point, the execution is still within the `trampoline.S` code, and `trampoline.S` maintains the same mapping relationship for both user and kernel page tables. ```c # restore kernel stack pointer from p->trapframe->kernel_sp lw sp, 4(a0) # make tp hold the current hartid, from p->trapframe->kernel_hartid lw tp, 16(a0) # load the address of usertrap(), p->trapframe->kernel_trap lw t0, 8(a0) # restore kernel page table from p->trapframe->kernel_satp lw t1, 0(a0) csrw satp, t1 sfence.vma zero, zero # a0 is no longer valid, since the kernel page # table does not specially map p->tf. # jump to usertrap(), which does not return jr t0 ``` ### usertrap First, modify the stvec register to point to the kernel's trap handling code, kernelvec. Then store the process's pc into trapframe. The reason is that, context switch might occur. Another process might execute a system call that modifies the content of the **sepc** register. Therefore, we require a process-specific physical space to prevent **sepc** from being overwritten by other processes. > Any compiled language (such as C) cannot be used to modify user registers since the program needs these registers. Thus, this task can only be accomplished through assembly code. On the other hand, modifying control registers does not have such restrictions. Therefore, saving them in trampoline.S or usertrap is both acceptable. ```c void usertrap(void) { int which_dev = 0; if((r_sstatus() & SSTATUS_SPP) != 0) panic("usertrap: not from user mode"); // send interrupts and exceptions to kerneltrap(), // since we're now in the kernel. w_stvec((uint64)kernelvec); struct proc *p = myproc(); // save user program counter. p->trapframe->epc = r_sepc(); ``` Next, by reading scause register, we determine the reason for the triggered trap. A value of 8 saved in scause indicates a system call. If it is a device interrupt, we call devintr; for other exceptions, terminate the process directly. In RISC-V, the ecall instruction saved in the sepc register points to the ecall instruction itself. When we want to return to user space and execute the next instruction, the value of trapframe->epc needs to be incremented by 4. Before transitioning from syscall to usertrapret, the contents of user registers are properly stored, and there are no other relevant register settings during this process. Therefore, a device interrupt will not impact program execution. As a result, we enable device interrupts to allow the kernel to respond more quickly to other interrupt signals. Top Half: - Save registers. - Determine the cause of the trap. - Set the pc. ```c if(r_scause() == 8){ // system call if(p->killed) exit(-1); // sepc points to the ecall instruction, // but we want to return to the next instruction. p->trapframe->epc += 4; // an interrupt will change sstatus &c registers, // so don't enable until done with those registers. intr_on(); syscall(); } else if((which_dev = devintr()) != 0){ // ok } else { printf("usertrap(): unexpected scause %p pid=%d\n", r_scause(), p->pid); printf(" sepc=%p stval=%p\n", r_sepc(), r_stval()); p->killed = 1; } ``` After handling the trap, if the process has been terminated, it exits directly. In the case of a timer interrupt, the yield function is called to relinquish the CPU resources to other processes. For other situations, the usertrapret function is invoked. ```c if(p->killed) exit(-1); // give up the CPU if this is a timer interrupt. if(which_dev == 2) yield(); usertrapret(); ``` The syscall records the return value of the system call in `p->trapframe->a0`. . Before jumping back to user space, this value is loaded into the a0 register, allowing the user program to smoothly obtain the return value from a0 as if nothing had happened. ### usertrapret The usertrapret function begins by disabling interrupts. The reason for this is that we are about to update the **stvec** register. Next, relevant kernel information is stored back in the trapframe, including the kernel page table, kernel stack, the function pointer for usertrap, and the CPU id. To return to user mode, the SPP bit in sstatus is cleared. Additionally, since we want interrupts to be enabled after returning to user mode, the SPIE bit is set to 1. The sret instruction sets the pc to the value in the sepc register. Therefore, we set the value of sepc to trapframe->epc, which is the address of the next instruction after the ecall. Subsequently, based on the user page table address, we generate the corresponding satp value. The actual page table switch will occur later in userret. It's important to note that because the trampoline is the only space where user and kernel have the same mapping, page table switching can only happen in trampoline.S. Finally, a function pointer fn is created, pointing to userret in trampoline.S. The virtual address of the trapframe (TRAPFRAME) and the user page table satp are passed as parameters (a0 and a1). ```c void usertrapret(void) { struct proc *p = myproc(); // we're about to switch the destination of traps from // kerneltrap() to usertrap(), so turn off interrupts until // we're back in user space, where usertrap() is correct. intr_off(); // send syscalls, interrupts, and exceptions to trampoline.S w_stvec(TRAMPOLINE + (uservec - trampoline)); // set up trapframe values that uservec will need when // the process next re-enters the kernel. p->trapframe->kernel_satp = r_satp(); // kernel page table p->trapframe->kernel_sp = p->kstack + PGSIZE; // process's kernel stack p->trapframe->kernel_trap = (uint64)usertrap; p->trapframe->kernel_hartid = r_tp(); // hartid for cpuid() // set up the registers that trampoline.S's sret will use // to get to user space. // set S Previous Privilege mode to User. unsigned long x = r_sstatus(); x &= ~SSTATUS_SPP; // clear SPP to 0 for user mode x |= SSTATUS_SPIE; // enable interrupts in user mode w_sstatus(x); // set S Exception Program Counter to the saved user pc. w_sepc(p->trapframe->epc); // tell trampoline.S the user page table to switch to. uint64 satp = MAKE_SATP(p->pagetable); // jump to trampoline.S at the top of memory, which // switches to the user page table, restores user registers, // and switches to user mode with sret. uint64 fn = TRAMPOLINE + (userret - trampoline); ((void (*)(uint64,uint64))fn)(TRAPFRAME, satp); } ``` Finally, the userret function is executed. The first stap involves switching the page table. Subsequently, the value stored in trapframe's a0 is loaded into the sscratch register. If it is a system call, at this point, sscratch holds the return value of the system call. > usertrapret passes TRAPFRAME as a parameter, so a0 now holds the address of the trapframe. Before returning to user space, the values of a0 and sscratch registers are swapped. If it is a system call, after the swap, a0 holds the return value of the system call, and sscratch is the address of the trapframe. Finally, the sret instruction is called to switch back to user mode. It sets the pc to the value stored in sepc (ecall + 4) and enables interrupts. ```c .globl userret userret: # userret(TRAPFRAME, pagetable) # switch from kernel to user. # usertrapret() calls here. # a0: TRAPFRAME, in user page table. # a1: user page table, for satp. # switch to the user page table. csrw satp, a1 sfence.vma zero, zero # put the saved user a0 in sscratch, so we # can swap it with our a0 (TRAPFRAME) in the last step. lw t0, 56(a0) csrw sscratch, t0 # restore all but a0 from TRAPFRAME lw ra, 20(a0) lw sp, 24(a0) ... lw t6, 140(a0) # restore user a0, and save TRAPFRAME in sscratch csrrw a0, sscratch, a0 # return to user mode and user pc. # usertrapret() set up sstatus and sepc. sret ``` ## Traps From Kernel Space In the kernel space, two types of traps can occur: device interrupts and exceptions. First, it is essential to save critical control registers to prevent other processes from modifying these values after a context switch. The kerneltrap function calls devintr to identify the type of device interrupt. If it is not a device interrupt, it must be an exception, which will result in kernel panic. ```c if((sstatus & SSTATUS_SPP) == 0) panic("kerneltrap: not from supervisor mode"); if(intr_get() != 0) panic("kerneltrap: interrupts enabled"); if((which_dev = devintr()) == 0){ printf("scause %p\n", scause); printf("sepc=%p stval=%p\n", r_sepc(), r_stval()); panic("kerneltrap"); } ``` If it's a timer interrupt, the kernel calls `yield()` to relinquish the CPU resources. Since we have saved the registers that other processes/threads might modify, upon switching back, the execution can resume from the next instruction after `yield()`. During the time the process is suspended, other execution units might encounter traps, potentially overwriting sepc and sstatus. Regardless, we reload these registers. Finally, it jumps back to kernelvec, restores all the saved registers from the kernel stack, reverts the kernel stack pointer, and calls sret to return to the interrupted position in the kernel and resume execution. ```c kernelvec: // make room to save registers. addi sp, sp, -128 // save the registers. sw sp, 4(sp) sw gp, 8(sp) sw tp, 12(sp) ... sw t6, 120(sp) // call the C trap handler in trap.c call kerneltrap ``` ### kerneltrap In the kernel space, there are two types of traps: - device interrupts - exceptions The kerneltrap function calls devintr to identify the type of device interrupt. If it is not a device interrupt, it must be an exception, which would result in the kernel ceasing operations. ```c if((sstatus & SSTATUS_SPP) == 0) panic("kerneltrap: not from supervisor mode"); if(intr_get() != 0) panic("kerneltrap: interrupts enabled"); if((which_dev = devintr()) == 0){ printf("scause %p\n", scause); printf("sepc=%p stval=%p\n", r_sepc(), r_stval()); panic("kerneltrap"); } } ``` If it's a timer interrupt, the kernel calls yield to relinquish CPU resources. Since we have already saved registers that other processes/threads might modify, upon switching back, execution can resume from the next instruction after yield. During the period when the process is suspended, other execution units might encounter traps, potentially overwriting sepc and sstatus. Regardless, we reload these registers. Finally, it jumps back to kernelvec, restores all the saved registers from the kernel stack, reverts the kernel stack pointer, and calls sret to return to the interrupted position in the kernel and resume execution. ```c // restore registers. lw ra, 0(sp) lw sp, 4(sp) lw gp, 8(sp) // not this, in case we moved CPUs: ld tp, 24(sp) lw t0, 16(sp) ... lw t6, 120(sp) addi sp, sp, 128 // return to whatever we were doing in the kernel. sret ``` # Interrupt An interrupt is when hardware seeks attention from the operating system. For example, when a network card receives a packet, the network card generates an interrupt; when a user presses a key on the keyboard, the keyboard generates an interrupt. What the operating system needs to do is save the current work, handle the interrupt, and then resume the previous work. The process of saving and restoring work is very similar to the system call process. Therefore, system calls, page faults, and interrupts all use the same mechanism. There are three main differences between interrupts and system calls: - Asynchronous: When hardware generates an interrupt, the interrupt handler is not associated with the currently executing process on the CPU. In contrast, for a system call, the system call occurs in the context of the executing process. - Concurrency: For interrupts, the CPU and the device generating the interrupt execute concurrently. The network card independently processes incoming network packets and generates an interrupt at some point in time. However, simultaneously, the CPU is also executing. Thus, there is concurrency between the CPU and devices, and this concurrency needs to be managed. - Program Device: This aspect focuses on external devices, such as network cards and UARTs. These devices need to be programmed. Each device has a programming manual, similar to how RISC-V has a manual containing instructions and registers. The programming manual for a device outlines what registers it has, what operations it can perform, and how the device responds when reading or writing control registers. # Reference https://pdos.csail.mit.edu/6.S081/2020/xv6/book-riscv-rev1.pdf https://github.com/harihitode/ladybird-xv6