Implement MMU (Sv32) for MyCPU Report

# Implement MMU (Sv32) for MyCPU Report >This is for report full develop log here: https://hackmd.io/@sysprog/SyCApAcWWg ## Goals * The provided pipelined RISC-V design (`4-soc`) is currently a physical-address-only core. Extending it to support Sv32 (32-bit page-based virtual memory) requires substantial hardware changes. * CSR implementation (Privileged Specification): * Implementation of `satp` (Supervisor Address Translation and Protection), which controls paging mode, ASID, and the physical page number of the root page table. * Correct handling of `mstatus` and `sstatus`, including privilege mode tracking (MPP/SPP) and permission-related bits such as SUM and MXR. * Support for the `sfence.vma` instruction, which flushes TLB entries and requires logic to invalidate selected or all TLB entries. * Address translation logic: * Translation Lookaside Buffers (TLB): * Engineering challenge: Designing separate instruction and data TLBs (I-TLB and D-TLB). These must be set-associative or fully associative to maintain reasonable performance. * Critical-path concern: TLB lookup sits directly in the instruction-fetch and memory stages. Careful Chisel design is required to avoid degrading the maximum clock frequency. * Page Table Walker (PTW): * Requires a hardware finite-state machine that walks the two-level Sv32 page table (Level 1 then Level 0) in physical memory when a TLB miss occurs. * The PTW must arbitrate memory access with the rest of the pipeline, typically stalling the core while it fetches and processes Page Table Entries (PTEs). * Hardware page updates (A/D bits): * The hardware must check and update the Accessed (A) and Dirty (D) bits in PTEs during the walk. When a page is accessed or written, these bits must be set atomically in memory, which increases the complexity of the PTW’s memory operations. * Integrating the MMU into the MyCPU pipeline is one of the most error-prone aspects of the project: * Pipeline stalling: The core must correctly stall the Fetch stage on an I-TLB miss and the Memory stage on a D-TLB miss while the PTW completes translation. * Exception handling: * Precise detection of Instruction Page Faults, Load Page Faults, and Store/AMO Page Faults. * Correctly writing the faulting virtual address into `stval` (or `mtval`) and transferring control to the trap handler (`stvec`), with proper privilege-mode switching. * Verification: Ensure that the [mmu-test suite](https://github.com/sysprog21/rv32emu/tree/master/tests/system/mmu) runs correctly and that all translations, exceptions, and corner cases behave as expected. --- ## What is SV32? ### Sv32 Virtual Memory Overview Sv32 is the 32-bit virtual memory paging scheme defined in the RISC-V privileged specification. It translates a 32-bit virtual address (VA) into a physical address (PA) using a two-level page table structure. Sv32 uses 4 KiB pages, and the virtual address is divided as follows: ``` 31 22 21 12 11 0 +---------------+---------------+---------------+ | VPN[1] | VPN[0] | page offset | +---------------+---------------+---------------+ 10 bits 10 bits 12 bits ``` - VPN[1]: Index into the Level-1 page table - VPN[0]: Index into the Level-0 page table - page offset: Offset within a 4 KiB page After translation, the physical address is formed as: ``` PA = [ PPN (from PTE) | page offset ] ``` ### satp Register and Paging Enable Paging in Sv32 is controlled by the satp CSR (Supervisor Address Translation and Protection). In Sv32 mode, satp contains: • MODE: Enables Sv32 paging • ASID: Address Space Identifier (not implement yet) • PPN: Physical Page Number of the root (Level-1) page table When satp.MODE enables Sv32, all instruction fetches and data accesses use virtual addresses and must go through the MMU for VA→PA translation. ### Page Table Walk (PTW) On a TLB miss, requires a two-step page table walk: 1. Level-1 Lookup ``` PTE1_addr = satp.PPN * 4096 + VPN[1] * 4 ``` 2. Level-0 Lookup ``` PTE0_addr = PTE1.PPN * 4096 + VPN[0] * 4 ``` If any PTE is invalid or permission checks fail, the MMU must raise: - Instruction Page Fault - Load Page Fault - Store/AMO Page Fault --- ## Supervisor-mode CSR + Trap Infrastructure (Implemented) ### Why implement S-mode first? Before MMU/PTW becomes meaningful, the core must be able to *enter and run Supervisor mode correctly*, because in a typical Sv32 system the operating system runs in S-mode and controls virtual memory through S-mode CSRs (e.g., `satp`, `stvec`, `sstatus`). So I implemented the S-mode CSR set and the trap/return path first, to make sure page table control and page-fault delivery can later work end-to-end in Supervisor context. --- ### 1) What I implemented - Added Supervisor CSRs in the CSR module: - `satp`, `stvec`, `sscratch`, `sepc`, `scause`, `stval`, and the supervisor-visible status view `sstatus`. - Added a privilege mode state (`priv_mode`) in CSR to track the current execution mode (M or S). - Extended CLINT to handle trap entry / return and to commit CSR updates via a direct-write interface. - Implemented exception delegation via `medeleg` (exceptions only; interrupt delegation via `mideleg` is not implemented in this stage). --- ### 2) How it is implemented (hardware behavior) #### 2.1 CSR storage + `sstatus` behavior - Keep one physical `mstatus` register. - Expose `sstatus` as a masked view of `mstatus`: - read: `sstatus = mstatus & SSTATUS_MASK` - Writes to `sstatus` only update the masked bits in `mstatus`, preserving all other fields. - This ensures S-mode status fields (`SPP`, `SIE`, `SPIE`, etc.) are stored physically in `mstatus` while still behaving like `sstatus` architecturally. #### 2.2 Trap commit path (atomic CSR updates) - CSR module provides a “direct write” commit path from CLINT: - `direct_write_enable` (M target) - `direct_write_enable_s` (S target) - When CLINT asserts these enables, CSR updates commit atomically in one place: - M target writes: `mstatus/mepc/mcause/mtval` - S target writes: `sepc/scause/stval` plus `sstatus` effect through the masked write path - `priv_mode` is updated through a dedicated interface (`priv_write_enable/priv_write_data`) to make privilege transitions explicit and easy to debug in waveforms. #### 2.3 Trap entry and return (CLINT) - CLINT determines whether to take a trap (exceptions + optional interrupts). - On trap entry, it records the faulting PC into `*epc`, writes `*cause`, then redirects PC to `*tvec`: - If handled in M-mode: - jump to `mtvec` - write `mepc/mcause/mtval` - set `priv_mode = M` - If handled in S-mode: - jump to `stvec` - write `sepc/scause/stval` - set `priv_mode = S` - Return instructions: - `mret`: PC <- `mepc`, and `priv_mode` is restored from `mstatus.MPP` (supports M→S transition) - `sret`: PC <- `sepc`, return to S (current design has no U-mode yet) #### 2.4 Exception delegation via `medeleg` (exceptions only) - Delegation decision is explicit and per-cause: - `delegatedToS = (cur_priv != M) && trap_is_exception && medeleg[cause]` - If delegated: - use S-mode CSRs (`sepc/scause/stval`) and jump to `stvec` - If not delegated: - default to M-mode CSRs (`mepc/mcause/mtval`) and jump to `mtvec` - Interrupt delegation via `mideleg` is intentionally not implemented at this stage. --- ### 3) How I verified it (waveform checkpoints) :::spoiler testcode here ```asm= .section .text .globl main .option norvc main: # ===== M-mode checkpoint ===== # Write a known value to mscratch to mark execution in M-mode li t0, 0x4D300001 csrw mscratch, t0 # ===== Configure MRET target and set MPP = S ===== # Prepare mstatus so that MRET returns to Supervisor mode csrr t1, mstatus li t0, ~(3 << 11) # Clear MPP[12:11] and t1, t1, t0 li t0, (1 << 11) # Set MPP = 01 (Supervisor mode) or t1, t1, t0 csrw mstatus, t1 # Set return address for MRET la t0, s_main csrw mepc, t0 # Second M-mode checkpoint before MRET li t0, 0x4D300002 csrw mscratch, t0 # Return from M-mode to S-mode mret # ========================= # S-mode trap handler # ========================= .align 4 s_trap: # Trap entry checkpoint li t0, 0x5330EE01 csrw sscratch, t0 # Advance SEPC to skip the faulting ECALL instruction csrr t1, sepc addi t1, t1, 4 csrw sepc, t1 # Trap exit checkpoint before SRET li t0, 0x5330EE02 csrw sscratch, t0 # Return from Supervisor trap sret # --- Insert a large gap to make PC jumps clearly visible in waveforms --- .align 4 .space 1024 # Can be increased (e.g., 4096) for clearer separation # ========================= # S-mode main # ========================= .align 4 s_main: # Write the current PC into sscratch for easy identification in waveforms auipc t2, 0 csrw sscratch, t2 # sscratch = PC of s_main # Configure Supervisor trap vector to point to s_trap # This is intentionally done in S-mode la t0, s_trap csrw stvec, t0 # S-mode execution checkpoint before ECALL li t0, 0x53300002 csrw sscratch, t0 # Trigger Supervisor-mode ECALL ecall after_ecall: # Checkpoint indicating successful return from SRET li t0, 0x53300003 csrw sscratch, t0 done: # Infinite loop to keep execution observable j done ``` ::: #### 3.1 M→S transition via `mret` - Test sets `mstatus.MPP = S` and `mepc = s_main`, then executes `mret`. - Expected waveform: - PC jumps to `s_main` - `priv_mode` switches from M to S ![截圖 2026-01-06 凌晨12.21.32](https://hackmd.io/_uploads/BkYfrvYEbe.png) #### 3.2 S-mode `ecall` trap + `sret` return - In S-mode, set `stvec = s_trap`, execute `ecall`. - Handler increments `sepc` by 4 then executes `sret`. - Expected waveform: - `sepc` captures the faulting PC - `scause = 9` (ECALL from S-mode) - PC jumps to `stvec` - `sret` returns to the instruction after `ecall` ![截圖 2026-01-06 凌晨12.22.46](https://hackmd.io/_uploads/rkHwrPYVZl.png) ![截圖 2026-01-06 凌晨12.25.22](https://hackmd.io/_uploads/SkbW8vt4Ze.png) #### 3.3 `medeleg` behavior (delegated vs non-delegated) - Run the same S-mode `ecall`, but toggle `medeleg[9]`: - `medeleg[9] = 1`: - trap stays in S-mode - PC jumps to `stvec` - writes `sepc/scause` ![截圖 2026-01-09 晚上8.25.21](https://hackmd.io/_uploads/S1AnXd0VWx.png) - `medeleg[9] = 0`: - trap escalates to M-mode (default behavior) - PC jumps to `mtvec` - writes `mepc/mcause` - `priv_mode` transitions S → M ![截圖 2026-01-09 晚上8.31.01](https://hackmd.io/_uploads/SyffSdA4-e.png) --- ## PTW + TLB (Full Version Only) ### 1) What I implemented To support Sv32 VA→PA translation, I implemented inside the MMU: - **Separate ITLB / DTLB** - **8 sets × 2 ways** (16 entries) each - Cache translations for **4KB pages** and **4MB superpages** - Per-set **round-robin replacement** (waveform-friendly and deterministic) - **A shared PTW (Page Table Walker) FSM** - Performs a **full 2-level Sv32 walk**: - L1 PTE fetch → (if non-leaf) L0 PTE fetch - Produces either: - **A leaf translation** → fills ITLB/DTLB - **A fault** → raises I/D fault signals (full trap plumbing is the next step) - **Pipeline stall + bus arbitration** - PTW must fetch PTEs via the **same AXI/bus** that the core MEM stage uses - So I added a PTW memory port and used **mutual exclusion**: - When PTW is active, **stall the whole core** (reuse `mem_stall`) - Block MEM stage from issuing requests during MMU stall - Use a **MUX** controlled by `ptw_active` to route bus request/response to either **PTW** or **normal MEM stage** --- ### 2) How the TLB works (high level) Each (I/D) TLB entry stores: - `valid` - `tag` - `ppn` (PA[31:12]) - `isSuper` (distinguish 4KB vs 4MB superpage entry) Lookup: - **4KB page** lookup uses **VPN0-based set** and a 4KB tag - **4MB superpage** lookup uses **VPN1-based set** and a superpage tag - `isSuper` prevents matching the wrong page size On hit: - output `PA = (PPN << 12) | page_offset` On PTW completion: - fill the selected set/way - update per-set RR victim pointer --- ### 3) How the PTW works (full Sv32 walk) ![image](https://hackmd.io/_uploads/HkTXWnnHbx.png) The PTW FSM contains the following states: - **sIdle** The PTW is inactive and the pipeline is not stalled by the MMU. When Sv32 is enabled and an instruction/data access misses in the corresponding TLB, the MMU latches the faulting virtual address and access type (I-side fetch vs D-side load/store), then starts a page table walk. - **sL1Req** Issues a memory read request for the level-1 PTE (L1 PTE). The requested physical address is computed from `satp.ppn` (root page table base) and `VPN[1]` extracted from the latched VA. - **sL1Wait** Waits for the memory response containing the L1 PTE. Once the response arrives, the PTW checks: 1) validity and illegal encoding (e.g., `V=0` or `R=0 && W=1`), 2) whether the entry is a **leaf** (translation terminates here) or a **pointer** (must continue to level-0), 3) if it is an L1 leaf (superpage/4MB mapping), it also checks alignment constraints (Sv32 requires `PPN0 == 0` for a superpage leaf). If the L1 PTE is a valid leaf, the walk completes and transitions to `sLeaf`. If it is a valid non-leaf pointer, the PTW proceeds to fetch the L0 PTE. - **sL0Req** Issues a memory read request for the level-0 PTE (L0 PTE). The base address is derived from the PPN in the L1 PTE, and the index comes from `VPN[0]` of the latched VA. - **sL0Wait** Waits for the memory response containing the L0 PTE. When the response arrives, the PTW validates the PTE similarly: - invalid or illegal encoding → page fault (`sFault`) - leaf entry → translation complete (`sLeaf`) - non-leaf entry at L0 → page fault (`sFault`) because Sv32 has only two levels - **sLeaf** Finalization state for a successful translation. In this state the PTW: 1) checks access permissions based on the request type: - instruction fetch requires `X` - load requires `R` - store requires `W` 2) constructs the final translated PPN: - for a normal 4KB page, PPN comes directly from the L0 leaf PTE - for a 4MB superpage, the PPN is formed by combining `PPN1` from the L1 leaf PTE with `VPN0` from the VA 3) fills the appropriate TLB (ITLB or DTLB), including replacement selection (2-way set-associative with per-set round-robin victim) After filling the TLB, the PTW returns to `sIdle` so the stalled request can be retried using the translated physical address. - **sFault** Terminal state for translation failure. The MMU reports a fault to either the I-side or D-side depending on which request triggered the walk, then returns to `sIdle`. *(Note: fault signaling to the IF stage and writing the corresponding `scause` are not implemented yet and will be added later.)* > Note: A/D-bit update is **not implemented yet** (I currently pre-set A/D in the test PTEs). --- ### 4) How I verified it Because the full mmu-test requires fault delivery + other missing pieces, I validated PTW/TLB behavior using a **standalone Sv32 assembly test** that: - Builds page tables in memory at fixed, aligned physical addresses - Enables `satp` - Triggers controlled accesses that force: - **L1 leaf (4MB superpage)** translations - **L1 pointer → L0 leaf (4KB)** translations :::spoiler ```asm= .section .text .globl main .equ L1_PT_PA, 0x00005000 # 4KB aligned .equ L0_PT0_PA, 0x00006000 # vpn1=0 -> VA 0x0000_0000 ~ 0x003F_FFFF .equ PTE_PTR, 0x001 # V=1, R=W=X=0 (pointer) .equ PTE_LEAF, 0x0CF # V|R|W|X|A|D main: # 1) mtvec point to trap entry la t0, __trap_entry csrw mtvec, t0 # --------------------------------------------------- # 2) clear L1 + L0 tables (only tables we use) # --------------------------------------------------- li t2, 0 # clear L1 page table @ L1_PT_PA li t0, L1_PT_PA li t1, 1024 1: sw t2, 0(t0) addi t0, t0, 4 addi t1, t1, -1 bnez t1, 1b # clear L0_PT0 @ L0_PT0_PA li t0, L0_PT0_PA li t1, 1024 2: sw t2, 0(t0) addi t0, t0, 4 addi t1, t1, -1 bnez t1, 2b # --------------------------------------------------- # 3) L1[0] -> L0_PT0 (pointer) # L1[1] = superpage leaf (4MB) for VA 0x0040_0000~0x007F_FFFF # --------------------------------------------------- li t0, L1_PT_PA # L1[0] pointer -> L0_PT0 li t1, (L0_PT0_PA >> 12) slli t1, t1, 10 ori t1, t1, PTE_PTR sw t1, 0(t0) # entry vpn1=0 # L1[1] superpage leaf: # PPN1 = 1, and PPN0 must be 0 => just put (1<<20) into PTE[31:20] li t2, (1 << 20) # PPN1 goes to bits [31:20] ori t2, t2, PTE_LEAF sw t2, 4(t0) # entry vpn1=1 (superpage leaf) # --------------------------------------------------- # 4) Fill L0_PT0: identity map 0~4MB (vpn1=0 chunk) # --------------------------------------------------- li t0, L0_PT0_PA li t1, 0 # vpn0 4: slli t2, t1, 12 # va within vpn1=0 region srli t3, t2, 12 # ppn = va>>12 (identity) slli t3, t3, 10 ori t3, t3, PTE_LEAF sw t3, 0(t0) addi t0, t0, 4 addi t1, t1, 1 li t4, 1024 blt t1, t4, 4b # --------------------------------------------------- # 5) Enter S-mode # --------------------------------------------------- csrr t0, mstatus li t1, ~(3 << 11) and t0, t0, t1 li t1, (1 << 11) # MPP=S or t0, t0, t1 csrw mstatus, t0 la t0, s_main csrw mepc, t0 mret # ===================================================== # S-mode main # ===================================================== .align 4 s_main: li t0, (1 << 31) | (L1_PT_PA >> 12) csrw satp, t0 li s0, 0 # error accumulator (0 = pass) # =================================================== # (A) Superpage region (vpn1=1 megapage) # touch a few addresses: base + 0x1000*k + small offset # =================================================== li s1, 0x00400000 # superpage base li s2, 0xA0000000 # seed # A0: [0x00401000] li t1, 0x00401000 li t2, 0xA0000001 sw t2, 0(t1) lw t3, 0(t1) xor t4, t2, t3 or s0, s0, t4 # A1: [0x00402004] li t1, 0x00402004 li t2, 0xA0000002 sw t2, 0(t1) lw t3, 0(t1) xor t4, t2, t3 or s0, s0, t4 # A2: [0x00403008] li t1, 0x00403008 li t2, 0xA0000003 sw t2, 0(t1) lw t3, 0(t1) xor t4, t2, t3 or s0, s0, t4 # A3: [0x0040400C] li t1, 0x0040400C li t2, 0xA0000004 sw t2, 0(t1) lw t3, 0(t1) xor t4, t2, t3 or s0, s0, t4 # =================================================== # (B) 4KB region (vpn1=0 via L0) # Force SAME set conflicts: # base = 0x00001000 keeps VA[14:12]=001 # stride = 0x8000 changes VA[21:15] (tag) but keeps set # Touch 6 distinct pages => must evict in 2-way # =================================================== li s4, 0xCAFE0000 # B0: 0x00001000 li t1, 0x00001000 li t2, 0xCAFE1000 sw t2, 0(t1) lw t3, 0(t1) xor t4, t2, t3 or s0, s0, t4 # B1: 0x00009000 li t1, 0x00009000 li t2, 0xCAFE1001 sw t2, 0(t1) lw t3, 0(t1) xor t4, t2, t3 or s0, s0, t4 # B2: 0x00011000 li t1, 0x00011000 li t2, 0xCAFE1002 sw t2, 0(t1) lw t3, 0(t1) xor t4, t2, t3 or s0, s0, t4 # B3: 0x00019000 li t1, 0x00019000 li t2, 0xCAFE1003 sw t2, 0(t1) lw t3, 0(t1) xor t4, t2, t3 or s0, s0, t4 # B4: 0x00021000 li t1, 0x00021000 li t2, 0xCAFE1004 sw t2, 0(t1) lw t3, 0(t1) xor t4, t2, t3 or s0, s0, t4 # B5: 0x00029000 li t1, 0x00029000 li t2, 0xCAFE1005 sw t2, 0(t1) lw t3, 0(t1) xor t4, t2, t3 or s0, s0, t4 # =================================================== # (C) Re-touch first two addresses again # If replacement is working, at least one should miss/refill # =================================================== # C0: 0x00001000 again li t1, 0x00001000 li t2, 0xDEAD0000 sw t2, 0(t1) lw t3, 0(t1) xor t4, t2, t3 or s0, s0, t4 # C1: 0x00009000 again li t1, 0x00009000 li t2, 0xDEAD0001 sw t2, 0(t1) lw t3, 0(t1) xor t4, t2, t3 or s0, s0, t4 done: j done ``` ::: Waveforms confirm the PTW/TLB loop works end-to-end: - **I-side (Instruction fetch)** - On an iTLB miss, the PTW performs the expected Sv32 walk and then fills the **ITLB**. - 4KB case: `L1 fetch → L0 fetch → leaf → ITLB fill` - (If the VA maps to a superpage) 4MB case: `L1 fetch → leaf → ITLB fill` - After the fill, the same fetch is retried and **hits in ITLB**. ![截圖 2026-01-18 晚上7.04.27](https://hackmd.io/_uploads/H1KHAE9SZe.png) - **D-side (Load/Store)** - I used **two waveforms** to cover both translation patterns: 1) **D-side 1-stage (superpage / 4MB)** - `L1 fetch → leaf → DTLB fill` - The retried load/store then **hits in DTLB**. ![截圖 2026-01-18 晚上11.22.15](https://hackmd.io/_uploads/rkEh5_qBZe.png) 2) **D-side 2-stage (normal 4KB)** - `L1 fetch → L0 fetch → leaf → DTLB fill` - The retried load/store then **hits in DTLB**. ![截圖 2026-01-19 凌晨12.00.01](https://hackmd.io/_uploads/S1aK7t9S-l.png) --- ### 5) Current limitation (next work) I can raise `i_fault/d_fault` from PTW, but **end-to-end page fault handling is not complete yet**, mainly because: - In a pipeline, faults must be **precise** (only the correct-path instruction should trap) - If the pipeline keeps presenting the same faulting request, PTW can re-walk repeatedly unless the core **kills the request + redirects PC** to the handler - The final integration step is: - carry fault info to the right stage, - flush/redirect correctly, - write `stval/scause/sepc` via the trap path. ## Next work / Not finished yet There are still several missing pieces before the Sv32 MMU is “complete” and OS-ready: - **Run the official `mmu-test` suite** - I currently cannot pass the full test suite because the remaining exception/fault plumbing is incomplete (see below). - Target: make the core pass the `rv32emu mmu-test` end-to-end (translation + faults + corner cases). - **Precise page fault handling (end-to-end)** - PTW can raise `i_fault/d_fault`, but I still need to: - generate the correct fault type (`Instruction/Load/Store Page Fault`) and write **`scause`** - write the faulting VA into **`stval`** - ensure faults are **precise** in a pipelined core (only correct-path instructions trap) - correctly **flush/kill** the faulting request and **redirect PC** to the trap handler (`stvec`) to avoid infinite re-walk loops. - **A/D bit update in hardware** - Current tests pre-set `A/D` in PTEs; PTW is read-only. - Target: implement atomic A/D updates (including the extra memory write sequence and corner cases). - **`sfence.vma` support** - TLB flush/invalidate behavior is still pending. - Target: implement `sfence.vma` and verify selective + global invalidation.