# Implement MMU (Sv32) for MyCPU Report
>This is for report full develop log here: https://hackmd.io/@sysprog/SyCApAcWWg
## Goals
* The provided pipelined RISC-V design (`4-soc`) is currently a physical-address-only core. Extending it to support Sv32 (32-bit page-based virtual memory) requires substantial hardware changes.
* CSR implementation (Privileged Specification):
* Implementation of `satp` (Supervisor Address Translation and Protection), which controls paging mode, ASID, and the physical page number of the root page table.
* Correct handling of `mstatus` and `sstatus`, including privilege mode tracking (MPP/SPP) and permission-related bits such as SUM and MXR.
* Support for the `sfence.vma` instruction, which flushes TLB entries and requires logic to invalidate selected or all TLB entries.
* Address translation logic:
* Translation Lookaside Buffers (TLB):
* Engineering challenge: Designing separate instruction and data TLBs (I-TLB and D-TLB). These must be set-associative or fully associative to maintain reasonable performance.
* Critical-path concern: TLB lookup sits directly in the instruction-fetch and memory stages. Careful Chisel design is required to avoid degrading the maximum clock frequency.
* Page Table Walker (PTW):
* Requires a hardware finite-state machine that walks the two-level Sv32 page table (Level 1 then Level 0) in physical memory when a TLB miss occurs.
* The PTW must arbitrate memory access with the rest of the pipeline, typically stalling the core while it fetches and processes Page Table Entries (PTEs).
* Hardware page updates (A/D bits):
* The hardware must check and update the Accessed (A) and Dirty (D) bits in PTEs during the walk. When a page is accessed or written, these bits must be set atomically in memory, which increases the complexity of the PTW’s memory operations.
* Integrating the MMU into the MyCPU pipeline is one of the most error-prone aspects of the project:
* Pipeline stalling: The core must correctly stall the Fetch stage on an I-TLB miss and the Memory stage on a D-TLB miss while the PTW completes translation.
* Exception handling:
* Precise detection of Instruction Page Faults, Load Page Faults, and Store/AMO Page Faults.
* Correctly writing the faulting virtual address into `stval` (or `mtval`) and transferring control to the trap handler (`stvec`), with proper privilege-mode switching.
* Verification: Ensure that the [mmu-test suite](https://github.com/sysprog21/rv32emu/tree/master/tests/system/mmu) runs correctly and that all translations, exceptions, and corner cases behave as expected.
---
## What is SV32?
### Sv32 Virtual Memory Overview
Sv32 is the 32-bit virtual memory paging scheme defined in the RISC-V privileged specification.
It translates a 32-bit virtual address (VA) into a physical address (PA) using a two-level page table structure.
Sv32 uses 4 KiB pages, and the virtual address is divided as follows:
```
31 22 21 12 11 0
+---------------+---------------+---------------+
| VPN[1] | VPN[0] | page offset |
+---------------+---------------+---------------+
10 bits 10 bits 12 bits
```
- VPN[1]: Index into the Level-1 page table
- VPN[0]: Index into the Level-0 page table
- page offset: Offset within a 4 KiB page
After translation, the physical address is formed as:
```
PA = [ PPN (from PTE) | page offset ]
```
### satp Register and Paging Enable
Paging in Sv32 is controlled by the satp CSR (Supervisor Address Translation and Protection).
In Sv32 mode, satp contains:
• MODE: Enables Sv32 paging
• ASID: Address Space Identifier (not implement yet)
• PPN: Physical Page Number of the root (Level-1) page table
When satp.MODE enables Sv32, all instruction fetches and data accesses use virtual addresses and must go through the MMU for VA→PA translation.
### Page Table Walk (PTW)
On a TLB miss, requires a two-step page table walk:
1. Level-1 Lookup
```
PTE1_addr = satp.PPN * 4096 + VPN[1] * 4
```
2. Level-0 Lookup
```
PTE0_addr = PTE1.PPN * 4096 + VPN[0] * 4
```
If any PTE is invalid or permission checks fail, the MMU must raise:
- Instruction Page Fault
- Load Page Fault
- Store/AMO Page Fault
---
## Supervisor-mode CSR + Trap Infrastructure (Implemented)
### Why implement S-mode first?
Before MMU/PTW becomes meaningful, the core must be able to *enter and run Supervisor mode correctly*, because in a typical Sv32 system the operating system runs in S-mode and controls virtual memory through S-mode CSRs (e.g., `satp`, `stvec`, `sstatus`).
So I implemented the S-mode CSR set and the trap/return path first, to make sure page table control and page-fault delivery can later work end-to-end in Supervisor context.
---
### 1) What I implemented
- Added Supervisor CSRs in the CSR module:
- `satp`, `stvec`, `sscratch`, `sepc`, `scause`, `stval`, and the supervisor-visible status view `sstatus`.
- Added a privilege mode state (`priv_mode`) in CSR to track the current execution mode (M or S).
- Extended CLINT to handle trap entry / return and to commit CSR updates via a direct-write interface.
- Implemented exception delegation via `medeleg` (exceptions only; interrupt delegation via `mideleg` is not implemented in this stage).
---
### 2) How it is implemented (hardware behavior)
#### 2.1 CSR storage + `sstatus` behavior
- Keep one physical `mstatus` register.
- Expose `sstatus` as a masked view of `mstatus`:
- read: `sstatus = mstatus & SSTATUS_MASK`
- Writes to `sstatus` only update the masked bits in `mstatus`, preserving all other fields.
- This ensures S-mode status fields (`SPP`, `SIE`, `SPIE`, etc.) are stored physically in `mstatus` while still behaving like `sstatus` architecturally.
#### 2.2 Trap commit path (atomic CSR updates)
- CSR module provides a “direct write” commit path from CLINT:
- `direct_write_enable` (M target)
- `direct_write_enable_s` (S target)
- When CLINT asserts these enables, CSR updates commit atomically in one place:
- M target writes: `mstatus/mepc/mcause/mtval`
- S target writes: `sepc/scause/stval` plus `sstatus` effect through the masked write path
- `priv_mode` is updated through a dedicated interface (`priv_write_enable/priv_write_data`) to make privilege transitions explicit and easy to debug in waveforms.
#### 2.3 Trap entry and return (CLINT)
- CLINT determines whether to take a trap (exceptions + optional interrupts).
- On trap entry, it records the faulting PC into `*epc`, writes `*cause`, then redirects PC to `*tvec`:
- If handled in M-mode:
- jump to `mtvec`
- write `mepc/mcause/mtval`
- set `priv_mode = M`
- If handled in S-mode:
- jump to `stvec`
- write `sepc/scause/stval`
- set `priv_mode = S`
- Return instructions:
- `mret`: PC <- `mepc`, and `priv_mode` is restored from `mstatus.MPP` (supports M→S transition)
- `sret`: PC <- `sepc`, return to S (current design has no U-mode yet)
#### 2.4 Exception delegation via `medeleg` (exceptions only)
- Delegation decision is explicit and per-cause:
- `delegatedToS = (cur_priv != M) && trap_is_exception && medeleg[cause]`
- If delegated:
- use S-mode CSRs (`sepc/scause/stval`) and jump to `stvec`
- If not delegated:
- default to M-mode CSRs (`mepc/mcause/mtval`) and jump to `mtvec`
- Interrupt delegation via `mideleg` is intentionally not implemented at this stage.
---
### 3) How I verified it (waveform checkpoints)
:::spoiler testcode here
```asm=
.section .text
.globl main
.option norvc
main:
# ===== M-mode checkpoint =====
# Write a known value to mscratch to mark execution in M-mode
li t0, 0x4D300001
csrw mscratch, t0
# ===== Configure MRET target and set MPP = S =====
# Prepare mstatus so that MRET returns to Supervisor mode
csrr t1, mstatus
li t0, ~(3 << 11) # Clear MPP[12:11]
and t1, t1, t0
li t0, (1 << 11) # Set MPP = 01 (Supervisor mode)
or t1, t1, t0
csrw mstatus, t1
# Set return address for MRET
la t0, s_main
csrw mepc, t0
# Second M-mode checkpoint before MRET
li t0, 0x4D300002
csrw mscratch, t0
# Return from M-mode to S-mode
mret
# =========================
# S-mode trap handler
# =========================
.align 4
s_trap:
# Trap entry checkpoint
li t0, 0x5330EE01
csrw sscratch, t0
# Advance SEPC to skip the faulting ECALL instruction
csrr t1, sepc
addi t1, t1, 4
csrw sepc, t1
# Trap exit checkpoint before SRET
li t0, 0x5330EE02
csrw sscratch, t0
# Return from Supervisor trap
sret
# --- Insert a large gap to make PC jumps clearly visible in waveforms ---
.align 4
.space 1024 # Can be increased (e.g., 4096) for clearer separation
# =========================
# S-mode main
# =========================
.align 4
s_main:
# Write the current PC into sscratch for easy identification in waveforms
auipc t2, 0
csrw sscratch, t2 # sscratch = PC of s_main
# Configure Supervisor trap vector to point to s_trap
# This is intentionally done in S-mode
la t0, s_trap
csrw stvec, t0
# S-mode execution checkpoint before ECALL
li t0, 0x53300002
csrw sscratch, t0
# Trigger Supervisor-mode ECALL
ecall
after_ecall:
# Checkpoint indicating successful return from SRET
li t0, 0x53300003
csrw sscratch, t0
done:
# Infinite loop to keep execution observable
j done
```
:::
#### 3.1 M→S transition via `mret`
- Test sets `mstatus.MPP = S` and `mepc = s_main`, then executes `mret`.
- Expected waveform:
- PC jumps to `s_main`
- `priv_mode` switches from M to S

#### 3.2 S-mode `ecall` trap + `sret` return
- In S-mode, set `stvec = s_trap`, execute `ecall`.
- Handler increments `sepc` by 4 then executes `sret`.
- Expected waveform:
- `sepc` captures the faulting PC
- `scause = 9` (ECALL from S-mode)
- PC jumps to `stvec`
- `sret` returns to the instruction after `ecall`


#### 3.3 `medeleg` behavior (delegated vs non-delegated)
- Run the same S-mode `ecall`, but toggle `medeleg[9]`:
- `medeleg[9] = 1`:
- trap stays in S-mode
- PC jumps to `stvec`
- writes `sepc/scause`

- `medeleg[9] = 0`:
- trap escalates to M-mode (default behavior)
- PC jumps to `mtvec`
- writes `mepc/mcause`
- `priv_mode` transitions S → M

---
## PTW + TLB (Full Version Only)
### 1) What I implemented
To support Sv32 VA→PA translation, I implemented inside the MMU:
- **Separate ITLB / DTLB**
- **8 sets × 2 ways** (16 entries) each
- Cache translations for **4KB pages** and **4MB superpages**
- Per-set **round-robin replacement** (waveform-friendly and deterministic)
- **A shared PTW (Page Table Walker) FSM**
- Performs a **full 2-level Sv32 walk**:
- L1 PTE fetch → (if non-leaf) L0 PTE fetch
- Produces either:
- **A leaf translation** → fills ITLB/DTLB
- **A fault** → raises I/D fault signals (full trap plumbing is the next step)
- **Pipeline stall + bus arbitration**
- PTW must fetch PTEs via the **same AXI/bus** that the core MEM stage uses
- So I added a PTW memory port and used **mutual exclusion**:
- When PTW is active, **stall the whole core** (reuse `mem_stall`)
- Block MEM stage from issuing requests during MMU stall
- Use a **MUX** controlled by `ptw_active` to route bus request/response
to either **PTW** or **normal MEM stage**
---
### 2) How the TLB works (high level)
Each (I/D) TLB entry stores:
- `valid`
- `tag`
- `ppn` (PA[31:12])
- `isSuper` (distinguish 4KB vs 4MB superpage entry)
Lookup:
- **4KB page** lookup uses **VPN0-based set** and a 4KB tag
- **4MB superpage** lookup uses **VPN1-based set** and a superpage tag
- `isSuper` prevents matching the wrong page size
On hit:
- output `PA = (PPN << 12) | page_offset`
On PTW completion:
- fill the selected set/way
- update per-set RR victim pointer
---
### 3) How the PTW works (full Sv32 walk)

The PTW FSM contains the following states:
- **sIdle**
The PTW is inactive and the pipeline is not stalled by the MMU.
When Sv32 is enabled and an instruction/data access misses in the corresponding TLB, the MMU latches the faulting virtual address and access type (I-side fetch vs D-side load/store), then starts a page table walk.
- **sL1Req**
Issues a memory read request for the level-1 PTE (L1 PTE).
The requested physical address is computed from `satp.ppn` (root page table base) and `VPN[1]` extracted from the latched VA.
- **sL1Wait**
Waits for the memory response containing the L1 PTE.
Once the response arrives, the PTW checks:
1) validity and illegal encoding (e.g., `V=0` or `R=0 && W=1`),
2) whether the entry is a **leaf** (translation terminates here) or a **pointer** (must continue to level-0),
3) if it is an L1 leaf (superpage/4MB mapping), it also checks alignment constraints (Sv32 requires `PPN0 == 0` for a superpage leaf).
If the L1 PTE is a valid leaf, the walk completes and transitions to `sLeaf`.
If it is a valid non-leaf pointer, the PTW proceeds to fetch the L0 PTE.
- **sL0Req**
Issues a memory read request for the level-0 PTE (L0 PTE).
The base address is derived from the PPN in the L1 PTE, and the index comes from `VPN[0]` of the latched VA.
- **sL0Wait**
Waits for the memory response containing the L0 PTE.
When the response arrives, the PTW validates the PTE similarly:
- invalid or illegal encoding → page fault (`sFault`)
- leaf entry → translation complete (`sLeaf`)
- non-leaf entry at L0 → page fault (`sFault`) because Sv32 has only two levels
- **sLeaf**
Finalization state for a successful translation.
In this state the PTW:
1) checks access permissions based on the request type:
- instruction fetch requires `X`
- load requires `R`
- store requires `W`
2) constructs the final translated PPN:
- for a normal 4KB page, PPN comes directly from the L0 leaf PTE
- for a 4MB superpage, the PPN is formed by combining `PPN1` from the L1 leaf PTE with `VPN0` from the VA
3) fills the appropriate TLB (ITLB or DTLB), including replacement selection (2-way set-associative with per-set round-robin victim)
After filling the TLB, the PTW returns to `sIdle` so the stalled request can be retried using the translated physical address.
- **sFault**
Terminal state for translation failure.
The MMU reports a fault to either the I-side or D-side depending on which request triggered the walk, then returns to `sIdle`.
*(Note: fault signaling to the IF stage and writing the corresponding `scause` are not implemented yet and will be added later.)*
> Note: A/D-bit update is **not implemented yet** (I currently pre-set A/D in the test PTEs).
---
### 4) How I verified it
Because the full mmu-test requires fault delivery + other missing pieces, I validated PTW/TLB behavior using a **standalone Sv32 assembly test** that:
- Builds page tables in memory at fixed, aligned physical addresses
- Enables `satp`
- Triggers controlled accesses that force:
- **L1 leaf (4MB superpage)** translations
- **L1 pointer → L0 leaf (4KB)** translations
:::spoiler
```asm=
.section .text
.globl main
.equ L1_PT_PA, 0x00005000 # 4KB aligned
.equ L0_PT0_PA, 0x00006000 # vpn1=0 -> VA 0x0000_0000 ~ 0x003F_FFFF
.equ PTE_PTR, 0x001 # V=1, R=W=X=0 (pointer)
.equ PTE_LEAF, 0x0CF # V|R|W|X|A|D
main:
# 1) mtvec point to trap entry
la t0, __trap_entry
csrw mtvec, t0
# ---------------------------------------------------
# 2) clear L1 + L0 tables (only tables we use)
# ---------------------------------------------------
li t2, 0
# clear L1 page table @ L1_PT_PA
li t0, L1_PT_PA
li t1, 1024
1:
sw t2, 0(t0)
addi t0, t0, 4
addi t1, t1, -1
bnez t1, 1b
# clear L0_PT0 @ L0_PT0_PA
li t0, L0_PT0_PA
li t1, 1024
2:
sw t2, 0(t0)
addi t0, t0, 4
addi t1, t1, -1
bnez t1, 2b
# ---------------------------------------------------
# 3) L1[0] -> L0_PT0 (pointer)
# L1[1] = superpage leaf (4MB) for VA 0x0040_0000~0x007F_FFFF
# ---------------------------------------------------
li t0, L1_PT_PA
# L1[0] pointer -> L0_PT0
li t1, (L0_PT0_PA >> 12)
slli t1, t1, 10
ori t1, t1, PTE_PTR
sw t1, 0(t0) # entry vpn1=0
# L1[1] superpage leaf:
# PPN1 = 1, and PPN0 must be 0 => just put (1<<20) into PTE[31:20]
li t2, (1 << 20) # PPN1 goes to bits [31:20]
ori t2, t2, PTE_LEAF
sw t2, 4(t0) # entry vpn1=1 (superpage leaf)
# ---------------------------------------------------
# 4) Fill L0_PT0: identity map 0~4MB (vpn1=0 chunk)
# ---------------------------------------------------
li t0, L0_PT0_PA
li t1, 0 # vpn0
4:
slli t2, t1, 12 # va within vpn1=0 region
srli t3, t2, 12 # ppn = va>>12 (identity)
slli t3, t3, 10
ori t3, t3, PTE_LEAF
sw t3, 0(t0)
addi t0, t0, 4
addi t1, t1, 1
li t4, 1024
blt t1, t4, 4b
# ---------------------------------------------------
# 5) Enter S-mode
# ---------------------------------------------------
csrr t0, mstatus
li t1, ~(3 << 11)
and t0, t0, t1
li t1, (1 << 11) # MPP=S
or t0, t0, t1
csrw mstatus, t0
la t0, s_main
csrw mepc, t0
mret
# =====================================================
# S-mode main
# =====================================================
.align 4
s_main:
li t0, (1 << 31) | (L1_PT_PA >> 12)
csrw satp, t0
li s0, 0 # error accumulator (0 = pass)
# ===================================================
# (A) Superpage region (vpn1=1 megapage)
# touch a few addresses: base + 0x1000*k + small offset
# ===================================================
li s1, 0x00400000 # superpage base
li s2, 0xA0000000 # seed
# A0: [0x00401000]
li t1, 0x00401000
li t2, 0xA0000001
sw t2, 0(t1)
lw t3, 0(t1)
xor t4, t2, t3
or s0, s0, t4
# A1: [0x00402004]
li t1, 0x00402004
li t2, 0xA0000002
sw t2, 0(t1)
lw t3, 0(t1)
xor t4, t2, t3
or s0, s0, t4
# A2: [0x00403008]
li t1, 0x00403008
li t2, 0xA0000003
sw t2, 0(t1)
lw t3, 0(t1)
xor t4, t2, t3
or s0, s0, t4
# A3: [0x0040400C]
li t1, 0x0040400C
li t2, 0xA0000004
sw t2, 0(t1)
lw t3, 0(t1)
xor t4, t2, t3
or s0, s0, t4
# ===================================================
# (B) 4KB region (vpn1=0 via L0)
# Force SAME set conflicts:
# base = 0x00001000 keeps VA[14:12]=001
# stride = 0x8000 changes VA[21:15] (tag) but keeps set
# Touch 6 distinct pages => must evict in 2-way
# ===================================================
li s4, 0xCAFE0000
# B0: 0x00001000
li t1, 0x00001000
li t2, 0xCAFE1000
sw t2, 0(t1)
lw t3, 0(t1)
xor t4, t2, t3
or s0, s0, t4
# B1: 0x00009000
li t1, 0x00009000
li t2, 0xCAFE1001
sw t2, 0(t1)
lw t3, 0(t1)
xor t4, t2, t3
or s0, s0, t4
# B2: 0x00011000
li t1, 0x00011000
li t2, 0xCAFE1002
sw t2, 0(t1)
lw t3, 0(t1)
xor t4, t2, t3
or s0, s0, t4
# B3: 0x00019000
li t1, 0x00019000
li t2, 0xCAFE1003
sw t2, 0(t1)
lw t3, 0(t1)
xor t4, t2, t3
or s0, s0, t4
# B4: 0x00021000
li t1, 0x00021000
li t2, 0xCAFE1004
sw t2, 0(t1)
lw t3, 0(t1)
xor t4, t2, t3
or s0, s0, t4
# B5: 0x00029000
li t1, 0x00029000
li t2, 0xCAFE1005
sw t2, 0(t1)
lw t3, 0(t1)
xor t4, t2, t3
or s0, s0, t4
# ===================================================
# (C) Re-touch first two addresses again
# If replacement is working, at least one should miss/refill
# ===================================================
# C0: 0x00001000 again
li t1, 0x00001000
li t2, 0xDEAD0000
sw t2, 0(t1)
lw t3, 0(t1)
xor t4, t2, t3
or s0, s0, t4
# C1: 0x00009000 again
li t1, 0x00009000
li t2, 0xDEAD0001
sw t2, 0(t1)
lw t3, 0(t1)
xor t4, t2, t3
or s0, s0, t4
done:
j done
```
:::
Waveforms confirm the PTW/TLB loop works end-to-end:
- **I-side (Instruction fetch)**
- On an iTLB miss, the PTW performs the expected Sv32 walk and then fills the **ITLB**.
- 4KB case: `L1 fetch → L0 fetch → leaf → ITLB fill`
- (If the VA maps to a superpage) 4MB case: `L1 fetch → leaf → ITLB fill`
- After the fill, the same fetch is retried and **hits in ITLB**.

- **D-side (Load/Store)**
- I used **two waveforms** to cover both translation patterns:
1) **D-side 1-stage (superpage / 4MB)**
- `L1 fetch → leaf → DTLB fill`
- The retried load/store then **hits in DTLB**.

2) **D-side 2-stage (normal 4KB)**
- `L1 fetch → L0 fetch → leaf → DTLB fill`
- The retried load/store then **hits in DTLB**.

---
### 5) Current limitation (next work)
I can raise `i_fault/d_fault` from PTW, but **end-to-end page fault handling is not complete yet**, mainly because:
- In a pipeline, faults must be **precise** (only the correct-path instruction should trap)
- If the pipeline keeps presenting the same faulting request, PTW can re-walk repeatedly unless the core **kills the request + redirects PC** to the handler
- The final integration step is:
- carry fault info to the right stage,
- flush/redirect correctly,
- write `stval/scause/sepc` via the trap path.
## Next work / Not finished yet
There are still several missing pieces before the Sv32 MMU is “complete” and OS-ready:
- **Run the official `mmu-test` suite**
- I currently cannot pass the full test suite because the remaining exception/fault plumbing is incomplete (see below).
- Target: make the core pass the `rv32emu mmu-test` end-to-end (translation + faults + corner cases).
- **Precise page fault handling (end-to-end)**
- PTW can raise `i_fault/d_fault`, but I still need to:
- generate the correct fault type (`Instruction/Load/Store Page Fault`) and write **`scause`**
- write the faulting VA into **`stval`**
- ensure faults are **precise** in a pipelined core (only correct-path instructions trap)
- correctly **flush/kill** the faulting request and **redirect PC** to the trap handler (`stvec`) to avoid infinite re-walk loops.
- **A/D bit update in hardware**
- Current tests pre-set `A/D` in PTEs; PTW is read-only.
- Target: implement atomic A/D updates (including the extra memory write sequence and corner cases).
- **`sfence.vma` support**
- TLB flush/invalidate behavior is still pending.
- Target: implement `sfence.vma` and verify selective + global invalidation.