# Computer Architecture — Fall 2025 Homework 3 # Implementation of single cycle cpu ## overview In this assigment we start from single-cycle CPU in Chisel. Although the design executes each instruction in a single cycle, we still conceptually divide the datapath into five familiar stages: IF, ID, EXE, MEM, and WB, which makes the structure clearer for implementation and debugging. ## IF (Instruction Fetch) In this stage, the CPU uses the current program counter (PC) to retrieve the corresponding instruction from instruction memory. The fetched instruction is then forwarded to the decode stage on the next cycle. To correctly manage sequential execution and control-flow changes, this stage handles several important control signals: #### • `pc` (Program Counter) A register that stores the address of the current instruction. It updates every cycle unless the fetch operation is stalled. #### • `instruction_address` The output address used by the instruction memory. This is simply the current value of `pc`. #### • `jump_flag_id` Asserted by the Execute stage when a jump or a taken branch is detected (e.g., **JAL**, **JALR**, **BEQ taken**). When this flag is true, the PC does **not** increment by 4; instead, it immediately redirects to the target address. #### • `jump_address_id` The calculated target address for control-flow instructions. This becomes the next PC value if `jump_flag_id` is high. #### • `instruction_valid` Indicates whether the instruction memory has returned a valid instruction. - When `instruction_valid = 1` → normal execution: PC updates and the instruction is forwarded. - When `instruction_valid = 0` → the PC holds its value, and a **NOP** (`0x00000013`) is inserted to avoid executing invalid data. Through the combination of these signals, the IF stage supports sequential instruction flow, correct handling of jumps and branches, and safe stalling when instruction memory is not ready. --- ## IF Test To validate the correctness of the Instruction Fetch (IF) stage, the `InstructionFetchTest` focuses on two critical behaviors: 1. **Sequential PC update** — When no jump occurs, the PC should increment by 4. 2. **Control-flow redirection** — When a jump or taken branch is requested, the PC should update to the jump target address. The test constructs an `InstructionFetch` module and initializes the PC at `0x1000`. The signal `instruction_valid` is set to `true`, meaning the IF stage is allowed to update the PC every cycle. A loop then performs randomized testing across 100 iterations. In each iteration, the test randomly selects one of the following cases: #### 1. No-jump case (case 0) - `jump_flag_id` is set to `false`, indicating no control-flow redirection. - Expected behavior: The next PC should be `pre + 4`, following normal sequential execution. The test performs: - Compute `cur = pre + 4` - Step one clock cycle - Expect `instruction_address == cur` - Update `pre` to `pre + 4` for the next iteration #### 2. Jump case (case 1) - `jump_flag_id` is set to `true` - `jump_address_id` is set to `entry` (`0x1000`) - Expected behavior: The PC must immediately redirect to the given jump target instead of incrementing by 4. The test performs: - Step one clock cycle - Expect `instruction_address == entry` - Update `pre = entry` #### What this test verifies This randomized pattern of interleaving “jump” and “non-jump” cycles ensures that: - **Sequential behavior is correct** When no jump is taken, PC strictly follows `PC + 4`. - **Jump behavior is correct** When a jump is requested, the PC is replaced by the jump target on the very next cycle. Overall, this test confirms that the IF stage correctly implements the core responsibilities of a single-cycle RISC-V design: - Proper sequential instruction flow - Correct handling of JAL, JALR, and taken branch instructions - Immediate PC redirection in control-flow changes **Waveform** ![截圖 2025-12-04 下午3.13.26](https://hackmd.io/_uploads/BJzsN3Rb-g.png) The waveform illustrates the normal sequential behavior of the Instruction Fetch stage. While **instruction_valid** remains high, the program counter increments by 4 on each rising clock edge, producing the expected sequence of **instruction addresses** (0x1000 → 0x1004 → 0x1008…). This verifies that the IF stage properly handles continuous instruction flow under non-branching conditions. When **jump_flag_id** transitions to 1, the waveform shows that the PC is redirected on the next cycle to the target address provided by **jump_address_id**. After the jump completes, the PC resumes sequential increments from the new location. This confirms correct integration of jump/branch control signals and demonstrates that the IF stage reacts to control-flow changes within a single cycle. --- ## ID (Instruction Decode) The Instruction Decode stage is responsible for breaking down the 32-bit instruction and generating all control signals required by the Execute, Memory, and Write-Back stages. In this stage, the CPU extracts register addresses, determines immediate formats, selects ALU operand sources, and configures whether memory or register operations will occur. ### Key Functions of the Decode Stage The ID stage receives the instruction fetched in the IF stage and performs: - **Field extraction** - `opcode`, `rd`, `rs1`, `rs2`, `funct3`, `funct7` - **Immediate decoding** - Supports all RV32I formats - **Control signal generation** - Configures ALU operand sources - Selects WriteBack data path - Enables load/store behavior - Enables or disables register file writes - **Register read address selection** - Determines whether rs1 / rs2 are used based on instruction type Important Control Signals ### ALU Operand Sources - **ex_aluop1_source** - `Register` → uses `rs1` - `InstructionAddress` → uses PC (for AUIPC, JAL, branches) - **ex_aluop2_source** - `Register` → uses `rs2` - `Immediate` → used by most RV32I instructions except R-type ### Write-Back Data Source - **RegWriteSource.ALUResult** Default for arithmetic instructions - **RegWriteSource.Memory** Used by load instructions - **RegWriteSource.NextInstructionAddress (PC+4)** Used by JAL, JALR (to save return address) ### Memory Access Signals - **memory_read_enable** Asserted for load instructions - **memory_write_enable** Asserted for store instructions ### Register File - **regs_reg1_read_address / regs_reg2_read_address** Only valid if instruction actually uses `rs1` / `rs2` - **reg_write_enable** Prevents illegal modification of registers for branches, store ## ID Test This test file verifies whether each RV32I instruction type (R, I, S, B, U, and J) correctly generates the expected control signals during the Decode stage. The test focuses on the behaviors implemented in the TODO sections, including proper write-back source selection, correct ALU operand routing, and accurate immediate decoding for each instruction format. To ensure that the Decode stage behaves as intended, I examined the waveform of several key signals—such as wb_reg_write_source, ex_aluop1_source, ex_aluop2_source, and ex_immediate—to confirm that the control logic responds appropriately to different instruction encodings. **Waveform** ![截圖 2025-12-04 下午4.24.59](https://hackmd.io/_uploads/rkuvrpCbZe.png) The first instruction observed in the waveform is `0x0040a183`, which corresponds to: **lw x3, 4(x1)** - rs1 = x1 - rd = x3 - immediate = 4 Because this is a **Load-type instruction**, several Decode-stage control signals take on specific expected values. The waveform confirms each of these behaviors. --- #### **1. `ex_aluop1_source = 0` (Register)** For load instructions, the ALU computes the memory address using: **ALU = rs1 + imm** Thus the first operand must come from **rs1**, not from the PC. The waveform shows: - `ex_aluop1_source = 0` → correctly selecting *Register*. --- #### **2. `ex_aluop2_source = 1` (Immediate)** Load instructions always use an I-type immediate as the second ALU operand. The waveform shows: - `ex_aluop2_source = 1` → correctly selecting *Immediate*. --- #### **3. `ex_immediate = 4`** The immediate field for this instruction is: **imm[11:0] = 0x004 → decimal 4** The waveform confirms: - `ex_immediate = 00000004` meaning the immediate-generation logic (TODO section) works correctly. --- #### **4. `wb_reg_write_source = 1` (Memory)** Load instructions write **memory data** back into rd. Thus: - `wb_reg_write_source = 1` → selecting *Memory* which matches the waveform. --- #### **5. `reg_write_enable = 1`** Loads always write to rd (`x3`). The waveform shows: - `reg_write_enable = 1` indicating that the decode logic properly enables register writes for this instruction. --- Summary | Signal | Expected | Waveform | Description | |-------|----------|----------|-------------| | `ex_aluop1_source` | Register (0) | 0 | ALU operand1 = rs1 | | `ex_aluop2_source` | Immediate (1) | 1 | ALU operand2 = immediate | | `ex_immediate` | 4 | 4 | Correct I-type immediate | | `wb_reg_write_source` | Memory (1) | 1 | Write back memory data | | `reg_write_enable` | 1 | 1 | Write rd enabled | This confirms that the Instruction Decode stage behaves fully correctly for the `lw` instruction. --- ## EX (Execute) The Execute (EX) stage is responsible for performing arithmetic and logical operations, evaluating branch conditions, and computing jump or branch target addresses. Using the operands and control signals provided by the Decode stage, the EX stage determines whether control flow should change and produces the ALU result that will be forwarded to later pipeline stages. ### Core Responsibilities The EX stage performs several essential tasks: - **ALU operand selection** Chooses between register data, immediate values, or the instruction address (PC), depending on the instruction type. - **Arithmetic and logical computation** Executes ADD, SUB, shifts, logical operations, comparisons, and other ALU-supported functions. - **Branch condition evaluation** Determines whether a branch is taken by comparing register values according to funct3: - BEQ/BNE → equality/inequality - BLT/BGE → signed comparison - BLTU/BGEU → unsigned comparison - **Jump and branch target calculation** Computes next PC for: - **Branches**: `PC + immediate` - **JAL**: `PC + immediate` - **JALR**: `(rs1 + immediate) & ~1` - **Control-flow signaling** Produces: - `if_jump_flag` — tells IF whether to redirect the PC - `if_jump_address` — the computed jump or branch target --- ### Important Signals #### **ALU Operand Selection** - **`aluop1_source`** - `Register` → operand 1 = `rs1` - `InstructionAddress` → operand 1 = PC Used for AUIPC, branches, and JAL. - **`aluop2_source`** - `Register` → operand 2 = `rs2` - `Immediate` → operand 2 = imm Used by all I/S/U/J-type instructions and branches. These signals guarantee correct operand pairing for every instruction category. --- #### **Branch Comparison Logic** Based on **funct3**, the EX stage evaluates: - BEQ: `rs1 == rs2` - BNE: `rs1 != rs2` - BLT/BGE: signed comparisons - BLTU/BGEU: unsigned comparisons If the condition is true and the opcode is a branch, the branch is considered **taken**. --- #### **Jump Target Calculation** - **Branches**: `target = PC + imm` - **JAL**: `target = PC + imm` - **JALR**: `target = (rs1 + imm) & ~1` (LSB forced to zero for alignment) These calculations ensure correct control-flow redirection at runtime. #### **Control Outputs** - **`if_jump_flag`** Asserted when: - A branch is taken - JAL - JALR - **`if_jump_address`** Contains the resolved branch or jump target. --- ## EX Test The `ExecuteTest` focuses on two key responsibilities of the Execute (EX) stage: 1. Performing correct ALU operations (here using ADD as a representative case). 2. Handling branch comparison and jump target calculation for a BEQ instruction. ### 1. ALU operation (ADD) The first part of the test configures the instruction as an R-type ADD: - Instruction: `0x001101b3` → `add x3, x2, x1` - For 100 iterations, it randomly generates: - `reg1_data = op1` - `reg2_data = op2` - Expected result `= op1 + op2` In each cycle, the test: - Drives `reg1_data` and `reg2_data` with new random values. - Steps the clock once. - Checks that `mem_alu_result` equals `op1 + op2`. - Verifies `if_jump_flag` remains `0`, confirming that a normal ALU instruction does **not** trigger any control-flow change. This part validates that: - The ALUControl correctly decodes an ADD instruction. - The ALU receives proper operands from the EX stage and produces the expected result. - The branch/jump logic stays inactive for non-branch instructions. ### 2. Branch decision and target address (BEQ) The second part tests branch behavior using a BEQ instruction: - Instruction: `0x00208163` → `beq x1, x2, 2` - `instruction_address` is set to `2`, and `immediate` is set to `2`. - `aluop1_source` and `aluop2_source` are both set to `1`, so: - ALU operand 1 = PC - ALU operand 2 = immediate - Internally, the branch target is computed as `PC + imm = 2 + 2 = 4`. Two cases are then checked: 1. **Branch taken (equal case)** - `reg1_data = 9`, `reg2_data = 9` - After one clock: - `if_jump_flag = 1` → branch is taken - `if_jump_address = 4` → matches `PC + imm` 2. **Branch not taken (not-equal case)** - `reg1_data = 9`, `reg2_data = 19` - After one clock: - `if_jump_flag = 0` → branch is not taken - `if_jump_address` stays at `4`, showing that the **target address computation** is independent of the comparison result, but whether to jump depends on the branch condition. **Waveform of add** ![截圖 2025-12-04 晚上10.51.57](https://hackmd.io/_uploads/rJnMgQ1MZg.png) This waveform corresponds to the first part of `ExecuteTest`, where the instruction is `add x3, x2, x1`. In each cycle, `io_reg1_data` and `io_reg2_data` are randomized, and the Execute stage should simply compute: - `alu_io_op1 = io_reg1_data` - `alu_io_op2 = io_reg2_data` - `alu_io_result = alu_io_op1 + alu_io_op2` - `io_if_jump_flag = 0` (no control flow change for R-type ALU ops) From the trace, we can see that `alu_io_op1` and `alu_io_op2` always follow the values of `io_reg1_data` and `io_reg2_data`, and `alu_io_result` is exactly their sum (e.g., `0x0d2ba1b3 + 0x177ee9f1 = 0x24aa8ba4`). Meanwhile, `io_if_jump_flag` stays at 0 for the entire ADD test. **EX Test — BEQ (`0x00208163`) Waveform** ![截圖 2025-12-04 晚上10.51.17](https://hackmd.io/_uploads/rk7gxmJMWg.png) This waveform verifies two key behaviors implemented in the TODO sections of the Execute stage: (1) branch condition evaluation and (2) branch target address calculation. At around 206 ps, both operands are equal (`io_reg1_data = 0x09`, `io_reg2_data = 0x09`), so the BEQ condition evaluates to true, and the waveform correctly shows `io_if_jump_flag = 1`. In the following cycle (208–209 ps), the operands differ (`0x09` vs. `0x13`), meaning the branch should not be taken, and the waveform reflects this with `io_if_jump_flag = 0`. This confirms that the branch comparison logic behaves exactly as intended. --- ## MEM and WB Check my another report : https://hackmd.io/@ab842650/InClassDisscussion1125 --- ## The difficulty that I encountered during development The major issue I faced occurred when running `make compliance`, where the generated signature output was always `0x00000000`. To diagnose the problem, I traced the execution flow inside `tests/mycpu_plugin/riscof_mycpu.py` and discovered that the command used to invoke the compliance tests originally contained a Linux-only utility: ```shell= cd {parent_dir} && timeout 3600 sbt –batch “project {sbt_project_name}” “testOnly riscv.compliance.ComplianceTest” 2>&1 ``` Since I was developing on macOS, the `timeout` command does not exist on this system. As a result, the script attempted to execute the command but silently failed to launch the actual test process. The pipeline continued running, but no compliance tests were executed—leading to signature files filled entirely with zeros. After replacing the line with a macOS-compatible version: ```shell= cd {parent_dir} && sbt –batch “project {sbt_project_name}” “testOnly riscv.compliance.ComplianceTest” 2>&1 ``` the compliance framework successfully invoked the simulator, real instructions were executed, and correct signatures were generated. ## sbt test and make compliance results All single-cycle stage modules (IF, ID, EX, MEM, WB) and their corresponding ScalaTest files compiled and ran successfully. This confirmed that: - ALU operations were correct - Branch comparison logic behaved as expected - PC update rules matched the specification - Decode and immediate generation logic functioned properly ![截圖 2025-12-04 晚上11.37.05](https://hackmd.io/_uploads/Syqh5myzbe.png) the compliance framework successfully do: - correctly invoke the simulator - execute all RV32I reference tests - generate valid signature files (no longer all zeros) - compare them against the golden reference output All compliance tests passed without errors, confirming that the CPU behaves correctly according to the RISC-V RV32I base ISA. ![截圖 2025-12-04 晚上11.36.27](https://hackmd.io/_uploads/rJwK9mkGbg.png) # Single cycle cpu with MMIO-trap ## Overview This part extends the 1-single-cycle MyCPU into a RV32I core with machine-mode CSRs, trap/interrupt handling, and memory-mapped I/O (MMIO). On top of the original datapath,adding: - A CSR module (mstatus, mie, mtvec, mscratch, mepc, mcause, cycle[h/l]) with Zicsr read–modify–write behavior. - A CLINT-like block that updates mstatus/mepc/mcause on traps and drives the interrupt handler PC. - MMIO decoding in the data memory path to access timer / VGA instead of normal RAM. The CPU must still pass `make test` and `make compliance`, and additionally execute the Nyancat demo to correctly render the VGA animation under Verilator. During development, I used the provided unit tests and waveforms to debug CSR logic, PC update with interrupts, and load/store alignment for MMIO regions. --- ## Control and Status Registers (CSR) in 2-mmio-trap RISC-V defines a separate 4096-byte CSR address space, accessed only by CSR instructions. CSR instructions are specified as *atomic read–modify–write* operations, so the CPU must carefully handle them in the pipeline. In this lab, MyCPU implements a small subset of machine-mode CSRs: - **mstatus (0x300)** – Global machine status - `MIE` (bit 3): global machine interrupt enable - `MPIE` (bit 7): stores previous MIE on trap entry, restored by MRET - **mie (0x304)** – Machine interrupt enable mask - In this design, used to enable/disable the timer interrupt source. - **mtvec (0x305)** – Trap vector base address - When a trap occurs, the PC is redirected to `mtvec`. - **mscratch (0x340)** – Scratch register for trap handlers - Hardware does not interpret the value; used by trap code. - **mepc (0x341)** – Machine exception program counter - Stores the faulting PC on trap entry; MRET jumps back to `mepc`. - **mcause (0x342)** – Trap cause register - Bit 31: 1 = interrupt, 0 = exception - Bits 30:0: exception/interrupt code used by the handler. - **cycle (0xC00) / cycleh (0xC80)** – 64-bit cycle counter - `cycle` holds the low 32 bits, `cycleh` the high 32 bits. - The counter increments every clock cycle. ## CSR control in each stage Compared to single-cycle, mmio-trap adds CSR and trap handling, which affects several stages of the core. The datapath is still single-cycle, but the control logic is extended as follows. ### Instruction Fetch (IF) - New inputs: - `interrupt_assert`: interrupt/trap request. - `interrupt_handler_address`: trap handler entry (usually derived from `mtvec`). - PC update priority: 1. Interrupt: `PC_next = interrupt_handler_address`. 2. Jump/branch taken: `PC_next = jump_address_id`. 3. Sequential execution: `PC_next = PC + 4`. - When `instruction_valid = 0`, IF holds `PC` and injects a NOP to avoid executing invalid instructions. - IF does not read or write CSRs directly, but it relies on CLINT/CSR to supply the correct trap handler address. ### Instruction Decode (ID) - Detects CSR instructions: - `isSystem && funct3 =/= 0` → `isCsr`. - Generates CSR-related signals: - `csr_reg_address = inst[31:20]` (12-bit CSR address). - `csr_reg_write_enable`: depends on the CSR instruction type and whether `rs1` / `zimm` is zero (for CSRRS/CSRRC variants). - Selects write-back source: - For CSR instructions: `wb_reg_write_source = RegWriteSource.CSR`, so rd receives the old CSR value. - ALU operand selection remains similar to single-cycle, since CSR arithmetic is handled in the CSR datapath rather than the normal ALU. ### Execute (EX) - Implements CSR atomic read–modify–write behavior: - Computes `csrSource` from either `rs1` or the zero-extended immediate `zimm` based on `funct3(2)`. - Uses `csr_reg_read_data` (the old CSR value) to compute the new CSR contents: - CSRRW/CSRRWI: write `csrSource`. - CSRRS/CSRRSI: `csrResult = old | csrSource`. - CSRRC/CSRRCI: `csrResult = old & ~csrSource`. - Outputs: - `csr_reg_write_data = csrResult`, which is sent to the CSR module for actual register updates. - In parallel, EX still performs normal ALU operations, branch comparison, and jump target address calculation. ### Memory (MEM) unchanged from single-cycle design; it only handles load/store alignment and MMIO access. CSR instructions do not access memory, so no CSR-specific control is needed in MEM. ### WriteBack (WB) - Extends the write-back source multiplexer with a CSR option: - `RegWriteSource.CSR`: selects `io.csr_read_data` as the value written to rd. - Other write-back sources: - ALU result (default). - Memory read data (for load instructions). - `PC + 4` (for JAL/JALR return address). - For CSR instructions, WB writes the CSR read value (the *old* CSR content) back to rd, as required by the RISC-V specification. ### CSR module The `CSR` module: - **Implements machine-mode CSRs** - Stores: `mstatus`, `mie`, `mtvec`, `mscratch`, `mepc`, `mcause` - Implements a 64-bit `cycle` counter, exposed as `cycle` (low 32 bits) and `cycleh` (high 32 bits) - **Uses an address-to-register lookup table** - `CSRRegister` defines the CSR addresses (e.g., `0x300`, `0x304`, `0x341`, `0xC00`) - `regLUT` maps each CSR address to the corresponding register or slice of `cycles` - CSR reads (`reg_read_data`, `debug_reg_read_data`) are implemented via `MuxLookup(regLUT)` - **Provides combinational reads and registered writes** - Reads are purely combinational: the pipeline can see the current CSR value in the same cycle - Writes are registered and take effect on the next clock edge - **Maintains a global cycle counter** - `cycles` is a 64-bit register incremented every cycle (`cycles := cycles + 1.U`) - `cycle` / `cycleh` CSRs are implemented by slicing `cycles(31,0)` and `cycles(63,32)` - **Arbitrates writes between CLINT and the CPU** - Trap-related CSRs (`mstatus`, `mepc`, `mcause`): - If `clint_access_bundle.direct_write_enable` is asserted, CLINT updates these CSRs (trap entry/exit) - Otherwise, CPU CSR instructions can write them via `reg_write_enable_id` and `reg_write_address_id` - CPU-only CSRs (`mie`, `mtvec`, `mscratch`): - Updated only by CPU CSR instructions - CLINT never writes these registers - **Exposes CSR state to the CLINT** - Forwards `mstatus`, `mie`, `mtvec`, `mepc`, and `mcause` to `clint_access_bundle` - CLINT uses these values to decide when to take interrupts and where to jump on trap ### CLINT module The `CLINT` module is the core-local interrupt controller that decides **when to enter a trap** and **how to return from it**. It observes the current interrupt flags, `mstatus`/`mie` enable bits, and the executing instruction, then: - On an **interrupt** (e.g., timer interrupt): - Checks global `MIE` and per-source enable bits in `mie` - Asserts `interrupt_assert` and jumps to `mtvec` - Updates CSRs via `CSRDirectAccessBundle`: - `mepc` ← next PC (`instruction_address` or jump target) - `mcause` ← interrupt code with bit 31 = 1 - `mstatus`: `MPIE ← MIE`, `MIE ← 0`, `MPP ← M-mode` - On an **exception** (`ECALL` / `EBREAK`): - Follows the same entry flow (save PC to `mepc`, set `mcause`, disable interrupts) - But sets `mcause` with bit 31 = 0 (exception) - On **`MRET`**: - Asserts `interrupt_assert` and redirects PC to `mepc` - Restores interrupt state in `mstatus`: - `MIE ← MPIE` (re-enable interrupts) - `MPIE ← 1` - `MPP` fixed to machine mode (M-only design) Using `direct_write_enable`, CLINT tells the CSR module “this cycle is a trap transition”, so trap-related CSRs (`mstatus`, `mepc`, `mcause`) are updated atomically and take priority over normal CSR instructions. --- ## Testing and Verification In this project I mainly relied on two system-level test flows: - `make test` – functional end-to-end tests - `make compliance` – RISC-V ISA compliance tests ### Functional tests (`make test`) The `make test` target runs several ScalaTest suites on the MyCPU design. In this report I mainly focus on the following files: - `ExecuteTest.scala` - `CLINTCSRTest.scala` - `TimerTest.scala` - `UartMMIOTest.scala` - `CPUTest.scala` #### ExecuteTest.scala — CSR write-back in EX stage **Module under test:** `Execute` (EX stage only) - Feeds hand-crafted CSR instructions (`csrrw`, `csrrs`, `csrrc` and the immediate variants) with different values on `reg1_data` and `csr_reg_read_data`. - Checks that `csr_reg_write_data` implements the correct atomic read–modify–write behavior: - CSRRW: write the source value directly. - CSRRS: `csr_reg_read_data | csrSource`. - CSRRC: `csr_reg_read_data & ~csrSource`. - This confirms that the EX stage already produces correct CSR write data before it reaches the CSR module. #### CLINTCSRTest.scala — CLINT and CSR interaction **Modules under test:** `CLINT` + `CSR` (connected together) - **External interrupt flow** - Pre-programs `mtvec`, `mstatus`, and `mie` to enable machine interrupts. - Raises an interrupt flag and steps the clock. - Verifies that: - `interrupt_assert` is raised and the handler PC equals `mtvec`. - `mepc` stores the correct return address. - `mcause` encodes an interrupt with the expected cause code. - `mstatus` updates `MPIE` and clears `MIE` according to the privileged spec. - **Environmental instructions (ECALL/EBREAK/MRET)** - Drives `ecall` and `ebreak` as instructions and checks that traps update `mepc` and `mcause` correctly. - Drives `mret` and verifies that the PC returns to `mepc` and that `mstatus` restores `MIE` from `MPIE`. - These tests validate the whole trap entry/return sequence in machine mode. #### TimerTest.scala — MMIO timer registers **Module under test:** `Timer` peripheral - Uses the MMIO interface to write and read timer registers (limit, enable, clear). - Verifies that: - Writing the **limit** register and reading it back returns the same value. - When the timer is enabled, the internal counter increases and asserts the IRQ line when it reaches the limit. - Clearing or disabling the timer de-asserts the interrupt and stops counting. - This confirms that the timer peripheral behaves correctly for later use by the CPU through MMIO. #### UartMMIOTest.scala — UART TX/RX with a real program **Module under test:** `UartHarness` (CPU + UART + memory) - Uses `ROMLoader` to load the assembled UART test program into instruction memory. - Runs the CPU for a fixed number of cycles until the program finishes. - At the end of simulation, reads two words from memory: - Address `0x100` should contain `0xcafef00d` as a success signature. - Address `0x104` should contain `0xF` (0b1111), meaning that all four UART sub-tests (TX, multi-byte RX, binary RX, timeout RX) have passed. - This test shows that the UART MMIO registers and the associated software driver work correctly together. #### CPUTest.scala — full CPU programs **Module under test:** full `CPU` core with RAM/ROM - Uses `ROMLoader` to load different RISC-V programs (e.g., Fibonacci, timer, interrupt-trap, quicksort) into the instruction ROM. - Examples: - **Fibonacci program** - Runs a recursive Fibonacci routine. - Checks that a designated memory location contains `fib(10)` at the end. - **MMIO registers program** - Accesses timer MMIO registers via load/store. - Verifies that the CPU can correctly read and write these registers. - **Interrupt trap flow** - Triggers a timer interrupt, jumps to the trap handler, and returns with `mret`. - Confirms correct updates of `mepc`, `mcause`, and control flow. - **Quicksort program** - Sorts an array of 10 numbers in RAM. - Asserts that the array is sorted in ascending order after execution. - These programs act as end-to-end tests that exercise the whole pipeline, CSR/CLINT logic, and MMIO peripherals under realistic workloads. All these tests pass on my implementation, which gives confidence that the CSR logic, interrupt controller, timer, UART, and the whole CPU pipeline are working correctly together. ![截圖 2025-12-10 晚上8.52.32](https://hackmd.io/_uploads/SJ6zTyvf-g.png) --- ### riscof test Besides the custom `make test` programs for UART, timer, and VGA, the 2-mmio-trap core is also validated by **RISCOF** architectural compliance testing In this flow, the official `riscv-arch-test` cases are compiled into ELF binaries (e.g., `cadd-01.elf`) and executed on: - a golden reference model (rv32emu), which generates a **reference signature** in memory, and - our MyCPU mmio-trap implementation, which writes its own **test signature** to the same region. RISCOF then compares the two signatures to check RV32I(+Zicsr) behavior (ALU, branches, loads/stores, and basic CSR usage). Even after adding MMIO devices and trap/interrupt support, all RISCOF compliance tests passed on my design (no signature mismatches), confirming that the CPU core still conforms to the RISC-V ISA. ![截圖 2025-12-10 晚上9.11.57](https://hackmd.io/_uploads/H1y3Zlvzbl.png) ## Nyancat Demo For the 2-mmio-trap project, I used the provided VGA and Verilator infrastructure to run the Nyancat demo on my own RV32I(+Zicsr) core. ![截圖 2025-12-10 晚上9.13.24](https://hackmd.io/_uploads/ByZWMlwfWx.png) # 5 Stage Pipeline CPU ## Implementations Overview Compared to the single-cycle RISC-V CPU, the five-stage pipelined CPU improves instruction throughput by overlapping the execution of multiple instructions across different pipeline stages (IF, ID, EX, MEM, WB). While this significantly increases performance, it also introduces new correctness challenges that do not exist in a single-cycle design. To maintain architectural correctness, several additional mechanisms must be introduced. ### Pipeline Registers In a single-cycle CPU, all instruction stages are completed within one clock cycle, so no intermediate state needs to be stored. In contrast, the pipelined CPU divides instruction execution into multiple stages, requiring **pipeline registers** (IF2ID, ID2EX, EX2MEM, MEM2WB) to hold intermediate results between stages. These registers enable multiple instructions to be processed concurrently and form the structural backbone of the pipeline. ### Data Forwarding (Bypassing) Single-cycle execution naturally avoids data hazards because register writes are completed before the next instruction begins. In a pipelined CPU, however, an instruction may require a register value that has not yet been written back. To reduce unnecessary stalls, **data forwarding paths** are introduced to bypass results directly from later pipeline stages (EX/MEM or MEM/WB) to earlier stages (EX or ID). ### Hazard Detection and Pipeline Control Not all hazards can be resolved by forwarding alone. In particular, **load-use hazards** and **jump register dependencies** require the pipeline to temporarily pause execution. A dedicated **hazard detection and control unit** is therefore introduced to dynamically manage pipeline behavior. This unit generates control signals such as: - `pc_stall` to freeze instruction fetch - `if_stall` to hold the IF/ID pipeline register - `id_flush` and `if_flush` to insert bubbles or discard wrong-path instructions These signals ensure that instructions are executed in the correct order with valid operands. ### Control Hazard Handling and Early Branch Resolution In a single-cycle CPU, branch outcomes are known immediately. In the five-stage pipelined CPU, branches introduce control hazards because instructions may already be fetched along the wrong path. To minimize the performance penalty, this design resolves branch decisions in the **ID stage**, supported by ID-stage forwarding. As a result, only one instruction needs to be flushed, reducing the branch penalty from two cycles to one. ### Summary In summary, moving from a single-cycle CPU to a five-stage pipelined CPU requires: - pipeline registers to store intermediate state - forwarding networks to resolve most data hazards - hazard detection logic to manage stalls and flushes - optimized control-flow handling to reduce branch penalties These enhancements allow the five-stage pipelined CPU to achieve higher performance while preserving full architectural correctness. --- ## Pipeline-Specific Components and Design Rationale Compared to a single-cycle CPU, a five-stage pipelined CPU introduces new challenges related to **data hazards**, **control hazards**, and **instruction overlap**. To ensure correctness while maintaining performance, several additional components and control mechanisms are implemented. This section describes the key pipeline-related modules, their responsibilities, and how they work together. --- ### Data Forwarding Mechanism **Related file:** `Forwarding.scala` In a pipelined architecture, results produced by an instruction are not immediately written back to the register file. However, subsequent instructions may need these results before the write-back stage completes. To handle this, a data forwarding (bypass) mechanism is introduced. #### EX-Stage Forwarding - For ALU operations, source registers in the EX stage are compared with destination registers in: - EX/MEM stage - MEM/WB stage - When a match is detected and the destination register is not `x0`, the most recent value is forwarded directly to the ALU input (Forward = (rd ≠ 0) ∧ (rd = rs) ∧ RegWrite) - Forwarding from EX/MEM has higher priority than MEM/WB to ensure correctness. This mechanism resolves most RAW data hazards without introducing stalls. #### ID-Stage Forwarding (Early Branch Support) In this design, branches are resolved in the ID stage. Therefore, operands required for branch comparison must also be available during ID. - Source registers in ID (`rs1_id`, `rs2_id`) are compared with destination registers in: - EX/MEM stage - MEM/WB stage - Forwarding is enabled under the same conditions as EX-stage forwarding: - the instruction in a later stage will write a register, - the destination register matches the source register in ID, - and the destination register is not `x0`. When these conditions are satisfied, the operand is forwarded from EX/MEM or MEM/WB to the ID stage for branch comparison, with EX/MEM having higher priority. Because branch operands can be forwarded to the ID stage, the branch decision is made one cycle earlier, so only the IF stage needs to be flushed, resulting in a **1-cycle branch penalty instead of 2.** --- ### Hazard Detection and Pipeline Control **Related file:** `Control.scala` Not all hazards can be resolved through forwarding. Some situations require the pipeline to pause or discard incorrect instructions. A centralized hazard detection unit is responsible for identifying such cases and generating appropriate control signals. #### Load-Use Hazards When a load instruction is followed immediately by an instruction that uses the loaded register: - The loaded data becomes available only at the end of the MEM stage. - Even with forwarding, the data cannot be provided in time for the next instruction’s EX stage. To handle this: - The PC and IF/ID pipeline register are stalled. - A bubble (NOP) is inserted into the ID/EX stage. #### Jump Register Dependencies For instructions like `JALR`, the jump target address depends on a register value. - If that register value is still being produced by a load instruction, the jump cannot be resolved safely. - The pipeline stalls until the correct value is available. #### Control Hazard Handling When a branch is taken: - The branch decision is already resolved in the ID stage. - Only the instruction in the IF stage is incorrect. Therefore: - The IF/ID pipeline register is flushed. - No additional stall or ID-stage flush is required. This design choice is a major improvement over EX-stage branch resolution. --- ### Pipeline Register Stall and Flush Behavior **Related file:** `IF2ID.scala` (and other pipeline registers) Each pipeline register supports both **stall** and **flush** operations: - **Stall** - Holds the current contents of the register. - Used to wait for data dependencies to be resolved. - **Flush** - Replaces the current instruction with a NOP. - Used to discard wrong-path instructions caused by control hazards. Correct default values are critical: - Instructions are flushed to `NOP` - Instruction addresses reset to entry address - Interrupt flags are cleared Without this behavior, incorrect instructions could propagate and corrupt architectural state. --- ## Waveform-Based Verification (Key Signals) In addition to checking final memory and register values, waveform inspection is used to verify that **hazard handling, forwarding, and stall/flush control** behave correctly cycle by cycle. This section summarizes the key signals observed in the waveform and the expected behavior under different instruction scenarios. To verify we use this asm code for example :::spoiler code here ```asm= hazard.asmbin: file format binary Disassembly of section .data: 00000000 <.data>: 0: c0002573 rdcycle a0 4: 00100293 li t0,1 8: 40500333 neg t1,t0 c: 0062f3b3 and t2,t0,t1 10: 00702223 sw t2,4(zero) # 0x4 14: 00c0006f j 0x20 18: 0062e3b3 or t2,t0,t1 1c: 0062c3b3 xor t2,t0,t1 20: 00138313 addi t1,t2,1 24: 007303b3 add t2,t1,t2 28: 007373b3 and t2,t1,t2 2c: 0023a383 lw t2,2(t2) 30: 00736e33 or t3,t1,t2 34: 01c3c663 blt t2,t3,0x40 38: 0052ee33 or t3,t0,t0 3c: 0062ce33 xor t3,t0,t1 40: 00300e93 li t4,3 44: fdde1ee3 bne t3,t4,0x20 48: 01c02423 sw t3,8(zero) # 0x8 4c: 00000e97 auipc t4,0x0 50: 008e8ee7 jalr t4,8(t4) # 0x54 54: 004e8ee7 jalr t4,4(t4) 58: c00025f3 rdcycle a1 5c: 40a580b3 sub ra,a1,a0 60: 0000006f j 0x60 ``` ::: ### EX stage forwarding Waveform **Target Instruction Sequence** ```asm= 0x00000004 li t0, 1 0x00000008 neg t1, t0 ``` Waveform: ![截圖 2025-12-14 下午5.01.52](https://hackmd.io/_uploads/SJCbplnMbg.png) This instruction sequence introduces a classic **RAW (Read After Write)** data dependency: - `neg t1, t0` requires the value of `t0` - `t0` is produced by the immediately preceding instruction `li t0, 1` Pipeline Timing (Instruction-centric View) | Instruction \ Cycle | C0 | C1 | C2 | C3 | C4 | |---|---|---|---|---|---| | `rdcycle a0` | IF | ID | EX | MEM | WB | | `li t0, 1` | | IF | ID | EX | MEM | | `neg t1, t0` | | | IF | ID | **EX*** | | `and t2, t0, t1` | | | | IF | ID | | `sw t2, 4(zero)` | | | | | IF | 👉 **EX-stage forwarding occurs at Cycle C4** At Cycle C4: - `neg t1, t0` is in the **EX stage** - `li t0, 1` is in the **MEM stage** - Register mapping: - `t0 = x5` - `t1 = x6` Observed signals: - `rs1_ex = 5` - `rd_mem = 5` - `reg_write_enable_mem = 1` - `io_reg2_forward_ex = 1` (ForwardFromMEM) After validating EX/MEM forwarding, we now examine a case where the producer instruction is two cycles ahead, requiring forwarding from MEM/WB instead. **Target Instruction Sequence** ```asm= 0x00000004 li t0, 1 0x00000008 neg t1, t0 0x0000000c and t2, t0, t1 ``` Waveform: ![截圖 2025-12-14 下午5.30.54](https://hackmd.io/_uploads/HyiCXW2M-g.png) Here: - `li t0, 1` produces t0 - `and t2, t0, t1` consumes t0 two cycles later - At this moment, t0 has reached the MEM/WB stage Pipeline Timing (Instruction-centric View) | Instruction \ Cycle | C1 | C2 | C3 | C4 | C5 | |---------------------|----|----|----|----|----| | li t0, 1 | IF | ID | EX | MEM | **WB** | | neg t1, t0 | | IF | ID | EX | MEM | | and t2, t0, t1 | | | IF | ID | **EX*** | MEM/WB → EX forwarding occurs at **Cycle C5** Cycle C5 — Forwarding Analysis (MEM/WB → EX) At Cycle C5: - rs1_ex = 5 (source register of `and t2, t0, t1`) - rd_wb = 5 (destination register of `li t0, 1`) - `and t2, t0, t1` is in the EX stage - `li t0, 1` is in the WB stage - Register mapping: - rs1_ex = x5 (t0) - rd_wb = x5 Observed control signals: - reg_write_enable_wb = 1 - reg1_forward_ex = 2 (ForwardFromWB) Conclusion: The forwarding unit correctly selects MEM/WB as the bypass source. --- ### ID stage fowarding Waveform TODO --- # HW2 asmcode on Mycpu ## Running rsqrt on MyCPU This work ports the handwritten RISC-V rsqrt assembly code from Homework 2 to the MyCPU pipelined processor. First, to make the rsqrt program executable on MyCPU, I modified the Makefile in the csrc directory so that the rsqrt C file, the handwritten rsqrt assembly code, and init.o are correctly linked into a single ELF file. The ELF file is then converted into a binary (.asmbin) using objcopy and placed under src/main/resources for simulation. Next, I extended the PipelineProgramTest by adding a new test case for the rsqrt program. The test loads rsqrt_fast.asmbin, allows the program to execute on the pipelined CPU, and reads the result from a predefined memory-mapped address. The output value is compared against the expected reference result to verify correctness. The rsqrt program is executed on different pipelined CPU configurations, including three-stage and five-stage pipelines. The program is implemented using hand-written RISC-V assembly, and special care is taken to minimize pipeline hazards, particularly load-use dependencies that are difficult to fully eliminate in a five-stage pipeline. By reordering independent instructions and avoiding immediate use of load results whenever possible, unnecessary stalls are reduced while preserving correct functionality. Simulation is performed using Verilator, and waveform traces are analyzed to observe instruction flow, data forwarding behavior, and remaining pipeline stalls. Successful execution and correct memory output confirm that the modified rsqrt assembly code functions correctly on the pipelined RISC-V CPU. :::spoiler testing code ```scala= it should "compute rsqrt_fast and match reference" in { runProgram("rsqrt_fast.asmbin", cfg) { c => c.clock.setTimeout(200000) c.clock.step(50000) def check(addr: Int, expected: Int): Unit = { c.io.mem_debug_read_address.poke(addr.U) c.clock.step() c.io.mem_debug_read_data.expect(expected.U) } check(0x4, 0x00018000) // rsqrt(1) check(0x8, 0x0000B505) // rsqrt(2) check(0xC, 0x00008000) // rsqrt(4) check(0x10, 0x00004000) // rsqrt(16) check(0x14, 0x00001000) // rsqrt(256) } } ``` ::: ![截圖 2025-12-15 晚上11.54.08](https://hackmd.io/_uploads/B1fNJ3Tf-e.png) --- ## Waveform Trace Here, we trace the waveform of `rsqrt_fast` to examine whether the critical control signals behave correctly in the pipelined CPU. :::spoiler asmcode here (partial) ```asm= 00001bcc <main>: 1bcc: fe010113 addi sp,sp,-32 1bd0: 00112e23 sw ra,28(sp) 1bd4: 00812c23 sw s0,24(sp) 1bd8: 02010413 addi s0,sp,32 1bdc: 00100513 li a0,1 1be0: 07c000ef jal 1c5c <rsqrt_fast> 1be4: fea42623 sw a0,-20(s0) 1be8: 00400793 li a5,4 1bec: fec42703 lw a4,-20(s0) 1bf0: 00e7a023 sw a4,0(a5) 1bf4: 00200513 li a0,2 1bf8: 064000ef jal 1c5c <rsqrt_fast> 1bfc: fea42623 sw a0,-20(s0) 1c00: 00800793 li a5,8 1c04: fec42703 lw a4,-20(s0) 1c08: 00e7a023 sw a4,0(a5) 1c0c: 00400513 li a0,4 1c10: 04c000ef jal 1c5c <rsqrt_fast> 1c14: fea42623 sw a0,-20(s0) 1c18: 00c00793 li a5,12 1c1c: fec42703 lw a4,-20(s0) 1c20: 00e7a023 sw a4,0(a5) 1c24: 01000513 li a0,16 1c28: 034000ef jal 1c5c <rsqrt_fast> 1c2c: fea42623 sw a0,-20(s0) 1c30: 01000793 li a5,16 1c34: fec42703 lw a4,-20(s0) 1c38: 00e7a023 sw a4,0(a5) 1c3c: 10000513 li a0,256 1c40: 01c000ef jal 1c5c <rsqrt_fast> 1c44: fea42623 sw a0,-20(s0) 1c48: 01400793 li a5,20 1c4c: fec42703 lw a4,-20(s0) 1c50: 00e7a023 sw a4,0(a5) 1c54: 00000013 nop 1c58: ffdff06f j 1c54 <main+0x88> 00001c5c <rsqrt_fast>: 1c5c: 22050863 beqz a0,1e8c <.mul32_done+0x8c> 1c60: fff00293 li t0,-1 1c64: 22550863 beq a0,t0,1e94 <.mul32_done+0x94> 1c68: 00050f93 mv t6,a0 1c6c: 00050293 mv t0,a0 1c70: 00000313 li t1,0 1c74: 000103b7 lui t2,0x10 1c78: 0072be33 sltu t3,t0,t2 1c7c: 004e1e13 slli t3,t3,0x4 1c80: 01c30333 add t1,t1,t3 1c84: 01c292b3 sll t0,t0,t3 1c88: 010003b7 lui t2,0x1000 1c8c: 0072be33 sltu t3,t0,t2 1c90: 003e1e13 slli t3,t3,0x3 1c94: 01c30333 add t1,t1,t3 1c98: 01c292b3 sll t0,t0,t3 1c9c: 100003b7 lui t2,0x10000 1ca0: 0072be33 sltu t3,t0,t2 1ca4: 002e1e13 slli t3,t3,0x2 1ca8: 01c30333 add t1,t1,t3 1cac: 01c292b3 sll t0,t0,t3 1cb0: 400003b7 lui t2,0x40000 1cb4: 0072be33 sltu t3,t0,t2 1cb8: 001e1e13 slli t3,t3,0x1 1cbc: 01c30333 add t1,t1,t3 1cc0: 01c292b3 sll t0,t0,t3 1cc4: 800003b7 lui t2,0x80000 1cc8: 0072be33 sltu t3,t0,t2 1ccc: 01c30333 add t1,t1,t3 1cd0: 01c292b3 sll t0,t0,t3 1cd4: 0012b393 seqz t2,t0 1cd8: 00730333 add t1,t1,t2 1cdc: 01f00393 li t2,31 1ce0: 406382b3 sub t0,t2,t1 1ce4: 00000e17 auipc t3,0x0 1ce8: 1b8e0e13 addi t3,t3,440 # 1e9c <rsqrt_table> 1cec: 00229393 slli t2,t0,0x2 1cf0: 007e0eb3 add t4,t3,t2 1cf4: 000ea303 lw t1,0(t4) 1cf8: 01f00393 li t2,31 1cfc: 0072fc63 bgeu t0,t2,1d14 <rsqrt_fast+0xb8> 1d00: 00128e93 addi t4,t0,1 1d04: 002e9e93 slli t4,t4,0x2 1d08: 01de0eb3 add t4,t3,t4 1d0c: 000ea383 lw t2,0(t4) 1d10: 0080006f j 1d18 <rsqrt_fast+0xbc> 1d14: 00100393 li t2,1 1d18: 00050e93 mv t4,a0 1d1c: 00100f13 li t5,1 1d20: 005f1f33 sll t5,t5,t0 1d24: 41ee8f33 sub t5,t4,t5 1d28: 01000e93 li t4,16 1d2c: 01d2c863 blt t0,t4,1d3c <exp_less_16> 1d30: ff028e93 addi t4,t0,-16 1d34: 01df5f33 srl t5,t5,t4 1d38: 00c0006f j 1d44 <fraction_done> ``` ::: ### Control Signal Verification Using Waveform ![截圖 2025-12-16 凌晨3.55.33](https://hackmd.io/_uploads/rJzTDJCGZe.png) Figure shows the waveform around the execution of the `jal rsqrt_fast` instruction at PC = `0x1be0`. At this cycle, the IF stage fetches the instruction at address `0x1be0`, which corresponds to `jal rsqrt_fast`. This can be observed from the `io_instruction` signal showing the instruction word at PC `0x1be0`. In the following cycle, the instruction enters the ID stage, which is reflected by the `io_id_instruction` signal. At this point, the control logic correctly identifies the instruction as a jump. As a result, the jump decision is resolved in the ID stage, and the `io_jump_flag` signal is asserted. Once the jump is detected, the pipeline control logic asserts `io_if_flush`. This flush signal invalidates the sequential instruction that was speculatively fetched in the IF stage (e.g., the instruction at `0x1be4`). As shown in the waveform, `io_if_flush` goes high immediately after the jump decision, ensuring that the wrong-path instruction does not proceed further down the pipeline. After the flush, the program counter is redirected to the jump target address (`rsqrt_fast` at `0x1c5c`), and instruction fetching resumes from the correct location. This behavior confirms that control hazards introduced by `jal` are handled correctly: the jump is resolved early in the ID stage, and the pipeline is properly flushed to maintain correct execution. --- ### Fowarding siganl Verification ![截圖 2025-12-16 凌晨4.15.41](https://hackmd.io/_uploads/SJquh1Cfbe.png) For the instruction sequence at 0x1c74–0x1c7c, a RAW data hazard occurs between `sltu t3, t0, t2` and `slli t3, t3, 0x4`. When `slli` reaches the EX stage, the value of register `t3` has been produced by the preceding `sltu` instruction but has not yet been written back to the register file. At this cycle, `sltu` is in the MEM stage and its destination register (`rd_mem = t3`) matches the source register of `slli` (`rs_ex = t3`). The forwarding unit detects this condition and asserts the EX-stage forwarding signal, allowing the operand to be forwarded directly from the MEM stage to the EX stage. As a result, the pipeline executes the dependent instructions correctly without introducing any stall.