Assignment3: Your Own RISC-V CPU

# Assignment3: Your Own RISC-V CPU contributed by < [AnnTaiwan](https://github.com/AnnTaiwan/ca2025-mycpu) > >[!Note] AI tools usage >I use ChatGPT to help me realize the concept of project and some code explanation. And also, I use it to rephrase my note and summarize the code. >[!Tip] Notice >Follow instructions to learn and build the environment : [Lab3: Construct RISC-V CPU with Chisel](https://hackmd.io/@sysprog/B1Qxu2UkZx#Lab3-Construct-a-RISC-V-CPU-with-Chisel) >[Assignment3: RISC-V CPU](https://hackmd.io/@sysprog/2025-arch-homework3) >* code >[mycpu upstream/main](https://github.com/sysprog21/ca2025-mycpu) >[mycpu forked by me](https://github.com/AnnTaiwan/ca2025-mycpu) >* [chisel bootcamp - online](https://mybinder.org/v2/gh/freechipsproject/chisel-bootcamp/master) >* Environment: Ubuntu 24.04.3. >* Toolchain: riscv-none-elf-gcc ## chisel bootcamp ### Describe the operation of 'Hello World in Chisel' * `Hello World in Chisel` ```python= class Hello extends Module { val io = IO(new Bundle { val led = Output(UInt(1.W)) }) val CNT_MAX = (50000000 / 2 - 1).U; val cntReg = RegInit(0.U(32.W)) val blkReg = RegInit(0.U(1.W)) cntReg := cntReg + 1.U when(cntReg === CNT_MAX) { cntReg := 0.U blkReg := ~blkReg } io.led := blkReg } ``` * `cntReg`: a 32-bit counter, initialized to 0. * `blkReg`: a 1-bit register controlling the LED state. * `CNT_MAX = 50,000,000 / 2 - 1` * For a 50 MHz clock, this creates a toggle every 0.5 seconds → LED blinks at 1 Hz. There is only one output `io.led`, which is unsigned integer for 1-bit. In each clock cycle, the register `cntReg` will increase by 1. When `cntReg` is equall to `CNT_MAX`, it will reset to zero and inverse the `blkReg` that is output signal. The LED output is therefore a square wave with a fixed blinking frequency. ### Enhance it by incorporating logic circuit * Use this example to ==learn how to use sbt and manage scala project.== * Add `enable` signal, and adjust the `CNT_MAX` to 249, which means **LED blink period is 500 cycles.** ```clike= import chisel3._ import chisel3.util._ class HelloEnhanced extends Module { val io = IO(new Bundle { val enable = Input(Bool()) // new input val led = Output(UInt(1.W)) }) val CNT_MAX = (500 / 2 - 1).U val cntReg = RegInit(0.U(32.W)) val blkReg = RegInit(0.U(1.W)) cntReg := cntReg + 1.U when(cntReg === CNT_MAX) { cntReg := 0.U blkReg := ~blkReg } // logic circuit enhancement io.led := blkReg & io.enable } ``` * Use test code to examine the module. ```clile= import chisel3._ import chiseltest._ import org.scalatest.flatspec.AnyFlatSpec class HelloTest extends AnyFlatSpec with ChiselScalatestTester { behavior of "HelloEnhanced" it should "blink correctly with enable" in { test(new HelloEnhanced).withAnnotations(Seq(WriteVcdAnnotation)) { dut => val PERIOD = 500 // Must match 2 * (CNT_MAX + 1) // Test 1: Enable HIGH - should blink dut.io.enable.poke(true.B) // Verify first half-period (LED should be LOW initially) // Counter: 0→249, then toggles on cycle 250 for (i <- 0 until PERIOD/2) { // loop i: 0-249 dut.io.led.expect(0.U) // (0-249), LOW dut.clock.step(1) } // After 250 cycles, LED should now be HIGH dut.io.led.expect(1.U) // Verify second half-period (LED stays HIGH) for (i <- 0 until PERIOD/2 - 1) { dut.clock.step(1) dut.io.led.expect(1.U) // Should be HIGH (cycles 250-498) } // Step once more to complete the period dut.clock.step(1) // After 500 cycles total, LED should toggle back to LOW dut.io.led.expect(0.U) // Verify third half-period (LED stays LOW) for (i <- 0 until PERIOD/2 - 1) { dut.clock.step(1) dut.io.led.expect(0.U) // Back to LOW } // Test 2: Enable LOW - LED should always be OFF dut.io.enable.poke(false.B) for (_ <- 0 until PERIOD * 2) { dut.clock.step(1) dut.io.led.expect(0.U) // Always LOW when disabled } } } } ``` >It tests approxiamtely 17500 cycles' output, and condition for `enable` and `not enable`. * output example |cycle count: `cntReg`| `io.led`| |--|--| |0-249|false| |250-499|true| |500|false| * `sbt test` reuslt ```c [info] HelloTest: [info] HelloEnhanced [info] - should blink correctly with enable [info] Run completed in 1 second, 750 milliseconds. [info] Total number of tests run: 1 [info] Suites: completed 1, aborted 0 [info] Tests: succeeded 1, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. ``` * See the waveform >Reset for 1 cycle, so `io.enable` and `io.led` is 0. ![image](https://hackmd.io/_uploads/rkmuz_Wzbe.png) >`io.enable` is set to 1. >After `cntReg` reaches 249, the `io.led` becomes HIGH. ![image](https://hackmd.io/_uploads/H1uE7_-zZg.png) >`io.enable` is set to 1. >Pass another 250 cycles, `io.led` is inversed, so it becomes LOW. ![image](https://hackmd.io/_uploads/SyLiQu-MWg.png) >When `io.enable` is 0, the `io.led` should be LOW, even though `blkReg` is HIGH. ![image](https://hackmd.io/_uploads/BktJ4_ZGWg.png) ## Prerequests ### Install riscof in venv ```clike sudo apt update sudo apt install -y python3 python3-pip python3-venv git \ build-essential cmake ninja-build python3 -m venv riscof_env # located at ca2025/ source riscof_env/bin/activate pip install riscof ``` ## mycpu : Finish all CA25 TODOs ### Before `make check-deps`, do this commands to set up the required environment for `make compliance` ```c # for toolchain source ~/riscv-none-elf-gcc/setenv # go to ca2025-mycpu cd ca2025-mycpu make check-deps # And, it will suggest you to do below: export RISCV=/home/chouan/riscv-none-elf-gcc # for riscof, try to activate my python venv source riscof_env/bin/activate make check-deps # And, it should see "All dependencies validated successfully" ``` ## MyCPU * [mycpu code](https://github.com/AnnTaiwan/ca2025-mycpu) * workflow: SBT produces Verilog, Verilator simulates it, C++ produces VCD ## MyCPU : 0-minimal * Run `make`, terminal output: ```clike cd .. && sbt "project minimal" test [info] [launcher] getting org.scala-sbt sbt 1.10.7 (this may take some time)... [info] [launcher] getting Scala 2.12.20 (for sbt)... [info] welcome to sbt 1.10.7 (Eclipse Adoptium Java 11.0.29) [info] loading project definition from /home/chouan/ncku_courses/ca2025/ca2025-mycpu/project [info] loading settings for project root from build.sbt... [info] set current project to mycpu-root (in build file:/home/chouan/ncku_courses/ca2025/ca2025-mycpu/) [info] set current project to mycpu-minimal (in build file:/home/chouan/ncku_courses/ca2025/ca2025-mycpu/) [info] Updating mycpu-minimal_2.13 ... [info] JITTest: [info] Minimal CPU - JIT Test [info] - should correctly execute jit.asmbin and set a0 to 42 [info] Run completed in 1 minute, 7 seconds. [info] Total number of tests run: 1 [info] Suites: completed 1, aborted 0 [info] Tests: succeeded 1, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. [success] Total time: 82 s (01:22), completed Nov 28, 2025, 1:12:18 AM ``` >same as `make test` * I use `gtkwave trace.vcd` to see waveform. ## MyCPU : 1-single-cycle [Commit 8c7c07e](https://github.com/AnnTaiwan/ca2025-mycpu/commit/8c7c07e0d6a1bbbc1ca981405179e7b5153becb9) >1-single-cycle: Finish ALL TODOs, and pass 'make compliance' ### Code note #### Simulate flow using verilator and `sim.cpp` * In `Makefile`: ```c verilator: cd .. && PATH=$$HOME/.local/bin:$$PATH sbt "project singleCycle" "runMain board.verilator.VerilogGenerator" cd verilog/verilator && verilator --trace --exe --cc sim.cpp Top.v && make -C obj_dir -f VTop.mk ``` * Files explanation |code|ability|How to get?| |--|--|--| |Top.scala|chisel hardware code|handwritten, it can generate Top.v| |Top.v|Hardware design|`sbt "project singleCycle" "runMain board.verilator.VerilogGenerator"`, `VerilogGenerator` is written in Top.scala| |VTop\.mk|Makefile to build everything|From `verilator --trace --exe --cc sim.cpp` |VTop.cpp\/h|Hardware model|Generated by `make -C obj_dir -f VTop.mk`| |sim.cpp|==The Testbench (C++ Driver)==, **Instantiates** the hardware (VTop object), **Drives** inputs (clock, reset, memory data), **Reads** outputs (CPU state, success signals), **Controls** simulation (time, termination), **Provides** memory model (behavioral MRAM/WRAM)|handwritten| * In `sim.cpp`: >Control the signal data in below `top` and `memory` modules. ```clike std::unique_ptr<VTop> top; std::unique_ptr<VCDTracer> vcd_tracer; std::unique_ptr<Memory> memory; ``` >This declared software `Memory`, which will specify the memory size to $1024 * 1024$(4MB). ```c++ class Memory { std::vector<uint32_t> memory; public: Memory(size_t size) : memory(size, 0) {} ``` * In `1-single-cycle/src/main/scala/board/verilator/Top.scala` ```c object VerilogGenerator extends App { (new ChiselStage).emitVerilog( new Top(), Array("--target-dir", "1-single-cycle/verilog/verilator") ) } ``` >It will generate the verilog by running `sbt "project singleCycle" "runMain board.verilator.VerilogGenerator"` written in `Makefile` #### Step by step 1. chisel -> Verilog ``` Top.scala → Top.v (hardware description) ``` 2. **Verilator:** Verilog -> c++ ``` Top.v → VTop.cpp, VTop.h, VTop__Syms.cpp, ... (Converted to C++ classes modeling the circuit) ``` 3. Link with testbench ``` sim.cpp + VTop.cpp → VTop (executable) (Testbench links with hardware model) ``` * Overall flow ``` Chisel (Top.scala) ↓ [SBT generates] Top.v (Verilog RTL) ↓ [Verilator compiles] VTop.cpp/VTop.h (C++ model) ↓ [Links with] sim.cpp (C++ testbench) ↓ [Compiles to] VTop executable (simulator) ``` In verilator, `Top.scala` has the CPU instance. By running `verilog/verilator/sim.cpp`, it will declare VTop and Memory to read instruction from Memory and insert it into VTop(Top.scala will input the instruction into CPU.scala). * overall `sim.cpp` structure. ``` ┌─────────────────────────────────────────────────┐ │ sim.cpp (C++ Testbench) │ │ │ │ ┌──────────────────────┐ │ │ │ Memory (C++ class) │ ← Pure software! │ │ │ - vector<uint32_t> │ NOT Memory.scala │ │ │ - read() │ │ │ │ - write() │ │ │ │ - load_binary() │ │ │ └──────────────────────┘ │ │ ↕ │ │ ┌─────────────────────────────────────┐ │ │ │ VTop (from verilator/Top.v) │ │ │ │ │ │ │ │ ┌───────────────────────────────┐ │ │ │ │ │ CPU (from CPU.scala) │ │ │ │ │ │ - No Memory module inside! │ │ │ │ │ │ - Only I/O ports for memory │ │ │ │ │ └───────────────────────────────┘ │ │ │ │ │ │ │ │ I/O Ports: │ │ │ │ - io_instruction_address → │ │ │ │ - io_instruction ← │ │ │ │ - io_memory_bundle_address → │ │ │ │ - io_memory_bundle_read_data ← │ │ │ │ - io_memory_bundle_write_data → │ │ │ └─────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────┘ ``` #### `MemoryAccess.scala` >Explain how to design `sb`, `sh`, `sw`. And, why need to shift left the r/w data. It needs to shift left `reg2_data` by `mem_address_index * 8` because writing in memory each time writes to word-aligned position with 32-bit word. ```c reg2_data = 0x000000FF // write to 0x1002 MEM = [00, 00, 00, 00] (pos = [0x1003, 0x1002, 0x1001, 0x1000]) // After shift left data reg2_data = 0x00FF0000 // After writing to 0x1000 (start position, which is word-aligned) MEM = [00, FF, 00, 00] (pos = [0x1003, 0x1002, 0x1001, 0x1000]) ``` #### `CPUTest.scala` for ChiselTest simulation >Explain the techniques employed for loading test program instructions. * It declares the real hardware memory module defined as peripherals like `Memory.scala`. ```c++ class TestTopModule(exeFilename: String) extends Module { val io = IO(new Bundle { val mem_debug_read_address = Input(UInt(Parameters.AddrWidth)) val regs_debug_read_address = Input(UInt(Parameters.PhysicalRegisterAddrWidth)) val regs_debug_read_data = Output(UInt(Parameters.DataWidth)) val mem_debug_read_data = Output(UInt(Parameters.DataWidth)) }) val mem = Module(new Memory(8192)) val instruction_rom = Module(new InstructionROM(exeFilename)) val rom_loader = Module(new ROMLoader(instruction_rom.capacity)) ``` * InstructionROM architecture ```c++ class InstructionROM(instructionFilename: String) extends Module { val io = IO(new Bundle { val address = Input(UInt(Parameters.AddrWidth)) val data = Output(UInt(Parameters.InstructionWidth)) }) val (instructionsInitFile, capacity) = readAsmBinary(instructionFilename) val mem = Mem(capacity, UInt(Parameters.InstructionWidth)) loadMemoryFromFileInline(mem, instructionsInitFile.toString.replaceAll("\\\\", "/")) io.data := mem.read(io.address) ``` * `InstructionROM` will read the binary file, and write each instruction into a `asmbin_filename.txt`. * It will allocate a Mem. * `loadMemoryFromFileInline`: loads the .txt file into mem (happens once at start) * During simulation: reads one word per cycle from the memory array and send it to `ROMLoader`. * `io.address` changes → `io.data` outputs corresponding word * `ROMLoader` architecture * Load instructions by outputing `rom_address`(start from 0 to `instruction_rom.capacity` - 1) to `InstructionROM` to access its `InstructionROM.Mem`(which includes all instructions) to retrieve the instruction data. * It will start to write all those instructions reading from `InstructionROM` to address starting from `Parameters.EntryAddress(0x1000)` in memory. * Load finish, and set `io.load_finished=true`, then cpu can do its work. * For example: (`InstructionROM` <-> `ROMLoader`) ``` ┌─────────────────────────────────────────────────┐ │ InstructionROM (Array Index) │ ├─────────────────────────────────────────────────┤ │ mem[0] = 0x12345678 │ │ mem[1] = 0xABCDEF00 │ │ mem[2] = 0xDEADBEEF │ │ mem[3] = 0xCAFEBABE │ └─────────────────────────────────────────────────┘ ↓ ROMLoader translates (index << 2) + load_address ↓ ┌─────────────────────────────────────────────────┐ │ Memory (Byte Address Space) │ ├─────────────────────────────────────────────────┤ │ 0x0000-0x0FFC: Empty │ │ 0x1000: 0x12345678 ← mem[0] mapped here │ │ 0x1004: 0xABCDEF00 ← mem[1] mapped here │ │ 0x1008: 0xDEADBEEF ← mem[2] mapped here │ │ 0x100C: 0xCAFEBABE ← mem[3] mapped here │ │ 0x1010+: Empty │ └─────────────────────────────────────────────────┘ ``` * Overall CPUTest ``` ┌───────────────────────────────────────────────┐ │ TestTopModule (CPUTest.scala) │ │ │ │ ┌──────────────────┐ ┌─────────────────┐ │ │ │ InstructionROM │ │ ROMLoader │ │ │ │ (from .asmbin) │──→│ (copies to mem) │ │ │ └──────────────────┘ └─────────────────┘ │ │ ↓ │ │ ┌─────────────────────────────────────────┐ │ │ │ Memory (Hardware from Memory.scala) │ │ │ │ - SyncReadMem(8192 words = 32KB) │ │ │ │ - 1-cycle read latency │ │ │ │ - Byte-level write strobes │ │ │ └─────────────────────────────────────────┘ │ │ ↕ │ │ ┌─────────────────────────────────────────┐ │ │ │ CPU (Single-cycle RISC-V) │ │ │ │ - Instruction fetch │ │ │ │ - Data memory access │ │ │ └─────────────────────────────────────────┘ │ │ │ └───────────────────────────────────────────────┘ ↑ ChiselTest simulator ``` ### Test cases summary: **1. Unit Tests (Component-Level)** >**RegisterFileTest.scala** * **Tests** * Register read/write operations * Simultaneous read/write behavior * Register x0 hardwired to zero * **Validates** * RegisterFile module behavior in isolation * Write-after-read hazards * **Outcome** * Confirms correct register file functionality per RISC-V spec >**InstructionFetchTest.scala** * **Tests** * PC increment * jump target calculation * **Validates** * InstructionFetch stage logic * PC update mechanisms * **Outcome** * Ensures correct instruction sequencing and control flow >**InstructionDecoderTest.scala** * **Tests** * Opcode decoding * Immediate extraction * Control signal generation for R/I/S/B/U/J instruction formats * **Validates** * InstructionDecode stage correctness * **Outcome** * Confirms proper parsing and generation of control signals >**ExecuteTest.scala** * **Tests** * ALU operations (add) * Comparison operation (equ and not equ) * Branch condition evaluation (beq) * **Validates** * Arithmetic and logic correctness of the Execute stage * **Outcome** * Ensures ALU results match expected RISC-V behavior **2. Integration Tests (End-to-End)** >**CPUTest.scala — FibonacciTest** * **Tests** * Recursive computation of Fibonacci(10) * **Validates** * Function call flow * Stack read/write * Return address handling * Register allocation * Memory load/store * **Expected Outcome** * Memory address **0x0004** holds **55** * **Evaluates** * Full CPU instruction set integration and control/data flow >**CPUTest.scala — QuicksortTest** * **Tests** * Quicksort on an array of 10 integers * **Validates** * Complex control flow (recursive calls, nested loops) * Array manipulation * Comparison and branching operations * **Expected Outcome** * Memory locations **0x0004–0x0028** contain sorted values 0–9 * **Evaluates** * Correct behavior of algorithm execution and memory subsystem >**CPUTest.scala — ByteAccessTest** * **Tests** * SB, LB, LW instructions * **Validates** * Byte-level memory accesses * Write strobes * Sign/zero extension * **Expected Outcomes** * x5 = `0xDEADBEEF` * x6 = `0xEF` (byte extraction) * x1 = `0x15EF` (partial word) * **Evaluates** * Memory granularity and alignment behavior ### Test result: * Phase 1: Instruction Decode (Exercises 1–2) ```c [info] InstructionDecoderTest: [info] InstructionDecoder [info] - should decode RV32I instructions and generate correct control signals [info] Run completed in 4 seconds, 77 milliseconds. [info] Total number of tests run: 1 [info] Suites: completed 1, aborted 0 [info] Tests: succeeded 1, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. [success] Total time: 11 s, completed Nov 29, 2025, 11:49:28 PM ``` * Phase 2: ALU Control (Exercise 3) ```c [info] ExecuteTest: [info] Execute [info] - should execute ALU operations and branch logic correctly [info] Run completed in 4 seconds, 143 milliseconds. [info] Total number of tests run: 1 [info] Suites: completed 1, aborted 0 [info] Tests: succeeded 1, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. [success] Total time: 5 s, completed Nov 29, 2025, 11:53:47 PM ``` * Phase 3, 4, 5: `sbt "project singleCycle" test` (same as `make test`) ```c [info] InstructionDecoderTest: [info] InstructionDecoder [info] - should decode RV32I instructions and generate correct control signals [info] ByteAccessTest: [info] Single Cycle CPU - Integration Tests [info] - should correctly handle byte-level store/load operations (SB/LB) [info] InstructionFetchTest: [info] InstructionFetch [info] - should correctly update PC and handle jumps [info] ExecuteTest: [info] Execute [info] - should execute ALU operations and branch logic correctly [info] FibonacciTest: [info] Single Cycle CPU - Integration Tests [info] - should correctly execute recursive Fibonacci(10) program [info] RegisterFileTest: [info] RegisterFile [info] - should correctly read previously written register values [info] - should keep x0 hardwired to zero (RISC-V compliance) [info] - should support write-through (read during write cycle) [info] QuicksortTest: [info] Single Cycle CPU - Integration Tests [info] - should correctly execute Quicksort algorithm on 10 numbers [info] Run completed in 24 seconds, 22 milliseconds. [info] Total number of tests run: 9 [info] Suites: completed 7, aborted 0 [info] Tests: succeeded 9, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. ``` #### `make compliance` in 1-single-cycle => Success ``` INFO | === Generating batch test file with 41 tests === INFO | === Running all 41 tests in single SBT session === ... INFO | Batch test completed. Full log: /home/chouan/ncku_courses/ca2025/ca2025-mycpu/tests/riscof_work_1sc/batch_test.log INFO | Results: 41 passed, 0 failed INFO | Running Tests on Reference Model. ... ✅ Compliance tests complete. Results in riscof_work_1sc/ Completion time: Sun Nov 30 00:03:50 CST 2025 Copying results to results/ directory... Cleaning up auto-generated RISCOF test files... ✅ Compliance tests complete. Results in results/ 📊 View report: results/report.html ``` ![image](https://hackmd.io/_uploads/Hy50JW9WZl.png) ### Analyze the waveform #### `sb.asmbin` * `make sim SIM_VCD=trace_sb.vcd SIM_ARGS="-instruction src/main/resources/sb.asmbin"` * source code ```c= .global _start _start: li a0, 0x4 li t0, 0xDEADBEEF sb t0, 0(a0) lw t1, 0(a0) li s2, 0x15 sb s2, 1(a0) lw ra, 0(a0) loop: j loop ``` * hexdump rsult ``` 0000000 0513 0040 c2b7 dead 8293 eef2 0023 0055 0000010 2303 0005 0913 0150 00a3 0125 2083 0005 0000020 006f 0000 0000024 ``` * waveform ![image](https://hackmd.io/_uploads/rkNSX6WzZl.png) * First, check the io_instruction same with hexdump result. 1st instruction is `00400513`. * In code, line 5(`sb t0, 0(a0)`) store a byte(0xEF) at 0x4. These two number can be seen in io_memory_bundle_write_data(000000EF) and io_memory_bundle_address(00000004). Due to 0x4=100 (the last two bit is 00), it is write at strobe_0, so io_memory_bundle_write_strobe_0 is 1, the others are zero. ![image](https://hackmd.io/_uploads/SyIsVp-MZg.png) * In code, line 6(`lw t1, 0(a0)`), it loads a word(0x000000EF) from 0x4. These two numbers can be seen in io_memory_bundle_read_data(000000EF) and io_memory_bundle_address(00000004). ![image](https://hackmd.io/_uploads/ryRVLTbG-g.png) * In code, line 8(`sb s2, 1(a0)`), it stores a byte(0x15) to a0[1] whose address is 0x5. These two numbers can be seen in io_memory_bundle_write_data(00001500, due to address => 0x5=01==01==, write at index 1, so written data(0x00000015) shifts left 8-bit, which becomes 0x00001500) and io_memory_bundle_address(00000005). Due to writing at index 1, io_memory_bundle_write_strobe_1 is 1, the others are zero. ![image](https://hackmd.io/_uploads/rJViI6WM-l.png) >The above operations is mainly written in `memoryAccess.scala`. ## MyCPU: 2-mmio-trap [Commit 5eb5ef9](https://github.com/AnnTaiwan/ca2025-mycpu/commit/5eb5ef9881adbc5aa029c6ccb299318c15331e2c) >2-mmio-trap: Finish ALL TODOs, and pass 'make compliance' ### Code note #### Trap Handling * `csr` ```clike # csr format 31 20 19 15 14 12 11 7 6 0 [ CSR addr ] [ rs1/zimm ] [funct3] [ rd ] [ opcode ] # operations CSRRW: write_data = rs1 CSRRS: write_data = csr_old | rs1 CSRRC: write_data = csr_old & (~rs1) ``` #### MMIO Peripherals * MMIO Peripherals: Memory-mapped Timer, UART, and VGA devices with device address decoding ##### Timer Peripheral * `mmio.h` * concept: MMIO (Memory-Mapped I/O) maps the Timer's hardware registers into the CPU's memory address space. * The processor uses standard Load and Store instructions to write to specific memory addresses that are mapped to hardware registers. ```cpp /* Timer peripheral registers (base: 0x80000000) */ #define TIMER_BASE 0x80000000 /* +0x04: Timer limit register */ #define TIMER_LIMIT ((volatile unsigned int *) (TIMER_BASE + 4)) /* +0x08: Timer enable register */ #define TIMER_ENABLED ((volatile unsigned int *) (TIMER_BASE + 8)) ``` >It says the offset addresses of `limit`(==count bound==) and `enable`(enable the timer or not) are 0x4 and 0x8 in mmio area, respectively. Hence, the data will be written at `0x80000004` and `0x80000008` in RAM. >The registers, `count` and `limit` in `Timer`, which are mapped to memory space in order to let host can access that register, are defined in `Timer.scala` (`object DataAddr`): ```cpp class Timer extends Module { val io = IO(new Bundle { val bundle = new RAMBundle val signal_interrupt = Output(Bool()) val debug_limit = Output(UInt(Parameters.DataWidth)) val debug_enabled = Output(Bool()) }) val count = RegInit(0.U(32.W)) val limit = RegInit(Parameters.TimerDefaultLimit.U(32.W)) // Default: 100M cycles (~1s at 100MHz) io.debug_limit := limit val enabled = RegInit(true.B) io.debug_enabled := enabled // Memory-mapped register addresses object DataAddr { val enable = 0x8.U val limit = 0x4.U } ``` >Timer counts the number until reaching the limit. >Interrupt signal remains high while (count >= limit) and enabled. >Timer module can r/w data to that mmio space. >Test at `TimerTest.scala` --- ##### VGA Peripherals >`src/main/scala/peripheral/VGA.scala` >`csrc/nyancat.c` * Get nyncat.asmbin detail: `make nyancat.asmbin` ```shell (riscof_env) chouan@chouan-ASUS-TUF:~/ncku_courses/ca2025/ca2025-mycpu/2-mmio-trap/csrc$ make nyancat.asmbin python3 ../../scripts/gen-nyancat-data.py --delta --output nyancat-data.h Downloading from: https://raw.githubusercontent.com/klange/nyancat/master/src/animation.c Parsing animation frames... Parsed 12 frames, 4096 pixels each Compressing frames with delta-RLE... Frame 0 (baseline): 4096 pixels → 576 opcodes (86% reduction) Frame 1 (delta): 4096 pixels → 403 opcodes (91% reduction) Frame 2 (delta): 4096 pixels → 485 opcodes (89% reduction) Frame 3 (delta): 4096 pixels → 235 opcodes (95% reduction) Frame 4 (delta): 4096 pixels → 471 opcodes (89% reduction) Frame 5 (delta): 4096 pixels → 291 opcodes (93% reduction) Frame 6 (delta): 4096 pixels → 417 opcodes (90% reduction) Frame 7 (delta): 4096 pixels → 411 opcodes (90% reduction) Frame 8 (delta): 4096 pixels → 482 opcodes (89% reduction) Frame 9 (delta): 4096 pixels → 236 opcodes (95% reduction) Frame 10 (delta): 4096 pixels → 486 opcodes (89% reduction) Frame 11 (delta): 4096 pixels → 262 opcodes (94% reduction) Total: 49152 pixels → 4755 opcodes (91% reduction) Generated: nyancat-data.h Header size: 30641 bytes Verifying compression... === Verification Mode === Frame 0: ✓ Perfect match (576 opcodes) Frame 1: ✓ Perfect match (403 opcodes) Frame 2: ✓ Perfect match (485 opcodes) Frame 3: ✓ Perfect match (235 opcodes) Frame 4: ✓ Perfect match (471 opcodes) Frame 5: ✓ Perfect match (291 opcodes) Frame 6: ✓ Perfect match (417 opcodes) Frame 7: ✓ Perfect match (411 opcodes) Frame 8: ✓ Perfect match (482 opcodes) Frame 9: ✓ Perfect match (236 opcodes) Frame 10: ✓ Perfect match (486 opcodes) Frame 11: ✓ Perfect match (262 opcodes) ✓ All frames verified successfully /home/chouan/riscv-none-elf-gcc/bin/riscv-none-elf-as -R -march=rv32i_zicsr -mabi=ilp32 -o init_minimal.o init_minimal.S /home/chouan/riscv-none-elf-gcc/bin/riscv-none-elf-gcc -O0 -Wall -march=rv32i_zicsr -mabi=ilp32 -c -o nyancat.o nyancat.c /home/chouan/riscv-none-elf-gcc/bin/riscv-none-elf-ld -o nyancat.elf -T link.lds --oformat=elf32-littleriscv nyancat.o init_minimal.o /home/chouan/riscv-none-elf-gcc/bin/riscv-none-elf-objcopy -O binary -j .text -j .data nyancat.elf nyancat.asmbin ``` * Memory-Mapped Registers: Base address: `0x30000000` * Each color is 6-bit: RRGGBB * It will specify at most 16 colors, so the color index is 4-bit. * Each 32-bit word contains 8 pixels with 4-bit color indices. * Run the animation ``` # Build with SDL2 support make verilator-sdl2 # Run with custom program cd verilog/verilator/obj_dir ./VTop -vga -instruction ../../../src/main/resources/nyancat.asmbin -time 100000000 ``` --- ###### `nyancat.c` * Memory map io area is based at `0x30000000`. ```cpp // VGA MMIO register addresses (base: 0x30000000) #define VGA_BASE 0x30000000u #define VGA_ID (VGA_BASE + 0x00) #define VGA_CTRL (VGA_BASE + 0x04) #define VGA_STATUS (VGA_BASE + 0x08) #define VGA_UPLOAD_ADDR (VGA_BASE + 0x10) #define VGA_STREAM_DATA (VGA_BASE + 0x14) #define VGA_PALETTE(n) (VGA_BASE + 0x20 + ((n) << 2)) ``` * VGA Control Registers | Register Name | Address | R/W | Description | Bitfield / Notes | | ------------------- | ------------- | --- | ---------------------------------- | -------------------------------------------------------------------------------------------- | | **VGA_ID** | `0x3000_0000` | R | Device identification | Returns constant `0x56474131` (`"VGA1"`) | | **VGA_CTRL** | `0x3000_0004` | R/W | Control register | Bit 0: Display enable (1 = on) Bit 1: Auto-advance enable (1 = automatic frame cycling) | | **VGA_STATUS** | `0x3000_0008` | R | Status register | Bit 0: V-sync active Bit 1: H-sync active | | **VGA_UPLOAD_ADDR** | `0x3000_0010` | W | Framebuffer upload address pointer | Format: `[frame_index:4][pixel_offset:12]` packed into 32-bit word address | | **VGA_STREAM_DATA** | `0x3000_0014` | W | Streaming data write port | Write 32 bits = 8 pixels, address auto-increments | * Palette Registers | Register Name | Address Range | R/W | Description | Format | | ------------------ | ------------------------------------ | --- | ------------- | --------------------------------------------------------------------------------------------------------------------------------- | | **VGA_PALETTE(n)** | `0x3000_0020 + n*4` *n = 0..15* | R/W | Palette entry | 6-bit `RRGGBB` `RR` = red (bits 5:4) `GG` = green (bits 3:2) `BB` = blue (bits 1:0) Each component ranges 0–3 | * Compressed frame data is written in `nyncat-data.h`, which is generated by `scripts/gen-nyancat-data.py`. * Use below two functions to read or write each frame from/to mmio area. ```cpp // MMIO access functions static inline void vga_write32(uint32_t addr, uint32_t val) { *(volatile uint32_t *) addr = val; } static inline uint32_t vga_read32(uint32_t addr) { return *(volatile uint32_t *) addr; } ``` * `vga_init_palette`: Write all 14 colors(0~13) into VGA_PALETTE, the last two(14~15) in VGA_PALETTE are black. * `vga_upload_frame_delta`: **Decompression** >Extract the color and action (skip, repeat, EOF) from compressed data; therefore, it can know how to draw each frame. >Upload decompressed frame to VGA. * Two modes exist: * **Frame 0: Full RLE Decode** The baseline frame contains no “skip unchanged” instructions. All pixels are reconstructed entirely from compressed data: 1. Read opcodes 2. Maintain a current color 3. Output repeated colors according to opcode Partial frames are padded with background color 0. --- * **Frames 1–11: Delta Decode** Later frames describe only *changes* from the previous frame. Process: 1. Start by copying previous frame → current frame 2. Parse delta opcodes: * **Skip** means “reuse previous frame’s pixels” * **Repeat** means “overwrite with current color” 3. Write updated pixels into `frame_buffer` After decoding, the current frame is saved as the new `prev_frame_buffer`. * opcode format in compressed data ``` Frame 0 (Baseline): ┌──────────┬─────────────────────────────────┐ │ Opcode │ Meaning │ ├──────────┼─────────────────────────────────┤ │ 0x0X │ SetColor: current_color = X │ │ 0x2Y │ Repeat: write (Y+1) pixels │ │ 0x3Y │ Repeat: write (Y+1)×16 pixels │ │ 0xFF │ End of frame │ └──────────┴─────────────────────────────────┘ Frame 1-11 (Delta): ┌──────────┬─────────────────────────────────┐ │ Opcode │ Meaning │ ├──────────┼─────────────────────────────────┤ │ 0x0X │ SetColor: current_color = X │ │ 0x1Y │ Skip: advance (Y+1) pixels │ │ 0x2Y │ Repeat: write (Y+1) pixels │ │ 0x3Y │ Skip: advance (Y+1)×16 pixels │ │ 0x4Y │ Repeat: write (Y+1)×16 pixels │ │ 0x5Y │ Skip: advance (Y+1)×64 pixels │ │ 0xFF │ End of frame │ └──────────┴─────────────────────────────────┘ ``` * Decode example: * Frame 0 :arrow_right: `0x00, 0x31, 0x25, .., 0xFF`: color is 0, repeat (1+1)*16=32, repeat 5+1=6 times, ..., end of frame * Frame 1 :arrow_right: `0x31, 0x10, 0x01, 0x20, 0x13, 0x00, 0x20, 0x32, 0x17, 0x01, 0x21, 0x10, 0x21, 0x10`: skip (1+1)×16=32 unchanged, skip 0+1=1 unchanged, color=1 (white), repeat 0+1=1 changed, skip 3+1=4 unchanged, color=0 (dark blue), repeat 0+1=1 changed, skip (2+1)×16=48 unchanged, skip (7+1)×64=512 unchanged, color=1 (white), repeat 1+1=2 changed, skip 0+1=1 unchanged, repeat 1+1=2 changed, skip 0+1=1 unchanged * Actually write data to VGA to display (write at `VGA_STREAM_DATA`) ```cpp // Upload decompressed frame to VGA (512 words = 4096 pixels / 8) for (int i = 0; i < FRAME_SIZE; i += PIXELS_PER_WORD) { uint32_t packed = pack8_pixels(&frame_buffer[i]); vga_write32(VGA_STREAM_DATA, packed); } ``` >In each iteration, it will write 8 pixels (32-bit), which is 4-bit per pixel. ### Test case summary #### **1. CPU Integration Tests (`CPUTest.scala`)** | Test Name | Evaluates | Outcome | | --------------------- | --------------------------------------------------------- | ------------------------------------------------------------- | | **FibonacciTest** | ALU operations, function calls, recursion, register usage | ✅ Memory[0x4] = 55 (10th Fibonacci) | | **QuicksortTest** | Array manipulation, branching, memory ops, recursion | ✅ Memory[0x4–0x28] = sorted 0–9 | | **ByteAccessTest** | SB/LB byte-level access, write strobes | ✅ x5 = 0xDEADBEEF, x6 = 0xEF, x1 = 0x15EF | | **InterruptTrapTest** | Trap handling, CSR updates, interrupt flow | ✅ Memory[0x4] = 0x2022, mcause = 0x80000007, mstatus = 0x1888 | --- #### **2. Execute Unit Test (`ExecuteTest.scala`)** | Test Name | Evaluates | Outcome | | ------------------ | ------------------------------------------------- | -------------------------------------------- | | **CSR write-back** | CSR instructions (`csrc`, `csrs`, `csrw`, `csrr`) | ✅ Correct CSR read/write with proper masking | --- #### **3. CLINT/CSR Tests (`CLINTCSRTest.scala`)** | Test Name | Evaluates | Outcome | | ------------------------------ | ------------------------------------------------------ | ------------------------------------------------------------ | | **Machine-mode interrupt** | Timer/external interrupts, mstatus/mepc/mcause updates | ✅ mepc = 0x1904, mcause = 0x80000007, MIE cleared | | **Environmental instructions** | ecall/ebreak handling, MEPC increment | ✅ ecall: mcause = 0xB, mepc = 0x2004 ebreak: mcause = 0x3 | --- #### **4. Timer Peripheral Test (`TimerTest.scala`)** | Test Name | Evaluates | Outcome | | ------------------ | ------------------------------------------------- | ----------------------------------- | | **MMIO registers** | Read/write limit (0x4) and enable (0x8) registers | ✅ limit = 0x990315, enabled = false | --- #### **5. UART Peripheral Test (`UartMMIOTest.scala`)** | Test Name | Evaluates | Outcome | | ----------------------- | --------------------------------------------------- | -------------------------------------------------------- | | **Comprehensive TX+RX** | UART TX, multi-byte RX, binary RX, timeout handling | ✅ Memory[0x100] = 0xCAFEF00D test_status[0x104] = 0xF | --- #### **Key Coverage** | Area | Coverage | | ------------------- | ------------------------------------------------------- | | **Instruction Set** | RV32I base (ALU, load/store, branches, jumps) | | **CSR/Privileged** | Zicsr, trap handling, M-mode interrupts | | **Memory** | Byte/word access, write strobes, ROM loading | | **MMIO** | Timer (limit/enable), UART (TX/RX with baud rate) | | **System** | Interrupt prioritization, mret flow, ecall/ebreak traps | ### Test result #### `make test` ```clike cd .. && sbt "project mmioTrap" test [info] welcome to sbt 1.10.7 (Eclipse Adoptium Java 11.0.29) [info] loading project definition from /home/chouan/ncku_courses/ca2025/ca2025-mycpu/project [info] loading settings for project root from build.sbt... [info] set current project to mycpu-root (in build file:/home/chouan/ncku_courses/ca2025/ca2025-mycpu/) [info] set current project to mycpu-mmio-trap (in build file:/home/chouan/ncku_courses/ca2025/ca2025-mycpu/) [info] ByteAccessTest: [info] [CPU] Byte access program [info] - should store and load single byte [info] CLINTCSRTest: [info] [CLINT] Machine-mode interrupt flow [info] - should handle external interrupt [info] - should handle environmental instructions [info] UartMMIOTest: [info] [UART] Comprehensive TX+RX test [info] - should pass all TX and RX tests [info] ExecuteTest: [info] [Execute] CSR write-back [info] - should produce correct data for csr write [info] FibonacciTest: [info] [CPU] Fibonacci program [info] - should calculate recursively fibonacci(10) [info] TimerTest: [info] [Timer] MMIO registers [info] - should read and write the limit [info] InterruptTrapTest: [info] [CPU] Interrupt trap flow [info] - should jump to trap handler and then return [info] QuicksortTest: [info] [CPU] Quicksort program [info] - should quicksort 10 numbers [info] Run completed in 23 seconds, 475 milliseconds. [info] Total number of tests run: 9 [info] Suites: completed 8, aborted 0 [info] Tests: succeeded 9, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. [success] Total time: 24 s, completed Nov 30, 2025, 11:49:15 PM ``` #### `make compliance` in 2-mmio-trap => Success ``` INFO | === Generating batch test file with 119 tests === INFO | === Running all 119 tests in single SBT session === ... INFO | Batch test completed. Full log: /home/chouan/ncku_courses/ca2025/ca2025-mycpu/tests/riscof_work_2mt/batch_test.log INFO | Results: 119 passed, 0 failed INFO | Running Tests on Reference Model. ... ✅ Compliance tests complete. Results in riscof_work_2mt/ Completion time: Mon Dec 1 01:12:40 CST 2025 Copying results to results/ directory... Cleaning up auto-generated RISCOF test files... ✅ Compliance tests complete. Results in results/ 📊 View report: results/report.html ``` ![image](https://hackmd.io/_uploads/S15XkZc-Wx.png) #### `make demo` => I can see the animation successfully. ![image](https://hackmd.io/_uploads/H1Lg7W5-Zl.png) #### Simulation * waveform - `fibonacci.asmbin` ![image](https://hackmd.io/_uploads/HkYv2WqWbe.png) * hexdump ```c chouan@chouan-ASUS-TUF:~/ncku_courses/ca2025/ca2025-mycpu/2-mmio-trap/src/main/resources$ hexdump fibonacci.asmbin 0000000 1197 0000 8193 a981 0137 0040 0297 0000 0000010 8293 28c2 0317 0000 0313 2843 f863 0062 ``` * waveform - `irqtrap.asmbin` ![image](https://hackmd.io/_uploads/BJhX6Zq-Zl.png) * hexdump ```c chouan@chouan-ASUS-TUF:~/ncku_courses/ca2025/ca2025-mycpu/2-mmio-trap/src/main/resources$ hexdump irqtrap.asmbin 0000000 1197 0000 8193 ae41 0137 0040 0297 0000 0000010 8293 2d82 0317 0000 0313 2d03 f863 0062 0000020 a023 0002 8293 0042 f06f ff5f f297 000f ``` * `verilator-sdl2` adds SDL2 graphics library support. ### Analyze the waveform #### See waveform after executing `irqtrap.asmbin` It will call `enable_interrupt` written in assembly: ```clike .globl enable_interrupt enable_interrupt: la t0, __trap_entry csrrw t1, mtvec, t0 # setup trap vector base li t0, (1 << 3) # MIE bit (Machine Interrupt Enable) csrs mstatus, t0 # set MIE without clobbering other bits li t0, (1 << 7) | (1 << 11) # MTIE (bit 7) | MEIE (bit 11) csrs mie, t0 # enable timer + external interrupts ret ``` It will set up the trap vector base stored at `mtvec`(Trap vector base address) **to see what to do when encountering the interrupt**, and it will enable the interrupt by setting `mstatus[3]=1`, Bit 3 (MIE): Machine interrupt enable. And, in `__trap_entry`, it will call `trap_handler` defined in `irqtrap.c`. * waveform ![image](https://hackmd.io/_uploads/BJIfegEMZg.png) >* `mvtec=0x000010A0`, which is trap vector base address, which will trap the interrupt by calling `trap_handler`. >* `mstatus[3] = 1`, set MIE bit as 1. >* `mie = 0x100010000000`, MTIE (bit 7) and MEIE (bit 11) are 1 in order to enable timer + external interrupts. In `CLINT.scala`: ```scala val interrupt_enable_global = io.csr_bundle.mstatus(3) // MIE bit (global enable) val interrupt_enable_timer = io.csr_bundle.mie(7) // MTIE bit (timer enable) val interrupt_enable_external = io.csr_bundle.mie(11) // MEIE bit (external enable) ``` >* And, it can see the interrupt enable signal (global, timer, external) successfully set to 1. > ## MyCPU : 3-pipeline [Commit a82fef3](https://github.com/AnnTaiwan/ca2025-mycpu/commit/a82fef373f22668e98fe809d22e207cb6f5303f6) >3-pipeline: Finish ALL TODOs, and pass 'make compliance' ### code note #### `fivestage_final/Control.scala` 1. Control Hazards: * Branch in `ID` :arrow_right: Flush IF ``` IF -> ID -> EX -> MEM -> WB A (will branch) B (need to be flushed) ``` 2. Data Hazards: * `lw`'s rd is used for ALU. :arrow_right: stall 1 cycle ``` lw a0, 0(s0) add t0, t1, a0 ``` * `lw`'s rd is used for branch. :arrow_right: stall 1~2 cycles ``` lw a0, 0(s0) beq a0, t0, next ``` * Jump register dependencies. :arrow_right: stall until operands ready ``` lw x1, 0(x2) # Load x1 jalr x3, x1, 0 # Jump to address in x1 → needs stall ``` ##### Data hazard needs stall **1. stall 1-cycle** ```she IF ID EX MEM WB jalr add ``` * After stall 1 cycle ```she IF ID EX MEM WB jalr NOP add # jalr get the ALU result by forwarding to ID. ``` If above combination is (lw -> jalr): It needs to do step1 then step2 :arrow_right: stall 2 cycles. ```she IF ID EX MEM WB jalr lw ``` * After stall 1 cycle ```she IF ID EX MEM WB jalr NOP lw ``` * It needs stall again even with forwarding beacuse load data is ready at end of MEM. So, it can't forward to ID in one same cycle. * Hence, the condition becomes next `2. need to stall 2 cycles in total`. **2. need to stall 2 cycles in total** ```she IF ID EX MEM WB jalr NOP lw ``` * After stall 1 cycle ```she IF ID EX MEM WB jalr NOP NOP lw ``` ##### Consirder flush and stall **1. Set below all true** * Set below all true ```scala // - Flush ID/EX register (insert bubble) => EXE is NOP. // - Freeze PC (don't fetch next instruction) => So, IF didn't change. // - Freeze IF/ID (hold current fetch result) => So, ID didn't change. io.id_flush := true.B io.pc_stall := true.B io.if_stall := true.B ``` * Explanation, due to data hazard: ```c= # clock 1 IF: [instruction 3] ID: [ADD x3, x1, x4] ← Hazard detected! x1 not ready EX: [LW x1, 0(x2)] ← x1 being loaded MEM: [...] WB: [...] Control unit detects: - ADD (in ID) needs x1 - x1 is being loaded by LW (in EX) - Set: pc_stall=1, if_stall=1, id_flush=1 # clock 2 IF: [instruction 3] (frozen by if_stall) ID: [ADD x3, x1, x4] (frozen by if_stall, stays in IF/ID register) EX: [NOP] (id_flush inserted bubble into ID/EX register) MEM: [LW x1, ...] (LW advanced normally) WB: [...] ``` **2. jump is True** * branch can be detected in `ID`. ```cpp .elsewhen(io.jump_flag) { // ============ Control Hazard (Branch Taken) ============ // Branch resolved in ID stage - only 1 cycle penalty // Only flush IF stage (not ID) since branch resolved early // TODO: Which stage needs to be flushed when branch is taken? // Hint: Branch resolved in ID stage, discard wrong-path instruction io.if_flush := true.B // Note: No ID flush needed - branch already resolved in ID! // This is the key optimization: 1-cycle branch penalty vs 2-cycle } ``` >Send `NOP` to `ID`, flow is shown below: ``` Before Branch Resolution (Cycle 1): ┌──────┬──────┬──────┬──────┬──────┐ │ IF │ ID │ EX │ MEM │ WB │ ├──────┼──────┼──────┼──────┼──────┤ │Inst │ BEQ │ ADD │ SUB │ OR │ │at PC+4│Branch│ │ │ │ └──────┴──────┴──────┴──────┴──────┘ ▲ │ │ └─> Branch resolves: taken! │ jump_flag = true │ └─ Wrong path instruction (fetched speculatively) After Branch Resolution (Cycle 2): ┌──────┬──────┬──────┬──────┬──────┐ │ IF │ ID │ EX │ MEM │ WB │ ├──────┼──────┼──────┼──────┼──────┤ │Inst │ NOP │ BEQ │ ADD │ SUB │ │at │(flush│Branch│ │ │ │Target│ ed) │ │ │ │ └──────┴──────┴──────┴──────┴──────┘ ▲ ▲ ▲ │ │ │ New Flushed Branch continues correct IF (already done its work) fetch stage ``` #### `fivestage_final/Forwarding.scala` * Remember to check if `rd_mem` or `rd_wb` is zero. And, make sure `io.reg_write_enable_mem` & `io.reg_write_enable_wb` are `true.B`. * Forwarding explanation ``` ┌─────────────────────────────────────────────────────────────────────────────┐ │ FORWARDING UNIT │ │ │ │ INPUTS (Read Sources): │ │ ┌──────────────────────────────────────────────────────────────┐ │ │ │ ID Stage: │ │ │ │ rs1_id ──┐ Register addresses being read in ID stage │ │ │ │ rs2_id ──┘ (for branch comparison) │ │ │ │ │ │ │ │ EX Stage: │ │ │ │ rs1_ex ──┐ Register addresses being read in EX stage │ │ │ │ rs2_ex ──┘ (for ALU operations) │ │ │ └──────────────────────────────────────────────────────────────┘ │ │ │ │ INPUTS (Write Destinations - Hazard Sources): │ │ ┌──────────────────────────────────────────────────────────────┐ │ │ │ MEM Stage: │ │ │ │ rd_mem ─────────────┐ Destination register in MEM │ │ │ │ reg_write_enable_mem┘ (most recent result) │ │ │ │ │ │ │ │ WB Stage: │ │ │ │ rd_wb ──────────────┐ Destination register in WB │ │ │ │ reg_write_enable_wb ┘ (older result) │ │ │ └──────────────────────────────────────────────────────────────┘ │ │ │ │ OUTPUTS (Forwarding Control Signals): │ │ ┌──────────────────────────────────────────────────────────────┐ │ │ │ To ID Stage: │ │ │ │ reg1_forward_id ──┐ Control for rs1 bypass mux │ │ │ │ reg2_forward_id ──┘ Control for rs2 bypass mux │ │ │ │ │ │ │ │ To EX Stage: │ │ │ │ reg1_forward_ex ──┐ Control for rs1 bypass mux │ │ │ │ reg2_forward_ex ──┘ Control for rs2 bypass mux │ │ │ └──────────────────────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────────────────────────┘ ``` >First check `EXE` forwarding then `ID` forwarding >check if `rs1` and rs2 in `EXE` or `ID` are the same as rd in `MEM` and `WB`. >Output `io.reg1_forward_id`, `io.reg1_forward_ex`, `io.reg2_forward_id`, `io.reg2_forward_ex` as **forward type** to represent whose stage `rd` is forwarded to previous stage like `ID` and `EXE`. ### CA25: Exercise 21" in 3-pipeline >Question ```c= // ============================================================ // [CA25: Exercise 21] Hazard Detection Summary and Analysis // ============================================================ // Conceptual Exercise: Answer the following questions based on the hazard // detection logic implemented above // // Q1: Why do we need to stall for load-use hazards? // A: [Student answer here] // Hint: Consider data dependency and forwarding limitations // // Q2: What is the difference between "stall" and "flush" operations? // A: [Student answer here] // Hint: Compare their effects on pipeline registers and PC // // Q3: Why does jump instruction with register dependency need stall? // A: [Student answer here] // Hint: When is jump target address available? // // Q4: In this design, why is branch penalty only 1 cycle instead of 2? // A: [Student answer here] // Hint: Compare ID-stage vs EX-stage branch resolution // // Q5: What would happen if we removed the hazard detection logic entirely? // A: [Student answer here] // Hint: Consider data hazards and control flow correctness // // Q6: Complete the stall condition summary: // Stall is needed when: // 1. ? (EX stage condition) // 2. ? (MEM stage condition) // // Flush is needed when: // 1. ? (Branch/Jump condition) ``` >Answer: * Q1: Why do we need to stall for load-use hazards? `lw` will only require its loaded data when finishing `MEM` stage, so `lw` didn't finish preparing its `rd` data for next instruction who is in `ID` stage now. So, if there is an instruction whose `rs1` or `rs2` is the same as `lw`'s `rd`, it needs to stall the instruction after `lw`, insert NOP to EX in next cycle. The forwarding limitation is that the loaded data finished at the end of MEM stage, so it can't forward loaded data to ID in the same cycle. * Q2: What is the difference between "stall" and "flush" operations? * stall: It means **current instruction remains at its stage, don't go to next stage.** It is mainly about the "**freeze**" action. In the following actions, it sometimes will flush pipeline registers between stages in order to insert `NOP` to next stage, which also called "insert bubble". If stall at `IF`, `pc` will remain the same, too, which means not to fetch next instruction. * flush: It means wiping out the pipeline register data between stages like in ID/EX or EX/MEM. * Q3: Why does jump instruction with register dependency need stall? ``` add t0, a1, a2 jalr t1, t0, 0 ``` `t0` is register dependency. In this example, if it wants to know the destination of jump it needs to wait for ALU result of `add` in `EX` stage. Hence, `jalr` needs to stall. * Q4: In this design, why is branch penalty only 1 cycle instead of 2? Branch can be detected at `ID` stage, so there is only one stage before `ID`, which is `IF`. Hence, it only needs to flush the wrong instruction in `IF`, so it only wastes for 1 cycle. * Q5: What would happen if we removed the hazard detection logic entirely? ``` add t0, a1, a2 jalr t1, t0, 0 ``` `t0` is register dependency. In this example without hazard detection, `jalr` will get the old value of `t0`, not the `add` result, `t0`. So, it will lead to wrong result. * Q6: Complete the stall condition summary * Stall is needed when: 1. `rs1` or `rs2` required in `ID` is the same as `rd` of instruction in `EX` (EX stage condition) 2. `rs1` or `rs2` required in `ID` is the same as `rd` of instruction in `MEM` (MEM stage condition) * Flush is needed when: 1. the branch action is taken or stall action is needed (which needs to flush pipeline registers for next stage) (Branch/Jump condition) ### Test case summary * Test Structure * All tests run on 4 pipeline variants with different hazard handling: * Three-stage (hazardX1=26): IF→ID→EX/MEM/WB combined * Five-stage Stall (hazardX1=46): Stall-only, no forwarding * Five-stage Forward (hazardX1=27): Data forwarding enabled * Five-stage Final (hazardX1=26): Forwarding + early branch resolution * 1. PipelineProgramTest (6 tests × 4 variants = 24 tests) * Test: fibonacci(10) * Evaluates: Recursive calls, stack ops, register allocation, ALU ops * Expected Outcome: `Memory[0x4] = 55` * Hazards Tested: Call/return deps, RAW hazards * Test: quicksort * Evaluates: Sorting, array access, branching, recursion * Expected Outcome: `Memory[0x4-0x28]` = sorted 0–9 * Hazards Tested: Load-use hazards, branch deps * Test: sb (store byte) * Evaluates: SB/LB ops, byte alignment, write strobes * Expected Outcome: * `x5 = 0xDEADBEEF` * `x6 = 0xEF` * `x1 = 0x15EF` * Hazards Tested: Byte-level store-load forwarding * Test: hazard (basic) * Evaluates: RAW, control hazards, load-use stalls * Program Flow: * RAW: `sub t1, zero, t0 → and t2, t0, t1` * Jump: `j skip1` * Load-use: `lw t2, 2(t2) → or t3, t1, t2` * Branch on dependency: `bne t3, t4, skip1` * JALR dep: `jalr t4, 8(t4)` * Expected Outcome: * `x1` = 26 (three-stage, final), 27 (forward), 46 (stall-only) * `Memory[4] = 1`, `Memory[8] = 3` * Hazards Tested: All basic pipeline hazards * Test: hazard_extended (comprehensive) * Evaluates: 10 complex hazard scenarios * Sections: * WAW: `mem[0x10]=2` * Store-Load Forwarding: `mem[0x14]=0xAB` * Multi-Load Chains: `mem[0x18]=0` * Branch Condition RAW: `mem[0x1C]=10` * JAL RA: `mem[0x20]=PC+4` * CSR RAW: `mem[0x24]=0x1888+diff` * Long RAW Chain: `mem[0x28]=5` * WB Forwarding: `mem[0x2C]=7` * Load-to-Store: `mem[0x30]=0` * Branch Multi-RAW: `mem[0x34]=20` * Expected Outcome: All memory values correct * Hazards Tested: Every RISC-V hazard type * Test: irqtrap * Evaluates: Interrupt handling, CSR updates, trap entry/exit * Program Flow: * Initialize: `Memory[0x4] = 0xDEADBEEF` * Enable interrupts * WFI loop * Inject interrupt * Trap handler writes: `Memory[0x4] = 0x2022` * Expected Outcome: * `Memory[0x4]: 0xDEADBEEF → 0x2022` * `mstatus = 0x1888` * `mcause = 0x80000007` or `0x8000000B` * Hazards Tested: Trap-related pipeline timing * 2. PipelineUartTest (1 test × 4 variants = 4 tests) * Test: UART Comprehensive * Evaluates: UART TX/RX, MMIO timing, interrupts * Program Flow: * TX send * Multi-byte RX * Binary RX * Timeout RX * Expected Outcome: * `Memory[0x100] = 0xCAFEF00D` * `Memory[0x104] = 0xF` (all tests passed) * Hazards Tested: MMIO load-use, interrupt ordering * 3. PipelineRegisterTest (1 test) * Test: Pipeline Register Stall/Flush * Evaluates: Bubble insertion, stall, flush logic * Method: * 1000 random cycles * No stall/flush → Output=Input * Stall → Output=frozen * Flush → Output=DefaultValue * Expected Outcome: All cycles correct * Hazards Tested: Pipeline control paths ### Test result #### `make test` ```clike= (riscof_env) chouan@chouan-ASUS-TUF:~/ncku_courses/ca2025/ca2025-mycpu/3-pipeline$ make test cd .. && sbt "project pipeline" test [info] welcome to sbt 1.10.7 (Eclipse Adoptium Java 11.0.29) [info] loading project definition from /home/chouan/ncku_courses/ca2025/ca2025-mycpu/project [info] loading settings for project root from build.sbt... [info] set current project to mycpu-root (in build file:/home/chouan/ncku_courses/ca2025/ca2025-mycpu/) [info] set current project to mycpu-pipeline (in build file:/home/chouan/ncku_courses/ca2025/ca2025-mycpu/) [info] PipelineProgramTest: [info] Three-stage Pipelined CPU [info] - should calculate recursively fibonacci(10) [info] - should quicksort 10 numbers [info] - should store and load single byte [info] - should solve data and control hazards [info] - should handle all hazard types comprehensively [info] - should handle machine-mode traps [info] Five-stage Pipelined CPU with Stalling [info] - should calculate recursively fibonacci(10) [info] - should quicksort 10 numbers [info] - should store and load single byte [info] - should solve data and control hazards [info] - should handle all hazard types comprehensively [info] - should handle machine-mode traps [info] Five-stage Pipelined CPU with Forwarding [info] - should calculate recursively fibonacci(10) [info] - should quicksort 10 numbers [info] - should store and load single byte [info] - should solve data and control hazards [info] - should handle all hazard types comprehensively [info] - should handle machine-mode traps [info] Five-stage Pipelined CPU with Reduced Branch Delay [info] - should calculate recursively fibonacci(10) [info] - should quicksort 10 numbers [info] - should store and load single byte [info] - should solve data and control hazards [info] - should handle all hazard types comprehensively [info] - should handle machine-mode traps [info] PipelineUartTest: [info] Three-stage Pipelined CPU UART Comprehensive Test [info] - should pass all TX and RX tests [info] Five-stage Pipelined CPU with Stalling UART Comprehensive Test [info] - should pass all TX and RX tests [info] Five-stage Pipelined CPU with Forwarding UART Comprehensive Test [info] - should pass all TX and RX tests [info] Five-stage Pipelined CPU with Reduced Branch Delay UART Comprehensive Test [info] - should pass all TX and RX tests [info] PipelineRegisterTest: [info] Pipeline Register [info] - should be able to stall and flush [info] Run completed in 1 minute, 16 seconds. [info] Total number of tests run: 29 [info] Suites: completed 3, aborted 0 [info] Tests: succeeded 29, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. [success] Total time: 77 s (01:17), completed Dec 2, 2025, 2:15:55 AM ``` #### `make compliance` in 3-pipeline => Success ```clike INFO | === Generating batch test file with 119 tests === INFO | === Running all 119 tests in single SBT session === ... INFO | Batch test completed. Full log: /home/chouan/ncku_courses/ca2025/ca2025-mycpu/tests/riscof_work_3pl/batch_test.log INFO | Results: 119 passed, 0 failed INFO | Running Tests on Reference Model. ... ✅ Compliance tests complete. Results in riscof_work_3pl/ Completion time: Tue Dec 2 02:28:10 CST 2025 Copying results to results/ directory... Cleaning up auto-generated RISCOF test files... ✅ Compliance tests complete. Results in results/ 📊 View report: results/report.html ``` ![image](https://hackmd.io/_uploads/SkfvR8oWbe.png) ### Analyze the waveform #### See waveform after executing `hazard.asmbin` * What `hazard.S` do? This is a pipeline hazard stress test program that deliberately creates various types of hazards to verify the CPU's forwarding and stall logic works correctly. * seeing the `ctrl` wire in order to see the hazard control ![image](https://hackmd.io/_uploads/Sk20WUrfbx.png) Around the red straight line, the `rs2_id` is the same as `rs_ex`, and `memory_read_enable_ex` is also 1. Hence, it fulfills the conditions that needs to stall, which will output signal like this: ```cpp io.id_flush := true.B io.pc_stall := true.B io.if_stall := true.B ``` So, the signal that waveform shows is correct. ![image](https://hackmd.io/_uploads/rJOZ7LHMbg.png) Around the red straight line, the `jump_flag` is 1, and it will set `if_flush` as 1, which can be seen in waveform. * seeing the `forwarding` wire in order to see the forward detection ``` object ForwardingType { val NoForward = 0.U(2.W) val ForwardFromMEM = 1.U(2.W) val ForwardFromWB = 2.U(2.W) } ``` ![image](https://hackmd.io/_uploads/rkb7UUBGZe.png) See around the red straight line, and first see the forwarding from MEM to EX. `rs2_ex` and `rd_mem` are the same, so it will set `reg2_forward_ex` to 1 which means `ForwardFromMEM`. See around the red straight line, and see the forwarding from MEM to ID. `rs1_id` and `rd_mem` are the same, so it will set `reg1_forward_id` to 1 which means `ForwardFromMEM`. ![image](https://hackmd.io/_uploads/B17rwIrzWe.png) See around the red straight line, and first see the forwarding from WB to EX. `rs2_ex` and `rd_wb` are the same, so it will set `reg2_forward_ex` to 2 which means `ForwardFromWB`. See around the red straight line, and see the forwarding from WB to ID. `rs1_id` and `rd_wb` are the same, so it will set `reg1_forward_id` to 2 which means `ForwardFromWB`. ### Run HW2 programs on mycpu [commit c04130b](https://github.com/AnnTaiwan/ca2025-mycpu/commit/c04130be5c8e77e468830e58fcec4f445f8d5d1d) * command: `sbt "project pipeline" "testOnly *PipelineProgramTest"` #### Run uf8-decode/encode * Description: Main entry point that tests UF8 encode/decode functions for all values from 0 to 255. Stores input values, decoded results, and encoded results to memory for validation. Also checks two conditions: 1. Encoded result must match original input (encode(decode(x)) == x) 2. Decoded values must be monotonically increasing * Test result (**Pass**)`3-pipeline/src/test/scala/riscv/PipelineProgramTest.scala` :::spoiler ``` (riscof_env) chouan@chouan-ASUS-TUF:~/ncku_courses/ca2025/ca2025-mycpu$ sbt "project pipeline" "testOnly *PipelineProgramTest" [info] welcome to sbt 1.10.7 (Eclipse Adoptium Java 11.0.29) [info] loading project definition from /home/chouan/ncku_courses/ca2025/ca2025-mycpu/project [info] loading settings for project root from build.sbt... [info] set current project to mycpu-root (in build file:/home/chouan/ncku_courses/ca2025/ca2025-mycpu/) [info] set current project to mycpu-pipeline (in build file:/home/chouan/ncku_courses/ca2025/ca2025-mycpu/) [info] compiling 1 Scala source to /home/chouan/ncku_courses/ca2025/ca2025-mycpu/3-pipeline/target/scala-2.13/test-classes ... [info] PipelineProgramTest: [info] Three-stage Pipelined CPU [info] - should do uf8_decode/encode from 0-255 [info] Five-stage Pipelined CPU with Stalling [info] - should do uf8_decode/encode from 0-255 [info] Five-stage Pipelined CPU with Forwarding [info] - should do uf8_decode/encode from 0-255 [info] Five-stage Pipelined CPU with Reduced Branch Delay [info] - should do uf8_decode/encode from 0-255 [info] Run completed in 49 seconds, 330 milliseconds. [info] Total number of tests run: 4 [info] Suites: completed 1, aborted 0 [info] Tests: succeeded 4, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. [success] Total time: 52 s, completed Dec 9, 2025, 11:01:21 PM ``` ::: #### Run fast_rsqrt * Description: Computes 1/sqrt(x) * Test cases ``` * Test Cases: * rsqrt(1) = 1.0 → 65536 * rsqrt(4) = 0.5 → 32768 * rsqrt(16) = 0.25 → 16384 * rsqrt(20) ≈ 0.2236 → 14654 * rsqrt(100) ≈ 0.1 → 6553 * rsqrt(258) ≈ 0.0623 → 4080 * rsqrt(650) ≈ 0.0392 → 2570 ``` * Test result (**Pass**)`3-pipeline/src/test/scala/riscv/PipelineProgramTest.scala` :::spoiler ``` (riscof_env) chouan@chouan-ASUS-TUF:~/ncku_courses/ca2025/ca2025-mycpu$ sbt "project pipeline" "testOnly *PipelineProgramTest" [info] welcome to sbt 1.10.7 (Eclipse Adoptium Java 11.0.29) [info] loading project definition from /home/chouan/ncku_courses/ca2025/ca2025-mycpu/project [info] loading settings for project root from build.sbt... [info] set current project to mycpu-root (in build file:/home/chouan/ncku_courses/ca2025/ca2025-mycpu/) [info] set current project to mycpu-pipeline (in build file:/home/chouan/ncku_courses/ca2025/ca2025-mycpu/) [info] compiling 1 Scala source to /home/chouan/ncku_courses/ca2025/ca2025-mycpu/3-pipeline/target/scala-2.13/test-classes ... [info] PipelineProgramTest: [info] Three-stage Pipelined CPU [info] - should do fast_rsqrt [info] Five-stage Pipelined CPU with Stalling [info] - should do fast_rsqrt [info] Five-stage Pipelined CPU with Forwarding [info] - should do fast_rsqrt [info] Five-stage Pipelined CPU with Reduced Branch Delay [info] - should do fast_rsqrt [info] Run completed in 22 seconds, 941 milliseconds. [info] Total number of tests run: 4 [info] Suites: completed 1, aborted 0 [info] Tests: succeeded 4, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. [success] Total time: 26 s, completed Dec 9, 2025, 11:45:11 PM ``` ::: * waveform ![image](https://hackmd.io/_uploads/Hyv4tTBzZx.png) In `writeData`, I can see there are data like 65536, 32768, 16384, 14654, ..., which are the same as my test answer. ## What I have learned from Chisel Bootcmap I learned that it can easily generate the corresponding verilog code by writing chisel. The bootcamp is complete to understand the relationship between modules and test code, which can efficiently to test each small module, and also learn the grammar of chisel. And, it is organized well in jupyter notebook, so the beginner still don't need to worry about how to build the scala project and set the environment.