Assignment 3 - HackMD

<h1> Assignment 3: Your Own RISC-V CPU</h1> contributed by <[`hsuhsuhs`](https://github.com/hsuhsuhs/NCKU-ca2025-mycpu)> --- [Toc] ## Environment Setup This section outlines the procedure for setting up the Chisel development environment on macOS. * To simulate hardware designs and view waveforms, Verilator and Surfer were installed using Homebrew. ```bash $ brew install verilator surfer ``` * Verilator: Used for compiling Verilog (generated from Chisel) into a fast C++ cycle-accurate simulation model. * Surfer: A modern waveform viewer used to debug signals. * Setting up Scala and sbt * Clean Installation Existing versions of sbt or jenv were removed to prevent conflicts: ```bash $ brew uninstall sbt $ brew uninstall jenv ``` * Installing SDKMAN and Java 11 SDKMAN was used to manage parallel versions of multiple Software Development Kits. ```bash # Install SDKMAN $ curl -s "https://get.sdkman.io" | bash $ source "$HOME/.sdkman/bin/sdkman-init.sh" # Install JDK 11 and sbt $ sdk install java 11.0.29-tem $ sdk install sbt ``` * Chisel Environment Verification ```bash $ git clone https://github.com/ucb-bar/chisel-tutorial $ cd chisel-tutorial $ git checkout release $ sbt run ``` ### Chisel vs. Verilog Chisel acts as a Hardware Construction Language rather than a direct hardware description language like Verilog. The workflow involves writing Scala/Chisel code, which is then compiled into Verilog for use with standard EDA tools. * Abstraction: Chisel uses object-oriented programming to generate complex Verilog logic programmatically. * Design Enforcement: Chisel enforces modern digital design practices (e.g., synchronous logic) by limiting support for complex features like negative-edge triggering or unstructured multi-clock designs, which are often sources of bugs in Verilog. #### Chisel Generation Flow ```mermaid graph LR %% Nodes A[Chisel Code High-Level Scala] B(Chisel Compiler sbt / FIRRTL) C[Verilog File Hardware Description] D((Simulation Verilator)) E((FPGA Bitstream)) %% Edges / Connections A -->|1. Elaboration| B B -->|2. Generation| C C -->|3. Verification| D C -->|4. Synthesis| E %% Styling to highlight the Software vs Hardware boundary style A fill:#e3f2fd,stroke:#1565c0,stroke-width:2px style B fill:#fff9c4,stroke:#fbc02d,stroke-width:2px,stroke-dasharray: 5 5 style C fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px ``` ### Riscof (RISC-V Compatibility Framework) To ensure strict adherence to the RISC-V Instruction Set Architecture specification, this project incorporates the RISCOF validation suite. Unlike ad-hoc unit tests which primarily verify functional existence, RISCOF performs rigorous architectural compliance testing to detect subtle deviations from the standard. * The compliance testing process employs a differential testing strategy based on signature comparison. * The framework selects a suite of assembly tests (e.g., ADD-01.S, SLL-01.S) that cover all base instructions, including edge cases such as overflow, zero-extension, and signed comparisons. * RISCOF compares the signature file from the DUT against that of the Reference Model. Any discrepancy signifies a non-compliant implementation. * Environment ![截圖 2025-12-10 晚上11.38.40](https://hackmd.io/_uploads/ryYXEGwMWx.png =300x) * The compliance suite was executed using the project's build system command: ```bash $ make compliance ``` #### Adapting RISCOF for Apple Silicon (macOS) While establishing the compliance testing framework, I encountered significant compatibility issues due to the architectural differences between the standard Lab environment and my local development environment. * The Architecture Mismatch Issue When attempting to run the standard make compliance command, the simulation failed immediately with a Java Native Interface error. ```bash [error] Uncaught exception when running riscv.compliance.ComplianceTest: java.lang.UnsatisfiedLinkError: ... (mach-o file, but is an incompatible architecture (have 'x86_64', need 'arm64e' or 'arm64')) ``` * My local JVM runs in native ARM64 mode. However, the simulation environment likely triggered via Python wrappers or default Verilator settings was attempting to link against x86_64 shared libraries. * System Reconfiguration It became clear that a straightforward port was insufficient. I needed to force the build system to respect the local architecture and correct the path resolutions for the Python plugins. * Step 1: I engineered a custom configuration file, such as `config-3-pipeline.ini`, to precisely define the paths for the DUTPlugin (mycpu) and ReferencePlugin (rv32emu). This bypassed the default Linux-centric paths ```TOML [RISCOF] ReferencePlugin=rv32emu ReferencePluginPath=rv32emu_plugin DUTPlugin=mycpu DUTPluginPath=mycpu_plugin ... ``` * Step 2: The custom plugins were not visible to the default Python environment. I resolved this by manually exporting the `PYTHONPATH` to include the plugin directories before execution: ```bash export PYTHONPATH=$PROJECT_ROOT/tests/mycpu_plugin:$PROJECT_ROOT/tests/rv32emu_plugin:$PYTHONPATH ``` * Step 3: To automate this process and ensure reproducibility, I modified the `Makefile`. I created a target that explicitly passes the custom configuration and suite paths to the RISCOF runner, overriding the defaults. ```bash compliance: cd ../tests && $(RISCOF) run --config $(CONFIG) -- suite $(SUITE) --env $(ENV) --work-dir $(WORK) ``` * Executing via SBT Instead of relying on make compliance which triggers the incompatible Python wrapper, I can now execute both basic unit tests and architectural compliance tests using a single command within each project directory: ```bash $ sbt test ``` ## 0-minimal In this section, I implemented a minimal RISC-V core designed specifically to pass the jit.asmbin test. This exercise demonstrates the concept of Just-In-Time compilation and self-modifying code, where the processor treats data written to memory as executable instructions. ### Implementation Strategy Unlike a full RV32I processor, this minimal core only supports the 5 instructions required for the JIT test: * **SW (Store Word)**: Used by the program to write new machine code instructions into memory at runtime. * **LW (Load Word)**: Verifies memory reads. * **JALR (Jump and Link Register)**: Critical for transferring control from the main program to the newly generated code in memory. * **AUIPC & ADDI (Add upper Immediate to PC & Add Immediate )**: Used for address calculation and register manipulation. ### Test Verification The test jit.asmbin writes a small code sequence to memory, jumps to it, and calculates a result. The expected return value in register a0 is 42. * The testing environment is invoked using the standard sbt command line interface. This ensures consistency across different development stages from the minimal CPU to the pipelined version. ```bash $ sbt [info] started sbt server [sbt:mycpu-root> project minimal [sbt:mycpu-minimal> test ``` * Test Output Analysis ```bash [info] JITTest: [info] Minimal CPU - JIT Test [info] - should correctly execute jit.asmbin and set a0 to 42 [info] Run completed in 1 minute, 7 seconds. [info] Total number of tests run: 1 [info] Suites: completed 1, aborted 0 [info] Tests: succeeded 1, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. [success] Total time: 68 s ``` ### Use Surfer to view waveform ```bash surfer trace.vcd ``` ![截圖 2025-12-14 下午6.37.55](https://hackmd.io/_uploads/r1xeEzhf-l.png =480x) ## 1-single-cycle This design implements the complete RV32I Base Integer Instruction Set, making it capable of executing general-purpose C programs. ### Architectural Overview The processor adopts a classic single-cycle microarchitecture where every instruction completes its execution within one clock cycle ($CPI = 1$). The data path is structured into five distinct stages, connected via Chisel Bundles to ensure modularity: ![image](https://hackmd.io/_uploads/S1ricIEz-l.png) * **Instruction Fetch (IF)**: Maintains the Program Counter (PC) and fetches instructions from the Instruction Memory. It handles sequential execution ($PC+4$) and updates the PC for jumps/branches. * **Instruction Decode (ID)**: Decodes the 32-bit instruction to generate control signals (e.g., MemRead, MemWrite, RegWrite). It also reads operands from the Register File. * **Execute (EX)**: Performs arithmetic/logical operations using the ALU and calculates branch target addresses. * **Memory Access (MEM)**: Handles data exchange with Data Memory for Load (and Store ) instructions. * **Write Back (WB)**: Selects the final result and writes it back to the destination register. ### Implementation Strategy RV32I Base Integer Instruction Set. * Arithmetic and Logic Instructions: add, sub, slt, etc. * Memory Access Instructions: lb, lw, sb, etc. * Branch Instructions: beq, jar, etc. ### The issues I encounter and how I overcome them 1. During the initial testing of branch instructions, the CPU was jumping to incorrect, random addresses. * Unlike I-Type instructions, B-Type and J-Type instructions in RISC-V do not store immediate bits sequentially. The bits are scrambled to optimize the hardware decoder on the chip. * **Solution**: I implemented a precise bit-concatenation strategy using Chisel's `Cat` function to reconstruct the immediate values according to the ISA manual. ```scala // Resolving B-Type Scrambling: Reordering bits [31, 7, 30:25, 11:8] val immB = Cat( Fill(Parameters.DataBits - 13, instruction(31)), // Sign Extension instruction(7), // Bit 11 instruction(30, 25), // Bits 10:5 instruction(11, 8), // Bits 4:1 0.U(1.W) // LSB always 0 ) ``` 2. The `BLT` instruction failed when comparing a negative number (e.g., -1) with a positive number (e.g., 5). The branch was not taken as expected. * In Chisel, `UInt` types are treated as unsigned by default. Therefore, -1 (`0xFFFFFFFF`) was interpreted as a very large positive number, making it greater than 5. * **Solution**: I modified the comparator logic to explicitly cast operands to `SInt` (Signed Integer) for `BLT` and `BGE` instructions, while keeping `BLTU` and `BGEU` as `UInt`. ```scala // Distinguishing Signed vs Unsigned Comparisons InstructionsTypeB.blt -> (io.reg1_data.asSInt < io.reg2_data.asSInt), // Signed Cast InstructionsTypeB.bltu -> (io.reg1_data < io.reg2_data) // Unsigned Keep ``` 3. The `JALR` instruction occasionally caused misalignment exceptions or jumped to odd addresses. * According to the RISC-V specification, the target address of `JALR` is obtained by adding the 12-bit immediate to the register `rs1`, and then setting the least significant bit to 0. My initial addition did not mask the LSB. * **Solution**: I used bit slicing and concatenation to force the LSB to zero, ensuring 2-byte alignment for instructions. ```scala // JALR Target Calculation: Force LSB to 0 val jalrSum = io.reg1_data + io.immediate val jalrTarget = Cat(jalrSum(Parameters.DataBits - 1, 1), 0.U(1.W)) ``` ### Test Verification The testing environment is invoked using the standard sbt command line interface. ```bash $ sbt [info] started sbt server [sbt:mycpu-root> project singleCycle [sbt:mycpu-singleCycle> test ``` #### Base Test ```bash [info] InstructionDecoderTest: [info] InstructionDecoder [info] - should decode RV32I instructions and generate correct control signals [info] ByteAccessTest: [info] Single Cycle CPU - Integration Tests [info] - should correctly handle byte-level store/load operations (SB/LB) [info] InstructionFetchTest: [info] InstructionFetch [info] - should correctly update PC and handle jumps [info] ExecuteTest: [info] Execute [info] - should execute ALU operations and branch logic correctly [info] FibonacciTest: [info] Single Cycle CPU - Integration Tests [info] - should correctly execute recursive Fibonacci(10) program [info] RegisterFileTest: [info] RegisterFile [info] - should correctly read previously written register values [info] - should keep x0 hardwired to zero (RISC-V compliance) [info] - should support write-through (read during write cycle) [info] QuicksortTest: [info] Single Cycle CPU - Integration Tests [info] - should correctly execute Quicksort algorithm on 10 numbers [info] Run completed in 22 seconds, 456 milliseconds. [info] Total number of tests run: 9 [info] Suites: completed 7, aborted 0 [info] Tests: succeeded 9, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. [success] Total time: 25 s ``` #### Base Test + Compliance Test ```bash [info] InstructionDecoderTest: ... [info] ComplianceTest: [info] MyCPU Compliance ✅ Test completed - signature: /Users/hsuhsiaofan/chisel-tutorial/ca2025-mycpu/tests/riscof_work_1sc/rv32i_m/I/src/add-01.S/dut/DUT-mycpu.signature [info] - should pass test /Users/hsuhsiaofan/chisel-tutorial/ca2025-mycpu/tests/riscv-arch-test/riscv-test-suite/rv32i_m/I/src/add-01.S ✅ Test completed - signature: /Users/hsuhsiaofan/chisel-tutorial/ca2025-mycpu/tests/riscof_work_1sc/rv32i_m/I/src/addi-01.S/dut/DUT-mycpu.signature [info] - should pass test /Users/hsuhsiaofan/chisel-tutorial/ca2025-mycpu/tests/riscv-arch-test/riscv-test-suite/rv32i_m/I/src/addi-01.S ✅ Test completed - signature: /Users/hsuhsiaofan/chisel-tutorial/ca2025-mycpu/tests/riscof_work_1sc/rv32i_m/I/src/and-01.S/dut/DUT-mycpu.signature [info] - should pass test /Users/hsuhsiaofan/chisel-tutorial/ca2025-mycpu/tests/riscv-arch-test/riscv-test-suite/rv32i_m/I/src/and-01.S riscv-arch-test/riscv-test-suite/rv32i_m/I/src/xori-01.S ... ... ✅ Test completed - signature: /Users/hsuhsiaofan/chisel-tutorial/ca2025-mycpu/tests/riscof_work_1sc/rv32i_m/hints/src/fence-01.S/dut/DUT-mycpu.signature [info] - should pass test /Users/hsuhsiaofan/chisel-tutorial/ca2025-mycpu/tests/riscv-arch-test/riscv-test-suite/rv32i_m/hints/src/fence-01.S ✅ Test completed - signature: /Users/hsuhsiaofan/chisel-tutorial/ca2025-mycpu/tests/riscof_work_1sc/rv32i_m/hints/src/srl-01.S/dut/DUT-mycpu.signature [info] - should pass test /Users/hsuhsiaofan/chisel-tutorial/ca2025-mycpu/tests/riscv-arch-test/riscv-test-suite/rv32i_m/hints/src/srl-01.S ✅ Test completed - signature: /Users/hsuhsiaofan/chisel-tutorial/ca2025-mycpu/tests/riscof_work_1sc/rv32i_m/privilege/src/misalign1-jalr-01.S/dut/DUT-mycpu.signature [info] - should pass test /Users/hsuhsiaofan/chisel-tutorial/ca2025-mycpu/tests/riscv-arch-test/riscv-test-suite/rv32i_m/privilege/src/misalign1-jalr-01.S ... [info] QuicksortTest: [info] Single Cycle CPU - Integration Tests [info] - should correctly execute Quicksort algorithm on 10 numbers [info] Run completed in 2 minutes, 25 seconds. [info] Total number of tests run: 50 [info] Suites: completed 8, aborted 0 [info] Tests: succeeded 50, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. [success] Total time: 146 s ``` ## 2-mmio-trap ### Architectural Overview This project extends the microarchitecture to support RISC-V Privileged Architecture . The data path is significantly expanded to handle asynchronous events and external device communication through three key architectural additions: * **Control and Status Registers (CSR)**: Implements a separate 4096-byte address space independent of the general-purpose register file. This module manages the processor's privilege state (e.g., `mstatus`, `mie`) and supports atomic Read-Modify-Write operations required by the ISA. * **Core-Local Interrupt Controller (CLINT)**: Acts as the central nervous system for trap handling. It coordinates the transition between normal execution and exception handlers by managing context saving (`mepc`, `mcause`) and control flow redirection (`mtvec`) at instruction boundaries. * **Memory-Mapped I/O (MMIO) Interface**: Modifies the memory stage to include an address decoder. This logic differentiates between standard memory access and peripheral device communication (Timer, UART, VGA) based on the high-order address bits. ### Implementation Strategy The design incorporates the Zicsr Extension and system-level instructions necessary for embedded operating system support. * **CSR Instructions (Zicsr)**: `csrrw`, `csrrs`, `csrrc` and their immediate variants. These ensure atomic updates to system status registers. * **System & Trap Instructions**: * Trap Entry: `ecall` (Environment Call), `ebreak` (Breakpoint). * Trap Return: `mret` (Return from Machine-mode trap), restoring the PC and interrupt enable state. * **Peripheral Integration**: * Timer: Configurable 32-bit counter with interrupt generation logic. * UART/VGA: Support for serial communication and graphical display output via memory-mapped addresses. ### The issues I encounter and how I overcome them 1. A critical race condition exists between software and hardware. The software might try to write to a CSR at the exact same cycle that a hardware interrupt occurs which requires the CLINT to save the mcause and update mstatus. * If both the CPU pipeline and the CLINT module assert write enables to the CSR file simultaneously, the hardware trap context could be corrupted by the software write, leading to a system crash or incorrect exception handling. * **Solution**: I implemented a Hardware Priority Logic within the CSR module. The CLINT's `direct_write_enable` signal is given strict precedence over the CPU's `reg_write_enable_id`. ```scala // Priority Arbitration: CLINT (Hardware) > CPU (Software) when(io.clint_access_bundle.direct_write_enable) { // Atomic hardware update for Trap Entry mstatus := io.clint_access_bundle.mstatus_write_data mepc := io.clint_access_bundle.mepc_write_data mcause := io.clint_access_bundle.mcause_write_data }.elsewhen(io.reg_write_enable_id) { // Software CSR instruction when(io.reg_write_address_id === CSRRegister.MSTATUS) { mstatus := io.reg_write_data_ex } // ... other CSRs } ``` 2. When an interrupt occurred simultaneously with a Branch or Jump instruction, the PC sometimes updated to the Branch Target instead of the Trap Handler (`mtvec`). * **Solution**: I utilized a nested `Mux` structure to enforce this priority. ```scala // PC Update Priority: Interrupt > Jump/Branch > Sequential pc := Mux( io.interrupt_assert, io.interrupt_handler_address, // 1. Highest Priority: Interrupt Vector Mux( io.jump_flag_id, io.jump_address_id, // 2. Control Flow pc + 4.U // 3. Sequential ) ) ``` 3. The Store Byte instruction was writing to the correct address but shifting the data incorrectly, overwriting adjacent bytes in the 32-bit word. * Memory is word-aligned in hardware. Writing a byte to address `0x...1` means the data must be shifted to bits `[15:8]`, and only the second write strobe must be active. * **Solution**: I calculated the shift amount dynamically based on the lower 2 bits of the address (`mem_address_index`). ```scala // Dynamic Shifting for Store Byte val byteShiftAmount = Cat(mem_address_index, 0.U(3.W)) // equivalent to index * 8 io.memory_bundle.write_strobe(mem_address_index) := true.B io.memory_bundle.write_data := io.reg2_data(Parameters.ByteBits - 1, 0) << (byteShiftAmount) ``` 4. The `JALR` instruction occasionally caused address misalignment exceptions. * The RISC-V specification mandates that the target address of `JALR` must have its Least Significant Bit set to 0. A simple addition `rs1 + imm` is insufficient. * **Solution**: I implemented bit concatenation to forcibly clear bit 0. ```scala // Force LSB to 0 val jalrTarget = Cat(jalrSum(Parameters.DataBits - 1, 1), 0.U(1.W)) ``` ### Test Verification The testing environment is invoked using the standard sbt command line interface. ```bash $ sbt [info] started sbt server [sbt:mycpu-root> project mmioTrap [sbt:mycpu-mmio-trap> test ``` #### Base Test ```bash [info] ByteAccessTest: [info] [CPU] Byte access program [info] - should store and load single byte [info] CLINTCSRTest: [info] [CLINT] Machine-mode interrupt flow [info] - should handle external interrupt [info] - should handle environmental instructions [info] UartMMIOTest: [info] [UART] Comprehensive TX+RX test [info] - should pass all TX and RX tests [info] ExecuteTest: [info] [Execute] CSR write-back [info] - should produce correct data for csr write [info] FibonacciTest: [info] [CPU] Fibonacci program [info] - should calculate recursively fibonacci(10) [info] TimerTest: [info] [Timer] MMIO registers [info] - should read and write the limit [info] InterruptTrapTest: [info] [CPU] Interrupt trap flow [info] - should jump to trap handler and then return [info] QuicksortTest: [info] [CPU] Quicksort program [info] - should quicksort 10 numbers [info] Run completed in 23 seconds, 273 milliseconds. [info] Total number of tests run: 9 [info] Suites: completed 8, aborted 0 [info] Tests: succeeded 9, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. [success] Total time: 24 s ``` #### Base Test + Compliance Test ```bash [info] ByteAccessTest: [info] [CPU] Byte access program ... [info] ComplianceTest: [info] MyCPU Compliance ✅ Test completed - signature: /Users/hsuhsiaofan/chisel-tutorial/ca2025-mycpu/tests/riscof_work_2mmio/src/add-01.S/dut/DUT-mycpu.signature [info] - should pass test /Users/hsuhsiaofan/chisel-tutorial/ca2025-mycpu/tests/riscv-arch-test/riscv-test-suite/rv32i_m/I/src/add-01.S ✅ Test completed - signature: /Users/hsuhsiaofan/chisel-tutorial/ca2025-mycpu/tests/riscof_work_2mmio/src/addi-01.S/dut/DUT-mycpu.signature [info] - should pass test /Users/hsuhsiaofan/chisel-tutorial/ca2025-mycpu/tests/riscv-arch-test/riscv-test-suite/rv32i_m/I/src/addi-01.S ✅ Test completed - signature: /Users/hsuhsiaofan/chisel-tutorial/ca2025-mycpu/tests/riscof_work_2mmio/src/and-01.S/dut/DUT-mycpu.signature [info] - should pass test /Users/hsuhsiaofan/chisel-tutorial/ca2025-mycpu/tests/riscv-arch-test/riscv-test-suite/rv32i_m/I/src/and-01.S ... ... ✅ Test completed - signature: /Users/hsuhsiaofan/chisel-tutorial/ca2025-mycpu/tests/riscof_work_2mmio/src/sub-01.S/dut/DUT-mycpu.signature [info] - should pass test /Users/hsuhsiaofan/chisel-tutorial/ca2025-mycpu/tests/riscv-arch-test/riscv-test-suite/rv32i_m/I/src/sub-01.S ✅ Test completed - signature: /Users/hsuhsiaofan/chisel-tutorial/ca2025-mycpu/tests/riscof_work_2mmio/src/sw-align-01.S/dut/DUT-mycpu.signature [info] - should pass test /Users/hsuhsiaofan/chisel-tutorial/ca2025-mycpu/tests/riscv-arch-test/riscv-test-suite/rv32i_m/I/src/sw-align-01.S ... ... [info] QuicksortTest: [info] [CPU] Quicksort program [info] - should quicksort 10 numbers [info] Run completed in 6 minutes, 22 seconds. [info] Total number of tests run: 128 [info] Suites: completed 9, aborted 0 [info] Tests: succeeded 128, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. [success] Total time: 382 s ``` ### Nyancat VGA Display Demo Command line ```bash cd ~/chisel-tutorial/ca2025-mycpu/2-mmio-trap make demo ``` Demo ```bash cd .. && PATH=$HOME/.local/bin:$PATH sbt "project mmioTrap" "runMain board.verilator.VerilogGenerator" [info] welcome to sbt 1.10.7 (Eclipse Adoptium Java 11.0.29) [info] loading project definition from /Users/hsuhsiaofan/chisel-tutorial/ca2025-mycpu/project [info] loading settings for project root from build.sbt... [info] set current project to mycpu-root (in build file:/Users/hsuhsiaofan/chisel-tutorial/ca2025-mycpu/) [info] set current project to mycpu-mmio-trap (in build file:/Users/hsuhsiaofan/chisel-tutorial/ca2025-mycpu/) [info] running board.verilator.VerilogGenerator [success] Total time: 3 s, completed 2025年12月10日下午11:31:30 cd verilog/verilator && verilator --trace --exe --cc sim.cpp Top.v ../../src/main/resources/vsrc/TrueDualPortRAM32.v \ -Wno-WIDTHEXPAND -Wno-WIDTH \ -CFLAGS "-DENABLE_SDL2 $(sdl2-config --cflags)" -LDFLAGS "$(sdl2-config --libs)" && \ make -C obj_dir -f VTop.mk - V e r i l a t i o n R e p o r t: Verilator 5.042 2025-11-02 rev UNKNOWN.REV - Verilator: Built from 0.000 MB sources in 0 modules, into 0.000 MB in 0 C++ files needing 0.000 MB - Verilator: Walltime 0.002 s (elab=0.000, cvt=0.000, bld=0.000); cpu 0.001 s on 1 threads make[1]: Nothing to be done for `default'. 🐱 Starting VGA demo with nyancat animation... Display: 640×480@72Hz with SDL2 visualization Program: nyancat.asmbin (12-frame nyancat animation) Note: Frame upload + animation takes significant time Duration: 500M cycles (~5 minutes, includes full animation) cd verilog/verilator/obj_dir && ./VTop -vga -instruction ../../../src/main/resources/nyancat.asmbin -time 500000000 [SDL2] Window opened: 640x480 'VGA Display - MyCPU' [SDL2] Press ESC or close window to stop simulation early Simulation progress: 1% Simulation progress: 2% Simulation progress: 3% Simulation progress: 4% Simulation progress: 5% ... Simulation progress: 94% Simulation progress: 95% Simulation progress: 96% Simulation progress: 97% Simulation progress: 98% Simulation progress: 99% Simulation progress: 100% ✅ Demo complete! You should have seen animated nyancat. ``` ![截圖 2025-12-10 晚上11.31.45](https://hackmd.io/_uploads/SkbEXzwzWg.png =500x) #### Proposed Compression Strategies I inspected the first 100 lines of nyancat.asmbin to reverse-engineer the rendering logic. ```bash $ hexdump -C src/main/resources/nyancat.asmbin | head -n 100 ``` The binary is densely populated with the Opcode `0x23`, which corresponds to the RISC-V S-Type instruction. ```bash 00000000 37 01 40 00 ef 00 80 7a 6f 00 00 00 13 01 01 fd |7.@....zo.......| 00000010 23 26 11 02 23 24 81 02 13 04 01 03 23 2e a4 fc |#&..#$......#...| 00000020 23 2c b4 fc 23 2a c4 fc 23 26 04 fe 6f 00 00 03 |#,..#*..#&..o...| 00000030 83 27 c4 fe 03 27 84 fd 33 07 f7 00 83 27 c4 fe |.'...'..3....'..| 00000040 83 26 c4 fd b3 87 f6 00 03 47 07 00 23 80 e7 00 |.&.......G..#...| 00000050 83 27 c4 fe 93 87 17 00 23 26 f4 fe 03 27 c4 fe |.'......#&...'..| 00000060 83 27 44 fd e3 46 f7 fc 13 00 00 00 13 00 00 00 |.'D..F..........| 00000070 83 20 c1 02 03 24 81 02 13 01 01 03 67 80 00 00 |. ...$......g...| 00000080 13 01 01 fe 23 2e 11 00 23 2c 81 00 13 04 01 02 |....#...#,......| 00000090 23 26 a4 fe 23 24 b4 fe 83 27 c4 fe 03 27 84 fe |#&..#$...'...'..| ... ``` This pattern confirms that the graphics are rendered via Software Bit-Banging. The CPU is fetching and executing a unique store instruction for nearly every pixel or pixel group. This explains the large binary size relative to the simple animation—the image data is encoded as instructions. * **Proposed Solution**: To compress this program effectively, we must move away from Instruction-based Rendering. My proposal to implement a Hardware Sprite Controller is validated by this finding, as it would replace thousands of these `SW` instructions with simple metadata writes, drastically reducing both instruction count and binary size. * Implementation Plan: * Sprite Management Unit : Integrate a dedicated logic unit within the VGA controller. * Store the static sprite bitmaps (cat, background) in a dedicated Read-Only Memory block, separate from the instruction memory. * CPU only needs to write lightweight Metadata (Asset ID, X-Coordinate, Y-Coordinate) to specific MMIO registers to update the frame, instead of manually rewriting the frame buffer. ## 3-pipeline This project involves transforming the core into a Pipelined Architecture. The primary objective is to improve instruction throughput by overlapping the execution of multiple instructions. The design shares the same front-end (Fetch/Decode) as the previous implementations but introduces complex control logic to handle hazards inherent in parallel execution. ### Architectural Overview To ensure architectural correctness while progressively optimizing performance, it implemented the pipeline in four distinct iterations, selectable via the implementation parameter in Top.scala: * **ThreeStage (IF $\to$ ID $\to$ EX/MEM/WB)**: A minimal pipeline that folds the last three stages into one. It introduces the basics of control-flow redirection without the complexity of data hazards. * **FiveStageStall (IF $\to$ ID $\to$ EX $\to$ MEM $\to$ WB)**:The classic 5-stage design. It handles Data Hazards (Read-After-Write) conservatively by using Interlocks. The control unit inserts "bubbles" (NOPs) to wait for data dependencies. * **FiveStageForward**: An optimized version that implements Data Forwarding . It routes results from the MEM or WB stages directly back to the EX stage, significantly reducing the number of stall cycles required. * **FiveStageFinal**: The complete integration which combines forwarding with robust Control Hazard handling (Flushing) and the Core Local Interrupt Controller to support exceptions and interrupts. ### Hazard Management * **Structural Hazards**: Structural hazards occur when multiple stages compete for the same resource. * **Data Hazards**: Read-After-Write (RAW), Write-After-Write (WAW), Write-After-Read (WAR) * **Control Hazards**: Branch and Jump instructions determine the next PC in the EX stage, but by then, the IF and ID stages have already fetched incorrect instructions. * **Integration with Hazard Unit** * Extended the pipeline registers (`ID2EX`, `EX2MEM`) to carry control signals (`RegWrite`, `MemRead`) so the hazard unit can observe the state of instructions in flight. * Running `make sim` to inspect the waveforms to confirm that: * Load-Use Hazard: Correctly triggers a 1-cycle stall even in the forwarding design since memory data is not yet available. * Branch Misprediction: Correctly flushes the two following instructions. ```mermaid graph LR %% --- define --- classDef stage fill:#f0f4c3,stroke:#827717,stroke-width:2px,font-weight:bold,rx:5,ry:5; classDef component fill:#e3f2fd,stroke:#1565c0,stroke-width:1px,rx:3,ry:3; classDef memory fill:#fff3e0,stroke:#e65100,stroke-width:1px,rx:3,ry:3; classDef pipeReg fill:#cfd8dc,stroke:#455a64,stroke-width:2px,stroke-dasharray: 5 5; %% --- Instruction Fetch (IF) --- subgraph IF [Stage 1: Instruction Fetch] direction TB PC(Program Counter) IMEM(Instruction Memory) end %% --- Instruction Decode (ID) --- subgraph ID [Stage 2: Decode] direction TB RegFile(Register File) CtrlUnit(Control Unit) end %% --- Execute (EX) --- subgraph EX [Stage 3: Execute] direction TB ALU(ALU) end %% --- Memory Access (MEM) --- subgraph MEM [Stage 4: Memory] direction TB DMEM(Data Memory) end %% --- Write Back (WB) --- subgraph WB [Stage 5: Write Back] direction TB WBMux(Write Back Mux) end %% PC --> IMEM IMEM --> RegFile RegFile --> ALU ALU --> DMEM DMEM --> WBMux WBMux --> RegFile class IF,ID,EX,MEM,WB stage; class PC,RegFile,CtrlUnit,ALU,WBMux component; class IMEM,DMEM memory; ``` ### The issues I encounter and how I overcome them 1. The Forwarding Unit prioritizes the most recent correct data available in the pipeline registers (`EX/MEM` or `MEM/WB`) and routes it directly to the EX stage's ALU inputs, bypassing the Register File access. * Forwarding Priority Logic： The forwarding unit resolves data hazards by selecting the most recent value available in the pipeline. The priority logic is summarized below: | Forwarding Path | Priority | Condition | `reg_forward_ex` Setting | | :--- | :--- | :--- | :--- | | **MEM $\to$ EX** | **High** | `RegWriteEnable_MEM` AND `rd_MEM == rs_EX` AND `rd_MEM != x0` | `ForwardFromMEM` | | **WB $\to$ EX** | **Low** | `RegWriteEnable_WB` AND `rd_WB == rs_EX` AND `rd_WB != x0` | `ForwardFromWB` | * **solution** ```scala when(io.reg_write_enable_mem && (io.rd_mem === io.rs1_ex) && io.rd_mem =/= 0.U) { io.reg1_forward_ex := ForwardingType.ForwardFromMEM }.elsewhen(io.reg_write_enable_wb && (io.rd_wb === io.rs1_ex) && io.rd_wb =/= 0.U) { io.reg1_forward_ex := ForwardingType.ForwardFromWB }.otherwise { io.reg1_forward_ex := ForwardingType.NoForward } ``` ### Test Verification The testing environment is invoked using the standard sbt command line interface. ```bash $ sbt [info] started sbt server [sbt:mycpu-root> project pipeline [sbt:mycpu-pipeline> test ``` #### Base Test ```bash [info] PipelineProgramTest: [info] Three-stage Pipelined CPU [info] - should calculate recursively fibonacci(10) [info] - should quicksort 10 numbers [info] - should store and load single byte [info] - should solve data and control hazards [info] - should handle all hazard types comprehensively [info] - should handle machine-mode traps [info] Five-stage Pipelined CPU with Stalling [info] - should calculate recursively fibonacci(10) [info] - should quicksort 10 numbers [info] - should store and load single byte [info] - should solve data and control hazards [info] - should handle all hazard types comprehensively [info] - should handle machine-mode traps [info] Five-stage Pipelined CPU with Forwarding [info] - should calculate recursively fibonacci(10) [info] - should quicksort 10 numbers [info] - should store and load single byte [info] - should solve data and control hazards [info] - should handle all hazard types comprehensively [info] - should handle machine-mode traps [info] Five-stage Pipelined CPU with Reduced Branch Delay [info] - should calculate recursively fibonacci(10) [info] - should quicksort 10 numbers [info] - should store and load single byte [info] - should solve data and control hazards [info] - should handle all hazard types comprehensively [info] - should handle machine-mode traps [info] PipelineUartTest: [info] Three-stage Pipelined CPU UART Comprehensive Test [info] - should pass all TX and RX tests [info] Five-stage Pipelined CPU with Stalling UART Comprehensive Test [info] - should pass all TX and RX tests [info] Five-stage Pipelined CPU with Forwarding UART Comprehensive Test [info] - should pass all TX and RX tests [info] Five-stage Pipelined CPU with Reduced Branch Delay UART Comprehensive Test [info] - should pass all TX and RX tests [info] PipelineRegisterTest: [info] Pipeline Register [info] - should be able to stall and flush [info] Run completed in 1 minute, 13 seconds. [info] Total number of tests run: 29 [info] Suites: completed 3, aborted 0 [info] Tests: succeeded 29, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. [success] Total time: 76 s ``` #### Base Test + Compliance Test ```bash [info] PipelineProgramTest: [info] Three-stage Pipelined CPU ... [info] ComplianceTest: [info] MyCPU Compliance ✅ Test completed - signature: /Users/hsuhsiaofan/chisel-tutorial/ca2025-mycpu/tests/riscof_work_3pl/src/add-01.S/dut/DUT-mycpu.signature [info] - should pass test /Users/hsuhsiaofan/chisel-tutorial/ca2025-mycpu/tests/riscv-arch-test/riscv-test-suite/rv32i_m/I/src/add-01.S ✅ Test completed - signature: /Users/hsuhsiaofan/chisel-tutorial/ca2025-mycpu/tests/riscof_work_3pl/src/addi-01.S/dut/DUT-mycpu.signature [info] - should pass test /Users/hsuhsiaofan/chisel-tutorial/ca2025-mycpu/tests/riscv-arch-test/riscv-test-suite/rv32i_m/I/src/addi-01.S ✅ Test completed - signature: /Users/hsuhsiaofan/chisel-tutorial/ca2025-mycpu/tests/riscof_work_3pl/src/and-01.S/dut/DUT-mycpu.signature [info] - should pass test /Users/hsuhsiaofan/chisel-tutorial/ca2025-mycpu/tests/riscv-arch-test/riscv-test-suite/rv32i_m/I/src/and-01.S ... ... t/riscv-test-suite/rv32i_m/I/src/sra-01.S ✅ Test completed - signature: /Users/hsuhsiaofan/chisel-tutorial/ca2025-mycpu/tests/riscof_work_3pl/src/srai-01.S/dut/DUT-mycpu.signature [info] - should pass test /Users/hsuhsiaofan/chisel-tutorial/ca2025-mycpu/tests/riscv-arch-test/riscv-test-suite/rv32i_m/I/src/srai-01.S ✅ Test completed - signature: /Users/hsuhsiaofan/chisel-tutorial/ca2025-mycpu/tests/riscof_work_3pl/src/srl-01.S/dut/DUT-mycpu.signature [info] - should pass test /Users/hsuhsiaofan/chisel-tutorial/ca2025-mycpu/tests/riscv-arch-test/riscv-test-suite/rv32i_m/I/src/srl-01.S ✅ Test completed - signature: /Users/hsuhsiaofan/chisel-tutorial/ca2025-mycpu/tests/riscof_work_3pl/src/srli-01.S/dut/DUT-mycpu.signature [info] - should pass test /Users/hsuhsiaofan/chisel-tutorial/ca2025-mycpu/tests/riscv-arch-test/riscv-test-suite/rv32i_m/I/src/srli-01.S ... ... [info] Five-stage Pipelined CPU with Reduced Branch Delay UART Comprehensive Test [info] - should pass all TX and RX tests [info] PipelineRegisterTest: [info] Pipeline Register [info] - should be able to stall and flush [info] Run completed in 8 minutes, 40 seconds. [info] Total number of tests run: 147 [info] Suites: completed 4, aborted 0 [info] Tests: succeeded 148, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. [success] Total time: 520 s ``` ### [CA25: Exercise 21] Hazard Detection Summary and Analysis #### Q1: Why do we need to stall for load-use hazards? >A stall is needed for a load-use hazard because the result of the Load instruction is not available until the Memory (MEM) stage, but the subsequent instruction in the Instruction Decode (ID) stage needs to read it. Stalling delays the dependent instruction by one cycle, giving the load instruction enough time to forward the data. #### Q2: What is the difference between "stall" and "flush" operations? >* Stall resolves Data Hazards by freezing the PC/IF/ID and forcing a NOP into the ID/EX register. >* Flush resolves Control Hazards by discarding an instruction fetched down the wrong path. #### Q3: Why does jump instruction with register dependency need stall? >JALR requires the target address (from r_s1) to be calculated in the ID stage. If r_s1 is being written by a prior instruction (Data Hazard), a stall is needed until the correct value can be forwarded to the ID stage to calculate the next PC. #### Q4: In this design, why is branch penalty only 1 cycle instead of 2? >The branch penalty is only 1 cycle because the design implements early branch resolution in the ID stage. By resolving in ID, only the single incorrect instruction in the IF/ID register needs to be flushed (discarded), resulting in a 1-cycle penalty. #### Q5: What would happen if we removed the hazard detection logic entirely? >The processor would fail to execute code correctly. Data Hazards would lead to incorrect computation results, and Control Hazards would lead to instructions from the wrong code path being executed. #### Q6: Complete the stall condition summary: >Stall is needed when: >1. (EX stage condition) An instruction in ID reads the destination of a Load or JALR instruction currently in the EX stage. >2. (MEM stage condition) A JALR instruction in ID reads the destination of a Load instruction currently in the MEM stage. > >Flush is needed when: >1. (Branch/Jump condition) A Branch instruction in the ID stage evaluates to Taken when io.jump_flag is true. ## Modify the handwritten RISC-V assembly code in Homework2 To validate the pipelined CPU's performance in a realistic scenario, I ported the Tower of Hanoi assembly program (from Homework 2) to the RISC-V 5-stage pipeline environment. The primary objective was to eliminate pipeline stalls caused by Load-Use Hazards through software-level optimization. ### Code Optimization The critical optimization was applied in the `display_move` section, where character data is loaded from memory and then decrypted. ```diff display_move: add x5, s4, s2 # Calculate FROM address - lbu s8, 0(x5) # Load FROM char - xor s8, s8, x6 # [STALL] Dependency on s8 - add x7, s4, s3 # Calculate TO address - lbu s9, 0(x7) # Load TO char + # --- Optimized Sequence --- + add x7, s4, s3 # Calculate TO address (Moved up) + lbu s8, 0(x5) # Load FROM char + lbu s9, 0(x7) # [FILL SLOT] Load TO char immediately + # This instruction executes while s8 is being loaded. + xor s8, s8, x6 # [NO STALL] s8 is now ready (Forwarding WB->EX) ``` By reordering, the second `lbu` fills the bubble required by the first `lbu`, effectively eliminating the stall cycle. ### Compilation ELF to Binary I modified the existing Makefile in the csrc/ directory to support the mixed compilation of C and Assembly sources: * the C entry wrapper was compiled using gcc with the Zicsr extension enabled: ```bash riscv64-unknown-elf-gcc -O0 -Wall -march=rv32i_zicsr -mabi=ilp32 -c -o main.o main.c ``` * create `main.c` ```c // main.c extern void qz2_A_main(); // An empty print function fools the linker. void print_hanoi_move(int disk, int from, int to) {} void _start() { qz2_A_main(); while(1); // } ``` * Assembly Compilation and Linking ```bash riscv64-unknown-elf-as -R -march=rv32i_zicsr -mabi=ilp32 -o hanoi.o hanoi.S riscv64-unknown-elf-ld -o hanoi.elf -T link.lds --oformat=elf32-littleriscv main.o hanoi.o ``` * `objcopy` was used to extract the text and data sections into the raw binary format: ```bash riscv64-unknown-elf-objcopy -O binary -j .text -j .data hanoi.elf hanoi.asmbin ``` ### Test Integration Extend Scala code to test the new program. ```scala // src/test/scala/riscv/PipelineProgramTest.scala it should "execute optimized Hanoi Tower" in { runProgram("hanoi.asmbin", cfg) { c => // Disable timeout to accommodate the long running time of the algorithm c.clock.setTimeout(0) // Step for sufficient cycles to ensure completion c.clock.step(50000) } } ``` ### Test Verification command ```bash WRITE_VCD=1 sbt "project pipeline" "testOnly riscv.PipelineProgramTest" ``` #### Test ```bash [info] PipelineProgramTest: [info] Three-stage Pipelined CPU [info] - should execute optimized Hanoi Tower ... [info] Five-stage Pipelined CPU with Stalling [info] - should execute optimized Hanoi Tower ... [info] Five-stage Pipelined CPU with Forwarding [info] - should execute optimized Hanoi Tower ... [info] Five-stage Pipelined CPU with Reduced Branch Delay [info] - should execute optimized Hanoi Tower ... [info] Run completed in 1 minute, 9 seconds. [info] Total number of tests run: 28 [info] Suites: completed 1, aborted 0 [info] Tests: succeeded 28, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. [success] Total time: 74 s ``` #### reference [2025-arch-homework3](https://hackmd.io/@sysprog/2025-arch-homework3) [Lab3: Construct a RISC-V CPU with Chisel](https://hackmd.io/@sysprog/B1Qxu2UkZx)