Assignment3: Your Own RISC-V CPU

# Assignment3: Your Own RISC-V CPU contributed by < [`clare8151214`](https://github.com/clare8151214) > > [!Note] AI tools usage > AI tools were used primarily for improving writing quality and clarity, including grammar refinement and restructuring explanations. They were also used as a reference to better understand programming concepts and review code readability, without directly producing final implementation content. ## Reflections on Learning Chisel Bootcamp Before working on this assignment, my understanding of hardware description was mostly limited to structural concepts such as datapaths and control signals, but I had little experience in actually constructing hardware using a modern hardware description language. Completing the Chisel Bootcamp exercises fundamentally changed how I think about hardware design. One of the most important lessons I learned is that Chisel is not a traditional HDL, but a hardware construction language. At the beginning, I often wrote Scala if statements assuming they would directly translate into hardware logic, which led to incorrect or incomplete circuit generation. This mistake helped me realize the crucial distinction between Scala execution during elaboration and hardware behavior after synthesis. Understanding when to use when/elsewhen/otherwise instead of Scala control flow significantly improved my ability to reason about the resulting circuits. Another challenge I encountered was Chisel’s strict type system. Unlike Verilog, where implicit width extension can easily hide bugs, Chisel enforces explicit distinctions between UInt, SInt, and Bool. Although this caused frequent compilation errors at first, it ultimately forced me to think more carefully about data representation and signal width. Over time, these errors became a useful guide rather than an obstacle, helping me identify design mistakes early in the development process. The “Hello World” example was also eye-opening. Initially, I expected it to behave like a software program, but I gradually understood that its real purpose is to demonstrate the full hardware generation and simulation workflow. Seeing how Chisel code is elaborated into FIRRTL, translated into Verilog, and then simulated with Verilator helped me connect abstract code with concrete hardware behavior. This process made debugging more systematic, as waveform inspection became an essential tool for verifying correctness. Overall, the Chisel Bootcamp exercises shifted my mindset from writing code that “runs” to designing hardware that “exists.” This change in perspective is critical for the later stages of this assignment, especially when implementing a RISC-V CPU, where correctness depends not only on functionality but also on precise control of timing, state, and data flow. ## Hello World in Chisel ### Describe the operation of 'Hello World in Chisel' and enhance it by incorporating logic circuit. ```scala // See LICENSE.txt for license details. package hello import chisel3._ import chisel3.iotesters.{PeekPokeTester, Driver} class Hello extends Module { val io = IO(new Bundle { val out = Output(UInt(8.W)) }) io.out := 42.U } class HelloTests(c: Hello) extends PeekPokeTester(c) { step(1) expect(c.io.out, 42) } object Hello { def main(args: Array[String]): Unit = { if (!Driver(() => new Hello())(c => new HelloTests(c))) System.exit(1) } } ``` #### How the 'Hello World' Code Works This code describes a hardware module that simply outputs a static number. It does not perform any computation yet. **1. The Hardware Definition (class Hello)** ```scala class Hello extends Module { val io = IO(new Bundle { val out = Output(UInt(8.W)) }) io.out := 42.U } ``` `extends Module`: Defines a hardware component. In Verilog/SystemVerilog, this is equivalent to a `module`. `val io = IO(...)`: Defines the interface (ports). Bundle groups signals together. `Output(UInt(8.W))`: Creates an 8-bit wide Unsigned Integer output port. `io.out := 42.U`: The := operator connects signals. Here, it hardwires the constant value 42 (converted to a hardware literal with .U) to the output. **2. The Test Logic (class HelloTests)** ```scala class HelloTests(c: Hello) extends PeekPokeTester(c) { step(1) expect(c.io.out, 42) } ``` `PeekPokeTester`: A test harness that allows you to manipulate inputs ("poke") and read outputs ("peek"). `step(1)`: Advances the simulation clock by 1 cycle. `expect(c.io.out, 42)`: Asserts that the value on the out port equals 42. If it doesn't, the test fails. 3. The Entry Point (`object Hello`) `Driver`: Compiles the Chisel code into Verilog (or an intermediate representation) and runs the test harness. ### Enhanced Version (Adding Logic Circuits) To make this a true "Logic Circuit," we need Inputs and Combinational Logic.Below is an enhanced version called `LogicHello`. It implements a circuit that takes two inputs ($A$ and $B$) and a control signal ($Select$) to choose between bitwise **AND** and **OR** operations. **Logic Implemented**: $Out = (Select == 1) \ ? \ (A \ | \ B) \ : \ (A \ \& \ B)$ **The Enhanced Code** ```scala= package hello import chisel3._ import chisel3.util._ import chisel3.iotesters.{PeekPokeTester, Driver} class LogicHello extends Module { val io = IO(new Bundle { val a = Input(UInt(4.W)) val b = Input(UInt(4.W)) val select = Input(Bool()) val out = Output(UInt(4.W)) }) // Define the Logic Nodes val andLogic = io.a & io.b // Bitwise AND val orLogic = io.a | io.b // Bitwise OR // Multiplexer: If select is true, pick OR, else pick AND io.out := Mux(io.select, orLogic, andLogic) } class LogicHelloTests(c: LogicHello) extends PeekPokeTester(c) { // Test Case 1: Testing AND Logic (Select = 0/False) poke(c.io.a, 0xA) // 1010 poke(c.io.b, 0x3) // 0011 poke(c.io.select, 0) // Select AND step(1) // Expected: 1010 & 0011 = 0010 (0x2) expect(c.io.out, 0x2) println(s"AND Test: Input A=0xA, B=0x3, Select=0 -> Output=${peek(c.io.out)}") // Test Case 2: Testing OR Logic (Select = 1/True) poke(c.io.a, 0xA) // 1010 poke(c.io.b, 0x3) // 0011 poke(c.io.select, 1) // Select OR step(1) // Expected: 1010 | 0011 = 1011 (0xB) expect(c.io.out, 0xB) println(s"OR Test: Input A=0xA, B=0x3, Select=1 -> Output=${peek(c.io.out)}") } // 3. Execution Entry Point object LogicHello { def main(args: Array[String]): Unit = { if (!Driver(() => new LogicHello())(c => new LogicHelloTests(c))) System.exit(1) } } ``` - The transition from the basic 'Hello World' to the 'LogicHello' module represents a fundamental shift in hardware design. Unlike the original code, which merely output a static constant, the enhanced version introduces dynamic inputs and combinational logic. By implementing bitwise operations and a Multiplexer (Mux) for control flow, this circuit mimics the behavior of a rudimentary Arithmetic Logic Unit (ALU), serving as a crucial building block for designing complex processor architectures ## CA25 Exercise Implementation and Verification ### 1-single-cycle >[Commit #e10cebf](https://github.com/clare8151214/ca2025-mycpu/commit/e10cebfc5b1fa8a5f5fb41991b3093212db168c4) After completing the TODO parts, I ran the code in sequence to locate and fix bugs. The output of the following command confirmed that I had implemented the code correctly. ![image](https://hackmd.io/_uploads/ry1-NpgrWl.png) RISCOF compliance testing was performed via `make compliance`. This flow builds the reference model, executes the RV32I test suite, and produces a report under `tests/riscof_work/`. The results show 41/41 tests passed. ![image](https://hackmd.io/_uploads/B1q8zUbHZe.png) --- ### 2-mmio-trap >[Commit #af1f6e0](https://github.com/clare8151214/ca2025-mycpu/commit/af1f6e08951ff30f28bbe9cd82a40939122d8da1) In `2-mmio-trap`, I verified that **Nyancat can be rendered correctly under Verilator** by driving the VGA output through **MMIO-based framebuffer updates**. The program repeatedly writes pixel data to the VGA-related MMIO region (framebuffer or VGA controller registers, depending on the provided address map), and the simulator collects these writes to produce the final rendered frames. To ensure the animation is visible (instead of a single static frame), the program updates the framebuffer in a loop and uses a timing mechanism (e.g., busy-wait delay or timer tick, if available) to control the frame rate. As a result, Verilator displays the expected Nyancat animation frames without visual corruption, confirming that the MMIO trap path and VGA peripheral integration behave as intended. ![image](https://hackmd.io/_uploads/rJzlAMfBWe.png) RISCOF compliance testing was performed via `make compliance`. This flow builds the reference model, executes the RV32I test suite, and produces a report under `tests/riscof_work/`. The results show 76/76 tests passed. ![image](https://hackmd.io/_uploads/r1rXdQfSZx.png) --- ### 3-pipleline >[Commit #7c28042](https://github.com/sysprog21/ca2025-mycpu/commit/7c28042a2e82cb7bfdd928514040b6413e9e32af) The failing test (“False Stall: `mem[0x38]` should be 3”) was caused by a false RAW dependency in Section 11 of **hazard_extended.S**. We used addi `a5, sp, 0` immediately before two loads that used `a5` as the base register. In the five-stage stall pipeline, that created a real dependency on a5, triggering a stall and increasing the cycle delta from 3 to 4/6. The test expected no stall there, so it failed. The fix was to avoid the dependency by using sp directly as the base register in those loads. ```assembly # ===== Section 11: False Stall ===== # Test control hazard that should not cause stalls # Use sp directly to avoid false RAW dependency on a5 csrr a2, cycle # Read cycle CSR lw a6, 0(sp) # Load from address in sp (should not stall) lw a7, 16(sp) # Load from address in sp+16 (should not stall) csrr a3, cycle # Read cycle CSR again sub a3, a3, a2 # a3 = cycle difference (should be small, ≥1) sw a3, 0x38(zero) # mem[0x38] = cycle difference, should be 3 cycles # ===== End: Calculate cycle count ===== csrr a1, cycle # End cycle count sub s0, a1, a0 # s0 = total cycles (result register) ``` ![image](https://hackmd.io/_uploads/rykLUvzHZg.png) ![image](https://hackmd.io/_uploads/ry-sOvMB-l.png) ## HW2 Assembly on 3-Pipeline CPU >[Commit #fd638af](https://github.com/sysprog21/ca2025-mycpu/commit/fd638afe0d374f708cda67bcc64074bbe94f7554) ### 1. Modify HW2 Assembly for 3-pipeline Execution - I revised the original handwritten assembly to ensure it behaves correctly on the pipelined core. - The updated program is placed under the `csrc/` directory (3-pipeline side), and follows the expected memory output conventions so it can be verified by the testbench. ### 2. Extend Scala Tests in `CPUTest.scala` - I extended `src/test/scala/riscv/singlecycle/CPUTest.scala` by adding new test items targeting my modified assembly program. ```scala class HanoiTest extends AnyFlatSpec with ChiselScalatestTester { behavior.of("Single Cycle CPU - Integration Tests") it should "generate correct 3-disk Hanoi move sequence" in { test(new TestTopModule("hanoi.asmbin")).withAnnotations(TestAnnotations.annos) { c => for (i <- 1 to 200) { c.clock.step(100) c.io.mem_debug_read_address.poke((i * 4).U) // Avoid timeout } // move_count at 0x20 c.io.mem_debug_read_address.poke(0x20.U) c.clock.step() c.io.mem_debug_read_data.expect(7.U) // First move: disk=1, from='A', to='C' c.io.mem_debug_read_address.poke(0x40.U) c.clock.step() c.io.mem_debug_read_data.expect(1.U) c.io.mem_debug_read_address.poke(0x44.U) c.clock.step() c.io.mem_debug_read_data.expect(0x00004341L.U) // Last move: disk=1, from='A', to='C' c.io.mem_debug_read_address.poke(0x70.U) c.clock.step() c.io.mem_debug_read_data.expect(1.U) c.io.mem_debug_read_address.poke(0x74.U) c.clock.step() c.io.mem_debug_read_data.expect(0x00004341L.U) } } } ``` - Following the structure of `FibonacciTest`, the test loads the corresponding `*.asmbin` (generated from an ELF and then converted via `objcopy`) and validates the program’s results by reading specific memory locations through the debug interface. ``` BINS = \ fibonacci.asmbin \ hazard.asmbin \ hanoi.asmbin \ quicksort.asmbin \ sb.asmbin \ uart.asmbin \ irqtrap.asmbin \ hazard_extended.asmbin \ ``` #### Test - Run `sbt "project singleCycle" "testOnly *HanoiTest"` ``` [info] welcome to sbt 1.10.7 (Temurin Java 1.8.0_472) [info] loading project definition from /home/johnson/Desktop/ca2025-mycpu/project [info] loading settings for project root from build.sbt... [info] set current project to mycpu-root (in build file:/home/johnson/Desktop/ca2025-mycpu/) [info] set current project to mycpu-single-cycle (in build file:/home/johnson/Desktop/ca2025-mycpu/) [info] compiling 1 Scala source to /home/johnson/Desktop/ca2025-mycpu/1-single-cycle/target/scala-2.13/test-classes ... [info] HanoiTest: [info] Single Cycle CPU - Integration Tests [info] - should generate correct 3-disk Hanoi move sequence [info] Run completed in 14 seconds, 541 milliseconds. [info] Total number of tests run: 1 [info] Suites: completed 1, aborted 0 [info] Tests: succeeded 1, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. [success] Total time: 23 s, completed Jan 14, 2026 9:52:39 PM ``` ### 3. Verilator Execution and Waveform Analysis - I executed the program on the Verilator simulation flow and inspected waveform diagrams to analyze how key signals change across different instructions. - The analysis focuses on observing how major components (PC, instruction fetch/decode, register file, ALU, memory interface, and write-back) behave and how pipeline control signals vary when hazards or control-flow instructions occur. ### 4. Pipeline Optimization (Reducing Stalls) - The assembly was optimized specifically for a pipelined processor: instructions were arranged to avoid unnecessary stalls while preserving full correctness. - The goal is to maximize throughput on the 3-stage pipeline by minimizing avoidable hazard-induced bubbles.