Assignment3: Your Own RISC-V CPU

# Assignment3: Your Own RISC-V CPU ## Chisel Bootcamp Note ### Introduction to Scala for Hardware Design Scala serves as the foundation for writing parameterized hardware generators in Chisel. The role of `val` and `var`, conditionals, functions, lists, anonymous functions, named parameters, and object-oriented constructs helps structure hardware-generation code clearly. A Scala block returns its final expression, which enables concise composition of hardware-related functions, traits, and companion objects. **From Scala to Hardware Modules** A Chisel `Module` becomes actual hardware after elaboration. Defining an `io` `Bundle` and using the directional operator `:=` specifies concrete signal connections. Elaboration utilities such as `getVerilog` transform high-level Chisel descriptions into Verilog, while parameterized classes allow the creation of module **generators** capable of producing entire families of related hardware circuits. --- ### Hardware Testing with ChiselTest ChiselTest enables systematic verification of designs. Inputs are driven with `poke`, outputs checked with `expect`, and clocks advanced with `step`. Tests proceed inside a `test(new Module) { ... }` block, and a passing test continues execution without interruption. `peek` is available when observation is needed without assertion. **Simulation vs Elaboration Printing** Two different printing contexts are available: - **Chisel `printf`** emits messages during simulation and displays cycle-accurate hardware values. - **Scala `println`** executes during elaboration or in the test harness. Understanding this separation clarifies whether a message reflects hardware operation or generator behavior. --- ### Combinational and Squential Logic **Combinational Logic Construction** Combinational circuits are constructed using operators such as `+`, `-`, `*`, logical (`&&`, `||`) and bitwise (`&`, `|`) operations. Chisel expressions build hardware graphs rather than compute immediate software values. `Mux`, `Cat`, and the widening add `+&` provide expressive mechanisms for multiplexer design, bit concatenation, and carry-preserving arithmetic. **Sequential Logic and Registers** State elements are introduced using `Reg`, `RegNext`, and `RegInit`. Registers update on clock edges and preserve values between cycles. Clock and reset behavior can be customized with `withClock`, `withReset`, and `withClockAndReset`, enabling multi-clock or specialized reset structures. Sequential tests advance time explicitly using `c.clock.step(...)`. --- ### Control Flow and Assignment Semantics Chisel’s control flow (`when`, `elsewhen`, `otherwise`) expresses conditional wiring instead of imperative branching. The assignment operator `:=` defines hardware connections, following the **last-connect-wins** rule when multiple assignments appear. Temporary combinational signals are created with `Wire`, and state names can be generated using `Enum`. Equality checks rely on hardware-aware `===`. --- ### FIR Filters as a Complete Hardware Example An FIR filter resembles a shift-register pipeline where new samples enter each cycle and older samples shift through taps. Weighted sums are computed continuously, demonstrating how combinational and sequential elements combine within a parameterized generator. This example highlights how Chisel naturally expresses structured DSP pipelines. ![image](https://hackmd.io/_uploads/Hkc0w8Hf-x.png) **DSP Blocks, Diplomacy, and System-Level Integration** A deeper look at DSP blocks shows how Chisel integrates with the RocketChip ecosystem. Streaming interfaces (AXI4-Stream), control/status registers (`RegField`, `regmap`), and LazyModules form reusable accelerator blocks. Filters receive coefficients over a memory-mapped interface, process streaming data, and output results through a ready/valid handshake. Specializations such as `AXI4FIRBlock` produce ready-to-synthesize standalone modules. ![image](https://hackmd.io/_uploads/rkAnTISMWl.png) --- ### Decoupled Interfaces and Streaming Handshakes The ready/valid `Decoupled` interface provides a standardized way to connect producers and consumers that operate at different speeds. Handshake correctness is easy to test using helpers like `enqueueNow`, `expectDequeueNow`, and sequence-based versions. This testing model supports modular streaming architectures. --- ### Concurrency in Tests and Bundle Literals Test environments can express concurrent behavior using `fork` and `join`, enabling independent input drivers and output monitors to run simultaneously. Bundle literals created with `.Lit` supply compact structured test vectors, while Scala libraries such as `BigInt` can compute reference values for validating hardware behavior. --- ### Parameterization and Preconditions Hardware generators rely on parameters to express scalable designs. `require(condition)` ensures parameter validity early in elaboration, preventing illegal configurations. This forms a foundation for reusable and safe generator-based design patterns. **Option Types and Pattern Matching** Optional configuration is represented using `Option` (`Some` and `None`), providing a safe alternative to nulls. Pattern matching (`match`) handles value cases, type cases, and tuples, supporting expressive construction of optional IO, reset logic, and configuration-dependent behavior. **Implicit Arguments and Conversions** Implicit parameters and conversions reduce boilerplate in complex Chisel and RocketChip systems. They automatically provide required context (such as implicit `Parameters`) or type conversions. Their power is balanced by careful usage, with simpler alternatives such as traits or overloading preferred when possible. ### Encountering Problem **Chisel Test Library Compatibility Issue and Resolution** A problem emerged when I running the Chisel Bootcamp material, specifically matching the issue documented here: [issue](https://github.com/sysprog21/chisel-bootcamp/issues/15). ```scala import chisel3._ import chisel3.util._ import chisel3.tester._ import chisel3.tester.RawTester.test ``` ```diff - cmd1.sc:3: object tester is not a member of package chisel3 - import chisel3.tester._ - ^cmd1.sc:4: object tester is not a member of package chisel3 - import chisel3.tester.RawTester.test - ^Compilation Failed - Compilation Failed ``` The core difficulty was that the ChiselTest library bundled in the bootcamp environment was no longer compatible with newer versions of the Scala or Chisel dependencies pulled during the Ivy load stage. This mismatch prevented the tutorial notebooks from running correctly and caused repeated failures when executing even simple test cases. The issue reproduced exactly as described in the linked discussion, confirming that it stemmed from dependency version drift rather than any local configuration error. To resolve the problem, the effective solution was to **downgrade the Chisel-related libraries** used by the bootcamp environment. This required editing the dependency list in the following file: [load-ivy.sc](https://github.com/freechipsproject/chisel-bootcamp/blob/master/source/load-ivy.sc). By aligning the versions of Chisel, ChiselTest, and supporting libraries with the historically stable set expected by the bootcamp, the incompatibilities were removed and the notebooks executed correctly again. Once the downgrade was applied, all ChiselTest features—such as `poke`, `expect`, and clock stepping—functioned without issue, fully restoring the intended bootcamp environment behavior. ## Enhancing Hello World in Chisel The given “Hello World” module in Chisel implements a simple LED blinker using a clock divider. Here is the original code : ```scala class Hello extends Module { val io = IO(new Bundle { val led = Output(UInt(1.W)) }) val CNT_MAX = (50000000 / 2 - 1).U val cntReg = RegInit(0.U(32.W)) val blkReg = RegInit(0.U(1.W)) cntReg := cntReg + 1.U when (cntReg === CNT_MAX) { cntReg := 0.U blkReg := ~blkReg } io.led := blkReg } ``` This module is built under the assumption that the FPGA clock runs at 50 MHz. The goal is to blink an LED at about 1 Hz, that is, one full on–off cycle per second. The `CNT_MAX` constant is defined as `(50000000 / 2 - 1).U`, which evaluates to 24,999,999 in decimal. Since the clock period is 1 / 50,000,000 seconds, counting from 0 up to 24,999,999 takes exactly 25,000,000 cycles, which corresponds to 0.5 seconds. Therefore, each time the counter reaches this maximum value, 0.5 seconds have elapsed. The signal `cntReg` is a 32-bit register initialized to zero by `RegInit(0.U(32.W))`. On every rising edge of the clock, the line `cntReg := cntReg + 1.U` causes this register to increment by 1. Without any other logic, this would form a free-running 32-bit up-counter that wraps around when it overflows. However, there is a conditional block that modifies this behavior. The `when (cntReg === CNT_MAX)` block checks whether the current value of `cntReg` equals 24,999,999. When this condition is true on a given cycle, two assignments happen for the next cycle: `cntReg := 0.U` resets the counter back to zero instead of allowing it to go to 25,000,000, and `blkReg := ~blkReg` flips the value of `blkReg`. The signal `blkReg` is a one-bit register initialized to 0 (`RegInit(0.U(1.W))`), so it will alternate between 0 and 1 every time the counter rolls over. Because the counter reaches `CNT_MAX` every 0.5 seconds, `blkReg` toggles every 0.5 seconds, which means its full period (from 0 back to 0) is 1 second. The final line, `io.led := blkReg`, simply drives the LED output with the current value of this flip-flop. As a result, the LED turns on for 0.5 seconds, then off for 0.5 seconds, repeatedly, blinking at 1 Hz. To translate this into a logic circuit, think of the design in terms of combinational logic and flip-flops. The register `cntReg` is a bank of 32 D flip-flops with a common clock and synchronous reset-like behavior implemented via a multiplexer on their D inputs. There is a 32-bit adder that computes `cntReg + 1`. There is also a 32-bit equality comparator that checks whether `cntReg` equals the constant `CNT_MAX`. The output of this comparator is a single bit, which we can call `eqMax`. This `eqMax` signal controls a multiplexer that selects whether the next value of `cntReg` should be the incremented value (`cntReg + 1`) or zero. In terms of equations, if `cntReg_next` denotes the next-state value of `cntReg`, then logically: $$ \text{cntReg_next} = \begin{cases} 0, & \text{if } \text{eqMax} = 1 \\ \text{cntReg} + 1, & \text{if } \text{eqMax} = 0 \end{cases} $$ The `blkReg` signal is a simple one-bit D flip-flop that behaves like a T flip-flop with an enable. When `eqMax` is false, `blkReg` simply holds its current value. When `eqMax` is true, `blkReg` takes the bitwise NOT of its previous value. If we denote the next state as `blkReg_next`, then: $$ \text{blkReg_next} = \begin{cases} -\text{blkReg}, & \text{if } \text{eqMax} = 1 \\ \text{blkReg}, & \text{if } \text{eqMax} = 0 \end{cases} $$ At the circuit level, that means there is a NOT gate that computes `~blkReg`, and a 2:1 multiplexer chooses between `blkReg` and `~blkReg` based on `eqMax`. The output of this multiplexer feeds the D input of the `blkReg` flip-flop. Finally, the LED output is just wired to the Q output of `blkReg`. So the entire circuit consists of a 32-bit up-counter with terminal-count detection and reset, combined with a 1-bit T flip-flop that toggles on every terminal count. To “enhance” the Chisel version by making the logic circuit more explicit, we can factor the combinational logic out into intermediate wires and, optionally, parameterize the clock frequency and blink rate. Here is an extended version that stays close to the original behavior but clarifies the structure: ```scala class HelloBlink( val CLOCK_FREQ_HZ: Int = 50000000, // input clock frequency val BLINK_HZ: Int = 1 // desired blink frequency ) extends Module { val io = IO(new Bundle { val led = Output(UInt(1.W)) }) // The LED toggles every half period, so we compute the number of cycles // for half the blink period and subtract 1 because we count from 0. val halfPeriod = (CLOCK_FREQ_HZ / (2 * BLINK_HZ)) - 1 val CNT_MAX = halfPeriod.U val cntReg = RegInit(0.U(32.W)) val blkReg = RegInit(0.U(1.W)) // Combinational logic: adder, comparator, and muxes. val cntInc = cntReg + 1.U // 32-bit incrementer val eqMax = cntReg === CNT_MAX // 32-bit equality comparator // Next-state logic for the counter: if eqMax is true, go to 0; else increment. val cntNext = Mux(eqMax, 0.U, cntInc) // Next-state logic for the LED flip-flop: toggle only when eqMax is true. val blkNext = Mux(eqMax, ~blkReg, blkReg) // Sequential logic: registers sample their next state on each clock edge. cntReg := cntNext blkReg := blkNext // Output: LED is driven directly by blkReg. io.led := blkReg } ``` In this version, the internal wires `cntInc`, `eqMax`, `cntNext`, and `blkNext` correspond quite directly to blocks in the logic circuit. The expression `cntInc = cntReg + 1.U` is synthesized into a 32-bit adder. The expression `eqMax = cntReg === CNT_MAX` becomes an equality comparator. The functions `Mux(eqMax, 0.U, cntInc)` and `Mux(eqMax, ~blkReg, blkReg)` become multiplexers that implement the next-state equations written above. The registers `cntReg` and `blkReg` together with these combinational blocks form a synchronous sequential circuit that divides down the input clock and produces a blinking LED signal. The behavior is the same as the original “Hello World” design, but the connection to standard logic elements such as adders, comparators, muxes, and flip-flops is made explicit in the code structure. ## Progress of 1-single-cycle ### **Condensed Summary of CA25 Exercises** Across the CA25 exercises, the focus was on implementing the essential functional blocks of a single-cycle RV32I CPU: * Immediate generation for all instruction formats * Control-signal derivation * ALU function decoding * Branch comparison * Jump-target computation * Load/store data alignment * Write-back selection * PC update logic. These exercises collectively built the full datapath and control logic needed to support the entire RV32I instruction set. The work clarified how instruction fields, immediates, ALU operands, and memory interactions combine into a coherent single-cycle design, and how each instruction type steers data through the CPU via control signals. Rather than diving deeply into each subcomponent, the exercises formed an integrated understanding of how all modules interact to implement ISA-level behavior in hardware. See the complete implementation in [1-single-cyle](https://github.com/jgw0915/Computer-Architecture_HW3-ca2025-mycpu/tree/main/1-single-cycle) --- ### Test Cases Summary **Full-System Integration Testing** A large portion of the work involved verifying the processor using full-system test programs executed on the completed single-cycle CPU. These programs—such as RSqrt, Fibonacci, Quicksort, and byte-access tests—exercise the entire datapath and control logic together, stressing not only ALU computation but also branching, memory access, function-call behavior, and register-file correctness. By checking final memory values or register results, these tests validate that the CPU’s implementation aligns with software expectations under realistic workloads. The integration tests serve as end-to-end verification that the complete execution pipeline behaves correctly when executing nontrivial programs. * Example of using chisel test for running quicksort on Single Cycle CPU : ```scala= class QuicksortTest extends AnyFlatSpec with ChiselScalatestTester { behavior.of("Single Cycle CPU - Integration Tests") it should "correctly execute Quicksort algorithm on 10 numbers" in { test(new TestTopModule("quicksort.asmbin")).withAnnotations(TestAnnotations.annos) { c => for (i <- 1 to 50) { c.clock.step(1000) c.io.mem_debug_read_address.poke((i * 4).U) // Avoid timeout } for (i <- 1 to 10) { c.io.mem_debug_read_address.poke((4 * i).U) c.clock.step() c.io.mem_debug_read_data.expect((i - 1).U) } } } } ``` Whole test cases is here [CPUTest.scala](https://github.com/jgw0915/Computer-Architecture_HW3-ca2025-mycpu/blob/main/2-mmio-trap/src/test/scala/riscv/singlecycle/CPUTest.scala) **Component-Level Testing for ALU, Branching, and PC Logic** Unit tests for the execute stage target the ALU and branch comparator specifically. Randomized ALU inputs help confirm arithmetic logic correctness over a wide operand range, while branch tests validate both taken and not-taken scenarios for conditional instructions. The PC logic is tested to ensure correct sequential incrementing and immediate redirection during jumps and branches. These unit tests isolate logic related to arithmetic correctness and control-flow correctness, ensuring robustness before integrating the full datapath. ```scala= { ... // add test c.io.instruction.poke(0x001101b3L.U) // x3 = x2 + x1 var x = 0 for (x <- 0 to 100) { val op1 = scala.util.Random.nextInt(429496729) val op2 = scala.util.Random.nextInt(429496729) val result = op1 + op2 val addr = scala.util.Random.nextInt(32) c.io.reg1_data.poke(op1.U) c.io.reg2_data.poke(op2.U) c.clock.step() c.io.mem_alu_result.expect(result.U) c.io.if_jump_flag.expect(0.U) } // beq test c.io.instruction.poke(0x00208163L.U) // pc + 2 if x1 === x2 c.io.instruction_address.poke(2.U) c.io.immediate.poke(2.U) c.io.aluop1_source.poke(1.U) c.io.aluop2_source.poke(1.U) c.clock.step() ... } ``` Whole test cases is here [ExecuteTest.scala](https://github.com/jgw0915/Computer-Architecture_HW3-ca2025-mycpu/blob/main/1-single-cycle/src/test/scala/riscv/singlecycle/ExecuteTest.scala) **Decoder Validation Across All Instruction Formats** Another set of test cases validates decoding behavior for each RISC-V instruction type. By checking immediate extraction, register index selection, memory-access behavior, ALU operand choices, and write-back destinations, the tests confirm that the CPU properly interprets 32-bit instruction words. These checks ensure that the control signals reflect ISA semantics, forming a critical foundation for correct execution of all instruction categories—including R-type arithmetic, I-type logical and load instructions, S-type stores, B-type branches, U-type immediate loads, and J-type jumps. * Use I-type load instruction as example: ```scale= { ... // InstructionTypes.L , I-type load command c.io.instruction.poke(0x0040a183L.U) // lw x3, 4(x1) c.io.ex_aluop1_source.expect(ALUOp1Source.Register) c.io.ex_aluop2_source.expect(ALUOp2Source.Immediate) c.io.ex_immediate.expect(4.U) c.io.regs_reg1_read_address.expect(1.U) c.io.reg_write_enable.expect(true.B) c.io.reg_write_address.expect(3.U) c.io.wb_reg_write_source.expect(RegWriteSource.Memory) c.io.memory_read_enable.expect(true.B) c.io.memory_write_enable.expect(false.B) c.clock.step() ... } ``` Whole test cases is here [InstructionDecodeTest.scala](https://github.com/jgw0915/Computer-Architecture_HW3-ca2025-mycpu/blob/main/1-single-cycle/src/test/scala/riscv/singlecycle/InstructionDecoderTest.scala) **Instruction Fetch and PC Sequencing Tests** Fetch-stage tests verify that the PC increments correctly by 4 for sequential execution and updates precisely to jump targets when branches or jumps occur. These tests alternate between normal progression and redirection, confirming that the fetch unit properly interacts with control signals from the decode stage. This ensures correct instruction flow throughout program execution, preventing PC misalignment, instruction-skipping, or unintended infinite loops. * Whole test logic: ```scala= { ... val entry = 0x1000 var pre = entry var cur = pre c.io.instruction_valid.poke(true.B) var x = 0 for (x <- 0 to 100) { Random.nextInt(2) match { case 0 => // no jump cur = pre + 4 c.io.jump_flag_id.poke(false.B) c.clock.step() c.io.instruction_address.expect(cur) pre = pre + 4 case 1 => // jump c.io.jump_flag_id.poke(true.B) c.io.jump_address_id.poke(entry) c.clock.step() c.io.instruction_address.expect(entry) pre = entry } } ... } ``` Whole test cases is here [InstructionFetchTest.scala](https://github.com/jgw0915/Computer-Architecture_HW3-ca2025-mycpu/blob/main/1-single-cycle/src/test/scala/riscv/singlecycle/InstructionFetchTest.scala) **Register File Tests Ensuring Architectural Compliance** Register-file tests confirm architectural guarantees: registers except `x0` must store written values accurately, and `x0` must remain hardwired to zero regardless of write attempts. Additional tests verify that write-through behavior works as expected—meaning that reading a register in the same cycle as a write returns the updated value. These behaviors are required for correct execution of RISC-V programs, especially those that rely on immediate register use after writes. * Use write-through test case as example * Write to register file first, then read the register value * `0xdeadbeef` is Hexadecimal constant (value = 3,735,928,559 in decimal). ```scala= { ... behavior.of("RegisterFile") it should "support write-through (read during write cycle)" in { test(new RegisterFile).withAnnotations(TestAnnotations.annos) { c => timescope { c.io.read_address1.poke(2.U) c.io.read_data1.expect(0.U) c.io.write_enable.poke(true.B) c.io.write_address.poke(2.U) c.io.write_data.poke(0xdeadbeefL.U) c.clock.step() c.io.read_address1.poke(2.U) c.io.read_data1.expect(0xdeadbeefL.U) c.clock.step() } c.io.read_address1.poke(2.U) c.io.read_data1.expect(0xdeadbeefL.U) } } ... } ``` Whole test cases is here [RegisterFileTest.scala](https://github.com/jgw0915/Computer-Architecture_HW3-ca2025-mycpu/blob/main/1-single-cycle/src/test/scala/riscv/singlecycle/RegisterFileTest.scala) --- ### Purpose of `make compliance` **Ensuring Architectural Accuracy Against the Official RISC-V Specification** The `make compliance` command integrates [RISCOF](https://riscof.readthedocs.io/en/stable/), which executes the official RISC-V architectural test suite and compares the CPU’s behavior against a golden reference model. Its main role is to ensure that every instruction defined in the RV32I specification—along with CSR behaviors (Not include in 1-single-cycle) — matches the ISA’s architectural expectations. This goes beyond program execution correctness, requiring cycle-consistent, value-accurate, and exception-correct behavior across all legal instruction patterns. **Detecting Subtle and Hard-to-Find Bugs** Many RISC-V rules are easy to violate unintentionally: * Sign-extension rules * Alignment constraints * PC update semantics * CSR behaviors * Interactions between loads/stores * Memory alignment While simple test programs may run successfully, ISA compliance tests expose subtle corner cases and undefined-behavior conditions that often go unnoticed. Running `make compliance` identifies these mismatches early, preventing long-term architectural errors in future development. **Providing a Standardized Verification Framework** RISCOF offers a standardized harness for comparing DUT behavior against reference signatures. This ensures consistency: every RISC-V core, whether educational or commercial, is judged by the same test criteria and the same expected outputs. Using this framework, the CPU can be validated objectively rather than through ad-hoc testing, making the results more credible and portable across projects. --- ### Encountering Problem **Prepare RISCOF in WSL** While setting up RISCOF in WSL, I learned to isolate the RISC-V compliance environment using Python virtual environments. I created and activated a `.venv/`, installed RISCOF inside it, and adopted the workflow of activating the environment before running tests and deactivating when done. This setup ensures a clean, reproducible environment where RISCOF’s Python dependencies do not conflict with the system-wide Python configuration. Using a virtual environment is especially important in WSL because installing Python packages system-wide often leads to system library and dependency conflicts. Many Python packages required by RISCOF depend on specific versions of libraries (such as pyyaml, jinja2, or elftools), and installing them globally can break other tools that rely on different versions, or even interfere with the Linux distribution’s package manager (`apt`). By installing RISCOF inside a virtual environment, all required Python libraries are contained and version-pinned locally, avoiding conflicts with system Python packages and preventing hard-to-debug errors that commonly arise when mixing pip-installed libraries with system-managed libraries in WSL. * Command for initiate .venv folder ```bash= python3 -m venv .venv ``` * Command for activate the virtual environment ```bash= source .venv/bin/activate ``` **MuxLookup and Seq in ALUControl** When working on ALU control, I gained a clearer understanding of how MuxLookup and Seq combine to implement decode tables in Chisel. `MuxLookup` behaves like a hardware-friendly switch that selects an output based on a key UInt, and Seq provides the mapping as a list of (key -> value) pairs. Using this pattern, I could compactly express the mapping from funct3 (and sometimes funct7) to ALU function codes, making the decode logic both readable and systematic. * Example of `MuLookup` for `funct3` in ALU Control Unit : ```scala= { ... // I-type immediate operation instructions (ADDI, SLTI, XORI, ORI, ANDI, SLLI, SRLI, SRAI) io.alu_funct := MuxLookup(io.funct3, ALUFunctions.zero)( Seq( InstructionsTypeI.addi -> ALUFunctions.add, InstructionsTypeI.slli -> ALUFunctions.sll, InstructionsTypeI.slti -> ALUFunctions.slt, InstructionsTypeI.sltiu -> ALUFunctions.sltu, InstructionsTypeI.xori -> ALUFunctions.xor, InstructionsTypeI.ori -> ALUFunctions.or, InstructionsTypeI.andi -> ALUFunctions.and, // SRLI/SRAI distinguished by funct7[5]: // funct7[5] = 0 → SRLI (logical right shift) // funct7[5] = 1 → SRAI (arithmetic right shift) InstructionsTypeI.sri -> Mux(io.funct7(5), ALUFunctions.sra, ALUFunctions.srl) ) ) ... } ``` **RISC-V GNU Toolchain not found** During RISCOF setup, I encountered the “RISC-V GNU Toolchain not found” error, which highlighted the importance of environment variables and toolchain visibility. I discovered that simply having the compiler in PATH is not enough for some flows; scripts may rely on CROSS_COMPILE or custom environment setups. ```diff - Error: RISC-V GNU Toolchain not found - Please install a RISC-V toolchain or set CROSS_COMPILE environment variable ``` By sourcing the Homwwork 2's setenv script (which configures PATH and CROSS_COMPILE for riscv-none-elf-), I enabled make compliance to locate the compiler and run successfully. ```bash! source ~/riscv-none-elf-gcc/setenv ``` **RISCOF All Test Fail** Another issue arose when all RISCOF tests failed due to the framework not finding the toolchain via $RISCV. By reading ComplianceTestBase.scala, I realized that RISCOF checks specific locations and the RISCV environment variable rather than the generic PATH. My previous setenv only modified PATH, so RISCOF could not locate the toolchain root. ```bash! export RISCV=/home/kiwi/riscv-none-elf-gcc export PATH="$RISCV/bin:$PATH" ``` After updating setenv to export RISCV pointing to the toolchain directory, RISCOF’s discovery logic passed, and the entire RV32I suite ran to completion with all tests passing. ``` ✅ Compliance tests complete. Results in riscof_work_1sc/ Completion time: Wed Dec 10 10:39:29 CST 2025 Copying results to results/ directory... Cleaning up auto-generated RISCOF test files... ✅ Compliance tests complete. Results in results/ 📊 View report: results/report.html ``` This experience reinforced the value of reading framework source code to understand hidden assumptions. ### Chisel’s blank and waveforms Analyzation (Using quicksort.asbin) * Possible values and meanings for `wb_reg_write_source` ```scala= object RegWriteSource { val ALUResult = 0.U(2.W) val Memory = 1.U(2.W) val NextInstructionAddress = 2.U(2.W) } ``` * Possible values and meanings for `ALUOp1Source` and `ALUOp2Source` ```scala= object ALUOp1Source { val Register = 0.U(1.W) val InstructionAddress = 1.U(1.W) } object ALUOp2Source { val Register = 0.U(1.W) val Immediate = 1.U(1.W) } ``` ![image](https://hackmd.io/_uploads/ry8_Uhjz-l.png) * **`add` (Register-Register Arithmetic)** For an `add a5,a4,a5` instruction, the opcode is decoded as R-type (`isOpimm = 0`). Both ALU operands are sourced directly from registers, so `io_ex_aluop2_source = 0 (Register)` and `io_ex_aluop1_source = 0 (Register)`. The ALU computes the sum of `rs1` and `rs2`, and `wb_reg_write_source = 0` so the result is written back to the destination register via the ALU result path. In the waveform, this is visible as stable operand-select signals pointing to registers, with no memory activity during the cycle. * **`lw` (Load Word)** When executing an `lw a5,0(a5)` instruction, the decoder asserts `isLoad = 1`. As a result, the ALU is used to compute the effective memory address by adding the base register (`rs1`) and the sign-extended immediate. This is reflected by `io_ex_aluop2_source = 1 (Immediate)` and `io_ex_aluop1_source = 0 (Register)`. The memory read path is enabled (`memory_read_enable = 1`), while memory write remains disabled. In the write-back stage, `wb_reg_write_source = 1` selects **Memory**, so the data loaded from memory is written into the destination register. The waveform shows the instruction remaining stable during the cycle, with ALU operand 2 consistently driven by the immediate value. ![image](https://hackmd.io/_uploads/SJxRB3jMbx.png) * **`jal` (Jump and Link)** When a `jal ra, -500` instruction is executed, `isJal = 1` is asserted. Similar to `auipc`, `io_ex_aluop1_source = 1 ` and `io_ex_aluop2_source = 1` so the ALU uses the instruction address (PC) as operand 1 and the J-type immediate as operand 2 to compute the jump target. However, the write-back path differs: `wb_reg_write_source = 2` so it selects **PC + 4**, storing the return address in the destination register (typically `ra`). The PC update logic uses the computed target address for the next instruction. * **`addi` (Immediate Arithmetic)** When executing `addi sp,sp,-48`, the instruction is decoded as OP-IMM (`isOpImm = 1`). `io_ex_aluop2_source = 0 (Register)` so the ALU operand 1 comes from `rs1`, while operand 2 is selected from the sign-extended I-type immediate (`io_ex_aluop2_source = 1 (Immediate)`). There is no memory access in this instruction. The ALU result is written back directly to the destination register. ![image](https://hackmd.io/_uploads/H1KCdgsGZl.png) * **`auipc` (Add Upper Immediate to PC)** For `auipc t1, 0x1`, the decoder asserts `isAuipc = 1`, which changes both ALU operand selections. Operand 1 is sourced from the **instruction address (PC)** (`io_ex_aluop1_source = 1 (InstructionAddress)`), and operand 2 comes from the U-type immediate (`io_ex_aluop2_source = 1 (Immediate)`). The ALU computes `PC + (imm << 12)`, and the result is written back to the destination register. No memory access occurs. * **`bgeu` (Branch if Greater or Equal, Unsigned)** For `bgeu t0,t1,16`, the decoder asserts `isBranch = 1`. `io_ex_aluop1_source = 1 (InstructionAddress)` so the ALU operand 1 is set to the instruction address (PC), and `io_ex_aluop2_source = 1 (Immediate)` operand 2 is the B-type immediate, which is used to compute the branch target address. In parallel, the branch comparator evaluates the unsigned comparison between `rs1` and `rs2`. No register write-back occurs, and memory is not accessed. If the comparison condition is true, the PC is updated to the branch target; otherwise, it advances normally to PC+4. ## Progress of 2-mmio-trap ### Exercises Overview Work in the 2-mmio-trap stage extended the basic single-cycle RV32I core into a full RV32I + Zicsr machine with MMIO and trap/interrupt support. Immediate generation, ALU control, and branch/jump target calculation were reused and refined to handle CSR-related instructions and PC redirection into trap handlers. On top of this, the writeback stage was generalized so that register data can now come from the ALU, memory, PC+4, or CSR read data, depending on the instruction and control signals derived from decode. CSRs and trap handling formed the main new architectural focus. THe project implemented a CSR lookup table that maps CSR addresses (such as `mstatus`, `mie`, `mtvec`, `mscratch`, `mepc`, `mcause`, and the 64-bit cycle counter) to their internal registers, and added write logic with a clear priority scheme: CLINT direct writes for interrupts take precedence, while CSR instructions from the CPU update CSRs in normal execution. The `mstatus` state machine for trap entry and return was implemented according to the privileged spec: 1. saving and disabling `MIE` on entry 2. recording `mepc` and `mcause` 3. jumping to `mtvec` 4. then restoring `MIE` and updating `MPIE` on `mret`. Finally, the PC update logic was extended to arbitrate between sequential execution, control-flow changes (branches/jumps), and trap/interrupt redirection, while still injecting a NOP when instructions are invalid. See complete implementation in [2-mmio-trap](https://github.com/jgw0915/Computer-Architecture_HW3-ca2025-mycpu/tree/main/2-mmio-trap) --- ### Test Cases Summary **1. Machine-Mode Interrupt and CSR Handling** Test cases for interrupts and CSRs verify the complete trap-handling pipeline from event detection to handler execution and return. They check whether timer or external interrupts correctly assert the CLINT interface, cause the CPU to jump to `mtvec`, and update `mstatus`, `mepc`, and `mcause` according to the RISC-V privileged rules. Both interrupt and exception sources are covered: * Timer interrupts * Environment calls (`ecall`) * Breakpoints (`ebreak`). The tests confirm that `mcause[31]` is set correctly for interrupts versus exceptions, that the saved `mepc` points to the faulting instruction or next PC as required, and that the `mret` instruction correctly restores `mstatus` (copying `MPIE` back to `MIE` and setting `MPIE` to 1) and resumes execution at `mepc`. This gives strong evidence that the CSR file, CLINT integration, and PC transfer logic implement trap semantics faithfully. See whole test cases here [CLINTCSRTest.scala](https://github.com/jgw0915/Computer-Architecture_HW3-ca2025-mycpu/blob/main/2-mmio-trap/src/test/scala/riscv/singlecycle/CLINTCSRTest.scala) **2. Full-System CPU Program Execution** System-level tests run real assembly programs on the enhanced CPU to validate end-to-end behavior under realistic workloads. Programs such as Fibonacci and Quicksort exercise function calls, stack usage, arithmetic, branching, and memory access, while their final results are checked in memory or registers (e.g., verifying `fib(10) = 55` or that an array is sorted). Byte-access programs ensure that LB/LBU/LH/LHU and SB/SH behave correctly with respect to sign/zero extension and alignment. Interrupt-focused programs deliberately trigger traps (via timer or software events), perform work in the handler, then return using `mret`, with post-run checks on CSRs and memory to confirm the exact sequence of events. Together, these tests show that fetch, decode, execute, memory, CSR, and interrupt subsystems are all correctly wired and cooperate properly during full program execution. * Example for testing Interrupt and trap mechanism in mmio-trap implementation ```scale= class InterruptTrapTest extends AnyFlatSpec with ChiselScalatestTester { behavior.of("[CPU] Interrupt trap flow") it should "jump to trap handler and then return" in { test(new TestTopModule("irqtrap.asmbin")).withAnnotations(TestAnnotations.annos) { c => for (i <- 1 to 1000) { c.clock.step() c.io.mem_debug_read_address.poke((i * 4).U) // Avoid timeout } c.io.mem_debug_read_address.poke(4.U) c.clock.step() c.io.mem_debug_read_data.expect(0xdeadbeefL.U) c.io.interrupt_flag.poke(0x1.U) c.clock.step(5) c.io.interrupt_flag.poke(0.U) for (i <- 1 to 1000) { c.clock.step() c.io.mem_debug_read_address.poke((i * 4).U) // Avoid timeout } c.io.csr_regs_debug_read_address.poke(CSRRegister.MSTATUS) c.io.csr_regs_debug_read_data.expect(0x1888.U) c.io.csr_regs_debug_read_address.poke(CSRRegister.MCAUSE) c.io.csr_regs_debug_read_data.expect(0x80000007L.U) c.io.mem_debug_read_address.poke(0x4.U) c.clock.step() c.io.mem_debug_read_data.expect(0x2022L.U) } } } ``` See whole test cases in [CPUTest.scala](https://github.com/jgw0915/Computer-Architecture_HW3-ca2025-mycpu/blob/main/2-mmio-trap/src/test/scala/riscv/singlecycle/CPUTest.scala) **3. Execute-Stage CSR Operation Tests** Dedicated execute-stage tests concentrate on CSR instructions (`csrrw`, `csrrs`, `csrrc` and their immediate variants). For each CSR operation, the tests verify that the combined effect on the CSR register and the destination GPR matches the Zicsr specification: * `csrrw` performs a full overwrite * `csrrs` sets bits * `csrrc` clears bits based on the source operand The tests also confirm that read-back paths and writeback selection properly route CSR values into the integer register file when required. This validates both the ALU/execute datapath handling of CSR-side operands and the correctness of the CSR register update logic. See whole test cases here [ExecuteTest.scala](https://github.com/jgw0915/Computer-Architecture_HW3-ca2025-mycpu/blob/main/2-mmio-trap/src/test/scala/riscv/singlecycle/ExecuteTest.scala) **4. Timer MMIO Register Tests** Timer tests check that the MMIO interface for a machine timer behaves as expected. By issuing loads and stores to the timer’s mapped addresses, the tests verify that configuration registers such as the timer limit and enable flag update correctly and that readbacks reflect the current hardware state. They also verify that when the timer reaches the programmed limit and is enabled, an interrupt is generated and propagated to the CPU via the CLINT path. This demonstrates that the timer peripheral follows the designed MMIO protocol and can drive the interrupt mechanism in a predictable way. See whole test case in [TimerTest.scala](https://github.com/jgw0915/Computer-Architecture_HW3-ca2025-mycpu/blob/main/2-mmio-trap/src/test/scala/riscv/singlecycle/TimerTest.scala) **5. UART MMIO and End-to-End Transmission Tests** UART tests exercise a memory-mapped serial interface integrated with the CPU, focusing on both transmit and receive paths. Test programs write characters or binary values into the UART transmit register, toggle control bits to enable or disable the UART, and then read status or receive registers to confirm correct behavior. The testbench tracks how many bytes were transmitted, what the last transmitted byte was, and whether transmission only occurs when the UART is properly enabled. Completion flags or result codes are written back into memory, and the test harness checks these values to decide whether all UART scenarios passed. ```scala= class UartMMIOTest extends AnyFlatSpec with ChiselScalatestTester { behavior.of("[UART] Comprehensive TX+RX test") it should "pass all TX and RX tests" in { test(new UartHarness("uart.asmbin")).withAnnotations(TestAnnotations.annos) { c => c.io.interrupt_flag.poke(0.U) c.clock.setTimeout(0) for (_ <- 0 until 15000) { c.clock.step() } c.io.mem_debug_read_address.poke(0x100.U) c.clock.step() c.io.mem_debug_read_data.expect(0xcafef00dL.U) // Check all tests passed: TX + Multi-byte RX + Binary RX + Timeout RX c.io.mem_debug_read_address.poke(0x104.U) c.clock.step() c.io.mem_debug_read_data.expect(0xf.U) // 0b1111 = all 4 tests passed } } } ``` These tests validate correct MMIO address decoding, side-effect behavior on writes and reads, and correct coordination between CPU software and the UART peripheral. See whole test case setup here [UARTMMIOTest.scala](https://github.com/jgw0915/Computer-Architecture_HW3-ca2025-mycpu/blob/main/2-mmio-trap/src/test/scala/riscv/singlecycle/UartMMIOTest.scala) --- ### Encountering Problem **Running Test Programs with `make sim`** While running simulation with Verilator, a path-related issue appeared when specifying the instruction binary for `make sim`. The default instinct was to use a relative path from the Verilator `obj_dir` directory, such as `SIM_ARGS="-instruction ../../../src/main/resources/fibonacci.asmbin"`. However, in the 2-mmio-trap setup, `make sim` interprets the instruction path relative to the 2-mmio-trap project root, not the generated `verilog/verilator/obj_dir` directory. The correct usage is therefore `make sim SIM_ARGS="-instruction src/main/resources/fibonacci.asmbin"`. Once this adjustment was made, the simulator could locate the program binary correctly, and test programs like Fibonacci and Quicksort ran without file-not-found errors. --- ### Verify Nyancat Simulation and Compression Strategy **Verifying Nyancat Simulation** The Nyancat VGA demo serves as a stress test for the MMIO and peripheral subsystem, particularly the VGA pipeline and simulation environment. Running the Nyancat program via `make demo` with SDL2 support verifies that the CPU can continuously feed frame data through the MMIO interface to a VGA-like framebuffer and that the generated Verilator model produces a stable, animated display. Observing the animation and confirming that the simulation runs for the expected number of cycles provides strong evidence that the CPU, bus, and VGA peripheral agree on addressing, timing, and data formats. * A .gif demonstrate Nyancat demo is running successfully ![Demo](https://hackmd.io/_uploads/S1pPfiIfWe.gif) **Extending and Refining Nyancat Compression** Beyond simply running the demo, significant attention went into analyzing and improving the Nyancat frame-compression scheme. The current `delta-RLE` format uses opcodes such as `SetColor`, “skip unchanged” runs, and “repeat changed” runs to exploit temporal redundancy between frames. Several directions for further compression were identified. One group of ideas adds richer opcodes, such as “same-as-previous-line” or “same-as-baseline-frame” blocks, which can drastically reduce encoding size for repetitive sky or background regions. Another idea is to generalize run-length encodings with more flexible length fields so that long runs do not have to be approximated by multiple fixed-size ops. **Structural Reorganization of Frame Data** A deeper level of optimization focuses on changing the structure of what is encoded rather than just how it is encoded. The animation can be decomposed into static background, moving rainbow stripes, and cat sprites with small animated patches. Under this model, the background and main sprite are stored once, while per-frame data is reduced to offsets and small patches, significantly shrinking data size. Similarly, tile-based schemes divide each frame into 8×8 tiles and build a dictionary of unique tiles shared across frames, reconstructing frames by referencing tile indices rather than storing full pixel arrays. Symmetry and mirroring can also be exploited for regions that are horizontally or vertically symmetric, further reducing the stored data. **Entropy Coding and Lightweight Dictionary Methods** On top of the structural and opcode-level schemes, generic compression techniques such as static `Huffman coding` or `small-window LZ-style dictionary compression` can be applied to the opcode stream. Since some opcodes (especially skip and repeat) occur much more frequently than others, a tailored Huffman code could reduce the byte stream size by another 10–20%, with the possibility of decoding to the original opcodes at startup and running the existing decompressor unchanged. Alternatively, a very small LZ decoder with constrained window and match length could exploit recurring opcode sequences between similar frames without overly complicating the on-device code. * How LZ-style dictionary compression Works (LZ77/LZSS Example) * **Sliding Window**: Instead of a massive dictionary, it uses a fixed-size buffer (e.g., 256 bytes to a few KB) of recently processed data. * **Look-Ahead Buffer**: A small buffer scans ahead in the data stream. * **Match Finding**: The algorithm finds the longest match for the look-ahead buffer within the sliding window. * **Encoding**: 1. If a match is found, it outputs a pointer: (offset, length, next_char). 2. If no match, it outputs the literal character. * **Window Movement**: The window "slides" forward by the length of the match plus one character. **Balancing Data Size, Code Size, and Runtime Costs** The overall Nyancat work highlights the trade-offs between compression ratio, decoder complexity, code size, and runtime overhead on a softcore CPU. Some techniques (extra opcodes, small structural tweaks) are almost free in terms of code size and runtime, while others (full tile dictionaries, Huffman/LZ decoding) provide more compression at the cost of additional logic and cycles. By enumerating these options and analyzing their impact, the project builds a clear roadmap for future optimization: 1. Start with small opcode extensions and simple structural refactoring 2. Selectively layer on entropy coding or dictionary-based methods if further size reductions are required under tight binary size constraints. ### Chisel’s blank and waveforms Analyzation * **CSR (running irqtrap.asbin)** * CSR register's' address ```scala= object CSRRegister { val MSTATUS = 0x300.U(Parameters.CSRRegisterAddrWidth) val MIE = 0x304.U(Parameters.CSRRegisterAddrWidth) val MTVEC = 0x305.U(Parameters.CSRRegisterAddrWidth) val MSCRATCH = 0x340.U(Parameters.CSRRegisterAddrWidth) val MEPC = 0x341.U(Parameters.CSRRegisterAddrWidth) val MCAUSE = 0x342.U(Parameters.CSRRegisterAddrWidth) val CycleL = 0xc00.U(Parameters.CSRRegisterAddrWidth) val CycleH = 0xc80.U(Parameters.CSRRegisterAddrWidth) } ``` * **cssrw** ![image](https://hackmd.io/_uploads/B1fhYr2zWg.png) When the single-cycle CPU executes `csrrw t1, mtvec, t0` , the instruction decode identifies a CSRRW operation with CSR address `0x305`, causing the CSR read and write paths to be activated in the same cycle. The CSR module first performs a combinational read of `mtvec`, placing its old value on `io_reg_read_data`, which is then written back to the destination register `t1`. At the same time, the value from source register `t0` is forwarded on `io_reg_write_data_ex`, and with `io_reg_write_enable_id` asserted and no CLINT override, the CSR write logic updates `mtvec` at the rising clock edge. The waveform shows `mtvec` changing only after the clock edge, while `t1` captures the pre-write value (register_5 = 0x0, the reason why t1 (x6) is register_5 is because x0 is optimized and discarded after compilation to verilog), confirming correct read-before-write semantics. With no memory access, no pipeline stalls, and CLINT inactive, all effects complete atomically within one cycle, matching the RISC-V CSRRW specification and the behavior implemented in [CSR.scala](https://github.com/sysprog21/ca2025-mycpu/blob/main/2-mmio-trap/src/main/scala/riscv/core/CSR.scala). ## Progress of 3-pipeline ### Exercise Overview The work in this stage focused on transforming the earlier single-cycle and MMIO-trap designs into a fully functional pipelined CPU with hazard detection, forwarding, control-flow recovery, and correct MMIO/interrupt interaction. The first part of the exercises completed the remaining **ALU operations** so that the EX stage could handle the full RV32I arithmetic and logical instruction set. Once the datapath was complete, attention shifted to eliminating unnecessary stalls by introducing forwarding paths. **Data forwarding** into the EX stage resolved most RAW hazards, while additional forwarding into the ID stage enabled early branch comparison and reduced the branch penalty to one cycle. With forwarding in place, the next step was to develop **hazard detection logic** to identify the few remaining cases where forwarding is insufficient—especially load-use hazards and jump/branch dependencies that rely on values not yet available in EX or MEM. This hazard unit determines when to stall the pipeline and when to flush mis-fetched instructions, ensuring correctness under all dependency patterns. A final piece of pipeline control involved **implementing flush and stall behavior** in the pipeline registers. Flushes inject a NOP when a branch or jump is taken, removing wrong-path instructions, while stalls freeze IF and ID to introduce bubbles when operands are not yet ready. Together with early branch resolution, full EX/ID forwarding, and load-use stalling, these mechanisms create a robust 5-stage pipeline capable of handling all major hazard types in an optimized manner. --- ### Test Cases Summary **1. Pipeline Configuration Definitions** A set of pipeline configurations—ranging from a simple 3-stage version to various 5-stage versions with different hazard-handling strategies—provides a common testing framework. These definitions unify how each pipeline variant is constructed and how expected outputs (such as the value of `x1` in hazard tests) are checked across all designs. This allows identical test suites to evaluate correctness and behavior under different architectural tradeoffs. See whole configuration here [PipelineConfigs.scala](https://github.com/jgw0915/Computer-Architecture_HW3-ca2025-mycpu/blob/main/3-pipeline/src/test/scala/riscv/PipelineConfigs.scala) --- **2. Full Program Execution Tests** Running complete assembly programs on each pipeline configuration verifies that the whole CPU—fetch, decode, execute, memory, writeback, hazards, MMIO, and interrupt logic—works coherently under realistic workloads. * The Fibonacci program stresses recursion, stack usage, and return-address handling. * Quicksort heavily exercises data hazards, control-flow hazards, and repeated memory operations. * Byte-access tests ensure correctness of subword memory instructions across pipeline boundaries. These tests confirm that the pipelined system behaves identically to a correct RV32I implementation despite internal pipeline stalls and flushes. **Basic Hazard Resolution Test** A compact hazard program checks fundamental RAW and load-use behaviors. The final value of a specific register and selected memory locations is compared against expected values for each pipeline configuration. This test highlights how pure-stall pipelines differ from forwarding pipelines in performance while validating that all configurations remain architecturally correct. **Extended Hazard Test Suite** A comprehensive suite systematically evaluates every hazard class: WAW update ordering, store-to-load forwarding, load-to-store propagation, multi-cycle load chains, branch RAW hazards under early ID-stage resolution, CSR dependencies, writeback-stage forwarding, and long dependency sequences that stress forwarding paths. A cycle counter is also checked to ensure that no deadlock or livelock occurs. This suite validates that the pipeline handles hazards cleanly in all complex scenarios. **Interrupt and Trap Handling** Trap and interrupt tests confirm correct pipeline recovery under asynchronous events. When an interrupt or exception occurs, the pipeline must flush in-flight instructions, save state to CSRs (`mepc`, `mcause`, `mstatus`), and jump to the handler at `mtvec`. After executing the handler, `mret` must correctly restore machine state and resume execution. Memory and register checks after return validate that the pipeline returns to normal operation cleanly. See whole test cases here [PipelineProgramTest.scala](https://github.com/jgw0915/Computer-Architecture_HW3-ca2025-mycpu/blob/main/3-pipeline/src/test/scala/riscv/PipelineProgramTest.scala) --- **3. Stall/Flush Behavior of Pipeline Registers** Unit tests on the pipeline register primitive verify the correctness of the fundamental control mechanisms. The tests confirm that a stall holds the previous value, a flush overwrites the entry with a NOP, and normal operation passes through the next instruction. This validates the basic building block upon which all pipeline control logic depends. See whole test logic here [PipelineRegisterTest.scala](https://github.com/jgw0915/Computer-Architecture_HW3-ca2025-mycpu/blob/main/3-pipeline/src/test/scala/riscv/PipelineRegisterTest.scala) --- **4. UART MMIO Tests Under the Pipeline** UART interaction tests ensure that MMIO behavior remains correct even when multiple instructions overlap in flight. The program performs transmit and receive operations, and the testbench tracks transmitted bytes and status flags. At program completion, signature values written to memory are inspected to ensure every UART subtest passed. This confirms correct ordering of MMIO accesses and compatibility between the pipelined CPU and long-latency devices. See whole test implementation here [PipelineUARTTest.scala](https://github.com/jgw0915/Computer-Architecture_HW3-ca2025-mycpu/blob/main/3-pipeline/src/test/scala/riscv/PipelineUartTest.scala) --- **5. Unified System Test Harness** A unified test harness connects the pipelined CPU to ROM, RAM, MMIO devices, and debug ports. This infrastructure allows all pipeline configurations to be tested uniformly and makes it possible to inspect internal state such as registers, CSRs, memory contents, and instruction addresses. See the unified test system in [TopTestModule.scala](https://github.com/jgw0915/Computer-Architecture_HW3-ca2025-mycpu/blob/main/3-pipeline/src/test/scala/riscv/TestTopModule.scala) --- ### Encountering Problem **Invalid Read/Write Address** When running the Quicksort program with `make sim` command, repeated warnings appeared for invalid memory accesses such as `0x00400000` and `0x1ffffffc` at begining of the simulation. ```diff -invalid read address 0x00400000 -invalid read address 0x00400000 -invalid read address 0x1ffffffc -invalid read address 0x1ffffffc -invalid read address 0x00400000 -invalid read address 0x00400000 Simulation progress: 1% Simulation progress: 2% Simulation progress: 3% Simulation progress: 4% Simulation progress: 5% Simulation progress: 6% ``` The invalid read-address warnings during Quicksort simulation are caused by how the simulator unconditionally interprets the `io_memory_bundle_address` signal as a data-memory access every cycle. In the CPU, this address comes directly from the ALU result, and is reused across many instruction types such as arithmetic operations, comparisons, and address calculations, not only load/store instructions. However, in `sim.cpp`, the simulator calls ```cpp= memory->read(top->io_memory_bundle_address) ``` on every iteration of the main loop, without checking whether the current instruction is actually performing a data-memory read. As a result, whenever the ALU computes a value outside the implemented RAM range (for example, a large arithmetic result unrelated to memory), the simulator still treats it as a memory address and reports it as an “invalid read address”, even though the CPU is not logically accessing memory at that moment. This means the warnings are not indicating real architectural bugs in the core, but rather a mismatch in the interface contract: the C++ simulation side assumes “ALU result == active memory address every cycle”, whereas the actual CPU design uses that bus only when executing load/store instructions. --- ### Hazard Detection Summary **Q1. Why do we need to stall for load-use hazards?** A load instruction produces its result only in the **MEM stage**, because the data must be read from memory. The consumer instruction, however, needs the value in the **EX stage** on the very next cycle. Since forwarding cannot provide data that has not yet been fetched, the pipeline must **insert one bubble** to wait for the load result. Without this stall, the consumer would read an **incorrect (stale)** register value. **Q2. What is the difference between *stall* and *flush*?** | Operation | Meaning | Effect on Pipeline | | --- | --- | --- | | **Stall** | Freeze pipeline movement | PC and IF/ID do **not advance**; a bubble is inserted into ID/EX | | **Flush** | Discard an instruction | The target pipeline register is overwritten with a **NOP** | **Stall = stop moving forward**, **Flush = erase an instruction**. **Q3. Why must a jump instruction stall when it has register dependencies?** Jump/branch instructions often need **rs1** (e.g., JALR) or need to compare **rs1/rs2** (for branches). If the correct register value is still in EX or MEM and not yet forwarded to ID, then ID will compute the **wrong next PC**. Jumping to an incorrect address would corrupt program flow. Therefore, the pipeline must **stall until the operand can be forwarded or written back safely**. **Q4. Why is the branch penalty only 1 cycle in this CPU design?** This implementation resolves branches **in the ID stage**, not in EX. As a result: - Only the instruction currently in **IF/ID** is from the wrong control path. - Only **one flush** (IF stage) is required. If branch resolution occurred in EX, both IF and ID instructions would be wrong, leading to a **2-cycle penalty**. Early resolution in ID **reduces branch penalty to 1 cycle**. **Q5. What happens if hazard detection logic is removed?** Without hazard detection: - Load-use hazards would cause the consumer to read **incorrect values**. - Branch/jump instructions would use **wrong operands** and jump to the **wrong PC**. - Pipeline flow would become inconsistent. The processor might continue to execute instructions but would produce **incorrect results** and behave unpredictably. Hazard detection is essential for **functional correctness**. **Q6. Summarize all conditions that require a stall.** A stall is required when **any of the following is true**: 1. **EX-stage hazard (load-use or jump dependency)** - `rd_ex != x0` - ID-stage instruction uses this register (`rs1_id` or `rs2_id`) - And **either**: - The EX-stage instruction is a **load** (`memory_read_enable_ex = 1`) - OR the ID-stage instruction is a **jump** (`jump_instruction_id = 1`) 2. **MEM-stage load hazard for jumps** - ID-stage instruction is a **jump** - MEM-stage instruction is a **load** - `rd_mem != x0` - ID-stage instruction needs that value (`rd_mem == rs1_id` or `rs2_id`) **When do we perform a flush?** A flush is required when: - **A branch or jump is taken** in ID (`jump_flag = 1`) - The instruction in IF/ID is along the **wrong path**, so IF is flushed with a NOP. --- ### Chisel and waveforms Analyzation * **Forwarding (Running quicksort.asmbin)** * Forwarding type encoding ```scala= object ForwardingType { val NoForward = 0.U(2.W) val ForwardFromMEM = 1.U(2.W) val ForwardFromWB = 2.U(2.W) } ``` * Forwarding to ex_stage ![image](https://hackmd.io/_uploads/H1ckdP3MZe.png) * Forward from Memory Access Stage In the waveform, a clear MEM-to-EX forwarding scenario is observed. An instruction in the MEM stage has `reg_write_enable_mem = 1` and a destination register (`rd_mem`) that matches (both value are 05 (`t0`)) one of the source registers of the instruction (read in id stage) currently in the EX stage (`rs1_ex`). At this moment, the forwarding unit asserts `io_reg1_forward_ex` to 1 (`ForwardFromMEM`). This indicates that the ALU operand of instruction (`addi t0,t0, 1052`) in the EX stage is not taken from the register file, but instead bypassed directly from the EX/MEM pipeline register which is running instruction `auipc t0, 0x0`. In the waveform, this behavior aligns with sequences such as ` ` followed immediately by an instruction that consumes t0, where the updated value has not yet been written back. The priority logic in the forwarding unit ensures that MEM-stage data takes precedence over WB-stage data, correctly resolving the 1-cycle RAW hazard. ![Snipaste_2025-12-15_18-53-53](https://hackmd.io/_uploads/BJK15vafWg.png) * Forward from Write Back Stage In this waveform, a WB-to-EX forwarding case is visible. Here, the destination register of the instruction `addi t1,t1,1044` in the WB stage (`rd_wb`) matches the source register (`rs2_ex`) of the instruction `bgeu t0,t1,16` (read in id stage) in the EX stage (`bgeu t0,t1,16` had stalled in ID for 1 cycle), while no matching write exists in the instruction (A NOP due to stall) in MEM stage. Under these conditions, the forwarding logic sets `io_reg2_forward_ex` to 2 (ForwardFromWB). This reflects a 2-cycle-old dependency, where the value is available in the MEM/WB pipeline register but has not yet been read from the register file. The waveform shows this behavior during instruction sequences where a register is produced, followed by one intervening instruction, and then consumed. The forwarding unit correctly falls back to WB forwarding only when MEM forwarding is not applicable, preserving correctness and recency of data. ## Adapt HW2’s RSqrt to 3-pipeline ### Setup & Integration For adapting Homework 2’s RSqrt code to the 3-pipeline myCPU, I first focused on making the program compatible with the Chisel-based test harness. The original HW2 program was designed for an rv32emu bare-metal environment with outputs printed via `ecall`, while the pipeline CPU testbench inspects memory directly. To fit this flow, I modified the RSqrt program so that each test input writes its result and its measured cycle count into specific memory locations. In the assembly `main`, each call to `fast_rsqrt` stores the returned value to an address tracked by `l` (starting from 4), and writes the low and high 32-bit words of the elapsed cycle count to an address tracked by `j` (starting from 8), then increments both pointers by 12 for the next test case. This mirrors the structure of the original C test harness in `rsqrt-c.c` but replaces host I/O with memory-mapped result slots suitable for Chisel tests. See whole implementation of [rsqrt-asm.S](https://github.com/jgw0915/Computer-Architecture_HW3-ca2025-mycpu/blob/main/3-pipeline/csrc/rsqrt-asm.S) and [rsqrt-c.c](https://github.com/jgw0915/Computer-Architecture_HW3-ca2025-mycpu/blob/main/3-pipeline/csrc/rsqrt-c.c) To integrate this into the 3-pipeline project, I added a dedicated RSqrt test to `PipelineProgramTest.scala` that loads `rsqrt-asm.asmbin` and then repeatedly steps the clock to allow the program to complete. The test uses the debug memory port to read back the results and cycle counts from the agreed-upon addresses. I also verified that simulation can be run standalone with Verilator using: ```bash make sim SIM_ARGS="-instruction src/main/resources/rsqrt.asmbin" SIM_VCD=rsqrt_trace.vcd ``` which produces a VCD trace for inspecting the internal waveforms of the pipelined CPU during RSqrt execution. --- ### RSqrt Test Case in `PipelineProgramTest.scala` In the 3-pipeline test suite, the RSqrt test is written to exercise all pipeline configurations uniformly. For each pipeline configuration, the test loads the `rsqrt-asm.asmbin` program and then runs a long stepping loop: ```scala for (i <- 1 to 500) { c.clock.step(1000) c.io.mem_debug_read_address.poke((i * 4).U) } ``` This ensures the program has enough time to finish computing all RSqrt test vectors and storing their results and cycle counts into memory. After warm-up, the test iterates through a fixed set of input–expected-output pairs: ```scala val rsqrtVectors: Seq[(UInt, UInt)] = Seq( 1.U -> 65536.U, 4.U -> 32768.U, 16.U -> 16384.U, 20.U -> 14654.U, 30.U -> 11965.U, 100.U -> 6553.U, 120.U -> 5982.U, 130.U -> 5747.U, 0.U -> 0xffffffffL.U, // x=0 → “infinity” sentinel 0xffffffffL.U -> 1.U // x=2^32-1 → y=1 ) ``` The test expects the RSqrt results to be stored at addresses `4, 16, 28, ...` (starting from `i = 4`, incremented by 12 each case), and the corresponding elapsed cycle counts to be stored at `8, 20, 32, ...` (starting from `j = 8`, also incremented by 12). For each vector, the test reads the result word, checks it against the expected fixed-point value, then reads the cycle count word and prints it to the console. This structure directly matches how the assembly `main` writes its outputs, allowing the same program to be used both for correctness testing and performance profiling. Testing memory address increment logic is here: ```scala= { ... var i = 4 var j = 8 var k = 16 println("========== rsqrt() test start =========") println("") for ((input, expected) <- rsqrtVectors) { c.io.mem_debug_read_address.poke((i).U) c.clock.step() c.io.mem_debug_read_data.expect(expected, s"rsqrt($input) should be $expected") val result = c.io.mem_debug_read_data.peek().litValue println(f"rsqrt(${input.litValue}%d) = ${result}%d") c.io.mem_debug_read_address.poke((j).U) c.clock.step() val elapse_cycle = c.io.mem_debug_read_data.peek().litValue println(f" Elapsed cycles: $elapse_cycle%d") println("") i += 12 j += 12 } ... } ``` --- ### Assembly Adaptation for Pipeline-Friendly RSqrt When translating the HW2 RSqrt logic into `rsqrt-asm.S`, I took care to make the hot path of `fast_rsqrt` as pipeline-friendly as possible. The function begins with a short prologue and two early-exit branches for the edge cases `x == 0` and `x == 1`, so that these rare conditions are handled quickly and separately. Once past these checks, the common path computes `exp = 31 - clz(x)`, loads an initial estimate from `rsqrt_table[exp]`, and sets up a normalization factor `(1u << exp)` in `s2`, keeping all of this setup tightly packed to reduce pipeline bubbles. The interpolation branch for non–power-of-two inputs is also arranged so that the hot case (`x` not requiring interpolation) falls through into a straight-line segment labeled `newton_two_iters`, while the more complex interpolation logic is off to the side under `interp_entry`. The Newton iterations themselves are written as two unrolled iterations in straight-line assembly with calls to `mul32`, minimizing loop-control branches inside the numerically heavy core. This structure keeps the main RSqrt path branch-light and predictable: once the input has passed the edge checks and any optional interpolation, the pipeline sees a mostly linear sequence of ALU and multiply calls with minimal control hazards. At the program level, the `main` loop over test inputs is similarly structured to have a simple, predictable control flow: it loads the next test input from a static table, calls `get_cycles`, calls `fast_rsqrt`, stores the result, calls `get_cycles` again, and stores the cycle difference. The loop index and address pointers are updated with straightforward arithmetic, keeping branches simple and easily resolved in the ID stage of the pipeline. --- ### Performance Comparison: `rsqrt_O0` vs `rsqrt_O0_improved` on myCPU To quantify the benefit of pipeline-oriented tuning, I compared the cycle counts recorded by the myCPU pipeline for two versions of the RSqrt benchmark: * `rsqrt_O0`: the direct compilation of `rsqrt-c.c` with `-O0`, using the original C structure and unoptimized control flow. * `rsqrt_O0_improved`: the hand-tuned `rsqrt-asm.S` version integrated into the pipeline test, using the memory layout and structure just described. Both versions use the same measurement strategy: they call `get_cycles()` before and after `fast_rsqrt(x)` and store the elapsed cycles into memory for each of the ten test inputs. On the pipelined myCPU, the improved assembly consistently shows lower cycle counts per RSqrt call compared to the `-O0` C version. See the comparison below: | Pipeline CPU | O0 Average Cycles | O0_Improved Average Cycles | Improvement (%) | | ----------------------- | ----------------- | -------------------------- | --------------- | | three-stage | **4061.8** | **3979.8** | **2.0188%** | | five-stage (final) | **4622.9** | **4524.2** | **2.1350%** | | five-stage (forwarding) | **4601.3** | **4506.5** | **2.0603%** | | five-stage (stall) | **7212.8** | **7067.4** | **2.0159%** | Whole test result is in [performance](https://github.com/jgw0915/Computer-Architecture_HW3-ca2025-mycpu/tree/main/3-pipeline/performance) The reduction comes from several sources: * The assembly eliminates some of the compiler-generated prologue/epilogue overhead and redundant loads/stores that appear at `-O0`. * The control flow is reorganized to keep the hot path as straight-line as possible, reducing branch penalties and control hazards. * Arithmetic sequences are ordered to help forwarding paths and minimize stalls caused by back-to-back dependencies. The trend in the per-input cycle measurements shows a clear improvement, especially for nontrivial inputs where the interpolation and Newton iterations dominate runtime. The improved `rsqrt-asm.S` thus serves as a concrete example of how algorithm-aware scheduling and branch layout can better exploit the 5-stage pipeline’s forwarding and hazard mechanisms. --- ### Performance Comparison: `rsqrt_O0_improved` vs `perfCount-Ofast_inline_improve` (rv32emu Bare-Metal) I also compared the cycle counts from the **pipelined myCPU** running `rsqrt_O0_improved` against the performance numbers from Homework 2’s `perfCount-Ofast_inline_improve`, which is the same RSqrt algorithm compiled with aggressive `-Ofast` optimization and inlined helpers on an **rv32emu bare-metal environment**. In both cases, the algorithm is essentially the same — LUT + interpolation + two Newton iterations—but the underlying execution engines differ significantly. See the coomoparison below: * $\text{diff%} = \frac{\text{O0_Improved} - \text{Ofast}}{\text{O0_Improved}} \times 100$ | Input value | rsqrt(input) | O0_Improved Cycles | Ofast_inline_improve Cycles | Difference (%) | | ----------- | ------------ | ------------------ | --------------------------- | -------------- | | 1 | 65536 | 68 | 19 | 72.06% | | 4 | 32768 | 3992 | 1387 | 65.27% | | 16 | 16384 | 3992 | 51 | 98.72% | | 20 | 14654 | 6400 | 317 | 95.05% | | 30 | 11934 | 6304 | 67 | 98.94% | | 100 | 6553 | 6137 | 65 | 98.94% | | 120 | 5982 | 6521 | 65 | 99.00% | | 130 | 5736 | 6582 | 65 | 99.01% | | 0 | 0 | 60 | 18 | 70.00% | | 4294967295 | 1 | 5186 | 66 | 98.73% | On rv32emu, the “cycle” counts reflect an emulator’s model of instruction timing rather than a physical 5-stage pipeline with real hazards. The emulator executes instructions in a simple, idealized order, without modeling penalties for mispredicted branches, load-use hazards, or MMIO latencies to the same degree as a hardware-style pipeline. In contrast, the myCPU pipeline’s cycle counts include all the real microarchitectural costs: stalls for load-use hazards, one-cycle penalties for taken branches resolved in ID, and realistic timing of CSRs and memory operations. At a high level, the observation is that `rsqrt_O0_improved` is already tuned to make good use of the myCPU pipeline, but its cycle profile is still shaped by hardware realities such as data dependencies and control hazards. In contrast, the `perfCount-Ofast_inline_improve` results on rv32emu demonstrate how far a compiler can go under an idealized timing model, but they do not directly capture the impact of pipeline-specific behavior. Together, these measurements provide a useful perspective on how microarchitectural details influence the practical performance of the same numerical kernel. --- ### Encountering Problem **Unexpected Testing Behavior** During early integration, the `make test` result showed unexpected behavior: the memory locations probed by the testbench did not yet contain the expected results, leading to mismatches in the Chisel tests. ```bash [info] PipelineProgramTest: [info] Three-stage Pipelined CPU ========== rsqrt() test start ========= rsqrt(1) = 65536 Elapsed cycles: 68 rsqrt(4) = 32768 Elapsed cycles: 3571 rsqrt(16) = 16384 Elapsed cycles: 3571 rsqrt(20) = 0 [info] - should calculate rsqrt() numbers *** FAILED *** [info] io_mem_debug_read_data=0 (0x0) did not equal expected=14654 (0x393e): rsqrt(UInt<5>(20)) should be UInt<14>(14654) (lines in PipelineProgramTest.scala: 59, 56, 56, 29, 21, 18) (PipelineProgramTest.scala:59) ``` The root cause was that the simulation did not run long enough for the RSqrt program to complete all ten test cases, and the memory layout in the assembly program initially did not exactly match the addresses assumed by the Scala test. * Initially set 50,000 cycle to run the rsqrt program ```scala= for (i <- 1 to 500) { c.clock.step(1000) c.io.mem_debug_read_address.poke((i * 4).U) } ``` After I extended the stepping loop to allow sufficient runtime, the `make test` result became consistent. The RSqrt outputs appeared at the correct addresses, the cycle counts updated as expected, and the RSqrt test passed across all pipeline configurations. * After modify to run 500,000 cycle over rsqrt program ```scala= for (i <- 1 to 500) { c.clock.step(1000) c.io.mem_debug_read_address.poke((i * 4).U) } ``` * `Make test` result becomes as expected ``` [info] Five-stage Pipelined CPU with Reduced Branch Delay ========== rsqrt() test start ========= rsqrt(1) = 65536 Elapsed cycles: 68 rsqrt(4) = 32768 Elapsed cycles: 3992 rsqrt(16) = 16384 Elapsed cycles: 3992 rsqrt(20) = 14654 Elapsed cycles: 6400 rsqrt(30) = 11965 Elapsed cycles: 6304 rsqrt(100) = 6553 Elapsed cycles: 6137 rsqrt(120) = 5982 Elapsed cycles: 6521 rsqrt(130) = 5747 Elapsed cycles: 6582 rsqrt(0) = 4294967295 Elapsed cycles: 60 rsqrt(4294967295) = 1 Elapsed cycles: 5186 . . . [info] Run completed in 7 minutes, 16 seconds. [info] Total number of tests run: 33 [info] Suites: completed 3, aborted 0 [info] Tests: succeeded 33, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. ``` ### Chisel and waveforms Analyzation * **Control (Harzard Detection)** ![Snipaste_2025-12-15_23-58-59](https://hackmd.io/_uploads/r1JPxhpM-g.png) Although the waveform initially appears to show a correct load-use hazard response when `lw a0, 16(a5)` is in the ID stage, a closer inspection reveals that this stall is actually **spuriously triggered** rather than caused by a true data dependency. In the observed cycle, the preceding instruction in the EX stage is `lw a6, 12(a5)`, whose destination register is `a6` (x16). The ID-stage instruction `lw a0, 16(a5)` only semantically uses **one source register**, `rs1 = a5`; there is no architectural `rs2` for an I-type load. However, in my current imlementation, the hazard detection logic compares `rd_ex` against **both `rs1_id` and `rs2_id` unconditionally**. Because `rs2_id` is wired directly from instruction bits `[24:20]`, it incorrectly reflects `imm[4:0]` for I-type instructions. In this case, the immediate value `16` (`0x10`) coincidentally equals the register index of `a6` (`x16`), causing the condition `(rd_ex == rs2_id)` to evaluate true. As a result, the control unit asserts `io_pc_stall`, `io_if_stall`, and `io_id_flush`, inserting a bubble/NOP even though no real load-use dependency exists. This behavior strictly follows the implemented hazard-detection specification in `Control.scala`, but it highlights a **decode-level limitation**: the control logic is unaware of whether the ID-stage instruction actually *uses* `rs2`. Correcting this requires gating the `rs2` comparison with an instruction-type signal (I have added `uses_rs2` signal to determine in InstructionDecode.scala and Control.scala) or forcing `rs2_id` to zero for I-type instructions, thereby eliminating false-positive stalls while preserving correct handling of genuine load-use hazards. I have fix this problem with the commit key [01d73d4](https://github.com/jgw0915/Computer-Architecture_HW3-ca2025-mycpu/commit/01d73d40ac9b9df3179942e2283fe75a13c4a970). After the fix, the waveform is functioned as we expected and we avoid unnecessary stall. ![Snipaste_2025-12-16_01-02-28](https://hackmd.io/_uploads/H1r8JaTGWe.png) In addition, I also find out `rs1` also need to gate the hazard condition if the instruction is `jal`, `auipc` and `lui` since these instructions don't use rs1, instead, they treat the same bit positions as immediate. I allready make additional fix in commit key [d817f0e](https://github.com/jgw0915/Computer-Architecture_HW3-ca2025-mycpu/commit/d817f0eef37db93f05a4086b4c6f9b710d70d22a).