# Assignment3: Your Own RISC-V CPU
# PART 1: Chisel Bootcmap
In this part, we should follow the instructions in [Lab3: Construct a RISC-V CPU with Chisel](https://hackmd.io/@sysprog/B1Qxu2UkZx#Lab3-Construct-a-RISC-V-CPU-with-Chisel), engage with [Chisel Bootcmap](https://github.com/sysprog21/chisel-bootcamp), describe and enhance the [Hello World in Chisel](https://hackmd.io/@sysprog/B1Qxu2UkZx#Hello-World-in-Chisel) from Lab3.
---
## Learning Summary
> Chisel Bootcmap Learning summery
### 1. Introduction to Scala
**Scala** is a modern programming language that combines object-oriented and functional programming on top of the JVM. It is suitable as a host language for the embedded **DSL** (Domain Specific Language) such as Chisel.
Mainly Learned:
- The basic Scala syntax (`val`/`var`, `if` expressions, functions, `for` loops, and `List`).
- Both `if` and code blocks `{...}` return values.
- Read and write simple `class` / `object` definitions.
- Understand common patterns like `import` and named parameters with default values.
- A first impression of treating functions as values (passing functions as parameters) and using anonymous functions.
These points make it more easier for us to read, write, and understand the Scala code that appears in Chisel modules and testbenches.
### 2. Basics of Chisel
**Chisel** is a DSL embedded in Scala for **Hardware Construction**: All programming related to circuit logic must be implemented using the macro definitions provided by Chisel library.
Key Concepts:
| Concept | Description |
|--------|-------------|
| `Module` & `IO` `Bundle` | A `Module` is the basic hardware block, and its interface is defined by an `IO(new Bundle {...})`, which declares typed input and output ports. |
| Hardware types & literals | Chisel provides hardware types such as `UInt`, `SInt`, and `Bool`. Literals use suffixes like `3.U`, `-1.S`, `true.B`, and an optional width such as `8.W` (for example, `0.U(8.W)`). |
| Combinational logic | Pure combinational logic is described with operators (`+`, `-`, `&`, `^`, etc.) and helpers like `Cat`, `Fill`, and `Mux`. Signals are connected with `:=`, and values update immediately without waiting for a clock edge. |
| Control flow in hardware | `when` / `.elsewhen` / `.otherwise` express conditional hardware similar to `if-else` chain, and `Mux` / `MuxLookup` are hardware multiplexers. Temporary `Wire` values are often used to hold an intermediate combinational results or to give a signal a default value before overwriting it in different branches. |
| Sequential Logic | Registers are created with `Reg`, `RegInit`, and `RegNext(x)`. `Reg` is a state element updated on the rising clock edge, `RegInit` provides an reset value, and `RegNext(x)` creates a one-cycle delayed version of `x`. |
| ChiselTest basics | The new testing style uses `test(new DUT) {...}` together with `poke`, `expect`, and `clock.step()` to write concise tests. |
---
## Hello World in Chisel
In [Hello World in Chisel](https://hackmd.io/@sysprog/B1Qxu2UkZx#Hello-World-in-Chisel), the `Hello` module implements a simple **blinking LED** (controller) circuit. It uses a counter that accumulates until it reaches a specific upper limit, at which point it inverts the LED state, causing the physical LED to alternate between on and off.
### 1. About Description
Specifically, there is a **counter register** that increments every clock cycle. When the value of counter register is reaches a **predefined threshold**, on the next clock edge, the value of counter register resets to zero, and the **LED state register** simultaneously inverts its value.
Since the **output port** is directly driven by LED state register, it reflects this value change instantly. Therefore, the blink period is determined by the threshold, and the output signal also toggles in the cycle immediately after the threshold is reached.
The original code from [Lab3](https://hackmd.io/@sysprog/B1Qxu2UkZx#Lab3-Construct-a-RISC-V-CPU-with-Chisel) is show as follows:
```scala
class Hello extends Module {
val io = IO(new Bundle {
val led = Output(UInt(1.W))
})
val CNT_MAX = (50000000 / 2 - 1).U;
val cntReg = RegInit(0.U(32.W))
val blkReg = RegInit(0.U(1.W))
cntReg := cntReg + 1.U
when(cntReg === CNT_MAX) {
cntReg := 0.U
blkReg := ~blkReg
}
io.led := blkReg
}
```
Where:
- **CNT_MAX:** A **constant literal** used as the threshold to **control the blinking timing**.
- **cntReg:** A 32-bit **counter register** initialized to `0`, used to track elapsed clock cycles. Normally, it increments by `1` every cycle, but it resets to `0` in the cycle following a match with `CNT_MAX`.
- **blkReg:** A 1-bit **state register** initialized to `0`, which **dictates the on/off status** of the LED. It normally holds its value but updates to `~blkReg` (inverts: 0→1 or 1→0) in the cycle following the match between `cntReg` and `CNT_MAX`.
- **io.led:** The module's **single output port**, driven directly by `blkReg`, and represents the signal that will eventually control the physical LED.
For an example, assuming a **50 MHz** clock frequency: The `blkReg` value inverts every `0.5` seconds (corresponding to counting up to `25,000,000` cycles). Consequently, the full LED cycle (On → Off → On) takes approximately `1` second (as it requires two inverts to return to the original state), resulting in a **1 Hz blinking LED**.
---
### 2. About Enhancement
To satisfy the assignment requirement, we extend the original `Hello` blinking LED design into a **3-mode** LED (controller). This simple enhancement introduces:
- An additional **mode input port** (`io.mode`).
- Combinational control logic to decide whether the LED should **blink or stay on**.
- A **multiplexer** (`Mux`) to select different blinking thresholds at runtime.
In other words, instead of a single fixed blinking period, the enhanced design can switch among multiple behaviors according to various modes.
The corresponding enhanced code is show as follows:
```scala
class HelloEnhance extends Module {
val io = IO(new Bundle {
val led = Output(UInt(1.W))
val mode = Input(UInt(2.W)) // 0:slow blink, 1:fast blink, 2:always on
})
val cntReg = RegInit(0.U(32.W))
val blkReg = RegInit(0.U(1.W))
val isBlink = (io.mode === 0.U) || (io.mode === 1.U)
// If we assume clock = 50 MHz
val CNT_SLOW = (50000000/2 - 1).U // toggle every 0.5 sec
val CNT_FAST = (50000000/4 - 1).U // toggle faster than slow mode
// Logic circuit enhancement: select threshold by a Mux
val CNT_MAX = Mux(io.mode === 1.U, CNT_FAST, CNT_SLOW)
when(isBlink) {
cntReg := cntReg + 1.U
when(cntReg === CNT_MAX) {
cntReg := 0.U
blkReg := ~blkReg
}
}.otherwise {
cntReg := 0.U
blkReg := 1.U
}
io.led := blkReg
}
```
#### 1. Added I/O and Modes
We add a **2-bit** input `io.mode` to select one of the following modes:
- `mode = 0`: **slow blink**.
- `mode = 1`: **fast blink**.
- `mode = 2`: **always on** (no blinking).
#### 2. Mux-based Threshold Selection
The original design uses a fixed `CNT_MAX`. By contrast, in the enhanced version, we define **two thresholds**: `CNT_SLOW` for **slow** blinking and `CNT_FAST` for **fast** blinking. Then we use a `Mux` to select the active threshold:
- If `mode == 1`, choose `CNT_FAST`.
- Otherwise (e.g., `mode == 0`), choose `CNT_SLOW`.
This `Mux` is the core logic circuit used to enhance the behavior, because it **dynamically selects which timing constant** controls the counter.
#### 3. Control Logic for Blink vs. Always-On
We define a boolean condition `isBlink = (mode == 0) || (mode == 1)`. If `isBlink == true`, the circuit behaves like the original `Hello`:
- `cntReg` increments every cycle.
- when `cntReg == CNT_MAX`, it resets to `0` and inverts `blkReg`.
Otherwise (`mode == 2`), we force `cntReg := 0` (to avoid unnecessary counting) and `blkReg := 1` (for LED stays on).
#### Summary of Enhancement
In summary, this enhancement transforms the original `Hello` blinking LED into a **3-mode** LED (controller). By adding a **mode input port** and incorporating **logic circuit**, this design can dynamically switch between **slow / fast blink** and **always-on** behaviors at runtime.
---
## Problems Encountered
This section records several major **issues** encountered while running the Chisel Bootcmap notebook inside the Docker container, along with the solutions that were found to work.
### 1. Importing issue
> Encountered in **2.1 Your First Chisel Module**.
**Problem:**
Issues with importing `chisel3.tester` and `chisel3.tester.RawTester.test`.
**Solution:**
Replace the old `imports` with the following:
```scala
import chisel3._
import chisel3.util._
import chiseltest._ // replace `chisel3.tester`
import chiseltest.RawTester.test // replace `chisel3.tester.RawTester.test`
```
This can avoid downgrading chiseltest from `0.6.+` to `0.5.+`, keeping the environment consistent with Chisel Bootcamp’s expected version range.
### 2. Compatibility issues
> Encountered in **2.1 Your First Chisel Module**.
**Problem:**
The following example is a Chisel `Module`:
```scala
// Chisel Code: Declare a new module definition
class Passthrough extends Module {
val io = IO(new Bundle {
val in = Input(UInt(4.W))
val out = Output(UInt(4.W))
})
io.out := io.in
}
```
When running the cell below:
```scala
// Scala Code: Elaborate our Chisel design by translating it to Verilog
// Don't worry about understanding this code; it is very complicated Scala
println(getVerilog(new Passthrough))
```
The following error occurred:
```
Warning: Verilog generation has json4s compatibility issues, using FIRRTL
java.lang.NoSuchMethodError: 'void org.json4s.FullTypeHints.<init>(scala.collection.immutable.List, java.lang.String)'
```
Furthermore, running the following cell will also result in errors:
```scala
// Scala Code: `test` runs the unit test.
// test takes a user Module and has a code block that applies pokes and expects to the
// circuit under test (c)
test(new Passthrough()) { c =>
c.io.in.poke(0.U) // Set our input to value 0
c.io.out.expect(0.U) // Assert that the output correctly has 0
c.io.in.poke(1.U) // Set our input to value 1
c.io.out.expect(1.U) // Assert that the output correctly has 1
c.io.in.poke(2.U) // Set our input to value 2
c.io.out.expect(2.U) // Assert that the output correctly has 2
}
println("SUCCESS!!") // Scala Code: if we get here, our tests passed!
```
```
java.lang.NoSuchMethodError: 'os.Path os.Path.$div(os.PathChunk)'
```
These errors were caused by **binary incompatibility** (The dependency versions loaded at runtime are inconsistent with the versions expected by the program), due to **dependency version mismatch**:
1. Verilog / FIRRTL generation depends on **json4s** package that are **not compatible** with the json4s version loaded by the default kernel (e.g., `json4s-core_2.12-3.6.7`), which leads to `NoSuchMethodError`.
2. **chiseltest** requires newer **os-lib** package, but the default kernel loads an older version (`os-lib_2.12-0.3.0`), which triggers `NoSuchMethodError` (missing `os.Path.$div`).
Therefore, both `getVerilog(...)` and `test(...)` fail because the runtime environment was **incompatible** with the required Chisel / chiseltest dependency versions.
**Solution:**
Install and switch to a **new Almond kernel** built for **Scala 2.12.17**, then update `source/load-ivy.sc` accordingly:
```
// Replace the original 2.12.10 with a newer Scala version
interp.configureCompiler(x => x.settings.source.value = scala.tools.nsc.settings.ScalaVersion("2.12.17"))
```
Switch to the new kernel helps to resolve dependencies more consistently, so the **json4s** and **os-lib** versions match what Chisel and chiseltest need.
> **Note:** All subsequent work in Chisel Bootcmap is carried out under this Scala 2.12.17 kernel and the updated `source/load-ivy.sc`.
### 3. Module Loading Failure in Setup
> Encountered in **2.1 Your First Chisel Module**.
Switching to the new Almond kernel solves compatibility issues, but it immediately leads to another problem in the **Setup** step.
**Problem:**
When running the Setup cell:
```
val path = System.getProperty("user.dir") + "/source/load-ivy.sc"
interp.load.module(ammonite.ops.Path(java.nio.file.FileSystems.getDefault().getPath(path)))
```
The following error occurred:
```
cell1.sc:2: object ops is not a member of package ammonite
val res1_1 = interp.load.module(ammonite.ops.Path(java.nio.file.FileSystems.getDefault().getPath(path)))
^Compilation Failed
Compilation Failed
```
This happens because the new Almond kernel uses a different runtime environment, so the original Setup cell is no longer compatible.
**Solution:**
Replace the path construction and module loading with:
```
val path = os.pwd / "source" / "load-ivy.sc"
interp.load.module(path)
```
Additionally, before every Setup, clean up stale Almond temp output under `/tmp` to avoid issues such as `FileAlreadyExistsException`:
```
val tmp = os.Path("/tmp")
if (os.exists(tmp)) {
os.list(tmp).filter(p => p.last.startsWith("almond-output")).foreach(p => os.remove.all(p))
}
```
This prevents temporary-file conflicts (e.g., `FileAlreadyExistsException`) when re-running the Setup cell.
---
# PART 2: Single-Cycle CPU
A **Single-Cycle CPU** completes **one full instruction per clock cycle**. In this design, all **five pipeline stages**: IF, ID, EX, MEM and WB, operate combinationally **within the same cycle**, which simplifies control logic at the cost of a longer critical path.
In this [**Project 1**](https://github.com/Micelearner/ca2025-mycpu/tree/main/1-single-cycle): `1-single-cycle`, we will finally implemented a basic **RISC-V RV32I single-cycle processor** in Chisel with comprehensive verification.
---
## Exercises Summary
In [`1-single-cycle`](https://github.com/Micelearner/ca2025-mycpu/tree/ca2025-exercise-updates/1-single-cycle), we incrementally completed the basic Single-Cycle RV32I CPU by filling in **9 exercises** across **five phases**.
### Phase 1:Instruction Decode
> Exercises **1–2**, in **InstructionDecoderTest.scala**
**Exercise 1: Immediate Extension(S / B / J)**
We filled in the **S/B/J** immediate bit **reordering** and **sign-extension** so `store/branch/jal` instructions get the correct 32-bit immediate.
**Exercise 2: Control Signal Generation(WB source / ALU op sources)**
We filled in the **conditions** for `wbSource/aluOp1Sel/aluOp2Sel` so the WB source and the two ALU operand **muxes** select the correct datapath based on instruction type.
### Phase 2:ALU Control
> Exercises **3**, in **ALUControl.scala**
**Exercise 3: Opcode/Funct3/Funct7 → ALU Function**
We filled in the `OpImm/Op` **funct3→ALU-function** mappings (including ADD/SUB, SRL/SRA, and SRLI/SRAI distinguished by `funct7(5)` ) so the ALU performs the correct **arithmetic**, **logic** and **shift** operations.
### Phase 3:Execute
> Exercises **4-5**, in **Excute.scala**
**Exercise 4: Branch Comparison Logic (6 types)**
We filled in all **six branch comparisons** (BEQ/BNE, signed BLT/BGE, unsigned BLTU/BGEU) to produce the correct `branchCondition`.
**Exercise 5: Jump/Branch Target Address Calculation**
We filled in the **target address calculations** for `branchTarget = PC + imm` and `jalrTarget = (rs1 + imm) & ~1`, so `branch/jal/jalr` jump addresses are computed correctly.
### Phase 4:Memory Access
> Exercises **6-7**, in **MemoryAccess.scala**
**Exercise 6: Load Data Extension**
We filled in the **LB/LBU/LH/LHU sign/zero extension logic** (using `byte(7)` / `half(15)` as the **sign bit**) to correctly generate `wb_memory_read_data`.
**Exercise 7: Store Data Alignment**
We filled in **SB/SH write strobe** and **write data alignment** (SB shifts by `mem_address_index << 3` and SH shifts by 16 depending on `mem_address_index(1)` ) to ensure **byte/halfword** stores write to the correct lanes.
### Phase 5:WriteBack & Instruction Fetch
> Exercises **8-9**, in **WriteBack.scala** and **InstructionFetch.scala**
**Exercise 8: WriteBack Source Selection(3-way mux)**
In `WriteBack.scala`, we filled in the `MuxLookup(...)` so `regs_write_data` correctly selects among **ALU result**, **`memory_read_data`**, and **(PC + 4)**.
**Exercise 9: PC Update Logic**
In `InstructionFetch.scala`, we filled in the **PC** update `pc := Mux(jump_flag, jump_addr, pc + 4)` (only when `instruction_valid` is true), completing the **sequential vs control-flow** PC selection logic.
---
## Comprehensive Verification
### ChiselTest Unit Tests
These tests cover both **Stage-Level functionality** (IF/ID/EX/RegisterFile) and **Full-Program execution**, which are located in `src/test/scala/riscv/singlecycle/`.
#### a. InstructionDecoderTest
1. **Concise Summary**
This test validates the Instruction Decode (**ID**) stage, ensuring correct decoding of representative RV32I instructions and correct generation of control signals.
2. **Expected Outcome**
All decode cases matched the expected control signals and immediates.
#### b. ExecuteTest
1. **Concise Summary**
This test validates the Execute (**EX**) stage. It checks that the ALU datapath produces correct results for an R-type ADD instruction (100 random test cases), and the branch logic works correctly for a BEQ instruction by generating the correct `if_jump_flag` and `if_jump_address`.
2. **Expected Outcome**
For randomized ADD operands, `mem_alu_result` should always matched `op1 + op2` and `if_jump_flag` stayed `0`. For BEQ instruction, the test confirmed that equality should sets `if_jump_flag = 1` and redirects to `PC+imm`, while inequality keeps `if_jump_flag = 0` (with the computed target address remaining available on `if_jump_address`).
#### c. CPUTest
1. **Concise Summary**
In `CPUTest.scala`, it provides **integration tests** that validate the **entire Single-Cycle CPU** (IF/ID/EX/MEM/WB) running 3 real RV32I programs.
2. **How test program instructions are loaded ?**
Each test instantiates `TestTopModule(exeFilename)`, which uses a loader (`ROMLoader`) and Instruction ROM (`InstructionROM(exeFilename)`). The loader loads the program from ROM into main memory (`mem`) starting at `Parameters.EntryAddress`. While `load_finished` is `false`, the memory ports is connected to loader. Once loading completes, the ports is switched to the CPU and `instruction_valid` is asserted so CPU begins execution.
3. **Test Cases and Expected Outcome**
- **FibonacciTest**(`fibonacci.asmbin`): runs a recursive Fibonacci(10) function and verifies the final result by reading memory with address `4` (expected `55`).
- **QuicksortTest**(`quicksort.asmbin`): executes Quicksort algorithm on 10 numbers and validates sorted output by reading back the memory (expected `0...9`).
- **ByteAccessTest**(`sb.asmbin`): focuses on byte-level memory operations (SB/LB), and checks correctness by reading specific registers after execution completed (`x5 = 0xdeadbeef`, `x6 = 0xef`, `x1 = 0x15ef`).
#### d. RegisterFileTest
1. **Concise Summary**
The RegisterFileTest (All 3 tests) targets the **RegisterFile** module. In **Test 1**, it checks that a value written to a general-purpose register can be read back correctly. In **Test 2**, it checks that `x0` remains hardwired to zero even if a write is attempted. In **Test 3**, it checks that the design supports write-through (read during write cycle).
2. **Expected Outcome**
- **Test 1**: Writing `0xdeadbeef` to `x1` and reading `x1` afterward returns `0xdeadbeef`.
- **Test 2**: After attempting to write `0xdeadbeef` to `x0`, reading `x0` still return the value `0x00000000`.
- **Test 3**: After writing `0xdeadbeef` to `x2`, subsequent reads (immediately, during write cycle) return `0xdeadbeef` as expected.
#### e. InstructionFetchTest
1. **Concise Summary**
This test validates the Instruction Fetch (**IF**) stage **program counter** update logic: when `instruction_valid` is asserted, the next PC must be either PC+4 (sequential) or `jump_address_id` (redirection) depending on `jump_flag_id`.
2. **Expected Outcome**
Across randomized sequences of jump and non-jump cycles, `instruction_address` always matches the expected next PC (either the computed `pre + 4` or the forced jump target `0x1000`).
#### Test Result
Running with `make test` in `~/ca2025-mycpu/1-single-cycle` produce the following result:

All unit and integration tests completed successfully (all tests passed, 0 failed), indicating correct functionality that across decode, fetch, execute, memory access, write-back, and full-program execution.
---
### RISCOF Compliance Tests
**RISCOF (RISC-V Architectural Test Framework)** is a testing framework used to validates RISC-V ISA compliance. Based on the CPU specification we declare (**RV32I** in this project), it selects the corresponding **standard test programs** (from `riscv-arch-test` / `riscv-test-suite`), then uses a **RISC-V toolchain** to assemble each test into an executable.
Each test is executed on both the **DUT** (Design Under Test, i.e., our RISC-V CPU) and a **reference model (rv32emu)**. After execution, both sides produce a **signature** (records the test results), and RISCOF compares the DUT signature against the reference signature to decide whether each test Passed or Failed.
Therefore, when `make compliance` reports that all tests pass, it means our CPU’s behavior matches the reference model for those RV32I instructions and scenarios, demonstrating basic ISA correctness and compliance.
#### Error Encountered
```
Validating RISCOF installation...
Error: riscof not found in PATH
RISCOF (RISC-V Architectural Test Framework) is required for compliance tests.
Installation options:
1. Install via pip:
pip install riscof
2. If using virtualenv:
python -m pip install riscof
# Or use: python -m riscof --version
3. If already installed, add to PATH:
export PATH="$HOME/.local/bin:$PATH"
4. Verify installation:
which riscof && riscof --version
make: *** [../common/build.mk:16: check-riscof] Error 1
```
To fix this, install the **riscof and toolchain**:
```
sudo apt update
sudo apt install pipx
pipx ensurepath
pipx install riscof
```
```
sudo apt update
sudo apt install gcc-riscv64-unknown-elf
export RISCV=/usr
make compliance
```
#### Test Result
Running with `make compliance` in `~/ca2025-mycpu/1-single-cycle` produce following result:


---
# PART 3: MMIO-Trap CPU
## Comprehensive Verification
### ChiselTest Unit Tests
These tests are located in `src/test/scala/riscv/singlecycle/`.
#### Test Result
Running with `make test` in `~/ca2025-mycpu/2-mmio-trap` produce the following result:

The result shows that all tests completed successfully (all tests passed, 0 failed).
---
### RISCOF Compliance Tests
#### Test Result
Running with `make compliance` in `~/ca2025-mycpu/2-mmio-trap` produce following result:


---
## Nyancat Animation
### Render Animation
This processor includes a VGA peripheral for visual output with SDL2 support, running with `make demo` in `~/ca2025-mycpu/2-mmio-trap` produce the following result:

The result shows that Nyancat animation is correctly rendered on the VGA display during Verilator-based simulation.
---
### Compression approach
#### 1. Effective Approach
Use a **Keyframe + Patch** scheme: store a few frames as full keyframes (are independently decodable), and encode the other frames as compact “**patches**” that only describe what changed relative to a chosen reference keyframe (or the nearest keyframe).
A patch can be represented as a simple sequence of “**skip unchanged pixels, then write changed pixels**” commands, plus a small per-frame offset table so the decoder can jump to each frame’s patch data quickly.
#### 2. Why it can be more effective?
Nyancat is **highly periodic**, and a frame is not always most similar to the immediately previous one. Referencing a keyframe aligned with the animation cycle can reduce the amount of “changed pixels” that must be described, so patches shrink.
---
# PART 4: Pipelined RISC-V CPU
## Comprehensive Verification
### ChiselTest Unit Tests
These tests are located in `src/test/scala/riscv/`.
#### Test Result
Running with `make test` in `~/ca2025-mycpu/3-pipeline` produce the following result:

The result shows that all tests completed successfully (all tests passed, 0 failed).
---
### RISCOF Compliance Tests
#### Test Result
Running with `make compliance` in `~/ca2025-mycpu/3-pipeline` produce following result:


---
## Hazard Detection Summary
For each questions below, the following answer is based on the hazard detection logic implemented in `fivestage_final/Control.scala`.
Besides, to check the **waveform**: First, in `Top.scala`, we modify `new CPU(implementation = ImplementationType.ThreeStage)` by replacing `ThreeStage` with `FiveStageFinal` inside. Next, running Verilator simulation with test program, then check the waveform.
```
make sim SIM_ARGS="-instruction src/main/resources/hazard.asmbin"
gtkwave trace.vcd
```
> Commands run in `~/ca2025-mycpu`
### Q1. Why do we need to stall for load-use hazards ?
In the **FiveStageFinal** pipeline, a **load-use hazard** occurs when the instruction currently in the EX stage is a **load**, and its destination register matches one of the source registers of the instruction currently in the ID stage.
In this situation, the load value is not available immediately, which is produced only after memory access completes in the MEM stage, while the dependent instruction would need that operand at the beginning of its EX stage. Because of this time gap, **forwarding alone cannot provide the correct value in the same cycle**, so the pipeline must **stall one cycle** to preserve correctness.

As shown in the **waveform figure** above, the hazard condition is visible because the EX stage load has `memory_read_enable_ex = 1`, and the register dependency can be confirmed by checking that `rd_ex == rs2_id`. When load-use hazard is detected, all three signals are asserted simultaneously: `pc_stall = 1`, `if_stall = 1`, and `id_flush = 1`.
In the next cycle (starting at 53 ps in the waveform), the inserted NOP can be observed, and the pipeline now has enough time for the load result to become ready. After this one cycle delay, the original dependent instruction can finally enter the EX stage and obtain the correct operand via forwarding.
### Q2. What is the difference between "stall" and "flush" operations ?
1. **Stall operation**
Prevents the **PC** and/or certain **pipeline registers** from being updated in next cycle, so the earlier stages keep holding the same instruction and do not advance (waiting for the required data to become ready).
2. **Flush operation**
**Overwrites** certain **pipeline registers** with a **NOP (bubble)** to discard instructions that should not be executed (e.g., wrong-path instructions fetched due to a taken branch or jump). Flush mainly clears pipeline registers, while the **PC** is typically corrected through redirection (jumping to the correct target address).
### Q3. Why does jump instruction with register dependency need stall?
There are **two cases** where a jump instruction with a register dependency needs to stall:
**Case 1:**
In the same cycle, when the **ID stage holds a jump instruction** and the **MEM stage holds a load instruction**, and the load instruction’s `rd` has a dependency with any source register of the jump instruction, the jump instruction cannot obtain the correct value immediately. Therefore, it must stall and wait until the loaded data becomes available (after MEM/WB), so that it can compute the correct jump target address and redirect the PC correctly, or make a correct control-flow decision.
**Case 2:**
In the same cycle, when the ID stage holds jump instruction and the EX stage instruction’s `rd` has dependency with any source register of the jump instruction, the jump instruction also cannot obtain the correct value immediately. Therefore, it must stall and wait until the correct source operand becomes available. Otherwise, the ID stage cannot make a correct control-flow decision (e.g., branch taken/not-taken) or compute a correct redirect target address(e.g., JALR target address).

### Q4. In this design, why is branch penalty only 1 cycle instead of 2 ?
If branch resolution is placed in the EX stage, then when a branch happens, it is necessary to flush the instructions in the earlier IF and ID stages, so the branch penalty is 2 cycles.
However, in this design, branch resolution is **moved earlier to the ID stage**. Thus, when a branch occurs, the pipeline only needs to flush the instruction in the IF stage, so branch penalty is **only 1 cycle instead of 2**, as illustrated in the figure below.

As shown in the waveform figure above, we can observe that the instruction `0x01c3c663` triggers a branch penalty of **only 1 cycle** (a NOP starting at 65 ps in the waveform).
### Q5. What would happen if we removed the hazard detection logic entirely ?
If the hazard detection logic were removed entirely, the processor would suffer from **functional incorrectness**, leading to **corrupted data** and **wrong program flow**.
### Q6. Stall and flush condition summary
**Stall** is needed when:
1. **EX stage hazard**
- Jump instruction in ID stage **or** load instruction in EX stage.
- Destination register in EX stage is not `x0` and conflicts with ID source registers.
2. **MEM stage load + jump dependency**
- Jump instruction in ID stage **and** load instruction in EX stage.
- Destination register in EX stage is not `x0` and conflicts with ID source registers.
While **Flush** is needed when all two scenarios mentioned above, or **branch/jump resolved in ID triggers PC redirect**.
---
# PART 5: Run code on pipelined RISC-V CPU
In this part, we need to modify the handwritten RISC-V assembly code in [HW2](https://hackmd.io/@Im0gRUsqRbqQEpU005z5PQ/arch2025-homework2) to ensure it functions correctly on the pipelined RISC-V CPU `3-pipeline`.
---
## Related work
Here, we choose `uf8_decode` (in `uf8.s`, which was required to be modified in [HW2](https://hackmd.io/@Im0gRUsqRbqQEpU005z5PQ/arch2025-homework2) and run correctly on rv32emu & bare-metal environment) as a simple example, and adapt it so that it can execute successfully on the `3-pipeline`.
### Handwritten assembly program (`uf8.decodeTest.S`)
First, we prepare a handwritten assembly program (wraps `uf8_decode`) that can run on `3-pipeline`. The corresponding `uf8_decodeTest.S` is shown below:
```
.section .data
.align 4
test_in:
.word 0x21
.section .text
.globl _start
_start:
# a0 = test input
la t0, test_in
lw a0, 0(t0)
# call UF8_decode(a0)
jal ra, UF8_decode
# store result for Scala test to read
sw a0, 4(x0) # result -> mem[4]
li t1, 1
sw t1, 8(x0) # done=1 -> mem[8]
done:
j done
# uint32 UF8_decode(uint32 a0) (input in a0, output in a0)
UF8_decode:
andi t1, a0, 0x0F # m = a0 & 0x0F
srli t2, a0, 4 # e = a0 >> 4
li t3, 1
sll t3, t3, t2 # 1 << e
addi t3, t3, -1 # (1 << e) - 1
slli t3, t3, 4 # offset = ((1 << e)-1) << 4
sll t1, t1, t2 # m << e
add a0, t1, t3 # return (m << e) + offset
jr ra
```
Where:
- **_start**: The **entry point** for program execution, which is specified as the starting address by `csrc/link.lds`.
- **Inifinity loop (`done: j done`)**: After executing the last instruction in the `.text` section, if no control-flow change is made, the CPU will still fetch the next instruction at `PC+4` according to hardware rules, may lead to unpredictable behavior. Thus, after writing the test outputs, we keep the PC in a safe location using an **infinite loop**.
- **Test data (`.word 0x21`)**: The test input. After decoding with `uf8_decode`, the expected result is 52 (`0x34`).
Next, we assemble and link the `uf8_decodeTest.S` to produce an 32-bit ELF-format RISC-V executable `uf8_decodeTest.elf`.
```
riscv64-unknown-elf-as -march=rv32i -mabi=ilp32 -o uf8_decodeTest.o uf8_decodeTest.S
riscv64-unknown-elf-ld -T linker.ld --oformat=elf32-littleriscv -o uf8_decodeTest.elf uf8_decodeTest.o
```
Then we convert `uf8_decodeTest.elf` into a binary file `uf8_decodeTest.asmbin` via `objcopy`.
```
riscv64-unknown-elf-objcopy -O binary -j .text -j .data uf8_decodeTest.elf uf8_decodeTest.asmbin
```
Finally, we place `uf8_decodeTest.asmbin` under `src/main/resources/` so that it can be test by `PipelineProgramTest.scala`.
### Modify the Scala Test (`PipelineProgramTest.scala`)
Next step, we need to add a new test item in the `PipelineProgramTest.scala`. Refer to the existing `runProgram(...){...}` testing approach, we run `uf8_decodeTest.asmbin`, and verify correctness by reading the predefined output locations through the memory debug port.
The added Scala code is shown below:
```scala=
it should "execute uf8_decodeTest and write done/result" in {
runProgram("uf8_decodeTest.asmbin", cfg) { c =>
c.clock.setTimeout(0)
c.clock.step(2000)
// done flag
c.io.mem_debug_read_address.poke(8.U)
c.clock.step()
c.io.mem_debug_read_data.expect(1.U)
// result
c.io.mem_debug_read_address.poke(4.U)
c.clock.step()
c.io.mem_debug_read_data.expect(52.U) // 0x21 decode = 52
}
}
```
Note that:
- We use `c.clock.step(2000)` to provide sufficient cycles for the CPU to complete the full execution. This prevents reading the memory too early, before the stores (writing back the result and the done flag) have occurred.
- In `uf8_decodeTest.S`, after finishing the decode computation, the program stores the results to two memory locations at addresses `0x4` and `0x8` (one for decoded result, one for the done flag). Therefore, in `PipelineProgramTest.scala`, we read these two addresses via the debug port and use `expect(...)` to verify that the program has finished and that the decoded result matches the expected value.
Finally, we run the following command under `~/ca2025-mycpu`:
```
WRITE_VCD=1 sbt "project pipeline" "testOnly riscv.PipelineProgramTest"
```
The corresponding execution result is shown below:
```
[info] welcome to sbt 1.10.7 (Eclipse Adoptium Java 11.0.29)
[info] loading project definition from /home/xyh/ca2025-mycpu/project
[info] loading settings for project root from build.sbt...
[info] set current project to mycpu-root (in build file:/home/xyh/ca2025-mycpu/)
[info] set current project to mycpu-pipeline (in build file:/home/xyh/ca2025-mycpu/)
[info] PipelineProgramTest:
[info] Three-stage Pipelined CPU
[info] - should calculate recursively fibonacci(10)
[info] - should quicksort 10 numbers
[info] - should store and load single byte
[info] - should solve data and control hazards
[info] - should handle all hazard types comprehensively
[info] - should handle machine-mode traps
[info] - should execute uf8_decodeTest and write done/result
[info] Five-stage Pipelined CPU with Stalling
[info] - should calculate recursively fibonacci(10)
[info] - should quicksort 10 numbers
[info] - should store and load single byte
[info] - should solve data and control hazards
[info] - should handle all hazard types comprehensively
[info] - should handle machine-mode traps
[info] - should execute uf8_decodeTest and write done/result
[info] Five-stage Pipelined CPU with Forwarding
[info] - should calculate recursively fibonacci(10)
[info] - should quicksort 10 numbers
[info] - should store and load single byte
[info] - should solve data and control hazards
[info] - should handle all hazard types comprehensively
[info] - should handle machine-mode traps
[info] - should execute uf8_decodeTest and write done/result
[info] Five-stage Pipelined CPU with Reduced Branch Delay
[info] - should calculate recursively fibonacci(10)
[info] - should quicksort 10 numbers
[info] - should store and load single byte
[info] - should solve data and control hazards
[info] - should handle all hazard types comprehensively
[info] - should handle machine-mode traps
[info] - should execute uf8_decodeTest and write done/result
[info] Run completed in 1 minute, 28 seconds.
[info] Total number of tests run: 28
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 28, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
[success] Total time: 89 s (01:29)
```
The result shows that all tests completed successfully (all tests passed, 0 failed), confirm that we successfully executed the `uf8_decodeTest.S` on `3-pipeline`.
---
> Refer to <[Assignment3 of Computer Architecture (2025 Fall)](https://hackmd.io/@sysprog/2025-arch-homework3)>
> Contributed by <[`Micelearner`](https://github.com/Micelearner/ca2025-mycpu)>
***Reference***
- [Assignment3 of Computer Architecture (2025 Fall)](https://hackmd.io/@sysprog/2025-arch-homework3)
- [LAB3: Construct a RISC-V CPU with Chisel](https://hackmd.io/@sysprog/B1Qxu2UkZx#Lab3-Construct-a-RISC-V-CPU-with-Chisel)
- [Chisel Bootcmap](https://github.com/sysprog21/chisel-bootcamp)
- [Chisel: Cheatsheet](https://www.scribd.com/document/682760217/chisel-cheatsheet)
- [ca2025-mycpu: RISC-V CPU Labs in Chisel](https://github.com/sysprog21/ca2025-mycpu)
- [CS61C: RISC-V Instruction Formats Part 1](https://docs.google.com/presentation/d/1pXcXcBjmUCXFJNFi4DpFLtpyDKsol5SaKfVce1hKdXE/edit?slide=id.p1#slide=id.p1)
- [CS61C: RISC-V Instruction Formats Part 2](https://docs.google.com/presentation/d/1rYJeuDUPezDdk01M8ib3XTo15IGLdsMJclzD4Vv9XGQ/edit?slide=id.p1#slide=id.p1)
- [CS61C: RISC-V 5-Stage Pipeline](https://docs.google.com/presentation/d/1v-Squx8lK-oOrflFOwBZh-ue94seVHZudqDOgJmFf5Q/edit?slide=id.g2fa6143ce8c_0_133#slide=id.g2fa6143ce8c_0_133)
- [RISC-V Assembly Programmer’s Manual](https://github.com/riscv-non-isa/riscv-asm-manual/blob/main/src/asm-manual.adoc)