---
tags: CA2025
---
# Assignment3: Your Own RISC-V CPU
contributed by [<wilson0828>](https://github.com/wilson0828)
>Refer to [Assignment3](https://hackmd.io/@sysprog/2025-arch-homework3)
## Learned from Chiesel Bootcmap
### Scala
- Int & UInt
- Scala Int: A number used during compilation (e.g., loop counters, parameter sizes); Chisel UInt: A wire in the actual hardware.
- The Error: You can't just do b0 * io.in if b0 is an Int and io.in is a UInt. The compiler doesn't know how to multiply a software number with a hardware wire; The Fix: Cast the Scala integer to a hardware literal using .U.
- Defining Functions (def)
- The biggest difference: no return keyword is needed. The result of the last line is automatically returned.
- Variables
- val (Value): Constant. Immutable once defined.
- var (Variable): Variable. Mutable and reassignable.
### Chisel
- Register Semantics
- The Golden Rule: Reads happen "now", writes happen "at the next clock edge".
- If saying reg := io.in, the reg value during this cycle is still the old value.
- Example: io.out := io.in - delay. This creates a difference engine because delay holds the previous value of in
- Control Flow
- if: Used by Scala to decide which hardware to build (conditional generation).
- when: Used by Chisel to generate Multiplexers (MUX).
- If a Register is inside a when block and the condition is false, the Register automatically holds its old value. You don't need to write else { reg := reg }.
- Decoupled (Ready-Valid) Automation
- enqueueNow(data): Handles the valid signal and waits for ready.
- expectDequeueNow(data): Waits for valid and checks the data.
- enqueueSeq(Seq(...)): Blasts a whole array of data into the module automatically
## Encounter Problems in Chisel Bootcamp
### Problem Description 1
When running the Chisel Bootcamp "2.1_first_module.ipynb" notebook in the official Docker container, I occurred issues with importing `chisel3.tester` and `chisel3.tester.RawTester.test`.
- The following error occurred
```
cmd1.sc:3: object tester is not a member of package chisel3
import chisel3.tester._
^cmd1.sc:4: object tester is not a member of package chisel3
import chisel3.tester.RawTester.test
^Compilation Failed
Compilation Failed
```
- Attempts to Fix
After watching [GitHub issues](https://github.com/sysprog21/chisel-bootcamp/issues/15) and load-ivy.sc file from `/chisel-bootcamp/source`, I found out the issue was caused by import $ivy.`edu.berkeley.cs::chiseltest:0.6.+`. Therefore I replace `chisel3.tester` with `chiseltest`, and it works perfectly.
### Problem Description 2
When running the Chisel Bootcamp "2.1_first_module.ipynb" notebook in the official Docker container, I occurred issues with using function `getVerilog` due to the json4s compatibility issues.
- The following error occurred
```
java.lang.NoSuchMethodError: 'void org.json4s.FullTypeHints.<init>(scala.collection.immutable.List, java.lang.String)'
firrtl.annotations.JsonProtocol$.jsonFormat(JsonProtocol.scala:226)
firrtl.annotations.JsonProtocol$.serializeTry(JsonProtocol.scala:259)
firrtl.annotations.JsonProtocol$.serialize(JsonProtocol.scala:239)
firrtl.options.phases.WriteOutputAnnotations.transform(WriteOutputAnnotations.scala:90)
firrtl.options.phases.WriteOutputAnnotations.transform(WriteOutputAnnotations.scala:31)
firrtl.options.phases.DeletedWrapper.internalTransform(DeletedWrapper.scala:38)
firrtl.options.phases.DeletedWrapper.internalTransform(DeletedWrapper.scala:15)
firrtl.options.Translator.transform(Phase.scala:248)
firrtl.options.Translator.transform$(Phase.scala:248)
firrtl.options.phases.DeletedWrapper.transform(DeletedWrapper.scala:15)
firrtl.options.Stage.$anonfun$transform$5(Stage.scala:47)
scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126)
scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122)
scala.collection.immutable.List.foldLeft(List.scala:89)
firrtl.options.Stage.$anonfun$transform$3(Stage.scala:47)
logger.Logger$.$anonfun$makeScope$2(Logger.scala:137)
scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
logger.Logger$.makeScope(Logger.scala:135)
firrtl.options.Stage.transform(Stage.scala:47)
firrtl.options.Stage.execute(Stage.scala:58)
chisel3.stage.ChiselStage.emitChirrtl(ChiselStage.scala:60)
ammonite.$file.dummy.source.load$minusivy_2$Helper.getVerilog(Main.sc:37)
ammonite.$sess.cmd3$Helper.<init>(cmd3.sc:1)
ammonite.$sess.cmd3$.<init>(cmd3.sc:7)
ammonite.$sess.cmd3$.<clinit>(cmd3.sc:-1)
```
- Attempts to Fix
Initially, I thought I could just work around the buggy getVerilog helper by calling the Chisel API directly: `println(chisel3.stage.ChiselStage.emitVerilog(new Passthrough))`
It worked for a moment, but I quickly ran into a wall. As I continued through the notebook, I kept hitting "Dependency Hell"—specifically with json4s. The library version in my environment was incompatible with the old Chisel version used in the bootcamp, causing the code to crash with "Method Not Found" errors.
I realized that patching the code wasn't enough. I had to fix the environment. So, I switched my Jupyter kernel to Almond (Scala 2.12.17) with command `wget https://github.com/coursier/launchers/raw/master/coursier -O coursier` and `java -jar coursier launch almond --scala 2.12.17 -- --install --id scala_2_12_17 --display-name "Scala 2.12.17 (Chisel)"` to match the bootcamp's requirements.
### Problem Description 3
However, after changing the Jupyter kernel to Almond (Scala 2.12.17), the first setup cell failed with an object ops is not a member of package ammonite error.
- The following error occurred
```
cmd1.sc:2: object ops is not a member of package ammonite
val res1_1 = interp.load.module(ammonite.ops.Path(java.nio.file.FileSystems.getDefault().getPath(path)))
^
Compilation Failed
```
- Attempts to Fix
This is because modern Almond kernels have replaced the legacy ammonite.ops package with os-lib. I resolved this by updating the script to use os.Path, replacing the first setup cell with the following code.
```
val tmp = os.Path("/tmp")
if (os.exists(tmp)) {
os.list(tmp).filter(p => p.last.startsWith("almond-output")).foreach(p => os.remove.all(p))
}
val path = os.pwd / "source" / "load-ivy.sc"
interp.load.module(path)
```
## Hello World in Chisel
```
class Hello extends Module {
val io = IO(new Bundle {
val led = Output(UInt(1.W))
})
val CNT_MAX = (50000000 / 2 - 1).U;
val cntReg = RegInit(0.U(32.W))
val blkReg = RegInit(0.U(1.W))
cntReg := cntReg + 1.U
when(cntReg === CNT_MAX) {
cntReg := 0.U
blkReg := ~blkReg
}
io.led := blkReg
}
```
- the operation of 'Hello World in Chisel'
The LED starts off. It waits for 25 million clock cycles (rising edges), then turns on and stays on. After another 25 million cycles, it turns off again. It just keeps looping like this
```
class Hello extends Module {
val io = IO(new Bundle {
val led = Output(UInt(1.W))
})
val cntReg = RegInit(0.U(32.W))
cntReg := cntReg + 1.U
when(cntReg(31) === 1.U) {
cntReg := 0.U
}
io.led := (cntReg(26) & cntReg(22)).asUInt
}
```
- To enhance the circuit with combinational logic, I implemented a signal gating mechanism using an AND gate.
Unlike a simple continuous toggle, the AND gate requires both the slow counter bit and the fast counter bit to be high simultaneously. This creates a distinct rhythm, the LED stays completely dark for a few seconds, followed by a sudden burst of rapid flashes.
## Single-cycle CPU
### Summary of test cases
- Component Tests First, we have the component tests. Instead of running full programs, these verify isolated modules like the decoder or ALU. We use ChiselTest to `poke` specific input values into the circuits and `expect` precise outputs, ensuring the low-level logic is solid before we assemble the entire CPU.
- CPU Integration Tests Then, we have the integration tests. These verify the full pipeline by executing actual binaries like Fibonacci and Quicksort. A loader reads these program files from disk into the simulated memory, proving the CPU can handle complex software tasks like recursion and sorting.
### ChiselTest Unit Tests (9 tests)

### RISCOF Compliance Testing (41 tests)
- encounter problem (lack of riscof)
```
Validating RISCOF installation...
Error: riscof not found in PATH
RISCOF (RISC-V Architectural Test Framework) is required for compliance tests.
```
- Attempts to Fix
```
sudo apt update
sudo apt install -y pipx
pipx ensurepath
pipx install riscof
export PATH="$HOME/.local/bin:$PATH"
make compliance
```
- encounter problem (lack of toolchain)
```
Validating RISCOF installation...
RISCOF found: /home/blz/.local/bin/riscof
Version: RISC-V Architectural Test Framework., version 1.25.3
Running RISCOF compliance tests for 1-single-cycle (RV32I)...
Error: RISC-V GNU Toolchain not found
```
- Attempts to Fix
```
sudo apt install -y gcc-riscv64-unknown-elf binutils-riscv64-unknown-elf
export RISCV=/usr
export CROSS_COMPILE=riscv64-unknown-elf-
export PATH="$HOME/.local/bin:$PATH"
make compliance
```



## RISC-V CPU with MMIO Peripherals and Trap Handling
### ChiselTest Unit Tests (9 tests)

### RISCOF Compliance Testing (119 tests)



### Nyancat animation
```
sudo apt install libsdl2-dev
make demo
```

### Effective approaches
- Bounding Box Update
For each frame, first compare it with the previous frame and find the smallest bounding box that covers all pixels that changed (store x, y, w, h). Instead of saving a full 64×64 frame, you only save the pixels inside that box, because everything outside can just reuse the previous frame unchanged. On the decoder side, you keep a current frame buffer which copy the previous frame as the base, then apply a small patch by writing the box pixels into the right place.
This is especially effective for Nyancat-style animations because most of the background stays the same, and motion is usually limited to small regions like the rainbow tail or a few moving stars.
## Pipelined RISC-V CPU
### [CA25: Exercise 21] Hazard Detection Summary and Analysis
>Conceptual Exercise: Answer the following questions based on the hazard detection logic implemented above
#### For waveforms
First, we modified `val cpu = Module(new CPU(implementation = ImplementationType.ThreeStage))` to `val cpu = Module(new CPU(implementation = ImplementationType.ThreeStage))` in Top.scala.
then use instructions below to see wave
```c
make sim SIM_ARGS="-instruction src/main/resources/hazard.asmbin"
gtkwave trace.vcd
```
#### Q1: Why do we need to stall for load-use hazards?
A : Even if we have forwarding, a lw doesn’t produce the real data until MEM stage. But the very next dependent instruction will already be in EX stage and needs that operand right now. So forwarding can’t save you in time.
- Analysis with waveforms

From the waveform, once `id2ex_io_memory_read_enable` goes high (meaning the lw is already in the EX stage), the next instruction in the ID stage `0x00736E33` needs the same register `x7` that the load will write back to. This is a classic load-use hazard. You can see in the waveform that on the next cycle, the ID-stage instruction `0x00736E33` and the `io_instruction_address` (PC) stay the same instead of moving forward, which shows that the pipeline stalls.
#### Q2: What is the difference between "stall" and "flush" operations?
A: Stall means “pause the pipeline from moving forward.” In this design, that’s done by not updating PC and not updating the IF/ID register (`pc_stall` + `if_stall`). Usually we also flush ID/EX to inject a bubble so things don’t break.
Flush means “kill an instruction and turn it into a NOP.” Here it’s mainly used for wrong-path instructions when control flow changes, so we do `if_flush` to discard what IF just fetched
#### Q3: Why does jump instruction with register dependency need stall?
A: Because in our design we resolve JALR’s next PC in the ID stage, the jump target (rs1 + imm) must be ready right in ID. So if there’s a load-use dependency, the loaded register value won’t be available in time to forward into ID, and we can’t compute the correct PC.
- Analysis with waveforms

From the waveform, when `id_io_if_jump_flag` is asserted, the CPU is trying to resolve a register-based jump (JALR) in the ID stage. At the same time, the hazard check shows a dependency (`ctrl_io_rs1_id` matches `ctrl_io_rd_ex`, both 0x1D), so the control unit raises `ctrl_io_pc_stall` and `ctrl_io_if_stall`.
#### Q4: In this design, why is branch penalty only 1 cycle instead of 2?
A: Because the branch decision is made in ID stage, not EX stage.
So when the branch is taken, we only need to flush the IF stage instruction that was fetched on the wrong path. We don’t have to waste extra cycles flushing deeper stages, that’s why the penalty is just 1 cycle.
- Analysis with waveforms

Here `id_io_if_jump_flag` is also asserted, but the behavior we see isn’t a jump-dependency stall, it’s a taken-branch redirect with a flush. From the signals, `ctrl_io_if_flush` goes high while `ctrl_io_id_flush` stays low, which means only the IF stage gets squashed. That matches our claim that the branch penalty is just 1 cycle, and the wrong-path instruction in IF is then flushed into a NOP.
#### Q5: What would happen if we removed the hazard detection logic entirely?
A: Probably will get incorrect results.
If we removed hazard detection but still wanted correctness, the only brute-force solution is to insert 3 NOPs between dependent instructions so everything has time to write back.
#### Q6: Complete the stall condition summary: Stall is needed when:
1: Load-use hazard
2: jump register dependency
#### Flush is needed when:
1: Control flow is redirected
### ChiselTest Unit Tests (29 tests)


### RISCOF Compliance Testing (119 tests)



## Make assembly code run on the pipelined RISC-V CPU
Here, we need to ensure that our hand-written RISC-V assembly code functions correctly on the pipelined RISC-V CPU. Moreover, it should be fully optimized for the pipeline, meaning it should not introduce any unnecessary stalls.
I chose `bf16_add` as the target program for the pipelined RISC-V CPU. In Assignment 2, we ran it in a bare-metal environment using rv32emu in system-emulation mode.
### Modified Assembly
Since we don't rely on Makefile, we need to modify our assembly. To run on the pipelined RISC-V CPU, I added a `.bss` section to allocate a stack, and updated `bf16_add.S` with a bare-metal entry `_start` that writes the result and a done flag to memory before entering an infinite loop.
```c
.section .bss
.align 4
stack: .space 2048
stack_top:
...
_start:
la sp, stack_top
# bf16 input
li a0, 0x3F80 # 1.0 (bf16)
li a1, 0x4000 # 2.0 (bf16)
jal ra, bf16_add
sw a0, 4(x0) # result put in mem[4]
li t1, 1
sw t1, 8(x0)
inf_loop:
j inf_loop
...
```
### Compile to .asmbin file
First, we Assemble `bf16_add.S` into an object file
```c
cd csrc
riscv64-unknown-elf-as -march=rv32i_zicsr -mabi=ilp32 -o bf16_add.o bf16_add.S
```
Link the object file into a 32-bit RISC-V ELF executable
```c
riscv64-unknown-elf-ld -T link.lds --oformat=elf32-littleriscv -o bf16_add.elf bf16_add.o
```
Convert the ELF into `.asmbin` file
```c
riscv64-unknown-elf-objcopy -O binary -j .text -j .data bf16_add.elf bf16_add.asmbin
```
Copy the generated `.asmbin` into src/main/resources/
```c
cp bf16_add.asmbin ../src/main/resources/
```
### Test
I extended PipelineProgramTest.scala to validate the program output.
```c
it should "execute bf16_add and set done/result" in {
runProgram("bf16_add.asmbin", cfg) { c =>
c.clock.setTimeout(0)
c.clock.step(3000)
c.io.mem_debug_read_address.poke(8.U) // done
c.clock.step()
c.io.mem_debug_read_data.expect(1.U)
c.io.mem_debug_read_address.poke(4.U) // result
c.clock.step()
c.io.mem_debug_read_data.expect(0x4040.U) // 3.0 bf16
}
}
```
command
```c
WRITE_VCD=1 sbt "project pipeline" "testOnly riscv.PipelineProgramTest"
```
### Validation
```c
blz@localhost:~/ca2025-mycpu$ WRITE_VCD=1 sbt "project pipeline" "testOnly riscv.PipelineProgramTest"
[info] welcome to sbt 1.10.7 (Eclipse Adoptium Java 11.0.29)
[info] loading project definition from /home/blz/ca2025-mycpu/project
[info] loading settings for project root from build.sbt...
[info] set current project to mycpu-root (in build file:/home/blz/ca2025-mycpu/)
[info] set current project to mycpu-pipeline (in build file:/home/blz/ca2025-mycpu/)
[info] PipelineProgramTest:
[info] Three-stage Pipelined CPU
[info] - should calculate recursively fibonacci(10)
[info] - should quicksort 10 numbers
[info] - should store and load single byte
[info] - should solve data and control hazards
[info] - should execute bf16_add and set done/result
[info] - should handle all hazard types comprehensively
[info] - should handle machine-mode traps
[info] Five-stage Pipelined CPU with Stalling
[info] - should calculate recursively fibonacci(10)
[info] - should quicksort 10 numbers
[info] - should store and load single byte
[info] - should solve data and control hazards
[info] - should execute bf16_add and set done/result
[info] - should handle all hazard types comprehensively
[info] - should handle machine-mode traps
[info] Five-stage Pipelined CPU with Forwarding
[info] - should calculate recursively fibonacci(10)
[info] - should quicksort 10 numbers
[info] - should store and load single byte
[info] - should solve data and control hazards
[info] - should execute bf16_add and set done/result
[info] - should handle all hazard types comprehensively
[info] - should handle machine-mode traps
[info] Five-stage Pipelined CPU with Reduced Branch Delay
[info] - should calculate recursively fibonacci(10)
[info] - should quicksort 10 numbers
[info] - should store and load single byte
[info] - should solve data and control hazards
[info] - should execute bf16_add and set done/result
[info] - should handle all hazard types comprehensively
[info] - should handle machine-mode traps
[info] Run completed in 2 minutes, 21 seconds.
[info] Total number of tests run: 28
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 28, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
[success] Total time: 143 s
```