# Assignment3: Your Own RISC-V CPU
contributed by < [`jin11109`](https://github.com/jin11109/ca2025-mycpu) >
## Simple Chisel Code
Below is 'Hello World in Chisel' provided in [lab3](https://hackmd.io/@sysprog/B1Qxu2UkZx#Hello-World-in-Chisel)
```scala
class Hello extends Module {
val io = IO(new Bundle {
val led = Output(UInt(1.W))
})
val CNT_MAX = (50000000 / 2 - 1).U;
val cntReg = RegInit(0.U(32.W))
val blkReg = RegInit(0.U(1.W))
cntReg := cntReg + 1.U
when(cntReg === CNT_MAX) {
cntReg := 0.U
blkReg := ~blkReg
}
io.led := blkReg
}
```
This code let led blink every certian time. It will counting register(`cntReg`) count every clock cycle until reach the max counting number(`CNT_MAX`). When reaching max counting number, it reset counting register to `0` and tranfer the led output device to contrasting signal by applying the value recorded in `blkReg`.
### Enhance by incorporating logic circuit
The original circuit use `blkReg` to record current state. This register can be remove by below discussion.
I found that there are many bits in `cntReg` unuse due to the `CNT_MAX` is not bigger than a half of maximum number 32bit uint can represent.
```
# When cntReg store CNT_MAX
|XXXX_XXX1 0111_1101 0111_1000 0011_1111|
```
So, `blkReg` can be remove by threating `cntReg` msb bit to be a state recoreder.
```
# When cntReg store CNT_MAX
|YXXX_XXX1 0111_1101 0111_1000 0011_1111|
^
state recoreder
```
Here is the chisel code:
```scala
class Hello extends Module {
val io = IO(new Bundle {
val led = Output(UInt(1.W))
})
val CNT_MAX = (50000000 / 2 - 1).U;
val cntReg = RegInit(0.U(32.W))
cntReg := cntReg + 1.U
when(cntReg(24, 0) === CNT_MAX) {
cntReg(24, 0) := Cat(~cntReg(31), 0.U(31.W))
}
io.led := cntReg(31)
}
```
## Exercises in [ca2025-mycpu](https://github.com/sysprog21/ca2025-mycpu)
In this section, I describe the issues I encountered while completing the exercises in ca2025-mycpu and how I resolved them.
I also discuss my hazard-detection analysis for “CA25: Exercise 21” in 3-pipeline using both Chisel logic and waveform inspection.
### Issues and Resolutions
#### Issues in `1-single-cycle`
While working on CA25: Exercise 4, which focuses on the Instruction Fetch Unit, I encountered the following test failure:
```
[info] InstructionFetchTest:
[info] InstructionFetch
[info] - should correctly update PC and handle jumps *** FAILED ***
[info] io_instruction_address=4096 (0x1000) did not equal expected=4100 (0x1004) (lines in InstructionFetchTest.scala: 36, 30, 24) (InstructionFetchTest.scala:36)
```
To investigate, I ran `make sim` and inspected the waveform using:
```console
# in 1-single-cycle
$ make sim SIM_ARGS="-instruction ./src/main/resources/fibonacci.asmbin"
```
Use `surfer` to view the result:
```console
$ ../../surfer ./trace.vcd
```
`surfer` shows:

scale to the begining:

I observed the following issues:
1. Signal values stopped changing after `2 ps`, remaining at `0000 1197`.
To confirm the initial instruction value, I dumped the first few words of the raw binary for the `fibonacci` program:
```console
$ hexdump ./src/main/resources/fibonacci.
asmbin | head -1
0000000 1197 0000 8193 a981 0137 0040 0297 0000
```
This shows that the program crashes at the very beginning. It successfully assigns a value to `io.instruction` once, but never updates afterward. This indicates that the failure occurs immediately after the first instruction fetch.
2. Only a few IO signals appeared in the waveform.
This suggests that several components—including `pc`, `io.jump_flag_id`, and `others—were` not initialized or not being driven at all.
Given these symptoms, I focused on the instruction fetch logic:
```scala
pc := Mux(io.jump_flag_id, io.jump_address_id, io.instruction + 4.U)
```
It became clear that I mistakenly used `io.instruction` instead of `pc` in the PC update path. The correct logic should increment the current PC, not the instruction value:
```diff
-pc := Mux(io.jump_flag_id, io.jump_address_id, pc + 4.U)
+pc := Mux(io.jump_flag_id, io.jump_address_id, io.instruction + 4.U)
```
After applying the fix, all tests passed:
```
jin@jin-mcslab2404:~/temp/ca2025-mycpu/1-single-cycle$ make test
...
[info] Run completed in 28 seconds, 539 milliseconds.
[info] Total number of tests run: 9
[info] Suites: completed 7, aborted 0
[info] Tests: succeeded 9, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
[success] Total time: 37 s, completed Nov 26, 2025, 1:42:12 PM
```
The corrected waveform is shown below. Unlike the failed run, the fixed version exposes all relevant IO signals in the Instruction Fetch Unit, and the simulation proceeds as expected.

#### No RISC-V toolchain found
After fixing the earlier mistake in my code, all test cases in `make compliance` continued to fail.
To diagnose the problem, I first inspected the report located at `/tests/riscof_work_1sc/report.html`, which is generated by the compliance workflow. The report showed that all results produced by my implementation were zeros, indicating that the tests were not actually being executed correctly.
```
commit_id:-
MACROS:
TEST_CASE_1=True
XLEN=32
File1 Path:/home/jin/course/ca2025/ca2025-mycpu/tests/riscof_work_1sc/rv32i_m/I/src/add-01.S/dut/DUT-mycpu.signature
File2 Path:/home/jin/course/ca2025/ca2025-mycpu/tests/riscof_work_1sc/rv32i_m/I/src/add-01.S/ref/Reference-rv32emu.signature
Match Line# File1 File2
* 0 00000000 6f5ca309
* 1 00000000 deadbeef
* 2 00000000 deadbeef
* 3 00000000 deadbeef
* 4 00000000 deadbeef
* 5 00000000 deadbeef
* 6 00000000 deadbeef
* 7 00000000 deadbeef
* 8 00000000 deadbeef
* 9 00000000 deadbeef
* 10 00000000 deadbeef
* 11 00000000 deadbeef
* 12 00000000 deadbeef
```
I then continued searching through the output artifacts and found `batch_test.log` under `/tests/riscof_work_1sc/.` This log revealed the root cause: the RISC-V toolchain was not detected, so none of the compliance tests were compiled or run.
```
[info] - should pass test /home/jin/course/ca2025/ca2025-mycpu/tests/riscv-arch-test/riscv-test-suite/rv32i_m/hints/src/srl-01.S *** FAILED ***
[info] java.lang.Exception: No RISC-V toolchain found. Set $RISCV or install to $HOME/riscv/toolchain
[info] at riscv.compliance.ElfSignatureExtractor$.$anonfun$extractSignatureRange$10(ComplianceTestBase.scala:84)
[info] at scala.Option.getOrElse(Option.scala:201)
[info] at riscv.compliance.ElfSignatureExtractor$.readelfCmd$lzycompute$1(ComplianceTestBase.scala:84)
[info] at riscv.compliance.ElfSignatureExtractor$.readelfCmd$1(ComplianceTestBase.scala:70)
[info] at riscv.compliance.ElfSignatureExtractor$.extractSignatureRange(ComplianceTestBase.scala:95)
[info] at riscv.compliance.ComplianceTestBase.runComplianceTest(ComplianceTestBase.scala:166)
[info] at riscv.compliance.ComplianceTest.$anonfun$new$40(ComplianceTest.scala:365)
[info] at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
[info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
[info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
[info] ...
```
To address this error, I send a [issue](https://github.com/sysprog21/ca2025-mycpu/issues/1) on github, and solve this by [pull-request](https://github.com/sysprog21/ca2025-mycpu/pull/2)
#### Issues in 3-pipeline
In this section, I encountered a functional error in the `ALU.scalar` implementation. The issue was caused by not slicing `rs2` to ensure that the shift amount stays within the valid range of 0 to 32. Without this constraint, the shift operations behave incorrectly, resulting in widespread test failures:
```
[info] Run completed in 8 seconds, 84 milliseconds.
[info] Total number of tests run: 29
[info] Suites: completed 3, aborted 0
[info] Tests: succeeded 1, failed 28, canceled 0, ignored 0, pending 0
```
To investigate further, I ran `make sim` to generate waveforms. The simulation output immediately indicated that the failure originated from the shift operation logic inside `ALU.scala`. This confirmed that the shift amount was not being masked properly and that the ALU was performing undefined behavior when rs2 exceeded the valid bit-width.
```
[error] firrtl.passes.CheckWidths$DshlTooBig: @[3-pipeline/src/main/scala/riscv/core/ALU.scala 63:27] : [target ~Top|ALU] Width of dshl shift amount must be less than 20 bits.
[error] stack trace is suppressed; run last Compile / runMain for the full output
[error] (Compile / runMain) firrtl.passes.CheckWidths$DshlTooBig: @[3-pipeline/src/main/scala/riscv/core/ALU.scala 63:27] : [target ~Top|ALU] Width of dshl shift amount must be less than 20 bits.
[error] Total time: 2 s, completed Dec 1, 2025, 7:17:56 PM
make: *** [Makefile:17: verilator] Error 1
```
Then I fix this and pass all the test case.
### Nyancat program in `2-mmio-trap`
Below screenshot shows that Nyancat animation is correctly rendered on the VGA display during Verilator-based simulation.

I having not figure out more efficiency way to compress this program.
### Perform Hazard Detection Summary and Analysis with Chisel and waveforms
```
make sim SIM_ARGS="-instruction ./src/main/resources/hazard_extended.asmbin"
```
To finish "CA25: Exercise 21" in `3-pipeline`, I answer the question below:
1. Why do we need to stall for load-use hazards? (Hint: Consider data dependency and forwarding limitations)
Because the data from a Load instruction is fetched from memory during the MEM stage, but the immediately following instruction needs to use that data in the EX stage. Even with forwarding, the data arrives too late (at the end of the cycle). Therefore, we must stall for 1 cycle to wait for the data to become available from the memory.
2. What is the difference between "stall" and "flush" operations?
- **Stall**: Freezes the pipeline state. The `PC` and Pipeline Registers are not updated (Keep value), effectively repeating the current stage and inserting a `NOP`into the next stage. It is used to resolve Data Hazards (waiting for data).
- **Flush**: Clears the pipeline state. It resets the control signals in the pipeline register to zero (turning the instruction into a NOP/Bubble). It is used to resolve Control Hazards.
3. Why does jump instruction with register dependency need stall?
Instructions like JALR calculate the target address using a register value (e.g., PC = Reg[rs1] + offset). If we calculate this address in the ID stage to reduce latency, but the register rs1 is being updated by a previous instruction currently in the EX or MEM stage, we must stall to wait for that result to be forwarded to the ID stage.
4. In this design, why is branch penalty only 1 cycle instead of 2?
Because the branch comparison logic and target address calculation are performed in the ID (Decode) stage instead of the EX stage. By detecting the branch outcome earlier in ID, we only need to flush the one instruction currently being fetched in the IF stage. (If logic were in EX, we would have fetched 2 wrong instructions, causing a 2-cycle penalty).
5. What would happen if we removed the hazard detection logic entirely?
- **Data Hazards**: Instructions would read stale (old) data from registers because the new values haven't been written back yet (Read-After-Write errors).
- **Control Hazards**: The processor would execute instructions from the wrong memory path after a branch or jump, corrupting the program state.
## Modify the handwritten RISC-V assembly code in Homework2 to ensure it functions correctly on the pipelined RISC-V CPU
### Extend the Scala code for testing
Before starting process this aquirement, I discuss the different betweem a program ran in [rv32emu](https://github.com/sysprog21/rv32emu) or ran in `mycpu` first:
1. `rv32emu` has syscall to deal with `WRITE` and `EXIT` and so on, but `mycpu` doesn't support this.
2. In `rv32emu` the program is start at `0x10000` and should provide own liker script to compile the target. `mycpu` start the program at `0x00001000` but already provide `/csrc/link.lds` for testing code.
#### Ensure handwritten code can run in mycpu
To run handwritten code in homework2 on `mycpu`, we should remove all of using of syscall.
Then, copy the `utfs.S` to `3-pipeline/csrc` and execute `make update`.
But there is not compile for my program, unless add the below code in `3-pipeline/csrc/Makefile`:
```diff
BINS = \
fibonacci.asmbin \
hazard.asmbin \
quicksort.asmbin \
sb.asmbin \
uart.asmbin \
irqtrap.asmbin \
+ uf8.asmbin
```
---
I meet the below warning message, which means it miss `_start` label in my work.
```
/home/jin/riscv/toolchain/bin/riscv-none-elf-ld: warning: cannot find entry symbol _start; defaulting to 00001000
```
The reason is that in the homework2, I use `start.S` as the entry function and initialize `bss` and `stack` here.
But now, we only need my `main` function in `uf8.S`. So, I use `_start` insteaf of `main` to fix this warning.
---
After we have `uf8.asmbin`, we can start execute on this cpu like:
```console
$ make sim SIM_ARGS="-instruction ./src/main/resources/uf8.asmbin"
```
Unfortunaly it encounter some error:
```
...
invalid write address 0x1fffecb4
invalid read address 0x1fffecb0
invalid write address 0x1fffecb0
invalid read address 0x1fffecb0
invalid write address 0x1fffecb0
invalid read address 0x1fffecb0
invalid write address 0x1fffecb0
invalid read address 0x1fffecb0
invalid write address 0x1fffecb0
Simulation progress: 100%
```
Recall dicussion about of different between `rv32emu` and `mycpu` at the begining of the section. If we remove all of syscall we use, specialy `EXIT`, how program to go stop or finish?
Reference the testing code we can found that each of them is end with endless loop, like:
assembly code `sb.S`:
```asm
.globl _start
_start:
li a0, 0x4
li t0, 0xDEADBEEF
sb t0, 0(a0)
lw t1, 0(a0)
li s2, 0x15
sb s2, 1(a0)
lw ra, 0(a0)
loop:
j loop
```
or C language code in `uart.c`:
```c
/* Use wfi (Wait For Interrupt) for power efficiency */
while (1)
__asm__ volatile("wfi");
return 0;
```
So, the error is due to unexpect behavier when program is end.
#### Add to testing logic
Unlike our work in `re32emu` which automaticly run the testing code and print the result on the screen, `Mycpu` should split this logic in main logic and testing code.
To meet this requirement, we should remove the testing code in the `uf8.S` and also add new testing logic in `/src/test/scala/riscv/PipelineProgramTest.scala` which is define such test case by chisel.
My original testing way in homework2 is compare if the different between the input value and value by encoded and decoded is exceed the max absolute error.([detail](https://hackmd.io/ay0lR_mBRrq5nQsHUKYa2w?view#Version1Naive))
To realize how to write a testing code in `PipelineProgramTest.scala`, I study `/csrc/fibonacci.c` and focus on how it put result that chisel test can read and test.
In `fibonacci.c`, it put the reault of calculting at th address of `0x4`, like:
```c
*(int *) (4) = fib(10);
```
In the `PipelineProgramTest.scala`, we can find that after enough cpu clock cycles, it pose the memory address at `0x4` and wait a cpu cycle finaly compare the value with the expect value `55`
```scala
it should "calculate recursively fibonacci(10)" in {
runProgram("fibonacci.asmbin", cfg) { c =>
for (i <- 1 to 50) {
c.clock.step(1000)
c.io.mem_debug_read_address.poke((i * 4).U)
}
c.io.mem_debug_read_address.poke(4.U)
c.clock.step()
c.io.mem_debug_read_data.expect(55.U)
}
}
```
So, I imitate `fibonacci.c` to write the chisel testing code:
```scala
it should "encode and decode correctly in uf8 form" in {
runProgram("uf8.asmbin", cfg) { c =>
c,clock(1000)
c.io.mem_debug_read_address.poke(4.U)
c.clock.step()
c.io.mem_debug_read_data.expect(123456.U)
}
}
```
and rewrite the logic to process "All test cases pass" and "Fail" in `uf8.S` with below. If passing all test case it will write a magic number `123456` to `0x4` memory address.
```diff
# Print result
bnez s4, print_pass
- li a0, STDOUT
- la a1, msg_failed
- li a2, 7
- li a7, WRITE
- ecall
+ li a0, 0
+ sw a0, 4(x0)
j 1f
print_pass:
- li a0, STDOUT
- la a1, msg_pass
- li a2, 17
- li a7, WRITE
- ecall
+ li a0, 123456
+ sw a0, 4(x0)
```
Although this change successfully add and execute when using `make test`, but there is one case fail which is testing this `uf8` case in "Five-stage Pipelined CPU with Stalling"
```
[info] Five-stage Pipelined CPU with Stalling
[info] - should encode and decode correctly in uf8 form *** FAILED ***
[info] io_mem_debug_read_data=0 (0x0) did not equal expected=123456 (0x1e240) (lines in PipelineProgramTest.scala: 33, 29, 21, 18) (PipelineProgramTest.scala:33)
```
We can see that it is unable to distinguish the `io_mem_debug_read_data=0` is realy means "Fails" or other issue. So, I also change this by also storing a magic number `123` at the address.
After that, the error message still present that `io_mem_debug_read_data=0` again. This means it could not run to end of program because of too short clock cycle to the testing program.
So, I rewrite my chisel testing code as very similar to `fibonacci.c`:
```scala
it should "encode and decode correctly in uf8 form" in {
runProgram("uf8.asmbin", cfg) { c =>
for (i <- 1 to 50) {
c.clock.step(1000)
c.io.mem_debug_read_address.poke((i * 4).U)
}
c.io.mem_debug_read_address.poke(4.U)
c.clock.step()
c.io.mem_debug_read_data.expect(123456.U)
}
}
```
Finally, this change pass all type of five stage cpu.