# Assignment3: My RISC-V CPU
contributed by < [JimmyCh1025](https://github.com/JimmyCh1025) >
[TOC]
###### tags: `RISC-V` `RV32I` `Computer Architure 2025`
---
## Hello World in Chisel
### original
- module
```1
class Hello extends Module {
val io = IO(new Bundle {
val led = Output(UInt(1.W))
})
val CNT_MAX = (50000000 / 2 - 1).U;
val cntReg = RegInit(0.U(32.W))
val blkReg = RegInit(0.U(1.W))
cntReg := cntReg + 1.U
when(cntReg === CNT_MAX) {
cntReg := 0.U
blkReg := ~blkReg
}
io.led := blkReg
}
```
- test
```
test(new Hello()) { c =>
for (i <- 0 until 25000000) {
c.io.led.expect(0.U)
c.clock.step(1)
}
c.io.led.expect(1.U)
}
```
1. Counter Operation
- `cntReg` increments every clock cycle.
- When (cntReg == CNT_MAX)
- The counter resets to 0.
- The LED flip-flops its value.
2. LED Blink Operation
- `blkReg` is a 1-bit register that toggles every `25000000` cycles.
- `io.led` outputs the current LED state.
3. Hardware Meaning
- Registers : RegInit
- Combinational logic : cntReg + 1.U
- Hardware signal : io.led
### Enhanced Hello Module
```1
class Hello extends Module {
val io = IO(new Bundle {
val normalSpeed = Input(Bool())
val led = Output(UInt(1.W))
})
val CNT_MAX = (50000000 / 2 - 1).U;
val cntReg = RegInit(0.U(32.W))
val blkReg = RegInit(0.U(1.W))
cntReg := cntReg + 1.U
when((normalSpeed && (cntReg === CNT_MAX)) || (!normalSpeed && ((cntReg & 1.U) === 1.U))) {
cntReg := 0.U
blkReg := ~blkReg
}
io.led := blkReg
}
```
- This Chisel module implements a blinking LED with a speed control input normalSpeed.
- When normalSpeed is true, the LED toggles slowly based on a counter reaching CNT_MAX.
- When normalSpeed is false, the LED toggles every two clock cycles for fast blinking.
---
## Single cycle
A `single-cycle` processor executes each instruction in exactly one clock cycle.
In this stage, I modified the single-cycle file. During this process, I ensured that all functions (such as ALU control and the 5-stage pipeline) are running correctly.
### Challenge
"In the early stages of implementing the sb (Store Byte) instruction, I understood the basic operation: rs2's least significant 8 bits need to be written to memory at the address calculated by `rs1 + offset`. However, I initially misunderstood how the byte should be stored in memory. I thought that once I calculated the address, I could just place the byte directly at that address. I did not consider that memory is organized in 32-bit words, and I needed to account for byte alignment when writing a single byte.
I later realized that, in the actual implementation, the entire 32-bit word needs to be written to memory, even though only a single byte from rs2 is being stored. The byte has to be shifted into the correct position within the 32-bit word based on the lower two bits of the address."

### Resolution
To resolve this issue, I implemented address alignment by shifting the address with `(address >> 2)` to ensure that it is properly aligned for byte-level storage. Then, I determined the correct position within the 32-bit word by using the lower two bits of the address `((address >> 2)(1,0))` to calculate the appropriate byte offset. Specifically, I shifted rs2 to the correct bit position using `(address >> 2)(1,0) << 3`, which corresponds to the byte position within the 32-bit word.
For example, when the address alignment results in an offset of 2, rs2 will be stored in `write_data(23, 16)` because `2 * 8 = 16`, indicating that the byte needs to be placed at that specific bit location.
After implementing this solution, I ran several tests to verify that the byte was correctly stored at the intended memory location. The updated code successfully passed all the tests, and the memory content was correctly updated, confirming the fix was effective
### Result
- make test
```1
[info] Single Cycle CPU - Integration Tests
[info] - should correctly execute Quicksort algorithm on 10 numbers
[info] Run completed in 26 seconds, 356 milliseconds.
[info] Total number of tests run: 9
[info] Suites: completed 7, aborted 0
[info] Tests: succeeded 9, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
[success] Total time: 27 s, completed Dec 3, 2025, 12:23:10 PM
```
- make compliance
```
β Compliance tests complete. Results in riscof_work_1sc/
Completion time: Thu Dec 4 14:27:43 CST 2025
Copying results to results/ directory...
Cleaning up auto-generated RISCOF test files...
β Compliance tests complete. Results in results/
π View report: results/report.html
```
---
## MMIO-Trap
`MMIO (Memory-Mapped I/O)` is a method of mapping the control registers of external devices into a range of memory addresses. This allows the CPU to interact with hardware devices by reading from and writing to specific memory addresses, just like regular memory access.
### Challenge
I encountered an issue while implementing the lb (load byte) instruction in the MemoryAccess stage. Initially, I incorrectly placed the byte data from `data(7, 0)` into the wrong byte positions within the 32-bit word. Specifically, in the MuxLookup for lb, I was placing `data(7, 0)` in seq 0 and again in seq 1, which caused the byte data to be misaligned.
I knew that the lb instruction should load a byte from memory and sign-extend it to a 32-bit word, but I was unsure of how to handle the different memory alignments (e.g., the byte being at position 0, 1, 2, or 3 within a 32-bit word).
### Resolution
To fix this, I implemented proper byte alignment by using the mem_address_index to determine the byte position within the 32-bit word. I adjusted the MuxLookup to correctly place each byte from the data input at the appropriate byte position within the 32-bit result. The mem_address_index tells us the byte's position, and based on that, I used MuxLookup to place data(7, 0), data(15, 8), data(23, 16), and data(31, 24) into the correct byte positions (0-7, 8-15, 16-23, and 24-31 bits, respectively).
### Result
- make test
```1
[info] ByteAccessTest:
[info] [CPU] Byte access program
[info] - should store and load single byte
[info] CLINTCSRTest:
[info] [CLINT] Machine-mode interrupt flow
[info] - should handle external interrupt
[info] - should handle environmental instructions
[info] UartMMIOTest:
[info] [UART] Comprehensive TX+RX test
[info] - should pass all TX and RX tests
[info] ExecuteTest:
[info] [Execute] CSR write-back
[info] - should produce correct data for csr write
[info] FibonacciTest:
[info] [CPU] Fibonacci program
[info] - should calculate recursively fibonacci(10)
[info] TimerTest:
[info] [Timer] MMIO registers
[info] - should read and write the limit
[info] InterruptTrapTest:
[info] [CPU] Interrupt trap flow
[info] - should jump to trap handler and then return
[info] QuicksortTest:
[info] [CPU] Quicksort program
[info] - should quicksort 10 numbers
[info] Run completed in 26 seconds, 812 milliseconds.
[info] Total number of tests run: 9
[info] Suites: completed 8, aborted 0
[info] Tests: succeeded 9, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
[success] Total time: 28 s, completed Dec 3, 2025, 3:14:49 PM
```
- make compliance
```
β Compliance tests complete. Results in riscof_work_2mt/
Completion time: Thu Dec 4 14:34:26 CST 2025
Copying results to results/ directory...
Cleaning up auto-generated RISCOF test files...
β Compliance tests complete. Results in results/
π View report: results/report.html
```
---
## Pipeline
### Challenge
While running `make all` in the `csrc` directory, I encountered the following error related to undefined references to stack protection functions:
```1
riscv64-linux-gnu-gcc -O0 -Wall -march=rv32i_zicsr -mabi=ilp32 -c -o quicksort.o quicksort.c
riscv64-linux-gnu-ld -o quicksort.elf -T link.lds --oformat=elf32-littleriscv quicksort.o init.o
riscv64-linux-gnu-ld: warning: quicksort.elf has a LOAD segment with RWX permissions
riscv64-linux-gnu-ld: quicksort.o: in function `main':
quicksort.c:(.text+0x18c): undefined reference to `__stack_chk_guard'
riscv64-linux-gnu-ld: quicksort.c:(.text+0x250): undefined reference to `__stack_chk_guard'
riscv64-linux-gnu-ld: quicksort.c:(.text+0x26c): undefined reference to `__stack_chk_fail'
make: *** [Makefile:19: quicksort.elf] Error 1
rm LC3370.elf init.o fibonacci.elf hazard.elf
```
This issue occurred while trying to compile the project. The error message indicated unresolved references to `__stack_chk_guard` and `__stack_chk_fail`, which are part of the stack protection mechanism in C programs.
### Resolution
To resolve this issue, I modified the Makefile to disable the stack protector, which is causing the linker to look for the `__stack_chk_guard` and `__stack_chk_fail` functions. These functions are part of the stack protection mechanism used to detect buffer overflows.
I added the `-fno-stack-protector` flag to the CFLAGS in the Makefile as follows:
```
CFLAGS = -O0 -Wall -march=rv32i_zicsr -mabi=ilp32 -fno-stack-protector
```
This change prevents the stack protection mechanism from being enabled, and as a result, the program no longer tries to reference the missing stack protection symbols.
After making this change, I was able to successfully run make all without encountering the stack protection errors.
### Result
- make test
```1
[info] PipelineProgramTest:
[info] Three-stage Pipelined CPU
[info] - should calculate recursively fibonacci(10)
[info] - should LC3370 program
[info] - should quicksort 10 numbers
[info] - should store and load single byte
[info] - should solve data and control hazards
[info] - should handle all hazard types comprehensively
[info] - should handle machine-mode traps
[info] Five-stage Pipelined CPU with Stalling
[info] - should calculate recursively fibonacci(10)
[info] - should LC3370 program
[info] - should quicksort 10 numbers
[info] - should store and load single byte
[info] - should solve data and control hazards
[info] - should handle all hazard types comprehensively
[info] - should handle machine-mode traps
[info] Five-stage Pipelined CPU with Forwarding
[info] - should calculate recursively fibonacci(10)
[info] - should LC3370 program
[info] - should quicksort 10 numbers
[info] - should store and load single byte
[info] - should solve data and control hazards
[info] - should handle all hazard types comprehensively
[info] - should handle machine-mode traps
[info] Five-stage Pipelined CPU with Reduced Branch Delay
[info] - should calculate recursively fibonacci(10)
[info] - should LC3370 program
[info] - should quicksort 10 numbers
[info] - should store and load single byte
[info] - should solve data and control hazards
[info] - should handle all hazard types comprehensively
[info] - should handle machine-mode traps
[info] PipelineUartTest:
[info] Three-stage Pipelined CPU UART Comprehensive Test
[info] - should pass all TX and RX tests
[info] Five-stage Pipelined CPU with Stalling UART Comprehensive Test
[info] - should pass all TX and RX tests
[info] Five-stage Pipelined CPU with Forwarding UART Comprehensive Test
[info] - should pass all TX and RX tests
[info] Five-stage Pipelined CPU with Reduced Branch Delay UART Comprehensive Test
[info] - should pass all TX and RX tests
[info] PipelineRegisterTest:
[info] Pipeline Register
[info] - should be able to stall and flush
[info] Run completed in 1 minute, 38 seconds.
[info] Total number of tests run: 33
[info] Suites: completed 3, aborted 0
[info] Tests: succeeded 33, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
[success] Total time: 101 s (01:41), completed Dec 3, 2025, 3:42:19 PM
```
- make compliance
```
β Compliance tests complete. Results in riscof_work_3pl/
Completion time: Thu Dec 4 14:37:49 CST 2025
Copying results to results/ directory...
Cleaning up auto-generated RISCOF test files...
β Compliance tests complete. Results in results/
π View report: results/report.html
```
### Question
Analyze `fibonacci.asmbin`
#### Q1: Why do we need to stall for load-use hazards?
1. Instruction Decoding:
hex = `fec42783`, bin = `1111 1110 1100 0100 0010 0111 1000 0011`
=> lw x15, -20(x8)
hex = `fff78793`, bin = `1111 1111 1111 0111 1000 0111 1001 0011`
=> addi x15, x15, -1
- `fec42783`: This is a LW (Load Word) instruction, which loads a 32-bit value from memory into register x15.
- `fff78793`: This is an ADDI instruction, which adds the value of x15 and -1, and stores the result in x15.
2. Load-Use Hazard:
- The `load-used` hazard occurs because the `ADDI` instruction depends on the value of `x15`, which is loaded by the `LW` instruction.
- The `LW` instruction takes time to load data from memory, but the `ADDI` instruction tries to read `x15` before it's updated, leading to incorrect results.
3. Stall Mechanism:
- To resolve this hazard, we need to insert a `stall` in the pipeline.
- The stall prevents the `ADDI` instruction from executing until the `LW` instruction has finished loading the data into `x15`.
4. Hazard Detection in Chisel:
- In Chisel, the `hazard detection unit` detects that the `ADDI` instruction is dependent on the result of the `LW` instruction and inserts a stall signal (`io_stall`) to pause the pipeline.
==Summary:==
- The stall is needed to handle the `load-used` hazard where the `ADDI` instruction depends on the value being loaded by `LW`, ensuring correct execution in the pipeline.

#### Q2: What is the difference between "stall" and "flush" operations?
hex = `00f71663`, bin = `0000 0000 1111 0111 0001 0110 0110 0011`
=> bne x15, x14, 6
| | stall | flush |
| ------------------- | ---------------------- | ------------------------------------ |
| Why it happens | Data hazard | Control hazard (branch) or exception |
| What it does | Freeze pipeline stages | Clear wrong instructions with NOP |
| PC | Does not `advance` | `Redirected` to new target |
| Pipeline effect | Insert `bubble` | Discard instructions |
| Example in waveform | After `fec42783` load | When `io_flush` = 1 appears |
==Summary:==
- A `stall` pauses the pipeline because data is not readyβfor example, after a `load` instruction when the next instruction immediately needs the loaded value. The `PC` and pipeline registers `stop` updating for `one cycle`, inserting a `bubble`.
- A `flush` discards instructions that were fetched along the wrong path, typically after a branch. `Instead of pausing`, the pipeline `replaces these instructions with NOPs` and `redirects the PC`.

#### Q3: Why does jump instruction with register dependency need stall?
A jump instruction with register dependency needs a stall because the jump target address is dependent on a register value, which is not available until the previous instruction completes and writes back the register. Without a stall, the PC could be updated to the wrong address, causing the pipeline to fetch the wrong instruction.
#### Q4: In this design, why is branch penalty only 1 cycle instead of 2?
The branch penalty is 1 cycle because branch resolution occurs in the ID stage. If the branch predictor is wrong, only the instruction in the IF stage needs to be flushed, as the branch decision is made by the time the instruction reaches the ID stage.
If branch resolution happened in the EX stage, 2 instructions would need to be flushed:
one from the IF stage and one from the ID stage, because the branch decision is not made until the EX stage.
#### Q5: What would happen if we removed the hazard detection logic entirely?
A: If we removed the hazard detection logic, data hazards (like `RAW` or `load-used` hazards) could cause incorrect results. For example, if a later instruction depends on data from an earlier instruction that hasn't completed yet, it could use stale or incorrect data. Additionally, control flow hazards (like `branch` mispredictions) would go unhandled, leading to the wrong instructions being fetched and potentially corrupting the execution flow, requiring pipeline flushes or causing incorrect outputs.

#### Q6: Complete the stall condition summary:
Stall is needed when:
1. EX destination === ID RS1 || EX destination === ID RS2 (EX stage condition)
2. (MEM reg write enable) && (MEM destination === ID RS1 || MEM destination === ID RS2) (MEM stage condition)
Flush is needed when:
1. (EX branch_flag && branchCondition(beq ..)) || EX jump_flag (Branch/Jump condition)
---
## Modify the handwritten RISC-V assembly code in [Homework2](https://hackmd.io/@JimmyChen88/arch2025-homework2)
- `3-pipeline/src/test/scala/riscv/PipelineProgramTest.scala`
```1
it should "LC3370 program" in {
runProgram("LC3370.asmbin",cfg) { c =>
for (i <- 1 to 50) {
c.clock.step(1000)
c.io.mem_debug_read_address.poke((i * 4).U) // Avoid timeout
}
c.io.mem_debug_read_address.poke(4.U)
c.clock.step()
c.io.mem_debug_read_data.expect(1.U)
c.io.mem_debug_read_address.poke(8.U)
c.clock.step()
c.io.mem_debug_read_data.expect(511.U)
c.io.mem_debug_read_address.poke(12.U)
c.clock.step()
c.io.mem_debug_read_data.expect(1023.U)
}
}
```
- `3-pipeline/csrc/LC3370.c`
```1
// SPDX-License-Identifier: MIT
// MyCPU is freely redistributable under the MIT License. See the file
// "LICENSE" for information on usage and redistribution of this file.
unsigned int clz(unsigned int x)
{
int n = 32, c = 16;
do {
unsigned int y = x >> c;
if (y) {
n -= c;
x = y;
}
c >>= 1;
} while (c);
return n - x;
}
int smallestNumber(int n) {
int bit_len = (1 << (32-clz(n)))-1;
return bit_len;
}
int main()
{
*(int *) (4) = smallestNumber(1);
*(int *) (8) = smallestNumber(509);
*(int *) (12) = smallestNumber(1000);
}
```
- ==Running make update in the csrc directory will generate the .asmbin file and place it in src/main/resources.==
### Waveform
- `verilog/LC3370.asmbin.txt`
```
@0
00001197
@1
b1818193
@2
00400137
@3
00000297
@4
30c28293
@5
00000317
@6
30430313
@7
0062f863
@8
0002a023
@9
00428293
@a
ff5ff06f
@b
000ff297
@c
fd428293
@d
000ff317
@e
fcc30313
.
.
.
@c7
00000013
@c8
00000013
```

## Reference
* [Computer Architecture HW3](https://hackmd.io/@sysprog/2025-arch-homework3)
* [Lab3: Construct a RISC-V CPU with Chisel](https://hackmd.io/@sysprog/B1Qxu2UkZx#Lab3-Construct-a-RISC-V-CPU-with-Chisel)
* [Assignment1: RISC-V Assembly and Instruction Pipeline](https://hackmd.io/@JimmyChen88/arch2025-homework1)
* [Assignment2: Complete Applications](https://hackmd.io/@JimmyChen88/arch2025-homework2)
* [RISC-V Instruction Set Manual](https://riscv.org/specifications/ratified/)