# Assignment3: RISC-V CPU
contributed by [< ukp66482 >](https://github.com/ukp66482/ca2025-RISC-V-CPU)
[TOC]
## 1-single-cycle
### Handling Memory Latency Through Clock Domain Separation

After opening the generated VCD file and inspecting the waveform, it can be seen that the instruction address, instruction data, and ALU result do not change within a single system clock cycle. Instead, each instruction stays stable for several cycles before the next instruction appears. **This initially looks different from what is typically expected from a single-cycle CPU when viewed under the system clock.**
By looking more closely at the waveform, the source of this behavior becomes clearer. As highlighted in the above figure, the **blue marker** shows that after the instruction address is issued, the instruction data becomes valid one system clock cycle later, indicating a one-cycle read delay from the instruction memory. Similarly, during the execution of the `lw` instruction, the **purple marker** shows that the data memory read data also appears one cycle after the data memory address is asserted, revealing the one-cycle latency of the data memory.
Further checking the project’s `1-single-cycle/README.md` confirms that both the instruction memory and data memory are implemented using `synchronous block RAM`, which introduces a **one-cycle delay for read operations**.
According [Xilinx UG473](https://docs.amd.com/v/u/en-US/ug473_7Series_Memory_Resources) block RAM performs synchronous read operations in which the address is registered on a clock edge and the read data becomes available in the next cycle. This matches the one-cycle memory access latency observed in the waveform.

The logic diagram shows that **block RAM performs synchronous reads by registering the address before accessing the memory array.** Without enabling the `optional output register`,the **read data is produced after one clock cycle and remains stable for capture by the CPU**, matching the timing behavior observed in the waveform. If the optional output register is enabled, the read data is further registered before reaching the output, introducing an additional cycle of latency. This additional pipeline stage can improve **timing and allow higher operating frequencies**.

To accommodate this memory behavior without adding extra control logic, the CPU is intentionally operated at a slower effective rate. Specifically, the CPU advances once every four system clock cycles, allowing sufficient time for memory read data to stabilize before being captured by the CPU.
The following code snippet `1-single-cycle/src/test/scala/riscv/singlecycle/CPUTest.scala` shows how the divided CPU clock is generated and used to clock the CPU core.
``` scala
class TestTopModule(exeFilename: String) extends Module {
val io = IO(new Bundle {
......
......
})
val mem = Module(new Memory(8192))
val instruction_rom = Module(new InstructionROM(exeFilename))
val rom_loader = Module(new ROMLoader(instruction_rom.capacity))
......
......
......
val CPU_clkdiv = RegInit(UInt(2.W), 0.U)
val CPU_tick = Wire(Bool())
val CPU_next = Wire(UInt(2.W))
CPU_next := Mux(CPU_clkdiv === 3.U, 0.U, CPU_clkdiv + 1.U)
CPU_tick := CPU_clkdiv === 0.U
CPU_clkdiv := CPU_next
withClock(CPU_tick.asClock) {
val cpu = Module(new CPU)
......
......
......
}
......
}
```
### Clock Domain Crossing Problem
**Before discussing the following design, it is important to note that if this CPU is intended to be implemented on real hardware (e.g. FPGA), clock-domain considerations must be carefully addressed.**
#### Issue
The following code snippet `1-single-cycle/src/test/scala/riscv/singlecycle/CPUTest.scala`
``` scala
class TestTopModule(exeFilename: String) extends Module {
val io = IO(new Bundle {
......
......
})
val mem = Module(new Memory(8192))
val instruction_rom = Module(new InstructionROM(exeFilename))
val rom_loader = Module(new ROMLoader(instruction_rom.capacity))
......
......
......
val CPU_clkdiv = RegInit(UInt(2.W), 0.U)
val CPU_tick = Wire(Bool())
val CPU_next = Wire(UInt(2.W))
CPU_next := Mux(CPU_clkdiv === 3.U, 0.U, CPU_clkdiv + 1.U)
CPU_tick := CPU_clkdiv === 0.U
CPU_clkdiv := CPU_next
withClock(CPU_tick.asClock) {
val cpu = Module(new CPU)
......
......
......
}
......
}
```
At first, the design slows down the CPU by generating a separate CPU clock from the system clock. The memory continues to run at the system clock, while the CPU runs on this derived clock.
In this implementation, the CPU clock is created using logic (`CPU_tick.asClock`) instead of a clock-enable signal. This means the CPU and the memory do not strictly run on the same clock. As a result, signals transferred between the memory and the CPU cross between two different clocks without explicit synchronization.
Although the derived CPU clock is related to the system clock, its timing relationship is not strictly guaranteed. In a real hardware design, this can lead to clock domain crossing issues and potential timing or metastability problems.
#### Improvement
To slow down the CPU, a clock-enable signal can be used. The CPU clock remains the system clock, but the CPU only updates its registers when the enable signal is asserted. When the enable signal is not active, the CPU simply holds its state.
With this approach, the CPU and the memory stay in the same clock domain, and all signals are sampled synchronously. This removes the clock domain crossing issue and avoids potential timing or metastability problems, while still allowing the CPU to run at a slower effective rate.
### Verification
#### make test
``` bash
ukp66482@ukp66482-Ubuntu:~/Desktop/ca2025-RISC-V-CPU/1-single-cycle$ make test
cd .. && sbt "project singleCycle" test
[info] welcome to sbt 1.10.7 (Temurin Java 1.8.0_472)
[info] loading project definition from /home/ukp66482/Desktop/ca2025-RISC-V-CPU/project
[info] loading settings for project root from build.sbt...
[info] set current project to mycpu-root (in build file:/home/ukp66482/Desktop/ca2025-RISC-V-CPU/)
[info] set current project to mycpu-single-cycle (in build file:/home/ukp66482/Desktop/ca2025-RISC-V-CPU/)
[info] InstructionDecoderTest:
[info] InstructionDecoder
[info] - should decode RV32I instructions and generate correct control signals
[info] ByteAccessTest:
[info] Single Cycle CPU - Integration Tests
[info] - should correctly handle byte-level store/load operations (SB/LB)
[info] InstructionFetchTest:
[info] InstructionFetch
[info] - should correctly update PC and handle jumps
[info] ExecuteTest:
[info] Execute
[info] - should execute ALU operations and branch logic correctly
[info] FibonacciTest:
[info] Single Cycle CPU - Integration Tests
[info] - should correctly execute recursive Fibonacci(10) program
[info] RegisterFileTest:
[info] RegisterFile
[info] - should correctly read previously written register values
[info] - should keep x0 hardwired to zero (RISC-V compliance)
[info] - should support write-through (read during write cycle)
[info] QuicksortTest:
[info] Single Cycle CPU - Integration Tests
[info] - should correctly execute Quicksort algorithm on 10 numbers
[info] Run completed in 24 seconds, 552 milliseconds.
[info] Total number of tests run: 9
[info] Suites: completed 7, aborted 0
[info] Tests: succeeded 9, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
[success] Total time: 25 s, completed Dec 15, 2025 10:19:57 AM
```
#### RISCOF Compliance test

## 2-mmio-trap
### MMIO (Memory-Mapped I/O)
MMIO does **not** correspond to **real physical memory storage**.
Instead, MMIO addresses represent **virtual memory locations** that are reserved for accessing peripheral devices.
In this design, MMIO is realized through board-level address multiplexing logic, allowing peripheral accesses to be handled outside the CPU core.
In practice, this routing mechanism works as follows:
```
┌─────────────────┐
│ CPU │
│ (lw or sw) │
└────────┬────────┘
│
Address (32 bits), data, and control signals are issued
┌────▼────────────────────────────────────┐
│ Address bits: [31:0] │
│ Extract: bits [31:29] → deviceSelect │
└─────────────────────────────────────────┘
│ │
deviceSelect == 1 │ != 1 │
│ │
┌────▼─────┐ ┌─▼───────────┐
│ VGA │ │ Testbench │
│Peripheral│ │ (Other │
│ │ │ peripherals)│
└────┬─────┘ └─┬───────────┘
│ │
Read/write Read/write
VGA registers external
and frame memory or
buffer peripherals
│ │
└────┬──────┘
│
┌────▼────────┐
│ Read Data │
│ MUX │
└────┬────────┘
│
Returned to CPU
```
1. At the board level, an address decoder and routing logic examine the address range and generate device-select signals.
``` scala
// 2-mmio-trap/src/main/scala/riscv/core/CPU.scala
class CPU extends Module {
val io = IO(new CPUBundle)
// Pipeline stage modules
val regs = Module(new RegisterFile)
val inst_fetch = Module(new InstructionFetch)
val id = Module(new InstructionDecode)
val ex = Module(new Execute)
val mem = Module(new MemoryAccess)
val wb = Module(new WriteBack)
// Privileged architecture components
val csr_regs = Module(new CSR)
val clint = Module(new CLINT)
// deviceSelect
io.deviceSelect := mem.io.memory_bundle
.address(Parameters.AddrBits - 1, Parameters.AddrBits - Parameters.SlaveDeviceCountBits)
......
}
// 2-mmio-trap/src/main/scala/board/verilator/Top.scala
class Top extends Module {
val io = IO(new TopBundle)
val cpu = Module(new CPU)
val vga = Module(new VGA)
......
// Mux read data based on deviceSelect
cpu.io.memory_bundle.read_data := Mux(
cpu.io.deviceSelect === 1.U,
vga.io.bundle.read_data,
io.memory_bundle.read_data
)
......
// VGA MMIO routing
vga.io.bundle.address := cpu.io.memory_bundle.address
vga.io.bundle.write_data := cpu.io.memory_bundle.write_data
vga.io.bundle.write_strobe := cpu.io.memory_bundle.write_strobe
vga.io.bundle.write_enable := Mux(
cpu.io.deviceSelect === 1.U,
cpu.io.memory_bundle.write_enable,
false.B
)
}
```
2. If the address falls within an MMIO region, the access is routed through a multiplexer to the corresponding peripheral (e.g., UART, timer), rather than being handled by a physical memory block.
This design helped me understand that MMIO can be implemented externally using address decoding and multiplexing logic, so the CPU can access different peripherals through the same memory interface. **This approach is similar to my previous experience with AXI-based designs on Xilinx FPGAs, where address decoding and interconnect logic are used to route transactions to different AXI slaves.**
### Verification
#### make test
``` bash
ukp66482@ukp66482-Ubuntu:~/Desktop/ca2025-RISC-V-CPU/2-mmio-trap$ make test
cd .. && sbt "project mmioTrap" test
[info] welcome to sbt 1.10.7 (Temurin Java 1.8.0_472)
[info] loading project definition from /home/ukp66482/Desktop/ca2025-RISC-V-CPU/project
[info] loading settings for project root from build.sbt...
[info] set current project to mycpu-root (in build file:/home/ukp66482/Desktop/ca2025-RISC-V-CPU/)
[info] set current project to mycpu-mmio-trap (in build file:/home/ukp66482/Desktop/ca2025-RISC-V-CPU/)
[info] ByteAccessTest:
[info] [CPU] Byte access program
[info] - should store and load single byte
[info] CLINTCSRTest:
[info] [CLINT] Machine-mode interrupt flow
[info] - should handle external interrupt
[info] - should handle environmental instructions
[info] UartMMIOTest:
[info] [UART] Comprehensive TX+RX test
[info] - should pass all TX and RX tests
[info] ExecuteTest:
[info] [Execute] CSR write-back
[info] - should produce correct data for csr write
[info] FibonacciTest:
[info] [CPU] Fibonacci program
[info] - should calculate recursively fibonacci(10)
[info] TimerTest:
[info] [Timer] MMIO registers
[info] - should read and write the limit
[info] InterruptTrapTest:
[info] [CPU] Interrupt trap flow
[info] - should jump to trap handler and then return
[info] QuicksortTest:
[info] [CPU] Quicksort program
[info] - should quicksort 10 numbers
[info] Run completed in 27 seconds, 984 milliseconds.
[info] Total number of tests run: 9
[info] Suites: completed 8, aborted 0
[info] Tests: succeeded 9, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
[success] Total time: 29 s, completed Dec 15, 2025 2:51:42 PM
```
#### NyanCat Demo

#### RISCOF Compliance test

## 3-pipeline
### Hazard Detection Summary and Analysis
#### Q1 : Why do we need to stall for load-use hazards?
A load-use hazard occurs when an instruction immediately following a load instruction needs the loaded data.
In this case, the data is only available after the memory stage, so it cannot be forwarded in time for the next instruction.

``` asm
sw t1, 0(sp)
lw t2, 0(sp)
sw t2, 20(zero) # load-use hazard
```
The waveform shows a one-cycle stall between the `lw` and the dependent `sw`, indicating that the pipeline waits for the load data to become available before continuing execution.
#### Q2 : What is the difference between "stall" and "flush" operations?
- stall : pause the pipeline to wait for data, while keeping the current instructions.
- flush : discards instructions that should not be executed

``` asm
sw t1, 0(sp)
lw t2, 0(sp)
sw t2, 20(zero) # load-use hazard
```
In this design, a load-use hazard triggers a stall to wait for the load data to become available.
The `program counter` and `IF/ID pipeline register` are frozen so that instructions do not advance.
At the same time, `id_flush` is asserted to clear the `ID/EX register` and insert a bubble.
This is necessary because the **instruction memory has a one-cycle delay**, and **flushing** the decode stage prevents an invalid instruction from entering execution.
#### Q3 : Why does jump instruction with register dependency need stall?

From the waveform, the branch instruction `bne` depends on register values that are produced by the previous `addi` instructions.
When the branch is reached, these register values are not yet ready, so `if_stall` and `pc_stall` are asserted for one cycle.
This stall allows the correct register values to become available before the branch decision is made.
Without the stall, the CPU may evaluate the branch using incorrect values and jump to the wrong address.
``` chisel
addi s11, zero, 3 # s11 = 3
addi t0, zero, 3 # t0 = 3
bne s11, t0, 12 # branch depends on s11 and t0 (register dependency)
```
#### Q4 : In this design, why is branch penalty only 1 cycle instead of 2?
The branch penalty is only 1 cycle because branch decisions are resolved in the `ID` stage rather than the `EX` stage.
As a result, only the instruction in the `IF` stage needs to be flushed, instead of flushing both `IF` and `ID` stages.
- If branches resolved in EX stage (traditional):
```
Cycle 1: BEQ [IF]
Cycle 2: Wrong-1 [IF] → BEQ [ID]
Cycle 3: Wrong-2 [IF] → Wrong-1 [ID] → BEQ [EX] <- resolves here
Cycle 4: Target [IF] (flush IF + ID = 2 cycles penalty)
```
- branches resolved in ID stage:
```
Cycle 1: BEQ [IF]
Cycle 2: Wrong-1 [IF] → BEQ [ID] <- resolves here
Cycle 3: Target [IF] (only flush IF = 1 cycle penalty)
```
#### Q5 : What would happen if we removed the hazard detection logic entirely?
If we removed the hazard detection logic, the pipeline would no longer handle data and control dependencies correctly.
Instructions could use incorrect register values, As a result, branches and jumps might be resolved using wrong operands or targets, leading to incorrect control flow.
Overall, the CPU would produce incorrect execution results.
### Verification
#### make test
``` bash
ukp66482@ukp66482-Ubuntu:~/Desktop/ca2025-RISC-V-CPU/3-pipeline$ make test
cd .. && sbt "project pipeline" test
[info] welcome to sbt 1.10.7 (Temurin Java 1.8.0_472)
[info] loading project definition from /home/ukp66482/Desktop/ca2025-RISC-V-CPU/project
[info] loading settings for project root from build.sbt...
[info] set current project to mycpu-root (in build file:/home/ukp66482/Desktop/ca2025-RISC-V-CPU/)
[info] set current project to mycpu-pipeline (in build file:/home/ukp66482/Desktop/ca2025-RISC-V-CPU/)
[info] PipelineProgramTest:
[info] Three-stage Pipelined CPU
[info] - should calculate recursively fibonacci(10)
[info] - should quicksort 10 numbers
[info] - should store and load single byte
[info] - should solve data and control hazards
[info] - should handle all hazard types comprehensively
[info] - should handle machine-mode traps
[info] Five-stage Pipelined CPU with Stalling
[info] - should calculate recursively fibonacci(10)
[info] - should quicksort 10 numbers
[info] - should store and load single byte
[info] - should solve data and control hazards
[info] - should handle all hazard types comprehensively
[info] - should handle machine-mode traps
[info] Five-stage Pipelined CPU with Forwarding
[info] - should calculate recursively fibonacci(10)
[info] - should quicksort 10 numbers
[info] - should store and load single byte
[info] - should solve data and control hazards
[info] - should handle all hazard types comprehensively
[info] - should handle machine-mode traps
[info] Five-stage Pipelined CPU with Reduced Branch Delay
[info] - should calculate recursively fibonacci(10)
[info] - should quicksort 10 numbers
[info] - should store and load single byte
[info] - should solve data and control hazards
[info] - should handle all hazard types comprehensively
[info] - should handle machine-mode traps
[info] PipelineUartTest:
[info] Three-stage Pipelined CPU UART Comprehensive Test
[info] - should pass all TX and RX tests
[info] Five-stage Pipelined CPU with Stalling UART Comprehensive Test
[info] - should pass all TX and RX tests
[info] Five-stage Pipelined CPU with Forwarding UART Comprehensive Test
[info] - should pass all TX and RX tests
[info] Five-stage Pipelined CPU with Reduced Branch Delay UART Comprehensive Test
[info] - should pass all TX and RX tests
[info] PipelineRegisterTest:
[info] Pipeline Register
[info] - should be able to stall and flush
[info] Run completed in 1 minute, 23 seconds.
[info] Total number of tests run: 29
[info] Suites: completed 3, aborted 0
[info] Tests: succeeded 29, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
[success] Total time: 84 s (01:24), completed Dec 15, 2025 2:56:42 PM
```
#### RISCOF Compliance test

## Reference
- https://github.com/sysprog21/ca2025-mycpu
- https://hackmd.io/@sysprog/2025-arch-homework3
- https://docs.riscv.org/reference/isa/unpriv/unpriv-index.html
- https://docs.riscv.org/reference/isa/priv/priv-index.html
- https://docs.amd.com/v/u/en-US/ug473_7Series_Memory_Resources
- https://electronics.stackexchange.com/questions/73398/gated-clocks-and-clock-enables-in-fpga-and-asics