---
title: 'Assignment 3: Your Own RISC-V CPU'
disqus: hackmd
---
Assignment 3: Your Own RISC-V CPU
===
>[!Note] AI tools usage
>I use ChatGPT to assist with Quiz 1 by providing code explanations, grammar revisions, pre-work research, code summaries, and explanations of standard RISC-V instruction usage.
> Github code for this Assigment Homework: [Link](https://github.com/scarlett910/ca2025-mycpu)
## 1. Single-cycle CPU
#### Testcases for exercise 1
* Unit Tests:
* InstructionFetchTest:
Ensures that the Program Counter (PC) updates correctly, including normal sequential execution (PC + 4) and correct behavior for jump instructions.
* InstructionDecoderTest:
Checks whether the decoder accurately interprets all RV32I instruction formats (R, I, S, B, U, J) and generates the correct control signals, such as ALU input selection and register-write enables.
* ExecuteTest:
Validates ALU functionality—including arithmetic, logical, shift, and comparison operations—and verifies the branch decision logic.
* RegisterFileTest:
Tests register read/write operations, ensuring that register x0 always returns zero and that write-through behavior functions properly.
* Integration Tests (CPUTest.scala)
* FibonacciTest:
Runs a recursive Fibonacci program (`fibonacci.asmbin`) to verify correct function calls, stack usage, and control flow.
* QuicksortTest
Executes a quicksort implementation (`quicksort.asmbin`) to sort 10 integers, testing more complex branching, loops, and memory operations.
* ByteAccessTest
Runs the byte-access program (`sb.asmbin`) to validate correct behavior of lb and sb instructions, including alignment handling and proper sign extension.
After I filled the missing blanks in the 9 exercises for the first exercise, I run the command ```make test``` to run all the tests.
The result:
```
cd .. && sbt "project singleCycle" test
[info] welcome to sbt 1.10.7 (Eclipse Adoptium Java 11.0.29)
[info] loading project definition from /home/harrypotter/ca2025-mycpu/project
[info] loading settings for project root from build.sbt...
[info] set current project to mycpu-root (in build file:/home/harrypotter/ca2025-mycpu/)
[info] set current project to mycpu-single-cycle (in build file:/home/harrypotter/ca2025-mycpu/)
[info] compiling 3 Scala sources to /home/harrypotter/ca2025-mycpu/common/target/scala-2.13/classes ...
[info] compiling 12 Scala sources to /home/harrypotter/ca2025-mycpu/1-single-cycle/target/scala-2.13/classes ...
[info] compiling 7 Scala sources to /home/harrypotter/ca2025-mycpu/1-single-cycle/target/scala-2.13/test-classes ...
[info] InstructionDecoderTest:
[info] InstructionDecoder
[info] - should decode RV32I instructions and generate correct control signals
[info] ByteAccessTest:
[info] Single Cycle CPU - Integration Tests
[info] - should correctly handle byte-level store/load operations (SB/LB)
[info] InstructionFetchTest:
[info] InstructionFetch
[info] - should correctly update PC and handle jumps
[info] ExecuteTest:
[info] Execute
[info] - should execute ALU operations and branch logic correctly
[info] FibonacciTest:
[info] Single Cycle CPU - Integration Tests
[info] - should correctly execute recursive Fibonacci(10) program
[info] RegisterFileTest:
[info] RegisterFile
[info] - should correctly read previously written register values
[info] - should keep x0 hardwired to zero (RISC-V compliance)
[info] - should support write-through (read during write cycle)
[info] QuicksortTest:
[info] Single Cycle CPU - Integration Tests
[info] - should correctly execute Quicksort algorithm on 10 numbers
[info] Run completed in 1 minute, 18 seconds.
[info] Total number of tests run: 9
[info] Suites: completed 7, aborted 0
[info] Tests: succeeded 9, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
```
**RISCOF Compliance Test**
`make compliance` result:

**Waveform Analysis**
I use waveform analysis to demonstrate that my implementation is correct, by showing how the key signals of each component change when different instructions are executed.
**1. Instruction Fetch:**
Exercise 9: PC Update Logic - Sequential vs Control Flow
```scala=
// ============================================================
// [CA25: Exercise 9] PC Update Logic - Sequential vs Control Flow
// ============================================================
when(io.instruction_valid) {
io.instruction := io.instruction_read_data
pc := Mux(io.jump_flag_id, io.jump_address_id, pc + 4.U)
}.otherwise {
// When instruction is invalid, hold PC and insert NOP (ADDI x0, x0, 0)
// NOP = 0x00000013 allows pipeline to continue safely without side effects
pc := pc
io.instruction := 0x00000013.U // NOP: prevents illegal instruction execution
}
io.instruction_address := pc
}
```

Observation from the waveform:
| Test Case | Waveform Evidence | Expected | Actual |
| ----------- | -------------------------- | ----------- | ------------------- |
| Sequential | pc += 4 | PC + 4 | 1000→1004→1008→100c |
| Jump (4ps) | jump_flag=1, target=0x1000 | PC = 0x1000 | 0x1004→0x1000 |
| Jump (8ps) | jump_flag=1, target=0x1000 | PC = 0x1000 | 0x1004→0x1000 |
| Jump (22ps) | jump_flag=1, target=0x1000 | PC = 0x1000 | 0x1010→0x1000 |
**Code Logic**: `Mux(jump_flag_id, jump_address_id, pc + 4.U)`
* When `jump_flag = 0`: Select `pc + 4`
* When `jump_flag = 1`: Select `jump_address_id`
**Result**: Implementation correct, all behaviors match specification
**2. Instruction Decode:**
Exercise 1: Immediate Extension - RISC-V Instruction Encoding

| Time | Instruction | Type | ImmI | ImmS | ImmB | ImmJ | ImmU | io_ex_immediate |
| ---- | ----------- | ---- | ----- | ---- | ---- | ---- | ------ | --------------- |
| 2ps | lw | I | 0x04 | | | | | 0x04 |
| 4ps | sw | S | | 0x04 | | | | 0x04 |
| 6ps | andi | I | 0x018 | | | | | 0x018 |
| 8ps | bge | B | | | 0x10 | | | 0x10 |
| 12ps | lui | U | | | | | 0x2000 | 0x2000 |
| 14ps | jal | J | | | | 0x08 | | 0x08 |
Exercise 2: Control Signal Generation
**Code Implementation:**
```scala=
// ============================================================
// [CA25: Exercise 2] Control Signal Generation
// ============================================================
// TODO: Determine when to write back from Memory
when(isLoad) {
wbSource := RegWriteSource.Memory
}
// TODO: Determine when to write back PC+4
.elsewhen(isJal || isJalr) {
wbSource := RegWriteSource.NextInstructionAddress
}
```

| isLoad | isJal | isJalr | wb_source | Result Source |
|--------|-------|--------|-----------|----------------------|
| 0 | 0 | 0 | 0 | ALU result |
| **1** | 0 | 0 | **1** | **Memory read data** |
| 0 | **1** | 0 | **2** | **PC + 4** |
| 0 | 0 | **1** | **2** | **PC + 4** |
**Code Implementation:**
```scala=
// TODO: Determine when to use PC as first operand
when(isBranch || isAuipc || isJal) {
aluOp1Sel := ALUOp1Source.InstructionAddress
}
val needsImmediate = isLoad || isStore || isOpImm || isBranch || isLui || isAuipc || isJal || isJalr
val aluOp2Sel = WireDefault(ALUOp2Source.Register)
// TODO: Determine when to use immediate as second operand
when(needsImmediate) {
aluOp2Sel := ALUOp2Source.Immediate
}
```

| isBranch | isAuipc | isJal | Condition (OR) | aluop1_src | Result |
|----------|---------|-------|----------------|------------|---------------|
| 0 | 0 | 0 | 0 | **0** | **Register rs1** |
| **1** | 0 | 0 | **1** | **1** | **PC** |
| 0 | **1** | 0 | **1** | **1** | **PC** |
| 0 | 0 | **1** | **1** | **1** | **PC** |

**3. Execute**


Observation from the waveform:
Exercise 3: ALU Control Logic - Opcode/Funct3/Funct7 Decoding
| | opcode | funct3 | funct7 | Output |
| -------------------- | ------ | ------ | --------- | ------------ |
| R-type ADD | 0x33 | 0x0 | 0 | ADD function |
| Branch (default ADD) | 0x63 | X | X | ADD function |
**Conclusion**: ALU Control correctly decodes instruction type and generates appropriate ALU function codes
Exercise 4: Branch Comparison Logic
```scala=
// ============================================================
// [CA25: Exercise 4] Branch Comparison Logic
// ============================================================
val branchCondition = MuxLookup(funct3, false.B)(
Seq(
// TODO: Implement six branch conditions
InstructionsTypeB.beq -> (io.reg1_data === io.reg2_data),
InstructionsTypeB.bne -> (io.reg1_data =/= io.reg2_data),
// Signed comparison (need conversion to signed type)
InstructionsTypeB.blt -> (io.reg1_data.asSInt < io.reg2_data.asSInt),
InstructionsTypeB.bge -> (io.reg1_data.asSInt >= io.reg2_data.asSInt),
// Unsigned comparison
InstructionsTypeB.bltu -> (io.reg1_data < io.reg2_data),
InstructionsTypeB.bgeu -> (io.reg1_data >= io.reg2_data)
)
)
val isBranch = opcode === InstructionTypes.Branch
val isJal = opcode === Instructions.jal
val isJalr = opcode === Instructions.jalr
```
Exercise 5: Jump Target Address Calculation
```scala=
// ============================================================
// [CA25: Exercise 5] Jump Target Address Calculation
// ============================================================
// TODO: Complete the following address calculations
val branchTarget = io.instruction_address + io.immediate
val jalTarget = branchTarget // JAL and Branch use same calculation method
val jalrSum = io.reg1_data + io.immediate
// TODO: Clear LSB using bit concatenation
val jalrTarget = Cat(jalrSum(Parameters.DataBits - 1, 1), 0.U(1.W))
val branchTaken = isBranch && branchCondition
io.if_jump_flag := branchTaken || isJal || isJalr
io.if_jump_address := Mux(
isJalr,
jalrTarget,
Mux(isJal, jalTarget, branchTarget)
)
}
```
Observation from the waveform:
| Test Case | Time | Instruction | Condition | Expected | Actual | Status |
|-----------|------|-------------|-----------|----------|--------|--------|
| **BEQ FALSE** | 204-206ps | BEQ (unequal) | reg1 $\neq$ reg2 | Not taken | branchTaken=0, jump_flag=0 | ✓ PASS |
| **BEQ TRUE** | 206ps | BEQ (equal) | reg1 = reg2 | **Taken** | **branchTaken=1, jump_flag=1** | **✓ PASS** |
**4. Memory Access**
Exercise 6: Load Data Extension - Sign and Zero Extension
```scala=
// ============================================================
// [CA25: Exercise 6] Load Data Extension - Sign and Zero Extension
// ============================================================
// TODO: Complete sign/zero extension for load operations
io.wb_memory_read_data := MuxLookup(io.funct3, 0.U)(
Seq(
// TODO: Complete LB (sign-extend byte)
// Hint: Replicate sign bit, then concatenate with byte
InstructionsTypeL.lb -> Cat(Fill(24, byte(7)), byte),
// TODO: Complete LBU (zero-extend byte)
// Hint: Fill upper bits with zero, then concatenate with byte
InstructionsTypeL.lbu -> Cat(0.U(24.W), byte),
// TODO: Complete LH (sign-extend halfword)
// Hint: Replicate sign bit, then concatenate with halfword
InstructionsTypeL.lh -> Cat(Fill(16, half(15)), half),
// TODO: Complete LHU (zero-extend halfword)
// Hint: Fill upper bits with zero, then concatenate with halfword
InstructionsTypeL.lhu -> Cat(0.U(16.W), half),
// LW: Load full word, no extension needed (completed example)
InstructionsTypeL.lw -> data
)
)
```

Observation from the waveform:
- At 51ps: io_funct3 = 2 → LW
```
io.memory_read_enable = 1
memory_read_data = 0x000000ef
val data = io.memory_bundle.read_data //= 0x000000ef
io.wb_memory_read_data := data //= 0x000000ef -> Correct outcome
```
The waveform shows that the code in TODO gives the correct result.
Exercise 7: Store Data Alignment - Byte Strobes and Shifting

- At 43ps: io_funct3 = 0 → SB
```
io.memory_write_enable = 1
reg2_data = 0xdeadbeef
alu_result = 4 → mem_address_index = 0
is(InstructionsTypeS.sb) {
writeStrobes(mem_address_index) := true.B //= 1 -> correct
writeData := data(7, 0) << (mem_address_index << 3.U) //= 0xef << 0 = 0x000000ef -> correct
}
```
The waveform shows that the code in TODO gives the correct result.
**5. Write Back**
```scala=
class WriteBack extends Module {
val io = IO(new Bundle() {
val instruction_address = Input(UInt(Parameters.AddrWidth))
val alu_result = Input(UInt(Parameters.DataWidth))
val memory_read_data = Input(UInt(Parameters.DataWidth))
val regs_write_source = Input(UInt(2.W))
val regs_write_data = Output(UInt(Parameters.DataWidth))
})
//============================================================
// [CA25: Exercise 8] WriteBack Source Selection
//============================================================
// TODO: Complete MuxLookup to multiplex writeback sources
io.regs_write_data := MuxLookup(io.regs_write_source, io.alu_result)(
Seq(
RegWriteSource.Memory -> io.memory_read_data,
RegWriteSource.NextInstructionAddress -> (io.instruction_address + 4.U)
)
)
}
```

- Default: regs_write_source = 0 → ALU result
```
alu_result = 104c
regs_write_source = 0
instruction address = 103c
memory_read_data = 0
io.regs_write_data := MuxLookup(io.regs_write_source, io.alu_result)
//= 104 -> match with waveform: regs_write_data = 104c
```
- regs_write_source = 1 → memory data

```
regs_write_source = 1
instruction address = 106c
memory_read_data = 0000000a
...
RegWriteSource.Memory -> io.memory_read_data,
//= 0000000a -> match with waveform: regs_write_data = 0000000a
```
- regs_write_source = 2 → PC + 4

```
regs_write_source = 2
instruction address = 10ec
memory_read_data = 00000000
...
RegWriteSource.NextInstructionAddress -> (io.instruction_address + 4.U)
//PC + 4 = 0x10ec + 4 = 0x10f0
//match with waveform: regs_write_data = 000010f0
```
#### Integration test results
* Fibonacci.c:

```
hexdump -e src/main/resources/fibonacci.asmbin | head -1
00001197 91418193 00001137 00000297 10828293 00000317 10030313 0062f863 0002a023 00428293 ff5ff06f 00000297 0e828293 00000317 0e030313 0062f863 0002a023 00428293 ff5ff06f 084000ef 0000006f fe010113 00112e23 00812c23 00912a23 02010413 fea42623 fec42703 00100793 00f70863
```
Conclusion: The fibonacci waveform values in `io_instruction_read_data` matches with the expected result of the program.
* Quicksort.c:

```
hexdump -e src/main/resources/quicksort.asmbin | head -1
00002197 82818193 00001137 00001297 01c28293 00001317 01430313 0062f863 0002a023 00428293 ff5ff06f 00001297 ffc28293 00001317 ff430313 0062f863 0002a023 00428293 ff5ff06f 184000ef 0000006f fd010113 02112623 02812423 03010413 fca42e23 fcb42c23 fcc42a23 fd842703 fd442783
```
Conclusion: The quicksort waveform values in `io_instruction_read_data` matches with the expected result of the program.
* sb.S:

```
hexdump -e src/main/resources/sb.asmbin | head -1
00400513 deadc2b7 eef28293 00550023 00052303 01500913 012500a3 00052083 0000006f
```
Conclusion: The sb waveform values in `io_instruction_read_data` matches with the expected result of the program.
## 2. RISC-V CPU with MMIO Peripherals and Trap Handling
### 1. Nyancat VGA Display Demo
The Nyancat animation runs by command `make demo` using Verilator with SDL2

### 2. Further Compression for Nyan program
**Current Implementation Analysis**
The Nyancat animation uses Delta-RLE (Run-Length Encoding with Delta Frames) compression, implemented in `scripts/gen-nyancat-data.py`.
Compression Statistics:
- Uncompressed: 24,576 bytes (12 frames × 64×64 pixels × 4-bit color)
- Compressed: 4,755 bytes
- Compression Ratio: 5.17× (81.6% size reduction)
- Method: Frame 0 uses baseline RLE; Frames 1-11 use delta encoding
**Additional Compression**
Several alternative approaches were evaluated:
1. Palette Quantization (14 → 8 colors)
* Approach: Merge similar colors (e.g., red/orange, yellow/green)
* Result: No size reduction
* Reason: Opcode count remains unchanged; only color indices are remapped
2. Opcode Merging After Remapping
* Approach: Decompress → remap colors → re-compress with merged operations
* Result: Increased size (72KB vs 4.7KB)
* Reason: Decompression/re-compression introduces inefficiencies
3. Advanced Algorithms (LZ77, Huffman, LZMA)
* LZ77: ~5,200 bytes (worse than current)
* Huffman: ~4,900 bytes (marginal 3% improvement)
* LZMA: ~3,800 bytes (20% improvement but 10× slower decompression)
* Conclusion: Trade-offs unacceptable
**Conclusion:** The existing compression is optimal for this application because:
1. Temporal Coherence: Rainbow animation has high frame-to-frame similarity, making delta encoding highly effective
1. Spatial Coherence: Large solid-color regions benefit from RLE
1. Hardware Constraints: VGA peripheral expects uncompressed 4-bit pixels; no hardware decompression support
1. Real-time Requirements: Decompression must complete within frame time (50ms at 20Hz)
1. Code Quality: Instructor's implementation is production-grade
## 3. Pipelined RISC-V CPU
After I filled the missing blanks in the 6 exercises (16-21) for the first exercise, I run the command ```make test``` to run all the tests.
The result:
```
make test
cd .. && sbt "project pipeline" test
[info] welcome to sbt 1.10.7 (Eclipse Adoptium Java 11.0.29)
[info] loading project definition from /home/harrypotter/ca2025-mycpu/project
[info] loading settings for project root from build.sbt...
[info] set current project to mycpu-root (in build file:/home/harrypotter/ca2025-mycpu/)
[info] set current project to mycpu-pipeline (in build file:/home/harrypotter/ca2025-mycpu/)
[info] compiling 61 Scala sources to /home/harrypotter/ca2025-mycpu/3-pipeline/target/scala-2.13/classes ...
[info] compiling 7 Scala sources to /home/harrypotter/ca2025-mycpu/3-pipeline/target/scala-2.13/test-classes ...
[info] PipelineProgramTest:
[info] Three-stage Pipelined CPU
[info] - should calculate recursively fibonacci(10)
[info] - should quicksort 10 numbers
[info] - should store and load single byte
[info] - should solve data and control hazards
[info] - should handle all hazard types comprehensively
[info] - should handle machine-mode traps
[info] Five-stage Pipelined CPU with Stalling
[info] - should calculate recursively fibonacci(10)
[info] - should quicksort 10 numbers
[info] - should store and load single byte
[info] - should solve data and control hazards
[info] - should handle all hazard types comprehensively
[info] - should handle machine-mode traps
[info] Five-stage Pipelined CPU with Forwarding
[info] - should calculate recursively fibonacci(10)
[info] - should quicksort 10 numbers
[info] - should store and load single byte
[info] - should solve data and control hazards
[info] - should handle all hazard types comprehensively
[info] - should handle machine-mode traps
[info] Five-stage Pipelined CPU with Reduced Branch Delay
[info] - should calculate recursively fibonacci(10)
[info] - should quicksort 10 numbers
[info] - should store and load single byte
[info] - should solve data and control hazards
[info] - should handle all hazard types comprehensively
[info] - should handle machine-mode traps
[info] PipelineUartTest:
[info] Three-stage Pipelined CPU UART Comprehensive Test
[info] - should pass all TX and RX tests
[info] Five-stage Pipelined CPU with Stalling UART Comprehensive Test
[info] - should pass all TX and RX tests
[info] Five-stage Pipelined CPU with Forwarding UART Comprehensive Test
[info] - should pass all TX and RX tests
[info] Five-stage Pipelined CPU with Reduced Branch Delay UART Comprehensive Test
[info] - should pass all TX and RX tests
[info] PipelineRegisterTest:
[info] Pipeline Register
[info] - should be able to stall and flush
[info] Run completed in 4 minutes, 18 seconds.
[info] Total number of tests run: 29
[info] Suites: completed 3, aborted 0
[info] Tests: succeeded 29, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
```
### Exercise 17 & 18: Data Hazard Analysis - Forwarding
This analysis examines data hazard resolution through forwarding mechanisms in the FiveStageFinal pipeline implementation using `hazard.asmbin` test program.
**EX-Stage Forwarding** (Exercise 17)
Hazard Scenario: RAW (Read-After-Write) Dependency
Test Program: `hazard.asmbin` contains deliberately crafted hazard scenarios.

* **MEM-to-EX Forwarding**
Time Window: ~20ps
Signal | Value | Interpretation
----------------------------|-----------|----------------------------------
io_rd_mem [4:0] | 0x0a | Instruction in MEM writes to x10
io_reg_write_enable_mem | 1 | Write enabled in MEM stage
io_rs2_ex [4:0] | 0x01 | Instruction in EX reads from x1
io_reg2_forward_ex [1:0] | 1 | ForwardFromMEM activated
**Analysis**
* Hazard Type: RAW (Read-After-Write) dependency
* Detection: `rd_mem == rs_ex` and`reg_write_enable_mem == 1`
* Resolution: Forwarding unit selects MEM stage output instead of register file
* Impact: Eliminates 1-cycle stall penalty
* **WB-to-EX Forwarding**
Time Window: ~57ps
Signal | Value | Interpretation
----------------------------|-----------|----------------------------------
io_rd_wb [4:0] | 0x07 | Instruction in WB writes to x7
io_reg_write_enable_wb | 1 | Write enabled in WB stage
io_rs1_ex [4:0] | 0x06 | Instruction in EX reads from x6
io_rs2_ex [4:0] | 0x07 | Instruction in EX reads from x7
io_reg1_forward_ex [1:0] | 0 | No forward for rs1
io_reg2_forward_ex [1:0] | 2 | ForwardFromWB for rs2
**Analysis:**
* Scenario: Instruction needs data that just finished WB stage
* Forwarding Logic: Select WB stage output over register file
* Cycle Savings:Without forwarding, would need 2-cycle stall
* **Multiple Simultaneous Forwards**
Time Window: ~45ps
Signal | Value | Interpretation
----------------------------|-----------|----------------------------------
io_rd_mem [4:0] | 0x07 | MEM stage writes x7
io_rd_wb [4:0] | 0x06 | WB stage writes x6
io_rs1_ex [4:0] | 0x06 | EX needs x6
io_rs2_ex [4:0] | 0x07 | EX needs x7
io_reg1_forward_ex [1:0] | 2 | Forward x6 from WB
io_reg2_forward_ex [1:0] | 1 | Forward x7 from MEM
**Analysis:**
* Complex Scenario: Both ALU operands require forwarding from different stages
* Forwarding Unit Behavior:
- rs1 (x6): Forward from WB (older instruction)
- rs2 (x7): Forward from MEM (newer instruction)
* Priority Handling: MEM forwarding takes precedence over WB when both available
* Performance Impact: Without dual forwarding, would require multiple stalls
**ID-Stage Forwarding** (Exercise 18)
Purpose of this forwarding is for early branch resolution. Branches compare registers in ID stage for early resolution. If branch operands are not ready, forwarding to ID stage prevents additional stall penalties.
* **WB-to-ID Forwarding**
Time Window: ~57ps
Signal | Value | Interpretation
----------------------------|-----------|----------------------------------
io_rd_wb [4:0] | 0x07 | WB stage writes x7
io_rs1_id [4:0] | 0x07 | ID reads x7 MATCH!
io_rs2_id [4:0] | 0x1c | ID reads x28
io_reg1_forward_id [1:0] | 2 | ForwardFromWB
io_reg2_forward_id [1:0] | 0 | No forward
**Analysis:**
* Hazard: Branch in ID needs x7 which is completing WB
* Detection: `rd_wb (0x07) == rs1_id (0x07)` and `reg_write_enable_wb = 1`
* Resolution: Forward x7 from WB stage to ID comparator
* Benefit: Branch resolves immediately in ID without stalling
**Performance Impact:**
* Without ID forwarding: 2-cycle stall (wait for WB → register file → ID read)
* With ID forwarding: 0-cycle penalty (immediate comparison)
* **MEM-to-ID Forwarding**
**Time Window: ~49ps**
Signal | Value | Interpretation
----------------------------|-----------|----------------------------------
io_rd_mem [4:0] | 0x07 | MEM stage writes x7
io_rd_wb [4:0] | 0x07 | WB stage also writes x7
io_rs1_id [4:0] | 0x06 | ID reads x6 (no hazard)
io_rs2_id [4:0] | 0x07 | ID reads x7
io_reg1_forward_id [1:0] | 0 | No forward for rs1
io_reg2_forward_id [1:0] | 1 | ForwardFromMEM
**Analysis:**
* Hazard: Branch in ID needs x7 which is still in MEM stage (not yet in WB/register file)
* Detection: `rd_mem (0x07) == rs2_id (0x07)` and `reg_write_enable_mem = 1`
* Resolution: Forward x7 from MEM stage directly to ID comparator
* Priority: MEM forward (value=1) chosen over WB forward because MEM has more recent result
**Performance Impact:**
* Without forwarding: 1-2 cycle stall
* With MEM-to-ID forwarding: 0-cycle penalty
### Exercise 19: Control Hazard Analysis
Control hazards occur when pipeline execution sequence changes due to branches/jumps. This analysis examines hazard detection and pipeline flush mechanisms.

* **Branch Hazard - IF Flush Only**
**Time Window: ~25ps**
Signal | Value | Interpretation
----------------------------|-----------|----------------------------------
io_jump_flag | 1 | Branch/Jump taken
io_if_flush | 1 | Flush IF stage
io_id_flush | 0 | Keep ID stage
io_pc_stall | 0 | PC redirected (not stalled)
io_if_stall | 0 | IF/ID not stalled
**Analysis:**
* Hazard: Branch taken, PC redirected to target address
* Detection: `jump_flag = 1` indicates branch/jump condition TRUE
* Resolution: Flush IF stage only
* Impact:** 1-cycle penalty (one bubble inserted)
* **Load-Use Hazard with Stall**
**Time Window: ~49ps**
Signal | Value | Interpretation
----------------------------|-----------|----------------------------------
io_memory_read_enable_ex | 1 | Load in EX stage
io_rd_ex [4:0] | 0x07 | Load writes to x07
io_id_flush | 1 | Insert bubble
io_pc_stall | 1 | Freeze PC
io_if_stall | 1 | Freeze IF/ID
**Analysis:**
* Hazard: Instruction in ID needs data from load in EX (not ready until MEM)
* Detection:`memory_read_enable_ex = 1` AND `rd_ex == rs_id`
* Resolution: Stall pipeline 1 cycle (bubble + freeze)
* Impact: 1-cycle penalty, forward from MEM next cycle
* `io_id_flush = io_pc_stall = io_if_stall = 1` : All three toggle together for load-use hazards
**Control Signal Patterns**
| Scenario | io_if_flush | io_id_flush | io_pc/if_stall | Cause |
| ------------ | ----------- | ----------- | -------------- | ----------------------- |
| Branch Taken | 1 | 0 | 0 | `jump_flag = 1` |
| Load-Use | 0 | 1 | 1 | `mem_read_ex && rd==rs` |
| Normal | 0 | 0 | 0 | No hazard |
### Exercise 20: Pipeline Register Behavior

* **Stall Mechanism**
**Register Holds Value During Stall**
Time Window: 2ps - 14ps
Signal | Behavior
--------------------|----------------------------------
io_stall | 1 (asserted continuously)
io_flush | 0
io_in [31:0] | 4037c779 → 2a581587 → 0a6b9b95 → ... (changing)
io_out [31:0] | 57e348f (frozen)
reg [31:0] | 57e348f (frozen)
**Analysis:**
* Behavior: Output frozen at 0x57e348f despite input changing
* Purpose: Hold instruction during pipeline hazard
* Verification: Input ignored while io_stall = 1
* **Flush Mechanism**
**Register Outputs Default Value**
Time Window: ~14ps, ~24ps
Signal | @14ps | @24ps |
--------------------|------------|------------|
io_stall | 1→0 | 1→0 |
io_flush | 0→1 | 0→1 |
io_in [31:0] | 04c02c2b ->... | 1bc9be96 |
io_out [31:0] | 57e348f | 57e348f|
**Analysis:**
* When flush = 1, output becomes default value (0x57e348f)
* Input ignored during flush
* Same default value at both flush events
### Exercise 21: Hazard Detection Summary and Analysis
* Q1: Why do we need to stall for load-use hazards?
Answer:
Load data is not available until MEM stage. Cannot forward from EX stage (data not computed yet). Must stall 1 cycle to allow load to reach MEM, then forward data to dependent instruction.

Waveform Evidence (Time ~49ps):
LW in EX: memory_read_enable_ex = 1, data not ready
Dependent inst in ID: needs loaded data
→ io_id_flush = 1, io_pc_stall = 1 (stall 1 cycle)
→ Next cycle: forward from MEM
* Q2: What is the difference between "stall" and "flush" operations?
Answer:
| Operation | Effect on pipeline | Effect on pc | Purpose |
| --------- | ---------------------------- | ------------ | ------------------------------ |
| Stall | Hold register value (freeze) | Freeze pc | Wait for data dependency |
| Flush | Insert default | Redirect pc | Remove wrong path instructions |

Waveform Evidence:
Stall (2-14ps): io_out frozen at 0x57e348f, PC unchanged
Flush (14ps, 24ps): io_out = default 0x57e348f, PC redirected
* Q3: Why does jump instruction with register dependency need stall?
Answer:
Jump target address computed from register value (JALR x1, offset(x2)). If x2 has data hazard, target address unknown until dependency resolved. Must stall until register value available, then compute target and redirect PC.
Example:
```
ADD x2, x3, x4 # x2 not ready
JALR x1, 0(x2) # Needs x2 for target ← Stall
```
* Q4: Why is branch penalty only 1 cycle instead of 2?
**Answer:**
Branch resolved in **ID stage** (early), not EX stage. Only IF stage contains wrong-path instruction when branch taken. Flush IF only → 1 bubble.
If branch resolved in EX: both IF and ID would have wrong-path → 2 bubbles.

**Waveform Evidence (Time ~25ps):**
io_jump_flag = 1 (branch taken in ID)
io_if_flush = 1 (flush IF only)
io_id_flush = 0 (ID has branch itself, keep it)
→ Penalty = 1 cycle
* Q5: What would happen if we removed hazard detection logic entirely?
Answer:
* Data Hazards:
* RAW hazards → Read stale data from register file instead of forwarded values
* Incorrect computation results
* Example: `ADD x1, x2, x3; SUB x4, x1, x5` → x1 not updated yet
* Control Hazards:
* Wrong-path instructions execute instead of being flushed
* Branch/jump → Continue sequential execution, corrupting registers/memory
* Example: Branch taken but ADD after branch still executes → wrong register modified
**Result:** Program produces incorrect output, pipeline not functionally correct.
## 4. Homework 2 on Pipelined CPU
In homework 2, I chose the uf8 decode/encoding problem to run it on the pipelined RISC-V CPU so I modified it to eliminate the hazards to ensure it functions correctly.
**crsc/uf8_modified.S**
```
# UF8 Encode/Decode Test - Modified for Pipeline Testing
# Stores final result in memory address 4 for verification
.globl _start
_start:
# initialize test
li s0, 0 # test counter
li s1, 3 # number of tests (reduced for pipeline testing)
test_loop:
beq s0, s1, test_done
# load test value based on counter
beq s0, x0, test_0
li t0, 1
beq s0, t0, test_1
li t0, 2
beq s0, t0, test_2
test_0:
li a0, 15 # test value 1: small value
j do_test
test_1:
li a0, 48 # test value 2: medium value
j do_test
test_2:
li a0, 240 # test value 3: large value
j do_test
do_test:
# save original value
mv s2, a0
# encode
jal ra, uf8_encode
mv s3, a0 # s3 = encoded byte
# decode
mv a0, s3
jal ra, uf8_decode
mv s4, a0 # s4 = decoded value
# simple validation: check if decoded ≈ original
# for small values (<16): must be exact
li t0, 16
blt s2, t0, check_exact
# for larger values: allow some error
sub t0, s4, s2 # diff = decoded - original
bgez t0, diff_pos
neg t0, t0 # abs(diff)
diff_pos:
slli t0, t0, 4 # diff * 16
bgt t0, s2, test_fail # if diff*16 > original, fail
j test_pass
check_exact:
bne s4, s2, test_fail
test_pass:
addi s0, s0, 1 # Next test
j test_loop
test_fail:
# store failure indicator
li t0, 0xDEAD
sw t0, 4(x0) # Memory[4] = 0xDEAD (failure)
j end_program
test_done:
# all tests passed, store success value
li t0, 0x55 # 0x55 = success indicator
sw t0, 4(x0) # Memory[4] = 0x55
end_program:
# Infinite loop to end
j end_program
# UF8 Encode: value -> byte
# input: a0 = value to encode
# output: a0 = encoded byte (exponent in upper 4 bits, mantissa in lower 4 bits)
uf8_encode:
# Handle small values (0-15)
li t0, 16
blt a0, t0, encode_small
# Initialize for loop
li t1, 0 # exponent = 0
li t2, 0 # base_offset = 0
li t4, 15 # max_exponent = 15
encode_loop:
# calculate next threshold: base_offset + (16 << exponent)
add t3, t2, t0
bgt t3, a0, encode_done
# update for next iteration
mv t2, t3 # base_offset = threshold
slli t0, t0, 1 # threshold *= 2
addi t1, t1, 1 # exponent++
blt t1, t4, encode_loop
encode_done:
# calculate mantissa: (value - base_offset) >> exponent
sub t3, a0, t2
srl t3, t3, t1
andi t3, t3, 0x0F # mantissa (4 bits)
# combine: (exponent << 4) | mantissa
slli t1, t1, 4
or a0, t1, t3
ret
encode_small:
# value < 16, no encoding needed
ret
# UF8 Decode: byte -> value
# input: a0 = encoded byte
# output: a0 = decoded value
uf8_decode:
# extract exponent and mantissa
andi t0, a0, 0x0F # mantissa (lower 4 bits)
srli t1, a0, 4 # exponent (upper 4 bits)
# calculate offset: (2^exponent - 1) * 16
li t2, 1
sll t2, t2, t1 # 2^exponent
addi t2, t2, -1 # 2^exponent - 1
slli t2, t2, 4 # * 16
# calculate value: (mantissa << exponent) + offset
sll t0, t0, t1
add a0, t0, t2
ret
```
Commands to compile new files:
```
# 1. Assemble
riscv64-unknown-elf-as -march=rv32i -mabi=ilp32 \
uf8_pipeline.S -o uf8_pipeline.o
# 2. Link (use link.lds!)
riscv64-unknown-elf-ld -T link.lds \
--oformat=elf32-littleriscv \
uf8_pipeline.o -o uf8_pipeline.elf
# 3. Convert to binary
riscv64-unknown-elf-objcopy -O binary \
-j .text -j .data \
uf8_pipeline.elf ../src/main/resources/uf8_test.asmbin
```
Modify test scala file to run uf8 test:
```scala=
it should "correctly execute UF8 encode/decode program" in {
runProgram("uf8_test.asmbin", cfg) { c =>
// Run UF8 test (3 encode/decode cycles)
for (i <- 1 to 20) {
c.clock.step(500)
c.io.mem_debug_read_address.poke((i * 4).U)
}
// Check result in memory[4]
// 0x55 = all tests passed
// 0xDEAD = test failed
c.io.mem_debug_read_address.poke(4.U)
c.clock.step()
val result = c.io.mem_debug_read_data.peek().litValue.toInt
assert(
result == 0x55,
f"${cfg.name}: UF8 test failed! Memory[4] = 0x${result}%x (expected 0x55)"
)
```
Test result:
```
make test
cd .. && sbt "project pipeline" test
[info] welcome to sbt 1.10.7 (Eclipse Adoptium Java 11.0.29)
[info] loading project definition from /home/harrypotter/ca2025-mycpu/project
[info] loading settings for project root from build.sbt...
[info] set current project to mycpu-root (in build file:/home/harrypotter/ca2025-mycpu/)
[info] set current project to mycpu-pipeline (in build file:/home/harrypotter/ca2025-mycpu/)
[info] compiling 1 Scala source to /home/harrypotter/ca2025-mycpu/3-pipeline/target/scala-2.13/test-classes ...
[info] PipelineProgramTest:
[info] Three-stage Pipelined CPU
[info] - should calculate recursively fibonacci(10)
make[1]: Warning: File 'TestTopModule-harness.cpp' has modification time 0.35 s in the future
make[1]: warning: Clock skew detected. Your build may be incomplete.
[info] - should quicksort 10 numbers
[info] - should store and load single byte
[info] - should solve data and control hazards
[info] - should handle all hazard types comprehensively
make[1]: Warning: File 'TestTopModule-harness.cpp' has modification time 0.49 s in the future
make[1]: warning: Clock skew detected. Your build may be incomplete.
[info] - should handle machine-mode traps
[info] - should correctly execute UF8 encode/decode program
[info] Five-stage Pipelined CPU with Stalling
[info] - should calculate recursively fibonacci(10)
[info] - should quicksort 10 numbers
[info] - should store and load single byte
[info] - should solve data and control hazards
[info] - should handle all hazard types comprehensively
[info] - should handle machine-mode traps
[info] - should correctly execute UF8 encode/decode program
[info] Five-stage Pipelined CPU with Forwarding
[info] - should calculate recursively fibonacci(10)
[info] - should quicksort 10 numbers
[info] - should store and load single byte
[info] - should solve data and control hazards
[info] - should handle all hazard types comprehensively
[info] - should handle machine-mode traps
[info] - should correctly execute UF8 encode/decode program
[info] Five-stage Pipelined CPU with Reduced Branch Delay
[info] - should calculate recursively fibonacci(10)
[info] - should quicksort 10 numbers
[info] - should store and load single byte
[info] - should solve data and control hazards
[info] - should handle all hazard types comprehensively
[info] - should handle machine-mode traps
[info] - should correctly execute UF8 encode/decode program
[info] ComplianceTest:
[info] MyCPU Compliance
✅ Test completed - signature: /home/harrypotter/ca2025-mycpu/tests/riscof_work_3pl/rv32i_m/hints/src/sll-01.S/dut/DUT-mycpu.signature[info] - should pass test /home/harrypotter/ca2025-mycpu/tests/riscv-arch-test/riscv-test-suite/rv32i_m/hints/src/sll-01.S
[info] PipelineUartTest:
[info] Three-stage Pipelined CPU UART Comprehensive Test
[info] - should pass all TX and RX tests
[info] Five-stage Pipelined CPU with Stalling UART Comprehensive Test
[info] - should pass all TX and RX tests
[info] Five-stage Pipelined CPU with Forwarding UART Comprehensive Test
[info] - should pass all TX and RX tests
[info] Five-stage Pipelined CPU with Reduced Branch Delay UART Comprehensive Test
[info] - should pass all TX and RX tests
[info] PipelineRegisterTest:
[info] Pipeline Register
[info] - should be able to stall and flush
[info] Run completed in 4 minutes, 58 seconds.
[info] Total number of tests run: 34
[info] Suites: completed 4, aborted 0
[info] Tests: succeeded 34, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
```
## 5. What I have learnt from Chisel Bootcamp

Before taking Chisel Bootcamp, I mostly thought about hardware in terms of low-level Verilog syntax. This course changed the way I think about hardware design. I learned that Chisel is not just another HDL, but a hardware construction language that lets me describe circuits at a much higher and more flexible level.
One of the most important things I learned was the difference between writing software code and describing hardware. At first, it was confusing that Scala code runs only during elaboration, while the generated hardware runs cycle by cycle. After working through the exercises, I started to understand how Reg, Wire, and when statements actually represent real hardware elements like registers and multiplexers.
I also really appreciated how Chisel allows parameterized and reusable designs. Using loops and functions to generate hardware made my code cleaner and easier to modify, especially compared to writing repetitive Verilog. This became very useful when building larger components and pipelined structures.
Finally, using Verilator and waveform analysis helped me connect my Chisel code to real signal behavior. Seeing signals change every cycle made hardware behavior much more intuitive. Overall, Chisel Bootcamp gave me confidence in designing and reasoning about hardware systems, especially pipelined processors.
## Reference
* [Computer Architecture Homework 3](https://hackmd.io/@sysprog/2025-arch-homework3)
* [Lab3: Construct a RISC-V CPU with Chisel](https://hackmd.io/@sysprog/B1Qxu2UkZx#Chisel-Bootcamp)
* [Assignment2: Complete Applications](https://hackmd.io/@6qS8IHTdRr2PrBg7Q97fww/H1HxZJP1bx)