Assignment 3: Your Own RISC-V CPU

--- title: 'Assignment 3: Your Own RISC-V CPU' disqus: hackmd --- Assignment 3: Your Own RISC-V CPU === >[!Note] AI tools usage >I use ChatGPT to assist with Quiz 1 by providing code explanations, grammar revisions, pre-work research, code summaries, and explanations of standard RISC-V instruction usage. > Github code for this Assigment Homework: [Link](https://github.com/scarlett910/ca2025-mycpu) ## 1. Single-cycle CPU #### Testcases for exercise 1 * Unit Tests: * InstructionFetchTest: Ensures that the Program Counter (PC) updates correctly, including normal sequential execution (PC + 4) and correct behavior for jump instructions. * InstructionDecoderTest: Checks whether the decoder accurately interprets all RV32I instruction formats (R, I, S, B, U, J) and generates the correct control signals, such as ALU input selection and register-write enables. * ExecuteTest: Validates ALU functionality—including arithmetic, logical, shift, and comparison operations—and verifies the branch decision logic. * RegisterFileTest: Tests register read/write operations, ensuring that register x0 always returns zero and that write-through behavior functions properly. * Integration Tests (CPUTest.scala) * FibonacciTest: Runs a recursive Fibonacci program (`fibonacci.asmbin`) to verify correct function calls, stack usage, and control flow. * QuicksortTest Executes a quicksort implementation (`quicksort.asmbin`) to sort 10 integers, testing more complex branching, loops, and memory operations. * ByteAccessTest Runs the byte-access program (`sb.asmbin`) to validate correct behavior of lb and sb instructions, including alignment handling and proper sign extension. After I filled the missing blanks in the 9 exercises for the first exercise, I run the command ```make test``` to run all the tests. The result: ``` cd .. && sbt "project singleCycle" test [info] welcome to sbt 1.10.7 (Eclipse Adoptium Java 11.0.29) [info] loading project definition from /home/harrypotter/ca2025-mycpu/project [info] loading settings for project root from build.sbt... [info] set current project to mycpu-root (in build file:/home/harrypotter/ca2025-mycpu/) [info] set current project to mycpu-single-cycle (in build file:/home/harrypotter/ca2025-mycpu/) [info] compiling 3 Scala sources to /home/harrypotter/ca2025-mycpu/common/target/scala-2.13/classes ... [info] compiling 12 Scala sources to /home/harrypotter/ca2025-mycpu/1-single-cycle/target/scala-2.13/classes ... [info] compiling 7 Scala sources to /home/harrypotter/ca2025-mycpu/1-single-cycle/target/scala-2.13/test-classes ... [info] InstructionDecoderTest: [info] InstructionDecoder [info] - should decode RV32I instructions and generate correct control signals [info] ByteAccessTest: [info] Single Cycle CPU - Integration Tests [info] - should correctly handle byte-level store/load operations (SB/LB) [info] InstructionFetchTest: [info] InstructionFetch [info] - should correctly update PC and handle jumps [info] ExecuteTest: [info] Execute [info] - should execute ALU operations and branch logic correctly [info] FibonacciTest: [info] Single Cycle CPU - Integration Tests [info] - should correctly execute recursive Fibonacci(10) program [info] RegisterFileTest: [info] RegisterFile [info] - should correctly read previously written register values [info] - should keep x0 hardwired to zero (RISC-V compliance) [info] - should support write-through (read during write cycle) [info] QuicksortTest: [info] Single Cycle CPU - Integration Tests [info] - should correctly execute Quicksort algorithm on 10 numbers [info] Run completed in 1 minute, 18 seconds. [info] Total number of tests run: 9 [info] Suites: completed 7, aborted 0 [info] Tests: succeeded 9, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. ``` **RISCOF Compliance Test** `make compliance` result: ![Screenshot 2025-12-11 002217](https://hackmd.io/_uploads/SyVYAzwM-g.png) **Waveform Analysis** I use waveform analysis to demonstrate that my implementation is correct, by showing how the key signals of each component change when different instructions are executed. **1. Instruction Fetch:** Exercise 9: PC Update Logic - Sequential vs Control Flow ```scala= // ============================================================ // [CA25: Exercise 9] PC Update Logic - Sequential vs Control Flow // ============================================================ when(io.instruction_valid) { io.instruction := io.instruction_read_data pc := Mux(io.jump_flag_id, io.jump_address_id, pc + 4.U) }.otherwise { // When instruction is invalid, hold PC and insert NOP (ADDI x0, x0, 0) // NOP = 0x00000013 allows pipeline to continue safely without side effects pc := pc io.instruction := 0x00000013.U // NOP: prevents illegal instruction execution } io.instruction_address := pc } ``` ![Screenshot 2025-12-10 114551](https://hackmd.io/_uploads/Bk9eRv8G-g.png) Observation from the waveform: | Test Case | Waveform Evidence | Expected | Actual | | ----------- | -------------------------- | ----------- | ------------------- | | Sequential | pc += 4 | PC + 4 | 1000→1004→1008→100c | | Jump (4ps) | jump_flag=1, target=0x1000 | PC = 0x1000 | 0x1004→0x1000 | | Jump (8ps) | jump_flag=1, target=0x1000 | PC = 0x1000 | 0x1004→0x1000 | | Jump (22ps) | jump_flag=1, target=0x1000 | PC = 0x1000 | 0x1010→0x1000 | **Code Logic**: `Mux(jump_flag_id, jump_address_id, pc + 4.U)` * When `jump_flag = 0`: Select `pc + 4` * When `jump_flag = 1`: Select `jump_address_id` **Result**: Implementation correct, all behaviors match specification **2. Instruction Decode:** Exercise 1: Immediate Extension - RISC-V Instruction Encoding ![Screenshot 2025-12-09 073813](https://hackmd.io/_uploads/SJLV_1Bzbe.png) | Time | Instruction | Type | ImmI | ImmS | ImmB | ImmJ | ImmU | io_ex_immediate | | ---- | ----------- | ---- | ----- | ---- | ---- | ---- | ------ | --------------- | | 2ps | lw | I | 0x04 | | | | | 0x04 | | 4ps | sw | S | | 0x04 | | | | 0x04 | | 6ps | andi | I | 0x018 | | | | | 0x018 | | 8ps | bge | B | | | 0x10 | | | 0x10 | | 12ps | lui | U | | | | | 0x2000 | 0x2000 | | 14ps | jal | J | | | | 0x08 | | 0x08 | Exercise 2: Control Signal Generation **Code Implementation:** ```scala= // ============================================================ // [CA25: Exercise 2] Control Signal Generation // ============================================================ // TODO: Determine when to write back from Memory when(isLoad) { wbSource := RegWriteSource.Memory } // TODO: Determine when to write back PC+4 .elsewhen(isJal || isJalr) { wbSource := RegWriteSource.NextInstructionAddress } ``` ![Screenshot 2025-12-10 105717](https://hackmd.io/_uploads/rJQnbDIMbg.png) | isLoad | isJal | isJalr | wb_source | Result Source | |--------|-------|--------|-----------|----------------------| | 0 | 0 | 0 | 0 | ALU result | | **1** | 0 | 0 | **1** | **Memory read data** | | 0 | **1** | 0 | **2** | **PC + 4** | | 0 | 0 | **1** | **2** | **PC + 4** | **Code Implementation:** ```scala= // TODO: Determine when to use PC as first operand when(isBranch || isAuipc || isJal) { aluOp1Sel := ALUOp1Source.InstructionAddress } val needsImmediate = isLoad || isStore || isOpImm || isBranch || isLui || isAuipc || isJal || isJalr val aluOp2Sel = WireDefault(ALUOp2Source.Register) // TODO: Determine when to use immediate as second operand when(needsImmediate) { aluOp2Sel := ALUOp2Source.Immediate } ``` ![Screenshot 2025-12-09 075841](https://hackmd.io/_uploads/SJCyevUz-l.png) | isBranch | isAuipc | isJal | Condition (OR) | aluop1_src | Result | |----------|---------|-------|----------------|------------|---------------| | 0 | 0 | 0 | 0 | **0** | **Register rs1** | | **1** | 0 | 0 | **1** | **1** | **PC** | | 0 | **1** | 0 | **1** | **1** | **PC** | | 0 | 0 | **1** | **1** | **1** | **PC** | ![Screenshot 2025-12-09 080036](https://hackmd.io/_uploads/BJAJlPIzWg.png) **3. Execute** ![Screenshot 2025-12-10 121910](https://hackmd.io/_uploads/ry8Xw_8GWg.png) ![Screenshot 2025-12-10 121933](https://hackmd.io/_uploads/rJwXwu8z-l.png) Observation from the waveform: Exercise 3: ALU Control Logic - Opcode/Funct3/Funct7 Decoding | | opcode | funct3 | funct7 | Output | | -------------------- | ------ | ------ | --------- | ------------ | | R-type ADD | 0x33 | 0x0 | 0 | ADD function | | Branch (default ADD) | 0x63 | X | X | ADD function | **Conclusion**: ALU Control correctly decodes instruction type and generates appropriate ALU function codes Exercise 4: Branch Comparison Logic ```scala= // ============================================================ // [CA25: Exercise 4] Branch Comparison Logic // ============================================================ val branchCondition = MuxLookup(funct3, false.B)( Seq( // TODO: Implement six branch conditions InstructionsTypeB.beq -> (io.reg1_data === io.reg2_data), InstructionsTypeB.bne -> (io.reg1_data =/= io.reg2_data), // Signed comparison (need conversion to signed type) InstructionsTypeB.blt -> (io.reg1_data.asSInt < io.reg2_data.asSInt), InstructionsTypeB.bge -> (io.reg1_data.asSInt >= io.reg2_data.asSInt), // Unsigned comparison InstructionsTypeB.bltu -> (io.reg1_data < io.reg2_data), InstructionsTypeB.bgeu -> (io.reg1_data >= io.reg2_data) ) ) val isBranch = opcode === InstructionTypes.Branch val isJal = opcode === Instructions.jal val isJalr = opcode === Instructions.jalr ``` Exercise 5: Jump Target Address Calculation ```scala= // ============================================================ // [CA25: Exercise 5] Jump Target Address Calculation // ============================================================ // TODO: Complete the following address calculations val branchTarget = io.instruction_address + io.immediate val jalTarget = branchTarget // JAL and Branch use same calculation method val jalrSum = io.reg1_data + io.immediate // TODO: Clear LSB using bit concatenation val jalrTarget = Cat(jalrSum(Parameters.DataBits - 1, 1), 0.U(1.W)) val branchTaken = isBranch && branchCondition io.if_jump_flag := branchTaken || isJal || isJalr io.if_jump_address := Mux( isJalr, jalrTarget, Mux(isJal, jalTarget, branchTarget) ) } ``` Observation from the waveform: | Test Case | Time | Instruction | Condition | Expected | Actual | Status | |-----------|------|-------------|-----------|----------|--------|--------| | **BEQ FALSE** | 204-206ps | BEQ (unequal) | reg1 $\neq$ reg2 | Not taken | branchTaken=0, jump_flag=0 | ✓ PASS | | **BEQ TRUE** | 206ps | BEQ (equal) | reg1 = reg2 | **Taken** | **branchTaken=1, jump_flag=1** | **✓ PASS** | **4. Memory Access** Exercise 6: Load Data Extension - Sign and Zero Extension ```scala= // ============================================================ // [CA25: Exercise 6] Load Data Extension - Sign and Zero Extension // ============================================================ // TODO: Complete sign/zero extension for load operations io.wb_memory_read_data := MuxLookup(io.funct3, 0.U)( Seq( // TODO: Complete LB (sign-extend byte) // Hint: Replicate sign bit, then concatenate with byte InstructionsTypeL.lb -> Cat(Fill(24, byte(7)), byte), // TODO: Complete LBU (zero-extend byte) // Hint: Fill upper bits with zero, then concatenate with byte InstructionsTypeL.lbu -> Cat(0.U(24.W), byte), // TODO: Complete LH (sign-extend halfword) // Hint: Replicate sign bit, then concatenate with halfword InstructionsTypeL.lh -> Cat(Fill(16, half(15)), half), // TODO: Complete LHU (zero-extend halfword) // Hint: Fill upper bits with zero, then concatenate with halfword InstructionsTypeL.lhu -> Cat(0.U(16.W), half), // LW: Load full word, no extension needed (completed example) InstructionsTypeL.lw -> data ) ) ``` ![Screenshot 2025-12-10 133536](https://hackmd.io/_uploads/r1t1uKUGbe.png) Observation from the waveform: - At 51ps: io_funct3 = 2 → LW ``` io.memory_read_enable = 1 memory_read_data = 0x000000ef val data = io.memory_bundle.read_data //= 0x000000ef io.wb_memory_read_data := data //= 0x000000ef -> Correct outcome ``` The waveform shows that the code in TODO gives the correct result. Exercise 7: Store Data Alignment - Byte Strobes and Shifting ![Screenshot 2025-12-10 150134](https://hackmd.io/_uploads/ByNyoq8GZl.png) - At 43ps: io_funct3 = 0 → SB ``` io.memory_write_enable = 1 reg2_data = 0xdeadbeef alu_result = 4 → mem_address_index = 0 is(InstructionsTypeS.sb) { writeStrobes(mem_address_index) := true.B //= 1 -> correct writeData := data(7, 0) << (mem_address_index << 3.U) //= 0xef << 0 = 0x000000ef -> correct } ``` The waveform shows that the code in TODO gives the correct result. **5. Write Back** ```scala= class WriteBack extends Module { val io = IO(new Bundle() { val instruction_address = Input(UInt(Parameters.AddrWidth)) val alu_result = Input(UInt(Parameters.DataWidth)) val memory_read_data = Input(UInt(Parameters.DataWidth)) val regs_write_source = Input(UInt(2.W)) val regs_write_data = Output(UInt(Parameters.DataWidth)) }) //============================================================ // [CA25: Exercise 8] WriteBack Source Selection //============================================================ // TODO: Complete MuxLookup to multiplex writeback sources io.regs_write_data := MuxLookup(io.regs_write_source, io.alu_result)( Seq( RegWriteSource.Memory -> io.memory_read_data, RegWriteSource.NextInstructionAddress -> (io.instruction_address + 4.U) ) ) } ``` ![Screenshot 2025-12-10 154857](https://hackmd.io/_uploads/HJJQIsLMZe.png) - Default: regs_write_source = 0 → ALU result ``` alu_result = 104c regs_write_source = 0 instruction address = 103c memory_read_data = 0 io.regs_write_data := MuxLookup(io.regs_write_source, io.alu_result) //= 104 -> match with waveform: regs_write_data = 104c ``` - regs_write_source = 1 → memory data ![Screenshot 2025-12-10 154927](https://hackmd.io/_uploads/SymoUiIMZx.png) ``` regs_write_source = 1 instruction address = 106c memory_read_data = 0000000a ... RegWriteSource.Memory -> io.memory_read_data, //= 0000000a -> match with waveform: regs_write_data = 0000000a ``` - regs_write_source = 2 → PC + 4 ![Screenshot 2025-12-10 155634](https://hackmd.io/_uploads/rySpDiLzWl.png) ``` regs_write_source = 2 instruction address = 10ec memory_read_data = 00000000 ... RegWriteSource.NextInstructionAddress -> (io.instruction_address + 4.U) //PC + 4 = 0x10ec + 4 = 0x10f0 //match with waveform: regs_write_data = 000010f0 ``` #### Integration test results * Fibonacci.c: ![Screenshot 2025-12-09 022718](https://hackmd.io/_uploads/Bk11-gHfZe.png) ``` hexdump -e src/main/resources/fibonacci.asmbin | head -1 00001197 91418193 00001137 00000297 10828293 00000317 10030313 0062f863 0002a023 00428293 ff5ff06f 00000297 0e828293 00000317 0e030313 0062f863 0002a023 00428293 ff5ff06f 084000ef 0000006f fe010113 00112e23 00812c23 00912a23 02010413 fea42623 fec42703 00100793 00f70863 ``` Conclusion: The fibonacci waveform values in `io_instruction_read_data` matches with the expected result of the program. * Quicksort.c: ![Screenshot 2025-12-09 024848](https://hackmd.io/_uploads/SyTtZlSzWg.png) ``` hexdump -e src/main/resources/quicksort.asmbin | head -1 00002197 82818193 00001137 00001297 01c28293 00001317 01430313 0062f863 0002a023 00428293 ff5ff06f 00001297 ffc28293 00001317 ff430313 0062f863 0002a023 00428293 ff5ff06f 184000ef 0000006f fd010113 02112623 02812423 03010413 fca42e23 fcb42c23 fcc42a23 fd842703 fd442783 ``` Conclusion: The quicksort waveform values in `io_instruction_read_data` matches with the expected result of the program. * sb.S: ![Screenshot 2025-12-09 025745](https://hackmd.io/_uploads/B186-xBfbg.png) ``` hexdump -e src/main/resources/sb.asmbin | head -1 00400513 deadc2b7 eef28293 00550023 00052303 01500913 012500a3 00052083 0000006f ``` Conclusion: The sb waveform values in `io_instruction_read_data` matches with the expected result of the program. ## 2. RISC-V CPU with MMIO Peripherals and Trap Handling ### 1. Nyancat VGA Display Demo The Nyancat animation runs by command `make demo` using Verilator with SDL2 ![Screenshot 2025-12-10 223156](https://hackmd.io/_uploads/HJcttfDGZx.png) ### 2. Further Compression for Nyan program **Current Implementation Analysis** The Nyancat animation uses Delta-RLE (Run-Length Encoding with Delta Frames) compression, implemented in `scripts/gen-nyancat-data.py`. Compression Statistics: - Uncompressed: 24,576 bytes (12 frames × 64×64 pixels × 4-bit color) - Compressed: 4,755 bytes - Compression Ratio: 5.17× (81.6% size reduction) - Method: Frame 0 uses baseline RLE; Frames 1-11 use delta encoding **Additional Compression** Several alternative approaches were evaluated: 1. Palette Quantization (14 → 8 colors) * Approach: Merge similar colors (e.g., red/orange, yellow/green) * Result: No size reduction * Reason: Opcode count remains unchanged; only color indices are remapped 2. Opcode Merging After Remapping * Approach: Decompress → remap colors → re-compress with merged operations * Result: Increased size (72KB vs 4.7KB) * Reason: Decompression/re-compression introduces inefficiencies 3. Advanced Algorithms (LZ77, Huffman, LZMA) * LZ77: ~5,200 bytes (worse than current) * Huffman: ~4,900 bytes (marginal 3% improvement) * LZMA: ~3,800 bytes (20% improvement but 10× slower decompression) * Conclusion: Trade-offs unacceptable **Conclusion:** The existing compression is optimal for this application because: 1. Temporal Coherence: Rainbow animation has high frame-to-frame similarity, making delta encoding highly effective 1. Spatial Coherence: Large solid-color regions benefit from RLE 1. Hardware Constraints: VGA peripheral expects uncompressed 4-bit pixels; no hardware decompression support 1. Real-time Requirements: Decompression must complete within frame time (50ms at 20Hz) 1. Code Quality: Instructor's implementation is production-grade ## 3. Pipelined RISC-V CPU After I filled the missing blanks in the 6 exercises (16-21) for the first exercise, I run the command ```make test``` to run all the tests. The result: ``` make test cd .. && sbt "project pipeline" test [info] welcome to sbt 1.10.7 (Eclipse Adoptium Java 11.0.29) [info] loading project definition from /home/harrypotter/ca2025-mycpu/project [info] loading settings for project root from build.sbt... [info] set current project to mycpu-root (in build file:/home/harrypotter/ca2025-mycpu/) [info] set current project to mycpu-pipeline (in build file:/home/harrypotter/ca2025-mycpu/) [info] compiling 61 Scala sources to /home/harrypotter/ca2025-mycpu/3-pipeline/target/scala-2.13/classes ... [info] compiling 7 Scala sources to /home/harrypotter/ca2025-mycpu/3-pipeline/target/scala-2.13/test-classes ... [info] PipelineProgramTest: [info] Three-stage Pipelined CPU [info] - should calculate recursively fibonacci(10) [info] - should quicksort 10 numbers [info] - should store and load single byte [info] - should solve data and control hazards [info] - should handle all hazard types comprehensively [info] - should handle machine-mode traps [info] Five-stage Pipelined CPU with Stalling [info] - should calculate recursively fibonacci(10) [info] - should quicksort 10 numbers [info] - should store and load single byte [info] - should solve data and control hazards [info] - should handle all hazard types comprehensively [info] - should handle machine-mode traps [info] Five-stage Pipelined CPU with Forwarding [info] - should calculate recursively fibonacci(10) [info] - should quicksort 10 numbers [info] - should store and load single byte [info] - should solve data and control hazards [info] - should handle all hazard types comprehensively [info] - should handle machine-mode traps [info] Five-stage Pipelined CPU with Reduced Branch Delay [info] - should calculate recursively fibonacci(10) [info] - should quicksort 10 numbers [info] - should store and load single byte [info] - should solve data and control hazards [info] - should handle all hazard types comprehensively [info] - should handle machine-mode traps [info] PipelineUartTest: [info] Three-stage Pipelined CPU UART Comprehensive Test [info] - should pass all TX and RX tests [info] Five-stage Pipelined CPU with Stalling UART Comprehensive Test [info] - should pass all TX and RX tests [info] Five-stage Pipelined CPU with Forwarding UART Comprehensive Test [info] - should pass all TX and RX tests [info] Five-stage Pipelined CPU with Reduced Branch Delay UART Comprehensive Test [info] - should pass all TX and RX tests [info] PipelineRegisterTest: [info] Pipeline Register [info] - should be able to stall and flush [info] Run completed in 4 minutes, 18 seconds. [info] Total number of tests run: 29 [info] Suites: completed 3, aborted 0 [info] Tests: succeeded 29, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. ``` ### Exercise 17 & 18: Data Hazard Analysis - Forwarding This analysis examines data hazard resolution through forwarding mechanisms in the FiveStageFinal pipeline implementation using `hazard.asmbin` test program. **EX-Stage Forwarding** (Exercise 17) Hazard Scenario: RAW (Read-After-Write) Dependency Test Program: `hazard.asmbin` contains deliberately crafted hazard scenarios. ![ca2025_ex17_18](https://hackmd.io/_uploads/HkYDMq2fZl.png) * **MEM-to-EX Forwarding** Time Window: ~20ps Signal | Value | Interpretation ----------------------------|-----------|---------------------------------- io_rd_mem [4:0] | 0x0a | Instruction in MEM writes to x10 io_reg_write_enable_mem | 1 | Write enabled in MEM stage io_rs2_ex [4:0] | 0x01 | Instruction in EX reads from x1 io_reg2_forward_ex [1:0] | 1 | ForwardFromMEM activated **Analysis** * Hazard Type: RAW (Read-After-Write) dependency * Detection: `rd_mem == rs_ex` and`reg_write_enable_mem == 1` * Resolution: Forwarding unit selects MEM stage output instead of register file * Impact: Eliminates 1-cycle stall penalty * **WB-to-EX Forwarding** Time Window: ~57ps Signal | Value | Interpretation ----------------------------|-----------|---------------------------------- io_rd_wb [4:0] | 0x07 | Instruction in WB writes to x7 io_reg_write_enable_wb | 1 | Write enabled in WB stage io_rs1_ex [4:0] | 0x06 | Instruction in EX reads from x6 io_rs2_ex [4:0] | 0x07 | Instruction in EX reads from x7 io_reg1_forward_ex [1:0] | 0 | No forward for rs1 io_reg2_forward_ex [1:0] | 2 | ForwardFromWB for rs2 **Analysis:** * Scenario: Instruction needs data that just finished WB stage * Forwarding Logic: Select WB stage output over register file * Cycle Savings:Without forwarding, would need 2-cycle stall * **Multiple Simultaneous Forwards** Time Window: ~45ps Signal | Value | Interpretation ----------------------------|-----------|---------------------------------- io_rd_mem [4:0] | 0x07 | MEM stage writes x7 io_rd_wb [4:0] | 0x06 | WB stage writes x6 io_rs1_ex [4:0] | 0x06 | EX needs x6 io_rs2_ex [4:0] | 0x07 | EX needs x7 io_reg1_forward_ex [1:0] | 2 | Forward x6 from WB io_reg2_forward_ex [1:0] | 1 | Forward x7 from MEM **Analysis:** * Complex Scenario: Both ALU operands require forwarding from different stages * Forwarding Unit Behavior: - rs1 (x6): Forward from WB (older instruction) - rs2 (x7): Forward from MEM (newer instruction) * Priority Handling: MEM forwarding takes precedence over WB when both available * Performance Impact: Without dual forwarding, would require multiple stalls **ID-Stage Forwarding** (Exercise 18) Purpose of this forwarding is for early branch resolution. Branches compare registers in ID stage for early resolution. If branch operands are not ready, forwarding to ID stage prevents additional stall penalties. * **WB-to-ID Forwarding** Time Window: ~57ps Signal | Value | Interpretation ----------------------------|-----------|---------------------------------- io_rd_wb [4:0] | 0x07 | WB stage writes x7 io_rs1_id [4:0] | 0x07 | ID reads x7 MATCH! io_rs2_id [4:0] | 0x1c | ID reads x28 io_reg1_forward_id [1:0] | 2 | ForwardFromWB io_reg2_forward_id [1:0] | 0 | No forward **Analysis:** * Hazard: Branch in ID needs x7 which is completing WB * Detection: `rd_wb (0x07) == rs1_id (0x07)` and `reg_write_enable_wb = 1` * Resolution: Forward x7 from WB stage to ID comparator * Benefit: Branch resolves immediately in ID without stalling **Performance Impact:** * Without ID forwarding: 2-cycle stall (wait for WB → register file → ID read) * With ID forwarding: 0-cycle penalty (immediate comparison) * **MEM-to-ID Forwarding** **Time Window: ~49ps** Signal | Value | Interpretation ----------------------------|-----------|---------------------------------- io_rd_mem [4:0] | 0x07 | MEM stage writes x7 io_rd_wb [4:0] | 0x07 | WB stage also writes x7 io_rs1_id [4:0] | 0x06 | ID reads x6 (no hazard) io_rs2_id [4:0] | 0x07 | ID reads x7 io_reg1_forward_id [1:0] | 0 | No forward for rs1 io_reg2_forward_id [1:0] | 1 | ForwardFromMEM **Analysis:** * Hazard: Branch in ID needs x7 which is still in MEM stage (not yet in WB/register file) * Detection: `rd_mem (0x07) == rs2_id (0x07)` and `reg_write_enable_mem = 1` * Resolution: Forward x7 from MEM stage directly to ID comparator * Priority: MEM forward (value=1) chosen over WB forward because MEM has more recent result **Performance Impact:** * Without forwarding: 1-2 cycle stall * With MEM-to-ID forwarding: 0-cycle penalty ### Exercise 19: Control Hazard Analysis Control hazards occur when pipeline execution sequence changes due to branches/jumps. This analysis examines hazard detection and pipeline flush mechanisms. ![ca2025_ex19](https://hackmd.io/_uploads/S1YhSohGZg.png) * **Branch Hazard - IF Flush Only** **Time Window: ~25ps** Signal | Value | Interpretation ----------------------------|-----------|---------------------------------- io_jump_flag | 1 | Branch/Jump taken io_if_flush | 1 | Flush IF stage io_id_flush | 0 | Keep ID stage io_pc_stall | 0 | PC redirected (not stalled) io_if_stall | 0 | IF/ID not stalled **Analysis:** * Hazard: Branch taken, PC redirected to target address * Detection: `jump_flag = 1` indicates branch/jump condition TRUE * Resolution: Flush IF stage only * Impact:** 1-cycle penalty (one bubble inserted) * **Load-Use Hazard with Stall** **Time Window: ~49ps** Signal | Value | Interpretation ----------------------------|-----------|---------------------------------- io_memory_read_enable_ex | 1 | Load in EX stage io_rd_ex [4:0] | 0x07 | Load writes to x07 io_id_flush | 1 | Insert bubble io_pc_stall | 1 | Freeze PC io_if_stall | 1 | Freeze IF/ID **Analysis:** * Hazard: Instruction in ID needs data from load in EX (not ready until MEM) * Detection:`memory_read_enable_ex = 1` AND `rd_ex == rs_id` * Resolution: Stall pipeline 1 cycle (bubble + freeze) * Impact: 1-cycle penalty, forward from MEM next cycle * `io_id_flush = io_pc_stall = io_if_stall = 1` : All three toggle together for load-use hazards **Control Signal Patterns** | Scenario | io_if_flush | io_id_flush | io_pc/if_stall | Cause | | ------------ | ----------- | ----------- | -------------- | ----------------------- | | Branch Taken | 1 | 0 | 0 | `jump_flag = 1` | | Load-Use | 0 | 1 | 1 | `mem_read_ex && rd==rs` | | Normal | 0 | 0 | 0 | No hazard | ### Exercise 20: Pipeline Register Behavior ![ca2025_ex20](https://hackmd.io/_uploads/Sk8qLhnf-g.png) * **Stall Mechanism** **Register Holds Value During Stall** Time Window: 2ps - 14ps Signal | Behavior --------------------|---------------------------------- io_stall | 1 (asserted continuously) io_flush | 0 io_in [31:0] | 4037c779 → 2a581587 → 0a6b9b95 → ... (changing) io_out [31:0] | 57e348f (frozen) reg [31:0] | 57e348f (frozen) **Analysis:** * Behavior: Output frozen at 0x57e348f despite input changing * Purpose: Hold instruction during pipeline hazard * Verification: Input ignored while io_stall = 1 * **Flush Mechanism** **Register Outputs Default Value** Time Window: ~14ps, ~24ps Signal | @14ps | @24ps | --------------------|------------|------------| io_stall | 1→0 | 1→0 | io_flush | 0→1 | 0→1 | io_in [31:0] | 04c02c2b ->... | 1bc9be96 | io_out [31:0] | 57e348f | 57e348f| **Analysis:** * When flush = 1, output becomes default value (0x57e348f) * Input ignored during flush * Same default value at both flush events ### Exercise 21: Hazard Detection Summary and Analysis * Q1: Why do we need to stall for load-use hazards? Answer: Load data is not available until MEM stage. Cannot forward from EX stage (data not computed yet). Must stall 1 cycle to allow load to reach MEM, then forward data to dependent instruction. ![ca2025_ex19](https://hackmd.io/_uploads/HyYGPhnzZe.png) Waveform Evidence (Time ~49ps): LW in EX: memory_read_enable_ex = 1, data not ready Dependent inst in ID: needs loaded data → io_id_flush = 1, io_pc_stall = 1 (stall 1 cycle) → Next cycle: forward from MEM * Q2: What is the difference between "stall" and "flush" operations? Answer: | Operation | Effect on pipeline | Effect on pc | Purpose | | --------- | ---------------------------- | ------------ | ------------------------------ | | Stall | Hold register value (freeze) | Freeze pc | Wait for data dependency | | Flush | Insert default | Redirect pc | Remove wrong path instructions | ![ca2025_ex20](https://hackmd.io/_uploads/rkJjd2hzbx.png) Waveform Evidence: Stall (2-14ps): io_out frozen at 0x57e348f, PC unchanged Flush (14ps, 24ps): io_out = default 0x57e348f, PC redirected * Q3: Why does jump instruction with register dependency need stall? Answer: Jump target address computed from register value (JALR x1, offset(x2)). If x2 has data hazard, target address unknown until dependency resolved. Must stall until register value available, then compute target and redirect PC. Example: ``` ADD x2, x3, x4 # x2 not ready JALR x1, 0(x2) # Needs x2 for target ← Stall ``` * Q4: Why is branch penalty only 1 cycle instead of 2? **Answer:** Branch resolved in **ID stage** (early), not EX stage. Only IF stage contains wrong-path instruction when branch taken. Flush IF only → 1 bubble. If branch resolved in EX: both IF and ID would have wrong-path → 2 bubbles. ![ca2025_ex19](https://hackmd.io/_uploads/rktiFn3z-l.png) **Waveform Evidence (Time ~25ps):** io_jump_flag = 1 (branch taken in ID) io_if_flush = 1 (flush IF only) io_id_flush = 0 (ID has branch itself, keep it) → Penalty = 1 cycle * Q5: What would happen if we removed hazard detection logic entirely? Answer: * Data Hazards: * RAW hazards → Read stale data from register file instead of forwarded values * Incorrect computation results * Example: `ADD x1, x2, x3; SUB x4, x1, x5` → x1 not updated yet * Control Hazards: * Wrong-path instructions execute instead of being flushed * Branch/jump → Continue sequential execution, corrupting registers/memory * Example: Branch taken but ADD after branch still executes → wrong register modified **Result:** Program produces incorrect output, pipeline not functionally correct. ## 4. Homework 2 on Pipelined CPU In homework 2, I chose the uf8 decode/encoding problem to run it on the pipelined RISC-V CPU so I modified it to eliminate the hazards to ensure it functions correctly. **crsc/uf8_modified.S** ``` # UF8 Encode/Decode Test - Modified for Pipeline Testing # Stores final result in memory address 4 for verification .globl _start _start: # initialize test li s0, 0 # test counter li s1, 3 # number of tests (reduced for pipeline testing) test_loop: beq s0, s1, test_done # load test value based on counter beq s0, x0, test_0 li t0, 1 beq s0, t0, test_1 li t0, 2 beq s0, t0, test_2 test_0: li a0, 15 # test value 1: small value j do_test test_1: li a0, 48 # test value 2: medium value j do_test test_2: li a0, 240 # test value 3: large value j do_test do_test: # save original value mv s2, a0 # encode jal ra, uf8_encode mv s3, a0 # s3 = encoded byte # decode mv a0, s3 jal ra, uf8_decode mv s4, a0 # s4 = decoded value # simple validation: check if decoded ≈ original # for small values (<16): must be exact li t0, 16 blt s2, t0, check_exact # for larger values: allow some error sub t0, s4, s2 # diff = decoded - original bgez t0, diff_pos neg t0, t0 # abs(diff) diff_pos: slli t0, t0, 4 # diff * 16 bgt t0, s2, test_fail # if diff*16 > original, fail j test_pass check_exact: bne s4, s2, test_fail test_pass: addi s0, s0, 1 # Next test j test_loop test_fail: # store failure indicator li t0, 0xDEAD sw t0, 4(x0) # Memory[4] = 0xDEAD (failure) j end_program test_done: # all tests passed, store success value li t0, 0x55 # 0x55 = success indicator sw t0, 4(x0) # Memory[4] = 0x55 end_program: # Infinite loop to end j end_program # UF8 Encode: value -> byte # input: a0 = value to encode # output: a0 = encoded byte (exponent in upper 4 bits, mantissa in lower 4 bits) uf8_encode: # Handle small values (0-15) li t0, 16 blt a0, t0, encode_small # Initialize for loop li t1, 0 # exponent = 0 li t2, 0 # base_offset = 0 li t4, 15 # max_exponent = 15 encode_loop: # calculate next threshold: base_offset + (16 << exponent) add t3, t2, t0 bgt t3, a0, encode_done # update for next iteration mv t2, t3 # base_offset = threshold slli t0, t0, 1 # threshold *= 2 addi t1, t1, 1 # exponent++ blt t1, t4, encode_loop encode_done: # calculate mantissa: (value - base_offset) >> exponent sub t3, a0, t2 srl t3, t3, t1 andi t3, t3, 0x0F # mantissa (4 bits) # combine: (exponent << 4) | mantissa slli t1, t1, 4 or a0, t1, t3 ret encode_small: # value < 16, no encoding needed ret # UF8 Decode: byte -> value # input: a0 = encoded byte # output: a0 = decoded value uf8_decode: # extract exponent and mantissa andi t0, a0, 0x0F # mantissa (lower 4 bits) srli t1, a0, 4 # exponent (upper 4 bits) # calculate offset: (2^exponent - 1) * 16 li t2, 1 sll t2, t2, t1 # 2^exponent addi t2, t2, -1 # 2^exponent - 1 slli t2, t2, 4 # * 16 # calculate value: (mantissa << exponent) + offset sll t0, t0, t1 add a0, t0, t2 ret ``` Commands to compile new files: ``` # 1. Assemble riscv64-unknown-elf-as -march=rv32i -mabi=ilp32 \ uf8_pipeline.S -o uf8_pipeline.o # 2. Link (use link.lds!) riscv64-unknown-elf-ld -T link.lds \ --oformat=elf32-littleriscv \ uf8_pipeline.o -o uf8_pipeline.elf # 3. Convert to binary riscv64-unknown-elf-objcopy -O binary \ -j .text -j .data \ uf8_pipeline.elf ../src/main/resources/uf8_test.asmbin ``` Modify test scala file to run uf8 test: ```scala= it should "correctly execute UF8 encode/decode program" in { runProgram("uf8_test.asmbin", cfg) { c => // Run UF8 test (3 encode/decode cycles) for (i <- 1 to 20) { c.clock.step(500) c.io.mem_debug_read_address.poke((i * 4).U) } // Check result in memory[4] // 0x55 = all tests passed // 0xDEAD = test failed c.io.mem_debug_read_address.poke(4.U) c.clock.step() val result = c.io.mem_debug_read_data.peek().litValue.toInt assert( result == 0x55, f"${cfg.name}: UF8 test failed! Memory[4] = 0x${result}%x (expected 0x55)" ) ``` Test result: ``` make test cd .. && sbt "project pipeline" test [info] welcome to sbt 1.10.7 (Eclipse Adoptium Java 11.0.29) [info] loading project definition from /home/harrypotter/ca2025-mycpu/project [info] loading settings for project root from build.sbt... [info] set current project to mycpu-root (in build file:/home/harrypotter/ca2025-mycpu/) [info] set current project to mycpu-pipeline (in build file:/home/harrypotter/ca2025-mycpu/) [info] compiling 1 Scala source to /home/harrypotter/ca2025-mycpu/3-pipeline/target/scala-2.13/test-classes ... [info] PipelineProgramTest: [info] Three-stage Pipelined CPU [info] - should calculate recursively fibonacci(10) make[1]: Warning: File 'TestTopModule-harness.cpp' has modification time 0.35 s in the future make[1]: warning: Clock skew detected. Your build may be incomplete. [info] - should quicksort 10 numbers [info] - should store and load single byte [info] - should solve data and control hazards [info] - should handle all hazard types comprehensively make[1]: Warning: File 'TestTopModule-harness.cpp' has modification time 0.49 s in the future make[1]: warning: Clock skew detected. Your build may be incomplete. [info] - should handle machine-mode traps [info] - should correctly execute UF8 encode/decode program [info] Five-stage Pipelined CPU with Stalling [info] - should calculate recursively fibonacci(10) [info] - should quicksort 10 numbers [info] - should store and load single byte [info] - should solve data and control hazards [info] - should handle all hazard types comprehensively [info] - should handle machine-mode traps [info] - should correctly execute UF8 encode/decode program [info] Five-stage Pipelined CPU with Forwarding [info] - should calculate recursively fibonacci(10) [info] - should quicksort 10 numbers [info] - should store and load single byte [info] - should solve data and control hazards [info] - should handle all hazard types comprehensively [info] - should handle machine-mode traps [info] - should correctly execute UF8 encode/decode program [info] Five-stage Pipelined CPU with Reduced Branch Delay [info] - should calculate recursively fibonacci(10) [info] - should quicksort 10 numbers [info] - should store and load single byte [info] - should solve data and control hazards [info] - should handle all hazard types comprehensively [info] - should handle machine-mode traps [info] - should correctly execute UF8 encode/decode program [info] ComplianceTest: [info] MyCPU Compliance ✅ Test completed - signature: /home/harrypotter/ca2025-mycpu/tests/riscof_work_3pl/rv32i_m/hints/src/sll-01.S/dut/DUT-mycpu.signature[info] - should pass test /home/harrypotter/ca2025-mycpu/tests/riscv-arch-test/riscv-test-suite/rv32i_m/hints/src/sll-01.S [info] PipelineUartTest: [info] Three-stage Pipelined CPU UART Comprehensive Test [info] - should pass all TX and RX tests [info] Five-stage Pipelined CPU with Stalling UART Comprehensive Test [info] - should pass all TX and RX tests [info] Five-stage Pipelined CPU with Forwarding UART Comprehensive Test [info] - should pass all TX and RX tests [info] Five-stage Pipelined CPU with Reduced Branch Delay UART Comprehensive Test [info] - should pass all TX and RX tests [info] PipelineRegisterTest: [info] Pipeline Register [info] - should be able to stall and flush [info] Run completed in 4 minutes, 58 seconds. [info] Total number of tests run: 34 [info] Suites: completed 4, aborted 0 [info] Tests: succeeded 34, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. ``` ## 5. What I have learnt from Chisel Bootcamp ![Screenshot 2025-12-15 132532](https://hackmd.io/_uploads/HJJJ3M6fWl.png) Before taking Chisel Bootcamp, I mostly thought about hardware in terms of low-level Verilog syntax. This course changed the way I think about hardware design. I learned that Chisel is not just another HDL, but a hardware construction language that lets me describe circuits at a much higher and more flexible level. One of the most important things I learned was the difference between writing software code and describing hardware. At first, it was confusing that Scala code runs only during elaboration, while the generated hardware runs cycle by cycle. After working through the exercises, I started to understand how Reg, Wire, and when statements actually represent real hardware elements like registers and multiplexers. I also really appreciated how Chisel allows parameterized and reusable designs. Using loops and functions to generate hardware made my code cleaner and easier to modify, especially compared to writing repetitive Verilog. This became very useful when building larger components and pipelined structures. Finally, using Verilator and waveform analysis helped me connect my Chisel code to real signal behavior. Seeing signals change every cycle made hardware behavior much more intuitive. Overall, Chisel Bootcamp gave me confidence in designing and reasoning about hardware systems, especially pipelined processors. ## Reference * [Computer Architecture Homework 3](https://hackmd.io/@sysprog/2025-arch-homework3) * [Lab3: Construct a RISC-V CPU with Chisel](https://hackmd.io/@sysprog/B1Qxu2UkZx#Chisel-Bootcamp) * [Assignment2: Complete Applications](https://hackmd.io/@6qS8IHTdRr2PrBg7Q97fww/H1HxZJP1bx)