# arch-homework3 github link : https://github.com/zhenfong07/ca2025-mycpu ## 1. Go through Chisel tutorial and describe the operation of 'Hello World in Chisel' and enhance it by incorporating logic circuit ### 1.1 Chisel tutorial I have looked through Chisel tutorial and go through exercise 1 to 3.6 in the tutorial site.![chisel tutorial](https://hackmd.io/_uploads/SJLyGiAl-l.png) ### 1.2 Describe the operation of 'Hello World in Chisel' and enhance it by incorporating logic circuit #### 1.2.1 Describe the operation ```scala // See LICENSE.txt for license details. package hello import chisel3._ import chisel3.iotesters.{PeekPokeTester, Driver} class Hello extends Module { val io = IO(new Bundle { val out = Output(UInt(8.W)) }) io.out := 42.U } class HelloTests(c: Hello) extends PeekPokeTester(c) { step(1) expect(c.io.out, 42) } object Hello { def main(args: Array[String]): Unit = { if (!Driver(() => new Hello())(c => new HelloTests(c))) System.exit(1) } } ``` - The Hello extends Moduleis to create a hardware component and the **val out= Output(UInt(8.W))** means that this component has one output pin with the width of 8 bits. the **io.out :=42** means that connect wire to value 42 in terms of unsigned integer. **Class HelloTests with using PeekPokeTester** is to test the function, step(1) means that clock cycle will run 1 cycle. Then if output is 42, there will be success terminal.**Object Hello** is to generate Chisel code to Verilog code. #### 1.2.2 Enhancement with incorporating logic circuit - I will enhance the code to create **a Conditional Adder** with 2 8-bit input a and b and **control signal (enable)**. ```scala // See LICENSE.txt for license details. package hello import chisel3._ import chisel3.util._ // Import 'util' to use Mux (Multiplexer) import chisel3.iotesters.{PeekPokeTester, Driver} // --- 1. HARDWARE DEFINITION (The Module) --- class LogicCircuit extends Module { val io = IO(new Bundle { val a = Input(UInt(8.W)) // Input A (8-bit width) val b = Input(UInt(8.W)) // Input B (8-bit width) val enable = Input(Bool()) // Control signal (Boolean: 1 bit) val out = Output(UInt(8.W)) // Output (8-bit width) }) // LOGIC DESCRIPTION: // A Multiplexer (Mux) selects between two values based on a condition. // Logic: If (io.enable is True) return (io.a + io.b), else return (0). io.out := Mux(io.enable, io.a + io.b, 0.U) } // --- 2. VERIFICATION (The Tester) --- class LogicTests(c: LogicCircuit) extends PeekPokeTester(c) { // Test Case 1: Enable is OFF poke(c.io.a, 10) // Drive input A with 10 poke(c.io.b, 20) // Drive input B with 20 poke(c.io.enable, false) // Drive enable with False (0) step(1) // Advance clock by 1 cycle expect(c.io.out, 0) // Assert that output is 0 println(s"Test 1 (Enable OFF): Output = ${peek(c.io.out)}") // Test Case 2: Enable is ON poke(c.io.a, 10) poke(c.io.b, 20) poke(c.io.enable, true) // Drive enable with True (1) step(1) expect(c.io.out, 30) // Assert that output is 30 (10 + 20) println(s"Test 2 (Enable ON): Output = ${peek(c.io.out)}") // Test Case 3: Different Values poke(c.io.a, 42) poke(c.io.b, 8) poke(c.io.enable, true) step(1) expect(c.io.out, 50) // Assert that output is 50 (42 + 8) println(s"Test 3 (42 + 8): Output = ${peek(c.io.out)}") } // --- 3. MAIN ENTRY POINT --- object Hello { def main(args: Array[String]): Unit = { // Compiles the Chisel code to Verilog and runs the tests if (!Driver(() => new LogicCircuit())(c => new LogicTests(c))) System.exit(1) } } ``` - **Mux** stands for if else condition to check if (io.enable is True) return (io.a + io.b), else return (0). The reason why we have to use **Enable** because the wires from the ALU are physically connected to the Register File's input, the comparison result (or garbage data) would be written into a register (e.g., x1), overwriting valuable data. *This new one is more upgraded and it is an enhanced version with logic circuit because:* - It uses both input and output **(input a,b and control signal 'Enable')**. - There are also arithmetic operations **(+)** (which is made of **AND,XOR,OR logic**) . - Control logic **MUX** which acts as a if-else condition. Lastly, we will have a **dependent output** because if we change a or b, the output will be changed also. ![Helloworld](https://hackmd.io/_uploads/ryI8R30eWg.png) ## Run Exercise 1 to exercise 3 ### Excercise 1 (1-single-cycle) After implementing the **TO DO** parts, I will run the codes in order to fix the bugs and the following command noticed that I have write the code exactly. ![excersie term](https://hackmd.io/_uploads/rkiT76Ce-e.png) - Test **make compliance** have passed all ![testcompliance3](https://hackmd.io/_uploads/BktXaa0W-x.png) The following is the waveform analysis to check whether the logic in the code is right or not. #### Instructions Fetch ```scala // SPDX-License-Identifier: MIT // MyCPU is freely redistributable under the MIT License. See the file // "LICENSE" for information on usage and redistribution of this file. package riscv.core import chisel3._ import riscv.Parameters // Program counter reset value object ProgramCounter { val EntryAddress = Parameters.EntryAddress } // Instruction Fetch stage: maintains PC and fetches instructions from memory // // This is the first stage of the processor pipeline, responsible for: // - Maintaining the program counter (PC) register // - Providing current PC to instruction memory // - Handling control flow changes from Execute stage // // PC update logic: // - Sequential: PC = PC + 4 (when no jump/branch) // - Control flow: PC = jump_address_id (when jump_flag_id asserted by Execute stage) // // The instruction_valid signal gates PC updates to handle memory latency and stalls. class InstructionFetch extends Module { val io = IO(new Bundle { val jump_flag_id = Input(Bool()) val jump_address_id = Input(UInt(Parameters.AddrWidth)) val instruction_read_data = Input(UInt(Parameters.DataWidth)) val instruction_valid = Input(Bool()) val instruction_address = Output(UInt(Parameters.AddrWidth)) val instruction = Output(UInt(Parameters.InstructionWidth)) }) // Program counter register (PC) val pc = RegInit(ProgramCounter.EntryAddress) // ============================================================ // [CA25: Exercise 9] PC Update Logic - Sequential vs Control Flow // ============================================================ // Hint: Implement program counter (PC) update logic for sequential execution // and control flow changes // // PC update rules: // 1. Control flow (jump/branch taken): PC = jump target address // - When jump flag is asserted, use jump address // - Covers: JAL, JALR, and taken branches (BEQ, BNE, BLT, BGE, BLTU, BGEU) // 2. Sequential execution: PC = PC + 4 // - When no jump/branch, increment PC by 4 bytes (next instruction) // - RISC-V instructions are 4 bytes (32 bits) in RV32I // 3. Invalid instruction: PC = PC (hold current value) // - When instruction is invalid, don't update PC // - Insert NOP to prevent illegal instruction execution // // Examples: // - Normal ADD: PC = 0x1000 → next PC = 0x1004 (sequential) // - JAL offset: PC = 0x1000, target = 0x2000 → next PC = 0x2000 (control flow) // - BEQ taken: PC = 0x1000, target = 0x0FFC → next PC = 0x0FFC (control flow) when(io.instruction_valid) { io.instruction := io.instruction_read_data // TODO: Complete PC update logic // Hint: Use multiplexer to select between jump target and sequential PC // - Check jump flag condition // - True case: Use jump target address // - False case: Sequential execution pc := Mux(io.jump_flag_id,io.jump_address_id,pc +4.U) }.otherwise { // When instruction is invalid, hold PC and insert NOP (ADDI x0, x0, 0) // NOP = 0x00000013 allows pipeline to continue safely without side effects pc := pc io.instruction := 0x00000013.U // NOP: prevents illegal instruction execution } io.instruction_address := pc } ``` ***Waveform Analysis*** After running **make sim** command, there will be a file exported named trace.vcd and I will use that to analyse to check each stages. ##### Reset and Initial Stage and Execution Stage ***Initial Stage (0 to 3 ps)***: the reset is active (reset=1), then the **io_jump_address** is still equal to **00001000**. ![instructionfetch1](https://hackmd.io/_uploads/rkqdt00eZg.png) ***Next Stage (4ps)***: the reset and clock is both equal to 0. Then the then the **io_jump_address** is still equal to **00001000** like the initial stage. ![instructionfetch2](https://hackmd.io/_uploads/rJe_iCRgWg.png) ***Execution Stage(5 to 7ps)***: The reset is equal to 0 and the clock is equal to 1, then you can see the **io_jump_address** is equal to **00001004** which means program counter increments by 4 bytes, fetching the next sequential instruction **(pc+4)**. ![instructionfetch3](https://hackmd.io/_uploads/Bk_X2CAlZl.png) ##### Control Flow Change ![instruction fetch 2 exer1](https://hackmd.io/_uploads/SJjjmxyW-x.png) ```scala PC update: pc := io.jump_address_id = 0x1000 ``` When instruction type is : ***JAL, JALR or Taken Branches***. It will trigger ***io.jump_flag_id*** to turn to 1. In the waveform you can see that ***io_jump_flag_id*** is 1 when the ***io_instruction*** = jal ra, -68, which is the hint that the instruction fetch has run correctly. #### Instructions Decode ```scala // SPDX-License-Identifier: MIT // MyCPU is freely redistributable under the MIT License. See the file // "LICENSE" for information on usage and redistribution of this file. package riscv.core import chisel3._ import chisel3.util._ import chisel3.ChiselEnum import riscv.Parameters // RV32I opcode groupings object InstructionTypes { val Load = "b0000011".U(7.W) val OpImm = "b0010011".U(7.W) val Store = "b0100011".U(7.W) val Op = "b0110011".U(7.W) val Lui = "b0110111".U(7.W) val Auipc = "b0010111".U(7.W) val Jal = "b1101111".U(7.W) val Jalr = "b1100111".U(7.W) val Branch = "b1100011".U(7.W) val MiscMem = "b0001111".U(7.W) val System = "b1110011".U(7.W) } // Convenience aliases for specific opcodes object Instructions { val jal = InstructionTypes.Jal val jalr = InstructionTypes.Jalr val lui = InstructionTypes.Lui val auipc = InstructionTypes.Auipc } // Funct3 encodings for load instructions object InstructionsTypeL { val lb = "b000".U(3.W) val lh = "b001".U(3.W) val lw = "b010".U(3.W) val lbu = "b100".U(3.W) val lhu = "b101".U(3.W) } // Funct3 encodings for store instructions object InstructionsTypeS { val sb = "b000".U(3.W) val sh = "b001".U(3.W) val sw = "b010".U(3.W) } // Funct3 encodings for OP-IMM instructions object InstructionsTypeI { val addi = "b000".U(3.W) val slli = "b001".U(3.W) val slti = "b010".U(3.W) val sltiu = "b011".U(3.W) val xori = "b100".U(3.W) val sri = "b101".U(3.W) val ori = "b110".U(3.W) val andi = "b111".U(3.W) } // Funct3 encodings for OP instructions object InstructionsTypeR { val add_sub = "b000".U(3.W) val sll = "b001".U(3.W) val slt = "b010".U(3.W) val sltu = "b011".U(3.W) val xor = "b100".U(3.W) val sr = "b101".U(3.W) val or = "b110".U(3.W) val and = "b111".U(3.W) } // Funct3 encodings for branch instructions object InstructionsTypeB { val beq = "b000".U(3.W) val bne = "b001".U(3.W) val blt = "b100".U(3.W) val bge = "b101".U(3.W) val bltu = "b110".U(3.W) val bgeu = "b111".U(3.W) } object ALUOp1Source { val Register = 0.U(1.W) val InstructionAddress = 1.U(1.W) } object ALUOp2Source { val Register = 0.U(1.W) val Immediate = 1.U(1.W) } object RegWriteSource { val ALUResult = 0.U(2.W) val Memory = 1.U(2.W) val NextInstructionAddress = 2.U(2.W) } object ImmediateKind extends ChiselEnum { val None, I, S, B, U, J = Value } /** * InstructionDecode: Instruction field extraction and control signal generation * * Pipeline Stage: ID (Instruction Decode) * * Key Responsibilities: * - Extract instruction fields (opcode, rd, rs1, rs2, funct3, funct7) * - Generate control signals for Execute, Memory, and WriteBack stages * - Decode and sign-extend immediate values based on instruction format * - Determine ALU operand sources (register vs PC, register vs immediate) * - Identify register file read/write operations * - Configure memory access (read/write enable signals) * * RV32I Instruction Formats Decoded: * - R-type: Register-register operations (op1=rs1, op2=rs2) * - Examples: ADD, SUB, AND, OR, XOR, SLL, SRL, SRA, SLT, SLTU * - I-type: Immediate operations and loads (12-bit signed immediate) * - Examples: ADDI, SLTI, ANDI, ORI, XORI, LB, LH, LW, JALR * - Immediate: inst[31:20] sign-extended to 32 bits * - S-type: Store operations (12-bit signed immediate split across fields) * - Examples: SB, SH, SW * - Immediate: {inst[31:25], inst[11:7]} sign-extended * - B-type: Branch operations (13-bit signed PC-relative, LSB always 0) * - Examples: BEQ, BNE, BLT, BGE, BLTU, BGEU * - Immediate: {inst[31], inst[7], inst[30:25], inst[11:8], 0} * - U-type: Upper immediate (20-bit immediate in upper bits) * - Examples: LUI, AUIPC * - Immediate: {inst[31:12], 12'b0} * - J-type: Jump (21-bit signed PC-relative, LSB always 0) * - Example: JAL * - Immediate: {inst[31], inst[19:12], inst[20], inst[30:21], 0} * * Control Signal Generation: * - reg_write_enable: Enable writing to rd (false for branches, stores) * - memory_read_enable: Asserted for load instructions * - memory_write_enable: Asserted for store instructions * - ex_aluop1_source: Select PC vs rs1 for ALU operand 1 * - ex_aluop2_source: Select immediate vs rs2 for ALU operand 2 * - wb_reg_write_source: Select ALU result, memory data, or PC+4 for rd * * Interface: * - Input: 32-bit instruction from IF stage * - Outputs: Control signals to EX/MEM/WB, immediate value, register addresses */ class InstructionDecode extends Module { val io = IO(new Bundle { val instruction = Input(UInt(Parameters.InstructionWidth)) val regs_reg1_read_address = Output(UInt(Parameters.PhysicalRegisterAddrWidth)) val regs_reg2_read_address = Output(UInt(Parameters.PhysicalRegisterAddrWidth)) val ex_immediate = Output(UInt(Parameters.DataBits.W)) val ex_aluop1_source = Output(UInt(1.W)) val ex_aluop2_source = Output(UInt(1.W)) val memory_read_enable = Output(Bool()) val memory_write_enable = Output(Bool()) val wb_reg_write_source = Output(UInt(2.W)) val reg_write_enable = Output(Bool()) val reg_write_address = Output(UInt(Parameters.PhysicalRegisterAddrWidth)) }) val instruction = io.instruction val opcode = instruction(6, 0) val rs1 = instruction(19, 15) val rs2 = instruction(24, 20) val rd = instruction(11, 7) val isLoad = opcode === InstructionTypes.Load val isStore = opcode === InstructionTypes.Store val isOpImm = opcode === InstructionTypes.OpImm val isOp = opcode === InstructionTypes.Op val isLui = opcode === InstructionTypes.Lui val isAuipc = opcode === InstructionTypes.Auipc val isJal = opcode === InstructionTypes.Jal val isJalr = opcode === InstructionTypes.Jalr val isBranch = opcode === InstructionTypes.Branch val usesRs1 = isLoad || isStore || isOpImm || isOp || isBranch || isJalr val usesRs2 = isStore || isOp || isBranch val regWrite = isLoad || isOpImm || isOp || isLui || isAuipc || isJal || isJalr // ============================================================ // [CA25: Exercise 2] Control Signal Generation // ============================================================ // Hint: Generate correct control signals based on instruction type // // Need to determine three key multiplexer selections: // 1. WriteBack data source selection (wbSource) // 2. ALU operand 1 selection (aluOp1Sel) // 3. ALU operand 2 selection (aluOp2Sel) // WriteBack data source selection: // - Default: ALU result // - Load instructions: Read from Memory // - JAL/JALR: Save PC+4 (return address) val wbSource = WireDefault(RegWriteSource.ALUResult) // TODO: Determine when to write back from Memory when(isLoad) { wbSource := RegWriteSource.Memory } // TODO: Determine when to write back PC+4 .elsewhen(isJal||isJalr) { wbSource := RegWriteSource.NextInstructionAddress } // ALU operand 1 selection: // - Default: Register rs1 // - Branch/AUIPC/JAL: Use PC (for calculating target address or PC+offset) // val aluOp1Sel = WireDefault(ALUOp1Source.Register) // TODO: Determine when to use PC as first operand // Hint: Consider instructions that need PC-relative addressing when(isBranch||isJal||isAuipc) { aluOp1Sel := ALUOp1Source.InstructionAddress } // ALU operand 2 selection: // - Default: Register rs2 (for R-type instructions) // - I-type/S-type/B-type/U-type/J-type: Use immediate val needsImmediate = isLoad || isStore || isOpImm || isBranch || isLui || isAuipc || isJal || isJalr val aluOp2Sel = WireDefault(ALUOp2Source.Register) // TODO: Determine when to use immediate as second operand // Hint: Most instruction types except R-type use immediate when(isBranch||isOpImm || isLoad || isStore || isLui || isAuipc || isJal || isJalr) { aluOp2Sel := ALUOp2Source.Immediate } val immKind = WireDefault(ImmediateKind.None) when(isLoad || isOpImm || isJalr) { immKind := ImmediateKind.I } when(isStore) { immKind := ImmediateKind.S } when(isBranch) { immKind := ImmediateKind.B } when(isLui || isAuipc) { immKind := ImmediateKind.U } when(isJal) { immKind := ImmediateKind.J } io.regs_reg1_read_address := Mux(usesRs1, rs1, 0.U) io.regs_reg2_read_address := Mux(usesRs2, rs2, 0.U) io.ex_aluop1_source := aluOp1Sel io.ex_aluop2_source := aluOp2Sel io.memory_read_enable := isLoad io.memory_write_enable := isStore io.wb_reg_write_source := wbSource io.reg_write_enable := regWrite io.reg_write_address := rd io.regs_reg1_read_address := Mux(opcode === Instructions.lui, 0.U(Parameters.PhysicalRegisterAddrWidth), rs1) io.regs_reg2_read_address := rs2 // ============================================================ // [CA25: Exercise 1] Immediate Extension - RISC-V Instruction Encoding // ============================================================ // Hint: RISC-V has five immediate formats, requiring correct bit-field // extraction and sign extension // // I-type (12-bit): Used for ADDI, LW, JALR, etc. // Immediate located at inst[31:20] // Requires sign extension to 32 bits // Hint: Use Fill() to replicate sign bit instruction(31) // val immI = Cat( Fill(Parameters.DataBits - 12, instruction(31)), // Sign extension: replicate bit 31 twenty times instruction(31, 20) // Immediate: bits [31:20] ) // S-type (12-bit): Used for SW, SH, SB store instructions // Immediate split into two parts: inst[31:25] and inst[11:7] // Need to concatenate these parts then sign extend // Hint: High bits at upper field, low bits at lower field // // TODO: Complete S-type immediate extension val immS = Cat( Fill(Parameters.DataBits - 12, instruction(31)), // Sign extension instruction(31,25), // High 7 bits instruction(11,7) // Low 5 bits ) // B-type (13-bit): Used for BEQ, BNE, BLT branch instructions // Immediate requires reordering: {sign, bit11, bits[10:5], bits[4:1], 0} // Note: LSB is always 0 (2-byte alignment) // Requires sign extension to 32 bits // Hint: B-type bit order is scrambled, must reorder per specification // // TODO: Complete B-type immediate extension val immB = Cat( Fill(Parameters.DataBits - 13, instruction(31)), // Sign extension instruction(31), // bit [12] instruction(7), // bit [11] instruction(30,25), // bits [10:5] instruction(11,8), // bits [4:1] 0.U(1.W) // bit [0] = 0 (alignment) ) // U-type (20-bit): Used for LUI, AUIPC // Immediate located at inst[31:12], low 12 bits filled with zeros // No sign extension needed (placed directly in upper 20 bits) // Hint: U-type places 20 bits in result's upper bits, fills low 12 bits with 0 val immU = Cat(instruction(31, 12), 0.U(12.W)) // J-type (21-bit): Used for JAL // Immediate requires reordering: {sign, bits[19:12], bit11, bits[10:1], 0} // Note: LSB is always 0 (2-byte alignment) // Requires sign extension to 32 bits // Hint: J-type bit order is scrambled, similar to B-type // // TODO: Complete J-type immediate extension val immJ = Cat( Fill(Parameters.DataBits - 21, instruction(31)), // Sign extension instruction(31), // bit [20] instruction(19,12), // bits [19:12] instruction(20), // bit [11] instruction(30,21), // bits [10:1] 0.U(1.W) // bit [0] = 0 (alignment) ) val immediate = MuxLookup(immKind.asUInt, 0.U(Parameters.DataBits.W))( Seq( ImmediateKind.I.asUInt -> immI, ImmediateKind.S.asUInt -> immS, ImmediateKind.B.asUInt -> immB, ImmediateKind.U.asUInt -> immU, ImmediateKind.J.asUInt -> immJ ) ) io.ex_immediate := immediate } ``` We have set ***io_memory_read_enable*** and ***io_memory_write_enable***: ```scala io.memory_read_enable := isLoad // Load (lw, lb...) io.memory_write_enable := isStore // Store (sw, sb...) ``` ![instruction decode](https://hackmd.io/_uploads/Bk1zxW1--g.png) If the decoded instruction is S-type (store instructions: sw, sh, sb), whose opcode is 0x23 (binary 0100011), the output flag memory_write_enable will be 1. ![instruction decode2](https://hackmd.io/_uploads/HyUyZWyZZx.png) When the instruction is lw a4, -20(s0), indicating this is an I-type store instruction, so memory_read_enable is true. #### Execution ```scala // SPDX-License-Identifier: MIT // MyCPU is freely redistributable under the MIT License. See the file // "LICENSE" for information on usage and redistribution of this file. package riscv.core import chisel3._ import chisel3.util.Cat import chisel3.util.MuxLookup import riscv.Parameters /** * Execute: ALU operations and branch resolution for RV32I * * Pipeline Stage: EX (Execute) * * Key Responsibilities: * - Select ALU operands from register data, PC, or immediate values * - Perform arithmetic and logical operations via ALU submodule * - Evaluate branch conditions for all six RV32I branch types * - Calculate branch and jump target addresses * - Generate control signals for instruction fetch (jump flag and address) * * Control Flow Handling: * - Branch (B-type): Compare rs1 and rs2, compute PC + immediate if taken * - BEQ/BNE: Equality comparison (signed or unsigned agnostic) * - BLT/BGE: Signed comparison using .asSInt * - BLTU/BGEU: Unsigned comparison using .asUInt * - JAL (J-type): Unconditional PC-relative jump, save PC+4 to rd * - JALR (I-type): Indirect jump to (rs1 + immediate) & ~1, save PC+4 to rd * * ALU Operand Selection: * - Operand 1: Register (rs1) or PC (for branches, JAL, AUIPC) * - Operand 2: Register (rs2) or immediate (for I-type, S-type, U-type) * * Interface: * - Inputs: instruction, PC, register values, immediate, operand source selects * - Outputs: ALU result (for MEM stage), jump flag, jump address (for IF stage) * * Branch Penalty: * - Taken branches/jumps cause 1-cycle penalty (IF must restart) * - Not-taken branches proceed without penalty */ class Execute extends Module { val io = IO(new Bundle { val instruction = Input(UInt(Parameters.InstructionWidth)) val instruction_address = Input(UInt(Parameters.AddrWidth)) val reg1_data = Input(UInt(Parameters.DataWidth)) val reg2_data = Input(UInt(Parameters.DataWidth)) val immediate = Input(UInt(Parameters.DataWidth)) val aluop1_source = Input(UInt(1.W)) val aluop2_source = Input(UInt(1.W)) val mem_alu_result = Output(UInt(Parameters.DataWidth)) val if_jump_flag = Output(Bool()) val if_jump_address = Output(UInt(Parameters.DataWidth)) }) // Decode instruction fields val opcode = io.instruction(6, 0) val funct3 = io.instruction(14, 12) val funct7 = io.instruction(31, 25) // Instantiate ALU and control logic val alu = Module(new ALU) val alu_ctrl = Module(new ALUControl) alu_ctrl.io.opcode := opcode alu_ctrl.io.funct3 := funct3 alu_ctrl.io.funct7 := funct7 // Select ALU operands based on instruction type alu.io.func := alu_ctrl.io.alu_funct val aluOp1 = Mux(io.aluop1_source === ALUOp1Source.InstructionAddress, io.instruction_address, io.reg1_data) val aluOp2 = Mux(io.aluop2_source === ALUOp2Source.Immediate, io.immediate, io.reg2_data) alu.io.op1 := aluOp1 alu.io.op2 := aluOp2 io.mem_alu_result := alu.io.result // ============================================================ // [CA25: Exercise 4] Branch Comparison Logic // ============================================================ // Hint: Implement all six RV32I branch conditions // // Branch types: // - BEQ/BNE: Equality/inequality comparison (sign-agnostic) // - BLT/BGE: Signed comparison (requires type conversion) // - BLTU/BGEU: Unsigned comparison (direct comparison) val branchCondition = MuxLookup(funct3, false.B)( Seq( // TODO: Implement six branch conditions // Hint: Compare two register data values based on branch type InstructionsTypeB.beq -> (io.reg1_data===io.reg2_data), InstructionsTypeB.bne -> (io.reg1_data=/=io.reg2_data), // Signed comparison (need conversion to signed type) InstructionsTypeB.blt -> (io.reg1_data.asSInt< io.reg2_data.asSInt), InstructionsTypeB.bge -> (io.reg1_data.asSInt>= io.reg2_data.asSInt), // Unsigned comparison InstructionsTypeB.bltu -> (io.reg1_data< io.reg2_data), InstructionsTypeB.bgeu -> (io.reg1_data>= io.reg2_data) ) ) val isBranch = opcode === InstructionTypes.Branch val isJal = opcode === Instructions.jal val isJalr = opcode === Instructions.jalr // ============================================================ // [CA25: Exercise 5] Jump Target Address Calculation // ============================================================ // Hint: Calculate branch and jump target addresses // // Address calculation rules: // - Branch: PC + immediate (PC-relative) // - JAL: PC + immediate (PC-relative) // - JALR: (rs1 + immediate) & ~1 (register base, clear LSB for alignment) // // TODO: Complete the following address calculations val branchTarget = io.instruction_address+io.immediate val jalTarget = branchTarget // JAL and Branch use same calculation method // JALR address calculation: // 1. Add register value and immediate // 2. Clear LSB (2-byte alignment) val jalrSum = io.reg1_data + io.immediate // TODO: Clear LSB using bit concatenation // Hint: Extract upper bits and append zero val jalrTarget = Cat(jalrSum(31,1),0.U(1.W)) val branchTaken = isBranch && branchCondition io.if_jump_flag := branchTaken || isJal || isJalr io.if_jump_address := Mux( isJalr, jalrTarget, Mux(isJal, jalTarget, branchTarget) ) } ``` We have set the ***ALUop1*** and ***ALUop2*** like this: ```scala val aluOp1 = Mux(io.aluop1_source === ALUOp1Source.InstructionAddress, io.instruction_address, io.reg1_data) alu.io.op1 := aluOp1 val aluOp2 = Mux(io.aluop2_source === ALUOp2Source.Immediate, io.immediate, io.reg2_data) alu.io.op2 := aluOp2 ``` Then we will check if: - aluop1_source = 0 then op1 = reg1_data - aluop2_source = 0 then op2 = reg2_data ![execute2](https://hackmd.io/_uploads/B10UuGJZZg.png) Then we will check if: - aluop1_source = 1 → op1 = instruction_address (PC) - aluop2_source = 1 → op2 = immediate ![Execute](https://hackmd.io/_uploads/Hk5XvfyZWg.png) #### Memory Access ```scala // SPDX-License-Identifier: MIT // MyCPU is freely redistributable under the MIT License. See the file // "LICENSE" for information on usage and redistribution of this file. package riscv.core import chisel3._ import chisel3.util._ import peripheral.RAMBundle import riscv.Parameters // Memory Access stage: handles load/store operations with proper byte/halfword/word alignment // // This module implements RV32I memory access operations: // - Load operations (LB, LH, LW, LBU, LHU): extract and sign/zero-extend data // - Store operations (SB, SH, SW): write with byte-level strobes // // Memory alignment: // - Addresses are byte-addressable but memory is organized as 32-bit words // - mem_address_index (bits 1:0) selects byte/halfword position within word // - Byte stores use individual byte strobes for precise writes // - Loads extract and extend data based on address alignment class MemoryAccess extends Module { val io = IO(new Bundle() { val alu_result = Input(UInt(Parameters.DataWidth)) val reg2_data = Input(UInt(Parameters.DataWidth)) val memory_read_enable = Input(Bool()) val memory_write_enable = Input(Bool()) val funct3 = Input(UInt(3.W)) val wb_memory_read_data = Output(UInt(Parameters.DataWidth)) val memory_bundle = Flipped(new RAMBundle) }) val mem_address_index = io.alu_result(log2Up(Parameters.WordSize) - 1, 0).asUInt io.memory_bundle.write_enable := false.B io.memory_bundle.write_data := 0.U io.memory_bundle.address := io.alu_result io.memory_bundle.write_strobe := VecInit(Seq.fill(Parameters.WordSize)(false.B)) io.wb_memory_read_data := 0.U // ============================================================ // [CA25: Exercise 6] Load Data Extension - Sign and Zero Extension // ============================================================ // Hint: Implement proper sign extension and zero extension for load operations // // RISC-V Load instruction types: // - LB (Load Byte): Load 8-bit value and sign-extend to 32 bits // - LBU (Load Byte Unsigned): Load 8-bit value and zero-extend to 32 bits // - LH (Load Halfword): Load 16-bit value and sign-extend to 32 bits // - LHU (Load Halfword Unsigned): Load 16-bit value and zero-extend to 32 bits // - LW (Load Word): Load full 32-bit value, no extension needed // // Sign extension: Replicate the sign bit (MSB) to fill upper bits // Example: LB loads 0xFF → sign-extended to 0xFFFFFFFF // Zero extension: Fill upper bits with zeros // Example: LBU loads 0xFF → zero-extended to 0x000000FF when(io.memory_read_enable) { // Optimized load logic: extract bytes/halfwords based on address alignment val data = io.memory_bundle.read_data val bytes = Wire(Vec(Parameters.WordSize, UInt(Parameters.ByteWidth))) for (i <- 0 until Parameters.WordSize) { bytes(i) := data((i + 1) * Parameters.ByteBits - 1, i * Parameters.ByteBits) } // Select byte based on lower 2 address bits (mem_address_index) val byte = bytes(mem_address_index) // Select halfword based on bit 1 of address (word-aligned halfwords) val half = Mux(mem_address_index(1), Cat(bytes(3), bytes(2)), Cat(bytes(1), bytes(0))) // TODO: Complete sign/zero extension for load operations // Hint: // - Use Fill to replicate a bit multiple times // - For sign extension: Fill with the sign bit (MSB) // - For zero extension: Fill with zeros // - Use Cat to concatenate extension bits with loaded data io.wb_memory_read_data := MuxLookup(io.funct3, 0.U)( Seq( // TODO: Complete LB (sign-extend byte) // Hint: Replicate sign bit, then concatenate with byte InstructionsTypeL.lb -> Cat(Fill(24, byte(7)), byte), // TODO: Complete LBU (zero-extend byte) // Hint: Fill upper bits with zero, then concatenate with byte InstructionsTypeL.lbu -> Cat(0.U(24.W), byte), // TODO: Complete LH (sign-extend halfword) // Hint: Replicate sign bit, then concatenate with halfword InstructionsTypeL.lh -> Cat(Fill(16, half(15)), half), // TODO: Complete LHU (zero-extend halfword) // Hint: Fill upper bits with zero, then concatenate with halfword InstructionsTypeL.lhu ->Cat(0.U(16.W), half), // LW: Load full word, no extension needed (completed example) InstructionsTypeL.lw -> data ) ) // ============================================================ // [CA25: Exercise 7] Store Data Alignment - Byte Strobes and Shifting // ============================================================ // Hint: Implement proper data alignment and byte strobes for store operations // // RISC-V Store instruction types: // - SB (Store Byte): Write 8-bit value to memory at byte-aligned address // - SH (Store Halfword): Write 16-bit value to memory at halfword-aligned address // - SW (Store Word): Write 32-bit value to memory at word-aligned address // // Key concepts: // 1. Byte strobes: Control which bytes in a 32-bit word are written // - SB: 1 strobe active (at mem_address_index position) // - SH: 2 strobes active (based on address bit 1) // - SW: All 4 strobes active // 2. Data shifting: Align data to correct byte position in 32-bit word // - mem_address_index (bits 1:0) indicates byte position // - Left shift by (mem_address_index * 8) bits for byte operations // - Left shift by 16 bits for upper halfword // // Examples: // - SB to address 0x1002 (index=2): data[7:0] → byte 2, strobe[2]=1 // - SH to address 0x1002 (index=2): data[15:0] → bytes 2-3, strobes[2:3]=1 }.elsewhen(io.memory_write_enable) { io.memory_bundle.write_enable := true.B io.memory_bundle.address := io.alu_result val data = io.reg2_data // Optimized store logic: reduce combinational depth by simplifying shift operations // mem_address_index is already computed from address alignment (bits 1:0) val strobeInit = VecInit(Seq.fill(Parameters.WordSize)(false.B)) val defaultData = 0.U(Parameters.DataWidth) val writeStrobes = WireInit(strobeInit) val writeData = WireDefault(defaultData) switch(io.funct3) { is(InstructionsTypeS.sb) { // TODO: Complete store byte logic // Hint: // 1. Enable single byte strobe at appropriate position // 2. Shift byte data to correct position based on address writeStrobes(mem_address_index) := true.B writeData := data(7,0) << (mem_address_index << 3) } is(InstructionsTypeS.sh) { // TODO: Complete store halfword logic // Hint: Check address to determine lower/upper halfword position when(mem_address_index(1) === 0.U) { // Lower halfword (bytes 0-1) // TODO: Enable strobes for lower two bytes, no shifting needed writeStrobes(0) := true.B writeStrobes(1) := true.B writeData := data(15,0) }.otherwise { // Upper halfword (bytes 2-3) // TODO: Enable strobes for upper two bytes, apply appropriate shift writeStrobes(2) := true.B writeStrobes(3) := true.B writeData := data(15,0) << 16 } } is(InstructionsTypeS.sw) { // Store word: enable all byte strobes, no shifting needed (completed example) writeStrobes := VecInit(Seq.fill(Parameters.WordSize)(true.B)) writeData := data } } io.memory_bundle.write_data := writeData io.memory_bundle.write_strobe := writeStrobes } } ``` When ***io.memory_read_enable*** is true: The code reads data from the memory bus and processes it differently based on the instruction type (e.g., lb sign-extends, lbu zero-extends, lh reads two bytes, etc.). The processed data is then assigned to io.wb_memory_read_data for writing back. When ***io.memory_write_enable*** is true: The code writes data to memory, and the data is processed differently based on the instruction type (e.g., sw takes 32 bits, sh takes 16 bits, and sb takes 8 bits). ![mem_read](https://hackmd.io/_uploads/HJ0w2MkZbx.png) As you can see in the waveform, the ***io_write_data*** write a value then the ***io_read_data*** can read it back, which can be witnessed in the waveform. #### Write-back ```scala // SPDX-License-Identifier: MIT // MyCPU is freely redistributable under the MIT License. See the file // "LICENSE" for information on usage and redistribution of this file. package riscv.core import chisel3._ import chisel3.util._ import riscv.Parameters // Write Back stage: selects final result to write to register file // // This is the final stage of the processor pipeline, responsible for multiplexing // the appropriate data source to be written back to the register file: // - ALU result (default): Arithmetic/logical operation output // - Memory data: Load instruction result // - Next instruction address (PC+4): Return address for JAL/JALR instructions // // The regs_write_source signal (from Decode stage) determines which source is selected. class WriteBack extends Module { val io = IO(new Bundle() { val instruction_address = Input(UInt(Parameters.AddrWidth)) val alu_result = Input(UInt(Parameters.DataWidth)) val memory_read_data = Input(UInt(Parameters.DataWidth)) val regs_write_source = Input(UInt(2.W)) val regs_write_data = Output(UInt(Parameters.DataWidth)) }) // ============================================================ // [CA25: Exercise 8] WriteBack Source Selection // ============================================================ // Hint: Select the appropriate write-back data source based on instruction type // // WriteBack sources: // - ALU result (default): Used by arithmetic/logical/branch/jump instructions // - Memory read data: Used by load instructions (LB, LH, LW, LBU, LHU) // - Next instruction address (PC+4): Used by JAL/JALR for return address // // The control signal regs_write_source (from Decode stage) selects: // - RegWriteSource.ALUResult (0): Default, use ALU computation result // - RegWriteSource.Memory (1): Load instruction, use memory read data // - RegWriteSource.NextInstructionAddress (2): JAL/JALR, save return address // // TODO: Complete MuxLookup to multiplex writeback sources // Hint: Specify default value and cases for each source type io.regs_write_data := MuxLookup(io.regs_write_source, io.alu_result)( Seq( RegWriteSource.Memory -> io.memory_read_data, RegWriteSource.NextInstructionAddress -> (io.instruction_address +4.U) ) ) } ``` ***The write-back stage, the computed data or data read from memory is written into registers. The write-back module is essentially a multiplexer, and the code is very simple. However, it raises an interesting question: the write-enable signal is generated in the decode stage, but at that point, the correct write-back data has not been calculated (or read from memory). So, will incorrect write-back data be written into the register file, and why?*** - The answer is ***incorrect write-back data WILL NOT be written into the register file***. The reason is that **Register File** is designed to update data from the write-enable signal only and only if there is the rising edge of the clock. Instruction will be generated from **Decode** with write_enable will be appear here. Then it will be executed in **Execute/Memomry Access** to eliminate incorrect data. Therefore, the time the next clock edge arrives (end of the cycle), the write_data signal has stabilized to the correct value, which incorrect write-back data be written into the register file. #### Functional Tests ##### Fibonacci Test ![fibonaci waveform](https://hackmd.io/_uploads/rkCoF11WWx.png) ![fibonaci](https://hackmd.io/_uploads/B1i57yJ-Ze.png) As you can see, the Fibonaci function, **io_instruction_read_data**, It aligns with the expected waveform. ##### Quicksort ![quicksort](https://hackmd.io/_uploads/rkrl3JJW-g.png) ![quicksort wave](https://hackmd.io/_uploads/Hkxg21yZWe.png) As you can see, the Quicksort function, **io_instruction_read_data**, It aligns with the expected waveform. ##### SB ![sb](https://hackmd.io/_uploads/HJNweeJZbg.png) ![sb wave](https://hackmd.io/_uploads/r1tDgx1Zbe.png) As you can see, the SB function, ***io_instruction_read_data***, It aligns with the expected waveform. ### Exercise 2 (2-mmio-trap) ***1. Ensure that the Nyancat animation is correctly rendered on the VGA display during Verilator-based simulation, and propose effective approaches to further compress the Nyancat program.*** After implementing the **TO DO** parts, I will run the codes in order to fix the bugs and the following command noticed that I have write the code exactly. ![exercise22](https://hackmd.io/_uploads/B1HBD4ZW-l.png) - Test **make compliance** have passed all ![test compliance4](https://hackmd.io/_uploads/Sy58aaR-Wl.png) ![demo cat](https://hackmd.io/_uploads/Bk9a9EWW-l.png) #### 2. Propose effective approaches to further compress the Nyancat program ##### 2.1 Downscaling method This one is the simplest method to compress the Nyancat program, we just need to go directly to the file ***VGA.scala***. The part of this one is the default resolution of the VGA display. I will downscale the paramethers to 2 times to save more capacity because Nyancat program is the pixel art so we do not need to save it with high resolution. ```scala // ============ VGA Timing Parameters ============ // 640×480 @ 72Hz, pixel clock = 31.5 MHz val H_ACTIVE = 640 val H_FP = 24 // Front porch val H_SYNC = 40 // Sync pulse width val H_BP = 128 // Back porch val H_TOTAL = H_ACTIVE + H_FP + H_SYNC + H_BP // 832 val V_ACTIVE = 480 val V_FP = 9 // Front porch val V_SYNC = 3 // Sync pulse width val V_BP = 28 // Back porch val V_TOTAL = V_ACTIVE + V_FP + V_SYNC + V_BP // 520 // Display scaling and positioning (6× scaling as per design spec) val FRAME_WIDTH = 64 val FRAME_HEIGHT = 64 val SCALE_FACTOR = 6 // 6× scaling: 64×64 → 384×384 (fits cleanly in 640×480) val DISPLAY_WIDTH = FRAME_WIDTH * SCALE_FACTOR // 64×6 = 384 val DISPLAY_HEIGHT = FRAME_HEIGHT * SCALE_FACTOR // 64×6 = 384 val LEFT_MARGIN = (H_ACTIVE - DISPLAY_WIDTH) / 2 // Horizontal center: (640-384)/2 = 128 val TOP_MARGIN = (V_ACTIVE - DISPLAY_HEIGHT) / 2 // Vertical center: (480-384)/2 = 48 ``` Besides, we will need to fix the scale factor from x/6 to x/3 because our display is now changed to 320 and 240. the pixel 64x6 will be 384 will be out of range of the display so we need to downscale factor. ```scala // Division by 6: Manual implementation due to Verilator division bug // Use binary long division approximation: x/6 ≈ floor((x * 10923) / 65536) // For 10-bit input (0-1023), this gives correct results with minimal error // 10923/65536 = 0.166656 ≈ 1/6 (0.166667) - optimal constant for integer division // Extract bits [23:16] for 8-bit division result val frame_x_mult = rel_x * 10923.U val frame_x_div = frame_x_mult(23, 16) val frame_y_mult = rel_y * 10923.U val frame_y_div = frame_y_mult(23, 16) // Clamp to valid range [0, 63] to prevent out-of-bounds access // Use 8-bit division result directly, clamp will handle overflow val frame_x = Mux(frame_x_div >= FRAME_WIDTH.U, (FRAME_WIDTH - 1).U, frame_x_div(5, 0)) val frame_y = Mux(frame_y_div >= FRAME_HEIGHT.U, (FRAME_HEIGHT - 1).U, frame_y_div(5, 0)) ``` ***After modifying the code will be like this:*** ```scala // ============ VGA Timing Parameters ============ // 320×240 @ 72Hz, pixel clock = 31.5 MHz val H_ACTIVE = 320 val H_FP = 12 // Front porch val H_SYNC = 20 // Sync pulse width val H_BP = 64 // Back porch val H_TOTAL = H_ACTIVE + H_FP + H_SYNC + H_BP // 832 val V_ACTIVE = 240 val V_FP = 4 // Front porch val V_SYNC = 2 // Sync pulse width val V_BP = 14 // Back porch val V_TOTAL = V_ACTIVE + V_FP + V_SYNC + V_BP // 520 // Display scaling and positioning (3× scaling as per design spec) val FRAME_WIDTH = 64 val FRAME_HEIGHT = 64 val SCALE_FACTOR = 3 // 3× scaling: 32x32 → 192×192 (fits cleanly in 640×480) val DISPLAY_WIDTH = FRAME_WIDTH * SCALE_FACTOR // 64×3 = 192 val DISPLAY_HEIGHT = FRAME_HEIGHT * SCALE_FACTOR // 64×3 = 192 val LEFT_MARGIN = (H_ACTIVE - DISPLAY_WIDTH) / 2 // Horizontal center: (640-384)/2 = 128 val TOP_MARGIN = (V_ACTIVE - DISPLAY_HEIGHT) / 2 // Vertical center: (480-384)/2 = 48 ``` We also need to change x/3 to **x*21845/65536**. ```scala // Division by 3: Manual implementation due to Verilator division bug // Use binary long division approximation: x/3 ≈ floor((x * 21845) / 65536) // For 10-bit input (0-1023), this gives correct results with minimal error // Extract bits [23:16] for 8-bit division result val frame_x_mult = rel_x * 21845.U val frame_x_div = frame_x_mult(23, 16) val frame_y_mult = rel_y * 21845.U val frame_y_div = frame_y_mult(23, 16) // Clamp to valid range [0, 63] to prevent out-of-bounds access // Use 8-bit division result directly, clamp will handle overflow val frame_x = Mux(frame_x_div >= FRAME_WIDTH.U, (FRAME_WIDTH - 1).U, frame_x_div(5, 0)) val frame_y = Mux(frame_y_div >= FRAME_HEIGHT.U, (FRAME_HEIGHT - 1).U, frame_y_div(5, 0)) ``` **The results will be like this, you can see the width and height has been double smaller than the original one so it will save more capacity .** ![downscale](https://hackmd.io/_uploads/r1IdZH-bbe.png) ##### 2.2 Fix the Oo to 0s flag in Makefile (Optimization for size) ```scala CROSS_COMPILE ?= $(HOME)/riscv/toolchain/bin/riscv-none-elf- ASFLAGS = -march=rv32i_zicsr -mabi=ilp32 CFLAGS = -Os -Wall -march=rv32i_zicsr -mabi=ilp32 LDFLAGS = --oformat=elf32-littleriscv ``` add definition for memcpy and memset: ```c typedef unsigned long size_t; void *memset(void *dest, int c, size_t n) { unsigned char *p = (unsigned char *)dest; while (n--) { *p++ = (unsigned char)c; } return dest; } void *memcpy(void *dest, const void *src, size_t n) { unsigned char *d = (unsigned char *)dest; const unsigned char *s = (const unsigned char *)src; while (n--) { *d++ = *s++; } return dest; } ``` Then we have to run **make update CROSS_COMPILE=riscv64-unknown-elf-** to update the compiler. As being learned from the homework 2, **0o** compiler translates C code to Assembly in a "naive" way. A simple calculation or loop might result in dozens of redundant load/store instructions and the **nyancat.c** using a lot of **for** loop which can result in instruction redundancy . When we use **0s**, the compiler analyzes and removes redundant instructions, using registers more efficiently instead of constantly accessing memory. In order to run the 0s compiler we have to add **memset** and **memcpy** because the 0s compiler we use these 2 functions in stead of running the **for** loop and it will make our program saves memory very much. Here is the reason: **1.Data Width** **for** loop: Usually copies one Byte (8-bit) at a time so to copy 4096 bytes, the CPU must perform 4096 read operations and 4096 write operations. **Optimized memcpy** :It checks if the memory address is aligned (divisible by 4 or 8). If it is, it copies by Word (32-bit or 64-bit) so to copy 4096 bytes, it only needs 1024 reads/writes (4 times faster). **2.Reduced Loop Overhead** Every time a for loop completes one iteration, the CPU must perform 3 specific tasks: - Increment the counter (i++). - Compare the counter (i < n). - Jump/Branch back to the start of the loop. **The result when changing to 0s compiler** The cat no longer stutters or draws line-by-line in slow motion because the -Os flag will recognize the program try to clear the screen or copy a frame. Instead of running a slow for loop byte-by-byte, it calls the optimized memset/memcpy functions (often written in Assembly) to handle large data blocks at once. As a result, fewer instructions then less work for the simulator and ther computer simulates faster so the frames per second increases. ### Exercise 3 (3-pipleline) ***Requirements: Perform Hazard Detection Summary and Analysis with Chisel and waveforms.*** Test **make compliance** have passed all: ![testcompliance5](https://hackmd.io/_uploads/BJV2TTRZ-x.png) #### 3.1 ThreeStage This one is the simplest one with the flow **IF->ID->EX/MEM/WB(folded)**. In order to test the Hazard Detection I need to change the file **Top.scala** like this: ```scala // SPDX-License-Identifier: MIT // MyCPU is freely redistributable under the MIT License. See the file // "LICENSE" for information on usage and redistribution of this file. package board.verilator import chisel3._ import chisel3.stage.ChiselStage import riscv.core.CPU import riscv.core.CPUBundle import riscv.ImplementationType class Top extends Module { val io = IO(new CPUBundle) val cpu = Module(new CPU(implementation = ImplementationType.ThreeStage)) io.device_select := 0.U cpu.io.debug_read_address := io.debug_read_address io.debug_read_data := cpu.io.debug_read_data cpu.io.csr_debug_read_address := io.csr_debug_read_address io.csr_debug_read_data := cpu.io.csr_debug_read_data io.memory_bundle <> cpu.io.memory_bundle io.instruction_address := cpu.io.instruction_address cpu.io.instruction := io.instruction cpu.io.interrupt_flag := io.interrupt_flag cpu.io.instruction_valid := io.instruction_valid } object VerilogGenerator extends App { (new ChiselStage).emitVerilog( new Top(), Array("--target-dir", "3-pipeline/verilog/verilator") ) } ``` For control hazards, Flush is set on IF and ID stages on taken branches/jumps. ```scala ctrl.io.JumpFlag := ex.io.if_jump_flag if2id.io.flush := ctrl.io.Flush id2ex.io.flush := ctrl.io.Flush ``` and run **make verilator** and test it with Fibonacci.asmbin and analyse with waveforms. ![3stage test2](https://hackmd.io/_uploads/SkQHWxNW-l.png) You can see that in the waveform the stage 3 (**id2ex.io_output_instruction**) is **addi a0,zero,10** which is the from the flow from the stage 2 (**if2id.io_output_instruction)** coming to because the flow in 3-stage is **IF->ID->EX/MEM/WB(folded)**. Next we will check the Hazards detection when meeting branches or jump. You can see also the present instruction which is also **if2id.io_output_instruction** is **jal ra,-152** which is the condition to activate the **flush** (hazard detection).![3stage test 3](https://hackmd.io/_uploads/ryL87gE-bl.png) You can see that the **io_jump_flag_ex** and the **io_flush** have been turned on value **1**. Furthermore, after that the next cycle, you can see that the value from both **if2id.io_output_instruction** and **id2ex.io_output_instruction** is **NOP(addi zero,zero,0)** value because we have set in both ID2EX.scala and ID2EX.scala with this conditions. All of the above analysis shows that **ThreeStage** have run the hazard detection (flush) correctly with simulation on Fibonacci tests. ```scala val instruction = Module(new PipelineRegister(defaultValue = InstructionsNop.nop)) instruction.io.in := io.instruction instruction.io.stall := stall instruction.io.flush := io.flush io.output_instruction := instruction.io.out ``` #### 3.2 Five Stage Stall This one is the Five Stage with the flow **IF → ID → EX → MEM → WB**. In order to test the Hazard Detection I need to change the file Top.scala like this: ```scala // SPDX-License-Identifier: MIT // MyCPU is freely redistributable under the MIT License. See the file // "LICENSE" for information on usage and redistribution of this file. package board.verilator import chisel3._ import chisel3.stage.ChiselStage import riscv.core.CPU import riscv.core.CPUBundle import riscv.ImplementationType class Top extends Module { val io = IO(new CPUBundle) val cpu = Module(new CPU(implementation = ImplementationType.FiveStageStall)) io.device_select := 0.U cpu.io.debug_read_address := io.debug_read_address io.debug_read_data := cpu.io.debug_read_data cpu.io.csr_debug_read_address := io.csr_debug_read_address io.csr_debug_read_data := cpu.io.csr_debug_read_data io.memory_bundle <> cpu.io.memory_bundle io.instruction_address := cpu.io.instruction_address cpu.io.instruction := io.instruction cpu.io.interrupt_flag := io.interrupt_flag cpu.io.instruction_valid := io.instruction_valid } object VerilogGenerator extends App { (new ChiselStage).emitVerilog( new Top(), Array("--target-dir", "3-pipeline/verilog/verilator") ) } ``` For control hazards, Flush and Stall is set with interlocks (bubbles) and performs branch resolution in EX. You can see the condition is set in **Control.scala** ```scala // Hazard detection priority logic when(io.jump_flag) { // =========================== Control Hazard =========================== // Jump/branch taken - must flush incorrectly fetched instructions // Instructions in IF and ID stages are on wrong execution path io.if_flush := true.B // Clear IF/ID register (discard fetched instruction) io.id_flush := true.B // Clear ID/EX register (discard decoded instruction) }.elsewhen( // =========================== Data Hazard (RAW) =========================== // Conservative stalling: ANY register dependency causes a stall // No forwarding capability, so must wait for register write to complete // Check EX stage dependency (1-cycle old instruction): (io.reg_write_enable_ex && // EX stage will write a register (io.rd_ex === io.rs1_id || io.rd_ex === io.rs2_id) && // Destination matches ID source io.rd_ex =/= 0.U) // Not writing to x0 (always zero) || // Check MEM stage dependency (2-cycle old instruction): (io.reg_write_enable_mem && // MEM stage will write a register (io.rd_mem === io.rs1_id || io.rd_mem === io.rs2_id) && // Destination matches ID source io.rd_mem =/= 0.U) // Not writing to x0 ) { // Stall action: Insert bubble (NOP) and freeze earlier stages io.id_flush := true.B // Insert NOP into ID/EX register (bubble) io.pc_stall := true.B // Freeze PC (don't fetch new instruction) io.if_stall := true.B // Freeze IF/ID register (hold current instruction) // Result: ID stage instruction waits until dependency resolved } } ``` For the Hazard detection for jump/branch, when the Execute stage is jump or branch. It will activate the **flush flag**. You can see that when it meets **jal ra,508** in **EX**, the flush both is turned on with **IF/ID and ID/EX** is discard to zero (addi zero,zero,0). The condition is nearly the same with **Three Stage**. ![5stagestall3](https://hackmd.io/_uploads/ryEl0M4WZe.png) For the Hazard detection for RAW(ANY register dependency causes a stall ![5stagestall4](https://hackmd.io/_uploads/BJ-rmQN-We.png) You can see that when output of the EX is the same with the first operand or second operand register in ID, it will cause the flush. The **addi sp,sp,-16** is being executed but the ID have the instructions **sw ra,12(sp)** so it causes stall. the pc will be stall for 2 cycles. The IF2ID will have a interlocks and still keep that instruction in the next 2 cycle and stage **EX** does not have any instructions so it will be updated with bubbles (addi zero,zero,0). The last 4 lines is also check the conditions whether **(io.rd_ex === io.rs1_id || io.rd_ex === io.rs2_id)** and you can see that the **rd_ex is equal to rs1_id (equal to 02)**. #### 3.3 Five Stage Forward This one is the Five Stage Forwarding with the flow IF → ID → EX → MEM → WB. In order to test the Hazard Detection I need to change the file Top.scala like this: ```scala // SPDX-License-Identifier: MIT // MyCPU is freely redistributable under the MIT License. See the file // "LICENSE" for information on usage and redistribution of this file. package board.verilator import chisel3._ import chisel3.stage.ChiselStage import riscv.core.CPU import riscv.core.CPUBundle import riscv.ImplementationType class Top extends Module { val io = IO(new CPUBundle) val cpu = Module(new CPU(implementation = ImplementationType.FiveStageForward)) io.device_select := 0.U cpu.io.debug_read_address := io.debug_read_address io.debug_read_data := cpu.io.debug_read_data cpu.io.csr_debug_read_address := io.csr_debug_read_address io.csr_debug_read_data := cpu.io.csr_debug_read_data io.memory_bundle <> cpu.io.memory_bundle io.instruction_address := cpu.io.instruction_address cpu.io.instruction := io.instruction cpu.io.interrupt_flag := io.interrupt_flag cpu.io.instruction_valid := io.instruction_valid } object VerilogGenerator extends App { (new ChiselStage).emitVerilog( new Top(), Array("--target-dir", "3-pipeline/verilog/verilator") ) } ``` This one have control hazard like the above ones when meeting jump or branches instruction in EX stage. You can see that the id_flush has been turned on to 1 and there will be a **NOP** in both ID/EX stage when there is a **jal ra,-68** in EX stage. ![5stageforward1](https://hackmd.io/_uploads/H1phL4EW-l.png) Next is **Load use hazard detection**: when EX stage has a load instruction and destination is not x0. Besides, ID stage use a load destination. You can see that there is a load instruction **lw a5,-20(s0)** in EX stage and the ID stage have a **addi a5,a5,-1** which meets the condition of **Control.scala**. In cycle T, you can see that the **pc_stall,if_flush,id_flush** have been turned on to 1. the next cycle T+1, you can see that instruction **addi a5,a5,-1** will be stalled. the **EX** stage will have a NOP in this cycle, the MEM stage will extract this value and store in MEM stage. The last cycle T+2, you can see that **addi a5,a5,-1** from the ID stage has been moved to **EX** stage and the **WB** stage will get the value a5 after being done processed to give to a5 in **EX** stage. Another variable you can see that the io_reg1_forward_ex is equal to 2, which is being set in **Forwarding.scala** ( set to 2 when MEM/WB destination register matches current rs1). AlL of the following means that we know how the Forwarding works in **FiveStageForward** which is effectively updated from the **FiveStageStall** and it helps reducing stalls very significantly. ![5stageforward](https://hackmd.io/_uploads/rkF2U44bWe.png) #### Five Stage Final There will be 3 hazard types including Load-use Hazard, Jump-related hazard and control hazard. This **Five Stage Final** is upgraded very much when it has the early branch resolution. I will first test the **Control.scala**. - Fistly, there will be a scenario that there would be a hazard in **EX** stage when the **ID** stage needs that result to be its first or second operand and the result has not finished being processed or when the **jump_instruction** need the address but the **EX** stage has not finised being processed to give the **ID** stage to jump. You can see this condition in **Control.scala** ```scala ((io.jump_instruction_id || io.memory_read_enable_ex) && io.rd_ex =/= 0.U && (io.rd_ex === io.rs1_id || io.rd_ex === io.rs2_id)) ``` - Second, when the **ID** stage need the result from **MEM** stage but the instructions from **MEM** stage need to be load until the end of **MEM** stage so that the result can be sent back to **ID** stage to use. But if that moment the **MEM** stage has not been finished or **Forwarding** cannot be load to **ID** stage. ```scala (io.jump_instruction_id && // Jump instruction in ID io.memory_read_enable_mem && // Load instruction in MEM io.rd_mem =/= 0.U && // Load destination not x0 (io.rd_mem === io.rs1_id || io.rd_mem === io.rs2_id)) // Load dest matches jump source ``` - Then we will do these conditions: ```scala // Stall action: Insert bubble and freeze pipeline // TODO: Which control signals need to be set to insert a bubble? // Hint: // - Flush ID/EX register (insert bubble) // - Freeze PC (don't fetch next instruction) // - Freeze IF/ID (hold current fetch result) io.id_flush := true.B io.pc_stall := true.B io.if_stall := true.B ``` You can understand more in this waveform analysis: ![fivestagefinal4](https://hackmd.io/_uploads/rkPKodHb-g.png) You can see that there is instruction **sw ra,4(sp) and the instruction right after it **bltu sp,t0,264** which is the condition of the above. Then you can see that the **id_flush**,**pc_stall** and **if_stall** have been activated. The one in **ID** stage (**bltu sp,t0,264**) have been stalled and keep in the next cycle while the **EX** stage have been inserted bubbles **NOP** value (addi zero,zero,0). On the other hand, this is the most effective different from other types of **Five Stage**: ```scala }.elsewhen(io.jump_flag) { // ============ Control Hazard (Branch Taken) ============ // Branch resolved in ID stage - only 1 cycle penalty // Only flush IF stage (not ID) since branch resolved early // TODO: Which stage needs to be flushed when branch is taken? // Hint: Branch resolved in ID stage, discard wrong-path instruction io.if_flush := true.B // Note: No ID flush needed - branch already resolved in ID! // This is the key optimization: 1-cycle branch penalty vs 2-cycle } ``` When we have **jump_flag** we will flush the instruction in **IF** stage directly because the instruction in **ID** stage have branch resolved early. We do not need to wait until the instruction go wrong from **IF** stage to **ID** stage then we will flush it, which leads to waste 2-cycle like other versions. In this version, we have saved 1 cycle branch penalty. ![fivestagefinal3](https://hackmd.io/_uploads/BkoSavrb-x.png) You can see that at the **ID** stage there is instruction **jal ra,-88** which activates the **io_jump_flag** and **if_flush** then the next cycle when the **jal ra,-88** come to next stage in **EX** stage, it means that jump instruction have resolved in **ID** stage and can find the right address and jump to it, you can see the PC address from **4576** to PC address at **4484** means that the **if_flush** prevent the PC to run instruction at **4576+4** by deleting it permanently by flush operation (which leads to wrong instruction in **ID/MEM/EX** stage if we do not have flush operation) but jump directly to the address of **jal ra,-88**. This upgraded one just haves 1-cycle penalty rather than 2-cycle penalty like other version above. ***Next I will test Forwarding.scala waveform analysis for data hazard RAW (Read after Write) in EX stage and data forwarding to ID stage***. - First one is the data hazard RAW (Read after Write) in EX stage ![fivestagefinalforwarding3](https://hackmd.io/_uploads/BJJCc08WWe.png) You will give a notice to the parameters that have been pointed out with red arrow. You can see that instruction in **EX** stage is **and t2,t0,t1** which is necessary to read the register. You can also see the **io_rd_wb===io_rs1_ex (equal to 05)*** then the **io_reg1_forward_ex is equal to 2 which stands for forward from WB**. Next is the verify step, we will check the value from WB to EX stage have been forwarded correct or not. You can see the **io_forward_from_wb** and **alu_io_op1** are equal to 00000001, which means that the value 00000001 from **WB** stage will be forwarded directly to **EX** stage. On the other hand, the **io_output_reg1_data** which stands for the value rs1 (value 00000000) in Register File coming from ID to EX stage and it cannot be used because there is the data hazard because the **rd_wb equal to rs1_ex** and it will forward the value from *WB* stage register destination instead of using the value 00000000 from *io_output_reg1_data*. - Second one is Data Forwarding to **ID** Stage ![fivestagefinalforwarding5](https://hackmd.io/_uploads/SkM-dkvZbl.png) You can see that this one is the traditional forwarding from **MEM** stage to **ID** stage. You can see at the io_instruction from **ID** stage is **jalr t4,8(t4) and it needs the address of **t4** register destination from **MEM** stage (**jalr t4,4(t4)**) to forward to. You can see that the **io_reg_write_enable_mem** is turned on to 1, **io_rd_mem is equal to io_rs1_id (value 1d)** so then the **io_reg1_forward_id** turn to 1 which is **forwarding from MEM**. The next step is verifying them on the waveform results, you can see that the **io_forward_from_mem is equal to reg1_data** which is 0000104c so you can know that the value have been forwarded correctly from **MEM** stage to **ID** stage. This example I do not check the flush flag because I have done it in the above part. **io_reg1_data** is the wrong result (NOT MATCH WITH **reg1_data**) so you can see that it cannot be used to use in **reg1_data**. The code reg1_data is linked directly to io_forward_from_mem is in **InstructionDecode.scala** ```scala val reg1_data = MuxLookup( io.reg1_forward, 0.U )( IndexedSeq( ForwardingType.NoForward -> io.reg1_data, ForwardingType.ForwardFromWB -> io.forward_from_wb, ForwardingType.ForwardFromMEM -> io.forward_from_mem ) ``` #### ANSWER QUESTIONS **1. Why do we need to stall for load-use hazards? (Hint: Consider data dependency and forwarding limitations)** We need to stall due to the physical timing limitations of Data Memory access, which causes immediate forwarding impossible. Specifically: **Data Availability:** A Load instruction (LW) gets the valid data after completing the RAM access at the end of the MEM stage. However, a dependent instruction immediately following it requires that data at the beginning of the EX stage to use as an input for the ALU. **Forwarding Limitation:** When the consumer instruction is in the EX stage, the producer instruction (LW) is concurrently in the MEM stage fetching the data. Since the data has not yet emerged from the RAM, it is physically impossible to forward data from MEM to EX within the same clock cycle (we cannot forward a value that does not exist). ![answer1111](https://hackmd.io/_uploads/BJ_cWbOZZx.png) - As can be seen from the waveform, we have instruction in **EX** stage is **lw t2,2(t2)** (which can be seen in **ex.io_instruction**) and the one in **ID** stage is **or t3,t1,t2**(which can be seen in **id.io_instruction**) in the waveform which cause the load use hazard. You can see from in the cycle at **50ps to 52ps(I will call it as cycle T)**. The condition in Control.scala has been activated: ```scala // Complex hazard detection for early branch resolution in ID stage when( // ============ Complex Hazard Detection Logic ============ // This condition detects multiple hazard scenarios requiring stalls: // --- Condition 1: EX stage hazards (1-cycle dependencies) --- // TODO: Complete hazard detection conditions // Need to detect: // 1. Jump instruction in ID stage // 2. OR Load instruction in EX stage // 3. AND destination register is not x0 // 4. AND destination register conflicts with ID source registers // ((io.jump_instruction_id || io.memory_read_enable_ex) && // Either: // - Jump in ID needs register value, OR // - Load in EX (load-use hazard) io.rd_ex =/= 0.U && // Destination is not x0 (io.rd_ex === io.rs1_id || io.rd_ex === io.rs2_id)) // Destination matches ID source ``` You can see that the **io.rd_ex**(which is **t2** in lw t2,2(t2)) equal to **io.rs2_id (which is **t2** in instruction or t3,t1,t2) equal to value **07** in the waveform, and the **io_memory_read_enable_ex** have been turned on to 1. Then the action **id_flush**,**if_stall** and **pc_stall** have been turned on to 1. - Next, I will talk about the **cycle T+1 (54ps to 56ps):** ![answer222](https://hackmd.io/_uploads/rkzaZWu-bx.png) You can see that in this cycle, the instruction **lw t2,2(t2)** has come to stage **MEM** but the instruction **or t3,t1,t2** have been stalled in **ID** stage while **EX** stage will be inserted with **NOP** value (addi zero,zero,0) because of this action below in **Control.scala**. ```scala // Stall action: Insert bubble and freeze pipeline // TODO: Which control signals need to be set to insert a bubble? // Hint: // - Flush ID/EX register (insert bubble) // - Freeze PC (don't fetch next instruction) // - Freeze IF/ID (hold current fetch result) io.id_flush := true.B io.pc_stall := true.B io.if_stall := true.B ``` - Lastly, I will talk about cycle **T+2 (58ps to 60ps)** ![answer3](https://hackmd.io/_uploads/HJukf-uZZx.png) In this stage, the instruction have finished being loaded **t2** in the end of **MEM** stage and coming to **WB** stage and the instruction **or t3,t1,t2 has come to **EX** stage**. At this moment, the forwarding will be activated because the **t2** has finised being loaded. You may pay attention to the waveform, the **io_output_memory_read_data** which represented the value of **t2** which equal to *00000001* and also equal to **alu_io_op2** which stands for value of rs2 in instruction **or t3,t1,t2** in **EX** stage forwarding from **io_output_memory_read_data**. - As a result, we can see that there is **data dependency** when the data **t2** from instruction **lw t2,2(t2)** only available in the end of **MEM** stage while the instruction **or t3,t1,t2** need to read value **t2** in the beginning of **EX** stage. Furthermore, we can see there is **forwarding limitation** when it cannot do forwarding action in one cycle when the value of **t2** in instruction **lw t2,2(t2)** is being read in **MEM** stage so we cannot forward data that doesn't exist to **EX** stage. Then we need to stall one cycle, that is the reason why the cycle **T+1** is used to stall the instruction in **ID** stage. Not until the cycle **T+2** do the value **t2** come to pipeline **MEM/WB** so that it can forward the value of **t2** which is *00000001* to **EX** stage. **2. What is the difference between "stall" and "flush" operations?** | Characteristics | Stall | Flush| | -------- | -------- | -------- | | *Actions* | holds current state | clear or reset state | |*Effect on pipeline register*|Retains the previous value. Prevents new data from overwriting the register|Overwrites the current value with a NOP instruction (0x00000013) or zero(addi zero,zero,0)| |*Effect on PC*|Maintains the current PC value|Often accompanied by updating the PC to a new target address (typically during Jumps or Branch mispredictions) |*Purpose*|To "buy time" (waiting for data to become available or for a hazard to resolve).|To "discard" invalid instructions (due to control hazards) or to insert a bubble required by a stall logic.| You can see the **flush and stall operations** in the waveform below. ![question2](https://hackmd.io/_uploads/B1cCeGd--l.png) - You will pay attention to *flush and stall* with **id_flush** and **if_stall** and **pc_stall**. You can see that instruction **bne t3,t4,-36** meeting the condition of **if_stall**, then it will have a stall in the next cycle and will be kept in **ID** stage in the next cycle, while the instruction **addi t4,zero,3** meet the condition of **id_flush** then the next cycle it will be inserted with **NOP** value (addi zero,zero,0). **3. Why does jump instruction with register dependency need stall?** - Unlike the JAL instruction (which uses PC-relative addressing with an immediate value), the JALR instruction requires the value of a source register (rs1) to calculate the target address (Target = rs1 + offset). A Stall is mandatory in cases of data dependency (RAW Hazard) due to the following reasons: **Target Address Availability:** The target address can only be computed after the CPU has obtained the correct value of the rs1 register so if rs1 is being updated by a preceding instruction like Load instruction, its most recent value has not been written back to the Register File. **Timing Conflict:** The CPU attempts to resolve the jump target address during the ID stage (to minimize the Branch Penalty to 1 cycle). However, if the preceding instruction is a LW (Load), the rs1 data is still being retrieved from RAM (currently in the MEM stage). As a consequence, the data is unavailable so it is unable to calculate the jump address at **ID** stage and unable to fetch the correct next instruction so the stall must be take into consideration for the rs1 to become available or to wait until it can be forwarded from the **MEM/WB** stage like I have shown in the question 1's waveform case. ![question3](https://hackmd.io/_uploads/HkqG8w_W-l.png) You can see at the cycle 27ps to 30ps, the **id_flush** have been turned on 1 , the **ID** stage instruction **jalr ra,0(t2)** need the value **t2** from the **lw t2,0(t0)** in **EX** stage but as I have mentioned in the above part, not until the end of **MEM** stage to **WB** stage will the t2 be finised being processed so that it can be forwarded to **t2** which is rs1 in instruction **jalr ra,0,(t2)** so the **id_flush** have been turned on 1 in this cycle. As a result, you can see in the next cycle **30ps to 37ps** the instruction **jalr ra,0(t2)** is still being kept in **ID** stage and the **EX** stage is inserted with a **NOP** value. ![question3](https://hackmd.io/_uploads/HklF2PuWZg.png) Then the next cycle 37ps to 40ps when the **jalr ra,0(t2)** come to **EX** stage, it is obvious that **t2 from lw t2,0(t0)** have been finised being loaded and forward to **jalr ra,0(t2)** so that it can go to EX stage.This is the reason why jump instruction with register dependency need stall. If not having stall, the instruction **jalr ra,0(t2)** will take the previous value of **t2** and it will jump to the wrong address and the forwarding even cannot load this situation of there is not stall **(this is also a forwarding limitation because it cannot do in the same cycle so stall is necessary as well.)** **4. In this design, why is branch penalty only 1 cycle instead of 2? (Hint: Compare ID-stage vs EX-stage branch resolution)** - In this design, the branch penalty is reduced to 1 cycle because we implement the Early Branch Resolution technique. **Original approach (From *ThreeStage*) (EX-Stage Resolution - 2-cycle penalty):** If the branch decision (register comparison) and target address calculation occur in the EX stage, the pipeline has already speculatively fetched the next two instructions (currently in the ID and IF stages). When a branch is taken, there will be flush to both of these invalid instructions, resulting in a 2-cycle penalty. You can see in this waveform located in **ThreeStage** with the condition located in CPU.scala, when it has the jump or branches in **EX** stage the jump flag will turn on to 1 and it will flush instructions in both **IF and ID stage**. ```scala ctrl.io.JumpFlag := ex.io.if_jump_flag if2id.io.flush := ctrl.io.Flush id2ex.io.flush := ctrl.io.Flush ``` ![3stage test 3](https://hackmd.io/_uploads/SJveyVuW-x.png) You can see in this waveform when the **id2ex.io_output_instruction is jal ra,-152** then the next 2 cycle is 2 cycle penalty which is both **NOP** value (addi zero,zero,0) because of the above condition which result in this effect, which can be followed in cycle 58ps to 64ps with only instruction **addi zero,zero,0**. Obviously seeing this is the proof for **2 cycle penalty** because it is followed continuously from instruction in the stage **IF and ID** stage which have been flushed flag from cycle 54ps to 57ps when the **EX** stage detect there is jump or branch instruction **jal ra,-152**. **Current Design (ID-Stage Resolution - 1-cycle penalty):** We have moved the branch comparator and target address adder to the ID stage. The branch decision is made immediately while the instruction is being decoded. At this specific moment, the pipeline has only fetched one incorrect instruction (which is currently in the IF stage). As a result, we only need to flush the IF stage (if_flush), resulting in only a 1-cycle penalty to clear the pipeline and fetch the correct instruction. Following is the condition of this 1 cycle penalty. It is located in Control.scala in *FiveStageFinal*. ```scala elsewhen(io.jump_flag) { // ============ Control Hazard (Branch Taken) ============ // Branch resolved in ID stage - only 1 cycle penalty // Only flush IF stage (not ID) since branch resolved early // TODO: Which stage needs to be flushed when branch is taken? // Hint: Branch resolved in ID stage, discard wrong-path instruction io.if_flush := true.B // Note: No ID flush needed - branch already resolved in ID! // This is the key optimization: 1-cycle branch penalty vs 2-cycle } ``` ![fivestagefinalforwarding5](https://hackmd.io/_uploads/ryyRQU_ZZl.png) I will take the waveform from **FiveStageFinal** analysis above. There is instruction **jalr t4,8(t4) in ID** stage. You can see that there is only one bubble **Nop** value addi zero,zero,0 also in **ID** stage in cycle 98 to 100ps because we have **if_flush** have been activated at cycle 94ps to 96ps, the clear analysis I have mentioned in the question 1 so I will not mentioned again but I will show the difference between *1 cycle penalty and 2 cycle penalty* for easy understanding. **5. What would happen if we removed the hazard detection logic entirely? (Hint: Consider data hazards and control flow correctness)** - Removing the Hazard Detection logic (specifically the Stall and Flush operations) would result in a non-functional CPU that produces incorrect calculation results and executes invalid instructions then the register file and memory would be corrupted immediately. Specifically: **Data Corruption due to Unresolved Load-Use Hazards:** Without hazard detection, the CPU would fail to Stall when a Load instruction (lw) is immediately followed by a dependent instruction. The dependent instruction would proceed to the EX stage while the Load instruction is still in the MEM stage and it is only available in the end of **MEM** stage coming to **WB** stage (which is mem2wb). Since the data has not returned from Memory, Forwarding is physically impossible. As a result, the dependent instruction executes using invalid/stale data, propagating calculation errors throughout the program. **Control Flow Violation due to Branch/Jump Hazards:** When there is no appearance hazard detection, the CPU would fail to Flush the pipeline when a Branch or Jump is taken. Due to speculative fetching, instructions following the branch (at PC+4) enter the pipeline before the branch decision is made. As a result, when there is no flush operation **(if_flush)**, these wrong-path instructions would proceed to complete execution (WB stage). They would modify registers or memory, permanently corrupting the program state before the control flow redirects to the correct target. ![question666](https://hackmd.io/_uploads/ryjtjwYbbx.png) For details, you can see that I have turned off the **flush and stall** operation and you can see that when I run the test it cannot flush or stall when meeting insruction **lw ra,0(a0)**. Then the forwarding will be run incorrectly, obviously seeing in the waveform, when the **rd_mem** equal to **rs2_id** then the **io_reg2_forward_ex** have been turned on to value 1 meaning that **Forwarding from MEM stage** but you can see that the value **alu_io_op2** and **io_forward_from_mem** are not as the same (which is 00000081 and 00000015 accordingly). All of the witnesses I have shown are the proof that the forwarding without **stall** is physically impossible and the **rs2_ex** has taken other wrong data and propagating calculation errors throughout the program. **(If true, then the value of **alu_io_op2** and **io_forward_from_mem** must be the same ).** This is an example when we remove hazard detect logic. #### Modify the handwritten RISC-V assembly code in Homework2 to ensure it functions correctly on the pipelined RISC-V CPU (i.e., 3-pipeline). I will get the question **hanoi_optimized.S** in homework2 that has been optimized to modify it. **the original one from **hanoi_optimized.S in homework2** :** ```c .text .globl hanoi_asm hanoi_asm: # only save used registers addi x2, x2, -32 # Reduced from 56 to 32 bytes sw x8, 0(x2) sw x9, 4(x2) sw x18, 8(x2) sw x19, 12(x2) # removed x20, s2-s7 (never used) # 3 disk positions stored at offsets 16,20,24 sw x0, 16(x2) sw x0, 20(x2) sw x0, 24(x2) li x8, 1 game_loop: addi x5, x0, 8 beq x8, x5, finish_game srli x5, x8, 1 xor x6, x8, x5 addi x7, x8, -1 srli x28, x7, 1 xor x7, x7, x28 xor x5, x6, x7 addi x9, x0, 0 andi x6, x5, 1 bne x6, x0, disk_found addi x9, x0, 1 andi x6, x5, 2 bne x6, x0, disk_found addi x9, x0, 2 disk_found: slli x5, x9, 2 addi x5, x5, 16 #Changed from 44 to 16 add x5, x2, x5 lw x18, 0(x5) bne x9, x0, handle_large addi x19, x18, 2 li x6, 3 blt x19, x6, display_move sub x19, x19, x6 jal x0, display_move handle_large: lw x6, 16(x2) #Changed from 44 to 16 li x19, 3 sub x19, x19, x18 sub x19, x19, x6 display_move: addi t2, x18, 65 addi t3, x19, 65 li a0, 1 la a1, str1 li a2, 10 li a7, 64 ecall addi t0, x9, 1 addi t0, t0, 48 la t1, char_buffer sb t0, 0(t1) li a0, 1 la a1, char_buffer li a2, 1 li a7, 64 ecall li a0, 1 la a1, str2 li a2, 6 li a7, 64 ecall la t1, char_buffer sb t2, 0(t1) li a0, 1 la a1, char_buffer li a2, 1 li a7, 64 ecall li a0, 1 la a1, str3 li a2, 4 li a7, 64 ecall la t1, char_buffer sb t3, 0(t1) li a0, 1 la a1, char_buffer li a2, 1 li a7, 64 ecall li t0, 10 la t1, char_buffer sb t0, 0(t1) li a0, 1 la a1, char_buffer li a2, 1 li a7, 64 ecall slli x5, x9, 2 addi x5, x5, 16 #Changed from 44 to 16 add x5, x2, x5 sw x19, 0(x5) addi x8, x8, 1 jal x0, game_loop finish_game: lw x8, 0(x2) lw x9, 4(x2) lw x18, 8(x2) lw x19, 12(x2) # removed loads of x20, s2-s7 addi x2, x2, 32 # Changed from 56 to 32 ret .data str1: .asciz "Move Disk " str2: .asciz " from " str3: .asciz " to " char_buffer: .space 1 ``` **the modify one:** ```c .text .global _start _start: # --- INIT --- li sp, 0x02000 # Initialize Stack Pointer la t0, trap_vector csrw mtvec, t0 # Setup Trap Vector Base Address # --- PROLOGUE --- addi x2, x2, -32 sw x8, 0(x2) sw x9, 4(x2) sw x18, 8(x2) sw x19, 12(x2) # Init Disk positions (0,0,0) sw x0, 16(x2) sw x0, 20(x2) sw x0, 24(x2) # --- PRE-LOAD CONSTANTS (Invariant Code Motion) --- li x8, 1 # Counter start li x4, 8 # Loop limit (Kept in x4) li x20, 3 # Constant 3 (Used for peg calculation) game_loop: # Use x4 directly, remove redundant 'addi x5' beq x8, x4, finish_game # --- GRAY CODE LOGIC --- srli x5, x8, 1 addi x7, x8, -1 xor x6, x8, x5 srli x28, x7, 1 xor x7, x7, x28 xor x5, x6, x7 # --- FIND DISK --- # Disk finding logic remains unchanged as it is data-dependent andi x6, x5, 1 addi x9, x0, 0 #andi x6, x5, 1 bne x6, x0, disk_found andi x6, x5, 2 addi x9, x0, 1 #andi x6, x5, 2 bne x6, x0, disk_found addi x9, x0, 2 disk_found: slli x5, x9, 2 addi x5, x5, 16 add x5, x2, x5 lw x18, 0(x5) # LOAD disk position # Hazard Fix: Inserted Branch to avoid Load-Use Stall bne x9, x0, handle_large addi x19, x18, 2 lw x6,16(x2) blt x19, x20, update_state # Compare with x20 sub x19, x19, x20 # Subtract x20 jal x0, update_state handle_large: lw x6, 16(x2) # Use x20 (value 3) instead of 'li x19, 3' sub x19, x20, x18 # 3 - current sub x19, x19, x6 # result - other update_state: slli x5, x9, 2 addi x5, x5, 16 add x5, x2, x5 sw x19, 0(x5) # Store new position # --- Trap Test --- ecall addi x8, x8, 1 # Increment Counter jal x0, game_loop finish_game: lw x9, 4(x2) lw x18, 8(x2) lw x19, 12(x2) addi x2, x2, 32 finish_loop: jal x0, finish_loop # --- TRAP HANDLER --- .align 4 trap_vector: addi sp, sp, -4 sw t0, 0(sp) csrr t0, mepc csrw mepc, t0 lw t0, 0(sp) addi sp, sp, 4 mret ``` **The modified one have some changes:** **1. Set up stack pointer and trap handler** We need to initialize the stack pointer to the safety memory RAM address, which is 0x02000 because the initial value starts at 0 in hardware simulation, if we do not do it, the value will be start at 0 and the instruction **addi sp,sp,-32** will be out of range value so we initialize the stack pointer first. Next is set up **trap handler** to support the **ecall** instruction without halting the CPU, the purpose is for testing the trap handler in **3-pipeline** , the address of the trap_vector routine is loaded into the mtvec (Machine Trap-Vector Base-Address) CSR. The trap handler logic will be like this: - The temporary register t0 is pushed onto the stack to preserve the state of the interrupted program. - The Machine Exception Program Counter (mepc) is read and written back to confirm the return address. - the mret instruction is executed to restore the privilege mode and jump back to the instruction flow defined in mepc, effectively bypassing the ecall and continuing execution. **2. Deal with blt/bne instruction and eliminate the act of restore counter-register x8 (lw x8, 0(x2)) and output printf command ecall.** **Deal with blt/bne instruction to avoid stall operation** ```c # --- FIND DISK --- # Disk finding logic remains unchanged as it is data-dependent andi x6, x5, 1 addi x9, x0, 0 #andi x6, x5, 1 bne x6, x0, disk_found andi x6, x5, 2 addi x9, x0, 1 #andi x6, x5, 2 bne x6, x0, disk_found addi x9, x0, 2 ``` In this part you can see that in the original one the instruction **andi x6,x5,1** is right behind the **bne x6,x0,disk_found** so I have changed the order of **addi x9,x0,0** and **andi x6,x5,1** but the logic will not change because it is data dependent but thanks to this action we can avoid stall operation. There is no stall activated when I check it on the waveform for confirmation. ![ans1](https://hackmd.io/_uploads/HJFIaOabbx.png) ```c addi x19, x18, 2 lw x6,16(x2) blt x19, x20, update_state # Compare with x20 sub x19, x19, x20 # Subtract x20 jal x0, update_state ``` You can see in the original one above the **addi x19,x18,2** is right behind the **blt x19,x20,update_state** then we need to insert **lw x6,16(x2)** like in the **handle_large** function so that the cpu will not be stalled by the 2 instructions **blt x19,x20,update_state** with **addi x19,x18,2** right behind. You can see there is no stall when I check it on waveform for confirmation. ![ans2](https://hackmd.io/_uploads/Hye5aOpW-e.png) **Eliminate the act of restore counter-register x8 (lw x8, 0(x2)) and output printf command.** Because in the homework 2, we have to run it on **Ripes** and **RV32emu** and we can see the output with printf in **console log** because **RV32emu** use the C/C++ with supporting **printf** so we can see the output command. But in **myCPU** we do not have any operating system so we will eliminate instructions like **li a7,64** and it relies on Linux System Calls (sys_write) to perform console I/O. Since myCPU does not implement these OS services, executing them would result in undefined behavior but we still keep 1 ecall to check trap handler. As a result, we do not know whether it success or not like the original one because it does not print output so we will watch the counter register **x8** whether it is equal to 8 after 8 iterations if we restore it, it will equal to 0 and we will not know the value and we can not check it correctness in the **waveform** by verilator. Besides, I also move constant loading **(li x4, 8, li x20, 3)** outside the game_loop to reduce instruction count ( the original code we have **addi x5,x0,8 and beq x8,x5,finish_game** in the game_loop and **li x6,3 and li x19,3 in disk_found and handle_large**),which reduce the large number of instructions count by approximately 24 cycles (3 instructions multiply 8 iterations), significantly improving the execution efficiency. ```c trap_vector: addi sp, sp, -4 sw t0, 0(sp) # (ecall) csrr t0, mepc csrw mepc, t0 lw t0, 0(sp) addi sp, sp, 4 mret ``` We will also have this condition in **CLINT.scala** and I will analysis the **CLINT** based on waveform for checking **ecall**. ```c // Trap entry: Set MPP=0b11 (Machine mode), MPIE=MIE (save), MIE=0 (disable) val mstatus_disable_interrupt = io.csr_bundle.mstatus(31, 13) ## 3.U(2.W) ## io.csr_bundle.mstatus(10, 8) ## io.csr_bundle.mstatus( 3 ) ## io.csr_bundle.mstatus(6, 4) ## 0.U(1.W) ## io.csr_bundle.mstatus(2, 0) // Trap return: Set MPP=0b11 (Machine mode), MPIE=1, MIE=MPIE (restore) val mstatus_recover_interrupt = io.csr_bundle.mstatus(31, 13) ## 3.U(2.W) ## io.csr_bundle.mstatus(10, 8) ## 1 .U(1.W) ## io.csr_bundle.mstatus(6, 4) ## io.csr_bundle.mstatus(7) ## io.csr_bundle.mstatus(2, 0) ... when(io.instruction_id === InstructionsEnv.ecall || io.instruction_id === InstructionsEnv.ebreak) { io.csr_bundle.mstatus_write_data := mstatus_disable_interrupt io.csr_bundle.mepc_write_data := instruction_address io.csr_bundle.mcause_write_data := MuxLookup( io.instruction_id, 10.U )( IndexedSeq( InstructionsEnv.ecall -> 11.U, InstructionsEnv.ebreak -> 3.U, ) ) ``` ![answer150](https://hackmd.io/_uploads/B1mZUQhZbl.png) - Upon encountering the ecall instruction, the mcause register immediately updates to 11 (0xb), which corresponds to the 'Environment call from M-mode' exception code, confirming that the trap logic has been triggered correctly. - Simultaneously, the mstatus register is updated to reflect the trap state: bits 12 and 11 (MPP field) are set to 3 (binary 11) to indicate the previous mode was Machine Mode, while the MIE bit (bit 3) is cleared to 0. This disables global interrupts, ensuring the trap handler executes atomically without external interruptions, which is all the same with the condition in **mstatus_disable_interrupt** we set above. - Subsequently, the Program Counter jumps to the trap handler routine. Here, the temporary register t0 is pushed onto the stack (addi sp, sw) to preserve the context and prevent data corruption in the main program. The routine then verifies CSR access by reading the Machine Exception Program Counter (mepc) into t0 and writing it back using csrr and csrw. Finally, the original value of t0 is popped from the stack (lw, addi sp) to fully restore the processor state. - Crucially, the mepc register captures the return address. As observed, mepc holds the value 0x000010b8, which points to the addi s0, s0, 1 instruction immediately following the ecall. This ensures that when mret is executed, the program correctly resumes execution by incrementing the counter. ![answer10000](https://hackmd.io/_uploads/Hyp0kbnbZe.png) - Also in bit 3 but we will also save the value of **MIE** from the old **MPIE** (bit 7) so that when escape from trap handler (which is **mret**) we will turn **MPIE** to 1, meaning that other interupt is accepted to execute. You can see bit 7 is equal to 1 in **mstatus** which is the same with the condition we write in **mstatus_recover_interrupt** above. ![scape](https://hackmd.io/_uploads/BJkPhXn-We.png) When meeting **mret** ( which stands for UNKNOWN INSG 0x30200073), it will activate **if_flush** because mret have jump instruction, then you can see the **NOP** value before coming to **addi s0,s0,1** with address value is **000010b8** which is the same with mepc meaning that our **CSR and CLINT module** have worked correctly. ```update_state: slli x5, x9, 2 addi x5, x5, 16 add x5, x2, x5 sw x19, 0(x5) # --- 3. TEST ECALL --- ecall addi x8, x8, 1 jal x0, game_loop ``` mret afterwards jump back to **addi x8,x8,1** in the above instruction. You can see **addi s0,s0,1** in the **EX** stage. the *alu_io_op1* which is now 00000003 then plus 1 (*alu_io_op2*) and the result **alu_io_result** is 00000004. We will check it until **alu_io_result** is 00000008 then it will loop infinitely, meaning that it has run correctly comparing to **make test**. ![answer99](https://hackmd.io/_uploads/BkyBtehbbe.png) When coming to the **alu_io_result** 00000008 in very next cycles, it means that x8 equal to **8** then we will finish the game. ![anwer9999](https://hackmd.io/_uploads/ryDqYx3-Wl.png) then it will have an infinite loop, meaning that we have run correctly. we will compare it with **make test**: before doing it we have to define it in **PipelineProgramTest.scala** and add the test for this problem: ```scala it should "solve Towers of Hanoi (Optimized)" in { runProgram("hanoi_opt.asmbin", cfg) { c => c.clock.setTimeout(50000) c.clock.step(20000) // 3. Verify c.io.regs_debug_read_address.poke(8.U) c.clock.step() val result = c.io.regs_debug_read_data.peek().litValue println(f"Hanoi Result Check: x8 = $result") c.io.regs_debug_read_data.expect(8.U, "Hanoi 3 disks should finish with x8 = 8") } } ``` Then we will run **make test** and the output **x8** should equal to 8 like the original one. ![answer111111](https://hackmd.io/_uploads/Bkw2BGnbWe.png) ### Express what you have learned from Chisel Bootcamp Through the Chisel Bootcamp and the series of RISC-V CPU design labs ranging from single-cycle architecture and MMIO trap handling to pipelined designs—I have broaden my understanding the "hardware generator" mindset using Chisel on the Scala platform. I learned to utilize basic components like Wire, Reg, and Bundle to construct modules, while also managing complex logic such as instruction decoding, CSR management for trap mechanisms, and MMIO memory coordination. Specifically, implementing the Pipelined CPU deepened my understanding of sequential circuit design through pipeline registers and hazard handling techniques like forwarding, stalling, and flushing. Ultimately, this assignment has not only equipped me with skills in parameterization and hardware verification through ChiselTest and also show me how to analyse the architecture in waveforms but also fostered a comprehensive systems thinking approach, allowing me to fully grasp the execution flow from a software instruction to its physical impact on hardware.