Assignment 3: My RISC-V CPU

# Assignment 3: My RISC-V CPU #### contributed by <`bbchen`> [Full code](https://github.com/CHENSHIBO0000/ca2025-mycpu.git) # [Lab3](https://hackmd.io/@sysprog/B1Qxu2UkZx#RISC-V-CPU-with-MMIO-Peripherals-and-Trap-Handling) ## Single-cycle CPU ### Overview of Single-Cycle CPU Implementation This single-cycle RISC-V CPU are building supports a basic subset of the RV32I instruction set, including: 1. Arithmetic and Logic Operations. 2. Memory Access Instructions. 3. Branch and Jump Instructions. --- To execute each instruction, the CPU goes through five main phases: 1. `Instruction Fetch` – Retrieve the instruction from memory. 2. `Instruction Decode` – Interpret the instruction and read the required register values. 3. `Execute` – Perform the necessary computation using the ALU. 4. `Memory Access` – For load/store instructions, read or write data from/to memory. 5. `Write-back` – Store the computed or loaded result back into the register file (except for store instructions). --- ### 1. Instruction Fetch This is the first stage of the processor, responsible for: - Maintaining the program counter (PC) register - Providing current PC to instruction memory - Handling control flow changes from Execute stage And PC update logic: - Sequential: PC = PC + 4 (when no jump/branch) - Control flow: PC = jump_address_id (when jump_flag_id asserted by Execute stage) --- :::spoiler IF ```scala package riscv.core import chisel3._ import riscv.Parameters // Program counter reset value object ProgramCounter { val EntryAddress = Parameters.EntryAddress } class InstructionFetch extends Module { val io = IO(new Bundle { val jump_flag_id = Input(Bool()) val jump_address_id = Input(UInt(Parameters.AddrWidth)) val instruction_read_data = Input(UInt(Parameters.DataWidth)) val instruction_valid = Input(Bool()) val instruction_address = Output(UInt(Parameters.AddrWidth)) val instruction = Output(UInt(Parameters.InstructionWidth)) }) // Program counter register (PC) val pc = RegInit(ProgramCounter.EntryAddress) when(io.instruction_valid) { io.instruction := io.instruction_read_data pc := Mux(io.jump_flag_id, io.jump_address_id, pc + 4.U) }.otherwise { pc := pc io.instruction := 0x00000013.U // NOP: prevents illegal instruction execution } io.instruction_address := pc } ``` ::: I complete PC update logic using muitiplexer. ```scale pc := Mux(io.jump_flag_id, io.jump_address_id, pc + 4.U) ``` --- ### 2. Instruction Decode This stage is response for : - Extract instruction fields (`opcode`, `rd`, `rs1`, `rs2`, `funct3`, `funct7`) - Generate control signals for `Execute`, `Memory`, and `WriteBack` stages - Decode and sign-extend `immediate` values based on instruction format - Determine ALU operand sources (register vs PC, register vs immediate) - Identify register file read/write operations - Configure memory access (read/write enable signals) :::spoiler ID ```scala package riscv.core import chisel3._ import chisel3.util._ import chisel3.ChiselEnum import riscv.Parameters // RV32I opcode groupings object InstructionTypes { val Load = "b0000011".U(7.W) val OpImm = "b0010011".U(7.W) val Store = "b0100011".U(7.W) val Op = "b0110011".U(7.W) val Lui = "b0110111".U(7.W) val Auipc = "b0010111".U(7.W) val Jal = "b1101111".U(7.W) val Jalr = "b1100111".U(7.W) val Branch = "b1100011".U(7.W) val MiscMem = "b0001111".U(7.W) val System = "b1110011".U(7.W) } // Convenience aliases for specific opcodes object Instructions { val jal = InstructionTypes.Jal val jalr = InstructionTypes.Jalr val lui = InstructionTypes.Lui val auipc = InstructionTypes.Auipc } // Funct3 encodings for load instructions object InstructionsTypeL { val lb = "b000".U(3.W) val lh = "b001".U(3.W) val lw = "b010".U(3.W) val lbu = "b100".U(3.W) val lhu = "b101".U(3.W) } // Funct3 encodings for store instructions object InstructionsTypeS { val sb = "b000".U(3.W) val sh = "b001".U(3.W) val sw = "b010".U(3.W) } // Funct3 encodings for OP-IMM instructions object InstructionsTypeI { val addi = "b000".U(3.W) val slli = "b001".U(3.W) val slti = "b010".U(3.W) val sltiu = "b011".U(3.W) val xori = "b100".U(3.W) val sri = "b101".U(3.W) val ori = "b110".U(3.W) val andi = "b111".U(3.W) } // Funct3 encodings for OP instructions object InstructionsTypeR { val add_sub = "b000".U(3.W) val sll = "b001".U(3.W) val slt = "b010".U(3.W) val sltu = "b011".U(3.W) val xor = "b100".U(3.W) val sr = "b101".U(3.W) val or = "b110".U(3.W) val and = "b111".U(3.W) } // Funct3 encodings for branch instructions object InstructionsTypeB { val beq = "b000".U(3.W) val bne = "b001".U(3.W) val blt = "b100".U(3.W) val bge = "b101".U(3.W) val bltu = "b110".U(3.W) val bgeu = "b111".U(3.W) } object ALUOp1Source { val Register = 0.U(1.W) val InstructionAddress = 1.U(1.W) } object ALUOp2Source { val Register = 0.U(1.W) val Immediate = 1.U(1.W) } object RegWriteSource { val ALUResult = 0.U(2.W) val Memory = 1.U(2.W) val NextInstructionAddress = 2.U(2.W) } object ImmediateKind extends ChiselEnum { val None, I, S, B, U, J = Value } class InstructionDecode extends Module { val io = IO(new Bundle { val instruction = Input(UInt(Parameters.InstructionWidth)) val regs_reg1_read_address = Output(UInt(Parameters.PhysicalRegisterAddrWidth)) val regs_reg2_read_address = Output(UInt(Parameters.PhysicalRegisterAddrWidth)) val ex_immediate = Output(UInt(Parameters.DataBits.W)) val ex_aluop1_source = Output(UInt(1.W)) val ex_aluop2_source = Output(UInt(1.W)) val memory_read_enable = Output(Bool()) val memory_write_enable = Output(Bool()) val wb_reg_write_source = Output(UInt(2.W)) val reg_write_enable = Output(Bool()) val reg_write_address = Output(UInt(Parameters.PhysicalRegisterAddrWidth)) }) val instruction = io.instruction val opcode = instruction(6, 0) val rs1 = instruction(19, 15) val rs2 = instruction(24, 20) val rd = instruction(11, 7) val isLoad = opcode === InstructionTypes.Load val isStore = opcode === InstructionTypes.Store val isOpImm = opcode === InstructionTypes.OpImm val isOp = opcode === InstructionTypes.Op val isLui = opcode === InstructionTypes.Lui val isAuipc = opcode === InstructionTypes.Auipc val isJal = opcode === InstructionTypes.Jal val isJalr = opcode === InstructionTypes.Jalr val isBranch = opcode === InstructionTypes.Branch val usesRs1 = isLoad || isStore || isOpImm || isOp || isBranch || isJalr val usesRs2 = isStore || isOp || isBranch val regWrite = isLoad || isOpImm || isOp || isLui || isAuipc || isJal || isJalr val wbSource = WireDefault(RegWriteSource.ALUResult) when(isLoad) { wbSource := RegWriteSource.Memory } .elsewhen(isJal || isJalr) { wbSource := RegWriteSource.NextInstructionAddress } val aluOp1Sel = WireDefault(ALUOp1Source.Register) when(isBranch || isAuipc || isJal) { aluOp1Sel := ALUOp1Source.InstructionAddress } val needsImmediate = isLoad || isStore || isOpImm || isBranch || isLui || isAuipc || isJal || isJalr val aluOp2Sel = WireDefault(ALUOp2Source.Register) when(needsImmediate) { aluOp2Sel := ALUOp2Source.Immediate } val immKind = WireDefault(ImmediateKind.None) when(isLoad || isOpImm || isJalr) { immKind := ImmediateKind.I } when(isStore) { immKind := ImmediateKind.S } when(isBranch) { immKind := ImmediateKind.B } when(isLui || isAuipc) { immKind := ImmediateKind.U } when(isJal) { immKind := ImmediateKind.J } io.regs_reg1_read_address := Mux(usesRs1, rs1, 0.U) io.regs_reg2_read_address := Mux(usesRs2, rs2, 0.U) io.ex_aluop1_source := aluOp1Sel io.ex_aluop2_source := aluOp2Sel io.memory_read_enable := isLoad io.memory_write_enable := isStore io.wb_reg_write_source := wbSource io.reg_write_enable := regWrite io.reg_write_address := rd val immI = Cat( Fill(Parameters.DataBits - 12, instruction(31)), // Sign extension: replicate bit 31 twenty times instruction(31, 20) // Immediate: bits [31:20] ) val immS = Cat( Fill(Parameters.DataBits - 12, instruction(31)), // Sign extension instruction(31, 25), // High 7 bits instruction(11, 7) // Low 5 bits ) val immB = Cat( Fill(Parameters.DataBits - 13, instruction(31)), // Sign extension instruction(31), // bit [12] instruction(7), // bit [11] instruction(30, 25), // bits [10:5] instruction(11, 8), // bits [4:1] 0.U(1.W) // bit [0] = 0 (alignment) ) val immU = Cat(instruction(31, 12), 0.U(12.W)) val immJ = Cat( Fill(Parameters.DataBits - 21, instruction(31)), // Sign extension instruction(31), // bit [20] instruction(19, 12), // bits [19:12] instruction(20), // bit [11] instruction(30, 21), // bits [10:1] 0.U(1.W) // bit [0] = 0 (alignment) ) val immediate = MuxLookup(immKind.asUInt, 0.U(Parameters.DataBits.W))( Seq( ImmediateKind.I.asUInt -> immI, ImmediateKind.S.asUInt -> immS, ImmediateKind.B.asUInt -> immB, ImmediateKind.U.asUInt -> immU, ImmediateKind.J.asUInt -> immJ ) ) io.ex_immediate := immediate } ``` ::: --- ### 3. Execution Execute stage's responsibilities: - Select ALU operands from register data, PC, or immediate values - Perform arithmetic and logical operations via ALU submodule - Evaluate branch conditions for all six RV32I branch types - Calculate branch and jump target addresses - Generate control signals for instruction fetch (jump flag and address) :::spoiler EX ```scala package riscv.core import chisel3._ import chisel3.util.Cat import chisel3.util.MuxLookup import riscv.Parameters class Execute extends Module { val io = IO(new Bundle { val instruction = Input(UInt(Parameters.InstructionWidth)) val instruction_address = Input(UInt(Parameters.AddrWidth)) val reg1_data = Input(UInt(Parameters.DataWidth)) val reg2_data = Input(UInt(Parameters.DataWidth)) val immediate = Input(UInt(Parameters.DataWidth)) val aluop1_source = Input(UInt(1.W)) val aluop2_source = Input(UInt(1.W)) val mem_alu_result = Output(UInt(Parameters.DataWidth)) val if_jump_flag = Output(Bool()) val if_jump_address = Output(UInt(Parameters.DataWidth)) }) // Decode instruction fields val opcode = io.instruction(6, 0) val funct3 = io.instruction(14, 12) val funct7 = io.instruction(31, 25) // Instantiate ALU and control logic val alu = Module(new ALU) val alu_ctrl = Module(new ALUControl) alu_ctrl.io.opcode := opcode alu_ctrl.io.funct3 := funct3 alu_ctrl.io.funct7 := funct7 // Select ALU operands based on instruction type alu.io.func := alu_ctrl.io.alu_funct val aluOp1 = Mux(io.aluop1_source === ALUOp1Source.InstructionAddress, io.instruction_address, io.reg1_data) val aluOp2 = Mux(io.aluop2_source === ALUOp2Source.Immediate, io.immediate, io.reg2_data) alu.io.op1 := aluOp1 alu.io.op2 := aluOp2 io.mem_alu_result := alu.io.result val branchCondition = MuxLookup(funct3, false.B)( Seq( // TODO: Implement six branch conditions // Hint: Compare two register data values based on branch type InstructionsTypeB.beq -> (io.reg1_data === io.reg2_data), InstructionsTypeB.bne -> (io.reg1_data =/= io.reg2_data), // Signed comparison (need conversion to signed type) InstructionsTypeB.blt -> (io.reg1_data.asSInt < io.reg2_data.asSInt), InstructionsTypeB.bge -> (io.reg1_data.asSInt >= io.reg2_data.asSInt), // Unsigned comparison InstructionsTypeB.bltu -> (io.reg1_data < io.reg2_data), InstructionsTypeB.bgeu -> (io.reg1_data >= io.reg2_data) ) ) val isBranch = opcode === InstructionTypes.Branch val isJal = opcode === Instructions.jal val isJalr = opcode === Instructions.jalr val branchTarget = io.instruction_address + io.immediate val jalTarget = branchTarget // JAL and Branch use same calculation method // JALR address calculation: // 1. Add register value and immediate // 2. Clear LSB (2-byte alignment) val jalrSum = io.reg1_data + io.immediate val jalrTarget = Cat(jalrSum(31, 1), 0.U(1.W)) val branchTaken = isBranch && branchCondition io.if_jump_flag := branchTaken || isJal || isJalr io.if_jump_address := Mux( isJalr, jalrTarget, Mux(isJal, jalTarget, branchTarget) ) } ``` ::: --- ### 4. Memory Access This stage implements RV32I memory access operations: - Load operations (`LB`, `LH`, `LW`, `LBU`, `LHU`): extract and sign/zero-extend data - Store operations (`SB`, `SH`, `SW`): write with byte-level strobes :::spoiler Memory access ```scala package riscv.core import chisel3._ import chisel3.util._ import peripheral.RAMBundle import riscv.Parameters class MemoryAccess extends Module { val io = IO(new Bundle() { val alu_result = Input(UInt(Parameters.DataWidth)) val reg2_data = Input(UInt(Parameters.DataWidth)) val memory_read_enable = Input(Bool()) val memory_write_enable = Input(Bool()) val funct3 = Input(UInt(3.W)) val wb_memory_read_data = Output(UInt(Parameters.DataWidth)) val memory_bundle = Flipped(new RAMBundle) }) val mem_address_index = io.alu_result(log2Up(Parameters.WordSize) - 1, 0).asUInt io.memory_bundle.write_enable := false.B io.memory_bundle.write_data := 0.U io.memory_bundle.address := io.alu_result io.memory_bundle.write_strobe := VecInit(Seq.fill(Parameters.WordSize)(false.B)) io.wb_memory_read_data := 0.U when(io.memory_read_enable) { // Optimized load logic: extract bytes/halfwords based on address alignment val data = io.memory_bundle.read_data val bytes = Wire(Vec(Parameters.WordSize, UInt(Parameters.ByteWidth))) for (i <- 0 until Parameters.WordSize) { bytes(i) := data((i + 1) * Parameters.ByteBits - 1, i * Parameters.ByteBits) } // Select byte based on lower 2 address bits (mem_address_index) val byte = bytes(mem_address_index) // Select halfword based on bit 1 of address (word-aligned halfwords) val half = Mux(mem_address_index(1), Cat(bytes(3), bytes(2)), Cat(bytes(1), bytes(0))) io.wb_memory_read_data := MuxLookup(io.funct3, 0.U)( Seq( InstructionsTypeL.lb -> Cat(Fill(24, data(7)), data(7, 0)), InstructionsTypeL.lbu -> Cat(0.U(24.W), byte), InstructionsTypeL.lh -> Cat(Fill(16, half(15)), half(15, 0)), InstructionsTypeL.lhu -> Cat(0.U(16.W), half), InstructionsTypeL.lw -> data ) ) }.elsewhen(io.memory_write_enable) { io.memory_bundle.write_enable := true.B io.memory_bundle.address := io.alu_result val data = io.reg2_data val strobeInit = VecInit(Seq.fill(Parameters.WordSize)(false.B)) val defaultData = 0.U(Parameters.DataWidth) val writeStrobes = WireInit(strobeInit) val writeData = WireDefault(defaultData) switch(io.funct3) { is(InstructionsTypeS.sb) { writeStrobes(mem_address_index) := true.B writeData := data(7, 0) << (mem_address_index << 3) } is(InstructionsTypeS.sh) { when(mem_address_index(1) === 0.U) { writeStrobes(0) := true.B writeStrobes(1) := true.B writeData := data(15, 0) }.otherwise { writeStrobes(2) := true.B writeStrobes(3) := true.B writeData := data(15, 0) << 16 } } is(InstructionsTypeS.sw) { writeStrobes := VecInit(Seq.fill(Parameters.WordSize)(true.B)) writeData := data } } io.memory_bundle.write_data := writeData io.memory_bundle.write_strobe := writeStrobes } } ``` ::: --- ### 5. Write-back Write Back stage selects final result to write to register file This is the final stage of the processor, responsible for multiplexing the appropriate data source to be written back to the register file: - ALU result (default): Arithmetic/logical operation output - Memory data: Load instruction result - Next instruction address (PC+4): Return address for JAL/JALR instructions :::spoiler WB ```scala package riscv.core import chisel3._ import chisel3.util._ import riscv.Parameters class WriteBack extends Module { val io = IO(new Bundle() { val instruction_address = Input(UInt(Parameters.AddrWidth)) val alu_result = Input(UInt(Parameters.DataWidth)) val memory_read_data = Input(UInt(Parameters.DataWidth)) val regs_write_source = Input(UInt(2.W)) val regs_write_data = Output(UInt(Parameters.DataWidth)) }) io.regs_write_data := MuxLookup(io.regs_write_source, io.alu_result)( Seq( RegWriteSource.Memory -> io.memory_read_data, RegWriteSource.NextInstructionAddress -> (io.instruction_address + 4.U) ) ) } ``` ::: --- ### Functional Test Using Chiseltest to run functional tests for the CPU design. Chiseltest provides a lightweight and expressive framework for verifying Chisel-based RTL. As an example, a simple C program is that computes `Fibonacci(10)` and writes the result to memory address 4. During testing, the Chiseltest framework checks this memory location to confirm that the CPU correctly executed the program and stored the expected value. ``` make test ``` ![image](https://hackmd.io/_uploads/rJcU3QpZbx.png) All functional and integration tests for the single-cycle CPU passed successfully, confirming that each submodule operates as intended. The CPU correctly executed a variety of workloads, from basic RV32I operations to recursive Fibonacci computation and even the Quicksort algorithm. These results demonstrate that the overall design is stable, compliant with RISC-V requirements, and capable of running nontrivial programs reliably. --- ### RISCOF Compliance Testing RISCOF (RISC-V Compatibility Framework) is the official framework used to verify whether a RISC-V processor implementation conforms to the RISC-V ISA specification. It works by running a collection of official RISC-V architectural tests (riscv-arch-test) on both the user-designed CPU (the DUT) and a trusted golden reference model. Each test program produces a memory region called a signature, which records the results of the test’s execution. RISCOF compares the DUT’s signature with the golden signature, and a test is considered passed only if both match exactly. ![image](https://hackmd.io/_uploads/HyUZ07aZWe.png) ![image](https://hackmd.io/_uploads/r1X4AmTWZx.png) The RISCOF compliance report shows that my CPU implementation successfully passed all `41` RV32I architectural tests with 0 failures. This indicates that the functional behavior of the processor fully matches the RISC-V ISA specification for the RV32I subset. Therefore, based on the complete `RISCOF` pass rate, the CPU can be considered a functionally correct and ISA-compliant RV32I processor. --- ## MMIO Peripherals and Trap Handling ### Control and Status Register (CSR) The module manages machine-level status, interrupt control, trap handling, and provides debug visibility for verification. It supports atomic CSR operations, ensures consistent trap entry semantics, and interfaces directly with the CLINT for interrupt-driven updates. --- The CSR module implements the following features: - A separate 4096-entry CSR register file mapped to the RISC-V CSR address space - Read-only enforcement for constant or information-only CSRs - Atomic Read-Modify-Write (RMW) semantics for CSRRS/CSRRC instructions in a single cycle - CLINT interface for interrupt-driven CSR updates - Debug read port for simulation and verification - Full support for machine-mode CSRs required by trap and interrupt handling :::spoiler CSR ```scala package riscv.core import chisel3._ import chisel3.util._ import riscv.Parameters // RISC-V Machine-mode CSR addresses (Privileged Spec Vol.II) object CSRRegister { val MSTATUS = 0x300.U(Parameters.CSRRegisterAddrWidth) val MIE = 0x304.U(Parameters.CSRRegisterAddrWidth) val MTVEC = 0x305.U(Parameters.CSRRegisterAddrWidth) val MSCRATCH = 0x340.U(Parameters.CSRRegisterAddrWidth) val MEPC = 0x341.U(Parameters.CSRRegisterAddrWidth) val MCAUSE = 0x342.U(Parameters.CSRRegisterAddrWidth) val CycleL = 0xc00.U(Parameters.CSRRegisterAddrWidth) val CycleH = 0xc80.U(Parameters.CSRRegisterAddrWidth) } class CSR extends Module { val io = IO(new Bundle { val reg_read_address_id = Input(UInt(Parameters.CSRRegisterAddrWidth)) val reg_write_enable_id = Input(Bool()) val reg_write_address_id = Input(UInt(Parameters.CSRRegisterAddrWidth)) val reg_write_data_ex = Input(UInt(Parameters.DataWidth)) val debug_reg_read_address = Input(UInt(Parameters.CSRRegisterAddrWidth)) val debug_reg_read_data = Output(UInt(Parameters.DataWidth)) val reg_read_data = Output(UInt(Parameters.DataWidth)) val clint_access_bundle = Flipped(new CSRDirectAccessBundle) }) val mstatus = RegInit(UInt(Parameters.DataWidth), 0.U) val mie = RegInit(UInt(Parameters.DataWidth), 0.U) val mtvec = RegInit(UInt(Parameters.DataWidth), 0.U) val mscratch = RegInit(UInt(Parameters.DataWidth), 0.U) val mepc = RegInit(UInt(Parameters.DataWidth), 0.U) val mcause = RegInit(UInt(Parameters.DataWidth), 0.U) val cycles = RegInit(UInt(64.W), 0.U) val regLUT = IndexedSeq( CSRRegister.MSTATUS -> mstatus, CSRRegister.MIE -> mie, CSRRegister.MTVEC -> mtvec, CSRRegister.MSCRATCH -> mscratch, CSRRegister.MEPC -> mepc, CSRRegister.MCAUSE -> mcause, CSRRegister.CycleL -> cycles(31, 0), CSRRegister.CycleH -> cycles(63, 32), ) cycles := cycles + 1.U io.reg_read_data := MuxLookup(io.reg_read_address_id, 0.U)(regLUT) io.debug_reg_read_data := MuxLookup(io.debug_reg_read_address, 0.U)(regLUT) io.clint_access_bundle.mstatus := mstatus io.clint_access_bundle.mtvec := mtvec io.clint_access_bundle.mcause := mcause io.clint_access_bundle.mepc := mepc io.clint_access_bundle.mie := mie // - mcause: Record trap cause when(io.clint_access_bundle.direct_write_enable) { mstatus := io.clint_access_bundle.mstatus_write_data mepc := io.clint_access_bundle.mepc_write_data mcause := io.clint_access_bundle.mcause_write_data }.elsewhen(io.reg_write_enable_id) { when(io.reg_write_address_id === CSRRegister.MSTATUS) { mstatus := io.reg_write_data_ex }.elsewhen(io.reg_write_address_id === CSRRegister.MEPC) { mepc := io.reg_write_data_ex }.elsewhen(io.reg_write_address_id === CSRRegister.MCAUSE) { mcause := io.reg_write_data_ex } } when(io.reg_write_enable_id) { when(io.reg_write_address_id === CSRRegister.MIE) { mie := io.reg_write_data_ex }.elsewhen(io.reg_write_address_id === CSRRegister.MTVEC) { mtvec := io.reg_write_data_ex }.elsewhen(io.reg_write_address_id === CSRRegister.MSCRATCH) { mscratch := io.reg_write_data_ex } } } ``` ::: --- ### Core-Local Interrupt Controller (CLINT) CLINT manages interrupt entry/exit and CSR state transitions. CLINT's responsibilities: - Handle external interrupts from peripherals (timer, UART) - Manage trap entry for exceptions (ECALL, EBREAK) - Execute MRET (machine return) for trap exit - Update CSR state atomically during trap handling --- State Transitions: - Interrupt entry: Save PC to `mepc`, set `mcause`, disable interrupts (`MIE=0`), jump to `mtvec`. - Exception entry: Same as interrupt, but `mcause[31]` = 0. - MRET: Restore `PC` from `mepc`, re-enable interrupts (`MIE=MPIE`), set `MPIE=1`. CSR Updates (`mstatus`): - Trap entry: `MPIE←MIE`, `MIE←0` (save and disable interrupts) - `MRET`: `MIE←MPIE`, `MPIE←1` (restore interrupts and set MPIE) :::spoiler CLINT ```scala package riscv.core import chisel3._ import chisel3.util.Cat import chisel3.util.MuxLookup import riscv.Parameters // Interrupt cause codes for mcause register object InterruptCode { val None = 0x0.U(8.W) val Timer0 = 0x1.U(8.W) val Ret = 0xff.U(8.W) } object InterruptEntry { val Timer0 = 0x4.U(8.W) } class CSRDirectAccessBundle extends Bundle { val mstatus = Input(UInt(Parameters.DataWidth)) val mepc = Input(UInt(Parameters.DataWidth)) val mcause = Input(UInt(Parameters.DataWidth)) val mtvec = Input(UInt(Parameters.DataWidth)) val mie = Input(UInt(Parameters.DataWidth)) val mstatus_write_data = Output(UInt(Parameters.DataWidth)) val mepc_write_data = Output(UInt(Parameters.DataWidth)) val mcause_write_data = Output(UInt(Parameters.DataWidth)) val direct_write_enable = Output(Bool()) } class CLINT extends Module { val io = IO(new Bundle { // Interrupt signals from peripherals val interrupt_flag = Input(UInt(Parameters.InterruptFlagWidth)) val instruction = Input(UInt(Parameters.InstructionWidth)) val instruction_address = Input(UInt(Parameters.AddrWidth)) val jump_flag = Input(Bool()) val jump_address = Input(UInt(Parameters.AddrWidth)) val interrupt_handler_address = Output(UInt(Parameters.AddrWidth)) val interrupt_assert = Output(Bool()) val csr_bundle = new CSRDirectAccessBundle }) val interrupt_enable_global = io.csr_bundle.mstatus(3) // MIE bit (global enable) val interrupt_enable_timer = io.csr_bundle.mie(7) // MTIE bit (timer enable) val interrupt_enable_external = io.csr_bundle.mie(11) // MEIE bit (external enable) val instruction_address = Mux( io.jump_flag, io.jump_address, io.instruction_address + 4.U, ) val mpie = io.csr_bundle.mstatus(7) val mie = io.csr_bundle.mstatus(3) // Check individual interrupt source enable based on interrupt type val interrupt_source_enabled = Mux( io.interrupt_flag === InterruptCode.Timer0, interrupt_enable_timer, interrupt_enable_external ) when(io.interrupt_flag =/= InterruptCode.None && interrupt_enable_global && interrupt_source_enabled) { // interrupt io.interrupt_assert := true.B io.interrupt_handler_address := io.csr_bundle.mtvec io.csr_bundle.mstatus_write_data := Cat( io.csr_bundle.mstatus(31, 13), 3.U(2.W), // mpp ← 0b11 (Machine mode) io.csr_bundle.mstatus(10, 8), mie, // mpie ← mie (save current interrupt enable) io.csr_bundle.mstatus(6, 4), 0.U(1.W), // mie ← 0 (disable interrupts) io.csr_bundle.mstatus(2, 0) ) io.csr_bundle.mepc_write_data := instruction_address io.csr_bundle.mcause_write_data := Cat( 1.U, MuxLookup( io.interrupt_flag, 11.U(31.W) // machine external interrupt )( IndexedSeq( InterruptCode.Timer0 -> 7.U(31.W), ) ) ) io.csr_bundle.direct_write_enable := true.B }.elsewhen(io.instruction === InstructionsEnv.ebreak || io.instruction === InstructionsEnv.ecall) { // exception io.interrupt_assert := true.B io.interrupt_handler_address := io.csr_bundle.mtvec io.csr_bundle.mstatus_write_data := Cat( io.csr_bundle.mstatus(31, 13), 3.U(2.W), // mpp ← 0b11 (Machine mode) io.csr_bundle.mstatus(10, 8), mie, // mpie io.csr_bundle.mstatus(6, 4), 0.U(1.W), // mie io.csr_bundle.mstatus(2, 0) ) io.csr_bundle.mepc_write_data := instruction_address io.csr_bundle.mcause_write_data := Cat( 0.U, MuxLookup(io.instruction, 0.U)( IndexedSeq( InstructionsEnv.ebreak -> 3.U(31.W), InstructionsEnv.ecall -> 11.U(31.W), ) ) ) io.csr_bundle.direct_write_enable := true.B }.elsewhen(io.instruction === InstructionsRet.mret) { // ret io.interrupt_assert := true.B io.interrupt_handler_address := io.csr_bundle.mepc io.csr_bundle.mstatus_write_data := Cat( io.csr_bundle.mstatus(31, 13), 3.U(2.W), // mpp ← 0b11 (Machine mode) io.csr_bundle.mstatus(10, 8), 1.U(1.W), // mpie ← 1 (reset MPIE) io.csr_bundle.mstatus(6, 4), mpie, // mie ← mpie (restore interrupt enable) io.csr_bundle.mstatus(2, 0) ) io.csr_bundle.mepc_write_data := io.csr_bundle.mepc io.csr_bundle.mcause_write_data := io.csr_bundle.mcause io.csr_bundle.direct_write_enable := true.B }.otherwise { io.interrupt_assert := false.B io.interrupt_handler_address := io.csr_bundle.mtvec io.csr_bundle.mstatus_write_data := io.csr_bundle.mstatus io.csr_bundle.mepc_write_data := io.csr_bundle.mepc io.csr_bundle.mcause_write_data := io.csr_bundle.mcause io.csr_bundle.direct_write_enable := false.B } } ``` ::: --- ### Memory-Mapped Peripherals Processor uses high-order address bits to select between devices: - `deviceSelect = 0`: Main memory - `deviceSelect = 1`: Timer peripheral - `deviceSelect = 2`: UART peripheral - `deviceSelect = 3`: VGA peripheral 1. Timer peripheral : A memory-mapped timer peripheral provides periodic interrupt generation capabilities. 2. UART peripheral : Implements 8-N-1 serial transmission (8 data bits, no parity, 1 stop bit) with ready/valid handshaking protocol. 3. VGA peripheral : A memory-mapped VGA display peripheral for visual output with 640×480@72Hz timing and indexed color support. --- ### Functional and RISCOF Compliance Test All functional tests for the `CPU`, `CSR module`, `CLINT interrupt handling`, `UART subsystem`, `memory operations`, and `execution pipeline` successfully passed. A total of 9 test suites were executed, including byte-access tests, CSR write-back validation, machine-mode interrupt flow, UART TX/RX behavior, timer register tests, recursive Fibonacci execution, and a full quicksort program. ![image](https://hackmd.io/_uploads/ry3uLdRbZx.png) These results confirm that the CPU core correctly implements memory access, ALU execution, CSR semantics, trap handling (interrupt + exception), timer operations, and I/O communication. The successful completion of higher-level programs (Fibonacci, Quicksort) further demonstrates the correctness and stability of the pipeline and control logic. --- The RISCOF compliance framework was executed using the rv32emu reference model and validated against the `RV32I + Zicsr` ISA specification. The device under `mycpu` successfully passed all compliance checks, confirming that the CPU correctly implements the RISC-V base integer instructions, CSR operations, and required machine-mode architectural behavior. ![image](https://hackmd.io/_uploads/ByBBwORb-x.png) --- ### Nyancat ``` make demo ``` ![image](https://hackmd.io/_uploads/SykWCb4fWx.png) --- ## Pipelined RISC-V CPU Since I have already completed a single-cycle RV32I processor, the next step of the project is to extend it into several `pipelined versions`. Building on the same instruction memory, register file, and peripheral framework from previous labs, these pipeline designs progressively introduce more advanced techniques to remove performance bottlenecks while preserving correct architectural behavior. --- | Implementation | Stages | Highlight | | --- | --- | --- | | `ThreeStage` | IF → ID → EX/MEM/WB (folded) | Minimal pipeline that introduces control-flow redirection and CLINT interaction with a single execute stage. | | `FiveStageStall` | IF → ID → EX → MEM → WB | Classic five-stage design that resolves `data hazards` with interlocks (`bubbles`) and performs `branch resolution in EX`. | | `FiveStageForward` | IF → ID → EX → MEM → WB | Adds `bypass` paths from MEM/WB back to EX to reduce stalls caused by `RAW hazards`. | | `FiveStageFinal` | IF → ID → EX → MEM → WB | Combines `forwarding`, refined `flush` logic, and the optimized CLINT/CSR handshake that matches the interrupt-capable --- ### Hazard #### 1. Structural Hazards In myCPU project, we used `Harvard-style memory system`. - Memory have independent ports for instruction fetch and data. - The register file supports two reads and one write per cycle. --> `Structural hazards` are intentionally avoided. #### 2. Data Hazards Read-After-Write (RAW): - `Three-stage` and `five-stage` stall cores insert `bubbles` when an instruction consumes a value still in flight. - `Forwarding` and `final` cores extend forwarding technique to feed results from `MEM` or `WB` back into `EX`. #### 3. Control Hazards `Branches` and `jumps` must redirect the instruction fetch stage. #### 4. Hazard Unit - `Control.scala` (stall and flush decisions). - `Forwarding.scala` : Resolves data hazards with dual-stage forwarding. - `ID2EX.scala` and `EX2MEM.scala` (registering control signals so that the hazard unit can observe the pipeline state). --- ### IF2ID IF/ID Pipeline Register's responsibilities : - Hold the instruction between `IF` → `ID` - Hold the instruction address (`PC`) for control-flow resolution - Carry interrupt/trap-related flags into later stages - Support pipeline stalls by freezing contents - Support pipeline flushes by inserting safe NOP instructions --- And this module has two control signals : - `stall`: Freeze register contents (hold current instruction when ID/EX busy) - `flush`: Clear register contents to NOP (discard wrong-path instructions) --- For instruction register: - `Normal` : Pass instruction from IF - `Stall` : Keep previous instruction - `Flush` : Output `NOP` (InstructionsNop.nop = 0x00000013) :::spoiler IF2ID ```scala class IF2ID extends Module { val io = IO(new Bundle { val stall = Input(Bool()) val flush = Input(Bool()) val instruction = Input(UInt(Parameters.InstructionWidth)) val instruction_address = Input(UInt(Parameters.AddrWidth)) val interrupt_flag = Input(UInt(Parameters.InterruptFlagWidth)) val output_instruction = Output(UInt(Parameters.DataWidth)) val output_instruction_address = Output(UInt(Parameters.AddrWidth)) val output_interrupt_flag = Output(UInt(Parameters.InterruptFlagWidth)) }) val instruction = Module(new PipelineRegister(defaultValue = 0x00000013.U)) instruction.io.in := io.instruction instruction.io.stall := io.stall instruction.io.flush := io.flush io.output_instruction := instruction.io.out val instruction_address = Module(new PipelineRegister(defaultValue = ProgramCounter.EntryAddress)) instruction_address.io.in := io.instruction_address instruction_address.io.stall := io.stall instruction_address.io.flush := io.flush io.output_instruction_address := instruction_address.io.out val interrupt_flag = Module(new PipelineRegister(Parameters.InterruptFlagBits)) interrupt_flag.io.in := io.interrupt_flag interrupt_flag.io.stall := io.stall interrupt_flag.io.flush := io.flush io.output_interrupt_flag := interrupt_flag.io.out } ``` ::: ### Forwarding Unit The forwarding unit defines the source of forwarded data: - `NoForward`: Use register file value (no forwarding needed) - `ForwardFromMEM`: Forward from EX/MEM pipeline register (1 cycle old) - `ForwardFromWB`: Forward from MEM/WB pipeline register (2 cycles old) ```scala object ForwardingType { val NoForward = 0.U(2.W) val ForwardFromMEM = 1.U(2.W) val ForwardFromWB = 2.U(2.W) } ``` --- The enhancements over basic forwarding : - `ID stage forwarding`: Enables early branch comparison in decode stage - `Dual-stage support`: Simultaneous forwarding to both ID and EX stages - `Reduced branch penalty`: Branch decisions made 1 cycle earlier --- The forwarding unit determines where the forwarded data will be delivered: - `ID` Stage (for branch comparison) - `EX/MEM` → `ID` : Forward for immediate branch operand resolution - `MEM/WB` → `ID` : Forward for 2-cycle old branch operands - `EX` Stage (for ALU operations): - `EX/MEM` → `EX` : Forward ALU result from previous instruction - `MEM/WB` → `EX` : Forward memory or writeback value --- Due to early branch resolution with `ID forwarding`, the `branch penalty` is reduced from 2 cycles to 1 cycle. For example : ```assembly ADD x1, x2, x3 BEQ x1, x4, label NOP ``` `ADD x1, x2, x3` produces its result in the `EX` stage, but the value will not be written back to the register file until the `WB` stage. The next instruction: `BEQ x1, x4, label` needs the value of `x1` during the `ID` stage, because branches compare their operands early to decide whether the pipeline must redirect control flow. This creates a `RAW data hazard`. If the processor cannot forward operands into the ID stage, BEQ must wait until ADD completes WB. >[!Note]Timing illustration of Without ID forwarding >``` > ADD: IF ID EX MEM WB > BEQ: IF xx xx ID EX MEM WB ← 2 stalls >``` This requires `2 stall cycles` before BEQ can safely read x1. --- >[!Note]Timing illustration of With ID forwarding >``` > ADD: IF ID EX MEM WB > BEQ: IF xx ID EX MEM WB ← only 1 stall >``` Forwarding from EX/MEM supplies the correct value before WB. However, because the two events occur in the same cycle, the pipeline typically inserts `one bubble` to align the timing so that BEQ’s ID stage sees a stable forwarded value. --- #### Data Forwarding to EX Stage - Forwarding Conditions : For a source register `rs1` (same applies to `rs2`): Forwarding is required if: 1. The later stage will write a register (`reg_write_enable_mem` or `reg_write_enable_wb`) 2. The destination register matches the source register (`rd_mem` === `rs1_ex`, or `rd_wb` === `rs1_ex`) 3. The destination is not `x0` (`rd ≠ 0`), because `x0` always reads as zero - Forwarding Priority : - Priority 1 — Forward from EX/MEM - Most recent result (1-cycle hazard) - Used when the MEM-stage instruction writes the needed register - Priority 2 — Forward from MEM/WB - Older result (2-cycle hazard) - Used only when EX/MEM does not match - Otherwise — No Forward - Safe to use register file output directly :::spoiler EX-Stage Forwarding Logic ```scala // rs1 Forwarding Logic when(io.reg_write_enable_mem && (io.rd_mem === io.rs1_ex) && (io.rd_mem =/= 0.U)) { io.reg1_forward_ex := ForwardingType.ForwardFromMEM }.elsewhen(io.reg_write_enable_wb && (io.rd_wb === io.rs1_ex) && (io.rd_wb =/= 0.U)) { io.reg1_forward_ex := ForwardingType.ForwardFromWB }.otherwise { io.reg1_forward_ex := ForwardingType.NoForward } // rs2 Forwarding Logic when(io.reg_write_enable_mem && (io.rd_mem === io.rs2_ex) && (io.rd_mem =/= 0.U)) { io.reg2_forward_ex := ForwardingType.ForwardFromMEM }.elsewhen(io.reg_write_enable_wb && (io.rd_wb === io.rs2_ex) && (io.rd_wb =/= 0.U)) { io.reg2_forward_ex := ForwardingType.ForwardFromWB }.otherwise { io.reg2_forward_ex := ForwardingType.NoForward } ``` ::: --- #### Data Forwarding to ID Stage - Forwarding Conditions : For each branch operand (`rs1_id` or `rs2_id`), forwarding is needed if: 1. A later pipeline stage will write a register - `io.reg_write_enable_mem` or `io.reg_write_enable_wb` 2. The destination register matches the source register used in ID - `io.rd_mem === io.rs1_id` / `io.rs2_id` - `io.rd_wb === io.rs1_id` / `io.rs2_id` 3. The destination is not `x0` - `rd ≠ 0`, since `x0` is always 0 and does not need forwarding - Forwarding Priority : - The priority is the same as EX-stage forwarding. :::spoiler ID-Stage Forwarding Logic ```scala // rs1 ID-Stage Forwarding Logic when(io.reg_write_enable_mem && (io.rd_mem === io.rs1_id) && (io.rd_mem =/= 0.U)) { // Forward from EX/MEM → ID for branch rs1 io.reg1_forward_id := ForwardingType.ForwardFromMEM }.elsewhen(io.reg_write_enable_wb && (io.rd_wb === io.rs1_id) && (io.rd_wb =/= 0.U)) { // Forward from MEM/WB → ID if no newer MEM-stage match io.reg1_forward_id := ForwardingType.ForwardFromWB }.otherwise { // No hazard: use register file output for rs1 io.reg1_forward_id := ForwardingType.NoForward } // rs2 ID-Stage Forwarding Logic when(io.reg_write_enable_mem && (io.rd_mem === io.rs2_id) && (io.rd_mem =/= 0.U)) { // Forward from EX/MEM → ID for second branch operand io.reg2_forward_id := ForwardingType.ForwardFromMEM }.elsewhen(io.reg_write_enable_wb && (io.rd_wb === io.rs2_id) && (io.rd_wb =/= 0.U)) { // Forward from MEM/WB → ID io.reg2_forward_id := ForwardingType.ForwardFromWB }.otherwise { // No hazard: use register file output for rs2 io.reg2_forward_id := ForwardingType.NoForward } ``` ::: --- ### Control Unit The design provides the most advanced hazard detection, enabling early branch resolution in the `ID` stage and incorporating full forwarding support. - Key Enhancement : - `Early branch resolution`: Branches resolved in `ID` stage (not EX) - `ID-stage forwarding`: Enables immediate branch operand comparison - `Complex hazard detection`: Handles jump dependencies and multi-stage loads --- The control unit implements centralized pipeline hazard detection for pipelined CPU. It handles both data and control hazards by coordinating `stalls` and `flushes` across the PC, IF, and ID stages. For data hazards, it detects `load-use` and `jump` dependencies on registers whose values are not yet available, inserting a bubble by stalling PC/IF and flushing the ID/EX register. For control hazards, it supports `early branch` and `jump resolution` in the ID stage: when a branch or jump is taken (`jump_flag`), only the IF stage is flushed, reducing the branch penalty to a single cycle. --- To handle these situations, the Control unit drives four main outputs: - `pc_stall` — freezes the PC to prevent fetching the next instruction - `if_stall` — holds the IF/ID pipeline register - `id_flush` — clears ID/EX, inserting a bubble - `if_flush` — clears IF/ID to discard wrong-path instructions --- The `when` block detects all data hazards that prevent the ID-stage instruction from safely executing. If any such hazard is detected, the pipeline must `stall` and insert a `bubble` to ensure correctness. If no data hazard exists but a branch/jump is taken (`jump_flag = true`), the pipeline performs an `IF flush` to discard the wrong-path instruction. 1. Condition 1 — EX-stage Dependency (1-cycle hazard) Triggered when: - The `ID-stage` instruction is a `jump` OR - The `EX-stage` instruction is a `load` AND: - The `EX-stag`e instruction writes a register (`rd_ex ≠ x0`) - That register is used as `rs1` or `rs2` of the ID-stage instruction Meaning: The instruction in ID needs a value that the EX-stage instruction has not finished producing. Even with forwarding, EX → ID forwarding is not possible for jumps and some load-use hazards. 2. Condition 2 — MEM-stage Load with Jump Dependency (2-cycle hazard) Triggered when: - `ID-stage` instruction is a `jump`, AND - `MEM-stage` instruction is a `load`, AND - The load destination (`rd_mem`) matches `rs1_id` or `rs2_id`, AND - The destination is not `x0` Meaning: Even with forwarding, the load value in MEM stage takes one extra cycle before it can be forwarded back to ID for jump target calculation. If above conditions either is detected: ```scala io.id_flush := true.B io.pc_stall := true.B io.if_stall := true.B ``` The pipeline performs a stall + bubble insertion. 3. Control Hazard Handling (`jump_flag`) If the `ID` stage determines that a branch or `jump` is `taken`, the pipeline must discard the next sequential instruction. ```scala io.id_flush := true.B ``` Only IF is flushed because: The correct `jump target PC` is already known in `ID`. ID-stage early resolution avoids the need for flushing ID/EX --- The hazard detection logic distinguishes between EX-stage and MEM-stage dependencies that affect ID-stage jumps or load-use cases. If a dependency makes the operand unavailable in time, the pipeline stalls and inserts a bubble. Otherwise, if a branch or jump is taken, only the IF stage is flushed thanks to early resolution in ID, achieving a one-cycle branch penalty. This combination ensures correctness with minimal performance overhead. :::spoiler Control.scala ```scala class Control extends Module { val io = IO(new Bundle { val jump_flag = Input(Bool()) // id.io.if_jump_flag val jump_instruction_id = Input(Bool()) // id.io.ctrl_jump_instruction // val rs1_id = Input(UInt(Parameters.PhysicalRegisterAddrWidth)) // id.io.regs_reg1_read_address val rs2_id = Input(UInt(Parameters.PhysicalRegisterAddrWidth)) // id.io.regs_reg2_read_address val memory_read_enable_ex = Input(Bool()) // id2ex.io.output_memory_read_enable val rd_ex = Input(UInt(Parameters.PhysicalRegisterAddrWidth)) // id2ex.io.output_regs_write_address val memory_read_enable_mem = Input(Bool()) // ex2mem.io.output_memory_read_enable // val rd_mem = Input(UInt(Parameters.PhysicalRegisterAddrWidth)) // ex2mem.io.output_regs_write_address // val if_flush = Output(Bool()) val id_flush = Output(Bool()) val pc_stall = Output(Bool()) val if_stall = Output(Bool()) }) io.if_flush := false.B io.id_flush := false.B io.pc_stall := false.B io.if_stall := false.B when( ((io.jump_instruction_id || io.memory_read_enable_ex) && // Either: (io.rd_ex =/= 0.U) && // Destination is not x0 (io.rd_ex === io.rs1_id || io.rd_ex === io.rs2_id)) // Destination matches ID source || (io.jump_instruction_id && // Jump instruction in ID io.memory_read_enable_mem && // Load instruction in MEM (io.rd_mem =/= 0.U) && // Load destination not x0 (io.rd_mem === io.rs1_id || io.rd_mem === io.rs2_id)) // Load dest matches jump source ) { io.id_flush := true.B io.pc_stall := true.B io.if_stall := true.B }.elsewhen(io.jump_flag) { io.if_flush := true.B } } ``` ::: --- ### Hazard Detection Summary and Analysis #### Q1 : Why do we need to stall for load-use hazards? >[!Note] Ans >Because the value loaded from memory is not available until the `MEM/WB` stage, and forwarding cannot provide it early enough for the next instruction in ID. Without stalling, the instruction in ID would read an incorrect register value. --- #### Q2 : What is the difference between "stall" and "flush" operations? >[!Note] Ans >A `stall` freezes the `PC` and `IF/ID` registers, keeping the same instruction for another cycle. A `flush` overwrites a pipeline register with a `NOP` to discard an incorrect or unwanted instruction. > >`Stall` delays execution; `flush` removes wrong-path instructions. ___ #### Q3 : Why does jump instruction with register dependency need stall? >[!Note] Ans >The `jump` target address depends on a register value that is not yet ready when the jump reaches the `ID` stage. The CPU must wait for the producer instruction (EX or MEM stage) to compute the correct value before calculating the jump target. ___ #### Q4 : In this design, why is branch penalty only 1 cycle instead of 2? >[!Note] Ans >`Branches` and `jumps` are resolved in the `ID` stage instead of the `EX` stage. >This early resolution means only the instruction in `IF` needs to be flushed, reducing the penalty from two discarded instructions to just one. --- #### Q5 : What would happen if we removed the hazard detection logic entirely? >[!Note] Ans >The pipeline would execute instructions using incorrect data, leading to wrong branch decisions, incorrect jump targets, and corrupted ALU results. > >Control flow would break, and the CPU would no longer execute programs correctly. ___ #### Q6 : Complete the stall condition summary >[!Note] Ans >Stall is needed when: >1. EX-stage dependency : >A `jump` or `ID-stage` instruction depends on a value produced by `EX`, or the EX-stage instruction is a load whose destination matches `rs1_id` or `rs2_id`. >2. MEM-stage dependency: >A `jump` instruction in `ID` depends on a `load result` still in the `MEM` stage, and the loaded register matches `rs1_id` or `rs2_id`. > >Flush is needed when: >1. Branch or jump is taken (jump_flag): >The `IF-stage` instruction is on the wrong path and must be discarded. --- ### Functional Test ```ㄘ [info] PipelineProgramTest: [info] Three-stage Pipelined CPU [info] - should calculate recursively fibonacci(10) [info] - should quicksort 10 numbers [info] - should store and load single byte [info] - should solve data and control hazards [info] - should handle all hazard types comprehensively [info] - should handle machine-mode traps [info] Five-stage Pipelined CPU with Stalling [info] - should calculate recursively fibonacci(10) [info] - should quicksort 10 numbers [info] - should store and load single byte [info] - should solve data and control hazards [info] - should handle all hazard types comprehensively [info] - should handle machine-mode traps [info] Five-stage Pipelined CPU with Forwarding [info] - should calculate recursively fibonacci(10) [info] - should quicksort 10 numbers [info] - should store and load single byte [info] - should solve data and control hazards [info] - should handle all hazard types comprehensively [info] - should handle machine-mode traps [info] Five-stage Pipelined CPU with Reduced Branch Delay [info] - should calculate recursively fibonacci(10) [info] - should quicksort 10 numbers [info] - should store and load single byte [info] - should solve data and control hazards [info] - should handle all hazard types comprehensively [info] - should handle machine-mode traps [info] PipelineUartTest: [info] Three-stage Pipelined CPU UART Comprehensive Test [info] - should pass all TX and RX tests [info] Five-stage Pipelined CPU with Stalling UART Comprehensive Test [info] - should pass all TX and RX tests [info] Five-stage Pipelined CPU with Forwarding UART Comprehensive Test [info] - should pass all TX and RX tests [info] Five-stage Pipelined CPU with Reduced Branch Delay UART Comprehensive Test [info] - should pass all TX and RX tests [info] PipelineRegisterTest: [info] Pipeline Register [info] - should be able to stall and flush [info] Run completed in 1 minute, 32 seconds. [info] Total number of tests run: 29 [info] Suites: completed 3, aborted 0 [info] Tests: succeeded 29, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. ``` --- All pipeline implementations—including the `three-stage` CPU, the `five-stage stalling` design, the `five-stage design with forwarding`, and the `five-stage final design` successfully passed every functional and hazard-related test. The tests includes `recursive Fibonacci` computation, `quicksort` execution, `byte-level memory operations`, comprehensive `data and control hazard resolution`, and correct handling of `machine-mode traps`. In addition, all `UART transmit/receive tests` passed across all pipeline variants, confirming correct peripheral integration. The pipeline register behavior was also verified to properly support `stall` and `flush ` operations. --- ### RISCOF Compliance Testing ![image](https://hackmd.io/_uploads/ByvJReVMWg.png) Pepelined version of mycpu successfully passed all compliance checks, confirming that the CPU design correctly implements the `RISC-V` base `integer instructions`, `CSR` operations, and required machine-mode architectural behavior. # [Adapting Homework 2 RISC-V Assembly](https://hackmd.io/@bbchen/arch2025-homework2) I successfully modified the handwritten `RISC-V assembly` from `Homework 2` to make it compatible with the pipelined CPU. After adapting the `UF8 encode/decode` routines to respect the pipeline’s hazard behavior and placing the updated assembly in the `csrc` directory as required, the program now executes correctly on the pipelined processor. The test confirms that both encoding and decoding operations produce the expected results across all input values, demonstrating full functional correctness under the pipelined execution model. --- ### C version Before integrating the handwritten RISC-V assembly routines, I first validated the `UF8 encode/decode` logic using a pure C implementation. I compiled the C-only version into `main.asmbin` and executed it on the pipelined CPU to ensure that the baseline algorithm behaved correctly independent of any assembly-level hazards. The `C version` successfully completed all encode–decode checks and wrote the expected result code to the debug memory location, confirming that the UF8 algorithm itself was correct and that the test harness, memory-mapped interface, and pipeline environment were functioning properly. :::spoiler c version ```c typedef unsigned int uint32_t; typedef unsigned char uint8_t; typedef uint8_t uf8; The C version successfully completed all encode–decode checks and wrote the expected result code to the debug memory location, confirming that the UF8 algorithm itself was correct and that the test harness, memory-mapped interface, and pipeline environment were functioning properly. /* ============= Decode Encode ============= */ /* CLZ: count leading zeros for 32-bit unsigned int */ static unsigned clz(uint32_t x) { int n = 32; int c = 16; do { uint32_t y = x >> c; if (y) { n -= c; x = y; } c >>= 1; } while (c); return n - x; } /* Decode uf8 to uint32_t */ uint32_t uf8_decode(uf8 fl) { uint32_t mantissa = (uint32_t)(fl & 0x0f); uint8_t exponent = (uint8_t)(fl >> 4); uint32_t offset = (0x7FFFu >> (15 - exponent)) << 4; return (mantissa << exponent) + offset; } // /* Encode uint32_t to uf8 */ uf8 uf8_encode(uint32_t value) { /* Use CLZ for fast exponent calculation */ if (value < 16u) return (uf8)value; /* Find appropriate exponent using CLZ hint */ int lz = (int)clz(value); int msb = 31 - lz; uint8_t exponent = 0; uint32_t overflow = 0; if (msb >= 5) { /* Estimate exponent - the formula is empirical */ exponent = (uint8_t)(msb - 4); if (exponent > 15u) exponent = 15u; /* Calculate overflow for estimated exponent */ for (uint8_t e = 0; e < exponent; e++) overflow = (overflow << 1) + 16u; /* Adjust if estimate was off */ while (exponent > 0u && value < overflow) { overflow = (overflow - 16u) >> 1; exponent--; } } /* Find exact exponent */ while (exponent < 15u) { uint32_t next_overflow = (overflow << 1) + 16u; if (value < next_overflow) break; overflow = next_overflow; exponent++; } uint8_t mantissa = (uint8_t)((value - overflow) >> exponent); return (uf8)((exponent << 4) | mantissa); } //test int main(void) { *(int *)(4) = 1; for (int i = 5; i < 256; i++) { uint32_t d = uf8_decode_asm((uf8)i); uf8 e = uf8_encode_asm(d); if ( (e != (uf8)i) ) { *(int *)(4) = i; } } } ``` ::: --- Modify `PipelineProgeamTest.scala` : ```scala it should "perform uf8 encode and decode" in { runProgram("main.asmbin", cfg) { c => for (i <- 1 to 100) { c.clock.step(1000) c.io.mem_debug_read_address.poke((i * 4).U) } c.io.mem_debug_read_address.poke(4.U) c.clock.step() c.io.mem_debug_read_data.expect(1.U) } } ``` --- Successfully completed all encode–decode checks. For a more detailed explanation of the testing methodology, refer to [Assignment 2](https://hackmd.io/@bbchen/arch2025-homework2). ```info [info] PipelineProgramTest: [info] Three-stage Pipelined CPU [info] - should perform uf8 encode and decode [info] Five-stage Pipelined CPU with Stalling [info] - should perform uf8 encode and decode [info] Five-stage Pipelined CPU with Forwarding [info] - should perform uf8 encode and decode [info] Five-stage Pipelined CPU with Reduced Branch Delay [info] - should perform uf8 encode and decode [info] Run completed in 55 seconds, 560 milliseconds. [info] All tests passed. ``` --- ### RISC-V assembly Next, I adapt the handwritten `RISC-V assembly` code so that it operates correctly under the pipelined RISC-V CPU. :::spoiler RISC-V assembly ```assembly .section .text # .align 2 /* --------- uf8_decode --------- */ .globl uf8_decode_asm .type uf8_decode_asm, @function uf8_decode_asm: addi sp, sp, -4 sw ra, 0(sp) andi t0, a0, 0x0f #t0 = mantissa srli t1, a0, 4 #t1 = exponent # offset = ((1 << exponent) - 1) << 4 addi t2, zero, 1 sll t2, t2, t1 addi t2, t2, -1 slli t2, t2, 4 #t2 = offset sll a0,t0, t1 add a0,a0, t2 lw ra, 0(sp) # restore return addr addi sp, sp, 4 ret /* --------- uf8_encode --------- */ .globl uf8_encode_asm .type uf8_encode_asm, @function uf8_encode_asm: li t0, 16 slt t0, a0 ,t0 bne t0, zero, return_value addi sp, sp, -36 sw s7, 32(sp) sw s6, 28(sp) sw s0, 24(sp) sw s1, 20(sp) sw s2, 16(sp) sw s3, 12(sp) sw s4, 8(sp) sw s5, 4(sp) sw ra, 0(sp) add s0, a0, zero #s0 = value #Find appropriate exponent using CLZ hint jal ra, clz_asm add s1, a0, x0 #s1 = lz li t0, 31 sub s2, t0, s1 #s2 = msb add s3, zero, zero #s3 = exponent add s4, zero, zero #s4 = overflow li t0, 5 blt t0, s2, 1f addi s3, s2, -4 li t0, 15 ble s3, t0, 2f li s3, 15 # Calculate overflow for estimated exponent 2: li s5, 0 3: bge s5, s3, 4f slli s4, s4, 1 addi s4, s4, 16 addi s5, s5, 1 j 3b # Adjust if estimate was off 4: slt t0, x0, s3 slt t1, s0, s4 and t0, t0, t1 beq t0, x0, 1f addi s4, s4, -16 srli s4, s4, 1 addi s3, s3, -1 j 4b #Find exact exponent 1: addi t0, x0, 15 bge s3, t0, encode_done slli s6, s4, 1 addi s6, s6, 16 blt s0, s6, encode_done mv s4, s6 addi s3, s3, 1 j 1b encode_done: sub s7, s0, s4 srl s7, s7, s3 slli a0, s3, 4 or a0, a0, s7 lw ra, 0(sp) lw s5, 4(sp) lw s4, 8(sp) lw s3, 12(sp) lw s2, 16(sp) lw s1, 20(sp) lw s0, 24(sp) lw s6, 28(sp) lw s7, 32(sp) addi sp, sp, 36 return_value: ret /* --------- clz (count leading zeros, 0..32) --------- */ .globl clz_asm .type clz_asm, @function clz_asm: addi t0, zero, 32 #t0 = n addi t1, zero, 16 #t1 = c add t3, a0, zero #t3 = x clz_Loop: srl t2, t3, t1 #t2 = y beq t2, zero, clz_y_0 sub t0, t0, t1 add t3, t2, zero #t3 = x clz_y_0: srli t1,t1,1 # c >> 1 bne t1, x0, clz_Loop sub a0,t0,t3 ret ``` ::: --- I also updated the `Makefile` to link the handwritten `UF8 assembly module`. I added a dedicated build rule for `main.elf` so that it includes `uf8.o` along with `main.o`: ```makefile main.elf: main.o init.o uf8.o $(LD) -o $@ -T link.lds $(LDFLAGS) $^ ``` --- ### Functional Test I updated the test logic in `main.c` to perform a full cross-check between the `C` reference implementation and the `handwritten assembly` routines for UF8 encoding and decoding. The program now iterates through all 256 possible UF8 values and verifies three conditions: 1. The `C` implementation of `uf8_encode()` must generate the original UF8 byte. 2. The assembly implementation `uf8_encode_asm()` must match the behavior of the C version. 3. The assembly decoder `uf8_decode_asm()` must produce the same 32-bit result as the C decoder. --- A result code is written to address 0x4 to indicate which check fails: - `3` → all tests passed - `2` → C encoder mismatch - `1` → assembly encoder mismatch - `0` → assembly decoder mismatch ```c int main(void) { *(int *)(4) = 3; for (int i = 0; i < 256; i++) { uint32_t d = uf8_decode((uf8)i); uint32_t d_asm = uf8_decode_asm((uf8)i); uf8 e = uf8_encode(d); uf8 e_asm = uf8_encode_asm(d_asm); if ( (e != (uf8)i) ) { *(int *)(4) = 2; break; }else if(e_asm != (uf8)i) { *(int *)(4) = 1; break; }else if(d_asm != d) { *(int *)(4) = 0; break; } } ``` The `UF8` program ran correctly on the pipelined CPU, and all encode/decode checks passed, confirming full functional correctness of the modified assembly implementation. ```info [info] PipelineProgramTest: [info] Three-stage Pipelined CPU [info] - should perform uf8 encode and decode [info] Five-stage Pipelined CPU with Stalling [info] - should perform uf8 encode and decode [info] Five-stage Pipelined CPU with Forwarding [info] - should perform uf8 encode and decode [info] Five-stage Pipelined CPU with Reduced Branch Delay [info] - should perform uf8 encode and decode [info] All tests passed. ``` >[!warning] >At first, I encountered linking errors because both the C implementation and the handwritten assembly implementation were named uf8 (e.g., `uf8.c` and `uf8.S`). > >Since they produced object files with the same name (uf8.o), the linker could not distinguish between the two, causing one file to overwrite the other during compilation. > >As a result, the final executable was missing the correct symbols, leading to unresolved references when the program attempted to call the assembly routines.