Try   HackMD

Construct RISC-V in Chisel

賴傑南

Differences Between RV32E and RV32IM in Chisel RISC-V Processor Design

In Chisel-based RISC-V processor designs, RV32E and RV32IM are two distinct RISC-V architecture variants. The primary differences lie in the number of registers and the supported instruction sets.

Comparison between RV32E and RV32IM:

Feature RV32E RV32IM
Number of Registers 16 (x0 - x15) 32 (x0 - x31)
Bit Width 32-bit 32-bit
Instruction Set RV32I RV32I + M Extension
Target Applications Ultra-low power embedded systems General embedded and compute-intensive
Computation Ability Basic operations Supports multiplication/division
Hardware Resources Minimal More
Power Consumption Lower Higher

Part A : Change the code from RV32E to RV32IM

Source Code <rave32> : This is an unpipelined RV32E (RISC-V 32-bit, embedded variant) CPU written in Chisel.

I forked the original Repositories and pushed the program I modified to RV32IM into a new branch called <RV32IM>.

  • 32 general-purpose registers (x0 to x31)
  • M Extension: Integer multiplication/division instructions (MUL, MULH, MULHSU, MULHU, DIV, DIVU, REM, REMU)

main/scala

Instruction set (Inst.scala)

  • Added instructions related to the M extension (MUL, MULH, MULHSU, MULHU, DIV, DIVU, REM, REMU).
  // ---------------------- M extension ---------------------- //
  // R-type, funct7 = 0x1
  def MUL    = BitPat("b0000001??????????000?????0110011")
  def MULH   = BitPat("b0000001??????????001?????0110011")
  def MULHSU = BitPat("b0000001??????????010?????0110011")
  def MULHU  = BitPat("b0000001??????????011?????0110011")
  def DIV    = BitPat("b0000001??????????100?????0110011")
  def DIVU   = BitPat("b0000001??????????101?????0110011")
  def REM    = BitPat("b0000001??????????110?????0110011")
  def REMU   = BitPat("b0000001??????????111?????0110011")
  // ---------------------- M extension ---------------------- //

ALU (Alu.scala)

  • Added MUL, MULH, MULHSU, MULHU, DIV, DIVU, REM, and REMU operation codes.
////Define the operation code
  // M extension
  val MUL    = Value
  val MULH   = Value
  val MULHSU = Value
  val MULHU  = Value
  val DIV    = Value
  val DIVU   = Value
  val REM    = Value
  val REMU   = Value

////Define calculation formula
// ---------------------- M extension ---------------------- //
is(AluOp.MUL) {
  io.out := (io.src1 * io.src2)(31, 0)
}
is(AluOp.MULH) {
  // High 32 bits of 4-bit product, signed multiplication
  val product = (src1S * src2S).asSInt
  io.out := product(63, 32).asUInt
}
is(AluOp.MULHSU) {
  // src1 has a number, src2 has no number
  val product = (src1S * io.src2).asSInt
  io.out := product(63, 32).asUInt
}
is(AluOp.MULHU) {
  // All without numbers
  val product = (io.src1 * io.src2)
  io.out := product(63, 32)
}
is(AluOp.DIV) {
  when(src2S === 0.S) {
    // Divide by 0, the result is undefined
    io.out := "hFFFFFFFF".U
  }.otherwise {
    io.out := (src1S / src2S).asUInt
  }
}
is(AluOp.DIVU) {
  when(io.src2 === 0.U) {
    io.out := "hFFFFFFFF".U
  }.otherwise {
    io.out := io.src1 / io.src2
  }
}
is(AluOp.REM) {
  when(src2S === 0.S) {
    io.out := src1S.asUInt
  }.otherwise {
    io.out := (src1S % src2S).asUInt
  }
}
is(AluOp.REMU) {
  when(io.src2 === 0.U) {
    io.out := io.src1
  }.otherwise {
    io.out := io.src1 % io.src2
  }
}
// ---------------------- M extension ---------------------- //

RegFile.scala

  • The original 16 registers have been expanded to 32 registers.
  • The bit width of rs1, rs2, and rd has been increased from 3.W to 5.W.
// Change to 5 bits (0~31)
val rs1 = Input(UInt(5.W))
val rs2 = Input(UInt(5.W))
val rd  = Input(UInt(5.W))

Decoder (Decoder.scala)

  • Added the above instructions to the decoding table, mapping them to corresponding AluOp and InstFormat.
////Change to 5 bits (0~31)
val rs1       = UInt(5.W)   // Change to 5 digits
val rs2       = UInt(5.W)   // Change to 5 digits
val rd        = UInt(5.W)   // Change to 5 digits

////Add instruction to decoding table
// ---------------------- M extension ---------------------- //
Inst.MUL    -> List(false.B, InstFormat.R, AluOp.MUL, MemOp.NONE, SpecialOp.NONE, false.B),
Inst.MULH   -> List(false.B, InstFormat.R, AluOp.MULH,   MemOp.NONE, SpecialOp.NONE, false.B),
Inst.MULHSU -> List(false.B, InstFormat.R, AluOp.MULHSU, MemOp.NONE, SpecialOp.NONE, false.B),
Inst.MULHU  -> List(false.B, InstFormat.R, AluOp.MULHU,  MemOp.NONE, SpecialOp.NONE, false.B),
Inst.DIV    -> List(false.B, InstFormat.R, AluOp.DIV,    MemOp.NONE, SpecialOp.NONE, false.B),
Inst.DIVU   -> List(false.B, InstFormat.R, AluOp.DIVU,   MemOp.NONE, SpecialOp.NONE, false.B),
Inst.REM    -> List(false.B, InstFormat.R, AluOp.REM,    MemOp.NONE, SpecialOp.NONE, false.B),
Inst.REMU   -> List(false.B, InstFormat.R, AluOp.REMU,   MemOp.NONE, SpecialOp.NONE, false.B),
// ---------------------- M extension ---------------------- //

Memory.scala

  • The memory architecture and access methods are the same for both RV32E and RV32IM.

  • Instructions like LW (Load Word), SW (Store Word), as well as LB, LH, etc., are supported in both RV32E and RV32IM.

    • These instructions are part of the RV32I base instruction set.
  • M extension (multiplication and division) instructions do not involve memory operations.

    • They are purely ALU operations and therefore do not affect the logic in Memory.scala.
  • Memory.scala is responsible for data access rather than arithmetic operations.

  • The memory module only cares about:

    • Addresses (addr)
    • Data (writeData, readData)
    • Memory operation types (MemOp).
  • Multiplication and division instructions executed by the ALU do not require additional memory access, hence no changes are needed for Memory.scala.

OneCycle.scala

  • OneCycle.scala mainly handles:
    • Instruction fetch (imem)
    • Decoding (decoder)
    • ALU execution (alu)
    • Memory access (dmem)
    • Writing results back to the register file (RegFile).
//Change to 5 bits, corresponding to 0~31
val readAddr = Input(UInt(5.W))  

test/scala

AluSpec.scala

Added M extended ALU test program

  1. MUL Test Instruction
  • Function: Tests the ALU multiplication instruction, returning the lower 32 bits of the product.
  • Test Range: -10 to 9.
  • Formula:
    ​​result = (x * y) & 0xFFFFFFFFL
    
  1. MULH Test Instruction
  • Function: Tests the ALU multiplication instruction, returning the upper 32 bits of the product.
  • Test Range: -10 to 9.
  • Formula:
    ​​result = ((x * y) >> 32) & 0xFFFFFFFFL
    
  1. MULHSU Test Instruction
  • Function: Tests signed and unsigned mixed multiplication, returning the upper 32 bits of the product.
  • Test Range: -10 to 9.
  • Formula:
    ​​result = ((BigInt(x) * BigInt(y & 0xFFFFFFFFL)) >> 32).toLong & 0xFFFFFFFFL
    
  1. MULHU Test Instruction
  • Function: Tests unsigned multiplication, returning the upper 32 bits of the product.
  • Test Range: -10 to 9.
  • Formula:
    ​​result = ((BigInt(x & 0xFFFFFFFFL) * BigInt(y & 0xFFFFFFFFL)) >> 32).toLong
    
  1. DIV Test Instruction
  • Function: Tests signed division, including handling cases where the divisor is zero.
  • Test Range: -10 to 9.
  • Formula:
    ​​result = if (y != 0) x / y else -1
    
  1. DIVU Test Instruction
  • Function: Tests unsigned division, including handling cases where the divisor is zero.
  • Test Range: 0 to 19.
  • Formula:
    ​​result = if (y != 0) (x & 0xFFFFFFFFL) / (y & 0xFFFFFFFFL) else 0xFFFFFFFFL
    
  1. REM Test Instruction
  • Function: Tests signed remainder, including handling cases where the divisor is zero.
  • Test Range: -10 to 9.
  • Formula:
    ​​result = if (y != 0) x % y else x
    

RegFileSpec.scala

Test high register (x31), Write to and Read from Higher Registers

  • Function: Tests the ability to write to and read from a higher register (specifically x31) in the register file.
  • Test Steps:
    1. Enable write operations by setting writeEnable to true.
    2. Write the value 42 to register x31 (rd set to 31).
    3. Perform a clock cycle to commit the write operation.
    4. Set rs1 to 31 to read the value of x31.
    5. Verify that the read value matches the written value (42).

DecoderSpec.scala

Remove or comment out the original "not decode MUL"

//it should "not decode MUL" in {
//test(new Decoder) { c =>
 // c.io.inst.poke(assemble("mul, x1, x2, x3"))
  //c.io.ctrl.exception.peekBoolean() shouldBe true
//}
//}   

Added M extended test case to decoder. The following tests were added to verify the decoding functionality of the Decoder module. Each test ensures that the corresponding R-type instruction is decoded correctly, with the proper ALU operation and register fields.

  1. MUL Decode Test
  • Function: Verifies that the Decoder correctly decodes the MUL instruction.
  • Instruction: mul x1, x2, x3
  • Expected Decoding:
    • ALU Operation: AluOp.MUL
    • Destination Register: x1
    • Source Registers: x2, x3
  • Example Test:
    ​​checkRType(c, "mul x1, x2, x3", AluOp.MUL, 1, 2, 3)
    
  1. MULH Decode Test
  • Function:Verifies that the Decoder correctly decodes the MULHinstruction.
  • Instruction: mulh x1, x2, x3
  • Expected Decoding:
    • ALU Operation: AluOp.MULH
    • Destination Register: x1
    • Source Registers: x2, x3
  • Example Test:
    ​​checkRType(c, "mulh x1, x2, x3", AluOp.MULH, 1, 2, 3)
    
  1. MULHSU Decode Test
  • Function: Verifies that the Decoder correctly decodes the MULHSU instruction.
  • Instruction: mulhsu x1, x2, x3
  • Expected Decoding:
    • ALU Operation: AluOp.MULHSU
    • Destination Register: x1
    • Source Registers: x2, x3
  • Example Test:
    ​​checkRType(c, "mulhsu x1, x2, x3", AluOp.MULHSU, 1, 2, 3)
    
  1. MULHU Decode Test
  • Function: Verifies that the Decoder correctly decodes the MULHU instruction.
  • Instruction: mulhu x1, x2, x3
  • Expected Decoding:
    • ALU Operation: AluOp.MULHU
    • Destination Register: x1
    • Source Registers: x2, x3
  • Example Test:
    ​​checkRType(c, "mulhu x1, x2, x3", AluOp.MULHU, 1, 2, 3)
    
  1. DIV Decode Test
  • Function: Verifies that the Decoder correctly decodes the DIV instruction.
  • Instruction: div x1, x2, x3
  • Expected Decoding:
    • ALU Operation: AluOp.DIV
    • Destination Register: x1
    • Source Registers: x2, x3
  • Example Test:
    ​​checkRType(c, "div x1, x2, x3", AluOp.DIV, 1, 2, 3)
    
  1. DIVU Decode Test
  • Function: Verifies that the Decoder correctly decodes the DIVU instruction.
  • Instruction: divu x1, x2, x3
  • Expected Decoding:
    • ALU Operation: AluOp.DIVU
    • Destination Register: x1
    • Source Registers: x2, x3
  • Example Test:
    ​​checkRType(c, "divu x1, x2, x3", AluOp.DIVU, 1, 2, 3)
    
  1. REM Decode Test
  • Function: Verifies that the Decoder correctly decodes the REM instruction.
  • Instruction: rem x1, x2, x3
  • Expected Decoding:
    • ALU Operation: AluOp.REM
    • Destination Register: x1
    • Source Registers: x2, x3
  • Example Test:
    ​​checkRType(c, "rem x1, x2, x3", AluOp.REM, 1, 2, 3)
    
  1. REMU Decode Test
  • Function: Verifies that the Decoder correctly decodes the REMU instruction.
  • Instruction: remu x1, x2, x3
  • Expected Decoding:
    • ALU Operation: AluOp.REMU
    • Destination Register: x1
    • Source Registers: x2, x3
  • Example Test:
    ​​checkRType(c, "remu x1, x2, x3", AluOp.REMU, 1, 2, 3)
    

Because the RISCVAssembler compiler cannot compile M-extended instructions, so need to add M-extended instructions to assemble

def assemble(in: String): UInt = {
    if (in == "ebreak") {
      "b00000000000100000000000001110011".U
    } else if (in == "ecall") {
      "b00000000000000000000000001110011".U
    } else if (in == "fence") {
      "b00000000000000000000000000001111".U
    } else {
        in match {
        // ---------------------- M extension ---------------------- //
        case "mul x1, x2, x3"    => "h023100B3".U  // funct7=1, funct3=0, opcode=0x33
        case "mulh x1, x2, x3"   => "h023110B3".U(32.W)  // funct7=1, funct3=1
        case "mulhsu x1, x2, x3" => "h023120B3".U  // funct7=1, funct3=2
        case "mulhu x1, x2, x3"  => "h023130B3".U
        case "div x1, x2, x3"    => "h023140B3".U  // funct7=1, funct3=4
        case "divu x1, x2, x3"   => "h023150B3".U
        case "rem x1, x2, x3"    => "h023160B3".U
        case "remu x1, x2, x3"   => "h023170B3".U

        // ---------------------- M extension ---------------------- //
        // If it is not the above M command, it will be handed over to the original Assembler.
        // -------------------------------------------------
        case _ => ("b" + RISCVAssembler.binOutput(in)).U
      }

    }

MemorySpec.scala

Because Memory.scala has not been modified, MemorySpec.scala does not need to be modified.

OneCycleSpec.scala

Add M-extended instructions and Convert RISC-V assembly language instructions into corresponding machine code

 def assemble(in: String): Int = {
   in match {

      case "mul x1, x2, x3" => Integer.parseUnsignedInt("023100B3", 16)
      case "mulh x1, x2, x3" => Integer.parseUnsignedInt("023110B3", 16)
      case "mulhsu x1, x2, x3" => Integer.parseUnsignedInt("023120B3", 16)
      case "mulhu x1, x2, x3" => Integer.parseUnsignedInt("023130B3", 16)
      case "div x1, x2, x3" => Integer.parseUnsignedInt("023140B3", 16)
      case "divu x1, x2, x3" => Integer.parseUnsignedInt("023150B3", 16)
      case "rem x1, x2, x3" => Integer.parseUnsignedInt("023160B3", 16)
      case "remu x1, x2, x3" => Integer.parseUnsignedInt("023170B3", 16)

      case "nop" => Integer.parseUnsignedInt("00000013", 16)

      case "ebreak" => Integer.parseUnsignedInt("00000000000000000000000001110011", 2)
      case "ecall" => Integer.parseUnsignedInt("00000000000100000000000001110011", 2)
      case "fence" => Integer.parseUnsignedInt("00000000000000000000000000001111", 2)

      case other =>
        val binStr = RISCVAssembler.binOutput(other)
        Integer.parseUnsignedInt(binStr, 2)
    }

Added test program for M extension

  1. MUL Functionality Test
  • Function: Verifies that the mul instruction correctly multiplies two registers and stores the result in the destination register.
  • Instruction: mul x1, x2, x3
  • Expected Behavior:
    • Setup:
      • x2 is set to 2.
      • x3 is set to 3.
    • Operation: x1 = x2 * x3 = 6
    • Final State:
      • x1 should be 6.
      • The processor should halt after execution.
  1. MULH Functionality Test
  • Function: Verifies that the mulh instruction correctly multiplies two registers and stores the high 32 bits of the result in the destination register.
  • Instruction: mulh x1, x2, x3
  • Expected Behavior:
    • Setup:
      • x2 is set to 1 and then shifted left by 16 bits (x2 = 65536).
      • x3 is set to 1 and then shifted left by 16 bits (x3 = 65536).
    • Operation: x1 = (65536 * 65536) >> 32 = 1
    • Final State:
      • x1 should be 1.
      • The processor should halt after execution.
  1. MULHSU Functionality Test
  • Function: Verifies that the mulhsu instruction correctly multiplies two registers (with one operand signed and the other unsigned) and stores the high 32 bits of the result in the destination register.
  • Instruction: mulhsu x1, x2, x3
  • Expected Behavior:
    • Setup:
      • x2 is set to 1 and then shifted left by 16 bits (x2 = 65536).
      • x3 is set to 1 and then shifted left by 16 bits (x3 = 65536).
    • Operation: x1 = (65536 * 65536) >> 32 = 1
    • Final State:
      • x1 should be 1.
      • The processor should halt after execution.
  1. MULHU Functionality Test
  • Function: Verifies that the mulhu instruction correctly multiplies two unsigned registers and stores the high 32 bits of the result in the destination register.
  • Instruction: mulhu x1, x2, x3
  • Expected Behavior:
    • Setup:
      • x2 is set to 1 and then shifted left by 16 bits (x2 = 65536).
      • x3 is set to 1 and then shifted left by 16 bits (x3 = 65536).
    • Operation: x1 = (65536 * 65536) >> 32 = 1
    • Final State:
      • x1 should be 1.
      • The processor should halt after execution.
  1. DIV Functionality Test
  • Function: Verifies that the div instruction correctly divides two signed registers and stores the quotient in the destination register.
  • Instruction: div x1, x2, x3
  • Expected Behavior:
    • Setup:
      • x2 is set to 10.
      • x3 is set to 3.
    • Operation: x1 = x2 / x3 = 3
    • Final State:
      • x1 should be 3.
      • The processor should halt after execution.
  1. DIVU Functionality Test
  • Function: Verifies that the divu instruction correctly multiplies two registers and stores the result in the destination register.
  • Instruction: divu x1, x2, x3
  • Expected Behavior:
    • Setup:
      • x2 is set to 10.
      • x3 is set to 3.
    • Operation: x1 = x2 / x3 = 3
    • Final State:
      • x1 should be 3.
      • The processor should halt after execution.
  1. REM Functionality Test
  • Function: Verifies that the rem instruction correctly multiplies two registers and stores the result in the destination register.
  • Instruction: rem x1, x2, x3
  • Expected Behavior:
    • Setup:
      • x2 is set to 10.
      • x3 is set to 3.
    • Operation: x1 = x2 % x3 = 1
    • Final State:
      • x1 should be 1.
      • The processor should halt after execution.
  1. REMU Functionality Test
  • Function: Verifies that the remu instruction correctly multiplies two registers and stores the result in the destination register.
  • Instruction: remu x1, x2, x3
  • Expected Behavior:
    • Setup:
      • x2 is set to 10.
      • x3 is set to 3.
    • Operation: x1 = x2 % x3 = 1
    • Final State:
      • x1 should be 1.
      • The processor should halt after execution.

Simulate and run tests

Tests can be executed using sbt test to run all tests or with sbt "testOnly mrv.DecorderSpec" to run a specific test. Various test cases have been used to validate the functionality, and the results are displayed in the Test Results.

Part B : Rewrite OneCycle into a 3-stage pipeline

After upgrading the RISC-V CPU from RV32E to RV32IM, I will transition it from a single-cycle design to a 3-stage pipeline. The traditional 5-stage pipeline consists of Instruction Fetch, Instruction Decode, Execution, Memory Access, and Write-Back stages. In my 3-stage pipeline, the CPU is reorganized into the following three stages:

  • Stage 1 : Instruction Fetch (IF) and Instruction Decode (ID)
  • Stage 2 : Execution (EX)
  • Stage 3 : Memory Access (MEM) and Write-Back (WB)

To enable the CPU to function in a pipelined architecture, temporary registers need to be placed between the different stages. These registers hold the data necessary for the next stage as well as the results produced by the preceding one. Proper handling of these stage registers and guaranteeing the accurate transfer of information is essential for the pipeline to operate correctly.

Pipeline register

Pipeline registers are used to hold and pass the necessary data and control signals from one stage of the pipeline to the next.

IF/ID to EXE register

Used to transfer data between the instruction fetch stage and the instruction decode stage of the processor

val if_id = Module(new PipelineReg(new Bundle {
    val inst    = UInt(32.W)
    val pcPlus4 = UInt(32.W)
    val rs1     = UInt(5.W)
    val rs2     = UInt(5.W)
    val rd      = UInt(5.W)
    val ctrl    = new Ctrl
    val rs1Data = UInt(32.W)
    val rs2Data = UInt(32.W)
}))

if_id.io.in.inst := io.imem.readData
if_id.io.in.pcPlus4 := pcPlus4
if_id.io.in.rs1 := decoder.io.ctrl.rs1
if_id.io.in.rs2 := decoder.io.ctrl.rs2
if_id.io.in.rd := decoder.io.ctrl.rd
if_id.io.in.ctrl := decoder.io.ctrl
if_id.io.in.rs1Data := regFile.io.rs1Data
if_id.io.in.rs2Data := regFile.io.rs2Data
if_id.io.enable := true.B
if_id.io.flush := false.B

EXE to MEM/WB register

Used to transfer data between the excution stage of the processor

val ex_mem = Module(new PipelineReg(new Bundle {
    val ctrl    = new Ctrl
    val aluOut  = UInt(32.W)
    val rd      = UInt(5.W)
    val rs2Data = UInt(32.W)
}))

MEM to WB register

This register is used to handle data hazard issues by transferring the results of the ALU calculations back before writing them back.

val mem_wb = Module(new PipelineReg(new Bundle {
    val ctrl   = new Ctrl
    val wbData = UInt(32.W)
    val rd     = UInt(5.W)
}))

Deal with Data Hazard

  • Purpose: When data dependencies occur between multiple instructions in the pipeline, stall or forward data to avoid incorrect results.
  • Detection Conditions:
    • When ex_mem's rd matches decoder's rs1 or rs2 and regWrite is true.
    • If the condition is met, activate the stall signal to pause the PC and halt the pipeline progression.

Stall

  • Stalling Logic:
    • When a hazard is detected, the stall signal is asserted (stall := true.B).
    • The program counter (PC) and pipeline registers are frozen, effectively pausing the pipeline.
    • The input to the EX/MEM pipeline register is cleared to prevent incorrect data from propagating further.
  • Impact of Stalling:
    • Stalling introduces a bubble into the pipeline, which delays execution but ensures correctness.
    • This is a necessary step when forwarding is insufficient, such as when a result is still in the EX stage and not available for forwarding.
  val hazard = ex_mem.io.out.ctrl.regWrite &&
               ex_mem.io.out.rd =/= 0.U &&
               (ex_mem.io.out.rd === decoder.io.ctrl.rs1 ||
                ex_mem.io.out.rd === decoder.io.ctrl.rs2)

  val stall = WireDefault(false.B)
  when(hazard) {
    stall := true.B
  }

  when(!stall) {
    pc := nextPC
  }.otherwise {
    ex_mem.io.in.ctrl := 0.U.asTypeOf(ex_mem.io.in.ctrl)
    ex_mem.io.in.aluOut := 0.U
    ex_mem.io.in.rd := 0.U
    ex_mem.io.in.rs2Data := 0.U
  }

  if_id.io.enable := !stall

Forwarding

  • Forwarding Logic:
    • In the EX stage, check if the rd (destination register) from a later pipeline stage matches rs1 or rs2 of the current instruction:
      • EX/MEM stage:
        • If the instruction in the EX/MEM stage will write back (regWrite = true.B) and its rd matches the current rs1 or rs2, use the aluOut from the EX/MEM stage as the forwarded value.
      • MEM/WB stage:
        • Similarly, if the MEM/WB stage instruction writes back (regWrite = true.B) and its rd matches the current rs1 or rs2, use the wbData from the MEM/WB stage.
  • Data Selection:
    • Forwarding overrides the default values of rs1Data and rs2Data by dynamically selecting the most recent value available in the pipeline.
    • If no hazards are detected, the default register file outputs are used.
  • Impact of Forwarding:
    • Reduced Stalls:
      • Forwarding allows the pipeline to proceed without delays, provided that the required data is available in a later stage.
    • Increased Complexity:
      • The forwarding unit requires additional hardware for hazard detection and data multiplexing but greatly improves performance.
  • Forwarding Example:
    • Instruction 1 (in EX stage): add x5, x1, x2 (produces result in x5).
    • Instruction 2 (in ID stage): sub x6, x5, x3 (requires the result from x5).
      • Without forwarding: Instruction 2 stalls until Instruction 1 writes back to the register file.
      • With forwarding: Instruction 2 uses aluOut from Instruction 1 directly, avoiding a stall.
  val ex_alu = Module(new Alu)

  val forwardA = WireDefault(if_id.io.out.rs1Data)
  val forwardB = WireDefault(if_id.io.out.rs2Data)

  val ex_mem = Module(new PipelineReg(new Bundle {
    val ctrl    = new Ctrl
    val aluOut  = UInt(32.W)
    val rd      = UInt(5.W)
    val rs2Data = UInt(32.W)
  }))

  when(ex_mem.io.out.ctrl.regWrite && ex_mem.io.out.rd =/= 0.U) {
    when(ex_mem.io.out.rd === if_id.io.out.rs1) {
      forwardA := ex_mem.io.out.aluOut
    }
    when(ex_mem.io.out.rd === if_id.io.out.rs2) {
      forwardB := ex_mem.io.out.aluOut
    }
  }

  val mem_wb = Module(new PipelineReg(new Bundle {
    val ctrl   = new Ctrl
    val wbData = UInt(32.W)
    val rd     = UInt(5.W)
  }))

  when(mem_wb.io.out.ctrl.regWrite && mem_wb.io.out.rd =/= 0.U) {
    when(mem_wb.io.out.rd === if_id.io.out.rs1) {
      forwardA := mem_wb.io.out.wbData
    }
    when(mem_wb.io.out.rd === if_id.io.out.rs2) {
      forwardB := mem_wb.io.out.wbData
    }
  }

  ex_alu.io.op := if_id.io.out.ctrl.aluOp
  ex_alu.io.src1 := forwardA
  ex_alu.io.src2 := Mux(if_id.io.out.ctrl.useImm, if_id.io.out.ctrl.imm.asUInt, forwardB)

Test CPU

I converted the original OneCycle program into a ThreeStage program, and added simple stalling and forwarding to avoid Data Hazard. My 3 stage pipeline program is ThreeStage.scala and is paired with a test program ThreeStageSpec.scala.

These two tests ensure that the simulator can handle RAW hazards correctly and maintain the correct execution order, and check whether the simulator handles the dependencies between registers correctly.
The other focuses on memory operations, checking whether the emulator correctly supports data loading and storing instructions.

RAW Hazard Test

  • Function: Verifies that the simulator handles a RAW (Read After Write) hazard correctly.
  • Instructions:
    1. addi x1, x0, 10 - Write 10 to register x1.
    2. addi x2, x1, 5 - Add 5 to the value in x1 and write the result to x2.
  • Expected Results:
    • After the first instruction, x1 should be 10.
    • After the second instruction, x2 should be 15.

Memory Load/Store Test

  • Function: Verifies that the simulator supports memory load (lw) and store (sw) operations.
  • Instructions:
    1. addi x1, x0, 5 - Write 5 to register x1.
    2. sw x1, 100(x0)`` - Store the value of x1` (5) into memory at address 100.
    3. lw x2, 100(x0) - Load the value from memory at address 100 into x2.
  • Expected Results:
    • After the first instruction, x1 should be 5.
    • After the second instruction, the memory at address 100 should contain 5.
    • After the third instruction, x2 should be 5.

Simulate and run tests

Tests can be executed using sbt test to run all tests or with sbt "testOnly mrv.ThreeStageSpec" to run a specific test. Various test cases have been used to validate the functionality, and the results are displayed in the Test Results.

Part C : Select from the course quiz questions rewrite the code to run on extended RISC-V processor

Quiz question1 (Quiz 1 Problem C)

This function, fabsf, is a custom implementation of the standard C function fabsf, which calculates the absolute value of a floating-point number.

static inline float fabsf(float x) {
    uint32_t i = *(uint32_t *)&x;  // Read the bits of the float into an integer
    i &= 0x7FFFFFFF;               // Clear the sign bit to get the absolute value
    x = *(float *)&i;              // Write the modified bits back into the float
    return x;
}

Absolute value Test (fabsf)

For the operation |x| where the input ( x = -5 ) (testing the computation of absolute value via subtraction):

  • Instructions:

    • addi x1, x0, -5 - Load the value (-5) into register x1.
    • sub x1, x0, x1 - Subtract x1 from x0 to compute its absolute value (resulting in ( 5 )).
    • add x2, x1, x0 - Copy the result from x1 to x2 (to verify the result).
  • Execution Steps:

    • The simulator loads all three instructions into the pipeline.
    • Wait for the pipeline to complete execution, allowing sufficient clock cycles (6 total).
    • After execution, check the values in registers x1 and x2.
  • Expected Results:

    • After executing the first instruction (addi x1, x0, -5), x1 should hold the value (-5).
    • After executing the second instruction (sub x1, x0, x1), x1 should hold the value ( 5 ) (absolute value of the initial value).
    • After executing the third instruction (add x2, x1, x0), x2 should hold the value ( 5 ), confirming that the absolute value computation was correct.
  • Verification:

    • Register x1 should be ( 5 ) after all instructions complete.
    • Register x2 should also be ( 5 ), as it mirrors the value of x1.

Quiz question2 (Quiz 2 Problem D)

Fix the permissions of the uploaded pictures.

The formula defines a recursive calculation for a value 𝑛𝑚 based on the input values 𝑛 and 𝑚. The calculation varies depending on the properties of 𝑛, as follows:

  1. If ( n ) is even:
    The value of ( nm ) is calculated as image
    This means ( n ) is halved, and the result is multiplied by ( 2m ).

  2. If ( n ) is odd:
    The value of ( nm ) is calculated as image
    Here, ( n ) is reduced by 1 to make it even, halved, and multiplied by ( 2m ). Then, ( m ) is added to the result.

  3. If ( n = 1 ):
    The value of ( nm ) is simply ( m ).
    This serves as the base case for the recursion or iterative process.

nm={n22mif n is even,n122m+mif n is odd,mif n=1.

Multiplication Test (n*m via mul)
For every combination of ( n ) and ( m ) where ( n, m \in {1, 2, \ldots, 7} ) (49 total combinations):

  1. addi x2, x0, n - Load the value ( n ) into register x2.
  2. addi x3, x0, m - Load the value ( m ) into register x3.
  3. mul x1, x2, x3 - Multiply the values in x2 and x3 and store the result in x1.
  • Execution Steps:
    1. After the first instruction (addi x2, x0, n), the simulator waits for the instruction to complete (3 clock cycles total).
    2. After the second instruction (addi x3, x0, m), the simulator again waits for the instruction to complete (3 clock cycles total).
    3. After the third instruction (mul x1, x2, x3), the simulator waits for the multiplication to complete (3 clock cycles total).
  • Expected Result ( For each combination of ( n ) and ( m ) ):
    1. After the first instruction, x2 should equal ( n ).
    2. After the second instruction, x3 should equal ( m ).
    3. After the third instruction, x1 should equal ( n * m ).

Quiz question3 (Quiz 5 Problem D)

This code snippet calculates the position of the most significant bit (MSB) in a non-negative integer N.

int logint(int N)
{
    int k = N, i = 0;
    while (k) {
        k >>= 1;
        i++;
    }
    return i - 1;
}

Logint test

assmebly code

  addi x2, x0, 16   # x2 = N = 16
  addi x3, x0, 0    # x3 = i = 0

loop:
  beq x2, x0, end   # if (x2 == 0) => jump to end
  srai x2, x2, 1    # x2 >>= 1
  addi x3, x3, 1    # i++
  jal x0, loop      # unconditional jump to loop

end:
  addi x4, x3, -1   # x4 = i - 1
  nop
  • Execution Steps:

    • Initialization:
      • x2 is set to 16 (the initial value of 𝑁).
      • x3 is initialized to 0 (counter for iterations).
    • Loop Execution:
      • The loop iterates, repeatedly right-shifting x2 by 1 bit (dividing it by 2).
      • Each iteration increments x3 to count the number of shifts performed.
      • The loop exits when x2 becomes 0.
    • Final Step:
      • After the loop, x4 is calculated as x3-1 to account for zero-based indexing of the MSB position.
  • Expected Results:

    • After executing the program:
      • Register x4 should hold the value 4, indicating that the MSB of 16(binary 10000) is at position 4.
  • Verification:

    • The test confirms that x4 is updated correctly after all instructions are executed.
    • The expected output in x4 matches the calculated MSB position.

Reference

Do refer to the lecture materials and/or primary source.