賴傑南
In Chisel-based RISC-V processor designs, RV32E and RV32IM are two distinct RISC-V architecture variants. The primary differences lie in the number of registers and the supported instruction sets.
Feature | RV32E | RV32IM |
---|---|---|
Number of Registers | 16 (x0 - x15) | 32 (x0 - x31) |
Bit Width | 32-bit | 32-bit |
Instruction Set | RV32I | RV32I + M Extension |
Target Applications | Ultra-low power embedded systems | General embedded and compute-intensive |
Computation Ability | Basic operations | Supports multiplication/division |
Hardware Resources | Minimal | More |
Power Consumption | Lower | Higher |
Source Code <rave32> : This is an unpipelined RV32E (RISC-V 32-bit, embedded variant) CPU written in Chisel.
I forked the original Repositories and pushed the program I modified to RV32IM into a new branch called <RV32IM>.
MUL
, MULH
, MULHSU
, MULHU
, DIV
, DIVU
, REM
, REMU
)MUL
, MULH
, MULHSU
, MULHU
, DIV
, DIVU
, REM
, REMU
). // ---------------------- M extension ---------------------- //
// R-type, funct7 = 0x1
def MUL = BitPat("b0000001??????????000?????0110011")
def MULH = BitPat("b0000001??????????001?????0110011")
def MULHSU = BitPat("b0000001??????????010?????0110011")
def MULHU = BitPat("b0000001??????????011?????0110011")
def DIV = BitPat("b0000001??????????100?????0110011")
def DIVU = BitPat("b0000001??????????101?????0110011")
def REM = BitPat("b0000001??????????110?????0110011")
def REMU = BitPat("b0000001??????????111?????0110011")
// ---------------------- M extension ---------------------- //
MUL
, MULH
, MULHSU
, MULHU
, DIV
, DIVU
, REM
, and REMU
operation codes.////Define the operation code
// M extension
val MUL = Value
val MULH = Value
val MULHSU = Value
val MULHU = Value
val DIV = Value
val DIVU = Value
val REM = Value
val REMU = Value
////Define calculation formula
// ---------------------- M extension ---------------------- //
is(AluOp.MUL) {
io.out := (io.src1 * io.src2)(31, 0)
}
is(AluOp.MULH) {
// High 32 bits of 4-bit product, signed multiplication
val product = (src1S * src2S).asSInt
io.out := product(63, 32).asUInt
}
is(AluOp.MULHSU) {
// src1 has a number, src2 has no number
val product = (src1S * io.src2).asSInt
io.out := product(63, 32).asUInt
}
is(AluOp.MULHU) {
// All without numbers
val product = (io.src1 * io.src2)
io.out := product(63, 32)
}
is(AluOp.DIV) {
when(src2S === 0.S) {
// Divide by 0, the result is undefined
io.out := "hFFFFFFFF".U
}.otherwise {
io.out := (src1S / src2S).asUInt
}
}
is(AluOp.DIVU) {
when(io.src2 === 0.U) {
io.out := "hFFFFFFFF".U
}.otherwise {
io.out := io.src1 / io.src2
}
}
is(AluOp.REM) {
when(src2S === 0.S) {
io.out := src1S.asUInt
}.otherwise {
io.out := (src1S % src2S).asUInt
}
}
is(AluOp.REMU) {
when(io.src2 === 0.U) {
io.out := io.src1
}.otherwise {
io.out := io.src1 % io.src2
}
}
// ---------------------- M extension ---------------------- //
rs1
, rs2
, and rd
has been increased from 3.W
to 5.W
.// Change to 5 bits (0~31)
val rs1 = Input(UInt(5.W))
val rs2 = Input(UInt(5.W))
val rd = Input(UInt(5.W))
AluOp
and InstFormat
.////Change to 5 bits (0~31)
val rs1 = UInt(5.W) // Change to 5 digits
val rs2 = UInt(5.W) // Change to 5 digits
val rd = UInt(5.W) // Change to 5 digits
////Add instruction to decoding table
// ---------------------- M extension ---------------------- //
Inst.MUL -> List(false.B, InstFormat.R, AluOp.MUL, MemOp.NONE, SpecialOp.NONE, false.B),
Inst.MULH -> List(false.B, InstFormat.R, AluOp.MULH, MemOp.NONE, SpecialOp.NONE, false.B),
Inst.MULHSU -> List(false.B, InstFormat.R, AluOp.MULHSU, MemOp.NONE, SpecialOp.NONE, false.B),
Inst.MULHU -> List(false.B, InstFormat.R, AluOp.MULHU, MemOp.NONE, SpecialOp.NONE, false.B),
Inst.DIV -> List(false.B, InstFormat.R, AluOp.DIV, MemOp.NONE, SpecialOp.NONE, false.B),
Inst.DIVU -> List(false.B, InstFormat.R, AluOp.DIVU, MemOp.NONE, SpecialOp.NONE, false.B),
Inst.REM -> List(false.B, InstFormat.R, AluOp.REM, MemOp.NONE, SpecialOp.NONE, false.B),
Inst.REMU -> List(false.B, InstFormat.R, AluOp.REMU, MemOp.NONE, SpecialOp.NONE, false.B),
// ---------------------- M extension ---------------------- //
The memory architecture and access methods are the same for both RV32E and RV32IM.
Instructions like LW
(Load Word), SW
(Store Word), as well as LB
, LH
, etc., are supported in both RV32E and RV32IM.
M extension (multiplication and division) instructions do not involve memory operations.
Memory.scala
.Memory.scala
is responsible for data access rather than arithmetic operations.
The memory module only cares about:
addr
)writeData
, readData
)MemOp
).Multiplication and division instructions executed by the ALU do not require additional memory access, hence no changes are needed for Memory.scala
.
OneCycle.scala
mainly handles:
imem
)decoder
)alu
)dmem
)RegFile
).//Change to 5 bits, corresponding to 0~31
val readAddr = Input(UInt(5.W))
Added M extended ALU test program
MUL
Test Instruction-10
to 9
.result = (x * y) & 0xFFFFFFFFL
MULH
Test Instruction-10
to 9
.result = ((x * y) >> 32) & 0xFFFFFFFFL
MULHSU
Test Instruction-10
to 9
.result = ((BigInt(x) * BigInt(y & 0xFFFFFFFFL)) >> 32).toLong & 0xFFFFFFFFL
MULHU
Test Instruction-10
to 9
.result = ((BigInt(x & 0xFFFFFFFFL) * BigInt(y & 0xFFFFFFFFL)) >> 32).toLong
DIV
Test Instruction-10
to 9
.result = if (y != 0) x / y else -1
DIVU
Test Instruction0
to 19
.result = if (y != 0) (x & 0xFFFFFFFFL) / (y & 0xFFFFFFFFL) else 0xFFFFFFFFL
REM
Test Instruction-10
to 9
.result = if (y != 0) x % y else x
Test high register (x31), Write to and Read from Higher Registers
x31
) in the register file.writeEnable
to true
.42
to register x31
(rd
set to 31
).rs1
to 31
to read the value of x31
.42
).Remove or comment out the original "not decode MUL"
//it should "not decode MUL" in {
//test(new Decoder) { c =>
// c.io.inst.poke(assemble("mul, x1, x2, x3"))
//c.io.ctrl.exception.peekBoolean() shouldBe true
//}
//}
Added M extended test case to decoder. The following tests were added to verify the decoding functionality of the Decoder
module. Each test ensures that the corresponding R-type instruction is decoded correctly, with the proper ALU operation and register fields.
MUL
Decode TestDecoder
correctly decodes the MUL
instruction.mul x1, x2, x3
AluOp.MUL
x1
x2
, x3
checkRType(c, "mul x1, x2, x3", AluOp.MUL, 1, 2, 3)
MULH
Decode TestMULH
instruction.mulh x1, x2, x3
AluOp.MULH
x1
x2
, x3
checkRType(c, "mulh x1, x2, x3", AluOp.MULH, 1, 2, 3)
MULHSU
Decode TestMULHSU
instruction.mulhsu x1, x2, x3
AluOp.MULHSU
x1
x2
, x3
checkRType(c, "mulhsu x1, x2, x3", AluOp.MULHSU, 1, 2, 3)
MULHU
Decode TestMULHU
instruction.mulhu x1, x2, x3
AluOp.MULHU
x1
x2
, x3
checkRType(c, "mulhu x1, x2, x3", AluOp.MULHU, 1, 2, 3)
DIV
Decode TestDIV
instruction.div x1, x2, x3
AluOp.DIV
x1
x2
, x3
checkRType(c, "div x1, x2, x3", AluOp.DIV, 1, 2, 3)
DIVU
Decode TestDIVU
instruction.divu x1, x2, x3
AluOp.DIVU
x1
x2
, x3
checkRType(c, "divu x1, x2, x3", AluOp.DIVU, 1, 2, 3)
REM
Decode Testrem x1, x2, x3
AluOp.REM
x1
x2
, x3
checkRType(c, "rem x1, x2, x3", AluOp.REM, 1, 2, 3)
REMU
Decode Testremu x1, x2, x3
AluOp.REMU
x1
x2
, x3
checkRType(c, "remu x1, x2, x3", AluOp.REMU, 1, 2, 3)
Because the RISCVAssembler
compiler cannot compile M-extended instructions, so need to add M-extended instructions to assemble
def assemble(in: String): UInt = {
if (in == "ebreak") {
"b00000000000100000000000001110011".U
} else if (in == "ecall") {
"b00000000000000000000000001110011".U
} else if (in == "fence") {
"b00000000000000000000000000001111".U
} else {
in match {
// ---------------------- M extension ---------------------- //
case "mul x1, x2, x3" => "h023100B3".U // funct7=1, funct3=0, opcode=0x33
case "mulh x1, x2, x3" => "h023110B3".U(32.W) // funct7=1, funct3=1
case "mulhsu x1, x2, x3" => "h023120B3".U // funct7=1, funct3=2
case "mulhu x1, x2, x3" => "h023130B3".U
case "div x1, x2, x3" => "h023140B3".U // funct7=1, funct3=4
case "divu x1, x2, x3" => "h023150B3".U
case "rem x1, x2, x3" => "h023160B3".U
case "remu x1, x2, x3" => "h023170B3".U
// ---------------------- M extension ---------------------- //
// If it is not the above M command, it will be handed over to the original Assembler.
// -------------------------------------------------
case _ => ("b" + RISCVAssembler.binOutput(in)).U
}
}
Because Memory.scala has not been modified, MemorySpec.scala does not need to be modified.
Add M-extended instructions and Convert RISC-V assembly language instructions into corresponding machine code
def assemble(in: String): Int = {
in match {
case "mul x1, x2, x3" => Integer.parseUnsignedInt("023100B3", 16)
case "mulh x1, x2, x3" => Integer.parseUnsignedInt("023110B3", 16)
case "mulhsu x1, x2, x3" => Integer.parseUnsignedInt("023120B3", 16)
case "mulhu x1, x2, x3" => Integer.parseUnsignedInt("023130B3", 16)
case "div x1, x2, x3" => Integer.parseUnsignedInt("023140B3", 16)
case "divu x1, x2, x3" => Integer.parseUnsignedInt("023150B3", 16)
case "rem x1, x2, x3" => Integer.parseUnsignedInt("023160B3", 16)
case "remu x1, x2, x3" => Integer.parseUnsignedInt("023170B3", 16)
case "nop" => Integer.parseUnsignedInt("00000013", 16)
case "ebreak" => Integer.parseUnsignedInt("00000000000000000000000001110011", 2)
case "ecall" => Integer.parseUnsignedInt("00000000000100000000000001110011", 2)
case "fence" => Integer.parseUnsignedInt("00000000000000000000000000001111", 2)
case other =>
val binStr = RISCVAssembler.binOutput(other)
Integer.parseUnsignedInt(binStr, 2)
}
Added test program for M extension
MUL
Functionality Testmul
instruction correctly multiplies two registers and stores the result in the destination register.mul x1, x2, x3
x2
is set to 2
.x3
is set to 3
.x1 = x2 * x3 = 6
x1
should be 6
.MULH
Functionality Testmulh
instruction correctly multiplies two registers and stores the high 32 bits of the result in the destination register.mulh x1, x2, x3
x2
is set to 1
and then shifted left by 16 bits (x2 = 65536
).x3
is set to 1
and then shifted left by 16 bits (x3 = 65536
).x1 = (65536 * 65536) >> 32 = 1
x1
should be 1
.MULHSU
Functionality Testmulhsu
instruction correctly multiplies two registers (with one operand signed and the other unsigned) and stores the high 32 bits of the result in the destination register.mulhsu x1, x2, x3
x2
is set to 1
and then shifted left by 16 bits (x2 = 65536
).x3
is set to 1
and then shifted left by 16 bits (x3 = 65536
).x1 = (65536 * 65536) >> 32 = 1
x1
should be 1
.MULHU
Functionality Testmulhu
instruction correctly multiplies two unsigned registers and stores the high 32 bits of the result in the destination register.mulhu x1, x2, x3
x2
is set to 1
and then shifted left by 16 bits (x2 = 65536
).x3
is set to 1
and then shifted left by 16 bits (x3 = 65536
).x1 = (65536 * 65536) >> 32 = 1
x1
should be 1
.DIV
Functionality Testdiv
instruction correctly divides two signed registers and stores the quotient in the destination register.div x1, x2, x3
x2
is set to 10
.x3
is set to 3
.x1 = x2 / x3 = 3
x1
should be 3
.DIVU
Functionality Testdivu
instruction correctly multiplies two registers and stores the result in the destination register.divu x1, x2, x3
x2
is set to 10
.x3
is set to 3
.x1 = x2 / x3 = 3
x1
should be 3
.REM
Functionality Testrem
instruction correctly multiplies two registers and stores the result in the destination register.rem x1, x2, x3
x2
is set to 10
.x3
is set to 3
.x1 = x2 % x3 = 1
x1
should be 1
.REMU
Functionality Testremu
instruction correctly multiplies two registers and stores the result in the destination register.remu x1, x2, x3
x2
is set to 10
.x3
is set to 3
.x1 = x2 % x3 = 1
x1
should be 1
.Tests can be executed using sbt test
to run all tests or with sbt "testOnly mrv.DecorderSpec"
to run a specific test. Various test cases have been used to validate the functionality, and the results are displayed in the Test Results.
After upgrading the RISC-V CPU from RV32E
to RV32IM
, I will transition it from a single-cycle design to a 3-stage pipeline. The traditional 5-stage pipeline consists of Instruction Fetch, Instruction Decode, Execution, Memory Access, and Write-Back stages. In my 3-stage pipeline, the CPU is reorganized into the following three stages:
To enable the CPU to function in a pipelined architecture, temporary registers need to be placed between the different stages. These registers hold the data necessary for the next stage as well as the results produced by the preceding one. Proper handling of these stage registers and guaranteeing the accurate transfer of information is essential for the pipeline to operate correctly.
Pipeline registers are used to hold and pass the necessary data and control signals from one stage of the pipeline to the next.
Used to transfer data between the instruction fetch
stage and the instruction decode
stage of the processor
val if_id = Module(new PipelineReg(new Bundle {
val inst = UInt(32.W)
val pcPlus4 = UInt(32.W)
val rs1 = UInt(5.W)
val rs2 = UInt(5.W)
val rd = UInt(5.W)
val ctrl = new Ctrl
val rs1Data = UInt(32.W)
val rs2Data = UInt(32.W)
}))
if_id.io.in.inst := io.imem.readData
if_id.io.in.pcPlus4 := pcPlus4
if_id.io.in.rs1 := decoder.io.ctrl.rs1
if_id.io.in.rs2 := decoder.io.ctrl.rs2
if_id.io.in.rd := decoder.io.ctrl.rd
if_id.io.in.ctrl := decoder.io.ctrl
if_id.io.in.rs1Data := regFile.io.rs1Data
if_id.io.in.rs2Data := regFile.io.rs2Data
if_id.io.enable := true.B
if_id.io.flush := false.B
Used to transfer data between the excution
stage of the processor
val ex_mem = Module(new PipelineReg(new Bundle {
val ctrl = new Ctrl
val aluOut = UInt(32.W)
val rd = UInt(5.W)
val rs2Data = UInt(32.W)
}))
This register is used to handle data hazard issues by transferring the results of the ALU calculations back before writing them back.
val mem_wb = Module(new PipelineReg(new Bundle {
val ctrl = new Ctrl
val wbData = UInt(32.W)
val rd = UInt(5.W)
}))
stall
or forward
data to avoid incorrect results.ex_mem
's rd
matches decoder
's rs1
or rs2
and regWrite
is true.stall
signal to pause the PC and halt the pipeline progression.stall
signal is asserted (stall := true.B
). val hazard = ex_mem.io.out.ctrl.regWrite &&
ex_mem.io.out.rd =/= 0.U &&
(ex_mem.io.out.rd === decoder.io.ctrl.rs1 ||
ex_mem.io.out.rd === decoder.io.ctrl.rs2)
val stall = WireDefault(false.B)
when(hazard) {
stall := true.B
}
when(!stall) {
pc := nextPC
}.otherwise {
ex_mem.io.in.ctrl := 0.U.asTypeOf(ex_mem.io.in.ctrl)
ex_mem.io.in.aluOut := 0.U
ex_mem.io.in.rd := 0.U
ex_mem.io.in.rs2Data := 0.U
}
if_id.io.enable := !stall
rd
(destination register) from a later pipeline stage matches rs1
or rs2
of the current instruction:
regWrite = true.B
) and its rd
matches the current rs1
or rs2
, use the aluOut
from the EX/MEM stage as the forwarded value.regWrite = true.B
) and its rd
matches the current rs1
or rs2
, use the wbData
from the MEM/WB stage.rs1Data
and rs2Data
by dynamically selecting the most recent value available in the pipeline.add x5, x1, x2
(produces result in x5
).sub x6, x5, x3
(requires the result from x5
).
aluOut
from Instruction 1 directly, avoiding a stall. val ex_alu = Module(new Alu)
val forwardA = WireDefault(if_id.io.out.rs1Data)
val forwardB = WireDefault(if_id.io.out.rs2Data)
val ex_mem = Module(new PipelineReg(new Bundle {
val ctrl = new Ctrl
val aluOut = UInt(32.W)
val rd = UInt(5.W)
val rs2Data = UInt(32.W)
}))
when(ex_mem.io.out.ctrl.regWrite && ex_mem.io.out.rd =/= 0.U) {
when(ex_mem.io.out.rd === if_id.io.out.rs1) {
forwardA := ex_mem.io.out.aluOut
}
when(ex_mem.io.out.rd === if_id.io.out.rs2) {
forwardB := ex_mem.io.out.aluOut
}
}
val mem_wb = Module(new PipelineReg(new Bundle {
val ctrl = new Ctrl
val wbData = UInt(32.W)
val rd = UInt(5.W)
}))
when(mem_wb.io.out.ctrl.regWrite && mem_wb.io.out.rd =/= 0.U) {
when(mem_wb.io.out.rd === if_id.io.out.rs1) {
forwardA := mem_wb.io.out.wbData
}
when(mem_wb.io.out.rd === if_id.io.out.rs2) {
forwardB := mem_wb.io.out.wbData
}
}
ex_alu.io.op := if_id.io.out.ctrl.aluOp
ex_alu.io.src1 := forwardA
ex_alu.io.src2 := Mux(if_id.io.out.ctrl.useImm, if_id.io.out.ctrl.imm.asUInt, forwardB)
I converted the original OneCycle program into a ThreeStage program, and added simple stalling and forwarding to avoid Data Hazard. My 3 stage pipeline program is ThreeStage.scala and is paired with a test program ThreeStageSpec.scala.
These two tests ensure that the simulator can handle RAW hazards correctly and maintain the correct execution order, and check whether the simulator handles the dependencies between registers correctly.
The other focuses on memory operations, checking whether the emulator correctly supports data loading and storing instructions.
RAW Hazard Test
addi x1, x0, 10
- Write 10 to register x1
.addi x2, x1, 5
- Add 5 to the value in x1
and write the result to x2
.x1
should be 10.x2
should be 15.Memory Load/Store Test
load (lw)
and store (sw)
operations.addi x1, x0, 5
- Write 5 to register x1
.sw x1, 100(x0)`` - Store the value of
x1` (5) into memory at address 100.lw x2, 100(x0)
- Load the value from memory at address 100 into x2
.x1
should be 5.x2
should be 5.Tests can be executed using sbt test
to run all tests or with sbt "testOnly mrv.ThreeStageSpec"
to run a specific test. Various test cases have been used to validate the functionality, and the results are displayed in the Test Results.
This function, fabsf
, is a custom implementation of the standard C function fabsf, which calculates the absolute value of a floating-point number.
static inline float fabsf(float x) {
uint32_t i = *(uint32_t *)&x; // Read the bits of the float into an integer
i &= 0x7FFFFFFF; // Clear the sign bit to get the absolute value
x = *(float *)&i; // Write the modified bits back into the float
return x;
}
Absolute value Test (fabsf)
For the operation |x|
where the input ( x = -5 ) (testing the computation of absolute value via subtraction):
Instructions:
addi x1, x0, -5
- Load the value (-5) into register x1
.sub x1, x0, x1
- Subtract x1
from x0
to compute its absolute value (resulting in ( 5 )).add x2, x1, x0
- Copy the result from x1
to x2
(to verify the result).Execution Steps:
x1
and x2
.Expected Results:
addi x1, x0, -5
), x1
should hold the value (-5).sub x1, x0, x1
), x1
should hold the value ( 5 ) (absolute value of the initial value).add x2, x1, x0
), x2
should hold the value ( 5 ), confirming that the absolute value computation was correct.Verification:
x1
should be ( 5 ) after all instructions complete.x2
should also be ( 5 ), as it mirrors the value of x1
.Fix the permissions of the uploaded pictures.
The formula defines a recursive calculation for a value 𝑛𝑚 based on the input values 𝑛 and 𝑚. The calculation varies depending on the properties of 𝑛, as follows:
If ( n ) is even:
The value of ( nm ) is calculated as
This means ( n ) is halved, and the result is multiplied by ( 2m ).
If ( n ) is odd:
The value of ( nm ) is calculated as
Here, ( n ) is reduced by 1 to make it even, halved, and multiplied by ( 2m ). Then, ( m ) is added to the result.
If ( n = 1 ):
The value of ( nm ) is simply ( m ).
This serves as the base case for the recursion or iterative process.
Multiplication Test (n*m via mul)
For every combination of ( n ) and ( m ) where ( n, m \in {1, 2, \ldots, 7} ) (49 total combinations):
addi x2, x0, n
- Load the value ( n ) into register x2
.addi x3, x0, m
- Load the value ( m ) into register x3
.mul x1, x2, x3
- Multiply the values in x2
and x3
and store the result in x1
.addi x2, x0, n
), the simulator waits for the instruction to complete (3 clock cycles total).addi x3, x0, m
), the simulator again waits for the instruction to complete (3 clock cycles total).mul x1, x2, x3
), the simulator waits for the multiplication to complete (3 clock cycles total).x2
should equal ( n ).x3
should equal ( m ).x1
should equal ( n * m ).This code snippet calculates the position of the most significant bit (MSB) in a non-negative integer N
.
int logint(int N)
{
int k = N, i = 0;
while (k) {
k >>= 1;
i++;
}
return i - 1;
}
Logint test
assmebly code
addi x2, x0, 16 # x2 = N = 16
addi x3, x0, 0 # x3 = i = 0
loop:
beq x2, x0, end # if (x2 == 0) => jump to end
srai x2, x2, 1 # x2 >>= 1
addi x3, x3, 1 # i++
jal x0, loop # unconditional jump to loop
end:
addi x4, x3, -1 # x4 = i - 1
nop
Execution Steps:
x2
is set to 16 (the initial value of 𝑁).x3
is initialized to 0 (counter for iterations).x2
by 1 bit (dividing it by 2).x3
to count the number of shifts performed.x4
is calculated as x3-1
to account for zero-based indexing of the MSB position.Expected Results:
x4
should hold the value 4, indicating that the MSB of 16(binary 10000) is at position 4.Verification:
x4
is updated correctly after all instructions are executed.x4
matches the calculated MSB position.Do refer to the lecture materials and/or primary source.