# Construct RISC-V in Chisel
> 賴傑南
## Differences Between RV32E and RV32IM in Chisel RISC-V Processor Design
In Chisel-based RISC-V processor designs, **RV32E** and **RV32IM** are two distinct RISC-V architecture variants. The primary differences lie in the **number of registers** and the **supported instruction sets**.
### Comparison between RV32E and RV32IM:
| Feature | RV32E | RV32IM |
|-----------------------|-------------------------------|----------------------------------|
| **Number of Registers** | 16 (x0 - x15) | 32 (x0 - x31) |
| **Bit Width** | 32-bit | 32-bit |
| **Instruction Set** | RV32I | RV32I + M Extension |
| **Target Applications** | Ultra-low power embedded systems | General embedded and compute-intensive |
| **Computation Ability** | Basic operations | Supports multiplication/division |
| **Hardware Resources** | Minimal | More |
| **Power Consumption** | Lower | Higher |
---
## Part A : Change the code from RV32E to RV32IM
> Source Code <[rave32](https://github.com/64/rave32)> : This is an unpipelined RV32E (RISC-V 32-bit, embedded variant) CPU written in Chisel.
> I forked the original Repositories and pushed the program I modified to RV32IM into a new branch called <[RV32IM](https://github.com/Jiegoqqq/Computer_Architecture_term_project/tree/RV32IM)>.
- **32 general-purpose registers** (x0 to x31)
- **M Extension**: Integer multiplication/division instructions (`MUL`, `MULH`, `MULHSU`, `MULHU`, `DIV`, `DIVU`, `REM`, `REMU`)
### main/scala
#### Instruction set ([Inst.scala](https://github.com/Jiegoqqq/Computer_Architecture_term_project/blob/RV32IM/src/main/scala/Inst.scala))
- Added instructions related to the **M extension** (`MUL`, `MULH`, `MULHSU`, `MULHU`, `DIV`, `DIVU`, `REM`, `REMU`).
```s
// ---------------------- M extension ---------------------- //
// R-type, funct7 = 0x1
def MUL = BitPat("b0000001??????????000?????0110011")
def MULH = BitPat("b0000001??????????001?????0110011")
def MULHSU = BitPat("b0000001??????????010?????0110011")
def MULHU = BitPat("b0000001??????????011?????0110011")
def DIV = BitPat("b0000001??????????100?????0110011")
def DIVU = BitPat("b0000001??????????101?????0110011")
def REM = BitPat("b0000001??????????110?????0110011")
def REMU = BitPat("b0000001??????????111?????0110011")
// ---------------------- M extension ---------------------- //
```
#### ALU ([Alu.scala](https://github.com/Jiegoqqq/Computer_Architecture_term_project/blob/RV32IM/src/main/scala/Alu.scala))
- Added `MUL`, `MULH`, `MULHSU`, `MULHU`, `DIV`, `DIVU`, `REM`, and `REMU` operation codes.
```s
////Define the operation code
// M extension
val MUL = Value
val MULH = Value
val MULHSU = Value
val MULHU = Value
val DIV = Value
val DIVU = Value
val REM = Value
val REMU = Value
////Define calculation formula
// ---------------------- M extension ---------------------- //
is(AluOp.MUL) {
io.out := (io.src1 * io.src2)(31, 0)
}
is(AluOp.MULH) {
// High 32 bits of 4-bit product, signed multiplication
val product = (src1S * src2S).asSInt
io.out := product(63, 32).asUInt
}
is(AluOp.MULHSU) {
// src1 has a number, src2 has no number
val product = (src1S * io.src2).asSInt
io.out := product(63, 32).asUInt
}
is(AluOp.MULHU) {
// All without numbers
val product = (io.src1 * io.src2)
io.out := product(63, 32)
}
is(AluOp.DIV) {
when(src2S === 0.S) {
// Divide by 0, the result is undefined
io.out := "hFFFFFFFF".U
}.otherwise {
io.out := (src1S / src2S).asUInt
}
}
is(AluOp.DIVU) {
when(io.src2 === 0.U) {
io.out := "hFFFFFFFF".U
}.otherwise {
io.out := io.src1 / io.src2
}
}
is(AluOp.REM) {
when(src2S === 0.S) {
io.out := src1S.asUInt
}.otherwise {
io.out := (src1S % src2S).asUInt
}
}
is(AluOp.REMU) {
when(io.src2 === 0.U) {
io.out := io.src1
}.otherwise {
io.out := io.src1 % io.src2
}
}
// ---------------------- M extension ---------------------- //
```
#### [RegFile.scala](https://github.com/Jiegoqqq/Computer_Architecture_term_project/blob/RV32IM/src/main/scala/RegFile.scala)
- The original 16 registers have been expanded to **32 registers**.
- The bit width of `rs1`, `rs2`, and `rd` has been increased from `3.W` to **`5.W`**.
```s
// Change to 5 bits (0~31)
val rs1 = Input(UInt(5.W))
val rs2 = Input(UInt(5.W))
val rd = Input(UInt(5.W))
```
#### Decoder ([Decoder.scala](https://github.com/Jiegoqqq/Computer_Architecture_term_project/blob/RV32IM/src/main/scala/Decoder.scala))
- Added the above instructions to the decoding table, mapping them to corresponding `AluOp` and `InstFormat`.
```s
////Change to 5 bits (0~31)
val rs1 = UInt(5.W) // Change to 5 digits
val rs2 = UInt(5.W) // Change to 5 digits
val rd = UInt(5.W) // Change to 5 digits
////Add instruction to decoding table
// ---------------------- M extension ---------------------- //
Inst.MUL -> List(false.B, InstFormat.R, AluOp.MUL, MemOp.NONE, SpecialOp.NONE, false.B),
Inst.MULH -> List(false.B, InstFormat.R, AluOp.MULH, MemOp.NONE, SpecialOp.NONE, false.B),
Inst.MULHSU -> List(false.B, InstFormat.R, AluOp.MULHSU, MemOp.NONE, SpecialOp.NONE, false.B),
Inst.MULHU -> List(false.B, InstFormat.R, AluOp.MULHU, MemOp.NONE, SpecialOp.NONE, false.B),
Inst.DIV -> List(false.B, InstFormat.R, AluOp.DIV, MemOp.NONE, SpecialOp.NONE, false.B),
Inst.DIVU -> List(false.B, InstFormat.R, AluOp.DIVU, MemOp.NONE, SpecialOp.NONE, false.B),
Inst.REM -> List(false.B, InstFormat.R, AluOp.REM, MemOp.NONE, SpecialOp.NONE, false.B),
Inst.REMU -> List(false.B, InstFormat.R, AluOp.REMU, MemOp.NONE, SpecialOp.NONE, false.B),
// ---------------------- M extension ---------------------- //
```
#### [Memory.scala](https://github.com/Jiegoqqq/Computer_Architecture_term_project/blob/RV32IM/src/main/scala/Memory.scala)
- The memory architecture and access methods are the same for both **RV32E** and **RV32IM**.
- Instructions like **`LW`** (Load Word), **`SW`** (Store Word), as well as **`LB`**, **`LH`**, etc., are supported in both **RV32E** and **RV32IM**.
- These instructions are part of the **RV32I base instruction set**.
- **M extension** (multiplication and division) instructions do not involve memory operations.
- They are purely ALU operations and therefore do not affect the logic in **`Memory.scala`**.
- `Memory.scala` is responsible for **data access** rather than arithmetic operations.
- The memory module only cares about:
- **Addresses** (`addr`)
- **Data** (`writeData`, `readData`)
- **Memory operation types** (`MemOp`).
- Multiplication and division instructions executed by the ALU do not require additional memory access, hence **no changes** are needed for **`Memory.scala`**.
#### [OneCycle.scala](https://github.com/Jiegoqqq/Computer_Architecture_term_project/blob/RV32IM/src/main/scala/OneCycle.scala)
- `OneCycle.scala` mainly handles:
- **Instruction fetch** (`imem`)
- **Decoding** (`decoder`)
- **ALU execution** (`alu`)
- **Memory access** (`dmem`)
- **Writing results back to the register file** (`RegFile`).
```s
//Change to 5 bits, corresponding to 0~31
val readAddr = Input(UInt(5.W))
```
### test/scala
#### [AluSpec.scala](https://github.com/Jiegoqqq/Computer_Architecture_term_project/blob/RV32IM/src/test/scala/AluSpec.scala)
Added M extended ALU test program
1. `MUL` Test Instruction
- **Function**: Tests the ALU multiplication instruction, returning the lower 32 bits of the product.
- **Test Range**: `-10` to `9`.
- **Formula**:
```scala
result = (x * y) & 0xFFFFFFFFL
```
2. `MULH` Test Instruction
- **Function**: Tests the ALU multiplication instruction, returning the upper 32 bits of the product.
- **Test Range**: `-10` to `9`.
- **Formula**:
```scala
result = ((x * y) >> 32) & 0xFFFFFFFFL
```
3. `MULHSU` Test Instruction
- **Function**: Tests signed and unsigned mixed multiplication, returning the upper 32 bits of the product.
- **Test Range**: `-10` to `9`.
- **Formula**:
```scala
result = ((BigInt(x) * BigInt(y & 0xFFFFFFFFL)) >> 32).toLong & 0xFFFFFFFFL
```
4. `MULHU` Test Instruction
- **Function**: Tests unsigned multiplication, returning the upper 32 bits of the product.
- **Test Range**: `-10` to `9`.
- **Formula**:
```scala
result = ((BigInt(x & 0xFFFFFFFFL) * BigInt(y & 0xFFFFFFFFL)) >> 32).toLong
```
5. `DIV` Test Instruction
- **Function**: Tests signed division, including handling cases where the divisor is zero.
- **Test Range**: `-10` to `9`.
- **Formula**:
```scala
result = if (y != 0) x / y else -1
```
6. `DIVU` Test Instruction
- **Function**: Tests unsigned division, including handling cases where the divisor is zero.
- **Test Range**: `0` to `19`.
- **Formula**:
```scala
result = if (y != 0) (x & 0xFFFFFFFFL) / (y & 0xFFFFFFFFL) else 0xFFFFFFFFL
```
7. `REM` Test Instruction
- **Function**: Tests signed remainder, including handling cases where the divisor is zero.
- **Test Range**: `-10` to `9`.
- **Formula**:
```scala
result = if (y != 0) x % y else x
```
#### [RegFileSpec.scala](https://github.com/Jiegoqqq/Computer_Architecture_term_project/blob/RV32IM/src/test/scala/RegFileSpec.scala)
Test high register (x31), Write to and Read from Higher Registers
- **Function**: Tests the ability to write to and read from a higher register (specifically `x31`) in the register file.
- **Test Steps**:
1. Enable write operations by setting `writeEnable` to `true`.
2. Write the value `42` to register `x31` (`rd` set to `31`).
3. Perform a clock cycle to commit the write operation.
4. Set `rs1` to `31` to read the value of `x31`.
5. Verify that the read value matches the written value (`42`).
#### [DecoderSpec.scala](https://github.com/Jiegoqqq/Computer_Architecture_term_project/blob/RV32IM/src/test/scala/DecoderSpec.scala)
Remove or comment out the original "not decode MUL"
```scala
//it should "not decode MUL" in {
//test(new Decoder) { c =>
// c.io.inst.poke(assemble("mul, x1, x2, x3"))
//c.io.ctrl.exception.peekBoolean() shouldBe true
//}
//}
```
Added M extended test case to decoder. The following tests were added to verify the decoding functionality of the `Decoder` module. Each test ensures that the corresponding R-type instruction is decoded correctly, with the proper ALU operation and register fields.
1. `MUL` Decode Test
- **Function**: Verifies that the `Decoder` correctly decodes the `MUL` instruction.
- **Instruction**: `mul x1, x2, x3`
- **Expected Decoding**:
- **ALU Operation**: `AluOp.MUL`
- **Destination Register**: `x1`
- **Source Registers**: `x2`, `x3`
- **Example Test**:
```scala
checkRType(c, "mul x1, x2, x3", AluOp.MUL, 1, 2, 3)
```
2. `MULH` Decode Test
- **Function**:Verifies that the Decoder correctly decodes the `MULH`instruction.
- **Instruction**: `mulh x1, x2, x3`
- **Expected Decoding**:
- **ALU Operation**: `AluOp.MULH`
- **Destination Register**: `x1`
- **Source Registers**: `x2`, `x3`
- **Example Test**:
```scala
checkRType(c, "mulh x1, x2, x3", AluOp.MULH, 1, 2, 3)
```
3. `MULHSU` Decode Test
- **Function**: Verifies that the Decoder correctly decodes the `MULHSU` instruction.
- **Instruction**: `mulhsu x1, x2, x3`
- **Expected Decoding**:
- **ALU Operation**: `AluOp.MULHSU`
- **Destination Register**: `x1`
- **Source Registers**: `x2`, `x3`
- **Example Test**:
```scala
checkRType(c, "mulhsu x1, x2, x3", AluOp.MULHSU, 1, 2, 3)
```
4. `MULHU` Decode Test
- **Function**: Verifies that the Decoder correctly decodes the `MULHU` instruction.
- **Instruction**: `mulhu x1, x2, x3`
- **Expected Decoding**:
- **ALU Operation**: `AluOp.MULHU`
- **Destination Register**: `x1`
- **Source Registers**: `x2`, `x3`
- **Example Test**:
```scala
checkRType(c, "mulhu x1, x2, x3", AluOp.MULHU, 1, 2, 3)
```
5. `DIV` Decode Test
- **Function**: Verifies that the Decoder correctly decodes the `DIV` instruction.
- **Instruction**: `div x1, x2, x3`
- **Expected Decoding**:
- **ALU Operation**: `AluOp.DIV`
- **Destination Register**: `x1`
- **Source Registers**: `x2`, `x3`
- **Example Test**:
```scala
checkRType(c, "div x1, x2, x3", AluOp.DIV, 1, 2, 3)
```
6. `DIVU` Decode Test
- **Function**: Verifies that the Decoder correctly decodes the `DIVU` instruction.
- **Instruction**: `divu x1, x2, x3`
- **Expected Decoding**:
- **ALU Operation**: `AluOp.DIVU`
- **Destination Register**: `x1`
- **Source Registers**: `x2`, `x3`
- **Example Test**:
```scala
checkRType(c, "divu x1, x2, x3", AluOp.DIVU, 1, 2, 3)
```
7. `REM` Decode Test
- **Function**: Verifies that the Decoder correctly decodes the REM instruction.
- **Instruction**: `rem x1, x2, x3`
- **Expected Decoding**:
- **ALU Operation**: `AluOp.REM`
- **Destination Register**: `x1`
- **Source Registers**: `x2`, `x3`
- **Example Test**:
```scala
checkRType(c, "rem x1, x2, x3", AluOp.REM, 1, 2, 3)
```
8. `REMU` Decode Test
- **Function**: Verifies that the Decoder correctly decodes the REMU instruction.
- **Instruction**: `remu x1, x2, x3`
- **Expected Decoding**:
- **ALU Operation**: `AluOp.REMU`
- **Destination Register**: `x1`
- **Source Registers**: `x2`, `x3`
- **Example Test**:
```scala
checkRType(c, "remu x1, x2, x3", AluOp.REMU, 1, 2, 3)
```
Because the `RISCVAssembler` compiler cannot compile M-extended instructions, so need to add M-extended instructions to assemble
```s
def assemble(in: String): UInt = {
if (in == "ebreak") {
"b00000000000100000000000001110011".U
} else if (in == "ecall") {
"b00000000000000000000000001110011".U
} else if (in == "fence") {
"b00000000000000000000000000001111".U
} else {
in match {
// ---------------------- M extension ---------------------- //
case "mul x1, x2, x3" => "h023100B3".U // funct7=1, funct3=0, opcode=0x33
case "mulh x1, x2, x3" => "h023110B3".U(32.W) // funct7=1, funct3=1
case "mulhsu x1, x2, x3" => "h023120B3".U // funct7=1, funct3=2
case "mulhu x1, x2, x3" => "h023130B3".U
case "div x1, x2, x3" => "h023140B3".U // funct7=1, funct3=4
case "divu x1, x2, x3" => "h023150B3".U
case "rem x1, x2, x3" => "h023160B3".U
case "remu x1, x2, x3" => "h023170B3".U
// ---------------------- M extension ---------------------- //
// If it is not the above M command, it will be handed over to the original Assembler.
// -------------------------------------------------
case _ => ("b" + RISCVAssembler.binOutput(in)).U
}
}
```
#### [MemorySpec.scala](https://github.com/Jiegoqqq/Computer_Architecture_term_project/blob/RV32IM/src/test/scala/MemorySpec.scala)
Because Memory.scala has not been modified, MemorySpec.scala does not need to be modified.
#### [OneCycleSpec.scala](https://github.com/Jiegoqqq/Computer_Architecture_term_project/blob/RV32IM/src/test/scala/OneCycleSpec.scala)
Add M-extended instructions and Convert RISC-V assembly language instructions into corresponding machine code
```scala
def assemble(in: String): Int = {
in match {
case "mul x1, x2, x3" => Integer.parseUnsignedInt("023100B3", 16)
case "mulh x1, x2, x3" => Integer.parseUnsignedInt("023110B3", 16)
case "mulhsu x1, x2, x3" => Integer.parseUnsignedInt("023120B3", 16)
case "mulhu x1, x2, x3" => Integer.parseUnsignedInt("023130B3", 16)
case "div x1, x2, x3" => Integer.parseUnsignedInt("023140B3", 16)
case "divu x1, x2, x3" => Integer.parseUnsignedInt("023150B3", 16)
case "rem x1, x2, x3" => Integer.parseUnsignedInt("023160B3", 16)
case "remu x1, x2, x3" => Integer.parseUnsignedInt("023170B3", 16)
case "nop" => Integer.parseUnsignedInt("00000013", 16)
case "ebreak" => Integer.parseUnsignedInt("00000000000000000000000001110011", 2)
case "ecall" => Integer.parseUnsignedInt("00000000000100000000000001110011", 2)
case "fence" => Integer.parseUnsignedInt("00000000000000000000000000001111", 2)
case other =>
val binStr = RISCVAssembler.binOutput(other)
Integer.parseUnsignedInt(binStr, 2)
}
```
Added test program for M extension
1. `MUL` Functionality Test
- **Function**: Verifies that the `mul` instruction correctly multiplies two registers and stores the result in the destination register.
- **Instruction**: `mul x1, x2, x3`
- **Expected Behavior**:
- **Setup**:
- `x2` is set to `2`.
- `x3` is set to `3`.
- **Operation**: `x1 = x2 * x3 = 6`
- **Final State**:
- `x1` should be `6`.
- The processor should halt after execution.
2. `MULH` Functionality Test
- **Function**: Verifies that the `mulh` instruction correctly multiplies two registers and stores the high 32 bits of the result in the destination register.
- **Instruction**: `mulh x1, x2, x3`
- **Expected Behavior**:
- **Setup**:
- `x2` is set to `1` and then shifted left by 16 bits (`x2 = 65536`).
- `x3` is set to `1` and then shifted left by 16 bits (`x3 = 65536`).
- **Operation**: `x1 = (65536 * 65536) >> 32 = 1`
- **Final State**:
- `x1` should be `1`.
- The processor should halt after execution.
3. `MULHSU` Functionality Test
- **Function**: Verifies that the `mulhsu` instruction correctly multiplies two registers (with one operand signed and the other unsigned) and stores the high 32 bits of the result in the destination register.
- **Instruction**: `mulhsu x1, x2, x3`
- **Expected Behavior**:
- **Setup**:
- `x2` is set to `1` and then shifted left by 16 bits (`x2 = 65536`).
- `x3` is set to `1` and then shifted left by 16 bits (`x3 = 65536`).
- **Operation**: `x1 = (65536 * 65536) >> 32 = 1`
- **Final State**:
- `x1` should be `1`.
- The processor should halt after execution.
4. `MULHU` Functionality Test
- **Function**: Verifies that the `mulhu` instruction correctly multiplies two unsigned registers and stores the high 32 bits of the result in the destination register.
- **Instruction**: `mulhu x1, x2, x3`
- **Expected Behavior**:
- **Setup**:
- `x2` is set to `1` and then shifted left by 16 bits (`x2 = 65536`).
- `x3` is set to `1` and then shifted left by 16 bits (`x3 = 65536`).
- **Operation**: `x1 = (65536 * 65536) >> 32 = 1`
- **Final State**:
- `x1` should be `1`.
- The processor should halt after execution.
5. `DIV` Functionality Test
- **Function**: Verifies that the `div` instruction correctly divides two signed registers and stores the quotient in the destination register.
- **Instruction**: `div x1, x2, x3`
- **Expected Behavior**:
- **Setup**:
- `x2` is set to `10`.
- `x3` is set to `3`.
- **Operation**: `x1 = x2 / x3 = 3`
- **Final State**:
- `x1` should be `3`.
- The processor should halt after execution.
6. `DIVU` Functionality Test
- **Function**: Verifies that the `divu` instruction correctly multiplies two registers and stores the result in the destination register.
- **Instruction**: `divu x1, x2, x3`
- **Expected Behavior**:
- **Setup**:
- `x2` is set to `10`.
- `x3` is set to `3`.
- **Operation**: `x1 = x2 / x3 = 3`
- **Final State**:
- `x1` should be `3`.
- The processor should halt after execution.
7. `REM` Functionality Test
- **Function**: Verifies that the `rem` instruction correctly multiplies two registers and stores the result in the destination register.
- **Instruction**: `rem x1, x2, x3`
- **Expected Behavior**:
- **Setup**:
- `x2` is set to `10`.
- `x3` is set to `3`.
- **Operation**: `x1 = x2 % x3 = 1`
- **Final State**:
- `x1` should be `1`.
- The processor should halt after execution.
8. `REMU` Functionality Test
- **Function**: Verifies that the `remu` instruction correctly multiplies two registers and stores the result in the destination register.
- **Instruction**: `remu x1, x2, x3`
- **Expected Behavior**:
- **Setup**:
- `x2` is set to `10`.
- `x3` is set to `3`.
- **Operation**: `x1 = x2 % x3 = 1`
- **Final State**:
- `x1` should be `1`.
- The processor should halt after execution.
### Simulate and run tests
Tests can be executed using `sbt test` to run all tests or with `sbt "testOnly mrv.DecorderSpec"` to run a specific test. Various test cases have been used to validate the functionality, and the results are displayed in the [Test Results](https://github.com/Jiegoqqq/Computer_Architecture_term_project/blob/ThreeStage/ThreeStage_test_result.txt).
## Part B : Rewrite OneCycle into a 3-stage pipeline
After upgrading the RISC-V CPU from `RV32E` to `RV32IM`, I will transition it from a single-cycle design to a 3-stage pipeline. The traditional 5-stage pipeline consists of Instruction Fetch, Instruction Decode, Execution, Memory Access, and Write-Back stages. In my 3-stage pipeline, the CPU is reorganized into the following three stages:
- **Stage 1 :** Instruction Fetch (IF) and Instruction Decode (ID)
- **Stage 2 :** Execution (EX)
- **Stage 3 :** Memory Access (MEM) and Write-Back (WB)
To enable the CPU to function in a pipelined architecture, temporary registers need to be placed between the different stages. These registers hold the data necessary for the next stage as well as the results produced by the preceding one. Proper handling of these stage registers and guaranteeing the accurate transfer of information is essential for the pipeline to operate correctly.
### Pipeline register
Pipeline registers are used to hold and pass the necessary data and control signals from one stage of the pipeline to the next.
#### IF/ID to EXE register
Used to transfer data between the `instruction fetch` stage and the `instruction decode` stage of the processor
```scala
val if_id = Module(new PipelineReg(new Bundle {
val inst = UInt(32.W)
val pcPlus4 = UInt(32.W)
val rs1 = UInt(5.W)
val rs2 = UInt(5.W)
val rd = UInt(5.W)
val ctrl = new Ctrl
val rs1Data = UInt(32.W)
val rs2Data = UInt(32.W)
}))
if_id.io.in.inst := io.imem.readData
if_id.io.in.pcPlus4 := pcPlus4
if_id.io.in.rs1 := decoder.io.ctrl.rs1
if_id.io.in.rs2 := decoder.io.ctrl.rs2
if_id.io.in.rd := decoder.io.ctrl.rd
if_id.io.in.ctrl := decoder.io.ctrl
if_id.io.in.rs1Data := regFile.io.rs1Data
if_id.io.in.rs2Data := regFile.io.rs2Data
if_id.io.enable := true.B
if_id.io.flush := false.B
```
#### EXE to MEM/WB register
Used to transfer data between the `excution` stage of the processor
```scala
val ex_mem = Module(new PipelineReg(new Bundle {
val ctrl = new Ctrl
val aluOut = UInt(32.W)
val rd = UInt(5.W)
val rs2Data = UInt(32.W)
}))
```
#### MEM to WB register
This register is used to handle data hazard issues by transferring the results of the ALU calculations back before writing them back.
```scala
val mem_wb = Module(new PipelineReg(new Bundle {
val ctrl = new Ctrl
val wbData = UInt(32.W)
val rd = UInt(5.W)
}))
```
### Deal with Data Hazard
- **Purpose:** When data dependencies occur between multiple instructions in the pipeline, `stall` or `forward` data to avoid incorrect results.
- **Detection Conditions:**
- When `ex_mem`'s `rd` matches `decoder`'s `rs1` or `rs2` and `regWrite` is true.
- If the condition is met, activate the `stall` signal to pause the PC and halt the pipeline progression.
#### Stall
- **Stalling Logic**:
- When a hazard is detected, the `stall` signal is asserted (`stall := true.B`).
- The program counter (PC) and pipeline registers are frozen, effectively pausing the pipeline.
- The input to the EX/MEM pipeline register is cleared to prevent incorrect data from propagating further.
- **Impact of Stalling**:
- Stalling introduces a bubble into the pipeline, which delays execution but ensures correctness.
- This is a necessary step when forwarding is insufficient, such as when a result is still in the EX stage and not available for forwarding.
```scala
val hazard = ex_mem.io.out.ctrl.regWrite &&
ex_mem.io.out.rd =/= 0.U &&
(ex_mem.io.out.rd === decoder.io.ctrl.rs1 ||
ex_mem.io.out.rd === decoder.io.ctrl.rs2)
val stall = WireDefault(false.B)
when(hazard) {
stall := true.B
}
when(!stall) {
pc := nextPC
}.otherwise {
ex_mem.io.in.ctrl := 0.U.asTypeOf(ex_mem.io.in.ctrl)
ex_mem.io.in.aluOut := 0.U
ex_mem.io.in.rd := 0.U
ex_mem.io.in.rs2Data := 0.U
}
if_id.io.enable := !stall
```
#### Forwarding
- **Forwarding Logic**:
- In the EX stage, check if the `rd` (destination register) from a later pipeline stage matches `rs1` or `rs2` of the current instruction:
- **EX/MEM stage**:
- If the instruction in the EX/MEM stage will write back (`regWrite = true.B`) and its `rd` matches the current `rs1` or `rs2`, use the `aluOut` from the EX/MEM stage as the forwarded value.
- **MEM/WB stage**:
- Similarly, if the MEM/WB stage instruction writes back (`regWrite = true.B`) and its `rd` matches the current `rs1` or `rs2`, use the `wbData` from the MEM/WB stage.
- **Data Selection**:
- Forwarding overrides the default values of `rs1Data` and `rs2Data` by dynamically selecting the most recent value available in the pipeline.
- If no hazards are detected, the default register file outputs are used.
- **Impact of Forwarding**:
- **Reduced Stalls**:
- Forwarding allows the pipeline to proceed without delays, provided that the required data is available in a later stage.
- **Increased Complexity**:
- The forwarding unit requires additional hardware for hazard detection and data multiplexing but greatly improves performance.
- **Forwarding Example**:
- Instruction 1 (in EX stage): `add x5, x1, x2` (produces result in `x5`).
- Instruction 2 (in ID stage): `sub x6, x5, x3` (requires the result from `x5`).
- Without forwarding: Instruction 2 stalls until Instruction 1 writes back to the register file.
- With forwarding: Instruction 2 uses `aluOut` from Instruction 1 directly, avoiding a stall.
```scala
val ex_alu = Module(new Alu)
val forwardA = WireDefault(if_id.io.out.rs1Data)
val forwardB = WireDefault(if_id.io.out.rs2Data)
val ex_mem = Module(new PipelineReg(new Bundle {
val ctrl = new Ctrl
val aluOut = UInt(32.W)
val rd = UInt(5.W)
val rs2Data = UInt(32.W)
}))
when(ex_mem.io.out.ctrl.regWrite && ex_mem.io.out.rd =/= 0.U) {
when(ex_mem.io.out.rd === if_id.io.out.rs1) {
forwardA := ex_mem.io.out.aluOut
}
when(ex_mem.io.out.rd === if_id.io.out.rs2) {
forwardB := ex_mem.io.out.aluOut
}
}
val mem_wb = Module(new PipelineReg(new Bundle {
val ctrl = new Ctrl
val wbData = UInt(32.W)
val rd = UInt(5.W)
}))
when(mem_wb.io.out.ctrl.regWrite && mem_wb.io.out.rd =/= 0.U) {
when(mem_wb.io.out.rd === if_id.io.out.rs1) {
forwardA := mem_wb.io.out.wbData
}
when(mem_wb.io.out.rd === if_id.io.out.rs2) {
forwardB := mem_wb.io.out.wbData
}
}
ex_alu.io.op := if_id.io.out.ctrl.aluOp
ex_alu.io.src1 := forwardA
ex_alu.io.src2 := Mux(if_id.io.out.ctrl.useImm, if_id.io.out.ctrl.imm.asUInt, forwardB)
```
### Test CPU
I converted the original OneCycle program into a ThreeStage program, and added simple stalling and forwarding to avoid Data Hazard. My 3 stage pipeline program is [ThreeStage.scala](https://github.com/Jiegoqqq/Computer_Architecture_term_project/blob/ThreeStage/src/main/scala/ThreeStage.scala) and is paired with a test program [ThreeStageSpec.scala](https://github.com/Jiegoqqq/Computer_Architecture_term_project/blob/ThreeStage/src/test/scala/ThreeStageSpec.scala).
These two tests ensure that the simulator can handle RAW hazards correctly and maintain the correct execution order, and check whether the simulator handles the dependencies between registers correctly.
The other focuses on memory operations, checking whether the emulator correctly supports data loading and storing instructions.
**RAW Hazard Test**
- **Function**: Verifies that the simulator handles a RAW (Read After Write) hazard correctly.
- **Instructions**:
1. `addi x1, x0, 10` - Write 10 to register `x1`.
2. `addi x2, x1, 5` - Add 5 to the value in `x1` and write the result to `x2`.
- **Expected Results**:
- After the first instruction, `x1` should be 10.
- After the second instruction, `x2` should be 15.
**Memory Load/Store Test**
- **Function**: Verifies that the simulator supports memory `load (lw)` and `store (sw)` operations.
- **Instructions**:
1. `addi x1, x0, 5` - Write 5 to register `x1`.
2. `sw x1, 100(x0)`` - Store the value of `x1` (5) into memory at address 100.
3. `lw x2, 100(x0)` - Load the value from memory at address 100 into `x2`.
- **Expected Results**:
- After the first instruction, `x1` should be 5.
- After the second instruction, the memory at address 100 should contain 5.
- After the third instruction, `x2` should be 5.
### Simulate and run tests
Tests can be executed using `sbt test` to run all tests or with `sbt "testOnly mrv.ThreeStageSpec"` to run a specific test. Various test cases have been used to validate the functionality, and the results are displayed in the [Test Results](https://github.com/Jiegoqqq/Computer_Architecture_term_project/blob/ThreeStage/ThreeStage_test_result.txt).
## Part C : Select from the course quiz questions rewrite the code to run on extended RISC-V processor
### Quiz question1 ([Quiz 1 Problem C](https://hackmd.io/@sysprog/arch2024-quiz1-sol#Problem-C))
This function, `fabsf`, is a custom implementation of the standard C function fabsf, which calculates the absolute value of a floating-point number.
```s
static inline float fabsf(float x) {
uint32_t i = *(uint32_t *)&x; // Read the bits of the float into an integer
i &= 0x7FFFFFFF; // Clear the sign bit to get the absolute value
x = *(float *)&i; // Write the modified bits back into the float
return x;
}
```
**Absolute value Test (fabsf)**
For the operation `|x|` where the input \( x = -5 \) (testing the computation of absolute value via subtraction):
- **Instructions**:
- `addi x1, x0, -5` - Load the value \(-5\) into register `x1`.
- `sub x1, x0, x1` - Subtract `x1` from `x0` to compute its absolute value (resulting in \( 5 \)).
- `add x2, x1, x0` - Copy the result from `x1` to `x2` (to verify the result).
- **Execution Steps**:
- The simulator loads all three instructions into the pipeline.
- Wait for the pipeline to complete execution, allowing sufficient clock cycles (6 total).
- After execution, check the values in registers `x1` and `x2`.
- **Expected Results**:
- After executing the first instruction (`addi x1, x0, -5`), `x1` should hold the value \(-5\).
- After executing the second instruction (`sub x1, x0, x1`), `x1` should hold the value \( 5 \) (absolute value of the initial value).
- After executing the third instruction (`add x2, x1, x0`), `x2` should hold the value \( 5 \), confirming that the absolute value computation was correct.
- **Verification**:
- Register `x1` should be \( 5 \) after all instructions complete.
- Register `x2` should also be \( 5 \), as it mirrors the value of `x1`.
### Quiz question2 ([Quiz 2 Problem D](https://hackmd.io/@sysprog/arch2024-quiz2-sol#Problem-D))
:::danger
Fix the permissions of the uploaded pictures.
:::
The formula defines a recursive calculation for a value 𝑛𝑚 based on the input values 𝑛 and 𝑚. The calculation varies depending on the properties of 𝑛, as follows:
1. **If \( n \) is even**:
The value of \( nm \) is calculated as 
This means \( n \) is halved, and the result is multiplied by \( 2m \).
2. **If \( n \) is odd**:
The value of \( nm \) is calculated as 
Here, \( n \) is reduced by 1 to make it even, halved, and multiplied by \( 2m \). Then, \( m \) is added to the result.
3. **If \( n = 1 \)**:
The value of \( nm \) is simply \( m \).
This serves as the base case for the recursion or iterative process.
$$
nm =
\begin{cases}
\frac{n}{2} \cdot 2m & \text{if } n \text{ is even,} \\
\frac{n-1}{2} \cdot 2m + m & \text{if } n \text{ is odd,} \\
m & \text{if } n = 1.
\end{cases}
$$
**Multiplication Test (n\*m via mul)**
For every combination of \( n \) and \( m \) where \( n, m \in \{1, 2, \ldots, 7\} \) (49 total combinations):
1. `addi x2, x0, n` - Load the value \( n \) into register `x2`.
2. `addi x3, x0, m` - Load the value \( m \) into register `x3`.
3. `mul x1, x2, x3` - Multiply the values in `x2` and `x3` and store the result in `x1`.
- **Execution Steps**:
1. After the first instruction (`addi x2, x0, n`), the simulator waits for the instruction to complete (3 clock cycles total).
2. After the second instruction (`addi x3, x0, m`), the simulator again waits for the instruction to complete (3 clock cycles total).
3. After the third instruction (`mul x1, x2, x3`), the simulator waits for the multiplication to complete (3 clock cycles total).
- **Expected Result ( For each combination of \( n \) and \( m \) )**:
1. After the first instruction, `x2` should equal \( n \).
2. After the second instruction, `x3` should equal \( m \).
3. After the third instruction, `x1` should equal \( n \* m \).
### Quiz question3 ([Quiz 5 Problem D](https://hackmd.io/@sysprog/arch2024-quiz5-sol#Problem-D))
This code snippet calculates the position of the most significant bit (MSB) in a non-negative integer `N`.
```s
int logint(int N)
{
int k = N, i = 0;
while (k) {
k >>= 1;
i++;
}
return i - 1;
}
```
**Logint test**
assmebly code
```s
addi x2, x0, 16 # x2 = N = 16
addi x3, x0, 0 # x3 = i = 0
loop:
beq x2, x0, end # if (x2 == 0) => jump to end
srai x2, x2, 1 # x2 >>= 1
addi x3, x3, 1 # i++
jal x0, loop # unconditional jump to loop
end:
addi x4, x3, -1 # x4 = i - 1
nop
```
- **Execution Steps**:
- Initialization:
- `x2` is set to 16 (the initial value of 𝑁).
- `x3` is initialized to 0 (counter for iterations).
- Loop Execution:
- The loop iterates, repeatedly right-shifting `x2` by 1 bit (dividing it by 2).
- Each iteration increments `x3` to count the number of shifts performed.
- The loop exits when x2 becomes 0.
- Final Step:
- After the loop, `x4` is calculated as `x3-1` to account for zero-based indexing of the MSB position.
- **Expected Results**:
- After executing the program:
- Register `x4` should hold the value 4, indicating that the MSB of 16(binary 10000) is at position 4.
- **Verification**:
- The test confirms that `x4` is updated correctly after all instructions are executed.
- The expected output in `x4` matches the calculated MSB position.
## Reference
:::danger
Do refer to the lecture materials and/or primary source.
:::
* [Pipeline Processor](https://hackmd.io/@joanne8826/HkT32O85I#4-3%E2%80%934-Data-Hazard-and-forwarding)
* [3-stage pipeline](https://hackmd.io/@shhung/Syfu0Z3Ip)
* [pipeline hazard](https://king0980692.medium.com/computer-architecture-cheat-sheet-pipeline-hazard-ee27d0d66e89)
* [pipeline](https://tobygao.github.io/Learning-Lounge/2017/12/15/ca-c-pipelining.html)