# Extend riscv-mini
> 蕭維昭, 吳柏漢
>
## Introduction
`riscv-mini` is a basic 3-stage RISC-V pipeline implemented in Chisel. It is designed using 3-stage pipeline technique includes only Fetch, Decode, and Execute stages, which makes the design and implementation more straightforward, making it ideal for teaching and prototyping.
However, 3-stage pipline cpu with respcet to 5-stage one may face low instruction throughout and clock frequency constraints issue due to each stage in 3-stage needs to dealing with more work than the 5-stage one.
As result, this project we are going to figure out this problem, extend the riscv-mini to 5 stage CPU.
We based our work on the [implementation by YangKefan-rk](https://github.com/YangKefan-rk/riscv-mini), exploring potential specific flaws in this extended implementation. We also attempted to address these issues to ensure robust handling of various hazard-related challenges.
## Data path of 3 stage pipline

From this diagram, we can see a 3-stage pipeline implementing a simplified RISC-V CPU. The pipeline is divided into three main stages:
**Fretch、Execute、Write Back**
## Extend RISC-V mini into 5 stage
Here, we follow the classic RISC architecture, planing to extend the RISC-V mini via separate **Execute** into **Decode** and **Execute**, and separate **Write Back** into **Memory** and **Write Back**
## Data path of extended RISC-V mini
### Fetch
The main code of Fretch can be write as below:
```scala!
pc := next_pc
io.icache.req.bits.addr := next_pc
io.icache.req.bits.data := 0.U
io.icache.req.bits.mask := 0.U
io.icache.req.valid := !stall
io.icache.abort := false.B
```
Here is the description
* `pc := next_pc` : Updates the program counter (pc) with the computed value of `next_pc`.
* `io.icache.req.bits.addr := next_pc` : Sends the address of the next instruction (`next_pc`) to the instruction cache (`io.icache.req.bits.addr`).
* `io.icache.req.bits.data := 0.U` : Sets the data field of the instruction cache request to 0.U (`zero`).
* `io.icache.req.bits.mask := 0.U` : Sets the mask field of the instruction cache request to 0.U (`zero`).
* `io.icache.req.valid := !stall` : Sets the valid signal of the instruction cache request to true if there is no stall (`!stall`).
* `io.icache.abort := false.B` : Sets the abort signal of the instruction cache to false.
The value of `next_pc` is dynamically selected based on various conditions using MuxCase.
```scala!
val next_pc = MuxCase(
pc + 4.U,
IndexedSeq(
stall -> pc,
csr.io.expt -> csr.io.evec,
(em_reg.pc_sel === PC_EPC) -> csr.io.epc,
((de_reg.pc_sel === PC_ALU) || (de_reg.taken)) -> (alu.io.sum >> 1.U << 1.U),
(io.ctrl.pc_sel === PC_0) -> pc
)
)
```
In this part, it also define the NOP instruction
```scala!
val inst = Mux(started || io.ctrl.inst_kill || (io.ctrl.pc_sel === PC_ALU) || de_reg.taken || brCond.io.taken || csr.io.expt || de_reg.pc_sel === PC_ALU || de_reg.pc_sel === PC_EPC || em_reg.pc_sel === PC_EPC,
Instructions.NOP, io.icache.resp.bits.data)
```
At the end of this part, it deal with the stall condition
```scala!
when(!stall) {
fd_reg.pc := pc
fd_reg.inst := inst
}
```
### Decode
#### Fetch-Decode (FD) Instruction Assignment
```scala!
io.ctrl.inst := fd_reg.inst
```
Assigns the fetched instruction (`fd_reg.inst`) to the control unit (`io.ctrl.inst`).
#### Register File Read
```scala!
val rd_addr = fd_reg.inst(11, 7)
val rs1_addr = fd_reg.inst(19, 15)
val rs2_addr = fd_reg.inst(24, 20)
regFile.io.raddr1 := rs1_addr
regFile.io.raddr2 := rs2_addr
```
Here are used for decodes the instruction to extract source and destination register addresses. And it will assigns these source register addresses (`rs1_addr` and `rs2_addr`) to the register file read ports (`regFile.io.raddr1` and `regFile.io.raddr2`
#### Immediate Generation
```scala!
immGen.io.inst := fd_reg.inst
immGen.io.sel := io.ctrl.imm_sel
```
The purpose of this code is to generates the immediate value based on the instruction and control signals.
#### Forwarding Logic
After describe the main blocks in decode part, the signal flow will follow the logic:
```scala!
val rs1 = MuxCase(regFile.io.rdata1, Seq(
(de_rs1hazard) -> alu.io.out,
(em_rs1hazard) -> em_regWrite,
(mw_rs1hazard) -> regWrite
))
val rs2 = MuxCase(regFile.io.rdata2, Seq(
(de_rs2hazard) -> alu.io.out,
(em_rs2hazard) -> em_regWrite,
(mw_rs2hazard) -> regWrite
))
```
where `rs1` and `rs2` are final values for source registers after applying forwarding logic.
The forward sources are described below:
`alu.io.out`: ALU result from the Execute stage (EX/DE hazard).
`em_regWrite`: Result from the Memory stage (MEM/DE hazard).
regWrite: Result from the Writeback stage (WB/DE hazard).
#### Pipelining De/Ex
When a reset signal (`reset.asBool`) or an exception (`csr.io.expt`) occurs, the Decode-Execute pipeline registers are cleared to prevent the propagation of invalid values. Under normal operation, these registers are updated with valid data, including ALU inputs (`alu_a`, `alu_b`) and operation (`alu_op`), control signals for memory access (`st_type`, `ld_type`), the writeback enable signal (`wb_en`), and the destination register (`rs2`). This ensures proper execution of instructions and maintains the integrity of the pipeline.
```scala!
when(reset.asBool || !stall && csr.io.expt) {
// Reset DE pipeline registers or clear on exception
de_reg.st_type := 0.U
de_reg.csr_cmd := 0.U
de_reg.illegal := false.B
de_reg.pc_check := false.B
de_reg.ld_type := 0.U
de_reg.wb_en := false.B
de_reg.taken := false.B
de_reg.pc_sel := 0.U
}.elsewhen(!stall && !csr.io.expt) {
// Update DE pipeline registers when no stall or exception
de_reg.pc := fd_reg.pc
de_reg.inst := fd_reg.inst
de_reg.alu_a := Mux(io.ctrl.A_sel === A_RS1, rs1, fd_reg.pc)
de_reg.alu_b := Mux(io.ctrl.B_sel === B_RS2, rs2, immGen.io.out)
de_reg.csr_in := Mux(io.ctrl.imm_sel === IMM_Z, immGen.io.out, rs1)
de_reg.alu_op := io.ctrl.alu_op
de_reg.st_type := io.ctrl.st_type
de_reg.csr_cmd := io.ctrl.csr_cmd
de_reg.illegal := io.ctrl.illegal
de_reg.pc_check := io.ctrl.pc_sel === PC_ALU
de_reg.ld_type := io.ctrl.ld_type
de_reg.wb_sel := io.ctrl.wb_sel
de_reg.wb_en := io.ctrl.wb_en
de_reg.taken := brCond.io.taken
de_reg.pc_sel := io.ctrl.pc_sel
de_reg.rs2 := rs2
}
```
### Execute
#### ALU Calculation
This part sets up the inputs and operation type for the ALU
```scala!
alu.io.A := de_reg.alu_a
alu.io.B := de_reg.alu_b
alu.io.alu_op := de_reg.alu_op
```
This section configures the inputs of the ALU (Arithmetic Logic Unit) with values from the decode stage register (`de_reg`):
Where:
* `alu_a` and `alu_b` are the two operands for the ALU.
* `alu_op` specifies the type of operation the ALU should perform .
#### Dcache Access
```scala!
val woffset = (alu.io.sum(1) << 4.U).asUInt | (alu.io.sum(0) << 3.U).asUInt
val daddr = alu.io.sum >> 2.U << 2.U
io.dcache.req.valid := !stall && (de_reg.st_type.orR || de_reg.ld_type.orR)
io.dcache.req.bits.addr := daddr
io.dcache.req.bits.data := de_reg.rs2 << woffset
io.dcache.req.bits.mask := MuxLookup(de_reg.st_type, "b0000".U)(
Seq(
ST_SW -> "b1111".U,
ST_SH -> ("b11".U << alu.io.sum(1, 0)),
ST_SB -> ("b1".U << alu.io.sum(1, 0)))
)
```
* `woffset` : Define the word offset. It is decided by the ALU output. it has four default valus which are 0 、8 、16 、24
* `io.dcache.req.bits.addr` : the address of the data
* `io.dcache.req.valid` : to decide the data is value to access or not.
* `io.dcache.req.bits.data` :the specific data you want to access.
* `io.dcache.req.bits.mask` :Decide the edit words types: 4 bytes (0b1111)、2 bytes (0b11)、signal byte (0b1)、no-edited (0b0000)
### Memory
#### Load
The `MEM` section starts at line 321 in `datapath.scala`, where the `Load` operation is described.
```scala!
val loffset = (em_reg.alu(1) << 4.U).asUInt | (em_reg.alu(0) << 3.U).asUInt
val load_reg_valid = !io.icache.resp.valid && io.dcache.resp.valid && em_reg.ld_type =/= LD_XXX
```
Here, loffset calculates the offset from the address passed by the ALU, used for aligning memory load data.
Next, `load_reg_valid` checks whether the loaded data is valid, with conditions including the instruction cache being invalid (`io.icache.resp.valid` is false), the data cache being valid (`io.dcache.resp.valid` is true), and the current operation type not being a no-op (`LD_XXX`).
The load state is then checked:
```scala!
val load_state = RegInit(false.B)
switch(load_state) {
is(false.B){
when(load_reg_valid){
load_state := true.B
}
}
is(true.B){
when(!stall){
load_state := false.B
}
}
}
```
Since loading data from memory or cache takes time, `load_state` is defined to help determine the loading condition. When `load_reg_valid` is true, the load state is entered. If no stalls (`!stall`) occur, the state resets to `false.B`.
Next, data source selection and conversion are performed:
```scala!
val load_reg = RegEnable(io.dcache.resp.bits.data, 0.U(conf.xlen.W), load_reg_valid && !load_state)
val load_data = Mux(load_state, load_reg, io.dcache.resp.bits.data)
val lshift = load_data >> loffset
load := MuxLookup(em_reg.ld_type, load_data.zext)(
Seq(
LD_LH -> lshift(15, 0).asSInt,
LD_LB -> lshift(7, 0).asSInt,
LD_LHU -> lshift(15, 0).zext,
LD_LBU -> lshift(7, 0).zext
)
)
```
`load_data` selects the data source based on `load_state`. If in a load state, data stored in the register is used; otherwise, the data from the cache is used. After obtaining the data, it is right-shifted according to `ld_type` for proper alignment.
`LD_LH` and `LD_LB` indicate signed loading of halfword and byte, respectively.
`LD_LHU` and `LD_LBU` indicate unsigned loading of halfword and byte, respectively.
#### CRS
The Control and Status Registers (CSR) manage system-level operations and configurations, providing essential information about the processor state. These registers handle privileged operations such as interrupt processing, exception management, timer configuration, and system monitoring.
```scala!
csr.io.stall := stall
csr.io.in := em_reg.csr_in
csr.io.cmd := em_reg.csr_cmd
csr.io.inst := em_reg.inst
csr.io.pc := em_reg.pc
csr.io.addr := em_reg.alu
csr.io.illegal := em_reg.illegal
csr.io.pc_check := em_reg.pc_check
csr.io.ld_type := em_reg.ld_type
csr.io.st_type := em_reg.st_type
io.host <> csr.io.host
```
In the MEM stage, the above code describes the interaction with the CSR.
`csr.io.stall`: Notifies the CSR whether to pause operations based on the pipeline's execution status.
`csr.io.cmd`: Command signal for executing instructions, specifying whether the CSR should read, write, or perform other actions.
`csr.io.inst` and `csr.io.pc`: Provide the current instruction and corresponding program counter, enabling CSR to process instruction-related tasks.
`csr.io.addr`: Supplies the target address for CSR access, calculated by the ALU.
`csr.io.illegal`: Flags whether the current operation is illegal. If true, CSR can trigger exceptions or abort the operation.
Additional signals such as `csr.io.pc_check`, `csr.io.ld_type`, and `csr.io.st_type` assist CSR in handling specific operations like branch checking, load type determination, and store type processing.
### Write Back
In this stage, the code is focus on writing back to the regFile and the control signal of the regFile.
#### Regfile Write
```scala!
regWrite :=
MuxLookup(mw_reg.wb_sel, mw_reg.alu.zext)(
Seq(
WB_MEM -> mw_reg.load,
WB_PC4 -> (mw_reg.pc + 4.U).zext,
WB_CSR -> mw_reg.csr_out.zext
)
).asUInt
```
During the write-back operation, regWrite determines the final data to be written to the register file:
Data source selection:
`WB_MEM`: Writes back data loaded from memory (`mw_reg.load`).
`WB_PC4`: Writes back the value of PC + 4, typically used for branch or jump instructions.
`WB_CSR`: Writes back the result of a CSR operation (`mw_reg.csr_out`).
By default, the ALU result (`mw_reg.alu.zext`) is written back.
#### Regfile Control Signals
```scala!
regFile.io.wen := mw_reg.wb_en && !stall
regFile.io.waddr := mw_rd_addr
regFile.io.wdata := regWrite
```
This section controls whether data is written to the register file:
Enable write-back (`wen`): The register file allows write operations only if the write-back enable signal (`mw_reg.wb_en`) is true and no stalls (`!stall`) occur.
Write-back address and data:
`waddr`: Specifies the target register address for write-back (`mw_rd_addr`).
`wdata`: Specifies the data content to be written, selected via the `regWrite` logic.
# Verifying Custom Programs
## Installing Environment
To test the occurrence of hazards, we want to run several custom programs. However, before doing so, we need to set up the relevant environment.
To compile and assemble the custom programs, the RISC-V tools for the Privileged Architecture 1.7 toolchain need to be installed. Follow the instructions in [Running Your Own Program on riscv-mini](https://github.com/ucb-bar/riscv-mini?tab=readme-ov-file#running-your-own-program-on-riscv-mini) to set up the environment variables and install the toolchain.
However, this part of the process was not explained clearly. After some exploration, the following steps were used to set up the environment.
```shell!
$ export RISCV=$HOME/riscv-tools
$ mkdir -p $RISCV
```
Then, we run the `build-riscv-tools.sh` script in project `riscv-mini` under `$HOME/riscv-tools`
The installation process would take about half an hour. After checking, we found that the `riscv32-unknown-elf-gcc` was not compiled.
And the below error was showed:
```shell!
/home/wu/riscv-tools/riscv-tools-priv1.7/riscv-gnu-toolchain/build/src/newlib-gcc/gcc/reload1.c:115:24: error: use of an operand of type ‘bool’ in ‘operator++’ is forbidden in C++17
115 | (this_target_reload->x_spill_indirect_levels)
| ~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~
```
We suspect the compiler version is too new, so we downgraded it to GCC 10.
```shell!
$ sudo apt install gcc-10 g++-10
$ export CC=gcc-10
$ export CXX=g++-10
```
But it still can't work, therefore, we try to install the latest version of the toolchain instead of priv1.7.
However, install the latest version take too much time. We quickly abandoned this plan and switched to implementing the following two plans simultaneously.
The first plan is to download a [precompiled toolchain](https://github.com/riscv-collab/riscv-gnu-toolchain/releases/tag/2025.01.20) and [riscv-tests](https://github.com/riscv-software-src/riscv-tests) for use. Then, modify the `Makefile` in `custom-bmark`. It can work for generating `.vcd` but `.dump`, because precompiled toolchain we found lack of `riscv32-unknown-elf-objdump`.
The secound solution is much more better, we installed **gcc5.3** and **ubuntu16.04** in docker , which are perfectly match with RISC-V tools for priv 1.7.
Followings are the steps for installing docker enviroment.
```shell!
$ sudo apt install docker.io
$ sudo docker run -it -v ~/riscv-mini:/riscv-mini ubuntu:16.04
```
In docker:
```shell!
$ apt update
$ apt install gcc-5 g++-5
```
setting up the environment
Install the necessery tools
```shell!
# install gawk
$ apt update
$ apt install -y gawk
```
install wget and curl
```shell!
$ apt install -y wget curl
```
Install git,texinfo,file
```shell!
$ apt install -y git
```
```shell!
$ apt install -y texinfo
```
```shell!
$ apt install -y file
```
Then startInstall the riscv-tools
```shell!
$ export RISCV=~/riscv-mini
$ ./build-riscv-tools.sh
```
Finally, we can test the custom programs.
## Modify custom programs
```
riscv-mini/custom-bmark/
├── main.c
├── add.S
├── Makefile
```
In this part, `add.S` is used as a function to be called by `main.c`. If we want to test the function file named `test.S`, we need to first replace the file name `add.S` into `test.S` in the 7th line of the `Makefile` .
```
CUSTOM_BMARK_S_SRC ?= test.S crt.S
```
Then, update the function name in `main.c` accordingly also.
## Run custom assembly code on riscv-mini
After setting every thing up, now we can run our own assembly code on the riscv-mini cpu.
Entering the file `custom-bmark`, we can edit our own custom assembly code in `add.s`.
Next, to compile you program, run `make` in `custom-bmark` to generate the binary, dump, and the hex files.
```
/mnt/riscv_mini_extended/riscv-mini/VTile: /usr/lib/x86_64-linux-gnu/libstdc++.so.6: version `GLIBCXX_3.4.29' not found (required by /mnt/riscv_mini_extended/riscv-mini/VTile)
/mnt/riscv_mini_extended/riscv-mini/VTile: /usr/lib/x86_64-linux-gnu/libstdc++.so.6: version `CXXABI_1.3.11' not found (required by /mnt/riscv_mini_extended/riscv-mini/VTile)
/mnt/riscv_mini_extended/riscv-mini/VTile: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.34' not found (required by /mnt/riscv_mini_extended/riscv-mini/VTile)
```
## Check Waveform
After runing `make` in custom-bmark to generate the hex files. We moved it into` ~/riscv-mini` and use VTile to generate .vcd waveform.
```shell!
$ ./VTile <hex file> [<vcd file> 2> <log file>]
```
For following code as example,
``` assemble!
# add.S
.text
.align 2
.globl add
.type add,@function
add:
add a0, a0, a1
ret
```
```c!
// main.c
int add(int a, int b);
int main(int argc, char** argv) {
int res = add(3, 2);
return res == 5 ? 6543 : -1;
}
```
We checked two place waveform for `a0` and `a1` in regFile using GTKWave.

Here is the change during the execution of `add a0, a0, a1`, where 3+2=5.

And here is the signal before completion. You can see that `a0` ends at 198F, which is 6543 in decimal.
VTile's output in terminal can also check return value of main.
```shell!
wu@samsung:~/riscv-mini$ ./VTile main.hex
Enabling waves...
Starting simulation!
Simulation completed at time 2646 (cycle 264)
TOHOST = 6543
Finishing simulation!
```
# Test Instruction Alignment may Cause Hazard
## Read-After-Write (RAW) Hazard
`t0` is written in the first instruction but used in the next instruction.
```c
raw:
add t0, a0, a1
add a0, t0, a0
ret
```
```c!
int main(int argc, char** argv) {
int res = raw(13, 25);
return res == 51 ? 6543 : 123;
}
```
The result is correct:
```shell!
wu@samsung:~/riscv-mini$ ./VTile main.hex
Enabling waves...
Starting simulation!
Simulation completed at time 2642 (cycle 264)
TOHOST = 6543
Finishing simulation!
```
## Load-Use Hazard
After load operation from memory, use the loaded value immediately.
```c
load_use_hazard:
lw t0, 0(a0)
lw t1, 0(a1)
add a0, t0, t1
ret
```
```c
int load_use_hazard(int* ptr1, int* ptr2);
int main(int argc, char** argv) {
int data1 = 13;
int data2 = 25;
int res = load_use_hazard(&data1, &data2);
return res == 38 ? 6543 : 123;
}
```
```shell!
wu@samsung:~/riscv-mini$ ./VTile main.hex
Enabling waves...
Starting simulation!
Simulation completed at time 2654 (cycle 265)
TOHOST = 6543
Finishing simulation!
```
It can be seen that the result is equal to 38, so the final output is 6543, the result is correct.
## Jump Branch Hazard
In this section, we test if jump branch is functional.
**Testing if it jump**
```c
branch_jump_hazard:
add t0, a0, a1
bge t0, zero, skip
li a0, 1117
ret
skip:
li a0, 8885
ret
```
```c
int branch_jump_hazard(int a, int b);
int main(int argc, char** argv) {
int res = branch_jump_hazard(13, 25);
return res == 8885 ? 6543 : 123;
}
```
simulation result
```shell!
Enabling waves...
Starting simulation!
Simulation completed at time 2644 (cycle 264)
TOHOST = 6543
Finishing simulation!
```
**Testing if it not jump**
```c
branch_jump_hazard:
add t0, a0, a1
bge t0, zero, skip
li a0, 1117
ret
skip:
li a0, 8885
ret
```
```c
int branch_jump_hazard(int a, int b);
int main(int argc, char** argv) {
int res = branch_jump_hazard(13, 25);
return res == 1117 ? 6543 : 123;
}
```
simulation result
```shell!
Enabling waves...
Starting simulation!
Simulation completed at time 2650 (cycle 265)
TOHOST = 6543
Finishing simulation!
```
Result:
It shows that all the result after simulation is 6543, which mean that they both correctly jump to the proper position in our testing code.
## Verify 3 RISC-V Programs from the Course Quiz
### [Quiz2 Problem D](https://hackmd.io/@sysprog/arch2024-quiz2-sol#Problem-D)
assebly code
```c
.text
.align 2
.globl eg_mul
.type eg_mul,@function
eg_mul:
# Begin the main code in the text section
add t0, x0, a0 # Load the first number (num1) into register t0
add t1, x0, a1 # Load the second number (num2) into register t1
li t2, 0 # Initialize the result (t2) to 0
loop:
# Check if the least significant bit of t0 (num1) is 1 (i.e., if the number is odd)
andi t3, t0, 1
beq t3, x0, skip_add # If the bit is 0 (even), skip the addition
# If the number is odd, add the value in t1 (num2) to the result in t2
add t2, t2, t1
skip_add:
# Perform a right shift on t0 (num1), effectively dividing it by 2
srli t0, t0, 1 # D02
# Perform a left shift on t1 (num2), effectively multiplying it by 2
slli t1, t1, 1 # D03
# If t0 (num1) is not zero, repeat the loop
bnez t0, loop
# Store the final result in a0
add a0, x0,t2
ret
```
```c!
int eg_mul(int a, int b);
int main(int argc, char** argv) {
int res = eg_mul(13, 7);
return res;
}
```
```shell!
$ ./VTile main.hex
Enabling waves...
Starting simulation!
Simulation completed at time 2678 (cycle 267)
TOHOST = 91
Finishing simulation!
```
## Conclusion
In this project, we delved deep into the architecture of RISC-V Mini, starting from the fundamental components such as the Arithmetic Logic Unit (ALU), data cache (D-cache), instruction cache (I-cache), branch condition unit (BrCond), and control and status registers (CSR). We progressively analyzed their operational principles.
We placed particular emphasis on pipeline design and optimization. Beginning with a three-stage pipeline (IF, EX, WB), we gradually expanded it into a five-stage pipeline (ID, IF, EX, MEM, WB), conducting a detailed analysis of the performance improvements and increased design complexity resulting from this modification. Concurrently, we explored the stall mechanism within pipelines to address hazards like data dependencies, and the application of NOP instructions to fill pipeline bubbles.
To gain a deeper understanding of hazard types and handling methods, we systematically investigated various common hazard scenarios. Utilizing tools like Verilator and GTKWave, we closely observed signal waveforms in each stage, enabling us to intuitively comprehend the system's response when hazards occur. Finally, we applied our knowledge to practical cases, successfully verifying multiple instruction combinations that could potentially cause hazards on the five-stage pipeline version of RISC-V Mini. Additionally, we simulated the execution of a classroom exam program to validate the correctness of our design.
# Reference
* [riscv-mini](https://github.com/ucb-bar/riscv-mini)
* [riscv-mini in five-stage pipeline ](https://github.com/YangKefan-rk/riscv-mini)
* [Lab3: Construct a single-cycle RISC-V CPU with Chisel](https://hackmd.io/@sysprog/r1mlr3I7p)