蕭維昭, 吳柏漢
riscv-mini
is a basic 3-stage RISC-V pipeline implemented in Chisel. It is designed using 3-stage pipeline technique includes only Fetch, Decode, and Execute stages, which makes the design and implementation more straightforward, making it ideal for teaching and prototyping.
However, 3-stage pipline cpu with respcet to 5-stage one may face low instruction throughout and clock frequency constraints issue due to each stage in 3-stage needs to dealing with more work than the 5-stage one.
As result, this project we are going to figure out this problem, extend the riscv-mini to 5 stage CPU.
We based our work on the implementation by YangKefan-rk, exploring potential specific flaws in this extended implementation. We also attempted to address these issues to ensure robust handling of various hazard-related challenges.
From this diagram, we can see a 3-stage pipeline implementing a simplified RISC-V CPU. The pipeline is divided into three main stages:
Fretch、Execute、Write Back
Here, we follow the classic RISC architecture, planing to extend the RISC-V mini via separate Execute into Decode and Execute, and separate Write Back into Memory and Write Back
The main code of Fretch can be write as below:
pc := next_pc
io.icache.req.bits.addr := next_pc
io.icache.req.bits.data := 0.U
io.icache.req.bits.mask := 0.U
io.icache.req.valid := !stall
io.icache.abort := false.B
Here is the description
pc := next_pc
: Updates the program counter (pc) with the computed value of next_pc
.io.icache.req.bits.addr := next_pc
: Sends the address of the next instruction (next_pc
) to the instruction cache (io.icache.req.bits.addr
).io.icache.req.bits.data := 0.U
: Sets the data field of the instruction cache request to 0.U (zero
).io.icache.req.bits.mask := 0.U
: Sets the mask field of the instruction cache request to 0.U (zero
).io.icache.req.valid := !stall
: Sets the valid signal of the instruction cache request to true if there is no stall (!stall
).io.icache.abort := false.B
: Sets the abort signal of the instruction cache to false.The value of next_pc
is dynamically selected based on various conditions using MuxCase.
val next_pc = MuxCase(
pc + 4.U,
IndexedSeq(
stall -> pc,
csr.io.expt -> csr.io.evec,
(em_reg.pc_sel === PC_EPC) -> csr.io.epc,
((de_reg.pc_sel === PC_ALU) || (de_reg.taken)) -> (alu.io.sum >> 1.U << 1.U),
(io.ctrl.pc_sel === PC_0) -> pc
)
)
In this part, it also define the NOP instruction
val inst = Mux(started || io.ctrl.inst_kill || (io.ctrl.pc_sel === PC_ALU) || de_reg.taken || brCond.io.taken || csr.io.expt || de_reg.pc_sel === PC_ALU || de_reg.pc_sel === PC_EPC || em_reg.pc_sel === PC_EPC,
Instructions.NOP, io.icache.resp.bits.data)
At the end of this part, it deal with the stall condition
when(!stall) {
fd_reg.pc := pc
fd_reg.inst := inst
}
io.ctrl.inst := fd_reg.inst
Assigns the fetched instruction (fd_reg.inst
) to the control unit (io.ctrl.inst
).
val rd_addr = fd_reg.inst(11, 7)
val rs1_addr = fd_reg.inst(19, 15)
val rs2_addr = fd_reg.inst(24, 20)
regFile.io.raddr1 := rs1_addr
regFile.io.raddr2 := rs2_addr
Here are used for decodes the instruction to extract source and destination register addresses. And it will assigns these source register addresses (rs1_addr
and rs2_addr
) to the register file read ports (regFile.io.raddr1
and regFile.io.raddr2
immGen.io.inst := fd_reg.inst
immGen.io.sel := io.ctrl.imm_sel
The purpose of this code is to generates the immediate value based on the instruction and control signals.
After describe the main blocks in decode part, the signal flow will follow the logic:
val rs1 = MuxCase(regFile.io.rdata1, Seq(
(de_rs1hazard) -> alu.io.out,
(em_rs1hazard) -> em_regWrite,
(mw_rs1hazard) -> regWrite
))
val rs2 = MuxCase(regFile.io.rdata2, Seq(
(de_rs2hazard) -> alu.io.out,
(em_rs2hazard) -> em_regWrite,
(mw_rs2hazard) -> regWrite
))
where rs1
and rs2
are final values for source registers after applying forwarding logic.
The forward sources are described below:
alu.io.out
: ALU result from the Execute stage (EX/DE hazard).
em_regWrite
: Result from the Memory stage (MEM/DE hazard).
regWrite: Result from the Writeback stage (WB/DE hazard).
When a reset signal (reset.asBool
) or an exception (csr.io.expt
) occurs, the Decode-Execute pipeline registers are cleared to prevent the propagation of invalid values. Under normal operation, these registers are updated with valid data, including ALU inputs (alu_a
, alu_b
) and operation (alu_op
), control signals for memory access (st_type
, ld_type
), the writeback enable signal (wb_en
), and the destination register (rs2
). This ensures proper execution of instructions and maintains the integrity of the pipeline.
when(reset.asBool || !stall && csr.io.expt) {
// Reset DE pipeline registers or clear on exception
de_reg.st_type := 0.U
de_reg.csr_cmd := 0.U
de_reg.illegal := false.B
de_reg.pc_check := false.B
de_reg.ld_type := 0.U
de_reg.wb_en := false.B
de_reg.taken := false.B
de_reg.pc_sel := 0.U
}.elsewhen(!stall && !csr.io.expt) {
// Update DE pipeline registers when no stall or exception
de_reg.pc := fd_reg.pc
de_reg.inst := fd_reg.inst
de_reg.alu_a := Mux(io.ctrl.A_sel === A_RS1, rs1, fd_reg.pc)
de_reg.alu_b := Mux(io.ctrl.B_sel === B_RS2, rs2, immGen.io.out)
de_reg.csr_in := Mux(io.ctrl.imm_sel === IMM_Z, immGen.io.out, rs1)
de_reg.alu_op := io.ctrl.alu_op
de_reg.st_type := io.ctrl.st_type
de_reg.csr_cmd := io.ctrl.csr_cmd
de_reg.illegal := io.ctrl.illegal
de_reg.pc_check := io.ctrl.pc_sel === PC_ALU
de_reg.ld_type := io.ctrl.ld_type
de_reg.wb_sel := io.ctrl.wb_sel
de_reg.wb_en := io.ctrl.wb_en
de_reg.taken := brCond.io.taken
de_reg.pc_sel := io.ctrl.pc_sel
de_reg.rs2 := rs2
}
This part sets up the inputs and operation type for the ALU
alu.io.A := de_reg.alu_a
alu.io.B := de_reg.alu_b
alu.io.alu_op := de_reg.alu_op
This section configures the inputs of the ALU (Arithmetic Logic Unit) with values from the decode stage register (de_reg
):
Where:
alu_a
and alu_b
are the two operands for the ALU.alu_op
specifies the type of operation the ALU should perform .val woffset = (alu.io.sum(1) << 4.U).asUInt | (alu.io.sum(0) << 3.U).asUInt
val daddr = alu.io.sum >> 2.U << 2.U
io.dcache.req.valid := !stall && (de_reg.st_type.orR || de_reg.ld_type.orR)
io.dcache.req.bits.addr := daddr
io.dcache.req.bits.data := de_reg.rs2 << woffset
io.dcache.req.bits.mask := MuxLookup(de_reg.st_type, "b0000".U)(
Seq(
ST_SW -> "b1111".U,
ST_SH -> ("b11".U << alu.io.sum(1, 0)),
ST_SB -> ("b1".U << alu.io.sum(1, 0)))
)
woffset
: Define the word offset. It is decided by the ALU output. it has four default valus which are 0 、8 、16 、24io.dcache.req.bits.addr
: the address of the dataio.dcache.req.valid
: to decide the data is value to access or not.io.dcache.req.bits.data
:the specific data you want to access.io.dcache.req.bits.mask
:Decide the edit words types: 4 bytes (0b1111)、2 bytes (0b11)、signal byte (0b1)、no-edited (0b0000)The MEM
section starts at line 321 in datapath.scala
, where the Load
operation is described.
val loffset = (em_reg.alu(1) << 4.U).asUInt | (em_reg.alu(0) << 3.U).asUInt
val load_reg_valid = !io.icache.resp.valid && io.dcache.resp.valid && em_reg.ld_type =/= LD_XXX
Here, loffset calculates the offset from the address passed by the ALU, used for aligning memory load data.
Next, load_reg_valid
checks whether the loaded data is valid, with conditions including the instruction cache being invalid (io.icache.resp.valid
is false), the data cache being valid (io.dcache.resp.valid
is true), and the current operation type not being a no-op (LD_XXX
).
The load state is then checked:
val load_state = RegInit(false.B)
switch(load_state) {
is(false.B){
when(load_reg_valid){
load_state := true.B
}
}
is(true.B){
when(!stall){
load_state := false.B
}
}
}
Since loading data from memory or cache takes time, load_state
is defined to help determine the loading condition. When load_reg_valid
is true, the load state is entered. If no stalls (!stall
) occur, the state resets to false.B
.
Next, data source selection and conversion are performed:
val load_reg = RegEnable(io.dcache.resp.bits.data, 0.U(conf.xlen.W), load_reg_valid && !load_state)
val load_data = Mux(load_state, load_reg, io.dcache.resp.bits.data)
val lshift = load_data >> loffset
load := MuxLookup(em_reg.ld_type, load_data.zext)(
Seq(
LD_LH -> lshift(15, 0).asSInt,
LD_LB -> lshift(7, 0).asSInt,
LD_LHU -> lshift(15, 0).zext,
LD_LBU -> lshift(7, 0).zext
)
)
load_data
selects the data source based on load_state
. If in a load state, data stored in the register is used; otherwise, the data from the cache is used. After obtaining the data, it is right-shifted according to ld_type
for proper alignment.
LD_LH
and LD_LB
indicate signed loading of halfword and byte, respectively.
LD_LHU
and LD_LBU
indicate unsigned loading of halfword and byte, respectively.
The Control and Status Registers (CSR) manage system-level operations and configurations, providing essential information about the processor state. These registers handle privileged operations such as interrupt processing, exception management, timer configuration, and system monitoring.
csr.io.stall := stall
csr.io.in := em_reg.csr_in
csr.io.cmd := em_reg.csr_cmd
csr.io.inst := em_reg.inst
csr.io.pc := em_reg.pc
csr.io.addr := em_reg.alu
csr.io.illegal := em_reg.illegal
csr.io.pc_check := em_reg.pc_check
csr.io.ld_type := em_reg.ld_type
csr.io.st_type := em_reg.st_type
io.host <> csr.io.host
In the MEM stage, the above code describes the interaction with the CSR.
csr.io.stall
: Notifies the CSR whether to pause operations based on the pipeline's execution status.
csr.io.cmd
: Command signal for executing instructions, specifying whether the CSR should read, write, or perform other actions.
csr.io.inst
and csr.io.pc
: Provide the current instruction and corresponding program counter, enabling CSR to process instruction-related tasks.
csr.io.addr
: Supplies the target address for CSR access, calculated by the ALU.
csr.io.illegal
: Flags whether the current operation is illegal. If true, CSR can trigger exceptions or abort the operation.
Additional signals such as csr.io.pc_check
, csr.io.ld_type
, and csr.io.st_type
assist CSR in handling specific operations like branch checking, load type determination, and store type processing.
In this stage, the code is focus on writing back to the regFile and the control signal of the regFile.
regWrite :=
MuxLookup(mw_reg.wb_sel, mw_reg.alu.zext)(
Seq(
WB_MEM -> mw_reg.load,
WB_PC4 -> (mw_reg.pc + 4.U).zext,
WB_CSR -> mw_reg.csr_out.zext
)
).asUInt
During the write-back operation, regWrite determines the final data to be written to the register file:
Data source selection:
WB_MEM
: Writes back data loaded from memory (mw_reg.load
).
WB_PC4
: Writes back the value of PC + 4, typically used for branch or jump instructions.
WB_CSR
: Writes back the result of a CSR operation (mw_reg.csr_out
).
By default, the ALU result (mw_reg.alu.zext
) is written back.
regFile.io.wen := mw_reg.wb_en && !stall
regFile.io.waddr := mw_rd_addr
regFile.io.wdata := regWrite
This section controls whether data is written to the register file:
Enable write-back (wen
): The register file allows write operations only if the write-back enable signal (mw_reg.wb_en
) is true and no stalls (!stall
) occur.
Write-back address and data:
waddr
: Specifies the target register address for write-back (mw_rd_addr
).
wdata
: Specifies the data content to be written, selected via the regWrite
logic.
To test the occurrence of hazards, we want to run several custom programs. However, before doing so, we need to set up the relevant environment.
To compile and assemble the custom programs, the RISC-V tools for the Privileged Architecture 1.7 toolchain need to be installed. Follow the instructions in Running Your Own Program on riscv-mini to set up the environment variables and install the toolchain.
However, this part of the process was not explained clearly. After some exploration, the following steps were used to set up the environment.
$ export RISCV=$HOME/riscv-tools
$ mkdir -p $RISCV
Then, we run the build-riscv-tools.sh
script in project riscv-mini
under $HOME/riscv-tools
The installation process would take about half an hour. After checking, we found that the riscv32-unknown-elf-gcc
was not compiled.
And the below error was showed:
/home/wu/riscv-tools/riscv-tools-priv1.7/riscv-gnu-toolchain/build/src/newlib-gcc/gcc/reload1.c:115:24: error: use of an operand of type ‘bool’ in ‘operator++’ is forbidden in C++17
115 | (this_target_reload->x_spill_indirect_levels)
| ~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~
We suspect the compiler version is too new, so we downgraded it to GCC 10.
$ sudo apt install gcc-10 g++-10
$ export CC=gcc-10
$ export CXX=g++-10
But it still can't work, therefore, we try to install the latest version of the toolchain instead of priv1.7.
However, install the latest version take too much time. We quickly abandoned this plan and switched to implementing the following two plans simultaneously.
The first plan is to download a precompiled toolchain and riscv-tests for use. Then, modify the Makefile
in custom-bmark
. It can work for generating .vcd
but .dump
, because precompiled toolchain we found lack of riscv32-unknown-elf-objdump
.
The secound solution is much more better, we installed gcc5.3 and ubuntu16.04 in docker , which are perfectly match with RISC-V tools for priv 1.7.
Followings are the steps for installing docker enviroment.
$ sudo apt install docker.io
$ sudo docker run -it -v ~/riscv-mini:/riscv-mini ubuntu:16.04
In docker:
$ apt update
$ apt install gcc-5 g++-5
setting up the environment
Install the necessery tools
# install gawk
$ apt update
$ apt install -y gawk
install wget and curl
$ apt install -y wget curl
Install git,texinfo,file
$ apt install -y git
$ apt install -y texinfo
$ apt install -y file
Then startInstall the riscv-tools
$ export RISCV=~/riscv-mini
$ ./build-riscv-tools.sh
Finally, we can test the custom programs.
riscv-mini/custom-bmark/
├── main.c
├── add.S
├── Makefile
In this part, add.S
is used as a function to be called by main.c
. If we want to test the function file named test.S
, we need to first replace the file name add.S
into test.S
in the 7th line of the Makefile
.
CUSTOM_BMARK_S_SRC ?= test.S crt.S
Then, update the function name in main.c
accordingly also.
After setting every thing up, now we can run our own assembly code on the riscv-mini cpu.
Entering the file custom-bmark
, we can edit our own custom assembly code in add.s
.
Next, to compile you program, run make
in custom-bmark
to generate the binary, dump, and the hex files.
/mnt/riscv_mini_extended/riscv-mini/VTile: /usr/lib/x86_64-linux-gnu/libstdc++.so.6: version `GLIBCXX_3.4.29' not found (required by /mnt/riscv_mini_extended/riscv-mini/VTile)
/mnt/riscv_mini_extended/riscv-mini/VTile: /usr/lib/x86_64-linux-gnu/libstdc++.so.6: version `CXXABI_1.3.11' not found (required by /mnt/riscv_mini_extended/riscv-mini/VTile)
/mnt/riscv_mini_extended/riscv-mini/VTile: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.34' not found (required by /mnt/riscv_mini_extended/riscv-mini/VTile)
After runing make
in custom-bmark to generate the hex files. We moved it into ~/riscv-mini
and use VTile to generate .vcd waveform.
$ ./VTile <hex file> [<vcd file> 2> <log file>]
For following code as example,
# add.S
.text
.align 2
.globl add
.type add,@function
add:
add a0, a0, a1
ret
// main.c
int add(int a, int b);
int main(int argc, char** argv) {
int res = add(3, 2);
return res == 5 ? 6543 : -1;
}
We checked two place waveform for a0
and a1
in regFile using GTKWave.
Here is the change during the execution of add a0, a0, a1
, where 3+2=5.
And here is the signal before completion. You can see that a0
ends at 198F, which is 6543 in decimal.
VTile's output in terminal can also check return value of main.
wu@samsung:~/riscv-mini$ ./VTile main.hex
Enabling waves...
Starting simulation!
Simulation completed at time 2646 (cycle 264)
TOHOST = 6543
Finishing simulation!
t0
is written in the first instruction but used in the next instruction.
raw:
add t0, a0, a1
add a0, t0, a0
ret
int main(int argc, char** argv) {
int res = raw(13, 25);
return res == 51 ? 6543 : 123;
}
The result is correct:
wu@samsung:~/riscv-mini$ ./VTile main.hex
Enabling waves...
Starting simulation!
Simulation completed at time 2642 (cycle 264)
TOHOST = 6543
Finishing simulation!
After load operation from memory, use the loaded value immediately.
load_use_hazard:
lw t0, 0(a0)
lw t1, 0(a1)
add a0, t0, t1
ret
int load_use_hazard(int* ptr1, int* ptr2);
int main(int argc, char** argv) {
int data1 = 13;
int data2 = 25;
int res = load_use_hazard(&data1, &data2);
return res == 38 ? 6543 : 123;
}
wu@samsung:~/riscv-mini$ ./VTile main.hex
Enabling waves...
Starting simulation!
Simulation completed at time 2654 (cycle 265)
TOHOST = 6543
Finishing simulation!
It can be seen that the result is equal to 38, so the final output is 6543, the result is correct.
In this section, we test if jump branch is functional.
Testing if it jump
branch_jump_hazard:
add t0, a0, a1
bge t0, zero, skip
li a0, 1117
ret
skip:
li a0, 8885
ret
int branch_jump_hazard(int a, int b);
int main(int argc, char** argv) {
int res = branch_jump_hazard(13, 25);
return res == 8885 ? 6543 : 123;
}
simulation result
Enabling waves...
Starting simulation!
Simulation completed at time 2644 (cycle 264)
TOHOST = 6543
Finishing simulation!
Testing if it not jump
branch_jump_hazard:
add t0, a0, a1
bge t0, zero, skip
li a0, 1117
ret
skip:
li a0, 8885
ret
int branch_jump_hazard(int a, int b);
int main(int argc, char** argv) {
int res = branch_jump_hazard(13, 25);
return res == 1117 ? 6543 : 123;
}
simulation result
Enabling waves...
Starting simulation!
Simulation completed at time 2650 (cycle 265)
TOHOST = 6543
Finishing simulation!
Result:
It shows that all the result after simulation is 6543, which mean that they both correctly jump to the proper position in our testing code.
assebly code
.text
.align 2
.globl eg_mul
.type eg_mul,@function
eg_mul:
# Begin the main code in the text section
add t0, x0, a0 # Load the first number (num1) into register t0
add t1, x0, a1 # Load the second number (num2) into register t1
li t2, 0 # Initialize the result (t2) to 0
loop:
# Check if the least significant bit of t0 (num1) is 1 (i.e., if the number is odd)
andi t3, t0, 1
beq t3, x0, skip_add # If the bit is 0 (even), skip the addition
# If the number is odd, add the value in t1 (num2) to the result in t2
add t2, t2, t1
skip_add:
# Perform a right shift on t0 (num1), effectively dividing it by 2
srli t0, t0, 1 # D02
# Perform a left shift on t1 (num2), effectively multiplying it by 2
slli t1, t1, 1 # D03
# If t0 (num1) is not zero, repeat the loop
bnez t0, loop
# Store the final result in a0
add a0, x0,t2
ret
int eg_mul(int a, int b);
int main(int argc, char** argv) {
int res = eg_mul(13, 7);
return res;
}
$ ./VTile main.hex
Enabling waves...
Starting simulation!
Simulation completed at time 2678 (cycle 267)
TOHOST = 91
Finishing simulation!
In this project, we delved deep into the architecture of RISC-V Mini, starting from the fundamental components such as the Arithmetic Logic Unit (ALU), data cache (D-cache), instruction cache (I-cache), branch condition unit (BrCond), and control and status registers (CSR). We progressively analyzed their operational principles.
We placed particular emphasis on pipeline design and optimization. Beginning with a three-stage pipeline (IF, EX, WB), we gradually expanded it into a five-stage pipeline (ID, IF, EX, MEM, WB), conducting a detailed analysis of the performance improvements and increased design complexity resulting from this modification. Concurrently, we explored the stall mechanism within pipelines to address hazards like data dependencies, and the application of NOP instructions to fill pipeline bubbles.
To gain a deeper understanding of hazard types and handling methods, we systematically investigated various common hazard scenarios. Utilizing tools like Verilator and GTKWave, we closely observed signal waveforms in each stage, enabling us to intuitively comprehend the system's response when hazards occur. Finally, we applied our knowledge to practical cases, successfully verifying multiple instruction combinations that could potentially cause hazards on the five-stage pipeline version of RISC-V Mini. Additionally, we simulated the execution of a classroom exam program to validate the correctness of our design.