Try   HackMD

Extend riscv-mini

蕭維昭, 吳柏漢

Introduction

riscv-mini is a basic 3-stage RISC-V pipeline implemented in Chisel. It is designed using 3-stage pipeline technique includes only Fetch, Decode, and Execute stages, which makes the design and implementation more straightforward, making it ideal for teaching and prototyping.

However, 3-stage pipline cpu with respcet to 5-stage one may face low instruction throughout and clock frequency constraints issue due to each stage in 3-stage needs to dealing with more work than the 5-stage one.
As result, this project we are going to figure out this problem, extend the riscv-mini to 5 stage CPU.

We based our work on the implementation by YangKefan-rk, exploring potential specific flaws in this extended implementation. We also attempted to address these issues to ensure robust handling of various hazard-related challenges.

Data path of 3 stage pipline

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

From this diagram, we can see a 3-stage pipeline implementing a simplified RISC-V CPU. The pipeline is divided into three main stages:
Fretch、Execute、Write Back

Extend RISC-V mini into 5 stage

Here, we follow the classic RISC architecture, planing to extend the RISC-V mini via separate Execute into Decode and Execute, and separate Write Back into Memory and Write Back

Data path of extended RISC-V mini

Fetch

The main code of Fretch can be write as below:

  pc := next_pc
  io.icache.req.bits.addr := next_pc
  io.icache.req.bits.data := 0.U
  io.icache.req.bits.mask := 0.U
  io.icache.req.valid := !stall
  io.icache.abort := false.B

Here is the description

  • pc := next_pc : Updates the program counter (pc) with the computed value of next_pc.
  • io.icache.req.bits.addr := next_pc : Sends the address of the next instruction (next_pc) to the instruction cache (io.icache.req.bits.addr).
  • io.icache.req.bits.data := 0.U : Sets the data field of the instruction cache request to 0.U (zero).
  • io.icache.req.bits.mask := 0.U : Sets the mask field of the instruction cache request to 0.U (zero).
  • io.icache.req.valid := !stall : Sets the valid signal of the instruction cache request to true if there is no stall (!stall).
  • io.icache.abort := false.B : Sets the abort signal of the instruction cache to false.

The value of next_pc is dynamically selected based on various conditions using MuxCase.

 val next_pc = MuxCase(
    pc + 4.U,
    IndexedSeq(
      stall -> pc,
      csr.io.expt -> csr.io.evec,
      (em_reg.pc_sel === PC_EPC) -> csr.io.epc,
      ((de_reg.pc_sel === PC_ALU) || (de_reg.taken)) -> (alu.io.sum >> 1.U << 1.U),
      (io.ctrl.pc_sel === PC_0) -> pc
    )
  )

In this part, it also define the NOP instruction

  val inst = Mux(started || io.ctrl.inst_kill || (io.ctrl.pc_sel === PC_ALU) || de_reg.taken || brCond.io.taken || csr.io.expt || de_reg.pc_sel === PC_ALU || de_reg.pc_sel === PC_EPC || em_reg.pc_sel === PC_EPC, 
    Instructions.NOP, io.icache.resp.bits.data)

At the end of this part, it deal with the stall condition

  when(!stall) { 
    fd_reg.pc := pc
    fd_reg.inst := inst
  }

Decode

Fetch-Decode (FD) Instruction Assignment

 io.ctrl.inst := fd_reg.inst

Assigns the fetched instruction (fd_reg.inst) to the control unit (io.ctrl.inst).

Register File Read

 val rd_addr = fd_reg.inst(11, 7)
  val rs1_addr = fd_reg.inst(19, 15)
  val rs2_addr = fd_reg.inst(24, 20)
  regFile.io.raddr1 := rs1_addr
  regFile.io.raddr2 := rs2_addr

Here are used for decodes the instruction to extract source and destination register addresses. And it will assigns these source register addresses (rs1_addr and rs2_addr) to the register file read ports (regFile.io.raddr1 and regFile.io.raddr2

Immediate Generation

  immGen.io.inst := fd_reg.inst
  immGen.io.sel := io.ctrl.imm_sel

The purpose of this code is to generates the immediate value based on the instruction and control signals.

Forwarding Logic

After describe the main blocks in decode part, the signal flow will follow the logic:

val rs1 = MuxCase(regFile.io.rdata1, Seq(
    (de_rs1hazard) -> alu.io.out,
    (em_rs1hazard) -> em_regWrite,
    (mw_rs1hazard) -> regWrite
  ))
  val rs2 = MuxCase(regFile.io.rdata2, Seq(
    (de_rs2hazard) -> alu.io.out,
    (em_rs2hazard) -> em_regWrite,
    (mw_rs2hazard) -> regWrite
  ))

where rs1 and rs2 are final values for source registers after applying forwarding logic.
The forward sources are described below:
alu.io.out: ALU result from the Execute stage (EX/DE hazard).
em_regWrite: Result from the Memory stage (MEM/DE hazard).
regWrite: Result from the Writeback stage (WB/DE hazard).

Pipelining De/Ex

When a reset signal (reset.asBool) or an exception (csr.io.expt) occurs, the Decode-Execute pipeline registers are cleared to prevent the propagation of invalid values. Under normal operation, these registers are updated with valid data, including ALU inputs (alu_a, alu_b) and operation (alu_op), control signals for memory access (st_type, ld_type), the writeback enable signal (wb_en), and the destination register (rs2). This ensures proper execution of instructions and maintains the integrity of the pipeline.

when(reset.asBool || !stall && csr.io.expt) {
  // Reset DE pipeline registers or clear on exception
  de_reg.st_type := 0.U
  de_reg.csr_cmd := 0.U
  de_reg.illegal := false.B
  de_reg.pc_check := false.B
  de_reg.ld_type := 0.U
  de_reg.wb_en := false.B
  de_reg.taken := false.B
  de_reg.pc_sel := 0.U
}.elsewhen(!stall && !csr.io.expt) {
  // Update DE pipeline registers when no stall or exception
  de_reg.pc := fd_reg.pc
  de_reg.inst := fd_reg.inst
  de_reg.alu_a := Mux(io.ctrl.A_sel === A_RS1, rs1, fd_reg.pc)
  de_reg.alu_b := Mux(io.ctrl.B_sel === B_RS2, rs2, immGen.io.out)
  de_reg.csr_in := Mux(io.ctrl.imm_sel === IMM_Z, immGen.io.out, rs1)
  de_reg.alu_op := io.ctrl.alu_op
  de_reg.st_type := io.ctrl.st_type
  de_reg.csr_cmd := io.ctrl.csr_cmd
  de_reg.illegal := io.ctrl.illegal
  de_reg.pc_check := io.ctrl.pc_sel === PC_ALU
  de_reg.ld_type := io.ctrl.ld_type
  de_reg.wb_sel := io.ctrl.wb_sel
  de_reg.wb_en := io.ctrl.wb_en
  de_reg.taken := brCond.io.taken
  de_reg.pc_sel := io.ctrl.pc_sel
  de_reg.rs2 := rs2
}

Execute

ALU Calculation

This part sets up the inputs and operation type for the ALU

alu.io.A := de_reg.alu_a
alu.io.B := de_reg.alu_b
alu.io.alu_op := de_reg.alu_op

This section configures the inputs of the ALU (Arithmetic Logic Unit) with values from the decode stage register (de_reg):
Where:

  • alu_a and alu_b are the two operands for the ALU.
  • alu_op specifies the type of operation the ALU should perform .

Dcache Access

val woffset = (alu.io.sum(1) << 4.U).asUInt | (alu.io.sum(0) << 3.U).asUInt
val daddr = alu.io.sum >> 2.U << 2.U
io.dcache.req.valid := !stall && (de_reg.st_type.orR || de_reg.ld_type.orR)
io.dcache.req.bits.addr := daddr
io.dcache.req.bits.data := de_reg.rs2 << woffset
io.dcache.req.bits.mask := MuxLookup(de_reg.st_type, "b0000".U)(
  Seq(
      ST_SW -> "b1111".U, 
      ST_SH -> ("b11".U << alu.io.sum(1, 0)),
      ST_SB -> ("b1".U << alu.io.sum(1, 0)))
        )

  • woffset : Define the word offset. It is decided by the ALU output. it has four default valus which are 0 、8 、16 、24
  • io.dcache.req.bits.addr : the address of the data
  • io.dcache.req.valid : to decide the data is value to access or not.
  • io.dcache.req.bits.data :the specific data you want to access.
  • io.dcache.req.bits.mask :Decide the edit words types: 4 bytes (0b1111)、2 bytes (0b11)、signal byte (0b1)、no-edited (0b0000)

Memory

Load

The MEM section starts at line 321 in datapath.scala, where the Load operation is described.

val loffset = (em_reg.alu(1) << 4.U).asUInt | (em_reg.alu(0) << 3.U).asUInt
val load_reg_valid = !io.icache.resp.valid && io.dcache.resp.valid && em_reg.ld_type =/= LD_XXX

Here, loffset calculates the offset from the address passed by the ALU, used for aligning memory load data.
Next, load_reg_valid checks whether the loaded data is valid, with conditions including the instruction cache being invalid (io.icache.resp.valid is false), the data cache being valid (io.dcache.resp.valid is true), and the current operation type not being a no-op (LD_XXX).

The load state is then checked:

val load_state = RegInit(false.B)
switch(load_state) {
  is(false.B){
    when(load_reg_valid){
      load_state := true.B
    }
  }
  is(true.B){
    when(!stall){
      load_state := false.B
    }
  }
}

Since loading data from memory or cache takes time, load_state is defined to help determine the loading condition. When load_reg_valid is true, the load state is entered. If no stalls (!stall) occur, the state resets to false.B.

Next, data source selection and conversion are performed:

val load_reg = RegEnable(io.dcache.resp.bits.data, 0.U(conf.xlen.W), load_reg_valid && !load_state)
val load_data = Mux(load_state, load_reg, io.dcache.resp.bits.data)
val lshift = load_data >> loffset
load := MuxLookup(em_reg.ld_type, load_data.zext)(
  Seq(
    LD_LH -> lshift(15, 0).asSInt,
    LD_LB -> lshift(7, 0).asSInt,
    LD_LHU -> lshift(15, 0).zext,
    LD_LBU -> lshift(7, 0).zext
  )
)

load_data selects the data source based on load_state. If in a load state, data stored in the register is used; otherwise, the data from the cache is used. After obtaining the data, it is right-shifted according to ld_type for proper alignment.

LD_LH and LD_LB indicate signed loading of halfword and byte, respectively.
LD_LHU and LD_LBU indicate unsigned loading of halfword and byte, respectively.

CRS

The Control and Status Registers (CSR) manage system-level operations and configurations, providing essential information about the processor state. These registers handle privileged operations such as interrupt processing, exception management, timer configuration, and system monitoring.

csr.io.stall := stall
csr.io.in := em_reg.csr_in
csr.io.cmd := em_reg.csr_cmd
csr.io.inst := em_reg.inst
csr.io.pc := em_reg.pc
csr.io.addr := em_reg.alu
csr.io.illegal := em_reg.illegal
csr.io.pc_check := em_reg.pc_check
csr.io.ld_type := em_reg.ld_type
csr.io.st_type := em_reg.st_type
io.host <> csr.io.host

In the MEM stage, the above code describes the interaction with the CSR.

csr.io.stall: Notifies the CSR whether to pause operations based on the pipeline's execution status.
csr.io.cmd: Command signal for executing instructions, specifying whether the CSR should read, write, or perform other actions.
csr.io.inst and csr.io.pc: Provide the current instruction and corresponding program counter, enabling CSR to process instruction-related tasks.
csr.io.addr: Supplies the target address for CSR access, calculated by the ALU.
csr.io.illegal: Flags whether the current operation is illegal. If true, CSR can trigger exceptions or abort the operation.
Additional signals such as csr.io.pc_check, csr.io.ld_type, and csr.io.st_type assist CSR in handling specific operations like branch checking, load type determination, and store type processing.

Write Back

In this stage, the code is focus on writing back to the regFile and the control signal of the regFile.

Regfile Write

regWrite :=
  MuxLookup(mw_reg.wb_sel, mw_reg.alu.zext)(
    Seq(
      WB_MEM -> mw_reg.load,
      WB_PC4 -> (mw_reg.pc + 4.U).zext,
      WB_CSR -> mw_reg.csr_out.zext
    )
  ).asUInt

During the write-back operation, regWrite determines the final data to be written to the register file:

Data source selection:
WB_MEM: Writes back data loaded from memory (mw_reg.load).
WB_PC4: Writes back the value of PC + 4, typically used for branch or jump instructions.
WB_CSR: Writes back the result of a CSR operation (mw_reg.csr_out).
By default, the ALU result (mw_reg.alu.zext) is written back.

Regfile Control Signals

regFile.io.wen := mw_reg.wb_en && !stall
regFile.io.waddr := mw_rd_addr
regFile.io.wdata := regWrite

This section controls whether data is written to the register file:

Enable write-back (wen): The register file allows write operations only if the write-back enable signal (mw_reg.wb_en) is true and no stalls (!stall) occur.
Write-back address and data:
waddr: Specifies the target register address for write-back (mw_rd_addr).
wdata: Specifies the data content to be written, selected via the regWrite logic.

Verifying Custom Programs

Installing Environment

To test the occurrence of hazards, we want to run several custom programs. However, before doing so, we need to set up the relevant environment.

To compile and assemble the custom programs, the RISC-V tools for the Privileged Architecture 1.7 toolchain need to be installed. Follow the instructions in Running Your Own Program on riscv-mini to set up the environment variables and install the toolchain.
However, this part of the process was not explained clearly. After some exploration, the following steps were used to set up the environment.

$ export RISCV=$HOME/riscv-tools
$ mkdir -p $RISCV

Then, we run the build-riscv-tools.sh script in project riscv-mini under $HOME/riscv-tools
The installation process would take about half an hour. After checking, we found that the riscv32-unknown-elf-gcc was not compiled.
And the below error was showed:

/home/wu/riscv-tools/riscv-tools-priv1.7/riscv-gnu-toolchain/build/src/newlib-gcc/gcc/reload1.c:115:24: error: use of an operand of type ‘bool’ in ‘operator++’ is forbidden in C++17
  115 |   (this_target_reload->x_spill_indirect_levels)
      |   ~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~

We suspect the compiler version is too new, so we downgraded it to GCC 10.

$ sudo apt install gcc-10 g++-10
$ export CC=gcc-10
$ export CXX=g++-10

But it still can't work, therefore, we try to install the latest version of the toolchain instead of priv1.7.
However, install the latest version take too much time. We quickly abandoned this plan and switched to implementing the following two plans simultaneously.
The first plan is to download a precompiled toolchain and riscv-tests for use. Then, modify the Makefile in custom-bmark. It can work for generating .vcd but .dump, because precompiled toolchain we found lack of riscv32-unknown-elf-objdump.
The secound solution is much more better, we installed gcc5.3 and ubuntu16.04 in docker , which are perfectly match with RISC-V tools for priv 1.7.
Followings are the steps for installing docker enviroment.

$ sudo apt install docker.io
$ sudo docker run -it -v ~/riscv-mini:/riscv-mini ubuntu:16.04

In docker:

$ apt update
$ apt install gcc-5 g++-5

setting up the environment
Install the necessery tools

# install gawk
$ apt update
$ apt install -y gawk

install wget and curl

$ apt install -y wget curl

Install git,texinfo,file

$ apt install -y git
$ apt install -y texinfo
$ apt install -y file

Then startInstall the riscv-tools

$ export RISCV=~/riscv-mini
$ ./build-riscv-tools.sh

Finally, we can test the custom programs.

Modify custom programs

riscv-mini/custom-bmark/
├── main.c
├── add.S
├── Makefile

In this part, add.S is used as a function to be called by main.c. If we want to test the function file named test.S, we need to first replace the file name add.S into test.S in the 7th line of the Makefile .

CUSTOM_BMARK_S_SRC ?= test.S crt.S

Then, update the function name in main.c accordingly also.

Run custom assembly code on riscv-mini

After setting every thing up, now we can run our own assembly code on the riscv-mini cpu.
Entering the file custom-bmark, we can edit our own custom assembly code in add.s.
Next, to compile you program, run make in custom-bmark to generate the binary, dump, and the hex files.

/mnt/riscv_mini_extended/riscv-mini/VTile: /usr/lib/x86_64-linux-gnu/libstdc++.so.6: version `GLIBCXX_3.4.29' not found (required by /mnt/riscv_mini_extended/riscv-mini/VTile)
/mnt/riscv_mini_extended/riscv-mini/VTile: /usr/lib/x86_64-linux-gnu/libstdc++.so.6: version `CXXABI_1.3.11' not found (required by /mnt/riscv_mini_extended/riscv-mini/VTile)
/mnt/riscv_mini_extended/riscv-mini/VTile: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.34' not found (required by /mnt/riscv_mini_extended/riscv-mini/VTile)

Check Waveform

After runing make in custom-bmark to generate the hex files. We moved it into ~/riscv-mini and use VTile to generate .vcd waveform.

$ ./VTile <hex file> [<vcd file> 2> <log file>]

For following code as example,

# add.S
.text
.align 2

.globl add
.type  add,@function

add:
  add a0, a0, a1
  ret
// main.c
int add(int a, int b);

int main(int argc, char** argv) {
  int res = add(3, 2);
  return res == 5 ? 6543 : -1;
}

We checked two place waveform for a0 and a1 in regFile using GTKWave.
image
Here is the change during the execution of add a0, a0, a1, where 3+2=5.
image
And here is the signal before completion. You can see that a0 ends at 198F, which is 6543 in decimal.
VTile's output in terminal can also check return value of main.

wu@samsung:~/riscv-mini$ ./VTile main.hex
Enabling waves...
Starting simulation!
Simulation completed at time 2646 (cycle 264)
TOHOST = 6543
Finishing simulation!

Test Instruction Alignment may Cause Hazard

Read-After-Write (RAW) Hazard

t0 is written in the first instruction but used in the next instruction.

raw:
  add t0, a0, a1
  add a0, t0, a0
  ret
int main(int argc, char** argv) {
  int res = raw(13, 25);
  return res == 51 ? 6543 : 123;
}

The result is correct:

wu@samsung:~/riscv-mini$ ./VTile main.hex
Enabling waves...
Starting simulation!
Simulation completed at time 2642 (cycle 264)
TOHOST = 6543
Finishing simulation!

Load-Use Hazard

After load operation from memory, use the loaded value immediately.

load_use_hazard:
  lw t0, 0(a0)
  lw t1, 0(a1)
  add a0, t0, t1
  ret
int load_use_hazard(int* ptr1, int* ptr2);

int main(int argc, char** argv) {
  int data1 = 13;
  int data2 = 25;
  int res = load_use_hazard(&data1, &data2);
  return res == 38 ? 6543 : 123;
}
wu@samsung:~/riscv-mini$ ./VTile main.hex
Enabling waves...
Starting simulation!
Simulation completed at time 2654 (cycle 265)
TOHOST = 6543
Finishing simulation!

It can be seen that the result is equal to 38, so the final output is 6543, the result is correct.

Jump Branch Hazard

In this section, we test if jump branch is functional.
Testing if it jump

branch_jump_hazard:
  add t0, a0, a1        
  bge t0, zero, skip    
  li a0, 1117             
  ret                   
skip:
  li a0, 8885              
  ret
int branch_jump_hazard(int a, int b);

int main(int argc, char** argv) {
  int res = branch_jump_hazard(13, 25);
  return res == 8885 ? 6543 : 123;
}

simulation result

Enabling waves...
Starting simulation!
Simulation completed at time 2644 (cycle 264)
TOHOST = 6543
Finishing simulation!

Testing if it not jump

branch_jump_hazard:
  add t0, a0, a1        
  bge t0, zero, skip    
  li a0, 1117             
  ret                   
skip:
  li a0, 8885              
  ret
int branch_jump_hazard(int a, int b);

int main(int argc, char** argv) {
  int res = branch_jump_hazard(13, 25);
  return res == 1117 ? 6543 : 123;
}

simulation result

Enabling waves...
Starting simulation!
Simulation completed at time 2650 (cycle 265)
TOHOST = 6543
Finishing simulation!

Result:
It shows that all the result after simulation is 6543, which mean that they both correctly jump to the proper position in our testing code.

Verify 3 RISC-V Programs from the Course Quiz

Quiz2 Problem D

assebly code

.text
.align 2

.globl eg_mul
.type  eg_mul,@function

eg_mul:
    # Begin the main code in the text section
    add t0, x0, a0         # Load the first number (num1) into register t0
    add t1, x0, a1         # Load the second number (num2) into register t1
    li t2, 0            # Initialize the result (t2) to 0

loop:                                         
    # Check if the least significant bit of t0 (num1) is 1 (i.e., if the number is odd)
    andi t3, t0, 1
    beq t3, x0, skip_add  # If the bit is 0 (even), skip the addition
    # If the number is odd, add the value in t1 (num2) to the result in t2
    add t2, t2, t1 

skip_add:
    # Perform a right shift on t0 (num1), effectively dividing it by 2
    srli t0, t0, 1 # D02
    # Perform a left shift on t1 (num2), effectively multiplying it by 2
    slli t1, t1, 1 # D03
    # If t0 (num1) is not zero, repeat the loop
    bnez t0, loop

    # Store the final result in a0
    add a0, x0,t2
    ret
int eg_mul(int a, int b);

int main(int argc, char** argv) {
  int res = eg_mul(13, 7);
  return res;
}
$ ./VTile main.hex
Enabling waves...
Starting simulation!
Simulation completed at time 2678 (cycle 267)
TOHOST = 91
Finishing simulation!

Conclusion

In this project, we delved deep into the architecture of RISC-V Mini, starting from the fundamental components such as the Arithmetic Logic Unit (ALU), data cache (D-cache), instruction cache (I-cache), branch condition unit (BrCond), and control and status registers (CSR). We progressively analyzed their operational principles.

We placed particular emphasis on pipeline design and optimization. Beginning with a three-stage pipeline (IF, EX, WB), we gradually expanded it into a five-stage pipeline (ID, IF, EX, MEM, WB), conducting a detailed analysis of the performance improvements and increased design complexity resulting from this modification. Concurrently, we explored the stall mechanism within pipelines to address hazards like data dependencies, and the application of NOP instructions to fill pipeline bubbles.

To gain a deeper understanding of hazard types and handling methods, we systematically investigated various common hazard scenarios. Utilizing tools like Verilator and GTKWave, we closely observed signal waveforms in each stage, enabling us to intuitively comprehend the system's response when hazards occur. Finally, we applied our knowledge to practical cases, successfully verifying multiple instruction combinations that could potentially cause hazards on the five-stage pipeline version of RISC-V Mini. Additionally, we simulated the execution of a classroom exam program to validate the correctness of our design.

Reference