蕭維昭, 吳柏漢
riscv-mini
is a basic 3-stage RISC-V pipeline implemented in Chisel. It is designed using 3-stage pipeline technique includes only Fetch, Decode, and Execute stages, which makes the design and implementation more straightforward, making it ideal for teaching and prototyping.
However, 3-stage pipline cpu with respcet to 5-stage one may face low instruction throughout and clock frequency constraints issue due to each stage in 3-stage needs to dealing with more work than the 5-stage one.
As result, this project we are going to figure out this problem, extend the riscv-mini to 5 stage CPU.
We based our work on the implementation by YangKefan-rk, exploring potential specific flaws in this extended implementation. We also attempted to address these issues to ensure robust handling of various hazard-related challenges.
From this diagram, we can see a 3-stage pipeline implementing a simplified RISC-V CPU. The pipeline is divided into three main stages:
Fretch、Execute、Write Back
Here, we follow the classic RISC architecture, planing to extend the RISC-V mini via separate Execute into Decode and Execute, and separate Write Back into Memory and Write Back
The main code of Fretch can be write as below:
Here is the description
pc := next_pc
: Updates the program counter (pc) with the computed value of next_pc
.io.icache.req.bits.addr := next_pc
: Sends the address of the next instruction (next_pc
) to the instruction cache (io.icache.req.bits.addr
).io.icache.req.bits.data := 0.U
: Sets the data field of the instruction cache request to 0.U (zero
).io.icache.req.bits.mask := 0.U
: Sets the mask field of the instruction cache request to 0.U (zero
).io.icache.req.valid := !stall
: Sets the valid signal of the instruction cache request to true if there is no stall (!stall
).io.icache.abort := false.B
: Sets the abort signal of the instruction cache to false.The value of next_pc
is dynamically selected based on various conditions using MuxCase.
In this part, it also define the NOP instruction
At the end of this part, it deal with the stall condition
Assigns the fetched instruction (fd_reg.inst
) to the control unit (io.ctrl.inst
).
Here are used for decodes the instruction to extract source and destination register addresses. And it will assigns these source register addresses (rs1_addr
and rs2_addr
) to the register file read ports (regFile.io.raddr1
and regFile.io.raddr2
The purpose of this code is to generates the immediate value based on the instruction and control signals.
After describe the main blocks in decode part, the signal flow will follow the logic:
where rs1
and rs2
are final values for source registers after applying forwarding logic.
The forward sources are described below:
alu.io.out
: ALU result from the Execute stage (EX/DE hazard).
em_regWrite
: Result from the Memory stage (MEM/DE hazard).
regWrite: Result from the Writeback stage (WB/DE hazard).
When a reset signal (reset.asBool
) or an exception (csr.io.expt
) occurs, the Decode-Execute pipeline registers are cleared to prevent the propagation of invalid values. Under normal operation, these registers are updated with valid data, including ALU inputs (alu_a
, alu_b
) and operation (alu_op
), control signals for memory access (st_type
, ld_type
), the writeback enable signal (wb_en
), and the destination register (rs2
). This ensures proper execution of instructions and maintains the integrity of the pipeline.
This part sets up the inputs and operation type for the ALU
This section configures the inputs of the ALU (Arithmetic Logic Unit) with values from the decode stage register (de_reg
):
Where:
alu_a
and alu_b
are the two operands for the ALU.alu_op
specifies the type of operation the ALU should perform .woffset
: Define the word offset. It is decided by the ALU output. it has four default valus which are 0 、8 、16 、24io.dcache.req.bits.addr
: the address of the dataio.dcache.req.valid
: to decide the data is value to access or not.io.dcache.req.bits.data
:the specific data you want to access.io.dcache.req.bits.mask
:Decide the edit words types: 4 bytes (0b1111)、2 bytes (0b11)、signal byte (0b1)、no-edited (0b0000)The MEM
section starts at line 321 in datapath.scala
, where the Load
operation is described.
Here, loffset calculates the offset from the address passed by the ALU, used for aligning memory load data.
Next, load_reg_valid
checks whether the loaded data is valid, with conditions including the instruction cache being invalid (io.icache.resp.valid
is false), the data cache being valid (io.dcache.resp.valid
is true), and the current operation type not being a no-op (LD_XXX
).
The load state is then checked:
Since loading data from memory or cache takes time, load_state
is defined to help determine the loading condition. When load_reg_valid
is true, the load state is entered. If no stalls (!stall
) occur, the state resets to false.B
.
Next, data source selection and conversion are performed:
load_data
selects the data source based on load_state
. If in a load state, data stored in the register is used; otherwise, the data from the cache is used. After obtaining the data, it is right-shifted according to ld_type
for proper alignment.
LD_LH
and LD_LB
indicate signed loading of halfword and byte, respectively.
LD_LHU
and LD_LBU
indicate unsigned loading of halfword and byte, respectively.
The Control and Status Registers (CSR) manage system-level operations and configurations, providing essential information about the processor state. These registers handle privileged operations such as interrupt processing, exception management, timer configuration, and system monitoring.
In the MEM stage, the above code describes the interaction with the CSR.
csr.io.stall
: Notifies the CSR whether to pause operations based on the pipeline's execution status.
csr.io.cmd
: Command signal for executing instructions, specifying whether the CSR should read, write, or perform other actions.
csr.io.inst
and csr.io.pc
: Provide the current instruction and corresponding program counter, enabling CSR to process instruction-related tasks.
csr.io.addr
: Supplies the target address for CSR access, calculated by the ALU.
csr.io.illegal
: Flags whether the current operation is illegal. If true, CSR can trigger exceptions or abort the operation.
Additional signals such as csr.io.pc_check
, csr.io.ld_type
, and csr.io.st_type
assist CSR in handling specific operations like branch checking, load type determination, and store type processing.
In this stage, the code is focus on writing back to the regFile and the control signal of the regFile.
During the write-back operation, regWrite determines the final data to be written to the register file:
Data source selection:
WB_MEM
: Writes back data loaded from memory (mw_reg.load
).
WB_PC4
: Writes back the value of PC + 4, typically used for branch or jump instructions.
WB_CSR
: Writes back the result of a CSR operation (mw_reg.csr_out
).
By default, the ALU result (mw_reg.alu.zext
) is written back.
This section controls whether data is written to the register file:
Enable write-back (wen
): The register file allows write operations only if the write-back enable signal (mw_reg.wb_en
) is true and no stalls (!stall
) occur.
Write-back address and data:
waddr
: Specifies the target register address for write-back (mw_rd_addr
).
wdata
: Specifies the data content to be written, selected via the regWrite
logic.
To test the occurrence of hazards, we want to run several custom programs. However, before doing so, we need to set up the relevant environment.
To compile and assemble the custom programs, the RISC-V tools for the Privileged Architecture 1.7 toolchain need to be installed. Follow the instructions in Running Your Own Program on riscv-mini to set up the environment variables and install the toolchain.
However, this part of the process was not explained clearly. After some exploration, the following steps were used to set up the environment.
Then, we run the build-riscv-tools.sh
script in project riscv-mini
under $HOME/riscv-tools
The installation process would take about half an hour. After checking, we found that the riscv32-unknown-elf-gcc
was not compiled.
And the below error was showed:
We suspect the compiler version is too new, so we downgraded it to GCC 10.
But it still can't work, therefore, we try to install the latest version of the toolchain instead of priv1.7.
However, install the latest version take too much time. We quickly abandoned this plan and switched to implementing the following two plans simultaneously.
The first plan is to download a precompiled toolchain and riscv-tests for use. Then, modify the Makefile
in custom-bmark
. It can work for generating .vcd
but .dump
, because precompiled toolchain we found lack of riscv32-unknown-elf-objdump
.
The secound solution is much more better, we installed gcc5.3 and ubuntu16.04 in docker , which are perfectly match with RISC-V tools for priv 1.7.
Followings are the steps for installing docker enviroment.
In docker:
setting up the environment
Install the necessery tools
install wget and curl
Install git,texinfo,file
Then startInstall the riscv-tools
Finally, we can test the custom programs.
In this part, add.S
is used as a function to be called by main.c
. If we want to test the function file named test.S
, we need to first replace the file name add.S
into test.S
in the 7th line of the Makefile
.
Then, update the function name in main.c
accordingly also.
After setting every thing up, now we can run our own assembly code on the riscv-mini cpu.
Entering the file custom-bmark
, we can edit our own custom assembly code in add.s
.
Next, to compile you program, run make
in custom-bmark
to generate the binary, dump, and the hex files.
After runing make
in custom-bmark to generate the hex files. We moved it into ~/riscv-mini
and use VTile to generate .vcd waveform.
For following code as example,
We checked two place waveform for a0
and a1
in regFile using GTKWave.
Here is the change during the execution of add a0, a0, a1
, where 3+2=5.
And here is the signal before completion. You can see that a0
ends at 198F, which is 6543 in decimal.
VTile's output in terminal can also check return value of main.
t0
is written in the first instruction but used in the next instruction.
The result is correct:
After load operation from memory, use the loaded value immediately.
It can be seen that the result is equal to 38, so the final output is 6543, the result is correct.
In this section, we test if jump branch is functional.
Testing if it jump
simulation result
Testing if it not jump
simulation result
Result:
It shows that all the result after simulation is 6543, which mean that they both correctly jump to the proper position in our testing code.
assebly code
In this project, we delved deep into the architecture of RISC-V Mini, starting from the fundamental components such as the Arithmetic Logic Unit (ALU), data cache (D-cache), instruction cache (I-cache), branch condition unit (BrCond), and control and status registers (CSR). We progressively analyzed their operational principles.
We placed particular emphasis on pipeline design and optimization. Beginning with a three-stage pipeline (IF, EX, WB), we gradually expanded it into a five-stage pipeline (ID, IF, EX, MEM, WB), conducting a detailed analysis of the performance improvements and increased design complexity resulting from this modification. Concurrently, we explored the stall mechanism within pipelines to address hazards like data dependencies, and the application of NOP instructions to fill pipeline bubbles.
To gain a deeper understanding of hazard types and handling methods, we systematically investigated various common hazard scenarios. Utilizing tools like Verilator and GTKWave, we closely observed signal waveforms in each stage, enabling us to intuitively comprehend the system's response when hazards occur. Finally, we applied our knowledge to practical cases, successfully verifying multiple instruction combinations that could potentially cause hazards on the five-stage pipeline version of RISC-V Mini. Additionally, we simulated the execution of a classroom exam program to validate the correctness of our design.