Lab3: `srv32` - RISCV `RV32IM` Soft CPU

srv32^MIT: Simple 3-stage pipeline RISC-V processor

A simple RISC-V 3-stage pipeline processor featuring:

Three-stage pipeline processor
RV32IM instruction sets
Pass RV32IM compliance test
Trap exception
Interrupt handler
FreeRTOS support
ISS simulator

Prerequisites

Install RISC-V toolchains. You can do either of the following:

Use pre-built GNU Toolchain via xPack GNU RISC-V Embedded GCC. Then, you can define an environment variable in advance:
```
export CROSS_COMPILE=riscv-none-elf-
```

Build from source. Take Ubuntu Linux for example.

sudo apt install autoconf automake autotools-dev curl gawk git \
                 build-essential bison flex texinfo gperf libtool patchutils bc git \
                 libmpc-dev libmpfr-dev libgmp-dev gawk zlib1g-dev libexpat1-dev
git clone --recursive https://github.com/riscv/riscv-gnu-toolchain
cd riscv-gnu-toolchain
mkdir -p build && cd build
../configure --prefix=/opt/riscv --with-isa-spec=20191213 \
    --with-multilib-generator="rv32i-ilp32--;rv32im-ilp32--;rv32imac-ilp32--;rv32im_zicsr-ilp32--;rv32imac_zicsr-ilp32--;rv64imac-lp64--;rv64imac_zicsr-lp64--"
make -j$(nproc)

Install the dependent packages.

For macOS

brew install ccache verilator gawk lcov gtkwave

For Ubuntu Linux

sudo apt install build-essential lcov ccache libsystemc-dev

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Never run apt install verilator on Ubuntu Linux, otherwise you may get older versions, which would not fit.

Package verilator is required, but the default package provided by Ubuntu Linux was too old. Hence, we have to build verilator from source. See Installation Obtain Source

\to

Auto Configure

\to

Eventual Installation Options

Assume you are in the home directory:

cd $HOME
git clone https://github.com/verilator/verilator
cd verilator
git checkout stable
export VERILATOR_ROOT=`pwd`
./configure
make

You don't have to run make install

Then, you can set environment variables in advance.

export VERILATOR_ROOT=$HOME/verilator
export PATH=$VERILATOR_ROOT/bin:$PATH

Make sure the version of Verilator >= 5.002.

$ verilator --version
Verilator 5.002 2022-10-29 rev v5.002-29-gdb39d70c7

Get the Source

git clone https://github.com/sysprog21/srv32

Read the srv32 project page carefully.

Run RTL sim

The simulator generated by Verilator is called sim. This is a RTL-level generator that is capable of simulating the execution of RISC-V binary at RTL level.
RTL (Register-Transistor-Level) simulation is done on either Verilator (default) or Icarus Verilog.
RTL simulator is located in sim/ directory.
- Result make all command build the core and run RTL sim, all simulation passed.

The command below will generate the VCD/FST dump. You can browse file wave.fst via GTKWave.

cd sim && ./sim +dump

Check tobychui's note as well.

Run ISS sim

This repo also comes with a software RISC-V simulator that is capable of simulating the execution of RISC-V binary in software level.
The source code of ISS simulator is located in tools directory.
Result Various of benchmarks/hello world program/riscv-compliance tests are run on the ISS simulator.
- Benchmark
  - Coremark score obtained: 2.681152 CoreMark/MHz
  - Dhrystone score obtained:
```
Number_Of_Runs: 100
User_Time: 31249 cycles, 26443 insn
Cycles_Per_Instruction: 1.181
Dhrystones_Per_Second_Per_MHz: 3200
DMIPS_Per_MHz: 1.821
```
- The result of RISC-V compliance tests will be covered in the following section.

Run RISC-V compliance test (v2.x)

There are two ways of running RISC-V binary. Namely, the RTL simulator called sim generated by Verilator. Another one is the software RISC-V simulator called rvsim located in tools directory.

Run compliance tests on RTL simulator

This repo test the compliance of hardware implementation by comparing the output results running on both simulator. To be precise, when type make tests in ./tests directory, compliance tests will be run on the RTL simulator (sim) and the output will be compared with the reference output specified by riscv-compliance AND the output of software simulator (rvsim).

The memory dump of RTL simulator dump.txt will be renamed to *.signature.output and will be automatically compared to the reference output provided by riscv-compliance repo.

The output of RTL simulator (sim) is stored in trace.log file while the output of software simulator (rvsim) is stored in trace_sw.log file. These two files contains detailed information of each instruction such as: value write to a certain register, value write to a certain memory address etc. These two files will be compared through a diff --brief command.

In summary, the memory dump files from RTL simulator will be compared with the reference output. Then, the output between RTL simulator and software simulator will be compared. Notice if the first comparison fails, the error will be signaled by riscv-compliance ; however, a failure on second comparison will results in a failed make command (The second failure is raised by diff --brief command).

Run compliance tests on SW simulator

When one types make tests-sw in ./tests directory, compliance tests will be run on the software simulator. The output results compared with itself AND the the reference output provided by riscv-compliance repo.

Result

make tests-sw

OK: 48/48 RISCV_TARGET=srv32 RISCV_DEVICE=rv32i RISCV_ISA=rv32i
OK: 8/8 RISCV_TARGET=srv32 RISCV_DEVICE=rv32im RISCV_ISA=rv32im
OK: 6/6 RISCV_TARGET=srv32 RISCV_DEVICE=rv32Zicsr RISCV_ISA=rv32Zicsr

make tests

OK: 48/48 RISCV_TARGET=srv32 RISCV_DEVICE=rv32i RISCV_ISA=rv32i
OK: 8/8 RISCV_TARGET=srv32 RISCV_DEVICE=rv32im RISCV_ISA=rv32im
OK: 6/6 RISCV_TARGET=srv32 RISCV_DEVICE=rv32Zicsr RISCV_ISA=rv32Zicsr

Analyze `srv32` RV32 core

Memory modeling

As the time of writing, the memory of srv32 is divided into instruction memory (I-MEM) and data memory (D-MEM). Both I-MEM and D-MEM are modelled using mem2ports verilog module as follow:

module mem2ports # (
    parameter SIZE  = 4096,
    parameter FILE  = "memory.bin"
) (
    input               clk,
    input               resetb,

    input               rready,
    input               wready,
    output reg          rresp,
    output reg  [31: 0] rdata,
    input       [31: 2] raddr,
    input       [31: 2] waddr,
    input       [31: 0] wdata,
    input       [ 3: 0] wstrb
);

Notice the signal raddr and waddr are both 30 bits long. The omission of lower 2 bits shows read or write to memory are word-aligned or 4-byte aligned.

Pipeline architecture

srv32 is a 3-stage pipeline architecture with IF/ID, EX, WB stages. The follwing diagram marks some important signals for later discussion.

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Forwarding

Data hazard

srv32 supports full forwarding, which means RAW data hazard can be resolved WITHOUT stalling the processor. Notice only RAW data hazard is possible, other hazard (WAW, WAR) isn't possible on single issue processor.

The implementation of register forwarding is as follow:

// register reading @ execution stage and register forwarding
// When the execution result accesses the same register,
// the execution result is directly forwarded from the previous
// instruction (at write back stage)
assign reg_rdata1[31: 0]    = (ex_src1_sel == 5'h0) ? 32'h0 :
                              (!wb_flush && wb_alu2reg &&
                               (wb_dst_sel == ex_src1_sel)) ? // register forwarding
                                (wb_mem2reg ? wb_rdata : wb_result) :
                                regs[ex_src1_sel];
assign reg_rdata2[31: 0]    = (ex_src2_sel == 5'h0) ? 32'h0 :
                              (!wb_flush && wb_alu2reg &&
                               (wb_dst_sel == ex_src2_sel)) ? // register forwarding
                                (wb_mem2reg ? wb_rdata : wb_result) :
                                regs[ex_src2_sel];

Consider the following instruction sequence:

IF/ID	EX	WB
`add x4, x5, x6`	`and x3, x2, x4`	`addi x2, x2, -3`

Instruction and x3, x2, x4 at EX stage and instruction addi x2, x2, -3 at WB stage have RAW data hazard on register x2. The latest result of x2 (from addi x2, x2, -3) is stored in signal wb_result at WB stage. Since (wb_dst_sel == ex_src1_sel) is true and wb_mem2reg is false. wb_result is forward to x2 register in EX stage (and x3, x2, x4). The value of x2 in EX stage is stored in reg_rdata1.

The timing diagram of the above instruction sequence is as follow:

Instruction	cycle 1	c2	c3	c4	c5
`addi x2, x2, -3`	IF/ID	EX	WB⬂
`and x3, x2, x4`		IF/ID	EX⬃	WB
`add x4, x5, x6`			IF/ID	EX	WB

Load-use hazard

Load-use hazard is NOT an issue in srv32 core because D-MEM is read at WB stage, and register file is also read at WB stage. A single MUX is used to switch between 2 operands (operand from register file and operand from D-MEM). Load-use hazard can be resolved WITHOUT stalling the processor.

Consider the following instruction sequence:

IF/ID	EX	WB
`add x4, x5, x6`	`and x3, x2, x4`	`lw x2 0(x5)`

Instruction and x3, x2, x4 at EX stage and instruction lw x2 0(x5) at WB stage have load-use data hazard on register x2. The result of x2 is read from D-MEM in WB stage and stored in signal wb_rdata. Since (wb_dst_sel == ex_src1_sel) is true and wb_mem2reg is true. wb_rdata is forward to x2 register in EX stage. The value of x2 in EX stage is reg_rdata1.

The verilog code is shown again for your reference:

assign reg_rdata1[31: 0]    = (ex_src1_sel == 5'h0) ? 32'h0 :
                              (!wb_flush && wb_alu2reg &&
                               (wb_dst_sel == ex_src1_sel)) ? // register forwarding
                                (wb_mem2reg ? wb_rdata : wb_result) :
                                regs[ex_src1_sel];
assign reg_rdata2[31: 0]    = (ex_src2_sel == 5'h0) ? 32'h0 :
                              (!wb_flush && wb_alu2reg &&
                               (wb_dst_sel == ex_src2_sel)) ? // register forwarding
                                (wb_mem2reg ? wb_rdata : wb_result) :
                                regs[ex_src2_sel];

The timing diagram of the above instruction sequence is as follow:

Instruction	cycle 1	c2	c3	c4	c5
`lw x2 0(x5)`	IF/ID	EX	WB
`and x3, x2, x4`		IF/ID	EX	WB
`add x4, x5, x6`			IF/ID	EX	WB

Branch penalty

Branch penalty is the number of instructions killed after a branch instruction if a branch is TAKEN. Branch result is resolved at the end EX stage by ALU so the instruction fetch in IF/ID might need to be killed if a branch is taken. In this processor; however, the address of next instruction (next PC) should be fed into I-MEM a cycle ahead. Thus, the branch penalty for srv32 is 2. To clarify, by the time next PC is resolved, one instruction has been fetch into pipeline and another PC has been calculated because address should be computed one cycle ahead. The number of instructions that should be killed (a.k.a. set to NOP) is 2 instruction after a branch instruction if the branch is actually taken.

Consider the following instruction sequence:

		IF/ID	EX	WB
next_pc	fetch_pc (imem_addr)	`if_pc`	`ex_pc`	`wb_pc`
xxx	`add x4, x5, x6`	`and x3, x2, x4`	`beq x5, x6 (taken)`

(Notice an additional column is inserted above the instruction. These are the PC variables in pipeline)

Branch instruction beq x5, x6 (taken) is resolved by the END of EX stage. By the time branch instruction is resolved, two consequtive instructions, namely add x4, x5, x6 and and x3, x2, x4 will be fetched from I-MEM. These two instructions should be killed if branch is taken.

The timing diagram of the above instruction sequence is as follow:

Instruction	c1	c2	c3	c4	c5	c6
`beq x5, x6 (taken)`	IF/ID	EX	WB
`and x3, x2, x4`		NOP	NOP	NOP
`add x4, x5, x6`			NOP	NOP	NOP
`exec if branch taken`				IF/ID	EX	WB

Port `count_bits` into srv32

Because we now move our code from Ripes to bare metal environment, we need to follow the calling convention carefully.

Thus we should ensure that all we use saved registers and temporary registers properly.

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

The revised code can be found at 6cc195f

We can first debug our code on ISS infrastrature. The result is:

$ tools/rvsim count_bits.elf
[0,1,1,2,1]

Excuting 4726 instructions, 5988 cycles, 1.267 CPI
Program terminate

Simulation statistics
=====================
Simulation time  : 0.000 s
Simulation cycles: 5988
Simulation speed : 15.008 MHz

Next, we enter make count_bits.run under sim/ directory to copy the memory layout of count_bits to the directory and start to simulate the result.

$ make count_bits.run
[0,1,1,2,1]

Excuting 4726 instructions, 5988 cycles, 1.267 CPI
Program terminate
- ../rtl/../testbench/testbench.v:418: Verilog $finish

Simulation statistics
=====================
Simulation time  : 0.068 s
Simulation cycles: 5999
Simulation speed : 0.0882206 MHz

We can get the result similar to ISS one.

Run the following command to generate wave.fst

$ ./sim +trace

Analyze the waveform

Latency of instrcution fetching

As the below waveform shows, we can consider imem is a combinatial circuit on this CPU.
We can retrieve the instruction with given address (raddr) in the same clock period.

Reason of stall

Since srv32 does have fully bypassing, there is no stall resulting from data hazards. All nops are generated after control hazards happended.

So we should try to reduce the stalls because of control hazards.

Control hazards

srv32 is a 3 staged pipeline architecture. It has F/D, E, and WB stages

Its branch penalty will be 2 cycles since it has to flush the incorrect instruction in F/D and then load the new instruction.

In the following waveform, we can see that when the signal branch_taken is set, it generates a flush signal which takes 2 cycles to complete. Then reload the right instruction after flushing.

Performance Improvements

Branchless popcount

According to the section we discussed before, we know that the stalls are all generated by control hazards. To reduce the executing cycles, we can try to apply branchless algorithm to compute popcount

For comparison, we first compute the branch_taken count in the original code: there are 632 branches been taken.

Then we apply the algorithm mentioned in amacc to revise our code. The code is now changed to:

popcount:
    srli        t0, a0, 1
    li          t1, 0x55555555
    and         t0, t0, t1
    sub         a0, a0, t0
    li          t1, 0x33333333
    srli        t2, a0, 2
    and         t2, t2, t1
    and         a0, a0, t1
    add         a0, a0, t2
    srli        t0, a0, 4
    add         a0, a0, t0
    li          t0, 0x0f0f0f0f
    and         a0, a0, t0
    li          t0, 0x01010101
    mul         a0, a0, t0
    srli        a0, a0, 24
    ret

We can see from the output of the simulation:

[0,1,1,2,1]

Excuting 4781 instructions, 6023 cycles, 1.259 CPI
Program terminate
- ../rtl/../testbench/testbench.v:418: Verilog $finish

Simulation statistics
=====================
Simulation time  : 0.043 s
Simulation cycles: 6034
Simulation speed : 0.140326 MHz

Although the count of branch_taken is slightly reduced to 622, we need more instructions to compute the popcount for each numbers. In result, it cancels out the benefit of branchless implementation and cause the total executing cycles to increase.

Loop unrolling

In order to eliminate the stall caused by for-loop, we simply expand the loop body and rewrite the count_bits and print part.

[0,1,1,2,1]

Excuting 4754 instructions, 5982 cycles, 1.258 CPI
Program terminate

Simulation statistics
=====================
Simulation time  : 0.000 s
Simulation cycles: 5982
Simulation speed : 14.918 MHz

As the result shown above, we get 41 cycles reduced after this modification. The count of branch_taken is eliminated to 615

Inline function

Instead of adding inline prefix in function prototype, I use a macro to replace the use of popcount to reduce the jump instruction to function body. The behavior of the macro is similar to count_bits function, but it will not result in function call. It expand in compiling time and increase the code size instead.

.macro pop_cnt
    srli        t0, a0, 1
    li          t1, 0x55555555
    and         t0, t0, t1
    sub         a0, a0, t0
    li          t1, 0x33333333
    srli        t2, a0, 2
    and         t2, t2, t1
    and         a0, a0, t1
    add         a0, a0, t2
    srli        t0, a0, 4
    add         a0, a0, t0
    li          t0, 0x0f0f0f0f
    and         a0, a0, t0
    li          t0, 0x01010101
    mul         a0, a0, t0
    srli        a0, a0, 24
.endm

In the result of simulation, we receive another 30 cycles reduction.

[0,1,1,2,1]

Excuting 4744 instructions, 5952 cycles, 1.255 CPI
Program terminate

Simulation statistics
=====================
Simulation time  : 0.000 s
Simulation cycles: 5952
Simulation speed : 15.030 MHz

The count of branch_taken is down to 605.

Lab3: srv32 - RISCV RV32IM Soft CPU

srv32MIT: Simple 3-stage pipeline RISC-V processor

Prerequisites

Get the Source

Run RTL sim

Run ISS sim

Run RISC-V compliance test (v2.x)

Run compliance tests on RTL simulator

Run compliance tests on SW simulator

Analyze srv32 RV32 core

Memory modeling

Pipeline architecture

Forwarding

Data hazard

Load-use hazard

Branch penalty

Port count_bits into srv32

Analyze the waveform

Latency of instrcution fetching

Reason of stall

Control hazards

Performance Improvements

Branchless popcount

Loop unrolling

Inline function

Read more

你所不知道的 C 語言：數值系統

建構 User-Mode Linux 的實驗環境

Linux 核心設計: Scheduler(8): Energy Aware Scheduling

你所不知道的 C 語言: linked list 和非連續記憶體

Lab3: `srv32` - RISCV `RV32IM` Soft CPU

srv32^MIT: Simple 3-stage pipeline RISC-V processor

Analyze `srv32` RV32 core

Port `count_bits` into srv32