Try   HackMD

Lab3: srv32 - RISCV RV32IM Soft CPU

srv32MIT: Simple 3-stage pipeline RISC-V processor

A simple RISC-V 3-stage pipeline processor featuring:

  • Three-stage pipeline processor
  • RV32IM instruction sets
  • Pass RV32IM compliance test
  • Trap exception
  • Interrupt handler
  • FreeRTOS support
  • ISS simulator

Prerequisites

Install RISC-V toolchains. You can do either of the following:

  1. Use pre-built GNU Toolchain via xPack GNU RISC-V Embedded GCC. Then, you can define an environment variable in advance:
    ​​​​export CROSS_COMPILE=riscv-none-elf-
    
  2. Build from source. Take Ubuntu Linux for example.
    ​​​​sudo apt install autoconf automake autotools-dev curl gawk git \
    ​​​​                 build-essential bison flex texinfo gperf libtool patchutils bc git \
    ​​​​                 libmpc-dev libmpfr-dev libgmp-dev gawk zlib1g-dev libexpat1-dev
    ​​​​git clone --recursive https://github.com/riscv/riscv-gnu-toolchain
    ​​​​cd riscv-gnu-toolchain
    ​​​​mkdir -p build && cd build
    ​​​​../configure --prefix=/opt/riscv --with-isa-spec=20191213 \
    ​​​​    --with-multilib-generator="rv32i-ilp32--;rv32im-ilp32--;rv32imac-ilp32--;rv32im_zicsr-ilp32--;rv32imac_zicsr-ilp32--;rv64imac-lp64--;rv64imac_zicsr-lp64--"
    ​​​​make -j$(nproc)
    

Install the dependent packages.

  • For macOS
    ​​​​brew install ccache verilator gawk lcov gtkwave
    
  • For Ubuntu Linux
    ​​​​sudo apt install build-essential lcov ccache libsystemc-dev
    

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
Never run apt install verilator on Ubuntu Linux, otherwise you may get older versions, which would not fit.

Package verilator is required, but the default package provided by Ubuntu Linux was too old. Hence, we have to build verilator from source. See Installation Obtain Source

Auto Configure
Eventual Installation Options

Assume you are in the home directory:

cd $HOME
git clone https://github.com/verilator/verilator
cd verilator
git checkout stable
export VERILATOR_ROOT=`pwd`
./configure
make

You don't have to run make install

Then, you can set environment variables in advance.

export VERILATOR_ROOT=$HOME/verilator
export PATH=$VERILATOR_ROOT/bin:$PATH

Make sure the version of Verilator >= 5.002.

$ verilator --version
Verilator 5.002 2022-10-29 rev v5.002-29-gdb39d70c7

Get the Source

git clone https://github.com/sysprog21/srv32

Read the srv32 project page carefully.

Run RTL sim

  • The simulator generated by Verilator is called sim. This is a RTL-level generator that is capable of simulating the execution of RISC-V binary at RTL level.
  • RTL (Register-Transistor-Level) simulation is done on either Verilator (default) or Icarus Verilog.
  • RTL simulator is located in sim/ directory.
    • Result make all command build the core and run RTL sim, all simulation passed.

The command below will generate the VCD/FST dump. You can browse file wave.fst via GTKWave.

cd sim && ./sim +dump

Check tobychui's note as well.

Run ISS sim

  • This repo also comes with a software RISC-V simulator that is capable of simulating the execution of RISC-V binary in software level.
  • The source code of ISS simulator is located in tools directory.
  • Result Various of benchmarks/hello world program/riscv-compliance tests are run on the ISS simulator.
    • Benchmark
      • Coremark score obtained: 2.681152 CoreMark/MHz
      • Dhrystone score obtained:
        ​​​​​​Number_Of_Runs: 100
        ​​​​​​User_Time: 31249 cycles, 26443 insn
        ​​​​​​Cycles_Per_Instruction: 1.181
        ​​​​​​Dhrystones_Per_Second_Per_MHz: 3200
        ​​​​​​DMIPS_Per_MHz: 1.821
        
    • The result of RISC-V compliance tests will be covered in the following section.

Run RISC-V compliance test (v2.x)

There are two ways of running RISC-V binary. Namely, the RTL simulator called sim generated by Verilator. Another one is the software RISC-V simulator called rvsim located in tools directory.

Run compliance tests on RTL simulator

This repo test the compliance of hardware implementation by comparing the output results running on both simulator. To be precise, when type make tests in ./tests directory, compliance tests will be run on the RTL simulator (sim) and the output will be compared with the reference output specified by riscv-compliance AND the output of software simulator (rvsim).

The memory dump of RTL simulator dump.txt will be renamed to *.signature.output and will be automatically compared to the reference output provided by riscv-compliance repo.

The output of RTL simulator (sim) is stored in trace.log file while the output of software simulator (rvsim) is stored in trace_sw.log file. These two files contains detailed information of each instruction such as: value write to a certain register, value write to a certain memory address etc. These two files will be compared through a diff --brief command.

In summary, the memory dump files from RTL simulator will be compared with the reference output. Then, the output between RTL simulator and software simulator will be compared. Notice if the first comparison fails, the error will be signaled by riscv-compliance ; however, a failure on second comparison will results in a failed make command (The second failure is raised by diff --brief command).

Run compliance tests on SW simulator

When one types make tests-sw in ./tests directory, compliance tests will be run on the software simulator. The output results compared with itself AND the the reference output provided by riscv-compliance repo.

  • Result
    • make tests-sw
    ​​OK: 48/48 RISCV_TARGET=srv32 RISCV_DEVICE=rv32i RISCV_ISA=rv32i
    ​​OK: 8/8 RISCV_TARGET=srv32 RISCV_DEVICE=rv32im RISCV_ISA=rv32im
    ​​OK: 6/6 RISCV_TARGET=srv32 RISCV_DEVICE=rv32Zicsr RISCV_ISA=rv32Zicsr
    
    • make tests
    ​​OK: 48/48 RISCV_TARGET=srv32 RISCV_DEVICE=rv32i RISCV_ISA=rv32i
    ​​OK: 8/8 RISCV_TARGET=srv32 RISCV_DEVICE=rv32im RISCV_ISA=rv32im
    ​​OK: 6/6 RISCV_TARGET=srv32 RISCV_DEVICE=rv32Zicsr RISCV_ISA=rv32Zicsr
    

Analyze srv32 RV32 core

Memory modeling

As the time of writing, the memory of srv32 is divided into instruction memory (I-MEM) and data memory (D-MEM). Both I-MEM and D-MEM are modelled using mem2ports verilog module as follow:

module mem2ports # (
    parameter SIZE  = 4096,
    parameter FILE  = "memory.bin"
) (
    input               clk,
    input               resetb,

    input               rready,
    input               wready,
    output reg          rresp,
    output reg  [31: 0] rdata,
    input       [31: 2] raddr,
    input       [31: 2] waddr,
    input       [31: 0] wdata,
    input       [ 3: 0] wstrb
);

Notice the signal raddr and waddr are both 30 bits long. The omission of lower 2 bits shows read or write to memory are word-aligned or 4-byte aligned.

Pipeline architecture

srv32 is a 3-stage pipeline architecture with IF/ID, EX, WB stages. The follwing diagram marks some important signals for later discussion.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Forwarding

Data hazard

srv32 supports full forwarding, which means RAW data hazard can be resolved WITHOUT stalling the processor. Notice only RAW data hazard is possible, other hazard (WAW, WAR) isn't possible on single issue processor.

The implementation of register forwarding is as follow:

// register reading @ execution stage and register forwarding
// When the execution result accesses the same register,
// the execution result is directly forwarded from the previous
// instruction (at write back stage)
assign reg_rdata1[31: 0]    = (ex_src1_sel == 5'h0) ? 32'h0 :
                              (!wb_flush && wb_alu2reg &&
                               (wb_dst_sel == ex_src1_sel)) ? // register forwarding
                                (wb_mem2reg ? wb_rdata : wb_result) :
                                regs[ex_src1_sel];
assign reg_rdata2[31: 0]    = (ex_src2_sel == 5'h0) ? 32'h0 :
                              (!wb_flush && wb_alu2reg &&
                               (wb_dst_sel == ex_src2_sel)) ? // register forwarding
                                (wb_mem2reg ? wb_rdata : wb_result) :
                                regs[ex_src2_sel];

Consider the following instruction sequence:

IF/ID EX WB
add x4, x5, x6 and x3, x2, x4 addi x2, x2, -3

Instruction and x3, x2, x4 at EX stage and instruction addi x2, x2, -3 at WB stage have RAW data hazard on register x2. The latest result of x2 (from addi x2, x2, -3) is stored in signal wb_result at WB stage. Since (wb_dst_sel == ex_src1_sel) is true and wb_mem2reg is false. wb_result is forward to x2 register in EX stage (and x3, x2, x4). The value of x2 in EX stage is stored in reg_rdata1.

The timing diagram of the above instruction sequence is as follow:

Instruction cycle 1 c2 c3 c4 c5
addi x2, x2, -3 IF/ID EX WB
and x3, x2, x4 IF/ID EX WB
add x4, x5, x6 IF/ID EX WB

Load-use hazard

Load-use hazard is NOT an issue in srv32 core because D-MEM is read at WB stage, and register file is also read at WB stage. A single MUX is used to switch between 2 operands (operand from register file and operand from D-MEM). Load-use hazard can be resolved WITHOUT stalling the processor.

Consider the following instruction sequence:

IF/ID EX WB
add x4, x5, x6 and x3, x2, x4 lw x2 0(x5)

Instruction and x3, x2, x4 at EX stage and instruction lw x2 0(x5) at WB stage have load-use data hazard on register x2. The result of x2 is read from D-MEM in WB stage and stored in signal wb_rdata. Since (wb_dst_sel == ex_src1_sel) is true and wb_mem2reg is true. wb_rdata is forward to x2 register in EX stage. The value of x2 in EX stage is reg_rdata1.

The verilog code is shown again for your reference:

assign reg_rdata1[31: 0]    = (ex_src1_sel == 5'h0) ? 32'h0 :
                              (!wb_flush && wb_alu2reg &&
                               (wb_dst_sel == ex_src1_sel)) ? // register forwarding
                                (wb_mem2reg ? wb_rdata : wb_result) :
                                regs[ex_src1_sel];
assign reg_rdata2[31: 0]    = (ex_src2_sel == 5'h0) ? 32'h0 :
                              (!wb_flush && wb_alu2reg &&
                               (wb_dst_sel == ex_src2_sel)) ? // register forwarding
                                (wb_mem2reg ? wb_rdata : wb_result) :
                                regs[ex_src2_sel];

The timing diagram of the above instruction sequence is as follow:

Instruction cycle 1 c2 c3 c4 c5
lw x2 0(x5) IF/ID EX WB
and x3, x2, x4 IF/ID EX WB
add x4, x5, x6 IF/ID EX WB

Branch penalty

Branch penalty is the number of instructions killed after a branch instruction if a branch is TAKEN. Branch result is resolved at the end EX stage by ALU so the instruction fetch in IF/ID might need to be killed if a branch is taken. In this processor; however, the address of next instruction (next PC) should be fed into I-MEM a cycle ahead. Thus, the branch penalty for srv32 is 2. To clarify, by the time next PC is resolved, one instruction has been fetch into pipeline and another PC has been calculated because address should be computed one cycle ahead. The number of instructions that should be killed (a.k.a. set to NOP) is 2 instruction after a branch instruction if the branch is actually taken.

Consider the following instruction sequence:

IF/ID EX WB
next_pc fetch_pc (imem_addr) if_pc ex_pc wb_pc
xxx add x4, x5, x6 and x3, x2, x4 beq x5, x6 (taken)

(Notice an additional column is inserted above the instruction. These are the PC variables in pipeline)

Branch instruction beq x5, x6 (taken) is resolved by the END of EX stage. By the time branch instruction is resolved, two consequtive instructions, namely add x4, x5, x6 and and x3, x2, x4 will be fetched from I-MEM. These two instructions should be killed if branch is taken.

The timing diagram of the above instruction sequence is as follow:

Instruction c1 c2 c3 c4 c5 c6
beq x5, x6 (taken) IF/ID EX WB
and x3, x2, x4 NOP NOP NOP
add x4, x5, x6 NOP NOP NOP
exec if branch taken IF/ID EX WB

Port count_bits into srv32

Because we now move our code from Ripes to bare metal environment, we need to follow the calling convention carefully.

Thus we should ensure that all we use saved registers and temporary registers properly.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
The revised code can be found at 6cc195f

We can first debug our code on ISS infrastrature. The result is:

$ tools/rvsim count_bits.elf
[0,1,1,2,1]

Excuting 4726 instructions, 5988 cycles, 1.267 CPI
Program terminate

Simulation statistics
=====================
Simulation time  : 0.000 s
Simulation cycles: 5988
Simulation speed : 15.008 MHz

Next, we enter make count_bits.run under sim/ directory to copy the memory layout of count_bits to the directory and start to simulate the result.

$ make count_bits.run
[0,1,1,2,1]

Excuting 4726 instructions, 5988 cycles, 1.267 CPI
Program terminate
- ../rtl/../testbench/testbench.v:418: Verilog $finish

Simulation statistics
=====================
Simulation time  : 0.068 s
Simulation cycles: 5999
Simulation speed : 0.0882206 MHz

We can get the result similar to ISS one.

Run the following command to generate wave.fst

$ ./sim +trace

Analyze the waveform

Latency of instrcution fetching

As the below waveform shows, we can consider imem is a combinatial circuit on this CPU.
We can retrieve the instruction with given address (raddr) in the same clock period.

Reason of stall

Since srv32 does have fully bypassing, there is no stall resulting from data hazards. All nops are generated after control hazards happended.

So we should try to reduce the stalls because of control hazards.

Control hazards

srv32 is a 3 staged pipeline architecture. It has F/D, E, and WB stages

Its branch penalty will be 2 cycles since it has to flush the incorrect instruction in F/D and then load the new instruction.

In the following waveform, we can see that when the signal branch_taken is set, it generates a flush signal which takes 2 cycles to complete. Then reload the right instruction after flushing.

Performance Improvements

Branchless popcount

According to the section we discussed before, we know that the stalls are all generated by control hazards. To reduce the executing cycles, we can try to apply branchless algorithm to compute popcount

For comparison, we first compute the branch_taken count in the original code: there are 632 branches been taken.

Then we apply the algorithm mentioned in amacc to revise our code. The code is now changed to:

popcount:
    srli        t0, a0, 1
    li          t1, 0x55555555
    and         t0, t0, t1
    sub         a0, a0, t0
    li          t1, 0x33333333
    srli        t2, a0, 2
    and         t2, t2, t1
    and         a0, a0, t1
    add         a0, a0, t2
    srli        t0, a0, 4
    add         a0, a0, t0
    li          t0, 0x0f0f0f0f
    and         a0, a0, t0
    li          t0, 0x01010101
    mul         a0, a0, t0
    srli        a0, a0, 24
    ret

We can see from the output of the simulation:

[0,1,1,2,1]

Excuting 4781 instructions, 6023 cycles, 1.259 CPI
Program terminate
- ../rtl/../testbench/testbench.v:418: Verilog $finish

Simulation statistics
=====================
Simulation time  : 0.043 s
Simulation cycles: 6034
Simulation speed : 0.140326 MHz

Although the count of branch_taken is slightly reduced to 622, we need more instructions to compute the popcount for each numbers. In result, it cancels out the benefit of branchless implementation and cause the total executing cycles to increase.

Loop unrolling

In order to eliminate the stall caused by for-loop, we simply expand the loop body and rewrite the count_bits and print part.

[0,1,1,2,1]

Excuting 4754 instructions, 5982 cycles, 1.258 CPI
Program terminate

Simulation statistics
=====================
Simulation time  : 0.000 s
Simulation cycles: 5982
Simulation speed : 14.918 MHz

As the result shown above, we get 41 cycles reduced after this modification. The count of branch_taken is eliminated to 615

Inline function

Instead of adding inline prefix in function prototype, I use a macro to replace the use of popcount to reduce the jump instruction to function body. The behavior of the macro is similar to count_bits function, but it will not result in function call. It expand in compiling time and increase the code size instead.

.macro pop_cnt
    srli        t0, a0, 1
    li          t1, 0x55555555
    and         t0, t0, t1
    sub         a0, a0, t0
    li          t1, 0x33333333
    srli        t2, a0, 2
    and         t2, t2, t1
    and         a0, a0, t1
    add         a0, a0, t2
    srli        t0, a0, 4
    add         a0, a0, t0
    li          t0, 0x0f0f0f0f
    and         a0, a0, t0
    li          t0, 0x01010101
    mul         a0, a0, t0
    srli        a0, a0, 24
.endm

In the result of simulation, we receive another 30 cycles reduction.

[0,1,1,2,1]

Excuting 4744 instructions, 5952 cycles, 1.255 CPI
Program terminate

Simulation statistics
=====================
Simulation time  : 0.000 s
Simulation cycles: 5952
Simulation speed : 15.030 MHz

The count of branch_taken is down to 605.