--- tags: computer-arch --- # Lab3: `srv32` - RISCV `RV32IM` Soft CPU ## [srv32](https://github.com/sysprog21/srv32)^MIT^: Simple 3-stage pipeline RISC-V processor A simple RISC-V 3-stage pipeline processor featuring: * Three-stage pipeline processor * RV32IM instruction sets * Pass RV32IM compliance test * Trap exception * Interrupt handler * [FreeRTOS](https://www.freertos.org/) support * ISS simulator ## Prerequisites Install RISC-V toolchains. You can do either of the following: 1. Use pre-built GNU Toolchain via [xPack GNU RISC-V Embedded GCC](https://xpack.github.io/riscv-none-elf-gcc/releases/). Then, you can define an environment variable in advance: ```shell export CROSS_COMPILE=riscv-none-elf- ``` 2. Build from source. Take Ubuntu Linux for example. ```shell sudo apt install autoconf automake autotools-dev curl gawk git \ build-essential bison flex texinfo gperf libtool patchutils bc git \ libmpc-dev libmpfr-dev libgmp-dev gawk zlib1g-dev libexpat1-dev git clone --recursive https://github.com/riscv/riscv-gnu-toolchain cd riscv-gnu-toolchain mkdir -p build && cd build ../configure --prefix=/opt/riscv --with-isa-spec=20191213 \ --with-multilib-generator="rv32i-ilp32--;rv32im-ilp32--;rv32imac-ilp32--;rv32im_zicsr-ilp32--;rv32imac_zicsr-ilp32--;rv64imac-lp64--;rv64imac_zicsr-lp64--" make -j$(nproc) ``` Install the dependent packages. * For macOS ```shell brew install ccache verilator gawk lcov gtkwave ``` * For Ubuntu Linux ```shell sudo apt install build-essential lcov ccache libsystemc-dev ``` :::danger :warning: Never run ==`apt install verilator`== on Ubuntu Linux, otherwise you may get older versions, which would not fit. ::: Package `verilator` is required, but the default package provided by Ubuntu Linux was too old. Hence, we have to build `verilator` from source. See [Installation](https://verilator.org/guide/latest/install.html) `Obtain Source` $\to$ `Auto Configure` $\to$ `Eventual Installation Options` Assume you are in the home directory: ```shell cd $HOME git clone https://github.com/verilator/verilator cd verilator git checkout stable export VERILATOR_ROOT=`pwd` ./configure make ``` :::info You don't have to run `make install` ::: Then, you can set environment variables in advance. ```shell export VERILATOR_ROOT=$HOME/verilator export PATH=$VERILATOR_ROOT/bin:$PATH ``` Make sure the version of Verilator >= `5.002`. ``` $ verilator --version Verilator 5.002 2022-10-29 rev v5.002-29-gdb39d70c7 ``` ## Get the Source ```shell git clone https://github.com/sysprog21/srv32 ``` Read the [srv32 project page](https://github.com/sysprog21/srv32) carefully. ## Run RTL sim - The simulator generated by Verilator is called `sim`. This is a RTL-level generator that is capable of simulating the execution of RISC-V binary at RTL level. - RTL (Register-Transistor-Level) simulation is done on either [Verilator](https://www.veripool.org/wiki/verilator) (default) or Icarus Verilog. - RTL simulator is located in `sim/` directory. - **Result** `make all` command build the core and run RTL sim, all simulation passed. The command below will generate the VCD/FST dump. You can browse file `wave.fst` via [GTKWave](http://gtkwave.sourceforge.net/). ```shell cd sim && ./sim +dump ``` Check [tobychui's note](https://hackmd.io/@wIVnCcUaTouAktrkMVLEMA/Hy6vD5DtF) as well. ## Run ISS sim - This repo also comes with a software RISC-V simulator that is capable of simulating the execution of RISC-V binary in software level. - The source code of ISS simulator is located in `tools` directory. - **Result** Various of benchmarks/hello world program/riscv-compliance tests are run on the ISS simulator. - Benchmark - Coremark score obtained: `2.681152 CoreMark/MHz` - Dhrystone score obtained: ```c Number_Of_Runs: 100 User_Time: 31249 cycles, 26443 insn Cycles_Per_Instruction: 1.181 Dhrystones_Per_Second_Per_MHz: 3200 DMIPS_Per_MHz: 1.821 ``` - The result of RISC-V compliance tests will be covered in the following section. ## Run RISC-V compliance test (v2.x) There are two ways of running RISC-V binary. Namely, the RTL simulator called `sim` generated by Verilator. Another one is the software RISC-V simulator called `rvsim` located in `tools` directory. ### Run compliance tests on RTL simulator This repo test the compliance of hardware implementation by comparing the output results running on both simulator. To be precise, when type `make tests` in `./tests` directory, compliance tests will be run on the RTL simulator (`sim`) and the output will be compared with the reference output specified by `riscv-compliance` AND the output of software simulator (`rvsim`). The memory dump of RTL simulator `dump.txt` will be renamed to `*.signature.output` and will be automatically compared to the reference output provided by `riscv-compliance` repo. The output of RTL simulator (`sim`) is stored in `trace.log` file while the output of software simulator (`rvsim`) is stored in `trace_sw.log` file. These two files contains detailed information of each instruction such as: value write to a certain register, value write to a certain memory address etc. These two files will be compared through a `diff --brief` command. In summary, the memory dump files from RTL simulator will be compared with the reference output. Then, the output between RTL simulator and software simulator will be compared. Notice if the first comparison fails, the error will be signaled by `riscv-compliance` ; however, a failure on second comparison will results in a failed `make` command (The second failure is raised by `diff --brief` command). ### Run compliance tests on SW simulator When one types `make tests-sw` in `./tests` directory, compliance tests will be run on the software simulator. The output results compared with itself AND the the reference output provided by `riscv-compliance` repo. - **Result** - `make tests-sw` ```c OK: 48/48 RISCV_TARGET=srv32 RISCV_DEVICE=rv32i RISCV_ISA=rv32i OK: 8/8 RISCV_TARGET=srv32 RISCV_DEVICE=rv32im RISCV_ISA=rv32im OK: 6/6 RISCV_TARGET=srv32 RISCV_DEVICE=rv32Zicsr RISCV_ISA=rv32Zicsr ``` - `make tests` ```c OK: 48/48 RISCV_TARGET=srv32 RISCV_DEVICE=rv32i RISCV_ISA=rv32i OK: 8/8 RISCV_TARGET=srv32 RISCV_DEVICE=rv32im RISCV_ISA=rv32im OK: 6/6 RISCV_TARGET=srv32 RISCV_DEVICE=rv32Zicsr RISCV_ISA=rv32Zicsr ``` --- ## Analyze `srv32` RV32 core ### Memory modeling As the time of writing, the memory of `srv32` is divided into instruction memory (I-MEM) and data memory (D-MEM). Both I-MEM and D-MEM are modelled using `mem2ports` verilog module as follow: ```c module mem2ports # ( parameter SIZE = 4096, parameter FILE = "memory.bin" ) ( input clk, input resetb, input rready, input wready, output reg rresp, output reg [31: 0] rdata, input [31: 2] raddr, input [31: 2] waddr, input [31: 0] wdata, input [ 3: 0] wstrb ); ``` Notice the signal `raddr` and `waddr` are both 30 bits long. The omission of lower 2 bits shows read or write to memory are **word-aligned** or **4-byte aligned**. ### Pipeline architecture `srv32` is a 3-stage pipeline architecture with IF/ID, EX, WB stages. The follwing diagram marks some important signals for later discussion. ![](https://i.imgur.com/9lbFKBM.jpg) ### Forwarding #### Data hazard `srv32` supports full forwarding, which means RAW data hazard can be resolved WITHOUT stalling the processor. Notice only RAW data hazard is possible, other hazard (WAW, WAR) isn't possible on single issue processor. The implementation of register forwarding is as follow: ```c // register reading @ execution stage and register forwarding // When the execution result accesses the same register, // the execution result is directly forwarded from the previous // instruction (at write back stage) assign reg_rdata1[31: 0] = (ex_src1_sel == 5'h0) ? 32'h0 : (!wb_flush && wb_alu2reg && (wb_dst_sel == ex_src1_sel)) ? // register forwarding (wb_mem2reg ? wb_rdata : wb_result) : regs[ex_src1_sel]; assign reg_rdata2[31: 0] = (ex_src2_sel == 5'h0) ? 32'h0 : (!wb_flush && wb_alu2reg && (wb_dst_sel == ex_src2_sel)) ? // register forwarding (wb_mem2reg ? wb_rdata : wb_result) : regs[ex_src2_sel]; ``` Consider the following instruction sequence: | IF/ID | EX | WB | | ---------------- | ---------------- | ----------------- | | `add x4, x5, x6` | `and x3, x2, x4` | `addi x2, x2, -3` | Instruction `and x3, x2, x4` at EX stage and instruction `addi x2, x2, -3` at WB stage have RAW data hazard on register `x2`. The latest result of `x2` (from `addi x2, x2, -3`) is stored in signal `wb_result` at WB stage. Since `(wb_dst_sel == ex_src1_sel)` is true and `wb_mem2reg` is false. `wb_result` is forward to `x2` register in EX stage (`and x3, x2, x4`). The value of `x2` in EX stage is stored in `reg_rdata1`. The timing diagram of the above instruction sequence is as follow: | Instruction |cycle 1| c2 | c3 | c4 | c5 | | ----------------- | ----- | --- | --- | --- | --- | | `addi x2, x2, -3` | IF/ID | EX |**WB**⬂| | | | `and x3, x2, x4` | |IF/ID|**EX**⬃| WB | | | `add x4, x5, x6` | | |IF/ID| EX | WB | #### Load-use hazard Load-use hazard is NOT an issue in `srv32` core because D-MEM is read at WB stage, and register file is also read at WB stage. A single MUX is used to switch between 2 operands (operand from register file and operand from D-MEM). Load-use hazard can be resolved WITHOUT stalling the processor. ![](https://i.imgur.com/naYkEhn.png) Consider the following instruction sequence: | IF/ID | EX | WB | | ---------------- | ---------------- | ----------------- | | `add x4, x5, x6` | `and x3, x2, x4` | `lw x2 0(x5)` | Instruction `and x3, x2, x4` at EX stage and instruction `lw x2 0(x5)` at WB stage have load-use data hazard on register `x2`. The result of `x2` is read from D-MEM in WB stage and stored in signal `wb_rdata`. Since `(wb_dst_sel == ex_src1_sel)` is true and `wb_mem2reg` is true. `wb_rdata` is forward to `x2` register in EX stage. The value of `x2` in EX stage is `reg_rdata1`. The verilog code is shown again for your reference: ```c assign reg_rdata1[31: 0] = (ex_src1_sel == 5'h0) ? 32'h0 : (!wb_flush && wb_alu2reg && (wb_dst_sel == ex_src1_sel)) ? // register forwarding (wb_mem2reg ? wb_rdata : wb_result) : regs[ex_src1_sel]; assign reg_rdata2[31: 0] = (ex_src2_sel == 5'h0) ? 32'h0 : (!wb_flush && wb_alu2reg && (wb_dst_sel == ex_src2_sel)) ? // register forwarding (wb_mem2reg ? wb_rdata : wb_result) : regs[ex_src2_sel]; ``` The timing diagram of the above instruction sequence is as follow: | Instruction |cycle 1| c2 | c3 | c4 | c5 | | ----------------- | ----- | --- | --- | --- | --- | | `lw x2 0(x5)` | IF/ID | EX | WB | | | | `and x3, x2, x4` | |IF/ID| EX | WB | | | `add x4, x5, x6` | | |IF/ID| EX | WB | ### Branch penalty Branch penalty is the number of instructions killed after a branch instruction if a branch is TAKEN. Branch result is resolved at the end EX stage by ALU so the instruction fetch in IF/ID might need to be killed if a branch is taken. In this processor; however, the address of next instruction (next PC) should be fed into I-MEM a cycle ahead. Thus, the branch penalty for `srv32` is 2. To clarify, by the time next PC is resolved, one instruction has been fetch into pipeline and another PC has been calculated because address should be computed one cycle ahead. The number of instructions that should be killed (a.k.a. set to NOP) is 2 instruction after a branch instruction if the branch is actually taken. Consider the following instruction sequence: | | | IF/ID | EX | WB | | ------- | ------------------ | ------------------ | ------------------ | ----- | | next_pc |fetch_pc (imem_addr)|`if_pc` | `ex_pc` |`wb_pc`| | xxx |`add x4, x5, x6` |`and x3, x2, x4` |`beq x5, x6 (taken)`| | (Notice an additional column is inserted above the instruction. These are the PC variables in pipeline) Branch instruction `beq x5, x6 (taken)` is resolved by the END of EX stage. By the time branch instruction is resolved, two consequtive instructions, namely `add x4, x5, x6` and `and x3, x2, x4` will be fetched from I-MEM. These two instructions should be killed if branch is taken. The timing diagram of the above instruction sequence is as follow: | Instruction | c1 | c2 | c3 | c4 | c5 | c6 | | --------------------- | --- | --- | --- | --- | --- | --- | |`beq x5, x6 (taken)` |IF/ID| EX | WB | | | | | `and x3, x2, x4` | | NOP | NOP | NOP | | | | `add x4, x5, x6` | | | NOP | NOP | NOP | | | `exec if branch taken`| | | |IF/ID| EX | WB | --- ## Port `count_bits` into srv32 Because we now move our code from Ripes to bare metal environment, we need to follow the calling convention carefully. Thus we should ensure that all we use saved registers and temporary registers properly. :::info :pencil: The revised code can be found at [`6cc195f`](https://github.com/OscarShiang/srv32/blob/6cc195f/sw/count_bits/count_bits.s) ::: We can first debug our code on ISS infrastrature. The result is: ```shell $ tools/rvsim count_bits.elf [0,1,1,2,1] Excuting 4726 instructions, 5988 cycles, 1.267 CPI Program terminate Simulation statistics ===================== Simulation time : 0.000 s Simulation cycles: 5988 Simulation speed : 15.008 MHz ``` Next, we enter `make count_bits.run` under `sim/` directory to copy the memory layout of `count_bits` to the directory and start to simulate the result. ```shell $ make count_bits.run [0,1,1,2,1] Excuting 4726 instructions, 5988 cycles, 1.267 CPI Program terminate - ../rtl/../testbench/testbench.v:418: Verilog $finish Simulation statistics ===================== Simulation time : 0.068 s Simulation cycles: 5999 Simulation speed : 0.0882206 MHz ``` We can get the result similar to ISS one. Run the following command to generate `wave.fst` ```shell $ ./sim +trace ``` ## Analyze the waveform ### Latency of instrcution fetching As the below waveform shows, we can consider `imem` is a combinatial circuit on this CPU. We can retrieve the instruction with given address (`raddr`) in the same clock period. ![](https://i.imgur.com/fAfJYwe.png) ### Reason of stall Since srv32 does have fully bypassing, there is no stall resulting from data hazards. All nops are generated after control hazards happended. ![](https://i.imgur.com/WfbO8IY.png) So we should try to reduce the stalls because of control hazards. ### Control hazards srv32 is a 3 staged pipeline architecture. It has `F/D`, `E`, and `WB` stages ![](https://github.com/sysprog21/srv32/raw/devel/images/forwarding.svg) Its branch penalty will be 2 cycles since it has to flush the incorrect instruction in `F/D` and then load the new instruction. ![](https://github.com/sysprog21/srv32/raw/devel/images/branch.svg) In the following waveform, we can see that when the signal `branch_taken` is set, it generates a flush signal which takes 2 cycles to complete. Then reload the right instruction after flushing. ![](https://i.imgur.com/ceSrNSu.png) ## Performance Improvements ### Branchless popcount According to the section we discussed before, we know that the stalls are all generated by control hazards. To reduce the executing cycles, we can try to apply branchless algorithm to compute `popcount` For comparison, we first compute the branch_taken count in the original code: there are `632` branches been taken. Then we apply the algorithm mentioned in [amacc](https://github.com/jserv/amacc/blob/master/amacc.c#L504) to revise our code. The code is now changed to: ```c popcount: srli t0, a0, 1 li t1, 0x55555555 and t0, t0, t1 sub a0, a0, t0 li t1, 0x33333333 srli t2, a0, 2 and t2, t2, t1 and a0, a0, t1 add a0, a0, t2 srli t0, a0, 4 add a0, a0, t0 li t0, 0x0f0f0f0f and a0, a0, t0 li t0, 0x01010101 mul a0, a0, t0 srli a0, a0, 24 ret ``` We can see from the output of the simulation: ``` [0,1,1,2,1] Excuting 4781 instructions, 6023 cycles, 1.259 CPI Program terminate - ../rtl/../testbench/testbench.v:418: Verilog $finish Simulation statistics ===================== Simulation time : 0.043 s Simulation cycles: 6034 Simulation speed : 0.140326 MHz ``` Although the count of branch_taken is slightly reduced to `622`, we need more instructions to compute the popcount for each numbers. In result, it cancels out the benefit of branchless implementation and cause the total executing cycles to increase. ### Loop unrolling In order to eliminate the stall caused by for-loop, we simply expand the loop body and rewrite the `count_bits` and `print` part. ``` [0,1,1,2,1] Excuting 4754 instructions, 5982 cycles, 1.258 CPI Program terminate Simulation statistics ===================== Simulation time : 0.000 s Simulation cycles: 5982 Simulation speed : 14.918 MHz ``` As the result shown above, we get 41 cycles reduced after this modification. The count of `branch_taken` is eliminated to `615` ### Inline function Instead of adding `inline` prefix in function prototype, I use a macro to replace the use of `popcount` to reduce the jump instruction to function body. The behavior of the macro is similar to `count_bits` function, but it will not result in function call. It expand in compiling time and increase the code size instead. ``` .macro pop_cnt srli t0, a0, 1 li t1, 0x55555555 and t0, t0, t1 sub a0, a0, t0 li t1, 0x33333333 srli t2, a0, 2 and t2, t2, t1 and a0, a0, t1 add a0, a0, t2 srli t0, a0, 4 add a0, a0, t0 li t0, 0x0f0f0f0f and a0, a0, t0 li t0, 0x01010101 mul a0, a0, t0 srli a0, a0, 24 .endm ``` In the result of simulation, we receive another 30 cycles reduction. ``` [0,1,1,2,1] Excuting 4744 instructions, 5952 cycles, 1.255 CPI Program terminate Simulation statistics ===================== Simulation time : 0.000 s Simulation cycles: 5952 Simulation speed : 15.030 MHz ``` The count of `branch_taken` is down to `605`.