srv32
- RISCV RV32IM
Soft CPUA simple RISC-V 3-stage pipeline processor featuring:
Install RISC-V toolchains. You can do either of the following:
Install the dependent packages.
apt install verilator
on Ubuntu Linux, otherwise you may get older versions, which would not fit.
Package verilator
is required, but the default package provided by Ubuntu Linux was too old. Hence, we have to build verilator
from source. See Installation Obtain Source
Auto Configure
Eventual Installation Options
Assume you are in the home directory:
You don't have to run make install
Then, you can set environment variables in advance.
Make sure the version of Verilator >= 5.002
.
Read the srv32 project page carefully.
sim
. This is a RTL-level generator that is capable of simulating the execution of RISC-V binary at RTL level.sim/
directory.
make all
command build the core and run RTL sim, all simulation passed.The command below will generate the VCD/FST dump. You can browse file wave.fst
via GTKWave.
Check tobychui's note as well.
tools
directory.2.681152 CoreMark/MHz
There are two ways of running RISC-V binary. Namely, the RTL simulator called sim
generated by Verilator. Another one is the software RISC-V simulator called rvsim
located in tools
directory.
This repo test the compliance of hardware implementation by comparing the output results running on both simulator. To be precise, when type make tests
in ./tests
directory, compliance tests will be run on the RTL simulator (sim
) and the output will be compared with the reference output specified by riscv-compliance
AND the output of software simulator (rvsim
).
The memory dump of RTL simulator dump.txt
will be renamed to *.signature.output
and will be automatically compared to the reference output provided by riscv-compliance
repo.
The output of RTL simulator (sim
) is stored in trace.log
file while the output of software simulator (rvsim
) is stored in trace_sw.log
file. These two files contains detailed information of each instruction such as: value write to a certain register, value write to a certain memory address etc. These two files will be compared through a diff --brief
command.
In summary, the memory dump files from RTL simulator will be compared with the reference output. Then, the output between RTL simulator and software simulator will be compared. Notice if the first comparison fails, the error will be signaled by riscv-compliance
; however, a failure on second comparison will results in a failed make
command (The second failure is raised by diff --brief
command).
When one types make tests-sw
in ./tests
directory, compliance tests will be run on the software simulator. The output results compared with itself AND the the reference output provided by riscv-compliance
repo.
make tests-sw
make tests
srv32
RV32 coreAs the time of writing, the memory of srv32
is divided into instruction memory (I-MEM) and data memory (D-MEM). Both I-MEM and D-MEM are modelled using mem2ports
verilog module as follow:
Notice the signal raddr
and waddr
are both 30 bits long. The omission of lower 2 bits shows read or write to memory are word-aligned or 4-byte aligned.
srv32
is a 3-stage pipeline architecture with IF/ID, EX, WB stages. The follwing diagram marks some important signals for later discussion.
srv32
supports full forwarding, which means RAW data hazard can be resolved WITHOUT stalling the processor. Notice only RAW data hazard is possible, other hazard (WAW, WAR) isn't possible on single issue processor.
The implementation of register forwarding is as follow:
Consider the following instruction sequence:
IF/ID | EX | WB |
---|---|---|
add x4, x5, x6 |
and x3, x2, x4 |
addi x2, x2, -3 |
Instruction and x3, x2, x4
at EX stage and instruction addi x2, x2, -3
at WB stage have RAW data hazard on register x2
. The latest result of x2
(from addi x2, x2, -3
) is stored in signal wb_result
at WB stage. Since (wb_dst_sel == ex_src1_sel)
is true and wb_mem2reg
is false. wb_result
is forward to x2
register in EX stage (and x3, x2, x4
). The value of x2
in EX stage is stored in reg_rdata1
.
The timing diagram of the above instruction sequence is as follow:
Instruction | cycle 1 | c2 | c3 | c4 | c5 |
---|---|---|---|---|---|
addi x2, x2, -3 |
IF/ID | EX | WB⬂ | ||
and x3, x2, x4 |
IF/ID | EX⬃ | WB | ||
add x4, x5, x6 |
IF/ID | EX | WB |
Load-use hazard is NOT an issue in srv32
core because D-MEM is read at WB stage, and register file is also read at WB stage. A single MUX is used to switch between 2 operands (operand from register file and operand from D-MEM). Load-use hazard can be resolved WITHOUT stalling the processor.
Consider the following instruction sequence:
IF/ID | EX | WB |
---|---|---|
add x4, x5, x6 |
and x3, x2, x4 |
lw x2 0(x5) |
Instruction and x3, x2, x4
at EX stage and instruction lw x2 0(x5)
at WB stage have load-use data hazard on register x2
. The result of x2
is read from D-MEM in WB stage and stored in signal wb_rdata
. Since (wb_dst_sel == ex_src1_sel)
is true and wb_mem2reg
is true. wb_rdata
is forward to x2
register in EX stage. The value of x2
in EX stage is reg_rdata1
.
The verilog code is shown again for your reference:
The timing diagram of the above instruction sequence is as follow:
Instruction | cycle 1 | c2 | c3 | c4 | c5 |
---|---|---|---|---|---|
lw x2 0(x5) |
IF/ID | EX | WB | ||
and x3, x2, x4 |
IF/ID | EX | WB | ||
add x4, x5, x6 |
IF/ID | EX | WB |
Branch penalty is the number of instructions killed after a branch instruction if a branch is TAKEN. Branch result is resolved at the end EX stage by ALU so the instruction fetch in IF/ID might need to be killed if a branch is taken. In this processor; however, the address of next instruction (next PC) should be fed into I-MEM a cycle ahead. Thus, the branch penalty for srv32
is 2. To clarify, by the time next PC is resolved, one instruction has been fetch into pipeline and another PC has been calculated because address should be computed one cycle ahead. The number of instructions that should be killed (a.k.a. set to NOP) is 2 instruction after a branch instruction if the branch is actually taken.
Consider the following instruction sequence:
IF/ID | EX | WB | ||
---|---|---|---|---|
next_pc | fetch_pc (imem_addr) | if_pc |
ex_pc |
wb_pc |
xxx | add x4, x5, x6 |
and x3, x2, x4 |
beq x5, x6 (taken) |
(Notice an additional column is inserted above the instruction. These are the PC variables in pipeline)
Branch instruction beq x5, x6 (taken)
is resolved by the END of EX stage. By the time branch instruction is resolved, two consequtive instructions, namely add x4, x5, x6
and and x3, x2, x4
will be fetched from I-MEM. These two instructions should be killed if branch is taken.
The timing diagram of the above instruction sequence is as follow:
Instruction | c1 | c2 | c3 | c4 | c5 | c6 |
---|---|---|---|---|---|---|
beq x5, x6 (taken) |
IF/ID | EX | WB | |||
and x3, x2, x4 |
NOP | NOP | NOP | |||
add x4, x5, x6 |
NOP | NOP | NOP | |||
exec if branch taken |
IF/ID | EX | WB |
count_bits
into srv32Because we now move our code from Ripes to bare metal environment, we need to follow the calling convention carefully.
Thus we should ensure that all we use saved registers and temporary registers properly.
6cc195f
We can first debug our code on ISS infrastrature. The result is:
Next, we enter make count_bits.run
under sim/
directory to copy the memory layout of count_bits
to the directory and start to simulate the result.
We can get the result similar to ISS one.
Run the following command to generate wave.fst
As the below waveform shows, we can consider imem
is a combinatial circuit on this CPU.
We can retrieve the instruction with given address (raddr
) in the same clock period.
Since srv32 does have fully bypassing, there is no stall resulting from data hazards. All nops are generated after control hazards happended.
So we should try to reduce the stalls because of control hazards.
srv32 is a 3 staged pipeline architecture. It has F/D
, E
, and WB
stages
Its branch penalty will be 2 cycles since it has to flush the incorrect instruction in F/D
and then load the new instruction.
In the following waveform, we can see that when the signal branch_taken
is set, it generates a flush signal which takes 2 cycles to complete. Then reload the right instruction after flushing.
According to the section we discussed before, we know that the stalls are all generated by control hazards. To reduce the executing cycles, we can try to apply branchless algorithm to compute popcount
For comparison, we first compute the branch_taken count in the original code: there are 632
branches been taken.
Then we apply the algorithm mentioned in amacc to revise our code. The code is now changed to:
We can see from the output of the simulation:
Although the count of branch_taken is slightly reduced to 622
, we need more instructions to compute the popcount for each numbers. In result, it cancels out the benefit of branchless implementation and cause the total executing cycles to increase.
In order to eliminate the stall caused by for-loop, we simply expand the loop body and rewrite the count_bits
and print
part.
As the result shown above, we get 41 cycles reduced after this modification. The count of branch_taken
is eliminated to 615
Instead of adding inline
prefix in function prototype, I use a macro to replace the use of popcount
to reduce the jump instruction to function body. The behavior of the macro is similar to count_bits
function, but it will not result in function call. It expand in compiling time and increase the code size instead.
In the result of simulation, we receive another 30 cycles reduction.
The count of branch_taken
is down to 605
.