contributed by < dhiptmc
>
Install RISC-V toolchains.
Use pre-built GNU Toolchain via xPack GNU RISC-V Embedded GCC. Then, define an environment variable in advance:
Install the dependent packages.
For Ubuntu Linux,
Assume you are in the home directory:
Then, you can set environment variables in advance.
Make sure the version of Verilator >= 5.002
.
Fetched from HW3-1.dis
srv32
is a 3-stage pipeline architecture with IF/ID, EX, WB stages. The follwing diagram marks some important signals for later discussion.
srv32
supports full forwarding, which means RAW data hazard can be resolved WITHOUT stalling the processor. Notice only RAW data hazard is possible, other hazard (WAW, WAR) isn't possible on single issue processor.
Consider the following instruction sequence:
IF/ID | EX | WB |
---|---|---|
auipc t1,0x22 |
addi t0,t0,-2004 |
auipc t0,0x22 |
addi t0,t0,-2004
at EX stage and instruction auipc t0,0x22
at WB stage have RAW data hazard on register t0
.t0
(from auipc t0,0x22
) is stored in signal wb_result
at WB stage.(wb_dst_sel == ex_src1_sel)
is true and wb_mem2reg
is false, wb_result
is forward to t0
register in EX stage (addi t0,t0,-2004
). The value of t0
in EX stage is stored in reg_rdata1
.The timing diagram of the above instruction sequence is as follow:
Instruction | cycle 1 | c2 | c3 | c4 | c5 |
---|---|---|---|---|---|
10: auipc t0,0x22 |
IF/ID | EX | WB⬂ | ||
14: addi t0,t0,-2004 |
IF/ID | EX⬃ | WB | ||
18: auipc t1,0x22 |
IF/ID | EX | WB |
Load-use hazard is NOT an issue in srv32
core because D-MEM is read at WB stage, and register file is also read at WB stage. A single MUX is used to switch between 2 operands (operand from register file and operand from D-MEM). Load-use hazard can be resolved WITHOUT stalling the processor.
Consider the following instruction sequence:
IF/ID | EX | WB |
---|---|---|
bltu s6,s1,4dc |
andi s6,a5,-4 |
lw a5,4(s0) |
andi s6,a5,-4
at EX stage and instruction lw a5,4(s0)
at WB stage have load-use data hazard on register a5
.a5
is read from D-MEM in WB stage and stored in signal wb_rdata
.(wb_dst_sel == ex_src1_sel)
is true and wb_mem2reg
is true, wb_rdata
is forward to a5
register in EX stage. The value of a5
in EX stage is reg_rdata1
.The timing diagram of the above instruction sequence is as follow:
Instruction | cycle 1 | c2 | c3 | c4 | c5 |
---|---|---|---|---|---|
4c4: lw a5,4(s0) |
IF/ID | EX | WB | ||
4c8: andi s6,a5,-4 |
IF/ID | EX | WB | ||
4cc: bltu s6,s1,4dc |
IF/ID | EX | WB |
Branch penalty is the number of instructions killed after a branch instruction if a branch is TAKEN. Branch result is resolved at the end EX stage by ALU so the instruction fetch in IF/ID might need to be killed if a branch is taken. In this processor; however, the address of next instruction (next PC) should be fed into I-MEM a cycle ahead. Thus, the branch penalty for srv32
is 2. To clarify, by the time next PC is resolved, one instruction has been fetch into pipeline and another PC has been calculated because address should be computed one cycle ahead. The number of instructions that should be killed (a.k.a. set to NOP) is 2 instruction after a branch instruction if the branch is actually taken.
Consider the following instruction sequence:
IF/ID | EX | WB | ||
---|---|---|---|---|
next_pc | fetch_pc (imem_addr) | if_pc |
ex_pc |
wb_pc |
xxx | addi sp,sp,-44 |
auipc sp,0x40 |
bltu t0,t1,20(taken) |
(Notice an additional column is inserted above the instruction. These are the PC variables in pipeline)
Branch instruction bltu t0,t1,20(taken)
is resolved by the END of EX stage. By the time branch instruction is resolved, two consequtive instructions, namely auipc sp,0x40
and addi sp,sp,-44
will be fetched from I-MEM. These two instructions should be killed if branch is taken.
The timing diagram of the above instruction sequence is as follow:
Instruction | c1 | c2 | c3 | c4 | c5 | c6 |
---|---|---|---|---|---|---|
bltu t0,t1,20(taken) |
IF/ID | EX | WB | |||
auipc sp,0x40 |
NOP | NOP | NOP | |||
addi sp,sp,-44 |
NOP | NOP | NOP | |||
exec if branch taken |
IF/ID | EX | WB |
Change the placement for result[0] = nums[0];
, which eventually decreases the branch penalty. 24 instructions and cycles are saved.
2417
instructions and 2945
cycles. Notice that the code is based on a larger testing set. Dealing with a smaller testing set, e.g., nums[3] = {1,2,1} may have smaller or even no benefit from the new code.:warning: There are some problems in the modified code, while the numsSize became larger, there are inevitably going to occur overflow error for the
It makes sense since the binomial coefficient grows really quick.
Overall, the code takes advantages of the larger array, but fails at the even more larger array. There should be solution for this approach.