Try   HackMD

Assignment3: SoftCPU

contributed by < dhiptmc >

Environment

Install RISC-V toolchains.
Use pre-built GNU Toolchain via xPack GNU RISC-V Embedded GCC. Then, define an environment variable in advance:

$ export CROSS_COMPILE=riscv-none-elf-

Install the dependent packages.
For Ubuntu Linux,

$ sudo apt install build-essential lcov ccache libsystemc-dev
$ sudo apt install gtkwave

Assume you are in the home directory:

$ cd $HOME
$ git clone https://github.com/verilator/verilator
$ cd verilator
$ git checkout stable
$ export VERILATOR_ROOT=`pwd`
$ autoconf
$ ./configure
$ make

Then, you can set environment variables in advance.

$ cd $HOME
$ export VERILATOR_ROOT=$HOME/verilator
$ export PATH=$VERILATOR_ROOT/bin:$PATH

Make sure the version of Verilator >= 5.002.

$ verilator --version
Verilator 5.002 2022-10-29 rev v5.002-29-gdb39d70c7

Get the Source

$ git clone https://github.com/sysprog21/srv32

Derived from Assignment2

C code

#include <stdlib.h> #include <stdio.h> int* runningSum(int* nums, int numsSize){ int* result = (int*)malloc(sizeof(int)*numsSize); for(int i=0; i<numsSize; i++){ if(i==0) result[0] = nums[0]; else result[i] = result[i-1] + nums[i]; } return result; } int main(int argc, char *argv[]){ int nums1[5] = {1,2,3,4,5}; int nums2[6] = {1,3,5,7,9,11}; int nums3[7] = {0,2,4,6,8,10,12}; int* result1 = runningSum(nums1, 5); int* result2 = runningSum(nums2, 6); int* result3 = runningSum(nums3, 7); for(int j=0; j<5; j++){ printf("%d ", result1[j]); } printf("\n"); for(int j=0; j<6; j++){ printf("%d ", result2[j]); } printf("\n"); for(int j=0; j<7; j++){ printf("%d ", result3[j]); } printf("\n"); }

Add on a Makefile

include ../common/Makefile.common

EXE      = .elf
SRC      = HW3-1.c
CFLAGS  += -L../common
LDFLAGS += -T ../common/default.ld
TARGET   = HW3-1
OUTPUT   = $(TARGET)$(EXE)

.PHONY: all clean

all: $(TARGET)

$(TARGET): $(SRC)
	$(CC) $(CFLAGS) -o $(OUTPUT) $(SRC) $(LDFLAGS)
	$(OBJCOPY) -j .text -O binary $(OUTPUT) imem.bin
	$(OBJCOPY) -j .data -O binary $(OUTPUT) dmem.bin
	$(OBJCOPY) -O binary $(OUTPUT) memory.bin
	$(OBJDUMP) -d $(OUTPUT) > $(TARGET).dis
	$(READELF) -a $(OUTPUT) > $(TARGET).symbol

clean:
	$(RM) *.o $(OUTPUT) $(TARGET).dis $(TARGET).symbol [id]mem.bin memory.bin

Run the code

$ cd srv32
~/srv32$ make HW3-1

Run RTL sim

1 3 6 10 15 
1 4 9 16 25 36 
0 2 6 12 20 30 42 

Excuting 15968 instructions, 20974 cycles, 1.313 CPI
Program terminate
- ../rtl/../testbench/testbench.v:434: Verilog $finish

Simulation statistics
=====================
Simulation time  : 0.105 s
Simulation cycles: 20985
Simulation speed : 0.199857 MHz

Run ISS sim

./rvsim --memsize 128 -l trace.log ../sw/HW3-1/HW3-1.elf
1 3 6 10 15 
1 4 9 16 25 36 
0 2 6 12 20 30 42 

Excuting 15968 instructions, 20974 cycles, 1.314 CPI
Program terminate

Simulation statistics
=====================
Simulation time  : 0.003 s
Simulation cycles: 20974
Simulation speed : 6.069 MHz

make[1]: Leaving directory '/home/dhiptmc/srv32/tools'
Compare the trace between RTL and ISS simulator
=== Simulation passed ===

Assembly Code Generated By srv32

Fetched from HW3-1.dis

0000003c <runningSum>:
      3c:	ff010113          	addi	sp,sp,-16
      40:	00812423          	sw	s0,8(sp)
      44:	00912223          	sw	s1,4(sp)
      48:	00050413          	mv	s0,a0
      4c:	00058493          	mv	s1,a1
      50:	00259513          	slli	a0,a1,0x2
      54:	00112623          	sw	ra,12(sp)
      58:	1c4000ef          	jal	ra,21c <malloc>
      5c:	04905063          	blez	s1,9c <runningSum+0x60>
      60:	00042683          	lw	a3,0(s0)
      64:	00100713          	li	a4,1
      68:	00450793          	addi	a5,a0,4
      6c:	00d52023          	sw	a3,0(a0)
      70:	00440693          	addi	a3,s0,4
      74:	02e48463          	beq	s1,a4,9c <runningSum+0x60>
      78:	00100613          	li	a2,1
      7c:	ffc7a703          	lw	a4,-4(a5)
      80:	0006a803          	lw	a6,0(a3)
      84:	00478793          	addi	a5,a5,4
      88:	00160613          	addi	a2,a2,1
      8c:	01070733          	add	a4,a4,a6
      90:	fee7ae23          	sw	a4,-4(a5)
      94:	00468693          	addi	a3,a3,4
      98:	fec492e3          	bne	s1,a2,7c <runningSum+0x40>
      9c:	00c12083          	lw	ra,12(sp)
      a0:	00812403          	lw	s0,8(sp)
      a4:	00412483          	lw	s1,4(sp)
      a8:	01010113          	addi	sp,sp,16
      ac:	00008067          	ret

srv32 (Simple 3-stage pipeline RISC-V processor)

srv32 is a 3-stage pipeline architecture with IF/ID, EX, WB stages. The follwing diagram marks some important signals for later discussion.

Forwarding

Data Hazard

srv32 supports full forwarding, which means RAW data hazard can be resolved WITHOUT stalling the processor. Notice only RAW data hazard is possible, other hazard (WAW, WAR) isn't possible on single issue processor.

Consider the following instruction sequence:

IF/ID EX WB
auipc t1,0x22 addi t0,t0,-2004 auipc t0,0x22
  • Instruction addi t0,t0,-2004 at EX stage and instruction auipc t0,0x22 at WB stage have RAW data hazard on register t0.
  • The latest result of t0 (from auipc t0,0x22) is stored in signal wb_result at WB stage.
  • Since (wb_dst_sel == ex_src1_sel) is true and wb_mem2reg is false, wb_result is forward to t0 register in EX stage (addi t0,t0,-2004). The value of t0 in EX stage is stored in reg_rdata1.

The timing diagram of the above instruction sequence is as follow:

Instruction cycle 1 c2 c3 c4 c5
10: auipc t0,0x22 IF/ID EX WB
14: addi t0,t0,-2004 IF/ID EX WB
18: auipc t1,0x22 IF/ID EX WB

Load-Use Hazard

Load-use hazard is NOT an issue in srv32 core because D-MEM is read at WB stage, and register file is also read at WB stage. A single MUX is used to switch between 2 operands (operand from register file and operand from D-MEM). Load-use hazard can be resolved WITHOUT stalling the processor.

Consider the following instruction sequence:

IF/ID EX WB
bltu s6,s1,4dc andi s6,a5,-4 lw a5,4(s0)
  • Instruction andi s6,a5,-4 at EX stage and instruction lw a5,4(s0) at WB stage have load-use data hazard on register a5.
  • The result of a5 is read from D-MEM in WB stage and stored in signal wb_rdata.
  • Since (wb_dst_sel == ex_src1_sel) is true and wb_mem2reg is true, wb_rdata is forward to a5 register in EX stage. The value of a5 in EX stage is reg_rdata1.

The timing diagram of the above instruction sequence is as follow:

Instruction cycle 1 c2 c3 c4 c5
4c4: lw a5,4(s0) IF/ID EX WB
4c8: andi s6,a5,-4 IF/ID EX WB
4cc: bltu s6,s1,4dc IF/ID EX WB

Branch

Branch penalty is the number of instructions killed after a branch instruction if a branch is TAKEN. Branch result is resolved at the end EX stage by ALU so the instruction fetch in IF/ID might need to be killed if a branch is taken. In this processor; however, the address of next instruction (next PC) should be fed into I-MEM a cycle ahead. Thus, the branch penalty for srv32 is 2. To clarify, by the time next PC is resolved, one instruction has been fetch into pipeline and another PC has been calculated because address should be computed one cycle ahead. The number of instructions that should be killed (a.k.a. set to NOP) is 2 instruction after a branch instruction if the branch is actually taken.

Consider the following instruction sequence:

IF/ID EX WB
next_pc fetch_pc (imem_addr) if_pc ex_pc wb_pc
xxx addi sp,sp,-44 auipc sp,0x40 bltu t0,t1,20(taken)

(Notice an additional column is inserted above the instruction. These are the PC variables in pipeline)

Branch instruction bltu t0,t1,20(taken) is resolved by the END of EX stage. By the time branch instruction is resolved, two consequtive instructions, namely auipc sp,0x40 and addi sp,sp,-44 will be fetched from I-MEM. These two instructions should be killed if branch is taken.

The timing diagram of the above instruction sequence is as follow:

Instruction c1 c2 c3 c4 c5 c6
bltu t0,t1,20(taken) IF/ID EX WB
auipc sp,0x40 NOP NOP NOP
addi sp,sp,-44 NOP NOP NOP
exec if branch taken IF/ID EX WB

Optimized for Assignment2

Change the placement for result[0] = nums[0];, which eventually decreases the branch penalty. 24 instructions and cycles are saved.

int* runningSum(int* nums, int numsSize){ int* result = (int*)malloc(sizeof(int)*numsSize); result[0] = nums[0]; for(int i=1; i<numsSize; i++){ result[i] = result[i-1] + nums[i]; } return result; }

Run RTL sim

1 3 6 10 15 
1 4 9 16 25 36 
0 2 6 12 20 30 42 

Excuting 15944 instructions, 20950 cycles, 1.313 CPI
Program terminate
- ../rtl/../testbench/testbench.v:434: Verilog $finish

Simulation statistics
=====================
Simulation time  : 0.091 s
Simulation cycles: 20961
Simulation speed : 0.230341 MHz

Run ISS sim

./rvsim --memsize 128 -l trace.log ../sw/HW3-1/HW3-1.elf
1 3 6 10 15 
1 4 9 16 25 36 
0 2 6 12 20 30 42 

Excuting 15944 instructions, 20950 cycles, 1.314 CPI
Program terminate

Simulation statistics
=====================
Simulation time  : 0.004 s
Simulation cycles: 20950
Simulation speed : 4.668 MHz

make[1]: Leaving directory '/home/dhiptmc/srv32/tools'
Compare the trace between RTL and ISS simulator
=== Simulation passed ===

LeetCode problem with medium difficulty - 2221. Find Triangular Sum of an Array

Origin code:

#include <stdlib.h> #include <stdio.h> int triangularSum(int* nums, int numsSize){ for (; numsSize != 1; numsSize--) for (int i = 0; i < numsSize - 1; i++) nums[i] = (nums[i] + nums[i + 1]) % 10; return nums[0]; } int main(int argc, char *argv[]){ int nums1[5] = {1,2,3,4,5}; int nums2[6] = {3,2,3,9,1,8}; int nums3[7] = {7,4,2,6,5,5,2}; int nums4[30] = {1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1 }; int result1 = triangularSum(nums1, 5); int result2 = triangularSum(nums2, 6); int result3 = triangularSum(nums3, 7); int result4 = triangularSum(nums4, 30); printf("%d", result1); printf("\n"); printf("%d", result2); printf("\n"); printf("%d", result3); printf("\n"); printf("%d", result4); printf("\n"); }

Run RTL sim

8
6
8
2

Excuting 7767 instructions, 10121 cycles, 1.303 CPI
Program terminate
- ../rtl/../testbench/testbench.v:434: Verilog $finish

Simulation statistics
=====================
Simulation time  : 0.049 s
Simulation cycles: 10132
Simulation speed : 0.206776 MHz

Run ISS sim

./rvsim --memsize 128 -l trace.log ../sw/HW3-2-org/HW3-2-org.elf
8
6
8
2

Excuting 7767 instructions, 10121 cycles, 1.303 CPI
Program terminate

Simulation statistics
=====================
Simulation time  : 0.002 s
Simulation cycles: 10121
Simulation speed : 6.115 MHz

make[1]: Leaving directory '/home/dhiptmc/srv32/tools'
Compare the trace between RTL and ISS simulator
=== Simulation passed ===

Modified code:

#include <stdlib.h> #include <stdio.h> int triangularSum(int* nums, int numsSize){ int* coef = (int*)malloc(sizeof(int)*numsSize); coef[0] = 1; coef[numsSize-1] = 1; for (int i = 1; i < numsSize-1; i++) coef[i] = coef[i-1] * (numsSize - i) / i ; int result = 0; for (int i = 0; i < numsSize; i++) result += nums[i] * coef[i]; return result % 10; } int main(int argc, char *argv[]){ int nums1[5] = {1,2,3,4,5}; int nums2[6] = {3,2,3,9,1,8}; int nums3[7] = {7,4,2,6,5,5,2}; int nums4[30] = {1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1 }; int result1 = triangularSum(nums1, 5); int result2 = triangularSum(nums2, 6); int result3 = triangularSum(nums3, 7); int result4 = triangularSum(nums4, 30); printf("%d", result1); printf("\n"); printf("%d", result2); printf("\n"); printf("%d", result3); printf("\n"); printf("%d", result4); printf("\n"); }

Run RTL sim

8
6
8
2

Excuting 5350 instructions, 7176 cycles, 1.341 CPI
Program terminate
- ../rtl/../testbench/testbench.v:434: Verilog $finish

Simulation statistics
=====================
Simulation time  : 0.035 s
Simulation cycles: 7187
Simulation speed : 0.205343 MHz

Run ISS sim

./rvsim --memsize 128 -l trace.log ../sw/HW3-2/HW3-2.elf
8
6
8
2

Excuting 5350 instructions, 7176 cycles, 1.341 CPI
Program terminate

Simulation statistics
=====================
Simulation time  : 0.001 s
Simulation cycles: 7176
Simulation speed : 5.863 MHz

make[1]: Leaving directory '/home/dhiptmc/srv32/tools'
Compare the trace between RTL and ISS simulator
=== Simulation passed ===

Software optimizations

  • We exposed the binomial coefficient feature in this problem. Since we are dedicated to decrease the branch activities, nested loops are simply unacceptable.
  • Based on the changes, we can save up to 2417 instructions and 2945 cycles. Notice that the code is based on a larger testing set. Dealing with a smaller testing set, e.g., nums[3] = {1,2,1} may have smaller or even no benefit from the new code.
  • Time complexity is improved from O(n^2) to O(n).

:warning: There are some problems in the modified code, while the numsSize became larger, there are inevitably going to occur overflow error for the

coef[i] = coef[i-1] * (numsSize - i) / i ;

It makes sense since the binomial coefficient grows really quick.
Overall, the code takes advantages of the larger array, but fails at the even more larger array. There should be solution for this approach.

Explain how srv32 works with Verilator

  • The Verilator package converts Verilog and SystemVerilog hardware description language (HDL) designs into a C++ or SystemC model that after compiling can be executed. Verilator is not a traditional simulator, but a compiler.
  • In srv32, it uses Verilator to generate the RTL-level simulator.
  • The RTL-level simulator is capable of simulating the execution of RISC-V binary at RTL-level.
  • C program is processed and copied into /sim directory(simulation environment generated by Verilator) and simulated on srv32 hardware virtually.