# Assignment3: SoftCPU
contributed by < [`dhiptmc`](https://github.com/dhiptmc) >
## Environment
Install RISC-V toolchains.
Use pre-built GNU Toolchain via [xPack GNU RISC-V Embedded GCC](https://xpack.github.io/riscv-none-elf-gcc/releases/). Then, define an environment variable in advance:
```shell
$ export CROSS_COMPILE=riscv-none-elf-
```
Install the dependent packages.
For Ubuntu Linux,
```shell
$ sudo apt install build-essential lcov ccache libsystemc-dev
$ sudo apt install gtkwave
```
Assume you are in the home directory:
```shell
$ cd $HOME
$ git clone https://github.com/verilator/verilator
$ cd verilator
$ git checkout stable
$ export VERILATOR_ROOT=`pwd`
$ autoconf
$ ./configure
$ make
```
Then, you can set environment variables in advance.
```shell
$ cd $HOME
$ export VERILATOR_ROOT=$HOME/verilator
$ export PATH=$VERILATOR_ROOT/bin:$PATH
```
Make sure the version of Verilator >= `5.002`.
```
$ verilator --version
Verilator 5.002 2022-10-29 rev v5.002-29-gdb39d70c7
```
## Get the Source
```shell
$ git clone https://github.com/sysprog21/srv32
```
## Derived from Assignment2
### C code
```clike=
#include <stdlib.h>
#include <stdio.h>
int* runningSum(int* nums, int numsSize){
int* result = (int*)malloc(sizeof(int)*numsSize);
for(int i=0; i<numsSize; i++){
if(i==0)
result[0] = nums[0];
else
result[i] = result[i-1] + nums[i];
}
return result;
}
int main(int argc, char *argv[]){
int nums1[5] = {1,2,3,4,5};
int nums2[6] = {1,3,5,7,9,11};
int nums3[7] = {0,2,4,6,8,10,12};
int* result1 = runningSum(nums1, 5);
int* result2 = runningSum(nums2, 6);
int* result3 = runningSum(nums3, 7);
for(int j=0; j<5; j++){
printf("%d ", result1[j]);
}
printf("\n");
for(int j=0; j<6; j++){
printf("%d ", result2[j]);
}
printf("\n");
for(int j=0; j<7; j++){
printf("%d ", result3[j]);
}
printf("\n");
}
```
### Add on a Makefile
```
include ../common/Makefile.common
EXE = .elf
SRC = HW3-1.c
CFLAGS += -L../common
LDFLAGS += -T ../common/default.ld
TARGET = HW3-1
OUTPUT = $(TARGET)$(EXE)
.PHONY: all clean
all: $(TARGET)
$(TARGET): $(SRC)
$(CC) $(CFLAGS) -o $(OUTPUT) $(SRC) $(LDFLAGS)
$(OBJCOPY) -j .text -O binary $(OUTPUT) imem.bin
$(OBJCOPY) -j .data -O binary $(OUTPUT) dmem.bin
$(OBJCOPY) -O binary $(OUTPUT) memory.bin
$(OBJDUMP) -d $(OUTPUT) > $(TARGET).dis
$(READELF) -a $(OUTPUT) > $(TARGET).symbol
clean:
$(RM) *.o $(OUTPUT) $(TARGET).dis $(TARGET).symbol [id]mem.bin memory.bin
```
### Run the code
```shell
$ cd srv32
~/srv32$ make HW3-1
```
#### Run RTL sim
```
1 3 6 10 15
1 4 9 16 25 36
0 2 6 12 20 30 42
Excuting 15968 instructions, 20974 cycles, 1.313 CPI
Program terminate
- ../rtl/../testbench/testbench.v:434: Verilog $finish
Simulation statistics
=====================
Simulation time : 0.105 s
Simulation cycles: 20985
Simulation speed : 0.199857 MHz
```
#### Run ISS sim
```
./rvsim --memsize 128 -l trace.log ../sw/HW3-1/HW3-1.elf
1 3 6 10 15
1 4 9 16 25 36
0 2 6 12 20 30 42
Excuting 15968 instructions, 20974 cycles, 1.314 CPI
Program terminate
Simulation statistics
=====================
Simulation time : 0.003 s
Simulation cycles: 20974
Simulation speed : 6.069 MHz
make[1]: Leaving directory '/home/dhiptmc/srv32/tools'
Compare the trace between RTL and ISS simulator
=== Simulation passed ===
```
### Assembly Code Generated By srv32
Fetched from HW3-1.dis
```
0000003c <runningSum>:
3c: ff010113 addi sp,sp,-16
40: 00812423 sw s0,8(sp)
44: 00912223 sw s1,4(sp)
48: 00050413 mv s0,a0
4c: 00058493 mv s1,a1
50: 00259513 slli a0,a1,0x2
54: 00112623 sw ra,12(sp)
58: 1c4000ef jal ra,21c <malloc>
5c: 04905063 blez s1,9c <runningSum+0x60>
60: 00042683 lw a3,0(s0)
64: 00100713 li a4,1
68: 00450793 addi a5,a0,4
6c: 00d52023 sw a3,0(a0)
70: 00440693 addi a3,s0,4
74: 02e48463 beq s1,a4,9c <runningSum+0x60>
78: 00100613 li a2,1
7c: ffc7a703 lw a4,-4(a5)
80: 0006a803 lw a6,0(a3)
84: 00478793 addi a5,a5,4
88: 00160613 addi a2,a2,1
8c: 01070733 add a4,a4,a6
90: fee7ae23 sw a4,-4(a5)
94: 00468693 addi a3,a3,4
98: fec492e3 bne s1,a2,7c <runningSum+0x40>
9c: 00c12083 lw ra,12(sp)
a0: 00812403 lw s0,8(sp)
a4: 00412483 lw s1,4(sp)
a8: 01010113 addi sp,sp,16
ac: 00008067 ret
```
## srv32 (Simple 3-stage pipeline RISC-V processor)
`srv32` is a 3-stage pipeline architecture with IF/ID, EX, WB stages. The follwing diagram marks some important signals for later discussion.

### Forwarding
#### Data Hazard
`srv32` supports full forwarding, which means RAW data hazard can be resolved WITHOUT stalling the processor. Notice only RAW data hazard is possible, other hazard (WAW, WAR) isn't possible on single issue processor.
Consider the following instruction sequence:
| IF/ID | EX | WB |
| ---------------- | ---------------- | ----------------- |
| `auipc t1,0x22` | `addi t0,t0,-2004` | `auipc t0,0x22` |
* Instruction `addi t0,t0,-2004` at EX stage and instruction `auipc t0,0x22` at WB stage have RAW data hazard on register `t0`.
* The latest result of `t0` (from `auipc t0,0x22`) is stored in signal `wb_result` at WB stage.
* Since `(wb_dst_sel == ex_src1_sel)` is true and `wb_mem2reg` is false, `wb_result` is forward to `t0` register in EX stage (`addi t0,t0,-2004`). The value of `t0` in EX stage is stored in `reg_rdata1`.
The timing diagram of the above instruction sequence is as follow:
| Instruction |cycle 1| c2 | c3 | c4 | c5 |
| ----------------- | ----- | --- | --- | --- | --- |
| `10: auipc t0,0x22` | IF/ID | EX |**WB**⬂| | |
| `14: addi t0,t0,-2004` | |IF/ID|**EX**⬃| WB | |
| `18: auipc t1,0x22` | | |IF/ID| EX | WB |

#### Load-Use Hazard
Load-use hazard is NOT an issue in `srv32` core because D-MEM is read at WB stage, and register file is also read at WB stage. A single MUX is used to switch between 2 operands (operand from register file and operand from D-MEM). Load-use hazard can be resolved WITHOUT stalling the processor.
Consider the following instruction sequence:
| IF/ID | EX | WB |
| ---------------- | ---------------- | ----------------- |
| `bltu s6,s1,4dc` | `andi s6,a5,-4` | `lw a5,4(s0)` |
* Instruction `andi s6,a5,-4` at EX stage and instruction `lw a5,4(s0)` at WB stage have load-use data hazard on register `a5`.
* The result of `a5` is read from D-MEM in WB stage and stored in signal `wb_rdata`.
* Since `(wb_dst_sel == ex_src1_sel)` is true and `wb_mem2reg` is true, `wb_rdata` is forward to `a5` register in EX stage. The value of `a5` in EX stage is `reg_rdata1`.
The timing diagram of the above instruction sequence is as follow:
| Instruction |cycle 1| c2 | c3 | c4 | c5 |
| ----------------- | ----- | --- | --- | --- | --- |
| `4c4: lw a5,4(s0)` | IF/ID | EX | WB | | |
| `4c8: andi s6,a5,-4` | |IF/ID| EX | WB | |
| `4cc: bltu s6,s1,4dc`| | |IF/ID| EX | WB |

### Branch
Branch penalty is the number of instructions killed after a branch instruction if a branch is TAKEN. Branch result is resolved at the end EX stage by ALU so the instruction fetch in IF/ID might need to be killed if a branch is taken. In this processor; however, the address of next instruction (next PC) should be fed into I-MEM a cycle ahead. Thus, the branch penalty for `srv32` is 2. To clarify, by the time next PC is resolved, one instruction has been fetch into pipeline and another PC has been calculated because address should be computed one cycle ahead. The number of instructions that should be killed (a.k.a. set to NOP) is 2 instruction after a branch instruction if the branch is actually taken.
Consider the following instruction sequence:
| | | IF/ID | EX | WB |
| ------- | ------------------ | ------------------ | ------------------ | ----- |
| next_pc |fetch_pc (imem_addr)|`if_pc` | `ex_pc` |`wb_pc`|
| xxx |`addi sp,sp,-44` |`auipc sp,0x40` |`bltu t0,t1,20(taken)`| |
(Notice an additional column is inserted above the instruction. These are the PC variables in pipeline)
Branch instruction `bltu t0,t1,20(taken)` is resolved by the END of EX stage. By the time branch instruction is resolved, two consequtive instructions, namely `auipc sp,0x40` and `addi sp,sp,-44` will be fetched from I-MEM. These two instructions should be killed if branch is taken.
The timing diagram of the above instruction sequence is as follow:
| Instruction | c1 | c2 | c3 | c4 | c5 | c6 |
| --------------------- | --- | --- | --- | --- | --- | --- |
|`bltu t0,t1,20(taken)`|IF/ID| EX | WB | | | |
| `auipc sp,0x40` | | NOP | NOP | NOP | | |
| `addi sp,sp,-44` | | | NOP | NOP | NOP | |
| `exec if branch taken`| | | |IF/ID| EX | WB |

## Optimized for Assignment2
Change the placement for `result[0] = nums[0];`, which eventually decreases the branch penalty. **24** instructions and cycles are saved.
```clike=
int* runningSum(int* nums, int numsSize){
int* result = (int*)malloc(sizeof(int)*numsSize);
result[0] = nums[0];
for(int i=1; i<numsSize; i++){
result[i] = result[i-1] + nums[i];
}
return result;
}
```
#### Run RTL sim
```
1 3 6 10 15
1 4 9 16 25 36
0 2 6 12 20 30 42
Excuting 15944 instructions, 20950 cycles, 1.313 CPI
Program terminate
- ../rtl/../testbench/testbench.v:434: Verilog $finish
Simulation statistics
=====================
Simulation time : 0.091 s
Simulation cycles: 20961
Simulation speed : 0.230341 MHz
```
#### Run ISS sim
```
./rvsim --memsize 128 -l trace.log ../sw/HW3-1/HW3-1.elf
1 3 6 10 15
1 4 9 16 25 36
0 2 6 12 20 30 42
Excuting 15944 instructions, 20950 cycles, 1.314 CPI
Program terminate
Simulation statistics
=====================
Simulation time : 0.004 s
Simulation cycles: 20950
Simulation speed : 4.668 MHz
make[1]: Leaving directory '/home/dhiptmc/srv32/tools'
Compare the trace between RTL and ISS simulator
=== Simulation passed ===
```
## LeetCode problem with medium difficulty --- [2221. Find Triangular Sum of an Array](https://leetcode.com/problems/find-triangular-sum-of-an-array/)
### Origin code:
```clike=
#include <stdlib.h>
#include <stdio.h>
int triangularSum(int* nums, int numsSize){
for (; numsSize != 1; numsSize--)
for (int i = 0; i < numsSize - 1; i++)
nums[i] = (nums[i] + nums[i + 1]) % 10;
return nums[0];
}
int main(int argc, char *argv[]){
int nums1[5] = {1,2,3,4,5};
int nums2[6] = {3,2,3,9,1,8};
int nums3[7] = {7,4,2,6,5,5,2};
int nums4[30] = {1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1
};
int result1 = triangularSum(nums1, 5);
int result2 = triangularSum(nums2, 6);
int result3 = triangularSum(nums3, 7);
int result4 = triangularSum(nums4, 30);
printf("%d", result1);
printf("\n");
printf("%d", result2);
printf("\n");
printf("%d", result3);
printf("\n");
printf("%d", result4);
printf("\n");
}
```
#### Run RTL sim
```
8
6
8
2
Excuting 7767 instructions, 10121 cycles, 1.303 CPI
Program terminate
- ../rtl/../testbench/testbench.v:434: Verilog $finish
Simulation statistics
=====================
Simulation time : 0.049 s
Simulation cycles: 10132
Simulation speed : 0.206776 MHz
```
#### Run ISS sim
```
./rvsim --memsize 128 -l trace.log ../sw/HW3-2-org/HW3-2-org.elf
8
6
8
2
Excuting 7767 instructions, 10121 cycles, 1.303 CPI
Program terminate
Simulation statistics
=====================
Simulation time : 0.002 s
Simulation cycles: 10121
Simulation speed : 6.115 MHz
make[1]: Leaving directory '/home/dhiptmc/srv32/tools'
Compare the trace between RTL and ISS simulator
=== Simulation passed ===
```
### Modified code:
```clike=
#include <stdlib.h>
#include <stdio.h>
int triangularSum(int* nums, int numsSize){
int* coef = (int*)malloc(sizeof(int)*numsSize);
coef[0] = 1;
coef[numsSize-1] = 1;
for (int i = 1; i < numsSize-1; i++)
coef[i] = coef[i-1] * (numsSize - i) / i ;
int result = 0;
for (int i = 0; i < numsSize; i++)
result += nums[i] * coef[i];
return result % 10;
}
int main(int argc, char *argv[]){
int nums1[5] = {1,2,3,4,5};
int nums2[6] = {3,2,3,9,1,8};
int nums3[7] = {7,4,2,6,5,5,2};
int nums4[30] = {1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1
};
int result1 = triangularSum(nums1, 5);
int result2 = triangularSum(nums2, 6);
int result3 = triangularSum(nums3, 7);
int result4 = triangularSum(nums4, 30);
printf("%d", result1);
printf("\n");
printf("%d", result2);
printf("\n");
printf("%d", result3);
printf("\n");
printf("%d", result4);
printf("\n");
}
```
#### Run RTL sim
```
8
6
8
2
Excuting 5350 instructions, 7176 cycles, 1.341 CPI
Program terminate
- ../rtl/../testbench/testbench.v:434: Verilog $finish
Simulation statistics
=====================
Simulation time : 0.035 s
Simulation cycles: 7187
Simulation speed : 0.205343 MHz
```
#### Run ISS sim
```
./rvsim --memsize 128 -l trace.log ../sw/HW3-2/HW3-2.elf
8
6
8
2
Excuting 5350 instructions, 7176 cycles, 1.341 CPI
Program terminate
Simulation statistics
=====================
Simulation time : 0.001 s
Simulation cycles: 7176
Simulation speed : 5.863 MHz
make[1]: Leaving directory '/home/dhiptmc/srv32/tools'
Compare the trace between RTL and ISS simulator
=== Simulation passed ===
```
### Software optimizations
* We exposed the binomial coefficient feature in this problem. Since we are dedicated to decrease the branch activities, nested loops are simply unacceptable.
* Based on the changes, we can save up to `2417` instructions and `2945` cycles. Notice that the code is based on a larger testing set. Dealing with a smaller testing set, e.g., nums[3] = {1,2,1} may have smaller or even no benefit from the new code.
* Time complexity is improved from O(n^2) to O(n).
:::warning
:warning: There are some problems in the modified code, while the numsSize became larger, there are inevitably going to occur overflow error for the
```
coef[i] = coef[i-1] * (numsSize - i) / i ;
```
It makes sense since the binomial coefficient grows really quick.
Overall, the code takes advantages of the larger array, but fails at the even more larger array. There should be solution for this approach.
:::
## Explain how srv32 works with Verilator
* The Verilator package converts Verilog and SystemVerilog hardware description language (HDL) designs into a C++ or SystemC model that after compiling can be executed. Verilator is not a traditional simulator, but a compiler.
* In srv32, it uses Verilator to generate the RTL-level simulator.
* The RTL-level simulator is capable of simulating the execution of RISC-V binary at RTL-level.
* C program is processed and copied into /sim directory(simulation environment generated by Verilator) and simulated on srv32 hardware virtually.