# Lab3: SoftCPU
###### tags: `Course` `RISC-V` `Computer Architecture 2022`
## Initialize
1. update $PATH environment
```bash=
cd $HOME
source riscv-none-elf-gcc/setenv
```
2. download Srv32
```bash=
git clone https://github.com/sysprog21/srv32
```
3. download packages
```linux=
sudo apt install gtkwave # a fully featured GTK+ based wave viewer
sudo apt install lcov # a graphical front-end for GCC's coverage testing tool
```
4. Modify the makefile.common
```makefile=
# can change CFLAG if neen other opitmizer
rv32c ?= 0
CC := $(CROSS_COMPILE)gcc
AS := $(CROSS_COMPILE)as
AR := $(CROSS_COMPILE)ar
OBJDUMP := $(CROSS_COMPILE)objdump
OBJCOPY := $(CROSS_COMPILE)objcopy
READELF := $(CROSS_COMPILE)readelf
ifeq ($(rv32c), 1)
ARCH := -march=rv32imac_zicsr -mabi=ilp32 $(EXTRA_CFLAGS)
else
ARCH := -march=rv32im_zicsr -mabi=ilp32 $(EXTRA_CFLAGS)
endif
CFLAGS := -O3 -Wall $(ARCH) -nostartfiles -nostdlib
ASFLAGS := $(ARCH)
LDFLAGS := -lc -lm -lgcc -lsys
```
5. Install Verilator
```linux=
sudo apt-get install verilator # Will report version error, dont use
# use below instr.
git clone https://github.com/verilator/verilator # first time install
# Every time you need to build:
unsetenv VERILATOR_ROOT # For csh; ignore error if on bash
unset VERILATOR_ROOT # For bash
cd verilator
git pull # Make sure git repository is up-to-date
git tag # See what versions exist
git checkout v5.002 # Switch to specified release version
sudo apt-get install git perl python3 make autoconf g++ flex bison ccache
autoconf # Create ./configure script
./configure # Configure and create Makefile
make -j `nproc` # Build Verilator itself (if error, try just 'make')
sudo make install
```
## Analysis srv32 CPU
### **Datapath and Pipeline**
The picture below is the architecture of **S**imple **r**isc-**v** **32**-bit CPU.
There are only 3 stages `IF/ID`,`EX`, `WB`. And `I-MEM` and `D-MEM` are out of the file.

`I-MEM` : active when clk raise. Give the Instructions to deocder according to the fetch pc.
`ex_reg` : Get register data for ALU to execute. Can forward `ex_result` after ALU complete the calculation.
`wb_reg` : Get the `ex_result` and `wdata`. Also can forword to `ex` stage.
`fetch_pc`, `Register File` : Cahnge pc and `ex_flush` or `wb_flush` when branch is taken.
Write data into `Register File`
### **Hazzard 、 Stall**
More details of Hazzard and stall will show in the next part `Derived from Assignment2/Waveform analysis`. With the waveform result, we can know how it works in intuitive ways.
## Derived from Assignment2
### Modify Makefile
* Makefile
```makefile=
include ../common/Makefile.common
EXE = .elf
SRC = HW3.c
CFLAGS += -L../common
LDFLAGS += -T ../common/default.ld
TARGET = HW3
OUTPUT = $(TARGET)$(EXE)
.PHONY: all clean
all: $(TARGET)
$(TARGET): $(SRC)
$(CC) $(CFLAGS) -o $(OUTPUT) $(SRC) $(LDFLAGS)
$(OBJCOPY) -j .text -O binary $(OUTPUT) imem.bin
$(OBJCOPY) -j .data -O binary $(OUTPUT) dmem.bin
$(OBJCOPY) -O binary $(OUTPUT) memory.bin
$(OBJDUMP) -d $(OUTPUT) > $(TARGET).dis
$(READELF) -a $(OUTPUT) > $(TARGET).symbol
clean:
$(RM) *.o $(OUTPUT) $(TARGET).dis $(TARGET).symbol [id]mem.bin memory.bin
```
### Run RTL SIM
* Oringin C code
```c=
void decode(int *encoded, int encodedSize, int first){
int result [encodedSize+1];
result[0]=first;
printf("%d ", result[0])
for(int i=0;i<encodedSize;i++){
result[i+1]=result[i] ^ encoded[i];
printf("%d ", result[i+1]);
}
}
int main(){
int nums1[4] = {6,2,7,3};
decode(&nums1[0], 4, 4);
}
```
* Run code
```linux=
make common <!--if report error of "cannot find -lsys"-->
make HW3
```
* Result
```linux=
Excuting 5477 instructions, 7381 cycles, 1.347 CPI
Program terminate
- ../rtl/../testbench/testbench.v:434: Verilog $finish
Simulation statistics
=====================
Simulation time : 0.025 s
Simulation cycles: 7392
Simulation speed : 0.29568 MHz
```
### Run ISS
* Result
```linux=
./rvsim --memsize 128 -l trace.log ../sw/HW3/HW3.elf
Excuting 5477 instructions, 7381 cycles, 1.348 CPI
Program terminate
Simulation statistics
=====================
Simulation time : 0.002 s
Simulation cycles: 7381
Simulation speed : 4.259 MHz
```
### Waveform analysis
* **Branch Penalty**
Branch Penalty will occur when the branch/jump is TAKEN, which will let the pipeline to waste some cycles. In `Srv32`, the penalty will be `2` due to the `next_pc` should be fetch one cycle ahead before into I-MEM.
```
130: 00400613 li a2,4
134: 00400593 li a1,4
138: f05ff0ef jal ra,3c <decode>
13c: 01c12083 lw ra,28(sp)
140: 00000513 li a0,0
```

Let's see picture above. we can found that there are `2` cycle are wasted <font color=yellow>(The yellow line).</font> Then for jump instr. `next_pc` = `ex_pc` + `ex_imm`. At above case, we cam found `next_pc` = `0x1387` + `0xFFFFFF04` = `0x0000003C`<font color=cake>(The red retangles).</font> After two cycle, the `next_pc` value will be sent to `if_pc`.And flush instructions in `ex` & `wb`<font color=green>(The blue & green line in the picture).</font>
* **Forwarding**
Srv32 have support full forwarding, so that there are no stall when executing data hazard instructions.
(load-use won't stall either.Details will explain at next part)
```
3c: 00158613 addi a2,a1,1
40: 00261613 slli a2,a2,0x2
```

The picture above shows that the `ex_stall` is not HIGH and `wb_mem2reg` isn't HIGH either. Which means that in this case, the instructions hazard is not load-use.
Then we observe `ex_dst_sel` and `ex_src1_sel`.<font color=cake>(The red retangles)</font> We notice that there are data hazard occur due to the same value of `0x0C` at `ex_dst_sel` when `ex_pc` is `0x3C` and `ex_src1_sel` when `ex_pc` is `0x40`.
However, stall isn't happen. Therefore, we can discover the `ex_result` has pass to `wb_result` then `reg_rdata2`, which is used at the followiing cycle.<font color=yellow>(The yellow lines)</font>
* **Load-use hazard**
When we learn about 5-stage pipeline, we know that there will occur a stall when load-use. But now, we have a 3-stage pipeline, which won't lead to a stall that wasting cycle.
At `Analysis srv32 CPU`, we notice that `D-MEM` read at `WB` stage and `RegisterFile` read at `WB` stage too. In addition, a single MUX is used to classify 2 operands, which resolves the problem of load-use hazard.
```
64: fd842683 lw a3,-40(s0)
68: 00168693 addi a3,a3,1
```

Then we can start to figure out how this Srv32 avoid the stall. First, the value of `ex_dst_sel` and `ex_src1_sel` are same, showing that the hazard is happen.<font color=yellow>(The yellow retangles)</font> Then we check the `wb_mem2reg` and the signal is active HIGH, telling us this case is load-use hazard.<font color=green>(The green lines)</font> Then we see the `wb_rdata` is passed to `reg_rdata1`. Therefore, we don't need to stall to wait for `D-MEM` to get the value.<font color=blue>(The blue retangle)</font>
### Optimize & Result
* At the wave simulation, I found that my C code will use lots of times to branch, While each of them will cost me 2 cycles. Thus, I flaten the for loop to make it cost less cycles.

```c=
//The old C code
// result[0] = first;
// for(int i=0;i<encodedSize;i++){
// result[i+1]=result[i] ^ encoded[i];
// }
result[0] = first;
result[1]=result[0] ^ encoded[0];
result[2]=result[1] ^ encoded[1];
result[3]=result[2] ^ encoded[2];
result[4]=result[3] ^ encoded[3];
```
* RTL Simulation result
```linux=
Excuting 5430 instructions, 7318 cycles, 1.347 CPI
Program terminate
- ../rtl/../testbench/testbench.v:434: Verilog $finish
Simulation statistics
=====================
Simulation time : 0.028 s
Simulation cycles: 7329
Simulation speed : 0.26175 MHz
```
* ISS result
```linux=
./rvsim --memsize 128 -l trace.log ../sw/HW3/HW3.elf
Excuting 5430 instructions, 7318 cycles, 1.348 CPI
Program terminate
Simulation statistics
=====================
Simulation time : 0.002 s
Simulation cycles: 7318
Simulation speed : 3.547 MHz
```
The Executing instructions has reduce `47` instructions, `63` cycles
## Leetcode ([Longest Substring Without Repeating Characters](https://leetcode.com/problems/longest-substring-without-repeating-characters/description/))
### Modify Makefile
* Makwfile
```makefile=
include ../common/Makefile.common
EXE = .elf
SRC = HW3-1.c
CFLAGS += -L../common
LDFLAGS += -T ../common/default.ld
TARGET = HW3-1
OUTPUT = $(TARGET)$(EXE)
.PHONY: all clean
all: $(TARGET)
$(TARGET): $(SRC)
$(CC) $(CFLAGS) -o $(OUTPUT) $(SRC) $(LDFLAGS)
$(OBJCOPY) -j .text -O binary $(OUTPUT) imem.bin
$(OBJCOPY) -j .data -O binary $(OUTPUT) dmem.bin
$(OBJCOPY) -O binary $(OUTPUT) memory.bin
$(OBJDUMP) -d $(OUTPUT) > $(TARGET).dis
$(READELF) -a $(OUTPUT) > $(TARGET).symbol
clean:
$(RM) *.o $(OUTPUT) $(TARGET).dis $(TARGET).symbol [id]mem.bin memory.bin
```
* C code
```c=
#include <stdio.h>
int max(int a, int b) {
return a > b ? a : b;
}
int lengthOfLongestSubstring(char * s){
int last_use[128] = {0};
int ret = 0;
int last_rep = 0;
char ch;
for(int i = 0; (ch = s[i]) != 0; ++i) {
last_rep = max(last_use[ch], last_rep);
ret = max(ret, i - last_rep + 1);
last_use[ch] = i + 1;
}
return ret;
}
int main()
{
char str[8] = "abcabcbb";
int result;
result = lengthOfLongestSubstring(str);
printf("The result is : %d", result);
}
```
### Run RTL SIM
* Result
```linux=
Excuting 2504 instructions, 3520 cycles, 1.405 CPI
Program terminate
- ../rtl/../testbench/testbench.v:434: Verilog $finish
Simulation statistics
=====================
Simulation time : 0.014 s
Simulation cycles: 3531
Simulation speed : 0.252214 MHz
```
### Run ISS
* Result
```linux=
./rvsim --memsize 128 -l trace.log ../sw/HW3-1/HW3-1.elf
Excuting 2504 instructions, 3520 cycles, 1.406 CPI
Program terminate
Simulation statistics
=====================
Simulation time : 0.001 s
Simulation cycles: 3520
Simulation speed : 4.156 MHz
```
### Optimize & Result
* Because I have already discuss the detail of waveform above. So in this part, I skip the waveform analysis so as to focus on the Optimize & Result.
* **Reduce the parameter**
In the for loop, we use `(ch = s[i]) != 0`, which will produce more cycle to mv the value and compare to 0. So I am thinking wether can i build a struct to concur it.
```c=
typedef struct {
char *str;
int length;
}String;
...
int lengthOfLongestSubstring(String s){
...
for(int i = 0; i<s.length-1; ++i) {
ch = s.str[i];
last_rep = max(last_use[ch], last_rep);
ret = max(ret, i - last_rep + 1);
last_use[ch] = i + 1;
}
...
}
int main()
{
String str = {"abcabcbb", 8};
int result;
result = lengthOfLongestSubstring(str);
printf("The result is : %d", result);
}
```
* Result
```linux=
Excuting 2488 instructions, 3500 cycles, 1.406 CPI
```
Reduce `16` instructions and `20` cycles.
* **Useless instruction**
Then, I found that there are some useless code which will cause assembly will have some unnecessary branch. so I modify the code as follow. I delete the "max" function and modify the way to give `last_rep` and `ret` values.
```c=
// The old one
// int max(int a, int b) {
// return a > b ? a : b;
// }
// ...
// last_rep = max(last_use[ch], last_rep);
// ret = max(ret, i - last_rep + 1);
if (last_use[ch] > last_rep) last_rep = last_use[ch];
if (ret < i - last_rep + 1) ret = i - last_rep + 1;
```
* Result
```linux=
Excuting 2481 instructions, 3493 cycles, 1.407 CPI
```
Reduce `7` instructions and `7` cycles.
* Loop unrolling
We can flaten the loop so that we can reduce the cycle penalty of branch.
* Result
```linux=
Excuting 2464 instructions, 3464 cycles, 1.406 CPI
```
Reduce `17` instructions and `29` cycles.
* **The Optimize totally reduce `40` instructions and `56` cycles.**
## Explain how srv32 works with Verilator
* Verilator is a used to verify the Verilog or SystemVerilog code. However, We need to write C++/SystemC file, which is compiled by a C compiler.The resulting executable performs the design simulation.
* In addition, when we type `make` below the file of srv32/ , it will call the makefile with `make $@.run` command, which calls the makefile in srv32/sim directory.
* For the description of other files, we can reference to the [website](https://verilator.org/guide/latest/files.html#files-read-written).