# Lab3: SoftCPU ###### tags: `Course` `RISC-V` `Computer Architecture 2022` ## Initialize 1. update $PATH environment ```bash= cd $HOME source riscv-none-elf-gcc/setenv ``` 2. download Srv32 ```bash= git clone https://github.com/sysprog21/srv32 ``` 3. download packages ```linux= sudo apt install gtkwave # a fully featured GTK+ based wave viewer sudo apt install lcov # a graphical front-end for GCC's coverage testing tool ``` 4. Modify the makefile.common ```makefile= # can change CFLAG if neen other opitmizer rv32c ?= 0 CC := $(CROSS_COMPILE)gcc AS := $(CROSS_COMPILE)as AR := $(CROSS_COMPILE)ar OBJDUMP := $(CROSS_COMPILE)objdump OBJCOPY := $(CROSS_COMPILE)objcopy READELF := $(CROSS_COMPILE)readelf ifeq ($(rv32c), 1) ARCH := -march=rv32imac_zicsr -mabi=ilp32 $(EXTRA_CFLAGS) else ARCH := -march=rv32im_zicsr -mabi=ilp32 $(EXTRA_CFLAGS) endif CFLAGS := -O3 -Wall $(ARCH) -nostartfiles -nostdlib ASFLAGS := $(ARCH) LDFLAGS := -lc -lm -lgcc -lsys ``` 5. Install Verilator ```linux= sudo apt-get install verilator # Will report version error, dont use # use below instr. git clone https://github.com/verilator/verilator # first time install # Every time you need to build: unsetenv VERILATOR_ROOT # For csh; ignore error if on bash unset VERILATOR_ROOT # For bash cd verilator git pull # Make sure git repository is up-to-date git tag # See what versions exist git checkout v5.002 # Switch to specified release version sudo apt-get install git perl python3 make autoconf g++ flex bison ccache autoconf # Create ./configure script ./configure # Configure and create Makefile make -j `nproc` # Build Verilator itself (if error, try just 'make') sudo make install ``` ## Analysis srv32 CPU ### **Datapath and Pipeline** The picture below is the architecture of **S**imple **r**isc-**v** **32**-bit CPU. There are only 3 stages `IF/ID`,`EX`, `WB`. And `I-MEM` and `D-MEM` are out of the file. ![](https://i.imgur.com/toF0IJ0.png) `I-MEM` : active when clk raise. Give the Instructions to deocder according to the fetch pc. `ex_reg` : Get register data for ALU to execute. Can forward `ex_result` after ALU complete the calculation. `wb_reg` : Get the `ex_result` and `wdata`. Also can forword to `ex` stage. `fetch_pc`, `Register File` : Cahnge pc and `ex_flush` or `wb_flush` when branch is taken. Write data into `Register File` ### **Hazzard 、 Stall** More details of Hazzard and stall will show in the next part `Derived from Assignment2/Waveform analysis`. With the waveform result, we can know how it works in intuitive ways. ## Derived from Assignment2 ### Modify Makefile * Makefile ```makefile= include ../common/Makefile.common EXE = .elf SRC = HW3.c CFLAGS += -L../common LDFLAGS += -T ../common/default.ld TARGET = HW3 OUTPUT = $(TARGET)$(EXE) .PHONY: all clean all: $(TARGET) $(TARGET): $(SRC) $(CC) $(CFLAGS) -o $(OUTPUT) $(SRC) $(LDFLAGS) $(OBJCOPY) -j .text -O binary $(OUTPUT) imem.bin $(OBJCOPY) -j .data -O binary $(OUTPUT) dmem.bin $(OBJCOPY) -O binary $(OUTPUT) memory.bin $(OBJDUMP) -d $(OUTPUT) > $(TARGET).dis $(READELF) -a $(OUTPUT) > $(TARGET).symbol clean: $(RM) *.o $(OUTPUT) $(TARGET).dis $(TARGET).symbol [id]mem.bin memory.bin ``` ### Run RTL SIM * Oringin C code ```c= void decode(int *encoded, int encodedSize, int first){ int result [encodedSize+1]; result[0]=first; printf("%d ", result[0]) for(int i=0;i<encodedSize;i++){ result[i+1]=result[i] ^ encoded[i]; printf("%d ", result[i+1]); } } int main(){ int nums1[4] = {6,2,7,3}; decode(&nums1[0], 4, 4); } ``` * Run code ```linux= make common <!--if report error of "cannot find -lsys"--> make HW3 ``` * Result ```linux= Excuting 5477 instructions, 7381 cycles, 1.347 CPI Program terminate - ../rtl/../testbench/testbench.v:434: Verilog $finish Simulation statistics ===================== Simulation time : 0.025 s Simulation cycles: 7392 Simulation speed : 0.29568 MHz ``` ### Run ISS * Result ```linux= ./rvsim --memsize 128 -l trace.log ../sw/HW3/HW3.elf Excuting 5477 instructions, 7381 cycles, 1.348 CPI Program terminate Simulation statistics ===================== Simulation time : 0.002 s Simulation cycles: 7381 Simulation speed : 4.259 MHz ``` ### Waveform analysis * **Branch Penalty** Branch Penalty will occur when the branch/jump is TAKEN, which will let the pipeline to waste some cycles. In `Srv32`, the penalty will be `2` due to the `next_pc` should be fetch one cycle ahead before into I-MEM. ``` 130: 00400613 li a2,4 134: 00400593 li a1,4 138: f05ff0ef jal ra,3c <decode> 13c: 01c12083 lw ra,28(sp) 140: 00000513 li a0,0 ``` ![](https://i.imgur.com/K2iSQPQ.png) Let's see picture above. we can found that there are `2` cycle are wasted <font color=yellow>(The yellow line).</font> Then for jump instr. `next_pc` = `ex_pc` + `ex_imm`. At above case, we cam found `next_pc` = `0x1387` + `0xFFFFFF04` = `0x0000003C`<font color=cake>(The red retangles).</font> After two cycle, the `next_pc` value will be sent to `if_pc`.And flush instructions in `ex` & `wb`<font color=green>(The blue & green line in the picture).</font> * **Forwarding** Srv32 have support full forwarding, so that there are no stall when executing data hazard instructions. (load-use won't stall either.Details will explain at next part) ``` 3c: 00158613 addi a2,a1,1 40: 00261613 slli a2,a2,0x2 ``` ![](https://i.imgur.com/EZdfh5C.png) The picture above shows that the `ex_stall` is not HIGH and `wb_mem2reg` isn't HIGH either. Which means that in this case, the instructions hazard is not load-use. Then we observe `ex_dst_sel` and `ex_src1_sel`.<font color=cake>(The red retangles)</font> We notice that there are data hazard occur due to the same value of `0x0C` at `ex_dst_sel` when `ex_pc` is `0x3C` and `ex_src1_sel` when `ex_pc` is `0x40`. However, stall isn't happen. Therefore, we can discover the `ex_result` has pass to `wb_result` then `reg_rdata2`, which is used at the followiing cycle.<font color=yellow>(The yellow lines)</font> * **Load-use hazard** When we learn about 5-stage pipeline, we know that there will occur a stall when load-use. But now, we have a 3-stage pipeline, which won't lead to a stall that wasting cycle. At `Analysis srv32 CPU`, we notice that `D-MEM` read at `WB` stage and `RegisterFile` read at `WB` stage too. In addition, a single MUX is used to classify 2 operands, which resolves the problem of load-use hazard. ``` 64: fd842683 lw a3,-40(s0) 68: 00168693 addi a3,a3,1 ``` ![](https://i.imgur.com/bvcBSJJ.png) Then we can start to figure out how this Srv32 avoid the stall. First, the value of `ex_dst_sel` and `ex_src1_sel` are same, showing that the hazard is happen.<font color=yellow>(The yellow retangles)</font> Then we check the `wb_mem2reg` and the signal is active HIGH, telling us this case is load-use hazard.<font color=green>(The green lines)</font> Then we see the `wb_rdata` is passed to `reg_rdata1`. Therefore, we don't need to stall to wait for `D-MEM` to get the value.<font color=blue>(The blue retangle)</font> ### Optimize & Result * At the wave simulation, I found that my C code will use lots of times to branch, While each of them will cost me 2 cycles. Thus, I flaten the for loop to make it cost less cycles. ![](https://i.imgur.com/IgjqmBN.png) ```c= //The old C code // result[0] = first; // for(int i=0;i<encodedSize;i++){ // result[i+1]=result[i] ^ encoded[i]; // } result[0] = first; result[1]=result[0] ^ encoded[0]; result[2]=result[1] ^ encoded[1]; result[3]=result[2] ^ encoded[2]; result[4]=result[3] ^ encoded[3]; ``` * RTL Simulation result ```linux= Excuting 5430 instructions, 7318 cycles, 1.347 CPI Program terminate - ../rtl/../testbench/testbench.v:434: Verilog $finish Simulation statistics ===================== Simulation time : 0.028 s Simulation cycles: 7329 Simulation speed : 0.26175 MHz ``` * ISS result ```linux= ./rvsim --memsize 128 -l trace.log ../sw/HW3/HW3.elf Excuting 5430 instructions, 7318 cycles, 1.348 CPI Program terminate Simulation statistics ===================== Simulation time : 0.002 s Simulation cycles: 7318 Simulation speed : 3.547 MHz ``` The Executing instructions has reduce `47` instructions, `63` cycles ## Leetcode ([Longest Substring Without Repeating Characters](https://leetcode.com/problems/longest-substring-without-repeating-characters/description/)) ### Modify Makefile * Makwfile ```makefile= include ../common/Makefile.common EXE = .elf SRC = HW3-1.c CFLAGS += -L../common LDFLAGS += -T ../common/default.ld TARGET = HW3-1 OUTPUT = $(TARGET)$(EXE) .PHONY: all clean all: $(TARGET) $(TARGET): $(SRC) $(CC) $(CFLAGS) -o $(OUTPUT) $(SRC) $(LDFLAGS) $(OBJCOPY) -j .text -O binary $(OUTPUT) imem.bin $(OBJCOPY) -j .data -O binary $(OUTPUT) dmem.bin $(OBJCOPY) -O binary $(OUTPUT) memory.bin $(OBJDUMP) -d $(OUTPUT) > $(TARGET).dis $(READELF) -a $(OUTPUT) > $(TARGET).symbol clean: $(RM) *.o $(OUTPUT) $(TARGET).dis $(TARGET).symbol [id]mem.bin memory.bin ``` * C code ```c= #include <stdio.h> int max(int a, int b) { return a > b ? a : b; } int lengthOfLongestSubstring(char * s){ int last_use[128] = {0}; int ret = 0; int last_rep = 0; char ch; for(int i = 0; (ch = s[i]) != 0; ++i) { last_rep = max(last_use[ch], last_rep); ret = max(ret, i - last_rep + 1); last_use[ch] = i + 1; } return ret; } int main() { char str[8] = "abcabcbb"; int result; result = lengthOfLongestSubstring(str); printf("The result is : %d", result); } ``` ### Run RTL SIM * Result ```linux= Excuting 2504 instructions, 3520 cycles, 1.405 CPI Program terminate - ../rtl/../testbench/testbench.v:434: Verilog $finish Simulation statistics ===================== Simulation time : 0.014 s Simulation cycles: 3531 Simulation speed : 0.252214 MHz ``` ### Run ISS * Result ```linux= ./rvsim --memsize 128 -l trace.log ../sw/HW3-1/HW3-1.elf Excuting 2504 instructions, 3520 cycles, 1.406 CPI Program terminate Simulation statistics ===================== Simulation time : 0.001 s Simulation cycles: 3520 Simulation speed : 4.156 MHz ``` ### Optimize & Result * Because I have already discuss the detail of waveform above. So in this part, I skip the waveform analysis so as to focus on the Optimize & Result. * **Reduce the parameter** In the for loop, we use `(ch = s[i]) != 0`, which will produce more cycle to mv the value and compare to 0. So I am thinking wether can i build a struct to concur it. ```c= typedef struct { char *str; int length; }String; ... int lengthOfLongestSubstring(String s){ ... for(int i = 0; i<s.length-1; ++i) { ch = s.str[i]; last_rep = max(last_use[ch], last_rep); ret = max(ret, i - last_rep + 1); last_use[ch] = i + 1; } ... } int main() { String str = {"abcabcbb", 8}; int result; result = lengthOfLongestSubstring(str); printf("The result is : %d", result); } ``` * Result ```linux= Excuting 2488 instructions, 3500 cycles, 1.406 CPI ``` Reduce `16` instructions and `20` cycles. * **Useless instruction** Then, I found that there are some useless code which will cause assembly will have some unnecessary branch. so I modify the code as follow. I delete the "max" function and modify the way to give `last_rep` and `ret` values. ```c= // The old one // int max(int a, int b) { // return a > b ? a : b; // } // ... // last_rep = max(last_use[ch], last_rep); // ret = max(ret, i - last_rep + 1); if (last_use[ch] > last_rep) last_rep = last_use[ch]; if (ret < i - last_rep + 1) ret = i - last_rep + 1; ``` * Result ```linux= Excuting 2481 instructions, 3493 cycles, 1.407 CPI ``` Reduce `7` instructions and `7` cycles. * Loop unrolling We can flaten the loop so that we can reduce the cycle penalty of branch. * Result ```linux= Excuting 2464 instructions, 3464 cycles, 1.406 CPI ``` Reduce `17` instructions and `29` cycles. * **The Optimize totally reduce `40` instructions and `56` cycles.** ## Explain how srv32 works with Verilator * Verilator is a used to verify the Verilog or SystemVerilog code. However, We need to write C++/SystemC file, which is compiled by a C compiler.The resulting executable performs the design simulation. * In addition, when we type `make` below the file of srv32/ , it will call the makefile with `make $@.run` command, which calls the makefile in srv32/sim directory. * For the description of other files, we can reference to the [website](https://verilator.org/guide/latest/files.html#files-read-written).