# Assignment3: SoftCPU ###### tags: `RISC-V` `Computer Architecture 2022` ## Prerequisites 1. Prepare GNU Toolchain for RISC-V. See [The xPack GNU RISC-V Embedded GCC](https://xpack.github.io/riscv-none-elf-gcc/) ```shell $ cd /tmp $ wget https://github.com/xpack-dev-tools/riscv-none-elf-gcc-xpack/releases/download/v12.2.0-1/xpack-riscv-none-elf-gcc-12.2.0-1-linux-x64.tar.gz $ tar zxvf xpack-riscv-none-elf-gcc-12.2.0-1-linux-x64.tar.gz $ cp -af xpack-riscv-none-elf-gcc-12.2.0-1 $HOME/riscv-none-elf-gcc ``` Configure `$PATH` ```shell $ cd $HOME/riscv-none-elf-gcc $ echo "export PATH=`pwd`/bin:$PATH" > setenv ``` Once step (1) and (2) are complete, you can simply update `$PATH` environment variable via: ```shell $ cd $HOME $ source riscv-none-elf-gcc/setenv ``` Check `$PATH` at the first time: ```shell $ riscv-none-elf-gcc -v ``` You shall be able to see the following messages: ``` gcc version 12.2.0 (xPack GNU RISC-V Embedded GCC x86_64) ``` 2. Install [Verilator](https://verilator.org/guide/latest/install.html) packages Package verilator is required, but the default package provided by Ubuntu Linux was too old. Use below: ``` cd $HOME git clone https://github.com/verilator/verilator cd verilator git checkout stable export VERILATOR_ROOT=`pwd` ./configure make ``` Then, you can set environment variables in advance. ``` export VERILATOR_ROOT=$HOME/verilator export PATH=$VERILATOR_ROOT/bin:$PATH ``` Make sure the version of Verilator >= 5.002. ``` $ verilator --version Verilator 5.002 2022-10-29 rev v5.002-29-gdb39d70c7 ``` 3. Install the dependent packages ``` sudo apt install gtkwave sudo apt install lcov ``` :::info Run every time when open a new terminal: ```shell cd $HOME source riscv-none-elf-gcc/setenv ``` ```shell export VERILATOR_ROOT=$HOME/verilator export PATH=$VERILATOR_ROOT/bin:$PATH ``` ::: ## Leetcode [96 Unique Binary Search Trees](https://leetcode.com/problems/unique-binary-search-trees/) ### C code ```c= #include <stdio.h> int numTrees(int n){ if (n <= 1) return 1; int res = 0; for (int i=0; i<n ; i++) res += numTrees(i)*numTrees(n-i-1); return res; } int main(){ int n = 3; int num = numTrees(n); printf("The answer is %d\n", num); return 0; } ``` ### Start Put C code in `srv32/sw` ```bash # In srv32/ mkdir sw/hw3 # Copy existing make file to your directory cp sw/hello/Makefile sw/hw3/ # Modify the makefile vim sw/hw3/Makefile # Write down your code vim sw/hw3/hw3.c # Run it make hw3 ``` modify `sw/hw3/Makefile`: ``` include ../common/Makefile.common EXE = .elf SRC = hw3.c CFLAGS += -L../common LDFLAGS += -T ../common/default.ld TARGET = hw3 OUTPUT = $(TARGET)$(EXE) .PHONY: all clean all: $(TARGET) $(TARGET): $(SRC) $(CC) $(CFLAGS) -o $(OUTPUT) $(SRC) $(LDFLAGS) $(OBJCOPY) -j .text -O binary $(OUTPUT) imem.bin $(OBJCOPY) -j .data -O binary $(OUTPUT) dmem.bin $(OBJCOPY) -O binary $(OUTPUT) memory.bin $(OBJDUMP) -d $(OUTPUT) > $(TARGET).dis $(READELF) -a $(OUTPUT) > $(TARGET).symbol clean: $(RM) *.o $(OUTPUT) $(TARGET).dis $(TARGET).symbol [id]mem.bin memory.bin ``` Now we can run the simulation and compare the result of RTL simulation and ISS simulation. Run the code by the simulator: ```bash # Run the make command at the root path srv32 directory make hw3 ``` ### Validate the results #### Run RTL sim ```shell cd sim && ./sim +dump ``` ![](https://i.imgur.com/CDahuNE.png =75%x) #### Run ISS sim ![](https://i.imgur.com/a50CGwI.png =75%x) ### Use GTKWave to observe the wave After the command ```make hw3```, SRV32 would generate ```wave.fst```. We can find and open the file in ```./sim/wave.fst```. #### Forwarding srv32 supports full forwarding, which means RAW data hazard can be resolved WITHOUT stalling the processor. Notice only RAW data hazard is possible, other hazard (WAW, WAR) isn’t possible on single issue processor. ##### Data hazard ``` 24: 00428293 addi t0,t0,4 28: fe62ece3 bltu t0,t1,20 <_bss_clear> ``` There is a data hazard on register `t0` between intruction `addi t0, t0, 4` and `bltu t0, t1, 20`. The implementation of register forwarding which can be found in `./rtl/riscv.v` is as follow: ```verilog assign reg_rdata1[31: 0] = (ex_src1_sel == 5'h0) ? 32'h0 : (!wb_flush && wb_alu2reg && (wb_dst_sel == ex_src1_sel)) ? // register forwarding (wb_mem2reg ? wb_rdata : wb_result) : regs[ex_src1_sel]; assign reg_rdata2[31: 0] = (ex_src2_sel == 5'h0) ? 32'h0 : (!wb_flush && wb_alu2reg && (wb_dst_sel == ex_src2_sel)) ? // register forwarding (wb_mem2reg ? wb_rdata : wb_result) : regs[ex_src2_sel]; ``` ![](https://i.imgur.com/5xW8Z9w.jpg) * Instruction `bltu t0, t1, 20` at EX stage and instruction`addi t0, t0, 4` at WB stage have RAW data hazard on register `t0`. * The latest result of `t0` (from `addi t0, t0, 4`) is stored in signal `wb_result` at WB stage. * Since (`!wb_flush && wb_alu2reg && (wb_dst_sel == ex_src1_sel)`) is true and `wb_mem2reg` is false, `reg_rdata1` is set to be equal to `wb_result`. That is, `wb_result` is forward to `t0` register in EX stage (`bltu t0, t1, 20`). ##### Load-use hazard ``` 9c: 01812783 lw a5,24(sp) a0: 00f44463 blt s0,a5,a8 <numTrees.part.0+0x6c> a4: 22c0106f j 12d0 <numTrees.part.0+0x1294> ``` Load-use hazard is NOT an issue in `srv32` core because D-MEM is read at WB stage, and register file is also read at WB stage. A single MUX is used to switch between 2 operands (operand from register file and operand from D-MEM). Load-use hazard can be resolved **WITHOUT** stalling the processor. Consider the following instruction sequence(address `9c - a4`): | IF/ID | EX | WB | | -------- | -------- | -------- | | `j 12d0` | `blt s0, a5, a8` | `lw a5, 24(sp)` | Corresponding waveform is as follow: ![](https://i.imgur.com/CstdXZp.jpg) * Instruction `blt s0, a5, a8` at EX stage and instruction `lw a5, 24(sp)` at WB stage have load-use data hazard on register `a5`. * The result of `a5` is read from D-MEM in WB stage and stored in signal `wb_rdata`. * Since (`wb_dst_sel == ex_src2_sel`) is true and `wb_mem2reg` is true, `reg_rdata2` is set to be equal to `wb_rdata`. That is, `wb_rdata` is forward to `a5` register in EX stage (`blt s0, a5, a8`). The verilog is shown for reference and can be found in `./rtl/riscv.v`. ```verilog assign reg_rdata1[31: 0] = (ex_src1_sel == 5'h0) ? 32'h0 : (!wb_flush && wb_alu2reg && (wb_dst_sel == ex_src1_sel)) ? // register forwarding (wb_mem2reg ? wb_rdata : wb_result) : regs[ex_src1_sel]; assign reg_rdata2[31: 0] = (ex_src2_sel == 5'h0) ? 32'h0 : (!wb_flush && wb_alu2reg && (wb_dst_sel == ex_src2_sel)) ? // register forwarding (wb_mem2reg ? wb_rdata : wb_result) : regs[ex_src2_sel]; ``` #### Branch penalty When the branch is taken during the execute phase, it needs to flush the instructions that have been fetched into the pipeline, which causes a delay of two instructions, so the branch penalty for srv32 is two. Consider the following instruction sequence and its corresponding waveform. Take the instruction`jal ra,3c` for example: ``` 000014bc <main>: 14bc: ff010113 addi sp,sp,-16 14c0: 00300513 li a0,3 14c4: 00112623 sw ra,12(sp) 14c8: b75fe0ef jal ra,3c <numTrees.part.0> 14cc: 00050593 mv a1,a0 14d0: 00020537 lui a0,0x20 ``` Jump instruction `jal ra, 3c (taken)` is resolved by the END of EX stage. By the time jump instruction is resolved, two consequtive instructions, namely `mv a1, a0` and `lui a0, 0x20` will be fetched from I-MEM. These two instructions should be killed. ![](https://i.imgur.com/WOl49rH.jpg) The verilog is shown for reference and can be found in `./rtl/riscv.v`. We can find that if ```ex_jal = 1'b1```, ```next_pc``` will equal to ```ex_pc + ex_imm```. ```verilog=378 always @* begin branch_taken = !ex_flush; next_pc = fetch_pc + `IF_NEXT_PC; ex_ill_branch = 1'b0; case(1'b1) ex_jal : next_pc = ex_pc + ex_imm; ex_jalr : next_pc = alu_op1 + ex_imm; ex_branch: begin case(ex_alu_op) OP_BEQ : begin next_pc = (result_subs[32: 0] == 'd0) ? ex_pc + ex_imm : fetch_pc + `IF_NEXT_PC; if (result_subs[32: 0] != 'd0) branch_taken = 1'b0; end ... ``` ### Software optimizations #### modified C code ```c= #include <stdio.h> /* original numTrees function int numTrees(int n){ if (n <= 1) return 1; int res = 0; for (int i = 0; i < n ; i++) res += numTrees(i) * numTrees(n-i-1); return res; } */ int numTrees(int n){ int cache[n+1]; cache[0] = 1; for (int k = 1; k <= n; k++){ int sum = 0; for (int i = 1; i <= k; i++){ sum += cache[i-1] * cache[k-i]; } cache[k] = sum; } int ans = cache[n]; return ans; } int main(){ int n = 3; int num = numTrees(n); printf("The answer is %d\n", num); return 0; } ``` #### Run RTL sim ![](https://i.imgur.com/N8mFZL3.png =75%x) #### Run ISS sim ![](https://i.imgur.com/AkFlBDg.png =75%x) * Saving **`54`** instructions and **`86`** cycles. --- ## Assignment2 ### c code in `sw/hw3-1/hw3-1.c` ```c= #include <stdio.h> int lengthOfLastWord(char * s){ int length = 0,i=1; if(s[0]!=' ')length++; while(s[i]!='\0'){ if(s[i]!=' '){ if(s[i-1]==' '){ length = 1; } else{ length++; } } i++; } return length; } int main(){ char s[] = "Hello World"; printf("The length of last word is %d\n", lengthOfLastWord(s)); return 0; } ``` ### Validate the results #### Run RTL sim ![](https://i.imgur.com/RPJrSvQ.png =75%x) #### Run ISS sim ![](https://i.imgur.com/X6pn7dU.png =75%x) ### software optimizations #### modified C code ```c= #include <stdio.h> int lengthOfLastWord(char * s){ int length = 0,i=1; if(s[0]!=' ')length++; while(s[i]!='\0'){ if(s[i]!=' '){ if(s[i-1]==' '){ length = 1; } else{ length++; } } i++; } return length; } int main(){ char *s = "Hello World"; printf("The length of last word is %d\n", lengthOfLastWord(s)); return 0; } ``` #### Run RTL sim ![](https://i.imgur.com/AcAlmmX.png =75%x) #### Run ISS sim ![](https://i.imgur.com/oSdqA50.png =75%x) * Saving **`8`** instructions and **`8`** cycles.