Homework3: SoftCPU

# Homework3: SoftCPU ###### tags: `110-1_RISC-V` ## Installation * According to README.md from [SRV32 - Simple 3-stage pipeline RISC-V processor](https://github.com/sysprog21/srv32), I install RISC-V toolchain via pre-built GNU Toolchain [xPack GNU RISC-V Embedded GCC v10.2.0-1.2](https://github.com/xpack-dev-tools/riscv-none-embed-gcc-xpack/releases/tag/v10.2.0-1.2) * First, download [xpack-riscv-none-embed-gcc-10.2.0-1.2-linux-x64.tar.gz](https://github.com/xpack-dev-tools/riscv-none-embed-gcc-xpack/releases/download/v10.2.0-1.2/xpack-riscv-none-embed-gcc-10.2.0-1.2-linux-x64.tar.gz) * Untar the file ```c= $ sudo tar -xvf xpack-riscv-none-embed-gcc-10.2.0-1.2-linux-x64.tar.gz ``` * Create Softlink to set the PATH ```c= $ ln -s /opt/xpack-riscv-none-embed-gcc-10.2.0-1.2/ riscv-none-embed ``` * Set the PATH in `./.bashrc` ```c= $ echo "export PATH=/opt/riscv-none-embed/bin:$PATH" ``` * Install the required packages ```c= $ sudo apt install build-essential ccache $ sudo apt install lcov $ sudo apt install gtkwave $ sudo apt install autoconf automake autotools-dev curl gawk git \ build-essential bison flex texinfo gperf libtool patchutils bc git \ libmpc-dev libmpfr-dev libgmp-dev gawk zlib1g-dev libexpat1-dev ``` * [SRV32 - Simple 3-stage pipeline RISC-V processor](https://github.com/sysprog21/srv32) ```c= $ cd ~ $ git clone https://github.com/sysprog21/srv32.git ``` ## Rewrite the C program * [Homework1:Find Numbers with Even Number of Digits](https://hackmd.io/IW1RyxsuSnG2I-dbbCFnIA) ```c= #include <stdio.h> #include <stdlib.h> int main() { volatile int output=0; volatile int number_array[6]={12,345,1771,2,6,7896}; for(volatile int i = 0;i<6;i++) { volatile int counter=0; while(number_array[i]>0) { counter++; number_array[i]/=10; } if(counter%2==0) output++; } printf("The number of even digits are: %d\n",output); return 0; } ``` * Create directory for C program and Makefile :::spoiler **Makefile** ```c= include ../common/Makefile.common EXE = .elf SRC = N_EN.c CFLAGS += -L../common LDFLAGS += -T ../common/default.ld TARGET = N_EN OUTPUT = $(TARGET)$(EXE) .PHONY: all clean all: $(TARGET) $(TARGET): $(SRC) $(CC) $(CFLAGS) -o $(OUTPUT) $(SRC) $(LDFLAGS) $(OBJCOPY) -j .text -O binary $(OUTPUT) imem.bin $(OBJCOPY) -j .data -O binary $(OUTPUT) dmem.bin $(OBJCOPY) -O binary $(OUTPUT) memory.bin $(OBJDUMP) -d $(OUTPUT) > $(TARGET).dis $(READELF) -a $(OUTPUT) > $(TARGET).symbol clean: $(RM) *.o $(OUTPUT) $(TARGET).dis $(TARGET).symbol [id]mem.bin memory.bin ``` ::: ![](https://i.imgur.com/wnNNDg6.png) * Run the ISS sim ```c= $ make all-sw ``` ![](https://i.imgur.com/Rq99DmC.png) * Run the RTL sim ```c= $ make all ``` ![](https://i.imgur.com/MYXb4EA.png) :::spoiler **Main Program Assembly Code Generated By SRV32** ```c= 0000003c <main>: 3c: 000207b7 lui a5,0x20 40: 06078793 addi a5,a5,96 # 20060 <__malloc_trim_threshold+0x28> 44: 0007a683 lw a3,0(a5) 48: fc010113 addi sp,sp,-64 4c: 0047a703 lw a4,4(a5) 50: 00012623 sw zero,12(sp) 54: 0087a603 lw a2,8(a5) 58: 00d12c23 sw a3,24(sp) 5c: 00c7a683 lw a3,12(a5) 60: 00e12e23 sw a4,28(sp) 64: 0107a703 lw a4,16(a5) 68: 02c12023 sw a2,32(sp) 6c: 0147a783 lw a5,20(a5) 70: 02d12223 sw a3,36(sp) 74: 02e12423 sw a4,40(sp) 78: 02f12623 sw a5,44(sp) 7c: 00012823 sw zero,16(sp) 80: 01012703 lw a4,16(sp) 84: 02112e23 sw ra,60(sp) 88: 00500793 li a5,5 8c: 08e7c063 blt a5,a4,10c <main+0xd0> 90: 00a00693 li a3,10 94: 00500613 li a2,5 98: 00012a23 sw zero,20(sp) 9c: 02c0006f j c8 <main+0x8c> a0: 01412783 lw a5,20(sp) a4: 03010713 addi a4,sp,48 a8: 00178793 addi a5,a5,1 ac: 00f12a23 sw a5,20(sp) b0: 01012783 lw a5,16(sp) b4: 00279793 slli a5,a5,0x2 b8: 00f707b3 add a5,a4,a5 bc: fe87a703 lw a4,-24(a5) c0: 02d74733 div a4,a4,a3 c4: fee7a423 sw a4,-24(a5) c8: 01012783 lw a5,16(sp) cc: 03010713 addi a4,sp,48 d0: 00279793 slli a5,a5,0x2 d4: 00f707b3 add a5,a4,a5 d8: fe87a783 lw a5,-24(a5) dc: fcf042e3 bgtz a5,a0 <main+0x64> e0: 01412783 lw a5,20(sp) e4: 0017f793 andi a5,a5,1 e8: 00079863 bnez a5,f8 <main+0xbc> ec: 00c12783 lw a5,12(sp) f0: 00178793 addi a5,a5,1 f4: 00f12623 sw a5,12(sp) f8: 01012783 lw a5,16(sp) fc: 00178793 addi a5,a5,1 100: 00f12823 sw a5,16(sp) 104: 01012783 lw a5,16(sp) 108: f8f658e3 bge a2,a5,98 <main+0x5c> 10c: 00c12583 lw a1,12(sp) 110: 00020537 lui a0,0x20 114: 03c50513 addi a0,a0,60 # 2003c <__malloc_trim_threshold+0x4> 118: 054000ef jal ra,16c <printf> 11c: 03c12083 lw ra,60(sp) 120: 00000513 li a0,0 124: 04010113 addi sp,sp,64 128: 00008067 ret ``` ::: ## Use GTKWave to observe the wave * After the command `make all`, SRV32 would generate `wave.fst` * Open GTKWave and select `./sim/wave.fst`, to observe how instruction is processed ### SRV32 - Simple 3-stage pipeline RISC-V processor Architecture * SRV32 is a 3-stage pipeline with IF/ID,EX,WB stages * The picture below shows the pipeline architecture and some important signals of SRV32 ![](https://i.imgur.com/mweY9p4.png) ### Branch * In SRV32 the branch penalty is 2. To clarify,when next PC is resolved, one instruction has been fetch into pipeline and another PC has been calculated because address should be computed one cycle ahead * So, when branch occur it should flush 2 instruction(set as NOP) * For example, in my main program, `dc: fcf042e3 bgtz a5,a0 <main+0x64>`,this branch instruction should be taken ![](https://i.imgur.com/6FyHGe4.png) Branch instruction`bgtz a5,a0 <main+0x64>`is resolved at the EX stage, so the following two instrcution, `lw a5,20(sp)`,`andi a5,a5,1` should be flushed * The timing diagram of above instruction sequence show as below: |Address| Instruction | Cycle1 | Cycle2 |Cycle3 | Cycle4 | Cycle5 |Cycle6 | |-------| -------- | -------- | -------- |-------- |-------- |-------- |-------- | |DC | bgtz a5,a0 | IF/ID | EXE |WB | | | | |E0 | lw a5,20(sp)| | NOP |NOP |NOP | | | |E4 | andi a5,a5,1| | |NOP |NOP |NOP | | |A0 | lw a5,20(sp)| | | |IF/ID |EXE |WB | ### Forwarding * Consider instruction sequence in my main program below: ``` d0: 00279793 slli a5,a5,0x2 d4: 00f707b3 add a5,a4,a5 d8: fe87a783 lw a5,-24(a5) ``` #### Address d0 and Address d4 data hazard * `add a5,a4,a5` at EXE stage and `slli a5,a5,0x2` at WB stage have RAW data hazard on register `a5` * The correct data of `a5` (result from `slli a5,a5,0x2`)is stored in signal `wb_result`at WB stage * Because (`wb_dst_sel[4:0]==ex_src1_sel[4:0]`) is True(both equal to 0F) and `wb_mem2reg` is False * So `wb_result` is forwarded to `a5` register(store in `reg_rdata1[31:0]=00000000`) #### Address d4 and Address d8 data hazard * `lw a5,-24(a5)` at EXE stage and `add a5,a4,a5` at WB stage have RAW data hazard on register `a5` * The correct data of `a5` (result from `add a5,a4,a5`)is stored in signal `wb_result`at WB stage * Because (`wb_dst_sel[4:0]==ex_src1_sel[4:0]`) is True(both equal to 0E) and `wb_mem2reg` is False * So `wb_result` is forwarded to `a5` register(store in `reg_rdata1[31:0]=0003FFF0`) ![](https://i.imgur.com/NL8eylA.png) * The timing diagram of above instruction sequence show as below: |Address| Instruction | Cycle1 | Cycle2 |Cycle3 | Cycle4 | Cycle5 | |-------| ----------- | -------- | -------- |-------- |--------|-------- | |D0 | slli a5,a5,0x2| IF/ID | EXE |**WB**⬂ | | | |D4 | add a5,a4,a5 | | IF/ID |**EXE**⬃ |**WB**⬂ | | |D8 | lw a5,-24(a5) | | |IF/ID |**EXE**⬃|WB | ### C program result * According to main program assembly code, once the program_counter reach `118`,the result would be generated and store in register`a1` ![](https://i.imgur.com/axOZMvm.png) ![](https://i.imgur.com/xEaF1qC.png) ## Proposed software optimization :::spoiler **Original C code** ```c= #include <stdio.h> #include <stdlib.h> int main() { volatile int output=0; volatile int number_array[6]={12,345,1771,2,6,7896}; for(volatile int i = 0;i<6;i++) { volatile int counter=0; while(number_array[i]>0) { counter++; number_array[i]/=10; } if(counter%2==0) output++; } printf("The number of even digits are: %d\n",output); return 0; } ``` ::: * I found that in my C code, if I use divide and modulo operation, it will make riscv-none-embed-gcc to take a lot of instructions and cycle to finish divide and modulo operation without using `div` instruction * As a consequence, I replace all the divide and modulo operation with new approachs * In the new approach, I use the property that if the number smaller than 10,it must only has 1 digits, and if the number greater than 10 and smaller than 100, it must has 2 digits, so on and so forth * By this approach, I can only use `if...else...` comparison to decide how many even digits are there inside the array :::spoiler **New C code** ```c= #include <stdio.h> #include <stdlib.h> int main() { volatile int number_array[6]={12,345,1771,2,6,7896}; volatile int counter = 0; for(volatile int i = 0;i<6;i++) { if((number_array[i] > 9) & (number_array[i] < 100)) counter++; else if((number_array[i] > 99) & (number_array[i] < 1000)) counter = counter; else if(number_array[i] < 10) counter = counter; else counter++; } printf("The number of even digits are: %d\n",counter); return 0; } ``` ::: * Result of Original C code ![](https://i.imgur.com/dOfIFHy.png) * Result of New C code ![](https://i.imgur.com/EeZEjQ1.png) * After Optimizing the C program the total instruction reduce `2098-1971=127`, and the number of cycles reduce `2698-2545=153`,and the CPI improve to 1.291 :::spoiler **Further C code optimization** ```c= #include <stdio.h> #include <stdlib.h> int main() { int output=0; int number_array[6]={12,345,1771,2,6,7896}; int counter = 0; if((number_array[0] > 9) & (number_array[0] < 100)) counter++; else if((number_array[0] > 99) & (number_array[0] < 1000)) counter = counter; else if(number_array[0] < 10) counter = counter; else counter++; if((number_array[1] > 9) & (number_array[1] < 100)) counter++; else if((number_array[1] > 99) & (number_array[1] < 1000)) counter = counter; else if(number_array[1] < 10) counter = counter; else counter++; if((number_array[2] > 9) & (number_array[2] < 100)) counter++; else if((number_array[2] > 99) & (number_array[2] < 1000)) counter = counter; else if(number_array[2] < 10) counter = counter; else counter++; if((number_array[3] > 9) & (number_array[3] < 100)) counter++; else if((number_array[3] > 99) & (number_array[3] < 1000)) counter = counter; else if(number_array[3] < 10) counter = counter; else counter++; if((number_array[4] > 9) & (number_array[4] < 100)) counter++; else if((number_array[4] > 99) & (number_array[4] < 1000)) counter = counter; else if(number_array[4] < 10) counter = counter; else counter++; if((number_array[5] > 9) & (number_array[5] < 100)) counter++; else if((number_array[5] > 99) & (number_array[5] < 1000)) counter = counter; else if(number_array[5] < 10) counter = counter; else counter++; printf("The number of even digits are: %d\n",counter); return 0; } ``` ::: * For further optimization, as I mentioned above, the branch penalty in srv32 is 2 * This means that if branch is taken, it has to flush following 2 instruction, it is a waste in pipeline * Consequently, I remove the `for-loop`, and use loop unrolling technique * Result of the No-loop C code ![](https://i.imgur.com/yu5BtSp.png) * From the result, we can see the total instructions reduce `2098-1732=366`,and the number of cycles reduce `2698-2274=424`, furthermore, the CPI become 1.313 ## How srv32 works with Verilator * Verilator is a compiler that can convert verilog HDL to a cycle-accurate behavioral model in C++ for simulation. * In srv32, it use Verilator to generate the RTL-level simulator * This RTL-level simulator is capable of simulating the execution of RISC-V binary at RTL-level * Our C program is processed and copied into `/sim` directory(simulation environment generated by Verilator) and simulated on srv32 hardware virtually ## RISC-V Compliance test * RISC-V compliance tests are a set of rules that ensures a claimed RISC-V implementation fits the basic standard and plays along with other RISC-V implementation in the ecosystem. * From the [README.adoc](https://github.com/riscv-non-isa/riscv-arch-test/tree/master/doc), we can have better understanding of the compliance tests >At the heart of the testing infrastructure is the detailed compliance test. This is the RISC-V assembler code that is executed on the processor and that provides results in a **defined memory area (the signature)**. The test should only use **the minimum of instructions** and only those absolutely necessary. It should only use instructions and registers from the ISA instruction set on which it is targeted. ### What is signature? ==**signature** is a defined memory area where the result of a test suite is stored== ### How to run RISC-V compliance tests? According to the [README.md](https://github.com/riscv/riscv-compliance) in the [riscv-compliance](https://github.com/riscv/riscv-compliance) repo, these are the requirements to run the RISCV compliance tests - Specify the toolchain in the [Makefile](https://github.com/riscv/riscv-compliance/blob/master/Makefile) - `export RISCV_PREFIX ?= riscv64-unknown-elf-` - `export RISCV_TARGET_FLAGS ?=` - `export RISCV_ASSERT ?= 0` - Specify where is the target device in the [Makefile](https://github.com/riscv/riscv-compliance/blob/master/Makefile) - `export RISCV_TARGET ?= riscvOVPsim` - `export RISCV_DEVICE ?= rv32i` - Set the ISA of target device - Install `riscvOVPsim` simulator - `riscvOVPsim` is a simulator used here. It has to be installed parallelled to the `riscv-compliance` directory. - Install `riscvOVPsim` by `git clone https://github.com/riscv-ovpsim/imperas-riscv-tests.git` - Also, the environment variable `TARGET_SIM` should point to the executable `/riscv-ovpsim/bin/Linux64/riscvOVPsim.exe` Eventually, a modified version of Makefile can be found [here](https://github.com/WeiCheng14159/riscv-compliance/blob/rv32i-arch/Makefile) There're currently 48 test suites in the repository, run all the tests by running the command ```bash= $ export TARGET_SIM=/path-on-ur-computer/imperas-riscv-tests/riscv-ovpsim/bin/Linux64/riscvOVPsim.exe $ make stimulate verify ``` The result will be `OK: 48/48 RISCV_TARGET=riscvOVPsim RISCV_DEVICE=rv32i RISCV_ISA=`