Assignment2: RISC-V Toolchain

# Assignment2: RISC-V Toolchain contributed by < `kdnvt` > ## Rewrite [Monotonic Array](https://leetcode.com/problems/monotonic-array/) I choose the problem [Monotonic Array](https://leetcode.com/problems/monotonic-array/) from [張邦翰](https://hackmd.io/@0M4xubFKTNeuc7WRhLY9lA/rk5VinbXs) , because I think I can use a more efficient and concise method to solve this problem. In addition, I want to apply loop unrolling and strip mining on a length-known array, because the linked list problem I chose for [hw1](https://hackmd.io/@kdnvt/2022-arch-hw1) last time did not turn out to be very effective. Original C code: ```c bool isMonotonic(int* nums, int numsSize){ bool a=false,d=false; int eq=0; for(int i=0;i<numsSize-1;++i){ if(nums[i]<nums[i+1]) a=true; if(nums[i]>nums[i+1]) d=true; if(nums[i]==nums[i+1]) eq++; } return a^d|(eq==numsSize-1); } ``` Originally, the `isMonotonic` checks whether the relation of `nums[i]` and `nums[i+1]` is `<`, `>` or `==` on each iteration. Moreover, it traverses the whole array everytime. If it can return false immediately after it determine this array is not a monotonic array, it will save many time. Here's my optimizing C code: ```c bool isMonotonic(int* nums, int numsSize){ int dir = 1; int begin = 0; int end = numsSize-1; if(nums[0] > nums[numsSize-1]){ dir = -1; begin = numsSize-1; end = 0; } for(int i=begin;i!=end;i += dir){ if(nums[i] > nums[i+dir]) return false; } return true; } ``` First, I determine whether this array is increasing or decreasing by checking the relation between first and last elements. Then, I can determine whether to compare elements from begin to end or otherwise. Therefore, I only have to compare once per iteration. The monotonic array which all elements are the same won't be a problem, too. What's more, I will return immediately the moment I detect elements violate the monotonic rule. To evaluate the performance when the array is very long, I add another test data which is an array with 1000 same value elements. ```c #include <stdio.h> #include <stdbool.h> int i[]={1,2,5,5,6}; int j[]={600,240,43,25,1}; int k[]={1,3,2,4,5}; int l[]={34,34,34,34,34}; int m[]={1,0,-1,0}; int test[]={ 34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34 /*....................................*/ ,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34 }; bool res[5]; bool isMonotonic(int* nums, int numsSize){ /* ........... */ } int main(){ res[0]=isMonotonic(i,5); res[1]=isMonotonic(j,5); res[2]=isMonotonic(k,5); res[3]=isMonotonic(l,5); res[4]=isMonotonic(m,4); for(bool *it=res;it<res+5;++it){ if(*it) printf("true\n"); else printf("false\n"); } printf("%d\n",isMonotonic(test,1000)); } ``` I also introduce a shell script to compare between different optimization levels. ```bash riscv-none-elf-gcc mono.c -o mono0 -O0 riscv-none-elf-gcc mono.c -o mono1 -O1 riscv-none-elf-gcc mono.c -o mono2 -O2 riscv-none-elf-gcc mono.c -o mono3 -O3 riscv-none-elf-gcc mono.c -o monof -Ofast riscv-none-elf-gcc opt_mono.c -o opt_mono0 -O0 riscv-none-elf-gcc opt_mono.c -o opt_mono1 -O1 riscv-none-elf-gcc opt_mono.c -o opt_mono2 -O2 riscv-none-elf-gcc opt_mono.c -o opt_mono3 -O3 riscv-none-elf-gcc opt_mono.c -o opt_monof -Ofast ./rv32emu --stats mono0 ./rv32emu --stats mono1 ./rv32emu --stats mono2 ./rv32emu --stats mono3 ./rv32emu --stats monof echo "\n" ./rv32emu --stats opt_mono0 ./rv32emu --stats opt_mono1 ./rv32emu --stats opt_mono2 ./rv32emu --stats opt_mono3 ./rv32emu --stats opt_monof ``` Here's the result: | Optimization level | original cycles number | modified cycles number | | ------------------ | ---------------- | ---------------- | | -O0 | 46577 | 22099 | | -O1 | 9763 | 7737 | | -O2 | 7761 | 7743 | | -O3 | 7761 | 7674 | | -Ofast | 7761 | 7674 | The -O0 level of modified code outperforms the original one as I expected. However, to my suprise, as the optimization being aggressive, the different between two is not very much. ## Hand written assembly code To improve my code, I use `riscv-none-elf-objdump` to show the assembly code. ```c 000103b2 <isMonotonic>: 103b2: 00259793 slli a5,a1,0x2 103b6: 97aa add a5,a5,a0 103b8: 4118 lw a4,0(a0) 103ba: ffc7a783 lw a5,-4(a5) 103be: 15fd addi a1,a1,-1 103c0: 00e7de63 bge a5,a4,103dc <isMonotonic+0x2a> 103c4: 00259793 slli a5,a1,0x2 103c8: 953e add a0,a0,a5 103ca: c585 beqz a1,103f2 <isMonotonic+0x40> 103cc: 4118 lw a4,0(a0) 103ce: 15fd addi a1,a1,-1 103d0: 1571 addi a0,a0,-4 103d2: 411c lw a5,0(a0) 103d4: fee7dbe3 bge a5,a4,103ca <isMonotonic+0x18> 103d8: 4501 li a0,0 103da: 8082 ret 103dc: 4781 li a5,0 103de: 00f58a63 beq a1,a5,103f2 <isMonotonic+0x40> 103e2: 4114 lw a3,0(a0) 103e4: 0785 addi a5,a5,1 103e6: 0511 addi a0,a0,4 103e8: 4118 lw a4,0(a0) 103ea: fed747e3 blt a4,a3,103d8 <isMonotonic+0x26> 103ee: fef59ae3 bne a1,a5,103e2 <isMonotonic+0x30> 103f2: 4505 li a0,1 103f4: 8082 ret ``` Different from source code, the assembly code is split into two loop, one for increasing the other for decreasing. I use `riscv-none-elf-gcc -S opt_mono.c -Ofast -o hand_opt_monof.s` to get the assembly source code: ```c isMonotonic: slli a5,a1,2 add a5,a0,a5 lw a4,0(a0) lw a5,-4(a5) addi a1,a1,-1 ble a4,a5,.L12 slli a5,a1,2 add a0,a0,a5 .L4: beq a1,zero,.L8 lw a4,0(a0) addi a1,a1,-1 addi a0,a0,-4 lw a5,0(a0) ble a4,a5,.L4 # L4 loop for decreasing .L9: li a0,0 ret .L12: li a5,0 beq a1,a5,.L8 .L13: lw a3,0(a0) addi a5,a5,1 addi a0,a0,4 lw a4,0(a0) bgt a3,a4,.L9 bne a1,a5,.L13 # L13 loop for increasing .L8: li a0,1 ret ``` The compiler use `a1` and `a5` in two loops to determine whether to leave the loop. However, by using the address of `a0`, it is possible to delete `addi a1,a1,-1` and `addi a5,a5,1` two instructions. Modified version: ```c isMonotonic: slli a5,a1,2 add a5,a0,a5 mv a6,a5 # added instrucion addi a6,a6,-4 # added instrucion mv a7,a0 # added instrucion lw a4,0(a0) lw a5,-4(a5) addi a1,a1,-1 ble a4,a5,.L12 slli a5,a1,2 add a0,a0,a5 .L4: beq a0,a7,.L8 # modified instruction lw a4,0(a0) addi a0,a0,-4 lw a5,0(a0) ble a4,a5,.L4 .L9: li a0,0 ret .L12: li a5,0 beq a1,a5,.L8 .L13: lw a3,0(a0) addi a0,a0,4 lw a4,0(a0) bgt a3,a4,.L9 bne a0,a6,.L13 # modified instruction .L8: li a0,1 ret ``` After modified the assembly source code, I compile it into executable file. ```bash $ riscv-none-elf-gcc -c hand_opt_monof.s -o hand_opt_monof.o $ riscv-none-elf-gcc hand_opt_monof.o -o hand_opt_monof ``` To my suprise, the performance is the same. ```bash $ ./rv32emu --stats hand_opt_monof true true false true false 1 inferior exit code 0 CSR cycle count: 7674 ``` Therefore, I use `objdump` to see the assembly code of `main` function. ```c 000100b2 <main>: 100b2: 67f5 lui a5,0x1d 100b4: 01078793 addi a5,a5,16 # 1d010 <i> 100b8: 4394 lw a3,0(a5) 100ba: 4b90 lw a2,16(a5) 100bc: 1101 addi sp,sp,-32 100be: ce06 sw ra,28(sp) 100c0: cc22 sw s0,24(sp) 100c2: ca26 sw s1,20(sp) 100c4: c84a sw s2,16(sp) 100c6: c64e sw s3,12(sp) 100c8: 12d65e63 bge a2,a3,10204 <main+0x152> 100cc: 47d4 lw a3,12(a5) 100ce: 00c78713 addi a4,a5,12 100d2: 14c6ce63 blt a3,a2,1022e <main+0x17c> 100d6: 468d li a3,3 100d8: 430c lw a1,0(a4) 100da: 16fd addi a3,a3,-1 100dc: 1771 addi a4,a4,-4 100de: 4310 lw a2,0(a4) 100e0: 14b64763 blt a2,a1,1022e <main+0x17c> 100e4: faf5 bnez a3,100d8 <main+0x26> ................... ................... ``` Aside from not seeing any instrucion calling `isMonotonic`, I saw two instrucion which is familiar to me: ```c 100da: 16fd addi a3,a3,-1 100dc: 1771 addi a4,a4,-4 ``` It seems like that the aggressive optimization inline the `isMonotonic` in `main`. Therefore, simply modifying the `isMonotonic` in `hand_opt_monof.s` didn't affect the execution of program. In addition to `-Ofast`, `-O3` inlines the funciton, too. Finally, I decided to improve the code from `-O2` level. After modifying the `isMonotonic` in `-O2` level, I reduce the CSR cycles successfully. ```bash $ ./rv32emu --stats hand_opt_mono2 true true false true false 1 inferior exit code 0 CSR cycle count: 6721 ``` ### loop unrolling and strip mining Like hw1, I applied loop unrolling on this program. Unlike linked list in hw1, I think there are more room for optimization in this problem. For example, see the below code: ```c .L11: lw a3,0(a0) addi a0,a0,4 lw a4,0(a0) bgt a3,a4,.L9 bne a0,a6,.L11 # ``` There are two `lw` each iteration. However, the value of `a4` in this iteration is actually the same as the value of `a3` in next iteration. If we can unrolling the loop and keep the value in register, it could reduce a lot of instruction and obviate the heavy overhead of memory accessing. After unrolling four times and using strip mining, the modified code is shown as below: ```c isMonotonic: slli a5,a1,2 addi t0,a1,-1 andi t0,t0,3 # t0 = numsSize%4 add a5,a0,a5 mv a6,a5 # addi a6,a6,-4 # mv a7,a0 # lw a4,0(a0) lw a5,-4(a5) addi a1,a1,-1 ble a4,a5,.L12 slli a5,a1,2 add a0,a0,a5 .L20: # strip mining beqz t0,.L4 lw a4,0(a0) addi a0,a0,-4 lw a5,0(a0) bgt a4,a5,.L9 addi t0,t0,-1 j .L20 .L4: lw t0,0(a0) lw t1,-4(a0) bgt t0,t1,.L9 lw t2,-8(a0) bgt t1,t2,.L9 lw t3,-12(a0) bgt t2,t3,.L9 lw t4,-16(a0) bgt t3,t4,.L9 addi a0,a0,-16 beq a0,a7,.L8 j .L4 .L9: li a0,0 ret .L12: li a5,0 beq a1,a5,.L8 .L21: # strip mining increasing beqz t0,.L11 lw a3,0(a0) addi a0,a0,4 lw a4,0(a0) bgt a3,a4,.L9 addi t0,t0,-1 j .L21 .L11: lw t0,0(a0) lw t1,4(a0) bgt t0,t1,.L9 lw t2,8(a0) bgt t1,t2,.L9 lw t3,12(a0) bgt t2,t3,.L9 lw t4,16(a0) bgt t3,t4,.L9 addi a0,a0,16 bne a0,a6,.L11 # .L8: li a0,1 ret ``` As we can see, loop `.L11` only need 11 instrucions each iteration, including 5 `lw`. On the contrary, to iterate four times with original code, we need 5x4 = 20 instrucions, including 2x4 = 8 `lw`. The result CSR cycles is shown as below: ```bash $ ./rv32emu --stats hand_opt_mono2_unrolling true true false true false 1 inferior exit code 0 CSR cycle count: 5221 ``` To see only the effect on the long array, I measured the cycles number without calling `isMonotonic` on long array by modifying assembly source code file: ```c main: # ...... # ...... .L16: addi s0,s0,1 call puts bne s0,s1,.L13 lui a0,%hi(test) li a1,1000 addi a0,a0,%lo(test) # call isMonotonic # delete this line mv a1,a0 lui a0,%hi(.LC2) addi a0,a0,%lo(.LC2) call printf # ...... # ...... ``` Result (CSR cycles): | measured program | w/o long array | with long array | difference | | ---------------------------------------- | -------------- | --------------- | ---------- | | -O2 | 4036 | 7743 | 3707 | | -O2 with hand written code | 4015 | 6721 | 2706 | | -O2 with hand written code and unrolling | 4004 | 5221 | 1217 |