# Assignment2: RISC-V Toolchain
contributed by < `kdnvt` >
## Rewrite [Monotonic Array](https://leetcode.com/problems/monotonic-array/)
I choose the problem [Monotonic Array](https://leetcode.com/problems/monotonic-array/) from [張邦翰](https://hackmd.io/@0M4xubFKTNeuc7WRhLY9lA/rk5VinbXs) , because I think I can use a more efficient and concise method to solve this problem. In addition, I want to apply loop unrolling and strip mining on a length-known array, because the linked list problem I chose for [hw1](https://hackmd.io/@kdnvt/2022-arch-hw1) last time did not turn out to be very effective.
Original C code:
```c
bool isMonotonic(int* nums, int numsSize){
bool a=false,d=false;
int eq=0;
for(int i=0;i<numsSize-1;++i){
if(nums[i]<nums[i+1])
a=true;
if(nums[i]>nums[i+1])
d=true;
if(nums[i]==nums[i+1])
eq++;
}
return a^d|(eq==numsSize-1);
}
```
Originally, the `isMonotonic` checks whether the relation of `nums[i]` and `nums[i+1]` is `<`, `>` or `==` on each iteration. Moreover, it traverses the whole array everytime. If it can return false immediately after it determine this array is not a monotonic array, it will save many time.
Here's my optimizing C code:
```c
bool isMonotonic(int* nums, int numsSize){
int dir = 1;
int begin = 0;
int end = numsSize-1;
if(nums[0] > nums[numsSize-1]){
dir = -1;
begin = numsSize-1;
end = 0;
}
for(int i=begin;i!=end;i += dir){
if(nums[i] > nums[i+dir])
return false;
}
return true;
}
```
First, I determine whether this array is increasing or decreasing by checking the relation between first and last elements. Then, I can determine whether to compare elements from begin to end or otherwise. Therefore, I only have to compare once per iteration. The monotonic array which all elements are the same won't be a problem, too.
What's more, I will return immediately the moment I detect elements violate the monotonic rule.
To evaluate the performance when the array is very long, I add another test data which is an array with 1000 same value elements.
```c
#include <stdio.h>
#include <stdbool.h>
int i[]={1,2,5,5,6};
int j[]={600,240,43,25,1};
int k[]={1,3,2,4,5};
int l[]={34,34,34,34,34};
int m[]={1,0,-1,0};
int test[]={
34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34
/*....................................*/
,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34
};
bool res[5];
bool isMonotonic(int* nums, int numsSize){
/* ........... */
}
int main(){
res[0]=isMonotonic(i,5);
res[1]=isMonotonic(j,5);
res[2]=isMonotonic(k,5);
res[3]=isMonotonic(l,5);
res[4]=isMonotonic(m,4);
for(bool *it=res;it<res+5;++it){
if(*it)
printf("true\n");
else
printf("false\n");
}
printf("%d\n",isMonotonic(test,1000));
}
```
I also introduce a shell script to compare between different optimization levels.
```bash
riscv-none-elf-gcc mono.c -o mono0 -O0
riscv-none-elf-gcc mono.c -o mono1 -O1
riscv-none-elf-gcc mono.c -o mono2 -O2
riscv-none-elf-gcc mono.c -o mono3 -O3
riscv-none-elf-gcc mono.c -o monof -Ofast
riscv-none-elf-gcc opt_mono.c -o opt_mono0 -O0
riscv-none-elf-gcc opt_mono.c -o opt_mono1 -O1
riscv-none-elf-gcc opt_mono.c -o opt_mono2 -O2
riscv-none-elf-gcc opt_mono.c -o opt_mono3 -O3
riscv-none-elf-gcc opt_mono.c -o opt_monof -Ofast
./rv32emu --stats mono0
./rv32emu --stats mono1
./rv32emu --stats mono2
./rv32emu --stats mono3
./rv32emu --stats monof
echo "\n"
./rv32emu --stats opt_mono0
./rv32emu --stats opt_mono1
./rv32emu --stats opt_mono2
./rv32emu --stats opt_mono3
./rv32emu --stats opt_monof
```
Here's the result:
| Optimization level | original cycles number | modified cycles number |
| ------------------ | ---------------- | ---------------- |
| -O0 | 46577 | 22099 |
| -O1 | 9763 | 7737 |
| -O2 | 7761 | 7743 |
| -O3 | 7761 | 7674 |
| -Ofast | 7761 | 7674 |
The -O0 level of modified code outperforms the original one as I expected. However, to my suprise, as the optimization being aggressive, the different between two is not very much.
## Hand written assembly code
To improve my code, I use `riscv-none-elf-objdump` to show the assembly code.
```c
000103b2 <isMonotonic>:
103b2: 00259793 slli a5,a1,0x2
103b6: 97aa add a5,a5,a0
103b8: 4118 lw a4,0(a0)
103ba: ffc7a783 lw a5,-4(a5)
103be: 15fd addi a1,a1,-1
103c0: 00e7de63 bge a5,a4,103dc <isMonotonic+0x2a>
103c4: 00259793 slli a5,a1,0x2
103c8: 953e add a0,a0,a5
103ca: c585 beqz a1,103f2 <isMonotonic+0x40>
103cc: 4118 lw a4,0(a0)
103ce: 15fd addi a1,a1,-1
103d0: 1571 addi a0,a0,-4
103d2: 411c lw a5,0(a0)
103d4: fee7dbe3 bge a5,a4,103ca <isMonotonic+0x18>
103d8: 4501 li a0,0
103da: 8082 ret
103dc: 4781 li a5,0
103de: 00f58a63 beq a1,a5,103f2 <isMonotonic+0x40>
103e2: 4114 lw a3,0(a0)
103e4: 0785 addi a5,a5,1
103e6: 0511 addi a0,a0,4
103e8: 4118 lw a4,0(a0)
103ea: fed747e3 blt a4,a3,103d8 <isMonotonic+0x26>
103ee: fef59ae3 bne a1,a5,103e2 <isMonotonic+0x30>
103f2: 4505 li a0,1
103f4: 8082 ret
```
Different from source code, the assembly code is split into two loop, one for increasing the other for decreasing. I use `riscv-none-elf-gcc -S opt_mono.c -Ofast -o hand_opt_monof.s` to get the assembly source code:
```c
isMonotonic:
slli a5,a1,2
add a5,a0,a5
lw a4,0(a0)
lw a5,-4(a5)
addi a1,a1,-1
ble a4,a5,.L12
slli a5,a1,2
add a0,a0,a5
.L4:
beq a1,zero,.L8
lw a4,0(a0)
addi a1,a1,-1
addi a0,a0,-4
lw a5,0(a0)
ble a4,a5,.L4 # L4 loop for decreasing
.L9:
li a0,0
ret
.L12:
li a5,0
beq a1,a5,.L8
.L13:
lw a3,0(a0)
addi a5,a5,1
addi a0,a0,4
lw a4,0(a0)
bgt a3,a4,.L9
bne a1,a5,.L13 # L13 loop for increasing
.L8:
li a0,1
ret
```
The compiler use `a1` and `a5` in two loops to determine whether to leave the loop. However, by using the address of `a0`, it is possible to delete `addi a1,a1,-1` and `addi a5,a5,1` two instructions.
Modified version:
```c
isMonotonic:
slli a5,a1,2
add a5,a0,a5
mv a6,a5 # added instrucion
addi a6,a6,-4 # added instrucion
mv a7,a0 # added instrucion
lw a4,0(a0)
lw a5,-4(a5)
addi a1,a1,-1
ble a4,a5,.L12
slli a5,a1,2
add a0,a0,a5
.L4:
beq a0,a7,.L8 # modified instruction
lw a4,0(a0)
addi a0,a0,-4
lw a5,0(a0)
ble a4,a5,.L4
.L9:
li a0,0
ret
.L12:
li a5,0
beq a1,a5,.L8
.L13:
lw a3,0(a0)
addi a0,a0,4
lw a4,0(a0)
bgt a3,a4,.L9
bne a0,a6,.L13 # modified instruction
.L8:
li a0,1
ret
```
After modified the assembly source code, I compile it into executable file.
```bash
$ riscv-none-elf-gcc -c hand_opt_monof.s -o hand_opt_monof.o
$ riscv-none-elf-gcc hand_opt_monof.o -o hand_opt_monof
```
To my suprise, the performance is the same.
```bash
$ ./rv32emu --stats hand_opt_monof
true
true
false
true
false
1
inferior exit code 0
CSR cycle count: 7674
```
Therefore, I use `objdump` to see the assembly code of `main` function.
```c
000100b2 <main>:
100b2: 67f5 lui a5,0x1d
100b4: 01078793 addi a5,a5,16 # 1d010 <i>
100b8: 4394 lw a3,0(a5)
100ba: 4b90 lw a2,16(a5)
100bc: 1101 addi sp,sp,-32
100be: ce06 sw ra,28(sp)
100c0: cc22 sw s0,24(sp)
100c2: ca26 sw s1,20(sp)
100c4: c84a sw s2,16(sp)
100c6: c64e sw s3,12(sp)
100c8: 12d65e63 bge a2,a3,10204 <main+0x152>
100cc: 47d4 lw a3,12(a5)
100ce: 00c78713 addi a4,a5,12
100d2: 14c6ce63 blt a3,a2,1022e <main+0x17c>
100d6: 468d li a3,3
100d8: 430c lw a1,0(a4)
100da: 16fd addi a3,a3,-1
100dc: 1771 addi a4,a4,-4
100de: 4310 lw a2,0(a4)
100e0: 14b64763 blt a2,a1,1022e <main+0x17c>
100e4: faf5 bnez a3,100d8 <main+0x26>
...................
...................
```
Aside from not seeing any instrucion calling `isMonotonic`, I saw two instrucion which is familiar to me:
```c
100da: 16fd addi a3,a3,-1
100dc: 1771 addi a4,a4,-4
```
It seems like that the aggressive optimization inline the `isMonotonic` in `main`. Therefore, simply modifying the `isMonotonic` in `hand_opt_monof.s` didn't affect the execution of program. In addition to `-Ofast`, `-O3` inlines the funciton, too. Finally, I decided to improve the code from `-O2` level. After modifying the `isMonotonic` in `-O2` level, I reduce the CSR cycles successfully.
```bash
$ ./rv32emu --stats hand_opt_mono2
true
true
false
true
false
1
inferior exit code 0
CSR cycle count: 6721
```
### loop unrolling and strip mining
Like hw1, I applied loop unrolling on this program. Unlike linked list in hw1, I think there are more room for optimization in this problem.
For example, see the below code:
```c
.L11:
lw a3,0(a0)
addi a0,a0,4
lw a4,0(a0)
bgt a3,a4,.L9
bne a0,a6,.L11 #
```
There are two `lw` each iteration. However, the value of `a4` in this iteration is actually the same as the value of `a3` in next iteration. If we can unrolling the loop and keep the value in register, it could reduce a lot of instruction and obviate the heavy overhead of memory accessing.
After unrolling four times and using strip mining, the modified code is shown as below:
```c
isMonotonic:
slli a5,a1,2
addi t0,a1,-1
andi t0,t0,3 # t0 = numsSize%4
add a5,a0,a5
mv a6,a5 #
addi a6,a6,-4 #
mv a7,a0 #
lw a4,0(a0)
lw a5,-4(a5)
addi a1,a1,-1
ble a4,a5,.L12
slli a5,a1,2
add a0,a0,a5
.L20: # strip mining
beqz t0,.L4
lw a4,0(a0)
addi a0,a0,-4
lw a5,0(a0)
bgt a4,a5,.L9
addi t0,t0,-1
j .L20
.L4:
lw t0,0(a0)
lw t1,-4(a0)
bgt t0,t1,.L9
lw t2,-8(a0)
bgt t1,t2,.L9
lw t3,-12(a0)
bgt t2,t3,.L9
lw t4,-16(a0)
bgt t3,t4,.L9
addi a0,a0,-16
beq a0,a7,.L8
j .L4
.L9:
li a0,0
ret
.L12:
li a5,0
beq a1,a5,.L8
.L21: # strip mining increasing
beqz t0,.L11
lw a3,0(a0)
addi a0,a0,4
lw a4,0(a0)
bgt a3,a4,.L9
addi t0,t0,-1
j .L21
.L11:
lw t0,0(a0)
lw t1,4(a0)
bgt t0,t1,.L9
lw t2,8(a0)
bgt t1,t2,.L9
lw t3,12(a0)
bgt t2,t3,.L9
lw t4,16(a0)
bgt t3,t4,.L9
addi a0,a0,16
bne a0,a6,.L11 #
.L8:
li a0,1
ret
```
As we can see, loop `.L11` only need 11 instrucions each iteration, including 5 `lw`. On the contrary, to iterate four times with original code, we need 5x4 = 20 instrucions, including 2x4 = 8 `lw`.
The result CSR cycles is shown as below:
```bash
$ ./rv32emu --stats hand_opt_mono2_unrolling
true
true
false
true
false
1
inferior exit code 0
CSR cycle count: 5221
```
To see only the effect on the long array, I measured the cycles number without calling `isMonotonic` on long array by modifying assembly source code file:
```c
main:
# ......
# ......
.L16:
addi s0,s0,1
call puts
bne s0,s1,.L13
lui a0,%hi(test)
li a1,1000
addi a0,a0,%lo(test)
# call isMonotonic # delete this line
mv a1,a0
lui a0,%hi(.LC2)
addi a0,a0,%lo(.LC2)
call printf
# ......
# ......
```
Result (CSR cycles):
| measured program | w/o long array | with long array | difference |
| ---------------------------------------- | -------------- | --------------- | ---------- |
| -O2 | 4036 | 7743 | 3707 |
| -O2 with hand written code | 4015 | 6721 | 2706 |
| -O2 with hand written code and unrolling | 4004 | 5221 | 1217 |