# Assignment2: RISC-V Toolchain
contributed by < [`kaeteyaruyo`](https://github.com/kaeteyaruyo) >
## Rewrite [Linear Search](https://hackmd.io/@Rwbh0z6QRXqUP7ovs7txiQ/Hkt8_swrw)
I rewrite [this topic](https://hackmd.io/@Rwbh0z6QRXqUP7ovs7txiQ/Hkt8_swrw) into rv32emu-compatible C implementation. Like this:
```c=
int _start()
{
int max = 0;
int arr[10] = {9, 28, 63, 88, 52, 9, 75, 6, 26, 7};
for (int i = 0; i < 10; i++)
{
if (arr[i] > max)
{
max = arr[i];
}
}
volatile char *tx = (volatile char *)0x40002000;
const char *str = "The largest number in the array is ";
while (*str)
{
*tx = *str;
str++;
}
volatile int *tx2 = (volatile int *)0x40000008;
*tx2 = max;
return 0;
}
```
And the execution result is:
```
The largest number in the array is 88
>>> Execution time: 179285 ns
>>> Instruction count: 197 (IPS=1098809)
>>> Jumps: 38 (19.29%) - 3 forwards, 35 backwards
>>> Branching T=37 (84.09%) F=7 (15.91%)
```
:::warning
I modify some source code in `emu-rv32i.c` so it can print out integer. But it seems not an appropriate modification. For more information, check out [this PR](https://github.com/sysprog21/rv32emu/pull/10).
:::
## Original Assembly Code
I adjust the output string part to match my rewritten version:
```asm
.data
array: .word 9, 28, 63, 88, 52, 9, 75, 6, 26, 7
range: .word 10
str: .string "The largest number in the array is "
.text
#s1 = Base address of current array
#s2 = i
#s3 = Range to compare (10)
#t0 = Flag
#t1 = Current Address
.global _start
_start:
la s1, array # s1 = base address of array
add s2, x0, x0 # i = 0
add t0, x0, x0 # define Flag
addi s3, s3, 10 # size = 10
jal ra, forloop # Jump-and-link to the 'forloop' label
li a7, 10 # Exit program
ecall
forloop:
addi s2, s2, 1 # i = i+1
lw t1,0(s1) # t1 = next value of the array
addi s1, s1, 4 # update new position of the next address
bge t0, t1, then # if Flag >= A[i] jump to 'then'
add t0, x0, t1 # else Flag = A[i]
beq s2, s3, printResult # if i = 10 => stop
j forloop # back to for loop
then:
beq s2, s3, printResult # if i = 10 => stop
j forloop # else continue run for loop
printResult:
la a0, str
li a7, 4 # print string
ecall
mv a0, t0
li a7, 1
ecall
ret
```
Since `emu-rv32i` cannot perform `ecall`, so I remove the output part and modify it to a `emu-rv32i` compatible version:
```asm
```
## `-O3` Optimized Assembly Code
* Complie: `riscv-none-embed-gcc -march=rv32i -mabi=ilp32 -O3 -nostdlib linear-search.c -o linear-search`
* Disassemble: `riscv-none-embed-objdump -D linear-search >> linear-search.s`
* Assembly code with comment:
```asm=
00010054 <_start>:
10054: lui a5,0x10
10058: addi a5,a5,336 # a5 = &arr
1005c: lw a4,0(a5) # a4 = arr[0]
10060: lw t5,8(a5) # t5 = arr[2]
10064: lw t4,12(a5) # t4 = arr[3]
10068: lw t3,16(a5) # t3 = arr[4]
1006c: lw t1,20(a5) # t1 = arr[5]
10070: lw a7,24(a5) # a7 = arr[6]
10074: lw a6,28(a5) # a6 = arr[7]
10078: lw a0,32(a5) # a0 = arr[8]
1007c: lw a1,36(a5) # a1 = arr[9]
10080: not a2,a4 # a2 = ~arr[0] (for 1008c)
10084: lw a3,4(a5) # a3 = arr[1]
10088: addi sp,sp,-48 # ready for store something into stack
1008c: srai a5,a2,0x1f # a5 = (~arr[0])'s sign extension (for 100b0)
10090: sw t5,16(sp) # store arr[2] into stack
10094: sw t4,20(sp) # store arr[3] into stack
10098: sw t3,24(sp) # store arr[4] into stack
1009c: sw t1,28(sp) # store arr[5] into stack
100a0: sw a7,32(sp) # store arr[6] into stack
100a4: sw a6,36(sp) # store arr[7] into stack
100a8: sw a0,40(sp) # store arr[8] into stack
100ac: sw a1,44(sp) # store arr[9] into stack
100b0: and a4,a4,a5 # a4 = max(arr[0], max = 0) (mind blowing!!)
100b4: bge a3,a4,100bc # if (arr[1] >= a4) goto 100bc
100b8: mv a3,a4 # else a3 = a4
100bc: lw a5,16(sp) # a5 = arr[2]
# => a3 = max(arr[1], a4), and then load arr[2] to a5
100c0: bge a5,a3,100c8 # if (arr[2] >= a3) goto 100c8
100c4: mv a5,a3 # else a5 = a3
100c8: lw a4,20(sp) # a4 = arr[3]
# => a5 = max(arr[2], a3), and then load arr[3] to a4
100cc: bge a4,a5,100d4 # if (arr[3] >= a5) goto 100d4
100d0: mv a4,a5 # else a4 = a5
100d4: lw a5,24(sp) # a5 = arr[4]
# => a4 = max(arr[3], a5), and then load arr[4] to a5
100d8: bge a5,a4,100e0 # if (arr[4] >= a4) goto 100e0
100dc: mv a5,a4 # else a5 = a4
100e0: lw a4,28(sp) # a4 = arr[5]
# => a5 = max(arr[4], a4), and then load arr[5] to a4
100e4: bge a4,a5,100ec # if (arr[5] >= a5) goto 100ec
100e8: mv a4,a5 # else a4 = a5
100ec: lw a5,32(sp) # a5 = arr[6]
# => a4 = max(arr[5], a5), and then load arr[6] to a5
100f0: bge a5,a4,100f8 # if (arr[6] >= a4) goto 100f8
100f4: mv a5,a4 # else a5 = a4
100f8: lw a4,36(sp) # a4 = arr[7]
# => a5 = max(arr[6], a4), and then load arr[7] to a4
100fc: bge a4,a5,10104 # if (arr[7] >= a5) goto 10104
10100: mv a4,a5 # else a4 = a5
10104: lw a5,40(sp) # a5 = arr[8]
# => a4 = max(arr[7], a5), and then load arr[8] to a5
10108: bge a5,a4,10110 # if (arr[8] >= a4) goto 10110
1010c: mv a5,a4 # else a5 = a4
10110: lw a2,44(sp) # a2 = arr[9]
# => a5 = max(arr[8], a4), and then load arr[9] to a2
10114: bge a2,a5,1011c # if (arr[9] >= a5) goto 1011c
10118: mv a2,a5 # else a2 = a5
# => a2 = max(arr[9], a5), so a2 = the largest number in arr
1011c: lui a5,0x10
10120: addi a5,a5,376 # a5 = str
10124: li a4,84 # a4 = "T" (first char in str)
10128: lui a3,0x40002 # a3 = tx
1012c: sb a4,0(a3) # send a4 to tx
10130: addi a5,a5,1 # str++
10134: lbu a4,0(a5) # a4 = next char in str
10138: bnez a4,1012c # while (*str)
1013c: lui a5,0x40000 # a5 = tx2
10140: sw a2,8(a5) # send a2 to tx2
10144: li a0,0 # set return value
10148: addi sp,sp,48 # reset stack pointer
1014c: ret # return 0
```
* Data section:
```=
10150: 00000009 # 9
10154: 0000001c # 28
10158: 0000003f # 63
1015c: 00000058 # 88
10160: 00000034 # 52
10164: 00000009 # 9
10168: 0000004b # 75
1016c: 00000006 # 6
10170: 0000001a # 26
10174: 00000007 # 7
10178: 54 # 'T'
10179: 68 # 'h'
1017a: 65 # 'e'
1017b: 20 # ' '
1017c: 6c # 'l'
1017d: 61 # 'a'
1017e: 72 # 'r'
1017f: 67 # 'g'
10180: 65 # 'e'
10181: 73 # 's'
10182: 74 # 't'
10183: 20 # ' '
10184: 6e # 'n'
10185: 75 # 'u'
10186: 6d # 'm'
10187: 62 # 'b'
10188: 65 # 'e'
10189: 72 # 'r'
1018a: 20 # ' '
1018b: 69 # 'i'
1018c: 6e # 'n'
1018d: 20 # ' '
1018e: 74 # 't'
1018f: 68 # 'h'
10190: 65 # 'e'
10191: 20 # ' '
10192: 61 # 'a'
10193: 72 # 'r'
10194: 72 # 'r'
10195: 61 # 'a'
10196: 79 # 'y'
10197: 20 # ' '
10198: 69 # 'i'
10199: 73 # 's'
1019a: 20 # ' '
1019b: 00 # '\0'
```
:::info
The `objdump` interprets data section's bit patterns as instruction and did some unneccessarily padding. But they should be seen as numbers. Therefore I just modify the output manually to make the bit patterns look more meaningful.
:::
* Observation
* The loop was unrolled. With loop unrolled:
* Less branches needed. In this case, the times of branch taken dropped by half. The execution result is like (both without access to UART or any output):
```
$ emu-rv32i ../RISC-V/hw2/original
>>> Execution time: 14229 ns
>>> Instruction count: 73 (IPS=5130367)
>>> Jumps: 17 (23.29%) - 8 forwards, 9 backwards
>>> Branching T=7 (35.00%) F=13 (65.00%)
$ emu-rv32i ../RISC-V/hw2/linear-search
>>> Execution time: 11881 ns
>>> Instruction count: 61 (IPS=5134247)
>>> Jumps: 7 (11.48%) - 6 forwards, 1 backwards
>>> Branching T=6 (60.00%) F=4 (40.00%)
```
* Less branches also implies less branch prediction failure.
* Less registers were used. In original design, 5 registers were used in loop, but it only took 2 in this optimized verison.
* Overall, it took only 4 common registers (`a2` - `a5`) to complete every task in this code (besides `sp` and data loading part).
* The most incredible thing I found in this code snippet is that:
```c
10080: not a2,a4 -> a2 = (a4 >= 0 ? 1 : 0)
...
1008c: srai a5,a2,0x1f -> a5 = (a4 >= 0 ? 0x1111... : 0x0000...)
...
100b0: and a4,a4,a5 -> a4 = (a4 >= 0 ? a4 : 0)
```
and they become an `max(a[0], 0)`. It even use `max` with out assign it to a fixed register. That is what I've never seen before.
* To prevent data hazard, it also keep these lines from appearing one right after the other. The hazards are:
* RAW hazard on `a2` between `srai` and `not`
* RAW hazard on `a5` between `ans` and `srai`
* Question
* Why load array from data section than store it into stack? Why don't just load array from data section when needed?
* Why load `arr[1]` so late?
* In Ripes, `la` is translated to `auipc + addi`, but here it was translated to `lui + addi`. What's the difference?
* How does the compiler choose which register to use? Why use `aX` regitster instead of `tX` ?
2. Disassemble the ELF files generated by C compiler and compare the assembly listing between handwritten and compiler optimized one.
* You can modify the arguments specified by `RV32I_CFLAGS` to experiment. e.g. Change `-O3` (optimized for speed) to `-Os` (optimized for size).
* Describe your obserations and explain.
* Function calls and `ecall` shall be taken into considerations.
3. Check the results of `emu-rv32i` for the statistics of execution flow and explain the internal counters such as `true_counter`, `false_counter` (crucial for [branch prediction](https://en.wikipedia.org/wiki/Branch_predictor)), `jump_counter`, etc. Then, try to optimize the generated assembly. You shall read [RISC-V Assembly Programmer's Manual](https://github.com/riscv/riscv-asm-manual/blob/master/riscv-asm.md) carefully and modify Makefile in order to append new assembly targets.
* Would you encounter data hazards? Can you do [instruction scheduling](https://en.wikipedia.org/wiki/Instruction_scheduling) manually?
4. Can you improve the assembly code generated by gcc with optimizations? Or, can you write even faster/smaller programs in RISC-V assembly?
## Reference