Assignment2: RISC-V Toolchain

# Assignment2: RISC-V Toolchain contributed by < [`kaeteyaruyo`](https://github.com/kaeteyaruyo) > ## Rewrite [Linear Search](https://hackmd.io/@Rwbh0z6QRXqUP7ovs7txiQ/Hkt8_swrw) I rewrite [this topic](https://hackmd.io/@Rwbh0z6QRXqUP7ovs7txiQ/Hkt8_swrw) into rv32emu-compatible C implementation. Like this: ```c= int _start() { int max = 0; int arr[10] = {9, 28, 63, 88, 52, 9, 75, 6, 26, 7}; for (int i = 0; i < 10; i++) { if (arr[i] > max) { max = arr[i]; } } volatile char *tx = (volatile char *)0x40002000; const char *str = "The largest number in the array is "; while (*str) { *tx = *str; str++; } volatile int *tx2 = (volatile int *)0x40000008; *tx2 = max; return 0; } ``` And the execution result is: ``` The largest number in the array is 88 >>> Execution time: 179285 ns >>> Instruction count: 197 (IPS=1098809) >>> Jumps: 38 (19.29%) - 3 forwards, 35 backwards >>> Branching T=37 (84.09%) F=7 (15.91%) ``` :::warning I modify some source code in `emu-rv32i.c` so it can print out integer. But it seems not an appropriate modification. For more information, check out [this PR](https://github.com/sysprog21/rv32emu/pull/10). ::: ## Original Assembly Code I adjust the output string part to match my rewritten version: ```asm .data array: .word 9, 28, 63, 88, 52, 9, 75, 6, 26, 7 range: .word 10 str: .string "The largest number in the array is " .text #s1 = Base address of current array #s2 = i #s3 = Range to compare (10) #t0 = Flag #t1 = Current Address .global _start _start: la s1, array # s1 = base address of array add s2, x0, x0 # i = 0 add t0, x0, x0 # define Flag addi s3, s3, 10 # size = 10 jal ra, forloop # Jump-and-link to the 'forloop' label li a7, 10 # Exit program ecall forloop: addi s2, s2, 1 # i = i+1 lw t1,0(s1) # t1 = next value of the array addi s1, s1, 4 # update new position of the next address bge t0, t1, then # if Flag >= A[i] jump to 'then' add t0, x0, t1 # else Flag = A[i] beq s2, s3, printResult # if i = 10 => stop j forloop # back to for loop then: beq s2, s3, printResult # if i = 10 => stop j forloop # else continue run for loop printResult: la a0, str li a7, 4 # print string ecall mv a0, t0 li a7, 1 ecall ret ``` Since `emu-rv32i` cannot perform `ecall`, so I remove the output part and modify it to a `emu-rv32i` compatible version: ```asm ``` ## `-O3` Optimized Assembly Code * Complie: `riscv-none-embed-gcc -march=rv32i -mabi=ilp32 -O3 -nostdlib linear-search.c -o linear-search` * Disassemble: `riscv-none-embed-objdump -D linear-search >> linear-search.s` * Assembly code with comment: ```asm= 00010054 <_start>: 10054: lui a5,0x10 10058: addi a5,a5,336 # a5 = &arr 1005c: lw a4,0(a5) # a4 = arr[0] 10060: lw t5,8(a5) # t5 = arr[2] 10064: lw t4,12(a5) # t4 = arr[3] 10068: lw t3,16(a5) # t3 = arr[4] 1006c: lw t1,20(a5) # t1 = arr[5] 10070: lw a7,24(a5) # a7 = arr[6] 10074: lw a6,28(a5) # a6 = arr[7] 10078: lw a0,32(a5) # a0 = arr[8] 1007c: lw a1,36(a5) # a1 = arr[9] 10080: not a2,a4 # a2 = ~arr[0] (for 1008c) 10084: lw a3,4(a5) # a3 = arr[1] 10088: addi sp,sp,-48 # ready for store something into stack 1008c: srai a5,a2,0x1f # a5 = (~arr[0])'s sign extension (for 100b0) 10090: sw t5,16(sp) # store arr[2] into stack 10094: sw t4,20(sp) # store arr[3] into stack 10098: sw t3,24(sp) # store arr[4] into stack 1009c: sw t1,28(sp) # store arr[5] into stack 100a0: sw a7,32(sp) # store arr[6] into stack 100a4: sw a6,36(sp) # store arr[7] into stack 100a8: sw a0,40(sp) # store arr[8] into stack 100ac: sw a1,44(sp) # store arr[9] into stack 100b0: and a4,a4,a5 # a4 = max(arr[0], max = 0) (mind blowing!!) 100b4: bge a3,a4,100bc # if (arr[1] >= a4) goto 100bc 100b8: mv a3,a4 # else a3 = a4 100bc: lw a5,16(sp) # a5 = arr[2] # => a3 = max(arr[1], a4), and then load arr[2] to a5 100c0: bge a5,a3,100c8 # if (arr[2] >= a3) goto 100c8 100c4: mv a5,a3 # else a5 = a3 100c8: lw a4,20(sp) # a4 = arr[3] # => a5 = max(arr[2], a3), and then load arr[3] to a4 100cc: bge a4,a5,100d4 # if (arr[3] >= a5) goto 100d4 100d0: mv a4,a5 # else a4 = a5 100d4: lw a5,24(sp) # a5 = arr[4] # => a4 = max(arr[3], a5), and then load arr[4] to a5 100d8: bge a5,a4,100e0 # if (arr[4] >= a4) goto 100e0 100dc: mv a5,a4 # else a5 = a4 100e0: lw a4,28(sp) # a4 = arr[5] # => a5 = max(arr[4], a4), and then load arr[5] to a4 100e4: bge a4,a5,100ec # if (arr[5] >= a5) goto 100ec 100e8: mv a4,a5 # else a4 = a5 100ec: lw a5,32(sp) # a5 = arr[6] # => a4 = max(arr[5], a5), and then load arr[6] to a5 100f0: bge a5,a4,100f8 # if (arr[6] >= a4) goto 100f8 100f4: mv a5,a4 # else a5 = a4 100f8: lw a4,36(sp) # a4 = arr[7] # => a5 = max(arr[6], a4), and then load arr[7] to a4 100fc: bge a4,a5,10104 # if (arr[7] >= a5) goto 10104 10100: mv a4,a5 # else a4 = a5 10104: lw a5,40(sp) # a5 = arr[8] # => a4 = max(arr[7], a5), and then load arr[8] to a5 10108: bge a5,a4,10110 # if (arr[8] >= a4) goto 10110 1010c: mv a5,a4 # else a5 = a4 10110: lw a2,44(sp) # a2 = arr[9] # => a5 = max(arr[8], a4), and then load arr[9] to a2 10114: bge a2,a5,1011c # if (arr[9] >= a5) goto 1011c 10118: mv a2,a5 # else a2 = a5 # => a2 = max(arr[9], a5), so a2 = the largest number in arr 1011c: lui a5,0x10 10120: addi a5,a5,376 # a5 = str 10124: li a4,84 # a4 = "T" （first char in str） 10128: lui a3,0x40002 # a3 = tx 1012c: sb a4,0(a3) # send a4 to tx 10130: addi a5,a5,1 # str++ 10134: lbu a4,0(a5) # a4 = next char in str 10138: bnez a4,1012c # while (*str) 1013c: lui a5,0x40000 # a5 = tx2 10140: sw a2,8(a5) # send a2 to tx2 10144: li a0,0 # set return value 10148: addi sp,sp,48 # reset stack pointer 1014c: ret # return 0 ``` * Data section: ```= 10150: 00000009 # 9 10154: 0000001c # 28 10158: 0000003f # 63 1015c: 00000058 # 88 10160: 00000034 # 52 10164: 00000009 # 9 10168: 0000004b # 75 1016c: 00000006 # 6 10170: 0000001a # 26 10174: 00000007 # 7 10178: 54 # 'T' 10179: 68 # 'h' 1017a: 65 # 'e' 1017b: 20 # ' ' 1017c: 6c # 'l' 1017d: 61 # 'a' 1017e: 72 # 'r' 1017f: 67 # 'g' 10180: 65 # 'e' 10181: 73 # 's' 10182: 74 # 't' 10183: 20 # ' ' 10184: 6e # 'n' 10185: 75 # 'u' 10186: 6d # 'm' 10187: 62 # 'b' 10188: 65 # 'e' 10189: 72 # 'r' 1018a: 20 # ' ' 1018b: 69 # 'i' 1018c: 6e # 'n' 1018d: 20 # ' ' 1018e: 74 # 't' 1018f: 68 # 'h' 10190: 65 # 'e' 10191: 20 # ' ' 10192: 61 # 'a' 10193: 72 # 'r' 10194: 72 # 'r' 10195: 61 # 'a' 10196: 79 # 'y' 10197: 20 # ' ' 10198: 69 # 'i' 10199: 73 # 's' 1019a: 20 # ' ' 1019b: 00 # '\0' ``` :::info The `objdump` interprets data section's bit patterns as instruction and did some unneccessarily padding. But they should be seen as numbers. Therefore I just modify the output manually to make the bit patterns look more meaningful. ::: * Observation * The loop was unrolled. With loop unrolled: * Less branches needed. In this case, the times of branch taken dropped by half. The execution result is like (both without access to UART or any output): ``` $ emu-rv32i ../RISC-V/hw2/original >>> Execution time: 14229 ns >>> Instruction count: 73 (IPS=5130367) >>> Jumps: 17 (23.29%) - 8 forwards, 9 backwards >>> Branching T=7 (35.00%) F=13 (65.00%) $ emu-rv32i ../RISC-V/hw2/linear-search >>> Execution time: 11881 ns >>> Instruction count: 61 (IPS=5134247) >>> Jumps: 7 (11.48%) - 6 forwards, 1 backwards >>> Branching T=6 (60.00%) F=4 (40.00%) ``` * Less branches also implies less branch prediction failure. * Less registers were used. In original design, 5 registers were used in loop, but it only took 2 in this optimized verison. * Overall, it took only 4 common registers (`a2` - `a5`) to complete every task in this code (besides `sp` and data loading part). * The most incredible thing I found in this code snippet is that: ```c 10080: not a2,a4 -> a2 = (a4 >= 0 ? 1 : 0) ... 1008c: srai a5,a2,0x1f -> a5 = (a4 >= 0 ? 0x1111... : 0x0000...) ... 100b0: and a4,a4,a5 -> a4 = (a4 >= 0 ? a4 : 0) ``` and they become an `max(a[0], 0)`. It even use `max` with out assign it to a fixed register. That is what I've never seen before. * To prevent data hazard, it also keep these lines from appearing one right after the other. The hazards are: * RAW hazard on `a2` between `srai` and `not` * RAW hazard on `a5` between `ans` and `srai` * Question * Why load array from data section than store it into stack? Why don't just load array from data section when needed? * Why load `arr[1]` so late? * In Ripes, `la` is translated to `auipc + addi`, but here it was translated to `lui + addi`. What's the difference? * How does the compiler choose which register to use? Why use `aX` regitster instead of `tX` ? 2. Disassemble the ELF files generated by C compiler and compare the assembly listing between handwritten and compiler optimized one. * You can modify the arguments specified by `RV32I_CFLAGS` to experiment. e.g. Change `-O3` (optimized for speed) to `-Os` (optimized for size). * Describe your obserations and explain. * Function calls and `ecall` shall be taken into considerations. 3. Check the results of `emu-rv32i` for the statistics of execution flow and explain the internal counters such as `true_counter`, `false_counter` (crucial for [branch prediction](https://en.wikipedia.org/wiki/Branch_predictor)), `jump_counter`, etc. Then, try to optimize the generated assembly. You shall read [RISC-V Assembly Programmer's Manual](https://github.com/riscv/riscv-asm-manual/blob/master/riscv-asm.md) carefully and modify Makefile in order to append new assembly targets. * Would you encounter data hazards? Can you do [instruction scheduling](https://en.wikipedia.org/wiki/Instruction_scheduling) manually? 4. Can you improve the assembly code generated by gcc with optimizations? Or, can you write even faster/smaller programs in RISC-V assembly? ## Reference