Assignment2: RISC-V Toolchain

# Assignment2: RISC-V Toolchain contributed by < `OscarShiang` > ## Rewrite [Sqrt(x)](https://hackmd.io/@30vhEV7FQECcWeCF1eAN5A/ry44_j8rK) I implement my code based on [李政憲](https://hackmd.io/@30vhEV7FQECcWeCF1eAN5A/ry44_j8rK)'s one. Because I am curious about the way to compute the square root of a number without using brute force. I have rewritten the code to find `floor(sqrt(x))` with binary search approach. ```c #include <stdint.h> #include "math.h" #include "print.h" uint32_t mysqrt(uint32_t x) { uint64_t lo = 0, hi = x + 1u; while (lo < hi) { uint64_t mi = (hi - lo) / 2 + lo; if (mul(mi, mi) <= x) lo = mi + 1; else hi = mi; } return lo - 1; } void _start() { int a = 4096; print_str("sqrt("); print_int(a); print_str(") = "); print_int(mysqrt(a)); print_str("\n"); } ``` ### Execution Result ```shell $ emu-rv32i ./sqrt sqrt(4096) = 64 >>> Execution time: 4930683 ns >>> Instruction count: 78360 (IPS=15892321) >>> Jumps: 4700 (6.00%) - 110 forwards, 4590 backwards >>> Branching T=4592 (98.18%) F=85 (1.82%) ``` ## GNU tools ### Optimization The code is generated by `riscv-none-embed-gcc` with `-O3` optimization and disassembled bt `riscv-none-embed-objdump` ``` ./sqrt: file format elf32-littleriscv Disassembly of section .text: 00010054 <mysqrt>: 10054: fe010113 addi sp,sp,-32 10058: 01512223 sw s5,4(sp) 1005c: 00112e23 sw ra,28(sp) 10060: 00812c23 sw s0,24(sp) 10064: 00912a23 sw s1,20(sp) 10068: 01212823 sw s2,16(sp) 1006c: 01312623 sw s3,12(sp) 10070: 01412423 sw s4,8(sp) 10074: 01612023 sw s6,0(sp) 10078: 00150a93 addi s5,a0,1 1007c: 0a0a8e63 beqz s5,10138 <mysqrt+0xe4> 10080: 00050993 mv s3,a0 10084: 00000b13 li s6,0 10088: 00000a13 li s4,0 1008c: 00000913 li s2,0 10090: 414a87b3 sub a5,s5,s4 10094: 00fab733 sltu a4,s5,a5 10098: 412b0433 sub s0,s6,s2 1009c: 40e40433 sub s0,s0,a4 100a0: 01f41713 slli a4,s0,0x1f 100a4: 0017d793 srli a5,a5,0x1 100a8: 00f767b3 or a5,a4,a5 100ac: 014784b3 add s1,a5,s4 100b0: 00145413 srli s0,s0,0x1 100b4: 01240433 add s0,s0,s2 100b8: 00f4b7b3 sltu a5,s1,a5 100bc: 00048593 mv a1,s1 100c0: 00048513 mv a0,s1 100c4: 00878433 add s0,a5,s0 100c8: 50c000ef jal ra,105d4 <mul> 100cc: 04059863 bnez a1,1011c <mysqrt+0xc8> 100d0: 00148793 addi a5,s1,1 100d4: 0097b733 sltu a4,a5,s1 100d8: 04a9e263 bltu s3,a0,1011c <mysqrt+0xc8> 100dc: 00870933 add s2,a4,s0 100e0: 00078a13 mv s4,a5 100e4: fb6966e3 bltu s2,s6,10090 <mysqrt+0x3c> 100e8: 012b1463 bne s6,s2,100f0 <mysqrt+0x9c> 100ec: fb57e2e3 bltu a5,s5,10090 <mysqrt+0x3c> 100f0: fffa0513 addi a0,s4,-1 100f4: 01c12083 lw ra,28(sp) 100f8: 01812403 lw s0,24(sp) 100fc: 01412483 lw s1,20(sp) 10100: 01012903 lw s2,16(sp) 10104: 00c12983 lw s3,12(sp) 10108: 00812a03 lw s4,8(sp) 1010c: 00412a83 lw s5,4(sp) 10110: 00012b03 lw s6,0(sp) 10114: 02010113 addi sp,sp,32 10118: 00008067 ret 1011c: 00897863 bgeu s2,s0,1012c <mysqrt+0xd8> 10120: 00048a93 mv s5,s1 10124: 00040b13 mv s6,s0 10128: f69ff06f j 10090 <mysqrt+0x3c> 1012c: fd2412e3 bne s0,s2,100f0 <mysqrt+0x9c> 10130: fe9a68e3 bltu s4,s1,10120 <mysqrt+0xcc> 10134: fbdff06f j 100f0 <mysqrt+0x9c> 10138: fff00513 li a0,-1 1013c: fb9ff06f j 100f4 <mysqrt+0xa0> ``` Compare the code generated with `-O3` and `-O0`, I find that - the `-O3` code size is bigger than `-O0` one. Causing several loops to be unrolled. - `-O3` code deals with corner cases. (case `0` at `1007c` in above code) - the compiler with `-O3` try to detect the functions whose usage is similar to gcc builtin and replace them with gcc ones. - Functions like (`put_char`) may be inlined to eliminate the allocation and release of stack memory. - The constant data like the address of `tx` has been coded into `.text` section instead of storing in `.data` section. Accoring to the above differences, the code with `O3` optimization has less branches and smaller execution time. ```shell $ emu-rv32i ./sqrt sqrt(4096) = 64 >>> Execution time: 331105 ns >>> Instruction count: 2105 (IPS=6357499) >>> Jumps: 341 (16.20%) - 164 forwards, 177 backwards >>> Branching T=260 (71.04%) F=106 (28.96%) ``` ### Code size We can see that the size of text section ```shell # with -O0 flag $ riscv-none-embed-size ./sqrt text data bss dec hex filename 1192 4 0 1196 4ac ./sqrt # with -O3 flag $ riscv-none-embed-size ./sqrt text data bss dec hex filename 1666 0 0 1666 682 ./sqrt ``` ## Implement print function To print out the result of computing, we need to implement our own print function. `rv32emu` provides the address of UART output pin, which is at `0x40002000`. Base on this address, a simple print function , which can print integer in decimal form, can be implemented as below: ```cpp void print_int(int a) { bool printed = false; for (int i = 0; i < 10; i++) { if (a >= pow10[i]) { int rem; printed = true; put_char(div(a, pow10[i], &rem) + '0'); a = rem; } else if (printed) { put_char('0'); } } if (!printed) put_char('0'); } ``` The function `put_char` is a wrapper of the following statement: ```cpp static void put_char(char c) { *tx = c; } ``` The reason we named `put_char` is to avoid confliction to the glibc `putchar` signature. Since we need to print the number in decimal form, we must do division and take remainder. However we only use `rv32i` as the default architecture. I append `div` and `mul` function which implemented by loops to archieve it. The `div` function can return the remainder if we provide a address of a variable to the third parameter of that function. ```cpp int div(int dividend, int divisor, int *rem) { int quotient = 0; while (dividend >= divisor) { quotient++; dividend -= divisor; } if (rem) { *rem = dividend; } return quotient; } uint64_t mul(uint32_t a, uint32_t b) { uint64_t res = 0; while (b--) { res += a; } return res; } ``` As a result, we can have some output like ```shell sqrt(4096) = 64 ``` ## Reference ###### tags: `arch2021`