Lab2: RISC-V RV32I[MA] emulator with ELF support

# Lab2: RISC-V RV32I[MA] emulator with ELF support ###### tags: `2020-Computer-Architecture` contributed by < `uduru0522` > ---- :::success ### Configuration of PATH + Required after every boot. ```shell cd $HOME/riscv-none-embed-gcc echo "export PATH=`pwd`/8.2.0-3.1/bin:$PATH" > setenv cd $HOME source riscv-none-embed-gcc/setenv riscv-none-embed-gcc -v ``` > The following text should be found at the last output line: ```shell gcc version 8.2.0 (xPack GNU RISC-V Embedded GCC, 64-bit) ``` ### Compiling + Compile with command: ```shell riscv-none-embed-gcc -march=rv32i -mabi=ilp32 -nostdlib ``` > Combined with other compile options, e.g, `-O3`. - `-march` **---** Specify RISC-V ISA strings. + Look [Here](https://en.wikichip.org/wiki/risc-v/standard_extensions) for extension string orders. + `rv32i` used due to the emulator we're using. - `-mabi` **---** Specify RISC-V ABI strings - Specify bit length of `int`, `long` and pointers. + `ilp32` -- `int`, `long`, and pointers are 32-bit. </br>               `char` are 8-bit, short` are 16-bit long, `long long` are 64-bit. + `lp64` -- `long` and pointers are 64-bit. Others are the same as `ilp32` - Specifing which floating point types are passed into registers + `(empty)` -- No floating point numbers are passed into. + `f` -- **F Extension Required**. 32-bit or shorter floating point values are passed into registers + `d` -- **D Extension Required**. 64-bit or shorter floating point values are passed into registers ::: ---- # Bubble Sort > From < `Wder4` > ## Rewrited C Code ```c= void print_arr(int size, int* arr, volatile char* tx){ for(int i = 0; i < size; ++i){ *tx = arr[i] + '0'; *tx = ' '; } *tx = '\n'; } void bubble_sort(int size, int *arr){ for(int i = 0; i < size; ++i){ for(int j = i + 1; j < size; ++j){ if(arr[i] > arr[j]){ arr[i] ^= arr[j]; arr[j] ^= arr[i]; arr[i] ^= arr[j]; } } } } void _start(){ volatile char* tx = (volatile char*) 0x40002000; static int arr[10] = {2, 8, 5, 1, 0, 6, 9, 7, 4, 9}; int arr_size = 10; print_arr(10, arr, tx) bubble_sort(10, arr); print_arr(10, arr, tx); } ``` ## Testing w/ `-O0` ~ `-O3` And `-Os`, Printing Out Results ```shell riscv-none-embed-gcc -march=rv32i -mabi=ilp32 -nostdlib -O0 bubble-sort-riscv.c -o test-O0 ./emu-rv32i test-O0 2 8 5 1 0 6 9 7 4 9 0 1 2 4 5 6 7 8 9 9 >>> Execution time: 489719 ns >>> Instruction count: 2121 (IPS=4331055) >>> Jumps: 124 (5.85%) - 45 forwards, 79 backwards >>> Branching T=104 (78.20%) F=29 (21.80%) riscv-none-embed-gcc -march=rv32i -mabi=ilp32 -nostdlib -O1 bubble-sort-riscv.c -o test-O1 ./emu-rv32i test-O1 2 8 5 1 0 6 9 7 4 9 0 1 2 4 5 6 7 8 9 9 >>> Execution time: 88914 ns >>> Instruction count: 588 (IPS=6613131) >>> Jumps: 90 (15.31%) - 14 forwards, 76 backwards >>> Branching T=57 (46.34%) F=66 (53.66%) riscv-none-embed-gcc -march=rv32i -mabi=ilp32 -nostdlib -O2 bubble-sort-riscv.c -o test-O2 ./emu-rv32i test-O2 2 8 5 1 0 6 9 7 4 9 0 1 2 4 5 6 7 8 9 9 >>> Execution time: 146506 ns >>> Instruction count: 544 (IPS=3713158) >>> Jumps: 94 (17.28%) - 30 forwards, 64 backwards >>> Branching T=91 (75.21%) F=30 (24.79%) riscv-none-embed-gcc -march=rv32i -mabi=ilp32 -nostdlib -O3 bubble-sort-riscv.c -o test-O3 ./emu-rv32i test-O3 2 8 5 1 0 6 9 7 4 9 0 1 2 4 5 6 7 8 9 9 >>> Execution time: 138569 ns >>> Instruction count: 363 (IPS=2619633) >>> Jumps: 46 (12.67%) - 37 forwards, 9 backwards >>> Branching T=45 (45.45%) F=54 (54.55%) riscv-none-embed-gcc -march=rv32i -mabi=ilp32 -nostdlib -Os bubble-sort-riscv.c -o test-Os ./emu-rv32i test-Os 2 8 5 1 0 6 9 7 4 9 0 1 2 4 5 6 7 8 9 9 >>> Execution time: 105596 ns >>> Instruction count: 747 (IPS=7074131) >>> Jumps: 185 (24.77%) - 106 forwards, 79 backwards >>> Branching T=104 (78.20%) F=29 (21.80%) ``` Within several attempts, the optimize option `-O1` mostly yields the best performance in terms of execution time, and `-Os` getting the highest IPS. ## Not Printing Out The Results Thinking of that `printf()` often is the bottleneck of performance in C-codes, I tried comment out all printing-related functions and tested again: ```shell riscv-none-embed-gcc -march=rv32i -mabi=ilp32 -nostdlib -O0 bubble-sort-riscv.c -o test-O0 ./emu-rv32i test-O0 >>> Execution time: 54608 ns >>> Instruction count: 1695 (IPS=31039408) >>> Jumps: 98 (5.78%) - 41 forwards, 57 backwards >>> Branching T=84 (75.68%) F=27 (24.32%) riscv-none-embed-gcc -march=rv32i -mabi=ilp32 -nostdlib -O1 bubble-sort-riscv.c -o test-O1 ./emu-rv32i test-O1 >>> Execution time: 13542 ns >>> Instruction count: 422 (IPS=31162309) >>> Jumps: 68 (16.11%) - 12 forwards, 56 backwards >>> Branching T=39 (38.61%) F=62 (61.39%) riscv-none-embed-gcc -march=rv32i -mabi=ilp32 -nostdlib -O2 bubble-sort-riscv.c -o test-O2 ./emu-rv32i test-O2 >>> Execution time: 11959 ns >>> Instruction count: 384 (IPS=32109708) >>> Jumps: 75 (19.53%) - 29 forwards, 46 backwards >>> Branching T=73 (72.28%) F=28 (27.72%) riscv-none-embed-gcc -march=rv32i -mabi=ilp32 -nostdlib -O3 bubble-sort-riscv.c -o test-O3 ./emu-rv32i test-O3 >>> Execution time: 9194 ns >>> Instruction count: 307 (IPS=33391342) >>> Jumps: 46 (14.98%) - 37 forwards, 9 backwards >>> Branching T=45 (45.45%) F=54 (54.55%) riscv-none-embed-gcc -march=rv32i -mabi=ilp32 -nostdlib -Os bubble-sort-riscv.c -o test-Os ./emu-rv32i test-Os >>> Execution time: 15149 ns >>> Instruction count: 521 (IPS=34391709) >>> Jumps: 141 (27.06%) - 84 forwards, 57 backwards >>> Branching T=84 (75.68%) F=27 (24.32%) ``` This time, the optimization clearly fastens the execution as the optimize level goes up. However, within ~20 testings, sometimes the speed of `-O3` gets very high, almost getting close to the `-O0` speed. I assume it is the result of prediction failure, since only the branching of `-O3` has almost a 50-50 probability, which is bad for predicting. ## Dumping Object Files And Improving Generated Code By comparing the `-O2` and `-O3` object file: :::spoiler `-O2` objdump ```nasm Disassembly of section .text: 00010074 <bubble_sort>: 10074: 04a05a63 blez a0,100c8 <bubble_sort+0x54> 10078: 00251793 slli a5,a0,0x2 1007c: 00458513 addi a0,a1,4 10080: 00f585b3 add a1,a1,a5 10084: 04b50263 beq a0,a1,100c8 <bubble_sort+0x54> 10088: ffc52703 lw a4,-4(a0) 1008c: 00050793 mv a5,a0 10090: 0007a603 lw a2,0(a5) 10094: 00c746b3 xor a3,a4,a2 10098: 02e65063 bge a2,a4,100b8 <bubble_sort+0x44> 1009c: fed52e23 sw a3,-4(a0) 100a0: 0007a703 lw a4,0(a5) 100a4: 00e6c6b3 xor a3,a3,a4 100a8: 00d7a023 sw a3,0(a5) 100ac: ffc52703 lw a4,-4(a0) 100b0: 00d74733 xor a4,a4,a3 100b4: fee52e23 sw a4,-4(a0) 100b8: 00478793 addi a5,a5,4 100bc: fcb79ae3 bne a5,a1,10090 <bubble_sort+0x1c> 100c0: 00450513 addi a0,a0,4 100c4: fcb512e3 bne a0,a1,10088 <bubble_sort+0x14> 100c8: 00008067 ret 000100cc <_start>: 100cc: 000115b7 lui a1,0x11 100d0: 0dc58593 addi a1,a1,220 # 110dc <__DATA_BEGIN__> 100d4: 00a00513 li a0,10 100d8: f9dff06f j 10074 <bubble_sort> ``` ::: :::spoiler `-O3` objdump ```nasm 00010074 <bubble_sort>: 10074: 06a05c63 blez a0,100ec <bubble_sort+0x78> 10078: 00100793 li a5,1 1007c: 06f50863 beq a0,a5,100ec <bubble_sort+0x78> 10080: 00458e13 addi t3,a1,4 10084: 00000893 li a7,0 10088: 00100313 li t1,1 1008c: 00289893 slli a7,a7,0x2 10090: 011588b3 add a7,a1,a7 10094: 0008a703 lw a4,0(a7) 10098: 000e0793 mv a5,t3 1009c: 00030613 mv a2,t1 100a0: 0007a803 lw a6,0(a5) 100a4: 00160613 addi a2,a2,1 100a8: 00e846b3 xor a3,a6,a4 100ac: 02e85063 bge a6,a4,100cc <bubble_sort+0x58> 100b0: 00d8a023 sw a3,0(a7) 100b4: 0007a703 lw a4,0(a5) 100b8: 00e6c6b3 xor a3,a3,a4 100bc: 00d7a023 sw a3,0(a5) 100c0: 0008a703 lw a4,0(a7) 100c4: 00e6c733 xor a4,a3,a4 100c8: 00e8a023 sw a4,0(a7) 100cc: 00478793 addi a5,a5,4 100d0: fca648e3 blt a2,a0,100a0 <bubble_sort+0x2c> 100d4: 00130793 addi a5,t1,1 100d8: 004e0e13 addi t3,t3,4 100dc: 00030893 mv a7,t1 100e0: 00f50663 beq a0,a5,100ec <bubble_sort+0x78> 100e4: 00078313 mv t1,a5 100e8: fa5ff06f j 1008c <bubble_sort+0x18> 100ec: 00008067 ret 000100f0 <_start>: 100f0: 00000613 li a2,0 100f4: 000117b7 lui a5,0x11 100f8: 00a00693 li a3,10 100fc: 00160613 addi a2,a2,1 10100: 23078793 addi a5,a5,560 # 11230 <__DATA_BEGIN__> 10104: 00900f13 li t5,9 10108: 00800e93 li t4,8 1010c: 00700e13 li t3,7 10110: 00600313 li t1,6 10114: 00500893 li a7,5 10118: 00400813 li a6,4 1011c: 00300513 li a0,3 10120: 00200593 li a1,2 10124: 00900293 li t0,9 10128: 10d60263 beq a2,a3,1022c <_start+0x13c> 1012c: 0007a703 lw a4,0(a5) 10130: 0047af83 lw t6,4(a5) 10134: 00efd863 bge t6,a4,10144 <_start+0x54> 10138: 00e7a223 sw a4,4(a5) 1013c: 01f7a023 sw t6,0(a5) 10140: 000f8713 mv a4,t6 10144: 0ad58e63 beq a1,a3,10200 <_start+0x110> 10148: 0087af83 lw t6,8(a5) 1014c: 00efd863 bge t6,a4,1015c <_start+0x6c> 10150: 00e7a423 sw a4,8(a5) 10154: 01f7a023 sw t6,0(a5) 10158: 000f8713 mv a4,t6 1015c: 0ad50263 beq a0,a3,10200 <_start+0x110> 10160: 00c7af83 lw t6,12(a5) 10164: 00efd863 bge t6,a4,10174 <_start+0x84> 10168: 00e7a623 sw a4,12(a5) 1016c: 01f7a023 sw t6,0(a5) 10170: 000f8713 mv a4,t6 10174: 08d80663 beq a6,a3,10200 <_start+0x110> 10178: 0107af83 lw t6,16(a5) 1017c: 00efd863 bge t6,a4,1018c <_start+0x9c> 10180: 00e7a823 sw a4,16(a5) 10184: 01f7a023 sw t6,0(a5) 10188: 000f8713 mv a4,t6 1018c: 06d88a63 beq a7,a3,10200 <_start+0x110> 10190: 0147af83 lw t6,20(a5) 10194: 00efd863 bge t6,a4,101a4 <_start+0xb4> 10198: 00e7aa23 sw a4,20(a5) 1019c: 01f7a023 sw t6,0(a5) 101a0: 000f8713 mv a4,t6 101a4: 04d30e63 beq t1,a3,10200 <_start+0x110> 101a8: 0187af83 lw t6,24(a5) 101ac: 00efd863 bge t6,a4,101bc <_start+0xcc> 101b0: 00e7ac23 sw a4,24(a5) 101b4: 01f7a023 sw t6,0(a5) 101b8: 000f8713 mv a4,t6 101bc: 04de0263 beq t3,a3,10200 <_start+0x110> 101c0: 01c7af83 lw t6,28(a5) 101c4: 00efd863 bge t6,a4,101d4 <_start+0xe4> 101c8: 00e7ae23 sw a4,28(a5) 101cc: 01f7a023 sw t6,0(a5) 101d0: 000f8713 mv a4,t6 101d4: 02de8663 beq t4,a3,10200 <_start+0x110> 101d8: 0207af83 lw t6,32(a5) 101dc: 00efd863 bge t6,a4,101ec <_start+0xfc> 101e0: 02e7a023 sw a4,32(a5) 101e4: 01f7a023 sw t6,0(a5) 101e8: 000f8713 mv a4,t6 101ec: 005f1a63 bne t5,t0,10200 <_start+0x110> 101f0: 0247af83 lw t6,36(a5) 101f4: 00efd663 bge t6,a4,10200 <_start+0x110> 101f8: 02e7a223 sw a4,36(a5) 101fc: 01f7a023 sw t6,0(a5) 10200: 00160613 addi a2,a2,1 10204: 00158593 addi a1,a1,1 10208: 00150513 addi a0,a0,1 1020c: 00180813 addi a6,a6,1 10210: 00188893 addi a7,a7,1 10214: 00130313 addi t1,t1,1 10218: 001e0e13 addi t3,t3,1 1021c: 001e8e93 addi t4,t4,1 10220: 001f0f13 addi t5,t5,1 10224: 00478793 addi a5,a5,4 10228: f0d612e3 bne a2,a3,1012c <_start+0x3c> 1022c: 00008067 ret ``` ::: I would describe `-O2` optimization as "making thinks short and clean", and `-O3` as "Anarchy", inlining the entire `bubble_sort()` into `_start`. Running with Ripes, it was clear that the `_bubble_sort` section is not executed entirely. Which a eazy way to optimize for file size is, simply delete the section, saving filesize of around 25 instructions. ### Object File Information ```shell # Display object file information riscv-none-embed-size test-O0 text data bss dec hex filename 412 40 0 452 1c4 test-O0 riscv-none-embed-size test-O1 text data bss dec hex filename 132 40 0 172 ac test-O1 riscv-none-embed-size test-O2 text data bss dec hex filename 104 40 0 144 90 test-O2 riscv-none-embed-size test-O3 text data bss dec hex filename 444 40 0 484 1e4 test-O3 riscv-none-embed-size test-Os text data bss dec hex filename 108 40 0 148 94 test-Os ``` The huge filesize of `-O3` optimized object file seems to be blamed for the aggressive inlining. ---- # Two Sum > From < `joe-U16` > ## C Code Implementation (Original) ```c= int* twoSum(int* nums, int numsSize, int target, int* returnSize){ for (int i=0; i < numsSize; i++) { for (int j=i+1; j < numsSize; j++) { if ((nums[i]+nums[j]) == target) { *returnSize = 2; int *arr = malloc(sizeof(int) * 2); arr[0] = i; arr[1] = j; return arr; } } } return 0; } ``` ## Rewrited C Code ```c= void two_sum(int *arr, int size, int target, int* return_arr){ for(int i = 0; i < size; ++i){ for(int j = i + 1; j < size; ++j){ if(arr[i] + arr[j] == target){ return_arr[0] = i; return_arr[1] = j; return; } } } return_arr[0] = return_arr[1] = -1; return; } void _start(){ volatile char* tx = (volatile char*) 0x40002000; static int arr[] = {1, 2, 3, 5, 7, 11, 16, 21, 26, 33}; int arr_size = 10; int return_arr[2]; two_sum(arr, arr_size, 33, return_arr); *tx = return_arr[0] + '0'; *tx = ' '; *tx = return_arr[1] + '0'; } ``` ## Testing w/ `-O0` ~ `-O3` And `-Os`s ```shell $rv32igcc -O0 $FILENAME -o test-O0 ./emu-rv32i test-O0 4 8 >>> Execution time: 42385 ns >>> Instruction count: 758 (IPS=17883685) >>> Jumps: 82 (10.82%) - 41 forwards, 41 backwards >>> Branching T=72 (93.51%) F=5 (6.49%) $rv-size test-O0 text data bss dec hex filename 368 40 0 408 198 test-O0 $rv32igcc -O1 $FILENAME -o test-O1 ./emu-rv32i test-O1 4 8 >>> Execution time: 15339 ns >>> Instruction count: 262 (IPS=17080644) >>> Jumps: 39 (14.89%) - 5 forwards, 34 backwards >>> Branching T=32 (43.84%) F=41 (56.16%) $rv-size test-O1 text data bss dec hex filename 200 40 0 240 f0 test-O1 $rv32igcc -O2 $FILENAME -o test-O2 ./emu-rv32i test-O2 4 8 >>> Execution time: 15627 ns >>> Instruction count: 251 (IPS=16061944) >>> Jumps: 40 (15.94%) - 6 forwards, 34 backwards >>> Branching T=34 (47.89%) F=37 (52.11%) $rv-size test-O2 text data bss dec hex filename 304 40 0 344 158 test-O2 $rv32igcc -O3 $FILENAME -o test-O3 ./emu-rv32i test-O3 4 8 >>> Execution time: 13062 ns >>> Instruction count: 211 (IPS=16153728) >>> Jumps: 10 (4.74%) - 5 forwards, 5 backwards >>> Branching T=5 (7.46%) F=62 (92.54%) $rv-size test-O3 text data bss dec hex filename 412 40 0 452 1c4 test-O3 $rv32igcc -Os $FILENAME -o test-Os ./emu-rv32i test-Os 4 8 >>> Execution time: 48654 ns >>> Instruction count: 363 (IPS=7460845) >>> Jumps: 112 (30.85%) - 73 forwards, 39 backwards >>> Branching T=72 (93.51%) F=5 (6.49%) $rv-size test-Os text data bss dec hex filename 184 40 0 224 e0 test-Os ``` This time, with much less printing action, I skipped the elimination of printing related codes. Also the options are also functioning correctly, with `-O3` being the fast and `-Os` having the least instructions. ## Object File Dumps :::spoiler `-O3` Dump ```nasm uduru@uduru-GL553VD:~/rv32emu$ riscv-none-embed-objdump -d test-O3 test-O3: file format elf32-littleriscv Disassembly of section .text: 00010074 <two_sum>: 10074: 04b05a63 blez a1,100c8 <two_sum+0x54> 10078: 00000e13 li t3,0 1007c: 001e0313 addi t1,t3,1 10080: 04658463 beq a1,t1,100c8 <two_sum+0x54> 10084: 00052883 lw a7,0(a0) 10088: 00452783 lw a5,4(a0) 1008c: 00f887b3 add a5,a7,a5 10090: 04f60463 beq a2,a5,100d8 <two_sum+0x64> 10094: 00050813 mv a6,a0 10098: 00030713 mv a4,t1 1009c: 0140006f j 100b0 <two_sum+0x3c> 100a0: 00882783 lw a5,8(a6) 100a4: 00480813 addi a6,a6,4 100a8: 00f887b3 add a5,a7,a5 100ac: 02c78863 beq a5,a2,100dc <two_sum+0x68> 100b0: 00170713 addi a4,a4,1 100b4: fee596e3 bne a1,a4,100a0 <two_sum+0x2c> 100b8: 00030e13 mv t3,t1 100bc: 001e0313 addi t1,t3,1 100c0: 00450513 addi a0,a0,4 100c4: fc6590e3 bne a1,t1,10084 <two_sum+0x10> 100c8: fff00793 li a5,-1 100cc: 00f6a223 sw a5,4(a3) 100d0: 00f6a023 sw a5,0(a3) 100d4: 00008067 ret 100d8: 00030713 mv a4,t1 100dc: 01c6a023 sw t3,0(a3) 100e0: 00e6a223 sw a4,4(a3) 100e4: 00008067 ret 000100e8 <_start>: 100e8: 00011737 lui a4,0x11 100ec: 21070713 addi a4,a4,528 # 11210 <__DATA_BEGIN__> 100f0: 00200693 li a3,2 100f4: 02100593 li a1,33 100f8: 00a00513 li a0,10 100fc: 00072603 lw a2,0(a4) 10100: 00472783 lw a5,4(a4) 10104: ffe68813 addi a6,a3,-2 10108: 00f607b3 add a5,a2,a5 1010c: 0eb78263 beq a5,a1,101f0 <_start+0x108> 10110: 00068793 mv a5,a3 10114: 0ea68863 beq a3,a0,10204 <_start+0x11c> 10118: 00872883 lw a7,8(a4) 1011c: 011608b3 add a7,a2,a7 10120: 08b88e63 beq a7,a1,101bc <_start+0xd4> 10124: 00168313 addi t1,a3,1 10128: 00030793 mv a5,t1 1012c: 0aa30c63 beq t1,a0,101e4 <_start+0xfc> 10130: 00c72883 lw a7,12(a4) 10134: 011608b3 add a7,a2,a7 10138: 08b88263 beq a7,a1,101bc <_start+0xd4> 1013c: 00268793 addi a5,a3,2 10140: 0aa78263 beq a5,a0,101e4 <_start+0xfc> 10144: 01072883 lw a7,16(a4) 10148: 011608b3 add a7,a2,a7 1014c: 06b88863 beq a7,a1,101bc <_start+0xd4> 10150: 00368793 addi a5,a3,3 10154: 08a78863 beq a5,a0,101e4 <_start+0xfc> 10158: 01472883 lw a7,20(a4) 1015c: 011608b3 add a7,a2,a7 10160: 04b88e63 beq a7,a1,101bc <_start+0xd4> 10164: 00468793 addi a5,a3,4 10168: 06a78e63 beq a5,a0,101e4 <_start+0xfc> 1016c: 01872883 lw a7,24(a4) 10170: 011608b3 add a7,a2,a7 10174: 04b88463 beq a7,a1,101bc <_start+0xd4> 10178: 00568793 addi a5,a3,5 1017c: 06a78463 beq a5,a0,101e4 <_start+0xfc> 10180: 01c72883 lw a7,28(a4) 10184: 011608b3 add a7,a2,a7 10188: 02b88a63 beq a7,a1,101bc <_start+0xd4> 1018c: 00668793 addi a5,a3,6 10190: 04a78a63 beq a5,a0,101e4 <_start+0xfc> 10194: 02072883 lw a7,32(a4) 10198: 011608b3 add a7,a2,a7 1019c: 02b88063 beq a7,a1,101bc <_start+0xd4> 101a0: 00768793 addi a5,a3,7 101a4: 04a78063 beq a5,a0,101e4 <_start+0xfc> 101a8: 02472883 lw a7,36(a4) 101ac: 01160633 add a2,a2,a7 101b0: 00b60663 beq a2,a1,101bc <_start+0xd4> 101b4: 00868793 addi a5,a3,8 101b8: 02a78663 beq a5,a0,101e4 <_start+0xfc> 101bc: 03080813 addi a6,a6,48 101c0: 03078793 addi a5,a5,48 101c4: 0ff87813 andi a6,a6,255 101c8: 0ff7f793 andi a5,a5,255 101cc: 40002737 lui a4,0x40002 101d0: 01070023 sb a6,0(a4) # 40002000 <__global_pointer$+0x3fff05f0> 101d4: 02000693 li a3,32 101d8: 00d70023 sb a3,0(a4) 101dc: 00f70023 sb a5,0(a4) 101e0: 00008067 ret 101e4: 00030693 mv a3,t1 101e8: 00470713 addi a4,a4,4 101ec: f11ff06f j 100fc <_start+0x14> 101f0: 02e68813 addi a6,a3,46 101f4: 02f68793 addi a5,a3,47 101f8: 0ff87813 andi a6,a6,255 101fc: 0ff7f793 andi a5,a5,255 10200: fcdff06f j 101cc <_start+0xe4> 10204: 02f00793 li a5,47 10208: 02f00813 li a6,47 1020c: fc1ff06f j 101cc <_start+0xe4> ``` ::: Again the `-O3` option inlined the intire searching function and left the `two_sum` part not be called entirely. # Encoutered Issues ## Optimizer `memcpy` Call In `-Os` & `-O3` Optimization Originally, the array is initialized like this: ```c= int arr[] = {2, 8, 5, 1, 0, 6, 9, 7, 4, 9}; ``` Which `call memcpy` appers to be generated within the compiled assembly file, emitting an undefined refrenece error. By the option `-S`, I located the position where it calls the `memcpy` subroutine: ```nasm= _start: addi sp,sp,-64 # Allocating memory for stack frame lui a1,%hi(.LANCHOR0) # Loading 32-bit array address li a2,40 # Load array size addi a1,a1,%lo(.LANCHOR0) # Loading 32-bit array address addi a0,sp,8 # Store array at sp + 8 sw ra,60(sp) call memcpy li a2,1073750016 # Loading arguments for print_arr() addi a1,sp,8 # li a0,10 # call print_arr addi a1,sp,8 # Loading arguments for bubble_sort() li a0,10 # call bubble_sort addi a1,sp,8 # Loading arguments for print_arr() li a2,1073750016 # li a0,10 # call print_arr lw ra,60(sp) addi sp,sp,64 jr ra .size _start, .-_start .section .rodata .align 2 .set .LANCHOR0,. + 0 # Starting address for .rodata section .LC0: .word 2 .word 8 .word 5 .word 1 .word 0 .word 6 .word 9 .word 7 .word 4 .word 9 ``` > + `.rodata` denotes the **data-segement** > + `.set .LANCHOR0, . + 0` stands for setting value of `.LANCHOR0` to `. + 0` > + The dot symbol `.` means "the current address". > + That is, `.LANCHOR0` is set to be the starting address of the following data section. I found that, with the written code, the array body actually exists 2 copy: 1. One starts at `.LANCHOR0` in the `.rodata` section, 2. And one `sp + 8`, **which the `memcpy` call is used to copy the array body into the stack.** > This copy happens becauses that `.rodata` denotes the **Constant Data** section. And heres some effort i tried but to no avail: + Adding the `-fno-builtin -ffreestanding` option + Changing the array declaration to global and do not pass into functions With these succed: + Declaring the array as `static int[]`, which I assume tells the compiler to **keep only one instance**, a.k.a., do not store in `.rodata`. + Declairing only the array(`int arr[10];`), and fill in the array body one-by-one, as for `arr[0] = 2; arr[1] = 8;`, etc. **The static array way is chosen due to the better cleaness of assembly-code and c-code.** :::info The discription of the expressions are refreneced from https://www.esi-risc.com/wp-content/uploads/as.pdf ::: ## `_mulsi3` Call in `-O1` And Higher Optimization Since we are using the rv32i instructions, we are not weaponed with `MUL`, `DIV` and `MOD` calculations. Thus, I first tried to implement the `int _mul(int a, int b)` function: ```c= int _mul(int a, int b){ int result = 0; while(b--){ result += a; } return result; } ``` with optimization on, the compiled assembly code appears to know whats going on, and tries to call the `__mulsi3()` function, which is the builtin branchless unsigined integer multiplication function. The following is the [GCC Implementation](https://github.com/gcc-mirror/gcc/blob/master/libgcc/config/epiphany/mulsi3.c): ```c= unsigned int __mulsi3 (unsigned int a, unsigned int b) { unsigned int r = 0; while (a){ if (a & 1) r += b; a >>= 1; b <<= 1; } return r; } ``` To prevent this optimization from happening, I found that we can add `__attribute((optimize("O0")))__` to the function, or, surprisingly, inlining the function also works.