# Lab2: RISC-V RV32I[MA] emulator with ELF support
###### tags: `2020-Computer-Architecture`
contributed by < `uduru0522` >
----
:::success
### Configuration of PATH
+ Required after every boot.
```shell
cd $HOME/riscv-none-embed-gcc
echo "export PATH=`pwd`/8.2.0-3.1/bin:$PATH" > setenv
cd $HOME
source riscv-none-embed-gcc/setenv
riscv-none-embed-gcc -v
```
> The following text should be found at the last output line:
```shell
gcc version 8.2.0 (xPack GNU RISC-V Embedded GCC, 64-bit)
```
### Compiling
+ Compile with command:
```shell
riscv-none-embed-gcc -march=rv32i -mabi=ilp32 -nostdlib
```
> Combined with other compile options, e.g, `-O3`.
- `-march` **---** Specify RISC-V ISA strings.
+ Look [Here](https://en.wikichip.org/wiki/risc-v/standard_extensions) for extension string orders.
+ `rv32i` used due to the emulator we're using.
- `-mabi` **---** Specify RISC-V ABI strings
- Specify bit length of `int`, `long` and pointers.
+ `ilp32` -- `int`, `long`, and pointers are 32-bit. </br> `char` are 8-bit, short` are 16-bit long, `long long` are 64-bit.
+ `lp64` -- `long` and pointers are 64-bit. Others are the same as `ilp32`
- Specifing which floating point types are passed into registers
+ `(empty)` -- No floating point numbers are passed into.
+ `f` -- **F Extension Required**. 32-bit or shorter floating point values are passed into registers
+ `d` -- **D Extension Required**. 64-bit or shorter floating point values are passed into registers
:::
----
# Bubble Sort
> From < `Wder4` >
## Rewrited C Code
```c=
void print_arr(int size, int* arr, volatile char* tx){
for(int i = 0; i < size; ++i){
*tx = arr[i] + '0';
*tx = ' ';
}
*tx = '\n';
}
void bubble_sort(int size, int *arr){
for(int i = 0; i < size; ++i){
for(int j = i + 1; j < size; ++j){
if(arr[i] > arr[j]){
arr[i] ^= arr[j];
arr[j] ^= arr[i];
arr[i] ^= arr[j];
}
}
}
}
void _start(){
volatile char* tx = (volatile char*) 0x40002000;
static int arr[10] = {2, 8, 5, 1, 0, 6, 9, 7, 4, 9};
int arr_size = 10;
print_arr(10, arr, tx)
bubble_sort(10, arr);
print_arr(10, arr, tx);
}
```
## Testing w/ `-O0` ~ `-O3` And `-Os`, Printing Out Results
```shell
riscv-none-embed-gcc -march=rv32i -mabi=ilp32 -nostdlib -O0 bubble-sort-riscv.c -o test-O0
./emu-rv32i test-O0
2 8 5 1 0 6 9 7 4 9
0 1 2 4 5 6 7 8 9 9
>>> Execution time: 489719 ns
>>> Instruction count: 2121 (IPS=4331055)
>>> Jumps: 124 (5.85%) - 45 forwards, 79 backwards
>>> Branching T=104 (78.20%) F=29 (21.80%)
riscv-none-embed-gcc -march=rv32i -mabi=ilp32 -nostdlib -O1 bubble-sort-riscv.c -o test-O1
./emu-rv32i test-O1
2 8 5 1 0 6 9 7 4 9
0 1 2 4 5 6 7 8 9 9
>>> Execution time: 88914 ns
>>> Instruction count: 588 (IPS=6613131)
>>> Jumps: 90 (15.31%) - 14 forwards, 76 backwards
>>> Branching T=57 (46.34%) F=66 (53.66%)
riscv-none-embed-gcc -march=rv32i -mabi=ilp32 -nostdlib -O2 bubble-sort-riscv.c -o test-O2
./emu-rv32i test-O2
2 8 5 1 0 6 9 7 4 9
0 1 2 4 5 6 7 8 9 9
>>> Execution time: 146506 ns
>>> Instruction count: 544 (IPS=3713158)
>>> Jumps: 94 (17.28%) - 30 forwards, 64 backwards
>>> Branching T=91 (75.21%) F=30 (24.79%)
riscv-none-embed-gcc -march=rv32i -mabi=ilp32 -nostdlib -O3 bubble-sort-riscv.c -o test-O3
./emu-rv32i test-O3
2 8 5 1 0 6 9 7 4 9
0 1 2 4 5 6 7 8 9 9
>>> Execution time: 138569 ns
>>> Instruction count: 363 (IPS=2619633)
>>> Jumps: 46 (12.67%) - 37 forwards, 9 backwards
>>> Branching T=45 (45.45%) F=54 (54.55%)
riscv-none-embed-gcc -march=rv32i -mabi=ilp32 -nostdlib -Os bubble-sort-riscv.c -o test-Os
./emu-rv32i test-Os
2 8 5 1 0 6 9 7 4 9
0 1 2 4 5 6 7 8 9 9
>>> Execution time: 105596 ns
>>> Instruction count: 747 (IPS=7074131)
>>> Jumps: 185 (24.77%) - 106 forwards, 79 backwards
>>> Branching T=104 (78.20%) F=29 (21.80%)
```
Within several attempts, the optimize option `-O1` mostly yields the best performance in terms of execution time, and `-Os` getting the highest IPS.
## Not Printing Out The Results
Thinking of that `printf()` often is the bottleneck of performance in C-codes, I tried comment out all printing-related functions and tested again:
```shell
riscv-none-embed-gcc -march=rv32i -mabi=ilp32 -nostdlib -O0 bubble-sort-riscv.c -o test-O0
./emu-rv32i test-O0
>>> Execution time: 54608 ns
>>> Instruction count: 1695 (IPS=31039408)
>>> Jumps: 98 (5.78%) - 41 forwards, 57 backwards
>>> Branching T=84 (75.68%) F=27 (24.32%)
riscv-none-embed-gcc -march=rv32i -mabi=ilp32 -nostdlib -O1 bubble-sort-riscv.c -o test-O1
./emu-rv32i test-O1
>>> Execution time: 13542 ns
>>> Instruction count: 422 (IPS=31162309)
>>> Jumps: 68 (16.11%) - 12 forwards, 56 backwards
>>> Branching T=39 (38.61%) F=62 (61.39%)
riscv-none-embed-gcc -march=rv32i -mabi=ilp32 -nostdlib -O2 bubble-sort-riscv.c -o test-O2
./emu-rv32i test-O2
>>> Execution time: 11959 ns
>>> Instruction count: 384 (IPS=32109708)
>>> Jumps: 75 (19.53%) - 29 forwards, 46 backwards
>>> Branching T=73 (72.28%) F=28 (27.72%)
riscv-none-embed-gcc -march=rv32i -mabi=ilp32 -nostdlib -O3 bubble-sort-riscv.c -o test-O3
./emu-rv32i test-O3
>>> Execution time: 9194 ns
>>> Instruction count: 307 (IPS=33391342)
>>> Jumps: 46 (14.98%) - 37 forwards, 9 backwards
>>> Branching T=45 (45.45%) F=54 (54.55%)
riscv-none-embed-gcc -march=rv32i -mabi=ilp32 -nostdlib -Os bubble-sort-riscv.c -o test-Os
./emu-rv32i test-Os
>>> Execution time: 15149 ns
>>> Instruction count: 521 (IPS=34391709)
>>> Jumps: 141 (27.06%) - 84 forwards, 57 backwards
>>> Branching T=84 (75.68%) F=27 (24.32%)
```
This time, the optimization clearly fastens the execution as the optimize level goes up.
However, within ~20 testings, sometimes the speed of `-O3` gets very high, almost getting close to the `-O0` speed. I assume it is the result of prediction failure, since only the branching of `-O3` has almost a 50-50 probability, which is bad for predicting.
## Dumping Object Files And Improving Generated Code
By comparing the `-O2` and `-O3` object file:
:::spoiler `-O2` objdump
```nasm
Disassembly of section .text:
00010074 <bubble_sort>:
10074: 04a05a63 blez a0,100c8 <bubble_sort+0x54>
10078: 00251793 slli a5,a0,0x2
1007c: 00458513 addi a0,a1,4
10080: 00f585b3 add a1,a1,a5
10084: 04b50263 beq a0,a1,100c8 <bubble_sort+0x54>
10088: ffc52703 lw a4,-4(a0)
1008c: 00050793 mv a5,a0
10090: 0007a603 lw a2,0(a5)
10094: 00c746b3 xor a3,a4,a2
10098: 02e65063 bge a2,a4,100b8 <bubble_sort+0x44>
1009c: fed52e23 sw a3,-4(a0)
100a0: 0007a703 lw a4,0(a5)
100a4: 00e6c6b3 xor a3,a3,a4
100a8: 00d7a023 sw a3,0(a5)
100ac: ffc52703 lw a4,-4(a0)
100b0: 00d74733 xor a4,a4,a3
100b4: fee52e23 sw a4,-4(a0)
100b8: 00478793 addi a5,a5,4
100bc: fcb79ae3 bne a5,a1,10090 <bubble_sort+0x1c>
100c0: 00450513 addi a0,a0,4
100c4: fcb512e3 bne a0,a1,10088 <bubble_sort+0x14>
100c8: 00008067 ret
000100cc <_start>:
100cc: 000115b7 lui a1,0x11
100d0: 0dc58593 addi a1,a1,220 # 110dc <__DATA_BEGIN__>
100d4: 00a00513 li a0,10
100d8: f9dff06f j 10074 <bubble_sort>
```
:::
:::spoiler `-O3` objdump
```nasm
00010074 <bubble_sort>:
10074: 06a05c63 blez a0,100ec <bubble_sort+0x78>
10078: 00100793 li a5,1
1007c: 06f50863 beq a0,a5,100ec <bubble_sort+0x78>
10080: 00458e13 addi t3,a1,4
10084: 00000893 li a7,0
10088: 00100313 li t1,1
1008c: 00289893 slli a7,a7,0x2
10090: 011588b3 add a7,a1,a7
10094: 0008a703 lw a4,0(a7)
10098: 000e0793 mv a5,t3
1009c: 00030613 mv a2,t1
100a0: 0007a803 lw a6,0(a5)
100a4: 00160613 addi a2,a2,1
100a8: 00e846b3 xor a3,a6,a4
100ac: 02e85063 bge a6,a4,100cc <bubble_sort+0x58>
100b0: 00d8a023 sw a3,0(a7)
100b4: 0007a703 lw a4,0(a5)
100b8: 00e6c6b3 xor a3,a3,a4
100bc: 00d7a023 sw a3,0(a5)
100c0: 0008a703 lw a4,0(a7)
100c4: 00e6c733 xor a4,a3,a4
100c8: 00e8a023 sw a4,0(a7)
100cc: 00478793 addi a5,a5,4
100d0: fca648e3 blt a2,a0,100a0 <bubble_sort+0x2c>
100d4: 00130793 addi a5,t1,1
100d8: 004e0e13 addi t3,t3,4
100dc: 00030893 mv a7,t1
100e0: 00f50663 beq a0,a5,100ec <bubble_sort+0x78>
100e4: 00078313 mv t1,a5
100e8: fa5ff06f j 1008c <bubble_sort+0x18>
100ec: 00008067 ret
000100f0 <_start>:
100f0: 00000613 li a2,0
100f4: 000117b7 lui a5,0x11
100f8: 00a00693 li a3,10
100fc: 00160613 addi a2,a2,1
10100: 23078793 addi a5,a5,560 # 11230 <__DATA_BEGIN__>
10104: 00900f13 li t5,9
10108: 00800e93 li t4,8
1010c: 00700e13 li t3,7
10110: 00600313 li t1,6
10114: 00500893 li a7,5
10118: 00400813 li a6,4
1011c: 00300513 li a0,3
10120: 00200593 li a1,2
10124: 00900293 li t0,9
10128: 10d60263 beq a2,a3,1022c <_start+0x13c>
1012c: 0007a703 lw a4,0(a5)
10130: 0047af83 lw t6,4(a5)
10134: 00efd863 bge t6,a4,10144 <_start+0x54>
10138: 00e7a223 sw a4,4(a5)
1013c: 01f7a023 sw t6,0(a5)
10140: 000f8713 mv a4,t6
10144: 0ad58e63 beq a1,a3,10200 <_start+0x110>
10148: 0087af83 lw t6,8(a5)
1014c: 00efd863 bge t6,a4,1015c <_start+0x6c>
10150: 00e7a423 sw a4,8(a5)
10154: 01f7a023 sw t6,0(a5)
10158: 000f8713 mv a4,t6
1015c: 0ad50263 beq a0,a3,10200 <_start+0x110>
10160: 00c7af83 lw t6,12(a5)
10164: 00efd863 bge t6,a4,10174 <_start+0x84>
10168: 00e7a623 sw a4,12(a5)
1016c: 01f7a023 sw t6,0(a5)
10170: 000f8713 mv a4,t6
10174: 08d80663 beq a6,a3,10200 <_start+0x110>
10178: 0107af83 lw t6,16(a5)
1017c: 00efd863 bge t6,a4,1018c <_start+0x9c>
10180: 00e7a823 sw a4,16(a5)
10184: 01f7a023 sw t6,0(a5)
10188: 000f8713 mv a4,t6
1018c: 06d88a63 beq a7,a3,10200 <_start+0x110>
10190: 0147af83 lw t6,20(a5)
10194: 00efd863 bge t6,a4,101a4 <_start+0xb4>
10198: 00e7aa23 sw a4,20(a5)
1019c: 01f7a023 sw t6,0(a5)
101a0: 000f8713 mv a4,t6
101a4: 04d30e63 beq t1,a3,10200 <_start+0x110>
101a8: 0187af83 lw t6,24(a5)
101ac: 00efd863 bge t6,a4,101bc <_start+0xcc>
101b0: 00e7ac23 sw a4,24(a5)
101b4: 01f7a023 sw t6,0(a5)
101b8: 000f8713 mv a4,t6
101bc: 04de0263 beq t3,a3,10200 <_start+0x110>
101c0: 01c7af83 lw t6,28(a5)
101c4: 00efd863 bge t6,a4,101d4 <_start+0xe4>
101c8: 00e7ae23 sw a4,28(a5)
101cc: 01f7a023 sw t6,0(a5)
101d0: 000f8713 mv a4,t6
101d4: 02de8663 beq t4,a3,10200 <_start+0x110>
101d8: 0207af83 lw t6,32(a5)
101dc: 00efd863 bge t6,a4,101ec <_start+0xfc>
101e0: 02e7a023 sw a4,32(a5)
101e4: 01f7a023 sw t6,0(a5)
101e8: 000f8713 mv a4,t6
101ec: 005f1a63 bne t5,t0,10200 <_start+0x110>
101f0: 0247af83 lw t6,36(a5)
101f4: 00efd663 bge t6,a4,10200 <_start+0x110>
101f8: 02e7a223 sw a4,36(a5)
101fc: 01f7a023 sw t6,0(a5)
10200: 00160613 addi a2,a2,1
10204: 00158593 addi a1,a1,1
10208: 00150513 addi a0,a0,1
1020c: 00180813 addi a6,a6,1
10210: 00188893 addi a7,a7,1
10214: 00130313 addi t1,t1,1
10218: 001e0e13 addi t3,t3,1
1021c: 001e8e93 addi t4,t4,1
10220: 001f0f13 addi t5,t5,1
10224: 00478793 addi a5,a5,4
10228: f0d612e3 bne a2,a3,1012c <_start+0x3c>
1022c: 00008067 ret
```
:::
I would describe `-O2` optimization as "making thinks short and clean", and `-O3` as "Anarchy", inlining the entire `bubble_sort()` into `_start`.
Running with Ripes, it was clear that the `_bubble_sort` section is not executed entirely.
Which a eazy way to optimize for file size is, simply delete the section, saving filesize of around 25 instructions.
### Object File Information
```shell
# Display object file information
riscv-none-embed-size test-O0
text data bss dec hex filename
412 40 0 452 1c4 test-O0
riscv-none-embed-size test-O1
text data bss dec hex filename
132 40 0 172 ac test-O1
riscv-none-embed-size test-O2
text data bss dec hex filename
104 40 0 144 90 test-O2
riscv-none-embed-size test-O3
text data bss dec hex filename
444 40 0 484 1e4 test-O3
riscv-none-embed-size test-Os
text data bss dec hex filename
108 40 0 148 94 test-Os
```
The huge filesize of `-O3` optimized object file seems to be blamed for the aggressive inlining.
----
# Two Sum
> From < `joe-U16` >
## C Code Implementation (Original)
```c=
int* twoSum(int* nums, int numsSize, int target, int* returnSize){
for (int i=0; i < numsSize; i++)
{
for (int j=i+1; j < numsSize; j++)
{
if ((nums[i]+nums[j]) == target) {
*returnSize = 2;
int *arr = malloc(sizeof(int) * 2);
arr[0] = i;
arr[1] = j;
return arr;
}
}
}
return 0;
}
```
## Rewrited C Code
```c=
void two_sum(int *arr, int size, int target, int* return_arr){
for(int i = 0; i < size; ++i){
for(int j = i + 1; j < size; ++j){
if(arr[i] + arr[j] == target){
return_arr[0] = i;
return_arr[1] = j;
return;
}
}
}
return_arr[0] = return_arr[1] = -1;
return;
}
void _start(){
volatile char* tx = (volatile char*) 0x40002000;
static int arr[] = {1, 2, 3, 5, 7, 11, 16, 21, 26, 33};
int arr_size = 10;
int return_arr[2];
two_sum(arr, arr_size, 33, return_arr);
*tx = return_arr[0] + '0';
*tx = ' ';
*tx = return_arr[1] + '0';
}
```
## Testing w/ `-O0` ~ `-O3` And `-Os`s
```shell
$rv32igcc -O0 $FILENAME -o test-O0
./emu-rv32i test-O0
4 8
>>> Execution time: 42385 ns
>>> Instruction count: 758 (IPS=17883685)
>>> Jumps: 82 (10.82%) - 41 forwards, 41 backwards
>>> Branching T=72 (93.51%) F=5 (6.49%)
$rv-size test-O0
text data bss dec hex filename
368 40 0 408 198 test-O0
$rv32igcc -O1 $FILENAME -o test-O1
./emu-rv32i test-O1
4 8
>>> Execution time: 15339 ns
>>> Instruction count: 262 (IPS=17080644)
>>> Jumps: 39 (14.89%) - 5 forwards, 34 backwards
>>> Branching T=32 (43.84%) F=41 (56.16%)
$rv-size test-O1
text data bss dec hex filename
200 40 0 240 f0 test-O1
$rv32igcc -O2 $FILENAME -o test-O2
./emu-rv32i test-O2
4 8
>>> Execution time: 15627 ns
>>> Instruction count: 251 (IPS=16061944)
>>> Jumps: 40 (15.94%) - 6 forwards, 34 backwards
>>> Branching T=34 (47.89%) F=37 (52.11%)
$rv-size test-O2
text data bss dec hex filename
304 40 0 344 158 test-O2
$rv32igcc -O3 $FILENAME -o test-O3
./emu-rv32i test-O3
4 8
>>> Execution time: 13062 ns
>>> Instruction count: 211 (IPS=16153728)
>>> Jumps: 10 (4.74%) - 5 forwards, 5 backwards
>>> Branching T=5 (7.46%) F=62 (92.54%)
$rv-size test-O3
text data bss dec hex filename
412 40 0 452 1c4 test-O3
$rv32igcc -Os $FILENAME -o test-Os
./emu-rv32i test-Os
4 8
>>> Execution time: 48654 ns
>>> Instruction count: 363 (IPS=7460845)
>>> Jumps: 112 (30.85%) - 73 forwards, 39 backwards
>>> Branching T=72 (93.51%) F=5 (6.49%)
$rv-size test-Os
text data bss dec hex filename
184 40 0 224 e0 test-Os
```
This time, with much less printing action, I skipped the elimination of printing related codes.
Also the options are also functioning correctly, with `-O3` being the fast and `-Os` having the least instructions.
## Object File Dumps
:::spoiler `-O3` Dump
```nasm
uduru@uduru-GL553VD:~/rv32emu$ riscv-none-embed-objdump -d test-O3
test-O3: file format elf32-littleriscv
Disassembly of section .text:
00010074 <two_sum>:
10074: 04b05a63 blez a1,100c8 <two_sum+0x54>
10078: 00000e13 li t3,0
1007c: 001e0313 addi t1,t3,1
10080: 04658463 beq a1,t1,100c8 <two_sum+0x54>
10084: 00052883 lw a7,0(a0)
10088: 00452783 lw a5,4(a0)
1008c: 00f887b3 add a5,a7,a5
10090: 04f60463 beq a2,a5,100d8 <two_sum+0x64>
10094: 00050813 mv a6,a0
10098: 00030713 mv a4,t1
1009c: 0140006f j 100b0 <two_sum+0x3c>
100a0: 00882783 lw a5,8(a6)
100a4: 00480813 addi a6,a6,4
100a8: 00f887b3 add a5,a7,a5
100ac: 02c78863 beq a5,a2,100dc <two_sum+0x68>
100b0: 00170713 addi a4,a4,1
100b4: fee596e3 bne a1,a4,100a0 <two_sum+0x2c>
100b8: 00030e13 mv t3,t1
100bc: 001e0313 addi t1,t3,1
100c0: 00450513 addi a0,a0,4
100c4: fc6590e3 bne a1,t1,10084 <two_sum+0x10>
100c8: fff00793 li a5,-1
100cc: 00f6a223 sw a5,4(a3)
100d0: 00f6a023 sw a5,0(a3)
100d4: 00008067 ret
100d8: 00030713 mv a4,t1
100dc: 01c6a023 sw t3,0(a3)
100e0: 00e6a223 sw a4,4(a3)
100e4: 00008067 ret
000100e8 <_start>:
100e8: 00011737 lui a4,0x11
100ec: 21070713 addi a4,a4,528 # 11210 <__DATA_BEGIN__>
100f0: 00200693 li a3,2
100f4: 02100593 li a1,33
100f8: 00a00513 li a0,10
100fc: 00072603 lw a2,0(a4)
10100: 00472783 lw a5,4(a4)
10104: ffe68813 addi a6,a3,-2
10108: 00f607b3 add a5,a2,a5
1010c: 0eb78263 beq a5,a1,101f0 <_start+0x108>
10110: 00068793 mv a5,a3
10114: 0ea68863 beq a3,a0,10204 <_start+0x11c>
10118: 00872883 lw a7,8(a4)
1011c: 011608b3 add a7,a2,a7
10120: 08b88e63 beq a7,a1,101bc <_start+0xd4>
10124: 00168313 addi t1,a3,1
10128: 00030793 mv a5,t1
1012c: 0aa30c63 beq t1,a0,101e4 <_start+0xfc>
10130: 00c72883 lw a7,12(a4)
10134: 011608b3 add a7,a2,a7
10138: 08b88263 beq a7,a1,101bc <_start+0xd4>
1013c: 00268793 addi a5,a3,2
10140: 0aa78263 beq a5,a0,101e4 <_start+0xfc>
10144: 01072883 lw a7,16(a4)
10148: 011608b3 add a7,a2,a7
1014c: 06b88863 beq a7,a1,101bc <_start+0xd4>
10150: 00368793 addi a5,a3,3
10154: 08a78863 beq a5,a0,101e4 <_start+0xfc>
10158: 01472883 lw a7,20(a4)
1015c: 011608b3 add a7,a2,a7
10160: 04b88e63 beq a7,a1,101bc <_start+0xd4>
10164: 00468793 addi a5,a3,4
10168: 06a78e63 beq a5,a0,101e4 <_start+0xfc>
1016c: 01872883 lw a7,24(a4)
10170: 011608b3 add a7,a2,a7
10174: 04b88463 beq a7,a1,101bc <_start+0xd4>
10178: 00568793 addi a5,a3,5
1017c: 06a78463 beq a5,a0,101e4 <_start+0xfc>
10180: 01c72883 lw a7,28(a4)
10184: 011608b3 add a7,a2,a7
10188: 02b88a63 beq a7,a1,101bc <_start+0xd4>
1018c: 00668793 addi a5,a3,6
10190: 04a78a63 beq a5,a0,101e4 <_start+0xfc>
10194: 02072883 lw a7,32(a4)
10198: 011608b3 add a7,a2,a7
1019c: 02b88063 beq a7,a1,101bc <_start+0xd4>
101a0: 00768793 addi a5,a3,7
101a4: 04a78063 beq a5,a0,101e4 <_start+0xfc>
101a8: 02472883 lw a7,36(a4)
101ac: 01160633 add a2,a2,a7
101b0: 00b60663 beq a2,a1,101bc <_start+0xd4>
101b4: 00868793 addi a5,a3,8
101b8: 02a78663 beq a5,a0,101e4 <_start+0xfc>
101bc: 03080813 addi a6,a6,48
101c0: 03078793 addi a5,a5,48
101c4: 0ff87813 andi a6,a6,255
101c8: 0ff7f793 andi a5,a5,255
101cc: 40002737 lui a4,0x40002
101d0: 01070023 sb a6,0(a4) # 40002000 <__global_pointer$+0x3fff05f0>
101d4: 02000693 li a3,32
101d8: 00d70023 sb a3,0(a4)
101dc: 00f70023 sb a5,0(a4)
101e0: 00008067 ret
101e4: 00030693 mv a3,t1
101e8: 00470713 addi a4,a4,4
101ec: f11ff06f j 100fc <_start+0x14>
101f0: 02e68813 addi a6,a3,46
101f4: 02f68793 addi a5,a3,47
101f8: 0ff87813 andi a6,a6,255
101fc: 0ff7f793 andi a5,a5,255
10200: fcdff06f j 101cc <_start+0xe4>
10204: 02f00793 li a5,47
10208: 02f00813 li a6,47
1020c: fc1ff06f j 101cc <_start+0xe4>
```
:::
Again the `-O3` option inlined the intire searching function and left the `two_sum` part not be called entirely.
# Encoutered Issues
## Optimizer `memcpy` Call In `-Os` & `-O3` Optimization
Originally, the array is initialized like this:
```c=
int arr[] = {2, 8, 5, 1, 0, 6, 9, 7, 4, 9};
```
Which `call memcpy` appers to be generated within the compiled assembly file, emitting an undefined refrenece error.
By the option `-S`, I located the position where it calls the `memcpy` subroutine:
```nasm=
_start:
addi sp,sp,-64 # Allocating memory for stack frame
lui a1,%hi(.LANCHOR0) # Loading 32-bit array address
li a2,40 # Load array size
addi a1,a1,%lo(.LANCHOR0) # Loading 32-bit array address
addi a0,sp,8 # Store array at sp + 8
sw ra,60(sp)
call memcpy
li a2,1073750016 # Loading arguments for print_arr()
addi a1,sp,8 #
li a0,10 #
call print_arr
addi a1,sp,8 # Loading arguments for bubble_sort()
li a0,10 #
call bubble_sort
addi a1,sp,8 # Loading arguments for print_arr()
li a2,1073750016 #
li a0,10 #
call print_arr
lw ra,60(sp)
addi sp,sp,64
jr ra
.size _start, .-_start
.section .rodata
.align 2
.set .LANCHOR0,. + 0 # Starting address for .rodata section
.LC0:
.word 2
.word 8
.word 5
.word 1
.word 0
.word 6
.word 9
.word 7
.word 4
.word 9
```
> + `.rodata` denotes the **data-segement**
> + `.set .LANCHOR0, . + 0` stands for setting value of `.LANCHOR0` to `. + 0`
> + The dot symbol `.` means "the current address".
> + That is, `.LANCHOR0` is set to be the starting address of the following data section.
I found that, with the written code, the array body actually exists 2 copy:
1. One starts at `.LANCHOR0` in the `.rodata` section,
2. And one `sp + 8`, **which the `memcpy` call is used to copy the array body into the stack.**
> This copy happens becauses that `.rodata` denotes the **Constant Data** section.
And heres some effort i tried but to no avail:
+ Adding the `-fno-builtin -ffreestanding` option
+ Changing the array declaration to global and do not pass into functions
With these succed:
+ Declaring the array as `static int[]`, which I assume tells the compiler to **keep only one instance**, a.k.a., do not store in `.rodata`.
+ Declairing only the array(`int arr[10];`), and fill in the array body one-by-one, as for `arr[0] = 2; arr[1] = 8;`, etc.
**The static array way is chosen due to the better cleaness of assembly-code and c-code.**
:::info
The discription of the expressions are refreneced from
https://www.esi-risc.com/wp-content/uploads/as.pdf
:::
## `_mulsi3` Call in `-O1` And Higher Optimization
Since we are using the rv32i instructions, we are not weaponed with `MUL`, `DIV` and `MOD` calculations.
Thus, I first tried to implement the `int _mul(int a, int b)` function:
```c=
int _mul(int a, int b){
int result = 0;
while(b--){
result += a;
}
return result;
}
```
with optimization on, the compiled assembly code appears to know whats going on, and tries to call the `__mulsi3()` function,
which is the builtin branchless unsigined integer multiplication function.
The following is the [GCC Implementation](https://github.com/gcc-mirror/gcc/blob/master/libgcc/config/epiphany/mulsi3.c):
```c=
unsigned int __mulsi3 (unsigned int a, unsigned int b)
{
unsigned int r = 0;
while (a){
if (a & 1)
r += b;
a >>= 1;
b <<= 1;
}
return r;
}
```
To prevent this optimization from happening, I found that we can add `__attribute((optimize("O0")))__` to the function,
or, surprisingly, inlining the function also works.