# Assignment2: RISC-V Toolchain
Contributed by [raphael89918](https://github.com/raphael89918)
> [source code](https://github.com/raphael89918/Computer_Architecture_11201/tree/main/homework_2)
## Pick a Question
The following question is picked from the Assignment 1
> 陳燦仁 Shell sort with FP32 in BF16 format
> Target to sort float number list with shell sort, and using bf16 format to sort instead of using normal calculate.
For Example.
float number list 1.6, -1.5, 1.4, -1.3, 1.2, -1.1.
list length is 6,so the gap will be 3.
after the first round, gap will divide 2 = 1 until gap = 0.
gap = 3 : final result : -1.3, -1.5, -1.1, 1.6, 1.2, 1.4.
gap = 1 : final result : -1.5, -1.3, -1.1, 1.2, 1.4, 1.6.
[Source code](https://github.com/TRChen11011/Shell-sort-with-FP32-in-BF16-format/tree/main)
The original implementation of the questions is as follows.
```c
#include <stdio.h>
#include <stdbool.h>
#define array_size 6
unsigned int fp32_to_bf16(float x)
{
float y = x;
int *p = (int *) &y;
unsigned int exp = *p & 0x7F800000;
unsigned int man = *p & 0x007FFFFF;
if (exp == 0 && man == 0) /* zero */
return *p;
if (exp == 0x7F800000) /* infinity or NaN */
return *p;
float r = x;
int *pr = (int *) &r;
*pr &= 0xFF800000; /* r has the same exp as x */
r /= 0x100;
y = x + r;
*p &= 0xFFFF0000;
return *p;
}
bool BOS(float fp32_1, float fp32_2){
unsigned int bf16_1, bf16_2, sig1, sig2, exp1, exp2, man1, man2;
bf16_1 = fp32_to_bf16(fp32_1);
bf16_2 = fp32_to_bf16(fp32_2);
sig1 = bf16_1 & 0x80000000;
sig2 = bf16_2 & 0x80000000;
exp1 = bf16_1 & 0x7F800000;
exp2 = bf16_2 & 0x7F800000;
man1 = bf16_1 & 0x007F0000;
man2 = bf16_2 & 0x007F0000;
if (sig1 < sig2) return 1;
else if (sig1 == 0 && sig1 == sig2 && exp1 > exp2)
return 1;
else if (sig1 != 0 && sig1 == sig2 && exp1 < exp2)
return 1;
else if (sig1 == 0 && sig1 == sig2 && exp1 == exp2 && man1 > man2)
return 1;
else if (sig1 != 0 && sig1 == sig2 && exp1 == exp2 && man1 < man2)
return 1;
else return 0;
}
void ShellSort(float array[array_size]){
int interval = array_size / 2;
while (interval >0){
for(int i = interval;i<array_size;i++){
int j = i;
float temp = array[i];
int Flag = BOS(array[j-interval], temp);
while (j>= interval && Flag == 1){
array[j] = array[j-interval];
j = j-interval;
Flag = BOS(array[j-interval], temp);
}
array[j] = temp;
}
interval = interval / 2;
}
}
void main()
{
float array[array_size] = {1.6,-1.5,1.4,-1.3,1.2,-1.1};
ShellSort(array);
for (int i = 0;i<array_size;i++){
printf("%x\n",fp32_to_bf16(array[i]));
}
return 0;
}
```
I chose this code simply because I am interested in the [Shell Sort](https://hackmd.io/@Aquamay/H1nxBOLcO/https%3A%2F%2Fhackmd.io%2F%40Aquamay%2FrkgO8fpcu). Compared to bubble sort, heap sort, and others, I haven't used this sorting method yet.
The official documentation of GCC provides a detailed explanation of these optimization options. [here](https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html)
### gcc O0 optimization
```text=
ELF Header:
Magic: 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00
Class: ELF32
Data: 2's complement, little endian
Version: 1 (current)
OS/ABI: UNIX - System V
ABI Version: 0
Type: EXEC (Executable file)
Machine: RISC-V
Version: 0x1
Entry point address: 0x100d8
Start of program headers: 52 (bytes into file)
Start of section headers: 98776 (bytes into file)
Flags: 0x0
Size of this header: 52 (bytes)
Size of program headers: 32 (bytes)
Number of program headers: 3
Size of section headers: 40 (bytes)
Number of section headers: 15
Section header string table index: 14
text data bss dec hex filename
78976 2320 1528 82824 14388 hw1.elf
```
#### RDCYCLE
```text=
bfc00000
bfa60000
bf8d0000
3f9a0000
3fb30000
3fcd0000
Start Cycle: 110
End Cycle: 21535
Cycles Elapsed: 21425
inferior exit code 0
```
:::warning
:warning: Don't put the screenshots which contain plain text only. Instead, utilize HackMD syntax to annotate the text.
:notes: jserv
:::
### gcc O1 optimization
```text
ELF Header:
Magic: 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00
Class: ELF32
Data: 2s complement, little endian
Version: 1 (current)
OS/ABI: UNIX - System V
ABI Version: 0
Type: EXEC (Executable file)
Machine: RISC-V
Version: 0x1
Entry point address: 0x100d8
Start of program headers: 52 (bytes into file)
Start of section headers: 98780 (bytes into file)
Flags: 0x0
Size of this header: 52 (bytes)
Size of program headers: 32 (bytes)
Number of program headers: 3
Size of section headers: 40 (bytes)
Number of section headers: 15
Section header string table index: 14
text data bss dec hex filename
78216 2324 1528 82068 14094 hw1.elf
```
#### RDCYCLE
```text=
bfc00000
bfa60000
bf8d0000
3f9a0000
3fb30000
3fcd0000
Start Cycle: 105
End Cycle: 16471
Cycles Elapsed: 16366
inferior exit code
```
I guess using the O1 optimizer is not necessary for the program to be optimized.
### gcc O2 optimization
```text
ELF Header:
Magic: 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00
Class: ELF32
Data: 2's complement, little endian
Version: 1 (current)
OS/ABI: UNIX - System V
ABI Version: 0
Type: EXEC (Executable file)
Machine: RISC-V
Version: 0x1
Entry point address: 0x101f0
Start of program headers: 52 (bytes into file)
Start of section headers: 98796 (bytes into file)
Flags: 0x0
Size of this header: 52 (bytes)
Size of program headers: 32 (bytes)
Number of program headers: 3
Size of section headers: 40 (bytes)
Number of section headers: 15
Section header string table index: 14
text data bss dec hex filename
78488 2324 1528 82340 141a4 hw1.elf
```
#### RDCYCLE
```text=
bfc00000
bfa60000
bf8d0000
3f9a0000
3fb30000
3fcd0000
Start Cycle: 105
End Cycle: 16224
Cycles Elapsed: 16119
inferior exit code 0
```
### gcc O3 optimization
```text=
ELF Header:
Magic: 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00
Class: ELF32
Data: 2's complement, little endian
Version: 1 (current)
OS/ABI: UNIX - System V
ABI Version: 0
Type: EXEC (Executable file)
Machine: RISC-V
Version: 0x1
Entry point address: 0x101f0
Start of program headers: 52 (bytes into file)
Start of section headers: 98796 (bytes into file)
Flags: 0x0
Size of this header: 52 (bytes)
Size of program headers: 32 (bytes)
Number of program headers: 3
Size of section headers: 40 (bytes)
Number of section headers: 15
Section header string table index: 14
text data bss dec hex filename
79012 2324 1528 82864 143b0 hw1.elf
```
#### RDCYCLE
```text=
bfc00000
bfa60000
bf8d0000
3f9a0000
3fb30000
3fcd0000
Start Cycle: 105
End Cycle: 16302
Cycles Elapsed: 16197
inferior exit code 0
```
### gcc Os optimization
```text=
ELF Header:
Magic: 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00
Class: ELF32
Data: 2's complement, little endian
Version: 1 (current)
OS/ABI: UNIX - System V
ABI Version: 0
Type: EXEC (Executable file)
Machine: RISC-V
Version: 0x1
Entry point address: 0x10144
Start of program headers: 52 (bytes into file)
Start of section headers: 98796 (bytes into file)
Flags: 0x0
Size of this header: 52 (bytes)
Size of program headers: 32 (bytes)
Number of program headers: 3
Size of section headers: 40 (bytes)
Number of section headers: 15
Section header string table index: 14
text data bss dec hex filename
78124 2324 1528 81976 14038 hw1.elf
```
#### RDCYCLE
```text=
bfc00000
bfa60000
bf8d0000
3f9a0000
3fb30000
3fcd0000
Start Cycle: 105
End Cycle: 16526
Cycles Elapsed: 16421
inferior exit code 0
```
### gcc Ofast optimization
```text=
ELF Header:
Magic: 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00
Class: ELF32
Data: 2's complement, little endian
Version: 1 (current)
OS/ABI: UNIX - System V
ABI Version: 0
Type: EXEC (Executable file)
Machine: RISC-V
Version: 0x1
Entry point address: 0x101f0
Start of program headers: 52 (bytes into file)
Start of section headers: 98796 (bytes into file)
Flags: 0x0
Size of this header: 52 (bytes)
Size of program headers: 32 (bytes)
Number of program headers: 3
Size of section headers: 40 (bytes)
Number of section headers: 15
Section header string table index: 14
text data bss dec hex filename
79068 2324 1528 82920 143e8 hw1.elf
```
#### RDCYCLE
```text=
bfc00000
bfa60000
bf8d0000
3f9a0000
3fb30000
3fcd0000
Start Cycle: 105
End Cycle: 16324
Cycles Elapsed: 16219
inferior exit code 0
```
1. The text section with -Ofast is larger than with -Os. This is likely due to -Ofast performing more optimizations, such as inlining, loop unrolling, etc., with the expectation of achieving better runtime performance, but at the cost of increased code size.
2. The data and bss sections are the same in both cases, which implies that the optimizations did not change the layout or size of static data.
3. The ELF file size generated by -Os is smaller than that by -Ofast, which is as expected since the primary goal of -Os is to reduce code size.
## The Hand Written Assembly
The hand written assembly looks like this
```c
.data
dataset1: .word 0x3fcccccd #1.6
dataset2: .word 0xbfc00000 #-1.5
dataset3: .word 0x3fb33333 #1.4
dataset4: .word 0xbfa66666 #-1.3
dataset5: .word 0x3f99999a #1.2
dataset6: .word 0xbf8ccccd #-1.1
array_size: .word 0x00000006
sign: .word 0x80000000
exp: .word 0x7F800000
man: .word 0x007FFFFF
bf16man: .word 0x007F0000
bf16format: .word 0xFFFF0000
_EOL: .string "\n"
.text
main:
addi sp,sp,-24
lw s0, dataset1
lw s1, dataset2
lw s2, dataset3
lw s3, dataset4
lw s4, dataset5
lw s5, dataset6
lw a0, sign
lw a1, exp
lw a2, man
lw a3, bf16man
lw a4, bf16format
sw s0, 0(sp)
sw s1, 4(sp)
sw s2, 8(sp)
sw s3, 12(sp)
sw s4, 16(sp)
sw s5, 20(sp)
lw s11, array_size
jal ShellSort
ShellSort:
lw t0, array_size
srli t0, t0, 1
Whileinterval:
beq t0, zero, Print
add t1, zero, t0
Foriarraysize:
beq t1, s11, Interval_div2
add t2, zero, t1
slli s0, t1, 2
add s1, sp, s0
lw t5, 0(s1)
ReadData_Temp_done:
sub t3, t2, t0
slli s0, t3, 2
add s1, sp, s0
lw t4, 0(s1)
jal fp32_to_bf16
WhilejandFlag:
beq s10, zero, arrayjtotemp
bltu t2, t0, arrayjtotemp
and t4, t4, a4
slli s0, t2, 2
add s1, sp, s0
sw t4, 0(s1)
sub t2, t2, t0
jal ReadData_Temp_done
arrayjtotemp:
and t5, t5, a4
slli s0, t2, 2
add s1, sp, s0
sw t5, 0(s1)
addi t1, t1, 1
jal Foriarraysize
Interval_div2:
srli t0, t0, 1
jal Whileinterval
fp32_to_bf16:
and s0, a0, t4
and s1, a0, t5
and s2, a1, t4
and s3, a1, t5
and s4, a2, t4
and s5, a2, t5
and s6, a3, t4
and s7, a3, t5
jal BOS
ret
BOS:
blt s0, s1, SMALL
beq s0, s1, Sigsame
jal BIG
Sigsame:
beq s2, s3, Expsame
beq s0, zero, Exp1
jal Exp2
Expsame:
beq s0, zero, Man1
jal Man2
Exp1:
blt s2, s3, BIG
jal SMALL
Exp2:
blt s2, s3, SMALL
jal BIG
Man1:
blt s6, s7, SMALL
jal BIG
Man2:
blt s6, s7, BIG
jal SMALL
BIG:
li s10, 1
jalr ra
SMALL:
li s10, 0
jalr ra
Print:
slli s0, t6, 2
add s1, sp, s0
lw a0, 0(s1)
li a7, 2
ecall
addi t6, t6, 1
la a0, _EOL
li a7, 4
ecall
beq t6, s11, End
jal Print
End:
# Exit code can be placed here
li a7, 10
ecall
```
### Compiling The Handwritten Assembly using RISC-V
```shell
$ riscv-none-elf-objdump -d hw2.elf
hw2.elf: file format elf32-littleriscv
Disassembly of section .text:
00000000 <dataset1>:
0: 3fcccccd .word 0x3fcccccd
00000004 <dataset2>:
4: bfc00000 .word 0xbfc00000
00000008 <dataset3>:
8: 3fb33333 .word 0x3fb33333
0000000c <dataset4>:
c: bfa66666 .word 0xbfa66666
00000010 <dataset5>:
10: 3f99999a .word 0x3f99999a
00000014 <dataset6>:
14: bf8ccccd .word 0xbf8ccccd
00000018 <array_size>:
18: 00000006 .word 0x00000006
0000001c <sign>:
1c: 80000000 .word 0x80000000
00000020 <exp>:
20: 7f800000 .word 0x7f800000
00000024 <man>:
24: 007fffff .word 0x007fffff
00000028 <bf16man>:
28: 007f0000 .word 0x007f0000
0000002c <bf16format>:
2c: ffff0000 .word 0xffff0000
00000030 <_EOL>:
30: 000a .short 0x000a
......
```
### Analysis The Handwritten Assembly
```text
ELF Header:
Magic: 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00
Class: ELF32
Data: 2's complement, little endian
Version: 1 (current)
OS/ABI: UNIX - System V
ABI Version: 0
Type: EXEC (Executable file)
Machine: RISC-V
Version: 0x1
Entry point address: 0x32
Start of program headers: 52 (bytes into file)
Start of section headers: 5576 (bytes into file)
Flags: 0x0
Size of this header: 52 (bytes)
Size of program headers: 32 (bytes)
Number of program headers: 2
Size of section headers: 40 (bytes)
Number of section headers: 6
Section header string table index: 5
text data bss dec hex filename
480 0 0 480 1e0 hw2.elf
```
Based on the output from "riscv-none-elf-objdump", the data appears to be stored in the program's ".text" section, rather than the ".data" section. I assume this is because the data is defined using the ".word" directive
### Optimization The Handwritten Assembly
In the original version, ".data" and ".text" were used directly. In the revised version, just by using ".section .data" and ".section .text", the usage of .text was reduced. In any case, ".section" in assembly language provides a way for programmers to explicitly differentiate and control how different parts of the program are laid out in memory and how they are handled by the operating system.
```c
.section .data
dataset1: .word 0x3fcccccd #1.6
dataset2: .word 0xbfc00000 #-1.5
dataset3: .word 0x3fb33333 #1.4
dataset4: .word 0xbfa66666 #-1.3
dataset5: .word 0x3f99999a #1.2
dataset6: .word 0xbf8ccccd #-1.1
array_size: .word 0x00000006
sign: .word 0x80000000
exp: .word 0x7F800000
man: .word 0x007FFFFF
bf16man: .word 0x007F0000
bf16format: .word 0xFFFF0000
_EOL: .string "\n"
.section .text
```
```text=
ELF Header:
Magic: 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00
Class: ELF32
Data: 2's complement, little endian
Version: 1 (current)
OS/ABI: UNIX - System V
ABI Version: 0
Type: EXEC (Executable file)
Machine: RISC-V
Version: 0x1
Entry point address: 0x0
Start of program headers: 52 (bytes into file)
Start of section headers: 5176 (bytes into file)
Flags: 0x0
Size of this header: 52 (bytes)
Size of program headers: 32 (bytes)
Number of program headers: 2
Size of section headers: 40 (bytes)
Number of section headers: 6
Section header string table index: 5
text data bss dec hex filename
426 0 0 426 1aa hw2_optimization.elf
```
:::warning
You shall use RDCYCLE/RDCYCLEH instruction for the statistics of your program’s execution.
:notes: jserv
:::
update c code to count cycle
```clike=
static inline uint32_t read_cycle(void) {
uint32_t cycle;
asm volatile ("rdcycle %0" : "=r" (cycle));
return cycle;
}
```