# Assignment2: RISC-V Toolchain Contributed by [raphael89918](https://github.com/raphael89918) > [source code](https://github.com/raphael89918/Computer_Architecture_11201/tree/main/homework_2) ## Pick a Question The following question is picked from the Assignment 1 > 陳燦仁 Shell sort with FP32 in BF16 format > Target to sort float number list with shell sort, and using bf16 format to sort instead of using normal calculate. For Example. float number list 1.6, -1.5, 1.4, -1.3, 1.2, -1.1. list length is 6,so the gap will be 3. after the first round, gap will divide 2 = 1 until gap = 0. gap = 3 : final result : -1.3, -1.5, -1.1, 1.6, 1.2, 1.4. gap = 1 : final result : -1.5, -1.3, -1.1, 1.2, 1.4, 1.6. [Source code](https://github.com/TRChen11011/Shell-sort-with-FP32-in-BF16-format/tree/main) The original implementation of the questions is as follows. ```c #include <stdio.h> #include <stdbool.h> #define array_size 6 unsigned int fp32_to_bf16(float x) { float y = x; int *p = (int *) &y; unsigned int exp = *p & 0x7F800000; unsigned int man = *p & 0x007FFFFF; if (exp == 0 && man == 0) /* zero */ return *p; if (exp == 0x7F800000) /* infinity or NaN */ return *p; float r = x; int *pr = (int *) &r; *pr &= 0xFF800000; /* r has the same exp as x */ r /= 0x100; y = x + r; *p &= 0xFFFF0000; return *p; } bool BOS(float fp32_1, float fp32_2){ unsigned int bf16_1, bf16_2, sig1, sig2, exp1, exp2, man1, man2; bf16_1 = fp32_to_bf16(fp32_1); bf16_2 = fp32_to_bf16(fp32_2); sig1 = bf16_1 & 0x80000000; sig2 = bf16_2 & 0x80000000; exp1 = bf16_1 & 0x7F800000; exp2 = bf16_2 & 0x7F800000; man1 = bf16_1 & 0x007F0000; man2 = bf16_2 & 0x007F0000; if (sig1 < sig2) return 1; else if (sig1 == 0 && sig1 == sig2 && exp1 > exp2) return 1; else if (sig1 != 0 && sig1 == sig2 && exp1 < exp2) return 1; else if (sig1 == 0 && sig1 == sig2 && exp1 == exp2 && man1 > man2) return 1; else if (sig1 != 0 && sig1 == sig2 && exp1 == exp2 && man1 < man2) return 1; else return 0; } void ShellSort(float array[array_size]){ int interval = array_size / 2; while (interval >0){ for(int i = interval;i<array_size;i++){ int j = i; float temp = array[i]; int Flag = BOS(array[j-interval], temp); while (j>= interval && Flag == 1){ array[j] = array[j-interval]; j = j-interval; Flag = BOS(array[j-interval], temp); } array[j] = temp; } interval = interval / 2; } } void main() { float array[array_size] = {1.6,-1.5,1.4,-1.3,1.2,-1.1}; ShellSort(array); for (int i = 0;i<array_size;i++){ printf("%x\n",fp32_to_bf16(array[i])); } return 0; } ``` I chose this code simply because I am interested in the [Shell Sort](https://hackmd.io/@Aquamay/H1nxBOLcO/https%3A%2F%2Fhackmd.io%2F%40Aquamay%2FrkgO8fpcu). Compared to bubble sort, heap sort, and others, I haven't used this sorting method yet. The official documentation of GCC provides a detailed explanation of these optimization options. [here](https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html) ### gcc O0 optimization ```text= ELF Header: Magic: 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00 Class: ELF32 Data: 2's complement, little endian Version: 1 (current) OS/ABI: UNIX - System V ABI Version: 0 Type: EXEC (Executable file) Machine: RISC-V Version: 0x1 Entry point address: 0x100d8 Start of program headers: 52 (bytes into file) Start of section headers: 98776 (bytes into file) Flags: 0x0 Size of this header: 52 (bytes) Size of program headers: 32 (bytes) Number of program headers: 3 Size of section headers: 40 (bytes) Number of section headers: 15 Section header string table index: 14 text data bss dec hex filename 78976 2320 1528 82824 14388 hw1.elf ``` #### RDCYCLE ```text= bfc00000 bfa60000 bf8d0000 3f9a0000 3fb30000 3fcd0000 Start Cycle: 110 End Cycle: 21535 Cycles Elapsed: 21425 inferior exit code 0 ``` :::warning :warning: Don't put the screenshots which contain plain text only. Instead, utilize HackMD syntax to annotate the text. :notes: jserv ::: ### gcc O1 optimization ```text ELF Header: Magic: 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00 Class: ELF32 Data: 2s complement, little endian Version: 1 (current) OS/ABI: UNIX - System V ABI Version: 0 Type: EXEC (Executable file) Machine: RISC-V Version: 0x1 Entry point address: 0x100d8 Start of program headers: 52 (bytes into file) Start of section headers: 98780 (bytes into file) Flags: 0x0 Size of this header: 52 (bytes) Size of program headers: 32 (bytes) Number of program headers: 3 Size of section headers: 40 (bytes) Number of section headers: 15 Section header string table index: 14 text data bss dec hex filename 78216 2324 1528 82068 14094 hw1.elf ``` #### RDCYCLE ```text= bfc00000 bfa60000 bf8d0000 3f9a0000 3fb30000 3fcd0000 Start Cycle: 105 End Cycle: 16471 Cycles Elapsed: 16366 inferior exit code ``` I guess using the O1 optimizer is not necessary for the program to be optimized. ### gcc O2 optimization ```text ELF Header: Magic: 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00 Class: ELF32 Data: 2's complement, little endian Version: 1 (current) OS/ABI: UNIX - System V ABI Version: 0 Type: EXEC (Executable file) Machine: RISC-V Version: 0x1 Entry point address: 0x101f0 Start of program headers: 52 (bytes into file) Start of section headers: 98796 (bytes into file) Flags: 0x0 Size of this header: 52 (bytes) Size of program headers: 32 (bytes) Number of program headers: 3 Size of section headers: 40 (bytes) Number of section headers: 15 Section header string table index: 14 text data bss dec hex filename 78488 2324 1528 82340 141a4 hw1.elf ``` #### RDCYCLE ```text= bfc00000 bfa60000 bf8d0000 3f9a0000 3fb30000 3fcd0000 Start Cycle: 105 End Cycle: 16224 Cycles Elapsed: 16119 inferior exit code 0 ``` ### gcc O3 optimization ```text= ELF Header: Magic: 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00 Class: ELF32 Data: 2's complement, little endian Version: 1 (current) OS/ABI: UNIX - System V ABI Version: 0 Type: EXEC (Executable file) Machine: RISC-V Version: 0x1 Entry point address: 0x101f0 Start of program headers: 52 (bytes into file) Start of section headers: 98796 (bytes into file) Flags: 0x0 Size of this header: 52 (bytes) Size of program headers: 32 (bytes) Number of program headers: 3 Size of section headers: 40 (bytes) Number of section headers: 15 Section header string table index: 14 text data bss dec hex filename 79012 2324 1528 82864 143b0 hw1.elf ``` #### RDCYCLE ```text= bfc00000 bfa60000 bf8d0000 3f9a0000 3fb30000 3fcd0000 Start Cycle: 105 End Cycle: 16302 Cycles Elapsed: 16197 inferior exit code 0 ``` ### gcc Os optimization ```text= ELF Header: Magic: 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00 Class: ELF32 Data: 2's complement, little endian Version: 1 (current) OS/ABI: UNIX - System V ABI Version: 0 Type: EXEC (Executable file) Machine: RISC-V Version: 0x1 Entry point address: 0x10144 Start of program headers: 52 (bytes into file) Start of section headers: 98796 (bytes into file) Flags: 0x0 Size of this header: 52 (bytes) Size of program headers: 32 (bytes) Number of program headers: 3 Size of section headers: 40 (bytes) Number of section headers: 15 Section header string table index: 14 text data bss dec hex filename 78124 2324 1528 81976 14038 hw1.elf ``` #### RDCYCLE ```text= bfc00000 bfa60000 bf8d0000 3f9a0000 3fb30000 3fcd0000 Start Cycle: 105 End Cycle: 16526 Cycles Elapsed: 16421 inferior exit code 0 ``` ### gcc Ofast optimization ```text= ELF Header: Magic: 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00 Class: ELF32 Data: 2's complement, little endian Version: 1 (current) OS/ABI: UNIX - System V ABI Version: 0 Type: EXEC (Executable file) Machine: RISC-V Version: 0x1 Entry point address: 0x101f0 Start of program headers: 52 (bytes into file) Start of section headers: 98796 (bytes into file) Flags: 0x0 Size of this header: 52 (bytes) Size of program headers: 32 (bytes) Number of program headers: 3 Size of section headers: 40 (bytes) Number of section headers: 15 Section header string table index: 14 text data bss dec hex filename 79068 2324 1528 82920 143e8 hw1.elf ``` #### RDCYCLE ```text= bfc00000 bfa60000 bf8d0000 3f9a0000 3fb30000 3fcd0000 Start Cycle: 105 End Cycle: 16324 Cycles Elapsed: 16219 inferior exit code 0 ``` 1. The text section with -Ofast is larger than with -Os. This is likely due to -Ofast performing more optimizations, such as inlining, loop unrolling, etc., with the expectation of achieving better runtime performance, but at the cost of increased code size. 2. The data and bss sections are the same in both cases, which implies that the optimizations did not change the layout or size of static data. 3. The ELF file size generated by -Os is smaller than that by -Ofast, which is as expected since the primary goal of -Os is to reduce code size. ## The Hand Written Assembly The hand written assembly looks like this ```c .data dataset1: .word 0x3fcccccd #1.6 dataset2: .word 0xbfc00000 #-1.5 dataset3: .word 0x3fb33333 #1.4 dataset4: .word 0xbfa66666 #-1.3 dataset5: .word 0x3f99999a #1.2 dataset6: .word 0xbf8ccccd #-1.1 array_size: .word 0x00000006 sign: .word 0x80000000 exp: .word 0x7F800000 man: .word 0x007FFFFF bf16man: .word 0x007F0000 bf16format: .word 0xFFFF0000 _EOL: .string "\n" .text main: addi sp,sp,-24 lw s0, dataset1 lw s1, dataset2 lw s2, dataset3 lw s3, dataset4 lw s4, dataset5 lw s5, dataset6 lw a0, sign lw a1, exp lw a2, man lw a3, bf16man lw a4, bf16format sw s0, 0(sp) sw s1, 4(sp) sw s2, 8(sp) sw s3, 12(sp) sw s4, 16(sp) sw s5, 20(sp) lw s11, array_size jal ShellSort ShellSort: lw t0, array_size srli t0, t0, 1 Whileinterval: beq t0, zero, Print add t1, zero, t0 Foriarraysize: beq t1, s11, Interval_div2 add t2, zero, t1 slli s0, t1, 2 add s1, sp, s0 lw t5, 0(s1) ReadData_Temp_done: sub t3, t2, t0 slli s0, t3, 2 add s1, sp, s0 lw t4, 0(s1) jal fp32_to_bf16 WhilejandFlag: beq s10, zero, arrayjtotemp bltu t2, t0, arrayjtotemp and t4, t4, a4 slli s0, t2, 2 add s1, sp, s0 sw t4, 0(s1) sub t2, t2, t0 jal ReadData_Temp_done arrayjtotemp: and t5, t5, a4 slli s0, t2, 2 add s1, sp, s0 sw t5, 0(s1) addi t1, t1, 1 jal Foriarraysize Interval_div2: srli t0, t0, 1 jal Whileinterval fp32_to_bf16: and s0, a0, t4 and s1, a0, t5 and s2, a1, t4 and s3, a1, t5 and s4, a2, t4 and s5, a2, t5 and s6, a3, t4 and s7, a3, t5 jal BOS ret BOS: blt s0, s1, SMALL beq s0, s1, Sigsame jal BIG Sigsame: beq s2, s3, Expsame beq s0, zero, Exp1 jal Exp2 Expsame: beq s0, zero, Man1 jal Man2 Exp1: blt s2, s3, BIG jal SMALL Exp2: blt s2, s3, SMALL jal BIG Man1: blt s6, s7, SMALL jal BIG Man2: blt s6, s7, BIG jal SMALL BIG: li s10, 1 jalr ra SMALL: li s10, 0 jalr ra Print: slli s0, t6, 2 add s1, sp, s0 lw a0, 0(s1) li a7, 2 ecall addi t6, t6, 1 la a0, _EOL li a7, 4 ecall beq t6, s11, End jal Print End: # Exit code can be placed here li a7, 10 ecall ``` ### Compiling The Handwritten Assembly using RISC-V ```shell $ riscv-none-elf-objdump -d hw2.elf hw2.elf: file format elf32-littleriscv Disassembly of section .text: 00000000 <dataset1>: 0: 3fcccccd .word 0x3fcccccd 00000004 <dataset2>: 4: bfc00000 .word 0xbfc00000 00000008 <dataset3>: 8: 3fb33333 .word 0x3fb33333 0000000c <dataset4>: c: bfa66666 .word 0xbfa66666 00000010 <dataset5>: 10: 3f99999a .word 0x3f99999a 00000014 <dataset6>: 14: bf8ccccd .word 0xbf8ccccd 00000018 <array_size>: 18: 00000006 .word 0x00000006 0000001c <sign>: 1c: 80000000 .word 0x80000000 00000020 <exp>: 20: 7f800000 .word 0x7f800000 00000024 <man>: 24: 007fffff .word 0x007fffff 00000028 <bf16man>: 28: 007f0000 .word 0x007f0000 0000002c <bf16format>: 2c: ffff0000 .word 0xffff0000 00000030 <_EOL>: 30: 000a .short 0x000a ...... ``` ### Analysis The Handwritten Assembly ```text ELF Header: Magic: 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00 Class: ELF32 Data: 2's complement, little endian Version: 1 (current) OS/ABI: UNIX - System V ABI Version: 0 Type: EXEC (Executable file) Machine: RISC-V Version: 0x1 Entry point address: 0x32 Start of program headers: 52 (bytes into file) Start of section headers: 5576 (bytes into file) Flags: 0x0 Size of this header: 52 (bytes) Size of program headers: 32 (bytes) Number of program headers: 2 Size of section headers: 40 (bytes) Number of section headers: 6 Section header string table index: 5 text data bss dec hex filename 480 0 0 480 1e0 hw2.elf ``` Based on the output from "riscv-none-elf-objdump", the data appears to be stored in the program's ".text" section, rather than the ".data" section. I assume this is because the data is defined using the ".word" directive ### Optimization The Handwritten Assembly In the original version, ".data" and ".text" were used directly. In the revised version, just by using ".section .data" and ".section .text", the usage of .text was reduced. In any case, ".section" in assembly language provides a way for programmers to explicitly differentiate and control how different parts of the program are laid out in memory and how they are handled by the operating system. ```c .section .data dataset1: .word 0x3fcccccd #1.6 dataset2: .word 0xbfc00000 #-1.5 dataset3: .word 0x3fb33333 #1.4 dataset4: .word 0xbfa66666 #-1.3 dataset5: .word 0x3f99999a #1.2 dataset6: .word 0xbf8ccccd #-1.1 array_size: .word 0x00000006 sign: .word 0x80000000 exp: .word 0x7F800000 man: .word 0x007FFFFF bf16man: .word 0x007F0000 bf16format: .word 0xFFFF0000 _EOL: .string "\n" .section .text ``` ```text= ELF Header: Magic: 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00 Class: ELF32 Data: 2's complement, little endian Version: 1 (current) OS/ABI: UNIX - System V ABI Version: 0 Type: EXEC (Executable file) Machine: RISC-V Version: 0x1 Entry point address: 0x0 Start of program headers: 52 (bytes into file) Start of section headers: 5176 (bytes into file) Flags: 0x0 Size of this header: 52 (bytes) Size of program headers: 32 (bytes) Number of program headers: 2 Size of section headers: 40 (bytes) Number of section headers: 6 Section header string table index: 5 text data bss dec hex filename 426 0 0 426 1aa hw2_optimization.elf ``` :::warning You shall use RDCYCLE/RDCYCLEH instruction for the statistics of your program’s execution. :notes: jserv ::: update c code to count cycle ```clike= static inline uint32_t read_cycle(void) { uint32_t cycle; asm volatile ("rdcycle %0" : "=r" (cycle)); return cycle; } ```