Assignment2: GNU Toolchain

# Assignment2: GNU Toolchain contributed by [weiweiwei68](https://github.com/weiweiwei68/RISC-V-Assembly-and-Instruction-Pipeline) In this assignment, i chose the subject ["Multiplication overflow prediction for unsigned int using CLZ"](https://hackmd.io/@hungyuhang/risc-v-hw1) from [洪佑杭](https://github.com/hungyuhang) as basis for my work. ## Motivation The prediction of multiplication is a simple but much applicable. As the author of "Multiplication overflow prediction for unsigned int using CLZ" mentiones in the section ["Compared With the Multiplication Method"]((https://hackmd.io/@hungyuhang/risc-v-hw1#Analysis---Compared-With-the-Multiplication-Method)). This proposal can significantly reduce the cycles for a program to exercute. Since it does not really calculate the exact value of multiplication operation, the application might be strictly restraint. We only need the high computation of this program instead of its functionality, so the time improvement become more important in this situation. ## Rewrite C code The following is the origin C code ```c #include <stdint.h> #include <stdbool.h> // test case a: no overflow, predict result is false uint64_t a_x0 = 0x0000000000000000; uint64_t a_x1 = 0x0000000000000000; // test case b: no overflow, predict result is false uint64_t b_x0 = 0x0000000000000001; uint64_t b_x1 = 0x0000000000000010; // test case c: no overflow, but predict result is true uint64_t c_x0 = 0x0000000000000002; uint64_t c_x1 = 0x4000000000000000; // test case d: overflow, and predict result is true uint64_t d_x0 = 0x0000000000000003; uint64_t d_x1 = 0x7FFFFFFFFFFFFFFF; uint16_t count_leading_zeros(uint64_t x) { x |= (x >> 1); x |= (x >> 2); x |= (x >> 4); x |= (x >> 8); x |= (x >> 16); x |= (x >> 32); /* count ones (population count) */ x -= ((x >> 1) & 0x5555555555555555); x = ((x >> 2) & 0x3333333333333333) + (x & 0x3333333333333333); x = ((x >> 4) + x) & 0x0f0f0f0f0f0f0f0f; x += (x >> 8); x += (x >> 16); x += (x >> 32); return (64 - (x & 0x7f)); } bool predict_if_mul_overflow(uint64_t *x0, uint64_t *x1) { int32_t exp_x0 = 63 - (int32_t)count_leading_zeros(*x0); int32_t exp_x1 = 63 - (int32_t)count_leading_zeros(*x1); if ((exp_x0 + 1) + (exp_x1 + 1) >= 64) return true; else return false; } void main() { printf("%d\n", predict_if_mul_overflow(&a_x0, &a_x1)); printf("%d\n", predict_if_mul_overflow(&b_x0, &b_x1)); printf("%d\n", predict_if_mul_overflow(&c_x0, &c_x1)); printf("%d\n", predict_if_mul_overflow(&d_x0, &d_x1)); return; } ``` I just do a minor adjustment to the function called ```predict_if_mul_overflow```. I removed the if-else statement to simplify the upcoming loop. Below is the C code after modification. ```c bool predict_if_mul_overflow(uint64_t *x0, uint64_t *x1) { return ((int32_t)(64 - (int32_t)count_leading_zeros(*x0) - (int32_t)count_leading_zeros(*x1)) >= 0); } ``` ## Improvement of assembly language In this section, i optimized the calculation in the label of "pimo" to reduce the use of registers and instructions for predicting multiplication overflow. For instance, the original version utilized the following instructions to obtain the values of ```exp_x0``` and ```exp_x1```. Ultimately, both values could be placed into the function ```(exp_x0 + 1) + (exp_x1 + 1) >= 64``` to obtain the boolean value. ##### Original version: ``` pimo: addi sp, sp, -20 sw ra, 0(sp) sw s0, 4(sp) sw s1, 8(sp) sw s2, 12(sp) sw s3, 16(sp) mv s0, a0 # s0 is address of x0 mv s1, a1 # s1 is address of x1 lw a0, 0(s0) lw a1, 4(s0) # a0 a1 is now the value of x0 jal ra, clz li s2, 63 sub s2, s2, a0 # s2 is now exp_x0 lw a0, 0(s1) lw a1, 4(s1) # a1 a0 is now the value of x1 jal ra, clz li s3, 63 sub s3, s3, a0 # s3 is now exp_x1 add s2, s2, s3 addi s2, s2, 2 # s2 is (exp_x0 + 1) + (exp_x1 + 1) li t0, 64 bge s2, t0, pimo_ret_t li a0, 0 # return false j pimo_end ``` In the optimized version, i simplified the function ```(exp_x0 + 1) + (exp_x1 + 1) >= 64``` into ```clz(x0) + clz(x1) <= 64```. As a result, the relative modification in assembly is presented in the following. ##### Optimized version: ``` pimo: addi sp, sp, -20 sw ra, 0(sp) sw s0, 4(sp) sw s1, 8(sp) sw s2, 12(sp) sw s3, 16(sp) mv s0, a0 # s0 is address of x0 mv s1, a1 # s1 is address of x1 lw a0, 0(s0) lw a1, 4(s0) # a0 a1 is now the value of x0 jal ra, clz mv s2, a0 #s2 is now clz(x0) lw a0, 0(s1) lw a1, 4(s1) # a1 a0 is now the value of x1 jal ra, clz mv s3, a0 # s3 is now clz(x1) add s2, s2, s3 # s2 = clz(x0) + xlz(x1) li t0, 64 bge t0, s2, pimo_ret_t li a0, 0 # return false j pimo_end ``` ##### Complete version The complete version of the assembly language code and the related files required to generate an ELF (Executable and Linkable Format) file can be accessed on [GitHub](https://github.com/weiweiwei68/GNU-Toolchain). :::danger DON'T DO THAT. You shall put the source code into a GitHub repository. HackMD note should keep the critical code instead. :notes: jserv ::: ##### Output ``` $ build/rv32emu tests/pimo2/pimo.elf 0011inferior exit code 0 ``` **Aseembly language** 1. Size ``` $ riscv-none-elf-size pimo.elf text data bss dec hex filename 760 0 0 760 2f8 pimo.elf ``` 2. ELF file header ``` $ riscv-none-elf-readelf -h pimo.elf ELF Header: Magic: 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00 Class: ELF32 Data: 2's complement, little endian Version: 1 (current) OS/ABI: UNIX - System V ABI Version: 0 Type: EXEC (Executable file) Machine: RISC-V Version: 0x1 Entry point address: 0x0 Start of program headers: 52 (bytes into file) Start of section headers: 5556 (bytes into file) Flags: 0x0 Size of this header: 52 (bytes) Size of program headers: 32 (bytes) Number of program headers: 2 Size of section headers: 40 (bytes) Number of section headers: 6 Section header string table index: 5 ``` ## Optimization In this section, i will compile both the original code and the modified code and then compare the different between them. The website about ["optimization"](https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html) provides a wealth of optimization options that can be applied to enhance the performance of our code. We can utilize the information from the provided link to explore various optimization options available for the GCC compiler, allowing us to make exact decisions about how to improve the efficiency of our code. After compiling the code with different optimization option flags, we can analyze factors such as execution time and code size to determine the impact of different optimization. :::warning Improve your English writing. Utilize ChatGPT or [Quillbot](https://quillbot.com/). :notes: jserv ::: **O0 optimization of original code** 1. Compile ``` $ riscv-none-elf-gcc -march=rv32i -mabi=ilp32 -O0 original.c -o original.elf ``` 2. Size ``` $ricev-none-elf-size original.elf text data bss dec hex filename 76688 2372 1548 80608 13ae0 original.elf ``` 3. ELF file header ``` $ riscv-none-elf-readelf -h original.elf ELF Header: Magic: 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00 Class: ELF32 Data: 2's complement, little endian Version: 1 (current) OS/ABI: UNIX - System V ABI Version: 0 Type: EXEC (Executable file) Machine: RISC-V Version: 0x1 Entry point address: 0x100d8 Start of program headers: 52 (bytes into file) Start of section headers: 94772 (bytes into file) Flags: 0x0 Size of this header: 52 (bytes) Size of program headers: 32 (bytes) Number of program headers: 3 Size of section headers: 40 (bytes) Number of section headers: 15 Section header string table index: 14 ``` **O0 optimization of modified code** 1. Compile ``` $ riscv-none-elf-gcc -march=rv32i -mabi=ilp32 -O0 modification.c -o modification.elf ``` 2. Size ``` $ riscv-none-elf-size modification.elf text data bss dec hex filename 76664 2372 1548 80584 13ac8 modification.elf ``` 3. ELF file header ``` $ riscv-none-elf-readelf -h modification.elf ELF Header: Magic: 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00 Class: ELF32 Data: 2's complement, little endian Version: 1 (current) OS/ABI: UNIX - System V ABI Version: 0 Type: EXEC (Executable file) Machine: RISC-V Version: 0x1 Entry point address: 0x100d8 Start of program headers: 52 (bytes into file) Start of section headers: 94776 (bytes into file) Flags: 0x0 Size of this header: 52 (bytes) Size of program headers: 32 (bytes) Number of program headers: 3 Size of section headers: 40 (bytes) Number of section headers: 15 Section header string table index: 14 ``` **Comparison** | filename | text | data | bss | dec | hex | | ---- | ---- | ---- | ---- | ---- | ------| | original.elf | 76688 | 2372 | 1548 | 80608 | 13ae0 | | modification.elf | 76664 | 2372 | 1548 | 80584 | 13ae8 | Form the table provided above, we can observe that the size of the text section has slightly decreased after the modification, which corresponds to the optimization of the C code mentioned before. Based on the result, it is evident that the improvement made to the original C code has indeed contributed to reducing the code size. This reduction in code size can have various benefits, such as improved memory usage and potentially faster execution times. ## Different level of optimizations In this section, i will utilize different optimization levels, including ```-O1```, ```-O2```, ```-O3```, ```-Ofast```, and ```-Os```, to investigate the relationship between code size and the level of optimization. By varying the optimization levels in the compilation, we can analyze how each level affects the resulting code size. **O1 optimizations** ``` $ riscv-none-elf-size modification_O1.elf text data bss dec hex filename 75780 2372 1548 79700 13754 modification_O1.elf ``` **O2 optimizations** ``` $ riscv-none-elf-size modification_O2.elf text data bss dec hex filename 75776 2372 1548 79696 13750 modification_O2.elf ``` **O3 optimizations** ``` $ riscv-none-elf-size modification_O3.elf text data bss dec hex filename 76428 2372 1548 80348 139dc modification_O3.elf ``` **Ofast optimizations** ``` $ riscv-none-elf-size modification_Ofast.elf text data bss dec hex filename 76428 2372 1548 80348 139dc modification_Ofast.elf ``` **Os optimizations** ``` $ riscv-none-elf-size modification_Os.elf text data bss dec hex filename 75776 2372 1548 79696 13750 modification_Os.elf ``` | Level | Assemly | -O0 | -O1 | -O2 | -O3 | -Ofast | -Os | | -------- | -------- | ----- |-----|-----|-----|--------|-----| | Text size | :bulb:760 | 76664 |75780|75776|76428|76428 |75776| :::warning You shall use RDCYCLE/RDCYCLEH instruction for the statistics of your program’s execution. :notes: jserv ::: ## Execution time In this section, i will employ the [tick.c](https://github.com/sysprog21/rv32emu/blob/master/tests/ticks.c) for the statistics of my program's execution. **O0 optimizations** ``` $ build/rv32emu tests/hw2/csr.elf 0 0 1 1 elapsed cycle: 602060 inferior exit code 22 ``` **O1 optimizations** ``` $ build/rv32emu tests/hw2/csr_o1.elf 0 0 1 1 elapsed cycle: 351760 inferior exit code 22 ``` **O2 optimizations** ``` $ build/rv32emu tests/hw2/csr_o2.elf 0 0 1 1 elapsed cycle: 136942 inferior exit code 22 ``` **O3 optimizations** ``` $ build/rv32emu tests/hw2/csr_o3.elf 0 0 1 1 elapsed cycle: 151043 inferior exit code 22 ``` **Ofast optimizations** ``` $ build/rv32emu tests/hw2/csr_ofast.elf 0 0 1 1 elapsed cycle: 151043 inferior exit code 22 ``` **Os optimizations** ``` $ build/rv32emu tests/hw2/csr_os.elf 0 0 1 1 elapsed cycle: 182657 inferior exit code 22 ``` | Level | -O0 | -O1 | -O2 | -O3 | -Ofast | -Os | | -------- | ------ |-------|-------------|-------|----------|-------| | Cycle | 602,060|351,760|:bulb:136,942|151,043|151,043 |182,657|