# Assignment2: GNU Toolchain contributed by < [`jimmylu890303`](https://github.com/jimmylu890303) > ## Question selection I chose the `implementation of multiplication overflow prediction for unsigned integers using CLZ` as my homework assignment 2 from [洪佑杭](https://github.com/hungyuhang). **Motivation:** The main reason why I chose this is that I am curious about how counting leading zeros can be applied to overflow prediction. So I chose this. ## Anaylsis origin code ### Modified the original **C** on the rv32emu Because I need to analyze the performance of his code, I define a extern function(getcycles.s) to get current cycle. - getcycles.s ```code= .text .globl get_cycles .align 2 get_cycles: csrr a1, cycleh csrr a0, cycle csrr a2, cycleh bne a1, a2, get_cycles ret .size get_cycles,.-get_cycles ``` - main.c ```code= #include <stdint.h> #include <stdbool.h> #include <stdio.h> // define extern function get_cycles() in getcycles.S extern uint64_t get_cycles(); ... omission int main() { // Get init cycle before execution uint64_t oldcount = get_cycles(); printf("%d\n", predict_if_mul_overflow(&a_x0, &a_x1)); printf("%d\n", predict_if_mul_overflow(&b_x0, &b_x1)); printf("%d\n", predict_if_mul_overflow(&c_x0, &c_x1)); printf("%d\n", predict_if_mul_overflow(&d_x0, &d_x1)); // Obtain the total number of execution cycles uint64_t cyclecount = get_cycles() - oldcount; printf("cycle count: %u\n", (unsigned int) cyclecount); return 0; } ``` I have defined an external function **get_cycles()** in my C program. Therefore, I need to link main.c and getcycles.s when compiling with GCC. ```code= // Convert getcycles.s to getcycles.o $ riscv-none-elf-gcc -march=rv32i_zicsr_zifencei -mabi=ilp32 -Wall -c -o getcycles.o getcycles.s // Convert main.c to main.s $ riscv-none-elf-gcc -march=rv32i_zicsr_zifencei -mabi=ilp32 -Wall -S -o main.s main.c // Convert main.s to main.o $ riscv-none-elf-gcc -march=rv32i_zicsr_zifencei -mabi=ilp32 -Wall -c -o main.o main.s // Compile & link with main.o and getcycles.o to main.elf $ riscv-none-elf-gcc -o main.elf getcycles.o main.o ``` - Result of execution on rv32emu ```code= $ ~/rv32emu/build/rv32emu main.elf 0 0 1 1 cycle count: 6190 inferior exit code 0 ``` - Disassembly of main.elf ```code= $ riscv-none-elf-objdump -d main.elf // output Disassembly of section .text: 00010094 <exit>: 10094: 1141 add sp,sp,-16 10096: 4581 li a1,0 10098: c422 sw s0,8(sp) 1009a: c606 sw ra,12(sp) 1009c: 842a mv s0,a0 1009e: 6fd000ef jal 10f9a <__call_exitprocs> 100a2: f981a783 lw a5,-104(gp) # 1d7a8 <__stdio_exit_handler> 100a6: c391 beqz a5,100aa <exit+0x16> 100a8: 9782 jalr a5 100aa: 8522 mv a0,s0 100ac: 5b1090ef jal 19e5c <_exit> ... omission 000100c2 <_start>: 100c2: 0000d197 auipc gp,0xd 100c6: 74e18193 add gp,gp,1870 # 1d810 <__global_pointer$> 100ca: f7c18513 add a0,gp,-132 # 1d78c <__bss_start> 100ce: 58c18613 add a2,gp,1420 # 1dd9c <__BSS_END__> 100d2: 8e09 sub a2,a2,a0 100d4: 4581 li a1,0 100d6: 58b000ef jal 10e60 <memset> 100da: 00001517 auipc a0,0x1 100de: d3650513 add a0,a0,-714 # 10e10 <__libc_fini_array> 100e2: 255d jal 10788 <atexit> 100e4: 4c3000ef jal 10da6 <__libc_init_array> 100e8: 4502 lw a0,0(sp) 100ea: 004c add a1,sp,4 100ec: 4601 li a2,0 100ee: 2b79 jal 1068c <main> 100f0: b755 j 10094 <exit> ... omission 00010144 <get_cycles>: 10144: c80025f3 rdcycleh a1 10148: c0002573 rdcycle a0 1014c: c8002673 rdcycleh a2 10150: fec59ae3 bne a1,a2,10144 <get_cycles> 10154: 00008067 ret ``` - Display the main.elf header ```code= $ riscv-none-elf-readelf -h main.elf // output ELF Header: Magic: 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00 Class: ELF32 Data: 2's complement, little endian Version: 1 (current) OS/ABI: UNIX - System V ABI Version: 0 Type: EXEC (Executable file) Machine: RISC-V Version: 0x1 Entry point address: 0x100c2 Start of program headers: 52 (bytes into file) Start of section headers: 69612 (bytes into file) Flags: 0x1, RVC, soft-float ABI Size of this header: 52 (bytes) Size of program headers: 32 (bytes) Number of program headers: 3 Size of section headers: 40 (bytes) Number of section headers: 15 Section header string table index: 14 ``` - Program size ```code= $ riscv-none-elf-size main.elf // output text data bss dec hex filename 52564 1928 1548 56040 dae8 main.elf ``` ### Modified the original **Assembly** on the rv32emu I added a `get_cycles` section to the original code in order to analyze its performance, allowing me to retrieve the current cycle. - get_cycles section ```code= get_cycles: csrr a1, cycleh csrr a0, cycle csrr a2, cycleh bne a1, a2, get_cycles ret ``` However, rv32emu doesn't support system calls for printing integers, so I added a **convert_ascii** section to `convert integers to ASCII`. ```code= # Predefine 4-byte buffer .data buffer: .byte 0, 0, 0, 0 # convert number to ascii code # Input: # a0: number # a1: number of bytes in buffer convert_ascii: addi sp, sp, -8 sw a0, 0(sp) # save dividend sw a1, 4(sp) # save counter convert_loop: lw t0, 0(sp) # load dividend li t1, 0 # t1 = quotient li t2, 0 # t2 = reminder lw t3, 4(sp) # load counter li t4, 10 # divisor divide_loop: # if x > 10 => go x-=10 bge t0, t4, divide_subtract # else reminder be found mv t2, t0 j divide_loop_done divide_subtract: # x = x - 10 sub t0, t0, t4 # quotient += 1 addi t1, t1, 1 j divide_loop divide_loop_done: # save quotient to stack sw t1, 0(sp) # convert ascii code this round and save to buffer addi t2, t2, 48 la a0, buffer add a0, a0, t3 addi a0, a0, -1 sb t2, 0(a0) # counter = 0 => exit addi t3, t3, -1 sw t3, 4(sp) beqz t3, convert_loop_done j convert_loop convert_loop_done: addi sp, sp, 8 ret ``` - Result of execution on rv32emu ```code= $ ~/rv32emu/build/rv32emu main.elf 0 0 1 1 cycle count: 876 inferior exit code 0 ``` - Disassembly of main.elf ```code= $ riscv-none-elf-objdump -d main.elf // output main.elf: file format elf32-littleriscv Disassembly of section .text: 00000000 <main>: 0: 100000ef jal 100 <get_cycles> 4: ffc10113 add sp,sp,-4 8: 00a12023 sw a0,0(sp) c: fec10113 add sp,sp,-20 10: 0f0000ef jal 100 <get_cycles> 14: 00a12823 sw a0,16(sp) 18: 00000297 auipc t0,0x0 1c: 3c828293 add t0,t0,968 # 3e0 <cmp_data_1> 20: 00512023 sw t0,0(sp) 24: 00000297 auipc t0,0x0 28: 3cc28293 add t0,t0,972 # 3f0 <cmp_data_2> 2c: 00512223 sw t0,4(sp) 30: 00000297 auipc t0,0x0 34: 3d028293 add t0,t0,976 # 400 <cmp_data_3> 38: 00512423 sw t0,8(sp) 3c: 00000297 auipc t0,0x0 40: 3d428293 add t0,t0,980 # 410 <cmp_data_4> 44: 00512623 sw t0,12(sp) 48: 00400413 li s0,4 4c: 00000493 li s1,0 50: 00010913 mv s2,sp ``` - Display the main.elf header ```code= $ riscv-none-elf-readelf -h main.elf // output ELF Header: Magic: 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00 Class: ELF32 Data: 2's complement, little endian Version: 1 (current) OS/ABI: UNIX - System V ABI Version: 0 Type: EXEC (Executable file) Machine: RISC-V Version: 0x1 Entry point address: 0x0 Start of program headers: 52 (bytes into file) Start of section headers: 6312 (bytes into file) Flags: 0x0 Size of this header: 52 (bytes) Size of program headers: 32 (bytes) Number of program headers: 2 Size of section headers: 40 (bytes) Number of section headers: 7 Section header string table index: 6 ``` - Program size ```code= $ riscv-none-elf-size main.elf // output text data bss dec hex filename 1077 0 0 1077 435 main.elf ``` ### Compare C and Assembly - Program size: The size of the assembly implementation (text size: 1077) is significantly smaller than the size of the C implementation (text size: 52564). The main reason I think why the size of the C implementation is larger is that there are `many libraries included` in the C program. - Cycle count: The cycle count of the assembly implementation(876) is significantly less than the cycle count of the C implementation(6190). The main reason why the cycle count of the C implemtation is more is that `printf` cause many cycles. ## Optimized by riscv-none-elf-gcc ### -O0 Optimized Don't do any optimization, this is the default compile option - size ```code= text data bss dec hex filename 52564 1928 1548 56040 dae8 main_O0.elf ``` - cycle count ```code= $ ~/rv32emu/build/rv32emu main_O0.elf 0 0 1 1 cycle count: 6190 inferior exit code 0 ``` - ELF header ```code= ELF Header: Magic: 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00 Class: ELF32 Data: 2's complement, little endian Version: 1 (current) OS/ABI: UNIX - System V ABI Version: 0 Type: EXEC (Executable file) Machine: RISC-V Version: 0x1 Entry point address: 0x100c2 Start of program headers: 52 (bytes into file) Start of section headers: 69612 (bytes into file) Flags: 0x1, RVC, soft-float ABI Size of this header: 52 (bytes) Size of program headers: 32 (bytes) Number of program headers: 3 Size of section headers: 40 (bytes) Number of section headers: 15 Section header string table index: 14 ``` - Disassembly code Full file at [objdump_O0.txt](https://github.com/jimmylu890303/2023_Computer_Architecture/blob/main/assignment/hw2/c/analysis/objdump_O0.txt). ### -O1 Optimized - size ```code= text data bss dec hex filename 51608 1928 1548 55084 d72c main_O1.elf ``` - cycle count ```code= $ ~/rv32emu/build/rv32emu main_O1.elf 0 0 1 1 cycle count: 4524 inferior exit code 0 ``` - ELF header ```code= ELF Header: Magic: 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00 Class: ELF32 Data: 2's complement, little endian Version: 1 (current) OS/ABI: UNIX - System V ABI Version: 0 Type: EXEC (Executable file) Machine: RISC-V Version: 0x1 Entry point address: 0x100c2 Start of program headers: 52 (bytes into file) Start of section headers: 69612 (bytes into file) Flags: 0x1, RVC, soft-float ABI Size of this header: 52 (bytes) Size of program headers: 32 (bytes) Number of program headers: 3 Size of section headers: 40 (bytes) Number of section headers: 15 Section header string table index: 14 ``` - Disassembly code Full file at [objdump_O1.txt](https://github.com/jimmylu890303/2023_Computer_Architecture/blob/main/assignment/hw2/c/analysis/objdump_O1.txt). ### -O2 Optimized - size ```code= text data bss dec hex filename 51608 1928 1548 55084 d72c main_O2.elf ``` - cycle count ```code= $ ~/rv32emu/build/rv32emu main_O2.elf 0 0 1 1 cycle count: 4524 inferior exit code 0 ``` - ELF header ```code= ELF Header: Magic: 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00 Class: ELF32 Data: 2's complement, little endian Version: 1 (current) OS/ABI: UNIX - System V ABI Version: 0 Type: EXEC (Executable file) Machine: RISC-V Version: 0x1 Entry point address: 0x1016a Start of program headers: 52 (bytes into file) Start of section headers: 69628 (bytes into file) Flags: 0x1, RVC, soft-float ABI Size of this header: 52 (bytes) Size of program headers: 32 (bytes) Number of program headers: 3 Size of section headers: 40 (bytes) Number of section headers: 15 Section header string table index: 14 ``` - Disassembly code Full file at [objdump_O2.txt](https://github.com/jimmylu890303/2023_Computer_Architecture/blob/main/assignment/hw2/c/analysis/objdump_O2.txt). ### -O3 Optimized - size ```code= text data bss dec hex filename 52258 1928 1548 55734 d9b6 main_O3.elf ``` - cycle count ```code= $ ~/rv32emu/build/rv32emu main_O3.elf 0 0 1 1 cycle count: 4440 inferior exit code 0 ``` - ELF header ```code= ELF Header: Magic: 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00 Class: ELF32 Data: 2's complement, little endian Version: 1 (current) OS/ABI: UNIX - System V ABI Version: 0 Type: EXEC (Executable file) Machine: RISC-V Version: 0x1 Entry point address: 0x1016a Start of program headers: 52 (bytes into file) Start of section headers: 69628 (bytes into file) Flags: 0x1, RVC, soft-float ABI Size of this header: 52 (bytes) Size of program headers: 32 (bytes) Number of program headers: 3 Size of section headers: 40 (bytes) Number of section headers: 15 Section header string table index: 14 ``` - Disassembly code Full file at [objdump_O3.txt](https://github.com/jimmylu890303/2023_Computer_Architecture/blob/main/assignment/hw2/c/analysis/objdump_O3.txt). ### -Ofast Optimized - size ```code= text data bss dec hex filename 52258 1928 1548 55734 d9b6 main_Ofast.elf ``` - cycle count ```code= $ ~/rv32emu/build/rv32emu main_Ofast.elf 0 0 1 1 cycle count: 4440 inferior exit code 0 ``` - ELF header ```code= ELF Header: Magic: 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00 Class: ELF32 Data: 2's complement, little endian Version: 1 (current) OS/ABI: UNIX - System V ABI Version: 0 Type: EXEC (Executable file) Machine: RISC-V Version: 0x1 Entry point address: 0x1016a Start of program headers: 52 (bytes into file) Start of section headers: 69628 (bytes into file) Flags: 0x1, RVC, soft-float ABI Size of this header: 52 (bytes) Size of program headers: 32 (bytes) Number of program headers: 3 Size of section headers: 40 (bytes) Number of section headers: 15 Section header string table index: 14 ``` - Disassembly code Full file at [objdump_Ofast.txt](https://github.com/jimmylu890303/2023_Computer_Architecture/blob/main/assignment/hw2/c/analysis/objdump_Ofast.txt). ## Rewrite C code and Optimize ### Observation of origin C Code - Origin C code ```code= int main() { // Get init cycle before execution uint64_t oldcount = get_cycles(); printf("%d\n", predict_if_mul_overflow(&a_x0, &a_x1)); printf("%d\n", predict_if_mul_overflow(&b_x0, &b_x1)); printf("%d\n", predict_if_mul_overflow(&c_x0, &c_x1)); printf("%d\n", predict_if_mul_overflow(&d_x0, &d_x1)); // Obtain the total number of execution cycles uint64_t cyclecount = get_cycles() - oldcount; printf("cycle count: %u\n", (unsigned int) cyclecount); return 0; } ``` We have noticed that the printf function is called five times. First, these calls ==consume many cycles== to print values. Second, including the stdio.h library ==increases the size of the .elf file==. Therefore, I decided to optimize the code using ==inline assembly==. ### Rewrite C Code I converted the output of predict_if_mul_overflow() to ASCII and replaced all instances of printf with assembly code. ```code= int main() { uint64_t oldcount = get_cycles(); int predict = (int)predict_if_mul_overflow(&a_x0, &a_x1); char buffer[2]={0,'\n'}; buffer[0] = predict+48; asm("li a0, 1 \n" "mv a1, %0 \n" "li a2, 2 \n" "li a7, 64 \n" "ecall \n" : :"r"(&buffer)); predict = (int)predict_if_mul_overflow(&b_x0, &b_x1); buffer[0] = predict+48; asm("li a0, 1 \n" "mv a1, %0 \n" "li a2, 2 \n" "li a7, 64 \n" "ecall \n" : :"r"(&buffer)); predict = (int)predict_if_mul_overflow(&c_x0, &c_x1); buffer[0] = predict+48; asm("li a0, 1 \n" "mv a1, %0 \n" "li a2, 2 \n" "li a7, 64 \n" "ecall \n" : :"r"(&buffer)); predict = (int)predict_if_mul_overflow(&d_x0, &d_x1); buffer[0] = predict+48; asm("li a0, 1 \n" "mv a1, %0 \n" "li a2, 2 \n" "li a7, 64 \n" "ecall \n" : :"r"(&buffer)); uint64_t cyclecount = get_cycles() - oldcount; char cycle_str[] = "cycle count: "; asm("li a0, 1 \n" "mv a1, %0 \n" "li a2, 13 \n" "li a7, 64 \n" "ecall \n" : :"r"(&cycle_str)); char *ptr = convert_ascii((unsigned int) cyclecount); asm("li a0, 1 \n" "mv a1, %0 \n" "li a2, 5 \n" "li a7, 64 \n" "ecall \n" : :"r"(ptr)); free(ptr); return 0; } ``` ### The Optimized Result of Rewritten C Code Full implementation at [objdump_O0.txt](https://github.com/jimmylu890303/2023_Computer_Architecture/blob/main/assignment/hw2/c/main_opt.c). #### Compare cycle count - Cycle count of origin C code(-O0 optimized) ```code= $ ~/rv32emu/build/rv32emu main_O0.elf 0 0 1 1 cycle count: 6190 inferior exit code 0 ``` - Cycle count of rewritten C code ```code= $ ~/rv32emu/build/rv32emu main_opt.elf 0 0 1 1 cycle count: 2561 inferior exit code 0 ``` #### Compare code size - Code size of origin C code(-O0 optimized) ```code= text data bss dec hex filename 52564 1928 1548 56040 dae8 main_O0.elf ``` - Code size of rewritten C code ```code= text data bss dec hex filename 8026 1424 860 10310 2846 main_opt.elf ```