# Assignment2: GNU Toolchain contributed by < [aa860630](https://github.com/aa860630/2023-computer-architecture.git) > ## Rewrite [Multiplication Overflow Prediction](https://github.com/hungyuhang/computer_architecture_2023/tree/main/hw1) The reason I chose this question from [洪佑杭](https://hackmd.io/@hungyuhang/risc-v-hw1) is because I have experience with floating point multiplication, and overflow has always been one of the biggest concerns. After reviewing the code written by others, I found that the author's method of detecting overflow requires minimal computation, albeit at the expense of slight accuracy. Overall, it is an efficient and precise approach. ### C code In the original C code, "getcycle" instructions were inserted at the beginning and end to obtain the total cycle count of program execution. ```c #include <stdint.h> #include <stdbool.h> #include <stdio.h> extern uint64_t get_cycles(); // test case a: no overflow, predict result is false uint64_t a_x0 = 0x0000000000000000; uint64_t a_x1 = 0x0000000000000000; // test case b: no overflow, predict result is false uint64_t b_x0 = 0x0000000000000001; uint64_t b_x1 = 0x0000000000000010; // test case c: no overflow, but predict result is true uint64_t c_x0 = 0x0000000000000002; uint64_t c_x1 = 0x4000000000000000; // test case d: overflow, and predict result is true uint64_t d_x0 = 0x0000000000000003; uint64_t d_x1 = 0x7FFFFFFFFFFFFFFF; uint16_t count_leading_zeros(uint64_t x) { x |= (x >> 1); x |= (x >> 2); x |= (x >> 4); x |= (x >> 8); x |= (x >> 16); x |= (x >> 32); /* count ones (population count) */ x -= ((x >> 1) & 0x5555555555555555); x = ((x >> 2) & 0x3333333333333333) + (x & 0x3333333333333333); x = ((x >> 4) + x) & 0x0f0f0f0f0f0f0f0f; x += (x >> 8); x += (x >> 16); x += (x >> 32); return (64 - (x & 0x7f)); } bool predict_if_mul_overflow(uint64_t *x0, uint64_t *x1) { int32_t exp_x0 = 63 - (int32_t)count_leading_zeros(*x0); int32_t exp_x1 = 63 - (int32_t)count_leading_zeros(*x1); if ((exp_x0 + 1) + (exp_x1 + 1) >= 64) return true; else return false; } int main() { uint64_t oldcount = get_cycles(); printf("%d\n", predict_if_mul_overflow(&a_x0, &a_x1)); printf("%d\n", predict_if_mul_overflow(&b_x0, &b_x1)); printf("%d\n", predict_if_mul_overflow(&c_x0, &c_x1)); printf("%d\n", predict_if_mul_overflow(&d_x0, &d_x1)); uint64_t cyclecount = get_cycles() - oldcount; printf("cycle count: %u\n", (unsigned int) cyclecount); return 0; } ``` ## Assembly code When transitioning the original assembly code from ```Ripes``` to ```rv32emu```, some modifications are necessary. In ```rv32emu```, it can only output characters. Therefore, if you want to output a decimal number, you need to split the output into 8-bit segments and then add ```48``` to each segment to match the ASCII code in order to successfully output the desired number. When executing an ecall, the registers that need to be modified include ```a0```, ```a1```, ```a2```, and ```a7```. The content of ```a1``` should represent the starting position for the output, ```a2``` should store the output length, and ```a7``` should be set to SYSWRITE(64). The following program demonstrates the process of converting cycle counts to decimal. ```c to_deci: sub s10, s11, s10 la s9,cycle_address add t0, x0, x0 keep1: addi s10, s10, -1000 # t0 keep minus 1000 blt s10, x0, nega1 # if t0 0 jump to nega addi t0, t0, 1 j keep1 nega1: addi s10, s10, 1000 # plus 1000 to take the number that (original_number%10) addi t0, t0, 48 # fit in ascii sb t0, 0(s9) addi s9, s9, 1 add t0, x0, x0 keep2: addi s10, s10, -100 bltz s10, nega2 addi t0, t0, 1 j keep2 nega2: addi s10, s10, 100 # plus 100 to take the number that (original_number%10) addi t0, t0, 48 sb t0, 0(s9) addi s9, s9, 1 add t0, x0, x0 keep3: addi s10, s10, -10 bltz s10, nega3 addi t0, t0, 1 j keep3 nega3: addi s10, s10, 10 # plus 10 to take the number that (original_number%10) addi t0, t0, 48 sb t0, 0(s9) addi s9, s9, 1 add t0, x0, x0 keep4: addi s10, s10, -1 bltz s10, nega4 addi t0, t0, 1 j keep4 nega4: addi s10, s10, 1 # plus 1 to take the number that (original_number%10) addi t0, t0, 48 sb t0, 0(s9) li a0, 1 la a1, cycle_address li a2, 5 li a7, SYSWRITE # tell ecall to print char ecall li a0, 1 la a1, next_line li a2, 1 li a7, SYSWRITE # tell ecall to print char ecall ret ``` The total cycle is : <s> ![](https://hackmd.io/_uploads/Bk20N3Lfa.png) </s> :::warning :warning: Don't put the screenshots which contain plain text only. Instead, utilize HackMD syntax to annotate the text. :notes: jserv ::: ## Using GNU Toolchain You can use some commands from the GNU Toolchain to inspect information about different programs. Here are a few commonly used commands as examples: Display the ELF file header ``` $ riscv-none-elf-readelf -h main.elf ``` Expected output: ![](https://hackmd.io/_uploads/SkiPLSUf6.png) Display the assembler mnemonics for the machine instructions ``` $ riscv-none-elf-objdump -d build/hello.elf ``` Expected output: ``` main.elf: file format elf32-littleriscv Disassembly of section .text: 00010094 <exit>: 10094: 1141 add sp,sp,-16 10096: 4581 li a1,0 10098: c422 sw s0,8(sp) 1009a: c606 sw ra,12(sp) 1009c: 842a mv s0,a0 1009e: 6fd000ef jal 10f9a <__call_exitprocs> 100a2: f981a783 lw a5,-104(gp) # 1d7a8 <__stdio_exit_handler> 100a6: c391 beqz a5,100aa <exit+0x16> 100a8: 9782 jalr a5 100aa: 8522 mv a0,s0 100ac: 5b1090ef jal 19e5c <_exit> ... ``` Through these commands, you can gain a clearer understanding of how the original C code is compiled into assembly language, and use it to compare with the assembly you've written. At the same time, by changing the optimizer in the ```MAKEFILE```, you can make the original program faster. Below are demonstrations of O1, O2, O3, and Ofast. ### Original * Program size ``` text data bss dec hex filename 52564 1928 1548 56040 dae8 main.elf ``` * cycle count ``` cycle count: 6190 ``` ### O1 Optimized * Program size ``` text data bss dec hex filename 51608 1928 1548 55084 d72c main.elf ``` * cycle count ``` cycle count: 4524 ``` ### O2 Optimized * Program size ``` text data bss dec hex filename 51608 1928 1548 55084 d72c main.elf ``` * cycle count ``` cycle count: 4524 ``` ### O3 Optimized * Program size ``` text data bss dec hex filename 52258 1928 1548 55734 d9b6 main.elf ``` * cycle count ``` cycle count: 4440 ``` ### Ofast Optimized * Program size ``` text data bss dec hex filename 52258 1928 1548 55734 d9b6 main.elf ``` * cycle count ``` cycle count: 4440 ``` ## Comparing C code and assembly code Even with the use of the fastest optimizer to speed up the C code, the cycle count (4440) is still significantly higher than the cycle count of the assembly code (874), indicating that even with optimized instructions in C, the output performance will not surpass that of riscv assembly.