# Assignment2: GNU Toolchain
contributed by < [aa860630](https://github.com/aa860630/2023-computer-architecture.git) >
## Rewrite [Multiplication Overflow Prediction](https://github.com/hungyuhang/computer_architecture_2023/tree/main/hw1)
The reason I chose this question from [洪佑杭](https://hackmd.io/@hungyuhang/risc-v-hw1) is because I have experience with floating point multiplication, and overflow has always been one of the biggest concerns. After reviewing the code written by others, I found that the author's method of detecting overflow requires minimal computation, albeit at the expense of slight accuracy. Overall, it is an efficient and precise approach.
### C code
In the original C code, "getcycle" instructions were inserted at the beginning and end to obtain the total cycle count of program execution.
```c
#include <stdint.h>
#include <stdbool.h>
#include <stdio.h>
extern uint64_t get_cycles();
// test case a: no overflow, predict result is false
uint64_t a_x0 = 0x0000000000000000;
uint64_t a_x1 = 0x0000000000000000;
// test case b: no overflow, predict result is false
uint64_t b_x0 = 0x0000000000000001;
uint64_t b_x1 = 0x0000000000000010;
// test case c: no overflow, but predict result is true
uint64_t c_x0 = 0x0000000000000002;
uint64_t c_x1 = 0x4000000000000000;
// test case d: overflow, and predict result is true
uint64_t d_x0 = 0x0000000000000003;
uint64_t d_x1 = 0x7FFFFFFFFFFFFFFF;
uint16_t count_leading_zeros(uint64_t x)
{
x |= (x >> 1);
x |= (x >> 2);
x |= (x >> 4);
x |= (x >> 8);
x |= (x >> 16);
x |= (x >> 32);
/* count ones (population count) */
x -= ((x >> 1) & 0x5555555555555555);
x = ((x >> 2) & 0x3333333333333333) + (x & 0x3333333333333333);
x = ((x >> 4) + x) & 0x0f0f0f0f0f0f0f0f;
x += (x >> 8);
x += (x >> 16);
x += (x >> 32);
return (64 - (x & 0x7f));
}
bool predict_if_mul_overflow(uint64_t *x0, uint64_t *x1)
{
int32_t exp_x0 = 63 - (int32_t)count_leading_zeros(*x0);
int32_t exp_x1 = 63 - (int32_t)count_leading_zeros(*x1);
if ((exp_x0 + 1) + (exp_x1 + 1) >= 64)
return true;
else
return false;
}
int main()
{
uint64_t oldcount = get_cycles();
printf("%d\n", predict_if_mul_overflow(&a_x0, &a_x1));
printf("%d\n", predict_if_mul_overflow(&b_x0, &b_x1));
printf("%d\n", predict_if_mul_overflow(&c_x0, &c_x1));
printf("%d\n", predict_if_mul_overflow(&d_x0, &d_x1));
uint64_t cyclecount = get_cycles() - oldcount;
printf("cycle count: %u\n", (unsigned int) cyclecount);
return 0;
}
```
## Assembly code
When transitioning the original assembly code from ```Ripes``` to ```rv32emu```, some modifications are necessary. In ```rv32emu```, it can only output characters. Therefore, if you want to output a decimal number, you need to split the output into 8-bit segments and then add ```48``` to each segment to match the ASCII code in order to successfully output the desired number. When executing an ecall, the registers that need to be modified include ```a0```, ```a1```, ```a2```, and ```a7```. The content of ```a1``` should represent the starting position for the output, ```a2``` should store the output length, and ```a7``` should be set to SYSWRITE(64).
The following program demonstrates the process of converting cycle counts to decimal.
```c
to_deci:
sub s10, s11, s10
la s9,cycle_address
add t0, x0, x0
keep1:
addi s10, s10, -1000 # t0 keep minus 1000
blt s10, x0, nega1 # if t0 0 jump to nega
addi t0, t0, 1
j keep1
nega1:
addi s10, s10, 1000 # plus 1000 to take the number that (original_number%10)
addi t0, t0, 48 # fit in ascii
sb t0, 0(s9)
addi s9, s9, 1
add t0, x0, x0
keep2:
addi s10, s10, -100
bltz s10, nega2
addi t0, t0, 1
j keep2
nega2:
addi s10, s10, 100 # plus 100 to take the number that (original_number%10)
addi t0, t0, 48
sb t0, 0(s9)
addi s9, s9, 1
add t0, x0, x0
keep3:
addi s10, s10, -10
bltz s10, nega3
addi t0, t0, 1
j keep3
nega3:
addi s10, s10, 10 # plus 10 to take the number that (original_number%10)
addi t0, t0, 48
sb t0, 0(s9)
addi s9, s9, 1
add t0, x0, x0
keep4:
addi s10, s10, -1
bltz s10, nega4
addi t0, t0, 1
j keep4
nega4:
addi s10, s10, 1 # plus 1 to take the number that (original_number%10)
addi t0, t0, 48
sb t0, 0(s9)
li a0, 1
la a1, cycle_address
li a2, 5
li a7, SYSWRITE # tell ecall to print char
ecall
li a0, 1
la a1, next_line
li a2, 1
li a7, SYSWRITE # tell ecall to print char
ecall
ret
```
The total cycle is :
<s>

</s>
:::warning
:warning: Don't put the screenshots which contain plain text only. Instead, utilize HackMD syntax to annotate the text.
:notes: jserv
:::
## Using GNU Toolchain
You can use some commands from the GNU Toolchain to inspect information about different programs. Here are a few commonly used commands as examples:
Display the ELF file header
```
$ riscv-none-elf-readelf -h main.elf
```
Expected output:

Display the assembler mnemonics for the machine instructions
```
$ riscv-none-elf-objdump -d build/hello.elf
```
Expected output:
```
main.elf: file format elf32-littleriscv
Disassembly of section .text:
00010094 <exit>:
10094: 1141 add sp,sp,-16
10096: 4581 li a1,0
10098: c422 sw s0,8(sp)
1009a: c606 sw ra,12(sp)
1009c: 842a mv s0,a0
1009e: 6fd000ef jal 10f9a <__call_exitprocs>
100a2: f981a783 lw a5,-104(gp) # 1d7a8 <__stdio_exit_handler>
100a6: c391 beqz a5,100aa <exit+0x16>
100a8: 9782 jalr a5
100aa: 8522 mv a0,s0
100ac: 5b1090ef jal 19e5c <_exit>
...
```
Through these commands, you can gain a clearer understanding of how the original C code is compiled into assembly language, and use it to compare with the assembly you've written.
At the same time, by changing the optimizer in the ```MAKEFILE```, you can make the original program faster. Below are demonstrations of O1, O2, O3, and Ofast.
### Original
* Program size
```
text data bss dec hex filename
52564 1928 1548 56040 dae8 main.elf
```
* cycle count
```
cycle count: 6190
```
### O1 Optimized
* Program size
```
text data bss dec hex filename
51608 1928 1548 55084 d72c main.elf
```
* cycle count
```
cycle count: 4524
```
### O2 Optimized
* Program size
```
text data bss dec hex filename
51608 1928 1548 55084 d72c main.elf
```
* cycle count
```
cycle count: 4524
```
### O3 Optimized
* Program size
```
text data bss dec hex filename
52258 1928 1548 55734 d9b6 main.elf
```
* cycle count
```
cycle count: 4440
```
### Ofast Optimized
* Program size
```
text data bss dec hex filename
52258 1928 1548 55734 d9b6 main.elf
```
* cycle count
```
cycle count: 4440
```
## Comparing C code and assembly code
Even with the use of the fastest optimizer to speed up the C code, the cycle count (4440) is still significantly higher than the cycle count of the assembly code (874), indicating that even with optimized instructions in C, the output performance will not surpass that of riscv assembly.