# Assignment2: GNU Toolchain
contributed by [weiweiwei68](https://github.com/weiweiwei68/RISC-V-Assembly-and-Instruction-Pipeline)
In this assignment, i chose the subject ["Multiplication overflow prediction for unsigned int using CLZ"](https://hackmd.io/@hungyuhang/risc-v-hw1) from [洪佑杭](https://github.com/hungyuhang) as basis for my work.
## Motivation
The prediction of multiplication is a simple but much applicable. As the author of "Multiplication overflow prediction for unsigned int using CLZ" mentiones in the section ["Compared With the Multiplication Method"]((https://hackmd.io/@hungyuhang/risc-v-hw1#Analysis---Compared-With-the-Multiplication-Method)). This proposal can significantly reduce the cycles for a program to exercute. Since it does not really calculate the exact value of multiplication operation, the application might be strictly restraint. We only need the high computation of this program instead of its functionality, so the time improvement become more important in this situation.
## Rewrite C code
The following is the origin C code
```c
#include <stdint.h>
#include <stdbool.h>
// test case a: no overflow, predict result is false
uint64_t a_x0 = 0x0000000000000000;
uint64_t a_x1 = 0x0000000000000000;
// test case b: no overflow, predict result is false
uint64_t b_x0 = 0x0000000000000001;
uint64_t b_x1 = 0x0000000000000010;
// test case c: no overflow, but predict result is true
uint64_t c_x0 = 0x0000000000000002;
uint64_t c_x1 = 0x4000000000000000;
// test case d: overflow, and predict result is true
uint64_t d_x0 = 0x0000000000000003;
uint64_t d_x1 = 0x7FFFFFFFFFFFFFFF;
uint16_t count_leading_zeros(uint64_t x)
{
x |= (x >> 1);
x |= (x >> 2);
x |= (x >> 4);
x |= (x >> 8);
x |= (x >> 16);
x |= (x >> 32);
/* count ones (population count) */
x -= ((x >> 1) & 0x5555555555555555);
x = ((x >> 2) & 0x3333333333333333) + (x & 0x3333333333333333);
x = ((x >> 4) + x) & 0x0f0f0f0f0f0f0f0f;
x += (x >> 8);
x += (x >> 16);
x += (x >> 32);
return (64 - (x & 0x7f));
}
bool predict_if_mul_overflow(uint64_t *x0, uint64_t *x1)
{
int32_t exp_x0 = 63 - (int32_t)count_leading_zeros(*x0);
int32_t exp_x1 = 63 - (int32_t)count_leading_zeros(*x1);
if ((exp_x0 + 1) + (exp_x1 + 1) >= 64)
return true;
else
return false;
}
void main()
{
printf("%d\n", predict_if_mul_overflow(&a_x0, &a_x1));
printf("%d\n", predict_if_mul_overflow(&b_x0, &b_x1));
printf("%d\n", predict_if_mul_overflow(&c_x0, &c_x1));
printf("%d\n", predict_if_mul_overflow(&d_x0, &d_x1));
return;
}
```
I just do a minor adjustment to the function called ```predict_if_mul_overflow```. I removed the if-else statement to simplify the upcoming loop. Below is the C code after modification.
```c
bool predict_if_mul_overflow(uint64_t *x0, uint64_t *x1)
{
return ((int32_t)(64 - (int32_t)count_leading_zeros(*x0) - (int32_t)count_leading_zeros(*x1)) >= 0);
}
```
## Improvement of assembly language
In this section, i optimized the calculation in the label of "pimo" to reduce the use of registers and instructions for predicting multiplication overflow.
For instance, the original version utilized the following instructions to obtain the values of ```exp_x0``` and ```exp_x1```. Ultimately, both values could be placed into the function ```(exp_x0 + 1) + (exp_x1 + 1) >= 64``` to obtain the boolean value.
##### Original version:
```
pimo:
addi sp, sp, -20
sw ra, 0(sp)
sw s0, 4(sp)
sw s1, 8(sp)
sw s2, 12(sp)
sw s3, 16(sp)
mv s0, a0 # s0 is address of x0
mv s1, a1 # s1 is address of x1
lw a0, 0(s0)
lw a1, 4(s0) # a0 a1 is now the value of x0
jal ra, clz
li s2, 63
sub s2, s2, a0 # s2 is now exp_x0
lw a0, 0(s1)
lw a1, 4(s1) # a1 a0 is now the value of x1
jal ra, clz
li s3, 63
sub s3, s3, a0 # s3 is now exp_x1
add s2, s2, s3
addi s2, s2, 2 # s2 is (exp_x0 + 1) + (exp_x1 + 1)
li t0, 64
bge s2, t0, pimo_ret_t
li a0, 0 # return false
j pimo_end
```
In the optimized version, i simplified the function ```(exp_x0 + 1) + (exp_x1 + 1) >= 64``` into ```clz(x0) + clz(x1) <= 64```. As a result, the relative modification in assembly is presented in the following.
##### Optimized version:
```
pimo:
addi sp, sp, -20
sw ra, 0(sp)
sw s0, 4(sp)
sw s1, 8(sp)
sw s2, 12(sp)
sw s3, 16(sp)
mv s0, a0 # s0 is address of x0
mv s1, a1 # s1 is address of x1
lw a0, 0(s0)
lw a1, 4(s0) # a0 a1 is now the value of x0
jal ra, clz
mv s2, a0 #s2 is now clz(x0)
lw a0, 0(s1)
lw a1, 4(s1) # a1 a0 is now the value of x1
jal ra, clz
mv s3, a0 # s3 is now clz(x1)
add s2, s2, s3 # s2 = clz(x0) + xlz(x1)
li t0, 64
bge t0, s2, pimo_ret_t
li a0, 0 # return false
j pimo_end
```
##### Complete version
The complete version of the assembly language code and the related files required to generate an ELF (Executable and Linkable Format) file can be accessed on [GitHub](https://github.com/weiweiwei68/GNU-Toolchain).
:::danger
DON'T DO THAT.
You shall put the source code into a GitHub repository. HackMD note should keep the critical code instead.
:notes: jserv
:::
##### Output
```
$ build/rv32emu tests/pimo2/pimo.elf
0011inferior exit code 0
```
**Aseembly language**
1. Size
```
$ riscv-none-elf-size pimo.elf
text data bss dec hex filename
760 0 0 760 2f8 pimo.elf
```
2. ELF file header
```
$ riscv-none-elf-readelf -h pimo.elf
ELF Header:
Magic: 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00
Class: ELF32
Data: 2's complement, little endian
Version: 1 (current)
OS/ABI: UNIX - System V
ABI Version: 0
Type: EXEC (Executable file)
Machine: RISC-V
Version: 0x1
Entry point address: 0x0
Start of program headers: 52 (bytes into file)
Start of section headers: 5556 (bytes into file)
Flags: 0x0
Size of this header: 52 (bytes)
Size of program headers: 32 (bytes)
Number of program headers: 2
Size of section headers: 40 (bytes)
Number of section headers: 6
Section header string table index: 5
```
## Optimization
In this section, i will compile both the original code and the modified code and then compare the different between them. The website about ["optimization"](https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html) provides a wealth of optimization options that can be applied to enhance the performance of our code.
We can utilize the information from the provided link to explore various optimization options available for the GCC compiler, allowing us to make exact decisions about how to improve the efficiency of our code. After compiling the code with different optimization option flags, we can analyze factors such as execution time and code size to determine the impact of different optimization.
:::warning
Improve your English writing. Utilize ChatGPT or [Quillbot](https://quillbot.com/).
:notes: jserv
:::
**O0 optimization of original code**
1. Compile
```
$ riscv-none-elf-gcc -march=rv32i -mabi=ilp32 -O0 original.c -o original.elf
```
2. Size
```
$ricev-none-elf-size original.elf
text data bss dec hex filename
76688 2372 1548 80608 13ae0 original.elf
```
3. ELF file header
```
$ riscv-none-elf-readelf -h original.elf
ELF Header:
Magic: 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00
Class: ELF32
Data: 2's complement, little endian
Version: 1 (current)
OS/ABI: UNIX - System V
ABI Version: 0
Type: EXEC (Executable file)
Machine: RISC-V
Version: 0x1
Entry point address: 0x100d8
Start of program headers: 52 (bytes into file)
Start of section headers: 94772 (bytes into file)
Flags: 0x0
Size of this header: 52 (bytes)
Size of program headers: 32 (bytes)
Number of program headers: 3
Size of section headers: 40 (bytes)
Number of section headers: 15
Section header string table index: 14
```
**O0 optimization of modified code**
1. Compile
```
$ riscv-none-elf-gcc -march=rv32i -mabi=ilp32 -O0 modification.c -o modification.elf
```
2. Size
```
$ riscv-none-elf-size modification.elf
text data bss dec hex filename
76664 2372 1548 80584 13ac8 modification.elf
```
3. ELF file header
```
$ riscv-none-elf-readelf -h modification.elf
ELF Header:
Magic: 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00
Class: ELF32
Data: 2's complement, little endian
Version: 1 (current)
OS/ABI: UNIX - System V
ABI Version: 0
Type: EXEC (Executable file)
Machine: RISC-V
Version: 0x1
Entry point address: 0x100d8
Start of program headers: 52 (bytes into file)
Start of section headers: 94776 (bytes into file)
Flags: 0x0
Size of this header: 52 (bytes)
Size of program headers: 32 (bytes)
Number of program headers: 3
Size of section headers: 40 (bytes)
Number of section headers: 15
Section header string table index: 14
```
**Comparison**
| filename | text | data | bss | dec | hex |
| ---- | ---- | ---- | ---- | ---- | ------|
| original.elf | 76688 | 2372 | 1548 | 80608 | 13ae0 |
| modification.elf | 76664 | 2372 | 1548 | 80584 | 13ae8 |
Form the table provided above, we can observe that the size of the text section has slightly decreased after the modification, which corresponds to the optimization of the C code mentioned before. Based on the result, it is evident that the improvement made to the original C code has indeed contributed to reducing the code size. This reduction in code size can have various benefits, such as improved memory usage and potentially faster execution times.
## Different level of optimizations
In this section, i will utilize different optimization levels, including ```-O1```, ```-O2```, ```-O3```, ```-Ofast```, and ```-Os```, to investigate the relationship between code size and the level of optimization. By varying the optimization levels in the compilation, we can analyze how each level affects the resulting code size.
**O1 optimizations**
```
$ riscv-none-elf-size modification_O1.elf
text data bss dec hex filename
75780 2372 1548 79700 13754 modification_O1.elf
```
**O2 optimizations**
```
$ riscv-none-elf-size modification_O2.elf
text data bss dec hex filename
75776 2372 1548 79696 13750 modification_O2.elf
```
**O3 optimizations**
```
$ riscv-none-elf-size modification_O3.elf
text data bss dec hex filename
76428 2372 1548 80348 139dc modification_O3.elf
```
**Ofast optimizations**
```
$ riscv-none-elf-size modification_Ofast.elf
text data bss dec hex filename
76428 2372 1548 80348 139dc modification_Ofast.elf
```
**Os optimizations**
```
$ riscv-none-elf-size modification_Os.elf
text data bss dec hex filename
75776 2372 1548 79696 13750 modification_Os.elf
```
| Level | Assemly | -O0 | -O1 | -O2 | -O3 | -Ofast | -Os |
| -------- | -------- | ----- |-----|-----|-----|--------|-----|
| Text size | :bulb:760 | 76664 |75780|75776|76428|76428 |75776|
:::warning
You shall use RDCYCLE/RDCYCLEH instruction for the statistics of your program’s execution.
:notes: jserv
:::
## Execution time
In this section, i will employ the [tick.c](https://github.com/sysprog21/rv32emu/blob/master/tests/ticks.c) for the statistics of my program's execution.
**O0 optimizations**
```
$ build/rv32emu tests/hw2/csr.elf
0
0
1
1
elapsed cycle: 602060
inferior exit code 22
```
**O1 optimizations**
```
$ build/rv32emu tests/hw2/csr_o1.elf
0
0
1
1
elapsed cycle: 351760
inferior exit code 22
```
**O2 optimizations**
```
$ build/rv32emu tests/hw2/csr_o2.elf
0
0
1
1
elapsed cycle: 136942
inferior exit code 22
```
**O3 optimizations**
```
$ build/rv32emu tests/hw2/csr_o3.elf
0
0
1
1
elapsed cycle: 151043
inferior exit code 22
```
**Ofast optimizations**
```
$ build/rv32emu tests/hw2/csr_ofast.elf
0
0
1
1
elapsed cycle: 151043
inferior exit code 22
```
**Os optimizations**
```
$ build/rv32emu tests/hw2/csr_os.elf
0
0
1
1
elapsed cycle: 182657
inferior exit code 22
```
| Level | -O0 | -O1 | -O2 | -O3 | -Ofast | -Os |
| -------- | ------ |-------|-------------|-------|----------|-------|
| Cycle | 602,060|351,760|:bulb:136,942|151,043|151,043 |182,657|