owned this note
owned this note
Published
Linked with GitHub
# Assignment2: GNU Toolchain
contributed by <[`JoshuaLee0321`](https://github.com/JoshuaLee0321/fall-2023-computer-arch/tree/main/hw2)>
## Question Selection
Data compression is one of the greatest developnment in human history. It is ubiquitous in not just `.zip / .tar / .7z ...etc` file, but also in most IC (developed by large company such as `NovaTek` or `Realtek`). Both `audio IC` and `monitor IC` needs data compression and decompression to make them both fast and reliable. To this reason, I choose [`Scottgood333's`](https://hackmd.io/@scottgood333) work [`Bits Compression Using CLZ`](https://hackmd.io/@scottgood333/SkM6KdVMT) to work on.
## Analyzation of Original code
```c=
#include <stdint.h>
#include <stdio.h>
uint16_t count_leading_zeros(uint64_t x)
{
...
}
uint32_t cvrt_uint32(uint64_t x)
{
return (uint32_t) x;
}
uint16_t cvrt_uint16(uint64_t x)
{
return (uint16_t) x;
}
int main()
{
uint64_t test_data[3] = {0x0fffffffffffffff, 0x00003567, 0xa};
uint16_t n;
uint16_t bit_count = 64 * (sizeof(test_data) / 8), bit_reduced = 0;
for (int i = 0; i < 3; i++)
printf("%llu\n", test_data[i]);
printf("original bit count: %hu\n", (unsigned short int)bit_count);
bit_count = 0;
for (int i = 0; i < 3; i++)
{
n = count_leading_zeros(test_data[i]);
printf("leading zeros of test data[%d]: %hu\n", i, n);
if (n >= 48)
{
uint16_t tmp_out = cvrt_uint16(test_data[i]);
bit_count += 16;
bit_reduced += 48;
}
else if (n >= 32)
{
uint32_t tmp_out = cvrt_uint32(test_data[i]);
bit_count += 32;
bit_reduced += 32;
}
else
bit_count += 64;
}
printf("bit count after compressed: %hu\n", (unsigned short int)(bit_count));
printf("bit reduced after compressed: %hu\n", (unsigned short int)(bit_reduced));
}
```
The core idea of this code is to `reduce unused field`. as you can see, the code above, specifcally focus on line `9` and `14`, this simple ==compression== function is to convert the datatype.
-------
## Improvement
after checkout the original code implementation [`original`](https://github.com/scottgood333/computer_architecture/blob/main/hw1/bit_compression_v1.s), almost immediately, I notice that he implemented unnecessary loop in the funciont `clz`, so the first modification will be [`loop unrolling`](https://en.wikipedia.org/wiki/Loop_unrolling), that is, to reduce cycle by deleting unnecessary branch instruction, hence to remove the potential overhead brought by [`branch prediction`](https://en.wikipedia.org/wiki/Predication_(computer_architecture))
```asm
# before
CLZ:
...
loop:
#x |= (x>>2, 4, 8, 16, 32)
srl t1, t1, t2
or t0, t1, t0
slli t2, t2, 1
bne t2, t3, loop
...
```
```asm
# after
CLZ:
srli t1, t0, 1
or t0, t1, t0
srli t1, t0, 2
or t0, t1, t0
srli t1, t0, 4
or t0, t1, t0
srli t1, t0, 8
or t0, t1, t0
srli t1, t0, 16
or t0, t1, t0
srli t1, t0, 32
or t0, t1, t0
...
```
:::info
### Comparison
#### Before

#### After

we gain around `1.1` speedup after the changes.
:::
back to the business~
In order to analyze the performance of his code, we can define a extern function getcycle.s to get the current cycle (by [`jimmylu0303`](https://hackmd.io/@jimmylu0303/assignment2))
:::warning
Improve your English writing.
:notes: jserv
:::
so i define a function in `main.c`
```c
uint64_t read_cycles(void)
{
unsigned long cycles;
asm volatile ("rdcycle %0" : "=r" (cycles));
return cycles;
}
```
# Optimization comparison
## without optimization
### assembly code
[link](https://github.com/JoshuaLee0321/fall-2023-computer-arch/blob/main/hw2/opts/noOpt.txt)
## O1
### assembly code
[link](https://github.com/JoshuaLee0321/fall-2023-computer-arch/blob/main/hw2/opts/O1.txt)
## O2
### assembly code
[link](https://github.com/JoshuaLee0321/fall-2023-computer-arch/blob/main/hw2/opts/O2.txt)
## O3
### assembly code
[link](https://github.com/JoshuaLee0321/fall-2023-computer-arch/blob/main/hw2/opts/O3.txt)
## O4
### assembly code
[link](https://github.com/JoshuaLee0321/fall-2023-computer-arch/blob/main/hw2/opts/O4.txt)
From [gcc man page](https://linux.die.net/man/1/gcc), there's 8 different valid -O options you can give to gcc
* -O (Same as -O1)
* -O0 (do no optimization, the default if no optimization level is specified)
* -O1 (optimize minimally)
* -O2 (optimize more)
* -O3 (optimize even more)
* -Ofast (optimize very aggressively to the point of breaking standard compliance)
* -Os (Optimize for size)
* -Og (Optimize debugging experience)
### Check differenct optimization strategy result
* Check all 8 optimization stats by `riscv-none-elf-size`
* It's challenging to deduce from this result what optimizations the compiler has implemented for different strategies.
```
text data bss dec hex filename
77292 2320 1528 81140 13cf4 main.s
77292 2320 1528 81140 13cf4 O0.s
76204 2320 1528 80052 138b4 O1.s
76180 2320 1528 80028 1389c O2.s
76560 2320 1528 80408 13a18 O3.s
76560 2320 1528 80408 13a18 O4.s
76560 2320 1528 80408 13a18 Ofast.s
76156 2320 1528 80004 13884 Os.s
```
* From the table above, it can be observed that the '-Os' option indeed optimized the file size.