Assignment2: GNU Toolchain

# Assignment2: GNU Toolchain contributed by <[`JoshuaLee0321`](https://github.com/JoshuaLee0321/fall-2023-computer-arch/tree/main/hw2)> ## Question Selection Data compression is one of the greatest developnment in human history. It is ubiquitous in not just `.zip / .tar / .7z ...etc` file, but also in most IC (developed by large company such as `NovaTek` or `Realtek`). Both `audio IC` and `monitor IC` needs data compression and decompression to make them both fast and reliable. To this reason, I choose [`Scottgood333's`](https://hackmd.io/@scottgood333) work [`Bits Compression Using CLZ`](https://hackmd.io/@scottgood333/SkM6KdVMT) to work on. ## Analyzation of Original code ```c= #include <stdint.h> #include <stdio.h> uint16_t count_leading_zeros(uint64_t x) { ... } uint32_t cvrt_uint32(uint64_t x) { return (uint32_t) x; } uint16_t cvrt_uint16(uint64_t x) { return (uint16_t) x; } int main() { uint64_t test_data[3] = {0x0fffffffffffffff, 0x00003567, 0xa}; uint16_t n; uint16_t bit_count = 64 * (sizeof(test_data) / 8), bit_reduced = 0; for (int i = 0; i < 3; i++) printf("%llu\n", test_data[i]); printf("original bit count: %hu\n", (unsigned short int)bit_count); bit_count = 0; for (int i = 0; i < 3; i++) { n = count_leading_zeros(test_data[i]); printf("leading zeros of test data[%d]: %hu\n", i, n); if (n >= 48) { uint16_t tmp_out = cvrt_uint16(test_data[i]); bit_count += 16; bit_reduced += 48; } else if (n >= 32) { uint32_t tmp_out = cvrt_uint32(test_data[i]); bit_count += 32; bit_reduced += 32; } else bit_count += 64; } printf("bit count after compressed: %hu\n", (unsigned short int)(bit_count)); printf("bit reduced after compressed: %hu\n", (unsigned short int)(bit_reduced)); } ``` The core idea of this code is to `reduce unused field`. as you can see, the code above, specifcally focus on line `9` and `14`, this simple ==compression== function is to convert the datatype. ------- ## Improvement after checkout the original code implementation [`original`](https://github.com/scottgood333/computer_architecture/blob/main/hw1/bit_compression_v1.s), almost immediately, I notice that he implemented unnecessary loop in the funciont `clz`, so the first modification will be [`loop unrolling`](https://en.wikipedia.org/wiki/Loop_unrolling), that is, to reduce cycle by deleting unnecessary branch instruction, hence to remove the potential overhead brought by [`branch prediction`](https://en.wikipedia.org/wiki/Predication_(computer_architecture)) ```asm # before CLZ: ... loop: #x |= (x>>2, 4, 8, 16, 32) srl t1, t1, t2 or t0, t1, t0 slli t2, t2, 1 bne t2, t3, loop ... ``` ```asm # after CLZ: srli t1, t0, 1 or t0, t1, t0 srli t1, t0, 2 or t0, t1, t0 srli t1, t0, 4 or t0, t1, t0 srli t1, t0, 8 or t0, t1, t0 srli t1, t0, 16 or t0, t1, t0 srli t1, t0, 32 or t0, t1, t0 ... ``` :::info ### Comparison #### Before ![](https://hackmd.io/_uploads/rJ24WG2MT.png) #### After ![](https://hackmd.io/_uploads/HJg-fz2zT.png) we gain around `1.1` speedup after the changes. ::: back to the business~ In order to analyze the performance of his code, we can define a extern function getcycle.s to get the current cycle (by [`jimmylu0303`](https://hackmd.io/@jimmylu0303/assignment2)) :::warning Improve your English writing. :notes: jserv ::: so i define a function in `main.c` ```c uint64_t read_cycles(void) { unsigned long cycles; asm volatile ("rdcycle %0" : "=r" (cycles)); return cycles; } ``` # Optimization comparison ## without optimization ### assembly code [link](https://github.com/JoshuaLee0321/fall-2023-computer-arch/blob/main/hw2/opts/noOpt.txt) ## O1 ### assembly code [link](https://github.com/JoshuaLee0321/fall-2023-computer-arch/blob/main/hw2/opts/O1.txt) ## O2 ### assembly code [link](https://github.com/JoshuaLee0321/fall-2023-computer-arch/blob/main/hw2/opts/O2.txt) ## O3 ### assembly code [link](https://github.com/JoshuaLee0321/fall-2023-computer-arch/blob/main/hw2/opts/O3.txt) ## O4 ### assembly code [link](https://github.com/JoshuaLee0321/fall-2023-computer-arch/blob/main/hw2/opts/O4.txt) From [gcc man page](https://linux.die.net/man/1/gcc), there's 8 different valid -O options you can give to gcc * -O (Same as -O1) * -O0 (do no optimization, the default if no optimization level is specified) * -O1 (optimize minimally) * -O2 (optimize more) * -O3 (optimize even more) * -Ofast (optimize very aggressively to the point of breaking standard compliance) * -Os (Optimize for size) * -Og (Optimize debugging experience) ### Check differenct optimization strategy result * Check all 8 optimization stats by `riscv-none-elf-size` * It's challenging to deduce from this result what optimizations the compiler has implemented for different strategies. ``` text data bss dec hex filename 77292 2320 1528 81140 13cf4 main.s 77292 2320 1528 81140 13cf4 O0.s 76204 2320 1528 80052 138b4 O1.s 76180 2320 1528 80028 1389c O2.s 76560 2320 1528 80408 13a18 O3.s 76560 2320 1528 80408 13a18 O4.s 76560 2320 1528 80408 13a18 Ofast.s 76156 2320 1528 80004 13884 Os.s ``` * From the table above, it can be observed that the '-Os' option indeed optimized the file size.

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.