Assignment2: GNU Toolchain

contributed by <JoshuaLee0321>

Question Selection

Data compression is one of the greatest developnment in human history. It is ubiquitous in not just .zip / .tar / .7z ...etc file, but also in most IC (developed by large company such as NovaTek or Realtek). Both audio IC and monitor IC needs data compression and decompression to make them both fast and reliable. To this reason, I choose Scottgood333's work Bits Compression Using CLZ to work on.

Analyzation of Original code

#include <stdint.h> #include <stdio.h> uint16_t count_leading_zeros(uint64_t x) { ... } uint32_t cvrt_uint32(uint64_t x) { return (uint32_t) x; } uint16_t cvrt_uint16(uint64_t x) { return (uint16_t) x; } int main() { uint64_t test_data[3] = {0x0fffffffffffffff, 0x00003567, 0xa}; uint16_t n; uint16_t bit_count = 64 * (sizeof(test_data) / 8), bit_reduced = 0; for (int i = 0; i < 3; i++) printf("%llu\n", test_data[i]); printf("original bit count: %hu\n", (unsigned short int)bit_count); bit_count = 0; for (int i = 0; i < 3; i++) { n = count_leading_zeros(test_data[i]); printf("leading zeros of test data[%d]: %hu\n", i, n); if (n >= 48) { uint16_t tmp_out = cvrt_uint16(test_data[i]); bit_count += 16; bit_reduced += 48; } else if (n >= 32) { uint32_t tmp_out = cvrt_uint32(test_data[i]); bit_count += 32; bit_reduced += 32; } else bit_count += 64; } printf("bit count after compressed: %hu\n", (unsigned short int)(bit_count)); printf("bit reduced after compressed: %hu\n", (unsigned short int)(bit_reduced)); }

The core idea of this code is to reduce unused field. as you can see, the code above, specifcally focus on line 9 and 14, this simple compression function is to convert the datatype.


Improvement

after checkout the original code implementation original, almost immediately, I notice that he implemented unnecessary loop in the funciont clz, so the first modification will be loop unrolling, that is, to reduce cycle by deleting unnecessary branch instruction, hence to remove the potential overhead brought by branch prediction

# before
CLZ:
    ...
    
    loop:    
        #x |= (x>>2, 4, 8, 16, 32)
        srl t1, t1, t2
        or t0, t1, t0  
        slli t2, t2, 1
        bne t2, t3, loop
    
    ...
# after

CLZ: 
    srli t1, t0, 1
    or t0, t1, t0  
    srli t1, t0, 2
    or t0, t1, t0
    srli t1, t0, 4
    or t0, t1, t0
    srli t1, t0, 8
    or t0, t1, t0
    srli t1, t0, 16
    or t0, t1, t0
    srli t1, t0, 32
    or t0, t1, t0
    
    ...

Comparison

Before

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

After

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

we gain around 1.1 speedup after the changes.

back to the business~

In order to analyze the performance of his code, we can define a extern function getcycle.s to get the current cycle (by jimmylu0303)

Improve your English writing.
:notes: jserv

so i define a function in main.c

uint64_t read_cycles(void)
{
  unsigned long cycles;
  asm volatile ("rdcycle %0" : "=r" (cycles));
  return cycles;
}

Optimization comparison

without optimization

assembly code

link

O1

assembly code

link

O2

assembly code

link

O3

assembly code

link

O4

assembly code

link

From gcc man page, there's 8 different valid -O options you can give to gcc

  • -O (Same as -O1)
  • -O0 (do no optimization, the default if no optimization level is specified)
  • -O1 (optimize minimally)
  • -O2 (optimize more)
  • -O3 (optimize even more)
  • -Ofast (optimize very aggressively to the point of breaking standard compliance)
  • -Os (Optimize for size)
  • -Og (Optimize debugging experience)

Check differenct optimization strategy result

  • Check all 8 optimization stats by riscv-none-elf-size
  • It's challenging to deduce from this result what optimizations the compiler has implemented for different strategies.
   text    data     bss     dec     hex filename
  77292    2320    1528   81140   13cf4 main.s
  77292    2320    1528   81140   13cf4 O0.s
  76204    2320    1528   80052   138b4 O1.s
  76180    2320    1528   80028   1389c O2.s
  76560    2320    1528   80408   13a18 O3.s
  76560    2320    1528   80408   13a18 O4.s
  76560    2320    1528   80408   13a18 Ofast.s
  76156    2320    1528   80004   13884 Os.s
  • From the table above, it can be observed that the '-Os' option indeed optimized the file size.