Assignment2: GNU Toolchain

contributed by <JoshuaLee0321>

Question Selection

Data compression is one of the greatest developnment in human history. It is ubiquitous in not just .zip / .tar / .7z ...etc file, but also in most IC (developed by large company such as NovaTek or Realtek). Both audio IC and monitor IC needs data compression and decompression to make them both fast and reliable. To this reason, I choose Scottgood333's work Bits Compression Using CLZ to work on.

Analyzation of Original code





















































#include <stdint.h>
#include <stdio.h>

uint16_t count_leading_zeros(uint64_t x) 
{
    ...
}

uint32_t cvrt_uint32(uint64_t x)
{
    return (uint32_t) x;
}

uint16_t cvrt_uint16(uint64_t x)
{
    return (uint16_t) x;
}

int main()
{
    uint64_t test_data[3] = {0x0fffffffffffffff, 0x00003567, 0xa};
    uint16_t n;
    uint16_t bit_count = 64 * (sizeof(test_data) / 8), bit_reduced = 0;

    for (int i = 0; i < 3; i++)
        printf("%llu\n", test_data[i]);

    printf("original bit count: %hu\n", (unsigned short int)bit_count);
    bit_count = 0;
    for (int i = 0; i < 3; i++)
    {
        n = count_leading_zeros(test_data[i]);
        printf("leading zeros of test data[%d]: %hu\n", i, n);

        if (n >= 48)
        {
            uint16_t tmp_out = cvrt_uint16(test_data[i]);
            bit_count += 16;
            bit_reduced += 48;
        }
        else if (n >= 32)
        {
            uint32_t tmp_out = cvrt_uint32(test_data[i]);
            bit_count += 32;
            bit_reduced += 32;
        }
        else
            bit_count += 64;
    }
    printf("bit count after compressed: %hu\n", (unsigned short int)(bit_count));
    printf("bit reduced after compressed: %hu\n", (unsigned short int)(bit_reduced));
}

The core idea of this code is to reduce unused field. as you can see, the code above, specifcally focus on line 9 and 14, this simple compression function is to convert the datatype.

Improvement

after checkout the original code implementation original, almost immediately, I notice that he implemented unnecessary loop in the funciont clz, so the first modification will be loop unrolling, that is, to reduce cycle by deleting unnecessary branch instruction, hence to remove the potential overhead brought by branch prediction

# before
CLZ:
    ...
    
    loop:    
        #x |= (x>>2, 4, 8, 16, 32)
        srl t1, t1, t2
        or t0, t1, t0  
        slli t2, t2, 1
        bne t2, t3, loop
    
    ...

# after

CLZ: 
    srli t1, t0, 1
    or t0, t1, t0  
    srli t1, t0, 2
    or t0, t1, t0
    srli t1, t0, 4
    or t0, t1, t0
    srli t1, t0, 8
    or t0, t1, t0
    srli t1, t0, 16
    or t0, t1, t0
    srli t1, t0, 32
    or t0, t1, t0
    
    ...

Comparison

Before

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

After

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

we gain around 1.1 speedup after the changes.

back to the business~

In order to analyze the performance of his code, we can define a extern function getcycle.s to get the current cycle (by jimmylu0303)

Improve your English writing.

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

jserv

so i define a function in main.c

uint64_t read_cycles(void)
{
  unsigned long cycles;
  asm volatile ("rdcycle %0" : "=r" (cycles));
  return cycles;
}

Optimization comparison

without optimization

assembly code

link

O1

assembly code

link

O2

assembly code

link

O3

assembly code

link

O4

assembly code

link

From gcc man page, there's 8 different valid -O options you can give to gcc

-O (Same as -O1)
-O0 (do no optimization, the default if no optimization level is specified)
-O1 (optimize minimally)
-O2 (optimize more)
-O3 (optimize even more)
-Ofast (optimize very aggressively to the point of breaking standard compliance)
-Os (Optimize for size)
-Og (Optimize debugging experience)

Check differenct optimization strategy result

Check all 8 optimization stats by riscv-none-elf-size
It's challenging to deduce from this result what optimizations the compiler has implemented for different strategies.

   text    data     bss     dec     hex filename
  77292    2320    1528   81140   13cf4 main.s
  77292    2320    1528   81140   13cf4 O0.s
  76204    2320    1528   80052   138b4 O1.s
  76180    2320    1528   80028   1389c O2.s
  76560    2320    1528   80408   13a18 O3.s
  76560    2320    1528   80408   13a18 O4.s
  76560    2320    1528   80408   13a18 Ofast.s
  76156    2320    1528   80004   13884 Os.s

From the table above, it can be observed that the '-Os' option indeed optimized the file size.

Assignment2: GNU Toolchain

Question Selection

Analyzation of Original code

Improvement

Comparison

Before

After

Optimization comparison

without optimization

assembly code

O1

assembly code

O2

assembly code

O3

assembly code

O4

assembly code

Check differenct optimization strategy result

Read more

李熙堃的網誌－首頁

李熙堃的網誌 - 履歷

2022 年 [「資訊科技產業專案設計」](https://hackmd.io/@sysprog/info2022/https%3A%2F%2Fhackmd.io%2F%40sysprog%2FBJLSJ3ggi)作業三

2023q1 Homework1 (lab0)