Assignment2: RISC-V Toolchain

contributed by < ranvd >

Rewrite Find Leftmost 0-byte using CLZ

Thanks to the contribution from 陳川曜 (hugo0406), the assembly code he provided is already efficient enough. There isn't much room for me to improve it further. The following section provides the current optimal code I could come up with.

C Implementation

The modification is shown below. I found that, there is one redundent operation in zbytel. We can look closer to the operation in byte. Since 0x7F is 0111_1111 in binary representation.

Now assume we have a byte variable called x. If x = x & 0b01111111 operation is applied, the MSB in x will always be 0. It can be regard as the truncate of 0~6 bits from x, we represent it as x[0:6] in following paragraphs. Then the operation of x[0:6] + 0x7F will make x[7] to 1 only when the x[0:6] is not zero.

After kowning this premise, y | x | 0x7f and y | 0x7f is identically the same, since y[7] will be 1 state while x is non-zero. If x[7] is 1, x is non-zero, then y[7] will be 1 without a doubt.

uint16_t count_leading_zeros(uint64_t x)
{
    x |= (x >> 1);
    x |= (x >> 2);
    x |= (x >> 4);
    x |= (x >> 8);
    x |= (x >> 16);
    x |= (x >> 32);

    /* count ones (population count) */
    x -= ((x >> 1) & 0x5555555555555555);
    x = ((x >> 2) & 0x3333333333333333) + (x & 0x3333333333333333);
    x = ((x >> 4) + x) & 0x0f0f0f0f0f0f0f0f;
    x += (x >> 8);
    x += (x >> 16);
    x += (x >> 32);

    return (64 - (x & 0x7f));
}

uint16_t zbytel(uint64_t x)
{
    uint64_t y;
    uint16_t n;
    y = (x & 0x7F7F7F7F7F7F7F7F) +
        0x7F7F7F7F7F7F7F7F;             // convert each 0-byte to 0x80
-    y = ~(y | x | 0x7F7F7F7F7F7F7F7F);  // and each nonzero byte to 0x00
+    y = ~(y | 0x7F7F7F7F7F7F7F7F);    
    n = count_leading_zeros(y) >> 3;    // use number of leading zeros
    return n;                           // n = 0 ... 8 , 8 if x has no 0-byte.
}

Compare performance

I compared the execution cycle and the size between original and modified C implementation in different optimalize trategy.

Note that the size is the object file of the c implementation not the execution file. Because we need a driver the inspect the cycle count of our implementation, which will inflate the code size. This will make our improvment become insignificant.

In the form below, (o) and (m) represent original and modified version repectively.

Level	cycle (o)	cycle (m)	size (o)	size (m)
-O0	324	320	1256	1240
-O1	130	128	480	472
-O2	130	128	480	472
-O3	122	120	812	804
-Ofast	122	120	812	804
-Os	130	128	480	472

Assembly Implementation

I just do the minimum modify to fit the performance analysis program. The full version code can be seen here. I just paste the most important part in below.

The biggest different is shown below. Since we need a C code driver to observer the cycle conuts. So, we need to follow the calling convention strictly. So I save all s0-s5 register in the function prologure and epilogue, which is callee saved register. Without doing this, the program may goes wrong, especially the s0/fp.

.text
.globl zbyte
zbyte:
loop:
    # Prologue
    addi sp, sp, -28
    sw ra, 0(sp)
    sw s0, 4(sp)
    sw s1, 8(sp)
    sw s2, 12(sp)
    sw s3, 16(sp)
    sw s4, 20(sp)
    sw s5, 24(sp)
    jal  zbytel

    # Epilogue
    lw ra, 0(sp)
    lw s0, 4(sp)
    lw s1, 8(sp)
    lw s2, 12(sp)
    lw s3, 16(sp)
    lw s4, 20(sp)
    lw s5, 24(sp)
    addi sp, sp, 28
    ret

performance

Level	cycle	size (.o file)
bare assembly	154	564

Improvements

1. Inlining the function

This is common strategy in compiler. By inlining the function, we can save the prologue and epilogue overhead and eliminate the control hazard from jumping instruction.

Level	cycle	size(.o file)
bare assembly	138	500

2. Don't use the callee save register

Although we inlined all function, we still have the prologue and epilogue when we call the zbytel function. How about we don't use any callee save register? Without using any callee save, we can save 14 instructions, which is 56 bytes. The result matches our expectation.

In this way, the code size is the minimum in all implementation.

Levle	cycle	size(.o file)
bare assembly	124	444

3. Use static data

Since li is a pseudo instruction, it can be combined with lui and addi when the immediate value can't be represented in 12 bits. If we define it as global data, we can obtain the value in one instruction. This could be a valuable improvement, especially if we need to execute this function many times and the cache mechanism is functioning properly.

Levle	cycle	size(.o file)
bare assembly	120	448

Compare

Note that the y-axis of historgram not start from 0.

Upon examining the histogram, it is evident that my implementation is better than the other implementations, regardless of cycle or size.

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →