contributed by < ranvd
>
Thanks to the contribution from 陳川曜 (hugo0406
), the assembly code he provided is already efficient enough. There isn't much room for me to improve it further. The following section provides the current optimal code I could come up with.
The modification is shown below. I found that, there is one redundent operation in zbytel
. We can look closer to the operation in byte. Since 0x7F
is 0111_1111
in binary representation.
Now assume we have a byte variable called x
. If x = x & 0b01111111
operation is applied, the MSB in x
will always be 0. It can be regard as the truncate of 0~6 bits from x
, we represent it as x[0:6]
in following paragraphs. Then the operation of x[0:6] + 0x7F
will make x[7]
to 1 only when the x[0:6]
is not zero.
After kowning this premise, y | x | 0x7f
and y | 0x7f
is identically the same, since y[7]
will be 1 state while x is non-zero. If x[7]
is 1, x is non-zero, then y[7]
will be 1 without a doubt.
uint16_t count_leading_zeros(uint64_t x)
{
x |= (x >> 1);
x |= (x >> 2);
x |= (x >> 4);
x |= (x >> 8);
x |= (x >> 16);
x |= (x >> 32);
/* count ones (population count) */
x -= ((x >> 1) & 0x5555555555555555);
x = ((x >> 2) & 0x3333333333333333) + (x & 0x3333333333333333);
x = ((x >> 4) + x) & 0x0f0f0f0f0f0f0f0f;
x += (x >> 8);
x += (x >> 16);
x += (x >> 32);
return (64 - (x & 0x7f));
}
uint16_t zbytel(uint64_t x)
{
uint64_t y;
uint16_t n;
y = (x & 0x7F7F7F7F7F7F7F7F) +
0x7F7F7F7F7F7F7F7F; // convert each 0-byte to 0x80
- y = ~(y | x | 0x7F7F7F7F7F7F7F7F); // and each nonzero byte to 0x00
+ y = ~(y | 0x7F7F7F7F7F7F7F7F);
n = count_leading_zeros(y) >> 3; // use number of leading zeros
return n; // n = 0 ... 8 , 8 if x has no 0-byte.
}
I compared the execution cycle and the size between original and modified C implementation in different optimalize trategy.
Note that the size is the object file of the c implementation not the execution file. Because we need a driver the inspect the cycle count of our implementation, which will inflate the code size. This will make our improvment become insignificant.
In the form below, (o) and (m) represent original and modified version repectively.
Level | cycle (o) | cycle (m) | size (o) | size (m) |
---|---|---|---|---|
-O0 | 324 | 320 | 1256 | 1240 |
-O1 | 130 | 128 | 480 | 472 |
-O2 | 130 | 128 | 480 | 472 |
-O3 | 122 | 120 | 812 | 804 |
-Ofast | 122 | 120 | 812 | 804 |
-Os | 130 | 128 | 480 | 472 |
I just do the minimum modify to fit the performance analysis program. The full version code can be seen here. I just paste the most important part in below.
The biggest different is shown below. Since we need a C code driver to observer the cycle conuts. So, we need to follow the calling convention strictly. So I save all s0-s5 register in the function prologure and epilogue, which is callee saved register. Without doing this, the program may goes wrong, especially the s0/fp.
.text
.globl zbyte
zbyte:
loop:
# Prologue
addi sp, sp, -28
sw ra, 0(sp)
sw s0, 4(sp)
sw s1, 8(sp)
sw s2, 12(sp)
sw s3, 16(sp)
sw s4, 20(sp)
sw s5, 24(sp)
jal zbytel
# Epilogue
lw ra, 0(sp)
lw s0, 4(sp)
lw s1, 8(sp)
lw s2, 12(sp)
lw s3, 16(sp)
lw s4, 20(sp)
lw s5, 24(sp)
addi sp, sp, 28
ret
Level | cycle | size (.o file) |
---|---|---|
bare assembly | 154 | 564 |
This is common strategy in compiler. By inlining the function, we can save the prologue and epilogue overhead and eliminate the control hazard from jumping instruction.
Level | cycle | size(.o file) |
---|---|---|
bare assembly | 138 | 500 |
Although we inlined all function, we still have the prologue and epilogue when we call the zbytel
function. How about we don't use any callee save register? Without using any callee save, we can save 14 instructions, which is 56 bytes. The result matches our expectation.
In this way, the code size is the minimum in all implementation.
Levle | cycle | size(.o file) |
---|---|---|
bare assembly | 124 | 444 |
Since li
is a pseudo instruction, it can be combined with lui
and addi
when the immediate value can't be represented in 12 bits. If we define it as global data, we can obtain the value in one instruction. This could be a valuable improvement, especially if we need to execute this function many times and the cache mechanism is functioning properly.
Levle | cycle | size(.o file) |
---|---|---|
bare assembly | 120 | 448 |
Note that the y-axis of historgram not start from 0.
Upon examining the histogram, it is evident that my implementation is better than the other implementations, regardless of cycle or size.