# Assignment2: RISC-V Toolchain
contributed by < [`ranvd`](https://github.com/ranvd/ComputerArch) >
## Rewrite Find Leftmost 0-byte using CLZ
Thanks to the contribution from 陳川曜 ([`hugo0406`](https://hackmd.io/@cychen/computer_architecture_hw1)), the assembly code he provided is already efficient enough. There isn't much room for me to improve it further. The following section provides the current optimal code I could come up with.
## C Implementation
The modification is shown below. I found that, there is one redundent operation in `zbytel`. We can look closer to the operation in byte. Since `0x7F` is `0111_1111` in binary representation.
Now assume we have a byte variable called `x`. If `x = x & 0b01111111` operation is applied, the MSB in `x` will always be 0. It can be regard as the truncate of 0~6 bits from `x`, we represent it as `x[0:6]` in following paragraphs. Then the operation of `x[0:6] + 0x7F` will make `x[7]` to 1 only when the `x[0:6]` is not zero.
After kowning this premise, `y | x | 0x7f` and `y | 0x7f` is identically the same, since `y[7]` will be 1 state while x is non-zero. If `x[7]` is 1, x is non-zero, then `y[7]` will be 1 without a doubt.
```diff
uint16_t count_leading_zeros(uint64_t x)
{
x |= (x >> 1);
x |= (x >> 2);
x |= (x >> 4);
x |= (x >> 8);
x |= (x >> 16);
x |= (x >> 32);
/* count ones (population count) */
x -= ((x >> 1) & 0x5555555555555555);
x = ((x >> 2) & 0x3333333333333333) + (x & 0x3333333333333333);
x = ((x >> 4) + x) & 0x0f0f0f0f0f0f0f0f;
x += (x >> 8);
x += (x >> 16);
x += (x >> 32);
return (64 - (x & 0x7f));
}
uint16_t zbytel(uint64_t x)
{
uint64_t y;
uint16_t n;
y = (x & 0x7F7F7F7F7F7F7F7F) +
0x7F7F7F7F7F7F7F7F; // convert each 0-byte to 0x80
- y = ~(y | x | 0x7F7F7F7F7F7F7F7F); // and each nonzero byte to 0x00
+ y = ~(y | 0x7F7F7F7F7F7F7F7F);
n = count_leading_zeros(y) >> 3; // use number of leading zeros
return n; // n = 0 ... 8 , 8 if x has no 0-byte.
}
```
### Compare performance
I compared the execution cycle and the size between original and modified C implementation in different optimalize trategy.
Note that the size is the object file of the c implementation not the execution file. Because we need a driver the inspect the cycle count of our implementation, which will inflate the code size. This will make our improvment become insignificant.
In the form below, (o) and (m) represent original and modified version repectively.
| Level | cycle (o) | cycle (m) | size (o) | size (m) |
| ------ | --------- | --------- | -------- | -------- |
| -O0 | 324 | 320 | 1256 | 1240 |
| -O1 | 130 | 128 | 480 | 472 |
| -O2 | 130 | 128 | 480 | 472 |
| -O3 | 122 | 120 | 812 | 804 |
| -Ofast | 122 | 120 | 812 | 804 |
| -Os | 130 | 128 | 480 | 472 |
<!-- Furthermore, the 64-bit instants, eg. 0x5555555555555555, 0x0f0f0f0f0f0f0f0f, etc, can be write moved as global variable. By doing this. We can reduce the instruction from 4(lui+addi) to 2(lw).
311 -->
## Assembly Implementation
I just do the minimum modify to fit the performance analysis program. The full version code can be seen [here](https://github.com/ranvd/ComputerArch/blob/main/hw2/origin_asm/zbyte.S). I just paste the most important part in below.
The biggest different is shown below. Since we need a C code driver to observer the cycle conuts. So, we need to follow the calling convention strictly. So I save all s0-s5 register in the function prologure and epilogue, which is callee saved register. **Without doing this, the program may goes wrong, especially the s0/fp.**
```clike
.text
.globl zbyte
zbyte:
loop:
# Prologue
addi sp, sp, -28
sw ra, 0(sp)
sw s0, 4(sp)
sw s1, 8(sp)
sw s2, 12(sp)
sw s3, 16(sp)
sw s4, 20(sp)
sw s5, 24(sp)
jal zbytel
# Epilogue
lw ra, 0(sp)
lw s0, 4(sp)
lw s1, 8(sp)
lw s2, 12(sp)
lw s3, 16(sp)
lw s4, 20(sp)
lw s5, 24(sp)
addi sp, sp, 28
ret
```
### performance
| Level | cycle | size (.o file) |
| ------------- | ----- | -------------- |
| bare assembly | 154 | 564 |
## Improvements
### 1. Inlining the function
This is common strategy in compiler. By inlining the function, we can save the prologue and epilogue overhead and eliminate the control hazard from jumping instruction.
| Level | cycle | size(.o file) |
| ------------- | ----- | ------------- |
| bare assembly | 138 | 500 |
### 2. Don't use the callee save register
Although we inlined all function, we still have the prologue and epilogue when we call the `zbytel` function. How about we don't use any callee save register? Without using any callee save, we can save 14 instructions, which is 56 bytes. The result matches our expectation.
In this way, the code size is the minimum in all implementation.
| Levle | cycle | size(.o file) |
| ------------- | ----- | ------------- |
| bare assembly | 124 | 444 |
### 3. Use static data
Since `li` is a pseudo instruction, it can be combined with `lui` and `addi` when the immediate value can't be represented in 12 bits. If we define it as global data, we can obtain the value in one instruction. This could be a valuable improvement, especially if we need to execute this function many times and the cache mechanism is functioning properly.
| Levle | cycle | size(.o file) |
| ------------- | ----- | ------------- |
| bare assembly | 120 | 448 |
## Compare
> Note that the y-axis of historgram not start from 0.
Upon examining the histogram, it is evident that my implementation is better than the other implementations, regardless of cycle or size.

