# Assignment1: RISC-V Assembly and Instruction Pipeline
contributed by <`SuNsHiNe-75`>
## [Quiz 1](https://hackmd.io/@sysprog/arch2024-quiz1-sol) - Problem C
:::danger
Choose one problem (A, B, or C) from [Quiz1](https://hackmd.io/@sysprog/arch2024-quiz1-sol), translate it from C code to a complete RISC-V assembly program, and include the relevant test data.
:::
### C Code
The C code for the `__builtin_clz` function mentioned in Problem C is as follows:
```c
static inline int my_clz(uint32_t x) {
int count = 0;
for (int i = 31; i >= 0; --i) {
if (x & (1U << i))
break;
count++;
}
return count;
}
```
:::danger
Show the full C source code corresponding to the original problem set.
:::
### RISC-V Assembly Code
```c
.text
.globl main
main:
li a0, 1
call my_clz
li a7, 10
ecall
my_clz:
li t0, 0
li t1, 31
loop:
li t2, 1
sll t2, t2, t1
and t3, a0, t2
bnez t3, done
addi t0, t0, 1
addi t1, t1, -1
blt t1, zero, done
j loop
done:
mv a0, t0
ret
```
:::danger
You should automate the testing procedures as much as possible, meaning there's no need to manually check each program output. Instead, implement internal validation to handle the result checking.
:::
A brief analysis of registers of the above assembly code:
- `a0` can be used to specify the input parameter `x`.
- `t0` represents the `count`.
- `t1` serves as the loop index `i`.
- `t3` is used to detect whether the currently observed bit is `1`. If it is, the calculation is completed, and the loop is exited.
The program logic **calculates the leading zeroes of `x`** (`a0`) and stores the result back in `a0`, then ends program execution using `li a7, 10` + `ecall`.
:::info
It is important to note that the built-in C function `__builtin_clz` results in undefined behavior when testing a value of 0. Therefore, in this assembly program, the input value cannot be 0.
:::
### RISC-V Assembly Code Improvement
```c
.text
.globl main
main:
li a0, 1
call my_clz
li a7, 10
ecall
my_clz:
li t0, 0
li t1, 31
li t2, 0xFFFF0000
srli t3, a0, 16
beqz t3, L1
li t4, 16
sub t1, t1, t4
mv a0, t3
L1:
li t2, 0xFF00
srli t3, a0, 8
beqz t3, L2
li t4, 8
sub t1, t1, t4
mv a0, t3
L2:
li t2, 0xF0
srli t3, a0, 4
beqz t3, L3
li t4, 4
sub t1, t1, t4
mv a0, t3
L3:
li t2, 0xC
srli t3, a0, 2
beqz t3, L4
li t4, 2
sub t1, t1, t4
mv a0, t3
L4:
li t2, 0x2
srli t3, a0, 1
beqz t3, L5
li t4, 1
sub t1, t1, t4
L5:
mv a0, t1
ret
```
This assembly program uses **bitwise operations** to avoid the loop execution of `my_clz`.
| Assembly program | Cycles | Instrs. retired | IPC | CPI |
| ------------------- | ------ | --------------- | ----- | ---- |
| `my_clz` | 335 | 261 | 0.78 | 1.28 |
| branchless `my_clz` | 45 | 25 | 0.556 | 1.8 |
In the same test scenario of "calculating the number of leading zeroes for 1," we can observed the **Execution Info.** simulated in *Ripes* and organized the table as shown above.
It is clear that the cycle count for the branchless version of `my_clz` is only $\frac 1 7$ of that for the original version with the `for` loop logic `my_clz`.
However, when examining the CPI and other indicators, the overall execution performance of the branchless `my_clz` is slightly inferior to that of the original `my_clz`, indicating that the performance bottleneck needs to be clarified.
:::danger
You should automate the testing procedures as much as possible, meaning there's no need to manually check each program output. Instead, implement internal validation to handle the result checking.
:::
## Use Case
### LeetCode: [Factorial Trailing Zeroes](https://leetcode.com/problems/factorial-trailing-zeroes/description/)
> Given an integer `n`, return the number of trailing zeroes in `n!`.
### C Code
```c
int trailingZeroes(int n) {
int ans = 0;
while (n) {
ans += n / 5;
n /= 5;
}
return ans;
}
```
### RISC-V Assembly Code
```c
.text
.globl trailingZeroes
trailingZeroes:
addi sp, sp, -16
sw ra, 12(sp)
sw s0, 8(sp)
sw a0, 4(sp)
li s0, 0
lw t0, 4(sp)
loop:
beq t0, zero, end
li t2, 5
div t1, t0, t2
add s0, s0, t1
mv t0, t1
j loop
end:
mv a0, s0
lw ra, 12(sp)
lw s0, 8(sp)
addi sp, sp, 16
li a7, 10
ecall
jr ra
```