Assignment1: RISC-V Assembly and Instruction Pipeline

# Assignment1: RISC-V Assembly and Instruction Pipeline contributed by <`SuNsHiNe-75`> ## [Quiz 1](https://hackmd.io/@sysprog/arch2024-quiz1-sol) - Problem C :::danger Choose one problem (A, B, or C) from [Quiz1](https://hackmd.io/@sysprog/arch2024-quiz1-sol), translate it from C code to a complete RISC-V assembly program, and include the relevant test data. ::: ### C Code The C code for the `__builtin_clz` function mentioned in Problem C is as follows: ```c static inline int my_clz(uint32_t x) { int count = 0; for (int i = 31; i >= 0; --i) { if (x & (1U << i)) break; count++; } return count; } ``` :::danger Show the full C source code corresponding to the original problem set. ::: ### RISC-V Assembly Code ```c .text .globl main main: li a0, 1 call my_clz li a7, 10 ecall my_clz: li t0, 0 li t1, 31 loop: li t2, 1 sll t2, t2, t1 and t3, a0, t2 bnez t3, done addi t0, t0, 1 addi t1, t1, -1 blt t1, zero, done j loop done: mv a0, t0 ret ``` :::danger You should automate the testing procedures as much as possible, meaning there's no need to manually check each program output. Instead, implement internal validation to handle the result checking. ::: A brief analysis of registers of the above assembly code: - `a0` can be used to specify the input parameter `x`. - `t0` represents the `count`. - `t1` serves as the loop index `i`. - `t3` is used to detect whether the currently observed bit is `1`. If it is, the calculation is completed, and the loop is exited. The program logic **calculates the leading zeroes of `x`** (`a0`) and stores the result back in `a0`, then ends program execution using `li a7, 10` + `ecall`. :::info It is important to note that the built-in C function `__builtin_clz` results in undefined behavior when testing a value of 0. Therefore, in this assembly program, the input value cannot be 0. ::: ### RISC-V Assembly Code Improvement ```c .text .globl main main: li a0, 1 call my_clz li a7, 10 ecall my_clz: li t0, 0 li t1, 31 li t2, 0xFFFF0000 srli t3, a0, 16 beqz t3, L1 li t4, 16 sub t1, t1, t4 mv a0, t3 L1: li t2, 0xFF00 srli t3, a0, 8 beqz t3, L2 li t4, 8 sub t1, t1, t4 mv a0, t3 L2: li t2, 0xF0 srli t3, a0, 4 beqz t3, L3 li t4, 4 sub t1, t1, t4 mv a0, t3 L3: li t2, 0xC srli t3, a0, 2 beqz t3, L4 li t4, 2 sub t1, t1, t4 mv a0, t3 L4: li t2, 0x2 srli t3, a0, 1 beqz t3, L5 li t4, 1 sub t1, t1, t4 L5: mv a0, t1 ret ``` This assembly program uses **bitwise operations** to avoid the loop execution of `my_clz`. | Assembly program | Cycles | Instrs. retired | IPC | CPI | | ------------------- | ------ | --------------- | ----- | ---- | | `my_clz` | 335 | 261 | 0.78 | 1.28 | | branchless `my_clz` | 45 | 25 | 0.556 | 1.8 | In the same test scenario of "calculating the number of leading zeroes for 1," we can observed the **Execution Info.** simulated in *Ripes* and organized the table as shown above. It is clear that the cycle count for the branchless version of `my_clz` is only $\frac 1 7$ of that for the original version with the `for` loop logic `my_clz`. However, when examining the CPI and other indicators, the overall execution performance of the branchless `my_clz` is slightly inferior to that of the original `my_clz`, indicating that the performance bottleneck needs to be clarified. :::danger You should automate the testing procedures as much as possible, meaning there's no need to manually check each program output. Instead, implement internal validation to handle the result checking. ::: ## Use Case ### LeetCode: [Factorial Trailing Zeroes](https://leetcode.com/problems/factorial-trailing-zeroes/description/) > Given an integer `n`, return the number of trailing zeroes in `n!`. ### C Code ```c int trailingZeroes(int n) { int ans = 0; while (n) { ans += n / 5; n /= 5; } return ans; } ``` ### RISC-V Assembly Code ```c .text .globl trailingZeroes trailingZeroes: addi sp, sp, -16 sw ra, 12(sp) sw s0, 8(sp) sw a0, 4(sp) li s0, 0 lw t0, 4(sp) loop: beq t0, zero, end li t2, 5 div t1, t0, t2 add s0, s0, t1 mv t0, t1 j loop end: mv a0, s0 lw ra, 12(sp) lw s0, 8(sp) addi sp, sp, 16 li a7, 10 ecall jr ra ```