Assignment1: RISC-V Assembly and Instruction Pipeline

# Assignment1: RISC-V Assembly and Instruction Pipeline contributed by < [bentck719](https://github.com/Galafala) > (Tzu-Ching Kuo) ## Abstract In this document, I approach the challenge of translating a C function for counting leading zeros (clz) into optimized RISC-V assembly code. My goal is to explore two distinct strategies: a loop-based implementation and a more efficient branchless approach using binary search. Additionally, I delve into the “Valid Perfect Square” problem from LeetCode, investigating Newton’s method and leveraging clz optimizations for faster convergence. Throughout the process, I aim to minimize cycle counts and instruction overhead, focusing on performance gains critical for low-level optimization tasks. ## Problem C to RISC-V ### clz32 w/ loop The CLZ operation is a simple yet powerful concept in computing that counts the number of zeros before the first ‘1’ in a binary number. It has wide-ranging applications in mathematics, computer science, and engineering, particularly in areas that require efficient processing of numerical data and bit-level operations. #### C code The function iterates over each bit of the 32-bit unsigned integer x, starting from the most significant bit. It counts how many consecutive zeros appear before the first 1 bit is encountered. This is a straightforward way to implement the “Count Leading Zeros” (CLZ) operation without using any built-in functions or hardware instructions. ```c= static inline int clz32(uint32_t x) { int count = 0; for (int i = 31; i >= 0; --i) { if (x & (1U << i)) break; count++; } return count; } ``` #### RISC-V code ```s= clz32: bnez a3, none_zero # if a3 != 0, jump to none_zero is_zero: li t0, 32 # a3 = 32 sw t0, 0(sp) # save the clz answer jalr zero, ra, 0 # return none_zero: li t2, 0 # count li t0, 31 # int i = 31 clz32_loop: li t1, 1 # 1U sll t1, t1, t0 # (1U << i) and t3, a3, t1 # x & (1U << i) addi t0, t0, -1 # i-- addi t2, t2, 1 # count++ beqz t3, clz32_loop # if(x & (1U << i)) == 0 (false) addi t2, t2, -1 # count-- sw t2, 0(sp) # save count jalr zero, ra, 0 # return ``` ### fabsf (Not applied in this project) #### Structure of FP32 An FP32 number consists of three components: 1. Sign Bit (1 bit): Indicates the sign of the number. • 0 for positive numbers • 1 for negative numbers 2. Exponent (8 bits): Encodes the exponent portion, allowing the representation of a wide range of magnitudes. It uses a bias of 127. 3. Significand or Mantissa (23 bits): Represents the significant digits of the number. In normalized form, there is an implicit leading 1. Binary layout: | Sign (1 bit) | Exponent (8 bits) | Significand (23 bits) | Therefore, in `fasbf`, we only need to concern the sign bit. #### C code The C code uses a pointer to access the binary representation of the float. Performing a bitwise AND with 0x7FFFFFFF clears the sign bit of the float, making the number non-negative. Finally, The modified unsigned integer i is reinterpreted as a float. ```c= float fabsf(float x) { uint32_t i = *(uint32_t *)&x; // Read the bits of the float into an integer i &= 0x7FFFFFFF; // Clear the sign bit to get the absolute value x = *(float *)&i; // Write the modified bits back into the float return x; } ``` #### RISC-V code ```s= fabsf: li t0, 0x7fffffff and a0, s0, t0 # n &= 0x7fffffff # Store the result into stack mv t1, a0 sw t1, 4(sp) jalr zero, ra, 0 ``` ## Code Optimization ### clz32 branchless [1] #### C code :::info The approach utilizes a binary search to count the number of zero at the left side of the "**WALL**." ::: The branchless method begins by checking if the top 16 bits are zero; if they are, it adds 16 to the result r and shifts x left by 16 bits to bring the next bits into focus. It then checks the next 8 bits, adds 8 to r if they’re zero, and shifts x left by 8 bits. This process continues with checks for 4 bits, 2 bits, and 1 bit, each time adding to r and shifting x as needed. Finally, if x becomes zero after these shifts, it means all bits were zero, and it adds 1 more to r. By using bit shifts and comparisons instead of loops, this method efficiently calculates the number of leading zeros in x in a few straightforward steps. **Amazing...** Step 1: Check Upper 16 Bits ```c= int clz32(uint32_t x) { int r = 0, c; c = (x < 0x00010000) << 4; r += c; x <<= c; // off 16 c = (x < 0x01000000) << 3; r += c; x <<= c; // off 8 c = (x < 0x10000000) << 2; r += c; x <<= c; // off 4 c = (x < 0x40000000) << 1; r += c; x <<= c; // off 2 c = x < 0x80000000; r += c; x <<= c; // off 1 r += x == 0; // If all bits are zero, then the code adds one to the result. return r; } ``` #### RISC-V code ```s= clz32: li t0, 0 # r = 0 mv t3, a3 # t3 = a3 li t2, 0x00010000 # 0000 0000 0000 0001 0000 0000 0000 0000 sltu t1, t3, t2 # c = (x < 0x00010000) ? 1 : 0 slli t1, t1, 4 # c = (x < 0x00010000) << 4 add t0, t0, t1 # r += c sll t3, t3, t1 # x <<= c li t2, 0x01000000 # 0000 0001 0000 0000 0000 0000 0000 0000 sltu t1, t3, t2 # c = (x < 0x01000000) ? 1 : 0 slli t1, t1, 3 # c = (x < 0x01000000) << 3 add t0, t0, t1 # r += c sll t3, t3, t1 # x <<= c li t2, 0x10000000 # 0001 0000 0000 0000 0000 0000 0000 0000 sltu t1, t3, t2 # c = (x < 0x10000000) ? 1 : 0 slli t1, t1, 2 # c = (x < 0x01000000) << 2 add t0, t0, t1 # r += c sll t3, t3, t1 # x <<= c li t2, 0x40000000 # 0100 0000 0000 0000 0000 0000 0000 0000 sltu t1, t3, t2 # c = (x < 0x4000000) ? 1 : 0 slli t1, t1, 1 # c = (x < 0x01000000) << 1 add t0, t0, t1 # r += c sll t3, t3, t1 # x <<= c li t2, 0x80000000 # 1000 0000 0000 0000 0000 0000 0000 0000 sltu t1, t3, t2 # c = (x < 0x01000000) ? 1 : 0 add t0, t0, t1 # r += c sll t3, t3, t1 # x <<= c sltiu t1, t3, 1 add t0, t0, t1 sw t0, 0(sp) jalr zero, ra, 0 ``` ### Analysis The optimized version uses 382 fewer cycles than the original, representing a reduction of about 43.6%. This means the optimized program runs significantly faster. In terms of instructions retired, the optimized version executes 250 fewer instructions, a reduction of about 40.8%. This suggests that the optimized code is more efficient, possibly due to better algorithms or code that eliminate unnecessary operations. The CPI has decreased slightly in the optimized version, indicating that each instruction, on average, takes fewer cycles to execute. This could be due to improved instruction scheduling or better utilization of the CPU’s capabilities. The IPC has increased in the optimized version, meaning the CPU is executing more instructions in each cycle. This reflects improved parallelism and efficiency in instruction execution. Overall, the data shows that the optimized version of the program is significantly more efficient than the original. By reducing the number of cycles and instructions, and improving both CPI and IPC, the optimized code runs faster and makes better use of the CPU’s resources. | Metric | Original | Optimized | | -------- | -------- | -------- | | Cycle | 876 | 494 | | Instructions Retired| 612 | 362 | | CPI | 1.43 | 1.36 | | IPC | 0.699 | 0.733 | ## Compare with Compiler I downloaded `riscv64-unknown-elf-gcc` compiler [4] for C-code to assembly. The C code is de-compiled for RISC-V by myself: ```c= #include "stdint.h" #include "stdio.h" int clz32(uint32_t x) { int r = 0, c; c = (x < 0x00010000) << 4; r += c; x <<= c; // off 16 c = (x < 0x01000000) << 3; r += c; x <<= c; // off 8 c = (x < 0x10000000) << 2; r += c; x <<= c; // off 4 c = (x < 0x40000000) << 1; r += c; x <<= c; // off 2 c = x < 0x80000000; r += c; x <<= c; // off 1 r += x == 0; return r; } int main() { int test[6] = {0, 0x345, 0x80000000, 0x1, 0xc2440000, 0xf00001}; for (int i = 0; i < 6; i++) { int r = clz32(test[i]); if (r == 32) { printf("The number of %d's leading zero is udefined\n", i); continue; } printf("The number of %d's leading zero is %d\n", i, r); } return 0; } ``` Although the “Myself” implementation has a slightly higher CPI (1.36 vs. 1.34) and a slightly lower IPC (0.733 vs. 0.749) compared to the “Compiler” version—meaning each instruction takes marginally more cycles on average and executes slightly fewer instructions per cycle—it is vastly more efficient overall due to a significant reduction in the total number of instructions executed. Executing only 362 instructions compared to 11,011 results in a much shorter execution time, as evidenced by the drastic decrease in the number of cycles required (from 14,708 to 494). This suggests that the “Myself” version completes its task much faster, likely due to optimized algorithms or more efficient code paths. The use of more powerful instructions (e.g., SIMD instructions) can perform more work per instruction, reducing the total instruction count but potentially increasing CPI. Overall, the “Myself” implementation demonstrates superior performance in terms of total execution cycles and instructions retired, highlighting that code optimization and efficient algorithm design have a more profound impact on performance than marginal differences in CPI and IPC. | Metric | Compiler | Myself | | -------- | -------- | -------- | | Cycle | 14708 | 494 | | Instructions Retired| 11011 | 362 | | CPI | 1.34 | 1.36 | | IPC | 0.749 | 0.733 | ## LeetCode Problem: Valid Perfect Square > [LeetCode 367](https://leetcode.com/problems/valid-perfect-square/description/) Given a positive integer num, return true if num is a perfect square or false otherwise. A perfect square is an integer that is the square of an integer. In other words, it is the product of some integer with itself. You must not use any built-in library function, such as sqrt. **Example 1** Input: num = 16 Output: true Explanation: We return true because 4 * 4 = 16 and 4 is an integer. **Example 2** Input: num = 14 Output: false Explanation: We return false because 3.742 * 3.742 = 14 and 3.742 is not an integer. ## LeetCode Solution ### Newton's method #### What is Newton's method? Newton’s method is a root-finding algorithm in numerical analysis. The algorithm iteratively generates successively better approximations to the roots (zeros) of a real-valued function. The basic version begins with a real-valued function $f$, its derivative $f'$, and an initial guess $x_0$ for a root of $f$. If $f$ satisfies certain conditions and the initial guess is sufficiently close to an actual root, the next approximation $x_1$ is calculated using: $$ x_1 = x_0 - \frac{f(x_0)}{f'(x_0)} $$ This new value $x_1$ is a better approximation of the root than $x_0$. Geometrically, $x_1$ is the $x$-intercept of the tangent line to the graph of $f$ at the point $(x_0, f(x_0))$; in other words, $x_1$ is the root of the linear approximation of $f$ at $x_0$. The process is then repeated iteratively: $$ x_{n+1} = x_n - \frac{f(x_n)}{f'(x_n)} $$ until the desired level of precision is achieved. ![An illustration of Newton's method.](https://hackmd.io/_uploads/HkY-nbtykl.png) An illustration of Newton's method. [2] #### Newton's method for square root In square root cases, we have to design an equation. For example, if we want to find the square root of $x_0$, we should let an equation: $$ f(x) = x^2-x_0 $$ And find its derivative: $$ f'(x)= 2x $$ We use the derivative to find the slope of a tangent line and merge the slope and $(x_0,f(x_0))$ to define the tangent line: $$ f(x) - f(x_0) = f'(x_0)(x-x_0) $$ To generalize, we replace $x_0$ to $x_i$ and $x$ to $x_{i+1}$ and substitute the value into equation: $$ \begin{align*} f(x_{i+1}) - f(x_i) &= f'(x_i)(x_{i+1}-x_i)\\ f(x_{i+1}) - ({x_i}^2-x_0) &= 2x_i(x_{i+1}-x_i) \end{align*} $$ Since we want to find the root of the tangent line, we let $f(x_{i+1})=0$: $$ -({x_i}^2-x_0) = 2x_i(x_{i+1}-x_i) $$ Then, we rearrange the equation by moving terms: $$ \begin{align*} -({x_i}^2-x_0) &= 2x_i(x_{i+1}-x_i)\\ -\frac{({x_i}^2-x_0)}{2x_i} &= (x_{i+1}-x_i)\\ x_{i+1} &= -\frac{({x_i}^2-x_0)}{2x_i} + x_i\\ x_{i+1} &= -\frac{x_i}{2} + \frac{x_0}{2x_i} + x_i\\ &= \frac{x_i}{2} + \frac{x_0}{2x_i}\\ &= (x_i + \frac{x_0}{x_i})\times\frac{1}{2} \end{align*} $$ After moving terms, we get the final equation to find the square root iteratively: $$ x_{i+1} = (x_i + \frac{x_0}{x_i})\times\frac{1}{2} $$ #### C code ```c= bool isPerfectSquare(int num) { int count_x = 0; // record iteration times uint64_t x = (num >> 1); // divide by 2 while (x * x > num) { x = (x + num / x) / 2; count_x++; } printf("count_x: %d\n", count_x); return x * x == num; } ``` ### Optimized Newton's method #### Motivation This assignment proposes two methods to solve the LeetCode problem 367. Both methods are based on Newton's approach to finding the root of a function. In Newton’s method, selecting an appropriate initial value is crucial for reducing the number of iterations. An article about the history of square roots [3] inspired me. The article discusses finding a number’s approximate square root to use as the starting point for iteration. Building on the idea from the article, I combined it with the counting leading zeros (clz) technique to accelerate Newton’s method. Building on the idea from the article, I combined it with the counting leading zeros (clz) technique to accelerate Newton’s method. Since we aim to find the square root, we can use an approximate square root value as the initial guess to achieve faster convergence. In the binary world, finding the approximate square root value is easy. The approximate square root has about half the number of bits as the original. For example, if a number occupies 16 bits, its square root will occupy approximately 8. Therefore, a suitable initial guess can be obtained by right-shifting the number by half its bit length. Thus, determining the number of bits in the original number becomes essential. We can use the counting leading zeros (clz) function to find the number of bits by subtracting the clz result from 32 bits. By doing this, we can easily find an approximate square root value. #### C code ```c= bool isPerfectSquare(int num) { int count_x = 0; // record iteration times int clz = myclz(num); int shift = (31 - clz) >> 1; // get a half of the bits uint32_t x = (num >> shift); // shift to approximate square of two while (x * x > num) { x = (x + num / x) >> 1; count_x++; } printf("count_x: %d\n", count_x); return x * x == num; } ``` ### Analysis I tested two initial values: half the number and the approximate square root. As anticipated, the approach using the approximate square root required fewer iterations than starting with half the number. The difference in iteration counts becomes more significant with larger numbers (see Table 1). In summary, using clz to find an approximate square root is an effective method to reduce iteration times and speed up calculations. **Table1** | **Metric** | **1 to INT_MAX** | **INT_MAX - 100 to INT_MAX** | **1 to 100** | |------------------------------------------|-------------------|------------------------------|--------------| | **Total Iterations<br> (Original)** | 36,115,969,063 | 1,800 | 263 | | **Total Iterations<br> (Optimized)** | 6,591,883,267 | 400 | 126 | | **Ratio<br> (Original / Optimized)** | 5.48 | 4.50 | 2.09 | | **Average Iterations per i (Original)** | 16.82 | 18 | 2.63 | | **Average Iterations per i (Optimized)** | 3.07 | 4 | 1.26 | ## The 5-stage pipelined processor in RISC-V ### Introduction The RISC-V 5-stage processor is a simplified model of a CPU architecture that implements the RISC-V instruction set using a pipeline with five distinct stages. This design enhances instruction throughput by allowing multiple instructions to be processed simultaneously, each at a different stage of execution. The five stages are: 1. **Instruction Fetch (IF)**: The processor retrieves the next instruction to execute from memory using the Program Counter (PC). The instruction is fetched from the instruction memory and placed into the pipeline for processing. 2. **Instruction Decode (ID)**: The fetched instruction is decoded to determine the operation to perform. The processor identifies the opcode, source registers, destination register, and any immediate values. It reads the necessary operands from the register file. 3. **Execute (EX)**: The Arithmetic Logic Unit (ALU) performs the computation specified by the instruction. This could involve arithmetic operations, logical operations, or calculating memory addresses for load and store instructions. 4. **Memory Access (MEM)**: If the instruction requires data to be read from or written to memory (as in load or store instructions), the processor accesses the data memory during this stage. For instructions that do not involve memory access, this stage can be bypassed. 5. **Write-Back (WB)**: The result of the computation or memory operation is written back to the destination register in the register file, completing the execution of the instruction. ### Case study In the section, I analyse `sltu` instruction. #### Instruction Fetch (IF) Stage The execution of the sltu x6, x28, x7 instruction progresses through the five stages of the RISC-V pipeline as follows. In the Instruction Fetch (IF) stage, the Program Counter (PC) holds the address 0x00000070, which points to the instruction 0x007e3333 (sltu x6, x28, x7). This instruction is fetched from the Instruction Memory and stored in the IF/ID Register for the next stage. ![Screenshot 2024-10-14 at 19.23.01](https://hackmd.io/_uploads/SkUE1YqJyx.png) #### Instruction Decode (ID) stage During the Instruction Decode (ID) stage, the fetched instruction is decoded to identify the opcode sltu, the source registers rs1 (x1c) and rs2 (x7), and the destination register rd (x6). The values from registers x28 and x7 are read from the register file, and the decoded instruction details along with operand values are stored in the ID/EX Register. ![Screenshot 2024-10-14 at 19.41.36](https://hackmd.io/_uploads/SyhdmYq1Je.png) #### In the Execute (EX) stage In the Execute (EX) stage, the ALU performs the unsigned comparison x28 < x7 using the operands from rs1 and rs2. The result of this comparison (0x00000001 if true, 0x00000000 if false) is stored in the ALU result register and passed to the EX/MEM Register. ![Screenshot 2024-10-14 at 19.42.04](https://hackmd.io/_uploads/H1F57K9Jyg.png) #### In the Memory stage Since the sltu instruction does not require memory access, the Memory Access (MEM) stage is effectively bypassed; there is no interaction with Data Memory, and the ALU result is forwarded directly through the MEM/WB Register to the next stage. #### In the Write-Back (WB) stage Finally, in the Write-Back (WB) stage, the result (0x00000001) is written into the destination register x6, updating the register file. This completes the execution of the instruction, as the final result is now stored in the appropriate register and is available for any subsequent operations. ![Screenshot 2024-10-14 at 19.47.37](https://hackmd.io/_uploads/HJoJBFcJkx.png) ## Reference [1] [Branchless count-leading-zeros on 32-bit RISC-V without Zbb extension](https://stackoverflow.com/questions/78589600/branchless-count-leading-zeros-on-32-bit-risc-v-without-zbb-extension) [2] [Newtom's method](https://en.wikipedia.org/wiki/Newton%27s_method#) [3] [Discussing fast computations of square roots based on the existence of the square root of 2](https://hackmd.io/@sysprog/sqrt#%E4%BA%8C%E9%80%B2%E4%BD%8D%E7%B3%BB%E7%B5%B1%E7%9A%84%E5%B9%B3%E6%96%B9%E6%A0%B9) [4] [riscv64-unknown-elf-gcc compiler](https://github.com/sifive/freedom-tools/releases/tag/v2020.04.0-Toolchain.Only)