Assignment1: RISC-V Assembly and Instruction Pipeline

# Assignment1: RISC-V Assembly and Instruction Pipeline contributed by < [`Jiegoqqq`](https://github.com/Jiegoqqq/Computer_Architecture) > ## Problem C in [Quiz1]() ### C Code Of Function fabsf The function of **fabsf** is to calculate the absolute value of a single-precision floating-point number (float) by inputting a floating-point number x as a parameter and returning the absolute value of the floating-point number. This function uses inline, which avoids the overhead of function calls and inlines the program directly during the compilation phase. Below is the origin function. ```s static inline float fabsf(float x) { uint32_t i = *(uint32_t *)&x; // Read the bits of the float into an integer i &= 0x7FFFFFFF; // Clear the sign bit to get the absolute value x = *(float *)&i; // Write the modified bits back into the float return x; } ``` ### Assembly Code Of Function fabsf ```s fabsf: la t0, argument lw t0, 0(t0) li t1, 0x7FFFFFFF and t0, t0, t1 ``` ### C Code Of Function my_clz The function of **my_clz** is to counts the number of leading zeros in the binary representation of a given 32-bit unsigned integer (uint32_t).Inside the loop, use `x & (1U << i)` to check if the i-th bit of x is equal 1 and use bitwise AND operation with x determines if that bit in x is 1. If it is 1, add 1 to the count, if not, continue looping. Below is the origin function. ```s static inline int my_clz(uint32_t x) { int count = 0; for (int i = 31; i >= 0; --i) { if (x & (1U << i)) break; count++; } return count; } ``` ### Assembly Code Of Function my_clz ```s .data Input: .word 0x0FFFFFFF .text la a1, Input lw a1, 0(a1) jal my_clz li a7, 1 ecall li a7, 10 ecall my_clz: addi t0, x0, 31 addi t1, x0, 0 addi t2, x0, 1 loop: blt t0, x0, exit sll t3, t2, t0 and t3, a1, t3 bnez t3, exit addi t0, t0, -1 addi t1, t1, 1 j loop exit: mv a0, t1 ret ``` ### C Code Of Function fp16_to_fp32 The function of **fp16_to_fp32** is to converts a 16-bit floating-point number in IEEE half-precision format (bit representation) to a 32-bit floating-point number in IEEE single-precision format (bit representation). This implementation will avoid using any floating-point arithmetic and will utilize the clz function discussed earlier.Below is the origin function. ```s static inline uint32_t fp16_to_fp32(uint16_t h) { const uint32_t w = (uint32_t) h << 16; const uint32_t sign = w & UINT32_C(0x80000000); const uint32_t nonsign = w & UINT32_C(0x7FFFFFFF); uint32_t renorm_shift = my_clz(nonsign); renorm_shift = renorm_shift > 5 ? renorm_shift - 5 : 0; const int32_t inf_nan_mask = ((int32_t)(nonsign + 0x04000000) >> 8) & INT32_C(0x7F800000); const int32_t zero_mask = (int32_t)(nonsign - 1) >> 31; return sign | ((((nonsign << renorm_shift >> 3) + ((0x70 - renorm_shift) << 23)) | inf_nan_mask) & ~zero_mask); } ``` #### fp16 (Half-precision floating-point) The Definition of **fp16** ：Is a 16-bit floating-point number representation format commonly used in fields such as computer graphics and deep learning that require smaller data sizes. The format of FP16 is based on the IEEE 754 standard and is similar to the more common single-precision (FP32) and double-precision (FP64) floating point numbers, but with lower precision and a smaller data range. ![image](https://hackmd.io/_uploads/r17b4cDkyl.png) * **Sign (s):** 1 bit (highest bit) Occupying 1 bit,in the leftmost 1 bit, it represents the positive and negative sign of the value. 0 represents a positive number and 1 represents a negative number. * **Exponent (E):** 5 bits It occupies 5 digits and represents the exponent part of the value. Using bias encoding, the actual value of the exponent is the stored value minus the offset 15. * **Mantissa (M):** 10 bits It occupies 10 digits and represents the decimal part of the value. As with FP32 and FP64, a "1" is implied before the mantissa, unless the exponent values are all 0 (in which case the value is subnormal or zero). **Structure of FP16:** The numerical expression formula is： $$ \text{Value} = (-1)^{S} \times 2^{(E - 15)} \times \left( 1 + M \times 2^{-10} \right) $$ Among them, the range of the exponent value $E$ is 1 to 30, and the corresponding range of the exponent is $-14$ to 15. **Special condition:** * **Zero:** When the exponent bit and the mantissa bit are both 0, it means positive zero or negative zero. * **Infinity:** The exponent bits are all 1 and the mantissa bits are all 0, indicating positive or negative infinity. * **Not a Number (NaN):** The exponent bits are all 1 and the mantissa bits are non-0, indicating NaN (Not a Number). * **Subnormal numbers:** The exponent bits are all 0 and the mantissa bits are non-0, indicating very small values without an implicit "1". #### fp32 (single-precision floating-point) The Definition of **fp32** ：It is a data format used to represent floating-point numbers and is widely used in computer science and mathematical operations. The definition of FP32 follows the IEEE 754 standard. This format divides 32-bit data into three parts to represent a floating-point number: the sign bit (sign bit), the exponent bit (exponent), and the mantissa bit (mantissa or significand). ![image](https://hackmd.io/_uploads/ryXMVqwyyx.png) * **Sign (s):** 1 bit (highest bit) The sign bit determines whether a floating point number is positive or negative. If the sign bit is 0, it represents a positive number; if it is 1, it represents a negative number. * **Exponent(E):** 8 bits The exponent bit represents the range of values and is used to adjust the size of floating point numbers. FP32 uses bias notation with a bias value of 127. That is, the actual exponent value is the stored exponent minus 127. * **Mantissa(M):** 23 digits The mantissa bit is used to represent the precision part of the value. FP32 uses an implicit "1" to increase the precision of the mantissa, which means that the actual value is equal to 1 plus the mantissa. **Structure of FP32:** The numerical expression formula is： $$ \text{Value} = (-1)^{S} \times 2^{(E - 127)} \times \left( 1 + M \right) $$ **Special condition:** * **Zero:** When the sign bit is 0 or 1, the exponent bit and the mantissa bit are all 0, it means positive zero or negative zero. * **Infinity:** When the exponent bits are all 1 and the mantissa bits are all 0, it means positive infinity or negative infinity. * **Not a numeric value (NaN, Not a Number):** When the exponent bits are all 1 and the mantissa bit is not 0, it means NaN (for example, an invalid operation result, such as 0 divided by 0). ### Assembly Code Of Function fp16_to_fp32 ```s .data InputValue: .word 0x007f # Input value for binaryGap endl: .string "\n" # For printing .text la t0, InputValue lw t0, 0(t0) slli t0, t0, 16 li t1, 0x80000000 and a1, t0, t1 li t2, 0x7FFFFFFF and a2, t0, t2 jal my_clz mv a3, a0 # Move clz result (in a0) to a3 (renorm_shift) li t4, 5 # Load the value 5 into register t1 blt a3, t4, set_zero # If renorm_shift < 5, jump to set_zero sub a3, a3, t4 # renorm_shift = renorm_shift - 5 j done # Jump to done set_zero: li a3, 0 # renorm_shift = 0 done: #nan mask li t1, 0x04000000 # Load constant 0x04000000 into t1 add a4, a3, t1 # a4 = nonsign + 0x04000000 srli a4, a4, 8 # Shift a4 right by 8 bits (a4 >> 8) li t2, 0x7F800000 # Load mask constant 0x7F800000 into t2 and a4, a4, t2 # Perform bitwise AND: a4 = a4 & 0x7F800000 #zero mask addi a5, a3, -1 # a5 = nonsign - 1 srai a5, a5, 31 # Arithmetic shift right by 31 bits (a5 >> 31) # Shift nonsign left by renorm_shift sll a6, a2, a3 # a6 = nonsign << renorm_shift # Then shift right by 3 srli a6, a6, 3 # a6 = a6 >> 3 # Subtract renorm_shift from 0x70 (112 decimal) li a7, 0x70 # a7 = 0x70 (112) sub a7, a7, a3 # a7 = 0x70 - renorm_shift # Shift (0x70 - renorm_shift) by 23 bits slli a7, a7, 23 # a7 = (0x70 - renorm_shift) << 23 # Add to the previous result add a6, a6, a7 # a6 = (nonsign << renorm_shift >> 3) + ((0x70 - renorm_shift) << 23) # OR with inf_nan_mask or a6, a6, a4 # a6 = a6 | inf_nan_mask # Compute the NOT of zero_mask not a5, a5 # a5 = ~zero_mask # AND the result with ~zero_mask and a6, a6, a5 # a6 = a6 & ~zero_mask # OR the final result with sign or a0, a1, a6 # a0 = sign | a6 li a7, 1 # Load syscall number for print_int ecall # Exit program li a7, 10 # Syscall for exit ecall my_clz: addi t4, x0, 31 addi t1, x0, 0 addi t2, x0, 1 loop: blt t4, x0, exit sll t3, t2, t4 and t3, a2, t3 bnez t3, exit addi t4, t4, -1 addi t1, t1, 1 j loop exit: mv a0, t1 ret ``` ## Binary Gap ([LeetCode 868](https://leetcode.com/problems/binary-gap/description/)) - Difficulity : easy - Problem description : Given a positive integer n, find and return the longest distance between any two adjacent 1's in the binary representation of n. If there are no two adjacent 1's, return 0. Two 1's are adjacent if there are only 0's separating them (possibly no 0's). The distance between two 1's is the absolute difference between their bit positions. For example, the two 1's in "1001" have a distance of 3. - Test Case : - Example1 : Input : n=22 Output : 2 Explanation : 22 in binary is "10110". The first adjacent pair of 1's is "10110" with a distance of 2. The second adjacent pair of 1's is "10110" with a distance of 1. The answer is the largest of these two distances, which is 2. Note that "10110" is not a valid pair since there is a 1 separating the two 1's underlined. - Example2 : Input : n=8 Output : 0 Explanation :8 in binary is "1000". There are not any adjacent pairs of 1's in the binary representation of 8, so we return 0. - Example3 : Input : n=2098185 Output : 11 Explanation :2098185 in binary is "1000000000010000001001". - Constraints: 1 <= n <= 10<sup>9</sup> ### Use Function Code in Quiz1 Use the Function of```my_clz``` in quiz 1 Problem C , the function ```my_clz``` is to count the number of leading zero in current number and then left shift n bit to accelartion the caculation speed. The following function is the method of counting leading zero in Problem C of the quiz1. ```s static inline int my_clz(uint32_t x) { int count = 0; for (int i = 31; i >= 0; --i) { if (x & (1U << i)) break; count++; } return count; } ``` and change to not use loop (branchless clz): ```s uint16_t branchless_clz(uint32_t x) { x |= (x >> 1); x |= (x >> 2); x |= (x >> 4); x |= (x >> 8); x |= (x >> 16); x -= ((x >> 1) & 0x55555555); x = ((x >> 2) & 0x33333333) + (x & 0x33333333); x = ((x >> 4) + x) & 0x0f0f0f0f; x += (x >> 8); x += (x >> 16); return (32 - (x & 0x3f)); } ``` ## C code ### First Version (not use clz function) In first version i use the variable max_gap to store the maximun gap value, and use the variable last_position to record the position where 1 last appeared. In the for loop it traverse from the highest bit of the integer to the lowest bit and sse bitwise operations `n & (1u << current_position)` to check if each bit is 1. If a bit is checked to be 1, calculate the distance gap between it and the last position where 1 occurred. Ultimately, max_gap is returned, which is the distance between the largest consecutive zeros found. ```s int BinaryGap(int n) { int max_gap = 0; int last_position = -1; // use to saved the place of "1", origin set to -1 for (int current_position = 31; current_position >= 0; current_position--) { if (n & (1u << current_position)) { // use the unsigned int of 1 if (last_position != -1) { int gap = last_position - current_position; if (gap > max_gap) { max_gap = gap; } } last_position = current_position; // Update the last occurrence of '1' } } return max_gap; } ``` ### Second Verion (use my_clz function) (with loop) In second version i use the my_clz function to count the leading zero.I try this way to accleration the process. First i use the unsigned 1U and shift it i bits to the left and check if the i-th bit of x is 1.If a bit is found to be 1, the loop ends and the current accumulated count value is returned, which represents the number of leading 0s. ```s // clz function with loop #include <stdio.h> #include <stdint.h> static inline int my_clz(uint32_t x) { int count = 0; for (int i = 31; i >= 0; --i) { if (x & (1U << i)) break; count++; } return count; } int BinaryGap(int n) { // remove the leading zero int clz = my_clz(n); int max_gap = 0; int last_position = -1; // Store the position of the previous '1' // Start scanning from the remaining valid bits for (int i = 31 - clz; i >= 0; --i) { if (n & (1U << i)) { if (last_position != -1) { int gap = last_position - i; if (gap > max_gap) { max_gap = gap; } } last_position = i; // Update the last occurrence of '1' } } return max_gap; } ``` ### Third Verion (use branchless_clz function) (without loop) In third version i referenced [brenchless_clz](https://hackmd.io/@sysprog/arch2023-quiz1-sol#Problem-A) to replaced the my_clzfunction. Compared with my_clz, brenchless_clz does not use loops, it through right shift and bitwise OR operation (|=), all the bits of x are filled with 1 in order from the most significant bit to the least significant bit. That is to say, as long as one bit in x is 1, then the All bits to the right of the bit are also set to 1. This step converts x into a number ending in a series of ones.The next steps utilize the classic Hamming Weight algorithm to calculate the number of bits in x. ```s // clz function without loop #include <stdio.h> #include <stdint.h> uint16_t branchless_clz(uint32_t x) { //fill all bits x |= (x >> 1); x |= (x >> 2); x |= (x >> 4); x |= (x >> 8); x |= (x >> 16); //perform fast bit counting x -= ((x >> 1) & 0x55555555); x = ((x >> 2) & 0x33333333) + (x & 0x33333333); x = ((x >> 4) + x) & 0x0f0f0f0f; x += (x >> 8); x += (x >> 16); return (32 - (x & 0x3f)); } int BinaryGap(int n) { // remove the leading zero int clz = branchless_clz(n); int max_gap = 0; int last_position = -1; // Store the position of the previous '1' // Start scanning from the remaining valid bits for (int i = 31 - clz; i >= 0; --i) { if (n & (1U << i)) { if (last_position != -1) { int gap = last_position - i; if (gap > max_gap) { max_gap = gap; } } last_position = i; // Update the last occurrence of '1' } } return max_gap; } ``` use the main function to get the **BinaryGap** result ```clike= int main() { int example1 = 22; int binary_gap_result1 = BinaryGap(example1); int example2 = 8; int binary_gap_result2 = BinaryGap(example2); int example3 = 2098185; int binary_gap_result3 = BinaryGap(example3); printf("The Binary Gap of %d is: %d\n", example1, binary_gap_result1); printf("The Binary Gap of %d is: %d\n", example2, binary_gap_result2); printf("The Binary Gap of %d is: %d\n", example3, binary_gap_result3); return 0; } ``` ## Assembly code ### Define Static Data I used three test data and gave three correct answers. When the program runs the test data and gets the correct answer, it will print out the correct answer together with the answer output by the program. ```clike= .data InputValue1: .word 22 # Input value for binaryGap InputValue2: .word 8 InputValue3: .word 2098185 Actual1: .word 2 Actual2: .word 0 Actual3: .word 11 max_gap: .word 0 # Store max_gap last_position: .word -1 # Store the position of the last '1' gap: .word 0 # Store temporary gap calculation result_text1: .string "The input value is " result_text2: .string " and the answer is " result_text3: .string " .The actual answer is " endl: .string "\n" # For printing ``` ### BinaryGap ```clike= .text #First Input la t0, InputValue1 lw t0, 0(t0) #First Actual la a2, Actual1 lw a2, 0(a2) # Initialize max_gap = 0 la t2, max_gap li t3, 0 sw t3, 0(t2) #last_positioan = -1 la t2, last_position li t3, -1 sw t3, 0(t2) jal ra, process_case #Second Input la t0, InputValue2 lw t0, 0(t0) #Second Actual la a2, Actual2 lw a2, 0(a2) # Initialize max_gap = 0 la t2, max_gap li t3, 0 sw t3, 0(t2) #last_position = -1 la t2, last_position li t3, -1 sw t3, 0(t2) jal ra, process_case #Third Input la t0, InputValue3 lw t0, 0(t0) #Third Actual la a2, Actual3 lw a2, 0(a2) # Initialize max_gap = 0 la t2, max_gap li t3, 0 sw t3, 0(t2) #last_position = -1 la t2, last_position li t3, -1 sw t3, 0(t2) jal ra, process_case # Exit program li a7, 10 # Syscall for exit ecall process_case: addi sp, sp, -4 sw ra, 0(sp) # Call count_leading_zero to get clz jal ra, my_clz lw ra, 0(sp) addi sp, sp, 4 # Load clz value (from a0) mv t1, a0 # Store clz (leading zero count) from a0 to t1 # Start scanning from the remaining valid bits: i = 31 - clz li t4, 31 # t4 = 31 sub t4, t4, t1 # t4 = 31 - clz (i = 31 - clz) scan_loop: blt t4, zero, print # If i < 0, exit loop # Check if (n & (1U << i)) li t5, 1 # t5 = 1 sll t5, t5, t4 # t5 = 1 << i and t6, t0, t5 # t6 = n & (1 << i) beqz t6, next # If t6 == 0 (no '1' at this position), skip # Check if last_position != -1 la t2, last_position # Load last_position address lw t3, 0(t2) # t3 = last_position li t2, -1 # Load -1 for comparison beq t3, t2, update_last_pos # If last_position == -1, update it # Calculate gap = last_position - i sub a1, t3, t4 # a1 = last_position - (gap) # Compare gap with max_gap, if gap > max_gap, update max_gap la t2, max_gap # Load max_gap address lw t3, 0(t2) # t3 = max_gap bge t3, a1, update_last_pos # if max_gap >= gap, skip update sw a1, 0(t2) # Store a1 (new max_gap) in max_gap update_last_pos: # Update last_position = i la t2, last_position # Load last_position address sw t4, 0(t2) # Store i in last_position next: # Decrement i addi t4, t4, -1 j scan_loop # Repeat the loop print: # Print descriptive text1 la a0, result_text1 #(result_text1) li a7, 4 #(syscall number for print_string) ecall # Return InputValue mv a0,t0 li a7, 1 # Load syscall number for print_int ecall # Print descriptive text2 la a0, result_text2 #(result_text2) li a7, 4 #(syscall number for print_string) ecall # Return max_gap la t2, max_gap # Load max_gap address lw a0, 0(t2) # a0 = max_gap # Print the result li a7, 1 # Load syscall number for print_int ecall # Make the system call to print a0 (max_gap) # Print descriptive text3 la a0, result_text3 #(result_text3) li a7, 4 #(syscall number for print_string) ecall # Return ActualValue mv a0,a2 li a7, 1 # Load syscall number for print_int ecall # Print newline la a0, endl # Load the address of the newline string (\n) li a7, 4 # Load syscall number for print_string ecall ret ``` I tried to use the my_clz function with a loop and the branchless_clz function without a loop to compare the differences between the two functions. ### my_clz (with loop) ```clike= my_clz: addi t4, x0, 31 addi t1, x0, 0 #set a variable to count zero addi t2, x0, 1 #use unsign 1 to bitwise count loop: blt t4, x0, exit sll t3, t2, t4 and t3, t0, t3 bnez t3, exit addi t4, t4, -1 addi t1, t1, 1 j loop exit: mv a0, t1 ret ``` The output of the test data: ![image](https://hackmd.io/_uploads/H1ltFxdyye.png) ### branchless_clz (without loop) ```clike= branchless_clz: # x |= (x >> 1); srli t1, t0, 1 or a0, t0, t1 # x |= (x >> 1) # x |= (x >> 2); srli t1, a0, 2 # t1 = x >> 2 or a0, a0, t1 # x |= (x >> 2) # x |= (x >> 4); srli t1, a0, 4 # t1 = x >> 4 or a0, a0, t1 # x |= (x >> 4) # x |= (x >> 8); srli t1, a0, 8 # t1 = x >> 8 or a0, a0, t1 # x |= (x >> 8) # x |= (x >> 16); srli t1, a0, 16 # t1 = x >> 16 or a0, a0, t1 # x |= (x >> 16) # x -= ((x >> 1) & 0x55555555); srli t1, a0, 1 # t1 = x >> 1 li t2, 0x55555555 and t1, t1, t2 # t1 = (x >> 1) & 0x55555555 sub a0, a0, t1 # x -= ((x >> 1) & 0x55555555) # x = ((x >> 2) & 0x33333333) + (x & 0x33333333); srli t1, a0, 2 # t1 = x >> 2 li t2, 0x33333333 and t1, t1, t2 # t1 = (x >> 2) & 0x33333333 and a0, a0, t2 # a0 = x & 0x33333333 add a0, t1, a0 # x = ((x >> 2) & 0x33333333) + (x & 0x33333333) # x = ((x >> 4) + x) & 0x0f0f0f0f; srli t1, a0, 4 # t1 = x >> 4 add a0, t1, a0 # x = (x >> 4) + x li t1, 0x0f0f0f0f and a0, a0, t1 # x = ((x >> 4) + x) & 0x0f0f0f0f # x += (x >> 8); srli t1, a0, 8 # t1 = x >> 8 add a0, a0, t1 # x += (x >> 8) # x += (x >> 16); srli t1, a0, 16 # t1 = x >> 16 add a0, a0, t1 # x += (x >> 16) # return (32 - (x & 0x3f)); li t1, 32 li t2, 0x3f and a0, a0, t2 # x = x & 0x3f sub a0, t1, a0 # a0 = 32 - (x & 0x3f) ret ``` The output of the test data: ![image](https://hackmd.io/_uploads/rJZwcldyke.png) ## Analysis I use online [Ripes](https://ripes.me/) to simulate. Ripes provides different kinds of processor for users to choose such as single cycle processor, 5-stage processor. ### Five - stage pipelined processor The five-stage pipeline in RISC-V (or other RISC architectures) is a common method to improve CPU performance by breaking down the execution of an instruction into five distinct steps. Each step, or "stage," is responsible for a part of the instruction execution process. By overlapping the execution of multiple instructions, the pipeline allows the CPU to work on several instructions simultaneously, increasing throughput. ![image](https://hackmd.io/_uploads/SkqqAQ51yl.png) * **Instruction Fetch (IF):** In this stage, the CPU retrieves the next instruction to be executed from memory (usually from the instruction cache). The program counter (PC) is incremented to point to the next instruction. * **Instruction Decode (ID):** The fetched instruction is decoded in this stage. The CPU identifies the instruction type (e.g., arithmetic, memory access, etc.) and determines the necessary control signals. Simultaneously, the source registers for the instruction are read from the register file. * **Execution (EX):** This is where the actual computation happens. Depending on the type of instruction, this stage performs arithmetic or logical operations (for example, adding two numbers), computes memory addresses for load/store instructions, or determines the target for branch instructions. * **Memory Access (MEM):** In this stage, if the instruction is a load or store, the CPU accesses data memory. For a load instruction, it fetches the data from memory, and for a store instruction, it writes data to memory. If no memory operation is needed, this stage is bypassed. * **Write Back (WB):** The results of the instruction are written back to the destination register. For example, if the instruction was an arithmetic operation, the result is written back to the register file. If it was a load instruction, the data fetched from memory is written to the register. ### Code Analysis #### use the my_clz function The memory use in use my_clz function: ![image](https://hackmd.io/_uploads/ByPqKgukkg.png) The clock cycles of my_clz program: ![image](https://hackmd.io/_uploads/BkX1qxuyyg.png) #### use the branchless_clz function The memory use in use branchless_clz function: ![image](https://hackmd.io/_uploads/rJks9xukyx.png) The clock cycles of branchless_clz program: ![image](https://hackmd.io/_uploads/B1fnclu11x.png)  ## Reference https://zh.wikipedia.org/zh-tw/%E5%8D%8A%E7%B2%BE%E5%BA%A6%E6%B5%AE%E7%82%B9%E6%95%B0 https://zh.wikipedia.org/zh-tw/%E5%96%AE%E7%B2%BE%E5%BA%A6%E6%B5%AE%E9%BB%9E%E6%95%B8 https://hackmd.io/@sysprog/arch2023-quiz1-sol#Problem-A https://hackmd.io/@sysprog/arch2024-quiz1-sol