Assignment1: RISC-V Assembly and Instruction Pipeline(Problem C from Quiz1)

# Assignment1: RISC-V Assembly and Instruction Pipeline(Problem C from Quiz1) contributed by < [jimmyli88623](https://github.com/jimmy88623) > The fabsf function is designed to compute the absolute value of a give single-precision floating-point number x. Below, I will present the implementation of this function using RISC-V Assembly Language and optimized it , followed by an analysis of their performance. :::danger Don't abuse the notation `==`. ::: ## fabsf Below is the implementation using C , I will using 0x7FFFFFFF mask to set the left bit to zero ### C code fabsf ```c static inline float fabsf(float x) { uint32_t i = *(uint32_t *)&x; // Read the bits of the float into an integer i &= 0x7FFFFFFF; // Clear the sign bit to get the absolute value x = *(float *)&i; // Write the modified bits back into the float return x; } ``` :::danger Don't paste the code snip without comprehensive discussions. ::: Below is the implementation that transform C to RISC-V, and I add input and output data initial, that can automatically verify the answer. ### Origin RISC-V Assembly Language fabsf ```asm .data input: .word 0xf2133212 ,0x8c45f455,0x05151515 output: .word 0x72133212,0x8c45f455,0x05151515 yes_msg: .string "True \n" no_msg: .string "False \n" .text main: la t0, input # load input address la t1, output # load output address li t2, 3 # input length compare_loop: lw t3, 0(t0) lw t4, 0(t1) mv a0, t3 jal ra, fabsf beq a0, t4, print_yes # compare input and fabsf function output j print_no # print true message print_yes: la a0, yes_msg li a7, 4 ecall j next_compare # print false message print_no: la a0, no_msg li a7, 4 ecall j next_compare # next round next_compare: addi t0, t0, 4 addi t1, t1, 4 addi t2, t2, -1 bnez t2, compare_loop li a7, 10 ecall fabsf: addi sp, sp, -12 # create stack sw s0 ,8(sp) # store s0 sw s1 ,4(sp) # store s1 sw ra, 0(sp) # store ra li s1, 0x7FFFFFFF # load mask and s0, a0, s1 # do operation mv a0, s0 # put the operation result to a0 lw s0, 8(sp) # load s0 lw s1, 4(sp) # load s1 lw ra, 0(sp) # load ra addi sp, sp, 12 # release stack ret # return ``` Below is my optimized method, that will reduce the instruction count and cycle number, and also add automatically test to verify the answer. ### Optimized RISC-V Assembly Language fabsf ```asm .data input: .word 0xf2133212 ,0x8c45f455,0x05151515 output: .word 0xf2133212,0x8c45f455,0x05151515 yes_msg: .string "True \n" no_msg: .string "False \n" .text main: la t0, input # load input address la t1, output # load output address li t2, 3 # input length compare_loop: lw a0, 0(t0) lw a1, 0(t1) jal ra, fabsf beq a0, a1, print_yes # compare input and fabsf function output j print_no # print true message print_yes: la a0, yes_msg li a7, 4 ecall j next_compare # print false message print_no: la a0, no_msg li a7, 4 ecall j next_compare # next round next_compare: addi t0, t0, 4 addi t1, t1, 4 addi t2, t2, -1 bnez t2, compare_loop li a7, 10 ecall # system call fabsf: li t3, 0x7FFFFFFF and a0, a0, t3 ret ``` ### Performace analysis | | Origin | Optimized | | -------- | -------- | -------- | | Cycles | 129 | 100 | | Instruction count | 89 | 60 | | CPI | 1.45 | 1.67 | | IPC | 0.69 | 0.6 | In my optimized method, I aim to reduce the number of registers used to improve both performance and code simplicity. Instead of using multiple temporary registers to store intermediate values, I directly load values into the argument registers , and in the origin implementation, the fabsf function involved saving and restoring multiple registers onto the stack. By limiting the usage of registers inside the function, I eliminated the need for stack operations entirely, resulting in faster execution and a more efficient function call. :::danger What inspired you from Quiz 1, as the assignment asks you to use these inspirations to improve the existing problems? ::: ## __builtin_clz Below is the implementation of clz function using C, i will run a for loop to get each bit number, and update the count ### C code __builtin_clz ```c static inline int my_clz(uint32_t x) { int count = 0; for (int i = 31; i >= 0; --i) { if (x & (1U << i)) break; count++; } return count; } ``` Below is the implementation that transform C to RISC-V Assembly Language, and add automatically input and output data test to verify the answer. ### Origin RISC-V Assembly Language __builtin_clz ```asm .data input: .word 0xf2133212 ,0x1c45f455,0x05151515 output: .word 0x00000000,0x00000004,0x00000005 yes_msg: .string "True \n" no_msg: .string "False \n" .text main: la t0, input # load input address la t1, output # load output address li t2, 3 # input length # loop compare_loop: lw t3, 0(t0) lw t4, 0(t1) mv a0, t3 jal ra, my_clz beq a0, t4, print_yes # compare input and my_clz function output j print_no # print true message print_yes: la a0, yes_msg li a7, 4 ecall j next_compare # print false message print_no: la a0, no_msg li a7, 4 ecall j next_compare # next round next_compare: addi t0, t0, 4 # point to next input data addi t1, t1, 4 # point to next output data addi t2, t2, -1 bnez t2, compare_loop li a7, 10 ecall # Function to count leading zeros (CLZ) my_clz: addi sp, sp, -16 # create stack sw ra, 12(sp) # store ra sw s0, 8(sp) # store s0 sw s1, 4(sp) # store s1 sw s2, 0(sp) # store s2 li s0, 31 # s0 = i li s1, 0 # s1 = count loop: li s2, 0x1 # load 0x1 sll s2, s2, s0 # shift left and s2, a0, s2 # do "and" operation, take the left bit bne s2, x0, done addi s1, s1, 1 # update count addi s0, s0, -1 # update i bge s0, x0, loop # loop condition done: mv a0, s1 # put the answer into a0 lw ra, 12(sp) # load ra lw s0, 8(sp) # load s0 lw s1, 4(sp) # load s1 lw s2, 0(sp) # load s2 addi sp, sp, 16 # release stack ret # return ``` Below is optimized version, I reduce the register used, and also add automatically input and output test to verify the answer. ### Optimized RISC-V Assembly Language __builtin_clz ```asm .data input: .word 0xf2133212 ,0x1c45f455,0x05151515 output: .word 0x00000000,0x00000003,0x00000005 yes_msg: .string "True \n" no_msg: .string "False \n" .text main: la t0, input # load input address la t1, output # load output address li t2, 3 # input length compare_loop: lw a0, 0(t0) lw a1, 0(t1) jal ra, my_clz beq a0, a1, print_yes # compare input and my_clz function output j print_no # print true message print_yes: la a0, yes_msg li a7, 4 ecall j next_compare # print false message print_no: la a0, no_msg li a7, 4 ecall j next_compare # next round next_compare: addi t0, t0, 4 addi t1, t1, 4 addi t2, t2, -1 bnez t2, compare_loop li a7, 10 ecall # Function to count leading zeros (CLZ) my_clz: li t3, 31 # s0 = i li t4, 0 # s1 = count loop: li t5, 0x1 # load 0x1 sll t5, t5, t3 # shift left and t5, a0, t5 # do "and" operation, take the left bit bne t5, x0, done addi t4, t4, 1 # update count addi t3, t3, -1 # update i bge t3, x0, loop # loop condition done: mv a0, t4 # put the answer into a0 ret # return ``` ### Performace analysis | | Origin | Optimized | | -------- | -------- | -------- | | Cycles | 222 | 188 | | Instruction count | 160 | 126 | | CPI | 1.39 | 1.49 | | IPC | 0.721 | 0.67 | In my optimized method, I aim to reduce the number of registers used to improve both performance and code simplicity. Instead of using multiple temporary registers to store intermediate values, I directly load values into the argument registers , and in the origin implementation, the ==my_clz== function involved saving and restoring multiple registers onto the stack. By limiting the usage of registers inside the function, I eliminated the need for stack operations entirely, resulting in faster execution and a more efficient function call. --- Below is the implementation of fp16_to_fp32 function, since fp16 is different with fp32 , so we need to do transform to each part , and in the final will combine each part . ## fp16_to_fp32 ### C code fp16_to_fp32 ```c static inline uint32_t fp16_to_fp32(uint16_t h) { const uint32_t w = (uint32_t) h << 16; const uint32_t sign = w & UINT32_C(0x80000000); const uint32_t nonsign = w & UINT32_C(0x7FFFFFFF); uint32_t renorm_shift = my_clz(nonsign); renorm_shift = renorm_shift > 5 ? renorm_shift - 5 : 0; const int32_t inf_nan_mask = ((int32_t)(nonsign + 0x04000000) >> 8) & INT32_C(0x7F800000); const int32_t zero_mask = (int32_t)(nonsign - 1) >> 31; return sign | ((((nonsign << renorm_shift >> 3) + ((0x70 - renorm_shift) << 23)) | inf_nan_mask) & ~zero_mask); } ``` Below is the implementation that transform C to RISC-V Assembly Language, and add automatically input and output test data to verify the answer. ### Origin RISC-V Assembly Language fp16_to_fp32 ```asm .data input: .word 0x3C00, 0x4000, 0x6219 output: .word 0x3F800000, 0x40000000, 0x44432000 yes_msg: .string "True \n" no_msg: .string "False \n" .text main: la t3, input # load input address la t4, output # load output address li t5, 3 # input length compare_loop: lw a0, 0(t3) lw a1, 0(t4) jal ra, fp16_to_fp32 # compare input and fp16_to_fp32 function output beq a0, a1, print_yes j print_no # print true message print_yes: la a0, yes_msg li a7, 4 ecall j next_compare # print false message print_no: la a0, no_msg li a7, 4 ecall j next_compare # next round next_compare: addi t3, t3, 4 addi t4, t4, 4 addi t5, t5, -1 bnez t5, compare_loop li a7, 10 ecall fp16_to_fp32: addi sp, sp, -32 sw ra, 28(sp) sw s0, 24(sp) sw s1, 20(sp) sw s2, 16(sp) sw s3, 12(sp) sw s4, 8(sp) sw s5, 4(sp) sw s6, 0(sp) mv s0, a0 # s0 store input uint16 slli s1, s0, 16 # s1 store w li t1, 0x80000000 and s2, s1, t1 # s2 store sign li t1, 0x7FFFFFFF and s3, s1, t1 # s3 store nonsign mv a0, s3 jal ra, my_clz mv s4, a0 # s4 store my_clz result(renorm_shift) li s5, 6 # s5 store temp 5 blt s4, s5, set_zero addi s4, s4, -5 # update s4 j go set_zero: li s4, 0 go: li t2, 0x04000000 add s5, s3, t2 srli s5, s5, 8 li t2, 0x7F800000 and s5, s5, t2 # s5 update store inf_nan_mask addi s6, s3, -1 srli s6, s6, 31 # s6 store zero_mask sll s3, s3, s4 # nonsign << renorm_shift srli s3, s3, 3 # s3 store (nonsign << renorm_shift >> 3) li t1, 0x70 sub t1, t1, s4 # (0x70 - renorm_shift) slli t1, t1, 23 # t1 store ((0x70 - renorm_shift) << 23) add s3, s3, t1 or s3, s3, s5 not t1, s6 and s3, s3, t1 or s3, s3, s2 mv a0, s3 lw ra, 28(sp) lw s0, 24(sp) lw s1, 20(sp) lw s2, 16(sp) lw s3, 12(sp) lw s4, 8(sp) lw s5, 4(sp) lw s6, 0(sp) addi sp, sp, 32 ret # Function to count leading zeros (CLZ) my_clz: addi sp, sp, -16 # create stack sw ra, 12(sp) # store ra sw s0, 8(sp) # store s0 sw s1, 4(sp) # store s1 sw s2, 0(sp) # store s2 li s0, 31 # s0 = i li s1, 0 # s1 = count loop: li s2, 0x1 # load 0x1 sll s2, s2, s0 # shift left and s2, a0, s2 # do "and" operation, take the left bit bne s2, x0, done addi s1, s1, 1 # update count addi s0, s0, -1 # update i bge s0, x0, loop # loop condition done: mv a0, s1 # put the answer into a0 lw ra, 12(sp) # load ra lw s0, 8(sp) # load s0 lw s1, 4(sp) # load s1 lw s2, 0(sp) # load s2 addi sp, sp, 16 # release stack ret # return ``` Below is the optimized version of RISC-V program, that I reduce the register number used and can minimize overall register usage in function. ### Optimized RISC-V Assembly Language fp16_to_fp32 ```asm .data input: .word 0x3C00, 0x4000, 0x6219 output: .word 0x3F800000, 0x40000000, 0x44432000 yes_msg: .string "True \n" no_msg: .string "False \n" .text main: la t3, input # load input address la t4, output # load output address li t5, 3 # input length compare_loop: lw a0, 0(t3) lw a1, 0(t4) jal ra, fp16_to_fp32 beq a0, a1, print_yes # compare input and fp16_to_fp32 function output j print_no # print true message print_yes: la a0, yes_msg li a7, 4 ecall j next_compare # print false message print_no: la a0, no_msg li a7, 4 ecall j next_compare # next round next_compare: addi t3, t3, 4 addi t4, t4, 4 addi t5, t5, -1 bnez t5, compare_loop li a7, 10 ecall fp16_to_fp32: addi sp, sp, -16 sw ra, 12(sp) sw a0, 8(sp) lw s0, 8(sp) slli s1, s0, 16 # shift left li t1, 0x80000000 and s2, s1, t1 # s2 store sign li t1, 0x7FFFFFFF and s3, s1, t1 # s3 store nonsign mv a0, s3 jal ra, my_clz mv s4, a0 # s4 store renorm_shift li s5, 6 blt s4, s5, set_zero addi s4, s4, -5 # update renorm_shift value j go set_zero: li s4, 0 go: li t2, 0x04000000 add s5, s3, t2 srli s5, s5, 8 li t2, 0x7F800000 and s5, s5, t2 # s5 store inf_nan_mask addi s6, s3, -1 srli s6, s6, 31 # s6 store zero_mask sll s3, s3, s4 # s3 store (nonsign << renorm_shift ) srli s3, s3, 3 # s3 store (nonsign << renorm_shift >> 3) li t1, 0x70 sub t1, t1, s4 # t1 store (0x70 - renorm_shift) slli t1, t1, 23 # t1 store (0x70 - renorm_shift) << 23) add s3, s3, t1 # (nonsign << renorm_shift >> 3) + ((0x70 - renorm_shift) << 23) not t1, s6 # ~zero_mask and s3, s3, t1 # ((((nonsign << renorm_shift >> 3) + ((0x70 - renorm_shift) << 23)) | inf_nan_mask) & ~zero_mask) or s3, s3, s2 # sign | ((((nonsign << renorm_shift >> 3) + ((0x70 - renorm_shift) << 23)) | inf_nan_mask) & ~zero_mask) mv a0, s3 lw ra, 12(sp) addi sp, sp, 16 ret my_clz: addi sp, sp, -16 sw ra, 12(sp) sw a0, 8(sp) li s0, 31 li s1, 0 loop: li s2, 0x1 sll s2, s2, s0 and s2, a0, s2 bne s2, x0, done addi s1, s1, 1 addi s0, s0, -1 bge s0, x0, loop done: mv a0, s1 lw ra, 12(sp) addi sp, sp, 16 ret ``` ### Performace analysis | | Origin | Optimized | | -------- | -------- | -------- | | Cycles | 350 | 296 | | Instruction count | 278 | 221 | | CPI | 1.26 | 1.24 | | IPC | 0.794 | 0.747 | My optimized method aims to reduce the number of registers, thereby improving both performance and code simplicity. By utilizing argument registers, I can replace certain temporary registers and minimize the overall register usage in other functions. ## LeetCode Problem ### [LeetCode 1550. Three Consecutive Odds](https://leetcode.com/problems/three-consecutive-odds/description/) In LeetCode problem 1550 "Three Consecutive Odds," the task is to determine whether the given array contains three consecutive odd numbers. When it comes to checking if a number is odd, using bit manipulation is an efficient approach. Specifically, the least significant bit (LSB) of any odd number is 1. Therefore, we can use bitwise operations to determine if a number is odd, so I choose clz function from Quiz1 to help implementation. ### C code ```c class Solution { public: bool threeConsecutiveOdds(vector<int>& arr) { int count = 0; for(int i=0;i<arr.size();i++){ if(arr[i] & 1 == 1){ count += 1; if(count == 3){ return true; } continue; } count = 0; } return false; } }; ``` ### add clz function ```c int clz(uint32_t x){ int count = 0; for (int i = 31; i >= 0; --i) { if (x & (1U << i)) break; count++; } return count; } bool threeConsecutiveOdds(int* arr, int arrSize) { int count = 0; for (int i = 0; i < arrSize; i++) { uint32_t value = arr[i]; uint32_t shifted_value = (value << 31) >> 31; int leading_zeros = clz(shifted_value); if (leading_zeros == 31) { count++; if (count == 3) { return true; } } else { count = 0; } } return false; } ``` ### Origin RISC-V Assembly Language ```asm .data arr: .word 7,7,7 msg_true: .string "True \n" msg_false: .string "False \n" .text main: la s10, arr # Load address of array into a0 (first argument to function) li a1, 3 # Load array size (6 in this case) into a1 (second argument) call threeConsecutiveOdds # Call the function # Check the returned result and print appropriate message beq a0, x0, print_false # If returned value is 0 (false), jump to print_false j print_true # Else, jump to print_true print_true: li a0,1 la a0, msg_true # Load address of "True" message into a0 li a7, 4 # Syscall for print string ecall # Make the system call li a7, 10 # Syscall for exit ecall # End the program after printing print_false: li a0, 0 la a0, msg_false # Load address of "False" message into a0 li a7, 4 # Syscall for print string ecall li a7, 10 # Syscall for exit ecall # Exit the program threeConsecutiveOdds: addi s8, x0, 0 # Initialize count to 0 (s0 = count) addi s9, x0, 0 # Initialize i to 0 (s1 = loop counter) li s2, 3 # Load 3 into s2 (for checking consecutive count) loop_start: bge s9, a1, print_false # if i >= arrSize, return false slli s3, s9, 2 # Calculate offset for arr[i] (s3 = i * 4) add s3, s10, s3 # Load address of arr[i] into s3 lw s4, 0(s3) # Load arr[i] into s4 slli s5, s4, 31 # Shift value left by 31 bits srli s5, s5, 31 # Shift value right by 31 bits (extract last bit) mv a0,s5 li s7, 0x1f jal ra, my_clz # Use clz to count leading zeros in s5 beq a0, s7, increment_count # If leading zeros == 31, increment count j reset_count reset_count: addi s8, x0, 0 # Reset count to 0 addi s9, s9, 1 # i++ j loop_start # Jump to start of the loop increment_count: addi s8, s8, 1 # count++ beq s8, s2, print_true # If count == 3, return true addi s9, s9, 1 # i++ j loop_start # Jump to start of the loop my_clz: addi sp, sp, -24 # create stack sw ra, 20(sp) # store ra sw s0, 16(sp) # store s0 sw s1, 12(sp) # store s1 sw s2, 8(sp) # store s2 sw s3, 4(sp) # store s3 sw s4, 0(sp) # store s4 li s0, 31 # s0 = i li s1, 0 # s1 = count loop: li s4, 0x1 # load 0x1 sll s2, s4, s0 # shift left and s3, a0, s2 # do "and" operation, take the left bit bne s3, x0, done addi s1, s1, 1 # update count addi s0, s0, -1 # update i bge s0, x0, loop # loop condition done: mv a0, s1 # put the result into a0 lw ra, 20(sp) # load ra lw s0, 16(sp) # load s0 lw s1, 12(sp) # load s1 lw s2, 8(sp) # load s2 lw s3, 4(sp) # load s3 lw s4, 0(sp) # load s4 addi sp, sp, 24 # release stack ret # return ``` ### Optimized RISC-V Assembly Language ```asm .data arr: .word 1,3,4,5,6,8,3,3,1 msg_true: .string "True \n" msg_false: .string "False \n" .text main: la s10, arr # Load address of array into s10 li a1, 9 # Load array size into a1 call threeConsecutiveOdds # Call the function # Check the returned result and print appropriate message beq a0, x0, print_false # If 0 (false), jump to print_false j print_true # Else, jump to print_true print_true: la a0, msg_true # Load address of "True" message into a0 li a7, 4 # Syscall for print string ecall # Make the system call li a7, 10 # Syscall for exit ecall # End the program after printing print_false: la a0, msg_false # Load address of "False" message into a0 li a7, 4 # Syscall for print string ecall li a7, 10 # Syscall for exit ecall # Exit the program threeConsecutiveOdds: addi s8, x0, 0 # Initialize count to 0 addi s9, x0, 0 # Initialize i to 0 li s2, 3 # Load 3 for consecutive check loop_start: bge s9, a1, print_false # if i >= arrSize, return false lw s4, 0(s10) # Load arr[i] into s4 addi s10, s10, 4 # Move to next array element andi s5, s4, 1 # Check if arr[i] is odd (last bit == 1) beqz s5, reset_count # If not odd, reset count addi s8, s8, 1 # Increment count beq s8, s2, print_true # If count == 3, return true addi s9, s9, 1 # i++ j loop_start # Jump to start of the loop reset_count: addi s8, x0, 0 # Reset count to 0 addi s9, s9, 1 # i++ j loop_start # Jump to start of the loop my_clz: addi sp, sp, -8 # Allocate stack space sw ra, 4(sp) # Store return address sw s0, 0(sp) # Store s0 (loop counter) li s0, 31 # Set loop counter to 31 li a0, 0 # Initialize count of leading zeros clz_loop: srl s5, a1, s0 # Shift right by s0 bits andi s5, s5, 1 # Isolate the least significant bit bnez s5, clz_done # If we find a 1, break the loop addi a0, a0, 1 # Increment the count addi s0, s0, -1 # Decrement the loop counter bgez s0, clz_loop # Repeat until s0 >= 0 clz_done: lw ra, 4(sp) # Restore return address lw s0, 0(sp) # Restore s0 addi sp, sp, 8 # Free stack space ret # Return ``` ### Performance Analysis | | Origin | Optimized | | -------- | -------- | -------- | | Cycles | 1001 | 124 | | Instruction count | 772 | 90 | | CPI | 1.3 | 1.38 | | IPC | 0.771 | 0.726 | In my optimized method, I have changed the way array elements are loaded, resulting in a reduction in the instruction count. Additionally, I combined several function calls, further decreasing the overall instruction count.