Assignment1: RISC-V Assembly and Instruction Pipeline

# Assignment1: RISC-V Assembly and Instruction Pipeline contributed by < [JerryYuLin](https://github.com/JerryYuLin) > For this assignment, we are required to select one problem (A, B, or C) from [Quiz 1](https://hackmd.io/@sysprog/arch2024-quiz1-sol), translate the chosen C code into a fully functional RISC-V assembly program, and include the necessary test data. I have chosen to work on Problem C for this task. ## 1. Quiz1 - Problem C Problem C consists of three functions: `fabsf`, `my_clz`, and `fp16_to_fp32`. Below are the implementations of these functions in both C and RISC-V assembly language. ### 1.1 `fabsf` The following C program defines an inline function `fabsf(float x)`, which computes the absolute value of a floating-point number represented as a float. ```c static inline float fabsf(float x) { uint32_t i = *(uint32_t *)&x; // Read the bits of the float into an integer i &= 0x7FFFFFFF; // Clear the sign bit to get the absolute value x = *(float *)&i; // Write the modified bits back into the float return x; } ``` The `fabsf()` function can be converted into RISC-V assembly code to achieve the same functionality with efficient low-level operations. Below is a detailed explanation of the conversion process, along with the resulting assembly code. ```asm fabsf: addi sp, sp, -12 sw ra, 0(sp) sw t0, 4(sp) sw t1, 8(sp) mv t0, a0 # t0: x li t1, 0x7FFFFFFF and t0, t0, t1 mv a0, t0 lw ra, 0(sp) lw t0, 4(sp) lw t1, 8(sp) addi sp, sp, 12 ret ``` To optimize your RISC-V assembly code for `fabsf()`, we can simplify the stack operations by reducing the use of unnecessary saved registers and focusing on more efficient instruction handling. Here’s an optimized version: ```asm fabsf: addi sp, sp, -8 # Allocate space for saving ra and t0 sw ra, 0(sp) # Save return address sw t0, 4(sp) # Save t0 li t0, 0x7FFFFFFF # Load mask to clear the sign bit and a0, a0, t0 # Clear the sign bit of the float in a0 (absolute value) lw ra, 0(sp) # Restore return address lw t0, 4(sp) # Restore t0 addi sp, sp, 8 # Deallocate stack space ret # Return from function ``` **Optimizations Made:** 1. Register Usage: Removed the use of `t1` since it’s unnecessary. The mask can be loaded directly into `t0`. 2. Stack Space: Reduced stack space to only store `ra` and `t0`. No need to save `t1`. 3. Efficiency: The `mv` instruction for copying values was removed by directly applying the and operation to the input register `a0`. This version is more efficient, using fewer instructions and less stack space while maintaining the same functionality. | | Origin | Optimize | | -------- | -------- | -------- | | **Cycles** | 131 | 119 | | **Instrs. retired** | 88 | 76 | | **CPI** | 1.49 | 1.57 | | **IPC** | 0.672 | 0.639 | The following code includes complete test cases for the function: ```asm .data testcase: .word 0xC048F5C3, 0x00000000, 0xC7F1202E answer: .word 0x4048F5C3, 0x00000000, 0x47F1202E true: .string "true\n" false: .string "false\n" .text main: la t0, testcase la t1, answer li t2, 3 test_loop: lw a0, 0(t0) jal ra, fabsf lw s0, 0(t1) beq a0, s0, print_true j print_false print_true: li a7, 4 la a0, true ecall j check_test_loop print_false: li a7, 4 la a0, false ecall j check_test_loop check_test_loop: addi t0, t0, 4 addi t1, t1, 4 addi t2, t2, -1 bne t2, x0, test_loop li a7, 10 ecall fabsf: addi sp, sp, -8 # Allocate space for saving ra and t0 sw ra, 0(sp) # Save return address sw t0, 4(sp) # Save t0 li t0, 0x7FFFFFFF # Load mask to clear the sign bit and a0, a0, t0 # Clear the sign bit of the float in a0 (absolute value) lw ra, 0(sp) # Restore return address lw t0, 4(sp) # Restore t0 addi sp, sp, 8 # Deallocate stack space ret # Return from function ``` ### 1.2 `my_clz` The `my_clz()` function computes the count of leading zeros in a 32-bit unsigned integer. It uses a straightforward loop to check each bit from the most significant bit (MSB) to the least significant bit (LSB) and counts how many leading zeros are present. ```c static inline int my_clz(uint32_t x) { int count = 0; for (int i = 31; i >= 0; --i) { if (x & (1U << i)) break; count++; } return count; } ``` The `my_clz()` function from C to RISC-V assembly language ```asm my_clz: addi sp, sp, -24 sw ra, 0(sp) sw t0, 4(sp) sw t1, 8(sp) sw t2, 12(sp) sw t3, 16(sp) sw t4, 20(sp) mv t0, a0 # t0: x li t1, 0 # t1: count li t2, 31 # t2: i clz_loop: li t3, 1 # t3: 1U sll t3, t3, t2 # 1U << i and t4, t0, t3 # t4: x & (1U << i) bne t4, x0, exit_clz_loop addi t2, t2, -1 addi t1, t1, 1 bge t2, x0, clz_loop exit_clz_loop: mv a0, t1 lw ra, 0(sp) lw t0, 4(sp) lw t1, 8(sp) lw t2, 12(sp) lw t3, 16(sp) lw t4, 20(sp) addi sp, sp, 24 ret ``` To optimize the `my_clz` RISC-V assembly code for reduced cycle usage, we can make a few changes to minimize unnecessary operations and improve efficiency. Specifically, we can: 1. **Avoid recalculating the `1 << i` bitmask in each loop iteration:** Instead of using `sll` in every iteration, we can shift the bitmask directly, which saves cycles. 2. **Minimize stack usage:** By only saving the necessary registers and avoiding redundant saves/restores, we reduce the overhead. Here’s the further optimized version of the code: ```asm my_clz: addi sp, sp, -16 # Allocate space for ra, t0, t1 sw ra, 0(sp) # Save return address sw t0, 4(sp) # Save t0 sw t1, 8(sp) # Save t1 sw t2, 12(sp) # Save t2 mv t0, a0 # t0: x (input value) li t1, 0 # t1: count (leading zeros) li t3, 0x80000000 # t3: starting bitmask (1 << 31) clz_loop: and t4, t0, t3 # t4: x & (1 << i) bne t4, x0, exit_clz # If x & (1 << i) is not 0, exit the loop addi t1, t1, 1 # Increment leading zero count srli t3, t3, 1 # Right shift the bitmask (t3 >>= 1) bnez t3, clz_loop # If the bitmask is non-zero, continue looping exit_clz: mv a0, t1 # Move count to return value (a0) lw ra, 0(sp) # Restore return address lw t0, 4(sp) # Restore t0 lw t1, 8(sp) # Restore t1 lw t2, 12(sp) # Restore t2 addi sp, sp, 16 # Deallocate stack space ret # Return from function ``` | | Origin | Optimize | | -------- | -------- | -------- | | **Cycles** | 723 | 581 | | **Instrs. retired** | 552 | 410 | | **CPI** | 1.31 | 1.42 | | **IPC** | 0.763 | 0.706 | The following code includes complete test cases for the function: ```asm .data testcase: .word 0x00000000, 0x00000001, 0x80000000 answer: .word 32, 31, 0 true: .string "true\n" false: .string "false\n" .text main: la t0, testcase la t1, answer li t2, 3 test_loop: lw a0, 0(t0) jal ra, my_clz lw s0, 0(t1) beq a0, s0, print_true j print_false print_true: li a7, 4 la a0, true ecall j check_test_loop print_false: li a7, 4 la a0, false ecall j check_test_loop check_test_loop: addi t0, t0, 4 addi t1, t1, 4 addi t2, t2, -1 bne t2, x0, test_loop li a7, 10 ecall my_clz: addi sp, sp, -16 # Allocate space for ra, t0, t1 sw ra, 0(sp) # Save return address sw t0, 4(sp) # Save t0 sw t1, 8(sp) # Save t1 sw t2, 12(sp) # Save t2 mv t0, a0 # t0: x (input value) li t1, 0 # t1: count (leading zeros) li t3, 0x80000000 # t3: starting bitmask (1 << 31) clz_loop: and t4, t0, t3 # t4: x & (1 << i) bne t4, x0, exit_clz # If x & (1 << i) is not 0, exit the loop addi t1, t1, 1 # Increment leading zero count srli t3, t3, 1 # Right shift the bitmask (t3 >>= 1) bnez t3, clz_loop # If the bitmask is non-zero, continue looping exit_clz: mv a0, t1 # Move count to return value (a0) lw ra, 0(sp) # Restore return address lw t0, 4(sp) # Restore t0 lw t1, 8(sp) # Restore t1 lw t2, 12(sp) # Restore t2 addi sp, sp, 16 # Deallocate stack space ret # Return from function ``` ### 1.3 `fp16_to_fp32` The `fp16_to_fp32` function is designed to convert a 16-bit floating-point number (FP16) to a 32-bit floating-point number (FP32). It takes an FP16 number as input and returns its FP32 equivalent. The function preserves the FP16 sign, exponent, and mantissa while adjusting for the larger range and precision of FP32. ```c static inline uint32_t fp16_to_fp32(uint16_t h) { const uint32_t w = (uint32_t) h << 16; const uint32_t sign = w & UINT32_C(0x80000000); const uint32_t nonsign = w & UINT32_C(0x7FFFFFFF); uint32_t renorm_shift = my_clz(nonsign); renorm_shift = renorm_shift > 5 ? renorm_shift - 5 : 0; const int32_t inf_nan_mask = ((int32_t)(nonsign + 0x04000000) >> 8) & INT32_C(0x7F800000); const int32_t zero_mask = (int32_t)(nonsign - 1) >> 31; return sign | ((((nonsign << renorm_shift >> 3) + ((0x70 - renorm_shift) << 23)) | inf_nan_mask) & ~zero_mask); } ``` The following code includes complete test cases for the function: ```asm .data testcase: .word 0x3C00, 0x7BFF, 0x0400 answer: .word 0x3F800000, 0x477FE000, 0x38800000 true: .string "true\n" false: .string "false\n" .text main: la t0, testcase la t1, answer li t2, 3 test_loop: lw a0, 0(t0) jal ra, fp16_to_fp32 lw s0, 0(t1) beq a0, s0, print_true j print_false print_true: li a7, 4 la a0, true ecall j check_test_loop print_false: li a7, 4 la a0, false ecall j check_test_loop check_test_loop: addi t0, t0, 4 addi t1, t1, 4 addi t2, t2, -1 bne t2, x0, test_loop li a7, 10 ecall fp16_to_fp32: addi sp, sp, -32 sw ra, 0(sp) sw t0, 4(sp) sw t1, 8(sp) sw t2, 12(sp) sw t3, 16(sp) sw t4, 20(sp) sw t5, 24(sp) sw t6, 28(sp) mv t0, a0 # t0: h slli t0, t0, 16 # t0: w, w = (uint32_t) h << 16 li t1, 0x80000000 # t1: 0x80000000 and t1, t0, t1 # t1: sign, sign = w & UINT32_C(0x80000000) li t2, 0x7FFFFFFF # t2: 0x7FFFFFFF and t2, t0, t2 # t2: nonsign, nonsign = w & UINT32_C(0x7FFFFFFF) mv a0, t2 jal ra, my_clz # call my_clz(nonsign) mv t3, a0 # t3: renorm_shift addi t4, x0, 6 bge t3, t4, sub_5 mv t3, x0 # renorm_shift = 0 j cont sub_5: addi t3, t3, -5 # renorm_shift = renorm_shift - 5 cont: li t4, 0x04000000 add t4, t2, t4 # t4: nonsign + 0x04000000 srli t4, t4, 8 # t4: (nonsign + 0x04000000) >> 8 li t5, 0x7F800000 and t4, t4, t5 # t4: inf_nan_mask addi t5, t2, -1 # t5: nonsign - 1 srli t5, t5, 31 # t5: zero_mask, zero_mask = (int32_t)(nonsign - 1) >> 31 sll t2, t2, t3 # t2: nonsign << renorm_shift srli t2, t2, 3 # t2: nonsign << renorm_shift >> 3 li t6, 0x70 sub t3, t6, t3 # t3: 0x70 - renorm_shift slli t3, t3, 23 # t3: (0x70 - renorm_shift) << 23 add t2, t2, t3 # t2: ((nonsign << renorm_shift >> 3) + ((0x70 - renorm_shift) << 23)) or t2, t2, t4 # t2: (((nonsign << renorm_shift >> 3) + ((0x70 - renorm_shift) << 23)) | inf_nan_mask) not t5, t5 # ~zero_mask and t2, t2, t5 # t2: ((((nonsign << renorm_shift >> 3) + ((0x70 - renorm_shift) << 23)) | inf_nan_mask) & ~zero_mask) or a0, t1, t2 lw ra, 0(sp) lw t0, 4(sp) lw t1, 8(sp) lw t2, 12(sp) lw t3, 16(sp) lw t4, 20(sp) lw t5, 24(sp) lw t6, 28(sp) addi sp, sp, 32 ret my_clz: addi sp, sp, -16 # Allocate space for ra, t0, t1 sw ra, 0(sp) # Save return address sw t0, 4(sp) # Save t0 sw t1, 8(sp) # Save t1 sw t2, 12(sp) # Save t2 mv t0, a0 # t0: x (input value) li t1, 0 # t1: count (leading zeros) li t3, 0x80000000 # t3: starting bitmask (1 << 31) clz_loop: and t4, t0, t3 # t4: x & (1 << i) bne t4, x0, exit_clz # If x & (1 << i) is not 0, exit the loop addi t1, t1, 1 # Increment leading zero count srli t3, t3, 1 # Right shift the bitmask (t3 >>= 1) bnez t3, clz_loop # If the bitmask is non-zero, continue looping exit_clz: mv a0, t1 # Move count to return value (a0) lw ra, 0(sp) # Restore return address lw t0, 4(sp) # Restore t0 lw t1, 8(sp) # Restore t1 lw t2, 12(sp) # Restore t2 addi sp, sp, 16 # Deallocate stack space ret # Return from function ``` | | | | -------- | -------- | | **Cycles** | 370 | | **Instrs. retired** | 287 | | **CPI** | 1.29 | | **IPC** | 0.776 | ## 2. Leetcode 29. Divide Two Integers The problem is to implement the division of two integers without using multiplication, division, or modulus operators. Given two integers, `dividend` and `divisor`, return the `quotient` after dividing the `dividend` by the `divisor`. However, the division must truncate toward zero, meaning that the result should be rounded towards zero (e.g., 5 / 2 = 2 and -7 / 3 = -2). Additionally, the function should return the quotient within the bounds of a 32-bit signed integer. If the quotient exceeds these bounds, return `INT_MAX` or `INT_MIN`. ### 2.1 Original Code ```cpp #include <climits> #include <cstdlib> class Solution { public: int divide(int dividend, int divisor) { // Handle overflow case if (dividend == INT_MIN && divisor == -1) { return INT_MAX; } // Determine the sign of the result int sign = (dividend < 0) == (divisor < 0) ? 1 : -1; // Work with absolute values to simplify the logic long long absDividend = std::llabs(dividend); long long absDivisor = std::llabs(divisor); long long quotient = 0; while (absDividend >= absDivisor) { long long tempDivisor = absDivisor; long long multiple = 1; while (absDividend >= (tempDivisor << 1)) { tempDivisor <<= 1; multiple <<= 1; } absDividend -= tempDivisor; quotient += multiple; } quotient = sign * quotient; if (quotient > INT_MAX) return INT_MAX; if (quotient < INT_MIN) return INT_MIN; return quotient; } }; ``` ### 2.2 Modified Code The modified version of the `divide` function includes an optimization using the `clz` (count leading zeros) function to calculate how much the divisor can be shifted left without exceeding the dividend. This reduces the need for a manual loop that iteratively doubles the divisor, improving efficiency. ```cpp #include <climits> #include <cstdlib> #include <bit> // For std::countl_zero class Solution { public: int divide(int dividend, int divisor) { // Handle overflow case if (dividend == INT_MIN && divisor == -1) { return INT_MAX; // Return 2^31 - 1 because INT_MIN / -1 would overflow } // Determine the sign of the result int sign = (dividend < 0) == (divisor < 0) ? 1 : -1; // Work with absolute values to simplify the logic long long absDividend = std::llabs(dividend); long long absDivisor = std::llabs(divisor); long long quotient = 0; // Calculate the maximum shift using leading zero count (clz) while (absDividend >= absDivisor) { // Calculate the maximum left shift we can apply without exceeding the dividend int shifts = std::countl_zero(static_cast<unsigned>(absDivisor)) - std::countl_zero(static_cast<unsigned>(absDividend)); // Ensure the shift value is non-negative and within bounds if (shifts < 0) shifts = 0; long long tempDivisor = absDivisor << shifts; if (tempDivisor > absDividend) { // If we overshoot, reduce the shift by 1 tempDivisor = absDivisor << (shifts - 1); shifts -= 1; } absDividend -= tempDivisor; quotient += (1LL << shifts); // Add the corresponding multiple } // Apply the sign to the result quotient = sign * quotient; // Ensure the result is within the 32-bit signed integer range if (quotient > INT_MAX) return INT_MAX; if (quotient < INT_MIN) return INT_MIN; return quotient; } }; ``` The following is the converted RISC-V assembly code: ```asm divide: addi sp, sp, -44 # Allocate stack space sw ra, 0(sp) # Save return address sw s0, 4(sp) sw s1, 8(sp) sw s2, 12(sp) sw s3, 16(sp) sw s4, 20(sp) sw s5, 24(sp) sw s6, 28(sp) sw s7, 32(sp) sw s8, 36(sp) sw s9, 40(sp) mv s0, a0 # s0: dividend mv s1, a1 # s1: divisor li s2, 0x80000000 # s2: INT_MIN bne s0, s2, begin_divide li s2, -1 bne s0, s2, begin_divide j return_INT_MAX begin_divide: blt s0, x0, dividend_b_set1 li s2, 0 j exit_dividend_b dividend_b_set1: li s2, 1 exit_dividend_b: blt s1, x0, divisor_b_set1 li s3, 0 j exit_divisor_b divisor_b_set1: li s3, 1 exit_divisor_b: beq s2, s3, set_sign_1 # s2: sign, sign = (dividend < 0) == (divisor < 0) ? 1 : -1 li s2, -1 j exit_set_sign set_sign_1: li s2, 1 exit_set_sign: li t0, -1 mv s3, s0 bge s3, x0, dividend_abs_end # s3: absDividend, absDividend = std::llabs(dividend) mul s3, s3, t0 dividend_abs_end: mv s4, s1 # s4: absDivisor, absDivisor = std::llabs(divisor) bge s4, x0, divisor_abs_end mul s4, s4, t0 divisor_abs_end: mv s5, x0 # s5: quotient, quotient = 0 blt s3, s4, divide_loop_end divide_loop: mv a0, s4 jal ra, my_clz mv s6, a0 # s6: countl_zero(static_cast<unsigned>(absDivisor)) mv a0, s3 jal ra, my_clz mv s7, a0 # s7: countl_zero(static_cast<unsigned>(absDividend)) sub s6, s6, s7 # s6: shifts bge s6, x0, shift_greater_equal li s6, 0 # if (shifts < 0) shifts = 0 shift_greater_equal: sll s7, s4, s6 # s7: tempDivisor, tempDivisor = absDivisor << shifts blt s7, s3, temp_less_than_abs addi s8, s6, -1 # s8: shifts - 1 sll s7, s4, s8 # tempDivisor = absDivisor << (shifts - 1) addi s6, s6, -1 # shifts -= 1 temp_less_than_abs: sub s3, s3, s7 # absDividend -= tempDivisor li s9, 1 sll s9, s9, s6 # (1LL << shifts) add s5, s5, s9 blt s3, s4, divide_loop_end j divide_loop divide_loop_end: mul s5, s5, s2 # quotient = sign * quotient mv a0, s5 j divide_end return_INT_MAX: li a0, 0x7FFFFFFF j divide_end divide_end: lw ra, 0(sp) lw s0, 4(sp) lw s1, 8(sp) lw s2, 12(sp) lw s3, 16(sp) lw s4, 20(sp) lw s5, 24(sp) lw s6, 28(sp) lw s7, 32(sp) lw s8, 36(sp) lw s9, 40(sp) addi sp, sp, 44 ret ```