# Assignment1: RISC-V Assembly and Instruction Pipeline
contributed by < [JerryYuLin](https://github.com/JerryYuLin) >
For this assignment, we are required to select one problem (A, B, or C) from [Quiz 1](https://hackmd.io/@sysprog/arch2024-quiz1-sol), translate the chosen C code into a fully functional RISC-V assembly program, and include the necessary test data. I have chosen to work on Problem C for this task.
## 1. Quiz1 - Problem C
Problem C consists of three functions: `fabsf`, `my_clz`, and `fp16_to_fp32`. Below are the implementations of these functions in both C and RISC-V assembly language.
### 1.1 `fabsf`
The following C program defines an inline function `fabsf(float x)`, which computes the absolute value of a floating-point number represented as a float.
```c
static inline float fabsf(float x) {
uint32_t i = *(uint32_t *)&x; // Read the bits of the float into an integer
i &= 0x7FFFFFFF; // Clear the sign bit to get the absolute value
x = *(float *)&i; // Write the modified bits back into the float
return x;
}
```
The `fabsf()` function can be converted into RISC-V assembly code to achieve the same functionality with efficient low-level operations. Below is a detailed explanation of the conversion process, along with the resulting assembly code.
```asm
fabsf:
addi sp, sp, -12
sw ra, 0(sp)
sw t0, 4(sp)
sw t1, 8(sp)
mv t0, a0 # t0: x
li t1, 0x7FFFFFFF
and t0, t0, t1
mv a0, t0
lw ra, 0(sp)
lw t0, 4(sp)
lw t1, 8(sp)
addi sp, sp, 12
ret
```
To optimize your RISC-V assembly code for `fabsf()`, we can simplify the stack operations by reducing the use of unnecessary saved registers and focusing on more efficient instruction handling. Here’s an optimized version:
```asm
fabsf:
addi sp, sp, -8 # Allocate space for saving ra and t0
sw ra, 0(sp) # Save return address
sw t0, 4(sp) # Save t0
li t0, 0x7FFFFFFF # Load mask to clear the sign bit
and a0, a0, t0 # Clear the sign bit of the float in a0 (absolute value)
lw ra, 0(sp) # Restore return address
lw t0, 4(sp) # Restore t0
addi sp, sp, 8 # Deallocate stack space
ret # Return from function
```
**Optimizations Made:**
1. Register Usage: Removed the use of `t1` since it’s unnecessary. The mask can be loaded directly into `t0`.
2. Stack Space: Reduced stack space to only store `ra` and `t0`. No need to save `t1`.
3. Efficiency: The `mv` instruction for copying values was removed by directly applying the and operation to the input register `a0`.
This version is more efficient, using fewer instructions and less stack space while maintaining the same functionality.
| | Origin | Optimize |
| -------- | -------- | -------- |
| **Cycles** | 131 | 119 |
| **Instrs. retired** | 88 | 76 |
| **CPI** | 1.49 | 1.57 |
| **IPC** | 0.672 | 0.639 |
The following code includes complete test cases for the function:
```asm
.data
testcase: .word 0xC048F5C3, 0x00000000, 0xC7F1202E
answer: .word 0x4048F5C3, 0x00000000, 0x47F1202E
true: .string "true\n"
false: .string "false\n"
.text
main:
la t0, testcase
la t1, answer
li t2, 3
test_loop:
lw a0, 0(t0)
jal ra, fabsf
lw s0, 0(t1)
beq a0, s0, print_true
j print_false
print_true:
li a7, 4
la a0, true
ecall
j check_test_loop
print_false:
li a7, 4
la a0, false
ecall
j check_test_loop
check_test_loop:
addi t0, t0, 4
addi t1, t1, 4
addi t2, t2, -1
bne t2, x0, test_loop
li a7, 10
ecall
fabsf:
addi sp, sp, -8 # Allocate space for saving ra and t0
sw ra, 0(sp) # Save return address
sw t0, 4(sp) # Save t0
li t0, 0x7FFFFFFF # Load mask to clear the sign bit
and a0, a0, t0 # Clear the sign bit of the float in a0 (absolute value)
lw ra, 0(sp) # Restore return address
lw t0, 4(sp) # Restore t0
addi sp, sp, 8 # Deallocate stack space
ret # Return from function
```
### 1.2 `my_clz`
The `my_clz()` function computes the count of leading zeros in a 32-bit unsigned integer. It uses a straightforward loop to check each bit from the most significant bit (MSB) to the least significant bit (LSB) and counts how many leading zeros are present.
```c
static inline int my_clz(uint32_t x) {
int count = 0;
for (int i = 31; i >= 0; --i) {
if (x & (1U << i))
break;
count++;
}
return count;
}
```
The `my_clz()` function from C to RISC-V assembly language
```asm
my_clz:
addi sp, sp, -24
sw ra, 0(sp)
sw t0, 4(sp)
sw t1, 8(sp)
sw t2, 12(sp)
sw t3, 16(sp)
sw t4, 20(sp)
mv t0, a0 # t0: x
li t1, 0 # t1: count
li t2, 31 # t2: i
clz_loop:
li t3, 1 # t3: 1U
sll t3, t3, t2 # 1U << i
and t4, t0, t3 # t4: x & (1U << i)
bne t4, x0, exit_clz_loop
addi t2, t2, -1
addi t1, t1, 1
bge t2, x0, clz_loop
exit_clz_loop:
mv a0, t1
lw ra, 0(sp)
lw t0, 4(sp)
lw t1, 8(sp)
lw t2, 12(sp)
lw t3, 16(sp)
lw t4, 20(sp)
addi sp, sp, 24
ret
```
To optimize the `my_clz` RISC-V assembly code for reduced cycle usage, we can make a few changes to minimize unnecessary operations and improve efficiency. Specifically, we can:
1. **Avoid recalculating the `1 << i` bitmask in each loop iteration:** Instead of using `sll` in every iteration, we can shift the bitmask directly, which saves cycles.
2. **Minimize stack usage:** By only saving the necessary registers and avoiding redundant saves/restores, we reduce the overhead.
Here’s the further optimized version of the code:
```asm
my_clz:
addi sp, sp, -16 # Allocate space for ra, t0, t1
sw ra, 0(sp) # Save return address
sw t0, 4(sp) # Save t0
sw t1, 8(sp) # Save t1
sw t2, 12(sp) # Save t2
mv t0, a0 # t0: x (input value)
li t1, 0 # t1: count (leading zeros)
li t3, 0x80000000 # t3: starting bitmask (1 << 31)
clz_loop:
and t4, t0, t3 # t4: x & (1 << i)
bne t4, x0, exit_clz # If x & (1 << i) is not 0, exit the loop
addi t1, t1, 1 # Increment leading zero count
srli t3, t3, 1 # Right shift the bitmask (t3 >>= 1)
bnez t3, clz_loop # If the bitmask is non-zero, continue looping
exit_clz:
mv a0, t1 # Move count to return value (a0)
lw ra, 0(sp) # Restore return address
lw t0, 4(sp) # Restore t0
lw t1, 8(sp) # Restore t1
lw t2, 12(sp) # Restore t2
addi sp, sp, 16 # Deallocate stack space
ret # Return from function
```
| | Origin | Optimize |
| -------- | -------- | -------- |
| **Cycles** | 723 | 581 |
| **Instrs. retired** | 552 | 410 |
| **CPI** | 1.31 | 1.42 |
| **IPC** | 0.763 | 0.706 |
The following code includes complete test cases for the function:
```asm
.data
testcase: .word 0x00000000, 0x00000001, 0x80000000
answer: .word 32, 31, 0
true: .string "true\n"
false: .string "false\n"
.text
main:
la t0, testcase
la t1, answer
li t2, 3
test_loop:
lw a0, 0(t0)
jal ra, my_clz
lw s0, 0(t1)
beq a0, s0, print_true
j print_false
print_true:
li a7, 4
la a0, true
ecall
j check_test_loop
print_false:
li a7, 4
la a0, false
ecall
j check_test_loop
check_test_loop:
addi t0, t0, 4
addi t1, t1, 4
addi t2, t2, -1
bne t2, x0, test_loop
li a7, 10
ecall
my_clz:
addi sp, sp, -16 # Allocate space for ra, t0, t1
sw ra, 0(sp) # Save return address
sw t0, 4(sp) # Save t0
sw t1, 8(sp) # Save t1
sw t2, 12(sp) # Save t2
mv t0, a0 # t0: x (input value)
li t1, 0 # t1: count (leading zeros)
li t3, 0x80000000 # t3: starting bitmask (1 << 31)
clz_loop:
and t4, t0, t3 # t4: x & (1 << i)
bne t4, x0, exit_clz # If x & (1 << i) is not 0, exit the loop
addi t1, t1, 1 # Increment leading zero count
srli t3, t3, 1 # Right shift the bitmask (t3 >>= 1)
bnez t3, clz_loop # If the bitmask is non-zero, continue looping
exit_clz:
mv a0, t1 # Move count to return value (a0)
lw ra, 0(sp) # Restore return address
lw t0, 4(sp) # Restore t0
lw t1, 8(sp) # Restore t1
lw t2, 12(sp) # Restore t2
addi sp, sp, 16 # Deallocate stack space
ret # Return from function
```
### 1.3 `fp16_to_fp32`
The `fp16_to_fp32` function is designed to convert a 16-bit floating-point number (FP16) to a 32-bit floating-point number (FP32). It takes an FP16 number as input and returns its FP32 equivalent. The function preserves the FP16 sign, exponent, and mantissa while adjusting for the larger range and precision of FP32.
```c
static inline uint32_t fp16_to_fp32(uint16_t h) {
const uint32_t w = (uint32_t) h << 16;
const uint32_t sign = w & UINT32_C(0x80000000);
const uint32_t nonsign = w & UINT32_C(0x7FFFFFFF);
uint32_t renorm_shift = my_clz(nonsign);
renorm_shift = renorm_shift > 5 ? renorm_shift - 5 : 0;
const int32_t inf_nan_mask = ((int32_t)(nonsign + 0x04000000) >> 8) & INT32_C(0x7F800000);
const int32_t zero_mask = (int32_t)(nonsign - 1) >> 31;
return sign | ((((nonsign << renorm_shift >> 3) + ((0x70 - renorm_shift) << 23)) | inf_nan_mask) & ~zero_mask);
}
```
The following code includes complete test cases for the function:
```asm
.data
testcase: .word 0x3C00, 0x7BFF, 0x0400
answer: .word 0x3F800000, 0x477FE000, 0x38800000
true: .string "true\n"
false: .string "false\n"
.text
main:
la t0, testcase
la t1, answer
li t2, 3
test_loop:
lw a0, 0(t0)
jal ra, fp16_to_fp32
lw s0, 0(t1)
beq a0, s0, print_true
j print_false
print_true:
li a7, 4
la a0, true
ecall
j check_test_loop
print_false:
li a7, 4
la a0, false
ecall
j check_test_loop
check_test_loop:
addi t0, t0, 4
addi t1, t1, 4
addi t2, t2, -1
bne t2, x0, test_loop
li a7, 10
ecall
fp16_to_fp32:
addi sp, sp, -32
sw ra, 0(sp)
sw t0, 4(sp)
sw t1, 8(sp)
sw t2, 12(sp)
sw t3, 16(sp)
sw t4, 20(sp)
sw t5, 24(sp)
sw t6, 28(sp)
mv t0, a0 # t0: h
slli t0, t0, 16 # t0: w, w = (uint32_t) h << 16
li t1, 0x80000000 # t1: 0x80000000
and t1, t0, t1 # t1: sign, sign = w & UINT32_C(0x80000000)
li t2, 0x7FFFFFFF # t2: 0x7FFFFFFF
and t2, t0, t2 # t2: nonsign, nonsign = w & UINT32_C(0x7FFFFFFF)
mv a0, t2
jal ra, my_clz # call my_clz(nonsign)
mv t3, a0 # t3: renorm_shift
addi t4, x0, 6
bge t3, t4, sub_5
mv t3, x0 # renorm_shift = 0
j cont
sub_5:
addi t3, t3, -5 # renorm_shift = renorm_shift - 5
cont:
li t4, 0x04000000
add t4, t2, t4 # t4: nonsign + 0x04000000
srli t4, t4, 8 # t4: (nonsign + 0x04000000) >> 8
li t5, 0x7F800000
and t4, t4, t5 # t4: inf_nan_mask
addi t5, t2, -1 # t5: nonsign - 1
srli t5, t5, 31 # t5: zero_mask, zero_mask = (int32_t)(nonsign - 1) >> 31
sll t2, t2, t3 # t2: nonsign << renorm_shift
srli t2, t2, 3 # t2: nonsign << renorm_shift >> 3
li t6, 0x70
sub t3, t6, t3 # t3: 0x70 - renorm_shift
slli t3, t3, 23 # t3: (0x70 - renorm_shift) << 23
add t2, t2, t3 # t2: ((nonsign << renorm_shift >> 3) + ((0x70 - renorm_shift) << 23))
or t2, t2, t4 # t2: (((nonsign << renorm_shift >> 3) + ((0x70 - renorm_shift) << 23)) | inf_nan_mask)
not t5, t5 # ~zero_mask
and t2, t2, t5 # t2: ((((nonsign << renorm_shift >> 3) + ((0x70 - renorm_shift) << 23)) | inf_nan_mask) & ~zero_mask)
or a0, t1, t2
lw ra, 0(sp)
lw t0, 4(sp)
lw t1, 8(sp)
lw t2, 12(sp)
lw t3, 16(sp)
lw t4, 20(sp)
lw t5, 24(sp)
lw t6, 28(sp)
addi sp, sp, 32
ret
my_clz:
addi sp, sp, -16 # Allocate space for ra, t0, t1
sw ra, 0(sp) # Save return address
sw t0, 4(sp) # Save t0
sw t1, 8(sp) # Save t1
sw t2, 12(sp) # Save t2
mv t0, a0 # t0: x (input value)
li t1, 0 # t1: count (leading zeros)
li t3, 0x80000000 # t3: starting bitmask (1 << 31)
clz_loop:
and t4, t0, t3 # t4: x & (1 << i)
bne t4, x0, exit_clz # If x & (1 << i) is not 0, exit the loop
addi t1, t1, 1 # Increment leading zero count
srli t3, t3, 1 # Right shift the bitmask (t3 >>= 1)
bnez t3, clz_loop # If the bitmask is non-zero, continue looping
exit_clz:
mv a0, t1 # Move count to return value (a0)
lw ra, 0(sp) # Restore return address
lw t0, 4(sp) # Restore t0
lw t1, 8(sp) # Restore t1
lw t2, 12(sp) # Restore t2
addi sp, sp, 16 # Deallocate stack space
ret # Return from function
```
| | |
| -------- | -------- |
| **Cycles** | 370 |
| **Instrs. retired** | 287 |
| **CPI** | 1.29 |
| **IPC** | 0.776 |
## 2. Leetcode 29. Divide Two Integers
The problem is to implement the division of two integers without using multiplication, division, or modulus operators. Given two integers, `dividend` and `divisor`, return the `quotient` after dividing the `dividend` by the `divisor`.
However, the division must truncate toward zero, meaning that the result should be rounded towards zero (e.g., 5 / 2 = 2 and -7 / 3 = -2). Additionally, the function should return the quotient within the bounds of a 32-bit signed integer. If the quotient exceeds these bounds, return `INT_MAX` or `INT_MIN`.
### 2.1 Original Code
```cpp
#include <climits>
#include <cstdlib>
class Solution {
public:
int divide(int dividend, int divisor) {
// Handle overflow case
if (dividend == INT_MIN && divisor == -1) {
return INT_MAX;
}
// Determine the sign of the result
int sign = (dividend < 0) == (divisor < 0) ? 1 : -1;
// Work with absolute values to simplify the logic
long long absDividend = std::llabs(dividend);
long long absDivisor = std::llabs(divisor);
long long quotient = 0;
while (absDividend >= absDivisor) {
long long tempDivisor = absDivisor;
long long multiple = 1;
while (absDividend >= (tempDivisor << 1)) {
tempDivisor <<= 1;
multiple <<= 1;
}
absDividend -= tempDivisor;
quotient += multiple;
}
quotient = sign * quotient;
if (quotient > INT_MAX) return INT_MAX;
if (quotient < INT_MIN) return INT_MIN;
return quotient;
}
};
```
### 2.2 Modified Code
The modified version of the `divide` function includes an optimization using the `clz` (count leading zeros) function to calculate how much the divisor can be shifted left without exceeding the dividend. This reduces the need for a manual loop that iteratively doubles the divisor, improving efficiency.
```cpp
#include <climits>
#include <cstdlib>
#include <bit> // For std::countl_zero
class Solution {
public:
int divide(int dividend, int divisor) {
// Handle overflow case
if (dividend == INT_MIN && divisor == -1) {
return INT_MAX; // Return 2^31 - 1 because INT_MIN / -1 would overflow
}
// Determine the sign of the result
int sign = (dividend < 0) == (divisor < 0) ? 1 : -1;
// Work with absolute values to simplify the logic
long long absDividend = std::llabs(dividend);
long long absDivisor = std::llabs(divisor);
long long quotient = 0;
// Calculate the maximum shift using leading zero count (clz)
while (absDividend >= absDivisor) {
// Calculate the maximum left shift we can apply without exceeding the dividend
int shifts = std::countl_zero(static_cast<unsigned>(absDivisor)) -
std::countl_zero(static_cast<unsigned>(absDividend));
// Ensure the shift value is non-negative and within bounds
if (shifts < 0) shifts = 0;
long long tempDivisor = absDivisor << shifts;
if (tempDivisor > absDividend) {
// If we overshoot, reduce the shift by 1
tempDivisor = absDivisor << (shifts - 1);
shifts -= 1;
}
absDividend -= tempDivisor;
quotient += (1LL << shifts); // Add the corresponding multiple
}
// Apply the sign to the result
quotient = sign * quotient;
// Ensure the result is within the 32-bit signed integer range
if (quotient > INT_MAX) return INT_MAX;
if (quotient < INT_MIN) return INT_MIN;
return quotient;
}
};
```
The following is the converted RISC-V assembly code:
```asm
divide:
addi sp, sp, -44 # Allocate stack space
sw ra, 0(sp) # Save return address
sw s0, 4(sp)
sw s1, 8(sp)
sw s2, 12(sp)
sw s3, 16(sp)
sw s4, 20(sp)
sw s5, 24(sp)
sw s6, 28(sp)
sw s7, 32(sp)
sw s8, 36(sp)
sw s9, 40(sp)
mv s0, a0 # s0: dividend
mv s1, a1 # s1: divisor
li s2, 0x80000000 # s2: INT_MIN
bne s0, s2, begin_divide
li s2, -1
bne s0, s2, begin_divide
j return_INT_MAX
begin_divide:
blt s0, x0, dividend_b_set1
li s2, 0
j exit_dividend_b
dividend_b_set1:
li s2, 1
exit_dividend_b:
blt s1, x0, divisor_b_set1
li s3, 0
j exit_divisor_b
divisor_b_set1:
li s3, 1
exit_divisor_b:
beq s2, s3, set_sign_1 # s2: sign, sign = (dividend < 0) == (divisor < 0) ? 1 : -1
li s2, -1
j exit_set_sign
set_sign_1:
li s2, 1
exit_set_sign:
li t0, -1
mv s3, s0
bge s3, x0, dividend_abs_end # s3: absDividend, absDividend = std::llabs(dividend)
mul s3, s3, t0
dividend_abs_end:
mv s4, s1 # s4: absDivisor, absDivisor = std::llabs(divisor)
bge s4, x0, divisor_abs_end
mul s4, s4, t0
divisor_abs_end:
mv s5, x0 # s5: quotient, quotient = 0
blt s3, s4, divide_loop_end
divide_loop:
mv a0, s4
jal ra, my_clz
mv s6, a0 # s6: countl_zero(static_cast<unsigned>(absDivisor))
mv a0, s3
jal ra, my_clz
mv s7, a0 # s7: countl_zero(static_cast<unsigned>(absDividend))
sub s6, s6, s7 # s6: shifts
bge s6, x0, shift_greater_equal
li s6, 0 # if (shifts < 0) shifts = 0
shift_greater_equal:
sll s7, s4, s6 # s7: tempDivisor, tempDivisor = absDivisor << shifts
blt s7, s3, temp_less_than_abs
addi s8, s6, -1 # s8: shifts - 1
sll s7, s4, s8 # tempDivisor = absDivisor << (shifts - 1)
addi s6, s6, -1 # shifts -= 1
temp_less_than_abs:
sub s3, s3, s7 # absDividend -= tempDivisor
li s9, 1
sll s9, s9, s6 # (1LL << shifts)
add s5, s5, s9
blt s3, s4, divide_loop_end
j divide_loop
divide_loop_end:
mul s5, s5, s2 # quotient = sign * quotient
mv a0, s5
j divide_end
return_INT_MAX:
li a0, 0x7FFFFFFF
j divide_end
divide_end:
lw ra, 0(sp)
lw s0, 4(sp)
lw s1, 8(sp)
lw s2, 12(sp)
lw s3, 16(sp)
lw s4, 20(sp)
lw s5, 24(sp)
lw s6, 28(sp)
lw s7, 32(sp)
lw s8, 36(sp)
lw s9, 40(sp)
addi sp, sp, 44
ret
```