# Assignment1: RISC-V Assembly and Instruction Pipeline(Problem C from Quiz1)
contributed by < [jimmyli88623](https://github.com/jimmy88623) >
The fabsf function is designed to compute the absolute value of a give single-precision floating-point number x. Below, I will present the implementation of this function using RISC-V Assembly Language and optimized it , followed by an analysis of their performance.
:::danger
Don't abuse the notation `==`.
:::
## fabsf
Below is the implementation using C , I will using 0x7FFFFFFF mask to set the left bit to zero
### C code fabsf
```c
static inline float fabsf(float x) {
uint32_t i = *(uint32_t *)&x; // Read the bits of the float into an integer
i &= 0x7FFFFFFF; // Clear the sign bit to get the absolute value
x = *(float *)&i; // Write the modified bits back into the float
return x;
}
```
:::danger
Don't paste the code snip without comprehensive discussions.
:::
Below is the implementation that transform C to RISC-V, and I add input and output data initial, that can automatically verify the answer.
### Origin RISC-V Assembly Language fabsf
```asm
.data
input: .word 0xf2133212 ,0x8c45f455,0x05151515
output: .word 0x72133212,0x8c45f455,0x05151515
yes_msg: .string "True \n"
no_msg: .string "False \n"
.text
main:
la t0, input # load input address
la t1, output # load output address
li t2, 3 # input length
compare_loop:
lw t3, 0(t0)
lw t4, 0(t1)
mv a0, t3
jal ra, fabsf
beq a0, t4, print_yes # compare input and fabsf function output
j print_no
# print true message
print_yes:
la a0, yes_msg
li a7, 4
ecall
j next_compare
# print false message
print_no:
la a0, no_msg
li a7, 4
ecall
j next_compare
# next round
next_compare:
addi t0, t0, 4
addi t1, t1, 4
addi t2, t2, -1
bnez t2, compare_loop
li a7, 10
ecall
fabsf:
addi sp, sp, -12 # create stack
sw s0 ,8(sp) # store s0
sw s1 ,4(sp) # store s1
sw ra, 0(sp) # store ra
li s1, 0x7FFFFFFF # load mask
and s0, a0, s1 # do operation
mv a0, s0 # put the operation result to a0
lw s0, 8(sp) # load s0
lw s1, 4(sp) # load s1
lw ra, 0(sp) # load ra
addi sp, sp, 12 # release stack
ret # return
```
Below is my optimized method, that will reduce the instruction count and cycle number, and also add automatically test to verify the answer.
### Optimized RISC-V Assembly Language fabsf
```asm
.data
input: .word 0xf2133212 ,0x8c45f455,0x05151515
output: .word 0xf2133212,0x8c45f455,0x05151515
yes_msg: .string "True \n"
no_msg: .string "False \n"
.text
main:
la t0, input # load input address
la t1, output # load output address
li t2, 3 # input length
compare_loop:
lw a0, 0(t0)
lw a1, 0(t1)
jal ra, fabsf
beq a0, a1, print_yes # compare input and fabsf function output
j print_no
# print true message
print_yes:
la a0, yes_msg
li a7, 4
ecall
j next_compare
# print false message
print_no:
la a0, no_msg
li a7, 4
ecall
j next_compare
# next round
next_compare:
addi t0, t0, 4
addi t1, t1, 4
addi t2, t2, -1
bnez t2, compare_loop
li a7, 10
ecall # system call
fabsf:
li t3, 0x7FFFFFFF
and a0, a0, t3
ret
```
### Performace analysis
| | Origin | Optimized |
| -------- | -------- | -------- |
| Cycles | 129 | 100 |
| Instruction count | 89 | 60 |
| CPI | 1.45 | 1.67 |
| IPC | 0.69 | 0.6 |
In my optimized method, I aim to reduce the number of registers used to improve both performance and code simplicity. Instead of using multiple temporary registers to store intermediate values, I directly load values into the argument registers , and in the origin implementation, the fabsf function involved saving and restoring multiple registers onto the stack. By limiting the usage of registers inside the function, I eliminated the need for stack operations entirely, resulting in faster execution and a more efficient function call.
:::danger
What inspired you from Quiz 1, as the assignment asks you to use these inspirations to improve the existing problems?
:::
## __builtin_clz
Below is the implementation of clz function using C, i will run a for loop to get each bit number, and update the count
### C code __builtin_clz
```c
static inline int my_clz(uint32_t x) {
int count = 0;
for (int i = 31; i >= 0; --i) {
if (x & (1U << i))
break;
count++;
}
return count;
}
```
Below is the implementation that transform C to RISC-V Assembly Language, and add automatically input and output data test to verify the answer.
### Origin RISC-V Assembly Language __builtin_clz
```asm
.data
input: .word 0xf2133212 ,0x1c45f455,0x05151515
output: .word 0x00000000,0x00000004,0x00000005
yes_msg: .string "True \n"
no_msg: .string "False \n"
.text
main:
la t0, input # load input address
la t1, output # load output address
li t2, 3 # input length
# loop
compare_loop:
lw t3, 0(t0)
lw t4, 0(t1)
mv a0, t3
jal ra, my_clz
beq a0, t4, print_yes # compare input and my_clz function output
j print_no
# print true message
print_yes:
la a0, yes_msg
li a7, 4
ecall
j next_compare
# print false message
print_no:
la a0, no_msg
li a7, 4
ecall
j next_compare
# next round
next_compare:
addi t0, t0, 4 # point to next input data
addi t1, t1, 4 # point to next output data
addi t2, t2, -1
bnez t2, compare_loop
li a7, 10
ecall
# Function to count leading zeros (CLZ)
my_clz:
addi sp, sp, -16 # create stack
sw ra, 12(sp) # store ra
sw s0, 8(sp) # store s0
sw s1, 4(sp) # store s1
sw s2, 0(sp) # store s2
li s0, 31 # s0 = i
li s1, 0 # s1 = count
loop:
li s2, 0x1 # load 0x1
sll s2, s2, s0 # shift left
and s2, a0, s2 # do "and" operation, take the left bit
bne s2, x0, done
addi s1, s1, 1 # update count
addi s0, s0, -1 # update i
bge s0, x0, loop # loop condition
done:
mv a0, s1 # put the answer into a0
lw ra, 12(sp) # load ra
lw s0, 8(sp) # load s0
lw s1, 4(sp) # load s1
lw s2, 0(sp) # load s2
addi sp, sp, 16 # release stack
ret # return
```
Below is optimized version, I reduce the register used, and also add automatically input and output test to verify the answer.
### Optimized RISC-V Assembly Language __builtin_clz
```asm
.data
input: .word 0xf2133212 ,0x1c45f455,0x05151515
output: .word 0x00000000,0x00000003,0x00000005
yes_msg: .string "True \n"
no_msg: .string "False \n"
.text
main:
la t0, input # load input address
la t1, output # load output address
li t2, 3 # input length
compare_loop:
lw a0, 0(t0)
lw a1, 0(t1)
jal ra, my_clz
beq a0, a1, print_yes # compare input and my_clz function output
j print_no
# print true message
print_yes:
la a0, yes_msg
li a7, 4
ecall
j next_compare
# print false message
print_no:
la a0, no_msg
li a7, 4
ecall
j next_compare
# next round
next_compare:
addi t0, t0, 4
addi t1, t1, 4
addi t2, t2, -1
bnez t2, compare_loop
li a7, 10
ecall
# Function to count leading zeros (CLZ)
my_clz:
li t3, 31 # s0 = i
li t4, 0 # s1 = count
loop:
li t5, 0x1 # load 0x1
sll t5, t5, t3 # shift left
and t5, a0, t5 # do "and" operation, take the left bit
bne t5, x0, done
addi t4, t4, 1 # update count
addi t3, t3, -1 # update i
bge t3, x0, loop # loop condition
done:
mv a0, t4 # put the answer into a0
ret # return
```
### Performace analysis
| | Origin | Optimized |
| -------- | -------- | -------- |
| Cycles | 222 | 188 |
| Instruction count | 160 | 126 |
| CPI | 1.39 | 1.49 |
| IPC | 0.721 | 0.67 |
In my optimized method, I aim to reduce the number of registers used to improve both performance and code simplicity. Instead of using multiple temporary registers to store intermediate values, I directly load values into the argument registers , and in the origin implementation, the ==my_clz== function involved saving and restoring multiple registers onto the stack. By limiting the usage of registers inside the function, I eliminated the need for stack operations entirely, resulting in faster execution and a more efficient function call.
---
Below is the implementation of fp16_to_fp32 function, since fp16 is different with fp32 , so we need to do transform to each part , and in the final will combine each part .
## fp16_to_fp32
### C code fp16_to_fp32
```c
static inline uint32_t fp16_to_fp32(uint16_t h) {
const uint32_t w = (uint32_t) h << 16;
const uint32_t sign = w & UINT32_C(0x80000000);
const uint32_t nonsign = w & UINT32_C(0x7FFFFFFF);
uint32_t renorm_shift = my_clz(nonsign);
renorm_shift = renorm_shift > 5 ? renorm_shift - 5 : 0;
const int32_t inf_nan_mask = ((int32_t)(nonsign + 0x04000000) >> 8) &
INT32_C(0x7F800000);
const int32_t zero_mask = (int32_t)(nonsign - 1) >> 31;
return sign | ((((nonsign << renorm_shift >> 3) +
((0x70 - renorm_shift) << 23)) | inf_nan_mask) & ~zero_mask);
}
```
Below is the implementation that transform C to RISC-V Assembly Language, and add automatically input and output test data to verify the answer.
### Origin RISC-V Assembly Language fp16_to_fp32
```asm
.data
input: .word 0x3C00, 0x4000, 0x6219
output: .word 0x3F800000, 0x40000000, 0x44432000
yes_msg: .string "True \n"
no_msg: .string "False \n"
.text
main:
la t3, input # load input address
la t4, output # load output address
li t5, 3 # input length
compare_loop:
lw a0, 0(t3)
lw a1, 0(t4)
jal ra, fp16_to_fp32 # compare input and fp16_to_fp32 function output
beq a0, a1, print_yes
j print_no
# print true message
print_yes:
la a0, yes_msg
li a7, 4
ecall
j next_compare
# print false message
print_no:
la a0, no_msg
li a7, 4
ecall
j next_compare
# next round
next_compare:
addi t3, t3, 4
addi t4, t4, 4
addi t5, t5, -1
bnez t5, compare_loop
li a7, 10
ecall
fp16_to_fp32:
addi sp, sp, -32
sw ra, 28(sp)
sw s0, 24(sp)
sw s1, 20(sp)
sw s2, 16(sp)
sw s3, 12(sp)
sw s4, 8(sp)
sw s5, 4(sp)
sw s6, 0(sp)
mv s0, a0 # s0 store input uint16
slli s1, s0, 16 # s1 store w
li t1, 0x80000000
and s2, s1, t1 # s2 store sign
li t1, 0x7FFFFFFF
and s3, s1, t1 # s3 store nonsign
mv a0, s3
jal ra, my_clz
mv s4, a0 # s4 store my_clz result(renorm_shift)
li s5, 6 # s5 store temp 5
blt s4, s5, set_zero
addi s4, s4, -5 # update s4
j go
set_zero:
li s4, 0
go:
li t2, 0x04000000
add s5, s3, t2
srli s5, s5, 8
li t2, 0x7F800000
and s5, s5, t2 # s5 update store inf_nan_mask
addi s6, s3, -1
srli s6, s6, 31 # s6 store zero_mask
sll s3, s3, s4 # nonsign << renorm_shift
srli s3, s3, 3 # s3 store (nonsign << renorm_shift >> 3)
li t1, 0x70
sub t1, t1, s4 # (0x70 - renorm_shift)
slli t1, t1, 23 # t1 store ((0x70 - renorm_shift) << 23)
add s3, s3, t1
or s3, s3, s5
not t1, s6
and s3, s3, t1
or s3, s3, s2
mv a0, s3
lw ra, 28(sp)
lw s0, 24(sp)
lw s1, 20(sp)
lw s2, 16(sp)
lw s3, 12(sp)
lw s4, 8(sp)
lw s5, 4(sp)
lw s6, 0(sp)
addi sp, sp, 32
ret
# Function to count leading zeros (CLZ)
my_clz:
addi sp, sp, -16 # create stack
sw ra, 12(sp) # store ra
sw s0, 8(sp) # store s0
sw s1, 4(sp) # store s1
sw s2, 0(sp) # store s2
li s0, 31 # s0 = i
li s1, 0 # s1 = count
loop:
li s2, 0x1 # load 0x1
sll s2, s2, s0 # shift left
and s2, a0, s2 # do "and" operation, take the left bit
bne s2, x0, done
addi s1, s1, 1 # update count
addi s0, s0, -1 # update i
bge s0, x0, loop # loop condition
done:
mv a0, s1 # put the answer into a0
lw ra, 12(sp) # load ra
lw s0, 8(sp) # load s0
lw s1, 4(sp) # load s1
lw s2, 0(sp) # load s2
addi sp, sp, 16 # release stack
ret # return
```
Below is the optimized version of RISC-V program, that I reduce the register number used and can minimize overall register usage in function.
### Optimized RISC-V Assembly Language fp16_to_fp32
```asm
.data
input: .word 0x3C00, 0x4000, 0x6219
output: .word 0x3F800000, 0x40000000, 0x44432000
yes_msg: .string "True \n"
no_msg: .string "False \n"
.text
main:
la t3, input # load input address
la t4, output # load output address
li t5, 3 # input length
compare_loop:
lw a0, 0(t3)
lw a1, 0(t4)
jal ra, fp16_to_fp32
beq a0, a1, print_yes # compare input and fp16_to_fp32 function output
j print_no
# print true message
print_yes:
la a0, yes_msg
li a7, 4
ecall
j next_compare
# print false message
print_no:
la a0, no_msg
li a7, 4
ecall
j next_compare
# next round
next_compare:
addi t3, t3, 4
addi t4, t4, 4
addi t5, t5, -1
bnez t5, compare_loop
li a7, 10
ecall
fp16_to_fp32:
addi sp, sp, -16
sw ra, 12(sp)
sw a0, 8(sp)
lw s0, 8(sp)
slli s1, s0, 16 # shift left
li t1, 0x80000000
and s2, s1, t1 # s2 store sign
li t1, 0x7FFFFFFF
and s3, s1, t1 # s3 store nonsign
mv a0, s3
jal ra, my_clz
mv s4, a0 # s4 store renorm_shift
li s5, 6
blt s4, s5, set_zero
addi s4, s4, -5 # update renorm_shift value
j go
set_zero:
li s4, 0
go:
li t2, 0x04000000
add s5, s3, t2
srli s5, s5, 8
li t2, 0x7F800000
and s5, s5, t2 # s5 store inf_nan_mask
addi s6, s3, -1
srli s6, s6, 31 # s6 store zero_mask
sll s3, s3, s4 # s3 store (nonsign << renorm_shift )
srli s3, s3, 3 # s3 store (nonsign << renorm_shift >> 3)
li t1, 0x70
sub t1, t1, s4 # t1 store (0x70 - renorm_shift)
slli t1, t1, 23 # t1 store (0x70 - renorm_shift) << 23)
add s3, s3, t1 # (nonsign << renorm_shift >> 3) +
((0x70 - renorm_shift) << 23)
not t1, s6 # ~zero_mask
and s3, s3, t1 # ((((nonsign << renorm_shift >> 3) +
((0x70 - renorm_shift) << 23)) | inf_nan_mask) & ~zero_mask)
or s3, s3, s2 # sign | ((((nonsign << renorm_shift >> 3) +
((0x70 - renorm_shift) << 23)) | inf_nan_mask) & ~zero_mask)
mv a0, s3
lw ra, 12(sp)
addi sp, sp, 16
ret
my_clz:
addi sp, sp, -16
sw ra, 12(sp)
sw a0, 8(sp)
li s0, 31
li s1, 0
loop:
li s2, 0x1
sll s2, s2, s0
and s2, a0, s2
bne s2, x0, done
addi s1, s1, 1
addi s0, s0, -1
bge s0, x0, loop
done:
mv a0, s1
lw ra, 12(sp)
addi sp, sp, 16
ret
```
### Performace analysis
| | Origin | Optimized |
| -------- | -------- | -------- |
| Cycles | 350 | 296 |
| Instruction count | 278 | 221 |
| CPI | 1.26 | 1.24 |
| IPC | 0.794 | 0.747 |
My optimized method aims to reduce the number of registers, thereby improving both performance and code simplicity. By utilizing argument registers, I can replace certain temporary registers and minimize the overall register usage in other functions.
## LeetCode Problem
### [LeetCode 1550. Three Consecutive Odds](https://leetcode.com/problems/three-consecutive-odds/description/)
In LeetCode problem 1550 "Three Consecutive Odds," the task is to determine whether the given array contains three consecutive odd numbers.
When it comes to checking if a number is odd, using bit manipulation is an efficient approach. Specifically, the least significant bit (LSB) of any odd number is 1. Therefore, we can use bitwise operations to determine if a number is odd, so I choose clz function from Quiz1 to help implementation.
### C code
```c
class Solution {
public:
bool threeConsecutiveOdds(vector<int>& arr) {
int count = 0;
for(int i=0;i<arr.size();i++){
if(arr[i] & 1 == 1){
count += 1;
if(count == 3){
return true;
}
continue;
}
count = 0;
}
return false;
}
};
```
### add clz function
```c
int clz(uint32_t x){
int count = 0;
for (int i = 31; i >= 0; --i) {
if (x & (1U << i))
break;
count++;
}
return count;
}
bool threeConsecutiveOdds(int* arr, int arrSize) {
int count = 0;
for (int i = 0; i < arrSize; i++) {
uint32_t value = arr[i];
uint32_t shifted_value = (value << 31) >> 31;
int leading_zeros = clz(shifted_value);
if (leading_zeros == 31) {
count++;
if (count == 3) {
return true;
}
} else {
count = 0;
}
}
return false;
}
```
### Origin RISC-V Assembly Language
```asm
.data
arr: .word 7,7,7
msg_true: .string "True \n"
msg_false: .string "False \n"
.text
main:
la s10, arr # Load address of array into a0 (first argument to function)
li a1, 3 # Load array size (6 in this case) into a1 (second argument)
call threeConsecutiveOdds # Call the function
# Check the returned result and print appropriate message
beq a0, x0, print_false # If returned value is 0 (false), jump to print_false
j print_true # Else, jump to print_true
print_true:
li a0,1
la a0, msg_true # Load address of "True" message into a0
li a7, 4 # Syscall for print string
ecall # Make the system call
li a7, 10 # Syscall for exit
ecall # End the program after printing
print_false:
li a0, 0
la a0, msg_false # Load address of "False" message into a0
li a7, 4 # Syscall for print string
ecall
li a7, 10 # Syscall for exit
ecall
# Exit the program
threeConsecutiveOdds:
addi s8, x0, 0 # Initialize count to 0 (s0 = count)
addi s9, x0, 0 # Initialize i to 0 (s1 = loop counter)
li s2, 3 # Load 3 into s2 (for checking consecutive count)
loop_start:
bge s9, a1, print_false # if i >= arrSize, return false
slli s3, s9, 2 # Calculate offset for arr[i] (s3 = i * 4)
add s3, s10, s3 # Load address of arr[i] into s3
lw s4, 0(s3) # Load arr[i] into s4
slli s5, s4, 31 # Shift value left by 31 bits
srli s5, s5, 31 # Shift value right by 31 bits (extract last bit)
mv a0,s5
li s7, 0x1f
jal ra, my_clz # Use clz to count leading zeros in s5
beq a0, s7, increment_count # If leading zeros == 31, increment count
j reset_count
reset_count:
addi s8, x0, 0 # Reset count to 0
addi s9, s9, 1 # i++
j loop_start # Jump to start of the loop
increment_count:
addi s8, s8, 1 # count++
beq s8, s2, print_true # If count == 3, return true
addi s9, s9, 1 # i++
j loop_start # Jump to start of the loop
my_clz:
addi sp, sp, -24 # create stack
sw ra, 20(sp) # store ra
sw s0, 16(sp) # store s0
sw s1, 12(sp) # store s1
sw s2, 8(sp) # store s2
sw s3, 4(sp) # store s3
sw s4, 0(sp) # store s4
li s0, 31 # s0 = i
li s1, 0 # s1 = count
loop:
li s4, 0x1 # load 0x1
sll s2, s4, s0 # shift left
and s3, a0, s2 # do "and" operation, take the left bit
bne s3, x0, done
addi s1, s1, 1 # update count
addi s0, s0, -1 # update i
bge s0, x0, loop # loop condition
done:
mv a0, s1 # put the result into a0
lw ra, 20(sp) # load ra
lw s0, 16(sp) # load s0
lw s1, 12(sp) # load s1
lw s2, 8(sp) # load s2
lw s3, 4(sp) # load s3
lw s4, 0(sp) # load s4
addi sp, sp, 24 # release stack
ret # return
```
### Optimized RISC-V Assembly Language
```asm
.data
arr: .word 1,3,4,5,6,8,3,3,1
msg_true: .string "True \n"
msg_false: .string "False \n"
.text
main:
la s10, arr # Load address of array into s10
li a1, 9 # Load array size into a1
call threeConsecutiveOdds # Call the function
# Check the returned result and print appropriate message
beq a0, x0, print_false # If 0 (false), jump to print_false
j print_true # Else, jump to print_true
print_true:
la a0, msg_true # Load address of "True" message into a0
li a7, 4 # Syscall for print string
ecall # Make the system call
li a7, 10 # Syscall for exit
ecall # End the program after printing
print_false:
la a0, msg_false # Load address of "False" message into a0
li a7, 4 # Syscall for print string
ecall
li a7, 10 # Syscall for exit
ecall # Exit the program
threeConsecutiveOdds:
addi s8, x0, 0 # Initialize count to 0
addi s9, x0, 0 # Initialize i to 0
li s2, 3 # Load 3 for consecutive check
loop_start:
bge s9, a1, print_false # if i >= arrSize, return false
lw s4, 0(s10) # Load arr[i] into s4
addi s10, s10, 4 # Move to next array element
andi s5, s4, 1 # Check if arr[i] is odd (last bit == 1)
beqz s5, reset_count # If not odd, reset count
addi s8, s8, 1 # Increment count
beq s8, s2, print_true # If count == 3, return true
addi s9, s9, 1 # i++
j loop_start # Jump to start of the loop
reset_count:
addi s8, x0, 0 # Reset count to 0
addi s9, s9, 1 # i++
j loop_start # Jump to start of the loop
my_clz:
addi sp, sp, -8 # Allocate stack space
sw ra, 4(sp) # Store return address
sw s0, 0(sp) # Store s0 (loop counter)
li s0, 31 # Set loop counter to 31
li a0, 0 # Initialize count of leading zeros
clz_loop:
srl s5, a1, s0 # Shift right by s0 bits
andi s5, s5, 1 # Isolate the least significant bit
bnez s5, clz_done # If we find a 1, break the loop
addi a0, a0, 1 # Increment the count
addi s0, s0, -1 # Decrement the loop counter
bgez s0, clz_loop # Repeat until s0 >= 0
clz_done:
lw ra, 4(sp) # Restore return address
lw s0, 0(sp) # Restore s0
addi sp, sp, 8 # Free stack space
ret # Return
```
### Performance Analysis
| | Origin | Optimized |
| -------- | -------- | -------- |
| Cycles | 1001 | 124 |
| Instruction count | 772 | 90 |
| CPI | 1.3 | 1.38 |
| IPC | 0.771 | 0.726 |
In my optimized method, I have changed the way array elements are loaded, resulting in a reduction in the instruction count. Additionally, I combined several function calls, further decreasing the overall instruction count.