---
tags: computer-arch
---
# Assignment1: RISC-V Assembly and Instruction Pipeline
contributed by < [`owen1994-tw`](https://github.com/owen1994-tw/2024_Computer_Architecture_FALL) >
## Deriving LeetCode Problem 29 Using Floating-Point : Analyzing Precision Differences Between FP32 and BF16
### 29. Divide Two Integers
> Given two integers `dividend` and `divisor`, divide two integers without using multiplication, division, and mod operator.
### Divide Two Floating Point Number
Inspired by LeetCode Problem `29, Divide Two Integers`, the goal is to convert the original integer data type to the FP32 data type. The conversion functions from `bf16_to_fp32` and `bf16_to_fp32`, as outlined in Quiz 1 - Problem B, will be utilized. This approach allows for the analysis of precision and computational differences between directly dividing floating-point numbers using FP32 and employing a conversion process that transitions from FP32 to BF16 and then back to FP32.
## Quiz 1 - Problem B
The bfloat16 format is a 16-bit floating-point representation, designed to provide a wide dynamic range by using a floating radix point. It is a shortened version of the 32-bit IEEE 754 single-precision format (binary32), aimed at accelerating machine learning.
```cpp=
┌ sign
│
│ ┌ exponent
│ │
│ │ ┌ mantissa
│ │ │
│┌──┴───┐┌─┴───┐
0b0000000000000000 bfloat16
```
Since bfloat16 (bf16) shares the same number of exponent bits as a 32-bit float, encoding and decoding values is relatively simple.
### C Code
```cpp=
static inline float bf16_to_fp32(bf16_t h)
{
union {
float f;
uint32_t i;
} u = {.i = (uint32_t)h.bits << 16};
return u.f;
}
```
### Assembly Code
```:=
.data
A: .word 0x3F80, 0x4000, 0x4080, 0x3F00, 0xBF00
, 0x7F80, 0xFF80 # Multiple floating-point data values
length: .word 7 # Data length
prompt_fp32: .string " fp32 = "
prompt_bf16: .string "\n bf16 = "
.text
.globl fp32_to_bf16
.global main
main:
# Load the base address of the data array and the length
la s0, A # s0 = Address of A (the array of floating-point numbers)
lw s1, length # s1 = length of the array
loop:
# Check if the length is 0, if so, exit the loop
beqz s1, end_loop # If length == 0, exit the loop
# Load the current floating-point number into t1
lw t1, 0(s0) # Load the current floating-point value from A into t1
# Print "bf16 = "
la a0, prompt_bf16 # Load the address of the string " bf16 = "
li a7, 4 # System call code for printing a string
ecall # Print the string
# Print the current bf16 value
mv a0, t1 # Move the original input value (bf16) into a0 for printing
li a7, 34 # System call code for printing a hex value (for bf16)
ecall # Print the hex value (bf16)
# Convert bf16 to fp32
mv a0, t1 # Move bf16 value into a0 (input to bf16_to_fp32)
call bf16_to_fp32 # Call the bf16_to_fp32 conversion function
mv t1, a0 # Move the result (fp32) back into t1
# Print "fp32 = "
la a0, prompt_fp32 # Load the address of the string " fp32 = "
li a7, 4 # System call code for printing a string
ecall # Print the string
# Print the resulting fp32 value
mv a0, t1 # Move the converted fp32 value into a0 for printing
li a7, 34 # System call code for printing a hex value (for fp32)
ecall # Print the hex value (fp32)
# Update loop conditions
addi s0, s0, 4 # Move to the next floating-point number (each is 4 bytes)
addi s1, s1, -1 # Decrease the length by 1
j loop # Jump back to the top of the loop
end_loop:
# Exit the program
li a7, 10
ecall
bf16_to_fp32:
# Input: bf16 value in a0
# Output: fp32 value returned in a0
slli a0, a0, 16 # Shift left by 16 to align bf16 bits for fp32 conversion
ret # Return the result in a0
```
### Test Data
There are 7 test datas in this program:
| BF16 as HEX | FP32 | FP32 as HEX | FP32 |
| -------- | -------- | -------- |-------- |
| 0x3F80 | 1.0 | 0x3f800000 | 1.0 |
| 0x4000 | 2.0 | 0x40000000 | 2.0 |
| 0x4080 | 4.0 | 0x40800000 | 4.0 |
| 0x3F00 | 0.5 | 0x3F000000 | 0.5 |
| 0xBF00 | -0.5 | 0xBF000000 | -0.5 |
| 0x7F80 | +Infinity | 0x7F800000 | +Infinity|
| 0xFF80 | -Infinity | 0xFF800000 | -Infinity|
---
The structure of the bfloat16 floating-point format is as follows.
`fp32_to_bf16` function, which converts a float32 to bfloat16. This conversion should be binary identical to the BFloat16 format. Floats must round to the nearest even value, and NaNs should be quiet. Subnormal values should not be flushed to zero, unless explicitly needed.
### C Code
```cpp=
static inline bf16_t fp32_to_bf16(float s)
{
bf16_t h;
union {
float f;
uint32_t i;
} u = {.f = s};
if ((u.i & 0x7fffffff) > 0x7f800000) { /* NaN */
h.bits = (u.i >> 16) | 64; /* force to quiet */
return h;
}
h.bits = (u.i + (0x7fff + ((u.i >> 0x10) & 1))) >> 0x10;
return h;
}
```
### Assembly Code
```:=
.data
A: .word 0x3f800000, 0xbf800000, 0x00000000, 0x80000000,
0x40490fdc, 0x477fe000, 0x477fe100 # Multiple floating-point values (FP32 format)
length: .word 7 # Length of the data
prompt_bf32: .string "\n fp32 = " # String prompt for FP32 values
prompt_bf16: .string " bf16 = " # String prompt for BF16 values
.text
.globl fp32_to_bf16
.global main
main:
# Load the address of A and the length of the data
la s0, A # Load the address of array A into s0
lw s1, length # Load the length of the array into s1
loop:
# Check if the length is zero, and if so, exit the loop
beqz s1, end_loop # If s1 (length) is zero, jump to the end of the loop
# Load the current FP32 value
lw t1, 0(s0) # Load the current floating-point value from A into t1
# Print the FP32 prompt
la a0, prompt_bf32 # Load the address of the FP32 prompt string into a0
li a7, 4 # System call for printing a string
ecall # Make the system call
# Print the FP32 value (in hex)
mv a0, t1 # Move the FP32 value to a0 for printing
li a7, 34 # System call for printing a hexadecimal value
ecall # Make the system call
# Convert FP32 to BF16
mv a0, t1 # Move the FP32 value to a0 as input to the conversion function
call fp32_to_bf16 # Call the function to convert FP32 to BF16
mv t1, a0 # Store the converted BF16 value in t1
# Print the BF16 prompt
la a0, prompt_bf16 # Load the address of the BF16 prompt string into a0
li a7, 4 # System call for printing a string
ecall # Make the system call
# Print the BF16 value (in hex)
mv a0, t1 # Move the BF16 value to a0 for printing
li a7, 34 # System call for printing a hexadecimal value
ecall # Make the system call
# Update the loop condition
addi s0, s0, 4 # Move to the next float (increment the pointer by 4 bytes)
addi s1, s1, -1 # Decrement the length by 1
j loop # Jump back to the start of the loop
end_loop:
# Exit the program
li a7, 10 # System call for exiting the program
ecall # Make the system call
fp32_to_bf16:
# Function: Convert FP32 to BF16
# Input: s (FP32 value) in a0
# Output: h (BF16 value) in a0
# Prepare temporary registers
li t0, 0x7fffffff # Mask for extracting the exponent and mantissa
li t1, 0x7f800000 # Mask for maximum exponent (NaN detection)
li t2, 64 # Quiet NaN bit (0x40)
li t5, 0x7fff # Bias adjustment for rounding
# Convert floating-point to integer (IEEE 754 format)
mv t3, a0 # Move FP32 value from a0 to t3 for processing
# Check for NaN
and t6, t0, t3 # Extract exponent and mantissa
bgtu t6, t1, handle_nan # If the value is NaN, jump to NaN handling
# Normal case: rounding and truncation
srli t6, t3, 16 # Shift the FP32 value right by 16 bits
andi t4, t6, 1 # Isolate the least significant bit
add t4, t4, t5 # Add bias for rounding
add t3, t3, t4 # Add rounding adjustment to the original value
srli t3, t3, 16 # Truncate the result to get BF16
# Return the BF16 value
mv a0, t3 # Move the result to a0
ret # Return from the function
handle_nan:
# Handle NaN by converting to a quiet NaN (qNaN)
srli t3, t3, 16 # Truncate the FP32 value to BF16
or t3, t3, t2 # Set the quiet NaN bit
mv a0, t3 # Move the result to a0
ret # Return from the function
```
### Test Data
There are 7 test datas in this program:
| Single-precision (FP32) | FP32 as HEX | BF as HEX |
| -------- | -------- | -------- |
| 1.000000 | 0x3f800000 | 0x3f80 |
| -1.000000 | 0xbf800000 | 0xbf80 |
| 0.000000 | 0x3f800000 | 0x0000 |
| -0.000000 | 0x00000000 | 0x8000 |
| 3.141593 | 0x40490fdc | 0x4049 |
| 65504.0 | 0x477fe000 | 0x4780 |
| 65505.0 | 0x477fe100 | 0x4780 |
## Divide Two Floats
`fdiv32`
In the `fdiv32` function, NaN, inf and denormalized number are not allowed as fmul32. I still provide the basic floating point multiplication by the technique as the mention above.
```cpp=
float fdiv32(float a, float b)
{
int32_t ia = *(int32_t *)&a, ib = *(int32_t *)&b;
if (a == 0) return a;
if (b == 0) return *(float*)&(int){0x7f800000};
/* mantissa */
int32_t ma = (ia & 0x7FFFFF) | 0x800000;
int32_t mb = (ib & 0x7FFFFF) | 0x800000;
/* sign and exponent */
int32_t sea = ia & 0xFF800000;
int32_t seb = ib & 0xFF800000;
/* result of mantissa */
int32_t m = idiv24(ma, mb);
int32_t mshift = !getbit(m, 31);
m <<= mshift;
int32_t r = ((sea - seb + 0x3f800000) - (0x800000 & -mshift)) | (m & 0x7fffff00) >> 8;
int32_t ovfl = (ia ^ ib ^ r) >> 31;
r = r ^ ((r ^ 0x7f800000) & ovfl);
return *(float *) &r;
// return a / b;
}
```
`idiv24`
`idiv24` can handle 32 bits integer division. The dividend and quotient can be update by shifting 1 bit left. If the dividend are greater then the divisor, the LSB of quotient should be 1
```cpp=
static int32_t idiv24(int32_t a, int32_t b){
uint32_t r = 0;
for (int i = 0; i < 32; i++){
r <<= 1;
if (a - b < 0){
a <<= 1;
continue;
}
r |= 1;
a -= b;
a <<= 1;
}
return r;
}
```
### C Code
```cpp=
#include <stdio.h>
#include <stdint.h>
// 定義 bf16_t 結構
typedef union {
uint16_t bits; // 16-bit representation
} bf16_t;
static inline float bf16_to_fp32(bf16_t h)
{
union {
float f;
uint32_t i;
} u = {.i = (uint32_t)h.bits << 16};
return u.f;
}
static inline bf16_t fp32_to_bf16(float s) {
bf16_t h;
union {
float f;
uint32_t i;
} u = {.f = s};
if ((u.i & 0x7fffffff) > 0x7f800000)
{ // NaN
h.bits = (u.i >> 16) | 0x0040; // force to quiet
return h;
}
h.bits = (u.i + (0x7fff + ((u.i >> 0x10) & 1))) >> 0x10;
return h;
}
int getbit(int num, int position) {
// Shift the desired bit to the least significant bit position and mask it
return (num >> position) & 1;
}
static int32_t idiv24(int32_t a, int32_t b){
uint32_t r = 0;
for (int i = 0; i < 32; i++){
r <<= 1;
if (a - b < 0){
a <<= 1;
continue;
}
r |= 1;
a -= b;
a <<= 1;
}
return r;
}
float fdiv32(float a, float b)
{
int32_t ia = *(int32_t *)&a, ib = *(int32_t *)&b;
if (a == 0) return a;
if (b == 0) return *(float*)&(int){0x7f800000};
/* mantissa */
int32_t ma = (ia & 0x7FFFFF) | 0x800000;
int32_t mb = (ib & 0x7FFFFF) | 0x800000;
/* sign and exponent */
int32_t sea = ia & 0xFF800000;
int32_t seb = ib & 0xFF800000;
/* result of mantissa */
int32_t m = idiv24(ma, mb);
int32_t mshift = !getbit(m, 31);
m <<= mshift;
int32_t r = ((sea - seb + 0x3f800000) - (0x800000 & -mshift)) | (m & 0x7fffff00) >> 8;
int32_t ovfl = (ia ^ ib ^ r) >> 31;
r = r ^ ((r ^ 0x7f800000) & ovfl);
return *(float *) &r;
// return a / b;
}
float fp32_to_bf16_to_fp32(float s)
{
bf16_t bf = fp32_to_bf16(s);
return bf16_to_fp32(bf);
}
int main() {
float dd = 10.78212f;
float dv = 2.0f;
// float result_f = 0.0f;
float result = 0.0f;
result = fdiv32(dd, dv);
printf("fp32 Result: %.6f\n", result); // 輸出結果
dd = fp32_to_bf16_to_fp32(dd);
dv = fp32_to_bf16_to_fp32(dv);
result = fdiv32(dd, dv);
printf("bf16 Result: %.6f\n", result); // 輸出結果
return 0;
}
```
### Assembly Code
```:=
.data
a: .word 0x412c8390 # Dividend (floating-point format)
b: .word 0x40000000 # Divisor (floating-point format)
prompt: .string "answer = "
.text
.globl _start
setup:
li ra,-1
li sp, 0x7fffffff
_start:
# Load floating-point numbers a and b (as unsigned integers)
la t0, a # Load address of a
lw t1, 0(t0) # Load 32-bit data of a into t1
la t0, b # Load address of b
lw t2, 0(t0) # Load 32-bit data of b into t2
# Save return address ra and divisor onto the stack
addi sp, sp, -8
sw ra, 4(sp)
sw t2, 0(sp)
# Pass the dividend t1 to a0, prepare for FP32 to BF16 conversion
mv a0, t1
call fp32_to_bf16_fp32 # Convert dividend from FP32 to BF16
mv t1, a0 # Save the converted dividend into t1
# Restore return address and divisor
lw ra, 4(sp)
lw t2, 0(sp)
sw t1, 0(sp) # Save the converted dividend to the stack
# Pass the divisor t2 to a0 for FP32 to BF16 conversion
mv a0, t2
call fp32_to_bf16_fp32 # Convert divisor from FP32 to BF16
lw ra, 4(sp)
lw t1, 0(sp)
addi sp, sp, 8 # Restore stack pointer
mv t2, a0 # Save the converted divisor into t2
# Set parameters a0 and a1 for floating-point division
mv a0, t1 # Dividend
mv a1, t2 # Divisor
call_fdiv32:
call fdiv32 # Perform floating-point division
mv t1,a0 # Save the result into t1
# Print the string "answer = "
la a0, prompt # Load the address of the string
li a7, 4 # Syscall code 4, print string
ecall # Perform syscall
# Print the floating-point division result (as an integer)
mv a0, t1 # Store the result in a0
li a7, 2 # Syscall code 2, print integer
ecall # Perform syscall
# Exit program
li a7, 10
ecall
# Convert FP32 to BF16 and back to FP32
fp32_to_bf16_fp32:
li t0, 0x7fffffff # Load 0x7fffffff for masking
li t1, 0x7f800000 # Load 0x7f800000 for NaN checking
li t2, 64 # Load 64 for quiet NaN
li t5, 0x7fff # Load 0x7fff
mv t3, a0 # Store FP32 floating-point number in t3
# Check for NaN
and t6, t0, t3 # Perform bitwise AND with 0x7fffffff
bgtu t6, t1, handle_nan # If result > 0x7f800000, branch to NaN handling
# Normal processing
srli t6, t3, 16 # Shift FP32 right by 16 bits
andi t4, t6, 1 # Perform bitwise AND with 1 on (u >> 16)
add t4, t4, t5 # Add 0x7fff to the result
add t3, t3, t4 # Add result to u
srli t3, t3, 16 # Shift right by 16 bits to get BF16
j bf16_to_fp32
# NaN handling
handle_nan:
srli t3, t3, 16 # Shift FP32 right by 16 bits
or t3, t3, t2 # Perform bitwise OR with 64
j bf16_to_fp32
bf16_to_fp32:
slli a0, t3, 16 # Shift BF16 left by 16 bits to align with FP32
ret
# 32-bit floating-point division
fdiv32:
addi sp, sp, -4
sw ra, 0(sp)
beqz a0, f3 # If divisor is 0, return 0x7f800000
bnez a1, f1 # Check if dividend is 0
li a0, 0x7f800000 # Return positive infinity
j f3
f1:
li t2, 0x7FFFFF # Mask for extracting mantissa
and t0, t2, a0 # Extract dividend mantissa
and t1, t2, a1 # Extract divisor mantissa
addi t2, t2, 1 # Add 1 to handle rounding
or t0, t0, t2 # Calculate dividend mantissa
or t1, t1, t2 # Calculate divisor mantissa
idiv24:
li t3, 0 # Initialize quotient and remainder
li t4, 32 # Set loop count for 32 bits
f2:
sub t0, t0, t1 # Subtract divisor from dividend
sltz t2, t0 # Check if result is negative
seqz t2, t2 # t2 becomes 0 or 1
slli t3, t3, 1 # Shift quotient left by 1 bit
or t3, t3, t2 # Update quotient
seqz t2, t2 # Invert t2
neg t2, t2 # Take the negative value
and t5, t2, t1 # If result is negative, set t5 to divisor
add t0, t0, t5 # Restore dividend
slli t0, t0, 1 # Shift dividend left by 1 bit
addi t4, t4, -1 # Decrease loop count
bnez t4, f2 # Continue looping
li t2, 0xFF800000 # Extract exponent
and t0, a0, t2 # Extract dividend exponent
and t1, a1, t2 # Extract divisor exponent
mv a0, t3 # Save quotient result
# Handle result shifts
li a1, 31
jal getbit # Check most significant bit
seqz a0, a0 # Calculate shift amount
sll t3, t3, a0 # Shift quotient left
# Calculate new exponent
li t2, 0x3f800000
sub t4, t0, t1
add t4, t4, t2 # Adjust exponent
neg a0, a0 # Negate shift amount
li t2, 0x800000
and a0, a0, t2 # Handle shift effect
srli t3, t3, 8 # Shift mantissa right by 8 bits
addi t2, t2, -1
and t3, t3, t2 # Align mantissa
sub a0, t4, a0
or a0, a0, t3 # Combine result
# Check for overflow
xor t3, t1, t0
xor t3, t3, a0
srli t3, t3, 31
li t2, 0x7f800000
xor t4, t2, a0
and t4, t4, t3
xor a0, a0, t4
f3:
# Epilogue
lw ra, 0(sp)
addi sp, sp, 4
ret
getbit:
# Prologue
srl a0, a0, a1
andi a0, a0, 0x1
ret
```
Analysis
---

### Relative Error Formula:
$$
Relative\ Error = \frac{\left| \text{Result}_{\text{FP32}} \right|}{\left| \text{Result}_{\text{FP32}} - \text{Result}_{\text{BF16-converted}} \right|}
$$
bf16表示範圍 3.3895313892515354759e+38 ~ 2.34180515202877589478e-38,我們會選取五組數組,並且以相近的數值組合、大範圍數值差異的組合 觀察相對誤差數據
There are 5 test datas in this program:
| Dividend | Divisor | Result | Quantized Result | Relative Error (%) |
| --------- | --------- | -------- | ---------------- | ------------------ |
| 1.0001e+0 | 1.0002e+0 | 9.999e-1 | 1.000e+0 | 1.000e-4 |
| 1.000e+0 | 1.000e+6 | 1.000e-6 | 1.001e-6 | 5.763e-4 |
| 1.000e+6 | 1.000e+0 | 1.000e+6 | 9.994e+5 | 5.760e-4 |
| 2.342e-38 | 1.000e-38 | 2.153e+0 | 2.152e+0 | 4.627e-4 |
| 3.390e+38 | 1.000e+38 | 3.390e+0 | 3.400e+0 | 3.088e-3 |
### Pipeline Stage Explanation
"5-stage processor" to test if the code works well under potential hazards. However, either Ripes handle hazards for me, or it cannot properly simulate hazards..
The block diagram of 5-stage processor in Ripes is like the following figure.

### Control Hazard
When our data is an FP32 **NaN**, this branch instruction will need to jump to the `handle_nan` label. In this case, the instructions in the **ID** and **EX** stages will need to be flushed, and the IF stage will fetch the instructions at the `handle_nan` label.

### Data Hazard
```:=
00000108 <f1>:
108: 008003b7 lui x7 0x800
10c: fff38393 addi x7 x7 -1
110: 00a3f2b3 and x5 x7 x10
114: 00b3f333 and x6 x7 x11
118: 00138393 addi x7 x7 1
```
During the **EX** of the instruction `and x5, x7, x10`, the instruction `addi x7, x7, -1` is still in the **ID**. Since the and instruction needs to use `x7`, it is necessary to write back (write back) the value of `x7` during the **EX** stage, and the correct value will be selected by the MUX for use.


## Reference