--- tags: computer-arch --- # Assignment1: RISC-V Assembly and Instruction Pipeline contributed by < [`owen1994-tw`](https://github.com/owen1994-tw/2024_Computer_Architecture_FALL) > ## Deriving LeetCode Problem 29 Using Floating-Point : Analyzing Precision Differences Between FP32 and BF16 ### 29. Divide Two Integers > Given two integers `dividend` and `divisor`, divide two integers without using multiplication, division, and mod operator. ### Divide Two Floating Point Number Inspired by LeetCode Problem `29, Divide Two Integers`, the goal is to convert the original integer data type to the FP32 data type. The conversion functions from `bf16_to_fp32` and `bf16_to_fp32`, as outlined in Quiz 1 - Problem B, will be utilized. This approach allows for the analysis of precision and computational differences between directly dividing floating-point numbers using FP32 and employing a conversion process that transitions from FP32 to BF16 and then back to FP32. ## Quiz 1 - Problem B The bfloat16 format is a 16-bit floating-point representation, designed to provide a wide dynamic range by using a floating radix point. It is a shortened version of the 32-bit IEEE 754 single-precision format (binary32), aimed at accelerating machine learning. ```cpp= ┌ sign │ │ ┌ exponent │ │ │ │ ┌ mantissa │ │ │ │┌──┴───┐┌─┴───┐ 0b0000000000000000 bfloat16 ``` Since bfloat16 (bf16) shares the same number of exponent bits as a 32-bit float, encoding and decoding values is relatively simple. ### C Code ```cpp= static inline float bf16_to_fp32(bf16_t h) { union { float f; uint32_t i; } u = {.i = (uint32_t)h.bits << 16}; return u.f; } ``` ### Assembly Code ```:= .data A: .word 0x3F80, 0x4000, 0x4080, 0x3F00, 0xBF00 , 0x7F80, 0xFF80 # Multiple floating-point data values length: .word 7 # Data length prompt_fp32: .string " fp32 = " prompt_bf16: .string "\n bf16 = " .text .globl fp32_to_bf16 .global main main: # Load the base address of the data array and the length la s0, A # s0 = Address of A (the array of floating-point numbers) lw s1, length # s1 = length of the array loop: # Check if the length is 0, if so, exit the loop beqz s1, end_loop # If length == 0, exit the loop # Load the current floating-point number into t1 lw t1, 0(s0) # Load the current floating-point value from A into t1 # Print "bf16 = " la a0, prompt_bf16 # Load the address of the string " bf16 = " li a7, 4 # System call code for printing a string ecall # Print the string # Print the current bf16 value mv a0, t1 # Move the original input value (bf16) into a0 for printing li a7, 34 # System call code for printing a hex value (for bf16) ecall # Print the hex value (bf16) # Convert bf16 to fp32 mv a0, t1 # Move bf16 value into a0 (input to bf16_to_fp32) call bf16_to_fp32 # Call the bf16_to_fp32 conversion function mv t1, a0 # Move the result (fp32) back into t1 # Print "fp32 = " la a0, prompt_fp32 # Load the address of the string " fp32 = " li a7, 4 # System call code for printing a string ecall # Print the string # Print the resulting fp32 value mv a0, t1 # Move the converted fp32 value into a0 for printing li a7, 34 # System call code for printing a hex value (for fp32) ecall # Print the hex value (fp32) # Update loop conditions addi s0, s0, 4 # Move to the next floating-point number (each is 4 bytes) addi s1, s1, -1 # Decrease the length by 1 j loop # Jump back to the top of the loop end_loop: # Exit the program li a7, 10 ecall bf16_to_fp32: # Input: bf16 value in a0 # Output: fp32 value returned in a0 slli a0, a0, 16 # Shift left by 16 to align bf16 bits for fp32 conversion ret # Return the result in a0 ``` ### Test Data There are 7 test datas in this program: | BF16 as HEX | FP32 | FP32 as HEX | FP32 | | -------- | -------- | -------- |-------- | | 0x3F80 | 1.0 | 0x3f800000 | 1.0 | | 0x4000 | 2.0 | 0x40000000 | 2.0 | | 0x4080 | 4.0 | 0x40800000 | 4.0 | | 0x3F00 | 0.5 | 0x3F000000 | 0.5 | | 0xBF00 | -0.5 | 0xBF000000 | -0.5 | | 0x7F80 | +Infinity | 0x7F800000 | +Infinity| | 0xFF80 | -Infinity | 0xFF800000 | -Infinity| --- The structure of the bfloat16 floating-point format is as follows. `fp32_to_bf16` function, which converts a float32 to bfloat16. This conversion should be binary identical to the BFloat16 format. Floats must round to the nearest even value, and NaNs should be quiet. Subnormal values should not be flushed to zero, unless explicitly needed. ### C Code ```cpp= static inline bf16_t fp32_to_bf16(float s) { bf16_t h; union { float f; uint32_t i; } u = {.f = s}; if ((u.i & 0x7fffffff) > 0x7f800000) { /* NaN */ h.bits = (u.i >> 16) | 64; /* force to quiet */ return h; } h.bits = (u.i + (0x7fff + ((u.i >> 0x10) & 1))) >> 0x10; return h; } ``` ### Assembly Code ```:= .data A: .word 0x3f800000, 0xbf800000, 0x00000000, 0x80000000, 0x40490fdc, 0x477fe000, 0x477fe100 # Multiple floating-point values (FP32 format) length: .word 7 # Length of the data prompt_bf32: .string "\n fp32 = " # String prompt for FP32 values prompt_bf16: .string " bf16 = " # String prompt for BF16 values .text .globl fp32_to_bf16 .global main main: # Load the address of A and the length of the data la s0, A # Load the address of array A into s0 lw s1, length # Load the length of the array into s1 loop: # Check if the length is zero, and if so, exit the loop beqz s1, end_loop # If s1 (length) is zero, jump to the end of the loop # Load the current FP32 value lw t1, 0(s0) # Load the current floating-point value from A into t1 # Print the FP32 prompt la a0, prompt_bf32 # Load the address of the FP32 prompt string into a0 li a7, 4 # System call for printing a string ecall # Make the system call # Print the FP32 value (in hex) mv a0, t1 # Move the FP32 value to a0 for printing li a7, 34 # System call for printing a hexadecimal value ecall # Make the system call # Convert FP32 to BF16 mv a0, t1 # Move the FP32 value to a0 as input to the conversion function call fp32_to_bf16 # Call the function to convert FP32 to BF16 mv t1, a0 # Store the converted BF16 value in t1 # Print the BF16 prompt la a0, prompt_bf16 # Load the address of the BF16 prompt string into a0 li a7, 4 # System call for printing a string ecall # Make the system call # Print the BF16 value (in hex) mv a0, t1 # Move the BF16 value to a0 for printing li a7, 34 # System call for printing a hexadecimal value ecall # Make the system call # Update the loop condition addi s0, s0, 4 # Move to the next float (increment the pointer by 4 bytes) addi s1, s1, -1 # Decrement the length by 1 j loop # Jump back to the start of the loop end_loop: # Exit the program li a7, 10 # System call for exiting the program ecall # Make the system call fp32_to_bf16: # Function: Convert FP32 to BF16 # Input: s (FP32 value) in a0 # Output: h (BF16 value) in a0 # Prepare temporary registers li t0, 0x7fffffff # Mask for extracting the exponent and mantissa li t1, 0x7f800000 # Mask for maximum exponent (NaN detection) li t2, 64 # Quiet NaN bit (0x40) li t5, 0x7fff # Bias adjustment for rounding # Convert floating-point to integer (IEEE 754 format) mv t3, a0 # Move FP32 value from a0 to t3 for processing # Check for NaN and t6, t0, t3 # Extract exponent and mantissa bgtu t6, t1, handle_nan # If the value is NaN, jump to NaN handling # Normal case: rounding and truncation srli t6, t3, 16 # Shift the FP32 value right by 16 bits andi t4, t6, 1 # Isolate the least significant bit add t4, t4, t5 # Add bias for rounding add t3, t3, t4 # Add rounding adjustment to the original value srli t3, t3, 16 # Truncate the result to get BF16 # Return the BF16 value mv a0, t3 # Move the result to a0 ret # Return from the function handle_nan: # Handle NaN by converting to a quiet NaN (qNaN) srli t3, t3, 16 # Truncate the FP32 value to BF16 or t3, t3, t2 # Set the quiet NaN bit mv a0, t3 # Move the result to a0 ret # Return from the function ``` ### Test Data There are 7 test datas in this program: | Single-precision (FP32) | FP32 as HEX | BF as HEX | | -------- | -------- | -------- | | 1.000000 | 0x3f800000 | 0x3f80 | | -1.000000 | 0xbf800000 | 0xbf80 | | 0.000000 | 0x3f800000 | 0x0000 | | -0.000000 | 0x00000000 | 0x8000 | | 3.141593 | 0x40490fdc | 0x4049 | | 65504.0 | 0x477fe000 | 0x4780 | | 65505.0 | 0x477fe100 | 0x4780 | ## Divide Two Floats `fdiv32` In the `fdiv32` function, NaN, inf and denormalized number are not allowed as fmul32. I still provide the basic floating point multiplication by the technique as the mention above. ```cpp= float fdiv32(float a, float b) { int32_t ia = *(int32_t *)&a, ib = *(int32_t *)&b; if (a == 0) return a; if (b == 0) return *(float*)&(int){0x7f800000}; /* mantissa */ int32_t ma = (ia & 0x7FFFFF) | 0x800000; int32_t mb = (ib & 0x7FFFFF) | 0x800000; /* sign and exponent */ int32_t sea = ia & 0xFF800000; int32_t seb = ib & 0xFF800000; /* result of mantissa */ int32_t m = idiv24(ma, mb); int32_t mshift = !getbit(m, 31); m <<= mshift; int32_t r = ((sea - seb + 0x3f800000) - (0x800000 & -mshift)) | (m & 0x7fffff00) >> 8; int32_t ovfl = (ia ^ ib ^ r) >> 31; r = r ^ ((r ^ 0x7f800000) & ovfl); return *(float *) &r; // return a / b; } ``` `idiv24` `idiv24` can handle 32 bits integer division. The dividend and quotient can be update by shifting 1 bit left. If the dividend are greater then the divisor, the LSB of quotient should be 1 ```cpp= static int32_t idiv24(int32_t a, int32_t b){ uint32_t r = 0; for (int i = 0; i < 32; i++){ r <<= 1; if (a - b < 0){ a <<= 1; continue; } r |= 1; a -= b; a <<= 1; } return r; } ``` ### C Code ```cpp= #include <stdio.h> #include <stdint.h> // 定義 bf16_t 結構 typedef union { uint16_t bits; // 16-bit representation } bf16_t; static inline float bf16_to_fp32(bf16_t h) { union { float f; uint32_t i; } u = {.i = (uint32_t)h.bits << 16}; return u.f; } static inline bf16_t fp32_to_bf16(float s) { bf16_t h; union { float f; uint32_t i; } u = {.f = s}; if ((u.i & 0x7fffffff) > 0x7f800000) { // NaN h.bits = (u.i >> 16) | 0x0040; // force to quiet return h; } h.bits = (u.i + (0x7fff + ((u.i >> 0x10) & 1))) >> 0x10; return h; } int getbit(int num, int position) { // Shift the desired bit to the least significant bit position and mask it return (num >> position) & 1; } static int32_t idiv24(int32_t a, int32_t b){ uint32_t r = 0; for (int i = 0; i < 32; i++){ r <<= 1; if (a - b < 0){ a <<= 1; continue; } r |= 1; a -= b; a <<= 1; } return r; } float fdiv32(float a, float b) { int32_t ia = *(int32_t *)&a, ib = *(int32_t *)&b; if (a == 0) return a; if (b == 0) return *(float*)&(int){0x7f800000}; /* mantissa */ int32_t ma = (ia & 0x7FFFFF) | 0x800000; int32_t mb = (ib & 0x7FFFFF) | 0x800000; /* sign and exponent */ int32_t sea = ia & 0xFF800000; int32_t seb = ib & 0xFF800000; /* result of mantissa */ int32_t m = idiv24(ma, mb); int32_t mshift = !getbit(m, 31); m <<= mshift; int32_t r = ((sea - seb + 0x3f800000) - (0x800000 & -mshift)) | (m & 0x7fffff00) >> 8; int32_t ovfl = (ia ^ ib ^ r) >> 31; r = r ^ ((r ^ 0x7f800000) & ovfl); return *(float *) &r; // return a / b; } float fp32_to_bf16_to_fp32(float s) { bf16_t bf = fp32_to_bf16(s); return bf16_to_fp32(bf); } int main() { float dd = 10.78212f; float dv = 2.0f; // float result_f = 0.0f; float result = 0.0f; result = fdiv32(dd, dv); printf("fp32 Result: %.6f\n", result); // 輸出結果 dd = fp32_to_bf16_to_fp32(dd); dv = fp32_to_bf16_to_fp32(dv); result = fdiv32(dd, dv); printf("bf16 Result: %.6f\n", result); // 輸出結果 return 0; } ``` ### Assembly Code ```:= .data a: .word 0x412c8390 # Dividend (floating-point format) b: .word 0x40000000 # Divisor (floating-point format) prompt: .string "answer = " .text .globl _start setup: li ra,-1 li sp, 0x7fffffff _start: # Load floating-point numbers a and b (as unsigned integers) la t0, a # Load address of a lw t1, 0(t0) # Load 32-bit data of a into t1 la t0, b # Load address of b lw t2, 0(t0) # Load 32-bit data of b into t2 # Save return address ra and divisor onto the stack addi sp, sp, -8 sw ra, 4(sp) sw t2, 0(sp) # Pass the dividend t1 to a0, prepare for FP32 to BF16 conversion mv a0, t1 call fp32_to_bf16_fp32 # Convert dividend from FP32 to BF16 mv t1, a0 # Save the converted dividend into t1 # Restore return address and divisor lw ra, 4(sp) lw t2, 0(sp) sw t1, 0(sp) # Save the converted dividend to the stack # Pass the divisor t2 to a0 for FP32 to BF16 conversion mv a0, t2 call fp32_to_bf16_fp32 # Convert divisor from FP32 to BF16 lw ra, 4(sp) lw t1, 0(sp) addi sp, sp, 8 # Restore stack pointer mv t2, a0 # Save the converted divisor into t2 # Set parameters a0 and a1 for floating-point division mv a0, t1 # Dividend mv a1, t2 # Divisor call_fdiv32: call fdiv32 # Perform floating-point division mv t1,a0 # Save the result into t1 # Print the string "answer = " la a0, prompt # Load the address of the string li a7, 4 # Syscall code 4, print string ecall # Perform syscall # Print the floating-point division result (as an integer) mv a0, t1 # Store the result in a0 li a7, 2 # Syscall code 2, print integer ecall # Perform syscall # Exit program li a7, 10 ecall # Convert FP32 to BF16 and back to FP32 fp32_to_bf16_fp32: li t0, 0x7fffffff # Load 0x7fffffff for masking li t1, 0x7f800000 # Load 0x7f800000 for NaN checking li t2, 64 # Load 64 for quiet NaN li t5, 0x7fff # Load 0x7fff mv t3, a0 # Store FP32 floating-point number in t3 # Check for NaN and t6, t0, t3 # Perform bitwise AND with 0x7fffffff bgtu t6, t1, handle_nan # If result > 0x7f800000, branch to NaN handling # Normal processing srli t6, t3, 16 # Shift FP32 right by 16 bits andi t4, t6, 1 # Perform bitwise AND with 1 on (u >> 16) add t4, t4, t5 # Add 0x7fff to the result add t3, t3, t4 # Add result to u srli t3, t3, 16 # Shift right by 16 bits to get BF16 j bf16_to_fp32 # NaN handling handle_nan: srli t3, t3, 16 # Shift FP32 right by 16 bits or t3, t3, t2 # Perform bitwise OR with 64 j bf16_to_fp32 bf16_to_fp32: slli a0, t3, 16 # Shift BF16 left by 16 bits to align with FP32 ret # 32-bit floating-point division fdiv32: addi sp, sp, -4 sw ra, 0(sp) beqz a0, f3 # If divisor is 0, return 0x7f800000 bnez a1, f1 # Check if dividend is 0 li a0, 0x7f800000 # Return positive infinity j f3 f1: li t2, 0x7FFFFF # Mask for extracting mantissa and t0, t2, a0 # Extract dividend mantissa and t1, t2, a1 # Extract divisor mantissa addi t2, t2, 1 # Add 1 to handle rounding or t0, t0, t2 # Calculate dividend mantissa or t1, t1, t2 # Calculate divisor mantissa idiv24: li t3, 0 # Initialize quotient and remainder li t4, 32 # Set loop count for 32 bits f2: sub t0, t0, t1 # Subtract divisor from dividend sltz t2, t0 # Check if result is negative seqz t2, t2 # t2 becomes 0 or 1 slli t3, t3, 1 # Shift quotient left by 1 bit or t3, t3, t2 # Update quotient seqz t2, t2 # Invert t2 neg t2, t2 # Take the negative value and t5, t2, t1 # If result is negative, set t5 to divisor add t0, t0, t5 # Restore dividend slli t0, t0, 1 # Shift dividend left by 1 bit addi t4, t4, -1 # Decrease loop count bnez t4, f2 # Continue looping li t2, 0xFF800000 # Extract exponent and t0, a0, t2 # Extract dividend exponent and t1, a1, t2 # Extract divisor exponent mv a0, t3 # Save quotient result # Handle result shifts li a1, 31 jal getbit # Check most significant bit seqz a0, a0 # Calculate shift amount sll t3, t3, a0 # Shift quotient left # Calculate new exponent li t2, 0x3f800000 sub t4, t0, t1 add t4, t4, t2 # Adjust exponent neg a0, a0 # Negate shift amount li t2, 0x800000 and a0, a0, t2 # Handle shift effect srli t3, t3, 8 # Shift mantissa right by 8 bits addi t2, t2, -1 and t3, t3, t2 # Align mantissa sub a0, t4, a0 or a0, a0, t3 # Combine result # Check for overflow xor t3, t1, t0 xor t3, t3, a0 srli t3, t3, 31 li t2, 0x7f800000 xor t4, t2, a0 and t4, t4, t3 xor a0, a0, t4 f3: # Epilogue lw ra, 0(sp) addi sp, sp, 4 ret getbit: # Prologue srl a0, a0, a1 andi a0, a0, 0x1 ret ``` Analysis --- ![image](https://hackmd.io/_uploads/r15lfsIkkg.png) ### Relative Error Formula: $$ Relative\ Error = \frac{\left| \text{Result}_{\text{FP32}} \right|}{\left| \text{Result}_{\text{FP32}} - \text{Result}_{\text{BF16-converted}} \right|} $$ bf16表示範圍 3.3895313892515354759e+38 ~ 2.34180515202877589478e-38,我們會選取五組數組,並且以相近的數值組合、大範圍數值差異的組合 觀察相對誤差數據 There are 5 test datas in this program: | Dividend | Divisor | Result | Quantized Result | Relative Error (%) | | --------- | --------- | -------- | ---------------- | ------------------ | | 1.0001e+0 | 1.0002e+0 | 9.999e-1 | 1.000e+0 | 1.000e-4 | | 1.000e+0 | 1.000e+6 | 1.000e-6 | 1.001e-6 | 5.763e-4 | | 1.000e+6 | 1.000e+0 | 1.000e+6 | 9.994e+5 | 5.760e-4 | | 2.342e-38 | 1.000e-38 | 2.153e+0 | 2.152e+0 | 4.627e-4 | | 3.390e+38 | 1.000e+38 | 3.390e+0 | 3.400e+0 | 3.088e-3 | ### Pipeline Stage Explanation "5-stage processor" to test if the code works well under potential hazards. However, either Ripes handle hazards for me, or it cannot properly simulate hazards.. The block diagram of 5-stage processor in Ripes is like the following figure. ![image](https://hackmd.io/_uploads/SyMWsnIJkx.png) ### Control Hazard When our data is an FP32 **NaN**, this branch instruction will need to jump to the `handle_nan` label. In this case, the instructions in the **ID** and **EX** stages will need to be flushed, and the IF stage will fetch the instructions at the `handle_nan` label. ![image](https://hackmd.io/_uploads/Sy8hlpUyyl.png) ### Data Hazard ```:= 00000108 <f1>: 108: 008003b7 lui x7 0x800 10c: fff38393 addi x7 x7 -1 110: 00a3f2b3 and x5 x7 x10 114: 00b3f333 and x6 x7 x11 118: 00138393 addi x7 x7 1 ``` During the **EX** of the instruction `and x5, x7, x10`, the instruction `addi x7, x7, -1` is still in the **ID**. Since the and instruction needs to use `x7`, it is necessary to write back (write back) the value of `x7` during the **EX** stage, and the correct value will be selected by the MUX for use. ![image](https://hackmd.io/_uploads/HkFN4pUJJl.png) ![image](https://hackmd.io/_uploads/SkieV6Ukkx.png) ## Reference