# Assignment1: RISC-V Assembly and Instruction Pipeline contributed by ??? ## Introduction My original case was to ==implement quantization from bfloat16 to int8==, which means that input single-precision floating-point numbers are first converted to bfloat16 through fp32_to_bf16, and then quantized to 8-bit integers through quant_bf16_to_int8. However, I encountered many issues in the implementation, as all operations had to be performed in the bfloat16 format. This made operations such as finding the maximum value, subtraction, division, and multiplication quite complicated. Therefore, I changed my case to ==implement a function to find the maximum value in a bfloat16 array for quantization==. This means that input single-precision floating-point numbers are first converted to bfloat16 through fp32_to_bf16, and then the maximum value in the array is found using bf16_findmax, which is the function that will be used in quantization. Additionally, in order to apply finding the maximum value to quantization, I modified the case to ==implement a function to find the maximum absolute value== in a bfloat16 array for quantization. In summary, I implemented the C code for quantization along with the code for finding the maximum value and the maximum absolute value. ## Background ### Quantization Quantization is the process of converting a representation with a higher number of bits into one with a lower number of bits, typically to accelerate computations. Quantization can be broken down into three main steps: 1. Find the maximum absolute value in the data. ![](https://hackmd.io/_uploads/rkII1IMZT.png) 2. Calculate the scaling factor (Scale). ![](https://hackmd.io/_uploads/SyxDy8zZT.png) 3. Multiply the data by the Scale and round it to the nearest integer. ![](https://hackmd.io/_uploads/rJRP1IfWa.png) ### BFloat16 IEEE half-precision 16-bit float ![](https://hackmd.io/_uploads/HkPTyLGWT.png) IEEE 754 single-precision 32-bit float ![](https://hackmd.io/_uploads/rJgsy8GZp.png) BFloat16 ![](https://hackmd.io/_uploads/HkXnJLfbp.png) From the information provided, it's clear that bfloat16 consists of 1 bit for the sign, 8 bits for the exponent, and 7 bits for the fraction. What sets it apart from single-precision is that it retains only 7 bits for the fraction, reducing the number of bits to accelerate computations. In comparison to half-precision, bfloat16 also uses 16 bits for its representation. However, it allocates 3 of those bits to the exponent, sacrificing some precision in exchange for a larger range. This is particularly useful in machine learning, where the slight loss of precision doesn't significantly impact computation results, but the extended range helps prevent overflow, which can have a substantial impact on computations. # Implementation ## C code (Quantization) ``` c #include <stdio.h> #include <stdlib.h> #include<math.h> # define array_size 7 # define range 127 /*2^(n-1)-1, n: quant bit*/ float fp32_to_bf16(float x); int* quant_bf16_to_int8(float x[]); float bf16_findmax(float x[]); int main() { float array[array_size] = {1.200000, 1.203125, 2.310000, 2.312500, 3.460000, 3.4531255, 5.630000}; float array2[array_size] = { 0.1, 0.2, 1.2, 3, 2.1, -4.2, 3.5}; float array3[array_size] = { 3.14159265, 0.12345678 , 1.23456789 , 0.00000123, 0.00000001, 0.99999999 , 0.00000007 }; float array_bf16[array_size] = {}; int *after_quant; /*data 1*/ for (int i = 0; i < 7; i++) { array_bf16[i] = fp32_to_bf16(array[i]); } printf("data 1\nbfloat16 number is \n"); for (int i = 0; i < array_size; i++) { printf("%.12f\n", array_bf16[i]); } after_quant = quant_bf16_to_int8(array_bf16); printf("after quantization \n"); for (int i = 0; i < array_size; i++) { printf("%d\n", after_quant[i]); } /*data 2*/ for (int i = 0; i < 7; i++) { array_bf16[i] = fp32_to_bf16(array2[i]); } printf("data 2\nbfloat16 number is \n"); for (int i = 0; i < array_size; i++) { printf("%.12f\n", array_bf16[i]); } after_quant = quant_bf16_to_int8(array_bf16); printf("after quantization \n"); for (int i = 0; i < array_size; i++) { printf("%d\n", after_quant[i]); } /*data 3*/ for (int i = 0; i < 7; i++) { array_bf16[i] = fp32_to_bf16(array3[i]); } printf("data 3\nbfloat16 number is \n"); for (int i = 0; i < array_size; i++) { printf("%.12f\n", array_bf16[i]); } after_quant = quant_bf16_to_int8(array_bf16); printf("after quantization \n"); for (int i = 0; i < array_size; i++) { printf("%d\n", after_quant[i]); } system("pause"); return 0; } float fp32_to_bf16(float x) { float y = x; int *p = (int *)&y; unsigned int exp = *p & 0x7F800000; unsigned int man = *p & 0x007FFFFF; if (exp == 0 && man == 0) /* zero */ return x; if (exp == 0x7F800000 /* Fill this! */) /* infinity or NaN */ return x; /* Normalized number */ /* round to nearest */ float r = x; int *pr = (int *)&r; *pr &= 0xFF800000; /* r has the same exp as x */ r /= 0x100 /* Fill this! */; y = x + r; *p &= 0xFFFF0000; return y; } int* quant_bf16_to_int8(float x[array_size]) { static int after_quant[array_size] = {}; float max = fabs(x[0]); for (int i = 1; i < array_size; i++) { if (fabs(x[i]) > max) { max = fabs(x[i]); } } printf("maximum number is %.12f\n", max); float scale = range / max; for (int i = 0; i < array_size; i++) { after_quant[i] = (x[i] * scale); } return after_quant; } ``` ## Assembly code (fp32_to_bf16 & find maximum) ```assembly= .data array: .word 0x3f99999a, 0x3f9a0000, 0x4013d70a, 0x40140000, 0x405d70a4, 0x405d0000, 0x40b428f6 # test data1: 1.200000, 1.203125, 2.310000, 2.312500, 3.460000, 3.4531255, 5.630000 array2: .word 0x3dcccccd, 0x3e4ccccd, 0x3f99999a, 0x40400000, 0x40066666, 0xc0866666, 0x40600000 # test data2: 0.1, 0.2, 1.2, 3, 2.1, -4.2, 3.5 array3: .word 0x40490fdb, 0x3dfcd6e9, 0x3f9e0652, 0x35a5167a, 0x322bcc77, 0x3f800000, 0x339652e8 # test data3: 3.14159265, 0.12345678 , 1.23456789 , 0.00000123, 0.00000001, 0.99999999 , 0.00000007 array_size: .word 7 array_bf16: .word 0, 0, 0, 0, 0, 0, 0 exp_mask: .word 0x7F800000 man_mask: .word 0x007FFFFF sign_exp_mask: .word 0xFF800000 bf16_mask: .word 0xFFFF0000 next_line: .string "\n" max_string: .string "maximum number is " bf16_string: .string "\nbfloat16 number is \n" .text main: # data 1 la a0, bf16_string addi a7, x0, 4 ecall la s0, array # array address la s2, array_bf16 # array_bf16 address la s1, array_size lw s1, 0(s1) # array size la s3, exp_mask lw s3, 0(s3) la s4, man_mask lw s4, 0(s4) la s5, sign_exp_mask lw s5, 0(s5) la s6, bf16_mask lw s6, 0(s6) for1: lw a0, 0(s0) # first element jal ra, fp32_to_bf16 sw a0, 0(s2) addi a7, x0, 34 ecall la a0, next_line addi a7, x0, 4 ecall addi s1, s1, -1 addi s0, s0, 4 addi s2, s2, 4 bne s1, x0, for1 # invoke find maximum la a0, max_string addi a7, x0, 4 ecall addi a1, s2, -28 # input array_bf16 jal ra, bf16_findmax addi a7, x0, 34 ecall # data 2 la a0, bf16_string addi a7, x0, 4 ecall la s0, array2 # array address la s2, array_bf16 # array_bf16 address la s1, array_size lw s1, 0(s1) # array size la s3, exp_mask lw s3, 0(s3) la s4, man_mask lw s4, 0(s4) la s5, sign_exp_mask lw s5, 0(s5) la s6, bf16_mask lw s6, 0(s6) data2for1: lw a0, 0(s0) # first element jal ra, fp32_to_bf16 sw a0, 0(s2) addi a7, x0, 34 ecall la a0, next_line addi a7, x0, 4 ecall addi s1, s1, -1 addi s0, s0, 4 addi s2, s2, 4 bne s1, x0, data2for1 # invoke find maximum la a0, max_string addi a7, x0, 4 ecall addi a1, s2, -28 # input array_bf16 jal ra, bf16_findmax addi a7, x0, 34 ecall # data 3 la a0, bf16_string addi a7, x0, 4 ecall la s0, array3 # array address la s2, array_bf16 # array_bf16 address la s1, array_size lw s1, 0(s1) # array size la s3, exp_mask lw s3, 0(s3) la s4, man_mask lw s4, 0(s4) la s5, sign_exp_mask lw s5, 0(s5) la s6, bf16_mask lw s6, 0(s6) data3for1: lw a0, 0(s0) # first element jal ra, fp32_to_bf16 sw a0, 0(s2) addi a7, x0, 34 ecall la a0, next_line addi a7, x0, 4 ecall addi s1, s1, -1 addi s0, s0, 4 addi s2, s2, 4 bne s1, x0, data3for1 # invoke find maximum la a0, max_string addi a7, x0, 4 ecall addi a1, s2, -28 # input array_bf16 jal ra, bf16_findmax addi a7, x0, 34 ecall # Exit program li a7, 10 ecall fp32_to_bf16: # ! don't need point variable concept addi sp,sp,-8 sw a0,0(sp) # y addi t0,sp,0 # p lw t2,0(t0) # *p and t6, t2, s3 # exp and t4, t2, s4 # man # if zero bne t6, x0, else # exp is zero bne t4, x0, else return_x: addi sp,sp,8 jr ra else: # if infinity or NaN beq t4, x0, return_x # round sw a0, 4(sp) # r addi t1, sp, 4 # pr lw t3, 0(t1) # *pr and t3, t3, s5 sw t3, 0(t1) lw t3, 4(sp) # r # floating point divide # ~~ addi t5, x0, 0x100~~ # ~~ div t3, t3, t5~~ li t5, 0x04000000 sub t3, t3, t5 # r # floating point addition # t3: r, t4: a0's man, t6:a0's exp # ~~ add t5, a0, t3 # y~~ and t5, t3, s3 # r's exp srli t6, t6, 23 # exp alignment srli t5, t5, 23 sub t2, t6, t5 # exp diff and t3, t3, s4 # r's man # man alignment li s11, 0x00800000 # make up 1 to No.24bit or t3, t3, s11 or t4, t4, s11 # t6>=0, a0>=r; t6<0, a0<r, smaller man shift right, reserve bigger exp bge t2, x0, x_big_r srl t4, t4, t2 # a0's man shift right t2 bit mv t6, t5 # reserve t5(r's exp) j add_man x_big_r: srl t3, t3, t2 # r's man shift right t2 bit add_man: add t3, t3, t4 # mantissa addition # check carry and t4, t3, s11 # check No.24bit, 0:carry, 1: nocarry bne t4, x0, no_carry addi t6, t6, 1 # exp+1 srli t3 ,t3, 1 # man alignment no_carry: slli t6, t6, 23 # exp shift and t6, t6, s3 # mask exp and t3, t3, s4 # mask man or t6, t3, t6 # combine exp & man li s11, 0x80000000 # sign mask and t3, a0, s11 # sign or t5, t3, t6 # ! integrate divide and addition can be man shift 8 bit sw t5, 0(sp) # y lw t2, 0(t0) # *p and t2, t2, s6 sw t2, 0(t0) lw t5, 0(sp) add a0, x0, t5 addi sp,sp,8 jr ra bf16_findmax: # a1: bf16_array, return a0: max # Prologue addi sp, sp, -16 sw ra, 0(sp) sw s0, 4(sp) sw s1, 8(sp) sw s2, 12(sp) li s0, 0x80000000 # sign mask li s1, 0x7f800000 # exp mask li s2, 0x007f0000 # man mask mv t2, a1 # input array(t2) lw a0, 0(t2) # max(a0) la t4, array_size lw t4, 0(t4) # array size(t4) addi t3, x0, 1 # count(t3) for2: addi t3, t3, 1 addi t2, t2, 4 lw t1, 0(t2) # second element(t1) # blt t1, a0, max_not_change # ! max's sign,exp,man can save # bf16_compare # a0: max, t1: next # compare sign and t5, a0, s0 and t6, t1, s0 bltu t6, t5, max_change # t6=0(+), t5=1(-) branch bltu t5, t6, max_not_change # t6=1(-), t5=0(+) blt x0, t6, negative # 0<t6(1), negative # positive # compare exp and t5, a0, s1 and t6, t1, s1 blt t5, t6, max_change # t5(max.exp)<t6(next.exp) branch blt t6, t5, max_not_change # compare man and t5, a0, s2 and t6, t1, s2 blt t5, t6, max_change blt t6, t5, max_not_change negative: # compare exp and t5, a0, s1 and t6, t1, s1 blt t6, t5, max_change blt t5, t6, max_not_change # compare man and t5, a0, s2 and t6, t1, s2 blt t6, t5, max_change blt t5, t6, max_not_change max_change: mv a0, t1 max_not_change: blt t3, t4, for2 # Epilogue lw ra, 0(sp) lw s0, 4(sp) lw s1, 8(sp) lw s2, 12(sp) addi sp, sp, 16 jr ra ``` ## Assembly code (fp32_to_bf16 & find maximum absolute value) ```assembly= .data array: .word 0x3f99999a, 0x3f9a0000, 0x4013d70a, 0x40140000, 0x405d70a4, 0x405d0000, 0x40b428f6 # test data1: 1.200000, 1.203125, 2.310000, 2.312500, 3.460000, 3.4531255, 5.630000 array2: .word 0x3dcccccd, 0x3e4ccccd, 0x3f99999a, 0x40400000, 0x40066666, 0xc0866666, 0x40600000 # test data2: 0.1, 0.2, 1.2, 3, 2.1, -4.2, 3.5 array3: .word 0x40490fdb, 0x3dfcd6e9, 0x3f9e0652, 0x35a5167a, 0x322bcc77, 0x3f800000, 0x339652e8 # test data3: 3.14159265, 0.12345678 , 1.23456789 , 0.00000123, 0.00000001, 0.99999999 , 0.00000007 array_size: .word 7 array_bf16: .word 0, 0, 0, 0, 0, 0, 0 exp_mask: .word 0x7F800000 man_mask: .word 0x007FFFFF sign_exp_mask: .word 0xFF800000 bf16_mask: .word 0xFFFF0000 next_line: .string "\n" max_string: .string "maximum number is " bf16_string: .string "\nbfloat16 number is \n" .text main: # data 1 la a0, bf16_string addi a7, x0, 4 ecall la s0, array # array address la s2, array_bf16 # array_bf16 address la s1, array_size lw s1, 0(s1) # array size la s3, exp_mask lw s3, 0(s3) la s4, man_mask lw s4, 0(s4) la s5, sign_exp_mask lw s5, 0(s5) la s6, bf16_mask lw s6, 0(s6) for1: lw a0, 0(s0) # first element jal ra, fp32_to_bf16 sw a0, 0(s2) addi a7, x0, 34 ecall la a0, next_line addi a7, x0, 4 ecall addi s1, s1, -1 addi s0, s0, 4 addi s2, s2, 4 bne s1, x0, for1 # invoke find maximum la a0, max_string addi a7, x0, 4 ecall addi a1, s2, -28 # input array_bf16 jal ra, bf16_findmax addi a7, x0, 34 ecall # data 2 la a0, bf16_string addi a7, x0, 4 ecall la s0, array2 # array address la s2, array_bf16 # array_bf16 address la s1, array_size lw s1, 0(s1) # array size la s3, exp_mask lw s3, 0(s3) la s4, man_mask lw s4, 0(s4) la s5, sign_exp_mask lw s5, 0(s5) la s6, bf16_mask lw s6, 0(s6) data2for1: lw a0, 0(s0) # first element jal ra, fp32_to_bf16 sw a0, 0(s2) addi a7, x0, 34 ecall la a0, next_line addi a7, x0, 4 ecall addi s1, s1, -1 addi s0, s0, 4 addi s2, s2, 4 bne s1, x0, data2for1 # invoke find maximum la a0, max_string addi a7, x0, 4 ecall addi a1, s2, -28 # input array_bf16 jal ra, bf16_findmax addi a7, x0, 34 ecall # data 3 la a0, bf16_string addi a7, x0, 4 ecall la s0, array3 # array address la s2, array_bf16 # array_bf16 address la s1, array_size lw s1, 0(s1) # array size la s3, exp_mask lw s3, 0(s3) la s4, man_mask lw s4, 0(s4) la s5, sign_exp_mask lw s5, 0(s5) la s6, bf16_mask lw s6, 0(s6) data3for1: lw a0, 0(s0) # first element jal ra, fp32_to_bf16 sw a0, 0(s2) addi a7, x0, 34 ecall la a0, next_line addi a7, x0, 4 ecall addi s1, s1, -1 addi s0, s0, 4 addi s2, s2, 4 bne s1, x0, data3for1 # invoke find maximum la a0, max_string addi a7, x0, 4 ecall addi a1, s2, -28 # input array_bf16 jal ra, bf16_findmax addi a7, x0, 34 ecall # Exit program li a7, 10 ecall fp32_to_bf16: # ! don't need point variable concept addi sp,sp,-8 sw a0,0(sp) # y addi t0,sp,0 # p lw t2,0(t0) # *p and t6, t2, s3 # exp and t4, t2, s4 # man # if zero bne t6, x0, else # exp is zero bne t4, x0, else return_x: addi sp,sp,8 jr ra else: # if infinity or NaN beq t4, s3, return_x # round sw a0, 4(sp) # r addi t1, sp, 4 # pr lw t3, 0(t1) # *pr and t3, t3, s5 sw t3, 0(t1) lw t3, 4(sp) # r # floating point divide # ~~ addi t5, x0, 0x100~~ # ~~ div t3, t3, t5~~ li t5, 0x04000000 sub t3, t3, t5 # r # floating point addition # t3: r, t4: a0's man, t6:a0's exp # ~~ add t5, a0, t3 # y~~ and t5, t3, s3 # r's exp srli t6, t6, 23 # exp alignment srli t5, t5, 23 sub t2, t6, t5 # exp diff and t3, t3, s4 # r's man # man alignment li s11, 0x00800000 # make up 1 to No.24bit or t3, t3, s11 or t4, t4, s11 # t6>=0, a0>=r; t6<0, a0<r, smaller man shift right, reserve bigger exp bge t2, x0, x_big_r srl t4, t4, t2 # a0's man shift right t2 bit mv t6, t5 # reserve t5(r's exp) j add_man x_big_r: srl t3, t3, t2 # r's man shift right t2 bit add_man: add t3, t3, t4 # mantissa addition # check carry and t4, t3, s11 # check No.24bit, 0:carry, 1: nocarry bne t4, a0, no_carry addi t6, t6, 1 # exp+1 srli t3 ,t3, 1 # man alignment no_carry: slli t6, t6, 23 # exp shift and t6, t6, s3 # mask exp and t3, t3, s4 # mask man or t6, t3, t6 # combine exp & man li s11, 0x80000000 # sign mask and t3, a0, s11 # sign or t5, t3, t6 # ! integrate divide and addition can be man shift 8 bit sw t5, 0(sp) # y lw t2, 0(t0) # *p and t2, t2, s6 sw t2, 0(t0) lw t5, 0(sp) add a0, x0, t5 addi sp,sp,8 jr ra bf16_findmax: # a1: bf16_array, return a0: max # Prologue addi sp, sp, -12 sw ra, 0(sp) sw s1, 4(sp) sw s2, 8(sp) li s1, 0x7f800000 # exp mask li s2, 0x007f0000 # man mask mv t2, a1 # input array(t2) lw a0, 0(t2) # max(a0) la t4, array_size lw t4, 0(t4) # array size(t4) addi t3, x0, 1 # count(t3) for2: addi t3, t3, 1 addi t2, t2, 4 lw t1, 0(t2) # second element(t1) # blt t1, a0, max_not_change # ! max's sign,exp,man can save # bf16_compare # a0: max, t1: next # compare exp and t5, a0, s1 and t6, t1, s1 blt t5, t6, max_change # t5(max.exp)<t6(next.exp) branch blt t6, t5, max_not_change # compare man and t5, a0, s2 and t6, t1, s2 blt t5, t6, max_change blt t6, t5, max_not_change max_change: mv a0, t1 max_not_change: blt t3, t4, for2 # Absolute li t0, 0x7fffffff and a0, a0, t0 # Epilogue lw ra, 0(sp) lw s1, 4(sp) lw s2, 8(sp) addi sp, sp, 12 jr ra ``` ## Output ### C output ![](https://hackmd.io/_uploads/Bkhs2FGZa.png) ### Assembly output - **Fp32_to_bf16 & Find maximum** ![](https://hackmd.io/_uploads/SJCesPf-T.png) - **Fp32_to_bf16 &Find maximum absolute value** ![](https://hackmd.io/_uploads/HkKsbuGWa.png) test data2: 0.1, 0.2, 1.2, 3, 2.1, -4.2, 3.5 It's evident from your test data 2 that there is a difference between finding the maximum value and the maximum absolute value for bfloat16. When searching for the maximum value of bfloat16, the maximum value is 3.5, represented as 0x40600000. On the other hand, when searching for the maximum absolute value of bfloat16, the maximum value is 4.1875, represented as 0x40860000. ## Result - **Fp32_to_bf16 & Find maximum** ![](https://hackmd.io/_uploads/BkuJxuMWa.png) - **Fp32_to_bf16 &Find maximum absolute value** ![](https://hackmd.io/_uploads/HJ4kMuMbT.png) ### After optimization - **Optimal_Fp32_to_bf16 &Find maximum absolute value** ![](https://hackmd.io/_uploads/rkWQeZQZa.png) # Optimization ```assembly= .data array: .word 0x3f99999a, 0x3f9a0000, 0x4013d70a, 0x40140000, 0x405d70a4, 0x405d0000, 0x40b428f6 # test data1: 1.200000, 1.203125, 2.310000, 2.312500, 3.460000, 3.4531255, 5.630000 array2: .word 0x3dcccccd, 0x3e4ccccd, 0x3f99999a, 0x40400000, 0x40066666, 0xc0866666, 0x40600000 # test data2: 0.1, 0.2, 1.2, 3, 2.1, -4.2, 3.5 array3: .word 0x40490fdb, 0x3dfcd6e9, 0x3f9e0652, 0x35a5167a, 0x322bcc77, 0x3f800000, 0x339652e8 # test data3: 3.14159265, 0.12345678 , 1.23456789 , 0.00000123, 0.00000001, 0.99999999 , 0.00000007 array_bf16: .word 0, 0, 0, 0, 0, 0, 0 exp_mask: .word 0x7F800000 man_mask: .word 0x007FFFFF sign_exp_mask: .word 0xFF800000 bf16_mask: .word 0xFFFF0000 next_line: .string "\n" max_string: .string "maximum number is " bf16_string: .string "\nbfloat16 number is \n" .text main: # push data addi sp, sp, -12 la t0, array sw t0, 0(sp) la t0, array2 sw t0, 4(sp) la t0, array3 sw t0, 8(sp) la s10, array_bf16 # global array_bf16 address(s10) addi s11, x0, 3 # data number(s11) la s9, exp_mask # global exp(s9) la s8, man_mask # global man(s8) la s6, bf16_mask # global bf16(s6) lw s9, 0(s9) lw s8, 0(s8) lw s6, 0(s6) add s7, x0, sp main_for: la a0, bf16_string addi a7, x0, 4 ecall addi a3, x0, 7 # array size(a3) lw a1, 0(s7) # array_data pointer(a1) mv a2, s10 # array_bf16 pointer(a2) jal ra, fp32_to_bf16_findmax addi s11, s11, -1 addi s7, s7, 4 bne s11, x0, main_for # Exit program li a7, 10 ecall fp32_to_bf16_findmax: # array_data pointer(a1), array_bf16 pointer(a2), array size(a3) # prologue addi sp, sp, -8 sw s0, 0(sp) sw s1, 4(sp) # array loop for1: lw a5, 0(a1) # x(a5) # fp32_to_bf16 and t0, a5, s9 # x exp(t0) and t1, a5, s8 # x man(t1) # if zero bne t0, x0, else # exp is zero bne t1, x0, else j finish_bf16 else: # if infinity or NaN beq t0, s9, finish_bf16 # round # r = x.man shift right 8 bit # x+r = x.man + x.man>>8 li t3, 0x00800000 # make up 1 to No.24bit or t1, t1, t3 srli t2, t1, 8 # r(t2) add t1, t1, t2 # x+r # check carry and t4, t1, t3 # check No.24bit (t4), 0:carry, 1: nocarry bne t4, x0, no_carry add t0, t0, t3 # exp+1 srli t1 ,t1, 1 # man alignment no_carry: and t0, t0, s9 # mask exp(t0) and t1, t1, s8 # mask man(t1) or t2, t0, t1 # combine exp & man li t3, 0x80000000 # sign mask and t3, a5, t3 # x sign or a5, t3, t2 # bfloat16(a5) and a5, a5, s6 finish_bf16: sw a5, 0(a2) mv a0, a5 addi a7, x0, 34 ecall la a0, next_line addi a7, x0, 4 ecall slti t3, a3, 7 # (a3==7) t3=0, (a3<7) t3=1 bne t3, x0, compare # saved first max j max_change compare: # compare exp blt s0, t0, max_change blt t0, s0, max_not_change # compare man blt s1, t1, max_change blt t1, s1, max_not_change max_change: mv s0, t0 # max exp(s0) mv s1, t1 # max man(s1) mv a4, a5 # max bf16(a4) max_not_change: addi a3, a3, -1 addi a1, a1, 4 addi a2, a2, 4 bne a3, x0, for1 # Absolute li t2, 0x7fffffff and a4, a4, t2 #print la a0, max_string addi a7, x0, 4 ecall mv a0, a4 addi a7, x0, 34 ecall # epilogue lw s0, 0(sp) lw s1, 4(sp) addi sp, sp, 8 jr ra ``` 1. Using a loop to read different test data can significantly reduce redundant code. 2. Merge fp32_to_bf16 and bf16_findmax into a single function to reduce the need for a for loop. 3. Avoid using the concept of pointer variables to minimize unnecessary access and instructions. 4. Integrate the addition and division of FP32 in fp32_to_bf16 into an 8-bit shift operation. 5. Save the max's exponent and mantissa for reuse, eliminating the need for repeated retrieval. # Analysis ## Pipeline ![](https://hackmd.io/_uploads/B1HolqMZp.png) | stage | description | | -------- | -------- | | IF | Instruction Fetch | | ID | Instruction Decode & Register Read | | EX | Execution or Address Calculation | | MEM | Data Memory Access | | WB | Write Back | ![](https://hackmd.io/_uploads/Skl342zZa.png) Different types of commands involve different operations at different stages. Below is an example using "addi." ### IF ![](https://hackmd.io/_uploads/rJIQ_hz-p.png) The IF (Instruction Fetch) stage primarily involves fetching instructions from memory based on the program counter (PC). Additionally, the PC is incremented in this stage. Therefore, you can see that the output "0X00400893" corresponds to the encoding of the "addi x17,x0,4" instruction, and the preceding "0x0000000C" is the address of the next instruction. ### ID ![](https://hackmd.io/_uploads/BkYo_3GZp.png) The ID (Instruction Decode) stage primarily involves decoding the instruction, extracting the source register addresses, destination register address, opcode, and immediate value. It also retrieves values from the registers. Therefore, in this context, you can see that "0x00" represents the source register address, "0x11" is the destination register address, "0x00000004" is the immediate value, and "0x00000000" is the value taken from register x0. ### EX ![](https://hackmd.io/_uploads/SkoxK2GZp.png) The EX (Execution) stage primarily performs numerical operations through the ALU. You can see that there are four multiplexers that determine the inputs to the ALU. The upper two mainly decide whether to use the PC value or the register value, while the lower two mainly decide whether to use the immediate value or the register value. Arithmetic instructions perform calculations here, branch instructions calculate the next PC, and memory instructions calculate memory access addresses. Therefore, in this context, you can see that OP1 inputs "0X00000000," which represents the value of x0, and OP2 inputs "0X00000004," which is the immediate value. ### MEM ![](https://hackmd.io/_uploads/B11GFnfW6.png) The MEM (Memory) stage primarily handles memory access operations, such as lw (load word) or sw (store word), based on the address calculated by the ALU. In the case of "addi," there are no operations in this stage. Therefore, the result of the operation, which in this case is "0x00000004," will be passed on to the next stage, which is the WB (Write Back) stage. ### WB ![](https://hackmd.io/_uploads/By5XKhGZp.png) The WB (Write Back) stage primarily performs the write-back operation to registers based on the destination register address received from the previous stage. A multiplexer decides whether to write the value from the ALU operation, the PC, or memory into the destination register. Therefore, in this context, you can see that "0x00000004" is being written back into register "0x11." ## Hazard ### Data hazard In a pipeline, a certain instruction may require the results of preceding instructions that have not yet been generated, meaning that the data needed for the execution of the instruction is not yet available. ![](https://hackmd.io/_uploads/r1sjviGWT.png) Because "and x31 x7 x19" requires the value of x7, but "lw x7 0 x5" has not yet written back its result to x7, a Read After Write data hazard occurs. We need to wait until the "lw" instruction writes back the result to x7 in the MEM stage before the "and" instruction can read the value of x7 in the ID stage. This results in the insertion of a nop (no-operation), which adds an extra cycle to the pipeline. ### Structure hazard Insufficient hardware resources lead to the inability to execute multiple instructions simultaneously within the same timeframe. ![](https://hackmd.io/_uploads/By1HKjGZp.png) This example illustrates a situation where a "load" instruction is in the MEM stage, trying to access memory, but at the same time, the instruction i+3 needs to enter the IF stage to fetch an instruction from memory. However, a single memory cannot simultaneously satisfy two different read requests, leading to a structural hazard. This can be resolved by distinguishing between instruction memory and data memory. ### Control hazard When the result of a branch instruction has not yet been determined, subsequent instructions have already entered the pipeline. If a decision is made to branch to a different location, an error occurs. ![](https://hackmd.io/_uploads/S151_sfba.png) Because the instruction "bne x31 x0 16" calculates its result and decides whether to branch in the EXE stage, but subsequent instructions "bne" and "addi" have already entered the pipeline. # Future work Improving the conditions of hazards in order to reduce cycles caused by NOPs. Next, implementing bfloat16 multiplication, division, and addition, and further realizing quantization from bfloat16 to int8.