# Assignment1: RISC-V Assembly and Instruction Pipeline
contributed by ???
## Introduction
My original case was to ==implement quantization from bfloat16 to int8==, which means that input single-precision floating-point numbers are first converted to bfloat16 through fp32_to_bf16, and then quantized to 8-bit integers through quant_bf16_to_int8. However, I encountered many issues in the implementation, as all operations had to be performed in the bfloat16 format. This made operations such as finding the maximum value, subtraction, division, and multiplication quite complicated.
Therefore, I changed my case to ==implement a function to find the maximum value in a bfloat16 array for quantization==. This means that input single-precision floating-point numbers are first converted to bfloat16 through fp32_to_bf16, and then the maximum value in the array is found using bf16_findmax, which is the function that will be used in quantization.
Additionally, in order to apply finding the maximum value to quantization, I modified the case to ==implement a function to find the maximum absolute value== in a bfloat16 array for quantization.
In summary, I implemented the C code for quantization along with the code for finding the maximum value and the maximum absolute value.
## Background
### Quantization
Quantization is the process of converting a representation with a higher number of bits into one with a lower number of bits, typically to accelerate computations.
Quantization can be broken down into three main steps:
1. Find the maximum absolute value in the data.

2. Calculate the scaling factor (Scale).

3. Multiply the data by the Scale and round it to the nearest integer.

### BFloat16
IEEE half-precision 16-bit float

IEEE 754 single-precision 32-bit float

BFloat16

From the information provided, it's clear that bfloat16 consists of 1 bit for the sign, 8 bits for the exponent, and 7 bits for the fraction.
What sets it apart from single-precision is that it retains only 7 bits for the fraction, reducing the number of bits to accelerate computations.
In comparison to half-precision, bfloat16 also uses 16 bits for its representation. However, it allocates 3 of those bits to the exponent, sacrificing some precision in exchange for a larger range. This is particularly useful in machine learning, where the slight loss of precision doesn't significantly impact computation results, but the extended range helps prevent overflow, which can have a substantial impact on computations.
# Implementation
## C code (Quantization)
``` c
#include <stdio.h>
#include <stdlib.h>
#include<math.h>
# define array_size 7
# define range 127 /*2^(n-1)-1, n: quant bit*/
float fp32_to_bf16(float x);
int* quant_bf16_to_int8(float x[]);
float bf16_findmax(float x[]);
int main()
{
float array[array_size] = {1.200000, 1.203125, 2.310000, 2.312500, 3.460000, 3.4531255, 5.630000};
float array2[array_size] = { 0.1, 0.2, 1.2, 3, 2.1, -4.2, 3.5};
float array3[array_size] = { 3.14159265, 0.12345678 , 1.23456789 , 0.00000123, 0.00000001, 0.99999999 , 0.00000007 };
float array_bf16[array_size] = {};
int *after_quant;
/*data 1*/
for (int i = 0; i < 7; i++) {
array_bf16[i] = fp32_to_bf16(array[i]);
}
printf("data 1\nbfloat16 number is \n");
for (int i = 0; i < array_size; i++) {
printf("%.12f\n", array_bf16[i]);
}
after_quant = quant_bf16_to_int8(array_bf16);
printf("after quantization \n");
for (int i = 0; i < array_size; i++) {
printf("%d\n", after_quant[i]);
}
/*data 2*/
for (int i = 0; i < 7; i++) {
array_bf16[i] = fp32_to_bf16(array2[i]);
}
printf("data 2\nbfloat16 number is \n");
for (int i = 0; i < array_size; i++) {
printf("%.12f\n", array_bf16[i]);
}
after_quant = quant_bf16_to_int8(array_bf16);
printf("after quantization \n");
for (int i = 0; i < array_size; i++) {
printf("%d\n", after_quant[i]);
}
/*data 3*/
for (int i = 0; i < 7; i++) {
array_bf16[i] = fp32_to_bf16(array3[i]);
}
printf("data 3\nbfloat16 number is \n");
for (int i = 0; i < array_size; i++) {
printf("%.12f\n", array_bf16[i]);
}
after_quant = quant_bf16_to_int8(array_bf16);
printf("after quantization \n");
for (int i = 0; i < array_size; i++) {
printf("%d\n", after_quant[i]);
}
system("pause");
return 0;
}
float fp32_to_bf16(float x)
{
float y = x;
int *p = (int *)&y;
unsigned int exp = *p & 0x7F800000;
unsigned int man = *p & 0x007FFFFF;
if (exp == 0 && man == 0) /* zero */
return x;
if (exp == 0x7F800000 /* Fill this! */) /* infinity or NaN */
return x;
/* Normalized number */
/* round to nearest */
float r = x;
int *pr = (int *)&r;
*pr &= 0xFF800000; /* r has the same exp as x */
r /= 0x100 /* Fill this! */;
y = x + r;
*p &= 0xFFFF0000;
return y;
}
int* quant_bf16_to_int8(float x[array_size])
{
static int after_quant[array_size] = {};
float max = fabs(x[0]);
for (int i = 1; i < array_size; i++) {
if (fabs(x[i]) > max) {
max = fabs(x[i]);
}
}
printf("maximum number is %.12f\n", max);
float scale = range / max;
for (int i = 0; i < array_size; i++) {
after_quant[i] = (x[i] * scale);
}
return after_quant;
}
```
## Assembly code (fp32_to_bf16 & find maximum)
```assembly=
.data
array: .word 0x3f99999a, 0x3f9a0000, 0x4013d70a, 0x40140000, 0x405d70a4, 0x405d0000, 0x40b428f6
# test data1: 1.200000, 1.203125, 2.310000, 2.312500, 3.460000, 3.4531255, 5.630000
array2: .word 0x3dcccccd, 0x3e4ccccd, 0x3f99999a, 0x40400000, 0x40066666, 0xc0866666, 0x40600000
# test data2: 0.1, 0.2, 1.2, 3, 2.1, -4.2, 3.5
array3: .word 0x40490fdb, 0x3dfcd6e9, 0x3f9e0652, 0x35a5167a, 0x322bcc77, 0x3f800000, 0x339652e8
# test data3: 3.14159265, 0.12345678 , 1.23456789 , 0.00000123, 0.00000001, 0.99999999 , 0.00000007
array_size: .word 7
array_bf16: .word 0, 0, 0, 0, 0, 0, 0
exp_mask: .word 0x7F800000
man_mask: .word 0x007FFFFF
sign_exp_mask: .word 0xFF800000
bf16_mask: .word 0xFFFF0000
next_line: .string "\n"
max_string: .string "maximum number is "
bf16_string: .string "\nbfloat16 number is \n"
.text
main:
# data 1
la a0, bf16_string
addi a7, x0, 4
ecall
la s0, array # array address
la s2, array_bf16 # array_bf16 address
la s1, array_size
lw s1, 0(s1) # array size
la s3, exp_mask
lw s3, 0(s3)
la s4, man_mask
lw s4, 0(s4)
la s5, sign_exp_mask
lw s5, 0(s5)
la s6, bf16_mask
lw s6, 0(s6)
for1:
lw a0, 0(s0) # first element
jal ra, fp32_to_bf16
sw a0, 0(s2)
addi a7, x0, 34
ecall
la a0, next_line
addi a7, x0, 4
ecall
addi s1, s1, -1
addi s0, s0, 4
addi s2, s2, 4
bne s1, x0, for1
# invoke find maximum
la a0, max_string
addi a7, x0, 4
ecall
addi a1, s2, -28 # input array_bf16
jal ra, bf16_findmax
addi a7, x0, 34
ecall
# data 2
la a0, bf16_string
addi a7, x0, 4
ecall
la s0, array2 # array address
la s2, array_bf16 # array_bf16 address
la s1, array_size
lw s1, 0(s1) # array size
la s3, exp_mask
lw s3, 0(s3)
la s4, man_mask
lw s4, 0(s4)
la s5, sign_exp_mask
lw s5, 0(s5)
la s6, bf16_mask
lw s6, 0(s6)
data2for1:
lw a0, 0(s0) # first element
jal ra, fp32_to_bf16
sw a0, 0(s2)
addi a7, x0, 34
ecall
la a0, next_line
addi a7, x0, 4
ecall
addi s1, s1, -1
addi s0, s0, 4
addi s2, s2, 4
bne s1, x0, data2for1
# invoke find maximum
la a0, max_string
addi a7, x0, 4
ecall
addi a1, s2, -28 # input array_bf16
jal ra, bf16_findmax
addi a7, x0, 34
ecall
# data 3
la a0, bf16_string
addi a7, x0, 4
ecall
la s0, array3 # array address
la s2, array_bf16 # array_bf16 address
la s1, array_size
lw s1, 0(s1) # array size
la s3, exp_mask
lw s3, 0(s3)
la s4, man_mask
lw s4, 0(s4)
la s5, sign_exp_mask
lw s5, 0(s5)
la s6, bf16_mask
lw s6, 0(s6)
data3for1:
lw a0, 0(s0) # first element
jal ra, fp32_to_bf16
sw a0, 0(s2)
addi a7, x0, 34
ecall
la a0, next_line
addi a7, x0, 4
ecall
addi s1, s1, -1
addi s0, s0, 4
addi s2, s2, 4
bne s1, x0, data3for1
# invoke find maximum
la a0, max_string
addi a7, x0, 4
ecall
addi a1, s2, -28 # input array_bf16
jal ra, bf16_findmax
addi a7, x0, 34
ecall
# Exit program
li a7, 10
ecall
fp32_to_bf16:
# ! don't need point variable concept
addi sp,sp,-8
sw a0,0(sp) # y
addi t0,sp,0 # p
lw t2,0(t0) # *p
and t6, t2, s3 # exp
and t4, t2, s4 # man
# if zero
bne t6, x0, else
# exp is zero
bne t4, x0, else
return_x:
addi sp,sp,8
jr ra
else:
# if infinity or NaN
beq t4, x0, return_x
# round
sw a0, 4(sp) # r
addi t1, sp, 4 # pr
lw t3, 0(t1) # *pr
and t3, t3, s5
sw t3, 0(t1)
lw t3, 4(sp) # r
# floating point divide
# ~~ addi t5, x0, 0x100~~
# ~~ div t3, t3, t5~~
li t5, 0x04000000
sub t3, t3, t5 # r
# floating point addition
# t3: r, t4: a0's man, t6:a0's exp
# ~~ add t5, a0, t3 # y~~
and t5, t3, s3 # r's exp
srli t6, t6, 23 # exp alignment
srli t5, t5, 23
sub t2, t6, t5 # exp diff
and t3, t3, s4 # r's man
# man alignment
li s11, 0x00800000 # make up 1 to No.24bit
or t3, t3, s11
or t4, t4, s11
# t6>=0, a0>=r; t6<0, a0<r, smaller man shift right, reserve bigger exp
bge t2, x0, x_big_r
srl t4, t4, t2 # a0's man shift right t2 bit
mv t6, t5 # reserve t5(r's exp)
j add_man
x_big_r:
srl t3, t3, t2 # r's man shift right t2 bit
add_man:
add t3, t3, t4 # mantissa addition
# check carry
and t4, t3, s11 # check No.24bit, 0:carry, 1: nocarry
bne t4, x0, no_carry
addi t6, t6, 1 # exp+1
srli t3 ,t3, 1 # man alignment
no_carry:
slli t6, t6, 23 # exp shift
and t6, t6, s3 # mask exp
and t3, t3, s4 # mask man
or t6, t3, t6 # combine exp & man
li s11, 0x80000000 # sign mask
and t3, a0, s11 # sign
or t5, t3, t6
# ! integrate divide and addition can be man shift 8 bit
sw t5, 0(sp) # y
lw t2, 0(t0) # *p
and t2, t2, s6
sw t2, 0(t0)
lw t5, 0(sp)
add a0, x0, t5
addi sp,sp,8
jr ra
bf16_findmax:
# a1: bf16_array, return a0: max
# Prologue
addi sp, sp, -16
sw ra, 0(sp)
sw s0, 4(sp)
sw s1, 8(sp)
sw s2, 12(sp)
li s0, 0x80000000 # sign mask
li s1, 0x7f800000 # exp mask
li s2, 0x007f0000 # man mask
mv t2, a1 # input array(t2)
lw a0, 0(t2) # max(a0)
la t4, array_size
lw t4, 0(t4) # array size(t4)
addi t3, x0, 1 # count(t3)
for2:
addi t3, t3, 1
addi t2, t2, 4
lw t1, 0(t2) # second element(t1)
# blt t1, a0, max_not_change
# ! max's sign,exp,man can save
# bf16_compare
# a0: max, t1: next
# compare sign
and t5, a0, s0
and t6, t1, s0
bltu t6, t5, max_change # t6=0(+), t5=1(-) branch
bltu t5, t6, max_not_change # t6=1(-), t5=0(+)
blt x0, t6, negative # 0<t6(1), negative
# positive
# compare exp
and t5, a0, s1
and t6, t1, s1
blt t5, t6, max_change # t5(max.exp)<t6(next.exp) branch
blt t6, t5, max_not_change
# compare man
and t5, a0, s2
and t6, t1, s2
blt t5, t6, max_change
blt t6, t5, max_not_change
negative:
# compare exp
and t5, a0, s1
and t6, t1, s1
blt t6, t5, max_change
blt t5, t6, max_not_change
# compare man
and t5, a0, s2
and t6, t1, s2
blt t6, t5, max_change
blt t5, t6, max_not_change
max_change:
mv a0, t1
max_not_change:
blt t3, t4, for2
# Epilogue
lw ra, 0(sp)
lw s0, 4(sp)
lw s1, 8(sp)
lw s2, 12(sp)
addi sp, sp, 16
jr ra
```
## Assembly code (fp32_to_bf16 & find maximum absolute value)
```assembly=
.data
array: .word 0x3f99999a, 0x3f9a0000, 0x4013d70a, 0x40140000, 0x405d70a4, 0x405d0000, 0x40b428f6
# test data1: 1.200000, 1.203125, 2.310000, 2.312500, 3.460000, 3.4531255, 5.630000
array2: .word 0x3dcccccd, 0x3e4ccccd, 0x3f99999a, 0x40400000, 0x40066666, 0xc0866666, 0x40600000
# test data2: 0.1, 0.2, 1.2, 3, 2.1, -4.2, 3.5
array3: .word 0x40490fdb, 0x3dfcd6e9, 0x3f9e0652, 0x35a5167a, 0x322bcc77, 0x3f800000, 0x339652e8
# test data3: 3.14159265, 0.12345678 , 1.23456789 , 0.00000123, 0.00000001, 0.99999999 , 0.00000007
array_size: .word 7
array_bf16: .word 0, 0, 0, 0, 0, 0, 0
exp_mask: .word 0x7F800000
man_mask: .word 0x007FFFFF
sign_exp_mask: .word 0xFF800000
bf16_mask: .word 0xFFFF0000
next_line: .string "\n"
max_string: .string "maximum number is "
bf16_string: .string "\nbfloat16 number is \n"
.text
main:
# data 1
la a0, bf16_string
addi a7, x0, 4
ecall
la s0, array # array address
la s2, array_bf16 # array_bf16 address
la s1, array_size
lw s1, 0(s1) # array size
la s3, exp_mask
lw s3, 0(s3)
la s4, man_mask
lw s4, 0(s4)
la s5, sign_exp_mask
lw s5, 0(s5)
la s6, bf16_mask
lw s6, 0(s6)
for1:
lw a0, 0(s0) # first element
jal ra, fp32_to_bf16
sw a0, 0(s2)
addi a7, x0, 34
ecall
la a0, next_line
addi a7, x0, 4
ecall
addi s1, s1, -1
addi s0, s0, 4
addi s2, s2, 4
bne s1, x0, for1
# invoke find maximum
la a0, max_string
addi a7, x0, 4
ecall
addi a1, s2, -28 # input array_bf16
jal ra, bf16_findmax
addi a7, x0, 34
ecall
# data 2
la a0, bf16_string
addi a7, x0, 4
ecall
la s0, array2 # array address
la s2, array_bf16 # array_bf16 address
la s1, array_size
lw s1, 0(s1) # array size
la s3, exp_mask
lw s3, 0(s3)
la s4, man_mask
lw s4, 0(s4)
la s5, sign_exp_mask
lw s5, 0(s5)
la s6, bf16_mask
lw s6, 0(s6)
data2for1:
lw a0, 0(s0) # first element
jal ra, fp32_to_bf16
sw a0, 0(s2)
addi a7, x0, 34
ecall
la a0, next_line
addi a7, x0, 4
ecall
addi s1, s1, -1
addi s0, s0, 4
addi s2, s2, 4
bne s1, x0, data2for1
# invoke find maximum
la a0, max_string
addi a7, x0, 4
ecall
addi a1, s2, -28 # input array_bf16
jal ra, bf16_findmax
addi a7, x0, 34
ecall
# data 3
la a0, bf16_string
addi a7, x0, 4
ecall
la s0, array3 # array address
la s2, array_bf16 # array_bf16 address
la s1, array_size
lw s1, 0(s1) # array size
la s3, exp_mask
lw s3, 0(s3)
la s4, man_mask
lw s4, 0(s4)
la s5, sign_exp_mask
lw s5, 0(s5)
la s6, bf16_mask
lw s6, 0(s6)
data3for1:
lw a0, 0(s0) # first element
jal ra, fp32_to_bf16
sw a0, 0(s2)
addi a7, x0, 34
ecall
la a0, next_line
addi a7, x0, 4
ecall
addi s1, s1, -1
addi s0, s0, 4
addi s2, s2, 4
bne s1, x0, data3for1
# invoke find maximum
la a0, max_string
addi a7, x0, 4
ecall
addi a1, s2, -28 # input array_bf16
jal ra, bf16_findmax
addi a7, x0, 34
ecall
# Exit program
li a7, 10
ecall
fp32_to_bf16:
# ! don't need point variable concept
addi sp,sp,-8
sw a0,0(sp) # y
addi t0,sp,0 # p
lw t2,0(t0) # *p
and t6, t2, s3 # exp
and t4, t2, s4 # man
# if zero
bne t6, x0, else
# exp is zero
bne t4, x0, else
return_x:
addi sp,sp,8
jr ra
else:
# if infinity or NaN
beq t4, s3, return_x
# round
sw a0, 4(sp) # r
addi t1, sp, 4 # pr
lw t3, 0(t1) # *pr
and t3, t3, s5
sw t3, 0(t1)
lw t3, 4(sp) # r
# floating point divide
# ~~ addi t5, x0, 0x100~~
# ~~ div t3, t3, t5~~
li t5, 0x04000000
sub t3, t3, t5 # r
# floating point addition
# t3: r, t4: a0's man, t6:a0's exp
# ~~ add t5, a0, t3 # y~~
and t5, t3, s3 # r's exp
srli t6, t6, 23 # exp alignment
srli t5, t5, 23
sub t2, t6, t5 # exp diff
and t3, t3, s4 # r's man
# man alignment
li s11, 0x00800000 # make up 1 to No.24bit
or t3, t3, s11
or t4, t4, s11
# t6>=0, a0>=r; t6<0, a0<r, smaller man shift right, reserve bigger exp
bge t2, x0, x_big_r
srl t4, t4, t2 # a0's man shift right t2 bit
mv t6, t5 # reserve t5(r's exp)
j add_man
x_big_r:
srl t3, t3, t2 # r's man shift right t2 bit
add_man:
add t3, t3, t4 # mantissa addition
# check carry
and t4, t3, s11 # check No.24bit, 0:carry, 1: nocarry
bne t4, a0, no_carry
addi t6, t6, 1 # exp+1
srli t3 ,t3, 1 # man alignment
no_carry:
slli t6, t6, 23 # exp shift
and t6, t6, s3 # mask exp
and t3, t3, s4 # mask man
or t6, t3, t6 # combine exp & man
li s11, 0x80000000 # sign mask
and t3, a0, s11 # sign
or t5, t3, t6
# ! integrate divide and addition can be man shift 8 bit
sw t5, 0(sp) # y
lw t2, 0(t0) # *p
and t2, t2, s6
sw t2, 0(t0)
lw t5, 0(sp)
add a0, x0, t5
addi sp,sp,8
jr ra
bf16_findmax:
# a1: bf16_array, return a0: max
# Prologue
addi sp, sp, -12
sw ra, 0(sp)
sw s1, 4(sp)
sw s2, 8(sp)
li s1, 0x7f800000 # exp mask
li s2, 0x007f0000 # man mask
mv t2, a1 # input array(t2)
lw a0, 0(t2) # max(a0)
la t4, array_size
lw t4, 0(t4) # array size(t4)
addi t3, x0, 1 # count(t3)
for2:
addi t3, t3, 1
addi t2, t2, 4
lw t1, 0(t2) # second element(t1)
# blt t1, a0, max_not_change
# ! max's sign,exp,man can save
# bf16_compare
# a0: max, t1: next
# compare exp
and t5, a0, s1
and t6, t1, s1
blt t5, t6, max_change # t5(max.exp)<t6(next.exp) branch
blt t6, t5, max_not_change
# compare man
and t5, a0, s2
and t6, t1, s2
blt t5, t6, max_change
blt t6, t5, max_not_change
max_change:
mv a0, t1
max_not_change:
blt t3, t4, for2
# Absolute
li t0, 0x7fffffff
and a0, a0, t0
# Epilogue
lw ra, 0(sp)
lw s1, 4(sp)
lw s2, 8(sp)
addi sp, sp, 12
jr ra
```
## Output
### C output

### Assembly output
- **Fp32_to_bf16 & Find maximum**

- **Fp32_to_bf16 &Find maximum absolute value**

test data2: 0.1, 0.2, 1.2, 3, 2.1, -4.2, 3.5
It's evident from your test data 2 that there is a difference between finding the maximum value and the maximum absolute value for bfloat16.
When searching for the maximum value of bfloat16, the maximum value is 3.5, represented as 0x40600000.
On the other hand, when searching for the maximum absolute value of bfloat16, the maximum value is 4.1875, represented as 0x40860000.
## Result
- **Fp32_to_bf16 & Find maximum**

- **Fp32_to_bf16 &Find maximum absolute value**

### After optimization
- **Optimal_Fp32_to_bf16 &Find maximum absolute value**

# Optimization
```assembly=
.data
array: .word 0x3f99999a, 0x3f9a0000, 0x4013d70a, 0x40140000, 0x405d70a4, 0x405d0000, 0x40b428f6
# test data1: 1.200000, 1.203125, 2.310000, 2.312500, 3.460000, 3.4531255, 5.630000
array2: .word 0x3dcccccd, 0x3e4ccccd, 0x3f99999a, 0x40400000, 0x40066666, 0xc0866666, 0x40600000
# test data2: 0.1, 0.2, 1.2, 3, 2.1, -4.2, 3.5
array3: .word 0x40490fdb, 0x3dfcd6e9, 0x3f9e0652, 0x35a5167a, 0x322bcc77, 0x3f800000, 0x339652e8
# test data3: 3.14159265, 0.12345678 , 1.23456789 , 0.00000123, 0.00000001, 0.99999999 , 0.00000007
array_bf16: .word 0, 0, 0, 0, 0, 0, 0
exp_mask: .word 0x7F800000
man_mask: .word 0x007FFFFF
sign_exp_mask: .word 0xFF800000
bf16_mask: .word 0xFFFF0000
next_line: .string "\n"
max_string: .string "maximum number is "
bf16_string: .string "\nbfloat16 number is \n"
.text
main:
# push data
addi sp, sp, -12
la t0, array
sw t0, 0(sp)
la t0, array2
sw t0, 4(sp)
la t0, array3
sw t0, 8(sp)
la s10, array_bf16 # global array_bf16 address(s10)
addi s11, x0, 3 # data number(s11)
la s9, exp_mask # global exp(s9)
la s8, man_mask # global man(s8)
la s6, bf16_mask # global bf16(s6)
lw s9, 0(s9)
lw s8, 0(s8)
lw s6, 0(s6)
add s7, x0, sp
main_for:
la a0, bf16_string
addi a7, x0, 4
ecall
addi a3, x0, 7 # array size(a3)
lw a1, 0(s7) # array_data pointer(a1)
mv a2, s10 # array_bf16 pointer(a2)
jal ra, fp32_to_bf16_findmax
addi s11, s11, -1
addi s7, s7, 4
bne s11, x0, main_for
# Exit program
li a7, 10
ecall
fp32_to_bf16_findmax:
# array_data pointer(a1), array_bf16 pointer(a2), array size(a3)
# prologue
addi sp, sp, -8
sw s0, 0(sp)
sw s1, 4(sp)
# array loop
for1:
lw a5, 0(a1) # x(a5)
# fp32_to_bf16
and t0, a5, s9 # x exp(t0)
and t1, a5, s8 # x man(t1)
# if zero
bne t0, x0, else
# exp is zero
bne t1, x0, else
j finish_bf16
else:
# if infinity or NaN
beq t0, s9, finish_bf16
# round
# r = x.man shift right 8 bit
# x+r = x.man + x.man>>8
li t3, 0x00800000 # make up 1 to No.24bit
or t1, t1, t3
srli t2, t1, 8 # r(t2)
add t1, t1, t2 # x+r
# check carry
and t4, t1, t3 # check No.24bit (t4), 0:carry, 1: nocarry
bne t4, x0, no_carry
add t0, t0, t3 # exp+1
srli t1 ,t1, 1 # man alignment
no_carry:
and t0, t0, s9 # mask exp(t0)
and t1, t1, s8 # mask man(t1)
or t2, t0, t1 # combine exp & man
li t3, 0x80000000 # sign mask
and t3, a5, t3 # x sign
or a5, t3, t2 # bfloat16(a5)
and a5, a5, s6
finish_bf16:
sw a5, 0(a2)
mv a0, a5
addi a7, x0, 34
ecall
la a0, next_line
addi a7, x0, 4
ecall
slti t3, a3, 7 # (a3==7) t3=0, (a3<7) t3=1
bne t3, x0, compare
# saved first max
j max_change
compare:
# compare exp
blt s0, t0, max_change
blt t0, s0, max_not_change
# compare man
blt s1, t1, max_change
blt t1, s1, max_not_change
max_change:
mv s0, t0 # max exp(s0)
mv s1, t1 # max man(s1)
mv a4, a5 # max bf16(a4)
max_not_change:
addi a3, a3, -1
addi a1, a1, 4
addi a2, a2, 4
bne a3, x0, for1
# Absolute
li t2, 0x7fffffff
and a4, a4, t2
#print
la a0, max_string
addi a7, x0, 4
ecall
mv a0, a4
addi a7, x0, 34
ecall
# epilogue
lw s0, 0(sp)
lw s1, 4(sp)
addi sp, sp, 8
jr ra
```
1. Using a loop to read different test data can significantly reduce redundant code.
2. Merge fp32_to_bf16 and bf16_findmax into a single function to reduce the need for a for loop.
3. Avoid using the concept of pointer variables to minimize unnecessary access and instructions.
4. Integrate the addition and division of FP32 in fp32_to_bf16 into an 8-bit shift operation.
5. Save the max's exponent and mantissa for reuse, eliminating the need for repeated retrieval.
# Analysis
## Pipeline

| stage | description |
| -------- | -------- |
| IF | Instruction Fetch |
| ID | Instruction Decode & Register Read |
| EX | Execution or Address Calculation |
| MEM | Data Memory Access |
| WB | Write Back |

Different types of commands involve different operations at different stages. Below is an example using "addi."
### IF

The IF (Instruction Fetch) stage primarily involves fetching instructions from memory based on the program counter (PC). Additionally, the PC is incremented in this stage. Therefore, you can see that the output "0X00400893" corresponds to the encoding of the "addi x17,x0,4" instruction, and the preceding "0x0000000C" is the address of the next instruction.
### ID

The ID (Instruction Decode) stage primarily involves decoding the instruction, extracting the source register addresses, destination register address, opcode, and immediate value. It also retrieves values from the registers. Therefore, in this context, you can see that "0x00" represents the source register address, "0x11" is the destination register address, "0x00000004" is the immediate value, and "0x00000000" is the value taken from register x0.
### EX

The EX (Execution) stage primarily performs numerical operations through the ALU. You can see that there are four multiplexers that determine the inputs to the ALU. The upper two mainly decide whether to use the PC value or the register value, while the lower two mainly decide whether to use the immediate value or the register value. Arithmetic instructions perform calculations here, branch instructions calculate the next PC, and memory instructions calculate memory access addresses. Therefore, in this context, you can see that OP1 inputs "0X00000000," which represents the value of x0, and OP2 inputs "0X00000004," which is the immediate value.
### MEM

The MEM (Memory) stage primarily handles memory access operations, such as lw (load word) or sw (store word), based on the address calculated by the ALU. In the case of "addi," there are no operations in this stage. Therefore, the result of the operation, which in this case is "0x00000004," will be passed on to the next stage, which is the WB (Write Back) stage.
### WB

The WB (Write Back) stage primarily performs the write-back operation to registers based on the destination register address received from the previous stage. A multiplexer decides whether to write the value from the ALU operation, the PC, or memory into the destination register. Therefore, in this context, you can see that "0x00000004" is being written back into register "0x11."
## Hazard
### Data hazard
In a pipeline, a certain instruction may require the results of preceding instructions that have not yet been generated, meaning that the data needed for the execution of the instruction is not yet available.

Because "and x31 x7 x19" requires the value of x7, but "lw x7 0 x5" has not yet written back its result to x7, a Read After Write data hazard occurs. We need to wait until the "lw" instruction writes back the result to x7 in the MEM stage before the "and" instruction can read the value of x7 in the ID stage. This results in the insertion of a nop (no-operation), which adds an extra cycle to the pipeline.
### Structure hazard
Insufficient hardware resources lead to the inability to execute multiple instructions simultaneously within the same timeframe.

This example illustrates a situation where a "load" instruction is in the MEM stage, trying to access memory, but at the same time, the instruction i+3 needs to enter the IF stage to fetch an instruction from memory. However, a single memory cannot simultaneously satisfy two different read requests, leading to a structural hazard. This can be resolved by distinguishing between instruction memory and data memory.
### Control hazard
When the result of a branch instruction has not yet been determined, subsequent instructions have already entered the pipeline. If a decision is made to branch to a different location, an error occurs.

Because the instruction "bne x31 x0 16" calculates its result and decides whether to branch in the EXE stage, but subsequent instructions "bne" and "addi" have already entered the pipeline.
# Future work
Improving the conditions of hazards in order to reduce cycles caused by NOPs. Next, implementing bfloat16 multiplication, division, and addition, and further realizing quantization from bfloat16 to int8.