# Performing multiplication using BF16(brain floating point) in RISC-V assembly language
contributed by < [aa860630](https://github.com/aa860630/2023-computer-architecture.git) >
## Introdution
First, we need to understand the significance of BF16. Obviously, BF16 requires smaller computations compared to FP32.
In addition, we should also understand the differences between BF16 and FP16. As shown in the diagram below, BF16 has 8 bits in the exponent region, while FP16 only has 5 bits. This means that BF16 can represent a larger range of numerical values. Regarding the mantissa, FP16's mantissa region contains 10 bits, whereas BF16 has only 7 bits, indicating that BF16 does not have the same level of precision as FP16.
](https://hackmd.io/_uploads/S1FhwxheT.png)
## Converting from FP32 to BF16
### Written in C language
```
float fp32_to_bf16(float x)
{
float y = x;
int *p = (int *) &y;
unsigned int exp = *p & 0x7F800000;
unsigned int man = *p & 0x007FFFFF;
if (exp == 0 && man == 0) /* zero */
return x;
if (exp == 0x7F800000 /* Fill this! */) /* infinity or NaN */
return x;
/* Normalized number */
/* round to nearest */
float r = x;
int *pr = (int *) &r;
*pr &= 0xFF800000; /* r has the same exp as x */
r /= 0x100 /* Fill this! */;
y = x + r;
*p &= 0xFFFF0000;
return y;
}
```
### Written in RISC-V
Click [here](https://github.com/aa860630/2023-computer-architecture/tree/main/HW1) for the complete code.
Pre-setting data through```.data```, unlike in high-level languages, in RISC-V, there is no need to declare data types. This means you need to put effort into determining data types and performing conversions on your own.
```
.data
test: .word 0x3f99999a
exp_mask: .word 0x7F800000
man_mask: .word 0x007FFFFF
```
To facilitate calculations in the sign, exponent, and mantissa regions, you need to perform AND operations with their respective masks. First, you need to load the address of the corresponding mask and then load the value at that address.
```
la s0 test
lw s0 0(s0)
la s1 exp_mask
lw s1 0(s1)
la s2 man_mask
lw s2 0(s2)
and s1 s0 s1 #Only retrieve the EXPONENT part of TEST.
and s2 s0 s2 #Only retrieve the MANTISSA part of TEST.
```
Sign can be computed directly by right-shifting 31 bits.
## Floating Point Encoding Summary

The following code is used to determine if a value is zero.
```
bnez s1 exp_isnt_0 #If exp is 0, proceed to the next line for the second evaluation
beqz s2 return_x #If man is also 0, return x
exp_isnt_0:
```
Discard the rightmost 16 bits.Note that if rounding is to be performed, ensure that if the discarded highest bit is 1.
```
la t0, result_mask #32->16
lw t0, 0(t0)
and s0, s0, t0
```
## Applications of BF16 Multiplication
Using the XOR instruction for the sign part,two negatives make a positive.
```
xor t0 s0 s1
and s10 t0 s2 # s2 = 0x80000000
```

For the exponent part, first right-shift by 23 bits (to align with the mantissa), and then subtract the bias (127) separately. Due to the principle of adding exponents when multiplying with the same base, t0 and t1 can be added directly.
Considering the carry-over issue with the exponent, let's not restore the exponent to its original position for now.
```
and t0 s0 s2 #s2 = 0x7f800000
and t1 s1 s2
srli t0 t0 23
srli t1 t1 23
addi t0 t0 -127 # subtract the bias (127)
addi t1 t1 -127
```
Although RISC-V does have multiplication instructions, they ultimately boil down to operations involving addition. Therefore, we can also achieve multiplication using addition. The method of implementation is as follows: by examining the rightmost bit of the multiplier, if it's 1, add the multiplicand to the product. If it's 0, no operation is performed, and then, immediately left-shift the multiplicand by one bit.
Please note that under the limitation of having only 32 bits in registers, performing multiplication with FP32 would exceed its capacity, requiring additional adjustments. However, with BF16, only 16 bits are needed for computation and storage.

The mantissa consists of 7 digits, and when combined with a single-digit integer, it totals 8 digits, requiring eight rounds of addition. Due to the carry-over issue, normalization is needed.
```s0 ```: mutiplier,
```t3 ```: mutiplicand,
Initial value of ```s7``` : 0
Initial value of ```s8``` : 8
```
andi t1 s0 1 # t1 = last_bit
srli s0 s0 1 # right shift multiplier 1 bit
beqz t1 loop # if last bit is 0, then jump to loop
add s1 s2 s1 #product = product + multiplicand
loop:
slli t3 t3 1 #left shift multiplicand 1 bit
bge s7 s8 normalize
addi s7 s7 1
andi t1 s0 1 # t1 = last_bit
srli s0 s0 1 #right shift multiplier 1 bit
beqz t1, loop
add s1 t3 s1
j loop
```
Normalization primarily distinguishes whether the value has 15 bits or 16 bits, so a mask with a value of 0x00008000 is needed. Since bf16 only stores a 7-bit mantissa, the result of the multiplication should be right-shifted by 7 bits first to remove unnecessary decimal places. If the result is 16 bits, it should then be left-shifted by 24 bits; if it's 15 bits, it should be left-shifted by 25 bits. This is done to retain only the seven decimal places and then move them to the corresponding mantissa block.
```
normalize:
la s0 norm_mask
lw s0 0(s0)
and s0 s0 t5 #if mantissa need to carry
beqz s0 bits_15
addi s11 s11 1
slli s11 s11 24 #have to cut the integer meanwhile
srli s11 s11 1
srli t5 t5 7 # discard unnecessary digits
slli t5 t5 24 # after carring one bit,only have to shift left 24 bits
srli t5 t5 9 #corresponding to the position of the mantissa
j combine
bits_15:
slli s11 s11 24
srli s11 s11 1
srli t5 t5 7 # discard unnecessary digits
slli t5 t5 25
srli t5 t5 9
```
Combining the sign block, exponent block, and mantissa block results in a computed BF16.
```
combine:
or s10 s11 s10 # combine sign and exponent
or s10 s10 t5 # combine with mantissa
```
# Optimize
Unrolling is a simple and common optimization technique that involves splitting the number of iterations in a loop into multiple smaller iterations to reduce loop control overhead, thereby improving program performance. Since this loop only has 8 iterations, the optimization provided is limited, but if applied to loops with a larger number of iterations, it will undoubtedly result in significant performance improvements.
## Before

## After
