contributed by < jimmylu890303
>
Type | sign bit | exponent bits | significand bits |
---|---|---|---|
Float32 | 1 | 8 | 23 |
Bfloat16 | 1 | 8 | 7 |
Disadvantage: The architecture comparison between the two shows that Bfloat16 has fewer significand bits, only 7 bits, which results in lower precision compared to Float32.
Advantage: Bfloat16 is employed to reduce storage requirements and enhance the calculation speed of machine learning algorithms.
The simple algorithm for IEEE754 32-bit multiplication involves the following steps:
When using Float32 to multiply two Float32 numbers, at step 5, which involves computing the mantissa Cmantissa, we need to use two registers to store the result of Amantissa * Bmantissa. Furthermore, we need to iterate 24 times (24 bits * 24 bits, where 1 bit is the integer part, and 23 bits are the mantissa) for the multiplication.
However, when using Bfloat16, at step 5, we only require one register to store the result of Amantissa * Bmantissa. Additionally, we only need to iterate 8 times (8 bits * 8 bits, where one bit represents the integer part, and 7 bits are the mantissa) for the multiplication.
In conclusion, Bfloat16 is more hardware-friendly (requiring only one register) and time-efficient (fewer iterations) when performing multiplication operations.
From the above code, we can see that we need to use 64 bits to store the multiplication result when performing a multiplication of 24-bit mantissas. We need to iterate 24 times to complete the multiplication.
From the above code, we can see that we only need to use 32 bits to store the multiplication result when performing a multiplication of 8-bit mantissas. We need to iterate 8 times to complete the multiplication.
We can clearly see that using BFloat16 to perform multiplication can speed up significantly.
Although we sacrifice some computational precision, it is worth it in terms of the benefits we gain
Follow my GitHub and check my full implementation.
The code was tested using the Ripes simulator.
Here are the first few lines of the disassembled executable code:
Let’s take a look at how the jal x1 368 <test>
instruction works during execution, which is at address 0x0
.
0x0
, meaning that we are going to fetch instruction at 0x0
, jal x1 368 <test>.
0x4
next cycle.0x170000ef
).0x170000ef
is decoded by the Decoder.0x00 (zero)
and 0x10 (a0)
are sent to the Register circuit to read values.jal
,the address of the label test
is decoded, and its value is 0x170
.0x0 (PC)
and 0x170
, sums them to obtain the new address 0x170 (Label: test)
.a0
and ra
will be sent to next stage.at the address test+4
.0x4(return address)
and 0x1(ra index)
will be sent to next stage.0x4(return address) into 0x1(ra reg.)
I am comparing the performance of the optimized code with the previous code, which converts two Float32 numbers to BFloat16 and then performs BFloat16 multiplication (by invoking the fp32_to_bf16 function twice for conversion).
Complete optimized code is in main_optimized.s.
We can observe that there's no need to flush the two instructions behind jal ra, checkZero
,resulting in a reduction in the total number of cycles.
We can observe that the number of required flush instructions can be reduced when employing loop unrolling.