or
or
By clicking below, you agree to our terms of service.
New to HackMD? Sign up
Syntax | Example | Reference | |
---|---|---|---|
# Header | Header | 基本排版 | |
- Unordered List |
|
||
1. Ordered List |
|
||
- [ ] Todo List |
|
||
> Blockquote | Blockquote |
||
**Bold font** | Bold font | ||
*Italics font* | Italics font | ||
~~Strikethrough~~ | |||
19^th^ | 19th | ||
H~2~O | H2O | ||
++Inserted text++ | Inserted text | ||
==Marked text== | Marked text | ||
[link text](https:// "title") | Link | ||
 | Image | ||
`Code` | Code |
在筆記中貼入程式碼 | |
```javascript var i = 0; ``` |
|
||
:smile: | ![]() |
Emoji list | |
{%youtube youtube_id %} | Externals | ||
$L^aT_eX$ | LaTeX | ||
:::info This is a alert area. ::: |
This is a alert area. |
On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?
Please give us some advice and help us improve HackMD.
Do you want to remove this version name and description?
Syncing
xxxxxxxxxx
Assignment1: RISC-V Assembly and Instruction Pipeline
Brief topic: Bfloat16 Logarithm
Descriptive topic: Approximation of logarithm of bfloat16 numbers using the RV32I instruction set
contributed by <
coding-ray
(黃柏叡) >tags:
Bfloat16
,RISC-V
,RV32I
,Natural Logarithm
,Remez Algorithm
1. Background
1.1. Bfloat16 floating-point format
The bfloat16 (brain floating point, bf16) floating-point format occupies only 16 bits in computer memory.
Bfloat16 is in the following format:
In comparison of bf16 to the standard IEEE 754 single-precision 32-bit float (fp32), bf16 has almost the same dynamic range as fp32 has. However, the precision of bf16 is ~65,000x worse than that of fp32. The details are shown in the following table.(John, 2021)
The values above are calculated by the following steps.
IEEE 754 floating-point format
IEEE Standard for Floating-Point Arithmetic (IEEE 754) defines the format of floating-point numbers.
For a given floating-point format with 1 sign bit \(s\), \(n_e\) exponent bits, and \(n_m\) mantissa (significand) bits, a number in such format is generally as follows.
\[ (-1)^s \times 1.\underbrace{xxx \cdots xxx}_{n_m \text{ bits}} \,_2 \times 2^{(exp - 2^{(n_e - 1)} + 1)} \]
, whose binary representaion (encoding) in computer memory is as follows.
\[ \underbrace{s}_\text{1 bit} \ \ \underbrace{eee \cdots eee}_{exp_2 ,\ n_e \text{ bits}} \ \ \underbrace{xxx \cdots xxx}_{n_m \text{ bits}} \]
, where \(exp_2\) is the right-aligned, binary value of \(exp \in [1, (2^{n_e} - 1)]\).
For \(exp = 0\) with nonzero \(mantissa\), the number is subnormal, generally as follows.
\[ (-1)^s \times 0.\underbrace{xxx \cdots xxx}_{n_m \text{ bits}} \,_2 \times 2^{-126} \]
For \(exp = mantissa = 0\), the number is either positive or negative zero, depending on the sign bit.
\[ (-1)^s \times 0 \]
For \(exp = 2^{n_e}\), there are two cases:
\[ (-1)^s \times \infty \]
(IEEE Computer Society, 2019)
Dynamic ranges of fp32 and bf16
The largest normal value in fp32 format is as follows.
\[ \begin{align} L_\text{fp32} &= 0 \ \underbrace{11111110}_{8 \text{ bits}} \ \underbrace{111 \cdots 111}_{23 \text{ bits}} \\\ &= (2 - 2^{-23}) \times 2^{((2^8 - 2) - 2^7 + 1)} \\\ &= (2 - 2^{-23}) \times 2^{127} \\\ &\approx 3.403 \times 10^{38} \end{align} \]
The smallest positive subnormal number in fp32 format is as follows.
\[ \begin{align} S_\text{fp32} &= 0 \ \underbrace{00000000}_{8 \text{ bits}} \ \underbrace{000 \cdots 001}_{23 \text{ bits}} \\\ &= 2^{-23} \times 2^{-126} \\\ &= 2^{-149} \\\ &\approx 1.401 \times 10^{-45} \end{align} \]
So, the dynamic range of fp32 is
\[ \begin{align} \text{DR}_\text{fp32} & = \log_{10}\frac{L_\text{fp32}}{S_\text{fp32}} \\\ & \approx \log_{10} \frac {3.403 \times 10^{38}} {1.401 \times 10^{-45}} \\\ & \approx 83.39 \text{ dB} \end{align} \]
Similarly, for bf16, we have the following values.
\[ \begin{align} L_\text{bf16} &= 0 \ \underbrace{11111110}_{8 \text{ bits}} \ \underbrace{1111111}_{7 \text{ bits}} \\\ &= (2 - 2^{-7}) \times 2^{((2^8 - 2) - 2^7 + 1)} \\\ &= (2 - 2^{-7}) \times 2^{127} \\\ &\approx 3.390 \times 10^{38} \\\ \ \ \\\ S_\text{bf16} &= 0 \ \underbrace{00000000}_{8 \text{ bits}} \ \underbrace{0000001}_{7 \text{ bits}} \\\ &= 2^{-7} \times 2^{-126} \\\ &= 2^{-133} \\\ &\approx 9.184 \times 10^{-41} \\\ \ \ \\ \text{DR}_\text{bf16} & = \log_{10}\frac{L_\text{bf16}}{S_\text{bf16}} \\\ & \approx \log_{10} \frac {3.390 \times 10^{38}} {9.184 \times 10^{-41}} \\\ & \approx 78.57 \text{ dB} \end{align} \]
Precision of fp32 and bf16
(Both sides in the following inequality relations are in the same format of either fp32 or bf16.)
\[ \begin{align} &\text{In fp32, } \underset{\epsilon} {\arg\min} \{ (1 + \epsilon) > 1 \} = 2^{-23} = 0.00000011921 \\\ &\text{In bf16, } \underset{\epsilon} {\arg\min} \{ (1 + \epsilon) > 1 \} = 2^{-7}\,\, = 0.0078125 \end{align} \]
, where \(2^{-23}\) in fp32 and \(2^{-7}\) in bf16 are the values of the least significant bit of mantissa when \(exp = (2^{(n_e - 1)} - 1)\). (John, 2021)
1.2. Approximation of natural logarithm by Taylor series
For \(x\) in the range of \((0,2]\), taking natural logarithm on it can be approximated as follows by Taylor series. (Milton et al., 1965; Wikipedia contributors, 2023.)
\[ \begin{align} ln(x) &= \sum_{k=1}^\infty { \frac {(-1)^{k-1} (x-1)^k }{k} } \\ \ &= (x-1) - \frac{(x-1)^2}{2} + \frac{(x-1)^3}{3} - \frac{(x-1)^4}{4} + \frac{(x-1)^5}{5} - \cdots \end{align} \]
\[ \forall x \in (0, 2] \]
Let \(la(x)\) denote the approximation of \(ln(x)\) with the first five terms in Taylor series approximation. That is,
\[ ln(x) \approx la(x) = \sum_{k=1}^5 { \frac {(-1)^{k-1} (x-1)^k }{k} } \]
To decrease computation complexity but losing the ability of parallel computing, expand \(la(x)\) as
\[ la(x) = -2.28333 + ( 5 - ( 5 - ( 3.33333 - ( 1.25 - 0.2 x ) x ) x ) x ) x \]
Apply this formula for \(x \in [0.1, 2]\), I got
\[ \begin{align} &\text{max(error)} \approx 0.47, \text{ for } x \in [0.1, 2] \\\ &\text{max(error)} \approx 0.09, \text{ for } x \in [1, 2] \end{align} \]
To achieve \(\text{max(error)} < 0.01\), \(x\) must be in the range \([0.44, 1.68]\).
As a result, to apply this algorithm on a float with significand \(s \in [1, 2)\) (by definition) and exponent \(p\), \(ln(x)\) can be approximated as follows.
\[ \begin{align} &\text{Given } x = 2^p \times s, \text{ where } s \in [1, 2) \\\ &ln(x) = p \times ln(2) + ln (s), \text{ where } \\\ &ln(s)\approx \begin{cases} la(s) &,\text{ if } s \le 1.68 \\ la(s/2) + ln(2) &, \text{ otherwise.} \end{cases} \end{align} \]
Note: \(ln(2) = 0.693\) can be stored statically in the memory (data or text section).
1.3. Approximation of natural logarithm by Remez algorithm (method)
I will calculate the approximation of \(ln(x)\) in the bfloat16 (bf16) format, which has \(\epsilon \approx 0.0078\) for precision. This error accumulates as the number of arithmetic operations increases. That is, the maximal error from the arithmetic operations is summed up as
\[ ME(n) = 0.0078 n \]
, where \(n\) is the number of arithmetic operations.
Hence, to further decrease the maximum error, I have to have a polynomial solution of lower orders; moreover, applicable for input \(x \in [1, 2]\) in a single function.
Crouching (2017) utilized the computer algebra program Maple (1982) and the Remez algorithm (Remez, 1934; Wikipedia contributors, 2023) to generate the following 4th-order polynomial approximation of \(ln(x)\) for \(x \in [1, 2]\).
\[ ln(x) \approx -1.7417939 + (2.8212026 + (-1.4699568 + (0.44717955 - 0.056570851 x) x) x) x \]
It achieved
\[ \text{max(error)} = 6.101 \times 10^{-5}, \text{ for } x \in [1, 2] \]
Evil et al. (2021) utilized the Remez algorithm in Boost C++ Libraries (John, 2010) to get the following 3rd-order approximation for \(x \in [1, 2]\).
\[ ln(x) \approx −1.49278 + (2.11263 + (−0.729104 + 0.10969 x) x) x \]
And it achieved
\[ \text{max(error)} = 4.5 \times 10^{-4}, \text{ for } x \in [1, 2] \]
(1) Methods to approximate the natural logarithm \(ln(x) \ \forall x \in [1, 2]\), their own (2) minimal number of arithmetic operations (\(+\ -\ \times\)), and (3) maximal errors introduced from the methods, and (4) from the arithmetic operations are summarized in the following table.
Referring from the table above, I conclude that the 3rd-order polynomial approximation from the Remez method is the best algorithm in my case, for it gives the minimal overall ME (\(\approx 0.047\)).
2. Implementation
The source code is hosted on my GitHub repository. Feel free to fork and modify it.
2.1. Algorithm
I adopted the 3rd-order solution from Evil et al. (2021). However, the precision of this algorithm is limited by the precision of bf16, so the following numbers are not as precise as the numbers in the original solution.
\[ \begin{align} &\text{Given } x = 2^p \times s, \text{ where } s \in [1, 2), \\\ &\ \ \ \ \ \ ln(x) \approx lnc0 + (lnc1 + (lnc2 + lnc3 s) s) s + ln2 \times p \end{align} \]
, where
2.2. 32-bit integer multiplication
The RV32I instruction set architecture (ISA) does not provide the multiplication (\(mul\)) functionality. Thus, in this section, I implement the multiplication for two unsigned 32-bit integers with the RV32I.
Assembly code - sum v0.0
Given multiplier \(a_0\) and multiplicand \(a_1\), their product can be calculated by "summing \(a_1\) times of \(a_0\)". That is,
\[ a_0 \times a_1 = \sum_{i = 1}^{a_1} a_0 \]
In lines 7-11, I make \(a_1 < a_0\) to decrease the number of instructions with the overhead less than an iteration in the loop.
muu_loop
takes 5 CPU cycles if it branches from line 19 to line 17.Assembly code - shift v0.0
To improve the speed on large numbers, I make another implementation of multiplication by shifting operations.
It is binary multiplication as illstrated below.
\[ \begin{split} 110 \\ \times )\ \ \ \ \ \ \ \ 101 \\ \hline 110 \\ 000 \ \ \\ \,+ ) \ \ \ \ 110 \ \ \ \ \\ \hline 11110 \end{split} \]
Performance comparison
Given input number \(n\), the average time complixity for the summing method is \(O(n)\). For the shifting method, the average time complixity is \(O(\lg n)\). By comparing the average time complixity, the shifting method is faster than the summing method when \(n\) is large enough. As for what \(n\) is large enough, I will discuss it in the following paragraphs.
When \(a_0 = a_1 = 5_{10} = 101_2\), summing is better than shifting in comparison of CPU cycles.
When \(a_0 = a_1 = 6_{10} = 110_2\), summing method have additional 5 cycles; shifting method have the same cycles as in the previous case.
By testing all values ranged in \([1, 18]\), the reuslting numbers of CPU cycles for both summing and shifting methods are shown in the following figure.

Referring to the figure above, I have the following two observation, given that \(a_0 = a_1\):
As a result, in the following code that requires the library of unsigned integer multiplication, I will utilize
mul_shift_u32
by default.2.3. Bf16 addition and subtraction
There is no floating-point operations in the RV32I ISA, and there is no bf16 operations as well. Hence, I have to implement my own bf16 arithmetic operations (\(+ - \times\)) in sections 2.3 and 2.4.
C code
v0.0: initial version
File 1:
type_def.h
This file will be used in the following sections.
File 2:
add_sub_bf16.c
File 3:
main.c
This file contains the main and testing functions to test the funcitonalities of the above program
Result:
Test for add_sub_bf16.c passed.
v0.1: add and sub wrapper
The major features in v0.1 of
add_sub_bf16.c
are the addition and subtraction wrappersadd_bf16
andsub_bf16
. Utilizing these wrappersto calladd_sub_bf16
have the overhead of calling a function, and that of adding one argument to the function. Nevertheless, developers don't need to add theto_add
argument each time callingadd_bf16
orsub_bf16
.Assembly code
2.4. BF16 multiplication
C code
v0.0: initial version
File 1:
mul_bf16.c
File 2:
main.c
v0.1: handle input and result of zero
In
mul_bf16
funciton inmul_bf16.c
, the major features are as follows.Assembly code
2.5. BF16 natural logarithm
C code
Evil et al. (2021) provided the implementation in C for the above-mentioned 3rd-order polynomial approximation for \(ln(x)\), whereas his implementation can only deal with fp32 numbers.
As a result, I modified his code in order to apply it on bf16 numbers. (In the following code, I also added some of his comments and my refinement.)
File 1:
ln_bf16.c
File 2:
main.c
Result:
Assembly code
2.7. Complete assembly code
3. Analysis
3.1. Result correctness
Utilizing the testing suite in
ln_bf16.c
, I have the results as shwon in the following table.Average error: 0.007
Maximal error: 0.031
As for assebly, all the designed tests passed using the 5-stage processor in Ripes simulator.
3.2. Pseudo-instructions
With the help of Ripes, I see that pseudo-instructions (p.110) are transformed to the equivalent instructions that RISC-V proccessers can understand.
Moreover, registers in ABI name are also transformed to the register name that processors can understand.
In the figure above,
li t6, 0x80000000
is converted tolui x31 0x80000
;beqz t1, asb_not_invert_mb_1
is converted tobeq x6 x0 8 <asb_not_invert_mb_1>
3.3. 5-stage pipelined processor
To test assembly code in Ripes, I first test it with "single cycle processor", for the registers can be updated as soon as the instruction is executed. It is useful for the proof of concept.
Later, I choose "5-stage processor" to test if the code works well under potential hazards. However, either Ripes handle hazards for me, or it cannot properly simulate hazards. What a pity.
The block diagram of 5-stage processor in Ripes is like the following figure.
(not finished…)
4. Learned concepts
ret
is equal tojr ra
andjalr x0, ra, 0
. It is used to return from or end a function (subroutine). (Reference)5. Future works
6. References