Approximating sine function in BF16 data type

# Approximating sine function in BF16 data type > [source code](https://github.com/Max042004/bf16_approximation.git) ## Introduction In the current frameworks such as PyTorch, computing sine function on BF16 will cast them to float before calling `std::sin`. My goal is to address intolerable precision loss that occurs when computing $\sin$ directly in BF16. The target of this article is to develop a BF16 version of sin that matches `std::sin` in both precision, speed and less memory. ## First try: approximating through taylor expansion > [commit 180ee7e](https://github.com/Max042004/bf16_approximation/commit/180ee7ec333fc132804accc67aa65fac1c4d7140) ```c /* using tylar expansion? * sin(x) = x - x^3/3! + x^5/5! - x^7/7! +... * iteration value: (-1)^n * x^2 / (4n^2+2*n) * * optimize: * loop unrooling, avoid unneeded calculation for too small tylar value. */ static bf16_t bf16_sin(bf16_t a) { // restrict a between -2pi to +2pi while((((a >> 7) & 0xFF) > 0x4080)) bf16_t result = a; bf16_t bf16_one = (bf16_t) { .bits = 0x3F80}; // BF16_ONE bf16_t bf16_two = (bf16_t) { .bits = 0x4000}; // BF16_TWO bf16_t bf16_four = (bf16_t) { .bits = 0x4080}; // BF16_FOUR bf16_t it_n = bf16_one; bf16_t it_sign = (bf16_t) { .bits = 0x8000}; bf16_t it_mole = bf16_mul(a, a); bf16_t it_deno; bf16_t it = a; unsigned i; for(i = 1; i < 100; i++){ it_deno = bf16_add(bf16_mul(bf16_four, bf16_mul(it_n, it_n)), bf16_mul(bf16_two, it_n)); if (bf16_isnan(it_deno) || bf16_isinf(it_deno) || bf16_iszero(it_deno)) break; it = bf16_mul(it, it_mole); if (bf16_isnan(it) || bf16_isinf(it) || bf16_iszero(it)) break; it = bf16_div(it, it_deno); if (bf16_isnan(it) || bf16_isinf(it) || bf16_iszero(it)) break; it.bits ^= it_sign.bits; it_n = bf16_add(it_n, bf16_one); // result add iteration result = bf16_add(result, it); } printf("\niteration times: %d\n", i); return result; } ``` when the input argument becomes large, the Taylor series expansion oscillates, causing significant numerical errors. If the input is reduced modulo π/2 (using long division), the finite-precision representation of π introduces cumulative error. When the input value is near an integer multiple of π/2, the result may be rounded off because of this error. Therefore, range reduction is necessary to constrain the input angle within +π/2 to –π/2. ### precision concern According to the definition of the BF16, the bit allocation is as follows: 1 bit for the sign, 8 bits for the exponent, and 7 bits for the mantissa. The numerical value is represented by this format is given by: $$ (-1)^{\text{sign}} * 2^\text{(biased exponent-bias)} * (1+\text{mantissa}) $$ where the biased exponent is defined in IEEE 754-2019. > "biased exponent: The sum of the exponent and a constant (bias) chosen to make the biased exponent’s range non-negative." - IEEE 754-2019 **bias exponent** In IEEE 754-2019 - 3.4 Binary interchange format encodings, the exponent part of a floating point type is consisnt of **bias** and **actual exponent** as following: $$ \text{biased exponent} = \text{actual exponent} + \text{bias} $$ By the definition of the biased exponent, the bias of BF16 is 127. For the number $1$, which is positive and has an actual exponent of $0$, the biased exponent is calculated as $0+127=127$. Therefore, the exponent field in binary is `01111111`. Similarly, for the number $\frac{1}{2}$, the actual exponent is $–1$, so the biased exponent becomes $-1+127=126$, which corresponds to `01111110` in binary. **mantissa** The minimal mantissa is $0$, corresponding to `0000000` in binary, and the maximal mantissa is $1 - 2^{-7} = \frac{127}{128}$, corresponding to `1111111` in binary. **precision** Since the range of $\sin{x}$, $\cos{x}$ are $[-1, 1]$ for arbitrary $x\in \mathbf{R}$, for any $y\in[-1,1]$, the actual exponent is aways less than $0$, that is $$ \text{biased exponent} - 127 = \text{actual exponent} \leq 0, $$ equivalently, $$ \text{biased exponent} \leq 127. $$ **Case1: If the biased exponent is fixed as 126** The numerical value is represented by $$ (-1)^0 * 2^{-1}*(1+\text{mantissa}). $$ Given that the mantissa ranges from $0$ to $\frac{127}{128}$, the minimal numerical value is $$ 2^{-1}*(1+0)=0.5, $$ and the maximal numerical value is $$ 2^{-1}*(1+\frac{127}{128})=0.99609375. $$ Furthermore, the interval between two consecutive numerical values is $$ 2^{-7}*2^{-1}=2^{-8}=0.00390625. $$ **Case2: If the biased exponent is fixed as 125** The numerical value is represented by $$ (-1)^0 * 2^{-2}*(1+\text{mantissa}). $$ Similar to Case 1, the minimal numerical value is $$ 2^{-2}*(1+0)=0.25, $$ the maximal numerical value is $$ 2^{-2}*(1+\frac{127}{128})=0.498046875, $$ and the interval is $2^{-7}*2^{-2}=2^{-9}=0.001953125$. A visualizer tool for the BF16 is introduced in [PR#1](https://github.com/Max042004/bf16_approximation/pull/1). ![bf16_value_distribution](https://hackmd.io/_uploads/By3fJN-Rex.png)  So we don't measure rounding error with absolute error. We measure in relative error. Absolute error: $V_{true}-V_{approximate}$ Relative error: $\frac{V_{true}-V_{approximate}}{V_{true}}$ For BF16, relative error should be less than $2^{-8}$ so that the maximum error is $0.5$ ULP. While to optimize error less than $0.5$ ULP need more novel methods. So we don't discuss further. And based on the property of $\sin(x)$. If $x$->0, then $\sin(x)≈x$. We can choose $0.140625$ in BF16 as threshold to directly return $x$. Becuase let $0.140625 = \sin(0.140625)$, the relative error is less than $2^{-8}$ ## introduce range reduction algorithm Approximating sin(x) with a Taylor Expansion $$ \sin(x) = x - \frac{x^3}{3!} + \frac{x^5}{5!}- \frac{x^7}{7!} + \cdots $$ or equivalently $$ \sin(x) = \sum_{n=0}^{\infty} \frac{(-1)^n}{(2n+1)!} x^{2n+1} $$ As x grows larger, sin(x) oscillates more strongly in the partial sums, so n must be large enough—i.e., many more terms must be computed—for the series to converge. **To reduce the oscillation, range reduction method is necessary.** In taylor expansion, there is Lagrange Remainder Estimate term: $$ R_n(x) = \frac{f^{(n+1)}(\xi)}{(n+1)!}(x-a)^{\,n+1}, \quad \xi \in (a, x) $$ Since $f^{(n+1)}(\xi)$ is either $\sin(\xi)$ or $\cos(\xi)$ and $\vert{\sin(\xi)\vert} \leq 1$, $\vert\cos(\xi)\vert \leq 1$, with $a = 0$ the remainder can be bounded by $R_n(x) \leq \frac{\vert x\vert^{\,n+1}}{(n+1)!}$ For $x = 2$, $n = 5$, $\sin(1)$, the fifth-order Taylor expansion of sin x has an error not exceeding $\frac{2^6}{6!} = 0.088$。 Program output confirms this: ``` iteration times: 5 x = 2.000000, sin(x) = 0.914062 ``` The value $\sin(2)$ by high precision method is $0.909297$, the difference is about $0.047 < 0.088$, within the theoretical bound. However, for fixed order $n$, the remainder bound grows quickly as $x$ increases. At $x = 8$, the theoretical upper bound becomes $364.088$ , and the code indeed shows a large truncation error: ``` iteration times: 5 x = 8.000000, sin(x) = 150.000000 ``` The high precision value $\sin(8)=0.9893$, so the difference is $149.01$— still below the theoretical bound, yet clearly unacceptable. Because $\sin$ is periodic with period $2\pi$, we can reduce $x$ to a smaller equivalent argument to decrease the oscillation amplitude and error in the Taylor approximation. ![image](https://hackmd.io/_uploads/Hk769zWnlx.png) Suppose we wish to constrain $x$ to $x^{\ast}\in \left[-C,C\right]$ A straightforward way is to use long division to find $k$ so that: $$ \begin{align} k &= \left\lfloor \frac{x}{C} \right\rfloor \\ x^{\ast} &= x - kC \end{align} $$  Implementation reference: [commit ad47fa2](https://github.com/Max042004/C-practice/commit/ad47fa2e0a9738d04eeb7534ef9e79e5b3802251) With range reduction, loop unrolling becomes possible; even a fixed 4-term Taylor expansion can achieve roughly one-decimal accuracy for inputs like $x = 8$. However BF16 provides only limited precision, larger values of $x$ cause $x-kC$ to lose almost all of its significant digits, resulting in severe deviation in the computed result. To avoid this issue, one must either extend the bit width or introduce additional precision terms to preserve numerical accuracy. ### Map to discrete degrees current BF16 store radian values. And our goal is to limit its range to $\frac{\pi}{2}$ and $-\frac{\pi}{2}$. Refering to [Mado's implementation](https://github.com/sysprog21/mado/blob/main/src/trig.c#L75), we can map radian to discrete degrees. To map second to forth quadrant radian to discrete degrees. Using bitwise and operation to perform mod. ```c double t = x * (N / (2.0f * (double)M_PI)); // float -> ticks, N = 4096 uint32_t i = (uint32_t)lrintf(t) & (N - 1); // mod 4096 double new_x = t * 2.0f * (double)M_PI / N; ``` Representing whole circle with $4096$ ticks $\Delta =\frac{2\pi}{4096}≈0.0015339808$ rad The worst-case angular error from this discretization is half a tick $\frac{\Delta}{2}$. The $\frac{\Delta}{2}$ numerical error of angle cause at most $\sin(\frac{\Delta}{2}) = 0.00153398019$ numerical error of result. ### Cody and Waite’s Method While doing range reduction, we don't want the result losing almost all the precision. So Cody and Waite's method introduce additional precision terms to preserve numerical accuracy. The idea represents $C$ with $C_1$ and $C_2$ which are representable in floating point system. instead of evaluating $x − kC$, we evaluate $$ (x − kC_1) − kC_2 $$ as the result, if $x-kC_1$ lost almost all precision, we can still retain most precision from $-kC_2$ For example, if $C = \pi$, we can choose $C_1=201/64=3.140625$, $C2 = 9.67653589793 × 10^{−4}$ The Cody and Waite’s Method implementation is in [commit 68eb815](https://github.com/Max042004/bf16_approximation/commit/68eb815ced32de87c3aa98820d39997dfc60a4e5) The double-precision CRLIBM library4 uses four possible methods, depending on the magnitude of the input number: • Cody and Waite’s method with two constants (the fastest); • Cody and Waite’s method with three constants (almost as fast); • Cody and Waite’s method with three constants, using a double-double arithmetic and a 64-bit integer format for k; • Payne and Hanek’s algorithm (see Section 9.4) for the largest arguments ### Payne-Hanek reduction algorithm > P. David Defour, Peter Kornerup, Jean-Michel Muller, Nathalie Revol. A new range reduction algorithm. [Research Report] Laboratoire de l’informatique du parallélisme. 2001, 2+10p. hal-02101916 For larger arguments that require range reduction, the Payne–Hanek algorithm is appropriate. For a positive floating-point number $x$ in the IEEE 754 format, let $e$ be its unbiased exponent and let the mantissa have $n$ (including the implicit leading 1 bit, for example, $n=24$ for FP32). Then the integer significand $X$ and the real value $x$ satisfy： $$ x = X*2^{e-n+1} $$ This follows from writing the number as $1.bbb...b×2^{e_x}$ and multiplying the mantissa part by $2^{23}$, $X$ is an n-bit integer satisfying $2^{n-1}\leq X \leq2^{n}$ Assume $e \geq -1$; if $e \leq -1$，the value $2^e*1.mantissa$ is at most $0.5$, in which case a simple Taylor expansion already converges well and no oscillation occurs. Next, define $\alpha=2/\pi$ , represented as the binary fraction $0.v_1v_2v_3v_4v_5...$ We then decompose $\alpha$ as $Left(e,p)*2^{n-e+1}+(Med(e,p)+Right(e,p))2^{-n-e-2-p}$ where $$ \quad \begin{cases} \text{Left}(e,p) &= \alpha_{-1}\cdots\alpha_{n-e+1}, \\[6pt] \text{Med}(e,p) &= \alpha_{n-e}\cdots\alpha_{\,-n-e-2-p}, \\[6pt] \text{Right}(e,p) &= 0.\alpha_{\,n-e-3-p}\alpha_{\,n-e-4-p}\alpha_{\,n-e-5-p}\cdots . \end{cases} $$ Hence $\frac{2}{\pi}x = \frac{2}{\pi}(X*2^{e-n+1})$ We obtain $$x=Left(e, p) × X × 2\pi + Med(e, p) × X × 2^{−2n−p-1}×\frac{\pi}{2} + Right(e, p) × X × 2^{−2n−p-1}×\frac{\pi}{2}$$ Because of the periodicity $2\pi$, the term $Left(e, p) × X × 2\pi$ has no effect on the result.。The last term $Right(e, p) × X × 2^{−2n−p-1}×\frac{\pi}{2}$ is smaller than $2^{-n-p-1}$; thus, choosing a sufficiently large $p$ makes this error arbitrarily small. For instance, if we want a reduced argument with at least m significant bits, and if the number $− log2(\beta)$ found using the algorithm presented in [Section 9.3.2](https://nzdr.ru/data/media/biblio/kolxoz/M/MN/Muller%20J.M.%20Elementary%20Functions%20..%20Algorithms%20and%20Implementation(2ed.,%20Birkhauser,%202005)(ISBN%209780817644086)(O)(274s)_MN_.pdf) is less than some integer $j$, then $p = j + m − n$ is convenient. For $C = \pi/2$, , the number of bits in $Med(e,p)$ is $(n-e)-(-n-e-1-p)=2n+2+p$ Substituting back into the original expression gives： $x^{\ast} = Med(e, p) × X × 2^{−2n−p-1}×\frac{\pi}{2} - \frac{\pi}{2}k=\frac{\pi}{2}(Med(e, p) × X × 2^{−2n−p-1} - k)$ where $k = \lfloor \frac{2}{\pi}x \rfloor$ — the integer part of $Med(e, p) × X × 2^{−2n−p-1}$. Let $h=Med(e, p) × X × 2^{−2n−p-1}$ then $x^{\ast}=\frac{\pi}{2}frac(h)$ which is $\frac{\pi}{2}$ times the fractional part of $h$. Finally, we determine the quadrant. Since $C=\frac{\pi}{2}$ divides the coordinate plane into four quadrants, the value of $k\mod 4$ identifies the region of the original $x$ Using the general angle-conversion identities between $\sin$ and $\cos$, we can compute the final result. For example, if $k \mod 4=2$, then $x$ lies between $\pi$ and $\frac{3\pi}{2}$, so $\sin(x)$ must be represented as $-\sin(x^{\ast})$. The refering implementation in [604b613](https://github.com/Max042004/bf16_approximation/commit/604b61321ea20e00feb9b73ca2a715bc67e14ef8) But based on testing results have huge numrical error. But the implementation from [StackOverflow](https://stackoverflow.com/questions/30463616/payne-hanek-algorithm-implementation-in-c/30465751#30465751) has more accurate result. It must be my implementation went wrong. So, I directly use the implementation from stackoverflow as my Payne-Hanek range reduction method. ## Learn from Mado After first try of sine approximation using range reduction and taylor expansion, instructor told me to learn some techniques come from Mado project. [Mado](https://github.com/sysprog21/mado) -- full name: Minimalistic Application Display Orchestrator, is a window system. Design for running on resource constrained hardware. [trig.c](https://github.com/sysprog21/mado/blob/main/src/trig.c) is for computing trigonometric functions. `twin_sin` is a wrapper function. ```c twin_fixed_t twin_sin(twin_angle_t a) { twin_fixed_t sin_val = 0; twin_sincos(a, &sin_val, NULL); return sin_val; } ``` While you can see parameters type are `twin_angle_t` and `twin_fixed_t`. Their definitions are: ```c typedef int32_t twin_fixed_t; /* Fixed-point number (16.16 format) */ ``` ```c /** * Angle representation for rotations and trigonometry * Range: -2048 to 2048 representing -180 to 180 degrees * * Fixed-point angle system optimized for embedded systems without FPU. * One full rotation (360 degrees) equals TWIN_ANGLE_360 units. */ typedef int16_t twin_angle_t; ``` For `twin_angle_t`, its unit is degree, not radian. Because a whole circle, radian is $2\pi$ and degree is $360^{\circ}$. Degree is more suitable to represent in fixed point. `twin_angle_t` represent $360^{\circ}$ with 4096 integers. When `twin_angle_t` plus 1, meaning its degrees plus $\frac{360}{4096}^{\circ} = 0.087809625^{\circ}$ And the whole range `twin_angle_t` can represent is $(-5760^{\circ},5760^{\circ})$ The main logic located in [twin_sincos](https://github.com/sysprog21/mado/blob/main/src/trig.c#L68) First, limit the angle to $360^{\circ}$, instead of using $a = a \% 360^{\circ}$, using bitwise operation is more effcient. ```c /* limit to [0..360) */ a = a & (TWIN_ANGLE_360 - 1); ``` then it deals with $90$ and $270$ degrees: ```c if ((a & ~(TWIN_ANGLE_180)) == TWIN_ANGLE_90) { sin_val = TWIN_FIXED_ONE; cos_val = 0; ``` for others degrees, mirror second and third quadrant values across y axis: ```c if (a & TWIN_ANGLE_90) a = TWIN_ANGLE_180 - a; ``` mirro forth quadrant values across y axis: ```c twin_angle_t x = a & (TWIN_ANGLE_90 - 1); ``` then compute $sin$ and $cos$, because $cos(\theta) = sin(90-\theta)$ ```c if (sin) sin_val = sin_poly(x); if (cos) cos_val = sin_poly(TWIN_ANGLE_90 - x); ``` In `sin_poly`, shorten the computation of taylor expansion for $\sin$ with Horner's method, Hermite interpolation. Instead of represent $\sin$ function with: $\sin(x)$, we represent it with $\sin(\frac{\pi}{2}u)$, which $0\leq u\leq1$, so $u = x * 2^{-10}$ let $s(u)=\sin(\frac{\pi}{2}u)$ and $P(u) =u(a−bu^2+cu^4)$ approximates $s(u)$ While $s(1) = 1, s'(1)=0$, so let $P(1) =1$, $P'(1)=0$ So $a−b+c=1$, $P'(1)=a−3b+5c=0$. We get $c=a−\frac{3}{2},b=2a−\frac{5}{2}$ ## Second try: approximation through sum of Chebyshev polynomials After reading source code of mado and some other implementation. I realized taylor expansion is not the best choice for approximating sine function. Most of the approximation will choose Chebyshev expansion. The reasons are Chebyshev expansion converges faster, the numrical error is uniform, and closes to minimax approximation. * Based on Weierstrass Approximation Theorem: For any continuous function $f$ define in [a,b], for any $\epsilon > 0$. there exists a polynomials such that $\vert f(x)-p(x)\vert<\epsilon$ for any $x \in[a,b]$ So we can write $\sin(x) ≈ p(x) = a_1x+a_2x^2+...+a_nx^n$ where $x \in[0, \pi/2]$, $\epsilon$ is the maximum absolute error for $0\leq x\leq\pi/2$. In other words, to let $p(x)$ be a best approximation, we need to make $\epsilon$ as small as possible. To construct a good enough $p(x)$, we use interpolation with Chebyshev nodes. Before that, we first discuss about what is Chebyshev polynomials. ### Chebyshev polynomials * Definiton: For each natual number $n$, we define $T_n:[-1,1] \mapsto [-1,1]$ $T_n(x) =\cos(n\cos^{-1}x)$, and $\theta=\cos^{-1}x$ or $\cos\theta=x$ ![image](https://hackmd.io/_uploads/S1VgzEFTex.png =75%x) By definition, $T_n(x)$ is cosine function, so its $min$ and $max$ are $\pm1$ when $n\theta=k\pi$ meaning $\theta=0, \frac{\pi}{n}, \frac{2\pi}{n}, ..., \pi$. So $T_n(x)$ reachs extremum $n+1$ times. $\cos(n\theta)+i\sin(n\theta)=e^{i*n\theta}=(e^{i\theta})^n=(\cos\theta+i\sin\theta)^n=$ $\cos^n\theta+(^n_1)i\sin\theta\cos^{n-1}\theta -(^n_2)\sin^2\theta\cos^{n-2}\theta+...+(^n_n)i^nsin^n\theta$ so, $\cos(n\theta)= \cos^n\theta-(^n_2)(1-\cos^2\theta)\cos^{n-2}\theta+(^n_4)(1-\cos^2\theta)^2\cos^{n-4}-...$ and $x=\cos\theta$ Then we get polynomial representation of $T_n(x)$$$\cos(n\theta)=T_n(x)=x^n-(^n_2)(1-x^2)x^{n-2}+(^n_4)(1-x^2)^2x^{n-4}-...$$ In addition, Chebyshev polynomials have recursive property. $$T_{n+1}(x)=\cos(n\theta+\theta)=\cos(n\theta)\cos\theta-\sin(n\theta)\sin\theta$$ $$T_{n-1}(x)=\cos(n\theta+\theta)=\cos(n\theta)\cos\theta+\sin(n\theta)\sin\theta$$ $$T_{n+1}(x)+T_{n-1}(x)=2\cos(n\theta)\cos\theta=2T_n(x)x$$ so we get the recursion: $$T_{n+1}(x)=2T_n(x)x-T_{n-1}$$ Listing some Chebyshev polynomials $$T_0(x)=1$$ $$T_1(x)=x$$ $$T_2(x)=2x^2-1$$ $$T_3(x)=4x^3-3x$$ We can easily observe that the leading coefficient is $2^{n-1}$ Then we let $\tilde{T}_n(x)=T_n(x)/2^{n-1}$. Then $\tilde{T}_n(x)$ becomes a monic polynomial, which means its leading coefficient is $1$. For $\tilde{T}_n(x)$, there is a property called Minimal ∞-norm * Minimal ∞-norm For any given $n≥1$, among the polynomials of degree $n$ with leading coefficient $1$ (monic polynomials) $f(x)=\frac{\tilde{T}_n(x)}{2^{n-1}}$ is the one of which the maximal absolute value on the interval $[−1, 1]$ is minimal. This maximal absolute value is $\frac{1}{2^{n-1}}$. and $\vert f(x)\vert$ reachs this maximum exactly $n+1$ times at $x=\cos(\frac{k\pi}{n}), 0\leq k\leq n$ Minimal ∞-norm is important for us to approximating through Chebyshev polynomial. Assume there is $f(t), t \in[-1,1]$ which we want to approximate by $p(x)$ with interpolation methods. By interpolation, we write $$p(t)=c_0+c_1(t-t_1)+c_2(t-t_1)(t-t_2)+...+c_{n-1}(t-t_1)...(t-t_{n-1})$$ which $t_1, t_2,...,t_{n-1}$ are data points on $f(t)$ and $p(x)$ is $n-1$ degrees. With error estimation $f(t)-p(t)=\frac{f^{(n)}(\xi)}{n!}\prod_{i=1}^{n}(t-t_i)$ To make error estimation smallest, refer to Minimal ∞-norm property, we should let $\prod_{i=1}^{n}(t-t_i) = \tilde{T}_n(t)$, then $\vert f(t)-p(t) \vert \leq\frac{1}{n!*2^{n-1}}|max_{y\in[-1,1]}f^{(m)}(y)|$, so $t_i$ should be root of $T_{n}(t)$. $$T_n(t) = \cos(n\theta)=0$$ $$\theta = \frac{(2k-1)\pi}{2n}, \quad k= 1, 2, 3, \ldots,n$$ and $t=\cos(\theta)$, $$t_0 = \cos(\frac{\pi}{2n}), t_1=\cos(\frac{3\pi}{2n}), t_2=\cos(\frac{5\pi}{2n}), t_{n-1}=\cos(\frac{(2n-1)\pi}{2n})$$ We call these $t_i$ are [Chebyshev nodes](https://en.wikipedia.org/wiki/Chebyshev_nodes). Then we can computing coefficients $c_0$ with $p(t_1)=f(t_1)$, $c_1$ with $p(t_2)=f(t_2)$,...,$c_{n-1}$ with $p(t_{n})=f(t_{n})$, and then gets $p(t)$ But what our desire range is $f(x), x\in[a,b]$. We define the map $l:[-1, 1] \to [a, b]$ by $$x = l(t)=\frac{b-a}{2}t+\frac{b+a}{2}$$ then subsitute $p(t)$ with $$t=\frac{2x-(b+a)}{b-a}$$we gets $p(x)$ which approximates $f(x)$ Back to error estimation $\vert f(t)-p(t) \vert \leq\frac{f^{(n)}(\xi)}{n!*2^{n-1}}$, we now have $$\vert f(x)-p(x) \vert \leq\frac{2C_n}{n!}(\frac{b-a}{4})^n, \,C_n=|max_{y\in[a,b]}f^{(n)}(y)|$$ According to above process, a $n=6, f(x)=\sin(x), x\in[0,\pi/2]$ six terms Chebyshev interpolation polynomial, theoreitical max numrical error $\frac{2}{6!}(\frac{\pi}{8})^6=0.00001018724$. Represents $P_6(x)$ with Horner method is: $$ \begin{aligned} P_6(x) &=\Bigl(\Bigl(\Bigl(\Bigl(\Bigl(-9.6066424092\times10^{-4}\,x+ 1.02485642486\times10^{-2}\Bigr)x - 1.9022700092\times10^{-3}\Bigr)x \\&\qquad\ - 1.65722549388\times10^{-1}\Bigr)x- 2.06058500652\times10^{-4}\Bigr)x+ 1.00001322248\Bigr)x \end{aligned} $$ Additionally, we can deal with special cases to reduce computation, for $P_6(x)$, $x$ just lay between $[0.140625, \pi/2]$. In BF16 binary representation, the range is $0011111000010000$ to $0011111111001001$, only $441$ representable numbers. Because for $x > \pi/2$, we can just utilize range reduction to let it lay between $[0, \pi/2]$. For $x < 1.1754944*10^{-38}$, which is subnormal and BF16 normally not represent subnormal, so we flush out subnormal result to zero. For $1.1754944*10^{-38} <= x < 0.140625$, we can just return $x$ because the numerical difference is less than BF16 $0.5$ ULP. Also, For those $x$ is negative, we just let $\sin(x)=-\sin(-x)$ $\quad \begin{cases} x < 1.1754944*10^{-38}, \sin(x) =0\\[6pt] 1.1754944*10^{-38} <= x < 0.140625, \text{ }\sin(x) = x\\[6pt] 0.140625 <= x <= 1.570312, \sin(x)= P_6(x) \\[6pt] 1.570312 < x <= \text{BF16_MAX}, \sin(x) = P_6(x^{\ast}) \end{cases}$ We takes results compared to glibc library function [sinf(x)](https://man7.org/linux/man-pages/man3/sin.3.html) result truncates from FP32 to BF16, we can see the result is wonderful. A straight line of zero meaning no numerical difference. ![chebyshev_interpolation_6terms](https://hackmd.io/_uploads/SJU9tQqagl.png) But importantly, the result only tell us the result is as same as glibc truncated result of `sinf`. This is not absolute accurate comparing to true $\sin(x)$, because BF16 can not represent all the numbers. So the result definetely has numerical error comparing to true $\sin(x)$. while previous result of taylor expansion, even be a taylor 15 degrees expansion, still has numerical difference for larger $x$. ![taylor_15degree](https://hackmd.io/_uploads/By_M9m96xg.png) We now ensure that Chebyshev expansion of $\sin(x)$ is accurate. Then we need to test about those $x$ above $\pi/2$, which $x$ will need to go through range reduction. Here, because I testing awhile, I found Payne-Hanek method has great accurate result of all $x$ between $(\pi/2, inf)$. So I choose Payne-Hanek method as only strategy for range reduction. For BF16, representable numbers from $\pi/2$ to $inf$ has $0111111101111111 - 0011111111001010 = 16309$ numbers. The result is: example testcase ``` x fter payne_hanek ange reduction: -0.652344 sin(r) = -0.609375, cos(r) = 0.792969, k_int = 1, mod_k = 1 x = 330977770950444052353047958009805799424.000000, bf16_sin = 0.792969 ,glibc_sin = 0.792969 ``` ![wholerange_payne](https://hackmd.io/_uploads/S1N7BI96ll.png) Comparing to glibc `sinf`, the maximum difference is $0.003906$, average numerical difference is $0.00046032507664009806$. Totally are $2740$ BF16 numbers have numerical difference. Then we can observe that $0.003906 = 2^{-8}$, exactly 1 LP when value between $(1, 0.5]$. So the result shows that the numerical difference is less the 1 ULP. We know that for $x$ don't need range reduction, result of $\sin(x)$ is accurate with glibc `sinf`. So the numerical difference must come from range reduction. But we know that glibc `sinf` truncated to BF16 will defintely have numerical error. So the above numerical error chart and data just tell us how many $\sin(x)$ are different between glibc `sinf`(truncated) and my implementation. My code still has many places to be improved. Such as use less memory, or remove some redundant computation. > commit [d53ed08 ](https://github.com/Max042004/bf16_approximation/commit/d53ed08496790fddaa13760dbad34c08ad94d79a) ## Third try: computing in 16 bit Both Payne-Hanek method and Chebyshev expansion using FP32 for computation and then convert the result to BF16. But if directly computes in BF16, our 6 degrees Chebyshev interpolating approxiamtion polynomial, the numerical error rises to at most 4 ULP. This is not ideal. **Concerning bandwidth**, if we can compute mostly in 16 bit data type. Interconnection will transfer more data, hence improve efficiency. > commit [c560c59](https://github.com/Max042004/bf16_approximation/commit/c560c59aceed9dcc3437898e97fd634c5ae645dc) In second try, I interpolates 6 Chebyshev nodes to approximating $f(x)=\sin(x)$. With remainder estimation term, $$\vert f(x)-p(x) \vert \leq\frac{2C_n}{n!}(\frac{b-a}{4})^n, \,C_n=|max_{y\in[a,b]}f^{(n)}(y)|$$ theoreitical max numrical error $\frac{2}{6!}(\frac{\pi}{8})^6=0.00001018724$. Here in third try, I want to computation using 16 bits as much as possible, so I use 16 bits fixed point to store coefficients, as the result it will increase the rounding error compared to store coeffiecients in 32 bits floating point. So I choose to increased the Chevyshev nodes to 8 and hence compute new coefficients. In addition, I also try to use less branch. Currently, approximation function is $p(x), x\in[0, \pi/2]$ so needs one modulation to decide the quadrant of original radius. ```c unsigned mod_k = k % 4; switch (mod_k) { case 0: result = sin_x; break; case 1: result = cos_x; break; case 2: result = (bf16_t) {.bits = sin_x.bits ^ 0x8000}; break; case 3: result = (bf16_t) {.bits = cos_x.bits ^ 0x8000}; break; } ``` In order to remove this branch, we can let approximation function $p(x), x\in[0,\pi]$. Then result will only be $\sin(x)$, thus do not need branch. If $p(x), x\in[0,\pi]$ choose 8 Chebyshev nodes as interpolation. The theoreitical max numrical error $$\frac{2}{8!}(\frac{\pi}{4})^8=0.00000718172$$ Then We can Represent $p(x)$ in even function $p(y)$. The reason is $\sin(x), x\in[0,\pi]$ is symmetic to $x=\pi/2$, we just let $y=x-\pi/2$, then $p(y)$ is a even function. $$p(y) = c_8y^8+c_6y^6+c_4y^4+c_2y^2+c_0$$ in horner's $$p(y) = (((c_8y^2+c_6)y^2+c_4)y^2+c_2)y^2+c_0$$ which $$ \begin{align} c_0&=1, \\ c_2&=-0.499999655713,\\ c_4&=0.041664803572,\\ c_6&=-0.001386160164,\\ c_8&=0.00002331. \end{align} $$ I utilize Q2.14 fixed point to compute approximation polynomial. Because maximum number while compuation is happening when $y^2=(\frac{\pi}{2})^2=2.4674011$, however biggest representable number of Q2.14 is $2^2-2^{-14}=3.99993896484$. So Q2.14 is appropriate. But fixed point Q2.14 can only store at most 4 decimals places, so $c_8$ can't represent in Q2.14 and other coefficients will need to round to 4 decimals places. I guess numerical error maybe large. The results proves it. While in second try which store coefficients and computes in FP32 has no difference. But, this time, for $x\in[0, \pi]$, maximum difference compared to glibc sinf is $0.003906 = 1$ ULP. ![range1_scatter](https://hackmd.io/_uploads/Bk4GqQGAgx.png) Then I observe that if we let $y^2 < 2-\frac{1}{2^{15}}$, then we can change fixed point representation to Q1.15 which biggest value is $2-\frac{1}{2^{15}}=1.9999698242$ So, for $x > 2.96875$, let $\sin(x)=\pi-x$. Then we can ensure $y^2=(\pi/2-x)^2 < 1.9999698242$ Below is the result that using Q1.15 to store coefficients and computation. We can see that the data point which differs from glibc sinf is less. But unfortunately, still not enough. We want totally no difference. ![range1_scatter](https://hackmd.io/_uploads/S1znG1NCgg.png) I found some methods maybe capable of solving numerical difference like Remez algorithm. ## Testing To test the result, going to `floating_point` directory, types `make test`. ## Real World application In vllm project, which is a high throughput and memory efficient inference and serving engine. vllm using RoPE technique to reduce memory usage and computation. RoPE technique require [precompute sin,cos cache.](https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/rotary_embedding/base.py) Currently, vllm compute and store sin,cos cache in fp32. According to [this article](https://medium.com/@nilrekna/%E8%81%8A%E8%81%8A-llama-%E8%A3%A1%E7%9A%84-rope-%E6%80%8E%E9%BA%BC%E5%81%9A%E5%BE%97%E6%9B%B4%E5%BF%AB%E4%B8%80%E9%BB%9E-21f9ec403ba6), RoPE compute $\sin$ and $\cos$ with angle $\phi_{p,i}$![image](https://hackmd.io/_uploads/SJV7Pashxg.png) so the concern is that what is the range for $\phi_{p,i}$, this determine how much error we can tolerate for computing $\sin$ and $\cos$ function. [Huggingface Transformer issue 29285](https://github.com/huggingface/transformers/pull/29285) points out BF16 has serious precision problem. In Pytorch, the native implementation of `torch.sin` API utilize `std::sin`, which compute sine with FP32 or FP64. [Link1](https://github.com/pytorch/pytorch/blob/40f53912cf797b55f63adf0e5ac0ec08ccfafe4a/aten/src/ATen/native/cuda/UnaryGeometricSinKernel.cu#L24) [Link2](https://github.com/pytorch/pytorch/blob/ddf8de28c25944a58e739ba9996b06753e4199cc/c10/util/BFloat16-math.h#L135) * My question is, for a model using BF16, can we compute sine with BF16 rather than FP32? $\to$ Yes, you can compute the sine function in BF16 with a numerical error that is negligible relative to the format's own precision. The analysis shows the maximum error can be kept to **1 ULP** (Unit in the Last Place). However, achieving this precision is critically dependent on implementing two key strategies: 1. A Well-Conditioned Polynomial Approximation: A simple Taylor series expansion is shown to be inadequate, causing significant errors and oscillations. The document demonstrates that a Chebyshev polynomial approximation is a superior method, as it converges faster and provides a more uniform, minimized error across the approximation interval. 2. A Robust Range Reduction Algorithm: This is the most critical step and the primary source of precision loss in BF16. The document finds that the **Payne-Hanek algorithm** is necessary to accurately reduce large input angles to a fundamental range (e.g., $[-\pi/2, \pi/2]$) without losing the significant digits that BF16 cannot preserve on its own. For applications that are highly sensitive to phase or require high precision with large angles, a mixed-precision design remains the safest and most robust compromise. Performing the error-prone range reduction step in FP32 and then converting the result to BF16 for the final polynomial evaluation would effectively mitigate the primary source of error while still leveraging BF16 for the bulk of the computation. --- ## References * [Elementary Functions](https://nzdr.ru/data/media/biblio/kolxoz/M/MN/Muller%20J.M.%20Elementary%20Functions%20..%20Algorithms%20and%20Implementation(2ed.,%20Birkhauser,%202005)(ISBN%209780817644086)(O)(274s)_MN_.pdf) * [A new range reduction algorithm](https://inria.hal.science/inria-00072320v1/file/RR2001-33.pdf) * [Mado: A window system for resource-constrained devices ](https://github.com/sysprog21/mado) * [HORNER’S METHOD](https://www.math.utoronto.ca/barbeau/hxpol2.pdf) * [LLM Optimization Part 2 — How to Make RoPE Faster in LLaMA](https://medium.com/@nilrekna/%E8%81%8A%E8%81%8A-llama-%E8%A3%A1%E7%9A%84-rope-%E6%80%8E%E9%BA%BC%E5%81%9A%E5%BE%97%E6%9B%B4%E5%BF%AB%E4%B8%80%E9%BB%9E-21f9ec403ba6) * [Sollya](https://www.sollya.org/) * [Stackoverflow: Fastest implementation of sine, cosine and square root in C++ (doesn't need to be much accurate)](https://stackoverflow.com/questions/18662261/fastest-implementation-of-sine-cosine-and-square-root-in-c-doesnt-need-to-b) * [How do Trigonometric functions work?](https://stackoverflow.com/questions/345085/how-do-trigonometric-functions-work/345117#345117) * https://mooooo.ooo/chebyshev-sine-approximation/ * [RotaryEmbedding computation is wrong for certain position/feature pairs in reduced precision (both fp16 and bfloat) ](https://github.com/EleutherAI/gpt-neox/issues/1003) * [Precision issues in Mistral rotary embeddings](https://github.com/huggingface/transformers/issues/29496?utm_source=chatgpt.com) * [RoPE loses precision for Llama / Gemma + Gemma logits.float](https://github.com/huggingface/transformers/pull/29285) * [Source code of RoPE in vllm](https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/rotary_embedding/base.py) ## TODO https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/examples/fp8_primer.html#Introduction-to-FP8 https://images.nvidia.com/aem-dam/en-zz/Solutions/data-center/nvidia-ampere-architecture-whitepaper.pdf https://images.nvidia.com/aem-dam/en-zz/Solutions/data-center/nvidia-ampere-architecture-whitepaper.pdf https://github.com/rutgers-apl/rlibm-all/tree/main/source/float https://www.amd.com/content/dam/amd/en/documents/instinct-business-docs/white-papers/amd-cdna2-white-paper.pdf https://www.jstage.jst.go.jp/article/jasse/12/1/12_113/_pdf#:~:text=Abstract,by%20MATLAB%2C%20GNU%20Octave%2C%20two https://people.cs.rutgers.edu/~sn349/papers/rlibm-popl-2021.pdf https://people.cs.rutgers.edu/~sn349/papers/park-etal-vss2025.pdf?utm_source=chatgpt.com https://www.netlib.org/fdlibm/s_sin.c https://www.netlib.org/fdlibm/k_sin.c https://www.netlib.org/fdlibm/e_rem_pio2.c https://www.netlib.org/fdlibm/k_rem_pio2.c https://hal.science/hal-04709615/document