一對一問答紀錄:從整數平方根到浮點數開平方根。

# 一對一問答紀錄:從整數平方根到浮點數開平方根。 ## 整數開平方根在接下來的諸多演算法中，可以歸納為以下大綱: 1. 迭代起始值: 二元搜索法 `cls()` 、查表法。 2. 演算法: Newton-Paphson method / Heron's method (以下簡稱牛頓法)、 digit-by-digit method (以下簡稱逐位逼近法)。 3. 透過追蹤迭代變數減少位移運算。 > Qs: why is branch free important? ### 牛頓法以下將跟隨 [Python's integer square root algorithm](https://github.com/mdickinson/snippets/blob/main/papers/isqrt/isqrt.pdf) 逐步探討牛頓法整數開平方根的演進。參閱 [從 √2 的存在談開平方根的快速運算](https://hackmd.io/@sysprog/sqrt?stext=18180%3A3%3A0%3A1747215575%3A9b9x2Q)，Heron's method 保證: $$a^{(k+1)}= \frac{ a^{(k)} + \lfloor \frac{n}{a^{(k)}} \rfloor}{2} $$ 最終 $a^{(k+1)}$ 會收斂到 $\sqrt{n}$，於是有了以下實作。 ```c int mySqrt(int n) { if (n == 0) return 0; if (n < 4) return 1; unsigned int ans = n / 2; if (ans > 65535) // 65535 = 2^16 - 1 ans = 65535; while (ans * ans > n || (ans + 1) * (ans + 1) <= n) ans = (ans + n / ans) / 2; return ans; } ``` 加入用二元搜索法 `clz()` 取得初始值，和追蹤迭代變數 `d` (和 [2025 Quiz2-2](https://hackmd.io/@sysprog/linux2025-quiz2) 中的 `m` 有異曲同工之妙) ，有更快的實作如下: ```c uint64_t simple_newton_sqrt(uint64_t x) { int total_bits = 64; int shift = (total_bits - 1 - clz64(x)); uint64_t a = 1ULL << shift; uint64_t d; while (1) { d = x / a; if (a < d) return a; a = (a + d) << 1; } } ``` #### Listing 4: Adaptive-precision Heron’s method, iterative 文章中基於現有實作，發現每次迭代過程中準確的有效位數增加緩慢，提出以下方法予以優化: 與其直接計算 $\sqrt{n}$ ，改為計算 $\sqrt{\lfloor \frac{n}{j^2} \rfloor}$ ，且隨著迭代，整數 $j\rightarrow 0$，期待 $\sqrt{\lfloor \frac{n}{j^2} \rfloor}\rightarrow \sqrt{n}$。進行證明之前: 1. 定義 **near square root** :若一個正整數 $a$ 是正整數 $n$ 的 near square root 則 $(a-1)^2 \lt n \lt (a+1)^2$。 2. 定義 $b=\lfloor \frac{n}{j^2} \rfloor$ 的 near square root，移項開根號可得 $jb\rightarrow \sqrt{n}$。 3. (Theorem 10) $a=kb+\lfloor \frac{n}{4kb} \rfloor$ 是 n 的 near square root ，若$4k^2 \lt n$ 綜合以上，由 2，不妨令 $j=2k$，$b=\lfloor \frac{n}{4k^2} \rfloor$ 的 near square root ，如果 $4k^2 \lt n$ ， (Theorem 10) 保證 $a=kb+\lfloor \frac{n}{4kb} \rfloor$ 是 n 的 near square root。由 Theorem 10， $a$ 同時是 n 的 near square root，，我們可以有更精密的估計: $$\lfloor \frac{jb+\lfloor \frac{n}{jb} \rfloor}{2} \rfloor \sim \sqrt{n}$$ 為求使用位元運算，令 $k=2^e$ ，根據 3. 的條件，我們可以得到最大的縮小倍率 $k$ : $$\begin{align} 4k^2 \le n &\Longleftrightarrow 2^{2+4e} \lt n \\ &\Longleftrightarrow 2+4e+1 \le length(n) \\ &\Longleftrightarrow e \le \lfloor \frac{length(n)-3}{4} \rfloor \end{align}$$ 定義 `recur_sqrt(x)` 演算法步驟: 1. 根據 $\lfloor \frac{n}{4k^2} \rfloor$ 計算最大 $k=2^e$ 當作估計的縮小倍率。 2. 因為$a=kb+\lfloor \frac{n}{4kb} \rfloor$ 是 n 的 near square root，計算有效位數較少的 `recur_sqrt(b)` (near square root of $\lfloor \frac{n}{4k^2} \rfloor$)，取代 `recur_sqrt(a)` ，引入遞迴寫成 `a` = recur_sqrt($\lfloor \frac{n}{4k^2} \rfloor$)。 3. 回傳正確的近似 $a=kb+\lfloor \frac{n}{4kb} \rfloor$ (near square root of $n$) ```c uint64_t recur_sqrt(uint64_t n) { if (n < 4) return 1; uint64_t e = (64 - clz64(n) - 3) >> 2; // 4k^2 < n uint64_t a = recur_sqrt(n >> (e+2)); return (a << e) + (n >> (e+2)) / a; } uint64_t raw_adaptive_sqrt(uint64_t x) { uint64_t a = recur_sqrt(x); return (a*a <= x) ? a : a-1; } ``` 文中提供不使用遞迴的版本，具體來說做了以下代數代換： $$a=kb+\lfloor \frac{n}{4kb} \rfloor \rightarrow a=kb+\lfloor \frac{n}{2^{2(c-d)}kb} \rfloor$$ $$a = \text{near square root of} \,\,\lfloor \frac{n}{2^{2(c-d)}} \rfloor = \lfloor \frac{n}{4^{(c-d)}} \rfloor$$ $$k=2^e \rightarrow 2^{d-e-1}$$ 注意到會隨迭代改變的變數是 $a$、$d$、$s$ ，其中 $d=\lfloor \frac{c}{2^s} \rfloor$ ，$s$ 隨迭代次數遞減，代表最終 $d \rightarrow c$ ， $k \rightarrow 2^{c-e-1}$。又 $b$ = near square root of $a$，令 $m= \lfloor \frac{n}{4^{(c-d)}} \rfloor$，$a$ = near square root of $n$，於是我們有: $$a=kb+\lfloor \frac{n}{2^{2(c-d)}kb} \rfloor=2^{d-e-1}b+\lfloor \frac{m}{4(2^{d-e-1})b} \rfloor$$ $$a=2^{d-e-1}b+\lfloor \frac{m}{4(2^{d-e-1})b} \rfloor \rightarrow 2^{c-e-1}b+\lfloor \frac{m}{4(2^{c-e-1})b} \rfloor$$ 最後一式已然滿足上述 Theorem 10 的形式，最後證明縮倍比率 $k$ 的最大值為: $$\begin{align} 4k^2=4(4^{d-e-1})^2&=4\times 4^{d-e-1}\times 4^{d-e-1} \\ &\le 4\times 4^{d-e-1}\times 4^{e} \\ &= 4^d \end{align}$$ 至此我們成功證明 $a$ 同時是 $m$ 和 $n$ 的 near square root **注意到這裡沒有推倒過程，文章中直接嘗試證明以下實作的合理性**，不想看推導的可以跳過，根據程式碼定義: * $c=\lfloor\frac{log_2(n)-1}{2}\rfloor\le\lfloor \log_4(n)\rfloor$ (換底公式: $\log_{a^n}b=\log_{a}b^{\frac{1}{n}}$) * $s=\log_2c$ * $d=\lfloor \frac{c}{2^s} \rfloor$ * $e=\lfloor \frac{d}{2} \rfloor$ 利用 $Lemma\,\,12$ 得以證明 $4k^2 \le n$ ， Theorem 10 成立，即所求 $a$ 是 $m$ 的 near square root。 $$ 4k^2 \le n \Longleftrightarrow 4(4^{d-e-1})^2 \le 4^d \lt m $$ ```c uint64_t adaptive_newton_sqrt(uint64_t x) { int total_bits = 64; uint64_t c = (total_bits - clz64(x) - 1) >> 1; uint64_t s = total_bits - clz64(c); uint64_t d = 0, a = 1, e; while (s > 0) { e = d; s--; d = c >> s; a = (a << (d-e-1)) + (x >> ((c<<1) - d - e + 1)) / a; } return (a*a <= x) ? a : (a-1); } ``` 值得一提的是 `d` 基本上沒有數學意義，一樣是我們為了減少運算所紀錄的迭代中間產物。 #### Listing 7: Fixed-precision variant, valid for $0 ≤ n < 2^{64}$ 從上一個實作可以發現迭代次數上限 $s=\lfloor\log_2\lfloor\log_2n\rfloor\rfloor$，以 64 位元無號整數來說，$c=31$，迭代次數必然是 $s=5$，由此我們如果固定美次迭代的縮小倍率 $k=2^{d-e-1}$ ，則我們可以展開迭代成以下形式，讀者可以自行帶入數值證明: ```c uint64_t fix_newton_sqrt(uint64_t x) { int total_bits = 64; /* notice that we want x to shift with interval 2^2 to make sure the */ uint64_t e = clz64(x) >> 1; // e = (32 − n.bit_length ()) / 2 uint64_t m = x << (e << 1); // make x > 2^62 /* for 2^62 < integer < 2^64 , fixed-percision Newton's method is * guarantee to converge with 5 iteration. */ uint64_t a = 1 + (m >> 62); // s=4 a = (a << 1) + (m >> 59) / a; // s=3 a = (a << 3) + (m >> 53) / a; // s=2 a = (a << 7) + (m >> 41) / a; // s=1 a = (a << 15) + (m >> 17) / a; // s=0 a = a >> e; // shift back return (a*a <= x) ? a : (a-1); } ``` **Listing 8: LUT + Fixed percision + Newton's Method** ```c /* isqrt32_tab[k] = isqrt(256*(k+64) - 1) for 0 <= k < 192 */ static const uint8_t isqrt32_tab[192] = { 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 143, 144, 145, 146, 147, 148, 149, 150, 150, 151, 152, 153, 154, 155, 155, 156, 157, 158, 159, 159, 160, 161, 162, 163, 163, 164, 165, 166, 167, 167, 168, 169, 170, 170, 171, 172, 173, 173, 174, 175, 175, 176, 177, 178, 178, 179, 180, 181, 181, 182, 183, 183, 184, 185, 185, 186, 187, 187, 188, 189, 189, 190, 191, 191, 192, 193, 193, 194, 195, 195, 196, 197, 197, 198, 199, 199, 200, 201, 201, 202, 203, 203, 204, 204, 205, 206, 206, 207, 207, 208, 209, 209, 210, 211, 211, 212, 212, 213, 214, 214, 215, 215, 216, 217, 217, 218, 218, 219, 220, 221, 221, 222, 222, 223, 223, 224, 225, 225, 226, 226, 227, 227, 228, 229, 229, 230, 230, 231, 231, 232, 232, 233, 234, 234, 235, 236, 236, 237, 237, 238, 238, 239, 239, 240, 241, 241, 242, 242, 243, 243, 244, 244, 245, 245, 246, 246, 247, 247, 248, 248, 249, 249, 250, 250, 251, 251, 252, 252, 253, 253, 254, 254, 255, }; uint16_t table_sqrti(uint32_t x) { if (x == 0) return x; /* make sure x > 2^30 */ int shift = clz32(x) & 30; x <<= shift; /* get 2^14 < a < 2^16 from table */ printf("loop: %d\n", (int)((x >> 24) - 64)); uint16_t a = 1 + isqrt32_tab[(x >> 24) - 64]; /* Newton's method */ a = (a << 7) + (x >> 9) / a; /* near square correction */ if (x < a * a) a -= 1; return a >> (shift >> 1); } ``` 從 listing 7 可以直覺的想到，因為迭代的精度 (a 準確的位數) 是以固定的位數增長，換句話說，我們可以利用查表取代前幾次的牛頓法迭代運算，具體來說，根據 $Lemma\,13$ 對於一個數 $x,\,\,x > 2^{30}$ ，如果我們先透過查表找到 $t=\lfloor\frac{x}{2^{15}}\rfloor,\,\,2^{14}<t<2^{16}$ ， lemma 保證迭代後的 $t$ 收斂到 $x$ 的 near sqaure root，由此可以發展出 listing 8 實作。 Look Up Table(LUT)製作方法單純是把 $t$ 預先做好 near square root ，注意到因為這邊是使用 8 位元大小存放LUT，所以在最後一項 $t=\lfloor2^{16}-1\rfloor$ 同時有 255 和 256 兩個 near sqaure root 時，避免使用需要 9 位元的 256。 ### Digit-by-digit Method (逐位逼近法) 詳見[測驗 2](https://hackmd.io/a_okd5kAS0CdzP22ZmHlVg?both=&stext=7584%3A4%3A0%3A1747209583%3ATugJuK)。 ```c uint64_t sqrti(uint64_t x) { uint64_t m, y = 0; if (x <= 1) return x; int total_bits = 64; /* clz64(x) returns the count of leading zeros in x. * (total_bits - 1 - clz64(x)) gives the index of the highest set bit. * Rounding that index down to an even number ensures our starting m is a * power of 4. */ int shift = (total_bits - 1 - clz64(x)) & ~1; m = 1ULL << shift; while (m) { uint64_t b = y + m; y >>= 1; if (x >= b) { x -= b; y += m; } m >>= 2; } return y; } ``` 至此我們完整了解牛頓法開平方根，而[整數開平方根 Quiz2-2](https://hackmd.io/@DboOgKS6RtOmMLCltFsBig/SkS4ZDUeeg?fbclid=IwY2xjawKQ_D5leHRuA2FlbQIxMQBicmlkETFHNlRUTW9jQ3c1MXBRYVVvAR5REqHnVCBKAoORuVxYHyyf4ECOZrZiMQcmHoZtCDuGht2VY_eC5UEqU5WerQ_aem_ZvO-9XJyk4KNS977HM_StQ&stext=2%3A15%3A0%3A1747230693%3AWpynbZ)中提到，牛頓法相比逐位逼近法迭代次數更少，運算更快，然而 linux 核心卻仍採用逐位逼近法。 ## 浮點數開平方根 ### IEEE754 ![雙精度浮點數](https://hackmd.io/_uploads/BkJ143Wbge.png) 引述自維基百科，[IEEE754](https://en.wikipedia.org/wiki/IEEE_754#Rounding_rules) 中提供四種近似到最接近整數的 `round()` 方法: * 捨入到最接近：修約到最接近，在一樣接近的情況下偶數優先（Ties To Even，這是預設的修約方式）：會將結果修約為最接近且可以表示的值，但是當存在兩個數一樣接近的時候，則取其中的偶數（在二進位中是以0結尾的）。 * 朝+∞方向捨入：會將結果朝正無限大的方向捨入。 * 朝-∞方向捨入：會將結果朝負無限大的方向捨入。 * 朝0方向捨入：會將結果朝0的方向捨入。相較於無條件進位和無條件捨去，通常默認的修約方法是最接近四捨五入的 **RoundTiesToEven**，以下提供精度到小數點後 8 位元的範例，我們假設換算成整數只需要小數點後三位，即最尾端有四位元需要被近似後捨棄。 ``` LSB 的一半: 1000 # 正常情況大於 1000 進位，小於捨棄對於 1.001_1001，round 後為 1.010。對於 1.001_0111，round 後為 1.001。 # 正好一半: round to nearest even 對於 1.001_1000，經舍入處理後為 1.010。對於 -1.010_1000，經舍入處理後為 -1.010。 ``` 單精度浮點數之非正規數例外情況： ![image](https://hackmd.io/_uploads/SJC9rIo-gx.png) 結論來說，本次單精度浮點數開根號實作需要處理以下例外情況： * 對正負 0 開根號 $\rightarrow$ 回傳正負 0。 * 對負數開根號，根據 IEEE 754 ，屬於數學尚未定義數，回傳 qNaN 。 * 對太小的數根號使發生 Underflow ，回傳小於等於原數的值即可(不需處理)。 * 對正無窮大開根號 $\rightarrow$ 回傳正無窮大。 * 對 NaN 開根號 $\rightarrow$ 回傳 NaN 。 ### 為何需要避免使用 FPU 在作業三 [Linux 核心的浮點數運算](https://hackmd.io/@sysprog/linux2025-kxo/%2F%40sysprog%2Flinux2025-kxo-a%3Fstext%3D3060%253A14%253A0%253A1748433549%253A9UogU2)一文中提到: linux kernel 若要使用 FPU 通常需要觸發 trap(exception) 並切換至 floating point mode ，而這個 trap 的成本並不小，核心通常沒有餘裕負擔，所以幾乎所有的核心程式碼都使用定點數運算。這也代表大部分的時間 FPU context 的內容與 user context 正在進行的 process 不吻合，若有需要要使用 FPU 的情況，需要自行操作對 FPU state 的存取，並且避免 context switch。建議盡量使用定點數運算。舉例來說，有以下可能的問題 * Lazy FP state leak: * Lazy FP state 因亂序執行失效 * 並非所有硬體都具備 FPU ### 牛頓法 ```c typedef union { float value; uint32_t v_int; } ieee_float_shape_type; /* Get a 32 bit int from a float. */ #define EXTRACT_WORDS(ix0, d) \ do { \ ieee_float_shape_type ew_u; \ ew_u.value = (d); \ (ix0) = ew_u.v_int; \ } while (0) float mySqrt(float n) { uint32_t sign_mask = 0x80000000; uint32_t exp_mask = 0x7F800000; uint32_t frac_mask = 0x007fffff; int32_t ix0, exponent, i; uint32_t mentissa; float result = 0.0, x = n, d; EXTRACT_WORDS(ix0, x); /* take care of zero */ if (ix0 <= 0) { if ((ix0 & (~sign_mask)) == 0) return x; /* sqrt(+-0) = +-0 */ if (ix0 < 0) return (x - x) / (x - x); /* sqrt(-ve) = sNaN */ } /* take care of positive infinity */ exponent = ix0 >> 23; mentissa = ix0 & frac_mask; if (exponent == 0) { if ((ix0 & frac_mask) == 0) return x; } /* take care of NaNs */ else if (exponent == 0xFF) { if (mentissa != 0) return x; } printf("1 exponent: %d, ix0: %X \n", exponent,ix0); printf("x: %f, n: %f \n", x,n); for (i = 0; i < 10; i++) { d = n / x; x = (x + d) / 2; printf("x: %f, n: %f \n", x,n); } return x; } ``` 參考以上程式碼，若我們直接用牛頓法實作，相比於 `sqrti` 我們需要考慮收斂的問題，且在輸入小於 1 時特別明顯，以輸入 `1e-19` 為例，可以發現 10 次迭代依然無法收斂。 ``` 1 exponent: 63, ix0: 1FEC1E4A x: 0.000000, n: 0.000000 x: 0.500000, n: 0.000000 x: 0.250000, n: 0.000000 x: 0.125000, n: 0.000000 x: 0.062500, n: 0.000000 x: 0.031250, n: 0.000000 x: 0.015625, n: 0.000000 x: 0.007812, n: 0.000000 x: 0.003906, n: 0.000000 x: 0.001953, n: 0.000000 x: 0.000977, n: 0.000000 correct: 0.000000, my: 0.000977 ``` 顯然，現有程式仍有很大缺陷,我們當然可以透過尋找更好的初始值解決迭代次數的問題，可參考本文最後用倒數平方根做猜想的牛頓法實現。 ### Digit-by-digit Method ```c p = s = 0; /* q = sqrt(x) */ r = 0x01000000; /* r = moving bit from right to left */ while(r!=0) { if(s+r <= ix0) { ix0 -= (s+r); s += 2*r; q += r; } ix0 += ix0; r >>= 1; } ``` 參考[直式開方的步驟原理說明](https://www.youtube.com/watch?v=iQYGKId1c3U)。 ![image](https://hackmd.io/_uploads/S17RGnnblx.png) * $x_{i}$ 是待開平方數，對應圖中 6241。 * $q_{i}^2$ 是已決定的平方根，對應圖中 70。 * $q_{i+1}=q$`y` 代表加入下一位後的商，對應圖中直式除法的商 `7y`。要求下一位平方根，可以總結成以下步驟： 1. 增加兩位，計算當前誤差值 $13(41)=x_{i}^2-q_{i}^2$ 2. 決定下一位元 `y`，若 $(2\times70+y)\times y > 當前差值$ 則商 $q_{i+1}$ 填入對應的 `y`。 **改寫至二進制** 觀察以下二進制小數平方： $$\begin{align} (1.1)^2 &= (1+\frac{1}{2})^2=\frac{9}{4}= 2.25 \\ (1.101)^2 &= (1+\frac{1}{2}+\frac{1}{8})^2=\frac{169}{64}\approx 2.6406 \\ (1.11)^2 &= (1+\frac{1}{2}+\frac{1}{4})^2=\frac{49}{16}\approx 3.0625 \end{align}$$ 可以發現多增加一位小數點的平方 $(1.101_2)^2$ 必定被包在 $(1.10_2)^2$ 和 $(1.11_2)^2$ 之間，故和十進位不同，我們算誤差值只需要一次多觀察一位，也就是說第 i+1 次的迭代誤差值 `x` 可以寫成： $$\begin{align} x^{(i+1)} &= x^{(i)}\times2 - (2q+b_{i+1}\times2 )b_{i+1}\times2 \\ &= x^{(i)}\times2 - (2q+2^{-(i)}\times2 )2^{-(i)}\times2 \end{align}$$ 其中 $x^{(i+1)}$ 代表 $q$ 目前計算到第 i+1 位，也可以說是小數點後第 i 位，對應 $b_{i+1}=2^{-(i)}$。注意到多觀察一位隱含同時把剩下的誤差和原本的商右移，可以參照十進制時，誤差 $13.41\,\, ⭢\,\, 13.41\times10^2$ 和商 $7\,\, ⭢\,\, 70$ ，對應 $x^{(i)}\times2$ 和 $b_{i+1} \times2$。由此，我們可以得到==不需要 FPU== 的實作。 ```c float DBD_sqrtf(float n) { float x = n, z; uint32_t sign_mask = 0x80000000; uint32_t frac_mask = 0x007fffff; int32_t ix0,s,q,m,t,i; uint32_t r; EXTRACT_WORDS(ix0,x); /* take care of zero */ if (ix0 <= 0) { if ((ix0 & (~sign_mask)) == 0) return x; /* sqrt(+-0) = +-0 */ if (ix0 < 0) return (x - x) / (x - x); /* sqrt(-ve) = sNaN */ } /* normalize x */ m = (ix0 >> 23); if (m == 0) { /* subnormal x */ for (i = 0; (ix0 & 0x00800000) == 0; i++) ix0 <<= 1; m -= i - 1; } m -= 127; /* unbias exponent */ /* retrieve correct mentissa 1.XXX */ ix0 = (ix0 & frac_mask) | 0x00800000; if (m & 1) /* odd m, double x to make it even */ ix0 += ix0; m >>= 1; /* (exponent)m = [m/2] */ /* generate sqrt(x) digit by digit */ ix0 += ix0; q = s = 0; /* q = sqrt(x) */ r = 0x01000000; /* r = moving bit from right to left */ while(r!=0) { t = s+r; if(t <= ix0) { // if reamin > add + r s = t+r; // add + r ix0 -= t; // remain - add q += r; // add one bit } ix0 += ix0; r >>= 1; } /* use floating add to find out rounding direction */ if(ix0!=0) { z = one-tiny; /* trigger inexact flag */ if (z>=one) { z = one+tiny; if (z>one) q += 2; else q += (q & 1); } } ix0 = (q >> 1) + 0x3f000000; ix0 += (m << 23); INSERT_WORDS(z,ix0); return z; } ``` ### 倒數平方根 + 牛頓法這部份 [從 √2 的存在談開平方根的快速運算](https://hackmd.io/@sysprog/sqrt?stext=18180%3A3%3A0%3A1747215575%3A9b9x2Q) 已有介紹 `Q_rsqrt(x)`，實作上也很直接，即先用倒數平方根做初始值的猜測 $$\sqrt{N} = \text{Q_rsqrt(x)} \times N$$在加入牛頓法提高精度 >TODO: 比較有無牛頓法的差異 ### TODO：比較 Digit-by-digit 和 Inverse Square Root 實現 `sqrtf()` 之效能差異 ### 參考資料 * [glibc/math_private.h](https://codebrowser.dev/glibc/glibc/sysdeps/generic/math_private.h.html) * [e_sqrtf.c](https://github.com/openbsd/src/blob/dd79ebd3e2ae7cad5e9ae8d576005e34146143b2/lib/libm/src/e_sqrtf.c#L23) * [/ARM/sqrtf.c](https://github.com/simonbyrne/apple-libm/blob/master/Source/ARM/sqrtf.c)