--- ###### tags: `sysprog2022q1` --- # 2022q1 Homework3 (quiz3) contribute by < `93i7xo2` > > Source: [quiz3](https://hackmd.io/@sysprog/linux2022-quiz3) > [作業區](https://hackmd.io/@sysprog/linux2022-homework3) ### 測驗 `11` ```c= static inline unsigned long fls(unsigned long word) { int num = 64 - 1; if (!(word & (~0ul << 32))) { num -= 32; word <<= 32; } if (!(word & (~0ul << (64 - 16)))) { num -= 16; word <<= 16; } if (!(word & (~0ul << (64 - 8)))) { num -= 8; word <<= 8; } if (!(word & (~0ul << (64 - 4)))) { num -= 4; word <<= 4; } if (!(word & (~0ul << (64 - 2)))) { num -= 2; word <<= 2; } if (!(word & (~0ul << (64 - 1)))) num -= 1; return num; } unsigned long i_sqrt(unsigned long x) { unsigned long b, m, y = 0; if (x <= 1) return x; m = 1UL << (fls(x) & ~1UL); while (m != 0) { b = y + m; y >>= 1; if (x >= b) { x -= b; y += m; } m >>= 2; } return y; } ``` :::success 延伸問題: 1. 解釋上述程式碼運作原理,嘗試利用硬體的 clz/ctz 指令改寫 2. 在 Linux 核心找出類似的巨集和程式碼,說明其應用場景 ::: 本題為 [Leetcode 69. Sqrt(x)](https://leetcode.com/problems/sqrtx/),常見作法是以 binary search 在範圍內逐一試誤。先列出要介紹的幾種方法在 Leetcode 的執行時間: Method|Runtime|Memory -|-|- Brute-force|58 ms|5.4 MB Fast Integer Square Root|3 ms|5.5 MB Digit-by-digit calculation|3 ms|5.8 MB Newton's Method|0 ms|5.7 MB IEEE754 + Binary search|6 ms|5.4 MB Fast Inverse Square Root|-| - #### 1. Brute-force 第一種方法雖然是暴力解,但試著縮小範圍,減少猜測的次數。例如 2 進位的 2 位數平方不會超過 4 位數;試著想 1 個 2 位數和 1 個 3 位數 `0b100` 相乘結果為 4 位數,因此 2 位數相乘不會超出 4 位數,同理反推小於 n 位數的開根號數會是 $\frac{n+1}{2}$ 位。現在假設 $x$ 為 n 位數,我們可大幅縮小試誤範圍為 $[0, 2^{\frac{n+1}{2}} - 1]$: ```c int mySqrt(int x){ if (x <= 1) return x; int bits = (32 - __builtin_clz(x) + 1) /2; int i = (1<<bits) - 1; while(i>x/i){ i--; } return i; } ``` #### 2. Fast Integer Square Root 第二種為實現 [Microchip - Fast Integer Square Root](http://ww1.microchip.com/downloads/en/AppNotes/91040a.pdf) 演算法。該方法是將每一 bit 逐一試驗,控制時間複雜度在 $O(n)$,從 2 進位的角度來看也是一種 binary search: ``` x 為 8 bit,求得 k = sqrt(x) k = 0 1. 測試第1位,k |= 0b10000000 成功=> x = 0b1XXXXXXX, k = 0b10000000 失敗=> x = 0b0XXXXXXX, k = 0b00000000 2. 測試第2位,k |= 0b01000000 成功=> x = 0bX1XXXXXX, k = 0bX1000000 失敗=> x = 0bX0XXXXXX, k = 0bX0000000 3. 測試第3位,k |= 0b00100000 成功=> x = 0bXX1XXXXX, k = 0bXX100000 失敗=> x = 0bXX0XXXXX, k = 0bXX000000 4. 測試到第8位,k即所求數 ``` ```c int mySqrt(int x){ if (x <= 1) return x; int bits = (32 - __builtin_clz(x) + 1) /2, result = 0; while(--bits >=0){ int try = 1<<bits | result; if (try <= x/try) result |= 1<<bits; } return result; } ``` #### 3. Digit-by-digit Calculation 第三種方法(測驗 `11`)為 [Digit-by-digit calculation](https://en.wikipedia.org/wiki/Methods_of_computing_square_roots#Digit-by-digit_calculation),也就是常見的直式開方法,且可在 [math/int_sqrt.c](https://git.yoctoproject.org/linux-yocto-contrib/plain/lib/math/int_sqrt.c) 找到原始碼。 ```c= unsigned long i_sqrt(unsigned long x) { unsigned long b, m, y = 0; if (x <= 1) return x; m = 1UL << (fls(x) & ~1UL); while (m != 0) { b = y + m; y >>= 1; if (x >= b) { x -= b; y += m; } m >>= 2; } return y; } ``` > `fls` 功能為取得 x 的位數再減去 1,其實可以替換成 `(63 - __builtin_clzll(x))`,但 `__builtin_clzll` 並未對 0 進行定義,需在行 5 判斷: > ```c > #define fls(x) (63 - __builtin_clzll(x)) > ``` 原理是將欲求平方根的 $N^2$ 拆成: $N^2=(a_{n}+a_{n-1}+a_{n-2}+...+{a_0})^2, a_m=2^m\ or\ a_m=0$ 1. 將其乘開得到 $\left[ \begin{array}{ccccc} a_0a_0 & a_1a_0 & a_2a_0 & ... & a_na_0 \\ a_0a_1 & a_1a_1 & a_2a_1 & ... & a_na_1 \\ a_0a_2 & a_1a_2 & a_2a_2 & ... & a_na_2 \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ a_0a_n & a_1a_n & a_2a_n & ... & a_na_n \\ \end{array} \right]$ 2. 分成幾個部份觀察 $\left[ \begin{array}{c} a_0a_0 \\ \end{array} \right]$ $\left[ \begin{array}{cc} & a_1a_0 \\ a_0a_1 & a_1a_1 \end{array} \right]$ $\left[ \begin{array}{ccc} & & a_2a_0 \\ & & a_2a_1 \\ a_0a_2 & a_1a_2 & a_2a_2 \end{array} \right]$ 3. 原式可整理成 $\begin{split} \\ N^2 =& a_n^2+[2a_n+a_{n-1}]a_{n-1}+[2(a_n+a_{n-1})+a_{n-2}]a_{n-2}+...+[2(\displaystyle\sum_{i=1}^{n}a_i)+a_0]a_0 \\ =& a_n^2+[2P_n+a_{n-1}]a_{n-1}+[2P_{n-1}+a_{n-2}]a_{n-2}+...+[2P_1+a_0]a_0 \\ P_m =& a_n+a_{n-1}+...+a_m \\ \end{split}$ $P_0=a_n+a_{n-1}+...+a_0$ 即為所求平方根 $N$ 4. 很明顯 $P_m = P_{m+1} + 2^m$ 。我們的目的是從試出所有 $a_m$,$a_m=2^m$ or $a_m=0$。從 $m=n$ 一路試到 $m=0$,而求得 $a_m$ 只要試驗 $P_m^2 \leq N^2$ 是否成立,兩種可能性: \begin{cases} P_m = P_{m+1} + 2^m, & \text{if $P_m^2 \leq N^2$}\\ P_m = P_{m+1}, & \text{otherwise} \end{cases} 5. 但總不可能每輪去計算 $N^2 - P_m^2$ 去試驗 $a_m$,成本太高。一種想法是從上一輪的差值 $X_{m+1}$ 減去 $Y_m$ 得到 $X_{m}$ - $X_m = N^2 - P_m^2 = X_{m+1} - Y_m$ - $Y_m = P_m^2 - P_{m+1}^2 = 2P_{m+1}a_m+a_m^2$ 也就是紀錄上一輪的 $P_{m+1}$ 來計算 $Y_m$ 6. 更進一步最佳化,將 $Y_m$ 拆成 $c_m, d_m$ $c_m = P_{m+1}2^{m+1}$ $d_m = (2^m)^2$ $Y_m=\left. \begin{cases} c_m+d_m & \text{if } a_m^2=2^m \\ 0 & \text{if } a_m=0 \end{cases} \right.$ <font color="red">藉由位元運算從 $c_m, d_m$ 推出下一輪$c_{m-1}, d_{m-1}$</font> - $c_{m-1}=P_m2^m=(P_{m+1}+a_m)2^m=P_{m+1}2^m+a_m2^m=\begin{cases} c_m/2+d_m & \text{if }a_m=2^m \\ c_m/2 & \text{if }a_m=0 \end{cases}$ - $d_{m-1}=\dfrac{d_m}{4}$ 7. 結合上述方法撰寫演算法來尋求$a_n+a_{n-1}+...+a_0$,<font color="red">假設從 $a_n$ 開始往下試</font>,至於如何求得 $a_n$ 後面再說明。因為是從 $a_n$ 開始,所以: - $X_{n+1}=N$ - 一開始都沒猜,差值是 N - $n$ 是最大的位數,使得 $a_n^2=(2^n)^2= d_n^2 = 4^n \leq N^2$ 所以用迴圈逼近 $d_n$, ```c int32_t d = 1 << 30; // The second-to-top bit is set. while (d > n) d >>= 2; ``` > 我自己的理解是要猜出最大的 $a_n$ 使得 $P_0=N^2$。 > $N^2$ 最大為 [2n+1, 2n+2] bits,如果用 $4^n$ 去試有 2n+1 bits,是能試出 $a_n$ 的。$4^{n+1}$ 有 2n+3 bits 就會超過 $N^2$。 - $c_n = 0$ - $c_m=P_{m+1}2^{m+1}$。$a_n$ 已是已知最大,所以 $P_{n+1} = 0 \rightarrow c_n=0$ 因為 $d_m$ 恆大於零,使用位移操作 $d_m$ 到最後一定為零,我們可以<font>使用 `d!=0` 作為條件</font>。又 $c_{-1} = P_0 = a_n+a_{n-1}+...+a_0$,算到最後的<font color="red"> $c_{-1}$ 即為所求 $N$,$a_n$ 也一併知曉</font> ```c= // Digit-by-digit_calculation int32_t isqrt(int32_t n) { assert(("sqrt input should be non-negative", n > 0)); // Xₙ₊₁ int32_t x = n; // cₙ int32_t c = 0; // dₙ which starts at the highest power of four <= n int32_t d = 1 << 30; // The second-to-top bit is set. // Same as ((unsigned) INT32_MAX + 1) / 2. while (d > n) d >>= 2; // for dₙ … d₀ while (d != 0) { if (x >= c + d) { // if Xₘ₊₁ ≥ Yₘ then aₘ = 2ᵐ x -= c + d; // Xₘ = Xₘ₊₁ - Yₘ c = (c >> 1) + d; // cₘ₋₁ = cₘ/2 + dₘ (aₘ is 2ᵐ) } else { c >>= 1; // cₘ₋₁ = cₘ/2 (aₘ is 0) } d >>= 2; // dₘ₋₁ = dₘ/4 } return c; // c₋₁ } ``` 對比 quiz `11`,除了變數名稱和使用 `fls()` 來尋找 $d_n$,和上述演算法是一致的。 ```c= unsigned long i_sqrt(unsigned long x) { unsigned long b, m, y = 0; if (x <= 1) return x; m = 1UL << (fls(x) & ~1UL); while (m != 0) { b = y + m; y >>= 1; if (x >= b) { x -= b; y += m; } m >>= 2; } return y; } ``` #### 4. Newton's Method :::spoiler 參考[牛頓法求平方根](https://hackmd.io/@sysprog/sqrt#牛頓法求平方根) ```c= int mySqrt(int n) { if (n == 0) return 0; if (n < 4) return 1; unsigned int ans = n / 2; if (ans > 65535) // 65535 = 2^16 - 1 ans = 65535; while (ans * ans > n || (ans + 1) * (ans + 1) <= n) ans = (ans + n / ans) / 2; return ans; } ``` ::: #### 5. IEEE754 + Binary Search :::spoiler 此方法是結合位元位移和 binary search,參考[二進位系統的平方根](https://hackmd.io/HUzAPdVVSACpsnaj5Omvgw?view#二進位系統的平方根) ```c= typedef union { float value; uint32_t v_int; } ieee_float_shape_type; /* Get a 32 bit int from a float. */ #define EXTRACT_WORDS(ix0, d) \ do { \ ieee_float_shape_type ew_u; \ ew_u.value = (d); \ (ix0) = ew_u.v_int; \ } while (0) int mySqrt(int n) { int32_t sign = 0x80000000; int32_t ix0, m, i; float x = n; EXTRACT_WORDS(ix0, x); /* take care of zero */ if (ix0 <= 0) { if ((ix0 & (~sign)) == 0) return x; /* sqrt(+-0) = +-0 */ if (ix0 < 0) return (x - x) / (x - x); /* sqrt(-ve) = sNaN */ } /* normalize x */ m = (ix0 >> 23); if (m == 0) { /* subnormal x */ for (i = 0; (ix0 & 0x00800000) == 0; i++) ix0 <<= 1; m -= i - 1; } m -= 127; /* unbias exponent */ ix0 = (ix0 & 0x007fffff) | 0x00800000; if (m & 1) /* odd m, double x to make it even */ ix0 += ix0; m >>= 1; /* m = [m/2] */ /* binary search 'm' */ unsigned int head = 1 << m; // 2^m unsigned int tail = 1 << (m + 1); // 2^(m+1) unsigned int mid = (head + tail) >> 1; // same as (2^m + 2^(m+1)) / 2 while (1) { if (n > (mid + 1) * (mid + 1)) { head = mid; mid = (head + tail) >> 1; } else if(n < mid * mid) { tail = mid; mid = (head + tail) >> 1; } else break; } return mid; } ``` ::: #### 6. Fast Inverse Square Root :::spoiler 參考[Fast Inverse Square Root](https://hackmd.io/HUzAPdVVSACpsnaj5Omvgw?view#Fast-Inverse-Square-Root-(平方根倒數)) ![](https://i.imgur.com/m1wi5tu.png) ```c= float InvSqrt(float x) { float xhalf = 0.5f * x; int i = *(int *) &x; i = 0x5f3759df - (i >> 1); x = *(float *) &i; x = x * (1.5f - xhalf * x * x); return x; } ``` ::: 上述是針對倒數開根號所開發的演算法,因此,還需要將回傳值乘上自己以取得開根號 $x\cdot \dfrac{1}{\sqrt{x}}=\sqrt{x}$ #### 7. Babylonian method 參考 [Babylonian method](https://en.wikipedia.org/wiki/Methods_of_computing_square_roots#Babylonian_method),此法等同用[牛頓法](https://en.wikipedia.org/wiki/Newton%27s_method)解 $x^2 - S = 0$ 求 $\sqrt{S}$,[Uniswap v2](https://github.com/Uniswap/v2-core/blob/v1.0.1/contracts/libraries/Math.sol) 應用在 Math 函式庫。 ```c= int sqrt(int y) { int z; if (y > 3) { z = y; int x = y / 2 + 1; while (x < z) { z = x; x = (y / x + x) / 2; } } else if (y != 0) { z = 1; } return z; } ``` #### 結論 撇除第 1 種暴力解對各種演算法進行測試: ![](https://i.imgur.com/0JwEu4c.png) Algorithm|Use division/multiplication|Use float/double|Details -|-|-|- Fast Integer Square Root|:heavy_check_mark:||執行時間與數值位數成正比 Digit-by-digit Calculation|||則因為初始 d 值與數值位數相關,與執行時間成正比,又因 d 每次以 2 位元位移且避開乘法,理應耗時低於 Fast Integer Square Root。 IEEE754 + Binary Search|||一如預期的在較大的數值範圍執行時間較長 Fast Inverse Square Root|:heavy_check_mark:|:heavy_check_mark:|以常數時間勝出。 Babylonian method|:heavy_check_mark:||| 如果因為硬體限制可用 Digit-by-digit Calculation 或是 IEEE754 + Binary Search。如果沒有硬體限制且不在意精度可用 Fast Inverse Square Root。 #### Reference - [Calculate sqrt(x) using magic number method](https://gist.github.com/a2468834/f798c9731a0d4a9dc532bcb28930da84)