---
###### tags: `sysprog2022q1`
---
# 2022q1 Homework3 (quiz3)
contribute by < `93i7xo2` >
> Source: [quiz3](https://hackmd.io/@sysprog/linux2022-quiz3)
> [作業區](https://hackmd.io/@sysprog/linux2022-homework3)
### 測驗 `11`
```c=
static inline unsigned long fls(unsigned long word)
{
int num = 64 - 1;
if (!(word & (~0ul << 32))) {
num -= 32;
word <<= 32;
}
if (!(word & (~0ul << (64 - 16)))) {
num -= 16;
word <<= 16;
}
if (!(word & (~0ul << (64 - 8)))) {
num -= 8;
word <<= 8;
}
if (!(word & (~0ul << (64 - 4)))) {
num -= 4;
word <<= 4;
}
if (!(word & (~0ul << (64 - 2)))) {
num -= 2;
word <<= 2;
}
if (!(word & (~0ul << (64 - 1))))
num -= 1;
return num;
}
unsigned long i_sqrt(unsigned long x)
{
unsigned long b, m, y = 0;
if (x <= 1)
return x;
m = 1UL << (fls(x) & ~1UL);
while (m != 0) {
b = y + m;
y >>= 1;
if (x >= b) {
x -= b;
y += m;
}
m >>= 2;
}
return y;
}
```
:::success
延伸問題:
1. 解釋上述程式碼運作原理,嘗試利用硬體的 clz/ctz 指令改寫
2. 在 Linux 核心找出類似的巨集和程式碼,說明其應用場景
:::
本題為 [Leetcode 69. Sqrt(x)](https://leetcode.com/problems/sqrtx/),常見作法是以 binary search 在範圍內逐一試誤。先列出要介紹的幾種方法在 Leetcode 的執行時間:
Method|Runtime|Memory
-|-|-
Brute-force|58 ms|5.4 MB
Fast Integer Square Root|3 ms|5.5 MB
Digit-by-digit calculation|3 ms|5.8 MB
Newton's Method|0 ms|5.7 MB
IEEE754 + Binary search|6 ms|5.4 MB
Fast Inverse Square Root|-| -
#### 1. Brute-force
第一種方法雖然是暴力解,但試著縮小範圍,減少猜測的次數。例如 2 進位的 2 位數平方不會超過 4 位數;試著想 1 個 2 位數和 1 個 3 位數 `0b100` 相乘結果為 4 位數,因此 2 位數相乘不會超出 4 位數,同理反推小於 n 位數的開根號數會是 $\frac{n+1}{2}$ 位。現在假設 $x$ 為 n 位數,我們可大幅縮小試誤範圍為 $[0, 2^{\frac{n+1}{2}} - 1]$:
```c
int mySqrt(int x){
if (x <= 1) return x;
int bits = (32 - __builtin_clz(x) + 1) /2;
int i = (1<<bits) - 1;
while(i>x/i){
i--;
}
return i;
}
```
#### 2. Fast Integer Square Root
第二種為實現 [Microchip - Fast Integer Square Root](http://ww1.microchip.com/downloads/en/AppNotes/91040a.pdf) 演算法。該方法是將每一 bit 逐一試驗,控制時間複雜度在 $O(n)$,從 2 進位的角度來看也是一種 binary search:
```
x 為 8 bit,求得 k = sqrt(x)
k = 0
1. 測試第1位,k |= 0b10000000
成功=> x = 0b1XXXXXXX, k = 0b10000000
失敗=> x = 0b0XXXXXXX, k = 0b00000000
2. 測試第2位,k |= 0b01000000
成功=> x = 0bX1XXXXXX, k = 0bX1000000
失敗=> x = 0bX0XXXXXX, k = 0bX0000000
3. 測試第3位,k |= 0b00100000
成功=> x = 0bXX1XXXXX, k = 0bXX100000
失敗=> x = 0bXX0XXXXX, k = 0bXX000000
4. 測試到第8位,k即所求數
```
```c
int mySqrt(int x){
if (x <= 1) return x;
int bits = (32 - __builtin_clz(x) + 1) /2, result = 0;
while(--bits >=0){
int try = 1<<bits | result;
if (try <= x/try) result |= 1<<bits;
}
return result;
}
```
#### 3. Digit-by-digit Calculation
第三種方法(測驗 `11`)為 [Digit-by-digit calculation](https://en.wikipedia.org/wiki/Methods_of_computing_square_roots#Digit-by-digit_calculation),也就是常見的直式開方法,且可在 [math/int_sqrt.c](https://git.yoctoproject.org/linux-yocto-contrib/plain/lib/math/int_sqrt.c) 找到原始碼。
```c=
unsigned long i_sqrt(unsigned long x)
{
unsigned long b, m, y = 0;
if (x <= 1)
return x;
m = 1UL << (fls(x) & ~1UL);
while (m != 0) {
b = y + m;
y >>= 1;
if (x >= b) {
x -= b;
y += m;
}
m >>= 2;
}
return y;
}
```
> `fls` 功能為取得 x 的位數再減去 1,其實可以替換成 `(63 - __builtin_clzll(x))`,但 `__builtin_clzll` 並未對 0 進行定義,需在行 5 判斷:
> ```c
> #define fls(x) (63 - __builtin_clzll(x))
> ```
原理是將欲求平方根的 $N^2$ 拆成:
$N^2=(a_{n}+a_{n-1}+a_{n-2}+...+{a_0})^2, a_m=2^m\ or\ a_m=0$
1. 將其乘開得到
$\left[
\begin{array}{ccccc}
a_0a_0 & a_1a_0 & a_2a_0 & ... & a_na_0 \\
a_0a_1 & a_1a_1 & a_2a_1 & ... & a_na_1 \\
a_0a_2 & a_1a_2 & a_2a_2 & ... & a_na_2 \\
\vdots & \vdots & \vdots & \ddots & \vdots \\
a_0a_n & a_1a_n & a_2a_n & ... & a_na_n \\
\end{array}
\right]$
2. 分成幾個部份觀察
$\left[
\begin{array}{c}
a_0a_0 \\
\end{array}
\right]$
$\left[
\begin{array}{cc}
& a_1a_0 \\
a_0a_1 & a_1a_1
\end{array}
\right]$
$\left[
\begin{array}{ccc}
& & a_2a_0 \\
& & a_2a_1 \\
a_0a_2 & a_1a_2 & a_2a_2
\end{array}
\right]$
3. 原式可整理成
$\begin{split} \\
N^2 =& a_n^2+[2a_n+a_{n-1}]a_{n-1}+[2(a_n+a_{n-1})+a_{n-2}]a_{n-2}+...+[2(\displaystyle\sum_{i=1}^{n}a_i)+a_0]a_0 \\
=& a_n^2+[2P_n+a_{n-1}]a_{n-1}+[2P_{n-1}+a_{n-2}]a_{n-2}+...+[2P_1+a_0]a_0 \\
P_m =& a_n+a_{n-1}+...+a_m \\
\end{split}$
$P_0=a_n+a_{n-1}+...+a_0$ 即為所求平方根 $N$
4. 很明顯 $P_m = P_{m+1} + 2^m$ 。我們的目的是從試出所有 $a_m$,$a_m=2^m$ or $a_m=0$。從 $m=n$ 一路試到 $m=0$,而求得 $a_m$ 只要試驗 $P_m^2 \leq N^2$ 是否成立,兩種可能性:
\begin{cases}
P_m = P_{m+1} + 2^m, & \text{if $P_m^2 \leq N^2$}\\
P_m = P_{m+1}, & \text{otherwise}
\end{cases}
5. 但總不可能每輪去計算 $N^2 - P_m^2$ 去試驗 $a_m$,成本太高。一種想法是從上一輪的差值 $X_{m+1}$ 減去 $Y_m$ 得到 $X_{m}$
- $X_m = N^2 - P_m^2 = X_{m+1} - Y_m$
- $Y_m = P_m^2 - P_{m+1}^2 = 2P_{m+1}a_m+a_m^2$
也就是紀錄上一輪的 $P_{m+1}$ 來計算 $Y_m$
6. 更進一步最佳化,將 $Y_m$ 拆成 $c_m, d_m$
$c_m = P_{m+1}2^{m+1}$
$d_m = (2^m)^2$
$Y_m=\left.
\begin{cases}
c_m+d_m & \text{if } a_m^2=2^m \\
0 & \text{if } a_m=0
\end{cases}
\right.$
<font color="red">藉由位元運算從 $c_m, d_m$ 推出下一輪$c_{m-1}, d_{m-1}$</font>
- $c_{m-1}=P_m2^m=(P_{m+1}+a_m)2^m=P_{m+1}2^m+a_m2^m=\begin{cases}
c_m/2+d_m & \text{if }a_m=2^m \\
c_m/2 & \text{if }a_m=0
\end{cases}$
- $d_{m-1}=\dfrac{d_m}{4}$
7. 結合上述方法撰寫演算法來尋求$a_n+a_{n-1}+...+a_0$,<font color="red">假設從 $a_n$ 開始往下試</font>,至於如何求得 $a_n$ 後面再說明。因為是從 $a_n$ 開始,所以:
- $X_{n+1}=N$
- 一開始都沒猜,差值是 N
- $n$ 是最大的位數,使得 $a_n^2=(2^n)^2= d_n^2 = 4^n \leq N^2$ 所以用迴圈逼近 $d_n$,
```c
int32_t d = 1 << 30; // The second-to-top bit is set.
while (d > n)
d >>= 2;
```
> 我自己的理解是要猜出最大的 $a_n$ 使得 $P_0=N^2$。
> $N^2$ 最大為 [2n+1, 2n+2] bits,如果用 $4^n$ 去試有 2n+1 bits,是能試出 $a_n$ 的。$4^{n+1}$ 有 2n+3 bits 就會超過 $N^2$。
- $c_n = 0$
- $c_m=P_{m+1}2^{m+1}$。$a_n$ 已是已知最大,所以 $P_{n+1} = 0 \rightarrow c_n=0$
因為 $d_m$ 恆大於零,使用位移操作 $d_m$ 到最後一定為零,我們可以<font>使用 `d!=0` 作為條件</font>。又 $c_{-1} = P_0 = a_n+a_{n-1}+...+a_0$,算到最後的<font color="red"> $c_{-1}$ 即為所求 $N$,$a_n$ 也一併知曉</font>
```c=
// Digit-by-digit_calculation
int32_t isqrt(int32_t n) {
assert(("sqrt input should be non-negative", n > 0));
// Xₙ₊₁
int32_t x = n;
// cₙ
int32_t c = 0;
// dₙ which starts at the highest power of four <= n
int32_t d = 1 << 30; // The second-to-top bit is set.
// Same as ((unsigned) INT32_MAX + 1) / 2.
while (d > n)
d >>= 2;
// for dₙ … d₀
while (d != 0) {
if (x >= c + d) { // if Xₘ₊₁ ≥ Yₘ then aₘ = 2ᵐ
x -= c + d; // Xₘ = Xₘ₊₁ - Yₘ
c = (c >> 1) + d; // cₘ₋₁ = cₘ/2 + dₘ (aₘ is 2ᵐ)
}
else {
c >>= 1; // cₘ₋₁ = cₘ/2 (aₘ is 0)
}
d >>= 2; // dₘ₋₁ = dₘ/4
}
return c; // c₋₁
}
```
對比 quiz `11`,除了變數名稱和使用 `fls()` 來尋找 $d_n$,和上述演算法是一致的。
```c=
unsigned long i_sqrt(unsigned long x)
{
unsigned long b, m, y = 0;
if (x <= 1)
return x;
m = 1UL << (fls(x) & ~1UL);
while (m != 0) {
b = y + m;
y >>= 1;
if (x >= b) {
x -= b;
y += m;
}
m >>= 2;
}
return y;
}
```
#### 4. Newton's Method
:::spoiler 參考[牛頓法求平方根](https://hackmd.io/@sysprog/sqrt#牛頓法求平方根)
```c=
int mySqrt(int n)
{
if (n == 0)
return 0;
if (n < 4)
return 1;
unsigned int ans = n / 2;
if (ans > 65535) // 65535 = 2^16 - 1
ans = 65535;
while (ans * ans > n || (ans + 1) * (ans + 1) <= n)
ans = (ans + n / ans) / 2;
return ans;
}
```
:::
#### 5. IEEE754 + Binary Search
:::spoiler 此方法是結合位元位移和 binary search,參考[二進位系統的平方根](https://hackmd.io/HUzAPdVVSACpsnaj5Omvgw?view#二進位系統的平方根)
```c=
typedef union {
float value;
uint32_t v_int;
} ieee_float_shape_type;
/* Get a 32 bit int from a float. */
#define EXTRACT_WORDS(ix0, d) \
do { \
ieee_float_shape_type ew_u; \
ew_u.value = (d); \
(ix0) = ew_u.v_int; \
} while (0)
int mySqrt(int n)
{
int32_t sign = 0x80000000;
int32_t ix0, m, i;
float x = n;
EXTRACT_WORDS(ix0, x);
/* take care of zero */
if (ix0 <= 0) {
if ((ix0 & (~sign)) == 0)
return x; /* sqrt(+-0) = +-0 */
if (ix0 < 0)
return (x - x) / (x - x); /* sqrt(-ve) = sNaN */
}
/* normalize x */
m = (ix0 >> 23);
if (m == 0) { /* subnormal x */
for (i = 0; (ix0 & 0x00800000) == 0; i++)
ix0 <<= 1;
m -= i - 1;
}
m -= 127; /* unbias exponent */
ix0 = (ix0 & 0x007fffff) | 0x00800000;
if (m & 1) /* odd m, double x to make it even */
ix0 += ix0;
m >>= 1; /* m = [m/2] */
/* binary search 'm' */
unsigned int head = 1 << m; // 2^m
unsigned int tail = 1 << (m + 1); // 2^(m+1)
unsigned int mid = (head + tail) >> 1; // same as (2^m + 2^(m+1)) / 2
while (1) {
if (n > (mid + 1) * (mid + 1)) {
head = mid;
mid = (head + tail) >> 1;
} else if(n < mid * mid) {
tail = mid;
mid = (head + tail) >> 1;
} else
break;
}
return mid;
}
```
:::
#### 6. Fast Inverse Square Root
:::spoiler 參考[Fast Inverse Square Root](https://hackmd.io/HUzAPdVVSACpsnaj5Omvgw?view#Fast-Inverse-Square-Root-(平方根倒數))
![](https://i.imgur.com/m1wi5tu.png)
```c=
float InvSqrt(float x) {
float xhalf = 0.5f * x;
int i = *(int *) &x;
i = 0x5f3759df - (i >> 1);
x = *(float *) &i;
x = x * (1.5f - xhalf * x * x);
return x;
}
```
:::
上述是針對倒數開根號所開發的演算法,因此,還需要將回傳值乘上自己以取得開根號
$x\cdot \dfrac{1}{\sqrt{x}}=\sqrt{x}$
#### 7. Babylonian method
參考 [Babylonian method](https://en.wikipedia.org/wiki/Methods_of_computing_square_roots#Babylonian_method),此法等同用[牛頓法](https://en.wikipedia.org/wiki/Newton%27s_method)解 $x^2 - S = 0$ 求 $\sqrt{S}$,[Uniswap v2](https://github.com/Uniswap/v2-core/blob/v1.0.1/contracts/libraries/Math.sol) 應用在 Math 函式庫。
```c=
int sqrt(int y) {
int z;
if (y > 3) {
z = y;
int x = y / 2 + 1;
while (x < z) {
z = x;
x = (y / x + x) / 2;
}
} else if (y != 0) {
z = 1;
}
return z;
}
```
#### 結論
撇除第 1 種暴力解對各種演算法進行測試:
![](https://i.imgur.com/0JwEu4c.png)
Algorithm|Use division/multiplication|Use float/double|Details
-|-|-|-
Fast Integer Square Root|:heavy_check_mark:||執行時間與數值位數成正比
Digit-by-digit Calculation|||則因為初始 d 值與數值位數相關,與執行時間成正比,又因 d 每次以 2 位元位移且避開乘法,理應耗時低於 Fast Integer Square Root。
IEEE754 + Binary Search|||一如預期的在較大的數值範圍執行時間較長
Fast Inverse Square Root|:heavy_check_mark:|:heavy_check_mark:|以常數時間勝出。
Babylonian method|:heavy_check_mark:|||
如果因為硬體限制可用 Digit-by-digit Calculation 或是 IEEE754 + Binary Search。如果沒有硬體限制且不在意精度可用 Fast Inverse Square Root。
#### Reference
- [Calculate sqrt(x) using magic number method](https://gist.github.com/a2468834/f798c9731a0d4a9dc532bcb28930da84)