2022q1 Homework3 (quiz3)

contribute by < 93i7xo2 >

Source: quiz3
作業區

測驗 `11`


















































static inline unsigned long fls(unsigned long word)
{
    int num = 64 - 1;
        
    if (!(word & (~0ul << 32))) {
        num -= 32;
        word <<= 32;
    }
    if (!(word & (~0ul << (64 - 16)))) {
        num -= 16;
        word <<= 16;
    }
    if (!(word & (~0ul << (64 - 8)))) {
        num -= 8;
        word <<= 8;
    }
    if (!(word & (~0ul << (64 - 4)))) {
        num -= 4;
        word <<= 4;
    }   
    if (!(word & (~0ul << (64 - 2)))) {
        num -= 2;
        word <<= 2;
    }   
    if (!(word & (~0ul << (64 - 1))))
        num -= 1;
    return num;
} 

unsigned long i_sqrt(unsigned long x)
{
    unsigned long b, m, y = 0;

    if (x <= 1)
        return x;

    m = 1UL << (fls(x) & ~1UL);
    while (m != 0) {
        b = y + m;
        y >>= 1;

        if (x >= b) {
            x -= b;
            y += m;
        }
        m >>= 2;
    }

    return y;
}

延伸問題:

解釋上述程式碼運作原理，嘗試利用硬體的 clz/ctz 指令改寫
在 Linux 核心找出類似的巨集和程式碼，說明其應用場景

本題為 Leetcode 69. Sqrt(x)，常見作法是以 binary search 在範圍內逐一試誤。先列出要介紹的幾種方法在 Leetcode 的執行時間：

Method	Runtime	Memory
Brute-force	58 ms	5.4 MB
Fast Integer Square Root	3 ms	5.5 MB
Digit-by-digit calculation	3 ms	5.8 MB
Newton's Method	0 ms	5.7 MB
IEEE754 + Binary search	6 ms	5.4 MB
Fast Inverse Square Root	-	-

1. Brute-force

第一種方法雖然是暴力解，但試著縮小範圍，減少猜測的次數。例如 2 進位的 2 位數平方不會超過 4 位數；試著想 1 個 2 位數和 1 個 3 位數 0b100 相乘結果為 4 位數，因此 2 位數相乘不會超出 4 位數，同理反推小於 n 位數的開根號數會是

\frac{n + 1}{2}

位。現在假設

x

為 n 位數，我們可大幅縮小試誤範圍為

[0, 2^{\frac{n + 1}{2}} - 1]

：

int mySqrt(int x){
    if (x <= 1) return x;

    int bits = (32 - __builtin_clz(x) + 1) /2;
    int i = (1<<bits) - 1;
    while(i>x/i){
        i--;
    }
    return i;
}

2. Fast Integer Square Root

第二種為實現 Microchip - Fast Integer Square Root 演算法。該方法是將每一 bit 逐一試驗，控制時間複雜度在

O (n)

，從 2 進位的角度來看也是一種 binary search：

x 為 8 bit，求得 k = sqrt(x)
k = 0

1. 測試第1位，k |= 0b10000000
   成功=> x = 0b1XXXXXXX, k = 0b10000000
   失敗=> x = 0b0XXXXXXX, k = 0b00000000

2. 測試第2位，k |= 0b01000000
   成功=> x = 0bX1XXXXXX, k = 0bX1000000
   失敗=> x = 0bX0XXXXXX, k = 0bX0000000

3. 測試第3位，k |= 0b00100000
   成功=> x = 0bXX1XXXXX, k = 0bXX100000
   失敗=> x = 0bXX0XXXXX, k = 0bXX000000
   
4. 測試到第8位，k即所求數

int mySqrt(int x){
    if (x <= 1) return x;

    int bits = (32 - __builtin_clz(x) + 1) /2, result = 0;
    while(--bits >=0){
        int try = 1<<bits | result;
        if (try <= x/try) result |= 1<<bits;
    }

    return result;
}

3. Digit-by-digit Calculation

第三種方法（測驗 11）為 Digit-by-digit calculation，也就是常見的直式開方法，且可在 math/int_sqrt.c 找到原始碼。





















unsigned long i_sqrt(unsigned long x)
{
    unsigned long b, m, y = 0;

    if (x <= 1)
        return x;

    m = 1UL << (fls(x) & ~1UL);
    while (m != 0) {
        b = y + m;
        y >>= 1;

        if (x >= b) {
            x -= b;
            y += m;
        }
        m >>= 2;
    }

    return y;
}

fls 功能為取得 x 的位數再減去 1，其實可以替換成 (63 - __builtin_clzll(x))，但 __builtin_clzll 並未對 0 進行定義，需在行 5 判斷：
#define fls(x) (63 - __builtin_clzll(x))

原理是將欲求平方根的

N^{2}

拆成：

N^{2} = (a_{n} + a_{n - 1} + a_{n - 2} + . . . + a_{0})^{2}, a_{m} = 2^{m} o r a_{m} = 0

將其乘開得到

$[\begin{array}{ccccc} a_{0} a_{0} & a_{1} a_{0} & a_{2} a_{0} & . . . & a_{n} a_{0} \\ a_{0} a_{1} & a_{1} a_{1} & a_{2} a_{1} & . . . & a_{n} a_{1} \\ a_{0} a_{2} & a_{1} a_{2} & a_{2} a_{2} & . . . & a_{n} a_{2} \\ ⋮ & ⋮ & ⋮ & ⋱ & ⋮ \\ a_{0} a_{n} & a_{1} a_{n} & a_{2} a_{n} & . . . & a_{n} a_{n} \end{array}]$
分成幾個部份觀察

$[\begin{matrix} a_{0} a_{0} \end{matrix}]$

$[\begin{array}{cc} a_{1} a_{0} \\ a_{0} a_{1} & a_{1} a_{1} \end{array}]$

$[\begin{array}{ccc} a_{2} a_{0} \\ a_{2} a_{1} \\ a_{0} a_{2} & a_{1} a_{2} & a_{2} a_{2} \end{array}]$
原式可整理成

$\begin{aligned} N^{2} = & a_{n}^{2} + [2 a_{n} + a_{n - 1}] a_{n - 1} + [2 (a_{n} + a_{n - 1}) + a_{n - 2}] a_{n - 2} + . . . + [2 (\sum_{i = 1}^{n} a_{i}) + a_{0}] a_{0} \\ = & a_{n}^{2} + [2 P_{n} + a_{n - 1}] a_{n - 1} + [2 P_{n - 1} + a_{n - 2}] a_{n - 2} + . . . + [2 P_{1} + a_{0}] a_{0} \\ P_{m} = & a_{n} + a_{n - 1} + . . . + a_{m} \end{aligned}$

$P_{0} = a_{n} + a_{n - 1} + . . . + a_{0}$ 即為所求平方根
$N$
很明顯
$P_{m} = P_{m + 1} + 2^{m}$ 。我們的目的是從試出所有
$a_{m}$ ，
$a_{m} = 2^{m}$ or
$a_{m} = 0$ 。從
$m = n$ 一路試到
$m = 0$ ，而求得
$a_{m}$ 只要試驗
$P_{m}^{2} \leq N^{2}$ 是否成立，兩種可能性：

${\begin{cases} P_{m} = P_{m + 1} + 2^{m}, & if P_{m}^{2} \leq N^{2} \\ P_{m} = P_{m + 1}, & otherwise \end{cases}$
但總不可能每輪去計算
$N^{2} - P_{m}^{2}$ 去試驗
$a_{m}$ ，成本太高。一種想法是從上一輪的差值
$X_{m + 1}$ 減去
$Y_{m}$ 得到
$X_{m}$
- $X_{m} = N^{2} - P_{m}^{2} = X_{m + 1} - Y_{m}$
- $Y_{m} = P_{m}^{2} - P_{m + 1}^{2} = 2 P_{m + 1} a_{m} + a_{m}^{2}$
也就是紀錄上一輪的
$P_{m + 1}$ 來計算
$Y_{m}$
更進一步最佳化，將
$Y_{m}$ 拆成
$c_{m}, d_{m}$

$c_{m} = P_{m + 1} 2^{m + 1}$

$d_{m} = (2^{m})^{2}$

$Y_{m} = {\begin{cases} c_{m} + d_{m} & if a_{m}^{2} = 2^{m} \\ 0 & if a_{m} = 0 \end{cases}$

藉由位元運算從
$c_{m}, d_{m}$ 推出下一輪
$c_{m - 1}, d_{m - 1}$
- $c_{m - 1} = P_{m} 2^{m} = (P_{m + 1} + a_{m}) 2^{m} = P_{m + 1} 2^{m} + a_{m} 2^{m} = {\begin{cases} c_{m} / 2 + d_{m} & if a_{m} = 2^{m} \\ c_{m} / 2 & if a_{m} = 0 \end{cases}$
- $d_{m - 1} = \frac{d_{m}}{4}$

結合上述方法撰寫演算法來尋求

a_{n} + a_{n - 1} + . . . + a_{0}

，假設從

a_{n}

開始往下試，至於如何求得

a_{n}

後面再說明。因為是從

a_{n}

開始，所以：

$X_{n + 1} = N$
- 一開始都沒猜，差值是 N
$n$ 是最大的位數，使得
$a_{n}^{2} = (2^{n})^{2} = d_{n}^{2} = 4^{n} \leq N^{2}$ 所以用迴圈逼近
$d_{n}$ ，
```
int32_t d = 1 << 30; // The second-to-top bit is set.
while (d > n)
    d >>= 2;
```
我自己的理解是要猜出最大的
$a_{n}$ 使得
$P_{0} = N^{2}$ 。

$N^{2}$ 最大為 [2n+1, 2n+2] bits，如果用
$4^{n}$ 去試有 2n+1 bits，是能試出
$a_{n}$ 的。
$4^{n + 1}$ 有 2n+3 bits 就會超過
$N^{2}$ 。
$c_{n} = 0$
- $c_{m} = P_{m + 1} 2^{m + 1}$ 。
  $a_{n}$ 已是已知最大，所以
  $P_{n + 1} = 0 \to c_{n} = 0$

因為

d_{m}

恆大於零，使用位移操作

d_{m}

到最後一定為零，我們可以使用 d!=0 作為條件。又

c_{- 1} = P_{0} = a_{n} + a_{n - 1} + . . . + a_{0}

，算到最後的

c_{- 1}

即為所求

N

，

a_{n}

也一併知曉





























// Digit-by-digit_calculation
int32_t isqrt(int32_t n) {
    assert(("sqrt input should be non-negative", n > 0));

    // Xₙ₊₁
    int32_t x = n;

    // cₙ
    int32_t c = 0;

    // dₙ which starts at the highest power of four <= n
    int32_t d = 1 << 30; // The second-to-top bit is set.
                         // Same as ((unsigned) INT32_MAX + 1) / 2.
    while (d > n)
        d >>= 2;

    // for dₙ … d₀
    while (d != 0) {
        if (x >= c + d) {      // if Xₘ₊₁ ≥ Yₘ then aₘ = 2ᵐ
            x -= c + d;        // Xₘ = Xₘ₊₁ - Yₘ
            c = (c >> 1) + d;  // cₘ₋₁ = cₘ/2 + dₘ (aₘ is 2ᵐ)
        }
        else {
            c >>= 1;           // cₘ₋₁ = cₘ/2      (aₘ is 0)
        }
        d >>= 2;               // dₘ₋₁ = dₘ/4
    }
    return c;                  // c₋₁
}

對比 quiz 11，除了變數名稱和使用 fls() 來尋找

d_{n}

，和上述演算法是一致的。





















unsigned long i_sqrt(unsigned long x)
{
    unsigned long b, m, y = 0;

    if (x <= 1)
        return x;

    m = 1UL << (fls(x) & ~1UL);
    while (m != 0) {
        b = y + m;
        y >>= 1;

        if (x >= b) {
            x -= b;
            y += m;
        }
        m >>= 2;
    }

    return y;
}

4. Newton's Method

參考牛頓法求平方根















int mySqrt(int n)
{
    if (n == 0)
        return 0;
    if (n < 4)
        return 1;

    unsigned int ans = n / 2;
    if (ans > 65535)  // 65535 = 2^16 - 1
        ans = 65535;
    
    while (ans * ans > n  || (ans + 1) * (ans + 1) <= n)
        ans = (ans + n / ans) / 2;
    return ans;
}

5. IEEE754 + Binary Search

此方法是結合位元位移和 binary search，參考二進位系統的平方根




























































typedef union {
    float value;
    uint32_t v_int;
} ieee_float_shape_type;

/* Get a 32 bit int from a float. */
#define EXTRACT_WORDS(ix0, d)        \
    do {                             \
        ieee_float_shape_type ew_u;  \
        ew_u.value = (d);            \
        (ix0) = ew_u.v_int;          \
    } while (0)

int mySqrt(int n)
{ 
    int32_t sign = 0x80000000;
    int32_t ix0, m, i;

    float x = n;
    EXTRACT_WORDS(ix0, x);
    
    /* take care of zero */
    if (ix0 <= 0) {
        if ((ix0 & (~sign)) == 0)
            return x; /* sqrt(+-0) = +-0 */
        if (ix0 < 0)
            return (x - x) / (x - x); /* sqrt(-ve) = sNaN */
    }

    /* normalize x */
    m = (ix0 >> 23);
    if (m == 0) { /* subnormal x */
        for (i = 0; (ix0 & 0x00800000) == 0; i++)
            ix0 <<= 1;
        m -= i - 1;
    }
    m -= 127; /* unbias exponent */
    ix0 = (ix0 & 0x007fffff) | 0x00800000;
    if (m & 1) /* odd m, double x to make it even */
        ix0 += ix0;
    m >>= 1; /* m = [m/2] */
     
    /* binary search 'm' */
    unsigned int head = 1 << m;       // 2^m
    unsigned int tail = 1 << (m + 1);  // 2^(m+1)
    unsigned int mid = (head + tail) >> 1; // same as (2^m + 2^(m+1)) / 2
    while (1) {
        if (n > (mid + 1) * (mid + 1)) {
            head = mid; 
            mid = (head + tail) >> 1;
        } else if(n < mid * mid) {
            tail = mid; 
            mid = (head + tail) >> 1;
        } else
            break;
    }

    return mid;
}

6. Fast Inverse Square Root

參考Fast Inverse Square Root








float InvSqrt(float x) {  
    float xhalf = 0.5f * x;
    int i = *(int *) &x;
    i = 0x5f3759df - (i >> 1);
    x = *(float *) &i;
    x = x * (1.5f - xhalf * x * x);
    return x;
}

上述是針對倒數開根號所開發的演算法，因此，還需要將回傳值乘上自己以取得開根號

x \cdot \frac{1}{\sqrt{x}} = \sqrt{x}

7. Babylonian method

參考 Babylonian method，此法等同用牛頓法解

x^{2} - S = 0

求

\sqrt{S}

，Uniswap v2 應用在 Math 函式庫。














int sqrt(int y) {
  int z;
  if (y > 3) {
    z = y;
    int x = y / 2 + 1;
    while (x < z) {
      z = x;
      x = (y / x + x) / 2;
    }
  } else if (y != 0) {
    z = 1;
  }
  return z;
}

結論

撇除第 1 種暴力解對各種演算法進行測試：

Algorithm	Use division/multiplication	Use float/double	Details
Fast Integer Square Root	Image Not Showing Possible Reasons The image file may be corrupted The server hosting the image is unavailable The image path is incorrect The image format is not supported Learn More →		執行時間與數值位數成正比
Digit-by-digit Calculation			則因為初始 d 值與數值位數相關，與執行時間成正比，又因 d 每次以 2 位元位移且避開乘法，理應耗時低於 Fast Integer Square Root。
IEEE754 + Binary Search			一如預期的在較大的數值範圍執行時間較長
Fast Inverse Square Root	Image Not Showing Possible Reasons The image file may be corrupted The server hosting the image is unavailable The image path is incorrect The image format is not supported Learn More →	Image Not Showing Possible Reasons The image file may be corrupted The server hosting the image is unavailable The image path is incorrect The image format is not supported Learn More →	以常數時間勝出。
Babylonian method	Image Not Showing Possible Reasons The image file may be corrupted The server hosting the image is unavailable The image path is incorrect The image format is not supported Learn More →

如果因為硬體限制可用 Digit-by-digit Calculation 或是 IEEE754 + Binary Search。如果沒有硬體限制且不在意精度可用 Fast Inverse Square Root。

Reference

Calculate sqrt(x) using magic number method

2022q1 Homework3 (quiz3)

測驗 11

1. Brute-force

2. Fast Integer Square Root

3. Digit-by-digit Calculation

4. Newton's Method

5. IEEE754 + Binary Search

6. Fast Inverse Square Root

7. Babylonian method

結論

Reference

Read more

2021q1 Homework4 (quiz4)

JPEG Decoder (3/3)

JPEG Decoder (1/3)

JPEG Decoder (2/3)

測驗 `11`