2023-06-18 測驗

# 2023-06-18 測驗 > 受測者: POCHUN-CHEN ## 測驗 `1` 用 Python 計算開平方根: ```python import math print(math.sqrt(16)) print(math.sqrt(63)) ``` 以下程式碼嘗試計算開平方根: (版本一) ```c #include <math.h> int isqrt(int N) { int msb = (int) log2(N); int a = 1 << msb; int result = 0; while (a != 0) { if ((result + a) * (result + a) <= N) result += a; a >>= 1; } return result; } ``` 對應的測試程式: ```c #include <stdio.h> int main() { int N = 36; printf("%d\n", isqrt(N)); return 0; } ``` 編譯方式: ```shell $ gcc -o isqrt isqrt.c -lm ``` 改寫上述 `isqrt` 函式，使其不依賴 [log2](https://man7.org/linux/man-pages/man3/log2.3p.html) 並保持原本的函式原型 (prototype) 和精度 (precision)。 :arrow_down_small: 作答區 (完整的 `isqrt` 函式，版本二) ```c int isqrt(int N) { int msb = 0; int n = N; while (n > 1) { n >>= 1; msb++; } int a = 1 << msb; int result = 0; while (a != 0) { if ((result + a) * (result + a) <= N) result += a; a >>= 1; } return result; } ``` Linux 核心的 [lib/math/int_sqrt.c](https://github.com/torvalds/linux/blob/master/lib/math/int_sqrt.c) 採用 [Digit-by-digit calculation](https://en.wikipedia.org/wiki/Methods_of_computing_square_roots#Digit-by-digit_calculation)，也就是常見的直式開方法。程式碼改寫如下: (版本三) ```c int isqrt(int x) { if (x <= 1) /* Assume x is always positive */ return x; int y = 0; for (int m = 1UL << ((31 - __builtin_clz(x)) & ~1UL); m; m >>= 2) { int b = y + m; y >>= 1; if (x >= b) x -= b, y += m; } return y; } ``` 直式開方法是把要開方的數拆成 2 的冪相加 (若是十進位就是 10 的冪)，就可以拆成以下的形式： $$ x = (000b_0b_1b_2\cdots b_{n-1}b_{n})^2 $$ 上面的 $000b_0b_1b_2\cdots b_{n-1}b_{n}$ 為 bits pattern ，其中 $b_0$ 為最高位的 1，寫成數學式如下： $$ x = (2^{n}b_0+2^{n-1}b_1+\cdots+2^1b_{n-1}+2^0b_{n})^2 \\ a_i = 2^{n-i}b_i \\ \begin{aligned}x &=(a_0+a_1+\cdots+a_{n-1}+a_{n})^2 \\ &= a_0^{2}+2a_0a_1 + a_1^2 + 2(a_0+a_1)a_2 + a_2^2 + \cdots + a_{n-1}^2 + 2\left(\sum\limits_{i = 0}^{n-1}{a_i}\right)a_n + a_n^2 \\ &=a_0^2 + (2a_0+a_1)a_1 + [2(a_0+a_1)+a_2]a_2 + \cdots + [2\left(\sum\limits_{i = 0}^{n-1}{a_i}\right)+a_n]a_n \end{aligned} $$ 最後的形式為 $[2\left(\sum\limits_{i = 0}^{n-1}{a_i}\right)+a_n]a_n$ 的相加，其中每項就對應到上述程式碼的每一輪迴圈 ```c if (x >= b) { x -= b; y += m; } ``` 要判斷並可能減掉的 `b`。原理是將欲求平方根的 $N^2$ 拆成： $$ N^2=(a_{n}+a_{n-1}+a_{n-2}+...+{a_0})^2, a_m=2^m\ or\ a_m=0 $$ 將其乘開得到 $$ \left[ \begin{array}{ccccc} a_0a_0 & a_1a_0 & a_2a_0 & ... & a_na_0 \\ a_0a_1 & a_1a_1 & a_2a_1 & ... & a_na_1 \\ a_0a_2 & a_1a_2 & a_2a_2 & ... & a_na_2 \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ a_0a_n & a_1a_n & a_2a_n & ... & a_na_n \\ \end{array} \right] $$ 分成幾個部份觀察 $$ \left[ \begin{array}{c} a_0a_0 \\ \end{array} \right] $$ $$ \left[ \begin{array}{cc} & a_1a_0 \\ a_0a_1 & a_1a_1 \end{array} \right] $$ $$ \left[ \begin{array}{ccc} & & a_2a_0 \\ & & a_2a_1 \\ a_0a_2 & a_1a_2 & a_2a_2 \end{array} \right] $$ 原式可整理成 $$ \begin{split} \\ N^2 =& a_n^2+[2a_n+a_{n-1}]a_{n-1}+[2(a_n+a_{n-1})+a_{n-2}]a_{n-2}+...+[2(\displaystyle\sum_{i=1}^{n}a_i)+a_0]a_0 \\ =& a_n^2+[2P_n+a_{n-1}]a_{n-1}+[2P_{n-1}+a_{n-2}]a_{n-2}+...+[2P_1+a_0]a_0 \\ P_m =& a_n+a_{n-1}+...+a_m \\ \end{split} $$ $P_0=a_n+a_{n-1}+...+a_0$ 即為所求平方根 $N$ 顯然 $P_m = P_{m+1} + a_m$。我們的目的是從試出所有 $a_m$，$a_m=2^m$ or $a_m=0$。從 $m=n$ 一路試到 $m=0$，而求得 $a_m$ 只要試驗 $P_m^2 \leq N^2$ 是否成立，二種可能狀況： $$ \begin{cases} P_m = P_{m+1} + 2^m, & \text{if $P_m^2 \leq N^2$}\\ P_m = P_{m+1}, & \text{otherwise} \end{cases} $$ 但我們不可能每輪去計算 $N^2 - P_m^2$ 去試驗 $a_m$，因為成本太高。一種想法是從上一輪的差值 $X_{m+1}$ 減去 $Y_m$ 得到 $X_{m}$ - $X_m = N^2 - P_m^2 = X_{m+1} - Y_m$ - $Y_m = P_m^2 - P_{m+1}^2 = 2P_{m+1}a_m+a_m^2$ 也就是紀錄上一輪的 $P_{m+1}$ 來計算 $Y_m$ 進一步調整，將 $Y_m$ 拆成 $c_m, d_m$ $$ c_m = P_{m+1}2^{m+1} \\ d_m = (2^m)^2 \\ Y_m=\left. \begin{cases} c_m+d_m & \text{if } a_m^2=2^m \\ 0 & \text{if } a_m=0 \end{cases} \right. $$ 藉由位元運算從 $c_m, d_m$ 推出下一輪$c_{m-1}, d_{m-1}$ - $c_{m-1}=P_m2^m=(P_{m+1}+a_m)2^m=P_{m+1}2^m+a_m2^m=\begin{cases} c_m/2+d_m & \text{if }a_m=2^m \\ c_m/2 & \text{if }a_m=0 \end{cases}$ - $d_{m-1}=\dfrac{d_m}{4}$ 結合上述方法撰寫演算法來尋求$a_n+a_{n-1}+...+a_0$，假設從 $a_n$ 開始往下試，至於如何求得 $a_n$ 後面再說明。因為是從 $a_n$ 開始，所以： - $X_{n+1}=N$ - 一開始都沒猜，差值是 N - $n$ 是最大的位數，使得 $a_n^2=(2^n)^2= d_n^2 = 4^n \leq N^2$ 所以用迴圈逼近 $d_n$， ```c int32_t d = 1 << 30; // The second-to-top bit is set. while (d > n) d >>= 2; ``` - $c_n = 0$ - $c_m=P_{m+1}2^{m+1}$。$a_n$ 已是已知最大，所以 $P_{n+1} = 0 \to c_n=0$ 因為 $d_m$ 恆大於 $0$，使用位移操作 $d_m$ 到最後必為 $0$，我們可以使用 `d!=0` 作為條件。又 $c_{-1} = P_0 = a_n+a_{n-1}+...+a_0$，算到最後的 $c_{-1}$ 即為所求 $N$，$a_n$ 也一併得知。對應程式碼由上述的理論可知，求出根號需要求得 $c_{-1}$ 即為所求 $N$ 。故寫出迭代的關係，即可求得解。首先，先判斷輸入的數值，是否符合求解根號的範圍。 ```c if (x <= 1) /* Assume x is always positive */ return x; ``` 由於開根號，本身在負數與 `0` 是無意義的，故不討論。且小數不能被整數型別 `int` 儲存，所以也不用討論。 `int m` 對應上述理論中的 $d_{m}$ ，由於首項 $d_n = (2^n)^2$，可知 ```c int m = 1UL << ((31 - __builtin_clz(x)) & ~1UL) ``` > [`__builtin_clz(x)`](https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html) 為 GCC 的內建函式。 > Returns the number of leading 0-bits in x, starting at the most significant bit position. If x is 0, the result is undefined. > 其功能為找到 x 最高位數前面有幾個 0 ， x 為 0 時，未定義。 > `1UL` 為 `(unsigned long) 1` 。 `int m` 為了求得 32 位元下，輸入參數 `x` 的最高有效位數，並將其向下取成偶數。 :::warning 詳細待補的部分： 1. 並將其向下取成偶數，這裡的寫法我還是有點不確定，為什麼奇數會有問題，目前推估是在做 Bitwise >> 2 時的問題。 2. 首項 $d_n = (2^n)^2$ ，但程式碼中，似乎只有取出最高位數（$2^n$），並沒有做平方。 ::: `m >>= 2` 為 $d_{m-1}=\dfrac{d_m}{4}$ 除以 4 的計算。由程式與理論對應可得知， $b=Y_{m}$ $y=c_{m}$ $x=X_{m}$ 最終經由迭代，得知$c_{-1}$ 即為所求 $N$ 。 ### 效能測試使用 `perf stat` 來進行測試。為了不讓 I/O 影響到測試的時間，測試時先移除 `printf("%d\n", isqrt(N));`，並重複執行 100000 次，比較執行時間。編譯後，執行 `perf stat` ```shell $ perf stat -r 100000 ./isqrt1 ``` 版本一 ``` Performance counter stats for './isqrt1' (100000 runs): 0.29 msec task-clock # 0.561 CPUs utilized ( +- 0.01% ) 0 context-switches # 0.000 /sec 0 cpu-migrations # 0.000 /sec 62 page-faults # 211.765 K/sec ( +- 0.00% ) 99,4673 cycles # 3.397 GHz ( +- 0.01% ) 104,6590 instructions # 1.06 insn per cycle ( +- 0.00% ) 19,6820 branches # 672.253 M/sec ( +- 0.00% ) 5070 branch-misses # 2.58% of all branches ( +- 0.03% ) 0.000525393 +- 0.000000108 seconds time elapsed ( +- 0.02% ) ``` 版本二 ``` Performance counter stats for './isqrt2' (100000 runs): 0.24 msec task-clock # 0.500 CPUs utilized ( +- 0.01% ) 0 context-switches # 0.000 /sec 0 cpu-migrations # 0.000 /sec 51 page-faults # 211.142 K/sec ( +- 0.00% ) 79,3884 cycles # 3.287 GHz ( +- 0.01% ) 85,2080 instructions # 1.05 insn per cycle ( +- 0.00% ) 15,9847 branches # 661.772 M/sec ( +- 0.00% ) 4015 branch-misses # 2.51% of all branches ( +- 0.04% ) 0.000470606 +- 0.000000101 seconds time elapsed ( +- 0.02% ) ``` 版本三 ``` Performance counter stats for './isqrt3' (100000 runs): 0.24 msec task-clock # 0.502 CPUs utilized ( +- 0.01% ) 0 context-switches # 0.000 /sec 0 cpu-migrations # 0.000 /sec 51 page-faults # 211.153 K/sec ( +- 0.00% ) 79,5358 cycles # 3.293 GHz ( +- 0.01% ) 84,9166 instructions # 1.04 insn per cycle ( +- 0.00% ) 15,9826 branches # 661.720 M/sec ( +- 0.00% ) 4148 branch-misses # 2.59% of all branches ( +- 0.03% ) 0.0004700040 +- 0.0000000973 seconds time elapsed ( +- 0.02% ) ``` 單從執行 100000 次的測試，可知版本三的指令數量 (instructions) 最少，且執行時間最短（time elapsed）。 --- ## 測驗 `2` 以下程式碼計算整數除 `10`，不使用 `/` (除法) 或 `%` (modulo) 運算子。 ```c static inline int32_t div10(const int32_t x) { return (int32_t) (((int64_t) x * INT64_C(/* Your answer */)) >> INT64_C(35)); } ``` 補完上述程式碼，使其得以運作。 :arrow_down_small: 作答區 (完整的 `div10` 函式) ```c static inline int32_t div10(const int32_t x) { return (int32_t) (((int64_t) x * INT64_C(0xCCCCCCCD)) >> INT64_C(35)); } ``` 敘述 `>> INT64_C(35)` 實際作用為「除 $2^{35}$」，故 `INT64_C(/* Your answer */))` 為 $2^{35}\times\dfrac{1}{10}$ ，這時候答案會是 `0xCCCCCCCC` ，但是經由測試，當**整除**發生時，商數會比預期的小 1。誤差原因： `0xCCCCCCCC` = 3435973836~10~，且 $2^{35} = 34359738368$，利用 $2^{35} \times \dfrac{1}{10}$ 得到 `0xCCCCCCCC` 時，已捨棄掉最低一位數，導致問題出現。舉例來說，若 $x = 10 \times y$，商為 `y` ，則 $$ \begin{split} x \times 3435973836 \div 34359738368 &= y \times 10 \times 3435973836 \div 34359738368 \\ &= y \times 0.099999999976717 \\ \end{split} $$ 這時候即可知 $2^{35}\times\dfrac{1}{10}$ 捨去掉的最低位數，會出現計算商數比正確商數少一的情況 (原本應為 $y\times0.1$)。 `0xCCCCCCCD` 則是在被捨去掉最低一位數字的數值 (`0xCCCCCCCC`) 的狀況，加上 `1` ，確保捨去掉的位數，不會讓原先的數值低於捨去後（此做法建立在商數為整數時，不會因為捨去最低位數，而被無條件捨去成小於正確商數值減 `1` 的商值）。負數商：經由改寫成 `0xCCCCCCCD` 時，帶入負數 `-20` ，答案會出現 `-3`。負數情況，由於商數規則是捨去掉小數位，原本 `-5` 的商值依照二補數，捨去後，計算的商值會是 `-1` ($-10 = -1 \times 10$) ，但這不符合一般數學負數商數的規定，故 `+1` 符合負數商對應正數商的關係。根據 [C11 規格書](https://www.open-std.org/jtc1/sc22/WG14/www/docs/n1570.pdf)第 6.5.5 節 Multiplicative operators 的描述: > 5. The result of the / operator is the quotient from the division of the first operand by the second; the result of the % operator is the remainder. In both operations, if the value of the second operand is zero, the behavior is undefined. > > 6. When integers are divided, the result of the `/` operator is the algebraic quotient with any fractional part discarded. If the quotient `a/b` is representable, the expression `(a/b)*b + a%b` shall equal `a`; otherwise, the behavior of both a/b and a%b is undefined. 根據以上兩個規則敘述，可知： 1. 當除數為 `0` ，為未定義行為。 2. 商數運算是捨棄掉除法運算後的小數部分，如果商數運算不超出資料型別的表示範圍，則符合 `(a / b) * b + a % b = a` 也就是「$商數 \times 除數 + 餘數 = 被除數$」。超出資料型別範圍則未定義。因為商數運算是**捨棄掉除法運算後的小數部分**，可以從此規則上得知 3. 商的結果會朝向 `0` 取整數。 4. 餘數一定和被除數同正負號。故以上調整是合理的，程式碼中加入負數的處理，如下。 ```c static inline int32_t div10(const int32_t x) { int32_t result = (int32_t) (((int64_t) x * INT64_C(0xCCCCCCCD)) >> INT64_C(35)); return (x < 0) ? (result + 1): result; } ``` 將上式改成 branchless 寫法 ```c static inline int32_t div10(const int32_t x) { int32_t result = (int32_t) (((int64_t) x * INT64_C(0xCCCCCCCD)) >> INT64_C(35)); return result + (x >> 31 & 1); } ``` --- ## 測驗 `3` 延續上題，針對除數為 `10`、給定被除數，輸出商與餘數，不得使用除法或 modulo 運算子。 ```c typedef struct { int32_t quotient; int32_t rem; } divmod_t; static inline void divmod10(const int32_t x, divmod_t *ret) { /* Your answer here */ } ``` :arrow_down_small: 作答區 (完整的 `divmod10` 函式) 先以不改寫上題題目為情況下寫此題 ```c static inline void divmod10(const int32_t x, divmod_t *ret) { ret->quotient = (int32_t) (((int64_t) x * INT64_C(0xCCCCCCCD)) >> INT64_C(35)); ret->rem = (int32_t) (x - ret->quotient * 10); } ``` :::warning TODO: 1. 提供必要的錯誤檢查 2. 參照 [第 7 週課堂問答簡記](https://hackmd.io/@sysprog/SyOhklg-3)，改寫上述程式碼，探討執行成本。 :::