2024q1 Homework4 (quiz3+4)

# 2024q1 Homework4 (quiz3+4) contributed by < [`teresasa0731`](https://github.com/teresasa0731) > ## 測驗 3-1: square root ### sol.1 logarithmic search 題目中[第一版程式碼](https://github.com/teresasa0731/2024_Linux-Kernel/blob/main/hw4/1_square_Root/isqrt_1.c)使用二分搜尋法逼近所求的平方根結果，先以 $log_2(N)$取得要開始的最大二次冪`msb`後進入`while()`迴圈逐步逼近，從最大可能的平方根$2^{msb}$開始，每次向右移一位縮小`a`的增幅，直到為零即為 N 的整數部份平方根，時間複雜度為 $O(logN)$。 [第二版程式碼](https://github.com/teresasa0731/2024_Linux-Kernel/blob/main/hw4/1_square_Root/isqrt_2.c)則改以位移計數的方式來計算 MSB ，相對來說比使用$log_2(N)$還要快一些。 ### sol.2 digit-by-digit calculation [第三版程式碼](https://github.com/teresasa0731/2024_Linux-Kernel/blob/main/hw4/1_square_Root/isqrt.c)根據 [Digit-by-digit calculation](https://en.wikipedia.org/wiki/Methods_of_computing_square_roots#Digit-by-digit_calculation) 的原理可以統整為以下：設欲求 $N^2$ 的平方根 N ，可將 $N^2$ 寫為 2 的冪總和，即 $$ N^2=(a_0+a_1+...+a_n)^2,a_m=2^m or 2^m=0, m \in \{0,1,...,n\} $$ 二次方項展開係數可以用矩陣表示 $$ \begin{bmatrix} a_0a_0 & a_1a_0 &...& a_na_0\\ a_0a_1 & a_1a_1 &...& a_na_1\\ a_0a_2 & a_1a_2 &...& a_na_2\\ ... &... &...& ... \\ a_0a_n & a_1a_n &...& a_na_n\\ \end{bmatrix} $$ 所以把原式整理後可以得到 $$ N^2=a_n^2+[2P_n+a_{n-1}]a_{n-1}+...+[2P_1+a_0]a_0 $$ 其中令$P_m=a_n+a_{n-1}+...+a_m=\sum_{i=m}^{n}a_i$，則$P_0=\sum_{i=0}^{n}a_i$ 即為所求平方根 N，顯而易見的是 $P_m = P_{m+1}+a_m$，而目的是試出所有的$a_m＝2^m$ 或 $a_m＝0$。 -- 接下來為了節省運算成本（直觀作法是直接每輪計算$N^2-P_m^2$，但運算成本與前面兩版本相比並沒有減少），透過計算差值 $X_m$ ，並將 $Y_m$ 拆成兩個係數$c_m$ & $d_m$： $$X_m = N^2 -P_m^2 = X_{m+1}-Y_m$$ 整理後得到$\Rightarrow Y_m = P_m^2 - P_{m+1}^2 = 2P_{m+1}a_m+a_m^2=c_m+d_m$，其中 $$c_m = P_{m+1}2^{m+1}, d_m=a_m^2, Y_m= \begin{align} \begin{cases} c_m+d_m & \text{if } a_m=2^m\\ 0 & \text{if } a_m = 0 \\ \end{cases} \end{align} $$ 每次迴圈會更新為： $c_{m-1} = P_m2^m=(P_{m+1}+a_m)2^m= \begin{align} \begin{cases} c_m/2+d_m &\text{if} a_m = 2^m\\ c_m/2 &\text{if} a_m = 0 \\ \end{cases} \end{align}$ $d_{m-1} = d_m/4$ 所以到最後$c_{-1}=P_0=a_n+a_{n-1}+...+a_0$，也就是我們要找的 $N$，而迴圈的截止條件可以用`d!=0`來判斷。 > 完整程式碼 [commit dc666d9](https://github.com/teresasa0731/2024_Linux-Kernel/blob/main/hw4/1_square_Root/isqrt.c) ### 使用 `ffs()`/`fls()` 改寫前面第三版程式碼使用`(31 - __builtin_clz(x))`來尋找最高位前有幾個零後計算出 MSB，為 GNU extension，此處改以`ffs()`從 LSB 找第一個 set bit 位置並回傳 index（由1開始）；在利用 shift 來找出 MSB。 ```cpp int i_sqrt_ffs(int x) { if (x <= 1) return x; int tmp = x, msb = 0; while (tmp) { int i = ffs(tmp); msb += i; tmp >>= i; } msb -= 1; ... } ``` > 完整程式碼 [commit 671c580](https://github.com/teresasa0731/2024_Linux-Kernel/commit/671c5801dc6ab98b18c3d44dc3b0c90bb3351508) ### Linux 核心相關程式碼在參考的 [block](https://github.com/torvalds/linux/tree/master/block) 目錄中搜尋後，找到用來監控特定時間窗口的延遲 [linux/block/blk-wbt.c](https://github.com/torvalds/linux/blob/master/block/blk-wbt.c) 中有平方根的相關動作： ``` - If the minimum latency in the above window exceeds some target, increment * scaling step and scale down queue depth by a factor of 2x. The monitoring * window is then shrunk to 100 / sqrt(scaling step + 1). ``` 當延遲超過目標值時，會增加縮放步驟(scaling step)、將佇列深度以2x因子縮減，並且監控窗口會相應縮小到 100 / sqrt(scaling step + 1)，可以在實際呼叫的`rwb_arm_timer` 中看到具體實現： ```cpp static void rwb_arm_timer(struct rq_wb *rwb) { struct rq_depth *rqd = &rwb->rq_depth; if (rqd->scale_step > 0) { /* * We should speed this up, using some variant of a fast * integer inverse square root calculation. Since we only do * this for every window expiration, it's not a huge deal, * though. */ rwb->cur_win_nsec = div_u64(rwb->win_nsec << 4, int_sqrt((rqd->scale_step + 1) << 8)); } else { /* * For step < 0, we don't want to increase/decrease the * window size. */ rwb->cur_win_nsec = rwb->win_nsec; } blk_stat_activate_nsecs(rwb->cb, rwb->cur_win_nsec); } ``` 而以上呼叫的開根號函式 `int_sqrt` 在 linux 核心的 `/lib/math/int_sqrt.c` 可以看到 ```cpp unsigned long int_sqrt(unsigned long x) { unsigned long b, m, y = 0; if (x <= 1) return x; m = 1UL << (__fls(x) & ~1UL); while (m != 0) { b = y + m; y >>= 1; if (x >= b) { x -= b; y += m; } m >>= 2; } return y; } EXPORT_SYMBOL(int_sqrt); ``` 除了改成使用 `__fls(x)` 以及用 while loop 取代迴圈外，整體算法是相同的。 ### 延伸討論在論文〈[Timing Square Root](https://web.archive.org/web/20210208132927/http://assemblyrequired.crashworks.org/timing-square-root/)〉比較不同開根號方法的效能差異： 1. 編譯器內建 `sqrt()` （x87 FSQRT opcode) 2. SSE 的 opcode `sqrtss` 3. The “magic number” approximation technique 4. 利用 SSE opcode `rsqrtss` 取得 $1/\sqrt{a}$ 的估算值再乘上 $a$ 5. 方法 4. + [Newton-Raphson iteration](https://web.archive.org/web/20210208132927/http://en.wikipedia.org/wiki/Newtons_method) 來增加精度 6. 方法 5. 的第 13 行展開，每次迴圈處理四個浮點數。在測試結果中方法4是最快的，但方法5＆6的平均誤差可以做到0.0000％，且原本內建的`sqrt()`是最慢的，幾乎相差到10倍左右。根據以上的實驗結果，改進開平方根應該是一個可以有效提昇整體運作的提案。 not yet completed... ## 測驗 3-2: `div` & `mod` operation 題目希望以 bitwise operation 來實作除法（題意是"針對 RV32I/RV64I 這類沒有整數乘法和除法指令的指令集，該如何撰寫有效的程式。"，源自[討論區老師的回覆](https://www.facebook.com/groups/system.software2024/?multi_permalinks=1572555173522375&notif_id=1711546821881676&notif_t=feedback_reaction_generic&ref=notif)），但在含有非 2 的冪項的除數情況下，單純的 bitwise shifting 操作會是不準確的。依照題目的推導證明：若存在 $n \in N$，一除數 $x$ 使 $\frac{n}{x}$ 直到小數第一位都是精準的，即 $a.b\le\frac{n}{x}\le a.b9$；則對於任何小於 $n$ 的非負整數 $l$，也會到小數第一位都是精準的。所以在題目的輸入範圍，也就是除數不大於19的情況下，能夠維持小數第一位精準度的除數 $x$ 的範圍是： $$1.9 \le \frac{19}{x} \le 1.99 \rightarrow 9.55 \le x \le 10$$ 接下來解析一下程式碼的計算方法: ```cpp uint32_t x = (in | 1) - (in >> 2); /* div = in/10 ==> div = 0.75*in/8 */ uint32_t q = (x >> 4) + x; ``` 先計算 $0.75*in$，而`(in | 1)` 推測是為了確保輸入是奇數，因為在做 div 10 時會出現一位小數，為確保在移位後不會有整除情況（偶數）。下一步則是進一步算出 $\frac{3}{4}*in+\frac{3}{64}*in=\frac{51}{64}*in \approx 0.79*in$ ```cpp x = q; q = (q >> 8) + x; q = (q >> 8) + x; q = (q >> 8) + x; q = (q >> 8) + x; ``` 接下來為了讓 $x$ 更逼近 $\frac{8}{10}*in$，連續加了四次 $\frac{1}{256}*1.79*in$。 ```cpp *div = (q >> 3); // CCCC *mod = in - ((q & ~0x7) + (*div << 1)); // DDDD ``` 最後將 $q$ 右移三位，也就是除以8，變成 $\frac{1}{10}*in$。餘數則是 $mod = in - div * 10 = in - (div * 8 + div * 2)$ ### 比較依賴除/乘法與否的指令序列 TODO... ### 實作 `% 9` (modulo 9) 和 `% 5` (modulo 5) 程式碼 TODO... ## 測驗 3-3: ilog2 > 完整程式碼 [commit 0a54569](https://github.com/teresasa0731/2024_Linux-Kernel/commit/0a5456932f531e753ec068099d66119f15022e57) #### (版本一) ```cpp int ilog2(int i) { int log = -1; while (i) { i >>= 1; log++; } return log; } ``` 不斷右移(即除以2)直到被除數為 0 為止，迴圈數以 MSB 的位數決定。 #### (版本二) ```cpp static size_t ilog2(size_t i) { size_t result = 0; while (i >= 65536) { // AAAA result += 16; i >>= 16; } while (i >= 256) { // BBBB result += 8; i >>= 8; } while (i >= 16) { // CCCC result += 4; i >>= 4; } while (i >= 2) { result += 1; i >>= 1; } return result; } ``` 透過一次位移 16、8、4、1 個 bits 來加快尋找 MSB 的位數。 #### (版本三) ```cpp int ilog32(uint32_t v) { return (31 - __builtin_clz(v | 1)); // DDDD } ``` 直接透過測驗一提到的`__builtin_clz(x)`來找出第一個非零位以前的 0 的個數，並為確保輸入 0 時不會無效，先將輸入做處理，最後用 31 減去 0 的個數即為 MSB 位數。 ### Linux 核心相關程式碼在[`/include/linux/log2.h`](https://elixir.bootlin.com/linux/latest/source/include/linux/log2.h)中可以看到與 log2 相關的定義： #### arch linux 在 arch 架構下允許 `asm/bitops.h` 用更有效率的 log2 實作覆蓋使用 `fls()` 版本的程式碼 ```cpp #ifndef CONFIG_ARCH_HAS_ILOG2_U32 static __always_inline __attribute__((const)) int __ilog2_u32(u32 n) { return fls(n) - 1; } #endif ``` #### const_ilog2 - log base 2 of 32-bit or a 64-bit ==constant== unsigned value ```cpp #define const_ilog2(n) \ ( \ __builtin_constant_p(n) ? ( \ (n) < 2 ? 0 : \ (n) & (1ULL << 63) ? 63 : \ (n) & (1ULL << 62) ? 62 : \ (n) & (1ULL << 61) ? 61 : \ ... (n) & (1ULL << 4) ? 4 : \ (n) & (1ULL << 3) ? 3 : \ (n) & (1ULL << 2) ? 2 : \ 1) : \ -1) ``` 先使用 `__bulitin_constant_p() 檢測表示式是否為編譯期常數，將常數直接轉換（類似查表的動作），在[你所不知道的 C 語言：編譯器和最佳化原理篇](https://hackmd.io/@sysprog/c-compiler-optimization#From-Source-to-Binary-How-A-Compiler-Works-GNU-Toolchain)中也有提到用將迴圈展開的最佳化手法。 #### ilog2 - log base 2 of 32-bit or a 64-bit unsigned value ```cpp #define ilog2(n) \ ( \ __builtin_constant_p(n) ? \ ((n) < 2 ? 0 : \ 63 - __builtin_clzll(n)) : \ (sizeof(n) <= 4) ? \ __ilog2_u32(n) : \ __ilog2_u64(n) \ ) ``` 同樣先檢查是否為編譯期常數，是則使用版本三的方法直接計算，否則依照輸入數字為 32-bit 或是 64-bit 呼叫函式。 ## 測驗 3-4: EWMA Exponentially Weighted Moving Average (EWMA; 指數加權移動平均) 是一種統計上取平均的手法，並且==越久以前的歷史資料的權重也會越低==，以下為 EWMA 的數學定義: $$ S_t = \begin{aligned} \begin{cases} Y_0 & t=0\\ \alpha Y_t + (1-\alpha)* S_{t-1} & t>0\\ \end{cases} \end{aligned} $$ 根據題目定義 `ewma` 的結構，其中 `internal` 為當前 `ewma` 的值， `factor` 為縮放因子， `weight` 為衰減的歷史加權參數 $\alpha$(the decay rate)。 ```cpp struct ewma { unsigned long internal; unsigned long factor; unsigned long weight; }; ``` 對結構體 `ewma` 進行初始化，檢查 `factor` 與 `weight` 是否為 2 的冪次；並引入 `ilog2()` 對以上兩個參數取 2 的次方。 ```cpp void ewma_init(struct ewma *avg, unsigned long factor, unsigned long weight) { if (!is_power_of_2(weight) || !is_power_of_2(factor)) assert(0 && "weight and factor have to be a power of two!"); avg->weight = ilog2(weight); avg->factor = ilog2(factor); avg->internal = 0; } ``` 添加新資料 `val` 時，若 `avg->internal` 為空則代表為 $Y_0$，此時直接將輸入值位移後填入即可；否則須進行更新，因比原式多了縮放因子，將以上公式調整後可以與程式碼相應： $$ S_t = \begin{aligned} \begin{cases} Y_0 & t=0\\ \alpha Y_t + S_{t-1} - \alpha S_{t-1} & t>0\\ \end{cases} \end{aligned} $$ ```cpp struct ewma *ewma_add(struct ewma *avg, unsigned long val) { avg->internal = avg->internal ? (((avg->internal << avg->weight) - avg->internal) + (val << avg->factor)) >> avg->weight : (val << avg->factor); //EEEE, FFFF return avg; } ``` ### Linux 核心相關程式碼在 Linux 核心中的 [/include/linux/average.h](https://elixir.bootlin.com/linux/latest/source/include/linux/average.h) 有提到與 EWMA 相關的操作，其定義了 `DECLARE_EWMA(name, _precision, _weight_rcp)`的巨集，並可以傳入計算精度 `_precision` ，而數值的更新則依據新值權重為 `1/_weight_rcp` ，舊狀態權重則為 ` 1 - 1/_weight_rcp` 。關於無線網路裝置驅動程式應用部份，在 [/drivers/net/wireless](https://elixir.bootlin.com/linux/latest/source/drivers/net/wireless) 中找到很多與 EWMA 相關的應用，但目前僅能看出設定參數如 `DECLARE_EWMA(rssi, 10, 16)` 是在設定客戶端保持連接所需的信號強度指示，並設定其精度為 10 位以及權重倒數為 16。 ## 測驗 3-4: $\lceil log_2(x) \rceil$ 此題為了計算 $log_2(x)$ 向上取整的值，以二分法的形式逐步將 `x` 移位到為 0 或 1 ，最後回傳值則須`+1`代表向上取整（因一開始 `x` 先減 1 ，因此處理輸入為 2 的冪次時以為整數不須取整的情況。） ```cpp int ceil_ilog2(uint32_t x) { uint32_t r, shift; x--; r = (x > 0xFFFF) << 4; x >>= r; shift = (x > 0xFF) << 3; x >>= shift; r |= shift; shift = (x > 0xF) << 2; x >>= shift; r |= shift; shift = (x > 0x3) << 1; x >>= shift; return (r | shift | x > 1) + 1; // GGGG } ``` #### 改進程式碼，使其得以處理 `x = 0` 的狀況，並仍是 branchless `x = 0`會使 `x--` 執行時變成 0xFFFFFFFF，為了避免這個情況，則須判斷在 x > 0 時才執行 +1 的動作，而為保持 branchless，將該行改為 `x += !(!x)` 先將 x 轉換成布林值後再取邏輯非運算。 ```diff int ceil_ilog2(uint32_t x) { uint32_t r, shift; - x--; + x -= !!x; r = (x > 0xFFFF) << 4; x >>= r; shift = (x > 0xFF) << 3; ... ``` > 完整程式碼 [commit 490873e](https://github.com/teresasa0731/2024_Linux-Kernel/commit/490873e77f47ed55dd0635f6f32242a10696797a) ### Linux 核心相關程式碼在 [/include/linux/sched.h](https://elixir.bootlin.com/linux/latest/source/include/linux/sched.h#L1600) 中有類似的操作： ```cpp static inline char task_index_to_char(unsigned int state) { static const char state_char[] = "RSDTtXZPI"; BUILD_BUG_ON(1 + ilog2(TASK_REPORT_MAX) != sizeof(state_char) - 1); return state_char[state]; } ``` 用來確保 `state_char[]` 的陣列大小與常數 `TASK_REPORT_MAX` 所需的位元數相匹配。