2022q1 Homework2 (quiz2)

--- tags: linux2022 --- # 2022q1 Homework2 (quiz2) contributed by < [laneser](https://github.com/laneser) > > [作業要求](https://hackmd.io/@sysprog/BybKYf5xc) ## 2022q1 第 2 週測驗題 ### 測驗 1 > 考慮以下對二個無號整數取平均值的程式碼: > ```cpp > #include <stdint.h> > uint32_t average(uint32_t a, uint32_t b) > { > return (a + b) / 2; > } > ``` > 改寫為 > ```cpp > #include <stdint.h> > uint32_t average(uint32_t a, uint32_t b) > { > return (a >> 1) + (b >> 1) + (EXP1); > } > ``` #### 想法 & 思考看到 `a >> 1` 就會直覺想到是結果相等於把 `a` 除以 2. 但會無條件捨去最後一個 bit. 因此 EXP1 應該就是彌補了 `a`, `b` 都是無條件捨去的情況下，結果跟 `(a + b) / 2` 的差異。也就是 EXP1 需要利用 `a`, `b` 的最後一個 bit 產生一個彌補值，相當於 `(a + b) / 2` 跟 `(a >> 1) + (b >> 1)` 之間的差異。所以可以列出 `a`, `b` 最後一個 bit 跟我們期望的答案的真值表。 a lowest bit | b lowest bit | EXP1 result -----------|:----------:| ----------- 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 1 | 1 | 那就很明顯是這兩個 bits 的 `AND` 結果。所以答案就是 `(a & 1) & (b & 1)`, 簡化為 `a & b & 1`。 > 改寫為 > ```cpp > uint32_t average(uint32_t a, uint32_t b) > { > return (EXP2) + ((EXP3) >> 1); > } > ``` 題目要求只能用 `AND`, `OR`, `XOR` 三種 bitwise 運算子，也就意味著這三種方法跟 `+`,`-` 有一定的關聯。只好求救 Google, 找到 [Bitwise XORing two numbers results in sum or difference of the numbers](https://stackoverflow.com/questions/21146103/bitwise-xoring-two-numbers-results-in-sum-or-difference-of-the-numbers), 提到 `x + y = (x ^ y) + ((x & y) << 1)` , 這個等式就可以聯想到我們要求的答案是 `(a + b) >> 1` , 所以將這個等式兩邊都除以 2, 也就是 bitwise 右移一位。`(x + y) >> 1 = (x ^ y) >> 1 + (x & y)`, 這樣就跟我們要的答案非常接近了。`EXP2` 應該就是 `a & b`, 而 `EXP3` 就是 `a ^ b`. #### 延伸問題 >嘗試解讀編譯器最佳化開啟的狀況，對應的組合語言輸出。利用 `gcc -S -O3 test1.c` 可以得到每個 average 實作函式對應的組合語言輸出。以下都移除 assembly directives, 因為那是給編譯器看的，按題目需求應該是專注於給 CPU 執行的指令 (instructions). ##### (a + b) / 2 ```assembly ; uint32_t average(uint32_t a, uint32_t b) {return (a + b) / 2;} average: endbr64 leal (%rdi,%rsi), %eax shrl %eax ret ``` `endbr64` 根據 [What does the endbr64 instruction actually do?](https://stackoverflow.com/questions/56905811/what-does-the-endbr64-instruction-actually-do) 解釋為 End for branch 64 bit. 在後續的其他函式也都是一開始就執行這個指令，在這裡也可以解釋為 `nop`, 沒有任何動作。 `leal` 這個指令是 `load effactive address` 的 64bits 版本，但常常用來取代 `add` 指令, [What's the purpose of the LEA instruction?](https://stackoverflow.com/questions/1658294/whats-the-purpose-of-the-lea-instruction) 可以理解為，為了支持高階語言的陣列功能，x86 指令集特別弄了這個，但被發現跟 `add` 有異曲同工之妙，但又比 `add` 好用。總之在這裡，就意味著把傳進來的兩個參數 `a`,`b`, 分別放在 `esi`,`edi` 這兩個暫存器內，直接加起來把結果存入 `eax` 暫存器。 `shrl` 很好理解，就是右移一個 bit. 把 `eax` 暫存器內容除以 2 的意思。最後回傳值放在 `eax`. ##### low + (high - low) / 2 ```c ;uint32_t average2(uint32_t a, uint32_t b) {return a + (b-a) / 2;} average2: endbr64 movl %esi, %eax subl %edi, %eax shrl %eax addl %edi, %eax ret ``` 前兩行就是把 `b-a` 的結果放在 `eax`, 再右移一個 bit. 最後加上 `a`. 所以如果 `b-a` 的結果會 overflow, 那最後的結果也就不正確了。 #### (a >> 1) + (b >> 1) + (a & b & 1) ```c= ;uint32_t average_EXP1(uint32_t a, uint32_t b){return (a >> 1) + (b >> 1) + (a & b & 1);} average_EXP1: endbr64 movl %edi, %eax movl %esi, %edx andl %esi, %edi shrl %eax shrl %edx andl $1, %edi addl %edx, %eax addl %edi, %eax ret ``` 第 4,7 行實作了 `b >> 1` , 第 5,8 行實作了 `a >> 1`, 而這兩個結果分別放在 `eax`,`edx`. 第 6,9 行實作了 `a & b & 1`, 結果存放於 `edi`. 最後 10,11 行把這三個結果加起來回傳。 #### `(a & b) + ((a ^ b) >> 1)` ```c ;uint32_t average_EXP2_EXP3(uint32_t a, uint32_t b){return (a & b) + ((a ^ b) >> 1);} average_EXP2_EXP3: endbr64 movl %edi, %eax andl %esi, %edi xorl %esi, %eax shrl %eax addl %edi, %eax ret ``` 前三行實作了 `a & b`, `a ^ b`, 分別放在 `edi`,`eax`. 接著再把 `a ^ b` 右移一個 bit. 最後加起來回傳。 #### 研讀 Linux 核心原始程式碼 `include/linux/average.h`，探討其 Exponentially weighted moving average (EWMA) 實作，以及在 Linux 核心的應用先了解 EWMA 的定義: $S_t = \begin{cases} Y_0, & t = 0 \\ \alpha Y_t + (1 - \alpha) \cdot S_{t-1}, & t > 0 \end{cases}$ `average.h` 定義了一個大 macro, 共有三個參數, 分別是 - `name`: 表示整個 macro 定義出來的一組 macro 以及 struct 的名稱。 - `_percision`: 表示用多少個 bits 來表示內部儲存的小數部分。 - `_weight_rcp`: 表示權重 $\alpha$ 的倒數, $\alpha=1 / weight\_rcp$。接著就是拆解整個 macro 的每一部分: ```cpp struct ewma_##name { unsigned long internal; }; ``` 表示這個 macro 會定義出一個 ewma_ 開頭, `name` 參數結尾的 struct, 內部只有一個 unsigned long internal. ```cpp static inline void ewma_##name##_init(struct ewma_##name *e) { BUILD_BUG_ON(!__builtin_constant_p(_precision)); BUILD_BUG_ON(!__builtin_constant_p(_weight_rcp)); /* * Even if you want to feed it just 0/1 you should have * some bits for the non-fractional part... */ BUILD_BUG_ON((_precision) > 30); BUILD_BUG_ON_NOT_POWER_OF_2(_weight_rcp); e->internal = 0; } ``` 在使用之前，需要先呼叫 `ewma_##name##_init` 這個 incline function. 會在 compiler 時期就檢查所有該檢查的參數，而真正的 code 只有 `e->internal = 0`, 看來 linux kernel 傾向一切能夠在 compiler 時就能做的就盡量做，這樣可以最早發現問題，而且又確保了執行期的速度與正確性。 ```cpp static inline unsigned long ewma_##name##_read(struct ewma_##name *e) { BUILD_BUG_ON(!__builtin_constant_p(_precision)); BUILD_BUG_ON(!__builtin_constant_p(_weight_rcp)); BUILD_BUG_ON((_precision) > 30); BUILD_BUG_ON_NOT_POWER_OF_2(_weight_rcp); return e->internal >> (_precision); } ``` 看到這裡我有個小疑問，難道這四行檢查碼會因為放在不同的 macro 當中而有所不同嗎? 看起來都一樣的程式為什麼要寫第二遍呢? 我想像中，只要寫一次，應該會在 compiler 時期就會發生錯誤警告了。不過，做完檢查就只剩下一行, 也就是讀取值的時候直接把 `internal` 右移 `_precision` 這麼多個 bits 後回傳。 ```cpp static inline void ewma_##name##_add(struct ewma_##name *e, unsigned long val) { unsigned long internal = READ_ONCE(e->internal); unsigned long weight_rcp = ilog2(_weight_rcp); unsigned long precision = _precision; BUILD_BUG_ON(!__builtin_constant_p(_precision)); BUILD_BUG_ON(!__builtin_constant_p(_weight_rcp)); BUILD_BUG_ON((_precision) > 30); BUILD_BUG_ON_NOT_POWER_OF_2(_weight_rcp); WRITE_ONCE(e->internal, internal ? (((internal << weight_rcp) - internal) + (val << precision)) >> weight_rcp : (val << precision)); } ``` 這個 macro 就是最主要的功能。其中利用了 `READ_ONCE` 來讀取 `e->internal`, 也利用 `WRITE_ONCE` 來寫入 `e->internal`, 看來是為了避免在不同的 thread 當中存取所發生的問題。另外 [`ilog2`](https://github.com/torvalds/linux/blob/master/include/linux/log2.h#L156) 函式看來也是個 macro 可以在 compiler time 就可以運算出 log2 的值。算出 log2 的值就可以運用 bits 位移加速乘法以及除法運算。為了避免 double evaluation, `_preision` 先指派給 `precision` 取值後再使用， macro 的世界都要小心翼翼的。所以最後一行的意思是 (用 _val 取代 `val << precision`) - internal 沒有值的時候: `e->internal = _val` - internal 有值的時候: `e->internal = (internal*(_weight_rcp - 1) + _val)/_weight_rcp` 如果把 $\_weight\_rcp = 1/\alpha$ e->internal = ($internal*(1/\alpha - 1) + \_val)*\alpha$ e->internal = $internal*(1 - \alpha) + \_val*\alpha$ 這正是 Exponentially weighted moving average (EWMA) 的定義。探究這個 `DECLARE_EWMA` 在 linux kernel 當中的應用，最快的方法就是搜尋它被使用在哪裡。 - 用在 network or network driver, 推論跟 ethernet 有關 - [net/mac80211/sta_info.h](https://github.com/torvalds/linux/blob/master/net/mac80211/sta_info.h#L126) - [drivers/net/wireless/mediatek/mt76/mt76x02_mac.h](https://github.com/torvalds/linux/blob/master/drivers/net/wireless/mediatek/mt76/mt76x02_mac.h#L34) - [drivers/net/wireless/realteck/rtw88/main.h](https://github.com/torvalds/linux/blob/master/drivers/net/wireless/realtek/rtw88/main.h#L642) - [drivers/net/wireless/mediatek/mt7601u/mt7601u.h](https://github.com/torvalds/linux/blob/master/drivers/net/wireless/mediatek/mt7601u/mt7601u.h#L134) - ...還有十幾處都是 net or drivers/net 下方的實作 - gpu - [drivers/gpu/drm/i915/gt/intel_context_types.h](https://github.com/torvalds/linux/blob/master/drivers/gpu/drm/i915/gt/intel_context_types.h#L24) - [drivers/gpu/drm/drm_self_refresh_helper.c](https://github.com/torvalds/linux/blob/master/drivers/gpu/drm/drm_self_refresh_helper.c#L56) 以網路來說, 參考 [sta_info.h](https://github.com/torvalds/linux/blob/6441998e2e37131b0a4c310af9156d79d3351c16/net/mac80211/sta_info.h#L643) 的註解說明, 這個是 average ACK signal 的意思, 在每次 [ack](https://github.com/torvalds/linux/blob/9e9fb7655ed585da8f468e29221f0ba194a5f613/net/mac80211/status.c#L1168) 的時候都會更新一次. Google [文章](https://www.slideserve.com/dianerosa/outline-rtt-estimation-using-ewma-stop-and-wait-sliding-window-algorithms-powerpoint-ppt-presentation) 說是用在估算 Round Trip Time (RTT), 相關資料在 [RFC793](https://datatracker.ietf.org/doc/html/rfc793) page 40 當中提到 Retransmission Timeout 需要動態被估算, 而且需要得到 RTT 這個資訊, 根據 [這份簡報](https://courses.cs.washington.edu/courses/cse461/07wi/lectures/lec12.pdf) 提到 $$ \begin{align} EstimatedRTT = (1-g)(EstimatedRTT) + g(SampleRTT)\\ 0 ≤ g ≤ 1, usually\ g = .1\ or\ .2 \end{align} $$ 這個公式就是 EWMA. 而 gpu 似乎是跟 DRM (Direct Rendering Manager) 有關. --- ### 測驗 2 > 改寫〈解讀計算機編碼〉一文的「不需要分支的設計」一節提供的程式碼 min，我們得到以下實作 (max): > ```cpp > #include <stdint.h> > uint32_t max(uint32_t a, uint32_t b) > { > return a ^ ((EXP4) & -(EXP5)); > } > ``` #### 想法 & 思考既然是取比較大的值, 結果只會有兩種: a , b. 若是 a 比較大, 則結果應該類似 a ^ 0 = a, 若是 b 比較大, 結果應該類似 a ^ (a ^ b) = b. 所以 `((EXP4 & -(EXP5))` 的結果應該會如下: - if a >= b then `((EXP4 & -(EXP5)) = 0` - if a < b then `((EXP4 & -(EXP5)) = (a ^ b)` 這時可以想到 `EXP4` 應該是 `a ^ b` , 而 `-(EXP5)` 應該如下: - if a >= b then `-(EXP5) = 0b` 使得 `(a ^ b) & 0 = 0` - if a < b then `-(EXP5) = 0xFFFFFFFF = -1` 使得 `(a ^ b) & 0xFFFFFFFF = (a ^ b)` 因此 `EXP5` 就是 `a < b` #### 延伸問題 > 針對 32 位元無號/有號整數，撰寫同樣 branchless 的實作 ```cpp #include <stdint.h> int32_t max_signed(int32_t a, int32_t b) { return a ^ ((a ^ b) & -(a < b)); } ``` 不管 a , b 是否為 signed 或 unsigned, `-(a < b)` 的結果就是 signed int, 且為 0x000000 或者 0xFFFFFFFF 這兩種結果之一, 所以實作的方法是一樣的。 > 請在 Linux 核心原始程式碼中，找出更多類似 branchless 的實作手法。請善用 git log 檢索。 [kvm/mmu.h](https://github.com/torvalds/linux/blob/master/arch/x86/kvm/mmu.h#L258) from [commits](https://github.com/torvalds/linux/commit/97ec8c067d322d32effdc1701760d3babbc5595f) ```cpp /* * If CPL < 3, SMAP prevention are disabled if EFLAGS.AC = 1. * * If CPL = 3, SMAP applies to all supervisor-mode data accesses * (these are implicit supervisor accesses) regardless of the value * of EFLAGS.AC. * * This computes (cpl < 3) && (rflags & X86_EFLAGS_AC), leaving * the result in X86_EFLAGS_AC. We then insert it in place of * the PFERR_RSVD_MASK bit; this bit will always be zero in pfec, * but it will be one in index if SMAP checks are being overridden. * It is important to keep this branchless. */ unsigned long smap = (cpl - 3) & (rflags & X86_EFLAGS_AC); ``` [net/xfrm.h](https://github.com/torvalds/linux/blob/master/include/net/xfrm.h#L830) from [commits](https://github.com/torvalds/linux/commit/6c786bcb29dd684533ec165057f437d5bb34a4b2) ```cpp static inline bool addr4_match(__be32 a1, __be32 a2, u8 prefixlen) { /* C99 6.5.7 (3): u32 << 32 is undefined behaviour */ if (sizeof(long) == 4 && prefixlen == 0) return true; return !((a1 ^ a2) & htonl(~0UL << (32 - prefixlen))); } ``` 這裡的 if 是為了避免產生 u32 << 32 這種未定義的行為。而且這次 commit 是從組合語言角度說明了 `Branch savings: 1`. [Rework 64 bits shifts to avoid tests and branches](https://github.com/torvalds/linux/commit/e7de0023e1232f42a10ef6af03352538cc27eaf6) - 但這個我真的看不懂還有以前的一些紀錄在 git logs 裡面, 但目前的 linux kernel 似乎已經不存在了: - [Optimize pte permission checks](https://github.com/torvalds/linux/commit/97d64b788114be1c4dc4bfe7a8ba2bf9643fe6af) - [simplify last_pte_bitmap](https://github.com/torvalds/linux/commit/6bb69c9b69c315200ddc2bc79aee14c0184cf5b2) - [improve CRC32 performance for deep pipelines](https://github.com/torvalds/linux/commit/efdb25efc7645b326cd5eb82be5feeabe167c24e) - [Replace int2float() with an optimized version.](https://github.com/torvalds/linux/commit/747f49ba67b8895a5831ab539de551b916f3738c) --- ### 測驗 3 >考慮以下 64 位元 GCD (greatest common divisor, 最大公因數) 求值函式: > > ```cpp > #include <stdint.h> > uint64_t gcd64(uint64_t u, uint64_t v) > { > if (!u || !v) return u | v; > while (v) { > uint64_t t = v; > v = u % v; > u = t; > } > return u; > } > ``` > 改寫為以下等價實作: > ```cpp > #include <stdint.h> > uint64_t gcd64(uint64_t u, uint64_t v) > { > if (!u || !v) return u | v; > int shift; > for (shift = 0; !((u | v) & 1); shift++) { > u /= 2, v /= 2; > } > while (!(u & 1)) > u /= 2; > do { > while (!(v & 1)) > v /= 2; > if (u < v) { > v -= u; > } else { > uint64_t t = u - v; > u = v; > v = t; > } > } while (COND); > return RET; > } > ``` #### 想法 & 思考在前幾個步驟都是根據 GCD 演算法，可以得到很好的理解。 ```cpp if (!u || !v) return u | v; ``` 就實作了兩個步驟: 1. if both x, y are 0, $gcd(0,0) = 0$ 2. $gcd(x,0)=x$ and $gcd(0,y)=y$ ```cpp int shift; for (shift = 0; !((u | v) & 1); shift++) { u /= 2, v /= 2; } ``` 這個步驟就是 3. if x, y are both even, $gcd(x,y)=2*gcd(\frac{x}{2},\frac{y}{2})$ 在迴圈執行完畢的時候, `!((u | v) & 1)` 是 false, 意味著 `(u | v)` 的最後一個 bit 是 1, 也就是 `u` 或 `v` 其中一個是奇數. 或者兩者都是奇數。而此時的 $2^{shift}*gcd(u,v)$ 就是原本要求的 GCD。接著的 ```cpp while (!(u & 1)) u /= 2; ``` 以及 ```cpp while (!(v & 1)) v /= 2; ``` 因為此時必定不會存在 `u`, `v` 同時為偶數的狀況，所以其中一個是偶數就可以立刻把 `2 這個非公因數` 給剔除。相當於演算法當中的 4. If x is even and y is odd, then $gcd(x,y)=gcd(\frac{x}{2},y)$ 以及 5. If x is odd and y is even, then $gcd(x,y)=gcd(x,\frac{y}{2})$ 最後迴圈中的 ```cpp if (u < v) { v -= u; } else { uint64_t t = u - v; u = v; v = t; } ``` 是利用輾轉相除法的原理，只是用了連續減法來找出相除的餘數結果。輾轉相除法就是除到餘數為 0 就是答案。最後一步 `u = v` 的時候, 就是餘數為 0 的時候，此時 `uint64_t t = u - v` 會讓 `t` 為 0, 然後被指派給 `v`. 所以 `COND` 就是餘數 `v` , 而 `u` 在這裡就是最後的那個最大公因數。但要記得真正的答案是 $2^{shift}*gcd(u,v)$ . 因此 `RET` 就是 `u << shift`. #### 延伸問題 > 在 x86_64 上透過 __builtin_ctz 改寫 GCD，分析對效能的提升; 在理解了 `__builtin_ctz` 的意義後，顯然是針對 ```cpp while (!(u & 1)) u /= 2; ``` 這樣的程式碼，可以直接等於 ```cpp u >> __builtin_ctz(u) ``` 所以實作的程式碼為 ```cpp uint64_t gcd64_v3(uint64_t u, uint64_t v) { if (!u || !v) return u | v; int u_ctz = __builtin_ctzl(u); int v_ctz = __builtin_ctzl(v); u = u >> u_ctz; v = v >> v_ctz; int shift = u_ctz > v_ctz ? v_ctz : u_ctz; do { v = v >> __builtin_ctzl(v); if (u < v) { v -= u; } else { uint64_t t = u - v; u = v; v = t; } } while (v); return u << shift; } ``` 可以連續呼叫後，計算耗費的時間來分析效能: ```cpp float execute_cost(gcd_function func, int times) { clock_t start = clock(); while (times--) { func(rand64(), rand64()); } clock_t end = clock(); return (float)(end - start) / CLOCKS_PER_SEC; } ``` 在我的環境中執行 1000000 次的結果為 ```shell gcd64_v2 cost 0.474353 seconds, __builtin_ctz cost 0.203850 seconds, Uplift performance 132.697067% ``` 大約快了 1.3 倍左右。 > Linux 核心中也內建 GCD (而且還不只一種實作)，例如 `lib/math/gcd.c`，請解釋其實作手法和探討在核心內的應用場景。 [lib/math/gcd.c](https://github.com/torvalds/linux/blob/master/lib/math/gcd.c) 當中主要的 gcd 實作如下: ```cpp unsigned long gcd(unsigned long a, unsigned long b) { unsigned long r = a | b; if (!a || !b) return r; b >>= __ffs(b); if (b == 1) return r & -r; for (;;) { a >>= __ffs(a); if (a == 1) return r & -r; if (a == b) return a << __ffs(r); if (a < b) swap(a, b); a -= b; } } ``` 這段程式碼主要由 [Zeng Zhaoxiu 所提的 commit](https://github.com/torvalds/linux/commit/fff7fb0b2d908dec779783d8eaf3d7725230f75e) 貢獻的. 他在這段 commit log 當中直接把 benchmark 的程式跟結果展示出來，比原本的速度快了很多。提到了加速的主因，就是 `%` 這個求餘數在 CPU 指令集是用除法做的，即使是 x86 這樣的已經針對除法優化的 CPU, 用減法搭配檢查 2 的因數依舊快了 25%, 更別提 Alpha 或者 ARMv6 這種僅僅是利用 `emulation code` 來實作除法的 CPU. 加速效果更是明顯。 ``` There are two variants of the code here, depending on whether a fast __ffs (find least significant set bit) instruction is available. ``` 也提到了 `__ffs` 是利用 CPU 提供的指令，意義上等同 `__builtin_ctz` 的實作。其中的 `r & -r` 讓我思考蠻久的，是把 `r` 跟 `r` 的二進位補數做 bitwise AND 運作，這會使得 `r` 僅保留從低位元數過來的第一個 1 bit，意義上就是 `1 << __builtin_ctz(r)`. 剩下的部分就幾乎跟本次的測驗題目一樣了。 > 核心內對於 gcd 的應用場景 - 對於頻率相關運算，用在影音或無線網路等等，這應該是主要的應用了 - [pcm_timer.c](https://github.com/torvalds/linux/blob/5bfc75d92efd494db37f5c4c173d3639d4772966/sound/core/pcm_timer.c#L19) - [adau-utils.c](https://github.com/torvalds/linux/blob/5bfc75d92efd494db37f5c4c173d3639d4772966/sound/soc/codecs/adau-utils.c#L15) - [aptina-pll.c](https://github.com/torvalds/linux/blob/5bfc75d92efd494db37f5c4c173d3639d4772966/drivers/media/i2c/aptina-pll.c) - [iio-rescale.c](https://github.com/torvalds/linux/blob/79160a603bdb51916226caf4a6616cc4e1c58a58/drivers/iio/afe/iio-rescale.c#L177) - [amdgpu_afmt.c](https://github.com/torvalds/linux/blob/5bfc75d92efd494db37f5c4c173d3639d4772966/drivers/gpu/drm/amd/amdgpu/amdgpu_afmt.c#L51) - ... - 數學 - 最小公倍數 [lcm.c](https://github.com/torvalds/linux/blob/5bfc75d92efd494db37f5c4c173d3639d4772966/lib/math/lcm.c#L8) 可以利用 gcd 實作 - Load balancer IPVS - [ip_vs_mh.c](https://github.com/torvalds/linux/blob/5bfc75d92efd494db37f5c4c173d3639d4772966/net/netfilter/ipvs/ip_vs_mh.c#L344) - 檔案系統中的 blk - [blk-settings.c](https://github.com/torvalds/linux/blob/a379fbbcb88bcf43d886977c6fb91fb43f330b84/block/blk-settings.c#L514) - tty - [vt.c](https://github.com/torvalds/linux/blob/81ff0be4b9e3bcfee022d71cf89d72f7e2ed41ba/drivers/tty/vt/vt.c#L431) 捲動文字螢幕也要用 gcd ?! - 其他我不懂的 - [sparx5_calendar.c](https://github.com/torvalds/linux/blob/dbe69e43372212527abf48609aba7fc39a6daa27/drivers/net/ethernet/microchip/sparx5/sparx5_calendar.c#L247) - [ecc.c](https://github.com/torvalds/linux/blob/bfc484fe6abba4b89ec9330e0e68778e2a9856b2/crypto/ecc.c#L988) 這個註解提到的 link 已經 404 了, 可以考慮用 [From Euclid’s GCD to Montgomery Multiplication to the Great Divide](https://www.researchgate.net/publication/234808933_From_Euclid's_GCD_to_Montgomery_Multiplication_to_the_Great_Divide) 取代嗎，可憐的 sun 被 oracle 買下來後連文件都保留不了，不過這裡只是提到 gcd, 似乎沒使用 gcd. --- ### 測驗 4 > 考慮以下程式碼: > ```cpp > #include <stddef.h> > size_t naive(uint64_t *bitmap, size_t bitmapsize, uint32_t *out) > { > size_t pos = 0; > for (size_t k = 0; k < bitmapsize; ++k) { > uint64_t bitset = bitmap[k]; > size_t p = k * 64; > for (int i = 0; i < 64; i++) { > if ((bitset >> i) & 0x1) > out[pos++] = p + i; > } > } > return pos; > } > ``` > 改寫的程式碼如下: > ```cpp= > #include <stddef.h> > size_t improved(uint64_t *bitmap, size_t bitmapsize, uint32_t *out) > { > size_t pos = 0; > uint64_t bitset; > for (size_t k = 0; k < bitmapsize; ++k) { > bitset = bitmap[k]; > while (bitset != 0) { > uint64_t t = EXP6; > int r = __builtin_ctzll(bitset); > out[pos++] = k * 64 + r; > bitset ^= t; > } > } > return pos; > } > ``` > 其中第 9 行的作用是找出目前最低位元的 1，並紀錄到 t 變數。舉例來說，若目前 bitset 為 000101b，那 t 就會變成 000001b，接著就可以透過 XOR 把該位的 1 清掉，其他保留 (此為 XOR 的特性)。 > 請補完 EXP6 #### 想法 & 思考看到這裡剛好跟前一題在 linux kernel 中的 gcd 用到的 `r & -r` 好像。所以就是 `bitset & -bitset`. #### 延伸問題 > 舉出這樣的程式碼用在哪些真實案例中 Linux kernel gcd implement. > 設計實驗，檢驗 ctz/clz 改寫的程式碼相較原本的實作有多少改進？應考慮到不同的 bitmap density; ```cpp int main() { srand(time(NULL)); int bitmapsize = 100000; int maxoutsize = bitmapsize * sizeof(uint64_t) * 8; uint64_t *bitmap = malloc(sizeof(uint64_t) * bitmapsize); uint32_t *naive_out = malloc(sizeof(uint32_t) * maxoutsize); uint32_t *improved_out = malloc(sizeof(uint32_t) * maxoutsize); for (int i = 0; i < bitmapsize; i++) bitmap[i] = rand64(); for (int i = 0; i < maxoutsize; i++) naive_out[i] = improved_out[i] = 0; clock_t t1 = clock(); naive(bitmap, bitmapsize, naive_out); clock_t t2 = clock(); improved(bitmap, bitmapsize, improved_out); clock_t t3 = clock(); float naive_cost = (float)(t2 - t1) / CLOCKS_PER_SEC; float improved_cost = (float)(t3 - t2) / CLOCKS_PER_SEC; int isPass = 1; for (int i = 0; i < maxoutsize; i++) { if (naive_out[i] != improved_out[i]) { printf("Mismatched at index %d, %u != %u\r\n", i, naive_out[i], improved_out[i]); isPass = 0; break; } } printf(isPass ? "Passed\r\n" : "Failed\r\n"); printf("naive cost %f seconds, improved cost %f seconds,\r\nUplift performance %f%%\r\n", naive_cost, improved_cost, (1 / improved_cost - 1 / naive_cost) * 100 / (1 / naive_cost)); free(bitmap); free(naive_out); free(improved_out); free(improved2_out); } ``` 執行結果 ```shell naive cost 0.020146 seconds, improved cost 0.002941 seconds, Uplift performance 585.005066% ``` > 思考進一步的改進空間; - 稍微寫了一個簡單的 mapping 機制來取代計算 bits, 但果然程式碼變大又變慢。 - 用 `out[pos++] = (k << 6) + r;` 來取代 `out[pos++] = k * 64 + r;` 也沒有比較快。 - 還沒有想到更多的改進空間。 > 閱讀 [Data Structures in the Linux Kernel](https://0xax.gitbooks.io/linux-insides/content/DataStructures/linux-datastructures-3.html) 並舉出 Linux 核心使用 bitmap 的案例; - [lib/iommu-helper.c](https://github.com/torvalds/linux/blob/5bfc75d92efd494db37f5c4c173d3639d4772966/lib/iommu-helper.c) - [tools/perf/util/pmu.y](https://github.com/torvalds/linux/blob/5bfc75d92efd494db37f5c4c173d3639d4772966/tools/perf/util/pmu.y) - [drivers/firmware/smccc/kvm_guest.c](https://github.com/torvalds/linux/blob/152d32aa846835987966fd20ee1143b0e05036a0/drivers/firmware/smccc/kvm_guest.c) - [linux/linkmode.h](https://github.com/torvalds/linux/blob/5bfc75d92efd494db37f5c4c173d3639d4772966/include/linux/linkmode.h) - 非常多且雜, 但基本上只要想要省記憶體, bitmap 是非常好用的資料結構。 --- ### 測驗 5 > 以下是 LeetCode 166. [Fraction to Recurring Decimal](https://leetcode.com/problems/fraction-to-recurring-decimal/) 的可能實作: > ```cpp > #include <stdbool.h> > #include <stdio.h> > #include <stdlib.h> > #include <string.h> > #include "list.h" > > struct rem_node { > int key; > int index; > struct list_head link; > }; > > static int find(struct list_head *heads, int size, int key) > { > struct rem_node *node; > int hash = key % size; > list_for_each_entry (node, &heads[hash], link) { > if (key == node->key) > return node->index; > } > return -1; > } > > char *fractionToDecimal(int numerator, int denominator) > { > int size = 1024; > char *result = malloc(size); > char *p = result; > > if (denominator == 0) { > result[0] = '\0'; > return result; > } > > if (numerator == 0) { > result[0] = '0'; > result[1] = '\0'; > return result; > } > > /* using long long type make sure there has no integer > overflow */ > long long n = numerator; > long long d = denominator; > > /* deal with negtive cases */ > if (n < 0) > n = -n; > if (d < 0) > d = -d; > > bool sign = (float) numerator / denominator >= 0; > if (!sign) > *p++ = '-'; > > long long remainder = n % d; > long long division = n / d; > > sprintf(p, "%ld", division > 0 ? (long) division : (long) - division); > if (remainder == 0) > return result; > > p = result + strlen(result); > *p++ = '.'; > > /* Using a map to record all of reminders and their position. > * if the reminder appeared before, which means the repeated loop begin, > */ > char *decimal = malloc(size); > memset(decimal, 0, size); > char *q = decimal; > > size = 1333; > struct list_head *heads = malloc(size * sizeof(*heads)); > for (int i = 0; i < size; i++) > INIT_LIST_HEAD(&heads[i]); > > for (int i = 0; remainder; i++) { > int pos = find(heads, size, remainder); > if (pos >= 0) { > while (PPP > 0) > *p++ = *decimal++; > *p++ = '('; > while (*decimal != '\0') > *p++ = *decimal++; > *p++ = ')'; > *p = '\0'; > return result; > } > struct rem_node *node = malloc(sizeof(*node)); > node->key = remainder; > node->index = i; > > MMM(&node->link, EEE); > > *q++ = (remainder * 10) / d + '0'; > remainder = (remainder * 10) % d; > } > > strcpy(p, decimal); > return result; >} >``` > 請補完 > PPP = ? (表示式) > MMM = ? (巨集) > EEE = ? (表示式) #### 想法 & 思考 `find` 函式實作就像 Hash Map, 在 `heads` 當中用 `key` 去找到對應的 `index`. 所以 `int pos = find(heads, size, remainder);` 當 `pos >= 0` 就是找到了循環的小數起點, 也就是之前出現過相同的餘數. 而 `decimal` 放的是之前算出來的餘數除法結果, 所以 ```cpp while (PPP > 0) *p++ = *decimal++; ``` 就是要把非循環的數字放進 p, 直到之前找到的餘數為止, 所以 `PPP` 就跟 `pos` 這個數字有關, 這個迴圈應該要執行 `pos` 這麼多次, `PPP` 應該是 `pos--` 而 `MMM(&node->link, EEE);` 顯然就是沒找到的時候要建立 Hash Map, 此時就是把 `&node->link` 塞入 `heads` 這個 Hash Map, Hash function 就在 `find` 裡面的 `int hash = key % size;`. 因此 `MMM` 就是 `list_add`, `EEE` 就是 `&heads[remainder % size]` #### 延伸問題 > 解釋上述程式碼運作原理，指出其中不足，並予以改進 - 第一眼看 `fractionToDecimal` 最不高興的就是只有 `malloc` 卻沒有 `free`. 所以我們可以寫一個 free head_head Array 的函式: ```cpp void freeListArray(struct list_head *heads, size_t size) { for (int i = 0; i < size; i++) { struct list_head *n, *s; list_for_each_safe(n, s, &heads[i]) { free(container_of(n, struct rem_node, link)); } } free(heads); } ``` - 老師說判斷正負號也可以用更好的方法: ```cpp bool sign = (float)numerator / denominator >= 0; if (!sign) *p++ = '-'; ``` 改為 ```cpp if ((numerator < 0) ^ (denominator < 0)) *p++ = '-'; ``` - 既然都已經 `/* deal with negtive cases */` , 那麼 `long long division = n / d;` 就不會有負的問題 ```cpp sprintf(p, "%ld", division > 0 ? (long)division : (long)-division); ``` 改為 ```cpp sprintf(p, "%ld", (long)division); ``` - 利用 llabs 簡化: ```cpp long long n = llabs(numerator); long long d = llabs(denominator); ``` - 可以考慮不需要先輸出到 `decimal`, 而是直接輸出給結果, 找到重複的 reminder 的時候, 再來插入 `(` 跟 `)` 就好. 最後如下: ```cpp char *fractionToDecimal2(int numerator, int denominator) { int size = 1024; char *result = malloc(size); char *p = result; if (denominator == 0) { result[0] = '\0'; return result; } if (numerator == 0) { result[0] = '0'; result[1] = '\0'; return result; } /* using long long type make sure there has no integer > overflow */ long long n = llabs(numerator); long long d = llabs(denominator); if ((numerator < 0) ^ (denominator < 0)) *p++ = '-'; long long remainder = n % d; long long division = n / d; sprintf(p, "%ld", (long)division); if (remainder == 0) return result; p = result + strlen(result); *p++ = '.'; /* Using a map to record all of reminders and their position. * if the reminder appeared before, which means the repeated loop begin, */ int listsize = 1333; struct list_head *heads = malloc(listsize * sizeof(*heads)); for (int i = 0; i < listsize; i++) INIT_LIST_HEAD(&heads[i]); char *idx_begin = p; int i = 0; for (; remainder; i++) { int pos = find(heads, listsize, remainder); if (pos >= 0) { size_t len = i - pos; memmove(idx_begin + pos + 1, idx_begin + pos, len); idx_begin[pos] = '('; idx_begin[pos + len + 1] = ')'; idx_begin[pos + len + 2] = '\0'; freeListArray(heads, listsize); return result; } struct rem_node *node = malloc(sizeof(*node)); node->key = remainder; node->index = i; list_add(&node->link, &heads[remainder % listsize]); idx_begin[i] = (remainder * 10) / d + '0'; remainder = (remainder * 10) % d; } idx_begin[i] = '\0'; freeListArray(heads, listsize); return result; } ``` > 在 Linux 核心原始程式碼的 mm/ 目錄 (即記憶體管理) 中，找到特定整數比值 (即 fraction) 和 bitwise 操作的程式碼案例，予以探討和解說其應用場景 - [vmscan.c](https://github.com/torvalds/linux/blob/a4fd49cdb5495f36a35bd27b69b3806e383c719b/mm/vmscan.c#L2730) - 看不懂但這裡讓我學到 github 搜尋可以指定目錄, 像是搜尋 [mm/ 目錄中的 fraction](https://github.com/search?q=fraction+repo%3Atorvalds%2Flinux+path%3A%2Fmm&type=Code&ref=advsearch&l=&l=) --- ### 測驗 6 > __alignof__ 是 GNU extension，以下是其可能的實作方式: > ```cpp > /* > * ALIGNOF - get the alignment of a type > * @t: the type to test > * > * This returns a safe alignment for the given type. > */ > #define ALIGNOF(t) \ > ((char *)(&((struct { char c; t _h; } *)0)->M) - (char *)X) > ``` > 請補完上述程式碼，使得功能與 __alignof__ 等價。 #### 想法 & 思考在只有定義一種 type t 的情況下, `alignof` 就是 t 的長度. 既然要算出 type t 的長度, 頭尾相減就是答案. 所以 `M` 表示的就是 type t 的變數: `_h`, 而 `X` 表現的就是頭, 就是 0. #### 延伸問題 > 在 Linux 核心原始程式碼中找出 __alignof__ 的使用案例 2 則，並針對其場景進行解說 - [socket.h](https://github.com/torvalds/linux/blob/5bfc75d92efd494db37f5c4c173d3639d4772966/tools/perf/include/bpf/linux/socket.h#L9) 利用 `__alignof__` 找出 `struct sockaddr`, 在定義 socket storage 的時候使用. - [signal_compat.c](https://github.com/torvalds/linux/blob/7c636d4d20f8c5acfbfbc60f326fddb0e1cf5daa/arch/x86/kernel/signal_compat.c#L41) 利用 `__alignof__` 強制在 compiler 時期就拒絕做出不合規格的 siginfo_t > 在 Linux 核心源使程式碼找出 ALIGN, ALIGN_DOWN, ALIGN_UP 等巨集，探討其實作機制和用途，並舉例探討 (可和上述第二點的案例重複)。思索能否對 Linux 核心提交貢獻，儘量共用相同的巨集 ALIGN, ALIGN_DOWN, ALIGN_UP 應該是在 [align.h](https://github.com/torvalds/linux/blob/a48b0872e69428d3d02994dcfad3519f01def7fa/include/linux/align.h) --- ### 測驗 7 > 考慮貌似簡單卻蘊含實作深度的 FizzBuzz 題目: > > - 從 1 數到 n，如果是 3的倍數，印出 “Fizz” > - 如果是 5 的倍數，印出 “Buzz” > - 如果是 15 的倍數，印出 “FizzBuzz” > - 如果都不是，就印出數字本身 > > 以下是利用 bitwise 和上述技巧實作的 FizzBuzz 程式碼: (bitwise.c) > ```cpp > static inline bool is_divisible(uint32_t n, uint64_t M) > { > return n * M <= M - 1; > } > > static uint64_t M3 = UINT64_C(0xFFFFFFFFFFFFFFFF) / 3 + 1; > static uint64_t M5 = UINT64_C(0xFFFFFFFFFFFFFFFF) / 5 + 1; > > int main(int argc, char **argv) > { > for (size_t i = 1; i <= 100; i++) { > uint8_t div3 = is_divisible(i, M3); > uint8_t div5 = is_divisible(i, M5); > unsigned int length = (2 << KK1) << KK2; > > char fmt[9]; > strncpy(fmt, &"FizzBuzz%u"[(9 >> div5) >> (KK3)], length); > fmt[length] = '\0'; > > printf(fmt, i); > printf("\n"); > } > return 0; > } > ``` > 作答區 > KK1 = ? (變數名稱) > KK2 = ? (變數名稱) > KK3 = ? (表示式，不包含小括號) #### 想法 & 思考按題意搭配 `strncpy(fmt, &"FizzBuzz%u"[(9 >> div5) >> (KK3)], length);` `length` 應該 - div3 = 1, div5 = 0: `length` = 4 - div3 = 0, div5 = 1: `length` = 4 - div3 = 1, div5 = 1: `length` = 8 - div3 = 0, div5 = 0: `length` = 2 因此 `KK1` 給 `div3`, `KK2` 給 `div5` 就會符合 `length` 的需求. 而 `(9 >> div5) >> (KK3)` 按需求則是 - div3 = 1, div5 = 0 : `(9 >> div5) >> (KK3)` = 0 - div3 = 0, div5 = 1 : `(9 >> div5) >> (KK3)` = 4 - div3 = 1, div5 = 1 : `(9 >> div5) >> (KK3)` = 0 - div3 = 0, div5 = 0 : `(9 >> div5) >> (KK3)` = 8 因此 `9` 要改為 `8`, 然後 `KK3` 等於 `div3 << 2` 就會符合需求... #### 延伸問題 > 解釋上述程式運作原理並評估 naive.c 和 bitwise.c 效能落差 > 避免 stream I/O 帶來的影響，可將 printf 更換為 sprintf > 分析 Faster remainders when the divisor is a constant: beating compilers and libdivide 一文的想法 (可參照同一篇網誌下方的評論)，並設計另一種 bitmask，如「可被 3 整除則末位設為 1」「可被 5 整除則倒數第二位設定為 1」，然後再改寫 bitwise.c 程式碼，試圖運用更少的指令來實作出 branchless; 參照 fastmod: A header file for fast 32-bit division remainders on 64-bit hardware > 研讀 The Fastest FizzBuzz Implementation 及其 Hacker News 討論，提出 throughput (吞吐量) 更高的 Fizzbuzz 實作 > 解析 Linux 核心原始程式碼 kernel/time/timekeeping.c 裡頭涉及到除法運算的機制，探討其快速除法的實作 (注意: 你可能要對照研讀 kernel/time/ 目錄的標頭檔和程式碼) 過程中，你可能會發現可貢獻到 Linux 核心的空間，請充分討論