2022q1 Homework2 (quiz2)

# 2022q1 Homework2 (quiz2) ###### tags: `linux2022` contributed by < `blue76815` > <style> .blue { color: blue; } .red { color: red; } </style> [作業要求](https://hackmd.io/@sysprog/linux2022-quiz2#2022q1-%E7%AC%AC-2-%E9%80%B1%E6%B8%AC%E9%A9%97%E9%A1%8C) 參考資料： * [數位邏輯](http://163.28.10.78/content/vocation/control/tp_nh/control/tp_nh/logic/index.htm) * [3.5 位元運算子](http://kaiching.org/pydoing/c/c-bitwise.html) * [數值系統、Bitwise 操作、浮點數運算、定點數操作](https://hackmd.io/@RusselCK/Bkxl1dDSw) ## 測驗 `1` `EXP1` = `a & b & 1` `EXP2` = `a & b` `EXP3` = `a ^ b` ```c #include <stdint.h> uint32_t average(uint32_t a, uint32_t b) { return (a >> 1) + (b >> 1) + (a & b & 1); } ``` ```c #include <stdint.h> uint32_t average(uint32_t a, uint32_t b) { return (a & b) + ((a ^ b) >> 1); } ``` :::success 延伸問題: 1. 解釋上述程式碼運作的原理 2. 比較上述實作在編譯器最佳化開啟的狀況，對應的組合語言輸出，並嘗試解讀 (可研讀 [CS:APP 第 3 章](https://hackmd.io/@sysprog/CSAPP-ch3)) 3. 研讀 Linux 核心原始程式碼 [include/linux/average.h](https://github.com/torvalds/linux/blob/master/include/linux/average.h)，探討其 [Exponentially weighted moving average](https://en.wikipedia.org/wiki/Moving_average#Exponential_moving_average) (EWMA) 實作，以及在 Linux 核心的應用 ::: ### 延伸問題 1. 解釋上述程式碼運作的原理根據 [位元運算 (bitwise operation)](https://kopu.chat/%e4%bb%a5c%e5%af%a6%e4%bd%9c%e4%ba%8c%e9%80%b2%e4%bd%8d%e5%8a%a0%e6%b3%95/#lwptoc2) > `&` 代表 AND > `|` 代表 OR > `^` 代表 XOR > `!` 代表 NOT > > AND 運算 > ``` > 0 AND 0=0 > 0 AND 1=0 > 1 AND 0=0 > 1 AND 1=1 > ``` > > OR 運算 > ``` > 0 OR 0=0 > 0 OR 1=1 > 1 OR 0=1 > 1 OR 1=1 > ``` > > XOR 運算 > ``` > 0 XOR 0=0 > 0 XOR 1=1 > 1 XOR 0=1 > 1 XOR 1=0 > ``` > > NOT 運算 > ``` > NOT 0=1 > NOT 1=0 > ``` #### `EXP1` ```c #include <stdint.h> uint32_t average(uint32_t a, uint32_t b) { return (a >> 1) + (b >> 1) + (EXP1); } ``` `>> 1` 相當於 $\div2$ ，例如 $a=7=0111_2$ ， $b=9=1001_2$ 則 * $a>>1 \Rightarrow 0111_2>>1$ 經過位移後$\Rightarrow 0011_2=3$， * $b>>1 \Rightarrow 1001_2>>1$ 經過位移後$\Rightarrow 0100_2=4$，其中 a，b 變數 bit 0 的值會無條件捨去，考慮到 a，b 變數 bit 0 的值都為 1 的話，則 bit 0 相加時會進位因此 `EXP1`式子，一開始我寫成 ```c #include <stdint.h> uint32_t average(uint32_t a, uint32_t b) { return (a >> 1) + (b >> 1) + ((a & 0x01) & (b & 0x01)); } ``` ```c (a & 0x01) & (b & 0x01) //判斷 a，b 變數 bit 0 的值都為 1 的話，則保留 bit 0=1 值 ``` 其中 `& 0x01` 是為了只讀 bit 0 值以 $a=7=0111_2$ 為例，則 `(a & 0x01)` ``` 0111 ==> a變數 &) 0001 ==> 0x01，根據 bit mask ，有和 1 做＆運算的 bit 位數才保留下來 --------- 0001 ==>(a & 0x01)，只讀 bit 0 結果 ``` 以 $b=8=1000_2$ 為例，則 `(b & 0x01)` ``` 1001 ==> b 變數 &) 0001 ==> 0x01，根據 bit mask ，有和 1 做＆運算的 bit 位數才保留下來 --------- 0001 ==>(b & 0x01)，只讀bit 0 結果 ``` ```c (a & 0x01) & (b & 0x01) //判斷 a，b 變數 bit 0 的值都為 1 的話 ``` 相當於 ``` 0001 ==> (a & 0x01) &) 0001 ==> (b & 0x01) --------- 0001 ==> (a & 0x01) & (b & 0x01) ``` `EXP1`式子再簡化成 ```c (a & b & 0x01) ``` 相當於 ``` 0111 ==> a 1001 ==> b &) 0001 ==> 0x01 --------- 0001 ==> (a & b & 0x01) ``` #### `EXP2` 和 `EXP3` 在[你所不知道的 C 語言：數值系統篇-運用 bit-wise operator](https://hackmd.io/@sysprog/c-numerics#%E9%81%8B%E7%94%A8-bit-wise-operator) 裡面有提到 > * 避免 overflow > * 比方說 `(x + y) / 2` 這樣的運算，有個致命問題在於 (x + y) 可能會導致 overflow (考慮到 x 和 y 都接近 [UINT32_MAX](https://msdn.microsoft.com/en-us/library/mt764276.aspx)，亦即 32-bit 表示範圍的上限之際) > * [Binary search 的演算法提出之後十年才被驗證](https://www.comp.nus.edu.sg/~sma5503/recitations/01-crct.pdf) > * 於是我們可以改寫為以下: > ```cpp > (x & y) + ((x ^ y) >> 1) > ``` > * 用加法器來思考: `x & y` 是進位, `x ^ y` 是位元和, `/ 2` 是向右移一位 > * 位元相加不進位的總和: `x ^ y`; 位元相加產生的進位值: `(x & y) << 1` > * `x + y = x ^ y + ( x & y ) << 1` > * 所以 (x + y) / 2 > = `(x + y) >> 1` > = `(x ^ y + (x & y) << 1) >> 1` > = `(x & y) + ((x ^ y) >> 1)` 因此 `EXP2` 和 `EXP3` 寫成 ```c #include <stdint.h> uint32_t average(uint32_t a, uint32_t b) { return (a & b)+ ((a ^ b) >> 1); } ``` --- ### 延伸問題 2.比較上述實作在編譯器最佳化開啟的狀況，對應的組合語言輸出參考資料： * [GCC 編譯器](https://easonwang.gitbook.io/c/chapter1) * [gcc -O0 -O1 -O2 -O3 四級優化選項及每級分別做什麼優化](https://www.itread01.com/content/1544816905.html) * CS.APP CH3.4 訪問信息 ### x86 架構暫存器簡介在 **CS.APP CH3.4 訪問信息** 章節提到 1. x86_64 架構 cpu，是由16個 **通用目的暫存器(General-purpose registers (GPRs))** 組成 2. 由於指令集演化的發展，後面新增不同的命名規則 * 8086架構：由8個 16 bits 暫存器構成，**名稱為 `%ax`-`%sp` 暫存器** * IA32架構：擴展成 32 bits 暫存器，**暫存器名稱為`%eax`到`%esp`** * x86_64架構： * 原來的8個暫存器擴展為64 bits暫存器，**名稱為`%rax`到`%rsp`** * 除此之外，還**新增8個新的暫存器**，**暫存器名稱為 `%r8`到`%r15`** X86 暫存器表列 | 64-bit模式 | 32-bit模式 | 16-bit模式 | 8-bit模式 | 用途 | | ---------- | ---------- | ---------- | --------- | ------------- | | %rax | %eax | %ax | %al | 返回值 | | %rbx | %ebx | %bx | %bl | 被調用者保存 | | %rcx | %ecx | %cx | %cl | 第4個參數 | | %rdx | %edx | %dx | %dl | 第3個參數 | | %rsi | %esi | %si | %sil | 第2個參數 | | %rdi | %edi | %di | %dil | 第1個參數 | | %rbp | %ebp | %bp | %bpl | 被調用者保存 | | %rsp | %esp | %sp | %spl | stack pointer | | %r8 | %r8d | %r8w | %r8b | 第5個參數 | | %r9 | %r9d | %r9w | %r9b | 第6個參數 | | %r10 | %r10d | %r10w | %r10b | 調用者保存 | | %r11 | %r11d | %r11w | %r11b | 調用者保存 | | %r12 | %r12d | %r12w | %r12b | 被調用者保存 | | %r13 | %r13d | %r13w | %r13b | 被調用者保存 | | %r14 | %r14d | %r14w | %r14b | 被調用者保存 | | %r15 | %r15d | %r15w | %r15b | 被調用者保存 | [X86組合語言/X86架構及暫存器解釋](https://zh.wikibooks.org/zh-tw/X86%E7%B5%84%E5%90%88%E8%AA%9E%E8%A8%80/X86%E6%9E%B6%E6%9E%84%E5%8F%8A%E5%AF%84%E5%AD%98%E5%99%A8%E8%A7%A3%E9%87%8A) > ### 通用暫存器（GPR） - 32位元命名約定 > 8個GPR是： > > 1. 累加器暫存器（AX）。用在算術運算。 > 2. 基址暫存器（BX）。作為一個指向資料的指標（在分段模式下，位於段暫存器DS）。 > 3. 計數器暫存器（CX）。用於移位/迴圈指令和迴圈。 > 4. 資料暫存器（DX）。用在算術運算和I/O操作。 > 5. 堆疊指標暫存器（SP）。用於指向堆疊的頂部。 > 6. 棧基址指標暫存器（BP）。用於指向堆疊的底部。 > 7. 源變址暫存器（SI）。在流操作中用作源的一個指標。 > 8. 目標索引暫存器（DI）。用作在流操作中指向目標的指標。 > > 將它們以這樣的順序列出是有原因的： > **這個順序和堆疊操作中推入棧中的次序相同**，我們以後會講到。 > > 所有暫存器都可以在16位元和32位元模式下被存取。 > * 在16位元模式下，通過上面的列表中兩個字母的縮寫來確定該暫存器。 > * 在32位元模式下，這兩個字母的縮寫名字前有「E」（_extended_，延伸）。例如，「EAX'是累加器暫存器作為一個32位元的值。 > * 類似地，在64位元的版本，「E」被替換為「R」，所以在64位元版本「EAX'被稱為'RAX'。 > > ![](https://hackmd.io/_uploads/HkUW0A3m3.png) ## X86 組語運算 S:source D:destination | 指令 | 效果 | 描述 | | -------- | ------------ | -------------------------------------- | | MOV S,D | D <-- S | 搬運資料存到 D | | leaq S,D | D <-- &S | 加載有效地址 | | INC D | D <-- D+1 | +1 | | DEC D | D <-- D-1 | -1 | | NEG D | D <-- -D | 取負數 | | NOT D | D <-- ~D | 取補數 | | ADD S,D | D <-- D+S | 加 | | SUB S,D | D <-- D-S | 減 | | IMUL S,D | D <-- `D*S` | 乘 | | XOR S,D | D <-- D ^ S | XOR 運算 | | OR S,D | D <-- D or S | OR 運算 | | AND S,D | D <-- D & S | AND 運算 | | SAL k,D | D <-- D << k | 左移k bit | | SHL k,D | D <-- D << k | 左移k bit (效果和SAL 一樣) | | SAR k,D | D <-- D >> k | 算術右移k bit(bit空缺補上sign integer) | | SHR k,D | D <-- D >> k | 邏輯右移k bit(bit空缺補上0) | `average.c` 使用 -O3 最佳化 `$ gcc -O3 -S average.c` 編譯成組語 `average.c`用以下3種寫法編譯看組語內容 #### case1. ```c #include <stdint.h> uint32_t average(uint32_t a, uint32_t b) { return a + (b - a) / 2; } ``` 生成的`average.s`內容為 ``` average: .LFB0: .cfi_startproc //定義函數開始 endbr64 movl %esi, %eax subl %edi, %eax shrl %eax addl %edi, %eax ret //return 返回 .cfi_endproc //定義函數結束 ``` #### .cfi_startproc 和 .cfi_endproc 用途在 [.cfi_* 汇编指示符](https://blog.csdn.net/zoomdy/article/details/80700750) 提到 > * CFI 即 Call Frame Information，是 DWARF 2.0 定義的函數 stack信息，DWARF 即 Debugging With Attributed Record Formats ，是一種調試信息格式。 > * 組語經常看到 .cfi_ 開頭的組語指示符， > 例如 `.cfi_startproc` 、`.cfi_undefined` 、`.cfi_endproc` 等， > * CFI 即 Call Frame Information，是 DWARF 2.0 定義的函數棧信息， > * DWARF 即 Debugging With Attributed Record Formats ，是一種調試信息格式。 > * `.cfi_` 開頭的組語指示符用來告訴彙編器生成相應的 DWARF 調試信息，主要是和函數有關。 > * **`.cfi_startproc` 定義函數開始，`.cfi_endproc `定義函數結束。** 因為我們的函式是 `uint32_t average(uint32_t a, uint32_t b)` * 回傳值和輸入變數都是 `uint32_t` ，所以組語編譯成 32-bit 模式 * `uint32_t a` 為第1個輸入變數，所以暫存器 `%edi` 存 a 變數 * `uint32_t b` 為第2個輸入變數，所以暫存器 `%esi` 存 b 變數對應用到的 X86 暫存器為 | 32-bit模式 | 用途 | | ---------- | ------------------------ | | %eax | **返回值** | | %ebx | 被調用者保存 | | %esi | **第2個參數(存 b 變數)** | | %edi | **第1個參數(存 a 變數)** | | %edx | 第3個參數 | ```c #include <stdint.h> uint32_t average(uint32_t a, uint32_t b) { return a + (b - a) / 2; } ``` 所以 case1 生成的`average.s`內容為 ``` average: .LFB0: .cfi_startproc //定義函數開始 endbr64 movl %esi, %eax //變數a載入到 %eax subl %edi, %eax //%eax＝%eax-%edi (相當於做 b=b-a) shrl %eax // %eax>>1 ( 相當於做 (b-a)/2 ) addl %edi, %eax // %eax=%eax+%edi (相當於做 a + (b - a)/2) ret //return 返回 .cfi_endproc //定義函數結束 ``` --- #### case2. ```c #include <stdint.h> uint32_t average(uint32_t a, uint32_t b) { return (a >> 1) + (b >> 1) + (a & b & 1); } ``` 生成的`average.s`內容為 ``` average: .LFB0: .cfi_startproc endbr64 movl %edi, %eax //a變數載入到 %eax返回值 movl %esi, %edx //b變數載入到 %edx第3個參數 andl %esi, %edi // %edi=a & b shrl %eax //(a >> 1) shrl %edx // (b >> 1) andl $1, %edi // %edi=(%edi)&1 (相當於 a & b & 1) addl %edx, %eax //%eax= (b >> 1) + (a >> 1) addl %edi, %eax //%eax+%edi (相當於 (b >> 1) + (a >> 1)+(a & b & 1) ) ret .cfi_endproc ``` --- #### case3. ```c #include <stdint.h> uint32_t average(uint32_t a, uint32_t b) { return (a & b)+ ((a ^ b) >> 1); } ``` 生成的`average.s`內容為 %esi:存第2個參數 b ``` average: .LFB0: .cfi_startproc endbr64 movl %edi, %eax //a變數載入到 %eax返回值 andl %esi, %edi // %edi= a & b xorl %esi, %eax // %eax= a ^ b shrl %eax // %eax>>1 相當於 ((a ^ b) >> 1) addl %edi, %eax // %eax= ((a ^ b) >> 1)+ (a & b) ret .cfi_endproc ``` case3 編譯若使用 `gcc -O0` 不做任何最佳化 `$ gcc -O0 -S average.c` 生成的`average.s`內容為 ``` average: .LFB0: .cfi_startproc endbr64 pushq %rbp .cfi_def_cfa_offset 16 .cfi_offset 6, -16 movq %rsp, %rbp .cfi_def_cfa_register 6 movl %edi, -4(%rbp) movl %esi, -8(%rbp) movl -4(%rbp), %eax andl -8(%rbp), %eax movl %eax, %edx movl -4(%rbp), %eax xorl -8(%rbp), %eax shrl %eax addl %edx, %eax popq %rbp .cfi_def_cfa 7, 8 ret .cfi_endproc ``` 其中 ``` .cfi_def_cfa_offset 16 .cfi_offset 6, -16 .cfi_def_cfa 7, 8 ``` 根據 [the gnu assembler (gas)](https://sourceware.org/binutils/docs-2.31/as/CFI-directives.html#g_t_002ecfi_005fdef_005fcfa_005foffset-offset) > #### 7.10.9 `.cfi_def_cfa register, offset` > > `.cfi_def_cfa` defines a rule for computing CFA as: _take address from register and add offset to it_. > > #### 7.10.11 `.cfi_def_cfa_offset offset` > > `.cfi_def_cfa_offset` modifies a rule for computing CFA. Register remains the same, but offset is new. Note that it is the absolute offset that will be added to a defined register to compute CFA address. > > #### 7.10.13 `.cfi_offset register, offset` > > Previous value of register is saved at offset offset from CFA. --- ### 延伸問題 3.研讀 Linux 核心原始程式碼 [include/linux/average.h](https://github.com/torvalds/linux/blob/master/include/linux/average.h)，探討其 [Exponentially weighted moving average](https://en.wikipedia.org/wiki/Moving_average#Exponential_moving_average) (EWMA) 實作，以及在 Linux 核心的應用介紹篇幅太長，另開一個共筆 [quiz2 測驗1 EWMA 介紹](https://hackmd.io/@blue76815/SJqcbr-_9) #### EWMA 實作，在 Linux 核心的應用在 linux kernel 中搜尋 [DECLARE_EWMA](https://github.com/torvalds/linux/search?q=DECLARE_EWMA) [drivers/net/wireless/mediatek/mt76/mt76x02_mac.h](https://github.com/torvalds/linux/blob/dbe69e43372212527abf48609aba7fc39a6daa27/drivers/net/wireless/mediatek/mt76/mt76x02_mac.h) ```c=34 DECLARE_EWMA(pktlen, 8, 8); struct mt76x02_sta { struct mt76_wcid wcid; /* must be first */ struct mt76x02_vif *vif; struct mt76x02_tx_status status; int n_frames; struct ewma_pktlen pktlen; }; ``` 依照 `DECLARE_EWMA()` 定義 > `#define DECLARE_EWMA(name, _precision, _weight_rcp)` > 其中3個輸入參數定義 > * `name`: 表示整個 macro 定義出來的一組 macro 以及 struct 的名稱。 > * `_percision`: 表示用多少個 bits 來表示內部儲存的小數部分。 > * `_weight_rcp`: 表示權重 $\alpha$ 的倒數, $\alpha=1 / weight\_rcp$。 `DECLARE_EWMA(pktlen, 8, 8);` 代表 * `name`: pktlen * `_percision`: 表示**用 8 bits** 來表示內部儲存的**小數部分。** * `_weight_rcp`: 表示權重 $\alpha$ 的倒數, $\alpha=1 / weight\_rcp=1 / 8$。 [drivers/net/wireless/mediatek/mt7601u/mt7601u.h](https://github.com/torvalds/linux/blob/5bfc75d92efd494db37f5c4c173d3639d4772966/drivers/net/wireless/mediatek/mt7601u/mt7601u.h) ```c=134 DECLARE_EWMA(rssi, 10, 4); ``` `DECLARE_EWMA(rssi, 10, 4);` 代表 * `name`: rssi * `_percision`: 表示**用 10 bits** 來表示內部儲存的**小數部分。** * `_weight_rcp`: 表示權重 $\alpha$ 的倒數, $\alpha=1 / weight\_rcp=1 / 4$。 [drivers/net/wireless/realtek/rtw88/main.h](https://github.com/torvalds/linux/blob/7e062cda7d90543ac8c7700fc7c5527d0c0f22ad/drivers/net/wireless/realtek/rtw88/main.h) ```c=642 DECLARE_EWMA(tp, 10, 2); ``` ```c=738 DECLARE_EWMA(rssi, 10, 16); ``` `DECLARE_EWMA(tp, 10, 2);` 代表 * `name`: tp * `_percision`: 表示**用 10 bits** 來表示內部儲存的**小數部分。** * `_weight_rcp`: 表示權重 $\alpha$ 的倒數, $\alpha=1 / weight\_rcp=1 / 2$。 `DECLARE_EWMA(rssi, 10, 16);` 代表 * `name`: rssi * `_percision`: 表示**用 10 bits** 來表示內部儲存的**小數部分。** * `_weight_rcp`: 表示權重 $\alpha$ 的倒數, $\alpha=1 / weight\_rcp=1 / 16$。 --- ## 測驗 `2` `EXP4` = `a ^ b` `EXP5` = `a < b` ```c #include <stdint.h> uint32_t max(uint32_t a, uint32_t b) { return a ^ ((a ^ b) & -(a < b)); } ``` :::success 延伸問題: 1. 解釋上述程式碼運作的原理 2. 針對 32 位元無號/有號整數，撰寫同樣 [branchless](https://en.wikipedia.org/wiki/Branch_(computer_science)) 的實作 3. Linux 核心也有若干 branchless / branch-free 的實作，例如 [lib/sort.c](https://github.com/torvalds/linux/blob/master/lib/sort.c): ```c /* * Logically, we're doing "if (i & lsbit) i -= size;", but since the * branch is unpredictable, it's done with a bit of clever branch-free * code instead. */ __attribute_const__ __always_inline static size_t parent(size_t i, unsigned int lsbit, size_t size) { i -= size; i -= size & -(i & lsbit); return i / 2; } ``` 請在 Linux 核心原始程式碼中，找出更多類似的實作手法。請善用 `git log `檢索。 ::: ### 1. 解釋上述程式碼運作的原理 [如何製作無分支代碼？](https://qa.icopy.site/questions/32107088/how-can-i-make-branchless-code) 在上面的答案中，提到瞭如何通過避免分支來**避免分支預測失敗**. ### 2. 針對 32 位元無號/有號整數，撰寫同樣 [branchless](https://en.wikipedia.org/wiki/Branch_(computer_science)) 的實作 ```c #include <stdint.h> int32_t max(int32_t a, int32_t b) { return a ^ ((a ^ b) & -(a < b)); } ``` branchless 參考資料: * [位元運算整理](https://hackmd.io/@0xff07/BTS/https%3A%2F%2Fhackmd.io%2F%400xff07%2FARIARIARIARI) * [Bit Twiddling Hacks](https://graphics.stanford.edu/~seander/bithacks.html) ### 3. Linux 核心也有若干 branchless / branch-free 的實作，請在 Linux 核心原始程式碼中，找出更多類似的實作手法。請善用 `git log `檢索。參考 [在 Linux 核心原始程式碼中，找出更多類似的實作手法](https://hackmd.io/@cantfindagoodname/linux2022-quiz2#%E5%BB%B6%E4%BC%B8%E5%95%8F%E9%A1%8C1) :::info `git log `檢索方法 * 方法1 直接在 [kernel/git/torvalds/linux.git](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/ "kernel/git/torvalds/linux.git") 用 log msg ![](https://hackmd.io/_uploads/BypVARnm2.png) 搜尋 [branchless](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/log/?qt=grep&q=branchless) 和 [branch-free](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/log/?qt=grep&q=branch-free) 關鍵字 * 方法2 * [git log命令全解析，log還能這樣美！](https://www.itread01.com/articles/1476835543.html) * [2.3 Git 基礎 - 檢視提交的歷史記錄](https://git-scm.com/book/zh-tw/v2/Git-%E5%9F%BA%E7%A4%8E-%E6%AA%A2%E8%A6%96%E6%8F%90%E4%BA%A4%E7%9A%84%E6%AD%B7%E5%8F%B2%E8%A8%98%E9%8C%84) > `--oneline` 選項將每一筆提交顯示成單獨一行，對於檢視大量的提交時很有 > `--name-only` 在提交訊息後方顯示更動的檔案列表。 > `--grep` 列出提交訊息中符合指定字串的提交。 `$ git log --oneline --name-only -E --grep "branchless | branch-free"` ::: #### branchless 實作 [xfrm: branchless addr4_match() on 64-bit](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6c786bcb29dd684533ec165057f437d5bb34a4b2) [powerpc/ebpf32: Rework 64 bits shifts to avoid tests and branches](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=e7de0023e1232f42a10ef6af03352538cc27eaf6) #### branch-free 實作 [KVM: MMU: simplify last\_pte\_bitmap](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6bb69c9b69c315200ddc2bc79aee14c0184cf5b2) [KVM: MMU: Optimize pte permission checks](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=97d64b788114be1c4dc4bfe7a8ba2bf9643fe6af) ## 測驗 `3` `COND` = `v` `RET` = `u << shift` ```c #include <stdint.h> uint64_t gcd64(uint64_t u, uint64_t v) { if (!u || !v) return u | v; int shift; for (shift = 0; !((u | v) & 1); shift++) { u /= 2, v /= 2; } while (!(u & 1)) u /= 2; do { /*做輾轉相除法*/ while (!(v & 1)) v /= 2; if (u < v) { v -= u; } else { uint64_t t = u - v; u = v; v = t; } } while (v); return u << shift; } ``` :::success 延伸問題: 1. 解釋上述程式運作原理; 2. 在 x86_64 上透過 __builtin_ctz 改寫 GCD，分析對效能的提升; 3. Linux 核心中也內建 GCD (而且還不只一種實作)，例如 [lib/math/gcd.c](https://github.com/torvalds/linux/blob/master/lib/math/gcd.c)，請解釋其實作手法和探討在核心內的應用場景。 ::: ### 1. 解釋上述程式運作原理; 複習數學求最大公因式時，用到輾轉相除法 1. [【概念】用短除法找兩數的最大公因數](https://www.junyiacademy.org/course-compare/junyi-math/math-topic-juni/mjn/mjnfs/mjnfs7b/v/3V-Tp__Gp9s) 2. [輾轉相除法求最大公因數和最小公倍數(影片)](https://www.youtube.com/watch?v=8PPQc16kK-c) 根據 [kevinshieh0225](https://hackmd.io/@Masamaloka/linux2022-quiz2#%E6%B8%AC%E9%A9%97-3-%EF%BC%9A-greatest-common-divisor) 同學介紹從 ```c #include <stdint.h> uint64_t gcd64(uint64_t u, uint64_t v) { if (!u || !v) return u | v; while (v) { uint64_t t = v; v = u % v; u = t; } return u; } ``` 上面這段用到輾轉相除法，但是 **% 取餘數是很承重的運算** 變成以下等價 1.修改演算法，**避開% 取餘數的使用** ```c #include <stdint.h> uint64_t gcd64(uint64_t u, uint64_t v) { if (!u || !v) return u | v; int shift; for (shift = 0; !((u | v) & 1); shift++) { u /= 2, v /= 2;//用 bitwise的|(or運算) &(and運算)，運算成本較低 } while (!(u & 1)) u /= 2; do { /*做輾轉相除法*/ while (!(v & 1)) v /= 2; if (u < v) { v -= u;// if(u<v) , gcd(u,v)=gcd(u,v-u) 因此得修改 v變數值 v=v-u } else { uint64_t t = u - v; u = v;//小數搬到u處 v = t;//大數搬到v處，大數搬到v是為了當"被除數" 才能在下一步再做 v=v-u 除法計算 } } while (v); return u << shift; } ``` 最後總結成 ```c #include <stdint.h> uint64_t gcd64(uint64_t u, uint64_t v) { if (!u || !v) return u | v; int shift; for (shift = 0; !((u | v) & 1); shift++) { /*用因數分解短除法一直提出2的公因數*/ u /= 2, v /= 2;//u和v變數，每除2一次，就紀錄一次shift (bit 右移次數) //直到!((u | v) & 1) ==>u或v變數的 bit1=1 為奇數時才退出for迴圈 } while (!(u & 1)) u /= 2; do { /*做輾轉相除法*/ while (!(v & 1)) v /= 2; if (u < v) { v -= u; // if(u<v) , gcd(u,v)=gcd(u,v-u) 因此得修改 v變數值 v=v-u } else { uint64_t t = u - v;//else if(u>v), gcd(u,v)=gcd(u-v,v) u = v; //小數搬到u處 v = t; //大數搬到v處，大數搬到v是為了當"被除數" 才能在下一步再做 v=v-u 除法計算 } } while (v); return u << shift;//計算完，要把前面做因數分解短除法時，提出的 2^shift 次方公因數給乘回去（相當於做 bit 左移 shift次） } ``` ## 測驗 `4` `EXP6` = `bitset & -bitset` ```c= #include <stddef.h> size_t improved(uint64_t *bitmap, size_t bitmapsize, uint32_t *out) { size_t pos = 0; uint64_t bitset; for (size_t k = 0; k < bitmapsize; ++k) { bitset = bitmap[k]; while (bitset != 0) { uint64_t t = bitset & -bitset;//EXP6 int r = __builtin_ctzll(bitset); out[pos++] = k * 64 + r; bitset ^= t; } } return pos; } ``` :::success 延伸問題: 1. 解釋上述程式運作原理，並舉出這樣的程式碼用在哪些真實案例中; 2. 設計實驗，檢驗 ctz/clz 改寫的程式碼相較原本的實作有多少改進？應考慮到不同的 [bitmap density](http://www.vki.com/2013/Support/docs/vgltools-3.html); 3. 思考進一步的改進空間; 4. 閱讀 [Data Structures in the Linux Kernel](https://0xax.gitbooks.io/linux-insides/content/DataStructures/linux-datastructures-3.html) 並舉出 Linux 核心使用 bitmap 的案例; ::: 在 [Bit Twiddling Hacks 解析 (三)](https://hackmd.io/@0xff07/BTS/https%3A%2F%2Fhackmd.io%2F%400xff07%2FWRYYYYYYYYYY#Count-Tailing-Zeroes-ctz) 中有提到 > - [Count Tailing Zeroes (ctz)](https://hackmd.io/@0xff07/WRYYYYYYYYYY#Count-Tailing-Zeroes-ctz "Count Tailing Zeroes (ctz)") > - [方法一：使用二元搜尋](https://hackmd.io/@0xff07/WRYYYYYYYYYY#%E6%96%B9%E6%B3%95%E4%B8%80%EF%BC%9A%E4%BD%BF%E7%94%A8%E4%BA%8C%E5%85%83%E6%90%9C%E5%B0%8B "方法一：使用二元搜尋") > - [方法一：還是二元搜尋](https://hackmd.io/@0xff07/WRYYYYYYYYYY#%E6%96%B9%E6%B3%95%E4%B8%80%EF%BC%9A%E9%82%84%E6%98%AF%E4%BA%8C%E5%85%83%E6%90%9C%E5%B0%8B "方法一：還是二元搜尋") > - [方法三：使用浮點數](https://hackmd.io/@0xff07/WRYYYYYYYYYY#%E6%96%B9%E6%B3%95%E4%B8%89%EF%BC%9A%E4%BD%BF%E7%94%A8%E6%B5%AE%E9%BB%9E%E6%95%B8 "方法三：使用浮點數") ### 2. 設計實驗，檢驗 ctz/clz 改寫的程式碼相較原本的實作有多少改進？應考慮到不同的 [bitmap density](http://www.vki.com/2013/Support/docs/vgltools-3.html); ### 實驗比較實驗設定每種 bit density測試時,**分別傳送N筆隨機位置的bitmap data, 傳到 `naive()`和 `improved()`比較耗時差別** bit-desity 從 1個bit為1，但是在64bit 中位置隨機，餵 N 筆資料傳到 `naive()`和 `improved()`比較耗時差別一直實驗運算到 bit-desity 64個bit為1，但是在64bit 中位置隨機，餵 N 筆資料傳到 `naive()`和 `improved()`比較耗時差別 ### case1.bitmap 資料為隨機位置1擺放 [程式碼 c2ecb0e](https://github.com/blue76815/linux2022q2_quiz2/tree/c2ecb0eaae7fe230f5accd3ec5a844579942d102) makefile中，已寫好 `make plot` ，能自動化執行 10000筆，50000筆，100000筆資料量測和 gunplot 繪圖 ``` $ make $ make plot ``` 實驗結果 #### 10000_bitmap.png（N=10000） ![](https://hackmd.io/_uploads/r14LRA37n.png) #### 50000_bitmap.png（N=50000） ![](https://hackmd.io/_uploads/BJSvAA2Q2.png) #### 100000_bitmap.png（N=100000） ![](https://hackmd.io/_uploads/B1SuRC2Qh.png) ### case2.bitmap 資料為連續1 位置擺放 [c2ecb0e](https://github.com/blue76815/linux2022q2_quiz2/commit/3ece33291e044ff4fd588d92284b7db49c390055) 程式碼將 `swap(&array[i], &array[rand_i]);` **注記掉不做洗牌使得bitmap 變數為連續1，而不是隨機位置1擺放** ```c uint64_t make_bitmap(int set_bit_num){ uint64_t array[64]={0}; uint64_t bitmap=0; for (int i = 0; i < set_bit_num; i++){//只設定set_bit_num 個 1 到陣列 array[i] = 1; } for (int i = 0; i < 64; i++){ //shuffle array[] int rand_i=rand() % 64; while(i==rand_i) rand_i=rand() % 64; //swap(&array[i], &array[rand_i]);//注記掉不做洗牌 } for (int i = 0; i < 64; i++){ bitmap^= array[i]<<(63-i) ;//轉換成 Big-Endian } return bitmap; } ``` <span class="blue">bitmap 變數**若為連續1的位置擺放** 則隨著 bit density 提高，就會出現前面同學實驗的情況</span> 例如 [vacantron同學](https://hackmd.io/@vacantron/linux2022-quiz2#%E6%B8%AC%E9%A9%97%E5%9B%9B) > **可以發現當 bitmap density 越小時 ( 1 越稀疏時)，修改過後 (使用 `builtin_ctzl()`) 的程式明顯比其他還有效率，隨著 bitmap 的 1 增加，`builtin_ctzl()` 的影響力相對的就下降了。** 例如 [tinyynoob 同學](https://hackmd.io/@tinyynoob/linux2022q1-quiz2#%E5%AF%A6%E9%A9%97%E6%AF%94%E8%BC%83) > > **若 bitmap 越鬆散 (即 1 越少)，於是 improved 的效益就更高。** #### 10000_bitmap.png（N=10000） ![](https://hackmd.io/_uploads/SkGygJ6m3.png) #### 50000_bitmap.png（N=50000） ![](https://hackmd.io/_uploads/rJZgl1TX3.png) #### 100000_bitmap.png（N=100000） ![](https://hackmd.io/_uploads/Hk9glJpX2.png) 呼應 > **若 bitmap 越鬆散 (即 ~~1 越少~~，<span class="blue">1擺放的位置越分散隨機</span>)，於是 improved 的效益就更高。** ### perf 分析 #### 分析目標用perf分析 1. `uint64_t make_bitmap(int set_bit_num)`內有加入 `swap(&array[i], &array[rand_i]);`洗牌時 2. `uint64_t make_bitmap(int set_bit_num)`內沒加入 `swap(&array[i], &array[rand_i]);`洗牌時，為連續1擺放時 <span class="blue">**用 perf 分析記憶體或是反組譯組語看有啥地方被最佳化**</span> #### perf record 分析 ``` $ sudo taskset -c 0 perf record -g ./main 100000 > data.csv //跑完perf record 分析後對應的資料夾會生成 perf.data $ sudo perf report -i perf.data //用 perf report -i 加上 perf.data ``` 左側有+的地方，鍵盤按`enter`鍵可展開分支 ![](https://hackmd.io/_uploads/HyrbeyTm3.png) #### perf top 分析先輸入 `$ perf top list` 看 `perf top` 指令有哪些設定參數 [perf top 使用方法](https://hackmd.io/Nrc17jxBRzyUz29twpdxqA) #### perf stat 分析參考 [freshLiver(quiz2 測驗3 的perf分析步驟)](https://hackmd.io/@freshLiver/2022-hw2-quiz2#%E4%BD%BF%E7%94%A8-__builtin_ctz-%E6%94%B9%E5%AF%AB%E4%B8%A6%E5%88%86%E6%9E%90%E5%B0%8D%E6%95%88%E8%83%BD%E7%9A%84%E6%8F%90%E5%8D%87) 根據 [perf stat](https://hackmd.io/@sysprog/gnu-linux-dev/https%3A%2F%2Fhackmd.io%2Fs%2FB11109rdg#perf-stat) 指令介紹我使用 `$ sudo taskset -c 0 perf stat -r 1 ./main 100000 > data.csv` 其中 `-r` 代表 repeat 重複執行的次數，因為我的程式`./main 100000` 本身餵10000筆資料時，就有在重複運算10000次因此只設定`perf stat -r 1` 只有repeat 重複執行1次 1. `uint64_t make_bitmap(int set_bit_num)`內有加入 `swap(&array[i], &array[rand_i]);`洗牌時 ``` $ make gcc -g -pthread main.c -o main $ sudo taskset -c 0 perf stat -r 1 ./main 100000 > data.csv Performance counter stats for './main 100000': 7,315.60 msec task-clock # 1.000 CPUs utilized 29 context-switches # 0.004 K/sec 0 cpu-migrations # 0.000 K/sec 6,505 page-faults # 0.889 K/sec 28,709,089,763 cycles # 3.924 GHz 62,499,571,894 instructions # 2.18 insn per cycle 10,424,286,945 branches # 1424.939 M/sec 149,895,478 branch-misses # 1.44% of all branches 7.316339321 seconds time elapsed 7.307770000 seconds user 0.007999000 seconds sys ``` ![](https://hackmd.io/_uploads/ryQMl1TQh.png) 2. `uint64_t make_bitmap(int set_bit_num)`內沒加入 `swap(&array[i], &array[rand_i]);`洗牌時，為連續1擺放時 ``` $ make gcc -g -pthread main.c -o main $ sudo taskset -c 0 perf stat -r 1 ./main 100000 > data.csv Performance counter stats for './main 100000': 5,683.21 msec task-clock # 1.000 CPUs utilized 23 context-switches # 0.004 K/sec 0 cpu-migrations # 0.000 K/sec 6,506 page-faults # 0.001 M/sec 22,331,076,507 cycles # 3.929 GHz 49,800,197,684 instructions # 2.23 insn per cycle 9,604,788,346 branches # 1690.028 M/sec 33,021,520 branch-misses # 0.34% of all branches 5.683874591 seconds time elapsed 5.671411000 seconds user 0.011998000 seconds sys ``` ![](https://hackmd.io/_uploads/SkgXxk673.png) perf stat 比較有明顯差異的數據是在 * 有加入 `swap(&array[i], &array[rand_i]);`洗牌時 ``` 62,499,571,894 instructions # 2.18 insn per cycle 10,424,286,945 branches # 1424.939 M/sec 149,895,478 branch-misses # 1.44% of all branches ``` * 沒加入 `swap(&array[i], &array[rand_i]);`洗牌時 ``` 49,800,197,684 instructions # 2.23 insn per cycle 9,604,788,346 branches # 1690.028 M/sec 33,021,520 branch-misses # 0.34% of all branches ``` 沒洗牌時 (1連續擺放，較密集的情況下) ``` 9,604,788,346 branches # 1690.028 M/sec 33,021,520 branch-misses # 0.34% of all branches ``` branches 和 branch-misses 數據較低，但是解釋不出來，為何在 improved() 函數執行時，運算效能卻比有洗牌時(1隨機位置擺放)還低 --- ## 測驗 `5` `PPP` = `pos--` `MMM` = `list_add` `EEE` = `&heads[remainder % size]` :::success 延伸問題: 1. 解釋上述程式碼運作原理，指出其中不足，並予以改進 > 例如，判斷負號只要寫作 `bool isNegative = numerator < 0 ^ denominator < 0;` > 搭配研讀 [The simple math behind decimal-binary conversion algorithms](https://indepth.dev/posts/1019/the-simple-math-behind-decimal-binary-conversion-algorithms) 2. 在 Linux 核心原始程式碼的 `mm/` 目錄 (即記憶體管理) 中，找到特定整數比值 (即 fraction) 和 bitwise 操作的程式碼案例，予以探討和解說其應用場景 ::: ## 測驗 `6` `M` = `_h` `X` = `0` ```c /* * ALIGNOF - get the alignment of a type * @t: the type to test * * This returns a safe alignment for the given type. */ #define ALIGNOF(t) \ ((char *)(&((struct { char c; t _h; } *)0)->_h) - (char *)0) ``` :::success 延伸問題: 1. 解釋上述程式碼運作原理 2. 在 Linux 核心原始程式碼中找出 [`__alignof__`](https://gcc.gnu.org/onlinedocs/gcc/Alignment.html) 的使用案例 2 則，並針對其場景進行解說 3. 在 Linux 核心源使程式碼找出 `ALIGN`, `ALIGN_DOWN`, `ALIGN_UP` 等巨集，探討其實作機制和用途，並舉例探討 (可和上述第二點的案例重複)。思索能否對 Linux 核心提交貢獻，儘量共用相同的巨集 ::: ### 解釋上述程式碼運作原理 [C++ 速查手冊 4.8 - alignof 運算](http://kaiching.org/pydoing/cpp/cpp-alignof.html) > `alignof` 為 C++11 新增的關鍵字，當作運算子使用，**回傳物件對齊所需的位元組數**。用法如下 > > ``` > alignof (type-id);//type-id 可以是型態名稱或變數名稱。 > ``` > 註: **在 64 位元 (bit) 的機器上，處理器抓一次資料就是 64 位元，而 64 位元等於 8 個位元組 (byte) ，各種型態卻不會剛好都是 8 個位元組，因此就有放置資料的對齊 (alignment) 問題。** ``` #define ALIGNOF(t) \ ((char *)(&((struct { char c; t _h; } *)0)->_h) - (char *)0) ``` 根據 [Learn a new trick with the offsetof() macro](https://www.embedded.com/learn-a-new-trick-with-the-offsetof-macro/) > Listing 1: A representative set of offsetof() macro definitions > > // Keil 8051 compiler > `#define offsetof(s,m) (size_t)&(((s *)0)->m)` > > To better understand the magic of the offsetof() macro, consider the details of Keil's definition. The various operators within the macro are evaluated in an order such that the following steps are performed: > > 1. `((s *)0) `takes the integer zero and casts it as a pointer to s . > 2. `((s *)0)->m` dereferences that pointer to point to structure member m . > 3. `&(((s *)0)->m)` computes the address of m . > 4. `(size_t)&(((s *)0)->m)` casts the result to an appropriate data type. 依此類推，`#define ALIGNOF(t)` 拆步驟解析 1. `(struct { char c; t _h; } *)0)`:將 `struct { char c; t _h; }` 設成指標型態 `*`，並將**指標位址設為0** 2. `((struct { char c; t _h; } *)0)->_h`:**取得**該 struct pointer中的 **`_h` 成員** 3. `&((struct { char c; t _h; } *)0)->_h`:加`&` 取得 **`_h` 成員記憶體位址** 4. `(char *)(&((struct { char c; t _h; } *)0)->_h)`:將 **`_h` 成員記憶體位址** 強制轉型成 `(char *)` 型態參照 [The (char *) casting in container_of() macro in linux kernel](https://stackoverflow.com/questions/20421910/the-char-casting-in-container-of-macro-in-linux-kernel) 5. `((char *)(&((struct { char c; t _h; } *)0)->_h) - (char *)0)` :和`(char *)0` 位址相減後，即為 `t _h` 的 alignment ### 問題我卡在步驟4 將 **`_h` 成員記憶體位址** 強制轉型成 `(char *)` 型態，**為何就能轉換出 ALIGNOF 差距的記憶體格數?** ### [解法](https://hackmd.io/@blue76815/Hyt0E3_Pc) - 參考資料: - [Linux 核心原始程式碼巨集: container_of](https://hackmd.io/@sysprog/linux-macro-containerof) - [kevinshieh0225](https://hackmd.io/@Masamaloka/linux2022-quiz2#%E6%B8%AC%E9%A9%97-6%EF%BC%9A__alignof__) - [6.44 Determining the Alignment of Functions, Types or Variables](https://gcc.gnu.org/onlinedocs/gcc/Alignment.html) - [offsetof - wikipedia](https://en.wikipedia.org/wiki/Offsetof) - [hsuedw](https://hackmd.io/@hsuedw/linux2022-quiz2#%E5%BB%B6%E4%BC%B8%E5%95%8F%E9%A1%8C-1-%E8%A7%A3%E9%87%8B%E4%B8%8A%E8%BF%B0%E7%A8%8B%E5%BC%8F%E7%A2%BC%E9%81%8B%E4%BD%9C%E5%8E%9F%E7%90%86) - [ycwu4142](https://hackmd.io/@ycwu4142/linux2022-quiz2#%E5%A1%AB%E7%A9%BA1) - [The (char *) casting in container_of() macro in linux kernel](https://stackoverflow.com/questions/20421910/the-char-casting-in-container-of-macro-in-linux-kernel) - [C 語言 #pragma pack 預處理指令的意義、用法教學與範例程式碼](https://blog.gtwang.org/programming/c-language-pragma-pack-tutorial-and-examples/ "C 語言 #pragma pack 預處理指令的意義、用法教學與範例程式碼") - [你所不知道的 C 語言：記憶體管理、對齊及硬體特性 data alignment](https://hackmd.io/@sysprog/c-memory?type=view#data-alignment) ### 在 Linux 核心原始程式碼中找出 [`__alignof__`](https://gcc.gnu.org/onlinedocs/gcc/Alignment.html) 的使用案例 2 則，並針對其場景進行解說 1. [\[转\]CPU cache ，内存对齐，分支预测](https://www.cnblogs.com/wlzy/p/9688194.html) > `__attribute__((aligned(n)))`表示所定義的變量為n字節對齊； > > **字節對齊的細節和編譯器實現相關，但一般而言，滿足三個準則：** > 1. (結構體)變量的首地址能夠被其(最寬)基本類型成員的大小所整除； > 2. 結構體每個成員相對於結構體首地址的偏移量（offset）都是成員大小的整數倍，如有需要編譯器會在成員之間加上填充字節（internal adding）； > 3. 結構體的總大小為結構體最寬基本類型成員大小的整數倍，如有需要編譯器會在最末一個成員之後加上填充字節（trailing padding）。 1. [net/netfilter/ipset/ip\_set\_list_set.c](https://github.com/torvalds/linux/blob/5bfc75d92efd494db37f5c4c173d3639d4772966/net/netfilter/ipset/ip_set_list_set.c "net/netfilter/ipset/ip_set_list_set.c") ```c=25 /* Member elements */ struct set_elem { struct rcu_head rcu; struct list_head list; struct ip_set *set; /* Sigh, in order to cleanup reference */ ip_set_id_t id; } __aligned(__alignof__(u64)); ``` 2. [net/mac80211/ieee80211_i.h](https://github.com/torvalds/linux/blob/7e062cda7d90543ac8c7700fc7c5527d0c0f22ad/net/mac80211/ieee80211_i.h) ```c=283 struct ps_data { /* yes, this looks ugly, but guarantees that we can later use * bitmap_empty :) * NB: don't touch this bitmap, use sta_info_{set,clear}_tim_bit */ u8 tim[sizeof(unsigned long) * BITS_TO_LONGS(IEEE80211_MAX_AID + 1)] __aligned(__alignof__(unsigned long)); struct sk_buff_head bc_buf; atomic_t num_sta_ps; /* number of stations in PS mode */ int dtim_count; bool dtim_bc_mc; }; ``` ### 在 Linux 核心源使程式碼找出 `ALIGN`, `ALIGN_DOWN`, `ALIGN_UP` 等巨集，探討其實作機制和用途，並舉例探討 [include/linux/align.h](https://elixir.bootlin.com/linux/v5.17.5/source/include/linux/align.h#L8) ```c /* SPDX-License-Identifier: GPL-2.0 */ #ifndef _LINUX_ALIGN_H #define _LINUX_ALIGN_H #include <linux/const.h> /* @a is a power of 2 value */ #define ALIGN(x, a) __ALIGN_KERNEL((x), (a)) #define ALIGN_DOWN(x, a) __ALIGN_KERNEL((x) - ((a) - 1), (a)) #define __ALIGN_MASK(x, mask) __ALIGN_KERNEL_MASK((x), (mask)) #define PTR_ALIGN(p, a) ((typeof(p))ALIGN((unsigned long)(p), (a))) #define PTR_ALIGN_DOWN(p, a) ((typeof(p))ALIGN_DOWN((unsigned long)(p), (a))) #define IS_ALIGNED(x, a) (((x) & ((typeof(x))(a) - 1)) == 0) #endif /* _LINUX_ALIGN_H */ ``` 其中在 [linux/const.h](https://elixir.bootlin.com/linux/v5.17.5/source/include/uapi/linux/const.h#L32) 可查到 `__ALIGN_KERNEL(x, a)` 和 `__ALIGN_KERNEL_MASK(x, mask)` 定義 ```c #define __ALIGN_KERNEL(x, a) __ALIGN_KERNEL_MASK(x, (typeof(x))(a) - 1) #define __ALIGN_KERNEL_MASK(x, mask) (((x) + (mask)) & ~(mask)) ``` ## 測驗 `7` `KK1` = `div3` `KK2` = `div5` `KK3` = `div3 << 2` :::success 延伸問題: 1. 解釋上述程式運作原理並評估 `naive.c` 和 `bitwise.c` 效能落差 * 避免 stream I/O 帶來的影響，可將 printf 更換為 sprintf 2. 分析 [Faster remainders when the divisor is a constant: beating compilers and libdivide](https://lemire.me/blog/2019/02/08/faster-remainders-when-the-divisor-is-a-constant-beating-compilers-and-libdivide/) 一文的想法 (可參照同一篇網誌下方的評論)，並設計另一種 bitmask，如「可被 3 整除則末位設為 1」「可被 5 整除則倒數第二位設定為 1」，然後再改寫 `bitwise.c` 程式碼，試圖運用更少的指令來實作出 branchless; * 參照 [fastmod: A header file for fast 32-bit division remainders on 64-bit hardware](https://github.com/lemire/fastmod) 3. 研讀 [The Fastest FizzBuzz Implementation](https://tech.marksblogg.com/fastest-fizz-buzz.html) 及其 [Hacker News 討論](https://news.ycombinator.com/item?id=29413656)，提出 throughput (吞吐量) 更高的 Fizzbuzz 實作 4. 解析 Linux 核心原始程式碼 [kernel/time/timekeeping.c](https://github.com/torvalds/linux/blob/master/kernel/time/timekeeping.c) 裡頭涉及到除法運算的機制，探討其快速除法的實作 (注意: 你可能要對照研讀 `kernel/time/` 目錄的標頭檔和程式碼) > 過程中，你可能會發現可貢獻到 Linux 核心的空間，請充分討論 :::