Linux 核心專題: 回顧 bitops 並改進

# Linux 核心專題: 回顧 bitops 並改進 > 執行人: Shiritai > [專題解說錄影](https://youtu.be/DpV40zL9ngY) > [投影片](https://www.canva.com/design/DAFmsnSBE4I/Qunm7_g9jHdbrbL0XcNoOw/view?utm_content=DAFmsnSBE4I) > [GitHub](https://github.com/Shiritai/linux-bitops-patch) :::success :question: 提問清單 * `__gen_fls` 與 `gen_fls` 只差在 `bit_len` 這項輸入。想問分開的好處？ > 有兩個好處。 > 1. 呼叫者不需要輸入 `bit_len` 參數，因為這可以從輸入型別直接推得，不過我想您想問的應該不是這點。 > 2. `bit_len` 這項輸入在 `__gen_fls` 中用上兩次，如下 > ```c > #define __gen_fls(in_type, out_type, start_from, fn_name, bit_len) \ > static __always_inline out_type fn_name(in_type x) \ > { \ > /* ... */ \ > shift = bit_len >> 1; \ > bits = bit_len - 1 + start_from; \ > ``` > 如果不以一個參數包裝 `sizeof(in_type) * 8`，我們就需要在 `__gen_fls` 中寫兩遍 `sizeof(in_type) * 8`，這增加了維護程式碼的一咪咪成本。使用間接的巨集生成函式不僅避免這個成本，同時不影響編譯完的結果，所以我採用此方法。 > > 換言之，增加一層間階層同時讓: > 1. 呼叫者呼叫的更優雅 > 2. 實作者簡化實作邏輯 > > 可謂兩邊都受惠 OwO > > 希望這樣有解答您的提問。[name=Shiritai] ::: ## 任務簡述 Linux 核心的 [include/asm-generic/bitops](https://github.com/torvalds/linux/tree/master/include/asm-generic/bitops) 存在若干實作不一致的狀況，本任務嘗試追蹤 git log 和 LKML，以理解發展脈絡，並嘗試用精簡有效的方式改寫，最後嘗試貢獻程式碼到 Linux 核心。 > [準備工作](https://hackmd.io/@Eroiko/linux2023-assessment) ## TODO: 理解 bitops 實作風格和目標 `bitops` 本身為 Linux kernel 中的一個套件，當中實作 `ffs` 系列函式，他們被用在核心的其餘諸多地方，這點可以簡單利用 vscode 對 `ffs`, `__ffs`, `fls`, `__fls`, `fls64`, `ffz` 這些函式進行搜尋便可得知。一個經典的應用是 CFS (completely fair scheduler) 的排程上，會呼叫 `bitops/sched.h` 中的 `sched_find_first_bit` 函式，後者又呼叫 `__ffs`。由 `clz` 系列函式的實作為引子，經翻閱 Linux 原始碼中 `ffs` 系列函式後，整理其概況與問題，列敘如下。 ### `ffs` 及 `__ffs` 加雙底線與否的不同首先先釐清 `ffs` 及 `__ffs` (是否加雙底線) 的不同之處：參考 `bitops` 系列對外的介面: [`arch/arc/include/asm/bitops.h`](https://github.com/torvalds/linux/blob/master/arch/arc/include/asm/bitops.h) 中的註解得知 * `__XXX` 系列: 以 index 0 索引，範圍為 `0 ~ {length of input type - 1}` * `XXX` 系列: 以 index 1 索引，範圍為 `1 ~ {length of input type}` ### 採用硬體實作與否 GCC 除了提供 `__builtin__c{l,t}z` 對硬體指令包裝的內建函式外，實際上也存在 `__builtin_ffs`。就算沒有此函式，用前兩個也足以解決 `ffs` 和 `fls` 的實作。不過並不是所有指令集都實作對應的硬體指令，所以才有 linux 核心中 `ffs` 系列的軟體實作。核心透過條件編譯決定最後應該採用哪種實作，而本子項目的重點將先放在軟體實作的部分。接著來探討現有實作的問題。 ### 詭異的實作不一致性與例外情況凡是掛名 `__builtin_XXX` 的函式本身以 CPU 硬體指令實作，過濾例外情況 $0$ 作為輸入以避免 undefined behavior 是必備手續；同時使用 `ffs`, `fls` 時輸入 $0$ 本身就是極其不合理的事，此時檢查例外情況依然重要。不過有趣的是，`bitops` 目錄下大家眾口一詞：有的說例外情況的判斷交由 caller 檢查，有的沒特別說 (只說其與 builtin ffs 有同樣的 routine)，有的直接為 $0$ 為輸入定義對應輸出... * 例外情況的判斷交由 caller 檢查 ```c // __ffs.h /** * __ffs - find first bit in word. * @word: The word to search * * Undefined if no bit exists, so code should check against 0 first. */ static __always_inline unsigned long __ffs(unsigned long word) ``` * 沒特別說，但自主處理例外情況 (只說其與 `builtin ffs` 有同樣的 routine) ```c // ffs.h /** * ffs - find first bit set * @x: the word to search * * This is defined the same way as * the libc and compiler builtin ffs routines, therefore * differs in spirit from ffz (man ffs). */ static inline int ffs(int x) { int r = 1; if (!x) return 0; ... ``` * 為 $0$ 作為輸入時定義對應輸出 ```c /** * fls - find last (most-significant) bit set * @x: the word to search * * This is defined the same way as ffs. * Note fls(0) = 0, fls(1) = 1, fls(0x80000000) = 32. */ static __always_inline int fls(unsigned int x) ``` 所謂魔鬼藏在細節裡。這樣的不一致性往往導致開發時的不可控風險，屬於 bad smell 的一種。 ### [`constant_fls`](https://github.com/torvalds/linux/blob/master/arch/arc/include/asm/bitops.h) 的存在意義 [`arch/arc/include/asm/bitops.h`](https://github.com/torvalds/linux/blob/master/arch/arc/include/asm/bitops.h) 中存在一函式 `constant_fls`，簽名與部分實作如下: ```c static inline int constant_fls(unsigned int x) { int r = 32; if (!x) return 0; if (!(x & 0xffff0000u)) { x <<= 16; ... } ``` 其目的是利用 [gcc `__builtin_constant_p`](https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html#index-_005f_005fbuiltin_005fconstant_005fp) 協助於編譯時期進行 [Constant folding (propagation)](https://en.wikipedia.org/wiki/Constant_folding)，於[你所不知道的 C 語言：編譯器和最佳化原理篇](https://hackmd.io/@sysprog/c-compiler-optimization?type=view#%E4%BD%A0%E6%89%80%E4%B8%8D%E7%9F%A5%E9%81%93%E7%9A%84-C-%E8%AA%9E%E8%A8%80%EF%BC%9A%E7%B7%A8%E8%AD%AF%E5%99%A8%E5%92%8C%E6%9C%80%E4%BD%B3%E5%8C%96%E5%8E%9F%E7%90%86%E7%AF%87)介紹過其觀念。以下節錄 gcc 官方文件，說明 `__builtin_constant_p` 的使用: > You can use the built-in function `__builtin_constant_p` to determine if a value is known to be constant at compile time and **hence that GCC can perform constant-folding on expressions involving that value**. ... > > ... A **return of `0` does not indicate that the value is not a constant**, but merely that GCC cannot prove it is a constant with the specified value of the `-O` option. > > You may use this built-in function in either a macro or an inline function. **However, if you use it in an inlined function and pass an argument of the function as the argument to the built-in**, GCC never returns `1` when you call the inline function with a string constant or compound literal (see Compound Literals) and **does not return 1 when you pass a constant numeric value to the inline function unless you specify the `-O` option**. 表示當 `__builtin_constant_p` 用於 macro 或 inline function 時，傳入的參數為常數數值，當開啟編譯器最佳化 `-O` 選項後即會進行 constant folding。綜觀 `constant_fls` 函式，我們看不出其與 [`bitops/fls`](https://github.com/torvalds/linux/blob/master/include/asm-generic/bitops/fls.h) 有任何差別。這表示就算不實作 `constant_fls`，也能利用上述 `bitops/fls` 函式協助編譯器進行 constant folding。那麼問題就出現了： * `constant_fls` 實作的意義為何？ * 若調整 `bitops/flz` 或 `constant_fls` 的實作，對於 [`bitops.h` 中 `fls`](https://github.com/torvalds/linux/blob/master/arch/arc/include/asm/bitops.h)傳入參數為 `__builtin_constant_p(x) == 1` 時的效能是否有影響？即是否有可能影響 constant folding？另附上唯一的呼叫者 [`bitops.h` 中 `fls`](https://github.com/torvalds/linux/blob/master/arch/arc/include/asm/bitops.h) 函式以供參考。 ```c /* * fls = Find Last Set in word * @result: [1-32] * fls(1) = 1, fls(0x80000000) = 32, fls(0) = 0 */ static inline __attribute__ ((const)) int fls(unsigned int x) { if (__builtin_constant_p(x)) return constant_fls(x); return 32 - clz(x); } ``` ### 另一細節的不一致性有一個小小細節是: 使用 `if-else` 實作系列函式的一個優點是可以略過部分不必要的操作，比如 `__fls` 函式: (這點 `__ffs` 亦同) ```c static __always_inline unsigned long __fls(unsigned long word) { int num = BITS_PER_LONG - 1; ... if (!(word & (~0ul << (BITS_PER_LONG-2)))) { num -= 2; word <<= 2; } if (!(word & (~0ul << (BITS_PER_LONG-1)))) num -= 1; // bypass "word <<= 1" instruction here return num; } ``` 但隔壁鄰居 `fls` (`ffs` 亦同) 可沒打算那樣做: ```c static __always_inline int fls(unsigned int x) { int r = 32; ... if (!(x & 0xc0000000u)) { x <<= 2; r -= 2; } if (!(x & 0x80000000u)) { x <<= 1; // meaningless?! r -= 1; } return r; } ``` 順帶一提，他的親戚 `constant_fls` (就是前述沒有存在意義的函式) 倒是考慮到了這點。 ### 目標透過觀察前述古怪現象，其中一個解決方法便是使用我分析 `clz` 軟體實作的成果: 以 button-up 的角度分析位元運算版實作，一步步將 Magic Number 抽象出來，在不斷維持原功能的相容性下擴展新功能。好處有以下： * 程式碼簡潔 * 可以輕鬆維護並同步現有不同輸入輸出型別之 `f{f,l}s` 系列函式的軟體實作 * 統一 Undefined behavior 的處理方式 * 為未來 $128$ bits 版的實作乃至於其他型別的實作給出輕鬆擴展的解決方案但若直接移植我的解決方案而不經任何調整，將會有以下缺點： * 舊方案無法處理有無雙底線如 `__ffs` 等函式的實作 * 效能降低對於第一項，我們可以調整函式的生成邏輯，增加可控變因來嘗試完成。對於第二項，目前將透過進一步理解編譯器的優化行為嘗試解決。 ## TODO: 用一致的實作手段改寫 bitops 程式碼 ### 擴展巨集的功能透過定義額外兩個供其他開發者使用的巨集常數 ```c #define __FFLS_START_INDEX_FROM_ZERO 0 #define __FFLS_START_INDEX_FROM_ONE 1 ``` 我調整 `ffs` (`fls` 同理) 函式如下: ```c= /** * Hidden generator of ffs to hide evaluation of size of type */ #define __gen_ffs(in_type, out_type, start_from, fn_name, bit_len) \ static __always_inline out_type fn_name(in_type x) \ { \ int bits, shift; \ in_type mask; \ if (!x) \ return 0; \ bits = start_from; \ shift = bit_len >> 1; \ mask = (in_type) ~(0xFFFFFFFFFFFFFFFF << shift); \ do { \ if (!(x & mask)) { \ bits += shift; \ x >>= shift; \ } \ shift >>= 1; \ mask >>= shift; \ } while (shift); \ return bits; \ } /** * gen_ffs() - Macro to generate function series of * ffs series - find first (least-significant) bit set. * * Generated function parameter: * @x: the object to search * * Macro arguments: * @in_type: input type of ffs * @out_type: output type of ffs * @start_from: starting index of generated ffs to generate both * ffs series (return 1 ~ 64) and __ffs series (return 0 ~ 63) * * Note if: * start_from is __FFLS_START_INDEX_FROM_ZERO, then ffs(0) = 0, * ffs(1) = 0, ffs(object leads with 1) = sizeof object - 1. * start_from is __FFLS_START_INDEX_FROM_ONE, then ffs(0) = 0, * ffs(1) = 1, ffs(object leads with 1) = sizeof object. * @fn_name: Function name to be generated */ #define gen_ffs(in_type, out_type, start_from, fn_name) \ __gen_ffs(in_type, out_type, start_from, fn_name, (sizeof(in_type) * 8)) ``` 根據我所寫的 kernel doc，其他開發者可以為 `start_from` 參數指派 `__FFLS_START_INDEX_FROM_ZERO` 或 `__FFLS_START_INDEX_FROM_ONE` 調整其起始索引值，如此一來無論是 `__ffs` 抑或 `ffs` 都能以同樣的規格實作。原理可見程式碼第 11 行，初始化 `bits` (輸出) 時採用不同的起始索引。由於編譯器能自動識別並簡化這些常數，因此這些調整對效能沒有影響。 ### Loop unroll #### 從一個 `x64` 組語小實驗開始我嘗試將自己的舊版 `ffs` 巨集生成函式置於 `util.h` 轉為組語，其在 `main.c` 中被呼叫並印出結果，以下節錄實作：  ```c= // util.h #define __gen_ffs(in_type, out_type, start_from, fn_name, bit_len) \ static __always_inline out_type fn_name(in_type x) \ { \ int bits, shift; \ in_type mask; \ if (!x) \ return 0; \ bits = start_from; \ shift = bit_len >> 1; \ mask = (in_type) ~(_ONES << (bit_len >> 1)); \ while (shift) { \ if (!(x & mask)) { \ bits += shift; \ x >>= shift; \ } \ shift >>= 1; \ mask >>= shift; \ } \ return bits; \ } #define gen_fls(in_type, out_type, start_from, fn_name) \ __gen_fls(in_type, out_type, start_from, fn_name, (sizeof(in_type) * 8)) // generate gen_ffs(__unsigned_int, int, START_INDEX_FROM_ONE, my_ffs); // main.c int main(void) { printf("%d\n", my_ffs((rand() % (__INT32_MAX__ - 1)) + 1)); } ``` `main.c` 中故意使用 `srand`、`time(NULL)` 和 `rand` 是為了不希望 compiler 將輸入設為常數；使用 `printf` 是希望 compiler 沒有機會把生成的 inline function 整個優化掉，強迫一定要取得結果並印出。使用 `cc main.c -O -std=c99 -S` 帶優化的編譯，產生的組語節錄如下： ```python ... movl $0, %edi addl $1, %eax je .L2 # if (!x) return 0; movl $5, %esi movl $16, %ecx movl $1, %edi movl $65535, %edx jmp .L4 .L3: sarl %ecx shrl %cl, %edx subl $1, %esi je .L2 .L4: testl %edx, %eax jne .L3 addl %ecx, %edi shrl %cl, %eax jmp .L3 .L2: movl %edi, %edx leaq .LC0(%rip), %rsi movl $1, %edi movl $0, %eax call __printf_chk@PLT ... ``` 這裡可以觀察到一些有趣的現象，以下討論根據 [x86 cheat sheet](https://web.stanford.edu/class/cs107/resources/x86-64-reference.pdf)。 * `__gen_ffs` 中 `mask` 變數確實透過編譯器的協助轉為常數 `65535` (`0xffff`)，這符合預期。 * 更進一步的，將上述 `util.h` 第 13 行改為 `mask = (in_type) ~(_ONES << shift);` 後編譯為組語，發現得到一樣的結果。 * 本來顧慮到不使用全部直接是常數的巨集產開下可能導致 gcc 不會正確預先求出 `mask` 常數，事實證明編譯器實在太聰明了！ * `my_ffs` (使用巨集生成的 `ffs` 函式) 組語迴圈也很有趣。 * 查詢 `x86` 的 [`test` 指令](https://en.wikipedia.org/wiki/TEST_(x86_instruction))發現，原來其非迴圈條件判斷，而是 bitwise AND 運算，對應第 13 行 `x & shift` 的部分。 * 原來編譯器自動幫我將 `while` 轉為 `do-while`，不得不說真的很聰明，直接體會到編譯器的強大。 #### 實作 Loop unroll 編譯器神通廣大，即使沒修過編譯器的我都聽過其中一個知名的功能：loop unrolling。透過將迴圈展開為只向下執行的程式碼，降低指令往前轉跳的 control hazard 和 instruction cache 成本。參考[本討論串](https://stackoverflow.com/questions/24196076/is-gcc-loop-unrolling-flag-really-effective)，其討論到適合使用 loop unrolling 的時機： > 對次數少的小迴圈有效，反之則要付出大代價。平常依靠編譯器自動最佳化即可，盡可能不要進行半吊子的優化，正如 Donald Knuth 所言： > We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. 明顯前述 `gen_ffs` 或者說 `my_ffs` 屬於適合 loop unrolling 的情況，於是我嘗試使用 [`-funroll-loops`](https://gcc.gnu.org/onlinedocs/gcc-4.5.2/gcc/Optimize-Options.html) 這個 gcc 編譯參數，其嘗試在編譯時期展開可預測迴圈執行次數的迴圈: ```bash cc main.c -O -funroll-loops -std=c99 -S ``` (左: 原版，右：unroll loops 版) ![](https://hackmd.io/_uploads/ByRqh5_E2.png) 成功得到展開後的迴圈，而且由於將邏輯迴圈化之前的重複次數本來就很少，展開後才只多七行，除了預見效能的提升外想不到其他可能。更令我驚訝的是，除了能正確的 unroll loop，針對整數輸入型別展開為 5 次 $(i.e. 16, 8, 4, 2, 1)$ 以外，編譯器甚至聰明到把最後一個 $1$ 的情況優化成不去對 `x` 做無謂的運算，也就是 `.L6` 中不存在對 `%eax (x)` 的右移指令。換言之，若將我的巨集生成函式應用於 Linux Kernel `ffs/fls` 系列函式的實作，有可能不會影響其效能。實際上是真的不會，詳見[後續實驗](https://hackmd.io/@sysprog/ByS9w_lHh#%E6%95%88%E8%83%BD%E6%8E%A2%E8%A8%8E)，但在此之前有個問題，就是如果需要調整 kernel 編譯時的編譯參數，恐怕會影響其他程式碼，我們應該限制 loop unroll 只作用在由我們巨集生成函式產生的迴圈上。 #### 精準最佳化：調整 compile 時特定位置採取的優化策略同前述，我們希望 kernel 於編譯時，與 `ffs` 相關的部分採用 GCC 編譯時傳入 `unroll-loops` 參數的效果。直接調整編譯參數有點不切實際，原因如下： * 只希望 unroll 生成出來的函式，不希望影響到其他函式。 * ~~Linux 的 makefile 們 trace 起來好痛苦~~ 另一方面，我想到 gcc 可能提供了額外的標記來解決問題，於是找到[本討論串](https://stackoverflow.com/questions/4071690/tell-gcc-to-specifically-unroll-a-loop)，其介紹了兩種方法： * [`pragma + GCC global optimize + push pop`](https://gcc.gnu.org/onlinedocs/gcc/Function-Specific-Option-Pragmas.html#Function-Specific-Option-Pragmas): 在函式前為 GCC 增加全域優化設定，函式後取消該新增。 > `pragma` 表示編譯提示。 ```c #pragma GCC push_options #pragma GCC optimize ("unroll-loops") // function with loops to be unrolled #pragma GCC pop_options ``` * 新增 [`attribute`](https://gcc.gnu.org/onlinedocs/gcc/Function-Attributes.html): 在函式前加上 ```c __attribute__((optimize("unroll-loops"))) // function with loops to be unrolled ``` 於是我嘗試實驗這兩者於以下編譯環境： ```bash > gcc --version gcc (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0 ``` 發現後者沒有效果，前者雖然有，但不知為何在巨集生成函式下沒有效果。我嘗試過以下做法： * 直接暴力放 `pragma` 會報錯，在 `define` 中不能用 `#...`，故 `#pragma` 不能使用。(語法錯誤)。 ```c #define ... \ #pragma GCC push_options \ #pragma GCC optimize ("unroll-loops") \ \ /* function with loops to be unrolled */ \ \ #pragma GCC pop_options ``` * 利用 C99 的功能: [`_Pragma` 運算子](https://gcc.gnu.org/onlinedocs/cpp/Pragmas.html)替代 `#pragma` 巨集這下語法不會錯誤了。但奇妙的是，這樣寫編譯後發現沒有效果！ ```c #define ... \ _Pragma("GCC push_options") \ _Pragma("GCC optimize (\"unroll-loops\")") \ \ /* function with loops to be unrolled */ \ \ _Pragma("GCC pop_options") ``` 於是我改為去掉最後一行 `_Pragma("GCC pop_options")`，編譯後發現竟然有效，但這樣調整全域優化設定非常不好，畢竟會影響到其他程式碼，我所追求的是只針對本函式的 loop unroll。 * 利用 GCC [Loop-Specific Pragmas](https://gcc.gnu.org/onlinedocs/gcc/Loop-Specific-Pragmas.html): 前述方法都是針對整個函式，經查找 GCC 規格書後發現還有此方法，透過標記在迴圈前一行告訴 GCC 要 unroll 幾次，語法為 `#pragma GCC unroll n`。 > 以下節錄 GCC 規格書，說明語法中 `n` 的意義： > n is an integer constant expression specifying the unrolling factor. The values of 0 and 1 block any unrolling of the loop. 於是我結合前述 `_Pragma` 運算子將程式改為： ```c #define ... \ ... \ _Pragma("GCC unroll 6") \ do { \ ... \ } while (shift); \ ... ``` 指定 $n = 6$ 是因為對於 $64$ 位元而言，迴圈將走訪 $shift = 2^5, 2^4, \sim, 2^0$ 這幾項可能。 :::warning 不過此時讀者可能會好奇，這樣不就限制我抽象為巨集生成函式的作法只能用於 $64$ bits？答案是其實不會！雖然規格書沒有詳細提及，不過我以實驗測試 unroll 次數自 0 至 100。結果表明，當可 unroll 的次數 $n_{avai}$ 小於指派的次數，結果只會展開 $n_{avai}$ 次。畢竟 GCC 原本的 `funroll-loops` 本來就有能力分析迴圈合理的展開次數。更進一步的，若未來出現 $128$ 位元的處理器，只要將當中全一立即數和 unroll 次數向上調整即可，生成出的結果不會影響舊有實作。 ::: ### 更進一步，實作 `clz/ffs` 系列函式全家桶 Linux kernel 中尚未針對 `c{l,t}z` 系列函式直接提供軟體實作。以下嘗試沿用 `f{f/s}s` 的成果，以接近相同的實作一口氣實作 `clz/ffs` 系列函式全家桶。 #### `clz/ffs` 全家桶說明常見的 `clz` 系列函式有以下數個，當中 leading、tailing、first、last 的定義存在模糊空間，故標註其針對的是 MSB 抑或 LSB: * count leading zeros (MSB) * count tailing zeros (LSB) * find first set (LSB) * find last set (MSB) * find first zero (LSB) 他們針對不同的 * 輸入 * 輸出 * 未定義行為而有不同的實作。不過一個很有趣的現象是 `clz` 函式們彼此間的關係，以下式說明。 $$ \begin{cases} clz(x) = 64 - fls(x), & x > 0 \\ ctz(x) + 1 = ffs(x), & x > 0 \\ ffz(x) = ffs(\neg x) - 1, & x \ne \rm{all\ ones} \\ \end{cases} $$ 可見這些函式實際上可能只需實作 `clz/ctz` 或 `ffs/fls` 即可，其他函式只需調整一些即可。 ### 實作全家桶在我的實作中，全家桶的關係如下圖：方形表對面向使用者的函式，橢圓表巨集生成函式 (`__ffs`、`__fls`、`ffz`、`constant_fls` 等請見 `bitops-patch` 子項目)。 ```graphviz digraph dg { graph[rankdir=BT]; gen_ffs; gen_fls; gen_clz; gen_ctz; ffs[shape=record]; fls[shape=record]; __ffs[shape=record];__fls[shape=record]; ffz[shape=record]; constant_fls[shape=record]; clz[shape=record]; ctz[shape=record]; __gen_clz -> __gen_fls; gen_fls -> __gen_fls; gen_ffs -> __gen_ffs; ffs -> gen_ffs; fls -> gen_fls; __ffs -> gen_ffs; __fls -> gen_fls; gen_clz -> __gen_clz; clz -> gen_clz; ctz -> gen_ctz; ffz -> __ffs; constant_fls -> gen_fls; gen_ctz -> __gen_ffs; } ``` 透過活用一開始設計來提供使用者客製化用的之巨集常數 `__FFLS_START_INDEX_FROM_{ZERO,ONE}`，以 `f{f,l}s` 實作 `c{l,t}z` 只在彈指之間。節錄實作如下： ```c #define __FFLS_START_INDEX_FROM_ZERO 0 #define __FFLS_START_INDEX_FROM_ONE 1 // hidden implementation #define __gen_ffs(in_type, out_type, start_from, fn_name, bit_len) \ ... // just like ctz // developers should call this #define gen_ffs(in_type, out_type, start_from, fn_name) \ __gen_ffs(in_type, out_type, start_from, fn_name, (sizeof(in_type) * 8)) // hidden implementation #define __gen_fls(in_type, out_type, start_from, fn_name, bit_len) \ ... // just like clz // developers should call this #define gen_fls(in_type, out_type, start_from, fn_name) \ __gen_fls(in_type, out_type, start_from, fn_name, (sizeof(in_type) * 8)) // hidden implementation #define __gen_clz(in_type, out_type, fn_name, bit_len) \ __gen_fls(in_type, out_type, \ __FFLS_START_INDEX_FROM_ONE, \ __fls_##fn_name, bit_len); \ static __always_inline out_type fn_name(in_type x) \ { \ return bit_len - __fls_##fn_name(x); \ } // developers should call this #define gen_clz(in_type, out_type, fn_name) \ __gen_clz(in_type, out_type, fn_name, (sizeof(in_type) * 8)) // developers should call this #define gen_ctz(in_type, out_type, fn_name) \ __gen_ffs(in_type, out_type, \ __FFLS_START_INDEX_FROM_ZERO, \ fn_name, (sizeof(in_type) * 8)) // hidden implementation #define __gen_ffz(in_type, out_type, fn_name, bit_len) \ __gen_ffs(in_type, out_type, \ __FFLS_START_INDEX_FROM_ZERO, \ __ffs_##fn_name, bit_len); \ static __always_inline out_type fn_name(in_type x) \ { \ return __ffs_##fn_name(~x); \ } #define gen_ffz(in_type, out_type, fn_name) \ __gen_ffz(in_type, out_type, fn_name, (sizeof(in_type) * 8)) ``` 可以發現實作 `clz` 和 `ffz` 的過程中額外生成了幫手 `fls` (`__fls_##fn_name`)、`ffs` (`__ffs_##fn_name`) 函式，至於對於 * 程式碼多餘與否 * 機械碼長度 * 執行時期效能等指標而言是不是一個好的做法，又是一個可以加以探討的話題。 ### 效能探討今以以下簡單的程式來測試，後者 (巨集生成迴圈版) 有兩個版本，一個是尚未使用 loop unroll 的版本，另一個有使用。 ```c // 原始版本 for (long long i = 1; i < 1000000; ++i) { assert(__builtin_ffsll(i) == ffs_ll(i)); assert(__builtin_ffsll(i + __INT32_MAX__) == ffs_ll(i + __INT32_MAX__)); } // 巨集生成迴圈版 (no loop unroll) for (long long i = 1; i < 1000000; ++i) { assert(__builtin_ffsll(i) == my_ffs_ll(i)); assert(__builtin_ffsll(i + __INT32_MAX__) == my_ffs_ll(i + __INT32_MAX__)); } // 巨集生成迴圈版 (loop unroll) for (long long i = 1; i < 1000000; ++i) { assert(__builtin_ffsll(i) == my_ffs_lu_ll(i)); assert(__builtin_ffsll(i + __INT32_MAX__) == my_ffs_lu_ll(i + __INT32_MAX__)); } ``` 實驗環境如下。 :::spoiler ```bash $ gcc --version gcc (Ubuntu 11.3.0-1ubuntu1~22.04.1) 11.3.0 $ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 39 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 8 On-line CPU(s) list: 0-7 Vendor ID: GenuineIntel Model name: 12th Gen Intel(R) Core(TM) i3-12100 CPU family: 6 Model: 151 Thread(s) per core: 2 Core(s) per socket: 4 Socket(s): 1 Stepping: 5 CPU max MHz: 5500.0000 CPU min MHz: 800.0000 ``` ::: 以 `perf` 為測試工具，測試指令如下： ```bash gcc main.c -std=c99 -O -o a -g3 && sudo perf stat --repeat 1000 ./a ``` 分別測試上述三種不同的實作，得到如下結果： |index|原始版本|巨集生成迴圈版 no loop unroll|巨集生成迴圈版 loop unroll| |:-:|:-:|:-:|:-:| |Execution time (ms)|3.48|9.26|3.52| |Branches|15.35M|32.34M|15.34M| |Branch miss (% of all branch)|0.18|0.16|0.19| |IPC|3.65|2.89|3.62| 可見在尚未配合編譯器優化前的實作與原始相比，執行時間大約為原始的 2.6 倍，雖然 branch miss 比率較低，但總 branches 指令為原本的近兩倍 (這是高度可以預期的，因為每輪迴圈都需要一次結束判定)，故 IPC 比原本低，影響執行時長。而採用 loop unroll 編譯器精準優化後，執行時間為原本的 1.094 倍，幾乎不受影響，而 branches 數也說明 loop unroll 成功把迴圈結束判定的邏輯給展開去除，得到近乎一樣的結果，在 branch 數與 branch miss 相近的情況下，IPC 也很接近，執行時長自然相近。實驗結論是：本專題的優化幾乎不會影響效能。 ## TODO: 確認改寫的程式碼得以在 Linux 核心運作 > 應搭配 user-mode-linux 或 virtme (第 7 週教材) 來驗證 ### 實作在實作精準的 loop unroll 後，下一步是實作於 Linux kernel。首先我建立新檔案 `ffls_gen.h`，加入前述 `ffs`、`fls` 巨集生成函式。接著修改 `ffs.h`、`fls.h`、`__ffs.h`、`__fls.h`、`fls64.h`、`bitops.h` 等檔案，調整實作如下兩例，其餘皆依樣畫葫蘆： ```c // in __ffs.h gen_ffs(unsigned long, unsigned long, __FFLS_START_INDEX_FROM_ZERO, __ffs); // in ffs.h gen_ffs(unsigned int, int, __FFLS_START_INDEX_FROM_ONE, ffs); ``` ### 初步測試為了測試方便，我先粗暴的調整 `__builtin-ffs.h`、`__builtin-__ffs.h` 等一系列採用硬體指令版的實作為與軟體版的相同，並以 User mode linux 編譯運行，得到無任何錯誤的編譯、運行結果。 <s>![](https://hackmd.io/_uploads/SyxJl4JB3.png)</s> ## TODO: 統計/預估藉由導入上述程式碼的修改後，現行 Linux 核心原始程式碼可減少幾行程式碼，及簡述提升程式碼維護的效益 ### 減少程式碼行數在 10 處重複實作的程式碼中 (不考慮註解的話): * 實作巨集函式們於 `include/asm-generic/bitops/ffls_gen.h`: 增加 $53$ 行 (不算額外擴展而暫時沒用到的 `clz/ctz/ffz` 巨集，若連註解都算的話增加 $170$ 行) * `tools/include/asm-generic/bitops` 和 `include/asm-generic/bitops` 下一樣的實作 * `__ffs.h`: 減少 $29 \times 2$ 行 * `__fls.h`: 減少 $28 \times 2$ 行 * `fls.h`: 減少 $28 \times 2$ 行 * `fls64.h`: 減少 $13 \times 2$ 行 * `include/asm-generic/bitops/ffs.h` 減少 $26$ 行 * `arch/arc/include/asm/bitops.h` 減少 $24$ 行共計減少 $193$ 行程式碼 (連註解、暫時沒用到的 `clz/ctz/ffz` 巨集都算的話減少 $76$ 行)。 ### 提升程式碼維護的效益 * 對於所有輸入輸出資料型別、長度、起始索引為零或一，統一其未定義行為，保證安全。 * `gen_ffs/gen_ctz` 產生的函式於輸入為 $0$ 時的未定義行為，本實作下輸出統一為 $0$ * `gen_fls` 產生的函式於輸入為 $0$ 時的未定義行為，本實作下輸出統一為 $0$ * (承上) `gen_clz` 輸入為 $0$ 時輸出為資料長度 * `gen_ffz` 輸入為全一時輸出為 $0$ * 針對位元運算迴避有號輸入型別可能產生的符號擴展，保證安全。 * 向舊有程式碼 (起始索引為 0 或 1，各式輸入輸出型別) 兼容的同時，在不用調整編譯參數的情況下擁有近乎相同的效能。 * 對未來比如 128 bits 的資料型別，僅需微調巨集展開次數與全一立即數，即可產生向長度 128 bits 以下兼容的程式碼，優雅。 * 要實作其他輸入輸出型別的 ffs/fls/ffz/cls/ctz 函式，每次僅需撰寫一行即可產生固定邏輯的函式，四兩撥千斤，優雅且利他。我認為是值得出現在 linux kernel 的程式碼。 ## TODO: 提供程式碼貢獻到 LKML > To be continue... [name=Shiritai]