2023q1 Homework5 (assessment)

contributed by <Shiritai>

即 trace linux kernel 中 bitops 後留下的問題提問、實作與實驗集合。

Shiritai

ffs trace code 問題集合

以下討論基於 Linux kernel 6.3.0 source code。

$ make kernelversion
> 6.3.0-rc2

問題 1: 謎之 `builtin-__ffs` 和 `builtin-ffs`

前後兩者的實作…結果竟然不同，i.e. 計算結果不同！

// builtin-__ffs.h
/**
 * __ffs - find first bit in word.
 * @word: The word to search
 *
 * Undefined if no bit exists, so code should check against 0 first.
 */
static __always_inline unsigned long __ffs(unsigned long word)
{
	return __builtin_ctzl(word);
}

// builtin-ffs.c
/**
 * ffs - find first bit set
 * @x: the word to search
 *
 * This is defined the same way as
 * the libc and compiler builtin ffs routines, therefore
 * differs in spirit from ffz (man ffs).
 */
#define ffs(x) __builtin_ffs(x)

照理來說 ctz(x) + 1 == ffs(x)，為何 builtin-__ffs 竟然直接以 ctz 實作？

表示你發現實作缺失，很好！
參照 Patch subject prefixes:

Another common Subject prefix is "RFC". This stands for "Request for Comment", which tells maintainers should review your patch thoroughly, and provide feedback. RFC is typically used when sending feature patches for the first time, or anytime the patch is more than just a simple bug fix. Developers will often use [RFC v2] on the second revision of their feature patchset.

你可提交 RFC patch，讓開發者參與討論，確認你的發現是否合理並值得改正。

Image Not Showing Possible Reasons
The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported
Learn More →

jserv

報告教授，參考 bitops 系列對外的介面: arch/arc/include/asm/bitops.h，經過更仔細的 trace code 後，我理解其中的原理了： ~~註解寫得很清楚但我眼瞎了 QQ~~

__XXX 系列: 以 index 0 索引，範圍為 0 ~ {length of input type - 1}
XXX 系列: 以 index 1 索引，範圍為 1 ~ {length of input type}

我以此為基礎調整生成 ffs、fls 系列的巨集，目前可適應以同一份巨集於編譯時期生成上述兩種情況的程式碼，並確立所有常數，不因此影響執行時期的效能。

問題 2: 詭異的實作不一致性與例外情況

凡是掛名 __builtin_XXX 的函式本身以 CPU 硬體指令實作，過濾例外情況

0

作為輸入以避免 undefined behavior 是必備手續；同時使用 ffs, fls 時輸入

0

本身就是極其不合理的事，此時檢查例外情況依然重要。

不過有趣的是，bitops 目錄下大家眾口一詞：有的說例外情況的判斷交由 caller 檢查，有的沒特別說 (只說其與 builtin ffs 有同樣的 routine)，有的直接為

0

為輸入定義對應輸出…

例外情況的判斷交由 caller 檢查

// __ffs.h
/**
 * __ffs - find first bit in word.
 * @word: The word to search
 *
 * Undefined if no bit exists, so code should check against 0 first.
 */
static __always_inline unsigned long __ffs(unsigned long word)

沒特別說，但自主處理例外情況 (只說其與 builtin ffs 有同樣的 routine)

// ffs.h
/**
 * ffs - find first bit set
 * @x: the word to search
 *
 * This is defined the same way as
 * the libc and compiler builtin ffs routines, therefore
 * differs in spirit from ffz (man ffs).
 */
static inline int ffs(int x)
{
    int r = 1;

    if (!x)
        return 0;
    ...

為

0

為輸入定義對應輸出

/**
 * fls - find last (most-significant) bit set
 * @x: the word to search
 *
 * This is defined the same way as ffs.
 * Note fls(0) = 0, fls(1) = 1, fls(0x80000000) = 32.
 */
static __always_inline int fls(unsigned int x)

想請問這是否為 Jserv 您於一對一討論時提到不同大公司背景之貢獻者 coding rule 的異同？
還有我看得頭好痛，其他開發者在使用這些函式時難道不會很困惑嗎？

工程背後就是一系列的取捨和溝通

Image Not Showing Possible Reasons
The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported
Learn More →

jserv

好難…不過現在算理解整個 ffs、fls 與 bitops 系列的關係，實現並驗證了對應的重構程式碼，目標是杜絕所有魔法與重複的程式碼 (i.e. bad smell)。
我將撰寫對應的 trace code 與實作報告，屆時請務必提供您的意見。

問題 3: `BITS_PER_LONG` 的來由

我撰寫 clz 巨集生成系列、inf_bits 時，對於位元數我總是使用 sizeof 運算子的協助。

若今想定義 BITS_PER_LONG 增加程式碼的可讀性，利用 sizeof 運算子正是時候。但 include/asm-generic/bitsperlong.h 的寫法讓我感到疑惑。

#ifdef CONFIG_64BIT
#define BITS_PER_LONG 64
#else
#define BITS_PER_LONG 32
#endif /* CONFIG_64BIT */

看起來他們有意避免使用 sizeof 運算子，想請問為什麼？
感覺可能與編譯器、C 語言規格、data model 有關，但不知道要從何找起…

這與 LP64 有關，這定義可能會被組合語言程式引用，因此用常數形式存在

Image Not Showing Possible Reasons
The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported
Learn More →

jserv

了解，感謝教授的回答，我想進一步提問：使用 sizeof 與依靠 BITS_PER_LONG 的結果是否一定相同？

問題 4: 如何做出貢獻

如教授所知，我曾實作過巨集展開版 clz、ctz 的軟體實作，目前也已經成功實作且驗證~~合乎我所認識的~~ ffs、fls 巨集生成函式，在理解前述兩問題後，接著就是如何解決相容性問題。

相容性問題與其餘延伸問題整理如下：

問題一延伸：實作差異的調整
- 也許可以進一步將巨集生成函式抽象化？
問題二延伸：例外情況處理邏輯的統一
- 也許可以設法模仿當前紛雜函式們的計算結果？但感覺很讓人頭疼…
- 也許進一步踏出 ffs 系列，調整使用到 ffs 系列函式的核心原始碼？這聽起來是個浩大工程，一想到光是 trace code…
更進一步的，如果能做出貢獻，手序為何？
如果對 bitops 作出顯著的調整，又要如何讓以後的開發者善用我們的調整？

若能成功解決，也許能完成對 ffs 系列函式的重構，並有機會進一步使用這些巨集生成函式於其他型別，讓 ffs 系列更好用也更統一。

問題 5: `constant_fls` 的存在意義

本問題與編譯器有關。

arch/arc/include/asm/bitops.h 中存在一函式 constant_fls，簽名與部分實作如下:

static inline int constant_fls(unsigned int x)
{
    int r = 32;
    if (!x)
        return 0;
    if (!(x & 0xffff0000u)) {
        x <<= 16;
    ...
}

其目的是利用 gcc __builtin_constant_p 協助於編譯時期進行 Constant folding (propagation)，於你所不知道的 C 語言：編譯器和最佳化原理篇介紹過其觀念。以下節錄 gcc 官方文件，說明 __builtin_constant_p 的使用:

You can use the built-in function __builtin_constant_p to determine if a value is known to be constant at compile time and hence that GCC can perform constant-folding on expressions involving that value. …

… A return of 0 does not indicate that the value is not a constant, but merely that GCC cannot prove it is a constant with the specified value of the -O option.

You may use this built-in function in either a macro or an inline function. However, if you use it in an inlined function and pass an argument of the function as the argument to the built-in, GCC never returns 1 when you call the inline function with a string constant or compound literal (see Compound Literals) and does not return 1 when you pass a constant numeric value to the inline function unless you specify the -O option.

表示當 __builtin_constant_p 用於 macro 或 inline function 時，傳入的參數為常數數值，當開啟編譯器最佳化 -O 選項後即會進行 constant folding。

綜觀 constant_fls 函式，我們看不出其與 bitops/fls 有任何差別。這表示就算不實現 constant_fls，也能利用上述 bitops/fls 函式協助編譯器進行 constant folding。

那麼問題就出現了：

constant_fls 實現的意義為何？
若調整 bitops/flz 或 constant_fls 的實作，對於 bitops.h 中 fls傳入參數為 __builtin_constant_p(x) == 1 時的效能是否有影響？即是否有可能影響 constant folding？(我覺得不會)

另附上 bitops.h 中 fls 函式以供參考。

/*
 * fls = Find Last Set in word
 * @result: [1-32]
 * fls(1) = 1, fls(0x80000000) = 32, fls(0) = 0
 */
static inline __attribute__ ((const)) int fls(unsigned int x)
{
    if (__builtin_constant_p(x))
        return constant_fls(x);

    return 32 - clz(x);
}

問題六: 另一個詭異的不一致性

有一個小小細節是: 使用 if-else 實作系列函式的一個優點是可以略過部分不必要的操作，比如 __fls 函式: (這點 __ffs 亦同)


static __always_inline unsigned long __fls(unsigned long word)
{
    int num = BITS_PER_LONG - 1;
    ...
    if (!(word & (~0ul << (BITS_PER_LONG-2)))) {
        num -= 2;
        word <<= 2;
    }
    if (!(word & (~0ul << (BITS_PER_LONG-1))))
        num -= 1;
        // bypass "word <<= 1" instruction here
    return num;
}

但隔壁鄰居 fls (ffs 亦同) 可沒打算那樣做:

static __always_inline int fls(unsigned int x)
{
    int r = 32;
    ...
    if (!(x & 0xc0000000u)) {
        x <<= 2;
        r -= 2;
    }
    if (!(x & 0x80000000u)) {
        x <<= 1; // meaningless?!
        r -= 1;
    }
    return r;
}

更有趣的是他的遠親 constant_fls 倒是考慮到了這點。

這是一個提出一個究極不起眼 patch 的機會嗎?
Shiritai

不過我更有興趣的是怎麼更徹底的解決 bitops 中實作混亂的問題，這引出下一個問題…

問題七: 阿罵，我的 loop unrolling 怎麼沒有感覺…

本問題已經解決，留下實驗與討論紀錄。
讓編譯器為我上一堂程式設計課的同時，驗證了我的實作應用於 Linux Kernel 的可行性。

從一個 `x64` 組語小實驗開始

我嘗試將自己的 loop 版 ffs 置於 util.h 轉為組語，其在 main.c 中被呼叫並印出結果，以下節錄實作：






























// util.h
#define __gen_ffs(in_type, out_type, start_from, fn_name, bit_len) \
    static __always_inline out_type fn_name(in_type x)             \
    {                                                              \
        int bits, shift;                                           \
        in_type mask;                                              \
        if (!x)                                                    \
            return 0;                                              \
        bits = start_from;                                         \
        shift = bit_len >> 1;                                      \
        mask = (in_type) ~(_ONES << (bit_len >> 1));               \
        while (shift) {                                            \
            if (!(x & mask)) {                                     \
                bits += shift;                                     \
                x >>= shift;                                       \
            }                                                      \
            shift >>= 1;                                           \
            mask >>= shift;                                        \
        }                                                          \
        return bits;                                               \
    }
#define gen_fls(in_type, out_type, start_from, fn_name) \
    __gen_fls(in_type, out_type, start_from, fn_name, (sizeof(in_type) * 8))
// generate
gen_ffs(__unsigned_int, int, START_INDEX_FROM_ONE, my_ffs);

// main.c
int main(void) {
    printf("%d\n", my_ffs((rand() % (__INT32_MAX__ - 1)) + 1));
}

main.c 中故意使用 srand、time(NULL) 和 rand 是為了不希望 compiler 將輸入設為常數；使用 printf 是希望 compiler 沒有機會把生成的 inline function 整個優化掉，強迫一定要取得結果並印出。

使用 cc main.c -O -std=c99 -S 編譯，產生的組語節錄如下：

	...
	movl	$0, %edi
	addl	$1, %eax
	je	.L2   # if (!x) return 0;
	movl	$5, %esi
	movl	$16, %ecx
	movl	$1, %edi
	movl	$65535, %edx
	jmp	.L4
.L3:
	sarl	%ecx
	shrl	%cl, %edx
	subl	$1, %esi
	je	.L2
.L4:
	testl	%edx, %eax
	jne	.L3
	addl	%ecx, %edi
	shrl	%cl, %eax
	jmp	.L3
.L2:
	movl	%edi, %edx
	leaq	.LC0(%rip), %rsi
	movl	$1, %edi
	movl	$0, %eax
	call	__printf_chk@PLT
	...

這裡可以觀察到一些有趣的現象。

__gen_ffs 中 mask 變數確實透過編譯器的協助轉為常數 65535 (0xffff)，這符合預期。
- 更進一步的，將上述 util.h 第 13 行改為 mask = (in_type) ~(_ONES << shift); 後編譯為組語，發現得到一樣的結果。
- 本來顧慮到不使用全部直接是常數的巨集產開下可能導致 gcc 不會正確預先求出 mask 常數，事實證明編譯器實在太聰明了！
my_ffs (使用巨集生成的 ffs 函式) 組語迴圈也很有趣。
- 本來我舉燭認為 .L4 以降第 16、17 行對應 while (shift)，但在我發現原 C code 還有優化空間 (將 while 轉為 do-while)，調整並重新編譯為組語後，驚愕地發現「完全沒有變化」！
- 由此，我查詢 x86 的 test 指令發現，原來其非條件判斷，而是 bitwise AND 運算，對應第 13 行 x & shift 的部分。
- 重新解讀 x86 後發現，原來編譯器自動幫我將 while 轉為 do-while，不得不說真的很聰明，直接體會到編譯器的強大。

另外，對 x86_64 舉燭實在太累人了，不如把它學起來吧！
承襲於修讀計算機結構時學 RISC-V 的方法，把 cheat sheet 拿出來讀！

為啥我不早點拿出來呢… Shiritai

所以 Unrolling 去哪了

編譯器神通廣大，即使沒修過編譯器的我都聽過其中一個知名的功能：loop unrolling。透過將迴圈展開為只向下執行的程式碼，降低指令往前轉跳的 control hazard 和 instruction cache 成本。

參考本討論串，其討論到適合使用 loop unrolling 的時機：

對次數少的小迴圈有效，反之則要付出大代價。

平常依靠編譯器自動最佳化即可，正如 Donald Knuth 所言：

We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Donald Knuth

明顯前述 gen_ffs 或者說 my_ffs 屬於適合 loop unrolling 的情況，於是我嘗試使用 -funroll-loops 這個 gcc 編譯參數，其嘗試在編譯時期展開可預測迴圈執行次數的迴圈:

cc main.c -O -funroll-loops -std=c99 -S

(左: 原版，右：unroll loops 版)

成功得到展開後的迴圈，而且由於將邏輯迴圈化之前的重複次數本來就很少，展開後才只多七行，除了預見效能的提升外想不到其他可能。

等等，為啥現在竟然成功 unroll 了？之前不行欸，我也嘗試把舊版 gen_ffs 拿來測試，怎麼這次又可以了？

極有可能的原因是之前有殘餘一個 util.h.gch 檔，即 precompiled header file，八成是它在搞鬼，導致不會重新編譯 my_ffs，此時自然在不用編譯 my_ffs 的情況下沒有任何迴圈可以展開。

Shiritai

更令我驚訝的是，除了能正確的 unroll loop，針對整數輸入型別展開為 5 次

(i . e . 16, 8, 4, 2, 1)

以外，編譯器甚至聰明到把最後一個

1

的情況優化成不去對 x 做無謂的運算，也就是 .L6 中不存在對 %eax (x) 的右移指令。

換言之，若將我的巨集生成函式應用於 Linux Kernel ffs/fls 系列函式的實作，不僅不會影響其效能，更能簡化原始碼和統一實作方法，同時提供一個能實作目前尚未實作之輸入型別的 ffs/fls，一舉三得！

那麼下一步便是搞清楚 Linux Kernel 是如何編譯這些檔案，以確認迴圈展開與常數傳遞等效果是否真正的被編譯器採用。

在程式碼間調整 compile 時採取的優化策略

希望調整 kernel 於編譯時，至少與 ffs 相關的部分採用 GCC 編譯時傳入 unroll-loops 參數的效果。

直接調整編譯參數有點不切實際，原因如下：

只希望 unroll 生成出來的函式，不希望影響到其他函式。
~~Linux 的 makefile 們 trace 起來好累…~~

另一方面，我想到 gcc 可能提供了額外的標記來解決問題，於是找到本討論串，其介紹了兩種方法：

pragma + GCC global optimize + push pop: 在函式前為 GCC 增加全域優化設定，函式後取消該新增。

pragma 表示編譯提示。

#pragma GCC push_options
#pragma GCC optimize ("unroll-loops")

// function with loops to be unrolled

#pragma GCC pop_options

新增 attribute: 在函式前加上

__attribute__((optimize("unroll-loops")))
// function with loops to be unrolled

於是我嘗試實驗這兩者於以下編譯環境：

> gcc --version
gcc (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0

發現後者沒有效果，前者雖然有，但不知為何在巨集生成函式下沒有效果。我嘗試過以下做法：

直接暴力放 pragma
會報錯，在 define 中不能用 #...，故 #pragma 不能使用。(語法錯誤)。

#define ...                                  \
    #pragma GCC push_options                 \
    #pragma GCC optimize ("unroll-loops")    \
                                             \
    /* function with loops to be unrolled */ \
                                             \
    #pragma GCC pop_options

利用 C99 的功能: _Pragma 運算子替代 #pragma 巨集
這下語法不會錯誤了。但奇妙的是，這樣寫編譯後發現沒有效果！

#define ...                                   \
   _Pragma("GCC push_options")                \
   _Pragma("GCC optimize (\"unroll-loops\")") \
                                              \
    /* function with loops to be unrolled */  \
                                              \
   _Pragma("GCC pop_options")

於是我改為去掉最後一行 _Pragma("GCC pop_options")，編譯後發現竟然有效，但這樣調整全域優化設定非常不好，畢竟會影響到其他程式碼，我所追求的是只針對本函式的 loop unroll。

利用 GCC Loop-Specific Pragmas: 前述方法都是針對整個函式，經查找 GCC 規格書後發現還有此方法，透過標記在迴圈前一行告訴 GCC 要 unroll 幾次，語法為 #pragma GCC unroll n。

以下節錄 GCC 規格書，說明語法中 n 的意義：

n is an integer constant expression specifying the unrolling factor. The values of 0 and 1 block any unrolling of the loop.

於是我結合前述 _Pragma 運算子將程式改為：
```
#define ...                 \
    ...                     \
    _Pragma("GCC unroll 6") \
    do {                    \
        ...                 \
    } while (shift);        \
    ...
```
指定
$n = 6$ 是因為對於
$64$ 位元而言，迴圈將走訪
$s h i f t = 2^{5}, 2^{4}, \sim, 2^{0}$ 這幾項可能。

不過此時讀者可能會好奇，這樣不就限制我抽象為巨集生成函式的作法只能用於
$64$ bits？

答案是其實不會！

雖然規格書沒有詳細提及，不過實驗表明，當可 unroll 的次數
$n_{a v a i}$ 小於指派的次數，結果只會展開
$n_{a v a i}$ 次。畢竟 GCC 原本的 funroll-loops 本來就有能力分析迴圈合理的展開次數。

不過這裡的「舉燭」是否有疏漏？！我如何確認此效果於編譯 Linux kernel 時是否有效？
Shiritai