Try   HackMD

Linux 核心專題: 回顧 bitops 並改進

影片講解 (請看文件比較好)

TODO: 重現 回顧 bitops 並改進 實驗,針對 Linux v6.8+ (Ubuntu Linux 24.04 搭配的 Linux 核心版本)

任務簡述

Linux 核心的 include/asm-generic/bitops 存在若干實作不一致的狀況,本任務嘗試追蹤 git log 和 LKML,以理解發展脈絡,並嘗試用精簡有效的方式改寫,最後嘗試貢獻程式碼到 Linux 核心。

先備知識:

shiritai 的文件內提到

ffs__ffs 加雙底線與否的不同
首先先釐清 ffs__ffs (是否加雙底線) 的不同之處:參考 bitops 系列對外的介面: arch/arc/include/asm/bitops.h 中的註解得知
__XXX 系列: 以 index 0 索引,範圍為 0 ~ {length of input type - 1}
XXX 系列: 以 index 1 索引,範圍為 1 ~ {length of input type}

問題

為何會有 0 及 1 開始的問題 ?

讓我們先來看以下兩種 __ffs 實作:

第一種 __ffs

/*
 * __ffs: Similar to ffs, but zero based (0-31)
 */
static inline __attribute__ ((const)) unsigned long __ffs(unsigned long word)
{
	if (!word)
		return word;

	return ffs(word) - 1;
}

第二種 __ffs

/**
 * __ffs - find first bit in word.
 * @word: The word to search
 *
 * Undefined if no bit exists, so code should check against 0 first.
 */
static __always_inline unsigned long __ffs(unsigned long word)
{
	int num = 0;

#if __BITS_PER_LONG == 64
	if ((word & 0xffffffff) == 0) {
		num += 32;
		word >>= 32;
	}
#endif
	if ((word & 0xffff) == 0) {
		num += 16;
		word >>= 16;
	}
	if ((word & 0xff) == 0) {
		num += 8;
		word >>= 8;
	}
	if ((word & 0xf) == 0) {
		num += 4;
		word >>= 4;
	}
	if ((word & 0x3) == 0) {
		num += 2;
		word >>= 2;
	}
	if ((word & 0x1) == 0)
		num += 1;
	return num;
}

上面第一個並沒有說明若輸入 word 為 0 的情況該如何處理,但透過程式碼可以看出就是回傳 word ( 0 ),而下面則明確表示 word 為 0 是例外情況,需自行處理,若強行輸入函式將會出錯。

bottom up trace 第一個 __ffs

/* * Count the number of zeros, starting from MSB * Helper for fls( ) friends * This is a pure count, so (1-32) or (0-31) doesn't apply * It could be 0 to 32, based on num of 0's in there * clz(0x8000_0000) = 0, clz(0xFFFF_FFFF)=0, clz(0) = 32, clz(1) = 31 */ static inline __attribute__ ((const)) int clz(unsigned int x) { unsigned int res; __asm__ __volatile__( " norm.f %0, %1 \n" " mov.n %0, 0 \n" " add.p %0, %0, 1 \n" : "=r"(res) : "r"(x) : "cc"); return res; } static inline int constant_fls(unsigned int x) { int r = 32; if (!x) return 0; if (!(x & 0xffff0000u)) { x <<= 16; r -= 16; } if (!(x & 0xff000000u)) { x <<= 8; r -= 8; } if (!(x & 0xf0000000u)) { x <<= 4; r -= 4; } if (!(x & 0xc0000000u)) { x <<= 2; r -= 2; } if (!(x & 0x80000000u)) r -= 1; return r; } /* * fls = Find Last Set in word * @result: [1-32] * fls(1) = 1, fls(0x80000000) = 32, fls(0) = 0 */ // __builtin_constant_p(x):簡單來說就是去看 x 在編譯時間是否已經是已知常數 // 若是,則使用 constant_fls(x) 計算,反之使用 32 - clz(x) 。 static inline __attribute__ ((const)) int fls(unsigned int x) { if (__builtin_constant_p(x)) return constant_fls(x); return 32 - clz(x); } /* * ffs = Find First Set in word (LSB to MSB) * @result: [1-32], 0 if all 0's */ // 用途:僅保留最低位 set bit ,目的為透過 fls 計算 ffs 。 // 而會透過 fls 計算 ffs 我認為是因為 code size 考量, // 概念相同的函式盡量透過同一個函式達成,對不同情況設計多種函式是不必要的。 #define ffs(x) ({ unsigned long __t = (x); fls(__t & -__t); })

Operating Systems 2 - Synchronization in Linux Kernel 裡提到

There are also nonatomic version of those functions, which names starts with __ (double underscore) prefix. The kernel also provides two function that search for the first set or cleared bit starting at a given address. Those are called respectively find_first_bit() and find_first_zero_bit(). If only one word (32 or 64 bits) should be searched, then the faster functions __ffs() and __ffz() can be applied.

這些 double underscore prefix 的函式是 nonatomic version

而在 Leading Underscores in Linux Kernel Programming 中提到

Identifiers starting with a double underscore are reserved by the C programming language specification. This means, it is illegal to define your own identifiers starting with a double underscore. (The same applies to identifiers starting with a single underscore followed by a capital letter.)

So, technically, this means that the source code of the Linux kernel is not standards-compliant C code.

However, the Linux kernel is very low-level code and it can only be compiled with specific versions of a small number of C compilers, therefore, the Linux kernel developers know exactly which __foo identifiers are used by the compiler and which they can use themselves.

So, the Linux kernel uses identifiers starting with double underscore for the same reason that the C standard forbids them: to make sure, there can be no conflicts with user-defined identifiers.

For example, some Linux kernel header files end up being included in the low-level system libraries (GNU libc, musl, dietlibc, etc.), and using identifiers that are explicitly reserved guarantees that this will not lead to conflicts.

也就是供內部執行並避免命名衝突。

事實上第一種實作的 __ffs 內的檢查 word 是否為 0 是不必要的,因為其呼叫的 ffs 已檢查。下一段中會列舉出 Linux v6.8+ 內include/asm-generic/bitops 內 9 種 ffsfls 實作

在以下列出各種實作方法後可以大致歸納出:

  1. __ prefix 會接收 unsigned long 的變數來計算,而 no-__ prefix 則接收 unsigned int/int 的變數來計算。
  2. __ prefix start from 0 , no-__ prefix start from 1
  3. __ prefix 皆無法處理輸入為 0 的情況,而 no-__ prefix 皆可處理
  4. 若可處理輸入為 0 ,只要輸入為 0 則必須輸出 0 ,因此 no-__ prefix 只要輸出為 0 則代表輸入為 0 ,若 __ prefix 處理輸入為 0 的方法是輸出 0 則會出現以下問題

第一種是可以接受輸入 word 為 0 的情況,那麼有個問題是第一個的回傳值範圍是 0-31 ,若 word 為 0 將會回傳 0 ,那麼這個 0 代表的到底是 first set bit 在第 0 位還是 word 為 0 ?

這樣其實可以大致想出 __ prefix 因為內部使用,在使用前已經避免輸入為 0 的情況因此可以直接使用,而 no-__ prefix 供外部使用可以直接處理輸入為 0 的情況。

因此會分出 0 based 及 1 based 的原因是在於,當使用者在使用 no __ prefix 的函式時可以明確知道其輸入的變數為 0 ,而 __ prefix 則無法得知此一資訊,甚至會因為不知道 0 是代表其 set bit 的位置還是輸入就是 0 。

更新問題至針對 Linux v6.8+

\(\star\) 針對 Linux v6.8+
在目前 Linux 核心版本 6.9.3 中,include/asm-generic/bitops 共有 9 種 ffsfls 實作。
如下:

  1. zero based (不能處理 word 為 0 的情況)
    include/asm-generic/bitops/__ffs.h
    此種為老師教材內的版本,即使用 binary search 版本,可以直接透過 !if 判斷式內寫得更簡潔。
static __always_inline unsigned int generic___ffs(unsigned long word)
{
	unsigned int num = 0;

#if BITS_PER_LONG == 64
	if ((word & 0xffffffff) == 0) {
		num += 32;
		word >>= 32;
	}
#endif
	if ((word & 0xffff) == 0) {
		num += 16;
		word >>= 16;
	}
	if ((word & 0xff) == 0) {
		num += 8;
		word >>= 8;
	}
	if ((word & 0xf) == 0) {
		num += 4;
		word >>= 4;
	}
	if ((word & 0x3) == 0) {
		num += 2;
		word >>= 2;
	}
	if ((word & 0x1) == 0)
		num += 1;
	return num;
}
  1. zero based (不能處理 word 為 0 的情況)
    include/asm-generic/bitops/__fls.h
    同為老師教材內的版本
static __always_inline unsigned int generic___fls(unsigned long word)
{
	unsigned int num = BITS_PER_LONG - 1;

#if BITS_PER_LONG == 64
	if (!(word & (~0ul << 32))) {
		num -= 32;
		word <<= 32;
	}
#endif
	if (!(word & (~0ul << (BITS_PER_LONG-16)))) {
		num -= 16;
		word <<= 16;
	}
	if (!(word & (~0ul << (BITS_PER_LONG-8)))) {
		num -= 8;
		word <<= 8;
	}
	if (!(word & (~0ul << (BITS_PER_LONG-4)))) {
		num -= 4;
		word <<= 4;
	}
	if (!(word & (~0ul << (BITS_PER_LONG-2)))) {
		num -= 2;
		word <<= 2;
	}
	if (!(word & (~0ul << (BITS_PER_LONG-1))))
		num -= 1;
	return num;
}
  1. zero based ( 不能處理 word 為 0 的情況)
    include/asm-generic/bitops/builtin-__ffs.h
    使用了 __builtin_ctzl 計算 tail zero ,即等同 __ffs (這裡是有 prefix ____builtin_ctzl ,是 zero based )
static __always_inline unsigned int __ffs(unsigned long word)
{
	return __builtin_ctzl(word);
}
  1. zero based (不能處理 word 為 0 的情況)
    include/asm-generic/bitops/builtin-__fls.h
    此種 __fls 使用了 __builtin_clzl ( 這裡雖然是有 prefix ____builtin_clzl ,但它居然是 one based !! ) 先計算 leading zero 再透過一個 word size bits 減去 leading zero 及 1 得到 fls
static __always_inline unsigned int __fls(unsigned long word)
{
	return (sizeof(word) * 8) - 1 - __builtin_clzl(word);
}
  1. one based ( 註解未明說,但應該可以處理 word 為 0 的情況 )
    include/asm-generic/bitops/builtin-ffs.h
    直接使用 __builtin_ffs 達成,但具有 prefix ____builtin_ffs ,但它居然是 one based !!
#define ffs(x) __builtin_ffs(x)
  1. one based ( 可處理 word 為 0 的情況 )
    include/asm-generic/bitops/builtin-fls.h
    作法應與上方類似,但大相逕庭,反而與第 4 種較相同。
static __always_inline int fls(unsigned int x)
{
	return x ? sizeof(x) * 8 - __builtin_clz(x) : 0;
}
  1. one based ( 可處理 word 為 0 的情況 )
    include/asm-generic/bitops/ffs.h
    這個是處理有號整數 ( integer ) 的版本,同時可以看到 x >>= 1; 是可以被省略的。

為什麼是 32 bits ???

static inline int generic_ffs(int x)
{
	int r = 1;

	if (!x)
		return 0;
	if (!(x & 0xffff)) {
		x >>= 16;
		r += 16;
	}
	if (!(x & 0xff)) {
		x >>= 8;
		r += 8;
	}
	if (!(x & 0xf)) {
		x >>= 4;
		r += 4;
	}
	if (!(x & 3)) {
		x >>= 2;
		r += 2;
	}
	if (!(x & 1)) {
		x >>= 1;
		r += 1;
	}
	return r;
}
  1. one based ( 可處理 word 為 0 的情況 )
    include/asm-generic/bitops/fls.h

這裡處理無號整數 ( unsigned integer ) ,為何上一種是有號?

我想不太出來為何這裡為什麼會是 ffs 接收一有號整數,而 fls 接收一無號整數,並沒有經過比較運算,以二進制表示的話無論是否 unsigned 皆相同,並沒有必要以不同方式接收。

static __always_inline int generic_fls(unsigned int x)
{
	int r = 32;

	if (!x)
		return 0;
	if (!(x & 0xffff0000u)) {
		x <<= 16;
		r -= 16;
	}
	if (!(x & 0xff000000u)) {
		x <<= 8;
		r -= 8;
	}
	if (!(x & 0xf0000000u)) {
		x <<= 4;
		r -= 4;
	}
	if (!(x & 0xc0000000u)) {
		x <<= 2;
		r -= 2;
	}
	if (!(x & 0x80000000u)) {
		x <<= 1;
		r -= 1;
	}
	return r;
}
  1. one based ( 可處理 word 為 0 的情況 )
    include/asm-generic/bitops/fls64.h
    明明都是 one based ,卻 32 bits 版本使用 fls , 64 bits 版本使用 __fls 再加 1 。
#if BITS_PER_LONG == 32
static __always_inline int fls64(__u64 x)
{
	__u32 h = x >> 32;
	if (h)
		return fls(h) + 32;
	return fls(x);
}
#elif BITS_PER_LONG == 64
static __always_inline int fls64(__u64 x)
{
	if (x == 0)
		return 0;
	return __fls(x) + 1;
}
#else
#error BITS_PER_LONG not 32 or 64
#endif

其中 __builtin_XXX 的函式本身以 CPU 硬體指令實作,過濾例外情況 0 作為輸入以避免 undefined behavior 是必備手續

這幾個版本完全看不出兩者的差別是 nonatomic 及 atomic ( 不可被分割的操作就是 atomic operation ,可以免去使用 spin lock 以防止資料寫入讀取衝突)

6 有對 x 為 0 的情況作處理,而 5 則無。
單單僅是要計算 last set bit 及 first set bit ,就用了多達 9 種函式,這尚未包含其它處理器使用機械碼的版本,如 x86, csky, m68k, s390 等,以 arch/csky/include/asm/bitops.h 為例:
就是用 assembly code 的方式寫

static inline int ffs(int x)
{
	if (!x)
		return 0;

	asm volatile (
		"brev %0\n"
		"ff1  %0\n"
		"addi %0, 1\n"
		: "=&r"(x)
		: "0"(x));
	return x;
}

整合這些複雜的 ffs 及 fls 成一範式將可以使

  • 程式碼簡潔
  • 可以輕鬆維護並同步現有不同輸入輸出型別之 f{f,l}s 系列函式的軟體實作
  • 統一 Undefined behavior 的處理方式
  • 為未來 \(128\) bits 版的實作乃至於其他型別的實作給出輕鬆擴展的解決方案

前篇執行人 <Shiritai> 完成了此一範式並透過理解編譯器優化改善執行上效能的不足。

在重現前為避免隱藏成本 (編譯時間、 code size ) 過高,我閱讀了 Linux kernel coding style 裡的 The inline disease

15) The inline disease

There appears to be a common misperception that gcc has a magic “make me faster” speedup option called inline. While the use of inlines can be appropriate (for example as a means of replacing macros, see Chapter 12), it very often is not. Abundant use of the inline keyword leads to a much bigger kernel, which in turn slows the system as a whole down, due to a bigger icache footprint for the CPU and simply because there is less memory available for the pagecache. Just think about it; a pagecache miss causes a disk seek, which easily takes 5 milliseconds. There are a LOT of cpu cycles that can go into these 5 milliseconds.

A reasonable rule of thumb is to not put inline at functions that have more than 3 lines of code in them. An exception to this rule are the cases where a parameter is known to be a compiletime constant, and as a result of this constantness you know the compiler will be able to optimize most of your function away at compile time. For a good example of this later case, see the kmalloc() inline function.

Often people argue that adding inline to functions that are static and used only once is always a win since there is no space tradeoff. While this is technically correct, gcc is capable of inlining these automatically without help, and the maintenance issue of removing the inline when a second user appears outweighs the potential value of the hint that tells gcc to do something it would have done anyway.

簡單來說,裡面提到 inline 可以加快運行速度,但如果太多函式使用 inline 會導致核心過大而導致 cache 佔用空間更大,這會直接增大 cache miss 機會,根本得不償失,因此根據經驗,使用 inline 的函式需少於 3 行,而例外情況是函式內大多為常數,因為編譯器可以很好的最佳化這種函式,而上面的版本 1, 2, 7, 8 正使用了大量的常數而使其突破了此限制。

不過在 linux/include/linux/rbtree.h 裡的函式 rb_find_add 似乎無視此限制,況且裡面呼叫了其他函式。因為我並不太清楚這其中明確的 inline 使用與否的界線,所以這裡暫時不去處理,但會盡量簡化其程式碼。

一致的實作手段改寫 bitops 程式碼

#define __FFLS_START_INDEX_FROM_ZERO 0
#define __FFLS_START_INDEX_FROM_ONE 1

/**
 * Hidden generator of ffs to hide evaluation of size of type
 */
#define __gen_ffs(in_type, out_type, start_from, fn_name, bit_len) \
    static __always_inline out_type fn_name(in_type x)             \
    {                                                              \
        int bits, shift;                                           \
        in_type mask;                                              \
        if (!x)                                                    \
            return 0;                                              \
        bits = start_from;                                         \
        shift = bit_len >> 1;                                      \
        mask = (in_type) ~(0xFFFFFFFFFFFFFFFF << shift);           \
        do {                                                       \
            if (!(x & mask)) {                                     \
                bits += shift;                                     \
                x >>= shift;                                       \
            }                                                      \
            shift >>= 1;                                           \
            mask >>= shift;                                        \
        } while (shift);                                           \
        return bits;                                               \
    }

/**
 * gen_ffs() - Macro to generate function series of
 * ffs series - find first (least-significant) bit set.
 *
 * Generated function parameter:
 * @x: the object to search
 *
 * Macro arguments:
 * @in_type: input type of ffs
 * @out_type: output type of ffs
 * @start_from: starting index of generated ffs to generate both
 *              ffs series (return 1 ~ 64) and __ffs series (return 0 ~ 63)
 *
 *              Note if:
 *              start_from is __FFLS_START_INDEX_FROM_ZERO, then ffs(0) = 0,
 *                  ffs(1) = 0, ffs(object leads with 1) = sizeof object - 1.
 *              start_from is __FFLS_START_INDEX_FROM_ONE, then ffs(0) = 0,
 *                  ffs(1) = 1, ffs(object leads with 1) = sizeof object.
 * @fn_name: Function name to be generated
 */
#define gen_ffs(in_type, out_type, start_from, fn_name) \
    __gen_ffs(in_type, out_type, start_from, fn_name, (sizeof(in_type) * 8))

Linux 預設使用的 O2 最佳化選項,以下為原先作者版本產出之 Assembly code

$ gcc -S -O2 genmyffs_ori.c genmyffs_ori.s
main:
.LFB52:
	.cfi_startproc
	endbr64
	subq	$8, %rsp
	.cfi_def_cfa_offset 16
	movl	$5, %eax
	movl	$16, %ecx
	xorl	%r8d, %r8d
	movl	$65535, %edx
	movl	$4554, %esi
.L2:
	sarl	%ecx
	shrl	%cl, %edx
	subl	$1, %eax
	je	.L7
.L4:
	testl	%esi, %edx
	jne	.L2
	addl	%ecx, %r8d
	shrl	%cl, %esi
	sarl	%ecx
	shrl	%cl, %edx
	subl	$1, %eax
	jne	.L4
.L7:
	movl	%r8d, %edx
	leaq	.LC0(%rip), %rsi
	movl	$1, %edi
	xorl	%eax, %eax
	call	__printf_chk@PLT
	xorl	%eax, %eax
	addq	$8, %rsp
	.cfi_def_cfa_offset 8
	ret
	.cfi_endproc

原版程式碼使用 loop unrolling

$ gcc -S -O2 -funroll-loops genmyffs_ori.c genmyffs_ori.s
main:
.LFB52:
	.cfi_startproc
	endbr64
	subq	$8, %rsp
	.cfi_def_cfa_offset 16
	movl	$1, %edx
	movl	$1, %edi
	xorl	%eax, %eax
	leaq	.LC0(%rip), %rsi
	call	__printf_chk@PLT
	xorl	%eax, %eax
	addq	$8, %rsp
	.cfi_def_cfa_offset 8
	ret
	.cfi_endproc

可以看到原版程式碼因為僅使用少數的常數操作可能會導致較不適合使用 inline ,除了先檢查編譯時是否為已知常數也可直接修改其程式碼。我更新為以下,直接免除了需要 loop unroll 的操作,並且是在使用大量常數的情況下使用 inline ,也使用了原本就用的 ffs/fls 操作,使維護者無需再費心重新理解其演算法。另外為了處理 128 bits 或者更大的的整數類型,我在前面使用了 while 迴圈,後面多個 if 判斷式則是因為 loop unroll ,若超出 32 bits 才進入 while 迴圈。

#define __FFLS_START_INDEX_FROM_ZERO 0 #define __FFLS_START_INDEX_FROM_ONE 1 /** * Hidden generator of ffs to hide evaluation of size of type */ #define __gen_ffs(in_type, out_type, start_from, fn_name, bit_len) \ static __always_inline out_type fn_name(in_type x) \ { \ if (!x) \ return 0; \ out_type num = start_from; \ out_type shift = bit_len >> 1; \ while (shift >= 32) { \ if (!(x & ~(((out_type) -1 >> shift) << shift))) { \ num += shift; \ x >>= shift; \ } \ shift >>= 1; \ } \ if (!(x & 0xffff)) { \ num += 16; \ x >>= 16; \ } \ if (!(x & 0xff)) { \ num += 8; \ x >>= 8; \ } \ if (!(x & 0xf)) { \ num += 4; \ x >>= 4; \ } \ if (!(x & 0x3)) { \ num += 2; \ x >>= 2; \ } \ if (!(x & 0x1)) \ num += 1; \ return num; \ } /** * gen_ffs() - Macro to generate function series of * ffs series - find first (least-significant) bit set. * * Generated function parameter: * @x: the object to search * * Macro arguments: * @in_type: input type of ffs * @out_type: output type of ffs * @start_from: starting index of generated ffs to generate both * ffs series (return 1 ~ 64) and __ffs series (return 0 ~ 63) * * Note if: * start_from is __FFLS_START_INDEX_FROM_ZERO, then ffs(0) = 0, * ffs(1) = 0, ffs(object leads with 1) = sizeof object - 1. * start_from is __FFLS_START_INDEX_FROM_ONE, then ffs(0) = 0, * ffs(1) = 1, ffs(object leads with 1) = sizeof object. * @fn_name: Function name to be generated */ #define gen_ffs(in_type, out_type, start_from, fn_name) \ __gen_ffs(in_type, out_type, start_from, fn_name, (sizeof(in_type) * 8))

我的版本可直接使用

$ gcc -S -O2 genmyffs_ori.c genmyffs_ori.s
main:
.LFB52:
	.cfi_startproc
	endbr64
	subq	$8, %rsp
	.cfi_def_cfa_offset 16
	movl	$1, %edx
	movl	$1, %edi
	xorl	%eax, %eax
	leaq	.LC0(%rip), %rsi
	call	__printf_chk@PLT
	xorl	%eax, %eax
	addq	$8, %rsp
	.cfi_def_cfa_offset 8
	ret
	.cfi_endproc

得到原始版本經過 loop unroll 一模一樣的版本,但無須再經過 loop unroll 的操作。

若不使用常數操作而並合併入 while 迴圈 ( 也較原先版本更簡短 )

#define __gen_ffs(in_type, out_type, start_from, fn_name, bit_len) \
    static __always_inline out_type fn_name(in_type x)             \
    {                                                              \
        if (!x)                                                    \
            return 0;                                              \
        out_type num = start_from;                                 \
        out_type shift = bit_len >> 1;                             \
        while (shift >= 1) {                                      \
            if (!(x & ~(((out_type) -1 >> shift) << shift))) {     \
                num += shift;                                      \
                x >>= shift;                                       \
            }                                                      \
            shift >>= 1;                                           \
        }                                                          \
        return num;                                                \
    }

直接使用

$ gcc -S -O2 genmyffs_ori.c genmyffs_ori.s
main:
.LFB52:
	.cfi_startproc
	endbr64
	subq	$8, %rsp
	.cfi_def_cfa_offset 16
	movl	$5, %esi
	xorl	%r8d, %r8d
	movl	$4554, %edi
	movl	$16, %eax
	movl	$-1, %r9d
	.p2align 4,,10
	.p2align 3
.L3:
	movl	%eax, %ecx
	movl	%r9d, %edx
	sall	%cl, %edx
	movl	%eax, %ecx
	sarl	%eax
	notl	%edx
	testl	%edi, %edx
	jne	.L2
	addl	%ecx, %r8d
	shrl	%cl, %edi
.L2:
	subl	$1, %esi
	jne	.L3
	movl	%r8d, %edx
	leaq	.LC0(%rip), %rsi
	movl	$1, %edi
	xorl	%eax, %eax
	call	__printf_chk@PLT
	xorl	%eax, %eax
	addq	$8, %rsp
	.cfi_def_cfa_offset 8
	ret
	.cfi_endproc

機器碼則會與原先版本未經 loop unroll 相同,因此不併入並直接將迴圈展開會較佳。

精準最佳化:調整 compile 時特定位置採取的優化策略

接下來, Shiritai 僅希望在特定位置採用 loop unroll ,而採用了之後的 Assembly code 如下:

$ gcc -S -O2 genmyffs_ori.c genmyffs_ori.s
main:
.LFB52:
	.cfi_startproc
	endbr64
	subq	$8, %rsp
	.cfi_def_cfa_offset 16
	movl	$1, %edx
	movl	$1, %edi
	xorl	%eax, %eax
	leaq	.LC0(%rip), %rsi
	call	__printf_chk@PLT
	xorl	%eax, %eax
	addq	$8, %rsp
	.cfi_def_cfa_offset 8
	ret
	.cfi_endproc

可以看到與修改編譯參數的結果相同,但我的版本仍無需做這些操作。

更進一步,合併 clz/ffz 等系列函式

以下為 ffs/flsffzclz/ctz 的圖示,可以發現其實目的差不多,並且僅差一位
若使用 ffs/fls、clz/ctz 函式,將會得到 (1 based):

16-fls(x)
fls(x)
ffs(x)
ffs(x)-1
ffs(~x)
0b0010100111000000
clz: 2
fls: 14
ffs: 7
ctz: 6
ffz: 1

透過以上關係, Shiritai 認為可以將五種函式結合,先以相同方式實作 fls

#define __gen_fls(in_type, out_type, start_from, fn_name, bit_len) \
    static __always_inline out_type fn_name(in_type x)             \
    {                                                              \
        if (!x)                                                    \
            return 0;                                              \
        out_type num = bit_len + start_from - 1;                   \
        out_type shift = bit_len >> 1;                             \
        while (shift >= 32) {                                      \
            if (!(x & (~0ul << (bit_len-shift)))) {                \
                num -= shift;                                      \
                x <<= shift;                                       \
            }                                                      \
            shift >>= 1;                                           \
        }                                                          \
        if (!(x & (~0ul << (bit_len-16)))) {                       \
            num -= 16;                                             \
            x <<= 16;                                              \
        }                                                          \
        if (!(x & (~0ul << (bit_len-8)))) {                        \
            num -= 8;                                              \
            x <<= 8;                                               \
        }                                                          \
        if (!(x & (~0ul << (bit_len-4)))) {                        \
            num -= 4;                                              \
            x <<= 4;                                               \
        }                                                          \
        if (!(x & (~0ul << (bit_len-2)))) {                        \
            num -= 2;                                              \
            x <<= 2;                                               \
        }                                                          \
        if (!(x & (~0ul << (bit_len-1))))                          \
            num -= 1;                                              \
        return num;                                                \
    }

以下為 Shiritai 筆記中提供的示意圖及實作







G


cluster_gen_ffs

__gen_ffs


cluster_gen_fls

__gen_fls



__gen_ffs

__gen_ffs



gen_ctz

gen_ctz



__gen_ffs->gen_ctz





gen_ffs

gen_ffs



__gen_ffs->gen_ffs





ctz

ctz



gen_ctz->ctz





ffs

ffs



gen_ffs->ffs





__ffs

__ffs



gen_ffs->__ffs





ffz

ffz



__ffs->ffz





__gen_fls

__gen_fls



gen_fls

gen_fls



__gen_fls->gen_fls





__gen_clz

__gen_clz



__gen_fls->__gen_clz





fls

fls



gen_fls->fls





__fls

__fls



gen_fls->__fls





constant_fls

constant_fls



gen_fls->constant_fls





gen_clz

gen_clz



__gen_clz->gen_clz





clz

clz



gen_clz->clz





#define __FFLS_START_INDEX_FROM_ZERO 0
#define __FFLS_START_INDEX_FROM_ONE 1

// hidden implementation
#define __gen_ffs(in_type, out_type, start_from, fn_name, bit_len) \
    ... // above

// developers should call this
#define gen_ffs(in_type, out_type, start_from, fn_name) \
    __gen_ffs(in_type, out_type, start_from, fn_name, (sizeof(in_type) * 8))

// hidden implementation
#define __gen_fls(in_type, out_type, start_from, fn_name, bit_len) \
    ... // above

// developers should call this
#define gen_fls(in_type, out_type, start_from, fn_name) \
    __gen_fls(in_type, out_type, start_from, fn_name, (sizeof(in_type) * 8))

// hidden implementation
#define __gen_clz(in_type, out_type, fn_name, bit_len)  \
    __gen_fls(in_type, out_type,                        \
            __FFLS_START_INDEX_FROM_ONE,                \
            __fls_##fn_name, bit_len);                  \
    static __always_inline out_type fn_name(in_type x)  \
    {                                                   \
        return bit_len - __fls_##fn_name(x);            \
    }

// developers should call this
#define gen_clz(in_type, out_type, fn_name) \
    __gen_clz(in_type, out_type, fn_name, (sizeof(in_type) * 8))

// developers should call this
#define gen_ctz(in_type, out_type, fn_name) \
    __gen_ffs(in_type, out_type,            \
        __FFLS_START_INDEX_FROM_ZERO,       \
        fn_name, (sizeof(in_type) * 8))

// hidden implementation
#define __gen_ffz(in_type, out_type, fn_name, bit_len)  \
    __gen_ffs(in_type, out_type,                        \
            __FFLS_START_INDEX_FROM_ZERO,               \
            __ffs_##fn_name, bit_len);                  \
    static __always_inline out_type fn_name(in_type x)  \
    {                                                   \
        return __ffs_##fn_name(~x);                     \
    }

#define gen_ffz(in_type, out_type, fn_name) \
    __gen_ffz(in_type, out_type, fn_name, (sizeof(in_type) * 8))

不過事實上 Linux kernel 內的 clz/ctz 已經找不到軟體實作,皆是前綴 __builtin 的硬體實作及因為 CLZ 操作可以通過簡單的硬體電路實現,因此在硬體設計中增加這一指令的成本較低而成為組合語言的指令之一,再去強迫他使用軟體實作非常不實際且多餘,因此我將 clz/ctz 移去,並且因為 ffz 在 Linux kernel 內程式碼僅短短一行,相較合併的版本,我認為原先版本更簡潔,因此以上部份部分我不打算納入改進,而單純嘗試合併一開始的 ffs/fls (前綴 builtin_ 可除外,因為效能會差非常多,並且原程式碼非常簡潔,約為 1 ~ 2 行)。

/*
 * ffz - find first zero in word.
 * @word: The word to search
 *
 * Undefined if no zero exists, so code should check against ~0UL first.
 */
#define ffz(x)  __ffs(~(x))

最後我決定僅合併 __ffs/__fls/ffs/fls 四種

TODO: 針對上述 bitops 改進,提出驗證的方式,如 Linux 核心模組, UML (見建構 User-Mode Linux 的實驗環境) 或 virtme,比對產生的機械碼 (至少是 x86-64 和 Arm64),確保提出的改進方案在「行為」層面相容於原有的實作

使用 UML 實驗環境 ( Linux 版本 6.8.9 )

作法:

  1. 建立一 header file 於 include/asm-generic/bitops 同層目錄,如下
/**
 * Hidden generator of ffs to hide evaluation of size of type
 */
#define __gen_ffs(in_type, out_type, start_from, fn_name, bit_len) \
    static __always_inline out_type fn_name(in_type x)             \
    {                                                              \
        if (!x)                                                    \
            return 0;                                              \
        out_type num = start_from;                                 \

        out_type shift = bit_len >> 1;                             \
        while (shift >= 32) {                                      \
            if (!(x & ~(((out_type) -1 >> shift) << shift))) {     \
                num += shift;                                      \
                x >>= shift;                                       \
            }                                                      \
            shift >>= 1;                                           \
        }                                                          \
        if (!(x & 0xffff)) {                                       \
            num += 16;                                             \
            x >>= 16;                                              \
        }                                                          \
        if (!(x & 0xff)) {                                         \
            num += 8;                                              \
            x >>= 8;                                               \
        }                                                          \
        if (!(x & 0xf)) {                                          \
            num += 4;                                              \
            x >>= 4;                                               \
        }                                                          \
        if (!(x & 0x3)) {                                          \
            num += 2;                                              \
            x >>= 2;                                               \
        }                                                          \
        if (!(x & 0x1))                                            \
            num += 1;                                              \
        return num;                                                \
    }
    
#define __gen_fls(in_type, out_type, start_from, fn_name, bit_len) \
    static __always_inline out_type fn_name(in_type x)             \
    {                                                              \
        if (!x)                                                    \
            return 0;                                              \
        out_type num = bit_len - 1 + start_from;                   \
        out_type shift = bit_len >> 1;                             \
        while (shift >= 32) {                                      \
            if (!(x & (~0ul << (bit_len-shift)))) {                \
                num -= shift;                                      \
                x <<= shift;                                       \
            }                                                      \
            shift >>= 1;                                           \
        }                                                          \
        if (!(x & (~0ul << (bit_len-16)))) {                       \
            num -= 16;                                             \
            x <<= 16;                                              \
        }                                                          \
        if (!(x & (~0ul << (bit_len-8)))) {                        \
            num -= 8;                                              \
            x <<= 8;                                               \
        }                                                          \
        if (!(x & (~0ul << (bit_len-4)))) {                        \
            num -= 4;                                              \
            x <<= 4;                                               \
        }                                                          \
        if (!(x & (~0ul << (bit_len-2)))) {                        \
            num -= 2;                                              \
            x <<= 2;                                               \
        }                                                          \
        if (!(x & (~0ul << (bit_len-1))))                          \
            num -= 1;                                              \
        return num;                                                \
    }
    
/**
 * gen_ffs() - Macro to generate function series of
 * ffs series - find first (least-significant) bit set.
 *
 * Generated function parameter:
 * @x: the object to search
 *
 * Macro arguments:
 * @in_type: input type of ffs
 * @out_type: output type of ffs
 * @start_from: starting index of generated ffs to generate both
 *              ffs series (return 1 ~ 64) and __ffs series (return 0 ~ 63)
 *
 *              Note if:
 *              start_from is __FFLS_START_INDEX_FROM_ZERO, then ffs(0) = 0,
 *                  ffs(1) = 0, ffs(object leads with 1) = sizeof object - 1.
 *              start_from is __FFLS_START_INDEX_FROM_ONE, then ffs(0) = 0,
 *                 ffs(1) = 1, ffs(object leads with 1) = sizeof object.
 * @fn_name: Function name to be generated
 */

#define gen_ffs(in_type, out_type, start_from, fn_name) \
    __gen_ffs(in_type, out_type, start_from, fn_name, (sizeof(in_type) * 8))
    
#define gen_fls(in_type, out_type, start_from, fn_name) \
    __gen_fls(in_type, out_type, start_from, fn_name, (sizeof(in_type) * 8))
  1. 將每一個 ffs/fls 皆改為使用上方 header file 產生的版本
/* SPDX-License-Identifier: GPL-2.0 */
#ifndef _ASM_GENERIC_BITOPS___FFS_H_
#define _ASM_GENERIC_BITOPS___FFS_H_
#include <asm/types.h>
/**
 * __ffs - find first bit in word.
 * @word: The word to search
 *
 * Undefined if no bit exists, so code should check against 0 first.
 */
// static __always_inline unsigned long __ffs(unsigned long word)
// {
// 	int num = 0;
// 
// #if BITS_PER_LONG == 64
// 	if ((word & 0xffffffff) == 0) {
// 		num += 32;
// 		word >>= 32;
// 	}
// #endif
// 	if ((word & 0xffff) == 0) {
// 		num += 16;
// 		word >>= 16;
// 	}
// 	if ((word & 0xff) == 0) {
// 		num += 8;
// 		word >>= 8;
// 	}
// 	if ((word & 0xf) == 0) {
// 		num += 4;
// 		word >>= 4;
// 	}
// 	if ((word & 0x3) == 0) {
// 		num += 2;
// 		word >>= 2;
// 	}
// 	if ((word & 0x1) == 0)
// 		num += 1;
// 	return num;
// 
#include "gen_ffs.h"
gen_ffs(unsigned long, unsigned long, 0, __ffs);

#endif /* _ASM_GENERIC_BITOPS___FFS_H_ */
  1. 編譯後進入 UML 並輸入 dmesg 查看開機訊息
Linux version 6.8.9 (root@venv) (gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0, GNU ld (GNU Binutils for Ubuntu) 2.38) #1 Mon Jun 24 14:39:18 CST 2024
Zone ranges:
  Normal   [mem 0x0000000000000000-0x0000000064a00fff]
Movable zone start for each node
Early memory node ranges
  node   0: [mem 0x0000000000000000-0x0000000004a00fff]
Initmem setup node 0 [mem 0x0000000000000000-0x0000000004a00fff]
random: crng init done
pcpu-alloc: s0 r0 d32768 u32768 alloc=1*32768
pcpu-alloc: [0] 0 
Kernel command line: ubd0=/dev/null hostname=uml1 eth0=tuntap,tap0 root=/dev/root rootfstype=hostfs hostfs=./rootfs rw mem=64M init=/init.sh quiet console=tty0
Unknown kernel command line parameters "hostfs=./rootfs mem=64M", will be passed to user space.
Dentry cache hash table entries: 16384 (order: 5, 131072 bytes, linear)
Inode-cache hash table entries: 8192 (order: 4, 65536 bytes, linear)
Sorting __ex_table...
Built 1 zonelists, mobility grouping on.  Total pages: 18685
mem auto-init: stack:off, heap alloc:off, heap free:off
Memory: 58000K/75780K available (3612K kernel code, 971K rwdata, 1244K rodata, 153K init, 227K bss, 17780K reserved, 0K cma-reserved)
SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=1, Nodes=1
NR_IRQS: 64
clocksource: timer: mask: 0xffffffffffffffff max_cycles: 0x1cd42e205, max_idle_ns: 881590404426 ns
Calibrating delay loop... 3114.59 BogoMIPS (lpj=15572992)
Checking that host ptys support output SIGIO...Yes
pid_max: default: 32768 minimum: 301
Mount-cache hash table entries: 512 (order: 0, 4096 bytes, linear)
Mountpoint-cache hash table entries: 512 (order: 0, 4096 bytes, linear)
devtmpfs: initialized
clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 19112604462750000 ns
futex hash table entries: 256 (order: 0, 6144 bytes, linear)
NET: Registered PF_NETLINK/PF_ROUTE protocol family
pps_core: LinuxPPS API ver. 1 registered
pps_core: Software ver. 5.3.6 - Copyright 2005-2007 Rodolfo Giometti <giometti@linux.it>
PTP clock support registered
clocksource: Switched to clocksource timer
VFS: Disk quotas dquot_6.6.0
VFS: Dquot-cache hash table entries: 512 (order 0, 4096 bytes)
NET: Registered PF_INET protocol family
IP idents hash table entries: 2048 (order: 2, 16384 bytes, linear)
tcp_listen_portaddr_hash hash table entries: 512 (order: 0, 4096 bytes, linear)
Table-perturb hash table entries: 65536 (order: 6, 262144 bytes, linear)
TCP established hash table entries: 1024 (order: 1, 8192 bytes, linear)
TCP bind hash table entries: 1024 (order: 2, 16384 bytes, linear)
TCP: Hash tables configured (established 1024 bind 1024)
UDP hash table entries: 256 (order: 1, 8192 bytes, linear)
UDP-Lite hash table entries: 256 (order: 1, 8192 bytes, linear)
NET: Registered PF_UNIX/PF_LOCAL protocol family
printk: legacy console [stderr0] disabled
mconsole (version 2) initialized on /home/vboxuser/.uml/uml0/mconsole
Checking host MADV_REMOVE support...OK
workingset: timestamp_bits=62 max_order=14 bucket_order=0
io scheduler mq-deadline registered
io scheduler kyber registered
NET: Registered PF_PACKET protocol family
Initialized stdio console driver
Console initialized on /dev/tty0
printk: legacy console [tty0] enabled
Initializing software serial port version 1
printk: legacy console [mc-1] enabled
read_cow_header - short header
Choosing a random ethernet address for device eth0
Netdevice 0 (12:06:bd:51:1c:39) : 
TUN/TAP backend - 
register_winch : failed to write synchronization byte, err = 9
VFS: Mounted root (hostfs filesystem) on device 0:13.
devtmpfs: mounted
This architecture does not have kernel memory protection.
Run /init.sh as init process
  with arguments:
    /init.sh
  with environment:
    HOME=/
    TERM=linux
    hostfs=./rootfs
    mem=64M

可以看到並沒有出現錯誤,不過也有可能是因為根本沒被測試到,因為 ffs/fls 並沒有出現在 linux/lib 內 prefix test.c 檔中,且即使我修改的內容有誤,其也沒有發出編譯錯誤的訊息。而因為其功能太過基本,另外寫一個函式測試又顯得多餘,因此嘗試掛載使用 ffs/fls 函式的檔案並檢查是否正常執行。

掛載測試模組至核心

以下為測試模組,不過並沒有測試到非常完全

#include <linux/module.h>   // Needed by all modules
#include <linux/kernel.h>   // Needed for KERN_INFO
#include <linux/init.h>     // Needed for the macros
#include <asm-generic/bitops/__ffs.h>
#include <asm-generic/bitops/__fls.h>
#include <asm-generic/bitops/ffs.h>
#include <asm-generic/bitops/fls.h>

MODULE_LICENSE("GPL");

// Function to be called when the module is loaded
static int __init ffls_module_init(void)
{
    for(unsigned long i = 0; i < 3; i++){
        printk(KERN_INFO "__ffs : %lu, %lu\n", __generic_ffs(i), __ffs(i));
        printk(KERN_INFO "ffs : %d, %d\n", generic_ffs(i), ffs(i));
        printk(KERN_INFO "__fls : %lu, %lu\n", __generic_fls(i), __fls(i));
        printk(KERN_INFO "fls : %d, %d\n\n", generic_fls(i), fls(i));
    }
    for(unsigned long i = 23424512321; i < 23424512324; i++){
        printk(KERN_INFO "__ffs : %lu, %lu\n", __generic_ffs(i), __ffs(i));
        printk(KERN_INFO "ffs : %d, %d\n", generic_ffs(i), ffs(i));
        printk(KERN_INFO "__fls : %lu, %lu\n", __generic_fls(i), __fls(i));
        printk(KERN_INFO "fls : %d, %d\n\n", generic_fls(i), fls(i));
    }

    return 0; // Return 0 means module loaded successfully
}

// Function to be called when the module is removed
static void __exit ffls_module_exit(void)
{
    printk(KERN_INFO "Goodbye, World!\n");
}

// Macros for registering module entry and exit points
module_init(ffls_module_init);
module_exit(ffls_module_exit);

Makefile

obj-m := fflstest.o
KDIR= ../../

all:
	$(MAKE) -C $(KDIR) M=$(PWD) modules ARCH=um
clean:
	$(MAKE) -C $(KDIR) M=$(PWD) clean ARCH=um

編譯過程輸出

user@venv:~/linux-6.8.9/tests/fflstest$ make

make -C ../../ M=/home/vboxuser/linux-6.8.9/tests/fflstest modules ARCH=um
make[1]: Entering directory '/home/vboxuser/linux-6.8.9'
  CC [M]  /home/vboxuser/linux-6.8.9/tests/fflstest/fflstest.o
  MODPOST /home/vboxuser/linux-6.8.9/tests/fflstest/Module.symvers
  LD [M]  /home/vboxuser/linux-6.8.9/tests/fflstest/fflstest.ko
make[1]: Leaving directory '/home/vboxuser/linux-6.8.9'

以下為掛載後輸出,同一行中兩個值前者為修改後的版本,後者為 arch/x86/include/asm/bitops.h 內的版本,可以看到修改後的版本也更名為 generic_ ,這在較新的 Linux 版本已被實作以避免重複定義函式的問題,透過輸出結果可以看到 arch/x86/include/asm/bitops.h 內 __ prefix 的版本並無法處理輸入為 0 的數值,若直接在前面檢查輸入為 0 時回傳 0 也會導致無區分以下第一及第二個測試的情況,驗證了前所述關於 0 及 1 開始的問題。

fflstest: loading out-of-tree module taints kernel.
__ffs : 0, 64
ffs : 0, 0
__fls : 0, 1615939600
fls : 0, 0

__ffs : 0, 0
ffs : 1, 1
__fls : 0, 0
fls : 1, 1

__ffs : 1, 1
ffs : 2, 2
__fls : 1, 1
fls : 2, 2

__ffs : 0, 0
ffs : 1, 1
__fls : 34, 34
fls : 31, 31

__ffs : 1, 1
ffs : 2, 2
__fls : 34, 34
fls : 31, 31

__ffs : 0, 0
ffs : 1, 1
__fls : 34, 34
fls : 31, 31

如果知道是否有前綴 __ 的差別的話會想為什麼會出現 __fls : 34, 34 fls : 31, 31 ,不是僅差在是否從 1 開始而不該差到這麼大 ? 可以看到上面第 8 種實作,其只接受 unsigned int 作為輸入,因此輸入會被強制轉型成 32 位元大小的值而使其值最大只能到 32 。

比對機械碼

現比對產生機械碼是否相容,為避免互相影響,將兩種實作分成 initexit 兩邊執行,再將機械碼輸出。

// Function to be called when the module is loaded
static int __init ffls_module_init(void)
{
    printk(KERN_INFO "__ffs : %lu\n", __generic_ffs(6540));
    return 0; // Return 0 means module loaded successfully
}

// Function to be called when the module is removed
static void __exit ffls_module_exit(void)
{
    printk(KERN_INFO "__ffs : %lu\n", __ffs(6540));
}

透過指令 objdump -d fflstest.ko 反組譯 .ko 比對產生的機械碼

user@venv:~/linux-6.8.9/tests/fflstest$ objdump -d fflstest.ko

fflstest.ko:     file format elf64-x86-64

Disassembly of section .init.text:

0000000000000000 <init_module>:
   0:	f3 0f 1e fa          	endbr64 
   4:	55                   	push   %rbp
   5:	be 02 00 00 00       	mov    $0x2,%esi
   a:	31 c0                	xor    %eax,%eax
   c:	48 bf 00 00 00 00 00 	movabs $0x0,%rdi
  13:	00 00 00 
  16:	48 ba 00 00 00 00 00 	movabs $0x0,%rdx
  1d:	00 00 00 
  20:	48 89 e5             	mov    %rsp,%rbp
  23:	ff d2                	call   *%rdx
  25:	31 c0                	xor    %eax,%eax
  27:	5d                   	pop    %rbp
  28:	c3                   	ret    

Disassembly of section .exit.text:

0000000000000000 <cleanup_module>:
   0:	f3 0f 1e fa          	endbr64 
   4:	55                   	push   %rbp
   5:	be 02 00 00 00       	mov    $0x2,%esi
   a:	31 c0                	xor    %eax,%eax
   c:	48 bf 00 00 00 00 00 	movabs $0x0,%rdi
  13:	00 00 00 
  16:	48 ba 00 00 00 00 00 	movabs $0x0,%rdx
  1d:	00 00 00 
  20:	48 89 e5             	mov    %rsp,%rbp
  23:	ff d2                	call   *%rdx
  25:	5d                   	pop    %rbp
  26:	c3                   	ret

可以看到基本完全相同。不過若將上方迴圈執行的版本反組譯後將會是使用 tzcntbsrbsf 等相關執行指令。

17:	f3 48 0f bc d3       	tzcnt  %rbx,%rdx

x86 and amd64 instruction reference 給出的指令描述

TZCNT — Count the Number of Trailing Zero Bits

TZCNT counts the number of trailing least significant zero bits in source operand (second operand) and returns the result in destination operand (first operand). TZCNT is an extension of the BSF instruction. The key difference between TZCNT and BSF instruction is that TZCNT provides operand size as output when source operand is zero while in the case of BSF instruction, if source operand is zero, the content of destination operand are undefined. On processors that do not support TZCNT, the instruction byte encoding is executed as BSF.

BSR — Bit Scan Reverse

Searches the source operand (second operand) for the most significant set bit (1 bit). If a most significant 1 bit is found, its bit index is stored in the destination operand (first operand). The source operand can be a register or a memory location; the destination operand is a register. The bit index is an unsigned offset from bit 0 of the source operand. If the content source operand is 0, the content of the destination operand is undefined.
In 64-bit mode, the instruction’s default operation size is 32 bits. Using a REX prefix in the form of REX.R permits access to additional registers (R8-R15). Using a REX prefix in the form of REX.W promotes operation to 64 bits. See the summary chart at the beginning of this section for encoding data and limits.

BSF — Bit Scan Forward

Searches the source operand (second operand) for the least significant set bit (1 bit). If a least significant 1 bit is found, its bit index is stored in the destination operand (first operand). The source operand can be a register or a memory location; the destination operand is a register. The bit index is an unsigned offset from bit 0 of the source operand. If the content of the source operand is 0, the content of the destination operand is undefined.
In 64-bit mode, the instruction’s default operation size is 32 bits. Using a REX prefix in the form of REX.R permits access to additional registers (R8-R15). Using a REX prefix in the form of REX.W promotes operation to 64 bits. See the summary chart at the beginning of this section for encoding data and limits.

以上皆為原先 x86 支援的硬體指令,可以推測是因為非迴圈版本輸入值為常數,編譯器可以針對此做最佳化,而迴圈版本輸入值隨著迴圈執行次數變動而無法針對已知常數做最佳化而造成以上差異,不過仍可推測兩者其實相同因此改進是可相容的。

TODO: upstream! based on Quiz 7

memchr 改進

2024q1 第 7 週測驗題 第一題使用 SWAR 技巧的 memchr 改進。
我事先閱讀了 arthurchang09 的筆記
2022q1 Homework5 (quiz8)
原先他提出了使用組語加快執行時間的方法,不過後來因以下原因而改為 C 語言版本

David Laight 將我的程式碼改寫,整題而言較為簡潔易懂,沒有組合語言。也解決 target character 為負時會出錯的問題。

後續不知道為什麼沒有下文,不過最後程式碼應該是以類似 quiz7 形式定案,而 quiz7 類似於以下內文中提到的 glibc-memchr.c 內的寫法。
2022q1 Homework5 (quiz8) answer
由於 Jserv 老師提供的程式碼沒有太多改進空間,因此我嘗試以在不影響程式碼可讀性的前提下盡量減少 code size 及提升效率。

原 linux 核心內程式碼

/** * memchr - Find a character in an area of memory. * @s: The memory area * @c: The byte to search for * @n: The size of the area. * * returns the address of the first occurrence of @c, or %NULL * if @c is not found */ void *memchr(const void *s, int c, size_t n) { const unsigned char *p = s; while (n-- != 0) { if ((unsigned char)c == *p++) { return (void *)(p - 1); } } return NULL; }

以 SWAR 改進程式碼

#include <limits.h> #include <stddef.h> #include <stdint.h> #include <string.h> /* Nonzero if X is not aligned on a "long" boundary */ #define UNALIGNED(X) ((long) X & (sizeof(long) - 1)) /* How many bytes are loaded each iteration of the word copy loop */ #define LBLOCKSIZE (sizeof(long)) /* Threshhold for punting to the bytewise iterator */ #define TOO_SMALL(LEN) ((LEN) < LBLOCKSIZE) #if LONG_MAX == 2147483647L #define DETECT_NULL(X) (((X) - (0x01010101)) & ~(X) & (0x80808080)) #else #if LONG_MAX == 9223372036854775807L /* Nonzero if X (a long int) contains a NULL byte. */ #define DETECT_NULL(X) \ (((X) - (0x0101010101010101)) & ~(X) & (0x8080808080808080)) #else #error long int is not a 32bit or 64bit type. #endif #endif /* @return nonzero if (long)X contains the byte used to fill MASK. */ #define DETECT_CHAR(X, mask) DETECT_NULL(*X ^ mask) void *memchr_opt(const void *str, int c, size_t len) { const unsigned char *src = (const unsigned char *) str; unsigned char d = c; while (UNALIGNED(src)) { if (!len--) return NULL; if (*src == d) return (void *) src; src++; } if (!TOO_SMALL(len)) { /* If we get this far, len is large and src is word-aligned. */ /* The fast code reads the source one word at a time and only performs * a bytewise search on word-sized segments if they contain the search * character. This is detected by XORing the word-sized segment with a * word-sized block of the search character, and then checking for the * presence of NULL in the result. */ unsigned long *asrc = (unsigned long *) src; unsigned long mask = d << 8 | d; mask |= mask << 16; for (unsigned int i = 32; i < LBLOCKSIZE * 8; i <<= 1) mask |= mask << i; while (len >= LBLOCKSIZE) { if (DETECT_CHAR(asrc, mask)) break; asrc++; len -= LBLOCKSIZE; } /* If there are fewer than LBLOCKSIZE characters left, then we resort to * the bytewise loop. */ src = (unsigned char *) asrc; } while (len--) { if (*src == d) return (void *) src; src++; } return NULL; }

簡單介紹概念:

  1. UNALIGNED(X) : 要使用 SWAR 須確保存放記憶體位置是有對齊的,即須照著每 sizeof(long) 個記憶體位置放置一個(在這裡是一個字) ,參照 你所不知道的 C 語言:記憶體管理、對齊及硬體特性

  2. TOO_SMALL(len) : 可以在中間程式碼看到 mask |= mask << 16; 這些設計 mask 的程式碼,假設在 sizeof(long) == 64 的程式碼, 8 byte 經過 mask 的操作將會變成擁有 8 個指定相同字元 (1 char 1 byte) 的 long size mask ,如下:

    ​​​​尋找字元 c 應該要以數字表示,但為方便先將用字符表示
    ​​​​d = c
    ​​​​mask = d << 8 | d;
    ​​​​mask =  | | | | | |c|c
    ​​​​mask |= mask << 16;
    ​​​​mask =  | | | |c|c|c|c
    ​​​​mask |= mask << 32;
    ​​​​mask = c|c|c|c|c|c|c|c
    
  3. #define DETECT_CHAR(X, mask) DETECT_NULL(*X ^ mask) : 這裡就是透過 *X ^ mask 尋找是否有對應的字符,若有則不為 0 ,表示已經找到了。

  4. 若有經過 SWAR 技巧處理的情況下,最後僅需尋找該位置 long sizesrc ,若 len 過小或 UNALIGNED 的情況下則需逐字符 iteration 尋找。

再行改進

#include <limits.h>
#include <stddef.h>
#include <stdint.h>
#include <string.h>

- /* Nonzero if X is not aligned on a "long" boundary */
- #define UNALIGNED(X) ((long) X & (sizeof(long) - 1))

- /* How many bytes are loaded each iteration of the word copy loop */
- #define LBLOCKSIZE (sizeof(long))

- /* Threshhold for punting to the bytewise iterator */
- #define TOO_SMALL(LEN) ((LEN) < LBLOCKSIZE)

- #if LONG_MAX == 2147483647L
- #define DETECT_NULL(X) (((X) - (0x01010101)) & ~(X) & (0x80808080))
- #else
- #if LONG_MAX == 9223372036854775807L
- #elif LONG_MAX == 9223372036854775807L
- /* Nonzero if X (a long int) contains a NULL byte. */
- #define DETECT_NULL(X) \
-     (((X) - (0x0101010101010101)) & ~(X) & -(0x8080808080808080))
- #else
- #error long int is not a 32bit or 64bit type.
- #endif
- #endif

- /* @return nonzero if (long)X contains the byte used to fill MASK. */
- #define DETECT_CHAR(X, mask) DETECT_NULL(*X ^ mask)

void *memchr_opt(const void *str, int c, size_t len)
{
    
    const unsigned char *src = (const unsigned char *) str;
-   unsigned char d = c;
+   unsigned long byte_per_long = sizeof(long);

-   while (UNALIGNED(src)) {
+   while ((long) src & (byte_per_long - 1)) {
        if (!len--)
            return NULL;
        if (*src == c)
            return (void *) src;
        src++;
    }
    
-   if (!TOO_SMALL(len)) {
+   if (len >= byte_per_long) {
        /* If we get this far, len is larger than a long word and src is word-aligned. */

        /* The fast code reads the source one word at a time and only performs
         * a bytewise search on word-sized segments if they contain the search
         * character. This is detected by XORing the word-sized segment with a
         * word-sized block of the search character, and then checking for the
         * presence of NULL in the result.
         */
        unsigned long *asrc = (unsigned long *) src;
+       unsigned long mask_one = 0x01010101;
-       unsigned long mask = c << 8 | c;
+       unsigned long mask_c = c << 8 | c;
-       mask |= mask << 16;
-       for (unsigned int i = 32; i < LBLOCKSIZE << 8; i <<= 1)
+       for (unsigned int i = 16; i < byte_per_long << 3; i <<= 1) {
            mask_c |= mask_c << i;
+           mask_one |= mask_one << i;
+       }
            

-       while (len >= LBLOCKSIZE) {
-           if (DETECT_CHAR(asrc, mask))
-               break;
-           asrc++;
-           len -= byte_per_long;
-       }
+       for (;len >= byte_per_long && !(((*asrc ^ mask) - (repeated_one)) & ~(*asrc ^ mask) & (repeated_one << 7)); asrc++, len -= byte_per_long);

        /* If there are fewer than LBLOCKSIZE characters left, then we resort to
         * the bytewise loop.
         */
        src = (unsigned char *) asrc;
    }

        

    while (len--) {
-       if (*src == d)
+       if (*src == c)
            return (void *) src;
        src++;
    }

    return NULL;
}

改動:

  1. code size 是 linux 核心程式碼重視的一個點,因此省去了一些看起來較不需要的 macro (此為 .c 檔,不太確定 macro 至於此是否合適,可以看到原本的 .c 檔並沒有 macro ,同時我使用在 glibc-memchr.c 內看到的程式去改善,如下),同時也合併一些程式碼進迴圈以減少行數。
    原先每次迴圈皆會經過上方的 if else macro 去執行,我不太確定這是否會增加 cycle 數,但是省去後會減少 code size ,因此我選擇移除,並改以 glibc-memchr.c 內的寫法,直接將 mask (0x01010101) 事先處理好,避免在迴圈內做無關迴圈的操作。
- /* Nonzero if X is not aligned on a "long" boundary */
- #define UNALIGNED(X) ((long) X & (sizeof(long) - 1))

- /* How many bytes are loaded each iteration of the word copy loop */
- #define LBLOCKSIZE (sizeof(long))

- /* Threshhold for punting to the bytewise iterator */
- #define TOO_SMALL(LEN) ((LEN) < LBLOCKSIZE)

- #if LONG_MAX == 2147483647L
- #define DETECT_NULL(X) (((X) - (0x01010101)) & ~(X) & (0x80808080))
- #else
- #if LONG_MAX == 9223372036854775807L
- #elif LONG_MAX == 9223372036854775807L
- /* Nonzero if X (a long int) contains a NULL byte. */
- #define DETECT_NULL(X) \
-     (((X) - (0x0101010101010101)) & ~(X) & -(0x8080808080808080))
- #else
- #error long int is not a 32bit or 64bit type.
- #endif
- #endif

- /* @return nonzero if (long)X contains the byte used to fill MASK. */
- #define DETECT_CHAR(X, mask) DETECT_NULL(*X ^ mask)

+ unsigned long mask_one = 0x01010101;

+ for (unsigned int i = 16; i < byte_per_long << 3; i <<= 1) {
+     mask_one |= mask_one << i;
+ }

- while (len >= LBLOCKSIZE) {
-     if (DETECT_CHAR(asrc, mask))
-         break;
-     asrc++;
-     len -= byte_per_long;
- }
+ for (;len >= byte_per_long && !(((*asrc ^ mask) - (repeated_one)) & ~(*asrc ^ mask) & (repeated_one << 7)); asrc++, len -= byte_per_long);
  1. 因為移去了 LBLOCKSIZE ,事先宣告 byte_per_longsizeof(long) ,減少調用 sizeof() 次數。

問題:
會不會有前面項 UNALIGNED ,後面 ALIGNED ?
會不會有前面項 ALIGNED ,後面 UNALIGNED ?
更甚會不會有前面項 ALIGNED ,後面 UNALIGNED ,後面又變成 ALIGNED?
等情況
如果有的話會不會造成改動後的效果變差?

我自己認為如果問題是存在的話,那麼原本的 memchr 也會有出現問題,前面發現 UNALIGNED(src) 不成立,但 len > byte_per_long ,執行到一半卻出現 UNALIGNED ,那麼將會有問題。

最後版本:

void *memchr_opt(const void *str, int c, size_t len) { const unsigned char *src = (const unsigned char *) str; unsigned long byte_per_long = sizeof(long); for (;((long) src & (byte_per_long - 1)); len--, src++) { if (*src == c) return (void *) src; if (!len) return NULL; } if (len >= byte_per_long) { /* If we get this far, len is larger than a long word and src is word-aligned. */ /* The fast code reads the source one word at a time and only performs * a bytewise search on word-sized segments if they contain the search * character. This is detected by XORing the word-sized segment with a * word-sized block of the search character, and then checking for the * presence of NULL in the result. */ unsigned long *asrc = (unsigned long *) src; unsigned long repeated_one = 0x01010101; unsigned long mask = c << 8 | c; mask |= mask << 16; repeated_one |= repeated_one << 16; for (unsigned int i = 32; i < byte_per_long << 3; i <<= 1) { mask |= mask << i; repeated_one |= repeated_one << i; } for (; len >= byte_per_long && !((((*asrc ^ mask) - repeated_one) & ~(*asrc ^ mask)) & (repeated_one << 7)); len -= byte_per_long, ++asrc); /* If there are fewer than LBLOCKSIZE characters left, then we resort to * the bytewise loop. */ src = (unsigned char *) asrc; } for (; len > 0; --len, ++src) { if (*src == c) return (void *) src; } return NULL; }

結果:
memchr_opt2 為我再修改後的程式碼, memchr_opt 則為 quiz8 內的版本,可以看到的確有微幅提升效能

截圖 2024-05-24 下午2.56.03

後來透過參考 arthurchang09 筆記中的
裡面說到

下圖為 memchr_opt 和 glibc-memchr.c 的執行時間比較,測量方式如 memchr 與 strchr 比較 所述,有趣的是兩者執行的時間幾乎相同。
image

筆記也說到

效能分析之模式參考 https://gist.github.com/nico/2624336 ,測量方式依舊使用 struct timespec ,測量執行兩百次之時間再除以兩百得到執行一次之時間。

的確,若照著筆記提供的程式碼做比較 glibc-memchr 與 memchr-opt 效能會較相近
開啟 O2 選項編譯
image

實際使用 https://gist.github.com/nico/2624336 方法測試,同樣開啟 -O2 選項編譯,並執行 5 次

1. 
stupid_memchr: 240.0us
glibc_memchr: 99.0us
memchr_opt: 101.0us
memchr_opt_ori: 106.0us
memchr: 24.0us
2. 
stupid_memchr: 220.0us
glibc_memchr: 85.0us
memchr_opt: 112.0us
memchr_opt_ori: 101.0us
memchr: 21.0us
3. 
stupid_memchr: 234.0us
glibc_memchr: 93.0us
memchr_opt: 103.0us
memchr_opt_ori: 112.0us
memchr: 20.0us
4. 
stupid_memchr: 204.0us
glibc_memchr: 98.0us
memchr_opt: 101.0us
memchr_opt_ori: 99.0us
memchr: 28.0us
5. 
stupid_memchr: 213.0us
glibc_memchr: 115.0us
memchr_opt: 93.0us
memchr_opt_ori: 96.0us
memchr: 27.0us

卻發現 glibc 內的版本的效能其實是非常些微的更佳,由於測試方法大致相同:
給定一連續記憶體,將其初始化為全 0 ,並在第 k (0-1000) 個位置將其設置為 4 ,並使用不同實作去找出 4 的位置。

我認為是優化使機械碼更好的原因

1. glibc_memchr

glibc_memchr: .LFB35: .cfi_startproc endbr64 movq %rdi, %rax movl %esi, %edi testq %rdx, %rdx jne .L17 jmp .L13 .p2align 4,,10 .p2align 3 .L19: cmpb %dil, (%rax) je .L1 addq $1, %rax subq $1, %rdx je .L13 .L17: testb $7, %al jne .L19 movzbl %sil, %esi movl %esi, %ecx sall $8, %ecx orl %esi, %ecx movslq %ecx, %rcx movq %rcx, %rsi salq $16, %rsi orq %rsi, %rcx movq %rcx, %r8 salq $32, %r8 orq %rcx, %r8 cmpq $7, %rdx jbe .L8 movabsq $-72340172838076673, %r10 movabsq $-9187201950435737472, %r9 jmp .L6 .p2align 4,,10 .p2align 3 .L21: subq $8, %rdx addq $8, %rax cmpq $7, %rdx jbe .L20 .L6: movq (%rax), %rcx xorq %r8, %rcx leaq (%rcx,%r10), %rsi notq %rcx andq %rsi, %rcx testq %r9, %rcx je .L21 .L8: addq %rax, %rdx jmp .L7 .p2align 4,,10 .p2align 3 .L22: addq $1, %rax cmpq %rax, %rdx je .L13 .L7: cmpb %dil, (%rax) jne .L22 .L1: ret .p2align 4,,10 .p2align 3 .L20: testq %rdx, %rdx jne .L8 .L13: xorl %eax, %eax ret .cfi_endproc

2. memchr_opt_ori

memchr_opt: .LFB35: .cfi_startproc endbr64 movq %rdi, %rax movl %esi, %edi testb $7, %al je .L10 leaq -1(%rdx), %rcx testq %rdx, %rdx jne .L4 jmp .L14 .p2align 4,,10 .p2align 3 .L19: addq $1, %rax testb $7, %al je .L2 subq $1, %rcx jb .L14 .L4: cmpb %dil, (%rax) jne .L19 ret .p2align 4,,10 .p2align 3 .L14: xorl %eax, %eax ret .L10: movq %rdx, %rcx .p2align 4,,10 .p2align 3 .L2: cmpq $7, %rcx ja .L20 .L6: testq %rcx, %rcx je .L14 .L7: addq %rax, %rcx jmp .L9 .p2align 4,,10 .p2align 3 .L21: addq $1, %rax cmpq %rax, %rcx je .L14 .L9: cmpb %dil, (%rax) jne .L21 ret .p2align 4,,10 .p2align 3 .L20: movabsq $-72340172838076673, %r10 movzbl %sil, %esi movabsq $-9187201950435737472, %r9 movl %esi, %edx sall $8, %edx orl %esi, %edx movslq %edx, %rdx movq %rdx, %rsi salq $16, %rsi orq %rsi, %rdx movq %rdx, %r8 salq $32, %r8 orq %rdx, %r8 .p2align 4,,10 .p2align 3 .L8: movq (%rax), %rdx xorq %r8, %rdx leaq (%rdx,%r10), %rsi notq %rdx andq %rsi, %rdx testq %r9, %rdx jne .L7 subq $8, %rcx addq $8, %rax cmpq $7, %rcx ja .L8 jmp .L6 .cfi_endproc

3. memchr_opt

memchr_opt: .LFB35: .cfi_startproc endbr64 movq %rdi, %rax testb $7, %al je .L11 leaq -1(%rdx), %rcx testq %rdx, %rdx jne .L4 jmp .L15 .p2align 4,,10 .p2align 3 .L19: addq $1, %rax testb $7, %al je .L2 subq $1, %rcx jb .L15 .L4: movzbl (%rax), %edx cmpl %esi, %edx jne .L19 ret .p2align 4,,10 .p2align 3 .L15: xorl %eax, %eax ret .L11: movq %rdx, %rcx .p2align 4,,10 .p2align 3 .L2: cmpq $7, %rcx ja .L20 .L6: testq %rcx, %rcx je .L15 .L9: addq %rax, %rcx jmp .L10 .p2align 4,,10 .p2align 3 .L21: addq $1, %rax cmpq %rcx, %rax je .L15 .L10: movzbl (%rax), %edx cmpl %esi, %edx jne .L21 ret .p2align 4,,10 .p2align 3 .L20: movabsq $-72340172838076673, %r10 movl %esi, %edx movabsq $-9187201950435737472, %r9 sall $8, %edx orl %esi, %edx movslq %edx, %rdx movq %rdx, %rdi salq $16, %rdi orq %rdi, %rdx movq %rdx, %r8 salq $32, %r8 orq %rdx, %r8 .p2align 4,,10 .p2align 3 .L7: movq (%rax), %rdx xorq %r8, %rdx leaq (%rdx,%r10), %rdi notq %rdx andq %rdi, %rdx testq %r9, %rdx jne .L9 subq $8, %rcx addq $8, %rax cmpq $7, %rcx ja .L7 jmp .L6 .cfi_endproc

若直接去看 glibc-memchr.c 內的程式碼,可以看到他寫得相對較為複雜,但開啟 O2 優化,卻發現優化的非常好,反而減少了 10 行,但是是用了什麼方法達成呢?
經過檢查後發現位在最後版本第 10 行的 unaligned 記憶體處理迴圈內的 if(!len) 是關鍵,原本想說儘早發現 len 已經為 0 可以提早離開函式可以避免接下來不必要的判斷式,不過似乎使編譯器比較不容易優化,因此我將其移除統一於最後回傳 NULL ,果然將機械碼減少的比 glibc 的版本更佳,且避免使用乘法及餘數運算。

memchr_opt: .LFB35: .cfi_startproc endbr64 movq %rdi, %rax movl %esi, %edi testq %rdx, %rdx jne .L21 jmp .L15 .p2align 4,,10 .p2align 3 .L23: movzbl (%rax), %ecx cmpl %edi, %ecx je .L1 addq $1, %rax subq $1, %rdx je .L15 .L21: testb $7, %al jne .L23 cmpq $7, %rdx jbe .L11 movabsq $-72340172838076673, %r10 movl %edi, %ecx movabsq $-9187201950435737472, %r9 sall $8, %ecx orl %edi, %ecx movslq %ecx, %rcx movq %rcx, %rsi salq $16, %rsi orq %rsi, %rcx movq %rcx, %r8 salq $32, %r8 orq %rcx, %r8 .p2align 4,,10 .p2align 3 .L5: movq (%rax), %rcx xorq %r8, %rcx leaq (%rcx,%r10), %rsi notq %rcx andq %rsi, %rcx testq %r9, %rcx jne .L11 subq $8, %rdx addq $8, %rax cmpq $7, %rdx ja .L5 testq %rdx, %rdx je .L15 .L11: addq %rax, %rdx jmp .L9 .p2align 4,,10 .p2align 3 .L24: addq $1, %rax cmpq %rdx, %rax je .L15 .L9: movzbl (%rax), %ecx cmpl %edi, %ecx jne .L24 .L1: ret .p2align 4,,10 .p2align 3 .L15: xorl %eax, %eax ret .cfi_endproc

不過經過調整後發現 glibc 內的還是些微的較快一些,而且在 O1 最佳化時都會較差,而一改成 O2 或以上就能反超。不過因為也有反過來的情況,因為影響的情況眾多,可能需要透過 linux 內建測試函式檢查。

我有試著學 glibc 將 12 行判斷是內的操作全部提出判斷式外,如此一來執行時間會更相近,不過想了一下發現提出後只要 len 此時是小於 byte_per_long 的情況,那麼這幾段操作完全不必要,由於沒辦法得知進入的記憶體位置是否為對其狀態,因此維持原版及早離開函示或避免不必要的操作會比較好。

修訂開平方根的快速計算

從 √2 的存在談開平方根的快速運算