Linux 核心專題: 位元操作

執行人: tab08222
專題解說錄影

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

提問清單

任務簡述

第 2 週測驗題嘗試檢驗學員對位元操作的認知，本任務強化相關練習，預計完成以下:

next_pow2 的實作和 Linux 核心相關程式碼分析
SIMD within a register (SWAR) 實作和 Linux 核心相關程式碼分析
bitmask 產生器和 Linux 核心相關程式碼分析

應該一併整理第二次作業中，學員提交的相關成果，清楚標註出處、修正描述及重現實驗。

TODO: `next_pow2` 的實作和 Linux 核心相關程式碼分析

重做第 2 週測驗題測驗一:

解釋程式碼原理，並用 __builtin_clzl 改寫，注意未定義行為
在 Linux 核心原始程式碼找出類似的使用案例並解釋，至少三處，不能只列出程式碼，應當揣摩其作用並解讀
當上述 clz 內建函式已運用時，編譯器能否產生對應的 x86 指令？

解釋程式碼原理

uint64_t next_pow2(uint64_t x)
{
    x |= x >> 1;
    x |= x >> 1;
    x |= x >> 1;
    x |= x >> 1;
    x |= x >> 1;
    x |= x >> 1;
    x |= x >> 1;
    x |= x >> 8;
    x |= x >> 16;
    x |= x >> 32;
    return x+1;
}

這段程式碼的目標在於找出大於或等於 X 的最小冪值，原理是把最高位元的 bit 一路填到最後一個位元，最後加 1 來達到目標。例如:5的二進制表示法為0b0101，再透過把最高位元的 bit 一路填到最後一個位元之後會變成0b0111，最後 +1，也就式0b1000(8)
根據 builtin_clz 的定義

Returns the number of leading 0-bits in x, starting at the most significant bit position. If x is 0, the result is undefined.

以 x = 15 為例子，由於 x 需要用到 4 個位元來表示，這代表前面有 28 個連續的0，因此 __builtin_clz(x) 會回傳 28，而 __builtin_clzl 是 unsigned long 因此 __builtin_clzl(x) 會回傳 60，以下是用 __builtin_clzl 改寫的版本

#include <stdint.h>
static inline uint64_t pow2(uint8_t e) { return ((uint64_t)1) << e; }
uint64_t next_pow2(uint64_t x)
{
    return pow2(64- __builtin_clzl(x) );
}

對應的x86指令

以 compiler explorer 產生 x86 指令

next_pow2:
        push    rbp
        mov     rbp, rsp
        sub     rsp, 8
        mov     QWORD PTR [rbp-8], rdi
        bsr     rax, QWORD PTR [rbp-8]
        xor     rax, 63
        mov     edx, eax
        mov     eax, 64
        sub     eax, edx
        movzx   eax, al
        mov     edi, eax
        call    pow2

其中 bsr (Bit Scan Reverse) 的定義為

Searches the source operand (second operand) for the most significant set bit (1 bit). If a most significant 1 bit is found, its bit index is stored in the destination operand (first operand).

Linux 核心相關程式碼

以下取自 mm/slab_common.c

static unsigned int calculate_alignment(slab_flags_t flags,
		unsigned int align, unsigned int size)
{
	
	if (flags & SLAB_HWCACHE_ALIGN) {
		unsigned int ralign;

		ralign = cache_line_size();
		while (size <= ralign / 2)
			ralign /= 2;
		align = max(align, ralign);
	}

	align = max(align, arch_slab_minalign());

	return ALIGN(align, sizeof(void *));
}

這段程式碼的目的是在確保配置的記憶體塊可以滿足硬體快取和架構的對齊要求，並以此來達到有效的記憶體配置。

以下取自 drivers/md/bcache/bset.c

static unsigned int __inorder_to_tree(unsigned int j,
				      unsigned int size,
				      unsigned int extra)
{
	unsigned int shift;

	if (j > extra)
		j += j - extra;

	shift = ffs(j);

	j >>= shift;
	j  |= roundup_pow_of_two(size) >> shift;

	return j;
}

bset.c 是用於管理和操作 bcache 中的資料結構，以實現有效的磁碟快取和提高儲存系統的性能。其中 __inorder_to_tree 的目的是確保 j 在特定範圍內並符合樹狀結構的要求。roundup_pow_of_two(size) 與next_pow2的功能不相同。以下列出roundup_pow_of_two的定義

#define roundup_pow_of_two(n)			\
(						\
	__builtin_constant_p(n) ? (		\
		((n) == 1) ? 1 :		\
		(1UL << (ilog2((n) - 1) + 1))	\
				   ) :		\
	__roundup_pow_of_two(n)			\
 )

__roundup_pow_of_two 的定義如下

unsigned long __roundup_pow_of_two(unsigned long n)
{
	return 1UL << fls_long(n - 1);
}

這兩個函式會回傳不大於 n 的 2 冪值，而next_pow2 則是會回傳不小於 n 的 2 的冪值。

以下取自include/asm-generic/bitops/fls.h

static __always_inline int fls(unsigned int x)
{
	int r = 32;

	if (!x)
		return 0;
	if (!(x & 0xffff0000u)) {
		x <<= 16;
		r -= 16;
	}
	if (!(x & 0xff000000u)) {
		x <<= 8;
		r -= 8;
	}
	if (!(x & 0xf0000000u)) {
		x <<= 4;
		r -= 4;
	}
	if (!(x & 0xc0000000u)) {
		x <<= 2;
		r -= 2;
	}
	if (!(x & 0x80000000u)) {
		x <<= 1;
		r -= 1;
	}
	return r;
}

功能與 clz、 clzl 類似，回傳值代表第一個 1 出現在由右數來第幾個 bit，例如15=0b1111，則fls(15) 就會回傳 4 。

TODO: SWAR 實作和 Linux 核心相關程式碼分析

重做第 2 週測驗題測驗三:

解釋程式碼運作原理，比較 SWAR 和原本的實作效能落差，應當使用 perf 一類的工具，充分量化
在 Linux 核心原始程式碼找出 UTF-8 和 Unicode 相關字串處理的程式碼 (至少二處)，探討其原理，並指出可能的改進空間，需要有程式碼產出和充分量化

#include <stddef.h>
#include <stdint.h>

size_t count_utf8(const char *buf, size_t len)
{
    const int8_t *p = (const int8_t *) buf;
    size_t counter = 0;
    for (size_t i = 0; i < len; i++) {
        if (p[i] > -65)
            counter++;
    }
    return counter;
}

utf-8字元可由一個首位元組，和1、2、3個後續位元組組成。這段程式碼的目的在於找出首位元組的數量，也就是開頭不是10的位元組，而首位元組也代表輸入的字串包含幾個字元。-65的二進制表示法為0b10111111，並且所有大於-65的位元組皆不符合後續位元組的定義(0xx.xxxx)，因此以-65為 magic number

size_t swar_count_utf8(const char *buf, size_t len)
{
    const uint64_t *qword = (const uint64_t *) buf;
    const uint64_t *end = qword + len >> 3;

    size_t count = 0;
    for (; qword != end; qword++) {
        const uint64_t t0 = *qword;
        const uint64_t t1 = ~t0;
        const uint64_t t2 = t1 & 0x04040404040404040llu;
        const uint64_t t3 = t2 + t2;
        const uint64_t t4 = t0 & t3;
        count += __builtin_popcountll(t4);
    }

    count = (1 << 3) * (len / 8)  - count;
    count += (len & 3) ? count_utf8((const char *) end, len & 3) : 0;

    return count;
}

在以上程式碼當中，透過觀察變數宣告的部分可以發現此 for 迴圈是以 8 個 byte 為一組進行的，for 迴圈內部的程式碼用於實作 not bit 6 and bit 7，而 count 代表後續位元組的數量，因此通過計算給定的 len 包含幾組再減去後續位元組的數量即為字元數量，再由 count_utf8 來進行計算其他部分。

UTF-8 和 Unicode 相關字串處理的程式碼

以下取自 fs/nls/nls_utf8.c

static int uni2char(wchar_t uni, unsigned char *out, int boundlen)
{
	int n;

	if (boundlen <= 0)
		return -ENAMETOOLONG;

	n = utf32_to_utf8(uni, out, boundlen);
	if (n < 0) {
		*out = '?';
		return -EINVAL;
	}
	return n;
}

這段程式碼式用來將Unicode 字元轉換為 UTF-8 編碼的字節序列

以下取自 fs/unicode/utf8-core.c

int utf8_strncasecmp(const struct unicode_map *um,
		     const struct qstr *s1, const struct qstr *s2)
{
	struct utf8cursor cur1, cur2;
	int c1, c2;

	if (utf8ncursor(&cur1, um, UTF8_NFDICF, s1->name, s1->len) < 0)
		return -EINVAL;

	if (utf8ncursor(&cur2, um, UTF8_NFDICF, s2->name, s2->len) < 0)
		return -EINVAL;

	do {
		c1 = utf8byte(&cur1);
		c2 = utf8byte(&cur2);

		if (c1 < 0 || c2 < 0)
			return -EINVAL;
		if (c1 != c2)
			return 1;
	} while (c1);

	return 0;
}

這段程式碼是用來判斷字串s1、s2是否相等。其中utf8ncursor可以用來判斷輸入是否符合規定，若不符合則 return -EINVAL 。根據errno-base.h ，EINVAL 代表參數錯誤

/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
#ifndef _ASM_GENERIC_ERRNO_BASE_H
#define _ASM_GENERIC_ERRNO_BASE_H

#define	EPERM		 1	/* Operation not permitted */
#define	ENOENT		 2	/* No such file or directory */
#define	ESRCH		 3	/* No such process */
#define	EINTR		 4	/* Interrupted system call */
#define	EIO		 5	/* I/O error */
#define	ENXIO		 6	/* No such device or address */
#define	E2BIG		 7	/* Argument list too long */
#define	ENOEXEC		 8	/* Exec format error */
#define	EBADF		 9	/* Bad file number */
#define	ECHILD		10	/* No child processes */
#define	EAGAIN		11	/* Try again */
#define	ENOMEM		12	/* Out of memory */
#define	EACCES		13	/* Permission denied */
#define	EFAULT		14	/* Bad address */
#define	ENOTBLK		15	/* Block device required */
#define	EBUSY		16	/* Device or resource busy */
#define	EEXIST		17	/* File exists */
#define	EXDEV		18	/* Cross-device link */
#define	ENODEV		19	/* No such device */
#define	ENOTDIR		20	/* Not a directory */
#define	EISDIR		21	/* Is a directory */
#define	EINVAL		22	/* Invalid argument */
#define	ENFILE		23	/* File table overflow */
#define	EMFILE		24	/* Too many open files */
#define	ENOTTY		25	/* Not a typewriter */
#define	ETXTBSY		26	/* Text file busy */
#define	EFBIG		27	/* File too large */
#define	ENOSPC		28	/* No space left on device */
#define	ESPIPE		29	/* Illegal seek */
#define	EROFS		30	/* Read-only file system */
#define	EMLINK		31	/* Too many links */
#define	EPIPE		32	/* Broken pipe */
#define	EDOM		33	/* Math argument out of domain of func */
#define	ERANGE		34	/* Math result not representable */

#endif

TODO: bitmask 產生器和 Linux 核心相關程式碼分析

重做第 2 週測驗題測驗四:

解釋程式碼運作原理
在 Linux 核心原始程式碼找出相關 bitmask 及產生器，探討應用範疇，至少三處，比照〈課前測驗參考解答: Q1〉

#include <stdint.h>
#include <stdbool.h>

bool is_pattern(uint16_t x)
{
    if (!x)
        return 0;

    for (; x > 0; x <<= 1) {
        if (!(x & 0x8000))
            return false;
    }

    return true;
}

這段程式碼的目的在於檢測 X 是否僅用一段連續的1(長度為 1~16 bits)以及一段連續的 0 (長度為0~15 bits)所組成，另一種解釋方法是，當 X 在不斷被向左位移的同時，必須確保 X 的最高位元是 1，或 x = 0

bool is_pattern(uint16_t x)
{
    const uint16_t n = -x;
    return (n ^ x) < x;
}

若 x 為正確的樣式，則 -x 必定會讓最後一個1 變成 0 ，因此 (n ^ x) < x 必定成立。若 x 不是正確的樣式，雖然最後一個 1 同樣會被替換成 0，但前方至少也會有一個 0 被替換成 1，因此 (n ^ x) > x

Linux 核心原始程式碼相關 bitmask 及產生器

以下取自 linux/include/linux/inetdevice.h

static __inline__ __be32 inet_make_mask(int logmask)
{
	if (logmask)
		return htonl(~((1U<<(32-logmask))-1));
	return 0;
}

這段程式碼用於生成 ipv4 子網路遮罩，子網路的生成有利於進行子網路計算、過濾或路由等操作序。其中 logmask 代表遮罩長度，最後使用 htonl 函式將其轉換為網路字節序。

子網路遮罩用於將IP位址劃分為網絡地址和主機地址，並在網絡中劃分子網路。它是計算機網絡中一個重要的概念，用於確定網絡的範圍、劃分子網路、控制通信流量和實施安全性。
ChatGPT

以下取自 mm/page_alloc.c

int find_next_best_node(int node, nodemask_t *used_node_mask)
{
	int n, val;
	int min_val = INT_MAX;
	int best_node = NUMA_NO_NODE;
	
	if (!node_isset(node, *used_node_mask)) {
		node_set(node, *used_node_mask);
		return node;
	}

	for_each_node_state(n, N_MEMORY) {
		if (node_isset(n, *used_node_mask))
			continue;
		
		val = node_distance(node, n);
		val += (n < node);

		if (!cpumask_empty(cpumask_of_node(n)))
			val += PENALTY_FOR_NODE_WITH_CPUS;
		
		val *= MAX_NUMNODES;
		val += node_load[n];

		if (val < min_val) {
			min_val = val;
			best_node = n;
		}
	}

	if (best_node >= 0)
		node_set(best_node, *used_node_mask);

	return best_node;
}

以上程式碼是用於 NUMA 系統中，找到最佳節點來配置記憶體空間，其中運用 used_node_mask 來分辨那些節點是已被使用的，以及那些節點是可用的。cpumask_of_node 可以產生 cpumask，並作為cpumask_empty 的參數，若此時 node n 沒有配置處理器則回傳 0 ，並為 val 加上 penalty 。

以下取自 kernel/irq/autoprobe.c

unsigned int probe_irq_mask(unsigned long val)
{
	unsigned int mask = 0;
	struct irq_desc *desc;
	int i;

	for_each_irq_desc(i, desc) {
		raw_spin_lock_irq(&desc->lock);
		if (desc->istate & IRQS_AUTODETECT) {
			if (i < 16 && !(desc->istate & IRQS_WAITING))
				mask |= 1 << i;

			desc->istate &= ~IRQS_AUTODETECT;
			irq_shutdown_and_deactivate(desc);
		}
		raw_spin_unlock_irq(&desc->lock);
	}
	mutex_unlock(&probing_active);

	return mask & val;
}

probe_irq_mask 函式用於走訪系統中的中斷描述子，檢查其中被標記為自動探測的中斷，並生成一個中斷遮罩，其中每個被探測到的中斷對應的位元被設置為 1。