2023q1 Homework2 (quiz2)

contributed by < joshyue >

測驗 `1`

參照你所不知道的 C 語言: bitwise 操作，考慮 next_pow2 可針對給定無號 64 位元數值 x，找出最接近且大於等於 2 的冪的值














uint64_t next_pow2(uint64_t x)
{
    x |= x >> 1;
    x |= x >> 1;
    x |= x >> 1;
    x |= x >> 1;
    x |= x >> 1;
    x |= x >> 1;
    x |= x >> 1;
    x |= x >> 8;
    x |= x >> 16;
    x |= x >> 32;
    return x + 1;
}

64位元之數值x，在此以x = 0b0100 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000為例。
經過7次x |= x >> 1運算，可保證原先最高位的1後面會有6個1，即x = 0b0111 1111 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000。
再經過x |= x >> 8、x |= x >> 32、x |= x >> 64，則x = 0b0111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111，如此便能使一開始最高位的1右邊所有位元皆指派為1。
此時x + 1才為2的冪，因此最後要return x + 1，即能得到next_pow2所求。

用 __builtin_clzl 改寫

__builtin_clzl : Returns the number of leading 0-bits in x, starting at the most significant bit position. If x is 0, the result is undefined.

uint64_t next_pow2(uint64_t x)
{
    if (！x)
        return 1;
    int clz = __builtin_clzl(x);
    return (1 << (64 - clz));
}

當x不為0時，__builtin_clzl(x)會回傳最高位開始連續0的個數的功能，假設x=1024時，則clz = 53。
則可以透過求得的clz，回傳(1 << (64 - clz))，即得題目所求。

在 Linux 核心原始程式碼找出類似的使用案例並解釋

在<include/asm-generic/bitops/builtin-fls.h>中的fls函式，有使用到__builtin_clz的功能。

static __always_inline int fls(unsigned int x)
{
    return x ? sizeof(x) * 8 - __builtin_clz(x) : 0;
}

fls - find last (most-significant) bit set

透過sizeof(x) * 8可得到x的位元數量，再減去最高位開始連續0的個數，則可以得到fls函式所求，即最高位元的1的bit index + 1。

上述 clz 內建函式已運用時，編譯器產生之對應x86指令

執行cc -O2 -std=c99 -S next_pow2.c，觀察所得next_pow2.s

	movl	$1, %eax
	testq	%rdi, %rdi
	je	.L1
	bsrq	%rdi, %rdi
	movl	$64, %ecx
	xorq	$63, %rdi
	subl	%edi, %ecx
	sall	%cl, %eax
	cltq

觀察__builtin_clzl(x)所對應的x86指令，其中 bsrq %rdi, %rdi 會將最高位的1之bit index儲存於rdi暫存器。
而後的 xorq $63, %rdi 則是對63與最高位的1之bit index作Exclusive OR，則等同於63 - 1之bit index，因此得到所求，即最高位開始連續0的個數。

測驗 `2`

LeetCode 1680. Concatenation of Consecutive Binary Numbers 給定一整數 n ，回傳將 1 到 n 的二進位表示法依序串接在一起所得到的二進位字串，其所代表的十進位數字 mod
$10^{9} + 7$ 之值。

int concatenatedBinary(int n)
{
    const int M = 1e9 + 7;
    int len = 0; /* the bit length to be shifted */
    /* use long here as it potentially could overflow for int */
    long ans = 0;
    for (int i = 1; i <= n; i++) {
        /* removing the rightmost set bit
         * e.g. 100100 -> 100000
         *      000001 -> 000000
         *      000000 -> 000000
         * after removal, if it is 0, then it means it is power of 2
         * as all power of 2 only contains 1 set bit
         * if it is power of 2, we increase the bit length
         */
        if (!(i & (i - 1)))
            len++;
        ans = (i | (ans << len)) % M;
    }
    return ans;
}

若要判斷i是否為2的冪，現假設i = 8，則二進位表示0b1000，i - 1的二進位表示則為0b0111，可觀察到兩者依序對應的bit皆相異，因此當i為2的冪時，i & (i - 1)必等於0。
並且題目所述1 到 n 的二進位表示法依序串接，於是我們可以將目前的字串（ans）向左shift所需串接的i之位元長度（len）後，再與i作bitwise OR，最後再mod
$10^{9} + 7$ ，即為題目所求。

使用__builtin_{clz,ctz,ffs}改寫，改進mod
$10^{9} + 7$ 的運算

int concatenatedBinary(int n)
{
    const int M = 1e9 + 7;
    int len = 0; /* the bit length to be shifted */
    long ans = 0;
    for (int i = 1; i <= n; i++) {
        int ffs = __builtin_ffsl(i);
        if (!(i >> (ffs)))
            len++;
        ans = (i | (ans << len)) % M; 
    }
}

__builtin_ffsl : Returns one plus the index of the least significant 1-bit of x, or if x is zero, returns zero.

原先使用以上的程式碼，以__builtin_ffsl來取得最低位開始第一個1的bit index + 1，i >> (ffs)若為0，則表示i為2的冪，而後發現此程式碼仍可改進，改進如下。

int concatenatedBinary(int n)
{
    const int M = 1e9 + 7;
    int len = 0; /* the bit length to be shifted */
    long ans = 0;
    for (int i = 1; i <= n; i++){
        ans = (i | (ans << (64 - __builtin_clzl(i)))) % M;
    }
}

以__builtin_clzl得到最高位開始連續0個數後，將i往左shift 64 - __builtin_clzl(i)(其等同先前的程式碼向左shift len )，這樣的改進能省去if的分支，不用判斷i是否為2的冪，使程式碼更加精簡。

測驗 `3`

SIMD within a register (SWAR) 是軟體最佳化技巧之一




















size_t swar_count_utf8(const char *buf, size_t len)
{
    const uint64_t *qword = (const uint64_t *) buf;
    const uint64_t *end = qword + len >> 3;

    size_t count = 0;
    for (; qword != end; qword++) {
        const uint64_t t0 = *qword;
        const uint64_t t1 = ~t0;
        const uint64_t t2 = t1 & 0x04040404040404040llu;
        const uint64_t t3 = t2 + t2;
        const uint64_t t4 = t0 & t3;
        count += __builtin_popcountll(t4);
    }

    count = (1 << 3) * (len / 8)  - count;
    count += (len & 7) ? count_utf8((const char *) end, len & 7) : 0;

    return count;
}

比較 SWAR 和原本的實作效能落差

__builtin_popcountll : Returns the number of 1-bits in x.

第13行 : 透過__builtin_popcountll求得位元組內1的總數，conut即代表題目所述continuation bytes（後續位元組）的個數。
第16行 : 因為我們在處理一個qword時，qword大小為8byte，(len / 8)即表示完整處理的次數(因後續要判斷剩餘未滿1個qword的狀況)，(1 << 3) * (len / 8)表示完整處理的次數再乘以8，即已處理字串的長度，並且題目有提到字串長度減去後續位元組，即可確定字元數量，因此我們要再減去第13行得到的count，得到字元數量。
第17行 : (len & 7)等同於len % 7，若len & 7不為0，表示我們仍有未處理的字元，則呼叫原先的函式count_utf8，回傳剩餘的字元數量。

在 Linux 核心原始程式碼找出 UTF-8 和 Unicode相關字串處理的程式碼，探討其原理

在<fs/unicode/utf8-core.c>中，有提供utf8_strncmp函式。
如果返回值 = 1，則表明兩字串不相等
如果返回值 = 0，則表明兩字串相等

int utf8_strncmp(const struct unicode_map *um,
		 const struct qstr *s1, const struct qstr *s2)
{
	struct utf8cursor cur1, cur2;
	int c1, c2;

	if (utf8ncursor(&cur1, um, UTF8_NFDI, s1->name, s1->len) < 0)
		return -EINVAL;

	if (utf8ncursor(&cur2, um, UTF8_NFDI, s2->name, s2->len) < 0)
		return -EINVAL;

	do {
		c1 = utf8byte(&cur1);
		c2 = utf8byte(&cur2);

		if (c1 < 0 || c2 < 0)
			return -EINVAL;
		if (c1 != c2)
			return 1;
	} while (c1);

	return 0;

utf8ncursor結構表示字串處理的進度，分別透過c1及c2獲取目前處理字串位置的一個位元組，並轉換此位元組的型態為unsigned char，因此若c1或c2小於0，則表示程式無效，回傳-EINVAL，反之的話就是繼續判斷兩個字元是否相等。

測驗 `4`

以下程式碼可判定 16 位元無號整數是否符合以下特定樣式 (pattern):

bool is_pattern(uint16_t x)
{
    if (!x)
        return 0;

    for (; x > 0; x <<= 1) {
        if (!(x & 0x8000))
            return false;
    }

    return true;
}

符合上述程式碼的樣式，如下:

pattern: 8000 (32768)
pattern: c000 (49152)
pattern: e000 (57344)
pattern: f000 (61440)
pattern: f800 (63488)
pattern: fc00 (64512)
pattern: fe00 (65024)
pattern: ff00 (65280)
pattern: ff80 (65408)
pattern: ffc0 (65472)
pattern: ffe0 (65504)
pattern: fff0 (65520)
pattern: fff8 (65528)
pattern: fffc (65532)
pattern: fffe (65534)
pattern: ffff (65535)

改寫上述程式碼，使其等價且更精簡有效，改寫內容如下:

bool is_pattern(uint16_t x)
{
    const uint16_t n = -x;
    return (n ^ x) < x;
}

透過x觀察是否符合程式碼的樣式(pattern)可發現，只要x在二進位下符合最高位開始為1的條件，後面依序接著連續的1(可為0個)與連續的0(可為0個)，則符合上述的樣式。
上述程式碼的-x表示x的二補數，以下方圖表為例，若x符合樣式，則(n ^ x)必定小於x，因此回傳True。

x	n = -x	n ^ x
(1111 0000 0000 0000)₂	(0001 0000 0000 0000)₂	(1110 0000 0000 0000)₂
(1111 0100 0000 0000)₂	(0000 1100 0000 0000)₂	(1111 1000 0000 0000)₂
(1111 0000 0000 0001)₂	(0000 1111 1111 1111)₂	(1111 1111 1111 1110)₂

2023q1 Homework2 (quiz2)

測驗 1

用 __builtin_clzl 改寫

在 Linux 核心原始程式碼找出類似的使用案例並解釋

上述 clz 內建函式已運用時，編譯器產生之對應x86指令

測驗 2

使用__builtin_{clz,ctz,ffs}改寫，改進mod 109+7 的運算

測驗 3

比較 SWAR 和原本的實作效能落差

在 Linux 核心原始程式碼找出 UTF-8 和 Unicode相關字串處理的程式碼，探討其原理

測驗 4

在 Linux 核心原始程式碼找出上述 bitmask 及產生器，探討應用範疇

Read more

2023q1 Homework1 (lab0)

測驗 `1`

測驗 `2`

使用__builtin_{clz,ctz,ffs}改寫，改進mod
$10^{9} + 7$ 的運算

測驗 `3`

測驗 `4`