2023q1 Homework2 (quiz2)

contributed by < Thegoatistasty >

測驗題目

測驗 `1`

1. 解釋上述程式碼原理，並用 `__builtin_clzl` 改寫

題目完整程式碼

//original
uint64_t next_pow2(uint64_t x)
{
    x |= x >> 1;
    x |= x >> 1;
    x |= x >> 1;
    x |= x >> 1;
    x |= x >> 1;
    x |= x >> 1;
    x |= x >> 1;
    x |= x >> 8;
    x |= x >> 16;
    x |= x >> 32;
    return x + 1;
}
//refined
uint64_t next_pow2(uint64_t x)
{
    x |= x >> 1;
    x |= x >> 2;
    x |= x >> 4;
    x |= x >> 8;
    x |= x >> 16;
    x |= x >> 32;
    return x + 1;
}

所謂最接近且大於等於 2 的冪的值，可以想成在2進位數值系統上，將最左邊的 1 往左移，並將右邊的 bits 全部設為 0。如：

0001xxxxxx
轉為
0010000000

題目的方法是以數值最左邊的 1 為首，將右邊所有的 bits 設為 1，最後回傳該數值 + 1，簡化版過程如下：

假設輸入數字為
  00001xxxxxxxxxxx
程式執行時
->000011xxxxxxxxxx
->00001111xxxxxxxx
->000011111111xxxx
->0000111111111111
最後 + 1 變成
->0001000000000000

builtin_clzl 改寫

uint64_t next_pow2(uint64_t x)
{
    if (!x)
        return 1;
    int k = __builtin_clzl(x);
    return 1 << (64 - k);
}

2.在 Linux 核心原始程式碼找出類似的使用案例並解釋

–––––––––待辦––––––––––

3. 當上述 clz 內建函式已運用時，編譯器能否產生對應的 x86 指令？

有 bsrq

	.file	"next_pow2.c"
	.text
	.p2align 4
	.globl	next_pow2
	.type	next_pow2, @function
next_pow2:
.LFB0:
	.cfi_startproc
	endbr64
	movl	$1, %eax
	testq	%rdi, %rdi
	je	.L1
	bsrq	%rdi, %rdi
	movl	$64, %ecx
	xorq	$63, %rdi
	subl	%edi, %ecx
	sall	%cl, %eax
	cltq

測驗 `2`

1. 解釋上述程式碼運作原理

int concatenatedBinary(int n)
{
    const int M = 1e9 + 7;
    int len = 0; /* the bit length to be shifted */
    /* use long here as it potentially could overflow for int */
    long ans = 0;
    for (int i = 1; i <= n; i++) {
        // removing the rightmost set bit
        // e.g. 100100 -> 100000
        //      000001 -> 000000
        //      000000 -> 000000
        // after removal, if it is 0, then it means it is power of 2
        // as all power of 2 only contains 1 set bit
        // if it is power of 2, we increase the bit length
        if (!(i & (i-1)))
            len++;
        ans = (i | (ans << len)) % M;
    }
    return ans;
}

(1) mod(

$10^{9} + 7$ )

參考資料：https://www.geeksforgeeks.org/modulo-1097-1000000007/
簡單來說，這邊是用來防止overflow的

(2) if (!(i & (i-1)))
如同註解寫的，這裡用來判斷連接時要用幾個 bit

i = 1
去掉最左邊的 1 之後為 0，此時 len = len + 1 = 1，代表當前的數字要用 1 個 bits 表示

i= 2 (10)
去掉最左邊的 1 之後為 0，此時 len = len + 1 = 2，代表當前的數字要用 2 個 bits 表示

i = 3 (11)
去掉最左邊的 1 之後不為 0，len = 2 不變，代表當前的數字要用 2 個 bits 表示
.
.
.
以此類推

2. 嘗試使用 `__builtin_{clz,ctz,ffs}` 改寫，並改進 mod
$10^{9} + 7$ 的運算

int concatenatedBinary(int n)
{
    const int M = 1e9 + 7;
    /* use long here as it potentially could overflow for int */
    long ans = 0;
    for (int i = 1; i <= n; i++) {
        ans = i | (ans << (32 - __builtin_clz(i)));
        ans = (ans > M)? ans : ans % M;
    }
    return ans;
}

modulo改善策略：在有需要的時候做modulo就好
參考資料：Built-in mod ('%') vs custom mod function: improve the performance of modulus operation

測驗 `3`

1. 解釋上述程式碼運作原理，比較 SWAR 和原本的實作效能落差




















size_t swar_count_utf8(const char *buf, size_t len)
{
    const uint64_t *qword = (const uint64_t *) buf;
    const uint64_t *end = qword + len >> 3;

    size_t count = 0;
    for (; qword != end; qword++) {
        const uint64_t t0 = *qword;
        const uint64_t t1 = ~t0;
        const uint64_t t2 = t1 & 0x04040404040404040llu;
        const uint64_t t3 = t2 + t2;
        const uint64_t t4 = t0 & t3;
        count += __builtin_popcountll(t4);
    }

    count = (1 << AAAA) * (len / 8)  - count;
    count += (len & BBBB) ? count_utf8((const char *) end, len & CCCC) : DDDD;

    return count;
}

SWAR版和原本的差別是在它可以一次對比 8 bytes，最後再加上未滿 8 bytes的序列。

第 4 行做的事就是設定以 8 bytes 為單位的結尾，並在接下來的 for loop 計算 "後續位元組數量"。
字元數量則 = 全長 - "後續位元組數量"，得出 AAAA = 3
又因為以 8 bytes 為單位，會有餘數問題，所以在 17 行再加上剩下 bytes 的位元數
所以 BBBB = 7, CCCC = 7, DDDD = 0

2. 在 Linux 核心原始程式碼找出 UTF-8 和 Unicode 相關字串處理的程式碼，探討其原理，並指出可能的改進空間

–––––待辦–––––

測驗 `4`

1. 解釋上述程式碼運作原理

將符合 pattern 的整數轉為 2 進制後就可以很容易看出規律：符合 pattern 的整數存在一個位置，使得左邊的位元都為 1，右邊都為 0

pattern: 8000 (32768) = 1000 0000 0000 0000
pattern: c000 (49152) = 1100 0000 0000 0000
pattern: e000 (57344) = 1110 0000 0000 0000
pattern: f000 (61440) = 1111 0000 0000 0000
pattern: f800 (63488) = 1111 1000 0000 0000
(以下略)

而精簡版程式碼可以寫為

bool is_pattern(uint16_t x)
{
    const uint16_t n = -x;
    return (n ^ x) < x;
}

原理大概是這樣（參考chun61205)

將所有符合的數字列出
1000000000000
1100000000000
1110000000000
.
.
.
1111111111111

二補數後會將他們變成

1000000000000
0100000000000
0010000000000
.
.
.
0000000000001

再跟原本做 XOR

0000000000000
1000000000000
1100000000000
.
.
.
1111111111110

相當於是 clear 最右邊的那個 1
所以結果必小於原本的數字

2023q1 Homework2 (quiz2)

測驗 1

1. 解釋上述程式碼原理，並用 __builtin_clzl 改寫

測驗 2

1. 解釋上述程式碼運作原理

2. 嘗試使用 __builtin_{clz,ctz,ffs} 改寫，並改進 mod 109+7 的運算

測驗 3

1. 解釋上述程式碼運作原理，比較 SWAR 和原本的實作效能落差

2. 在 Linux 核心原始程式碼找出 UTF-8 和 Unicode 相關字串處理的程式碼，探討其原理，並指出可能的改進空間

測驗 4

1. 解釋上述程式碼運作原理

Read more

2023q1 Homework1 (lab0)

測驗 `1`

1. 解釋上述程式碼原理，並用 `__builtin_clzl` 改寫

測驗 `2`

2. 嘗試使用 `__builtin_{clz,ctz,ffs}` 改寫，並改進 mod
$10^{9} + 7$ 的運算

測驗 `3`

測驗 `4`