2023q1 Homework2 (quiz2)

# 2023q1 Homework2 (quiz2) contributed by < `LiChiiiii`> >測驗題目：[quiz2](https://hackmd.io/@sysprog/linux2023-quiz2) ## 測驗一 ### 運作原理考慮 `next_pow2` 可針對給定無號 64 位元數值 x，找出最接近且大於等於 2 的冪的值，例如: * `next_pow2(7)` = 8 * `next_pow2(13)` = 16 * `next_pow2(42)` = 64 最初想法是將 x 左移 1 後，最高位元的 1 保留下來，其餘設定為 0 ，根據測驗一的題目實作，利用 `shift` 和 `or` 將最高位元的 1 至最低位元的 bit 皆設定為 1 ，最後回傳 x+1 即為答案。 Example: ``` x = 0000011001000000 x = 0000011111111111 x+1 = 0000100000000000 ``` ```c uint64_t next_pow2(uint64_t x) { x |= x >> 1; x |= x >> 2; x |= x >> 4; x |= x >> 8; x |= x >> 16; x |= x >> 32; return x+1; } ``` 發現：如果 x 剛好是 2 的冪，那麼應該要回傳 x 本身，但上述程式碼不會回傳 x 。 Example: ``` x = 0000010000000000 x = 0000011111111111 x+1 = 0000100000000000 ``` 因此可加上條件式判斷 x 是否剛好是 2 的冪，避免不必要的操作。 ```c if((x & (x - 1)) == 0) return x; ``` ### 用 `__builtin_clzl` 改寫 `__builtin_clzl(x)` 回傳 x 中有多少個 leading 0-bits ，當64 減去 `__builtin_clzl(x)` 即可找到最高位元的位置。 ```c uint64_t next_pow2(uint64_t x) { if (x == 0) return 1; else if ((x & (x - 1)) == 0) return x; else return ((uint64_t)1) << (64 - __builtin_clzl(x)); } ``` ### 在 Linux 核心原始程式碼找出類似的使用案例並解釋在 [The Linux Kernel API](https://www.kernel.org/doc/html/latest/core-api/kernel-api.html#c.is_power_of_2) 中找到 `__roundup_pow_of_two` 實作。其原始程式碼在 [linux/log2.h](https://github.com/torvalds/linux/blob/master/include/linux/log2.h) 。 #### `__roundup_pow_of_two` 給定一個無號整數 n，回傳該值最接近且大於等於 2 的冪的值，若 n 為 2 的冪則回傳 n。 ```c unsigned long __roundup_pow_of_two(unsigned long n) { return 1UL << fls_long(n - 1); } ``` :::info :stars: `fls_long` 用來找從最高位元開始連續的 0 的個數，也就是找到出現第一個 1 的位置。[linux/bitops.h](https://github.com/torvalds/linux/blob/master/include/linux/bitops.h) ```c static inline unsigned fls_long(unsigned long l) { if (sizeof(l) == 4) return fls(l); return fls64(l); } ``` 如果 `l` 是一個 32 位元的整數（即sizeof(l) == 4），則調用 `fls` 函數。如果 `l` 是一個 64 位元的整數（即sizeof(l) > 4），則調用 `fls64` 函數。 ::: #### `__roundup_pow_of_two` 使用案例在 [commit 8758768](https://github.com/torvalds/linux/commit/8758768ad8aa9fc0d56417315dec65b610fc3a21) 可以看到原本的 `i40iw_qp_round_up` 被替換成 `roundup_pow_of_two`。 > [color=#e85f71] `i40iw_qp_round_up` 實作概念與 `next_pow2` 相同，在 for loop 的每次循環中，將 scount 乘以2，然後將 wqdepth 右移 scount 位，並將結果與 wqdepth 進行位元或運算，這樣的效果是將最高位元的 1 至最低位元的 bit 皆設定為 1 ，其時間複雜度為 O(log wqdepth)。 > [color=lightgreen] `roundup_pow_of_two` 是將最高位元開始連續的 0 的個數作為 unsigned long 1 向左移的次數，其時間複雜度為 O(1)。 ```diff -/** - * i40iw_qp_roundup - return round up QP WQ depth - * @wqdepth: WQ depth in quantas to round up - */ - static int i40iw_qp_round_up(u32 wqdepth) - { - int scount = 1; - for (wqdepth--; scount <= 16; scount *= 2) - wqdepth |= wqdepth >> scount; - return ++wqdepth; - } ``` ```diff enum i40iw_status_code i40iw_get_sqdepth(u32 sq_size, u8 shift, u32 *sqdepth) { - *sqdepth = i40iw_qp_round_up((sq_size << shift) + I40IW_SQ_RSVD); + *sqdepth = roundup_pow_of_two((sq_size << shift) + I40IW_SQ_RSVD); ... } ``` ### 當上述 `clz` 內建函式已運用時，編譯器能否產生對應的 x86 指令？ x86 中有一條稱為 `bsrq` 的指令，可以實現尋找最高位元 1 的位置的功能，其運作方式與 `clz` 函式類似。編譯器可以將 `clz` 函式轉換為 `bsrq` 指令來實現對應的功能。實際執行 `cc -O2 -std=c99 -S next_pow2.c` ，可在 `next_pow2.s` 第 22 行看到 `bsrq` 指令。 :::spoiler next_pow2.s ```s= .file "next_pow2.c" .text .p2align 4 .globl next_pow2 .type next_pow2, @function next_pow2: .LFB0: .cfi_startproc endbr64 movl $1, %eax testq %rdi, %rdi je .L1 leaq -1(%rdi), %rdx movq %rdi, %rax testq %rdi, %rdx jne .L8 .L1: ret .p2align 4,,10 .p2align 3 .L8: bsrq %rdi, %rax movl $64, %ecx xorq $63, %rax subl %eax, %ecx movl $1, %eax salq %cl, %rax ret .cfi_endproc .LFE0: .size next_pow2, .-next_pow2 .ident "GCC: (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0" .section .note.GNU-stack,"",@progbits .section .note.gnu.property,"a" .align 8 .long 1f - 0f .long 4f - 1f .long 5 0: .string "GNU" 1: .align 8 .long 0xc0000002 .long 3f - 2f 2: .long 0x3 3: .align 8 4: ``` ::: > :::info :stars: `BSR (Bit Scan Reverse)` 指令和 `bsrq` 指令差別在哪裡？ >BSR (Bit Scan Reverse) 指令和 bsrq 指令都是在 x86 架構的指令集中的位元運算指令，它們的主要區別在於操作數的大小。 > >具體而言，BSR 指令是用於 16 位元或 32 位元操作數的位元運算，而 bsrq 指令是用於 64 位元操作數的位元運算。因此，它們分別可以從指定的操作數中尋找最高位元 (MSB) 的位置並將其傳回。 > > [name=chatGPT] ::: ## 測驗二 ### 運作原理給定一整數 n ，回傳將 1 到 n 的二進位表示法依序串接在一起所得到的二進位字串，其所代表的十進位數字 mod $10^9+7$ 之值。 `len` ：用來紀錄整數 `i` 之 bit 長度，也就是要左移的位元數。迴圈會從 i = 1 開始執行至 i = n，也就是說， bit 長度會越來越長。當 `i` 為 2 的冪時，代表進位到下一個 bit ，因此執行 `len++` 增加 bit 長度，而每次循環會將前面迴圈的結果向左移 `len` 個位元數，利用 `or` 與 `i` 合併後，再做 `mod M` 的運算來避免 overflow 。 ```c int concatenatedBinary(int n) { const int M = 1e9 + 7; int len = 0; /* the bit length to be shifted */ /* use long here as it potentially could overflow for int */ long ans = 0; for (int i = 1; i <= n; i++) { if (!(i & (i-1))) len++; ans = (i | (ans << len)) % M; } return ans; } ``` ### 嘗試使用 `__builtin_{clz,ctz,ffs}` 改寫，並改進 $mod$ $10^9+7$ 的運算 **int __builtin_ffs (int x)** > Returns one plus the index of the least significant 1-bit of x, or if x is zero, returns zero. **int __builtin_ctz (unsigned int x)** > Returns the number of leading 0-bits in x, starting at the most significant bit position. If x is 0, the result is undefined. **int __builtin_clz (unsigned int x)** > Returns the number of leading 0-bits in x, starting at the most significant bit position. If x is 0, the result is undefined. * 想法：使用函式計算最高位元開始算起的 0 之個數，及最低位元開始算啟的 0 之個數，若是 2 的冪，則兩者 0 之個數相加應為 31 （因為回傳型態為 int 是 32 bits） * 實作：將 `__builtin_clz(i) + __builtin_ctz(i) = 31` 作為 2 的冪之條件式判斷。 ```c int concatenatedBinary(int n) { const int M = 1e9 + 7; int len = 0; long ans = 0; for (int i = 1; i <= n; i++) { if (__builtin_clz(i)+ __builtin_ctz(i) == 31) len++; ans = (i | (ans << len)) % M; } return ans; } ``` 以上程式碼改寫並沒有增進效能，於是又想了其他改寫方式。 * 想法：程式碼中的條件式是為了確定左移的位元數，那是否可以直接計算需左移的位元數，以減少分支存在？ * 實作： `__builtin_clz` 會回傳從最高位元開始連續的 0 之個數，將 32 bits 減去 `__builtin_clz(i)` 即為需左移的位元數。 ```c int concatenatedBinary(int n) { const int M = 1e9 + 7; long ans = 0; for (int i = 1; i <= n; i++){ ans = (i | (ans << (32 - __builtin_clz(i)))) % M; } return ans; } ``` ## 測驗三 ### 運作原理 UTF-8 字元可由 1, 2, 3, 4 個位元組構成。其中單一位元組的 UTF-8 由 ASCII 字元構成，其 MSB 必為 0。 UTF-8 的多位元組字元是由一個首位元組和 1, 2 或 3 個後續位元組所構成。首位元組可以表示其 UTF-8 字元是幾個位元組所組成，舉例來說，2 bytes 的字元其首位元組之最高位元開始存在連續 2 個 1，下表為多位元組字元的規則： | | UTF-8 | |:------- |:----------------------------- | | ASCII | 0xxx.xxxx | | 2 bytes | 110x.xxxx 10xx.xxxx | | 3 bytes | 1110.xxxx 10xx.xxxx 10xx.xxxx | 以下程式碼使用了 SWAR（SIMD Within A Register）算法，計算給定的 UTF-8 編碼的字串中字元數量。程式碼中使用 for 迴圈來一次處理 8 bytes ，所以先透過 `len>>3` （也就是 `len/8`）來計算總共有幾個 8 bytes ，接著在迴圈中將 `not bit6 and bit7` 這個想法實作，並透過 [`__builtin_popcount`](https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html) 計算有多少 1，來紀錄有幾個 `continuation byte` （也就是 `10xx.xxxx`），最後將 `continuation byte` 的數量從總字串長度中減去，即可確定字元數量。 ```c= size_t swar_count_utf8(const char *buf, size_t len) { const uint64_t *qword = (const uint64_t *) buf; const uint64_t *end = qword + len >> 3; size_t count = 0; for (; qword != end; qword++) { const uint64_t t0 = *qword; const uint64_t t1 = ~t0; const uint64_t t2 = t1 & 0x04040404040404040llu; const uint64_t t3 = t2 + t2; const uint64_t t4 = t0 & t3; count += __builtin_popcountll(t4); } count = (1 << 3) * (len / 8) - count; count += (len & 7) ? count_utf8((const char *) end, len & 7) : 0; return count; } ``` > 第 16 行：利用 `(1 << 3) * (len / 8)` 得到總字串長度，再減去 `continuation byte` 的數量。 > 第 17 行：如果 `len & 7` 不等於 0，則說明 `len` 不是 8 的倍數，此時調用 `count_utf8` 函數計算在迴圈內未被處理到的 bytes ，回傳值再加入 `count` 中。 ### 比較 SWAR 和原本的實作效能落差 `swar_count_utf8` 中的循環次數比 `count_utf8` 少，因為對於一次處理 8 bytes 的長度範圍，使用位元運算進行計數，減少了循環的次數。而 `count_utf8` 則需要遍歷每個 byte 進行判斷，增加了循環的次數。 * 實驗設計：亂數產生長度為 1000000 的字串實作在 `swar_count_utf8` 和 `count_utf8` ，再利用 `perf` 重複執行 100 次並計算 `instructions` 和 `cycles` 的數量，從以下結果可以明顯看到 `count_utf8` 使用到較多的指令數量，花費較多的運行時間。 ```shell $sudo perf stat --repeat 100 -e instructions,cycles ./swar_count_utf8 Performance counter stats for './swar_count_utf8' (100 runs): 9058,4543 instructions # 1.47 insn per cycle ( +- 0.00% ) 6114,8381 cycles ( +- 0.29% ) 0.0167870 +- 0.0000631 seconds time elapsed ( +- 0.38% ) ``` ```shell $sudo perf stat --repeat 100 -e instructions,cycles ./count_utf8 Performance counter stats for './count_utf8' (100 runs): 9609,5870 instructions # 1.44 insn per cycle ( +- 0.00% ) 6602,1809 cycles ( +- 0.56% ) 0.0175909 +- 0.0000975 seconds time elapsed ( +- 0.55% ) ``` ### 在 Linux 核心原始程式碼找出 UTF-8 和 Unicode 相關字串處理的程式碼，探討其原理，並指出可能的改進空間在 [/fs/unicode/utf8-norm.c](https://github.com/torvalds/linux/blob/master/fs/unicode/utf8-norm.c) 找到 UTF-8 和 Unicode 相關字串處理的程式碼。 `utf8byte()` 是一個 UTF-8 字串的解碼器，用於從 UTF-8 字符串中提取規範化形式的單個字節，並處理了 UTF-8 字符串的解碼和解析過程中可能出現的各種特殊情況。這段程式碼存在一些重複的部分，可以將重複部分抽取出來，另外建立函式來改進可讀性。 :::spoiler `utf8byte(struct utf8cursor *u8c)` ```c int utf8byte(struct utf8cursor *u8c) { utf8leaf_t *leaf; int ccc; for (;;) { /* Check for the end of a decomposed character. */ if (u8c->p && *u8c->s == '\0') { u8c->s = u8c->p; u8c->p = NULL; } /* Check for end-of-string. */ if (!u8c->p && (u8c->len == 0 || *u8c->s == '\0')) { /* There is no next byte. */ if (u8c->ccc == STOPPER) return 0; /* End-of-string during a scan counts as a stopper. */ ccc = STOPPER; goto ccc_mismatch; } else if ((*u8c->s & 0xC0) == 0x80) { /* This is a continuation of the current character. */ if (!u8c->p) u8c->len--; return (unsigned char)*u8c->s++; } /* Look up the data for the current character. */ if (u8c->p) { leaf = utf8lookup(u8c->um, u8c->n, u8c->hangul, u8c->s); } else { leaf = utf8nlookup(u8c->um, u8c->n, u8c->hangul, u8c->s, u8c->len); } /* No leaf found implies that the input is a binary blob. */ if (!leaf) return -1; ccc = LEAF_CCC(leaf); /* Characters that are too new have CCC 0. */ if (u8c->um->tables->utf8agetab[LEAF_GEN(leaf)] > u8c->um->ntab[u8c->n]->maxage) { ccc = STOPPER; } else if (ccc == DECOMPOSE) { u8c->len -= utf8clen(u8c->s); u8c->p = u8c->s + utf8clen(u8c->s); u8c->s = LEAF_STR(leaf); /* Empty decomposition implies CCC 0. */ if (*u8c->s == '\0') { if (u8c->ccc == STOPPER) continue; ccc = STOPPER; goto ccc_mismatch; } leaf = utf8lookup(u8c->um, u8c->n, u8c->hangul, u8c->s); if (!leaf) return -1; ccc = LEAF_CCC(leaf); } /* * If this is not a stopper, then see if it updates * the next canonical class to be emitted. */ if (ccc != STOPPER && u8c->ccc < ccc && ccc < u8c->nccc) u8c->nccc = ccc; /* * Return the current byte if this is the current * combining class. */ if (ccc == u8c->ccc) { if (!u8c->p) u8c->len--; return (unsigned char)*u8c->s++; } /* Current combining class mismatch. */ ccc_mismatch: if (u8c->nccc == STOPPER) { /* * Scan forward for the first canonical class * to be emitted. Save the position from * which to restart. */ u8c->ccc = MINCCC - 1; u8c->nccc = ccc; u8c->sp = u8c->p; u8c->ss = u8c->s; u8c->slen = u8c->len; if (!u8c->p) u8c->len -= utf8clen(u8c->s); u8c->s += utf8clen(u8c->s); } else if (ccc != STOPPER) { /* Not a stopper, and not the ccc we're emitting. */ if (!u8c->p) u8c->len -= utf8clen(u8c->s); u8c->s += utf8clen(u8c->s); } else if (u8c->nccc != MAXCCC + 1) { /* At a stopper, restart for next ccc. */ u8c->ccc = u8c->nccc; u8c->nccc = MAXCCC + 1; u8c->s = u8c->ss; u8c->p = u8c->sp; u8c->len = u8c->slen; } else { /* All done, proceed from here. */ u8c->ccc = STOPPER; u8c->nccc = STOPPER; u8c->sp = NULL; u8c->ss = NULL; u8c->slen = 0; } } } ``` ::: ## 測驗四 ### 運作原理此程式碼是為了檢查**從最高位元開始是否只包含一段連續的 1**，使用一個 for 迴圈不斷的左移 x ，並檢查最高位元是否為 1 ，直到 x 等於零就結束迴圈回傳 true ，否則繼續左移 x ，若發現最高位元為 0 即回傳 false 。 ```c #include <stdint.h> #include <stdbool.h> bool is_pattern(uint16_t x) { if (!x) return 0; for (; x > 0; x <<= 1) { if (!(x & 0x8000)) return false; } return true; } ``` 改寫成精簡的程式碼，先計算 x 的 2 補數（ ~x + 1 ），兩者做 XOR 來保證高位元的值為連續的 1 ，即可在最後比較大小來檢查是否特定樣式。 ```c bool is_pattern(uint16_t x) { const uint16_t n = ~x + 1; return (n ^ x) < x; } ``` Example: ```c x = 1111110000000000 ~x+1 = 0000010000000000 x^(~x+1) = 1111100000000000 < x // 符合特定樣式 ``` ```c x = 1101110000000000 ~x+1 = 0010010000000000 x^(~x+1) = 1111100000000000 > x // 不符合特定樣式 ``` ### 在 Linux 核心原始程式碼找出上述 bitmask 及產生器，探討應用範疇參見 [Data Structures in the Linux Kernel](https://0xax.gitbooks.io/linux-insides/content/DataStructures/linux-datastructures-3.html) 找到 `BITMAP_FIRST_WORD_MASK` 和 `BITMAP_LAST_WORD_MASK` 實作，其原始程式碼在 [linux/bitmap.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/include/linux/bitmap.h) 。 #### `BITMAP_FIRST_WORD_MASK` 此函式用來獲得**從最高位元開始包含連續 1 的 mask** ，參數 start 代表從最低位元開始算起含有 0 的個數，其餘為 1 。將 `0xFFFFFFFF` 左移 `(start) & (BITS_PER_LONG - 1)` ，其中 `BITS_PER_LONG` 是一個常量，表示 unsigned long 型別的位數，通常是 32 位元或 64 位元，取決於機器的架構。 ```c #define BITMAP_FIRST_WORD_MASK(start) (~0UL << ((start) & (BITS_PER_LONG - 1))) ``` Example: ```c BITMAP_FIRST_WORD_MASK(4) = 0xFFFFFFF0 /* 以下為計算過程 */ ~0UL << (4) & (32 - 1)) 0xFFFFFFFF << (4 & 31) 0xFFFFFFFF << 4 ``` #### `BITMAP_LAST_WORD_MASK` 此函式用來獲得**從最低位元開始包含連續 1 的 mask** ，參數 nbits 代表從最低位元開始算起含有 1 的個數，其餘為 0 。 ```c #define BITMAP_LAST_WORD_MASK(nbits) (~0UL >> (-(nbits) & (BITS_PER_LONG - 1))) ``` 假設 `BITS_PER_LONG` 為 32 ，那麼實作就是將 `0xFFFFFFFF` 右移 `32 - nbits` 。 Example: ```c BITMAP_LAST_WORD_MASK(4) = 0x0000000F /* 以下為計算過程 */ ~0UL >> (-(4) & (32 - 1)) 0xFFFFFFFF >> (-4 & 31) 0xFFFFFFFF >> 28 ``` #### 實際應用這是一個用於清除 bitmap 指定範圍內的位元數的函數 > `map` : 指向 bitmap 的 pointer > `start` : bitmap 中需要被清除的第一個 bit 的位置 > `len` : 需要被清除的 bit 的數量 ```c void __bitmap_clear(unsigned long *map, unsigned int start, int len) { unsigned long *p = map + BIT_WORD(start); const unsigned int size = start + len; int bits_to_clear = BITS_PER_LONG - (start % BITS_PER_LONG); unsigned long mask_to_clear = BITMAP_FIRST_WORD_MASK(start); while (len - bits_to_clear >= 0) { *p &= ~mask_to_clear; len -= bits_to_clear; bits_to_clear = BITS_PER_LONG; mask_to_clear = ~0UL; p++; } if (len) { mask_to_clear &= BITMAP_LAST_WORD_MASK(size); *p &= ~mask_to_clear; } } ```