2023q1 Homework2 (quiz2)

# 2023q1 Homework2 (quiz2) contributed by < `paulpeng-popo` > > [題目](https://hackmd.io/@sysprog/linux2023-quiz2) ## 測驗一 ### 解釋程式碼原理輸入一個數，目標是找到最接近且大於等於 2 的冪的值，舉例來說 - $next\_pow2(7) = 8$ - $next\_pow2(13) = 16$ - $next\_pow2(42) = 64$ 最初的想法，就是找出最左邊的 1，然後讓它左移 1 就完成了，程式如下： ```c uint64_t my_next_pow2(uint64_t x) { x |= x >> 32; x |= x >> 16; x |= x >> 8; x |= x >> 4; x |= x >> 2; x |= x >> 1; return (x ^ (x >> 1)) << 1; } ``` 運作如下，利用二分法的概念，逐步設定 (set 1) 每一部份對應的位置 ```c x = 0010000000000000 // input x x = 00100000 00100000 // x |= x >> 8; x = 0010 0010 0010 0010 // x |= x >> 4; x = 00 10 10 10 10 10 10 10 // x |= x >> 2; x = 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 // x |= x >> 1; ``` 當然，細看最後一行 return，會發現有點冗長，最後經過 [JoshuaLee0321](https://hackmd.io/@p4uHLG53RQ2mdlxeWiipWg/quiz2#%E6%B8%AC%E9%A9%97%E4%B8%80) 提醒用 `x + 1` 其實就能取代 `(x ^ (x >> 1)) << 1` 接著回到老師給出的 snippets ```c uint64_t next_pow2(uint64_t x) { x |= x >> 1; x |= x >> 1; x |= x >> 1; x |= x >> 1; x |= x >> 1; x |= x >> 1; x |= x >> 1; x |= x >> AAAA; x |= x >> 16; x |= x >> BBBB; return CCCC; } ``` 可以發現與我先前的程式相似，只是 bottom-up 和 top-down 的差別，因此也能進一步把前 7 次 rshift 1 合併成 ```c x |= x >> 1; x |= x >> 2; x |= x >> 4; ``` 那這邊我發現一個小問題，就是雖然我們的目的是找出**最接近且大於等於 2 的冪的值**，但若輸入本身就是 2 的冪，那應該是要回傳 **自己** 還是 **下一個冪** ? ```c uint64_t next_pow2(uint64_t x) { uint8_t lo = 0, hi = 63; while (lo < hi) { uint8_t test = (lo + hi)/2; if (x < pow2(test)) { hi = test; } else if (pow2(test) < x) { lo = test+1; } else { return pow2(test); } } return pow2(lo); } ``` 我拿 `next_pow2` 跟 bitwise 的 `next_pow2` 比較 ```c next_pow2(64) // output = 64 // bitwise next_pow2(64) // output = 128 ``` 結果沒錯，兩個函式並不等價根據我在 Linux Kernel 看到 next_pow2 的用法，我認為應該 2 的冪應該要回傳 **自己** 因此將 bitwise 的 `next_pow2` 修改成如下： ```c uint64_t y = x; ... return y == ((x+1) >> 1) && y ? y : x + 1; ``` 這樣就可以回傳本來是冪次的 x 參考 [yanjiew1](https://hackmd.io/@yanjiew/linux2023q1-quiz2#%E7%A8%8B%E5%BC%8F%E8%AA%AA%E6%98%8E) 的程式說明後，發現有較簡單的作法，也就是先將輸入的 x 減 1 後，再去做 set，這樣就可以讓最後的結果維持一樣 ```c x--; ... return x + 1; ``` ### 用 __builtin_clzl 改寫 ```c int __builtin_clz (unsigned int x) int __builtin_clzl (unsigned long) ``` > Returns the number of leading 0-bits in x, starting at the most significant bit position. If x is 0, the result is undefined. `__builtin_clzl` 會回傳最左邊的 1 其前面有幾個 0，因此我只要把 1 左移 64 - n 個 bit 就可以得到答案了同時也注意到 `__builtin_clzl` 可能會有 undefined behavior，所以我們要先處理 `x = 0` 的情況 ```c uint64_t next_pow2(uint64_t x) { if (!x) return 1; if (!(x & (x - 1))) return x; return 1 << (64 - __builtin_clzl(x)); } ``` ### Linux 核心原始程式碼類似的使用案例在 Dec 17 2014 之前的 commits，我們可以在 [util.h](https://github.com/torvalds/linux/commit/91529834d1dea9afccb72843c3e547e703ec177f) 找到 `next_pow2` 的 definition 這裡的 `next_pow2_l` 使用條件編譯的方式增加了對 unsigned long 的支援 ```c static inline unsigned next_pow2(unsigned x) { if (!x) return 1; return 1ULL << (32 - __builtin_clz(x - 1)); } static inline unsigned long next_pow2_l(unsigned long x) { #if BITS_PER_LONG == 64 if (x <= (1UL << 31)) return next_pow2(x); return (unsigned long)next_pow2(x >> 32) << 32; #else return next_pow2(x); #endif } ``` 而在之後，這兩個函式被 [roundup_pow_of_two](https://github.com/torvalds/linux/commit/bd1857948e7667313f9a7bee9b2492c0848174a6) 取代，其中有提到 `fls_long` 是用來找從 MSB 開始出現第一個 1 的位置，而 `fls_long` 的實作中也可以見到 `clz` 的身影 [`linux/include/linux/bitops.h`](https://github.com/torvalds/linux/blob/master/include/linux/bitops.h#L203) 額外參考資料：[linux2021 彙整學員們的作業成果](https://hackmd.io/@gyes00205/HytYnbKtu#The-Linux-Kernel-API) ```c= static inline __attribute__((const)) unsigned long __roundup_pow_of_two(unsigned long n) { return 1UL << fls_long(n - 1); } #define roundup_pow_of_two(n) \ ( \ __builtin_constant_p(n) ? ( \ (n == 1) ? 1 : \ (1UL << (ilog2((n) - 1) + 1)) \ ) : \ __roundup_pow_of_two(n) \ ) ``` 注意到，`roundup_pow_of_two` 在第 10 行多做一層判斷，也就是當 n = 1 時，讓值直接為 1，而不繼續呼叫 `ilog2((n) - 1)` 來處理，猜測原因除了減少函式執行的 clock cycles 數外，也避免了 `__builtin_clzll` 輸入為 0 的情況 ```c #define ilog2(n) \ ( \ __builtin_constant_p(n) ? \ ((n) < 2 ? 0 : \ 63 - __builtin_clzll(n)) : \ (sizeof(n) <= 4) ? \ __ilog2_u32(n) : \ __ilog2_u64(n) \ ) ``` 在應用方面比如說 [drivers/md/raid5.c](https://github.com/torvalds/linux/blob/master/drivers/md/raid5.c#L6972) 中的 `raid5_store_stripe_size`，他在裡面就有檢查 new 是否為 2 的冪次方大小 ```c if (new % DEFAULT_STRIPE_SIZE != 0 || new > PAGE_SIZE || new == 0 || new != roundup_pow_of_two(new)) return -EINVAL; ``` 引用 chatgpt 的解釋 > raid5_store_stripe_size() 是 Linux 核心中用於處理 RAID5 塊裝置儲存條帶大小的函式。RAID5 是一種磁碟陣列技術，透過將資料條帶化（striping）儲存在多個磁碟中，並使用奇偶校驗（parity）實現資料冗餘和容錯。 > > 該函式的作用是將使用者傳入的儲存條帶大小值（stripe_size）儲存在 RAID5 塊裝置相關的結構體中，從而影響磁碟陣列的讀寫效能和空間利用效率。具體來說，該函式會檢查儲存條帶大小值是否符合一定的規範，例如必須是 2 的冪次方，且不能小於硬體支援的最小條帶大小等。如果輸入的值不符合要求，函式將返回錯誤碼，否則將更新 RAID5 塊裝置結構體中的儲存條帶大小資訊。 ### 編譯器能否產生對應的 x86 指令 ```shell $ gcc -O2 -std=c99 -S test1.c ``` 可以看到第 12 行有 [`bsr`](https://www.felixcloutier.com/x86/bsr#description) 指令出現 ```= next_pow2: .LFB26: .cfi_startproc endbr64 movl $1, %eax testq %rdi, %rdi je .L10 leaq -1(%rdi), %rdx movq %rdi, %rax testq %rdi, %rdx je .L10 bsrq %rdi, %rax movl $64, %ecx xorq $63, %rax subl %eax, %ecx movl $1, %eax sall %cl, %eax cltq .L10: ret .cfi_endproc ``` ## 測驗二 ### 解釋程式碼運作原理 [LeetCode 1680. Concatenation of Consecutive Binary Numbers](https://leetcode.com/problems/concatenation-of-consecutive-binary-numbers/) 給定一整數 n，回傳將 1 到 n 的二進位表示法依序串接在一起所得到的二進位字串，其所代表的十進位數字 mod $10^9 + 7$ 之值。 - 想法： - 每次左移 n 個 bit(s) - 然後跟要 concate 的數字做 OR operation 舉例：concate 1, 2, 3 ``` | v 0000 0000 // 初始值 | v 0000 0000 // 左移 n bit(s)，這時 n = 1 0000 0001 // 跟 1 做 OR | v 0000 0100 // 左移 n bit(s)，這時 n = 2 0000 0110 // 跟 2 做 OR | v 0001 1000 // 左移 n bit(s)，這時 n = 2 0001 1011 // 跟 3 做 OR 輸出 27 ``` - 該怎麼決定 n - 可以發現，當要 concate 2 的冪的時候，n 會比前一次多位移 1 - 而非 2 的冪的時候，n 維持前一次的值即可回到老師給出的 snippets ```c= for (int i = 1; i <= n; i++) { /* removing the rightmost set bit * e.g. 100100 -> 100000 * 000001 -> 000000 * 000000 -> 000000 * after removal, if it is 0, then it means it is power of 2 * as all power of 2 only contains 1 set bit * if it is power of 2, we increase the bit length */ if (!(DDDD)) len++; ans = (i | (EEEE)) % M; } ``` 可以看到第 10 行用來判斷 i 是否為 2 的冪，是的話，長度 len 加 1 而 12 行用來做位移然後相加，最後處理 mod $10^9 + 7$ ### 使用 \_\_builtin_{clz,ctz,ffs} 改寫 ```c int __builtin_ffs (int x); int __builtin_clz (unsigned int x); int __builtin_ctz (unsigned int x); ``` > /\* __builtin_ffs(x) \*/ > Returns one plus the index of the least significant 1-bit of x, or if x is zero, returns zero. > > /\* __builtin_clz(x) \*/ > Returns the number of leading 0-bits in x, starting at the most significant bit position. If x is 0, the result is undefined. > > /\* __builtin_ctz(x) \*/ > Returns the number of trailing 0-bits in x, starting at the least significant bit position. If x is 0, the result is undefined. 找出 i 前面連續 0 的個數與後面連續 0 的個數相加是否為 31，如果為 31 代表 i 為 2 的冪次方 ; 接著使用 `__builtin_ffs` 找到需要位移的長度 ```c for (int i = 1; i <= n; i++) { if (__builtin_clz(i) + __builtin_ctz(i) == 31) len = __builtin_ffs(i); ans = (i | (ans << len)) % M; } ``` 參考 [joshualee0321](https://hackmd.io/@p4uHLG53RQ2mdlxeWiipWg/linux-2023-quiz2#%E4%BD%BF%E7%94%A8__builtin-%E6%94%B9%E5%AF%AB) 的作法，發現可以只使用 `__builtin_clz` 便直接得出 len 的大小，同時也減少使用 branch 的開銷 ```c for (int i = 1; i <= n; i++) { len = 32 - __builtin_clz(i); ans = (i | (ans << len)) % M; } ``` ### 改進 mod $10^9 + 7$ 的運算在上述程式碼中，`i | (ans << len)` 的結果有可能在做 modulo M 之前就已經 overflow 了，雖然可能不會造成計算上的錯誤，但讓程式存在 overflow 風險並不是一件好事，因此根據 modulo 的性質 - $(A + B) \% M = (A \% M + B \% M) \% M$ - $(A \ast B) \% M = (A \% M \ast B \% M) \% M$ 可以將程式碼改成如下： ```c ans = ((i % M) | ((ans % M) << (len % M))) % M; ``` ## 測驗三 ### 解釋程式碼原理 ```c size_t count_utf8(const char *buf, size_t len) { const int8_t *p = (const int8_t *) buf; size_t counter = 0; for (size_t i = 0; i < len; i++) { /* -65 is 0b10111111, anything larger in two-complement's should start * new code point. */ if (p[i] > -65) counter++; } return counter; } ``` (待解釋) ```c size_t swar_count_utf8(const char *buf, size_t len) { const uint64_t *qword = (const uint64_t *) buf; const uint64_t *end = qword + len >> 3; size_t count = 0; for (; qword != end; qword++) { const uint64_t t0 = *qword; const uint64_t t1 = ~t0; const uint64_t t2 = t1 & 0x04040404040404040llu; const uint64_t t3 = t2 + t2; const uint64_t t4 = t0 & t3; count += __builtin_popcountll(t4); } count = (1 << 3) * (len / 8) - count; count += (len & 7) ? count_utf8((const char *) end, len & 7) : 0; return count; } ``` (待解釋) ### 比較 SWAR 和原本的實作效能落差可以使用 perf 比較效能 (待解釋) ### 在 Linux 核心原始程式碼找出 UTF-8 和 Unicode 相關字串處理的程式碼，探討其原理，並指出可能的改進空間可以在 [unicode.c](https://github.com/torvalds/linux/blob/master/fs/udf/unicode.c#L46) 中找到有關 unicode 相關的字串處理程式碼 ```c static unicode_t get_utf16_char(const uint8_t *str_i, int str_i_max_len, int str_i_idx, int u_ch, unicode_t *ret) { unicode_t c; int start_idx = str_i_idx; /* Expand OSTA compressed Unicode to Unicode */ c = str_i[str_i_idx++]; if (u_ch > 1) c = (c << 8) | str_i[str_i_idx++]; if ((c & SURROGATE_MASK) == SURROGATE_PAIR) { unicode_t next; /* Trailing surrogate char */ if (str_i_idx >= str_i_max_len) { c = UNICODE_MAX + 1; goto out; } /* Low surrogate must follow the high one... */ if (c & SURROGATE_LOW) { c = UNICODE_MAX + 1; goto out; } WARN_ON_ONCE(u_ch != 2); next = str_i[str_i_idx++] << 8; next |= str_i[str_i_idx++]; if ((next & SURROGATE_MASK) != SURROGATE_PAIR || !(next & SURROGATE_LOW)) { c = UNICODE_MAX + 1; goto out; } c = PLANE_SIZE + ((c & SURROGATE_CHAR_MASK) << SURROGATE_CHAR_BITS) + (next & SURROGATE_CHAR_MASK); } out: *ret = c; return str_i_idx - start_idx; } ``` ## 測驗四 ### 解釋程式碼原理首先觀察老師給出的範例 ```c for (; x > 0; x <<= 1) { if (!(x & 0x8000)) return false; } return true; ``` - 什麼時候 return false - 當 ($x$ & 0x8000) 為 0 時 - $x$ 為 0xxx ... xxxx (binary format 開頭為 0) - 其中 X 不全為 0，否則跳出迴圈 - 什麼時候 return true - 若 $x$ 跳出迴圈且為 0，則 return true - for 迴圈作用 - 將 $x$ 往左移 1 位，但因為 $x$ 為 `uint16_t`，因此作用相當於在 LSB 補一 0，而 MSB 捨去 1 個 bit - 像是 `1100 0000 0000 0000` 的下一次 iteration 就變成 `1000 0000 0000 0000` 接著看到 [測驗-4](https://hackmd.io/@sysprog/linux2023-quiz2#%E6%B8%AC%E9%A9%97-4) 老師給出的 pattern 中，全部的數值的 binary format 其開頭皆為連續的 1-bit，而之後的 bits 皆為 0 這就是我們需要的 pattern 要如何精簡以上程式碼，老實說我真想不到，因此只能看圖說故事 ```c bool is_pattern(uint16_t x) { const uint16_t n = -x; return (n ^ x) < x; } ``` 首先，-x 會被先後操作 NOT x 跟 +1，也就是 x 的 2 補數，在這個情況下，`x` 中的最靠 LSB 側的 1-bit 會維持 set，該 bit 左側皆會被作 NOT，如： - x = `1111 1100 0000 0000` , -x = `0000 0100 0000 0000` - x = `1111 0010 0010 0000` , -x = `0000 1101 1110 0000` - x = `0000 0010 0000 0000` , -x = `1111 1110 0000 0000` 這邊發現，變成 pow of two 的 n，在跟 x 作 XOR 後，the set bit 左側的 bits 都會是 1，而 the set bit 及其右側的 bits 皆為 0，因此 x 總是會比 (n ^ x) 多 1 ```c the set bit | v n = 0000 0100 0000 0000 x = 1111 1100 0000 0000 n ^ x = 1111 1000 0000 0000 ``` 同理，非 pow of two 的 n，在跟 x 作 XOR 後，the set bit 左側的 bits 都會是 1，這時後 x 總是會比 (n ^ x) 小 ```c the set bit | v n = 0000 1101 1110 0000 x = 1111 0010 0010 0000 n ^ x = 1111 1111 1100 0000 the set bit | v n = 1111 1110 0000 0000 x = 0000 0010 0000 0000 n ^ x = 1111 1100 0000 0000 ``` 因此根據以上解釋，我認為程式碼也可以修改如下：直接判斷 n 是否為 pow of two ```c int __builtin_popcount (unsigned int x); ``` > Returns the number of 1-bits in x. ```c bool is_pattern(uint16_t x) { const uint16_t n = -x; return __builtin_popcount(n) == 1; } ``` 或是確認連續 1 和連續 0 是否佔滿 16 個 bits 最後再對 0 作額外判斷 ```c bool is_pattern(uint16_t x) { return x ? (__builtin_clz(~x) + __builtin_ctz(x)) == 16 : 0; } ``` ### 在 Linux 核心原始程式碼找出上述 bitmask 及產生器在 [bits.h](https://github.com/torvalds/linux/blob/master/include/linux/bits.h#L36) 可以找到 `GENMASK` macro 的定義 ```c /* * Create a contiguous bitmask starting at bit position @l and ending at * position @h. For example * GENMASK_ULL(39, 21) gives us the 64bit vector 0x000000ffffe00000. */ #if !defined(__ASSEMBLY__) #include <linux/build_bug.h> #define GENMASK_INPUT_CHECK(h, l) \ (BUILD_BUG_ON_ZERO(__builtin_choose_expr( \ __is_constexpr((l) > (h)), (l) > (h), 0))) #else /* * BUILD_BUG_ON_ZERO is not available in h files included from asm files, * disable the input check if that is the case. */ #define GENMASK_INPUT_CHECK(h, l) 0 #endif ``` 假設 BITS_PER_LONG = 32，則 GENMASK(6, 4) 運作如下 1. 首先利用 `GENMASK_INPUT_CHECK` 檢查 `h` 和 `l` 是否合法 (`h` $\ge$ `l`) 2. ~UL(0) = `1111 1111 1111 1111 1111 1111 1111 1111` 3. UL(1) << 4 = `0000 0000 0000 0000 0000 0000 0001 0000` 4. ~UL(0) - (UL(1) << 4) + 1 = `1111 1111 1111 1111 1111 1111 1111 0000` 5. ~UL(0) >> (BITS_PER_LONG - 1 - 6) = `0000 0000 0000 0000 0000 0000 0111 1111` 6. 最後，將 4 和 5 的結果做 AND $\Rightarrow$ `0000 0000 0000 0000 0000 0000 0111 0000` ```c #define __GENMASK(h, l) \ (((~UL(0)) - (UL(1) << (l)) + 1) & \ (~UL(0) >> (BITS_PER_LONG - 1 - (h)))) #define GENMASK(h, l) \ (GENMASK_INPUT_CHECK(h, l) + __GENMASK(h, l)) ```