2023q1 Homework2 (quiz2)

# 2023q1 Homework2 (quiz2) contributed by < [seasonwang0905](https://github.com/seasonwang0905) > > [測驗題目](https://hackmd.io/@sysprog/linux2023-quiz2) ## 測驗 `1` ### 解釋程式碼運作原理根據題目，將 `next_pow2` 改寫為以下程式碼 ```c uint64_t next_pow2(uint64_t x) { x |= x >> 1; x |= x >> 1; x |= x >> 1; x |= x >> 1; x |= x >> 1; x |= x >> 1; x |= x >> 1; x |= x >> AAAA; x |= x >> 16; x |= x >> BBBB; return CCCC; } ``` 此題答案為 * `AAAA` = `8` * `BBBB` = `32` * `CCCC` = `x + 1` 假設輸入值 `x` 為 51 ，則 `next_pow2` 的回傳值為無號數 64 ，這裡用 16 bits 闡述其原理: 1. `x |= x >> 1` 表 **`x` 右移 1 個 bit ，再和 `x` 做 `｜` 位元運算並賦值給 `x`** ```c 0000 0000 0011 0011 | 0000 0000 0001 1001 = 0000 0000 0011 1011 // x = 0000 0000 0011 1011 ``` 2. 做七次後可得 `x = 0000 0000 0011 1111` ，執行到 `x |= x >> 8` 時，因為 `x` 右移 8 bits 會得到 `0000 0000 0000 0000` ，故其結果仍為 `x = 0000 0000 0011 1111` ```c 0000 0000 0011 1111 | 0000 0000 0000 0000 = 0000 0000 0011 1111 // x = 0000 0000 0011 1111 ``` 3. 從第 2 步可得知，**當位移超過其最高位 bit 時，經 `|=` 運算後都會得到原本的值** ```c x |= x >> 16; // x = 0000 0000 0011 1111 x |= x >> 32; // x = 0000 0000 0011 1111 ``` 4. 最後計算 `x + 1` 並以 `uint64_t` 回傳，結果為 `next_pow2(51) = 64` ```c 0000 0000 0011 1111 + 0000 0000 0000 0001 = 0000 0000 0100 0000 // (uint64_t) 0b0000000001000000 = 64 ``` ### 使用 __builtin_clzl 改寫程式碼 * [__builtin_clzl](https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html) 上述 `next_pow2` 函式可以使用 `__builtin_clzl` 簡化，關於此函式的描述如下 > `int __builtin_clz (unsigned int x)` Returns the number of leading 0-bits in x, starting at the most significant bit position. If x is 0, the result is undefined. > `int __builtin_clzl (unsigned long)` Similar to __builtin_clz, except the argument type is unsigned long. 根據描述，可以設計以下程式碼 :::spoiler 實際程式碼 ```c uint64_t next_pow2(uint64_t x) { if (!x) return 1; int n = __builtin_clzl(x); return x >> (63 - n) << (64 - n); } ``` ::: * 當 `x` 為 0 時，回傳 `1` * `int n` 用來紀錄 `__builtin_clzl` 回傳的值 * 將 `x` 右移 `(63 - n)` 個位元，保留 leading 1，再左移 `(64 - n)` 個位元，即進位 :::warning 後來發現題目要求回傳**大於或等於** 2 的冪的值，故將程式碼更改為以下 ::: :::spoiler 更新程式碼 ```diff uint64_t next_pow2(uint64_t x) { if (!x) return 1; int n = __builtin_clzl(x); + int k = __builtin_ctzl(x); + if (n + k == 63) + return x; return x >> (63 - n) << (64 - n); } ``` ::: [__builtin_ctzl](https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html) 可以回傳 leading 1 後面的 0 位元之數量，當 `if` 判斷 `x` 是 2 次冪的值時，則回傳 `x` 本身 ### 在 Linux 核心原始程式碼找出類似的使用案例並解釋 * 使用 [CodeBrowser](https://codebrowser.dev/linux/linux/lib/bitmap.c.html#19src) 可幫助理解 linux kernel 的內容 [bitmap](https://zh.wikipedia.org/zh-tw/BMP) 是一種由一系列 bits 所組成的陣列，在 linux kernel 巨量的程式碼之中，時常可以看到 bitmap 的身影，其常被應用在 [VMM (Virtual Memory Manager)](https://en.wikipedia.org/wiki/Virtual_Machine_Manager) 和追蹤硬體資源在 [linux/blob/master/lib/bitmap.c](https://github.com/torvalds/linux/blob/master/lib/bitmap.c) 中可以找到 `__bitmap_shift_right` 的程式碼 ```c /** * __bitmap_shift_right - logical right shift of the bits in a bitmap * @dst : destination bitmap * @src : source bitmap * @shift : shift by this many bits * @nbits : bitmap size, in bits * * Shifting right (dividing) means moving bits in the MS -> LS bit * direction. Zeros are fed into the vacated MS positions and the * LS bits shifted off the bottom are lost. */ void __bitmap_shift_right(unsigned long *dst, const unsigned long *src, unsigned shift, unsigned nbits) { unsigned k, lim = BITS_TO_LONGS(nbits); unsigned off = shift/BITS_PER_LONG, rem = shift % BITS_PER_LONG; unsigned long mask = BITMAP_LAST_WORD_MASK(nbits); for (k = 0; off + k < lim; ++k) { unsigned long upper, lower; /* * If shift is not word aligned, take lower rem bits of * word above and make them the top rem bits of result. */ if (!rem || off + k + 1 >= lim) upper = 0; else { upper = src[off + k + 1]; if (off + k + 1 == lim - 1) upper &= mask; upper <<= (BITS_PER_LONG - rem); } lower = src[off + k]; if (off + k == lim - 1) lower &= mask; lower >>= rem; dst[k] = lower | upper; } if (off) memset(&dst[lim - off], 0, off*sizeof(unsigned long)); } ``` 根據程式碼，推測 `__bitmap_shift_right` 的功能為: * 以 `src` 表示欲進行右位移的 bitmap ， `shift` 為每次右位移的數量 * 當 `off + k` 仍小於總位元數量 `lim` 時，進入 `for` 迴圈。使用 `upper` 以及 `lower` 變數分別儲存位移後的上半段和下半段位元 * 使用 `dst[k]` 紀錄位移後的 bitmap 遮罩 `mask` 的目的為確認是否到 bitmap 的尾端了， `memset` 的功用則是當 `shift` 的大小超過 bitmap 時，在 `dst` 的後端補上 0 位元 ### 編譯器是否能產生對應的 x86 指令輸入命令 `cc -O2 -std=c99 -S test.c` ，產生的 `test.s` 包含下列指令 ``` next_pow2: .LFB14: .cfi_startproc endbr64 movl $1, %eax testq %rdi, %rdi je .L1 bsrq %rdi, %rax movl $63, %ecx xorq $63, %rax subl %eax, %ecx shrq %cl, %rdi movl $64, %ecx subl %eax, %ecx movq %rdi, %rax salq %cl, %rax ``` 檢查有 `bsrq` 指令，所以編譯器可以產生對應的 x86 指令 --- ## 測驗 `2` ### 解釋程式碼運作原理根據題目完成實作 ```c int concatenatedBinary(int n) { const int M = 1e9 + 7; int len = 0; /* the bit length to be shifted */ /* use long here as it potentially could overflow for int */ long ans = 0; for (int i = 1; i <= n; i++) { /* removing the rightmost set bit * e.g. 100100 -> 100000 * 000001 -> 000000 * 000000 -> 000000 * after removal, if it is 0, then it means it is power of 2 * as all power of 2 only contains 1 set bit * if it is power of 2, we increase the bit length */ if (!(DDDD)) len++; ans = (i | (EEEE)) % M; } return ans; } ``` 此題答案為 * `DDDD` = `i & (i - 1)` * `EEEE` = `ans << len` `if` 判斷式的作用為: 1. 在 `i = 1` 時，讓 `len` 增加為 1 2. 檢查 `i` 值是否為 2 次冪，如果是，則 `len` 增加 1 以二進位而言，每當數值恰為 2 次冪時，因為多一個位元，需要位移及串接的長度便會增加， `% M` 是為了防止整數產生 overflow 的情形 :::info [$mod\ \ {10^9+7}$](https://www.geeksforgeeks.org/modulo-1097-1000000007/) 是普遍用來防止整數產生 overflow 的方式，其可以保證可能導致 overflow 的數值輸入到程式中時，它們仍然會在有效的數值範圍中被處理，像這樣的運算方式稱為[「取模運算」(modulo operation)](https://zh.wikipedia.org/zh-tw/%E6%A8%A1%E9%99%A4) ， C 語言中的模數須符合兩個條件: 1. 模數必須在整數數值範圍之內，而且足夠大 2. 模數必須是質數，取模運算後的結果更加離散，即結果不易重複，這可以提高演算法的正確性和可靠性 1000000007 符合上述兩個條件，故常被用來當作模數 ::: 以上述程式碼舉個例子，假設輸入 `n` 等於 3，已知數值不會 overflow ，故可忽略 `% M` 1. `i = 1` ，進入迴圈後遇 `if` 判斷式，判斷式為 `true` ，執行 `len++` ```c 0000 0001 & 0000 0000 = 0000 0000 // i = 0000 0001, i - 1 = 0000 0000 // if (!0) -> execute statement ``` 2. `ans` 左移 1 格再和 `i` 做 `|` 運算，最後賦值給 `ans` ```c 0000 0001 | 0000 0000 = 0000 0001 // ans = 0000 0001 ``` 3. `i = 2` ，`if` 判斷式為 `true` ，執行 `len++` ```c 0000 0010 & 0000 0001 = 0000 0000 // i = 0000 0010, i - 1 = 0000 0001 // if (!0) -> execute statement ``` 4. `ans` 左移 2 格再和 `i` 做 `|` 運算，最後賦值給 `ans` ```c 0000 0010 | 0000 0100 = 0000 0110 // ans = 0000 0110 ``` 5. `i = 3` ，`if` 判斷式為 `false` ```c 0000 0011 & 0000 0010 = 0000 0010 // i = 0000 0011, i - 1 = 0000 0010 // if (!2) -> skip ``` 6. 同 4. ```c 0000 0011 | 0001 1000 = 0001 1011 // ans = 0001 1011 ``` 將 `ans` 換算成十進位可得 `ans = 27` ### 嘗試使用 `__builtin_{clz,ctz,ffs}` 改寫 * [`__builtin_ffs`](https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html) > Returns one plus the index of the least significant 1-bit of x, or if x is zero, returns zero. 根據描述，`__builtin_ffs` 可以回傳第一個設置的位元之位置，第一個設置的位元指的是從右往第一個為 1 的位元，例如 * 8 在二進位為 `0b1000` ，`__builtin_ffs(8)` 會回傳整數 4 * 3 在二進位為 `0b0011` ，`__builtin_ffs(3)` 則會回傳整數 1 將 `concatenatedBinary` 用 `__builtin_clz`, `__builtin_ctz`, `__builtin_ffs` 改寫，假設先忽略模數 `M` ，即不考慮 overflow ，可以設計以下程式碼 :::spoiler 實際程式碼 ```c int concatenatedBinary(int n) { int len = 0; long ans = 0; for (int i = 1; i <= n; i++) { if (__builtin_clz(i) + __builtin_ctz(i) == 31) len = __builtin_ffs(i); ans = (i | (ans << len)); } return ans; } ``` ::: * `if (__builtin_clz(i) + __builtin_ctz(i) == 31)` 判斷 `i` 是否為 2 次冪，若是，則 `i` 只會有一個 1 位元，再用 `__builtin_ffs(i)` 回傳第一個設置的位元之位置，即 `len` * `ans` 的算法與之前相同測試結果正確無誤，接著加入模數 `M` 並改進取模運算 ### 改進取模運算考慮取模運算的作用，我們需要在可能發生 overflow 地方加上 `% M` :::spoiler 實際程式碼 ```c int concatenatedBinary(int n) { const int M = 1e9 + 7; int len = 0; long ans = 0; for (int i = 1; i <= n; i++) { if (__builtin_clz(i) + __builtin_ctz(i) == 31) len = __builtin_ffs(i); ans = (i | (ans << len) % M); } return ans; } ``` ::: * `n` 和 `i` 在函式內不會發生 overflow ，故不用對其做取模運算 * 若 `ans` 本身夠大，左移時便可能產生 overflow ，故在 `ans << len` 後方加上 `% M` * `(i | (ans << len) % M)` 不會發生 overflow ，後方不需要加上 `% M` 更改 `% M` 的位置後，發現其結果與 [測驗 `2`](https://hackmd.io/@sysprog/linux2023-quiz2#%E6%B8%AC%E9%A9%97-2) 測試資料 3 的結果略有不同測試資料 3: ```c // n = 12 ans = 505379714 ``` 更改後: ```c // n = 12 ans = 505379534 ``` 從結果推測，在測試資料 3 中， `ans` 左移 `len` 時有發生過 overflow 而導致輸出不同 --- ## 測驗 `3` ### 解釋程式碼運作原理測驗題目的程式碼有誤，編譯器編不過，所以這裡先修改成可以運行的程式碼 ```diff size_t swar_count_utf8(const char *buf, size_t len) { const uint64_t *qword = (const uint64_t *) buf; - const uint64_t *end = qword + len >> 3; + const usin64_t *end = qword + (len >> 3); size_t count = 0; for (; qword != end; qword++) { const uint64_t t0 = *qword; const uint64_t t1 = ~t0; const uint64_t t2 = t1 & 0x04040404040404040llu; const uint64_t t3 = t2 + t2; const uint64_t t4 = t0 & t3; count += __builtin_popcountll(t4); } count = (1 << AAAA) * (len / 8) - count; count += (len & BBBB) ? count_utf8((const char *) end, len & CCCC) : DDDD; return count; } ``` 作上述改動後即可運行此題可寫為 * `AAAA` = `3` * `BBBB` = `7` * `CCCC` = `7` * `DDDD` = `0` 將每一行程式碼拆解來看: 1. `len` 為傳入字串 `buf` 的長度，單位為 bytes ，傳進的字串以 64 位元來處理，建立 `qword` 指向 `buf` 的位址， `end` 則指向 `qword[len >> 3]` 之位址。 `len >> 3` 相當於 `len / 8` 2. `for` 迴圈內為計算 `qword` 到 `end` 之間，包含 `end` 指向的 `uint64_t` 中，**不為 UTF-8 的 bytes 數**，演算法配合 [`__builtin_popcountll`](https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html) 來計算個數並以 `count` 紀錄 3. `(1 << 3) * (len / 8)` 會得到 8 的整數倍，扣掉 `count` 後即為目前算得的 UTF-8 的個數 4. `len & 7` 在判斷式中可視為 `len % 8` 的反義，若 `len` 為 8 的倍數，則 `len & 7` 為 `false` ，執行 `count += 0`，最後回傳的 `count` 即為 `buf` 中 UTF-8 之總數 5. 若 `len` 不為 8 的倍數，則 `len & 7` 為 `true` ，進入函式 `count_utf8` 以計算剩餘的 bytes 是否含有 UTF-8 ，並加上此函式所得的 `count` ，最後回傳的 `count` 即為 `buf` 中 UTF-8 之總數。 ### 比較 SWAR 和原本的實作效能落差 ### --- ## 測驗 `4` ### 解釋程式碼運作原理根據題目，下列程式碼是用來判斷特定形式的 `uint16_t` ```c bool is_pattern(uint16_t x) { if (!x) return 0; for (; x > 0; x <<= 1) { if (!(x & 0x8000)) return false; } return true; } ``` 經過輸出，特定的形式為以下幾者 ``` pattern: 8000 (32768) pattern: c000 (49152) pattern: e000 (57344) pattern: f000 (61440) pattern: f800 (63488) pattern: fc00 (64512) pattern: fe00 (65024) pattern: ff00 (65280) pattern: ff80 (65408) pattern: ffc0 (65472) pattern: ffe0 (65504) pattern: fff0 (65520) pattern: fff8 (65528) pattern: fffc (65532) pattern: fffe (65534) pattern: ffff (65535) ``` 觀察輸出的形式可知，若換算成二進位，從最高位到最低位皆為緊密排列的 1 位元，如 `0xc000` = `0b1100 0000 0000 0000` ，藉由這個觀察結果，我們可以依據上述觀察結果，將程式碼改寫為 ```c bool is_pattern(uint16_t x) { const uint16_t n = EEEE; return (n ^ x) < FFFF; } ``` 此題可寫入 * `EEEE` = `~x + 1` * `FFFF` = `x` 假設 `x` 為輸入的特定形式，則 1. `n` 取 `x` 之二補數再加 1 ，**此時 `n` 的最高位必定是 `x` 的第一位設置的位元** ```c // if x = 0xc000 1100 0000 0000 0000 // x 0100 0000 0000 0000 // ~x + 1 0100 0000 0000 0000 // n ``` 2. `n` 與 `x` 做 `^` (XOR) 運算，**其結果必小於 `x`** ```c 1000 0000 0000 0000 // n ^ x // (n ^ x) < x returns true ``` ### Linux 原始核心碼 : `bitmap_fill` 原本想寫 [Data Structures in the Linux Kernel](https://0xax.gitbooks.io/linux-insides/content/DataStructures/linux-datastructures-3.html) 裡提到的 `bitmap_fill` ，但後來在 Linux Kernel 搜尋時，發現在這段原始碼在 [include/linux/bitmap.h](https://github.com/torvalds/linux/blob/master/include/linux/bitmap.h) 中改版了，這裡節錄改版的資訊 > **include/linux/bitmap.h: make bitmap_fill() and bitmap_zero() consistent** > Behaviour of bitmap_fill() differs from bitmap_zero() in a way how bits behind bitmap are handed. bitmap_zero() clears entire bitmap by unsigned long boundary, while bitmap_fill() mimics bitmap_set(). > Here we change bitmap_fill() behaviour to be consistent with bitmap_zero() and add a note to documentation. > The change might reveal some bugs in the code where unused bits are handled differently and in such cases bitmap_set() has to be used. 上述提到這次的改版主要是讓 `bitmap_fill` 和 `bitmap_zero` 的行為保持一致性，否則在使用上可能會有 bug，以下為改版前後的程式碼 ```diff= static inline void bitmap_fill(unsigned long *dst, unsigned int nbits) { - unsigned int nlongs = BITS_TO_LONGS(nbits); - if (!small_const_nbits(nbits)) { - unsigned int len = (nlongs - 1) * sizeof(unsigned long); - memset(dst, 0xff, len); + if (small_const_nbits(nbits)) + *dst = ~0UL; + else { + unsigned int len = BITS_TO_LONGS(nbits) * sizeof(unsigned long); + memset(dst, 0xff, len); } - dst[nlongs - 1] = BITMAP_LAST_WORD_MASK(nbits); } ``` 可以看到兩版本的 `if` 判斷式中的描述互為反義，根據 [`small_const_nbits`](https://github.com/torvalds/linux/blob/master/include/asm-generic/bitsperlong.h) 之定義，若 $1\ <=nbits\ <=\ 8$ 時，則 `small_const_nbits` 回傳 `true` 假設系統為 64 位元，在舊版陳述句中，若 `nbits` 不在前述範圍內，則填 `nlongs - 1` * 8 的長度的 `0xff` 給 `dst` 陣列，並且在最後的 9 個位元做了 mask (即第 13 行)，而新版則因為一致性問題，捨去了 `BITMAP_LAST_WORD_MASK` ，直接將 `dst` 填滿 `~0UL` 或是 `0xff` 附上各巨集的定義以及來源 :::spoiler 巨集 > [BITS_TO_LONGS](https://github.com/torvalds/linux/blob/master/include/linux/bitops.h) ```c #define BITS_TO_LONGS(nr) __KERNEL_DIV_ROUND_UP(nr, BITS_PER_TYPE(long)) ``` > [__KERNEL_DIV_ROUND_UP](https://github.com/torvalds/linux/blob/master/include/uapi/linux/const.h) ```c #define __KERNEL_DIV_ROUND_UP(n, d) (((n) + (d) - 1) / (d)) ``` > [small_const_nbits](https://github.com/torvalds/linux/blob/master/include/asm-generic/bitsperlong.h) ```c /* * small_const_nbits(n) is true precisely when it is known at compile-time * that BITMAP_SIZE(n) is 1, i.e. 1 <= n <= BITS_PER_LONG. This allows * various bit/bitmap APIs to provide a fast inline implementation. Bitmaps * of size 0 are very rare, and a compile-time-known-size 0 is most likely * a sign of error. They will be handled correctly by the bit/bitmap APIs, * but using the out-of-line functions, so that the inline implementations * can unconditionally dereference the pointer(s). */ #define small_const_nbits(nbits) \ (__builtin_constant_p(nbits) && (nbits) <= BITS_PER_LONG && (nbits) > 0) ``` > [BITMAP_LAST_WORD_MASK](https://github.com/torvalds/linux/blob/master/include/linux/bitmap.h) ```c #define BITMAP_LAST_WORD_MASK(nbits) (~0UL >> (-(nbits) & (BITS_PER_LONG - 1))) ``` :::