2023q1 Homework2 (quiz2)

--- tags: linux kernel, jserv --- # 2023q1 Homework2 (quiz2) contributed by < [`WangHanChi`](https://github.com/WangHanChi) > > [作業要求](https://hackmd.io/@sysprog/linux2023-quiz2) :::info - 測驗 1 - [x] 解釋上述程式碼原理，並用 __builtin_clzl 改寫 - [x] 在 Linux 核心原始程式碼找出類似的使用案例並解釋 - [x] 當上述 clz 內建函式已運用時，編譯器能否產生對應的 x86 指令？ - 測驗 2 - [x] 解釋上述程式碼運作原理 - [x] 嘗試使用 __builtin_{clz,ctz,ffs} 改寫，並改進 mod $10^9+7$ 的運算 - 測驗 3 - [x] 解釋上述程式碼運作原理，比較 SWAR 和原本的實作效能落差 - [ ] 在 Linux 核心原始程式碼找出 UTF-8 和 Unicode 相關字串處理的程式碼，探討其原理，並指出可能的改進空間 - 測驗 4 - [x] 解釋上述程式碼運作原理 - [ ] 在 Linux 核心原始程式碼找出上述 bitmask 及產生器，探討應用範疇 ::: ## 測驗 `1` ### 解釋上述程式碼運作原理題目為給定無號 64 位元數值 `x`，找出最接近且大於等於 2 的冪的值以下為一種實作的方法 ```c #include <stdint.h> static inline uint64_t pow2(uint8_t e) { return ((uint64_t)1) << e; } uint64_t next_pow2(uint64_t x) { uint8_t lo = 0, hi = 63; while (lo < hi) { uint8_t test = (lo + hi)/2; if (x < pow2(test)) { hi = test; } else if (pow2(test) < x) { lo = test+1; } else { return pow2(test); } } return pow2(lo); } ``` 但是為了減少 [branch penalty](https://stackoverflow.com/questions/56412615/what-does-it-mean-by-a-branch-penalty) ，所以我們更傾向使用 branchless 的版本也就是以下 ```c uint64_t next_pow2(uint64_t x) { x |= x >> 1; x |= x >> 1; x |= x >> 1; x |= x >> 1; x |= x >> 1; x |= x >> 1; x |= x >> 1; x |= x >> AAAA; x |= x >> 16; x |= x >> BBBB; return CCCC; } ``` 可以發現程式碼變得更精簡且沒有分支了，那 `AAAA`, `BBBB` 以及 `CCCC` 又是什麼呢？可以從程式的功能來看，輸入 x ，輸出最靠近且大於等於 2 的冪的值，也就是說輸出必定為 `0b0100000` 這類的答案，也就是二進制的表示式中只會有一個 1。接著看到上方的程式碼， `x |= x >> 1;` 它將小於他的位元都變成 1 。以 64 為例子 (預期輸出 128 ) ``` // x = 64 (0100 0000) x = (0100 0000) | (0010 0000) // x = 0110 0000 x = (0110 0000) | (0011 0000) // x = 0111 0000 x = (0111 0000) | (0011 1000) ... // x = 0111 1110 x = (0111 1110) | (0011 1111) // x = 0111 1111 ``` 而 0111 1111 與 128 (1111 1111) 只差了 1 ，因此可以知道 `CCCC` 為 `x + 1` 為了滿足 `x + 1` 為2的冪，於是必須將所的位元都變成 1 ，因此就要重複 `x |= x >> 1` 這個操作多次。而將 `x |= x >> 1` 重複 8 次後，可以保證最高位數的 8 個位元都會是1，再來需要將接下來的 8 位元都設為 1 ，所以需要執行一次 `x |= x >> 8;` ，這邊比較奇怪的是好像題目少了這個步驟就直接執行了 `x |= x >> 16;` 這樣子會在輸入 x 小於 512 時輸出正確的答案，但是只要超過 512，就會輸出錯誤的答案了。 :::info 根據上述，我認為題目少了一行 `x |= x >> 8;` 完整[程式碼](https://github.com/WangHanChi/linux_kernel_quiz/blob/main/quiz2/problem1/pow2.c)應該如下 ```diff uint64_t next_pow2(uint64_t x) { x |= x >> 1; x |= x >> 1; x |= x >> 1; x |= x >> 1; x |= x >> 1; x |= x >> 1; x |= x >> 1; x |= x >> 1; + x |= x >> 8; x |= x >> 16; x |= x >> 32; return x + 1; } ``` ::: ### 用 __builtin_clzl 改寫可以從 [__builtin_clzl](https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html) 這個網站中看到關於這個函式的內容 >Built-in Function: **int __builtin_clz** (unsigned int x) Returns the number of leading 0-bits in x, starting at the most significant bit position. If x is 0, the result is undefined. >Built-in Function: int **__builtin_clzl** (unsigned long) Similar to __builtin_clz, except the argument type is unsigned long. 以 `__builtin_clzl ` 為例，輸入一個 unsigned long 型別的數字 x ，它會回傳的型別為 **int** ，而內容就是從 x 的最高位 (MSB) 往最低位 (LSB) 的方向數共有幾的 0 ，如果在數的過程中遇到 1 ，就直接回傳數字。可以用以下程式碼測試 ```c #include <stdio.h> int main(int argc, char **argv) { uint64_t a = 1025; printf("%d\n", 63 - __builtin_clzl(a)); return 0; } ``` 它會回傳從 MSB 開始第一個不為 0 的位數，例如以 1025 為例，可以看到輸出為 10 ```shell $ make builtin_test gcc -O1 -c -o builtin_pow2.o builtin_pow2.c gcc builtin_pow2.c -o builtin_pow2.o ./builtin_pow2.o 10 ``` 接著將其應用在 `next_pow2` 上，於是可以寫出以下[程式碼](https://github.com/WangHanChi/linux_kernel_quiz/blob/main/quiz2/problem1/builtin_test.c) ```c #include <stdint.h> #include <stdio.h> static inline uint64_t pow2(uint8_t e) { return ((uint64_t)1) << e; } uint64_t builtin_pow2(uint64_t x) { if(x == 0) return 0; return pow2(63 - __builtin_clzl(x) + 1); } int main(int argc, char **argv) { uint64_t a = 513; printf("%ld\n", builtin_pow2(a)); return 0; } ``` 可以看到輸出與 `next_pow2` 一致 ```shell $ make gcc -O1 -c -o builtin_pow2.o builtin_pow2.c gcc builtin_pow2.c -o builtin_pow2.o ./builtin_pow2.o 1024 ``` 稍微簡單寫了一個[測試程式](https://github.com/WangHanChi/linux_kernel_quiz/blob/main/quiz2/problem1/compare.c)來檢測兩者的輸出是否為相同的，發現執行結果也是一樣的！ :::spoiler 測試程式 ```c #include <stdio.h> #include <stdint.h> #include <stdlib.h> #include <time.h> static inline uint64_t pow2(uint8_t e) { return ((uint64_t)1) << e; } uint64_t next_pow2(uint64_t x) { x |= x >> 1; x |= x >> 1; x |= x >> 1; x |= x >> 1; x |= x >> 1; x |= x >> 1; x |= x >> 1; x |= x >> 1; x |= x >> 8; x |= x >> 16; x |= x >> 32; return x + 1; } uint64_t builtin_pow2(uint64_t x) { if(x == 0) return 0; return pow2(63 - __builtin_clzl(x) + 1); } int main(int argc, char ** argv) { srand( time(NULL) ); int num = 0; for(int i = 0; i < 50000; ++i){ uint64_t x = rand(); (next_pow2(x) - builtin_pow2(x)) ? :num++; } num ? printf("PASS!\n") : printf("FAIL\n"); return 0; } ``` ::: ```shell $ make compare gcc -O1 -c -o compare.o compare.c gcc compare.c -o compare.o ./compare.o PASS! ``` ### 在 Linux 核心原始程式碼找出類似的使用案例並解釋我目前發現有使用到 `pow2(x)` 這個類的函式 - 在[linux/arch/ia64/mm/init.c](https://github.com/torvalds/linux/blob/master/arch/ia64/mm/init.c#L331) 的第331行在以下程式碼可以看到第 331 行先定義了 `pow2` 這個巨集，然後他在第 351 行使用到它，主要的用途就是設定 virtual memory 的區域。 ```c=331 # define POW2(n) (1ULL << (n)) impl_va_bits = ffz(~(local_cpu_data->unimpl_va_mask | (7UL << 61))); if (impl_va_bits < 51 || impl_va_bits > 61) panic("CPU has bogus IMPL_VA_MSB value of %lu!\n", impl_va_bits - 1); /* * mapped_space_bits - PAGE_SHIFT is the total number of ptes we need, * which must fit into "vmlpt_bits - pte_bits" slots. Second half of * the test makes sure that our mapped space doesn't overlap the * unimplemented hole in the middle of the region. */ if ((mapped_space_bits - PAGE_SHIFT > vmlpt_bits - pte_bits) || (mapped_space_bits > impl_va_bits - 1)) panic("Cannot build a big enough virtual-linear page table" " to cover mapped address space.\n" " Try using a smaller page size.\n"); /* place the VMLPT at the end of each page-table mapped region: */ pta = POW2(61) - POW2(vmlpt_bits); ``` ### 編譯器能否產生對應的 x86 指令？使用 ``` $ objdump -D builtin_pow2.o > builtin_pow2.dumptxt ``` 來產生 objdump 的反組譯，並且將結果儲存在 `builtin_pow2.dumptxt` 可以看到在 X86-64 的環境下所產生的組合語言如下 ``` 0000000000001165 <builtin_pow2>: 1165: f3 0f 1e fa endbr64 1169: 55 push %rbp 116a: 48 89 e5 mov %rsp,%rbp 116d: 48 83 ec 08 sub $0x8,%rsp 1171: 48 89 7d f8 mov %rdi,-0x8(%rbp) 1175: 48 83 7d f8 00 cmpq $0x0,-0x8(%rbp) 117a: 75 07 jne 1183 <builtin_pow2+0x1e> 117c: b8 00 00 00 00 mov $0x0,%eax 1181: eb 1c jmp 119f <builtin_pow2+0x3a> 1183: 48 0f bd 45 f8 bsr -0x8(%rbp),%rax 1188: 48 83 f0 3f xor $0x3f,%rax 118c: 89 c2 mov %eax,%edx 118e: b8 40 00 00 00 mov $0x40,%eax 1193: 29 d0 sub %edx,%eax 1195: 0f b6 c0 movzbl %al,%eax 1198: 89 c7 mov %eax,%edi 119a: e8 aa ff ff ff call 1149 <pow2> 119f: c9 leave 11a0: c3 ret 00000000000011a1 <main>: 11a1: f3 0f 1e fa endbr64 11a5: 55 push %rbp 11a6: 48 89 e5 mov %rsp,%rbp 11a9: 48 83 ec 20 sub $0x20,%rsp 11ad: 89 7d ec mov %edi,-0x14(%rbp) 11b0: 48 89 75 e0 mov %rsi,-0x20(%rbp) 11b4: 48 c7 45 f8 21 00 00 movq $0x21,-0x8(%rbp) 11bb: 00 11bc: 48 8b 45 f8 mov -0x8(%rbp),%rax 11c0: 48 89 c7 mov %rax,%rdi 11c3: e8 9d ff ff ff call 1165 <builtin_pow2> 11c8: 48 89 c6 mov %rax,%rsi 11cb: 48 8d 05 32 0e 00 00 lea 0xe32(%rip),%rax # 2004 <_IO_stdin_used+0x4> 11d2: 48 89 c7 mov %rax,%rdi 11d5: b8 00 00 00 00 mov $0x0,%eax 11da: e8 71 fe ff ff call 1050 <printf@plt> 11df: b8 00 00 00 00 mov $0x0,%eax 11e4: c9 leave 11e5: c3 ret ``` 可以注意到 `bsr` 這個指令，根據 [BSR — Bit Scan Reverse](https://www.felixcloutier.com/x86/bsr.html) 這裡所提供的資訊可知，它將在第二個引數中搜索 MSB 設置為 1 的數字，並且把他的索引 (第幾個位元為 1 ) 的資訊紀錄在第一個引數中，不難發現這個操作就像是跟 `__builtin_clz(x)` 一樣。因此，可以判斷在 X86-64 的組合語言中 `bsr` 與 GCC 的內建函數 `__builtin_clz` 的功能是一樣的。 --- ## 測驗 `2` ```c int concatenatedBinary(int n) { const int M = 1e9 + 7; int len = 0; /* the bit length to be shifted */ /* use long here as it potentially could overflow for int */ long ans = 0; for (int i = 1; i <= n; i++) { /* removing the rightmost set bit * e.g. 100100 -> 100000 * 000001 -> 000000 * 000000 -> 000000 * after removal, if it is 0, then it means it is power of 2 * as all power of 2 only contains 1 set bit * if it is power of 2, we increase the bit length */ if (!(DDDD)) len++; ans = (i | (EEEE)) % M; } return ans; } ``` 可以從 `if it is power of 2, we increase the bit length` 這段敘述得知 `DDDD` 的條件判斷應該是判斷是否為 2 的冪。要檢驗是否為 2 的冪的方式很簡單，只需要使用 `i & (i - 1)` 這樣的位元操作就可以確定。再來是要將 ans 進行偏移，而方法就是看當前的數字的二進制有幾個位元就向左偏移幾個，所以 `EEEE` 會是 `ans << len` 。 ### 使用 `__builtin_{clz,ctz,ffs}` 改寫 ```c int concatenatedBinary_new(int n) { const int M = 1e9 + 7; int len = 0; /* the bit length to be shifted */ /* use long here as it potentially could overflow for int */ long ans = 0; for (int i = 1; i <= n; i++) { len = 32 - __builtin_clz(i); ans = ((ans << len) % M + i) % M; } return ans; } ``` :::info TODO 改進 mod $10^9 +7$ 的運算 ::: --- ## 測驗 `3` ```c size_t swar_count_utf8(const char *buf, size_t len) { const uint64_t *qword = (const uint64_t *) buf; const uint64_t *end = qword + (len >> 3); size_t count = 0; for (; qword != end; qword++) { const uint64_t t0 = *qword; const uint64_t t1 = ~t0; const uint64_t t2 = t1 & 0x04040404040404040llu; const uint64_t t3 = t2 + t2; const uint64_t t4 = t0 & t3; count += __builtin_popcountll(t4); } count = (1 << AAAA) * (len / 8) - count; count += (len & BBBB) ? count_utf8((const char *) end, len & CCCC) : DDDD; return count; } ``` 由例子來講解程式碼，假如輸入 `char *buf = "AAAABBBBCCCCDDDDEEEEFFFFGGGGHHHH", len = 32`, 可以知道 end = qword + 8。接下來進行 for-loop ，來計算到底有幾個 1。接著由老師在題目中的敘述 >若輸入的字串是一個有效的 UTF-8 序列，則計算其中的後續位元組數量，並將此數字從總字串長度中減去，即可確定字元數量。可以知道 (1 << 3) 後在乘 (len / 8) 就會是字串原本的長度，其實也就是 len ，因此，這裡的 AAAA 就是要填 `3` ，但是也有可能 len 的長度不是 8 的倍數的情況(例如 35 )，這時候若是套用的公式來看的話，就可以看到會有剩下的，所以就要考慮那些剩下的，因此 `BBBB` 就是要填入 `7` ，利用 `&` 來進行比對，若是有剩下的話，就會再進行一次 `count_utf8((const char *) end, len & CCCC)`，因此 `CCCC` 這邊也是要填入 `7` ，最後如果沒有剩餘的數字的話，就 `+=0`，也就是 `DDDD` 所要填入的。 ### 效能比較這邊測試效能的方式會是使用 perf 來進行，而測試的資料為隨機產生的字串，在藉由兩種函式進行比較 :::spoiler 測試程式碼 ```c #include <stddef.h> #include <stdint.h> #include <stdlib.h> #include <time.h> #include <string.h> #include <stdio.h> #define NUM 100 #define CHUNK_SIZE 10000000 size_t count_utf8(const char *buf, size_t len) { const int8_t *p = (const int8_t *) buf; size_t counter = 0; for (size_t i = 0; i < len; i++) { /* -65 is 0b10111111, anything larger in two-complement's should start * new code point. */ if (p[i] > -65) counter++; } return counter; } void fix_string(char *s) { memset(s, 'a', CHUNK_SIZE - 1); s[CHUNK_SIZE - 1] = '\0'; } void random_string(char *s) { for (size_t i = 0; i < CHUNK_SIZE - 1; i++) { int x = rand(); s[i] = (x % 127) + 1; } s[CHUNK_SIZE - 1] = '\0'; } size_t swar_count_utf8(const char *buf, size_t len) { const uint64_t *qword = (const uint64_t *) buf; const uint64_t *end = qword + (len >> 3); size_t count = 0; for (; qword != end; qword++) { const uint64_t t0 = *qword; const uint64_t t1 = ~t0; const uint64_t t2 = t1 & 0x04040404040404040llu; const uint64_t t3 = t2 + t2; const uint64_t t4 = t0 & t3; count += __builtin_popcountll(t4); } count = (1 << 3) * (len / 8) - count; count += (len & 7) ? count_utf8((const char *) end, len & 7) : 0; return count; } int main(int argc, char **argv) { srand( time(NULL) ); char *s = malloc(sizeof(char) * CHUNK_SIZE); for(int i = 0; i < NUM; ++i){ random_string(s); count_utf8(s, CHUNK_SIZE); // swar_count_utf8(s, CHUNK_SIZE); } free(s); printf("Finish\n"); return 0; } ``` ::: 接著把它繪製成 gnuplot ![](https://i.imgur.com/VzOVOwd.png) 可以看到使用 SWAR 的執行時間確實有比沒有使用的快，但不確定為何進步沒有很大 ==由於上面的測試似乎沒有表現出 SWAR 的強大，所以用另外一種方式來檢測效能，使用了 lab0 中的 cpucycle()來進行檢測== :::spoiler 使用 cpucycle() 檢測 ```c #include <stddef.h> #include <stdint.h> #include <stdlib.h> #include <time.h> #include <string.h> #include <stdio.h> #include "cpucycles.h" #define NUM 100 #ifdef CYCLE #define CHUNK_SIZE CYCLE #else /* DEBUG */ #define CHUNK_SIZE 80000 #endif /* DEBUG */ size_t count_utf8(const char *buf, size_t len) { const int8_t *p = (const int8_t *) buf; size_t counter = 0; for (size_t i = 0; i < len; i++) { /* -65 is 0b10111111, anything larger in two-complement's should start * new code point. */ if (p[i] > -65) counter++; } return counter; } void fix_string(char *s) { memset(s, 'a', CHUNK_SIZE - 1); s[CHUNK_SIZE - 1] = '\0'; } void random_string(char *s) { for (size_t i = 0; i < CHUNK_SIZE - 1; i++) { int x = rand(); s[i] = x /*% 127 + 1*/; } s[CHUNK_SIZE - 1] = '\0'; } size_t swar_count_utf8(const char *buf, size_t len) { const uint64_t *qword = (const uint64_t *) buf; const uint64_t *end = qword + (len >> 3); size_t count = 0; for (; qword != end; qword++) { const uint64_t t0 = *qword; const uint64_t t1 = ~t0; const uint64_t t2 = t1 & 0x04040404040404040llu; const uint64_t t3 = t2 + t2; const uint64_t t4 = t0 & t3; count += __builtin_popcountll(t4); } count = (1 << 3) * (len / 8) - count; count += (len & 7) ? count_utf8((const char *) end, len & 7) : 0; return count; } int main(int argc, char **argv) { srand( time(NULL) ); char *s = malloc(sizeof(char) * CHUNK_SIZE); random_string(s); uint64_t normal_ticks[2], swar_ticks[2]; size_t count0, count1; normal_ticks[0] = cpucycles(); count0 = count_utf8(s, CHUNK_SIZE); normal_ticks[1] = cpucycles(); swar_ticks[0] = cpucycles(); count1 = swar_count_utf8(s, CHUNK_SIZE); swar_ticks[1] = cpucycles(); printf("\n\nNormal\t method: %ld\nCounts of UTF-8 : %ld\n\n", normal_ticks[1] - normal_ticks[0], count0); printf("SWAR\t method: %ld\nCounts of UTF-8 : %ld\n", swar_ticks[1] - swar_ticks[0], count1); free(s); printf("\n\nDone !\n"); return 0; } ``` ::: 以 100000 筆資料來進行比較，可以發現使用 SWAR 確實效能提升了不少 ```shell $ make test CYCLE=100000 gcc -D CYCLE=100000 -O0 cpucycles.h performance.c -o performance.elf ./performance.elf Normal method: 1094090 Counts of UTF-8 : 74985 SWAR method: 262293 Counts of UTF-8 : 74985 Done ! ``` 接著我測試從 10000 筆資料到 100000 筆資料，並使用 gnuplot 繪製成折線圖 ![](https://i.imgur.com/uwGACpD.png) 可以看到確實 SWAR 確實使用了更少的 cpucycles 數 :::info TODO 改進效能 ::: --- ## 測驗 `4` ### 解釋上述程式碼運作原理 ```c #include <stdint.h> #include <stdbool.h> bool is_pattern(uint16_t x) { if (!x) return 0; for (; x > 0; x <<= 1) { if (!(x & 0x8000)) return false; } return true; } ``` 可以看到上面程式碼中的 `0x8000` 轉換成二進制可以表示成 `0b 1000 0000 0000 0000` ，接著檢查 `x & 0x8000` 是否為 1 。這段可以把它視為要找的數字為從 MSB 開始連續的 1 ，例如 `0b 1111 1100 0000 0000` 這類的數字。也可以從符合樣式的資料及來觀察 ``` pattern: 8000 (32768) => 1000 0000 0000 0000 pattern: c000 (49152) => 1100 0000 0000 0000 pattern: e000 (57344) => 1110 0000 0000 0000 pattern: f000 (61440) => 1111 0000 0000 0000 pattern: f800 (63488) => 1111 1000 0000 0000 ``` 接著就把它用不同的方式實作出來 ```c bool is_pattern(uint16_t x) { const uint16_t n = -x; return (n ^ x) < x; } ``` 上面的程式碼也是用二進制會比較好觀察 ``` | Soucre | x | n (也就是 -x) | | ------ | ------------------- | ------------------- | | 8000 | 1000 0000 0000 0000 | 1000 0000 0000 0000 | => ture | c000 | 1100 0000 0000 0000 | 0100 0000 0000 0000 | => ture | e000 | 1110 0000 0000 0000 | 0010 0000 0000 0000 | => ture | f000 | 1111 0000 0000 0000 | 0001 0000 0000 0000 | => ture | f0f0 | 1111 0000 1111 0000 | 0000 1111 0001 0000 | => false ``` 可以看到只要 x 從 MSB 往 LSB 方向中間的連續 1 有中斷的話，就會讓 -x 也就是 n 在斷掉的位子產生 1，如此一來只要進行 `^` 運算，就會使得 `n ^ x` 變得比 x 還要大，所以就會回傳 false 。 ### 在 Linux 核心原始程式碼找出上述 bitmask 及產生器 --- ## Reference