2023q1 Homework2 (quiz2)

--- tags: linux2023 --- # 2023q1 Homework2 (quiz2) contributed by < `fennecJ` > ### 測驗 `1` ```c uint64_t next_pow2(uint64_t x) { x |= x >> 1; x |= x >> 1; x |= x >> 1; x |= x >> 1; x |= x >> 1; x |= x >> 1; x |= x >> 1; x |= x >> 8; x |= x >> 16; x |= x >> 32; return x + 1; } ``` ### 原理：將 `x` 與 `x` 右移 1 位元進行或運算，將 `x` 最高位非零位元（為使說明簡潔，以下所有「最高位元」皆指「最高位非零位元」）右側的位元設為 1 並將結果存回 `x` 。重複執行上述步驟 7 次，包含最高位共會有八個位元被設為 1 。因此下一步可以直接將 `x` 與 `x` 右移 8 位元進行或運算，將 `x` 最高位元右側的 16 個位元設為 1。依此類推，分別將 `x` 與右移 16 位元以及 32 位元的 `x` 進行或運算。最多可將 `x` 最高位元右側的 64 個位元設為 1 ，由於 `x` 本身為 `uint64_t` 型別的變數，因此可以確保從 `x` 最高位到最低位的所有位元皆被設為 1 。之後將 x 加上 1 ，即可得到比 x 大的下一個 2 的冪。 ### 進一步改進：由於 `x` 與 `x` 右移 1 位元進行或運算後能使最高 2 位元被設為 1 ，因此接下來可以一次右移兩個位元。依此類推，在下一次可以一次右移 4 、 8 、 16 、 32 個位元。 ```c uint64_t next_pow2(uint64_t x) { x |= x >> 1; x |= x >> 2; x |= x >> 4; x |= x >> 8; x |= x >> 16; x |= x >> 32; return x + 1; } ``` ### 改以 __builtin_clzl 實做上述函式： `__builtin_clz (unsigned int x)` 的用途為： `Returns the number of leading 0-bits in x` 即返回最高位的非零位元之前有幾個位元為 0 。而 `__builtin_clzl(unsigned long x)` 則為 `__builtin_clz` 的 `unsigned long` 版。我們可以用這個函式改進 next_pow2 的實做： ```c uint64_t next_pow2(uint64_t x) { return (uint64_t) 1 << (64 - __builtin_clzl(x)); } ``` 不過上述作法可能會導致問題，關於 `__builtin_clz (unsigned int x)` 的敘述還有一段： `If x is 0, the result is undefined.` 我們可以進一步將其改為： ```c uint64_t next_pow2(uint64_t x) { if(!x) return 1; return (uint64_t) 1 << (64 - __builtin_clzl(x)); } ``` 不過這樣 code 便會產生 branch ，嘗試將其改為 branchless 的實做： ```c uint64_t next_pow2(uint64_t x) { // !x can only be either `1` or `0`. // if(x == 0), (x | !x) = 1 // else (x | !x) = x; int fix = (x | !x) - x; // fix = (x == 0) ? 1 : 0; uint64_t leads = __builtin_clzl(x | 1); leads += fix; return (uint64_t) 1 << ((uint64_t)64 - leads); } ``` 以 perf 比較有 branch 和無 branch 計算 0 ~ (1 << 32) 的 next_pow2 時的執行效能： ```c for(uint64_t u = 0; u <= (uint64_t) 1 << 32; u++){ next_pow2(u); } ``` ```shell # gcc branchLess.c $ sudo perf stat --repeat 5 -e cache-misses,cache-references,instructions,cycles,branch-misses ./a.out --- output: Performance counter stats for './a.out' (5 runs): 1,920,238 cache-misses # 89.358 % of all cache refs ( +- 18.46% ) 3,358,733 cache-references ( +- 19.68% ) 158,943,820,480 instructions # 2.80 insn per cycle ( +- 0.01% ) 56,698,618,164 cycles ( +- 0.06% ) 143,546 branch-misses ( +- 27.52% ) 14.677 +- 0.108 seconds time elapsed ( +- 0.73% ) ``` ```shell # gcc branched.c $ sudo perf stat --repeat 5 -e cache-misses,cache-references,instructions,cycles,branch-misses ./a.out --- output: Performance counter stats for './a.out' (5 runs): 373,336 cache-misses # 31.788 % of all cache refs ( +- 27.43% ) 757,761 cache-references ( +- 52.57% ) 103,092,710,628 instructions # 2.79 insn per cycle ( +- 0.01% ) 36,953,381,047 cycles ( +- 0.08% ) 55,425 branch-misses ( +- 61.77% ) 9.6003 +- 0.0237 seconds time elapsed ( +- 0.25% ) ``` 可以看到雖然 branchLess 的 branch misses 比較少（應該都是 for loop 是否結束的判斷而已），但他花的時間 (14.61s) 卻比 branched 花的時間 (9.66s) 長。看到 instructions cnt branchLess 的實做雖然有較少的 branch misses，但它每個 funct call 要執行的指令較多，共需執行 `158933887154` 個指令，相較之下，有 branch 的實做僅須執行 `103140140758` 個指令。 $$\frac{158933887154-103140140758}{103140140758} = 0.541$$ branchLess 的實做要比 branched 多 54 ％的指令要執行。 $$\frac{9.6}{14.67} = 0.65$$ branched 所花執行時間僅為 branchLess 的 65 ％。以下是兩種實做編譯後的結果： branchLess: ```= next_pow2_branchLess: push {r4, r5, r6, r7, r8, r9, r10, fp} sub sp, sp, #24 add r7, sp, #0 strd r0, [r7] ldr r0, [r7] ldr r1, [r7, #4] orrs r1, r1, r0 cmp r1, #0 ite eq moveq r1, #1 movne r1, #0 uxtb r1, r1 mov r0, r1 ldr r1, [r7] mvns r1, r1 ands r1, r1, r0 str r1, [r7, #20] ldr r1, [r7] orr r1, r1, #1 clz r1, r1 asrs r0, r1, #31 mov r10, r1 mov fp, r0 strd r10, [r7, #8] ldr r1, [r7, #20] asrs r0, r1, #31 mov r4, r1 mov r5, r0 ldrd r0, [r7, #8] adds r8, r0, r4 adc r9, r1, r5 strd r8, [r7, #8] ldr r1, [r7, #8] rsb r4, r1, #64 mov r0, #1 mov r1, #0 sub r6, r4, #32 rsb r5, r4, #32 lsl r3, r1, r4 lsl r6, r0, r6 orrs r3, r3, r6 lsr r5, r0, r5 orrs r3, r3, r5 lsl r2, r0, r4 mov r0, r2 mov r1, r3 adds r7, r7, #24 mov sp, r7 pop {r4, r5, r6, r7, r8, r9, r10, fp} bx lr ``` branched: ```= next_pow2_branched: push {r4, r5, r6, r7} sub sp, sp, #8 add r7, sp, #0 strd r0, [r7] ldrd r0, [r7] orrs r1, r1, r0 bne .L2 mov r2, #1 mov r3, #0 b .L3 .L2: ldr r1, [r7] clz r1, r1 rsb r4, r1, #64 mov r0, #1 mov r1, #0 sub r6, r4, #32 rsb r5, r4, #32 lsl r3, r1, r4 lsl r6, r0, r6 orrs r3, r3, r6 lsr r5, r0, r5 orrs r3, r3, r5 lsl r2, r0, r4 .L3: mov r0, r2 mov r1, r3 adds r7, r7, #8 mov sp, r7 pop {r4, r5, r6, r7} bx lr ``` 可以明顯看到 branched 即便在最長的分枝（2~6->L2) 下需要執行的指令數量仍顯著比 branchLess 少。題外話：原先我嘗試用 ```c fix = (1 >> x) ``` 實做上述 branchLess 的函式，想法是只有在 x 為 0 時 fix 才會是 1 ，其餘時候 fix 都會是 0 。但執行後發現在 x = 64 時結果不正確，才知道原來右移的數量 >= 1 的位寬時會造成 undefined behavior[^1] ： ```! If the value of the right operand is negative or is greater than or equal to the width of the promoted left operand, the behavior is undefined. ``` 最後附上原本以分別右移 1 、 2 、 4 、 8 、 16 、 32 位元實做 `next_pow2` 的執行時間為 (23.41s) ```shell # gcc naive.c $ sudo perf stat --repeat 5 -e cache-misses,cache-references,instructions,cycles,branch-misses ./a.out --- output: Performance counter stats for './a.out' (5 runs): 1,667,730 cache-misses # 69.119 % of all cache refs ( +- 8.41% ) 2,966,903 cache-references ( +- 7.51% ) 141,776,719,950 instructions # 1.56 insn per cycle ( +- 0.00% ) 90,925,330,702 cycles ( +- 0.01% ) 168,508 branch-misses ( +- 5.54% ) 23.7229 +- 0.0316 seconds time elapsed ( +- 0.13% ) ``` 三者比較如下： ```vega { "$schema": "https://vega.github.io/schema/vega-lite/v5.json", "data": { "values":[ { "yname": "branchLess w/o clzl", "time" : 23.41831 }, { "yname": "branchLess w/ clzl", "time" : 14.6 }, { "yname": "branched w/ clzl", "time" : 9.66 } ] }, "height": {"step": 50}, "mark": {"type": "bar", "height": 23}, "encoding": { "y": { "field": "yname", "scale": {"padding": 0}, "title" : "" }, "x": { "field": "time", "type": "quantitative", "axis": {"grid": true}, "title": "time(s)" } } } ``` ### 在 linux 中類似的使用案例在 [include/linux/log2.h](https://elixir.bootlin.com/linux/v6.2.2/source/include/linux/log2.h#L156) 有一個 MACRO `ilog2` ： ```c= #define ilog2(n) \ ( \ __builtin_constant_p(n) ? \ ((n) < 2 ? 0 : \ 63 - __builtin_clzll(n)) : \ (sizeof(n) <= 4) ? \ __ilog2_u32(n) : \ __ilog2_u64(n) \ ) ``` `__builtin_constant_p(expr)` 的用途是判斷 expr 是否為在編譯期就能確定的常數，如果是的話就回傳 1 ，無法確定是否為常數的話就回傳 0 ```! The function returns the integer 1 if the argument is known to be a compile-time constant and 0 if it is not known to be a compile-time constant. A return of 0 does not indicate that the value is not a constant, but merely that GCC cannot prove it is a constant with the specified value of the -O option. ``` MACRO `ilog2` 做的事情便是用 63 - __builtin_clzll(n) 算出其 log2 的結果，注意上述 MACRO 中第 4 行的 `(n) < 2 ? 0 : ` ，這裡之所以把 `n < 2` 另外處理的原因是為了避免 `n = 0` 時的 undefined behavior 造成問題。 ### 編譯後是否會產生 __builtin_clz 對應的組語？以 godblod 檢視產生的組語： https://godbolt.org/z/ber7hKbYM ``` next_pow2: push rbp mov rbp, rsp mov QWORD PTR [rbp-8], rdi cmp QWORD PTR [rbp-8], 0 jne .L2 mov eax, 1 jmp .L3 .L2: bsr rax, QWORD PTR [rbp-8] xor rax, 63 mov edx, eax mov eax, 64 sub eax, edx mov edx, 1 mov ecx, eax sal rdx, cl mov rax, rdx .L3: pop rbp ret ``` 可以在 `.L2:` 看到產生了對應的 `bsr` 改以 ARM 架構進行編譯： ``` next_pow2: push {r4, r5, r6, r7} sub sp, sp, #8 add r7, sp, #0 strd r0, [r7] ldrd r0, [r7] orrs r1, r1, r0 bne .L2 mov r2, #1 mov r3, #0 b .L3 .L2: ldr r1, [r7] clz r1, r1 rsb r4, r1, #64 mov r0, #1 mov r1, #0 sub r6, r4, #32 rsb r5, r4, #32 lsl r3, r1, r4 lsl r6, r0, r6 orrs r3, r3, r6 lsr r5, r0, r5 orrs r3, r3, r5 lsl r2, r0, r4 .L3: mov r0, r2 mov r1, r3 adds r7, r7, #8 mov sp, r7 pop {r4, r5, r6, r7} bx lr ``` 可以看到產生了對應的 `clz` 而若以未使用 __builtin_clzl 實做的函式進行編譯，即便下了 -O3 flag 產生出來的組語也沒有 `bsr` 。 ### 測驗 `2` ```c int concatenatedBinary(int n) { const int M = 1e9 + 7; int len = 0; /* the bit length to be shifted */ /* use long here as it potentially could overflow for int */ long ans = 0; for (int i = 1; i <= n; i++) { /* removing the rightmost set bit * e.g. 100100 -> 100000 * 000001 -> 000000 * 000000 -> 000000 * after removal, if it is 0, then it means it is power of 2 * as all power of 2 only contains 1 set bit * if it is power of 2, we increase the bit length */ if (!(i & (i - 1))) len++; ans = (i | (ans << len)) % M; } return ans; } ``` ### 原理：在這個函式中，變數 len 用來表示在將新數字接在 ans 後面時，需要留多少位元給新數字。因此，我們需要找出一個方法來確定要留下多少位元。通過觀察下表，可以發現當 i 恰等於 2 的冪時，len 才需要增加。如下表所示： | dec | bin(shortest) |len| | ----- | ------ | ----| | 1 | 1 |1 | | 2 | 10 |**2** | | 3 | 11 |2 | | 4 | 100 |**3** | | 5 | 101 |3 | 因此，我們可以利用 !(i & (i - 1)) 來判斷當 i 等於 2 的冪時，讓 len 加 1。因為只有當 i 是 2 的冪時，!(i & (i - 1)) 才會返回 true。 ### 以 `__builtin_{clz,ctz,ffs}` 改寫可以用 `(__builtin_clz(i) + __builtin_ctz(i) == 31)` 取代 `!(i & (i - 1))` 也可以用 `__builtin_popcount(i) == 1` 取代。但這樣還是有 branch，可以直接用 `32 - __builtin_clz(i)` 取代本來的 len 改為 branchLess 的實做： ```c int concatenatedBinary(int n) { const int M = 1e9 + 7; long ans = 0; for (int i = 1; i <= n; i++) { ans = (i | (ans << (32 - __builtin_clz(i)))) % M; } return ans; } ``` ### 改善％ 1e9 + 7 運算 #### 以 barrett_reduce[^2] 改善 $$ a \% b = a − \lfloor a/b \rfloor * b $$ 其中 $$ \lfloor a/b \rfloor \approx ((\lfloor (2^{64}-1) /b \rfloor)*a) / 2^{64} $$ $$ \therefore a \% b \approx a − ((\lfloor (2^{64}-1) /b \rfloor)*a) / 2^{64} * b $$ 當 $((⌊(2^{64}−1)/b⌋)∗a)$ 精度夠高時，其計算結果與實際執行 mod 運算無異。我們可以用這個近似來做快速 modulo 運算，以 `__uint128_t`增加 $((⌊(2^{64}−1)/b⌋)∗a)$ 的精度： ```c #define b 1000000007 uint64_t reduce(uint64_t a) { __uint128_t p = (__uint128_t)(-1ULL / b) * a; uint64_t r = (uint64_t)(p >> 64); return a - r * b; } ``` 上述程式碼中，`-1ULL` $= 2^{64} - 1$ 該實做在 leedcode 上能跑過。以 perf 檢視進行 50000 次隨機數字的 mod 運算效能比較，為了避免額外變因把 srand 的 seed 設為 50 ： ```c int main() { srand(50); for (int i = 0; i < 50000; i++) { uint64_t a = reduce(rand()); } return 0; } ``` ### 直接 % 1e9+7 ```c uint64_t reduce(uint64_t a) { return a % b; } ``` ```bash sudo perf stat --repeat 5 -e instructions,cycles ./a.out Performance counter stats for './a.out' (5 runs): 4,991,514 instructions # 1.27 insn per cycle ( +- 0.07% ) 3,888,311 cycles ( +- 0.43% ) 0.003012 +- 0.000692 seconds time elapsed ( +- 22.97% ) ``` ### 使用 barrett_reduce ```c uint64_t reduce(uint64_t a) { __uint128_t p = (__uint128_t)(-1ULL / b) * a; uint64_t r = (uint64_t)(p >> 64); return a - r * b; } ``` ```bash sudo perf stat --repeat 5 -e instructions,cycles ./a.out Performance counter stats for './a.out' (5 runs): 5,797,471 instructions # 1.41 insn per cycle ( +- 0.04% ) 4,080,097 cycles ( +- 0.56% ) 0.0018122 +- 0.0000300 seconds time elapsed ( +- 1.66% ) ``` 可以看到執行時間 $\dfrac{0.0018122}{0.003012}\approx 60\%$ 約變為本來的 60 ％。 :::info 為何除法運算比 mod 快？ compiler 裡面有很多特別針對除法進行的操作能大幅加速除法的速度，但這些操作不一定能用在 mod 上。 ::: ### 測驗 `3` ```c size_t swar_count_utf8(const char *buf, size_t len) { const uint64_t *qword = (const uint64_t *) buf; const uint64_t *end = qword + len >> 3; size_t count = 0; for (; qword != end; qword++) { const uint64_t t0 = *qword; const uint64_t t1 = ~t0; const uint64_t t2 = t1 & 0x04040404040404040llu; const uint64_t t3 = t2 + t2; const uint64_t t4 = t0 & t3; count += __builtin_popcountll(t4); } count = (1 << 3) * (len / 8) - count; count += (len & 7) ? count_utf8((const char *) end, len & 7) : 0; return count; } ``` --- Ref: [GCC Other-Builtin functs](https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html) [Compiler Explorer](https://godbolt.org/) [Linux 效能分析工具: Perf](http://wiki.csie.ncku.edu.tw/embedded/perf-tutorial) [^1]: [ISO/IEC 9899:1999 (E)](https://www.dii.uchile.cl/~daespino/files/Iso_C_1999_definition.pdf) §6.5.7 [^2]: [Computing binomial coefficients modulo integers quickly](https://simonlindholm.github.io/files/bincoef.pdf)