# 2020q3 Homework3 (quiz3) contributed by < `Rsysz` > ###### tags: `sysprog` > [測驗題](https://hackmd.io/@sysprog/2020-quiz3#%E6%B8%AC%E9%A9%97-3) ## 1. 根據C99標準的6.5.7節 > “The result of `E1` >> `E2` is E1 right-shifted E2 bit positions. … If E1 has a signed type and a negative value, the resulting value is ==implementation-defined==. 對有號型態的物件進行算術右移(Arithmetic Shift Right)屬於 `Unspecified Behavior` 因此想實作一套可跨平台的ASR,代表當輸入為負數時右移必須補 1 ,反之則必須補 0 ```cpp int asr_i(signed int m, unsigned int n) { const int logical = (((int) -1) OP1) > 0; unsigned int fixu = -(logical & (OP2)); int fix = *(int *) &fixu; return (m >> n) | (fix ^ (fix >> n)); } ``` 根據題意,已知m為輸入數值,n為右移次數,由 `return` 回推 * 假設 `m` 為正數若 `m >> n` 為補0,則 `fix ^ (fix >> n)` = 0 若 `fix = 0` 相當於 `fixu = 0` 代表 `logical` 必須等於 0 因此得知 `logical` 為測試此平台上負數右移補 0 或 1 , OP1 = `(b) >> 1` * 假設 `m` 為負數若 `m >> n` 為補0,則 `fix ^ (fix >> n)` 必須補上對應 `n` 位的 1 若 `logical` 為 1 代表右移補 0 , `fix = -OP2` 又 `m < 0` 時 `fix` 必須等於 -1 因此 OP2 = `(c) m < 0` ```graphviz digraph ASR{ node [shape="box", fontcolor=black, fontsize=13, style=filled, fillcolor=aquamarine] start[shape=ellipse, fontcolor=black,style=filled, fillcolor=white,label="start"] init[label="input value = m, shift value = n", fillcolor=white] shift[label="(((int) -1) >> 1) > 0",shape="diamond", fontcolor=black, fillcolor=white] fix[label="fix = fixu = -(logical & (m < 0))", fillcolor=white] m[label="m < 0", shape="diamond", fillcolor=white] m_true[label="m > 0, logical = 0 or 1 m < 0, logical = 0 fix = 0", fillcolor=white] m_false[label="m < 0, logical = 1 fix = -1", fillcolor=white] return[label=" (m >> n) | (fix ^ (fix >> n)) ", fillcolor=white] end[shape=ellipse, fontcolor=black,style=filled, fillcolor=white,label="end"] start->init->shift->fix:w init->m->fix:e fix->m_true->return->end fix->m_false->return } ``` 以$-5≫^{arith}1$為例, `m = -5` , `n = 1` 若假設右移為 ==邏輯右移== ```graphviz digraph SR{ node[shape=record] struct1 [label="<f0>1|<f1>1|<f2>1|<f3>1|<f4>1|<f5>1|<f6>1|<f7>1"] int [label = "((int) -1)", shape = plaintext] struct2 [label="<f0>0|<f1>1|<f2>1|<f3>1|<f4>1|<f5>1|<f6>1|<f7>1"] int_s [label = "((int) -1) >> 1", shape = plaintext] {rank = same int_s struct2} logica [label = "logical = 1"] struct1:f0->struct2:f1 struct1:f1->struct2:f2 struct1:f2->struct2:f3 struct1:f3->struct2:f4 struct1:f4->struct2:f5 struct1:f5->struct2:f6 struct1:f6->struct2:f7 struct2:f7:e->logica:w } ``` 因 `m` 為負數, `fix = fixu = -1` ```graphviz digraph SR{ node[shape=record] m [label="<f0>1|<f1>1|<f2>1|<f3>1|<f4>1|<f5>0|<f6>1|<f7>1"] m_text_1 [label = "m(-5)", shape = plaintext] {rank = same m m_text_1} n [label="<f0>0|<f1>1|<f2>1|<f3>1|<f4>1|<f5>1|<f6>0|<f7>1"] m_text_2 [label = "m >> n(-5 >> 1)", shape = plaintext] {rank = same n m_text_2} result [label="<f0>1|<f1>1|<f2>1|<f3>1|<f4>1|<f5>1|<f6>0|<f7>1"] result_text [label = "(m >> n) | (fix ^ (fix >> 1)) (-3)", shape = plaintext] {rank = same result result_text} struct1 [label="<f0>1|<f1>1|<f2>1|<f3>1|<f4>1|<f5>1|<f6>1|<f7>1"] fix [label = "fix", shape = plaintext] struct2 [label="<f0>0|<f1>1|<f2>1|<f3>1|<f4>1|<f5>1|<f6>1|<f7>1"] fix_s [label = "fix >> 1", shape = plaintext] {rank = same fix_s struct2} struct3 [label="<f0>1|<f1>0|<f2>0|<f3>0|<f4>0|<f5>0|<f6>0|<f7>0"] fix_xor [label = "fix ^ (fix >> 1)", shape = plaintext] {rank = same fix_xor struct3} n->result m:f0->n:f1 m:f1->n:f2 m:f2->n:f3 m:f3->n:f4 m:f4->n:f5 m:f5->n:f6 m:f6->n:f7 struct1:f0->struct2:f1 struct1:f1->struct2:f2 struct1:f2->struct2:f3 struct1:f3->struct2:f4 struct1:f4->struct2:f5 struct1:f5->struct2:f6 struct1:f6->struct2:f7 struct2:f0->struct3:f0 struct2:f1->struct3:f1 struct2:f2->struct3:f2 struct2:f3->struct3:f3 struct2:f4->struct3:f4 struct2:f5->struct3:f5 struct2:f6->struct3:f6 struct2:f7->struct3:f7 } ``` ## 2. ```cpp bool isPowerOfFour(int num) { return num > 0 && (num & (num - 1))==0 && !(__builtin_ctzl(num) OPQ); } ``` $4^n(n為整數)必然大於0$,因此 `num > 0` $4^n可化為2^{2n},若num不為2^n必定不為4^n$, 因此 `(num & (num - 1)) == 0` > Built-in Function: `int __builtin_ctz (unsigned int x)` > * Returns the number of trailing 0-bits in x, starting at the least significant bit position. If x is 0, the result is undefined. > > Built-in Function: `int __builtin_clz (unsigned int x)` > * Returns the number of leading 0-bits in x, starting at the most significant bit position. If x is 0, the result is undefined. | $4^n$ | Binary | | -------- | -------- | | $4^0$| 0000 0001| | $4^1$| 0000 0100| | $4^2$| 0001 0000| 已知$4^n = {\overbrace{0000....0100}^{尾端數來共2n個0}}$ 以 `!(__builtin_ctz(num) OPQ)` 確保尾端數來皆是偶數個 0 排除 2 的次方數 因此 OPQ = `(f) & 0x1` ### 延伸問題 ```cpp bool isPowerOfFour(int num) { return (num & 0x55555555) && (!(num & (num - 1))); } ``` ![](https://i.imgur.com/moPAh7T.png) 透過 `0x55555555` 確保 `num` 的二進位只有奇數位有值,又 4 的次方數轉為二進位只有一個 1 於是再排除有複數位的情況。 ```cpp int bitwiseComplement(int N){ if(N) return ~N & ((1 << (32 - __builtin_clz(N))) - 1); return 1; } ``` ![](https://i.imgur.com/w7WveKO.png) 找出 MSB 後有效位並對 1 求補數,利用 `1 << (32 - __builtin_clz(N)) - 1` 製造有效位 Mask 並求剩下的數。 ```cpp #define SWAP(a, b) ((a) ^= (b), (b) ^= (a), (a) ^= (b)) int firstMissingPositive(int* nums, int numsSize){ for(int i = 0; i < numsSize; i++){ if(nums[i] > 0 && nums[i] < numsSize && nums[i] != nums[nums[i] - 1]){ int j = nums[i] - 1; SWAP(nums[i], nums[j]); i--; } } for(int i = 0; i < numsSize; i++){ if(nums[i] != i + 1) return i + 1; } return numsSize + 1; } ``` ![](https://i.imgur.com/5SL8iIl.png) 參考[SwappingValuesXOR](http://graphics.stanford.edu/~seander/bithacks.html#SwappingValuesXOR) ## 3. 針對 [LeetCode 1342. Number of Steps to Reduce a Number to Zero](https://leetcode.com/problems/number-of-steps-to-reduce-a-number-to-zero/),我們可運用 gcc 內建函式實作,假設 int 寬度為 32 位元,程式碼列表如下: ```cpp int numberOfSteps (int num) { return num ? AAA + 31 - BBB : 0; } ``` > `__builtin_popcount(x)`: 計算 x 的二進位表示中,總共有幾個 `1` 參考題二得知對應的`__builtin_clz(num)`為計算 MSB 開始到第一個 1 之間 0 的個數 透過`32 - __builtin_clz(num)`求得 MSB 與 0 的距離,再以 `__builtin_popcount(x) - 1` 求得 MSB 以外的 1 與 0 的距離,因此 AAA = `__builtin_popcount(num)`,BBB = `__builtin_clz(num)` ### 延伸問題 ```cpp int numberOfSteps (int num) { return __builtin_popcount(num) + 31 - __builtin_clz(num | 1); } ``` ![](https://i.imgur.com/qkdTj9Y.png) 於 `__builtin_clz(num | 1)` 針對 LSB 為 0 的情況加 1 ,而避免了傳入 0 ```cpp int popcount(int num){ /*__builtin_popcount(num)*/ num = (num & 0x55555555) + ((num >> 1) & 0x55555555); num = (num & 0x33333333) + ((num >> 2) & 0x33333333); num = (num & 0x0F0F0F0F) + ((num >> 4) & 0x0F0F0F0F); num = (num & 0x00FF00FF) + ((num >> 8) & 0x00FF00FF); num = (num & 0x0000FFFF) + ((num >> 16) & 0x0000FFFF); return num; } int clz(int v){ /*31 - __builtin_clz(num | 1)*/ unsigned int r; unsigned int shift; r = (v > 0xFFFF) << 4; v >>= r; shift = (v > 0xFF ) << 3; v >>= shift; r |= shift; shift = (v > 0xF ) << 2; v >>= shift; r |= shift; shift = (v > 0x3 ) << 1; v >>= shift; r |= shift; r |= (v >> 1); return r; } int numberOfSteps (int num) { return popcount(num) + clz(num | 1); } ``` ![](https://i.imgur.com/g5Czm3q.png) 每 `byte` 切割輸入值最終加總為 `popcount` ,而下方 `clz` 則可以計算 `__builtin_clz` 的相反數值 參考[Find the log base 2 of an N-bit integer in O(lg(N)) operations](http://graphics.stanford.edu/~seander/bithacks.html#IntegerLog) ## 4. 考慮以下 64-bit GCD (greatest common divisor, 最大公因數) 求值函式: ```cpp #include <stdint.h> uint64_t gcd64(uint64_t u, uint64_t v) { if (!u || !v) return u | v; while (v) { uint64_t t = v; v = u % v; u = t; } return u; } ``` 改寫為以下等價實作: ```cpp #include <stdint.h> uint64_t gcd64(uint64_t u, uint64_t v) { if (!u || !v) return u | v; int shift; for (shift = 0; !((u | v) & 1); shift++) { u /= 2, v /= 2; } while (!(u & 1)) u /= 2; do { while (!(v & 1)) v /= 2; if (u < v) { v -= u; } else { uint64_t t = u - v; u = v; v = t; } } while (XXX); return YYY; } ``` 上述程式就是在實作==輾轉相除法==,首先透過 `shift` 判斷當 `u` 和 `v` 都是偶數時,不斷除以 2 直到一方變成奇數,再對 `u` 單獨除到變為奇數,接著以不斷將兩數中較小的數置於 `u` 不斷以 `v -= u` 對 `v` 取餘,因此最終XXX = `(b) v` YYY = `(e) u << shift` 以上 GCD 加速方法可以參考[Binary GCD](https://hackmd.io/@sysprog/gcd-impl#Binary-GCD)利用 > binary gcd 的精神就是當兩數為偶數時,必有一 $2$ 因子。 $$\text{gcd}(x, y) = 2·\text{gcd}(\frac{x}{2}, \frac{y}{2})$$ 且一數為奇數另一數為偶數,則無 $2$ 因子。 $$\text{gcd}(x, y) = \text{gcd}(x, \frac{y}{2})$$ 於是我們可以改良為: $$ \text{even_factor} = \text{min}(\text{ctz}(x), \text{ctz}(y))$$ $$\text{gcd}(x, y) = \text{even_factor}·\text{gcd}((x\gg \text{even_factor}), (y\gg \text{even_factor}))$$ 透過 `shift` 加速 GCD 運算,最終於 `return u << shift` 補回先前右移的數值 ### 延伸問題 ```cpp #include <stdio.h> #include <stdlib.h> #include <stdint.h> uint64_t gcd64(uint64_t u, uint64_t v) { unsigned t = u | v; if (!u || !v) return t; while (u) { u >>= __builtin_ctz(u); v >>= __builtin_ctz(v); if (u >= v) u = (u - v) / 2; else v = (v - u) / 2; } return (v << __builtin_ctz(t)); } ``` 透過 `__builtin_ctz` 改寫除 2 部分加速整體運算速度,然而透過 `perf stat` 指令查看兩者在處理==極大的值==時消耗的時間相差甚少,因此於 `main` 新增 ```cpp for (int i = 0; i < 0xFFF; i+=2) { for (int j = 0; j < 0xFFF; j+=2) gcd64(i, j); } ``` 以大量==偶數==數據帶入運算針對 `__builtin_ctz` 與題目所提供的 `shift` 方法比較計算速度 ```bash Performance counter stats for './ex_builtin_ctz': 218.58 msec task-clock # 0.999 CPUs utilized 1 context-switches # 0.005 K/sec 0 cpu-migrations # 0.000 K/sec 49 page-faults # 0.224 K/sec 9,0579,4689 cycles # 4.144 GHz 6,9513,1948 instructions # 0.77 insn per cycle 1,2077,3811 branches # 552.541 M/sec 1246,1088 branch-misses # 10.32% of all branches 0.218806228 seconds time elapsed 0.218822000 seconds user 0.000000000 seconds sys ``` ```bash Performance counter stats for './ex_original': 415.57 msec task-clock # 0.999 CPUs utilized 1 context-switches # 0.002 K/sec 0 cpu-migrations # 0.000 K/sec 48 page-faults # 0.116 K/sec 17,2396,2751 cycles # 4.148 GHz 11,1716,9482 instructions # 0.65 insn per cycle 2,8125,5819 branches # 676.788 M/sec 3811,1653 branch-misses # 13.55% of all branches 0.415865646 seconds time elapsed 0.415875000 seconds user 0.000000000 seconds sys ``` 才能得到較為顯著的測試結果,採用 `__builtin_ctz` 明顯減少了 `CPU cycles` , `fetch instructions` 和 `branch` 的次數,得到效能的提升。這裡值得注意的是如果使用 `printf`將得到的數值印出,將會消耗大量的時間,因此強烈不推薦使用。 ## 5. 在影像處理中,[bit array](https://en.wikipedia.org/wiki/Bit_array) (也稱 `bitset`) 廣泛使用,考慮以下程式碼: ```cpp #include <stddef.h> size_t naive(uint64_t *bitmap, size_t bitmapsize, uint32_t *out) { size_t pos = 0; for (size_t k = 0; k < bitmapsize; ++k) { uint64_t bitset = bitmap[k]; size_t p = k * 64; for (int i = 0; i < 64; i++) { if ((bitset >> i) & 0x1) out[pos++] = p + i; } } return pos; } ``` 可用 clz 改寫為效率更高且等價的程式碼: ```cpp #include <stddef.h> size_t improved(uint64_t *bitmap, size_t bitmapsize, uint32_t *out) { size_t pos = 0; uint64_t bitset; for (size_t k = 0; k < bitmapsize; ++k) { bitset = bitmap[k]; while (bitset != 0) { uint64_t t = KKK; int r = __builtin_ctzll(bitset); out[pos++] = k * 64 + r; bitset ^= t; } } return pos; } ``` 上述程式碼相互對應後 `int r = __builtin_ctzll` 取代 `if ((bitset >> i) & 0x1)` 的功能,但改寫後的程式碼對 `out[pos++]` 賦值後還必須 clear bitset 的 LSB 以確保下個 Loop 的正確運行 > 根據題三提到 $-n=∼n+1$ 利用 `bitset & -bitset` 留下 LSB 再於 `bitset ^= t` 清除 LSB 因此 KKK = `(e) bitset & -bitset`