2020q3 Homework (quiz3)

# 2020q3 Homework (quiz3) contributed by < `hankluo6` > ### `測驗 1` ```c int asr_i(signed int m, unsigned int n) { const int logical = (((int) -1) OP1) > 0; unsigned int fixu = -(logical & (OP2)); int fix = *(int *) &fixu; return (m >> n) | (fix ^ (fix >> n)); } ``` #### `OP1` - [x] `(b)` `>> 1` #### `OP2` - [x] `(c)` `m < 0` 在有號型態上進行 ASR 的結果只有兩種可能，一種為在 MSB 補 0，另一種為在 MSB 補 1。因 `OP1` 使用到 `(int)-1` 且後方檢驗是否為正數，可以推測出 `OP1` 用來檢驗當前編譯器是選擇哪種 implementation，所以答案很明顯為 `(b) >> 1`。已知 `logical` 為 0 或 1，接著就是實作兩種情況都有正確答案的程式碼。因 `m > 0` 時不受影響，可以分成以下 2 種： * `m < 0` 且 ASR 為 `MSB = 0`，logical shift * `m < 0` 且 ASR 為 `MSB = 1`，arithemetic shift 我們可以知道第二種情況為正確行為，所以只要讓第一種情況發生時，shift 的部分改為 1 就行了。深入程式碼可發現 `fix` 只有 0 跟 -1 兩種可能，接著當 `fix == -1` 時，`return` 當中 `(fix ^ (fix >> n)` 能讓 shift 產生的位元為 0 或 1 (如果 ASR 的 `MSB = 0` 則產生 1，反之則產生 0)，接著使用 OR 就能把原本該為 1 卻為 0 的部分改為 1。因為我們只要針對 ASR 為 logical shift 時的行為，所以只考慮 `(c) m < 0`。 #### 數學證明當 ASR 為 logical shift 時，可以當成是在 unsigned 上做 arithmetic shift，以 $-5 >> 2$ 為例子，且假設為 4 bit integer，則可以寫成： $$ \begin{align} -5(1011)_{2's} >> 2 &= (-5(1011)_{2's}+2^4-2^4) >> 2 \\ &= (11(1011)_{unsigned}-2^4) >> 2 \\ &= 2(0010)_{uinsigned} - 2^2 \\ &= 2(0010)_{uinsigned} - 2^4 + 2^3 + 2^2 \\ &= -11(0010)_{2's} + 2^3 + 2^2\\ &= -2(1110)_{2's} \end{align} $$ :::info 這邊使用 $2^4$ 來當作 2's complement 與 unsigned 之間的轉換，可以只當成形式上的代換，也可以當成數值上的相加減，詳細二補數規則可參考 [解讀計算機編碼](https://hackmd.io/@sysprog/binary-representation#%E7%AC%AC%E4%B8%80%E9%83%A8%E5%88%86%EF%BC%9A%E4%BA%8C%E8%A3%9C%E6%95%B8%E7%B7%A8%E7%A2%BC) ::: 公式內的 $2^4$ 因我們假設在 unsigned 上做 ASR，所以程式內不需特別寫出，而後方的 $-2^4$ 則是我們在 return 時，將 MSB (或所有 shift 新增的位元) 都設為 1 的這個行為，這個 $-2^4$ 會因為 shift 的關係而減少，最後為了要轉換為 2's complement 而再補齊前面幾個 bit。 #### 延伸問題 * 練習實作其他資料寬度的 ASR，可參照 [C 語言：前置處理器應用篇撰寫通用的巨集以強化程式碼的共用程度](https://hackmd.io/@sysprog/c-preprocessor); ```c #define asr(m, n) \ _Generic((m), \ int8_t: asr_8, \ int16_t: asr_16, \ int32_t: asr_32, \ int64_t: asr_64 \ )(m, n) #define asr_type(type) \ const type logical = (((type) -1) >> 1) > 0; \ u##type fixu = -(logical & (m < 0)); \ type fix = *(type *) &fixu; \ return (m >> n) | (fix ^ (fix >> n)) int_8_t asr_8(int_8_t m, unsigned int n) { asr_type(int8_t); } int16_t asr_16(int16_t m, unsigned int n) { asr_type(int16_t); } int32_t asr_32(int32_t m, unsigned int n) { asr_type(int32_t); } int64_t asr_64(int64_t m, unsigned int n) { asr_type(int64_t); } ``` 使用 `_Generic` 來實現類似 C++ 中 Template 的效果。 ### `測驗 2` ```c bool isPowerOfFour(int num) { return num > 0 && (num & (num - 1))==0 && !(__builtin_ctz(num) OPQ); } ``` #### `OPQ` - [x] `(f)` `& 0x1` power of 4 的條件為總共只有一個 bit 為 1，且該 bit 的位置需為 2, 4, 6, 8...(從 0 開始)才為 4 的倍數。`num & (num - 1)` 可以判斷是否只有一個 bit 為 1。所以 `__builtin_ctz(num)` 必須判斷 bit 的位置，從上述可知 `__builtin_ctz(num)` 的值要為偶數，所以`OPQ` 為 `(f)`。 #### 延伸問題 * 改寫上述程式碼，提供等價功能的不同實作，儘量降低 branch 的數量; `__builtin_ctz(num) & 0x1` 計算從 LSB 數來是否為偶數個 0，可以改用 bitmask 來檢測，因 `num` 只有一個 bit 為 1，所以 `0x55555555 & num` 能夠取得該 bit 是否在偶數位置，且此 bitmask 也能順便考慮 `num == 0` 的情況。 ```c bool isPowerOfFour(int num) { return (0x55555555 & num) && (num & (num - 1)) == 0; } ``` * 練習 LeetCode [1009. Complement of Base 10 Integer](https://leetcode.com/problems/complement-of-base-10-integer/) 和 [41. First Missing Positive](https://leetcode.com/problems/first-missing-positive/)，應善用 clz; 1. LeetCode 1009. Complement of Base 10 Integer 有兩種想法，一種為使用 Not 運算再用 bitmask 將最高位元以前的 1 都 clear；另一種為對 N 的 bit 與 bitmask 做 XOR，且最高位元以前的 bitmask 皆為 0，以後皆為 1。這邊實作第二種方式： ```c int bitwiseComplement(int N) { return N ? ((1 << (32 - __builtin_clz(N))) - 1) ^ N : 1; } ``` 使用 `__builtin_clz` 能讓程式碼更簡潔 (沒有 loop)。 2. LeetCode 41. First Missing Positive 因 missing positive integer 不可能超過陣列長度，且題目要求使用 constant extra space，所以我們可以利用陣列來紀錄資料。 ```c int firstMissingPositive(int* nums, int numsSize) { for (int i = 0; i < numsSize; ++i) { if (nums[i] < 0 || nums[i] > numsSize) nums[i] = 0; } for (int i = 0; i < numsSize; ++i) { int idx = (nums[i] & 0x7fffffff) - 1; if (idx < 0 || idx >= numsSize) continue; nums[idx] |= 0x80000000; } for (int i = 0; i < numsSize; ++i) { if (!nums[i] || __builtin_clz(nums[i])) return i + 1; } return numsSize == 0 ? 1 : numsSize + 1; } ``` 把遇到過的數字應該放置的位置上的值與 `0x80000000` 做 OR 運算來當作標記，這樣不僅不會影響到原本的數字（假設都為正數），也可以紀錄哪些數字出現過。最後在看哪個數字沒有標記就是答案。 * 研讀 2017 年修課學生報告，理解 clz 的實作方式，並舉出 Exponential Golomb coding 的案例說明，需要有對應的 C 原始程式碼和測試報告; 在 [H.264](https://en.wikipedia.org/wiki/Advanced_Video_Coding) 中用到了 Exponential Golomb coding 這種編碼方式，其中內部又能在細分為 * ue(v) * me(v) * se(v) * te(v) 在 H.264 的 document 中有這段描述 >The parsing process for these syntax elements begins with reading the bits starting at the current location in the bitstream up to and including the first non-zero bit, and counting the number of leading bits that are equal to 0. 此部份就很適合使用 `clz` 來計算，翻閱 H.264 的 opensource，在 [openh264/codec/encoder/core/inc/svc_enc_golomb.h](https://github.com/cisco/openh264/blob/master/codec/encoder/core/inc/svc_enc_golomb.h) 中的 `BsSizeUe()` 是用來取得 exponent golomb code，代碼如下： ```cpp static inline uint32_t BsSizeUE (const uint32_t kiValue) { if (256 > kiValue) { return g_kuiGolombUELength[kiValue]; } else { uint32_t n = 0; uint32_t iTmpValue = kiValue + 1; if (iTmpValue & 0xffff0000) { iTmpValue >>= 16; n += 16; } if (iTmpValue & 0xff00) { iTmpValue >>= 8; n += 8; } //n += (g_kuiGolombUELength[iTmpValue] >> 1); n += (g_kuiGolombUELength[iTmpValue - 1] >> 1); return ((n << 1) + 1); } } ``` 可以看到 2 個 if 的作用就是在算 leading zero。 ### `測驗 3` ```c int numberOfSteps (int num) { return num ? AAA + 31 - BBB : 0; } ``` #### `AAA` - [x] `(a)` `__builtin_popcount(num)` #### `BBB` - [x] `(b)` `__builtin_clz(num)` 從 bits 的角度看此題，會發現偶數除 2、奇數減 1 這兩個步驟可以看成當 LSB 為 0 則 right shift one bit、當 LSB 為 1 則減 1。所以我們只需要算有幾個 bit 為 1 (代表減 1 的次數)，以及從 MSB 數來第一個 bit 為 1 的位置 (代表除 2 的次數)，故答案為 `(a)` 與 `(b)`。 #### 延伸問題 * 改寫上述程式碼，提供等價功能的不同實作並解說; * 避免用到編譯器的擴充功能，只用 C99 特徵及 bitwise 運算改寫 LeetCode 1342. Number of Steps to Reduce a Number to Zero，實作出 branchless 解法，儘量縮減程式碼行數和分支數量; ```c int numberOfSteps (int num) { if (!num) return 0; int i = 0, bitmask = 0x80000000; unsigned tmp, c; tmp = num - ((num >> 1) & 0x55555555); // reuse input as temporary tmp = (tmp & 0x33333333) + ((tmp >> 2) & 0x33333333); // temp c = ((tmp + (tmp >> 4) & 0xF0F0F0F) * 0x1010101) >> 24; // count while (!(num & bitmask)) { ++i; bitmask >>= 1; } return c + 31 - i; } ``` 參考 [Bit Twiddling Hacks](http://graphics.stanford.edu/~seander/bithacks.html) 裡的 [Counting bits set, in parallel](http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel) 章節，實作 `__builtin_popcount` function，並使用迴圈來計算 leading zero。 :::info 不知道有沒有 brenchless 的 `__builtin_clz` 實作。 ::: 這邊很意外的運行時間比原本程式還好 (0ms vs 4ms)，試著找出原因，翻閱 [`glibc/stdlib/longlong.h`](https://code.woboq.org/userspace/glibc/stdlib/longlong.h.html) 找出 `__builtin_clz` 原始碼，發現是直接使用組語實作： ```c do { SItype c_; __asm__ ("norm.f\t%0,%1\n\tmov.mi\t%0,-1" : "=r" (c_) : "r" (x) : "cc"); (count) = c_ + 1; } while (0) ``` 所以效能應該不會比我的 `while` 迴圈版本慢，查看 `__builtin_popcount` 發現有很多版本實作，推測是因為 leetcode 中 c compiler 的 `__builtin_popcount` 實作較慢。 :::info TODO: 用適合的實驗證明此推測 ::: ### `測驗 4` ```c #include <stdint.h> uint64_t gcd64(uint64_t u, uint64_t v) { if (!u || !v) return u | v; int shift; for (shift = 0; !((u | v) & 1); shift++) { u /= 2, v /= 2; } while (!(u & 1)) u /= 2; do { while (!(v & 1)) v /= 2; if (u < v) { v -= u; } else { uint64_t t = u - v; u = v; v = t; } } while (XXX); return YYY; } ``` #### `XXX` - [x] `(b)` `v` #### `YYY` - [x] `(e)` `u << shift` 需先知道輾轉相除法有個簡單的形式： $$ \begin{align} gcd(a,b) = gcd(a-b, b) \quad \text{if a > b} \\ gcd(a,b) = gcd(a, b-a) \quad \text{if a < b} \end{align} $$ 寫成 pseudocode： ```p function gcd(a, b) if a = 0 return b while b ≠ 0 if a > b a ← a − b else b ← b − a return a ``` 此測驗中程式內 `do..while` 裡的 `if else` 其實跟這邊的 `if elst` 做的事情一樣，只是測驗中的程式多做了一次 swap 把 `u` 跟 `v` 對調。另外程式還用了 gcd 的另一種特性：$gcd(a,b) = gcd(a/2, b) \quad \text{if b is odd}$，所以只要一方為奇數，就能將另一邊 shift 來減少運算量。 :::info 此程式用到的 gcd 相關技巧可參考 [wikipedia](https://en.wikipedia.org/wiki/Binary_GCD_algorithm)： * gcd(0, v) = v, because everything divides zero, and v is the largest number that divides v. Similarly, gcd(u, 0) = u. * gcd(2u, 2v) = 2·gcd(u, v) * gcd(2u, v) = gcd(u, v), if v is odd (2 is not a common divisor). Similarly, gcd(u, 2v) = gcd(u, v) if u is odd. * gcd(u, v) = gcd(|u − v|, min(u, v)), if u and v are both odd. ::: 我們讓 `u` 保持為奇數，並且維持 `u > v`，這樣當我們 `v == 0` 時，`u` 就自然為 gcd，故 `OOO` 為 `v`；根據上面第二點提到，我們把 2 的倍數提出來，gcd 應藥補回提出的 2 的倍數，所以 `YYY` 為 `u << shift`。 #### 延伸問題 * 在 x86_64 上透過 __builtin_ctz 改寫 GCD，分析對效能的提升; ```c #include <stdint.h> uint64_t gcd64(uint64_t u, uint64_t v) { if (!u || !v) return u | v; int shift; int a = __builtin_ctzl(u), b = __builtin_ctzl(v); shift = b ^ ((a ^ b) & -(a < b)); u >>= shift, v >>= shift; int bit; if (bit = __builtin_ctzl(u)) u >>= bit;; do { if (bit = __builtin_ctzl(v)) v >>= bit; if (u < v) { v -= u; } else { uint64_t t = u - v; u = v; v = t; } } while (v); return u << shift; } ``` 效能分析：隨機產生 64 bit 的資料，並測量 1000000 次的運行時間 | gcd64 | __builtin_ctz gcd64 | |:----------:|:-------------------:| | 0.667711 s | 0.425873 s | 可發現效能有有效的提昇，約提昇 $\frac{0.667711-0.425873}{1000000} = 2.4 * 10^{-7}$ 秒。 ### `測驗 5` ```c #include <stddef.h> size_t improved(uint64_t *bitmap, size_t bitmapsize, uint32_t *out) { size_t pos = 0; uint64_t bitset; for (size_t k = 0; k < bitmapsize; ++k) { bitset = bitmap[k]; while (bitset != 0) { uint64_t t = KKK; int r = __builtin_ctzll(bitset); out[pos++] = k * 64 + r; bitset ^= t; } } return pos; } ``` #### `KKK ` - [x] `(e)` `bitset & -bitset` 與參考程式比對，可知用 bit 的版本透過 bit 操作省略了 LSB 為 0 的情形，我們只操作從 LSB 數來第一個為 1 的 bit，而操作完必須 clear 該 bit，所以答案為 `(e)`。 #### 延伸問題 * 思考進一步的改進空間; 程式碼中 `k * 64` 可以改成 bit shift 運算，且清除 LSB 的 bit 可利用 `__builtin_ctzll()` 的結果來達成。改進後的程式碼如下： ```c size_t my_improved(uint64_t *bitmap, size_t bitmapsize, uint32_t *out) { size_t pos = 0; uint64_t bitset; for (size_t k = 0; k < bitmapsize; ++k) { bitset = bitmap[k]; while (bitset != 0) { uint64_t r = __builtin_ctzll(bitset); out[pos++] = k << 6 + r; bitset ^= 1 << r; } } return pos; } ``` * 設計實驗，檢驗 clz 改寫的程式碼相較原本的實作有多少改進？應考慮到不同的 [bitmap density](http://www.vki.com/2013/Support/docs/vgltools-3.html); 這邊我測試 0 ~ 64 個 bit 數量為 1 的情形，將每個為 1 的 bit 隨機分佈在 `bitset` 當中，在 butmapsize 為 128 上測量運行 10000 次的時間，結果如下： ![](https://i.imgur.com/hMsR3zg.png) * naive 的版本為線性時間成長，這是因為此版本一定會跑迴圈 64 次，所以效能影響全部由 bit 數量決定。 * improve 的版本在初期成長緩慢，之後卻急遽遞增，這是因為 bit 的數量越高，`while` 迴圈的次數就越多，而迴圈內部大量的運算就會降低程式效能。 * 改進後的版本，因 `while` 迴圈內部運算成本減少，可以發現在 bit 數高的情況下也能有效的壓低運行時間。 ###### tags: `linux2020`