2020q3 Homework2 (quiz2)

# 2020q3 Homework2 (quiz2) contributed by < [WeiCheng14159](https://github.com/WeiCheng14159) > ###### tags: `sysprog2020` ## Test1 ### 運作原理 ```c=1 while ((i + 8) <= size) { uint64_t payload; memcpy(&payload, str + i, 8); if (payload & MMM) return false; i += 8; } ``` * 如果尚未比較的片段大於或等於一個 word (假設 word size = 8 bytes = 64 bits) 則將一次將一個 word 大小的資料拷貝進 payload 變數 * 對於 payload 變數中的每一個 byte ，如果值 >= 128 (0x80) 則不屬於 ascii * 使用 bit-wise 操作，對於每一個 byte 使用 0x80 為 bit mask 做 bit-wise AND 則可以偵測 byte 之值是否大於 128 * 因為有 8 個 byte 故 0x80 重複 8 次，為 0x8080808080808080 * 剩餘不滿一個 word 的資料則一次一個 byte 讀取判斷 * `MMM=0x8080808080808080` ### 延伸問題 * `為何用到 memcpy 呢？` * `給予一個已知長度的字串，檢測裡頭是否包含有效的英文大小寫字母` * 用到第五題的觀念 * [程式碼](https://github.com/supernovaremnant/quiz2/blob/master/test1_ext1.cc) ```c=1 #define PACKED_BYTE(b) (((uint64_t)(b) & (0xff)) * 0x0101010101010101u) // detect the occurence of english character bool valid_eng(const char s[], size_t n) { for (size_t j = 0; j < n; j++) { if ((s[j] < 'A' || s[j] > 'Z') && (s[j] < 'a' || s[j] > 'z')) return false; } return true; } bool valid_eng_vector(const char s[], size_t n) { size_t k = n / 8; for (size_t i = 0; i < k; i++, s += 8) { uint64_t *chunk = (uint64_t *)s; uint64_t A = *chunk + PACKED_BYTE(128 - 'A' + 0); uint64_t Z = *chunk + PACKED_BYTE(128 - 'Z' + -1); uint64_t a = *chunk + PACKED_BYTE(128 - 'a' + 0); uint64_t z = *chunk + PACKED_BYTE(128 - 'z' + -1); uint64_t res = (((A ^ Z) | (a ^ z)) & PACKED_BYTE(0x80)); // printf("res = %lx\n", res); if (res != PACKED_BYTE(0x80)) return false; } k = n % 8; if (k) return valid_eng(s, k); return true; } ``` * `考慮常見的英文標點符號，判斷輸入字串是否為有效的字元集，避免逐一字元比對` * 假設英文標點符號包含從 `!` 到 `~` 所有字元 * [程式碼](https://github.com/supernovaremnant/quiz2/blob/master/test1_ext2.cc) ```c=1 // detect the occurence of english character + punctuation bool normal_eng(const char s[], size_t n) { for (size_t j = 0; j < n; j++) { if (s[j] < '!' || s[j] > '~') return false; } return true; } bool normal_eng_vector(const char s[], size_t n) { size_t k = n / 8; for (size_t i = 0; i < k; i++, s += 8) { uint64_t *chunk = (uint64_t *)s; uint64_t lower = *chunk + PACKED_BYTE(128 - '!' + 0); uint64_t upper = *chunk + PACKED_BYTE(128 - '~' + -1); uint64_t res = ((lower ^ upper) & PACKED_BYTE(0x80)); // printf("res = %lx\n", res); if (res != PACKED_BYTE(0x80)) return false; } k = n % 8; if (k) return normal_eng(s, k); return true; } ``` ## Test2 ### 運作原理 ```c=1 uint8_t hexchar2val(uint8_t in) { const uint8_t letter = in & MASK; const uint8_t shift = (letter >> AAA) | (letter >> BBB); return (in + shift) & 0xf; } ``` 列出使用到的 ascii 字元 | char | hex | bin | char | hex | bin | char | hex | bin | |------|------|-----------|------|------|-----------|------|------|-----------| | 0 | 0x30 | 0011_0000 | A | 0x41 | 0100_0001 | a | 0x61 | 0110_0001 | | 1 | 0x31 | 0011_0001 | B | 0x42 | 0100_0010 | b | 0x62 | 0110_0010 | | 2 | 0x32 | 0011_0010 | C | 0x43 | 0100_0011 | c | 0x63 | 0110_0011 | | 3 | 0x33 | 0011_0011 | D | 0x44 | 0100_0100 | d | 0x64 | 0110_0100 | | 4 | 0x34 | 0011_0100 | E | 0x45 | 0100_0101 | e | 0x65 | 0110_0101 | | 5 | 0x35 | 0011_0101 | F | 0x46 | 0100_0110 | f | 0x66 | 0110_0110 | | 6 | 0x36 | 0011_0110 | | | | | | | | 7 | 0x37 | 0011_0111 | | | | | | | | 8 | 0x38 | 0011_1000 | | | | | | | | 9 | 0x39 | 0011_1001 | | | | | | | * 只考慮 `0-9` 的話，直接對最低 4 bits 作 bit mask (也就是最後一行的 `0xf` )，就可以直接得到答案，故推測輸入為數字時，`shift` 之值為零 * 接著考慮字母 `F = 0100_0110` 的最低四個 bits ，加上 shift 之後取最低的 4 bits 要輸出 `15=0000_1111` ，故推測當輸入為 `F = 0100_0110` 或 `f = 0110_0110` 時，shift 之值為 `0000_1001` * shift 之值會因為輸入為數字或字母而有所不同，故推測 MASK 的作用是區別數字跟字母（從變數名稱也看得出來） * 觀察數字跟字母的二元表示法，發現 `MASK` 等於 `0100_0000` 或 `0001_0000` 都可以區分兩者 * 數字的第7個位元都是0，字母的第7個位元都是1 * 數字的第5個位元都是1，字母的第5個位元都是0 * 使用 `MASK = 0001_0000` 需要再多做一次反轉 * 取 `MASK=0100_0000=0x40` 則 `shift >> 3 || shift >> 6` 可以湊出 `shift=0000_1001` * `MASK=0x40, AAA=3, BBB=6` ### 延伸問題 * `將 hexchar2val 擴充為允許 "0xFF", "0xCAFEBABE", "0x8BADF00D" 等字串輸入，且輸出對應的十進位數值` ## Test3 ### 運作原理 ```c=1 const uint32_t D = 3; #define M ((uint64_t)(UINT64_C(XXX) / (D) + 1)) /* compute (n mod d) given precomputed M */ uint32_t quickmod(uint32_t n) { uint64_t quotient = ((__uint128_t) M * n) >> 64; return n - quotient * D; } ``` * n 模除 3 的數學推導如下： \begin{align} n \% 3 &= n - \left \lfloor {\frac{n}{3}} \right\rfloor \times 3 \\ &= n - \left \lfloor n \times \frac{ \frac{2^{64}}{3} }{2^{64}} \right\rfloor \times 3\\ &= n - \left \lfloor \left( n \times \frac{2^{64}}{3} \right) \times \frac{1}{2^{64}} \right \rfloor \times 3\\ &= n - ((n \times M) >> 64) \times 3 \\ \end{align} * 其中 $\frac{2^{64}}{3}$ 對應到 `#define M ((uint64_t)(UINT64_C(XXX) / (D) + 1))` 無條件捨去的除法運算再補一對應到"取上界" ，故 $M = \left\lceil \frac{2^{64}}{3} \right \rceil$，`XXX` 約等於 $2^{64}$ * `XXX = 0xFFFFFFFFFFFFFFFF` ### 延伸問題 * 由 Facebook 公司所維護的 jemalloc 是個高效能的記憶體配置器 (memory allocator，即 malloc 和 free 函式對應的實作)，特別在多核多執行緒的環境有優異表現，在其原始程式碼 [include/jemalloc/internal/div.h](https://github.com/jemalloc/jemalloc/blob/dev/include/jemalloc/internal/div.h) 和 [src/div.c](https://github.com/jemalloc/jemalloc/blob/dev/src/div.c) 也運用類似的技巧，請閱讀原始程式碼，並抽離快速除法實作為獨立的程式碼，說明運作原理，也該撰寫測試程式，比較透過處理器的整數除法指令及本技巧的效能差距 * 原理：`jemalloc` 使用的方法跟前一個問題使用的方法相似，已知 $\frac{n}{d} = n \times \frac{\frac{2^k}{d}}{2^k}$ 但是 k 要取多少才能滿足我們要的精確度呢? [src/div.c](https://github.com/jemalloc/jemalloc/blob/dev/src/div.c) 提供了證明 * 假設 $n, d, \frac{n}{d}$ 為整數 \begin{align} \frac{n}{d} &= p, \ p \in Z \\ &= \left \lfloor \left\lceil \frac{2^k}{d}\right\rceil \times \frac{n}{2^k}\right \rfloor , \ \forall k \in N\\ &= \left \lfloor \frac{2^k + r}{d} \times \frac{n}{2^k}\right \rfloor (\because r = - 2^k \mod d)\\ &= \left \lfloor \frac{n}{d} + \frac{r}{d} \times \frac{n}{2^k} \right \rfloor \\ &= \frac{n}{d} + \left \lfloor \frac{r}{d} \times \frac{n}{2^k} \right \rfloor (\because \frac{n}{d} = p \in Z) \\ \end{align} * 證明 $\left\lceil \frac{2^k}{d}\right\rceil = \frac{2^k + r}{d}$ * $\left\lceil \frac{2^k}{d}\right\rceil = \frac{2^k + d-(2^k \mod d)}{d} = \frac{2^k + r}{d} =\frac{2^k + (-2^k \mod d)}{d}$ * $pf: -2^k \mod d= d-(2^k \mod d)$ \begin{align} \Rightarrow & 2^k \mod d = r \\ 2^k &= pd + r \\ -2^k &= (-p)d + -r \\ -2^k &= (-p-1)d + d-r \\ -2^k &= p'd + (d-r) \\ \Rightarrow & (-2^k \mod d ) = d-r\\ \end{align} * $\frac{n}{d} = \frac{n}{d} + \left \lfloor \frac{r}{d} \times \frac{n}{2^k} \right \rfloor$ 成立的條件為 $\frac{r}{d} \times \frac{n}{2^k} < 1$ * 因為 $r = 2^k \mod d$，所以 $\frac{r}{d} <１$ * 如果 $\frac{n}{2^k} < 1$ 則 $\frac{r}{d} \times \frac{n}{2^k} < 1$ 恆成立 * 假設 n 的範圍是 unsigned int 則 k 取 32 可以保證 $\frac{n}{2^k} <1$ * 測試環境： * 使用 lab01 dudect 的時間測量方法 ```c=1 #include <stdint.h> // http://www.intel.com/content/www/us/en/embedded/training/ia-32-ia-64-benchmark-code-executio$ static inline int64_t cpucycles(void) { #if defined(__i386__) || defined(__x86_64__) unsigned int hi, lo; __asm__ volatile("rdtsc\n\t" : "=a"(lo), "=d"(hi)); return ((int64_t) lo) | (((int64_t) hi) << 32); #else #error Unsupported Architecture #endif } ``` * 將 jemalloc 的快速除法與普通除法做比較 * 編譯器最佳化程度由 `O0` 到 `O3` 都做測試 * 測量 10000 次畫出時間的散布圖, 發現 jemalloc 的快速除法確實比較快 * 測試結果： * 編譯器旗標 `-O0` * ![](https://i.imgur.com/Dk5PAOJ.png) * 編譯器旗標 `-O1` * ![](https://i.imgur.com/j4nUhde.png) * 編譯器旗標 `-O2` * ![](https://i.imgur.com/EcIwnYM.png) * 編譯器旗標 `-O3` * ![](https://i.imgur.com/MXaXXMq.png) ## Test4 ### 運作原理 ```c=1 const uint32_t D = 3; #define M ((uint64_t)(UINT64_C(XXX) / (D) + 1)) bool divisible(uint32_t n) { return n * M <= YYY; } ``` * 參考 [Faster remainders when the divisor is a constant: beating compilers and libdivide ](https://lemire.me/blog/2019/02/08/faster-remainders-when-the-divisor-is-a-constant-beating-compilers-and-libdivide/) * 因為 \begin{align} \left \lfloor {\frac{n}{3}} \right\rfloor &= (n \times M) >> 64 \\ M &= \left\lceil \frac{2^{64}}{3} \right \rceil \\ &= \left\lfloor \frac{2^{64}}{3} \right\rfloor + 1\\ \end{align} * $(n \times M) >> 64$ 是 $\frac{n}{3}$ 的整數值，$n \times M$ 是 $\frac{n}{3}$ 的小數部份放大 $2^{64}$ 倍（照理來說會被右移 64 bits 後捨棄，但是保留小數部份能夠幫助餘數的運算），$M - 1 = \left\lfloor \frac{2^{64}}{3} \right\rfloor$ 是 $\frac{1}{3}$ 的小數部份放大 $2^{64}$ 倍（將 n=1 帶入算是即可看出） * 比較 $\frac{n}{3}$ 和 $\frac{1}{3}$ 的小數部份，若 $n \times M \leq M - 1$ 則 n 可以被 3 整除 * `YYY=M-1` ## Test5 ### 運作原理 * byte-by-byte 方法： ```c=1 #include <ctype.h> #include <stddef.h> /* in-place implementation for converting all characters into lowercase. */ void strlower(char *s, size_t n) { for (size_t j = 0; j < n; j++) { if (s[j] >= 'A' && s[j] <= 'Z') s[j] ^= 1 << 5; else if ((unsigned) s[j] >= '\x7f') /* extended ASCII */ s[j] = tolower(s[j]); } } ``` * Vectorized 方法： ```c=1 #include <ctype.h> #include <stddef.h> #include <stdint.h> #define PACKED_BYTE(b) (((uint64_t)(b) & (0xff)) * 0x0101010101010101u) /* vectorized implementation for in-place strlower */ void strlower_vector(char *s, size_t n) { size_t k = n / 8; for (size_t i = 0; i < k; i++, s += 8) { uint64_t *chunk = (uint64_t *) s; if ((*chunk & PACKED_BYTE(VV1)) == 0) { /* is ASCII? */ uint64_t A = *chunk + PACKED_BYTE(128 - 'A' + VV2); uint64_t Z = *chunk + PACKED_BYTE(128 - 'Z' + VV3); uint64_t mask = ((A ^ Z) & PACKED_BYTE(VV4)) >> 2; *chunk ^= mask; } else strlower(s, 8); } k = n % 8; if (k) strlower(s, k); } ``` * 延續第一題的想法，判斷一個 byte 是否為 ascii 的方法為對 0x80 做 bit-wise AND 又 `PACKED_BYTE(0x80)` 會產生 `0x8080808080808080` 為針對每一個 byte 的偵測碼 ```c=12 uint64_t A = *chunk + PACKED_BYTE(128 - 'A' + VV2); uint64_t Z = *chunk + PACKED_BYTE(128 - 'Z' + VV3); uint64_t mask = ((A ^ Z) & PACKED_BYTE(VV4)) >> 2; ``` * `128 - 'A' + VV2` 被用來計算 `'A'` 到 128 的距離，令 `c` 為輸入的一個 byte (已知 c 為 ascii，`c < 128` ) 如果 `c + 128 - 'A' >= 128 (0x80) (1000_0000)` 則 `c - 'A' >= 0, c >= 'A'` 。`c + 128 - 'A'` 的 ==MSB 為 1 則 c >= 'A'== * `128 - 'Z' + VV3` 被用來計算 `'Z'` 到 128 的距離，令 `c` 為輸入的一個 byte (已知 c 為 ascii，`c < 128` ) 如果 `c + 128 - 'Z' >= 129 (0x81) (1000_0001)` 則 c 不是 ascii ，因為這樣不好用 bit-wise 做計算，故我們在不等式的兩邊減一，`c + 128 - 'Z' - 1 >= 128 (0x80) (1000_0000)`，如果 `c + 128 - 'Z' - 1` 的 ==MSB 是 1 則 c > 'Z'，MSB 是 0 則 c <= 'Z'== * 綜合以上兩個判斷條件，如果 `c + 128 - 'A'` 的 MSB 為 1 且 `c + 128 - 'Z' - 1` 的 MSB 為 0 則 c 滿足 `c >= 'A' 且 c <= 'Z'` c 是大寫字母 * 故使用 `A^Z` bit-wise XOR 來判斷上述邏輯，若 `c >= 'A' 且 c <= 'Z'` 則 c 被轉換成 `0x80=1000_0000` 反之 `c < 'A' 或 c > 'Z'` 則 c 被轉換成 `0x00=0000_0000` * `VV4=0x80` 被當作 mask 和 `(A^Z)` 做 bit-wise AND，接著 `0x80` 向右 shift 2 bit 變成 `0x20 (0010_0000)` 作為大小寫轉換的 bit flipping mask * `'A'=0x41` 翻轉右邊數來第六個 bit 變成 `'a'=0x61` ，`*chunk` 對 `0x20` 作 bit-wise XOR 將大寫轉換成小寫 * `VV1=0x80, VV2=0, VV3=-1, VV4=0x80` ### 延伸問題 * `將前述測試程式 main 函式的 char str[] 更換為 char *str，會發生什麼事？請參閱 C 語言規格書，說明 string literal 的行為` ## Test 6 ### 運作原理 ```c=1 int singleNumber(int *nums, int numsSize) { int lower = 0, higher = 0; for (int i = 0; i < numsSize; i++) { lower ^= nums[i]; lower &= KKK; higher ^= nums[i]; higher &= JJJ; } return lower; } ``` * 參考這篇 LeetCode 的討論 [Detailed explanation and generalization of the bitwise operation method for single numbers](https://leetcode.com/problems/single-number-ii/discuss/43295/Detailed-explanation-and-generalization-of-the-bitwise-operation-method-for-single-numbers) * 先討論只有一個 bit 的情況 * 假設輸入 i 只有一個 bit * 令 $X = \{x_m, x_{m-1}, ..., x_1\}$ 是一個 m-bit 計數器負責計算 i = 1 的次數 * 令 $K$ 是一個 1-bit mask * 假設 i = {1, 0, 1, 0, 1}，依序輸入五個數字，希望 m-bit 計數器能夠計算 i=1 的次數 * i=0 時 `X ^= i` X 對 i 做 bit-wise XOR 後 $X = \{0,0,...,0\}$ 符合我們的預期 * i=1 時 `X ^= i` X 對 i 做 bit-wise XOR 後 $X = \{1,1,...,1\}$ ，但我們想要的計數器應該是 $X = \{0,0,...,1\}$ 還需要加上額外的運算才能符合我們的預期 * 分析 $x_j = 1$ 的條件是 $x_{j-1} = x_{j-2} = ... = x_0 = 1$ 且 $i=1$ ，這時候 $x_j$ 才需要進位，所以得到下面的公式 * $x_1 = i \oplus x_1$ * $x_2 = x_2 \oplus (i \cdot x_1)$ * $x_3 = x_3 \oplus (i \cdot x_1 \cdot x_2)$ * $x_j = x_j \oplus (i \cdot x_1 \cdot x_2 \cdot ... \cdot x_{j-1})$ * 實際使用公式計算輸入 i = {1, 0, 1, 0, 1} 的情形： | | x5 | x4 | x3 | x2 | x1 | |-----|----|----|----|----|----| | i=1 | 0 | 0 | 0 | 0 | 1 | | i=0 | 0 | 0 | 0 | 0 | 1 | | i=1 | 0 | 0 | 0 | 1 | 0 | | i=0 | 0 | 0 | 0 | 1 | 0 | | i=1 | 0 | 0 | 0 | 1 | 1 | * 又 m-bit 計數器會一直數下去，但是我們只在意計數器數到 3 的次數就好，故需要使用 bit mask 來限制計數器 * mask 設計的想法是當計數器等於 3 時，mask 要把計數器的每一個 bit 遮成零，其他的條件下 mask 不會干擾計數器的運作(等於1) * 我們用 T 表示次數 $T = \{0, 0, 0, 1, 1\}$ * 遮罩對應的是 bit-wise AND 的運算 * $x_1 = k_1 \cdot x_1$ * $x_2 = k_2 \cdot x_2$ * $x_3 = k_3 \cdot x_3$ * $x_j = k_j \cdot x_j$ * mask K 的計算方法如下 \begin{align} K &= \neg (y_m \land y_{m-1} \land ... \land y_1) \\ &y_j = \begin{cases} x_j, & t_j = 1 \\ \neg x_j, & t_j = 0 \\ \end{cases} \end{align} * 如果 $T = \{0, 0, 0, 1, 1\}$ 則 \begin{align} K &= \neg (\neg x_5 \land \neg x_4 \land \neg x_3 \land x_2 \land x_1) \\ &= x_5 \lor x_4 \lor x_3 \lor \neg x_2 \lor \neg x_1 \\ \end{align} 代表只有在計數器的值等於 $X=\{0, 0, 0, 1, 1\}$ 時 $K=0$，計數器對 mask 做 bit-wise AND 之後計數器的值會被歸零，其他的情況下計數器不會被影響 * pseudo code 如下： ```c=1 int nums[5] = {1, 0, 1, 0, 1}; for (int i : nums ) { xm ^= (xm-1 & ... & x1 & i); xm-1 ^= (xm-2 & ... & x1 & i); ..... x1 ^= i; mask = ~(y1 & y2 & ... & ym) where yj = xj if kj = 1, and yj = ~xj if kj = 0 (j = 1 to m) xm &= mask; ...... x1 &= mask; } ``` * 接著我們要從 1-bit 的情況推廣到 32-bit 的情況 * 看到這個題目，直覺的反應是想要用 32 個 32 bit 計數器，計算每一個 bit 中出現 1 的次數，最後比較每一個 bit 中 1 出現的數目就可以知道哪一個數字指出現一次。但是這樣就不能善用 bit-wise operation 的優勢 * 使用一個 2 bit 計數器就能數到 3 次（00, 01, 10, 11），所以希望用 32 個 2 bit 計數器，每一個計數器被用來統計每一個 bit 中 1 出現的數目。 * 又因為 bit-wise operation 對每一個 bit 的操作都是獨立的，所以我們可以把這 32 個 2 bit 計數器分開來，變成 2 個 32 bit 的變數。這兩個變數在我們的程式碼裡頭被命名為 `lower` 和 `higher` 分別表示每一個 bit 的計數器的低位和高位 * 具體的示意圖如下：![](https://i.imgur.com/BZfGbW4.png) * ```graphviz digraph g{ // node node [ shape=rect, fillcolor=skyblue, style=filled ]; dot [label="..."]; // edge (hidden) b32->b31->b30->dot->b3->b2->b1 [style=invis, arrowhead=none]; // cluster_int subgraph cluster_int{ label="32-bit integer"; { rank=same b32 b31 b30 dot b3 b2 b1}; } subgraph counter{ // node node [fillcolor=green, shape=record]; _1_1 [label="1st"]; _2_1 [label="1st"]; _3_1 [label="1st"]; _dot_1 [label="..."]; _30_1 [label="1st"]; _31_1 [label="1st"]; _32_1 [label="1st"]; node [fillcolor=forestgreen, shape=record]; _1_2 [label="2nd"]; _2_2 [label="2nd"]; _3_2 [label="2nd"]; _dot_2 [label="..."]; _30_2 [label="2nd"]; _31_2 [label="2nd"]; _32_2 [label="2nd"]; // edge _1_2->_1_1->b1; _2_2->_2_1->b2; _3_2->_3_1->b3; _dot_2->_dot_1->dot[style=invis]; _30_2->_30_1->b30; _31_2->_31_1->b31; _32_2->_32_1->b32; } // lower subgraph cluster_l { label="lower"; {rank=same _1_1 _2_1 _3_1 _dot_1 _30_1 _31_1 _32_1} } // higher subgraph cluster_h { label="higher"; {rank=same _1_2 _2_2 _3_2 _dot_2 _30_2 _31_2 _32_2} } } ``` * 照上述邏輯的 32-bit 版本的實做程式碼如下 * ```c=1 int singleNumber(int *nums, int numsSize) { int lower = 0, higher = 0, mask = 0; for (int i = 0; i < numsSize; i++) { higher ^= lower & nums[i]; lower ^= nums[i]; mask = ~(lower & higher); higher &= mask; lower &= mask; } return lower; } ``` * `higher` `lower` 分別表示 2-bit 計數器的 lower, higher bit * 但是這和課堂上提供的版本有差異，接下來會推導兩者其實是相同的 * 對於 lower 的推導 ( higher同理 ) \begin{align} mask &= \neg ({ lower } \land { higher }) \\ lower &= lower \land mask \\ &= lower \land \neg ({ lower } \land { higher }) \\ &= lower \land ( \neg lower \lor \neg higher) \\ &= (lower \land \neg lower) \lor (lower \land \neg higher) \\ &= lower \land \neg higher \\ \end{align} * 所以計算 mask 之後再計算 lower 跟 mask 的 bit-wise AND 等價於 lower 對 ~higher 做 bit-wise AND * 所以課堂上的程式碼是精簡之後的版本，只使用到較少的運算 * `KKK=~higher` `JJJ=~lower` ### 延伸問題 * `LeetCode 執行結果正確不超時` ![](https://i.imgur.com/bA4QQZt.png) * `將重複 3 次改為改為重複 4 次或 5 次，並找出可容易推廣到多次的程式碼;` * 題目變更如下：Given a non-empty array of integers, every element appears ==r times== except for one, which appears exactly once. Find that single one. * 推廣到 r 次的[程式碼](https://github.com/supernovaremnant/quiz2/blob/master/test6.cc) 如下 ```c=1 int singleNumber(vector<U_INT> &nums, U_INT r) { // N is the size of counter U_INT N = 32 - __builtin_clz(r + 1); // array of counters vector<U_INT> cnt_arr(N); for (int i = 0; i < nums.size(); i++) { // compute the value of all counter // x_j = x_j ^ (x_j-1 & x_j-2 & ... & x_0 & nums[i]) for (int j = N - 1; j >= 0; j--) { U_INT tmp = nums[i]; for (int z = j - 1; z >= 0; z--) { tmp &= cnt_arr[z]; } cnt_arr[j] ^= tmp; } // compute the value of mask U_INT mask = 0xffffffff; U_INT one = 0x1; for (U_INT j = 0; j < N; j++) { if ((one & r) != 0) mask &= cnt_arr[j]; else mask &= (~cnt_arr[j]); one = one << 1; } mask = ~mask; // bit-wise & mask for (int j = N - 1; j >= 0; j--) { cnt_arr[j] &= mask; } } return cnt_arr[0]; } ```