[2020q3 Homework2 (quiz2)](https://hackmd.io/@sysprog/2020-quiz2)

# [2020q3 Homework2 (quiz2)](https://hackmd.io/@sysprog/2020-quiz2) contributed by < `zhu849` > ## 題目解析 ### ASCII table * 因為下面題目會頻繁使用到，所以放在這裡 ![](https://i.imgur.com/OuUdbwP.gif) ### 測驗 `1` #### 題目 > 目前電腦中用得最廣泛的字元集及其編碼，是美國國家標準局 (ANSI) 制定的 ASCII 碼，被國際標準化組織 (ISO) 採納為 ISO 646 標準。7 位碼 ASCII 是以 7 位元二進位數字進行編碼，可表示 128 個字元，第 128～255 號為擴展字元 (extended ASCII)。如果我們明確界定 7 位元編碼的字元隸屬於 ASCII，可透過以下函式來判斷指定的記憶體範圍內是否全是有效的 ASCII 字元: ```cpp= #include <stddef.h> bool is_ascii(const char str[], size_t size) { if (size == 0) return false; for (int i = 0; i < size; i++) if (str[i] & 0x80) /* i.e. (unsigned) str[i] >= 128 */ return false; return true; } ``` ```graphviz digraph structs { node [shape=record]; rankdir = LR; { rank="same"; D[label="{1|0|1|0|0|1|1|1}"]; A[shape=plaintext,fontcolor=black,label = "&"]; E[label="{1|0|0|0|0|0|0|0}"]; } B[shape=plaintext,fontcolor=black,label = "str[ ]"]; C[shape=plaintext,fontcolor=black,label = "0x80"]; B -> D[style=invisible,dir=none] C -> E[style=invisible,dir=none] } ``` > 其中 `str` 是開頭的記憶體地址，`size` 則是要檢驗的範圍。逐一字元比對的效率低，在 64 位元的架構下，我們嘗試一次比對 64 位元的資料 (即 word size)，程式碼改寫如下: ```cpp= #include <stdbool.h> #include <stddef.h> #include <stdint.h> #include <string.h> bool is_ascii(const char str[], size_t size) { if (size == 0) return false; int i = 0; while ((i + 8) <= size) { uint64_t payload; memcpy(&payload, str + i, 8); if (payload & MMM) return false; i += 8; } while (i < size) { if (str[i] & 0x80) return false; i++; } return true; } ``` * line 12：將 `str` 的第 `i` 個位置開始往後數八個位置的資訊搬移到 `payload` 的記憶體空間上 ==MMM== * `(a)` `0x0080008000800080` ```graphviz digraph structs { node [shape=record]; rankdir = LR; A[label="{0|0|0|0|0|0|0|0|1|0|0|0|0|0|0|0|...}"]; B[shape=plaintext,fontcolor=black,label = "0x"]; B->A[style=invisible,dir=none] } ``` * `(b)` `0x8000800080008000` ```graphviz digraph structs { node [shape=record]; rankdir = LR; A[label="{1|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|...}"]; B[shape=plaintext,fontcolor=black,label = "0x"]; B->A[style=invisible,dir=none] } ``` * `(c)` `0x0808080808080808` ```graphviz digraph structs { node [shape=record]; rankdir = LR; A[label="{0|0|0|0|1|0|0|0|0|0|0|0|1|0|0|0|...}"]; B[shape=plaintext,fontcolor=black,label = "0x"]; B->A[style=invisible,dir=none] } ``` * ***`(d)` `0x8080808080808080`*** ```graphviz digraph structs { node [shape=record]; rankdir = LR; A[label="{1|0|0|0|0|0|0|0|1|0|0|0|0|0|0|0|...}"]; B[shape=plaintext,fontcolor=black,label = "0x"]; B->A[style=invisible,dir=none] } ``` * `(e)` `0x8888888888888888` ```graphviz digraph structs { node [shape=record]; rankdir = LR; A[label="{1|0|0|0|1|0|0|0|1|0|0|0|1|0|0|0|...}"]; B[shape=plaintext,fontcolor=black,label = "0x"]; B->A[style=invisible,dir=none] } ``` * `(f)` `0xFFFFFFFFFFFFFFFF` ```graphviz digraph structs { node [shape=record]; rankdir = LR; A[label="{1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|...}"]; B[shape=plaintext,fontcolor=black,label = "0x"]; B->A[style=invisible,dir=none] } ``` * line 13, 14：為了要檢查從 memcpy 複製到 payload 的內容是有效的 ASCII 字元，每次都檢查 8 個 bits ，即便是已經放到 payload 的內容值也會一併檢查，再參考上面的 `is_ascii（）` function，所以選擇（d） :::warning 延伸問題: //TODO 1. 解釋上述程式的運作原理，特別是為何用到 memcpy 呢？提示: 與 data alignment 有關，詳閱 C 語言：記憶體管理、對齊及硬體特性 2. 將上述技巧應用於「給予一個已知長度的字串，檢測裡頭是否包含有效的英文大小寫字母」 3. 承 (2)，考慮常見的英文標點符號，判斷輸入字串是否為有效的字元集，避免逐一字元比對 >提示: Intel SSE 4.2 加入 STTNI (String and Text New Instructions) 指令，可加速字串比對，詳見: Schema Validation with Intel® Streaming SIMD Extensions 4 ::: ### 測驗 `2` > 開發解析器 (parser) 時，常會將十六進位表示的字串 (hexadecimal string) 轉為數值，例如 `'0xF'` (大寫 `F` 字元) 和 `'0xf'` (小寫 `f` 字元) 都該轉換為 `15`。考慮以下不需要分支 (branchless) 的實作: ```cpp= uint8_t hexchar2val(uint8_t in) { const uint8_t letter = in & MASK; const uint8_t shift = (letter >> AAA) | (letter >> BBB); return (in + shift) & 0xf; } ``` > 已知 `in` 一定隸屬於 `'0'`, `'1'`, `'2'`, …, `'9'` 及 `'a'`, `'b'`, `'c'`, …, `'f'` (小寫) 及 `'A'`, `'B'`, `'C'`, …, `'F'` (大寫) 這些字元。預期 `hexchar2val('F')` 和 `hexchar2val('f')` 回傳 `15`，且 `hexchar2val('0')` 回傳 `0`，其他輸入都有對應的數值。 > 以下摘自 ASCII 表格 `'0'`, `'1'`, `'2'`, …, `'9'` 對應到 `0x30`, `0x31`, `0x32`, … `0x39` `'a'`, `'b'`, `'c'`, …, `'f'` (小寫) 對應到 `0x61`, `0x62`, `0x63`, …, `0x66` `'A'`, `'B'`, `'C'`, …, `'F'` 對應到 `0x41`, `0x42`, `0x43`, …, `0x46` * line 3：將 `in` 和 `MACK` 做 and 運算後指派到 letter。先和 line 7 行為一起做觀察，可以猜測這步得到的 `letter` 可能是某個字元 * line 4：將得到的 `letter` 做 shift 運算，並使用到 or 運算組合成 shift，可以猜測這個步驟是做推移，意即將不論是大小寫輸入統一輸出成小寫 * 因為 line 3 - 5 無法直觀知道在執行什麼動作，這裡我使用帶入法輔助解題 * 考慮用 '0','5','9','a','c','f','A','C',F' 等值當作帶入值 * 題目能夠已知的內容只有 line 5: 將 `in` 跟 `shift` 的結果和 `0x0f` 做&運算，最後得到的值要是字元對應 hexadecimal 的值 * 考慮數字字元不需要做大小寫轉換，所以我們可以先將數字字元和英文字元區分開來再做處理 * 若考慮 `shift` 為 `0x00000000` 則 `'a'` 和 `'A'` 要回傳的值為 `0x0a` ==MASK== * `(a)` `0x10` * `(b)` `0x20` * ***`(c)` `0x40`*** * `(d)` `0x60` * `(e)` `0x80` :::danger 文字訊息不要用圖片展現！ :notes: jserv ::: :::success 已修正用圖表表示 ::: 如果選 `(a)`： | $char_{10}$ | $MASK_{10}$ |$char_2$ | $MASK_2$ | letter | | -------- | -------- |-------- |-------- |-------- | | '0' | 0x10 |00110000 |00010000 |00010000 | | '5' | 0x10 |00110101 |00010000 |00010000 | | '9' | 0x10 |00111001 |00010000 |00010000 | | 'a' | 0x10 |01100001 |00010000 |00000000 | | 'c' | 0x10 |01100011 |00010000 |00000000 | | 'f' | 0x10 |01100110 |00010000 |00000000 | | 'A' | 0x10 |01000001 |00010000 |00000000 | | 'C' | 0x10 |01000011 |00010000 |00000000 | | 'F' | 0x10 |01000110 |00010000 |00000000 | * 可以發現位元為數字的 result 皆為 `00010000`，其他英文字元不論大小寫皆為 `00000000` * 選擇這個選項可以讓我們區分數字和英文字元如果選 `(b)`： | $char_{10}$ | $MASK_{10}$ |$char_2$ | $MASK_2$ | letter | | -------- | -------- |-------- |-------- |-------- | | '0' | 0x20 |00110000 |00100000 |00100000 | | '5' | 0x20 |00110101 |00100000 |00100000 | | '9' | 0x20 |00111001 |00100000 |00100000 | | 'a' | 0x20 |01100001 |00100000 |00100000 | | 'c' | 0x20 |01100011 |00100000 |00100000 | | 'f' | 0x20 |01100110 |00100000 |00100000 | | 'A' | 0x20 |01000001 |00100000 |00000000 | | 'C' | 0x20 |01000011 |00100000 |00000000 | | 'F' | 0x20 |01000110 |00100000 |00000000 | * 可以發現大寫英文字元 result 皆為 `00000000`，其他字元輸出皆為 `00100000` * 選擇這個選項可以讓我們區分大寫英文字元和其他字元如果選 `(c)`： | $char_{10}$ | $MASK_{10}$ |$char_2$ | $MASK_2$ | letter | | -------- | -------- |-------- |-------- |-------- | | '0' | 0x40 |00110000 |01000000 |00000000 | | '5' | 0x40 |00110101 |01000000 |00000000 | | '9' | 0x40 |00111001 |01000000 |00000000 | | 'a' | 0x40 |01100001 |01000000 |01000000 | | 'c' | 0x40 |01100011 |01000000 |01000000 | | 'f' | 0x40 |01100110 |01000000 |01000000 | | 'A' | 0x40 |01000001 |01000000 |01000000 | | 'C' | 0x40 |01000011 |01000000 |01000000 | | 'F' | 0x40 |01000110 |01000000 |01000000 | * 可以發現位元為數字的 result 皆為 `00000000`，其他英文字元不論大小寫皆為 `01000000` * 選擇這個選項可以讓我們區分數字和英文字元如果選 `(d)`： | $char_{10}$ | $MASK_{10}$ |$char_2$ | $MASK_2$ | letter | | -------- | -------- |-------- |-------- |-------- | | '0' | 0x60 |00110000 |01100000 |00100000 | | '5' | 0x60 |00110101 |01100000 |00100000 | | '9' | 0x60 |00111001 |01100000 |00100000 | | 'a' | 0x60 |01100001 |01100000 |01100000 | | 'c' | 0x60 |01100011 |01100000 |01100000 | | 'f' | 0x60 |01100110 |01100000 |01100000 | | 'A' | 0x60 |01000001 |01100000 |01000000 | | 'C' | 0x60 |01000011 |01100000 |01000000 | | 'F' | 0x60 |01000110 |01100000 |01000000 | * 可以發現位元為數字的 result 皆為 `00100000`，其他英文小寫字元皆為 `01100000`，英文大學字元皆為 `01000000` * 選擇這個選項可以讓我們區分數字和大寫英文字元及小寫英文字元如果選 `(e)`： | $char_{10}$ | $MASK_{10}$ |$char_2$ | $MASK_2$ | letter | | -------- | -------- |-------- |-------- |-------- | | '0' | 0x80 |00110000 |10000000 |00000000 | | '5' | 0x80 |00110101 |10000000 |00000000 | | '9' | 0x80 |00111001 |10000000 |00000000 | | 'a' | 0x80 |01100001 |10000000 |00000000 | | 'c' | 0x80 |01100011 |10000000 |00000000 | | 'f' | 0x80 |01100110 |10000000 |00000000 | | 'A' | 0x80 |01000001 |10000000 |00000000 | | 'C' | 0x80 |01000011 |10000000 |00000000 | | 'F' | 0x80 |01000110 |10000000 |00000000 | * 可以發現所有的 result 皆為 `00000000` * 選擇這個選項無法讓我們有效辨別字元關係，所以刪去 * 綜合上述分析，為了要區分數字字元和字母字元我們可能的選擇是 `(a)`, `(c)`，但是如果選擇 `(a)` 得到的 letter 會為 `00000000` 在 line 4 的時候無法做有意義的操作(結果會皆為`00000000`)，所以這裡選 `(c)` ==AAA== * `(a)` `3` * `(b)` `2` * `(c)` `1` * `(d)` `0` ==BBB== * `(a)` `7` * `(b)` `6` * `(c)` `5` * `(d)` `4` * 已知得到的 `letter` 數字字元會為 `00000000`，英文字元會為 `01000000`，且數字字元在 line 4,5 可以得到正確的值(因為 shift = `00000000`，相當於用 `in` 直接和 `0x0f` 做 &運算)，所以以下只考慮大小寫的 `shift` 運算 * 觀察 line 5 可以發現若將 `'a'` 當作 `in` 預期回傳值是 `'0x0a'`，可以推得 `shift` 值要為 `0x00001001` 才能得到正確結果 * 則 `AAA` 的答案選擇 `(a)`， `BBB` 的答案選擇 `(b)` 可以使 `shift` 得到正確的值 `0x00001001` :::warning 延伸問題://TODO 1. 解釋上述程式的運作原理; 2. 將 hexchar2val 擴充為允許 "0xFF", "0xCAFEBABE", "0x8BADF00D" 等字串輸入，且輸出對應的十進位數值 >Hexspeak ::: ### 測驗 `3` > 除法運算的成本往往高於加法和乘法，於是改進的策略被引入。其中一種策略是將除法改為乘法運算，例如 $\dfrac{n}{d}$ 在分子和分母分別乘上 $2^N$ 後，得到等價的 $n$ x $\dfrac{\dfrac{2^N}{d}}{2^N}$，其中對 2 的 N 次方 (power of 2) 進行除法，在無號整數的操作就等同於右移 (shift) N 個位元，又當 d (除數) 的數值是已知的狀況下， $\dfrac{2^N}{d}$ 可預先計算，也就是說，$\dfrac{n}{d}$ 可轉換為乘法和位移操作，進而減少運算成本。不過，顯然 $\dfrac{2^N}{d}$ 無法總是用整數來表達 (僅在 d 是 power of 2 的時候才可)，因此，我們可先估算，並將差值補償回去。 > >重新檢視 $\dfrac{n}{d}=n × \dfrac{\dfrac{2^N}{d}}{2^N}$ ，當我們想將 n 除以 4 時，就相當於乘以 0.25，所以我們可將 $\dfrac{5}{4}$ 改為 5×0.25，我們得到整數的部分 (即 1)，和小數部分 (即 0.25)，後者乘以 4 就是 1，也就是餘數。下方程式碼展示這概念: ```cpp= const uint32_t D = 3; #define M ((uint64_t)(UINT64_C(XXX) / (D) + 1)) /* compute (n mod d) given precomputed M */ uint32_t quickmod(uint32_t n) { uint64_t quotient = ((__uint128_t) M * n) >> 64; return n - quotient * D; } ``` > 其中 `__uint128_t` 是 GCC extension，用以表示 128 位元的無號數值。 > 預期 `quickmod(5)` 會得到 `2`, `quickmod(55)` 得到 `1`, `quickmod(555)` 得到 `0` (`555` 是 3 的倍數)。 ==XXX== * `(a)` `0x00F000F000F00080` * `(b)` `0xF000F000F000F000` * `(c)` `0x0F0F0F0F0F0F0F0F` * `(d)` `0xF0F0F0F0F0F0F0F0` * `(e)` `0x8888888888888888` * `(f)` `0xFFFFFFFFFFFFFFFF` > [Macros for Integer Constant Expressions](https://pubs.opengroup.org/onlinepubs/009695399/basedefs/stdint.h.html) The following macros expand to integer constant expressions suitable for initializing objects that have integer types corresponding to types defined in the <stdint.h> header. Each macro name corresponds to a similar type name listed under Minimum-width integer types and Greatest-width integer types. ... The macro INTN_C( value) shall expand to an integer constant expression corresponding to the type int_least N _t. The macro UINTN_C( value) shall expand to an integer constant expression corresponding to the type uint_least N _t. For example, if uint_least64_t is a name for the type unsigned long long, then UINT64_C(0x123) might expand to the integer constant 0x123ULL. * 首先針對 `UINT64_C` 這個 macro 釐清，在不同平台或是不同編譯器的使用會使輸入值擴增至適合的長度 :::danger //TODO define那行為什麼要 +1 ??? 我需要補充其他知識或參考他人共筆才能回答這題 ::: ### 測驗 `4` :::danger 承上題 ::: ### 測驗 `5` > 考慮 `strlower` 函式的作用是將指定的字串 (或部分字串) 的英文字母全改為小寫，其 in-place 的實作如下: ```cpp= #include <ctype.h> #include <stddef.h> /* in-place implementation for converting all characters into lowercase. */ void strlower(char *s, size_t n) { for (size_t j = 0; j < n; j++) { if (s[j] >= 'A' && s[j] <= 'Z') s[j] ^= 1 << 5; else if ((unsigned) s[j] >= '\x7f') /* extended ASCII */ s[j] = tolower(s[j]); } } ``` > 針對 extended ASCII，我們呼叫 C 語言標準函式庫的 tolower，需要留意到在不同的語系 (locale)，字母順序和大小寫的定義可能異於我們認知的英文字母。以下摘自手冊: >> If c is a lowercase letter, toupper() returns its uppercase equivalent, if an uppercase representation exists in the current locale. Otherwise, it returns c. > >語系對程式碼的影響不能輕忽，舉例來說，若語系設定為捷克語，以 “Ch” (屬於 digraph，二合拉丁字母，常見於西歐語言) 開頭的字串要排在 “H” 之後，但單看字母的話，“C” 要在 “B” 之後。不過，如果明確只處理英美語系 (American/British English)，上述程式碼列表的第 9 及第 10 行可略去。 >>捷克字母 (依序) A a Á á B b C c Č č D d Ď ď E e É é Ě ě F f G g H h Ch ch I i Í í J j K k L l M m N n Ň ň O o Ó ó P p Q q R r Ř ř S s Š š T t Ť ť U u Ú ú Ů ů V v W w X x Y y Ý ý Z z Ž ž > > 在 64 位元處理器架構 (以下針對 x86_64, little endian)，我們可引入向量化 (vectorized) 的手段，避免一次比對一個字元，而是直接操作一整個 word (對 x86_64 來說，就是 64 bits，或 8 個位元組)。沿用上述的 strlower 函式，我們用這樣的思路實作向量化的 strlower，程式碼列表如下: ```cpp= #include <ctype.h> #include <stddef.h> #include <stdint.h> #define PACKED_BYTE(b) (((uint64_t)(b) & (0xff)) * 0x0101010101010101u) /* vectorized implementation for in-place strlower */ void strlower_vector(char *s, size_t n) { size_t k = n / 8; for (size_t i = 0; i < k; i++, s += 8) { uint64_t *chunk = (uint64_t *) s; if ((*chunk & PACKED_BYTE(VV1)) == 0) { /* is ASCII? */ uint64_t A = *chunk + PACKED_BYTE(128 - 'A' + VV2); uint64_t Z = *chunk + PACKED_BYTE(128 - 'Z' + VV3); uint64_t mask = ((A ^ Z) & PACKED_BYTE(VV4)) >> 2; *chunk ^= mask; } else strlower(s, 8); } k = n % 8; if (k) strlower(s, k); } ``` > 對應的測試程式碼如下: ```cpp= #include <stdio.h> #include <string.h> int main() { /* quote from https://www.gutenberg.org/files/74/74-h/74-h.htm */ char str[] = "This eBook is for the use of anyone anywhere at no cost and with \ almost no restrictions whatsoever. You may copy it, give it away or \ re-use it under the terms of the Project Gutenberg License included \ with this eBook or online at www.gutenberg.net"; int n = strlen(str); strlower_vector(str, n); puts(str); } ``` >參考執行輸出: >>this ebook is for the use of anyone anywhere at no cost and with almost no restrictions whatsoever. you may copy it, give it away or re-use it under the terms of the project gutenberg license included with this ebook or online at www.gutenberg.net * `PACKED_BYTE(b)` 的函數定義是取出輸入 `b` 的後 8 bits，之後再乘型態為 unsigned 的 `0x0101010101010101`，會得到重複循環的 8bit 字節 ```graphviz digraph structs { node [shape=record]; rankdir = LR; E[label="{1|0|0|0|1|1|0|0|1|0|0|0|1|1|0|0|...|1|0|0|0|1|1|0|0|...}"] F[shape=plaintext,fontcolor=black,label = "結果"]; F->E[style=invisible,dir=none] C[label="{1|0|0|0|1|1|0|0}"] D[shape=plaintext,fontcolor=black,label = "字節"]; D->C[style=invisible,dir=none] A[label="{1|0|0|0|0|0|...|1|0|0|0|1|1|0|0}"]; B[shape=plaintext,fontcolor=black,label = "b"]; B->A[style=invisible,dir=none] } ``` ==VV1== * `(a)` `0x10` * `(b)` `0x20` * `(c)` `0x40` * `(d)` `0x60` * `(e)` ***`0x80`*** * 由註解搭配 `PACKED_BYTE(b)` 得知 line 13 是要去判斷輸入進來的 8個 bit 所產生的值是否為有效的 ASCII，又已知 ASCII 如果超過 128 就不會是有效字元。所以我們可以選用 `0x80` 來做 mask ，如果結果為1則 if 為 false，否則 if 為 true ==VV4== * `(a)` `0x10` * `(b)` `0x20` * `(c)` `0x40` * `(d)` `0x60` * `(e)` **`0x80`** * 由 variable `A` 和 `Z` 無法了解要做什麼事情，所以先觀察 line 17 的行為，發現很熟悉的 `^` 操作，聯想到課堂上的 `'A' ^ ' ' = 'a'`，代表 line 17 所做的行為就是將大寫轉成小寫 * 再觀察 line 16，`(A ^ Z) ` 的結果要和 `PACKED_BYTE(VV4)` 做 `&` 操作且結果的值要再做 right shift 2位 * 考慮要將 'A' 轉成 'a', 'Z' 轉成 'z', 'M' 轉成 'm'，不難發現大小寫差別在於左邊數來第三個位元，結合 line 16 需要做 right shift 2位，可以得知答案為 `(e) 0x80` = 0x$10000000$ ， right shift 2 bit 後為 0x $00100000$ ，在交給 line 17做處理 | char | Binary expression | | -------- | -------- | | 'A' | 01**0**0 0001 | | 'a' | 01**1**0 0001 | | 'M' | 01**0**0 1101 | | 'm' | 01**1**0 1101 | | 'Z' | 01**0**1 1010 | | 'z' | 01**1**1 1010 | * 但 line 13 的 if 判斷式，大小寫 ASCII 有效字元都能為 true，所以可以推斷 line 14, 15 是用來區分輸入字元是否為大寫字元 ==VV2== * `(a)` `(-1)` * `(b)` ***`0`*** * `(c)` `1` * 由 line 16 判斷得知，`(A^Z)` 要跟 `0x80` 做 `&`運算，所以我們只會在乎最左邊的 bit 是否為 1 * `A` 的目的是為了要判斷 `*chunk` 內容是否會大於 'A'，所以`PACKED_BYTE(128 - 'A' + VV2) = PACKED_BYTE(63 + VV2)` 的值為 `63 + VV2` 只要加上任何大於等於 65 的值會使最左邊的 bit 為 1，所以`VV2`不需要做任何調整，答案為 (b) 0 * `Z` 的目的是為了要判斷 `*chunk` 內容是否會大於 'Z'，所以`PACKED_BYTE(128 - 'Z' + VV3) = PACKED_BYTE(38 + VV3)` 的值為`38 + VV3` 只要加上任何大於等於 90 的值會使最左邊的 bit 為 1，但是等於90時對應到 'Z' ，所以`VV3`應該要為 (a) -1，這樣才會讓任何大於 90的值都會使最左邊的 bit 為 1 ==VV3== * `(a)` ***`(-1)`*** * `(b)` `0` * `(c)` `1` :::warning 延伸問題://TODO 1. 解釋上述程式的原理; 2. 將前述測試程式 main 函式的 char str[] 更換為 char *str，會發生什麼事？請參閱 C 語言規格書，說明 string literal 的行為 ::: ### 測驗 `6` > LeetCode 137. Single Number II: > Given a non-empty array of integers, every element appears three times except for one, which appears exactly once. Find that single one. Note: Your algorithm should have a linear runtime complexity. Could you implement it without using extra memory? Example 1: Input: [2,2,3,2] Output: 3 > >Example 2: Input: [0,1,0,1,0,1,99] Output: 99 考慮以下程式碼為 LeetCode 137. Single Number II 的題解: ```cpp= int singleNumber(int *nums, int numsSize) { int lower = 0, higher = 0; for (int i = 0; i < numsSize; i++) { lower ^= nums[i]; lower &= KKK; higher ^= nums[i]; higher &= JJJ; } return lower; } ``` * 已知 `^` 和 `&` 運算的順序有差，不可利用結合律 * 隨便舉個反例：假設 `A = 010110` , `B = 101011` ,`C = 100001` 則 `(A ^ B) & C = 100001`，而 `A ^ (B & C) = 110111` * 已知利用 `A^B` 運算的結果再和 `A` 進行一次 `^` 運算會的到 `B` * 意即 `(A^B)^B=A`,`(A^B)^A=B` * 先思考幾個點: * 為什麼需要有 `lower` 和 `higher` 兩個變數? * 為什麼 `lower` 和 `higher` 的運算邏輯相同? * 既然要求的數是僅出現一次，其餘都出現三次，那勢必相同三個數輸入程式後結果會為 `0` * 假設程式僅執行3次迴圈，且將迴圈攤開來觀察 * 將 `num[3] = {010110, 010110, 010110}` 當作例子帶入 * 思考以下程式： ```cpp= //first time lower ^= nums[0]; lower &= KKK; higher ^= nums[0]; higher &= JJJ; //second times lower ^= nums[1]; lower &= KKK; higher ^= nums[1]; higher &= JJJ; //third time lower ^= nums[2]; lower &= KKK; higher ^= nums[2]; higher &= JJJ; ... ``` 老實說看不出什麼規則，但可以知道的是經過這一連串的運算後 `lower` 的值必為 0，幫助下面的想出下面的狀態機。 * 假設 `lower` 和 `higher` 會是兩個集合，我們知道任一數在 `lower` 經過兩次`^` 操作會得到 0，想像 `lower` 是一個 buffer 且具有以下特性 * 用來存放出現到目前為止只出現一次的數值集合 * 若有相同的第二個數字出現，經過 `^` 操作可以將數字從這個集合消除 * 那我們可推測 `higher` 為另外一個集合，目的是存放到目前為止恰出現兩次的數值集合。那在 `higher` 中再對同一個數字做一次 `^` 則可以將數字從 `higher` 這個群消除，這樣就可以處理數字洽出現三次的情形 * 經過上面的討論，我們還有一個問題：雖然我們得到了一點點蛛絲馬跡，但是如何保證加入 `higher` 這個數字不會影響 `lower` 集合？ * 這就是 `&` 操作所達到的功能，我們加入數字到 `lower` 集合後要確認它不存在於 `higher` 集合內(反之， `higher` 內也要確認該數不在 `lower` 內)，所以在 `lower` 做完 `^` 運算後利用 `letter = letter & ~higher` 去排除已經存在於 `higher` 內的數值 * 所以 `lower` 和 `higher` 是兩個功能類似的集合，且一個數只能出現在其中一個集合內 | state num| state name | 說明 |狀態改變條件| | -------- | -------- | -------- | -------- | | 10 | higher | 存放輸入目前為止的洽出現兩次的數值集合| 若有一個數在 `higher` 集合內，且不在 `lower` 集合內，做完該輪運算後，該數會在 `higher` 集合被消除，並將該數移動到狀態 `11` | | 01 | lower | 存放輸入到目前為止的僅出現一次的數值集合| 若有一個數在 `lower` 集合內，且不在 `higher` 集合內，做完該輪運算後，該數會在 `lower` 集合被消除，並將該數移動到狀態 `10` | | 11 | initial | 數值未出現或恰好出現三次的數值集合| 當有新的數加入，判斷該數是否存在 `lower` 和 `higher`的集合，若都不存在則將該數移動到 `lower`，若存在則優先執行狀態 `10` 或 `01` | ```graphviz digraph structs { node [shape=record]; rankdir = LR; A[label="lower"]; B[label="higher"]; C[label="initial"]; C->A->B->C A->A[label="nothing"] B->B[label="nothing"] } ``` * 由以上說明解釋了 line 6 做的是「將數加入 `lower` 集合後，判斷 `higher` 中內是否已存在該值」，若存在則需要將 `lower` 內的該數消除掉 * line8 同理，僅是 `lower` 和 `higher` 角色對調 * 先前我們將 `higher` 想成一個集合，那`~higher` 的選項可以解釋成不在集合內的其他元素，若用 `~higher` 和 `lower` 做 `&` 運算的意義就是：判斷 `lower` 集合內和**非** `higher` 集合內的交集，這麼做自然可以將已存在 `higher` 內的數消除 * 反之 line 8 同理，僅是 `lower` 和 `higher` 角色對調 ==KKK== * `(a)` `higher` * `(b)` `lower` * ***`(c)` `~higher`*** * `(d)` `~lower` * `(e)` `0xFFFFFFFF` ==JJJ== * `(a)` `higher` * `(b)` `lower` * `(c)` `~higher` * ***`(d)` `~lower`*** * `(e)` `0xFFFFFFFF` :::warning 延伸問題://TODO 1. 解釋上述程式的原理; 2. 請撰寫不同上述的程式碼，並確保在 LeetCode 137. Single Number II 執行結果正確且不超時; 3. 延伸題意，將重複 3 次改為改為重複 4 次或 5 次，並找出可容易推廣到多次的程式碼; :::