common bugs - HackMD

# common bugs 鄭余玄補充：關於為定義行為介紹，可以參考：https://blog.regehr.org/archives/213 ## Problem 1 1. (Segmentation fault) - `char *strtok(char *restrict s1, const char *restrict s2);` 函式會修改到 `s1` 所指到的字串。 - `"this is a string"` 是一個 string literals 會分配於 static storage (C99 [6.4.5])，若嘗試修改內容則會導致未定義行為。 - `char *start = "this is a string";` 的`start` 是指向 static storage 的一個指標，因此編譯器不會提出存取警告，而是因為執行期間修改 string literals 造成未定義行為。 - `char start[] = "this is a string";` 中 `start` 是 `char[]` 型態語意會將資料分配於 stack 中（編譯器將 string literals 從 static storage 複製到 stack 中），因此可以做修改。 ```asm .LC0: .string " " .LC1: .string "this is a string" main: ; main push rax ; char *start2 = strtok(start, " "); mov esi, OFFSET FLAT:.LC0 mov edi, OFFSET FLAT:.LC1 call strtok ; } xor eax, eax pop rdx ret ``` ```asm .LC1: .string " " .LC0: .string "this is a string" main: ; main sub rsp, 40 ; char start[] = "this is a string"; mov esi, OFFSET FLAT:.LC0 mov ecx, 17 lea rdi, [rsp+15] rep movsb ; char *start2 = strtok(start, " "); lea rdi, [rsp+15] mov esi, OFFSET FLAT:.LC1 call strtok ; } xor eax, eax add rsp, 40 ret ``` 2. (Segmentation fault) `start2[4] = '\0'` 嘗試修改 string literals，導致未定義行為 (C99 [6.4.5])。 ## Problem 2 `/*` 為 comment 開頭 (C99 [6.4.9])，而且會以最長 token 進行預處理 (C99 [6.4])，因此 `return *a/*b /* a simple division */;` 移除 comment 後會變成 `return *a`。解決方法是避免 `/` 和 `*` 兩個算子相連，例如： `return *a/ *b /* a simple division */;`，以及： ```clike int x = *a/ *b; int y = x /* useless variable */; return x; ``` ## Problem 3 `strtok` (C99 [7.21.5.8]) 會將匹配到的字元以 null 填入，因此在執行完第一組 while 迴圈後，`string` 的空白會改成 `\0`。第二組 while 迴圈開始前的 `strtok` 會先指到第一個不在 token 的位置，因此最後會印出 `this`。 ## Problem 4 - `EOF` (C99 [7.19.1]) 是一個巨集，為 `int` 型態並且是負值，表示 end-of-file - `fgetc` (C99 [7.19.7.1]) 會回傳 `EOF` 或是 `int` (由 `unsigned char` promote) 此題中 `c` 的型態是 `char`，在 glibc 實作中 `EOF` 為 `-1` (https://www.gnu.org/software/libc/manual/html_node/EOF-and-Errors.html)，因此 `(char) -1 == 255`，導致誤判 `255` 為 `EOF` 提前結束迴圈。此題解法為宣告 `int c;`，或甚至在 glibc 實作下使用 `ssize_t c;`，可以避免誤判 `EOF`。此外，若寫檔案改為 ```clike for (i = 255; i >= 0; i--) { fputc(i, fp); } ``` 則 `count` 為 `0`。 ## Problem 5 在 x86_64 gcc 4.8.5 中，若是使用 `O1` 以上優化 flag（例如 `O2`, `O3`），會在編譯時期就做完優化，直接輸出 `yes`： ```asm .LC0: .string "yes" main: mov edi, OFFSET FLAT:.LC0 xor eax, eax jmp printf ``` 這題應該是要考 machine epsilon，當進行浮點數比較時，建議使用 `abs(a * 3, b) < 1e-10` 方式。使用 `-O0` flag 建議編譯器不優化，則才會進行運算比較。但是若稍微追蹤 `-O0` 組譯結果，有使用到 `xmm` 暫存器 (SSE 指令集)。因此再多下一個參數：`-O0 -mno-sse`，此時會直接使用 `fmulp` 等指令運算，結果就會是 `no` 了。 ## Problem 6 Integer constants (C99 [6.4.4.1]) 中定義 octal-constant 為 0 開頭後面接 0~7 的數字。 ![](https://i.imgur.com/v45fodT.png) 因此 `035731603` 的十進位表示是 `7844739`。若要直接印出八進位可以使用： ```clike printf("my lab’s telephone number is %#lo\n", lab_tel); ``` 其中 `#` 會印出前綴 `0`，`lo` 會印出八進位的 `unsigned long int`。 ## Problem 7 此題會分別會輸出 `yes` 和 `no`。 1. `a == &a` `a` 是 pointer to int (`int *`)，指向 array 第一個元素 (C99 [6.3.2.1])。 `&a` 是 pointer to an array (`int (*)[10]`)，也就 address of array object（array object 起始位置），因此會和第一個元素位置相同。此外在規格書的 Equality operators (C99 [6.5.9]) 中也提到： > Two pointers compare equal if and only if both are null pointers, both are pointers to the same object (including a pointer to an object and a subobject at its beginning) or function, ... 2. `a + 1 == &a + 1` `a + 1` 就是 `a` 中第二個元素的位置，等於 `&(a[1])`。 `&a + 1` 是 pointer to an array 的下一個元素位置，因此跨越 4 * 10 bytes (`int [10]`)。 ## Problem 8 CRLF (`\r\n`) LF (`\n`) 補充參考：https://stackoverflow.com/questions/5813301/how-c-output-lf-to-stdout-without-being-changed-to-cr-lf ## Problem 9 `char` 不論是採用 `signed char` (`-128~127`) 或是 `unsigned char` (`0~255`) (C99 [5.2.4.2.1])，經過 integer promotion (C99 [6.3.1.1]) 後比較都是小於 `256`。此外，`char` 的 increment 行為是：`c` 會 integer promotion 成 `int` 後加一，之後轉回 `char`。若 `char` 是 `unsigned` 則值會是同餘在值域中 (C99 [6.3.1.3])，若是 `signed` 則是 implementation-defined。gcc 是採用同餘在值域 (https://gcc.gnu.org/onlinedocs/gcc/Integers-implementation.html#Integers-implementation)。補充：x86 下 `char` 和 `signed char` 有相同值域範圍，但是可以透過 `-fsigned-char` 和 `-funsigned-char` 編譯參數強制決定。 ## Problem 10 可以使用 `gcc -E -P test.c` 指令，觀察預處理後的結果。 `square(3 + 4)` 在巨集展開後會成為 `(i + j * i + j)`，因此是 `19`。可能修正定義為 `#define square(x) ((x) * (x))`，但是此種修正會造成 `square(inc(i))` 巨集展開後， `((((i)++)) * (((i)++)))` 運算式為未定義行為。因此需要進一步修正為 `#define square(x) ({ typeof(x) y = (x); y * y; })` 避免在巨集中 double evaluation。但是此種寫法還要小心 `y` 不會和區域變數撞名，因此： `#define square(x) ({ typeof(x) _y = (x); _y * _y; })` ## Problem 11 data alignment 在 x86_64 是 8 byte alignment 8 + 8 = 16 8 + 8 + 8 = 24 在 x86 是 4 byte alignment 4 + 4 + 8 = 16 4 + 8 + 4 + 4 = 20 （可以透過 `-O3 -m32` 驗證）此外 x86 允許 data unalignment，讓兩者都為 15。 ```clike #pragma pack(1) struct csie { char c; short s; int i; double e; }; struct ceis { char c; double e; int i; short s; }; #pragma pack(pop) int main(void) { printf("csie = %d\n", sizeof(struct csie)); printf("ceis = %d\n", sizeof(struct ceis)); return 0; } ``` ## Problem 12 `source` 和 `destination` 都是在 stack 上的變數，因為 `strcpy` 本身並不會對位置做檢查，所以 `destination` 實際只存有 `This`，而 `source` 被覆蓋為 `is a string.\0ng.\0`。在印出 `destination` 時會印出到 `\0`，因此有完整的「`This is a string.`」，而印出 `source` 只會印到 `is a sting.`。在 gcc 中若使用 `-O1` 以上 flag，則會顯示 `*** buffer overflow detected ***: ./test terminated`。甚至在 `-O3` 編譯時，會直接顯示： ```shell chengscott@cgi /tmp> gcc test.c -o test -O3 In file included from /usr/include/string.h:635:0, from test.c:2: In function ‘strcpy’, inlined from ‘main’ at test.c:8:13: /usr/include/x86_64-linux-gnu/bits/string3.h:110:10: warning: call to __builtin___memcpy_chk will always overflow destination buffer return __builtin___strcpy_chk (__dest, __src, __bos (__dest)); ^ ``` 透過 `readelf -s ./test` 觀察 symbol 可以發現，會多出 `__strcpy_chk@@GLIBC_2.3.4` 函式。此函式會檢查 buffer overflow，可以透過 `gcc test.c -o test -Os -D_FORTIFY_SOURCE=0` 不進行檢查。此時輸出為 ```shell i is 5 source is [.] destination is [This is a string.] ``` 與上述討論有出入，但是 clang 輸出與上述討論相同。因為編譯器可以多分配空間到 stack 上，所以 gcc 在 stack 上有額外配置多餘空間。 gcc 編譯時額外再加上 `-Os`：`gcc test.c -o test -D_FORTIFY_SOURCE=0 -Os` ```shell i is 5 source is [ is a string.] destination is [This is a string.] ``` 結果就會如上述討論的。 ## Problem 13 在開始之前，稍微提一下標頭檔最好加上 header guard。 `main.c` 和 `impl.c` 都有 `#include "header.h"`，經過編譯器 preprocess 後，兩個 compilation unit 都有 `static int val = 0;` 的變數宣告。`static` 表示變數是 internal linkage，因此兩個檔案的 `val` symbol 都不會參與鏈結，`main` 有獨自一個 `val` 變數，`impl` 有獨自一個 `val` 變數。解決方式是讓 `val` 使用 external linkage 方式，讓兩個檔案都鏈結到相同 symbol 上。例如，宣告為 `extern int val = 0;`、或是因為剛好在 file scope，所以直接宣告為 `int val = 0`。 ## Problem 14 `fgets` 會讀到 trailing newline。 ## Problem 15 - `int` 範圍為 -2147483648 ~ 2147483647 - `unsigned int` 範圍為 0 ~ 4294967295 (3) `ui + 1 > i + 1` 因為表達式左邊出現 unsigned int，所以算式兩側型態都會 promote (C99 [6.3.1.8]) 成 unsigned int：`2147483648 > 2147483648` 為非。 (2) `ui + 1 > 0` `ui + 1` 為 `2147483648` 大於零，會輸出 `ui + 1 > 0`。 (1) `i + 1 < 0` `i + 1` 會溢位，因此是 undefined behavior。 clang 編譯器都會印出 `i + 1 < 0` 字串（有可能是 `-2147483648`）。 GNU gcc、Intel icc 都不會印出 `i + 1 < 0` 字串。 gcc 如果在 `-O0` 下印出組語，會直接被優化移項成為 `cmpl $-1, -4(%rbp)`，也就是數學式的 `i < -1`。在 `-O1` 以上，甚至直接跳過比較，直接不輸出。此外可以在編譯時指定 `-fwrapv`，讓 signed integer overflow 實作行為確定。若額外宣告 `int w = 0;`，接著判斷 `i + 1 < w`，此時 gcc 和 icc 都會正常進行比較： `cmpl -12(%rbp), %eax`。在 `-O1` 以上優化，都會直接印出 `i + 1 < 0` 字串。 ## Problem 16 兩者都會輸出。因為運算式左側出現 unsigned int，所以運算式兩側型態都會 integer promote (C99 [6.3.1.1]) 成 unsigned int。 `(unsigned int) -1 == 4294967295` (是定義行為)，因此 `ui + 1 == 2147483648` 會小於無號整數的 `-1`。 ## Problem 17 對於非正整數進行 `>>` 運算會得到未定義的值 (C99 [6.2.6.2])。 `i` 的型態是有號整數，因此 `>>` 會進行算數右移（補上 signed bit）。 `-13 / 2` 會 round 為 `-6`，而 `-13 >> 1` 會是 `-7`。解決方式可以透過 signed bit 判斷正負：`(i + ((i >> 31) & 1)) >> 1` 更廣義的做法，除以 $2^n$ 的數字：`(x + ((x >> 31) & ((1 << n) + ~0))) >> n` source: https://stackoverflow.com/questions/5061093/dividing-by-power-of-2-using-bit-shifting ## Problem 18 `int` 範圍為 (-2147483647, 2147483648)，而因為 `-2147483640` 和 `50` 或 `100` 相減會溢位為極大正值，所以排序後為 50 < 100 < -2147483640。此處須小心 `int` 溢位是未定義行為，可以透過 `-fwrapv` 參數指定實作行為。 ## Problem 19 `__FILE__` 是 gcc 提供的巨集 (https://gcc.gnu.org/onlinedocs/cpp/Standard-Predefined-Macros.html)，會展開為當前檔案。因此此程式會印出程式碼內容。 ## Problem 20 首先，定義巨集時運算式最好都加上括號，避免意料之外的展開： `#define SWAP(x, y) ((x) ^= (y) ^= (x) ^= (y))` 在進行 xor swap 前，要先確認 `(x)` 和 `(y)` 位置不同，避免 xor 後直接為 `0`。 `#define SWAP(x, y) ((&(x) == &(y)) ? (x): ((x) ^= (y) ^= (x) ^= (y)))` 或是直接使用 logical `&&` 或 `||` 避免 branching： `#define SWAP(x, y) ((&(x) == &(y)) || ((x) ^= (y) ^= (x) ^= (y)))` `#define SWAP(x, y) ((&(x) ^ &(y)) && ((x) ^= (y) ^= (x) ^= (y)))` source: https://graphics.stanford.edu/~seander/bithacks.html#SwappingValuesSubAdd 可以再搭配第十題作法，避免巨集中 operand 重複 evaluation、和與區域變數撞名。 ## Problem 21 在 Assignment operators (C99 [6.5.16]) 提到，此種行為是未定義行為。 > The order of evaluation of the operands is unspecified. If an attempt is made to modify the result of an assignment operator or to access it after the next sequence point, the behavior is undefined ## Problem 22 `defualt` 為 `default` 錯字。但是這個程式仍然可以成功編譯，因為 `defualt:` 是合法的 Labeled statements (C99 [6.8.1]) ![](https://i.imgur.com/ii6lQ1w.png) `switch` 陳述式 (C99 [6.8.4.2]) 只和 `case` 和 `default` 有關；而 `goto` 陳述式 (C99 [6.8.6.1]) 可以直接跳到對應的 label (此時為 `defualt:`)。 ## Problem 23 `bar` 函數中的 `i` 和 `temp` 變數生命週期在離開函數後就結束了（此兩變數配置於 stack 上）。當變數在生命週期以外存取時，此時值為未定 (C99 [6.2.4])。回傳指標會指到 `bar` 函數 stack 上變數位置，但此時值已經是未定的 (undetermined)。以 `gcc -O0` 為例，結果為 ```shell temp is 545574124, (*temp) is 10 a is 545574124, (*a) is 10 a is 545574124, (*a) is 30 ``` （因為 `a` 剛好指到 `bar` stack 上 `i` 的位置）但是 `gcc -O3` 是 ```shell temp is 86999044, (*temp) is 10 a is 86999044, (*a) is 10 a is 86999044, (*a) is 10 ``` 修正方式是確保變數的生命週期，例如回傳 `int` 值、全域變數指標、`static` 變數等等。 ## Problem 24 `scanf("%d", &j)` 會使用到 4 byte 的 stack 空間，但是 `j` 只有 1 byte，因此有可能修改到 `i`，造成 ```shell chengscott@cgi /tmp> ./test 3 no. *** stack smashing detected ***: ./test terminated fish: “./test” terminated by signal SIGABRT (Abort) ``` `$rbp` (`j`) 是在 `0x0x7fffffffec00` ```shell x/8x $rbp - 16 0x7fffffffebf0: 0xffffece0 0x00007fff 0x00000000 0x01000000 0x7fffffffec00: 0x00000000 0x00000000 0xf7a32f45 0x00007fff x/8x $rbp - 16 0x7fffffffebf0: 0xffffece0 0x00007fff 0x00000000 0x00030000 0x7fffffffec00: 0x00000000 0x00000000 0xf7a32f45 0x00007fff ``` 解決方式是分配足夠的空間，例如宣告 `int j;`。 ## Problem 25 `-3 % 2 == -1` 因此不會進入判斷式。可以透過檢查 LSB 為 `1` 是否為奇數：`if (i & 1 == 1)` 此外，有可能遇到[參考網址](http://blog.ez2learn.com/2011/07/08/%E4%BD%A0%E6%B2%92%E9%81%87%E9%81%8E%E7%9A%84%E7%B7%A8%E8%AD%AF%E5%99%A8%E9%AC%BC%E6%89%93%E7%89%86%EF%BC%8C%E8%A8%B1%E5%8A%9F%E8%93%8B%E5%95%8F%E9%A1%8C/)中的「許功蓋問題」。在較早期的編譯器，尚未支援 unicode 時，「許」、「功」、「蓋」等字在 Big5 編碼下，若以 ASCII 轉義尾碼 byte 會是 `\`，恰好在讓註解自動延伸到下一行。 ## Problem 26 The ## operator (c99 [6.10.3.3]) 巨集字串黏合 1. ```clike int main(a) { int _x=a; printf("int"" %d",_x);; return 0; } ``` 額外補充，通常有多個 statment 的巨集，可以改進為 `#define test(x) do { x _x=a; pr##x##f(#x" %d",_x); } while(0)` 如此以來此巨集可以放在 if-else 語句上。 2. 用在生成簡單函數，並且註冊 function pointer。 ```c++ using namespace std; typedef void(*fn)(); map<string, fn> m; void cmd_quit() { printf("cmd: ""quit""\n"); }; void cmd_help() { printf("cmd: ""help""\n"); }; int main() { m["quit"] = cmd_quit;; m["help"] = cmd_help;; string cmd; while ( getline(cin,cmd) ){ if ( m.count(cmd) ) (*m[cmd])(); else printf("Not support %s\n", cmd.c_str()); } return 0; } ``` `##` 巨集可以應用在 c 生成類似 c++ template 函數。例如：https://stackoverflow.com/questions/10950828/simulation-of-templates-in-c-for-a-queue-data-type ```clike #include <stdio.h> #define TEMPLATE_SQUARE(T) \ T square_##T(T x) { \ return x * x; \ } TEMPLATE_SQUARE(int) TEMPLATE_SQUARE(float) int main(void) { printf("%d\n", square_int(5)); printf("%f\n", square_float(12.3)); return 0; } ``` 在 glibc 函式庫中也可以見到大量的 `##` 來接和各種訊息，在此不贅述了。