# 2020q1 Homework2 (quiz2) Contributed by < [AdrianHuang](https://github.com/AdrianHuang/quiz2-linux2020) > ###### tags: `linux2020` ## 測試環境 ```shell $ cat /etc/os-release NAME="Ubuntu" VERSION="18.04.4 LTS (Bionic Beaver)" $ cat /proc/version Linux version 5.3.0-42-generic (buildd@lcy01-amd64-019) (gcc version 7.4.0 (Ubuntu 7.4.0-1ubuntu1~18.04.1)) #34~18.04.1-Ubuntu SMP Fri Feb 28 13:42:26 UTC 2020 $ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 72 On-line CPU(s) list: 0-71 Thread(s) per core: 2 Core(s) per socket: 18 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 79 Model name: Intel(R) Xeon(R) CPU E5-2695 v4 @ 2.10GHz Stepping: 1 CPU MHz: 2102.602 CPU max MHz: 2100.0000 CPU min MHz: 1200.0000 BogoMIPS: 4199.77 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 46080K NUMA node0 CPU(s): 0-17,36-53 NUMA node1 CPU(s): 18-35,54-71 $ free -h total used free shared buff/cache available Mem: 125G 1.8G 119G 19M 4.0G 122G Swap: 8.0G 0B 8.0G ``` ## 程式碼運作原理 ```cpp typedef union { /* allow strings up to 15 bytes to stay on the stack * use the last byte as a null terminator and to store flags * much like fbstring: * https://github.com/facebook/folly/blob/master/folly/docs/FBString.md */ char data[16]; struct { uint8_t filler[15], /* how many free bytes in this stack allocated string * same idea as fbstring */ space_left : 4, /* if it is on heap, set to 1 */ is_ptr : 1, flag1 : 1, flag2 : 1, flag3 : 1; }; /* heap allocated */ struct { char *ptr; /* supports strings up to 2^54 - 1 bytes */ size_t size : 54, /* capacity is always a power of 2 (unsigned)-1 */ capacity : 6; /* the last 4 bits are important flags */ }; } xs; ``` `xs` union 定義 16 個位元組,應用與實作細節如下: * 字串長度小於或等於 15 個位元組,則放在 stack。 * 字串長度大於或等於 16 個位元組,則放在 heap (透過 malloc 系統呼叫配置所需字串大小) * heap struct * ptr: 8 個位元組指標 (64位元架構: x86_64/aarch64 等等)。 * size: 字串長度。因定義 54 位元,故最大字串長度為 2^54^ - 1 位元組。 * capacity: 從 heap 配置的空間大小,其單位為 2^capacity^,故用 6 個位元即可涵蓋 size 長度。 * 有四個位元沒有定義,是為了避免覆寫另一結構成員: `is_ptr`, `flag1`, `flag 2` 與 `flag3`。 ```cpp #define xs_literal_empty() \ (xs) { .space_left = 15 } static inline int ilog2(uint32_t n) { return 32 - __builtin_clz(n) - 1; } xs *xs_new(xs *x, const void *p) { *x = xs_literal_empty(); size_t len = strlen(p) + 1; if (len > 15) { x->capacity = ilog2(len) + 1; x->size = len - 1; x->is_ptr = true; x->ptr = malloc((size_t) 1 << x->capacity); memcpy(x->ptr, p, len); } else { memcpy(x->data, p, len); x->space_left = 15 - (len - 1); } return x; } ``` `xs_new()` 初始化給定的 `x` 並將字串 (`p`) 複製到 `data` 或 `ptr` (取決於字串長度是否大於 15)。 `ilog2` 計算給定數值的log2 (以 2 為底的對數)。該函式使用 count leading zero 計算。Linux 核心則使用 fls (find last bit set),詳見 [include/linux/log2.h](https://github.com/torvalds/linux/blob/master/include/linux/log2.h#L14) ```cpp /* * non-constant log of base 2 calculators * - the arch may override these in asm/bitops.h if they can be implemented * more efficiently than using fls() and fls64() * - the arch is not required to handle n==0 if implementing the fallback */ #ifndef CONFIG_ARCH_HAS_ILOG2_U32 static inline __attribute__((const)) int __ilog2_u32(u32 n) { return fls(n) - 1; } #endif #ifndef CONFIG_ARCH_HAS_ILOG2_U64 static inline __attribute__((const)) int __ilog2_u64(u64 n) { return fls64(n) - 1; } #endif ``` `capacity` 用來記錄最小可以滿足該字串長度的2^m^,其中 `m` 為 `ilog2(len) +1`。 ```cpp /* Memory leaks happen if the string is too long but it is still useful for * short strings. * "" causes a compile-time error if x is not a string literal or too long. */ #define xs_tmp(x) \ ((void) ((struct { \ _Static_assert(sizeof(x) <= 16, "it is too big"); \ int dummy; \ }){1}), \ xs_new(&xs_literal_empty(), "" x)) ... int main() { xs string = *xs_tmp("\n foobarbar \n\n\n"); xs_trim(&string, "\n "); printf("[%s] : %2zu\n", xs_data(&string), xs_size(&string)); xs prefix = *xs_tmp("((("), suffix = *xs_tmp(")))"); xs_concat(&string, &prefix, &suffix); printf("[%s] : %2zu\n", xs_data(&string), xs_size(&string)); return 0; } ``` `xs_tmp` 將傳入字串建立一個新的 union。此巨集相當有趣,包含諸多細節: * Static Assertions - `_Static_assert ( constant-expression , string-literal )`: C11規格書新增此關鍵字 (Keyword),詳見C11規格書的 "6.7.10 Static assertions" 。其中,constant-expression 結果不等於 `0`,則編譯時期不回報錯誤。反之,constant-expression 結果等於 `0`,則在編譯時期回報錯誤並輸出 `string-literal` 字串。此次作業範例程式可支援最大字串長度為 2^54^ - 1 位元組,但 `sizeof(x) <= 16` 限制最大傳入字串長度只能 `16` 位元組,如傳入字串長度大於 `16`,編譯時期會回報錯誤,實驗與對應修正如下所示。 ```diff @@ -187,7 +187,7 @@ xs *xs_trim(xs *x, const char *trimset) int main() { - xs string = *xs_tmp("\n foobarbar \n\n\n"); + xs string = *xs_tmp("\n foobarbar test n\n\n"); xs_trim(&string, "\n "); printf("[%s] : %2zu\n", xs_data(&string), xs_size(&string)); ``` ```shell $ make gcc -g -c -o xs.o xs.c xs.c: In function ‘main’: xs.c:78:10: error: static assertion failed: "it is too big" _Static_assert(sizeof(x) <= 16, "it is too big"); \ ^ xs.c:190:18: note: in expansion of macro ‘xs_tmp’ xs string = *xs_tmp("\n foobarbar test n\n\n"); ^~~~~~ <builtin>: recipe for target 'xs.o' failed make: *** [xs.o] Error 1 ``` ```diff @@ -4,6 +4,8 @@ #include <stdlib.h> #include <string.h> +#define MAX_STR_LEN ((1UL << 54) - 1) + typedef union { /* allow strings up to 15 bytes to stay on the stack * use the last byte as a null terminator and to store flags @@ -73,11 +75,11 @@ xs *xs_new(xs *x, const void *p) * short strings. * "" causes a compile-time error if x is not a string literal or too long. */ -#define xs_tmp(x) \ - ((void) ((struct { \ - _Static_assert(sizeof(x) <= 16, "it is too big"); \ - int dummy; \ - }){1}), \ +#define xs_tmp(x) \ + ((void) ((struct { \ + _Static_assert(sizeof(x) <= MAX_STR_LEN, "it is too big"); \ + int dummy; \ + }){1}), \ xs_new(&xs_literal_empty(), "" x)) /* grow up to specified size */ ``` * Compound literals (詳見 C99 規格 6.5.2.5): 由兩個大部分組成 1) parenthesized type name: 用括號來描述該物件型態 2) brace-enclosed list of initializers: 用大括號來初始化對應的物件 ( C99 稱為 initializer: 詳見 C99 規格書 6.7.8 )。這兩個部分被稱為 `unnamed object`。 ```cpp #include <stdio.h> void func(void) { int *p = (int []) {7, 9}; printf("%d %d\n", *p, *(p + 1)); } int main(void) { func(); return 0; } ``` 此程式碼將 `p` 指向一個整數陣列的第一個元素位址。該整數陣列包含兩個元素,其值為 7 與 9。`(int []) {7, 9}` 又稱 `unnamed object`。此範例的 `compound literal` 屬於 `automatic storage duration` (即該物件生命週期只在該函式內,意即該物件被存放在 stack ),有關物件生命週期可參考 C99 規格書 6.2.4。 使用 objdump 確認該物件是否存放在 stack,如下所示 (已加註解): ```shell 00000000000006aa <func>: 6aa: 55 push %rbp 6ab: 48 89 e5 mov %rsp,%rbp 6ae: 48 83 ec 20 sub $0x20,%rsp 6b2: 64 48 8b 04 25 28 00 mov %fs:0x28,%rax 6b9: 00 00 6bb: 48 89 45 f8 mov %rax,-0x8(%rbp) 6bf: 31 c0 xor %eax,%eax 6c1: c7 45 f0 07 00 00 00 movl $0x7,-0x10(%rbp) # move 7 to stack 6c8: c7 45 f4 09 00 00 00 movl $0x9,-0xc(%rbp) # move 9 to stack 6cf: 48 8d 45 f0 lea -0x10(%rbp),%rax # Get adress of the number '7' 6d3: 48 89 45 e8 mov %rax,-0x18(%rbp) # move address of the number to 'p' 6d7: 48 8b 45 e8 mov -0x18(%rbp),%rax 6db: 48 83 c0 04 add $0x4,%rax # p + 1 6df: 8b 10 mov (%rax),%edx # *(p + 1) 6e1: 48 8b 45 e8 mov -0x18(%rbp),%rax # Get the content of p 6e5: 8b 00 mov (%rax),%eax # *p 6e7: 89 c6 mov %eax,%esi 6e9: 48 8d 3d c4 00 00 00 lea 0xc4(%rip),%rdi # 7b4 <_IO_stdin_used+0x4> 6f0: b8 00 00 00 00 mov $0x0,%eax 6f5: e8 86 fe ff ff callq 580 <printf@plt> 6fa: 90 nop 6fb: 48 8b 45 f8 mov -0x8(%rbp),%rax 6ff: 64 48 33 04 25 28 00 xor %fs:0x28,%rax 706: 00 00 708: 74 05 je 70f <func+0x65> 70a: e8 61 fe ff ff callq 570 <__stack_chk_fail@plt> 70f: c9 leaveq 710: c3 retq ``` 使用 gdb 觀察該物件是否存放在 stack ,如下所示: ```shell (gdb) list 1 #include <stdio.h> 2 3 void func(void) 4 { 5 int *p = (int []) {7, 9}; 6 7 printf("%d %d\n", *p, *(p + 1)); 8 } 9 10 int main(void) (gdb) b 7 Breakpoint 1 at 0x6d7: file int_array.c, line 7. (gdb) run Starting program: /home/adrian/archive/jserv/linux2020/tmp/int_array Breakpoint 1, func () at int_array.c:7 7 printf("%d %d\n", *p, *(p + 1)); (gdb) info r rax 0x7ffffffee350 140737488282448 rbx 0x0 0 rcx 0x8000730 134219568 rdx 0x7ffffffee468 140737488282728 rsi 0x7ffffffee458 140737488282712 rdi 0x1 1 rbp 0x7ffffffee360 0x7ffffffee360 rsp 0x7ffffffee340 0x7ffffffee340 r8 0x7fffff3ecd80 140737475693952 r9 0x7fffff3ecd80 140737475693952 r10 0x2 2 r11 0x7 7 r12 0x80005a0 134219168 r13 0x7ffffffee450 140737488282704 r14 0x0 0 r15 0x0 0 rip 0x80006d7 0x80006d7 <func+45> eflags 0x246 [ PF ZF IF ] cs 0x33 51 ss 0x2b 43 ds 0x0 0 es 0x0 0 fs 0x0 0 gs 0x0 0 (gdb) ``` 從 rbp 與 rsp 暫存器可知 func() 使用 stack 空間 0x7ffffffee340-0x7ffffffee360 存放區域變數 `p` 與 compound literal,故將此區段內容印出來: 1. 區域變數 `p`: 其位址為0x7ffffffee348,內容為0x7ffffffee350 (也就是指向整數陣列第一個元素位址) 2. compound literal `(int []) {7, 9}`: 存在放在0x7ffffffee350。 ```shell (gdb) x/8x 0x7ffffffee340 0x7ffffffee340: 0xff4109a0 0x00007fff 0xfffee350 0x00007fff 0x7ffffffee350: 0x00000007 0x00000009 0xe2743500 0xbe27fcf3 ``` *** ```cpp #include <stdio.h> struct ab { int a; int b; }; void func1(struct ab tmp) { printf("%d %d\n", tmp.a, tmp.b); } void func2(struct ab *ptr) { printf("%d %d\n", ptr->a, ptr->b); } int main(void) { func1((struct ab) {.a = 1, .b = 2}); func2(&(struct ab) {.a = 3, .b = 4}); return 0; } ``` 此程式碼的 compound literal 使用 `designator` 初始化結構成員。 *** 回到此次作業程式碼,xs_literal_empty() 便是 compound literal,且 xs_tmp() 的 xs_new(&xs_literal_empty(), "" x) 透過 xs_literal_empty() 配置一個 xs union,並把其位址傳給 xs_new()。 * Comma operator (詳見 C99 規格 6.5.17): 逗號運算子左邊的運算結果會被當作 void 結果 (void expression),逗號運算子右邊的運算結果與型態會被當作最終運算結果與型態。 ```cpp #include <stdio.h> void func(int a) { printf("a: %d\n", a); } int main(void) { int t; func((t = 3, 5)); return 0; } ``` 執行結果: ``` a: 5 ``` ```cpp #include <stdio.h> void func(int a) { printf("a: %d\n", a); } int main(void) { int t; func((t = 3, t + 10)); return 0; } ``` 執行結果 ``` a: 13 ``` *** 回到此次作業程式碼,底下程式碼中,xs_tmp() 即使用 comma operator 定義兩個運算式,其中左邊運算式用來偵測字串長度是否超過所定義的長度,右邊運算式則呼叫 xs_new() 並回傳由 xs_literal_empty() 所初始化 union 的位址。依據 comma operator 特性,xs_tmp() -> xs_new(&xs_literal_empty(), "" x),最後再經由 `indirection * unary operator` 將其 union 的內容複製到 string 變數。 ```cpp /* Memory leaks happen if the string is too long but it is still useful for * short strings. * "" causes a compile-time error if x is not a string literal or too long. */ #define xs_tmp(x) \ ((void) ((struct { \ _Static_assert(sizeof(x) <= MAX_STR_LEN, "it is too big"); \ int dummy; \ }){1}), \ xs_new(&xs_literal_empty(), "" x)) ... xs string = *xs_tmp("\n foobarbar \n\n\n"); ``` ## 提供字串複製 (應考慮到 CoW) - [xs_copy()](https://github.com/AdrianHuang/quiz2-linux2020/blob/ddd7eb9a61c6a10a39b3a5d18471374a24cfc80e/xs.c#L287) xs_copy()實作字串複製功能,並分為三個策略 (參考[FBString.md](https://github.com/facebook/folly/blob/master/folly/docs/FBString.md)): * small strings (<= 15 chars) are stored in-situ without memory allocation. * Medium strings (16 - 255 chars) are stored in malloc-allocated memory and copied eagerly. * Large strings (> 255 chars) are stored in malloc-allocated memory and copied lazily. 對於大字串複製,xs_copy() 使用 CoW 機制,節省複製大字串執行時間,等到使用者要覆寫這個字串時,再進行複製字串動作,其分為兩大部分: 1. 複製 meta 資料: 複製 xs union (16 bytes) 2. 將此共享字串的 reference count 加一: 當執行 malloc() 分配記憶體時,會多分配 4 個位元組,用以記錄此字串的 reference count,其記憶體配置圖如下所示 (n = 字串長度)。其設計好處在於,不需要在 xs union 新增一個成員用以紀錄 reference count (每個 string 都有這個成員,此方法浪費記憶體空間)。 ![](https://i.imgur.com/wpkz9SR.png) ### 測試數據: wo/w CoW 底下為沒有 CoW 測試結果,總共花費 31.9 秒才執行結束: ```shell $ perf stat -e cache-misses,cache-references,instructions,cycles ./xs -d [foobarbar] : 9 [(((foobarbar)))] : 15 -------------- Small string -------------- copy string 10000 times, ref. count: 0 concatenate copied string for 100 sets, ref. count: 0 trim copied string for another 100 sets, ref. count: 0 ------------------------------------------ -------------- Medium string -------------- copy string 10000 times, ref. count: 0 concatenate copied string for 100 sets, ref. count: 0 trim copied string for another 100 sets, ref. count: 0 ------------------------------------------ -------------- Large string -------------- copy string 10000 times, ref. count: 0 concatenate copied string for 100 sets, ref. count: 0 trim copied string for another 100 sets, ref. count: 0 ------------------------------------------ Performance counter stats for './xs -d': 163,769,154 cache-misses # 11.146 % of all cache refs 1,469,306,796 cache-references 38,600,949,051 instructions # 0.58 insn per cycle 67,004,700,370 cycles 31.911450719 seconds time elapsed 7.671942000 seconds user 24.235817000 seconds sys ``` 底下為支援 CoW 測試結果,約0.8秒就能執行完畢: ```shell $ perf stat -e cache-misses,cache-references,instructions,cycles ./xs [foobarbar] : 9 [(((foobarbar)))] : 15 -------------- Small string -------------- copy string 10000 times, ref. count: 0 concatenate copied string for 100 sets, ref. count: 0 trim copied string for another 100 sets, ref. count: 0 ------------------------------------------ -------------- Medium string -------------- copy string 10000 times, ref. count: 0 concatenate copied string for 100 sets, ref. count: 0 trim copied string for another 100 sets, ref. count: 0 ------------------------------------------ -------------- Large string -------------- copy string 10000 times, ref. count: 10001 concatenate copied string for 100 sets, ref. count: 9901 trim copied string for another 100 sets, ref. count: 9801 ------------------------------------------ Performance counter stats for './xs': 2,301,522 cache-misses # 5.489 % of all cache refs 41,931,149 cache-references 1,076,763,293 instructions # 0.67 insn per cycle 1,607,658,444 cycles 0.765898061 seconds time elapsed 0.252623000 seconds user 0.513265000 seconds sys ``` ## 設計實驗,確認上述字串處理最佳化 (針對空間效率) 帶來的效益,應考慮到 locality of reference xs_copy() 實作 reference count 時,將其放在該字串後面。然而,字串指標跟此 reference count 經常被訪問,如果可以把這兩個記憶體分配在附近,就能享有 locality of reference 好處。因此,把原本的記憶體配置圖改為底下: ![](https://i.imgur.com/MDAMCJk.png) 對應實作可以參考此份提交: [Re-arrange memory layout of reference count](https://github.com/AdrianHuang/quiz2-linux2020/commit/fc0b832de53759e66397a1b6db977fd298b0d516)