# 2020q1 Homework2 (quiz2)
Contributed by < [AdrianHuang](https://github.com/AdrianHuang/quiz2-linux2020) >
###### tags: `linux2020`
## 測試環境
```shell
$ cat /etc/os-release
NAME="Ubuntu"
VERSION="18.04.4 LTS (Bionic Beaver)"
$ cat /proc/version
Linux version 5.3.0-42-generic (buildd@lcy01-amd64-019)
(gcc version 7.4.0 (Ubuntu 7.4.0-1ubuntu1~18.04.1))
#34~18.04.1-Ubuntu SMP Fri Feb 28 13:42:26 UTC 2020
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 72
On-line CPU(s) list: 0-71
Thread(s) per core: 2
Core(s) per socket: 18
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 79
Model name: Intel(R) Xeon(R) CPU E5-2695 v4 @ 2.10GHz
Stepping: 1
CPU MHz: 2102.602
CPU max MHz: 2100.0000
CPU min MHz: 1200.0000
BogoMIPS: 4199.77
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 46080K
NUMA node0 CPU(s): 0-17,36-53
NUMA node1 CPU(s): 18-35,54-71
$ free -h
total used free shared buff/cache available
Mem: 125G 1.8G 119G 19M 4.0G 122G
Swap: 8.0G 0B 8.0G
```
## 程式碼運作原理
```cpp
typedef union {
/* allow strings up to 15 bytes to stay on the stack
* use the last byte as a null terminator and to store flags
* much like fbstring:
* https://github.com/facebook/folly/blob/master/folly/docs/FBString.md
*/
char data[16];
struct {
uint8_t filler[15],
/* how many free bytes in this stack allocated string
* same idea as fbstring
*/
space_left : 4,
/* if it is on heap, set to 1 */
is_ptr : 1, flag1 : 1, flag2 : 1, flag3 : 1;
};
/* heap allocated */
struct {
char *ptr;
/* supports strings up to 2^54 - 1 bytes */
size_t size : 54,
/* capacity is always a power of 2 (unsigned)-1 */
capacity : 6;
/* the last 4 bits are important flags */
};
} xs;
```
`xs` union 定義 16 個位元組,應用與實作細節如下:
* 字串長度小於或等於 15 個位元組,則放在 stack。
* 字串長度大於或等於 16 個位元組,則放在 heap (透過 malloc 系統呼叫配置所需字串大小)
* heap struct
* ptr: 8 個位元組指標 (64位元架構: x86_64/aarch64 等等)。
* size: 字串長度。因定義 54 位元,故最大字串長度為 2^54^ - 1 位元組。
* capacity: 從 heap 配置的空間大小,其單位為 2^capacity^,故用 6 個位元即可涵蓋 size 長度。
* 有四個位元沒有定義,是為了避免覆寫另一結構成員: `is_ptr`, `flag1`, `flag 2` 與 `flag3`。
```cpp
#define xs_literal_empty() \
(xs) { .space_left = 15 }
static inline int ilog2(uint32_t n) { return 32 - __builtin_clz(n) - 1; }
xs *xs_new(xs *x, const void *p)
{
*x = xs_literal_empty();
size_t len = strlen(p) + 1;
if (len > 15) {
x->capacity = ilog2(len) + 1;
x->size = len - 1;
x->is_ptr = true;
x->ptr = malloc((size_t) 1 << x->capacity);
memcpy(x->ptr, p, len);
} else {
memcpy(x->data, p, len);
x->space_left = 15 - (len - 1);
}
return x;
}
```
`xs_new()` 初始化給定的 `x` 並將字串 (`p`) 複製到 `data` 或 `ptr` (取決於字串長度是否大於 15)。
`ilog2` 計算給定數值的log2 (以 2 為底的對數)。該函式使用 count leading zero 計算。Linux 核心則使用 fls (find last bit set),詳見 [include/linux/log2.h](https://github.com/torvalds/linux/blob/master/include/linux/log2.h#L14)
```cpp
/*
* non-constant log of base 2 calculators
* - the arch may override these in asm/bitops.h if they can be implemented
* more efficiently than using fls() and fls64()
* - the arch is not required to handle n==0 if implementing the fallback
*/
#ifndef CONFIG_ARCH_HAS_ILOG2_U32
static inline __attribute__((const))
int __ilog2_u32(u32 n)
{
return fls(n) - 1;
}
#endif
#ifndef CONFIG_ARCH_HAS_ILOG2_U64
static inline __attribute__((const))
int __ilog2_u64(u64 n)
{
return fls64(n) - 1;
}
#endif
```
`capacity` 用來記錄最小可以滿足該字串長度的2^m^,其中 `m` 為 `ilog2(len) +1`。
```cpp
/* Memory leaks happen if the string is too long but it is still useful for
* short strings.
* "" causes a compile-time error if x is not a string literal or too long.
*/
#define xs_tmp(x) \
((void) ((struct { \
_Static_assert(sizeof(x) <= 16, "it is too big"); \
int dummy; \
}){1}), \
xs_new(&xs_literal_empty(), "" x))
...
int main()
{
xs string = *xs_tmp("\n foobarbar \n\n\n");
xs_trim(&string, "\n ");
printf("[%s] : %2zu\n", xs_data(&string), xs_size(&string));
xs prefix = *xs_tmp("((("), suffix = *xs_tmp(")))");
xs_concat(&string, &prefix, &suffix);
printf("[%s] : %2zu\n", xs_data(&string), xs_size(&string));
return 0;
}
```
`xs_tmp` 將傳入字串建立一個新的 union。此巨集相當有趣,包含諸多細節:
* Static Assertions - `_Static_assert ( constant-expression , string-literal )`: C11規格書新增此關鍵字 (Keyword),詳見C11規格書的 "6.7.10 Static assertions" 。其中,constant-expression 結果不等於 `0`,則編譯時期不回報錯誤。反之,constant-expression 結果等於 `0`,則在編譯時期回報錯誤並輸出 `string-literal` 字串。此次作業範例程式可支援最大字串長度為 2^54^ - 1 位元組,但 `sizeof(x) <= 16` 限制最大傳入字串長度只能 `16` 位元組,如傳入字串長度大於 `16`,編譯時期會回報錯誤,實驗與對應修正如下所示。
```diff
@@ -187,7 +187,7 @@ xs *xs_trim(xs *x, const char *trimset)
int main()
{
- xs string = *xs_tmp("\n foobarbar \n\n\n");
+ xs string = *xs_tmp("\n foobarbar test n\n\n");
xs_trim(&string, "\n ");
printf("[%s] : %2zu\n", xs_data(&string), xs_size(&string));
```
```shell
$ make
gcc -g -c -o xs.o xs.c
xs.c: In function ‘main’:
xs.c:78:10: error: static assertion failed: "it is too big"
_Static_assert(sizeof(x) <= 16, "it is too big"); \
^
xs.c:190:18: note: in expansion of macro ‘xs_tmp’
xs string = *xs_tmp("\n foobarbar test n\n\n");
^~~~~~
<builtin>: recipe for target 'xs.o' failed
make: *** [xs.o] Error 1
```
```diff
@@ -4,6 +4,8 @@
#include <stdlib.h>
#include <string.h>
+#define MAX_STR_LEN ((1UL << 54) - 1)
+
typedef union {
/* allow strings up to 15 bytes to stay on the stack
* use the last byte as a null terminator and to store flags
@@ -73,11 +75,11 @@ xs *xs_new(xs *x, const void *p)
* short strings.
* "" causes a compile-time error if x is not a string literal or too long.
*/
-#define xs_tmp(x) \
- ((void) ((struct { \
- _Static_assert(sizeof(x) <= 16, "it is too big"); \
- int dummy; \
- }){1}), \
+#define xs_tmp(x) \
+ ((void) ((struct { \
+ _Static_assert(sizeof(x) <= MAX_STR_LEN, "it is too big"); \
+ int dummy; \
+ }){1}), \
xs_new(&xs_literal_empty(), "" x))
/* grow up to specified size */
```
* Compound literals (詳見 C99 規格 6.5.2.5): 由兩個大部分組成 1) parenthesized type name: 用括號來描述該物件型態 2) brace-enclosed list of initializers: 用大括號來初始化對應的物件 ( C99 稱為 initializer: 詳見 C99 規格書 6.7.8 )。這兩個部分被稱為 `unnamed object`。
```cpp
#include <stdio.h>
void func(void)
{
int *p = (int []) {7, 9};
printf("%d %d\n", *p, *(p + 1));
}
int main(void)
{
func();
return 0;
}
```
此程式碼將 `p` 指向一個整數陣列的第一個元素位址。該整數陣列包含兩個元素,其值為 7 與 9。`(int []) {7, 9}` 又稱 `unnamed object`。此範例的 `compound literal` 屬於 `automatic storage duration` (即該物件生命週期只在該函式內,意即該物件被存放在 stack ),有關物件生命週期可參考 C99 規格書 6.2.4。
使用 objdump 確認該物件是否存放在 stack,如下所示 (已加註解):
```shell
00000000000006aa <func>:
6aa: 55 push %rbp
6ab: 48 89 e5 mov %rsp,%rbp
6ae: 48 83 ec 20 sub $0x20,%rsp
6b2: 64 48 8b 04 25 28 00 mov %fs:0x28,%rax
6b9: 00 00
6bb: 48 89 45 f8 mov %rax,-0x8(%rbp)
6bf: 31 c0 xor %eax,%eax
6c1: c7 45 f0 07 00 00 00 movl $0x7,-0x10(%rbp) # move 7 to stack
6c8: c7 45 f4 09 00 00 00 movl $0x9,-0xc(%rbp) # move 9 to stack
6cf: 48 8d 45 f0 lea -0x10(%rbp),%rax # Get adress of the number '7'
6d3: 48 89 45 e8 mov %rax,-0x18(%rbp) # move address of the number to 'p'
6d7: 48 8b 45 e8 mov -0x18(%rbp),%rax
6db: 48 83 c0 04 add $0x4,%rax # p + 1
6df: 8b 10 mov (%rax),%edx # *(p + 1)
6e1: 48 8b 45 e8 mov -0x18(%rbp),%rax # Get the content of p
6e5: 8b 00 mov (%rax),%eax # *p
6e7: 89 c6 mov %eax,%esi
6e9: 48 8d 3d c4 00 00 00 lea 0xc4(%rip),%rdi # 7b4 <_IO_stdin_used+0x4>
6f0: b8 00 00 00 00 mov $0x0,%eax
6f5: e8 86 fe ff ff callq 580 <printf@plt>
6fa: 90 nop
6fb: 48 8b 45 f8 mov -0x8(%rbp),%rax
6ff: 64 48 33 04 25 28 00 xor %fs:0x28,%rax
706: 00 00
708: 74 05 je 70f <func+0x65>
70a: e8 61 fe ff ff callq 570 <__stack_chk_fail@plt>
70f: c9 leaveq
710: c3 retq
```
使用 gdb 觀察該物件是否存放在 stack ,如下所示:
```shell
(gdb) list
1 #include <stdio.h>
2
3 void func(void)
4 {
5 int *p = (int []) {7, 9};
6
7 printf("%d %d\n", *p, *(p + 1));
8 }
9
10 int main(void)
(gdb) b 7
Breakpoint 1 at 0x6d7: file int_array.c, line 7.
(gdb) run
Starting program: /home/adrian/archive/jserv/linux2020/tmp/int_array
Breakpoint 1, func () at int_array.c:7
7 printf("%d %d\n", *p, *(p + 1));
(gdb) info r
rax 0x7ffffffee350 140737488282448
rbx 0x0 0
rcx 0x8000730 134219568
rdx 0x7ffffffee468 140737488282728
rsi 0x7ffffffee458 140737488282712
rdi 0x1 1
rbp 0x7ffffffee360 0x7ffffffee360
rsp 0x7ffffffee340 0x7ffffffee340
r8 0x7fffff3ecd80 140737475693952
r9 0x7fffff3ecd80 140737475693952
r10 0x2 2
r11 0x7 7
r12 0x80005a0 134219168
r13 0x7ffffffee450 140737488282704
r14 0x0 0
r15 0x0 0
rip 0x80006d7 0x80006d7 <func+45>
eflags 0x246 [ PF ZF IF ]
cs 0x33 51
ss 0x2b 43
ds 0x0 0
es 0x0 0
fs 0x0 0
gs 0x0 0
(gdb)
```
從 rbp 與 rsp 暫存器可知 func() 使用 stack 空間 0x7ffffffee340-0x7ffffffee360 存放區域變數 `p` 與 compound literal,故將此區段內容印出來:
1. 區域變數 `p`: 其位址為0x7ffffffee348,內容為0x7ffffffee350 (也就是指向整數陣列第一個元素位址)
2. compound literal `(int []) {7, 9}`: 存在放在0x7ffffffee350。
```shell
(gdb) x/8x 0x7ffffffee340
0x7ffffffee340: 0xff4109a0 0x00007fff 0xfffee350 0x00007fff
0x7ffffffee350: 0x00000007 0x00000009 0xe2743500 0xbe27fcf3
```
***
```cpp
#include <stdio.h>
struct ab {
int a;
int b;
};
void func1(struct ab tmp)
{
printf("%d %d\n", tmp.a, tmp.b);
}
void func2(struct ab *ptr)
{
printf("%d %d\n", ptr->a, ptr->b);
}
int main(void)
{
func1((struct ab) {.a = 1, .b = 2});
func2(&(struct ab) {.a = 3, .b = 4});
return 0;
}
```
此程式碼的 compound literal 使用 `designator` 初始化結構成員。
***
回到此次作業程式碼,xs_literal_empty() 便是 compound literal,且 xs_tmp() 的 xs_new(&xs_literal_empty(), "" x) 透過 xs_literal_empty() 配置一個 xs union,並把其位址傳給 xs_new()。
* Comma operator (詳見 C99 規格 6.5.17): 逗號運算子左邊的運算結果會被當作 void 結果 (void expression),逗號運算子右邊的運算結果與型態會被當作最終運算結果與型態。
```cpp
#include <stdio.h>
void func(int a)
{
printf("a: %d\n", a);
}
int main(void)
{
int t;
func((t = 3, 5));
return 0;
}
```
執行結果:
```
a: 5
```
```cpp
#include <stdio.h>
void func(int a)
{
printf("a: %d\n", a);
}
int main(void)
{
int t;
func((t = 3, t + 10));
return 0;
}
```
執行結果
```
a: 13
```
***
回到此次作業程式碼,底下程式碼中,xs_tmp() 即使用 comma operator 定義兩個運算式,其中左邊運算式用來偵測字串長度是否超過所定義的長度,右邊運算式則呼叫 xs_new() 並回傳由 xs_literal_empty() 所初始化 union 的位址。依據 comma operator 特性,xs_tmp() -> xs_new(&xs_literal_empty(), "" x),最後再經由 `indirection * unary operator` 將其 union 的內容複製到 string 變數。
```cpp
/* Memory leaks happen if the string is too long but it is still useful for
* short strings.
* "" causes a compile-time error if x is not a string literal or too long.
*/
#define xs_tmp(x) \
((void) ((struct { \
_Static_assert(sizeof(x) <= MAX_STR_LEN, "it is too big"); \
int dummy; \
}){1}), \
xs_new(&xs_literal_empty(), "" x))
...
xs string = *xs_tmp("\n foobarbar \n\n\n");
```
## 提供字串複製 (應考慮到 CoW) - [xs_copy()](https://github.com/AdrianHuang/quiz2-linux2020/blob/ddd7eb9a61c6a10a39b3a5d18471374a24cfc80e/xs.c#L287)
xs_copy()實作字串複製功能,並分為三個策略 (參考[FBString.md](https://github.com/facebook/folly/blob/master/folly/docs/FBString.md)):
* small strings (<= 15 chars) are stored in-situ without memory allocation.
* Medium strings (16 - 255 chars) are stored in malloc-allocated memory and copied eagerly.
* Large strings (> 255 chars) are stored in malloc-allocated memory and copied lazily.
對於大字串複製,xs_copy() 使用 CoW 機制,節省複製大字串執行時間,等到使用者要覆寫這個字串時,再進行複製字串動作,其分為兩大部分:
1. 複製 meta 資料: 複製 xs union (16 bytes)
2. 將此共享字串的 reference count 加一: 當執行 malloc() 分配記憶體時,會多分配 4 個位元組,用以記錄此字串的 reference count,其記憶體配置圖如下所示 (n = 字串長度)。其設計好處在於,不需要在 xs union 新增一個成員用以紀錄 reference count (每個 string 都有這個成員,此方法浪費記憶體空間)。

### 測試數據: wo/w CoW
底下為沒有 CoW 測試結果,總共花費 31.9 秒才執行結束:
```shell
$ perf stat -e cache-misses,cache-references,instructions,cycles ./xs -d
[foobarbar] : 9
[(((foobarbar)))] : 15
-------------- Small string --------------
copy string 10000 times, ref. count: 0
concatenate copied string for 100 sets, ref. count: 0
trim copied string for another 100 sets, ref. count: 0
------------------------------------------
-------------- Medium string --------------
copy string 10000 times, ref. count: 0
concatenate copied string for 100 sets, ref. count: 0
trim copied string for another 100 sets, ref. count: 0
------------------------------------------
-------------- Large string --------------
copy string 10000 times, ref. count: 0
concatenate copied string for 100 sets, ref. count: 0
trim copied string for another 100 sets, ref. count: 0
------------------------------------------
Performance counter stats for './xs -d':
163,769,154 cache-misses # 11.146 % of all cache refs
1,469,306,796 cache-references
38,600,949,051 instructions # 0.58 insn per cycle
67,004,700,370 cycles
31.911450719 seconds time elapsed
7.671942000 seconds user
24.235817000 seconds sys
```
底下為支援 CoW 測試結果,約0.8秒就能執行完畢:
```shell
$ perf stat -e cache-misses,cache-references,instructions,cycles ./xs
[foobarbar] : 9
[(((foobarbar)))] : 15
-------------- Small string --------------
copy string 10000 times, ref. count: 0
concatenate copied string for 100 sets, ref. count: 0
trim copied string for another 100 sets, ref. count: 0
------------------------------------------
-------------- Medium string --------------
copy string 10000 times, ref. count: 0
concatenate copied string for 100 sets, ref. count: 0
trim copied string for another 100 sets, ref. count: 0
------------------------------------------
-------------- Large string --------------
copy string 10000 times, ref. count: 10001
concatenate copied string for 100 sets, ref. count: 9901
trim copied string for another 100 sets, ref. count: 9801
------------------------------------------
Performance counter stats for './xs':
2,301,522 cache-misses # 5.489 % of all cache refs
41,931,149 cache-references
1,076,763,293 instructions # 0.67 insn per cycle
1,607,658,444 cycles
0.765898061 seconds time elapsed
0.252623000 seconds user
0.513265000 seconds sys
```
## 設計實驗,確認上述字串處理最佳化 (針對空間效率) 帶來的效益,應考慮到 locality of reference
xs_copy() 實作 reference count 時,將其放在該字串後面。然而,字串指標跟此 reference count 經常被訪問,如果可以把這兩個記憶體分配在附近,就能享有 locality of reference 好處。因此,把原本的記憶體配置圖改為底下:

對應實作可以參考此份提交: [Re-arrange memory layout of reference count](https://github.com/AdrianHuang/quiz2-linux2020/commit/fc0b832de53759e66397a1b6db977fd298b0d516)