2020q3 quiz11 - HackMD

# 2020q3 quiz11 contributed by < `quantabase13` > ## 程式運作原理 ### NEXT_POWER_OF_2(s) 先從巨集 `v` 的實作看起： ```c= #define v(t, s, name, ...) \ union { \ STRUCT_BODY(t); \ struct { \ size_t filler; \ t buf[NEXT_POWER_OF_2(s)]; \ }; \ } name __attribute__((cleanup(vec_free))) = {.buf = {__VA_ARGS__}}; \ name.size = sizeof((__typeof__(name.buf[0])[]){0, __VA_ARGS__}) / \ sizeof(name.buf[0]) - 1; \ name.capacity = sizeof(name.buf) / sizeof(name.buf[0]) ``` 從裡面可以看出 `buf` 的大小決定於 `NEXT_POWER_OF_2(s)` ，其實作如下： ```c= #define NEXT_POWER_OF_2(s) \ (s & (s - 1)) == 0 \ ? s \ : (size_t) 1 << (64 - __builtin_clzl(s)) /* next power of 2 */ ``` 想法是：如果 `s` 為二的冪次，就使用 `s`，否則就選值最接近 `s` 的下一個大於 `s` 的二的冪次。 [declare attribute of functions](https://gcc.gnu.org/onlinedocs/gcc-4.7.2/gcc/Function-Attributes.html) ### v(t,s,name, ...) 這個巨集用來建立並初始化 vector。 `t` 為 vector 裡 element 的 type， `s` 為實際儲存了 vector 元素的 size， `name` 為 vector 的變數名稱。建立時使用 union 是為了節省記憶體，因為 vector 只可能存在 stack 或 heap 其中一個地方。一開始 vector 都是存在 stack。在建立 vector 時，用來儲存 vector 的 buffer 是 union 裡的 `t buf[NEXT_POWER_OF_2(s)];` 這裡透過 `NEXT_POWER_OF_2` 計算出 buffer 大小。這邊程式比較不容易看懂的是 a. ```c= sizeof((__typeof__(name.buf[0])[]){0, __VA_ARGS__}) / \ sizeof(name.buf[0]) - 1; ``` 還有 b. ```c= sizeof((__typeof__(name.buf[0])[]){0, __VA_ARGS__}) ``` 關於 a，根據 [gcc 的 online document](https://gcc.gnu.org/onlinedocs/gcc-4.7.2/gcc/Variable-Attributes.html#Variable-Attributes): >The keyword __attribute__ allows you to specify special attributes of variables or structure fields. This keyword is followed by an attribute specification inside double parentheses. 其中有關於 cleanup 功能的介紹： >cleanup (*cleanup*_*function*) >The cleanup attribute runs a function when the variable goes out of scope. This attribute can only be applied to auto function scope variables; it may not be applied to parameters or variables with static storage duration. The function must take one parameter, a pointer to a type compatible with the variable. The return value of the function (if any) is ignored. 在 a 這個例子，`__attribute__((cleanup(vec_free)))` 代表擁有這個 attribute 的 variable 在離開它的 scope 時，就會執行裡面的 function，也就是 `vec_free`。這個 attribute 的作用在這裡是實現自動把記憶體 free 掉的功能。關於b: b 的程式碼目的是計算出 buffer 裡真正儲存的 element 的個數。寫成 `{0, __VA_ARGS__}` 而不是`{__VA_ARGS__}` 是為了處理向量初始化時沒有給元素的狀況。 ### vec_push_back() `vec_push_back` 用來對 `__vec_push_back` 包裝。 `vec_push_back(v, e)` 展開後如下： ```c= __vec_push_back(&v, &(__typeof__(v.buf[0])[]){e}, vec_elemsize(v), \ vec_capacity(v)) ``` 我們先來看 `__vec_push_back` 的定義： ```c= static NON_NULL void __vec_push_back(void *restrict vec, void *restrict e, size_t elemsize, size_t capacity) { union { STRUCT_BODY(char); struct { size_t filler; char buf[]; }; } *v = vec; if (v->on_heap) { if (v->size == capacity) v->ptr = realloc(v->ptr, elemsize * (size_t) 1 << ++v->capacity); memcpy(&v->ptr[v->size++ * elemsize], e, elemsize); } else { if (v->size == capacity) { void *tmp = malloc(elemsize * (size_t) 1 << (v->capacity = capacity + 1)); memcpy(tmp, v->buf, elemsize * v->size); v->ptr = tmp; v->on_heap = 1; memcpy(&v->ptr[v->size++ * elemsize], e, elemsize); } else memcpy(&v->buf[v->size++ * elemsize], e, elemsize); } } ``` 這裡實作上的想法是： 1. 確認 vector 此時是否儲存在 heap 上 2. 如果是，確認此時 size 是否等於 capacity 3. 如果是，代表 heap 裡分配給 vector 的空間已滿，使用 realloc 重新分配記憶體，接著再把新的 element 填入 ; 否則直接填入新的 element 4. 如果 vector 存在 stack 上，確認此時 size 是否等於 capacity 5. 如果是，從 heap 分配記憶體空間，把 stack 裡的 elements 複製到 heap，接著填入新的 element ; 否則直接填入 element。 ### vec_reserve() `vec_reserve` 是對 `__vec_reserve` 的包裝。`__vec_reserve` 的定義如下： ```c= static NON_NULL void __vec_reserve(void *vec, size_t n, size_t elemsize, size_t capacity) { union { STRUCT_BODY(void); struct { size_t filler; char buf[]; }; } *v = vec; if (n > capacity) { if (v->on_heap) { v->ptr = realloc(v->ptr, elemsize * (size_t) 1 << (v->capacity = ilog2(n))); } else { void *tmp = malloc(elemsize * (size_t) 1 << (v->capacity = ilog2(n))); memcpy(tmp, v->buf, elemsize * v->size); v->ptr = tmp; v->on_heap = 1; } } } ``` 這段程式碼是用來「顯式」的決定分配給 vector 的記憶體大小。主要也是分成以下幾步： 1. 確認給定的 `n` 是否大於 `capacity` ，如果不大於的話代表原本的 `capacity` 已經能夠滿足所要求的大小，不用作任何事 ; 否則進入下一步。 2. 確認 `vector` 是否儲存在 heap 裡。如果是，使用 `realloc` 重新分配空間，大小為 `elementsize` *(最接近 n 且大於 n 的 2的冪次) 3. 如果 `vector` 存在 stack 裡，用 `malloc` 從 heap 裡分配空間，把 `vector` 存在 heap 裡 ## 用 Valgrind 來確認程式記憶體的使用情況這是有使用 `GCC extension: attribute cleanup` 的程式碼的 heap profile : ![](https://i.imgur.com/jdDW5ag.png) 把 `__attribute__((cleanup(vec_free)))` 註解掉後的 heap profile : ![](https://i.imgur.com/AHAJhF2.png) 可以看到有明顯的 memory leak ( heap memory 最終使用量不為零)。把原本的程式碼裡從 ```c= vec_push_back(vec1, 4.4); ``` 開始註解，得到下面這張 heap profile ![](https://i.imgur.com/kW3V0Kl.png) 可以發現沒有 `__vec_push_back()` 使用 heap memory 的計錄（因為沒有超過原本分配在 stack 裡 memory 的 capacity） ## Vector 操作相關實現 ### 加法 ```c= #define vec_add( v_dst, v_src1, v_src2) \ __typeof__(vec_pos(v_src1, 0)) tmp1, tmp2,tmp3 ; \ v(__typeof__(vec_pos(v_src1, 0)), 23, v_dst); \ for (size_t i = 0 ; i < vec_size(v_src1); i++) { \ tmp1 = vec_pos(v_src1, i);\ tmp2 = vec_pos(v_src2, i);\ tmp3 = tmp1 + tmp2;\ vec_push_back(v_dst, tmp3);\ } ``` ### 純量乘法 ```c= #define vec_multiply_scalar(v, e)\ for (size_t i = 0; i < vec_size(v); i++){\ vec_pos(v, i) = vec_pos(v, i) * e;\ } ``` 這兩個操作都選擇使用 C 的 macro 來實現。一開始純粹只是模仿原本的程式碼，但後來發現 macro 不像一般的 function，有限制其 arguments 的 type，所以在 macro 裡我們必須使用像 `__typeof__()` 這樣的 gnu c extenstion 來取得 argument 的 type。原本這點讓我覺得綁手綁腳，仔細想過後，決定使用 macro 而非 function 的原因是因為： * 屏蔽實作相關的細節： * 使用 `vector_add` 的程式不必擔心 type 相關問題，只要專注在要實現的數學概念。 * 不必寫各種 type 的 vector 操作： * 如果不用 macro ，以 function 實作必須規定傳入 argument 的 type ，這樣會導致需要實現 `vector_add_int`、`vector_add_float` 等等的情況。 ### 確認 array size 與隨機存取 array element 的花費時間關係這裡做的實驗，是想確認此處 vector 的實作在效能上受到字串長度的具體影響為何：實驗的步驟如下： 1. 對某一特定長度的字串，隨機訪問裡面的元素，重複這個操作到一給定次數，計算所耗時間。 2. 重複（1) 至一給定次數，得出耗時的分佈實驗設計的想法是：如果字串長度在一條 cache line 的大小內，就能充分利用到 spatial locality ; 反之則會遇到 cache miss 的情況。程式碼如下，參考的是[這篇](https://blog.gtwang.org/programming/measure-the-execution-time-in-c-language/2/)： ```c= struct timespec start, end; double time_used; v(int, 16, vec0); for (int i = 0; i < 16; i++) { vec_push_back(vec0, i); } clock_gettime(CLOCK_MONOTONIC, &start); for (int i = 0; i < RUNTIME; i++) { int temp = 0; int rand = random() % 16; temp += vec_pos(vec0, rand); } clock_gettime(CLOCK_MONOTONIC, &end); struct timespec temp = diff(start, end); time_used = temp.tv_sec + (double) temp.tv_nsec / SEC_TO_NANOSEC; printf("%f\t", time_used); ``` `diff` 的實作如下： ```c= struct timespec diff(struct timespec start, struct timespec end) { struct timespec temp; if ((end.tv_nsec-start.tv_nsec)<0) { temp.tv_sec = end.tv_sec-start.tv_sec-1; temp.tv_nsec = 1000000000+end.tv_nsec-start.tv_nsec; } else { temp.tv_sec = end.tv_sec-start.tv_sec; temp.tv_nsec = end.tv_nsec-start.tv_nsec; } return temp; } ``` 使用的重複執行腳本如下，在每個 run 前都有先把 cache 清空： ```shell= #!/bin/bash for i in `seq 0 1 200`; do \ echo 3 > /proc/sys/vm/drop_caches ;\ ./vector >> test done ``` 預期大於 64 byte 的 array 會出現較多 cache miss，耗時會比較高，但做完實驗後卻得到下圖： ![](https://i.imgur.com/IOqfcJl.png) 不懂為什麼 16 bytes 、32 byte 、 64 byte 這些小於 cache line size 的 array 花費時間會比較多... 懷疑是編譯器優化的問題，於是查了 [gcc 7.5 關於optimizaton 的 document](https://gcc.gnu.org/onlinedocs/gcc-7.5.0/gcc/Optimize-Options.html#Optimize-Options)。 >-O0 >Reduce compilation time and make debugging produce the expected results. This is the default. > default 選項是 `-O0`，出來的就是上面的結果。接著試了 `-Og`， >-Og >Optimize debugging experience. -Og enables optimizations that do not interfere with debugging. It should be the optimization level of choice for the standard edit-compile-debug cycle, offering a reasonable level of optimization while maintaining fast compilation and a good debugging experience. > 從`-Og` 的描述我只能知道它的優化不會干擾到 debugging ，抱著試試看的角度，得到的圖如下： ![](https://i.imgur.com/eIUKIfN.png) 還是出現無法解釋的結果。下圖是把 accessing to random string element 的次數調為 10 倍的結果： ![](https://i.imgur.com/qm1U4Y0.png) 可以看到 512 bytes 的分佈與剩下五個有明顯的不同。這裡出現一個非預期的結果：`array size` 在 512 與 768 bytes 的耗時分佈去掉 outlier 的部份相當一致， ![](https://i.imgur.com/dEEdYFe.png) 但 512 與 1024 bytes 的分佈卻完全不同 ![](https://i.imgur.com/mwFTdAw.png)