# 2021q1 Homework3 (quiz3) contributed by < `qwe661234` > ###### tags: `linux2021` 運作原理: 對字串做 SSO (small string optimization) 和 CoW (copy on write) 優化 CoW 機制的實現: >COW的實現依賴於 reference count(引用計數rc),初始時rc = 1,每次賦值複製時rc ++,當修改時,如果rc > 1,需要申請新的空間並複制一個原來的數據,並且 rc--,當rc == 0時,釋放原內存。 SSO 機制的實現: 將字串 xs 分為3種: 1. 長度 <= 15的字串: * 存放在 stack * size = 15 - space_left(剩餘空間) * data 直接存放於 xs 的 char 陣列中 * is_ptr = 0, is_large_string = 0 2. 長度 > 16 且 < 256 的字串 * 存放在 heap * size = slen(string) * is_ptr = 1, is_large_string = 0 3. 長度 >= 256 的字串 * 存放在 heap * 前 4 bytes 用來存 reference counter * size = slen(string) * is_ptr = 1, is_large_string = 1 >所以當字串較短時,直接將 data 存於 stack 中,而不用去 heap 中申請空間,省去了申請 heap 中空間的開銷。 Data structure: ``` cpp typedef union { /* allow strings up to 15 bytes to stay on the stack * use the last byte as a null terminator and to store flags * much like fbstring: * https://github.com/facebook/folly/blob/master/folly/docs/FBString.md */ /* store the content of short string */ char data[16]; struct { uint8_t filler[15], /* how many free bytes in this stack allocated string * same idea as fbstring */ /* how many length are left for short string */ space_left : 4, /* if it is on heap, set to 1 */ is_ptr : 1, is_large_string : 1, flag2 : 1, flag3 : 1; }; /* heap allocated */ struct { /* store the content of learge string */ char *ptr; /* supports strings up to 2^MAX_STR_LEN_BITS - 1 bytes */ size_t size : MAX_STR_LEN_BITS, /* capacity is always a power of 2 (unsigned)-1 */ capacity : 6; /* the last 4 bits are important flags */ }; } xs; ``` 其中 filler 是用來對齊的16 bytes 結構 ![](https://i.imgur.com/PFUnPoY.jpg) `xs_data` +4 是因為 x->ptr 前四個 bytes 是用來存取 reference count, 所以要位址+4來拿到 large string 的內容 ```cpp= static inline char *xs_data(const xs *x) { /* short string */ if (!xs_is_ptr(x)) return (char *) x->data; /* large string */ if (xs_is_large_string(x)) return (char *) (x->ptr + 4); /* medium string */ return (char *) x->ptr; } ``` `ilog2` __builtin_clz(n) >Count Leading Zeros (clz), 計算 2 進位數從 MSB 往右數遇到的第一個 1 之前所有 0 的數量 ilog2 是用來計算 xs 所需的 capacity ``` cpp static inline int ilog2(uint32_t n) { return 32 - __builtin_clz(n) - 1; } ``` `xs_allocate_data` allocate 時會區別 medium string 和 large string, large string 要處理 reference counter 的部分 realloc: 避免 memory 的浪費 > ptr -- 這是以前用malloc,calloc或realloc分配的指標,現在重新分配的記憶體給此指標。如果是NULL,分配一個新的記憶體區塊,由該函數返回一個指向它的指標。 size -- 這是新的記憶體區塊大小(以 byte 為單位)。如果它是0 並且 ptr 指向現有的記憶體區塊,指標所指向的記憶體區塊被釋放,並返回一個 NULL 指標。 ```cpp static void xs_allocate_data(xs *x, size_t len, bool reallocate) { /* Medium string */ if (len < LARGE_STRING_LEN) { x->ptr = reallocate ? realloc(x->ptr, (size_t) 1 << x->capacity) : malloc((size_t) 1 << x->capacity); return; } /* Large string */ x->is_large_string = 1; /* The extra 4 bytes are used to store the reference count */ x->ptr = reallocate ? realloc(x->ptr, (size_t)(1 << x->capacity) + 4) : malloc((size_t)(1 << x->capacity) + 4); xs_set_refcnt(x, 1); } ``` `xs_cow_lazy_copy` 用來處理 large string 的內容修改, 當 large string 需要修改且其 reference counter > 1 時, 將 reference counter - 1, 並複製一份內容給新分配的記憶體區塊, 內容的修改會發生在新分配的記憶體區塊 >當 large string 真的有修改時,才真正的分配新的記憶體, 否則同樣的 large string 都是指向同一記憶體位置, 可減少複製 large string data 的開銷 ``` cpp static bool xs_cow_lazy_copy(xs *x, char **data) { if (xs_get_refcnt(x) <= 1) return false; /* Lazy copy */ xs_dec_refcnt(x); xs_allocate_data(x, x->size, 0); if (data) { memcpy(xs_data(x), *data, x->size); /* Update the newly allocated pointer */ *data = xs_data(x); } return true; } ``` `xs_concat` 依造 prefix、 string、 suffix 連接,如果是 large string 要做 CoW, 如果 capacity 不足要先 xs_grow 到足夠的空間 ```cpp s *xs_concat(xs *string, const xs *prefix, const xs *suffix) { size_t pres = xs_size(prefix), sufs = xs_size(suffix), size = xs_size(string), capacity = xs_capacity(string); char *pre = xs_data(prefix), *suf = xs_data(suffix), *data = xs_data(string); xs_cow_lazy_copy(string, &data); if (size + pres + sufs <= capacity) { memmove(data + pres, data, size); memcpy(data, pre, pres); memcpy(data + pres + size, suf, sufs + 1); if (xs_is_ptr(string)) string->size = size + pres + sufs; else string->space_left = 15 - (size + pres + sufs); } else { xs tmps = xs_literal_empty(); xs_grow(&tmps, size + pres + sufs); char *tmpdata = xs_data(&tmps); memcpy(tmpdata + pres, data, size); memcpy(tmpdata, pre, pres); memcpy(tmpdata + pres + size, suf, sufs + 1); xs_free(string); *string = tmps; string->size = size + pres + sufs; } return string; } ``` `xs_trim` 用查表方式來做修剪, set_bit 用來建表, 將8 bit ASCII code 分為前5 bit 和後 3 bit, 前5個 bits 是 mask 的 Index, 後3個 bits 則是對應的 mask 的內容, check_bit 則是查表。 先以 trimset 建表, 接著從欲修剪的字串開頭集結尾查表 修剪部分是以指標的移動及字串長度的縮減, 最後將修剪好的字串以 memmove 複製到原始記憶體位置 ``` cpp xs *xs_trim(xs *x, const char *trimset) { if (!trimset[0]) return x; char *dataptr = xs_data(x), *orig = dataptr; if (xs_cow_lazy_copy(x, &dataptr)) orig = dataptr; /* similar to strspn/strpbrk but it operates on binary data */ uint8_t mask[32] = {0}; #define check_bit(byte) (mask[(uint8_t) byte / 8] & 1 << (uint8_t) byte % 8) #define set_bit(byte) (mask[(uint8_t) byte / 8] |= 1 << (uint8_t) byte % 8) size_t i, slen = xs_size(x), trimlen = strlen(trimset); for (i = 0; i < trimlen; i++) set_bit(trimset[i]); for (i = 0; i < slen; i++) if (!check_bit(dataptr[i])) break; for (; slen > 0; slen--) if (!check_bit(dataptr[slen - 1])) break; dataptr += i; slen -= i; /* reserved space as a buffer on the heap. * Do not reallocate immediately. Instead, reuse it as possible. * Do not shrink to in place if < 16 bytes. */ memmove(orig, dataptr, slen); /* do not dirty memory unless it is needed */ if (orig[slen]) orig[slen] = 0; if (xs_is_ptr(x)) x->size = slen; else x->space_left = 15 - slen; return x; #undef check_bit #undef set_bit } ``` :::success 延伸問題:提出改進方案 ::: `xs_grow` 修改 原本的 `xs_grow` 先將 x->is_ptr 設為 true 再去判斷 xs_is_ptr(x), 這樣 xs_is_ptr(x) 永遠都會是 true, 所以應該先判斷 xs_is_ptr(x), 等 memory allocate 完才能將 x->is_ptr 設為 true ```cpp xs *xs_grow(xs *x, size_t len) { char buf[16]; if (len <= xs_capacity(x)) return x; /* Backup first */ if (!xs_is_ptr(x)) memcpy(buf, x->data, 16); if (xs_is_ptr(x)) { xs_allocate_data(x, len, 1); } else { xs_allocate_data(x, len, 0); memcpy(xs_data(x), buf, 16); } x->is_ptr = true; x->capacity = ilog2(len) + 1; return x; } ``` :::success 延伸問題:以上述程式碼為基礎,提供字串複製 (應考慮到 CoW) 的函式實作 ::: `xs_copy` 針對 src 是 short、 medium、 large string 分別做不同的拷貝方式 1. short string: 直接複製內容, 將 src 的 data copy 到 dest 的 data 中 2. medium string: 直接複製內容, 先分配記憶體給 dest 的 char ptr, 再將 src 的 char ptr 中存放的內容直接 copy 到 dest 的 char ptr 中 3. large string: 須考慮 CoW 機制, 所以將 dest 的 char ptr 直接指向 src 的 char ptr, 並將 reference counter + 1, 等待未來內容有修改時在真正的複製一份哪容給 dest ```cpp xs *xs_copy(xs *dest, xs *src){ /* short string */ if (!xs_is_ptr(src)){ dest = xs_free(dest); dest->is_large_string = 0; dest->is_ptr = 0; dest->space_left = src->space_left; memcpy(dest->data, src->data, xs_size(dest)); if (dest->data[xs_size(dest)]){ dest->data[xs_size(dest)] = 0; } return dest; } /* large string */ if (xs_is_large_string(src)){ dest = xs_free(dest); dest->is_large_string = 1; dest->is_ptr = 1; size_t len = strlen(xs_data(src)) + 1; dest->capacity = ilog2(len) + 1; dest->size = len - 1; dest->ptr = src->ptr; xs_inc_refcnt(src); } /* medium string*/ dest = xs_free(dest); dest->is_ptr = 1; dest->is_large_string = 0; size_t len = strlen(xs_data(src)) + 1; dest->capacity = ilog2(len) + 1; dest->size = len - 1; xs_allocate_data(dest, dest->size, 0); memcpy(xs_data(dest), xs_data(src), len); return dest; } ``` 隨機產生大小為 325 的字串並做1 ~ 10000次來比較以 xs_copy 來複製字串與以 strncpy 來複製所花費的時間比較 可看出大字串以 xs_copy (CoW的機制) 來複製比以 strncpy 來的有效率 ![](https://i.imgur.com/UX78nhC.png) ### Refinement 與老師討論實驗設計的部分後, 從新設計 CoW 的實驗, 再實驗設計上應該著重於重複複製相同的字串才能顯現 CoW 機制的效能, 而一直非複製不同的字串, 因此從新設計實驗, 此實驗為針對字串長度為 325 的字串, 並對相同字串作複製1000次, 並重複此動作 1000次 由圖中可看出, 實行 CoW 機制的 xs_copy 效能較 strncpy 佳 藍色為 xs_copy 橘色為strncpy ![](https://i.imgur.com/OFBqDrc.png) 假設資料為常態分佈, 透過計算將 outlier 去除, 並對剩餘的數據取平均值 | function | 原始平均 | 標準差 | 信賴區間(95%) | 信賴區間(95%) | 去除 outlier 平均值 | | -------- | -------- | -------- | --------------- | --------------- | --- | | xs_copy | 450.555 | 232.4967 | -14.4385 | 915.5465 | 424.81 | | strncpy | 1253.087 | 124.3488 | 1004.3892 | 1501.7847 | 1241.602 | Conclusion: 在此實驗中, 對相同字串進行複製1000次, 實做 CoW 的 xs_copy 較 strncpy 效能快將近3倍 :::success 延伸問題:設計實驗,確認上述字串處理最佳化 (針對空間效率) 帶來的效益,應考慮到 locality of reference,善用 GDB 追蹤和分析,確認字串的確符合預期地配置於 stack 或 heap ::: `追蹤測試程式` ```cpp int main(int argc, char *argv[]) { int *a = (int *) malloc(100); xs str = *xs_tmp("foobarbar"); xs str2 = *xs_tmp("foobarbarfoobarbarfoobarbarfoobarbarfoobarbarfoobarbarfoobarbarfoobarbarfoobarbarfoobarbarfoobarbarfoobarbarfoobarbarfoobarbarfoobarbarfoobarbarfoobarbarfoobarbarfoobarbarfoobarbarfoobarbarfoobarbarfoobarbarfoobarbarfoobarbar"); return 0; } ``` `GDB 追蹤` 動態配置記憶體位置會出現在 heap, 所以先配置變數 a 來確認 heap 的位址, 接著印出 $rbp, $rsp 來確認 stack 的起始與結束位址, 由姐果可看出短字串 str.data 存在於 stack 中, 而中字串 str.ptr 則是指向 heap 中的記憶體位址, 可知其存在於 heap 中 ![](https://i.imgur.com/1AkCPTd.png) :::success 延伸問題:嘗試將 quiz2 提到的 string interning 機制整合,提出對空間高度最佳化的字串處理工具函式庫 ::: interning pool and hash table struct ```cpp struct __xs_node { xs str; char* buffer; uint32_t hash_size; struct __xs_node *next; }; struct __xs_pool { struct __xs_node node[INTERNING_POOL_SIZE]; }; struct __xs_interning { int lock; int index; unsigned size; unsigned total; struct __xs_node **hash; struct __xs_pool *pool; }; static struct __xs_interning __xs_ctx; ``` 考慮到 short string 存於字元 array 中, 且 large string 已經有實做 CoW 機制, 所以 string interning 只針對 medium string 為 function `xs_new` `xs_concat` `xs_trim` `xs_copy` 新增 string interning `xs_concat` `xs_trim` 再對 string 做修改時,就直接複製一份新的 data 去進行修改而不去管 reference count, 讓原始的 data 繼續留在 interning pool 中等待其他 string 使用, 跟 CoW 的機制有點不同 `xs_new` ```cpp xs *xs_new(xs *x, const void *p) { *x = xs_literal_empty(); size_t len = strlen(p) + 1; if (len > 16) { x->capacity = ilog2(len) + 1; x->size = len - 1; x->is_ptr = true; if (len < XS_INTERNING_SIZE) { x = cstr_interning(p, len - 1, hash_blob(p, len - 1)); }else { xs_allocate_data(x, x->size, 0); memcpy(xs_data(x), p, len); } }else { memcpy(x->data, p, len); x->space_left = 15 - (len - 1); } return x; } ``` `xs_concat` ```cpp xs *xs_concat(xs *string, const xs *prefix, const xs *suffix) { size_t pres = xs_size(prefix), sufs = xs_size(suffix), size = xs_size(string), capacity = xs_capacity(string); char *pre = xs_data(prefix), *suf = xs_data(suffix), *data = xs_data(string); xs_cow_lazy_copy(string, &data); if (size + pres + sufs <= capacity) { if (size + pres + sufs < XS_INTERNING_SIZE) { string->size = size + pres + sufs; char *tmp = xalloc(string->size); memcpy(tmp, pre, pres); memcpy(tmp + pres, data, size); memcpy(tmp + pres + size, suf, sufs + 1); return cstr_interning(tmp, string->size, hash_blob(tmp, string->size)); } memmove(data + pres, data, size); memcpy(data, pre, pres); memcpy(data + pres + size, suf, sufs + 1); if (xs_is_ptr(string)) { string->size = size + pres + sufs; } else string->space_left = 15 - (size + pres + sufs); } else { //TODO: fix grow function xs tmps = xs_literal_empty(); xs_grow(&tmps, size + pres + sufs); char *tmpdata = xs_data(&tmps); memcpy(tmpdata + pres, data, size); memcpy(tmpdata, pre, pres); memcpy(tmpdata + pres + size, suf, sufs + 1); if (size + pres + sufs < XS_INTERNING_SIZE) { return cstr_interning(tmpdata, xs_size(&tmps), hash_blob(tmpdata, xs_size(&tmps))); } xs_free(string); *string = tmps; string->size = size + pres + sufs; } return string; } ``` `xs_trim` ```cpp xs *xs_trim(xs *x, const char *trimset) { if (!trimset[0]) return x; char *dataptr = xs_data(x), *orig = dataptr; if (xs_size(x) > 16 && xs_size(x) < XS_INTERNING_SIZE) { char* tmp = xalloc(xs_size(x)); memcpy(tmp, dataptr, xs_size(x)); orig = tmp; } else { if (xs_cow_lazy_copy(x, &dataptr)) orig = dataptr; } /* similar to strspn/strpbrk but it operates on binary data */ uint8_t mask[32] = {0}; #define check_bit(byte) (mask[(uint8_t) byte / 8] & 1 << (uint8_t) byte % 8) #define set_bit(byte) (mask[(uint8_t) byte / 8] |= 1 << (uint8_t) byte % 8) size_t i, slen = xs_size(x), trimlen = strlen(trimset); for (i = 0; i < trimlen; i++) set_bit(trimset[i]); for (i = 0; i < slen; i++) if (!check_bit(dataptr[i])) break; for (; slen > 0; slen--) if (!check_bit(dataptr[slen - 1])) break; dataptr += i; slen -= i; /* reserved space as a buffer on the heap. * Do not reallocate immediately. Instead, reuse it as possible. * Do not shrink to in place if < 16 bytes. */ // memmove is safer than memcpy memmove(orig, dataptr, slen); /* do not dirty memory unless it is needed */ if (orig[slen]) orig[slen] = 0; if (xs_is_ptr(x)) x->size = slen; else x->space_left = 15 - slen; if (slen < XS_INTERNING_SIZE && slen > 16) { x = cstr_interning(orig, slen, hash_blob(orig, slen)); } return x; #undef check_bit #undef set_bit } ``` `xs_copy` ```cpp xs *xs_copy(xs *dest, xs *src){ /* short string */ if (!xs_is_ptr(src)){ dest = xs_free(dest); dest->is_large_string = 0; dest->is_ptr = 0; dest->space_left = src->space_left; memcpy(dest->data, src->data, xs_size(src)); return dest; } /* large string */ if (xs_is_large_string(src)){ dest = xs_free(dest); dest->is_large_string = 1; dest->is_ptr = 1; size_t len = strlen(xs_data(src)) + 1; dest->capacity = ilog2(len) + 1; dest->size = len - 1; dest->ptr = src->ptr; xs_inc_refcnt(src); return dest; } /* Medium */ dest = xs_free(dest); return cstr_interning(xs_data(src), xs_size(src), hash_blob(xs_data(src), xs_size(src))); } ``` xs_copy interning 前後效能比較 橘色為 interning 前藍色為 interning 後, 課發現隨著複製次數越高, interning 優化的效果越大 ![](https://i.imgur.com/Y7QOY2T.png) `Valgrind & Massif` 執行Valgrind >valgrind --tool=massif ./test >massif-visualizer outputfile interning 前的記憶體使用量 ![](https://i.imgur.com/izrhSgW.png) interning 後的記憶體使用量 ![](https://i.imgur.com/zjHiMwj.png) `Result` 60.9MB vs 41.3KB, string interning 後在記憶體開銷上有非常顯著的差異, 因此 string interning 同時節省了時間以及記憶體開銷