# 2021q1 Homework3 (quiz3)
contributed by < `qwe661234` >
###### tags: `linux2021`
運作原理:
對字串做 SSO (small string optimization) 和 CoW (copy on write) 優化
CoW 機制的實現:
>COW的實現依賴於 reference count(引用計數rc),初始時rc = 1,每次賦值複製時rc ++,當修改時,如果rc > 1,需要申請新的空間並複制一個原來的數據,並且 rc--,當rc == 0時,釋放原內存。
SSO 機制的實現:
將字串 xs 分為3種:
1. 長度 <= 15的字串:
* 存放在 stack
* size = 15 - space_left(剩餘空間)
* data 直接存放於 xs 的 char 陣列中
* is_ptr = 0, is_large_string = 0
2. 長度 > 16 且 < 256 的字串
* 存放在 heap
* size = slen(string)
* is_ptr = 1, is_large_string = 0
3. 長度 >= 256 的字串
* 存放在 heap
* 前 4 bytes 用來存 reference counter
* size = slen(string)
* is_ptr = 1, is_large_string = 1
>所以當字串較短時,直接將 data 存於 stack 中,而不用去 heap 中申請空間,省去了申請 heap 中空間的開銷。
Data structure:
``` cpp
typedef union {
/* allow strings up to 15 bytes to stay on the stack
* use the last byte as a null terminator and to store flags
* much like fbstring:
* https://github.com/facebook/folly/blob/master/folly/docs/FBString.md
*/
/* store the content of short string */
char data[16];
struct {
uint8_t filler[15],
/* how many free bytes in this stack allocated string
* same idea as fbstring
*/
/* how many length are left for short string */
space_left : 4,
/* if it is on heap, set to 1 */
is_ptr : 1, is_large_string : 1, flag2 : 1, flag3 : 1;
};
/* heap allocated */
struct {
/* store the content of learge string */
char *ptr;
/* supports strings up to 2^MAX_STR_LEN_BITS - 1 bytes */
size_t size : MAX_STR_LEN_BITS,
/* capacity is always a power of 2 (unsigned)-1 */
capacity : 6;
/* the last 4 bits are important flags */
};
} xs;
```
其中 filler 是用來對齊的16 bytes 結構

`xs_data`
+4 是因為 x->ptr 前四個 bytes 是用來存取 reference count, 所以要位址+4來拿到 large string 的內容
```cpp=
static inline char *xs_data(const xs *x)
{
/* short string */
if (!xs_is_ptr(x))
return (char *) x->data;
/* large string */
if (xs_is_large_string(x))
return (char *) (x->ptr + 4);
/* medium string */
return (char *) x->ptr;
}
```
`ilog2`
__builtin_clz(n)
>Count Leading Zeros (clz), 計算 2 進位數從 MSB 往右數遇到的第一個 1 之前所有 0 的數量
ilog2 是用來計算 xs 所需的 capacity
``` cpp
static inline int ilog2(uint32_t n) { return 32 - __builtin_clz(n) - 1; }
```
`xs_allocate_data`
allocate 時會區別 medium string 和 large string, large string 要處理 reference counter 的部分
realloc: 避免 memory 的浪費
> ptr --
這是以前用malloc,calloc或realloc分配的指標,現在重新分配的記憶體給此指標。如果是NULL,分配一個新的記憶體區塊,由該函數返回一個指向它的指標。
size --
這是新的記憶體區塊大小(以 byte 為單位)。如果它是0 並且 ptr 指向現有的記憶體區塊,指標所指向的記憶體區塊被釋放,並返回一個 NULL 指標。
```cpp
static void xs_allocate_data(xs *x, size_t len, bool reallocate)
{
/* Medium string */
if (len < LARGE_STRING_LEN) {
x->ptr = reallocate ? realloc(x->ptr, (size_t) 1 << x->capacity)
: malloc((size_t) 1 << x->capacity);
return;
}
/* Large string */
x->is_large_string = 1;
/* The extra 4 bytes are used to store the reference count */
x->ptr = reallocate ? realloc(x->ptr, (size_t)(1 << x->capacity) + 4)
: malloc((size_t)(1 << x->capacity) + 4);
xs_set_refcnt(x, 1);
}
```
`xs_cow_lazy_copy`
用來處理 large string 的內容修改, 當 large string 需要修改且其 reference counter > 1 時, 將 reference counter - 1, 並複製一份內容給新分配的記憶體區塊, 內容的修改會發生在新分配的記憶體區塊
>當 large string 真的有修改時,才真正的分配新的記憶體, 否則同樣的 large string 都是指向同一記憶體位置, 可減少複製 large string data 的開銷
``` cpp
static bool xs_cow_lazy_copy(xs *x, char **data)
{
if (xs_get_refcnt(x) <= 1)
return false;
/* Lazy copy */
xs_dec_refcnt(x);
xs_allocate_data(x, x->size, 0);
if (data) {
memcpy(xs_data(x), *data, x->size);
/* Update the newly allocated pointer */
*data = xs_data(x);
}
return true;
}
```
`xs_concat`
依造 prefix、 string、 suffix 連接,如果是 large string 要做 CoW, 如果 capacity 不足要先 xs_grow 到足夠的空間
```cpp
s *xs_concat(xs *string, const xs *prefix, const xs *suffix)
{
size_t pres = xs_size(prefix), sufs = xs_size(suffix),
size = xs_size(string), capacity = xs_capacity(string);
char *pre = xs_data(prefix), *suf = xs_data(suffix),
*data = xs_data(string);
xs_cow_lazy_copy(string, &data);
if (size + pres + sufs <= capacity) {
memmove(data + pres, data, size);
memcpy(data, pre, pres);
memcpy(data + pres + size, suf, sufs + 1);
if (xs_is_ptr(string))
string->size = size + pres + sufs;
else
string->space_left = 15 - (size + pres + sufs);
} else {
xs tmps = xs_literal_empty();
xs_grow(&tmps, size + pres + sufs);
char *tmpdata = xs_data(&tmps);
memcpy(tmpdata + pres, data, size);
memcpy(tmpdata, pre, pres);
memcpy(tmpdata + pres + size, suf, sufs + 1);
xs_free(string);
*string = tmps;
string->size = size + pres + sufs;
}
return string;
}
```
`xs_trim`
用查表方式來做修剪, set_bit 用來建表, 將8 bit ASCII code 分為前5 bit 和後 3 bit, 前5個 bits 是 mask 的 Index, 後3個 bits 則是對應的 mask 的內容, check_bit 則是查表。 先以 trimset 建表, 接著從欲修剪的字串開頭集結尾查表
修剪部分是以指標的移動及字串長度的縮減, 最後將修剪好的字串以 memmove 複製到原始記憶體位置
``` cpp
xs *xs_trim(xs *x, const char *trimset)
{
if (!trimset[0])
return x;
char *dataptr = xs_data(x), *orig = dataptr;
if (xs_cow_lazy_copy(x, &dataptr))
orig = dataptr;
/* similar to strspn/strpbrk but it operates on binary data */
uint8_t mask[32] = {0};
#define check_bit(byte) (mask[(uint8_t) byte / 8] & 1 << (uint8_t) byte % 8)
#define set_bit(byte) (mask[(uint8_t) byte / 8] |= 1 << (uint8_t) byte % 8)
size_t i, slen = xs_size(x), trimlen = strlen(trimset);
for (i = 0; i < trimlen; i++)
set_bit(trimset[i]);
for (i = 0; i < slen; i++)
if (!check_bit(dataptr[i]))
break;
for (; slen > 0; slen--)
if (!check_bit(dataptr[slen - 1]))
break;
dataptr += i;
slen -= i;
/* reserved space as a buffer on the heap.
* Do not reallocate immediately. Instead, reuse it as possible.
* Do not shrink to in place if < 16 bytes.
*/
memmove(orig, dataptr, slen);
/* do not dirty memory unless it is needed */
if (orig[slen])
orig[slen] = 0;
if (xs_is_ptr(x))
x->size = slen;
else
x->space_left = 15 - slen;
return x;
#undef check_bit
#undef set_bit
}
```
:::success
延伸問題:提出改進方案
:::
`xs_grow` 修改
原本的 `xs_grow` 先將 x->is_ptr 設為 true 再去判斷 xs_is_ptr(x), 這樣 xs_is_ptr(x) 永遠都會是 true, 所以應該先判斷 xs_is_ptr(x), 等 memory allocate 完才能將 x->is_ptr 設為 true
```cpp
xs *xs_grow(xs *x, size_t len)
{
char buf[16];
if (len <= xs_capacity(x))
return x;
/* Backup first */
if (!xs_is_ptr(x))
memcpy(buf, x->data, 16);
if (xs_is_ptr(x)) {
xs_allocate_data(x, len, 1);
} else {
xs_allocate_data(x, len, 0);
memcpy(xs_data(x), buf, 16);
}
x->is_ptr = true;
x->capacity = ilog2(len) + 1;
return x;
}
```
:::success
延伸問題:以上述程式碼為基礎,提供字串複製 (應考慮到 CoW) 的函式實作
:::
`xs_copy`
針對 src 是 short、 medium、 large string 分別做不同的拷貝方式
1. short string: 直接複製內容, 將 src 的 data copy 到 dest 的 data 中
2. medium string: 直接複製內容, 先分配記憶體給 dest 的 char ptr, 再將 src 的 char ptr 中存放的內容直接 copy 到 dest 的 char ptr 中
3. large string: 須考慮 CoW 機制, 所以將 dest 的 char ptr 直接指向 src 的 char ptr, 並將 reference counter + 1, 等待未來內容有修改時在真正的複製一份哪容給 dest
```cpp
xs *xs_copy(xs *dest, xs *src){
/* short string */
if (!xs_is_ptr(src)){
dest = xs_free(dest);
dest->is_large_string = 0;
dest->is_ptr = 0;
dest->space_left = src->space_left;
memcpy(dest->data, src->data, xs_size(dest));
if (dest->data[xs_size(dest)]){
dest->data[xs_size(dest)] = 0;
}
return dest;
}
/* large string */
if (xs_is_large_string(src)){
dest = xs_free(dest);
dest->is_large_string = 1;
dest->is_ptr = 1;
size_t len = strlen(xs_data(src)) + 1;
dest->capacity = ilog2(len) + 1;
dest->size = len - 1;
dest->ptr = src->ptr;
xs_inc_refcnt(src);
}
/* medium string*/
dest = xs_free(dest);
dest->is_ptr = 1;
dest->is_large_string = 0;
size_t len = strlen(xs_data(src)) + 1;
dest->capacity = ilog2(len) + 1;
dest->size = len - 1;
xs_allocate_data(dest, dest->size, 0);
memcpy(xs_data(dest), xs_data(src), len);
return dest;
}
```
隨機產生大小為 325 的字串並做1 ~ 10000次來比較以 xs_copy 來複製字串與以 strncpy 來複製所花費的時間比較
可看出大字串以 xs_copy (CoW的機制) 來複製比以 strncpy 來的有效率

### Refinement
與老師討論實驗設計的部分後, 從新設計 CoW 的實驗, 再實驗設計上應該著重於重複複製相同的字串才能顯現 CoW 機制的效能, 而一直非複製不同的字串, 因此從新設計實驗, 此實驗為針對字串長度為 325 的字串, 並對相同字串作複製1000次, 並重複此動作 1000次
由圖中可看出, 實行 CoW 機制的 xs_copy 效能較 strncpy 佳
藍色為 xs_copy 橘色為strncpy

假設資料為常態分佈, 透過計算將 outlier 去除, 並對剩餘的數據取平均值
| function | 原始平均 | 標準差 | 信賴區間(95%) | 信賴區間(95%) | 去除 outlier 平均值 |
| -------- | -------- | -------- | --------------- | --------------- | --- |
| xs_copy | 450.555 | 232.4967 | -14.4385 | 915.5465 | 424.81 |
| strncpy | 1253.087 | 124.3488 | 1004.3892 | 1501.7847 | 1241.602 |
Conclusion:
在此實驗中, 對相同字串進行複製1000次, 實做 CoW 的 xs_copy 較 strncpy 效能快將近3倍
:::success
延伸問題:設計實驗,確認上述字串處理最佳化 (針對空間效率) 帶來的效益,應考慮到 locality of reference,善用 GDB 追蹤和分析,確認字串的確符合預期地配置於 stack 或 heap
:::
`追蹤測試程式`
```cpp
int main(int argc, char *argv[])
{
int *a = (int *) malloc(100);
xs str = *xs_tmp("foobarbar");
xs str2 = *xs_tmp("foobarbarfoobarbarfoobarbarfoobarbarfoobarbarfoobarbarfoobarbarfoobarbarfoobarbarfoobarbarfoobarbarfoobarbarfoobarbarfoobarbarfoobarbarfoobarbarfoobarbarfoobarbarfoobarbarfoobarbarfoobarbarfoobarbarfoobarbarfoobarbarfoobarbar");
return 0;
}
```
`GDB 追蹤`
動態配置記憶體位置會出現在 heap, 所以先配置變數 a 來確認 heap 的位址, 接著印出 $rbp, $rsp 來確認 stack 的起始與結束位址, 由姐果可看出短字串 str.data 存在於 stack 中, 而中字串 str.ptr 則是指向 heap 中的記憶體位址, 可知其存在於 heap 中

:::success
延伸問題:嘗試將 quiz2 提到的 string interning 機制整合,提出對空間高度最佳化的字串處理工具函式庫
:::
interning pool and hash table struct
```cpp
struct __xs_node {
xs str;
char* buffer;
uint32_t hash_size;
struct __xs_node *next;
};
struct __xs_pool {
struct __xs_node node[INTERNING_POOL_SIZE];
};
struct __xs_interning {
int lock;
int index;
unsigned size;
unsigned total;
struct __xs_node **hash;
struct __xs_pool *pool;
};
static struct __xs_interning __xs_ctx;
```
考慮到 short string 存於字元 array 中, 且 large string 已經有實做 CoW 機制, 所以 string interning 只針對 medium string
為 function `xs_new` `xs_concat` `xs_trim` `xs_copy` 新增 string interning
`xs_concat` `xs_trim` 再對 string 做修改時,就直接複製一份新的 data 去進行修改而不去管 reference count, 讓原始的 data 繼續留在 interning pool 中等待其他 string 使用, 跟 CoW 的機制有點不同
`xs_new`
```cpp
xs *xs_new(xs *x, const void *p)
{
*x = xs_literal_empty();
size_t len = strlen(p) + 1;
if (len > 16) {
x->capacity = ilog2(len) + 1;
x->size = len - 1;
x->is_ptr = true;
if (len < XS_INTERNING_SIZE) {
x = cstr_interning(p, len - 1, hash_blob(p, len - 1));
}else {
xs_allocate_data(x, x->size, 0);
memcpy(xs_data(x), p, len);
}
}else {
memcpy(x->data, p, len);
x->space_left = 15 - (len - 1);
}
return x;
}
```
`xs_concat`
```cpp
xs *xs_concat(xs *string, const xs *prefix, const xs *suffix)
{
size_t pres = xs_size(prefix), sufs = xs_size(suffix),
size = xs_size(string), capacity = xs_capacity(string);
char *pre = xs_data(prefix), *suf = xs_data(suffix),
*data = xs_data(string);
xs_cow_lazy_copy(string, &data);
if (size + pres + sufs <= capacity) {
if (size + pres + sufs < XS_INTERNING_SIZE) {
string->size = size + pres + sufs;
char *tmp = xalloc(string->size);
memcpy(tmp, pre, pres);
memcpy(tmp + pres, data, size);
memcpy(tmp + pres + size, suf, sufs + 1);
return cstr_interning(tmp, string->size, hash_blob(tmp, string->size));
}
memmove(data + pres, data, size);
memcpy(data, pre, pres);
memcpy(data + pres + size, suf, sufs + 1);
if (xs_is_ptr(string)) {
string->size = size + pres + sufs;
} else
string->space_left = 15 - (size + pres + sufs);
} else {
//TODO: fix grow function
xs tmps = xs_literal_empty();
xs_grow(&tmps, size + pres + sufs);
char *tmpdata = xs_data(&tmps);
memcpy(tmpdata + pres, data, size);
memcpy(tmpdata, pre, pres);
memcpy(tmpdata + pres + size, suf, sufs + 1);
if (size + pres + sufs < XS_INTERNING_SIZE) {
return cstr_interning(tmpdata, xs_size(&tmps), hash_blob(tmpdata, xs_size(&tmps)));
}
xs_free(string);
*string = tmps;
string->size = size + pres + sufs;
}
return string;
}
```
`xs_trim`
```cpp
xs *xs_trim(xs *x, const char *trimset)
{
if (!trimset[0])
return x;
char *dataptr = xs_data(x), *orig = dataptr;
if (xs_size(x) > 16 && xs_size(x) < XS_INTERNING_SIZE) {
char* tmp = xalloc(xs_size(x));
memcpy(tmp, dataptr, xs_size(x));
orig = tmp;
} else {
if (xs_cow_lazy_copy(x, &dataptr))
orig = dataptr;
}
/* similar to strspn/strpbrk but it operates on binary data */
uint8_t mask[32] = {0};
#define check_bit(byte) (mask[(uint8_t) byte / 8] & 1 << (uint8_t) byte % 8)
#define set_bit(byte) (mask[(uint8_t) byte / 8] |= 1 << (uint8_t) byte % 8)
size_t i, slen = xs_size(x), trimlen = strlen(trimset);
for (i = 0; i < trimlen; i++)
set_bit(trimset[i]);
for (i = 0; i < slen; i++)
if (!check_bit(dataptr[i]))
break;
for (; slen > 0; slen--)
if (!check_bit(dataptr[slen - 1]))
break;
dataptr += i;
slen -= i;
/* reserved space as a buffer on the heap.
* Do not reallocate immediately. Instead, reuse it as possible.
* Do not shrink to in place if < 16 bytes.
*/
// memmove is safer than memcpy
memmove(orig, dataptr, slen);
/* do not dirty memory unless it is needed */
if (orig[slen])
orig[slen] = 0;
if (xs_is_ptr(x))
x->size = slen;
else
x->space_left = 15 - slen;
if (slen < XS_INTERNING_SIZE && slen > 16) {
x = cstr_interning(orig, slen, hash_blob(orig, slen));
}
return x;
#undef check_bit
#undef set_bit
}
```
`xs_copy`
```cpp
xs *xs_copy(xs *dest, xs *src){
/* short string */
if (!xs_is_ptr(src)){
dest = xs_free(dest);
dest->is_large_string = 0;
dest->is_ptr = 0;
dest->space_left = src->space_left;
memcpy(dest->data, src->data, xs_size(src));
return dest;
}
/* large string */
if (xs_is_large_string(src)){
dest = xs_free(dest);
dest->is_large_string = 1;
dest->is_ptr = 1;
size_t len = strlen(xs_data(src)) + 1;
dest->capacity = ilog2(len) + 1;
dest->size = len - 1;
dest->ptr = src->ptr;
xs_inc_refcnt(src);
return dest;
}
/* Medium */
dest = xs_free(dest);
return cstr_interning(xs_data(src), xs_size(src), hash_blob(xs_data(src), xs_size(src)));
}
```
xs_copy interning 前後效能比較
橘色為 interning 前藍色為 interning 後, 課發現隨著複製次數越高, interning 優化的效果越大

`Valgrind & Massif`
執行Valgrind
>valgrind --tool=massif ./test
>massif-visualizer outputfile
interning 前的記憶體使用量

interning 後的記憶體使用量

`Result`
60.9MB vs 41.3KB, string interning 後在記憶體開銷上有非常顯著的差異, 因此 string interning 同時節省了時間以及記憶體開銷