2021q1 Homework3 (quiz3)

# 2021q1 Homework3 (quiz3) contributed by < [`andy78644`](https://github.com/andy78644) > ###### tags: `2021 linux` --- ## 程式原理首先可以先看到 `xs` 的結構總共可以放入 15 個 byte 大小的字串，其中如果字串小於 15 byte 則放入 stack ，如果超過 15 bytes 則會放入 heap ```cpp typedef union { /* allow strings up to 15 bytes to stay on the stack * use the last byte as a null terminator and to store flags * much like fbstring: * https://github.com/facebook/folly/blob/master/folly/docs/FBString.md */ char data[16]; struct { uint8_t filler[15], /* how many free bytes in this stack allocated string * same idea as fbstring */ space_left : 4, /* if it is on heap, set to 1 */ is_ptr : 1, is_large_string : 1, flag2 : 1, flag3 : 1; }; /* heap allocated */ struct { char *ptr; /* supports strings up to 2^MAX_STR_LEN_BITS - 1 bytes */ size_t size : MAX_STR_LEN_BITS, /* capacity is always a power of 2 (unsigned)-1 */ capacity : 6; /* the last 4 bits are important flags */ }; } xs; ``` c99 中對於 struct 和 union 的描述中有提到 union 是將由許多成員共用同一空間， stack 中含有 64 個 bit ，因此我們在 heap 的結構中為了避免蓋掉 stack 中最後 4 個用來辨識是否為 head 的 bit，結構只含有 60 個 bit :::warning * ISO/IEC 9899 : [6.2.5] Types > A union type describes an overlapping nonempty set of member objects, each of which has an optionally specified name and possibly distinct type. > As discussed in [6.2.5], a structure is a type consisting of a sequence of members, whose storage is allocated in an ordered sequence, and a union is a type consisting of a sequence of members whose storage overlap. reference: [ISO/IEC 9899:1999](http://www.dii.uchile.cl/~daespino/files/Iso_C_1999_definition.pdf) ::: c99 提到對於 bitfield 的用法，不能超過物件所包含的空間大小，此結構中為宣告為 `char data[16]` ，而在其他兩個結構中的 bitfield 的宣告皆未超出此結構大小其中也有提到 bit-field 在遇到空間未滿且足夠時，會優先填滿該空間，但如果空間不足時，會依照實作結果去進行變化 :::warning * ISO/IEC 9899 : [6.7.2] Type specifiers > The expression that specifies the width of a bit-field shall be an integer constant expression that has nonnegative value that shall not exceed the number of bits in an object of the type that is specified if the colon and expression are omitted. If the value is zero, the declaration shall have no declarator. * ISO/IEC 9899 : [6.7.2.1] Structure and union specifiers > An implementation may allocate any addressable storage unit large enough to hold a bitfield. If enough space remains, a bit-field that immediately follows another bit-field in a structure shall be packed into adjacent bits of the same unit. If insufficient space remains, whether a bit-field that does not fit is put into the next unit or overlaps adjacent units is implementation-defined. The order of allocation of bit-fields within a unit (high-order to low-order or low-order to high-order) is implementation-defined. The alignment of the addressable storage unit is unspecified. reference: [ISO/IEC 9899:1999](http://www.dii.uchile.cl/~daespino/files/Iso_C_1999_definition.pdf) ::: `xs_allocate_data` 是用來將字串放入 heap 中的函式，首先會先看字串是否為超過256也就是長字串，在 [folly::fbstring](https://github.com/facebook/folly/blob/master/folly/FBString.h) 中，會將字串分成短中長，中長字串皆放在 heap 中，但這邊並未使用到如 `fbstring` 一樣將長字串最佳化，而是直接配置一個空間給長字串，這可以透過 `COW` 的方式來進行改善接著會看字串是否已經分配過空間，如果已經分配過則使用 `realloc` 去重新分配空間大小，如未分配過則使用 `malloc` 去產生一個新的空間來使用最後會將 `x->ptr` 設為 1 代表此字串為放入 heap 的空間在 `xs_set_refcnt` 中將 `x->ptr` 強轉型成 the pointer to int 是為了設定 reference count 而其所代表的大小為 int ，因此要先將 char* 轉為 int* 去進行設定 ```cpp static void xs_allocate_data(xs *x, size_t len, bool reallocate) { /* Medium string */ if (len < LARGE_STRING_LEN) { x->ptr = reallocate ? realloc(x->ptr, (size_t) 1 << x->capacity) : malloc((size_t) 1 << x->capacity); return; } /* Large string */ x->is_large_string = 1; /* The extra 4 bytes are used to store the reference count */ x->ptr = reallocate ? realloc(x->ptr, (size_t)(1 << x->capacity) + 4) : malloc((size_t)(1 << x->capacity) + 4); xs_set_refcnt(x, 1); } static inline void xs_set_refcnt(const xs *x, int val) { *((int *) ((size_t) x->ptr)) = val; } ``` 接下來看建立新的 xs 的方式，依照 main 裡面先呼叫 `xs_tmp` ，其中用到 `_Static_assert` 是定義在 c11 的 <asser.h> 裡，主要用法為當條件不成立時，即會跳出錯誤資訊，詳細如下 :::info [cpp reference : Static assertion](https://en.cppreference.com/w/c/language/_Static_assert) > The constant expression is evaluated at compile time and compared to zero. If it compares equal to zero, a compile-time error occurs and the compiler must display message (if provided) as part of the error message (except that characters not in basic source character set aren't required to be displayed). Otherwise, if expression does not equal zero, nothing happens; no code is emitted. ::: 而這邊會先測試傳入的字串是否超過所能容納的最長字串，長度為一開始所定義的 `MAX_STR_LEN` 為 $2^{54}-1$ ，此長度為字串在 heap 中的 size 為 54 bit 而來，如果是符合字串範圍內，則會利用 `xs_new` 來建立 `xs` ，會先透過 `xs_literal_empty()` 來建立 `xs` ， `space_left` 為零則是空字串，因為存入字串空間的長度為 `15 - space_left` 接下來會比較字串長度是要放入 stack 或是放入 heap ，因為字串最後會有 `\0` ，因此放入 stack 中最長的長度應該為 16，因此 `NNN` 即為16 ```cpp #define MAX_STR_LEN_BITS (54) #define MAX_STR_LEN ((1UL << MAX_STR_LEN_BITS) - 1) #define xs_tmp(x) \ ((void) ((struct { \ _Static_assert(sizeof(x) <= MAX_STR_LEN, "it is too big"); \ int dummy; \ }){1}), \ xs_new(&xs_literal_empty(), x)) #define xs_literal_empty() \ (xs) { .space_left = 15 } xs *xs_new(xs *x, const void *p) { *x = xs_literal_empty(); size_t len = strlen(p) + 1; if (len > NNN) { x->capacity = ilog2(len) + 1; x->size = len - 1; x->is_ptr = true; xs_allocate_data(x, x->size, 0); memcpy(xs_data(x), p, len); } else { memcpy(x->data, p, len); x->space_left = 15 - (len - 1); } return x; } ``` `xs_data` 用來讀取字串，其中分別看短中長，第一個判斷 `xs_is_ptr()` 來判定是否為短字串，如果為短字串則可以直接讀取 stack 上的資料，第二個判斷 `xs_is_large_string(x)` 用來判斷是否為長字串，如果是可以透過 ptr 去讀取字串，而因為在長字串中前面 4 個 byte 也就是一個 int 的長度是存放 reference count，因此須利用 `x->ptr + OFF` 讀取字串，而這個函式的回傳值為指向該字串開頭的指標 ```cpp static inline char *xs_data(const xs *x) { if (!xs_is_ptr(x)) return (char *) x->data; if (xs_is_large_string(x)) return (char *) (x->ptr + OFF); return (char *) x->ptr; } ``` * ccc = mask[(uint8_t) byte / 8] & 1 << (uint8_t) byte % 8 * sss = mask[(uint8_t) byte / 8] |= 1 << (uint8_t) byte % 8 `xs_trim` 主要是用來去掉字串頭尾包含 `trimset` 的字串 `xs_cow_lazy_copy` 判斷如果為長字串則複製一份，短中字串則直接修改 `mask[32]` 共有 256 個 bit ，恰好對應到 ASCII 的字元，在 `set_bit` 中將 `trimset` 出現過的字源所對應的數字的 bit 設為 1 ，然後再透過 `checkbit` 去找出是否有一樣的字元，如有一樣字元則將該字元刪除在 `set_bit` 和 `check_bit` 中，先分別將 byte 除以 8 找到在 mask 所在的位元組上，再取 8 的餘數去找到所在位元組的位置而在 18~21 行中，分別從頭和尾去刪除所要刪除的字串在 33 行中用到的 `memmove` ，他會從 src 中複製 n 個 bytes 到 dest ，並且回傳一個指向 dest 的指標 ```clike= void *memmove(void *dest, const void *src, size_t n); ``` :::info 在 man page 中有提到 `memmove()` 有可能產生 overlap 的情形，這個在這支函式即可以看到，當 dataptr 沒有被刪減時，會與 orig 指向的是同個位置，因此複製到 orig 時即是將自己也覆蓋過去 > The memmove() function copies n bytes from memory area src to memory area dest. The memory areas may overlap: copying takes place as though the bytes in src are first copied into a temporary array that does not overlap src or dest, and the bytes are then copied from the temporary array to dest. ::: ```clike= xs *xs_trim(xs *x, const char *trimset) { if (!trimset[0]) return x; char *dataptr = xs_data(x), *orig = dataptr; if (xs_cow_lazy_copy(x, &dataptr)) orig = dataptr; /* similar to strspn/strpbrk but it operates on binary data */ uint8_t mask[32] = {0}; #define check_bit(byte) (CCC) #define set_bit(byte) (SSS) size_t i, slen = xs_size(x), trimlen = strlen(trimset); for (i = 0; i < trimlen; i++) set_bit(trimset[i]); for (i = 0; i < slen; i++) if (!check_bit(dataptr[i])) break; for (; slen > 0; slen--) if (!check_bit(dataptr[slen - 1])) break; dataptr += i; slen -= i; /* reserved space as a buffer on the heap. * Do not reallocate immediately. Instead, reuse it as possible. * Do not shrink to in place if < 16 bytes. */ memmove(orig, dataptr, slen); /* do not dirty memory unless it is needed */ if (orig[slen]) orig[slen] = 0; if (xs_is_ptr(x)) x->size = slen; else x->space_left = 15 - slen; return x; #undef check_bit #undef set_bit } ``` `xs_cow_lazy_copy` 實作 copy on write 的方式，當有對於長字串的修改時，則會產生一個新的空間來存放修改後的字串 ```cpp static bool xs_cow_lazy_copy(xs *x, char **data) { if (xs_get_refcnt(x) <= 1) return false; /* Lazy copy */ xs_dec_refcnt(x); xs_allocate_data(x, x->size, 0); if (data) { memcpy(xs_data(x), *data, x->size); /* Update the newly allocated pointer */ *data = xs_data(x); } return true; } ``` `xs_dec_refcnt` 用來將 reference count 減 1 ```cpp= static inline void xs_inc_refcnt(const xs *x) { if (xs_is_large_string(x)) ++(*(int *) ((size_t) x->ptr)); } static inline int xs_dec_refcnt(const xs *x) { if (!xs_is_large_string(x)) return 0; return --(*(int *) ((size_t) x->ptr)); } static inline int xs_get_refcnt(const xs *x) { if (!xs_is_large_string(x)) return 0; return *(int *) ((size_t) x->ptr); } ``` `xs_concat` 用來將 string 前後分別加上 prefix 和 suffix，在 11-19 行中是當三個字串加起來長度會超出 string 所設下的容量，可以直接將自串接在一起， 12 行是將原 string 的字串往後移 prefix 字串的長度，然後在 13 行時，將 prefix 放入 string 一開始的位置，最後則是將 suffix 放在 string 的尾端如果是超出 string 原本的空間容量，則會到 `grow` 中將容量改成大於字串長度 2 的冪，再將所有字串放入新空間中 ```clike= xs *xs_concat(xs *string, const xs *prefix, const xs *suffix) { size_t pres = xs_size(prefix), sufs = xs_size(suffix), size = xs_size(string), capacity = xs_capacity(string); char *pre = xs_data(prefix), *suf = xs_data(suffix), *data = xs_data(string); xs_cow_lazy_copy(string, &data); if (size + pres + sufs <= capacity) { memmove(data + pres, data, size); memcpy(data, pre, pres); memcpy(data + pres + size, suf, sufs + 1); if (xs_is_ptr(string)) string->size = size + pres + sufs; else string->space_left = 15 - (size + pres + sufs); } else { xs tmps = xs_literal_empty(); xs_grow(&tmps, size + pres + sufs); char *tmpdata = xs_data(&tmps); memcpy(tmpdata + pres, data, size); memcpy(tmpdata, pre, pres); memcpy(tmpdata + pres + size, suf, sufs + 1); xs_free(string); *string = tmps; string->size = size + pres + sufs; } return string; } ```