# 2021q1 Homework3 (quiz3)
contributed by < `AmyLin0210` >
###### tags: `2021q1 Linux`
> [2021q1 第三周測驗題](https://hackmd.io/@sysprog/linux2021-quiz3)
---
## 解釋程式碼運作原理
### 結構定義
`xs` 是一個有 16 byte 的 union,分成三個部份:
- 小字串:字串長度小於等於 15 ,儲存於 stack
- 長度為 `15 - space_left`
- 字串放至於 `data` 字元陣列中
```graphviz
digraph list_add_tail {
nodesep=0.3
rankdir="TD"
node[shape=record]
small [
label = "
{15 bytes | data} |
{ 1 bytes | data with special bits }
"
]
small_1byte [
label = "
{ 4bits | space_left } |
{ 4bits | { is_ptr | is_large_string | flag2 | flag3 } }
"
]
small:b->small_1byte
}
```
- 中字串:
- 長度放置於 `size` 內
-
```graphviz
digraph list_add_tail {
nodesep=0.3
rankdir="TD"
node[shape=record]
medium [
label = "
{ 8 bytes | ptr } |
{ 54 bits | size } |
{ 6 bits | capacity} |
{ 4 bits | <flags> }
"
]
flags [
label = "
{ 4bits | { is_ptr | is_large_string | flag2 | flag3 } }
"
]
medium:flags->flags
}
```
- 長字串:
- 長度放至於 `size` 內
- ptr 最前方的 4 個 bytes 留作 reference count 使用
-
```graphviz
digraph list_add_tail {
nodesep=0.3
rankdir="TD"
node[shape=record]
large [
label = "
{ 8 bytes | <ptr> ptr } |
{ 54 bits | size } |
{ 6 bits | capacity} |
{ 4 bits | <flags> }
"
]
flags [
label = "
{ 4bits | { is_ptr | is_large_string | flag2 | flag3 } }
"
]
ptr [
label = "
{ 4 bytes | reference count } |
{ | string }
"
]
large:flags->flags
large:ptr->ptr
}
```
```cpp
#define MAX_STR_LEN_BITS (54)
typedef union {
/* allow strings up to 15 bytes to stay on the stack
* use the last byte as a null terminator and to store flags
* much like fbstring:
* https://github.com/facebook/folly/blob/master/folly/docs/FBString.md
*/
char data[16];
struct {
uint8_t filler[15],
/* how many free bytes in this stack allocated string
* same idea as fbstring
*/
space_left : 4,
/* if it is on heap, set to 1 */
is_ptr : 1, is_large_string : 1, flag2 : 1, flag3 : 1;
};
/* heap allocated */
struct {
char *ptr;
/* supports strings up to 2^MAX_STR_LEN_BITS - 1 bytes */
size_t size : MAX_STR_LEN_BITS,
/* capacity is always a power of 2 (unsigned)-1 */
capacity : 6;
/* the last 4 bits are important flags */
};
} xs;
```
### 函式說明
```cpp
int main(int argc, char *argv[])
{
xs string = *xs_tmp("\n foobarbar \n\n\n");
xs_trim(&string, "\n ");
printf("[%s] : %2zu\n", xs_data(&string), xs_size(&string));
xs prefix = *xs_tmp("((("), suffix = *xs_tmp(")))");
xs_concat(&string, &prefix, &suffix);
printf("[%s] : %2zu\n", xs_data(&string), xs_size(&string));
return 0;
}
```
[_Static_assert](https://en.cppreference.com/w/c/language/_Static_assert) 的用途為編譯時期檢測錯誤,若是 expression 等於 0 ,會出現 `compile-time error`。
> _Static_assert ( expression , message )
> The constant expression is evaluated at compile time and compared to zero. If it compares equal to zero, a compile-time error occurs and the compiler must display message (if provided) as part of the error message (except that characters not in basic source character set aren't required to be displayed).
Otherwise, if expression does not equal zero, nothing happens; no code is emitted.
```cpp
/* Memory leaks happen if the string is too long but it is still useful for
* short strings.
*/
#define xs_tmp(x) \
((void) ((struct { \
_Static_assert(sizeof(x) <= MAX_STR_LEN, "it is too big"); \
int dummy; \
}){1}), \
xs_new(&xs_literal_empty(), x))
```
:::success
現在我把 `xs_tmp` 簡化並改寫為以下:
由於 `xs_new(&xs_literal_empty(), x)` 最終會回傳出 `xs` 型態的數值,為了實驗方便,現在處理的數值型態皆為 integer
```cpp
#define test(x) \
((void) ((struct { \
_Static_assert(x <= 0, "it is too big"); \
int dummy; \
}){1}), \
x)
int main () {
int a = xs_tmp(-1);
printf("a = %d\n", a); // -1
}
```
想到那是否也可以將 macro 簡化為
```cpp
((void)param1, param2)
```
測試:
```cpp=
int main () {
int a = ((void)4, -1);
printf("a = %d\n", a); // -1
}
```
在上面測試第 2 行的部份讓我感到不解,又測試了其他的寫法:
```cpp
int main () {
int a = ((void)4, 2, -1);
printf("a = %d\n", a); // -1
}
```
可以發現給 a 的值皆為括號的最後面。
由 [tiffany6022](https://hackmd.io/@tiffany6022/linux2021_quiz3) 同學的筆記中得知,原來這是使用了 [comma operator](https://www.geeksforgeeks.org/comna-in-c-and-c/) 的結果
:::
在這邊使用到了 [Compound Literals](https://gcc.gnu.org/onlinedocs/gcc-4.3.2/gcc/Compound-Literals.html) 來對 struct 做初始化。
```cpp
#define xs_literal_empty() \
(xs) { .space_left = 15 }
```
```cpp
/* lowerbound (floor log2) */
static inline int ilog2(uint32_t n) { return 32 - __builtin_clz(n) - 1; }
```
**xs_new**
把字串儲存方式依照長度分類:
- 短字串:長度小於 16 ,儲存在 stack
- 中字串與長字串:長度大於 16 ,區隔在 內決定,儲存在 stack
```cpp
xs *xs_new(xs *x, const void *p)
{
*x = xs_literal_empty();
size_t len = strlen(p) + 1;
if (len > 16) {
x->capacity = ilog2(len) + 1;
x->size = len - 1;
x->is_ptr = true;
xs_allocate_data(x, x->size, 0);
memcpy(xs_data(x), p, len);
} else {
memcpy(x->data, p, len);
x->space_left = 15 - (len - 1);
}
return x;
}
```
:::warning
TODO: 用 GDB 確認上述程式碼實際的運作,是否符合預期,可善用 `x` 命令
:notes: jserv
:::
:::warning
- 測試程式碼
```cpp
int main(int argc, char *argv[]) {
xs tiny = *xs_tmp("1234567890");
xs mid = *xs_tmp("xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx");
xs large = *xs_tmp("aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa");
return 0;
}
```
- 短字串
在使用 gdb 查看的過程中發現最後一個 byte 中 `space_left` 的位置應為最後的 4 個 bits。
```graphviz
digraph list_add_tail {
nodesep=0.3
rankdir="TD"
node[shape=record]
small [
label = "
{15 bytes | data} |
{ 1 bytes | data with special bits }
"
]
small_1byte [
label = "
{ 4bits | { flag2 | flag3 | is_large_string | is_ptr } } |
{ 4bits | space_left }
"
]
small:b->small_1byte
}
```
```cpp
// 印出短字串的架構
(gdb) p tiny
$2 = {data = "1234567890\000\000\000\000\000\005", {filler = "1234567890\000\000\000\000", space_left = 5 '\005', is_ptr = 0 '\000', is_large_string = 0 '\000', flag2 = 0 '\000', flag3 = 0 '\000'}, {ptr = 0x3837363534333231 <error: Cannot access memory at address 0x3837363534333231>, size = 12345, capacity = 20}}
(gdb) x/16c &tiny
0x7fffffffdeb0: 49 '1' 50 '2' 51 '3' 52 '4' 53 '5' 54 '6' 55 '7' 56 '8'
0x7fffffffdeb8: 57 '9' 48 '0' 0 '\000' 0 '\000' 0 '\000' 0 '\000' 0 '\000' 5 '\005'
// 印出最後一個 byte,確認最後 4 個 bits 應為 space_left
(gdb) x/t 0x7fffffffdebf
0x7fffffffdebf: 00000101
```
- 中字串
```graphviz
digraph list_add_tail {
nodesep=0.3
rankdir="TD"
node[shape=record]
medium [
label = "
{ 8 bytes | ptr } |
{ 48 bits | size } |
{ 2 bits | capacity} |
{ 6 bits | size }|
{ 4 bits | <flags> } |
{ 4 bits | capacity }
"
]
flags [
label = "
{ 4bits | { flag2 | flag3 | is_large_string | is_ptr } }
"
]
medium:flags->flags
}
```
```cpp
// 從這邊得到此字串的 `size` 為 19
(gdb) p mid
$3 =
{
data = "\240\242UUUU\000\000\023\000\000\000\000\000@\021",
{
filler = "\240\242UUUU\000\000\023\000\000\000\000\000@",
space_left = 1 '\001',
is_ptr = 1 '\001',
is_large_string = 0 '\000',
flag2 = 0 '\000',
flag3 = 0 '\000'
}, {
ptr = 0x55555555a2a0 'x' <repeats 19 times>,
size = 76,
capacity = 7
}
}
(gdb) x/16 &mid
0x7fffffffded0: 10100000 10100010 01010101 01010101 01010101 01010101 00000000 00000000
0x7fffffffded8: 01001100 00000000 00000000 00000000 00000000 00000000 11000000 00010001
// 取出最後一個 byte 來查看,可以發現 `is_ptr` 位於第 3 個 bit
(gdb) x/t 0x7fffffffdedf
0x7fffffffdedf: 00010001
// 從這邊找到 capacity 的位置為最後兩個 byte 的 11000000 00001111
(gdb) set mid.capacity = 0b111111
(gdb) x/16t &mid
0x7fffffffdc30: 10100000 10100010 01010101 01010101 01010101 01010101 00000000 00000000
0x7fffffffdc38: 01001100 00000000 00000000 00000000 00000000 00000000 11000000 00011111
// 從這邊找到 flag2 位於最後一個 byte 的第 1 個位置
(gdb) set mid.flag2 = 0b1
(gdb) x/16t &mid
0x7fffffffdc30: 10100000 10100010 01010101 01010101 01010101 01010101 00000000 00000000
0x7fffffffdc38: 01001100 00000000 00000000 00000000 00000000 00000000 11000000 01011111
```
- 長字串
```graphviz
digraph list_add_tail {
nodesep=0.3
rankdir="TD"
node[shape=record]
large [
label = "
{ 8 bytes | <ptr> ptr } |
{ 48 bits | size } |
{ 2 bits | capacity } |
{ 6 bits | size} |
{ 4 bits | <flags> } |
{ 4 bits | capacity }
"
]
flags [
label = "
{ 4bits | { flag2 | flag3 | is_large_string | is_ptr } }
"
]
ptr [
label = "
{ 4 bytes | reference count } |
{ | string }
"
]
large:flags->flags
large:ptr->ptr
}
```
```cpp
(gdb) p large
$12 = {
data = "ТUUUU\000\000\000\001\000\000\000\000@2",
{
filler = "ТUUUU\000\000\000\001\000\000\000\000@", space_left = 2 '\002',
is_ptr = 1 '\001',
is_large_string = 1 '\001',
flag2 = 0 '\000',
flag3 = 0 '\000'
},{
ptr = 0x55555555a2d0 "\001",
size = 256,
capacity = 9
}
}
(gdb) x/16t &large
0x7fffffffdef0: 11010000 10100010 01010101 01010101 01010101 01010101 00000000 00000000
0x7fffffffdef8: 00000000 00000001 00000000 00000000 00000000 00000000 01000000 00110010
// 從這邊可以看到 is_large_string 位在第 2 個 bit
(gdb) x/t 0x7fffffffdeff
0x7fffffffdeff: 00110010
// 從這裡可以發現在長字串中的前 4 個 byte 有成功的被 set 程 referecne count
(gdb) xs_set_refcnt(&large, 0xff)
(gdb) x/16t large.ptr
0x55555555a330: 11111111 00000000 00000000 00000000 01100001 01100001 01100001 01100001
0x55555555a338: 01100001 01100001 01100001 01100001 01100001 01100001 01100001 01100001
```
- 字串儲存於 heap 或 stack
```cpp
int main() {
xs mid = *xs_tmp("abcdefghijklmnopqrstuvwxyz");
xs short = *xs_tmp("123");
return 0;
}
```
我使用了 `rbp` 與 `rsp` 來分別印出 stack 的底部與頂部,可以發現在 `short` 短字串中,他的資料被儲存於 stack ,而 `mid` 中字串則被儲存於 heap。
```cpp
(gdb) p $rbp
$1 = (void *) 0x7fffffffdf20
(gdb) p $rsp
$2 = (void *) 0x7fffffffdec0
(gdb) x xs_data(&mid)
0x5555555592a0: 0x64636261
(gdb) x xs_data(&short)
0x7fffffffdef0: 0x00333231
```
:::
**xs_allocate_data**
在這裡將字串分別處理中字串以及長字串,差別在於中字串直接儲存字串的位置,長字串在 `ptr` 的最前面 4 個 byte 預留空間儲存 reference count。
```cpp
#define LARGE_STRING_LEN 256
```
```cpp
static void xs_allocate_data(xs *x, size_t len, bool reallocate)
{
/* Medium string */
if (len < LARGE_STRING_LEN) {
x->ptr = reallocate ? realloc(x->ptr, (size_t) 1 << x->capacity)
: malloc((size_t) 1 << x->capacity);
return;
}
/* Large string */
x->is_large_string = 1;
/* The extra 4 bytes are used to store the reference count */
x->ptr = reallocate ? realloc(x->ptr, (size_t)(1 << x->capacity) + 4)
: malloc((size_t)(1 << x->capacity) + 4);
xs_set_refcnt(x, 1);
}
```
```cpp
static inline void xs_set_refcnt(const xs *x, int val)
{
*((int *) ((size_t) x->ptr)) = val;
}
```
**xs_data**
取出儲存於 heap 空間內的字串,也就是中字串及長字串。
由於長字串的最前面 4 個 byte 用來儲存 reference count ,因此須從指標開始的第四個位置開始取值。
```cpp
static inline char *xs_data(const xs *x)
{
if (!xs_is_ptr(x))
return (char *) x->data;
if (xs_is_large_string(x))
return (char *) (x->ptr + 4);
return (char *) x->ptr;
}
```
**xs_trim**
函式的目標為將指定字串中頭尾含有 `trimset` 內的字元給移除。
- 在 18 行的迴圈,建立 trimset 的表格
- 在 20 行的迴圈,從字串前方查看是否包含須移除的字元
- 在 23 行的迴圈,從字串後方查看是否包含須移除的字元
```cpp=
xs *xs_trim(xs *x, const char *trimset)
{
if (!trimset[0])
return x;
char *dataptr = xs_data(x), *orig = dataptr;
if (xs_cow_lazy_copy(x, &dataptr))
orig = dataptr;
/* similar to strspn/strpbrk but it operates on binary data */
uint8_t mask[32] = {0};
#define check_bit(byte) (mask[(uint8_t) byte / 8] & 1 << (uint8_t) byte % 8)
#define set_bit(byte) (mask[(uint8_t) byte / 8] |= 1 << (uint8_t) byte % 8)
size_t i, slen = xs_size(x), trimlen = strlen(trimset);
for (i = 0; i < trimlen; i++)
set_bit(trimset[i]);
for (i = 0; i < slen; i++)
if (!check_bit(dataptr[i]))
break;
for (; slen > 0; slen--)
if (!check_bit(dataptr[slen - 1]))
break;
dataptr += i;
slen -= i;
/* reserved space as a buffer on the heap.
* Do not reallocate immediately. Instead, reuse it as possible.
* Do not shrink to in place if < 16 bytes.
*/
memmove(orig, dataptr, slen);
/* do not dirty memory unless it is needed */
if (orig[slen])
orig[slen] = 0;
if (xs_is_ptr(x))
x->size = slen;
else
x->space_left = 15 - slen;
return x;
#undef check_bit
#undef set_bit
}
```
**xs_cow_lazy_copy**
檢查 `x` 的 reference count 是否大於 1 ,
若大於 1 ,表示有一個以上的指標指向該字串的位置。
在 reference count 大於一的情況下,需要額外 malloc 出空間,複製一份相同內容的字串來改動。
- reference count 減 1
- 透過 `xs_allocate_data` malloc 出一個新的存字串的位置
- 將 `data` 內的字串複製進 `x` 剛剛創建空間
- `data` 指向 `x` 內字串指標的位置
- 回傳 true
```cpp
static bool xs_cow_lazy_copy(xs *x, char **data)
{
if (xs_get_refcnt(x) <= 1)
return false;
/* Lazy copy */
xs_dec_refcnt(x);
xs_allocate_data(x, x->size, 0);
if (data) {
memcpy(xs_data(x), *data, x->size);
/* Update the newly allocated pointer */
*data = xs_data(x);
}
return true;
}
```
**xs_dec_refcnt**
將 `x` 的 reference count 減 1
```cpp=
static inline int xs_dec_refcnt(const xs *x)
{
if (!xs_is_large_string(x))
return 0;
return --(*(int *) ((size_t) x->ptr));
}
```
**xs_concat**
在第 4 行中的 capacity 會計算出目前字串 `string` 的長度限制,以下分成兩個狀況討論:
- 新字串的長度小於等於舊字串 `capacity`
- 依序將需要連接的字串放入 `string` 內的空間中。
-
```graphviz
digraph list_add_tail {
nodesep=0.3
rankdir="TD"
node[shape=record]
string [
label = "
{ size | } |
{ pres | pre } |
{ size | data } |
{ sufs | suf }
"
]
}
```
- 新字串的度大於舊字串 `capacity`
- 宣告一個新的 `xs tmp`,給予它足夠儲存新字串的空間
- 將字串依序放入空 `xs tmp` 的空間內
- 將舊的 `xs string` 指向的空間 free 掉,將其指向 `tmp`
```cpp=
xs *xs_concat(xs *string, const xs *prefix, const xs *suffix)
{
size_t pres = xs_size(prefix), sufs = xs_size(suffix),
size = xs_size(string), capacity = xs_capacity(string);
char *pre = xs_data(prefix), *suf = xs_data(suffix),
*data = xs_data(string);
xs_cow_lazy_copy(string, &data);
if (size + pres + sufs <= capacity) {
memmove(data + pres, data, size);
memcpy(data, pre, pres);
memcpy(data + pres + size, suf, sufs + 1);
if (xs_is_ptr(string))
string->size = size + pres + sufs;
else
string->space_left = 15 - (size + pres + sufs);
} else {
xs tmps = xs_literal_empty();
xs_grow(&tmps, size + pres + sufs);
char *tmpdata = xs_data(&tmps);
memcpy(tmpdata + pres, data, size);
memcpy(tmpdata, pre, pres);
memcpy(tmpdata + pres + size, suf, sufs + 1);
xs_free(string);
*string = tmps;
string->size = size + pres + sufs;
}
return string;
}
```
:::info
memmove v.s memcpy
- memmove
- The memory areas may overlap: copying takes place as though the bytes in src are first copied into a temporary array that **does not overlap src or dest, and the bytes are then copied from the temporary array to dest**.
- memcpy
- The memcpy() function copies n bytes from memory area src to memory area dest. **The memory areas must not overlap.**
以上節錄自 memmove 與 memcpy 的 Linux Programmer's Manual
在 `xs_concat` 第 12 行的地方目標是將 `data` 往後位移 `pres` 個位置,若 `pres` 小於 `size` , src 與 dist 將有可能重疊,所以才必須使用 `memmove` 而非使用 `memcpy`
:::
**xs_capacity**
- 短字串:`x` 內的字串放置於 stack 空間,回傳 15
- 長字串:`x` 內的字串放置於 heap 空間,回傳 $2^{capacity}$ - 1
```cpp
static inline size_t xs_capacity(const xs *x)
{
return xs_is_ptr(x) ? ((size_t) 1 << x->capacity) - 1 : 15;
}
```
**xs_grow**
在這邊有兩種情況需要被討論
- 原本為短字串,需要增長為中或長字串,並使用 heap 來儲存資料
- 原本就為中或長字串,延伸空間給它
```cpp=
/* grow up to specified size */
xs *xs_grow(xs *x, size_t len)
{
char buf[16];
if (len <= xs_capacity(x))
return x;
/* Backup first */
if (!xs_is_ptr(x))
memcpy(buf, x->data, 16);
x->is_ptr = true;
x->capacity = ilog2(len) + 1;
if (xs_is_ptr(x)) {
xs_allocate_data(x, len, 1);
} else {
xs_allocate_data(x, len, 0);
memcpy(xs_data(x), buf, 16);
}
return x;
}
```
:::warning
在程式碼中的第 16 行目標為檢查字串是否儲存於 heap,但由於第 13 行已經先把 `is_ptr` 改為 `true` 故永遠為真,因此需要修改程式碼為以下:
```diff=
/* grow up to specified size */
xs *xs_grow(xs *x, size_t len)
{
char buf[16];
if (len <= xs_capacity(x))
return x;
/* Backup first */
if (!xs_is_ptr(x))
memcpy(buf, x->data, 16);
- x->is_ptr = true;
x->capacity = ilog2(len) + 1;
if (xs_is_ptr(x)) {
xs_allocate_data(x, len, 1);
} else {
xs_allocate_data(x, len, 0);
memcpy(xs_data(x), buf, 16);
}
+ x->is_ptr = true;
return x;
}
```
:::