2020q2 Homework2 (quiz2)

# 2020q2 Homework2 (quiz2) contributed by < `eecheng87` > ###### tags: `linux2020` ## 解釋程式碼運作原理，並提出改進方案 ### Describe what did each function do * union `xs` ```cpp typedef union { char data[16]; struct { uint8_t filler[15], space_left : 4, /* if it is on heap, set to 1 */ is_ptr : 1, flag1 : 1, flag2 : 1, flag3 : 1; }; /* heap allocated */ struct { char *ptr; /* supports strings up to 2^54 - 1 bytes */ size_t size : 54, /* capacity is always a power of 2 (unsigned)-1 */ capacity : 6; /* the last 4 bits are important flags */ }; } xs; ``` If you are not sure what exactly layout of its member, following is address of member: ```shell (gdb) p &string $1 = (xs *) 0x7fffffffdc20 (gdb) p &string->data $2 = (char (*)[16]) 0x7fffffffdc20 (gdb) p &string->filler $3 = (uint8_t (*)[15]) 0x7fffffffdc20 (gdb) p &string->space_left $4 = (uint8_t *) 0x7fffffffdc2f "" (gdb) p &string->is_ptr $5 = (uint8_t *) 0x7fffffffdc2f "" (gdb) p &string->flag1 $6 = (uint8_t *) 0x7fffffffdc2f "" (gdb) p &string->ptr $7 = (char **) 0x7fffffffdc20 (gdb) p &string->size $8 = (size_t *) 0x7fffffffdc28 (gdb) p &string->capacity $9 = (size_t *) 0x7fffffffdc28 (gdb) p &string->flag2 $10 = (uint8_t *) 0x7fffffffdc2f "" (gdb) ``` * `space_left`, `is_ptr`, `flags` share the same 8 bits `uint8_t` * each struct's start address is as same as `data`. (characteristic of union) * The variable of last struct only use 60 bits. Others 4 bits are used by `is_ptr`, `flag`. * `static inline bool xs_is_ptr(const xs *x)` The purpose of it is as same as its function name, check whether it is pointer which is one bit variable (see also: [bit field](https://hackmd.io/@sysprog/c-bitfield)) in struct. * `static inline size_t xs_size(const xs *x)` * `static inline char *xs_data(const xs *x)` If string isn't short, it stores in heap. When accessing long string, we need read `ptr` in union. * macro `xs_tmp(x)` This macro have two jobs to do which is seperated by comma. First, check whether input string is less than 16. If not, you will get error message by `_Static_assert` after compile. Second, call function `xs_new` and check whether input `x` is valid through concat `""` and `x`. * macro `xs_literal_empty()` Use [compound literal](https://en.cppreference.com/w/c/language/compound_literal) to update value of member `space_left`. * `xs *xs_new(xs *x, const void *p)` Illustrate in several sections * variable `mask` : its complete definition is `uint8_t mask[32]` which contain 256 bits (32 * 8). Each character can map into this array(one to one). The most powerful part is that `mask` can record $2^{256}$ posibilities. e.g. When value of `mask[0]` is 3, it means we don't want character `NUL` and `SOH`. * macro `set_bit(byte)` : map input character into `mask`. e.g. If input is `'\n'` which denotes $1010_b$, `mask[1]` will be set to 4 ($100_b$) through `|` operation. Divide and shift operation in this macro you can view them as hash operation. Hash operation make mapping feasible. * macro `check_bit(byte)` : check bit of `mask` . e.g. If you want to check '\n' is in `mask` or not, just check bit 3 in `mask[1]` through `&` operation. If match, you will get 1 as r-value. * Explain mechanism of the function: * `const char *trimset` store the character you want to delete. e.g. pass `\n ` means you want to remove character '\n' and ' '. * ```cpp // this for loop aim at maping // all charater you don't want // into array mask for (i = 0; i < trimlen; i++) set_bit(trimset[i]); // start from left of input string, // break until encounter character // you want to keep for (i = 0; i < slen; i++) if (!check_bit(dataptr[i])) break; // start from right of input string // break until encounter character // you want to keep for (; slen > 0; slen--) if(!check_bit(dataptr[slen - 1])) break; ``` * `dataptr += i` : re-locate real start point of string. * [memmove](http://tw.gitbook.net/c_standard_library/c_function_memmove.html) * `xs *xs_grow(xs *x, size_t len)` : let allocated memory larger. * If `is_ptr` in `xs` set -> current memory is allocated in heap. Just call `realloc` to acquire larger memory. * If not, move string from stack to heap(`malloc`) * `xs *xs_concat(xs *string, const xs *prefix, const xs *suffix)` * if current size can accommodate string after concat ```cpp // move `data` to middle memmove(data + pres, data, size); // copy prefix to front of string memcpy(data, pre, pres); // append sufix to middle memcpy(data + pres + size, suf, sufs + 1); string->space_left = 15 - (size + pres + sufs); ``` * if not ```cpp xs tmps = xs_literal_empty(); xs_grow(&tmps, size + pres + sufs); char *tmpdata = xs_data(&tmps); // re-arrange string: pre + data + suf // same as above procedure memcpy(tmpdata + pres, data, size); memcpy(tmpdata, pre, pres); memcpy(tmpdata + pres + size, suf, sufs + 1); xs_free(string); *string = tmps; string->size = size + pres + sufs; ``` ## 提供字串複製 (應考慮到 CoW) 和類似 strtok 功能的函式實作 ### xs_cpy * 若為短字串的複製，就直接複製到 `data` * 若為超過 16 個字元的字串，則讓 `ptr` 指到同一個地方。 * 利用預留的 `flag1` 來紀錄此字串為共享記憶體，若接著有寫的動作，則需要重新要一塊記憶體寫上對應的字串，並且清除 `flag1`。 * 其他存在的函數可能都要微調，因為要考慮 CoW 議題。 ```cpp xs *xs_cpy(xs *dest, xs *src) { if (xs_is_ptr(src)){ /* string is on heap */ dest->is_ptr = true; dest->ptr = src->ptr; dest->flag1 = true; dest->size = xs_size(src); } else { /* string is on stack */ memcpy(dest->data, src->data , xs_size(src)); dest->is_ptr = false; dest->space_left = 15 - xs_size(src); } return dest; } ``` ```cpp main() { xs string = *xs_tmp("foobarbar"); xs prefix = *xs_tmp("(((ffss"), suffix = *xs_tmp(")))"); xs_concat(&string, &prefix, &suffix); xs string_cpy = *xs_cpy(&xs_literal_empty(),&string); printf("[%s] : %2zu\n", xs_data(&string_cpy), xs_size(&string_cpy)); } ``` 預期輸出 ``` [(((ffssfoobarbar)))] : 19 ``` 額外檢查兩者的字串是否真的為 copy ```cpp printf("string ptr start at %p, cpy's ptr start at %p\n",string.ptr,string_cpy.ptr); ``` 預期輸出 ``` string ptr start at 0x56098bbac260, cpy's ptr start at 0x56098bbac260 ``` 值得注意的是在 heap 時，要更新 `dest->size` 大小，不能用 `strlen(src->ptr)`，結果會錯，目前尚未找到原因。雖然以上版本的 `xs_cpy` 暫時沒有問題了，但是我們尚未解決其他函數在對複製的字串操作時，應該要做什麼措施。 :::warning TODO: 學習 [folly::fbstring](https://github.com/facebook/folly/blob/master/folly/FBString.h) 的記憶體佈局策略，將 CoW 所需要的 [reference counting](https://en.wikipedia.org/wiki/Reference_counting) 嵌入到結構體成員中，如此空間運用的效率會更高。 :notes: jserv ::: * 當操作複製的字串，原本那份也會被一起改 ```cpp xs string = *xs_new(&string,"aoaoaoaofoobarbar"); xs string_cpy = *xs_cpy(&xs_literal_empty(),&string); xs prefix = *xs_tmp("(((ffss"), suffix = *xs_tmp(")))"); xs_concat(&string_cpy, &prefix, &suffix); printf("[%s] : %2zu\n", xs_data(&string_cpy), xs_size(&string_cpy)); printf("[%s] : %2zu\n", xs_data(&string), xs_size(&string)); ``` ``` [(((ffssaoaoaoaofoobarbar)))] : 17 [(((ffssaoaoaoaofoobarbar)))] : 17 ``` 製作 reference count，我選擇在結構體加一個變數 `refcnt`。 ```cpp struct { char *ptr; size_t size : 54, capacity : 6; int *refcnt; }; ``` 因為共享變數，所以要宣告成 pointer。在每次複製時會遞增，需要在 `xc_cpy` 內增加一些內容 ```cpp if (xs_is_ptr(src)) { ... if (!src->refcnt) { dest->refcnt = src->refcnt = (int *) malloc(sizeof(int)); *(dest->refcnt) = 1; } else { dest->refcnt = src->refcnt; *(dest->refcnt) += 1; } } ``` ```cpp main() { xs string = *xs_new(&string, "aoaoaoaofoobarbar"); xs string_cpy = *xs_cpy(&xs_literal_empty(),&string); xs hihi = *xs_cpy(&xs_literal_empty(), &string); xs qq = *xs_cpy(&xs_literal_empty(), &hihi); printf("%d\n", *string_cpy.refcnt); } ``` ``` 3 ``` 符合預期，總共有四份，複製三次，所以 `refcnt` 應該為 3 沒錯。 * 接著著手修改 `xs_concat` * 若 concat 後大小仍小於 `capacity`，代表我們會到原本那塊共享的記憶體。所以我們要額外在 `malloc` 一塊，當作 concat 結果存放的地方。此外，記得修改一些變數。如：`refcnt` ```cpp if (size + pres + sufs <= capacity) { if (string->flag1 && string->refcnt && *(string->refcnt) > 0) { /* create new space */ *(string->refcnt) -= 1; string->flag1 = false; data = string->ptr = (char *) malloc(sizeof(char) * capacity); if (*(string->refcnt) == 0){ free(string->refcnt); string->refcnt = NULL; } } ... } ``` * 若大於 `capacity`，因為尚未考慮 CoW 版本的程式碼本來就會再要一塊新的記憶體了，所以這裡不用在判斷使否需要額外取得記憶體。但是我們要做的是判斷原本的 `string` 是否為共享。如果是，我們不能 `free` 它，因為別人還在用。 ```cpp else { ... if (string->flag1 && string->refcnt && *(string->refcnt) > 0) { *(string->refcnt) -= 1; if(*(string->refcnt) == 0) { free(string->refcnt); string->refcnt = NULL; } } else { xs_free(string); } ... ```` * 檢查是否正確 ```cpp xs prefix = *xs_tmp("(((ffss"), suffix = *xs_tmp(")))"); xs string = *xs_new(&string,"aoaoaoaofoobarbar"); xs string_cpy = *xs_cpy(&xs_literal_empty(),&string); xs_concat(&string_cpy, &prefix, &suffix); printf("[%s] : %2zu\n", xs_data(&string_cpy), xs_size(&string_cpy)); printf("[%s] : %2zu\n", xs_data(&string), xs_size(&string)); printf("%d\n",*string.refcnt); ``` ``` [(((ffssaoaoaoaofoobarbar)))] : 27 [aoaoaoaofoobarbar] : 17 0 ``` 正確，原本的字串沒有被改到。 * 著手修改 `xs_trim` * 尚未考慮 CoW 遇到的問題 ```cpp xs prefix = *xs_tmp("((((((("), suffix = *xs_tmp("))))))))))))"); xs string = *xs_new(&string,"aoaoaoaofoobarbar"); xs_concat(&string, &prefix, &suffix); xs string_cpy = *xs_cpy(&xs_literal_empty(),&string); xs_trim(&string_cpy,"("); printf("[%s] : %2zu\n", xs_data(&string_cpy), xs_size(&string_cpy)); printf("[%s] : %2zu\n", xs_data(&string), xs_size(&string)); ``` ``` [aoaoaoaofoobarbar))))))))))))] : 29 [aoaoaoaofoobarbar))))))))))))] : 36 ``` 很明顯， trim 改到共享記憶體的內容。 * 開始修改： ```cpp if (x->flag1 && x->refcnt && *(x->refcnt) > 0) { x->ptr = orig = (char *) malloc(sizeof(char) * strlen(x->ptr)); *(x->refcnt) -= 1; if(*(x->refcnt) == 0) { free(x->refcnt); x->refcnt = NULL; } } memmove(orig, dataptr, slen); /* do not dirty memory unless it is needed */ if (orig[slen]) orig[slen] = 0; ... ``` ``` [aoaoaoaofoobarbar))))))))))))] : 29 [(((((((aoaoaoaofoobarbar))))))))))))] : 36 ``` 在 trim 中增加判斷，若要裁剪的字串是共享的，則重新再 `malloc` 一塊。 ### xs_strtok ```cpp char *xs_strtok(char *x, const char *delimit) { static char *lastToken = NULL; /* UNSAFE SHARED STATE! */ char *tmp; /* Skip leading delimiters if new string. */ if (x == NULL) { x = lastToken; if (x == NULL) /* End of story? */ return NULL; } else { x += strspn(x, delimit); } /* Find end of segment */ tmp = strpbrk(x, delimit); if (tmp) { /* Found another delimiter, split string and save state. */ *tmp = '\0'; lastToken = tmp + 1; } else { /* Last segment, remember that. */ lastToken = NULL; } return x; } main() { /* Testing xs_strtok */ xs str = *xs_new(&str, "asd:aee:gdw:tfv:ddd"); char *pch = strtok(xs_data(&str), ":"); while (pch != NULL) { printf ("%s ", pch); pch = strtok (NULL, ":"); } printf("\n"); } ``` ``` asd aee gdw tfv ddd ``` 符合預期！ :::warning 將 `strtok` 換為 `strtok_r`，後者是考慮到 [reentrant](https://en.wikipedia.org/wiki/Reentrancy_(computing)) 的實作，之後整合 reference counting 和 POSIX Thread 後，需要這方面的驗證 :notes: jserv ::: [完整程式碼](https://github.com/eecheng87/xstring) ## 設計實驗，確認上述字串處理最佳化 (針對空間效率) 帶來的效益，應考慮到 locality of reference 雖然在還沒實驗前就知道 `xs_cpy` 一定會比原本的直接複製還要省空間。但是還是用 valgrind 來看一下到底差多少。來觀察記憶體和 locality of reference 在兩種複製方法下的差距。額外寫一個效率差的 copy string 方法： ```cpp void cpy(char **dest, char **src) { *dest = (char *) malloc(sizeof(char) * strlen(*src) + 1); strcpy(*dest, *src); } ``` 實驗：假設某情境複製了十萬次，但因為此十萬個變數都還有後續的運算，所以不 `free` 掉。（但是本實驗並沒有做後續的運算，純粹複製而已） ```cpp /* Experiment */ /* original copy */ printf("This is original copy\n"); int n = 100000; char* test1 = "boobooboobooboobooboobooboobooboobooboobooboobooboo"; char* test2; while (n-->0) { cpy(&test2, &test1); //free(test2); } /* xs copy */ xs xs_test1 = *xs_new(&xs_test1, "boobooboobooboobooboobooboobooboobooboobooboobooboo"); printf("This is xs copy\n"); int n = 100000; while (n-->0) { xs xs_test2 = *xs_cpy(&xs_literal_empty(), &xs_test2); //xs_free(&xs_test2); } ``` * 記憶體量用 valgrind 的 massif 來監測 heap 的使用狀況 * `copy` 的狀況 ``` -------------------------------------------------------------------------------- n time(i) total(B) useful-heap(B) extra-heap(B) stacks(B) -------------------------------------------------------------------------------- 50 16,807,531 6,854,712 4,950,904 1,903,808 0 51 17,103,631 6,976,536 5,038,888 1,937,648 0 52 17,399,731 7,098,360 5,126,872 1,971,488 0 53 17,652,360 7,201,032 5,201,024 2,000,008 0 ``` * `xs_copy` 的狀況 ``` -------------------------------------------------------------------------------- n time(i) total(B) useful-heap(B) extra-heap(B) stacks(B) -------------------------------------------------------------------------------- 0 0 0 0 0 0 1 147,863 72 64 8 0 2 148,112 1,104 1,088 16 0 3 15,352,211 1,104 1,088 16 0 ``` 因為沒有 `free` 任何記憶體 (heap)，所以可以直接看 total 的最底。這樣優化的複製大概省了 7000 倍的記憶體量。 * locality of reference * `copy` 的狀況 ``` $ sudo perf stat --repeat 5 -e cache-misses,cache-references,instructions,cycles ./xs ``` ``` Performance counter stats for './xs' (5 runs): 18,3685 cache-misses # 78.011 % of all cache refs ( +- 2.58% ) 23,5461 cache-references ( +- 2.43% ) 3769,3837 instructions # 1.71 insn per cycle ( +- 0.03% ) 2201,8832 cycles ( +- 3.98% ) 0.007494794 seconds time elapsed ( +- 4.24% ) ``` * `xs_copy` 的狀況 ``` Performance counter stats for './xs' (5 runs): 2,0995 cache-misses # 43.695 % of all cache refs ( +- 37.65% ) 4,8049 cache-references ( +- 9.57% ) 1584,7683 instructions # 2.13 insn per cycle ( +- 0.04% ) 744,4251 cycles ( +- 12.67% ) 0.002707433 seconds time elapsed ( +- 14.53% ) ``` 連 cache miss 也是 `xs_cpy` 比較好 ## 在 Linux 核心原始程式碼中，找出類似手法的 SSO 或 CoW 案例並解說我在作業系統課程聽到的 CoW 案例是 fork 會避免馬上 exec 掉，所以採用此機制。我接著再 linux 原始碼中找關於 CoW 和 fork 的應用。找了老半天，終於在 `xfs_inode_fork.c` 底下找的了蛛絲馬跡。若看到名字有 fs 相關的很有可能都和 [btrfs](https://zh.wikipedia.org/wiki/Btrfs) 很有關聯，而這是一種支援寫入時複製（CoW）的檔案系統，原始碼相當龐大，[資料結構](https://kknews.cc/zh-tw/news/roak8gx.html)也十分複雜。 :::warning 從檔案系統實作下手的話，的確太複雜，很容易落得「舉燭」的下場。或許可回頭分析 Linux 核心 fork + exec 這樣的最佳化手法，具體在哪裡實作。當然，程式註解很可能不會直接寫 CoW。 :notes: jserv ::: 首先看到 `xfs_ifork_init_cow`，註解已經講明了 >Initialize an inode's copy-on-write fork. > ```cpp void xfs_ifork_init_cow( struct xfs_inode *ip) { if (ip->i_cowfp) return; ip->i_cowfp = kmem_zone_zalloc(xfs_ifork_zone, KM_NOFS); ip->i_cowfp->if_flags = XFS_IFEXTENTS; ip->i_cformat = XFS_DINODE_FMT_EXTENTS; ip->i_cnextents = 0; } ``` 從初始化來看，作者不希望 `cowfp` 是 NULL，因為在 call 初始化前有先用 `ASSERT` 確認不為 NULL ```cpp=93 ASSERT(ip->i_cowfp == NULL); xfs_ifork_init_cow(ip); ``` 不幸的是，我在這個初始化中被沒有看到 **copy**。這邊預期能夠找到新的節點指向一塊共用的地方，但是反而看到要了一塊新的記憶體 (把 `kmem_zone_zalloc` 往內找會看到 `kmem_cache_size(zone)`)給 `cowfp`。既然找不到 **copy**，乾脆來找 **write** 後要怎麼處理。首先我找到也是在fs系列中的 `xfs_iomap.c` 中發現 `xfs_ilock_for_iomap`。這個函數是專門用來作為 lock 用，之所以要 lock 是因為當 COW writes 時，會發生 allocate 或 delalloc，所以我們需要額外 lock，以免發生錯誤。 ```cpp=692 relock: if (flags & IOMAP_NOWAIT) { if (!xfs_ilock_nowait(ip, mode)) return -EAGAIN; } else { xfs_ilock(ip, mode); } /* * The reflink iflag could have changed since the earlier unlocked * check, so if we got ILOCK_SHARED for a write and but we're now a * reflink inode we have to switch to ILOCK_EXCL and relock. */ if (mode == XFS_ILOCK_SHARED && is_write && xfs_is_cow_inode(ip)) { xfs_iunlock(ip, mode); mode = XFS_ILOCK_EXCL; goto relock; } *lockmode = mode; return 0; } ``` 由 `XFS_ILOCK_SHARED`，`is_write` ，`xfs_is_cow_inode(ip)`，三個參數為依據判斷是否block。而 allocate 為我預期 CoW 被 write 時會做的事，既然已經找到預防 allocate 時會發生問題的函數，那離找到 **COW 如何處理 Write** 又更進一步。接著我在 `xfs_direct_write_iomap_begin()` 找到有使用 block function，所以推測應該處理 **Write** 的程式碼應該在附近。 ```cpp=756 if (imap_needs_cow(ip, flags, &imap, nimaps)) { error = -EAGAIN; if (flags & IOMAP_NOWAIT) goto out_unlock; /* may drop and re-acquire the ilock */ error = xfs_reflink_allocate_cow(ip, &imap, &cmap, &shared, &lockmode, flags & IOMAP_DIRECT); if (error) goto out_unlock; if (shared) goto out_found_cow; end_fsb = imap.br_startoff + imap.br_blockcount; length = XFS_FSB_TO_B(mp, end_fsb) - offset; } ``` 在 block 呼叫的附近，我找到一行註解 (762)。既然有可能把鎖丟掉，那可能接下來要做的就是 Write 後的處理，更巧的是接著 function name 竟然和 allocate 有關，所以再繼續往內看。在 `xfs_reflink.c` 中有 `xfs_reflink_allocate_cow` 的實作。 ```cpp=379 error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, resblks, 0, 0, &tp); ``` 這行就是 CoW 遇到對共享資料(唯讀)做**寫**的動作時，其中一樣會發生的事，同時和我們預期的一樣，會配置一塊新的記憶體，然後再把改的資料寫進去。先在把前半部:**拿一塊新的記憶體** 看仔細一點。我們可在 `xfs_trans.c` 中找到 `xfs_trans_alloc` 的實作 ```cpp=266 tp = kmem_zone_zalloc(xfs_trans_zone, 0); ``` 這行就是實際的分配新的記憶體(事實上再往內看實作可以找到要了 `kmem_cache_size(zone)`) 接著我們來找第二部分: **把新資料寫回剛剛拿到的記憶體上**。同樣的，我們再回去 `xfs_reflink_allocate_cow` 找線索 ```cpp=427 convert: xfs_trim_extent(cmap, offset_fsb, count_fsb); /* * COW fork extents are supposed to remain unwritten until we're ready * to initiate a disk write. For direct I/O we are going to write the * data and need the conversion, but for buffered writes we're done. */ if (!convert_now || cmap->br_state == XFS_EXT_NORM) return 0; trace_xfs_reflink_convert_cow(ip, cmap); return xfs_reflink_convert_cow_locked(ip, offset_fsb, count_fsb); ``` 經過 427 行以前的判斷，現在已經可以開始寫了，加上註解的提示，我認為 436, 437 應該就是寫的部分。 Reference: * [What is size_t](https://stackoverflow.com/questions/2550774/what-is-size-t-in-c) * [Perf](http://wiki.csie.ncku.edu.tw/embedded/perf-tutorial) * [btrfs](https://lwn.net/Articles/576276/) * [btrfs](https://www.slideshare.net/TsungenHsiao/introduction-to-btrfs-and-zfs)