# 2020q1 Homework2 (quiz2) contributed by < `colinyoyo26` > ###### tags: `linux2020` > [第二週測驗題](https://hackmd.io/@sysprog/linux2020-quiz2) ==第 1 題組的延伸問題== ## 1. 解釋上述程式碼運作原理,可善用 GDB 追蹤和分析 ### `xs_tmp` 在編譯時期檢查 `x` 是 string literal 以及長度小於 16 ,接著呼叫 `xs_new` > 我想請問哪裡可以找到 `xs_new(&xs_literal_empty(), "" x)` `""` 可以檢查 `x` 是不是 string literal 的相關論述 > 已經查過 [C11 規格書](http://www.open-std.org/jtc1/sc22/WG14/www/docs/n1570.pdf)以及 [Using the GNU Compiler Collection](https://gcc.gnu.org/onlinedocs/gcc-7.5.0/gcc.pdf) 沒有找到相關論述 :::warning 用 gdb 做實驗,確認配置的字串在於 stack 還是 read-only section,即可反推是否為 string literal :notes: jserv ::: > 實驗得到 string literal 是在 read-only section ,然後在 [C11 規格書](http://www.open-std.org/jtc1/sc22/WG14/www/docs/n1570.pdf) 5.1.1.2 Translation phases 找到 6. Adjacent string literal tokens are concatenated. 推測出這是編譯器報錯的原因 ```cpp #define xs_tmp(x) \ ((void) ((struct { \ _Static_assert(sizeof(x) <= 16, "it is too big"); \ int dummy; \ }){1}), \ xs_new(&xs_literal_empty(), "" x)) ``` ### `ilog2` 回傳 $\lfloor log_2{n} \rfloor$ ```cpp static inline int ilog2(uint32_t n) { return 32 - __builtin_clz(n) - 1; ``` ### `xs_new` `xs_literal_empty()` 先初始化為空的短字串 (`x->space_left` 設為 15 其他設為 0),若字串長度大於 15 會把字串放到 heap , `malloc` 的空間為 $2^{\lfloor log_2{len} \rfloor + 1}$ (對齊下一個 2 的冪次方),否則會直接放到 stack 因為有 null terminator 所以不用在重設 `is_ptr` ```cpp xs *xs_new(xs *x, const void *p) { *x = xs_literal_empty(); size_t len = strlen(p) + 1; if (len > 16) { x->capacity = ilog2(len) + 1; x->size = len - 1; x->is_ptr = true; x->ptr = malloc((size_t) 1 << x->capacity); memcpy(x->ptr, p, len); } else { memcpy(x->data, p, len); x->space_left = 15 - (len - 1); } return x; } ``` ### `xs_grow` 函式目的為將 `x` 的容量增長到下個二的冪次方 ```cpp xs *xs_grow(xs *x, size_t len) { if (len <= xs_capacity(x)) return x; len = ilog2(len) + 1; if (xs_is_ptr(x)) x->ptr = realloc(x->ptr, (size_t) 1 << len); else { char buf[16]; memcpy(buf, x->data, 16); x->ptr = malloc((size_t) 1 << len); memcpy(x->ptr, buf, 16); } x->is_ptr = true; x->capacity = len; return x; } ``` ### `xs_concat` 函式目的為將 `prefix` 和 `suffix` 串接到 `string` 前後, `string` 的容量夠時只須將字串複製過去即可,否則需要 call `xs_grow` 增加容量到下個二的冪次方,再複製字串,以下是將用到 `memmove` 的原因 Linux Programmer's Manual - The memmove() function copies n bytes from memory area src to memory area dest. The memory areas may overlap ```cpp xs *xs_concat(xs *string, const xs *prefix, const xs *suffix) { size_t pres = xs_size(prefix), sufs = xs_size(suffix), size = xs_size(string), capacity = xs_capacity(string); char *pre = xs_data(prefix), *suf = xs_data(suffix), *data = xs_data(string); if (size + pres + sufs <= capacity) { memmove(data + pres, data, size); memcpy(data, pre, pres); memcpy(data + pres + size, suf, sufs + 1); string->space_left = 15 - (size + pres + sufs); } else { xs tmps = xs_literal_empty(); xs_grow(&tmps, size + pres + sufs); char *tmpdata = xs_data(&tmps); memcpy(tmpdata + pres, data, size); memcpy(tmpdata, pre, pres); memcpy(tmpdata + pres + size, suf, sufs + 1); xs_free(string); *string = tmps; string->size = size + pres + sufs; } return string; } ``` ### `xs_trim` 這個函式的目的是去除掉 `x` 內前後包含於 `trimset` 的字元,函式中兩個 macro 利用 hashing 的手法來檢查字元,首先把一個 `char` 分成 quotient (前 5 個 bits) ,和 remainder (後 3 個 bits) ,再來 `set_bit` 用於插入,`check_bit` 用於確認字元是否被插入 - quotient 用於 index - remainder 數值表示範圍為 0 到 7 , 一一對應 `uint8_t` 的 8 個 bits - 時間複雜度為 $O(slen + trimlen)$ ,若用 brute force 時間複雜度會升到 $O(slen \cdot trimlen)$ ```cpp xs *xs_trim(xs *x, const char *trimset) { if (!trimset[0]) return x; char *dataptr = xs_data(x), *orig = dataptr; /* similar to strspn/strpbrk but it operates on binary data */ uint8_t mask[32] = {0}; #define check_bit(byte) (mask[(uint8_t) byte / 8] CCC 1 << (uint8_t) byte % 8) #define set_bit(byte) (mask[(uint8_t) byte / 8] |= 1 << (uint8_t) byte % 8) size_t i, slen = xs_size(x), trimlen = strlen(trimset); for (i = 0; i < trimlen; i++) set_bit(trimset[i]); for (i = 0; i < slen; i++) if (!check_bit(dataptr[i])) break; for (; slen > 0; slen--) if (!check_bit(dataptr[slen - 1])) break; dataptr += i; slen -= i; /* reserved space as a buffer on the heap. * Do not reallocate immediately. Instead, reuse it as possible. * Do not shrink to in place if < 16 bytes. */ memmove(orig, dataptr, slen); /* do not dirty memory unless it is needed */ if (orig[slen]) orig[slen] = 0; if (xs_is_ptr(x)) x->size = slen; else x->space_left = 15 - slen; return x; #undef check_bit #undef set_bit } ``` ## 2. 以上述程式碼為基礎,提供字串複製 (應考慮到 CoW) 和類似 [strtok](http://man7.org/linux/man-pages/man3/strtok.3.html) 功能的函式實作 完整的 code 以及編譯方式請見 [GitHub](https://github.com/colinyoyo26/2020q1-linux-course/tree/master/2020q1-quiz2-xstring) 首先是字串複製 `xs_cpy` 的實作以及測試輸出,先從記憶體配置開始 ### `xs` memory layout `malloc` 空間給 `x->ptr` 時會多塞 `sizeof(REFTYPE)` 的空間放 reference count 並把指標移到字串開始位置 ```cpp #define REFTYPE size_t #define OFFSET sizeof(REFTYPE) ``` ```cpp xs->ptr = malloc(((size_t)1 << x->capacity) + OFFSET) + OFFSET; ``` `xs` 的記憶體配置邏輯如下(`xs` 16 bytes, `refcnt` `sizeof(REFTYPE)` bytes) ```graphviz digraph Q { rankdir=RL node [shape=record]; heap [label="refcnt |string..."]; ptr [label = " ptr "]; size [label = "size"]; capacity [label = " capacity "]; flags [lable = "flags"] subgraph cluster_xs { {rank=same ptr size capacity flags} } ptr -> heap } ``` :::warning 改用 mermaid 或 Graphviz 重新製作上方圖例 :notes: jserv ::: > 已用 Graphviz 重繪 以下比較 [folly::fbstring](https://github.com/facebook/folly/blob/master/folly/FBString.h) 的 CoW 機制,看以下五個函式和資料結構就能看出端倪了, `ml_.data_` 會指向 `struct RefCounted` 的 `data_` 當要取 `refCount_` 的值時,會利用計算從 `struct RefCounted` 到 `data_` 的 offset 得到 `refCount_` 的位址 (和 linux linked list 一樣的手法) ```cpp template <class Char> FOLLY_NOINLINE inline void fbstring_core<Char>::initLarge( const Char* const data, const size_t size) { // Large strings are allocated differently size_t effectiveCapacity = size; auto const newRC = RefCounted::create(data, &effectiveCapacity); ml_.data_ = newRC->data_; ml_.size_ = size; ml_.setCapacity(effectiveCapacity, Category::isLarge); ml_.data_[size] = '\0'; } ``` - `data[1]` 是為了留空間給 `\0` ```cpp struct RefCounted { std::atomic<size_t> refCount_; Char data_[1]; ... ``` ```cpp static void incrementRefs(Char* p) { fromData(p)->refCount_.fetch_add(1, std::memory_order_acq_rel); } ``` ```cpp constexpr static size_t getDataOffset() { return offsetof(RefCounted, data_); } ``` ```cpp static RefCounted* fromData(Char* p) { return static_cast<RefCounted*>(static_cast<void*>( static_cast<unsigned char*>(static_cast<void*>(p)) - getDataOffset())); } ``` ```cpp static RefCounted* create(size_t* size) { const size_t allocSize = goodMallocSize(getDataOffset() + (*size + 1) * sizeof(Char)); auto result = static_cast<RefCounted*>(checkedMalloc(allocSize)); result->refCount_.store(1, std::memory_order_release); *size = (allocSize - getDataOffset()) / sizeof(Char) - 1; return result; } ``` ### reference count 以下兩個函式回傳 reference count 和是否具有其他 reference ```cpp extern inline uint8_t xs_refcnt(const xs *x) { return IS_COW && xs_is_ptr(x) ? *(REFTYPE *) (x->ptr - OFFSET) : 1; } ``` ```cpp static inline bool xs_is_ref(const xs *x) { return xs_refcnt(x) > 1; } ``` 增減 reference count , `xs_decr_ref` 發現 reference count 歸零時會呼叫 `xs_free` ```cpp static inline void xs_incr_ref(xs *x) { *(REFTYPE *) (x->ptr - OFFSET) += 1; } ``` ```cpp static inline void xs_decr_ref(xs *x) { if (!--(*(REFTYPE *) (x->ptr - OFFSET))) xs_free(x); } ``` ### CoW 機制 呼叫 `xs_cpy` 時如果不到需要 CoW 的字串長度,則直接複製字串,否則用 CoW 機制直接複製 meta data 並增加 reference count 其中 `!~xs_refcnt(src)` 是為了防止 overflow ```cpp static inline void xs_set_ref(xs *x, xs *ref_to) { if (xs_size(ref_to) < LEN_TO_COW || x->ptr == ref_to->ptr) return; XS_INCR_REF(ref_to); /* copy the meta data */ *x = *ref_to; } ``` ```cpp xs *xs_cpy(xs *dest, xs *src) { if (XS_IS_REF(dest)) XS_DECR_REF(dest); /* too many references or short string * !OFFET for non-COW */ if (!OFFSET || xs_size(src) < LEN_TO_COW || !~xs_refcnt(src) || !xs_is_ptr(src)) { size_t len = xs_size(src); xs_grow(dest, len); memcpy(xs_data(dest), xs_data(src), len); if (xs_is_ptr(dest)) dest->size = len; else dest->space_left = 15 - len; } else xs_set_ref(dest, src); return dest; } ``` 若是字串要被修改時 (e.g. `xs_trim` , `xs_concat`) 會呼叫`xs_cow` 檢查是否有其他參考,若有則需要自己在開一個新的空間 ```cpp static void xs_cow(xs *x) { if (!XS_IS_REF(x)) return; XS_DECR_REF(x); xs_new(x, xs_data(x)); return; } ``` 測試程式: ```cpp xs string, cow1, cow2; void print_info(char* str) { printf("\n%s\n", str); printf("\nstring: %s", xs_data(&string)); printf("\n&string: %p", xs_data(&string)); printf("\nreference count: %d\n", string.ref_cnt); printf("\nstring: %s", xs_data(&cow1)); printf("\n&string: %p", xs_data(&cow1)); printf("\nreference count: %d\n", cow1.ref_cnt); printf("\nstring: %s", xs_data(&cow2)); printf("\n&string: %p", xs_data(&cow2)); printf("\nreference count: %d\n", cow2.ref_cnt); } int main() { xs_new(&string, "gggfoobarbarbarbarbarzzz"); xs_cpy(&cow1, &string); xs_cpy(&cow2, &string); print_info("after cpy from string to cow1 & cow2"); xs prefix = *xs_tmp("((("), suffix = *xs_tmp(")))"); xs_concat(&cow1, &prefix, &suffix); print_info("after concat cow1"); xs_cpy(&cow2, &cow1); print_info("after cpy from cow1 to cow2"); xs_trim(&cow1, ")(zg"); print_info("after trim cow1"); return 0; } ``` 參考執行結果: ``` after cpy from string to cow1 & cow2 #string string: gggfoobarbarbarbarbarzzz &string: 0x1735010 reference count: 3 #cow1 string: gggfoobarbarbarbarbarzzz &string: 0x1735010 reference count: 1 #cow2 string: gggfoobarbarbarbarbarzzz &string: 0x1735010 reference count: 1 after concat cow1 #string string: gggfoobarbarbarbarbarzzz &string: 0x1735010 reference count: 2 #cow1 string: (((gggfoobarbarbarbarbarzzz))) &string: 0x1735450 reference count: 1 #cow2 string: gggfoobarbarbarbarbarzzz &string: 0x1735010 reference count: 1 after cpy from cow1 to cow2 #string string: gggfoobarbarbarbarbarzzz &string: 0x1735010 reference count: 1 #cow1 string: (((gggfoobarbarbarbarbarzzz))) &string: 0x1735450 reference count: 2 #cow2 string: (((gggfoobarbarbarbarbarzzz))) &string: 0x1735450 reference count: 1 after trim cow1 #string string: gggfoobarbarbarbarbarzzz &string: 0x1735010 reference count: 1 #cow1 string: foobarbarbarbarbar &string: 0x1735480 reference count: 1 #cow2 string: (((gggfoobarbarbarbarbarzzz))) &string: 0x1735450 reference count: 1 ``` 接著是 [strtok](http://man7.org/linux/man-pages/man3/strtok.3.html) 的實作 ### `xs_tok` 先簡介 `strtok` 以下是 Linux Programmer's Manual - The `strtok()` function breaks a string into a sequence of zero or more nonempty tokens. On the first call to `strtok()` the string to be parsed should be specified in str. In each subsequent call that should parse the same string, str must be `NULL`. > 第一次呼叫 `strtok` 需要給定欲處理的字串(放在 `str` ),之後 `str` 必須為 `NULL` - The `delim` argument specifies a set of bytes that delimit the tokens in the parsed string. The caller may specify different strings in delim in successive calls that parse the same string. > `delim` 就是字元的集合,用於劃分欲解析字串 - Each call to `strtok()` returns a pointer to a null-terminated string containing the next token. This string does not include the delimiting byte. > 會回傳從第一個 non-delimiting byte. 下一個 delimiting byte (不包含) 之間的字串 - The end of each token is found by scanning forward until either the next delimiter byte is found or until the terminating null byte `\0` is encountered. > 會把下一個 delimiting byte 替換成 `\0` 我照著 `strtok` 的行為實作, `laststr` 指向欲處理字串的開頭, `src_flag` 用於分辨是否要更新 `src->size` (或 `src->leftspace`),並和 `xs_trim` 一樣用 hashing 的方法實作檢查 `delim` > 因為會把下一個 delimiting byte 替換成 `\0` 所以第一次呼叫需要更新 `src->size` ```cpp char *xs_tok(xs *src, const char *delim) { static char *laststr = NULL; char *cur; bool src_flag = 0; if (!src) cur = laststr; else { XS_COW(src); cur = xs_data(src); src_flag = 1; } if (!delim[0] || !cur) return cur; uint8_t mask[32] = {0}; #define check_bit(byte) (mask[(uint8_t) byte / 8] & 1 << (uint8_t) byte % 8) #define set_bit(byte) (mask[(uint8_t) byte / 8] |= 1 << (uint8_t) byte % 8) size_t i, curlen = strlen(cur), delimlen = strlen(delim); for (i = 0; i < delimlen; i++) set_bit(delim[i]); for (i = 0; i < curlen; i++) if (!check_bit(cur[i])) break; cur = cur + i; for (i = 0; i++ < curlen;) if (check_bit(cur[i])) break; *(cur + i) = '\0'; laststr = cur + i + 1; if (src_flag) { if (xs_is_ptr(src)) src->size = i; else src->space_left = 15 - i; } if (!*laststr) laststr = NULL; return cur; } ``` 測試程式: ```cpp #include <stdio.h> #include "xs.h" int main(void) { xs string; char token[8] = "HELLO W", *tem; xs_new(&string, "HELLOW fooHELLOWbarHbarfooLbarLbarObarW"); printf("#token\n%s\n", token); printf("\n#initial string\n%s\n\n", xs_data(&string)); printf("%s\n", xs_tok(&string, token)); while(tem = xs_tok(NULL, token)) printf("%s\n", tem); return 0; } ``` 參考執行結果: ``` #token HELLO W #initial string HELLOW fooHELLOWbarHbarfooLbarLbarObarW foo bar barfoo bar bar bar ``` :::warning TODO: 考慮到之後會在多執行緒環境,用 `strtok_r` 實作手法改寫 :notes: jserv ::: > 已實作 `xs_tok_r` (reentrant version of `xs_tok`),以及測試程式(用 pthread 測試) ### `xs_tok_r` ```cpp char *xs_tok_r(xs *src, const char *delim, char **saveptr) { char *cur; bool src_flag = 0; if (!src) cur = *saveptr; else { XS_COW(src); cur = xs_data(src); src_flag = 1; } if (!delim[0] || !cur) return cur; uint8_t mask[32] = {0}; #define check_bit(byte) (mask[(uint8_t) byte / 8] & 1 << (uint8_t) byte % 8) #define set_bit(byte) (mask[(uint8_t) byte / 8] |= 1 << (uint8_t) byte % 8) size_t i, curlen = strlen(cur), delimlen = strlen(delim); for (i = 0; i < delimlen; i++) set_bit(delim[i]); for (i = 0; i < curlen; i++) if (!check_bit(cur[i])) break; cur = cur + i; for (i = 0; i++ < curlen;) if (check_bit(cur[i])) break; *(cur + i) = '\0'; *saveptr = cur + i + 1; if (src_flag) { if (xs_is_ptr(src)) src->size = i; else src->space_left = 15 - i; } if (!*saveptr) saveptr = NULL; return cur; #undef check_bit #undef set_bit } ``` 測試程式: ```cpp #include <pthread.h> #include "string.h" #include "xs.h" #define NUM 3 const char token[8] = "@#(^&$*"; static void *test_tok(void *args) { char *saveptr; char out[32], *outptr = out; xs str = *(xs *) args; for (xs *p = &str;; p = NULL) { char *tem = xs_tok_r(p, token, &saveptr); if (!*tem) break; strcpy(outptr, tem); outptr += strlen(tem) + 1; *(outptr - 1) = ' '; } *(outptr - 1) = '\0'; printf("%s\n", out); } int main(void) { pthread_t threads[NUM]; xs strs[NUM]; printf("#token: %s\n", token); for (int i = 0; i < NUM; i++) { char buf[64]; sprintf(buf, "$$IM&*^%dth@@@#thread!!!$$", i); xs_new(&strs[i], buf); printf("#initial str of %dth thread: %s\n", i, xs_data(&strs[i])); pthread_create(threads + i, NULL, test_tok, (void *) &strs[i]); } for (int i = 0; i < NUM; i++) pthread_join(threads[i], NULL); return 0; } ``` 參考執行結果: > 這邊輸出順序不重要,重點是 "IM nth thread!!!" string 預期要是完整的 ``` #token: @#(^&$* #initial str of 0th thread: $$IM&*^0th@@@#thread!!!$$ #initial str of 1th thread: $$IM&*^1th@@@#thread!!!$$ #initial str of 2th thread: $$IM&*^2th@@@#thread!!!$$ IM 0th thread!!! IM 2th thread!!! IM 1th thread!!! ``` ## 3. 設計實驗,確認上述字串處理最佳化 (針對空間效率) 帶來的效益,應考慮到 [locality of reference](https://en.wikipedia.org/wiki/Locality_of_reference) 設計實驗呼叫複製字串函式數次計算時間以及 cache-misses 次數,實驗結果是 `xs_cpy` with CoW 的 cache-missed 明顯最少,但是再這個 input 下 `strcpy` 的時間效率略好一些,更下面還有 `make plot` 視覺化不同字串長度下的效率比較 - [source code](https://github.com/colinyoyo26/2020q1-linux-course/blob/master/2020q1-quiz2-xstring/cpy_bench.c) ### `xs_cpy` w/o CoW `$ make bench` ``` MODE: XSTR refcnt: 1 time: 23440.000000 (ms) Performance counter stats for './bench foo bar bar barbarbarbar foooooobarrrrrrarrrrrrfoo bar bar barbarbarbar foooooobarrrrrrarrrrrrfoo bar bar barbarbarbar foooooobarrrrrrarrrrrrfoo bar bar barbarbarbar foooooobarrrrrrarrrrrr 100000 0': 758,747 cache-misses # 85.012 % of all cache refs 892,518 cache-references 0.032626588 seconds time elapsed ``` ### `xs_cpy` w/ CoW ```shell $ make bench COW=1 MODE: XSTR refcnt: 100001 time: 5019.750000 (ms) Performance counter stats for './bench foo bar bar barbarbarbar foooooobarrrrrrarrrrrrfoo bar bar barbarbarbar foooooobarrrrrrarrrrrrfoo bar bar barbarbarbar foooooobarrrrrrarrrrrrfoo bar bar barbarbarbar foooooobarrrrrrarrrrrr 100000 0': 111,351 cache-misses # 57.129 % of all cache refs 194,910 cache-references 0.008941326 seconds time elapsed ``` ### `strcpy` ```shell $ make bench MODE=1 MODE: CSTR time: 4244.250000 (ms) Performance counter stats for './bench foo bar bar barbarbarbar foooooobarrrrrrarrrrrrfoo bar bar barbarbarbar foooooobarrrrrrarrrrrrfoo bar bar barbarbarbar foooooobarrrrrrarrrrrrfoo bar bar barbarbarbar foooooobarrrrrrarrrrrr 100000 1': 980,451 cache-misses # 66.203 % of all cache refs 1,480,977 cache-references 0.019159115 seconds time elapsed ``` ### `make plot` `$ make plot` 因為 CoW 機制下只有複製 metadata 所以時間趨勢如預期是呈現 $O(1)$ ![](https://i.imgur.com/sh5EGtn.png) :::warning 我的實驗還沒有考慮到寫入 shared string 時的效能 ::: ## 4. 在 Linux 核心原始程式碼中,找出類似手法的 SSO 或 CoW 案例並解說; call `fork()` 會先到以下函式設置 `args` 接著呼叫 `_do_fork` - `vfork()` `clone` 等函式也都是設置對應的 `args` 再呼叫 `_do_fork` ```cpp SYSCALL_DEFINE0(fork) { #ifdef CONFIG_MMU struct kernel_clone_args args = { .exit_signal = SIGCHLD, }; return _do_fork(&args); #else /* can not support in nommu mode */ return -EINVAL; #endif } ``` ### `_do_fork` 接著看到 `_do_fork` ,會先經過以下條件式檢查是哪個 system call ```cpp if (!(clone_flags & CLONE_UNTRACED)) { if (clone_flags & CLONE_VFORK) trace = PTRACE_EVENT_VFORK; else if (args->exit_signal != SIGCHLD) trace = PTRACE_EVENT_CLONE; else trace = PTRACE_EVENT_FORK; if (likely(!ptrace_event_enabled(current, trace))) trace = 0; } ``` 然後呼叫 `copy_process` 建造 child process 並回傳指標 ```cpp p = copy_process(NULL, trace, NUMA_NO_NODE, args); ``` 在 `copy_process` 內會呼叫 `dup_task_struct` 其中 `current` 是指向目前運行程式( parent process )的指標 ```cpp p = dup_task_struct(current, node); ``` 喚醒 child process ```cpp wake_up_new_task(p); ``` 如果是 `vfork` 要等待 child process 呼叫 `exec` 或結束 ```cpp if (clone_flags & CLONE_VFORK) { if (!wait_for_vfork_done(p, &vfork)) ptrace_event_pid(PTRACE_EVENT_VFORK_DONE, pid); ``` ### `copy_process` 忽略一開始的參數檢查, `dup_task_struct` 是真正建造 child process 的函式, parent process 會初始化 child process 後回傳指標 `p` ```cpp p = dup_task_struct(current, node); ``` 實現 CoW 的地方,放在後面說明 ```cpp retval = copy_mm(clone_flags, p); ``` 主要設置 childe process 的 PC (program counter) 以及 register - PC 會指向 return from fork 的地方,返回值會被設成 0 - `copy_thread_tls` 會呼叫 `copy_thread` ,函式實作和 microarchitecture 息息相關 > Reference: - [Linux kernel fork 函式](http://blog.chinaunix.net/uid-69947851-id-5826105.html) ```cpp retval = copy_thread_tls(clone_flags, args->stack, args->stack_size, p, args->tls); ``` ### `dup_task_struct` 得知要從哪個 NUMA 節點配置記憶體 ```cpp if (node == NUMA_NO_NODE) node = tsk_fork_get_node(orig); ``` 為 child process 從剛才得到的 NUMA 節點配置 `struct task_struct` , `stack` , `vm_struct` 的記憶體空間 :::warning 這邊的 `tsk_fork_get_node()` 其實並「不一定」會回傳一個任一個 NUMA node。 假設,`CONFIG_NUMA`已經設置`tsk_fork_get_node()`仍然只會對於 `kthreadd_task` 返回預設的 NUMA node. --- JulianATA ::: > 這裡的 `stack` 是 [kernel stack](https://www.kernel.org/doc/html/latest/x86/kernel-stacks.html) ```cpp tsk = alloc_task_struct_node(node); if (!tsk) return NULL; stack = alloc_thread_stack_node(tsk, node); if (!stack) goto free_tsk; if (memcg_charge_kernel_stack(tsk)) goto free_stack; stack_vm_area = task_stack_vm_area(tsk); ``` 先完全複製當前行程的 `struct task_struct` :::warning 上方的資訊,我們可以發現一件有趣的事情。就算在一個已經設置` CONFIG_NUMA `的系統中,仍然有可能回傳` NUMA_NO_NODE `作為系統預計放置的記憶體的 NUMA node。 那這個 ` NUMA_NO_NODE ` 實際上是 -1 ,我們順著 ` alloc_task_struct_node `以及 `alloc_thread_stack_node` 走下去。 我們會發現,不論是前者透過 `slab`, `slob`, `slub` 的`slab_alloc_node`, `slob_alloc_node`, `slub_alloc_node` 的配置,或是後者通過 `alloc_page_node`做分配,皆會通過一個 `numa_mem_id()` 來自 `include/linux/topology.h` ```clike /* Returns the number of the nearest Node with memory */ static inline int numa_mem_id(void) { return raw_cpu_read(_numa_mem_); } ``` ```clike /* Returns the number of the nearest Node with memory */ static inline int numa_mem_id(void) { return numa_node_id(); } ``` 見註解!可以看到,無論是哪個情況下的 `numa_mem_id` 目標為回傳最近仍然有空間的 NUMA node。 由這件小事,其實可以延伸到一個目前的現況。在 Linux Kernel 裡面,支援對於 NUMA node 的「操作」,除了幾個比較簡單的 [policy](https://www.kernel.org/doc/html/latest/admin-guide/mm/numa_memory_policy.html) ,其實還沒有提供實質上的「策略」。 --- JulianATA ::: ```cpp err = arch_dup_task_struct(tsk, orig); ``` 然後改變 child process 的 `stack` 以及 `stack_vm_area` ```cpp tsk->stack = stack; #ifdef CONFIG_VMAP_STACK tsk->stack_vm_area = stack_vm_area; #endif ``` ### `copy_mm` `vfork` 會直接進入條件式內 (`呼叫 vfork 時 args.flags 會被設成 CLONE_VFORK | CLONE_VM`) 直接共用同一個 `mmstruct` 這就是為什麼用 `vfork` child process 和 parent process 會共享記憶體空間的原因,一般的 `fork` 會呼叫 `dup_mm` 複製 parent process 的 `mm_struct` ```cpp if (clone_flags & CLONE_VM) { mmget(oldmm); mm = oldmm; goto good_mm; } retval = -ENOMEM; mm = dup_mm(tsk, current->mm); ... good_mm: tsk->mm = mm; tsk->active_mm = mm; return 0; ``` ### `dup_mm` 先配置新的 `mm_struct` 並從 parent 複製資料 ```cpp mm = allocate_mm(); ... memcpy(mm, oldmm, sizeof(*mm)); ``` 接著呼叫 `dup_mmap` 複製 page table , `vm_area_struct` ,這地方先不深入 trace > Reference: [Understanding the Linux Kernel](https://books.google.com.tw/books?id=9yIEji1UheIC&pg=PA299&lpg=PA299&dq=dup_mmap&source=bl&ots=JFnHxrkNEV&sig=ACfU3U3jG0ocR6NCXRT4Fx9OG6DFqY6BXg&hl=zh-TW&sa=X&ved=2ahUKEwjAvOGf1ZLoAhXIxosBHVIXCncQ6AEwBnoECAoQAQ#v=onepage&q=dup_mmap&f=false) ```cpp err = dup_mmap(mm, oldmm); ``` ### Write to shared frame (physical page) > 這邊依循恐龍書的定義把 virtual page 叫做 page , physical page 叫做 frame 當 parent 或 child 嘗試寫入共用的 frame 時會觸發 page fault 然後通過 `do_page_fault()` `handle_mm_fault()` `handle_pte_fault()` 處理異常 (觸發 CoW) > Reference: [深入了解 Linux-COW 原理](https://blog.leosocy.top/%E6%B7%B1%E5%85%A5%E4%BA%86%E8%A7%A3Linux-COW%E5%86%99%E6%97%B6%E6%8B%B7%E8%B4%9D%E5%AE%9E%E7%8E%B0%E5%8E%9F%E7%90%86/) :::warning 這幾個函式還沒深入 trace ::: ## Reference - [Linux kernel fork 函式](http://blog.chinaunix.net/uid-69947851-id-5826105.html) - [Linux kernel fork CoW 代碼分析](https://blog.csdn.net/dog250/article/details/5303054) - [mm_struct 簡介](https://blog.csdn.net/qq_26768741/article/details/54375524) - [UNIX 作業系統 fork/exec 系統呼叫的前世今生](https://hackmd.io/@sysprog/unix-fork-exec) - Source code: [fork.c](https://github.com/torvalds/linux/blob/9d588f63602778716921bb638cda412cdeacabc8/kernel/fork.c) - Source code: [exec.c](https://github.com/torvalds/linux/blob/7eec11d3a784a283f916590e5aa30b855c2ccfd7/fs/exec.c) - Source code: [memory.c](https://github.com/torvalds/linux/blob/aeb542a1b5c507ea117d21c3e3e012ba16f065ac/mm/memory.c)