# 2020q1 Homework2 (quiz2)
contributed by < `colinyoyo26` >
###### tags: `linux2020`
> [第二週測驗題](https://hackmd.io/@sysprog/linux2020-quiz2)
==第 1 題組的延伸問題==
## 1. 解釋上述程式碼運作原理,可善用 GDB 追蹤和分析
### `xs_tmp`
在編譯時期檢查 `x` 是 string literal 以及長度小於 16 ,接著呼叫 `xs_new`
> 我想請問哪裡可以找到 `xs_new(&xs_literal_empty(), "" x)` `""` 可以檢查 `x` 是不是 string literal 的相關論述
> 已經查過 [C11 規格書](http://www.open-std.org/jtc1/sc22/WG14/www/docs/n1570.pdf)以及 [Using the GNU Compiler Collection](https://gcc.gnu.org/onlinedocs/gcc-7.5.0/gcc.pdf) 沒有找到相關論述
:::warning
用 gdb 做實驗,確認配置的字串在於 stack 還是 read-only section,即可反推是否為 string literal
:notes: jserv
:::
> 實驗得到 string literal 是在 read-only section ,然後在 [C11 規格書](http://www.open-std.org/jtc1/sc22/WG14/www/docs/n1570.pdf) 5.1.1.2 Translation phases 找到 6. Adjacent string literal tokens are concatenated. 推測出這是編譯器報錯的原因
```cpp
#define xs_tmp(x) \
((void) ((struct { \
_Static_assert(sizeof(x) <= 16, "it is too big"); \
int dummy; \
}){1}), \
xs_new(&xs_literal_empty(), "" x))
```
### `ilog2`
回傳 $\lfloor log_2{n} \rfloor$
```cpp
static inline int ilog2(uint32_t n) { return 32 - __builtin_clz(n) - 1;
```
### `xs_new`
`xs_literal_empty()` 先初始化為空的短字串 (`x->space_left` 設為 15 其他設為 0),若字串長度大於 15 會把字串放到 heap , `malloc` 的空間為 $2^{\lfloor log_2{len} \rfloor + 1}$ (對齊下一個 2 的冪次方),否則會直接放到 stack 因為有 null terminator 所以不用在重設 `is_ptr`
```cpp
xs *xs_new(xs *x, const void *p)
{
*x = xs_literal_empty();
size_t len = strlen(p) + 1;
if (len > 16) {
x->capacity = ilog2(len) + 1;
x->size = len - 1;
x->is_ptr = true;
x->ptr = malloc((size_t) 1 << x->capacity);
memcpy(x->ptr, p, len);
} else {
memcpy(x->data, p, len);
x->space_left = 15 - (len - 1);
}
return x;
}
```
### `xs_grow`
函式目的為將 `x` 的容量增長到下個二的冪次方
```cpp
xs *xs_grow(xs *x, size_t len)
{
if (len <= xs_capacity(x))
return x;
len = ilog2(len) + 1;
if (xs_is_ptr(x))
x->ptr = realloc(x->ptr, (size_t) 1 << len);
else {
char buf[16];
memcpy(buf, x->data, 16);
x->ptr = malloc((size_t) 1 << len);
memcpy(x->ptr, buf, 16);
}
x->is_ptr = true;
x->capacity = len;
return x;
}
```
### `xs_concat`
函式目的為將 `prefix` 和 `suffix` 串接到 `string` 前後, `string` 的容量夠時只須將字串複製過去即可,否則需要 call `xs_grow` 增加容量到下個二的冪次方,再複製字串,以下是將用到 `memmove` 的原因
Linux Programmer's Manual
- The memmove() function copies n bytes from memory area src to memory area dest. The memory areas may overlap
```cpp
xs *xs_concat(xs *string, const xs *prefix, const xs *suffix)
{
size_t pres = xs_size(prefix), sufs = xs_size(suffix),
size = xs_size(string), capacity = xs_capacity(string);
char *pre = xs_data(prefix), *suf = xs_data(suffix),
*data = xs_data(string);
if (size + pres + sufs <= capacity) {
memmove(data + pres, data, size);
memcpy(data, pre, pres);
memcpy(data + pres + size, suf, sufs + 1);
string->space_left = 15 - (size + pres + sufs);
} else {
xs tmps = xs_literal_empty();
xs_grow(&tmps, size + pres + sufs);
char *tmpdata = xs_data(&tmps);
memcpy(tmpdata + pres, data, size);
memcpy(tmpdata, pre, pres);
memcpy(tmpdata + pres + size, suf, sufs + 1);
xs_free(string);
*string = tmps;
string->size = size + pres + sufs;
}
return string;
}
```
### `xs_trim`
這個函式的目的是去除掉 `x` 內前後包含於 `trimset` 的字元,函式中兩個 macro 利用 hashing 的手法來檢查字元,首先把一個 `char` 分成 quotient (前 5 個 bits) ,和 remainder (後 3 個 bits) ,再來 `set_bit` 用於插入,`check_bit` 用於確認字元是否被插入
- quotient 用於 index
- remainder 數值表示範圍為 0 到 7 , 一一對應 `uint8_t` 的 8 個 bits
- 時間複雜度為 $O(slen + trimlen)$ ,若用 brute force 時間複雜度會升到 $O(slen \cdot trimlen)$
```cpp
xs *xs_trim(xs *x, const char *trimset)
{
if (!trimset[0])
return x;
char *dataptr = xs_data(x), *orig = dataptr;
/* similar to strspn/strpbrk but it operates on binary data */
uint8_t mask[32] = {0};
#define check_bit(byte) (mask[(uint8_t) byte / 8] CCC 1 << (uint8_t) byte % 8)
#define set_bit(byte) (mask[(uint8_t) byte / 8] |= 1 << (uint8_t) byte % 8)
size_t i, slen = xs_size(x), trimlen = strlen(trimset);
for (i = 0; i < trimlen; i++)
set_bit(trimset[i]);
for (i = 0; i < slen; i++)
if (!check_bit(dataptr[i]))
break;
for (; slen > 0; slen--)
if (!check_bit(dataptr[slen - 1]))
break;
dataptr += i;
slen -= i;
/* reserved space as a buffer on the heap.
* Do not reallocate immediately. Instead, reuse it as possible.
* Do not shrink to in place if < 16 bytes.
*/
memmove(orig, dataptr, slen);
/* do not dirty memory unless it is needed */
if (orig[slen])
orig[slen] = 0;
if (xs_is_ptr(x))
x->size = slen;
else
x->space_left = 15 - slen;
return x;
#undef check_bit
#undef set_bit
}
```
## 2. 以上述程式碼為基礎,提供字串複製 (應考慮到 CoW) 和類似 [strtok](http://man7.org/linux/man-pages/man3/strtok.3.html) 功能的函式實作
完整的 code 以及編譯方式請見 [GitHub](https://github.com/colinyoyo26/2020q1-linux-course/tree/master/2020q1-quiz2-xstring)
首先是字串複製 `xs_cpy` 的實作以及測試輸出,先從記憶體配置開始
### `xs` memory layout
`malloc` 空間給 `x->ptr` 時會多塞 `sizeof(REFTYPE)` 的空間放 reference count 並把指標移到字串開始位置
```cpp
#define REFTYPE size_t
#define OFFSET sizeof(REFTYPE)
```
```cpp
xs->ptr = malloc(((size_t)1 << x->capacity) + OFFSET) + OFFSET;
```
`xs` 的記憶體配置邏輯如下(`xs` 16 bytes, `refcnt` `sizeof(REFTYPE)` bytes)
```graphviz
digraph Q {
rankdir=RL
node [shape=record];
heap [label="refcnt |string..."];
ptr [label = " ptr "];
size [label = "size"];
capacity [label = " capacity "];
flags [lable = "flags"]
subgraph cluster_xs {
{rank=same ptr size capacity flags}
}
ptr -> heap
}
```
:::warning
改用 mermaid 或 Graphviz 重新製作上方圖例
:notes: jserv
:::
> 已用 Graphviz 重繪
以下比較 [folly::fbstring](https://github.com/facebook/folly/blob/master/folly/FBString.h) 的 CoW 機制,看以下五個函式和資料結構就能看出端倪了, `ml_.data_` 會指向 `struct RefCounted` 的 `data_` 當要取 `refCount_` 的值時,會利用計算從 `struct RefCounted` 到 `data_` 的 offset 得到 `refCount_` 的位址 (和 linux linked list 一樣的手法)
```cpp
template <class Char>
FOLLY_NOINLINE inline void fbstring_core<Char>::initLarge(
const Char* const data,
const size_t size) {
// Large strings are allocated differently
size_t effectiveCapacity = size;
auto const newRC = RefCounted::create(data, &effectiveCapacity);
ml_.data_ = newRC->data_;
ml_.size_ = size;
ml_.setCapacity(effectiveCapacity, Category::isLarge);
ml_.data_[size] = '\0';
}
```
- `data[1]` 是為了留空間給 `\0`
```cpp
struct RefCounted {
std::atomic<size_t> refCount_;
Char data_[1];
...
```
```cpp
static void incrementRefs(Char* p) {
fromData(p)->refCount_.fetch_add(1, std::memory_order_acq_rel);
}
```
```cpp
constexpr static size_t getDataOffset() {
return offsetof(RefCounted, data_);
}
```
```cpp
static RefCounted* fromData(Char* p) {
return static_cast<RefCounted*>(static_cast<void*>(
static_cast<unsigned char*>(static_cast<void*>(p)) -
getDataOffset()));
}
```
```cpp
static RefCounted* create(size_t* size) {
const size_t allocSize =
goodMallocSize(getDataOffset() + (*size + 1) * sizeof(Char));
auto result = static_cast<RefCounted*>(checkedMalloc(allocSize));
result->refCount_.store(1, std::memory_order_release);
*size = (allocSize - getDataOffset()) / sizeof(Char) - 1;
return result;
}
```
### reference count
以下兩個函式回傳 reference count 和是否具有其他 reference
```cpp
extern inline uint8_t xs_refcnt(const xs *x)
{
return IS_COW && xs_is_ptr(x) ? *(REFTYPE *) (x->ptr - OFFSET) : 1;
}
```
```cpp
static inline bool xs_is_ref(const xs *x)
{
return xs_refcnt(x) > 1;
}
```
增減 reference count , `xs_decr_ref` 發現 reference count 歸零時會呼叫 `xs_free`
```cpp
static inline void xs_incr_ref(xs *x)
{
*(REFTYPE *) (x->ptr - OFFSET) += 1;
}
```
```cpp
static inline void xs_decr_ref(xs *x)
{
if (!--(*(REFTYPE *) (x->ptr - OFFSET)))
xs_free(x);
}
```
### CoW 機制
呼叫 `xs_cpy` 時如果不到需要 CoW 的字串長度,則直接複製字串,否則用 CoW 機制直接複製 meta data 並增加 reference count 其中 `!~xs_refcnt(src)` 是為了防止 overflow
```cpp
static inline void xs_set_ref(xs *x, xs *ref_to)
{
if (xs_size(ref_to) < LEN_TO_COW || x->ptr == ref_to->ptr)
return;
XS_INCR_REF(ref_to);
/* copy the meta data */
*x = *ref_to;
}
```
```cpp
xs *xs_cpy(xs *dest, xs *src)
{
if (XS_IS_REF(dest))
XS_DECR_REF(dest);
/* too many references or short string
* !OFFET for non-COW
*/
if (!OFFSET || xs_size(src) < LEN_TO_COW || !~xs_refcnt(src) ||
!xs_is_ptr(src)) {
size_t len = xs_size(src);
xs_grow(dest, len);
memcpy(xs_data(dest), xs_data(src), len);
if (xs_is_ptr(dest))
dest->size = len;
else
dest->space_left = 15 - len;
} else
xs_set_ref(dest, src);
return dest;
}
```
若是字串要被修改時 (e.g. `xs_trim` , `xs_concat`) 會呼叫`xs_cow` 檢查是否有其他參考,若有則需要自己在開一個新的空間
```cpp
static void xs_cow(xs *x)
{
if (!XS_IS_REF(x))
return;
XS_DECR_REF(x);
xs_new(x, xs_data(x));
return;
}
```
測試程式:
```cpp
xs string, cow1, cow2;
void print_info(char* str)
{ printf("\n%s\n", str);
printf("\nstring: %s", xs_data(&string));
printf("\n&string: %p", xs_data(&string));
printf("\nreference count: %d\n", string.ref_cnt);
printf("\nstring: %s", xs_data(&cow1));
printf("\n&string: %p", xs_data(&cow1));
printf("\nreference count: %d\n", cow1.ref_cnt);
printf("\nstring: %s", xs_data(&cow2));
printf("\n&string: %p", xs_data(&cow2));
printf("\nreference count: %d\n", cow2.ref_cnt);
}
int main()
{
xs_new(&string, "gggfoobarbarbarbarbarzzz");
xs_cpy(&cow1, &string);
xs_cpy(&cow2, &string);
print_info("after cpy from string to cow1 & cow2");
xs prefix = *xs_tmp("((("), suffix = *xs_tmp(")))");
xs_concat(&cow1, &prefix, &suffix);
print_info("after concat cow1");
xs_cpy(&cow2, &cow1);
print_info("after cpy from cow1 to cow2");
xs_trim(&cow1, ")(zg");
print_info("after trim cow1");
return 0;
}
```
參考執行結果:
```
after cpy from string to cow1 & cow2
#string
string: gggfoobarbarbarbarbarzzz
&string: 0x1735010
reference count: 3
#cow1
string: gggfoobarbarbarbarbarzzz
&string: 0x1735010
reference count: 1
#cow2
string: gggfoobarbarbarbarbarzzz
&string: 0x1735010
reference count: 1
after concat cow1
#string
string: gggfoobarbarbarbarbarzzz
&string: 0x1735010
reference count: 2
#cow1
string: (((gggfoobarbarbarbarbarzzz)))
&string: 0x1735450
reference count: 1
#cow2
string: gggfoobarbarbarbarbarzzz
&string: 0x1735010
reference count: 1
after cpy from cow1 to cow2
#string
string: gggfoobarbarbarbarbarzzz
&string: 0x1735010
reference count: 1
#cow1
string: (((gggfoobarbarbarbarbarzzz)))
&string: 0x1735450
reference count: 2
#cow2
string: (((gggfoobarbarbarbarbarzzz)))
&string: 0x1735450
reference count: 1
after trim cow1
#string
string: gggfoobarbarbarbarbarzzz
&string: 0x1735010
reference count: 1
#cow1
string: foobarbarbarbarbar
&string: 0x1735480
reference count: 1
#cow2
string: (((gggfoobarbarbarbarbarzzz)))
&string: 0x1735450
reference count: 1
```
接著是 [strtok](http://man7.org/linux/man-pages/man3/strtok.3.html) 的實作
### `xs_tok`
先簡介 `strtok` 以下是 Linux Programmer's Manual
- The `strtok()` function breaks a string into a sequence of zero or more nonempty tokens. On the first call to `strtok()` the string to be parsed should be specified in str. In each subsequent call that should parse the same string, str must be `NULL`.
> 第一次呼叫 `strtok` 需要給定欲處理的字串(放在 `str` ),之後 `str` 必須為 `NULL`
- The `delim` argument specifies a set of bytes that delimit the tokens in the parsed string. The caller may specify different strings in delim in successive calls that parse the same string.
> `delim` 就是字元的集合,用於劃分欲解析字串
- Each call to `strtok()` returns a pointer to a null-terminated string containing the next token. This string does not include the delimiting byte.
> 會回傳從第一個 non-delimiting byte. 下一個 delimiting byte (不包含) 之間的字串
- The end of each token is found by scanning forward until either the next delimiter byte is found or until the terminating null byte `\0` is encountered.
> 會把下一個 delimiting byte 替換成 `\0`
我照著 `strtok` 的行為實作, `laststr` 指向欲處理字串的開頭, `src_flag` 用於分辨是否要更新 `src->size` (或 `src->leftspace`),並和 `xs_trim` 一樣用 hashing 的方法實作檢查 `delim`
> 因為會把下一個 delimiting byte 替換成 `\0` 所以第一次呼叫需要更新 `src->size`
```cpp
char *xs_tok(xs *src, const char *delim)
{
static char *laststr = NULL;
char *cur;
bool src_flag = 0;
if (!src)
cur = laststr;
else {
XS_COW(src);
cur = xs_data(src);
src_flag = 1;
}
if (!delim[0] || !cur)
return cur;
uint8_t mask[32] = {0};
#define check_bit(byte) (mask[(uint8_t) byte / 8] & 1 << (uint8_t) byte % 8)
#define set_bit(byte) (mask[(uint8_t) byte / 8] |= 1 << (uint8_t) byte % 8)
size_t i, curlen = strlen(cur), delimlen = strlen(delim);
for (i = 0; i < delimlen; i++)
set_bit(delim[i]);
for (i = 0; i < curlen; i++)
if (!check_bit(cur[i]))
break;
cur = cur + i;
for (i = 0; i++ < curlen;)
if (check_bit(cur[i]))
break;
*(cur + i) = '\0';
laststr = cur + i + 1;
if (src_flag) {
if (xs_is_ptr(src))
src->size = i;
else
src->space_left = 15 - i;
}
if (!*laststr)
laststr = NULL;
return cur;
}
```
測試程式:
```cpp
#include <stdio.h>
#include "xs.h"
int main(void)
{
xs string;
char token[8] = "HELLO W", *tem;
xs_new(&string, "HELLOW fooHELLOWbarHbarfooLbarLbarObarW");
printf("#token\n%s\n", token);
printf("\n#initial string\n%s\n\n", xs_data(&string));
printf("%s\n", xs_tok(&string, token));
while(tem = xs_tok(NULL, token))
printf("%s\n", tem);
return 0;
}
```
參考執行結果:
```
#token
HELLO W
#initial string
HELLOW fooHELLOWbarHbarfooLbarLbarObarW
foo
bar
barfoo
bar
bar
bar
```
:::warning
TODO: 考慮到之後會在多執行緒環境,用 `strtok_r` 實作手法改寫
:notes: jserv
:::
> 已實作 `xs_tok_r` (reentrant version of `xs_tok`),以及測試程式(用 pthread 測試)
### `xs_tok_r`
```cpp
char *xs_tok_r(xs *src, const char *delim, char **saveptr)
{
char *cur;
bool src_flag = 0;
if (!src)
cur = *saveptr;
else {
XS_COW(src);
cur = xs_data(src);
src_flag = 1;
}
if (!delim[0] || !cur)
return cur;
uint8_t mask[32] = {0};
#define check_bit(byte) (mask[(uint8_t) byte / 8] & 1 << (uint8_t) byte % 8)
#define set_bit(byte) (mask[(uint8_t) byte / 8] |= 1 << (uint8_t) byte % 8)
size_t i, curlen = strlen(cur), delimlen = strlen(delim);
for (i = 0; i < delimlen; i++)
set_bit(delim[i]);
for (i = 0; i < curlen; i++)
if (!check_bit(cur[i]))
break;
cur = cur + i;
for (i = 0; i++ < curlen;)
if (check_bit(cur[i]))
break;
*(cur + i) = '\0';
*saveptr = cur + i + 1;
if (src_flag) {
if (xs_is_ptr(src))
src->size = i;
else
src->space_left = 15 - i;
}
if (!*saveptr)
saveptr = NULL;
return cur;
#undef check_bit
#undef set_bit
}
```
測試程式:
```cpp
#include <pthread.h>
#include "string.h"
#include "xs.h"
#define NUM 3
const char token[8] = "@#(^&$*";
static void *test_tok(void *args)
{
char *saveptr;
char out[32], *outptr = out;
xs str = *(xs *) args;
for (xs *p = &str;; p = NULL) {
char *tem = xs_tok_r(p, token, &saveptr);
if (!*tem)
break;
strcpy(outptr, tem);
outptr += strlen(tem) + 1;
*(outptr - 1) = ' ';
}
*(outptr - 1) = '\0';
printf("%s\n", out);
}
int main(void)
{
pthread_t threads[NUM];
xs strs[NUM];
printf("#token: %s\n", token);
for (int i = 0; i < NUM; i++) {
char buf[64];
sprintf(buf, "$$IM&*^%dth@@@#thread!!!$$", i);
xs_new(&strs[i], buf);
printf("#initial str of %dth thread: %s\n", i, xs_data(&strs[i]));
pthread_create(threads + i, NULL, test_tok, (void *) &strs[i]);
}
for (int i = 0; i < NUM; i++)
pthread_join(threads[i], NULL);
return 0;
}
```
參考執行結果:
> 這邊輸出順序不重要,重點是 "IM nth thread!!!" string 預期要是完整的
```
#token: @#(^&$*
#initial str of 0th thread: $$IM&*^0th@@@#thread!!!$$
#initial str of 1th thread: $$IM&*^1th@@@#thread!!!$$
#initial str of 2th thread: $$IM&*^2th@@@#thread!!!$$
IM 0th thread!!!
IM 2th thread!!!
IM 1th thread!!!
```
## 3. 設計實驗,確認上述字串處理最佳化 (針對空間效率) 帶來的效益,應考慮到 [locality of reference](https://en.wikipedia.org/wiki/Locality_of_reference)
設計實驗呼叫複製字串函式數次計算時間以及 cache-misses 次數,實驗結果是 `xs_cpy` with CoW 的 cache-missed 明顯最少,但是再這個 input 下 `strcpy` 的時間效率略好一些,更下面還有 `make plot` 視覺化不同字串長度下的效率比較
- [source code](https://github.com/colinyoyo26/2020q1-linux-course/blob/master/2020q1-quiz2-xstring/cpy_bench.c)
### `xs_cpy` w/o CoW
`$ make bench`
```
MODE: XSTR
refcnt: 1
time: 23440.000000 (ms)
Performance counter stats for './bench foo bar bar barbarbarbar foooooobarrrrrrarrrrrrfoo bar bar barbarbarbar foooooobarrrrrrarrrrrrfoo bar bar barbarbarbar foooooobarrrrrrarrrrrrfoo bar bar barbarbarbar foooooobarrrrrrarrrrrr 100000 0':
758,747 cache-misses # 85.012 % of all cache refs
892,518 cache-references
0.032626588 seconds time elapsed
```
### `xs_cpy` w/ CoW
```shell
$ make bench COW=1
MODE: XSTR
refcnt: 100001
time: 5019.750000 (ms)
Performance counter stats for './bench foo bar bar barbarbarbar foooooobarrrrrrarrrrrrfoo bar bar barbarbarbar foooooobarrrrrrarrrrrrfoo bar bar barbarbarbar foooooobarrrrrrarrrrrrfoo bar bar barbarbarbar foooooobarrrrrrarrrrrr 100000 0':
111,351 cache-misses # 57.129 % of all cache refs
194,910 cache-references
0.008941326 seconds time elapsed
```
### `strcpy`
```shell
$ make bench MODE=1
MODE: CSTR
time: 4244.250000 (ms)
Performance counter stats for './bench foo bar bar barbarbarbar foooooobarrrrrrarrrrrrfoo bar bar barbarbarbar foooooobarrrrrrarrrrrrfoo bar bar barbarbarbar foooooobarrrrrrarrrrrrfoo bar bar barbarbarbar foooooobarrrrrrarrrrrr 100000 1':
980,451 cache-misses # 66.203 % of all cache refs
1,480,977 cache-references
0.019159115 seconds time elapsed
```
### `make plot`
`$ make plot`
因為 CoW 機制下只有複製 metadata 所以時間趨勢如預期是呈現 $O(1)$

:::warning
我的實驗還沒有考慮到寫入 shared string 時的效能
:::
## 4. 在 Linux 核心原始程式碼中,找出類似手法的 SSO 或 CoW 案例並解說;
call `fork()` 會先到以下函式設置 `args` 接著呼叫 `_do_fork`
- `vfork()` `clone` 等函式也都是設置對應的 `args` 再呼叫 `_do_fork`
```cpp
SYSCALL_DEFINE0(fork)
{
#ifdef CONFIG_MMU
struct kernel_clone_args args = {
.exit_signal = SIGCHLD,
};
return _do_fork(&args);
#else
/* can not support in nommu mode */
return -EINVAL;
#endif
}
```
### `_do_fork`
接著看到 `_do_fork` ,會先經過以下條件式檢查是哪個 system call
```cpp
if (!(clone_flags & CLONE_UNTRACED)) {
if (clone_flags & CLONE_VFORK)
trace = PTRACE_EVENT_VFORK;
else if (args->exit_signal != SIGCHLD)
trace = PTRACE_EVENT_CLONE;
else
trace = PTRACE_EVENT_FORK;
if (likely(!ptrace_event_enabled(current, trace)))
trace = 0;
}
```
然後呼叫 `copy_process` 建造 child process 並回傳指標
```cpp
p = copy_process(NULL, trace, NUMA_NO_NODE, args);
```
在 `copy_process` 內會呼叫 `dup_task_struct` 其中 `current` 是指向目前運行程式( parent process )的指標
```cpp
p = dup_task_struct(current, node);
```
喚醒 child process
```cpp
wake_up_new_task(p);
```
如果是 `vfork` 要等待 child process 呼叫 `exec` 或結束
```cpp
if (clone_flags & CLONE_VFORK) {
if (!wait_for_vfork_done(p, &vfork))
ptrace_event_pid(PTRACE_EVENT_VFORK_DONE, pid);
```
### `copy_process`
忽略一開始的參數檢查, `dup_task_struct` 是真正建造 child process 的函式, parent process 會初始化 child process 後回傳指標 `p`
```cpp
p = dup_task_struct(current, node);
```
實現 CoW 的地方,放在後面說明
```cpp
retval = copy_mm(clone_flags, p);
```
主要設置 childe process 的 PC (program counter) 以及 register
- PC 會指向 return from fork 的地方,返回值會被設成 0
- `copy_thread_tls` 會呼叫 `copy_thread` ,函式實作和 microarchitecture 息息相關
> Reference: - [Linux kernel fork 函式](http://blog.chinaunix.net/uid-69947851-id-5826105.html)
```cpp
retval = copy_thread_tls(clone_flags, args->stack, args->stack_size, p,
args->tls);
```
### `dup_task_struct`
得知要從哪個 NUMA 節點配置記憶體
```cpp
if (node == NUMA_NO_NODE)
node = tsk_fork_get_node(orig);
```
為 child process 從剛才得到的 NUMA 節點配置 `struct task_struct` , `stack` , `vm_struct` 的記憶體空間
:::warning
這邊的 `tsk_fork_get_node()` 其實並「不一定」會回傳一個任一個 NUMA node。
假設,`CONFIG_NUMA`已經設置`tsk_fork_get_node()`仍然只會對於 `kthreadd_task` 返回預設的 NUMA node.
--- JulianATA
:::
> 這裡的 `stack` 是 [kernel stack](https://www.kernel.org/doc/html/latest/x86/kernel-stacks.html)
```cpp
tsk = alloc_task_struct_node(node);
if (!tsk)
return NULL;
stack = alloc_thread_stack_node(tsk, node);
if (!stack)
goto free_tsk;
if (memcg_charge_kernel_stack(tsk))
goto free_stack;
stack_vm_area = task_stack_vm_area(tsk);
```
先完全複製當前行程的 `struct task_struct`
:::warning
上方的資訊,我們可以發現一件有趣的事情。就算在一個已經設置` CONFIG_NUMA `的系統中,仍然有可能回傳` NUMA_NO_NODE `作為系統預計放置的記憶體的 NUMA node。
那這個 ` NUMA_NO_NODE ` 實際上是 -1 ,我們順著 ` alloc_task_struct_node `以及 `alloc_thread_stack_node` 走下去。
我們會發現,不論是前者透過 `slab`, `slob`, `slub` 的`slab_alloc_node`, `slob_alloc_node`, `slub_alloc_node` 的配置,或是後者通過 `alloc_page_node`做分配,皆會通過一個 `numa_mem_id()`
來自 `include/linux/topology.h`
```clike
/* Returns the number of the nearest Node with memory */
static inline int numa_mem_id(void)
{
return raw_cpu_read(_numa_mem_);
}
```
```clike
/* Returns the number of the nearest Node with memory */
static inline int numa_mem_id(void)
{
return numa_node_id();
}
```
見註解!可以看到,無論是哪個情況下的 `numa_mem_id` 目標為回傳最近仍然有空間的 NUMA node。
由這件小事,其實可以延伸到一個目前的現況。在 Linux Kernel 裡面,支援對於 NUMA node 的「操作」,除了幾個比較簡單的 [policy](https://www.kernel.org/doc/html/latest/admin-guide/mm/numa_memory_policy.html) ,其實還沒有提供實質上的「策略」。
--- JulianATA
:::
```cpp
err = arch_dup_task_struct(tsk, orig);
```
然後改變 child process 的 `stack` 以及 `stack_vm_area`
```cpp
tsk->stack = stack;
#ifdef CONFIG_VMAP_STACK
tsk->stack_vm_area = stack_vm_area;
#endif
```
### `copy_mm`
`vfork` 會直接進入條件式內 (`呼叫 vfork 時 args.flags 會被設成 CLONE_VFORK | CLONE_VM`) 直接共用同一個 `mmstruct` 這就是為什麼用 `vfork` child process 和 parent process 會共享記憶體空間的原因,一般的 `fork` 會呼叫 `dup_mm` 複製 parent process 的 `mm_struct`
```cpp
if (clone_flags & CLONE_VM) {
mmget(oldmm);
mm = oldmm;
goto good_mm;
}
retval = -ENOMEM;
mm = dup_mm(tsk, current->mm);
...
good_mm:
tsk->mm = mm;
tsk->active_mm = mm;
return 0;
```
### `dup_mm`
先配置新的 `mm_struct` 並從 parent 複製資料
```cpp
mm = allocate_mm();
...
memcpy(mm, oldmm, sizeof(*mm));
```
接著呼叫 `dup_mmap` 複製 page table , `vm_area_struct` ,這地方先不深入 trace
> Reference: [Understanding the Linux Kernel](https://books.google.com.tw/books?id=9yIEji1UheIC&pg=PA299&lpg=PA299&dq=dup_mmap&source=bl&ots=JFnHxrkNEV&sig=ACfU3U3jG0ocR6NCXRT4Fx9OG6DFqY6BXg&hl=zh-TW&sa=X&ved=2ahUKEwjAvOGf1ZLoAhXIxosBHVIXCncQ6AEwBnoECAoQAQ#v=onepage&q=dup_mmap&f=false)
```cpp
err = dup_mmap(mm, oldmm);
```
### Write to shared frame (physical page)
> 這邊依循恐龍書的定義把 virtual page 叫做 page , physical page 叫做 frame
當 parent 或 child 嘗試寫入共用的 frame 時會觸發 page fault 然後通過 `do_page_fault()` `handle_mm_fault()` `handle_pte_fault()` 處理異常 (觸發 CoW)
> Reference: [深入了解 Linux-COW 原理](https://blog.leosocy.top/%E6%B7%B1%E5%85%A5%E4%BA%86%E8%A7%A3Linux-COW%E5%86%99%E6%97%B6%E6%8B%B7%E8%B4%9D%E5%AE%9E%E7%8E%B0%E5%8E%9F%E7%90%86/)
:::warning
這幾個函式還沒深入 trace
:::
## Reference
- [Linux kernel fork 函式](http://blog.chinaunix.net/uid-69947851-id-5826105.html)
- [Linux kernel fork CoW 代碼分析](https://blog.csdn.net/dog250/article/details/5303054)
- [mm_struct 簡介](https://blog.csdn.net/qq_26768741/article/details/54375524)
- [UNIX 作業系統 fork/exec 系統呼叫的前世今生](https://hackmd.io/@sysprog/unix-fork-exec)
- Source code: [fork.c](https://github.com/torvalds/linux/blob/9d588f63602778716921bb638cda412cdeacabc8/kernel/fork.c)
- Source code: [exec.c](https://github.com/torvalds/linux/blob/7eec11d3a784a283f916590e5aa30b855c2ccfd7/fs/exec.c)
- Source code: [memory.c](https://github.com/torvalds/linux/blob/aeb542a1b5c507ea117d21c3e3e012ba16f065ac/mm/memory.c)