2020q2 Homework2 (quiz2)

# 2020q2 Homework2 (quiz2) contributed by < `nickyanggg` > ###### tags: `linux2020` ## [作業要求](https://hackmd.io/@sysprog/BJXNYx5NL) - 重新回答[第 2 週測驗題](https://hackmd.io/@sysprog/linux2020-quiz2)。 - 解釋程式碼運作原理，並提出改進方案。 - 提供字串複製 (CoW) 和類似 `strtok` 功能的函式實作。 - 解說 Linux 核心原始程式碼中類似的案例 (SSO、CoW)。 ## 分析 **xs** - 字串長度 $\leq15$ - 字串將儲存於 `stack` 上 - 字串起始位置為 `data` - 字串長度可由 `15 - space_left` 求得 - 當字串長度為 $15$ 時，`data[15] == '\0'`，然而位於最後一個 byte 的 `space_left` 卻不受影響，因為此時 `space_left` 剛好為 $0$ - 字串長度 $>15$ - `is_ptr` 為 $1$ - 字串將儲存於 `heap` 上 - 字串起始位置為 `ptr` ```cpp union { char data[16]; struct { uint8_t filler[15], /* how many free bytes in this stack allocated string * same idea as fbstring */ space_left : 4, /* if it is on heap, set to 1 */ is_ptr : 1, flag1 : 1, flag2 : 1, flag3 : 1; }; /* heap allocated */ struct { char *ptr; /* supports strings up to 2^54 - 1 bytes */ size_t size : 54, /* capacity is always a power of 2 (unsigned)-1 */ capacity : 6; /* the last 4 bits are important flags */ }; } xs; ``` ## Q1 分析以下程式碼，並判斷 `AAA` 內容為何。 ```c= #define xs_literal_empty() \ (xs) { .space_left = 15 } static inline int ilog2(uint32_t n) { return 32 - __builtin_clz(n) - 1; } xs *xs_new(xs *x, const void *p) { *x = xs_literal_empty(); size_t len = strlen(p) + 1; if (len > AAA) { x->capacity = ilog2(len) + 1; x->size = len - 1; x->is_ptr = true; x->ptr = malloc((size_t) 1 << x->capacity); memcpy(x->ptr, p, len); } else { memcpy(x->data, p, len); x->space_left = 15 - (len - 1); } return x; } ``` - `if` 的 block 中將字串儲存於 `heap` 中，而 `else` 則是將字串 copy 至 `data` 中，可知前者為字串長度 $>15$的狀況，而後者則是長度 $\leq15$ 的狀況。 - `len` 為一個字串真正用到的空間長度，也就是涵蓋結束字元，因此可判斷 `AAA` 應為 $16$。 ## Q2 分析以下程式碼，並判斷 `BBB` 及 `CCC` 內容為何。 ```c= xs *xs_trim(xs *x, const char *trimset) { if (!trimset[0]) return x; char *dataptr = xs_data(x), *orig = dataptr; /* similar to strspn/strpbrk but it operates on binary data */ uint8_t mask[BBB] = {0}; #define check_bit(byte) (mask[(uint8_t) byte / 8] CCC 1 << (uint8_t) byte % 8) #define set_bit(byte) (mask[(uint8_t) byte / 8] |= 1 << (uint8_t) byte % 8) size_t i, slen = xs_size(x), trimlen = strlen(trimset); for (i = 0; i < trimlen; i++) set_bit(trimset[i]); for (i = 0; i < slen; i++) if (!check_bit(dataptr[i])) break; for (; slen > 0; slen--) if (!check_bit(dataptr[slen - 1])) break; dataptr += i; slen -= i; /* reserved space as a buffer on the heap. * Do not reallocate immediately. Instead, reuse it as possible. * Do not shrink to in place if < 16 bytes. */ memmove(orig, dataptr, slen); /* do not dirty memory unless it is needed */ if (orig[slen]) orig[slen] = 0; if (xs_is_ptr(x)) x->size = slen; else x->space_left = 15 - slen; return x; #undef check_bit #undef set_bit } ``` `mask` - 由 `set_bit` 中可知 ascii 每八個一組對應到 `mask[]` 中。 e.g. A 的 ascii 為 65 (decimal)，因此 `mask[65/8]` 會被 assign 成 `mask[65/8] | 1 << (65 % 8)`，也就是 `mask[8] |= 2`，也就是 `mask[8]` 會等於 xxxxxx1x (二進制)，x 為其他字元所對應到的值，若是欲刪除字元會被設為 1，否則為 0。 - 同一組的字元所代表的二進位數字分別為 1、10、......、10000000，因此若 `mask` 中的每個值代表著一個 byte，那每個字元都會對應到其中一個 byte 中的其中一個 bit。 - 因此在做 `set_bit` 時，若以二進制來看的話每次做 `|=` 都不會影響其他字元所代表的值。 `set_bit` - 將欲刪除字元所對應到在 `mask` 的 bit 更新為 1。 - 由於將 `byte` 轉型成 `uint8_t`，可知 `byte` 範圍落在 0 ~ 255，再加上 ascii 範圍為 0 ~ 255，因此 `BBB` 值為 32 才不會出現 index out of range 的狀況。 `check_bit` - 用來檢查字串的頭尾是否出現連續欲刪除字元。 - 由於上面已經確定每個字元都可以對應到 `mask` 中去看說自己是否為欲刪除字元，因此將 `dataptr` 中的字元一一丟入 `check_bit`，若將自己的 bit 的位置設成 1 其他 bit 為 0 的值去和 `mask` 上自己對應到的 byte 做 `CCC` 時回傳值為 1 代表著為欲刪除字元，因此 `for` 迴圈繼續，若回傳值為 0 則跳出迴圈。 - 由此可知 `CCC` 是在做 `&` 運算。 ## 字串複製實作 (CoW) - [source code](https://github.com/nickyanggg/xstring-CoW/blob/master/xs.c) - 首先先找出能夠儲存 `reference counter` 的位置來實作 `CoW`。 - 發現 `xs_new`、`xs_grow` 在 malloc 一塊空間給長度 $>15$ 的字串時，都會保證即使在極端狀況也會保留最後一個 `byte` 沒被使用，也就是 `'\0'` 頂多位在可使用空間中的倒數第二個 byte。 e.g. `size == 30` => `malloc(32)` `size == 31` => `malloc(64)` `size == 62` => `malloc(64)` `size == 63` => `malloc(128)` - 一個 `byte` 足以讓我們儲存 $\leq255$ 的 `reference counter`。 - 或許作為一個 `reference counter` 255 可能會稍嫌不足，但我這邊就不另外要空間或是開變數去儲存更大的 `reference counter`，重點可能放在如何在 `reference counter` 只有 255 的情況之下避免 memory leak 以及處理 `reference counter` 不夠用的狀況。 ### `xs_new` - `xs_ref_counter` 為取結束字元後的下一個 byte 的值，也就是我們定義的 `reference counter`。 - 在 `xs_new` 中，若存放在 `heap` 中則賦予 `xs_ref_counter` 初始值 1。 ```c= #define xs_tmp(x) \ xs_new(&xs_literal_empty(), x) #define xs_ref_counter(xs) \ *(xs->ptr + xs->size + 1) #define MAX_REF_COUNTER 255 xs *xs_new(xs *x, const void *p) { ... if (len > 16) { x->capacity = ilog2(len) + 1; x->size = len - 1; x->is_ptr = true; x->ptr = malloc((size_t) 1 << x->capacity); memcpy(x->ptr, p, len); xs_ref_counter(x) = 1; } else { ... } return x; } ``` ### `xs_copy` - 功能為將 `src` 的內容複製到 `dest`。 - 若 `src` 的字串長度 $>15$，先更新原先的 `reference counter`，若變為 0 則代表沒有共享空間，就可以釋放掉。 - 考慮 `src` copy 的次數達到極限 (255 次)，因此 duplicate 出一個新的。 - 複製完後，檢查是否需要更新 `reference counter`，只需要更新其中一個，因為此時 `dest` 和 `ref` 的 `reference counter` 是存在相同的地址。 ```c= xs *xs_copy(xs *dest, xs *src) { if (!dest || !src) return dest; if (xs_data(dest) == xs_data(src)) return dest; if (xs_is_ptr(dest)) { xs_ref_counter(dest) -= 1; if (!xs_ref_counter(dest)) free(xs_data(dest)); } if (xs_is_ptr(src) && (unsigned char) xs_ref_counter(src) == MAX_REF_COUNTER) { xs *dup_x = xs_tmp(src->ptr); for (int i = 0; i < 16; ++i) dest->data[i] = dup_x->data[i]; return dest; } for (int i = 0; i < 16; ++i) dest->data[i] = src->data[i]; if (xs_is_ptr(src)) xs_ref_counter(src) += 1; return dest; } ``` ### `xs_cow` - 當發生改寫的狀況，並且`reference counter` $>1$，則會觸發。 - 更新 `reference counter`。 ```c= static inline void xs_cow(xs *x) { *(x->ptr + x->size + 1) -= 1; xs *dup_x = xs_tmp(x->ptr); for (int i = 0; i < 16; ++i) x->data[i] = dup_x->data[i]; } ``` ### `xs_concat` - 檢查是否需要觸發 `xs_cow`。 - 由於做完 concat 之後，`reference counter` 並沒有一併被維護到，因此最後檢查是否需要維護 `reference counter`。 ```c= xs *xs_concat(xs *string, const xs *prefix, const xs *suffix) { if (xs_is_ptr(string) && xs_ref_counter(string) != 1) xs_cow(string); ... if (xs_is_ptr(string)) xs_ref_counter(string) = 1; return string; } ``` ### `xs_trim` - 同 `xs_concat` 注意事項。 ```c= xs *xs_trim(xs *x, const char *trimset) { if (!trimset[0]) return x; if (xs_is_ptr(x) && xs_ref_counter(x) != 1) xs_cow(x); ... if (xs_is_ptr(x)) { x->size = slen; xs_ref_counter(x) = 1; } else x->space_left = 15 - slen; ... } ``` ### 待改進部分 - `reference counter` 僅記錄 1 ~ 255，因此一旦被複製了 254 次，接下來每次的複製都會直接 malloc 一塊新的空間，而不會自動去連接才剛 malloc 且 `reference counter` 為 1 的那塊，也就是第 255 次複製的時候會先去 malloc 一塊空間，而第 256 次複製的時候其實可以跑去和第 255 次複製的共享空間，而非又去 malloc 一塊新的，實作方式如下。 - 首先每次 malloc 的時候要再多開 8 個 byte，因為要儲存下一段一模一樣字串所 malloc 的地址。 - 利用 [circular linked list](https://www.geeksforgeeks.org/circular-linked-list/) 的概念，每個 node 代表著儲存一模一樣字串的可用空間。 - 在呼叫 `xs_new` 的時候，就把下一段的儲存空間先指向自己 malloc 出來的空間，也就是 circular linked list 只有一個 node 的狀況 (指向自己)。 - 在複製時一旦發現 `reference counter` 滿的時候，就去儲存的下一段空間找找看，若發現下一段的 `reference counter` 也滿的時候就再去看下一段...以此類推，一旦發現 traverse 到第一個看的空間時 (也就是走訪了一圈每個 `reference counter` 都滿了)，就 malloc 一塊新的空間，並且將它加入這個 circular linked list。 - 每當要 free 一塊空間時，也要維護它所屬的 circular linked list。 - 可能有些人看到這邊會想說竟然都多開空間了那幹嘛不直接拿那些空間當 `reference counter`，像是多開的 8 個 byte 就可以計數到 $2^{64} - 1$，何必像上面說的那樣搞得那麼麻煩，但我個人認為不管 `reference counter` 能到多大，都不能保證它一定不會滿，而一旦滿的時候若沒妥善處理仍然會浪費掉許多空間，所以才提出上面那種解決辦法。 - `xs_trim` 若字串長度 $>15$ 且有共享空間的話，就一定會去 malloc 一塊新的空間，但其實也可以透過改變自己的 `ptr`、`size` 就可以達成，當然還是需要注意一些細節如下才能實作出來。 - 首先要把 `reference counter` 儲存在擁有空間的最後一個 `byte`，而非 '\0' 後的下一個 byte，這樣共享成員才可以輕鬆去 access 到 `reference counter` (因為各個共享成員的 `size` 可能都不一樣，但仍可透過 `capacity` 去取得最後一位元)。 - 需要另外去存共享空間的 head，也就是有辦法取得起始位置，才可以在需要釋放空間的時候釋放 (因為 `ptr` 被改成指向自己的字串開頭)，所以在 malloc 的時候也要再多增加 8 個 byte 來儲存。 - 基本上目前看來只要在每次 malloc 時多增加 16 個 byte 上面的問題應該都可以被解決。 ## 利用 xs 實作 strtok - 主要流程參考 [strtok_r](https://code.woboq.org/userspace/glibc/string/strtok_r.c.html#__strtok_r) 實作方式。 - 結合上面實作的 `xs_copy` 來達到 type 為 `char*` 的 assign。 - 利用 `xs_trim` 的 `set_bit`、`check_bit` 來實作 `strspn` 以及 `strcspn`。 - 由於 CoW 的機制存在，`reference counter` 一定要在 `old` 複製完後才能維護，否則下次呼叫 `xs_tok` 的回傳值開頭很有可能會變成上一個 token 的 `reference counter`。 ```c= xs *xs_tok(xs *x, const char *delim) { static xs *old = NULL; if (x == NULL) { if (!old) return NULL; x = xs_copy(xs_tmp(""), old); } if (*xs_data(x) == '\0') { xs_copy(old, x); return NULL; } if (!delim[0]) return x; if (xs_is_ptr(x) && xs_ref_counter(x) != 1) xs_cow(x); char *dataptr = xs_data(x), *orig = dataptr; uint8_t mask[32] = {0}; #define check_bit(byte) (mask[(uint8_t) byte / 8] & 1 << (uint8_t) byte % 8) #define set_bit(byte) (mask[(uint8_t) byte / 8] |= 1 << (uint8_t) byte % 8) size_t i, xlen = xs_size(x), dlen = strlen(delim); for (i = 0; i < dlen; i++) set_bit(delim[i]); for (i = 0; i < xlen; i++) if (!check_bit(dataptr[i])) break; dataptr += i; xlen -= i; memmove(orig, dataptr, xlen); orig[xlen] = '\0'; if (xs_is_ptr(x)) x->size = xlen; else x->space_left = 15 - xlen; if (*xs_data(x) == '\0') { xs_copy(old, x); if (xs_is_ptr(x)) xs_ref_counter(x) = 1; return NULL; } dataptr = xs_data(x); for (i = 0; i < xlen; i++) if (check_bit(dataptr[i])) break; xlen = i; if (*(xs_data(x) + xlen) == '\0') { xs_copy(old, xs_tmp("")); if (xs_is_ptr(x)) xs_ref_counter(x) = 1; return x; } orig[xlen] = '\0'; old = xs_tmp(xs_data(x) + xlen + 1); if (xs_is_ptr(x)) { x->size = xlen; xs_ref_counter(x) = 1; } else x->space_left = 15 - xlen; return x; #undef check_bit #undef set_bit } ``` ### 待改進部分 - 和上述 `xs_trim` 的狀況雷同，可以不必去 `malloc` 一塊新的空間，改進方式也大同小異。 ## 實際測試簡單測試以下功能是否正常，並且檢查是否確實實施 CoW，以及處理 `reference counter` 達到極值的狀況。 - `xs_copy` - `xs_trim` - `xs_concat` - `xs_tok` ```c int main() { xs *string1 = xs_tmp(" abcd - efg - hijk -- kmnop "); xs *string2 = xs_tmp("hello hello hello hello"); printf("%-35s %p %d\n", xs_data(string1), xs_data(string1), xs_ref_counter(string1)); printf("%-35s %p %d\n\n", xs_data(string2), xs_data(string2), xs_ref_counter(string2)); xs_copy(string2, string1); printf("%-35s %p %d\n", xs_data(string1), xs_data(string1), xs_ref_counter(string1)); printf("%-35s %p %d\n\n", xs_data(string2), xs_data(string2), xs_ref_counter(string2)); xs_trim(string1, " "); printf("%-35s %p %d\n", xs_data(string1), xs_data(string1), xs_ref_counter(string1)); printf("%-35s %p %d\n\n", xs_data(string2), xs_data(string2), xs_ref_counter(string2)); xs_copy(string2, string1); printf("%-35s %p %d\n", xs_data(string1), xs_data(string1), xs_ref_counter(string1)); printf("%-35s %p %d\n\n", xs_data(string2), xs_data(string2), xs_ref_counter(string2)); xs_concat(string1, xs_tmp("- -="), xs_tmp(" - = ")); printf("%-35s %p %d\n", xs_data(string1), xs_data(string1), xs_ref_counter(string1)); printf("%-35s %p %d\n\n", xs_data(string2), xs_data(string2), xs_ref_counter(string2)); xs *token = xs_tok(string1, "-= "); while (token != NULL) { printf("%s\n", xs_data(token)); token = xs_tok(NULL, "-= "); } printf("\n"); xs *string_arr[255]; for (int i = 0; i < 255; ++i) { string_arr[i] = xs_tmp(""); xs_copy(string_arr[i], string2); printf("%-35s %p %d\n", xs_data(string_arr[i]), xs_data(string_arr[i]), (unsigned char) xs_ref_counter(string_arr[i])); } return 0; } ``` ### 測試結果 ```shell abcd - efg - hijk -- kmnop 0x7fbfac4026b0 1 hello hello hello hello 0x7fbfac4026d0 1 abcd - efg - hijk -- kmnop 0x7fbfac4026b0 2 abcd - efg - hijk -- kmnop 0x7fbfac4026b0 2 abcd - efg - hijk -- kmnop 0x7fbfac4026d0 1 abcd - efg - hijk -- kmnop 0x7fbfac4026b0 1 abcd - efg - hijk -- kmnop 0x7fbfac4026d0 2 abcd - efg - hijk -- kmnop 0x7fbfac4026d0 2 - -=abcd - efg - hijk -- kmnop - = 0x7fbfac4026f0 1 abcd - efg - hijk -- kmnop 0x7fbfac4026d0 1 abcd efg hijk kmnop abcd - efg - hijk -- kmnop 0x7fbfac4026d0 2 abcd - efg - hijk -- kmnop 0x7fbfac4026d0 3 ... abcd - efg - hijk -- kmnop 0x7fbfac4026d0 254 abcd - efg - hijk -- kmnop 0x7fbfac4026d0 255 abcd - efg - hijk -- kmnop 0x7fbfac402790 1 ``` :::warning TODO: 1. 引入 [Cachegrind](https://valgrind.org/docs/manual/cg-manual.html) 來觀察上述實驗結果，注意 locality 表現; 2. reference counter 的範圍很關鍵，你需要思考對應 CoW 操作所配合的 demand paging 機制; :notes: jserv ::: ## 利用 [Valgrind](http://ottoshare.blogspot.com/2012/03/valgrind.html)、[Perf](http://wiki.csie.ncku.edu.tw/embedded/perf-tutorial) 觀察 ### xs_copy_max - copy on write，一次複製 254 (`reference counter` 極限值) 次。 - `reference counter` 達到 255 時才會再去要空間。 ### str_copy_max - 為了方便觀察，於是和 `xs_copy_max` 一樣複製 254 次。 - 每次複製都一定會去要空間。 ```c= void str_copy_max(xs *x) { for (int i = 1; i < MAX_REF_COUNTER; i++) { char *c = malloc(sizeof(char) * strlen(xs_data(x)) + 1); strcpy(c, xs_data(x)); } } void xs_copy_max(xs *x) { if ((unsigned char) xs_ref_counter(x) == MAX_REF_COUNTER) { xs *tmp = xs_copy(xs_tmp(""), x); x = tmp; } for (int i = 1; i < MAX_REF_COUNTER; i++) xs_copy(xs_tmp(""), x); } int main() { int n = 400; // copy 400 * 255 - 1 次 xs *string1 = xs_tmp("AAAAAAAAAAAAAAAAAAAA"); for (int i = 0; i < n; i++) { xs_copy_max(string1); // Case 1 str_copy_max(string1); // Case 2 } return 0; } ``` ### 觀察記憶體使用量 `xs_copy_max` ```shell -------------------------------------------------------------------------------- n time(i) total(B) useful-heap(B) extra-heap(B) stacks(B) -------------------------------------------------------------------------------- 45 41,269,872 14,680 11,744 2,936 0 46 42,168,840 15,000 12,000 3,000 0 47 43,067,808 15,320 12,256 3,064 0 48 43,966,776 15,640 12,512 3,128 0 49 44,865,744 15,960 12,768 3,192 0 ``` `str_copy_max` ```shell -------------------------------------------------------------------------------- n time(i) total(B) useful-heap(B) extra-heap(B) stacks(B) -------------------------------------------------------------------------------- 90 18,039,730 3,727,320 1,956,854 1,770,466 0 91 18,236,410 3,768,280 1,978,358 1,789,922 0 92 18,433,090 3,809,240 1,999,862 1,809,378 0 93 18,629,770 3,850,200 2,021,366 1,828,834 0 94 18,826,450 3,891,160 2,042,870 1,848,290 0 95 19,023,148 3,932,120 2,064,374 1,867,746 0 96 19,219,828 3,973,080 2,085,878 1,887,202 0 97 19,416,508 4,014,040 2,107,382 1,906,658 0 98 19,613,188 4,055,000 2,128,886 1,926,114 0 ``` 可以發現 `str_copy_max` 所耗費的空間遠大於 `xs_copy_max` 許多。 ### 觀察 cache 命中率 (locality of reference) `xs_copy_max` ```shell ==6956== I refs: 45,215,498 ==6956== I1 misses: 990 ==6956== LLi misses: 973 ==6956== I1 miss rate: 0.00% ==6956== LLi miss rate: 0.00% ==6956== ==6956== D refs: 26,325,688 (20,069,765 rd + 6,255,923 wr) ==6956== D1 misses: 3,469 ( 2,544 rd + 925 wr) ==6956== LLd misses: 2,923 ( 2,063 rd + 860 wr) ==6956== D1 miss rate: 0.0% ( 0.0% + 0.0% ) ==6956== LLd miss rate: 0.0% ( 0.0% + 0.0% ) ==6956== ==6956== LL refs: 4,459 ( 3,534 rd + 925 wr) ==6956== LL misses: 3,896 ( 3,036 rd + 860 wr) ==6956== LL miss rate: 0.0% ( 0.0% + 0.0% ) ``` `str_copy_max` ```shell ==6950== I refs: 38,305,625 ==6950== I1 misses: 986 ==6950== LLi misses: 971 ==6950== I1 miss rate: 0.00% ==6950== LLi miss rate: 0.00% ==6950== ==6950== D refs: 13,679,074 (8,687,286 rd + 4,991,788 wr) ==6950== D1 misses: 54,268 ( 2,770 rd + 51,498 wr) ==6950== LLd misses: 53,424 ( 2,063 rd + 51,361 wr) ==6950== D1 miss rate: 0.4% ( 0.0% + 1.0% ) ==6950== LLd miss rate: 0.4% ( 0.0% + 1.0% ) ==6950== ==6950== LL refs: 55,254 ( 3,756 rd + 51,498 wr) ==6950== LL misses: 54,395 ( 3,034 rd + 51,361 wr) ==6950== LL miss rate: 0.1% ( 0.0% + 1.0% ) ``` 可以發現 `xs_copy_max` 在 D1 的命中率高於 `str_copy_max`，我們再使用 perf 測試看看。 `xs_copy_max` ```shell Performance counter stats for './xs' (5 runs): 3899 cache-misses # 8.930 % of all cache refs ( +- 68.79% ) 4,3659 cache-references ( +- 1.17% ) 4566,7919 instructions # 2.10 insn per cycle ( +- 0.01% ) 2179,6722 cycles ( +- 0.24% ) 0.0064362 +- 0.0000249 seconds time elapsed ( +- 0.39% ) ``` `str_copy_max` ```shell Performance counter stats for './xs' (5 runs): 4,0382 cache-misses # 34.039 % of all cache refs ( +- 26.15% ) 11,8632 cache-references ( +- 1.53% ) 3739,1733 instructions # 2.37 insn per cycle ( +- 0.02% ) 1580,7567 cycles ( +- 2.41% ) 0.004620 +- 0.000116 seconds time elapsed ( +- 2.50% ) ``` 可以明顯地看出 `xs_copy_max` 的 cache 命中率是遠優於 `str_copy_max`。 ## Linux 核心類似案例參考 [copy on write](https://www.itread01.com/content/1544029393.html)、[fork.c](https://github.com/torvalds/linux/blob/master/kernel/fork.c)、[memory.c](https://github.com/torvalds/linux/blob/master/mm/memory.c)、[fork 流程](https://www.twblogs.net/a/5b7ca1862b71770a43dbe650)、[page fault 處理](https://www.lagou.com/lgeduarticle/87749.html)。 ### fork - 建立一個子行程，此行程為父行程的副本。 - Linux 許多行程都是 `fork` 完後，子行程去呼叫 `exec`。 - 呼叫 `exec` 後，不會用到父行程的資料。 - 若每次 `fork` 後都要去複製一份父行程的資料給子行程，結果子行程卻完全沒有用到的話，就浪費了許多時間和空間。 - 呼叫 `fork` 後，若子行程不對記憶體空間寫入，則不複製一份。 - 利用 CoW，使建立子行程的速度快很多。 ### 流程 ```flow func1=>operation: sysfork func2=>operation: do_fork func3=>operation: copy_process func4=>operation: copy_mm func5=>operation: dup_mm func6=>operation: dup_mmap func7=>operation: copy_page_range func8=>operation: copy_p4d_range func9=>operation: copy_pud_range func10=>operation: copy_pmd_range func11=>operation: copy_pte_range func12=>operation: copy_one_pte func1->func2->func3->func4->func5->func6->func7->func8->func9->func10->func11->func12 ``` ### [四級分頁機制](https://www.itread01.com/content/1550321649.html) - Linux 從 2.6.11 開始採用四級分頁表模型 pgd、pud、pmd、pte。 - 分別為頁全域性目錄 (Page Global Directory)、頁上級目錄 (Page Upper Directory)、頁中間目錄 (Page Middle Directory)、頁表 (Page Table)。 - 從呼叫 `copy_page_range`、`copy_pud_range`、`copy_pmd_range`、`copy_pte_range` 都是在 copy 相對應的頁表。 ### copy_one_pte - 若支援 copy on write，就將父行程中所有的記憶體頁的許可權都設為 read-only (唯讀)。 - 之後父行程或子行程要寫入時，會檢測到為 read-only，因此觸發 page fault，並複製一份給此行程 (呼叫 [do_wp_page](https://github.com/torvalds/linux/blob/master/mm/memory.c)、[wp_page_copy](https://github.com/torvalds/linux/blob/master/mm/memory.c))。 ```c static inline unsigned long copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm, pte_t *dst_pte, pte_t *src_pte, struct vm_area_struct *vma, unsigned long addr, int *rss) { ... /* * If it's a COW mapping, write protect it both * in the parent and the child */ if (is_cow_mapping(vm_flags) && pte_write(pte)) { ptep_set_wrprotect(src_mm, addr, src_pte); pte = pte_wrprotect(pte); } ... } ```