Try   HackMD

2020q1 Homework2 (quiz2)

contributed by < colinyoyo26 >

tags: linux2020

第二週測驗題

第 1 題組的延伸問題

1. 解釋上述程式碼運作原理,可善用 GDB 追蹤和分析

xs_tmp

在編譯時期檢查 x 是 string literal 以及長度小於 16 ,接著呼叫 xs_new

我想請問哪裡可以找到 xs_new(&xs_literal_empty(), "" x) "" 可以檢查 x 是不是 string literal 的相關論述
已經查過 C11 規格書以及 Using the GNU Compiler Collection 沒有找到相關論述

用 gdb 做實驗,確認配置的字串在於 stack 還是 read-only section,即可反推是否為 string literal

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
jserv

實驗得到 string literal 是在 read-only section ,然後在 C11 規格書 5.1.1.2 Translation phases 找到 6. Adjacent string literal tokens are concatenated. 推測出這是編譯器報錯的原因

#define xs_tmp(x)                                          \
    ((void) ((struct {                                     \
         _Static_assert(sizeof(x) <= 16, "it is too big"); \
         int dummy;                                        \
     }){1}),                                               \
     xs_new(&xs_literal_empty(), "" x))

ilog2

回傳

log2n

static inline int ilog2(uint32_t n) { return 32 - __builtin_clz(n) - 1;

xs_new

xs_literal_empty() 先初始化為空的短字串 (x->space_left 設為 15 其他設為 0),若字串長度大於 15 會把字串放到 heap , malloc 的空間為

2log2len+1 (對齊下一個 2 的冪次方),否則會直接放到 stack 因為有 null terminator 所以不用在重設 is_ptr

xs *xs_new(xs *x, const void *p)
{
    *x = xs_literal_empty();
    size_t len = strlen(p) + 1;
    if (len > 16) {
        x->capacity = ilog2(len) + 1;
        x->size = len - 1;
        x->is_ptr = true;
        x->ptr = malloc((size_t) 1 << x->capacity);
        memcpy(x->ptr, p, len);
    } else {
        memcpy(x->data, p, len);
        x->space_left = 15 - (len - 1);
    }
    return x;
}

xs_grow

函式目的為將 x 的容量增長到下個二的冪次方

xs *xs_grow(xs *x, size_t len)
{
    if (len <= xs_capacity(x))
        return x;
    len = ilog2(len) + 1;
    if (xs_is_ptr(x))
        x->ptr = realloc(x->ptr, (size_t) 1 << len);
    else {
        char buf[16];
        memcpy(buf, x->data, 16);
        x->ptr = malloc((size_t) 1 << len);
        memcpy(x->ptr, buf, 16);
    }
    x->is_ptr = true;
    x->capacity = len;
    return x;
}

xs_concat

函式目的為將 prefixsuffix 串接到 string 前後, string 的容量夠時只須將字串複製過去即可,否則需要 call xs_grow 增加容量到下個二的冪次方,再複製字串,以下是將用到 memmove 的原因

Linux Programmer's Manual

  • The memmove() function copies n bytes from memory area src to memory area dest. The memory areas may overlap
xs *xs_concat(xs *string, const xs *prefix, const xs *suffix)
{
    size_t pres = xs_size(prefix), sufs = xs_size(suffix),
           size = xs_size(string), capacity = xs_capacity(string);

    char *pre = xs_data(prefix), *suf = xs_data(suffix),
         *data = xs_data(string);

    if (size + pres + sufs <= capacity) {
        memmove(data + pres, data, size);
        memcpy(data, pre, pres);
        memcpy(data + pres + size, suf, sufs + 1);
        string->space_left = 15 - (size + pres + sufs);
    } else {
        xs tmps = xs_literal_empty();
        xs_grow(&tmps, size + pres + sufs);
        char *tmpdata = xs_data(&tmps);
        memcpy(tmpdata + pres, data, size);
        memcpy(tmpdata, pre, pres);
        memcpy(tmpdata + pres + size, suf, sufs + 1);
        xs_free(string);
        *string = tmps;
        string->size = size + pres + sufs;
    }
    return string;
}

xs_trim

這個函式的目的是去除掉 x 內前後包含於 trimset 的字元,函式中兩個 macro 利用 hashing 的手法來檢查字元,首先把一個 char 分成 quotient (前 5 個 bits) ,和 remainder (後 3 個 bits) ,再來 set_bit 用於插入,check_bit 用於確認字元是否被插入

  • quotient 用於 index
  • remainder 數值表示範圍為 0 到 7 , 一一對應 uint8_t 的 8 個 bits
  • 時間複雜度為
    O(slen+trimlen)
    ,若用 brute force 時間複雜度會升到
    O(slentrimlen)
xs *xs_trim(xs *x, const char *trimset)
{
    if (!trimset[0])
        return x;

    char *dataptr = xs_data(x), *orig = dataptr;

    /* similar to strspn/strpbrk but it operates on binary data */
    uint8_t mask[32] = {0};

#define check_bit(byte) (mask[(uint8_t) byte / 8] CCC 1 << (uint8_t) byte % 8)
#define set_bit(byte) (mask[(uint8_t) byte / 8] |= 1 << (uint8_t) byte % 8)

    size_t i, slen = xs_size(x), trimlen = strlen(trimset);

    for (i = 0; i < trimlen; i++)
        set_bit(trimset[i]);
    for (i = 0; i < slen; i++)
        if (!check_bit(dataptr[i]))
            break;
    for (; slen > 0; slen--)
        if (!check_bit(dataptr[slen - 1]))
            break;
    dataptr += i;
    slen -= i;

    /* reserved space as a buffer on the heap.
     * Do not reallocate immediately. Instead, reuse it as possible.
     * Do not shrink to in place if < 16 bytes.
     */
    memmove(orig, dataptr, slen);
    /* do not dirty memory unless it is needed */
    if (orig[slen])
        orig[slen] = 0;

    if (xs_is_ptr(x))
        x->size = slen;
    else
        x->space_left = 15 - slen;
    return x;
#undef check_bit
#undef set_bit
}

2. 以上述程式碼為基礎,提供字串複製 (應考慮到 CoW) 和類似 strtok 功能的函式實作

完整的 code 以及編譯方式請見 GitHub

首先是字串複製 xs_cpy 的實作以及測試輸出,先從記憶體配置開始

xs memory layout

malloc 空間給 x->ptr 時會多塞 sizeof(REFTYPE) 的空間放 reference count 並把指標移到字串開始位置

#define REFTYPE size_t
#define OFFSET sizeof(REFTYPE)
xs->ptr = malloc(((size_t)1 << x->capacity) + OFFSET) + OFFSET;

xs 的記憶體配置邏輯如下(xs 16 bytes, refcnt sizeof(REFTYPE) bytes)







Q


cluster_xs




heap

refcnt

string...



ptr

ptr



ptr->heap





size

size



capacity

capacity



flags

flags



改用 mermaid 或 Graphviz 重新製作上方圖例

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
jserv

已用 Graphviz 重繪

以下比較 folly::fbstring 的 CoW 機制,看以下五個函式和資料結構就能看出端倪了, ml_.data_ 會指向 struct RefCounteddata_ 當要取 refCount_ 的值時,會利用計算從 struct RefCounteddata_ 的 offset 得到 refCount_ 的位址 (和 linux linked list 一樣的手法)

template <class Char>
FOLLY_NOINLINE inline void fbstring_core<Char>::initLarge(
    const Char* const data,
    const size_t size) {
  // Large strings are allocated differently
  size_t effectiveCapacity = size;
  auto const newRC = RefCounted::create(data, &effectiveCapacity);
  ml_.data_ = newRC->data_;
  ml_.size_ = size;
  ml_.setCapacity(effectiveCapacity, Category::isLarge);
  ml_.data_[size] = '\0';
}
  • data[1] 是為了留空間給 \0
struct RefCounted {
    std::atomic<size_t> refCount_;
    Char data_[1];
    ...
static void incrementRefs(Char* p) {
    fromData(p)->refCount_.fetch_add(1, std::memory_order_acq_rel);
}
constexpr static size_t getDataOffset() {
    return offsetof(RefCounted, data_);
}
static RefCounted* fromData(Char* p) {
    return static_cast<RefCounted*>(static_cast<void*>(
          static_cast<unsigned char*>(static_cast<void*>(p)) -
          getDataOffset()));
}
static RefCounted* create(size_t* size) {
    const size_t allocSize =
          goodMallocSize(getDataOffset() + (*size + 1) * sizeof(Char));
    auto result = static_cast<RefCounted*>(checkedMalloc(allocSize));
    result->refCount_.store(1, std::memory_order_release);
    *size = (allocSize - getDataOffset()) / sizeof(Char) - 1;
    return result;
}

reference count

以下兩個函式回傳 reference count 和是否具有其他 reference

extern inline uint8_t xs_refcnt(const xs *x)
{
    return IS_COW && xs_is_ptr(x) ? *(REFTYPE *) (x->ptr - OFFSET) : 1;
}
static inline bool xs_is_ref(const xs *x)
{
    return xs_refcnt(x) > 1;
}

增減 reference count , xs_decr_ref 發現 reference count 歸零時會呼叫 xs_free

static inline void xs_incr_ref(xs *x)
{
    *(REFTYPE *) (x->ptr - OFFSET) += 1;
}
static inline void xs_decr_ref(xs *x)
{
    if (!--(*(REFTYPE *) (x->ptr - OFFSET)))
        xs_free(x);
}

CoW 機制

呼叫 xs_cpy 時如果不到需要 CoW 的字串長度,則直接複製字串,否則用 CoW 機制直接複製 meta data 並增加 reference count 其中 !~xs_refcnt(src) 是為了防止 overflow

static inline void xs_set_ref(xs *x, xs *ref_to)
{
    if (xs_size(ref_to) < LEN_TO_COW || x->ptr == ref_to->ptr)
        return;

    XS_INCR_REF(ref_to);
    /* copy the meta data */
    *x = *ref_to;
}
xs *xs_cpy(xs *dest, xs *src)
{
    if (XS_IS_REF(dest))
        XS_DECR_REF(dest);
    /* too many references or short string
     * !OFFET for non-COW
     */
    if (!OFFSET || xs_size(src) < LEN_TO_COW || !~xs_refcnt(src) ||
        !xs_is_ptr(src)) {
        size_t len = xs_size(src);
        xs_grow(dest, len);
        memcpy(xs_data(dest), xs_data(src), len);
        if (xs_is_ptr(dest))
            dest->size = len;
        else
            dest->space_left = 15 - len;
    } else
        xs_set_ref(dest, src);
    return dest;
}

若是字串要被修改時 (e.g. xs_trim , xs_concat) 會呼叫xs_cow 檢查是否有其他參考,若有則需要自己在開一個新的空間

static void xs_cow(xs *x)
{
    if (!XS_IS_REF(x))
        return;
    XS_DECR_REF(x);
    xs_new(x, xs_data(x));
    return;
}

測試程式:

xs string, cow1, cow2;
void print_info(char* str)
{   printf("\n%s\n", str);
    printf("\nstring: %s", xs_data(&string));
    printf("\n&string: %p", xs_data(&string));
    printf("\nreference count: %d\n", string.ref_cnt);

    printf("\nstring: %s", xs_data(&cow1));
    printf("\n&string: %p", xs_data(&cow1));
    printf("\nreference count: %d\n", cow1.ref_cnt);

    printf("\nstring: %s", xs_data(&cow2));
    printf("\n&string: %p", xs_data(&cow2));
    printf("\nreference count: %d\n", cow2.ref_cnt);
}

int main()
{  
    xs_new(&string, "gggfoobarbarbarbarbarzzz");
    xs_cpy(&cow1, &string);
    xs_cpy(&cow2, &string);
    print_info("after cpy from string to cow1 & cow2");
    xs prefix = *xs_tmp("((("), suffix = *xs_tmp(")))");
    xs_concat(&cow1, &prefix, &suffix);
    print_info("after concat cow1");
    xs_cpy(&cow2, &cow1);
    print_info("after cpy from cow1 to cow2");
    xs_trim(&cow1, ")(zg");
    print_info("after trim cow1");

    return 0;
}

參考執行結果:

after cpy from string to cow1 & cow2

#string
string: gggfoobarbarbarbarbarzzz
&string: 0x1735010
reference count: 3

#cow1
string: gggfoobarbarbarbarbarzzz
&string: 0x1735010
reference count: 1

#cow2
string: gggfoobarbarbarbarbarzzz
&string: 0x1735010
reference count: 1

after concat cow1

#string
string: gggfoobarbarbarbarbarzzz
&string: 0x1735010
reference count: 2

#cow1
string: (((gggfoobarbarbarbarbarzzz)))
&string: 0x1735450
reference count: 1

#cow2
string: gggfoobarbarbarbarbarzzz
&string: 0x1735010
reference count: 1

after cpy from cow1 to cow2

#string
string: gggfoobarbarbarbarbarzzz
&string: 0x1735010
reference count: 1

#cow1
string: (((gggfoobarbarbarbarbarzzz)))
&string: 0x1735450
reference count: 2

#cow2
string: (((gggfoobarbarbarbarbarzzz)))
&string: 0x1735450
reference count: 1

after trim cow1

#string
string: gggfoobarbarbarbarbarzzz
&string: 0x1735010
reference count: 1

#cow1
string: foobarbarbarbarbar
&string: 0x1735480
reference count: 1

#cow2
string: (((gggfoobarbarbarbarbarzzz)))
&string: 0x1735450
reference count: 1

接著是 strtok 的實作

xs_tok

先簡介 strtok 以下是 Linux Programmer's Manual

  • The strtok() function breaks a string into a sequence of zero or more nonempty tokens. On the first call to strtok() the string to be parsed should be specified in str. In each subsequent call that should parse the same string, str must be NULL.

第一次呼叫 strtok 需要給定欲處理的字串(放在 str ),之後 str 必須為 NULL

  • The delim argument specifies a set of bytes that delimit the tokens in the parsed string. The caller may specify different strings in delim in successive calls that parse the same string.

delim 就是字元的集合,用於劃分欲解析字串

  • Each call to strtok() returns a pointer to a null-terminated string containing the next token. This string does not include the delimiting byte.

會回傳從第一個 non-delimiting byte. 下一個 delimiting byte (不包含) 之間的字串

  • The end of each token is found by scanning forward until either the next delimiter byte is found or until the terminating null byte \0 is encountered.

會把下一個 delimiting byte 替換成 \0

我照著 strtok 的行為實作, laststr 指向欲處理字串的開頭, src_flag 用於分辨是否要更新 src->size (或 src->leftspace),並和 xs_trim 一樣用 hashing 的方法實作檢查 delim

因為會把下一個 delimiting byte 替換成 \0 所以第一次呼叫需要更新 src->size

char *xs_tok(xs *src, const char *delim)
{
    static char *laststr = NULL;
    char *cur;
    bool src_flag = 0;

    if (!src)
        cur = laststr;
    else {
        XS_COW(src);
        cur = xs_data(src);
        src_flag = 1;
    }

    if (!delim[0] || !cur)
        return cur;

    uint8_t mask[32] = {0};

#define check_bit(byte) (mask[(uint8_t) byte / 8] & 1 << (uint8_t) byte % 8)
#define set_bit(byte) (mask[(uint8_t) byte / 8] |= 1 << (uint8_t) byte % 8)

    size_t i, curlen = strlen(cur), delimlen = strlen(delim);

    for (i = 0; i < delimlen; i++)
        set_bit(delim[i]);
    for (i = 0; i < curlen; i++)
        if (!check_bit(cur[i]))
            break;
    cur = cur + i;
    for (i = 0; i++ < curlen;)
        if (check_bit(cur[i]))
            break;
    *(cur + i) = '\0';
    laststr = cur + i + 1;

    if (src_flag) {
        if (xs_is_ptr(src))
            src->size = i;
        else
            src->space_left = 15 - i;
    }
    if (!*laststr)
        laststr = NULL;
    return cur;
}

測試程式:

#include <stdio.h>
#include "xs.h"

int main(void)
{
    xs string;
    char token[8] = "HELLO W", *tem;
    xs_new(&string, "HELLOW fooHELLOWbarHbarfooLbarLbarObarW");
    printf("#token\n%s\n", token);
    printf("\n#initial string\n%s\n\n", xs_data(&string));

    printf("%s\n", xs_tok(&string, token));
    while(tem = xs_tok(NULL, token))
        printf("%s\n", tem);
    return 0;
}

參考執行結果:

#token
HELLO W

#initial string
HELLOW fooHELLOWbarHbarfooLbarLbarObarW

foo
bar
barfoo
bar
bar
bar

TODO: 考慮到之後會在多執行緒環境,用 strtok_r 實作手法改寫

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
jserv

已實作 xs_tok_r (reentrant version of xs_tok),以及測試程式(用 pthread 測試)

xs_tok_r

char *xs_tok_r(xs *src, const char *delim, char **saveptr)
{
    char *cur;
    bool src_flag = 0;

    if (!src)
        cur = *saveptr;
    else {
        XS_COW(src);
        cur = xs_data(src);
        src_flag = 1;
    }

    if (!delim[0] || !cur)
        return cur;

    uint8_t mask[32] = {0};

#define check_bit(byte) (mask[(uint8_t) byte / 8] & 1 << (uint8_t) byte % 8)
#define set_bit(byte) (mask[(uint8_t) byte / 8] |= 1 << (uint8_t) byte % 8)

    size_t i, curlen = strlen(cur), delimlen = strlen(delim);

    for (i = 0; i < delimlen; i++)
        set_bit(delim[i]);
    for (i = 0; i < curlen; i++)
        if (!check_bit(cur[i]))
            break;
    cur = cur + i;
    for (i = 0; i++ < curlen;)
        if (check_bit(cur[i]))
            break;
    *(cur + i) = '\0';
    *saveptr = cur + i + 1;

    if (src_flag) {
        if (xs_is_ptr(src))
            src->size = i;
        else
            src->space_left = 15 - i;
    }
    if (!*saveptr)
        saveptr = NULL;
    return cur;
#undef check_bit
#undef set_bit
}

測試程式:

#include <pthread.h>
#include "string.h"
#include "xs.h"

#define NUM 3

const char token[8] = "@#(^&$*";

static void *test_tok(void *args)
{
    char *saveptr;
    char out[32], *outptr = out;

    xs str = *(xs *) args;
    for (xs *p = &str;; p = NULL) {
        char *tem = xs_tok_r(p, token, &saveptr);
        if (!*tem)
            break;
        strcpy(outptr, tem);
        outptr += strlen(tem) + 1;
        *(outptr - 1) = ' ';
    }
    *(outptr - 1) = '\0';
    printf("%s\n", out);
}

int main(void)
{
    pthread_t threads[NUM];
    xs strs[NUM];
    printf("#token: %s\n", token);
    for (int i = 0; i < NUM; i++) {
        char buf[64];
        sprintf(buf, "$$IM&*^%dth@@@#thread!!!$$", i);
        xs_new(&strs[i], buf);
        printf("#initial str of %dth thread: %s\n", i, xs_data(&strs[i]));
        pthread_create(threads + i, NULL, test_tok, (void *) &strs[i]);
    }
    for (int i = 0; i < NUM; i++)
        pthread_join(threads[i], NULL);

    return 0;
}

參考執行結果:

這邊輸出順序不重要,重點是 "IM nth thread!!!" string 預期要是完整的

#token: @#(^&$*
#initial str of 0th thread: $$IM&*^0th@@@#thread!!!$$
#initial str of 1th thread: $$IM&*^1th@@@#thread!!!$$
#initial str of 2th thread: $$IM&*^2th@@@#thread!!!$$
IM 0th thread!!!
IM 2th thread!!!
IM 1th thread!!!

3. 設計實驗,確認上述字串處理最佳化 (針對空間效率) 帶來的效益,應考慮到 locality of reference

設計實驗呼叫複製字串函式數次計算時間以及 cache-misses 次數,實驗結果是 xs_cpy with CoW 的 cache-missed 明顯最少,但是再這個 input 下 strcpy 的時間效率略好一些,更下面還有 make plot 視覺化不同字串長度下的效率比較

xs_cpy w/o CoW

$ make bench

MODE: XSTR
refcnt: 1
time: 23440.000000 (ms)

 Performance counter stats for './bench foo bar bar barbarbarbar foooooobarrrrrrarrrrrrfoo bar bar barbarbarbar foooooobarrrrrrarrrrrrfoo bar bar barbarbarbar foooooobarrrrrrarrrrrrfoo bar bar barbarbarbar foooooobarrrrrrarrrrrr 100000 0':

           758,747      cache-misses              #   85.012 % of all cache refs    
           892,518      cache-references                                            

       0.032626588 seconds time elapsed

xs_cpy w/ CoW

$ make bench COW=1

MODE: XSTR
refcnt: 100001
time: 5019.750000 (ms)

 Performance counter stats for './bench foo bar bar barbarbarbar foooooobarrrrrrarrrrrrfoo bar bar barbarbarbar foooooobarrrrrrarrrrrrfoo bar bar barbarbarbar foooooobarrrrrrarrrrrrfoo bar bar barbarbarbar foooooobarrrrrrarrrrrr 100000 0':

           111,351      cache-misses              #   57.129 % of all cache refs    
           194,910      cache-references                                            

       0.008941326 seconds time elapsed

strcpy

$ make bench MODE=1

MODE: CSTR
time: 4244.250000 (ms)

 Performance counter stats for './bench foo bar bar barbarbarbar foooooobarrrrrrarrrrrrfoo bar bar barbarbarbar foooooobarrrrrrarrrrrrfoo bar bar barbarbarbar foooooobarrrrrrarrrrrrfoo bar bar barbarbarbar foooooobarrrrrrarrrrrr 100000 1':

           980,451      cache-misses              #   66.203 % of all cache refs    
         1,480,977      cache-references                                            

       0.019159115 seconds time elapsed

make plot

$ make plot

因為 CoW 機制下只有複製 metadata 所以時間趨勢如預期是呈現

O(1)

我的實驗還沒有考慮到寫入 shared string 時的效能

4. 在 Linux 核心原始程式碼中,找出類似手法的 SSO 或 CoW 案例並解說;

call fork() 會先到以下函式設置 args 接著呼叫 _do_fork

  • vfork() clone 等函式也都是設置對應的 args 再呼叫 _do_fork
SYSCALL_DEFINE0(fork)
{
#ifdef CONFIG_MMU
	struct kernel_clone_args args = {
		.exit_signal = SIGCHLD,
	};

	return _do_fork(&args);
#else
	/* can not support in nommu mode */
	return -EINVAL;
#endif
}

_do_fork

接著看到 _do_fork ,會先經過以下條件式檢查是哪個 system call

if (!(clone_flags & CLONE_UNTRACED)) {
	if (clone_flags & CLONE_VFORK)
		trace = PTRACE_EVENT_VFORK;
	else if (args->exit_signal != SIGCHLD)
		trace = PTRACE_EVENT_CLONE;
	else
		trace = PTRACE_EVENT_FORK;

	if (likely(!ptrace_event_enabled(current, trace)))
		trace = 0;
}

然後呼叫 copy_process 建造 child process 並回傳指標

p = copy_process(NULL, trace, NUMA_NO_NODE, args);

copy_process 內會呼叫 dup_task_struct 其中 current 是指向目前運行程式( parent process )的指標

p = dup_task_struct(current, node);

喚醒 child process

wake_up_new_task(p);

如果是 vfork 要等待 child process 呼叫 exec 或結束

if (clone_flags & CLONE_VFORK) {
	if (!wait_for_vfork_done(p, &vfork))
		ptrace_event_pid(PTRACE_EVENT_VFORK_DONE, pid);

copy_process

忽略一開始的參數檢查, dup_task_struct 是真正建造 child process 的函式, parent process 會初始化 child process 後回傳指標 p

p = dup_task_struct(current, node);

實現 CoW 的地方,放在後面說明

retval = copy_mm(clone_flags, p);

主要設置 childe process 的 PC (program counter) 以及 register

  • PC 會指向 return from fork 的地方,返回值會被設成 0
  • copy_thread_tls 會呼叫 copy_thread ,函式實作和 microarchitecture 息息相關

Reference: - Linux kernel fork 函式

retval = copy_thread_tls(clone_flags, args->stack, args->stack_size, p,
				 args->tls);

dup_task_struct

得知要從哪個 NUMA 節點配置記憶體

if (node == NUMA_NO_NODE)
	node = tsk_fork_get_node(orig);

為 child process 從剛才得到的 NUMA 節點配置 struct task_struct , stack , vm_struct 的記憶體空間

這邊的 tsk_fork_get_node() 其實並「不一定」會回傳一個任一個 NUMA node。
假設,CONFIG_NUMA已經設置tsk_fork_get_node()仍然只會對於 kthreadd_task 返回預設的 NUMA node.

- JulianATA

這裡的 stackkernel stack

tsk = alloc_task_struct_node(node);
if (!tsk)
	return NULL;

stack = alloc_thread_stack_node(tsk, node);
if (!stack)
	goto free_tsk;

if (memcg_charge_kernel_stack(tsk))
	goto free_stack;

stack_vm_area = task_stack_vm_area(tsk);

先完全複製當前行程的 struct task_struct

上方的資訊,我們可以發現一件有趣的事情。就算在一個已經設置CONFIG_NUMA的系統中,仍然有可能回傳NUMA_NO_NODE作為系統預計放置的記憶體的 NUMA node。
那這個 NUMA_NO_NODE 實際上是 -1 ,我們順著 alloc_task_struct_node以及 alloc_thread_stack_node 走下去。
我們會發現,不論是前者透過 slab, slob, slubslab_alloc_node, slob_alloc_node, slub_alloc_node 的配置,或是後者通過 alloc_page_node做分配,皆會通過一個 numa_mem_id()

來自 include/linux/topology.h

/* Returns the number of the nearest Node with memory */
static inline int numa_mem_id(void)
{
	return raw_cpu_read(_numa_mem_);
}
/* Returns the number of the nearest Node with memory */
static inline int numa_mem_id(void)
{
	return numa_node_id();
}

見註解!可以看到,無論是哪個情況下的 numa_mem_id 目標為回傳最近仍然有空間的 NUMA node。
由這件小事,其實可以延伸到一個目前的現況。在 Linux Kernel 裡面,支援對於 NUMA node 的「操作」,除了幾個比較簡單的 policy ,其實還沒有提供實質上的「策略」。

- JulianATA

err = arch_dup_task_struct(tsk, orig);

然後改變 child process 的 stack 以及 stack_vm_area

	tsk->stack = stack;
#ifdef CONFIG_VMAP_STACK
	tsk->stack_vm_area = stack_vm_area;
#endif

copy_mm

vfork 會直接進入條件式內 (呼叫 vfork 時 args.flags 會被設成 CLONE_VFORK | CLONE_VM) 直接共用同一個 mmstruct 這就是為什麼用 vfork child process 和 parent process 會共享記憶體空間的原因,一般的 fork 會呼叫 dup_mm 複製 parent process 的 mm_struct

if (clone_flags & CLONE_VM) {
	mmget(oldmm);
	mm = oldmm;
	goto good_mm;
}

retval = -ENOMEM;
mm = dup_mm(tsk, current->mm);

...

good_mm:
	tsk->mm = mm;
	tsk->active_mm = mm;
	return 0;

dup_mm

先配置新的 mm_struct 並從 parent 複製資料

mm = allocate_mm();

...

memcpy(mm, oldmm, sizeof(*mm));

接著呼叫 dup_mmap 複製 page table , vm_area_struct ,這地方先不深入 trace

Reference: Understanding the Linux Kernel

err = dup_mmap(mm, oldmm);

Write to shared frame (physical page)

這邊依循恐龍書的定義把 virtual page 叫做 page , physical page 叫做 frame

當 parent 或 child 嘗試寫入共用的 frame 時會觸發 page fault 然後通過 do_page_fault() handle_mm_fault() handle_pte_fault() 處理異常 (觸發 CoW)

Reference: 深入了解 Linux-COW 原理

這幾個函式還沒深入 trace

Reference