Lock-Free Linked List with Lockless Memory Allocation

contributed by < Korin777 >

GitHub

以第 11 週測驗題第二題的程式碼為基礎，實作有效的並行鏈結串列，並搭配第 13 週測驗題第一題的記憶體管理 (以 mmap 系統呼叫實作) 程式，儘量降低 lock contention。

論文研讀: 〈Lock-Free Linked Lists and Skip Lists〉
撰寫測試程式: 設計實驗量化分析 scability，參考 concurrent-ll
glibc malloc、free 會用到 mutex -> 設計更有效率的 malloc、free 來降低 lock contention
memory barrier: fence(防止 instruction reorder)
madvise: 參考 mmap-benchmark

實驗環境

Architecture: CPU op-mode(s): Byte Order: Address sizes: CPU(s): On-line CPU(s) list: Thread(s) per core: Core(s) per socket: Socket(s): NUMA node(s): Vendor ID: CPU family: Model: Model name: Stepping: CPU MHz: CPU max MHz: CPU min MHz: BogoMIPS: Virtualization: L1d cache: L1i cache: L2 cache: L3 cache: NUMA node0 CPU(s): x86_64
32-bit, 64-bit
Little Endian
39 bits physical, 48 bits virtual
8
0-7
2
4
1
1
GenuineIntel
6
142
Intel® Core™ i5-8250U CPU @ 1.60GHz
10
1600.035
3400.0000
400.0000
3600.00
VT-x
128 KiB
128 KiB
1 MiB
6 MiB
0-7

論文研讀

node->next 被 flag 表示正要移除 node->next
node->next 被 mark 表示 node 已被 logically delete
若 node 已被 logically delete 將它 physically delete 同時把前一個 node 的 flag 消除
node 不能同時被 mark 及 flag
被 mark 或 flag 的 node 的下一個 node 不會被更動 , 其他人會先幫忙把正在移除的 node 移除

撰寫測試程式

測試程式修改自 concurrent-ll 以符合第 11 週測驗題第二題給定的 nblist (non-blocking singly-linked list) 實作，主要修改的地方在 test 函式

top : 取出開頭的 key
push : 在開頭新增 key
pop ：在開頭移除 key

void *test(void *data)
{
    thread_data_t *d = (thread_data_t *) data; 
    uint32_t read_thresh = 256 * finds / 100;
    seeds = seed_rand();
    uint32_t rand_max = max_key;
    val_t the_value;
    int last = -1;

    for (int i = 0; i < d->n_add; ++i) {
        the_value =
            (val_t) my_random(&seeds[0], &seeds[1], &seeds[2]) & rand_max;

        if (push(the_list,the_value) == 0)
            i--;
    }

    barrier_cross(d->barrier);
    while (*running) { 
        the_value = my_random(&seeds[0], &seeds[1], &seeds[2]) & rand_max;
        uint32_t op = my_random(&seeds[0], &seeds[1], &seeds[2]) & 0xff;
        if (op < read_thresh) { // 更換原 list_contain
            top(the_list);
        } else if (last == -1) { // 更換原 list_add
            if (push(the_list, the_value)) {
                d->n_insert++;
                last = 1;
            }
        } else { // 更換原 list_remove
            if(pop(the_list)) {
                d->n_remove++;
                last = -1;
            }
        }
        d->n_ops++;
    }
    return NULL;
}

在多執行緒會發生 Aborted (core dumped) , 原因是 push 允許 nblist_push 失敗的次數太少而執行到 abort() , 將 spins 值變大能避免此問題
在多執行緒程式不會遇到任何 assert 錯誤 , 且測試程式的 expected size 和 actual size 一致。

static struct item *push(struct nblist *list, int value)
{
    struct item *it = malloc(sizeof(*it));
    it->value = value;
    int spins = 0;
    while (!nblist_push(list, nblist_top(list), &it->link)) {
        /* spin */
        printf("%s: repeating", __func__);
        if (++spins == 10)
            abort();
    }
    return it;
}

$ ./out/linked list -n7
Thread 0
  #operations   : 2626095
  #inserts   : 266828
  #removes   : 266827
Thread 1
  #operations   : 1215180
  #inserts   : 123428
  #removes   : 123428
Thread 2
  #operations   : 2301196
  #inserts   : 233250
  #removes   : 233249
Thread 3
  #operations   : 1251367
  #inserts   : 127070
  #removes   : 127069
Thread 4
  #operations   : 1232578
  #inserts   : 124981
  #removes   : 124980
Thread 5
  #operations   : 2235207
  #inserts   : 226891
  #removes   : 226890
Thread 6
  #operations   : 1383432
  #inserts   : 139984
  #removes   : 139983
Duration      : 1000 (ms)
#txs     : 12245055 (12245055.000000 / s)
Expected size: 1029 Actual size: 1029

透過 run_ll.sh 及 create_plots_ll.sh 來觀察 throughput 及 scability

Update: (push+pop)操作次數/100
Size: 一開始 linked list 有多少 key , key 介於 0 ~ 2 * size
只有 top 操作時 throughput 及 scability 隨著執行緒數量增加成長
當有 push 及 pop throughput 及 scability 都隨著 thread 增加而下降
我認為這個結果是因為 push 及 pop 都是對開頭進行變更 , 而就算有多個執行緒同時在工作，每次也只會有一個執行緒能成功執行 , 而其他執行緒就必須 spin , 造成多執行緒產生的開銷大於其所帶來的效益

參考論文實作 `insert` 及 `delete` 函式

以下操作都建立在一個 key 值唯一且由小到大排序的 linked list
search:

struct nblist_node *list_search(val_t val, struct nblist_node *curr_node, struct nblist_node **left_node)
{
    uintptr_t curr_p = atomic_load(&curr_node->next);
    struct nblist_node *next_node = n_ptr(curr_p);

    while(next_node && (container_of(next_node,struct item, link)->value <= val)) {
        uintptr_t next_p = atomic_load(&next_node->next);
        while((next_p & F_MARK) && (curr_p & F_FLAG)) {
            if(n_ptr(curr_node->next) == next_node) {// help physically delete
                rend_the_marked(curr_node,next_node,next_p);
            }
            curr_p = atomic_load(&curr_node->next);
            next_node = n_ptr(curr_p);
            if(next_node == 0) // arrive list end
                break;
            next_p = atomic_load(&next_node->next);
        }
        if(next_node && container_of(next_node,struct item, link)->value <= val) {
            curr_node = next_node;
            curr_p = atomic_load(&curr_node->next);
            next_node = n_ptr(curr_p);
        }
    }

    *left_node = curr_node;
    return next_node;
}

自 curr_node 開始尋找兩相鄰節點 n1, n2
n1.value ≤ k < n2.value

insert

bool list_insert(struct nblist *the_list, val_t val)
{
    struct nblist_node *prev = NULL;
    struct nblist_node *next = list_search(val, &the_list->n, &prev);

    if(container_of(prev,struct item,link)->value == val) // duplicate key
        return false;
    struct item *it = malloc(sizeof(*it));
    if(!it)
        return false;
    it->value = val;
    struct nblist_node *newNode = &it->link;
    while (1) {
        uintptr_t prev_succ = prev->next; 
        if(prev_succ & F_FLAG) // help the corresponding deletion to complete
            clear_flag(prev,n_ptr(prev_succ));
        else {
            newNode->next = (uintptr_t)next;
            if(atomic_compare_exchange_strong_explicit(&prev->next, &next, (uintptr_t) newNode,
                        memory_order_release, memory_order_relaxed)) {
                // ensure prev and next are consecutive and prev isn't marked or flaged
                return true;
            }
            else {
                prev_succ = prev->next; 
                if(prev_succ & F_FLAG)
                    clear_flag(prev,n_ptr(prev_succ)); 
                while(prev->next & F_MARK) // Possibly a failure due to marking. Traverse a chain of backlinks to reach an unmarked node.
                    prev = prev->backlink;
            }
        }
        next = list_search(val,prev,&prev); // search two consecutive node again from prev node
        if(container_of(prev,struct item,link)->value == val) {
            free(it);
            return false;
        }
    }
}

delete

bool list_delete(struct nblist *the_list, val_t val)
{
    struct nblist_node *prev = NULL;
    struct nblist_node *next = list_search(val-1, &the_list->n, &prev);

    if(!next || container_of(next,struct item,link)->value != val) // no such key
        return false;

    bool got = try_flag(&prev, prev->next, next);
    
    if (prev)
        clear_flag(prev, next);
    return got;
}

Thoughput 及 Scability 都有隨著執行緒數量的增加而提升
隨著 linked list size 的增加，Scability 的提升也更顯著，我認為是因為 key 越多，要遇到正在進行移除的節點次數就會比較少，因此要幫助移除正在進行移除的節點及透過 backlink 回到尚未被移除的節點次數也就變少了，而我覺得這些都是可能降低多執行緒 throughput 的原因

try_flag















































static bool try_flag(struct nblist_node **pp,
                     uintptr_t p_val,
                     struct nblist_node *n)
{
    struct nblist_node *p = *pp;
    const uintptr_t new_val = (uintptr_t) n | F_FLAG;
    bool got;
    for (;;) {
        if (p_val == new_val) {
            *pp = p;
            return false;
        }
        uintptr_t old_val = (uintptr_t) n;
        got = atomic_compare_exchange_strong(&p->next, &old_val, new_val);
        if (got || old_val == new_val) {
            *pp = p;
            return got;
        }
        p_val = old_val;
        while ((p_val & F_MARK) != 0) {
            p = atomic_load_explicit(&p->backlink, memory_order_relaxed);
            assert(p);
            p_val = atomic_load_explicit(&p->next, memory_order_relaxed);
        }
        
        struct nblist_node *del_node = list_search(container_of(n,struct item,link)->value-1,p,&p);
        if(del_node != n) {
            *pp = NULL;
            return false;
        }
        
        /* @p is no longer @n's parent. walk forward until the parent is
         * found, or return NULL.
         */
        // assert(n_ptr(p_val));
        // while (n_ptr(p_val) != n) {
        //     p = n_ptr(p_val);
        //     p_val = atomic_load_explicit(&p->next, memory_order_relaxed);
        //     if (!n_ptr(p_val)) {
        //         *pp = NULL;
        //         return false;
        //     }
        // }
    }
    *pp = p;
    return got;
}

這裡嘗試去改進 try_flag，當要被 flag 的 predecessor 被 mark 時，會透過 backlink 來找到未被 mark 的 predecessor(因為 node 不能同時 flag 及 mark，且被 mark 表示正要被 physically delete)，在從這個 predecessor 去找出 del_node(這裡為 n) 新的 predecessor
32 ~ 43 行就是要找這個新的 predecessor，這裡的實作跟論文中不同，走訪過程中不會去幫忙節點的移除，所以當原本正在移除該節點的人停下來且其他人也都沒幫忙移除，就會使 try_flag 停滯不前
26 ~ 30 行就是跟論文一樣，去幫忙移除走訪過程中所遇到的正在移除中的節點

實做自己的 memory allocator

GitHub

memory allocator 結構

vm_head
- vm_t *next : 透過 mmap 配置的 4096 Byte 記憶體空間
  - p_rel array[N_VM_ELEMENTS] : 存放相對位置 , 及第 i 個配置的記憶體地址與 array[i] 地址的差
  - char raw[0] : vm_t 結構中可以讓我們動態配置記憶體的起始位置
- my_stack_t freed[256] : 被釋放的記憶體會根據大小 , push 到對應的 freed[i] , 配置記憶體實會優先重這裡 pop 出來
  - reuse_block_t *next : 被釋放的記憶體區塊

vm_extend_map

static vm_t *vm_extend_map(vm_head_t *head)
{
    vm_t *nod = head->next;
    vm_t *new_nod = vm_new();
    head->next = new_nod;
    return new_nod;
}

這裡將新配置出的 vm_t 加在開頭的位置，來避免在 vm_add 走訪所有的 vm_t

vm_add

uintptr_t vm_add(size_t sz, struct vm_head *head)
{
    sz = align_up(sz);

    uintptr_t block = pop(&head->freed[(sz >> 3) - 1]);
    if(block)
        return block;

    vm_t *nod = head->next;
    if(!nod)
        nod = vm_extend_map(head);

    retry:
    if ((int) ((nod->array[nod->use] + (sizeof(p_rel) * nod->use) +
                sz)) >= PAGESIZE || nod->use >= nod->max) {
        nod->max = nod->use + 1; /* addr > map */
        nod = vm_extend_map(head);
        goto retry;
    }

    char *p = getaddr(nod->array[nod->use]) + sz;
    setaddr(nod->array[nod->use + 1], p);
    nod->use++;
    return getaddr(nod->array[nod->use - 1]);
}

從對應的 my_stack_t freed[256] 或 char raw[0] 來配置記憶體空間，若沒有可用的空間在透過 mmap 配置新的 vm_t
記憶體空間對齊 8 byte

vm_remove

void vm_remove(uintptr_t ptr, int sz, struct vm_head *head) {
    if(sz == 0 || !ptr)
        return;
    sz = align_up(sz);
    push(&head->freed[(sz >> 3) - 1],ptr);
}

將記憶體 push 進對應的 my_stack_t freed[256] 讓後續的記憶體配置能重用此空間

多執行緒的配置及釋放記憶體

目前的作法是每個 thread 都有自己的 memory allocator
, 這樣就不會配置到相同的記憶體位置

測試多執行緒 malloc/free 的 throughput 及 scability

測試程式嘗試配置記憶體並寫值到該記憶體空間，接著去釋放該記憶體
紅線及藍 x 是 glibc 的實做 ; 綠線及紅 + 是我的實做
Malloc_Size : 動態配置記憶體的大小
Count : 連續配置記憶體及釋放記憶體的次數(e.g: count = 1 表示配置完 1 個後接著釋放 1 個，反覆執行)
Throughput： malloc 及 free 操作的次數

可以發現 glibc 的實做雖然有用到 lock，在 thread 數為 4 以下時，隨著 thread 的增加，Scability 還是有很好的成長，因為 glibc 有透過多個 allocation arenas 來避免 lock contention
我的實做雖然沒用到 lock，Scability 卻也在 thread 數為 4 以上時，增長不明顯
這裡還有發現，如果是一次配置一個接著釋放一個這樣輪流做的話，glibc 的實做 throughput 較佳 ; 如果是一次配置一些接著釋放一些這樣輪流做的化，我的實做 throughput 較佳

memory reclamation

be9a14608c 引入 Homework5 研讀過的 hp_list 裡面 hazard pointer 的實做，來幫助記憶體的釋放及重用，目前可以在多執行緒且只有 push 及 pop 的操作下，不觸發 AddressSanitizer 的錯誤

bool nblist_push(struct nblist *list, struct nblist_node *top, struct nblist_node *n, int tid)
{
    assert(((uintptr_t) n & F__MASK) == 0);
    uintptr_t old = atomic_load_explicit(&list->n.next, memory_order_acquire);
    while ((old & F_FLAG) != 0) {
        list_hp_protect_ptr(hp, 1, (uintptr_t) (old & ~F__MASK), tid);
        if(list->n.next == old)
            clear_flag(&list->n, n_ptr(old), tid);
        old = atomic_load(&list->n.next);
    }
    list_hp_clear(hp,tid);
    assert((old & F_MARK) == 0);
    n->next = old;
    n->backlink = NULL;
    return n_ptr(old) == top && atomic_compare_exchange_strong_explicit(
                                    &list->n.next, &old, (uintptr_t) n,
                                    memory_order_release, memory_order_relaxed);
}
struct nblist_node *nblist_pop(struct nblist *list, int tid)
{
    struct nblist_node *p;// predecessor
    uintptr_t p_val;
    struct nblist_node *n; // del_node
retry:
    p = &list->n; 
    p_val = atomic_load(&p->next);
    assert((p_val & F_MARK) == 0); // not dead
    n = n_ptr(p_val);
    list_hp_protect_ptr(hp, 1, (uintptr_t) n, tid);
    if((p->next & ~F_FLAG) != (uintptr_t) n)
        goto retry;
    while (n) {
        if ((p_val & F__MASK) != 0) { // p_val mark or flag
            p = n;
            list_hp_protect_ptr(hp, 0, (uintptr_t) n, tid);
            p_val = atomic_load(&p->next);
        } else if (atomic_compare_exchange_strong(&p->next, &p_val,
                                                  p_val | F_FLAG)) { // 確保 p->next 沒 mark or flag
            break;
        }
        n = n_ptr(p_val);
        list_hp_protect_ptr(hp, 1, (uintptr_t) n, tid);
        if((p->next & ~F_FLAG) != (uintptr_t) n)
            goto retry;
    }
    if (!n) // empty
        return NULL;
    clear_flag(p, n, tid);
    list_hp_clear(hp,tid);
    list_hp_retire(hp,n,tid);
    return n;
}
static bool clear_flag(struct nblist_node *prev, struct nblist_node *n, int tid)
{
    struct nblist_node *old =
        atomic_exchange_explicit(&n->backlink, prev, memory_order_release);
    assert(!old || old == prev);
    uintptr_t nextval = atomic_load_explicit(&n->next, memory_order_relaxed);
    while ((nextval & F_MARK) == 0) { // mark n
        while ((nextval & F_FLAG) != 0) { // n->next 要被移除 , 要先移出它才能 mark n
            list_hp_protect_ptr(hp, 2, nextval & ~F__MASK, tid);
            if(n->next == nextval) {
                clear_flag(n, n_ptr(nextval),tid);
            }
            nextval = atomic_load(&n->next);
        }
        if (atomic_compare_exchange_strong_explicit(
                &n->next, &nextval, nextval | F_MARK, memory_order_release,
                memory_order_relaxed)) {
            nextval |= F_MARK;
        }
    }
    return rend_the_marked(prev, n, nextval);
}

主要就是將要用到的 node 加入 harzard pointer，加入之後要確保 node 還沒被 physically delete，因為 physically delete 的 node 可能在加入前就被釋放了，遇到這個狀況就必須重來
push 最多需要保護兩個 node，而 pop 最多則需要保護 3 個
insert 和 delete 因為會透過 backlink 來往回走，所以只保護 3 個 node 應該是不夠的，且也無法像上面一樣透過 physically delete 來判斷加進 hazard pointer 前，node 是否以被釋放，目前還要想該怎麼做

觀察記憶體使用量

學習 kdnvt 同學，利用 lab0 看過的 valgrind 搭配 massif 來觀察 memory 的使用情況，這裡 thread 數為 4 、 node 數維持在 1024 左右

沒做 memory reclamation
有 memory reclamation

綠色為 push 新的 node 所使用的記憶體量
可以發現沒做 memory reclamation 的話，pop 並不會釋放記憶體，而 push 則一直配置新的記憶體，使記憶體使用量隨時間成長到 132Mib 左右
反之，有做 memory reclamation 的記憶體使用量則維持在 24KiB 左右，這 24 Kib 也就是 node 大小(24B) * node 數(1024左右)，可見 pop 有成功釋放記憶體並讓 push 去重用它們

用自己的 Memory Allocator 配置 linked list 記憶體

2278af42c3 引入自己實做的 memory allocator

運用在 linked list 上時，想到配置及釋放記憶體的 thread 不一定是同一個，且想像 malloc、free 一樣只傳記憶體地址進去，所以配置記憶體時會額外給 16 bytes，4 bytes 存記憶體大小，4 bytes 存 Memory Allocator id，8 bytes 存下一個 reuse block 的位址
也因為可能從多個 thread 同時釋放記憶體到某個 thread 上，在 push reuse_block_t 到 freed[i] 及 pop 出 reuse_block_t 時要透過 CAS 來保護

觀察不同 memory allocator 用在 linked list 上 throughput 及 scability 的差異

紅線及藍 x 是 glibc 的實做 ; 綠線及紅 + 是我的實做

push 及 pop 的 throughput 和 scability

push 和 pop 有做 memory reclamation
Update: 100 次操作中有幾次為 push 或 pop 操作
Size: 一開始 linked list 有多少 key，key 介於 0 ~ 2 * size
Throughput： push 及 pop 操作的次數
兩者的 throughput 並沒有太大的差異，而我的實做 scability 稍微高一些，不過因為前面提到的 push 會 spin 的緣故，兩者的 scability 及 throughput 都是隨著 thread 的增加而下降

insert 及 delete 的 throughput 和 scability

insert 和 delete 沒做 memory reclamation
Update: 100 次操作中有幾次為 push 或 pop 操作
Size: 一開始 linked list 有多少 key，key 介於 0 ~ 2 * size
Count : 連續配置記憶體及釋放記憶體的次數(e.g: count = 1 表示配置完 1 個後接著釋放 1 個，反覆執行)
Throughput： push 及 pop 操作的次數
根據不同的 size、update、count 實驗觀察後無法明確指出什麼情況下我的實做會比較好，但從圖可以看到依據不同的 Size 及 Count，scability 及 throughput 都有些微的差異，有時我的實做較佳(如圖(二)(四)(五))，有時則是 glibc 的實做較佳 (如圖(一)(三)(六))
其中有發現當 size=128/count=50 圖(四) 蠻特別的， glibc 的 scability 會在 thread 數為 4 以上就不再上升，不過 size 變大後就沒有這個情況了 圖(五)、圖(六)