2021q3 Homework2 (hp)

# 2021q3 Homework2 (hp) contributed by < [`linD026`](https://github.com/linD026) > ###### tags: `linux2021` > [2021 年暑期第 2 週隨堂測驗題目](https://hackmd.io/@sysprog/linux2021-summer-quiz2) --- ## 解釋程式碼運作原理 ### tid ```cpp static thread_local int tid_v = TID_UNKNOWN; static atomic_int_fast32_t tid_v_base = ATOMIC_VAR_INIT(0); static inline int tid(void) { if (tid_v == TID_UNKNOWN) { tid_v = atomic_fetch_add(&tid_v_base, 1); assert(tid_v < HP_MAX_THREADS); } return tid_v; } ``` * 每個執行緒各自進行， `tid_v` 擁有 thread-local storage duration 。 * 執行緒初次呼叫此函式會先根據 `tid_v_base` 設定好自己所屬的 `tid_v` 。 ### Harzard Pointer 資料結構 ```cpp typedef struct { int size; uintptr_t *list; } retirelist_t; typedef void(list_hp_deletefunc_t)(void *); typedef struct list_hp { int max_hps; alignas(128) atomic_uintptr_t *hp[HP_MAX_THREADS]; alignas(128) retirelist_t *rl[HP_MAX_THREADS * CLPAD]; list_hp_deletefunc_t *deletefunc; } list_hp_t; ``` * retire list 是最大執行緒數量乘以 `CLPAD` ，因此每個執行緒最少有 `CLPAD` 個 retire list ，此設定是對 cache coherence 做處理。 :::warning `CLPAD` cacheline pad 避免相鄰使得不必要同步的資訊也 cache coherence 。 > [Atomics 操作 - 背後作祟的 cacheline](https://hackmd.io/@sysprog/concurrency-atomics) 根據 [Wikipedia - False sharing](https://en.wikipedia.org/wiki/False_sharing) : > In computer science, false sharing is a performance-degrading usage pattern that can arise in systems with distributed, coherent caches at the size of the smallest resource block managed by the caching mechanism. When a system participant attempts to periodically access data that is not being altered by another party, but those data share a cache block with data that are being altered, the caching protocol may force the first participant to reload the whole cache block despite a lack of logical necessity. > By far the most common usage of this term is in modern multiprocessor CPU caches, where memory is cached in lines of some small power of two word size (e.g., 64 aligned, contiguous bytes). **If two processors operate on independent data in the same memory address region storable in a single line, the cache coherency mechanisms in the system may force the whole line across the bus or interconnect with every data write, forcing memory stalls in addition to wasting system bandwidth.** In some cases, the elimination of false sharing can result in order-of-magnitude performance improvements. 因此對於兩個位於相同記憶體區塊的資料，在對其中一個資料進行操作時， cache coherency mechanisms 可能會強制把整段記憶體區塊進行搬運進而造成另一個資料也要等待 cache coherence 。因此，若兩著不相干，在它們之間做出區隔會比較好。至於為什麼是 128 byte ，可能跟 intel 架構有關，資料如下： **[False sharing and 128-byte alignment/padding](https://stackoverflow.com/questions/29199779/false-sharing-and-128-byte-alignment-padding)** > As Hans pointed out in a comment, some info about this can be found in ["Intel® 64 and IA-32 architectures optimization reference manual"](https://software.intel.com/content/www/us/en/develop/articles/intel-sdm.html#optimization), in section 3.7.3 "Hardware Prefetching for Second-Level Cache", about the Intel Core microarchitecture: > > **"Streamer — Loads data or instructions from memory to the second-level cache. To use the streamer, organize the data or instructions in blocks of 128 bytes, aligned on 128 bytes. The first access to one of the two cache lines in this block while it is in memory triggers the streamer to prefetch the pair line."** **原文節錄 - 3.7.3 Hardware Prefetching for Second-Level Cache** The Intel Core microarchitecture contains two second-level cache prefetchers: * **Streamer** — **Loads data or instructions from memory to the second-level cache. To use the streamer, organize the data or instructions in blocks of 128 bytes, aligned on 128 bytes.** The first access to one of the two cache lines in this block while it is in memory triggers the streamer to prefetch the pair line. To software, the L2 streamer’s functionality is similar to the adjacent cache line prefetch mechanism found in processors based on Intel NetBurst microarchitecture. * **Data prefetch logic (DPL)** — DPL and L2 Streamer are triggered only by writeback memory type. They prefetch only inside page boundary (4 KBytes). Both L2 prefetchers can be triggered by software prefetch instructions and by prefetch request from DCU prefetchers. DPL can also be triggered by read for ownership (RFO) operations. The L2 Streamer can also be triggered by DPL requests for L2 cache misses. Software can gain from organizing data both according to the instruction pointer and according to line strides. For example, for matrix calculations, columns can be prefetched by IP-based prefetches, and rows can be prefetched by DPL and the L2 streamer. 上述的 [stackoverflow]((https://stackoverflow.com/questions/29199779/false-sharing-and-128-byte-alignment-padding)) 文章也有提到， facebook 的 [folly](https://github.com/cjl3080434008/facebook_folly/blob/4a8fc0f3b26d84ff7b99e276ee1b0fabc49d58ea/folly-master/folly/detail/CacheLocality.h#L130) 也有相關操作： ```cpp /// An attribute that will cause a variable or field to be aligned so that /// it doesn't have false sharing with anything at a smaller memory address. #define FOLLY_ALIGN_TO_AVOID_FALSE_SHARING __attribute__((__aligned__(128))) ``` [facebook_folly/folly-master/folly/detail/CacheLocality.h](https://github.com/cjl3080434008/facebook_folly/blob/4a8fc0f3b26d84ff7b99e276ee1b0fabc49d58ea/folly-master/folly/detail/CacheLocality.h#L113) ```cpp enum { /// Memory locations on the same cache line are subject to false /// sharing, which is very bad for performance. Microbenchmarks /// indicate that pairs of cache lines also see interference under /// heavy use of atomic operations (observed for atomic increment on /// Sandy Bridge). See FOLLY_ALIGN_TO_AVOID_FALSE_SHARING kFalseSharingRange = 128 }; ``` 而在此 [commit](https://github.com/facebook/folly/commit/8091d7199edecafa678d009b82f04c36dd8ce9a7) 當中，改以 C++11 的 `alignas` 形式。 ``` [Folly] Kill `FOLLY_ALIGNED` etc. `alignas` is standardized as of C++11. Let us just use that. Replace: * `FOLLY_ALIGNED` with `alignas` * `FOLLY_ALIGNED_MAX` with `alignas(folly::max_align_v)` * `FOLLY_ALIGN_TO_AVOID_FALSE_SHARING` with `alignas(folly::hardware_destructive_interference_size)` Because where `alignas` may be placed is more restrictive than where attributes may be placed, we also need to move these directives in some cases on top of doing the replacement. ``` ::: ```cpp struct align_struct { alignas(128) int *l[128 * CLPAD]; }; struct normal_struct { alignas(128) int *l[128]; }; int main(void) { struct align_struct align; struct normal_struct normal; for (int i = 0;i < 128;i++) { printf("normal %p, align %p\n", &normal.l[i], &align.l[i * CLPAD]); } return 0; } normal 0x7ffddc16f300, align 0x7ffddc16f700 normal 0x7ffddc16f308, align 0x7ffddc16f780 normal 0x7ffddc16f310, align 0x7ffddc16f800 normal 0x7ffddc16f318, align 0x7ffddc16f880 normal 0x7ffddc16f320, align 0x7ffddc16f900 normal 0x7ffddc16f328, align 0x7ffddc16f980 normal 0x7ffddc16f330, align 0x7ffddc16fa00 normal 0x7ffddc16f338, align 0x7ffddc16fa80 ``` ### Harzard Pointer 函式 ```cpp /* Retire an object that is no longer in use by any thread, calling * the delete function that was specified in list_hp_new(). * * Progress condition: wait-free bounded (by the number of threads squared) */ void list_hp_retire(list_hp_t *hp, uintptr_t ptr) { retirelist_t *rl = hp->rl[tid() * CLPAD]; // append the ptr we want to delete rl->list[rl->size++] = ptr; assert(rl->size < HP_MAX_RETIRED); if (rl->size < HP_THRESHOLD_R) return; for (size_t iret = 0; iret < rl->size; iret++) { uintptr_t obj = rl->list[iret]; bool can_delete = true; for (int itid = 0; itid < HP_MAX_THREADS && can_delete; itid++) { for (int ihp = hp->max_hps - 1; ihp >= 0; ihp--) { // if the thread's hp stored the ptr equal to obj // cannot delete. if (atomic_load(&hp->hp[itid][ihp]) == obj) { // can_delete = false is breaking // the second for loop can_delete = false; break; } } } // when this obj is not in all the hp, delete it. if (can_delete) { // 1 2 3 4 5 6 7 8 // [ ] [ ] [ ] iret [ ] [ ] [ ] size // byte = 4 size_t bytes = (rl->size - iret) * sizeof(rl->list[0]); memmove(&rl->list[iret], &rl->list[iret + 1], bytes); rl->size--; // RRR; hp->deletefunc((void *) obj); } } } ``` * 要釋放的物件增加至呼叫此函式的執行緒的 retire list 裡，隨後針對 retire list 之中的每個物件進行檢查。只要所有執行緒的 harzard pointer 沒有此物件便會以 `memove` 形式覆蓋 retire list 的物件資料，並利用先前設定個 `deletefunc` 釋放。 --- ### List ```cpp #define N_ELEMENTS 128 #define N_THREADS (64 /*NNN*/) #define MAX_THREADS 128 static atomic_uint_fast32_t deletes = 0, inserts = 0; enum { HP_NEXT = 0, HP_CURR = 1, HP_PREV }; #define is_marked(p) (bool) ((uintptr_t)(p) & 0x01) #define get_marked(p) ((uintptr_t)(p) | (0x01)) #define get_unmarked(p) ((uintptr_t)(p) & (~0x01)) #define get_marked_node(p) ((list_node_t *) get_marked(p)) #define get_unmarked_node(p) ((list_node_t *) get_unmarked(p)) ``` ```cpp typedef uintptr_t list_key_t; typedef struct list_node { alignas(128) uint32_t magic; alignas(128) atomic_uintptr_t next; list_key_t key; } list_node_t; typedef struct list { atomic_uintptr_t head, tail; list_hp_t *hp; } list_t; ``` --- ### Insert ```cpp= bool list_insert(list_t *list, list_key_t key) { list_node_t *curr = NULL, *next = NULL; atomic_uintptr_t *prev = NULL; list_node_t *node = list_node_new(key); while (true) { if (__list_find(list, &key, &prev, &curr, &next)) { list_node_destroy(node); list_hp_clear(list->hp); return false; } atomic_store_explicit(&node->next, (uintptr_t)curr, memory_order_relaxed); uintptr_t tmp = get_unmarked(curr); if (atomic_compare_exchange_strong(prev, &tmp, (uintptr_t)node)) { list_hp_clear(list->hp); return true; } } } ``` * 第 18 行會確保前一個節點的指標是否還指著 `curr` 的地址，若依然指著變替換曾插入節點的地址。 ```graphviz digraph main { label = "abstract" node [shape = record] rankdir = LR n1 [label = "{prev|<n> next}"] n2 [label = "{<c> curr|<n> next}"] n3 [label = "{<n> next|next}"] n4 [label = "{node|<n>next}"] n1:n -> n2:c n2:n -> n3:n n1:n -> n4 n4:n -> n2:c } ``` ```cpp= static bool __list_find(list_t *list, list_key_t *key, atomic_uintptr_t **par_prev, list_node_t **par_curr, list_node_t **par_next) { atomic_uintptr_t *prev = NULL; list_node_t *curr = NULL, *next = NULL; try_again: prev = &list->head; curr = (list_node_t *) atomic_load(prev); (void) list_hp_protect_ptr(list->hp, HP_CURR, (uintptr_t) curr); if (atomic_load(prev) != get_unmarked(curr)) goto try_again; while (true) { ... } *par_curr = curr; *par_prev = prev; *par_next = next; return false; } ``` * 在第 15 行儲存 `curr` 指向的節點地址至 `hp` 以確保若其他執行緒刪除此節點時不會立即被釋放。 * 第 12 和 13 行，`prev` 儲存的是 `list->head` 指標的地址。 `atomic_load(ptr)` 是以指標形式傳入，因此若以 `atomic_load(prev)` 呼叫會以 `head` 值回傳。 * 而以 harzard pointer 形式保證 `curr` 的動作分別為第 13 及 14 行，但在第 15 、 16 行有在儲存於 harzard pointer 的儲存空間後再做一次確認，因為若要刪除 `curr` 所指向的節點， `prev` 的值在 `list_delete` 當中一定會作更改，因此若兩者不相同會再次嘗試。 ```cpp= while (true) { if (!get_unmarked_node(curr)) return false; next = (list_node_t *) atomic_load(&get_unmarked_node(curr)->next); (void) list_hp_protect_ptr(list->hp, HP_NEXT, get_unmarked(next)); if (atomic_load(&get_unmarked_node(curr)->next) != (uintptr_t) next) break; if (get_unmarked(next) == atomic_load((atomic_uintptr_t *) &list->tail)) break; if (atomic_load(prev) != get_unmarked(curr)) goto try_again; ... curr = next; (void) list_hp_protect_release(list->hp, HP_CURR, get_unmarked(next)); } ``` * 在第 6 行儲存 `next` 指向的節點地址至 `hp` 以確保若其他執行緒刪除此節點時不會立即被釋放。 * 以下為第一次執行過程： ```graphviz digraph main { node [shape = record] rankdir = LR constraint=fale subgraph cluster_list { label = "list" subgraph cluster_prev { label = "prev" head } tail } hn [label = "{0|next}"] tn [label = "{UINTPTR_MAX|next}"] curr [label = "curr", shape = plaintext] next [label = "next", shape = plaintext] hn -> tn head -> hn tail -> tn curr -> hn next -> tn } ``` * 第一次的插入因第 10 行的判斷會跳出迴圈，並且在 `list_insert` 之中實現： ```graphviz digraph main { label = "first insert" node [shape = record] rankdir = LR constraint=fale subgraph cluster_list { label = "list" subgraph cluster_prev { label = "prev" head } tail } hn [label = "{0|next}"] tn [label = "{UINTPTR_MAX|next}"] n [label = "{<h>1|<n>next}"] curr [label = "curr", shape = plaintext] next [label = "next", shape = plaintext] hn -> tn head -> n:h tail -> tn curr -> hn n:n -> hn next -> tn } ``` :::warning 雖然圖中的 `next` 是直接指向下一個節點，但這方便繪圖才以此標示的。 `next` 值應以 `curr->next` 值為主。 ::: --- * 第二次插入： ```graphviz digraph main { node [shape = record] rankdir = LR constraint=fale subgraph cluster_list { label = "list" subgraph cluster_prev { label = "prev" head } tail } hn [label = "{0|next}"] tn [label = "{UINTPTR_MAX|next}"] n [label = "{<h>1|<n>next}"] curr [label = "curr", shape = plaintext] next [label = "next", shape = plaintext] hn -> tn head -> n:h tail -> tn curr -> n n:n -> hn next -> hn } ``` ```graphviz digraph main { label = "second insert" node [shape = record] rankdir = LR constraint=fale subgraph cluster_list { label = "list" head tail } hn [label = "{0|next}"] tn [label = "{UINTPTR_MAX|next}"] n [label = "{<h>1|<n>next}"] prev [label = "prev", shape = plaintext] curr [label = "curr", shape = plaintext] next [label = "next", shape = plaintext] n2 [label = "{2|<n>next}"] hn -> tn head -> n:h tail -> tn curr -> hn next -> tn prev -> n:n n:n -> n2 n2:n -> hn } ``` --- * 第 8 、9 行的情況在於 `next` 節點被另一個執行緒標記刪除時， `prev` 會有所更改。但這不影響插入的動作。 ```cpp //threadA currA->next => nextA => //threadB |prev => currB(delete) => nextB // currA->next => => nextB // currA->next != nextA if (atomic_load(&get_unmarked_node(curr)->next) != (uintptr_t)next) break; ``` 然而，若是被標記刪除的節點是 `curr` 而非 `next` 時，這會導致 `list_insert` 函式插入的節點的動作導致原先被標記刪除的節點重新連回 `list` 之中： ```graphviz digraph main { node [shape = record] rankdir = LR constraint=fale subgraph cluster_list { label = "list" head tail } hn [label = "{0|next}"] tn [label = "{UINTPTR_MAX|next}"] n [label = "{<h>1|<n>next}"] prev [label = "prev", shape = plaintext] curr [label = "curr", shape = plaintext] next [label = "next", shape = plaintext] n2 [label = "{2|<n>next (marked)}"] hn -> tn head -> n:h tail -> tn curr -> n2 next -> hn prev -> n:n n:n -> hn:n n2:n -> hn } ``` ```graphviz digraph main { node [shape = record] rankdir = LR constraint=fale subgraph cluster_list { label = "list" head tail } hn [label = "{0|next}"] tn [label = "{UINTPTR_MAX|next}"] n [label = "{<h>1|<n>next}"] prev [label = "prev", shape = plaintext] curr [label = "curr", shape = plaintext] next [label = "next", shape = plaintext] n2 [label = "{2|<n>next (marked)}"] n3 [label = "{3|<n>next}"] hn -> tn head -> n:h tail -> tn curr -> n2 next -> hn prev -> n:n n2:n -> hn n3 -> n2 n:n -> n3 } ``` 因此應為下列表示。在`curr` 節點被標為刪除時， `list_insert` 還是把 `curr` 給重新連回 `list` 之中，對於之後或其他執行緒減少在 `list` 之中不必要的走訪。 ```cpp if (atomic_load(&get_unmarked_node(curr)->next) != get_unmarked(next)) break; ``` --- * 當 `*key` 小於 `curr` 節點的值，會插入。例如，在下圖狀況下插入值為 1 時： ```cpp if (!(get_unmarked_node(curr)->key < *key)) { *par_curr = curr; *par_prev = prev; *par_next = next; return get_unmarked_node(curr)->key == *key; } ``` ```graphviz digraph main { node [shape = record] rankdir = LR constraint=fale subgraph cluster_list { label = "list" subgraph cluster_prev { label = "prev" head } tail } hn [label = "{0|next}"] tn [label = "{UINTPTR_MAX|next}"] n [label = "{<h>2|<n>next}"] curr [label = "curr", shape = plaintext] next [label = "next", shape = plaintext] hn -> tn head -> n:h tail -> tn curr -> n n:n -> hn next -> hn } ``` ```graphviz digraph main { node [shape = record] label = "inserted value is smaller than curr's value" rankdir = LR constraint=fale subgraph cluster_list { label = "list" subgraph cluster_prev { label = "prev" head } tail } hn [label = "{0|next}"] tn [label = "{UINTPTR_MAX|next}"] n [label = "{<h>2|<n>next}"] n1 [label = "{<h>1|<n>next}"] curr [label = "curr", shape = plaintext] next [label = "next", shape = plaintext] hn -> tn head -> n1:h n1:n -> n:h tail -> tn curr -> n n:n -> hn next -> hn } ``` --- ### Delete ```graphviz digraph delete { node [shape = box] rankdir = LR n1 [label = "{node 1 |<n> next}", shape = record] subgraph cluster_mark { label = "want to delete" n2 [label = "{<c> curr|<n> next}", shape = record] n2a [label = "{curr| next (marked)}", shape = record] } n3 [label = "{<n> next|next}", shape = record] prev [label = "prev", shape = plaintext] prev -> n1:n n1:n -> n2:c n2:n -> n3:n } ``` ```graphviz digraph delete { label = "delete" node [shape = box] rankdir = LR n1 [label = "{node 1 |<n> next}", shape = record] n3 [label = "{<n> next|next}", shape = record] n2 [label = "{<c> curr|<n> next ( marked )}", shape = record] n0 [label = "{node 0 |<n> next}", shape = record] n0:n -> n1 n2:n -> n3:n n1:n -> n3 prev [label = "prev", shape = plaintext] prev -> n1:n } ``` ```cpp bool list_delete(list_t *list, list_key_t key) { list_node_t *curr, *next; atomic_uintptr_t *prev; while (true) { if (!__list_find(list, &key, &prev, &curr, &next)) { list_hp_clear(list->hp); return false; } // marke delete uintptr_t tmp = get_unmarked(next); if (!atomic_compare_exchange_strong(&curr->next, &tmp, get_marked(next))) continue; tmp = get_unmarked(curr); if (atomic_compare_exchange_strong(prev, &tmp, get_unmarked(next))) { list_hp_clear(list->hp); list_hp_retire(list->hp, get_unmarked(curr)); } else { list_hp_clear(list->hp); } return true; } } ``` `__list_find` 回傳 `false` 就不會刪除。而 `__list_find` 函式只有在下列才會回傳 `true` 。此判斷式是對於刪除擁有 `*key` 值的節點作判斷，當 `*key` 值小於等於 `curr` 節點的值，會進入到此判斷並且確認是否相等，若相等則進行刪除。此行為不影響 `insert` 操作，因回傳相等， `list_insert` 會銷毀要插入的值。 ```cpp if (!(get_unmarked_node(curr)->key < *key)) { *par_curr = curr; *par_prev = prev; *par_next = next; return get_unmarked_node(curr)->key == *key; } ``` 當然也可以把判斷式改為相等時，並直接回傳 `true` 。 ```cpp // before if (!(get_unmarked_node(curr)->key < *key)) { *par_curr = curr; *par_prev = prev; *par_next = next; return get_unmarked_node(curr)->key == *key; } //--------------------------------------------------// // after if (get_unmarked_node(curr)->key == *key) { *par_curr = curr; *par_prev = prev; *par_next = next; return true; } ``` ![](https://imgur.com/aZTEmLH.png) 當 `next` 值已被標記，表示 `curr` 所只的節點已經由 `list_delete` 嘗試刪除過。因此會嘗試讓指向 `curr` 指標的指標的 `prev` 進行 CAS ，以確保前一個節點的 `next` 的正確性。若 CAS 成功會把 `curr` 指標放入 retire list 裡。之後繼續走訪 list 。 ```cpp if (get_unmarked_node(next) == next) { if (!(get_unmarked_node(curr)->key < *key)) { *par_curr = curr; *par_prev = prev; *par_next = next; return get_unmarked_node(curr)->key == *key; } prev = &get_unmarked_node(curr)->next; (void) list_hp_protect_release(list->hp, HP_PREV, get_unmarked(curr)); } else { uintptr_t tmp = get_unmarked(curr); if (!atomic_compare_exchange_strong(prev, &tmp, get_unmarked(next))) goto try_again; list_hp_retire(list->hp, get_unmarked(curr)); } ``` 也有試著使用 `is_marked` marco ，然而與原先差距並不會很大。 ```cpp // before if (get_unmarked_node(next) == next) { // after if (!is_marked(next)) { ``` ![](https://imgur.com/EKeS213.png) ![](https://imgur.com/2LXbWL8.png) :::info **論文** * [A Pragmatic Implementation of Non-Blocking Linked-Lists](https://www.cl.cam.ac.uk/research/srg/netos/papers/2001-caslists.pdf) * [Lock-Free Data Structures with Hazard Pointers](https://erdani.org/publications/cuj-2004-12.pdf) ::: --- ## 嘗試改進 ### Sorting and Binary search 論文裡的改進方式是在尋找之前作排序，之後再利用二元搜尋尋找。 > To optimize that, we can sort the hazard pointers before lookup, and then perform one binary search in it for each retired point. 若以 [merge sort](https://en.wikipedia.org/wiki/Merge_sort) 來看，會是 $th \times hp \times ( rl \times log(rl) + log_2(rl))$ 。 ### Array and Red Black Tree 原先在 `list_hp_retire` 當中尋找使用的物件會以三層迴圈方式搜尋，這會是 $rl \times th \times hp$ 的時間複雜度。因此思考更改 retire list 的資料結構，改由 [rbtree](https://github.com/linD026/linux2021_harzard_pointer/blob/main/rbtree.h) ，並從 hp 開始搜尋 retire list 的方式縮短時間複雜度成 $th \times hp \times 2c \times log(rl)$ 。 * $c$ 為常數 * $2c \times log(rl)$ 為搜尋時間 hp 的節點，以及搜尋到後插入至另一個樹所花費時間紅黑樹版本 [`vrb_listv1.c`](https://github.com/linD026/linux2021_harzard_pointer/blob/main/vrb_listv1.c) ，在 64 執行緒下： ![](https://imgur.com/oMmwiSZ.png) ![](https://imgur.com/LIc4O9r.png) 在 32 執行緒下： ```shell array mean: 146678547.833 rbtree mean: 150426176.169 ``` 以上為 `clock_gettime` 形式紀錄花費時間，可以看到原先版本較低，這可能是函式呼叫成本等維持紅黑樹有關。為了確認猜測，之後以 `perf` 進行分析可以明顯看到紅黑樹的 instructions 以及 cycles 確實比較高： * array ```shell Performance counter stats for './list' (1000 runs): 138,806 cache-misses # 1.356 % of all cache refs ( +- 0.21% ) 10,233,726 cache-references ( +- 0.48% ) 1,402,820,507 instructions # 1.10 insn per cycle ( +- 0.18% ) 1,280,720,776 cycles ( +- 0.20% ) 0.151228 +- 0.000375 seconds time elapsed ( +- 0.25% ) ``` * rbtree ```shell Performance counter stats for './vrb_list' (1000 runs): 126,670 cache-misses # 1.130 % of all cache refs ( +- 0.23% ) 11,207,687 cache-references ( +- 0.43% ) 1,454,082,912 instructions # 1.10 insn per cycle ( +- 0.16% ) 1,322,324,864 cycles ( +- 0.18% ) 0.155687 +- 0.000364 seconds time elapsed ( +- 0.23% ) ``` 後來不採用原先標頭檔定義的刪除，針對此程式的 rbtree 的刪除作特別處理。原先的處理方式是新建一棵樹，把 hp 裡有的節點新增至新的樹當中，之後在對舊的樹清除。最佳化後則是不新建新樹，改以標記方式處理。時間花費： $th \times hp \times c \times log(rl) + rl$ 。 * $c$ 為常數 * $log(rl)$ 為搜尋時間 * $+ rl$ 則是迴圈操作後，更新所有紅黑樹節點的資訊的走訪可見 [vrb_list.c](https://github.com/linD026/linux2021_harzard_pointer/blob/main/vrb_list.c#L147) 。得到： ```shell Performance counter stats for './vrb_list' (1000 runs): 122,556 cache-misses # 1.085 % of all cache refs ( +- 0.24% ) 11,295,186 cache-references ( +- 0.44% ) 1,453,763,246 instructions # 1.12 insn per cycle ( +- 0.17% ) 1,299,941,442 cycles ( +- 0.19% ) 0.154815 +- 0.000368 seconds time elapsed ( +- 0.24% ) ``` 然而若以 `clock_gettime` 方式計算花費時間，雖然更改過後的 rbtree 有比較好，但還是比不過原先方式。 ![](https://imgur.com/lTXa6DE.png) ```shell array mean: 146784669.18 rbtree mean: 150426176.169 rbtree v2 mean: 150391884.431 ``` ### Ordered :::warning 此改進針對[ 7 月 30 日](https://github.com/sysprog21/concurrent-programs/commit/b34b773a9fd0730f77f3a651422169e859a679a3#diff-d5e01af5d18e7feef0cc4f0ab5fb5e71a7df4e3196b24fa3224f34b54dc9de82)的版本，**已在此 [commit](https://github.com/sysprog21/concurrent-programs/commit/83ef91e8cd2bee343c692761416d2e8de1793467#diff-d5e01af5d18e7feef0cc4f0ab5fb5e71a7df4e3196b24fa3224f34b54dc9de82) 被修正**。且此篇 [ordered 的實作手法](https://github.com/linD026/linux2021_harzard_pointer/blob/main/ordered.c#L322)與 [concurrent programs - hp_list](https://github.com/sysprog21/concurrent-programs/tree/master/hp_list) 有些許不同。 ::: `__list_find` 對 linked list 的操作會造成 linked list 的節點無順序。而嘗試以 [A Pragmatic Implementation of Non-Blocking Linked-Lists](https://www.cl.cam.ac.uk/research/srg/netos/papers/2001-caslists.pdf) 裡的形式進行改寫， [ordered.c](https://github.com/linD026/linux2021_harzard_pointer/blob/main/ordered.c#L296) 。以下為 32 執行緒、陣列結構的 retire list 的比較： ```shell Performance counter stats for './list' (1000 runs): 72,349 cache-misses # 0.803 % of all cache refs ( +- 1.72% ) 9,013,861 cache-references ( +- 0.52% ) 1,327,629,956 instructions # 1.11 insn per cycle ( +- 0.27% ) 1,194,012,875 cycles ( +- 0.30% ) 0.141588 +- 0.000380 seconds time elapsed ( +- 0.27% ) ``` ```shell Performance counter stats for './ordered' (1000 runs): 63,494 cache-misses # 0.788 % of all cache refs ( +- 0.36% ) 8,059,975 cache-references ( +- 0.45% ) 1,108,421,582 instructions # 1.05 insn per cycle ( +- 0.20% ) 1,059,742,403 cycles ( +- 0.23% ) 0.111844 +- 0.000155 seconds time elapsed ( +- 0.14% ) ``` 論文中的 `search` ( `__list_find_ordered` )： ```cpp private Node *List::search(KeyType search_key, Node **left_node) { Node *left_node_next, *right_node; search_again: do { Node *t = head; Node *t_next = head.next; /* 1: Find left_node and right_node */ do { if (!is_marked_reference(t_next)) { (*left_node) = t; left_node_next = t_next; } t = get_unmarked_reference(t_next); if (t == tail) break; t_next = t.next; } while (is_marked_reference(t_next) || (t.key < search_key)); /*B1*/ right_node = t; /* 2: Check nodes are adjacent */ if (left_node_next == right_node) if ((right_node != tail) && is_marked_reference(right_node.next)) goto search_again; /*G1*/ else return right_node; /*R1*/ /* 3: Remove one or more marked nodes */ if (CAS(&(left_node.next), left_node_next, right_node)) /*C1*/ if ((right_node != tail) && is_marked_reference(right_node.next)) goto search_again; /*G2*/ else return right_node; /*R2*/ } while (true); /*B2*/ } ``` ### [A more Pragmatic Implementation of the Lock-free, Ordered, Linked List](https://arxiv.org/abs/2010.15755) :::warning 此處 `prev` 節點指 `prev` 指標指向的指標所屬的節點。 ::: 之後，又根據此篇論文些微調整了排序版本的 `__list_find_ordered` 、 `list_delete` 以及 `list_insert` 函式。 * `__list_find_ordered` 對於 CAS 失敗會從 `list->head` 重新開始。然而對於搜尋依據 `*key` 所尋找的位置 `prev` 的節點會小於 `*key` 值，而 `curr` 的節點會是大於等於 `*key` 值。所以如果 `prev` 節點沒有被標記刪除時，而 CAS 失效時表示有其他執行序在 `prev` 和 `curr` 之間插入新的節點，因此從 `prev` 節點的 `next` 值重新更新 `curr` 值的話，表示還可以從 `curr` 開始尋找。 > Upon failure of the `CAS()` operation in the search function, if the item on which the operation failed has not become marked (only the pointer changed due to another thread having had success in linking out the item), there is no reason to restart the search from the head of the list; it suffices to reread the next pointer of the item. * `list_delete` 若以刪除的操作來看，當已尋找目標 `key` 的 `curr` 節點已被標記為刪除，表示當下其他執行緒已經標記刪除了，因此可視為已刪除。而若 `prev` 的節點被標記刪除，在此 singly linked list 沒辦法往回找最近為標記的節點，因此只能重新尋找。但是若 `prev` 的節點未被標記刪除則有可能是在 `prev` 與 `curr` 節點之間被插入新的值，因此從 `prev` 的節點重新開始尋找即可。 > In the `rem()` operation, if the `CAS()` operation fails due to the item having become marked, the delete can be linearized as failed, since the item has been removed by another thread. The linearization point is not the failed `CAS()`, though, but the earlier of this and the point just before an overlapping `add()` operation is about to successfully insert an item with the same key. If instead the `CAS()` failed due to the pointer to the next item having changed, the thread can simply try again to mark the item. This can be repeated until the item to be deleted has become marked, successfully by some thread. Once the node is marked it is logically deleted and will be removed eventually. So a failure of the subsequent `CAS()` to unlink the node from the list can simply be ignored. * `list_insert` 也同於上面兩者，若 CAS 操作失效，繼續從 `prev` 走訪即可。 > In an `add()` operation, if the `CAS()` operation fails due to the next pointer having changed (but not the mark), the search for the position where to add can be continued from the item, there is no reason to go back to the head of the list. All that is needed is to reread the pointer. 為了要讓 linked list 的搜尋能夠從任意節點開始，但起始節點又不被刪除，因此在 hp 新增新的 index `HP_START` 。而在最一開始，起始搜尋從 `list->head` ，但 `list->head` 只會在銷毀整個 linked list 時刪除，因此不加入至 hp 當中。以下為更改過後的程式碼， [orderedv2.c](https://github.com/linD026/linux2021_harzard_pointer/blob/main/orderedv2.c) ： * `list_delete_once` ```cpp bool list_delete_once(list_t *list, list_key_t key) { list_node_t *curr, *next; atomic_uintptr_t *prev; try_again: // first time search if (!__list_find_ordered(list, &key, &list->head, &prev, &curr, &next)) { list_hp_clear(list->hp); return false; } do { // atomic fetch and or will return old value // so when tmp is marked, the curr node is already marked uintptr_t tmp = atomic_fetch_or(&get_unmarked_node(curr)->next, 0x01); if (is_marked(tmp)) return true; tmp = get_unmarked(curr); if (atomic_compare_exchange_strong(prev, &tmp, get_unmarked(next))) { list_hp_clear(list->hp); list_hp_retire(list->hp, get_unmarked(curr)); return true; } // prev node marked if (is_marked(atomic_load(prev))) { list_hp_clear(list->hp); goto try_again; } // continue searching (void)list_hp_protect_release(list->hp, HP_START, get_unmarked(atomic_load(prev))); if (!__list_find_ordered(list, &key, prev, &prev, &curr, &next)) { list_hp_clear(list->hp); return false; } } while (true); } ``` * `list_insert_conti` ```cpp bool list_insert_conti(list_t *list, list_key_t key) { list_node_t *curr = NULL, *next = NULL; atomic_uintptr_t *prev = NULL; list_node_t *node = list_node_new(key); try_again: // first time search if (__list_find_ordered(list, &key, &list->head, &prev, &curr, &next)) { list_node_destroy(node); list_hp_clear(list->hp); return false; } while (true) { atomic_store_explicit(&node->next, (uintptr_t)curr, memory_order_relaxed); uintptr_t tmp = get_unmarked(curr); if (atomic_compare_exchange_strong(prev, &tmp, (uintptr_t)node)) { list_hp_clear(list->hp); return true; } // prev value is smaller than key value // we need restart at prev without marking if (is_marked(atomic_load(prev))) { goto try_again; } // some one inserted node between prev and curr (void)list_hp_protect_release(list->hp, HP_START, get_unmarked(atomic_load(prev))); if (__list_find_ordered(list, &key, prev, &prev, &curr, &next)) { list_node_destroy(node); list_hp_clear(list->hp); return false; } } /* while(true) */ } ``` * `__list_find_ordered` ( orderedv2.c 版本，有針對起始節點走訪做修改) ```cpp static bool __list_find_ordered(list_t *list, list_key_t *key, atomic_uintptr_t *start, atomic_uintptr_t **par_prev, list_node_t **par_curr, list_node_t **par_next) { atomic_uintptr_t *prev = NULL; list_node_t *curr = NULL, *next = NULL; try_again: prev = start; curr = (list_node_t *)atomic_load(prev); (void)list_hp_protect_ptr(list->hp, HP_CURR, (uintptr_t)curr); while (true) { // get next next = (list_node_t *)atomic_load(&get_unmarked_node(curr)->next); (void)list_hp_protect_ptr(list->hp, HP_NEXT, get_unmarked(next)); // find left_node(prev) and right_node (curr) do { if (!is_marked(next)) { (void)list_hp_protect_release(list->hp, HP_PREV, get_unmarked(curr)); prev = &get_unmarked_node(curr)->next; } (void)list_hp_protect_release(list->hp, HP_CURR, get_unmarked(next)); curr = get_unmarked_node(next); if (get_unmarked(curr) == atomic_load((atomic_uintptr_t *)&list->tail)) break; next = (list_node_t *)atomic_load(&get_unmarked_node(curr)->next); (void)list_hp_protect_ptr(list->hp, HP_NEXT, get_unmarked(next)); } while (is_marked(next) || get_unmarked_node(curr)->key < *key); if (atomic_load(prev) == get_unmarked(curr)) { if (get_unmarked(curr) != atomic_load((atomic_uintptr_t *)&list->tail) && is_marked(get_unmarked_node(curr)->next)) goto try_again; else { *par_curr = curr; *par_prev = prev; *par_next = next; return get_unmarked_node(curr)->key == *key; } } uintptr_t tmp = get_unmarked(curr); if (atomic_compare_exchange_strong(prev, &tmp, get_unmarked(curr))) { if (get_unmarked(curr) != atomic_load((atomic_uintptr_t *)&list->tail) && is_marked(get_unmarked_node(curr)->next)) { goto try_again; } else { *par_curr = curr; *par_prev = prev; *par_next = next; return get_unmarked_node(curr)->key == *key; } } curr = (list_node_t *)atomic_load(prev); (void)list_hp_protect_release(list->hp, HP_CURR, get_unmarked(curr)); if (atomic_load(prev) != get_unmarked(curr)) goto try_again; } /*while (true)*/ } ``` 效能比對： ```shell Performance counter stats for './list' (1000 runs): 67,021 cache-misses # 0.736 % of all cache refs ( +- 0.35% ) 9,108,789 cache-references ( +- 0.51% ) 1,324,083,530 instructions # 1.11 insn per cycle ( +- 0.26% ) 1,191,982,050 cycles ( +- 0.28% ) 0.140473 +- 0.000358 seconds time elapsed ( +- 0.26% ) Performance counter stats for './ordered' (1000 runs): 65,674 cache-misses # 0.784 % of all cache refs ( +- 0.36% ) 8,378,900 cache-references ( +- 0.43% ) 1,114,187,575 instructions # 1.04 insn per cycle ( +- 0.19% ) 1,068,030,817 cycles ( +- 0.22% ) 0.112245 +- 0.000150 seconds time elapsed ( +- 0.13% ) Performance counter stats for './orderedv2' (1000 runs): 69,366 cache-misses # 0.820 % of all cache refs ( +- 0.51% ) 8,464,355 cache-references ( +- 0.43% ) 1,111,598,206 instructions # 1.04 insn per cycle ( +- 0.20% ) 1,067,236,423 cycles ( +- 0.22% ) 0.113825 +- 0.000155 seconds time elapsed ( +- 0.14% ) ``` 之後，對參考此篇論文當中對各 operation 的次數分析。 * `rtry` - 從 `list->head` 開始尋找，在此以 `goto try again` 紀錄。 * `cons` - 紀錄走訪過程中，經過多少個被標記為刪除的節點，在此以 `curr` 節點為主。 * `trav` - 紀錄走訪時的總節點數。 * `fail` - CAS 失敗次數。 * `del` - `list_delete` 失敗重新尋找次數。 * `ins` - `list_insert` 失敗重新尋找次數。 * `deletes` - 節點刪除次數。 * `inserts` - 節點新增次數。以下為十次測試後的結果： ```shell ## default list rtry cons trav fail del ins deletes inserts --------------------------------------------------------------------------------------- 359 0 216381787 5 0 5 40980 40980 ## ordered list v1 rtry cons trav fail del ins deletes inserts --------------------------------------------------------------------------------------- 0 0 213499190 159 0 7 40980 40980 ## ordered list v2 rtry cons trav fail del ins deletes inserts --------------------------------------------------------------------------------------- 0 0 211954079 97 0 2 40980 40980 ```