# 2022q1 Homework5 (quiz5) contributed by < `kdnvt` > ## Problem `2` This problem introduces a list to record the hazard pointers, and a list to record the retired list. Notice that every time a thread accesses the shared object (in this case, the object `share_config` points to), it inserts the address of the object into the hazard pointer list. Thus, there may be many of the same addresses in the hazard pointer list. When the reader finishes the operation, it will remove one of the addresses, the object is safe to be free if there is no address of such object in the hazard pointer list. The data structure is as shown below: ```graphviz digraph foo { rankdir=LR; node [shape=record]; a [label="{ <data> hazard ptr | <ref> }", width=1.2] b [label="{ <data> A | <ref> }"]; c [label="{ <data> B | <ref> }"]; d [label="{ <data> B | <ref> }"]; a:ref:c -> b:data [arrowhead=vee, arrowtail=dot, dir=both, tailclip=false, arrowsize=1.2]; b:ref:c -> c:data [arrowhead=vee, arrowtail=dot, dir=both, tailclip=false]; c:ref:c -> d [arrowhead=vee, arrowtail=dot, dir=both, tailclip=false]; } ``` ```graphviz digraph foo { rankdir=LR; node [shape=record]; a [label="{ <data> retired | <ref> }", width=1.2] b [label="{ <data> A | <ref> }"]; a:ref:c -> b:data [arrowhead=vee, arrowtail=dot, dir=both, tailclip=false, arrowsize=1.2]; } ``` When the last reader finishes reading A, it will remove it from the hazard pointer list. The implementation doesn't "remove the node" from the list, just uses an atomic CAS empty the address to NULL. This implementation saves the overhead to allocate and release the node every time (or we may need another garbage collection mechanism to do it). Moreover, it avoids the trouble that the linked list may generate errors when inserting and removing nodes simultaneously by different threads. The below diagram shows the data structure after A is remove from hazard pointer list: ```graphviz digraph foo { rankdir=LR; node [shape=record]; a [label="{ <data> hazard ptr | <ref> }", width=1.2] b [label="{ <data> NULL | <ref> }"]; c [label="{ <data> B | <ref> }"]; d [label="{ <data> B | <ref> }"]; a:ref:c -> b:data [arrowhead=vee, arrowtail=dot, dir=both, tailclip=false, arrowsize=1.2]; b:ref:c -> c:data [arrowhead=vee, arrowtail=dot, dir=both, tailclip=false]; c:ref:c -> d [arrowhead=vee, arrowtail=dot, dir=both, tailclip=false]; } ``` ```graphviz digraph foo { rankdir=LR; node [shape=record]; a [label="{ <data> retired | <ref> }", width=1.2] b [label="{ <data> A | <ref> }"]; a:ref:c -> b:data [arrowhead=vee, arrowtail=dot, dir=both, tailclip=false, arrowsize=1.2]; } ``` ref: [graphviz](https://gist.github.com/adamatan/5195647) ### Memory leak I use address sanitizer to find that there are memory leak in `list_free`. It only free the "node", but it should also free the `node->ptr`. ```diff void list_free(hp_t **head) { hp_t *cur = *head; while (cur) { hp_t *old = cur; cur = cur->next; + free((void *)(old->ptr)); free(old); } } ``` ### Memory barrier It's worth mentioning that the code `atomic_exchange`, which is used in `swap`, uses `__ATOMIC_SEQ_CST` as memory order. ```c #define atomic_exchange(ptr, val) \ __atomic_exchange_n(ptr, val, __ATOMIC_SEQ_CST) void swap(domain_t *dom, uintptr_t *prot_ptr, uintptr_t new_val, int flags) { const uintptr_t old_obj = atomic_exchange(prot_ptr, new_val); cleanup_ptr(dom, old_obj, flags); } ``` In the function writer_thread, we expect that the random value will be written in the `new_config`, and the `shared_config` will points to it, so the reader will read the new shared object. However, on some weak memory model machines like DEC Alpha, it is possible that the Store/Store will be out-of-order without memory barrier, that is, it is possible that the `shared_config` gets the new object address before the random value is written into the object. If the reader read the value between this time interval, it may get the original value of `new_config`, which depends on the implementation of `create_config()` (in this case, zero). ```c static void *writer_thread(void *arg) { (void) arg; for (int i = 0; i < N_ITERS / 2; ++i) { config_t *new_config = create_config(); new_config->v1 = rand(); new_config->v2 = rand(); new_config->v3 = rand(); print_config("updating config", new_config); swap(config_dom, (uintptr_t *) &shared_config, (uintptr_t) new_config, 1); print_config("updated config ", new_config); } return NULL; } ``` ```c void swap(domain_t *dom, uintptr_t *prot_ptr, uintptr_t new_val, int flags) { const uintptr_t old_obj = atomic_exchange(prot_ptr, new_val); cleanup_ptr(dom, old_obj, flags); } ``` Therefore, it is more portable to put a memory barrier, like `__ATOMIC_SEQ_CST` in this case. The similar memory barrier is also used in `load`. ```c uintptr_t load(domain_t *dom, const uintptr_t *prot_ptr) { const uintptr_t nullptr = 0; while (1) { uintptr_t val = atomic_load(prot_ptr); hp_t *node = list_insert_or_append(&dom->pointers, val); if (!node) return 0; /* Hazard pointer inserted successfully */ if (atomic_load(prot_ptr) == val) return val; /* * This pointer is being retired by another thread - remove this hazard * pointer and try again. We first try to remove the hazard pointer we * just used. If someone else used it to drop the same pointer, we walk * the list. */ uintptr_t tmp = val; if (!atomic_cas(&node->ptr, &tmp, &nullptr)) list_remove(&dom->pointers, val); } } ``` ```c #define atomic_load(src) __atomic_load_n(src, __ATOMIC_SEQ_CST) #define atomic_cas(dst, expected, desired) \ __atomic_compare_exchange(dst, expected, desired, 0, __ATOMIC_SEQ_CST, \ __ATOMIC_SEQ_CST) hp_t *list_insert_or_append(hp_t **head, uintptr_t ptr) { hp_t *node; bool need_alloc = true; LIST_ITER (head, node) { uintptr_t expected = atomic_load(&node->ptr); if (expected == 0 && atomic_cas(&node->ptr, &expected, &ptr)) { need_alloc = false; break; } } if (need_alloc) node = list_append(head, ptr); return node; } ``` The `atomic_load` and `atomic_cas` also use the `__ATOMIC_SEQ_CST`. If the second `atomic_load` be reordered before the `atomic_cas`, it may be a bug if writer changes the shared pointer between this time interval. ~~I consider the possibility to use `__ATOMIC_RELEASE` to replace `__ATOMIC_SEQ_CST` in atomic_exchange case, because I only see a possible error from Store/Store reorder so far.~~ `__ATOMIC_RELEASE` is a less expensive barrier than `__ATOMIC_SEQ_CST`. After some analysis, I think the `__ATOMIC_SEQ_CST` is necessary. Consider the code below, the `return val` and `free(val)` shouldn't happen simultaneously. ``` reader | writer --------------------------------------------- | /* store on ptr */ | val = atomic_exchange(ptr); | | /* load old */ | if(!list_contain(val)){ /* store on list */ | list_insert(val); | | /* load on ptr */ | if(ptr == val){ | return val; | } | | free(val); | } ``` The program works fine, the `ptr == val` in reader will fail, and read the new value from ptr next loop. However, if Store/load could be out-of-order in writer thread, the result may be as below: ``` reader | writer --------------------------------------------- | /* store on ptr */ /* invisible to | val = atomic_exchange(ptr,new); reader */ | /* still in write buffer */ | | /* load old */ | if(!list_contain(val)){ /* store on list */ | list_insert(val); | | /* load on ptr */ | if(ptr == val){ | return val; | } | | /* visible to | // ptr = new; reader now */ | free(val); | } ``` The code `ptr == val` will be true this time, so the return val and free(val) will both be executed, which results in error. Now consider the Store/Load out-of-order on reader thread: ``` reader | writer --------------------------------------------- /* store in list */ | list_insert(val); | | /* load ptr */ | if(ptr == val){ | return val; | } | | | /* store in ptr */ | val = atomic_exchange(ptr,new); | | /* load list */ | if(!list_contain(val)){ | free(val); | } ``` ``` reader | writer --------------------------------------------- /* store in list */ | list_insert(val); | /* invisible to writer */ /* still in buffer*/| | /* load ptr */ | if(ptr == val){ | return val; | } | | | /* store in ptr */ | val = atomic_exchange(ptr); | | /* load list */ | if(!list_contain(val)){ | free(val); | } | //list_insert(val); | /* visible to writer now */ | | | ``` The Store/Load out-of-order in reader also results in error. In conclusion, the both of two need to be `__ATOMIC_SEQ_CST` to prevent Store/Load reorder. Notice that the release operation and release fence are two different thing. I had been confused for a while until seeing the article [Acquire and Release Fences Don't Work the Way You'd Expect](https://preshing.com/20131125/acquire-and-release-fences-dont-work-the-way-youd-expect/). I did an experiment to simulate similar situation : [experiment](https://hackmd.io/@kdnvt/StoreLoad-experiment) ### Multiple readers/write with Hazard Pointer When using multiple readers and one writer, the program works well(Albeit the problem discription says that the thread sanitizer will generate error message, I don't get it when execute the program). However, when using multiple readers and multiple writers, I get some error messages from conflict of `free` and `calloc`. In addition to these, I think there may be error if multiple writer try to insert and remove nodes to the retired list simultaneously. ### Study [Lock-Free Data Structures with Hazard Pointers](https://erdani.org/publications/cuj-2004-12.pdf) The paper describe the implementation of hazard pointer. I compared the paper with the corresponding implementation [`hp_list`](https://github.com/sysprog21/concurrent-programs/tree/master/hp_list), one difference important is the threshold when the retire_list(`list_hp_retire` in `hp_list`) function should scan and release the retired pointers. In `hp_list`, the following codes set the threshold, while the `HP_THRESHOLD_R` is 0, which means that the list_hp_retire scan and release every time. ``` if (rl->size < HP_THRESHOLD_R) return; ``` Since `hp_list` traverse the whole reader thread's hazard pointers to check instead of using a sorted dictionary structure or a hash table, it needs O(MR) complexity to check retired pointers,where M stands for numbers of reader, N stands for numbers of writer, R stands for length of retired list. To improve the performance, I tried to add a variable hp_count to count the total number of hazard pointers, and defer the timing of release when the hp_count is less than the 4/5 of the length of the retired list, ~~so the amortized time complexity achieve the O(M)~~. I had a misunderstanding on amortized cost, which I thought is just ditributing the cost to all operations, so the cost of `list_hp_retire` is O(MR/R) originally. Nevertheless, When considering this problem carefully using potential method, I found that this is wrong. Let $\Phi = 5 \times R$ where $R$ is length of retired list. Let constant $C = 1$. $T_{amortize}(insert) = T_{actual}(insert)+C(\Phi (S_{after})-\Phi (S_{before}))$ $T_{amortize}(insert) = 1+5C = O(1)$ $T_{amortize}(retired) = T_{actual}(retired)+C(\Phi (S_{after})-\Phi (S_{before}))$ When using hash table mentioned in the paper, the $T_{actual}$ is O\(R\) because each time checking hazard pointers only needs O(1). $T_{amortize}(retired) = R+C(-5\times \frac{R}{5}) = O(1)$ Since the number of hazard pointers is less than $\dfrac{4}{5}R$ , the retired list is sure to release at least $\dfrac{1}{5}R$ pointers. So the potential $C(\Phi (S_{after})-\Phi (S_{before}))$ is at least $-CR$ . However, `hp_list` doesn't use the hash table, so the cost of retired will be O(MR). $T_{amortize}(retired) = MR+C(-5\times \frac{R}{5}) = O(MR)$ To think in another way, imagine in the hash table case the R is distributed to each insertion, so each operation will be O(1). However, in this problem, when distributing the MR to each insertion, the cost of insertions increases from O(1) to O(M), which is not what we want. In conclusion, to take advantage of the amortized analysis, we need a hash table to achieve O(1) time complexity when checking hazard pointers. reference: [potential method](https://en.wikipedia.org/wiki/Potential_method) #### Try to improve hp_list This implementation was done before I found my amortized analysis is wrong. Even though the complexity can't achieve constant time of amortized cost, I think it is still a good practice to reduce the number of operations by defering it. First, I introduced global variable `hp_count`. ```c int hp_count = 0; ``` The hp_count is changed with CAS when new hazard pointer is read by reader or complete reading. Notice that I added `if (!ptr)` condition checking in `list_hp_protect_ptr` and `list_hp_protect_release`, so they wouldn't protect the NULL pointer. ```c void list_hp_clear(list_hp_t *hp) { for (int i = 0; i < hp->max_hps; i++){ if(atomic_exchange_explicit(&hp->hp[tid()][i], 0, memory_order_release)){ int old; do{ old = atomic_load(&hp_count); }while(!atomic_compare_exchange_weak(&hp_count,&old,old-1)); } } } ``` ```c uintptr_t list_hp_protect_ptr(list_hp_t *hp, int ihp, uintptr_t ptr) { if(!ptr) return ptr; if(!atomic_exchange_explicit(&hp->hp[tid()][ihp], ptr, memory_order_seq_cst)){ int old; do{ old = atomic_load(&hp_count); }while(!atomic_compare_exchange_weak(&hp_count,&old,old+1)); } return ptr; } ``` ```c uintptr_t list_hp_protect_release(list_hp_t *hp, int ihp, uintptr_t ptr) { if(!ptr) return ptr; if(!atomic_exchange_explicit(&hp->hp[tid()][ihp], ptr, memory_order_release)){ int old; do{ old = atomic_load(&hp_count); }while(!atomic_compare_exchange_weak(&hp_count,&old,old+1)); } return ptr; } ``` Add the new dynamic threshold to `list_hp_retire`, and use shift operation to compute the $\dfrac{5}{4}$`hp_count`. I also used macro TRACE to add runtime analysis on traversal of retired lists and hazard pointers. ```diff enum { TRACE_nop = 0, TRACE_retry, /* the number of retries in the __list_find function. */ TRACE_contains, /* the number of wait-free contains in the __list_find function that curr pointer pointed. */ TRACE_traversal, /* the number of list element traversal in the __list_find function. */ TRACE_fail, /* the number of CAS() failures. */ TRACE_del, /* the number of list_delete operation failed and restart again. */ TRACE_ins, /* the number of list_insert operation failed and restart again. */ TRACE_inserts, /* the number of atomic_load operation in list_delete, list_insert and __list_find. */ TRACE_deletes, /* the number of atomic_store operation in list_delete, list_insert and __list_find. */ ++ TRACE_retired_traversal }; struct runtime_statistics { atomic_uint_fast64_t retry, contains, traversal, fail; atomic_uint_fast64_t del, ins; atomic_uint_fast64_t load, store; ++ atomic_uint_fast64_t retired_traversal; }; ``` ```diff void list_hp_retire(list_hp_t *hp, uintptr_t ptr) { retirelist_t *rl = hp->rl[tid() * CLPAD]; rl->list[rl->size++] = ptr; assert(rl->size < HP_MAX_RETIRED); if (rl->size < HP_THRESHOLD_R) return; ++ if (((unsigned)atomic_load(&hp_count) >> 2) + atomic_load(&hp_count) > rl->size) ++ return ; for (size_t iret = 0; iret < rl->size; iret++) { uintptr_t obj = rl->list[iret]; bool can_delete = true; for (int itid = 0; itid < HP_MAX_THREADS && can_delete; itid++) { for (int ihp = hp->max_hps - 1; ihp >= 0; ihp--) { ++ TRACE(retired_traversal); if (atomic_load(&hp->hp[itid][ihp]) == obj) { can_delete = false; break; } } } if (can_delete) { size_t bytes = (rl->size - iret) * sizeof(rl->list[0]); memmove(&rl->list[iret], &rl->list[iret + 1], bytes); rl->size--; hp->deletefunc((void *) obj); } } } ``` I used `clock_gettime` in `main` to simply compute the execution time, but I thought that there is still too many interference, so it can't be a reference. Without threshold: ```bash $ ./list time=73 retry : 185 contains : 0 traversal : 21064122 fail : 0 del : 0 ins : 0 load : 63275803 store : 4096 retired_traversal: 1572864 deletes : 4098 inserts : 4098 ``` With threshold: ```bash $ ./list time=78 retry : 157 contains : 0 traversal : 21006957 fail : 0 del : 0 ins : 0 load : 63104185 store : 4096 retired_traversal: 1572864 deletes : 4098 inserts : 4098 ``` To my surprise, the number of retired_traversal is the same. By printing the `hp_count` on each call of list_hp_retire, the `hp_count` is zero for all time. It seems that the hazard pointers is consumed by reader very fast, so when the list_hp_retire is called, it is safe to release the pointers. This phenomenon make me think of another improvement, instead of traversing the whole list, maybe we can first checking the `hp_count`, and release it directly if the `hp_count` is zero. ```diff void list_hp_retire(list_hp_t *hp, uintptr_t ptr) { retirelist_t *rl = hp->rl[tid() * CLPAD]; rl->list[rl->size++] = ptr; assert(rl->size < HP_MAX_RETIRED); if (rl->size < HP_THRESHOLD_R) return; if (((unsigned)atomic_load(&hp_count) >> 2) + atomic_load(&hp_count) > rl->size) return ; for (size_t iret = 0; iret < rl->size; iret++) { uintptr_t obj = rl->list[iret]; bool can_delete = true; ++ if(atomic_load(&hp_count)) for (int itid = 0; itid < HP_MAX_THREADS && can_delete; itid++) { for (int ihp = hp->max_hps - 1; ihp >= 0; ihp--) { TRACE(retired_traversal); if (atomic_load(&hp->hp[itid][ihp]) == obj) { can_delete = false; break; } } } if (can_delete) { size_t bytes = (rl->size - iret) * sizeof(rl->list[0]); memmove(&rl->list[iret], &rl->list[iret + 1], bytes); rl->size--; hp->deletefunc((void *) obj); } } } ``` ```bash $ ./list time=86 retry : 170 contains : 0 traversal : 20976870 fail : 0 del : 0 ins : 0 load : 63013973 store : 4096 retired_traversal: 0 deletes : 4098 inserts : 4098 ``` ### Avoid false sharing `hp_list` also optimizes the cache. The following code makes each `rl` in different cacheline. Although the x86 uses 64 bytes cacheline, I guess the 128 in this program is for portability in case some machines using 128 bytes cacheline. ~~However, I am confused about the alignas(128) in the structure, since I think the alignas(16)(which is the power of two bigger than sizeof(retiredlist_t)) is sufficient to meet our requirement.~~ I think the alignas(128) is to make hp and rl be distributed to different cacheline with alignas instead of inserting padding variable which I am familiar with. ```c #define CLPAD (128 / sizeof(uintptr_t)) struct list_hp { int max_hps; alignas(128) atomic_uintptr_t *hp[HP_MAX_THREADS]; alignas(128) retirelist_t *rl[HP_MAX_THREADS * CLPAD]; list_hp_deletefunc_t *deletefunc; }; ``` After reexamining the code, I'm of the opinion that the `CLPAD` in this program is superfluous. Since the value of `rl[i]` is the address of `retirelist_t`, the `rl[i]` itself won't be modified after the initialization. Therefore, the scenario that writers thread invalidate each others frequenlty won't happen. If we want to prevent allocator from allocating multiple `retirelist_t` on adjacent addresses, which causes false sharing due to sharing the cacheline, it is better to make `retirelist_t` as bigger as cacheline and align with `sizeof(size)` + `sizeof(list)`. The latter isn't related to false sharing, but obviet one writer from caching two cahelines for one `retirelist_t`. ```c typedef struct { alignas(16) int size; uintptr_t *list; uint8_t padding[128-sizeof(int)-sizeof(uintptr_t *)]; } retirelist_t; ```