# [concurrent-ll](https://github.com/kdnvt/concurrent-ll) contributed by <`kdnvt`> ## Execute on arm architecture I tried to execute the program on an arm-server. Environment of this arm-server is as below: ```bash $ lscpu Architecture: aarch64 CPU op-mode(s): 64-bit Byte Order: Little Endian CPU(s): 224 On-line CPU(s) list: 0-223 Thread(s) per core: 4 Core(s) per socket: 28 Socket(s): 2 NUMA node(s): 2 Vendor ID: Cavium Model: 1 Model name: ThunderX2 99xx Stepping: 0x1 Frequency boost: disabled CPU max MHz: 2000.0000 CPU min MHz: 1000.0000 BogoMIPS: 400.00 L1d cache: 1.8 MiB L1i cache: 1.8 MiB L2 cache: 14 MiB L3 cache: 64 MiB NUMA node0 CPU(s): 0-111 NUMA node1 CPU(s): 112-223 ``` Both of lock-base and lock-free linked list failed to pass the correctness checking. ### memory order The first thing I came up with is the memory order issue.Since the x86 architecture uses total store ordering, it prevents many reordering error. While the arm architecture improves performance by leveraging its relaxed ordering, it requires the programmer committing themself to ensure the correctness of the program. The original version of CAS_PTR in include/atomics.h will blunder in some situation. For example, if a new node is going to be inserted in linked list: ```c= new_elem->next = right; if (CAS_PTR(&(left->next), right, new_elem) == right) { FAI_U32(&(the_list->size)); return true; } ``` On relaxed order, line 1 may still in write buffer after line 2 finishing, which is invisible to other threads. Therefore, I restricted \__ATOMIC_RELAXED to \__ATOMIC_CST_PTR. There may still have room for improvement on relaxing memory order in this case, I haven't considered other memory orders carefully. The lockfree version of linked list passed the test. However, the lockbase linked list still failed. ### Spurious failure By experiment, I found that CAS_PTR sometimes failed even if the comparison is equal. When delving into the implementation of CAS_PTR, I found that the weak argument is enable, witch means the \__atomic_compare_exchange could fail "spuriously". x86 architecture uses `cmpxchg` to implement the CAS, so the spurious failure won't happen. On the other hand, arm architecture uses ll/sr like instructions to implement, so the spurious failure may happen. When I use `objdump -S` to examine the assembly code of two CAS, the strong version is as below: ```c #define CAS_PTR(a, b, c) \ __extension__({ \ typeof(*a) _old = b, _new = c; \ int tmp = __atomic_compare_exchange(a, &_old, &_new, 0, __ATOMIC_SEQ_CST, \ __ATOMIC_SEQ_CST); \ _old; \ }) int main(){ int a=0; while(CAS_PTR(&a,0,1) == 1); } ``` ``` while(CAS_PTR(&a,0,1) == 1); 708: 885ffc01 ldaxr w1, [x0] 70c: 35000061 cbnz w1, 718 <main+0x38> 710: 8804fc02 stlxr w4, w2, [x0] 714: 35ffffa4 cbnz w4, 708 <main+0x28> 718: 7100043f cmp w1, #0x1 71c: 54ffff60 b.eq 708 <main+0x28> // b.none } ``` The `ldaxr` loads and sets the exclusive status, while the `stlxr` stores and releases it. The CAS fails if the exclusive status is release between this period of time, which means the "atomic" part fails even if comparison is equal. The `w4` register in `stlxr` set to 0 if the CAS succeeds. As we can see, the strong version of \__atomic_compare_exchange will check the w4 register. The weak version is as below: ```c #define CAS_PTR(a, b, c) \ __extension__({ \ typeof(*a) _old = b, _new = c; \ int tmp = __atomic_compare_exchange(a, &_old, &_new, 1, __ATOMIC_SEQ_CST, \ __ATOMIC_SEQ_CST); \ _old; \ }) int main(){ int a=0; while(CAS_PTR(&a,0,1) == 1); } ``` ``` while(CAS_PTR(&a,0,1) == 1); 708: 885ffc01 ldaxr w1, [x0] 70c: 7100003f cmp w1, #0x0 710: 54000041 b.ne 718 <main+0x38> // b.any 714: 8804fc02 stlxr w4, w2, [x0] 718: 7100043f cmp w1, #0x1 71c: 54ffff60 b.eq 708 <main+0x28> // b.none } ``` As we can see, the program doesn't check the `w4` register, which means it could fail. After using the strong version of CAS, the lock-based linked list is able to pass the correctness test. ## Implement [Lock-Free Linked Lists and Skip Lists](http://www.cse.yorku.ca/~ruppert/papers/lfll.pdf) [Github](https://github.com/kdnvt/concurrent-ll/tree/lockfree2) I implemented the new version of the lock-free linked list based on [Lock-Free Linked Lists and Skip Lists](http://www.cse.yorku.ca/~ruppert/papers/lfll.pdf). In this project, I usually call it "lockfree2" or "lock-free2" to distinguish it from the first version of the lock-free linked list. I wanted to apply the original benchmark to it to roughly measure its performance. Thus, I made a little tweak on the original benchmark and added new commands "make bench2" and "make bench3" in Makefile. "make bench2" measures the throughput and scalability between lock-base and lockfree2 linked list, while "make bench3" measures the throughput and scalability between lock-free(first version) and lock-free2 linked list. I used the newly created benchmark to measure it on the arm server, from single thread to 64 threads, with an interval equal to 2(1,2,4,6,8 ... 62,64). The resulting diagram is as below: The red lines and blue points represent the throughput and scalability of lock-free linked list, while the green lines and red points represent the throughput and scalability of lock-free2 linked list. Size 128: ![](https://i.imgur.com/cCFY0Dd.png) ![](https://i.imgur.com/UQTTAi4.png) ![](https://i.imgur.com/rX7liDv.png) Size 1024: ![](https://i.imgur.com/ayoATQh.png) ![](https://i.imgur.com/I7o39Go.png) ![](https://i.imgur.com/GzHl5gD.png) Size 8192: ![](https://i.imgur.com/ADSySKn.png) ![](https://i.imgur.com/2FUXhvL.png) ![](https://i.imgur.com/VOmuzpm.png) As we can see, the performances of the two versions are similar when the size and update rate are small. However, with a larger size and higher update rate, the performance of lock-free2 is usually better than that of lock-free. Sometimes the diagram shows that the scalability of lock-free is better than lock-free2 even if the throughput is worse. I think the reason is that the throughput of lock-free with single thread is worse than that of lock-free2 with single thread, especially when the size is large and the update rate is high. size = 1024, update rate = 50% ``` # out/test-lockfree out/test-lockfree2 #cores throughput %linear scalability throughput %linear scalability 1 282446 100.00 1 423987 100.00 1 ``` size = 8192, update rate = 50% ``` # out/test-lockfree out/test-lockfree2 #cores throughput %linear scalability throughput %linear scalability 1 12146 100.00 1 17559 100.00 1 ``` When delving into implementation details of the two versions, we can see the `list_remove` of lock-free only "logically" remove the node. Therefore, when it uses `list_search` and finds a "logically" removed node, it needs to "physically" remove it and search the wanted node once again from the head of the list, which is an overhead. On the other hand, the lock-free2 will try to "physically" remove the node right after "logically" removing it, which saves the overhead. I think this is the reason why lock-free2 performs better even with single thread. ## Implement [Lock-Free Data Structures with Hazard Pointers](https://erdani.org/publications/cuj-2004-12.pdf) [Github](https://github.com/kdnvt/hazard-pointer) ### Tutorial #### `hp_pr_t *hp_pr_init();` Create the protect list, which stores the hazard pointers. Every thread using the same hazard pointers should share the same hp_pr_t. #### `hp_t *hp_init(hp_pr_t *pr, void (*dealloc)(void *));` Create the hp_t, which comprises an existing protect list(hazard pointers) and an empty retired list. The retired list is local to each writer thread, therefore all writer threads should call hp_init. The caller can define the deallocating function. #### `hp_addr_t hp_pr_load(hp_pr_t *pr, void *ptr);` Access the pointer and store it in protect list(hazard pointers). The ptr must be the address of the variable you want to load. The return value hp_addr_t is an address of the hazard pointer, by dereferencing to access the value. #### `void hp_pr_release(hp_pr_t *pr, hp_addr_t ptr_addr);` Release the hazard pointer. Pass the hp_addr_t returned by hp_pr_load. **Example:** ```c hp_addr_t ptr_addr = hp_pr_load(pr, &data); int *ptr = atomic_load(ptr_addr); // do something printf("%d\n", *ptr); hp_pr_release(pr, ptr_addr); ``` #### `void hp_retired(hp_t *hp, void *ptr);` Add the ptr to the retired list and try to deallocate it. #### `int hp_scan(hp_t *hp);` Try to release the pointers in the retired list. The return value is the number of pointers that can't be deallocated. Before the writer thread returns, it should use this function to ensure that there aren't remaining pointers in the retired list. **Example:** ```c void *writer(void *arg) { void **argv = (void **) arg; hp_pr_t *pr = argv[0]; hp_t *hp = hp_init(pr, free); /* do something */ /* retire something */ while (hp_scan(hp)) ; free(hp); } ``` #### `int hp_pr_size(hp_pr_t *pr);` Return the number of hazard pointers. #### `void hp_pr_destroy(hp_pr_t *pr);` Deallocate the protect list in pr and hp_pr_t itself. #### `hp_addr_t hp_pr_load_mask(hp_pr_t *pr, void *ptr, uintptr_t mask);` In some application, the address has certain alignment pattern, so the bits near the end of address will be zeroes. For example, if the address align with four, the final two bits will always be zeroes. Some implementation uses this bits as flag bits and refers to them as zeroes when accessing the address. In order to applying to those application, this function will clears the bits which is set to 1 in mask when loading them. #### `hp_pr_t *hp_get_pr(hp_t *hp);` Return the protect list of hp. #### `int get_retired_size(hp_t *hp);` Return the size of retired list of hp. ## Apply hazard pointers to lockfree2 I apply the hazard pointers to [lockfree2](https://github.com/kdnvt/concurrent-ll/tree/lockfree2), and it can reclaim the memory without triggering the warnings and errors of address sanitizer now. I use the `valgrind --tool=massif` to measure the memory usage of lockfree2 with hazard pointers and that without hazard pointers. Here are two diagram generated by massif-visaulizer: Without hazard pointers: ![](https://i.imgur.com/CKFfwpr.jpg) With hazard pointers: ![](https://i.imgur.com/Uhkd8UI.jpg) As we can see, the memory consumption without hazard pointers increases linearly, while that with hazard pointers is less than 40kb the whole time.