# 2017q1 team7 mergesort-concurrent contributed by <`zmke`、`petermouse`、`0xff07`> ## 使程式能接受字串輸入 參考 [0xff07](https://hackmd.io/s/H154LN6og) 的筆記,修改成可以接受字串輸入 ## 建立字串輸入的驗證機制 ```cmake check-words: sort @uniq test_data/words.txt | sort -R | tail -n $(NUM_OF_DATA) > $(TEST_DATA_FILE) @sort -d $(TEST_DATA_FILE) > $(SORTED_DATA_FILE) @./sort $(THREADS) $(TEST_DATA_FILE) | tail -n +4 > $(SORTED_RESULT) @bash scripts/compare.sh $(SORTED_DATA_FILE) $(SORTED_RESULT) ``` ## mutrace 分析 ``` mutrace: Showing statistics for process sort (pid 4217). mutrace: 3 mutexes used. Mutex #2 (0x0x2407670) first referenced by: /usr/lib/mutrace/libmutrace.so(pthread_mutex_init+0xf2) [0x7fc25fb344b2] ./sort(tqueue_init+0x46) [0x4014ca] ./sort(tpool_init+0x6a) [0x40172e] ./sort(main+0xe5) [0x401ee4] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x7fc25f56b830] Mutex #0 (0x0x7fc25d13a860) first referenced by: /usr/lib/mutrace/libmutrace.so(pthread_mutex_lock+0x49) [0x7fc25fb346b9] /lib/x86_64-linux-gnu/libgcc_s.so.1(_Unwind_Find_FDE+0x2c) [0x7fc25cf36fec] [(nil)] Mutex #1 (0x0x603140) first referenced by: /usr/lib/mutrace/libmutrace.so(pthread_mutex_init+0xf2) [0x7fc25fb344b2] ./sort(main+0xab) [0x401eaa] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x7fc25f56b830] mutrace: Showing 3 most contended mutexes: Mutex # Locked Changed Cont. tot.Time[ms] avg.Time[ms] max.Time[ms] Flags 2 282836 93324 81938 29.010 0.000 0.016 Mx.--. 0 20 16 5 0.003 0.000 0.001 M-.--. 1 6 4 0 0.001 0.000 0.000 M-.--. |||||| /||||| Object: M = Mutex, W = RWLock /|||| State: x = dead, ! = inconsistent /||| Use: R = used in realtime thread /|| Mutex Type: r = RECURSIVE, e = ERRRORCHECK, a = ADAPTIVE /| Mutex Protocol: i = INHERIT, p = PROTECT / RWLock Kind: r = PREFER_READER, w = PREFER_WRITER, W = PREFER_WRITER_NONREC mutrace: Note that the flags column R is only valid in --track-rt mode! mutrace: Total runtime is 1119.388 ms. mutrace: Results for SMP with 4 processors. ``` `Locked` : mutex 鎖了幾次 `Changed` : mutex 在不同 thread 之間切換的次數 `Cont.` : thread 嘗試要求已鎖住的 mutex 的次數 `tot.Time[ms]` : mutex 鎖住的總時間 `avg.Time[ms]` : 平均鎖住 mutex 的時間 `max.Time[ms]` : 持有 mutex 的最大時間 可以發現到 `Mutex #2` 在使用上非常的頻繁!thread 之間常需要搶奪 mutex,也常因此等待 追蹤原程式碼的 `Mutex #2` 出現的位置 (`$ addr2line -e sort 0x4014ca`) 第一次存取出現在 `/mergesort-concurrent/threadpool.c:45` ```clike=44 pthread_mutex_init(&(the_queue->mutex), NULL); pthread_cond_init(&(the_queue->cond), NULL); the_queue->size = 0; the_queue->num_of_consumed = 0; ``` (每個 mutex 實際對應到程式碼會差一行,不知道為什麼) 可以發現是 threadpool 中負責保護 queue pop 與 push 的 Mutex ## lock-free threadpool 實作 讀完 [A Pragmatic Implementation of Non-Blocking Linked-Lists](https://www.cl.cam.ac.uk/research/srg/netos/papers/2001-caslists.pdf) 的演算法之後想先動手實作看看,關於演算法的解釋 [0xff07](https://hackmd.io/s/H154LN6og) 的筆記有詳細的說明 ### Compare And Swap `bool __sync_bool_compare_and_swap (type *ptr, type oldval type newval, ...)` `type __sync_val_compare_and_swap (type *ptr, type oldval type newval, ...)` 當 `ptr` 和 `oldval` 相同的時候,會把 `ptr` 的內容改寫成 `newval` ,並回傳 `oldval` [Built-in functions for atomic memory access](https://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Atomic-Builtins.html) 的解釋如下 These builtins perform an atomic compare and swap. That is, if the current value of *ptr is oldval, then write newval into *ptr. The “bool” version returns true if the comparison is successful and newval was written. The “val” version returns the contents of *ptr before the operation. ### tqueue_pop() 首先從 `tqueue_pop()` 開始著手修改,目標是把 `tqueue_pop()` 的 mutex 拿掉,參考 [hungry-birds](https://github.com/jserv/hungry-birds) 的實作,並引用 concurrent-ll 內的 `atomic_ops_if.h` 先判斷 queue 裡面是否只有一個 node (head 和 tail 相同),如果只有一個 node 就嘗試把 tail 和 head 設成 NULL ,如果不只一個 node 嘗試把 head 設成 head->next ```clike= task_t *tqueue_pop(tqueue_t * const the_queue) { task_t *ret; ret = the_queue->head; if (ret) { if(CAS_PTR(&(the_queue->tail), ret, NULL) == ret) { if(CAS_PTR(&(the_queue->head), ret, NULL) == ret) { FAD_U32(&(the_queue->size)); FAI_U32(&(the_queue->num_of_consumed)); return ret; } else { return NULL; } } else { if(ret->next) { if(CAS_PTR(&(the_queue->head), ret, ret->next) == ret) { FAD_U32(&(the_queue->size)); FAI_U32(&(the_queue->num_of_consumed)); return ret; } else { return NULL; } } else { return NULL; } } } return ret; } ``` `$ make check-words` 結果為 `Passed!` 再用 mutrace 測試一次 `$ mutrace ./sort 4 random.txt` ``` mutrace: Showing statistics for process sort (pid 15836). mutrace: 3 mutexes used. Mutex #2 (0x0x7f720a85b860) first referenced by: /usr/lib/mutrace/libmutrace.so(pthread_mutex_lock+0x49) [0x7f720d2556b9] /lib/x86_64-linux-gnu/libgcc_s.so.1(_Unwind_Find_FDE+0x2c) [0x7f720a657fec] [(nil)] Mutex #0 (0x0xa43670) first referenced by: /usr/lib/mutrace/libmutrace.so(pthread_mutex_init+0xf2) [0x7f720d2554b2] ./sort(tqueue_init+0x46) [0x4014ca] ./sort(tpool_init+0x6a) [0x40173d] ./sort(main+0xe5) [0x401ef3] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x7f720cc8c830] Mutex #1 (0x0x603140) first referenced by: /usr/lib/mutrace/libmutrace.so(pthread_mutex_init+0xf2) [0x7f720d2554b2] ./sort(main+0xab) [0x401eb9] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x7f720cc8c830] mutrace: Showing 3 most contended mutexes: Mutex # Locked Changed Cont. tot.Time[ms] avg.Time[ms] max.Time[ms] Flags 2 20 17 6 0.003 0.000 0.000 M-.--. 0 13 9 1 0.004 0.000 0.001 Mx.--. 1 6 5 0 0.002 0.000 0.000 M-.--. |||||| /||||| Object: M = Mutex, W = RWLock /|||| State: x = dead, ! = inconsistent /||| Use: R = used in realtime thread /|| Mutex Type: r = RECURSIVE, e = ERRRORCHECK, a = ADAPTIVE /| Mutex Protocol: i = INHERIT, p = PROTECT / RWLock Kind: r = PREFER_READER, w = PREFER_WRITER, W = PREFER_WRITER_NONREC mutrace: Note that the flags column R is only valid in --track-rt mode! mutrace: Total runtime is 1057.670 ms. mutrace: Results for SMP with 4 processors. ``` 發現只有更改 `tqueue_pop()` 就讓 `Mutex #0` (從 address 可以看出就是前面的 `Mutex #2` )的使用率大幅下降 ### tqueue_push() `tqueue_push()` 一開始不知道要怎麼改變 `tqueue_tail->next` 並改變 `tqueue_tail` 之值,最後也是參考到 [hungry-birds](https://github.com/jserv/hungry-birds) 的寫法,才有些想法 其實兩個的操作並不是不行切割,而是可以先選擇改變 `tail` 之值,確定成功之後,取得舊的 `tail` ,再將舊的 `tail` node 指向新的 `tail` 。 ```clike= int tqueue_push(tqueue_t * const the_queue, task_t *task) { task_t *tail = NULL, *head = NULL, *tail_next = NULL; task->next = NULL; do { tail = the_queue->tail; if (CAS_PTR(&(the_queue->tail), tail, task) == tail) break; } while(1); if (tail) { do { tail_next = tail->next; if (CAS_PTR(&(tail->next), tail_next, task) == tail_next) break; } while(1); } else { do { head = the_queue->head; if (CAS_PTR(&(the_queue->head), head, task) == head) break; } while(1); } FAI_U32(&(the_queue->size)); return 0; } ``` ## 整合後發生的問題 我們發現程式碼大多數時候可以成功執行,但有時候 thread 沒有接收到中止的 task 使 thread 中止運作,導致進入無窮迴圈 之後讓程式重複執行500次,發現中間出現 core dumped,用`$ addr2line -e ./sort` 找到 `threadpool.c` 的 `tqueue_free` `free(cur);` 可能發生重複 free 同一塊記憶體,或是 free 到沒有被 malloc 的記憶體 ``` *** Error in `./sort': double free or corruption (fasttop): 0x00007f5f000009c0 *** ======= Backtrace: ========= /lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7f5f091987e5] /lib/x86_64-linux-gnu/libc.so.6(+0x7fe0a)[0x7f5f091a0e0a] /lib/x86_64-linux-gnu/libc.so.6(cfree+0x4c)[0x7f5f091a498c] ./sort(tqueue_free+0x41)[0x4016cb] ./sort(tpool_free+0x65)[0x401841] ./sort(main+0x132)[0x401f5f] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x7f5f09141830] ./sort(_start+0x29)[0x4011a9] ======= Memory map: ======== 00400000-00403000 r-xp 00000000 08:0b 525385 /home/ke/mergesort-concurrent/sort 00602000-00603000 r--p 00002000 08:0b 525385 /home/ke/mergesort-concurrent/sort 00603000-00604000 rw-p 00003000 08:0b 525385 /home/ke/mergesort-concurrent/sort 00ce9000-0224f000 rw-p 00000000 00:00 0 [heap] 7f5ef0000000-7f5ef0021000 rw-p 00000000 00:00 0 7f5ef0021000-7f5ef4000000 ---p 00000000 00:00 0 7f5ef8000000-7f5ef8021000 rw-p 00000000 00:00 0 7f5ef8021000-7f5efc000000 ---p 00000000 00:00 0 7f5efc000000-7f5efc021000 rw-p 00000000 00:00 0 7f5efc021000-7f5f00000000 ---p 00000000 00:00 0 7f5f00000000-7f5f00021000 rw-p 00000000 00:00 0 7f5f00021000-7f5f04000000 ---p 00000000 00:00 0 7f5f06f07000-7f5f06f1d000 r-xp 00000000 08:0a 136909 /lib/x86_64-linux-gnu/libgcc_s.so.1 7f5f06f1d000-7f5f0711c000 ---p 00016000 08:0a 136909 /lib/x86_64-linux-gnu/libgcc_s.so.1 7f5f0711c000-7f5f0711d000 rw-p 00015000 08:0a 136909 /lib/x86_64-linux-gnu/libgcc_s.so.1 7f5f0711d000-7f5f0711e000 ---p 00000000 00:00 0 7f5f0711e000-7f5f0791e000 rw-p 00000000 00:00 0 7f5f0791e000-7f5f0791f000 ---p 00000000 00:00 0 7f5f0791f000-7f5f0811f000 rw-p 00000000 00:00 0 7f5f0811f000-7f5f08120000 ---p 00000000 00:00 0 7f5f08120000-7f5f08920000 rw-p 00000000 00:00 0 7f5f08920000-7f5f08921000 ---p 00000000 00:00 0 7f5f08921000-7f5f09121000 rw-p 00000000 00:00 0 7f5f09121000-7f5f092e0000 r-xp 00000000 08:0a 148362 /lib/x86_64-linux-gnu/libc-2.23.so 7f5f092e0000-7f5f094e0000 ---p 001bf000 08:0a 148362 /lib/x86_64-linux-gnu/libc-2.23.so 7f5f094e0000-7f5f094e4000 r--p 001bf000 08:0a 148362 /lib/x86_64-linux-gnu/libc-2.23.so 7f5f094e4000-7f5f094e6000 rw-p 001c3000 08:0a 148362 /lib/x86_64-linux-gnu/libc-2.23.so 7f5f094e6000-7f5f094ea000 rw-p 00000000 00:00 0 7f5f094ea000-7f5f09502000 r-xp 00000000 08:0a 149114 /lib/x86_64-linux-gnu/libpthread-2.23.so 7f5f09502000-7f5f09701000 ---p 00018000 08:0a 149114 /lib/x86_64-linux-gnu/libpthread-2.23.so 7f5f09701000-7f5f09702000 r--p 00017000 08:0a 149114 /lib/x86_64-linux-gnu/libpthread-2.23.so 7f5f09702000-7f5f09703000 rw-p 00018000 08:0a 149114 /lib/x86_64-linux-gnu/libpthread-2.23.so 7f5f09703000-7f5f09707000 rw-p 00000000 00:00 0 7f5f09707000-7f5f0972d000 r-xp 00000000 08:0a 148365 /lib/x86_64-linux-gnu/ld-2.23.so 7f5f09908000-7f5f0990b000 rw-p 00000000 00:00 0 7f5f09929000-7f5f0992c000 rw-p 00000000 00:00 0 7f5f0992c000-7f5f0992d000 r--p 00025000 08:0a 148365 /lib/x86_64-linux-gnu/ld-2.23.so 7f5f0992d000-7f5f0992e000 rw-p 00026000 08:0a 148365 /lib/x86_64-linux-gnu/ld-2.23.so 7f5f0992e000-7f5f0992f000 rw-p 00000000 00:00 0 7fff9c1a7000-7fff9c1c8000 rw-p 00000000 00:00 0 [stack] 7fff9c1e0000-7fff9c1e2000 r--p 00000000 00:00 0 [vvar] 7fff9c1e2000-7fff9c1e4000 r-xp 00000000 00:00 0 [vdso] ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall] 已經終止 (core dumped) ``` ## week 10 改進方向 之前的程式還有很多有待改進的空間 * 程式的正確性 * 之前的程式是錯的!!!!! * 也許我們也忘記了 ABA 問題 * 驗證 lock-free 的方式有誤 * 我們一開始只驗證 lock-free 的 push 或 pop 其中一邊 , 另一邊維持著 mutex , 這樣反而造成了 race condition * memory leak * 現在不建議使用 `__sync_val_compare_and_swap` 了 * 應改用 [C11 `_Atomic`](http://en.cppreference.com/w/c/atomic) 老師提到的改善方法 * 可以參考 [LockLess](https://github.com/DeYangLiu/LockLess) 與 [ring](https://github.com/shramov/ring) 的作法 * 提供 benchmark 程式 ## 參考 lock-free queue 實作方法 這邊我們參考了 [Simple, Fast, and Practical Non-Blocking and Blocking Concurrent Queue Algorithms](https://www.research.ibm.com/people/m/michael/podc-1996.pdf) 論文,看到很多 lock-free queue 程式碼都建立在這篇論文的基礎上實作,像是 [Boost](https://github.com/boostorg/lockfree/blob/develop/include/boost/lockfree/queue.hpp) 也參考了這篇實作出 lock-free queue ```clike structure pointer_t { ptr: pointer to node_t, count: unsigned integer } structure node_t { value: data type, next: pointer_t } structure queue_t { Head: pointer t, Tail: pointer_t } ``` 原文中有個中介的資料結構叫做 `pointer_t` ,這個資料結構去指向真正的 node (`node_t`) 以及存放一個變數名稱 `count` 代表 reference counting 的次數,可以用來避免 ABA 問題 ```clike initialize(Q: pointer to queue_t) node = new_node() // Allocate a free node node->next.ptr = NULL // Make it the only node in the linked list Q->Head.ptr = Q->Tail.ptr = node // Both Head and Tail point to it ``` 比較特別的是, 一開始初始化 lock-free queue 就會新建一個 node ,這個 node 並沒有任何的作用, head 與 tail 會間接指向這個 node , lock-free queue 至少有一個 node 的存在。 ```clike enqueue(Q: pointer to queue_t, value: data type) E1: node = new_node() // Allocate a new node from the free list E2: node->value = value // Copy enqueued value into node E3: node->next.ptr = NULL // Set next pointer of node to NULL E4: loop // Keep trying until Enqueue is done E5: tail = Q->Tail // Read Tail.ptr and Tail.count together E6: next = tail.ptr->next // Read next ptr and count fields together E7: if tail == Q->Tail // Are tail and next consistent? // Was Tail pointing to the last node? E8: if next.ptr == NULL // Try to link node at the end of the linked list E9: if CAS(&tail.ptr->next, next, <node, next.count+1>) E10: break // Enqueue is done. Exit loop E11: endif E12: else // Tail was not pointing to the last node // Try to swing Tail to the next node E13: CAS(&Q->Tail, tail, <next.ptr, tail.count+1>) E14: endif E15: endif E16: endloop // Enqueue is done. Try to swing Tail to the inserted node E17: CAS(&Q->Tail, tail, <node, tail.count+1>) ``` push 時去確保 [E7] `Q->tail` 還是 `Q->tail` [E8] `Q->tail` 的 `next.ptr` 為 `NULL` 再去嘗試做 [E9] CAS , 失敗重來 特別的是 [E8] 失敗時,也嘗試去更新 `Q->tail` 為 `Q->tail->next.ptr`,失敗就算了沒關係(好隨便啊!) [E17] 試著改變 tail 之值,失敗也沒關係,代表已經被其他人更新 ```clike dequeue(Q: pointer to queue_t, pvalue: pointer to data type): boolean D1: loop // Keep trying until Dequeue is done D2: head = Q->Head // Read Head D3: tail = Q->Tail // Read Tail D4: next = head.ptr->next // Read Head.ptr->next D5: if head == Q->Head // Are head, tail, and next consistent? D6: if head.ptr == tail.ptr // Is queue empty or Tail falling behind? D7: if next.ptr == NULL // Is queue empty? D8: return FALSE // Queue is empty, couldn't dequeue D9: endif // Tail is falling behind. Try to advance it D10: CAS(&Q->Tail, tail, <next.ptr, tail.count+1>) D11: else // No need to deal with Tail // Read value before CAS // Otherwise, another dequeue might free the next node D12: *pvalue = next.ptr->value // Try to swing Head to the next node D13: if CAS(&Q->Head, head, <next.ptr, head.count+1>) D14: break // Dequeue is done. Exit loop D15: endif D16: endif D17: endif D18: endloop D19: free(head.ptr) // It is safe now to free the old node D20: return TRUE // Queue was not empty, dequeue succeeded ``` 先前提到 queue 會有一個 node , head 會指向這個 dummy node ,所以 pop 會先去抓取 dummy node 的下一個 node, 釋放指向 dummy node 的 pointer (`ptr_t`) , 再去將取得的 node 作為 dummy node ,交給下一次 pop 的釋放 `ptr_t` 確保 [D5] `Q->head` 還是原本的 `Q->head` [D6] `Q->head.ptr` 若等於先前抓到的 `Q->tail.ptr` [D7],代表 queue 可能是空的, 這時進一步去檢查 `tail->next.ptr` 有值也嘗試更新 `Q->tail` ,失敗也是沒關係算了 然而如果 [D6] 是錯的 [D11] ,把值取出,並改變 head, head 可能改變失敗代表其他 thread 已經改變 head ,一切重新來過 ## 實作 [Simple, Fast, and Practical Non-Blocking and Blocking Concurrent Queue Algorithms](https://www.research.ibm.com/people/m/michael/podc-1996.pdf) 這邊我們使用 C11 `atomic_compare_exchange_weak` 來替換 `__sync_val_compare_and_swap` ,與論文不同的是我們並沒有使用中介結構來操作 在 `tqueue_init()` 進行初始化的時候先新增一個 `dummy_node` ,並讓 `head` 和 `tail` 指向這個 node ``` clike= task_t *dummy_node = task_new(dummy_function, NULL); dummy_node->next = NULL; the_queue->head = dummy_node; the_queue->tail = dummy_node; ``` ### tqueue_pop() ``` clike= task_t *tqueue_pop(tqueue_t * const the_queue) { task_t *head = NULL, *tail = NULL, *next = NULL, *ret = NULL; while(1) { head = the_queue->head; tail = the_queue->tail; next = head->next; if(head == the_queue->head) { if(head == tail) { if(next == NULL) { return NULL; } atomic_compare_exchange_weak(&(the_queue->tail), &tail, next); } else { ret = next; if(atomic_compare_exchange_weak(&(the_queue->head), &head, next)) { break; } } } } atomic_fetch_sub(&(the_queue->size), 1); atomic_fetch_add(&(the_queue->num_of_consumed), 1); return ret; } ``` ### tqueue_push() ``` clike= int tqueue_push(tqueue_t * const the_queue, task_t *task) { task_t *tail = NULL, *tail_next = NULL; task->next = NULL; while(1) { tail = the_queue->tail; tail_next = tail->next; if(tail == the_queue->tail) { if(tail_next == NULL) { if(atomic_compare_exchange_weak(&(tail->next), &tail_next, task)) { if(atomic_compare_exchange_weak(&(the_queue->tail), &tail, task)) { } break; } } else { if(atomic_compare_exchange_weak(&(the_queue->tail), &tail, tail_next)) { } } } } atomic_fetch_add(&(the_queue->size), 1); return 0; } ``` ### 整合結果 整合後發現最後一個 task 無法正確被 pop ,導致程式無法終止,把原本 `task_run()` 內的 `tqueue_push(pool->queue, _task);` 改成 `tqueue_push(pool->queue, task_new(NULL, NULL));` 就能正確執行了,不知道原因為何 `$ mutrace ./sort 4 random.txt` ``` mutrace: Showing statistics for process sort (pid 5998). mutrace: 3 mutexes used. Mutex #1 (0x0x7f71cf274860) first referenced by: /usr/lib/mutrace/libmutrace.so(pthread_mutex_lock+0x49) [0x7f71d1c6e6b9] /lib/x86_64-linux-gnu/libgcc_s.so.1(_Unwind_Find_FDE+0x2c) [0x7f71cf070fec] [(nil)] Mutex #2 (0x0x603140) first referenced by: /usr/lib/mutrace/libmutrace.so(pthread_mutex_init+0xf2) [0x7f71d1c6e4b2] ./sort(main+0xab) [0x4020bb] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x7f71d16a5830] mutrace: Showing 2 most contended mutexes: Mutex # Locked Changed Cont. tot.Time[ms] avg.Time[ms] max.Time[ms] Flags 1 20 14 7 0.003 0.000 0.001 M-.--. 2 6 5 0 0.003 0.000 0.002 M-.--. |||||| /||||| Object: M = Mutex, W = RWLock /|||| State: x = dead, ! = inconsistent /||| Use: R = used in realtime thread /|| Mutex Type: r = RECURSIVE, e = ERRRORCHECK, a = ADAPTIVE /| Mutex Protocol: i = INHERIT, p = PROTECT / RWLock Kind: r = PREFER_READER, w = PREFER_WRITER, W = PREFER_WRITER_NONREC mutrace: Note that the flags column R is only valid in --track-rt mode! mutrace: Total runtime is 1110.494 ms. mutrace: Results for SMP with 4 processors. ``` 原本的 mutex 已經消失了 ## 存在的問題 * memory leak 根據原本的寫法,我們會 pop 每一個 task 並在 task 做完後 free task,但是最一開始的 dummy node 並沒有 free 掉且再也無法取得 如果 threadpool 被設計成可以中途停止的話,剩下在 task queue 中的 task 也無法被釋放記憶體 * ABA 問題 這個問題我們目前還沒有遇到,但是有可能發生,還不曉得要怎麼解決,可能的解決方式有: * [hazard pointer](https://en.wikipedia.org/wiki/Hazard_pointer) * [reference counting](https://en.wikipedia.org/wiki/Reference_counting) * **效率比原本的還差** pop 與 push 做的事情相當單純,反而 lock-free queue 要不斷地去嘗試 pop 與 push ,過度競爭的狀況使得效益上不減反增,我們對於 lock-free 的結構想得太簡化了,也許要回到每個 thread 有自己的 workqueue 的形式並搭配 ondition variable 等來實現有效率的 lock-free thread pool ``` 52.08% sort sort [.] tqueue_pop 16.94% sort sort [.] task_run 4.71% sort sort [.] split_n_merge 4.39% sort sort [.] sort_n_merge 4.08% sort [kernel.kallsyms] [k] queue_work_on 3.45% sort libc-2.23.so [.] __strcmp_sse2_unaligned 3.45% sort [kernel.kallsyms] [k] tty_ldisc_deref 3.14% sort [kernel.kallsyms] [.] native_irq_return_iret 2.82% sort sort [.] list_get 2.51% sort [kernel.kallsyms] [k] update_blocked_averages 2.20% sort libc-2.23.so [.] vfprintf ``` ## 參考資料 1. [A Pragmatic Implementation of Non-Blocking Linked-Lists](https://www.cl.cam.ac.uk/research/srg/netos/papers/2001-caslists.pdf) 2. [0xff07 筆記](https://hackmd.io/s/H154LN6og) 3. [hungry-birds](https://github.com/jserv/hungry-birds/blob/master/queue.c) 4. [Built-in functions for atomic memory access](https://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Atomic-Builtins.html) 5. [Compare-and-swap](https://en.wikipedia.org/wiki/Compare-and-swap) 6. [Acquire and Release Semantics](http://preshing.com/20120913/acquire-and-release-semantics/) 7. [Toward Concurrency](https://hackmd.io/s/H10MXXoT) 8. [Simple, Fast, and Practical Non-Blocking and Blocking Concurrent Queue Algorithms](https://www.research.ibm.com/people/m/michael/podc-1996.pdf) 9. [Simple, Fast, and Practical Non-Blocking and Blocking Concurrent Queue Algorithms ppt](http://web.cecs.pdx.edu/~walpole/class/cs510/winter2008/slides/4a.pdf) 10. [intro-to-lock-free-wait-free-and-aba](http://www.hergert.me/blog/2009/12/25/intro-to-lock-free-wait-free-and-aba.html) 11. [Boost lockfree queue](https://github.com/boostorg/lockfree/blob/develop/include/boost/lockfree/queue.hpp) 12. [Lock Free Concurrent Data Structures, CAS and the ABA Problem ](http://users.minet.uni-jena.de/~nwk/LockFree.pdf) 13. [lfqueue](https://github.com/mitghi/lfqueue)