# 2017q1 team7 mergesort-concurrent
contributed by <`zmke`、`petermouse`、`0xff07`>
## 使程式能接受字串輸入
參考 [0xff07](https://hackmd.io/s/H154LN6og) 的筆記,修改成可以接受字串輸入
## 建立字串輸入的驗證機制
```cmake
check-words: sort
@uniq test_data/words.txt | sort -R | tail -n $(NUM_OF_DATA) > $(TEST_DATA_FILE)
@sort -d $(TEST_DATA_FILE) > $(SORTED_DATA_FILE)
@./sort $(THREADS) $(TEST_DATA_FILE) | tail -n +4 > $(SORTED_RESULT)
@bash scripts/compare.sh $(SORTED_DATA_FILE) $(SORTED_RESULT)
```
## mutrace 分析
```
mutrace: Showing statistics for process sort (pid 4217).
mutrace: 3 mutexes used.
Mutex #2 (0x0x2407670) first referenced by:
/usr/lib/mutrace/libmutrace.so(pthread_mutex_init+0xf2) [0x7fc25fb344b2]
./sort(tqueue_init+0x46) [0x4014ca]
./sort(tpool_init+0x6a) [0x40172e]
./sort(main+0xe5) [0x401ee4]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x7fc25f56b830]
Mutex #0 (0x0x7fc25d13a860) first referenced by:
/usr/lib/mutrace/libmutrace.so(pthread_mutex_lock+0x49) [0x7fc25fb346b9]
/lib/x86_64-linux-gnu/libgcc_s.so.1(_Unwind_Find_FDE+0x2c) [0x7fc25cf36fec]
[(nil)]
Mutex #1 (0x0x603140) first referenced by:
/usr/lib/mutrace/libmutrace.so(pthread_mutex_init+0xf2) [0x7fc25fb344b2]
./sort(main+0xab) [0x401eaa]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x7fc25f56b830]
mutrace: Showing 3 most contended mutexes:
Mutex # Locked Changed Cont. tot.Time[ms] avg.Time[ms] max.Time[ms] Flags
2 282836 93324 81938 29.010 0.000 0.016 Mx.--.
0 20 16 5 0.003 0.000 0.001 M-.--.
1 6 4 0 0.001 0.000 0.000 M-.--.
||||||
/|||||
Object: M = Mutex, W = RWLock /||||
State: x = dead, ! = inconsistent /|||
Use: R = used in realtime thread /||
Mutex Type: r = RECURSIVE, e = ERRRORCHECK, a = ADAPTIVE /|
Mutex Protocol: i = INHERIT, p = PROTECT /
RWLock Kind: r = PREFER_READER, w = PREFER_WRITER, W = PREFER_WRITER_NONREC
mutrace: Note that the flags column R is only valid in --track-rt mode!
mutrace: Total runtime is 1119.388 ms.
mutrace: Results for SMP with 4 processors.
```
`Locked` : mutex 鎖了幾次
`Changed` : mutex 在不同 thread 之間切換的次數
`Cont.` : thread 嘗試要求已鎖住的 mutex 的次數
`tot.Time[ms]` : mutex 鎖住的總時間
`avg.Time[ms]` : 平均鎖住 mutex 的時間
`max.Time[ms]` : 持有 mutex 的最大時間
可以發現到 `Mutex #2` 在使用上非常的頻繁!thread 之間常需要搶奪 mutex,也常因此等待
追蹤原程式碼的 `Mutex #2` 出現的位置 (`$ addr2line -e sort 0x4014ca`)
第一次存取出現在 `/mergesort-concurrent/threadpool.c:45`
```clike=44
pthread_mutex_init(&(the_queue->mutex), NULL);
pthread_cond_init(&(the_queue->cond), NULL);
the_queue->size = 0;
the_queue->num_of_consumed = 0;
```
(每個 mutex 實際對應到程式碼會差一行,不知道為什麼)
可以發現是 threadpool 中負責保護 queue pop 與 push 的 Mutex
## lock-free threadpool 實作
讀完 [A Pragmatic Implementation of Non-Blocking Linked-Lists](https://www.cl.cam.ac.uk/research/srg/netos/papers/2001-caslists.pdf) 的演算法之後想先動手實作看看,關於演算法的解釋 [0xff07](https://hackmd.io/s/H154LN6og) 的筆記有詳細的說明
### Compare And Swap
`bool __sync_bool_compare_and_swap (type *ptr, type oldval type newval, ...)`
`type __sync_val_compare_and_swap (type *ptr, type oldval type newval, ...)`
當 `ptr` 和 `oldval` 相同的時候,會把 `ptr` 的內容改寫成 `newval` ,並回傳 `oldval`
[Built-in functions for atomic memory access](https://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Atomic-Builtins.html) 的解釋如下
These builtins perform an atomic compare and swap. That is, if the current value of *ptr is oldval, then write newval into *ptr.
The “bool” version returns true if the comparison is successful and newval was written. The “val” version returns the contents of *ptr before the operation.
### tqueue_pop()
首先從 `tqueue_pop()` 開始著手修改,目標是把 `tqueue_pop()` 的 mutex 拿掉,參考 [hungry-birds](https://github.com/jserv/hungry-birds) 的實作,並引用 concurrent-ll 內的 `atomic_ops_if.h`
先判斷 queue 裡面是否只有一個 node (head 和 tail 相同),如果只有一個 node 就嘗試把 tail 和 head 設成 NULL ,如果不只一個 node 嘗試把 head 設成 head->next
```clike=
task_t *tqueue_pop(tqueue_t * const the_queue)
{
task_t *ret;
ret = the_queue->head;
if (ret) {
if(CAS_PTR(&(the_queue->tail), ret, NULL) == ret) {
if(CAS_PTR(&(the_queue->head), ret, NULL) == ret) {
FAD_U32(&(the_queue->size));
FAI_U32(&(the_queue->num_of_consumed));
return ret;
} else {
return NULL;
}
} else {
if(ret->next) {
if(CAS_PTR(&(the_queue->head), ret, ret->next) == ret) {
FAD_U32(&(the_queue->size));
FAI_U32(&(the_queue->num_of_consumed));
return ret;
} else {
return NULL;
}
} else {
return NULL;
}
}
}
return ret;
}
```
`$ make check-words` 結果為 `Passed!`
再用 mutrace 測試一次
`$ mutrace ./sort 4 random.txt`
```
mutrace: Showing statistics for process sort (pid 15836).
mutrace: 3 mutexes used.
Mutex #2 (0x0x7f720a85b860) first referenced by:
/usr/lib/mutrace/libmutrace.so(pthread_mutex_lock+0x49) [0x7f720d2556b9]
/lib/x86_64-linux-gnu/libgcc_s.so.1(_Unwind_Find_FDE+0x2c) [0x7f720a657fec]
[(nil)]
Mutex #0 (0x0xa43670) first referenced by:
/usr/lib/mutrace/libmutrace.so(pthread_mutex_init+0xf2) [0x7f720d2554b2]
./sort(tqueue_init+0x46) [0x4014ca]
./sort(tpool_init+0x6a) [0x40173d]
./sort(main+0xe5) [0x401ef3]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x7f720cc8c830]
Mutex #1 (0x0x603140) first referenced by:
/usr/lib/mutrace/libmutrace.so(pthread_mutex_init+0xf2) [0x7f720d2554b2]
./sort(main+0xab) [0x401eb9]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x7f720cc8c830]
mutrace: Showing 3 most contended mutexes:
Mutex # Locked Changed Cont. tot.Time[ms] avg.Time[ms] max.Time[ms] Flags
2 20 17 6 0.003 0.000 0.000 M-.--.
0 13 9 1 0.004 0.000 0.001 Mx.--.
1 6 5 0 0.002 0.000 0.000 M-.--.
||||||
/|||||
Object: M = Mutex, W = RWLock /||||
State: x = dead, ! = inconsistent /|||
Use: R = used in realtime thread /||
Mutex Type: r = RECURSIVE, e = ERRRORCHECK, a = ADAPTIVE /|
Mutex Protocol: i = INHERIT, p = PROTECT /
RWLock Kind: r = PREFER_READER, w = PREFER_WRITER, W = PREFER_WRITER_NONREC
mutrace: Note that the flags column R is only valid in --track-rt mode!
mutrace: Total runtime is 1057.670 ms.
mutrace: Results for SMP with 4 processors.
```
發現只有更改 `tqueue_pop()` 就讓 `Mutex #0` (從 address 可以看出就是前面的 `Mutex #2` )的使用率大幅下降
### tqueue_push()
`tqueue_push()` 一開始不知道要怎麼改變 `tqueue_tail->next` 並改變 `tqueue_tail` 之值,最後也是參考到 [hungry-birds](https://github.com/jserv/hungry-birds) 的寫法,才有些想法
其實兩個的操作並不是不行切割,而是可以先選擇改變 `tail` 之值,確定成功之後,取得舊的 `tail` ,再將舊的 `tail` node 指向新的 `tail` 。
```clike=
int tqueue_push(tqueue_t * const the_queue, task_t *task)
{
task_t *tail = NULL, *head = NULL, *tail_next = NULL;
task->next = NULL;
do {
tail = the_queue->tail;
if (CAS_PTR(&(the_queue->tail), tail, task) == tail)
break;
} while(1);
if (tail) {
do {
tail_next = tail->next;
if (CAS_PTR(&(tail->next), tail_next, task) == tail_next)
break;
} while(1);
} else {
do {
head = the_queue->head;
if (CAS_PTR(&(the_queue->head), head, task) == head)
break;
} while(1);
}
FAI_U32(&(the_queue->size));
return 0;
}
```
## 整合後發生的問題
我們發現程式碼大多數時候可以成功執行,但有時候 thread 沒有接收到中止的 task 使 thread 中止運作,導致進入無窮迴圈
之後讓程式重複執行500次,發現中間出現 core dumped,用`$ addr2line -e ./sort` 找到
`threadpool.c` 的 `tqueue_free`
`free(cur);`
可能發生重複 free 同一塊記憶體,或是 free 到沒有被 malloc 的記憶體
```
*** Error in `./sort': double free or corruption (fasttop): 0x00007f5f000009c0 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7f5f091987e5]
/lib/x86_64-linux-gnu/libc.so.6(+0x7fe0a)[0x7f5f091a0e0a]
/lib/x86_64-linux-gnu/libc.so.6(cfree+0x4c)[0x7f5f091a498c]
./sort(tqueue_free+0x41)[0x4016cb]
./sort(tpool_free+0x65)[0x401841]
./sort(main+0x132)[0x401f5f]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x7f5f09141830]
./sort(_start+0x29)[0x4011a9]
======= Memory map: ========
00400000-00403000 r-xp 00000000 08:0b 525385 /home/ke/mergesort-concurrent/sort
00602000-00603000 r--p 00002000 08:0b 525385 /home/ke/mergesort-concurrent/sort
00603000-00604000 rw-p 00003000 08:0b 525385 /home/ke/mergesort-concurrent/sort
00ce9000-0224f000 rw-p 00000000 00:00 0 [heap]
7f5ef0000000-7f5ef0021000 rw-p 00000000 00:00 0
7f5ef0021000-7f5ef4000000 ---p 00000000 00:00 0
7f5ef8000000-7f5ef8021000 rw-p 00000000 00:00 0
7f5ef8021000-7f5efc000000 ---p 00000000 00:00 0
7f5efc000000-7f5efc021000 rw-p 00000000 00:00 0
7f5efc021000-7f5f00000000 ---p 00000000 00:00 0
7f5f00000000-7f5f00021000 rw-p 00000000 00:00 0
7f5f00021000-7f5f04000000 ---p 00000000 00:00 0
7f5f06f07000-7f5f06f1d000 r-xp 00000000 08:0a 136909 /lib/x86_64-linux-gnu/libgcc_s.so.1
7f5f06f1d000-7f5f0711c000 ---p 00016000 08:0a 136909 /lib/x86_64-linux-gnu/libgcc_s.so.1
7f5f0711c000-7f5f0711d000 rw-p 00015000 08:0a 136909 /lib/x86_64-linux-gnu/libgcc_s.so.1
7f5f0711d000-7f5f0711e000 ---p 00000000 00:00 0
7f5f0711e000-7f5f0791e000 rw-p 00000000 00:00 0
7f5f0791e000-7f5f0791f000 ---p 00000000 00:00 0
7f5f0791f000-7f5f0811f000 rw-p 00000000 00:00 0
7f5f0811f000-7f5f08120000 ---p 00000000 00:00 0
7f5f08120000-7f5f08920000 rw-p 00000000 00:00 0
7f5f08920000-7f5f08921000 ---p 00000000 00:00 0
7f5f08921000-7f5f09121000 rw-p 00000000 00:00 0
7f5f09121000-7f5f092e0000 r-xp 00000000 08:0a 148362 /lib/x86_64-linux-gnu/libc-2.23.so
7f5f092e0000-7f5f094e0000 ---p 001bf000 08:0a 148362 /lib/x86_64-linux-gnu/libc-2.23.so
7f5f094e0000-7f5f094e4000 r--p 001bf000 08:0a 148362 /lib/x86_64-linux-gnu/libc-2.23.so
7f5f094e4000-7f5f094e6000 rw-p 001c3000 08:0a 148362 /lib/x86_64-linux-gnu/libc-2.23.so
7f5f094e6000-7f5f094ea000 rw-p 00000000 00:00 0
7f5f094ea000-7f5f09502000 r-xp 00000000 08:0a 149114 /lib/x86_64-linux-gnu/libpthread-2.23.so
7f5f09502000-7f5f09701000 ---p 00018000 08:0a 149114 /lib/x86_64-linux-gnu/libpthread-2.23.so
7f5f09701000-7f5f09702000 r--p 00017000 08:0a 149114 /lib/x86_64-linux-gnu/libpthread-2.23.so
7f5f09702000-7f5f09703000 rw-p 00018000 08:0a 149114 /lib/x86_64-linux-gnu/libpthread-2.23.so
7f5f09703000-7f5f09707000 rw-p 00000000 00:00 0
7f5f09707000-7f5f0972d000 r-xp 00000000 08:0a 148365 /lib/x86_64-linux-gnu/ld-2.23.so
7f5f09908000-7f5f0990b000 rw-p 00000000 00:00 0
7f5f09929000-7f5f0992c000 rw-p 00000000 00:00 0
7f5f0992c000-7f5f0992d000 r--p 00025000 08:0a 148365 /lib/x86_64-linux-gnu/ld-2.23.so
7f5f0992d000-7f5f0992e000 rw-p 00026000 08:0a 148365 /lib/x86_64-linux-gnu/ld-2.23.so
7f5f0992e000-7f5f0992f000 rw-p 00000000 00:00 0
7fff9c1a7000-7fff9c1c8000 rw-p 00000000 00:00 0 [stack]
7fff9c1e0000-7fff9c1e2000 r--p 00000000 00:00 0 [vvar]
7fff9c1e2000-7fff9c1e4000 r-xp 00000000 00:00 0 [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]
已經終止 (core dumped)
```
## week 10 改進方向
之前的程式還有很多有待改進的空間
* 程式的正確性
* 之前的程式是錯的!!!!!
* 也許我們也忘記了 ABA 問題
* 驗證 lock-free 的方式有誤
* 我們一開始只驗證 lock-free 的 push 或 pop 其中一邊 , 另一邊維持著 mutex , 這樣反而造成了 race condition
* memory leak
* 現在不建議使用 `__sync_val_compare_and_swap` 了
* 應改用 [C11 `_Atomic`](http://en.cppreference.com/w/c/atomic)
老師提到的改善方法
* 可以參考 [LockLess](https://github.com/DeYangLiu/LockLess) 與 [ring](https://github.com/shramov/ring) 的作法
* 提供 benchmark 程式
## 參考 lock-free queue 實作方法
這邊我們參考了 [Simple, Fast, and Practical Non-Blocking and Blocking Concurrent Queue Algorithms](https://www.research.ibm.com/people/m/michael/podc-1996.pdf) 論文,看到很多 lock-free queue 程式碼都建立在這篇論文的基礎上實作,像是 [Boost](https://github.com/boostorg/lockfree/blob/develop/include/boost/lockfree/queue.hpp) 也參考了這篇實作出 lock-free queue
```clike
structure pointer_t { ptr: pointer to node_t, count: unsigned integer }
structure node_t { value: data type, next: pointer_t }
structure queue_t { Head: pointer t, Tail: pointer_t }
```
原文中有個中介的資料結構叫做 `pointer_t` ,這個資料結構去指向真正的 node (`node_t`) 以及存放一個變數名稱 `count` 代表 reference counting 的次數,可以用來避免 ABA 問題
```clike
initialize(Q: pointer to queue_t)
node = new_node() // Allocate a free node
node->next.ptr = NULL // Make it the only node in the linked list
Q->Head.ptr = Q->Tail.ptr = node // Both Head and Tail point to it
```
比較特別的是, 一開始初始化 lock-free queue 就會新建一個 node ,這個 node 並沒有任何的作用, head 與 tail 會間接指向這個 node , lock-free queue 至少有一個 node 的存在。
```clike
enqueue(Q: pointer to queue_t, value: data type)
E1: node = new_node() // Allocate a new node from the free list
E2: node->value = value // Copy enqueued value into node
E3: node->next.ptr = NULL // Set next pointer of node to NULL
E4: loop // Keep trying until Enqueue is done
E5: tail = Q->Tail // Read Tail.ptr and Tail.count together
E6: next = tail.ptr->next // Read next ptr and count fields together
E7: if tail == Q->Tail // Are tail and next consistent?
// Was Tail pointing to the last node?
E8: if next.ptr == NULL
// Try to link node at the end of the linked list
E9: if CAS(&tail.ptr->next, next, <node, next.count+1>)
E10: break // Enqueue is done. Exit loop
E11: endif
E12: else // Tail was not pointing to the last node
// Try to swing Tail to the next node
E13: CAS(&Q->Tail, tail, <next.ptr, tail.count+1>)
E14: endif
E15: endif
E16: endloop
// Enqueue is done. Try to swing Tail to the inserted node
E17: CAS(&Q->Tail, tail, <node, tail.count+1>)
```
push 時去確保
[E7] `Q->tail` 還是 `Q->tail`
[E8] `Q->tail` 的 `next.ptr` 為 `NULL`
再去嘗試做 [E9] CAS , 失敗重來
特別的是 [E8] 失敗時,也嘗試去更新 `Q->tail` 為 `Q->tail->next.ptr`,失敗就算了沒關係(好隨便啊!)
[E17] 試著改變 tail 之值,失敗也沒關係,代表已經被其他人更新
```clike
dequeue(Q: pointer to queue_t, pvalue: pointer to data type): boolean
D1: loop // Keep trying until Dequeue is done
D2: head = Q->Head // Read Head
D3: tail = Q->Tail // Read Tail
D4: next = head.ptr->next // Read Head.ptr->next
D5: if head == Q->Head // Are head, tail, and next consistent?
D6: if head.ptr == tail.ptr // Is queue empty or Tail falling behind?
D7: if next.ptr == NULL // Is queue empty?
D8: return FALSE // Queue is empty, couldn't dequeue
D9: endif
// Tail is falling behind. Try to advance it
D10: CAS(&Q->Tail, tail, <next.ptr, tail.count+1>)
D11: else // No need to deal with Tail
// Read value before CAS
// Otherwise, another dequeue might free the next node
D12: *pvalue = next.ptr->value
// Try to swing Head to the next node
D13: if CAS(&Q->Head, head, <next.ptr, head.count+1>)
D14: break // Dequeue is done. Exit loop
D15: endif
D16: endif
D17: endif
D18: endloop
D19: free(head.ptr) // It is safe now to free the old node
D20: return TRUE // Queue was not empty, dequeue succeeded
```
先前提到 queue 會有一個 node , head 會指向這個 dummy node ,所以 pop 會先去抓取 dummy node 的下一個 node, 釋放指向 dummy node 的 pointer (`ptr_t`) , 再去將取得的 node 作為 dummy node ,交給下一次 pop 的釋放 `ptr_t`
確保 [D5] `Q->head` 還是原本的 `Q->head`
[D6] `Q->head.ptr` 若等於先前抓到的 `Q->tail.ptr` [D7],代表 queue 可能是空的, 這時進一步去檢查 `tail->next.ptr` 有值也嘗試更新 `Q->tail` ,失敗也是沒關係算了
然而如果 [D6] 是錯的 [D11] ,把值取出,並改變 head, head 可能改變失敗代表其他 thread 已經改變 head ,一切重新來過
## 實作 [Simple, Fast, and Practical Non-Blocking and Blocking Concurrent Queue Algorithms](https://www.research.ibm.com/people/m/michael/podc-1996.pdf)
這邊我們使用 C11 `atomic_compare_exchange_weak` 來替換 `__sync_val_compare_and_swap` ,與論文不同的是我們並沒有使用中介結構來操作
在 `tqueue_init()` 進行初始化的時候先新增一個 `dummy_node` ,並讓 `head` 和 `tail` 指向這個 node
``` clike=
task_t *dummy_node = task_new(dummy_function, NULL);
dummy_node->next = NULL;
the_queue->head = dummy_node;
the_queue->tail = dummy_node;
```
### tqueue_pop()
``` clike=
task_t *tqueue_pop(tqueue_t * const the_queue)
{
task_t *head = NULL, *tail = NULL, *next = NULL, *ret = NULL;
while(1) {
head = the_queue->head;
tail = the_queue->tail;
next = head->next;
if(head == the_queue->head) {
if(head == tail) {
if(next == NULL) {
return NULL;
}
atomic_compare_exchange_weak(&(the_queue->tail), &tail, next);
} else {
ret = next;
if(atomic_compare_exchange_weak(&(the_queue->head), &head, next)) {
break;
}
}
}
}
atomic_fetch_sub(&(the_queue->size), 1);
atomic_fetch_add(&(the_queue->num_of_consumed), 1);
return ret;
}
```
### tqueue_push()
``` clike=
int tqueue_push(tqueue_t * const the_queue, task_t *task)
{
task_t *tail = NULL, *tail_next = NULL;
task->next = NULL;
while(1) {
tail = the_queue->tail;
tail_next = tail->next;
if(tail == the_queue->tail) {
if(tail_next == NULL) {
if(atomic_compare_exchange_weak(&(tail->next), &tail_next, task)) {
if(atomic_compare_exchange_weak(&(the_queue->tail), &tail, task)) {
}
break;
}
} else {
if(atomic_compare_exchange_weak(&(the_queue->tail), &tail, tail_next)) {
}
}
}
}
atomic_fetch_add(&(the_queue->size), 1);
return 0;
}
```
### 整合結果
整合後發現最後一個 task 無法正確被 pop ,導致程式無法終止,把原本 `task_run()` 內的 `tqueue_push(pool->queue, _task);` 改成 `tqueue_push(pool->queue, task_new(NULL, NULL));` 就能正確執行了,不知道原因為何
`$ mutrace ./sort 4 random.txt`
```
mutrace: Showing statistics for process sort (pid 5998).
mutrace: 3 mutexes used.
Mutex #1 (0x0x7f71cf274860) first referenced by:
/usr/lib/mutrace/libmutrace.so(pthread_mutex_lock+0x49) [0x7f71d1c6e6b9]
/lib/x86_64-linux-gnu/libgcc_s.so.1(_Unwind_Find_FDE+0x2c) [0x7f71cf070fec]
[(nil)]
Mutex #2 (0x0x603140) first referenced by:
/usr/lib/mutrace/libmutrace.so(pthread_mutex_init+0xf2) [0x7f71d1c6e4b2]
./sort(main+0xab) [0x4020bb]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x7f71d16a5830]
mutrace: Showing 2 most contended mutexes:
Mutex # Locked Changed Cont. tot.Time[ms] avg.Time[ms] max.Time[ms] Flags
1 20 14 7 0.003 0.000 0.001 M-.--.
2 6 5 0 0.003 0.000 0.002 M-.--.
||||||
/|||||
Object: M = Mutex, W = RWLock /||||
State: x = dead, ! = inconsistent /|||
Use: R = used in realtime thread /||
Mutex Type: r = RECURSIVE, e = ERRRORCHECK, a = ADAPTIVE /|
Mutex Protocol: i = INHERIT, p = PROTECT /
RWLock Kind: r = PREFER_READER, w = PREFER_WRITER, W = PREFER_WRITER_NONREC
mutrace: Note that the flags column R is only valid in --track-rt mode!
mutrace: Total runtime is 1110.494 ms.
mutrace: Results for SMP with 4 processors.
```
原本的 mutex 已經消失了
## 存在的問題
* memory leak
根據原本的寫法,我們會 pop 每一個 task 並在 task 做完後 free task,但是最一開始的 dummy node 並沒有 free 掉且再也無法取得
如果 threadpool 被設計成可以中途停止的話,剩下在 task queue 中的 task 也無法被釋放記憶體
* ABA 問題
這個問題我們目前還沒有遇到,但是有可能發生,還不曉得要怎麼解決,可能的解決方式有:
* [hazard pointer](https://en.wikipedia.org/wiki/Hazard_pointer)
* [reference counting](https://en.wikipedia.org/wiki/Reference_counting)
* **效率比原本的還差**
pop 與 push 做的事情相當單純,反而 lock-free queue 要不斷地去嘗試 pop 與 push ,過度競爭的狀況使得效益上不減反增,我們對於 lock-free 的結構想得太簡化了,也許要回到每個 thread 有自己的 workqueue 的形式並搭配 ondition variable 等來實現有效率的 lock-free thread pool
```
52.08% sort sort [.] tqueue_pop
16.94% sort sort [.] task_run
4.71% sort sort [.] split_n_merge
4.39% sort sort [.] sort_n_merge
4.08% sort [kernel.kallsyms] [k] queue_work_on
3.45% sort libc-2.23.so [.] __strcmp_sse2_unaligned
3.45% sort [kernel.kallsyms] [k] tty_ldisc_deref
3.14% sort [kernel.kallsyms] [.] native_irq_return_iret
2.82% sort sort [.] list_get
2.51% sort [kernel.kallsyms] [k] update_blocked_averages
2.20% sort libc-2.23.so [.] vfprintf
```
## 參考資料
1. [A Pragmatic Implementation of Non-Blocking Linked-Lists](https://www.cl.cam.ac.uk/research/srg/netos/papers/2001-caslists.pdf)
2. [0xff07 筆記](https://hackmd.io/s/H154LN6og)
3. [hungry-birds](https://github.com/jserv/hungry-birds/blob/master/queue.c)
4. [Built-in functions for atomic memory access](https://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Atomic-Builtins.html)
5. [Compare-and-swap](https://en.wikipedia.org/wiki/Compare-and-swap)
6. [Acquire and Release Semantics](http://preshing.com/20120913/acquire-and-release-semantics/)
7. [Toward Concurrency](https://hackmd.io/s/H10MXXoT)
8. [Simple, Fast, and Practical Non-Blocking and Blocking Concurrent Queue Algorithms](https://www.research.ibm.com/people/m/michael/podc-1996.pdf)
9. [Simple, Fast, and Practical Non-Blocking and Blocking
Concurrent Queue Algorithms ppt](http://web.cecs.pdx.edu/~walpole/class/cs510/winter2008/slides/4a.pdf)
10. [intro-to-lock-free-wait-free-and-aba](http://www.hergert.me/blog/2009/12/25/intro-to-lock-free-wait-free-and-aba.html)
11. [Boost lockfree queue](https://github.com/boostorg/lockfree/blob/develop/include/boost/lockfree/queue.hpp)
12. [Lock Free Concurrent Data Structures,
CAS and the ABA Problem ](http://users.minet.uni-jena.de/~nwk/LockFree.pdf)
13. [lfqueue](https://github.com/mitghi/lfqueue)