以 concurrent Linked List 實做 merge sort

contributed by<jeff60907>, <chenan00>

記得在分組頁面中補上 github 連結
課程助教

預期目標

研究 thread pool 管理 worker thread 的實做，提出實做層面的不足，並且參照 concurrent-ll，提出 lock-free 的實做
學習 concurrent-ll (concurrent linked-list 實作) 的 scalability 分析方式，透過 gnuplot 製圖比較 merge sort 在不同執行緒數量操作的效能
- 注意到 linked list 每個節點配置的記憶體往往是不連續，思考這對效能分析的影響

不同 thread 數量的效能影響

這張圖的圖例擋到數字囉~
建議可以移至左邊課程助教
好的，已修改完成陳致佑

1-32 threads 效能看起來都差不多，應該是因為mutex lock 關係，lock會等待完成才釋放，所以效能才不會有所差別。但是後面超過32之後建立更多的thread 反而造成更多的負擔。
是會造成怎樣的負擔呢？是因為CPU需花更多的時間做thread的切換跟管理嗎？查更仔細資料

再次複習 lock-free

閱讀老師提供 concurrent-ll程式碼，分析 lock和lock-free版本

測試老師提供的code輸出

可以請你解釋這三張圖分別有什麼含意（線條和十字個代表了什麼）且提出你對這三張圖的分析及想法嗎~
課程助教

十字叉叉代表 scalability（延伸擴展性）線條代表 Throughput (資料吞吐量)，綠色線條為lock-free，紅色線條為lock-based。可以很清楚看到 lock-based 的效能不管多少 thread 都不太會提升效能，而lock-free就會隨著 thread 變多效能提高。但是在Update =0 時候，lock-free 在 thread = 4 會被限制，應該是因為電腦為四核心關係，不過 update 不同的用意還不太了解

補充 scalability擴展性 throughput吞吐量定義介紹：

Throughput:
- Throughput is a measure of how many units of information a system can process in a given amount of time.

在單位時間內可以處理多少的資料量

Scalability:
- Scalability is the capability of a system, network, or process to handle a growing amount of work, or its potential to be enlarged in order to accommodate that growth.

可以處理越來越多的資料或工作時的能力或潛力

看了參考資料跟程式碼不懂的地方稍做整理

Non-blocking Algorithmwiki

In computer science, an algorithm is called non-blocking if failure or suspension of any thread cannot cause failure or suspension of another thread.

lock-free: A non-blocking algorithm is lock-free if there is guaranteed system-wide progress
wait-free: A non-blocking algorithm is wait-free if there is also guaranteed per-thread progress

Non-blocking algorithms use atomic read-modify-write primitives that the hardware must provide, the most notable of which is compare and swap (CAS).

Lock-freewiki

Lock-freedom allows individual threads to starve but guarantees system-wide throughput.

lock-free雖然允許個別的thread可以有starvation，但是要能保證系統的處理能力

An algorithm is lock-free if it satisfies that when the program threads are run sufficiently long at least one of the threads makes progress.

簡單的說，lock-free algorithm 至少要有一個thread能跑出結果

In general, a lock-free algorithm can run in four phases: completing one's own operation, assisting an obstructing operation, aborting an obstructing operation, and waiting. Completing one's own operation is complicated by the possibility of concurrent assistance and abortion, but is invariably the fastest path to completion.

The decision about when to assist, abort or wait when an obstruction is met is the responsibility of a contention manager. This may be very simple (assist higher priority operations, abort lower priority ones), or may be more optimized to achieve better throughput, or lower the latency of prioritized operations.

lock-free演算法的部份

Correct concurrent assistance is typically the most complex part of a lock-free algorithm, and often very costly to execute: not only does the assisting thread slow down, but thanks to the mechanics of shared memory, the thread being assisted will be slowed, too, if it is still running.

支援concurrent是lock-free演算法最複雜的部份，執行的成本很高，一來是支援concurrent使thread變慢，如果是又有用到shared memory的thread又會更慢

CAS （Compare and swap) wiki

an atomic instruction used in multithreading to achieve synchronization.

Algorithms built around CAS typically read some key memory location and remember the old value. Based on that old value, they compute some new value. Then they try to swap in the new value using CAS, where the comparison checks for the location still being equal to the old value. If CAS indicates that the attempt has failed, it has to be repeated from the beginning: the location is re-read, a new value is re-computed and the CAS is tried again.

CAS是用atomic（原子？） instruction 來完成 multithread 的同步問題。主要作法就是會讀memory內的資料並紀錄舊的值，然後計算新的值透過CAS交換，過程中會檢查舊有的值是否跟memory的相等，一樣代表資料未被更動則改變新的值過去，不一樣則CAS會失敗，必須在一輪的讀值檢查CAS動作重複。

ABA problem wiki and lock-free ABA

In multithreaded computing, ABA problem occurs during synchronization, when a location is read twice, has the same value for both reads, and "value is the same" is used to indicate "nothing has changed"

ABA問題是當同一個記憶體資料讀兩次發現值是相同的，會認為沒有任何改變，但可能是另一個thread 已經有更動過了，但是更動後的資料剛好符合"原本的值"，造成沒有更動的錯覺。

這個例子是使用list存放item，如果將 item從list移除並釋放記憶體位址，重新建立一個新的item至list中，可能因為最佳化的關係會把記憶體位址填到原本被刪除的記憶體位址，結果新舊的位址都是一樣，會以為原本item沒有被更動而造成ABA的問題。

memory ordering problem 參考

FAI （Fetch-and-increment）wiki

    Fetch the value at the location x, say xold, into a register;
    add a to xold in the register;
    store the new value of the register back into x.

When one process is doing x = x + a and another is doing x = x + b concurrently, there is a race condition. They might both fetch xold and operate on that, then both store their results with the effect that one overwrites the other and the stored value becomes either xold + a or xold + b, not xold + a + b as might be expected.

當一個做x+a，另一個做x+b，他們使用到的會是xold 不會造成x+a+b

探討update是如何實作及用途

atomic_ops_if.h

裡面使用的函式非常不了解 -> 查gcc介紹
Built-in functions for memory model aware atomic operations

 __sync_fetch_and_add
 __sync_fetch_and_sub

These builtins perform the operation suggested by the name, and returns the value that had previously been in memory

回傳原先在memory的資料

 __sync_add_and_fetch
 __sync_sub_and_fetch

These builtins perform the operation suggested by the name, and return the new value.

回傳後來新的值

utils.h














uint32_t pow2roundup(uint32_t x)
{
    if (x == 0)
        return 1;
    --x;
    x |= x >> 1;
    x |= x >> 2;
    x |= x >> 4;
    x |= x >> 8;
    x |= x >> 16;
    return x + 1;
}
/* Round up to next higher power of 2 (return x if it's already a power
 * of 2) for 32-bit numbers */

main.c使用 max_key = pow2roundup(max_key) - 1;
註解：we round the max key up to the nearest power of 2, which makes our random key generation more efficient

不太懂為什麼這樣方法可以提高生產鑰匙的效率，鑰匙的效率提升可以改善什麼地方呢？

READMD.md 提到一篇論文 "A Pragmatic Implementation of Non-Blocking Linked Lists"，你需要先去看 (讀書很重要!)，而不是亂猜，裡頭有提到 key 作為能否存取 linked list 的指引，目前實作的最大值是 2048 jserv
收到陳致佑

lock-free list.c

使用 mark 的方式 Linked Lists: Locking, LockFree,and Beyond …

閱讀論文 “A Pragmatic Implementation of Non-Blocking Linked Lists”

4.1 Implementing Sets
介紹了如何操作search，對於key的作用點
–還沒完整看完吸收

– 參考資料待閱讀
– 參考20組開發紀錄

提出了透過 monitor 管理worker thread，讓 thread 在wait queue中sleep，所以不會卡在CPU等待時間，要再研究一下他們如何實作

自我反省，應該要先好好研究程式碼重構，或者基本效能提升在研究更進階的 lock-free