Merge-sort-concurrent

# Merge-sort-concurrent ##### tag `Yen-Kuan Wu` `HahaSula` `mergesort-concurrent` `data strcuture` # 目標修改 data structure 讓 mergesort 能夠對 insert 和 delete 會更友善。 # 組員 ## 黃呂源 - [ ]~~lock-free 想法?冼鏡光投影片閱讀失敗 concurrentLL 失敗~~ - [ ]~~atom memory access?~~ - [x]分main_WorkQueue，thread_worker,roundrobin) >先將workingQueue & dispatch queue分開讓thread content 分散 >debug中 ## 吳彥寬 - [x]trace ==mergesort==、thread 的 code - [ ]code refactor: mergesort 的主體我將它們下外移(進行中...)有點小麻煩R > 暫緩，我發現盲目下去改會改不好，先熟悉 code。[name=Yen-Kuan Wu] - [x]針對 lock 和 time 的 benchmark ~~(進行中，debug)~~ - [ ]加上 condition variable - [ ]找看看有沒可以改善的 lock > 當我們做出 lock 時看到 contention 並不高，要調整測試條件[name=Yen-Kuan Wu] - [ ]mergesort 熟悉完 lock 狀況後，分析可否使用不同的 data structure 來改寫 --- # Code 觀察 ## main.c main.c 整體架構如下 ![](https://i.imgur.com/T9cJCuO.png) 觀察四條支線 * `list_add` 簡單的 init list * `tpool_init` 簡單的 init threadpool(但裏面很難...囧) >`tpool_init(pool, thread_count, task_run)` 所 pool 有吃到 `task_run` 這個 function [name=Yen-Kuan Wu] * `cut` 可以說是這份 mergesort 的精華，我們看到他有 call 到 `merge` 和 `tqueue_push`，而 `push` 這個動作即是將 task 丟進 threadpool 裡 ```clike= new_task->func = cut; new_task->arg = the_list; tqueue_push(pool->queue, new_task); ``` >所以 `cut` 也被丟到 `pool` 裡去了[name=Yen-Kuan Wu] * `tpool_free` 這裡面有實作 `tqueue_free` 等待所有任務做完 * 我們發現這整份 code 的切割方式很簡單，即是一個 threadpool 當作執行任務的載體，而且特別有丟入 `task_run` 進去，其他 task 就直接丟進去執行，並且使用 lock 的方式確保正確性。 >看完整體架構跟我們的目標後，我們想到要修改這份 code 的切入點應該是，tqueue 是 link-list 和 mergesort 用的也是 link-list，我們將從這兩點切入，先熟悉 code 再提出改善。[name=Yen-Kuan Wu] * 整份 main.c 裡有一些變數看不太懂 - [ ] `max_cut = thread_count * (thread_count <= data_count) + data_count * (thread_count > data_count) - 1;` >~~計算全部thread切了幾刀，兩個data只需要切一刀~~[name=HahaSula] >這我有點聽不懂耶，等等我們討論一下[name=Yen-Kuan Wu] - [ ] `thread_data.cuthread_count = 0;` >~~判斷cutthread_count = max cut 才會把cut轉成merge task push 到tqueue裏面~~[name=HahaSula] >更正cut為把整分資料先大刀切給thread(讓多個thread做merge_sort 去切小塊) >再給merge_sort下去切成實際被sort的的單位 --- ### static void *task_run(void *data) * 簡單的 while 迴圈，`pop` task 出來做 - [x] 這邊我好奇的是，如果 pool 剛 init 時是沒有 task 的吧，那為什麼不會直接結束呢？我想這邊應該要多增加一個解說或是一個 function 來說告知，現在 queue 是屬於什麼狀態，可以 free 還是說只是沒有沒有任務進來。 >沒有task並不回造成break，task->Fun = NULL才會break;[name=HahaSula] >感謝！我在想我們的改進方向，或許可以朝查詢這個 task queue 的狀態來發展。[name=Yen-Kuan Wu] --- ### void cut ( void * data ) ![](https://i.imgur.com/meHgk8C.png) #### 觀察 * 我們發現 `cut` `merge` 都有 call `tqueue_push`，也就是任務有分 `cut` 和 `merge`(這觀察很像是廢話...) * 再來是 `merge_sort` 這看不出來了，再來就是 trace 整份 code 了 * 首先看到我們第一個 critical section ```clike= pthread_mutex_lock(&(thread_data.mutex)); int cutLocal = thread_data.cuthread_count; if (list->size > 1 && cutLocal < max_cut) { ++thread_data.cuthread_count; pthread_mutex_unlock(&(thread_data.mutex)); ``` > 這邊我會們會發現，這邊很簡單就可以改成 atomic 了，由於裡面用到的共用變數只有 `thread_data.cuthread_count`(mutex lock 是針對 `thread_data`)[name=Yen-Kuan Wu] 很順利的我們改出了第1版 lock-free 的改進 ```clike= struct { //pthread_mutex_t mutex; _Atomic int cuthread_count; } thread_data; ``` - [ ] 這邊我再改 code 時發現，我們根本沒有用到 list 的 lock ```clike= struct { pthread_mutex_t mutex;//FIXME: No use in code llist_t *list; } tmp_list; ``` >而且有一個部份的 code 很奇怪，明明 critical section 綁的是 tmp_list，可是卻 lock thread_data.mutex[name=Yen-Kuan Wu] >有趣的是thread_data.mutex 已經cut完不會再被前面使用所以好像不會衝突[name=HahaSula] ```clike= [main.c: 73: merge()] //pthread_mutex_lock(&(thread_data.mutex));FIXME:No need lock llist_t *tmpLocal = tmp_list.list; if (!tmpLocal) { tmp_list.list = list; //pthread_mutex_unlock(&(thread_data.mutex)); } else { tmp_list.list = NULL; //pthread_mutex_unlock(&(thread_data.mutex)); ``` >- [x] 目前我是將其給 comment 掉了，但是是否需要換成 tmp_list.mutex 這我還不能確定。[name=Yen-Kuan Wu] >>我自己解掉他了，我發現這有可能發生 ABA 的問題，所以是需要 lock 的，那這邊一樣，我會想要改成 atomic 的版本[name=Yen-Kuan Wu] >```clike= >struct { > //pthread_mutex_t mutex;//FIXME: No use in code > llist_t * _Atomic list; >} tmp_list; >``` >>[name=Yen-Kuan Wu] - [ ] 經過我這個 Atomic 的修改，我發現我只有解決了不會有 double assign 而沒有解決順序的部份，這可能還要再研究一下，也許要做個 heavy test 來檢查正確性，也許要將這一段 code 都改成 atomic - [ ] heavy test for correction # merge & cut 觀察 #### merge分兩階段 thread1 merge會把拿到的list放到tmp_list.list裏面 thread2 merge會把tmp_list.list的list跟傳入的list做merge_list 將return list做一個merge task push回去tqueue中(把tmp_list.list清空給下一個merge) (結果：每吃兩個merge做一次merge_list) (不知道會不會有merge1 跟merge2 size不一的情況存在) > 呂源，麻煩你詳細一點，不然我也看不懂= =，至少讓我知道那幾個 function 再做什麼或者是他整體架構再幹嘛[name=Yen-Kuan Wu] > 簡單來講，你就要把我當作一個有一點基礎，而且只有略窺過這份 code，不用 detail 但要跟我說你怎麼觀察到的再哪個 fucntion裡，或者能簡介一下這份code 的架構或 functio 再幹嘛。[name=Yen-Kuan Wu] ## thread.c 觀察 condition variable沒有使用 (如果要修改tqueue的lock不知道還有沒有要用) ### 程式描述 thread pool init cut 將資料均分給各個thread(大刀) 各個thread 把自己的資料做merge sort(分別sort) 兩個兩個sort好的thread 合併list(sort完的thread 用merge合併) ## list.c * `node_t` ```clike= typedef intptr_t val_t; typedef struct node { val_t data; struct node *next; } node_t; ``` >至於 `intptr_t` 是指向 int 的 pointer 資料閱讀有解說。[name=Yen-Kuan Wu] * `llist_t` ```clike= typedef struct llist { node_t *head; uint32_t size; } llist_t; ``` >我認為這個命名不是很漂亮，~~也許作者的意思是 leader list 吧，我猜，這是需要 refector 的部份。~~[name=Yen-Kuan Wu] >>我錯了，linux kernel 裡就是用 llist 來代表 link list... >- [x]這邊的 szie 需要確認一下他的用處為何，如果只是 link list 照裡來說是不需要 size 的[name=Yen-Kuan Wu] >>這邊的 size 能用來幫助 check index 是否超過，就不用跑到底發現沒有。[name=Yen-Kuan Wu] >- [x] 修改 llist_t 命名 >> 算了吧，linux kernel 都這樣用了...[name=Yen-Kuan Wu] * `node_t *new_node(val_t val, node_t *next)` > 這邊很特別的是，新創一個node插入再開頭，所以參數命名沒有錯，的確是新創一個 node 且 assign 之後 node->next 就會接到 next 去。[name=Yen-Kuan Wu] >> 剛好看到一個用法 new_node(val, NULL)，這樣的的確就是 new node 了。[name=Yen-Kuan Wu] > 這不光光只有 new node 而且還有串接，所以看 function 名稱是看不出來的。[name=Yen-Kuan Wu] > - [ ] refector，更好的名稱或者拆開 * `llist_t *list_new()` > 基本的 init list[name=Yen-Kuan Wu] * `int list_add(llist_t *the_list, val_t val)` > 簡單的 append[name=Yen-Kuan Wu] * `node_t *list_get(llist_t *the_list, uint32_t index)` > output index 的 node 的 value，所以這邊會用到 size 先check index 是否大於 size 如果有就return NULL [name=Yen-Kuan Wu] ## 改進 link list 的想法 * 觀察完上面的 code 我們會發現他的缺點就是在，search 的時間，因為你要用 link 一個一個移到要找的地方 > 1. arrary 的方式如何呢？但 arrary 的缺點就是 new node 不太方便，目前想法是，可以使用 link list + arrary 的方式，但是這樣會讓 mergesort 很難改，待考慮[name=Yen-Kuan Wu] > 2. tree 的方式如何呢？當然我們可以直接實作 heap sort，但假如我們把他用在 mergesort 也許也會有好處，想像right leaf 跟 left leaf 即是一個分割，這可以考慮看看[name=Yen-Kuan Wu] > - [ ] survey linux kernel 的 alg 和 struct 或許裏面有很多我們沒想到的特殊演算法或 struct[name=Yen-Kuan Wu] # Linux Kernel data structure survey 主要參考 <`lourie`> 在 FB 裡貼出來的連結(http://cstheory.stackexchange.com/questions/19759/core-algorithms-deployed/)，他整理了 linux kernel 裡使用到的重要 data structure 和 algorithm :::info 首先我發現，Linux kernel 裡的 comment 居然都有符合 Doxygen 的格式，代表我能使用 doxygen 輔助我了解 code 的相依性。 ::: # 實驗 ## 極端狀況測試 - [x] 負的可不可以排？ O - [x] thread_num > data_count ? O - [x] 吃得下 35萬個 data 嘛？ O ```clike= (for i in {1..350000};do echo $RANDOM;done )| ./sort 4 350000 ``` - [x] 一個 thread 能跑嘛？ O - [ ] 超過 int 大小的能跑? X ## Dispatch queue實作 Data 1000 thread 1 2 4 8 16(仍然有mutex) ```shell= ori 版本 mergesort for 1000 data with 1 threads take 1509 us mergesort for 1000 data with 2 threads take 1001 us mergesort for 1000 data with 4 threads take 1424 us mergesort for 1000 data with 8 threads take 1103 us mergesort for 1000 data with 16 threads take 1499 us ``` **average = 926** ![](https://i.imgur.com/FuGCgZE.png) **average = 3027** ![](https://i.imgur.com/Esg8HX2.png) **average = 42540** ![](https://i.imgur.com/V7D19kn.png) **average = 181584** ![](https://i.imgur.com/Vxqzzz2.png) **average = 512155** ![](https://i.imgur.com/ZXzd6yh.png) **mutrace** * ori: ```shell Mutex # Locked Changed Cont. tot.Time[ms] avg.Time[ms] max.Time[ms] Flags 2 419 144 112 0.074 0.000 0.001 Mx.--. 0 40 25 9 0.007 0.000 0.001 M-.--. 1 29 24 4 0.005 0.000 0.000 M-.--. |||||| /||||| Object: M = Mutex, W = RWLock /|||| State: x = dead, ! = inconsistent /||| Use: R = used in realtime thread /|| Mutex Type: r = RECURSIVE, e = ERRRORCHECK, a = ADAPTIVE /| Mutex Protocol: i = INHERIT, p = PROTECT / RWLock Kind: r = PREFER_READER, w = PREFER_WRITER, W = PREFER_WRITER_NONREC ``` * after: ``` Mutex # Locked Changed Cont. tot.Time[ms] avg.Time[ms] max.Time[ms] Flags 4 41821 22 19 8.193 0.000 0.018 M-.--. 1 109758 8 8 20.845 0.000 0.049 Mx.--. 0 100650 8 8 21.070 0.000 2.847 Mx.--. 3 83146 6 6 19.873 0.000 5.192 Mx.--. 2 57248 6 3 10.219 0.000 0.073 Mx.--. 6 25 8 1 0.005 0.000 0.001 M-.--. 5 13 11 0 0.005 0.000 0.001 M-.--. |||||| /||||| Object: M = Mutex, W = RWLock /|||| State: x = dead, ! = inconsistent /||| Use: R = used in realtime thread /|| Mutex Type: r = RECURSIVE, e = ERRRORCHECK, a = ADAPTIVE /| Mutex Protocol: i = INHERIT, p = PROTECT / RWLock Kind: r = PREFER_READER, w = PREFER_WRITER, W = PREFER_WRITER_NONREC ``` **cache miss** * ori ```shell= Performance counter stats for './sort 1 1000': 8516 cache-misses # 25.416 % of all cache refs 3,3507 cache-references 235,0391 instructions # 0.90 insns per cycle 262,4563 cycles 0.003585538 seconds time elapsed mergesort for 1000 data with 2 threads take 714 us Performance counter stats for './sort 2 1000': 6770 cache-misses # 19.269 % of all cache refs 3,5135 cache-references 257,8524 instructions # 0.96 insns per cycle 269,6694 cycles 0.002147490 seconds time elapsed mergesort for 1000 data with 4 threads take 1020 us Performance counter stats for './sort 4 1000': 5765 cache-misses # 12.535 % of all cache refs 4,5991 cache-references 386,5704 instructions # 0.76 insns per cycle 510,7594 cycles 0.002345743 seconds time elapsed mergesort for 1000 data with 8 threads take 788 us Performance counter stats for './sort 8 1000': 5714 cache-misses # 11.888 % of all cache refs 4,8066 cache-references 349,8934 instructions # 0.83 insns per cycle 422,7095 cycles 0.002090579 seconds time elapsed mergesort for 1000 data with 16 threads take 1061 us Performance counter stats for './sort 16 1000': 8564 cache-misses # 12.828 % of all cache refs 6,6762 cache-references 506,4661 instructions # 0.74 insns per cycle 687,1425 cycles 0.002585683 seconds time elapsed ``` * after: ```shell== Performance counter stats for './sort 1 10000': 1,1637 cache-misses # 28.579 % of all cache refs 4,0719 cache-references 3036,3061 instructions # 1.03 insns per cycle 2941,6038 cycles 0.010364469 seconds time elapsed mergesort for 10000 data with 2 threads take 3299 us Performance counter stats for './sort 2 10000': 7315 cache-misses # 15.460 % of all cache refs 4,7316 cache-references 2587,7678 instructions # 1.05 insns per cycle 2460,9844 cycles 0.004680364 seconds time elapsed mergesort for 10000 data with 4 threads take 20934 us Performance counter stats for './sort 4 10000': 2,5720 cache-misses # 31.558 % of all cache refs 8,1501 cache-references 1,4593,1072 instructions # 0.86 insns per cycle 1,6953,7714 cycles 0.022214962 seconds time elapsed mergesort for 10000 data with 8 threads take 140475 us Performance counter stats for './sort 8 10000': 3,9885 cache-misses # 33.887 % of all cache refs 11,7699 cache-references 10,0577,8308 instructions # 0.82 insns per cycle 12,3025,4065 cycles 0.142290916 seconds time elapsed mergesort for 10000 data with 16 threads take 505297 us Performance counter stats for './sort 16 10000': 3,8847 cache-misses # 18.931 % of all cache refs 20,5208 cache-references 36,6412,2022 instructions # 0.80 insns per cycle 45,9637,9790 cycles 0.506762169 seconds time elapsed ``` >hi!簡單幫你們切割了一下數據,加了一些粗體字,應該會更一目了然! >可以再想想怎麼呈現實驗結果,因為資料很多筆,或許嘗試使用表格整理會不錯！目前我也還沒有想到更好的方法啦>"<[color=red][name=課程助教] # Benchmark ### TIMING 看到老師寫的 fixme，才想到裡面多餘的 printf 也是會造成 overhead ```clike= /* FIXME: remove all all occurrences of printf and scanf * in favor of automated test flow. */ ``` * 新增 TIMING 的 flag 來計算時間 ### Lock 我們使用老師給的 `mutrace` 來檢測 ### benchmark 設計這邊使用 bash script 來跑實驗，我設計將實驗用到的參數拉出來 * `bechmark.cfg` ```shell= TEST_SET=(1 2 4 8 16) TEST_DATA_NUM=10000 ``` * benchmark-time.bash ```shell= #!/bin/bash . benchmark.cfg for i in ${TEST_SET[@]}; do ./sort $i ${TEST_DATA_NUM}; done ``` # 資料閱讀 ## Bash 指令 ## GCC 指令 * 3.12 Options Controlling the Preprocessor [-MF -MMD](https://gcc.gnu.org/onlinedocs/gcc/Preprocessor-Options.html) > 這邊我找到更好的連結 [ntu gcc解說](http://www.cmlab.csie.ntu.edu.tw/~daniel/linux/gcc.html)，還有[gcc 中文手冊](http://www.cmlab.csie.ntu.edu.tw/~daniel/linux/cman/gcc.html)[name=Yen-Kuan Wu] 引用 `<kobeYu>` ``` -M 生成文件關聯的信息。包含目標文件所依賴的所有源代碼你可以用gcc -M hello.c來測試一下，很簡單。 -MM 和上面的那個一樣，但是它將忽略由#include<file>;造成的依賴關係。 -MD 和-M相同，但是輸出將導入到.d的文件裡面 -MMD 和-MM相同，但是輸出將導入到.d的文件裡面 -MF 指定輸出到某個檔案,否則預設是原始檔案檔名.d ``` > 參考:[kobeYu/mergesort-concurrency共筆](https://hackmd.io/s/Bydtpi4R)[name=Yen-Kuan Wu] main.c 81 ## mutrace 如果想要使用 mutrace 的話要在 gcc 後面加上 `-rdynamic` >Make sure to build your application with -rdynamic to make the backtraces mutrace generates useful. [name=Arthur] * 作者的 [blog](http://0pointer.de/blog/projects/mutrace.html) ``` Mutex # Locked Changed Cont. tot.Time[ms] avg.Time[ms] max.Time[ms] Flags 1 2734 1342 774 0.715 0.000 0.018 Mx.--. 0 80 55 21 0.019 0.000 0.001 M-.--. 2 61 52 2 0.017 0.000 0.000 M-.--. |||||| /||||| Object: M = Mutex, W = RWLock /|||| State: x = dead, ! = inconsistent /||| Use: R = used in realtime thread /|| Mutex Type: r = RECURSIVE, e = ERRRORCHECK, a = ADAPTIVE /| Mutex Protocol: i = INHERIT, p = PROTECT / RWLock Kind: r = PREFER_READER, w = PREFER_WRITER, W = PREFER_WRITER_NONREC ``` >~~這邊有個問題是作者的用詞都是 how often 害我不知道他到底是次數還是時間，待我之後測試~~[name=Yen-Kuan Wu] >就後續來看其面三項應該是幾次。(作者後面有說了= =) * ==`Locked`== column tells how often the mutex was locked during the entire runtime of about 10s. >這個 mutex ~~多常~~被 lock幾次。[name=Yen-Kuan Wu] * ==`Changed`== column tells us how often the owning thread of the mutex changed. >這我不太清楚，也許是 mutex 轉移。[name=Yen-Kuan Wu] * ==`Cont.`== column tells us how often the lock was already taken when we tried to take it and we had to wait. > 這個 lock 被 require 時是已經 taken了的次數。[name=Yen-Kuan Wu] * ==`tot.Time` `avg.Time` `max.Time`== 我就不說明了，也就是 lock 的時間 * - [ ] `addr2line` 一直都是亂碼 - [ ]閱讀 [More Mutrace](http://0pointer.net/blog/projects/mutrace2.html) ### 追蹤mutex 實際行數約在mutex_init下面一點問題只有做tqueue threadcount的init 可是卻追蹤到3個mutex ``` shell 原始code Mutex #0 (0x0xa5b870) first referenced by: /usr/lib/mutrace/libmutrace.so(pthread_mutex_init+0xf2) [0x7fc9977cd4b2] ./sort(tqueue_init+0x38) [0x401356] ./sort(tpool_init+0x6a) [0x401596] ./sort(main+0x156) [0x401e33] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x7fc997204830] $ addr2line -e sort 0x401356 /home/sula/workspace/mergesort-concurrent-ken/thread.c:16 $ addr2line -e sort 0x401596 /home/sula/workspace/mergesort-concurrent-ken/thread.c:80 $ addr2line -e sort 0x401e33 /home/sula/workspace/mergesort-concurrent-ken/main.c:185 ``` # Atomic ## C11 atomic 閱讀[Atomic C11 PPT](https://www2.informatik.hu-berlin.de/~keil/docs/keil_-_c11_atomics_20140202.pdf) [p.10] * `_Atomic` int ``` _Atomic int* pi; // pointer to atomic int int* _Atomic pi; // atomic pointer (to int) ``` > 這邊我參照這個想法做了一個 muti-thread counter，一個是有 atomic 另一個是沒有的，[code在這邊](https://github.com/c14006078/PracticeMakePerfect/tree/master/c11_atomic)，而且事實比對有 atomic 會是正確的數值。[name=Yen-Kuan Wu] [p.11] * `_Atomic` struct 的 member 不能獨立 access * 看不懂`must be copied to compatible object and then members will be accessed individually` ## Atomic memory access [__sync_add_and_fetch](https://gcc.gnu.org/onlinedocs/gcc-4.1.0/gcc/Atomic-Builtins.html) 問題？編譯器最佳化OOO===>memory Barrier [簡單範例blog](http://reborn2266.blogspot.tw/2013/03/memory-barrier-in-lock-api.html) atomic memory 的指令會對share_list 更動，不知道share_list是什麼。 **在 barrier 之前的CPU 指令會保證比 barrier 之後的指令先完成。** ## Signal ## lock wiki lock * lock overhead * 也就是只所有 lock 帶來的負擔，space、init/destroy time、time for reqiring and releasing lock ... * lock contention * 當一人握有 lock 時，另一人再進行 require * wiki 有說到，儘量要讓 lock 鎖在細項，而不是整個 table，這樣也能有效減少競爭，但是我想的是越細索是 >在直播聊天室提問了這個問題，我們是想 * deadlock ## intptr_t ? 主要參考 [使用變數型別的良好習慣](http://b8807053.pixnet.net/blog/post/164224857-%E4%BD%BF%E7%94%A8%E8%AE%8A%E6%95%B8%E5%9E%8B%E5%88%A5%E7%9A%84%E8%89%AF%E5%A5%BD%E7%BF%92%E6%85%A3) **一張圖看出端倪** ``` 32 bit 64 bit char 1 1 short 2 2 int 4 4 long 4 8 long long 8 8 pointer 4 8 ``` ```clike= #include <stdio.h> #include <stdint.h> int main() { int i = 10; int *p = &i; intptr_t pp = (intptr_t)p; printf("%p\n", p); printf("%lx\n", pp); printf("%p\n", &i); return 0; } ``` >這邊我們要考慮 ptr 是在 32bit 還 64bit 上，如果想把他轉成 int 的話，用 intptr_t 會比較 protable[name=Yen-Kuan Wu] >裡頭還提到很多，這讓我思考我的變數應該要以更統一的方式來宣告，例如 int16_t 之類的，而不會取決於機器[name=Yen-Kuan Wu]