B08: mergesort-concurrent

# B08: mergesort-concurrent ###### tags: `sysprog2017` :::info 主講人: [jserv](http://wiki.csie.ncku.edu.tw/User/jserv) / 課程討論區: [2017 年系統軟體課程](https://www.facebook.com/groups/system.software2017/) :mega: 返回「[作業系統設計與實作](http://wiki.csie.ncku.edu.tw/sysprog/schedule)」課程進度表 ::: ## 預期目標 * 作為 [concurrency](https://hackmd.io/s/Skh_AaVix) 的展示案例 * 學習 POSIX Thread Programming，特別是 [synchronization object](https://docs.oracle.com/cd/E19683-01/806-6867/6jfpgdcnd/index.html) * 為日後效能分析和 scalability 研究建構基礎建設 * 學習程式品質分析和相關的開發工具 ## 案例分析: [mergesort-concurrent](https://github.com/sysprog21/mergesort-concurrent) (對單向 Linked List) 兩大重點: - 排序的對象是 singly-linked list - 利用 POSIX Thread 處理，需要一併操作 synchronization object 取得程式碼並測試: ```shell $ git clone https://github.com/sysprog21/mergesort-concurrent $ cd mergesort-concurrent $ make $ ./sort ``` 程式輸出，提示以下用法: ``` usage: ./sort [thread_count] [input_count] ``` 比方說 `./sort 4 8`，執行後會看到 "input unsorted data line-by-line" 提示訊息，請輸入 8 組數字，每個都用 Enter (換行) 分隔。之後即可看到輸出。在 GNU bash 中，可善用 `$RANDOM` 環境變數，取得介於 0~32767 之間的亂數，於是我們可透過以下指令來作自動測試: ```shell $ (for i in {1..8}; do echo $RANDOM; done) | ./sort 4 8 ``` 輸出的 "sorted results" 訊息後方應該有 8 組數值，由小到大排列。 * 延伸閱讀: [第十章、認識與學習BASH](http://linux.vbird.org/linux_basic/0320bash.php)，《鳥哥的 Linux 私房菜》 * 在 `Makefile` 中定義了 `check`，所以可用 `make check` 來檢驗程式的正確性。 ## [manager-worker 架構](http://maxim.int.ru/bookshelf/PthreadsProgram/htm/r_19.html) 示意圖如下： ```graphviz digraph hierarchy { nodesep=0.5 // increases the separation between nodes node [color=black,fontname=Courier,shape=box] //All nodes will this shape and colour edge [color=black, style=dashed] // All the lines look like this Manager->{Worker1 Worker2 Worker3} } ``` 為了分配工作，在 worker thread 實作 Task 的機制。每個主要的操作會先被放進 task queue 裡頭，空閒的 thread 再從 task queue 裡頭提取 task 執行，如下面的步驟: ```graphviz digraph { 開始Process[shape="box", style=rounded]; 是否為結束Task[shape="diamond"]; 結束Process[shape="box", style=rounded]; 開始Process->提取Task->是否為結束Task 是否為結束Task->重發結束Task[label="是"] 重發結束Task->結束Process 是否為結束Task->執行Task[label="否"] 執行Task->提取Task; } ``` 只要把我們的操作寫成 task，就能順利執行。 ### Linked List Linked list 是由各個 node，透過指標串聯而成。先定義 node_t 型別 ```c typedef intptr_t val_t; /* 不僅可放數值，也能放指標 */ typedef struct node { val_t data; /* 欲儲存的資料 */ struct node *next; /* next 指標 */ } node_t; ``` node 建立後，就可繼續定義 llist_t: ```c typedef struct llist { node_t *head; uint32_t size; } llist_t; ``` 定義完型別，接著是各種操作: ```c= // 建立新的 list 物件 llist_t *list_new(); // 從 list 插入給定資料 int list_add(llist_t *the_list, val_t val); // 逐一印列 list 內容 void list_print(llist_t *the_list); // 產生新的節點 node_t *new_node(val_t val, node_t *next); // 取得某個 index 的節點資訊 node_t *list_get(llist_t *the_list, uint32_t index); ``` ### Thread pool 定義 task_t 來封裝 task: ```c typedef struct _task { void (*func)(void *); /* 對應到最終執行的函式 */ void *arg; /* 傳入的參數 */ struct _task *next, *last; /* 因為 queue 要用 doubly-linked list，需要儲存 next 和 last */ } task_t; ``` 定義Task的free操作 ```c int task_free(task_t *the_task); ``` 再來是 thread pool 所需的 queue 結構: ```c typedef struct { task_t *head, *tail; pthread_mutex_t mutex; pthread_cond_t cond; uint32_t size; } tqueue_t; int tqueue_init(tqueue_t *the_queue); task_t *tqueue_pop(tqueue_t *the_queue); uint32_t tqueue_size(tqueue_t *the_queue); int tqueue_push(tqueue_t *the_queue, task_t *task); int tqueue_free(tqueue_t *the_queue); ``` 接著把 queue 和 thread 包裝成 thread pool: ```c typedef struct { pthread_t *threads; uint32_t count; tqueue_t *queue; } tpool_t; int tpool_init(tpool_t *the_pool, uint32_t count, void *(*func)(void *)); int tpool_free(tpool_t *the_pool); ``` 之後實做`task_run` 作為稍早提到流程圖的主迴圈 (main-loop) ```C void *task_run(void *data) { task_t *cur_task = NULL; while (1) { cur_task = tqueue_pop(pool->queue); if (cur_task){ if (!cur_task->func) { tqueue_push(pool->queue, cur_task); break; } else{ curTask->func(cur_task->arg); task_free(cur_task); } } } pthread_exit(NULL); } ``` 有了這樣的基礎建設，我們的 mergesort 就很容易透過 task 這樣的包裝，加入 thread pool 中。 ## 分析 mutex contention mutrace 可用來偵測 [lock contention](https://en.wikipedia.org/wiki/Lock_(computer_science)#Granularity)，使用很方便，不需要重新編譯程式碼。 ```shell $ sudo apt-get install mutrace ``` 搭配前述亂數輸入自動測試，執行以下命令: ```shell $ (for i in {1..8}; do echo $RANDOM; done) | mutrace ./sort 4 8 ``` mutrace 的輸出: ``` mutrace: Showing statistics for process sort (pid 8978). mutrace: 3 mutexes used. Mutex #0 (0x0x559cfdae59b0) first referenced by: /usr/lib/mutrace/libmutrace.so(pthread_mutex_init+0xec) [0x7f8d7fc3862c] ./sort(tqueue_init+0x38) [0x559cfc426315] ./sort(tpool_init+0x6a) [0x559cfc42656a] ./sort(main+0x16b) [0x559cfc426d32] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf1) [0x7f8d7f6703f1] Mutex #1 (0x0x7f8d7d23f380) first referenced by: Mutex #2 (0x0x559cfc6280a0) first referenced by: /usr/lib/mutrace/libmutrace.so(pthread_mutex_init+0xec) [0x7f8d7fc3862c] ./sort(main+0x125) [0x559cfc426cec] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf1) [0x7f8d7f6703f1] mutrace: Showing 3 most contended mutexes: ``` 其中 `tqueue_init+0x38` 就是實際執行的地址，可用 `addr2line` 來找出對原始程式碼的對應，注意，要確保編譯時加入 `-g` 參數，確保包含 debug info 的執行檔正確產生。以這個地址來說，對應的原始程式碼為: ```shell $ addr2line -e sort 0x38 mergesort-concurrent/main.c:167 ``` 延伸閱讀: * [Measuring Lock Contention](http://0pointer.de/blog/projects/mutrace.html) * [More Mutrace](http://0pointer.net/blog/projects/mutrace2.html) ## GraphViz [ [source](http://www.openfoundry.org/tw/foss-programs/8820-graphviz-) ] [Graphviz](http://www.graphviz.org/) 是個依據給定指令的製圖軟體，不過說是繪圖軟體，它能繪的圖並不是一般人想像中的漫畫或 logo，而是數學意義上的 "graph"，比較通俗的說法就是「關係圖」。舉例來說，像是下面這種圖，展示 [Unix 家族](https://www.levenez.com/unix/)： ![](https://i.imgur.com/fMQZUGl.jpg) 用手畫會很痛苦，而 Graphviz 可以替使用者搞定它。[Graphviz](http://www.graphviz.org/) 提供一套語言，讓您能直接陳述圖片上的節點、邊、方向等性質。之後，由它來為您產生整張圖片。 Graphviz 能畫的圖片有許多種，可在[官方網站](http://www.graphviz.org/Gallery.php)找到更多範例。 HackMD 已經支援 GraphViz，本頁 mergesort 的圖例就是用該工具繪製。按右上方 <i class="fa fa-pencil"></i> 之後再按左上方 <i class="fa fa-columns"></i>，查看 GraphViz 的使用。 ## UNIX 指令組合的魔法 * [phonebook-concurrent](https://github.com/sysprog21/phonebook-concurrent) 裡頭的 `dictionary/words.txt` 複製出來，然後透過 UNIX 指令打亂順序，之後重新導向到另一個檔案 ```shell $ uniq words.txt | sort -R > input.txt ``` 這樣我們就有新的資料輸入。 ## 作業要求 * 將 merge sort 的實做改為可接受 [phonebook-concurrent](https://github.com/sysprog21/phonebook-concurrent) 的 35 萬筆資料輸入的資料檔 * 字典檔資料需要事先用 `sort -R` 處理過 * 思考如何得到均勻分佈的亂數排列，並且設計自動測試的機制 * 研究 thread pool 管理 worker thread 的實做，提出實做層面的不足，並且參照 [concurrent-ll](https://github.com/jserv/concurrent-ll)，提出 lock-free 的實做 * 學習 [concurrent-ll](https://github.com/jserv/concurrent-ll) (concurrent linked-list 實作) 的 scalability 分析方式，透過 gnuplot 製圖比較 merge sort 在不同執行緒數量操作的效能 * 注意到 linked list 每個節點配置的記憶體往往是不連續，思考這對效能分析的影響 * 共筆的內容儘量用 GraphViz 製作 * 截止日期： * Mar 27, 2017 (含) 之前 * 越早在 GitHub 上有動態、越早接受 code review，評分越高 ## 共筆選讀 - [ ] mingnus: [共筆](https://hackmd.io/s/SJeOsFOR) * code review - [ ] TempoJiJi: [共筆](https://hackmd.io/s/B1-Kytv0) * 對程式碼做了很多調整，降低 lock contention 發生的機會，不過因為他還沒將 merge sort 內部實做切割，所以 scalability 很差，執行緒越多，排序則越慢。 * 嘗試引入 SuperMalloc 後，他發現整體執行時間縮減了 (從左/上圖到右/下圖)，由此可見，實做細節不能忽略 concurrent memory allocation 的影響。 - [ ] shelly4132: [共筆](https://hackmd.io/s/HkX7l7YA) * 發現 cut_func() 跟 merge_sort() 都進行切割 linked list，而 cut_func 裡面判斷 cut_count 是否小於 max_cut 其實沒有必要，所以把 cut_count 等相關變數刪掉，也把 merge_sort()刪掉，留下 cut_func() 就好。 * 為了驗證程式正確性，她還特別設計 verification 用的程式 - [ ] LanKuDot: [共筆](https://hackmd.io/s/S1ezGhIA) * code refactoring * 建立自動化結果測試 - [ ] andy19950: [共筆](https://hackmd.io/s/rJ6m3IvC)