Mergesort 效能分析

# Mergesort 效能分析 contributed by <`fzzzz`> , <`nekoneko`> , <`abba123`> ###### tags `fzzzz` `nekoneko` `abba123` `sysprog` >>中英文關鍵字之間請記得使用空白隔開 >>[color=red][name=課程助教] ## 目標 - [x] 試圖找出之前實驗時，數據抖動的原因 - [x] 分析 lock contention 對於執行時間的影響 - [ ] 整理semaphore, mutex, condition variable資料與比較三者差異 - [ ] 了解何謂coarse locking ### 找出數據抖動原因在[原本的實驗](https://hackmd.io/s/SJfa8wFR)中，我們測試的時候發現執行時間會有不規律的抖動情況出現，即便是使用 gnu 內建的 `__buildin___clear_cache()` 仍然會存在著不規律抖動，所以我們推測，該抖動的來源可能是來自各 thread 的有效利用率不足。歸類出以下兩種 * 有效執行的 task 在各 thread 中分佈不均勻 * lock contention 導致大量 thread 處於等待鎖釋放的階段 #### 單一數量執行緒測試 ``` for i in `seq 1 1 50`; do ./sort 100 input.txt; done ``` ![](https://i.imgur.com/vLtEmmy.png) ![](https://i.imgur.com/I96Cr1T.png) ### 各thread執行有效task的分佈 ![](https://i.imgur.com/t2HTrmN.png) >> 希望能探討分佈平均與否對於效能的影響 [name=`fzzzz`] >> 多次實驗後發現沒有太大的影響（利用variance來作為分散程度的表示方法） >> 等等補圖 [name=`fzzzz`] #### 分佈均勻度對時間影響 ![](https://i.imgur.com/59bc0MK.png) 就實驗結果看來在大量執行緒時，分佈均勻與否並不影響執行時間 ### 將mutex改成POSIX semaphore的實做將main.c和threadpool.\*的mutex改成POSIX semaphore - benchmark ![](https://i.imgur.com/1KegsWL.png) - 各 thread 執行有效 task 的分佈 ![](https://i.imgur.com/5m6ZMwD.png) > semaphore 與 mutex 的差別，有需要再讀資料了解[name=cheng hung lin] > 信號標和鎖的不同，鎖會有擁有者(task)，semaphore 的話比較像信號標，來告知可不可以過[name=`fzzzz`] >> 上面那行到底在寫什麼？拜託清楚解釋差異，不要用不清不楚的詞彙來自欺欺人 [name=jserv] ### 利用 mutrace and gprof 來找出效能瓶頸 **mutrace** ``` Mutex # Locked Changed Cont. tot.Time[ms] avg.Time[ms] max.Time[ms] Flags 0 3440715 2706841 1711166 461.433 0.000 4.445 Mx.--. ``` **gprof** ``` Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls ms/call ms/call name 43.83 0.07 0.07 2805641 0.00 0.00 tqueue_pop 43.83 0.14 0.07 198755 0.00 0.00 merge_list 6.26 0.15 0.01 349900 0.00 0.00 list_add 6.26 0.16 0.01 main 0.00 0.16 0.00 395224 0.00 0.00 list_new ``` 經多次測量並觀察後發現，mutex contention 次數愈多的，執行時間會愈久，tqueue_pop 的呼叫次數也會變多但 function 的呼叫次數與 contention 相比， contention 的影響是大於無意義的 function call 原因是發生 mutex contention 時，該 thread 會被卡在同一個地方，直到 lock 擁有者釋放 lock 和組員討論過後的第一個想法是使用 condition variable，讓不必要的 thread 進入 wait 的狀態（此時該執行緒就不會使用任何 CPU 資源）這樣就不會浪費 CPU 資源再多次嘗試獲取 mutex 上 --> 使 mutex contention 大幅下降 --> 各 processor 有效執行效率也會變高 >> 晚點補上測資 [name=`fzzzz`] 剩下的問題就是若改成 lock-free 的版本的話，效能是否更進一步提升？ >> 在[andy19950的共筆](https://hackmd.io/s/rJ6m3IvC#future-work)裏有提到改成 lock-free 卻無法提升效能 >> 目前想法是，問題應該是出在沒意義的 function call 就變得更多了（待驗證）[name=`fzzzz`] >> 使用`htop`指令來看 CPU 的使用量，可以發現 mutex 版本的 cpu 使用量會是滿載，而 condition variable 的狀況卻是約一半(每個core)[name=`nekoneko`] >> 這個程式主要探討 lock 的影響，thread queue pop 操作會佔用大部分 CPU 資源的原因是，內部不是採用 coarse locking [name=jserv] ### 利用 condition variable 的實驗以下測試皆為利用 mutrace 對128個 thread 做100次追蹤的結果 ``` mutrace ./sort 128 input.txt ``` :::info 使用到128個執行緒的原因是，我們想觀看大量並行的效率如何 ::: #### 原版的 contention 數量與執行時間的關係 ![](https://i.imgur.com/YJFg703.png) >> 基本上是正相關，但有些許飄動可能是 clock_gettime() 所導致的浮動 #### 換成使用 condition variable 的版本 ![](https://i.imgur.com/naIEVSn.png) >> 這邊的 contention 數量愈多時間基本上還是會上漲，但會上下飄動，理由推斷是 thread 喚醒時的 contex switch 時間不一 [name=`fzzzz`] ### 資料分佈 #### 假設 - cut_func是將link list的資料切成左半llist和右半llist，再透過push task到work queue將資料交給下一個cut_func做處理。若當傳進去的llist 長度為1或是cut_count大於等於max_count時，cut_func則呼叫merge做單個thread mergesort和thread與thread之間的merge。 ```c=103 void cut_func(void *data) { llist_t *list = (llist_t *) data; pthread_mutex_lock(&(data_context.mutex)); int cut_count = data_context.cut_thread_count; if (list->size > 1 && cut_count < max_cut) { ++data_context.cut_thread_count; pthread_mutex_unlock(&(data_context.mutex)); /* cut list */ int mid = list->size / 2; llist_t *_list = list_new(); _list->head = list_nth(list, mid); _list->size = list->size - mid; list_nth(list, mid - 1)->next = NULL;//cut list->size = mid; /* create new task: left */ task_t *_task = (task_t *) malloc(sizeof(task_t)); _task->func = cut_func; _task->arg = _list; tqueue_push(pool->queue, _task); /* create new task: right */ _task = (task_t *) malloc(sizeof(task_t)); _task->func = cut_func; _task->arg = list; tqueue_push(pool->queue, _task); } else { pthread_mutex_unlock(&(data_context.mutex)); merge(merge_sort(list)); } } ``` - 在我們的認知中，mergesort的示意圖應該如下 - list->size > 1 的情況，如左圖 - cut_count < max_count ，其中max_count假設為3(4條thread)，如右圖 ```graphviz digraph mergesort { node [shape = record,height=.1]; node0 [label = "1|2|3|4|5|6"]; node1 [label = "1|2|3"]; node2 [label = "4|5|6"]; node3 [label = "1|2"]; node4 [label = "3"]; node5 [label = "4|5"]; node6 [label = "6"]; node7 [label = "1"]; node8 [label = "2"]; node9 [label = "4"]; node10 [label = "5"]; node0 -> node1; node0 -> node2; node1 -> node3; node1 -> node4; node2 -> node5; node2 -> node6; node3 -> node7; node3 -> node8; node5 -> node9; node5 -> node10; node11 [label = "1|2|3|4|5|6"]; node12 [label = "1|2|3"]; node13 [label = "4|5|6"]; node14 [label = "1|2"]; node15 [label = "3"]; node16 [label = "4|5"]; node17 [label = "6"]; node11 -> node12; node11 -> node13; node12 -> node14; node12 -> node15; node13 -> node16; node13 -> node17; } ``` - 這次的mergesort-concurrency是用thread pool實做。cut_func會不斷新增cut_func到task_queue，給work thread去執行。cut_func停止cut的其一判斷條件為cut_count，cut_count是一個會被多個執行cut_func更動的共享全域變數。所以可能會發生cut_func分到list長度很大，但卻不會繼續切半list，就因為cut_count被其他cut_func更新到等於max_count。所以同樣max_count為3的情況下，list可能會被分成圖下情況 ```graphviz digraph mergesort { node [shape = record,height=.1]; node0 [label = "1|2|3|4|5|6"]; node1 [label = "1|2|3"]; node2 [label = "4|5|6"]; node3 [label = "1|2"]; node4 [label = "3"] node5 [label = "1"]; node6 [label = "2"]; node0 -> node1; node0 -> node2; node1 -> node3; node1 -> node4; node3 -> node5; node3 -> node6; } ``` #### 驗證 - 在cut_func呼叫merge(merge_list(list))前，印出list長度 ```c=133 pthread_mutex_unlock(&(data_context.mutex)); + printf("%d\n", list->size) merge(merge_sort(list)); ``` - 以 128 threads為例，擷取一小段 - 比較好的分佈情況 ``` 2734 1367 2734 1367 1367 1366 1367 2734 1367 1367 2733 2734 2733 1367 1367 1366 2733 ``` - 比較壞的分佈情況 ``` 85 10935 10934 86 10934 10935 342 342 342 86 341 171 341 171 170 171 ``` > 這邊應該要用圖表示比較精確，擷取部份只是為突顯分佈不均的情況[name=cheng hung lin] > 有比對過 contention 數量和分佈的關係嗎? [name=fzzzz] > >還沒比較兩者之間的關係，太急著想改善資料分佈的情況，直接實做相對均勻的分佈，並測試原本和新的兩者之間執行時間的分佈情況[name=cheng hung lin] #### 改善 [code 連結](https://github.com/FZzzz/mergesort-concurrent/blob/data_scala_test/main.c#L109-L145) - cut 策略示意圖，在4條thread的情況，也就是max_count為3 ```graphviz digraph mergesort { node [shape = record,height=.1]; node0 [label = "1|2|3|4|5|6"]; node1 [label = "1|2"]; node2 [label = "3|4|5|6"]; node3 [label = "1"]; node4 [label = "2"]; node5 [label = "3"]; node6 [label = "4|5|6"]; node0 -> node1; node0 -> node2; node1 -> node3; node1 -> node4; node2 -> node5; node2 -> node6; } ``` points ![](https://i.imgur.com/63pc45o.png) linespoints ![](https://i.imgur.com/B6I2AB2.png) > 跟fzzzz討論，認為要將資料分佈的變異數(或標準差)計算出來，因為圖表只能呈現大概的結果，但還是要靠數學量化。[name=cheng hung lin] ## mutrace 參數說明 - `Locked`: how often the mutex was locked during the entire runtime - `Changed`: how often the owning thread of the mutex changed - `Cont.`: how often the lock was already taken when we tried to take it and we had to wait - `tot.Time[ms]`: lock的總時間 - `avg.Time[ms]`: 平均每個lock的時間 - `max.Time[ms]`: lock最久的時間 ## References * 恐龍本 * [Conditional Variable](http://pages.cs.wisc.edu/~remzi/OSTEP/threads-cv.pdf) * [Covariance](http://belleaya.pixnet.net/blog/post/30950579-%5B%E6%95%99%E5%AD%B8%5D-%5B%E7%B5%B1%E8%A8%88%5D-pearson-%E7%9B%B8%E9%97%9C)