Merge-sort-concurrent

tag `Yen-Kuan Wu` `HahaSula` `mergesort-concurrent` `data strcuture`

目標

修改 data structure 讓 mergesort 能夠對 insert 和 delete 會更友善。

組員

黃呂源

[ ]~~lock-free 想法?冼鏡光投影片閱讀失敗 concurrentLL 失敗~~
[ ]~~atom memory access?~~
[x]分main_WorkQueue，thread_worker,roundrobin)

先將workingQueue & dispatch queue分開讓thread content 分散
debug中

吳彥寬

[x]trace mergesort、thread 的 code
[ ]code refactor: mergesort 的主體我將它們下外移(進行中…)有點小麻煩R

暫緩，我發現盲目下去改會改不好，先熟悉 code。Yen-Kuan Wu

[x]針對 lock 和 time 的 benchmark ~~(進行中，debug)~~
[ ]加上 condition variable
[ ]找看看有沒可以改善的 lock

當我們做出 lock 時看到 contention 並不高，要調整測試條件Yen-Kuan Wu

[ ]mergesort 熟悉完 lock 狀況後，分析可否使用不同的 data structure 來改寫

Code 觀察

main.c

main.c 整體架構如下

觀察四條支線

list_add 簡單的 init list
tpool_init 簡單的 init threadpool(但裏面很難…囧)

tpool_init(pool, thread_count, task_run) 所 pool 有吃到 task_run 這個 function Yen-Kuan Wu

cut 可以說是這份 mergesort 的精華，我們看到他有 call 到 merge 和 tqueue_push，而 push 這個動作即是將 task 丟進 threadpool 裡



    new_task->func = cut;
    new_task->arg = the_list;
    tqueue_push(pool->queue, new_task);

所以 cut 也被丟到 pool 裡去了Yen-Kuan Wu

tpool_free 這裡面有實作 tqueue_free 等待所有任務做完
我們發現這整份 code 的切割方式很簡單，即是一個 threadpool 當作執行任務的載體，而且特別有丟入 task_run 進去，其他 task 就直接丟進去執行，並且使用 lock 的方式確保正確性。

看完整體架構跟我們的目標後，我們想到要修改這份 code 的切入點應該是，tqueue 是 link-list 和 mergesort 用的也是 link-list，我們將從這兩點切入，先熟悉 code 再提出改善。Yen-Kuan Wu

整份 main.c 裡有一些變數看不太懂

max_cut = thread_count * (thread_count <= data_count) + data_count * (thread_count > data_count) - 1;

~~計算全部thread切了幾刀，兩個data只需要切一刀~~HahaSula
這我有點聽不懂耶，等等我們討論一下Yen-Kuan Wu

thread_data.cuthread_count = 0;

~~判斷cutthread_count = max cut 才會把cut轉成merge task push 到tqueue裏面~~HahaSula
更正cut為把整分資料先大刀切給thread(讓多個thread做merge_sort 去切小塊)
再給merge_sort下去切成實際被sort的的單位

static void task_run(void data)

簡單的 while 迴圈，pop task 出來做

這邊我好奇的是，如果 pool 剛 init 時是沒有 task 的吧，那為什麼不會直接結束呢？我想這邊應該要多增加一個解說或是一個 function 來說告知，現在 queue 是屬於什麼狀態，可以 free 還是說只是沒有沒有任務進來。

沒有task並不回造成break，task->Fun = NULL才會break;HahaSula
感謝！我在想我們的改進方向，或許可以朝查詢這個 task queue 的狀態來發展。Yen-Kuan Wu

void cut ( void * data )

觀察

我們發現 cut merge 都有 call tqueue_push，也就是任務有分 cut 和 merge(這觀察很像是廢話…)
再來是 merge_sort 這看不出來了，再來就是 trace 整份 code 了
首先看到我們第一個 critical section





    pthread_mutex_lock(&(thread_data.mutex));
    int cutLocal = thread_data.cuthread_count;
    if (list->size > 1 && cutLocal < max_cut) {
        ++thread_data.cuthread_count;
        pthread_mutex_unlock(&(thread_data.mutex));

這邊我會們會發現，這邊很簡單就可以改成 atomic 了，由於裡面用到的共用變數只有 thread_data.cuthread_count(mutex lock 是針對 thread_data)Yen-Kuan Wu

很順利的我們改出了第1版 lock-free 的改進




struct {
    //pthread_mutex_t mutex;
    _Atomic int cuthread_count;
} thread_data;

這邊我再改 code 時發現，我們根本沒有用到 list 的 lock




struct {
    pthread_mutex_t mutex;//FIXME: No use in code
    llist_t *list;
} tmp_list;

而且有一個部份的 code 很奇怪，明明 critical section 綁的是 tmp_list，可是卻 lock thread_data.mutexYen-Kuan Wu
有趣的是thread_data.mutex 已經cut完不會再被前面使用所以好像不會衝突HahaSula









[main.c: 73: merge()]
        //pthread_mutex_lock(&(thread_data.mutex));FIXME:No need lock
        llist_t *tmpLocal = tmp_list.list;
        if (!tmpLocal) {
            tmp_list.list = list;
            //pthread_mutex_unlock(&(thread_data.mutex));
        } else {
            tmp_list.list = NULL;
            //pthread_mutex_unlock(&(thread_data.mutex));

目前我是將其給 comment 掉了，但是是否需要換成 tmp_list.mutex 這我還不能確定。Yen-Kuan Wu

我自己解掉他了，我發現這有可能發生 ABA 的問題，所以是需要 lock 的，那這邊一樣，我會想要改成 atomic 的版本Yen-Kuan Wu
struct {
   //pthread_mutex_t mutex;//FIXME: No use in code
   llist_t * _Atomic list;
} tmp_list;
Yen-Kuan Wu

經過我這個 Atomic 的修改，我發現我只有解決了不會有 double assign 而沒有解決順序的部份，這可能還要再研究一下，也許要做個 heavy test 來檢查正確性，也許要將這一段 code 都改成 atomic
heavy test for correction

merge & cut 觀察

merge分兩階段

thread1 merge會把拿到的list放到tmp_list.list裏面
thread2 merge會把tmp_list.list的list跟傳入的list做merge_list
將return list做一個merge task push回去tqueue中(把tmp_list.list清空給下一個merge)
(結果：每吃兩個merge做一次merge_list)
(不知道會不會有merge1 跟merge2 size不一的情況存在)

呂源，麻煩你詳細一點，不然我也看不懂= =，至少讓我知道那幾個 function 再做什麼或者是他整體架構再幹嘛Yen-Kuan Wu
簡單來講，你就要把我當作一個有一點基礎，而且只有略窺過這份 code，不用 detail 但要跟我說你怎麼觀察到的再哪個 fucntion裡，或者能簡介一下這份code 的架構或 functio 再幹嘛。Yen-Kuan Wu

thread.c 觀察

condition variable沒有使用
(如果要修改tqueue的lock不知道還有沒有要用)

程式描述

thread pool init

cut 將資料均分給各個thread(大刀)

各個thread 把自己的資料做merge sort(分別sort)

兩個兩個sort好的thread 合併list(sort完的thread 用merge合併)

list.c

node_t






typedef intptr_t val_t;

typedef struct node {
    val_t data;
    struct node *next;
} node_t;

至於 intptr_t 是指向 int 的 pointer 資料閱讀有解說。Yen-Kuan Wu

llist_t




typedef struct llist {
    node_t *head;
    uint32_t size;
} llist_t;

我認為這個命名不是很漂亮，~~也許作者的意思是 leader list 吧，我猜，這是需要 refector 的部份。~~Yen-Kuan Wu

我錯了，linux kernel 裡就是用 llist 來代表 link list…

[x]這邊的 szie 需要確認一下他的用處為何，如果只是 link list 照裡來說是不需要 size 的Yen-Kuan Wu

這邊的 size 能用來幫助 check index 是否超過，就不用跑到底發現沒有。Yen-Kuan Wu

修改 llist_t 命名

算了吧，linux kernel 都這樣用了…Yen-Kuan Wu

node_t *new_node(val_t val, node_t *next)

這邊很特別的是，新創一個node插入再開頭，所以參數命名沒有錯，的確是新創一個 node 且 assign 之後 node->next 就會接到 next 去。Yen-Kuan Wu

剛好看到一個用法 new_node(val, NULL)，這樣的的確就是 new node 了。Yen-Kuan Wu

這不光光只有 new node 而且還有串接，所以看 function 名稱是看不出來的。Yen-Kuan Wu

refector，更好的名稱或者拆開

llist_t *list_new()

基本的 init listYen-Kuan Wu

int list_add(llist_t *the_list, val_t val)

簡單的 appendYen-Kuan Wu

node_t *list_get(llist_t *the_list, uint32_t index)

output index 的 node 的 value，所以這邊會用到 size 先check index 是否大於 size 如果有就return NULL Yen-Kuan Wu

改進 link list 的想法

觀察完上面的 code 我們會發現他的缺點就是在，search 的時間，因為你要用 link 一個一個移到要找的地方

arrary 的方式如何呢？但 arrary 的缺點就是 new node 不太方便，目前想法是，可以使用 link list + arrary 的方式，但是這樣會讓 mergesort 很難改，待考慮Yen-Kuan Wu

tree 的方式如何呢？當然我們可以直接實作 heap sort，但假如我們把他用在 mergesort 也許也會有好處，想像right leaf 跟 left leaf 即是一個分割，這可以考慮看看Yen-Kuan Wu

survey linux kernel 的 alg 和 struct 或許裏面有很多我們沒想到的特殊演算法或 structYen-Kuan Wu

Linux Kernel data structure survey

主要參考 <lourie> 在 FB 裡貼出來的連結(http://cstheory.stackexchange.com/questions/19759/core-algorithms-deployed/)，他整理了 linux kernel 裡使用到的重要 data structure 和 algorithm

首先我發現，Linux kernel 裡的 comment 居然都有符合 Doxygen 的格式，代表我能使用 doxygen 輔助我了解 code 的相依性。

實驗

極端狀況測試

負的可不可以排？ O
thread_num > data_count ? O
吃得下 35萬個 data 嘛？ O


(for i in {1..350000};do echo $RANDOM;done )| ./sort 4 350000

一個 thread 能跑嘛？ O
超過 int 大小的能跑? X

Dispatch queue實作 Data 1000 thread 1 2 4 8 16(仍然有mutex)






ori 版本
mergesort for 1000 data with 1 threads take 1509 us
mergesort for 1000 data with 2 threads take 1001 us
mergesort for 1000 data with 4 threads take 1424 us
mergesort for 1000 data with 8 threads take 1103 us
mergesort for 1000 data with 16 threads take 1499 us

average = 926

average = 3027

average = 42540

average = 181584

average = 512155

mutrace

ori:

 Mutex #   Locked  Changed    Cont. tot.Time[ms] avg.Time[ms] max.Time[ms]  Flags
       2      419      144      112        0.074        0.000        0.001 Mx.--.
       0       40       25        9        0.007        0.000        0.001 M-.--.
       1       29       24        4        0.005        0.000        0.000 M-.--.
                                                                           ||||||
                                                                           /|||||
          Object:                                     M = Mutex, W = RWLock /||||
           State:                                 x = dead, ! = inconsistent /|||
             Use:                                 R = used in realtime thread /||
      Mutex Type:                 r = RECURSIVE, e = ERRRORCHECK, a = ADAPTIVE /|
  Mutex Protocol:                                      i = INHERIT, p = PROTECT /
     RWLock Kind: r = PREFER_READER, w = PREFER_WRITER, W = PREFER_WRITER_NONREC

after:

  Mutex #   Locked  Changed    Cont. tot.Time[ms] avg.Time[ms] max.Time[ms]  Flags
       4    41821       22       19        8.193        0.000        0.018 M-.--.
       1   109758        8        8       20.845        0.000        0.049 Mx.--.
       0   100650        8        8       21.070        0.000        2.847 Mx.--.
       3    83146        6        6       19.873        0.000        5.192 Mx.--.
       2    57248        6        3       10.219        0.000        0.073 Mx.--.
       6       25        8        1        0.005        0.000        0.001 M-.--.
       5       13       11        0        0.005        0.000        0.001 M-.--.
                                                                           ||||||
                                                                           /|||||
          Object:                                     M = Mutex, W = RWLock /||||
           State:                                 x = dead, ! = inconsistent /|||
             Use:                                 R = used in realtime thread /||
      Mutex Type:                 r = RECURSIVE, e = ERRRORCHECK, a = ADAPTIVE /|
  Mutex Protocol:                                      i = INHERIT, p = PROTECT /
     RWLock Kind: r = PREFER_READER, w = PREFER_WRITER, W = PREFER_WRITER_NONREC

cache miss




















































Performance counter stats for './sort 1 1000':

              8516      cache-misses              #   25.416 % of all cache refs    
            3,3507      cache-references                                            
          235,0391      instructions              #    0.90  insns per cycle        
          262,4563      cycles                                                      

       0.003585538 seconds time elapsed

mergesort for 1000 data with 2 threads take 714 us

 Performance counter stats for './sort 2 1000':

              6770      cache-misses              #   19.269 % of all cache refs    
            3,5135      cache-references                                            
          257,8524      instructions              #    0.96  insns per cycle        
          269,6694      cycles                                                      

       0.002147490 seconds time elapsed

mergesort for 1000 data with 4 threads take 1020 us

 Performance counter stats for './sort 4 1000':

              5765      cache-misses              #   12.535 % of all cache refs    
            4,5991      cache-references                                            
          386,5704      instructions              #    0.76  insns per cycle        
          510,7594      cycles                                                      

       0.002345743 seconds time elapsed

mergesort for 1000 data with 8 threads take 788 us

 Performance counter stats for './sort 8 1000':

              5714      cache-misses              #   11.888 % of all cache refs    
            4,8066      cache-references                                            
          349,8934      instructions              #    0.83  insns per cycle        
          422,7095      cycles                                                      

       0.002090579 seconds time elapsed

mergesort for 1000 data with 16 threads take 1061 us

 Performance counter stats for './sort 16 1000':

              8564      cache-misses              #   12.828 % of all cache refs    
            6,6762      cache-references                                            
          506,4661      instructions              #    0.74  insns per cycle        
          687,1425      cycles                                                      

       0.002585683 seconds time elapsed

after:





















































Performance counter stats for './sort 1 10000':

            1,1637      cache-misses              #   28.579 % of all cache refs    
            4,0719      cache-references                                            
         3036,3061      instructions              #    1.03  insns per cycle        
         2941,6038      cycles                                                      

       0.010364469 seconds time elapsed

mergesort for 10000 data with 2 threads take 3299 us

 Performance counter stats for './sort 2 10000':

              7315      cache-misses              #   15.460 % of all cache refs    
            4,7316      cache-references                                            
         2587,7678      instructions              #    1.05  insns per cycle        
         2460,9844      cycles                                                      

       0.004680364 seconds time elapsed

mergesort for 10000 data with 4 threads take 20934 us

 Performance counter stats for './sort 4 10000':

            2,5720      cache-misses              #   31.558 % of all cache refs    
            8,1501      cache-references                                            
       1,4593,1072      instructions              #    0.86  insns per cycle        
       1,6953,7714      cycles                                                      

       0.022214962 seconds time elapsed

mergesort for 10000 data with 8 threads take 140475 us

 Performance counter stats for './sort 8 10000':

            3,9885      cache-misses              #   33.887 % of all cache refs    
           11,7699      cache-references                                            
      10,0577,8308      instructions              #    0.82  insns per cycle        
      12,3025,4065      cycles                                                      

       0.142290916 seconds time elapsed

mergesort for 10000 data with 16 threads take 505297 us

 Performance counter stats for './sort 16 10000':

            3,8847      cache-misses              #   18.931 % of all cache refs    
           20,5208      cache-references                                            
      36,6412,2022      instructions              #    0.80  insns per cycle        
      45,9637,9790      cycles                                                      

       0.506762169 seconds time elapsed

hi!簡單幫你們切割了一下數據,加了一些粗體字,應該會更一目了然!
可以再想想怎麼呈現實驗結果,因為資料很多筆,或許嘗試使用表格整理會不錯！目前我也還沒有想到更好的方法啦>"<課程助教

Benchmark

TIMING

看到老師寫的 fixme，才想到裡面多餘的 printf 也是會造成 overhead



/* FIXME: remove all all occurrences of printf and scanf
     * in favor of automated test flow.
     */

新增 TIMING 的 flag 來計算時間

Lock

我們使用老師給的 mutrace 來檢測

benchmark 設計

這邊使用 bash script 來跑實驗，我設計將實驗用到的參數拉出來

bechmark.cfg


TEST_SET=(1 2 4 8 16)
TEST_DATA_NUM=10000

benchmark-time.bash



#!/bin/bash
. benchmark.cfg
for i in ${TEST_SET[@]}; do ./sort $i ${TEST_DATA_NUM}; done

資料閱讀

Bash 指令

GCC 指令

3.12 Options Controlling the Preprocessor
-MF -MMD

這邊我找到更好的連結 ntu gcc解說，還有gcc 中文手冊Yen-Kuan Wu

引用 <kobeYu>

	-M 
	  生成文件關聯的信息。包含目標文件所依賴的所有源代碼你可以用gcc -M hello.c來測試一下，很簡單。 

	-MM 
	  和上面的那個一樣，但是它將忽略由#include<file>;造成的依賴關係。 

	-MD 
	  和-M相同，但是輸出將導入到.d的文件裡面 

	-MMD 
	  和-MM相同，但是輸出將導入到.d的文件裡面
	  
	-MF 
	 指定輸出到某個檔案,否則預設是原始檔案檔名.d

參考:kobeYu/mergesort-concurrency共筆Yen-Kuan Wu
main.c 81

mutrace

如果想要使用 mutrace 的話要在 gcc 後面加上 -rdynamic

Make sure to build your application with -rdynamic to make the backtraces mutrace generates useful. Arthur

作者的 blog

Mutex #   Locked  Changed    Cont. tot.Time[ms] avg.Time[ms] max.Time[ms]  Flags
       1     2734     1342      774        0.715        0.000        0.018 Mx.--.
       0       80       55       21        0.019        0.000        0.001 M-.--.
       2       61       52        2        0.017        0.000        0.000 M-.--.
                                                                           ||||||
                                                                           /|||||
          Object:                                     M = Mutex, W = RWLock /||||
           State:                                 x = dead, ! = inconsistent /|||
             Use:                                 R = used in realtime thread /||
      Mutex Type:                 r = RECURSIVE, e = ERRRORCHECK, a = ADAPTIVE /|
  Mutex Protocol:                                      i = INHERIT, p = PROTECT /
     RWLock Kind: r = PREFER_READER, w = PREFER_WRITER, W = PREFER_WRITER_NONREC

~~這邊有個問題是作者的用詞都是 how often 害我不知道他到底是次數還是時間，待我之後測試~~Yen-Kuan Wu
就後續來看其面三項應該是幾次。(作者後面有說了= =)

Locked column tells how often the mutex was locked during the entire runtime of about 10s.

這個 mutex 多常被 lock幾次。Yen-Kuan Wu

Changed column tells us how often the owning thread of the mutex changed.

這我不太清楚，也許是 mutex 轉移。Yen-Kuan Wu

Cont. column tells us how often the lock was already taken when we tried to take it and we had to wait.

這個 lock 被 require 時是已經 taken了的次數。Yen-Kuan Wu

tot.Time avg.Time max.Time 我就不說明了，也就是 lock 的時間

addr2line 一直都是亂碼
[ ]閱讀 More Mutrace

追蹤mutex

實際行數約在mutex_init下面一點
問題只有做tqueue threadcount的init
可是卻追蹤到3個mutex

原始code
Mutex #0 (0x0xa5b870) first referenced by:
	/usr/lib/mutrace/libmutrace.so(pthread_mutex_init+0xf2) [0x7fc9977cd4b2]
	./sort(tqueue_init+0x38) [0x401356]
	./sort(tpool_init+0x6a) [0x401596]
	./sort(main+0x156) [0x401e33]
	/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x7fc997204830]
$ addr2line -e sort 0x401356
/home/sula/workspace/mergesort-concurrent-ken/thread.c:16
$ addr2line -e sort 0x401596
/home/sula/workspace/mergesort-concurrent-ken/thread.c:80
$ addr2line -e sort 0x401e33
/home/sula/workspace/mergesort-concurrent-ken/main.c:185

Atomic

C11 atomic

閱讀Atomic C11 PPT

[p.10]

_Atomic int

_Atomic int* pi; // pointer to atomic int
int* _Atomic pi; // atomic pointer (to int)

這邊我參照這個想法做了一個 muti-thread counter，一個是有 atomic 另一個是沒有的，code在這邊，而且事實比對有 atomic 會是正確的數值。Yen-Kuan Wu

[p.11]

_Atomic struct 的 member 不能獨立 access
看不懂must be copied to compatible object and then members will be accessed individually

Atomic memory access

__sync_add_and_fetch
問題？
編譯器最佳化OOO===>memory Barrier
簡單範例blog
atomic memory 的指令會對share_list 更動，不知道share_list是什麼。
在 barrier 之前的CPU 指令會保證比 barrier 之後的指令先完成。

Signal

lock

wiki lock

lock overhead
- 也就是只所有 lock 帶來的負擔，space、init/destroy time、time for reqiring and releasing lock …
lock contention
- 當一人握有 lock 時，另一人再進行 require
- wiki 有說到，儘量要讓 lock 鎖在細項，而不是整個 table，這樣也能有效減少競爭，但是我想的是越細索是

在直播聊天室提問了這個問題，我們是想

deadlock

intptr_t ?

主要參考使用變數型別的良好習慣

一張圖看出端倪

               32 bit          64 bit
char             1               1
short            2               2
int              4               4
long             4               8
long long        8               8
pointer          4               8












#include <stdio.h>
#include <stdint.h>
int main()
{
          int i = 10;
          int *p = &i;
          intptr_t pp = (intptr_t)p;
          printf("%p\n", p);
          printf("%lx\n", pp);
          printf("%p\n", &i);
          return 0;
}

這邊我們要考慮 ptr 是在 32bit 還 64bit 上，如果想把他轉成 int 的話，用 intptr_t 會比較 protableYen-Kuan Wu
裡頭還提到很多，這讓我思考我的變數應該要以更統一的方式來宣告，例如 int16_t 之類的，而不會取決於機器Yen-Kuan Wu