mergesort-concurrent

contributed by
<eeuserp>, <ruby0109>, <Sarahyuhancheng>, <bobsonlin>

組員

簡伯丞(eeuserp)
程鈺涵(Sarah)
林伯陞#
施雨妏#(Ruby)

工作分配

放入phonebook, 亂數分配 - by Ruby
Thread pool - by Ruby and eeuserp

指出thread pool 管理 worker thread 實作的不足之處
參照 concurrent-ll，提出 lock-free 的實作
學習 concurrent-ll (concurrent linked-list 實作) 的 scalability 分析方式，透過 gnuplot 製圖比較 merge sort 在不同執行緒數量操作的效能

設計自動測試 supermalloc -by Sarah

Git 學習

參考Git 情境劇

branch 新增
- git branch "Team20"
- git push --set-upstream origin Team20將一個已存在的 branch 設定成 tracking 遠端的branch。
$ git branch –track branch名稱
遠端branch 建立一個 tracking 遠端 branch 的 branch，這樣以後 push/pull都會直接對應到該遠端的branch。
$ git checkout branch名稱
切換到另一個 branch(所有修改過程會被保留)。
情境：想要把 github 上最新版本的專案 pull 下來
- $ git branch : 察看現在所在的 branch
- $ git checkout [branch名稱] : 切換 branch
- $ git pull : 看到終端機顯示Already up-to-date. 代表 pull 成功
- error: Your local changes to the following files would be overwritten by merge :

討論紀錄

Refactoring

參考駝峰式大小寫-比如將thread_count改成ThrdCount
把the_pool 改成pool
加入註解更清楚-有些變數和函式應該更清楚的說明
cut_func
- 裡面的_list和list改寫成Llist和Rlist比較清楚
- left 和right list在註解寫反了
可以另外建mergesort.c 和.h, main 看起來會更乾淨

I have a Mergesort~ I have a phonebook~

參考andy19950同學的共筆
由於Mergesort原本是排序int類型的資料, 考慮到phonebook是輸入lastName, 改為char *
- list.h
  typedef intptr_t val_t 改成 typedef char *val_t




typedef struct node {
    val_t data;
    struct node *next;
} node_t;

改成




typedef struct node {
    char data[MAX_LAST_NAME_SIZE];
    struct node *next;
} node_t;

輸入改成phonebook的words.txt檔















    /* check file opening*/
    fp = fopen(InputFile,"r");
    if(!fp){
        printf("Cannot open the file\n");
        return -1;
    }

    /* Read the File*/
    while(fgets(line,MAX_LAST_NAME_SIZE,fp)){
        line[strlen(len)-1] = '\0';
        list_add(the_list,line);
	/* Get the quantity of the data*/
        DataCount++;
    }

接下來的問題是, 要如何比較兩個character array的大小呢？
- 參考How to sort an array of string alphabetically (case sensitive, nonstandard collation) 裡面自訂排序或是開頭大小寫的排序方法
- 在phonebook 的lastName 中, 理應開頭都是大寫
- 因此用strcmp比對：
  在MergeList(llist_t *a, llist_t *b)中更改





/* If a->head->data smaller than or equel to b->head->data, return 1*/
int cmp = (strcmp(a->head->data, b->head->data) <= 0) ;
llist_t *small = (llist_t *)((intptr_t) a * cmp +
                             (intptr_t) b * (1-cmp));

最後把結果輸出到一個檔案中











void list_print(llist_t *list)
{
    node_t *cur = list->head;
    FILE *output;
    output = fopen("sorted.txt","a");
    while (cur) {
        fprintf(output, " %s\n", cur->data);
        cur = cur->next;
    }
    fclose(output);
}

程式執行時先
```
  $ uniq words.txt | sort -R > input.txt
```
之後再用./sort [ThrdCount] [InputFile]

因為用sort -R 會有其他問題, 改的指令在下面哦～
Ruby

均勻分佈亂數與測試

均勻分佈亂數

參考How can I shuffle the lines of a text file on the Unix command line or in a shell script?與

‘-R’
‘–random-sort’
‘–sort=random’
Sort by hashing the input keys and then sorting the hash values. Choose the hash function at random, ensuring that it is free of collisions so that differing keys have differing hash values. This is like a random permutation of the inputs (see shuf invocation), except that keys with the same value sort together.

If multiple random sort fields are specified, the same random hash function is used for all fields. To use different random hash functions for different fields, you can invoke sort more than once.

The choice of hash function is affected by the –random-source option.
sort -R 是先用Hash functon把內容轉成Hash value再去sort, 所以若有一樣的line就會有不均勻的分佈
$ uniq words.txt | sort -R > input.txt 中的unique會把一樣的資料合併成一個, 可是在phonebook很可能會有lastName相同可是其他資料不相同的情況, 因此這樣的作法可能會不符合實際的需求。
shuf 參考Why does the command shuf file > file leave an empty file, but similar commands do not? 較好,
更改指令如下：
- cat words.txt | shuf > input.txt

不太知道自動測試是測試亂數分佈均不均勻還是程式執行時間？
Ruby

bench:
        for i in 1 2 4 8 16 32 64; do \
                ./sort $$i input.txt; \
        done > bench.csv
        
plot:   bench
        gnuplot scripts/runtime.gp

正確性測試

因為寫的程式是要讓跑出來的檔案跟原本的檔案相同, 所以可以用
$ diff 參數檔案檔案
這個linux指令來看兩個檔案是否相同
輸入
＄ diff --brief 檔案檔案
一開始會顯示不相同, 不過是因為輸出檔案的字前面有加空格, 去掉之後就會相同

Monitor 與 boss/worker thread 介紹

Monitor - 為什麼pthread_cond_wait要unlock mutex
- 參考Monitor (synchronization)
- 普通我們用mutex這種thread佔據lock的方式保護critical section(mutual exclusion)
- 可是當threads運作還要考量到不同情況時（比如queue中有沒有task）, mutex會耗費CPU的資源, 因為一直在loop中
- Monitor除了mutex還有condition variable, 讓等待的threads sleep 減少CPU不必要的工作,
  condition variable:
  - Wait(c,m) c是condition variable ;m是 mutex
    1.Atomically
    - 釋出mutex lock
    - 把threads從c的ready queue移到wait queue
    - sleep this thread
    2.當signal發生, 會再自動重新拿回mutex lock
  - Signal
    - 把1個或多個thread從wait queue拿到ready queue
  - Broadcast
    - 用在a thread waiting for the wrong condition might be woken up and then immediately go back to sleep without waking up a thread waiting for the correct condition that just became true.
boss/worker model理解
- 參考Chapter 2 - Designing Threaded Programs
  如下圖：
  - boss :
    - 接收request（程式的input）
    - create worker threads
    - assigns task 給這些 worker threads
    - 需要時會等待worker threads 結束
  - Worker thread:
    - 等待boss的wake-up sign（運作方式包含剛才的monitor）
  - 在一開始就先create 所有thread可以縮短run-time - 如thread pool
  - 要減少boss和worker thread的溝通頻率（boss不應該被worker blocked）> 減少owrker間相依性
理解Thread Pool
- 可以先參考C 的 Thread Pool 筆記中提到的threadpool-mbrossard程式碼
- 基礎觀念:
  - "使用thread pool" 比 "每運作一個task就產生一個thread" 好
  - 在 thread pool 中的 thread 從 task queue 中拿一個 task 來執行，執行結束後再去 task queue 中拿另一個 task 來執行( Wikipedia原文:The threads in the pool take tasks off the queue, perform them, then return to the queue for their next task.)
- - 程式運作：
    先來看wiki 使用thread pool處理Task的圖
    
    再來看程式的struct, 首先threadpool中會紀錄兩個array, 一個是threads另一個是task queue。

struct threadpool_t {
    pthread_mutex_t lock;
    pthread_cond_t notify;
    pthread_t *threads;//放置所有thread空間的指標
    threadpool_task_t *queue;//放置所有task空間的指標
    int thread_count;
    int queue_size;
    int head;
    int tail;
    int count;
    int shutdown;
    int started;
};

threads的struct是pthread_t;
task queue的struct則是由funciton和argument組成

typedef struct {
    void (*function)(void *);
    void *argument;
} threadpool_task_t;

再來看執行的流程圖

TypeError: y.shiftX is not a function

op5不太理解在等待時threadpool_add時為啥要unlock
eeuserp

我覺得我op6可能寫錯了我再看一下><
Ruby

還好你有問!! 發現自己對於Synchronization Objects根本不太了解。我整理個筆記到上面～
Ruby

程式碼理解

以下是我的理解，有錯請指正～
先了解Mergesort的運作方式, 從main.c 的Launch the first task中可以看到包含 cut_func 和我們要排序的list被放入pool給task的queue中。

cut_func:
- 重複做把list拆解成左右兩個sublist, 再把包含sublist和cut_func的task放到queue中, 很像recursion。
- 直到cut_func運作的條件判斷Llist->size > 1不再成立
- 或是cut_count < max_cut不再成立
- 開始做merge(merge_sort(Llist)) >> 給main thread做
- cut_count < max_cut
  - cut_count是data_context.cut_ThrdCount
  - max_cut 是thread 的數量或是data 的數量(如果data比thread多, 那就是thread的數量;反之data)
- 理解如下：
  thread 會進入if 中 ThrdCount-1次, 之後就是各自再從merge_sort繼續往下切並merge 回來, 以ThrdCount 4 為例, max_cut 是3

merge_sort : 讓threads 切到list->size < 2再做merge。

放入thread pool task queue中的functon
1.llist_t *MergeList(llist_t *a, llist_t *b)
2.llist_t *merge_sort(llist_t *list)
3.void merge(void *data)
4.void cut_func(void *data)
5.static void *task_run(void *data)

1.將兩個已經排序過的 list 做合併排序
2.將一個 list 重複拆解然後排序後重新組合
5.thread 從 task queue 挑選 taskeeuserp

主要運作函式有三sarah

merge_sort()：純粹將取得的 list 一直對半分，直到長度小於 2。結束時呼叫 merge_list() 組合左右兩邊並回傳組合後的 list。
merge_list()：merge local list，將左右兩個 list 做排序合成一個 list，會回傳新的 llist_t，傳入的兩個會被 free 掉。
merge()：merge lists between threads。地一個進來的 thread 會檢查 tmp_list 是否為空，有則暫存自己的 local list，如果沒有則取得暫存的 list 和自己的 local list，配到 merge_list() 作為下一個工作。

問題 & Refactoring

在計算 max_cut 算式過於冗長，可以用 MIN macro 取代
cut_func 在分左右時，註解寫左邊但傳入的是右半的 list，右邊則傳入左半
main.c 過長，可以將 merge sort 的程式碼移至獨立檔案
- 將 merge_list() 與 merge_n_sort() 移到 merge_sort.c/h
- 並重新命名 merge_list() 為 merge_n_sort()，merge_sort() 為 split_n_merge()
在 merge_list() 中被 free 掉的 llist_t，其指標應設為 NULL，避免被使用
merge 兩個 thread 的 list 有沒有比較好的方法?

reference from lankuDotsarah

前面也有提到refactor,可以合併整理成一區,前後都有一些一樣的策略課程助教

Makefile
- -rdynamic ：看了一些資料可是不太懂
  在GCC Manual中
-rdynamic
Pass the flag -export-dynamic to the ELF linker, on targets that support it. This instructs the linker to add all symbols, not only used ones, to the dynamic symbol table. This option is needed for some uses of dlopen or to allow obtaining backtraces from within a program.

參考How to get more detailed backtrace [duplicate]才知道原來是obtaining backtraces需要用的。因為沒有很了解bracktrace, 去查資料找到老師的Who Call Me?, 用程式找funciton返回位址和依據返回位址查函式名稱。不過在這裡是為了給GNU Debugger(GDB)做backtrace。

附上與此相關的網站連結
1.Logan's Blog
2.你所不知道的C語言：動態連結器篇
給大家參考~ bobsonlin

thread_pool.[ch]
起初看thread_pool的queue的操作時，真的還蠻錯亂的，參考mingnus同學的共筆後，決定在命名的地方做相同的修改！
1. 從tail將新的task push進來, 從head將task pop出去
2. 朝tail方向指的pointer改名為prev

管理worker thread與 lock- free實作

以下是在理解程式碼的過程中, 發現可以改善的地方並一步步改善。

加上monitor

突然發現這個程式雖然有initial cond, 但只有用到mutex, 也就是當沒有task時, thread會一直耗費CPU的資源。可以照之前老師在phonebook貼的How to synchronize manager/worker pthreads without a join?來添加monitor
參考threadpool-mbrossard的想法
task_run 加上pthread_cond_wait


































static void *task_run(void *data)
{
    (void) data;
    while (1) {
        /* Threads snatch the lock*/
        pthread_mutex_lock(&(pool->queue->mutex));

        /* If the task queue is empty then unlock, 
           and let other threads come in to be swapped to the wait queue.
           When signal comes, threads wake up and own the lock one by one*/
        while( pool->queue->size == 0){
            pthread_cond_wait(&(pool->queue->cond), &(pool->queue->mutex));
        }
        if (pool->queue->size == 0) break;

        /* Thread grab the task*/
        task_t *_task = tqueue_pop(pool->queue);

        /* Unlock */
        pthread_mutex_unlock(&(pool->queue->mutex));

        if (_task) {
            if (!_task->func) {
                tqueue_push(pool->queue, _task);
                break;
            } else {
                _task->func(_task->arg);
                free(_task);
            }
        }
    }
    pthread_exit(NULL);
}

tqueue_push 加上pthread_cond_signal















int tqueue_push(tqueue_t *queue, task_t *task)
{
    pthread_mutex_lock(&(queue->mutex));
    task->prev = NULL;
    task->next = queue->tail;
    if (queue->tail)
        queue->tail->prev = task;
    queue->tail = task;
    if (queue->size++ == 0)
        queue->head = task;
    pthread_cond_signal(&(queue->cond));
    pthread_mutex_unlock(&(queue->mutex));
    return 0;
}

tpool_free








int tpool_free(tpool_t *pool)
{
    pthread_cond_broadcast(&(pool->queue->cond));
    pthread_mutex_unlock(&(pool->queue->mutex));
	.
	.
	.
}

結果：

發現時間似乎是錯的, 其他人的共筆都是約0.5左右。
回去看程式碼才發現應該要把
clock_gettime(CLOCK_REALTIME, &end);
放在tpool_free(pool);的後面，否則只是紀錄到main thread 的時間而沒有整體的

改完之後：

這張圖片的數據很有意義，請修改其版面配置，使其數據具有可讀性課程助教

好, 已調整～
Ruby

圖片中的數字間距還請你們再調整一下～
另外如果數字還是會打到圖例的話,可以把圖例移至左邊,甚至移到表外也可以喔！
參考連結:http://blog.csdn.net/iemyxie/article/details/41548583
文中set key的部份
課程助教

嗨助教～我有調整了一下~ 不過圖例移出去的話數字會擠在一起, 所以我把取的位數變少, 如果像下面這樣呢～？
Ruby

如果從原本的右上角移至左上角，就不用減少位數了～
課程助教

好開心ˊˇˋ
原本threads數越高, 時間就會飆漲, 改完之後就完全不會了!!!
原因應該是因為monitor可以讓threads在wait queue中sleep，所以不會耗費到CPU的時間～

merge裡面的data_context lock似乎沒有作用
- 刪掉後程式運作正常也正確
tqueue_size 在這個程式沒有用到也不需要

Lock-free實作

先來了解一下lock-free是什麼, 參考wiki的Non-blocking algorithm與 An Introduction to Lock-Free Programming
- lock-free利用atomic read-modify-write，如compare and swap (CAS)來省去lock可能產生的等待時間和lock衍生出的問題
- compare-and-swap (CAS)
  - 是一個atomic operation
  - 比較某個記憶體區塊的內容和一個給予的值是否相同, 如果相同就會把記憶體區塊的內容改成另一個新的值。
- 如果是一些比較弱的結構, 可以不用atomic operation, 比如single-reader single-writer ring buffer FIFO, with a size which evenly divides the overflow of one of the available unsigned integer types, 可以用 memory barrier 做到lock-free。

在wiki看到上面這段, 本來想說試試看memory barrier做lock free, 可是找不太到有沒有其他人這樣做的
Ruby

lock-freedom還沒有很能體會
參考concurrent-ll怎麼作lock-free
- 在linked list concurrent參考linked list
- Harris' algorithm, 參考與圖片來源A Pragmatic Implementation of Non-Blocking Linked-Lists by TL Harris:
  - 原本lock-free用一個CAS來做node deletion可能會發生問題
    
    如上圖所示, 如果想要刪掉10, 可是20也在同時加進去時, 20會遺失, 因為CAS在選擇30之後, 沒有辦法偵測或預防10和30間的變化。
  - 解法：
    - 在刪掉10的過程用兩個CAS
    - 一個CAS 先mark 10 的next指標 ( logically deleted )
    - 一個CAS 再刪掉10 ( physically deleted )
  - 所以在刪掉10的過程中, 若要加入20, 加入20的過程會偵測到10已經被logically deleted, 因此會對10做physically deleted之後再重新加入。
  - 後面的code看不是很懂, 回來先看concurrent-ll怎麼implement
- 在我們mergesort thread pool的實作裡面, task queue是最主要的共享資源, 也是lock free要控制的structure。在控制task queue時因為只有用到add 和delete （tqueue_push和 tqueue_pop）所以參考concurrent-ll會針對list_add和list_remove。
- 發現mergesort-concurrent其實不會有CAS deletion的問題, 因為我們是FIFO, push和pop會在兩端。
- 用單一個 CAS 試試看
  - 在concurrent-ll用的CAS是
  bool __sync_bool_compare_and_swap (type *ptr, type oldval type newval, …)
  type __sync_val_compare_and_swap (type *ptr, type oldval type newval, …)
  These builtins perform an atomic compare and swap. That is, if the current value of *ptr is oldval, then write newval into *ptr.
  The “bool” version returns true if the comparison is successful and newval was written. The “val” version returns the contents of *ptr before the operation.
目標：拿掉tqueue_push和tqueue_pop的lock
- 首先思考怎拿掉原本的monitor,
- 原本的task queue架構如下：
- 一開始想了很多種方式, 還不小心不知道在做什麼把linked list改掉了。
- 需要barrier signal thread work
  - 我的想法是在task run 那邊設簡單的barrier, 然後在tqueue_push的地方一樣有pthread_cond_signal












static void *task_run(void *data)
{
    (void) data;
    while (1) {
        /* Barrier to wait before tasks in queue*/
        pthread_mutex_lock(&(pool->queue->mutex));

        while( pool->queue->size == 0) {
            pthread_cond_wait(&(pool->queue->cond), &(pool->queue->mutex));
        }
        pthread_mutex_unlock(&(pool->queue->mutex));

在tqueue_pop中要有判斷queue->size, 因為可能在裡面的時候被改變
阿阿…胡思亂想試了好多方法, 後來還是在原本原本的的double linked list加上CAS。只是改了好久好久的code, 可是還是沒有出來QAQQ


































task_t *tqueue_pop(tqueue_t *queue)
{
    task_t *head, *headprev;

    while(1) {
        head = queue->head;
        headprev = head->prev;
        if(head){
            if( CAS_PTR(&(queue->head),head,headprev) != head) break;
            FAD_U32(&(queue->size));
            if(headprev) headprev->next = NULL;
        } else {
            queue->tail=NULL;
        }
        return head;
    }
}

int tqueue_push(tqueue_t *queue, task_t *task)
{
    while(1) {
        task->prev = NULL;
        task->next = queue->tail;

        if(CAS_PTR(&(queue->tail),task->next,task) == task->next){
            FAI_U32(&(queue->size));
            CAS_PTR(&(queue->head),NULL,task);
            if(task->next) task->next->prev = task;
            pthread_cond_signal(&(queue->cond));
            return 0;
        }
    }
}

現在的error如下：

5293 Thread 2:
5293 Invalid write of size 8
5293 at 0x4015B3: tqueue_push (threadpool.c:108)
5293 by 0x401AD1: merge (main.c:94)
5293 by 0x401C8C: cut_func (main.c:135)
5293 by 0x401D73: task_run (main.c:163)
5293 by 0x4E3F183: start_thread (pthread_create.c:312)
5293 by 0x514F37C: clone (clone.S:111)
5293 Address 0xc98bc88 is 24 bytes inside a block of size 32 free'd
5293 at 0x4C2BDEC: free (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
5293 by 0x401D7F: task_run (main.c:164)
5293 by 0x4E3F183: start_thread (pthread_create.c:312)
5293 by 0x514F37C: clone (clone.S:111)
5293
5293 Thread 1:
5293 Invalid read of size 8
5293 at 0x4015F5: tqueue_free (threadpool.c:119)
5293 by 0x4017CB: tpool_free (threadpool.c:151)
5293 by 0x401F40: main (main.c:231)
5293 Address 0xc98bde0 is 16 bytes inside a block of size 32 free'd
5293 at 0x4C2BDEC: free (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
5293 by 0x40160B: tqueue_free (threadpool.c:120)
5293 by 0x4017CB: tpool_free (threadpool.c:151)
5293 by 0x401F40: main (main.c:231)
5293
5293 Invalid free() / delete / delete[] / realloc()
5293 at 0x4C2BDEC: free (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
5293 by 0x40160B: tqueue_free (threadpool.c:120)
5293 by 0x4017CB: tpool_free (threadpool.c:151)
5293 by 0x401F40: main (main.c:231)
5293 Address 0xc98bdd0 is 0 bytes inside a block of size 32 free'd
5293 at 0x4C2BDEC: free (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
5293 by 0x40160B: tqueue_free (threadpool.c:120)
5293 by 0x4017CB: tpool_free (threadpool.c:151)
5293 by 0x401F40: main (main.c:231)
5293
5293
5293 More than 10000000 total errors detected. I'm not reporting any more.
5293 Final error counts will be inaccurate. Go fix your program!
5293 Rerun with –error-limit=no to disable this cutoff. Note
5293 that errors may occur in your program without prior warning from
5293 Valgrind, because errors are no longer being displayed.

詢問老師之後, 發現可以參考之前Toward Concurrency中提到的Hungry Birdlock-free實作的方式

重新改過code之後:




























task_t *tqueue_pop(tqueue_t *queue)
{
    task_t *head;

    head = queue->head;
    if(head){
        /* The situation head equal to tail, which means only one node in queue. */
        if(CAS_PTR(&(queue->tail), head, NULL) == head){
            /* Check whether the head is still the same*/
            if(CAS_PTR(&(queue->head), head, NULL) == head) {
                FAD_U32(&(queue->size));
                return head;
            } else {
                return NULL;
            }
        } else {
            if(CAS_PTR(&(queue->head->prev->next), head, NULL)){
                CAS_PTR(&(queue->head), head, head->prev);
                FAD_U32(&(queue->size));
                return head;
            }
	    else{
		 return NULL;
	    }
        }
    }
    return head;
}
















int tqueue_push(tqueue_t *queue, task_t *task)
{
    task->prev = NULL;
    task->next = queue->tail;

    if(CAS_PTR(&(queue->tail), NULL, task) == NULL){
        CAS_PTR(&(queue->head),NULL,task);
    } else {
        CAS_PTR(&(queue->tail->prev), NULL, task);
        CAS_PTR(&(queue->tail), task->next, task);
    }
    FAI_U32(&(queue->size));
    pthread_cond_signal(&(queue->cond));
    return 0;
}

執行的時候有時候會有錯誤, 可是不知道怎麼修改或用什麼工具來看比較好><

跑得時間：

lock free好像沒有快很多
為什麼呢？

用mutrace下去看

Mutex #2 (0x0x7f938c2813a0) first referenced by:
	/usr/lib/mutrace/libmutrace.so(pthread_mutex_lock+0x49) [0x7f938ec786b9]
	/lib/x86_64-linux-gnu/libgcc_s.so.1(_Unwind_Find_FDE+0x2c) [0x7f938c07d07c]
	[(nil)]

Mutex #0 (0x0x1131490) first referenced by:
	/usr/lib/mutrace/libmutrace.so(pthread_mutex_init+0xf2) [0x7f938ec784b2]
	./sort(tqueue_init+0x38) [0x401541]
	./sort(tpool_init+0x6b) [0x4017c8]
	./sort(main+0x197) [0x4020ec]
	/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7f938e6b3f45]

Mutex #1 (0x0x603160) first referenced by:
	/usr/lib/mutrace/libmutrace.so(pthread_mutex_init+0xf2) [0x7f938ec784b2]
	./sort(main+0x153) [0x4020a8]
	/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7f938e6b3f45]

mutrace: Showing 3 most contended mutexes:

 Mutex #   Locked  Changed    Cont. tot.Time[ms] avg.Time[ms] max.Time[ms]  Flags
       2       20       10        4        0.006        0.000        0.001 M-.--.
       0       22       19        1        0.012        0.001        0.003 M!.--.
       1        7        4        0        0.001        0.000        0.000 M-.--.
                                                                           ||||||
                                                                           /|||||
          Object:                                     M = Mutex, W = RWLock /||||
           State:                                 x = dead, ! = inconsistent /|||
             Use:                                 R = used in realtime thread /||
      Mutex Type:                 r = RECURSIVE, e = ERRRORCHECK, a = ADAPTIVE /|
  Mutex Protocol:                                      i = INHERIT, p = PROTECT /
     RWLock Kind: r = PREFER_READER, w = PREFER_WRITER, W = PREFER_WRITER_NONREC 

mutrace: Note that the flags column R is only valid in --track-rt mode!

mutrace: Total runtime is 287.781 ms.

mutrace: Results for SMP with 4 processors.

mutrace: WARNING: 40 inconsistent mutex uses detected. Results might not be reliable.
mutrace:          Fix your program first!

我不知道要怎麼看是哪一個lock

不過裡面看起來較有問題的是#0
- 重整一次寫code的想法：
  在pop我的想法是照著hungry bird, 先用CAS測試tail和head有沒有一樣, 也就是queue裡面是不是只有一個task。如果只有一個task, 那CAS會先用atomic 的方式把tail放入NULL, 之後queue->head也要放入NULL。不過在這個地方要考慮到測試完tail之後, 另一個thread有可能會同時要pop或是push, 所以用一組CAS保護head的值, 只要有改變就會回傳NULL。如果不只一個task, 則是把prevhead的next改成NULL, 不知道這裡這麼多pointer, 會不會在中間出錯？>> 先改改看這裡>>還是沒變ˊˋ, 繼續。
  在push我用了CAS, 並多加了保護機制。結果還是有inconsistent, 還是沒有抓到問題點可能出在哪。

mutrace 在記憶體區段錯誤時的資訊：

mutrace: Showing statistics for process sort (pid 6806).
mutrace: 3 mutexes used.

Mutex #0 (0x0x7f85550ff3a0) first referenced by:
	/usr/lib/mutrace/libmutrace.so(pthread_mutex_lock+0x49) [0x7f8557af66b9]
	/lib/x86_64-linux-gnu/libgcc_s.so.1(_Unwind_Find_FDE+0x2c) [0x7f8554efb07c]
	[(nil)]

Mutex #1 (0x0xe13490) first referenced by:
	/usr/lib/mutrace/libmutrace.so(pthread_mutex_init+0xf2) [0x7f8557af64b2]
	./sort(tqueue_init+0x38) [0x401541]
	./sort(tpool_init+0x6b) [0x4017eb]
	./sort(main+0x197) [0x40210f]
	/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7f8557531f45]

Mutex #2 (0x0x603160) first referenced by:
	/usr/lib/mutrace/libmutrace.so(pthread_mutex_init+0xf2) [0x7f8557af64b2]
	./sort(main+0x153) [0x4020cb]
	/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7f8557531f45]

mutrace: Showing 3 most contended mutexes:

 Mutex #   Locked  Changed    Cont. tot.Time[ms] avg.Time[ms] max.Time[ms]  Flags
       0       20        9        4        0.006        0.000        0.001 M-.--.
       1       23       20        0 14448984.781   628216.730 14448984.757 M!.--.
       2        7        5        0        0.001        0.000        0.000 M-.--.
                                                                           ||||||
                                                                           /|||||
          Object:                                     M = Mutex, W = RWLock /||||
           State:                                 x = dead, ! = inconsistent /|||
             Use:                                 R = used in realtime thread /||
      Mutex Type:                 r = RECURSIVE, e = ERRRORCHECK, a = ADAPTIVE /|
  Mutex Protocol:                                      i = INHERIT, p = PROTECT /
     RWLock Kind: r = PREFER_READER, w = PREFER_WRITER, W = PREFER_WRITER_NONREC 

mutrace: Note that the flags column R is only valid in --track-rt mode!

mutrace: Total runtime is 298.172 ms.

mutrace: Results for SMP with 4 processors.

用addr2line下去看, 不過跑出來行數的內容蠻奇妙的
- pthread_cond_init(&(queue->cond), NULL);
- pthread_attr_init(&attr);
- clock_gettime(CLOCK_REALTIME, &start);
釐清問題點
- 有時候出來是正確的, 有時後會core dump
- mutrace 出來有lock 鎖住的時間非常高(這個時間數是怎麼出來的呢？)
- gdb 在有問題時出來的資訊是
```
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff6ff5700 (LWP 5079)]
0x00000000004015a1 in tqueue_pop (queue=0x604080) at 	threadpool.c:40
warning: Source file is more recent than executable.
40	            if(CAS_PTR(&(headprev->next), head, NULL)) {
```
- 我猜測因為在執行CAS_PTR這一段時在lock中發生錯誤, 所以mutrace total time很高
- 怎麼解決這個問題？

閱讀資料

參考concurrent B+ Tree 裡面亮谷同學有提到atomic operation可能遇到的問題
- 回頭去看lockless的資料
  - Lockless Programming Considerations for Xbox 360 and Microsoft Windows
    - On x86 and x64 there, is no guarantee that reads and writes larger than eight bytes are atomic.（because of the bus）
    - On x86 and x64, there is a single instruction (inc) that can be used to increment a variable in memory. If you use this instruction, incrementing a variable is atomic on single-processor systems, but it is still not atomic on multi-processor systems.
    - Making inc atomic on x86- and x64-based multi-processor systems requires using the lock prefix, which prevents another processor from doing its own read-modify-write sequence between the read and the write of the inc instruction.
    - prevent reordering of reads and writes are called read-acquire and write-release barriers
      - A read-acquire has the barrier after the lock operation, and a write-release has the barrier before
      - If one processor uses a write-release to release a data structure to shared memory, and another processor uses a read-acquire to gain access to that data structure from shared memory, the code will then work properly.
    - An Introduction to Lock-Free Programming
      - 真的下去做才比較了解這篇文章在講什麼, 可是processor間的問題還是需要弄清楚一點
不同task queue結構對效能的影響：
- 解法：
  - 參考之前的threadpool, 用一個ring來儲存task, 不過這樣要一開始就給一個task queue大小並多加判別有沒有滿的機制
  - 以下的處理都參考threadpool-mbrossard
  - 首先改tqueue的想法
    - 在pool initialize的時候先給定tqueue的大小
      - 考量thread的數目介於1~128間, 我把queue size設256
    - tpool_t 中的count名稱容易混淆, 改成thrdCount
    - 若要用成一個ring, queue要變成一個array,
    - 把tpool和tqueue structure合在一起
    - 用array的index(pool->head和pool->tail)來控制ring的head和tail
    - structure 改完之後先來看執行時間有沒有什麼變化:
      
      似乎有下降一些些

加入b自動測試

問題待解

max_cut = thread_count * (thread_count <= data_count) +
data_count * (thread_count > data_count) - 1;
- 為什麼要-1?
- load file若是像phonebook大量的資料是否還要判斷？
在task_run的function內，不懂當task->func為0時，為何要將此task push到queue裏面。是不是func為0的task是要讓下一個在thread pool裡的thread停止task_run。

參考資料

1.andy19950同學的共筆(參考如何修改成適合phonebook的型式)
2.TempoJiJi同學的共筆（參考設計自動測試)
3.LanKuDot /開發記錄(mergesort-concurrent) / github
4. Logan's Blog
5. 你所不知道的C語言：動態連結器篇
6. kobeyu同學的共筆
7. mingnus同學的共筆
8. eeuserp共筆