2016q3 Homework3 (mergesort-concurrent)

contribute by <jeff60907>

作業說明A07: mergesort-concurrent

  • merge sort 的實做改為可接受 phonebook-concurrent 的 35 萬筆資料輸入的資料檔
  • 設計自動測試的機制
  • 一併嘗試重構 (refactor) 給定的程式碼,使得程式更容易閱讀和維護
  • 研究 thread pool 管理 worker thread 的實做,提出 lock-free 的實做

開發環境

作業系統 Lubuntu 16.04
CPU:Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz
cpu cores: 4
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              8192K

先閱讀課程資料

發現約讀懂6-7成而已,很多都有印象學過卻很模糊QQ

一句話:「學過,但是從來沒有學會過」 jserv

不懂名詞

The pipeline depth is the number of stages
管線作業架構內含多個硬體階段電路,硬體階段個數就稱為pipeline depth(管線深度)
1.pipeline depth越大,表示各硬體階段完成子工作所需的時間越短,執行效能就越好
2.當pipeline depth增加超過一個臨界值,效能則不升反降
ps:若管線深度為k,通常效能只會增快sqrt(k)倍

  • Out-of-order execution (just OOO for short sometimes written OoO or OOE).

亂序執行 參考wiki 參考百度解釋
可以避免因為獲取下一條程序指令所引起的處理器等待,立即執行處理下一條指令。
例如遇到branch指令3時,可以先把其他非相關聯指令4-8運算好,等到branch指令3結束,就可以直接使用。這種作法可以讓CPU內部運算沒有浪費,但是會造成功耗上升

  • FSB,Front Side Bus

前端匯流排 參考wiki

內文提到
increasing a modern processor's clock speed by a relatively modest 30% can take as much as double the power, and produce double the heat

增加30%的 clock速度 可能會造成兩倍的功耗和熱,所以不能無限制的增加時脈速度。

POWER WALL

Up to a point this increase in power is okay, but at a certain point, currently somewhere around 150-200 watts, the power and heat problems become unmanageable, because it's simply not possible to provide that much power and cooling to a silicon chip in any practical fashion, even if the circuits could, in fact, operate at higher clock speeds. This is called the power wall.

ILP WALL

Pursuing more and more ILP also has definite limits, because unfortunately, normal programs just don't have a lot of fine-grained parallelism in them, due to a combination of load latencies, cache misses, branches and dependencies between instructions. This limit of available instruction-level parallelism is called the ILP wall.

閱讀mergesort-concurrent程式碼

照步驟先測試

$ make 
gcc -std=gnu99 -Wall -g -pthread -o list.o -MMD -MF .list.o.d -c list.c
gcc -std=gnu99 -Wall -g -pthread -o threadpool.o -MMD -MF .threadpool.o.d -c threadpool.c
gcc -std=gnu99 -Wall -g -pthread -o main.o -MMD -MF .main.o.d -c main.c
gcc -std=gnu99 -Wall -g -pthread -o sort list.o threadpool.o main.o -rdynamic

MMD MF 是什麼? 參考kobeyu介紹
-g 在編譯時已經加上去了,底下才能使用 mutrace 查詢到地址對應的行數

mutrace: Showing statistics for process sort (pid 17137).
mutrace: 3 mutexes used.

Mutex #0 (0x0x13fd9b0) first referenced by:
	/usr/lib/mutrace/libmutrace.so(pthread_mutex_init+0xf2) [0x7faa075e84b2]
	./sort(tqueue_init+0x38) [0x401277]
	./sort(tpool_init+0x6a) [0x4014cc]
	./sort(main+0x161) [0x401c74]
	/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x7faa0701f830]

Mutex #1 (0x0x603120) first referenced by:
	/usr/lib/mutrace/libmutrace.so(pthread_mutex_init+0xf2) [0x7faa075e84b2]
	./sort(main+0x11d) [0x401c30]
	/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x7faa0701f830]

Mutex #2 (0x0x7faa04bee860) first referenced by:
	/usr/lib/mutrace/libmutrace.so(pthread_mutex_lock+0x49) [0x7faa075e86b9]
	/lib/x86_64-linux-gnu/libgcc_s.so.1(_Unwind_Find_FDE+0x2c) [0x7faa049eafec]
	[(nil)]

mutrace: Showing 3 most contended mutexes:

 Mutex #   Locked  Changed    Cont. tot.Time[ms] avg.Time[ms] max.Time[ms]  Flags
       0      133       15       10        0.026        0.000        0.002 Mx.--.
       1       13        9        1        0.003        0.000        0.000 M-.--.
       2       20        3        1        0.007        0.000        0.002 M-.--.
sort(tqueue_init+0x38) [0x401277]
$ addr2line -e sort 0x38
??:0
$ addr2line -e sort 0x401277
/home/jeff60907/mergesort-concurrent/threadpool.c:15

mutrace: 3 mutexes used. 會顯示使用多少個 mutexes
0x38 反應是??:0 是因為實際執行的地址是後面 401277才正確嗎?

所以要讀書呀!先讀手冊 addr2line,binutils 背後透過 Binary File Descriptor library (BFD) 實做。 jserv

  • 閱讀 addr2line
    If the file name or function name can not be determined, addr2line will print two question marks in their place. If the line number can not be determined, addr2line will print 0.

  • 閱讀Binary File Descriptor library (BFD)
    Big Endian和Little Endian的比較 and 最後一段小故事
    如果是Big Endian的系統:
    存到記憶體會變成 0x12 0x34 0x56 0x78,最高位元組在位址最低位元,最低位元組在位址最高位元,依次排列。
    如果是Little Endian的系統:
    存到記憶體會變成 0x78 0x56 0x34 0x12,最低位元組在最低位元,最高位元組在最高位元,反序排列。

  • 參考Measuring Lock Contention mutrace解釋

    • The 'Locked' column tells how often the mutex was locked during the entire runtime of about 10s.
    • The 'Changed' column tells us how often the owning thread of the mutex changed.
    • The 'Cont.' column tells us how often the lock was already taken when we tried to take it and we had to wait.
    • The fifth column tell us for how long during the entire runtime the lock was locked.
    • The sixth tells us the average lock time.
    • The seventh column tells us the longest time the lock was held.
    • Finally, the last column tells us what kind of mutex this is (recursive, normal or otherwise).

程式分析

list.h

Linked list 是由各個 node,透過指標串聯而成。先定義 node_t 型別

typedef intptr_t val_t; /* 不僅可放數值,也能放指標 */ typedef struct node { val_t data; /* 欲儲存的資料 */ struct node *next; /* next 指標 */ } node_t;

inptr_t 為什麼要使用這個型別呢? 參考SwimGlass很詳盡的內容!!
要將 pointer做型態轉換,因為電腦上 32bit 跟 64bit的pointer長度是不同的,強制轉型可能會造成資料loss的問題,所以透過intptr_t可以正確有效的轉換型態
表示法常見問題

list的操作,詳閱list.c實作內容

// 建立新的 list 物件 llist_t *list_new(); // 從 list 插入給定資料 int list_add(llist_t *the_list, val_t val); // 逐一印列 list 內容 void list_print(llist_t *the_list); // 產生新的節點 node_t *new_node(val_t val, node_t *next); // 取得某個 index 的節點資訊 node_t *list_get(llist_t *the_list, uint32_t index);

Thread pool

定義 task_t 來封裝 task:

typedef struct _task { void (*func)(void *); /* 對應到最終執行的函式 */ void *arg; /* 傳入的參數 */ struct _task *next, *last; /* 因為 queue 要用 doubly-linked list, 需要儲存 next 和 last */ } task_t;

doubly-linked list 用法不太熟悉,回頭複習資料結構課程

參考資料不要列出一堆中文的,請列出「有強度」的英文材料 jserv
收到! 陳致佑

詳閱threadpool.c

LitSnow講解操作內容 閱讀 queue介紹

看了threadpool.c POP 跟 PUSH 有些操作看不太懂參考了LitSnow同學的共筆,但疑惑的是queue 應該是 FIFO 先進先出,所以POP 應該是從head 先拿,push 放進tail才對,但是threadpool.c卻寫相反了

快去改,這個作業還有後續。以後要從 code refactoring 開始 jserv
收到陳致佑

Code Refactoring

修改threadpool.c pop 跟 push,把頭尾交換,pop從head提出來,push從tail放進去。

task_t *tqueue_pop(tqueue_t *the_queue) { task_t *ret; pthread_mutex_lock(&(the_queue->mutex)); ret = the_queue->head; if (ret) { the_queue->head = ret->next; if (!the_queue->head) { the_queue->tail = NULL; } the_queue->size--; } pthread_mutex_unlock(&(the_queue->mutex)); return ret; }
int tqueue_push(tqueue_t *the_queue, task_t *task) { pthread_mutex_lock(&(the_queue->mutex)); task->next = NULL; if (the_queue->tail) the_queue->tail->next = task; the_queue->tail = task; if (the_queue->size++ == 0) the_queue->head = task; pthread_mutex_unlock(&(the_queue->mutex)); return 0; }

嘗試實作可以讀取phonebook-word檔

參考andy19950共筆實作以下修改
修改 list.c and list.h

  • 原本struct node intptr_t 要改成字元可讀的型式 typedef char* val_t;
  • node_new()將資料放進來方式變成strcpy((char *)node->data, val);
  • list_print()方法也要更動輸出為output.txt
void list_print(llist_t *list) { node_t *cur = list->head; /* FIXME: we have to validate the sorted results in advance. */ FILE *fp = fopen(output, "w"); while (cur) { fprintf(fp,"%s\n", (char *)cur->data); cur = cur->next; } fclose(fp); }

修改 main.c

  • 先將word.txt 透過指令亂序$ uniq words.txt | sort -R > input.txt
  • 為了可以讀取input.txt,先開頭定義#define USAGE "usage: ./sort [thread_count] [file_name]\n"接著修改資料抓取方式,將scanf 改成 fgets讀檔案內資料,行的結尾要更改'\0'不然會紀錄原本的換行字元
for (int i = 0; i < data_count; ++i) { long int data; scanf("%ld", &data); list_add(the_list, data); } while(fgets(line, sizeof(line), fp)!= NULL) { line[strlen(line)-1] = '\0'; list_add(the_list, line); }
  • 原先數字比較大小更改為字串比較,andy19950共筆提出了strcmp比較大小回傳1 , 0 , -1實作small的內容很簡單明瞭,覺得很厲害不然原本直覺想到的就是硬幹而已@@
int cmp = strcmp(a->head->data, b->head->data); llist_t *small = (llist_t *) ((intptr_t *) a * (cmp <= 0) + (intptr_t *) b * (cmp > 0));

原本照andy19950程式碼實作出現一些Waring,主要是const char * 類似的型態類別有問題,指定轉換的格式後就OK了,所以資料在轉換過程中要遵循型態不然可能會有waring甚至會有奇怪的錯誤。

設計自動測試的機制

  • 將原本的word.txt 跟 後來產生的output.txt做比較,紀錄第幾筆資料錯誤並顯示
void check_data() { FILE *word, *data; data = fopen("output.txt", "r"); word = fopen("words.txt", "r"); char output[16], answer[16]; int count = 0; while(fgets(output, sizeof(output), data) != NULL){ fgets(answer, sizeof(answer), word); if(strcmp(answer, output) != 0) { printf("Wrong %d data %s\n",++count, output); } ++count; } printf("data are right\n"); fclose(word); fclose(data); }

測試未優化版本 thread 1 - 512

thread 在64以上,時間大幅度上升,代表後面沒有task時, thread會一直耗費CPU的資源

  • 使用 mutrace 了解程式 mutex 的狀況
    $ mutrace ./sort 64 input.txt
mutrace: 0.2 sucessfully initialized for process sort (pid 10023).
64 0.4450

mutrace: Showing statistics for process sort (pid 10023).
mutrace: 3 mutexes used.

Mutex #2 (0x0x5258ba0) first referenced by:
	/usr/lib/mutrace/libmutrace.so(pthread_mutex_init+0xf2) [0x7fead81b34b2]
	./sort(tqueue_init+0x38) [0x401477]
	./sort(tpool_init+0x6a) [0x4016ab]
	./sort(main+0x180) [0x402011]
	/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x7fead7bea830]

Mutex #0 (0x0x7feab777d860) first referenced by:
	/usr/lib/mutrace/libmutrace.so(pthread_mutex_lock+0x49) [0x7fead81b36b9]
	/lib/x86_64-linux-gnu/libgcc_s.so.1(_Unwind_Find_FDE+0x2c) [0x7feab7579fec]
	[(nil)]

mutrace: Showing 2 most contended mutexes:

 Mutex #   Locked  Changed    Cont. tot.Time[ms] avg.Time[ms] max.Time[ms]  Flags
       2   699815   657024   461951       84.738        0.000        0.020 Mx.--.
       0      320      131       41        0.032        0.000        0.001 M-.--.
                                                                           ||||||
                                                                           /|||||
          Object:                                     M = Mutex, W = RWLock /||||
           State:                                 x = dead, ! = inconsistent /|||
             Use:                                 R = used in realtime thread /||
      Mutex Type:                 r = RECURSIVE, e = ERRRORCHECK, a = ADAPTIVE /|
  Mutex Protocol:                                      i = INHERIT, p = PROTECT /
     RWLock Kind: r = PREFER_READER, w = PREFER_WRITER, W = PREFER_WRITER_NONREC 

mutrace: Note that the flags column R is only valid in --track-rt mode!

mutrace: Total runtime is 529.922 ms.

mutrace: Results for SMP with 8 processors.

mutex 2 被大量 locked 住了,熱點應該是 tqueue 之間的 task ,必須要等待 mutex 被釋放才能繼續進行,所以通通都卡在 lock 等待資源被釋放

嘗試管理 work thread

額外整理Pthread探討
參考20組 使用 Condition variables 的 singal 來控制 mutex
參照How to synchronize manager/worker pthreads without a join?

修改 main.c在 負責 task 運作的 task_run 加上 pthread_cond_wait 等待信號回傳喚醒進行動作,但是調用pthread_cond_wait 前要搭配pthread_mutex_lock使用。

static void *task_run(void *data) { (void) data; while (1) { /* Lock must be taken to wait on conditional variable */ pthread_mutex_lock(&pool->queue->mutex); /* Wait on condition variable, check for spurious wakeups. When returning from pthread_cond_wait(), we own the lock. */ while(pool->queue->size == 0) { pthread_cond_wait(&(pool->queue->cond), &(pool->queue->mutex)); } if (pool->queue->size ==0) break; /*Grab out task */ task_t *_task = tqueue_pop(pool->queue); /* Unlock*/ pthread_mutex_unlock(&(pool->queue->mutex)); if (_task) { if (!_task->func) { tqueue_push(pool->queue, _task); break; } else { _task->func(_task->arg); free(_task); } } } pthread_exit(NULL); }

修改 threadpool.c 將原本 tqueue_pop的 mutex 拿掉,留下 tqueue_push並建立pthread_cond_signal去喚醒 pthread_cond_wait的 thread

int tqueue_push(tqueue_t *the_queue, task_t *task) { pthread_mutex_lock(&(the_queue->mutex)); task->next = NULL; if (the_queue->tail) the_queue->tail->next = task; the_queue->tail = task; if (the_queue->size++ == 0) the_queue->head = task; pthread_cond_signal(&(the_queue->cond)); pthread_mutex_unlock(&(the_queue->mutex)); return 0; }

後面大的 thread 時間大幅度下降了

  • 使用 mutrace 了解用 condition variables 改善 mutex 的狀況
    $ mutrace ./sort 64 input.txt
mutrace: 0.2 sucessfully initialized for process sort (pid 22346).
64 0.3986

mutrace: Showing statistics for process sort (pid 22346).
mutrace: 3 mutexes used.

Mutex #0 (0x0x7f9ecffe5860) first referenced by:
	/usr/lib/mutrace/libmutrace.so(pthread_mutex_lock+0x49) [0x7f9ef0a1b6b9]
	/lib/x86_64-linux-gnu/libgcc_s.so.1(_Unwind_Find_FDE+0x2c) [0x7f9ecfde1fec]
	[(nil)]

Mutex #1 (0x0x36adba0) first referenced by:
	/usr/lib/mutrace/libmutrace.so(pthread_mutex_init+0xf2) [0x7f9ef0a1b4b2]
	./sort(tqueue_init+0x38) [0x401577]
	./sort(tpool_init+0x6a) [0x4017a7]
	./sort(main+0x180) [0x402197]
	/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x7f9ef0452830]

mutrace: Showing 2 most contended mutexes:

 Mutex #   Locked  Changed    Cont. tot.Time[ms] avg.Time[ms] max.Time[ms]  Flags
       0      320      112       18        0.032        0.000        0.001 M-.--.
       1      195      129        0        0.213        0.001        0.012 Mx.--.
                                                                           ||||||
                                                                           /|||||
          Object:                                     M = Mutex, W = RWLock /||||
           State:                                 x = dead, ! = inconsistent /|||
             Use:                                 R = used in realtime thread /||
      Mutex Type:                 r = RECURSIVE, e = ERRRORCHECK, a = ADAPTIVE /|
  Mutex Protocol:                                      i = INHERIT, p = PROTECT /
     RWLock Kind: r = PREFER_READER, w = PREFER_WRITER, W = PREFER_WRITER_NONREC 

mutrace: Note that the flags column R is only valid in --track-rt mode!

mutrace: Total runtime is 492.383 ms.

mutrace: Results for SMP with 8 processors.

發現 mutex 的 locked 從幾十萬降到剩幾百了,代表之前都是等待 mutex 被釋放才能繼續進行,加入 condition wait/singal 喚醒的機制改善了 lock 等待的問題。

研究 Lock-free 實作

分組作業-以 concurrent Linked List 實做 merge sort

簡介

People often describe lock-free programming as programming without mutexes, which are also referred to as locks. That’s true, but it’s only part of the story. The generally accepted definition, based on academic literature, is a bit more broad. At its essence, lock-free is a property used to describe some code, without saying too much about how that code was actually written.

lock-free 不表示一定沒有 mutex 或 lock 的互斥鎖,但是減少使用互斥鎖也是一種做法,然而最主要的是一種概念,並沒有要求 code 撰寫方式。

If a given part of your code doesn’t satisfy these conditions, then that part is not lock-free.
要完全符合下面的流程,才是一個完整的 lock-free 程式

閱讀老師提供 concurrent-ll程式碼以及20組的共筆資料

  • 導入atomic_ops_if.h 並嘗試在原本的 tqueue_pop tqueue_push 加上 CAS 作法
task_t *tqueue_pop(tqueue_t *the_queue) { task_t *ret; ret = the_queue->head; if (ret) { /*if head == tail means one node in queue*/ if(CAS_PTR(&(the_queue->tail), ret, NULL) == ret) { return NULL; } else { CAS_PTR(&(the_queue->head), ret, ret->last); FAD_U32(&(the_queue->size)); } } return ret; }
nt tqueue_push(tqueue_t *the_queue, task_t *task) { task->next = NULL; if (CAS_PTR(&(the_queue->tail), NULL, task) == NULL) { CAS_PTR(&(the_queue->head), NULL, task); } else { CAS_PTR(&(the_queue->tail->next),the_queue->tail, task); } FAI_U32(&(the_queue->size)); pthread_cond_signal(&(the_queue->cond)); return 0; }

看起來差別不大,可能是因為 CAS用法還不太會,不確定有沒有真正的 lcok-free 還需要再研究一下。

$ mutrace ./sort 64 input.txt

mutrace: 0.2 sucessfully initialized for process sort (pid 13784).
64 0.4127

mutrace: Showing statistics for process sort (pid 13784).
mutrace: 3 mutexes used.

Mutex #1 (0x0x7fd8cd7de860) first referenced by:
	/usr/lib/mutrace/libmutrace.so(pthread_mutex_lock+0x49) [0x7fd8ee2146b9]
	/lib/x86_64-linux-gnu/libgcc_s.so.1(_Unwind_Find_FDE+0x2c) [0x7fd8cd5dafec]
	[(nil)]

Mutex #0 (0x0x4c6aba0) first referenced by:
	/usr/lib/mutrace/libmutrace.so(pthread_mutex_init+0xf2) [0x7fd8ee2144b2]
	./sort(tqueue_init+0x38) [0x4015a7]
	./sort(tpool_init+0x6a) [0x4017ba]
	./sort(main+0x180) [0x4021b4]
	/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x7fd8edc4b830]

mutrace: Showing 2 most contended mutexes:

 Mutex #   Locked  Changed    Cont. tot.Time[ms] avg.Time[ms] max.Time[ms]  Flags
       1      320      120       36        0.033        0.000        0.001 M-.--.
       0      258      192        0        0.071        0.000        0.015 M!.--.
                                                                           ||||||
                                                                           /|||||
          Object:                                     M = Mutex, W = RWLock /||||
           State:                                 x = dead, ! = inconsistent /|||
             Use:                                 R = used in realtime thread /||
      Mutex Type:                 r = RECURSIVE, e = ERRRORCHECK, a = ADAPTIVE /|
  Mutex Protocol:                                      i = INHERIT, p = PROTECT /
     RWLock Kind: r = PREFER_READER, w = PREFER_WRITER, W = PREFER_WRITER_NONREC 

mutrace: Note that the flags column R is only valid in --track-rt mode!

mutrace: Total runtime is 505.703 ms.

mutrace: Results for SMP with 8 processors.

mutrace: WARNING: 386 inconsistent mutex uses detected. Results might not be reliable.
mutrace:          Fix your program first!

使用 CAS 做法反而多出了mutrace: WARNING: 386 inconsistent mutex uses detected. Results might not be reliable,是因為 threads 執行中 CAS 會去檢查 新舊問題,如果沒問題就直接使用並改變,才造成 不一致的 mutex warning ?


課程內容
電子工程專輯
link-list locality 散落

Select a repo