contribute by <jeff60907
>
作業系統 Lubuntu 16.04
CPU:Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz
cpu cores: 4
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 8192K
發現約讀懂6-7成而已,很多都有印象學過卻很模糊QQ
一句話:「學過,但是從來沒有學會過」 jserv
The pipeline depth is the number of stages
管線作業架構內含多個硬體階段電路,硬體階段個數就稱為pipeline depth(管線深度)
1.pipeline depth越大,表示各硬體階段完成子工作所需的時間越短,執行效能就越好
2.當pipeline depth增加超過一個臨界值,效能則不升反降
ps:若管線深度為k,通常效能只會增快sqrt(k)倍
亂序執行 參考wiki 參考百度解釋
可以避免因為獲取下一條程序指令所引起的處理器等待,立即執行處理下一條指令。
例如遇到branch指令3時,可以先把其他非相關聯指令4-8運算好,等到branch指令3結束,就可以直接使用。這種作法可以讓CPU內部運算沒有浪費,但是會造成功耗上升
前端匯流排 參考wiki
內文提到
increasing a modern processor's clock speed by a relatively modest 30% can take as much as double the power, and produce double the heat…
增加30%的 clock速度 可能會造成兩倍的功耗和熱,所以不能無限制的增加時脈速度。
Up to a point this increase in power is okay, but at a certain point, currently somewhere around 150-200 watts, the power and heat problems become unmanageable, because it's simply not possible to provide that much power and cooling to a silicon chip in any practical fashion, even if the circuits could, in fact, operate at higher clock speeds. This is called the power wall.
Pursuing more and more ILP also has definite limits, because unfortunately, normal programs just don't have a lot of fine-grained parallelism in them, due to a combination of load latencies, cache misses, branches and dependencies between instructions. This limit of available instruction-level parallelism is called the ILP wall.
照步驟先測試
$ make
gcc -std=gnu99 -Wall -g -pthread -o list.o -MMD -MF .list.o.d -c list.c
gcc -std=gnu99 -Wall -g -pthread -o threadpool.o -MMD -MF .threadpool.o.d -c threadpool.c
gcc -std=gnu99 -Wall -g -pthread -o main.o -MMD -MF .main.o.d -c main.c
gcc -std=gnu99 -Wall -g -pthread -o sort list.o threadpool.o main.o -rdynamic
MMD MF 是什麼? 參考kobeyu介紹
-g 在編譯時已經加上去了,底下才能使用 mutrace 查詢到地址對應的行數
mutrace: Showing statistics for process sort (pid 17137).
mutrace: 3 mutexes used.
Mutex #0 (0x0x13fd9b0) first referenced by:
/usr/lib/mutrace/libmutrace.so(pthread_mutex_init+0xf2) [0x7faa075e84b2]
./sort(tqueue_init+0x38) [0x401277]
./sort(tpool_init+0x6a) [0x4014cc]
./sort(main+0x161) [0x401c74]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x7faa0701f830]
Mutex #1 (0x0x603120) first referenced by:
/usr/lib/mutrace/libmutrace.so(pthread_mutex_init+0xf2) [0x7faa075e84b2]
./sort(main+0x11d) [0x401c30]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x7faa0701f830]
Mutex #2 (0x0x7faa04bee860) first referenced by:
/usr/lib/mutrace/libmutrace.so(pthread_mutex_lock+0x49) [0x7faa075e86b9]
/lib/x86_64-linux-gnu/libgcc_s.so.1(_Unwind_Find_FDE+0x2c) [0x7faa049eafec]
[(nil)]
mutrace: Showing 3 most contended mutexes:
Mutex # Locked Changed Cont. tot.Time[ms] avg.Time[ms] max.Time[ms] Flags
0 133 15 10 0.026 0.000 0.002 Mx.--.
1 13 9 1 0.003 0.000 0.000 M-.--.
2 20 3 1 0.007 0.000 0.002 M-.--.
sort(tqueue_init+0x38) [0x401277]
$ addr2line -e sort 0x38
??:0
$ addr2line -e sort 0x401277
/home/jeff60907/mergesort-concurrent/threadpool.c:15
mutrace: 3 mutexes used. 會顯示使用多少個 mutexes
0x38 反應是??:0 是因為實際執行的地址是後面 401277才正確嗎?所以要讀書呀!先讀手冊 addr2line,binutils 背後透過 Binary File Descriptor library (BFD) 實做。 jserv
閱讀 addr2line
If the file name or function name can not be determined, addr2line will print two question marks in their place. If the line number can not be determined, addr2line will print 0.
閱讀Binary File Descriptor library (BFD)
Big Endian和Little Endian的比較 and 最後一段小故事
如果是Big Endian的系統:
存到記憶體會變成 0x12 0x34 0x56 0x78,最高位元組在位址最低位元,最低位元組在位址最高位元,依次排列。
如果是Little Endian的系統:
存到記憶體會變成 0x78 0x56 0x34 0x12,最低位元組在最低位元,最高位元組在最高位元,反序排列。
參考Measuring Lock Contention mutrace解釋
Linked list 是由各個 node,透過指標串聯而成。先定義 node_t 型別
typedef intptr_t val_t; /* 不僅可放數值,也能放指標 */
typedef struct node {
val_t data; /* 欲儲存的資料 */
struct node *next; /* next 指標 */
} node_t;
inptr_t 為什麼要使用這個型別呢? 參考SwimGlass很詳盡的內容!!
要將 pointer做型態轉換,因為電腦上 32bit 跟 64bit的pointer長度是不同的,強制轉型可能會造成資料loss的問題,所以透過intptr_t
可以正確有效的轉換型態
表示法常見問題
list的操作,詳閱list.c實作內容
// 建立新的 list 物件
llist_t *list_new();
// 從 list 插入給定資料
int list_add(llist_t *the_list, val_t val);
// 逐一印列 list 內容
void list_print(llist_t *the_list);
// 產生新的節點
node_t *new_node(val_t val, node_t *next);
// 取得某個 index 的節點資訊
node_t *list_get(llist_t *the_list, uint32_t index);
定義 task_t 來封裝 task:
typedef struct _task {
void (*func)(void *); /* 對應到最終執行的函式 */
void *arg; /* 傳入的參數 */
struct _task *next, *last;
/* 因為 queue 要用 doubly-linked list,
需要儲存 next 和 last */
} task_t;
doubly-linked list 用法不太熟悉,回頭複習資料結構課程
參考資料不要列出一堆中文的,請列出「有強度」的英文材料 jserv
收到! 陳致佑
看了threadpool.c POP 跟 PUSH 有些操作看不太懂參考了
LitSnow
同學的共筆,但疑惑的是queue 應該是 FIFO 先進先出,所以POP 應該是從head 先拿,push 放進tail才對,但是threadpool.c卻寫相反了快去改,這個作業還有後續。以後要從 code refactoring 開始 jserv
收到陳致佑
修改threadpool.c pop 跟 push,把頭尾交換,pop從head提出來,push從tail放進去。
task_t *tqueue_pop(tqueue_t *the_queue)
{
task_t *ret;
pthread_mutex_lock(&(the_queue->mutex));
ret = the_queue->head;
if (ret) {
the_queue->head = ret->next;
if (!the_queue->head) {
the_queue->tail = NULL;
}
the_queue->size--;
}
pthread_mutex_unlock(&(the_queue->mutex));
return ret;
}
int tqueue_push(tqueue_t *the_queue, task_t *task)
{
pthread_mutex_lock(&(the_queue->mutex));
task->next = NULL;
if (the_queue->tail)
the_queue->tail->next = task;
the_queue->tail = task;
if (the_queue->size++ == 0)
the_queue->head = task;
pthread_mutex_unlock(&(the_queue->mutex));
return 0;
}
參考andy19950共筆實作以下修改
修改 list.c and list.h
struct node intptr_t
要改成字元可讀的型式 typedef char* val_t;
node_new()
將資料放進來方式變成strcpy((char *)node->data, val);
list_print()
方法也要更動輸出為output.txt
void list_print(llist_t *list)
{
node_t *cur = list->head;
/* FIXME: we have to validate the sorted results in advance. */
FILE *fp = fopen(output, "w");
while (cur) {
fprintf(fp,"%s\n", (char *)cur->data);
cur = cur->next;
}
fclose(fp);
}
修改 main.c
word.txt
透過指令亂序$ uniq words.txt | sort -R > input.txt
input.txt
,先開頭定義#define USAGE "usage: ./sort [thread_count] [file_name]\n"
接著修改資料抓取方式,將scanf
改成 fgets
讀檔案內資料,行的結尾要更改'\0'
不然會紀錄原本的換行字元
for (int i = 0; i < data_count; ++i) {
long int data;
scanf("%ld", &data);
list_add(the_list, data);
}
while(fgets(line, sizeof(line), fp)!= NULL) {
line[strlen(line)-1] = '\0';
list_add(the_list, line);
}
andy19950共筆
提出了strcmp
比較大小回傳1 , 0 , -1實作small
的內容很簡單明瞭,覺得很厲害不然原本直覺想到的就是硬幹而已@@
int cmp = strcmp(a->head->data, b->head->data);
llist_t *small = (llist_t *)
((intptr_t *) a * (cmp <= 0) +
(intptr_t *) b * (cmp > 0));
原本照
andy19950
程式碼實作出現一些Waring,主要是const char * 類似的型態類別有問題,指定轉換的格式後就OK了,所以資料在轉換過程中要遵循型態不然可能會有waring甚至會有奇怪的錯誤。
word.txt
跟 後來產生的output.txt
做比較,紀錄第幾筆資料錯誤並顯示
void check_data()
{
FILE *word, *data;
data = fopen("output.txt", "r");
word = fopen("words.txt", "r");
char output[16], answer[16];
int count = 0;
while(fgets(output, sizeof(output), data) != NULL){
fgets(answer, sizeof(answer), word);
if(strcmp(answer, output) != 0) {
printf("Wrong %d data %s\n",++count, output);
}
++count;
}
printf("data are right\n");
fclose(word);
fclose(data);
}
thread 在64以上,時間大幅度上升,代表後面沒有task時, thread會一直耗費CPU的資源
$ mutrace ./sort 64 input.txt
mutrace: 0.2 sucessfully initialized for process sort (pid 10023).
64 0.4450
mutrace: Showing statistics for process sort (pid 10023).
mutrace: 3 mutexes used.
Mutex #2 (0x0x5258ba0) first referenced by:
/usr/lib/mutrace/libmutrace.so(pthread_mutex_init+0xf2) [0x7fead81b34b2]
./sort(tqueue_init+0x38) [0x401477]
./sort(tpool_init+0x6a) [0x4016ab]
./sort(main+0x180) [0x402011]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x7fead7bea830]
Mutex #0 (0x0x7feab777d860) first referenced by:
/usr/lib/mutrace/libmutrace.so(pthread_mutex_lock+0x49) [0x7fead81b36b9]
/lib/x86_64-linux-gnu/libgcc_s.so.1(_Unwind_Find_FDE+0x2c) [0x7feab7579fec]
[(nil)]
mutrace: Showing 2 most contended mutexes:
Mutex # Locked Changed Cont. tot.Time[ms] avg.Time[ms] max.Time[ms] Flags
2 699815 657024 461951 84.738 0.000 0.020 Mx.--.
0 320 131 41 0.032 0.000 0.001 M-.--.
||||||
/|||||
Object: M = Mutex, W = RWLock /||||
State: x = dead, ! = inconsistent /|||
Use: R = used in realtime thread /||
Mutex Type: r = RECURSIVE, e = ERRRORCHECK, a = ADAPTIVE /|
Mutex Protocol: i = INHERIT, p = PROTECT /
RWLock Kind: r = PREFER_READER, w = PREFER_WRITER, W = PREFER_WRITER_NONREC
mutrace: Note that the flags column R is only valid in --track-rt mode!
mutrace: Total runtime is 529.922 ms.
mutrace: Results for SMP with 8 processors.
mutex 2 被大量 locked 住了,熱點應該是 tqueue 之間的 task ,必須要等待 mutex 被釋放才能繼續進行,所以通通都卡在 lock 等待資源被釋放
額外整理Pthread探討
參考20組 使用 Condition variables 的 singal 來控制 mutex
參照How to synchronize manager/worker pthreads without a join?
修改 main.c
在 負責 task 運作的 task_run
加上 pthread_cond_wait
等待信號回傳喚醒進行動作,但是調用pthread_cond_wait
前要搭配pthread_mutex_lock
使用。
static void *task_run(void *data)
{
(void) data;
while (1) {
/* Lock must be taken to wait on conditional variable */
pthread_mutex_lock(&pool->queue->mutex);
/* Wait on condition variable, check for spurious wakeups.
When returning from pthread_cond_wait(), we own the lock. */
while(pool->queue->size == 0) {
pthread_cond_wait(&(pool->queue->cond), &(pool->queue->mutex));
}
if (pool->queue->size ==0) break;
/*Grab out task */
task_t *_task = tqueue_pop(pool->queue);
/* Unlock*/
pthread_mutex_unlock(&(pool->queue->mutex));
if (_task) {
if (!_task->func) {
tqueue_push(pool->queue, _task);
break;
} else {
_task->func(_task->arg);
free(_task);
}
}
}
pthread_exit(NULL);
}
修改 threadpool.c
將原本 tqueue_pop
的 mutex 拿掉,留下 tqueue_push
並建立pthread_cond_signal
去喚醒 pthread_cond_wait
的 thread
int tqueue_push(tqueue_t *the_queue, task_t *task)
{
pthread_mutex_lock(&(the_queue->mutex));
task->next = NULL;
if (the_queue->tail)
the_queue->tail->next = task;
the_queue->tail = task;
if (the_queue->size++ == 0)
the_queue->head = task;
pthread_cond_signal(&(the_queue->cond));
pthread_mutex_unlock(&(the_queue->mutex));
return 0;
}
後面大的 thread 時間大幅度下降了
$ mutrace ./sort 64 input.txt
mutrace: 0.2 sucessfully initialized for process sort (pid 22346).
64 0.3986
mutrace: Showing statistics for process sort (pid 22346).
mutrace: 3 mutexes used.
Mutex #0 (0x0x7f9ecffe5860) first referenced by:
/usr/lib/mutrace/libmutrace.so(pthread_mutex_lock+0x49) [0x7f9ef0a1b6b9]
/lib/x86_64-linux-gnu/libgcc_s.so.1(_Unwind_Find_FDE+0x2c) [0x7f9ecfde1fec]
[(nil)]
Mutex #1 (0x0x36adba0) first referenced by:
/usr/lib/mutrace/libmutrace.so(pthread_mutex_init+0xf2) [0x7f9ef0a1b4b2]
./sort(tqueue_init+0x38) [0x401577]
./sort(tpool_init+0x6a) [0x4017a7]
./sort(main+0x180) [0x402197]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x7f9ef0452830]
mutrace: Showing 2 most contended mutexes:
Mutex # Locked Changed Cont. tot.Time[ms] avg.Time[ms] max.Time[ms] Flags
0 320 112 18 0.032 0.000 0.001 M-.--.
1 195 129 0 0.213 0.001 0.012 Mx.--.
||||||
/|||||
Object: M = Mutex, W = RWLock /||||
State: x = dead, ! = inconsistent /|||
Use: R = used in realtime thread /||
Mutex Type: r = RECURSIVE, e = ERRRORCHECK, a = ADAPTIVE /|
Mutex Protocol: i = INHERIT, p = PROTECT /
RWLock Kind: r = PREFER_READER, w = PREFER_WRITER, W = PREFER_WRITER_NONREC
mutrace: Note that the flags column R is only valid in --track-rt mode!
mutrace: Total runtime is 492.383 ms.
mutrace: Results for SMP with 8 processors.
發現 mutex 的 locked 從幾十萬降到剩幾百了,代表之前都是等待 mutex 被釋放才能繼續進行,加入 condition wait/singal 喚醒的機制改善了 lock 等待的問題。
分組作業-以 concurrent Linked List 實做 merge sort
People often describe lock-free programming as programming without mutexes, which are also referred to as locks. That’s true, but it’s only part of the story. The generally accepted definition, based on academic literature, is a bit more broad. At its essence, lock-free is a property used to describe some code, without saying too much about how that code was actually written.
lock-free 不表示一定沒有 mutex 或 lock 的互斥鎖,但是減少使用互斥鎖也是一種做法,然而最主要的是一種概念,並沒有要求 code 撰寫方式。
If a given part of your code doesn’t satisfy these conditions, then that part is not lock-free.
要完全符合下面的流程,才是一個完整的 lock-free 程式
閱讀老師提供 concurrent-ll程式碼以及20組的共筆資料
atomic_ops_if.h
並嘗試在原本的 tqueue_pop
tqueue_push
加上 CAS 作法
task_t *tqueue_pop(tqueue_t *the_queue)
{
task_t *ret;
ret = the_queue->head;
if (ret) {
/*if head == tail means one node in queue*/
if(CAS_PTR(&(the_queue->tail), ret, NULL) == ret) {
return NULL;
} else {
CAS_PTR(&(the_queue->head), ret, ret->last);
FAD_U32(&(the_queue->size));
}
}
return ret;
}
nt tqueue_push(tqueue_t *the_queue, task_t *task)
{
task->next = NULL;
if (CAS_PTR(&(the_queue->tail), NULL, task) == NULL) {
CAS_PTR(&(the_queue->head), NULL, task);
} else {
CAS_PTR(&(the_queue->tail->next),the_queue->tail, task);
}
FAI_U32(&(the_queue->size));
pthread_cond_signal(&(the_queue->cond));
return 0;
}
看起來差別不大,可能是因為 CAS用法還不太會,不確定有沒有真正的 lcok-free 還需要再研究一下。
$ mutrace ./sort 64 input.txt
mutrace: 0.2 sucessfully initialized for process sort (pid 13784).
64 0.4127
mutrace: Showing statistics for process sort (pid 13784).
mutrace: 3 mutexes used.
Mutex #1 (0x0x7fd8cd7de860) first referenced by:
/usr/lib/mutrace/libmutrace.so(pthread_mutex_lock+0x49) [0x7fd8ee2146b9]
/lib/x86_64-linux-gnu/libgcc_s.so.1(_Unwind_Find_FDE+0x2c) [0x7fd8cd5dafec]
[(nil)]
Mutex #0 (0x0x4c6aba0) first referenced by:
/usr/lib/mutrace/libmutrace.so(pthread_mutex_init+0xf2) [0x7fd8ee2144b2]
./sort(tqueue_init+0x38) [0x4015a7]
./sort(tpool_init+0x6a) [0x4017ba]
./sort(main+0x180) [0x4021b4]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x7fd8edc4b830]
mutrace: Showing 2 most contended mutexes:
Mutex # Locked Changed Cont. tot.Time[ms] avg.Time[ms] max.Time[ms] Flags
1 320 120 36 0.033 0.000 0.001 M-.--.
0 258 192 0 0.071 0.000 0.015 M!.--.
||||||
/|||||
Object: M = Mutex, W = RWLock /||||
State: x = dead, ! = inconsistent /|||
Use: R = used in realtime thread /||
Mutex Type: r = RECURSIVE, e = ERRRORCHECK, a = ADAPTIVE /|
Mutex Protocol: i = INHERIT, p = PROTECT /
RWLock Kind: r = PREFER_READER, w = PREFER_WRITER, W = PREFER_WRITER_NONREC
mutrace: Note that the flags column R is only valid in --track-rt mode!
mutrace: Total runtime is 505.703 ms.
mutrace: Results for SMP with 8 processors.
mutrace: WARNING: 386 inconsistent mutex uses detected. Results might not be reliable.
mutrace: Fix your program first!
使用 CAS 做法反而多出了
mutrace: WARNING: 386 inconsistent mutex uses detected. Results might not be reliable
,是因為 threads 執行中 CAS 會去檢查 新舊問題,如果沒問題就直接使用並改變,才造成 不一致的 mutex warning ?
課程內容
電子工程專輯
link-list locality 散落