contribute by <jeff60907
>
作業系統 Lubuntu 16.04
CPU:Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 8192K
發現投影片看不完 只先看了幾章就趕緊看原有資料
Atomic Operations , memory barriers , avoiding the ABA problem
看過那些演算法還是有點不太了解,要在重新複習過
洗老師提到的舉例
一邊走路一邊講電話就同時做兩件事(走路和講電話),一邊寫報告一邊上網聊天也是如此。然而這兩個同時做兩件事的例子卻有著很微妙的差異,前者我們真的是在每一瞬間都是同時進行兩件事(走路和講電話);後者卻不然,大多人都是寫一段報告、就跳過去聊天、再寫報告、再聊天、等等,幾乎不可能在每一瞬間都同時在寫報告和聊天。
兩個例子中前者(兩件事齊頭並進)的處理方式叫做 平行(parallel),後者在兩件事中交錯執行、但每一瞬間每件事都有點進展; 並行(concurrent,此地採用國內的譯名) 指的是交錯執行和平行兩者。
Concurrency is about dealing with lots of things at once.
Parallelism is about doing lots of things at once.
使用threads 是OS會用排程方法去選擇執行哪個running task,
可以執行到一半去做其他function,然後再回到剛剛停止的點繼續往下,kernel並不會參與使用context switch。
目前課程內容大致上看過但觀念吸收還是很模糊,英文看的很粗略要再重念一遍..抓緊腳步實作…
就算用中文寫,你也不見得可以全部吸收,要動手! jserv
未修改版本
參考Tempo JiJi共筆
data資料量有錯誤,看了原始程式碼內有show_entry
的函式可以呼叫
於是在main.c
加入
#ifdef OPT
show_entry(e);
#endif
顯示出來很怪,原來是沒有排序過後 使用linux 內建的 sort
,排序後在diff
比較words.txt
的差異,納悶是多出了3個沒找到的資料
系統顯示 Segmentation fault (core dumped) 記憶體錯誤
0a1,4
> aaaa
> aaaaa
> aaaaaa
> aaaaaaa
349886a349891
> zyxel
349888d349892
< zyxelzyzzogeton
349890a349895
> zyzzogeton
zyxelzyzzogeton 竟然黏在一起了! 也看不太出來輸出字串有什麼出錯
修改 pHead因為pHead = app[i]->pHead->pNext 跟 etmp->pNext = app[i]->pHead->pNext 這兩個將head連接起來時是head->next而不是head,因此才會少了4個head的entry
pHead = pHead->pNext;
for (int i = 0; i < THREAD_NUM; i++) {
if (i == 0) {
pHead = app[i]->pHead->pNext;
dprintf("Connect %d head string %s %p\n", i,
app[i]->pHead->pNext->lastName, app[i]->ptr);
} else {
etmp->pNext = app[i]->pHead->pNext;
dprintf("Connect %d head string %s %p\n", i,
app[i]->pHead->pNext->lastName, app[i]->ptr);
}
349891d349890
< zyxel
349892a349892
> zyxelzyzzogeton
349895d349894
< zyzzogeton
一直出現
warning: variable ‘cpu_time’ set but not used [-Wunused-but-set-variable]
剛好看到HaoTse同學的共筆
phonebook_opt.c
和 phonebook_opt.h
中的 static double diff_in_second(struct timespec t1, struct timespec t2)
前後加入 #ifdef PROFILE
確認是否開啟 PROFILE 模式。如果要使用 DEBUG 需要使用 $ make DEBUG=1 PROFILE=1for回圈內的 if else 可以把0獨立額外拿出來做即可,不用讓for loop一直去判別
entry *etmp;
pHead = pHead->pNext;
for (int i = 0; i < THREAD_NUM; i++) {
if (i == 0) {
pHead = app[i]->pHead;
dprintf("Connect %d head string %s %p\n", i,
app[i]->pHead->pNext->lastName, app[i]->ptr);
} else {
etmp->pNext = app[i]->pHead;
dprintf("Connect %d head string %s %p\n", i,
app[i]->pHead->pNext->lastName, app[i]->ptr);
}
entry *etmp;
pHead = app[0]->pHead;
etmp = app[0]->pLast;
dprintf("Connect %d head string %s %p\n", i,
app[0]->pHead->pNext->lastName, app[0]->ptr);
for (int i = 1; i < THREAD_NUM; i++) {
etmp->pNext = app[i]->pHead;
dprintf("Connect %d head string %s %p\n", i,
app[i]->pHead->pNext->lastName, app[i]->ptr);
etmp = app[i]->pLast;
dprintf("Connect %d tail string %s %p\n", i,
app[i]->pLast->lastName, app[i]->ptr);
dprintf("round %d\n", i);
}
size of entry : 136 bytes
execution time of append() : 0.064879 sec
execution time of findName() : 0.005614 sec
Performance counter stats for './phonebook_orig' (100 runs):
188,0685 cache-misses # 94.488 % of all cache refs ( +- 0.67% )
199,0400 cache-references ( +- 0.67% )
2,6073,4930 instructions # 1.50 insns per cycle ( +- 0.02% )
1,7384,2586 cycles ( +- 0.26% )
0.081726901 seconds time elapsed ( +- 1.33% )
size of entry : 24 bytes
execution time of append() : 0.002547 sec
execution time of findName() : 0.002512 sec
Performance counter stats for './phonebook_opt' (100 runs):
74,0953 cache-misses # 54.827 % of all cache refs ( +- 0.94% )
135,1431 cache-references ( +- 0.76% )
2,2474,1794 instructions # 1.36 insns per cycle ( +- 0.04% )
1,6499,3380 cycles ( +- 0.51% )
0.050511576 seconds time elapsed ( +- 1.82% )
閱讀Pthreads
The subroutines which comprise the Pthreads API can be informally grouped into four major groups:
1.Thread management: Routines that work directly on threads - creating, detaching, joining, etc. They also include functions to set/query thread attributes (joinable, scheduling etc.)
2.Mutexes: Routines that deal with synchronization, called a "mutex", which is an abbreviation for "mutual exclusion". Mutex functions provide for creating, destroying, locking and unlocking mutexes. These are supplemented by mutex attribute functions that set or modify attributes associated with mutexes.
3.Condition variables: Routines that address communications between threads that share a mutex. Based upon programmer specified conditions. This group includes functions to create, destroy, wait and signal based upon specified variable values. Functions to set/query condition variable attributes are also included.
4.Synchronization: Routines that manage read/write locks and barriers.
閱讀C 的 Thread Pool 筆記
研究mbrossard-threadpool如何實作
threadpool_t *threadpool_create(int thread_count, int queue_size, int flags)
/* Initialize */
/* Allocate thread and task queue */
/* Initialize mutex and conditional variable first */
/* Start worker threads */
建立 thread pool ,然後開始初始化的動作,並分配 thread 和 task queue 的最大數量,接著進入到 每個Thread 要執行的 Function task
static void *threadpool_thread(void *threadpool)
for(;;) {
/* Lock must be taken to wait on conditional variable */
pthread_mutex_lock(&(pool->lock));
/* Wait on condition variable, check for spurious wakeups.
When returning from pthread_cond_wait(), we own the lock. */
while((pool->count == 0) && (!pool->shutdown)) {
pthread_cond_wait(&(pool->notify), &(pool->lock));
}
if((pool->shutdown == immediate_shutdown) ||
((pool->shutdown == graceful_shutdown) &&
(pool->count == 0))) {
break;
}
/* Grab our task */
task.function = pool->queue[pool->head].function;
task.argument = pool->queue[pool->head].argument;
pool->head = (pool->head + 1) % pool->queue_size;
pool->count -= 1;
/* Unlock */
pthread_mutex_unlock(&(pool->lock));
/* Get to work */
(*(task.function))(task.argument);
}
for(;;) 裡面會先搶奪
pool->lock
,當搶到pool->lock
的 Thread 發現沒有工作可以做的時候,就會執行pthread_cond_wait
來等待通知,然後Unlcok
剛搶到的pool->lock
,其他的Thread就可以進來搶pool->lock
,完全沒有工作的情況下,所有的 Thread 都會在這邊 Waiting
當 Thread 被透過
pthread_cond_signal
喚醒的時候該 Thread 就會重新取得pool->lock
接著取出 queue 中的 task,等待取出完畢之後,再Unlock
讓其他被喚醒的 Thread 也可以去取得 task,執行 task 中的 function pointer工作
int threadpool_destroy(threadpool_t *pool, int flags)
/* Already shutting down */
/* Wake up all worker threads */
/* Join all worker thread */
/* Only if everything went well do we deallocate the pool */
使用
pthread_cond_broadcast
喚醒所有的 Thread ,並等待所有正在執行的 Thread 結束,最後銷毀 thread pool
使用原先 opt.c/h的內容,並加入 threadpool需要的資訊
#if defined THREADPOOL
extern threadpool_t *pool;
extern pthread_mutex_t lock;
pthread_mutex_init(&lock,NULL);
assert((pool=threadpool_create(THREAD_NUM,QUEUE,0))!=NULL);
append_a **app=(append_a **)malloc(sizeof(append_a *)*THREAD_NUM);
for(int i = 0;i < THREAD_NUM; i++)
app[i] = new_append_a(map + MAX_LAST_NAME_SIZE * i, map + fs, i,
THREAD_NUM, entry_pool + i);
for(int i = 0; i < THREAD_NUM; i++)
threadpool_add(pool, &append, (void *)app[i], 0);
assert(threadpool_destroy(pool,1) == 0);
pthread_mutex_destroy(&lock);
entry *etmp;
pHead = app[0]->pHead;
etmp = app[0]->pLast;
dprintf("Connect %d head string %s %p\n", i,
app[0]->pHead->pNext->lastName, app[0]->ptr);
for (int i = 1; i < THREAD_NUM; i++) {
etmp->pNext = app[i]->pHead;
dprintf("Connect %d head string %s %p\n", i,
app[i]->pHead->pNext->lastName, app[i]->ptr);
etmp = app[i]->pLast;
dprintf("Connect %d tail string %s %p\n", i,
app[i]->pLast->lastName, app[i]->ptr);
dprintf("round %d\n", i);
}
clock_gettime(CLOCK_REALTIME, &end);
cpu_time1 = diff_in_second(start, end);
size of entry : 24 bytes
execution time of append() : 0.003975 sec
execution time of findName() : 0.002819 sec
Performance counter stats for './phonebook_threadpool' (100 runs):
61,1939 cache-misses # 52.872 % of all cache refs ( +- 0.63% )
115,7405 cache-references ( +- 0.51% )
2,2375,1942 instructions # 1.54 insns per cycle ( +- 0.02% )
1,4489,8361 cycles ( +- 0.29% )
0.049825091 seconds time elapsed ( +- 1.75% )
感覺threadpool1變成跟opt一樣,但是cache miss都比opt少快10% 覺得納悶
發現問題threadpool
內的 queue 大小 會影響 thread 的處裡時間
上方queue 大小為 16 ,可能thread 排進 queue 內的task都在等待
使用queue 大小為 4 ,在thread 8 append()時間就比 opt 減少了
Size of the task queue設定 thread 可以排進的task數量,並不是開愈大就可以執行愈快
當 thread 開很大速度提升很多但是快取效能卻會大幅下降,探討分析- 待補
threadpool queue = 4 ,thread = 16
$ mutrace ./phonebook_threadpool
mutrace: Application appears to be compiled without -rdynamic. It might be a
mutrace: good idea to recompile with -rdynamic enabled since this produces more
mutrace: useful stack traces.
mutrace: 0.2 sucessfully initialized for process phonebook_threa (pid 5336).
size of entry : 24 bytes
execution time of append() : 0.005265 sec
execution time of findName() : 0.002172 sec
mutrace: Showing statistics for process phonebook_threa (pid 5336).
mutrace: 3 mutexes used.
Mutex #2 (0x0x7ff242b4c860) first referenced by:
/usr/lib/mutrace/libmutrace.so(pthread_mutex_lock+0x49) [0x7ff24c2ac6b9]
/lib/x86_64-linux-gnu/libgcc_s.so.1(_Unwind_Find_FDE+0x2c) [0x7ff242948fec]
[(nil)]
Mutex #1 (0x0x6031a0) first referenced by:
/usr/lib/mutrace/libmutrace.so(pthread_mutex_init+0xf2) [0x7ff24c2ac4b2]
./phonebook_threadpool() [0x4014b3]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x7ff24bce3830]
Mutex #0 (0x0x2252b00) first referenced by:
/usr/lib/mutrace/libmutrace.so(pthread_mutex_init+0xf2) [0x7ff24c2ac4b2]
./phonebook_threadpool() [0x401dd1]
./phonebook_threadpool() [0x4014c7]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x7ff24bce3830]
mutrace: Showing 3 most contended mutexes:
Mutex # Locked Changed Cont. tot.Time[ms] avg.Time[ms] max.Time[ms] Flags
2 80 35 10 0.015 0.000 0.001 M-.--.
1 7 6 6 4.588 0.655 1.625 Mx.--.
0 56 42 2 0.035 0.001 0.005 Mx.--.
||||||
/|||||
Object: M = Mutex, W = RWLock /||||
State: x = dead, ! = inconsistent /|||
Use: R = used in realtime thread /||
Mutex Type: r = RECURSIVE, e = ERRRORCHECK, a = ADAPTIVE /|
Mutex Protocol: i = INHERIT, p = PROTECT /
RWLock Kind: r = PREFER_READER, w = PREFER_WRITER, W = PREFER_WRITER_NONREC
mutrace: Note that the flags column R is only valid in --track-rt mode!
mutrace: Total runtime is 66.471 ms.
mutrace: Results for SMP with 8 processors.
$ addr2line -e phonebook_threadpool 0x4014b3
/home/jeff60907/phonebook-concurrent/main.c:90
$ addr2line -e phonebook_threadpool 0x401dd1
/home/jeff60907/phonebook-concurrent/threadpool.c:122
$ addr2line -e phonebook_threadpool 0x4014c7
/home/jeff60907/phonebook-concurrent/main.c:90
main.c:90
assert((pool=threadpool_create(THREAD_NUM,QUEUE,0))!=NULL);
threadpool.c:122
threadpool_t *threadpool_create(int thread_count, int queue_size, int flags)
/* Initialize mutex and conditional variable first */
if((pthread_mutex_init(&(pool->lock), NULL) != 0) ||
(pthread_cond_init(&(pool->notify), NULL) != 0) ||
(pool->threads == NULL) ||
(pool->queue == NULL)) {
goto err;
}
一開始 mutex 是在 threadpool_create()開始建立,但是執行時很多時間是在等 mutex 釋放才能繼續動作,未來嘗試加入 lock-free 機制
上課筆記
2016/10/07
cookie
malloc free
page fault
MMU
man page notes -> 常用在linux
PDP系列 (虛擬記憶體<實體)
cache locality
ARM-bigLITTLE
cpu hotplog
mutex vs. semaphore ( 恐龍本回去複習)
week3 必讀 - SMT
brach prediction
cache vivt 參考cache