Try   HackMD

2016q3 Homework2 (phonebook-concurrent)

contribute by <jeff60907>

作業說明A06: phonebook-concurrent

  • 學習 POSIX Thread來縮減 alloc() 的時間成本
    • 提示:可透過建立 thread pool 來管理 worker thread
  • 學習效能分析工具
  • code refactoring 練習
  • 探索 clz 的應用
  • 學習 concurrent-ll (concurrent linked-list 實作) 的 scalability 分析方式

開發環境

作業系統 Lubuntu 16.04
CPU:Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              8192K

發現投影片看不完 只先看了幾章就趕緊看原有資料

An Introduction to Lock-Free Programming

Lock-Free Programming Techniques

Atomic Operations , memory barriers , avoiding the ABA problem

看過那些演算法還是有點不太了解,要在重新複習過

Concurrency (並行) vs. Parallelism (平行)

洗老師提到的舉例

一邊走路一邊講電話就同時做兩件事(走路和講電話),一邊寫報告一邊上網聊天也是如此。然而這兩個同時做兩件事的例子卻有著很微妙的差異,前者我們真的是在每一瞬間都是同時進行兩件事(走路和講電話);後者卻不然,大多人都是寫一段報告、就跳過去聊天、再寫報告、再聊天、等等,幾乎不可能在每一瞬間都同時在寫報告和聊天。
兩個例子中前者(兩件事齊頭並進)的處理方式叫做 平行(parallel),後者在兩件事中交錯執行、但每一瞬間每件事都有點進展; 並行(concurrent,此地採用國內的譯名) 指的是交錯執行和平行兩者。

Concurrency is about dealing with lots of things at once.
Parallelism is about doing lots of things at once.

  • Concurrency 在一個CPU執行多數的tasks,每個tasks會切換所以執行時間都不一樣
    *
  • Parallelism 可以當作多個tasks在不同的CPU上同步一起執行。

Process vs. Thread vs. Coroutines

參考LanKuDot共筆

  • With threads, the operating system switches running tasks preemptively according to its scheduling algorithm.

使用threads 是OS會用排程方法去選擇執行哪個running task,

  • With coroutines, the programmer chooses, meaning tasks are cooperatively multitasked by pausing and resuming functions at set points.
    • coroutine switches are cooperative, meaning the programmer controls when a switch will happen.
    • The kernel is not involved in coroutine switches.

可以執行到一半去做其他function,然後再回到剛剛停止的點繼續往下,kernel並不會參與使用context switch。

POSIX Thread

目前課程內容大致上看過但觀念吸收還是很模糊,英文看的很粗略要再重念一遍..抓緊腳步實作

就算用中文寫,你也不見得可以全部吸收,要動手! jserv

實驗測試

未修改版本

Code Refactoring

  • refactoring(重構)的定義:「在不改變軟體的外在行為之下,改善既有軟體的內部設計」
  • 所謂的重構,Martin Fowler的定義是「在不改變軟體外部行為的前提下,改變其內部結構,使其更容易理解且易於修改」。也就是說,在對外的介面上沒有做改變、介面背後的對應行為也沒有改變的情況下,基於可讀性,以及日後更便於修改的目的之下,來改寫內部的程式碼實作。

參考Tempo JiJi共筆
data資料量有錯誤,看了原始程式碼內有show_entry的函式可以呼叫
於是在main.c加入

#ifdef OPT show_entry(e); #endif

顯示出來很怪,原來是沒有排序過後 使用linux 內建的 sort,排序後在diff比較words.txt的差異,納悶是多出了3個沒找到的資料

系統顯示 Segmentation fault (core dumped) 記憶體錯誤

0a1,4
> aaaa
> aaaaa
> aaaaaa
> aaaaaaa
349886a349891
> zyxel
349888d349892
< zyxelzyzzogeton
349890a349895
> zyzzogeton

zyxelzyzzogeton 竟然黏在一起了! 也看不太出來輸出字串有什麼出錯

修改 pHead因為pHead = app[i]->pHead->pNext 跟 etmp->pNext = app[i]->pHead->pNext 這兩個將head連接起來時是head->next而不是head,因此才會少了4個head的entry

pHead = pHead->pNext; for (int i = 0; i < THREAD_NUM; i++) { if (i == 0) { pHead = app[i]->pHead->pNext; dprintf("Connect %d head string %s %p\n", i, app[i]->pHead->pNext->lastName, app[i]->ptr); } else { etmp->pNext = app[i]->pHead->pNext; dprintf("Connect %d head string %s %p\n", i, app[i]->pHead->pNext->lastName, app[i]->ptr); }
349891d349890
< zyxel
349892a349892
> zyxelzyzzogeton
349895d349894
< zyzzogeton

研究程式碼

一直出現
warning: variable ‘cpu_time’ set but not used [-Wunused-but-set-variable]
剛好看到HaoTse同學的共筆

  • phonebook_opt.cphonebook_opt.h 中的 static double diff_in_second(struct timespec t1, struct timespec t2) 前後加入 #ifdef PROFILE 確認是否開啟 PROFILE 模式。如果要使用 DEBUG 需要使用 $ make DEBUG=1 PROFILE=1

for回圈內的 if else 可以把0獨立額外拿出來做即可,不用讓for loop一直去判別

entry *etmp; pHead = pHead->pNext; for (int i = 0; i < THREAD_NUM; i++) { if (i == 0) { pHead = app[i]->pHead; dprintf("Connect %d head string %s %p\n", i, app[i]->pHead->pNext->lastName, app[i]->ptr); } else { etmp->pNext = app[i]->pHead; dprintf("Connect %d head string %s %p\n", i, app[i]->pHead->pNext->lastName, app[i]->ptr); }
entry *etmp; pHead = app[0]->pHead; etmp = app[0]->pLast; dprintf("Connect %d head string %s %p\n", i, app[0]->pHead->pNext->lastName, app[0]->ptr); for (int i = 1; i < THREAD_NUM; i++) { etmp->pNext = app[i]->pHead; dprintf("Connect %d head string %s %p\n", i, app[i]->pHead->pNext->lastName, app[i]->ptr); etmp = app[i]->pLast; dprintf("Connect %d tail string %s %p\n", i, app[i]->pLast->lastName, app[i]->ptr); dprintf("round %d\n", i); }

執行時間跟效能 THREAD_NUM=4

  • orig
size of entry : 136 bytes
execution time of append() : 0.064879 sec
execution time of findName() : 0.005614 sec

 Performance counter stats for './phonebook_orig' (100 runs):

          188,0685      cache-misses              #   94.488 % of all cache refs      ( +-  0.67% )
          199,0400      cache-references                                              ( +-  0.67% )
       2,6073,4930      instructions              #    1.50  insns per cycle          ( +-  0.02% )
       1,7384,2586      cycles                                                        ( +-  0.26% )

       0.081726901 seconds time elapsed                                          ( +-  1.33% )
  • opt
size of entry : 24 bytes
execution time of append() : 0.002547 sec
execution time of findName() : 0.002512 sec

 Performance counter stats for './phonebook_opt' (100 runs):

           74,0953      cache-misses              #   54.827 % of all cache refs      ( +-  0.94% )
          135,1431      cache-references                                              ( +-  0.76% )
       2,2474,1794      instructions              #    1.36  insns per cycle          ( +-  0.04% )
       1,6499,3380      cycles                                                        ( +-  0.51% )

       0.050511576 seconds time elapsed                                          ( +-  1.82% )

透過 POSIX Thread 來縮減 alloc() 的時間成本

額外整理-Pthread 探討

閱讀Pthreads
The subroutines which comprise the Pthreads API can be informally grouped into four major groups:

  • 分為4個主要的設計:執行序管理、互斥鎖、條件變量、同步問題

1.Thread management: Routines that work directly on threads - creating, detaching, joining, etc. They also include functions to set/query thread attributes (joinable, scheduling etc.)

2.Mutexes: Routines that deal with synchronization, called a "mutex", which is an abbreviation for "mutual exclusion". Mutex functions provide for creating, destroying, locking and unlocking mutexes. These are supplemented by mutex attribute functions that set or modify attributes associated with mutexes.

3.Condition variables: Routines that address communications between threads that share a mutex. Based upon programmer specified conditions. This group includes functions to create, destroy, wait and signal based upon specified variable values. Functions to set/query condition variable attributes are also included.

4.Synchronization: Routines that manage read/write locks and barriers.

建立 thread pool

閱讀C 的 Thread Pool 筆記
研究mbrossard-threadpool如何實作

thread pool.c

threadpool_t *threadpool_create(int thread_count, int queue_size, int flags) /* Initialize */ /* Allocate thread and task queue */ /* Initialize mutex and conditional variable first */ /* Start worker threads */

建立 thread pool ,然後開始初始化的動作,並分配 thread 和 task queue 的最大數量,接著進入到 每個Thread 要執行的 Function task

static void *threadpool_thread(void *threadpool) for(;;) { /* Lock must be taken to wait on conditional variable */ pthread_mutex_lock(&(pool->lock)); /* Wait on condition variable, check for spurious wakeups. When returning from pthread_cond_wait(), we own the lock. */ while((pool->count == 0) && (!pool->shutdown)) { pthread_cond_wait(&(pool->notify), &(pool->lock)); } if((pool->shutdown == immediate_shutdown) || ((pool->shutdown == graceful_shutdown) && (pool->count == 0))) { break; } /* Grab our task */ task.function = pool->queue[pool->head].function; task.argument = pool->queue[pool->head].argument; pool->head = (pool->head + 1) % pool->queue_size; pool->count -= 1; /* Unlock */ pthread_mutex_unlock(&(pool->lock)); /* Get to work */ (*(task.function))(task.argument); }

for(;;) 裡面會先搶奪pool->lock,當搶到pool->lock的 Thread 發現沒有工作可以做的時候,就會執行 pthread_cond_wait來等待通知,然後Unlcok剛搶到的pool->lock,其他的Thread就可以進來搶pool->lock,完全沒有工作的情況下,所有的 Thread 都會在這邊 Waiting

當 Thread 被透過pthread_cond_signal喚醒的時候該 Thread 就會重新取得pool->lock 接著取出 queue 中的 task,等待取出完畢之後,再Unlock讓其他被喚醒的 Thread 也可以去取得 task,執行 task 中的 function pointer工作

int threadpool_destroy(threadpool_t *pool, int flags) /* Already shutting down */ /* Wake up all worker threads */ /* Join all worker thread */ /* Only if everything went well do we deallocate the pool */

使用 pthread_cond_broadcast 喚醒所有的 Thread ,並等待所有正在執行的 Thread 結束,最後銷毀 thread pool

實作參考a530788140共筆

建立 phonebook_threadpool.c/h

使用原先 opt.c/h的內容,並加入 threadpool需要的資訊

實作 main.c

#if defined THREADPOOL extern threadpool_t *pool; extern pthread_mutex_t lock; pthread_mutex_init(&lock,NULL); assert((pool=threadpool_create(THREAD_NUM,QUEUE,0))!=NULL); append_a **app=(append_a **)malloc(sizeof(append_a *)*THREAD_NUM); for(int i = 0;i < THREAD_NUM; i++) app[i] = new_append_a(map + MAX_LAST_NAME_SIZE * i, map + fs, i, THREAD_NUM, entry_pool + i); for(int i = 0; i < THREAD_NUM; i++) threadpool_add(pool, &append, (void *)app[i], 0); assert(threadpool_destroy(pool,1) == 0); pthread_mutex_destroy(&lock); entry *etmp; pHead = app[0]->pHead; etmp = app[0]->pLast; dprintf("Connect %d head string %s %p\n", i, app[0]->pHead->pNext->lastName, app[0]->ptr); for (int i = 1; i < THREAD_NUM; i++) { etmp->pNext = app[i]->pHead; dprintf("Connect %d head string %s %p\n", i, app[i]->pHead->pNext->lastName, app[i]->ptr); etmp = app[i]->pLast; dprintf("Connect %d tail string %s %p\n", i, app[i]->pLast->lastName, app[i]->ptr); dprintf("round %d\n", i); } clock_gettime(CLOCK_REALTIME, &end); cpu_time1 = diff_in_second(start, end);
  • threadpool THREAD_NUM=4
size of entry : 24 bytes
execution time of append() : 0.003975 sec
execution time of findName() : 0.002819 sec

 Performance counter stats for './phonebook_threadpool' (100 runs):

           61,1939      cache-misses              #   52.872 % of all cache refs      ( +-  0.63% )
          115,7405      cache-references                                              ( +-  0.51% )
       2,2375,1942      instructions              #    1.54  insns per cycle          ( +-  0.02% )
       1,4489,8361      cycles                                                        ( +-  0.29% )

       0.049825091 seconds time elapsed                                          ( +-  1.75% )

不同 THREAD 效能比較圖

THREAD = 4

THREAD = 8

THREAD = 16

THREAD = 32

感覺threadpool1變成跟opt一樣,但是cache miss都比opt少快10% 覺得納悶
發現問題 threadpool 內的 queue 大小 會影響 thread 的處裡時間
上方queue 大小為 16 ,可能thread 排進 queue 內的task都在等待

THREAD = 4

THREAD = 8

THREAD = 16

THREAD = 32

使用queue 大小為 4 ,在thread 8 append()時間就比 opt 減少了

  • Size of the task queue設定 thread 可以排進的task數量,並不是開愈大就可以執行愈快

  • 當 thread 開很大速度提升很多但是快取效能卻會大幅下降,探討分析- 待補

mutrace 探討mutex問題

threadpool queue = 4 ,thread = 16
$ mutrace ./phonebook_threadpool

mutrace: Application appears to be compiled without -rdynamic. It might be a
mutrace: good idea to recompile with -rdynamic enabled since this produces more
mutrace: useful stack traces.

mutrace: 0.2 sucessfully initialized for process phonebook_threa (pid 5336).
size of entry : 24 bytes
execution time of append() : 0.005265 sec
execution time of findName() : 0.002172 sec

mutrace: Showing statistics for process phonebook_threa (pid 5336).
mutrace: 3 mutexes used.

Mutex #2 (0x0x7ff242b4c860) first referenced by:
	/usr/lib/mutrace/libmutrace.so(pthread_mutex_lock+0x49) [0x7ff24c2ac6b9]
	/lib/x86_64-linux-gnu/libgcc_s.so.1(_Unwind_Find_FDE+0x2c) [0x7ff242948fec]
	[(nil)]

Mutex #1 (0x0x6031a0) first referenced by:
	/usr/lib/mutrace/libmutrace.so(pthread_mutex_init+0xf2) [0x7ff24c2ac4b2]
	./phonebook_threadpool() [0x4014b3]
	/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x7ff24bce3830]

Mutex #0 (0x0x2252b00) first referenced by:
	/usr/lib/mutrace/libmutrace.so(pthread_mutex_init+0xf2) [0x7ff24c2ac4b2]
	./phonebook_threadpool() [0x401dd1]
	./phonebook_threadpool() [0x4014c7]
	/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x7ff24bce3830]

mutrace: Showing 3 most contended mutexes:

 Mutex #   Locked  Changed    Cont. tot.Time[ms] avg.Time[ms] max.Time[ms]  Flags
       2       80       35       10        0.015        0.000        0.001 M-.--.
       1        7        6        6        4.588        0.655        1.625 Mx.--.
       0       56       42        2        0.035        0.001        0.005 Mx.--.
                                                                           ||||||
                                                                           /|||||
          Object:                                     M = Mutex, W = RWLock /||||
           State:                                 x = dead, ! = inconsistent /|||
             Use:                                 R = used in realtime thread /||
      Mutex Type:                 r = RECURSIVE, e = ERRRORCHECK, a = ADAPTIVE /|
  Mutex Protocol:                                      i = INHERIT, p = PROTECT /
     RWLock Kind: r = PREFER_READER, w = PREFER_WRITER, W = PREFER_WRITER_NONREC 

mutrace: Note that the flags column R is only valid in --track-rt mode!

mutrace: Total runtime is 66.471 ms.

mutrace: Results for SMP with 8 processors.
$ addr2line -e phonebook_threadpool 0x4014b3
/home/jeff60907/phonebook-concurrent/main.c:90
$ addr2line -e phonebook_threadpool 0x401dd1
/home/jeff60907/phonebook-concurrent/threadpool.c:122
$ addr2line -e phonebook_threadpool 0x4014c7
/home/jeff60907/phonebook-concurrent/main.c:90

main.c:90

assert((pool=threadpool_create(THREAD_NUM,QUEUE,0))!=NULL);

threadpool.c:122

threadpool_t *threadpool_create(int thread_count, int queue_size, int flags) /* Initialize mutex and conditional variable first */ if((pthread_mutex_init(&(pool->lock), NULL) != 0) || (pthread_cond_init(&(pool->notify), NULL) != 0) || (pool->threads == NULL) || (pool->queue == NULL)) { goto err; }

一開始 mutex 是在 threadpool_create()開始建立,但是執行時很多時間是在等 mutex 釋放才能繼續動作,未來嘗試加入 lock-free 機制


上課筆記
2016/10/07

cookie
malloc free
page fault
MMU
man page notes -> 常用在linux
PDP系列 (虛擬記憶體<實體)
cache locality
ARM-bigLITTLE
cpu hotplog
mutex vs. semaphore ( 恐龍本回去複習)
week3 必讀 - SMT
brach prediction
cache vivt 參考cache