2016q3 Homework2 (phonebook-concurrent)

# 2016q3 Homework2 (phonebook-concurrent) contribute by <`jeff60907`> [作業說明A06: phonebook-concurrent](https://hackmd.io/s/rJsgh0na) * 學習 POSIX Thread來縮減 alloc() 的時間成本 * 提示：可透過建立 thread pool 來管理 worker thread * 學習效能分析工具 * code refactoring 練習 * 探索 clz 的應用 * 學習 concurrent-ll (concurrent linked-list 實作) 的 scalability 分析方式 ## 開發環境 ``` 作業系統 Lubuntu 16.04 CPU:Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 8192K ``` * 學習第二週提及的 [concurrency](/s/H10MXXoT) 程式設計及[冼鏡光老師的投影片](http://blog.dcview.com/article.php?a=BTgBYw1qUWIAaQ%3D%3D) > 發現投影片看不完只先看了幾章就趕緊看原有資料 ## An Introduction to Lock-Free Programming * [An Introduction to Lock-Free Programming](http://preshing.com/20120612/an-introduction-to-lock-free-programming/) * [參考大陸翻譯](http://www.cnblogs.com/gaochundong/p/lock_free_programming.html) * [Concurrency系列(一)~(五)](http://enginechang.logdown.com/posts/784600-concurrency-series-1-the-road-to-understand-concurrency) * [閱讀Tempo JiJi共筆](https://hackmd.io/s/rymKa4aT) If a given part of your code doesn’t satisfy these conditions, then that part is not lock-free. 要完全符合上面的流程，才是一個完整的lock-free ![](https://i.imgur.com/VtJr7NL.png) ### Lock-Free Programming Techniques Atomic Operations , memory barriers , avoiding the ABA problem ![](https://i.imgur.com/nSgLjz4.png) > 看過那些演算法還是有點不太了解，要在重新複習過 ## Concurrency (並行) vs. Parallelism (平行) 洗老師提到的舉例一邊走路一邊講電話就同時做兩件事（走路和講電話），一邊寫報告一邊上網聊天也是如此。然而這兩個同時做兩件事的例子卻有著很微妙的差異，前者我們真的是在每一瞬間都是同時進行兩件事（走路和講電話）；後者卻不然，大多人都是寫一段報告、就跳過去聊天、再寫報告、再聊天、等等，幾乎不可能在每一瞬間都同時在寫報告和聊天。兩個例子中前者（兩件事齊頭並進）的處理方式叫做 **平行（parallel）**，後者在兩件事中交錯執行、但每一瞬間每件事都有點進展； **並行（concurrent，此地採用國內的譯名)** 指的是交錯執行和平行兩者。 Concurrency is about dealing with lots of things at once. Parallelism is about doing lots of things at once. * Concurrency 在一個CPU執行多數的tasks，每個tasks會切換所以執行時間都不一樣 *![](https://hackpad-attachments.s3.amazonaws.com/embedded2016.hackpad.com_K6DJ0ZtiecH_p.537916_1460613316743_p1.png) * Parallelism 可以當作多個tasks在不同的CPU上同步一起執行。 ![](https://hackpad-attachments.s3.amazonaws.com/embedded2016.hackpad.com_K6DJ0ZtiecH_p.537916_1460613329719_p2.png) ## Process vs. Thread vs. Coroutines [參考LanKuDot共筆](https://hackmd.io/s/HJFhaiAp) * With threads, the operating system switches running tasks preemptively according to its scheduling algorithm. >使用threads 是OS會用排程方法去選擇執行哪個running task， * With coroutines, the programmer chooses, meaning tasks are cooperatively multitasked by pausing and resuming functions at set points. * coroutine switches are cooperative, meaning the programmer controls when a switch will happen. * The kernel is not involved in coroutine switches. > 可以執行到一半去做其他function，然後再回到剛剛停止的點繼續往下，kernel並不會參與使用context switch。 ## POSIX Thread 目前課程內容大致上看過但觀念吸收還是很模糊，英文看的很粗略要再重念一遍..抓緊腳步實作... >> 就算用中文寫，你也不見得可以全部吸收，要動手！ [name=jserv] ## 實驗測試未修改版本 ![](https://i.imgur.com/uSWjLGk.png) ## Code Refactoring * refactoring（重構）的定義：「在不改變軟體的外在行為之下，改善既有軟體的內部設計」 * 所謂的重構，Martin Fowler的定義是「在不改變軟體外部行為的前提下，改變其內部結構，使其更容易理解且易於修改」。也就是說，在對外的介面上沒有做改變、介面背後的對應行為也沒有改變的情況下，基於可讀性，以及日後更便於修改的目的之下，來改寫內部的程式碼實作。 * 《重構》(原文書名: [Refactoring - Improving the Design of Existing Code](https://www.csie.ntu.edu.tw/~r95004/Refactoring_improving_the_design_of_existing_code.pdf)) * [什麼是Refactoring？](http://teddy-chen-tw.blogspot.tw/2014/04/refactoring.html) * [對於重構的兩則常見誤解](http://www.ithome.com.tw/node/79813) [參考Tempo JiJi共筆](https://hackmd.io/s/rymKa4aT) data資料量有錯誤，看了原始程式碼內有`show_entry`的函式可以呼叫於是在`main.c`加入 ```C= #ifdef OPT show_entry(e); #endif ``` 顯示出來很怪，原來是沒有排序過後使用linux 內建的 `sort`，排序後在`diff`比較`words.txt`的差異，納悶是多出了3個沒找到的資料系統顯示 Segmentation fault (core dumped) 記憶體錯誤 ``` 0a1,4 > aaaa > aaaaa > aaaaaa > aaaaaaa 349886a349891 > zyxel 349888d349892 < zyxelzyzzogeton 349890a349895 > zyzzogeton ``` > zyxelzyzzogeton 竟然黏在一起了！也看不太出來輸出字串有什麼出錯修改 pHead因為pHead = app[i]->pHead->pNext 跟 etmp->pNext = app[i]->pHead->pNext 這兩個將head連接起來時是head->next而不是head，因此才會少了4個head的entry ```c= pHead = pHead->pNext; for (int i = 0; i < THREAD_NUM; i++) { if (i == 0) { pHead = app[i]->pHead->pNext; dprintf("Connect %d head string %s %p\n", i, app[i]->pHead->pNext->lastName, app[i]->ptr); } else { etmp->pNext = app[i]->pHead->pNext; dprintf("Connect %d head string %s %p\n", i, app[i]->pHead->pNext->lastName, app[i]->ptr); } ``` ``` 349891d349890 < zyxel 349892a349892 > zyxelzyzzogeton 349895d349894 < zyzzogeton ``` #### 研究程式碼一直出現 warning: variable ‘cpu_time’ set but not used [-Wunused-but-set-variable] 剛好看到[HaoTse同學的共筆](https://hackmd.io/s/Bk-rtQGR) * 將 `phonebook_opt.c` 和 `phonebook_opt.h` 中的 `static double diff_in_second(struct timespec t1, struct timespec t2)` 前後加入 `#ifdef PROFILE` 確認是否開啟 PROFILE 模式。如果要使用 DEBUG 需要使用 $ make DEBUG=1 PROFILE=1 for回圈內的 if else 可以把0獨立額外拿出來做即可，不用讓for loop一直去判別 ```c= entry *etmp; pHead = pHead->pNext; for (int i = 0; i < THREAD_NUM; i++) { if (i == 0) { pHead = app[i]->pHead; dprintf("Connect %d head string %s %p\n", i, app[i]->pHead->pNext->lastName, app[i]->ptr); } else { etmp->pNext = app[i]->pHead; dprintf("Connect %d head string %s %p\n", i, app[i]->pHead->pNext->lastName, app[i]->ptr); } ``` ```c= entry *etmp; pHead = app[0]->pHead; etmp = app[0]->pLast; dprintf("Connect %d head string %s %p\n", i, app[0]->pHead->pNext->lastName, app[0]->ptr); for (int i = 1; i < THREAD_NUM; i++) { etmp->pNext = app[i]->pHead; dprintf("Connect %d head string %s %p\n", i, app[i]->pHead->pNext->lastName, app[i]->ptr); etmp = app[i]->pLast; dprintf("Connect %d tail string %s %p\n", i, app[i]->pLast->lastName, app[i]->ptr); dprintf("round %d\n", i); } ``` #### 執行時間跟效能 THREAD_NUM=4 * orig ``` size of entry : 136 bytes execution time of append() : 0.064879 sec execution time of findName() : 0.005614 sec Performance counter stats for './phonebook_orig' (100 runs): 188,0685 cache-misses # 94.488 % of all cache refs ( +- 0.67% ) 199,0400 cache-references ( +- 0.67% ) 2,6073,4930 instructions # 1.50 insns per cycle ( +- 0.02% ) 1,7384,2586 cycles ( +- 0.26% ) 0.081726901 seconds time elapsed ( +- 1.33% ) ``` * opt ``` size of entry : 24 bytes execution time of append() : 0.002547 sec execution time of findName() : 0.002512 sec Performance counter stats for './phonebook_opt' (100 runs): 74,0953 cache-misses # 54.827 % of all cache refs ( +- 0.94% ) 135,1431 cache-references ( +- 0.76% ) 2,2474,1794 instructions # 1.36 insns per cycle ( +- 0.04% ) 1,6499,3380 cycles ( +- 0.51% ) 0.050511576 seconds time elapsed ( +- 1.82% ) ``` ## 透過 POSIX Thread 來縮減 alloc() 的時間成本 [額外整理-Pthread 探討](https://hackmd.io/OwEwRgbArAhgxgZgLQE5goAxICwFNjIwgpRIwCMYAHGAExQoTYgJA===#) [閱讀Pthreads](https://computing.llnl.gov/tutorials/pthreads/) The subroutines which comprise the Pthreads API can be informally grouped into four major groups: * 分為4個主要的設計：執行序管理、互斥鎖、條件變量、同步問題 1.Thread management: Routines that work directly on threads - creating, detaching, joining, etc. They also include functions to set/query thread attributes (joinable, scheduling etc.) 2.Mutexes: Routines that deal with synchronization, called a "mutex", which is an abbreviation for "mutual exclusion". Mutex functions provide for creating, destroying, locking and unlocking mutexes. These are supplemented by mutex attribute functions that set or modify attributes associated with mutexes. 3.Condition variables: Routines that address communications between threads that share a mutex. Based upon programmer specified conditions. This group includes functions to create, destroy, wait and signal based upon specified variable values. Functions to set/query condition variable attributes are also included. 4.Synchronization: Routines that manage read/write locks and barriers. ### 建立 thread pool [閱讀C 的 Thread Pool 筆記](http://swind.code-life.info/posts/c-thread-pool.html) [研究mbrossard-threadpool如何實作](https://github.com/mbrossard/threadpool) ![](https://i.imgur.com/SGgqG0R.png) #### thread pool.c ```c= threadpool_t *threadpool_create(int thread_count, int queue_size, int flags) /* Initialize */ /* Allocate thread and task queue */ /* Initialize mutex and conditional variable first */ /* Start worker threads */ ``` > 建立 thread pool ，然後開始初始化的動作，並分配 thread 和 task queue 的最大數量，接著進入到每個Thread 要執行的 Function task ```c= static void *threadpool_thread(void *threadpool) for(;;) { /* Lock must be taken to wait on conditional variable */ pthread_mutex_lock(&(pool->lock)); /* Wait on condition variable, check for spurious wakeups. When returning from pthread_cond_wait(), we own the lock. */ while((pool->count == 0) && (!pool->shutdown)) { pthread_cond_wait(&(pool->notify), &(pool->lock)); } if((pool->shutdown == immediate_shutdown) || ((pool->shutdown == graceful_shutdown) && (pool->count == 0))) { break; } /* Grab our task */ task.function = pool->queue[pool->head].function; task.argument = pool->queue[pool->head].argument; pool->head = (pool->head + 1) % pool->queue_size; pool->count -= 1; /* Unlock */ pthread_mutex_unlock(&(pool->lock)); /* Get to work */ (*(task.function))(task.argument); } ``` > for(;;) 裡面會先搶奪`pool->lock`，當搶到`pool->lock`的 Thread 發現沒有工作可以做的時候，就會執行 `pthread_cond_wait`來等待通知，然後`Unlcok`剛搶到的`pool->lock`，其他的Thread就可以進來搶`pool->lock`，完全沒有工作的情況下，所有的 Thread 都會在這邊 Waiting > 當 Thread 被透過`pthread_cond_signal`喚醒的時候該 Thread 就會重新取得`pool->lock` 接著取出 queue 中的 task，等待取出完畢之後，再`Unlock`讓其他被喚醒的 Thread 也可以去取得 task，執行 task 中的 function pointer工作 ```c= int threadpool_destroy(threadpool_t *pool, int flags) /* Already shutting down */ /* Wake up all worker threads */ /* Join all worker thread */ /* Only if everything went well do we deallocate the pool */ ``` > 使用 `pthread_cond_broadcast` 喚醒所有的 Thread ，並等待所有正在執行的 Thread 結束，最後銷毀 thread pool [實作參考a530788140共筆](https://hackmd.io/JwDgzMCMAmAMCGBaYAmaSAsA2ARpRIArAGbSLSGFbzogbTA5A===#) #### 建立 phonebook_threadpool.c/h 使用原先 opt.c/h的內容，並加入 threadpool需要的資訊 #### 實作 main.c ```c= #if defined THREADPOOL extern threadpool_t *pool; extern pthread_mutex_t lock; pthread_mutex_init(&lock,NULL); assert((pool=threadpool_create(THREAD_NUM,QUEUE,0))!=NULL); append_a **app=(append_a **)malloc(sizeof(append_a *)*THREAD_NUM); for(int i = 0;i < THREAD_NUM; i++) app[i] = new_append_a(map + MAX_LAST_NAME_SIZE * i, map + fs, i, THREAD_NUM, entry_pool + i); for(int i = 0; i < THREAD_NUM; i++) threadpool_add(pool, &append, (void *)app[i], 0); assert(threadpool_destroy(pool,1) == 0); pthread_mutex_destroy(&lock); entry *etmp; pHead = app[0]->pHead; etmp = app[0]->pLast; dprintf("Connect %d head string %s %p\n", i, app[0]->pHead->pNext->lastName, app[0]->ptr); for (int i = 1; i < THREAD_NUM; i++) { etmp->pNext = app[i]->pHead; dprintf("Connect %d head string %s %p\n", i, app[i]->pHead->pNext->lastName, app[i]->ptr); etmp = app[i]->pLast; dprintf("Connect %d tail string %s %p\n", i, app[i]->pLast->lastName, app[i]->ptr); dprintf("round %d\n", i); } clock_gettime(CLOCK_REALTIME, &end); cpu_time1 = diff_in_second(start, end); ``` * threadpool THREAD_NUM=4 ``` size of entry : 24 bytes execution time of append() : 0.003975 sec execution time of findName() : 0.002819 sec Performance counter stats for './phonebook_threadpool' (100 runs): 61,1939 cache-misses # 52.872 % of all cache refs ( +- 0.63% ) 115,7405 cache-references ( +- 0.51% ) 2,2375,1942 instructions # 1.54 insns per cycle ( +- 0.02% ) 1,4489,8361 cycles ( +- 0.29% ) 0.049825091 seconds time elapsed ( +- 1.75% ) ``` ### 不同 THREAD 效能比較圖 #### THREAD = 4 ![](https://i.imgur.com/PoBMAfL.png) #### THREAD = 8 ![](https://i.imgur.com/zN6QRx9.png) #### THREAD = 16 ![](https://i.imgur.com/BXrGyjD.png) #### THREAD = 32 ![](https://i.imgur.com/7YSppYU.png) > 感覺threadpool1變成跟opt一樣，但是cache miss都比opt少快10% 覺得納悶 > 發現問題 `threadpool` 內的 queue 大小會影響 thread 的處裡時間 > 上方queue 大小為 16 ，可能thread 排進 queue 內的task都在等待 #### THREAD = 4 ![](https://i.imgur.com/asNLDMU.png) #### THREAD = 8 ![](https://i.imgur.com/ny4rW0f.png) #### THREAD = 16 ![](https://i.imgur.com/DDZzK0C.png) #### THREAD = 32 ![](https://i.imgur.com/JOpYhLd.png) > 使用queue 大小為 4 ，在thread 8 append()時間就比 opt 減少了 * Size of the task queue設定 thread 可以排進的task數量，並不是開愈大就可以執行愈快 * 當 thread 開很大速度提升很多但是快取效能卻會大幅下降，探討分析- 待補 #### mutrace 探討mutex問題 threadpool queue = 4 ,thread = 16 `$ mutrace ./phonebook_threadpool` ``` mutrace: Application appears to be compiled without -rdynamic. It might be a mutrace: good idea to recompile with -rdynamic enabled since this produces more mutrace: useful stack traces. mutrace: 0.2 sucessfully initialized for process phonebook_threa (pid 5336). size of entry : 24 bytes execution time of append() : 0.005265 sec execution time of findName() : 0.002172 sec mutrace: Showing statistics for process phonebook_threa (pid 5336). mutrace: 3 mutexes used. Mutex #2 (0x0x7ff242b4c860) first referenced by: /usr/lib/mutrace/libmutrace.so(pthread_mutex_lock+0x49) [0x7ff24c2ac6b9] /lib/x86_64-linux-gnu/libgcc_s.so.1(_Unwind_Find_FDE+0x2c) [0x7ff242948fec] [(nil)] Mutex #1 (0x0x6031a0) first referenced by: /usr/lib/mutrace/libmutrace.so(pthread_mutex_init+0xf2) [0x7ff24c2ac4b2] ./phonebook_threadpool() [0x4014b3] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x7ff24bce3830] Mutex #0 (0x0x2252b00) first referenced by: /usr/lib/mutrace/libmutrace.so(pthread_mutex_init+0xf2) [0x7ff24c2ac4b2] ./phonebook_threadpool() [0x401dd1] ./phonebook_threadpool() [0x4014c7] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x7ff24bce3830] mutrace: Showing 3 most contended mutexes: Mutex # Locked Changed Cont. tot.Time[ms] avg.Time[ms] max.Time[ms] Flags 2 80 35 10 0.015 0.000 0.001 M-.--. 1 7 6 6 4.588 0.655 1.625 Mx.--. 0 56 42 2 0.035 0.001 0.005 Mx.--. |||||| /||||| Object: M = Mutex, W = RWLock /|||| State: x = dead, ! = inconsistent /||| Use: R = used in realtime thread /|| Mutex Type: r = RECURSIVE, e = ERRRORCHECK, a = ADAPTIVE /| Mutex Protocol: i = INHERIT, p = PROTECT / RWLock Kind: r = PREFER_READER, w = PREFER_WRITER, W = PREFER_WRITER_NONREC mutrace: Note that the flags column R is only valid in --track-rt mode! mutrace: Total runtime is 66.471 ms. mutrace: Results for SMP with 8 processors. ``` ``` $ addr2line -e phonebook_threadpool 0x4014b3 /home/jeff60907/phonebook-concurrent/main.c:90 $ addr2line -e phonebook_threadpool 0x401dd1 /home/jeff60907/phonebook-concurrent/threadpool.c:122 $ addr2line -e phonebook_threadpool 0x4014c7 /home/jeff60907/phonebook-concurrent/main.c:90 ``` main.c：90 ```c= assert((pool=threadpool_create(THREAD_NUM,QUEUE,0))!=NULL); ``` threadpool.c:122 ```c= threadpool_t *threadpool_create(int thread_count, int queue_size, int flags) /* Initialize mutex and conditional variable first */ if((pthread_mutex_init(&(pool->lock), NULL) != 0) || (pthread_cond_init(&(pool->notify), NULL) != 0) || (pool->threads == NULL) || (pool->queue == NULL)) { goto err; } ``` >一開始 mutex 是在 threadpool_create()開始建立，但是執行時很多時間是在等 mutex 釋放才能繼續動作，未來嘗試加入 lock-free 機制 --- 上課筆記 2016/10/07 > cookie > malloc free > page fault > MMU > man page notes -> 常用在linux > PDP系列 (虛擬記憶體<實體) > cache locality > ARM-bigLITTLE > cpu hotplog > mutex vs. semaphore ( 恐龍本回去複習) > week3 必讀 - SMT > brach prediction > cache vivt [參考cache](https://chuck42315.wordpress.com/2011/10/29/cache/)

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.