2017q1 Team8(B07: phonebook-concurrent)

2017q1 Team8(B07: phonebook-concurrent) === contribute by <`Cayonliow`、`king1224`、`csielee#`> ###### tags: `sysprog2017` [github](https://github.com/sysprog2017team8/phonebook-concurrent) / [youtube 4/25](https://www.youtube.com/watch?v=yPdjzWUlizk) / [yotube 5/1](https://www.youtube.com/watch?v=F9hKspTdSQ0) ## 當周 youtube {%youtube F9hKspTdSQ0 %} ## 上課問題 2017/5/24 :::info 1. 5分31秒 - `__sync` 開頭是舊版，不推薦使用。建議使用 C11 `__Atomic` 2. 6分39秒 - append 的程式碼有極大的優化空間，可以縮減 while 迴圈的範圍 3. 7分07秒 - 注意 lockfree 標記問題 4. 總結 - 先建立執行緒再進行測試 - 注意正確性 - code 很醜就要改 - 參考論文：[A Pragmatic Implementation Of Non-Blocking Linked-Lists](https://timharris.uk/papers/2001-disc.pdf) - 參考筆記：[Concurrent Linked List 研究](https://hackmd.io/s/rkhBeeZyx) ::: ## 上課紀錄 2017/5/24 :::success - lock_if.h 的 if 是 interface 的意思 - 重新定義 uint32_t ptlock_t 是為了方便可以更換成 pthread_mutex 等等其他 lock 的實作 - static inline 是個特別的方法，能夠將 function 展開，就像 macro 的效果 - 兩大考量 - 程式優雅度 -- 好看好讀 - 型態檢查 -- 用 macro 不會幫忙檢查 ::: ## 任務要求 - 比照 [B07: phonebook-concurrent](https://hackmd.io/s/rkOExTOoe) 要求，實作出刪除 (remove) 特定資料的功能，並且分析 concurrent insert/delete 對效能的影響。留意 race condition 和正確性議題。 ## 討論紀錄 2017/04/15 - 創造 github - 研究 concurrent linked-list lock/lock-free - 初步重構 - 實作 concurrent linked-list lock/lock-free 2017/04/21 - 確認實作方向，先完成 phonebook 使用 append / remove - 下週一錄影 - 持續實作 2017/04/24 - 實作 lock/lockfree 的 Remove 功能 - 撰寫 hackmd - 錄製影片 ## Concurrent linked-list 在多執行緒當中執行 linked-list 的 Append / Remove 會有 Race condition 的問題主要是因為，當不同執行緒同時對某一個 node 進行操作順序的不同，可能導致不同結果這樣就不符合我們的期望也可以說，在 linked-list 當中的 node 資源具有排他性因此執行緒之間需競爭( race ) 底下是 Race condition 的一個例子當 2 個 thread 都想 Append ```sequence participant ThreadA participant ThreadB Note over ThreadA: node *tmp = new_node(); Note over ThreadB: node *tmp = new_node(); Note over ThreadA: list->tail->next = tmp; Note over ThreadB: list->tail->next = tmp; Note over ThreadB: list->tail = tmp; Note over ThreadA: list->tail = tmp; ``` 可以看到因為同時操作 node 會導致 linked-list 的結構造成錯誤為了避免這樣的競爭造成錯誤有了 lock(mutex) 的機制讓同一個資源只能夠讓一個執行緒使用但是 lock(mutex) 在實作上要非常小心，因為不當的操作都有可能造成 deadlock 因此有了 lock-free 這樣的想法不使用 lock(mutex) ，改用底層的 atomic opreation 將底層可能產生的錯誤用 atomic opreation 處理掉也就是在一段執行中不會被打斷/分割，就像是最基本原子不可被分割但是在程式撰寫上就要考慮到資源是不是已經被其他執行緒處理過而要不要重新執行 ## lock ### 介紹 lock 保護 linked list 的手法就是讓一個區塊的資料，同時只能有一個 thread 進入進行修改或存取。好像門的概念，當一條 thread 進入這個區塊並 lock 起來，像是把門關上，其他 threads 必須在門外等候，直到門被打開（unlock），第二條 threads 才能進入可是要注意的是第二條 threads 進入以後也會把門關上，其餘的 threads 需要繼續在門外等待。也就是說，每次只能有一條 thread 在 lock 與 unlock 之間的區塊。 lock [實作方法](https://github.com/jserv/concurrent-ll/tree/master/src/lock) ```clike typedef struct __PHONE_BOOK_ENTRY { char *lastName; struct __PHONE_BOOK_ENTRY *pNext; pdetail dtl; ptlock_t *lock; } entry; ``` ```clike //LOCK static inline uint32_t lock_lock(volatile ptlock_t *l) { while (CAS_U32(l, (uint32_t)0, (uint32_t)1) == 1) ; return 0; } ``` ```clike //UNLOCK static inline uint32_t lock_unlock(volatile ptlock_t *l) { *l = (uint32_t)0; return 0; } ``` lock 的執行時間會相較長，因爲在同一塊區塊只能有一條thread 在工作，所以會有其餘的 threads 在等待。因爲有等待的時間，所以執行時間會變得很久 ## lock-free ### 介紹若使用 lock 來保護住資料，若同時多個 thread 要對同一資料做改變，當一個 thread 將資料 lock 住之後，其餘所有需要此資源的 thread 都會停等下來，等待資源被 unlock。因此有 lock-free 的方法出現，當資源無法使用時，thread 也不會卡在那邊不做事，在某些情況下可以更有效率的利用 cpu cycle。 lock-free [實作方法](http://novoland.github.io/%E5%B9%B6%E5%8F%91/2014/07/26/Lock-Free%20%E7%AE%97%E6%B3%95.html) * 自旋 * 以無限迴圈包住 * 在成功達成目的之前不斷嘗試 * 若失敗則繼續嘗試 * 觀察新值、舊值 * 對於想要更新的對象先取得舊值 * 算出新值 * CAS (compare and swap) * compare 要更動的變數是否仍為舊值 * 若為舊值則可以更新為新值 * 若已被更動過則從新計算新值 ### 如何確保 CAS 成功－ Atomic Builtins 可以利用標準函式庫中支援的原子操作，即使在多線程的情況下，這些操作都會保證正確性，而 CAS 則在 [Atomic Builtins for memory access](https://gcc.gnu.org/onlinedocs/gcc-4.1.0/gcc/Atomic-Builtins.html) 的歸類中主要用到的函式是： `type __sync_val_compare_and_swap (type *ptr, type oldval, type newval)` 將一個要更新的變數傳入，同時傳入舊值與新值，若變數內仍為舊值，則更新為新值，若變數內已不是舊值，則不做任何更動，且不論何種情況都會回傳舊值。範例： ```clike= int main(){ int a,b,c,d; a = 3; b = 5; c = 1; d = __sync_val_compare_and_swap(&a, b, c); printf("%d %d %d %d\n",a,b,c,d); b = 3; d = __sync_val_compare_and_swap(&a, b, c); printf("%d %d %d %d\n",a,b,c,d); return 0; } ``` results： ``` 3 5 1 3 1 3 1 3 ``` ### 過程在對[ lockfree 範例程式碼](https://github.com/jserv/concurrent-ll/tree/master/src/lockfree)略懂略懂之後，本實作經歷了三個主要階段 * 我相信我可以 * 區區一個 link-list，一定不用半天就可以搞定了 * 但 Bug 多到不知道從哪裡開始找起 * 依樣畫葫蘆 * 還是模仿 ~~(照抄)~~ 範例程式碼中的實作方式 * 成功朝著夢想一步一步邁進 * 感覺差最後一步時卡住半天 * 認真學習 * 從頭檢視並理解程式碼詳細實作原理 * 再無限的 `core dump` 之後，今天開始學著使用 [GDB](https://www.puritys.me/docs-blog/article-329-%E5%A6%82%E4%BD%95%E4%BD%BF%E7%94%A8-GDB-Debug.html) 指令，~~之前只查到進階指令，看不懂就馬上關掉了~~ * 在 gdb 的幫助下，果然成功了呢！ ### 程式碼 [csielee 大大](https://github.com/orgs/sysprog2017team8/people/csielee) 先用 [csielee](https://github.com/csielee/phonebook-concurrent) 的方式幫我們整理重構了原始程式碼，因此需要改變的部份主要分成三個函式 * search 在整個 link-list 中搜尋目標 entry，若有已被刪除的函數也在這裡移除鏈結，在 remove 中僅做標記，回傳目標 entry 或 link-list 的尾巴，並將其中一個 call by reference 的 entry 參數設成目標之前一項，方便 caller 使用 ```clike= while( is_marked_ref(j_next) || strncasecmp(str,j->lastName,len)!=0 ) { if(!is_marked_ref(j_next)) { __sync_val_compare_and_swap(&tmpHead,tmpHead,j); (*left_entry) = j; left_next = j_next; } j = get_unmarked_ref(j_next); if(j == entrytail) break; j_next = j->pNext; } ``` 若該項被標記或與目標不符，則繼續往後搜尋，且每次需拿未標記的位址，否則對標記過得位址存取 data 會 core dump 而其中 strncasecmp 函式只能比較前 len 個字元，當一個字串恰好是另一個的前綴時該怎麼辦？ ```clike= int len = strlen(str); if(strlen(j->lastName) == len) break; ``` 再包上一層迴圈，若 strncasecmp 回傳 0 且兩字串長度相等，才能確定兩字串是真正完全一樣而離開迴圈 ```clike= if(left_next == right) { if(!is_marked_ref(right->pNext)) return right; } else if(__sync_val_compare_and_swap(&((*left_entry)->pNext), left_next, right) == left_next ) { if(!is_marked_ref(right->pNext)) { return right; } } ``` left 為目標前一項，right 為目標，left_next 與 right 相等則表示資料沒被更動過，反之則代表中間有很多被標記為刪除的資料，則將 right 接到 left 之後 * append 將新的 data 接到 link-list 末端，因總資料量太大，因此無檢查是否有重複 ```clike= while(1) { right = search(i, &left, tmpHead); if(strncasecmp(right->lastName,i,len) == 0 && strlen(right->lastName) == len) break; j->pNext = right; if(__sync_val_compare_and_swap(&(left->pNext), right, j) == right) break; } ``` 若得到的目標資料已存在，則不做任何事離開，但我在這邊傳入的 link-list entry 非 link-list 的 head，因此在 search 時並不是從頭開始檢查，即使有重複也檢查不出來 ( right 永遠是 link-list tail ) 若成功更新後就離開迴圈，否則重新計算新值，取得 tail 的前一項 * remove 搜尋目標並在刪除後回傳 1，若搜尋不到則不做任何事並回傳 0 ```clike= while(1) { right = search(lastName, &left, entryHead); if(right == NULL || strncasecmp(right->lastName,lastName,len) != 0){ return 0; } right_succ = right->pNext; if(!is_marked_ref(right_succ)) { if(__sync_val_compare_and_swap(&(right->pNext), right_succ, get_marked_ref(right_succ)) == right_succ){ return 1; } } } ``` 將目標的 pNext 中的資料標記起來參考： [SMP中多线程程序的性能衰退现象之False Sharing](http://www.voidcn.com/blog/yockie/article/p-6328160.html) [Frank Tan's Blog](http://www.cnblogs.com/FrankTan/archive/2010/12/11/1903377.html) ## 分析 |結構|執行方式|資料來源| |:---:|:---:|:---:| |linked-list|多執行緒 分區塊執行|append:人名資料 remove:前一萬筆人名| |concurrent linked-list lock|多執行緒 分區塊執行|append:人名資料 remove:前一萬筆人名| |concurrent linked-list lock-free|多執行緒 分區塊執行|append:人名資料 remove:前一萬筆人名| ### 執行狀況因為我們想要專注在 concurrent linked-list 上對效能的影響因此我們都傳入已經對齊好的文字檔案並且都一次 malloc 好需要的記憶體，讓效能影響的變因減少能夠好好研究 concurrent linked-list 的效能議題 - 呈獻在不同 linked-list 下的 AppendByFile / RemoveByFile 時間 :::warning 因為在 linked-list 難以實作多執行緒的刪除因此時間均為 0 ::: ![](https://i.imgur.com/UcTLasU.png) :::info 觀察後發現在 Append ，lock / lockfree 都有上升推測是因為要處理 race condition 的問題因此執行緒無法全力執行，必定有等待的時間 而在 Remove lock 的 Remove 時間異常的長 lockfree 的 Remove 時間也比 Append 來得長這部份感覺還需要調整一下跟預期的想像不一樣 ::: ## lock-free 優化部份主要優化了兩個部份 * thread_args * 針對 thread_args 使用 memory pool 來實作，一次取得所有會用到的空間，減少 system lock 的情況 * append * 將 append 中一些不需要的程式碼移除，讓各 threads 可以從任何地方 append，不用彼此搶著從同一個地方 append，減少 CAS 的失敗率 #### thread_args memory pool ```clike thread_args_pool = (thread_arg *) malloc(sizeof(thread_arg) * THREAD_NUM); ``` 一次將所需要的空間 malloc 出來，減少 malloc 的呼叫次數 ```clike thread_args[i] = createThread_arg(map + MAX_LAST_NAME_SIZE * i,map + file_size, i,THREAD_NUM, entry_pool + i, thread_args_pool + i); ``` 在 create 新的參數時，傳入事先取得的空間直接使用 ```clike= static thread_arg *createThread_arg(char *data_begin, char *data_end, int threadID, int numOfThread, entry *entryPool,thread_arg *argPool) { argPool->data_begin = data_begin; argPool->data_end = data_end; argPool->threadID = threadID; argPool->numOfThread = numOfThread; argPool->lEntryPool_begin = entryPool; argPool->lEntry_head = argPool->lEntry_tail = entryPool; return argPool; } ``` 從 memory pool 直接取一條 thead_arg 出來使用，取代每次都重新 malloc 空間的方法 #### append ```clike= static void append(void *arg) { thread_arg *t_arg = (thread_arg *) arg; int count = 0, len; entry *left, *right,*tmpHead; entry *j = t_arg->lEntryPool_begin; tmpHead = entryHead; for (char *i = t_arg->data_begin; i < t_arg->data_end; i += MAX_LAST_NAME_SIZE * t_arg->numOfThread, j += t_arg->numOfThread, count++) { left = right = NULL; if(i[strlen(i)-1]=='\n') i[strlen(i)-1] = '\0'; j->lastName = i; while(1) { right = tmpHead->pNext; j->pNext = right; if(__sync_val_compare_and_swap(&(tmpHead->pNext), right, j) == right) break; } tmpHead = j; DEBUG_LOG("thread %d t_argend string = %s\n", t_arg->threadID, new -> lastName); } pthread_exit(NULL); } ``` 每個 threads 都是從 entryHead 開始，但各個 threads 會不斷的將要插入的位置更新為自己上一次插入的 entry 之後，而每個 threads 插入的 entry 位置都不同，因此任兩個 threads 幾乎不會有同時搶著對同一個位置做改變的情況 ## 遇到問題 - 有 memory leak 的問題 - lock 的刪除有正確性的問題 - 無法充份表現 concurrent linked-list - 嘗試用 ThreadSanitizer 去驗證正確性議題 - lock 可以嘗試改用 pthread_mutex ## interface Refactor :::info TODO ::: 統一 lniked-list 的介面 ```clike insertEach(entry *new_element); removeEach(char *name); getuserdata(char *name); ```  <style> .markdown-body h2 { padding-top : 0.15em; padding-bottom : 0.15em; padding-left : 0.2em; background-color : DarkGrey; border: 1px solid transparent; border-radius: 4px; } .markdown-body { } </style>