2017q1 Gruop Work 1 (mergesort-concurrent)

# 2017q1 Gruop Work 1 (mergesort-concurrent) contributed by < `tina0405` >, < `heathcliffYang` > ## 將 merge sort 改成可接收字串排序 ### executable part - 要可以接收字串代表 node_t 裡面的 data 是要可以裝 char ```clike= typedef intptr_t val_t; typedef struct node { val_t data; /**< Data of the node */ struct node *next; /**< Pointer to the next node */ } node_t; ``` 其中在 `stdint.h` 裡面的定義 ```clike= /* Types for `void *' pointers. */ #if __WORDSIZE == 64 # ifndef __intptr_t_defined typedef long int intptr_t; # define __intptr_t_defined # endif typedef unsigned long int uintptr_t; #else # ifndef __intptr_t_defined typedef int intptr_t; # define __intptr_t_defined # endif typedef unsigned int uintptr_t; #endif ``` - main.c 取得資料後的處理 ```clike= static uint32_t build_list_from_file(llist_t *_list, const char *filename) { FILE *fp = fopen(filename, "r"); char buffer[16]; while (fgets(buffer, 16, fp) != NULL) { char *name = (char*)malloc(16); strcpy(name, buffer); list_add(_list, (val_t)name); } fclose(fp); return _list->size; } ``` ==char * -> val_t ??== [pointer](https://www.tutorialspoint.com/cprogramming/c_pointers.htm) & [string](https://www.tutorialspoint.com/cprogramming/c_strings.htm) 1. malloc ```clike= #include <stdlib.h> void *malloc(size_t size); ``` [malloc 說明](https://linux.die.net/man/3/malloc)提到 : The malloc() function **allocates size bytes** and returns a pointer to the allocated memory. The memory is not initialized. 剛好 char 為 1 byte，在 phonebook 作業中，最大的 size 為 16 - list.c 裡面的 list_print() 也需要調整成 char 的形式讓它顯示 ```clike= void list_print(const llist_t * const list) { const node_t *cur = list->head; while (cur) { xprint((char*)cur->data); // adjust to character and disable new line cur = cur->next; } } ``` - merge_sort.c 的比較部分也要改 ```clike= llist_t *small; int tmp = strcmp((char*)(a->head->data), (char*)(b->head->data)); small = (llist_t*) ((intptr_t) a * (tmp <= 0) + (intptr_t) b * (tmp > 0)); ``` 強制轉換參考了 0xff07 的[作業](https://hackmd.io/s/H154LN6og#) & [github](https://github.com/0xff07/mergesort-concurrent)，可以用很簡單的方法、改的部分很少就達到目的真的很厲害!! ### Makefile part ``` # String version test genVerify: sort test_data/input.txt > test_data/result_from_command.txt check_string: ./sort $(THREADS) test_data/input.txt > test_data/result.txt diff test_data/result.txt test_data/result_from_command.txt ``` ``` clean_data: rm -f test_data/input.txt test_data/result.txt test_data/result_from_command.txt ``` ## 輸入資料的分析 ### sort -R 的研究 ### 如何得到均勻分布的亂數排列 - 已知的方法 ``` $ (for i in {1..8}; do echo $RANDOM; done) > test_data/test.txt ``` ### 設計自動測試機制計畫將設計一個小程式，可以運用機率與統計所學以及更多延伸的資料，來測試拿到的資料的特性，例如是否是均勻分布的亂數排列，夠不夠「亂」? 針對這個問題，我想到了 Randomness test。 - Spectral Test - 針對 LCGs (PRNGs 中的一類)，檢測其是否夠 random - NIST 的 15 種 tests - [NIST Test Suite & YW 閱讀筆記](https://hackmd.io/s/r18-RHMAx) 找到了一個網頁 [Test for random number generators](http://i.cs.hku.hk/cisc/projects/va/index.htm#spectral) 提供檢測 RNGs，還有一個 lecture 是 Raj Jain 的 [Testing Random Testing RandomNumber Generators](https://www.cse.wustl.edu/~jain/cse567-08/ftp/k_27trg.pdf)，但正確性有待查證。 1. Standard deviation 先從比較簡單的著手，先分析它的 standard deviation (variance 的平方根) ![](https://i.imgur.com/Jro6ZWR.png) 但是在此次作業的案例來說，已知範圍內每個數字都只出現一次，不適用這個檢測方法 2. Spectral test 要先了解 [LCGs](https://en.wikipedia.org/wiki/Linear_congruential_generator) 是什麼 - Linear congruential generator ![](https://i.imgur.com/8PrpPJY.png) m 是 LCGs 產生的數列的週期長度(period length) ## 程式碼執行與解析 * 在這裡我先建立一個沒有順序數字的資料 input.txt，因為原程式碼並未支援字串。 ~~~ =likec //input.txt 35 66 78 45 55 96 85 22 10 //terminal tina@tina-X550VB:~/mergesort-concurrent$ ./sort 4 input.txt #Total_tasks_consumed: 12 #Elapsed_time: 0.405 ms #Throughput: 29629 (per sec) 10 22 35 45 55 66 78 85 96 ~~~ * 參考[argc和argv](http://jyhshin.pixnet.net/blog/post/26588000-argc%2c-argv-%e8%a7%a3%e6%9e%90%e7%94%a8%e6%b3%95)的解說。 ~~~ 這裡的例子是-> 運行命令為： hello.exe hello world 那麼， argc 的值是 3， argv[0] 是 "hello.exe" ，argv[1]是 "hello" ， argv[2] 是 "world" 。 ~~~ * 所以這邊的意思，就是在3個檔案以下會給我們使用提示： usage: ./sort [thread_count] [input_file] * 而 argv[1] 就是 thread_count 上面放的 4 * 而 argv[2] 就是 input_file 上面放的 input.txt * ((a) < (b)) ? (a) : (b) 這邊是如果 a<b 回傳 a ，則否回傳 b。 ~~~ #define USAGE "usage: ./sort [thread_count] [input_file]\n" #define MIN(a, b) (((a) < (b)) ? (a) : (b)) ... int main(int argc, char const *argv[]) { if (argc < 3) { printf(USAGE); return -1; } thread_count = atoi(argv[1]); /* Read data */ the_list = list_new(); data_count = build_list_from_file(the_list, argv[2]); max_cut = MIN( thread_count, data_count); ... ~~~ * 參考[pthread.h](http://pubs.opengroup.org/onlinepubs/7908799/xsh/pthread_mutex_init.html)解說。 * thread 的 mutex 型態是 pthread_mutex_t. * 在使用前, 記得要初始化: * 對於靜態分配可設置為PTHREAD_MUTEX_INITIALIZER,或呼 pthread_mutex_init，回傳 0 表示此 mutex 被使用（成功）. * 對於動態分配的互斥量, 在申請內存 (malloc) 之後, 通過 pthread_mutex_init 進行初始化, 並且在釋放內存 (free) 前需要呼叫 pthread_mutex_destroy. ~~~ int pthread_mutex_init(pthread_mutex_t *mutex, const pthread_mutexattr_t *attr); int pthread_mutex_destroy(pthread_mutex_t *mutex); pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER; ~~~ * If attr is NULL, the default mutex attributes are used ~~~ clike= pthread_mutex_init(&(data_context.mutex), NULL); tmp_list = NULL; pool = (tpool_t *) malloc(sizeof(tpool_t)); tpool_init(pool, thread_count, task_run); ~~~ * malloc 好一個空間後開始 tpool_init * 其中 task_run ： * 是一直去詢問有沒有 task ，再指到 task 的 function 和 arg，意思就是分配 thread 下去做事，如果我可以改由 task 主動去要 thread ，就可以減少迴圈的次數。 ~~~ cliike= static void *task_run(void *data __attribute__ ((__unused__))) { while (1) { task_t *_task = tqueue_pop(pool->queue); if (_task) { if (!_task->func) { tqueue_push(pool->queue, _task); break; } else { _task->func(_task->arg); free(_task); } } } pthread_exit(NULL); } ~~~ ### 再來批評一下程式碼先說一下 list, merge_sort, threadpool 等等這些檔案應該要有個別其他資料夾來放置，在找檔案時會比較好看、好找。 #### gprof 檢測不知道是一開始設定的問題，還是程式是用 recursive 的方法做的， call graph 的情況跟我預想的不太一樣，==這需要去找 gprof 的運作方式來查證==。此圖為 4 個 thread ![](https://i.imgur.com/EyETkdR.png) 這可以看出 tqueue_pop 被叫用的次數非常多，一直在 task_run 的詢問，效率就會很差。 #### merge_sort.c 這部分是 merge sort 真正實作的方法，少少的三個函式，其中 merge_sort 是最上層的包裝，由它叫用 merge sort 的第一份工作 split_n_merge : 先用 recursive 的方法去分割 list 直到只剩 1 個 node ；再執行第二份工作 sort_n_merge : 從最小的 list 一路 sort 上來。 - sort_n_merge() 在每次 sort 兩個 list 的時候，都要 malloc 一個新的 list ，這樣對系統造成的 overhead 太高，有改進的空間。 - split_n_merge() 在 concurrency 的實作中，最重要的是讓這些工作可以獨立運行，且不要使被 lock 的 thread 也去影響到其他 thread 的實行， recursive 的 merge sort 寫法前後相依性很高，傳進去的參數、第幾層的叫用都有影響，所以程式可以切割的部份就非常有限，況且 sort 的步驟才是最費時的，這部份無法多執行緒運行是致命傷。 #### main.c 在 task 的分配當中，主要是依照 thread 的數目分 task 並 push 到 queue 裡面，以及最後 merge 各 thread 之間的 list 也會是 queue 裡的 task ，但其實這樣的分配下， queue 裡的 task 不是很多，沒有完全發揮到 task_run 機制。 ## 參考 Concurrent-ll README 中的 Reference --------- [Lock-free linkedlist implementation of Harris' algorithm](https://www.cl.cam.ac.uk/research/srg/netos/papers/2001-caslists.pdf) > "A Pragmatic Implementation of Non-Blocking Linked Lists" > T. Harris, p. 300-314, DISC 2001. --------- * 原本的數字串如下： ![](https://i.imgur.com/4CT1jNs.png) * 如果今天想要 insert Node_20 ，我們會這樣做: ![](https://i.imgur.com/USDNKRB.png) * 但如果今天 insert Node_20 和 delete Node_10 同時發生，就有可能出現以下情況： ![](https://i.imgur.com/yybhpPk.png) * 但我們不樂見，因為我們希望他會走訪 Node_20 * 文中提到的解決方法就是: ~~~ We say that a node is logically deleted after the first stage and that it is physically deleted after the second. ~~~ * logically deleted ：當我們再10的前方掛上一個即將刪除，如此一來沒有真的刪除，所以可以一直運行。 * physically deleted : 再真的刪除之後，就下指令：即將刪除的結點不再去新增Node，如此一來就可以解決以上問題了。 ![](https://i.imgur.com/l5Iukii.png) 1. mark the next field of the deleted node in some way (below, left) 2. excise the node (below, right) * low-order mark bit 即是用來判斷目標 node 是否 logically deleted ### Algorithms * 利用共享記憶體空間的多處理器系統去做讀寫和compare-and-swap ## 嘗試加入 LOCK-FREE 進 list.h * 首先因為參考了[concurrent-ll](https://github.com/tina0405/concurrent-ll) * 我想把我的 llist_t 結構改成有頭( head )有尾( tail ) ~~~ likec= typedef struct { node_t *head; /**< The head of the linked list */ uint32_t size; /**< The size of the linked list */ node_t *tail; } llist_t; ~~~ * new_list 的話就要頭尾串起來 ~~~likec= the_list->head = new_node(INT_MIN, NULL); the_list->tail = new_node(INT_MAX, NULL); the_list->head->next = the_list->tail; the_list->size = 0; ~~~ >> new_node(val_t val, node_t *next); >> NULL should be INT_MIN & INT_MAX which are built in constants. >> Otherwise, data type will be not consistent >>感謝原來如此，已改掉。 [name=Tina] * add_list : 再這邊 add 用在加 input 檔案一個個加進來。 ~~~likec= int list_add(llist_t *the_list, val_t val) { node_t *right, *left; right = left = NULL; node_t *new_elem = new_node(val, NULL); while (1) { right = list_search(the_list, val, &left); if (right != the_list->tail && right->data == val) { return 0; } new_elem->next = right; if (CAS_PTR(&(left->next), right, new_elem) == right) { FAI_U32(&(the_list->size)); return 1; } } } ~~~ * 關於 CAS_PTR 的定義在 atomic_ops_if.h 裡 ~~~ #define CAS_PTR(a, b, c) __sync_val_compare_and_swap(a, b, c) ~~~ * 如果 a 和 b 比一樣的話，就將 c 寫入 a ~~~ #define FAI_U32(a) __sync_fetch_and_add(a, 1) ~~~ * 順便看了一篇文章: [Multithreaded simple data type access and atomic variables](http://www.alexonlinux.com/multithreaded-simple-data-type-access-and-atomic-variables) * 文章裡剛好介紹了一些我不太懂的事：就是我看了一些人寫的程式，他們都說用了gcc內建的 atomic function，就可以使用出無鎖（lock-free）的程式，而 concurrent-ll 中 lock-free 也的確有下列的使用： ~~~likec= // Compare-and-swap #define CAS_PTR(a, b, c) __sync_val_compare_and_swap(a, b, c) #define CAS_U8(a, b, c) __sync_val_compare_and_swap(a, b, c) #define CAS_U16(a, b, c) __sync_val_compare_and_swap(a, b, c) #define CAS_U32(a, b, c) __sync_val_compare_and_swap(a, b, c) #define CAS_U64(a, b, c) __sync_val_compare_and_swap(a, b, c) // Fetch-and-increment #define FAI_U8(a) __sync_fetch_and_add(a, 1) #define FAI_U16(a) __sync_fetch_and_add(a, 1) #define FAI_U32(a) __sync_fetch_and_add(a, 1) #define FAI_U64(a) __sync_fetch_and_add(a, 1) // Fetch-and-decrement #define FAD_U8(a) __sync_fetch_and_sub(a, 1) #define FAD_U16(a) __sync_fetch_and_sub(a, 1) #define FAD_U32(a) __sync_fetch_and_sub(a, 1) #define FAD_U64(a) __sync_fetch_and_sub(a, 1) // Increment-and-fetch #define IAF_U8(a) __sync_add_and_fetch(a, 1) #define IAF_U16(a) __sync_add_and_fetch(a, 1) #define IAF_U32(a) __sync_add_and_fetch(a, 1) #define IAF_U64(a) __sync_add_and_fetch(a, 1) // Decrement-and-fetch #define DAF_U8(a) __sync_sub_and_fetch(a, 1) #define DAF_U16(a) __sync_sub_and_fetch(a, 1) #define DAF_U32(a) __sync_sub_and_fetch(a, 1) #define DAF_U64(a) __sync_sub_and_fetch(a, 1) ~~~ * 文中說：當這些 atomic function 要執行記憶體的拿取時，就會 LOCK 住 FSB （前端匯流排），被鎖住的 FSB 就會預防其他 processor 去執行到它，剛好解釋到這些 function 原來是去鎖住前端匯流排的。 ~~~ FSB stands for Front Serial Bus. This is the bus that processor use to communicate with RAM. I.e. locking FSB will prevent from any other processor (core), and process running on that processor, from accessing RAM. And this is exactly what we need to implement atomic v ariables. ~~~ * 我知道這些後，我想下一步就是把我 main 裏面的執行換成 atomic function，然後就可以改掉那些 mutex 的資源保護，而不是只有在 list.h 裡改那些函式，因為我們不只有加減法還有邏輯運算元和轉換可以使用。 * 附上這次實驗結果，還要繼續改 main.c 檔 * 跟測原檔的 input.txt 是同一個 ~~~likec= //lock-free tina@tina-X550VB:~/mergesort-concurrent$ ./sort 4 input.txt #Total_tasks_consumed: 12 #Elapsed_time: 0.302 ms #Throughput: 39735 (per sec) 10 12 22 35 45 55 66 78 85 96 //original tina@tina-X550VB:~/mergesort-concurrent$ ./sort 4 input.txt #Total_tasks_consumed: 12 #Elapsed_time: 0.405 ms #Throughput: 29629 (per sec) 10 22 35 45 55 66 78 85 96 ~~~ ### 接續 Lock free 與本案例的討論改進的方向有兩個： 1. 把 merge sort 的寫法從 recursion 改成 iteration ，讓工作更容易切割， iteration 的寫法除了可以允許多個 thread 對**同個 list** 做修改，也能減少記憶體需求。 2. 讓 merge 的部份可以多個 thread 一起執行，並且不會出錯。 ## 解決 make check 的 incorrent * 自從加了 lock-free 裡的函式後發現 make check 有問題，但上面跑的結果是對的，因此就想到是不是數據不夠多。 * 馬上來利用 make check 裡有附的亂數產生工具 ~~~ tina@tina-X550VB:~/mergesort-concurrent$ bash scripts/gen-random-numbers.sh 1024 in.txt tina@tina-X550VB:~/mergesort-concurrent$sort -g in.txt > out.txt tina@tina-X550VB:~/mergesort-concurrent$./sort 4 in.txt | tail -n +4 > result.txt tina@tina-X550VB:~/mergesort-concurrent$bash scripts/compare.sh out.txt result.txt ~~~ * 最後我發現錯誤訊息總是出現在重複的數字上。 ~~~ EXAMPLE： //out.txt 1 2 2 3 //result.txt 1 2 3 ~~~ * 原來是因為原本的 add_list 有這麼一個條件我卻沒拿掉： ~~~ if (right != the_list->tail && right->data == val) { return 0; } ~~~ * 如果新加進來的和前一個的數一樣的話，那我們就不做 FAI_U32(&(the_list->size) size 就不加一，相當於重複的數字，我們就只取一個當代表。 ![](https://i.imgur.com/LHMKmCX.png) ## 測試 LOCK 和 MUTEX 工具 * ### 未優化前( thread = 4 , input.txt 為上面的測試檔案) * 來測試之前給的資料 [mutrace](http://0pointer.de/blog/projects/mutrace.html) * 在終端機 git clone ==git://git.0pointer.net/mutrace.git== * README檔上說： ~~~ mutrace profiles lock contention for you. Just use it as prefix for your usual command line and it will profile mutexes used in all child processes. Example: mutrace gedit ~~~ * 所以我就將 mutrace 放置偵測檔案的前頭，不知道有沒有理解錯誤。 * 使用方法：將 mutrace 放置 command line 前方 ~~~ tina@tina-X550VB:~/mergesort-concurrent$ mutrace ./sort 4 in.txt ~~~ * 下面是跑 1024 筆亂數的結果： ~~~ mutrace: Showing statistics for process sort (pid 9173). mutrace: 3 mutexes used. Mutex #2 (0x0x1eb4670) first referenced by: /usr/lib/mutrace/libmutrace.so(pthread_mutex_init+0xf2) [0x7fbea009b4b2] ./sort(tqueue_init+0x46) [0x40150b] ./sort(tpool_init+0x6a) [0x40176f] ./sort(main+0xe5) [0x401f10] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x7fbe9fad2830] Mutex #0 (0x0x7fbe9d6a1860) first referenced by: /usr/lib/mutrace/libmutrace.so(pthread_mutex_lock+0x49) [0x7fbea009b6b9] /lib/x86_64-linux-gnu/libgcc_s.so.1(_Unwind_Find_FDE+0x2c) [0x7fbe9d49dfec] [(nil)] Mutex #1 (0x0x603140) first referenced by: /usr/lib/mutrace/libmutrace.so(pthread_mutex_init+0xf2) [0x7fbea009b4b2] ./sort(main+0xab) [0x401ed6] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x7fbe9fad2830] mutrace: Showing 3 most contended mutexes: Mutex # Locked Changed Cont. tot.Time[ms] avg.Time[ms] max.Time[ms] Flags 2 143 29 22 0.039 0.000 0.001 Mx.--. 0 20 7 2 0.007 0.000 0.001 M-.--. 1 6 4 0 0.002 0.000 0.000 M-.--. |||||| /||||| Object: M = Mutex, W = RWLock /|||| State: x = dead, ! = inconsistent /||| Use: R = used in realtime thread /|| Mutex Type: r = RECURSIVE, e = ERRRORCHECK, a = ADAPTIVE /| Mutex Protocol: i = INHERIT, p = PROTECT / RWLock Kind: r = PREFER_READER, w = PREFER_WRITER, W = PREFER_WRITER_NONREC mutrace: Note that the flags column R is only valid in --track-rt mode! mutrace: Total runtime is 5.421 ms. mutrace: Results for SMP with 4 processors. is only valid in --track-rt mode! ~~~ ## Scalability 的分析與製圖 - 參考來源為 [concurrent-ll](https://github.com/heathcliffYang/concurrent-ll) ### src/lockfree - [ ] list.c 1. 以下與 low-order mark bit (logically deleted (1) or not (0))相關的部分，包含 - 判斷是否 marked Reference 中的文件提到關於 is_marked_ref : returns true if and only if r is a marked reference. ```c= int is_marked_ref(long i) { return (int)(i & 0x1L); } ``` 前面部份必會被 mask 為 0 ，只剩下 mark bit ，值則由 i 的 LSB 決定 - mark 與 unmark - unset ```c= i &= ~0x1L; ``` AND (111...0), 代表前面的部分會維持 i 值原來的樣子， LSB 必設定為 0 - set ```c= i |= 0x1L; ``` OR (000...1), 代表前面部分會維持 i 值原來的樣子， LSB 必設定為 1 - 取得目標 marked 或 unmarked 的狀態 - 0x1L ，這個數字的意思是型態為 long integer (4 bytes) 的 hexadecimal 的數字 1 2. list_search 的部分， *left_node 和 left_node_next 只會記錄 unmarked 的 t_next，也就是說， *left_node 的 next (也就是 left_node_next) 永遠不會是指向被 mark 起來的 node 而 search 的結果只會回傳確定 right_node_next 不是 marked 的 right_node ``` CAS_PTR(&((*left_node)->next), left_node_next, right_node); ``` 3. list_add ``` FAI_U32(&(the_list->size)); ``` 這個的意思是 size 加 1 4. list_remove : logically delete ==但在 code 中看不出來哪一個部分有寫到 set_mark== 5. list_delete : empty -> not for now - [ ] include/atomic_ops_if.h __sync_*() 在 [gcc 的線上文件](https://gcc.gnu.org/onlinedocs/gcc-4.1.0/gcc/Atomic-Builtins.html) 中有提到是 atomic memory access 的內建函式。但要注意的是並不是每個處理器都支援這些 operations 。這些函式大部分都是 full barrier ，也就是說，==no memory operand will be moved across the operation, either forward or backward==，==Further, instructions will be issued as necessary to prevent the processor from speculating loads across the operation and from queuing stores after the operation.== 1. `FAI_U32(a)` 則為 `__sync_fetch_and_add(a, 1)` 文件中提到 : These builtins perform the operation suggested by the name, and returns the value that had previously been in memory. That is, ``` { tmp = *ptr; *ptr op= value; return tmp; } { tmp = *ptr; *ptr = ~tmp & value; return tmp; } // nand ``` ==不懂為什麼都要回傳在 operation 之前的值== 2. `CAS_PTR(a,b,c)` 則為 `__sync_val_compare_and_swap (type *ptr, type oldval type newval, ...)` These builtins perform an atomic compare and swap. That is, if the current value of *ptr is oldval, then write newval into *ptr. The “val” version returns the contents of *ptr before the operation. - [ ] src/lockfree/main.c 主要是 main() 裡面先初始化各個需要用到的 object 及變數，再來解讀 command-line 並設定各項參數，接下來是真正要開始執行以及計算 operations 1. thread_data 裡面比較重要的部分 ```c= // counts the number of operations each thread performs unsigned long num_operations; ``` 2. test() 用來產生各 thread 自己的隨機產生數字的 link list - [ ] include/utils.h 在[這個網頁](https://gcc.gnu.org/onlinedocs/gcc-4.0.2/gcc/Type-Attributes.html)的解釋 : This attribute specifies a minimum alignment (in bytes) for variables of the specified type. 不過需要特別注意的是，aligned attribute 的大小會被 linker 限制 (maximum) 1. ALIGNED(64) ```c= ALIGNED(N) __attribute__((aligned(N))) ``` - [ ] test-lock (src/lock/main.c) - 直接執行的輸出訊息 ``` Thread 0 #operations : 253365 #inserts : 12882 #removes : 12881 Duration : 1000 (ms) #txs : 253365 (253365.000000 / s) Expected size: 1024 Actual size: 1024 ``` - txs 是 operations ### Makefile 從 Makefile 來看，`make bench` 可以叫用 benchmarking scripts ``` bench: $(EXEC) bash scripts/run_ll.sh bash scripts/create_plots_ll.sh >/dev/null @echo Check the plots generated in directory 'out/plots'. ``` ### Tools 的 scripts 從 README.md 裡面看到 Tools 的相關描述，發現有許多可以測試實作的 script，下面是一些閱讀理解。 - [ ] run_ll.sh 1. 執行 9 輪的個別參數 ==目前還不知道參數的代表意義== ``` initials="128 1024 8192"; updates="0 10 50" ``` 2. 設定輸出的紀錄檔 : `out="$out_dir/ll.i$initial.u$update.dat"` 3. 執行另外一個 script : scalability2.sh ``` ./scripts/scalability2.sh "$cores" \ ./out/test-lock ./out/test-lockfree \ -d$duration -i$initial -r$range -u$update | tee $out; ``` - [ ] scalability2.sh 原本的範例用法為 `scalability2.sh all ./out/test-lock ./out/test-lockfree -i100` 原本 `make bench` 的輸出 ``` *** Using as a default number of cores: 4 on 1 socket *** bash scripts/run_ll.sh * -i128 ** -u0 ## Use 4 as the number of cores. ## If not correct, change it in scripts/config # ./out/test-lock ./out/test-lockfree #cores throughput %linear scalability throughput %linear scalability 1 2035637 100.00 1 6648282 100.00 1 2 1255928 30.85 0.62 13058967 98.21 1.96 3 1560506 25.55 0.77 15337146 76.90 2.31 4 1774850 21.80 0.87 16416128 61.73 2.47 ``` - throughput ``` thr1a=$($run_script ./$prog $params -n1 | grep "#txs" | cut -d'(' -f2 | cut -d. -f1); ``` 這是要取 throughput 的值，擷取 #txs 括號裡面 `operations * 1000.0 / duration` 數字，並且是只取整數部分；其中， cut 是擷取輸出的指令， -d 的意思是以 " ( " 為分隔， -f2 是取分隔後的第二個部分 - scability : 以單核時的 throughput 為基準計算當 core 數增加之後， throughput 變化的比例 `$thr/$thr1` - %linear ==這邊算法的意義還不太懂== `100*(1-(($c-$scl)/$c))` - [ ] test_correctness.sh 先執行 scripts 裡面的 lock_exec & config 再執行 ./out/test-lock & ./out/lockfree >> 看不太懂 lock_exec 裡面的 /tmp/lock_exec_$(whoami) 是指什麼? - [ ] scalability1.sh : 原本的範例用法為 `scripts/scalability1.sh all ./out/test-lock -i128` - [ ] create_plots_ll.sh ### Scalability test implementation ![](https://i.imgur.com/NX9O7Yp.png) ![](https://i.imgur.com/O5hLSAH.png) ![](https://i.imgur.com/AL4BaQF.png) ![](https://i.imgur.com/OqbaWvo.png) - 此種 lock free 演算法在本案例中實現的可能性探討此種 linked list 的 lock free 演算法，是在處理同時有 delete 與 insert 需求，確保其正確性、降低 lock 的需求。但是在本次 merge sort 的情況，我們只用到 add，並沒有 delete ，如果不加修改就直接用上這個演算法的 code ，反而會降低執行效率，因為 add 之中對 marked node 的檢查是相對複雜的。 ### 學習紀錄 - trap : 一個 shell 內建的 function，可以定義 handlers 以回應收到的硬體訊號或其他事件。 `trap [-lp] [[ARG] SIGNAL_SPEC...]` - ARG 是一個指令在收到 SIGNAL_SPEC 後執行，若沒有 ARG 或只是一個 "-" ，每次收到 SIGNAL_SPEC 之後都會 reset 到他原本的值。 - common signals : 這個是 [Sending and Trapping Signals](http://mywiki.wooledge.org/SignalTrap)的介紹 | Name | Number | Meaning| | ------------- |:-------------:|:-------| |HUP |1 |Hang Up. The controlling terminal has gone away.| |INT |2 |Interrupt. The user has pressed the interrupt key (usually Ctrl-C or DEL).| |QUIT |3 |Quit. The user has pressed the quit key (usually Ctrl-\). Exit and dump core.| |KILL |9 |Kill. This signal cannot be caught or ignored. Unconditionally fatal. No cleanup possible.| |TERM |15 |Terminate. This is the default signal sent by the kill command.| |EXIT |0 |Not really a signal. In a shell script, an EXIT trap is run on any exit, signalled or not.| - shell scripts : - $# ：參數的個數 - "$@" ： "$1" "$2" "$3" "$4" (所有參數)，每個變數是獨立的(用雙引號括起來)； - "$*" ： "$1c$2c$3c$4" 』，其中 c 為分隔字元，預設為空白鍵 - shift：造成參數變數號碼偏移 - bc : An arbitrary precision calculator language - c 裡面 constant 的解釋 It may have a **prefix that specifies its base** and **a suffix that specifies its type**. - Call graph command ``` sudo apt install graphviz gprof ./sort | gprof2dot | dot -T png -o out/plot/gprof1.png ``` ## 新的 lock free 實例 - LockLess Github: [LockLess](https://github.com/DeYangLiu/LockLess) ### Reference 閱讀這篇文章主要是在改善時代回收演算法的效能，時代演算法是垃圾回收演算法的一種，在實作 lock free #### 垃圾回收演算法 1. 時代回收演算法 (Epoch-based algorithm) ## 參考資料 1. [sort man page](http://www.gnu.org/software/coreutils/manual/html_node/sort-invocation.html#sort-invocation) 2. [The Design of LIL Tests for (Pseudo) Random Generators and Some Experimental Results](http://webpages.uncc.edu/yonwang/papers/liltest.pdf) 3. http://ws680.nist.gov/publication/get_pdf.cfm?pub_id=906762 4. http://jyhshin.pixnet.net/blog/post/26588000-argc%2c-argv-%e8%a7%a3%e6%9e%90%e7%94%a8%e6%b3%95 5. http://pubs.opengroup.org/onlinepubs/7908799/xsh/pthread_mutex_init.html 6. http://columns.chicken-house.net/2007/12/14/threadpool-%e5%af%a6%e4%bd%9c-1-%e5%9f%ba%e6%9c%ac%e6%a6%82%e5%bf%b5/ 7. http://www.computerhope.com/unix/utrap.htm 8. http://linux.vbird.org/linux_basic/0340bashshell-scripts.php#dis3 9. http://www.alexonlinux.com/multithreaded-simple-data-type-access-and-atomic-variables 10. https://darkautism.blogspot.tw/2017/02/lock-free-queue.html 11. [hugikun999 推薦的 Lock free stack](http://nullprogram.com/blog/2014/09/02/) ## 想要看的文件 1. http://hunch.net/~jl/projects/interactive/sidebandits/bandit.pdf 2. https://aturon.github.io/blog/2015/08/27/epoch/ 3. http://concurrencykit.org/presentations/lockfree_introduction/#/ 4. http://www.rdrop.com/~paulmck/RCU/hart_ipdps06.pdf 5. http://www.cs.toronto.edu/~tomhart/papers/tomhart_thesis.pdf 6. https://mechanical-sympathy.blogspot.tw/2011/07/memory-barriersfences.html