2016q3 Homework3 (mergesort-concurrent)

# 2016q3 Homework3 (mergesort-concurrent) contributed by <`LanKuDot`> # 閱讀筆記 ## [計算機組織結構](https://hackmd.io/s/H1sZHv4R) * [Function compelete](https://en.wikipedia.org/wiki/NAND_logic)：只用自己就可以拼出所有邏輯的邏輯閘 * caller：呼叫者；callee：被呼叫者 * 將 ALU31 的 Set bit 接回 ALU0 的 less 是為了 set on less than：撿查 less than 最快就是用減法，所以 a - b < 0 (a < b) 的話 bit 31 是 set，所以接回 ALU0 用 Multiplexer 選 less output (less 是直接接到 result multiplexer 的)，就會輸出 000...001。[參考](http://www-classes.usc.edu/engr/ee-s/457/EE457_Classnotes/EE457_Chapter4/HP_AppB.pdf) * biased notation (位移)：可表示範圍最小的值用 000..00 表示，最大值用 111..11 表示。可表示範圍為 -x ~ 2^N^-1-x 2's complement (補數)：000..00 為 0，111..11 為 -1。可表示範圍為 -2^N-1^ ~ 2^N-1^-1 * [P35. Floating-point 算法詳細](http://sc.tamu.edu/systems/hydra/SC-FP.pdf) * Latency：做某一項動作所需的時間；Thoughput：每單位時間可以做的工作量 * Pipeline 讓很多動作可以一起作，所以是增加 thoughput，latency 是不變的 * 你可以一次烤五片披薩 (thoughput)，但是每片披薩要烤的時間 (latency) 不變 ### 提問的回答 - [x] 為什麼在步驟一要先左移一個 bit? * 我猜是因為會除都是從除 2 開始，所以可以先將 Remainder 左移 - [x] 為什麼 func3, func2, func1, func0 不是 don’t care? * 因為還要考慮到 ALUop1 為 1 但 ALUctr0 為 0 的情況，參考真值總表。 ## [Designing Threaded Programs](http://maxim.int.ru/bookshelf/PthreadsProgram/htm/r_19.html) 沒有絕對的 thread model ### Boss/Worker Model * Boss thread * 建立所有 worker threads * 接收任務，並派送給 worker thread * Worker thread * 由 worker thread 判斷要怎麼處理任務及資料 * 必要的話同步共享資源的資料 * Boss thread 接收到新工作時才 create worker thread，或是先建立好一群 worker threads (thread pool) * 合適 server 的運作 * 注意要 * 減少 worker thread 與 boss thread 的溝通頻率 * 避免所有 worker thread 存取共享資源 ### Peer Model (Workcrew model) * 與 Boss/Worker 不一樣的是 main thread 在建立完 peer thread 之後，也會開始執行任務。 * 每個 thread 都提前(?)知道有 input，可以自己取得任務，或是與其他 peer 共用同一個 input 來源 * 合適固定或是設計良好的 input，如：矩陣乘法、平行搜索資料庫 * 經典問題：可以等分切割的 input，每個 thread 為了修改下一個 thread 處理完的邊界 (bound) 資料，就必須要同步 (確認取得的資料是處理完的) ### Pipeline Model * 工廠的生產線 * Instruction pipeline * Piepeline model 假設 * 輸入是一長串 (stream) 的 * 每個輸入都有一連串的 stages/filters 要處理 * 每個處理的單位在不同時間點可以處理不同的輸入 * 適合需要一連串 stages 處理的資料，如：Image processing * 可以自由增減 stages，也可以開多個 thread 處理同一個 stage * 缺點：如果前一個 stage 處理太慢，後面的 stage 就會沒事做 # 作業 ## UNIX command * `sort`：將輸入資料排序 * `-R` flag：隨機排序 * `uniq`：可以用來將重複的行刪除而只顯示一個 * `-c` flag：顯示重複字詞的出現次數 * `wc`：列出檔案有多少行，多少單字，多少字元 * `-l/w/m`：僅列出行、單字、字元 ## 程式碼 ### Makefile * `:=`：[Simply expanded variables](https://www.gnu.org/software/make/manual/html_node/Flavors.html)，當 reference 這類變數時，其值等於當下被定義的值。 * Recursively expaned variables：當 reference 變數類變數時，Makefile 會去尋找最終可以替換的字串，允許最終字串宣告在 reference 之後。 * `-MMD`：`-M` 的縮減板，列出指定檔案所需要的其他 header file，但只列 user 定義的 header files * `-MF <file>`：配合 `-M` 等 flag 使用，將所需的 header file 輸出到 `file`。[參考](https://gcc.gnu.org/onlinedocs/gcc/Preprocessor-Options.html) * `-include <file>`：告訴 Makefile 先忽略找不到 `<file>` 的問題，除非無法完成最後要完成的 target * `-rdynamic`：指定 linker 將所有 label 寫入到 dynamic symbol table。[參考](https://gcc.gnu.org/onlinedocs/gcc/Link-Options.html) * 也就是說平常編譯有可能不會將所有 label 都放入到 dynamic symbol table (.dynsym) ### `list.c` * `intptr_t`：幫助轉換來自 void pointer 所指向的值。 ```c // stdint.h #if __WORDSIZE == 64 # ifndef __intptr_t_defined typedef long int intptr_t; # define __intptr_t_defined # endif typedef unsigned long int uintptr_t; #else # ifndef __intptr_t_defined typedef int intptr_t; # define __intptr_t_defined # endif typedef unsigned int uintptr_t; #endif ``` #### 問題 & Refactoring - [x] `list_add` 的註解提到如果要新增的 value 已經存在的話就不會新增這個 node，但是實做沒有對應的檢查。 * 因為輸入的資料確認是不重複的資料，所以先選擇將這段註解除掉 - [x] `list_add` 的回傳值永遠是 0 * 改為回傳新增 node 之後的 list 長度 - [x] `list_nth` 實際上是回傳第 n+1 個元素，因此當 `idx == list->size` 時回傳是 `NULL` * 為了符合程式人的習慣，會回傳 index 為 `idx` 的元素 (非第 idx 個)，並改名為 `list_get` - [x] 沒有提供 free list 的 function，雖然在程式結束時仍然有有效的 pointer 指向這些區塊，但為求保險還是 free 掉 ### threadpool * `void (*func)(void *)`：pointer to function `void foo(void *arg)` * 要將 `(*func)` 括號起來，否則是宣告一個會回傳 pointer to void 的函式 `void *func(void *)` >> 在 C99/C11 中，有個重要的概念叫做 function designator，詳見 [Value categories](http://en.cppreference.com/w/c/language/value_category) [name=jserv] >> 我一開始設計時是希望能做到像pipe那樣，可以把前一個task的回傳值傳給下一個task，所以在設計的時候的確是要宣告一個回傳pointer to void的函式，只是後來忘記實作就是了...[name=LuisHsu] >> `func` 不是要指定工作開始的函式嗎，然後上個工作回傳的內容透過 `arg` 傳遞。`_task->func(_task->arg);` 像這樣的用法[name=LanKuDot] * 每一個 thread 都是 joinable 的 [\[pthread_create(3)\]](http://man7.org/linux/man-pages/man3/pthread_create.3.html) * Joinable thread：在該 thread 結束之後，其他 thread 可以透過 `pthread_join` 來取得回傳值 * Detached threas：在該 thread 結束之後，其資源自動被系統收走，也無法取得其回傳值。 #### 問題 & Refactoring - [ ] `task_free()` 及 `tqueue_init()` 等函式都只會回傳 0，除非有特別意義，否則可以不用回傳值 * `tqueue_init()` 增加檢查傳入 pointer 是否有效，如果是 `NULL` 則回傳 -1 - [x] `task_free()` 短短的但問題很多 * 程式裡面沒有被用到 * 就算 `task_t` 是 `NULL` 不代表 `task_t->arg` 也是 `NULL` → 第一個 `free()` 會有問題 * 不確定 `task_t->arg` 是不是在 task 結束後還會用到 * 會用到：例如被分配到的 list 區段，task 結束後要合併 * 不會用到：例如僅限於該 task 使用的資訊 * 也有可能不是動態配置的值 * 太危險了，請程式設計者自己決定要在哪裡 free 掉吧 - [x] queue 操作不對，pop 從後面，push 從前面 * Enqueue means to insert an item into the back of the queue, dequeue means removing the front item. [source](https://www.cs.cmu.edu/~adamchik/15-121/lectures/Stacks%20and%20Queues/Stacks%20and%20Queues.html) * Cplusplus 對 [dequeue](http://www.cplusplus.com/reference/queue/queue/pop/) 及 [enqueue](http://www.cplusplus.com/reference/queue/queue/push/) 操作的說明 - [x] `tqueue_init()` 不會配置記憶體給 task queue，只會初始化其 member，應該要檢查傳入指標是否有效 - [x] 用 <s>single linked-llist</s> singly-linked list 也可以實現 queue * 利用 `grep` 來檢查 double linked list 的 `last` pointer 使用，發現都只有 assign。 * `-inr`：ignore case; show line number; recursive ``` $ grep -inr --exclude=*.txt 'last' . ./threadpool.h:21: struct _task *last; ///< Pointer to the previous task ./threadpool.h:39: task_t *tail; ///< Pointer to the last _task\_t_ in queue 二進位格式檔案 ./.threadpool.h.swp 符合 ./threadpool.c:47: the_queue->head->last = NULL; ./threadpool.c:79: task->last = the_queue->tail; 二進位格式檔案 ./.threadpool.c.swp 符合 ``` >> "singly-linked list" 才是正確寫法，請注意 [name=jserv] >> 了解[name=LanKuDot] ### `main.c` 1. 讀入資料，存到 linked list `the_list` 2. 建立 thread pool * 每個 thread 都會執行 `task_run()` * `task_run()` 是一個無限迴圈會一直從 task queue 請求工作，中止條件為收到對應函式為 `NULL` 的工作。執行完成後會將 `task_t` free 掉。 * `(void) data;` 是為了避免編譯器發出未使用參數的警告，表明在函式中不會使用這個參數。[參考](http://stackoverflow.com/questions/4647665/why-cast-an-unused-function-parameter-value-to-void) >> 可透過 gcc 的 `__attribute__ ((unused))` 來抑制警告，請見 [unused parameter warnings in C code](http://stackoverflow.com/questions/3599160/unused-parameter-warnings-in-c-code) [name=jserv] 3. 建立第一個工作 `cut_func()` 切割 `the_list` * 切割次數為 `MIN(thread_count, data_count) - 1`，當 data 比 thread 多時，可以將資料平均分配給 thread。(4 個 thread 分 3 次) * 將切割完的 list push 到 task queue * 雖然還是 push `cut_func()`，但是當切割次數滿了之後就會直接執行 `merge_sort` 和 `merge` 4. 主要運作函式有三 1. `merge_sort()`：純粹將取得的 list 一直對半分，直到長度小於 2。結束時呼叫 `merge_list()` 組合左右兩邊並回傳組合後的 list。 2. `merge_list()`：merge local list，將左右兩個 list 做排序合成一個 list，會回傳新的 `llist_t`，傳入的兩個會被 free 掉。 3. `merge()`：merge lists between threads。地一個進來的 thread 會檢查 `tmp_list` 是否為空，有則暫存自己的 local list，如果沒有則取得暫存的 list 和自己的 local list，配到 `merge_list()` 作為下一個工作。 #### 問題 & Refactoring - [x] 在計算 `max_cut` 算式過於冗長，可以用 `MIN` macro 取代 - [x] `cut_func` 在分左右時，註解寫左邊但傳入的是右半的 list，右邊則傳入左半 - [x] main.c 過長，可以將 merge sort 的程式碼移至獨立檔案 * 將 `merge_list()` 與 `merge_n_sort()` 移到 `merge_sort.c/h` * 並重新命名 `merge_list()` 為 `merge_n_sort()`，`merge_sort()` 為 `split_n_merge()` - [ ] 在 `merge_list()` 中被 free 掉的 `llist_t`，其指標應設為 `NULL`，避免被使用 - [ ] merge 兩個 thread 的 list 有沒有比較好的方法? ### Critical section * task queue * `tmp_list` (for merging lists between threads) * 被 access 太多次，會拖慢平行化的效能 * `cut_count` ## Scalability * Horizontal scalar (scale out/in)：新增/減少節點到系統中。例如：分散式運算，由多台電腦共同執行任務 * Vertical scalar (scale up/down)：增強/減弱單台電腦的性能。例如：使用多核心研究 [concurrent-II](https://github.com/jserv/concurrent-ll) 如何計算 scalabilty ### `scalability.sh` 執行 1~N 個核心的運行測試，並與 1 核心的執行成果作比較。評估 vertical scalability > `scalability1.sh` 與 `scalability2.sh` 差別只有一次執行一個或兩程式 #### 運作 1. 執行時須帶有參數，第一個參數指定 `core`，`shift` 會讓所有參數左移 (原本第二個參數會變第一個) 2. 執行 `config` 及 `lock_exec` script 3. 讀取指定執行程式名到 `prog`，剩下後接的參數到 `params` 4. 運行 1~`core` 核心的測試，並輸出與 core 1 的執行結果比較 5. 一共有四個資訊會輸出 (下標 N 為執行的 core 數) a. #cores：執行的核心數 b. throughput：每單位時間的執行工作數量 c. %linear：$(1-\frac{N-scalaility}{N})\times100\%=(\frac{scalability}{N})\times100\%$，每個 core 的平均 scalability in percent d. scalability：$\frac{throughput_N}{ throughput_1}$，throughput 增加比 6. 執行 `unlock_exec` script ### `lock_exec` 新增一個 `/tmp/lock_exec_$(whoami)` 的空檔案，並設定如果接收到 `SIGINT` 則把這個檔案刪除。但如果在新增檔案的時候該檔案存在的話，就不允許新開的測試進行 (代表前一個測試程式還在運作) * `trap command signal`：如果接收到 `signal` 就執行 `command` * `2> /dev/null`：將 `stderr(fd=2)` 的訊息導入 `/dev/null`，也就是不輸出 ### `unlock_exec` 刪除 `tmp/lock_exec_$(whoami)` > 不知道這兩個 script 的功用，主程式也沒有使用到，純粹創檔刪檔!?[name=LanKuDot] >> 我懂了，目的在於防止使用者同時進行同一個測試。[name=LanKuDot] ### `config` 將從 `run_II.sh` 傳入的 `cores` 指定的項目轉為對應的數列 * [uname](https://linux.die.net/man/1/uname) ## 實作 * 參考[吳彥寬的共筆](https://hackmd.io/s/BykQDVTp)實作 `GENERIC_PRINTF` 的 macro，以針對不同的資料型態作對應的輸出 ### Scalability * 測時：用 [`gettimeofday(2)`](http://man7.org/linux/man-pages/man2/gettimeofday.2.html) 來測量 elapsed time，從第一個 task 被 push 到 queue 中開始。 * 查到可以用 [`getrusage(2)`](https://linux.die.net/man/2/getrusage) 來得到 kernel time 和 user time，用 `USAGE_SELF` 會得到所有 thread 的使用時間**加總**。用 `USAGE_THREAD` 則會得到 calling thread 的使用時間，但 main thread 在等待 join 的這段時間都會一直使用 CPU 嗎? * 在 `task_pop()` 中來計算被消耗的 task 數，新增 member `num_of_consumed` 到 `tqueue_t` 中，在呼叫 `tqueue_free()` 時會回傳這個值。 * 輸出 ``` #Total_tasks_consumed: 14 #Elapsed_time: 0.648 ms #Throughput: 21604 (per sec) ``` * 嘗試之後發現，thoughput 的計算出來會有問題，見提問的探討 #### 提問 - [x] 在 concurrent-mergesort 中，就算固定輸入的資料量，但是會因為 thread 數的增減，讓一個 task 工作量有變化 (分到的 local list 的長度會不同)，這樣 thoughput 的量測公平嗎? * 不公平。因為 thread 的增加讓 list 切的更瑣碎，讓 task 數量增加，而且每個 task 的工作量變小，造成雖然執行時間變高，thoughput 也變高的情況 (但其實不同 thread 數量下的 task 工作量不同) ```bash $ ./sort 4 /tmp/run_input.txt #Total_tasks_consumed: 14 #Elapsed_time: 43.012 ms #Throughput: 325 (per sec) $ ./sort 64 /tmp/run_input.txt #Total_tasks_consumed: 254 #Elapsed_time: 447.635 ms <--- 時間大大增加 #Throughput: 567 (per sec) ``` - [ ] 不同的 task 能否相加當作總 thoughtput? 像 local list 的 merge sort 與 thread 之間的 list 的merge sort 是不同的工作 ### 用專門切割 list 的 task 取代交給每個 task 切為了減少 critical section 的使用，所以作一點變化，不讓個別 task 去切 list，而是建立一個專門切割 list 的 task。只要切好 local list 就會 push 到 task queue。 ### 建立自動化結果測試我在 makefile 中新增一個 `check` target 用以測試排序完的結果是否正確，流程如下： 1. 透過 script `gen-random-numebrs.sh` 產生隨機變數並輸出到測試檔案中 2. 使用 `sort -g` 將測試檔案作排序以作為 ground truth 3. 執行主程式，讀取測試檔案，並輸出結果到另一個檔案 4. `compare.sh` 比較輸出檔與 ground truth，並提示 verify 的結果 ### 讓程式可以接受字串輸入 * 為了不改變 `node->data` 的結構，所以只能是指標或是 `long int`，想要用 hw2 的方法，將輸入資料 map 到 process 的虛擬記憶體，再讓 thread 使用 ## 測試讓程式去排序 10000 筆數字，4 thread * Baseline：3.233 msec * 將切割交給專用的 task：3.216 msec ## 參考資料 * [Scalability, Wikipedia](https://en.wikipedia.org/wiki/Scalability) * [What does it mean scalability?, StackOverflow](http://stackoverflow.com/questions/9420014/what-does-it-mean-scalability) * [Amdahl's Law](https://en.wikipedia.org/wiki/Amdahl%27s_law) ###### tags: `sysprog2016`