2016q3 Homework3 (mergesort-concurrent)

contributed by <LanKuDot>

閱讀筆記

計算機組織結構

Function compelete：只用自己就可以拼出所有邏輯的邏輯閘
caller：呼叫者；callee：被呼叫者
將 ALU31 的 Set bit 接回 ALU0 的 less 是為了 set on less than：撿查 less than 最快就是用減法，所以 a - b < 0 (a < b) 的話 bit 31 是 set，所以接回 ALU0 用 Multiplexer 選 less output (less 是直接接到 result multiplexer 的)，就會輸出 000…001。參考
biased notation (位移)：可表示範圍最小的值用 000..00 表示，最大值用 111..11 表示。可表示範圍為 -x ~ 2^N-1-x
2's complement (補數)：000..00 為 0，111..11 為 -1。可表示範圍為 -2^N-1 ~ 2^N-1-1
P35. Floating-point 算法詳細
Latency：做某一項動作所需的時間；Thoughput：每單位時間可以做的工作量
- Pipeline 讓很多動作可以一起作，所以是增加 thoughput，latency 是不變的
- 你可以一次烤五片披薩 (thoughput)，但是每片披薩要烤的時間 (latency) 不變

提問的回答

為什麼在步驟一要先左移一個 bit?
- 我猜是因為會除都是從除 2 開始，所以可以先將 Remainder 左移
為什麼 func3, func2, func1, func0 不是 don’t care?
- 因為還要考慮到 ALUop1 為 1 但 ALUctr0 為 0 的情況，參考真值總表。

Designing Threaded Programs

沒有絕對的 thread model

Boss/Worker Model

Boss thread
- 建立所有 worker threads
- 接收任務，並派送給 worker thread
Worker thread
- 由 worker thread 判斷要怎麼處理任務及資料
- 必要的話同步共享資源的資料
Boss thread 接收到新工作時才 create worker thread，或是先建立好一群 worker threads (thread pool)
合適 server 的運作
注意要
- 減少 worker thread 與 boss thread 的溝通頻率
- 避免所有 worker thread 存取共享資源

Peer Model (Workcrew model)

與 Boss/Worker 不一樣的是 main thread 在建立完 peer thread 之後，也會開始執行任務。
每個 thread 都提前(?)知道有 input，可以自己取得任務，或是與其他 peer 共用同一個 input 來源
合適固定或是設計良好的 input，如：矩陣乘法、平行搜索資料庫
- 經典問題：可以等分切割的 input，每個 thread 為了修改下一個 thread 處理完的邊界 (bound) 資料，就必須要同步 (確認取得的資料是處理完的)

Pipeline Model

工廠的生產線
Instruction pipeline
Piepeline model 假設
- 輸入是一長串 (stream) 的
- 每個輸入都有一連串的 stages/filters 要處理
- 每個處理的單位在不同時間點可以處理不同的輸入
適合需要一連串 stages 處理的資料，如：Image processing
可以自由增減 stages，也可以開多個 thread 處理同一個 stage
缺點：如果前一個 stage 處理太慢，後面的 stage 就會沒事做

作業

UNIX command

sort：將輸入資料排序
- -R flag：隨機排序
uniq：可以用來將重複的行刪除而只顯示一個
- -c flag：顯示重複字詞的出現次數
wc：列出檔案有多少行，多少單字，多少字元
- -l/w/m：僅列出行、單字、字元

程式碼

Makefile

:=：Simply expanded variables，當 reference 這類變數時，其值等於當下被定義的值。
- Recursively expaned variables：當 reference 變數類變數時，Makefile 會去尋找最終可以替換的字串，允許最終字串宣告在 reference 之後。
-MMD：-M 的縮減板，列出指定檔案所需要的其他 header file，但只列 user 定義的 header files
-MF <file>：配合 -M 等 flag 使用，將所需的 header file 輸出到 file。參考
-include <file>：告訴 Makefile 先忽略找不到 <file> 的問題，除非無法完成最後要完成的 target
-rdynamic：指定 linker 將所有 label 寫入到 dynamic symbol table。參考
- 也就是說平常編譯有可能不會將所有 label 都放入到 dynamic symbol table (.dynsym)

`list.c`

intptr_t：幫助轉換來自 void pointer 所指向的值。

// stdint.h
#if __WORDSIZE == 64
# ifndef __intptr_t_defined
typedef long int        intptr_t;
#  define __intptr_t_defined
# endif
typedef unsigned long int   uintptr_t;
#else
# ifndef __intptr_t_defined
typedef int         intptr_t;
#  define __intptr_t_defined
# endif
typedef unsigned int        uintptr_t;
#endif

問題 & Refactoring

list_add 的註解提到如果要新增的 value 已經存在的話就不會新增這個 node，但是實做沒有對應的檢查。
- 因為輸入的資料確認是不重複的資料，所以先選擇將這段註解除掉
list_add 的回傳值永遠是 0
- 改為回傳新增 node 之後的 list 長度
list_nth 實際上是回傳第 n+1 個元素，因此當 idx == list->size 時回傳是 NULL
- 為了符合程式人的習慣，會回傳 index 為 idx 的元素 (非第 idx 個)，並改名為 list_get
沒有提供 free list 的 function，雖然在程式結束時仍然有有效的 pointer 指向這些區塊，但為求保險還是 free 掉

threadpool

void (*func)(void *)：pointer to function void foo(void *arg)
- 要將 (*func) 括號起來，否則是宣告一個會回傳 pointer to void 的函式 void *func(void *)

在 C99/C11 中，有個重要的概念叫做 function designator，詳見 Value categories jserv
我一開始設計時是希望能做到像pipe那樣，可以把前一個task的回傳值傳給下一個task，所以在設計的時候的確是要宣告一個回傳pointer to void的函式，只是後來忘記實作就是了…LuisHsu
func 不是要指定工作開始的函式嗎，然後上個工作回傳的內容透過 arg 傳遞。_task->func(_task->arg); 像這樣的用法LanKuDot

每一個 thread 都是 joinable 的 [pthread_create(3)]
- Joinable thread：在該 thread 結束之後，其他 thread 可以透過 pthread_join 來取得回傳值
- Detached threas：在該 thread 結束之後，其資源自動被系統收走，也無法取得其回傳值。

問題 & Refactoring

task_free() 及 tqueue_init() 等函式都只會回傳 0，除非有特別意義，否則可以不用回傳值
- tqueue_init() 增加檢查傳入 pointer 是否有效，如果是 NULL 則回傳 -1
task_free() 短短的但問題很多
- 程式裡面沒有被用到
- 就算 task_t 是 NULL 不代表 task_t->arg 也是 NULL → 第一個 free() 會有問題
- 不確定 task_t->arg 是不是在 task 結束後還會用到
  - 會用到：例如被分配到的 list 區段，task 結束後要合併
  - 不會用到：例如僅限於該 task 使用的資訊
  - 也有可能不是動態配置的值
- 太危險了，請程式設計者自己決定要在哪裡 free 掉吧
queue 操作不對，pop 從後面，push 從前面
- Enqueue means to insert an item into the back of the queue, dequeue means removing the front item. source
- Cplusplus 對 dequeue 及 enqueue 操作的說明
tqueue_init() 不會配置記憶體給 task queue，只會初始化其 member，應該要檢查傳入指標是否有效

用 ~~single linked-llist~~ singly-linked list 也可以實現 queue

利用 grep 來檢查 double linked list 的 last pointer 使用，發現都只有 assign。
-inr：ignore case; show line number; recursive

$ grep -inr --exclude=*.txt 'last' .
./threadpool.h:21:    struct _task *last;     ///< Pointer to the previous task
./threadpool.h:39:    task_t *tail;           ///< Pointer to the last _task\_t_ in queue
二進位格式檔案 ./.threadpool.h.swp 符合
./threadpool.c:47:            the_queue->head->last = NULL;
./threadpool.c:79:    task->last = the_queue->tail;
二進位格式檔案 ./.threadpool.c.swp 符合

"singly-linked list" 才是正確寫法，請注意 jserv
了解LanKuDot

`main.c`

讀入資料，存到 linked list the_list
建立 thread pool
- 每個 thread 都會執行 task_run()
- task_run() 是一個無限迴圈會一直從 task queue 請求工作，中止條件為收到對應函式為 NULL 的工作。執行完成後會將 task_t free 掉。
- (void) data; 是為了避免編譯器發出未使用參數的警告，表明在函式中不會使用這個參數。參考

可透過 gcc 的 __attribute__ ((unused)) 來抑制警告，請見 unused parameter warnings in C code jserv

建立第一個工作 cut_func() 切割 the_list
- 切割次數為 MIN(thread_count, data_count) - 1，當 data 比 thread 多時，可以將資料平均分配給 thread。(4 個 thread 分 3 次)
- 將切割完的 list push 到 task queue
- 雖然還是 push cut_func()，但是當切割次數滿了之後就會直接執行 merge_sort 和 merge
主要運作函式有三
1. merge_sort()：純粹將取得的 list 一直對半分，直到長度小於 2。結束時呼叫 merge_list() 組合左右兩邊並回傳組合後的 list。
2. merge_list()：merge local list，將左右兩個 list 做排序合成一個 list，會回傳新的 llist_t，傳入的兩個會被 free 掉。
3. merge()：merge lists between threads。地一個進來的 thread 會檢查 tmp_list 是否為空，有則暫存自己的 local list，如果沒有則取得暫存的 list 和自己的 local list，配到 merge_list() 作為下一個工作。

問題 & Refactoring

在計算 max_cut 算式過於冗長，可以用 MIN macro 取代
cut_func 在分左右時，註解寫左邊但傳入的是右半的 list，右邊則傳入左半
main.c 過長，可以將 merge sort 的程式碼移至獨立檔案
- 將 merge_list() 與 merge_n_sort() 移到 merge_sort.c/h
- 並重新命名 merge_list() 為 merge_n_sort()，merge_sort() 為 split_n_merge()
在 merge_list() 中被 free 掉的 llist_t，其指標應設為 NULL，避免被使用
merge 兩個 thread 的 list 有沒有比較好的方法?

Critical section

task queue
tmp_list (for merging lists between threads)
- 被 access 太多次，會拖慢平行化的效能
cut_count

Scalability

Horizontal scalar (scale out/in)：新增/減少節點到系統中。例如：分散式運算，由多台電腦共同執行任務
Vertical scalar (scale up/down)：增強/減弱單台電腦的性能。例如：使用多核心

研究 concurrent-II 如何計算 scalabilty

`scalability.sh`

執行 1~N 個核心的運行測試，並與 1 核心的執行成果作比較。
評估 vertical scalability

scalability1.sh 與 scalability2.sh 差別只有一次執行一個或兩程式

運作

執行時須帶有參數，第一個參數指定 core，shift 會讓所有參數左移 (原本第二個參數會變第一個)
執行 config 及 lock_exec script
讀取指定執行程式名到 prog，剩下後接的參數到 params
運行 1~core 核心的測試，並輸出與 core 1 的執行結果比較
一共有四個資訊會輸出 (下標 N 為執行的 core 數)
a. #cores：執行的核心數
b. throughput：每單位時間的執行工作數量
c. %linear：
$(1 - \frac{N - s c a l a i l i t y}{N}) \times 100 % = (\frac{s c a l a b i l i t y}{N}) \times 100 %$ ，每個 core 的平均 scalability in percent
d. scalability：
$\frac{t h r o u g h p u t_{N}}{t h r o u g h p u t_{1}}$ ，throughput 增加比
執行 unlock_exec script

`lock_exec`

新增一個 /tmp/lock_exec_$(whoami) 的空檔案，並設定如果接收到 SIGINT 則把這個檔案刪除。但如果在新增檔案的時候該檔案存在的話，就不允許新開的測試進行 (代表前一個測試程式還在運作)

trap command signal：如果接收到 signal 就執行 command
2> /dev/null：將 stderr(fd=2) 的訊息導入 /dev/null，也就是不輸出

`unlock_exec`

刪除 tmp/lock_exec_$(whoami)

不知道這兩個 script 的功用，主程式也沒有使用到，純粹創檔刪檔!?LanKuDot

我懂了，目的在於防止使用者同時進行同一個測試。LanKuDot

`config`

將從 run_II.sh 傳入的 cores 指定的項目轉為對應的數列

uname

實作

參考吳彥寬的共筆實作 GENERIC_PRINTF 的 macro，以針對不同的資料型態作對應的輸出

Scalability

測時：用 gettimeofday(2) 來測量 elapsed time，從第一個 task 被 push 到 queue 中開始。
- 查到可以用 getrusage(2) 來得到 kernel time 和 user time，用 USAGE_SELF 會得到所有 thread 的使用時間加總。用 USAGE_THREAD 則會得到 calling thread 的使用時間，但 main thread 在等待 join 的這段時間都會一直使用 CPU 嗎?
在 task_pop() 中來計算被消耗的 task 數，新增 member num_of_consumed 到 tqueue_t 中，在呼叫 tqueue_free() 時會回傳這個值。
輸出

#Total_tasks_consumed: 14
#Elapsed_time: 0.648 ms
#Throughput: 21604 (per sec)

嘗試之後發現，thoughput 的計算出來會有問題，見提問的探討

提問

在 concurrent-mergesort 中，就算固定輸入的資料量，但是會因為 thread 數的增減，讓一個 task 工作量有變化 (分到的 local list 的長度會不同)，這樣 thoughput 的量測公平嗎?
- 不公平。因為 thread 的增加讓 list 切的更瑣碎，讓 task 數量增加，而且每個 task 的工作量變小，造成雖然執行時間變高，thoughput 也變高的情況 (但其實不同 thread 數量下的 task 工作量不同)

$ ./sort 4 /tmp/run_input.txt 
#Total_tasks_consumed: 14
#Elapsed_time: 43.012 ms
#Throughput: 325 (per sec)
$ ./sort 64 /tmp/run_input.txt 
#Total_tasks_consumed: 254
#Elapsed_time: 447.635 ms <--- 時間大大增加
#Throughput: 567 (per sec)

不同的 task 能否相加當作總 thoughtput? 像 local list 的 merge sort 與 thread 之間的 list 的merge sort 是不同的工作

用專門切割 list 的 task 取代交給每個 task 切

為了減少 critical section 的使用，所以作一點變化，不讓個別 task 去切 list，而是建立一個專門切割 list 的 task。
只要切好 local list 就會 push 到 task queue。

建立自動化結果測試

我在 makefile 中新增一個 check target 用以測試排序完的結果是否正確，流程如下：

透過 script gen-random-numebrs.sh 產生隨機變數並輸出到測試檔案中
使用 sort -g 將測試檔案作排序以作為 ground truth
執行主程式，讀取測試檔案，並輸出結果到另一個檔案
compare.sh 比較輸出檔與 ground truth，並提示 verify 的結果

讓程式可以接受字串輸入

為了不改變 node->data 的結構，所以只能是指標或是 long int，想要用 hw2 的方法，將輸入資料 map 到 process 的虛擬記憶體，再讓 thread 使用

測試

讓程式去排序 10000 筆數字，4 thread

Baseline：3.233 msec
將切割交給專用的 task：3.216 msec

2016q3 Homework3 (mergesort-concurrent)

閱讀筆記

計算機組織結構

提問的回答

Designing Threaded Programs

Boss/Worker Model

Peer Model (Workcrew model)

Pipeline Model

作業

UNIX command

程式碼

Makefile

`list.c`

問題 & Refactoring

threadpool

問題 & Refactoring

`main.c`

問題 & Refactoring

Critical section

Scalability

`scalability.sh`

運作

`lock_exec`

`unlock_exec`

`config`

實作

Scalability

提問

用專門切割 list 的 task 取代交給每個 task 切

建立自動化結果測試

讓程式可以接受字串輸入

測試

參考資料

tags: `sysprog2016`

2016q3 Homework3 (mergesort-concurrent)

閱讀筆記

計算機組織結構

提問的回答

Designing Threaded Programs

Boss/Worker Model

Peer Model (Workcrew model)

Pipeline Model

作業

UNIX command

程式碼

Makefile

list.c

問題 & Refactoring

threadpool

問題 & Refactoring

main.c

問題 & Refactoring

Critical section

Scalability

scalability.sh

運作

lock_exec

unlock_exec

config

實作

Scalability

提問

用專門切割 list 的 task 取代交給每個 task 切

建立自動化結果測試

讓程式可以接受字串輸入

測試

參考資料

tags: sysprog2016

`list.c`

`main.c`

`scalability.sh`

`lock_exec`

`unlock_exec`

`config`

tags: `sysprog2016`