Multi-thread Programming

--- tags: para., pthread, thread, pi, SIMD --- https://nctu-sslab.github.io/PP-f20/HW2/ # Multi-thread Programming ## Q1: ### **In your write-up, produce a graph of speedup compared to the reference sequential implementation as a function of the number of threads used FOR VIEW.** ![](https://i.imgur.com/mFayLAr.png =350x)![](https://i.imgur.com/siNX3OO.png =350x) **1. Is speedup linear in the number of threads used?** * 加速是非線性的，除了程式會包含不可平行的部分外，Create Thread 也需要時間成本，因此無法靠thread 無限線性加速。 **2. In your writeup hypothesize why this is (or is not) the case? (You may also wish to produce a graph for VIEW 2 to help you come up with a good answer. Hint: take a careful look at the three-thread data-point.)** * 從上圖可以看出，Thread 數目增加到3或4左右，速度就沒有顯著提升，甚至可能會有些微比更少的thread還慢，我推測也許程式能平行的部分所花的時間，已經因為平行加速到某一程度**相較於不可平行的部分相當微小**，因此再如何增加thread 都沒有顯著差異。 ## Q2: ### **How do your measurements explain the speedup graph you previously created?** * 要達到thread 的最佳平行，就要讓各個thread 工作量均衡，這樣較不會產生較快的thread 需要等待較慢的thread 產生的bottle neck。為了克服這點我發現，Index Ouput Array 的中段需要花費較高的成本，因為搜尋是連續的，從陣列的始端Index 或從末端Index 都會較快。 * 因此thread 交叉執行鄰近的列會是較平均的做法，Ex. thread 1 做0 2 4 6 8 ....列，thread 2做 1 3 5 7 9 .... 列。 ![](https://i.imgur.com/W6YcIaY.png) 從結果來看效果是不錯的。接著我們觀察各個thread 在 `workerThreadStart()`運行所花的時間 * thread number 1 => total: 461.212 ms ![](https://i.imgur.com/Ly3kCsy.png) * thread number 2 => total: 236.325 ms ![](https://i.imgur.com/yZFCiLf.png) * thread number 3 => total: 158.896 ms ![](https://i.imgur.com/o9s3zux.png) * thread number 4 => total: 121.580 ms ![](https://i.imgur.com/ojlfOxW.png) * thread number 5 => total: 145.258 ms ![](https://i.imgur.com/PZ9A7Nu.png) * thread number 6 => total: 124.721 ms ![](https://i.imgur.com/K2Oz809.png) 從結果發現，在thread 較小的情況下，`workerThreadStart()`所運行的時間都是相差不多的，但是當thread 一變大，也代表`thread 0` 與`thread N` 所Index 的空間差距越大，所以`thread N`花的時間也會比`thread 0`還更久，所以產生bottle neck，所以程式執行時間主要受Index 中段array的影響，繼續平行的效果便不顯著。 ## Q3: ### Q3: In your write-up, describe your approach to parallelization and report the final 4-thread speedup obtained. 由於程式中段的處理較花時間所以我程式執行方式`thread 0`執行`0, 0+thread_num*1, 0+thread_num*2, 0+thread_num*3......`以此類推。 ``` // 此處 step 設置為1 for(int start=0; start < height;) { if(start+(threadId*step) < height) { mandelbrotSerial(x0, y0, x1, y1, width, height, start+(threadId*step), step, maxIterations, args->output); } start += numThreads; } ``` ``` [mandelbrot serial]: [460.861] ms Wrote image file mandelbrot-serial.ppm [mandelbrot thread]: [121.495] ms Wrote image file mandelbrot-thread.ppm (3.79x speedup from 4 threads) ``` 經過實驗`step`參數設置越大速度會越慢，因為會造成`thread0`和`threadN`距離越遠，等待情形越嚴重。 ## Q4: ### Now run your improved code with eight threads. Is performance noticeably greater than when running with four threads? Why or why not? (Notice that the workstation server provides 4 cores 4 threads.) | | View 1 | View 2 | | -------- |:-------:|:------:| | Thread 4 | 121.529 | 71.529 | | Thread 8 | 122.546 | 74.458 | 執行表現沒有變得更好，4 cores 4 thread 對於8個thread 並行的需求是足夠的，但是每個thread 所運行的空間越遠，所執行差的時間就越多，即使8個thread 相較於4個thread，每個thread 的平均執行時間較短許多，但主要影響整體執行時間的也還是執行最久的那個thread，這也許和快取記憶體大小有關。 :::info 應該說 4 cores 4 thread 的最大極限就是同時跑 4 thread，所以就算用 8 thread 讓單一執行緒的運算量減少了，因為同時最多 4 個在跑，所以還是會有 4 個需要等其他 4 個去執行，並且也會造成額外的 context switch 成本 > [name=趙益揚] :::