--- tags: para., pthread, thread, pi, SIMD --- https://nctu-sslab.github.io/PP-f20/HW2/ # Multi-thread Programming ## Q1: ### <font color="green"> **In your write-up, produce a graph of speedup compared to the reference sequential implementation as a function of the number of threads used FOR VIEW.** </font> ![](https://i.imgur.com/mFayLAr.png =350x)![](https://i.imgur.com/siNX3OO.png =350x) **1. Is speedup linear in the number of threads used?** * 加速是非線性的,除了程式會包含不可平行的部分外,Create Thread 也需要時間成本,因此無法靠thread 無限線性加速。 **2. In your writeup hypothesize why this is (or is not) the case? (You may also wish to produce a graph for VIEW 2 to help you come up with a good answer. Hint: take a careful look at the three-thread data-point.)** * 從上圖可以看出,Thread 數目增加到3或4左右,速度就沒有顯著提升,甚至可能會有些微比更少的thread還慢,我推測也許程式能平行的部分所花的時間,已經因為平行加速到某一程度**相較於不可平行的部分相當微小**,因此再如何增加thread 都沒有顯著差異。 ## Q2: ### <font color="green"> **How do your measurements explain the speedup graph you previously created?**</font> * 要達到thread 的最佳平行,就要讓各個thread 工作量均衡,這樣較不會產生較快的thread 需要等待較慢的thread 產生的bottle neck。為了克服這點我發現,Index Ouput Array 的中段需要花費較高的成本,因為搜尋是連續的,從陣列的始端Index 或從末端Index 都會較快。 * 因此thread 交叉執行鄰近的列會是較平均的做法,Ex. thread 1 做0 2 4 6 8 ....列,thread 2做 1 3 5 7 9 .... 列。 ![](https://i.imgur.com/W6YcIaY.png) 從結果來看效果是不錯的。 接著我們觀察各個thread 在 `workerThreadStart()`運行所花的時間 * thread number 1 => total: 461.212 ms ![](https://i.imgur.com/Ly3kCsy.png) * thread number 2 => total: 236.325 ms ![](https://i.imgur.com/yZFCiLf.png) * thread number 3 => total: 158.896 ms ![](https://i.imgur.com/o9s3zux.png) * thread number 4 => total: 121.580 ms ![](https://i.imgur.com/ojlfOxW.png) * thread number 5 => total: 145.258 ms ![](https://i.imgur.com/PZ9A7Nu.png) * thread number 6 => total: 124.721 ms ![](https://i.imgur.com/K2Oz809.png) 從結果發現,在thread 較小的情況下,`workerThreadStart()`所運行的時間都是相差不多的,但是當thread 一變大,也代表`thread 0` 與`thread N` 所Index 的空間差距越大,所以`thread N`花的時間也會比`thread 0`還更久,所以產生bottle neck,所以程式執行時間主要受Index 中段array的影響,繼續平行的效果便不顯著。 ## Q3: ### <font color="green"> Q3: In your write-up, describe your approach to parallelization and report the final 4-thread speedup obtained. </font> 由於程式中段的處理較花時間所以我程式執行方式`thread 0`執行`0, 0+thread_num*1, 0+thread_num*2, 0+thread_num*3......`以此類推。 ``` // 此處 step 設置為1 for(int start=0; start < height;) { if(start+(threadId*step) < height) { mandelbrotSerial(x0, y0, x1, y1, width, height, start+(threadId*step), step, maxIterations, args->output); } start += numThreads; } ``` ``` [mandelbrot serial]: [460.861] ms Wrote image file mandelbrot-serial.ppm [mandelbrot thread]: [121.495] ms Wrote image file mandelbrot-thread.ppm (3.79x speedup from 4 threads) ``` 經過實驗`step`參數設置越大速度會越慢,因為會造成`thread0`和`threadN`距離越遠,等待情形越嚴重。 ## Q4: ### <font color="green">Now run your improved code with eight threads. Is performance noticeably greater than when running with four threads? Why or why not? (Notice that the workstation server provides 4 cores 4 threads.)</font> | | View 1 | View 2 | | -------- |:-------:|:------:| | Thread 4 | 121.529 | 71.529 | | Thread 8 | 122.546 | 74.458 | 執行表現沒有變得更好,4 cores 4 thread 對於8個thread 並行的需求是足夠的,但是每個thread 所運行的空間越遠,所執行差的時間就越多,即使8個thread 相較於4個thread,每個thread 的平均執行時間較短許多,但主要影響整體執行時間的也還是執行最久的那個thread,這也許和快取記憶體大小有關。 :::info 應該說 4 cores 4 thread 的最大極限就是同時跑 4 thread,所以就算用 8 thread 讓單一執行緒的運算量減少了,因為同時最多 4 個在跑,所以還是會有 4 個需要等其他 4 個去執行,並且也會造成額外的 context switch 成本 > [name=趙益揚] :::