# Parallel Programming @ NYCU - HW2 #### **`0716221 余忠旻`** ### <font color="#3CB371"> In your write-up, produce a graph of speedup compared to the reference sequential implementation as a function of the number of threads used FOR VIEW 1. (You may also wish to produce a graph for VIEW 2 to help you come up with a good answer.) </font> 我將 image 由上到下均分給各個 thread 計算執行,如下實作: ```cpp= void workerThreadStart(WorkerArgs *const args) { int workload = args->height / args->numThreads; int start = args->threadId * workload; mandelbrotSerial(args->x0, args->y0, args->x1, args->y1, args->width, args->height, start, workload, args->maxIterations, args->output); } ``` 下圖是不同 number of threads 所產生的相對應的 speedup 比較圖 * <font color="#082567" size=4>VIEW 1</font> ``` $ ./mandelbrot -t 2 [mandelbrot serial]: [474.001] ms Wrote image file mandelbrot-serial.ppm [mandelbrot thread]: [239.313] ms Wrote image file mandelbrot-thread.ppm (1.98x speedup from 2 threads) $ ./mandelbrot -t 3 [mandelbrot serial]: [475.400] ms Wrote image file mandelbrot-serial.ppm [mandelbrot thread]: [291.510] ms Wrote image file mandelbrot-thread.ppm (1.63x speedup from 3 threads) $ ./mandelbrot -t 4 [mandelbrot serial]: [464.144] ms Wrote image file mandelbrot-serial.ppm [mandelbrot thread]: [193.675] ms Wrote image file mandelbrot-thread.ppm (2.40x speedup from 4 threads) ``` ![](https://i.imgur.com/MHQ4DCW.jpg) * <font color="#082567" size=4>VIEW 2</font> ``` $ ./mandelbrot -t 2 --view 2 [mandelbrot serial]: [296.608] ms Wrote image file mandelbrot-serial.ppm [mandelbrot thread]: [174.557] ms Wrote image file mandelbrot-thread.ppm (1.70x speedup from 2 threads) $ ./mandelbrot -t 3 --view 2 [mandelbrot serial]: [290.624] ms Wrote image file mandelbrot-serial.ppm [mandelbrot thread]: [131.804] ms Wrote image file mandelbrot-thread.ppm (2.20x speedup from 3 threads) $ ./mandelbrot -t 4 --view 2 [mandelbrot serial]: [288.252] ms Wrote image file mandelbrot-serial.ppm [mandelbrot thread]: [110.473] ms Wrote image file mandelbrot-thread.ppm (2.61x speedup from 4 threads) ``` ![](https://i.imgur.com/tx1Ws2N.jpg) ___ ### <font color="#3CB371"> Q1: Is speedup linear in the number of threads used? In your writeup hypothesize why this is (or is not) the case? Hint: take a careful look at the three-thread data-point. </font> :::info * 在VIEW 1中,speedup隨著number of threads上升並沒有線性成長,反而在number of threads=3下降 * 在VIEW 2中,speedup隨著number of threads上升有線性成長 ::: 在 `workerThreadStart()` 中,每一個 thread 都會執行 `mandelbrotSerial()` 而 `mandelbrotSerial()` 會在for迴圈重複呼叫到 `mandel()` 看如下 `mandel()` 函式中,可以發現決定for迴圈的次數是 `if (z_re * z_re + z_im * z_im > 4.f) break;` 這個條件 ```cpp= static inline int mandel(float c_re, float c_im, int count) { float z_re = c_re, z_im = c_im; int i; for (i = 0; i < count; ++i) { if (z_re * z_re + z_im * z_im > 4.f) break; float new_re = z_re * z_re - z_im * z_im; float new_im = 2.f * z_re * z_im; z_re = c_re + new_re; z_im = c_im + new_im; } return i; } ``` **在 8 位元灰階圖片中,0 表示黑色,最大值 255 則表示白色,數值越大顏色越淺** **所以我發現,當 count=256 時, i 可能為 0~255 中任一數,其中數值大小決定該像素顏色深淺** 再比較VIEW 1和 VIEW 2: ![](https://i.imgur.com/jzjx6af.jpg) 可以明顯看出 VIEW 1 的淺色像素遠比 View2 來得多且集中(集中在中間) 接著我們分別來看3個thread計算執行的內容(如下圖) thread1 所計算的image內容**淺色像素多且集中** **因此我認為是 thread1 拖垮整體執行速度,導致 speedup 下降** ![](https://i.imgur.com/pCKFAdj.jpg) :::info Hypothesize: 每個 thread 的工作量與其負責區域所含有的淺色像素數量有關。 &emsp; &emsp; &emsp; &emsp; &emsp;當 number of threads=3 時,thread1 會被分配到淺色像素多且集中的區域, &emsp; &emsp; &emsp; &emsp; &emsp;導致整體執行速度不佳。 ::: ___ ### <font color="#3CB371"> Q2: How do your measurements explain the speedup graph you previously created? </font> * **<font color="#082567" size=4>number of thread=2</font>** ![](https://i.imgur.com/yZmO08H.jpg) ``` $ ./mandelbrot -t 2 [mandelbrot serial]: [462.669] ms Wrote image file mandelbrot-serial.ppm Thread 0: [262.9549] ms Thread 1: [263.1731] ms [mandelbrot thread]: [263.524] ms Wrote image file mandelbrot-thread.ppm (1.76x speedup from 2 threads) ``` * **<font color="#082567" size=4>number of thread=3</font>** ![](https://i.imgur.com/pCKFAdj.jpg) ``` $ ./mandelbrot -t 3 [mandelbrot serial]: [461.782] ms Wrote image file mandelbrot-serial.ppm Thread 0: [94.1784] ms Thread 2: [94.9320] ms Thread 1: [286.3419] ms [mandelbrot thread]: [286.568] ms Wrote image file mandelbrot-thread.ppm (1.61x speedup from 3 threads) ``` * **<font color="#082567" size=4>number of thread=4</font>** ![](https://i.imgur.com/yIgilbF.jpg) ``` $ ./mandelbrot -t 4 [mandelbrot serial]: [474.807] ms Wrote image file mandelbrot-serial.ppm Thread 3: [45.8921] ms Thread 0: [66.6695] ms Thread 1: [194.6871] ms Thread 2: [206.5401] ms [mandelbrot thread]: [219.570] ms Wrote image file mandelbrot-thread.ppm (2.16x speedup from 4 threads) ``` 在 number of thread=3 可以看出 thread1 執行時間也是比 thread0 和 thread2 久 thread1 所計算的image內容**淺色像素多且集中** **因此 thread1 拖垮整體執行速度,導致 speedup 下降** 在 number of threads=4 也可以看出 **thread1 和 thread2是主要的執行速度瓶頸** <br> 接著看 VIEW 2 , number of threads=3 ![](https://i.imgur.com/dxmF3Kd.jpg) ``` $ ./mandelbrot -t 3 --view 2 [mandelbrot serial]: [297.046] ms Wrote image file mandelbrot-serial.ppm Thread 2: [82.0762] ms Thread 1: [87.8727] ms Thread 0: [134.4979] ms [mandelbrot thread]: [134.589] ms Wrote image file mandelbrot-thread.ppm (2.21x speedup from 3 threads) ``` 相對來說 VIEW 2的 thread0 執行時間也是比 thread1 和 thread2 久 (VIEW 2 淺色像素主要在最上面) 但因為淺色像素在整體是相對分散許多 因此執行速度瓶頸雖是 thread0,但不至於拖垮整體執行速度,還是可以看出線性成長的 speedup :::info A: 在 VIEW 1,speedup隨著number of threads上升不會有線性成長, &emsp; 是因每個 thread 工作量分配極度不均的關係。 ::: ___ ### <font color="#3CB371"> Q3: In your write-up, describe your approach to parallelization and report the final 4-thread speedup obtained. </font> 用一開始的方法,由上到下均分成n個區塊來分配給n個threads計算執行 但在 VIEW 1 由於淺色像素集中於同一區域導致工作量分配不均 因此我將 `workerThreadStart()`修改成如下: ```cpp= void workerThreadStart(WorkerArgs *const args) { for (int i = args->threadId; i < args->height; i += args->numThreads){ mandelbrotSerial(args->x0, args->y0, args->x1, args->y1, args->width, args->height, i, 1, args->maxIterations, args->output); } } ``` 我將 `workerThreadStart()`每一個thread執行`mandelbrotSerial()`時,傳入的 totalRows=1 我以 number of threads=3為例: ![](https://i.imgur.com/BWZou4h.jpg) 這樣能將原本淺色像素集中區域,3個thread都會被分到部分計算執行, 如此能將工作量平均分配 * Final 4-thread speedup obtained 執行結果如下: ``` $ ./mandelbrot -t 4 [mandelbrot serial]: [462.474] ms Wrote image file mandelbrot-serial.ppm [mandelbrot thread]: [121.549] ms Wrote image file mandelbrot-thread.ppm (3.80x speedup from 4 threads) $ ./mandelbrot -t 4 --view 2 [mandelbrot serial]: [288.821] ms Wrote image file mandelbrot-serial.ppm [mandelbrot thread]: [76.039] ms Wrote image file mandelbrot-thread.ppm (3.80x speedup from 4 threads) ``` :::info View 1: 3.80x speedup from 4 threads View 2: 3.80x speedup from 4 threads ::: ___ ### <font color="#3CB371"> Q4: Now run your improved code with eight threads. Is performance noticeably greater than when running with four threads? Why or why not? (Notice that the workstation server provides 4 cores 4 threads.) </font> 當 number of threads=8 執行如下: ``` $ ./mandelbrot -t 8 [mandelbrot serial]: [461.297] ms Wrote image file mandelbrot-serial.ppm [mandelbrot thread]: [123.937] ms Wrote image file mandelbrot-thread.ppm (3.72x speedup from 8 threads) $ ./mandelbrot -t 8 --view 2 [mandelbrot serial]: [289.132] ms Wrote image file mandelbrot-serial.ppm [mandelbrot thread]: [76.669] ms Wrote image file mandelbrot-thread.ppm (3.77x speedup from 8 threads) ``` **可以發現 number of threads=8 會比 number of threads=4 執行還要慢一些** 我認為當number of threads=8時,speedup沒有上升反而下降, 是因為工作站提供的是 *4 核心 4 執行序的 Server*, 當使用 8 個 threads 時, core 被分配到的threads就可能會大於1 這時執行會多了 **context switch 的 overhead** 這也造成了整體執行時間比 number of threads=4 還要久的原因。 ___ :::info Good Work! >[name=TA] :::