Parallel Programming Assignment II

Parallel Programming Assignment II === ## Q1: ### In your write-up, produce a graph of speedup compared to the reference sequential implementation as a function of the number of threads used FOR VIEW 1. Is speedup linear in the number of threads used? **在View 1的情況下，Speedup圖表為非線性**(如下圖表) ![](https://i.imgur.com/R9XVkhv.png) ### In your writeup hypothesize why this is (or is not) the case? (You may also wish to produce a graph for VIEW 2 to help you come up with a good answer. Hint: take a careful look at the three-thread data-point.) ![](https://i.imgur.com/3ezNvcO.png) 在View 2的情況下，Speedup圖表卻是線性的(如上圖表)，於是我個別測試在3個Threads的情況下每個Thread所花費的時間如下列顯示 ```. #For View 1: 311551144@pp037-ubuntu:~/HW2/HW2/part2$ ./mandelbrot --view 1 -t 3 [mandelbrot serial]: [461.450] ms Wrote image file mandelbrot-serial.ppm Thread 0 Spends 93.866111 ms Thread 2 Spends 95.322442 ms Thread 1 Spends 283.595570 ms Thread 0 Spends 93.230444 ms Thread 2 Spends 93.850952 ms Thread 1 Spends 283.004296 ms Thread 0 Spends 93.059696 ms Thread 2 Spends 93.739051 ms Thread 1 Spends 283.661404 ms Thread 0 Spends 93.368916 ms Thread 2 Spends 93.862679 ms Thread 1 Spends 284.227995 ms Thread 0 Spends 93.476153 ms Thread 2 Spends 93.971159 ms Thread 1 Spends 286.426691 ms [mandelbrot thread]: [283.222] ms Wrote image file mandelbrot-thread.ppm (1.63x speedup from 3 threads) #For View 2: 311551144@pp037-ubuntu:~/HW2/HW2/part2$ ./mandelbrot --view 2 -t 3 [mandelbrot serial]: [288.658] ms Wrote image file mandelbrot-serial.ppm Thread 2 Spends 79.634195 ms Thread 1 Spends 86.424342 ms Thread 0 Spends 131.673226 ms Thread 2 Spends 79.453930 ms Thread 1 Spends 86.561032 ms Thread 0 Spends 131.512419 ms Thread 2 Spends 79.409020 ms Thread 1 Spends 86.292915 ms Thread 0 Spends 131.971097 ms Thread 2 Spends 79.733860 ms Thread 1 Spends 86.886260 ms Thread 0 Spends 131.705595 ms Thread 2 Spends 79.359675 ms Thread 1 Spends 86.126847 ms Thread 0 Spends 132.335796 ms [mandelbrot thread]: [131.585] ms Wrote image file mandelbrot-thread.ppm (2.19x speedup from 3 threads) ``` 由上述結果可以發現對於View1，Thread 1花費了283ms與總花費時間一樣(mandelbrot thread)，但是Thread 0與Thread 1實際只花93ms，所以對於View 1而言Thread 1是整體工作的瓶頸；對於View 2而言，Thread 0也是整體的瓶頸。而Thread ID跟影像計算的位置有關係，所以可以猜測對於View 1，中間部分的計算比較花費時間；對於View 2，上半部分的計算比較花間，相對位置可參考下圖，可以發現View 1的中間區塊與View 2的上半區塊白色的部分比較多，***所以先猜測白色部分的計算比較花時間。*** ![](https://i.imgur.com/febJKmi.png) ## Q2: How do your measurements explain the speedup graph you previously created? ```. //原先的Code double startSeconds = CycleTimer::currentSeconds(); unsigned int totalRows = args->height / args->numThreads; unsigned int startRow = args->threadId*totalRows; unsigned int endRow = (args->threadId == args->numThreads - 1)?args->height:(args->threadId + 1)*totalRows; mandelbrotSerial( args->x0, args->y0, args->x1, args->y1, args->width, args->height, startRow, endRow-startRow, args->maxIterations, args->output ); double endSeconds = CycleTimer::currentSeconds(); printf("Thread %d Spend %f ms %d~%d\n", args->threadId, (endSeconds - startSeconds)*1000.0, startRow, endRow); ``` ```. //使用fackThreadId指派計算部分 //Thread 0 計算中間 or 上半部 //Thread 1 計算上半部 or 下半部 //Thread 2 計算下半部 or 中間 double startSeconds = CycleTimer::currentSeconds(); int fakeThreadId = 0; if (args->threadId == 0) fakeThreadId = 1; //or fakeThreadId = 0; else if (args->threadId == 1) fakeThreadId = 0; //or fakeThreadId = 2; else if (args->threadId == 2) fakeThreadId = 2; //or fakeThreadId = 1; unsigned int totalRows = args->height / args->numThreads; unsigned int startRow = fakeThreadId*totalRows; unsigned int endRow = (fakeThreadId == args->numThreads - 1)?args->height:(fakeThreadId + 1)*totalRows; mandelbrotSerial( args->x0, args->y0, args->x1, args->y1, args->width, args->height, startRow, endRow-startRow, args->maxIterations, args->output); double endSeconds = CycleTimer::currentSeconds(); printf("Thread %d Spend %f ms %d~%d\n", args->threadId, (endSeconds - startSeconds)*1000.0, startRow, endRow); ``` 根據上面Code的更改，對於View 1瓶頸部分將會變成Thread 0 (or Thread 2)，測試結果如下表，可以得知**瓶頸的部分是中間部分的計算**。 ```. #使用fackThreadId指派計算部分 #Thread 0 計算中間(最花時間) #Thread 1 計算上半部 #Thread 2 計算下半部 311551144@pp037-ubuntu:~/HW2/HW2/part2$ ./mandelbrot --view 1 -t 3 [mandelbrot serial]: [460.055] ms Wrote image file mandelbrot-serial.ppm Thread 1 Spends 92.726992 ms Thread 2 Spends 93.311376 ms Thread 0 Spends 282.812513 ms Thread 1 Spends 93.016909 ms Thread 2 Spends 93.671722 ms Thread 0 Spends 283.222068 ms Thread 1 Spends 92.897145 ms Thread 2 Spends 94.007879 ms Thread 0 Spends 282.550075 ms Thread 1 Spends 92.954511 ms Thread 2 Spends 93.521972 ms Thread 0 Spends 282.528924 ms Thread 1 Spends 92.750125 ms Thread 2 Spends 93.374099 ms Thread 0 Spends 282.409656 ms [mandelbrot thread]: [282.488] ms Wrote image file mandelbrot-thread.ppm (1.63x speedup from 3 threads) ``` ```. #使用fackThreadId指派計算部分 #Thread 0 計算上半部 #Thread 1 計算下半部 #Thread 2 計算中間(最花時間) 311551144@pp037-ubuntu:~/HW2/HW2/part2$ ./mandelbrot --view 1 -t 3 [mandelbrot serial]: [461.821] ms Wrote image file mandelbrot-serial.ppm Thread 0 Spends 95.096070 ms Thread 1 Spends 96.659770 ms Thread 2 Spends 290.325732 ms Thread 0 Spends 93.583143 ms Thread 1 Spends 94.589909 ms Thread 2 Spends 284.713729 ms Thread 0 Spends 93.033343 ms Thread 1 Spends 93.939927 ms Thread 2 Spends 283.422427 ms Thread 0 Spends 93.399414 ms Thread 1 Spends 94.107256 ms Thread 2 Spends 284.532574 ms Thread 0 Spends 93.191191 ms Thread 1 Spends 93.803367 ms Thread 2 Spends 283.782858 ms [mandelbrot thread]: [283.648] ms Wrote image file mandelbrot-thread.ppm (1.63x speedup from 3 threads) ``` 更進一步，我在`mandelbrotSerial.cpp`修改Code如下所示。 ```. #include <stdio.h> #include "CycleTimer.h" static inline int mandel(float c_re, float c_im, int count) { double startSeconds = CycleTimer::currentSeconds(); float z_re = c_re, z_im = c_im; int i; for (i = 0; i < count; ++i) { if (z_re * z_re + z_im * z_im > 4.f) break; float new_re = z_re * z_re - z_im * z_im; float new_im = 2.f * z_re * z_im; z_re = c_re + new_re; z_im = c_im + new_im; } double endSeconds = CycleTimer::currentSeconds(); printf("Output=%d => spends %.1f ns\n", i, (endSeconds - startSeconds)*1000000000.0); return i; } ``` 可以得出**For loop的跌代次數就越多，i會越大，也就是Output越大，結果會使得像素值越接近白色**，測試結果如下所示。 ```. Output=8 => spends 22.9 ns Output=9 => spends 21.8 ns Output=10 => spends 24.7 ns Output=11 => spends 57.0 ns Output=12 => spends 67.6 ns Output=13 => spends 65.9 ns Output=13 => spends 65.3 ns Output=14 => spends 67.6 ns Output=15 => spends 73.6 ns Output=16 => spends 82.9 ns Output=17 => spends 78.2 ns Output=18 => spends 83.0 ns Output=18 => spends 83.5 ns Output=19 => spends 84.6 ns Output=20 => spends 88.8 ns Output=22 => spends 96.5 ns Output=89 => spends 342.4 ns Output=121 => spends 452.4 ns Output=27 => spends 114.1 ns Output=28 => spends 117.7 ns Output=63 => spends 243.5 ns Output=147 => spends 554.1 ns Output=172 => spends 641.2 ns Output=60 => spends 233.5 ns Output=77 => spends 296.4 ns Output=256 => spends 934.7 ns Output=256 => spends 935.3 ns ... ``` ## Q3: In your write-up, describe your approach to parallelization and report the final 4-thread speedup obtained. 所有的Thread依序交錯計算所有的Rows，如下圖所繪(**Row interleaved**)。 ``` 例如有四個Threads， Thread 0計算Row 0,4,8,12,... Thread 1計算Row 1,5,9,13,... Thread 2計算Row 2,6,10,14,... Thread 3計算Row 3,7,11,15,... ``` ![](https://i.imgur.com/caPwGLZ.png) 因為產生影像的Row Pixel是連續變化的，所以使用Row interleaved可以有效降低不同區塊的計算時間差異，測試的結果如下所示。 ```. #對於View 1，3個Threads，使用Row interleaved方法優化後效能分析 311551144@pp037-ubuntu:~/HW2/HW2/part2$ ./mandelbrot --view 1 -t 3 [mandelbrot serial]: [460.446] ms Wrote image file mandelbrot-serial.ppm Thread 0 Spend 159.883626 ms Thread 1 Spend 160.754438 ms Thread 2 Spend 160.908920 ms Thread 2 Spend 158.953618 ms Thread 0 Spend 159.053588 ms Thread 1 Spend 159.023467 ms Thread 2 Spend 158.862734 ms Thread 1 Spend 158.935976 ms Thread 0 Spend 159.149035 ms Thread 2 Spend 158.509741 ms Thread 1 Spend 158.773504 ms Thread 0 Spend 158.828126 ms Thread 1 Spend 159.087707 ms Thread 2 Spend 159.054224 ms Thread 0 Spend 159.618253 ms [mandelbrot thread]: [159.032] ms Wrote image file mandelbrot-thread.ppm (2.90x speedup from 3 threads) ``` **在4個Threads的情況下，View 1與View 2皆可以得到3.79倍的加速**(結果如下表所示)。 ```. 311551144@pp037-ubuntu:~/HW2/HW2/part2$ ./mandelbrot --view 1 -t 4 [mandelbrot serial]: [460.463] ms Wrote image file mandelbrot-serial.ppm [mandelbrot thread]: [121.435] ms Wrote image file mandelbrot-thread.ppm (3.79x speedup from 4 threads) 311551144@pp037-ubuntu:~/HW2/HW2/part2$ ./mandelbrot --view 2 -t 4 [mandelbrot serial]: [288.155] ms Wrote image file mandelbrot-serial.ppm [mandelbrot thread]: [76.009] ms Wrote image file mandelbrot-thread.ppm (3.79x speedup from 4 threads) ``` ## Q4: ### Now run your improved code with eight threads. Is performance noticeably greater than when running with four threads? 如下列所示，**很明顯8個Threads的加速對於4個Threads沒有更明顯的加速甚至更慢**。 ```. //4個Threads的結果 311551144@pp037-ubuntu:~/HW2/HW2/part2$ ./mandelbrot --view 1 -t 4 [mandelbrot serial]: [460.463] ms Wrote image file mandelbrot-serial.ppm [mandelbrot thread]: [121.435] ms Wrote image file mandelbrot-thread.ppm (3.79x speedup from 4 threads) 311551144@pp037-ubuntu:~/HW2/HW2/part2$ ./mandelbrot --view 2 -t 4 [mandelbrot serial]: [288.155] ms Wrote image file mandelbrot-serial.ppm [mandelbrot thread]: [76.009] ms Wrote image file mandelbrot-thread.ppm (3.79x speedup from 4 threads) //8個Threads的結果 311551144@pp037-ubuntu:~/HW2/HW2/part2$ ./mandelbrot --view 1 -t 8 [mandelbrot serial]: [460.285] ms Wrote image file mandelbrot-serial.ppm [mandelbrot thread]: [122.188] ms Wrote image file mandelbrot-thread.ppm (3.77x speedup from 8 threads) 311551144@pp037-ubuntu:~/HW2/HW2/part2$ ./mandelbrot --view 2 -t 8 [mandelbrot serial]: [288.031] ms Wrote image file mandelbrot-serial.ppm [mandelbrot thread]: [78.129] ms Wrote image file mandelbrot-thread.ppm (3.69x speedup from 8 threads) ``` ### Why or why not? (Notice that the workstation server provides 4 cores 4 threads.) 因為測試機器只有4和4執行緒，所以最好的加速應該是4個Threads。當超過4個Threads時，因為Threads需要多的資料同步的處理與記憶體的成本且會有更多像context switch一樣被置換，會造成更多的Overhead，所以4個Threads與8個Threads的CPU使用率應該都是差不多的，且8個Threads可能會有更多開銷。以下為分別修改`main.cpp`中`numThreads = 2; or 3 or 4`，並使用gprof測試效能 ```. //2個Threads Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls Ts/call Ts/call name 99.79 4.65 4.65 mandelbrotSerial(float, float, float, float, int, int, int, int, int, int*) 0.21 4.66 0.01 writePPMImage(int*, int, int, char const*, int) 0.00 4.66 0.00 20 0.00 0.00 CycleTimer::secondsPerTick() 0.00 4.66 0.00 1 0.00 0.00 verifyResult(int*, int*, int, int) Call graph granularity: each sample hit covers 4 byte(s) for 0.21% of 4.66 seconds index % time self children called name <spontaneous> [1] 99.8 4.65 0.00 mandelbrotSerial(float, float, float, float, int, int, int, int, int, int*) [1] ----------------------------------------------- <spontaneous> [2] 0.2 0.01 0.00 writePPMImage(int*, int, int, char const*, int) [2] ----------------------------------------------- 0.00 0.00 20/20 main [8] [10] 0.0 0.00 0.00 20 CycleTimer::secondsPerTick() [10] ----------------------------------------------- 0.00 0.00 1/1 main [8] [11] 0.0 0.00 0.00 1 verifyResult(int*, int*, int, int) [11] ----------------------------------------------- Index by function name [11] verifyResult(int*, int*, int, int) [1] mandelbrotSerial(float, float, float, float, int, int, int, int, int, int*) [2] writePPMImage(int*, int, int, char const*, int) [10] CycleTimer::secondsPerTick() ``` ```. //4個Threads Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls Ts/call Ts/call name 99.49 3.93 3.93 mandelbrotSerial(float, float, float, float, int, int, int, int, int, int*) 0.25 3.94 0.01 writePPMImage(int*, int, int, char const*, int) 0.25 3.95 0.01 _init 0.00 3.95 0.00 20 0.00 0.00 CycleTimer::secondsPerTick() 0.00 3.95 0.00 1 0.00 0.00 verifyResult(int*, int*, int, int) Call graph granularity: each sample hit covers 4 byte(s) for 0.25% of 3.95 seconds index % time self children called name <spontaneous> [1] 99.5 3.93 0.00 mandelbrotSerial(float, float, float, float, int, int, int, int, int, int*) [1] ----------------------------------------------- <spontaneous> [2] 0.3 0.01 0.00 writePPMImage(int*, int, int, char const*, int) [2] ----------------------------------------------- <spontaneous> [3] 0.3 0.01 0.00 _init [3] ----------------------------------------------- 0.00 0.00 20/20 main [9] [11] 0.0 0.00 0.00 20 CycleTimer::secondsPerTick() [11] ----------------------------------------------- 0.00 0.00 1/1 main [9] [12] 0.0 0.00 0.00 1 verifyResult(int*, int*, int, int) [12] ----------------------------------------------- Index by function name [12] verifyResult(int*, int*, int, int) [1] mandelbrotSerial(float, float, float, float, int, int, int, int, int, int*) [3] _init [2] writePPMImage(int*, int, int, char const*, int) [11] CycleTimer::secondsPerTick() ``` ```. //8個Threads Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls Ts/call Ts/call name 100.00 4.05 4.05 mandelbrotSerial(float, float, float, float, int, int, int, int, int, int*) 0.00 4.05 0.00 20 0.00 0.00 CycleTimer::secondsPerTick() 0.00 4.05 0.00 1 0.00 0.00 verifyResult(int*, int*, int, int) Call graph granularity: each sample hit covers 4 byte(s) for 0.25% of 4.05 seconds index % time self children called name <spontaneous> [1] 100.0 4.05 0.00 mandelbrotSerial(float, float, float, float, int, int, int, int, int, int*) [1] ----------------------------------------------- 0.00 0.00 20/20 main [7] [9] 0.0 0.00 0.00 20 CycleTimer::secondsPerTick() [9] ----------------------------------------------- 0.00 0.00 1/1 main [7] [10] 0.0 0.00 0.00 1 verifyResult(int*, int*, int, int) [10] ----------------------------------------------- Index by function name [10] verifyResult(int*, int*, int, int) [1] mandelbrotSerial(float, float, float, float, int, int, int, int, int, int*) [9] CycleTimer::secondsPerTick() ```

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.