Parallel Programming Assignment II
===
## Q1:
### In your write-up, produce a graph of speedup compared to the reference sequential implementation as a function of the number of threads used FOR VIEW 1. Is speedup linear in the number of threads used?
**在View 1的情況下,Speedup圖表為非線性**(如下圖表)

### In your writeup hypothesize why this is (or is not) the case? (You may also wish to produce a graph for VIEW 2 to help you come up with a good answer. Hint: take a careful look at the three-thread data-point.)

在View 2的情況下,Speedup圖表卻是線性的(如上圖表),
於是我個別測試在3個Threads的情況下每個Thread所花費的時間如下列顯示
```.
#For View 1:
311551144@pp037-ubuntu:~/HW2/HW2/part2$ ./mandelbrot --view 1 -t 3
[mandelbrot serial]: [461.450] ms
Wrote image file mandelbrot-serial.ppm
Thread 0 Spends 93.866111 ms
Thread 2 Spends 95.322442 ms
Thread 1 Spends 283.595570 ms
Thread 0 Spends 93.230444 ms
Thread 2 Spends 93.850952 ms
Thread 1 Spends 283.004296 ms
Thread 0 Spends 93.059696 ms
Thread 2 Spends 93.739051 ms
Thread 1 Spends 283.661404 ms
Thread 0 Spends 93.368916 ms
Thread 2 Spends 93.862679 ms
Thread 1 Spends 284.227995 ms
Thread 0 Spends 93.476153 ms
Thread 2 Spends 93.971159 ms
Thread 1 Spends 286.426691 ms
[mandelbrot thread]: [283.222] ms
Wrote image file mandelbrot-thread.ppm
(1.63x speedup from 3 threads)
#For View 2:
311551144@pp037-ubuntu:~/HW2/HW2/part2$ ./mandelbrot --view 2 -t 3
[mandelbrot serial]: [288.658] ms
Wrote image file mandelbrot-serial.ppm
Thread 2 Spends 79.634195 ms
Thread 1 Spends 86.424342 ms
Thread 0 Spends 131.673226 ms
Thread 2 Spends 79.453930 ms
Thread 1 Spends 86.561032 ms
Thread 0 Spends 131.512419 ms
Thread 2 Spends 79.409020 ms
Thread 1 Spends 86.292915 ms
Thread 0 Spends 131.971097 ms
Thread 2 Spends 79.733860 ms
Thread 1 Spends 86.886260 ms
Thread 0 Spends 131.705595 ms
Thread 2 Spends 79.359675 ms
Thread 1 Spends 86.126847 ms
Thread 0 Spends 132.335796 ms
[mandelbrot thread]: [131.585] ms
Wrote image file mandelbrot-thread.ppm
(2.19x speedup from 3 threads)
```
由上述結果可以發現對於View1,Thread 1花費了283ms與總花費時間一樣(mandelbrot thread),但是Thread 0與Thread 1實際只花93ms,所以對於View 1而言Thread 1是整體工作的瓶頸;對於View 2而言,Thread 0也是整體的瓶頸。而Thread ID跟影像計算的位置有關係,所以可以猜測對於View 1,中間部分的計算比較花費時間;對於View 2,上半部分的計算比較花間,相對位置可參考下圖,可以發現View 1的中間區塊與View 2的上半區塊白色的部分比較多,***所以先猜測白色部分的計算比較花時間。***

## Q2: How do your measurements explain the speedup graph you previously created?
```.
//原先的Code
double startSeconds = CycleTimer::currentSeconds();
unsigned int totalRows = args->height / args->numThreads;
unsigned int startRow = args->threadId*totalRows;
unsigned int endRow = (args->threadId == args->numThreads - 1)?args->height:(args->threadId + 1)*totalRows;
mandelbrotSerial(
args->x0, args->y0, args->x1, args->y1,
args->width, args->height,
startRow, endRow-startRow,
args->maxIterations,
args->output
);
double endSeconds = CycleTimer::currentSeconds();
printf("Thread %d Spend %f ms %d~%d\n", args->threadId, (endSeconds - startSeconds)*1000.0, startRow, endRow);
```
```.
//使用fackThreadId指派計算部分
//Thread 0 計算中間 or 上半部
//Thread 1 計算上半部 or 下半部
//Thread 2 計算下半部 or 中間
double startSeconds = CycleTimer::currentSeconds();
int fakeThreadId = 0;
if (args->threadId == 0)
fakeThreadId = 1; //or fakeThreadId = 0;
else if (args->threadId == 1)
fakeThreadId = 0; //or fakeThreadId = 2;
else if (args->threadId == 2)
fakeThreadId = 2; //or fakeThreadId = 1;
unsigned int totalRows = args->height / args->numThreads;
unsigned int startRow = fakeThreadId*totalRows;
unsigned int endRow = (fakeThreadId == args->numThreads - 1)?args->height:(fakeThreadId + 1)*totalRows;
mandelbrotSerial(
args->x0, args->y0, args->x1, args->y1,
args->width, args->height,
startRow, endRow-startRow,
args->maxIterations,
args->output);
double endSeconds = CycleTimer::currentSeconds();
printf("Thread %d Spend %f ms %d~%d\n", args->threadId, (endSeconds - startSeconds)*1000.0, startRow, endRow);
```
根據上面Code的更改,對於View 1瓶頸部分將會變成Thread 0 (or Thread 2),測試結果如下表,可以得知**瓶頸的部分是中間部分的計算**。
```.
#使用fackThreadId指派計算部分
#Thread 0 計算中間(最花時間)
#Thread 1 計算上半部
#Thread 2 計算下半部
311551144@pp037-ubuntu:~/HW2/HW2/part2$ ./mandelbrot --view 1 -t 3
[mandelbrot serial]: [460.055] ms
Wrote image file mandelbrot-serial.ppm
Thread 1 Spends 92.726992 ms
Thread 2 Spends 93.311376 ms
Thread 0 Spends 282.812513 ms
Thread 1 Spends 93.016909 ms
Thread 2 Spends 93.671722 ms
Thread 0 Spends 283.222068 ms
Thread 1 Spends 92.897145 ms
Thread 2 Spends 94.007879 ms
Thread 0 Spends 282.550075 ms
Thread 1 Spends 92.954511 ms
Thread 2 Spends 93.521972 ms
Thread 0 Spends 282.528924 ms
Thread 1 Spends 92.750125 ms
Thread 2 Spends 93.374099 ms
Thread 0 Spends 282.409656 ms
[mandelbrot thread]: [282.488] ms
Wrote image file mandelbrot-thread.ppm
(1.63x speedup from 3 threads)
```
```.
#使用fackThreadId指派計算部分
#Thread 0 計算上半部
#Thread 1 計算下半部
#Thread 2 計算中間(最花時間)
311551144@pp037-ubuntu:~/HW2/HW2/part2$ ./mandelbrot --view 1 -t 3
[mandelbrot serial]: [461.821] ms
Wrote image file mandelbrot-serial.ppm
Thread 0 Spends 95.096070 ms
Thread 1 Spends 96.659770 ms
Thread 2 Spends 290.325732 ms
Thread 0 Spends 93.583143 ms
Thread 1 Spends 94.589909 ms
Thread 2 Spends 284.713729 ms
Thread 0 Spends 93.033343 ms
Thread 1 Spends 93.939927 ms
Thread 2 Spends 283.422427 ms
Thread 0 Spends 93.399414 ms
Thread 1 Spends 94.107256 ms
Thread 2 Spends 284.532574 ms
Thread 0 Spends 93.191191 ms
Thread 1 Spends 93.803367 ms
Thread 2 Spends 283.782858 ms
[mandelbrot thread]: [283.648] ms
Wrote image file mandelbrot-thread.ppm
(1.63x speedup from 3 threads)
```
更進一步,我在`mandelbrotSerial.cpp`修改Code如下所示。
```.
#include <stdio.h>
#include "CycleTimer.h"
static inline int mandel(float c_re, float c_im, int count)
{
double startSeconds = CycleTimer::currentSeconds();
float z_re = c_re, z_im = c_im;
int i;
for (i = 0; i < count; ++i)
{
if (z_re * z_re + z_im * z_im > 4.f)
break;
float new_re = z_re * z_re - z_im * z_im;
float new_im = 2.f * z_re * z_im;
z_re = c_re + new_re;
z_im = c_im + new_im;
}
double endSeconds = CycleTimer::currentSeconds();
printf("Output=%d => spends %.1f ns\n", i, (endSeconds - startSeconds)*1000000000.0);
return i;
}
```
可以得出**For loop的跌代次數就越多,i會越大,也就是Output越大,結果會使得像素值越接近白色**,測試結果如下所示。
```.
Output=8 => spends 22.9 ns
Output=9 => spends 21.8 ns
Output=10 => spends 24.7 ns
Output=11 => spends 57.0 ns
Output=12 => spends 67.6 ns
Output=13 => spends 65.9 ns
Output=13 => spends 65.3 ns
Output=14 => spends 67.6 ns
Output=15 => spends 73.6 ns
Output=16 => spends 82.9 ns
Output=17 => spends 78.2 ns
Output=18 => spends 83.0 ns
Output=18 => spends 83.5 ns
Output=19 => spends 84.6 ns
Output=20 => spends 88.8 ns
Output=22 => spends 96.5 ns
Output=89 => spends 342.4 ns
Output=121 => spends 452.4 ns
Output=27 => spends 114.1 ns
Output=28 => spends 117.7 ns
Output=63 => spends 243.5 ns
Output=147 => spends 554.1 ns
Output=172 => spends 641.2 ns
Output=60 => spends 233.5 ns
Output=77 => spends 296.4 ns
Output=256 => spends 934.7 ns
Output=256 => spends 935.3 ns
...
```
## Q3: In your write-up, describe your approach to parallelization and report the final 4-thread speedup obtained.
所有的Thread依序交錯計算所有的Rows,如下圖所繪(**Row interleaved**)。
```
例如有四個Threads,
Thread 0計算Row 0,4,8,12,...
Thread 1計算Row 1,5,9,13,...
Thread 2計算Row 2,6,10,14,...
Thread 3計算Row 3,7,11,15,...
```

因為產生影像的Row Pixel是連續變化的,所以使用Row interleaved可以有效降低不同區塊的計算時間差異,測試的結果如下所示。
```.
#對於View 1,3個Threads,使用Row interleaved方法優化後效能分析
311551144@pp037-ubuntu:~/HW2/HW2/part2$ ./mandelbrot --view 1 -t 3
[mandelbrot serial]: [460.446] ms
Wrote image file mandelbrot-serial.ppm
Thread 0 Spend 159.883626 ms
Thread 1 Spend 160.754438 ms
Thread 2 Spend 160.908920 ms
Thread 2 Spend 158.953618 ms
Thread 0 Spend 159.053588 ms
Thread 1 Spend 159.023467 ms
Thread 2 Spend 158.862734 ms
Thread 1 Spend 158.935976 ms
Thread 0 Spend 159.149035 ms
Thread 2 Spend 158.509741 ms
Thread 1 Spend 158.773504 ms
Thread 0 Spend 158.828126 ms
Thread 1 Spend 159.087707 ms
Thread 2 Spend 159.054224 ms
Thread 0 Spend 159.618253 ms
[mandelbrot thread]: [159.032] ms
Wrote image file mandelbrot-thread.ppm
(2.90x speedup from 3 threads)
```
**在4個Threads的情況下,View 1與View 2皆可以得到3.79倍的加速**(結果如下表所示)。
```.
311551144@pp037-ubuntu:~/HW2/HW2/part2$ ./mandelbrot --view 1 -t 4
[mandelbrot serial]: [460.463] ms
Wrote image file mandelbrot-serial.ppm
[mandelbrot thread]: [121.435] ms
Wrote image file mandelbrot-thread.ppm
(3.79x speedup from 4 threads)
311551144@pp037-ubuntu:~/HW2/HW2/part2$ ./mandelbrot --view 2 -t 4
[mandelbrot serial]: [288.155] ms
Wrote image file mandelbrot-serial.ppm
[mandelbrot thread]: [76.009] ms
Wrote image file mandelbrot-thread.ppm
(3.79x speedup from 4 threads)
```
## Q4:
### Now run your improved code with eight threads. Is performance noticeably greater than when running with four threads?
如下列所示,**很明顯8個Threads的加速對於4個Threads沒有更明顯的加速甚至更慢**。
```.
//4個Threads的結果
311551144@pp037-ubuntu:~/HW2/HW2/part2$ ./mandelbrot --view 1 -t 4
[mandelbrot serial]: [460.463] ms
Wrote image file mandelbrot-serial.ppm
[mandelbrot thread]: [121.435] ms
Wrote image file mandelbrot-thread.ppm
(3.79x speedup from 4 threads)
311551144@pp037-ubuntu:~/HW2/HW2/part2$ ./mandelbrot --view 2 -t 4
[mandelbrot serial]: [288.155] ms
Wrote image file mandelbrot-serial.ppm
[mandelbrot thread]: [76.009] ms
Wrote image file mandelbrot-thread.ppm
(3.79x speedup from 4 threads)
//8個Threads的結果
311551144@pp037-ubuntu:~/HW2/HW2/part2$ ./mandelbrot --view 1 -t 8
[mandelbrot serial]: [460.285] ms
Wrote image file mandelbrot-serial.ppm
[mandelbrot thread]: [122.188] ms
Wrote image file mandelbrot-thread.ppm
(3.77x speedup from 8 threads)
311551144@pp037-ubuntu:~/HW2/HW2/part2$ ./mandelbrot --view 2 -t 8
[mandelbrot serial]: [288.031] ms
Wrote image file mandelbrot-serial.ppm
[mandelbrot thread]: [78.129] ms
Wrote image file mandelbrot-thread.ppm
(3.69x speedup from 8 threads)
```
### Why or why not? (Notice that the workstation server provides 4 cores 4 threads.)
因為測試機器只有4和4執行緒,所以最好的加速應該是4個Threads。當超過4個Threads時,因為Threads需要多的資料同步的處理與記憶體的成本且會有更多像context switch一樣被置換,會造成更多的Overhead,所以4個Threads與8個Threads的CPU使用率應該都是差不多的,且8個Threads可能會有更多開銷。
以下為分別修改`main.cpp`中`numThreads = 2; or 3 or 4`,並使用gprof測試效能
```.
//2個Threads
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls Ts/call Ts/call name
99.79 4.65 4.65 mandelbrotSerial(float, float, float, float, int, int, int, int, int, int*)
0.21 4.66 0.01 writePPMImage(int*, int, int, char const*, int)
0.00 4.66 0.00 20 0.00 0.00 CycleTimer::secondsPerTick()
0.00 4.66 0.00 1 0.00 0.00 verifyResult(int*, int*, int, int)
Call graph
granularity: each sample hit covers 4 byte(s) for 0.21% of 4.66 seconds
index % time self children called name
<spontaneous>
[1] 99.8 4.65 0.00 mandelbrotSerial(float, float, float, float, int, int, int, int, int, int*) [1]
-----------------------------------------------
<spontaneous>
[2] 0.2 0.01 0.00 writePPMImage(int*, int, int, char const*, int) [2]
-----------------------------------------------
0.00 0.00 20/20 main [8]
[10] 0.0 0.00 0.00 20 CycleTimer::secondsPerTick() [10]
-----------------------------------------------
0.00 0.00 1/1 main [8]
[11] 0.0 0.00 0.00 1 verifyResult(int*, int*, int, int) [11]
-----------------------------------------------
Index by function name
[11] verifyResult(int*, int*, int, int) [1] mandelbrotSerial(float, float, float, float, int, int, int, int, int, int*)
[2] writePPMImage(int*, int, int, char const*, int) [10] CycleTimer::secondsPerTick()
```
```.
//4個Threads
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls Ts/call Ts/call name
99.49 3.93 3.93 mandelbrotSerial(float, float, float, float, int, int, int, int, int, int*)
0.25 3.94 0.01 writePPMImage(int*, int, int, char const*, int)
0.25 3.95 0.01 _init
0.00 3.95 0.00 20 0.00 0.00 CycleTimer::secondsPerTick()
0.00 3.95 0.00 1 0.00 0.00 verifyResult(int*, int*, int, int)
Call graph
granularity: each sample hit covers 4 byte(s) for 0.25% of 3.95 seconds
index % time self children called name
<spontaneous>
[1] 99.5 3.93 0.00 mandelbrotSerial(float, float, float, float, int, int, int, int, int, int*) [1]
-----------------------------------------------
<spontaneous>
[2] 0.3 0.01 0.00 writePPMImage(int*, int, int, char const*, int) [2]
-----------------------------------------------
<spontaneous>
[3] 0.3 0.01 0.00 _init [3]
-----------------------------------------------
0.00 0.00 20/20 main [9]
[11] 0.0 0.00 0.00 20 CycleTimer::secondsPerTick() [11]
-----------------------------------------------
0.00 0.00 1/1 main [9]
[12] 0.0 0.00 0.00 1 verifyResult(int*, int*, int, int) [12]
-----------------------------------------------
Index by function name
[12] verifyResult(int*, int*, int, int) [1] mandelbrotSerial(float, float, float, float, int, int, int, int, int, int*) [3] _init
[2] writePPMImage(int*, int, int, char const*, int) [11] CycleTimer::secondsPerTick()
```
```.
//8個Threads
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls Ts/call Ts/call name
100.00 4.05 4.05 mandelbrotSerial(float, float, float, float, int, int, int, int, int, int*)
0.00 4.05 0.00 20 0.00 0.00 CycleTimer::secondsPerTick()
0.00 4.05 0.00 1 0.00 0.00 verifyResult(int*, int*, int, int)
Call graph
granularity: each sample hit covers 4 byte(s) for 0.25% of 4.05 seconds
index % time self children called name
<spontaneous>
[1] 100.0 4.05 0.00 mandelbrotSerial(float, float, float, float, int, int, int, int, int, int*) [1]
-----------------------------------------------
0.00 0.00 20/20 main [7]
[9] 0.0 0.00 0.00 20 CycleTimer::secondsPerTick() [9]
-----------------------------------------------
0.00 0.00 1/1 main [7]
[10] 0.0 0.00 0.00 1 verifyResult(int*, int*, int, int) [10]
-----------------------------------------------
Index by function name
[10] verifyResult(int*, int*, int, int) [1] mandelbrotSerial(float, float, float, float, int, int, int, int, int, int*) [9] CycleTimer::secondsPerTick()
```