#Parrel programmong assigment2
#### Q1
Extend your code to use 2, 3, 4 threads, partitioning the image generation work accordingly (threads should get blocks of the image)In your write-up, produce a graph of speedup compared to the reference sequential implementation as a function of the number of threads used FOR VIEW 1
```C++=
int partHeight = args->height / args->numThreads;
int my_first_i = args->threadId * partHeight;
double startTime = CycleTimer::currentSeconds();
mandelbrotSerial(
args->x0, args->y0, args->x1, args->y1,
args->width, args->height, my_first_i, partHeight,
args->maxIterations, args->output
);
```


View2


```
1. view 1:隨著thread數量上升,速度並不會跟著變快,3 threads反而比2 threads還慢,從1.94倍降到1.63倍。
2. view 2:速度的確會隨著thread變多而上升
```
Is speedup linear in the number of threads used? In your writeup hypothesize why this is (or is not) the case? (You may also wish to produce a graph for VIEW 2 to help you come up with a good answer. Hint: take a careful look at the three-thread data-point.)
*****檢視程式碼
每一個thread都會呼叫mandelbrotSerial(),
,而mandelbrotSerial()又會再for迴圈裡呼叫到mandel(),這兩個function極有可能是thread加速無法達到極速的瓶頸,以下分別探討這兩個function。
```C++=
void mandelbrotSerial(
float x0, float y0, float x1, float y1,
int width, int height,
int startRow, int totalRows,
int maxIterations,
int output[])
{
float dx = (x1 - x0) / width;
float dy = (y1 - y0) / height;
int endRow = startRow + totalRows;
for (int j = startRow; j < endRow; j++)
{
for (int i = 0; i < width; ++i)
{
float x = x0 + i * dx;
float y = y0 + j * dy;
int index = (j * width + i);
output[index] = mandel(x, y, maxIterations);
}
}
}
```
由程式的第13行可以得知,mandelbrotSerial()會依據startRow與endRow變動迴圈執行的次數,而每個thread所切分到的執行大小是一樣的,理論上應該達到平行的加速與threads。
```C++=
static inline int mandel(float c_re, float c_im, int count)
{
float z_re = c_re, z_im = c_im;
int i;
for (i = 0; i < count; ++i)
{
if (z_re * z_re + z_im * z_im > 4.f)
break;
float new_re = z_re * z_re - z_im * z_im;
float new_im = 2.f * z_re * z_im;
z_re = c_re + new_re;
z_im = c_im + new_im;
}
return i;
}
```
View1

View2

可以明顯看出 VIEW 1 的淺色像素遠比 View2 來得多且集中。
經查看mandle()可以看到每個像素被賦值的過程,以及View圖中得出以下幾點:
* 在 8 位元灰階圖片中,0 表示黑色,最大值 255 則表示白色,數值越大顏色越淺
* 像素值是由迴圈執行次數 i 來決定。
* i 可能為 0~255 中任一數 (因count為256),數值大小決定該像素顏色深淺。
* 綜合所述,一個像素顏色越淺(rgb值越大),迴圈執行次數也越多。
#### Q2
To confirm (or disprove) your hypothesis, measure the amount of time each thread requires to complete its work by inserting timing code at the beginning and end of workerThreadStart(). Q2: How do your measurements explain the speedup graph you previously created?
**** 利用CycleTimer.h給的方法來計算看看各個thread



根據Threads2、3、4的結果,可以發現在在中間切分部分的區域產生瓶頸,代表在VIEW 1時中間區段帶入的x, y會耗費大部分時間(相比於前、後段)
***** 檢視 view1
View 1 in thread3

小結:
此三張圖代表Thread0, Thread0+Thread1, Thread0+Thread1+Thread3處理的圖片。
* 使用 3 個 thread 時,分配給第 2 個 thread 的中間區塊有非常多的淺色像素,導致該 thread 執行速度拖垮整體速度
* 根據圖片與以上的分析可以得知,只要**白色區域越多計算越耗時**。
#### Q3
> In your write-up, describe your approach to parallelization and report the final 4-thread speedup obtained
上述Q2用CycleTimer.h看出thread在中間的時候,執行時間較久,發現thread的workload不平均:
原本的作法
舉例:用3個threads來計算1200個rows
計算每個thread要算幾個row->1200/3=400,400 tasks per thread
分配工作:
1. thread 0 calculate:0~399 row
1. thread 1 calculate:400~799 row
1. thread 2 calculate:800~1199 row
2.
修改後
計算每個thread要算幾個row->1200/3=400
以thread數量當作間隔去分配任務:
ex: 3 threads
1. thread 0 calculate:0,3,6,9…1197 row
1. thread 1 calculate:1,4,7,10…1198 row
1. thread 2 calculate:2,5,8,11…1199 row
此外有可能發生rows數量沒辦法被thread數量整除(ex:thread=7),在最後有一段if去做判斷分配
```C++=
int partHeight = args->height / args->numThreads;
int remains = args->height % args->numThreads;
// double startTime = CycleTimer::currentSeconds();
for(int t =0;t<partHeight;t++){
mandelbrotSerial(
args->x0, args->y0, args->x1, args->y1,
args->width, args->height,
t*(args->numThreads)+args->threadId,//Start from which row
1,
args->maxIterations, args->output
);
}
if((args->threadId+1<=remains)&&(remains>0)){
mandelbrotSerial(
args->x0, args->y0, args->x1, args->y1,
args->width, args->height,
partHeight * args->numThreads + args->threadId,//Start from which row
1,
args->maxIterations, args->output
);
}
```
View1


View2


#### Q4
Now run your improved code with eight threads. Is performance noticeably greater than when running with four threads? Why or why not? (Notice that the workstation server provides 4 cores 4 threads.)
View1

View2

執行發現view1 跟 view2 在8 thread 時,speedup 3.78X->3.57X
與3.78X->3.67 兩者都會下降。
* 因為workstation server provides 4 cores 4 threads,如果使用8個thread去執行,反而會發生額外的context switch overhead,因此要開幾個thread必須視cpu核心數量而定,若工作站是8核心的話,那使用thread=8理論上就會比thread=4還要快。