# PP HW2
## Q1 Produce a graph of speedup with view1 and view2
:::success
使用tiv -w 100 -h 30 mandelbrot-thread.ppm 顯示出圖片
:::




---
### Q1 : Is speedup linear in the number of threads used? In your writeup hypothesize why this is (or is not) the case?
從上方的折線圖可以觀察到,在View1中,speedup並沒有呈現linear成長,而是在thread為3時,他的speedup反而比thread為2時,還要低。
在View2中,他就呈現較為linear的走勢。
```javascript=
void workerThreadStart(WorkerArgs *const args)
{
// TODO FOR PP STUDENTS: Implement the body of the worker
// thread here. Each thread could make a call to mandelbrotSerial()
// to compute a part of the output image. For example, in a
// program that uses two threads, thread 0 could compute the top
// half of the image and thread 1 could compute the bottom half.
// Of course, you can copy mandelbrotSerial() to this file and
// modify it to pursue a better performance.
double minSerial = 1e30;
int totalRows = args->height / args->numThreads;
int remainder = args->height % args->numThreads;
int startRow = totalRows * args->threadId;
if ((args->threadId + 1) == args->numThreads) { //the last thread need to do remain part
totalRows += remainder;
}
printf("Hello world from thread %d ", args->threadId);
}
```
這上方的程式碼為我做平行化的部分,我將不同的thread分配到不同的區域實作,而最後一個thread可能會做一些無法整除的剩餘部分,而主要的分法為利用高度切割,分配由上到下的圖片區段(依照thread數量切等分)給thread。
並且其實我觀察到在View1中,thread為2時,他的speedup很接近2x,且View1他的圖片為上下對稱的圖片,所以這邊造成不能線性加速的原因,極有可能是不同thread處理事情的快慢也不同,儘管我分配一樣大小的區塊給予他們平行化處理,而其中原因可能是跟分配到的圖片像素有關。
### Q2 : How do your measurements explain the speedup graph you previously created?
```javascript=
double minSerial = 1e30;
int totalRows = args->height / args->numThreads;
int remainder = args->height % args->numThreads;
int startRow = totalRows * args->threadId;
if ((args->threadId + 1) == args->numThreads) { //the last thread need to do remain part
totalRows += remainder;
}
double startTime = CycleTimer::currentSeconds();
mandelbrotSerial(args->x0,args->y0,args->x1,args->y1,args->width,args->height,startRow, totalRows,args->maxIterations,args->output);
double endTime = CycleTimer::currentSeconds();
minSerial = std::min(minSerial, endTime - startTime);
printf("Hello world from thread %d ", args->threadId);
printf("[mandelbrot serial]:\t\t[%.3f] ms\n", minSerial * 1000);
```
:::warning
下方圖為View 1 - Threads = 3 的結果
:::

:::warning
下方圖為View 2 - Threads = 3 的結果
:::

我將前段的程式碼加上測量時間,並且觀察結論可以發現不同thread在做相同大小區段的事情,所花費的時間真的差異很大,在View1結論中可以發現thread1(處理圖片中間段的thread)所花費的時間最多,在View2結論中可以發現thread0所花費的時間最多。
觀察到處理的時間跟所分配到的圖片有關時,我回去看了兩張圖,並且觀察到處理時間較長的thread都分配到圖片中出現較為多白色區域的片段(View 1的中間、View 2的上段),可以推斷白色區域較為花費時間。
觀察程式碼,我們可以發現在Q1中我所提供的程式碼裡,我會去呼叫==mandelbrotSerial()== 幫我們處理,而他會再去呼叫==mandel()== ,而在這裡就會回傳該點的pixel值(0~255),而在此function中,break會晚就代表pixel值越大,也代表他越白,所以可以發現出為何當thread分配到結果顯示為的白色區域時,所花費的時間比較長了。
```javascript=
static inline int mandel(float c_re, float c_im, int count)
{
float z_re = c_re, z_im = c_im;
int i;
for (i = 0; i < count; ++i)
{
if (z_re * z_re + z_im * z_im > 4.f)
break;
float new_re = z_re * z_re - z_im * z_im;
float new_im = 2.f * z_re * z_im;
z_re = c_re + new_re;
z_im = c_im + new_im;
}
return i;
}
```
### Q3 : In your write-up, describe your approach to parallelization and report the final 4-thread speedup obtained.
```javascript=
void workerThreadStart(WorkerArgs *const args)
{
for(unsigned int i=args->threadId;i<args->height;i+=args->numThreads){
mandelbrotSerial(args->x0,args->y0,args->x1,args->y1,args->width,args->height,i, 1,args->maxIterations,args->output);
}
}
```
上方的程市碼是修改後的==workerThreadStart()== ,因為我們發現如果上中下游等比例劃分會有不公平的問題,所以我們直接讓每個thread都均攤整張的每個區域部分,以避免有thread處理較花費時間的部分。
```
以前的方法(負責的height) :
thread0 -> (0 ~ x/3)
thread1 -> (x/3 ~ 2x/3)
thread2 -> (2x/3 ~ x)"
改進的方法(負責的height) :
thread0 -> (0,3,6,9...)
thread1 -> (1,4,7,10...)
thread2 -> (2,5,8,11...)"
```
:::success
下方的圖為decomposition policy後thread = 4的結果,Speedup > 3.5x。
:::

:::info
Decomposition policy graph of speedup with view1 and view2
:::


上方兩張圖可以觀察到,thread的數量與speedup呈linear成長。
### Q4 : Now run your improved code with eight threads. Is performance noticeably greater than when running with four threads? Why or why not?
:::success
下方的圖為thread = 8的結果
:::

觀察到thread = 8,反而沒有thread = 4時來的快。
```
311552055@pp1:~/311552055/HW2/HW2/part2$ nproc
4
```
上方的指令來看,可以發現此linux中只有4核心,所以很有可能每個核心都被分配到2個threads,所以可能會有context switch的overhead,這也是為什麼thread不是越多越好,硬體的不足會限制軟體的加速。