# Parallel Programming @ NYCU - HW2
#### **`0716221 余忠旻`**
### <font color="#3CB371"> In your write-up, produce a graph of speedup compared to the reference sequential implementation as a function of the number of threads used FOR VIEW 1.
(You may also wish to produce a graph for VIEW 2 to help you come up with a good answer.)
</font>
我將 image 由上到下均分給各個 thread 計算執行,如下實作:
```cpp=
void workerThreadStart(WorkerArgs *const args)
{
int workload = args->height / args->numThreads;
int start = args->threadId * workload;
mandelbrotSerial(args->x0, args->y0, args->x1, args->y1, args->width, args->height, start, workload, args->maxIterations, args->output);
}
```
下圖是不同 number of threads 所產生的相對應的 speedup 比較圖
* <font color="#082567" size=4>VIEW 1</font>
```
$ ./mandelbrot -t 2
[mandelbrot serial]: [474.001] ms
Wrote image file mandelbrot-serial.ppm
[mandelbrot thread]: [239.313] ms
Wrote image file mandelbrot-thread.ppm
(1.98x speedup from 2 threads)
$ ./mandelbrot -t 3
[mandelbrot serial]: [475.400] ms
Wrote image file mandelbrot-serial.ppm
[mandelbrot thread]: [291.510] ms
Wrote image file mandelbrot-thread.ppm
(1.63x speedup from 3 threads)
$ ./mandelbrot -t 4
[mandelbrot serial]: [464.144] ms
Wrote image file mandelbrot-serial.ppm
[mandelbrot thread]: [193.675] ms
Wrote image file mandelbrot-thread.ppm
(2.40x speedup from 4 threads)
```

* <font color="#082567" size=4>VIEW 2</font>
```
$ ./mandelbrot -t 2 --view 2
[mandelbrot serial]: [296.608] ms
Wrote image file mandelbrot-serial.ppm
[mandelbrot thread]: [174.557] ms
Wrote image file mandelbrot-thread.ppm
(1.70x speedup from 2 threads)
$ ./mandelbrot -t 3 --view 2
[mandelbrot serial]: [290.624] ms
Wrote image file mandelbrot-serial.ppm
[mandelbrot thread]: [131.804] ms
Wrote image file mandelbrot-thread.ppm
(2.20x speedup from 3 threads)
$ ./mandelbrot -t 4 --view 2
[mandelbrot serial]: [288.252] ms
Wrote image file mandelbrot-serial.ppm
[mandelbrot thread]: [110.473] ms
Wrote image file mandelbrot-thread.ppm
(2.61x speedup from 4 threads)
```

___
### <font color="#3CB371"> Q1: Is speedup linear in the number of threads used? In your writeup hypothesize why this is (or is not) the case?
Hint: take a careful look at the three-thread data-point.
</font>
:::info
* 在VIEW 1中,speedup隨著number of threads上升並沒有線性成長,反而在number of threads=3下降
* 在VIEW 2中,speedup隨著number of threads上升有線性成長
:::
在 `workerThreadStart()` 中,每一個 thread 都會執行 `mandelbrotSerial()`
而 `mandelbrotSerial()` 會在for迴圈重複呼叫到 `mandel()`
看如下 `mandel()` 函式中,可以發現決定for迴圈的次數是
`if (z_re * z_re + z_im * z_im > 4.f) break;` 這個條件
```cpp=
static inline int mandel(float c_re, float c_im, int count)
{
float z_re = c_re, z_im = c_im;
int i;
for (i = 0; i < count; ++i)
{
if (z_re * z_re + z_im * z_im > 4.f)
break;
float new_re = z_re * z_re - z_im * z_im;
float new_im = 2.f * z_re * z_im;
z_re = c_re + new_re;
z_im = c_im + new_im;
}
return i;
}
```
**在 8 位元灰階圖片中,0 表示黑色,最大值 255 則表示白色,數值越大顏色越淺**
**所以我發現,當 count=256 時, i 可能為 0~255 中任一數,其中數值大小決定該像素顏色深淺**
再比較VIEW 1和 VIEW 2:

可以明顯看出 VIEW 1 的淺色像素遠比 View2 來得多且集中(集中在中間)
接著我們分別來看3個thread計算執行的內容(如下圖)
thread1 所計算的image內容**淺色像素多且集中**
**因此我認為是 thread1 拖垮整體執行速度,導致 speedup 下降**

:::info
Hypothesize: 每個 thread 的工作量與其負責區域所含有的淺色像素數量有關。
         當 number of threads=3 時,thread1 會被分配到淺色像素多且集中的區域,
         導致整體執行速度不佳。
:::
___
### <font color="#3CB371"> Q2: How do your measurements explain the speedup graph you previously created?
</font>
* **<font color="#082567" size=4>number of thread=2</font>**

```
$ ./mandelbrot -t 2
[mandelbrot serial]: [462.669] ms
Wrote image file mandelbrot-serial.ppm
Thread 0: [262.9549] ms
Thread 1: [263.1731] ms
[mandelbrot thread]: [263.524] ms
Wrote image file mandelbrot-thread.ppm
(1.76x speedup from 2 threads)
```
* **<font color="#082567" size=4>number of thread=3</font>**

```
$ ./mandelbrot -t 3
[mandelbrot serial]: [461.782] ms
Wrote image file mandelbrot-serial.ppm
Thread 0: [94.1784] ms
Thread 2: [94.9320] ms
Thread 1: [286.3419] ms
[mandelbrot thread]: [286.568] ms
Wrote image file mandelbrot-thread.ppm
(1.61x speedup from 3 threads)
```
* **<font color="#082567" size=4>number of thread=4</font>**

```
$ ./mandelbrot -t 4
[mandelbrot serial]: [474.807] ms
Wrote image file mandelbrot-serial.ppm
Thread 3: [45.8921] ms
Thread 0: [66.6695] ms
Thread 1: [194.6871] ms
Thread 2: [206.5401] ms
[mandelbrot thread]: [219.570] ms
Wrote image file mandelbrot-thread.ppm
(2.16x speedup from 4 threads)
```
在 number of thread=3 可以看出
thread1 執行時間也是比 thread0 和 thread2 久
thread1 所計算的image內容**淺色像素多且集中**
**因此 thread1 拖垮整體執行速度,導致 speedup 下降**
在 number of threads=4 也可以看出
**thread1 和 thread2是主要的執行速度瓶頸**
<br>
接著看 VIEW 2 , number of threads=3

```
$ ./mandelbrot -t 3 --view 2
[mandelbrot serial]: [297.046] ms
Wrote image file mandelbrot-serial.ppm
Thread 2: [82.0762] ms
Thread 1: [87.8727] ms
Thread 0: [134.4979] ms
[mandelbrot thread]: [134.589] ms
Wrote image file mandelbrot-thread.ppm
(2.21x speedup from 3 threads)
```
相對來說 VIEW 2的 thread0 執行時間也是比 thread1 和 thread2 久
(VIEW 2 淺色像素主要在最上面)
但因為淺色像素在整體是相對分散許多
因此執行速度瓶頸雖是 thread0,但不至於拖垮整體執行速度,還是可以看出線性成長的 speedup
:::info
A: 在 VIEW 1,speedup隨著number of threads上升不會有線性成長,
  是因每個 thread 工作量分配極度不均的關係。
:::
___
### <font color="#3CB371"> Q3: In your write-up, describe your approach to parallelization and report the final 4-thread speedup obtained.
</font>
用一開始的方法,由上到下均分成n個區塊來分配給n個threads計算執行
但在 VIEW 1 由於淺色像素集中於同一區域導致工作量分配不均
因此我將 `workerThreadStart()`修改成如下:
```cpp=
void workerThreadStart(WorkerArgs *const args)
{
for (int i = args->threadId; i < args->height; i += args->numThreads){
mandelbrotSerial(args->x0, args->y0, args->x1, args->y1, args->width, args->height, i, 1, args->maxIterations, args->output);
}
}
```
我將 `workerThreadStart()`每一個thread執行`mandelbrotSerial()`時,傳入的 totalRows=1
我以 number of threads=3為例:

這樣能將原本淺色像素集中區域,3個thread都會被分到部分計算執行,
如此能將工作量平均分配
* Final 4-thread speedup obtained
執行結果如下:
```
$ ./mandelbrot -t 4
[mandelbrot serial]: [462.474] ms
Wrote image file mandelbrot-serial.ppm
[mandelbrot thread]: [121.549] ms
Wrote image file mandelbrot-thread.ppm
(3.80x speedup from 4 threads)
$ ./mandelbrot -t 4 --view 2
[mandelbrot serial]: [288.821] ms
Wrote image file mandelbrot-serial.ppm
[mandelbrot thread]: [76.039] ms
Wrote image file mandelbrot-thread.ppm
(3.80x speedup from 4 threads)
```
:::info
View 1: 3.80x speedup from 4 threads
View 2: 3.80x speedup from 4 threads
:::
___
### <font color="#3CB371"> Q4: Now run your improved code with eight threads. Is performance noticeably greater than when running with four threads? Why or why not?
(Notice that the workstation server provides 4 cores 4 threads.)
</font>
當 number of threads=8 執行如下:
```
$ ./mandelbrot -t 8
[mandelbrot serial]: [461.297] ms
Wrote image file mandelbrot-serial.ppm
[mandelbrot thread]: [123.937] ms
Wrote image file mandelbrot-thread.ppm
(3.72x speedup from 8 threads)
$ ./mandelbrot -t 8 --view 2
[mandelbrot serial]: [289.132] ms
Wrote image file mandelbrot-serial.ppm
[mandelbrot thread]: [76.669] ms
Wrote image file mandelbrot-thread.ppm
(3.77x speedup from 8 threads)
```
**可以發現 number of threads=8 會比 number of threads=4 執行還要慢一些**
我認為當number of threads=8時,speedup沒有上升反而下降,
是因為工作站提供的是 *4 核心 4 執行序的 Server*,
當使用 8 個 threads 時, core 被分配到的threads就可能會大於1
這時執行會多了 **context switch 的 overhead**
這也造成了整體執行時間比 number of threads=4 還要久的原因。
___
:::info
Good Work!
>[name=TA]
:::