# Programming Assignment II: Multi-thread Programming
[ToC]
## Part 1: Part 1: Parallel Counting PI Using Pthreads
## Part 2: Parallel Fractal Generation Using std::thread
:::info
++***Q1***++:
Is speedup linear in the number of threads used? In your writeup hypothesize why this is (or is not) the case?
:::
++***Ans***++:
以下是使用**spatial decomposition**策略,在**view 1**運行**2, 3, 4 threads**的結果:
```
$ ./mandelbrot -t 2; ./mandelbrot -t 3; ./mandelbrot -t 4
[mandelbrot serial]: [458.519] ms
Wrote image file mandelbrot-serial.ppm
[mandelbrot thread]: [235.864] ms
Wrote image file mandelbrot-thread.ppm
(1.94x speedup from 2 threads)
---
[mandelbrot serial]: [458.581] ms
Wrote image file mandelbrot-serial.ppm
[mandelbrot thread]: [282.049] ms
Wrote image file mandelbrot-thread.ppm
(1.63x speedup from 3 threads)
---
[mandelbrot serial]: [458.834] ms
Wrote image file mandelbrot-serial.ppm
[mandelbrot thread]: [192.713] ms
Wrote image file mandelbrot-thread.ppm
(2.38x speedup from 4 threads)
```
* 下圖是將speedup畫成圖表後的結果:

而以下是在**view 2**運行的結果:
```
$ ./mandelbrot -t 2 --view 2; ./mandelbrot -t 3 --view 2; ./mandelbrot -t 4 --view 2
[mandelbrot serial]: [286.397] ms
Wrote image file mandelbrot-serial.ppm
[mandelbrot thread]: [171.487] ms
Wrote image file mandelbrot-thread.ppm
(1.67x speedup from 2 threads)
---
[mandelbrot serial]: [286.933] ms
Wrote image file mandelbrot-serial.ppm
[mandelbrot thread]: [130.840] ms
Wrote image file mandelbrot-thread.ppm
(2.19x speedup from 3 threads)
---
[mandelbrot serial]: [286.836] ms
Wrote image file mandelbrot-serial.ppm
[mandelbrot thread]: [110.188] ms
Wrote image file mandelbrot-thread.ppm
(2.60x speedup from 4 threads)
```
* 將結果畫為圖表後:

:::success
可以看見其**在View 1並非線性成長,在threads數量為3時相較於2反而速度降低**;反之,**在View 2則接近縣性成長**
:::
++***Analysis***++:
* 以下是Spatial Decomposition的切割方式:

可以注意到以下兩點:
1. 在**View 1**時明顯可看出==中間的Rows的elements遠多於上下兩側==,而**View 2**時則較為均勻分散,只在中上區域有比較密集的情況。
2. 在**View 1**,num of thread為3,也就是==speedup相較2 threads反而下降的時候,此時中間的thread要負責的區域是elements最多的區域==。
:::warning
++***hypothesis***++:
由以上兩點我們假設,**是否線性成長,與各個thread的運算量,也就是thread負責區域的elements數量是否均勻有關。thread負責的區域內的elements越多,則該thread的運算速度越慢**
:::
-----------------------------------------------------
:::info
++***Q2***++:
How do your measurements explain the speedup graph you previously created?
:::
++***Ans:***++
在`mandelbrotThread.cpp`內的`workerThreadStart()`函式中,我在第一和最後一行利用`CycleTimer::currentSeconds()`獲取時間並將時間差印出來,而結果如下:
```
$ ./mandelbrot -t 2
...
Time cost of Thread 0 : 235.187369ms
Time cost of Thread 1 : 236.413963ms
[mandelbrot thread]: [236.601] ms
Wrote image file mandelbrot-thread.ppm
(1.95x speedup from 2 threads)
-------------------------
$ ./mandelbrot -t 3
...
Time cost of Thread 0 : 93.120594ms
Time cost of Thread 2 : 93.743789ms
Time cost of Thread 1 : 283.982157ms
[mandelbrot thread]: [282.919] ms
Wrote image file mandelbrot-thread.ppm
(1.63x speedup from 3 threads)
-------------------------
$ ./mandelbrot -t 4
...
Time cost of Thread 0 : 45.445786ms
Time cost of Thread 3 : 45.856606ms
Time cost of Thread 1 : 192.014935ms
Time cost of Thread 2 : 192.826827ms
[mandelbrot thread]: [193.002] ms
Wrote image file mandelbrot-thread.ppm
(2.39x speedup from 4 threads)
```
由結果可以發現在**中間的thread (t=3時的thread 1,以及t=4時的thread 1, 2),相較於在兩側的thread都需要更長的運算時間**。接下來觀察**View 2**的結果:
```
$ ./mandelbrot -t 2 --view 2
...
Time cost of Thread 1 : 123.394603ms
Time cost of Thread 0 : 173.867517ms
[mandelbrot thread]: [173.279] ms
Wrote image file mandelbrot-thread.ppm
(1.68x speedup from 2 threads)
-------------------------
$ ./mandelbrot -t 3 --view 2
...
Time cost of Thread 2 : 80.076922ms
Time cost of Thread 1 : 87.162575ms
Time cost of Thread 0 : 131.883265ms
[mandelbrot thread]: [131.956] ms
Wrote image file mandelbrot-thread.ppm
(2.20x speedup from 3 threads)
-------------------------
$ ./mandelbrot -t 4 --view 2
...
Time cost of Thread 3 : 61.670141ms
Time cost of Thread 2 : 64.950210ms
Time cost of Thread 1 : 66.122036ms
Time cost of Thread 0 : 110.919788ms
[mandelbrot thread]: [110.640] ms
Wrote image file mandelbrot-thread.ppm
(2.63x speedup from 4 threads)
...
```
由以上結果可以看出相較於**View 1**,整體的時間長度更為平均,並且時間最長的正好是**elements最多的thread 0**
:::success
++***result:***++
由以上我們可以看出,我們在Q1進行的假設是成立的。時間的長度受到elements的分布影響,在elements分布不均勻的情況下,**每個thread的工作負載不均勻,造成某些thread需要更多的時間進行,而已經完成的threads只能等待其完成,進而拉長整體運算時間**
:::
-----------------------------------------------------
:::info
++***Q3:***++
Describe your approach to parallelization and report the final 4-thread speedup obtained.
:::
++***Ans:***++
為了將運算量大的rows分配到各個threads,我將Decomposition policy修改如下。令thread總數為$N$,冒號後面表示負責的rows:
* Thread 0 : $0$, $N$, $2N$, $3N$, ...
* Thread 1 : $1$, $N+1$, $2N+1$, $3N+1$, ...
* Thread 2 : $2$, $N+2$, $2N+2$, $3N+2$, ...
* Thread 3 : $3$, $N+3$, $2N+3$, $3N+3$, ...
…
以此來使各個threads的工作負載更平均。下面是修改後的結果:
```
$ ./mandelbrot -t 4
Time cost of Thread 2 : 121.048429ms
Time cost of Thread 0 : 121.194788ms
Time cost of Thread 3 : 121.366946ms
[mandelbrot thread]: [121.619] ms
Wrote image file mandelbrot-thread.ppm
(3.80x speedup from 4 threads)
```
:::success
修改之後,各thread的工作負載得到更好的分配,使運算量更平衡進而縮短整體運算時間。可以看到**在$N$為4時Speedup達到3.8倍**
:::
-----------------------------------------------------
:::info
++***Q4:***++
Run improved code with eight threads. Is performance noticeably greater than when running with four threads? Why or why not?
:::
++***Ans:***++
將thread總數調製8之後:
```
$ ./mandelbrot -t 8
Time cost of Thread 3 : 60.737095ms
Time cost of Thread 2 : 61.173907ms
Time cost of Thread 0 : 100.664969ms
Time cost of Thread 5 : 91.760168ms
Time cost of Thread 1 : 111.769905ms
Time cost of Thread 7 : 110.079502ms
Time cost of Thread 6 : 115.579447ms
Time cost of Thread 4 : 106.577762ms
[mandelbrot thread]: [133.270] ms
Wrote image file mandelbrot-thread.ppm
(3.46x speedup from 8 threads)
```
:::success
由上可發現,==Threads總數為4的表現還比總數為8好==。這是因為當電腦只有**4 processors, 4 threads**時,在不考慮I/O,僅僅討論CPU資源的情況下,==**不論threads總數多大,能平行使用的實體計算資源都是相同的**,甚至在threads總數超過實體threads數時,還需考慮**Context switch**的處理時間==。因此在Threads總數為8時,效能的提升並不如為4時來的好。
:::