# Programming Assignment II: Multi-thread Programming [ToC] ## Part 1: Part 1: Parallel Counting PI Using Pthreads ## Part 2: Parallel Fractal Generation Using std::thread :::info ++***Q1***++: Is speedup linear in the number of threads used? In your writeup hypothesize why this is (or is not) the case? ::: ++***Ans***++: 以下是使用**spatial decomposition**策略,在**view 1**運行**2, 3, 4 threads**的結果: ``` $ ./mandelbrot -t 2; ./mandelbrot -t 3; ./mandelbrot -t 4 [mandelbrot serial]: [458.519] ms Wrote image file mandelbrot-serial.ppm [mandelbrot thread]: [235.864] ms Wrote image file mandelbrot-thread.ppm (1.94x speedup from 2 threads) --- [mandelbrot serial]: [458.581] ms Wrote image file mandelbrot-serial.ppm [mandelbrot thread]: [282.049] ms Wrote image file mandelbrot-thread.ppm (1.63x speedup from 3 threads) --- [mandelbrot serial]: [458.834] ms Wrote image file mandelbrot-serial.ppm [mandelbrot thread]: [192.713] ms Wrote image file mandelbrot-thread.ppm (2.38x speedup from 4 threads) ``` * 下圖是將speedup畫成圖表後的結果: ![image.png](https://hackmd.io/_uploads/HygW7G5mT.png) 而以下是在**view 2**運行的結果: ``` $ ./mandelbrot -t 2 --view 2; ./mandelbrot -t 3 --view 2; ./mandelbrot -t 4 --view 2 [mandelbrot serial]: [286.397] ms Wrote image file mandelbrot-serial.ppm [mandelbrot thread]: [171.487] ms Wrote image file mandelbrot-thread.ppm (1.67x speedup from 2 threads) --- [mandelbrot serial]: [286.933] ms Wrote image file mandelbrot-serial.ppm [mandelbrot thread]: [130.840] ms Wrote image file mandelbrot-thread.ppm (2.19x speedup from 3 threads) --- [mandelbrot serial]: [286.836] ms Wrote image file mandelbrot-serial.ppm [mandelbrot thread]: [110.188] ms Wrote image file mandelbrot-thread.ppm (2.60x speedup from 4 threads) ``` * 將結果畫為圖表後: ![image.png](https://hackmd.io/_uploads/HyDFQGqmT.png) :::success 可以看見其**在View 1並非線性成長,在threads數量為3時相較於2反而速度降低**;反之,**在View 2則接近縣性成長** ::: ++***Analysis***++: * 以下是Spatial Decomposition的切割方式: ![Num of Thread.gif](https://hackmd.io/_uploads/BJrliG57a.gif) 可以注意到以下兩點: 1. 在**View 1**時明顯可看出==中間的Rows的elements遠多於上下兩側==,而**View 2**時則較為均勻分散,只在中上區域有比較密集的情況。 2. 在**View 1**,num of thread為3,也就是==speedup相較2 threads反而下降的時候,此時中間的thread要負責的區域是elements最多的區域==。 :::warning ++***hypothesis***++: 由以上兩點我們假設,**是否線性成長,與各個thread的運算量,也就是thread負責區域的elements數量是否均勻有關。thread負責的區域內的elements越多,則該thread的運算速度越慢** ::: ----------------------------------------------------- :::info ++***Q2***++: How do your measurements explain the speedup graph you previously created? ::: ++***Ans:***++ 在`mandelbrotThread.cpp`內的`workerThreadStart()`函式中,我在第一和最後一行利用`CycleTimer::currentSeconds()`獲取時間並將時間差印出來,而結果如下: ``` $ ./mandelbrot -t 2 ... Time cost of Thread 0 : 235.187369ms Time cost of Thread 1 : 236.413963ms [mandelbrot thread]: [236.601] ms Wrote image file mandelbrot-thread.ppm (1.95x speedup from 2 threads) ------------------------- $ ./mandelbrot -t 3 ... Time cost of Thread 0 : 93.120594ms Time cost of Thread 2 : 93.743789ms Time cost of Thread 1 : 283.982157ms [mandelbrot thread]: [282.919] ms Wrote image file mandelbrot-thread.ppm (1.63x speedup from 3 threads) ------------------------- $ ./mandelbrot -t 4 ... Time cost of Thread 0 : 45.445786ms Time cost of Thread 3 : 45.856606ms Time cost of Thread 1 : 192.014935ms Time cost of Thread 2 : 192.826827ms [mandelbrot thread]: [193.002] ms Wrote image file mandelbrot-thread.ppm (2.39x speedup from 4 threads) ``` 由結果可以發現在**中間的thread (t=3時的thread 1,以及t=4時的thread 1, 2),相較於在兩側的thread都需要更長的運算時間**。接下來觀察**View 2**的結果: ``` $ ./mandelbrot -t 2 --view 2 ... Time cost of Thread 1 : 123.394603ms Time cost of Thread 0 : 173.867517ms [mandelbrot thread]: [173.279] ms Wrote image file mandelbrot-thread.ppm (1.68x speedup from 2 threads) ------------------------- $ ./mandelbrot -t 3 --view 2 ... Time cost of Thread 2 : 80.076922ms Time cost of Thread 1 : 87.162575ms Time cost of Thread 0 : 131.883265ms [mandelbrot thread]: [131.956] ms Wrote image file mandelbrot-thread.ppm (2.20x speedup from 3 threads) ------------------------- $ ./mandelbrot -t 4 --view 2 ... Time cost of Thread 3 : 61.670141ms Time cost of Thread 2 : 64.950210ms Time cost of Thread 1 : 66.122036ms Time cost of Thread 0 : 110.919788ms [mandelbrot thread]: [110.640] ms Wrote image file mandelbrot-thread.ppm (2.63x speedup from 4 threads) ... ``` 由以上結果可以看出相較於**View 1**,整體的時間長度更為平均,並且時間最長的正好是**elements最多的thread 0** :::success ++***result:***++ 由以上我們可以看出,我們在Q1進行的假設是成立的。時間的長度受到elements的分布影響,在elements分布不均勻的情況下,**每個thread的工作負載不均勻,造成某些thread需要更多的時間進行,而已經完成的threads只能等待其完成,進而拉長整體運算時間** ::: ----------------------------------------------------- :::info ++***Q3:***++ Describe your approach to parallelization and report the final 4-thread speedup obtained. ::: ++***Ans:***++ 為了將運算量大的rows分配到各個threads,我將Decomposition policy修改如下。令thread總數為$N$,冒號後面表示負責的rows: * Thread 0 : $0$, $N$, $2N$, $3N$, ... * Thread 1 : $1$, $N+1$, $2N+1$, $3N+1$, ... * Thread 2 : $2$, $N+2$, $2N+2$, $3N+2$, ... * Thread 3 : $3$, $N+3$, $2N+3$, $3N+3$, ... … 以此來使各個threads的工作負載更平均。下面是修改後的結果: ``` $ ./mandelbrot -t 4 Time cost of Thread 2 : 121.048429ms Time cost of Thread 0 : 121.194788ms Time cost of Thread 3 : 121.366946ms [mandelbrot thread]: [121.619] ms Wrote image file mandelbrot-thread.ppm (3.80x speedup from 4 threads) ``` :::success 修改之後,各thread的工作負載得到更好的分配,使運算量更平衡進而縮短整體運算時間。可以看到**在$N$為4時Speedup達到3.8倍** ::: ----------------------------------------------------- :::info ++***Q4:***++ Run improved code with eight threads. Is performance noticeably greater than when running with four threads? Why or why not? ::: ++***Ans:***++ 將thread總數調製8之後: ``` $ ./mandelbrot -t 8 Time cost of Thread 3 : 60.737095ms Time cost of Thread 2 : 61.173907ms Time cost of Thread 0 : 100.664969ms Time cost of Thread 5 : 91.760168ms Time cost of Thread 1 : 111.769905ms Time cost of Thread 7 : 110.079502ms Time cost of Thread 6 : 115.579447ms Time cost of Thread 4 : 106.577762ms [mandelbrot thread]: [133.270] ms Wrote image file mandelbrot-thread.ppm (3.46x speedup from 8 threads) ``` :::success 由上可發現,==Threads總數為4的表現還比總數為8好==。這是因為當電腦只有**4 processors, 4 threads**時,在不考慮I/O,僅僅討論CPU資源的情況下,==**不論threads總數多大,能平行使用的實體計算資源都是相同的**,甚至在threads總數超過實體threads數時,還需考慮**Context switch**的處理時間==。因此在Threads總數為8時,效能的提升並不如為4時來的好。 :::