Assignment II: Multi-thread Programming

# Assignment II: Multi-thread Programming create by panjianting ## Part 2: Parallel Fractal Generation Using std::thread ### Q1: In your write-up, produce a graph of speedup compared to the reference sequential implementation as a function of the number of threads used FOR VIEW 1. Is speedup linear in the number of threads used? In your writeup hypothesize why this is (or is not) the case? (You may also wish to produce a graph for VIEW 2 to help you come up with a good answer. Hint: take a careful look at the three-thread data-point.) > Extend your code to use 2, 3, 4 threads, partitioning the image generation work accordingly (threads should get blocks of the image) #### **實現多執行緒的方法** ```cpp=30 void workerThreadStart(WorkerArgs *const args) { double startTime = CycleTimer::currentSeconds(); int height = args -> height / args -> numThreads; int startRow = height * (args -> threadId); mandelbrotSerial( args -> x0, args -> y0, args -> x1, args -> y1, args -> width, args -> height, startRow, height, args -> maxIterations, args -> output); } ``` 將總執行次數切分給數個thread執行，例如總執行次數有1200次，而有3個thread，則thread 0 執行 [0-399]，thread 1 執行 [400-799]，thread 2 執行[800-1199]。 #### **Produce a graph for VIEW1 and VIEW2** ##### View1 ``` $ ./mandelbrot -t 2 [mandelbrot thread]: [236.303] ms Wrote image file mandelbrot-thread.ppm (1.94x speedup from 2 threads) $ ./mandelbrot -t 3 [mandelbrot thread]: [282.013] ms Wrote image file mandelbrot-thread.ppm (1.63x speedup from 3 threads) $ ./mandelbrot -t 4 [mandelbrot thread]: [192.673] ms Wrote image file mandelbrot-thread.ppm (2.38x speedup from 4 threads) $ ./mandelbrot -t 8 [mandelbrot thread]: [149.944] ms Wrote image file mandelbrot-thread.ppm (3.06x speedup from 8 threads) ``` ![](https://i.imgur.com/zDcwgzu.png) ##### View2 ``` $ ./mandelbrot -t 2 -v 2 [mandelbrot thread]: [171.183] ms Wrote image file mandelbrot-thread.ppm (1.68x speedup from 2 threads) $ ./mandelbrot -t 3 -v 2 [mandelbrot thread]: [131.039] ms Wrote image file mandelbrot-thread.ppm (2.19x speedup from 3 threads) $ ./mandelbrot -t 4 -v 2 [mandelbrot thread]: [110.285] ms Wrote image file mandelbrot-thread.ppm (2.59x speedup from 4 threads) $ ./mandelbrot -t 8 -v 2 [mandelbrot thread]: [91.563] ms Wrote image file mandelbrot-thread.ppm (3.13x speedup from 8 threads) ``` ![](https://i.imgur.com/NhOeB9p.png) #### 推論： :::warning 從上方兩張圖片可以發現， 1. 在View1中，沒有線性成長，反而在thread數量為3時，速度是下降的。 2. 在View2中，速度隨著thread數量上升而上升。呈線性成長。為什麼？可以從兩部分來推導： 1. 因View2的速度呈線性成長，所以可以排除是程式問題，從而觀察View1和View2兩張圖片的差異。</br>從兩張圖中可以觀察出，在淺色像素中，View1是遠比View2多的。所以可以直覺推論出，或許算淺色部分是非常耗時的。 ![](https://i.imgur.com/VBVYjHb.png) 2. 觀察像素的計算方式： 1. 從下方的程式碼中，可以發現：・像素值是由 i 來決定。・ i 跟迴圈跑的次數有關，而最高為255(因count為256)，而跑越多次，則 i 越大，而像素顏色越淺。・第8行會根據傳入的c_re、c_im來決定是否跳出迴圈，而決定i值範圍為0~255。・綜合以上三點可知，一個像素顏色越淺(像素值越大)，迴圈執行次數也越多。 ```cpp=1 static inline int mandel(float c_re, float c_im, int count) { float z_re = c_re, z_im = c_im; int i; for (i = 0; i < count; ++i) { if (z_re * z_re + z_im * z_im > 4.f) break; float new_re = z_re * z_re - z_im * z_im; float new_im = 2.f * z_re * z_im; z_re = c_re + new_re; z_im = c_im + new_im; } return i; } ``` 從上方兩部分可以推論出，thread裡需處理的淺色像素越多，則越耗時。</br>而在View1中，如果thread數量為3，則有一個thread需處理較多的淺色像素(thread的工作分配是由上而下平均分配。)，而導致整體速度變慢。 ::: ### Q2: How do your measurements explain the speedup graph you previously created? > To confirm (or disprove) your hypothesis, measure the amount of time each thread requires to complete its work by inserting timing code at the beginning and end of workerThreadStart(). 以下為圖片的工作分配方式，分別為thread number為2、3、4的狀況： | 2 threads | 3 threads | 4 threads | | :--------: | :--------: | :--------: | | ![](https://i.imgur.com/m7QxxLy.png) | ![](https://i.imgur.com/RQvd0Ww.png) | ![](https://i.imgur.com/cdQvcvM.png) | 以下為使用time code量測出來的時間： ``` $ ./mandelbrot -t 2 Thread 0, time: 235.289 ms Thread 1, time: 237.211 ms [mandelbrot thread]: [237.487] ms Wrote image file mandelbrot-thread.ppm (1.93x speedup from 2 threads) $ ./mandelbrot -t 3 Thread 0, time: 93.526 ms Thread 2, time: 94.177 ms Thread 1, time: 282.850 ms [mandelbrot thread]: [283.096] ms Wrote image file mandelbrot-thread.ppm (1.62x speedup from 3 threads) $ ./mandelbrot -t 4 Thread 0, time: 45.398 ms Thread 3, time: 45.839 ms Thread 1, time: 192.071 ms Thread 2, time: 192.933 ms [mandelbrot thread]: [193.204] ms Wrote image file mandelbrot-thread.ppm (2.38x speedup from 4 threads) ``` :::warning 以上可以觀察出： 1. thread數量為2，因工作分配相當，兩個thread的執行時間差不多。 2. thread數量為3，可以發現因中間部分的淺色像素較多(thread 1)，故花的時間較其他兩個thread多。 3. thread數量為4，因中間部分為thread 1和thread 2，也因淺色部分多，故執行時間也較慢。從以上三點可以得出，可以證實在Q1的推論。因thread工作分配不均，而導致速度沒有因thread數量增加而變快。 ::: ### Q3: In your write-up, describe your approach to parallelization and report the final 4-thread speedup obtained. >Modify the mapping of work to threads to achieve to improve speedup to at about 3-4x on both views of the Mandelbrot set (if you’re above 3.5x that’s fine, don’t sweat it). You may not use any synchronization between threads in your solution. We are expecting you to come up with a single work decomposition policy that will work well for all thread counts—hard coding a solution specific to each configuration is not allowed! (Hint: There is a very simple static assignment that will achieve this goal, and no communication/synchronization among threads is necessary.). 我們可以把程式修改成以下，將thread交替的來處理行數。以下表格為thread數量為3舉例，處理行數代號為0~1199： | Thread ID | Handle Rows | |:---------:|:-------------------------------:| | 0 | 0,3,6,9,12,15,18,21,....,1197 | | 1 | 1,4,7,10,13,16,19,22,....,1198 | | 2 | 2,5,8,11,14,17,20,23,.....,1199 | 示意圖： ![](https://i.imgur.com/xMzqoAd.jpg) 以下為修改後的code: ```cpp=30 void workerThreadStart(WorkerArgs *const args) { int totalThreads = args -> numThreads; int startRow = args -> threadId; int height = args -> height; for (int i = startRow; i < height; i+=totalThreads){ mandelbrotSerial( args -> x0, args -> y0, args -> x1, args -> y1, args -> width, args -> height, i, 1, args -> maxIterations, args -> output); } } ``` 以下為修改後的執行速度： 1. View1： ``` $ ./mandelbrot -t 2 -v 1 [mandelbrot thread]: [239.266] ms Wrote image file mandelbrot-thread.ppm (1.92x speedup from 2 threads) $ ./mandelbrot -t 3 -v 1 [mandelbrot thread]: [158.960] ms Wrote image file mandelbrot-thread.ppm (2.89x speedup from 3 threads) $ ./mandelbrot -t 4 -v 1 [mandelbrot thread]: [121.677] ms Wrote image file mandelbrot-thread.ppm (3.77x speedup from 4 threads) ``` 2. View2： ``` $ ./mandelbrot -t 2 -v 2 [mandelbrot thread]: [148.036] ms Wrote image file mandelbrot-thread.ppm (1.95x speedup from 2 threads) $ ./mandelbrot -t 3 -v 2 [mandelbrot thread]: [99.462] ms Wrote image file mandelbrot-thread.ppm (2.89x speedup from 3 threads) $ ./mandelbrot -t 4 -v 2 [mandelbrot thread]: [76.912] ms Wrote image file mandelbrot-thread.ppm (3.74x speedup from 4 threads) ``` :::warning 從上面執行結果來看： 1. View1 在thread數量為4時，提升到3.77倍，且thread與speedup為線性成長，thread數量為3時，沒有變慢的情況。 2. View2 在thread數量為4時，提升到3.74倍。 ::: ### Q4: Now run your improved code with eight threads. Is performance noticeably greater than when running with four threads? Why or why not? (Notice that the workstation server provides 4 cores 4 threads.) 以下為thread數量為 4、8、16、32各執行View1 8次的結果： | Thread Number | 1st | 2nd | 3rd | 4th | 5th | 6th | 7th | 8th | |:-------------:|:-----:|:-----:|:-----:|:-----:| ----- | ----- |:-----:|:-----:| | 4 | 3.78x | 3.78x | 3.77x | 3.77x | 3.78x | 3.77x | 3.78x | 3.78x | | 8 | 3.46x | 3.50x | 3.52x | 3.55x | 3.65x | 3.57x | 3.49x | 3.55x | | 16 | 3.75x | 3.73x | 3.73x | 3.74x | 3.74x | 3.74x | 3.71x | 3.77x | | 32 | 3.72x | 3.73x | 3.75x | 3.75x | 3.73x | 3.74x | 3.69x | 3.68x | ![](https://i.imgur.com/8qvcJ2e.png) :::warning #### 結論：從上面的執行結果來看，8 thread以上的速度對比於4 thread，反而速度沒有提升，且穩定性較差。 #### Why? 造成這樣的原因可能是因為我們的工作站只有4核心，而給超過4核心的thread就會造成Context Switch Overhead，而導致執行速度及穩定性都沒4 thread來得好。 :::