2021 Parallel Programming HW2

# 2021 Parallel Programming HW2 309706008 李欣盈 ### All plots available at: [PP HW2](https://docs.google.com/spreadsheets/d/10RDkF99o2d82xOlyK9vr0hqzayxay4UyOgTAPRL1lsI/edit?usp=sharing) ## Q1 ### Produce a graph of speedup compared to the reference sequential implementation as a function of the number of threads used for VIEW 1 and VIEW 2. The following figures are speedup and execution time based on different number of threads for view 1 and 2. ### View 1 ![](https://i.imgur.com/CYJxva7.png) ![](https://i.imgur.com/BKNaSLu.png) mandelbrot serial [mandelbrot serial]: [462.395] ms Wrote image file mandelbrot-serial.ppm thread = 2 [mandelbrot thread]: [236.736] ms Wrote image file mandelbrot-thread.ppm (1.95x speedup from 2 threads) thread = 3 [mandelbrot thread]: [283.566] ms Wrote image file mandelbrot-thread.ppm (1.63x speedup from 3 threads) thread = 4 [mandelbrot thread]: [194.707] ms Wrote image file mandelbrot-thread.ppm (2.38x speedup from 4 threads) thread = 8 [mandelbrot thread]: [141.096] ms Wrote image file mandelbrot-thread.ppm (3.27x speedup from 8 threads) thread = 16 [mandelbrot thread]: [127.580] ms Wrote image file mandelbrot-thread.ppm (3.62x speedup from 16 threads) ### View 2 ![](https://i.imgur.com/9960k3a.png) ![](https://i.imgur.com/LL4wAQs.png) mandelbrot serial [mandelbrot serial]: [288.883] ms Wrote image file mandelbrot-serial.ppm thread = 2 [mandelbrot thread]: [172.775] ms Wrote image file mandelbrot-thread.ppm (1.67x speedup from 2 threads) thread = 3 [mandelbrot thread]: [131.590] ms Wrote image file mandelbrot-thread.ppm (2.19x speedup from 3 threads) thread = 4 [mandelbrot thread]: [111.025] ms Wrote image file mandelbrot-thread.ppm (2.60x speedup from 4 threads) thread = 8 [mandelbrot thread]: [79.017] ms Wrote image file mandelbrot-thread.ppm (3.66x speedup from 8 threads) thread = 16 [mandelbrot thread]: [77.061] ms Wrote image file mandelbrot-thread.ppm (3.75x speedup from 16 threads) ### Is speedup linear in the number of threads used? The speedup for **view 1 is not linear**. We can see that there is a decrease in speedup when number of threads is 3. The speedup for view 2, however, seems to be linear. ### In your writeup hypothesize why this is (or is not) the case? Because of the different speedup behavior of view 1 and 2, we could possibly deduce that it could be related to **different pattern of the picture**. We can see that, in view 1, there're more white pixels and the position of those white pixels are more close to each other. ![](https://i.imgur.com/VNcIRFI.png) According to the explanation of [drawing the Mandelbrot set](https://jonisalonen.com/2013/lets-draw-the-mandelbrot-set/). > Plotting the Mandelbrot set is easy: map each pixel on the screen to a complex number, check if it belongs to the set by iterating the formula, and color the pixel black if it does and white if it doesn’t. Since the iteration may never end we set a maximum In our code, the `mandel` function will decide the color of a pixel. The `count` is equal to `maxIterations`, which is 256. The calculation inside the for loop shown below will be executed ranging from 0-255 times. 0 represents black and 255 represents white. If a pixel shoud be painted white, the for loop will be executed more times as compare to a black pixel. ![](https://i.imgur.com/lRuAQLS.png) Due to the use of spatial decomposition, **the execution time of region with more white pixels could be higher**. This could possibly be related to the non-linear speedup behavior of view 1 using 3 threads. The middle part of the view 1 contains lots of white pixels and tasks are not equally divided to all 3 threads. The execution time of one of the thread will be higher, which results in the overall degraded speedup. ## Q2 ### Measure the amount of time each thread requires to complete its work. ### View 1 We noticed a degraded speedup when thread number is 3. We stated that this could be related to different workload each thread has. If the area a thread need to compute includes more white pixels, the execution time could be higher. We measure the time taken for each thread using `CycleTimer.h`. The following section shows execution time of each thread when thread number is 2, 3, 4 respectively. The computation regions each thread got are show in the tables. #### 2 Threads * Workload of two threads are equal. Execution time of each thread are basically the same. * Thus, reached approximately 2x (1.95x) speedup. ![](https://i.imgur.com/yp1hUzd.png) Computation area of each thread | Thread 0 | Thread 1 | | -------- | -------- | |![](https://i.imgur.com/vu1QeNW.jpg)|![](https://i.imgur.com/nm4ndRu.jpg)| #### 3 Threads * The area thread 1 needed to compute has way more white pixels as compared to the other 2 threads. Therefore, the execution time of thread 1 is a lot higher than the other two threads. * Due to the slow execution of thread 1, a declined speedup(1.63x) is noticed as compared to thread number is 2. ![](https://i.imgur.com/q9jgxwn.png) Computation area of each thread | Thread 0 | Thread 1 | Thread 2 | | -------- | -------- | -------- | |![](https://i.imgur.com/nIsXAgG.jpg)|![](https://i.imgur.com/v4LmHVr.jpg)|![](https://i.imgur.com/C5zhymB.jpg) #### 4 Threads * The area thread 1 and 2 needed to compute has way more white pixels as cmopared to the other 2 threads. Therefore, thread 1 and 2 takes more time as compared to thread 0 and 3. * The area comtaining more white pixels are equally divided to thread 1 and 2. Two threads collaborated to increase the program performance. Therefore, the speedup (2.38x) is higher compared to thread number is 2 and 3. ![](https://i.imgur.com/qJmT0AR.png) Computation area of each thread | Thread 0 | Thread 1 | Thread 2 | Thread 3 | | -------- | -------- | -------- | -------- | |![](https://i.imgur.com/dQvJuul.jpg)|![](https://i.imgur.com/0fyi85X.jpg)|![](https://i.imgur.com/1DBUSeH.jpg)|![](https://i.imgur.com/idSe5Ch.jpg)| ### How do your measurements explain the speedup graph you previously created? Through our experiment, we can conclude that the execution time of each thread is highly related to the ratio of white area it need to compute. In the 3 thread scenario, thread 1 need to compute areas with higher workload (more white pixels), resulting in degraded speedup. So we can divided those high white/black ratio rows to each thread to increase the speedup. ### View 2 We further prove our idea through experiment with view 2. We tried to divided view 2 into four segments. The top region contains more white areas as compared to all the other 3. * Thread 0 received the top area, thus, has higher execution time. ![](https://i.imgur.com/2Jezr8t.png) Computation area of each thread | Thread 0 | Thread 1 | Thread 2 | Thread 3 | | -------- | -------- | -------- | -------- | |![](https://i.imgur.com/xT0d2Ff.jpg)|![](https://i.imgur.com/ze83ROb.jpg)|![](https://i.imgur.com/XEwXpmw.jpg)|![](https://i.imgur.com/XKcCgRk.jpg) ## Q3 ### Resort to decomposition policy. In your write-up, describe your approach to parallelization We can divided those high white/black ratio rows to each thread to increase the speedup. ![](https://i.imgur.com/GiPZRmz.png) In our improved version, the first row each thread needed to compute is its own thread ID, e.g. thread 0 starts from row 0, thread 2 starts from row 2. The following row each thread need to compute is (last computed row + number of threads). For example, there are 3 threads, thread 0 need to compute row 0, 3(0+3), 6(3+3).... We may refer to the figure below for the visualization of this approach. ![](https://i.imgur.com/2Cx7rp1.png) We can assume that there are 10 rows. * Thread 0 is responsible for row [0, 3, 6, 9] * Thread 1 is responsible for row [1, 4, 7] * Thread 2 is responsible for row [2, 5, 8] So instead of dividing the whole picture into 3 sections(as shown below) in the previous approach, we are able to **distribute the middle part of the graph(contains lots of white pixels, higher workload) to different threads**. Through this decoupled task distrubution, we can increase the performance of our calculation. ![](https://i.imgur.com/f1dzZrm.png) ### Report the final 4-thread speedup obtained ### View 1 3.79x speedup from 4 threads ``` [mandelbrot thread]: [122.095] ms Wrote image file mandelbrot-thread.ppm (3.79x speedup from 4 threads) ``` Compared with the previous implementation(Spatial Decomposition) A: Spatial Decomposition B: Improved Decomposition Policy ![](https://i.imgur.com/xeta5CV.png) ### View 2 3.77x speedup from 4 threads ``` [mandelbrot thread]: [76.626] ms Wrote image file mandelbrot-thread.ppm (3.77x speedup from 4 threads) ``` Compared with the previous implementation(Spatial Decomposition) A: Spatial Decomposition B: Improved Decomposition Policy ![](https://i.imgur.com/vcsIQzz.png) Through comparison with the previous implementation (Spatial Decomposition/Block Partition), we see that we are able to achieve near-linear speedup performance using the improved decomposition policy(red line B). ## Q4 ### Now run your improved code with eight threads. Is performance noticeably greater than when running with four threads? The performance is **a bit degraded** when using 8 threads for both view 1 and view 2. ### View 1 3.79x speedup from 4 threads 3.71x speedup from 8 threads ``` [mandelbrot thread]: [124.697] ms Wrote image file mandelbrot-thread.ppm (3.71x speedup from 8 threads) ``` ### View 2 3.77x speedup from 4 threads 3.70x speedup from 8 threads ``` [mandelbrot thread]: [78.258] ms Wrote image file mandelbrot-thread.ppm (3.70x speedup from 8 threads) ``` ### Why or why not? (Notice that the workstation server provides 4 cores 4 threads.) The 8 thread speedup is not noticeably greater than that of 4 thread speedup because the workstation server has only 4 cores. If we `lscpu`, we see that the server has 1 CPU sockets, each CPU has, up to 4 cores and each core has 1 threads. The max thread count is, 1 CPU x 4 cores x 1 threads per core, 4. ![](https://i.imgur.com/7b3haJ1.png) Attempting to create more than 4 threads may introduce **context switch overhead**. Each task switch takes a certain amount of time for the CPU to change the process environment. This may reduce the speedup, resulting in higher execution time as compared to using only 4 threads. :::info Excellent! You are the god of reports. Keep going in the future. 💪💪💪 >[name=TA] :::

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.