Parallel Programming F23 HW2

--- title: Parallel Programming F23 HW2 tags: Homework, NYCU --- # Parallel Programming F23 HW2 ## Student |Title|Content| |-|-| |ID|109704065| |Name|李冠緯| ## Question 1 > Q1: In your write-up, produce a graph of speedup compared to the reference sequential implementation as a function of the number of threads used FOR VIEW 1. Is speedup linear in the number of threads used? In your writeup hypothesize why this is (or is not) the case? (You may also wish to produce a graph for VIEW 2 to help you come up with a good answer. Hint: take a careful look at the three-thread data-point.) First, I test view 1 and make a table with Python. ![Imgur](https://i.imgur.com/h0wHjTQ.png) ![Imgur](https://i.imgur.com/WPxamXD.png) - **3 Threads** It can be noticed that when the number of execution threads is 3, the improved performance obviously has a downward trend. The main reason for my judgment is that I used division in the `workerThreadStart` function to allocate the sections responsible for each Threads. It is conceivable that if `height` is not a multiple of 3, it will not be divided completely. This will cause the load of threads to have no balance, so the performance will decrease. - **1 Threads** The reason why it can still be accelerated in one thread is because I parallelized the `mandel` and `mandelbrotSerial` functions in a SIMD manner. Here I use GCC vector types with Intel SSE to vectorize the data processing process. ## Question 2 > Q2: How do your measurements explain the speedup graph you previously created? ![Imgur](https://i.imgur.com/mZJ8Jen.png) ```shell ./mandelbrot --view 1 -t 4 > output.out ``` ```text [mandelbrot serial]: [462.736] ms Wrote image file mandelbrot-serial.ppm Thread 0: elapsed time = 0.0196 seconds Thread 3: elapsed time = 0.0198 seconds Thread 1: elapsed time = 0.0761 seconds Thread 2: elapsed time = 0.0768 seconds Thread 0: elapsed time = 0.0196 seconds Thread 3: elapsed time = 0.0202 seconds Thread 1: elapsed time = 0.0758 seconds Thread 2: elapsed time = 0.0762 seconds Thread 0: elapsed time = 0.0196 seconds Thread 3: elapsed time = 0.0197 seconds Thread 1: elapsed time = 0.0760 seconds Thread 2: elapsed time = 0.0763 seconds Thread 3: elapsed time = 0.0197 seconds Thread 0: elapsed time = 0.0204 seconds Thread 1: elapsed time = 0.0755 seconds Thread 2: elapsed time = 0.0763 seconds Thread 3: elapsed time = 0.0197 seconds Thread 0: elapsed time = 0.0208 seconds Thread 1: elapsed time = 0.0759 seconds Thread 2: elapsed time = 0.0760 seconds [mandelbrot thread]: [76.243] ms Wrote image file mandelbrot-thread.ppm (6.07x speedup from 4 threads) ``` From the results, we can see that the load of each execution thread is very different. I guess the reason for this phenomenon is because when generating pictures, the pixels that require more calculations are not evenly distributed, and the way I allocate execution threads is simple. Cut equally. So the load on each thread is different. We can make this assumption from the results of View 2 because View 2 is much more average relative to View 1. ![Imgur](https://i.imgur.com/mJ86E69.png) ![Imgur](https://i.imgur.com/LchSsSO.png) ## Question 3 > Q3: In your write-up, describe your approach to parallelization and report the final 4-thread speedup obtained. As mentioned in Q1, I used SIMD to accelerate the operation of each execution sequence. Although load balancing between each execution thread was not achieved, due to the independent characteristics of each pixel, I think in this assignment Using SIMD is a perfect approach. ## Question 4 > Q4: Now run your improved code with eight threads. Is performance noticeably greater than when running with four threads? Why or why not? (Notice that the workstation server provides 4 cores 4 threads.) According to the pictures in Q1 and here, we can know that 8 threads will indeed increase the speed. From the graph of adding time calculation, we can find that even though the machine only has 4 threads available, the load of each thread is slightly more even because the load of each thread is cut more finely. Therefore, the entire program is accelerated in disguise. ![Imgur](https://i.imgur.com/k0l1ur4.png) ```shell ./mandelbrot --view 1 -t 8 > output.out ``` ```text [mandelbrot serial]: [461.526] ms Wrote image file mandelbrot-serial.ppm Thread 0: elapsed time = 0.0036 seconds Thread 7: elapsed time = 0.0036 seconds Thread 1: elapsed time = 0.0265 seconds Thread 2: elapsed time = 0.0347 seconds Thread 5: elapsed time = 0.0312 seconds Thread 6: elapsed time = 0.0275 seconds Thread 3: elapsed time = 0.0458 seconds Thread 4: elapsed time = 0.0614 seconds Thread 0: elapsed time = 0.0036 seconds Thread 7: elapsed time = 0.0036 seconds Thread 1: elapsed time = 0.0162 seconds Thread 6: elapsed time = 0.0281 seconds Thread 2: elapsed time = 0.0350 seconds Thread 3: elapsed time = 0.0463 seconds Thread 5: elapsed time = 0.0404 seconds Thread 4: elapsed time = 0.0453 seconds Thread 0: elapsed time = 0.0036 seconds Thread 7: elapsed time = 0.0036 seconds Thread 1: elapsed time = 0.0281 seconds Thread 6: elapsed time = 0.0281 seconds Thread 5: elapsed time = 0.0350 seconds Thread 2: elapsed time = 0.0474 seconds Thread 3: elapsed time = 0.0495 seconds Thread 4: elapsed time = 0.0460 seconds Thread 0: elapsed time = 0.0036 seconds Thread 7: elapsed time = 0.0036 seconds Thread 6: elapsed time = 0.0160 seconds Thread 1: elapsed time = 0.0321 seconds Thread 2: elapsed time = 0.0464 seconds Thread 5: elapsed time = 0.0426 seconds Thread 3: elapsed time = 0.0494 seconds Thread 4: elapsed time = 0.0462 seconds Thread 0: elapsed time = 0.0035 seconds Thread 7: elapsed time = 0.0036 seconds Thread 1: elapsed time = 0.0196 seconds Thread 6: elapsed time = 0.0241 seconds Thread 2: elapsed time = 0.0312 seconds Thread 3: elapsed time = 0.0456 seconds Thread 5: elapsed time = 0.0383 seconds Thread 4: elapsed time = 0.0457 seconds [mandelbrot thread]: [49.829] ms Wrote image file mandelbrot-thread.ppm (9.26x speedup from 8 threads) ```