[NYCU PP-f23] Assignment II: Multi-thread Programming

# [NYCU PP-f23] Assignment II: Multi-thread Programming `311551174 李元亨` ### Q1: In your write-up, produce a graph of speedup compared to the reference sequential implementation as a function of the number of threads used FOR VIEW 1. Is speedup linear in the number of threads used? In your writeup hypothesize why this is (or is not) the case? View 1 shows that the speedup does not increase linearly with more threads. Surprisingly, using 3 threads actually resulted in a longer completion time. In contrast, View 2 exhibits closer-to-linear speedup with additional threads. ![image.png](https://hackmd.io/_uploads/Hyi3Xpe7p.png) In View 1's visualization, when employing 3 threads with the current parallelism scheme, Thread 2 handles a larger portion of the white area compared to the other two threads. This implies that more iterations are required, leading to a longer calculation time. This may be attributed to significant differences in workloads among the threads, causing some threads (e.g., Thread 1, 3) to wait for others (e.g., Thread 2) to complete their tasks. ### Q2: How do your measurements explain the speedup graph you previously created? By running view 1 with three threads, the time required for each thread is shown below: | | Thread 1 | Thread 2 | Thread 3 | |-----| -------- | -------- | -------- | |run 1| 82.7399| 257.244| 85.03| |run 2| 83.6573| 252.272| 87.5849| |run 3| 82.838| 252.661| 86.0711| |run 4| 81.1625| 252.46| 82.7004| |run 5| 83.0439| 251.258| 83.667| |AVG| 82.68832| 253.179| 85.01068| We can see from the table that thread 2 takes substantially longer to complete compared to thread 1 and thread 2(3x longer), which confirms the hypothesis we mentioned in Q1. ### Q3: In your write-up, describe your approach to parallelization and report the final 4-thread speedup obtained. In order to even the workloads on each thread, I make thread n in charge for row n, n+`#thread`, n+2*`#thread` and vice versa. ```cpp void mandelbrotSkip( float x0, float y0, float x1, float y1, int width, int height, int startRow, int skipRows, int maxIterations, int output[]) { float dx = (x1 - x0) / width; float dy = (y1 - y0) / height; for ( int j = startRow ; j < height ; j += skipRows ) { for ( int i = 0 ; i < width ; ++i ) { float x = x0 + i * dx; float y = y0 + j * dy; int index = (j * width + i); output[index] = mandel(x, y, maxIterations); } } } void workerThreadStart(WorkerArgs *const args) { WorkerArgs *const arg = args; int startRow = arg->threadId; mandelbrotSkip(arg->x0, arg->y0, arg->x1, arg->y1, arg->width, arg->height, startRow, arg->numThreads, arg->maxIterations, arg->output); } ``` The results show that we can achieve 3.83x speedup using four threads. ```unix $ ./mandelbrot -t 4 [mandelbrot serial]: [466.279] ms Wrote image file mandelbrot-serial.ppm [mandelbrot thread]: [121.592] ms Wrote image file mandelbrot-thread.ppm (3.83x speedup from 4 threads) ``` ### Q4: Now run your improved code with eight threads. Is performance noticeably greater than when running with four threads? Why or why not? ```unix $ ./mandelbrot -t 8 [mandelbrot serial]: [463.234] ms Wrote image file mandelbrot-serial.ppm [mandelbrot thread]: [124.601] ms Wrote image file mandelbrot-thread.ppm (3.72x speedup from 8 threads) ``` Running 8 threads on the workstation server with 4 cores and 4 threads results in worse performance compared to running with 4 threads. This is because the server has only 4 cores and can only run 4 threads simultaneously. Additionally, there is a context switch overhead when switching between threads. To further prove that, ```unix $ ./mandelbrot -t 4 [mandelbrot serial]: [406.370] ms Wrote image file mandelbrot-serial.ppm [mandelbrot thread]: [106.646] ms Wrote image file mandelbrot-thread.ppm (3.81x speedup from 4 threads) $ ./mandelbrot -t 8 [mandelbrot serial]: [406.748] ms Wrote image file mandelbrot-serial.ppm [mandelbrot thread]: [56.616] ms Wrote image file mandelbrot-thread.ppm (7.18x speedup from 8 threads) ``` When running the program on a different 6-core, 12-thread machine, we observe a more substantial speedup when increasing the number of threads from 4 to 8.