Parallel Programming HW2

--- tags: 111上 --- # Parallel Programming HW2 > ID:310552027 Name:Kuo Chia-Liang ### part2 - Modify the starter code to parallelize the Mandelbrot generation using two processors. Specifically, compute the top half of the image in thread 0, and the bottom half of the image in thread 1. This type of problem decomposition is referred to as spatial decomposition since different spatial regions of the image are computed by different processors. - Extend your code to use 2, 3, 4 threads, partitioning the image generation work accordingly (threads should get blocks of the image). #### Execution results - 2,3,4 threads for view1 ```shell $ ./mandelbrot -t 1 -v 1 [mandelbrot serial]: [475.477] ms Wrote image file mandelbrot-serial.ppm [mandelbrot thread]: [473.421] ms Wrote image file mandelbrot-thread.ppm (1.00x speedup from 1 threads) $ ./mandelbrot -t 2 -v 1 [mandelbrot serial]: [474.242] ms Wrote image file mandelbrot-serial.ppm [mandelbrot thread]: [239.004] ms Wrote image file mandelbrot-thread.ppm (1.98x speedup from 2 threads) $ ./mandelbrot -t 3 -v 1 [mandelbrot serial]: [474.094] ms Wrote image file mandelbrot-serial.ppm [mandelbrot thread]: [289.624] ms Wrote image file mandelbrot-thread.ppm (1.64x speedup from 3 threads) $ ./mandelbrot -t 4 -v 1 [mandelbrot serial]: [473.789] ms Wrote image file mandelbrot-serial.ppm [mandelbrot thread]: [200.193] ms Wrote image file mandelbrot-thread.ppm (2.37x speedup from 4 threads) ``` - 2,3,4 threads for view2 ```shell $ ./mandelbrot -t 1 -v 2 [mandelbrot serial]: [295.710] ms Wrote image file mandelbrot-serial.ppm [mandelbrot thread]: [295.752] ms Wrote image file mandelbrot-thread.ppm (1.00x speedup from 1 threads) $ ./mandelbrot -t 2 -v 2 [mandelbrot serial]: [295.639] ms Wrote image file mandelbrot-serial.ppm [mandelbrot thread]: [174.887] ms Wrote image file mandelbrot-thread.ppm (1.69x speedup from 2 threads) $ ./mandelbrot -t 3 -v 2 [mandelbrot serial]: [295.512] ms Wrote image file mandelbrot-serial.ppm [mandelbrot thread]: [133.786] ms Wrote image file mandelbrot-thread.ppm (2.21x speedup from 3 threads) $ ./mandelbrot -t 4 -v 2 [mandelbrot serial]: [297.090] ms Wrote image file mandelbrot-serial.ppm [mandelbrot thread]: [112.588] ms Wrote image file mandelbrot-thread.ppm (2.64x speedup from 4 threads) ``` #### Q1: In your write-up, produce a graph of speedup compared to the reference sequential implementation as a function of the number of threads used FOR VIEW 1. Is speedup linear in the number of threads used? ![](https://i.imgur.com/68HMnUC.png) ![](https://i.imgur.com/XV2Ueea.png) The above figures are the speedup results of 2,3,4 threads for view1 and view2. We can see that in view 1, the number of threads is not proportional to the speedup, even though the speedup is lower in 3 threads compared to 2 threads in view1. On the other hand, the number of threads seems to be proportional to the speedup in view 2. In order to confirm (my hypothesis, I measure the amount of time each thread requires to complete its work. :::spoiler code for adding timer to threads ```c void workerThreadStart(WorkerArgs *const args) { double minThread = 1e30; double startTime = CycleTimer::currentSeconds(); int startRow = args->threadId * (1 + args->height / args->numThreads); int numRows = (args->threadId == args->numThreads - 1) ? (args->height % (1 + args->height / args->numThreads)) : (1 + args->height / args->numThreads); mandelbrotSerial( args->x0, args->y0, args->x1, args->y1, args->width, args->height, startRow, numRows, args->maxIterations, args->output); double endTime = CycleTimer::currentSeconds(); minThread = std::min(minThread, endTime - startTime); printf("[mandelbrot thread %u]:\t\t[%.3f] ms\n", (unsigned)args->threadId, minThread * 1000); } ``` ::: :::spoiler code for auto-running different number of threads in 2 views - `run.sh`: this script will output the log information into `log-[thread number]-[view].txt` ```shell # bin/sh make ts=(2 3 4 5) vs=(1 2) for t in "${ts[@]}" do for v in "${vs[@]}" do echo "$t-$v" ./mandelbrot -t $t -v $v > "log-$t-$v.txt" done done ``` ::: :::spoiler code for parsing and computing the average execution time - `parse_log.py` this code will parse each log.txt file and compute the average execution time for different number of threads and views ```python= import os, glob, json txts = glob.glob('*.txt') txts.sort() whole_dict = { 'v=1': {}, 'v=2': {}, } out_lines = [] for txt in txts: f = open(txt, 'r') t = int(os.path.splitext(txt)[0].split('-')[-2]) v = int(os.path.splitext(txt)[0].split('-')[-1]) lines = f.readlines() data_d = { 'mandelbrot serial': 0, 'mandelbrot thread': 0, 'speedup': 0, } for i in range(t): data_d[f'mandelbrot thread {i}'] = [] for line in lines: if '[mandelbrot serial]' in line: data_d['mandelbrot serial'] = float(line.split('[')[2].split(']')[0]) if '[mandelbrot thread]' in line: data_d['mandelbrot thread'] = float(line.split('[')[2].split(']')[0]) if 'mandelbrot thread ' in line: tid = line.split(' ')[2].split(']')[0] data_d[f'mandelbrot thread {tid}'].append(float(line.split('[')[2].split(']')[0])) if 'speedup' in line: data_d['speedup'] = float(line.split('(')[1].split('x')[0]) for i in range(t): data_d[f'mandelbrot thread {i}\n'] = min(data_d[f'mandelbrot thread {i}']) out_lines.append('./mandelbrot -t {} -v {}\n'.format(t, v)) out_lines.append('[mandelbrot serial]: [{}] ms\n'.format(data_d['mandelbrot serial'])) out_lines.append('Wrote image file mandelbrot-serial.ppm\n') for i in range(t): out_lines.append('[mandelbrot thread {}]: [{}] ms\n'.format(i, min(data_d[f'mandelbrot thread {i}']))) out_lines.append('[mandelbrot thread]: [{}] ms\n'.format(data_d['mandelbrot serial'])) out_lines.append('Wrote image file mandelbrot-thread.ppm\n') out_lines.append(' ({}x speedup from {} threads)\n'.format(data_d['speedup'], t)) out_lines.append('=================================================================\n') # whole_dict[f'v={v}'][f't={t}'] = data_d f_out = open('res.out', 'w') f_out.writelines(out_lines) ``` ::: #### the execution time for each thread when using different thread number for view1: ```shell $ ./mandelbrot -t 2 -v 1 [mandelbrot serial]: [461.342] ms Wrote image file mandelbrot-serial.ppm [mandelbrot thread 0]: [239.233] ms [mandelbrot thread 1]: [236.783] ms [mandelbrot thread]: [239.323] ms Wrote image file mandelbrot-thread.ppm (1.93x speedup from 2 threads) $ ./mandelbrot -t 3 -v 1 [mandelbrot serial]: [463.757] ms Wrote image file mandelbrot-serial.ppm [mandelbrot thread 0]: [94.829] ms [mandelbrot thread 1]: [284.307] ms [mandelbrot thread 2]: [94.142] ms [mandelbrot thread]: [284.509] ms Wrote image file mandelbrot-thread.ppm (1.63x speedup from 3 threads) $ ./mandelbrot -t 4 -v 1 [mandelbrot serial]: [462.084] ms Wrote image file mandelbrot-serial.ppm [mandelbrot thread 0]: [45.716] ms [mandelbrot thread 1]: [194.413] ms [mandelbrot thread 2]: [193.629] ms [mandelbrot thread 3]: [44.705] ms [mandelbrot thread]: [194.517] ms Wrote image file mandelbrot-thread.ppm (2.38x speedup from 4 threads) ``` #### the execution time for each thread when using different thread number for view2: ```shell $ ./mandelbrot -t 2 -v 2 [mandelbrot serial]: [290.398] ms Wrote image file mandelbrot-serial.ppm [mandelbrot thread 0]: [173.412] ms [mandelbrot thread 1]: [122.020] ms [mandelbrot thread]: [173.498] ms Wrote image file mandelbrot-thread.ppm (1.67x speedup from 2 threads) $ ./mandelbrot -t 3 -v 2 [mandelbrot serial]: [290.580] ms Wrote image file mandelbrot-serial.ppm [mandelbrot thread 0]: [131.813] ms [mandelbrot thread 1]: [86.955] ms [mandelbrot thread 2]: [79.258] ms [mandelbrot thread]: [131.904] ms Wrote image file mandelbrot-thread.ppm (2.20x speedup from 3 threads) $ ./mandelbrot -t 4 -v 2 [mandelbrot serial]: [288.737] ms Wrote image file mandelbrot-serial.ppm [mandelbrot thread 0]: [110.610] ms [mandelbrot thread 1]: [65.714] ms [mandelbrot thread 2]: [64.823] ms [mandelbrot thread 3]: [60.256] ms [mandelbrot thread]: [110.662] ms Wrote image file mandelbrot-thread.ppm (2.61x speedup from 4 threads) ``` From the execution results above, we can see that the execution time is almost equal to the time of the slowest thread. Specially, we can see quite different execution times in each thread. For example, when using 3 threads in view1, the execution time in thread 1 is almost 3 times that in thread 0, thread 2. Similarly, when using 4 threads in view1, the execution time in thread 1,2 are more than 4 times than in thread 0, 3. Besides, in view2, we can observe that thread 0 always be the slowest thread. #### Q2: How do your measurements explain the speedup graph you previously created? To find the reason why different threads have quite different execution times, I split view1 and view2 into different parts by thread number (like the pictures below), each part is the sub-picture that one thread should output. For each pair of views and thread number, I also show the execution time for each thread. #### view1, thread number=2, execution time=239.323 ms - thread 0 (left): 239.233 ms - thread 1 (right): 236.783 ms ![](https://i.imgur.com/FBPeUdx.png) #### view1, thread number3, execution time=284.509 ms - thread 0 (left): 94.829 ms - thread 1 (middle): 284.307 ms - thread 2 (right): 94.142 ms ![](https://i.imgur.com/jXRYnMB.png) #### view1, thread number=4, execution time=194.517 ms - thread 0 (first): 45.716 ms - thread 1 (second): 194.413 ms - thread 2 (third): 193.629 ms - thread 3 (fourth): 44.705 ms ![](https://i.imgur.com/pnhEVga.png) #### view2, thread number=2, execution time=173.498 ms - thread 0 (left): 173.412 ms - thread 1 (right): 122.020 ms ![](https://i.imgur.com/9XY7AYr.png) #### view2, thread number3, execution time=131.904 ms - thread 0 (left): 131.813 ms - thread 1 (middle): 86.955 ms - thread 2 (right): 79.258 ms ![](https://i.imgur.com/YAQMgta.png) #### view2, thread number=4, execution time=110.662 ms - thread 0 (first): 110.610 ms - thread 1 (second): 65.714 ms - thread 2 (third): 64.823 ms - thread 3 (fourth): 60.256 ms ![](https://i.imgur.com/IiQReku.png) From the above pictures and the corresponding execution time for each thread, my hypothesis is the more "white part" in the sub-picture, the more execution time is needed. Now we check how one mandelbrot image is created by the code snippets below. We can see that the pixel value is `i`, which is decided by the if-condition in line 47, the more iterations, the brighter the pixels. This analysis is consistent with my assumptions. ```c=40 static inline int mandel(float c_re, float c_im, int count) { float z_re = c_re, z_im = c_im; int i; for (i = 0; i < count; ++i) { if (z_re * z_re + z_im * z_im > 4.f) break; float new_re = z_re * z_re - z_im * z_im; float new_im = 2.f * z_re * z_im; z_re = c_re + new_re; z_im = c_im + new_im; } return i; } ``` #### Q3: In your write-up, describe your approach to parallelization and report the final 4-thread speedup obtained. From the previous analysis, I observe that the execution time of the parallel version of generating the mandelbrot map is bounded by the execution time of the slowest thread. Besides, the previous parallel method doesn't evenly share the computation cost, which makes some threads need quite more time to complete execution than others. As a result, if we can share the calculation cost evenly, we can improve the overall efficiency. the new implementation is shown below, we can see that one thread generates only one row of mandelbrot picture in each iteration, and each thread shares the computation cost more equally. ```c for (unsigned h = (unsigned)args->threadId; h < args->height; h += args->numThreads) { mandelbrotSerial( args->x0, args->y0, args->x1, args->y1, args->width, args->height, h, 1, args->maxIterations, args->output); } ``` Now, the sub-pictures of the mandelbrot map for each thread are from ![](https://i.imgur.com/CupXyuD.png) to ![](https://i.imgur.com/mgNeAew.png) for view1 when using 4 threads the execution time of view1 and view2 for 4 threads ```shell ./mandelbrot -t 4 -v 1 [mandelbrot serial]: [463.315] ms Wrote image file mandelbrot-serial.ppm [mandelbrot thread 0]: [120.901] ms [mandelbrot thread 1]: [120.799] ms [mandelbrot thread 2]: [120.75] ms [mandelbrot thread 3]: [120.747] ms [mandelbrot thread]: [463.315] ms Wrote image file mandelbrot-thread.ppm (3.82x speedup from 4 threads) ./mandelbrot -t 4 -v 2 [mandelbrot serial]: [289.071] ms Wrote image file mandelbrot-serial.ppm [mandelbrot thread 0]: [75.881] ms [mandelbrot thread 1]: [75.683] ms [mandelbrot thread 2]: [75.119] ms [mandelbrot thread 3]: [75.645] ms [mandelbrot thread]: [289.071] ms Wrote image file mandelbrot-thread.ppm (3.8x speedup from 4 threads) ``` The speedup of view1 and view2 is 3.82x and 3.8x speedup from 4 threads, respectively. #### Q4: Now run your improved code with eight threads. Is performance noticeably greater than when running with four threads? Why or why not? (Notice that the workstation server provides 4 cores 4 threads.) The execution results of using 4 threads and using 8 threads are shown below: ```shell ./mandelbrot -t 4 -v 1 [mandelbrot serial]: [470.626] ms Wrote image file mandelbrot-serial.ppm [mandelbrot thread 0]: [120.867] ms [mandelbrot thread 1]: [120.819] ms [mandelbrot thread 2]: [120.877] ms [mandelbrot thread 3]: [120.826] ms [mandelbrot thread]: [120.923] ms Wrote image file mandelbrot-thread.ppm (3.86x speedup from 4 threads) ./mandelbrot -t 4 -v 2 [mandelbrot serial]: [288.542] ms Wrote image file mandelbrot-serial.ppm [mandelbrot thread 0]: [75.799] ms [mandelbrot thread 1]: [75.722] ms [mandelbrot thread 2]: [75.49] ms [mandelbrot thread 3]: [75.559] ms [mandelbrot thread]: [75.856] ms Wrote image file mandelbrot-thread.ppm (3.8x speedup from 4 threads) ./mandelbrot -t 8 -v 1 [mandelbrot serial]: [460.689] ms Wrote image file mandelbrot-serial.ppm [mandelbrot thread 0]: [60.605] ms [mandelbrot thread 1]: [111.01] ms [mandelbrot thread 2]: [60.457] ms [mandelbrot thread 3]: [60.415] ms [mandelbrot thread 4]: [95.762] ms [mandelbrot thread 5]: [84.438] ms [mandelbrot thread 6]: [77.891] ms [mandelbrot thread 7]: [88.61] ms [mandelbrot thread]: [111.132] ms Wrote image file mandelbrot-thread.ppm (3.66x speedup from 8 threads) ./mandelbrot -t 8 -v 2 [mandelbrot serial]: [289.154] ms Wrote image file mandelbrot-serial.ppm [mandelbrot thread 0]: [38.152] ms [mandelbrot thread 1]: [63.392] ms [mandelbrot thread 2]: [37.925] ms [mandelbrot thread 3]: [37.904] ms [mandelbrot thread 4]: [37.849] ms [mandelbrot thread 5]: [41.916] ms [mandelbrot thread 6]: [47.89] ms [mandelbrot thread 7]: [60.33] ms [mandelbrot thread]: [63.487] ms Wrote image file mandelbrot-thread.ppm (3.57x speedup from 8 threads) ``` We can see that the performance of using 8 threads is not greater than using 4 threads. Contrastively, the performance is even worse than using 4 threads. I think the reason is that the workstation server only provides 4 cores and 4 threads, which may make the server spend more time in context switching when using more than 4 threads.