Programming Assignment II: Multi-thread Programming

# Programming Assignment II: Multi-thread Programming ## Part1: Parallel Counting PI Using Pthreads null ## Part2: Parallel Fractal Generation Using std::thread ### Q1: plot a graph of speedup (compared to the reference sequential implementation) as a function of the number of threads used for VIEW 1. Is the speedup linear to the number of threads used? Hypothesize why this is (or is not) the case. (You may want to plot a graph for VIEW 2 for further insights. ![image](https://hackmd.io/_uploads/Skqyup_all.png) ![image](https://hackmd.io/_uploads/BkLIOxWTel.png) according to the plot, the speedup is not linear to the number of threads used, and we can clearly see that for VIEW 1, there is a significant drop of efficiency when using 3 threads to generate the plot. To further diagnose this behavior, I instrumented each thread with timers. > 2025/11/07: The reason for the significant drop in speedup on View 1 with three threads might be the extremely unbalanced distribution of computational load among the threads, since we use block based load distribution, the thread assigned to the center region might requires much more computation than the others, causing it to finish significantly later, while View 2 does not exhibit this issue because its workload (bright parts) seems more evenly distributed. ### Q2: How do the measurements explain the speedup graph you previously plotted? #### VIEW 1 ``` ./mandelbrot -t 2 Thread 0 elapsed time: 0.314885 seconds Thread 1 elapsed time: 0.316529 seconds [mandelbrot thread]: [317.564] ms ./mandelbrot -t 3 Thread 0 elapsed time: 0.132309 seconds Thread 2 elapsed time: 0.139190 seconds Thread 1 elapsed time: 0.411412 seconds [mandelbrot thread]: [407.149] ms ./mandelbrot -t 4 Thread 0 elapsed time: 0.063319 seconds Thread 3 elapsed time: 0.073332 seconds Thread 2 elapsed time: 0.276186 seconds Thread 1 elapsed time: 0.280731 seconds [mandelbrot thread]: [281.172] ms ``` #### VIEW 2 ``` ./mandelbrot -t 2 -v 2 Thread 1 elapsed time: 0.168188 seconds Thread 0 elapsed time: 0.223842 seconds [mandelbrot thread]: [222.059] ms ./mandelbrot -t 3 -v 2 Thread 1 elapsed time: 0.101139 seconds Thread 2 elapsed time: 0.111719 seconds Thread 0 elapsed time: 0.161664 seconds [mandelbrot thread]: [161.626] ms ./mandelbrot -t 4 -v 2 Thread 3 elapsed time: 0.069525 seconds Thread 1 elapsed time: 0.072434 seconds Thread 2 elapsed time: 0.073156 seconds Thread 0 elapsed time: 0.162869 seconds [mandelbrot thread]: [162.954] ms ``` with the instrumentation, we can clearly see that there are significant elapse time difference between threads when dealing with different chunks of the plot with more then three thread, and since the plot was divided evenly by total thread number, so the problem might be in the core function that calculates Mandelbrot set ``` int mandel(float c_re, float c_im, int count) { float z_re = c_re, z_im = c_im; int i; for (i = 0; i < count; ++i) { if (z_re * z_re + z_im * z_im > 4.f) break; float new_re = (z_re * z_re) - (z_im * z_im); float new_im = 2.f * z_re * z_im; z_re = c_re + new_re; z_im = c_im + new_im; } return i; } ``` in the function `mandel`, the value for count is 256, passed by a constant integer max_iterations, but in the loop there is a `break` statement, causing unbalanced computation between threads where the darker parts (smaller i) require less and brighter parts (larger i) require more iterations, which significantly cuts down the efficiency for VIEW 1 and slightly cuts down the efficiency for VIEW 2, since the chunk was divided as follows: ![image](https://hackmd.io/_uploads/H1l9_0O6ll.png) so for the thread(s) that responsible for the calculating the chunk close to the center, it will have more computation since there are much more brighter pixels in the center. ### Q3: In your write-up, describe your parallelization approach and report the final speedup achieved with 4 threads ![image](https://hackmd.io/_uploads/Hkb_m-Fael.png) Because dividing the plot by blocks could lead to unbalanced computation, so in my second approach, I used an interleaved assignment of rows, where each thread only takes 1 row at a time and jumps to row `N+thread_num` for the next calculation, and here's the speed up: ``` run -c 4 -- ./mandelbrot -t 4 [mandelbrot serial]: [384.880] ms Wrote image file mandelbrot-serial.ppm [mandelbrot thread]: [99.226] ms Wrote image file mandelbrot-thread.ppm (3.88x speedup from 4 threads) ``` ``` run -c 4 -- ./mandelbrot -t 4 -v 2 [mandelbrot serial]: [226.486] ms Wrote image file mandelbrot-serial.ppm [mandelbrot thread]: [59.497] ms Wrote image file mandelbrot-thread.ppm (3.81x speedup from 4 threads) ``` ### Q4: Is the performance noticeably better than with 6 threads? Why or why not? #### VIEW 1 ``` run -c 6 -- ./mandelbrot -t 6 [mandelbrot serial]: [385.171] ms Wrote image file mandelbrot-serial.ppm [mandelbrot thread]: [69.212] ms Wrote image file mandelbrot-thread.ppm (5.56x speedup from 12 threads) run -c 6 -- ./mandelbrot -t 12 [mandelbrot serial]: [385.819] ms Wrote image file mandelbrot-serial.ppm [mandelbrot thread]: [79.239] ms Wrote image file mandelbrot-thread.ppm (4.87x speedup from 12 threads) ``` #### VIEW 2 ``` run -c 6 -- ./mandelbrot -t 6 -v 2 [mandelbrot serial]: [225.684] ms Wrote image file mandelbrot-serial.ppm [mandelbrot thread]: [40.846] ms Wrote image file mandelbrot-thread.ppm (5.53x speedup from 6 threads) run -c 6 -- ./mandelbrot -t 12 -v 2 [mandelbrot serial]: [225.364] ms Wrote image file mandelbrot-serial.ppm [mandelbrot thread]: [41.038] ms Wrote image file mandelbrot-thread.ppm (5.49x speedup from 12 threads) ``` Both VIEW 1 and VIEW 2 shows that the performance got even worse with 12 threads compare to running with only 6 threads, although the cpu i5-10500 support hyper threading, its not turned on according to the output from `lscpu`, and for a system that doesn't do hyper threading (Simultaneous Multi-Threading), each physical core can only execute one thread at a time, so spawning more threads than available cores leads to excessive context switching and synchronization overhead rather than increasing parallelism. ``` run -c 6 -- lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 39 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 12 On-line CPU(s) list: 0-11 Vendor ID: GenuineIntel Model name: Intel(R) Core(TM) i5-10500 CPU @ 3.10GHz CPU family: 6 Model: 165 Thread(s) per core: 1 Core(s) per socket: 12 Socket(s): 1 Stepping: 3 BogoMIPS: 6192.05 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 xsaves arat umip md_clear flush_l1d arch_capabilities Virtualization: VT-x Hypervisor vendor: KVM Virtualization type: full L1d cache: 384 KiB (12 instances) L1i cache: 384 KiB (12 instances) L2 cache: 48 MiB (12 instances) L3 cache: 16 MiB (1 instance) NUMA node(s): 1 NUMA node0 CPU(s): 0-11 Vulnerability Gather data sampling: Unknown: Dependent on hypervisor status Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Mitigation; Clear CPU buffers; SMT Host state unknown Vulnerability Reg file data sampling: Not affected Vulnerability Retbleed: Mitigation; Enhanced IBRS Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced / Automatic IBRS; IBPB conditional; RSB filling; PBRSB-eIBRS SW sequence; BHI SW loop, KVM SW loop Vulnerability Srbds: Unknown: Dependent on hypervisor status Vulnerability Tsx async abort: Not affected ``` ### extended discussion ![mandelbrot-thread](https://hackmd.io/_uploads/Bktabns6ll.png) Looking at the Mandelbrot Set fractal, we can see multiple white regions which represents that for the points inside these white regions are in the Mandelbrot set no matter how many iterations of calculation, since the Mandelbrot set is defined as all complex numbers c where $z_{n+1} = z_n^2 + c$ does not diverge as n (iteration) goes infinity, and we can exploit this property to further improve the speed up. the derivation is far out of the scope of this assignment, so i will just use the Cardioid / bulb checking algebraic inequality condition [here](https://en.wikipedia.org/wiki/Plotting_algorithms_for_the_Mandelbrot_set?utm_source=chatgpt.com#Cardioid_/_bulb_checking) check if $c=x+yi$ is in the main cardioid: ``` float q = (x - 0.25f) * (x - 0.25f) + y * y; if (q * (q + (x - 0.25f)) <= 0.25f * y * y) { g_output[row * g_width + col] = 256; continue; } ``` check if $c=x+yi$ is in the period-2 bulb (main disk): ``` if ((x + 1.0f) * (x + 1.0f) + y * y <= 0.0625f) { g_output[row * g_width + col] = 256; continue; } ``` ![image](https://hackmd.io/_uploads/ryyKVTsaxx.png) and here is the final speed up with these two conditions (interleaved assignment of rows used, and mandelbrot serial also use the condition): #### VIEW 1 ``` run -c 6 -- bash -c "./mandelbrot -t 2; ./mandelbrot -t 3; ./mandelbrot -t 4; ./mandelbrot -t 5; ./mandelbrot -t 6" [mandelbrot serial]: [59.778] ms Wrote image file mandelbrot-serial.ppm [mandelbrot thread]: [31.362] ms Wrote image file mandelbrot-thread.ppm (1.91x speedup from 2 threads) [mandelbrot serial]: [59.634] ms Wrote image file mandelbrot-serial.ppm [mandelbrot thread]: [21.693] ms Wrote image file mandelbrot-thread.ppm (2.75x speedup from 3 threads) [mandelbrot serial]: [59.906] ms Wrote image file mandelbrot-serial.ppm [mandelbrot thread]: [16.324] ms Wrote image file mandelbrot-thread.ppm (3.67x speedup from 4 threads) [mandelbrot serial]: [60.948] ms Wrote image file mandelbrot-serial.ppm [mandelbrot thread]: [13.560] ms Wrote image file mandelbrot-thread.ppm (4.49x speedup from 5 threads) [mandelbrot serial]: [60.524] ms Wrote image file mandelbrot-serial.ppm [mandelbrot thread]: [11.301] ms Wrote image file mandelbrot-thread.ppm (5.36x speedup from 6 threads) ``` #### VIEW 2 ``` run -c 6 -- bash -c "./mandelbrot -t 2 -v 2; ./mandelbrot -t 3 -v 2; ./mandelbrot -t 4 -v 2; ./mandelbrot -t 5 -v 2; ./mandelbrot -t 6 -v 2" [mandelbrot serial]: [228.577] ms Wrote image file mandelbrot-serial.ppm [mandelbrot thread]: [117.871] ms Wrote image file mandelbrot-thread.ppm (1.94x speedup from 2 threads) [mandelbrot serial]: [227.238] ms Wrote image file mandelbrot-serial.ppm [mandelbrot thread]: [78.845] ms Wrote image file mandelbrot-thread.ppm (2.88x speedup from 3 threads) [mandelbrot serial]: [227.105] ms Wrote image file mandelbrot-serial.ppm [mandelbrot thread]: [59.956] ms Wrote image file mandelbrot-thread.ppm (3.79x speedup from 4 threads) [mandelbrot serial]: [227.194] ms Wrote image file mandelbrot-serial.ppm [mandelbrot thread]: [48.753] ms Wrote image file mandelbrot-thread.ppm (4.66x speedup from 5 threads) [mandelbrot serial]: [226.957] ms Wrote image file mandelbrot-serial.ppm [mandelbrot thread]: [41.513] ms Wrote image file mandelbrot-thread.ppm (5.47x speedup from 6 threads) ``` And the results recap the feature for VIEW 1 and VIEW 2 again where VIEW 2 didn't get benefit from the conditions, instead, its losses some speed up compare to the non since the calculation for the algebraic inequality conditions are also done for each pixel and are computationally expensive. without condition (for comparison) #### VIEW 1 ``` run -c 6 -- bash -c "./mandelbrot -t 2; ./mandelbrot -t 3; ./mandelbrot -t 4; ./mandelbrot -t 5; ./mandelbrot -t 6" [mandelbrot serial]: [385.196] ms Wrote image file mandelbrot-serial.ppm [mandelbrot thread]: [195.584] ms Wrote image file mandelbrot-thread.ppm (1.97x speedup from 2 threads) [mandelbrot serial]: [385.011] ms Wrote image file mandelbrot-serial.ppm [mandelbrot thread]: [132.185] ms Wrote image file mandelbrot-thread.ppm (2.91x speedup from 3 threads) [mandelbrot serial]: [385.031] ms Wrote image file mandelbrot-serial.ppm [mandelbrot thread]: [105.010] ms Wrote image file mandelbrot-thread.ppm (3.67x speedup from 4 threads) [mandelbrot serial]: [384.861] ms Wrote image file mandelbrot-serial.ppm [mandelbrot thread]: [82.569] ms Wrote image file mandelbrot-thread.ppm (4.66x speedup from 5 threads) [mandelbrot serial]: [385.635] ms Wrote image file mandelbrot-serial.ppm [mandelbrot thread]: [69.814] ms Wrote image file mandelbrot-thread.ppm (5.52x speedup from 6 threads) ``` #### VIEW 2 ``` run -c 6 -- bash -c "./mandelbrot -t 2 -v 2; ./mandelbrot -t 3 -v 2; ./mandelbrot -t 4 -v 2; ./mandelbrot -t 5 -v 2; ./mandelbrot -t 6 -v 2" [mandelbrot serial]: [226.663] ms Wrote image file mandelbrot-serial.ppm [mandelbrot thread]: [117.465] ms Wrote image file mandelbrot-thread.ppm (1.93x speedup from 2 threads) [mandelbrot serial]: [226.669] ms Wrote image file mandelbrot-serial.ppm [mandelbrot thread]: [82.366] ms Wrote image file mandelbrot-thread.ppm (2.75x speedup from 3 threads) [mandelbrot serial]: [226.841] ms Wrote image file mandelbrot-serial.ppm [mandelbrot thread]: [62.176] ms Wrote image file mandelbrot-thread.ppm (3.65x speedup from 4 threads) [mandelbrot serial]: [226.400] ms Wrote image file mandelbrot-serial.ppm [mandelbrot thread]: [52.342] ms Wrote image file mandelbrot-thread.ppm (4.33x speedup from 5 threads) [mandelbrot serial]: [226.631] ms Wrote image file mandelbrot-serial.ppm [mandelbrot thread]: [46.746] ms Wrote image file mandelbrot-thread.ppm (4.85x speedup from 6 threads) ``` final execution time comparison ![image](https://hackmd.io/_uploads/H1cVQCiaxg.png) ### SIMD approach for more speed up Additionally, we can use intrinsic functions to gain more speed up, although we can't use AVX2 instructions for 256 bit vectorization since there is no -mavx2 in the CFLAG, we still have SSE instructions for 128 bit vectorization. And we can also translate the algebraic inequality condition with SIMD intrinsics as follows: ``` // check if c=x+yi is in the main cardioid: q*(q+(x-0.25)) <= 0.25*y^2, q=(x-0.25f)^2+y^2 __m128 x025 = _mm_sub_ps(x_vec, _025); __m128 y2 = _mm_mul_ps(y_vec, y_vec); __m128 q = _mm_add_ps(_mm_mul_ps(x025, x025), y2); __m128 left = _mm_mul_ps(q, _mm_add_ps(q, x025)); __m128 right = _mm_mul_ps(_025, y2); __m128 mask_cardioid = _mm_cmple_ps(left, right); // check if c=x+yi is in the period-2 bulb (main disk): (x+1)^2+y^2 <= 0.0625 __m128 xa1 = _mm_add_ps(x_vec, _1); __m128 mask_disk = _mm_cmple_ps(_mm_add_ps(_mm_mul_ps(xa1, xa1), y2), _00625); __m128 mask_inside = _mm_or_ps(mask_cardioid, mask_disk); int inside_mask = _mm_movemask_ps(mask_inside); ``` and here is the final speed up with 1. interleaved assignment of rows 2. algebraic inequality condition 3. SSE SIMD instructions #### VIEW 1 ``` run -c 6 -- bash -c "./mandelbrot -t 2; ./mandelbrot -t 3; ./mandelbrot -t 4; ./mandelbrot -t 5; ./mandelbrot -t 6" [mandelbrot serial]: [106.273] ms Wrote image file mandelbrot-serial.ppm [mandelbrot thread]: [55.122] ms Wrote image file mandelbrot-thread.ppm (1.93x speedup from 2 threads) [mandelbrot serial]: [106.426] ms Wrote image file mandelbrot-serial.ppm [mandelbrot thread]: [38.003] ms Wrote image file mandelbrot-thread.ppm (2.80x speedup from 3 threads) [mandelbrot serial]: [106.333] ms Wrote image file mandelbrot-serial.ppm [mandelbrot thread]: [28.086] ms Wrote image file mandelbrot-thread.ppm (3.79x speedup from 4 threads) [mandelbrot serial]: [106.340] ms Wrote image file mandelbrot-serial.ppm [mandelbrot thread]: [22.827] ms Wrote image file mandelbrot-thread.ppm (4.66x speedup from 5 threads) [mandelbrot serial]: [106.297] ms Wrote image file mandelbrot-serial.ppm [mandelbrot thread]: [19.444] ms Wrote image file mandelbrot-thread.ppm (5.47x speedup from 6 threads) ``` #### VIEW 2 ``` run -c 6 -- bash -c "./mandelbrot -t 2 -v 2; ./mandelbrot -t 3 -v 2; ./mandelbrot -t 4 -v 2; ./mandelbrot -t 5 -v 2; ./mandelbrot -t 6 -v 2" [mandelbrot serial]: [70.349] ms Wrote image file mandelbrot-serial.ppm [mandelbrot thread]: [36.108] ms Wrote image file mandelbrot-thread.ppm (1.95x speedup from 2 threads) [mandelbrot serial]: [70.362] ms Wrote image file mandelbrot-serial.ppm [mandelbrot thread]: [24.180] ms Wrote image file mandelbrot-thread.ppm (2.91x speedup from 3 threads) [mandelbrot serial]: [70.376] ms Wrote image file mandelbrot-serial.ppm [mandelbrot thread]: [18.302] ms Wrote image file mandelbrot-thread.ppm (3.85x speedup from 4 threads) [mandelbrot serial]: [71.444] ms Wrote image file mandelbrot-serial.ppm [mandelbrot thread]: [15.072] ms Wrote image file mandelbrot-thread.ppm (4.74x speedup from 5 threads) [mandelbrot serial]: [70.418] ms Wrote image file mandelbrot-serial.ppm [mandelbrot thread]: [12.858] ms Wrote image file mandelbrot-thread.ppm (5.48x speedup from 6 threads) ``` ![image](https://hackmd.io/_uploads/ByDoV16Tgg.png) ![image](https://hackmd.io/_uploads/HJ8RD6pTgx.png) comparing the result between **VIEW1 with condition** and **VIEW1 with condition and SIMD**, we can see that there's a slightly backslide after adding SIMD implementation, which is expected, because SIMD implementation calculates four lanes at the same time, the computation can be omitted only if all four lanes matches the algebraic inequality condition, but SIMD compensate this this behavior, makes it just about 1.72x slower than the implementation without SIMD. while the performance slightly backslide a bit for VIEW1, there is a significant improvement for about 3.23x speed up after adding SIMD to **VIEW2 with condition**, since conditions has way less contribution for VIEW2. In summery, i think its a reasonable trade off to add SIMD where VIEW1 being 1.72x slower and VIEW2 having 3.23x speed up, although the performance for VIEW2 won't be evaluated in the assignment, i will still use this final version for validation as both approach passed R_100. ### Further improvement for vectorized cardioid/disk checking A possible improvement for vectorized cardioid/disk checking is to filter out the points in the cardioid/disk region before getting in to the computation for other regions, which requires re-indexing the column: ``` for each row i: initialize filtered columns list for each column j: if column j is in assign output[i*width+j] with value 256 else: cardioid/disk region: push j to filtered columns list efficiently perform vectored computation of Mandelbrot set for all values in filtered columns list ``` but as mentioned before, the methods proposed above has already passed the R_100, so i tend to not really implement this improvement.