Parrallel Programming HW2

--- title: Parrallel Programming HW2 --- Parrallel Programming HW2 === 110550047 巫廷翰 # Part 1 ## Parrallel Counting PI Using Pthreads ```cpp= #include <iostream> #include <pthread.h> #include <atomic> #include <stdlib.h> #include <ctime> // Number of points to generate long long int NUM_POINTS = 1000000; // Number of threads to use int NUM_THREADS = 4; // Atomic counter for points inside the circle std::atomic<long long int> inside_circle(0); // Function to be executed by each thread void* calculate_pi(void* arg) { long long int points_per_thread = NUM_POINTS / NUM_THREADS; long long int local_count = 0; unsigned int seed = 50; for (int i = 0; i < points_per_thread; ++i) { double x = ((double)rand_r(&seed) / (double)RAND_MAX) * 2.0 - 1.0; double y = ((double)rand_r(&seed) / RAND_MAX) * 2.0 - 1.0; if (x * x + y * y <= 1.0) ++local_count; } inside_circle += local_count; return nullptr; } int main(int argc, char* argv[]) { if (argc != 3) { std::cerr << "Usage: " << argv[0] << " <num_threads> <total points>" << std::endl; return 1; } int num_threads = std::stoi(argv[1]); char* end; long long int total_points = std::strtol(argv[2], &end, 10); if (num_threads <= 0 || total_points <= 0) { std::cerr << "Number of threads and points per thread must be positive integers." << std::endl; return 1; } NUM_THREADS = num_threads; NUM_POINTS = total_points; pthread_t threads[num_threads]; inside_circle = 0; // Create threads for (int i = 0; i < num_threads; ++i) { pthread_create(&threads[i], nullptr, calculate_pi, &total_points); } // Join threads for (int i = 0; i < num_threads; ++i) { pthread_join(threads[i], nullptr); } // Calculate Pi double pi = 4.0 * inside_circle / total_points; printf("Estimated Pi: %.20f\n", pi); return 0; } ``` # Part 2 ## Parallel Fractal Generation Using std::thread ## Q1 Speed Up Graph + View1 ![image](https://hackmd.io/_uploads/SJz8nyex1l.png) + View2 ![image](https://hackmd.io/_uploads/Sy7D31egyx.png) No, it not shows linearly increse, the speed drops at thread 3 and make not great improvement between thread 4 and thread 5. ## Q2 Timing code supervision ```cpp= double start = CycleTimer::currentSeconds(); int startRow = args->threadId * args->height / args->numThreads; int numRows = args->height / args->numThreads; mandelbrotSerial(args->x0, args->y0, args->x1, args->y1, args->width, args->height, startRow, numRows, args->maxIterations, args->output); double end = CycleTimer::currentSeconds(); runtimes[args->threadId] += end - start; ``` Take the greatest difference between thread among 5 iterations. ### View 1 + Thread 2 Thread 0 runtime: 0.13077f Thread 1 runtime: 0.13205f + Thread 3 Thread 0 runtime: 0.05363f Thread 1 runtime: 0.16386f Thread 2 runtime: 0.05346f + Thread 4 Thread 0 runtime: 0.02505f Thread 1 runtime: 0.10616f Thread 2 runtime: 0.10805f Thread 3 runtime: 0.02528f + Thread 5 Thread 0 runtime: 0.01290f Thread 1 runtime: 0.07208f Thread 2 runtime: 0.11087f Thread 3 runtime: 0.07184f Thread 4 runtime: 0.01125f + Thread 6 Thread 0 runtime: 0.00812f Thread 1 runtime: 0.04623f Thread 2 runtime: 0.08540f Thread 3 runtime: 0.08717f Thread 4 runtime: 0.04660f Thread 5 runtime: 0.00692f ### View 2 + Thread 2 Thread 0 runtime: 0.08978f Thread 1 runtime: 0.06078f + Thread 3 Thread 0 runtime: 0.07056f Thread 1 runtime: 0.04179f Thread 2 runtime: 0.03925f + Thread 4 Thread 0 runtime: 0.06249f Thread 1 runtime: 0.03412f Thread 2 runtime: 0.03173f Thread 3 runtime: 0.03365f + Thread 5 Thread 0 runtime: 0.06268f Thread 1 runtime: 0.03144f Thread 2 runtime: 0.03413f Thread 3 runtime: 0.02899f Thread 4 runtime: 0.02921f + Thread 6 Thread 0 runtime: 0.05945f Thread 1 runtime: 0.02334f Thread 2 runtime: 0.03002f Thread 3 runtime: 0.02887f Thread 4 runtime: 0.02011f Thread 5 runtime: 0.02593f > From above statistics, it shows each thread load is not balanced. In View 1, the middle thread number tends to take greater amount of time. On the other hand, in view 2, thread 0 tends to be the time-costing one. ## Q3 Speed Up ### Mandel Calculation We can know mandel calculation needs more operations(iteration number) to return greater value, and thus shows white region on the image. ```cpp= static inline int mandel(float c_re, float c_im, int count) { float z_re = c_re, z_im = c_im; int i; for (i = 0; i < count; ++i) { if (z_re * z_re + z_im * z_im > 4.f) break; float new_re = z_re * z_re - z_im * z_im; float new_im = 2.f * z_re * z_im; z_re = c_re + new_re; z_im = c_im + new_im; } return i; } ``` Back to the observation I made in Q2 conclustion, it indicates that if the image holds a wide white region, the blocky partition way would lead to unbalance. ![image](https://hackmd.io/_uploads/SkS7Zgxeyx.png) Therefore, the more proper way to partition the task would be calculate alternative lines. For example, if there are m thread, we do balance task by assign: thread 0 is resonponsible for line at the scaler of m, and thread 1 for line the scaler of m plus 1, etc. ```cpp= for(int r = args->threadId; r < args->height; r += args->numThreads) { mandelbrotSerial(args->x0, args->y0, args->x1, args->y1, args->width, args->height, r, 1, args->maxIterations, args->output); } ``` ### View 1 + Thread 2 + Thread 0 runtime: 0.13188f + Thread 1 runtime: 0.13114f + Thread 3 + Thread 0 runtime: 0.08857f + Thread 1 runtime: 0.08805f + Thread 2 runtime: 0.08974f + Thread 4 + Thread 0 runtime: 0.06670f + Thread 1 runtime: 0.06544f + Thread 2 runtime: 0.06665f + Thread 3 runtime: 0.06601f + Thread 5 + Thread 0 runtime: 0.05335f + Thread 1 runtime: 0.05271f + Thread 2 runtime: 0.05373f + Thread 3 runtime: 0.05210f + Thread 4 runtime: 0.05311f + Thread 6 + Thread 0 runtime: 0.05804f + Thread 1 runtime: 0.05092f + Thread 2 runtime: 0.05710f + Thread 3 runtime: 0.04825f + Thread 4 runtime: 0.05424f + Thread 5 runtime: 0.04939f ### View 2 + Thread 2 + Thread 0 runtime: 0.07520f + Thread 1 runtime: 0.07390f + Thread 3 + Thread 0 runtime: 0.05306f + Thread 1 runtime: 0.05212f + Thread 2 runtime: 0.05154f + Thread 4 + Thread 0 runtime: 0.03687f + Thread 1 runtime: 0.03660f + Thread 2 runtime: 0.03895f + Thread 3 runtime: 0.03863f + Thread 5 + Thread 0 runtime: 0.04348f + Thread 1 runtime: 0.03533f + Thread 2 runtime: 0.03325f + Thread 3 runtime: 0.03755f + Thread 4 runtime: 0.03516f + Thread 6 + Thread 0 runtime: 0.03631f + Thread 1 runtime: 0.03483f + Thread 2 runtime: 0.03286f + Thread 3 runtime: 0.02986f + Thread 4 runtime: 0.03376f + Thread 5 runtime: 0.02954f ## Q3 Speed Up Graph + View1 ![image](https://hackmd.io/_uploads/BJ3deeleyg.png) + View2 ![image](https://hackmd.io/_uploads/SJYYgggxke.png) > From above two graphs, we can clearly conclude that in my implementation, the speed improvement linearly increases when the thread number ranges from 2 to 6. The reason is, currently, the task partitioned equally, make every thread load as balancing as possible.[name=巫廷翰] > From the statistics, we can observe that each tread takes similar time to complete its work, so it indeed proves the load balancing. [name=巫廷翰] ### Q4 Is the performance noticeably better than with 6 threads? > Back to the statistics; however, the speed does not take noticable improvement. > With the providing environment, the server provides 6 threads on 6 cores. If the program takes a higher number of threads, it will cause rebundant context switch time to make the program "comcurrently" executed instead of "parallelly". [name=巫廷翰] + View1 ![image](https://hackmd.io/_uploads/r1FmMTyeyx.png) + View2 ![image](https://hackmd.io/_uploads/SyF8Makxyl.png)