---
title: Parrallel Programming HW2
---
Parrallel Programming HW2
===
110550047 巫廷翰
# Part 1
## Parrallel Counting PI Using Pthreads
```cpp=
#include <iostream>
#include <pthread.h>
#include <atomic>
#include <stdlib.h>
#include <ctime>
// Number of points to generate
long long int NUM_POINTS = 1000000;
// Number of threads to use
int NUM_THREADS = 4;
// Atomic counter for points inside the circle
std::atomic<long long int> inside_circle(0);
// Function to be executed by each thread
void* calculate_pi(void* arg) {
long long int points_per_thread = NUM_POINTS / NUM_THREADS;
long long int local_count = 0;
unsigned int seed = 50;
for (int i = 0; i < points_per_thread; ++i) {
double x = ((double)rand_r(&seed) / (double)RAND_MAX) * 2.0 - 1.0;
double y = ((double)rand_r(&seed) / RAND_MAX) * 2.0 - 1.0;
if (x * x + y * y <= 1.0)
++local_count;
}
inside_circle += local_count;
return nullptr;
}
int main(int argc, char* argv[]) {
if (argc != 3) {
std::cerr << "Usage: " << argv[0] << " <num_threads> <total points>" << std::endl;
return 1;
}
int num_threads = std::stoi(argv[1]);
char* end;
long long int total_points = std::strtol(argv[2], &end, 10);
if (num_threads <= 0 || total_points <= 0) {
std::cerr << "Number of threads and points per thread must be positive integers." << std::endl;
return 1;
}
NUM_THREADS = num_threads;
NUM_POINTS = total_points;
pthread_t threads[num_threads];
inside_circle = 0;
// Create threads
for (int i = 0; i < num_threads; ++i) {
pthread_create(&threads[i], nullptr, calculate_pi, &total_points);
}
// Join threads
for (int i = 0; i < num_threads; ++i) {
pthread_join(threads[i], nullptr);
}
// Calculate Pi
double pi = 4.0 * inside_circle / total_points;
printf("Estimated Pi: %.20f\n", pi);
return 0;
}
```
# Part 2
## Parallel Fractal Generation Using std::thread
## Q1 Speed Up Graph
+ View1

+ View2

No, it not shows linearly increse, the speed drops at thread 3 and make not great improvement between thread 4 and thread 5.
## Q2 Timing code supervision
```cpp=
double start = CycleTimer::currentSeconds();
int startRow = args->threadId * args->height / args->numThreads;
int numRows = args->height / args->numThreads;
mandelbrotSerial(args->x0, args->y0, args->x1, args->y1, args->width, args->height, startRow, numRows, args->maxIterations, args->output);
double end = CycleTimer::currentSeconds();
runtimes[args->threadId] += end - start;
```
Take the greatest difference between thread among 5 iterations.
### View 1
+ Thread 2
Thread 0 runtime: 0.13077f
Thread 1 runtime: 0.13205f
+ Thread 3
Thread 0 runtime: 0.05363f
Thread 1 runtime: 0.16386f
Thread 2 runtime: 0.05346f
+ Thread 4
Thread 0 runtime: 0.02505f
Thread 1 runtime: 0.10616f
Thread 2 runtime: 0.10805f
Thread 3 runtime: 0.02528f
+ Thread 5
Thread 0 runtime: 0.01290f
Thread 1 runtime: 0.07208f
Thread 2 runtime: 0.11087f
Thread 3 runtime: 0.07184f
Thread 4 runtime: 0.01125f
+ Thread 6
Thread 0 runtime: 0.00812f
Thread 1 runtime: 0.04623f
Thread 2 runtime: 0.08540f
Thread 3 runtime: 0.08717f
Thread 4 runtime: 0.04660f
Thread 5 runtime: 0.00692f
### View 2
+ Thread 2
Thread 0 runtime: 0.08978f
Thread 1 runtime: 0.06078f
+ Thread 3
Thread 0 runtime: 0.07056f
Thread 1 runtime: 0.04179f
Thread 2 runtime: 0.03925f
+ Thread 4
Thread 0 runtime: 0.06249f
Thread 1 runtime: 0.03412f
Thread 2 runtime: 0.03173f
Thread 3 runtime: 0.03365f
+ Thread 5
Thread 0 runtime: 0.06268f
Thread 1 runtime: 0.03144f
Thread 2 runtime: 0.03413f
Thread 3 runtime: 0.02899f
Thread 4 runtime: 0.02921f
+ Thread 6
Thread 0 runtime: 0.05945f
Thread 1 runtime: 0.02334f
Thread 2 runtime: 0.03002f
Thread 3 runtime: 0.02887f
Thread 4 runtime: 0.02011f
Thread 5 runtime: 0.02593f
> From above statistics, it shows each thread load is not balanced. In View 1, the middle thread number tends to take greater amount of time. On the other hand, in view 2, thread 0 tends to be the time-costing one.
## Q3 Speed Up
### Mandel Calculation
We can know mandel calculation needs more operations(iteration number) to return greater value, and thus shows white region on the image.
```cpp=
static inline int mandel(float c_re, float c_im, int count)
{
float z_re = c_re, z_im = c_im;
int i;
for (i = 0; i < count; ++i)
{
if (z_re * z_re + z_im * z_im > 4.f)
break;
float new_re = z_re * z_re - z_im * z_im;
float new_im = 2.f * z_re * z_im;
z_re = c_re + new_re;
z_im = c_im + new_im;
}
return i;
}
```
Back to the observation I made in Q2 conclustion, it indicates that if the image holds a wide white region, the blocky partition way would lead to unbalance.

Therefore, the more proper way to partition the task would be calculate alternative lines. For example, if there are m thread, we do balance task by assign: thread 0 is resonponsible for line at the scaler of m, and thread 1 for line the scaler of m plus 1, etc.
```cpp=
for(int r = args->threadId; r < args->height; r += args->numThreads)
{
mandelbrotSerial(args->x0, args->y0, args->x1, args->y1, args->width, args->height, r, 1, args->maxIterations, args->output);
}
```
### View 1
+ Thread 2
+ Thread 0 runtime: 0.13188f
+ Thread 1 runtime: 0.13114f
+ Thread 3
+ Thread 0 runtime: 0.08857f
+ Thread 1 runtime: 0.08805f
+ Thread 2 runtime: 0.08974f
+ Thread 4
+ Thread 0 runtime: 0.06670f
+ Thread 1 runtime: 0.06544f
+ Thread 2 runtime: 0.06665f
+ Thread 3 runtime: 0.06601f
+ Thread 5
+ Thread 0 runtime: 0.05335f
+ Thread 1 runtime: 0.05271f
+ Thread 2 runtime: 0.05373f
+ Thread 3 runtime: 0.05210f
+ Thread 4 runtime: 0.05311f
+ Thread 6
+ Thread 0 runtime: 0.05804f
+ Thread 1 runtime: 0.05092f
+ Thread 2 runtime: 0.05710f
+ Thread 3 runtime: 0.04825f
+ Thread 4 runtime: 0.05424f
+ Thread 5 runtime: 0.04939f
### View 2
+ Thread 2
+ Thread 0 runtime: 0.07520f
+ Thread 1 runtime: 0.07390f
+ Thread 3
+ Thread 0 runtime: 0.05306f
+ Thread 1 runtime: 0.05212f
+ Thread 2 runtime: 0.05154f
+ Thread 4
+ Thread 0 runtime: 0.03687f
+ Thread 1 runtime: 0.03660f
+ Thread 2 runtime: 0.03895f
+ Thread 3 runtime: 0.03863f
+ Thread 5
+ Thread 0 runtime: 0.04348f
+ Thread 1 runtime: 0.03533f
+ Thread 2 runtime: 0.03325f
+ Thread 3 runtime: 0.03755f
+ Thread 4 runtime: 0.03516f
+ Thread 6
+ Thread 0 runtime: 0.03631f
+ Thread 1 runtime: 0.03483f
+ Thread 2 runtime: 0.03286f
+ Thread 3 runtime: 0.02986f
+ Thread 4 runtime: 0.03376f
+ Thread 5 runtime: 0.02954f
## Q3 Speed Up Graph
+ View1

+ View2

> From above two graphs, we can clearly conclude that in my implementation, the speed improvement linearly increases when the thread number ranges from 2 to 6. The reason is, currently, the task partitioned equally, make every thread load as balancing as possible.[name=巫廷翰]
> From the statistics, we can observe that each tread takes similar time to complete its work, so it indeed proves the load balancing. [name=巫廷翰]
### Q4 Is the performance noticeably better than with 6 threads?
> Back to the statistics; however, the speed does not take noticable improvement.
> With the providing environment, the server provides 6 threads on 6 cores. If the program takes a higher number of threads, it will cause rebundant context switch time to make the program "comcurrently" executed instead of "parallelly". [name=巫廷翰]
+ View1

+ View2
