# Programming Assignment II: Multi-thread Programming
## Part 2: Parallel Fractal Generation Using std::thread
```
void workerThreadStart(WorkerArgs *const args) // code writing
{
int rowsPerThread = args->height / args->numThreads;
int startRow = args->threadId * rowsPerThread;
int numRows = (args->threadId == args->numThreads-1)
? (args->height-startRow)
: rowsPerThread;
mandelbrotSerial(args->x0, args->y0, args->x1, args->y1,
args->width, args->height,
startRow, numRows,
args->maxIterations,
args->output);
}
```
> Q1: In your write-up, plot a graph of speedup (compared to the reference sequential implementation) as a function of the number of threads used for VIEW 1. Is the speedup linear to the number of threads used? Hypothesize why this is (or is not) the case. (You may want to plot a graph for VIEW 2 for further insights. Hint: take a careful look at the 3-thread data point.)
|||
|:------------------------------------------:|:------------------------------------------:|

In the graph for VIEW 1, we can see that the speedup is clearly **not linear**, and it drops significantly with 3 threads. However, in VIEW 2, the speedup is more linear compared to VIEW 1.
I think the major reason for the performance drop in VIEW 1 is the **load imbalance** between threads, as some areas of the fractal require more computations than others.
> Q2: How do the measurements explain the speedup graph you previously plotted?
Different threads are responsible for different parts of the image. As shown in the output, we can see that **different threads have different execution times**, and this is particularly evident in the case with 3 threads.
The threads that finish earlier have to wait for the other threads to complete, which results in an overall longer execution time.

> Q3: In your write-up, describe your parallelization approach and report the final speedup achieved with 4 threads.
```
void workerThreadStart(WorkerArgs *const args)
{
for (unsigned int j = args->threadId; j < args->height; j += args->numThreads) {
mandelbrotSerial(args->x0, args->y0, args->x1, args->y1,
args->width, args->height,
j, 1,
args->maxIterations,
args->output);
}
}
```
|||
|:------------------------------------------:|:------------------------------------------:|
This method assigns each thread to compute every `numThreads`-th row across the image. , which results in better load balancing and more even distribution of work.
For example, if `numThreads = 4`, thread 0 is handles rows 0, 4, 8, ...; thread 1 handles rows 1, 5, 9, ...
As shown in the output, the execution time of each thread is quite uniform, which indicates that the workload is well balanced. Both views achieve around a 3.8x speedup, a significant improvement compared to the 2.5x speedup obtained using the method in Q1.
> Q4: Is the performance noticeably better than with 6 threads? Why or why not? (Notice that the workstation server provides 6 threads on 6 cores.)

No. As shown in the output, the performance with 12 threads is actually worse than with 6 threads. The workstation provides 6 threads on 6 cores, and if running 12 threads, multiple threads must share the same physical cores, leading to context switching (the OS swaps threads in and out of the processor).
It's generally more efficient to match the number of threads to the number of available cores, especially for CPU-bound tasks.