Parallel Programming NYCU Fall 2022 HW2

# <center>Parallel Programming NYCU Fall 2022 HW2 <p class="text-right"> 310552035 張竣傑 ## **Q1**: Produce a graph of speedup compared to the reference sequential implementation as a function of the number of threads used FOR VIEW 1. Is speedup linear in the number of threads used? Why this is (or is not) the case? Compute the top half of the image in thread 0, and the bottom half of the image in thread1. I separated the height into blocks according to the number of threads. Different spatial regions of the image are computed by different threads. Here are how I implement. void workerThreadStart(WorkerArgs *const args){ int b = args->height / args->numThreads; int s = args->threadId * b; mandelbrotSerial(args->x0, args->y0, args->x1, args->y1, args->width, args->height, s, b, args->maxIterations, args->output); } * The result of the view1 with different number of threads: ![](https://i.imgur.com/tvH5EBf.png) ![](https://i.imgur.com/HiU9CEj.png) * The result of the view2 with different number of threads: ![](https://i.imgur.com/yzP2iEY.png) ![](https://i.imgur.com/pZiavJp.png) As you can see, the speedup is not linear with the number of threads used in View1. The Speedup drops when there are three threads used. On the othre hand, the speedup of View2 is linear with the number of threads. Anaylsis: I think why the graph is not linear is that the workload is different between different thread. For each thread, it will go through the mandel function. static inline int mandel(float c_re, float c_im, int count) { float z_re = c_re, z_im = c_im; int i; for (i = 0; i < count; ++i) { if (z_re * z_re + z_im * z_im > 4.f) break; float new_re = z_re * z_re - z_im * z_im; float new_im = 2.f * z_re * z_im; z_re = c_re + new_re; z_im = c_im + new_im; } return i; } When i equals to 0, the pixel of the graph is black. When i equals to 255, the pixel of the graph is white. Let's see View 1 and View 2. ![](https://i.imgur.com/jhlnJ7s.png) In View 1, there are many pixels which is white in the middle. For 3-thread case, the upper block is done by thread 0, the middle block is done by thread 1, and the lower block is done by thread 2. Obivously, thread 1 is the bottleneck. Thread 0 and thread 3 finish quicker than thread 2. I think that's the main reason. Same as View1, the upper part of graph cost more time. So, in 2-thread case, there is an unbalance workload and can't reach 2 times faster than serial operation. Moreover, compared to View1, because it's more unbalance, the speedup is less. ## **Q2**: How do your measurements explain the speedup graph you previously created? In each thread, I print the time mandelbrotSerial costs. void workerThreadStart(WorkerArgs *const args) { double startTime = CycleTimer::currentSeconds(); int b = args->height / args->numThreads; int s = args->threadId * b; mandelbrotSerial(args->x0, args->y0, args->x1, args->y1, args->width, args->height, s, b, args->maxIterations, args->output); double endTime = CycleTimer::currentSeconds(); printf("Thread %d: [%.3f]ms \n", args->threadId, endTime - startTime); } * View 1 with 2 threads: ![](https://i.imgur.com/SKPdfAf.png) * View 1 with 3 threads: ![](https://i.imgur.com/9x2AjTT.png) * View1 with 4 threads: ![](https://i.imgur.com/QlAsurg.png) When the case comes to 3 threads/4 threads, the middle parts cost more time.(Thread 1 cost the most time in 3-thread case. Thread 1 and 2 cost the most time in 4-thread case.) There is an unbalance workload between different threads. * View 2 with 2 threads: ![](https://i.imgur.com/ZPbmSJ4.png) * View 2 with 3 threads: ![](https://i.imgur.com/AnhcQA3.png) * View 2 with 4 threads: ![](https://i.imgur.com/TnPAjsc.png) Compared to View 1, the upper part is the bottleneck, and the graph is more balance, so the speedup graph is more linear. ## **Q3**: Describe the approach to parallelization and report the final 4-thread speedup obtained. To overcome the issue, separate each block of the graph more balance. Each thread would deal one row and change the thread to do the next row. Keep doing in this way to separte the workload more balance. * Using more balance workload with 4 thread in View 1: ![](https://i.imgur.com/q1pzi3V.png) * In View 2: ![](https://i.imgur.com/S0sqSgG.png) In these two views, the speedup from 4 threads is about 3.79 times. ## **Q4**: Now run your improved code with eight threads. Is performance noticeably greater than when running with four threads? Why or why not? (Notice that the workstation server provides 4 cores 4 threads.) * The result of using 8 threads in View 1: ![](https://i.imgur.com/svmt9tP.png) * In View 2: ![](https://i.imgur.com/1itNdLg.png) As you can see, the 8-thread case spend more time than 4-thread case. I think the reason is that the work station server provides 4 cores 4 threads. When I use 8 threads, more than one thread will be allocated to one core. As a result, more time will be wasted on context switch in one core.