# PP HW2 ## Q1 Produce a graph of speedup with view1 and view2 :::success 使用tiv -w 100 -h 30 mandelbrot-thread.ppm 顯示出圖片 ::: ![image.png](https://hackmd.io/_uploads/Bywj_SY7T.png) ![image.png](https://hackmd.io/_uploads/HJ2XQIFmT.png) ![image.png](https://hackmd.io/_uploads/SycO9rY7a.png) ![image.png](https://hackmd.io/_uploads/HJIbmIY7a.png) --- ### Q1 : Is speedup linear in the number of threads used? In your writeup hypothesize why this is (or is not) the case? 從上方的折線圖可以觀察到,在View1中,speedup並沒有呈現linear成長,而是在thread為3時,他的speedup反而比thread為2時,還要低。 在View2中,他就呈現較為linear的走勢。 ```javascript= void workerThreadStart(WorkerArgs *const args) { // TODO FOR PP STUDENTS: Implement the body of the worker // thread here. Each thread could make a call to mandelbrotSerial() // to compute a part of the output image. For example, in a // program that uses two threads, thread 0 could compute the top // half of the image and thread 1 could compute the bottom half. // Of course, you can copy mandelbrotSerial() to this file and // modify it to pursue a better performance. double minSerial = 1e30; int totalRows = args->height / args->numThreads; int remainder = args->height % args->numThreads; int startRow = totalRows * args->threadId; if ((args->threadId + 1) == args->numThreads) { //the last thread need to do remain part totalRows += remainder; } printf("Hello world from thread %d ", args->threadId); } ``` 這上方的程式碼為我做平行化的部分,我將不同的thread分配到不同的區域實作,而最後一個thread可能會做一些無法整除的剩餘部分,而主要的分法為利用高度切割,分配由上到下的圖片區段(依照thread數量切等分)給thread。 並且其實我觀察到在View1中,thread為2時,他的speedup很接近2x,且View1他的圖片為上下對稱的圖片,所以這邊造成不能線性加速的原因,極有可能是不同thread處理事情的快慢也不同,儘管我分配一樣大小的區塊給予他們平行化處理,而其中原因可能是跟分配到的圖片像素有關。 ### Q2 : How do your measurements explain the speedup graph you previously created? ```javascript= double minSerial = 1e30; int totalRows = args->height / args->numThreads; int remainder = args->height % args->numThreads; int startRow = totalRows * args->threadId; if ((args->threadId + 1) == args->numThreads) { //the last thread need to do remain part totalRows += remainder; } double startTime = CycleTimer::currentSeconds(); mandelbrotSerial(args->x0,args->y0,args->x1,args->y1,args->width,args->height,startRow, totalRows,args->maxIterations,args->output); double endTime = CycleTimer::currentSeconds(); minSerial = std::min(minSerial, endTime - startTime); printf("Hello world from thread %d ", args->threadId); printf("[mandelbrot serial]:\t\t[%.3f] ms\n", minSerial * 1000); ``` :::warning 下方圖為View 1 - Threads = 3 的結果 ::: ![image.png](https://hackmd.io/_uploads/B1hbAItma.png) :::warning 下方圖為View 2 - Threads = 3 的結果 ::: ![image.png](https://hackmd.io/_uploads/HJNqWDFma.png) 我將前段的程式碼加上測量時間,並且觀察結論可以發現不同thread在做相同大小區段的事情,所花費的時間真的差異很大,在View1結論中可以發現thread1(處理圖片中間段的thread)所花費的時間最多,在View2結論中可以發現thread0所花費的時間最多。 觀察到處理的時間跟所分配到的圖片有關時,我回去看了兩張圖,並且觀察到處理時間較長的thread都分配到圖片中出現較為多白色區域的片段(View 1的中間、View 2的上段),可以推斷白色區域較為花費時間。 觀察程式碼,我們可以發現在Q1中我所提供的程式碼裡,我會去呼叫==mandelbrotSerial()== 幫我們處理,而他會再去呼叫==mandel()== ,而在這裡就會回傳該點的pixel值(0~255),而在此function中,break會晚就代表pixel值越大,也代表他越白,所以可以發現出為何當thread分配到結果顯示為的白色區域時,所花費的時間比較長了。 ```javascript= static inline int mandel(float c_re, float c_im, int count) { float z_re = c_re, z_im = c_im; int i; for (i = 0; i < count; ++i) { if (z_re * z_re + z_im * z_im > 4.f) break; float new_re = z_re * z_re - z_im * z_im; float new_im = 2.f * z_re * z_im; z_re = c_re + new_re; z_im = c_im + new_im; } return i; } ``` ### Q3 : In your write-up, describe your approach to parallelization and report the final 4-thread speedup obtained. ```javascript= void workerThreadStart(WorkerArgs *const args) { for(unsigned int i=args->threadId;i<args->height;i+=args->numThreads){ mandelbrotSerial(args->x0,args->y0,args->x1,args->y1,args->width,args->height,i, 1,args->maxIterations,args->output); } } ``` 上方的程市碼是修改後的==workerThreadStart()== ,因為我們發現如果上中下游等比例劃分會有不公平的問題,所以我們直接讓每個thread都均攤整張的每個區域部分,以避免有thread處理較花費時間的部分。 ``` 以前的方法(負責的height) : thread0 -> (0 ~ x/3) thread1 -> (x/3 ~ 2x/3) thread2 -> (2x/3 ~ x)" 改進的方法(負責的height) : thread0 -> (0,3,6,9...) thread1 -> (1,4,7,10...) thread2 -> (2,5,8,11...)" ``` :::success 下方的圖為decomposition policy後thread = 4的結果,Speedup > 3.5x。 ::: ![image.png](https://hackmd.io/_uploads/S134uPKXa.png) :::info Decomposition policy graph of speedup with view1 and view2 ::: ![image.png](https://hackmd.io/_uploads/rk_s2DtXT.png) ![image.png](https://hackmd.io/_uploads/SJdn2vYQT.png) 上方兩張圖可以觀察到,thread的數量與speedup呈linear成長。 ### Q4 : Now run your improved code with eight threads. Is performance noticeably greater than when running with four threads? Why or why not? :::success 下方的圖為thread = 8的結果 ::: ![image.png](https://hackmd.io/_uploads/SkKFTDtm6.png) 觀察到thread = 8,反而沒有thread = 4時來的快。 ``` 311552055@pp1:~/311552055/HW2/HW2/part2$ nproc 4 ``` 上方的指令來看,可以發現此linux中只有4核心,所以很有可能每個核心都被分配到2個threads,所以可能會有context switch的overhead,這也是為什麼thread不是越多越好,硬體的不足會限制軟體的加速。