#Parrel programmong assigment2 #### Q1 Extend your code to use 2, 3, 4 threads, partitioning the image generation work accordingly (threads should get blocks of the image)In your write-up, produce a graph of speedup compared to the reference sequential implementation as a function of the number of threads used FOR VIEW 1 ```C++= int partHeight = args->height / args->numThreads; int my_first_i = args->threadId * partHeight; double startTime = CycleTimer::currentSeconds(); mandelbrotSerial( args->x0, args->y0, args->x1, args->y1, args->width, args->height, my_first_i, partHeight, args->maxIterations, args->output ); ``` ![](https://i.imgur.com/BPV1XXt.jpg) ![](https://i.imgur.com/diElyrf.jpg) View2 ![](https://i.imgur.com/T9AY7fP.jpg) ![](https://i.imgur.com/S5uIvVI.jpg) ``` 1. view 1:隨著thread數量上升,速度並不會跟著變快,3 threads反而比2 threads還慢,從1.94倍降到1.63倍。 2. view 2:速度的確會隨著thread變多而上升 ``` Is speedup linear in the number of threads used? In your writeup hypothesize why this is (or is not) the case? (You may also wish to produce a graph for VIEW 2 to help you come up with a good answer. Hint: take a careful look at the three-thread data-point.) *****檢視程式碼 每一個thread都會呼叫mandelbrotSerial(), ,而mandelbrotSerial()又會再for迴圈裡呼叫到mandel(),這兩個function極有可能是thread加速無法達到極速的瓶頸,以下分別探討這兩個function。 ```C++= void mandelbrotSerial( float x0, float y0, float x1, float y1, int width, int height, int startRow, int totalRows, int maxIterations, int output[]) { float dx = (x1 - x0) / width; float dy = (y1 - y0) / height; int endRow = startRow + totalRows; for (int j = startRow; j < endRow; j++) { for (int i = 0; i < width; ++i) { float x = x0 + i * dx; float y = y0 + j * dy; int index = (j * width + i); output[index] = mandel(x, y, maxIterations); } } } ``` 由程式的第13行可以得知,mandelbrotSerial()會依據startRow與endRow變動迴圈執行的次數,而每個thread所切分到的執行大小是一樣的,理論上應該達到平行的加速與threads。 ```C++= static inline int mandel(float c_re, float c_im, int count) { float z_re = c_re, z_im = c_im; int i; for (i = 0; i < count; ++i) { if (z_re * z_re + z_im * z_im > 4.f) break; float new_re = z_re * z_re - z_im * z_im; float new_im = 2.f * z_re * z_im; z_re = c_re + new_re; z_im = c_im + new_im; } return i; } ``` View1 ![](https://i.imgur.com/rpRPC3u.jpg) View2 ![](https://i.imgur.com/py2AF8Z.jpg) 可以明顯看出 VIEW 1 的淺色像素遠比 View2 來得多且集中。 經查看mandle()可以看到每個像素被賦值的過程,以及View圖中得出以下幾點: * 在 8 位元灰階圖片中,0 表示黑色,最大值 255 則表示白色,數值越大顏色越淺 * 像素值是由迴圈執行次數 i 來決定。 * i 可能為 0~255 中任一數 (因count為256),數值大小決定該像素顏色深淺。 * 綜合所述,一個像素顏色越淺(rgb值越大),迴圈執行次數也越多。 #### Q2 To confirm (or disprove) your hypothesis, measure the amount of time each thread requires to complete its work by inserting timing code at the beginning and end of workerThreadStart(). Q2: How do your measurements explain the speedup graph you previously created? **** 利用CycleTimer.h給的方法來計算看看各個thread ![](https://i.imgur.com/FjwevqI.jpg) ![](https://i.imgur.com/EbHqY70.jpg) ![](https://i.imgur.com/2jz4VCS.jpg) 根據Threads2、3、4的結果,可以發現在在中間切分部分的區域產生瓶頸,代表在VIEW 1時中間區段帶入的x, y會耗費大部分時間(相比於前、後段) ***** 檢視 view1 View 1 in thread3 ![](https://i.imgur.com/jKnVFQn.jpg) 小結: 此三張圖代表Thread0, Thread0+Thread1, Thread0+Thread1+Thread3處理的圖片。 * 使用 3 個 thread 時,分配給第 2 個 thread 的中間區塊有非常多的淺色像素,導致該 thread 執行速度拖垮整體速度 * 根據圖片與以上的分析可以得知,只要**白色區域越多計算越耗時**。 #### Q3 > In your write-up, describe your approach to parallelization and report the final 4-thread speedup obtained 上述Q2用CycleTimer.h看出thread在中間的時候,執行時間較久,發現thread的workload不平均: 原本的作法 舉例:用3個threads來計算1200個rows 計算每個thread要算幾個row->1200/3=400,400 tasks per thread 分配工作: 1. thread 0 calculate:0~399 row 1. thread 1 calculate:400~799 row 1. thread 2 calculate:800~1199 row 2. 修改後 計算每個thread要算幾個row->1200/3=400 以thread數量當作間隔去分配任務: ex: 3 threads 1. thread 0 calculate:0,3,6,9…1197 row 1. thread 1 calculate:1,4,7,10…1198 row 1. thread 2 calculate:2,5,8,11…1199 row 此外有可能發生rows數量沒辦法被thread數量整除(ex:thread=7),在最後有一段if去做判斷分配 ```C++= int partHeight = args->height / args->numThreads; int remains = args->height % args->numThreads; // double startTime = CycleTimer::currentSeconds(); for(int t =0;t<partHeight;t++){ mandelbrotSerial( args->x0, args->y0, args->x1, args->y1, args->width, args->height, t*(args->numThreads)+args->threadId,//Start from which row 1, args->maxIterations, args->output ); } if((args->threadId+1<=remains)&&(remains>0)){ mandelbrotSerial( args->x0, args->y0, args->x1, args->y1, args->width, args->height, partHeight * args->numThreads + args->threadId,//Start from which row 1, args->maxIterations, args->output ); } ``` View1 ![](https://i.imgur.com/HS2zR2J.jpg) ![](https://i.imgur.com/Q9fTtGi.jpg) View2 ![](https://i.imgur.com/YopmkDx.jpg) ![](https://i.imgur.com/13FOZaD.jpg) #### Q4 Now run your improved code with eight threads. Is performance noticeably greater than when running with four threads? Why or why not? (Notice that the workstation server provides 4 cores 4 threads.) View1 ![](https://i.imgur.com/OecbW13.jpg) View2 ![](https://i.imgur.com/p2IendM.jpg) 執行發現view1 跟 view2 在8 thread 時,speedup 3.78X->3.57X 與3.78X->3.67 兩者都會下降。 * 因為workstation server provides 4 cores 4 threads,如果使用8個thread去執行,反而會發生額外的context switch overhead,因此要開幾個thread必須視cpu核心數量而定,若工作站是8核心的話,那使用thread=8理論上就會比thread=4還要快。