Try   HackMD

Parallel Programming @ NYCU - HW2

0716221 余忠旻

In your write-up, produce a graph of speedup compared to the reference sequential implementation as a function of the number of threads used FOR VIEW 1.

(You may also wish to produce a graph for VIEW 2 to help you come up with a good answer.)

我將 image 由上到下均分給各個 thread 計算執行,如下實作:

void workerThreadStart(WorkerArgs *const args) { int workload = args->height / args->numThreads; int start = args->threadId * workload; mandelbrotSerial(args->x0, args->y0, args->x1, args->y1, args->width, args->height, start, workload, args->maxIterations, args->output); }

下圖是不同 number of threads 所產生的相對應的 speedup 比較圖

  • VIEW 1
$ ./mandelbrot -t 2
[mandelbrot serial]:            [474.001] ms
Wrote image file mandelbrot-serial.ppm
[mandelbrot thread]:            [239.313] ms
Wrote image file mandelbrot-thread.ppm
                                (1.98x speedup from 2 threads)
$ ./mandelbrot -t 3
[mandelbrot serial]:            [475.400] ms
Wrote image file mandelbrot-serial.ppm
[mandelbrot thread]:            [291.510] ms
Wrote image file mandelbrot-thread.ppm
                                (1.63x speedup from 3 threads)
$ ./mandelbrot -t 4
[mandelbrot serial]:            [464.144] ms
Wrote image file mandelbrot-serial.ppm
[mandelbrot thread]:            [193.675] ms
Wrote image file mandelbrot-thread.ppm
                                (2.40x speedup from 4 threads)

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

  • VIEW 2
$ ./mandelbrot -t 2 --view 2
[mandelbrot serial]:            [296.608] ms
Wrote image file mandelbrot-serial.ppm
[mandelbrot thread]:            [174.557] ms
Wrote image file mandelbrot-thread.ppm
                                (1.70x speedup from 2 threads)
$ ./mandelbrot -t 3 --view 2
[mandelbrot serial]:            [290.624] ms
Wrote image file mandelbrot-serial.ppm
[mandelbrot thread]:            [131.804] ms
Wrote image file mandelbrot-thread.ppm
                                (2.20x speedup from 3 threads)
$ ./mandelbrot -t 4 --view 2
[mandelbrot serial]:            [288.252] ms
Wrote image file mandelbrot-serial.ppm
[mandelbrot thread]:            [110.473] ms
Wrote image file mandelbrot-thread.ppm
                                (2.61x speedup from 4 threads)

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →


Q1: Is speedup linear in the number of threads used? In your writeup hypothesize why this is (or is not) the case?

Hint: take a careful look at the three-thread data-point.

  • 在VIEW 1中,speedup隨著number of threads上升並沒有線性成長,反而在number of threads=3下降
  • 在VIEW 2中,speedup隨著number of threads上升有線性成長

workerThreadStart() 中,每一個 thread 都會執行 mandelbrotSerial()
mandelbrotSerial() 會在for迴圈重複呼叫到 mandel()

看如下 mandel() 函式中,可以發現決定for迴圈的次數是
if (z_re * z_re + z_im * z_im > 4.f) break; 這個條件

static inline int mandel(float c_re, float c_im, int count) { float z_re = c_re, z_im = c_im; int i; for (i = 0; i < count; ++i) { if (z_re * z_re + z_im * z_im > 4.f) break; float new_re = z_re * z_re - z_im * z_im; float new_im = 2.f * z_re * z_im; z_re = c_re + new_re; z_im = c_im + new_im; } return i; }

在 8 位元灰階圖片中,0 表示黑色,最大值 255 則表示白色,數值越大顏色越淺
所以我發現,當 count=256 時, i 可能為 0~255 中任一數,其中數值大小決定該像素顏色深淺

再比較VIEW 1和 VIEW 2:

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

可以明顯看出 VIEW 1 的淺色像素遠比 View2 來得多且集中(集中在中間)
接著我們分別來看3個thread計算執行的內容(如下圖)
thread1 所計算的image內容淺色像素多且集中
因此我認為是 thread1 拖垮整體執行速度,導致 speedup 下降

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Hypothesize: 每個 thread 的工作量與其負責區域所含有的淺色像素數量有關。
         當 number of threads=3 時,thread1 會被分配到淺色像素多且集中的區域,
         導致整體執行速度不佳。


Q2: How do your measurements explain the speedup graph you previously created?

  • number of thread=2

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

$ ./mandelbrot -t 2
[mandelbrot serial]:            [462.669] ms
Wrote image file mandelbrot-serial.ppm
Thread 0: [262.9549] ms
Thread 1: [263.1731] ms
[mandelbrot thread]:            [263.524] ms
Wrote image file mandelbrot-thread.ppm
                                (1.76x speedup from 2 threads)
  • number of thread=3

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

$ ./mandelbrot -t 3
[mandelbrot serial]:            [461.782] ms
Wrote image file mandelbrot-serial.ppm
Thread 0: [94.1784] ms
Thread 2: [94.9320] ms
Thread 1: [286.3419] ms
[mandelbrot thread]:            [286.568] ms
Wrote image file mandelbrot-thread.ppm
                                (1.61x speedup from 3 threads)
  • number of thread=4

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

$ ./mandelbrot -t 4
[mandelbrot serial]:            [474.807] ms
Wrote image file mandelbrot-serial.ppm
Thread 3: [45.8921] ms
Thread 0: [66.6695] ms
Thread 1: [194.6871] ms
Thread 2: [206.5401] ms
[mandelbrot thread]:            [219.570] ms
Wrote image file mandelbrot-thread.ppm
                                (2.16x speedup from 4 threads)

在 number of thread=3 可以看出
thread1 執行時間也是比 thread0 和 thread2 久
thread1 所計算的image內容淺色像素多且集中
因此 thread1 拖垮整體執行速度,導致 speedup 下降
在 number of threads=4 也可以看出
thread1 和 thread2是主要的執行速度瓶頸


接著看 VIEW 2 , number of threads=3

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

$ ./mandelbrot -t 3 --view 2
[mandelbrot serial]:            [297.046] ms
Wrote image file mandelbrot-serial.ppm
Thread 2: [82.0762] ms
Thread 1: [87.8727] ms
Thread 0: [134.4979] ms
[mandelbrot thread]:            [134.589] ms
Wrote image file mandelbrot-thread.ppm
                                (2.21x speedup from 3 threads)

相對來說 VIEW 2的 thread0 執行時間也是比 thread1 和 thread2 久
(VIEW 2 淺色像素主要在最上面)
但因為淺色像素在整體是相對分散許多
因此執行速度瓶頸雖是 thread0,但不至於拖垮整體執行速度,還是可以看出線性成長的 speedup

A: 在 VIEW 1,speedup隨著number of threads上升不會有線性成長,
  是因每個 thread 工作量分配極度不均的關係。


Q3: In your write-up, describe your approach to parallelization and report the final 4-thread speedup obtained.

用一開始的方法,由上到下均分成n個區塊來分配給n個threads計算執行
但在 VIEW 1 由於淺色像素集中於同一區域導致工作量分配不均
因此我將 workerThreadStart()修改成如下:

void workerThreadStart(WorkerArgs *const args) { for (int i = args->threadId; i < args->height; i += args->numThreads){ mandelbrotSerial(args->x0, args->y0, args->x1, args->y1, args->width, args->height, i, 1, args->maxIterations, args->output); } }

我將 workerThreadStart()每一個thread執行mandelbrotSerial()時,傳入的 totalRows=1
我以 number of threads=3為例:


這樣能將原本淺色像素集中區域,3個thread都會被分到部分計算執行,
如此能將工作量平均分配

  • Final 4-thread speedup obtained
    執行結果如下:
$ ./mandelbrot -t 4
[mandelbrot serial]:            [462.474] ms
Wrote image file mandelbrot-serial.ppm
[mandelbrot thread]:            [121.549] ms
Wrote image file mandelbrot-thread.ppm
                                (3.80x speedup from 4 threads)
$ ./mandelbrot -t 4 --view 2
[mandelbrot serial]:            [288.821] ms
Wrote image file mandelbrot-serial.ppm
[mandelbrot thread]:            [76.039] ms
Wrote image file mandelbrot-thread.ppm
                                (3.80x speedup from 4 threads)

View 1: 3.80x speedup from 4 threads
View 2: 3.80x speedup from 4 threads


Q4: Now run your improved code with eight threads. Is performance noticeably greater than when running with four threads? Why or why not?

(Notice that the workstation server provides 4 cores 4 threads.)

當 number of threads=8 執行如下:

$ ./mandelbrot -t 8
[mandelbrot serial]:            [461.297] ms
Wrote image file mandelbrot-serial.ppm
[mandelbrot thread]:            [123.937] ms
Wrote image file mandelbrot-thread.ppm
                                (3.72x speedup from 8 threads)
$ ./mandelbrot -t 8 --view 2
[mandelbrot serial]:            [289.132] ms
Wrote image file mandelbrot-serial.ppm
[mandelbrot thread]:            [76.669] ms
Wrote image file mandelbrot-thread.ppm
                                (3.77x speedup from 8 threads)

可以發現 number of threads=8 會比 number of threads=4 執行還要慢一些

我認為當number of threads=8時,speedup沒有上升反而下降,
是因為工作站提供的是 4 核心 4 執行序的 Server
當使用 8 個 threads 時, core 被分配到的threads就可能會大於1
這時執行會多了 context switch 的 overhead
這也造成了整體執行時間比 number of threads=4 還要久的原因。


Good Work!

TA