Parallel Programming @ NYCU - HW2

`0716221 余忠旻`

In your write-up, produce a graph of speedup compared to the reference sequential implementation as a function of the number of threads used FOR VIEW 1.

(You may also wish to produce a graph for VIEW 2 to help you come up with a good answer.)

我將 image 由上到下均分給各個 thread 計算執行，如下實作:






void workerThreadStart(WorkerArgs *const args)
{
    int workload = args->height / args->numThreads;
    int start = args->threadId * workload;
    mandelbrotSerial(args->x0, args->y0, args->x1, args->y1, args->width, args->height, start, workload, args->maxIterations, args->output);
}

下圖是不同 number of threads 所產生的相對應的 speedup 比較圖

VIEW 1

$ ./mandelbrot -t 2
[mandelbrot serial]:            [474.001] ms
Wrote image file mandelbrot-serial.ppm
[mandelbrot thread]:            [239.313] ms
Wrote image file mandelbrot-thread.ppm
                                (1.98x speedup from 2 threads)
$ ./mandelbrot -t 3
[mandelbrot serial]:            [475.400] ms
Wrote image file mandelbrot-serial.ppm
[mandelbrot thread]:            [291.510] ms
Wrote image file mandelbrot-thread.ppm
                                (1.63x speedup from 3 threads)
$ ./mandelbrot -t 4
[mandelbrot serial]:            [464.144] ms
Wrote image file mandelbrot-serial.ppm
[mandelbrot thread]:            [193.675] ms
Wrote image file mandelbrot-thread.ppm
                                (2.40x speedup from 4 threads)

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

VIEW 2

$ ./mandelbrot -t 2 --view 2
[mandelbrot serial]:            [296.608] ms
Wrote image file mandelbrot-serial.ppm
[mandelbrot thread]:            [174.557] ms
Wrote image file mandelbrot-thread.ppm
                                (1.70x speedup from 2 threads)
$ ./mandelbrot -t 3 --view 2
[mandelbrot serial]:            [290.624] ms
Wrote image file mandelbrot-serial.ppm
[mandelbrot thread]:            [131.804] ms
Wrote image file mandelbrot-thread.ppm
                                (2.20x speedup from 3 threads)
$ ./mandelbrot -t 4 --view 2
[mandelbrot serial]:            [288.252] ms
Wrote image file mandelbrot-serial.ppm
[mandelbrot thread]:            [110.473] ms
Wrote image file mandelbrot-thread.ppm
                                (2.61x speedup from 4 threads)

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Q1: Is speedup linear in the number of threads used? In your writeup hypothesize why this is (or is not) the case?

Hint: take a careful look at the three-thread data-point.

在VIEW 1中，speedup隨著number of threads上升並沒有線性成長，反而在number of threads=3下降
在VIEW 2中，speedup隨著number of threads上升有線性成長

在 workerThreadStart() 中，每一個 thread 都會執行 mandelbrotSerial()
而 mandelbrotSerial() 會在for迴圈重複呼叫到 mandel()

看如下 mandel() 函式中，可以發現決定for迴圈的次數是
if (z_re * z_re + z_im * z_im > 4.f) break; 這個條件


















static inline int mandel(float c_re, float c_im, int count)
{
  float z_re = c_re, z_im = c_im;
  int i;
  for (i = 0; i < count; ++i)
  {

    if (z_re * z_re + z_im * z_im > 4.f)
      break;

    float new_re = z_re * z_re - z_im * z_im;
    float new_im = 2.f * z_re * z_im;
    z_re = c_re + new_re;
    z_im = c_im + new_im;
  }

  return i;
}

在 8 位元灰階圖片中，0 表示黑色，最大值 255 則表示白色，數值越大顏色越淺
所以我發現，當 count=256 時， i 可能為 0~255 中任一數，其中數值大小決定該像素顏色深淺

再比較VIEW 1和 VIEW 2:

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

可以明顯看出 VIEW 1 的淺色像素遠比 View2 來得多且集中(集中在中間)
接著我們分別來看3個thread計算執行的內容(如下圖)
thread1 所計算的image內容淺色像素多且集中
因此我認為是 thread1 拖垮整體執行速度，導致 speedup 下降

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Hypothesize: 每個 thread 的工作量與其負責區域所含有的淺色像素數量有關。
當 number of threads=3 時，thread1 會被分配到淺色像素多且集中的區域，
導致整體執行速度不佳。

Q2: How do your measurements explain the speedup graph you previously created?

number of thread=2

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

$ ./mandelbrot -t 2
[mandelbrot serial]:            [462.669] ms
Wrote image file mandelbrot-serial.ppm
Thread 0: [262.9549] ms
Thread 1: [263.1731] ms
[mandelbrot thread]:            [263.524] ms
Wrote image file mandelbrot-thread.ppm
                                (1.76x speedup from 2 threads)

number of thread=3

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

$ ./mandelbrot -t 3
[mandelbrot serial]:            [461.782] ms
Wrote image file mandelbrot-serial.ppm
Thread 0: [94.1784] ms
Thread 2: [94.9320] ms
Thread 1: [286.3419] ms
[mandelbrot thread]:            [286.568] ms
Wrote image file mandelbrot-thread.ppm
                                (1.61x speedup from 3 threads)

number of thread=4

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

$ ./mandelbrot -t 4
[mandelbrot serial]:            [474.807] ms
Wrote image file mandelbrot-serial.ppm
Thread 3: [45.8921] ms
Thread 0: [66.6695] ms
Thread 1: [194.6871] ms
Thread 2: [206.5401] ms
[mandelbrot thread]:            [219.570] ms
Wrote image file mandelbrot-thread.ppm
                                (2.16x speedup from 4 threads)

在 number of thread=3 可以看出
thread1 執行時間也是比 thread0 和 thread2 久
thread1 所計算的image內容淺色像素多且集中
因此 thread1 拖垮整體執行速度，導致 speedup 下降
在 number of threads=4 也可以看出
thread1 和 thread2是主要的執行速度瓶頸

接著看 VIEW 2 , number of threads=3

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

$ ./mandelbrot -t 3 --view 2
[mandelbrot serial]:            [297.046] ms
Wrote image file mandelbrot-serial.ppm
Thread 2: [82.0762] ms
Thread 1: [87.8727] ms
Thread 0: [134.4979] ms
[mandelbrot thread]:            [134.589] ms
Wrote image file mandelbrot-thread.ppm
                                (2.21x speedup from 3 threads)

相對來說 VIEW 2的 thread0 執行時間也是比 thread1 和 thread2 久
(VIEW 2 淺色像素主要在最上面)
但因為淺色像素在整體是相對分散許多
因此執行速度瓶頸雖是 thread0，但不至於拖垮整體執行速度，還是可以看出線性成長的 speedup

A: 在 VIEW 1，speedup隨著number of threads上升不會有線性成長，
是因每個 thread 工作量分配極度不均的關係。

Q3: In your write-up, describe your approach to parallelization and report the final 4-thread speedup obtained.

用一開始的方法，由上到下均分成n個區塊來分配給n個threads計算執行
但在 VIEW 1 由於淺色像素集中於同一區域導致工作量分配不均
因此我將 workerThreadStart()修改成如下:






void workerThreadStart(WorkerArgs *const args)
{
    for (int i = args->threadId; i < args->height; i += args->numThreads){
		mandelbrotSerial(args->x0, args->y0, args->x1, args->y1, args->width, args->height, i, 1, args->maxIterations, args->output);
    }
}

我將 workerThreadStart()每一個thread執行mandelbrotSerial()時，傳入的 totalRows=1
我以 number of threads=3為例:

這樣能將原本淺色像素集中區域，3個thread都會被分到部分計算執行，
如此能將工作量平均分配

Final 4-thread speedup obtained
執行結果如下:

$ ./mandelbrot -t 4
[mandelbrot serial]:            [462.474] ms
Wrote image file mandelbrot-serial.ppm
[mandelbrot thread]:            [121.549] ms
Wrote image file mandelbrot-thread.ppm
                                (3.80x speedup from 4 threads)
$ ./mandelbrot -t 4 --view 2
[mandelbrot serial]:            [288.821] ms
Wrote image file mandelbrot-serial.ppm
[mandelbrot thread]:            [76.039] ms
Wrote image file mandelbrot-thread.ppm
                                (3.80x speedup from 4 threads)

View 1: 3.80x speedup from 4 threads
View 2: 3.80x speedup from 4 threads

Q4: Now run your improved code with eight threads. Is performance noticeably greater than when running with four threads? Why or why not?

(Notice that the workstation server provides 4 cores 4 threads.)

當 number of threads=8 執行如下:

$ ./mandelbrot -t 8
[mandelbrot serial]:            [461.297] ms
Wrote image file mandelbrot-serial.ppm
[mandelbrot thread]:            [123.937] ms
Wrote image file mandelbrot-thread.ppm
                                (3.72x speedup from 8 threads)
$ ./mandelbrot -t 8 --view 2
[mandelbrot serial]:            [289.132] ms
Wrote image file mandelbrot-serial.ppm
[mandelbrot thread]:            [76.669] ms
Wrote image file mandelbrot-thread.ppm
                                (3.77x speedup from 8 threads)

可以發現 number of threads=8 會比 number of threads=4 執行還要慢一些

我認為當number of threads=8時，speedup沒有上升反而下降，
是因為工作站提供的是 4 核心 4 執行序的 Server，
當使用 8 個 threads 時， core 被分配到的threads就可能會大於1
這時執行會多了 context switch 的 overhead
這也造成了整體執行時間比 number of threads=4 還要久的原因。

Good Work!

TA

Parallel Programming @ NYCU - HW2

0716221 余忠旻

In your write-up, produce a graph of speedup compared to the reference sequential implementation as a function of the number of threads used FOR VIEW 1.

Q1: Is speedup linear in the number of threads used? In your writeup hypothesize why this is (or is not) the case?

Q2: How do your measurements explain the speedup graph you previously created?

Q3: In your write-up, describe your approach to parallelization and report the final 4-thread speedup obtained.

Q4: Now run your improved code with eight threads. Is performance noticeably greater than when running with four threads? Why or why not?

Read more

Embedded System Design

RTOS

Via SSH to install Teamviewer

Ubuntu install c++ version of OpenCV + opencv_contrib

`0716221 余忠旻`