Try   HackMD

Parallel Programming HW5 @NYCU, 2022 Fall

tags: 2022_PP_NYCU

Q1

What are the pros and cons of the three methods? Give an assumption about their performances.

  • Method 1: malloc + cudaMalloc
    • Pros:
        1. 每個 thread 只需要處理一個 pixel,能善用 GPU 平行處理。
        1. 沒使用 page-locked memory ,不會對system performance 造成太大影響。
    • Cons:
        1. 各 thread 處理的計算量不同,導致比較快做完的 thread 需要等待比較慢做完的 thread。
        1. cudaMalloc 使用 pageable memory,如果需要的 page 不在 memory 內,需要等待傳輸。
  • Method 2: cudaHostAlloc + cudaMallocPitch
    • Pros:
        1. cudaHostAlloc 使用 page-locked memory ,可以確保需要的 page 在 memory 內,避免等待傳輸。
        1. 若 data 是二維,且對齊存放,可以增加存取速度。
        1. cudaMallocPitch cuda 中這樣分配的二維數組內存保證了數組每一行首元素的地址值都按照 256 或 512 的倍數對齊,提高訪問效率
    • Cons:
        1. 各 thread 處理的計算量不同,導致比較快做完的 thread 需要等待比較慢做完的 thread。
        1. cudaHostAlloc 使用過多 page-locked memory ,會降低 system performance 。
  • Method 3: cudaHostAlloc + cudaMallocPitch + group
    • Pros:
        1. 每個 thread 一次只可處理多個 pixel,可以減少搬動 global memory 到 shared memory 的次數。
        1. cudaHostAlloc 使用 page-locked memory ,可以確保需要的 page 在 memory 內,避免等待傳輸。
        1. 若 data 是二維,且對齊存放,可以增加存取速度。
        1. cudaMallocPitch cuda 中這樣分配的二維數組內存保證了數組每一行首元素的地址值都按照 256 或 512 的倍數對齊,提高訪問效率
    • Cons:
        1. 若 group 分配到的 pixel 計算量不平均,有些 group 內計算量都很大,有些計算量都很小,會導致效能不好。
        1. cudaHostAlloc 使用過多 page-locked memory ,會降低 system performance 。

Assumption:

  • Method 3 會有load unbalence 的問題,兩張測試圖片都有深淺不均的問題存在。所以 Method 3 效能最差。
  • cudaMallocPitchcudaMalloc 較為複雜,會花更多時間,因此推測 Method 1 效能會比 Method 2 還要好。
  • 推測效能排序: Method 1 > Method 2 > Method 3

Q2

How are the performances of the three methods? Plot a chart to show the differences among the three methods

veiw1:

Method 1 Method 2 Method 3
iteration=1000 6.556 9.012 9.065
iteration=10000 33.779 36.041 43.169
iteration=100000 310.885 314.805 382.159

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

veiw2:

Method 1 Method 2 Method 3
iteration=1000 3.761 6.260 6.507
iteration=10000 6.745 9.221 10.714
iteration=100000 29.948 32.349 46.315

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →


Q3

Explain the performance differences thoroughly based on your experimental results. Does the results match your assumption? Why or why not.

Assumptation: Method 1 > Method 2 > Method 3
Experimental Reslt: Method 1 > Method 2 > Method 3

實驗結果與我預期的一樣。

Method 3 因為每個 thread 要處理一個group 的 pixel ,會發生運算量分配不均的問題,因此效率最差。

Method 2 比 Method 1 效率差是因為 處理 Method 2 在 cudaMemcpy2D 會花比較久時間。使用 nvprof 可以看到實驗結果。

Method 1:

$ nvprof ./mandelbrot -i 100000 -v 1 -g 1

==162608== NVPROF is profiling process 162608, command: ./mandelbrot -i 100000 -v 1 -g 1
[mandelbrot reference]:         [344.416] ms
Wrote image file mandelbrot-ref.ppm
[mandelbrot thread]:            [307.783] ms
Wrote image file mandelbrot-thread.ppm
                                (1.12x speedup over the reference)
==162608== Profiling application: ./mandelbrot -i 100000 -v 1 -g 1
==162608== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   52.77%  3.43965s        10  343.97ms  337.66ms  391.15ms  mandelKernelRef(int*, float, float, float, float, int, int, int, int)
                   46.74%  3.04716s        10  304.72ms  303.76ms  305.49ms  mandelKernel(int*, float, float, float, float, int, int)
                    0.49%  31.995ms        20  1.5998ms  1.2946ms  3.6519ms  [CUDA memcpy DtoH]
      API calls:   51.68%  3.43975s        10  343.97ms  337.66ms  391.17ms  cudaDeviceSynchronize
                   46.33%  3.08376s        20  154.19ms  1.4646ms  306.95ms  cudaMemcpy
                    1.93%  128.65ms        20  6.4325ms  108.32us  126.16ms  cudaMalloc
                    ...

Method 2:

$ nvprof ./mandelbrot -i 100000 -v 1 -g 1

==161969== NVPROF is profiling process 161969, command: ./mandelbrot -i 100000 -v 1 -g 1
[mandelbrot reference]:         [345.823] ms
Wrote image file mandelbrot-ref.ppm
[mandelbrot thread]:            [310.246] ms
Wrote image file mandelbrot-thread.ppm
                                (1.11x speedup over the reference)
==161969== Profiling application: ./mandelbrot -i 100000 -v 1 -g 1
==161969== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   52.99%  3.46045s        10  346.05ms  338.39ms  399.42ms  mandelKernelRef(int*, float, float, float, float, int, int, int, int)
                   46.64%  3.04556s        10  304.56ms  303.44ms  305.56ms  mandelKernel(int*, unsigned long, float, float, float, float, int, int)
                    0.37%  23.860ms        20  1.1930ms  589.64us  3.5027ms  [CUDA memcpy DtoH]
      API calls:   51.61%  3.46055s        10  346.06ms  338.39ms  399.43ms  cudaDeviceSynchronize
                   45.52%  3.05184s        10  305.18ms  304.05ms  306.17ms  cudaMemcpy2D
                    1.94%  129.77ms        10  12.977ms  111.81us  128.72ms  cudaMalloc
                    ...

Q4

Can we do even better? Think a better approach and explain it. Implement your method in kernel4.cu

我使用 Method 1 去作修改。修改如下:

    1. 把原先 block size=16 改成 8,可以加快一些。
    1. 原本需要宣告 host img 和 device img ,最後透過 copy 把 host img copy 到 img 。我改成指宣告 device img,然後在 cudaMemcpy 直接把 device 的結果寫回 img ,可以比原先再快一些。
    1. mandelKernel 中,會作多次 iteration 的像素點的圖形運算,使用 #pragma unroll 加速整體的速度。
    else if (maxIterations == 10000)
    {
        int i;
#pragma unroll
        for (i = 0; i < maxIterations; i++)
        {
            if (z_re * z_re + z_im * z_im > 4.f)
                break;

            float new_re = z_re * z_re - z_im * z_im;
            float new_im = 2.f * z_re * z_im;
            z_re = c_re + new_re;
            z_im = c_im + new_im;
        }
        d_img[idx] = i;
    }

實驗結果:

View 1 View2
iteration=1000 5.008 2.852
iteration=10000 23.771 4.613
iteration=100000 206.298 19.752

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →