CUDA Programming

Q1: What are the pros and cons of the three methods? Give an assumption about their performances?

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Kernel 1.

Pros:
Host 內存空間有較開放的利用空間，Memory 比起使用固定內存的方式有較靈活的調度，當可調用的內存過小會造成速度變慢，因此在內存較小的機器上是更合理的選擇。
Cons:
使用malloc接收複製的時候，CUDA driver 會通過 dram 傳送數據給 GPU，這時複製操作會執行兩次(如上左圖)，一次是從可分頁內存複製一塊到臨時的分頁並鎖定內存，第二次是再從這個頁鎖定內存複製到 GPU 上，如此一來造成複製速度會受限於 PCIE 的傳輸速度或是前面的內部傳輸的過程，且過程需要 CPU 多一次的參與造成額外開銷。

Kernel 2.

Pros:
固定內存可以省略從內部多進行的一次資料複製，直接進行 GPU 和 Host 透過 PCIE 的傳輸，能夠透過CUDA 的API直接以DMA的方式搬動資料到 Divice 上，減少了CPU的參與和一次複製的時間，對於頻繁傳遞的工作能節省更多時間。
Cons:
cudaHostAlloc() 是一種 Pinned memory 模式，會鎖定 HOST 特定分頁不進行 paging(如上右圖)，容易導致系統內存被快速消耗，在內從較少的機器上運行時若過度使用固定內存，反而會導致後續一直發生page fault 反而導致整體系統速度變慢。

Kernel 3.

Pros:
Kernel 3有利用到cudaMallocPitch讀取空間上的優勢，在讀取鄰近的 memory 會因為該指令事先對齊過所以能夠更快讀取。
Cons:
單一 thread 需要計算的 Pixel 數量變多了導致每個 thread 計算量上升，相對平行度也隨之下降，因此也拖慢了執行時間。

Q2: How are the performances of the three methods? Plot a chart to show the differences among the three methods

左圖為每個數量 iteration 所對應到的程式執行時間。
右圖則是每個數量 iteration 所對應到的相對於 Serial 計算所加速的倍率。
上面兩張的部分是 View 1 的結果。
下面兩張則是 View 2 的結果。

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

View 1 相較於 View 2 計算量更大，所耗費的時間更多，可以看出其實在計算量越來越大的情況下 Kernel 1 和 Kernel 2，幾乎會有相同表現，顯示主要造成Kernel 1 和 Kernel 2差距的原因來自於計算以外的原因。
從結果來看基本上 Kernel 1 >= Kernel 2 > Kernel 3

Q3: Explain the performance differences thoroughly based on your experimental results. Does the results match your assumption? Why or why not.

我原先預期 kernel 2會是最快的，kernel 1次之，kernel 3 最後。
結果kernel 1成果上略好於kernel 2，但在大筆資料上kernel 1和kernel 2幾乎有相同表現，kernel 3確實在最後的位置。

Kernel 2 為何不如預期:

使用了能夠更快傳遞內存訊息的Pined Memory，Host 和 device 卻只在整個程式一開始和最後進行溝通而已，雖然 Memory copy 所佔的時間比重大，但是並不頻繁，因此程式的瓶頸已不是 memory 的溝通而是計算的部分，固定分頁的效益不高。
使用 cudaMallocPitch 對齊 memory 但是每個 piexl 都有使用一個 thread 來進行計算，因此幾乎不太需要進行疊代計算的情況下，對齊 memory 幾乎沒帶來幫助。下面圖片分別是 Kernel 1 和 Kernel 2的 mandelKernel View2 iteration 10000 執行分析結果，可以看出來對齊之後計算也沒有顯著變快。

Kernel 1
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Kernel 2
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

且對 Pitch 進行對齊這個動作，其實也花費了一些時間，導致在計算以外的 lantency 也提高了一點，這個花費會隨著計算量上升而稀釋掉，在 nvprof 的結果上也可以看到 pitch 這個動作不論花費多少 iteration 都是沒有改變的。

1000 iteration:
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
10000 iteration:
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
100000 iteration:
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

Kernel 3 為何較慢:
Kernel 3 我將整張圖分割成 512 個 block 再使用 512 個 thread 進行計算，每個 thread 由上而下對 block 進行計算。

雖然做了空間上的優化，但是也勢必降低了平行化的程度，所以在空間優化後提升速度又在降低平行候減低速度，最後可以從而發現平行在這個任務上所扮演的角色是更加重要的。

Kernel 1 為何最快:

Memory 的溝通並不是這個任務的瓶頸，因此並沒有因為沒有固定 Memory 而和另外兩個 Kernel 拉開距離。
再來，由於使用一個thread 來計算一個 piexl，所以沒有對齊 memory 對程式影響甚小，反而還減少了對 Memory 做前處理的時間。
由 kernel 3的實驗知道，空間優化雖然也能帶來效益但還是不及一個 thread 處理一個 Piexl 的平行所帶來的效益。

Q4:Can we do even better? Think a better approach and explain it. Implement your method in `kernel4.cu`.

我在kernel4上進行了很多種嘗試，但是最終有許多方法都在執行時發生問題。我原先嘗試使用vector type的變數進行運算(float4之類的)，但是卻一直發生計算錯誤，並且我後面也不認為這樣的變數型態是用來加速運算的，應該是用於加速讀取。最終我嘗試kernel 1 各種 block 參數，發現 16 有最佳效果，我猜想這可能和 CUDA 所含的 thread 上限有關，block size 太大會導致thread 不夠，太小又沒把 thread 做最佳利用。最後以此作為我的 kernel 4。

我觀察到 NVIDIA 對於 CUDA 有實現屬於 CUDA 的 SIMD 指令，使用 SIMD 指令能夠一次對多個數字進行運算，我想這樣也許能夠加速指令運行，但是因為時間關係，所以我這次也沒辦法在時間內完成。

CUDA_SIMD: https://docs.nvidia.com/cuda/cuda-math-api/group__CUDA__MATH__INTRINSIC__SIMD.html

對於 CUDA 應用於此次作業問題網上也有許多人對此研究 Adaptive Parallel Compuation with CUDA Dynamic Parallelism 就是其中一個方法，是在2014年6月提出的。該方法介紹網址如下: https://developer.nvidia.com/blog/introduction-cuda-dynamic-parallelism/ ，內文提到許多 CUDA 的特性並對其優化，目前我覺得我的功力可能還沒辦法完全理解並實作出來，未來如果有時間可以再試看看。

CUDA Programming

Q1: What are the pros and cons of the three methods? Give an assumption about their performances?

Kernel 1.

Kernel 2.

Kernel 3.

Q2: How are the performances of the three methods? Plot a chart to show the differences among the three methods

Q3: Explain the performance differences thoroughly based on your experimental results. Does the results match your assumption? Why or why not.

Q4:Can we do even better? Think a better approach and explain it. Implement your method in kernel4.cu.

Read more

OpenMP 指令筆記

OpenCL Programming

MPI Programming

Multi-thread Programming

Q4:Can we do even better? Think a better approach and explain it. Implement your method in `kernel4.cu`.