高效能計算(2017 春成大應數所課程筆記)

高效能計算(2017 春成大應數所課程筆記) === * TOP 500 List https://www.top500.org/list/2016/11/ * [A Call to Action to Prepare the High-Performance Computing Workforce](http://ieeexplore.ieee.org/document/7723785/) ## Multi-Core CPU & GPU Computing * https://hackmd.io/xNWY8BvoSqu4ym9mtI30GA * https://www.youtube.com/watch?v=45OoV8puvOQ * https://www.youtube.com/watch?v=hXWxbY39uMw&feature=youtu.be * https://www.youtube.com/watch?v=3H4AuBn4VZA * https://www.youtube.com/watch?v=04ScT6moAww * https://www.youtube.com/watch?v=EwgdPrwHW2M * https://www.youtube.com/watch?v=oheyVGBXGrU * https://www.youtube.com/watch?v=Q-xFtIXHy_c * https://www.youtube.com/watch?v=p3xz6cq3qLc * https://www.youtube.com/watch?v=Am1mTZowhtM * https://www.youtube.com/watch?v=o-W2Z2P015M * https://www.youtube.com/watch?v=Ile2LVL-1Zk * https://www.youtube.com/watch?v=zQP0rX6sV1M * https://www.youtube.com/watch?v=A49Fbf4jyEc * https://www.youtube.com/watch?v=IzWidgvrJik * # Genreal Info ## 參考資料： * [高效能計算 CPU & GPU 筆記 - 作業相關](https://hackmd.io/s/ByszQkkYl) * [Prof Smith Multi-Core CPU and GPU Programming 課程網頁(主要教學資料)](http://140.116.71.64/GPUClass.html) * [Prof Smith 實驗室 Tutorial 網頁](http://140.116.71.64/downloads.html) * [清大李哲榮老師2010課程網頁]( http://www.cs.nthu.edu.tw/~cherung/teaching/2010gpucell/index.html) * [台大2017平行計算短期課程網頁](http://nkl.cc.u-tokyo.ac.jp/NTU2017/) * http://www.akira.ruc.dk/~keld/teaching/IPDC_f10/ * [NCTS課程模組 2017_2018 .pdf](https://www.dropbox.com/s/bo199xqqe9mennh/NCTS%E8%AA%B2%E7%A8%8B%E6%A8%A1%E7%B5%84%202017_2018%20.pdf?dl=0) * [Development of High-performance Computing Software](http://www.ncts.ntu.edu.tw/events_3_detail.php?nid=107): June 26 - July 7, 2017 ## 課程投影片 * Class Notes for Multi-CPU + GPU by Prof. Smith: * [Class_Notes_A.pdf](http://140.116.71.64/files/GPU2016/Class_Notes_A.pdf) * [Class_Notes_B.pdf](http://140.116.71.64/files/GPU2016/Class_Notes_B.pdf) * [Class_Notes_C.pdf](http://140.116.71.64/files/GPU2016/Class_Notes_C.pdf) ## 課程伺服器 * GPU Server 1: ~~140.116.90.227~~ <s>140.116.90.200</s> 140.116.90.102 * CPU: Intel(R) Core(TM) i7-4790 * 12 GB Memory * <s>GPU: GeForce GTX TITAN Black</s> * <s>2880 CUDA Cores</s> * <s>6 GB Memory</s> * GPU: GeForce GTX 1080 Ti * 3584 CUDA Cores * 11 GB Memory * CUDA Capability: 6.1 :::info 先前如果你是使用 nvcc -arch=sm_35 來編譯你的cuda程式，現在請改用 nvcc -arch=sm_61 ::: * GPU Server 2: 140.116.90.103 * CPU: Intel(R) Core(TM) i7-4790 * 8 GB Memory * GPU: 目前是GeForce GT 740，預計學期中升級為1080 Ti或是新版Titan ## 成大資源？ * http://supercomputer.ncku.edu.tw/bin/home.php ## 相關新聞 * [Taiwan Neglects Supercomputing](http://spectrum.ieee.org/computing/hardware/taiwan-neglects-supercomputing) ## 工作區 * http://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html * [fast.ai's 7 week course, Practical Deep Learning For Coders, Part 1, taught by Jeremy Howard](http://course.fast.ai/?utm_content=buffer7b1f5&utm_medium=social&utm_source=facebook.com&utm_campaign=buffer) * [Ryan Zotti | How to Build Your Own Self Driving Toy Car](https://www.youtube.com/watch?v=QbbOxrR0zdA) --- # Topic 1 ## 課程伺服器 * GPU Server 1: ~~140.116.90.227~~ 140.116.90.102 * GPU Server 2: 140.116.90.103 公用帳號 hpcuser 密碼 gj4vm,6 ## 預訂進度： ### 軟體 #### Windows * [putty](http://www.chiark.greenend.org.uk/~sgtatham/putty/latest.html); [PuTTY 簡易教學](http://www.cs.nctu.edu.tw/help/putty.html) * [winscp](http://winscp.net/eng/docs/lang:cht) #### Linux * wget:檔案下載的工具[簡單說明](http://www.ewdna.com/2012/04/wget.html) * ssh, sftp:遠端連線 [鳥哥的 Linux 私房菜](http://linux.vbird.org/linux_server/0310telnetssh.php) ### 連線方式 * 於window中，使用putty 連線 * 連線後檢視目錄內容(ls) * 連線後檢視系統CPU、記憶體資訊 ``` cat /proc/cpuinfo cat /proc/meminfo ``` * 建立(暫時的)個人資料夾 ``` mkdir yourdir ``` * 在連線後系統中打開matlab * 於window中，先打開virtualbox，登入後以ssh -X 連線 * 在連線後系統中打開matlab或是firefox * 登入伺服器後，可以用ssh連線其他伺服器，比方說 ``` ssh -l hpcuser 140.116.90.227 ``` ### 編輯器 nano 連線後，輸入nano開啟編輯器 * [鳥哥的 Linux 私房菜第四章、首次登入與線上求助](http://linux.vbird.org/linux_basic/0160startlinux.php) * [4.4 超簡單文書編輯器： nano](http://linux.vbird.org/linux_basic/0160startlinux.php#nano) ### Hello 與 Compiler * 以nano編輯一個c程式輸出"Hello World"。請存在個人資料夾中，檔名 hello.c ``` #include <stdio.h> main( ) { printf("hello, world\n"); return(0); } ``` * 使用gcc 將hello.c 編譯成可執行檔 a.out ``` gcc hello.c ``` * 執行 a.out ``` ./a.out ``` # Topic 2 ## Makefile * [猴子都會寫的Makefile - makefile簡易教學 (1)](http://mropengate.blogspot.tw/2015/06/makefile-makefile.html) * [A Simple Makefile Tutorial](http://www.cs.colby.edu/maxwell/courses/tutorials/maketutor/) ### Example 1 建議先在個人目錄中建立一個目錄放置練習檔案。編輯一個makefile，內容如下： ``` test: echo "TEST" hello: echo "Hello" ``` 上面的echo指令會將之後的字串顯示在螢幕上。存檔後，使用make test以及make hello測試結果 ### Example 2 將上週的hello.c複製到目前的目錄(內容有更動)，同時修改makefile如下： ``` test: echo "TEST" hello: echo "Hello" gcc hello.c ./a.out ``` 這時候make hello除了先前顯示的Hello外，還會將hello.c編譯成a.out，同時執行a.out :::info "./a.out"中的"./"意思是目前的目錄 ::: ### Example 3 有時候我們的系統中除了gcc外還有其他compiler，但是makefile裡有很多gcc，不便一一修改，我們可以在前面定義一個變數來取代： ``` CC = gcc test: echo "TEST" hello: echo "Hello" $(CC) hello.c ./a.out ``` ### Example 4 更進一步的例子可以參考 [A Simple Makefile Tutorial](http://www.cs.colby.edu/maxwell/courses/tutorials/maketutor/) # Topic 3 ## 向量處理器 * [wiki](https://en.wikipedia.org/wiki/Vector_processor) * [Cray-1](https://en.wikipedia.org/wiki/Cray-1) ## 向量式計算 * Vector Addition / Subtraction: * Vector Elementwise Multiplication * Vector Transpose Multiplication (dot product) ## Linear System $Ax=B$ * $x = A^{-1}B$ * Direct solver (Gaussian Elimination) * Iterative methods * Jacobi and Gauss Seidel * Conjugate Gradient Method * BiConjugateGradient Method ### JACOBI METHOD * [wiki](https://en.wikipedia.org/wiki/Jacobi_method#Description) * matrix A is diagonally dominant * non of the diagonal elements are 0 * $𝐴=𝐷+𝐴_{𝑜𝑓𝑓}$ * D: contains only the diagonals * $Ax=B$ 1. $(𝐷+𝐴_{𝑜𝑓𝑓})x=B$ 2. $𝐷𝑥+𝐴_{𝑜𝑓𝑓}𝑥=𝐵$ 3. $𝑥=𝐷^{−1}(𝐵−𝐴_{𝑜𝑓𝑓}𝑥)$ 4. $x_{𝑁𝑒𝑤}=𝐷^{−1}(𝐵−𝐴_{𝑜𝑓𝑓}𝑥_{old})$ ### Gauss Seidel Jacobi_HT $$\Delta T =0$$ ``` for runs = 1:1:30 for i = 2:1:19 for j = 2:1:19 T(i,j) = 0.25*( T(i+1,j) + T(i-1,j) + T(i,j+1) + T(i,j-1)); end end end ``` :::info 請參考 [am585winter06.pdf](https://www.dropbox.com/s/jle2ez5cxl5qbk9/am585winter06.pdf?dl=0) 第五章 ::: ### Conjugate Gradient (CG) method * The CG method is considerably faster than Gauss-Seidel / Jacobi / SOR :::info 關於矩陣計算/數值線性代數，有興趣的可以找王辰樹老師以及林敏雄老師 ::: ## Numerical PDE and $Ax=B$ :::info P. 95-153 關於Numerical PDE 可以選修數值偏微分方程、不連續有限元素法、邊界元素法等課程。 * Finite Difference Methods for Ordinary and Partial Differential Equations by Randall J. LeVeque * [am585winter06.pdf](https://www.dropbox.com/s/jle2ez5cxl5qbk9/am585winter06.pdf?dl=0) * [am586spring06.pdf](https://www.dropbox.com/s/whn7bjytmahyntw/am586spring06.pdf?dl=0) 我們暫時先略過這部分內容，之後有需要再複習 ::: ## PARALLEL COMPUTING THEORY * Taking a single goal –perhaps the calculation of stress in a substance, * Dividing this computation in such a way as it c an be performed simultaneously, * Taking each part and executing them on different CPU (or otherwise) cores ### TYPES OF COMPUTATION • Sequential Computations: $a_{n+1}=2 a_n$ • Parallel Computations :::info 可以用助教改考卷來想像平行計算。 ::: ### THEORY OF COMPUTATION Arrays $A$ and $B$ are $N$ elements long. How much computational work is performed when computing $C = A + B$? * We need to compute C[i] = A[i] + B[i] for each of i= 0 to N-1 (N times). Therefore, we perform N computations. * Assuming we perform $K_1$ operations per element, the total work is $W = K_1 N$. * The computational work and time are correlated: $T = K_2 W$ * Where $K_2$ here is the speed at which the work is performed. This is directly related to the clock speed and model of your CPU. Again, we don’t explicitly need to know what $K_2$ is. How long will it take to add the arrays $C = A + B$ over N elements? * $Time = K_2*Work = K_2*(K_1*N) = K_1 K_2*N = KN.$ ### PARALLEL COMPUTATION * speedup: $𝑆𝑈=𝑇_{𝑆𝐸𝑅𝐼𝐴𝐿}/𝑇_{𝑃𝐴𝑅𝐴𝐿𝐿𝐸𝐿}$ * In practice: * SU > 1 : Successful application of parallel computing * SU <= 1: Danger! ### AMDAHLS LAW ### GUSTAFSONS LAW ### PREDICTION OF SPEEDUP FROM CODE # Topic 4 SSE 3/22 課堂上示範的例子： * [SSE_exercise.cpp](http://www.math.ncku.edu.tw/~mhchen/HPC/Tutorial03/SSE_exercise.cpp) * [no_SSE.cpp](http://www.math.ncku.edu.tw/~mhchen/HPC/Tutorial03/no_SSE.cpp) * [makefile](http://www.math.ncku.edu.tw/~mhchen/HPC/Tutorial03/makefile) --- # Topic 5 OpenMP 3/22 課堂上示範的例子： * [ex1.cpp](http://www.math.ncku.edu.tw/~mhchen/HPC/Tutorial04/ex1.cpp) * [ex2.cpp](http://www.math.ncku.edu.tw/~mhchen/HPC/Tutorial04/ex2.cpp) * [ex3.cpp](http://www.math.ncku.edu.tw/~mhchen/HPC/Tutorial04/ex3.cpp) * [makefile](http://www.math.ncku.edu.tw/~mhchen/HPC/Tutorial04/makefile) 4/5 上課相關例子： * [hello.cpp](http://www.math.ncku.edu.tw/~mhchen/HPC/Tutorial04/hello.cpp) :::info compile以及執行方式 g++ hello.cpp -fopenmp -o hello.run ./hello.run ::: ```= #include <stdio.h> #include<omp.h> #define Iter 500 void print_hello(){ int tid,k; #pragma omp for for (int i = 0; i < 10; i++) { //tid= omp_get_thread_num(); for (int j = 0; j < Iter; j++) { k = k+1;} printf("i = %d, Hello from thread %d\n", i, omp_get_thread_num()); } //return 0; } int main() { int P = 4; // No. of threads int tid; int i,j,k; omp_set_num_threads(P); // Create threads #pragma omp parallel private(tid,k) { tid= omp_get_thread_num(); printf("Hello from thread %d\n", tid); #pragma omp for for (int i = 0; i < 10; i++) { for (int j = 0; j < 50; j++) { k = k+1;} tid= omp_get_thread_num(); printf("i = %d, Hello from thread %d\n", i, tid); } // End of omp for. an implicit barrier tid= omp_get_thread_num(); printf("Call print_hello, %d\n",tid); tid= omp_get_thread_num(); printf("Hello from thread %d\n", tid); print_hello(); printf("After Call print_hello, %d\n",omp_get_thread_num()); #pragma omp barrier printf("Good Bye from thread %d\n", omp_get_thread_num()); //#pragma omp barrier } // Destroy threads return 0; } ``` 參考資料： * [簡易的程式平行化－OpenMP（五）變數的平行化](https://kheresy.wordpress.com/2006/09/22/%E7%B0%A1%E6%98%93%E7%9A%84%E7%A8%8B%E5%BC%8F%E5%B9%B3%E8%A1%8C%E5%8C%96%EF%BC%8Dopenmp%EF%BC%88%E4%BA%94%EF%BC%89-%E8%AE%8A%E6%95%B8%E7%9A%84%E5%B9%B3%E8%A1%8C%E5%8C%96/) * [OpenMP wikibook](https://en.wikibooks.org/wiki/OpenMP) * [OpenMP/Reductions](https://en.wikibooks.org/wiki/OpenMP/Reductions) * [Parallel Computing and OpenMP Tutorial](https://idre.ucla.edu/sites/default/files/intro-openmp-2013-02-11.pdf) * [Shared Memory Parallel Computing](http://www.akira.ruc.dk/~keld/teaching/IPDC_f10/) * http://www.akira.ruc.dk/~keld/teaching/IPDC_f10/Slides/pdf4x/4_Performance.4x.pdf * [OpenMP Tutorial - Lawrence Livermore National Laboratory](https://computing.llnl.gov/tutorials/openMP/) * [openmp cheat sheet](http://www.openmp.org/wp-content/uploads/OpenMP-4.0-C.pdf) ---- 一般會用到 OpenMP 的部分分為三類： * Directives * Clauses * Functions 而 function 的部份是獨立呼叫的，其實在一般的情況下，似乎不大會用到。而 directive 和 clause 的用法，大致上應該是： ``` #pragma omp directive [clause] ``` 的形式。像之前 #pragma omp parallel for，實際上 parallel 和 for 都是 directive；所以語法實際上可以拆開成 #pragma omp parallel 和 #pragma omp for 兩行。也就是 ### Reduction http://stackoverflow.com/questions/33340718/openmp-reduction-after-parallel-region-declared-outside-function ### False Sharing * https://www.dartmouth.edu/~rc/classes/intro_openmp/parallel_regions2.html#top * http://www.drdobbs.com/parallel/eliminate-false-sharing/217500206?pgno=4 # Topic 6: GPU/CUDA ## NTU CSIE GPU Programming 課程資料: * [github](https://github.com/johnjohnlin/GPGPU_Programming_2016S) * [vincent.zip](http://www.math.ncku.edu.tw/~mhchen/HPC/vincent.zip):lab1 產生一個julia set的影片 {%youtube UQ4Eq45ZBRE %} * Lecture 1: CUDA Crash Course * Lecture 2: GPU Architecture & OpenGL / Shader Programming * Lecture 3: Parallel Computing Overview * [Lecture 4: CUDA Thread Model](http://www.math.ncku.edu.tw/~mhchen/HPC/CUDA/GPGPU_Lecture4.pdf) * [Lecture 5: CUDA Memory](http://www.math.ncku.edu.tw/~mhchen/HPC/CUDA/GPGPU_Lecture5.pdf) [Adaptive Parallel Computation with CUDA Dynamic Parallelism](https://devblogs.nvidia.com/parallelforall/introduction-cuda-dynamic-parallelism/) ### Serial Practice [Serial.zip](http://www.math.ncku.edu.tw/~mhchen/HPC/CUDA0/Serial.zip) ### Share Memory * [Efficient Shared Memory Use](https://www.bu.edu/pasi/files/2011/07/Lecture31.pdf) * [Using Shared Memory in CUDA C/C++](https://devblogs.nvidia.com/parallelforall/using-shared-memory-cuda-cc/) ### 關於矩陣index(勘誤):[Class_Notes_C.pdf](http://140.116.71.64/files/GPU2016/Class_Notes_C.pdf) * P. 274-275: HA 改正為 WB * P. 288: C[ty * wA + tx] 改正為 C[ty * wB + tx]; * 為了符合我們的習慣，矩陣和二維執行緒的對應關係為 $A_{ij}=A[ty][tx]$(假設只有一個block) ---- ### Dot Product/Reduction [Optimizing Parallel Reduction in CUDA](http://developer.download.nvidia.com/compute/cuda/1.1-Beta/x86_website/projects/reduction/doc/reduction.pdf) ### nvcc 一些參數使用 * 如果想要知道你的函數記憶體使用情形，可以使用以下參數compile 你的cuda程式 ``` nvcc --ptxas-options=-v ``` * * [計算 SM 使用率](https://cuda105.hackpad.com/ep/pad/static/ziGQlHQwzGm) * 以特定架構編譯：如果你希望以特定架構編譯，可以使用 -arch=sm_oo 之類參數指定。例如CUDA Capability 3.5版則表示 -arch=sm_35。 * CUDA Capability 可以使用NVDIA提供的 deviceQuery 程式來檢視你的顯卡版本 ![image alt](https://imgs.xkcd.com/comics/compiling.png) --- ## Debug 練習 5/24 ### Ex 01 * [bug_ex01.cu](http://www.math.ncku.edu.tw/~mhchen/HPC/bug_ex01.cu) * 上面連結的程式架構： 1. 在host上以randomInit()產生h_x,h_y(原程式會產生亂數陣列，為了debug，我改成每個元素都是1的常數陣列) 2. 接著以cudaMemcpy將h_x, h_y傳至GPU上的d_x, d_y，然後以printf()顯示h_x,d_x的第一個元素。 3. 最後顯示計算時間 * 以nvcc編譯後，沒有問題。但是執行時會出現以下的結果 ``` CUDA error dx (mallocd_a) = no error CUDA error dy (mallocd_a) = no error 程式記憶體區段錯誤 (core dumped) ``` * 請找出bug並修正。 --- ## Save an array as an BMP file * [如何使用ISO C++讀寫BMP圖檔? (C/C++) (Image Processing)](http://www.cnblogs.com/oomusou/archive/2007/02/03/639074.html) * [BMPrwcpp.cpp](http://www.math.ncku.edu.tw/~mhchen/HPC/Image/BMPrwcpp.cpp) * [BMPrwc.c](http://www.math.ncku.edu.tw/~mhchen/HPC/Image/BMPrwc.c) * [clena.bmp](http://www.math.ncku.edu.tw/~mhchen/HPC/Image/clena.bmp) * [Arnold's Cat Map](https://www.math.utk.edu/~ccollins/GS2006/Lab/16.html) ## How to create a gif from the command line * [How to create a gif from the command line](https://askubuntu.com/questions/648244/how-to-create-a-gif-from-the-command-line) ``` convert -delay 20 -loop 0 *.jpg myimage.gif ``` {%youtube OFusYizJ-bA %} * [Clena.zip](http://www.math.ncku.edu.tw/~mhchen/HPC/Image/Clena.zip) ---- ## Reaction Diffusion Eq * [reaction_diffusion.m]: * [RD2D](http://www.math.ncku.edu.tw/~mhchen/HPC/RD/RD2D.m) * [RD2DNB](http://www.math.ncku.edu.tw/~mhchen/HPC/RD/RD2DNB.m) * [How the Tiger got its Stripes](http://blogs.mathworks.com/graphics/2015/03/16/how-the-tiger-got-its-stripes/) * {%youtube No6LvBBmk5E%} {%youtube s7QSCAbS-_g%} ---- ### Perlin noise * [Perlin noise](https://en.wikipedia.org/wiki/Perlin_noise) * [Perlin noise in C++11](https://solarianprogrammer.com/2012/07/18/perlin-noise-cpp-11/) * [Implementation of Perlin Noise on GPU](http://www.sci.utah.edu/~leenak/IndStudy_reportfall/Perlin%20Noise%20on%20GPU.html) --- # 期末作業 ## Topic 1: Poisson Equation 在使用有限差分法(或有限元素法)將Poisson Eq $$ \Delta u =f $$ 離散化後，我們可以得到一組線性方程式 $$ Au_h =f $$ 基本上，我們會提供$A$矩陣的資料(或是程式)，請寫程式(CUDA或是OpenMP)對該線性方程求解，同時輸出解$u_h$，以matlab或是其他你熟悉的程式語言把解畫出來。 ### 參考程式 * [poisson.m](http://www.math.ncku.edu.tw/~mhchen/HPC/poisson.m): 解poisson方程的matlab程式，請參考底下連結的(3.12) * [Elliptic.pdf](http://www.math.ncku.edu.tw/~mhchen/HPC/Elliptic.pdf) 摘自LeVeque的數值PDE講義 * [Final.zip](http://www.math.ncku.edu.tw/~mhchen/HPC/Final.zip): 這組程式包含了c和matlab兩組程式，以 WriteMat.m 產生 $A$ 和 $f$ 後，將資料以binary檔儲存；接著 readsys.cpp 讀取兩個binary檔後，計算矩陣向量乘法 $result = A* f$，同時將結果以binary檔儲存；最後以 ReadResult.m 讀取結果，繪圖。(矩陣是以二維陣列儲存) * [Final_1D.zip](http://www.math.ncku.edu.tw/~mhchen/HPC/Final_1D.zip): Final.zip是以二維陣列儲存矩陣資料，所以matlab和c/c++在讀寫binary檔時都需要宣告二維陣列。如果matlab以二維陣列儲存，但是c/c++是以一維陣列讀取，結果就會錯誤。這個版本將矩陣都改成一維陣列表示，讀寫計算時都改成一維陣列表示。除此之外，WriteMat_1D.m和ReadResult_1D.m中關於矩陣產生的部分獨立成Matrix_Init.m，維護上比較簡單，不過readsys_1D.cpp中矩陣相關資料還是需要另外修改(這其實可以寫成程式自動處理 #### 其他參考資料 https://people.sc.fsu.edu/~jburkardt/cpp_src/poisson_serial/poisson_serial.html https://stackoverflow.com/questions/14597733/how-to-save-a-c-readable-mat-file ---- ### Solving Linear System $Ax=b$ * [Solution of linear systems of equations in CUDA](http://www.orangeowlsolutions.com/archives/1233) * [Sample Code - github](https://github.com/OrangeOwlSolutions/Linear-Algebra) * [Solving linear systems AX = B with CUDA -stackoverflow](https://stackoverflow.com/questions/28794010/solving-linear-systems-ax-b-with-cuda) * [cuSolver](http://docs.nvidia.com/cuda/cusolver/index.html#abstract) * [cuSolver.zip](http://www.math.ncku.edu.tw/~mhchen/HPC/cuSolver.zip) * [cuBLAS](http://docs.nvidia.com/cuda/cublas/index.html#abstract) ---- ## Topic 2: Reaction Diffusion Eq 在[Reaction Diffusion Eq](https://hackmd.io/s/SkM_YbBUg#reaction-diffusion-eq)小節中，我們提供了RD eq的matlab程式，請將程式改寫為cuda版本，同時將結果輸出成圖或是影片 ## Topic 3: 其他可能題目首先以cuda撰寫一組核心程式模擬一個數學模型(比方說Perlin noise)，接著參考 * [Save an array as an BMP file](https://hackmd.io/s/SkM_YbBUg#save-an-array-as-an-bmp-file)、 * [How to create a gif from the command line](https://hackmd.io/s/SkM_YbBUg#how-to-create-a-gif-from-the-command-line) 或是 * [台大GPU課程Lab1](https://hackmd.io/s/SkM_YbBUg#ntu-csie-gpu-programming-課程資料) 將模擬結果輸出成圖片或是影片 --- ## Debug 練習 6/7 ### Ex 02 * [bug_ex02.cu](http://www.math.ncku.edu.tw/~mhchen/HPC/bug_ex02.cu) 某同學在幫自己寫的CG cuda程式除蟲過程中，懷疑根本沒有執行kernel程式，所以在kernel中放入printf來debug ``` __global__ void M2(){ printf("DONE \n"); } ``` 執行結果，沒有輸出DONE。 * 請下載程式執行，嘗試找出bug發生的原因，同時debug。 --- # Optimization * [CUDA_Occupancy_Calculator.xls](http://www.math.ncku.edu.tw/~mhchen/HPC/CUDA_Occupancy_Calculator.xls) * nvcc -arch=sm_61 --ptxas-options=-v --- #### Stages of Debugging ![image alt](http://i.imgur.com/qsLx1Ip.jpg) ![image alt](https://ih1.redbubble.net/image.122493896.1433/sn,x1313-bg,ffffff.u13.jpg)