# GPU Hackathon Brief Tutorial ## How GPU Works in a Program ![](https://i.imgur.com/2NgTOPQ.png) #### CUDA Kernel * Parallel portion of application: execute as kernel (mandy threads) * The heterogeneous nature of the CUDA programming model | GPU |Device | Execute kernels | | -------- | -------- | -------- | | CPU | Host | Execute functions | * **CPU** is a latency reducing architecture -- optimized for serial tasks * ![](https://i.imgur.com/L5FbwfH.png) * **GPU** is all about high throughput -- optimized for parallel tasks * ![](https://i.imgur.com/bjfp5c4.png) * **Low latency vs High throughput** * ![](https://i.imgur.com/1TidKAm.png) * CUDA thread * Lightweight * Fast switching * Tens of thousands execute simultaneously * A **kernel** is a function executed on the GPU * Array of threads, in parallel * Each thread execute the same code, but take different paths * ![](https://i.imgur.com/KJtm065.png) * Threads $\rightarrow$ Blocks $\rightarrow$ Grid * A kernel is executed as a grid of blocks of threads * ![](https://i.imgur.com/S2hRSv6.png) ## Hello World Program in Parallel Computation #### Code In ``saxpy.cu``, ```c= #include <stdio.h> /* Device code */ __global__ void saxpy(int n, float a, float *x, float *y){ int i = blockIdx.x*blockDim.x + threadIdx.x; if (i < n) y[i] = a*x[i] + y[i]; } int main(void){ int N = 1<<20; /* host code */ float *x, *y, *d_x, *d_y; x = float*)malloc(N*sizeof(float)); y = (float*)malloc(N*sizeof(float)); cudaMalloc(&d_x, N*sizeof(float)); cudaMalloc(&d_y, N*sizeof(float)); for (int i = 0; i < N; i++) { x[i] = 1.0f; y[i] = 2.0f; } cudaMemcpy(d_x, x, N*sizeof(float), cudaMemcpyHostToDevice); cudaMemcpy(d_y, y, N*sizeof(float), cudaMemcpyHostToDevice); // Perform SAXPY on 1M elements /* Launch a kernel*/ saxpy<<<(N+255)/256, 256>>>(N, 2.0f, d_x, d_y); cudaMemcpy(y, d_y, N*sizeof(float), cudaMemcpyDeviceToHost); float maxError = 0.0f; for (int i = 0; i < N; i++) maxError = max(maxError, abs(y[i]-4.0f)); printf("Max error: %f\n", maxError); cudaFree(d_x); cudaFree(d_y); free(x); free(y); } ``` #### Remarks 1. Withih the device code, `__global__` means * Runs on the device * Called from the host code 3. ![](https://i.imgur.com/4R4QgDg.png) 4. The kernel launch -- execute a function on the GPU * ``` saxpy<<< grid_dim, block_dim>>>() ``` * specify how many device threads execute the kernel in parallel * Asynchronous * Refer to [execution configuration](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#execution-configuration) ```c= kernel_name<<<grid_dim, threadblock_dim, dynamic_shared_memory_size, stream>>>(...); ``` 3. CUDA GPUs run kernels using blocks of threads that are a multiple of 32 in size 4. The data transfers between the host and device * `cudaMemcpy()` * synchronous (or blocking) transfers * Coudl also use `cudaDeviceSynchronize()` 5. [Grid-Stride Loop](https://developer.nvidia.com/blog/cuda-pro-tip-write-flexible-kernels-grid-stride-loops/) ```c __global__ void saxpy(int n, float a, float *x, float *y){ int index = blockIdx.x * blockDim.x + threadIdx.x; int stride = blockDim.x * gridDim.x; for (int i = index; i < n; i += stride) y[i] = a*x[i] + y[i]; } ``` #### Compile and Run 1. `module load nvhpc_sdk/20.11` 2. `nvcc saxpy.cu -o saxpy` 3. `srun -p v100 --gres=gpu:2 ./saxpy` 4. `nvprof ./saxpy` 5. The profile ![](https://i.imgur.com/dCIR7wx.png) ## Query Device Properties ```c= #include <stdio.h> int main() { int nDevices; cudaError_t err = cudaGetDeviceCount(&nDevices); if (err != cudaSuccess) printf("%s\n", cudaGetErrorString(err)); for (int i = 0; i < nDevices; ++i){ cudaDeviceProp prop; cudaGetDeviceProperties(&prop, i); printf("Device Number: %d\n", i); printf(" Device name: %s\n", prop.name); printf(" Memory Clock Rate (KHz): %d\n", prop.memoryClockRate); printf(" Memory Bus Width (bits): %d\n", prop.memoryBusWidth); printf(" Peak Memory Bandwidth (GB/s): %f\n\n", 2.0*prop.memoryClockRate*(prop.memoryBusWidth/8)/1.0e6); } } ``` Infor for our cluster SLURM ![](https://i.imgur.com/fc5mibZ.png) ## Optimize Date Transfers in CUDA #### Minimizing Data Transfers * ![](https://i.imgur.com/JavYKmP.png) * [Allocate pinned host memory](https://stackoverflow.com/questions/5736968/why-is-cuda-pinned-memory-so-fast) * `cudaMallocHost()`, `cudaHostAlloc()`, `cudaFreeHost()` * should always check for errors because it is possible for pinned memory allocation to fail * Take a look at [bandwidthtest.cu](https://github.com/NVIDIA-developer-blog/code-samples/blob/master/series/cuda-cpp/optimize-data-transfers/bandwidthtest.cu) * From my result ![](https://i.imgur.com/y7vioFf.png) --- #### Batching Small Transfers * Preferable to batch many small transfers together into a single transfer --- #### Overlapping Kernel Execution and Data Transfers ##### CUDA Streams * All device operations in CUDA run in a stream * The default stream is synchronizing, $i.e.$, no operation in the default stream will begin until all previously issued operations in any stream on the device have completed * ```c= cudaMemcpy(d_x, x, N*sizeof(float), cudaMemcpyHostToDevice); cudaMemcpy(d_y, y, N*sizeof(float), cudaMemcpyHostToDevice); saxpy<<<(N+255)/256, 256>>>(N, 2.0f, d_x, d_y); myfunc(); cudaMemcpy(y, d_y, N*sizeof(float), cudaMemcpyDeviceToHost); ``` * Non-default streams ```c= cudaStream_t stream1; cudaError_t result; result = cudaStreamCreate(&stream1) result = cudaStreamDestroy(stream1) result = cudaMemcpyAsync(d_a, a, N, cudaMemcpyHostToDevice, stream1) ``` ##### Synchronization with streams * Sometimes we need to synchronize the host code with operations in a stream. * Heavy hammer way -- `cudaDeviceSynchronize()` * block the host code until all previously issued operations on the device have completed * could hurt performance * [Other less severe ways](https://developer.nvidia.com/blog/how-overlap-data-transfers-cuda-cc/) ##### Overlap data transfer * Take a look at [async.cu](https://github.com/NVIDIA-developer-blog/code-samples/blob/master/series/cuda-cpp/overlap-data-transfers/async.cu) * My result ![](https://i.imgur.com/meuEYlx.png) * [Further about concurrency between data transfer and kernel execution](https://developer.nvidia.com/blog/gpu-pro-tip-cuda-7-streams-simplify-concurrency/) ## Shared Memory in CUDA #### Thread Synchronization * Threads within a block usually have race conditions * If thread B has not finished writing its element before thread A tries to read it * `__syncthreads()` * A thread’s execution can only proceed past a `__syncthreads()` after all threads in its block have executed the `__syncthreads()` * Example -- [shared-memory.cu](https://github.com/NVIDIA-developer-blog/code-samples/blob/master/series/cuda-cpp/shared-memory/shared-memory.cu) * **Static shared memory** -- explicitly declare an array of that size ```c= __global__ void staticReverse(int *d, int n){ __shared__ int s[64]; int t = threadIdx.x; int tr = n-t-1; s[t] = d[t]; __syncthreads(); d[t] = s[tr]; } ``` * ***Dynamic shared memory*** ```c= __global__ void dynamicReverse(int *d, int n){ extern __shared__ int s[]; ... } dynamicReverse<<<1, n, n*sizeof(int)>>>(d_d, n); ``` * [Further info about shared memory](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#shared) ## An Efficient Matrix Transpose #### Coalesce Memory Access * Whenever a thread on GPU reads or writes global memory, it always accesses a large chunk of memory at once * Other threads could reuse that chunk * GPU runs more efficiently when **threads read or write contiguous global memory locations** * A finer group of 32 threads -- wraps * Multiprocessors on the GPU execute intructions for each wrap in **SIMD** (Single Instruction Multiple Data) way * [**Spatial Locality**](https://en.wikipedia.org/wiki/Locality_of_reference) #### Matrix Transpose * Take a look at the code [Matrix transpose](https://github.com/NVIDIA-developer-blog/code-samples/blob/master/series/cuda-cpp/transpose/transpose.cu) * `TILE_DIM=32`, `BLOCK_ROW=8` * **Naive Matrix Transpose** -- performance suffers from lack of spatial locality ```c= __global__ void transposeNaive(float *odata, const float *idata){ int x = blockIdx.x * TILE_DIM + threadIdx.x; int y = blockIdx.y * TILE_DIM + threadIdx.y; int width = gridDim.x * TILE_DIM; for (int j = 0; j < TILE_DIM; j+= BLOCK_ROWS) odata[x*width + (y+j)] = idata[(y+j)*width + x]; } ``` * **Tiled transpose** ![](https://i.imgur.com/Ocx1xBw.png) ```c= __global__ void transposeCoalesced(float *odata, const float *idata) { __shared__ float tile[TILE_DIM][TILE_DIM]; int x = blockIdx.x * TILE_DIM + threadIdx.x; int y = blockIdx.y * TILE_DIM + threadIdx.y; int width = gridDim.x * TILE_DIM; for (int j = 0; j < TILE_DIM; j += BLOCK_ROWS) tile[threadIdx.y+j][threadIdx.x] = idata[(y+j)*width + x]; __syncthreads(); x = blockIdx.y * TILE_DIM + threadIdx.x; // transpose block offset y = blockIdx.x * TILE_DIM + threadIdx.y; for (int j = 0; j < TILE_DIM; j += BLOCK_ROWS) odata[(y+j)*width + x] = tile[threadIdx.x][threadIdx.y + j]; } ``` * **Shared Memory Bank Conflict** * ![](https://i.imgur.com/wGUgc8j.png) ## Finite Difference Methods The Code [finite difference](https://github.com/NVIDIA-developer-blog/code-samples/blob/master/series/cuda-cpp/finite-difference/finite-difference.cu) #### Finite Difference * A weighted summation of function values at neighboring points to approximate the derivative at a particular point * $\frac{\partial f(x,y,z)}{\partial x} \approx \frac{1}{\Delta x} \sum _{i=-N}^{N} C_i f(x + i \Delta x, y, z)$ * Here, using eighth-order FD $\frac{\partial f(x,y,z)}{\partial x} \approx a_{x}(f_{i+1,j,k} - f_{i-1,j,k}) + b_{x}(f_{i+2,j,k} - f_{i-2,j,k}) + c_{x}(f_{i+3,j,k} - f_{i-3,j,k}) + d_{x}(f_{i+4,j,k} - f_{i-4,j,k})$ $a_x = \frac{4}{5\Delta x}, b_x =- \frac{1}{5\Delta x}, c_x = \frac{4}{105\Delta x}, d_x = -\frac{1}{280\Delta x}$ * On $64^3$ periodic grids #### Data Reuse and Shared Memory * Tiling approach -- each thread block loads a tile of data from the multidimensional grid into shared memory, so that each thread in the block can access all elements of the shared memory tile * ![](https://i.imgur.com/q7HbVHa.png) * ![](https://i.imgur.com/ZejHgdh.png) ## Reference 1. [even easier introduction cuda](https://developer.nvidia.com/blog/even-easier-introduction-cuda/) 2. [An Easy Introduction to CUDA C and C++](https://developer.nvidia.com/blog/easy-introduction-cuda-c-and-c/) 3. [How to Implement Performance Metrics in CUDA C/C++](https://developer.nvidia.com/blog/how-implement-performance-metrics-cuda-cc/) 4. [How to Query Device Properties and Handle Errors in CUDA C/C++](https://developer.nvidia.com/blog/how-query-device-properties-and-handle-errors-cuda-cc/) 5. [How to Optimize Data Transfers in CUDA C/C++](https://developer.nvidia.com/blog/how-optimize-data-transfers-cuda-cc/) 6. [How to Overlap Data Transfers in CUDA C/C++](https://developer.nvidia.com/blog/how-overlap-data-transfers-cuda-cc/) 7. [How to Access Global Memory Efficiently in CUDA C/C++ Kernels](https://developer.nvidia.com/blog/how-access-global-memory-efficiently-cuda-c-kernels/) 8. [Using Shared Memory in CUDA C/C++](https://developer.nvidia.com/blog/using-shared-memory-cuda-cc/) 9. [Efficient Matrix Transpose](https://developer.nvidia.com/blog/efficient-matrix-transpose-cuda-cc/) 10. [Finite Difference Methods, Part 1](https://developer.nvidia.com/blog/finite-difference-methods-cuda-cc-part-1/) 11. [Finite Difference Methods, Part 2](https://developer.nvidia.com/blog/finite-difference-methods-cuda-c-part-2/)