# GPU Hackathon Brief Tutorial
## How GPU Works in a Program

#### CUDA Kernel
* Parallel portion of application: execute as kernel (mandy threads)
* The heterogeneous nature of the CUDA programming model
| GPU |Device | Execute kernels |
| -------- | -------- | -------- |
| CPU | Host | Execute functions |
* **CPU** is a latency reducing architecture -- optimized for serial tasks
* 
* **GPU** is all about high throughput -- optimized for parallel tasks
* 
* **Low latency vs High throughput**
* 
* CUDA thread
* Lightweight
* Fast switching
* Tens of thousands execute simultaneously
* A **kernel** is a function executed on the GPU
* Array of threads, in parallel
* Each thread execute the same code, but take different paths
* 
* Threads $\rightarrow$ Blocks $\rightarrow$ Grid
* A kernel is executed as a grid of blocks of threads
* 
## Hello World Program in Parallel Computation
#### Code
In ``saxpy.cu``,
```c=
#include <stdio.h>
/* Device code */
__global__ void saxpy(int n, float a, float *x, float *y){
int i = blockIdx.x*blockDim.x + threadIdx.x;
if (i < n) y[i] = a*x[i] + y[i];
}
int main(void){
int N = 1<<20;
/* host code */
float *x, *y, *d_x, *d_y;
x = float*)malloc(N*sizeof(float));
y = (float*)malloc(N*sizeof(float));
cudaMalloc(&d_x, N*sizeof(float));
cudaMalloc(&d_y, N*sizeof(float));
for (int i = 0; i < N; i++) {
x[i] = 1.0f;
y[i] = 2.0f;
}
cudaMemcpy(d_x, x, N*sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_y, y, N*sizeof(float), cudaMemcpyHostToDevice);
// Perform SAXPY on 1M elements
/* Launch a kernel*/
saxpy<<<(N+255)/256, 256>>>(N, 2.0f, d_x, d_y);
cudaMemcpy(y, d_y, N*sizeof(float), cudaMemcpyDeviceToHost);
float maxError = 0.0f;
for (int i = 0; i < N; i++)
maxError = max(maxError, abs(y[i]-4.0f));
printf("Max error: %f\n", maxError);
cudaFree(d_x);
cudaFree(d_y);
free(x);
free(y);
}
```
#### Remarks
1. Withih the device code, `__global__` means
* Runs on the device
* Called from the host code
3. 
4. The kernel launch -- execute a function on the GPU
* ``` saxpy<<< grid_dim, block_dim>>>() ```
* specify how many device threads execute the kernel in parallel
* Asynchronous
* Refer to [execution configuration](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#execution-configuration)
```c=
kernel_name<<<grid_dim, threadblock_dim, dynamic_shared_memory_size, stream>>>(...);
```
3. CUDA GPUs run kernels using blocks of threads that are a multiple of 32 in size
4. The data transfers between the host and device
* `cudaMemcpy()`
* synchronous (or blocking) transfers
* Coudl also use `cudaDeviceSynchronize()`
5. [Grid-Stride Loop](https://developer.nvidia.com/blog/cuda-pro-tip-write-flexible-kernels-grid-stride-loops/)
```c
__global__ void saxpy(int n, float a, float *x, float *y){
int index = blockIdx.x * blockDim.x + threadIdx.x;
int stride = blockDim.x * gridDim.x;
for (int i = index; i < n; i += stride)
y[i] = a*x[i] + y[i];
}
```
#### Compile and Run
1. `module load nvhpc_sdk/20.11`
2. `nvcc saxpy.cu -o saxpy`
3. `srun -p v100 --gres=gpu:2 ./saxpy`
4. `nvprof ./saxpy`
5. The profile 
## Query Device Properties
```c=
#include <stdio.h>
int main() {
int nDevices;
cudaError_t err = cudaGetDeviceCount(&nDevices);
if (err != cudaSuccess)
printf("%s\n", cudaGetErrorString(err));
for (int i = 0; i < nDevices; ++i){
cudaDeviceProp prop;
cudaGetDeviceProperties(&prop, i);
printf("Device Number: %d\n", i);
printf(" Device name: %s\n", prop.name);
printf(" Memory Clock Rate (KHz): %d\n", prop.memoryClockRate);
printf(" Memory Bus Width (bits): %d\n", prop.memoryBusWidth);
printf(" Peak Memory Bandwidth (GB/s): %f\n\n", 2.0*prop.memoryClockRate*(prop.memoryBusWidth/8)/1.0e6);
}
}
```
Infor for our cluster SLURM

## Optimize Date Transfers in CUDA
#### Minimizing Data Transfers
* 
* [Allocate pinned host memory](https://stackoverflow.com/questions/5736968/why-is-cuda-pinned-memory-so-fast)
* `cudaMallocHost()`, `cudaHostAlloc()`, `cudaFreeHost()`
* should always check for errors because it is possible for pinned memory allocation to fail
* Take a look at [bandwidthtest.cu](https://github.com/NVIDIA-developer-blog/code-samples/blob/master/series/cuda-cpp/optimize-data-transfers/bandwidthtest.cu)
* From my result

---
#### Batching Small Transfers
* Preferable to batch many small transfers together into a single transfer
---
#### Overlapping Kernel Execution and Data Transfers
##### CUDA Streams
* All device operations in CUDA run in a stream
* The default stream is synchronizing, $i.e.$, no operation in the default stream will begin until all previously issued operations in any stream on the device have completed
*
```c=
cudaMemcpy(d_x, x, N*sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_y, y, N*sizeof(float), cudaMemcpyHostToDevice);
saxpy<<<(N+255)/256, 256>>>(N, 2.0f, d_x, d_y);
myfunc();
cudaMemcpy(y, d_y, N*sizeof(float), cudaMemcpyDeviceToHost);
```
* Non-default streams
```c=
cudaStream_t stream1;
cudaError_t result;
result = cudaStreamCreate(&stream1)
result = cudaStreamDestroy(stream1)
result = cudaMemcpyAsync(d_a, a, N, cudaMemcpyHostToDevice, stream1)
```
##### Synchronization with streams
* Sometimes we need to synchronize the host code with operations in a stream.
* Heavy hammer way -- `cudaDeviceSynchronize()`
* block the host code until all previously issued operations on the device have completed
* could hurt performance
* [Other less severe ways](https://developer.nvidia.com/blog/how-overlap-data-transfers-cuda-cc/)
##### Overlap data transfer
* Take a look at [async.cu](https://github.com/NVIDIA-developer-blog/code-samples/blob/master/series/cuda-cpp/overlap-data-transfers/async.cu)
* My result

* [Further about concurrency between data transfer and kernel execution](https://developer.nvidia.com/blog/gpu-pro-tip-cuda-7-streams-simplify-concurrency/)
## Shared Memory in CUDA
#### Thread Synchronization
* Threads within a block usually have race conditions
* If thread B has not finished writing its element before thread A tries to read it
* `__syncthreads()`
* A thread’s execution can only proceed past a `__syncthreads()` after all threads in its block have executed the `__syncthreads()`
* Example -- [shared-memory.cu](https://github.com/NVIDIA-developer-blog/code-samples/blob/master/series/cuda-cpp/shared-memory/shared-memory.cu)
* **Static shared memory** -- explicitly declare an array of that size
```c=
__global__ void staticReverse(int *d, int n){
__shared__ int s[64];
int t = threadIdx.x;
int tr = n-t-1;
s[t] = d[t];
__syncthreads();
d[t] = s[tr];
}
```
* ***Dynamic shared memory***
```c=
__global__ void dynamicReverse(int *d, int n){
extern __shared__ int s[];
...
}
dynamicReverse<<<1, n, n*sizeof(int)>>>(d_d, n);
```
* [Further info about shared memory](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#shared)
## An Efficient Matrix Transpose
#### Coalesce Memory Access
* Whenever a thread on GPU reads or writes global memory, it always accesses a large chunk of memory at once
* Other threads could reuse that chunk
* GPU runs more efficiently when **threads read or write contiguous global memory locations**
* A finer group of 32 threads -- wraps
* Multiprocessors on the GPU execute intructions for each wrap in **SIMD** (Single Instruction Multiple Data) way
* [**Spatial Locality**](https://en.wikipedia.org/wiki/Locality_of_reference)
#### Matrix Transpose
* Take a look at the code [Matrix transpose](https://github.com/NVIDIA-developer-blog/code-samples/blob/master/series/cuda-cpp/transpose/transpose.cu)
* `TILE_DIM=32`, `BLOCK_ROW=8`
* **Naive Matrix Transpose** -- performance suffers from lack of spatial locality
```c=
__global__ void transposeNaive(float *odata, const float *idata){
int x = blockIdx.x * TILE_DIM + threadIdx.x;
int y = blockIdx.y * TILE_DIM + threadIdx.y;
int width = gridDim.x * TILE_DIM;
for (int j = 0; j < TILE_DIM; j+= BLOCK_ROWS)
odata[x*width + (y+j)] = idata[(y+j)*width + x];
}
```
* **Tiled transpose**

```c=
__global__ void transposeCoalesced(float *odata, const float *idata)
{
__shared__ float tile[TILE_DIM][TILE_DIM];
int x = blockIdx.x * TILE_DIM + threadIdx.x;
int y = blockIdx.y * TILE_DIM + threadIdx.y;
int width = gridDim.x * TILE_DIM;
for (int j = 0; j < TILE_DIM; j += BLOCK_ROWS)
tile[threadIdx.y+j][threadIdx.x] = idata[(y+j)*width + x];
__syncthreads();
x = blockIdx.y * TILE_DIM + threadIdx.x; // transpose block offset
y = blockIdx.x * TILE_DIM + threadIdx.y;
for (int j = 0; j < TILE_DIM; j += BLOCK_ROWS)
odata[(y+j)*width + x] = tile[threadIdx.x][threadIdx.y + j];
}
```
* **Shared Memory Bank Conflict**
* 
## Finite Difference Methods
The Code [finite difference](https://github.com/NVIDIA-developer-blog/code-samples/blob/master/series/cuda-cpp/finite-difference/finite-difference.cu)
#### Finite Difference
* A weighted summation of function values at neighboring points to approximate the derivative at a particular point
* $\frac{\partial f(x,y,z)}{\partial x} \approx \frac{1}{\Delta x} \sum _{i=-N}^{N} C_i f(x + i \Delta x, y, z)$
* Here, using eighth-order FD
$\frac{\partial f(x,y,z)}{\partial x} \approx a_{x}(f_{i+1,j,k} - f_{i-1,j,k}) + b_{x}(f_{i+2,j,k} - f_{i-2,j,k}) + c_{x}(f_{i+3,j,k} - f_{i-3,j,k}) + d_{x}(f_{i+4,j,k} - f_{i-4,j,k})$
$a_x = \frac{4}{5\Delta x}, b_x =- \frac{1}{5\Delta x}, c_x = \frac{4}{105\Delta x}, d_x = -\frac{1}{280\Delta x}$
* On $64^3$ periodic grids
#### Data Reuse and Shared Memory
* Tiling approach -- each thread block loads a tile of data from the multidimensional grid into shared memory, so that each thread in the block can access all elements of the shared memory tile
*

*

## Reference
1. [even easier introduction cuda](https://developer.nvidia.com/blog/even-easier-introduction-cuda/)
2. [An Easy Introduction to CUDA C and C++](https://developer.nvidia.com/blog/easy-introduction-cuda-c-and-c/)
3. [How to Implement Performance Metrics in CUDA C/C++](https://developer.nvidia.com/blog/how-implement-performance-metrics-cuda-cc/)
4. [How to Query Device Properties and Handle Errors in CUDA C/C++](https://developer.nvidia.com/blog/how-query-device-properties-and-handle-errors-cuda-cc/)
5. [How to Optimize Data Transfers in CUDA C/C++](https://developer.nvidia.com/blog/how-optimize-data-transfers-cuda-cc/)
6. [How to Overlap Data Transfers in CUDA C/C++](https://developer.nvidia.com/blog/how-overlap-data-transfers-cuda-cc/)
7. [How to Access Global Memory Efficiently in CUDA C/C++ Kernels](https://developer.nvidia.com/blog/how-access-global-memory-efficiently-cuda-c-kernels/)
8. [Using Shared Memory in CUDA C/C++](https://developer.nvidia.com/blog/using-shared-memory-cuda-cc/)
9. [Efficient Matrix Transpose](https://developer.nvidia.com/blog/efficient-matrix-transpose-cuda-cc/)
10. [Finite Difference Methods, Part 1](https://developer.nvidia.com/blog/finite-difference-methods-cuda-cc-part-1/)
11. [Finite Difference Methods, Part 2](https://developer.nvidia.com/blog/finite-difference-methods-cuda-c-part-2/)