> ***Learn CUDA Programming A beginner Guide to GPU Programming and Parallel Computing***
> * https://www.packtpub.com/product/learn-cudaprogramming/9781788996242
>
> ***解讀《CUDA C最佳實踐指南》***
> * https://www.zhihu.com/column/c_1323748466020315136
>
> ***CUDA程式設計入門及最佳化***
> * https://www.zhihu.com/column/c_1522503697624346624
*This post follow the above book to understand and build CUDA programming skills. In the final part of the chapter, there will be an example of AI Framework and how I plan to use the skills learned in this book to improve the infrerence framework.*
*OpenCL programming & CUDA programming might have different post. There should be a reference between them.*
:::info
:information_source: **Compiler Setting**
***nvcc***
***-run*** # This argument will run the program after compile
***-m64*** # Set the program to x64
-***I/usr/local/cuda/samples/common/inc*** # include the cuda sample tools
***-gencode arch=compute75,code=sm75*** # 7.5 for RTX 20 series
***-o executivename ./filename.cu***
***-Xcompiler -fopenmp*** # To compile with OpenMP in the host code
:::
## GPU Hardware & Software Mapping

### CUDA GPU Indexing and OpenCL Indexing
| | OpenCL | CUDA C |
|:-----------------:| ----------------------------- | --------------------------------------------------- |
| Thread Block ID | `get_group_id(uint dimIdx)` | `blockIdx.x[xyz]` |
| Thread Local ID | `get_local_id(uint dimIdx)` | `threadIdx.x[xyz]` |
| Num Thread in Block | `get_local _size(uint dimIdx)` | `blockDim.x[xyz]` |
| Num Block in Grid | `get_num_groups(uint dimIdx)` | `gridDim.x[xyz]` |
| Thread Global ID | `get_global_id(uint dimIdx)` | `blockIdx.x[xyz] * blockDim.[xyz] + threadIdx[xyz]` |
:::info
:bulb: **GPU Architecture Evolution**
**輝達GPU架構演進近十年,從費米到安培**
> https://zhuanlan.zhihu.com/p/413145211
**從AI系統角度回顧GPU架構變遷--從Fermi到Ampere(V1.2)**
> https://zhuanlan.zhihu.com/p/463629676
>
:::
## Memory Management
### Memory Hierarchy

:::warning
:warning: **Important**
> *Each memory bandwidth is orders of magnitude different, and so is the latency to access them.*
*Shared memory* is vary important in CUDA kernel program which is able to be managed by programmer.
When programming with CUDA, ***size, latency, throughput and visibility*** of different memory should be kept in mind.
:::
### Coalesced Memory Access
A Warp (32 threads) should access contiguous data. In this way, a better performance can be achieved with less memory fetches to the memory thus more efficient memory bus utilization.

### Shared Memory
Shared memory is **user-managed cache**. By *using coalesced access pattern storing global memory into shared memory*, program can utilize share memory and have higher bandwidth of throughput.
Shared memory is local to a thread block and can be used to communicate within a block with different threads.
:::info
:bulb: **Shared Memory Example**
First example provided by book and a post on Nvidia Developer Blog https://developer.nvidia.com/blog/efficient-matrix-transpose-cuda-cc/
Second example is in my github repository AI `Framework/conv2d.cu`
**Matrix transpose on shared memory**

```cpp
__global__ void fCUDA::matrixTransposeNaive(int *input, int *output)
{
int x_global_idx = blockIdx.x * blockDim.x + threadIdx.x;
int y_global_idx = blockIdx.y * blockDim.y + threadIdx.y;
// input and output index mapping
int inputIdx = y_global_idx * MATRIX_SIZE + x_global_idx;
int outputIdx = x_global_idx * MATRIX_SIZE + y_global_idx;
// At least one of them will be uncoalesced access
output[outputIdx] = input[inputIdx];
}
```

The tile is a shared memory and the pattern can be observed in the above picture. Notice that on both the `idata` and `odata`, the access pattern are coalesced.
```cpp
// In this algorithm, the shared mem block should be a square.
// blockDim.x == blockDIm.y == BLOCK_SIZE
__global__ void fCUDA::matrixTransposeShared(int *input, int *output)
{
__shared__ int sharedMem[BLOCK_SIZE][BLOCK_SIZE];
// blockDim.x == blockDIm.y == BLOCK_SIZE
int x_global_idx = blockIdx.x * BLOCK_SIZE + threadIdx.x;
int y_global_idx = blockIdx.y * BLOCK_SIZE + threadIdx.y;
// transposed global memory index
int x_transpose_idx = blockIdx.y * BLOCK_SIZE + threadIdx.x;
int y_transpose_idx = blockIdx.x * BLOCK_SIZE + threadIdx.y;
int inputIdx = y_global_idx * MATRIX_SIZE + x_global_idx;
int outputIdx = y_transpose_idx * MATRIX_SIZE + x_transpose_idx;
// global input to Shared mem
sharedMem[threadIdx.x][threadIdx.y] = input[inputIdx];
// Shared mem to global output (access sharedMem's column)
output[outputIdx] = sharedMem[threadIdx.y][threadIdx.x];
}
```
**2D Convolution Layer**
*Please refer to the github repository for more info*
Uncoalesced access to W (model weight)
```cpp
int indexW = 0;
for (; indexW < wSubSize; ++indexW)
sharedWeight[threadIdx.z * wSubSize + indexW] = W[cOut * wSubSize + indexW];
__syncthreads();
```
Coalesced access to W (model weight)
```cpp
int sharedNum = wSubSize * blockDim.z;
int threadLocalIdx = threadIdx.z * (blockDim.x * blockDim.y) + threadIdx.y * blockDim.x + threadIdx.x;
int threadLocalDim = blockDim.x * blockDim.y * blockDim.z;
for (int i = 0; i < sharedNum ; i += threadLocalDim) {
if ((i + threadLocalIdx) < sharedNum)
sharedWeight[i + threadLocalIdx] = W[(sharedNum * blockIdx.z) + i + threadLocalIdx];
}
__syncthreads();
```
:::
### Shared Memory Band Conflict
Shared memory is organized into banks and each bank can only service one address access per cycle. Therefore, we can have 32 simultaneous *4 bytes* accesses if we have 32 *4 bytes* wide banks.

A padding can be add to avoid bank conflict
```cpp
__shared__ int sharedMem[BLOCK_SIZE][BLOCK_SIZE+1];
```
The above shared memory allocation will put data 0, 32, 64 into different banks.

### Registers in GPU
Every SM has a fixed set of registers which have a scope of a single thread. The `nvcc` compiler will tries to find the best number of registers per thread. The number of registers used in a CUDA kernel will restrict the number of threads that can be schecduled on an SM.
```cpp
__global__ void cudaKernel(float* X, float* Y)
{
int global_idx = blockIdx.x * blockDim.x + threadIdx.x;
X[global_idx] += Y[global_idx];
}
```
:::info
:bulb: **Parameter Example**
* `global_idx` : on Register
* `X`, `Y`: on Shared Memory or Register
* `X[global_idx]`: on Global Memory
:::
:::warning
:warning: **Register Spills**
If the number of local variables and intermediate in a CUDA kernel exceed the number of registers available, the data will be pushed to **L1/L2 cache or global memory**. This refer to as register spills.
:::
### Pinned Memory
The default memory allocation is pageable which means it can be swapped out. Therefore, GPU will not access the pageable memory. CUDA driver will allocates temporary pinned memory on host, copies the data from the default pageable memory to it and transfer it to the device via a **Derict Memory Access (DMA)**. The pageable memory might be wapped and need to be load back to memory when the data transfer is requested.
:::info
:bulb: **External Resource**
This post provide a very good explantation of different CUDA memory allocation methods.
> https://kaibaoom.tw/2020/07/21/cuda-four-memory-access/
:::
### Unified Memory
Unified memory provide **a view of single memory space** which can be accessed by all the GPU and CPU in the system. The memory allocation based on first-touch basis. If the GPU first touch the variable, the page will be allocated and mapped in the GPU page table.
:::info
:bulb: **External Resource**
This slide provide a in depth view of unified memory.
https://on-demand.gputechconf.com/gtc/2018/presentation/s8430-everything-you-need-to-know-about-unified-memory.pdf

*Architecture **before Pascal***
* No GPU page fault support: move all dirty pages on kernel launch.
*Architecture **Pascal** and **Volta***
* GPU page fault support
* Concurrent access
* Extended VA space (48 bits)
* On-demand migration to accessing processor on first touch
*Architecture **Volta+P9** (IBM Power System AC922)*
* Access counters (Only hot pages will be moved to the GPU)
* Hardware Coherency with NVLink2

* ATS (Address Translation Service) support (CPU and GPU can share a single page table)

:::
#### Over-subscription
> https://on-demand.gputechconf.com/gtc/2018/presentation/s8430-everything-you-need-to-know-about-unified-memory.pdf


#### Data Prefetching
Data prefetching can be used with API `cudaMemPrefetchAsync()` and hints can be used with API `cudaMemAdvice()`.
:::info
:bulb: **Adviced for `cudaMemAdvice()`**
> ***Learn CUDA Programming A beginner Guide to GPU Programming and Parallel Computing***
> * https://www.packtpub.com/product/learn-cudaprogramming/9781788996242
* `cudaMemAdviseSetReadMostly`: This tell the driver that the data is mostly read-only and the driver will create a read-only copy of the data. If the data is written to, the page copies become invalidated except for the device that wrote the memory
* `cudaMemAdviseSetPreferredLocation`: This set the preferred memory location of the data to a memory of a device. The driver tries to resist migrating data away from the preferred location.
* `cudaMemAdviseSetAccessedBy`: This implies that the data will be accessed by the device. The device will create a direct mapping of input in the CPU memory and no page faults will be generated.
:::
:::success
:bulb: **Memory Evolution**
> * https://stackoverflow.com/questions/55742668/l1-cache-in-gpu
> * https://docs.nvidia.com/cuda/volta-tuning-guide/index.html#l1-cache
:::
## SIMT Programming
:::info
:bulb: **External Resource**
**NVIDIA GPU SM和CUDA程式設計理解**
> https://bbs.huaweicloud.com/blogs/349996
:::
> ***Learn CUDA Programming A beginner Guide to GPU Programming and Parallel Computing***
> * https://www.packtpub.com/product/learn-cudaprogramming/9781788996242
The streaming multiprocessors executes thread blocks arbitrarily and concurrently, executing as many as the GPU resources can afford. *Therefore, the number of thread blocks executable in parallel varies depending on how much of the GPU's resources the block requires and the amount of GPU resurces available*.

```cpp
__global__ void fCUDA::indexPrint()
{
int global_idx = blockIdx.x * blockDim.x + threadIdx.x;
int warp_idx = threadIdx.x / warpSize; // warpSize = 32 is predefined
int lane_idx = threadIdx.x & (warpSize - 1); // equal to threadIdx.x % 32
if (global_idx == 0)
printf("thread, block, warp, lane\n");
#ifdef INDEX_LANE
if ((lane_idx & (warpSize / 2 - 1)) == 0)
{
// thread, block, warp, lane
printf(" %5d\t%5d\t %2d\t%2d\n", global_idx, blockIdx.x, warp_idx, lane_idx);
}
#else
printf(" %5d\t%5d\t %2d\t%2d\n", global_idx, blockIdx.x, warp_idx, lane_idx);
#endif
return;
}
```
Print 2 line from lane 0 and 16 only, they will be consistant 0 ~ 31 output if this limitation is removed.
* **Out-of-order block execution**
* **Out-of-order warp index within a block**
* **Grouped threads executed in a warp**
```
thread, block, warp, lane
32 0 1 0
48 0 1 16
96 1 1 0
112 1 1 16
64 1 0 0
80 1 0 16
0 0 0 0
16 0 0 16
```
### Reduction Example
*TODO: Include the OpenCL book explantation and implement them with CUDA C*
#### Reduction with Global Memory
A for loop can be in the loop since the block execution order is not guaranteed. **(We can only sync threads within a block)**

```cpp
__global__ void sharedReductionKernel(float* output, float* input, int size)
{
int global_idx = blockIdx.x * blockDim.x + threadIdx.x;
extern __shared__ float sData[];
if (global_idx < size)
sData[threadIdx.x] = input[global_idx];
else
sData[threadIdx.x] = 0.0f;
__syncthreads();
for (int stride = 1; stride < blockDim.x; stride <<= 1) {
if(global_idx % (stride * 2) == 0)
sData[threadIdx.x] += sData[threadIdx.x + stride];
__syncthreads();
}
if (threadIdx.x == 0)
output[blockIdx.x] = sData[0];
}
void fCUDA::sharedReduction(float* output, float* input, int blockSize, int size)
{
cudaMemcpy(output, input, size * sizeof(float), cudaMemcpyDeviceToDevice);
while(size > 1)
{
int gridSize = (size + blockSize - 1) / blockSize;
sharedReductionKernel<<<gridSize, blockSize, blockSize * sizeof(float), 0>>>(output, output, size);
size = gridSize;
}
}
```
### Warp Divergence

The above figure shows how CUDA cores handle **warp divergence**.(*branching*) Threads will be put into stall when the branch condition of that thread is not meet and threads meet the condition will perform the branching code section.
The book provide several options/skills we can use:
* **Branching should be avoid or should be kept in warp level branching and avoid thread level**.
* **Coalescing the branched part to reduce in a warp**.
* **Shortening the branched part**.
* **Rearranging the data. (An overhead introduced)**
* **`tiled_patition` in cooperative group**
:::info
:bulb: **External Resouce**
This artical provide in depth view of warp divergence in newer Nvidia GPU architecture (*Volta, Tuning*).
**CUDA微架構與指令集(5)-Independent Thread Scheduling**
> https://zhuanlan.zhihu.com/p/186192189
:::
:::info
:bulb: **Reduction Example in OpenCL**
In this small section, a series of examples are provided for better understanding of the reduction and implementation consideration. These examples are provided by the following book.
**OpenCL Parallel Programming Development Cookbook by Raymond Tay**
> https://www.packtpub.com/product/opencl-parallel-programming-development-cookbook/9781849694520

**Reduction Implementation \#0 \[Interleaved Addressing\]**
```cpp=
__kernel void Reduction0(__global uint* X, __global uint* Y, __local uint* L)
{
unsigned int localIdx = get_local_id(0);
unsigned int groupIdx = get_group_id(0);
unsigned int globalIdx = get_global_id(0);
unsigned int blockSize = get_local_size(0);
L[localIdx] = X[globalIdx];
barrier(CLK_LOCAL_MEM_FENCE);
for (unsigned int stride = 1; stride < BLOCK_SIZE; stride <<= 1)
{
// mod operation is slow compare to addition
if (localIdx % (2 * stride) == 0)
{
L[localIdx] += L[localIdx + stride];
}
barrier(CLK_LOCAL_MEM_FENCE);
}
// write result of block
if (localIdx == 0)
Y[groupIdx] = L[0];
}
```
In the above code snippet, the **modulation calculation in *line 15* taking singnificantly longer time than addition\#1** and this exist a **waveform divergence\#2**. **Only a partial of the local memory will be utilized\#3** and causing the overhead of storing data from global memory to local memory cannot be hide.
**Reduction Implementation \#1#2 \[Sequential Addressing\]**
```cpp=
__kernel void Reduction1(__global uint* X, __global uint* Y, __local uint* L)
{
unsigned int localIdx = get_local_id(0);
unsigned int groupIdx = get_group_id(0);
unsigned int globalIdx = get_global_id(0);
unsigned int blockSize = get_local_size(0);
L[localIdx] = X[globalIdx];
barrier(CLK_LOCAL_MEM_FENCE);
for (unsigned int stride = 1; stride < BLOCK_SIZE; stride <<= 1)
{
int idx = 2 * stride * localIdx;
if (idx < BLOCK_SIZE)
{
L[idx] += L[idx + stride];
}
barrier(CLK_LOCAL_MEM_FENCE);
}
if (localIdx == 0)
Y[groupIdx] = L[0];
}
__kernel void Reduction2(__global uint* X, __global uint* Y, __local uint* L)
{
unsigned int localIdx = get_local_id(0);
unsigned int groupIdx = get_group_id(0);
unsigned int globalIdx = get_global_id(0);
unsigned int blockSize = get_local_size(0);
L[localIdx] = X[globalIdx];
barrier(CLK_LOCAL_MEM_FENCE);
for (unsigned int stride = BLOCK_SIZE / 2; stride > 0; stride >>= 1)
{
// half of the threads are already in idle and with each iteration
// its halved again.
if (localIdx < stride)
{
L[localIdx] += L[localIdx + stride];
}
barrier(CLK_LOCAL_MEM_FENCE);
}
if (localIdx == 0)
Y[groupIdx] = L[0];
}
```
In these implementations, the modulation calculation is removed; however, we can see that addressing of reduction2 eliminates all the multiplications in the loop and futher improve the performance. Despite of all the improvement, we can see that **half of the thread will be idling in the first iteration**.
**Reduction Implementation \#3 \[Sequential Addressing\]**

```cpp=
__kernel void Reduction3(__global uint* X, __global uint* Y, __local uint* L)
{
unsigned int localIdx = get_local_id(0);
unsigned int groupIdx = get_group_id(0);
unsigned int globalIdx = get_global_id(0);
unsigned int blockSize = get_local_size(0);
// mitigate idling threads in the first iteration by loading two element (by addition)
// into one local memory.
unsigned int idx = groupIdx * (BLOCK_SIZE * 2) + localIdx;
L[localIdx] = X[idx] + X[idx + BLOCK_SIZE];
barrier(CLK_LOCAL_MEM_FENCE);
for (unsigned int stride = BLOCK_SIZE / 2; stride > 0; stride >>= 1)
{
if (localIdx < stride)
{
L[localIdx] += L[localIdx + stride];
}
barrier(CLK_LOCAL_MEM_FENCE);
}
if (localIdx == 0)
Y[groupIdx] = L[0];
}
```
A mitigation is provided by **loading two elements into a local memory** from global memory. This means that when we load data from global memory to local memory, a first level of reduction is performed with a block of thread merging data with 2 * blockSize.
**Reduction Implementation \#4 \[Sequential Addressing\]**
```cpp=
#define BLOCK_SIZE (256)
__kernel void Reduction4(__global uint* X, __global uint* Y, __local uint* L)
{
unsigned int localIdx = get_local_id(0);
unsigned int groupIdx = get_group_id(0);
unsigned int globalIdx = get_global_id(0);
unsigned int blockSize = get_local_size(0);
unsigned int idx = groupIdx * (BLOCK_SIZE * 2) + localIdx;
L[localIdx] = X[idx] + X[idx + BLOCK_SIZE];
barrier(CLK_LOCAL_MEM_FENCE);
for (unsigned int stride = BLOCK_SIZE / 2; stride > 64; stride >>= 1)
{
// loop unrolling with wavefront level synchronization
// "Preferred work group size multiple" can be used to identified this number [clinfo]
// (AMD devie is 64[wavefront size] and Nvidia, Intel use 32[warp size])
if (localIdx < stride)
{
L[localIdx] += L[localIdx + stride];
}
barrier(CLK_LOCAL_MEM_FENCE);
}
// unroll 7 iterations
if (localIdx < 64)
{
if (blockSize >= 128)
L[localIdx] += L[localIdx + 64];
if (blockSize >= 64)
L[localIdx] += L[localIdx + 32];
if (blockSize >= 32)
L[localIdx] += L[localIdx + 16];
if (blockSize >= 16)
L[localIdx] += L[localIdx + 8];
if (blockSize >= 8)
L[localIdx] += L[localIdx + 4];
if (blockSize >= 4)
L[localIdx] += L[localIdx + 2];
if (blockSize >= 2)
L[localIdx] += L[localIdx + 1];
}
if (localIdx == 0)
Y[groupIdx] = L[0];
}
```
A **loop unrolling** is provided with a detemined waveform size and block size. In the above code, we can see that the section of the unrolling part **does not have any barrier**. This is because a waveform is a atomic element for the scheduler. *In the Nvidia GPU, we have 32 as the warp size and similarly, in the AMD GPU, we have 664 as the waveform*.
:::
### Warp-level Primitive Programming
* Reference \: https://zhuanlan.zhihu.com/p/572820783
In **Pascal architecture and older architecture**, threads were scheduled at warp level, and synchronized implicitly within a warp (*32 threads*). The Volta architecture introduce **independent thread scheduling**.

:::info
:bulb: **Nvidia Technical Blog**
**Using CUDA Warp-Level Primitives**
> https://developer.nvidia.com/blog/using-cuda-warp-level-primitives/
**Cooperative Groups: Flexible CUDA Thread Programming**
> https://developer.nvidia.com/blog/cooperative-groups/
:::
#### Reduction with Warp-level primitive Programming
:::success
:bulb: **Warp-level Collective Operation**
> Learn CUDA Programming A beginner Guide to GPU Programming and Parallel Computing
**Collective Operation Steps:**
1. Identifying, masking, or ballot sourcing CUDA threads in a warp that will participate the operation.
2. Letting CUDA thread shift data. (shifting happen in **register** without accessing memory)
3. All the CUDA threads in a warp are synchronized. (optional)
**Warp-level primitive functions CUDA 9 introduced:**

:::

In the `warpReductionDevice` function called in *line 18*, all threads within a warp will do the warp reduction and return a value. However, we only need first thread within a warp that contain the correct partial sum value. In *line 20*, we can see that only the first thread in a warp will store value into shared memory.
```cpp=
__inline__ __device__ float warpReductionDevice(float value)
{
for (int offset = warpSize / 2; offset > 0; offset >>= 1) {
unsigned int mask = __activemask();
value += __shfl_down_sync(mask, value, offset);
}
return value;
}
__inline__ __device__ float blockReductionDevice(float value)
{
static __shared__ float shared[32];
int laneIdx = threadIdx.x % warpSize;
int warpIdx = threadIdx.x / warpSize;
// each warp performs partial reduction
value = warpReductionDevice(value);
if (laneIdx == 0) // Only the first thread in a warp has the correct value (other warps contain value we don't care)
shared[warpIdx] = value; // write warp-level partial sum to shared memory
__syncthreads();
if (warpIdx == 0) {
value = (threadIdx.x < blockDim.x / warpSize) ? shared[laneIdx] : 0; // read from shared memory if the warp exist
value = warpReductionDevice(value); // final reduce within first warp
}
return value;
}
```
Nextly, the reduction kernel should act on the entire input data. The reduction kernel first reduce `4 * totalThreadSize` stride into 4 elements array. The grid-level reduction pattern is as the following figure.

Finally, a block reduction is conducted to have a set of partial sumation produced by each block. All the block partial sum is stored back to the output array. This kernel will be called two times, with second time execute on all the block partial sum and produce a scalar value for the final value.
```cpp=
__global__ void warplevelPrimitiveReductionKernel(float* output, float* input, int size)
{
int global_idx = blockIdx.x * blockDim.x + threadIdx.x;
float sum[NUM_LOAD] = { 0.0f };
for (int i = global_idx; i < size; i += blockDim.x * gridDim.x * NUM_LOAD)
{
for (int step = 0; step < NUM_LOAD; ++step)
sum[step] += (i + step * blockDim.x * gridDim.x < size) ? input[i + step * blockDim.x * gridDim.x] : 0.0f;
}
for (int i = 1; i < NUM_LOAD; ++i)
sum[0] += sum[i];
sum[0] = blockReductionDevice(sum[0]);
if (threadIdx.x == 0)
output[blockIdx.x] = sum[0];
}
void fCUDA::warplevelPrimitiveReduction(float* output, float* input, int size, int blockSize)
{
int numSMs;
int numBlockPerSM;
cudaDeviceGetAttribute(&numSMs, cudaDevAttrMultiProcessorCount, 0);
cudaOccupancyMaxActiveBlocksPerMultiprocessor(&numBlockPerSM, warplevelPrimitiveReductionKernel, blockSize, blockSize * sizeof(float));
int gridSize = min(numBlockPerSM * numSMs, (size + blockSize - 1) / blockSize);
warplevelPrimitiveReductionKernel<<<gridSize, blockSize>>>(output, input, size);
warplevelPrimitiveReductionKernel<<<1, blockSize, blockSize * sizeof(float), 0>>>(output, input, gridSize);
}
```
:::success
:bulb: **Set Kernel Execute Parameters with Device Attribute**
The program retrieve the cuda attribute of how many block can be scheduled on a SM and the number of SM in the target GPU.
```cpp
cudaDeviceGetAttribute(&numSMs, cudaDevAttrMultiProcessorCount, 0);
cudaOccupancyMaxActiveBlocksPerMultiprocessor(&numBlockPerSM, warplevelPrimitiveReductionKernel, blockSize, blockSize * sizeof(float));
```
:::
### Cooperative Groups
Reference \:
* Cooperative Groups: Flexible CUDA Thread Programming https://developer.nvidia.com/blog/cooperative-groups/
* CUDA程式設計入門之Cooperative Groups\(1\) https://zhuanlan.zhihu.com/p/572820342
* CUDA程式設計入門之Cooperative Groups\(2\) https://zhuanlan.zhihu.com/p/586770010

One of the benefit of using cooperative groups is that we can prevant deadlock. In the above picture provide by the book, `__syncthreads()` will not be reached by all the threads and causing a deadlock.
### Loop Unrolling in the CUDA Kernel
Similar to programming on CPU, loop unrolling can be used to reduce or remove loop control overheads. (*end of loop test on each iteration, branch penalties*) Many parallel program or program optimization tutorial recommend this technique. (refere to SIMD programming notes)
```cpp
#pragma unroll
for (int offset = group.size() / 2; offset > 0; offset >>= 1)
value += group.shfl_down(value, offset);
```
:::warning
:warning: **Important**
Unrolling code execution may result in **lower occupancy** by the increased register usage and can have a **higher instruction cache miss** penalty by the increased code execution size.
:::
## Kernel Execution Model and Optimiztion Strategies
### CUDA Streams
:::info
:bulb: **External Resource**
**How to Overlap Data Transfers in CUDA C/C++**
> https://developer.nvidia.com/blog/how-overlap-data-transfers-cuda-cc/
:::
CUDA programming without specifying stream will operate on default stream. Default Stream is a synchronizing stream which will not have operation if there are any previously issued operations in any stream on the device.
:::success
:bulb: **Requirement for Overlapping Kernel Execution and Data Transfers**
* The kernel execution and the data transfer to be overlapped must both occur **in different, non-default streams**.
* The host memory involved in the data transfer must be **pinned memory**.
* Take number of copy engine and the number of kernel engine into consideration.
* If there are **only one copy engine**, a related **H2D** and **D2H** should not be launch in back to back order.
:::
#### Code Example provided by Nvidia Technical Blog
>https://developer.nvidia.com/blog/how-overlap-data-transfers-cuda-cc/
```cpp
// Patern 1
for (int i = 0; i < nStreams; ++i) {
int offset = i * streamSize;
cudaMemcpyAsync(&d_a[offset], &a[offset], streamBytes, cudaMemcpyHostToDevice, stream[i]);
kernel<<<streamSize/blockSize, blockSize, 0, stream[i]>>>(d_a, offset);
cudaMemcpyAsync(&a[offset], &d_a[offset], streamBytes, cudaMemcpyDeviceToHost, stream[i]);
}
// Patern 2: Execute copy and compute kernel in batch together
for (int i = 0; i < nStreams; ++i) {
int offset = i * streamSize;
cudaMemcpyAsync(&d_a[offset], &a[offset],
streamBytes, cudaMemcpyHostToDevice, cudaMemcpyHostToDevice, stream[i]);
}
for (int i = 0; i < nStreams; ++i) {
int offset = i * streamSize;
kernel<<<streamSize/blockSize, blockSize, 0, stream[i]>>>(d_a, offset);
}
for (int i = 0; i < nStreams; ++i) {
int offset = i * streamSize;
cudaMemcpyAsync(&a[offset], &d_a[offset],
streamBytes, cudaMemcpyDeviceToHost, cudaMemcpyDeviceToHost, stream[i]);
}
```

### CUDA Dynamic Parallelism
Dynamic Parallelism can be used to avoid load balancing when working on Graph, Tree or QuickSort. (*recurrsive*)
A CUDA kernel can call a child grid. Therefore, in a CUDA kernel, a CUDA kernel can be launched. In general, dynamic parallel kernels can conservatively reserve up to 150MB of device memory to track pending grid launches and the parent grid status by synchronizing on a child grid launch. The nested kernel launches is limited to 24 levels.
### Grid-level Cooperative Groups
*TODO: Move to the previous section.*
### Multi-Process Service
*TODO: Code Example (Require new hardware and OS ready)*
:::info
:information_source: **External Resources**
NVIDIA MPS總結
> https://asphelzhn.github.io/2019/04/14/tensor_09_MPS/
:::
Before introducing **MPS**, we need to understand **CUDA context**, **Hyper-Q technology**.
**CUDA context** contain inforation of GPU global memory allocated from CPU, local memory defined in the kernel, CUDA streams and events and code block. Different process has its own CUDA context and each context has its own address space which will not be accessed by other CUDA context.
**Hyper-Q (Queue)** allow multiple CPU threads from a process to load jobs onto a GPU and CUDA kernels can utilize GPU at the same time *if when kernel A is running and GPU still have enough SM and resource to run a thread block of kernel B, the kernel B will be launched*. Hyper-Q is first implemented on Kepler GPUs(>3.5).
**MPS** will collect jobs from different processes and deploy them to GPU as if they are coming from a single process. We can think Hyper-Q as supporting OpenMP and MPS supports MPI.
:::info
:bulb: **GPU Operation Mode on Windows**
> https://docs.nvidia.com/nsight-visual-studio-edition/reference/index.html#:~:text=NVIDIA%20GPUs%20can%20exist%20in,mode%3B%20used%20in%20gaming%20graphics.
On Windows System, Nvidia GPU can exist in one of two modes: **TCC** or **WDDM**. *WDDM mode will set a limitation that a CUDA kernel maximum run time to around 5 seconds*. TCC mode will remove graphics functionality and purely use the graphic card for computation.
:::
:::info
:bulb: **GPU Utilization Explantation**
**教你如何繼續壓榨GPU的算力**
> https://zhuanlan.zhihu.com/p/346389176
*GPU Utilization :*
Percent of time over the past sample period during which **one or more kernels** was executing on the GPU. The sample period may be between 1 second and 1/6 second depending on the product.
In the following diagram, only one kernel will be schedule on the GPU at anytime. This means that in the CUDA code a `cudaDeviceSynchronize()`.

:::
### CUDA-aware MPI
*TODO: Pending*