Erebustsai
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note No publishing access yet

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.

      Your account was recently created. Publishing will be available soon, allowing you to share notes on your public page and in search results.

      Your team account was recently created. Publishing will be available soon, allowing you to share notes on your public page and in search results.

      Explore these features while you wait
      Complete general settings
      Bookmark and like published notes
      Write a few more notes
      Complete general settings
      Write a few more notes
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights New
    • Engagement control
    • Make a copy
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Make a copy Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note No publishing access yet

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.

    Your account was recently created. Publishing will be available soon, allowing you to share notes on your public page and in search results.

    Your team account was recently created. Publishing will be available soon, allowing you to share notes on your public page and in search results.

    Explore these features while you wait
    Complete general settings
    Bookmark and like published notes
    Write a few more notes
    Complete general settings
    Write a few more notes
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    > ***Learn CUDA Programming A beginner Guide to GPU Programming and Parallel Computing*** > * https://www.packtpub.com/product/learn-cudaprogramming/9781788996242 > > ***解讀《CUDA C最佳實踐指南》*** > * https://www.zhihu.com/column/c_1323748466020315136 > > ***CUDA程式設計入門及最佳化*** > * https://www.zhihu.com/column/c_1522503697624346624 *This post follow the above book to understand and build CUDA programming skills. In the final part of the chapter, there will be an example of AI Framework and how I plan to use the skills learned in this book to improve the infrerence framework.* *OpenCL programming & CUDA programming might have different post. There should be a reference between them.* :::info :information_source: **Compiler Setting** ***nvcc*** ***-run*** # This argument will run the program after compile ***-m64*** # Set the program to x64 -***I/usr/local/cuda/samples/common/inc*** # include the cuda sample tools ***-gencode arch=compute75,code=sm75*** # 7.5 for RTX 20 series ***-o executivename ./filename.cu*** ***-Xcompiler -fopenmp*** # To compile with OpenMP in the host code ::: ## GPU Hardware & Software Mapping ![](https://hackmd.io/_uploads/ryniMwrE3.png) ### CUDA GPU Indexing and OpenCL Indexing | | OpenCL | CUDA C | |:-----------------:| ----------------------------- | --------------------------------------------------- | | Thread Block ID | `get_group_id(uint dimIdx)` | `blockIdx.x[xyz]` | | Thread Local ID | `get_local_id(uint dimIdx)` | `threadIdx.x[xyz]` | | Num Thread in Block | `get_local _size(uint dimIdx)` | `blockDim.x[xyz]` | | Num Block in Grid | `get_num_groups(uint dimIdx)` | `gridDim.x[xyz]` | | Thread Global ID | `get_global_id(uint dimIdx)` | `blockIdx.x[xyz] * blockDim.[xyz] + threadIdx[xyz]` | :::info :bulb: **GPU Architecture Evolution** **輝達GPU架構演進近十年,從費米到安培** > https://zhuanlan.zhihu.com/p/413145211 **從AI系統角度回顧GPU架構變遷--從Fermi到Ampere(V1.2)** > https://zhuanlan.zhihu.com/p/463629676 > ::: ## Memory Management ### Memory Hierarchy ![](https://hackmd.io/_uploads/SJ5ZL54H3.png) :::warning :warning: **Important** > *Each memory bandwidth is orders of magnitude different, and so is the latency to access them.* *Shared memory* is vary important in CUDA kernel program which is able to be managed by programmer. When programming with CUDA, ***size, latency, throughput and visibility*** of different memory should be kept in mind. ::: ### Coalesced Memory Access A Warp (32 threads) should access contiguous data. In this way, a better performance can be achieved with less memory fetches to the memory thus more efficient memory bus utilization. ![](https://hackmd.io/_uploads/r19ergRS2.png) ### Shared Memory Shared memory is **user-managed cache**. By *using coalesced access pattern storing global memory into shared memory*, program can utilize share memory and have higher bandwidth of throughput. Shared memory is local to a thread block and can be used to communicate within a block with different threads. :::info :bulb: **Shared Memory Example** First example provided by book and a post on Nvidia Developer Blog https://developer.nvidia.com/blog/efficient-matrix-transpose-cuda-cc/ Second example is in my github repository AI `Framework/conv2d.cu` **Matrix transpose on shared memory** ![](https://hackmd.io/_uploads/S1Kht_eI3.png) ```cpp __global__ void fCUDA::matrixTransposeNaive(int *input, int *output) { int x_global_idx = blockIdx.x * blockDim.x + threadIdx.x; int y_global_idx = blockIdx.y * blockDim.y + threadIdx.y; // input and output index mapping int inputIdx = y_global_idx * MATRIX_SIZE + x_global_idx; int outputIdx = x_global_idx * MATRIX_SIZE + y_global_idx; // At least one of them will be uncoalesced access output[outputIdx] = input[inputIdx]; } ``` ![](https://hackmd.io/_uploads/BkXJqueUh.png) The tile is a shared memory and the pattern can be observed in the above picture. Notice that on both the `idata` and `odata`, the access pattern are coalesced. ```cpp // In this algorithm, the shared mem block should be a square. // blockDim.x == blockDIm.y == BLOCK_SIZE __global__ void fCUDA::matrixTransposeShared(int *input, int *output) { __shared__ int sharedMem[BLOCK_SIZE][BLOCK_SIZE]; // blockDim.x == blockDIm.y == BLOCK_SIZE int x_global_idx = blockIdx.x * BLOCK_SIZE + threadIdx.x; int y_global_idx = blockIdx.y * BLOCK_SIZE + threadIdx.y; // transposed global memory index int x_transpose_idx = blockIdx.y * BLOCK_SIZE + threadIdx.x; int y_transpose_idx = blockIdx.x * BLOCK_SIZE + threadIdx.y; int inputIdx = y_global_idx * MATRIX_SIZE + x_global_idx; int outputIdx = y_transpose_idx * MATRIX_SIZE + x_transpose_idx; // global input to Shared mem sharedMem[threadIdx.x][threadIdx.y] = input[inputIdx]; // Shared mem to global output (access sharedMem's column) output[outputIdx] = sharedMem[threadIdx.y][threadIdx.x]; } ``` **2D Convolution Layer** *Please refer to the github repository for more info* Uncoalesced access to W (model weight) ```cpp int indexW = 0; for (; indexW < wSubSize; ++indexW) sharedWeight[threadIdx.z * wSubSize + indexW] = W[cOut * wSubSize + indexW]; __syncthreads(); ``` Coalesced access to W (model weight) ```cpp int sharedNum = wSubSize * blockDim.z; int threadLocalIdx = threadIdx.z * (blockDim.x * blockDim.y) + threadIdx.y * blockDim.x + threadIdx.x; int threadLocalDim = blockDim.x * blockDim.y * blockDim.z; for (int i = 0; i < sharedNum ; i += threadLocalDim) { if ((i + threadLocalIdx) < sharedNum) sharedWeight[i + threadLocalIdx] = W[(sharedNum * blockIdx.z) + i + threadLocalIdx]; } __syncthreads(); ``` ::: ### Shared Memory Band Conflict Shared memory is organized into banks and each bank can only service one address access per cycle. Therefore, we can have 32 simultaneous *4 bytes* accesses if we have 32 *4 bytes* wide banks. ![](https://hackmd.io/_uploads/Bkx_WnWL3.png) A padding can be add to avoid bank conflict ```cpp __shared__ int sharedMem[BLOCK_SIZE][BLOCK_SIZE+1]; ``` The above shared memory allocation will put data 0, 32, 64 into different banks. ![](https://hackmd.io/_uploads/rye6V2bU2.png) ### Registers in GPU Every SM has a fixed set of registers which have a scope of a single thread. The `nvcc` compiler will tries to find the best number of registers per thread. The number of registers used in a CUDA kernel will restrict the number of threads that can be schecduled on an SM. ```cpp __global__ void cudaKernel(float* X, float* Y) { int global_idx = blockIdx.x * blockDim.x + threadIdx.x; X[global_idx] += Y[global_idx]; } ``` :::info :bulb: **Parameter Example** * `global_idx` : on Register * `X`, `Y`: on Shared Memory or Register * `X[global_idx]`: on Global Memory ::: :::warning :warning: **Register Spills** If the number of local variables and intermediate in a CUDA kernel exceed the number of registers available, the data will be pushed to **L1/L2 cache or global memory**. This refer to as register spills. ::: ### Pinned Memory The default memory allocation is pageable which means it can be swapped out. Therefore, GPU will not access the pageable memory. CUDA driver will allocates temporary pinned memory on host, copies the data from the default pageable memory to it and transfer it to the device via a **Derict Memory Access (DMA)**. The pageable memory might be wapped and need to be load back to memory when the data transfer is requested. :::info :bulb: **External Resource** This post provide a very good explantation of different CUDA memory allocation methods. > https://kaibaoom.tw/2020/07/21/cuda-four-memory-access/ ::: ### Unified Memory Unified memory provide **a view of single memory space** which can be accessed by all the GPU and CPU in the system. The memory allocation based on first-touch basis. If the GPU first touch the variable, the page will be allocated and mapped in the GPU page table. :::info :bulb: **External Resource** This slide provide a in depth view of unified memory. https://on-demand.gputechconf.com/gtc/2018/presentation/s8430-everything-you-need-to-know-about-unified-memory.pdf ![](https://hackmd.io/_uploads/SJXsGeMIn.png) *Architecture **before Pascal*** * No GPU page fault support: move all dirty pages on kernel launch. *Architecture **Pascal** and **Volta*** * GPU page fault support * Concurrent access * Extended VA space (48 bits) * On-demand migration to accessing processor on first touch *Architecture **Volta+P9** (IBM Power System AC922)* * Access counters (Only hot pages will be moved to the GPU) * Hardware Coherency with NVLink2 ![](https://hackmd.io/_uploads/rymCHgzU2.png) * ATS (Address Translation Service) support (CPU and GPU can share a single page table) ![](https://hackmd.io/_uploads/B1zFUgz83.png) ::: #### Over-subscription > https://on-demand.gputechconf.com/gtc/2018/presentation/s8430-everything-you-need-to-know-about-unified-memory.pdf ![](https://hackmd.io/_uploads/ryNwKxz8h.png) ![](https://hackmd.io/_uploads/rkLuYxzLn.png) #### Data Prefetching Data prefetching can be used with API `cudaMemPrefetchAsync()` and hints can be used with API `cudaMemAdvice()`. :::info :bulb: **Adviced for `cudaMemAdvice()`** > ***Learn CUDA Programming A beginner Guide to GPU Programming and Parallel Computing*** > * https://www.packtpub.com/product/learn-cudaprogramming/9781788996242 * `cudaMemAdviseSetReadMostly`: This tell the driver that the data is mostly read-only and the driver will create a read-only copy of the data. If the data is written to, the page copies become invalidated except for the device that wrote the memory * `cudaMemAdviseSetPreferredLocation`: This set the preferred memory location of the data to a memory of a device. The driver tries to resist migrating data away from the preferred location. * `cudaMemAdviseSetAccessedBy`: This implies that the data will be accessed by the device. The device will create a direct mapping of input in the CPU memory and no page faults will be generated. ::: :::success :bulb: **Memory Evolution** > * https://stackoverflow.com/questions/55742668/l1-cache-in-gpu > * https://docs.nvidia.com/cuda/volta-tuning-guide/index.html#l1-cache ::: ## SIMT Programming :::info :bulb: **External Resource** **NVIDIA GPU SM和CUDA程式設計理解** > https://bbs.huaweicloud.com/blogs/349996 ::: > ***Learn CUDA Programming A beginner Guide to GPU Programming and Parallel Computing*** > * https://www.packtpub.com/product/learn-cudaprogramming/9781788996242 The streaming multiprocessors executes thread blocks arbitrarily and concurrently, executing as many as the GPU resources can afford. *Therefore, the number of thread blocks executable in parallel varies depending on how much of the GPU's resources the block requires and the amount of GPU resurces available*. ![](https://hackmd.io/_uploads/SJtrYIVI3.png) ```cpp __global__ void fCUDA::indexPrint() { int global_idx = blockIdx.x * blockDim.x + threadIdx.x; int warp_idx = threadIdx.x / warpSize; // warpSize = 32 is predefined int lane_idx = threadIdx.x & (warpSize - 1); // equal to threadIdx.x % 32 if (global_idx == 0) printf("thread, block, warp, lane\n"); #ifdef INDEX_LANE if ((lane_idx & (warpSize / 2 - 1)) == 0) { // thread, block, warp, lane printf(" %5d\t%5d\t %2d\t%2d\n", global_idx, blockIdx.x, warp_idx, lane_idx); } #else printf(" %5d\t%5d\t %2d\t%2d\n", global_idx, blockIdx.x, warp_idx, lane_idx); #endif return; } ``` Print 2 line from lane 0 and 16 only, they will be consistant 0 ~ 31 output if this limitation is removed. * **Out-of-order block execution** * **Out-of-order warp index within a block** * **Grouped threads executed in a warp** ``` thread, block, warp, lane 32 0 1 0 48 0 1 16 96 1 1 0 112 1 1 16 64 1 0 0 80 1 0 16 0 0 0 0 16 0 0 16 ``` ### Reduction Example *TODO: Include the OpenCL book explantation and implement them with CUDA C* #### Reduction with Global Memory A for loop can be in the loop since the block execution order is not guaranteed. **(We can only sync threads within a block)** ![](https://hackmd.io/_uploads/Byg-AxsUn.png) ```cpp __global__ void sharedReductionKernel(float* output, float* input, int size) { int global_idx = blockIdx.x * blockDim.x + threadIdx.x; extern __shared__ float sData[]; if (global_idx < size) sData[threadIdx.x] = input[global_idx]; else sData[threadIdx.x] = 0.0f; __syncthreads(); for (int stride = 1; stride < blockDim.x; stride <<= 1) { if(global_idx % (stride * 2) == 0) sData[threadIdx.x] += sData[threadIdx.x + stride]; __syncthreads(); } if (threadIdx.x == 0) output[blockIdx.x] = sData[0]; } void fCUDA::sharedReduction(float* output, float* input, int blockSize, int size) { cudaMemcpy(output, input, size * sizeof(float), cudaMemcpyDeviceToDevice); while(size > 1) { int gridSize = (size + blockSize - 1) / blockSize; sharedReductionKernel<<<gridSize, blockSize, blockSize * sizeof(float), 0>>>(output, output, size); size = gridSize; } } ``` ### Warp Divergence ![](https://hackmd.io/_uploads/rJvO2WsIh.png) The above figure shows how CUDA cores handle **warp divergence**.(*branching*) Threads will be put into stall when the branch condition of that thread is not meet and threads meet the condition will perform the branching code section. The book provide several options/skills we can use: * **Branching should be avoid or should be kept in warp level branching and avoid thread level**. * **Coalescing the branched part to reduce in a warp**. * **Shortening the branched part**. * **Rearranging the data. (An overhead introduced)** * **`tiled_patition` in cooperative group** :::info :bulb: **External Resouce** This artical provide in depth view of warp divergence in newer Nvidia GPU architecture (*Volta, Tuning*). **CUDA微架構與指令集(5)-Independent Thread Scheduling** > https://zhuanlan.zhihu.com/p/186192189 ::: :::info :bulb: **Reduction Example in OpenCL** In this small section, a series of examples are provided for better understanding of the reduction and implementation consideration. These examples are provided by the following book. **OpenCL Parallel Programming Development Cookbook by Raymond Tay** > https://www.packtpub.com/product/opencl-parallel-programming-development-cookbook/9781849694520 ![](https://hackmd.io/_uploads/BkMkoNnIn.png) **Reduction Implementation \#0 \[Interleaved Addressing\]** ```cpp= __kernel void Reduction0(__global uint* X, __global uint* Y, __local uint* L) { unsigned int localIdx = get_local_id(0); unsigned int groupIdx = get_group_id(0); unsigned int globalIdx = get_global_id(0); unsigned int blockSize = get_local_size(0); L[localIdx] = X[globalIdx]; barrier(CLK_LOCAL_MEM_FENCE); for (unsigned int stride = 1; stride < BLOCK_SIZE; stride <<= 1) { // mod operation is slow compare to addition if (localIdx % (2 * stride) == 0) { L[localIdx] += L[localIdx + stride]; } barrier(CLK_LOCAL_MEM_FENCE); } // write result of block if (localIdx == 0) Y[groupIdx] = L[0]; } ``` In the above code snippet, the **modulation calculation in *line 15* taking singnificantly longer time than addition\#1** and this exist a **waveform divergence\#2**. **Only a partial of the local memory will be utilized\#3** and causing the overhead of storing data from global memory to local memory cannot be hide. **Reduction Implementation \#1#2 \[Sequential Addressing\]** ```cpp= __kernel void Reduction1(__global uint* X, __global uint* Y, __local uint* L) { unsigned int localIdx = get_local_id(0); unsigned int groupIdx = get_group_id(0); unsigned int globalIdx = get_global_id(0); unsigned int blockSize = get_local_size(0); L[localIdx] = X[globalIdx]; barrier(CLK_LOCAL_MEM_FENCE); for (unsigned int stride = 1; stride < BLOCK_SIZE; stride <<= 1) { int idx = 2 * stride * localIdx; if (idx < BLOCK_SIZE) { L[idx] += L[idx + stride]; } barrier(CLK_LOCAL_MEM_FENCE); } if (localIdx == 0) Y[groupIdx] = L[0]; } __kernel void Reduction2(__global uint* X, __global uint* Y, __local uint* L) { unsigned int localIdx = get_local_id(0); unsigned int groupIdx = get_group_id(0); unsigned int globalIdx = get_global_id(0); unsigned int blockSize = get_local_size(0); L[localIdx] = X[globalIdx]; barrier(CLK_LOCAL_MEM_FENCE); for (unsigned int stride = BLOCK_SIZE / 2; stride > 0; stride >>= 1) { // half of the threads are already in idle and with each iteration // its halved again. if (localIdx < stride) { L[localIdx] += L[localIdx + stride]; } barrier(CLK_LOCAL_MEM_FENCE); } if (localIdx == 0) Y[groupIdx] = L[0]; } ``` In these implementations, the modulation calculation is removed; however, we can see that addressing of reduction2 eliminates all the multiplications in the loop and futher improve the performance. Despite of all the improvement, we can see that **half of the thread will be idling in the first iteration**. **Reduction Implementation \#3 \[Sequential Addressing\]** ![](https://hackmd.io/_uploads/ryq4hwHPh.png) ```cpp= __kernel void Reduction3(__global uint* X, __global uint* Y, __local uint* L) { unsigned int localIdx = get_local_id(0); unsigned int groupIdx = get_group_id(0); unsigned int globalIdx = get_global_id(0); unsigned int blockSize = get_local_size(0); // mitigate idling threads in the first iteration by loading two element (by addition) // into one local memory. unsigned int idx = groupIdx * (BLOCK_SIZE * 2) + localIdx; L[localIdx] = X[idx] + X[idx + BLOCK_SIZE]; barrier(CLK_LOCAL_MEM_FENCE); for (unsigned int stride = BLOCK_SIZE / 2; stride > 0; stride >>= 1) { if (localIdx < stride) { L[localIdx] += L[localIdx + stride]; } barrier(CLK_LOCAL_MEM_FENCE); } if (localIdx == 0) Y[groupIdx] = L[0]; } ``` A mitigation is provided by **loading two elements into a local memory** from global memory. This means that when we load data from global memory to local memory, a first level of reduction is performed with a block of thread merging data with 2 * blockSize. **Reduction Implementation \#4 \[Sequential Addressing\]** ```cpp= #define BLOCK_SIZE (256) __kernel void Reduction4(__global uint* X, __global uint* Y, __local uint* L) { unsigned int localIdx = get_local_id(0); unsigned int groupIdx = get_group_id(0); unsigned int globalIdx = get_global_id(0); unsigned int blockSize = get_local_size(0); unsigned int idx = groupIdx * (BLOCK_SIZE * 2) + localIdx; L[localIdx] = X[idx] + X[idx + BLOCK_SIZE]; barrier(CLK_LOCAL_MEM_FENCE); for (unsigned int stride = BLOCK_SIZE / 2; stride > 64; stride >>= 1) { // loop unrolling with wavefront level synchronization // "Preferred work group size multiple" can be used to identified this number [clinfo] // (AMD devie is 64[wavefront size] and Nvidia, Intel use 32[warp size]) if (localIdx < stride) { L[localIdx] += L[localIdx + stride]; } barrier(CLK_LOCAL_MEM_FENCE); } // unroll 7 iterations if (localIdx < 64) { if (blockSize >= 128) L[localIdx] += L[localIdx + 64]; if (blockSize >= 64) L[localIdx] += L[localIdx + 32]; if (blockSize >= 32) L[localIdx] += L[localIdx + 16]; if (blockSize >= 16) L[localIdx] += L[localIdx + 8]; if (blockSize >= 8) L[localIdx] += L[localIdx + 4]; if (blockSize >= 4) L[localIdx] += L[localIdx + 2]; if (blockSize >= 2) L[localIdx] += L[localIdx + 1]; } if (localIdx == 0) Y[groupIdx] = L[0]; } ``` A **loop unrolling** is provided with a detemined waveform size and block size. In the above code, we can see that the section of the unrolling part **does not have any barrier**. This is because a waveform is a atomic element for the scheduler. *In the Nvidia GPU, we have 32 as the warp size and similarly, in the AMD GPU, we have 664 as the waveform*. ::: ### Warp-level Primitive Programming * Reference \: https://zhuanlan.zhihu.com/p/572820783 In **Pascal architecture and older architecture**, threads were scheduled at warp level, and synchronized implicitly within a warp (*32 threads*). The Volta architecture introduce **independent thread scheduling**. ![](https://hackmd.io/_uploads/rkVTwfsU3.png) :::info :bulb: **Nvidia Technical Blog** **Using CUDA Warp-Level Primitives** > https://developer.nvidia.com/blog/using-cuda-warp-level-primitives/ **Cooperative Groups: Flexible CUDA Thread Programming** > https://developer.nvidia.com/blog/cooperative-groups/ ::: #### Reduction with Warp-level primitive Programming :::success :bulb: **Warp-level Collective Operation** > Learn CUDA Programming A beginner Guide to GPU Programming and Parallel Computing **Collective Operation Steps:** 1. Identifying, masking, or ballot sourcing CUDA threads in a warp that will participate the operation. 2. Letting CUDA thread shift data. (shifting happen in **register** without accessing memory) 3. All the CUDA threads in a warp are synchronized. (optional) **Warp-level primitive functions CUDA 9 introduced:** ![](https://hackmd.io/_uploads/SyL9A4nIn.png) ::: ![](https://hackmd.io/_uploads/By9d6MoL2.png) In the `warpReductionDevice` function called in *line 18*, all threads within a warp will do the warp reduction and return a value. However, we only need first thread within a warp that contain the correct partial sum value. In *line 20*, we can see that only the first thread in a warp will store value into shared memory. ```cpp= __inline__ __device__ float warpReductionDevice(float value) { for (int offset = warpSize / 2; offset > 0; offset >>= 1) { unsigned int mask = __activemask(); value += __shfl_down_sync(mask, value, offset); } return value; } __inline__ __device__ float blockReductionDevice(float value) { static __shared__ float shared[32]; int laneIdx = threadIdx.x % warpSize; int warpIdx = threadIdx.x / warpSize; // each warp performs partial reduction value = warpReductionDevice(value); if (laneIdx == 0) // Only the first thread in a warp has the correct value (other warps contain value we don't care) shared[warpIdx] = value; // write warp-level partial sum to shared memory __syncthreads(); if (warpIdx == 0) { value = (threadIdx.x < blockDim.x / warpSize) ? shared[laneIdx] : 0; // read from shared memory if the warp exist value = warpReductionDevice(value); // final reduce within first warp } return value; } ``` Nextly, the reduction kernel should act on the entire input data. The reduction kernel first reduce `4 * totalThreadSize` stride into 4 elements array. The grid-level reduction pattern is as the following figure. ![](https://hackmd.io/_uploads/r1YekI3In.png) Finally, a block reduction is conducted to have a set of partial sumation produced by each block. All the block partial sum is stored back to the output array. This kernel will be called two times, with second time execute on all the block partial sum and produce a scalar value for the final value. ```cpp= __global__ void warplevelPrimitiveReductionKernel(float* output, float* input, int size) { int global_idx = blockIdx.x * blockDim.x + threadIdx.x; float sum[NUM_LOAD] = { 0.0f }; for (int i = global_idx; i < size; i += blockDim.x * gridDim.x * NUM_LOAD) { for (int step = 0; step < NUM_LOAD; ++step) sum[step] += (i + step * blockDim.x * gridDim.x < size) ? input[i + step * blockDim.x * gridDim.x] : 0.0f; } for (int i = 1; i < NUM_LOAD; ++i) sum[0] += sum[i]; sum[0] = blockReductionDevice(sum[0]); if (threadIdx.x == 0) output[blockIdx.x] = sum[0]; } void fCUDA::warplevelPrimitiveReduction(float* output, float* input, int size, int blockSize) { int numSMs; int numBlockPerSM; cudaDeviceGetAttribute(&numSMs, cudaDevAttrMultiProcessorCount, 0); cudaOccupancyMaxActiveBlocksPerMultiprocessor(&numBlockPerSM, warplevelPrimitiveReductionKernel, blockSize, blockSize * sizeof(float)); int gridSize = min(numBlockPerSM * numSMs, (size + blockSize - 1) / blockSize); warplevelPrimitiveReductionKernel<<<gridSize, blockSize>>>(output, input, size); warplevelPrimitiveReductionKernel<<<1, blockSize, blockSize * sizeof(float), 0>>>(output, input, gridSize); } ``` :::success :bulb: **Set Kernel Execute Parameters with Device Attribute** The program retrieve the cuda attribute of how many block can be scheduled on a SM and the number of SM in the target GPU. ```cpp cudaDeviceGetAttribute(&numSMs, cudaDevAttrMultiProcessorCount, 0); cudaOccupancyMaxActiveBlocksPerMultiprocessor(&numBlockPerSM, warplevelPrimitiveReductionKernel, blockSize, blockSize * sizeof(float)); ``` ::: ### Cooperative Groups Reference \: * Cooperative Groups: Flexible CUDA Thread Programming https://developer.nvidia.com/blog/cooperative-groups/ * CUDA程式設計入門之Cooperative Groups\(1\) https://zhuanlan.zhihu.com/p/572820342 * CUDA程式設計入門之Cooperative Groups\(2\) https://zhuanlan.zhihu.com/p/586770010 ![image](https://hackmd.io/_uploads/S1a9m5qgR.png) One of the benefit of using cooperative groups is that we can prevant deadlock. In the above picture provide by the book, `__syncthreads()` will not be reached by all the threads and causing a deadlock. ### Loop Unrolling in the CUDA Kernel Similar to programming on CPU, loop unrolling can be used to reduce or remove loop control overheads. (*end of loop test on each iteration, branch penalties*) Many parallel program or program optimization tutorial recommend this technique. (refere to SIMD programming notes) ```cpp #pragma unroll for (int offset = group.size() / 2; offset > 0; offset >>= 1) value += group.shfl_down(value, offset); ``` :::warning :warning: **Important** Unrolling code execution may result in **lower occupancy** by the increased register usage and can have a **higher instruction cache miss** penalty by the increased code execution size. ::: ## Kernel Execution Model and Optimiztion Strategies ### CUDA Streams :::info :bulb: **External Resource** **How to Overlap Data Transfers in CUDA C/C++** > https://developer.nvidia.com/blog/how-overlap-data-transfers-cuda-cc/ ::: CUDA programming without specifying stream will operate on default stream. Default Stream is a synchronizing stream which will not have operation if there are any previously issued operations in any stream on the device. :::success :bulb: **Requirement for Overlapping Kernel Execution and Data Transfers** * The kernel execution and the data transfer to be overlapped must both occur **in different, non-default streams**. * The host memory involved in the data transfer must be **pinned memory**. * Take number of copy engine and the number of kernel engine into consideration. * If there are **only one copy engine**, a related **H2D** and **D2H** should not be launch in back to back order. ::: #### Code Example provided by Nvidia Technical Blog >https://developer.nvidia.com/blog/how-overlap-data-transfers-cuda-cc/ ```cpp // Patern 1 for (int i = 0; i < nStreams; ++i) { int offset = i * streamSize; cudaMemcpyAsync(&d_a[offset], &a[offset], streamBytes, cudaMemcpyHostToDevice, stream[i]); kernel<<<streamSize/blockSize, blockSize, 0, stream[i]>>>(d_a, offset); cudaMemcpyAsync(&a[offset], &d_a[offset], streamBytes, cudaMemcpyDeviceToHost, stream[i]); } // Patern 2: Execute copy and compute kernel in batch together for (int i = 0; i < nStreams; ++i) { int offset = i * streamSize; cudaMemcpyAsync(&d_a[offset], &a[offset], streamBytes, cudaMemcpyHostToDevice, cudaMemcpyHostToDevice, stream[i]); } for (int i = 0; i < nStreams; ++i) { int offset = i * streamSize; kernel<<<streamSize/blockSize, blockSize, 0, stream[i]>>>(d_a, offset); } for (int i = 0; i < nStreams; ++i) { int offset = i * streamSize; cudaMemcpyAsync(&a[offset], &d_a[offset], streamBytes, cudaMemcpyDeviceToHost, cudaMemcpyDeviceToHost, stream[i]); } ``` ![](https://hackmd.io/_uploads/rJvN3aa83.png) ### CUDA Dynamic Parallelism Dynamic Parallelism can be used to avoid load balancing when working on Graph, Tree or QuickSort. (*recurrsive*) A CUDA kernel can call a child grid. Therefore, in a CUDA kernel, a CUDA kernel can be launched. In general, dynamic parallel kernels can conservatively reserve up to 150MB of device memory to track pending grid launches and the parent grid status by synchronizing on a child grid launch. The nested kernel launches is limited to 24 levels. ### Grid-level Cooperative Groups *TODO: Move to the previous section.* ### Multi-Process Service *TODO: Code Example (Require new hardware and OS ready)* :::info :information_source: **External Resources** NVIDIA MPS總結 > https://asphelzhn.github.io/2019/04/14/tensor_09_MPS/ ::: Before introducing **MPS**, we need to understand **CUDA context**, **Hyper-Q technology**. **CUDA context** contain inforation of GPU global memory allocated from CPU, local memory defined in the kernel, CUDA streams and events and code block. Different process has its own CUDA context and each context has its own address space which will not be accessed by other CUDA context. **Hyper-Q (Queue)** allow multiple CPU threads from a process to load jobs onto a GPU and CUDA kernels can utilize GPU at the same time *if when kernel A is running and GPU still have enough SM and resource to run a thread block of kernel B, the kernel B will be launched*. Hyper-Q is first implemented on Kepler GPUs(>3.5). **MPS** will collect jobs from different processes and deploy them to GPU as if they are coming from a single process. We can think Hyper-Q as supporting OpenMP and MPS supports MPI. :::info :bulb: **GPU Operation Mode on Windows** > https://docs.nvidia.com/nsight-visual-studio-edition/reference/index.html#:~:text=NVIDIA%20GPUs%20can%20exist%20in,mode%3B%20used%20in%20gaming%20graphics. On Windows System, Nvidia GPU can exist in one of two modes: **TCC** or **WDDM**. *WDDM mode will set a limitation that a CUDA kernel maximum run time to around 5 seconds*. TCC mode will remove graphics functionality and purely use the graphic card for computation. ::: :::info :bulb: **GPU Utilization Explantation** **教你如何繼續壓榨GPU的算力** > https://zhuanlan.zhihu.com/p/346389176 *GPU Utilization :* Percent of time over the past sample period during which **one or more kernels** was executing on the GPU. The sample period may be between 1 second and 1/6 second depending on the product. In the following diagram, only one kernel will be schedule on the GPU at anytime. This means that in the CUDA code a `cudaDeviceSynchronize()`. ![](https://hackmd.io/_uploads/By9IOPOvn.png) ::: ### CUDA-aware MPI *TODO: Pending*

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password
    or
    Sign in via Facebook Sign in via X(Twitter) Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    By signing in, you agree to our terms of service.

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully