# IRGPU: Getting started with CUDA # CUDA overview ## What is CUDA ? A product - It enables to use NVidia GPUs for computation A C/C++ variant - Mostly C++ 15 compatible, with extensions - and also some restrictions ! A SDK - A set of compilers and toolchains for various architectures - Performance analysis tools A runtime - An assembly specification - Computation libraries (linear algebra, etc.) A new industry standard - Used by every major deep learning framework - Replacing OpenCL as Vulka is replacing OpenGL ## The CUDA ecosystem (2021) ![](https://i.imgur.com/jyZ9il1.png) ## Libraries *or* Compiler Directives *or* Programming Language ? CUDA is mostly based on a "new" **programming language**: CUDA C (or C++, or Fortran) > This grants much flexibility and performance But is also exposes much of GPU goodness through **libraries** An it supports a few **compiler directives** to facilitate some constructs ## The big idea: Kernels instead of loops ![](https://i.imgur.com/MomimT7.png) ![](https://i.imgur.com/RR1Uc8A.png) ![](https://i.imgur.com/2zpoK0Z.png) :::success No more `for` loop ! ::: ## Arrays of parallel threads A CUDA kernel is executed by a **grid** (array) of threads - All threads in grid run the same kernel code (Single Program Mutliple Data) - Each thread has indexes that is used to compute memory addresses and compute decisions ![](https://i.imgur.com/piRxw0Q.png) ## Threads blocks Threads are grouped into thread blocks - Threads witihin a bloc cooperate via - *shared memory* - *atomic operations* - *barrier synchronization* - Threads in different blocks do not interact ![](https://i.imgur.com/tQmgcy8.png) ### A multidimensional grid of computation threads Each thread uses indices to decide what data to work on: ![](https://i.imgur.com/IYWVCfx.png) Each index has $x$, $y$ and $z$ attributes Grid and blocks can have different dimensions, but they usually are 2 levels of the same work decomposition ![](https://i.imgur.com/UzThWvv.png) ### Examples ![](https://i.imgur.com/jI0Buab.png) ![](https://i.imgur.com/j5pp4Yl.png) ## Block decomposition enable automatic scalability ![](https://i.imgur.com/J2dsHmg.png) # Architecture ## Programming modeling **Block** A set of *threads* that cooperate: - Synchronisation - Shared memory - Block ID = ID in a **grid** **Grid** Array of blocks executing same *kernel* - Access to global GPU memory - Sync. by stop and start a new kernel ![](https://i.imgur.com/w77cxAe.png) ![](https://i.imgur.com/Ol0ccWU.png) ## Mapping Programming model to hardware ![](https://i.imgur.com/PpSdlGS.png) ### the SMs ![](https://i.imgur.com/ECGCbjm.png) ## Zoom on the SM ![](https://i.imgur.com/fIOI7dS.png) :::info *warp*: 32 unites de calcul ::: - SM organize **blocks** into **warps** - 1 warp = group of 32 threads GTX 920: - 128 cores = 4 x 32 cores - Quad warp scheduler selects 4 warps (TLP) - And 2 independant instructions per warp can be dispatched each cycle (ILP) > Ex: 1 (logical) *block* of 96 threads maps to: 3 (physical) *warps* of 32 threads ## Zoom on the CUDA cores ![](https://i.imgur.com/aNJ0IX9.png) - A *warp* executes 32 threads on the 32 CUDA cores - The threads executes *the same* instructions (*DLP*) - All instructions are SIMD (width = 32) instructions Each core: - FLoating point & integer unit - Fused multiply-add (FMA) instruction - Logic unit - Move, compare unit - Branch unit - The first IF/ID of the pipeline is done by the SM :::warning SIMT allows to specify the execution ::: ## The SIMT Execution Model on CUDA cores SIMT: on programme comme si on avait un thread qui execute une donnees mais ca se cache derriere des instructions SIMD (a + b devient la somme du vecteur a avec le vecteur b) :::danger Chaque thread va executer le meme kernel et instructions ::: - Divergent code paths (branching) pile up! ![](https://i.imgur.com/HhwTHnQ.png) ![](https://i.imgur.com/V8hUjR1.png) If/else: tous les threads vont effectuer en meme temps le if et else ### If/else What is the latency of this code in the best and worst case ? ![](https://i.imgur.com/1P8XxsN.png) - Best case: $a\gt0$ is false for every thread. For all threads: `inst-d` - Worst case: $a\gt0$ and $b\gt0$ is true for some but not all threads. For all threads: `inst-a`, `inst-b`, `inst-c`, `inst-d` ### Loops ![](https://i.imgur.com/W47FpFq.png) ## Final note about terminology ![](https://i.imgur.com/HR8PZ9G.png) # GPU memory model ## Computation cost vs. memory cost Power measurements on NVIDIA GT200 ![](https://i.imgur.com/8rcKY55.png) With the same amount of energy: - Load 1 word from external memory (DRAM) - Compute 44 flops :::success Must optimize memory first ::: ## External memory: discrete GPU Classical CPU-GPU model - Split memory space - Highest bandwith from GPU memory - Transfers to main memory are slower ![](https://i.imgur.com/v7rzntd.png) > Intel i7 4770 / GTX 780 ## External memory: embedded GPU Most GPUs today: - Same memory - May support memory coherence (GPU can read directly from CPU caches) - More contention on external memory ![](https://i.imgur.com/fmaa5LI.png) ## GPU: on-chip memory Cache area in CPU vs GPU: ![](https://i.imgur.com/WUxFyh8.png) But if we include registers: ![](https://i.imgur.com/IKccvrI.png) :::success GPU has many more registers but made of simpler memory ::: ## Memory model hierarchy ### Hardware ![](https://i.imgur.com/JdjOci5.png) Cache hierarchy: - Keep frequently-accessed data Core - Reduce throughtput demand on main memoru L1 - Managed by hardware (L1, L2) or software (shared memory) On CPU, caches are designed to avoid memory latency On GPU, multi-threading deals with memory latency ### Software ![](https://i.imgur.com/K1c7Aqe.png) ![](https://i.imgur.com/LvmwjeC.png) ## Building and running a simple program ![](https://i.imgur.com/OU0vg4l.png) ## What you need to get started - NVidia GPU hardware - NVidia GPU drivers, properly loaded - CUDA runtime libraries - CUDA SDK (NVCC compiler in particular) # Summary - Host vs Device $\leftrightarrow$ Separate memory - GPU are computation units which require explicit usage, as opposed to a CPU - Need to load data and fetch result from device - Replace loops with kernels - Kernel = Function computed in relative isolation on small chunks of data - Divide the work - Compile and run using CUDA SDK # Host view of GPU computation ## Sequential and parallel sections - We use the GPU(s) as co-processor(s) ![](https://i.imgur.com/kO7Anul.png) ## CUDA memory primitives ![](https://i.imgur.com/NE4pw2T.png) **Why 2D and 3D variants ?** - Strong alignment requirements in device memory - Enables correct loading of memory chunks to SM caches (correct bank alignment) - Proper striding management in automated fashion ## Host $\leftrightarrow$ Device memory transfer ![](https://i.imgur.com/y922J5t.png) ## Almost complete code ![](https://i.imgur.com/wWOFzSV.png) ## Checking errors In practice check for API errors ![](https://i.imgur.com/0i5xiHN.png) # Intermission: *Can I use memory management functions inside kernels ?* **No**: `cudaMalloc()`, `cudaMemcpy()` and `cudaFree()` shall be called from host only However, kernels may allocate, use and reclaim memory dynamically using regular `malloc()` # Fix the kernel invocation line We want to fix this line: ![](https://i.imgur.com/6GMimst.png) Kernel invocation syntax: ![](https://i.imgur.com/w89y6Lw.png) ## How to set `gridDim` and `blockDim` properly ? Lvl 0: Naive trial with as many threads as possible ![](https://i.imgur.com/xtNi10X.png) - Will fail with large vectors - Hardware limitation on the maximum number of thread per block (1024 for compute capability 3.0-7.5) - Will fail with vectors of size which is not a multiple of warp size Lvl 1: It works with just enough blocks ![](https://i.imgur.com/9dpZy3p.png) Lvl 2: Tune block size given the kernel requirements and hardware constraints ![](https://i.imgur.com/VvcBIdo.png) ## But wait... ![](https://i.imgur.com/hGbEKSS.png) :::warning This code prints nothing ! ::: :::danger Kernel invocation is asynchronous ::: ![](https://i.imgur.com/r2N2gWq.png) :::success Host code synchronization requires `cudaDeviceSynchronize()` because **kernel invocation is asynchronous** from host perspective. ::: On the device, kernel invocations are striclty sequential (unless you schedule them on different *streams*) # Intermission: *Can I make kernels inside kernels ?* **Yes**. This is the basic of *dynamic parallelism* Some restrictions over the stack size apply. Remember that the device runtime is a functional subset of the host runtime, ie you can perform device management, kernel launching, device `memcpy`, etc. but with some restrictions The compiler may inline some of those calls. # Conclusion about the host-only view A host-only view of the computation is sufficient for most of the cases: 1. upload data to the device 2. fire a kernel 3. download output data from the device Advanced CUDA requires to make sure we saturate the SMs, and may imply some kernel study to determine the best: - amount of threads per blocks - amount of blocks per grid - work per thread (if applicable) - ... This depends on: - hardware specifications: maximum `gridDim` and `blockDim`, etc. - kernel code: amount of register and shared memory used by each thread # Kernel programming ## Several API levels We now want to program kernels There are several APIs available: - PTX assembly - Driver API (C) - Runtime C++ API $\leftarrow$ **let's use this one** ## Function Execution Space Specifiers ![](https://i.imgur.com/iGYXzbR.png) - `__global__` defines a kernel function - Each `__` consists of 2 underscore characters - A kernel function must return void - It may be called from another kernel for devices of compute capability 3.2 or higher (Dynamic Parallelism support) - `__device__` and `__host__` can be used together - `__host__` is optional if used alone ## Built-in Vector Types They make it easy to work with data like images **Alignement mus be respected** in all operations ![](https://i.imgur.com/KoGHgLn.png) ![](https://i.imgur.com/Lvk3u6h.png) ![](https://i.imgur.com/BzDXwjl.png) :::info They all are structures ::: They all come with a constructor function of the form `make_<type name>` ![](https://i.imgur.com/w2vZgUb.png) The 1st, 2nd, 3rd and 4th components are accessible through the fields $x$, $y$, $z$ respectively ## Built-in variable ![](https://i.imgur.com/ZuyLHCx.png) ### Example ![](https://i.imgur.com/BKYIIF9.png) ## Memory hierarchy ![](https://i.imgur.com/5W9tTTJ.png) ![](https://i.imgur.com/pMWK6zQ.png) ## Types of Memory - Registers - Used to store parameters, local variables, etc. - Very fast - Private to each thread - Lots of thread $\Rightarrow$ little memory per threads - Shared - Used to store temp data - Very fast - *Shared* among all threads in a block - Constant - A special cach for read-only values - Global - Large and slow - Caches - Transparent uses - Local ## Salient features of Device Memory ![](https://i.imgur.com/GaNsqf7.png) ![](https://i.imgur.com/MgD8NoP.png) ## Cost to access memory ![](https://i.imgur.com/bjn3Pvg.png) ## Variable Memory Space Specifiers **How to declaring CUDA variables** ![](https://i.imgur.com/KnC7phe.png) *Remarks:* - `__device__` is optional when used with `__shared__` or `__constant__` - Automatic variables reside in a register *Where to declare variables ?* Can host access it ? |Yes|No| |-|-| |**global** and **constant**, declare outside of any function|**register** and **shared**, use of declare in the kernel| ## Who can be shared by who ? Possible memory access: - Among threads in the same grid (a kernel invocation) - Global memory - Among threads in the same block - Global memory - Shared memory ## Relaxed consistency memory model The CUDA programming model assumes a device with a **weakly-ordered memory model**, that is the order in which a CUDA thread writes data to shared memory or global memory, is not necessarily the order in which the data is observed being written by another CUDA or host thread ### Example ![](https://i.imgur.com/cS4cxhv.png) ![](https://i.imgur.com/m5MoQtN.png) *Possible outcomes for thread 2 ?* ![](https://i.imgur.com/G48Oday.png) ![](https://i.imgur.com/jInFIjp.png) ## Memory Fence Functions Memory fence functions can be used to enforce some ordering on memory accesses ![](https://i.imgur.com/GoKrGxt.png) Ensures that: - All writes to all memory made by the calling thread before the call to `__threadfence_block()` - All reads from all memory ## Synch functions ![](https://i.imgur.com/e5hbmpM.png) :::danger Stronger than `__threadfence()` because it also synchronizes the execution ::: ![](https://i.imgur.com/mVylLzi.png) ## Atomic functions :::info Atomic functions perform a read-modify-write atomic operation on one 32-bit or 64-bit word residing in global or shared memory ::: Most of the atomic functions are available for all the numerical type ### Arithmetic functions ![](https://i.imgur.com/wmwCTvH.png) # Debugging, performance analysis and profiling ## `printf` Possible since Fermi devices (Compute Capability 2.x and higher) Limited amount of lines: - circular buffer flushed at particular times ## Global memory write To dump then inspect a larger amount of intermediate data Analysis code should be removed for production ### Example ![](https://i.imgur.com/tRumZ46.png) ## CUDA tools ![](https://i.imgur.com/dg1HAod.png) ## The complete compilation trajectory ![](https://i.imgur.com/bgsNioU.png)