IRGPU: Getting started with CUDA
CUDA overview
What is CUDA ?
A product
- It enables to use NVidia GPUs for computation
A C/C++ variant
- Mostly C++ 15 compatible, with extensions
- and also some restrictions !
A SDK
- A set of compilers and toolchains for various architectures
- Performance analysis tools
A runtime
- An assembly specification
- Computation libraries (linear algebra, etc.)
A new industry standard
- Used by every major deep learning framework
- Replacing OpenCL as Vulka is replacing OpenGL
The CUDA ecosystem (2021)
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Libraries or Compiler Directives or Programming Language ?
CUDA is mostly based on a "new" programming language: CUDA C (or C++, or Fortran)
This grants much flexibility and performance
But is also exposes much of GPU goodness through libraries
An it supports a few compiler directives to facilitate some constructs
The big idea: Kernels instead of loops
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Arrays of parallel threads
A CUDA kernel is executed by a grid (array) of threads
- All threads in grid run the same kernel code (Single Program Mutliple Data)
- Each thread has indexes that is used to compute memory addresses and compute decisions
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Threads blocks
Threads are grouped into thread blocks
- Threads witihin a bloc cooperate via
- shared memory
- atomic operations
- barrier synchronization
- Threads in different blocks do not interact
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
A multidimensional grid of computation threads
Each thread uses indices to decide what data to work on:
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Each index has , and attributes
Grid and blocks can have different dimensions, but they usually are 2 levels of the same work decomposition
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Examples
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Block decomposition enable automatic scalability
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Architecture
Programming modeling
Block
A set of threads that cooperate:
- Synchronisation
- Shared memory
- Block ID = ID in a grid
Grid
Array of blocks executing same kernel
- Access to global GPU memory
- Sync. by stop and start a new kernel
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Mapping Programming model to hardware
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
the SMs
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Zoom on the SM
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
warp: 32 unites de calcul
- SM organize blocks into warps
- 1 warp = group of 32 threads
GTX 920:
- 128 cores = 4 x 32 cores
- Quad warp scheduler selects 4 warps (TLP)
- And 2 independant instructions per warp can be dispatched each cycle (ILP)
Ex: 1 (logical) block of 96 threads maps to: 3 (physical) warps of 32 threads
Zoom on the CUDA cores
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
- A warp executes 32 threads on the 32 CUDA cores
- The threads executes the same instructions (DLP)
- All instructions are SIMD (width = 32) instructions
Each core:
- FLoating point & integer unit
- Fused multiply-add (FMA) instruction
- Logic unit
- Move, compare unit
- Branch unit
- The first IF/ID of the pipeline is done by the SM
SIMT allows to specify the execution
The SIMT Execution Model on CUDA cores
SIMT: on programme comme si on avait un thread qui execute une donnees mais ca se cache derriere des instructions SIMD (a + b devient la somme du vecteur a avec le vecteur b)
Chaque thread va executer le meme kernel et instructions
- Divergent code paths (branching) pile up!


If/else: tous les threads vont effectuer en meme temps le if et else
If/else
What is the latency of this code in the best and worst case ?

- Best case: is false for every thread. For all threads:
inst-d
- Worst case: and is true for some but not all threads. For all threads:
inst-a
, inst-b
, inst-c
, inst-d
Loops

Final note about terminology

GPU memory model
Computation cost vs. memory cost
Power measurements on NVIDIA GT200

With the same amount of energy:
- Load 1 word from external memory (DRAM)
- Compute 44 flops
Must optimize memory first
External memory: discrete GPU
Classical CPU-GPU model
- Split memory space
- Highest bandwith from GPU memory
- Transfers to main memory are slower

Intel i7 4770 / GTX 780
External memory: embedded GPU
Most GPUs today:
- Same memory
- May support memory coherence (GPU can read directly from CPU caches)
- More contention on external memory

GPU: on-chip memory
Cache area in CPU vs GPU:

But if we include registers:

GPU has many more registers but made of simpler memory
Memory model hierarchy
Hardware

Cache hierarchy:
- Keep frequently-accessed data Core
- Reduce throughtput demand on main memoru L1
- Managed by hardware (L1, L2) or software (shared memory)
On CPU, caches are designed to avoid memory latency
On GPU, multi-threading deals with memory latency
Software


Building and running a simple program

What you need to get started
- NVidia GPU hardware
- NVidia GPU drivers, properly loaded
- CUDA runtime libraries
- CUDA SDK (NVCC compiler in particular)
Summary
- Host vs Device Separate memory
- GPU are computation units which require explicit usage, as opposed to a CPU
- Need to load data and fetch result from device
- Replace loops with kernels
- Kernel = Function computed in relative isolation on small chunks of data
- Divide the work
- Compile and run using CUDA SDK
Host view of GPU computation
Sequential and parallel sections
- We use the GPU(s) as co-processor(s)

CUDA memory primitives

Why 2D and 3D variants ?
- Strong alignment requirements in device memory
- Enables correct loading of memory chunks to SM caches (correct bank alignment)
- Proper striding management in automated fashion
Host Device memory transfer

Almost complete code

Checking errors
In practice check for API errors

Intermission: Can I use memory management functions inside kernels ?
No: cudaMalloc()
, cudaMemcpy()
and cudaFree()
shall be called from host only
However, kernels may allocate, use and reclaim memory dynamically using regular malloc()
Fix the kernel invocation line
We want to fix this line:

Kernel invocation syntax:

How to set gridDim
and blockDim
properly ?
Lvl 0: Naive trial with as many threads as possible

- Will fail with large vectors
- Hardware limitation on the maximum number of thread per block (1024 for compute capability 3.0-7.5)
- Will fail with vectors of size which is not a multiple of warp size
Lvl 1: It works with just enough blocks

Lvl 2: Tune block size given the kernel requirements and hardware constraints

But wait…

This code prints nothing !
Kernel invocation is asynchronous

Host code synchronization requires cudaDeviceSynchronize()
because kernel invocation is asynchronous from host perspective.
On the device, kernel invocations are striclty sequential (unless you schedule them on different streams)
Intermission: Can I make kernels inside kernels ?
Yes. This is the basic of dynamic parallelism
Some restrictions over the stack size apply.
Remember that the device runtime is a functional subset of the host runtime, ie you can perform device management, kernel launching, device memcpy
, etc. but with some restrictions
The compiler may inline some of those calls.
Conclusion about the host-only view
A host-only view of the computation is sufficient for most of the cases:
- upload data to the device
- fire a kernel
- download output data from the device
Advanced CUDA requires to make sure we saturate the SMs, and may imply some kernel study to determine the best:
- amount of threads per blocks
- amount of blocks per grid
- work per thread (if applicable)
- …
This depends on:
- hardware specifications: maximum
gridDim
and blockDim
, etc.
- kernel code: amount of register and shared memory used by each thread
Kernel programming
Several API levels
We now want to program kernels
There are several APIs available:
- PTX assembly
- Driver API ©
- Runtime C++ API let's use this one
Function Execution Space Specifiers

__global__
defines a kernel function
- Each
__
consists of 2 underscore characters
- A kernel function must return void
- It may be called from another kernel for devices of compute capability 3.2 or higher (Dynamic Parallelism support)
__device__
and __host__
can be used together
__host__
is optional if used alone
Built-in Vector Types
They make it easy to work with data like images
Alignement mus be respected in all operations



They all come with a constructor function of the form make_<type name>

The 1st, 2nd, 3rd and 4th components are accessible through the fields , , respectively
Built-in variable

Example

Memory hierarchy


Types of Memory
- Registers
- Used to store parameters, local variables, etc.
- Very fast
- Private to each thread
- Lots of thread little memory per threads
- Shared
- Used to store temp data
- Very fast
- Shared among all threads in a block
- Constant
- A special cach for read-only values
- Global
- Caches
- Local
Salient features of Device Memory


Cost to access memory

Variable Memory Space Specifiers
How to declaring CUDA variables

Remarks:
__device__
is optional when used with __shared__
or __constant__
- Automatic variables reside in a register
Where to declare variables ?
Can host access it ?
Yes |
No |
global and constant, declare outside of any function |
register and shared, use of declare in the kernel |
Who can be shared by who ?
Possible memory access:
- Among threads in the same grid (a kernel invocation)
- Among threads in the same block
- Global memory
- Shared memory
Relaxed consistency memory model
The CUDA programming model assumes a device with a weakly-ordered memory model, that is the order in which a CUDA thread writes data to shared memory or global memory, is not necessarily the order in which the data is observed being written by another CUDA or host thread
Example


Possible outcomes for thread 2 ?


Memory Fence Functions
Memory fence functions can be used to enforce some ordering on memory accesses

Ensures that:
- All writes to all memory made by the calling thread before the call to
__threadfence_block()
- All reads from all memory
Synch functions

Stronger than __threadfence()
because it also synchronizes the execution

Atomic functions
Atomic functions perform a read-modify-write atomic operation on one 32-bit or 64-bit word residing in global or shared memory
Most of the atomic functions are available for all the numerical type
Arithmetic functions

printf
Possible since Fermi devices (Compute Capability 2.x and higher)
Limited amount of lines:
- circular buffer flushed at particular times
Global memory write
To dump then inspect a larger amount of intermediate data
Analysis code should be removed for production
Example


The complete compilation trajectory
