# IRGPU: Getting started with CUDA
# CUDA overview
## What is CUDA ?
A product
- It enables to use NVidia GPUs for computation
A C/C++ variant
- Mostly C++ 15 compatible, with extensions
- and also some restrictions !
A SDK
- A set of compilers and toolchains for various architectures
- Performance analysis tools
A runtime
- An assembly specification
- Computation libraries (linear algebra, etc.)
A new industry standard
- Used by every major deep learning framework
- Replacing OpenCL as Vulka is replacing OpenGL
## The CUDA ecosystem (2021)

## Libraries *or* Compiler Directives *or* Programming Language ?
CUDA is mostly based on a "new" **programming language**: CUDA C (or C++, or Fortran)
> This grants much flexibility and performance
But is also exposes much of GPU goodness through **libraries**
An it supports a few **compiler directives** to facilitate some constructs
## The big idea: Kernels instead of loops



:::success
No more `for` loop !
:::
## Arrays of parallel threads
A CUDA kernel is executed by a **grid** (array) of threads
- All threads in grid run the same kernel code (Single Program Mutliple Data)
- Each thread has indexes that is used to compute memory addresses and compute decisions

## Threads blocks
Threads are grouped into thread blocks
- Threads witihin a bloc cooperate via
- *shared memory*
- *atomic operations*
- *barrier synchronization*
- Threads in different blocks do not interact

### A multidimensional grid of computation threads
Each thread uses indices to decide what data to work on:

Each index has $x$, $y$ and $z$ attributes
Grid and blocks can have different dimensions, but they usually are 2 levels of the same work decomposition

### Examples


## Block decomposition enable automatic scalability

# Architecture
## Programming modeling
**Block**
A set of *threads* that cooperate:
- Synchronisation
- Shared memory
- Block ID = ID in a **grid**
**Grid**
Array of blocks executing same *kernel*
- Access to global GPU memory
- Sync. by stop and start a new kernel


## Mapping Programming model to hardware

### the SMs

## Zoom on the SM

:::info
*warp*: 32 unites de calcul
:::
- SM organize **blocks** into **warps**
- 1 warp = group of 32 threads
GTX 920:
- 128 cores = 4 x 32 cores
- Quad warp scheduler selects 4 warps (TLP)
- And 2 independant instructions per warp can be dispatched each cycle (ILP)
> Ex: 1 (logical) *block* of 96 threads maps to: 3 (physical) *warps* of 32 threads
## Zoom on the CUDA cores

- A *warp* executes 32 threads on the 32 CUDA cores
- The threads executes *the same* instructions (*DLP*)
- All instructions are SIMD (width = 32) instructions
Each core:
- FLoating point & integer unit
- Fused multiply-add (FMA) instruction
- Logic unit
- Move, compare unit
- Branch unit
- The first IF/ID of the pipeline is done by the SM
:::warning
SIMT allows to specify the execution
:::
## The SIMT Execution Model on CUDA cores
SIMT: on programme comme si on avait un thread qui execute une donnees mais ca se cache derriere des instructions SIMD (a + b devient la somme du vecteur a avec le vecteur b)
:::danger
Chaque thread va executer le meme kernel et instructions
:::
- Divergent code paths (branching) pile up!


If/else: tous les threads vont effectuer en meme temps le if et else
### If/else
What is the latency of this code in the best and worst case ?

- Best case: $a\gt0$ is false for every thread. For all threads: `inst-d`
- Worst case: $a\gt0$ and $b\gt0$ is true for some but not all threads. For all threads: `inst-a`, `inst-b`, `inst-c`, `inst-d`
### Loops

## Final note about terminology

# GPU memory model
## Computation cost vs. memory cost
Power measurements on NVIDIA GT200

With the same amount of energy:
- Load 1 word from external memory (DRAM)
- Compute 44 flops
:::success
Must optimize memory first
:::
## External memory: discrete GPU
Classical CPU-GPU model
- Split memory space
- Highest bandwith from GPU memory
- Transfers to main memory are slower

> Intel i7 4770 / GTX 780
## External memory: embedded GPU
Most GPUs today:
- Same memory
- May support memory coherence (GPU can read directly from CPU caches)
- More contention on external memory

## GPU: on-chip memory
Cache area in CPU vs GPU:

But if we include registers:

:::success
GPU has many more registers but made of simpler memory
:::
## Memory model hierarchy
### Hardware

Cache hierarchy:
- Keep frequently-accessed data Core
- Reduce throughtput demand on main memoru L1
- Managed by hardware (L1, L2) or software (shared memory)
On CPU, caches are designed to avoid memory latency
On GPU, multi-threading deals with memory latency
### Software


## Building and running a simple program

## What you need to get started
- NVidia GPU hardware
- NVidia GPU drivers, properly loaded
- CUDA runtime libraries
- CUDA SDK (NVCC compiler in particular)
# Summary
- Host vs Device $\leftrightarrow$ Separate memory
- GPU are computation units which require explicit usage, as opposed to a CPU
- Need to load data and fetch result from device
- Replace loops with kernels
- Kernel = Function computed in relative isolation on small chunks of data
- Divide the work
- Compile and run using CUDA SDK
# Host view of GPU computation
## Sequential and parallel sections
- We use the GPU(s) as co-processor(s)

## CUDA memory primitives

**Why 2D and 3D variants ?**
- Strong alignment requirements in device memory
- Enables correct loading of memory chunks to SM caches (correct bank alignment)
- Proper striding management in automated fashion
## Host $\leftrightarrow$ Device memory transfer

## Almost complete code

## Checking errors
In practice check for API errors

# Intermission: *Can I use memory management functions inside kernels ?*
**No**: `cudaMalloc()`, `cudaMemcpy()` and `cudaFree()` shall be called from host only
However, kernels may allocate, use and reclaim memory dynamically using regular `malloc()`
# Fix the kernel invocation line
We want to fix this line:

Kernel invocation syntax:

## How to set `gridDim` and `blockDim` properly ?
Lvl 0: Naive trial with as many threads as possible

- Will fail with large vectors
- Hardware limitation on the maximum number of thread per block (1024 for compute capability 3.0-7.5)
- Will fail with vectors of size which is not a multiple of warp size
Lvl 1: It works with just enough blocks

Lvl 2: Tune block size given the kernel requirements and hardware constraints

## But wait...

:::warning
This code prints nothing !
:::
:::danger
Kernel invocation is asynchronous
:::

:::success
Host code synchronization requires `cudaDeviceSynchronize()` because **kernel invocation is asynchronous** from host perspective.
:::
On the device, kernel invocations are striclty sequential (unless you schedule them on different *streams*)
# Intermission: *Can I make kernels inside kernels ?*
**Yes**. This is the basic of *dynamic parallelism*
Some restrictions over the stack size apply.
Remember that the device runtime is a functional subset of the host runtime, ie you can perform device management, kernel launching, device `memcpy`, etc. but with some restrictions
The compiler may inline some of those calls.
# Conclusion about the host-only view
A host-only view of the computation is sufficient for most of the cases:
1. upload data to the device
2. fire a kernel
3. download output data from the device
Advanced CUDA requires to make sure we saturate the SMs, and may imply some kernel study to determine the best:
- amount of threads per blocks
- amount of blocks per grid
- work per thread (if applicable)
- ...
This depends on:
- hardware specifications: maximum `gridDim` and `blockDim`, etc.
- kernel code: amount of register and shared memory used by each thread
# Kernel programming
## Several API levels
We now want to program kernels
There are several APIs available:
- PTX assembly
- Driver API (C)
- Runtime C++ API $\leftarrow$ **let's use this one**
## Function Execution Space Specifiers

- `__global__` defines a kernel function
- Each `__` consists of 2 underscore characters
- A kernel function must return void
- It may be called from another kernel for devices of compute capability 3.2 or higher (Dynamic Parallelism support)
- `__device__` and `__host__` can be used together
- `__host__` is optional if used alone
## Built-in Vector Types
They make it easy to work with data like images
**Alignement mus be respected** in all operations



:::info
They all are structures
:::
They all come with a constructor function of the form `make_<type name>`

The 1st, 2nd, 3rd and 4th components are accessible through the fields $x$, $y$, $z$ respectively
## Built-in variable

### Example

## Memory hierarchy


## Types of Memory
- Registers
- Used to store parameters, local variables, etc.
- Very fast
- Private to each thread
- Lots of thread $\Rightarrow$ little memory per threads
- Shared
- Used to store temp data
- Very fast
- *Shared* among all threads in a block
- Constant
- A special cach for read-only values
- Global
- Large and slow
- Caches
- Transparent uses
- Local
## Salient features of Device Memory


## Cost to access memory

## Variable Memory Space Specifiers
**How to declaring CUDA variables**

*Remarks:*
- `__device__` is optional when used with `__shared__` or `__constant__`
- Automatic variables reside in a register
*Where to declare variables ?*
Can host access it ?
|Yes|No|
|-|-|
|**global** and **constant**, declare outside of any function|**register** and **shared**, use of declare in the kernel|
## Who can be shared by who ?
Possible memory access:
- Among threads in the same grid (a kernel invocation)
- Global memory
- Among threads in the same block
- Global memory
- Shared memory
## Relaxed consistency memory model
The CUDA programming model assumes a device with a **weakly-ordered memory model**, that is the order in which a CUDA thread writes data to shared memory or global memory, is not necessarily the order in which the data is observed being written by another CUDA or host thread
### Example


*Possible outcomes for thread 2 ?*


## Memory Fence Functions
Memory fence functions can be used to enforce some ordering on memory accesses

Ensures that:
- All writes to all memory made by the calling thread before the call to `__threadfence_block()`
- All reads from all memory
## Synch functions

:::danger
Stronger than `__threadfence()` because it also synchronizes the execution
:::

## Atomic functions
:::info
Atomic functions perform a read-modify-write atomic operation on one 32-bit or 64-bit word residing in global or shared memory
:::
Most of the atomic functions are available for all the numerical type
### Arithmetic functions

# Debugging, performance analysis and profiling
## `printf`
Possible since Fermi devices (Compute Capability 2.x and higher)
Limited amount of lines:
- circular buffer flushed at particular times
## Global memory write
To dump then inspect a larger amount of intermediate data
Analysis code should be removed for production
### Example

## CUDA tools

## The complete compilation trajectory
