All instructions are SIMD (width = 32) instructions
Each core:
FLoating point & integer unit
Fused multiply-add (FMA) instruction
Logic unit
Move, compare unit
Branch unit
The first IF/ID of the pipeline is done by the SM
SIMT allows to specify the execution
The SIMT Execution Model on CUDA cores
SIMT: on programme comme si on avait un thread qui execute une donnees mais ca se cache derriere des instructions SIMD (a + b devient la somme du vecteur a avec le vecteur b)
Chaque thread va executer le meme kernel et instructions
Best case: is false for every thread. For all threads: inst-d
Worst case: and is true for some but not all threads. For all threads: inst-a, inst-b, inst-c, inst-d
Loops
Final note about terminology
GPU memory model
Computation cost vs. memory cost
Power measurements on NVIDIA GT200
With the same amount of energy:
Load 1 word from external memory (DRAM)
Compute 44 flops
Must optimize memory first
External memory: discrete GPU
Classical CPU-GPU model
Split memory space
Highest bandwith from GPU memory
Transfers to main memory are slower
Intel i7 4770 / GTX 780
External memory: embedded GPU
Most GPUs today:
Same memory
May support memory coherence (GPU can read directly from CPU caches)
More contention on external memory
GPU: on-chip memory
Cache area in CPU vs GPU:
But if we include registers:
GPU has many more registers but made of simpler memory
Memory model hierarchy
Hardware
Cache hierarchy:
Keep frequently-accessed data Core
Reduce throughtput demand on main memoru L1
Managed by hardware (L1, L2) or software (shared memory)
On CPU, caches are designed to avoid memory latency On GPU, multi-threading deals with memory latency
Software
Building and running a simple program
What you need to get started
NVidia GPU hardware
NVidia GPU drivers, properly loaded
CUDA runtime libraries
CUDA SDK (NVCC compiler in particular)
Summary
Host vs Device Separate memory
GPU are computation units which require explicit usage, as opposed to a CPU
Need to load data and fetch result from device
Replace loops with kernels
Kernel = Function computed in relative isolation on small chunks of data
Divide the work
Compile and run using CUDA SDK
Host view of GPU computation
Sequential and parallel sections
We use the GPU(s) as co-processor(s)
CUDA memory primitives
Why 2D and 3D variants ?
Strong alignment requirements in device memory
Enables correct loading of memory chunks to SM caches (correct bank alignment)
Proper striding management in automated fashion
Host Device memory transfer
Almost complete code
Checking errors
In practice check for API errors
Intermission: Can I use memory management functions inside kernels ?
No: cudaMalloc(), cudaMemcpy() and cudaFree() shall be called from host only
However, kernels may allocate, use and reclaim memory dynamically using regular malloc()
Fix the kernel invocation line
We want to fix this line:
Kernel invocation syntax:
How to set gridDim and blockDim properly ?
Lvl 0: Naive trial with as many threads as possible
Will fail with large vectors
Hardware limitation on the maximum number of thread per block (1024 for compute capability 3.0-7.5)
Will fail with vectors of size which is not a multiple of warp size
Lvl 1: It works with just enough blocks
Lvl 2: Tune block size given the kernel requirements and hardware constraints
But wait…
This code prints nothing !
Kernel invocation is asynchronous
Host code synchronization requires cudaDeviceSynchronize() because kernel invocation is asynchronous from host perspective.
On the device, kernel invocations are striclty sequential (unless you schedule them on different streams)
Intermission: Can I make kernels inside kernels ?
Yes. This is the basic of dynamic parallelism
Some restrictions over the stack size apply. Remember that the device runtime is a functional subset of the host runtime, ie you can perform device management, kernel launching, device memcpy, etc. but with some restrictions
The compiler may inline some of those calls.
Conclusion about the host-only view
A host-only view of the computation is sufficient for most of the cases:
upload data to the device
fire a kernel
download output data from the device
Advanced CUDA requires to make sure we saturate the SMs, and may imply some kernel study to determine the best:
amount of threads per blocks
amount of blocks per grid
work per thread (if applicable)
…
This depends on:
hardware specifications: maximum gridDim and blockDim, etc.
kernel code: amount of register and shared memory used by each thread
Kernel programming
Several API levels
We now want to program kernels There are several APIs available:
It may be called from another kernel for devices of compute capability 3.2 or higher (Dynamic Parallelism support)
__device__ and __host__ can be used together
__host__ is optional if used alone
Built-in Vector Types
They make it easy to work with data like images Alignement mus be respected in all operations
They all are structures
They all come with a constructor function of the form make_<type name>
The 1st, 2nd, 3rd and 4th components are accessible through the fields , , respectively
Built-in variable
Example
Memory hierarchy
Types of Memory
Registers
Used to store parameters, local variables, etc.
Very fast
Private to each thread
Lots of thread little memory per threads
Shared
Used to store temp data
Very fast
Shared among all threads in a block
Constant
A special cach for read-only values
Global
Large and slow
Caches
Transparent uses
Local
Salient features of Device Memory
Cost to access memory
Variable Memory Space Specifiers
How to declaring CUDA variables
Remarks:
__device__ is optional when used with __shared__ or __constant__
Automatic variables reside in a register
Where to declare variables ? Can host access it ?
Yes
No
global and constant, declare outside of any function
register and shared, use of declare in the kernel
Who can be shared by who ?
Possible memory access:
Among threads in the same grid (a kernel invocation)
Global memory
Among threads in the same block
Global memory
Shared memory
Relaxed consistency memory model
The CUDA programming model assumes a device with a weakly-ordered memory model, that is the order in which a CUDA thread writes data to shared memory or global memory, is not necessarily the order in which the data is observed being written by another CUDA or host thread
Example
Possible outcomes for thread 2 ?
Memory Fence Functions
Memory fence functions can be used to enforce some ordering on memory accesses
Ensures that:
All writes to all memory made by the calling thread before the call to __threadfence_block()
All reads from all memory
Synch functions
Stronger than __threadfence() because it also synchronizes the execution
Atomic functions
Atomic functions perform a read-modify-write atomic operation on one 32-bit or 64-bit word residing in global or shared memory
Most of the atomic functions are available for all the numerical type
Arithmetic functions
Debugging, performance analysis and profiling
printf
Possible since Fermi devices (Compute Capability 2.x and higher)
Limited amount of lines:
circular buffer flushed at particular times
Global memory write
To dump then inspect a larger amount of intermediate data Analysis code should be removed for production