owned this note
owned this note
Published
Linked with GitHub
# CUDA Programming Notes
A good introduction to the CUDA programming model can be found [here](https://www.youtube.com/watch?v=4APkMJdiudU&list=PLC6u37oFvF40BAm7gwVP7uDdzmW83yHPe).
## Kernels
CUDA kernels are functions that execute on the GPU. Each kernel gets assigned a grid of threads. Threads within a grid are organized in equally sized blocks. Blocks can hold at most 1024 threads (see [here](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#thread-hierarchy)) and are expected to reside in the same Streaming Multiprocessor (SM).
For convenience, blocks and threads can be indexed as one, two or three dimensional arrays.
### Warps
Each consecutive group of 32 threads that starts at a position multiple of 32 within a block is called a *warp*.
Historically, warps shared the same program counter, meaning that at every clock cycle all of the threads sit at the same instruction. Starting with Nvidia Volta, GPUs have Independent Thread Scheduling. In this programming model threads within a warp start off at the same instruction but have their own program counter. See [here](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#simt-architecture).
#### Warp divergence for GPUS without Independent Thread Scheduling
The code executed by a thread can branch depending on the thread id. This could lead to some threads within a warp to take different branches. For example, an `if {...} else {...}` for which the first 16 threads of a block fall in the `if` part and the second 16 threads fall in the `else` part. This will be managed sequentially, meaning that some threads will be masked out during the execution of the body of instructions for which they don't belong to (see [this](https://www.reddit.com/r/CUDA/comments/gkpjxe/what_is_warp_divergence/)). As a consequence, the total runtime of it will be the sum of the runtimes of both the `if` and `else` parts. This is inefficient and is called *warp divergence*.
### Multiple grids and streams
Each kernel gets a single grid of threads. One can run multiple kernels in parallel on their own grids, but there's a small catch. This code will have the GPU run the kernels sequentially.
```cpp=
kernel1<<gridDim, blockDim>>();
Kernel2<<gridDim, blockDim>>();
```
To have both kernels running in parallel we need to use streams.
```cpp=
cudaStream_t stream1;
cudaStreamCreate(&stream1);
cudaStream_t stream2;
cudaStreamCreate(&stream2);
kernel1<<gridDim, blockDim, 0, stream1>>();
kernel2<<gridDim, blockDim, 0, stream2>>();
```
## Thread syncronization
Sometimes, usually to avoid race conditions, threads need to be syncronized and wait for all threads in some range to finish a specific instruction before they can safely move on.
Vanilla CUDA has limited functionality to syncronize threads. It can only put a syncronization barrier on all threads of a block. No more, no less. This is achieved with the `__syncthreads()` intrinsic function. So, this means that a thread that encounters the `__syncthreads()` instruction will wait until all threads in its block arrive at the same point.
Since blocks can hold at most 1024 threads, this places a strong restriction on the kind of code that can benefit from syncronization.
### Cooperative groups
In CUDA 9 Cooperative groups were introduced to better manage syncronization. It allows both
- syncronization across blocks, and
- the ability to tile threads of a block and syncronize at a more granular level.
A kernel using this functionality is called a *cooperative kernel*. See
[here](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cooperative-groups) for the official documentation.
We haven't explicitely found it in the official documentation but it seems that when tiling threads of a block in $N$ parts, the tile to which each thread is assigned has index $\text{ThreadId} / N$. This seems convenient to handle warp divergence since we can tile our entire block in parts of 32 threads and so tiles and warps will coincide. See [here](https://docs.nvidia.com/cuda/archive/9.2/cuda-c-programming-guide/index.html#groups-blocks-cg).
Also the compiler can do some extra optimizations when the tile size and the warp size coincide as explained [here](https://developer.nvidia.com/blog/cooperative-groups/) under the "Optimizing for the GPU Warp Size" title.
#### `cudaLaunchCooperative` syntax
A CUDA kernel can have lots of blocks. If too many, then they won't be all resident on the gpu at the same time. CUDA loads blocks, computes and then offloads. For this reason CUDA requires cooperative kernels with sync across blocks to fit entirely in the gpu. Therefore, the max number of threads for such a kernel varies with each gpu card. It is usually in the order of `#multiproccesors x 1024`. In any case, this means that such kernels are hardware dependant.
In these cases the syntax to launch kernels changes. The method `cudaLaunchCooperative` needs to be used. See and [here](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__EXECUTION.html#group__CUDART__EXECUTION_1g504b94170f83285c71031be6d5d15f73) for official documentation.
### Examples
[This](https://developer.nvidia.com/blog/cooperative-groups/) blog post from Nvidia is useful as an introduction to kernels with sync within blocks.
[This](https://www.youtube.com/watch?v=rOUaco3w7mU) video is highly recommended to see three implementations of the binary reduction algorithm with
- only the `__syncthreads()` function,
- with cooperative kernels with only sync within blocks, and
- with sync across blocks and the `cudaLaunchCooperative` syntax.
## Enabling concurrency with streams
In CUDA, all device calls are either synchronous or asynchronous with respect to the host. In the first case, we enqueue the work we want to do and wait for completion. In the second one, the work is enqueued and we return immediately.
Kernel launches are asynchronous: the automatically overlap with
host.
In order to refine our use of concurrency, we can use **streams**, which are queues of device work (kernel launches, memory copies, etc.). What's the philosophy behind them? The host places work in the queue and continues on immediately, and the device schedules work from streams when resources are free. Operations within the same stream are ordered in a FIFO manner and cannot overlap, whilst operations in different streams are unordered and can overlap.
### How do we create/destroy streams?
- `cudaStream_t stream;` declares a stream handle.
- `cudaStreamCreate(&stream);` allocates a stream.
- `cudaStreamDestroy(stream);` deallocates a stream and synchronizes host until work in stream has completed.
### How do we place work into a stream?
In some API calls, the stream is passed in the function: `cudaMemcpyAsync( dst, src, size, dir, stream);`. We can also place the stream on the kernel launch: `kernel<<< blocks , threads, smem, stream>>>();`.
### Example
```cpp=
cudaStream_t stream1, stream2;
cudaStreamCreate(&stream1);
cudaStreamCreate(&stream2);
foo<<<blocks,threads,0,stream1>>>();
foo<<<blocks,threads,0,stream2>>>();
cudaStreamDestroy(stream1);
cudaStreamDestroy(stream2);
```
In this case, memory usage will be like this, assuming `foo` only utilizes 50% of the GPU:
![](https://i.imgur.com/czI8nZB.png)
## How do we synchronize work?
Within host code, we can use different levels of synchronization. From more to less synchronization, some API calls we can use are:
- We can synchronize everything using `cudaDeviceSynchronize()`, which blocks the host until all issued CUDA calls are complete;
- we can synchronize with respect to a specific stream using `cudaStreamSynchronize(stream)`, which blocks the host until all issued CUDA calls in `stream` are complete;
- Or we can synchronize host or devices more selectively using **events**.
### Events
CUDA events provide a mechanism to signal when operations have occurred in a stream. They are useful for profiling and synchronization.
Events have a boolean state: "Occurred" (which is the default state) or "Not Occurred".
### Managing events and synchronizing streams
The most common way to create, delete and enqueue events are:
- `cudaEventCreate(&event)` creates an `event`;
- `cudaEventDestroy(&event)` destroys an `event`;
- `cudaEventRecord(&event, stream)`
- sets the `event` state to "Not Occurred",
- enqueues the `event` into a `stream` and
- `event` state is set to occurred when it reaches the front of the `stream`.
How can we make sure that certain events have occurred before continuing execution?
- `cudaEventQuery(event)` returns `CUDA_SUCCESS` if `event` has occurred.
- `cudaEventSynchronize(event)` blocks the host until `event` has occurred.
- `cudaStreamWaitEvent(stream, event)`
- blocks all launches on `stream` after this call until `event` occurs
- does not block the host.
#### Note
When we don't specify a stream on which to place a device call, it will be placed on the default stream, also known as "Stream 0". This stream is synchronous with all streams, meaning that operations in stream 0 cannot overlap other streams:
![](https://i.imgur.com/CIOFfwB.png)
We can avoid this by creating Stream 1 with a non-blocking flag set, using `cudaStreamCreateWithFlags(&stream1,cudaStreamNonBlocking)`. This flag is also useful to get concurrency with libraries out of your control.