CUDA Programming Notes

A good introduction to the CUDA programming model can be found here.

Kernels

CUDA kernels are functions that execute on the GPU. Each kernel gets assigned a grid of threads. Threads within a grid are organized in equally sized blocks. Blocks can hold at most 1024 threads (see here) and are expected to reside in the same Streaming Multiprocessor (SM).

For convenience, blocks and threads can be indexed as one, two or three dimensional arrays.

Warps

Each consecutive group of 32 threads that starts at a position multiple of 32 within a block is called a warp.

Historically, warps shared the same program counter, meaning that at every clock cycle all of the threads sit at the same instruction. Starting with Nvidia Volta, GPUs have Independent Thread Scheduling. In this programming model threads within a warp start off at the same instruction but have their own program counter. See here.

Warp divergence for GPUS without Independent Thread Scheduling

The code executed by a thread can branch depending on the thread id. This could lead to some threads within a warp to take different branches. For example, an if {...} else {...} for which the first 16 threads of a block fall in the if part and the second 16 threads fall in the else part. This will be managed sequentially, meaning that some threads will be masked out during the execution of the body of instructions for which they don't belong to (see this). As a consequence, the total runtime of it will be the sum of the runtimes of both the if and else parts. This is inefficient and is called warp divergence.

Multiple grids and streams

Each kernel gets a single grid of threads. One can run multiple kernels in parallel on their own grids, but there's a small catch. This code will have the GPU run the kernels sequentially.


kernel1<<gridDim, blockDim>>();
Kernel2<<gridDim, blockDim>>();

To have both kernels running in parallel we need to use streams.







cudaStream_t stream1;
cudaStreamCreate(&stream1);
cudaStream_t stream2;
cudaStreamCreate(&stream2);

kernel1<<gridDim, blockDim, 0, stream1>>();
kernel2<<gridDim, blockDim, 0, stream2>>();

Thread syncronization

Sometimes, usually to avoid race conditions, threads need to be syncronized and wait for all threads in some range to finish a specific instruction before they can safely move on.

Vanilla CUDA has limited functionality to syncronize threads. It can only put a syncronization barrier on all threads of a block. No more, no less. This is achieved with the __syncthreads() intrinsic function. So, this means that a thread that encounters the __syncthreads() instruction will wait until all threads in its block arrive at the same point.
Since blocks can hold at most 1024 threads, this places a strong restriction on the kind of code that can benefit from syncronization.

Cooperative groups

In CUDA 9 Cooperative groups were introduced to better manage syncronization. It allows both

syncronization across blocks, and
the ability to tile threads of a block and syncronize at a more granular level.

A kernel using this functionality is called a cooperative kernel. See
here for the official documentation.

We haven't explicitely found it in the official documentation but it seems that when tiling threads of a block in \(N\) parts, the tile to which each thread is assigned has index \(\text{ThreadId} / N\). This seems convenient to handle warp divergence since we can tile our entire block in parts of 32 threads and so tiles and warps will coincide. See here.
Also the compiler can do some extra optimizations when the tile size and the warp size coincide as explained here under the "Optimizing for the GPU Warp Size" title.

`cudaLaunchCooperative` syntax

A CUDA kernel can have lots of blocks. If too many, then they won't be all resident on the gpu at the same time. CUDA loads blocks, computes and then offloads. For this reason CUDA requires cooperative kernels with sync across blocks to fit entirely in the gpu. Therefore, the max number of threads for such a kernel varies with each gpu card. It is usually in the order of #multiproccesors x 1024. In any case, this means that such kernels are hardware dependant.

In these cases the syntax to launch kernels changes. The method cudaLaunchCooperative needs to be used. See and here for official documentation.

Examples

This blog post from Nvidia is useful as an introduction to kernels with sync within blocks.

This video is highly recommended to see three implementations of the binary reduction algorithm with

only the __syncthreads() function,
with cooperative kernels with only sync within blocks, and
with sync across blocks and the cudaLaunchCooperative syntax.

Enabling concurrency with streams

In CUDA, all device calls are either synchronous or asynchronous with respect to the host. In the first case, we enqueue the work we want to do and wait for completion. In the second one, the work is enqueued and we return immediately.
Kernel launches are asynchronous: the automatically overlap with
host.

In order to refine our use of concurrency, we can use streams, which are queues of device work (kernel launches, memory copies, etc.). What's the philosophy behind them? The host places work in the queue and continues on immediately, and the device schedules work from streams when resources are free. Operations within the same stream are ordered in a FIFO manner and cannot overlap, whilst operations in different streams are unordered and can overlap.

How do we create/destroy streams?

cudaStream_t stream; declares a stream handle.
cudaStreamCreate(&stream); allocates a stream.
cudaStreamDestroy(stream); deallocates a stream and synchronizes host until work in stream has completed.

How do we place work into a stream?

In some API calls, the stream is passed in the function: cudaMemcpyAsync( dst, src, size, dir, stream);. We can also place the stream on the kernel launch: kernel<<< blocks , threads, smem, stream>>>();.

Example







cudaStream_t stream1, stream2;
cudaStreamCreate(&stream1);
cudaStreamCreate(&stream2);
foo<<<blocks,threads,0,stream1>>>();
foo<<<blocks,threads,0,stream2>>>();
cudaStreamDestroy(stream1);
cudaStreamDestroy(stream2);

In this case, memory usage will be like this, assuming foo only utilizes 50% of the GPU:

How do we synchronize work?

Within host code, we can use different levels of synchronization. From more to less synchronization, some API calls we can use are:

We can synchronize everything using cudaDeviceSynchronize(), which blocks the host until all issued CUDA calls are complete;
we can synchronize with respect to a specific stream using cudaStreamSynchronize(stream), which blocks the host until all issued CUDA calls in stream are complete;
Or we can synchronize host or devices more selectively using events.

Events

CUDA events provide a mechanism to signal when operations have occurred in a stream. They are useful for profiling and synchronization.

Events have a boolean state: "Occurred" (which is the default state) or "Not Occurred".

Managing events and synchronizing streams

The most common way to create, delete and enqueue events are:

cudaEventCreate(&event) creates an event;
cudaEventDestroy(&event) destroys an event;
cudaEventRecord(&event, stream)
- sets the event state to "Not Occurred",
- enqueues the event into a stream and
- event state is set to occurred when it reaches the front of the stream.

How can we make sure that certain events have occurred before continuing execution?

cudaEventQuery(event) returns CUDA_SUCCESS if event has occurred.
cudaEventSynchronize(event) blocks the host until event has occurred.
cudaStreamWaitEvent(stream, event)
- blocks all launches on stream after this call until event occurs
- does not block the host.

Note

When we don't specify a stream on which to place a device call, it will be placed on the default stream, also known as "Stream 0". This stream is synchronous with all streams, meaning that operations in stream 0 cannot overlap other streams:

We can avoid this by creating Stream 1 with a non-blocking flag set, using cudaStreamCreateWithFlags(&stream1,cudaStreamNonBlocking). This flag is also useful to get concurrency with libraries out of your control.

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`	在筆記中貼入程式碼
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.

CUDA Programming Notes

Kernels

Warps

Warp divergence for GPUS without Independent Thread Scheduling

Multiple grids and streams

Thread syncronization

Cooperative groups

cudaLaunchCooperative syntax

Examples

Enabling concurrency with streams

How do we create/destroy streams?

How do we place work into a stream?

Example

How do we synchronize work?

Events

Managing events and synchronizing streams

Note

`cudaLaunchCooperative` syntax