or
or
By clicking below, you agree to our terms of service.
New to HackMD? Sign up
Syntax | Example | Reference | |
---|---|---|---|
# Header | Header | 基本排版 | |
- Unordered List |
|
||
1. Ordered List |
|
||
- [ ] Todo List |
|
||
> Blockquote | Blockquote |
||
**Bold font** | Bold font | ||
*Italics font* | Italics font | ||
~~Strikethrough~~ | |||
19^th^ | 19th | ||
H~2~O | H2O | ||
++Inserted text++ | Inserted text | ||
==Marked text== | Marked text | ||
[link text](https:// "title") | Link | ||
 | Image | ||
`Code` | Code |
在筆記中貼入程式碼 | |
```javascript var i = 0; ``` |
|
||
:smile: | ![]() |
Emoji list | |
{%youtube youtube_id %} | Externals | ||
$L^aT_eX$ | LaTeX | ||
:::info This is a alert area. ::: |
This is a alert area. |
On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?
Please give us some advice and help us improve HackMD.
Do you want to remove this version name and description?
Syncing
xxxxxxxxxx
CUDA Programming Notes
A good introduction to the CUDA programming model can be found here.
Kernels
CUDA kernels are functions that execute on the GPU. Each kernel gets assigned a grid of threads. Threads within a grid are organized in equally sized blocks. Blocks can hold at most 1024 threads (see here) and are expected to reside in the same Streaming Multiprocessor (SM).
For convenience, blocks and threads can be indexed as one, two or three dimensional arrays.
Warps
Each consecutive group of 32 threads that starts at a position multiple of 32 within a block is called a warp.
Historically, warps shared the same program counter, meaning that at every clock cycle all of the threads sit at the same instruction. Starting with Nvidia Volta, GPUs have Independent Thread Scheduling. In this programming model threads within a warp start off at the same instruction but have their own program counter. See here.
Warp divergence for GPUS without Independent Thread Scheduling
The code executed by a thread can branch depending on the thread id. This could lead to some threads within a warp to take different branches. For example, an
if {...} else {...}
for which the first 16 threads of a block fall in theif
part and the second 16 threads fall in theelse
part. This will be managed sequentially, meaning that some threads will be masked out during the execution of the body of instructions for which they don't belong to (see this). As a consequence, the total runtime of it will be the sum of the runtimes of both theif
andelse
parts. This is inefficient and is called warp divergence.Multiple grids and streams
Each kernel gets a single grid of threads. One can run multiple kernels in parallel on their own grids, but there's a small catch. This code will have the GPU run the kernels sequentially.
To have both kernels running in parallel we need to use streams.
Thread syncronization
Sometimes, usually to avoid race conditions, threads need to be syncronized and wait for all threads in some range to finish a specific instruction before they can safely move on.
Vanilla CUDA has limited functionality to syncronize threads. It can only put a syncronization barrier on all threads of a block. No more, no less. This is achieved with the
__syncthreads()
intrinsic function. So, this means that a thread that encounters the__syncthreads()
instruction will wait until all threads in its block arrive at the same point.Since blocks can hold at most 1024 threads, this places a strong restriction on the kind of code that can benefit from syncronization.
Cooperative groups
In CUDA 9 Cooperative groups were introduced to better manage syncronization. It allows both
A kernel using this functionality is called a cooperative kernel. See
here for the official documentation.
We haven't explicitely found it in the official documentation but it seems that when tiling threads of a block in \(N\) parts, the tile to which each thread is assigned has index \(\text{ThreadId} / N\). This seems convenient to handle warp divergence since we can tile our entire block in parts of 32 threads and so tiles and warps will coincide. See here.
Also the compiler can do some extra optimizations when the tile size and the warp size coincide as explained here under the "Optimizing for the GPU Warp Size" title.
cudaLaunchCooperative
syntaxA CUDA kernel can have lots of blocks. If too many, then they won't be all resident on the gpu at the same time. CUDA loads blocks, computes and then offloads. For this reason CUDA requires cooperative kernels with sync across blocks to fit entirely in the gpu. Therefore, the max number of threads for such a kernel varies with each gpu card. It is usually in the order of
#multiproccesors x 1024
. In any case, this means that such kernels are hardware dependant.In these cases the syntax to launch kernels changes. The method
cudaLaunchCooperative
needs to be used. See and here for official documentation.Examples
This blog post from Nvidia is useful as an introduction to kernels with sync within blocks.
This video is highly recommended to see three implementations of the binary reduction algorithm with
__syncthreads()
function,cudaLaunchCooperative
syntax.Enabling concurrency with streams
In CUDA, all device calls are either synchronous or asynchronous with respect to the host. In the first case, we enqueue the work we want to do and wait for completion. In the second one, the work is enqueued and we return immediately.
Kernel launches are asynchronous: the automatically overlap with
host.
In order to refine our use of concurrency, we can use streams, which are queues of device work (kernel launches, memory copies, etc.). What's the philosophy behind them? The host places work in the queue and continues on immediately, and the device schedules work from streams when resources are free. Operations within the same stream are ordered in a FIFO manner and cannot overlap, whilst operations in different streams are unordered and can overlap.
How do we create/destroy streams?
cudaStream_t stream;
declares a stream handle.cudaStreamCreate(&stream);
allocates a stream.cudaStreamDestroy(stream);
deallocates a stream and synchronizes host until work in stream has completed.How do we place work into a stream?
In some API calls, the stream is passed in the function:
cudaMemcpyAsync( dst, src, size, dir, stream);
. We can also place the stream on the kernel launch:kernel<<< blocks , threads, smem, stream>>>();
.Example
In this case, memory usage will be like this, assuming

foo
only utilizes 50% of the GPU:How do we synchronize work?
Within host code, we can use different levels of synchronization. From more to less synchronization, some API calls we can use are:
cudaDeviceSynchronize()
, which blocks the host until all issued CUDA calls are complete;cudaStreamSynchronize(stream)
, which blocks the host until all issued CUDA calls instream
are complete;Events
CUDA events provide a mechanism to signal when operations have occurred in a stream. They are useful for profiling and synchronization.
Events have a boolean state: "Occurred" (which is the default state) or "Not Occurred".
Managing events and synchronizing streams
The most common way to create, delete and enqueue events are:
cudaEventCreate(&event)
creates anevent
;cudaEventDestroy(&event)
destroys anevent
;cudaEventRecord(&event, stream)
event
state to "Not Occurred",event
into astream
andevent
state is set to occurred when it reaches the front of thestream
.How can we make sure that certain events have occurred before continuing execution?
cudaEventQuery(event)
returnsCUDA_SUCCESS
ifevent
has occurred.cudaEventSynchronize(event)
blocks the host untilevent
has occurred.cudaStreamWaitEvent(stream, event)
stream
after this call untilevent
occursNote
When we don't specify a stream on which to place a device call, it will be placed on the default stream, also known as "Stream 0". This stream is synchronous with all streams, meaning that operations in stream 0 cannot overlap other streams:

We can avoid this by creating Stream 1 with a non-blocking flag set, using
cudaStreamCreateWithFlags(&stream1,cudaStreamNonBlocking)
. This flag is also useful to get concurrency with libraries out of your control.