Hackpad for N8 CIR GPU Python workshop based on Carpentries Incubator lesson.
You can use this hackpad by clicking the edit button.
This will present the hackpad in a split screen view with editable markdown on the left hand side and rendered markdown on the right hand side.
Time | Agenda |
---|---|
10:00 | Welcome and recap |
10:20 | Your First GPU Kernel |
11:00 | Break |
11:10 | Registers, Global, and Local Memory |
12:00 | Lunch |
13:00 | Shared Memory and Synchronization |
14:30 | Break |
14:40 | Constant Memory |
15:00 | Concurrent access to the GPU |
16:00 | Close |
We'll be working through todays examples using Google Colaboratory.
You'll need to register a google account to use the service but once registered you can create a new can create a new colab notebook via colab.research.google.com/.
Code to fetch radio astronomy data:
GPU Memory allocation:
#hopefully no mistakes here? CM
registers=small, helpful local memory on each thread. high performance. can support small arrays.
with register spillover, data is allocated automatically to local memory (=global divided per thread). (performance penalty.)
nvidia-gpu are not "cache-coherent" - updates to global memory might not be reflected simultaneously until processes completed on all threads.
static allocation of memory to local of a specific size from host can be achieved by adding memory size as an argument to one's kernel.
shared memory = read and write-able from each thread to the block. use __shared__
with variable in kernel to specify this location.
applying keyword extern
to the variable permits a parameter submitted from the host to determine size of shared memory item.
if using shared allocation techniques, take care to ensure that different blocks don't overwrite the same spots in shared memory. can account for this with e.g.: use of an appropriate offset when indexing.
atomicAdd can be used in CUDA code to prevent same output being overwritten by threads.