---
title: 2. Programming interface
tags: NVIDIA GPU programming
---
# Table of Contents
[toc]
# 2. Programming interface
## Compilation with NVCC
**`nvcc`**. A compiler driver which simplifies the process of compiling C or PTX code
* *Pros*.
* Provide simple and familiar command line options
* Command line are executed by invoking the collection of tools, which implement the different compilation stages
### Compilation workflow
**Source files compiled with `nvcc`**. Include a mix of host code and device code
**Offline compilation**.
* *`nvcc` basic workflow*.
1. Separate device code from host code
2. Compile the device code into an assembly form, i.e. PTX code, and/or binary form, i.e. cubin object
3. Modify the host code by replacing `<<<...>>>` syntax by the necessary CUDA C runtime function calls
$\to$ To load and launch each compiled kernel from the PTX code and/or cubin object
* *Output*.
* C code, which is left to be compiled using another tool, or
* Object code (by letting `nvcc` invoke the host compiler during the last compilation stage)
* *Application's jobs*.
* Link to the compiled host code (common case), or
* Ignore the modified host code (if any) and use the CUDA driver API to load and execute the PTX code or cubin object
**Just-in-time compilation**. Any PTX code loaded by an application at runtime is compiled further to binary code by the device driver
* *Pros*.
* Increase application load time
* Allow the application to benefit from any new compiler improvements coming with each device driver
* The only way for applications to run on devices that did not exist at the time the application was compiled
## CUDA C runtime (cudart)
**CUDA C runtime library**. `cudart`
* *Associated files*. `libcudart.a` or `libcudart.so`
### Runtime initialization
**Initialization time**. There is no explicit initialization function for the runtime
$\to$ The runtime initializes the first time a runtime function is called
>**NOTE**. We need to keep this in mind when timing runtime function calls, and when interpreting the error code from the first call into the runtime
**Initialization mechanism**.
1. The runtime creates a CUDA execution context for each device in the system
$\to$ This is the *primary context* for this device
>**NOTE**. The primary context is shared across all the host threads of the application
2. The device code is just-in-time (JIT) compiled if necessary and loaded into device memory
>**NOTE**. The initialization process happens under the hood
>**NOTE**. The runtime does not expose the primary context to the application
**`cudaDeviceReset()`**. Destroy the primary context of the device, which the host thread currently operates on
$\to$ The next runtime function call made by any host thread, which has this device as current, will create a new primary context for this device
### Device (global) memory
**Runtime memory functions**. Allocate, deallocate, and copy device memory, as well as transfer data between host memory and device memory
**Device memory organization**. Can be allocated either as linear memory or as CUDA arrays
* *Linear memory*. Exist on the device in a 40-bit address space
$\to$ Separatedly allocated entities can reference one another via pointers
* *Operations*.
* *Allocation*.
* `cudaMalloc()`
* `cudaMallocPitch()` (best for 2D array), and `cudaMalloc3D()` (best for 3D array)
* *Explain*. These functions ensure that the allocation is appropriately padded to meet the alignment requirements of CUDA memory accesses
$\to$ This ensures best performance when accessing the row addresses, or performing copies between 2D arrays and other regions of device memory
* *Deallocation*. `cudaFree()`
* *Transfer data*. `cudaMemcpy()`, `cudaMemcpy2D()`, and `cudaMemcpy3D()`
* *CUDA arrays*. Opaque memory layouts optimized for texture fetching
**Basic CUDA workflow**.
* *Workflow*.
1. Allocate memory buffers in the device
2. Copy the input host buffers to the device memory
3. Run the kernel
4. Copy the ouput device buffers to the host memory
* *Example code*. Basic workflow of a CUDA kernel
```c=
int main() {
...
float *h_A, *h_B, *h_C;
float *d_A, *d_B, *d_C;
cudaMalloc(&d_A, size);
cudaMalloc(&d_B, size);
cudaMalloc(&d_C, size);
cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
someKernel<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);
cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
}
```
### Shared memory
**Shared memory**. Much faster than global memory space of the device
$\to$ Any opportunity to replace global memory accesses by shared memory accesses should be exploited
**Allocating shared memory**. Use the `__shared__` memory space specifier
* *Example code*.
```c=
#define BLOCK_SIZE 16
__global__ void MatMulKernel(Matrix A, Matrix B, Matrix C) {
int blockRow = blockIdx.y, blockCol = blockIdx.x;
int row = threadIdx.y, col = threadIdx.x;
Matrix Csub = GetSubMatrix(C, blockRow, blockCol);
Cvalue = 0;
for (int m = 0; m < (A.width / BLOCK_SIZE); ++m) {
Matrix Asub = GetSubMatrix(A, blockRow, m);
Matrix Bsub = GetSubMatrix(B, m, blockCol);
__shared__ float As[BLOCK_SIZE][BLOCK_SIZE];
__shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];
As[row][col] = GetElement(Asub, row, col);
Bs[row][col] = GetElement(Bsub, row, col);
__syncthreads(); // block level synchronization barrier
for (int e = 0; e < BLOCK_SIZE; ++e)
Cvalue += As[row][e] * Bs[e][col];
__syncthreads();
}
SetElement(Csub, row, col Cvalue);
}
```
### Page-locked host memory
**Page-locked host memory**.
* *Characteristics*.
* Page-locked memory by itself is only directly accessible to the host
* Data can be copied between the page-locked memory and the GPU without an intermediate copy by the driver
* *Benefits*.
* Copies between page-locked host memory and device memory can be performed concurrently with kernel execution for some devices
* On some devices, page-locked host memory can be mapped into the address space of the device
$\to$ There is no need to copy it to, or from, device memory
* On systems with a front-side bus, bandwidth between host memory and device memory is higher if host memory is allocated as page-locked
>**NOTE**. The bandwidth will be even higher if it is allocated as write-combining
* *Limitations*.
* Page-locked host memory is a scarce resource
* Consuming too much page-locked memory reduces overall system performance
* *Usage*.
* *When to use*. When we need to transfer anything to the GPU multiple times
* *When not to use*.
* If we do not have control over memory allocation of the buffer
* *Explain*. We have to use `cudaMallocHost()` to allocate page-locked memory
* If the size of our host buffer needs to be a significant fraction of the total memory in our computer
>**NOTE**. Page-locked memory is, by default, not portable between CUDA contexts
**Functions**.
* *Allocation and deallocation of page-locked host memory*. `cudaHostAlloc()` and `cudaFreeHost()`
* *Page-locking a range of memory allocated by `malloc()`*. `cudaHostRegister()`
#### Portable memory
**Application range of page-locked memory**. A block of page-locked memory can be used in conjunction with any device in the system
* *Problem*. By default, the benefits of using page-locked memory are only available in conjunction with the device, which was current when the block was allocated, and with all devices sharing the same unified address space
$\to$ To make these advantages available to all devices, block needs to be portable
* *Explain*. The benefits of pinned memory could only be realized on the CUDA context, which allocated it
**Functions**.
* *Allocation*. Use `cudaHostAllocPortable` flag in `cudaHostAlloc`
* *Registration*. Use `cudaHostRegisterPortable` flag in `cudaHostRegister`
#### Mapped memory (zero-copy memory)
**Page-locked host memory in device address space**. A block of page-locked host memory can be mapped into the address space of the device
* *Consequence*. Such a block has, in general, two addresses, i.e.
* One in host memory, i.e. returned by `cudaHostAlloc()` or `malloc()`
* One in device memory, i.e. returned by `cudaHostGetDevicePointer()`
$\to$ The device-memory address can be used to access the block from within a kernel
* *Pros*.
* There is no need to allocate a block in device memory and copy data between this block and the block in host memory
* *Explain*. Data transfers are implicitly performed as needed by the kernel
* There is no need to use CUDA streams to overlap data transfers with kernel execution
* *Explain*. The kernel-originated data transfers automatically overlap with kernel execution
* *Cons*. The application must synchronize memory accesses using streams or events to avoid conflictions
* *Explain*. Mapped page-locked memory is shared between host and device
**Functions**.
* *Allocation*. Use `cudaHostAllocMapped` flag in `cudaHostAlloc`
* *Registration*. Use `cudaHostRegisterMapped` flag in `cudaHostRegister`
### Asynchronous concurrent execution
# Appendix
## Concepts
**PTX**. CUDA instruction set architecture, by which CUDA kernels are written
**Memory paging**. A memory management scheme, by which a computer stores and retrieves data from secondary storage for use in main memory
* *Explain*. The OS retrieves data from secondary storage in same-size blocks called pages
**Unified virtual address space**. In a 64-bit process, a single address space is used for the host and all the devices of compute capability 2.0 and higher
$\to$ All host memory allocations made via CUDA API calls, and all device memory allocations on supported devices, are within this virtual address space
* *Consequence*.
* The location of any memory on the host allocated through CUDA, or any devices using the unified address space, can be determined from the value of the pointer using `cudaPointerGetAttributes()`
* When copying to or from the memory of any device using the unified address space
$\to$ CUDA can automatically determine locations from the pointers, by setting `cudaMemcpyDefault` in `cudaMemcpy*()`
>**NOTE**. This works for host pointers not allocated through CUDA also, as long as the current device uses unified addressing
* Allocations via `cudaHostAlloc()` are automatically portable
* *Mechanism*. Any allocation created in the managed memory space is automatically migrated to where it is needed
* *Explain*.
* Unified memory attempts to optimize memory performance by migrating data towards the device where it is begin accessed
$\to$ The system will try to place data in the location where it can most efficiently be accessed without violating coherency
* The physical location of data is invisible to a program, and may be changed at any time, but accesses to the data's virtual address will remain valid and coherent from any processor, regardless of locality
>**NOTE**. Maintaining coherence is the primary requirement, ahead of performance
* *Detailed procedure*. A new GPU page faulting mechanism is introduced that provides more seamless unified memory functionality
$\to$ Combined with the system-wide virtual address space, page faulting provides several benefits
* *Benefits of page faulting*.
* The CUDA system software does not need to synchronize all managed memory allocations to the GPU, before each kernel launch
* *Explain*.
* If a kernel running on the GPU accesses a page, which is not resident in its memory
$\to$ It faults, allowing the page to be automatically migrated to the GPU memory on-demand
* Alternatively, the page may be mapped into the GPU address space for access over the PCIe or NVLink interconnects
* *System allocator*. If supported by the host CPU and the OS, address translation services (ATS) allows the GPU to directly access the GPU's page tables
* *Procedure*.
1. A miss in the GPU MMU will result in an address translation request (ATR) to the CPU
2. The CPU looks in its page tables for the virtual-to-physical mapping for that address
3. The CPU supplies the translation back to the GPU
* *Other benefits*. ATS provides the GPU full access to the system memory, e.g. memory allocation with `malloc()`, memory allocated on stack, glo0bal variables, and file-backed memory
$\to$ An application can query whether the device supports coherently accessing pageable memory via ATS
* *Access counters*. Keep track of the frequency of access that a GPU makes to memory allocated on other processors
* *Usage*.
* Ensure memory pages are moved to the physical memory of the processor, which is accessing the page most frequently
* Guide migrations between CPU and GPU, and between GPUs
* *Performance tuning*.
* Faults should be avoided
* Data should be local to the accessing processor
* Memory thrashing should be prevented
* *Explain*. If data is frequently accesssed by multiple processors and has to be constantly migrated around to achieve data locality
$\to$ The overhead of migration may exceed the benefits of locality
## References
* [More about unified memory](https://www.nextplatform.com/2019/01/24/unified-memory-the-final-piece-of-the-gpu-programming-puzzle/)
* [How unified memory works](https://on-demand.gputechconf.com/gtc/2018/presentation/s8430-everything-you-need-to-know-about-unified-memory.pdf)