--- title: 2. Programming interface tags: NVIDIA GPU programming --- # Table of Contents [toc] # 2. Programming interface ## Compilation with NVCC **`nvcc`**. A compiler driver which simplifies the process of compiling C or PTX code * *Pros*. * Provide simple and familiar command line options * Command line are executed by invoking the collection of tools, which implement the different compilation stages ### Compilation workflow **Source files compiled with `nvcc`**. Include a mix of host code and device code **Offline compilation**. * *`nvcc` basic workflow*. 1. Separate device code from host code 2. Compile the device code into an assembly form, i.e. PTX code, and/or binary form, i.e. cubin object 3. Modify the host code by replacing `<<<...>>>` syntax by the necessary CUDA C runtime function calls $\to$ To load and launch each compiled kernel from the PTX code and/or cubin object * *Output*. * C code, which is left to be compiled using another tool, or * Object code (by letting `nvcc` invoke the host compiler during the last compilation stage) * *Application's jobs*. * Link to the compiled host code (common case), or * Ignore the modified host code (if any) and use the CUDA driver API to load and execute the PTX code or cubin object **Just-in-time compilation**. Any PTX code loaded by an application at runtime is compiled further to binary code by the device driver * *Pros*. * Increase application load time * Allow the application to benefit from any new compiler improvements coming with each device driver * The only way for applications to run on devices that did not exist at the time the application was compiled ## CUDA C runtime (cudart) **CUDA C runtime library**. `cudart` * *Associated files*. `libcudart.a` or `libcudart.so` ### Runtime initialization **Initialization time**. There is no explicit initialization function for the runtime $\to$ The runtime initializes the first time a runtime function is called >**NOTE**. We need to keep this in mind when timing runtime function calls, and when interpreting the error code from the first call into the runtime **Initialization mechanism**. 1. The runtime creates a CUDA execution context for each device in the system $\to$ This is the *primary context* for this device >**NOTE**. The primary context is shared across all the host threads of the application 2. The device code is just-in-time (JIT) compiled if necessary and loaded into device memory >**NOTE**. The initialization process happens under the hood >**NOTE**. The runtime does not expose the primary context to the application **`cudaDeviceReset()`**. Destroy the primary context of the device, which the host thread currently operates on $\to$ The next runtime function call made by any host thread, which has this device as current, will create a new primary context for this device ### Device (global) memory **Runtime memory functions**. Allocate, deallocate, and copy device memory, as well as transfer data between host memory and device memory **Device memory organization**. Can be allocated either as linear memory or as CUDA arrays * *Linear memory*. Exist on the device in a 40-bit address space $\to$ Separatedly allocated entities can reference one another via pointers * *Operations*. * *Allocation*. * `cudaMalloc()` * `cudaMallocPitch()` (best for 2D array), and `cudaMalloc3D()` (best for 3D array) * *Explain*. These functions ensure that the allocation is appropriately padded to meet the alignment requirements of CUDA memory accesses $\to$ This ensures best performance when accessing the row addresses, or performing copies between 2D arrays and other regions of device memory * *Deallocation*. `cudaFree()` * *Transfer data*. `cudaMemcpy()`, `cudaMemcpy2D()`, and `cudaMemcpy3D()` * *CUDA arrays*. Opaque memory layouts optimized for texture fetching **Basic CUDA workflow**. * *Workflow*. 1. Allocate memory buffers in the device 2. Copy the input host buffers to the device memory 3. Run the kernel 4. Copy the ouput device buffers to the host memory * *Example code*. Basic workflow of a CUDA kernel ```c= int main() { ... float *h_A, *h_B, *h_C; float *d_A, *d_B, *d_C; cudaMalloc(&d_A, size); cudaMalloc(&d_B, size); cudaMalloc(&d_C, size); cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice); cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice); someKernel<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N); cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost); cudaFree(d_A); cudaFree(d_B); cudaFree(d_C); } ``` ### Shared memory **Shared memory**. Much faster than global memory space of the device $\to$ Any opportunity to replace global memory accesses by shared memory accesses should be exploited **Allocating shared memory**. Use the `__shared__` memory space specifier * *Example code*. ```c= #define BLOCK_SIZE 16 __global__ void MatMulKernel(Matrix A, Matrix B, Matrix C) { int blockRow = blockIdx.y, blockCol = blockIdx.x; int row = threadIdx.y, col = threadIdx.x; Matrix Csub = GetSubMatrix(C, blockRow, blockCol); Cvalue = 0; for (int m = 0; m < (A.width / BLOCK_SIZE); ++m) { Matrix Asub = GetSubMatrix(A, blockRow, m); Matrix Bsub = GetSubMatrix(B, m, blockCol); __shared__ float As[BLOCK_SIZE][BLOCK_SIZE]; __shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE]; As[row][col] = GetElement(Asub, row, col); Bs[row][col] = GetElement(Bsub, row, col); __syncthreads(); // block level synchronization barrier for (int e = 0; e < BLOCK_SIZE; ++e) Cvalue += As[row][e] * Bs[e][col]; __syncthreads(); } SetElement(Csub, row, col Cvalue); } ``` ### Page-locked host memory **Page-locked host memory**. * *Characteristics*. * Page-locked memory by itself is only directly accessible to the host * Data can be copied between the page-locked memory and the GPU without an intermediate copy by the driver * *Benefits*. * Copies between page-locked host memory and device memory can be performed concurrently with kernel execution for some devices * On some devices, page-locked host memory can be mapped into the address space of the device $\to$ There is no need to copy it to, or from, device memory * On systems with a front-side bus, bandwidth between host memory and device memory is higher if host memory is allocated as page-locked >**NOTE**. The bandwidth will be even higher if it is allocated as write-combining * *Limitations*. * Page-locked host memory is a scarce resource * Consuming too much page-locked memory reduces overall system performance * *Usage*. * *When to use*. When we need to transfer anything to the GPU multiple times * *When not to use*. * If we do not have control over memory allocation of the buffer * *Explain*. We have to use `cudaMallocHost()` to allocate page-locked memory * If the size of our host buffer needs to be a significant fraction of the total memory in our computer >**NOTE**. Page-locked memory is, by default, not portable between CUDA contexts **Functions**. * *Allocation and deallocation of page-locked host memory*. `cudaHostAlloc()` and `cudaFreeHost()` * *Page-locking a range of memory allocated by `malloc()`*. `cudaHostRegister()` #### Portable memory **Application range of page-locked memory**. A block of page-locked memory can be used in conjunction with any device in the system * *Problem*. By default, the benefits of using page-locked memory are only available in conjunction with the device, which was current when the block was allocated, and with all devices sharing the same unified address space $\to$ To make these advantages available to all devices, block needs to be portable * *Explain*. The benefits of pinned memory could only be realized on the CUDA context, which allocated it **Functions**. * *Allocation*. Use `cudaHostAllocPortable` flag in `cudaHostAlloc` * *Registration*. Use `cudaHostRegisterPortable` flag in `cudaHostRegister` #### Mapped memory (zero-copy memory) **Page-locked host memory in device address space**. A block of page-locked host memory can be mapped into the address space of the device * *Consequence*. Such a block has, in general, two addresses, i.e. * One in host memory, i.e. returned by `cudaHostAlloc()` or `malloc()` * One in device memory, i.e. returned by `cudaHostGetDevicePointer()` $\to$ The device-memory address can be used to access the block from within a kernel * *Pros*. * There is no need to allocate a block in device memory and copy data between this block and the block in host memory * *Explain*. Data transfers are implicitly performed as needed by the kernel * There is no need to use CUDA streams to overlap data transfers with kernel execution * *Explain*. The kernel-originated data transfers automatically overlap with kernel execution * *Cons*. The application must synchronize memory accesses using streams or events to avoid conflictions * *Explain*. Mapped page-locked memory is shared between host and device **Functions**. * *Allocation*. Use `cudaHostAllocMapped` flag in `cudaHostAlloc` * *Registration*. Use `cudaHostRegisterMapped` flag in `cudaHostRegister` ### Asynchronous concurrent execution # Appendix ## Concepts **PTX**. CUDA instruction set architecture, by which CUDA kernels are written **Memory paging**. A memory management scheme, by which a computer stores and retrieves data from secondary storage for use in main memory * *Explain*. The OS retrieves data from secondary storage in same-size blocks called pages **Unified virtual address space**. In a 64-bit process, a single address space is used for the host and all the devices of compute capability 2.0 and higher $\to$ All host memory allocations made via CUDA API calls, and all device memory allocations on supported devices, are within this virtual address space * *Consequence*. * The location of any memory on the host allocated through CUDA, or any devices using the unified address space, can be determined from the value of the pointer using `cudaPointerGetAttributes()` * When copying to or from the memory of any device using the unified address space $\to$ CUDA can automatically determine locations from the pointers, by setting `cudaMemcpyDefault` in `cudaMemcpy*()` >**NOTE**. This works for host pointers not allocated through CUDA also, as long as the current device uses unified addressing * Allocations via `cudaHostAlloc()` are automatically portable * *Mechanism*. Any allocation created in the managed memory space is automatically migrated to where it is needed * *Explain*. * Unified memory attempts to optimize memory performance by migrating data towards the device where it is begin accessed $\to$ The system will try to place data in the location where it can most efficiently be accessed without violating coherency * The physical location of data is invisible to a program, and may be changed at any time, but accesses to the data's virtual address will remain valid and coherent from any processor, regardless of locality >**NOTE**. Maintaining coherence is the primary requirement, ahead of performance * *Detailed procedure*. A new GPU page faulting mechanism is introduced that provides more seamless unified memory functionality $\to$ Combined with the system-wide virtual address space, page faulting provides several benefits * *Benefits of page faulting*. * The CUDA system software does not need to synchronize all managed memory allocations to the GPU, before each kernel launch * *Explain*. * If a kernel running on the GPU accesses a page, which is not resident in its memory $\to$ It faults, allowing the page to be automatically migrated to the GPU memory on-demand * Alternatively, the page may be mapped into the GPU address space for access over the PCIe or NVLink interconnects * *System allocator*. If supported by the host CPU and the OS, address translation services (ATS) allows the GPU to directly access the GPU's page tables * *Procedure*. 1. A miss in the GPU MMU will result in an address translation request (ATR) to the CPU 2. The CPU looks in its page tables for the virtual-to-physical mapping for that address 3. The CPU supplies the translation back to the GPU * *Other benefits*. ATS provides the GPU full access to the system memory, e.g. memory allocation with `malloc()`, memory allocated on stack, glo0bal variables, and file-backed memory $\to$ An application can query whether the device supports coherently accessing pageable memory via ATS * *Access counters*. Keep track of the frequency of access that a GPU makes to memory allocated on other processors * *Usage*. * Ensure memory pages are moved to the physical memory of the processor, which is accessing the page most frequently * Guide migrations between CPU and GPU, and between GPUs * *Performance tuning*. * Faults should be avoided * Data should be local to the accessing processor * Memory thrashing should be prevented * *Explain*. If data is frequently accesssed by multiple processors and has to be constantly migrated around to achieve data locality $\to$ The overhead of migration may exceed the benefits of locality ## References * [More about unified memory](https://www.nextplatform.com/2019/01/24/unified-memory-the-final-piece-of-the-gpu-programming-puzzle/) * [How unified memory works](https://on-demand.gputechconf.com/gtc/2018/presentation/s8430-everything-you-need-to-know-about-unified-memory.pdf)