Appendix - HackMD

--- title: Appendix tags: NVIDIA GPU programming --- # Table of Contents [toc] # Appendix ## C++ ### Types of pointers **`auto_ptr`**. A smart pointer managing an object obtained via `new` expression and deletes the object when `auto_ptr` itself is destroyed * *Structure*. A pointer to an allocated object, which ensures that when it goes out of scope $\to$ The pointed object must get automatically destroyed * *Exclusive ownership model*. `auto_ptr` is based on exclusive ownership model * *Example code*. ```c= #include <memory> auto_ptr<ClassName> ptr(new ClassName); ``` * *Why deprecated*. Due to exclusive ownership model, assignment or copy transfers ownership and resets the `rvalue` of the auto pointer to `nullptr` $\to$ `auto_ptr` cannot be used within STL containers since it cannot be copied **`unique_ptr`**. A replacement of `auto_ptr`, with improved security, i.e. no fake copy assignments, added features, i.e. deleters, and support for arrays * *Structure*. A container for raw pointers * *Features*. * `unique_ptr` cannot be copied, i.e. ```c= unique_ptr<A> ptr1 (new A); unique_ptr<A> ptr2 = ptr1; // ERROR: cannot copy unique_ptr ``` * `unique_ptr` can be moved with `std::move()`, i.e. ```c= unique_ptr<A> ptr2 = std::move(ptr1); ``` * *Usage*. When we want to have exclusive ownership of the resource **`shared_ptr`**. A container for raw pointers, based on reference counting ownership model * *Usage*. When we want to share ownership of a resource **`weak_ptr`**. A copy of `shared_ptr`. It provides access to an object, which is owned by one or more `shared_ptr` instances, but does not participate in reference counting * *Usage*. To avoid cyclic dependency * *Cyclic dependency*. Consider the graph below with three shared pointers A, B, and C ``` The rest of the program --> object A --> object B ^ | \ | \ v object C ``` If the rest of the program is destroyed, the reference count of A reduces from `2` to `1`, and A, B, and C are not destroyed. But in fact, we want everything to be destroyed ### Types of data type casting **Implicit conversion**. * *Use cases*. Automatically performed when a value is copied to a compatible type * *Types of implicit conversion*. * *Standard conversion*. Convert from child data type to parent data type * *Promotion*. Convert from parent data type to child data type * *Implicit conversion between classes*. * *Invoking operations*. * *Single-argument constructors*. Implicitly convert a particular type to initialize an object, i.e. ```c= class A {}; class B { public: B (const A& x) {} } ``` * *Assignment operator*. Implicitly convert a particular type on assignments, i.e. ```c= B& operator= (const A& x) {return *this;} ``` * *Type-cast operator*. Implicitly convert to a particular type, i.e. ```c= operator A() {return A();} ``` * *Special cases*. * Null pointers can be converted to pointers of any type * Pointers to any type can be converted to void pointers * Pointers to a derived class can be converted, i.e. upcast, to a pointer of an accessible and unambiguous base class **Type casting**. * *Traditional type casting*. ```c= (new_type) expression // C-like type casting new_type (expresion) // functional type casting ``` * *Dynamic casting*. * *Ability*. * Pointer upcast, in the same way as allowed in implicit conversion * Downcast polymorphic classes, i.e. those with virtual members, if and only if the pointed object is a valid complete object of the target type * Implicit cast null pointers between pointers types * Cast pointer of any type to a `void*` pointer * *Usage*. Only be used with pointers and references to classes, or with `void*` * *Example*. ```c= Base *pba = new Derived; Base *pbb = new Base; Derived *pd; pd = dynamic_cast<Derived*> (pbb); ``` * *Static casting*. Perform conversions between pointers to related clases, both upcast and downcast >**NOTE**. No checks are performed during runtime to guarantee that the converted object is a full object of the destination type >$\to$ Programmers must check this, but static casting does not have the overhead of type-safety checks of dynamic casting * *Ability*. * Explicitlly call a single-argument constructor or a conversion operator * Convert to rvalue references * Convert enum class values into integers or floating-point values * Convert any type to `void`, evaluating and discarding the value * *Difference from dynamic casting*. * `static_cast` performs no runtime checks. If a type cannot be casted to another type, it just raise errors * `dynamic_cast` performs runtime checks. If a type cannot be casted to another type, it will return a null pointer * *Reinterpret casting*. Convert any pointer type to any other pointer type, even of unrelated classes * *Mechanism*. Use a simple binary copy of the value from one pointer to the other * *Const casting*. Manipulate the constness of the object pointed by a pointer, i.e. either to be set or to be removed ## TensorRT ### General workflow <div style="text-align:center"> <img src="https://i.imgur.com/JLc2L0o.png"> <figcaption>TensorRT workflow</figcaption> </div> **Loading engines**. 1. Create IRuntime instance `runtime` 2. Deserialize a binary string to ICudaengine `engine` **Infer with engines**. 1. Create cuda stream for async execution * Comments: Different streams may run concurrently and asynchronously 2. Create IExecutionContext `exe_context` * Comments: One model can have multiple execution contexts 3. Copy host input buffer to device input buffer 4. Pass associated I/O buffers to `exe_context.enqueue()` (async) to execute inference * *Alternative methods*. * `exe_context.enqueueV2()` (async) * `exe_context.execute()` (sync) * `exe_context.executeV2()` (sync) 5. Synchronize CUDA stream 6. Copy device output buffer to host input buffer **Cautions**. * Using IReduceLayer from TRT is expensive and makes following computations of the model denser $\to$ If we use the implementation from cuda/samples, the reduction operation will be cheaper and the model computation will be more sparse * When adding a plugin to a TRT model, TRT will automatically run every method of the added plugin with dummy inputs, e.g. zero tensors $\to$ We must prepare for this ## GPU coding **Tools for CUDA coding**. * *STL-like library*. `thrust` * *Image processing with CUDA*. `npp` ### Image processing **Fundamental**. * *Image representation in GPU*. A continuous array of pixel values * *Basic flow*. 1. Copy image array from host to device 2. Process image array on the device 3. Copy the image array from device to host * *Example code*. Resize a RGB image with NPP and CUDA kernels ```c= float* decimate_cuda( float* readbuff, uint32_t nSrcH, uint32_t nSrcW, uint32_t nDstH, uint32_t nDstW ) { size_t srcStep; size_t dstStep; // rows = height; columns = width NppiSize oSrcSize = {nSrcW, nSrcH}; NppiRect oSrcROI = {0, 0, nSrcW, nSrcH}; float *devSrc; CUDA_CALL(cudaMallocPitch( (void**)&devSrc, &srcStep, 3 * nSrcW * sizeof(float), nSrcH )); CUDA_CALL(cudaMemcpy2D( devSrc, srcStep, readbuff, 3 * nSrcW * sizeof(Npp32f), 3*nSrcW * sizeof(Npp32f), nSrcH, cudaMemcpyHostToDevice )); NppiSize oDstSize = {nDstW, nDstH}; NppiRect oDstROI = {0, 0, nDstW, nDstH}; float *devDst; CUDA_CALL(cudaMallocPitch( (void**)&devDst, &dstStep, 3 * nDstW * sizeof(float), nDstH )); NppStatus result = nppiResize_32f_C3R( devSrc, srcStep, oSrcSize, oSrcROI, devDst, dstStep, oDstSize, oDstROI, NPPI_INTER_SUPER ); if (result != NPP_SUCCESS) { std::cerr << "Unable to run decimate_cuda, error " << result << std::endl; } Npp64s writesize; Npp32f *hostDst; writesize = (Npp64s) nDstW * nDstH * 3; // RGB if(NULL == (hostDst = (Npp32f *)malloc(writesize * sizeof(Npp32f)))){ printf("Error : Unable to alloctae hostDst in decimate_cuda, exiting...\n"); exit(1); } CUDA_CALL(cudaMemcpy2D(hostDst, nDstW*3 * sizeof(Npp32f), devDst, dstStep, nDstW*3 * sizeof(Npp32f),nDstH, cudaMemcpyDeviceToHost)); CUDA_CALL(cudaFree(devSrc)); CUDA_CALL(cudaFree(devDst)); return(hostDst); } // source - 3 x 32f, interleaved RGBRGBRGB... ``` ## Memory optimization with CUDA <div style="text-align:center"> <img src="https://i.imgur.com/JuPCfhp.png"> <figcaption>Memory spaces on CUDA device</figcaption> </div> ### Minimizing data transfer **Pageable memory**. Host data allocations are pageable by default, but GPU cannot access data directly from pageable host memory <div style="text-align:center"> <img src="https://i.imgur.com/NdlEzQu.png"> <figcaption>Memory transfer from CPU to GPU</figcaption> </div> * *Data transfer from host to device*. When a data transfer from pageable host memory to device memory is invoked 1. The CUDA driver must first allocate a temporary page-locked, or pinned, host array 2. The driver then copy the host data to the pinned array 3. The driver then transfer data from the pinned array to device memory * *Consequence*. We can avoid the cost of intermediate data transfer by directly allocating our host arrays in pinned memory $\to$ This is where page-locked memory comes in **Batching small transfers**. Due to the overhead associated with each transfer, it is preferable to batch many small transfers together into a single transfer * *Idea*. Use a temporary array, preferably pinned, and pack it with the data to be transferred ### Overlapping data transfers **Overlapping kernel execution and data transfer**. * *Requirements*. * The device must be capable of "concurrently copy and execution" $\to$ This can be queried from `deviceOverlap` field of `cudaDeviceProp` struct * The kernel execution and data transfer to be overlapped must both occur in different, non-default streams * The host memory involved in the data transfer must be pinned memory * *Approaches*. * Use multiple streams to carry out execution and data transfer * Split the big array into chunks and use different CUDA streams to transfer them * Batch similar operations together, issueing all host-to-device transfers first, followed by all kernel launches, and then all device-to-host transfers ### Efficient memory access #### Data structure alignment **Data structure alignment**. The way data is arranged and accessed in computer memory * *Efficiency of data access*. The CPU in modern computer hardware performs reads and writes to memory most efficiently when the data is naturally aligned * *Explain*. Data's memory address is a multiple of data size * *Example*. In a 32-bit architecture, the data maybe aligned if the data is stored in 4 consecutive bytes, and the first byte lies on a 4-byte boundary * *Data alignment*. The aligning of elements according to their natural alignment * *Padding*. To ensure natural alignment $\to$ It maybe required to insert some padding between structure elements or after the last element of a structure * *Packing*. Instead of padding, which may lead to slower access, we can pack the structure **Data alignment**. Assume that each primitive datum is a power of 2 bytes long * *Aligned memory address*. A memory address $a$ is said to be $n$-byte aligned when $a$ is a multiple of $n$ bytes, where $n$ is a power of $2$ * *Aligned memory access*. A memory access is said to be aligned when the data being accessed is $n$ bytes long and the datum address is $n$-byte aligned * *Aligned memory pointer*. A memory pointer referring to primitive data is $n$ bytes long is said to be aligned if it is only allowed to contain addresses, which are $n$-byte aligned **Problems**. The CPU accesses memory by a single memory word at a time $\to$ As long as the memory word size is at least as large as the largest primitive data type supported by the computer, aligned accesses will always access a single memory word * *Misaligned memory access*. If the highest and lowest bytes in a datum are not within the same memory word $\to$ The computer must split the datum acces into multiple memory accesses * *Consequence*. It requires a lot of complex circuitry to generate the memory access and coordinate misaligned memory accesses >**NOTE**. Misaligned memory access may also lead to other problems, e.g. synchronization between reading and writing to memory cells **Typical alignment of C structs on x86**. * *Primitive types*. | Data type | Alignment | | --- | --- | | `char` | 1-byte aligned | | `short` | 2-byte aligned | | `int` | 4-byte aligned | | `double` | 8-byte aligned on Windows and 4-byte aligned on Linux | | `long long` | 4-byte aligned | | `long double` | 8-byte aligned with C++ builder, 2-byte aligned with DMC, 4-byte aligned with GCC | * *Structures*. * *Before compilation*. Before compilation, the structure is 8-byte-sized ```c++ struct MixedData { char Data1; short Data2; int Data3; char Data4; }; ``` * *After compilation*. After compilation, data structures will be supplemented with padding bytes to ensure a proper alignment for each of its members, i.e. ```c++ struct MixedData /* After compilation in 32-bit x86 machine */ { char Data1; /* 1 byte */ char Padding1[1]; /* 1 byte for the following 'short' to be aligned on a 2 byte boundary assuming that the address where structure begins is an even number */ short Data2; /* 2 bytes */ int Data3; /* 4 bytes - largest structure member */ char Data4; /* 1 byte */ char Padding2[3]; /* 3 bytes to make total size of the structure 12 bytes */ }; ``` >**NOTE**. The last member is padded with the number of bytes required so that the total size of the structure should be a multiple of the largest alignment of any structure member >**NOTE**. We can check the size after compilation with `sizeof` * *Other tricks*. By reordering structure members, or changing the compiler's alignment of structure members $\to$ We can change the alignment of structures to reduce the memory they require * *Example*. The compiled size of the structure now matches the pre-compiled size of 8 bytes ```c++ struct MixedData /* after reordering */ { char Data1; char Data4; /* reordered */ short Data2; int Data3; }; ``` **Allocating memory aligned to cache lines**. It would be beneficial to allocate memory aligned to cache lines * *Cache line*. The unit of data transfer between the cache and main memory * *Explain*. The processor will read or write an entire cache line when any location in the cache line region is read or written * *Typical cache line*. 64 bytes * *Adapt to CUDA*. If an array is partitioned for more than one thread to operate on $\to$ Having the sub-array boundaries unaligned to cache lines could lead to performance degradation #### Efficient memory access in CUDA **Global memory coalescing**. Group of threads into warps is not only relevant to computation, but also to global memory accesses * *Global memory accesses*. The device coalesces global memory loads and stores issued by threads of a warp into as few transactions as possible to minimize DRAM bandwidth >**NOTE**. We should ensure that global memory accesses are coalesced whenever possible * *Access requirements for coalescing*. * *Compute capability 6.0 or higher*. The concurrent accesses of the threads of a warp will coalesce into a number of transactions equal to the number of 32-byte transactions required to service all of the threads of the warp >**NOTE**. In these devices, L1-caching is default * *Compute capability 3.5, 3.7, and 5.2*. If L1-caching is enabled $\to$ The number of required transactions is equal to the number of required 128-byte aligned segments >**NOTE**. Choosing sensible block sizes, e.g. multiple of the warp size, facilitates memory accesses by warps, which are properly aligned * *Misaligned data access*. Misaligned data accesses can reduce the bandwidth of the data transfer * *Global memory layout*. Arrays allocated in device memory are aligned to 256-byte memory segments by the CUDA driver for efficiency * *Explain*. The pointers which are allocated by using any of the CUDA Runtime's device memory allocation functions e.g `cudaMalloc` or `cudaMallocPitch` are guaranteed to be 256 byte aligned, i.e. the address is a multiple of 256 >**NOTE**. This alignment may differ across different devices * *Global memory transactions*. The device can access global memory via 32-, 64-, or 128-byte transactions which are aligned to their size * *Strided memory access*. Large strided memory access may lead to poor effective bandwidth * *Explain*. When concurrent threads simultaneously access memory addresses, which are very far away in physical memory $\to$ There is no chance for the hardware to combine the accesses * *Consequence*. When accessing multidimensional arrays, it is often necessary for threads to index the higher dimensions of the array $\to$ Strided access is simply unavoidable * *Solution*. Use shared memory, which is an on-chip memory shared by all threads in a thread block * *Example*. Use shared memory to extract a 2D tile of a multidimensional array from global memory in a coalesced fashion into shared memory $\to$ Then have contiguous threads stride through the shared memory tile ## Execution configuration optimizations ### Occupancy **Occupancy**. The ratio of the number of active warps per multiprocessor to the maximum number of possible active warps $\to$ This is the percentage of the hardware's ability to process warps that is actively in use * *Query for maximum number of possible active warps*. `deviceQuery` * *Values of occupancy*. * *High occupancy*. Do not always equate to higher performance, i.e. there is a point above which additional occupancy does not improve performance * *Low occupancy*. Always interfere with the ability to hide memory latency, resulting in performance degradation >**NOTE**. Low occupancy will have more registers available per thread than higher one, which may result in less spilling to local memory **Maximum block size**. Per thread resources required by a CUDA kernel might limit the maximum block size in an unwanted way $\to$ To maintain forward compatbility to future hardware and toolkits, and to ensure that at least one thread block can run on an SM, we should include `__launch_bounds__(maxThreadsPerBlock)` * **`__launch_bounds__(maxThreadsPerBlock)`**. Specify the largest block size which the kernel will be launched with ### Tricks **Hiding register dependencies**. To hide latency arising from register dependencies, maintain sufficient numbers of active threads per multiprocessor, i.e. sufficient occupancy * *Register dependencies*. Arise when an instruction uses a result stored in a register written by an instruction before it * *Consequence*. The latency of most arithmetic instructions is typically 4 cycles on devices of compute capability 7.0 $\to$ Threads must wait approximately 4 cycles before using an arithmetic result * *Solution*. Hide the latency caused by register dependencies by the execution of threads in other warps **Thread and block heuristics**. The number of threads per block should be a multiple of 32 threads, since it provides optimal computing efficiency and facilitates coalescing * *Motivation*. Latency hiding and occupancy depend on the number of active warps per multiprocessor, which is implicitly determined by the execution parameters along with the resources, i.e. register and shared memory $\to$ Choosing execution parameters is a matter of striking a balance between latency hiding and resource optimization * *The number of blocks per grid*. The primary concern is keeping the entire GPU busy $\to$ The number of blocks in a grid should be larger than the number of SMs * *Explain*. * All SMs have at least one block to execute * There should be multiple active blocks for SM $\to$ Blocks, which are not waiting for a `__syncthreads()`, can keep the hardware busy >**NOTE**. To scale to future devices, the grid size per kernel launch should be in the thousands * *Block size*. Multiple concurrent blocks can reside on a SM $\to$ Occupancy is not determined by block size alone >**NOTE**. A larger block size does not imply a higher occupancy * *Rules of thumps*. * Threads per block should be a multiple of warp size to avoid wasting computation on under-populated warps, and to facilitate coalescing * A minimum of 64 threads per block should be used, and only if there are multiple concurrent blocks per SM * Between 128 and 256 threads per block is a good initial range for experimentation with different block size * Use several smaller thread blocks, rather than one large thread block per SM if latency affects performance >**NOTE**. This is beneficial particularly to kernels which frequently call `__syncthreads()` >**NOTE**. When a thread block allocates more registers than are available on a SM >$\to$ The kernel launch fails **Effects of shared memory**. * *Usages of shared memory*. * Help to coalesce or eliminate redundant access to global memory * Act as a constraint on occupancy, i.e. increasing the shared memory per block will effectively reduce the occupancy of the kernel * *Determining the sensitivity of performance to occupancy*. Through experimentation with the amount of dynamically allocated shared memory **Concurrent kernel execution**. CUDA streams can be used to * Overlap kernel execution with data transfers * Execute multiple kernels simultaneously to more fully take advantage of the device's SMs (on device with concurrent kernel execution ability only) $\to$ We can query for this information of the device from `concurrentKernels` field of `cudaDeviceProp` **Multiple contexts**. * *CUDA context*. CUDA work occurs within a process space for a particular GPU * *Usage*. Encapsulate kernel launches and memory allocations for the corresponding GPU and supporting constructs, e.g. page tables >**NOTE**. The context is explicit in the CUDA Driver API, but is entirely implicit in the CUDA runtime API, which creates andmanages contexts automatically * *Multiple contexts*. If multiple CUDA application processes access the same GPU concurrently, this almost always implies multiple contexts * *Explain*. A context is tied to a particular host process, unless multi-process service is in use * *Drawback*. Only one of the contexts can execute work at any given moment on a GPU $\to$ Contexts sharing the same GPU are time-sliced * *Consequence*. It is best to avoid multiple contexts per GPU within the same CUDA application ## Advanced technologies by NVIDIA **Tensor cores**. Each tensor core is a programmable compute unit specialzed for accelerating machine learning workloads * *Functionality*. Each tensor core can complete a single $4\times 4$ matrix-multiply-and-accumulation (MACC) each clock cycle * *Formal*. $D = A \times B + C$ where $A, B, C$ are $4\times 4$ matrices * *References*. * *Modeling Deep Learning Accelerator Enabled GPUs* by Md Aamir Raihan, Negar Goli, and Tor M. Aamodt **Deep learning accelerator (DLA)**. ## Concepts **Ownership models for C++ pointers**. * *Exclusive ownership model*. Two pointers of the same type cannot point to the same resource, at the same time * *Reference counting ownership model*. There is a reference count for the contained pointer in cooperation with all copies of the `shared_ptr` $\to$ Each time a new pointer points to the resource, the counter increases **Virtual function**. A member function declared within a base class and is re-defined, i.e. overriden, by a derived class $\to$ When we refer to a derived class object using a pointer, or a reference to the base class, we can call a virtual function for that object and execute the derived class' version of the function **CMake**. A cross-platform free and open-source software tool for managing the build process of software using a compiler-independent method ## External resources **CUDA**. * [CUDA documentation](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html#group__CUDART__MEMORY_1g85073372f776b4c4d5f89f7124b7bf79) * [CUDA streams and concurrency](https://developer.download.nvidia.com/CUDA/training/StreamsAndConcurrencyWebinar.pdf) * [CUDA memory transfer optimization](https://developer.nvidia.com/blog/how-optimize-data-transfers-cuda-cc/) **TensorRT**. * [TensorRT quick guide](https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html) * [TensorRT API documentation](https://docs.nvidia.com/deeplearning/tensorrt/api/c_api/classnvinfer1_1_1_i_execution_context.html#a16b3aeb1359f8d53890e8a96bf6637a8) **Misc**. * Nsight Systems to profile CUDA code