Advanced GPGPU Programming

TODO \:Github Link > ***Learn CUDA Programming A beginner Guide to GPU Programming and Parallel Computing*** > * https://www.packtpub.com/product/learn-cudaprogramming/9781788996242 > > ***Programming Massively Parallel Processors: A Hands-on Approach*** > * https://www.amazon.com/Programming-Massively-Parallel-Processors-Hands/dp/0323912311 > > ***CUDA Programming: A Developer's Guide to Parallel Computing with GPUs*** > * https://www.amazon.com/CUDA-Programming-Developers-Computing-Applications/dp/0124159338 *This note continue the above book we used in CUDA Programming \(https://hackmd.io/QhzlGKAvQcGKLqBcqd2GBg\) Additionally, I include other books with different publish years when CUDA architecture with different features and consideration. By reading these books from different time on the same topic, it helps understand CUDA architecture's change over time and to predict future possible focuses.* :::info :information_source: **顯示卡是什麼，型號怎麼看？顯卡的規格等級介紹** https://wattbrother.com/112740#2 ::: # Advanced CUDA Topics *In this section, topics are picked and combined from different books.* ## Topic \#1 \: CUDA Memory Management ![image](https://hackmd.io/_uploads/HkUj8iUBC.png) Reference \: https://developer.nvidia.com/blog/cuda-refresher-cuda-programming-model/#memory_hierarchy ### Memory Access Time Table *Notice that this table is old \(from 2014\) but the memory hierarchy does not change* | Memory type | Registers | Share Memory | Texture Memory | Constant Memory | global Memroy | | ----------- | --------- | ------------ | -------------- | --------------- | ------------- | | Bandwidth | ~8TB/s | ~1.5TB/s | ~200MB/s | ~200MB/s | ~200MB/s | | Latency | 1 cycle | 1~32 cycle | 400~600 cycle | 400~600 cycle | 400~600 cycle | :::success In *CUDA Programming: A Developer's Guide to Parallel Computing with GPUs*, the author compare CPU hyper threading \(two threads for a core\) with the relation between SM and SP. ::: ***Correlation between maximum register used and the amount of block can be scheduled on a SM***. Notice that the more block we can schedule on a SM, the hight the GPU utilization is. This can also be calcualted by the **occupancy calculater** provide by Nvidia. ![image](https://hackmd.io/_uploads/B1Ukd38BC.png) ### GPU Memory Evolution > *Learn CUDA page 81* ![image](https://hackmd.io/_uploads/rJtXh2USA.png) In the above diagram, it shows that the amount of caches increase through time. This can also improved by the unification of L1 cache and shared memory which makes the L1 cache as fast as the shared memory. To sum up, if the L1 cache is enough to support memory locality, using shared memory might not improve the performance. ## Topic \#2 \: CUDA Thread Programming ### CUDA Occupancy ![image](https://hackmd.io/_uploads/H1O4-rnL0.png =x80) > In general, higher occupancy leads to more effective GPU utilization because more warps are available to hide the latency of stalled warps. However, it might also degrade performance due to the increased resource contention between the CUDA threads. * **Theoretical Occupancy** \: Can be calculated from Excel sheet provide by NVIDIA. * **Achieved Occupancy** \: Can only be determined when profiled. The differences come from not considering instructional dependecies or memory bandwidth limitations. :::success :bulb: **`--resource-usage`** ```clike ptxas info : 0 bytes gmem ptxas info : 0 bytes gmem ptxas info : Compiling entry function '_ZN5fCUDA14dGPUTestKernelEPj' for 'sm_52' ptxas info : Function properties for _ZN5fCUDA14dGPUTestKernelEPj 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads ptxas info : Used 4 registers, 328 bytes cmem[0] ``` ::: ### CUDA Occupancy Calculator *TODO* ### Atomic Operation :::info :bulb: **NVIDIA Official CUDA C Programming Document** https://docs.nvidia.com/cuda/cuda-c-programming-guide/#atomic-functions ::: A *histogram* example that reduce scope of atomic contention. * HackMD https://hackmd.io/@Erebustsai/Byul7e-Up * Github https://github.com/Chen-KaiTsai/GPGPU_OpenCL ### Low\/Mix Precision *In this section, since tutorials of this topic is still rare when writing this post 2024/06/29. I am going to include different articles that I fond online and keep this topic for later.* :::info :bulb: NVIDIA Technical Blog https://developer.nvidia.com/blog/mixed-precision-programming-cuda-8 > Update, March 25, 2019: The latest Volta and Turing GPUs now incoporate ***Tensor Cores***, which accelerate certain types of FP16 matrix math. This enables faster and easier mixed-precision computation within popular AI frameworks. Making use of ***Tensor Cores*** requires using CUDA 9 or later. NVIDIA has also added automatic mixed precision capabilities to TensorFlow, PyTorch, and MXNet. Interested in learning more or trying it out for yourself? Get tensor core optimized examples for popular AI frameworks here. ::: > Supporting low-precision computing is possible in specific GPUs, and the precision varies depending on the GPU chipsets. To be specific, GP102 (Tesla P40 and Titan X), GP104 (Tesla P4), and GP106 support INT8; and GP100 (Tesla P100) and GV100 (Tesla V100) support FP16 (half-precision) operations. The Tesla GV100 is compatible with INT8 operation and has no performance degradation. :::info :bulb: **深入理解混合精度訓練：從 Tensor Core 到 CUDA 程式設計** https://aijishu.com/a/1060000000286803 ::: :::info :bulb: **Programming Tensor Cores in CUDA 9** https://developer.nvidia.com/blog/programming-tensor-cores-cuda-9/ > During program execution, multiple Tensor Cores are used concurrently by a full warp of execution. The threads within a warp provide a larger 16x16x16 matrix operation to be processed by the Tensor Cores. CUDA exposes these operations as warp-level matrix operations in the CUDA C++ WMMA API. ::: ## Topic \#3 \: CUDA Optimization Considerations ### CUDA Profiling \& Debugging :::info :information_source: **Efficient CUDA Debugging\: Memory Initialization and Thread Synchronization with NVIDIA Compute Sanitizer** https://resources.nvidia.com/en-us-nsight-developer-tools/cuda-debugging-memory-initialization?lx=P1ZhhI&topic=hpc&linkId=100000290009338 ::: ### A Checklist of Optimizations > Reference \: > > ***Programming Massively Parallel Processors: A Hands-on Approach*** > * https://www.amazon.com/Programming-Massively-Parallel-Processors-Hands/dp/0323912311 | Optimization | Benefit to compute cores | Benefit to memory | Strategies | | ---------------------------------------- | --------------------------------------------------------------- | ------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | | | | | | Maximizing occupancy | More work to hide pipeline latency | More parallel memory access to hide DRAN latency | Tuning usage of SM resources such as threads per block, shared memory per block and registers per thread | | Enabling coalesced global memory acesses | Fewer pipeline stalls waiting for global memory accesses | Less global memory traffic and better utilization of bursts\/cache lines | Transfer between global memory and shared memory in a coalesced manner and performing uncoalesced accesses in shared memory \(e.g., corner turning\). Rearranging the mapping of threads to data. Rearranging the layout of the data | | Minimizing control divergence | High SIMD efficiency \(fewer idel cores during SIMD execution\) | - | Rearranging the mapping of threads to work and\/or data. Rearranging the layout of the data | | Tiling of reused data | Fewer pipeline stalls waiting for global memory accesses | Less global memory traffic | Placing data that is reused within a block in shared memory or registers so that it is transferred between global memory and the SM only once | | Privatization | Fewer pipeline stalls waiting for atomic updates | less contention and serialization of atomic updates | Applying partial updates to a private copy of the data and then updating the universal copy when done. | | Thread coarsening | Less redundant work, divergence or synchronization | Less redundant global memory traffic | Assigning multiple units of parallelism to each thread to reduce the price of parallelism when it is incurred unnecessarily | ## Topic \#4 \: Multi-GPU Programming *This topic has been move to a dedicated post. Please refer to https://hackmd.io/@Erebustsai/SyfClHYCa* ## Topic \#5 \: CUDA Support Vector Data Types :::info :information_source: Official Data Source https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#built-in-vector-types If using arithmatic operation on Vector Data Types, `helper_math.h` in the cuda example `Common` is required. Please refer to the following source code for included functions. https://github.com/NVIDIA/cuda-samples/blob/master/Common/helper_math.h ::: ## Topic \#6 \: CUDA Tensor Core \& Nvlink > Reference \: > * 輝達 GPU 詳解 > https://github.com/chenzomi12/AIFoundation/tree/main/01AIChip/04NVIDIA > * Programming Tensor Cores in CUDA 9 > https://developer.nvidia.com/blog/programming-tensor-cores-cuda-9/ Notice that in CUDA 9 \(released in 2017\), the best way to utilize Tensor core is using cuBLAS or cuDNN library. > Reference \: https://developer.nvidia.com/blog/programming-tensor-cores-cuda-9/ ![image](https://hackmd.io/_uploads/S1v2X3ep0.png) *A warp performs D=A\*B\+C where A\, B\, C and D are 16x16 matrices* Notice that in the above diagram, two 16x16 matrices input with *FP16* type and the result can be *FP16* or *FP32*. Therefor, this is considered ***mixed-precision***. ### Programming Interface > Reference \: > * WMMA - What does “warp matrix operations” mean? > https://forums.developer.nvidia.com/t/wmma-what-does-warp-matrix-operations-mean/229732 > * Code Example for the official post in the above reference > https://github.com/NVIDIA-developer-blog/code-samples/blob/master/posts/tensor-cores/simpleTensorCoreGEMM.cu ![image](https://hackmd.io/_uploads/SJci5nl6C.png =x300) Notice that synchronization happened before and after the MMA op. This is because all threads in the warp will participate in caculating the MMA. In the above reference, the first post describe the detail. :::success :bulb: In the above reference, *輝達 GPU 詳解* the author provide detial description of Tensor Core in different GPU Arch and how it's evolved. ::: # Advanced OpenCL Topics > ***Opencl Parallel Programming Development Cookbook*** > * https://www.amazon.com/OpenCL-Parallel-Programming-Development-Cookbook/dp/1849694524 ## Understanding OpenCL Data Types **Scalar integral data types \:** `bool, char, short, int, long, uchar, ushort, uint, ulong, float, half, double`. * `half` \: 16-bit floating-point \(*IEEE 754-2008*\) * `bool` \: This will expands to 1 or 0 * `size_t` \: This 32-bit or 64-bit unsigned integer * `ptrdiff_t` \: 32-bit or 64-bit signed integer represent the result of subtracting two points * `intptr_t` \: > A signed integer type with the property that any valid pointer to void can be converted to this type, then converted back to pointer to void, and the result will compare equal to the original pointer. * `uintptr_t` \: > An unsigned integer type with the property that any valid pointer to void can be converted to this type, then converted back to pointer to void, and the result will compare equal to the original pointer. | Type in OpenCL | Type in application | | ------------------------ | ---------------------- | | `bool` | `n/a` | | `char` | `cl_char` | | `unsigned char, char` | `cl_uchar` | | `short` | `cl_short` | | `unsigned short, ushort` | `cl_ushort` | | `int` | `cl_int` | | `unsigned int, uint` | `cl_uint` | | `long` | `unsigned long, ulong` | | `float` | `cl_float` | | `double` | `cl_double` | | `half` | `cl_half` | | `size_t` | `n/a` | | `ptrdiff_t` | `n/a` | | `intptr_t` | `n/a` | | `uintptr_t` | `n/a` | | `void` | `void` | > OpenCL vector data types consists of a multiple of scalar data integral and floating-point data types and they are `char<N>, short<N>, int<N>, long<N>, uchar<N>, ushort<N>, uint<N>, ulong<N>`, and `float<N>` where `<N>` represents a value of 2, 3, 4, 8, or 16. > Vectors have another remarkable property, and that is, you can access the individual components through indexes, that is to say if you wish to access each component of a `float4` vector, v, then you would do so via `v.x, v.y, v.z`, v.w respectively, and for larger vectors of 8 or 16 elements we would access those individual elements via `v.s0, v.s1` through to `v.s7`, and `v.s0, v.s1, v.sa` through to `v.sf` respectively. Hence, vectors of type `char2, uchar2, short2, ushort2, int2, uint2, long2, ulong2`, and `float2` can access their `.xy` elements. ## Using OpenCL Functions :::success :bulb: **opencl(十)----標量、向量類型的相關運算** https://www.cnblogs.com/feihu-h/p/12092895.html ::: # Other *Other material I came across when I worked on CUDA.* ## CUDA Debug > Reference \: > * How to build and debug a simple CUDA application on Jetson Nano. Pt1. https://medium.com/@stevechange/how-to-build-and-debug-a-simple-cuda-application-on-jetson-nano-pt1-7bc04444eb79 > * Unable to debug simple CUDA program: CUDBG_ERROR_INITIALIZATION_FAILURE https://forums.developer.nvidia.com/t/unable-to-debug-simple-cuda-program-cudbg-error-initialization-failure/222599/3 > * CUDA 番外篇 | Visual Studio Code的CUDA環境 https://zhuanlan.zhihu.com/p/508810115 ## CUDA for Tegra Programming ``` CUDA Device Query (Runtime API) version (CUDART static linking) Detected 1 CUDA Capable device(s) Device 0: "NVIDIA Tegra X1" CUDA Driver Version / Runtime Version 10.2 / 10.2 CUDA Capability Major/Minor version number: 5.3 Total amount of global memory: 1980 MBytes (2075893760 bytes) ( 1) Multiprocessors, (128) CUDA Cores/MP: 128 CUDA Cores GPU Max Clock rate: 922 MHz (0.92 GHz) Memory Clock rate: 13 Mhz Memory Bus Width: 64-bit L2 Cache Size: 262144 bytes Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096) Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 32768 Warp size: 32 Maximum number of threads per multiprocessor: 2048 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 1 copy engine(s) Run time limit on kernels: Yes Integrated GPU sharing Host Memory: Yes Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Disabled Device supports Unified Addressing (UVA): Yes Device supports Compute Preemption: No Supports Cooperative Kernel Launch: No Supports MultiDevice Co-op Kernel Launch: No Device PCI Domain ID / Bus ID / location ID: 0 / 0 / 0 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.2, CUDA Runtime Version = 10.2, NumDevs = 1 Result = PASS ``` ### System Diagram ![image](https://hackmd.io/_uploads/HyJYkfNWA.png) ### Memory Selection & Consideration * > Reference \: CUDA最佳化的冷知識10 | GPU卡和Jetson上視訊記憶體最佳化的特色 https://www.zhihu.com/tardis/zm/art/341602408?source_id=1003 ## Linear algebra operation using cuBLAS :::info :information_source: **iomanip \(IO Manipulation\)** https://yuihuang.com/iomanip/ ::: ***cuBLAS Host API*** Level 1 \(vector-vector\) Level 2 \(matrix-vector\) Level 3 \(matrix-matrix\) *Single Precision Floating Matrix Multiplication \(SGEMM\)* Parameters of the following code \: matrix `A` with N as \(column size\) nrow & S as \(row size\) ncol. matrix `B` with S as nrow & M as ncol matrix `C` with N as nrow & M as ncol *Please refer to the following URL for `cublasSgemm` function parameters.* > https://zhuanlan.zhihu.com/p/466939638 :::warning :warning: **Linear buffer** cublas and many othre libraries use linear buffer for 1D or 2D matrices. We can utilize the 2D malloc function which helps us create dynamic 2D matrix with single `malloc`. The detail is in the other post **2D Array Dynamic Memory Allocation** https://hackmd.io/knYTeHgvRcyw3tqIdOTRhA ::: ```cpp // compile with // nvcc -o cublas main.cpp -lcublas cublasHandle_t handle; int N = 8, M = 4, S = 5; float alpha = 0.1f, beta = 0.0f; if (cublasCreate(&handle) != CUBLAS_STATUS_SUCCESS) { std::cout << "CUBLAS initialization failed\n\n"; exit(EXIT_FAILURE); } float* A = getManagedMatrix(N, S); float* B = getManagedMatrix(S, M); float* C = getManagedMatrix(N, M); cublasSgemm( handle, CUBLAS_OP_N, CUBLAS_OP_N, N, M, S, &alpha, A, N, B, S, &beta, C, N ); cudaDeviceSynchronize(); printMatrix(A, N, S); printMatrix(B, S, M); printMatrix(C, N, M); cublasDestroy(handle); cudaFree(A); cudaFree(B); cudaFree(C); ``` ***Multi-GPU Operation with cuBLAS*** TODO \: Switch to EPYC Server ***Mixed-precision operation using cuBLAS*** TODO \: Switch to EPYC Server :::info :bulb: **cuRAND** *cuRAND Host API* can generate random numbers only using the host code and can be used directly for other kernel functions. *cuRAND Device API* can generate in kernel and CUDA threads can have their own randomly generated number during the execution. ::: ## Thrust > Reference \: > https://docs.nvidia.com/cuda/thrust/index.html#transformations CUDA Thrust library is useful when the development process need to be fast and some simple ***vector based*** algoritm is used. Using Thrust to do matrix operation is not better than using cuBLAS. The syntax is similar to the C++ STL and this makes the Thrust library easy to use. However, STL functions works on CPU natively and Thrust functions works on GPU and require data transfer, which introduce overhead. ### Features * `thrust::transform` with `<thrust/functional.h>` or custom functors * `thrust::reduce` * `<thrust/scan.h>` * `<thrust/sort.h>` ### Sorting Implementation :::info :bulb: **CUDA Thrust Sorting Implemntation** The following forum talk about the implementation of thrust sort. https://stackoverflow.com/questions/9037906/fast-cuda-thrust-custom-comparison-operator ::: **Performance on CPU E5-2696v3** In the following test, the program only run on one socket. On `float` datatype with `DATASIZE = 1024 * 1024 * 1024`. ``` Hardware Thread Number : 36 ------------------------------ Single Thread ------------------------------ Time: 187.271000 ------------------------------ Multi Thread / SIMD ------------------------------ Time: 88.364000 ``` **Performance on GPU Tesla P40** ``` ------------------------------ Transfer data to GPU ------------------------------ Time: 1.226000 ------------------------------ GPU Kernel ------------------------------ Time: 2.312000 ------------------------------ Transfer data from GPU ------------------------------ Time: 1.240000 ``` ## Deep Learning with CUDA *The goal of this section is to understand cuBLAS and cuDNN libraries to improve CUDA_AI_Framework.* :::info :bulb: **Installing cudnn in WSL2** WSL2 安裝 CUDA Toolkit、cuDNN https://hackmd.io/@Kailyn/HkSTXL9xK ::: The following reference is the code repository provide by the book *Learn CUDA Programming*. The code has better functionality than my work *CUDA_AI_Framework*. In this section, I will only provide concept of the neural network computation. Reference \: https://github.com/PacktPublishing/Learn-CUDA-Programming/tree/master/Chapter10/10_deep_learning ### Fully Connective Neural Network Similar to https://hackmd.io/@Erebustsai/rki6tRAL2, the forward passing can be compute with matrix multiplication of neural input and transpose of weight. This can be implemented with ***cuBLAS***. ### Activation Layer :::info :bulb: **cudnn使用的例子** https://blog.csdn.net/qq_44632658/article/details/129534944 ::: ### Convolution Layer Reference \: https://github.com/PacktPublishing/Learn-CUDA-Programming/tree/master ![image](https://hackmd.io/_uploads/B1je5Th-0.png) ### AI_Framework Improvement with cuDNN :::info :bulb: **Tutorial for using cuDNN** * **cuDNN Basic** https://www.zhihu.com/column/c_1476248814192988160 * **Example of cuDNN softmax usage** https://gist.github.com/hans/d21fa21c04904d0993c8 * **Convolutions with cuDNN** https://www.goldsborough.me/cuda/ml/cudnn/c++/2017/10/01/14-37-23-convolutions_with_cudnn/ * **Convolutions with cuDNN Github** https://gist.github.com/goldsborough/865e6717e64fbae75cdaf6c9914a130d * **使用CuDNN進行摺積運算** https://blog.csdn.net/ice__snow/article/details/79699388 ::: > Reference \: > * CUDA Programming: A Developer's Guide to Parallel Computing with GPUs https://www.amazon.com/CUDA-Programming-Developers-Computing-Applications/dp/0124159338 > :::info :bulb: **CUDA_AI_Framework Github Repository** https://github.com/Chen-KaiTsai/CUDA_AI_Framework ::: Conv2d Implemenatation with CUDNN ```cpp #ifdef USE_CUDNN_CONVOLUTION void framework::conv2d::run() { cudnnHandle_t handle; cudnnCreate(&handle); cudnnDataType_t dtype = CUDNN_DATA_FLOAT; cudnnTensorFormat_t format = CUDNN_TENSOR_NCHW; cudnnTensorDescriptor_t prvMEM_desc; cudnnTensorDescriptor_t gMEM_desc; cudnnCreateTensorDescriptor(&prvMEM_desc); cudnnCreateTensorDescriptor(&gMEM_desc); cudnnSetTensor4dDescriptor(prvMEM_desc, format, dtype, prvShape.N, prvShape.C, prvShape.H, prvShape.W); cudnnSetTensor4dDescriptor(gMEM_desc, format, dtype, shape.N, shape.C, shape.H, shape.W); cudnnFilterDescriptor_t W_desc; cudnnCreateFilterDescriptor(&W_desc); cudnnSetFilter4dDescriptor(W_desc, dtype, format, shape.C, prvShape.C, kSize, kSize); cudnnConvolutionDescriptor_t conv2d_desc; cudnnConvolutionMode_t mode = CUDNN_CROSS_CORRELATION; cudnnCreateConvolutionDescriptor(&conv2d_desc); cudnnSetConvolution2dDescriptor(conv2d_desc, padSize, padSize, stride, stride, 1, 1, mode, dtype); cudnnConvolutionFwdAlgo_t algo; #if CUDNN_MAJOR == 8 cudnnConvolutionFwdAlgoPerf_t algos[CUDNN_CONVOLUTION_FWD_ALGO_COUNT]; cudnnGetConvolutionForwardAlgorithm_v7(handle, prvMEM_desc, W_desc, conv2d_desc, gMEM_desc, CUDNN_CONVOLUTION_FWD_ALGO_COUNT, nullptr, algos); algo = algos[0].algo; #else cudnnGetConvolutionForwardAlgorithm(handle, prvMEM_desc, W_desc, conv2d_desc, gMEM_desc, CUDNN_CONVOLUTION_FWD_PREFER_FASTEST, nullptr, algo) #endif size_t workspaceBytes = 0; cudnnGetConvolutionForwardWorkspaceSize(handle, prvMEM_desc, W_desc, conv2d_desc, gMEM_desc, algo, &workspaceBytes); #ifdef __DEBUG__ printf("CUDNN info from %s\nWorkspace size: %zuMB\n", name.c_str(), (workspaceBytes / 1048576)); #endif cudaError_t error_id; void* workspaceMEM = nullptr; error_id = cudaMalloc(&workspaceMEM, workspaceBytes); if (error_id != cudaSuccess) { printf("Error %s cudaMalloc() : %d\n%s\n\n", name.c_str(), static_cast<int>(error_id), cudaGetErrorString(error_id)); exit(EXIT_FAILURE); } const float alpha = 1.0f, beta = 0.0f; float* prvMEM = prvLayer->getGPUMem(); if (useBias) { #ifdef __DEBUG__ printf("Using bias has not implemented yet. Fall back to non cuDNN implementation\n"); #endif int sharedMem = kSize * kSize * prvShape.C * dimBlock.z * sizeof(float); Conv2D_shared<<<dimGrid, dimBlock, sharedMem>>>(prvShape.H, prvShape.W, prvShape.C, shape.H, shape.W, shape.C, shape.N, stride, kSize, padSize, prvMEM, wMEM, bMEM, gMEM); } else { cudnnConvolutionForward(handle, &alpha, prvMEM_desc, prvMEM, W_desc, wMEM, conv2d_desc, algo, workspaceMEM, workspaceBytes, &beta, gMEM_desc, gMEM); } cudaFree(workspaceMEM); cudnnDestroy(handle); } #else ``` However, the performance ***is not improved***. A further examination is required and might be implemented in later updates. ## `helper_cuda.h` \& `helper_math.h` > Reference \: > * CUDA Samples大賞之helper_cuda.h檔案 > https://zhuanlan.zhihu.com/p/460926521 > * CUDA Samples大賞之helper_math.h檔案 > https://zhuanlan.zhihu.com/p/461461510 > * CUDA Samples Github Repo > https://github.com/NVIDIA/cuda-samples # GPGPU Appendix Document ## Nvidia Driver > Reference \: > * 淺談Linux的Nvidia閉源驅動問題，以及nvidia-open、Nouveau、NVK驅動的選擇 > https://ivonblog.com/posts/linux-nvidia-driver-issues/ > * NVIDIA Transitions Fully Towards Open-Source GPU Kernel Modules > https://developer.nvidia.com/blog/nvidia-transitions-fully-towards-open-source-gpu-kernel-modules/ > * open-gpu-kernel-modules Github Repository > https://github.com/NVIDIA/open-gpu-kernel-modules ## How to Read Output of `nvidia-smi` ```clike Wed May 22 22:06:48 2024 +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.104.05 Driver Version: 535.104.05 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA RTX A5000 Off | 00000000:46:00.0 Off | Off | | 30% 41C P8 19W / 230W | 2MiB / 24564MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 1 NVIDIA RTX A5000 Off | 00000000:81:00.0 Off | Off | | 30% 39C P8 22W / 230W | 2MiB / 24564MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ +---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | No running processes found | +---------------------------------------------------------------------------------------+ ``` * GPU ID * Operation Mode * Persistence mode \(ON/OFF\) \[https://forums.developer.nvidia.com/t/what-does-off-mean-in-the-output-of-nvidia-smi/37509/3\] * Tesla Compute Cluster\(TCC\)\/Windows Display Driver Model \(WDDM\) mode * Fan Speed * GPU Temperature * Performance Mode * Power Usage and Capacity * Bus-ID * Memory Usage and Installed Memory * Counted error-correcting ocde\(ECC\) * GPU Utilization * Compute Mode :::info :bulb: **Top 500 Super Computer** https://top500.org/lists/top500/2024/06/ ::: ## GPU Usage When considering GPU usage, the following issues should be considered. * **Synchronization** \: Too many atomic operation will cause stall and wait for threads. If atomic operation is used in too many threads per thread block will seriously reduce performance. * **Thread block dimensions** \: The program limitation is usually 1024 but should be tested for each CUDA kernel. * **Resource used** \: register used or shared memory used will reduce how many thread block on a SM. > Reference \: > * CUDA Programming: A Developer's Guide to Parallel Computing with GPUs https://www.amazon.com/CUDA-Programming-Developers-Computing-Applications/dp/0124159338 ### Hardware Usage Chart | Compute Capability\/ Thread Block Dimension| 1.0 | 1.1 | 1.2 | 1.3 | 2.0 | 2.1 | 3.0 | | -------- | -------- | -------- | -------- | -------- | -------- | -------- | -------- | |64|67|67|50|50|33|33|50| |96|100|100|75|75|50|50|75| |128|100|100|100|100|67|67|100| |192|100|100|94|94|100|100|94| |256|100|100|100|100|100|100|100| |384|100|100|75|75|100|100|94| |512|67|67|100|100|100|100|75| |768|N\/A|N\/A|N\/A|N\/A|100|100|100|75| |1024|N\/A|N\/A|N\/A|N\/A|67|67|100| ### ThreadBlock per SM | Compute Capability\/ Thread Block Dimension| 1.0 | 1.1 | 1.2 | 1.3 | 2.0 | 2.1 | 3.0 | | -------- | -------- | -------- | -------- | -------- | -------- | -------- | -------- | |64|8|8|8|8|8|8|16| |96|8|8|8|8|8|8|12| |128|6|6|8|8|8|8|12| |192|4|4|5|5|8|8|16| |256|3|3|4|4|6|6|8| |384|2|2|2|2|4|4|5| |512|1|1|2|2|3|3|4| |768|N\/A|N\/A|1|1|2|2|2| |1024|N\/A|N\/A|1|1|1|1|2| In this book, the author deduced that the GPU schedule SM as round robin style. :::success :bulb: **Why is max threads per sm larger than max threads per block?** https://forums.developer.nvidia.com/t/why-is-max-threads-per-sm-larger-than-max-threads-per-block/277817 ::: ## GPU General Programming Consideration > Reference \: https://www.manning.com/books/parallel-and-high-performance-computing ### PCI Express Specification by Generation ![image](https://hackmd.io/_uploads/Sk2fV7npC.png =x250) :::info :bulb: **Overhead Rates** > Reference \: https://www.manning.com/books/parallel-and-high-performance-computing Transmitting data across the PCI bus requires additional overhead. Generation 1 and 2 standards stipulate that 10 bytes are transmitted for every 8 bytes of useful data. Starting with generation 3, the transfer transmits 130 bytes for every 128 bytes of data. The overhead factor is the ratio of the number of usable bytes over the total bytes transmitted . ::: ## GPU SIMD or Vector Hardware SIMD in GPU SIMT operations might exist in some GPUs \(**mainly GPU vendor other than Nvidia**\). Vector operations are exposed in the OpenCL as describe in this post but not in CUDA. CUDA GPU can emulate the operation but it might increase the performance. For CUDA using some what SIMD like operation, Tensor core is the one that can be used with WMMA header. ## Optimizing GPU Resource Usage > Reference \: https://www.manning.com/books/parallel-and-high-performance-computing **Resource Limitation** ![image](https://hackmd.io/_uploads/Bkjc_qCTC.png =x200) ## Undervolt GPU \(Only for Windows\) > Reference \: > * How To UNDERVOLT Your GPU - The Ultimate Easy Guide 2024 \(Nvidia GPU\) > https://www.youtube.com/watch?v=KPR06CxysMw Basically, apply a power cap is the easiest way to prevent the card running too hot. * On MSI RTX 2080 ti Sea Hawk X \: This can work and reduce the temperature. * On Tesla P40 \: TODO

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.