--- slideOptions: transition: slide --- # GPU ## Cuda intro ### GPU introduction ![](https://i.imgur.com/RJaOX7Q.png) * memory: global --> PBSM (per block shared memory) --> local register * multiple stream multi-processors (SM): ![http://www.ece.lsu.edu/gp/refs/gf100-whitepaper.pdf](https://i.imgur.com/1vfxsXy.png) * vector machine, share register file, programmable cache ### Cuda * Kernel = Many Concurrent Threads ![](https://i.imgur.com/zx1Cx8y.png) * kernel = grid of thread blocks * Each thread executes the same code but **on the different data based on its threadID** * threads are grouped into thread blocks ![](https://i.imgur.com/pBZ2VJO.png) * blocks are independent to each other ![](https://i.imgur.com/TgO0qAl.png) * Thread Level Scheduling - Warp * ![](https://i.imgur.com/5LjZZGJ.jpg) * Once a thread block is scheduled to an SM, threads in the thread block are further partitioned into warps. * warp consists of 32 consecutive threads and all threads in a warp are executed in SIMT ie all threads exe the same instructions but on its own private data * Threads in a warp will be executing the same instruction * Instructions are issued per warp > Keep the number of threads per block a multiple of warp size (32). > Avoid small block sizes: Start with at least 128 or 256 threads per block. > Adjust block size up or down according to kernel resource requirements. ### cuda language * `kernelFunc<<< nB, nT, nS, Sid >>>(...);` * `nB`: number of block per grid * `nT`: number of threads per block * `nS`: shared memory size * `threadIdx; blockIdx; blockDim; gridDim` * eg1 ```cuda dim3 grid(3,2): dim3 blk(5,3); my_kernel<<< grid, blk>>>(); // (x,y) or (x,y,z) ``` * eg2 ```cuda // Kernel definition __global__ void VecAdd(float* A) { int i = threadIdx.x * blockDim.x + threadIdx.y; A[i] = A[i] + 1; } //... size=10; dim3 blk(size, size); my_kernel<<< 1, blk >>>(A, size); ``` * blockDim.x: number of threads per block * * function qualifier: * `__device__`: Executed on the device, Callable from the device only * `__global__ `: Executed on the device, **Callable from the host only (must have void return type!)** * Variable qualifier: * `__device__`: device’s global memory (shared among SMs) space * `__constant__` * `__shared__`: in **thread block's** shared memory * Program compilation * ![](https://i.imgur.com/KFZk7wH.png) * Kernel Time Measurement ```cuda cudaEvent_t start, stop; cudaEventCreate(&start); cudaEventCreate(&stop); cudaEventRecord(start); kernel<<<block, thread>>>(); cudaEventRecord(stop); cudaEventSynchronize(stop); float time; cudaEventElapsedTime(&time, start, stop); ``` * Dynamic Parallelism * launch new grids from the GPU * http://on-demand.gputechconf.com/gtc/2012/presentations/S0338-New-Features-in-the-CUDA-Programming-Model.pdf ## GPU Architecture ### Thread execution * execution model ![](https://i.imgur.com/WmxrPpk.png) ![](https://i.imgur.com/k0ijWsk.png) ![](https://i.imgur.com/lJLOxS2.png) ![](https://i.imgur.com/9W7DzDt.png) * Hardware schedules thread blocks onto available SMs * warp * At any time, only one warp is executed per SM * All threads in a Warp execute the same instruction when selected * Threads in a wrap execute physically in parallel * Warps and blocks execute logically in paralle ![](https://i.imgur.com/RdK7TaU.png) ### memory hierarchy ![](https://i.imgur.com/OxKdgHC.png) * registers * consumes zero extra clock cycles per instruction, except * Register memory bank conflicts * not indexable * **Array variables** always are allocated in **local memory** * Register spilling: Local memory is used if the register limit is met * limit: cuda6.1: * 65K registers per block * 255 registers per thread * local memory * Usually when one runs out of SM resources * "local" for each thread has its own private area * **Array variables** always are allocated in **local memory** * Stores are cached in L1 * Not really a “memory” – bytes are stored in global memory (DRAM) ![](https://i.imgur.com/uMnl1h2.png) * eg. ``` __device__ void distance(int m, int n, int *V){ int i, j, k; // register int a[10], b[10], c[10]; // local memory, off-chip DRAM ... } ``` * shared memory * shared by all the **threads in a block** * Size: at most 48K per block (CUDA 6.1) * `extern __shared__ <type> <name>[]` (pg.7) * dyanmic allocation: `nS` in `func<<<nB, nT, nS>>>(..);` ```cuda extern __shared__ int S[][]; __global__ void FW_APSP(int k, int D[n][n]) { int i = threadIdx.x; int j = threadIdx.y; S[i][j]=D[i][j]; // move data to shared memory __syncthreads(); // do computation if (S[i][j]>S[i][k]+S[k][j]) D[i][j]= S[i][k]+S[k][j]; } ``` * architecture * Each bank can service one address per cycle * Shared memory is as fast as register if no bank conflict * Shared memory banks are organized such that successive 32-bit words are assigned to successive banks and the bandwidth is 32 bits per bank per clock cycle. * For devices of compute capability 1.x, the warp size is 32 threads and the number of banks is 16 * https://stackoverflow.com/questions/7903566/how-is-2d-shared-memory-arranged-in-cuda ![](https://i.imgur.com/j0CG2L0.png) * global memory * memory coalescing * Global memory access happens in transactions of 32 or 128 bytes * The hardware will try to reduce to as few transactions as possible * Coalesced access: > A group of 32 contiguous threads ("warp") accessing adjacent words. Few transactions and high utilization * constant memory * Same usage and scope as the global memory except * Read only * Declare by variable qualifier `__constant__` * Move by cudaMemcpyToSymbol() & cudaMemcpyFromSymbol() * warp divergence ![](https://i.imgur.com/t2dsPrn.png) ## share memory * ref: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#shared-memory-5-x * For compute capability 5.x and 6.x: * Shared memory has 32 banks that are organized such that successive **32-bit words** map to successive banks. Each bank has a bandwidth of 32 bits per clock cycle. * bank index = (byte address / 4 bytes per bank) % 32 banks ## Reference * https://docs.nvidia.com/cuda/ * https://nanxiao.me/en/about/ * http://www.icl.utk.edu/~luszczek/teaching/courses/fall2016/cosc462/pdf/GPU_Fundamentals.pdf * https://medium.com/@smallfishbigsea/basic-concepts-in-gpu-computing-3388710e9239 * https://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf * http://on-demand.gputechconf.com/gtc-express/2011/presentations/cuda_webinars_WarpsAndOccupancy.pdf * https://devblogs.nvidia.com/cuda-pro-tip-nvprof-your-handy-universal-gpu-profiler/ * http://www.enseignement.polytechnique.fr/profs/informatique/Eric.Goubault/Cours09/CUDA/SC07_CUDA_5_Optimization_Harris.pdf * http://on-demand.gputechconf.com/gtc-express/2011/presentations/StreamsAndConcurrencyWebinar.pdf * http://on-demand.gputechconf.com/gtc-express/2011/presentations/NVIDIA_GPU_Computing_Webinars_CUDA_Memory_Optimization.pdf * http://on-demand.gputechconf.com/gtc-express/2011/presentations/GTC_Express_Sarah_Tariq_June2011.pdf *