# IRGPU: Patterns for massively parallel programming # Programming patterns & memory optimizations **The programming patterns** - Map - Map + Local reduction - Reduction - Scan **The IP algorithms** - LUT applications - Local features extraction - Histogram - Integral images # Map pattern ## Map pattern overview - Un pixel en entree et sortie - La dependance est nulle :::info Map replicated a function over every element of an index set. The computation of each pixel is independant wrt the others ::: `out(x,y) = f(in(x,y))` ![](https://i.imgur.com/fbjNWGZ.png) ![](https://i.imgur.com/Mk2oJIk.png) *What do you think about `k`'s impact of the performance ?* Le premier fait `threadIdx.x + k` et l'autre fait `threadIdx.x * k` - A gauche, le thread numero i est le thread numero i+1: **c'est continue au niveau des addresses memoire** (lineaire) - A droite: acces stride (palier) ![](https://i.imgur.com/AsmEcfK.png) :::success - Linear sequential access with offset c'est bien - Strided access c'est po bien ::: ### Strided access pattern ![](https://i.imgur.com/UIPbkpm.png) Sur le kernel numero 2, un grand temps du process est passe pour la gestion memoire (acces memoire mal fait) # Memory performance ## Memory bandwith What you think about memory ![](https://i.imgur.com/B9ci8TU.png) Reality: ![](https://i.imgur.com/tv7pZTj.png) ## Memory access hierarchy ![](https://i.imgur.com/26hzlga.png) - Memoire L2: cache intermediaire - Memoire sur le chip low-latency - Registres - Shared memory - L1 (meme zone que shared memory) - Dire au processeur de gerer la memoire ou la gerer nous-meme - Dans ce cas on configure nous meme la shared memory - Sers de cache entre le L2 et la shared memory ## Cached Loads from L1 low-latency 2 choses a prendre en compte: 1. acces memoire alignes ou non 2. acces memoire coalesced (stride) ou non 2 types of global memory loads: **cached** or **uncached** ### Aligned vs Misaligned A load is **aligned** if the first address of a memory access in 32 bytes - Memory addresses must be type-aligned (`typeof(machin)`) - Otherwise poor perf - `cudaMalloc` = alignement on 256 bits at least ### Coalesced vs uncoalesced A **load** is coalesced if a warp access is non-continuous ## Misaligned cached loads from L1 We need a load strategy: - 32 threads of warp access a 32-bit word = 128 bytes - 128 bytes = L1 bus (single load - bus utilization = 100%) - Access permutation has no (or very low) overheard ![](https://i.imgur.com/iwWlZJZ.png) - If data are not 128-bits aligned, 2 loads are required Adresses 96-224 required.. but 0-256 loaded ![](https://i.imgur.com/Z4BOxMo.png) - If data is accessed strided ![](https://i.imgur.com/sBfwvO6.png) ![](https://i.imgur.com/oeBdxY5.png) *Pas possible d'augmenter la taille du bus ?* Le bus est fixe (hardware) ## Loads from gloabl (uncached) memory Same idea but memory is split in segments of 32 bytes ![](https://i.imgur.com/ri57qfT.png) ## Coalesced Memory Access (summary) ![](https://i.imgur.com/HzDzK61.png) ![](https://i.imgur.com/yofR2mV.png) ## How memory works for real - DRAM is organised in 2D Core array - Each DRAM core array has about 16M bits ![](https://i.imgur.com/fsGELN3.png) ### Example - A 4x4 memory cell - With 4 bits pin interface width ![](https://i.imgur.com/H9l2qyo.png) ### DRAM Burst La memoire est lente - DDR = 1/2 interface speed - DDR2 = 1/4 interface speed - DDR3 = 1/8 interface speed :::success Solution: Bursting ::: Load N x interface width of **the same row** ![](https://i.imgur.com/keRA3iG.png) Au lieu d'avoir 1/4, on renvoit 3 x 1/4 Better, but not enough to saturate the memory bus ## Summary - Use **coalesced** (*contiguous* and *aligned*) accessed to memory ![](https://i.imgur.com/D5Nbfwi.png) *How to make coalesced loads with 2D arrays ?* ![](https://i.imgur.com/zVRBvKI.png) :::danger Ca correspond au **pitch** ::: Pitch: taille des lignes pour que le debut des lignes correspond a un multiple de 32 # Using shared memory ## Transposition ![](https://i.imgur.com/TEq18Ug.png) ![](https://i.imgur.com/1FkMRoY.png) *Where are non-coalesced access ?* Sur un warp, les `x` sont lineaire (0 - 31). Ici, ce qui est lineaire selon x c'est la lecture dans `in`. 32 threads vont ecrire 32 elements non-continus :::success `a[x][y]` ::: ## Tiling and memory privatization :::info On decoupe le travaille en sous-block (tuile) ![](https://i.imgur.com/AaI69bz.png) ::: > Ca marche bien sur les images ! For each block: - Read the tile from global to private block memory - process the block (used shared memory) - write the tile from the private (shared) block memory to global memory ![](https://i.imgur.com/MdMGfx4.png) ### Collaborative loading and writing then BLOCKDIM = TILEDIM - All threads load one or more data - Access must be **coalesced** - Use barrier synch to make sure that all threads are ready to start the phase ![](https://i.imgur.com/CV8GUbH.png) *Est-ce que ecrire dans la tile c'est lineaire ?* Pour chaque ligne `y`, `x` va varier le plus rapidement possible: c'est lineaire ### Tiled transposition in shared memory ![](https://i.imgur.com/3kENRo7.png) L'algorithme pour transposer en utilisant la shared memory commence par copier la taille en shared memory, transpose en shared memory et transpose les acces coalesced en memoire globale ![](https://i.imgur.com/KscPzBJ.png) *A quel moment on fait des acces aligned dans la shared memory ?* On lit de maniere alignee en global, on ecrit en aligne partout sauf la derniere ligne ou c'est un acces non-aligne ![](https://i.imgur.com/gqT5i0W.png) Performance (GB/s on TESLA K40) ![](https://i.imgur.com/bcKFDsm.png) Speed up de 2 entre la version qui utilise la shared memory et celle qui ne l'utilise pas # About shared memory *Comment on peut combler le gap (encore) entre les GB/s de la TESLA ?* ## DRAM banks - Bursting: access multiple locations of a line in the DRAM core array (horizontal parallelism) ![](https://i.imgur.com/j7rOC0e.png) :::success Permet d'utiliser d'avantage la memoire a sa capacite totale ::: ## Bank conflits in shared memory - If 2 threads try to perform 2 **different** loads in the same bank $\to$ **Bank conflict** - Evert bank can provide 64 bits every cycle - Only 2 modes: 1. Change after 32 bits 2. Change after 64 bits ![](https://i.imgur.com/3lRZhel.png) :::success Demander la meme adresse par 2 threads differents ne pose pas de problemes ::: Les addresses consecutives ne sont pas dans les memes banques. ### 2-way conflicts: ![](https://i.imgur.com/nXkxAjy.png) On fait 2 loads en shared memory :::warning Conflict = serialized access po bien ::: ![](https://i.imgur.com/YmfxB50.png) ## Concrete example for shared memory - Bank size: 4B = 4 uint8 - 32 Banks - Many channels - Warp size = 32 threads ![](https://i.imgur.com/oQ66wiH.png) ![](https://i.imgur.com/O7xreLP.png) **Pour le premier cas du tableau:** *Est-ce qu'il a des threads qui demandent des addresses differentes dans une meme banque ?* Non *Quel est le nombre de loads (requetes memoire) qu'on va effectuer ?* Une seule requete memoire |op|Items|Bank used|Conflict Free|#Loads| |--|-----|---------|-------------|------| |`load M[tid.x]`|$[0,1,...,31]$|$[0,1,...,7]$|Oui|1| |`load M[tid.x % 4]`|$[0,1,2,3,0,1,2,3...]$|$[0]$|Oui|1| |`load M[tid.x + 1]`|$[1,2,3,...32]$|$[0,1,...,8]$|Oui|1| |`load M[tid.x * 2]`|$[0,1,2,3,...62]$|$[0,1,...,15]$|Oui|1| |`load M[tid.x * 8]`|$[0,8,...248]$|$[0,2,...,30]$|Non (conflits sur toutes les banques)|2| |`load M[tid.x * 12]`|$[0,8,...248]$|$[0,1,...,31]$|Oui|1| `load M[tid.x * 8]`: ![](https://i.imgur.com/jAku2dw.png) `load M[tid.x * 12]`: ![](https://i.imgur.com/etNDHqA.png) # Bank conflicts in Transpose ![](https://i.imgur.com/J0S9MsY.png) ![](https://i.imgur.com/dQwmZid.png) Si on a 32 banques, on utilisent que 2 banques sur 32 et on a plein de conflits. :::warning Reading a column may cause bank conflict ::: ## Solution to bank conflicts ### With padding (to WRAP_SIZE + 1) ![](https://i.imgur.com/ERSDmSC.png) *Comment se passe la lecture des columns ?* Avec le padding, on decale la lecture de 1 (on se decale d'une banaque). En lisant la 1ere column, on tombe toujours dans une banque differente, evitant ainsi les conflits. ### Index mapping function ![](https://i.imgur.com/FajHwoH.png) ## Performances (GB/s on TESLA K40) ![](https://i.imgur.com/NlafuWa.png) ## Shared memory (summary) - Super fast access (almost as fast as registers) - But limited resources ### Occupancy ![](https://i.imgur.com/H0OfGMx.png) # Stencil Pattern Use case: - Dilation/erosion - Box (Mean) / Convolution Filters - Bilateral Filter - Gaussian Filter - Sobel Filter ![](https://i.imgur.com/k1qCOk7.png) :::warning Il y a une dependance entre les pixels ::: ## Naive Stencil Implem Local average with a rectangle of radius $r$ (Ignoring border problems for now) ![](https://i.imgur.com/CDF9LBE.png) ## Naive Stencil Performance Say we have this GPU: - Peak power: 1 500 GFlops and Memory Bandwith: 200 GB/s All threads access global memory - 1 memory access for 1 FP addition - Requires 1500 x `sizeof(float)` = 6 TB/s of data - But only 200 GB/x mem bandwith $\to$ 50 GLFOPS (3% the peak) ![](https://i.imgur.com/WaaSHtg.png) :::warning Problem: too many access to global memory ::: :::success Solution: tiling, copy data in shared memory to global memory ::: ## Handling Border ![](https://i.imgur.com/7Ke7pfY.png) 1. Add border to the image to have in-memory access 2. Copy tile + border to shared memory ### The bad way Each thread copies one value and border threads are then idle ![](https://i.imgur.com/i8tbBEO.png) ### The good way A thread may copy several pixels ![](https://i.imgur.com/pvXUUJC.png) ## Stencil pattern with tiling performance ![](https://i.imgur.com/wN0CVgX.png) # Reduction Pattern ## Intuition for reduction pattern > Reduction combines every elemet in a collection into one element using an associative operator ## Reduction pattern: solution 1 ![](https://i.imgur.com/WeQUfQW.png) *Est-ce que c'est correct ?* Non, on va avoir des acces concurentiels (data race) ### Data race :::info Plusieurs parties d'un programme qui essaie d'acceder sans ordre predefini a la meme donnee ::: ![](https://i.imgur.com/R1q5wFy.png) We need to ensure that each of those read-compute-write sequences are **atomic**. ## Atomics reminder Atomics - Read, modify, write in 1 operation - Cannot be mixed with accesses from other threads - On global memory and shared memory - Atomic operations to the same address are serialized Operations ![](https://i.imgur.com/J0MTpux.png) ## Reduction Pattern Corrected **Accumulation in global memory** ![](https://i.imgur.com/8QdeZd4.png) **Analysis** ![](https://i.imgur.com/EcRZIrW.png) Time: 5.619 ms Correct result but **high contention** on the global atomic variable :::success The execution is actually **sequential** ! ::: ## Global atomics: is this really parallel ? This version will produce the right result. However, **is it really parallel ?** How our global atomic instruction is executed: 1. lock memory cell 2. read old value 3. compute new value 4. write new value 5. release the memory cell Memory cell = cache line burst Our kernel generates a lot of collisions on global memory ## Leverage Shared Memory Atomic operations are **much** ## Motivation for output privatization ![](https://i.imgur.com/v8beAV1.png) ![](https://i.imgur.com/mCcY84B.png) ## Using shared memory ![](https://i.imgur.com/xGWVBNv.png) ## Reduction pattern V2: Output privatization ![](https://i.imgur.com/R98fyGN.png) ### With sync ![](https://i.imgur.com/hT3muek.png) ![](https://i.imgur.com/qHcQ2Xc.png) ## Reduction functions and trees On peut reduire en parallele plusieurs fragments ![](https://i.imgur.com/Ps454nh.png) ## Complexity in steps and operations ![](https://i.imgur.com/SDnMiRA.png) The tree parallel version is: - work efficient - not resource efficient - Average number of thread $((N-1/\log_2(N))$ << peak requirement ($N/2$) ### Proof of number of operations ![](https://i.imgur.com/dpfV5lj.png) ## Reduction pattern: tree reduction without atomics ![](https://i.imgur.com/jJNk6qz.png) - Use a local sum without atomics - Map reduction tree to compute units (threads) - Add to a global atomic once for each block ![](https://i.imgur.com/Dv7RNfJ.png) ![](https://i.imgur.com/WEGvBcP.png) **What is happening ?** The (naive) tree version is slower than the locally sequential version ## Sp starvation In each iteration, 2 control flow paths will be sequentiall traversed for each warp - Threads that perform addition and threads that do not - Threads that do not perform addition **still consume execution resources** ## Resource efficient version ![](https://i.imgur.com/vA8Y2QC.png) Tous les threads vont etre utilise sauf a la fin. On va avoir que des warps actifs, a chaque iterations on libere la moitie des warps. ![](https://i.imgur.com/hKwmj9M.png) ## A quick analysis For a 1024 thread block - No divergence on the first 5 steps - 1024, 512, 256, 128, 64, 32 consecutive threads are active in each step - All threads in each warp either all active or all inactive - The final 5 steps will still have divergence - Can use warp-level optimization then (warp suffle) ## Limit global collision *What happens with very large input arrays ?* **Lot of global atomics** *How to avoid this ?* **Global array, one cell for each block** - No more locks - But requires a second level of reduction **More work per thread** - Just fire enough blocks to hide latency - Sequential reduction, then tree reduction - "algorithm cascading" ## Algorithm cascading Perform first reduction during the collaborative loading :::warning Warning: kernel launch parameters must be scaled accordingly ! ::: ![](https://i.imgur.com/LsGjS5w.png) ## Last optimizations Loop unrolling - Unroll tree reduction loop for the last warp (less sync needed) - Unroll all tree reduction loops (need to know block size) - Unroll the sequential reduction loop (knowing the work per thread) # Histogram computation ## Mandelbrot practice session During the practice session, you will have to compute the **cumulated histogram** of the image. 1. Compute the histogram $H$ - Count the number of occurences of each value 2. Computed the cumulated histogram $C$ ![](https://i.imgur.com/cGXIbYP.png) :::warning Inefficient, *non-coalesced* memory access ::: ## First sample code ![](https://i.imgur.com/eMOR4VG.png) *What is the issue ?* Ajouter un data array ![](https://i.imgur.com/rnUgffz.png) ## Parallel algorithm using output privatization ### Local histogram #### Initialization Shared memory must be initialized This can be done with the "comb-like" pattern ![](https://i.imgur.com/unIYdXO.png) :::warning We need synchronization after this stage ::: #### Computation Like previous code, but with local atomics ![](https://i.imgur.com/ke6xZ3o.png) :::warning We need synchronization after this stage ::: #### Commit to global memory ![](https://i.imgur.com/je24ZHQ.png) `int n = blockDim.x * threadIdx.y + threadIdx.x` ## Summary Performance boosters: - Coalesced accesses - Output privatization Requirements: - atomics - synchronization # Scan pattern ## What is a scan ? :::info Scan computes all partial reductions of a collection ::: ![](https://i.imgur.com/yvFGSq8.png) Usage: - Integration (cumulated histogram) - Resource allocation (memory to parallel threads, camping spots...) - Base building block for many algorithms (sorts, strings comparisons) ## Performance baselines ![](https://i.imgur.com/Cj1a8dh.png) **Sequential version** ### Naive parallel version Have every thread to add up all x elements needed for the y element ## Scan pattern at the warp or block level ### Kogge-Stone ![](https://i.imgur.com/zh1JFSK.png) - Number of steps: $\log N$ (bien) - Ressource efficiency (bien) - Work efficiency $\sim N\log N$ (pas bien) ### Brent-Kung ![](https://i.imgur.com/0JJb8j7.png) - Number of steps: $2\log N$ - Ressource efficiency: all warps remain active till the end (pas bien) - Work efficiency: $2N$ (bien) ### Sklansky ![](https://i.imgur.com/Skxx9gb.png) - Number of steps: $\log N$ - Ressource efficiency: bien - Work efficiency: $\frac{N}{2}\log N$ (bien) ## Scan Pattern at the Block or Grid Level The patterns before can be applied: - At the warp level (no sync until Volta) - At the block level (thread sync) At the global level: multi-level kernel app in global memory - Scan then propagate - Reduce then scan ### Scan then propagate ![](https://i.imgur.com/tEcHvf6.png) ### Reduce then scan ![](https://i.imgur.com/qzKo8RB.png) ## Summary ![](https://i.imgur.com/sfbYbdL.png)