# IRGPU: Patterns for massively parallel programming
# Programming patterns & memory optimizations
**The programming patterns**
- Map
- Map + Local reduction
- Reduction
- Scan
**The IP algorithms**
- LUT applications
- Local features extraction
- Histogram
- Integral images
# Map pattern
## Map pattern overview
- Un pixel en entree et sortie
- La dependance est nulle
:::info
Map replicated a function over every element of an index set.
The computation of each pixel is independant wrt the others
:::
`out(x,y) = f(in(x,y))`


*What do you think about `k`'s impact of the performance ?*
Le premier fait `threadIdx.x + k` et l'autre fait `threadIdx.x * k`
- A gauche, le thread numero i est le thread numero i+1: **c'est continue au niveau des addresses memoire** (lineaire)
- A droite: acces stride (palier)

:::success
- Linear sequential access with offset c'est bien
- Strided access c'est po bien
:::
### Strided access pattern

Sur le kernel numero 2, un grand temps du process est passe pour la gestion memoire (acces memoire mal fait)
# Memory performance
## Memory bandwith
What you think about memory

Reality:

## Memory access hierarchy

- Memoire L2: cache intermediaire
- Memoire sur le chip low-latency
- Registres
- Shared memory
- L1 (meme zone que shared memory)
- Dire au processeur de gerer la memoire ou la gerer nous-meme
- Dans ce cas on configure nous meme la shared memory
- Sers de cache entre le L2 et la shared memory
## Cached Loads from L1 low-latency
2 choses a prendre en compte:
1. acces memoire alignes ou non
2. acces memoire coalesced (stride) ou non
2 types of global memory loads: **cached** or **uncached**
### Aligned vs Misaligned
A load is **aligned** if the first address of a memory access in 32 bytes
- Memory addresses must be type-aligned (`typeof(machin)`)
- Otherwise poor perf
- `cudaMalloc` = alignement on 256 bits at least
### Coalesced vs uncoalesced
A **load** is coalesced if a warp access is non-continuous
## Misaligned cached loads from L1
We need a load strategy:
- 32 threads of warp access a 32-bit word = 128 bytes
- 128 bytes = L1 bus (single load - bus utilization = 100%)
- Access permutation has no (or very low) overheard

- If data are not 128-bits aligned, 2 loads are required
Adresses 96-224 required.. but 0-256 loaded

- If data is accessed strided


*Pas possible d'augmenter la taille du bus ?*
Le bus est fixe (hardware)
## Loads from gloabl (uncached) memory
Same idea but memory is split in segments of 32 bytes

## Coalesced Memory Access (summary)


## How memory works for real
- DRAM is organised in 2D Core array
- Each DRAM core array has about 16M bits

### Example
- A 4x4 memory cell
- With 4 bits pin interface width

### DRAM Burst
La memoire est lente
- DDR = 1/2 interface speed
- DDR2 = 1/4 interface speed
- DDR3 = 1/8 interface speed
:::success
Solution: Bursting
:::
Load N x interface width of **the same row**

Au lieu d'avoir 1/4, on renvoit 3 x 1/4
Better, but not enough to saturate the memory bus
## Summary
- Use **coalesced** (*contiguous* and *aligned*) accessed to memory

*How to make coalesced loads with 2D arrays ?*

:::danger
Ca correspond au **pitch**
:::
Pitch: taille des lignes pour que le debut des lignes correspond a un multiple de 32
# Using shared memory
## Transposition


*Where are non-coalesced access ?*
Sur un warp, les `x` sont lineaire (0 - 31). Ici, ce qui est lineaire selon x c'est la lecture dans `in`.
32 threads vont ecrire 32 elements non-continus
:::success
`a[x][y]`
:::
## Tiling and memory privatization
:::info
On decoupe le travaille en sous-block (tuile)

:::
> Ca marche bien sur les images !
For each block:
- Read the tile from global to private block memory
- process the block (used shared memory)
- write the tile from the private (shared) block memory to global memory

### Collaborative loading and writing then BLOCKDIM = TILEDIM
- All threads load one or more data
- Access must be **coalesced**
- Use barrier synch to make sure that all threads are ready to start the phase

*Est-ce que ecrire dans la tile c'est lineaire ?*
Pour chaque ligne `y`, `x` va varier le plus rapidement possible: c'est lineaire
### Tiled transposition in shared memory

L'algorithme pour transposer en utilisant la shared memory commence par copier la taille en shared memory, transpose en shared memory et transpose les acces coalesced en memoire globale

*A quel moment on fait des acces aligned dans la shared memory ?*
On lit de maniere alignee en global, on ecrit en aligne partout sauf la derniere ligne ou c'est un acces non-aligne

Performance (GB/s on TESLA K40)

Speed up de 2 entre la version qui utilise la shared memory et celle qui ne l'utilise pas
# About shared memory
*Comment on peut combler le gap (encore) entre les GB/s de la TESLA ?*
## DRAM banks
- Bursting: access multiple locations of a line in the DRAM core array (horizontal parallelism)

:::success
Permet d'utiliser d'avantage la memoire a sa capacite totale
:::
## Bank conflits in shared memory
- If 2 threads try to perform 2 **different** loads in the same bank $\to$ **Bank conflict**
- Evert bank can provide 64 bits every cycle
- Only 2 modes:
1. Change after 32 bits
2. Change after 64 bits

:::success
Demander la meme adresse par 2 threads differents ne pose pas de problemes
:::
Les addresses consecutives ne sont pas dans les memes banques.
### 2-way conflicts:

On fait 2 loads en shared memory
:::warning
Conflict = serialized access po bien
:::

## Concrete example for shared memory
- Bank size: 4B = 4 uint8
- 32 Banks - Many channels
- Warp size = 32 threads


**Pour le premier cas du tableau:**
*Est-ce qu'il a des threads qui demandent des addresses differentes dans une meme banque ?*
Non
*Quel est le nombre de loads (requetes memoire) qu'on va effectuer ?*
Une seule requete memoire
|op|Items|Bank used|Conflict Free|#Loads|
|--|-----|---------|-------------|------|
|`load M[tid.x]`|$[0,1,...,31]$|$[0,1,...,7]$|Oui|1|
|`load M[tid.x % 4]`|$[0,1,2,3,0,1,2,3...]$|$[0]$|Oui|1|
|`load M[tid.x + 1]`|$[1,2,3,...32]$|$[0,1,...,8]$|Oui|1|
|`load M[tid.x * 2]`|$[0,1,2,3,...62]$|$[0,1,...,15]$|Oui|1|
|`load M[tid.x * 8]`|$[0,8,...248]$|$[0,2,...,30]$|Non (conflits sur toutes les banques)|2|
|`load M[tid.x * 12]`|$[0,8,...248]$|$[0,1,...,31]$|Oui|1|
`load M[tid.x * 8]`:

`load M[tid.x * 12]`:

# Bank conflicts in Transpose


Si on a 32 banques, on utilisent que 2 banques sur 32 et on a plein de conflits.
:::warning
Reading a column may cause bank conflict
:::
## Solution to bank conflicts
### With padding (to WRAP_SIZE + 1)

*Comment se passe la lecture des columns ?*
Avec le padding, on decale la lecture de 1 (on se decale d'une banaque). En lisant la 1ere column, on tombe toujours dans une banque differente, evitant ainsi les conflits.
### Index mapping function

## Performances (GB/s on TESLA K40)

## Shared memory (summary)
- Super fast access (almost as fast as registers)
- But limited resources
### Occupancy

# Stencil Pattern
Use case:
- Dilation/erosion
- Box (Mean) / Convolution Filters
- Bilateral Filter
- Gaussian Filter
- Sobel Filter

:::warning
Il y a une dependance entre les pixels
:::
## Naive Stencil Implem
Local average with a rectangle of radius $r$ (Ignoring border problems for now)

## Naive Stencil Performance
Say we have this GPU:
- Peak power: 1 500 GFlops and Memory Bandwith: 200 GB/s
All threads access global memory
- 1 memory access for 1 FP addition
- Requires 1500 x `sizeof(float)` = 6 TB/s of data
- But only 200 GB/x mem bandwith $\to$ 50 GLFOPS (3% the peak)

:::warning
Problem: too many access to global memory
:::
:::success
Solution: tiling, copy data in shared memory to global memory
:::
## Handling Border

1. Add border to the image to have in-memory access
2. Copy tile + border to shared memory
### The bad way
Each thread copies one value and border threads are then idle

### The good way
A thread may copy several pixels

## Stencil pattern with tiling performance

# Reduction Pattern
## Intuition for reduction pattern
> Reduction combines every elemet in a collection into one element using an associative operator
## Reduction pattern: solution 1

*Est-ce que c'est correct ?*
Non, on va avoir des acces concurentiels (data race)
### Data race
:::info
Plusieurs parties d'un programme qui essaie d'acceder sans ordre predefini a la meme donnee
:::

We need to ensure that each of those read-compute-write sequences are **atomic**.
## Atomics reminder
Atomics
- Read, modify, write in 1 operation
- Cannot be mixed with accesses from other threads
- On global memory and shared memory
- Atomic operations to the same address are serialized
Operations

## Reduction Pattern Corrected
**Accumulation in global memory**

**Analysis**

Time: 5.619 ms
Correct result but **high contention** on the global atomic variable
:::success
The execution is actually **sequential** !
:::
## Global atomics: is this really parallel ?
This version will produce the right result. However, **is it really parallel ?**
How our global atomic instruction is executed:
1. lock memory cell
2. read old value
3. compute new value
4. write new value
5. release the memory cell
Memory cell = cache line burst
Our kernel generates a lot of collisions on global memory
## Leverage Shared Memory
Atomic operations are **much**
## Motivation for output privatization


## Using shared memory

## Reduction pattern V2: Output privatization

### With sync


## Reduction functions and trees
On peut reduire en parallele plusieurs fragments

## Complexity in steps and operations

The tree parallel version is:
- work efficient
- not resource efficient
- Average number of thread $((N-1/\log_2(N))$ << peak requirement ($N/2$)
### Proof of number of operations

## Reduction pattern: tree reduction without atomics

- Use a local sum without atomics
- Map reduction tree to compute units (threads)
- Add to a global atomic once for each block


**What is happening ?**
The (naive) tree version is slower than the locally sequential version
## Sp starvation
In each iteration, 2 control flow paths will be sequentiall traversed for each warp
- Threads that perform addition and threads that do not
- Threads that do not perform addition **still consume execution resources**
## Resource efficient version

Tous les threads vont etre utilise sauf a la fin. On va avoir que des warps actifs, a chaque iterations on libere la moitie des warps.

## A quick analysis
For a 1024 thread block
- No divergence on the first 5 steps
- 1024, 512, 256, 128, 64, 32 consecutive threads are active in each step
- All threads in each warp either all active or all inactive
- The final 5 steps will still have divergence
- Can use warp-level optimization then (warp suffle)
## Limit global collision
*What happens with very large input arrays ?*
**Lot of global atomics**
*How to avoid this ?*
**Global array, one cell for each block**
- No more locks
- But requires a second level of reduction
**More work per thread**
- Just fire enough blocks to hide latency
- Sequential reduction, then tree reduction
- "algorithm cascading"
## Algorithm cascading
Perform first reduction during the collaborative loading
:::warning
Warning: kernel launch parameters must be scaled accordingly !
:::

## Last optimizations
Loop unrolling
- Unroll tree reduction loop for the last warp (less sync needed)
- Unroll all tree reduction loops (need to know block size)
- Unroll the sequential reduction loop (knowing the work per thread)
# Histogram computation
## Mandelbrot practice session
During the practice session, you will have to compute the **cumulated histogram** of the image.
1. Compute the histogram $H$
- Count the number of occurences of each value
2. Computed the cumulated histogram $C$

:::warning
Inefficient, *non-coalesced* memory access
:::
## First sample code

*What is the issue ?*
Ajouter un data array

## Parallel algorithm using output privatization
### Local histogram
#### Initialization
Shared memory must be initialized
This can be done with the "comb-like" pattern

:::warning
We need synchronization after this stage
:::
#### Computation
Like previous code, but with local atomics

:::warning
We need synchronization after this stage
:::
#### Commit to global memory

`int n = blockDim.x * threadIdx.y + threadIdx.x`
## Summary
Performance boosters:
- Coalesced accesses
- Output privatization
Requirements:
- atomics
- synchronization
# Scan pattern
## What is a scan ?
:::info
Scan computes all partial reductions of a collection
:::

Usage:
- Integration (cumulated histogram)
- Resource allocation (memory to parallel threads, camping spots...)
- Base building block for many algorithms (sorts, strings comparisons)
## Performance baselines

**Sequential version**
### Naive parallel version
Have every thread to add up all x elements needed for the y element
## Scan pattern at the warp or block level
### Kogge-Stone

- Number of steps: $\log N$ (bien)
- Ressource efficiency (bien)
- Work efficiency $\sim N\log N$ (pas bien)
### Brent-Kung

- Number of steps: $2\log N$
- Ressource efficiency: all warps remain active till the end (pas bien)
- Work efficiency: $2N$ (bien)
### Sklansky

- Number of steps: $\log N$
- Ressource efficiency: bien
- Work efficiency: $\frac{N}{2}\log N$ (bien)
## Scan Pattern at the Block or Grid Level
The patterns before can be applied:
- At the warp level (no sync until Volta)
- At the block level (thread sync)
At the global level: multi-level kernel app in global memory
- Scan then propagate
- Reduce then scan
### Scan then propagate

### Reduce then scan

## Summary
