---
slideOptions:
transition: slide
---
# GPU
## Cuda intro
### GPU introduction
![](https://i.imgur.com/RJaOX7Q.png)
* memory: global --> PBSM (per block shared memory) --> local register
* multiple stream multi-processors (SM):
![http://www.ece.lsu.edu/gp/refs/gf100-whitepaper.pdf](https://i.imgur.com/1vfxsXy.png)
* vector machine, share register file, programmable cache
### Cuda
* Kernel = Many Concurrent Threads
![](https://i.imgur.com/zx1Cx8y.png)
* kernel = grid of thread blocks
* Each thread executes the same code but **on the different data based on its threadID**
* threads are grouped into thread blocks
![](https://i.imgur.com/pBZ2VJO.png)
* blocks are independent to each other
![](https://i.imgur.com/TgO0qAl.png)
* Thread Level Scheduling - Warp
* ![](https://i.imgur.com/5LjZZGJ.jpg)
* Once a thread block is scheduled to an SM, threads in the thread block are further partitioned into warps.
* warp consists of 32 consecutive threads and all threads in a warp are executed in SIMT ie all threads exe the same instructions but on its own private data
* Threads in a warp will be executing the same instruction
* Instructions are issued per warp
> Keep the number of threads per block a multiple of warp size (32).
> Avoid small block sizes: Start with at least 128 or 256 threads per block.
> Adjust block size up or down according to kernel resource requirements.
### cuda language
* `kernelFunc<<< nB, nT, nS, Sid >>>(...);`
* `nB`: number of block per grid
* `nT`: number of threads per block
* `nS`: shared memory size
* `threadIdx; blockIdx; blockDim; gridDim`
* eg1
```cuda
dim3 grid(3,2):
dim3 blk(5,3);
my_kernel<<< grid, blk>>>();
// (x,y) or (x,y,z)
```
* eg2
```cuda
// Kernel definition
__global__ void VecAdd(float* A)
{
int i = threadIdx.x * blockDim.x + threadIdx.y;
A[i] = A[i] + 1;
}
//...
size=10; dim3 blk(size, size);
my_kernel<<< 1, blk >>>(A, size);
```
* blockDim.x: number of threads per block
*
* function qualifier:
* `__device__`: Executed on the device, Callable from the device only
* `__global__ `: Executed on the device, **Callable from the host only (must have void return type!)**
* Variable qualifier:
* `__device__`: device’s global memory (shared among SMs) space
* `__constant__`
* `__shared__`: in **thread block's** shared memory
* Program compilation
* ![](https://i.imgur.com/KFZk7wH.png)
* Kernel Time Measurement
```cuda
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start);
kernel<<<block, thread>>>();
cudaEventRecord(stop);
cudaEventSynchronize(stop);
float time;
cudaEventElapsedTime(&time, start, stop);
```
* Dynamic Parallelism
* launch new grids from the GPU
* http://on-demand.gputechconf.com/gtc/2012/presentations/S0338-New-Features-in-the-CUDA-Programming-Model.pdf
## GPU Architecture
### Thread execution
* execution model
![](https://i.imgur.com/WmxrPpk.png)
![](https://i.imgur.com/k0ijWsk.png)
![](https://i.imgur.com/lJLOxS2.png)
![](https://i.imgur.com/9W7DzDt.png)
* Hardware schedules thread blocks onto available SMs
* warp
* At any time, only one warp is executed per SM
* All threads in a Warp execute the same instruction when selected
* Threads in a wrap execute physically in parallel
* Warps and blocks execute logically in paralle
![](https://i.imgur.com/RdK7TaU.png)
### memory hierarchy
![](https://i.imgur.com/OxKdgHC.png)
* registers
* consumes zero extra clock cycles per instruction, except
* Register memory bank conflicts
* not indexable
* **Array variables** always are allocated in **local memory**
* Register spilling: Local memory is used if the register limit is met
* limit: cuda6.1:
* 65K registers per block
* 255 registers per thread
* local memory
* Usually when one runs out of SM resources
* "local" for each thread has its own private area
* **Array variables** always are allocated in **local memory**
* Stores are cached in L1
* Not really a “memory” – bytes are stored in global memory (DRAM)
![](https://i.imgur.com/uMnl1h2.png)
* eg.
```
__device__ void distance(int m, int n, int *V){
int i, j, k; // register
int a[10], b[10], c[10]; // local memory, off-chip DRAM
...
}
```
* shared memory
* shared by all the **threads in a block**
* Size: at most 48K per block (CUDA 6.1)
* `extern __shared__ <type> <name>[]` (pg.7)
* dyanmic allocation: `nS` in `func<<<nB, nT, nS>>>(..);`
```cuda
extern __shared__ int S[][];
__global__ void FW_APSP(int k, int D[n][n]) {
int i = threadIdx.x;
int j = threadIdx.y;
S[i][j]=D[i][j]; // move data to shared memory
__syncthreads();
// do computation
if (S[i][j]>S[i][k]+S[k][j])
D[i][j]= S[i][k]+S[k][j];
}
```
* architecture
* Each bank can service one address per cycle
* Shared memory is as fast as register if no bank conflict
* Shared memory banks are organized such that successive 32-bit words are assigned to successive banks and the bandwidth is 32 bits per bank per clock cycle.
* For devices of compute capability 1.x, the warp size is 32 threads and the number of banks is 16
* https://stackoverflow.com/questions/7903566/how-is-2d-shared-memory-arranged-in-cuda
![](https://i.imgur.com/j0CG2L0.png)
* global memory
* memory coalescing
* Global memory access happens in transactions of 32 or 128 bytes
* The hardware will try to reduce to as few transactions as possible
* Coalesced access:
> A group of 32 contiguous threads ("warp") accessing adjacent words. Few transactions and high utilization
* constant memory
* Same usage and scope as the global memory except
* Read only
* Declare by variable qualifier `__constant__`
* Move by cudaMemcpyToSymbol() & cudaMemcpyFromSymbol()
* warp divergence
![](https://i.imgur.com/t2dsPrn.png)
## share memory
* ref: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#shared-memory-5-x
* For compute capability 5.x and 6.x:
* Shared memory has 32 banks that are organized such that successive **32-bit words** map to successive banks. Each bank has a bandwidth of 32 bits per clock cycle.
* bank index = (byte address / 4 bytes per bank) % 32 banks
## Reference
* https://docs.nvidia.com/cuda/
* https://nanxiao.me/en/about/
* http://www.icl.utk.edu/~luszczek/teaching/courses/fall2016/cosc462/pdf/GPU_Fundamentals.pdf
* https://medium.com/@smallfishbigsea/basic-concepts-in-gpu-computing-3388710e9239
* https://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf
* http://on-demand.gputechconf.com/gtc-express/2011/presentations/cuda_webinars_WarpsAndOccupancy.pdf
* https://devblogs.nvidia.com/cuda-pro-tip-nvprof-your-handy-universal-gpu-profiler/
* http://www.enseignement.polytechnique.fr/profs/informatique/Eric.Goubault/Cours09/CUDA/SC07_CUDA_5_Optimization_Harris.pdf
* http://on-demand.gputechconf.com/gtc-express/2011/presentations/StreamsAndConcurrencyWebinar.pdf
* http://on-demand.gputechconf.com/gtc-express/2011/presentations/NVIDIA_GPU_Computing_Webinars_CUDA_Memory_Optimization.pdf
* http://on-demand.gputechconf.com/gtc-express/2011/presentations/GTC_Express_Sarah_Tariq_June2011.pdf
*