---
# System prepended metadata

title: GPU

---

---
slideOptions:
  transition: slide
---

# GPU

## Cuda intro
### GPU introduction
![](https://i.imgur.com/RJaOX7Q.png)
* memory: global --> PBSM (per block shared memory) --> local register
* multiple stream multi-processors (SM):
   
   ![http://www.ece.lsu.edu/gp/refs/gf100-whitepaper.pdf](https://i.imgur.com/1vfxsXy.png)    
   
    * vector machine, share register file, programmable cache

### Cuda
* Kernel = Many Concurrent Threads

    ![](https://i.imgur.com/zx1Cx8y.png)
    
    * kernel = grid of thread blocks 
    * Each thread executes the same code but **on the different data based on its threadID**
    * threads are grouped into thread blocks
    
    ![](https://i.imgur.com/pBZ2VJO.png)
    
    * blocks are independent to each other 
    
    ![](https://i.imgur.com/TgO0qAl.png)

* Thread Level Scheduling - Warp
* ![](https://i.imgur.com/5LjZZGJ.jpg)

    * Once a thread block is scheduled to an SM, threads in the thread block are further partitioned into warps. 
        * warp consists of 32 consecutive threads and all threads in a warp are executed in SIMT ie all threads exe the same instructions but on its own private data
        * Threads in a warp will be executing the same instruction
        * Instructions are issued per warp
    > Keep the number of threads per block a multiple of warp size (32).
    > Avoid small block sizes: Start with at least 128 or 256 threads per block.
    > Adjust block size up or down according to kernel resource requirements.
### cuda language
* `kernelFunc<<< nB, nT, nS, Sid >>>(...);`
    * `nB`: number of block per grid
    * `nT`: number of threads per block
    * `nS`: shared memory size
* `threadIdx; blockIdx; blockDim; gridDim`
    * eg1
    ```cuda
    dim3 grid(3,2):
    dim3 blk(5,3);
    my_kernel<<< grid, blk>>>();
    // (x,y) or (x,y,z)
    ```
    * eg2
    ```cuda
    // Kernel definition
    __global__ void VecAdd(float* A)
    {
        int i = threadIdx.x * blockDim.x + threadIdx.y;
        A[i] = A[i] + 1;
    } 
    //...
    size=10; dim3 blk(size, size); 
    my_kernel<<< 1, blk >>>(A, size);
    ```
    * blockDim.x: number of threads per block
    * 
    
* function qualifier:
    * `__device__`: Executed on the device, Callable from the device only
    * `__global__ `: Executed on the device, **Callable from the host only (must have void return type!)**
* Variable qualifier:
    * `__device__`: device’s global memory (shared among SMs) space
    * `__constant__`
    * `__shared__`:  in **thread block's** shared memory
* Program compilation
    * ![](https://i.imgur.com/KFZk7wH.png)
* Kernel Time Measurement

    ```cuda
    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);
    cudaEventRecord(start);
    kernel<<<block, thread>>>();
    cudaEventRecord(stop);
    cudaEventSynchronize(stop);
    float time;
    cudaEventElapsedTime(&time, start, stop);
    ```
* Dynamic Parallelism
    * launch new grids from the GPU
    *  http://on-demand.gputechconf.com/gtc/2012/presentations/S0338-New-Features-in-the-CUDA-Programming-Model.pdf

## GPU Architecture
### Thread execution
* execution model 

![](https://i.imgur.com/WmxrPpk.png)

![](https://i.imgur.com/k0ijWsk.png)

![](https://i.imgur.com/lJLOxS2.png)

![](https://i.imgur.com/9W7DzDt.png)

* Hardware schedules thread blocks onto available SMs
* warp
    * At any time, only one warp is executed per SM
    * All threads in a Warp execute the same instruction when selected
    * Threads in a wrap execute physically in parallel
    * Warps and blocks execute logically in paralle
     
    ![](https://i.imgur.com/RdK7TaU.png)

### memory hierarchy
![](https://i.imgur.com/OxKdgHC.png)
* registers
    * consumes zero extra clock cycles per instruction, except
        * Register memory bank conflicts
    * not indexable
        * **Array variables** always are allocated in **local memory**
    * Register spilling: Local memory is used if the register limit is met
        * limit: cuda6.1: 
            * 65K registers per block
            * 255 registers per thread
* local memory 
    * Usually when one runs out of SM resources
    * "local" for each thread has its own private area
    * **Array variables** always are allocated in **local memory**
    * Stores are cached in L1
    * Not really a “memory” – bytes are stored in global memory (DRAM)
    ![](https://i.imgur.com/uMnl1h2.png)
    
    * eg.
    ```
    __device__ void distance(int m, int n, int *V){
    int i, j, k; // register
    int a[10], b[10], c[10]; // local memory, off-chip DRAM
    ...
    } 
    ```
* shared memory
    * shared by all the **threads in a block**
    * Size: at most 48K per block (CUDA 6.1)
    * `extern __shared__ <type> <name>[]` (pg.7)
    * dyanmic allocation: `nS` in `func<<<nB, nT, nS>>>(..);`
     
        ```cuda
        extern __shared__ int S[][];
        __global__ void FW_APSP(int k, int D[n][n]) {
        int i = threadIdx.x;
        int j = threadIdx.y;
        S[i][j]=D[i][j]; // move data to shared memory
        __syncthreads();
        // do computation
        if (S[i][j]>S[i][k]+S[k][j])
        D[i][j]= S[i][k]+S[k][j];
        }
        ```
    * architecture
        * Each bank can service one address per cycle
        * Shared memory is as fast as register if no bank conflict
        *  Shared memory banks are organized such that successive 32-bit words are assigned to successive banks and the bandwidth is 32 bits per bank per clock cycle. 
        *  For devices of compute capability 1.x, the warp size is 32 threads and the number of banks is 16
        *  https://stackoverflow.com/questions/7903566/how-is-2d-shared-memory-arranged-in-cuda
        
        ![](https://i.imgur.com/j0CG2L0.png)

* global memory
    * memory coalescing
        * Global memory access happens in transactions of 32 or 128 bytes
        * The hardware will try to reduce to as few transactions as possible
        * Coalesced access:
        > A group of 32 contiguous threads ("warp") accessing adjacent words. Few transactions and high utilization
    
* constant memory
    * Same usage and scope as the global memory except
        * Read only
        * Declare by variable qualifier `__constant__`
        * Move by cudaMemcpyToSymbol() & cudaMemcpyFromSymbol()


* warp divergence
![](https://i.imgur.com/t2dsPrn.png)



## share memory 

* ref: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#shared-memory-5-x
* For compute capability 5.x and 6.x:
    * Shared memory has 32 banks that are organized such that successive **32-bit words** map to successive banks. Each bank has a bandwidth of 32 bits per clock cycle. 
    * bank index = (byte address / 4 bytes per bank) % 32 banks

## Reference
* https://docs.nvidia.com/cuda/
* https://nanxiao.me/en/about/ 
* http://www.icl.utk.edu/~luszczek/teaching/courses/fall2016/cosc462/pdf/GPU_Fundamentals.pdf
* https://medium.com/@smallfishbigsea/basic-concepts-in-gpu-computing-3388710e9239
* https://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf
* http://on-demand.gputechconf.com/gtc-express/2011/presentations/cuda_webinars_WarpsAndOccupancy.pdf
* https://devblogs.nvidia.com/cuda-pro-tip-nvprof-your-handy-universal-gpu-profiler/
* http://www.enseignement.polytechnique.fr/profs/informatique/Eric.Goubault/Cours09/CUDA/SC07_CUDA_5_Optimization_Harris.pdf
* http://on-demand.gputechconf.com/gtc-express/2011/presentations/StreamsAndConcurrencyWebinar.pdf
* http://on-demand.gputechconf.com/gtc-express/2011/presentations/NVIDIA_GPU_Computing_Webinars_CUDA_Memory_Optimization.pdf
* http://on-demand.gputechconf.com/gtc-express/2011/presentations/GTC_Express_Sarah_Tariq_June2011.pdf
* 