# Modeling Deep Learning Accelerator Enabled GPUs
###### papers: [link](https://arxiv.org/pdf/1811.08309.pdf)
###### video: `none`
###### slide: `none`
###### tags: `GPUs`
# Outline
* 1. Intrduction
* 2. Background
* Volta Microarchitecture
* Warp Matrix Function (WMMA) API
* PTX Instruction Set
* Tensor Core
* 3. DEMYSTIFYING NVIDIA’S TENSOR CORES
* Microbenchmarks
* Operand matrix element mapping
* Machine ISA interface
* HMMA Instruction Analysis
* Discussion
* 4. TENSOR CORE MICROARCHITECTURE
* 5. MODELING AND EVALUATION
# 1. Intrduction
* Background of this paper
* Deep neural networks are having impact in a growing number of areas but the benefits of DNNs come at the expense of high computational cost.
* Problem of DNN
* DNNs require performing a large number of multi-dimensional matrix computations.
* Previous achievements
* Recent research has explored how to accelerate these operations and many companies are **developing custom hardware** for these workloads
* The subject of this paper
* This paper investigated the **NVIDIA tensor cores** found in both Volta and Turing architectures GPU.
* Contribution
* It shows how different threads cooperate in transferring an input matrix to each tensor core.
* It gives an in-depth analysis of the execution of the tensor operation on the tensor cores and describes the microbenchmarks we used to perform our analysis.
* It proposed a microarchitectural model of tensor cores consistent with the characteristics revealed through our microbenchmarks.
* It describes our functional and timing model changes for modeling tensor cores in GPGPU-Sim.
* It describes support we added to enable applications built with NVIDIA’s CUTLASS library to run on GPGPU-Sim.
* It quantifies the accuracy of our modified GPGPUSim by running tensor core enabled kernels generated with CUTLASS and thereby demonstrating an IPC correlation of 99.6%.
# 2. Background
## 2.1 Volta Microarchitecture

* NVIDIA GPUs including Volta are generally composed of multiple Streaming Multiprocessors connected by an on-chip network to multiple memory partitions
* In comparison to Pascal (gpu of previous generation ), each streaming multiprocessor in Volta has **twice as many scheduling units**, and add **two tensor cores**.
## 2.2 Warp Matrix Function (WMMA) API
* **"warp matrix function" C++ language API** to enable programmers to use the tensor cores on supported GPUs, also called **"warp-level matrix multiply and accumulate" (WMMA)**
* Using the WMMA API, **all threads in a warp cooperatively work togethe**r to perform a matrix-multiply and accumulate operation
* **CUDA 9.0 supports only one tile sizes, 16 × 16 × 16**, while later versions allow additional flexibility.
* Each tile is further divided into "fragments" where **a fragment is a set of tile elements that are mapped into the registers of a single thread**. Thus, **input matrices are distributed across different threads** and each thread contains only a portion of a tile.
* The CUDA WMMA API provides three new functions
* load_matrix_sync
* store_matrix_sync
* mma_sync
## 2.3 PTX Instruction Set



* matrices A, B and C respectively into registers ra, rb and rc
* pa, pb, pc are the memory address where operand matrices A, B and C are stored in memory
* the "**sync**" qualifier indicates that the instruction waits for all threads in the warp to synchronize before beginning execution.
* The "**layout**" qualifier specifies whether **an operand matrix is stored in memory with a row-major or column-major layout**.
* The "**shape**" qualifier represents the fragment size of the operand matrices.
* The "**type**" qualifier indicates the precision of the operand matrices
* The "**stride**" operand specifies the beginning of each row
## 2.4 Tensor Core
* Each tensor core is a programmable compute unit specialized for accelerating machine learning workloads
* Each tensor core can complete **a single 4 × 4 matrix-multiply-and-accumulation each clock cycle**
* 
* The tensor cores have two modes of operation: FP16 and mixed precision.
* In FP16 mode, the tensor core reads **three 4 × 4 16-bit floating-point matrices** as source operands
* in mixed-precision mode it reads **two 4 × 4 16-bit floating point matrices** along with a third **4 × 4 32-bit floating-point accumulation matrix**
# 3. DEMYSTIFYING NVIDIA’S TENSOR CORES

* To execute a WMMA, the 32 threads in a warp are divided into 8 threadgroups
## 3.1 Microbenchmarks
* We employ two types of microbenchmarks:
1. how data move into and out of the tensor cores
2. how long the tensor cores take to perform operations
* Fragment to thread mapping:

* Analyzing machine instructions:

1. the operation of our microbenchmarks used for analyzing **how data is accessed by HMMA instructions**. We use radare2 to replace all HMMA operations except one with “no operation” (NOP) instructions.

2. the approach for **analyzing the timing of low level operations on tensor cores**
## 3.2 Operand matrix element mapping
1. Volta Tensor Cores:

2. Turing Tensor Cores:

* Turing's tensor cores support three new precision modes: 1-bit, 4-bit and 8-bit, along with three new tile sizes: **32×8×16** and **8×32×16** for 8 and 16-bit modes and **8 × 8 × 32** for 4-bit mode
* For all modes and configurations, each row or column is loaded by a threadgroup and consecutive threadgoups load consecutive rows or columns
## 3.3 Machine ISA interface
1. Volta Tensor Cores:


* We found that **wmma.load** and **wmma.store** PTX instructions are implemented by being broken into **a group of normal SASS load** (LD.E.64, LD.E.128, LD.E.SYS) and **store** (ST.E.SYS) instructions.
* The latency of wmma.mma API in mixed precision mode is ten cycles lower than in FP16 mode.
2. Turing Tensor Cores:
* For Turing, each PTX wmma.mma instruction is broken into a group of four HMMA instructions for all modes except 4-bit where it is converted into a single HMMA instruction.

## 3.4 HMMA Instruction Analysis
1. Volta :

2. Turing:





## 3.5 Discussion
* why execution is broken into “sets” and “steps” ?
* In order to use hardware octets. We found that threadgroups work in pairs to compute 8 × 8 subtiles of the result. We call **each such pair of threadgroups an octet**. There are four octets in a warp.


* A and B is loaded twice by threads in a different threadgroup.This enables each octet to work independently.
# A TENSOR CORE MICROARCHITECTURE
* Recall each tensor core completes a 4×4 matrix-multiply and accumulate each cycle.
* To achieve this, each tensor core must be able to perform **16 four-element dot-products** each cycle.
* Each warp utilizes two tensor cores. We assume two octets within a warp access each tensor core. instruction is executing 32 FEDP per cycle.
* As each tensor core consists of 16 FP16 FEDP units, it is capable of completing one 4 × 4 matrix multiplication each cycle.
* For operand **matrix A and C**, each threadgroup fetches the operands to its **separate buffer** whereas for operand **matrix B** both the threadgroups fetches to a **shared buffer**.

# MODELING AND EVALUATION
* We extended the current version of GPGPU-Sim to support 16bit floating-point by using a half-precision C++ header-only library .



* The performance gain obtained using the cuBLAS GEMM kernel is more than the WMMA GEMM implementation .
* We implemented **a model for the Volta tensor cores in GPGPU-Sim** and found its performance agreed well with hardware, obtaining a 99.6% IPC correlation versus a Titan V GPU. As part of our efforts we also enabled CUTLASS, NVIDIA's opensource CUDA C++ template library supporting tensor cores, on GPGPU-Sim.