Modeling Deep Learning Accelerator Enabled GPUs

# Modeling Deep Learning Accelerator Enabled GPUs ###### papers: [link](https://arxiv.org/pdf/1811.08309.pdf) ###### video: `none` ###### slide: `none` ###### tags: `GPUs` # Outline * 1. Intrduction * 2. Background * Volta Microarchitecture * Warp Matrix Function (WMMA) API * PTX Instruction Set * Tensor Core * 3. DEMYSTIFYING NVIDIA’S TENSOR CORES * Microbenchmarks * Operand matrix element mapping * Machine ISA interface * HMMA Instruction Analysis * Discussion * 4. TENSOR CORE MICROARCHITECTURE * 5. MODELING AND EVALUATION # 1. Intrduction * Background of this paper * Deep neural networks are having impact in a growing number of areas but the benefits of DNNs come at the expense of high computational cost. * Problem of DNN * DNNs require performing a large number of multi-dimensional matrix computations. * Previous achievements * Recent research has explored how to accelerate these operations and many companies are **developing custom hardware** for these workloads * The subject of this paper * This paper investigated the **NVIDIA tensor cores** found in both Volta and Turing architectures GPU. * Contribution * It shows how different threads cooperate in transferring an input matrix to each tensor core. * It gives an in-depth analysis of the execution of the tensor operation on the tensor cores and describes the microbenchmarks we used to perform our analysis. * It proposed a microarchitectural model of tensor cores consistent with the characteristics revealed through our microbenchmarks. * It describes our functional and timing model changes for modeling tensor cores in GPGPU-Sim. * It describes support we added to enable applications built with NVIDIA’s CUTLASS library to run on GPGPU-Sim. * It quantifies the accuracy of our modified GPGPUSim by running tensor core enabled kernels generated with CUTLASS and thereby demonstrating an IPC correlation of 99.6%. # 2. Background ## 2.1 Volta Microarchitecture ![](https://i.imgur.com/eB0GXup.png) * NVIDIA GPUs including Volta are generally composed of multiple Streaming Multiprocessors connected by an on-chip network to multiple memory partitions * In comparison to Pascal (gpu of previous generation ), each streaming multiprocessor in Volta has **twice as many scheduling units**, and add **two tensor cores**. ## 2.2 Warp Matrix Function (WMMA) API * **"warp matrix function" C++ language API** to enable programmers to use the tensor cores on supported GPUs, also called **"warp-level matrix multiply and accumulate" (WMMA)** * Using the WMMA API, **all threads in a warp cooperatively work togethe**r to perform a matrix-multiply and accumulate operation * **CUDA 9.0 supports only one tile sizes, 16 × 16 × 16**, while later versions allow additional flexibility. * Each tile is further divided into "fragments" where **a fragment is a set of tile elements that are mapped into the registers of a single thread**. Thus, **input matrices are distributed across different threads** and each thread contains only a portion of a tile. * The CUDA WMMA API provides three new functions * load_matrix_sync * store_matrix_sync * mma_sync ## 2.3 PTX Instruction Set ![](https://i.imgur.com/QUQHftA.png) ![](https://i.imgur.com/hX8CZ98.png) ![](https://i.imgur.com/LtT3FSM.png) * matrices A, B and C respectively into registers ra, rb and rc * pa, pb, pc are the memory address where operand matrices A, B and C are stored in memory * the "**sync**" qualifier indicates that the instruction waits for all threads in the warp to synchronize before beginning execution. * The "**layout**" qualifier specifies whether **an operand matrix is stored in memory with a row-major or column-major layout**. * The "**shape**" qualifier represents the fragment size of the operand matrices. * The "**type**" qualifier indicates the precision of the operand matrices * The "**stride**" operand specifies the beginning of each row ## 2.4 Tensor Core * Each tensor core is a programmable compute unit specialized for accelerating machine learning workloads * Each tensor core can complete **a single 4 × 4 matrix-multiply-and-accumulation each clock cycle** * ![](https://i.imgur.com/hX8CZ98.png) * The tensor cores have two modes of operation: FP16 and mixed precision. * In FP16 mode, the tensor core reads **three 4 × 4 16-bit floating-point matrices** as source operands * in mixed-precision mode it reads **two 4 × 4 16-bit floating point matrices** along with a third **4 × 4 32-bit floating-point accumulation matrix** # 3. DEMYSTIFYING NVIDIA’S TENSOR CORES ![](https://i.imgur.com/I4BAVD4.png) * To execute a WMMA, the 32 threads in a warp are divided into 8 threadgroups ## 3.1 Microbenchmarks * We employ two types of microbenchmarks: 1. how data move into and out of the tensor cores 2. how long the tensor cores take to perform operations * Fragment to thread mapping: ![](https://i.imgur.com/BuPv784.png) * Analyzing machine instructions: ![](https://i.imgur.com/sMDNAk6.png) 1. the operation of our microbenchmarks used for analyzing **how data is accessed by HMMA instructions**. We use radare2 to replace all HMMA operations except one with “no operation” (NOP) instructions. ![](https://i.imgur.com/tWwlXiK.png) 2. the approach for **analyzing the timing of low level operations on tensor cores** ## 3.2 Operand matrix element mapping 1. Volta Tensor Cores: ![](https://i.imgur.com/wIs8LRa.png) 2. Turing Tensor Cores: ![](https://i.imgur.com/BrCVAog.png) * Turing's tensor cores support three new precision modes: 1-bit, 4-bit and 8-bit, along with three new tile sizes: **32×8×16** and **8×32×16** for 8 and 16-bit modes and **8 × 8 × 32** for 4-bit mode * For all modes and configurations, each row or column is loaded by a threadgroup and consecutive threadgoups load consecutive rows or columns ## 3.3 Machine ISA interface 1. Volta Tensor Cores: ![](https://i.imgur.com/HSsW65E.png) ![](https://i.imgur.com/eRjPi9p.png) * We found that **wmma.load** and **wmma.store** PTX instructions are implemented by being broken into **a group of normal SASS load** (LD.E.64, LD.E.128, LD.E.SYS) and **store** (ST.E.SYS) instructions. * The latency of wmma.mma API in mixed precision mode is ten cycles lower than in FP16 mode. 2. Turing Tensor Cores: * For Turing, each PTX wmma.mma instruction is broken into a group of four HMMA instructions for all modes except 4-bit where it is converted into a single HMMA instruction. ![](https://i.imgur.com/G9USqTB.png) ## 3.4 HMMA Instruction Analysis 1. Volta : ![](https://i.imgur.com/4T2VCQm.png) 2. Turing: ![](https://i.imgur.com/Oa00b7g.png) ![](https://i.imgur.com/VxGBpVB.png) ![](https://i.imgur.com/zNnUxNK.png) ![](https://i.imgur.com/okv7SiH.png) ![](https://i.imgur.com/TeBY1PG.png) ## 3.5 Discussion * why execution is broken into “sets” and “steps” ? * In order to use hardware octets. We found that threadgroups work in pairs to compute 8 × 8 subtiles of the result. We call **each such pair of threadgroups an octet**. There are four octets in a warp. ![](https://i.imgur.com/ncfvmA0.png) ![](https://i.imgur.com/uD5FdqI.png) * A and B is loaded twice by threads in a different threadgroup.This enables each octet to work independently. # A TENSOR CORE MICROARCHITECTURE * Recall each tensor core completes a 4×4 matrix-multiply and accumulate each cycle. * To achieve this, each tensor core must be able to perform **16 four-element dot-products** each cycle. * Each warp utilizes two tensor cores. We assume two octets within a warp access each tensor core. instruction is executing 32 FEDP per cycle. * As each tensor core consists of 16 FP16 FEDP units, it is capable of completing one 4 × 4 matrix multiplication each cycle. * For operand **matrix A and C**, each threadgroup fetches the operands to its **separate buffer** whereas for operand **matrix B** both the threadgroups fetches to a **shared buffer**. ![](https://i.imgur.com/n6UBTGI.png) # MODELING AND EVALUATION * We extended the current version of GPGPU-Sim to support 16bit floating-point by using a half-precision C++ header-only library . ![](https://i.imgur.com/xaq1UU3.png) ![](https://i.imgur.com/6tBqczC.png) ![](https://i.imgur.com/2Tinptf.png) * The performance gain obtained using the cuBLAS GEMM kernel is more than the WMMA GEMM implementation . * We implemented **a model for the Volta tensor cores in GPGPU-Sim** and found its performance agreed well with hardware, obtaining a 99.6% IPC correlation versus a Titan V GPU. As part of our efforts we also enabled CUTLASS, NVIDIA's opensource CUDA C++ template library supporting tensor cores, on GPGPU-Sim.