# Sparse Tensor Core: Algorithm and Hardware Co-Design for Vector-wise Sparse Neural Networks on Modern GPUs ###### papers: [link](https://people.cs.nctu.edu.tw/~ttyeh/course/2020_Fall/IOC5009/file/week4-sparse-tensor-core.pdf) ###### video: `none` ###### slide: `none` ###### tags: `GPUs` # Outline * 1. INTRODUCTION * 2. BACKGROUND AND MOTIVATION * 2.1 Sparsity-Centric Optimization for DNNs * 2.2 Existing Sparsifying Methods on GPUs * 2.3 The Characterization of Sparsity * 3. VECTORSPARSE PRUNING * 3.1 Vector-wise Sparse Matrix Encoding * 3.2 VectorSparse: a Methodology of Iterative Vector-wise Sparsifying and Retraining * 3.3 GPU Kernel Design * 4. SPARSE TENSOR CORE * 4.1 Baseline Tensor Core Architecture * 4.2 The Extension of HMMA Instruction Set * 4.3 Micro-architecture Design for Sparse Tensor Core * 5. EXPERIMENTAL RESULTS # 1. Intrduction * The background of the paper * Deep neural networks (DNNs) have achieved state-of-the-art performance in many different tasks. * Problem and try to improve * The underlying representational power of these neural networks comes from the **huge parameter space** which results in an extremely large amount of **computation and memory usage**. There have been plenty of prior works to **improve both performance and energy efficiency of neural networks on various hardware platforms**. * The subject of this paper * **sparsity-centric optimization** techniques have achieved outstanding results * Problem of original sparsity optimization * high sparsity does not necessarily guarantee that the sparse neural networks can be more **performant** than their dense counterparts due to **the irregularity of data layout** * the sparse neural networks can hardly obtain performance gain on Graphics Processing Units (GPUs). * Give a new solution to sparsity optimization * VectorSparse * VectorSparse, a SIMD (Single Instruction, Multiple Data)-friendly sparsifying method to solve the problem. **Divides a weight matrix into multiple vectors and prunes each vector to the same sparsity**. * Extend the instruction set of Volta to run on Tensor Core * Little changes to tensor core * Contribution * We go through a comprehensive performance analysis to demonstrate the inefficiency of GPU when running the sparse neural networks. * We propose VectorSparse as a novel sparsifying algorithm that can achieve **63% performance** improvement with negligible accuracy drop. * We further extend the instruction sets of the Volta GPU to support the operand indexing in the register file. * We also show the details of the **micro-architecture design to mitigate the performance bottleneck**, which achieves **58% performance** gain with negligible area overhead. # 2. BACKGROUND AND MOTIVATION * #### 2.1 Sparsity-Centric Optimization for DNNs * Sparsifying methothds are classified into two categories 1. generic sparsifying ![](https://i.imgur.com/0Isx82D.png) * advantage : High compression rate * Disadvantage : 1. The variation of the row/column length of a generic sparse matrix makes it difficult to partition the workload evenly into GPUs 2. The computation amount of a highly sparse matrix is not enough to hide the long memory access latency, and therefore the benefit from the high sparsity vanishes 3. The sparse layers are less performant than dense layers. Even when the sparsity is 96% 2. unified sparsifying ![](https://i.imgur.com/BH5rEf3.png) * advantage : It is easy to encode/decode the coordinate information * disadvantage : 1. Even though this approach allows more flexibility by adjusting the block size, it cannot use the highly optimized dense libraries to get good performance boost over the dense counterparts 2. It cause **accuracy drop** * #### 2.3 The Characterization of Sparsity ![](https://i.imgur.com/nfjVVkj.png) * After a network is pruned (generic sparsifying), we split each row of its weight matrices into multiple L-dim vectors. * Vector V (y, x) contains the elements with row index y, and column indices from x × L to (x + 1) × L − 1 in the M × N weight matrix W. If N is not divisible by L, the residue vectors are padded with zero. * The local sparsity is defined as the number of zeros divided by L. ![](https://i.imgur.com/zJoAmWU.png) * When L is more than 8, only less than 5% of the vectors are less than 75% sparsity. This proves that the sparsity of each vector is sufficiently balanced # 3. VECTORSPARSE PRUNING * After cutting the sparse matrix after generic sparsifying into the form of vector, if the value of L is large enough, it can ensure that the sparsity of each vector is balanced. * They give a balanced vector-wise encoding format for sparse matrices, which can simplify workload partitioning on GPU * ### 3.1 Vector-wise Sparse Matrix Encoding ![](https://i.imgur.com/4Vi47Dm.png) * ### 3.2 VectorSparse: a Methodology of Iterative Vector-wise Sparsifying and Retraining ![](https://i.imgur.com/5vPsAOa.png) * The following figure compares the sparse matrix density of the generic sparsify and VectorSparse methods. It can be seen from the figure that VectorSparse has a relatively low compression rate, but its distribution is relatively regularity. ![](https://i.imgur.com/34qyIVx.png) # 4. SPARSE TENSOR CORE * As we know, generic sparse matrix suffers from the poor workload balance and cumbersome coordinate decoding for its overall performance. * In fact, VectorSparse could present the same issue if the hardware design is unaware of the new algorithm. * ### 4.1 Baseline Tensor Core Architecture ![](https://i.imgur.com/YK6CB1w.png) * #### Tensor Core * It was first introduced in NVIDIA's Volta architecture . * The Tensor Core is a fast hardware functional block for dense matrix multiplication. * Each Tensor Core is able to execute a 4×4×4 matrix multiplication and addition in one cycle. * In the CUDA programming model , Tensor Cores are exposed to programmers in the **CUDA WMMA API**. The WMMA API includes dedicated **matrix load and store** primitives, and **matrix multiply and accumulate** operations for Tensor Cores. * #### How does Tensor core operate * To execute a WMMA, the 32 threads in a warp are divided into 8 **threadgroups**. * All threads in a threadgroup work together to compute. * for better data reuse, two threadgroups work together as a **worktuple**. Worktuple i consists of threadgroup i and threadgroup i + 4. ![](https://i.imgur.com/I4BAVD4.png) * For example, using WMMA to calculate D = A X B + C, in WMMA API, HMMA instruction will be used to call tensor core for calculation, and HMMA will use 4 instructions as a set for one WMMA execution ![](https://i.imgur.com/37hCQK2.png) HMMA set is as follows : ![](https://i.imgur.com/mjCHPZi.png) * how a WMMA operation is mapped to the Tensor Core architecture ![](https://i.imgur.com/K1JiFQ3.png) 1. There are two octets in a Tensor Core. Inside an octet, there are **eight dot product (DP) units**, each of which can compute a 4-dim vector dot product per cycle. 2. A worktuple is mapped to one octet and thus each threadgroup takes four DP units 3. The octet has operand buffers to feed the worktuple via the tiled data when executing one set of HMMA instructions. 4. **Each operand buffer can hold a 4×4 tile**. On the other hand, **the operand buffer dedicated to operand B can hold a 4×8 tile** and the data inside are shared by the two threadgroups in the same worktuple 5. One threadgroup computes the multiplication of a 4×4 tile by a 4×8 tile in a set of HMMA instructions, which is 4×8=32 4-dim dot products. Because **four DP units compute four 4-dim vector dot products per cycle**, those set of HMMA instructions require at least 8 clock cycles to finish the computing of the 4×8×4 matrix multiplication * ### 4.2 The Extension of HMMA Instruction Set * In order to use the sparse matrix calculated by the VectorSparse algorithm to perform operations, this paper provides a new instruction SHHMA. Below shows a set of the **SHMMA** instruction. The amount of calculation is equivalent to a set of HMMA ![](https://i.imgur.com/DEJpyeV.png) 1. The instruction SHMMA.FETCHIDX fetches **the offset indices of the four elements** in a row of A from RO to an implicit, dedicated offset register 2. The instruction SHMMA.EXEC first **decodes the offset register** to determine which rows of B to be fetched from RB. After the data is loaded to operand buffer B, instruction SHMMA.EXEC.F32.F32 **computes the 16 4-dim dot products and accumulates the results with C** ![](https://i.imgur.com/yHxYSHs.png) * The SHMMA instructions require some modifications to the original Tensor Core : 1. Add **the dedicated offset registers** in the register file. 2. we not only **double the buffer size** to accommodate the four rows of buffer B, but also add another buffer to hide the load latency 3. we enable **the broadcasting of operand buffer A to the four DP units it connects** to so that all DP units in an octet can read the same row of A. * Compare the speedup of dense mode and vw sparse mode ![](https://i.imgur.com/AF92buh.png) # 5. EXPERIMENTAL RESULTS ![](https://i.imgur.com/6N0T17I.png) * On average, Tensor Core Vector-wise Sparse can achieve 2.57× speedup over the baseline. 1. With a relaxed spatial constraint, our vector-wise sparse NNs benefit from the high sparsity so that CUDA Core Vector-wise Sparse has **63% performance** gain than the baseline. 2. With the customized SHMMA instructions and microarchitecture design, these sparse vector-wise NNs can take advantage of the powerful Tensor Core, which contributes an additional **58% performance** improvement versus the CUDA Core Vector-wise Sparse.