Sparse Tensor Core: Algorithm and Hardware Co-Design for Vector-wise Sparse Neural Networks on Modern GPUs

# Sparse Tensor Core: Algorithm and Hardware Co-Design for Vector-wise Sparse Neural Networks on Modern GPUs ###### tags: `論文` `Presentation` ## 論文架構 - Abstract 1. INTRODUCTION 2. BACKGROUND AND MOTIVATION - 2.1 Sparsity-Centric Optimization for DNNs - 2.2 Existing Sparsifying Methods on GPUs - 2.3 The Characterization of Sparsity 3. VECTORSPARSE PRUNING - 3.1 Vector-wise Sparse Matrix Encoding - 3.2 VectorSparse: a Methodology of Iterative Vector-wise Sparsifying and Retraining - 3.3 GPU Kernel Design 4. SPARSE TENSOR CORE - 4.1 Baseline Tensor Core Architecture - 4.2 The Extension of HMMA Instruction Set - 4.3 Micro-architecture Design for Sparse Tensor Core 5. EXPERIMENTAL METHODOLOGY 6. EXPERIMENTAL RESULTS - 6.1 Impact on Accuracy - 6.2 Performance Evaluation - 6.3 Design Overhead Analysis - 6.4 Summary and Discussion 7. CONCLUSION - REFERENCES ## Details ### Abstract & 1. INTRODUCTION #### 第一段 - 講述此篇論文的的時代背景 - Deep neural networks (DNNs) have achieved state-of-the-art performance in many different tasks, such as image recognition... - 闡述在這個時代背景下發現的**問題** - The underlying representational power of these neural networks comes from the **huge parameter space** which results in an extremely **large amount of computation and memory usage** - 舉例說明哪些論文遇到這樣的問題，或者已經處理類似的問題 - There have been **plenty of prior works to improve both performance and energy efficiency of neural networks** on various hardware platforms, such as GPUs [9, 18, 38, 44, 60, 73, 78], FPGAs [19, 28, 81], and ASICs [2, 3, 11–14, 20, 21, 23, 32, 34, 35, 45, 47, 49, 54, 57, 59, 61, 63, 65, 68, 69, 75, 82, 83]. - 點出此篇論文想要探討的主題，同樣要參考類似論文 - **sparsity-centric optimization** techniques[4, 28–30, 56, 64] have achieved out standing results #### 第二、三段對於此篇論文想要探討的主題做更深入的介紹 - Nonetheless, such high sparsity does not necessarily guarantee that the sparse neural networks can be more performant than their dense counterparts due to the **irregularity of data layout** - In particular, the sparse neural networks can **hardly** obtain performance **gain on Graphics Processing Units (GPUs)**. - the GPU is extremely underutilized when running the sparse library kernels #### 第四段點出其他做過類似問題探討的論文，竟簡單闡述其方法、結果以及待改善的地方，這些論文也能夠拿來做最後效能比較的對象 - To address the performance issue, structural sparsifying methods [70, 76] have been **proposed by removing entire rows or columns from a matrix** - we have observed **significant accuracy drop** from large-scale neural networks when the structural pruning is applied #### 第五段大概說明一下此篇論文提出什麼不一樣的方法解決要探討的問題，且此方法相對其他方法而言的優點以及效能的提升，並條列此篇論文的貢獻 ##### Solutions - VectorSparse - In this paper,we propose **VectorSparse, a SIMD (Single Instruction, Multiple Data)-friendly sparsifying method** to tackle the problem. - Divides a weight matrix into multiple vectors and prunes each vector to the same sparsity. - Extend the instruction set of Volta - We extend the instruction set of Volta to allow the VectorSparse neural networks to run on Tensor Core. - Micro-architecture Design for Sparse Tensor Core ##### Contribution (貢獻不一定要是解決問題的方案) - We go through a **comprehensive performance analysis** to demonstrate the inefficiency of GPU when running the sparse neural networks. - We propose**VectorSparse as a novel sparsifying algorithm** that can achieve **63% performance improvement with negligible accuracy drop**. - We further **extend the instruction sets of the Volta GPU** to support the operand indexing in the register file - We also show the details of the **micro-architecture design to mitigate the performance bottleneck**, which achieves **58% performance gain** with negligible area overhead ### 2. BACKGROUND AND MOTIVATION 在第二章當中，作者主要做兩件事 1.review some prior work、2. Describe the existing sparsifying techniques in detail #### 2.1 Sparsity-Centric Optimization for DNNs & 2.2 Existing Sparsifying Methods on GPUs 引用多篇探討 `Sparsity-Centric Optimization` 論文的內容及介紹各種 `sparsifying methothds` ，最後將 `sparsifying methothds` 歸類成兩類，並比較兩者在 GPU 上的效能 1. generic sparsifying - 方法: 選出最大的 N 個 weight 其他則更新為 0 ![](https://i.imgur.com/tDd2AUM.png) - 優點: 高壓縮率，以上圖為例，壓縮率達到 75% - 缺點: 1. The **variation of the row/column length** of a generic sparse matrix makes it difficult to partition the workload evenly into GPUs 2. The number of non-zero elements in each row is **unknown until runtime**, leaving it difficult to choose an optimal tiling scheme for data reuse. 3. The computation amount of a highly sparse matrix is not enough to hide the long memory access latency, and therefore the benefit from the high sparsity vanishes - 實驗結果:The sparse layers are **less performant** than dense layers. Even when the sparsity is 96% 2. unified sparsifying - 方法: 選出 weight matrix 中最大的 element 並保存整個 column 的值 ![](https://i.imgur.com/2MGTDpU.png) - 優點: - It is easy to encode/decode the coordinate information - Even though this ap proach allows more flexibility by adjusting the block size, it **cannot use the highly optimized dense libraries** to get good performance boost over the dense counterparts. - **Accuracy drop** #### 2.3 The Characterization of Sparsity 此小節中，作者探討 Sparsity matrix 的特性，並紀錄結論以及實驗結果 - 實驗中一樣透過上面所講到的 `generic sparsifying` 對權重矩陣進行 pruning - 並將結果分成多個 vector ，以達到 sparsity blance ![](https://i.imgur.com/s8HO7ne.png) - 由圖為 MxN 的 weight matrix - 取長度 L 並將 L 個 elements 放在一個 vector 內 - 實驗結果顯示，當 L 取到 `8` 以上時，只有不到 5% 的 vector 稀疏程度少於 75% ，此實驗證明了每個 vector 的稀疏程度足夠平衡，結果如下圖； ![](https://i.imgur.com/2gcThcf.png) ### 3. VECTORSPARSE PRUNING 有鑿於上章節之實驗結果，將 `generic sparsifying` 過後的稀疏矩陣切割成 vector 的形式後，若取的 L 值夠大，就能確保個個 vector 的稀疏程度是 balance 的。因此，作者將所有 vector 的稀疏程度統一，並提出一個 balanced vector-wise encoding format for sparse matrices，如此一來，就能簡化 workload partitioning on GPU ，也就是保留了 `generic sparsifying` 的高確度特性，以及改善其 weight 擺放紊亂的缺點。 #### 3.1 Vector-wise Sparse Matrix Encoding 編碼步驟如下: 1. 取一長度 L ，以 L 為單位將資料分割成若干個 vector ， L 的切割方向可以是 row 或者 column 2. 找出含有最多非 0 的 vector ，其非 0 數總合為 K ，並將所有的 vector 的長度都轉換成 K 3. 紀錄 vector 的非 0 值，和其 index ![](https://i.imgur.com/iD7Dvgp.png) >如果若 K 為 L 導致沒有讓矩陣縮小? > #### 3.2 VectorSparse: a Methodology of Iterative Vector-wise Sparsifying and Retraining 此小節詳述作者提出的 sparsify 演算法 `VectorSparse` ，並提供 pseudo code 1. 將 weight matrices 切割成若干個 vector ，每個 vector 長度為 L 2. 使用者設定所能容忍最大的精確度損失 E_x 4. 程式會從 K=L 開始計算在這個 K 值之下的精確損失 E，K 為每個 vector 所保留的數量 5. 找到 K 後將 vector 的大小壓縮成 K 下圖比較 `generic sparsify` 和 `VectorSparse` 兩方法的稀疏矩陣密度，從圖中可以看出 `VectorSparse` 雖然壓縮率比較低，但它分布的較為平均 ![](https://i.imgur.com/75YKaAk.png) - 經過 `VectorSparse` 後的稀疏矩陣，在 GPU 上所需要的計算量大量減少，且經過 `VectorSparse` 後的稀疏矩陣還是可以用 `tiling techniques` #### 4. SPARSE TENSOR CORE 第三章把理論和演算法介紹完，第四章開始要著重於如何將經過 `VectorSparse` 後的稀疏矩陣，在 GPU 上做計算。若硬體沒有做任何修改，`VectorSparse` 會有和 `generic sparsifying` 一樣的問題，例如:需要額外的計算解析座標。 #### 4.1 Baseline Tensor Core Architecture ![](https://i.imgur.com/bp0n5Hp.png) - 1 SM - 4 subcores - 1 warp scheduler - 1 CUDA cores & SFU - 2 tensor core - LDST unit - register files #### CUDA WMMA (Warp Matrix Multiply and Accumulate) API 下圖為使用 WMMA 計算 D = A X B + C 的例子，WMMA API 中，會用 HMMA instruction 呼叫 tensor core 進行計算，而 HMMA 一次執行以 4 條指令為一個 set ![](https://i.imgur.com/pDkABuE.png) HMMA set 如下圖: ![](https://i.imgur.com/KNsJojL.png) #### Micro-architecture Design for Sparse Tensor Core 為了使用經過 VectorSparse 演算法計算出的稀疏矩陣進行運算，作者提供新的指令 `SHHMA` ，下圖中為 SHMMA 指令的一個 set ，計算量等同於一個 set 的 HMMA ![](https://i.imgur.com/E1zxinu.png) - 為了配合新增的 `SHMMA` 指令，應提需要做下列的改變 1. 新增 offset register 2. 增加 Buffer B 的 size ![](https://i.imgur.com/2TQ6fvF.png) 3. 增加 ping pong buffer ![](https://i.imgur.com/Urope8A.png) #### 5. EXPERIMENTAL METHODOLOGY - Retrain model with sparsifying methon - Add the SHNNA instruction to the GPGPU-sim - Implemented vector-wise sparse matrix multiplication kernels based on CUTLASS #### 6. Result 效能比較 ![](https://i.imgur.com/PzQZ48i.png) Design Overhead ![](https://i.imgur.com/hqV62gK.png)