# Sparse Tensor Core: Algorithm and Hardware Co-Design for Vector-wise Sparse Neural Networks on Modern GPUs ###### tags: `論文` `Presentation` ## 論文架構 - Abstract 1. INTRODUCTION 2. BACKGROUND AND MOTIVATION - 2.1 Sparsity-Centric Optimization for DNNs - 2.2 Existing Sparsifying Methods on GPUs - 2.3 The Characterization of Sparsity 3. VECTORSPARSE PRUNING - 3.1 Vector-wise Sparse Matrix Encoding - 3.2 VectorSparse: a Methodology of Iterative Vector-wise Sparsifying and Retraining - 3.3 GPU Kernel Design 4. SPARSE TENSOR CORE - 4.1 Baseline Tensor Core Architecture - 4.2 The Extension of HMMA Instruction Set - 4.3 Micro-architecture Design for Sparse Tensor Core 5. EXPERIMENTAL METHODOLOGY 6. EXPERIMENTAL RESULTS - 6.1 Impact on Accuracy - 6.2 Performance Evaluation - 6.3 Design Overhead Analysis - 6.4 Summary and Discussion 7. CONCLUSION - REFERENCES ## Details ### Abstract & 1. INTRODUCTION #### 第一段 - 講述此篇論文的的時代背景 - Deep neural networks (DNNs) have achieved state-of-the-art performance in many different tasks, such as image recognition... - 闡述在這個時代背景下發現的**問題** - The underlying representational power of these neural networks comes from the **huge parameter space** which results in an extremely **large amount of computation and memory usage** - 舉例說明哪些論文遇到這樣的問題,或者已經處理類似的問題 - There have been **plenty of prior works to improve both performance and energy efficiency of neural networks** on various hardware platforms, such as GPUs [9, 18, 38, 44, 60, 73, 78], FPGAs [19, 28, 81], and ASICs [2, 3, 11–14, 20, 21, 23, 32, 34, 35, 45, 47, 49, 54, 57, 59, 61, 63, 65, 68, 69, 75, 82, 83]. - 點出此篇論文想要探討的主題,同樣要參考類似論文 - **sparsity-centric optimization** techniques[4, 28–30, 56, 64] have achieved out standing results #### 第二、三段 對於此篇論文想要探討的主題做更深入的介紹 - Nonetheless, such high sparsity does not necessarily guarantee that the sparse neural networks can be more performant than their dense counterparts due to the **irregularity of data layout** - In particular, the sparse neural networks can **hardly** obtain performance **gain on Graphics Processing Units (GPUs)**. - the GPU is extremely underutilized when running the sparse library kernels #### 第四段 點出其他做過類似問題探討的論文,竟簡單闡述其方法、結果以及待改善的地方,這些論文也能夠拿來做最後效能比較的對象 - To address the performance issue, structural sparsifying methods [70, 76] have been **proposed by removing entire rows or columns from a matrix** - we have observed **significant accuracy drop** from large-scale neural networks when the structural pruning is applied #### 第五段 大概說明一下此篇論文提出什麼不一樣的方法解決要探討的問題,且此方法相對其他方法而言的優點以及效能的提升,並條列此篇論文的貢獻 ##### Solutions - VectorSparse - In this paper,we propose **VectorSparse, a SIMD (Single Instruction, Multiple Data)-friendly sparsifying method** to tackle the problem. - Divides a weight matrix into multiple vectors and prunes each vector to the same sparsity. - Extend the instruction set of Volta - We extend the instruction set of Volta to allow the VectorSparse neural networks to run on Tensor Core. - Micro-architecture Design for Sparse Tensor Core ##### Contribution (貢獻不一定要是解決問題的方案) - We go through a **comprehensive performance analysis** to demonstrate the inefficiency of GPU when running the sparse neural networks. - We propose**VectorSparse as a novel sparsifying algorithm** that can achieve **63% performance improvement with negligible accuracy drop**. - We further **extend the instruction sets of the Volta GPU** to support the operand indexing in the register file - We also show the details of the **micro-architecture design to mitigate the performance bottleneck**, which achieves **58% performance gain** with negligible area overhead ### 2. BACKGROUND AND MOTIVATION 在第二章當中,作者主要做兩件事 1.review some prior work、2. Describe the existing sparsifying techniques in detail #### 2.1 Sparsity-Centric Optimization for DNNs & 2.2 Existing Sparsifying Methods on GPUs 引用多篇探討 `Sparsity-Centric Optimization` 論文的內容及介紹各種 `sparsifying methothds` ,最後將 `sparsifying methothds` 歸類成兩類,並比較兩者在 GPU 上的效能 1. generic sparsifying - 方法: 選出最大的 N 個 weight 其他則更新為 0 ![](https://i.imgur.com/tDd2AUM.png) - 優點: 高壓縮率,以上圖為例,壓縮率達到 75% - 缺點: 1. The **variation of the row/column length** of a generic sparse matrix makes it difficult to partition the workload evenly into GPUs 2. The number of non-zero elements in each row is **unknown until runtime**, leaving it difficult to choose an optimal tiling scheme for data reuse. 3. The computation amount of a highly sparse matrix is not enough to hide the long memory access latency, and therefore the benefit from the high sparsity vanishes - 實驗結果:The sparse layers are **less performant** than dense layers. Even when the sparsity is 96% 2. unified sparsifying - 方法: 選出 weight matrix 中最大的 element 並保存整個 column 的值 ![](https://i.imgur.com/2MGTDpU.png) - 優點: - It is easy to encode/decode the coordinate information - Even though this ap proach allows more flexibility by adjusting the block size, it **cannot use the highly optimized dense libraries** to get good performance boost over the dense counterparts. - **Accuracy drop** #### 2.3 The Characterization of Sparsity 此小節中,作者探討 Sparsity matrix 的特性,並紀錄結論以及實驗結果 - 實驗中一樣透過上面所講到的 `generic sparsifying` 對權重矩陣進行 pruning - 並將結果分成多個 vector ,以達到 sparsity blance ![](https://i.imgur.com/s8HO7ne.png) - 由圖為 MxN 的 weight matrix - 取長度 L 並將 L 個 elements 放在一個 vector 內 - 實驗結果顯示,當 L 取到 `8` 以上時,只有不到 5% 的 vector 稀疏程度少於 75% ,此實驗證明了每個 vector 的稀疏程度足夠平衡,結果如下圖; ![](https://i.imgur.com/2gcThcf.png) ### 3. VECTORSPARSE PRUNING 有鑿於上章節之實驗結果,將 `generic sparsifying` 過後的稀疏矩陣切割成 vector 的形式後,若取的 L 值夠大,就能確保個個 vector 的稀疏程度是 balance 的。 因此,作者將所有 vector 的稀疏程度統一,並提出一個 balanced vector-wise encoding format for sparse matrices,如此一來,就能簡化 workload partitioning on GPU ,也就是保留了 `generic sparsifying` 的高確度特性,以及改善其 weight 擺放紊亂的缺點。 #### 3.1 Vector-wise Sparse Matrix Encoding 編碼步驟如下: 1. 取一長度 L ,以 L 為單位將資料分割成若干個 vector , L 的切割方向可以是 row 或者 column 2. 找出含有最多非 0 的 vector ,其非 0 數總合為 K ,並將所有的 vector 的長度都轉換成 K 3. 紀錄 vector 的非 0 值,和其 index ![](https://i.imgur.com/iD7Dvgp.png) >如果若 K 為 L 導致沒有讓矩陣縮小? > #### 3.2 VectorSparse: a Methodology of Iterative Vector-wise Sparsifying and Retraining 此小節詳述作者提出的 sparsify 演算法 `VectorSparse` ,並提供 pseudo code 1. 將 weight matrices 切割成若干個 vector ,每個 vector 長度為 L 2. 使用者設定所能容忍最大的精確度損失 E_x 4. 程式會從 K=L 開始計算在這個 K 值之下的精確損失 E,K 為每個 vector 所保留的數量 5. 找到 K 後將 vector 的大小壓縮成 K 下圖比較 `generic sparsify` 和 `VectorSparse` 兩方法的稀疏矩陣密度,從圖中可以看出 `VectorSparse` 雖然壓縮率比較低,但它分布的較為平均 ![](https://i.imgur.com/75YKaAk.png) - 經過 `VectorSparse` 後的稀疏矩陣,在 GPU 上所需要的計算量大量減少,且經過 `VectorSparse` 後的稀疏矩陣還是可以用 `tiling techniques` #### 4. SPARSE TENSOR CORE 第三章把理論和演算法介紹完,第四章開始要著重於如何將經過 `VectorSparse` 後的稀疏矩陣,在 GPU 上做計算。 若硬體沒有做任何修改,`VectorSparse` 會有和 `generic sparsifying` 一樣的問題,例如:需要額外的計算解析座標。 #### 4.1 Baseline Tensor Core Architecture ![](https://i.imgur.com/bp0n5Hp.png) - 1 SM - 4 subcores - 1 warp scheduler - 1 CUDA cores & SFU - 2 tensor core - LDST unit - register files #### CUDA WMMA (Warp Matrix Multiply and Accumulate) API 下圖為使用 WMMA 計算 D = A X B + C 的例子,WMMA API 中,會用 HMMA instruction 呼叫 tensor core 進行計算,而 HMMA 一次執行以 4 條指令為一個 set ![](https://i.imgur.com/pDkABuE.png) HMMA set 如下圖: ![](https://i.imgur.com/KNsJojL.png) #### Micro-architecture Design for Sparse Tensor Core 為了使用經過 VectorSparse 演算法計算出的稀疏矩陣進行運算,作者提供新的指令 `SHHMA` ,下圖中為 SHMMA 指令的一個 set ,計算量等同於一個 set 的 HMMA ![](https://i.imgur.com/E1zxinu.png) - 為了配合新增的 `SHMMA` 指令,應提需要做下列的改變 1. 新增 offset register 2. 增加 Buffer B 的 size ![](https://i.imgur.com/2TQ6fvF.png) 3. 增加 ping pong buffer ![](https://i.imgur.com/Urope8A.png) #### 5. EXPERIMENTAL METHODOLOGY - Retrain model with sparsifying methon - Add the SHNNA instruction to the GPGPU-sim - Implemented vector-wise sparse matrix multiplication kernels based on CUTLASS #### 6. Result 效能比較 ![](https://i.imgur.com/PzQZ48i.png) Design Overhead ![](https://i.imgur.com/hqV62gK.png)