# Sparse Tensor Core: Algorithm and Hardware Co-Design for Vector-wise Sparse Neural Networks on Modern GPUs
###### tags: `論文` `Presentation`
## 論文架構
- Abstract
1. INTRODUCTION
2. BACKGROUND AND MOTIVATION
- 2.1 Sparsity-Centric Optimization for DNNs
- 2.2 Existing Sparsifying Methods on GPUs
- 2.3 The Characterization of Sparsity
3. VECTORSPARSE PRUNING
- 3.1 Vector-wise Sparse Matrix Encoding
- 3.2 VectorSparse: a Methodology of Iterative Vector-wise Sparsifying and Retraining
- 3.3 GPU Kernel Design
4. SPARSE TENSOR CORE
- 4.1 Baseline Tensor Core Architecture
- 4.2 The Extension of HMMA Instruction Set
- 4.3 Micro-architecture Design for Sparse Tensor Core
5. EXPERIMENTAL METHODOLOGY
6. EXPERIMENTAL RESULTS
- 6.1 Impact on Accuracy
- 6.2 Performance Evaluation
- 6.3 Design Overhead Analysis
- 6.4 Summary and Discussion
7. CONCLUSION
- REFERENCES
## Details
### Abstract & 1. INTRODUCTION
#### 第一段
- 講述此篇論文的的時代背景
- Deep neural networks (DNNs) have achieved state-of-the-art performance in many different tasks, such as image recognition...
- 闡述在這個時代背景下發現的**問題**
- The underlying representational power of these neural networks comes from the **huge parameter space** which results in an extremely **large amount of computation and memory usage**
- 舉例說明哪些論文遇到這樣的問題,或者已經處理類似的問題
- There have been **plenty of prior works to improve both performance and energy efficiency of neural networks** on various hardware platforms, such as GPUs [9, 18, 38, 44, 60, 73, 78], FPGAs [19, 28, 81], and ASICs [2, 3, 11–14, 20, 21, 23, 32, 34, 35, 45, 47, 49, 54, 57, 59, 61, 63, 65, 68, 69, 75, 82, 83].
- 點出此篇論文想要探討的主題,同樣要參考類似論文
- **sparsity-centric optimization** techniques[4, 28–30, 56, 64] have achieved out standing results
#### 第二、三段
對於此篇論文想要探討的主題做更深入的介紹
- Nonetheless, such high sparsity does not necessarily guarantee that the sparse neural networks can be more performant than their dense counterparts due to the **irregularity of data layout**
- In particular, the sparse neural networks can **hardly** obtain performance **gain on Graphics Processing Units (GPUs)**.
- the GPU is extremely underutilized when running the sparse library kernels
#### 第四段
點出其他做過類似問題探討的論文,竟簡單闡述其方法、結果以及待改善的地方,這些論文也能夠拿來做最後效能比較的對象
- To address the performance issue, structural sparsifying methods [70, 76] have been **proposed by removing entire rows or columns from a matrix**
- we have observed **significant accuracy drop** from large-scale neural networks when the structural pruning is applied
#### 第五段
大概說明一下此篇論文提出什麼不一樣的方法解決要探討的問題,且此方法相對其他方法而言的優點以及效能的提升,並條列此篇論文的貢獻
##### Solutions
- VectorSparse
- In this paper,we propose **VectorSparse, a SIMD (Single Instruction, Multiple Data)-friendly sparsifying method** to tackle the problem.
- Divides a weight matrix into multiple vectors and prunes each vector to the same sparsity.
- Extend the instruction set of Volta
- We extend the instruction set of Volta to allow the VectorSparse neural networks to run on Tensor Core.
- Micro-architecture Design for Sparse Tensor Core
##### Contribution (貢獻不一定要是解決問題的方案)
- We go through a **comprehensive performance analysis** to demonstrate the inefficiency of GPU when running the sparse neural networks.
- We propose**VectorSparse as a novel sparsifying algorithm** that can achieve **63% performance improvement with negligible accuracy drop**.
- We further **extend the instruction sets of the Volta GPU** to support the operand indexing in the register file
- We also show the details of the **micro-architecture design to mitigate the performance bottleneck**, which achieves **58% performance gain** with negligible area overhead
### 2. BACKGROUND AND MOTIVATION
在第二章當中,作者主要做兩件事 1.review some prior work、2. Describe the existing sparsifying techniques in detail
#### 2.1 Sparsity-Centric Optimization for DNNs & 2.2 Existing Sparsifying Methods on GPUs
引用多篇探討 `Sparsity-Centric Optimization` 論文的內容及介紹各種 `sparsifying methothds` ,最後將 `sparsifying methothds` 歸類成兩類,並比較兩者在 GPU 上的效能
1. generic sparsifying
- 方法: 選出最大的 N 個 weight 其他則更新為 0

- 優點: 高壓縮率,以上圖為例,壓縮率達到 75%
- 缺點:
1. The **variation of the row/column length** of a generic sparse matrix makes it difficult to partition the workload evenly into GPUs
2. The number of non-zero elements in each row is **unknown until runtime**, leaving it difficult to choose an optimal tiling scheme for data reuse.
3. The computation amount of a highly sparse matrix is not enough to hide the long memory access latency, and therefore the benefit from the high sparsity vanishes
- 實驗結果:The sparse layers are **less performant** than dense layers. Even when the sparsity is 96%
2. unified sparsifying
- 方法: 選出 weight matrix 中最大的 element 並保存整個 column 的值

- 優點:
- It is easy to encode/decode the coordinate information
- Even though this ap proach allows more flexibility by adjusting the block size, it **cannot use the highly optimized dense libraries** to get good performance boost over the dense counterparts.
- **Accuracy drop**
#### 2.3 The Characterization of Sparsity
此小節中,作者探討 Sparsity matrix 的特性,並紀錄結論以及實驗結果
- 實驗中一樣透過上面所講到的 `generic sparsifying` 對權重矩陣進行 pruning
- 並將結果分成多個 vector ,以達到 sparsity blance

- 由圖為 MxN 的 weight matrix
- 取長度 L 並將 L 個 elements 放在一個 vector 內
- 實驗結果顯示,當 L 取到 `8` 以上時,只有不到 5% 的 vector 稀疏程度少於 75% ,此實驗證明了每個 vector 的稀疏程度足夠平衡,結果如下圖;

### 3. VECTORSPARSE PRUNING
有鑿於上章節之實驗結果,將 `generic sparsifying` 過後的稀疏矩陣切割成 vector 的形式後,若取的 L 值夠大,就能確保個個 vector 的稀疏程度是 balance 的。
因此,作者將所有 vector 的稀疏程度統一,並提出一個 balanced vector-wise encoding format for sparse matrices,如此一來,就能簡化 workload partitioning on GPU ,也就是保留了 `generic sparsifying` 的高確度特性,以及改善其 weight 擺放紊亂的缺點。
#### 3.1 Vector-wise Sparse Matrix Encoding
編碼步驟如下:
1. 取一長度 L ,以 L 為單位將資料分割成若干個 vector , L 的切割方向可以是 row 或者 column
2. 找出含有最多非 0 的 vector ,其非 0 數總合為 K ,並將所有的 vector 的長度都轉換成 K
3. 紀錄 vector 的非 0 值,和其 index

>如果若 K 為 L 導致沒有讓矩陣縮小?
>
#### 3.2 VectorSparse: a Methodology of Iterative Vector-wise Sparsifying and Retraining
此小節詳述作者提出的 sparsify 演算法 `VectorSparse` ,並提供 pseudo code
1. 將 weight matrices 切割成若干個 vector ,每個 vector 長度為 L
2. 使用者設定所能容忍最大的精確度損失 E_x
4. 程式會從 K=L 開始計算在這個 K 值之下的精確損失 E,K 為每個 vector 所保留的數量
5. 找到 K 後將 vector 的大小壓縮成 K
下圖比較 `generic sparsify` 和 `VectorSparse` 兩方法的稀疏矩陣密度,從圖中可以看出 `VectorSparse` 雖然壓縮率比較低,但它分布的較為平均

- 經過 `VectorSparse` 後的稀疏矩陣,在 GPU 上所需要的計算量大量減少,且經過 `VectorSparse` 後的稀疏矩陣還是可以用 `tiling techniques`
#### 4. SPARSE TENSOR CORE
第三章把理論和演算法介紹完,第四章開始要著重於如何將經過 `VectorSparse` 後的稀疏矩陣,在 GPU 上做計算。
若硬體沒有做任何修改,`VectorSparse` 會有和 `generic sparsifying` 一樣的問題,例如:需要額外的計算解析座標。
#### 4.1 Baseline Tensor Core Architecture

- 1 SM
- 4 subcores
- 1 warp scheduler
- 1 CUDA cores & SFU
- 2 tensor core
- LDST unit
- register files
#### CUDA WMMA (Warp Matrix Multiply and Accumulate) API
下圖為使用 WMMA 計算 D = A X B + C 的例子,WMMA API 中,會用 HMMA instruction 呼叫 tensor core 進行計算,而 HMMA 一次執行以 4 條指令為一個 set

HMMA set 如下圖:

#### Micro-architecture Design for Sparse Tensor Core
為了使用經過 VectorSparse 演算法計算出的稀疏矩陣進行運算,作者提供新的指令 `SHHMA` ,下圖中為 SHMMA 指令的一個 set ,計算量等同於一個 set 的 HMMA

- 為了配合新增的 `SHMMA` 指令,應提需要做下列的改變
1. 新增 offset register
2. 增加 Buffer B 的 size

3. 增加 ping pong buffer

#### 5. EXPERIMENTAL METHODOLOGY
- Retrain model with sparsifying methon
- Add the SHNNA instruction to the GPGPU-sim
- Implemented vector-wise sparse matrix multiplication kernels based on CUTLASS
#### 6. Result
效能比較

Design Overhead
