# Sparse Matrix Operation
## Main Observation & Motivation
### Sparsity In Neural Network
1. 在parameter中有將近80%的值為0
2. 這些0的值會佔用相當大的空間

### Model Size Trend
1. 隨著大型語言模型的成長 → 模型變的越來越大

2. 需要耗費大量的VRAM來儲存
3. 目前主流的解法
1. For have a lot of GPU → Megatron-LM (3D Parallelism), Deepspeed
2. For 一般使用者 → Quantization, LoRa
### Model Prunning
1. 由於model除了0以外的值以外, 更多的可能是趨近於0的值, 因此可以透過做prunning的方式來利用此一特性
2. Unstructured Prunning & Structure Prunning
1. Unstructured
- 從最接近0的值開始pruned
- 由於越接近0就代表該值對於模型結果的影響越低
- 但由於這些值的分佈通常較為隨機 → 沒有結構化
- 對結果的影響較小
2. structure
- 為了結構化的去pruned → 會以row或以col為單位進行
- 對結果的影響較大, 因為可能會pruned掉對模型有影響的值
- 通常不會採用這類作法
### Sparsity Performance
- 對於Bert來說
1. 在40%以前的pruned對於結果幾乎沒有影響
2. 一直到60%以上才會有顯著的acc drop

- MNIST手寫模型驗證
| sparsity | original acc | prune acc | finetune prune acc |
| --- | --- | --- | --- |
| 70% | 0.9604 | 0.9566 | 0.9682 |
| 80% | 0.9545 | 0.946 | 0.9656 |
| 90% | 0.9534 | 0.8845 | 0.9615 |
| 95% | 0.9569 | 0.4923 | 0.9395 |
| 98% | 0.957 | 0.2158 | 0.8074 |
| 99% | 0.9565 | 0.0982 | 0.3052 |
1. 大部分的情況prune後都可以透過微調剩下的參數回到原本的acc level
2. 直到95 ~ 99%的程度才有明顯喪失學習能力的程度
### Sparse Pattern
- 由於在做prunning的時候會全局的從model parameter中挑最接近0的
- 整體的pattern會很隨機, 以下舉了MNIST實驗全局80% sparsity的情況的其中一個layer來觀察
- Weight Tensor


- Bias Tensor


### Matrix Operation In Sparse
- 由於想要更有效率的去存這些非0的值 → Sparse Matrix 需要特殊的encoding方法
1. Coordinate List (COO) Format

2. Compressed Sparse Matrix (CSR) Format

- 在[pytorch](https://github.com/pytorch/pytorch/blob/f5088d2e45787094e201fe82677c399ce456ef71/aten/src/ATen/native/sparse/SparseMatMul.cpp#L54) cpu中目前支持CSR * CSR的sparse matrix multiplication
```cpp
template<typename index_t_ptr, typename scalar_t_ptr>
void _csr_matmult(
const int64_t n_row,
const int64_t n_col,
const index_t_ptr Ap,
const index_t_ptr Aj,
const scalar_t_ptr Ax,
const index_t_ptr Bp,
const index_t_ptr Bj,
const scalar_t_ptr Bx,
// NOLINTNEXTLINE(modernize-avoid-c-arrays,cppcoreguidelines-avoid-c-arrays)
typename index_t_ptr::value_type Cp[],
// NOLINTNEXTLINE(modernize-avoid-c-arrays,cppcoreguidelines-avoid-c-arrays)
typename index_t_ptr::value_type Cj[],
// NOLINTNEXTLINE(modernize-avoid-c-arrays,cppcoreguidelines-avoid-c-arrays)
typename scalar_t_ptr::value_type Cx[]) {
/*
Compute CSR entries for matrix C = A@B.
The matrices `A` and 'B' should be in proper CSR structure, and their dimensions
should be compatible.
Inputs:
`n_row` - number of row in A
`n_col` - number of columns in B
`Ap[n_row+1]` - row pointer
`Aj[nnz(A)]` - column indices
`Ax[nnz(A)] - nonzeros
`Bp[?]` - row pointer
`Bj[nnz(B)]` - column indices
`Bx[nnz(B)]` - nonzeros
Outputs:
`Cp[n_row+1]` - row pointer
`Cj[nnz(C)]` - column indices
`Cx[nnz(C)]` - nonzeros
Note:
Output arrays Cp, Cj, and Cx must be preallocated
*/
using index_t = typename index_t_ptr::value_type;
using scalar_t = typename scalar_t_ptr::value_type;
std::vector<index_t> next(n_col, -1);
std::vector<scalar_t> sums(n_col, 0);
int64_t nnz = 0;
Cp[0] = 0;
for (const auto i : c10::irange(n_row)) {
index_t head = -2;
index_t length = 0;
index_t jj_start = Ap[i];
index_t jj_end = Ap[i + 1];
for (const auto jj : c10::irange(jj_start, jj_end)) {
index_t j = Aj[jj];
scalar_t v = Ax[jj];
index_t kk_start = Bp[j];
index_t kk_end = Bp[j + 1];
for (const auto kk : c10::irange(kk_start, kk_end)) {
index_t k = Bj[kk];
sums[k] += v * Bx[kk];
if (next[k] == -1) {
next[k] = head;
head = k;
length++;
}
}
}
for (C10_UNUSED const auto jj : c10::irange(length)) {
// NOTE: the linked list that encodes col indices
// is not guaranteed to be sorted.
Cj[nnz] = head;
Cx[nnz] = sums[head];
nnz++;
index_t temp = head;
head = next[head];
next[temp] = -1; // clear arrays
sums[temp] = 0;
}
// Make sure that col indices are sorted.
// TODO: a better approach is to implement a CSR @ CSC kernel.
// NOTE: Cx arrays are expected to be contiguous!
auto col_indices_accessor = StridedRandomAccessor<int64_t>(Cj + nnz - length, 1);
auto val_accessor = StridedRandomAccessor<scalar_t>(Cx + nnz - length, 1);
auto kv_accessor = CompositeRandomAccessorCPU<
decltype(col_indices_accessor), decltype(val_accessor)
>(col_indices_accessor, val_accessor);
std::sort(kv_accessor, kv_accessor + length, [](const auto& lhs, const auto& rhs) -> bool {
return get<0>(lhs) < get<0>(rhs);
});
Cp[i + 1] = nnz;
}
}
```
- 可以看到相比於Dense Matrix Multiplication, Sparse Matirx Multiplication邏輯非常的複雜
- Performance Comparison
- cpu

- gpu

- 結果觀察
- 不論是GPU或是CPU的演算法, Sparse只有在很小或者是極度Sparse的時候才有可能比Dense的演算法還快
- 因此現階段的深度模型基本上都不考慮使用Sparse Matrix Multiplication
## Method
1. 將特定的layer換成Sparse Matrix
- 可以節省大量的記憶體 → (可以先做實驗驗證在主流的模型e.g. llama內部可以減少多少)
- 由於部分的Matrix變成Sparse但部份的仍保持Dense → 因此需要實現以下3種情況的Operation
1. Sparse x Sparse Matrix Operation
2. Sparse x Dense Matrix Operation
3. Dense x Sparse Matrix Operation
- 應該要從GPU based的角度(平行加速的角度)去設計演算法
- 理論上速度一定會比Dense的operation慢所以要在一個合理的latency drop下換到足夠多的記憶體空間
2. 結合硬體特性
## Study Plan
1. Sparse Attention(Transformer)
- citation 70
- [[1912.11637] Explicit Sparse Transformer: Concentrated Attention Through Explicit Selection (arxiv.org)](https://arxiv.org/abs/1912.11637)
- 不要每一個attention都去計算到
2. Pytorch & Cuda Sparse Algorithm
3. Sparse Matrix Multiplication Algorithm
[A Systematic Survey of General Sparse Matrix-Matrix Multiplication (arxiv.org)](https://arxiv.org/pdf/2002.11273.pdf)