# Sparse Matrix Operation ## Main Observation & Motivation ### Sparsity In Neural Network 1. 在parameter中有將近80%的值為0 2. 這些0的值會佔用相當大的空間 ![](https://hackmd.io/_uploads/HyiH5g3MT.png) ### Model Size Trend 1. 隨著大型語言模型的成長 → 模型變的越來越大 ![](https://hackmd.io/_uploads/rJ7HqehG6.png) 2. 需要耗費大量的VRAM來儲存 3. 目前主流的解法 1. For have a lot of GPU → Megatron-LM (3D Parallelism), Deepspeed 2. For 一般使用者 → Quantization, LoRa ### Model Prunning 1. 由於model除了0以外的值以外, 更多的可能是趨近於0的值, 因此可以透過做prunning的方式來利用此一特性 2. Unstructured Prunning & Structure Prunning 1. Unstructured - 從最接近0的值開始pruned - 由於越接近0就代表該值對於模型結果的影響越低 - 但由於這些值的分佈通常較為隨機 → 沒有結構化 - 對結果的影響較小 2. structure - 為了結構化的去pruned → 會以row或以col為單位進行 - 對結果的影響較大, 因為可能會pruned掉對模型有影響的值 - 通常不會採用這類作法 ### Sparsity Performance - 對於Bert來說 1. 在40%以前的pruned對於結果幾乎沒有影響 2. 一直到60%以上才會有顯著的acc drop ![](https://hackmd.io/_uploads/BJUI5ehza.png) - MNIST手寫模型驗證 | sparsity | original acc | prune acc | finetune prune acc | | --- | --- | --- | --- | | 70% | 0.9604 | 0.9566 | 0.9682 | | 80% | 0.9545 | 0.946 | 0.9656 | | 90% | 0.9534 | 0.8845 | 0.9615 | | 95% | 0.9569 | 0.4923 | 0.9395 | | 98% | 0.957 | 0.2158 | 0.8074 | | 99% | 0.9565 | 0.0982 | 0.3052 | 1. 大部分的情況prune後都可以透過微調剩下的參數回到原本的acc level 2. 直到95 ~ 99%的程度才有明顯喪失學習能力的程度 ### Sparse Pattern - 由於在做prunning的時候會全局的從model parameter中挑最接近0的 - 整體的pattern會很隨機, 以下舉了MNIST實驗全局80% sparsity的情況的其中一個layer來觀察 - Weight Tensor ![](https://hackmd.io/_uploads/H1vv9xnM6.png) ![](https://hackmd.io/_uploads/Byswcx3z6.png) - Bias Tensor ![](https://hackmd.io/_uploads/HyVu5g2fa.png) ![](https://hackmd.io/_uploads/BJVuqlnfp.png) ### Matrix Operation In Sparse - 由於想要更有效率的去存這些非0的值 → Sparse Matrix 需要特殊的encoding方法 1. Coordinate List (COO) Format ![](https://hackmd.io/_uploads/SkwF5xnfp.png) 2. Compressed Sparse Matrix (CSR) Format ![](https://hackmd.io/_uploads/BJcY5xhGp.png) - 在[pytorch](https://github.com/pytorch/pytorch/blob/f5088d2e45787094e201fe82677c399ce456ef71/aten/src/ATen/native/sparse/SparseMatMul.cpp#L54) cpu中目前支持CSR * CSR的sparse matrix multiplication ```cpp template<typename index_t_ptr, typename scalar_t_ptr> void _csr_matmult( const int64_t n_row, const int64_t n_col, const index_t_ptr Ap, const index_t_ptr Aj, const scalar_t_ptr Ax, const index_t_ptr Bp, const index_t_ptr Bj, const scalar_t_ptr Bx, // NOLINTNEXTLINE(modernize-avoid-c-arrays,cppcoreguidelines-avoid-c-arrays) typename index_t_ptr::value_type Cp[], // NOLINTNEXTLINE(modernize-avoid-c-arrays,cppcoreguidelines-avoid-c-arrays) typename index_t_ptr::value_type Cj[], // NOLINTNEXTLINE(modernize-avoid-c-arrays,cppcoreguidelines-avoid-c-arrays) typename scalar_t_ptr::value_type Cx[]) { /* Compute CSR entries for matrix C = A@B. The matrices `A` and 'B' should be in proper CSR structure, and their dimensions should be compatible. Inputs: `n_row` - number of row in A `n_col` - number of columns in B `Ap[n_row+1]` - row pointer `Aj[nnz(A)]` - column indices `Ax[nnz(A)] - nonzeros `Bp[?]` - row pointer `Bj[nnz(B)]` - column indices `Bx[nnz(B)]` - nonzeros Outputs: `Cp[n_row+1]` - row pointer `Cj[nnz(C)]` - column indices `Cx[nnz(C)]` - nonzeros Note: Output arrays Cp, Cj, and Cx must be preallocated */ using index_t = typename index_t_ptr::value_type; using scalar_t = typename scalar_t_ptr::value_type; std::vector<index_t> next(n_col, -1); std::vector<scalar_t> sums(n_col, 0); int64_t nnz = 0; Cp[0] = 0; for (const auto i : c10::irange(n_row)) { index_t head = -2; index_t length = 0; index_t jj_start = Ap[i]; index_t jj_end = Ap[i + 1]; for (const auto jj : c10::irange(jj_start, jj_end)) { index_t j = Aj[jj]; scalar_t v = Ax[jj]; index_t kk_start = Bp[j]; index_t kk_end = Bp[j + 1]; for (const auto kk : c10::irange(kk_start, kk_end)) { index_t k = Bj[kk]; sums[k] += v * Bx[kk]; if (next[k] == -1) { next[k] = head; head = k; length++; } } } for (C10_UNUSED const auto jj : c10::irange(length)) { // NOTE: the linked list that encodes col indices // is not guaranteed to be sorted. Cj[nnz] = head; Cx[nnz] = sums[head]; nnz++; index_t temp = head; head = next[head]; next[temp] = -1; // clear arrays sums[temp] = 0; } // Make sure that col indices are sorted. // TODO: a better approach is to implement a CSR @ CSC kernel. // NOTE: Cx arrays are expected to be contiguous! auto col_indices_accessor = StridedRandomAccessor<int64_t>(Cj + nnz - length, 1); auto val_accessor = StridedRandomAccessor<scalar_t>(Cx + nnz - length, 1); auto kv_accessor = CompositeRandomAccessorCPU< decltype(col_indices_accessor), decltype(val_accessor) >(col_indices_accessor, val_accessor); std::sort(kv_accessor, kv_accessor + length, [](const auto& lhs, const auto& rhs) -> bool { return get<0>(lhs) < get<0>(rhs); }); Cp[i + 1] = nnz; } } ``` - 可以看到相比於Dense Matrix Multiplication, Sparse Matirx Multiplication邏輯非常的複雜 - Performance Comparison - cpu ![](https://hackmd.io/_uploads/ByCqcghfp.png) - gpu ![](https://hackmd.io/_uploads/rkMo9l3Gp.png) - 結果觀察 - 不論是GPU或是CPU的演算法, Sparse只有在很小或者是極度Sparse的時候才有可能比Dense的演算法還快 - 因此現階段的深度模型基本上都不考慮使用Sparse Matrix Multiplication ## Method 1. 將特定的layer換成Sparse Matrix - 可以節省大量的記憶體 → (可以先做實驗驗證在主流的模型e.g. llama內部可以減少多少) - 由於部分的Matrix變成Sparse但部份的仍保持Dense → 因此需要實現以下3種情況的Operation 1. Sparse x Sparse Matrix Operation 2. Sparse x Dense Matrix Operation 3. Dense x Sparse Matrix Operation - 應該要從GPU based的角度(平行加速的角度)去設計演算法 - 理論上速度一定會比Dense的operation慢所以要在一個合理的latency drop下換到足夠多的記憶體空間 2. 結合硬體特性 ## Study Plan 1. Sparse Attention(Transformer) - citation 70 - [[1912.11637] Explicit Sparse Transformer: Concentrated Attention Through Explicit Selection (arxiv.org)](https://arxiv.org/abs/1912.11637) - 不要每一個attention都去計算到 2. Pytorch & Cuda Sparse Algorithm 3. Sparse Matrix Multiplication Algorithm [A Systematic Survey of General Sparse Matrix-Matrix Multiplication (arxiv.org)](https://arxiv.org/pdf/2002.11273.pdf)