# VIA: A Smart Scratchpad for Vector Units with Application to Sparse Matrix Computations ###### tags: `Accelerators` ###### paper origin: 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA) ###### papers: [link](https://ieeexplore.ieee.org/document/9407226) ###### slides and video: `none` * Scratchpad ![](https://i.imgur.com/Bjy7yPl.png) * Comparison of compressed storage formats — (a, b) Show the sparse matrix A in its dense representation under CSR and CSB respectively; (c) Shows the matrix A stored using the CSR format; (d) shows the matrix A stored using the CSB format ![](https://i.imgur.com/8jUm5m2.png) # 1. INTRODUCTION ## Problem * Sparse Matrix Computation Challenges(Main bottlenecks) * **Avoiding inefficient memory indexed instructions**. * Vectorized implementation of a CSR format based SpMV using a conventional vector ALU. The figure depicts the flow between the memory, the Vector Register File (VRF) and the Functional Units (FUs). * ![](https://i.imgur.com/6nZ5lmt.png) * ![](https://i.imgur.com/hfTTNQ7.png) * In every iteration, **a gather instruction is executed; this reduces the memory bandwidth and affects the efficiency of the traditional vector architectures**. * **Computing index matching efficiently**. * Vectorizing index matching operations with the current state-of-the-art vector ISAs **requires several in-VRF (Vector Register File) data transformations or reordering through the memory that affects the performance** of the vector ALU. * Solve by zero padding (will reduces the efficiency) ## Solution * VIA addresses these two bottlenecks with **a smart scratchpad that is tightly coupled to the Vector Functional Units within the core**. ## Contributions * The VIA design, a novel vector architecture that accelerates sparse matrix computations. VIA requires minimal hardware support, mainly a smart scratchpad memory and a vector unit that accelerates sparse matrix operations. A detailed design space exploration is performed to size the VIA hardware. **Register-Transfer Level (RTL) implementation and associated synthesis results confirm that VIA is area- and power-efficient** (0.515mm2 and 0.5mW, respectively) using a 22nm technology node. * A rich set of novel instructions to work with VIA. These **model of different Vector Instruction Set** Architectures (ISAs). * An exhaustive evaluation with **a full system cycle-accurate simulator** considering real applications and matrices. * Our evaluation shows that VIA achieves significant performance speedups of **4.22×, 6.14×, and 6.00× on average for SpMV, Sparse Matrix Addition and SpMM**. # 2. Implementation * Vectorized implementation of a CSR format based SpMV using a scratchpad memory (SPM). Using the SPM, the entire memory bandwidth is used to read the input matrix values. ![](https://i.imgur.com/ikdPT9x.png) ![](https://i.imgur.com/B32QXxM.png) * Vectorized implementation of a CSR format based SpMM using a scratchpad memory (SPM) as building block. Index matching operations are executed while reading the SPM. ![](https://i.imgur.com/6i8nYoW.png) ![](https://i.imgur.com/TH8nYKK.png) * VIA building blocks: (a) interconnection between FIVU, VIA and the issue logic of an out-of-order core; (b) microarchitecture of SSPM. It consists of the index tracking mechanism (Index table, Insert new Idx and Elements Count), valid bitmap, and the storage system (SRAMs). Read and write paths are depicted separately. (CAM = Content Addressable Memory) ![](https://i.imgur.com/KGCoU5v.png) * Index table architecture. Our Index Table is implemented as a set of 8 entry banks, and each bank is activated based on the number of tracked elements in SSPM ![](https://i.imgur.com/cGRDyjt.png) * FIVU: an extended regular VFU to support operations with the SSPM. Preprocessing 1 and Preprocessing 2 stages read the SSPM and the Post-processing stage determines if the operation output is stored in SSPM or directly written back to the VRF.(VSRC1 is a vector of indices) ![](https://i.imgur.com/HNkn9Mr.png) * ISA Extensions in VIA (AVX2 SIMD ISA in x86-64) * vIdxStore.d * vIdxLoad.d * vIdxStore.c * vIdxLoad.i * vIdxcount * vclear * (vIdxAdd, vIdxSub, vIdxMult).X * The ’X’ value can be .d for the Direct-map mode configuration or .c for the CAM mode. * vIdxBlkMult.X ![](https://i.imgur.com/U4Cj6OI.png) * The VIA hardware requirements when included in the pipeline of an out-of-order processor. ![](https://i.imgur.com/VIjYNbt.png) ![](https://i.imgur.com/TtWUqxd.png) ![](https://i.imgur.com/qeG1zTB.png) ## EXPERIMENTAL SETUP * **Model and evaluate VIA using Gem5**. * Power consumption is evaluated with McPAT for **22nm** technology, a voltage of **0.8V** and the default clock gating scheme. * VIA design is implemented in RTL and synthesized on a commercial standard cell library in 22nm technology. ![](https://i.imgur.com/o8cYMuE.png) # 3. Result * SSPM size configuration speedup comparison for the SpMV, SpMA and SpMM kernels. Each kernel results are normalized to its own 4_2p configuration performance.(memorysize_port) ![](https://i.imgur.com/2naZJw2.png) AREA AND POWER CONSUMPTION FOR 4 DIFFERENT SSPM CONFIGURATION (22NM TECHNOLOGY) ![](https://i.imgur.com/242l5x1.png) Speedup for VIA SpMV kernel. Results are normalized to the CSR implementation for every category.(The x-Axis at Figure 10 shows the median non-zero values per block among each category.) ![](https://i.imgur.com/nfrD0I2.png) * Speedup for VIA SpMA and VIA SpMM. Both Kernels are normalized to their base CSR implementation. ![](https://i.imgur.com/0AMxLPl.png) * Speedup for VIA histogram(a) and Gaussian convolution filter(b) kernels ![](https://i.imgur.com/g5dplR3.png) # 4. Conclusion * In this paper, we introduce VIA, a specialized vector architecture that significantly improves performance over sparse linear algebra computations. The main goal of VIA is (1) to reduce the memory traffic incurred by memory indexed operations, and (2) to improve the efficiency of vector architectures over index matching operations. To this end, we develop a smart scratchpad memory specifically designed to tackle both issues mentioned previously. This scratchpad memory makes use of two different content mapping techniques for the two execution scenarios of sparse-dense and sparse-sparse computations. As a result, VIA greatly reduces the performance overheads of memory indexed operations and index matching operations. The rich set of new VIA instructions provides with a simple and general interface to program the hardware, facilitating its adaptation to any SIMD ISA in the market. Our evaluation over a diverse set of 1,024 matrices from real applications demonstrates that VIA significantly improves the performance of SpMV, SpMA, and SPMM, compared to different state-of-the-art solutions. In addition, we demonstrate the generality and applicability of VIA to other important kernels that share characteristics with sparse matrices. Our evaluation with histogram and stencil computations demonstrates the effectiveness of VIA in applications with irregular memory access patterns. Moreover, we believe that VIA is applicable to other application domains such as graph computing and bioinformatics. # 5. Discussion * Very good.