VIA: A Smart Scratchpad for Vector Units with Application to Sparse Matrix Computations

# VIA: A Smart Scratchpad for Vector Units with Application to Sparse Matrix Computations ###### tags: `Accelerators` ###### paper origin: 2021 HPCA ###### papers: [link](https://ieeexplore.ieee.org/document/9407226) ## 1. Main Problem ### Challenge 1: Avoiding inefficient memory indexed instructions. ![](https://i.imgur.com/tuajRnS.png) ![](https://i.imgur.com/3jgGnD8.png) ![](https://i.imgur.com/jSOT3gK.png) * In every iteration, a gather instruction is executed; this reduces the memory bandwidth and affects the efficiency of the traditional vector architectures. ### Challenge 2: Computing index matching efficiently. * Vectorizing index matching operations with the current state-ofthe-art vector ISAs requires several in-VRF (Vector Register File) data transformations or reordering through the memory that affects the performance of the vector ALU. * -> Zero paddind (efficiency reduced) ## 2. Solution * Scratchpad Memories (SPM) are high-speed internal storage structures. Scratchpads have simpler allocation and control mechanisms compared to general cache memories and can be implemented with different mapping techniques. * The two challenges presented in Section III-A can be appropriately addressed using scratchpad memories. ## 3. SPM Design Idea ### Direct-Mapped SPM for Sparse-to-Dense Transformation: **scenario: SpMV where a sparse matrix is multiplied by a dense vector** * A direct-mapped SPM can provide enough locality to accelerate the dense data read process with **minimum area overhead** due the simplicity of scratchpad mapping control. ![](https://i.imgur.com/IGu5vLU.png) ::: info **STEP:** 1. The first step loads the input vector into the SPM. 2. Then, we read the values and column indices from row 0; the column indices are used to map and read data from the SPM. 3. Data from the VRF and SPM is multiplied and reduced to generate the output. This output result is stored back to the SPM. 4. Then, the next row is read from memory and computed in the same way. ::: * The idea of processing the index operations for the dense vector in the SPM instead of accessing the memory hierarchy reduces the overall memory traffic, thus increasing the memory bandwidth utilization to stream the input sparse matrix. ### CAM-based SPM for Sparse-to-Sparse Operations: **scenario: Challenge 2 is manifested when computing using two sparse structures(e.g. SpMA and SpMM).** * In our approach, allocate and delete policies are done in order, whereas in a typical out-of-order issue queue, CAM payload data is allocated and deleted in an out of order manner, thus complicating the design. ![](https://i.imgur.com/h9kt5Uf.png) ::: info **STEP:** 1. The SPM only stores the indices and data from the input row. In a first step, we load the input row in the SPM. 2. Then, the rowindices and values from the input column are read into the VRF. 3. Next, we use the row indices from the input column to read the SPM. This reading operation executes the index matching with the column indices from the input row. 4. The values from those indices that match are then multiplied and reduced in the FUs. 5. Finally, in a similar way to the SpMV kernel, we accumulate the output results in the SPM to reduce memory bandwidth utilization. ::: ## 4. SPM Design Implementation ### The Smart Scratchpad Memory ![](https://i.imgur.com/7HeGK6g.png) 1. The SRAM cells to store the actual data; 2. The valid bitmap to specify when an entry in the SRAM has been written before; 3. The Index tracking logic that provides the CAM-based functionality to SSPM. ::: success 1. SRAM is built using **4** byte length blocks and each block stores a single value independently on the data length. 2. SRAM: 2 Modes ::: **Valid bitmap**: This structure is used in the direct-mapped mode as a written value indicator for the entries in the SRAM 1. Writing in direct-mapped mode: When a value is written to the SSPM, the corresponding bit in the valid bitmap is set. 2. Reading in direct-mapped mode: When a value is read from the SSPM, the corresponding bit in the valid bitmap is checked. If it is set, the value from the SRAM is returned; otherwise a zero value is returned. 3. Clearing SSPM contents in direct-mapped mode: Either the full valid bitmap or a segment of it is cleared by flash zeroing. This operation is accomplished with a single instruction. ### Index table architecture * Our Index Table is implemented as a set of 8 entry banks, and each bank is activated based on the number of tracked elements in SSPM ![](https://i.imgur.com/cGRDyjt.png) ### FIVU * An extended regular VFU to support operations with the SSPM. Preprocessing 1 and Preprocessing 2 stages read the SSPM and the Post-processing stage determines if the operation output is stored in SSPM or directly written back to the VRF.(VSRC1 is a vector of indices) ![](https://i.imgur.com/HNkn9Mr.png) ### ISA Extensions in VIA (AVX2 SIMD ISA in x86-64) * vIdxStore.d * vIdxLoad.d * vIdxStore.c * vIdxLoad.i * vIdxcount * vclear * (vIdxAdd, vIdxSub, vIdxMult).X * The ’X’ value can be .d for the Direct-map mode configuration or .c for the CAM mode. * vIdxBlkMult.X ![](https://i.imgur.com/U4Cj6OI.png) * The VIA hardware requirements when included in the pipeline of an out-of-order processor. ![](https://i.imgur.com/VIjYNbt.png) ![](https://i.imgur.com/TtWUqxd.png) ![](https://i.imgur.com/sbpyk9q.png) ## 5. EXPERIMENTAL SETUP * **Model and evaluate VIA using Gem5**. * Power consumption is evaluated with McPAT for **22nm** technology, a voltage of **0.8V** and the default clock gating scheme. * VIA design is implemented in RTL and synthesized on a commercial standard cell library in 22nm technology. ![](https://i.imgur.com/o8cYMuE.png) * SSPM size configuration speedup comparison for the SpMV, SpMA and SpMM kernels. Each kernel results are normalized to its own 4_2p configuration performance.(memorysize_port) ![](https://i.imgur.com/2naZJw2.png) AREA AND POWER CONSUMPTION FOR 4 DIFFERENT SSPM CONFIGURATION (22NM TECHNOLOGY) ![](https://i.imgur.com/242l5x1.png) Speedup for VIA SpMV kernel. Results are normalized to the CSR implementation for every category.(The x-Axis at Figure 10 shows the median non-zero values per block among each category.) ![](https://i.imgur.com/nfrD0I2.png) * Speedup for VIA SpMA and VIA SpMM. Both Kernels are normalized to their base CSR implementation. ![](https://i.imgur.com/0AMxLPl.png) * Speedup for VIA histogram(a) and Gaussian convolution filter(b) kernels ![](https://i.imgur.com/g5dplR3.png) ## 6. Conclusion * In this paper, we introduce VIA, a specialized vector architecture that significantly improves performance over sparse linear algebra computations. The main goal of VIA is (1) to reduce the memory traffic incurred by memory indexed operations, and (2) to improve the efficiency of vector architectures over index matching operations. To this end, we develop a smart scratchpad memory specifically designed to tackle both issues mentioned previously. This scratchpad memory makes use of two different content mapping techniques for the two execution scenarios of sparse-dense and sparse-sparse computations. As a result, VIA greatly reduces the performance overheads of memory indexed operations and index matching operations. The rich set of new VIA instructions provides with a simple and general interface to program the hardware, facilitating its adaptation to any SIMD ISA in the market. Our evaluation over a diverse set of 1,024 matrices from real applications demonstrates that VIA significantly improves the performance of SpMV, SpMA, and SPMM, compared to different state-of-the-art solutions. In addition, we demonstrate the generality and applicability of VIA to other important kernels that share characteristics with sparse matrices. Our evaluation with histogram and stencil computations demonstrates the effectiveness of VIA in applications with irregular memory access patterns. Moreover, we believe that VIA is applicable to other application domains such as graph computing and bioinformatics.