VEGETA: Vertically-Integrated Extensions for Sparse/Dense GEMM Tile Acceleration on CPUs

# VEGETA: Vertically-Integrated Extensions for Sparse/Dense GEMM Tile Acceleration on CPUs ##### paper origin : HPCA 2023 ##### paper : [link](https://arxiv.org/pdf/2302.08687.pdf) [TOC] ## Introduction * Deep learning (DL) is used in various domains including computer vision, recommendation systems, and natural language processing. * In fact, there are cases where CPUs are more suitable than GPUs/accelerators as the primary processor for processing deep neural networks (DNNs). ### Problems * The computations in CPU are increased, due to matrix multiplications. ### Solutions * Deploy dense matrix engines along with conventional scalar and vector engines to accelerate GEMM (i.e., general matrix multiplication) which is at the core of DL models. * VEGETA, Vertically-Integrated Extensions for Sparse/Dense GEMM Tile Acceleration on CPUs. ## Background * Sparsity Pattern ![](https://hackmd.io/_uploads/BJ3gqiQfT.png) * VEGETA Registers ![](https://hackmd.io/_uploads/BJEEcsXGp.png) ![](https://hackmd.io/_uploads/H1bIqimMa.png) * VEGETA Instructions ![](https://hackmd.io/_uploads/rkp_cimfa.png) ## Architecture * Processing Unit (PU). A PU is composed of a number of MAC units. * Processing Element (PE). We group PUs that share the same eastbound inputs and output buffers into a PE. ![](https://hackmd.io/_uploads/S14ujsmzp.png) ![](https://hackmd.io/_uploads/r1bk3imf6.png) ![](https://hackmd.io/_uploads/HJGkhiQGT.png) * Cycle-level visualization for TILE SPMM V instruction on VEGETA-S-2-2 with 1:4 structured sparsity for matrix A with dimensions A: 16×128 (yellow), B: 128×16 (magenta), C:16×16 (green) ![](https://hackmd.io/_uploads/SkOtnoQfp.png) ![](https://hackmd.io/_uploads/HysKhoQGp.png) * Overview of VEGETA in a CPU. We highlight the parts including out contributions with red. ![](https://hackmd.io/_uploads/B1d33smGa.png) ## Evalution * Row-wise achieves 2.36× and 3.28× at 90% and 95% sparsity degree. SIGMA performs better than others with extremely high sparsity degrees (>95%), but it is inefficient for the modest sparsity degree (the target of our work) indicating that extra area overhead is not useful. * Area and power normalized to RASA-SM and frequency for different VEGETA engines. ![](https://hackmd.io/_uploads/Hy8W6oXM6.png) ## Conclusions * VEGETA architecture adds flexible N:M structured sparsity support in CPU matrix engines via extensions in ISA and engines. * Exploring different VEGETA engine design choices to understand the trade-offs in performance and area.