SparseDNN: Fast Sparse Deep Learning Inference on CPUs

# SparseDNN: Fast Sparse Deep Learning Inference on CPUs ## Abstraction The last few years have seen gigantic leaps in algorithms and systems to support efficient deep learning inference. Pruning and quantization algorithms can now consistently compress neural networks by an order of magnitude. For a compressed neural network, a multitude of inference frameworks have been designed to maximize the performance of the target hardware. While we find mature support for quantized neural networks in production frameworks such as OpenVINO and MNN, support for pruned sparse neural networks is still lacking. To tackle this challenge, we present SparseDNN, a sparse deep learning inference engine targeting CPUs. SparseDNN is an **inference engine** which could efficiently support unstructured sparse deep learning inference. The design of SparseDNN is inspired by other state-of-the-art inference systems for dense neural networks such as TensorRT, OpenVINO and MNN. It leverages both novel networklevel optimizations such as weight permutation and kernel-level optimizations through a sparse kernel generator which achieves significantly superior performance than Libxsmm and SkimCaffe. ## SparseDNN Sctructure ![](https://i.imgur.com/p85IPr4.png) To set up a neural network for inference in SparseRT, the neural network is first implemented using SparseDNN’s C++ API. We selected this approach instead of parsing from a higher-level representation in Tensorflow or Pytorch because of its flexibility. Existing high level frameworks also only typically have incomplete and experimental support for sparse operators. SparseDNN adopts the aforementioned preparation-execution paradigm. During the preparation stage, SparseDNN performs network level optimizations, and selects the best kernel implementation for each layer. SparseDNN also allows for both synchronous and asynchronous inference in the execution stage. ### Network Optimizations SparseDNN supports standard layer fusion strategies such as fusing a convolution layer and a subsequent element-wise layer. SparseDNN also finds the best kernel implementation for each operator in the neural network. #### Activation Buffer Reuse On sparse networks, it also has a big impact on inference performance. Sparse operators are typically memory-bound instead of compute-bound. As a result, their performance can often be substantially improved if their inputs and outputs physically reside in cache rather than DRAM. #### Weight Permutation For sparse operators, permuting the rows of the sparse weight matrix can substantially change the operator’s performance by changing the sparsity pattern. There have been many recent works in the scientific computing community that examine beneficial reorderings to apply to a sparse matrix. * Algorithm (Greedy Algorithm) ![](https://i.imgur.com/BjZhtau.png) ### Sparse Code Generator After network optimizations, SparseRT attempts to map each operator in the network to an optimized kernel implementation. Kernels for sparse operators are generated from a sparse code generator at this stage. We recognize that most operators in deep learning can be represented at their core by a matrix multiplication. The sparse code generator focuses on generating high performance SpMM routines. We support sparse convolutions by converting them into a SpMM problem with the direct convolution method. #### High Performance Dense GeMM High performance GeMM kernels typically start with a highly tuned microkernel which performs a small matrix multiplication using the SIMD instructions of the CPU. Armed with this tunable microkernel, the GeMM problem becomes how to decompose the computation into a sequence of microkernel calls. Typically, the implementation then optimizes for memory locality by using tiling. Tiling breaks down the matrix multiplication into chunks that are small enough to fit into on-chip caches on the CPU, which often offer an order of magnitude faster access than the off-chip main memory. * Tiling Illustration: ![](https://i.imgur.com/Lwr3kzX.png) #### SpMM Strategy In SpMM, we adopt the same general strategy as dense matrix multiplication. Our sparse microkernel is based on the dense microkernel described above. Since A is sparse, only the nonzero elements in each A column slice needs to be processed. Since B is dense, the vector registers that load the B row slice are fully packed. For each A column slice, we iterate over the nonzeros. We broadcast each nonzero to a vector register, multiply it with the vector registers holding the B row slice, and accumulate the result to the output tile. Although the accumulation position depends on the random nonzero position, we can use registers for the output tile by statically encoding the accumulation position in the vector register name. Similar to dense GeMM, we also employ tiling in SpMM. Since the nonzero locations in the A column slice can be random due to unstructured sparsity, we can expect random memory accesses into the dense B matrix. * SpMM Strategy Illustration: ![](https://i.imgur.com/VOxEfmG.png) ## Testing Result We evaluate our sparse code generator on two different operations, SpMM and sparse convolution, on a suite of typical problems encountered in deep learning. We consider both single-threaded performance and multi-threaded performance on 4 physical cores. As aforementioned, both scenarios are useful for end-to-end network inference. Single threaded performance is important for asynchronous inference, where each CPU handles a separate stream of inference requests, while multi-threaded performance is important for synchronous inference where all CPUs handle the same stream of inference requests. Here's the processors that used in this test: 1. Intel CPU : Xeon(R) Platinum 8124M 2. AMD CPU: AMD Epyc 7R32 ### Kernel Benchmarks We evaluate our sparse code generator on two different operations, SpMM and sparse convolution, on a suite of typical problems encountered in deep learning. #### **SpMM Benchmark** Testing Specification: 1. Sparse Matrix Size: 256x256 ~ 2048x2048 2. Benchmark: Intel Libxsmm 3. **Purpose: Performance Testing over Different Threads** Testing Result Illustration (SpMM Testing): * Left: single thread * Right: Multi thread ![](https://i.imgur.com/krNOpRR.png) #### **Sparse Convolution Benchmark** Testing Specification: 1. Input Channel Number: 32 ~ 256 2. Sparsity: 90%, 95% 3. Testing Image Size: 7x7, 14x14, 28x28, 56x56 4. Benchmarks: a. SkimCaffe sparse convolutional kernel b. oneDNN optimized dense convolution routines 6. **Purpose: Performance Testing over Different Kernels** Testing Result Illustration: 1. vs SkimCaffe sparse convolutional kernel: ![](https://i.imgur.com/twED7JE.png) 2. vs oneDNN optimized dense convolution routines: ![](https://i.imgur.com/p4haIY6.png) ### Performance Analysis Testing Specification: 1. Sparsity: 70%, 80%, 90%, 95% 2. Benchmarks: a. Libxsmm b. dense BLAS kernel 3. **Purpose: Performance Testing over Different Kernels** Testing Result Illustration: * Left: Libxsmm * Right: dense BLAS kernel ![](https://i.imgur.com/oTN8Tra.png) ### End-to-End Benchmarks 1. Comparing Framework: a. OpenVINO b. PyTorch 3. Benchmarks: a. MobileNet V1 b. pruneBERT 3. **Purpose: Performance Testing over Different Frameworks & Networks** Testing Result Illustration (Latency / throughput Trade Off): * a) Intel pruneBERT * b) AMD pruneBERT * c) Intel MobileNet V1 * d) AMD MobileNet V1 ![](https://i.imgur.com/5OZrXEx.png) ## Conclusion 1. SparseDNN is a deep learning inference engine that supports unstructured sparsity. 2. SparseDNN relies on a sparse code generator to optimize sparse weights, resulting in significantly faster kernels compared to current state-of-the-art libraries. 3. Network-level optimizations, such as buffer reuse and weights permutation, further improve the end-to-end inference results, leading to large latency and throughput gains over highly optimized dense inference engines. 4. Currently, SparseDNN does not handle activation sparsity induced by activation functions like relu, as it requires dynamic optimization techniques which are not compatible with static optimization used by SparseDNN. 5. SparseDNN is currently working on adding support for sparse low-precision operators, which combine pruning with quantization for network compression. 6. This work contributes to a growing body of literature that suggests unstructured sparsity as a viable option for achieving significant speedups on commodity hardware. SparseDNN aims to make it easier to deploy novel pruning algorithms in production and inspire future research in neural network compression.