# Packing Sparse Convolutional Neural Networks for Efficient Systolic Array Implementations: Column Combining Under joint Optimization ##### origin: ASPLOS 2019 ##### paper: [link](http://www.eecs.harvard.edu/~htk/publication/2019-asplos-kung-mcdanel-zhang.pdf) ###### tags: `Accelerators` ## Introduction ### Problem * **Weight pruning** techniques have been proposed to futher reduce the computation cost of CNN inference by setting to zero the majority of weights. * The remaining nonzero weights are distrubuted in an unstructured manner making it challenging to efficiently utilize the regular structure of systolic arrays. * Therefore, the weight reduction achieved through pruning does not lead to a reduced run time and hardware resources required for CNN inference. ### Solution * we propose **column combining**, which can pack sparse convolutional networks for efficient implementations in systolic arrays. * We optimize the topology of a CNN to fit the structure of the underlying computing hardware, such as **systolic arrays**, while preserving most of its classification accuracy via network retraining. ## Column Combining ### Column Combining Overview ![](https://i.imgur.com/11yCyBP.png) (a) A filter matrix F, is divided along columns into threee groups. (b) Each group is combined into a single column in order to be loaded into a column in to proposed systolic array. $\alpha$ : the maximum number of sparse columns that can be combined into a single dense column. $\gamma$ : the average number of conflicts per row allowed for each group. ### Column Combining Algorithm ![](https://i.imgur.com/snUyOm1.png) ![](https://i.imgur.com/ZstT0JV.png) ![](https://i.imgur.com/QJC6w6I.png) ### Row Permutation for Contiguous Column Groups * We can permute rows of a filter matrix of the current layer to ensure that **the columns from the same group for the next layer output next to each other**. ![](https://i.imgur.com/dsKsugF.png) * A relative expensive **switchbox** function is needed for routing output of layer i to input of the reduce systolic array for layer i+1. ### Cross-layer Pipelining of CNN Inference under Column Combining and Row Permutation * In realtime application scenarios, input sample must be processed as soon as it is received by the system, and therefore cannot wait to be processed in large batches. ![](https://i.imgur.com/8lcGlnd.png) ## Systolic Array System Description for Column Combining ![](https://i.imgur.com/aZf7tKO.png) * The shift block performs shift operations for shift convolution. ### Bit-serial Systolic Arrays ![](https://i.imgur.com/Jxxsoaz.png) * **White logic**: bit-serial multiplication between the input Xi and compute the absolute value of the ilter weight. * **Blue logic**: negate the product based on the sign of the filter weight. * **Pink logic**: bit-serial addition between the multiplication result and the input accumulation Yi. ![](https://i.imgur.com/w7gmR3V.png) ![](https://i.imgur.com/tiGs4h1.png) ![](https://i.imgur.com/WT5ynUN.png) * **Fig 8.9.10.(a)**: input data, filter weights and accumulation values all have the same number of bits. * **Fig 8.9.10.(b)**: for high-precision accumulation, bit-serial systolic cells will have longer computation time than I/O. input data and filter weights are 8-bit and accumulation values are 32-bits. There is a 24-clock gap between the input data streams. * **Fig 8.9.10.\(c\)**: fill in the gaps for each cell by processing four independent input data streams simultaneously in an interleaved manner. ![](https://i.imgur.com/FTsH94r.png) * The multiplexed input (MX) cell, takes in two x inputs, from two input channels, utilizes one of them inside each MAX, and forwards both inputs to the cell above. (in real ASIC and FPGA designs we pack up to 8 channels). * This highlights the importance of the bit-serial design, as in the case of 8 inputs, each cell takes in only 8 bit per cycles, as opposed to a bit-parallel design where each cell would require 64 inputs per cycle in the case of 8-bit input. ### Shift Block & ReLU ![](https://i.imgur.com/4fjwb5a.png) * We use double buffering to prefetch the nex data tile so that the ouptut time can overlap with the data transfer overhead from the input buffer to register arrays. ## Performance Analysis for the Column Combining Algorithm - Dataset & Networks | MNIST | CIFAR-10 | ImageNet | | -------- | -------- | -------- | | Lenet-5 | VGG-19 | VGG-19 | * two layer structure ![](https://i.imgur.com/UWaeQ1h.png) * network structures for each dataset ![](https://i.imgur.com/gEqFMC9.png) ### Iterative Training with Column Combining ![](https://i.imgur.com/GfUgdoC.png) * Classification accuracy for validation set and number of nonzero weights over training epochs. ### Impact of the Limited-Conflict Condition * The **limited-conflict condition** allows for $\gamma$ conflicting entries per row on acerage between columns within a group. ![](https://i.imgur.com/LcnQ4oE.png) * Larger value of $\gamma$ allow for more conflicts between the columns in a group and therefore prune more weights, possibly with relatively large magnitudes, in order to achieve higher utilzation efficiency across all layers in the CNN. * column-combine pruning has a small impact on classification accuracy. ### Dramatic Tiling Reduction in Partitioned Matrix Mutliplication with Column Combining ![](https://i.imgur.com/tz5dR5R.png) ![](https://i.imgur.com/Zh5Acmx.png) * The number of tiles required to perform matrix multiplication with a 64x64 systolic array fro each layer in VGG-19 models under three different settings of $\gamma$. * Generally, this shows that it is dificult to effectively combine sparse columns when a small $\gamma$ value is used, as a few conflicts in any row for a potential group will make the combination invalid. * By adding a modest amount of column-combine prunning (eg. $\gamma$ = 1.75) the combining algorithm is able to substantially improve the utilization efficiency an decrease the number of tiles. ## Hardware Implementation Experiments and Performance Evaluation ### ASIC Implementation and Evaluation * 45nm NanGate Open Cell Library * CACTI 7.0 ### Compariason Against Prior Designs for MNIST ![](https://i.imgur.com/Le108Yj.png) * Generally, our design has both the highest area efficienct and energy efficiency across all the designs. ### FPGA Implementation and Evaluation * Xilinx VC707 ![](https://i.imgur.com/ZOvBQPn.png) * Our design achieves top-1 accuracy of 93%, which is around 5-6% higher than other models. * our design achieves a 3x improvement on energy efficiency over the next best design. ![](https://i.imgur.com/eignhRd.png) * In term of energy efficiecy, our design outperforms the most recent work, 3.34x and 1.93x, respectively. ### Reduction in End-to-end Inference Latency with Cross-layer Pipelining ![](https://i.imgur.com/8Jsm0wQ.png) * Our design achieves an end-to-end latency over 12x smaller than best implementation, while also obtaining a higher classification accuracy.