# Packing Sparse Convolutional Neural Networks for Efficient Systolic Array Implementations: Column Combining Under joint Optimization
##### origin: ASPLOS 2019
##### paper: [link](http://www.eecs.harvard.edu/~htk/publication/2019-asplos-kung-mcdanel-zhang.pdf)
###### tags: `Accelerators`
## Introduction
### Problem
* **Weight pruning** techniques have been proposed to futher reduce the computation cost of CNN inference by setting to zero the majority of weights.
* The remaining nonzero weights are distrubuted in an unstructured manner making it challenging to efficiently utilize the regular structure of systolic arrays.
* Therefore, the weight reduction achieved through pruning does not lead to a reduced run time and hardware resources required for CNN inference.
### Solution
* we propose **column combining**, which can pack sparse convolutional networks for efficient implementations in systolic arrays.
* We optimize the topology of a CNN to fit the structure of the underlying computing hardware, such as **systolic arrays**, while preserving most of its classification accuracy via network retraining.
## Column Combining
### Column Combining Overview

(a) A filter matrix F, is divided along columns into threee groups.
(b) Each group is combined into a single column in order to be loaded into a column in to proposed systolic array.
$\alpha$ : the maximum number of sparse columns that can be combined into a single dense column.
$\gamma$ : the average number of conflicts per row allowed for each group.
### Column Combining Algorithm



### Row Permutation for Contiguous Column Groups
* We can permute rows of a filter matrix of the current layer to ensure that **the columns from the same group for the next layer output next to each other**.

* A relative expensive **switchbox** function is needed for routing output of layer i to input of the reduce systolic array for layer i+1.
### Cross-layer Pipelining of CNN Inference under Column Combining and Row Permutation
* In realtime application scenarios, input sample must be processed as soon as it is received by the system, and therefore cannot wait to be processed in large batches.

## Systolic Array System Description for Column Combining

* The shift block performs shift operations for shift convolution.
### Bit-serial Systolic Arrays

* **White logic**: bit-serial multiplication between the input Xi and compute the absolute value of the ilter weight.
* **Blue logic**: negate the product based on the sign of the filter weight.
* **Pink logic**: bit-serial addition between the multiplication result and the input accumulation Yi.



* **Fig 8.9.10.(a)**: input data, filter weights and accumulation values all have the same number of bits.
* **Fig 8.9.10.(b)**: for high-precision accumulation, bit-serial systolic cells will have longer computation time than I/O. input data and filter weights are 8-bit and accumulation values are 32-bits. There is a 24-clock gap between the input data streams.
* **Fig 8.9.10.\(c\)**: fill in the gaps for each cell by processing four independent input data streams simultaneously in an interleaved manner.

* The multiplexed input (MX) cell, takes in two x inputs, from two input channels, utilizes one of them inside each MAX, and forwards both inputs to the cell above. (in real ASIC and FPGA designs we pack up to 8 channels).
* This highlights the importance of the bit-serial design, as in the case of 8 inputs, each cell takes in only 8 bit per cycles, as opposed to a bit-parallel design where each cell would require 64 inputs per cycle in the case of 8-bit input.
### Shift Block & ReLU

* We use double buffering to prefetch the nex data tile so that the ouptut time can overlap with the data transfer overhead from the input buffer to register arrays.
## Performance Analysis for the Column Combining Algorithm
- Dataset & Networks
| MNIST | CIFAR-10 | ImageNet |
| -------- | -------- | -------- |
| Lenet-5 | VGG-19 | VGG-19 |
* two layer structure

* network structures for each dataset

### Iterative Training with Column Combining

* Classification accuracy for validation set and number of nonzero weights over training epochs.
### Impact of the Limited-Conflict Condition
* The **limited-conflict condition** allows for $\gamma$ conflicting entries per row on acerage between columns within a group.

* Larger value of $\gamma$ allow for more conflicts between the columns in a group and therefore prune more weights, possibly with relatively large magnitudes, in order to achieve higher utilzation efficiency across all layers in the CNN.
* column-combine pruning has a small impact on classification accuracy.
### Dramatic Tiling Reduction in Partitioned Matrix Mutliplication with Column Combining


* The number of tiles required to perform matrix multiplication with a 64x64 systolic array fro each layer in VGG-19 models under three different settings of $\gamma$.
* Generally, this shows that it is dificult to effectively combine sparse columns when a small $\gamma$ value is used, as a few conflicts in any row for a potential group will make the combination invalid.
* By adding a modest amount of column-combine prunning (eg. $\gamma$ = 1.75) the combining algorithm is able to substantially improve the utilization efficiency an decrease the number of tiles.
## Hardware Implementation Experiments and Performance Evaluation
### ASIC Implementation and Evaluation
* 45nm NanGate Open Cell Library
* CACTI 7.0
### Compariason Against Prior Designs for MNIST

* Generally, our design has both the highest area efficienct and energy efficiency across all the designs.
### FPGA Implementation and Evaluation
* Xilinx VC707

* Our design achieves top-1 accuracy of 93%, which is around 5-6% higher than other models.
* our design achieves a 3x improvement on energy efficiency over the next best design.

* In term of energy efficiecy, our design outperforms the most recent work, 3.34x and 1.93x, respectively.
### Reduction in End-to-end Inference Latency with Cross-layer Pipelining

* Our design achieves an end-to-end latency over 12x smaller than best implementation, while also obtaining a higher classification accuracy.