{%hackmd SybccZ6XD %}
###### tags: `paper`
# Model Compression and Acceleration for Deep Neural Networks
## Introduction
Why we need to compress model?
- Real-time applications such as online learning and incremental learning.
- Virtual reality
- Portable devices with limited resources
:::warning
**For example (residual network-50 (ResNet-50))**
- 50 convolutional layers
- 95 megabytes of memory
- numerous floating number multiplications
after discarding some redundant weights
- saved more than 75% of parameters
- 50% computational time
:::
The disciplines related to these goal
- machine learning,
- optimization,
- computer architecture,
- data compression,
- indexing, and
- hardware design.
Approach

- Parameter pruning and sharing
- Low-rank factorization
- Transferred/compact convolutional filters
- Knowledge distillation (KD)
## Parameter pruning and sharing
- reducing the network complexity and addressed the overfitting problem
### Quantization and binarization
Network quantization: reducing the number of bits required to represent each weight
- Gong et al. [6] and Wu et al. [7]: k-means scalar quantization to the parameter values.
- Vanhoucke et al. [8]: 8-bit quantization of the parameters can result in significant speedup with minimal loss of accuracy
- The work in [9]: 16-bit fixed-point representation in stochastic rounding-based CNN training
- **The method proposed in [10]**: pruned the unimportant connections and retrained the sparsely connected networks.
- It was shown in [11] that Hessian weight could be used to measure the importance of network parameters and proposed to minimize Hessian-weighted quantization errors on average for clustering network parameters
- A novel quantization framework was introduced in [12], which reduced the precision of network weights to ternary values.
extreme case of 1-bit representation of each weight
- Binary-Connect [13],
- BinaryNet [14], and
- XNORNetworks [15].
- The systematic study in [16] showed that networks trained with backpropagation could be robust against (robust against or resilient to) specific weight distortions, including binary weights.
### Drawbacks (binary nets)
- Accuracy of such binary nets is significantly lowered
- existing binarization schemes are based on simple matrix approximations and ignore the effect of binarization on the accuracy loss.
How to address this issue
- the work in [17] proposed a proximal Newton algorithm with diagonal Hessian approximation that directly minimizes the loss with respect to the binary weights.
- The work in [18] significantly reduced the time on floatpoint multiplication in the training stage by stochastically binarizing weights and converting multiplications in the hidden state computation to sign changes.
### Pruning and sharing
reduce network complexity and to address the overfitting issue.
early approach
- biased weight decay [19].
- The optimal brain damage [20] and the optimal brain surgeon [21] methods reduced the number of connections based on the Hessian of the loss function, and their works suggested that such pruning gave higher accuracy than magnitude-based pruning such as the weight decay method. Those methods supported training from scratch.
recent trend: prune redundant, noninformative weights in a pretrained CNN model.
### Drawbacks
- pruning with l1 or l2 regularization requires more iterations to converge
- all pruning criteria require manual setup of sensitivity for layers, which demands fine-tuning of the parameters
### Designing the structural matrix
$f(x, M) = \sigma (Mx)$
M: mxn matrix of parameters.
How to solve
- circulant projections

- fast Fourier transform (FFT)
### Drawbacks
- loss in accuracy
- how to find a proper structural matrix is difficult
## Low-rank factorization and sparsity
- The convolution kernels in a typical CNN is a four-dimensional tensor
- fully connected layer can be viewed as a two-dimensional (2-D) matrix
For example: SVD
$I\in \mathbb{R}^{n\times m}$
$W\in \mathbb{R}^{m\times k}$
$O(nmk)$
$W = USV^T$
$U\in \mathbb{R}^{m\times t}, S\in \mathbb{R}^{t\times t}, V\in \mathbb{R}^{t\times k}$
$O(nmt + nt^2 + ntk)$
### Drawbacks
- the implementation is not that easy since it involves a decomposition operation, which is computationally expensive
- perform low-rank approximation layer by layer, and thus cannot perform global parameter compression
- factorization requires extensive model retraining to achieve convergence
## Transferred/compact convolutional filters
很多weight只是旋轉而已,所以有很多浪費

### Drawbacks
- achieve competitive performance for wide/flat architectures (like VGGNet) but not narrow/special ones (like GoogleNet and ResNet)
- the transfer assumptions sometimes are too strong to guide the algorithm, making the results unstable on some data sets.
## KD
Hinton舉了一個仿生學的例子,就是昆蟲在幼生期的時候往往都是一樣的,適於它們從環境中攝取能量和營養;然而當它們成長到成熟期,會基於不同的環境或者身份,變成另外一種形態以適應這種環境。那麼對於DNN是不是存在類似的方法?在一開始training的過程中比較的龐雜但是後來當需要拿去deploy的時候,可以轉換成一個更小的模型。他把這種方法叫做 Knowledge Distillation(KD)
### Drawbacks
- only be applied to classification tasks with softmax loss function
- model assumptions sometimes are too strict
## Other types of approaches
### Dynamic Capacity Networks
$h(x) = g(f(x))$

we apply the coarse layers $f_c$ on the whole input x, and leverage the fine layers $f_f$ only at a few “important” input regions.
- $f_c$: low-capacity subnetworks
- $f_f$: high-capacity subnetworks
### conditional computation
### dynamic DNNs (D2NNs)
### reduce the convolutional overheads include using FFT-based convolutions
### fast convolution using the Winograd algorithm