Model Compression and Acceleration for Deep Neural Networks

{%hackmd SybccZ6XD %} ###### tags: `paper` # Model Compression and Acceleration for Deep Neural Networks ## Introduction Why we need to compress model? - Real-time applications such as online learning and incremental learning. - Virtual reality - Portable devices with limited resources :::warning **For example (residual network-50 (ResNet-50))** - 50 convolutional layers - 95 megabytes of memory - numerous floating number multiplications after discarding some redundant weights - saved more than 75% of parameters - 50% computational time ::: The disciplines related to these goal - machine learning, - optimization, - computer architecture, - data compression, - indexing, and - hardware design. Approach ![](https://i.imgur.com/0SdmTl4.png) - Parameter pruning and sharing - Low-rank factorization - Transferred/compact convolutional filters - Knowledge distillation (KD) ## Parameter pruning and sharing - reducing the network complexity and addressed the overfitting problem ### Quantization and binarization Network quantization: reducing the number of bits required to represent each weight - Gong et al. [6] and Wu et al. [7]: k-means scalar quantization to the parameter values. - Vanhoucke et al. [8]: 8-bit quantization of the parameters can result in significant speedup with minimal loss of accuracy - The work in [9]: 16-bit fixed-point representation in stochastic rounding-based CNN training - **The method proposed in [10]**: pruned the unimportant connections and retrained the sparsely connected networks. - It was shown in [11] that Hessian weight could be used to measure the importance of network parameters and proposed to minimize Hessian-weighted quantization errors on average for clustering network parameters - A novel quantization framework was introduced in [12], which reduced the precision of network weights to ternary values. extreme case of 1-bit representation of each weight - Binary-Connect [13], - BinaryNet [14], and - XNORNetworks [15]. - The systematic study in [16] showed that networks trained with backpropagation could be robust against (robust against or resilient to) specific weight distortions, including binary weights. ### Drawbacks (binary nets) - Accuracy of such binary nets is significantly lowered - existing binarization schemes are based on simple matrix approximations and ignore the effect of binarization on the accuracy loss. How to address this issue - the work in [17] proposed a proximal Newton algorithm with diagonal Hessian approximation that directly minimizes the loss with respect to the binary weights. - The work in [18] significantly reduced the time on floatpoint multiplication in the training stage by stochastically binarizing weights and converting multiplications in the hidden state computation to sign changes. ### Pruning and sharing reduce network complexity and to address the overfitting issue. early approach - biased weight decay [19]. - The optimal brain damage [20] and the optimal brain surgeon [21] methods reduced the number of connections based on the Hessian of the loss function, and their works suggested that such pruning gave higher accuracy than magnitude-based pruning such as the weight decay method. Those methods supported training from scratch. recent trend: prune redundant, noninformative weights in a pretrained CNN model. ### Drawbacks - pruning with l1 or l2 regularization requires more iterations to converge - all pruning criteria require manual setup of sensitivity for layers, which demands fine-tuning of the parameters ### Designing the structural matrix $f(x, M) = \sigma (Mx)$ M: mxn matrix of parameters. How to solve - circulant projections ![](https://i.imgur.com/YJNARzN.png) - fast Fourier transform (FFT) ### Drawbacks - loss in accuracy - how to find a proper structural matrix is difficult ## Low-rank factorization and sparsity - The convolution kernels in a typical CNN is a four-dimensional tensor - fully connected layer can be viewed as a two-dimensional (2-D) matrix For example: SVD $I\in \mathbb{R}^{n\times m}$ $W\in \mathbb{R}^{m\times k}$ $O(nmk)$ $W = USV^T$ $U\in \mathbb{R}^{m\times t}, S\in \mathbb{R}^{t\times t}, V\in \mathbb{R}^{t\times k}$ $O(nmt + nt^2 + ntk)$ ### Drawbacks - the implementation is not that easy since it involves a decomposition operation, which is computationally expensive - perform low-rank approximation layer by layer, and thus cannot perform global parameter compression - factorization requires extensive model retraining to achieve convergence ## Transferred/compact convolutional filters 很多weight只是旋轉而已，所以有很多浪費 ![](https://i.imgur.com/8OcJoDm.png) ### Drawbacks - achieve competitive performance for wide/flat architectures (like VGGNet) but not narrow/special ones (like GoogleNet and ResNet) - the transfer assumptions sometimes are too strong to guide the algorithm, making the results unstable on some data sets. ## KD Hinton舉了一個仿生學的例子，就是昆蟲在幼生期的時候往往都是一樣的，適於它們從環境中攝取能量和營養；然而當它們成長到成熟期，會基於不同的環境或者身份，變成另外一種形態以適應這種環境。那麼對於DNN是不是存在類似的方法？在一開始training的過程中比較的龐雜但是後來當需要拿去deploy的時候，可以轉換成一個更小的模型。他把這種方法叫做 Knowledge Distillation(KD) ### Drawbacks - only be applied to classification tasks with softmax loss function - model assumptions sometimes are too strict ## Other types of approaches ### Dynamic Capacity Networks $h(x) = g(f(x))$ ![](https://i.imgur.com/Ysl3RNx.png) we apply the coarse layers $f_c$ on the whole input x, and leverage the fine layers $f_f$ only at a few “important” input regions. - $f_c$: low-capacity subnetworks - $f_f$: high-capacity subnetworks ### conditional computation ### dynamic DNNs (D2NNs) ### reduce the convolutional overheads include using FFT-based convolutions ### fast convolution using the Winograd algorithm