Extremely Lightweight Quantization Robust Real-Time Single-Image Super Resolution for Mobile Devices

# Extremely Lightweight Quantization Robust Real-Time Single-Image Super Resolution for Mobile Devices > * https://arxiv.org/abs/2105.10288 > * https://github.com/cxzhou95/XLSR/tree/main :::info :bulb:**Mobile AI SISR 2021** The contast evaluate model with **Synaptics VS680 Edge AI SoC** which require quantization and ***uint8*** is prefered. The challenge scoring formula: ![](https://hackmd.io/_uploads/S1Leh6lHn.png =350x) ::: ## Assumed Hardware Limitation * No optmization on elementwise operations (vector operations) * No optmization on reshaping and transpose operations * Per-channel quantization is not supported * Skip-connections should be avoid ## Methods for Designing a Performent Model for Mobile Device The author listed a summary of methodologies to run/design a perfromant deep learning model * Hand-Designed Architectures * Efficient Building Block Design :::info > https://sh-tsang.medium.com/reading-deep-roots-improving-cnn-efficiency-with-hierarchical-filter-groups-image-9aba67f23b27 > :bulb:**Convolution with Filter Groups** Reduce conectivity (the more filter groups the fewer the number of connections), computational complexity and model size. In the paper, the author state that : > **Depth-wise convolution** can cause large ***quantization error*** without special precaution. ![](https://hackmd.io/_uploads/rJuoxRlS2.png =250x) *Convolution* ![](https://hackmd.io/_uploads/rknlb0lrn.png =250x) *Depthwise Convolution* ![](https://hackmd.io/_uploads/Syg7bRxr3.png =250x) *Convolution with Fileter Groups* ::: * Network Pruning/Sparsification * Network Quantization * Network Architecture Search (NAS) :::info > https://medium.com/ai-academy-taiwan/%E6%8F%90%E7%85%89%E5%86%8D%E6%8F%90%E7%85%89%E6%BF%83%E7%B8%AE%E5%86%8D%E6%BF%83%E7%B8%AE-neural-architecture-search-%E4%BB%8B%E7%B4%B9-ef366ffdc818 :bulb:**Neural Architecture Search** A method or algorithm to search the best neural network model architecture from the search space with target on performance and hardware constraints ::: * Knowledge Distillation :::info > https://chtseng.wordpress.com/2020/05/12/%E7%9F%A5%E8%AD%98%E8%92%B8%E9%A4%BE-knowledgedistillation/ :bulb:**Knowledge Distillation** Train a bigger model as teacher and generate result for student model (smaller) to better learn on the dataset. A classification model will generate possibility distribution which contains ***similarity*** information that is not provided in the label of the training dataset. ::: ## Model Architecture ![](https://hackmd.io/_uploads/SJa2t0gSn.png) Super resolution models commonly use a linear output activation function that can help model convergence. Even if the model output activation function is a linear function which does not impose any boundary to the output, when we continue training, the output will be closer to the training data which has a boundary 0-255 or 0-1.0. This can be seen as a indirect bounding. However, this might cause the intermediate activation functions generate outliner which is not bounded. Those outliner cause information loss on some nodes with smaller value which quantized to zero. Therefore a ***Clipped ReLU*** is used by the author :::success **Clipped ReLU** ![](https://hackmd.io/_uploads/Hy3U0RgBh.png =250x) *Enforce a upper bound of 1.0* ::: :::success **Building Block (GBlock)** ![](https://hackmd.io/_uploads/S12ziCeBh.png =x300) ::: ## Training :::info :bulb: **Loss function** > https://ieeexplore.ieee.org/document/8100101 **Charbonnier Loss** ![](https://hackmd.io/_uploads/BJb8gRWr3.png) *Intuition:* A smoother varient of **L1 loss** ::: ### Training Tricks: As state in the paper, > One Drawback of using Clipped ReLU at the output layer is unfortunately the model is **harder to opmtimze** Therefore, the author use the following tricks in the training stage. 1. **Triangular cyclic learning rate:** > Start with a *small learning rate* and quickly increases to a top value in the first 50 epochs and slowly decreases to a low value till 5000 epochs. :::info :bulb:**Cyclic Learning Rate** > https://arxiv.org/pdf/1506.01186 ![](https://hackmd.io/_uploads/HyaDHC-Bh.png) *Intuition:* Find a best learning rate `(base_lr ~ max_lr)` for a specific region in the loss function landscape. ::: 2. **Calculate PSNR in every epoch and keep the best one** 3. **Initialized Conv2D layers with He-Normal with 0.1 Variance Scaling** > to initialize kernels to closer to zero and avoid large numbers which might in turn create outliners in activation. :::info :bulb: **He-Normal** > https://arxiv.org/pdf/1502.01852.pdf Based on **Xavior Initialization** which intuitively want to keep the variance of input and output of the neuron. **He Initialization** focusing on ReLU which is not working well on Xavior Initialization. It assume that half of the neuron will be activate after ReLU and we want to keep the variance. *TODO:Maybe another post for this* :::