EtinyNet: Extremely Tiny Network for TinyML

# EtinyNet: Extremely Tiny Network for TinyML ## Introduction Machine learning on tiny IoT devices based on microcontroller units (MCU) is appealing but challenging: the **MEMORY** of microcontrollers is 2-3 orders of magnitude smaller even than mobile phones. * Billions of IoT devices around the world based on microcontrollers * Low-cost: low-income people can afford access. Democratize AI * Low-power: green AI, reduce carbon ![](https://i.imgur.com/VQfOQK0.png) In this paper, the author firstly design a parameter-efficient tiny architecture by introducing **Dense Linear Depthwise Block**. Then, a novel **Adaptive Scale Quantization (ASQ)** method is proposed for further quantizing tiny models in aggressive low-bit while retaining the accuracy ## Related Work ### MobileNet: Efficient Convolutional Neural Networks for Mobile Vision Applications [Howards et al., arXiv 2017] * Propose **Depthwise Convolution** is an extreme case of group convolution where the group number equals the number of input channels ![](https://i.imgur.com/S782Q4V.png) ### MobileNetV2: Inverted Residuals and Linear Bottlenecks [Sandler et al., CVPR 2018] * Depthwise convolution has a much lower capacity compared to normal convolution. * Increase the depthwise convolution’s input and output channels to improve its capacity * Depthwise convolution’s cost only grows linearly. Therefore, the cost is still affordable ![](https://i.imgur.com/VHWDWBN.png) ### DOREFA-NET: TRAINING LOW BITWIDTH CONVOLUTIONAL NEURAL NETWORKS WITH LOW BITWIDTH GRADIENTS [Zhou et al., arxiv 2016] * Uses the heuristic linear quantization as below. Note that in their definition, x is restricted within [0, 1], so they leverage various range transformation to satisfy this condition. $$ Q^k(w) = 2Q^2(\frac{tanh(w)}{2max(|tanh(W)|)} + \frac{1}{2}) - 1. $$ where $Q^k$ is the k-bit quantization function below $$ Q(x) = \Delta \; round(\frac{x}{\Delta}) $$ Before $Q^k$, $W$ is nonlinearly transformed to be within $[0, 1]$; whereas after quantization, an affine transformation brings the value back to $[−1, 1]$ and constrains it there. ## Methods To reduce the computations and parameters, the author in this paper utilize the block MobileNet proposed: the **depthwise separable convolution**, which made up of two operations: depthwise convolution and pointwise convolution (1 × 1 convolution), develop **Linear Depthwise Block (LB)** ![](https://i.imgur.com/gw4yaLV.png) Noted that in lightweighted neural architecture, **activation layer (ReLU) may harm the performance** due to the reason that parameters are much lesser than normal neural architecture's. Therefore, similar to previous works, the author also remove ReLU layer after the depthwise convolution layer. However, restricted by the total number of parameters, the **width(number of channel)** of network can not be too large, this downgrades the performance. To alleviate the problem, the author further introduce **Dense Linear Depthwise Block (DLB)**. ![](https://i.imgur.com/S503p5O.png) By slightly increases the computation cost, the structure with shortcut connection could be regarded as a **wider network** consisting of sub-networks. At last, the author point out that DoReFa scheme works well when quantization bits is relatively higher (8-bit or 16-bit), but it would lead to significant quantization error when extremely low bit-width is applied, such as 4-bit. ![](https://i.imgur.com/L4nxMW2.png) The images above are the frequency histogram of model parameters in EtinyNet-1.0. (a) denotes the distribution of pointwise convolution weights in layer1.4. (b) denotes the distribution of pointwise convolution weights in higher level layer. Ori. indicates the original distribution while Quan. is the distributionafter quantization. **It is clear that DoReFa scheme introduces significant quantization error as the layer become deeper in low-bit.** To alleviate this side effect of DoReFa, the author propose the novel **Adaptive Scale Quantization (ASQ)** by replacing the clamping step and quantization step with following: $$ \dot W = \frac{W}{\lambda \; \sqrt{VAR(W) + \epsilon}} $$ $$ \hat W = \frac{tanh(\dot W)}{max(|tanh(\dot W)|)} $$ $$ Q = \Delta \; round(\frac{\hat W}{\Delta}) $$ Noted that symmetrical quantization is utilized, so that $\Delta = 2^{b-1}-1$. More explanation can be found [here](https://www.youtube.com/watch?v=91stHPsxwig&list=PL80kAHvQbh-ocildRaxjjBy6MR1ZsNCU7&index=6) ## Experiments Results can be mainly categorized into two aspects: the better accuracy, and the lower **PEAK MEMOERY CONSUMED**, which can be roughly represented by **PEAK SRAM**. ![](https://i.imgur.com/Uy0m2uH.png) ![](https://i.imgur.com/tqU0r8Q.png) It is worth mentioning that in TinyML field, the cost of moving data in and out of memory is **much higher** than increasing computation. See more relevant experiment results in this [paper](https://arxiv.org/abs/2007.10319) ![](https://i.imgur.com/IbrzRj4.png) ## Conclusion In this paper, the author aim at developing the efficient tiny model that not only perform **high accuracy** compare to other lightweight network, but also **restrict the memory utilization** when the model being trained on microcontroller unit(MCU). The paper design dense linear depthwise block, and its variant dense linear depthwise block. Further, a novel adaptive scale quantization (ASQ) method is introduced, pushing the weight and memory utilization to SOTA. To this end, the proposed extremely tiny models make it possible to run CNNs in a single chip, showing the potential for designing small footprint ASIC CNNs accelerator of more higher efficiency.