**Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network (ESPCN)**

# **Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network (ESPCN)** ## **Abstract** - Extract feature map in **LR** space - Introduce efficient sub-pixel convolution layer to upscale **LR** to **HR** > Replace bicubic filter with specifically trained filter ## **Introduction** - The global **SR** problem assumes **LR** data to be a low-pass filtered (blurred), downsampled and noisy version of **HR** data. > It is a highly ill-posed problem, due to the loss of high-frequency information that occurs during the non-invertible low-pass filtering and subsampling operations. - A key assumption in **SR** is that much of the high-frequency data is redundant and thus can be reconstructed from low frequency components. - Multi-image **SR** > Explicit redundancy - Single image super-resolution (**SISR**) > Implicit redundancy ### **Related Work** - Edge-based - Image statistics-based - Patch-based - Sparsity-based - Neural network - Random forest ### **Motivations and Contributions** - Increase the resolution from **LR** to **HR** only at the very end of the network - Super-resolve **HR** data from **LR** feature maps > Advantage: reduce computational and memory complexity, learning implicitly - Achieve state-of-the-art performance and executes much faster ## **Method** ![](https://i.imgur.com/Bvx9ySH.png) - Apply a $l$ layer CNN to the **LR** image - Apply a sub-pixel convolution layer to upscale **LR** to **SR** $f^1(I^{LR};W_1,b_1) = \phi(W_1*I^{LR}+b_1)$ > The first $l-1$ layers func $f^l(I^{LR};W_{1:l},b_{1:l}) = \phi(W_l*f^{l-1}(I^{LR})+b_l)$ > $f^l$converts **LR** feature maps to **HR** image ### **Deconvolution Layer** - A popular choice to upscale **LR** image - Bicubic interpolation used in **super-resolution using deep convolutional networks** (**SRCNN**) is a special case of the deconvolution layer - Operations after convolution is expensive ### **Efficient Subpixel Convolution Layer** - Interpolation, perforate or un-pooling can convert **LR** images to **HR** images but cost $r^2$ ratio to compute. > Convolution happens in **HR** space - Alternatively, convolution with $\dfrac{1}{r}$ stride in **LR** space can also does the work but in a more efficient way. > ![](https://i.imgur.com/Ecli0a4.gif) > $r$ is the upscale ration - The number of activation patterns is $r^2$, and each activation pattern has at most $\lceil\dfrac{k_s}{r}\rceil^2$ weights activated. > $k_s$ is the kernel size - The proposed method: $I^{SR} = f^L(I^{LR}) = PS(W_L∗f^{L−1}(I^{LR})+b_L)$ > $PS(T)_{x,y,c} = T_{\lfloor x/r\rfloor ,\lfloor y/r\rfloor ,c\cdot r\cdot \bmod (y,r)+c\cdot \bmod (x,r)}$ > It is an periodic shuffling operator that rearranges a $H\times W\times C\cdot r^2$ tensor to a $rH\times rW\times C$ tensor like Fig. 1. - Objective function (pixel-wise MSE) : $l(W_{1:L},b_{1:L}) = \dfrac{1}{r^2HW}\sum\limits_{x=1}^{rH}\sum\limits_{y=1}^{rW}(I_{x,y}^{HR}-f_{x,y}^L(I^{LR}))^2$ - It is $log_2r^2$ times faster than deconvolution layer and $r^2$ faster than other techniques that upscales before convolution. ## **Experiments** ### **Datasets** Image - Timofte dataset - Berkeley segmentation dataset - super texture dataset - ImageNet Video - Xiph database - Ultra Video Group database ### **Benefits of the Sub-pixel Convolution Layer** ![](https://i.imgur.com/bj9eiQH.png) ### **Comparison to the State-of-the-art** ![](https://i.imgur.com/Yho2PDS.png) ### **Video Super-resolution Results** ![](https://i.imgur.com/2uSTPef.png) ![](https://i.imgur.com/JiaHOOa.png) ### **Run Time Evaluations** - Compared to **SRCNN**, the number of convolution required to super-resolve one image is $r^2$ times smaller and the number of total parameters of the model is $2.5$ times smaller - With upscale factor of 3, **SRCNN** takes 0.435s per frame whilst our **ESPCN** model takes only 0.038s per frame. ## **Conclusion** - Fixed filter upscaling at the first layer does not provide any extra information for SISR. Therefore, perform the feature extraction stages in the **LR** space instead of **HR** space can reduce computational complexity. - A novel sub-pixel convolution layer was proposed and is capable of super-resolving **LR** data into **HR** space with very little additional computational cost compared to a deconvolution layer. - The proposed method has a significant improvement in speed and performance compared to state-of-the-art models. This makes it the first CNN model that is capable of **SR** HD videos in real time on CPU. ## **Future Work** Spatial-temporal network can be added to improve **ESCPN** in video **SR** problems.