Single Image Super Resolution(SISR) using Deep Learning

# Single Image Super Resolution(SISR) using Deep Learning [TOC] ## Problem statement: - Resolution is one of the factors which affects tesseract OCR accuracy - When resolution of images are decreased below certain threshold, tesseract's output quality decreases in text dense images - Image Super Resolution can be used to enhance the quality of low resolution images - ![](https://i.imgur.com/wPEQii4.png) ## Problem definition: Image super-resolution aims at recovering the corresponding HR images from the LR images. Generally, the LR image $I_x$ is modeled as the output of the following degradation: $I_x$ = $D(I_y; \delta)$ where $D$ denotes a degradation mapping function, $I_y$ is the corresponding HR image and $\delta$ is the parameter of the degradation process ## Datasets for Super resolution: ![](https://i.imgur.com/1Tqo9EU.png) - These datasets contain the ground truth HR images. To synthesize LR input images, following steps are taken: - sub-images of specific dimension $f_{sub}$ x $f_{sub}$ x $c-channels$ are extracted by randomly cropping HR images - cropped sub-images are blurred using gaussian kernel - blurred sub-image is sub-sampled by the network's upscaling factor i.e pixels are removed from designated row and column - sub-sampled image is upscaled by same factor using upsampling techniques like bicubic interpolation, transposed convolution, sub-pixel convolution - Upsampling approaches: - Transposed Convolution: ![](https://i.imgur.com/2Fofuz0.png) - sub-pixel convolution: ![](https://i.imgur.com/NlCaL3L.png) ## Image Quality Assessment: - Image quality refers to visual attributes of images and focuses on the perceptual assessments of viewers - In general, image quality assessment (IQA) methods include subjective methods based on humans’ perception (i.e., how realistic the image looks) and some objective computational methods - Some of the commonly used objective IQA metrics: - Peak Signal-to-Noise Ratio: PSNR is defined via the maximum pixel value (denoted as L) and the mean squared error (MSE) between images - PSNR = $10 \log_{10} \frac{L^2}{\frac{1}{N}\sum_{i=0}^N(I(i) - \hat{I}(i))^2}$ where $I$ - HR Image with N pixels, $\hat{I}$ reconstruction - the maximum possible power of a signal and the power of corrupting noise that affects the fidelity of its representation - The ability of PSNR to capture the perceptually relevant differences such as high texture details is very limited as it is based only on pixel level difference - Structural Similarity Index(SSIM): - Measures the structural similarity between images based on comparisons of luminance, contrast and structures - Luminance: $\mu_I$ = $\frac{1}{N}\sum_{i=0}^NI(i)$ - Chrominance: $\sigma$ = $(\frac{1}{N-1}\sum_{i=0}^N(I(i) - \mu_I)^2)^\frac{1}{2}$ - Comparison of luminance: $C_l(I, \hat{I})$ = $\frac{2\mu_I\mu_{\hat{I}}+C_1}{\mu_I^2+\mu_{\hat{I}}^2 + C_1}$ - Comparison of chrominance: $C_c(I, \hat{I})$ = $\frac{2\sigma_I\sigma_{\hat{I}}+C_2}{\sigma_I^2+\sigma_{\hat{I}}^2 + C_2}$ - Structure similarity: $C_s(I, \hat{I})$ = $\frac{\sigma_{I\hat{I}}+C_3}{\sigma_I\sigma_\hat{I} +C_3}$ where $\sigma_{I\hat{I}}$ = $\frac{1}{N-1}\sum_{i=0}^N(I(i) - \mu_I)(\hat{I}(i) - \mu_{\hat{I}})$ - $SSIM(I, \hat{I})$ = $(C_l(I, \hat{I}))^\alpha(C_c(I, \hat{I}))^\beta(C_s(I, \hat{I}))^\gamma$ - SSIM evaluates the reconstruction quality from the perspective of the HVS(Human Visual System), it better meets the requirements of perceptual assessment - Mean Opinion Score(MOS) is a subjective IQA metric. To calculate MOS, human raters are asked to assign perceptual quality scores to tested images. Typically, the scores are from 1 (bad) to 5 (good). And the final MOS is calculated as the arithmetic mean over all ratings ## Deep Learning based SISR approaches: ### [SRCNN](https://arxiv.org/abs/1501.00092): - Proposed a deep learning method for single image super-resolution (SISR). The method directly learns an end-to-end mapping between the low/high-resolution images - Developed a simple CNN architecture with 3 layers, with each layer conceptually performing a specific operation - layer 1: Patch extraction and representaion. This layer extracts patches from upsampled input image and represents these patches as high dimensional vector[CNN feature maps] - layer 2: Non linear mapping. This layer non linearly maps features from layer 1 to layer 2 - layer 3: The feature maps from layer 2 are aggregated to reconstruct the final high resolution image ![](https://i.imgur.com/ClK3TrU.png) - Loss function: Mean squarred loss between pixels of generated HR and ground-truth HR image $L(\theta)$ = $\frac{1}{n}\sum_{i=0}^n||F(Y_i, \theta) - X_i||^2$ - Metric: PSNR - Training Data: Sub-images generated from ImageNet data - Authors conducted experiments to empirically choose filter sizes and number of layers. They adopted the model with good performance-speed trade-off: a three-layer network with f1 = 9, f2 =5, f3 = 5, n1 = 64, and n2 = 32 trained on the ImageNet. For each upscaling factor $\subset$ {2,3,4}, a specific network is trained - Conclusion: - Proposed first end-to-end trainable CNN architecture for SISR - Further imporvements can be achieved by different choice of network architecture and loss function ### [SRGAN](https://openaccess.thecvf.com/content_cvpr_2017/papers/Ledig_Photo-Realistic_Single_Image_CVPR_2017_paper.pdf) - Proposed a GAN based Super Resolution network capable of inferring photo-realistic natural images for 4×upscaling factors - Proposed a perceptual loss function which consists of an adversarial loss and a content loss - The adversarial loss pushes the solution to the natural image manifold using a discriminator network that is trained to differentiate between the super-resolved images and original photo-realistic images. - The content loss is motivated by perceptual similarity instead of similarity in pixel space - Pixel wise MSE loss encourages finding pixel-wise averages of plausible solutions which are typically overly-smooth and thus have poor perceptual quality - Contributions: - Achieved new state-of-art for SISR with high upscaling factors (4×) as measured by PSNR and structural similarity (SSIM) with 16 block deep ResNet(SRResNet) - Proposed SRGAN, a GAN based network optimized on proposed perceptual loss. Use of GAN framework encourages the reconstructions to move towards regions of the search space with high probability of containing photo-realistic images and thus closer to the natural image manifold - Confirmed that SRGAN is new stat-of-art in recovering photo-realistic images using extensive Mean Opinion Score(MOS) on test images - Method: - Only high resoltuion images are available during training $I^{HR}$, to generate low resolution images $I^{LR}$, a Gaussian filter is applied to $I^{HR}$ followed by a downsampling operation with downsampling factor r - Adversarial network architecture: - The general idea is that it allows one to train a generative model G with the goal of fooling a differentiable discriminator D that is trained to distinguish super-resolved images from real images ![](https://i.imgur.com/3U3cjMG.png) - Generator can learn to create solutions that are highly similar to real images - GAN framework encourages perceptually superior solutions residing in the subspace, the manifold, of natural images - ![](https://i.imgur.com/KHg8yJN.png) - Goal is to train a generating function $G$ that can estimate the HR counterpart for a LR input image. To achieve this a generator network i.e. a feed forward CNN $G_{\theta_{G}}$ is trained using SR specific loss function $l^{SR}$ ![](https://i.imgur.com/RYv5k9r.png) $l^{SR}$ is specifically designed perceptual loss as a weighted combination of several loss components that model distinct desirable characteristics of the recovered SR image - Perceptual Loss Function: weighted sum of two components i.e. content loss and adversarial loss component ![](https://i.imgur.com/XfxLutv.png) - Content loss: - MSE Loss: The most common choice is pixel-wise MSE loss ![](https://i.imgur.com/wch17Lu.png) while achieving particularly high PSNR, solutions of MSE optimization problems often lack high frequency content which results in perceptually unsatisfying solutions with overly smooth textures ![](https://i.imgur.com/si9vMLW.png) - VGG Loss: To solve for this issue, authors have used VGG loss based on the ReLU activation layers of the pre-trained 19 layer VGG network ![](https://i.imgur.com/a2UDFtP.png) $\phi_{i,j}$ indicate the feature map obtained by the j-th convolution (after activation) before the i-th maxpooling layer within the VGG19 network $G_{\theta_{G}}{I^{LR}}$ reconstructed LR image ${I^{HR}}$ HR ground truth image - Adversarial loss: - The generative component of the loss encourages the network to favor solutions that reside on the manifold of natural images, by trying to fool the discriminator network - $l^{SR}_{Gen}$ is defined based on the probabilities of the discriminator $D_{\theta_{D}}(G_{\theta_{G}}({I^{LR}}))$ ![](https://i.imgur.com/Rtyyg7g.png) - Experiments: - Data: Set5, set15 and BSD100 datasets were used for benchmarking - Training data: Networks were trained using images from ImageNet dataset - SRResNet was first trained. MSE-based SRResNet network as initialization for the generator when training the actual GAN to avoid undesired local optima - MOS testing: 26 raters were asked to assign an integral score from 1 (bad quality) to 5 (excellent quality) to the super-resolved image - Investigation of content loss: to investigate the effect of different content loss choices, networks(both SRResNet and SRGAN) with different content loss were trained and compared - ![](https://i.imgur.com/AyxIdRY.png) - SRGAN-MSE: $l^{SR}_{MSE}$, to investigate the adversarial network with the standard MSE as content loss - SRGAN-VGG22: $l^{SR}_{VGG/2.2}$ with $\phi_{2,2}$ a loss defined on feature maps representing lower-level features - SRGAN-VGG54: $l^{SR}_{VGG/5.4}$ with $\phi_{5,4}$ a loss defined on feature maps of higher level features from deeper network layers with more potential to focus on the content of the images - SRResNet-MSE and SRResNet-VGG22: generator network without the adversarial component to analyse the effect of adversarial framework - Result: - Even combined with the adversarial loss, MSE provides solutions with the highest PSNR values that are, however, perceptually rather smooth and less convincing than results achieved with a loss component more sensitive to visual perception. This is caused by competition between the MSE-based content loss and the adversarial loss - SRGAN-VGG54 significantly outperformed other SRGAN and SRResNet variants on Set14 in terms of MOS - We observed a trend that using the higher level VGG feature maps $\phi_{5,4}$ yields better texture detail when compared to $\phi_{2,2}$ - ![](https://i.imgur.com/Th0ke4C.png) - Results: - SRResNet(SRResNet-MSE) and SRGAN(SRGAN-VGG54) was compared with NN, Bicubic, SRCNN, SelfExSR, DRCN and ESPCN - ![](https://i.imgur.com/nuZgPWW.png) - ![](https://i.imgur.com/NiI5riU.png) - Objective IQA metrics(PSNR, SSIM) confirm the SRResNet is new state of art - for MOS ratings, SRGAN outperforms all reference methods by a large margin and sets a new state of the art for photorealistic image SR - Discussion: - Standard quantitative IQA measures such as PSNR and SSIM fail to capture and accurately assess image quality with respect to the human visual system - The presented model is not optimized for video as the focus of this work was the perceptual quality of super-resolved images rather than computational efficiencySR in real-time - Of particular importance when aiming for photo-realistic solutions to the SR problem is the choice of the content loss - Of particular importance when aiming for photo-realistic solutions to the SR problem is the choice of the content loss. In this work, authors found $l^{SR}_{VGG/5.4}$ to yield the perceptually most convincing results, which is attributed to the potential of deeper network layers to represent features of higher abstraction away from pixel space - Authors speculate that feature maps of these deeper layers focus purely on the content while leaving the adversarial loss focusing on texture details which are the main difference between the super-resolved images without the adversarial loss and photo-realistic images - Reference: - https://openaccess.thecvf.com/content_cvpr_2017/papers/Ledig_Photo-Realistic_Single_Image_CVPR_2017_paper.pdf - https://arxiv.org/abs/1501.00092 - https://arxiv.org/pdf/1902.06068.pdf