Deep Learning Super-Resolution for WhatsApp Compression

# Deep Learning Super-Resolution for WhatsApp Image Compression *Group 43: Lotte Kremer 4861957, Pepijn de Kruijff 4962966, Martijn Lieftinck 4862856*  [toc]                  ## Introduction With the ever-increasing digitalization of the world, it has become normal to be able to instantly share photos and videos on the spot wherever you are. Almost every phone has a camera and is connected to the internet. On top of that, smartphone cameras and screens continue to be able to capture and display higher and higher-resolution images. We've come to a point where the "instantaneous" aspect of high-quality image sharing is no longer obvious since the trend of increasing resolution comes with increasing file size, and bandwidth has started to become a limiting factor when sharing photos over the internet. One of the most popular ways to share images is using the messaging platform WhatsApp. To cope with the [billions](https://www2.deloitte.com/content/dam/Deloitte/global/Documents/Technology-Media-Telecommunications/gx-tmt-prediction-online-photo-sharing.pdf) of images shared yearly via WhatApp, WhatsApp applies a form of image compression prior to sending the image to the receiver. ### Image compression There are many different ways to compress an image, but each method can be placed in one of these categories: - Lossless compression - Lossy compression Lossless compression is a way of restructuring the image data in such a way that no important information is lost during the process of compression. This can be done by removing unnecessary metadata or exploiting redundancy. However, file sizes can only be reduced slightly in a lossless manner. To achieve a more drastic size reduction, one must resort to lossy compression. This form only preserves the most important information in an image and removes the rest. The most common form of lossy image compression is JPEG compression. JPEG compression aims to minimize the perceptible loss while maximizing compression rate. This is achieved by applying quantization in the frequency domain of the image, effectively removing high-frequencies that have relatively little contribution to the overall image content (Figure 1). <center> <img src="https://hackmd.io/_uploads/H1rvowdHR.png" alt="drawing" width="200"/> </center> <center> <em> Figure 1: JPEG image with a decreasing compression rate from left to right </em> </center> <p></p> WhatsApp uses a form of JPEG compression as well. With their method, image file sizes can be reduced by 70% (or more, depending on the original image). The use of lossy compression enables users to continue sending pictures back and forth over WhatsApp with very low latency. However, this does come at the cost of image quality. Photos downloaded from WhatsApp are noticeably of lesser quality than before they were sent, as the resolution of the images is reduced. ### Super-resolution There exists a field of research that aims to combat this loss of quality due to downscaling and compression: super-resolution. Super-resolution (SR) is a class of methods to upscale images, increasing resolution and quality. The challenge in super-resolution is to predict the values of the newly introduced pixels. Straight-forward methods have been around for a while, such as nearest-neighbor interpolation, but lately, super-resolution has become big in machine learning as well. A recent [review paper](https://pdf.sciencedirectassets.com/272144/1-s2.0-S1566253521X00103/1-s2.0-S1566253521001792/main.pdf?X-Amz-Security-Token=IQoJb3JpZ2luX2VjEAkaCXVzLWVhc3QtMSJIMEYCIQDY%2B3JZauBATK%2BVtDszcrdm5boORWkcadSdmAr%2FyVIPGwIhALicJJgIyiQBklXUsQg%2BX%2BcMmwnL2%2FbAQpfEDyrR2jZUKrsFCKL%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FwEQBRoMMDU5MDAzNTQ2ODY1IgxAdmPZj7cm7n9aR0sqjwX%2FSyWTtdEq%2FqkdQ4RZ78u%2FPiumSQHWqQGZknMvRQAuvdgxEJZwIQJJIYZ2Vb1vLOejCBQAIP1MwUXcT8aI76dSeLNzHfnDA5VHnoDUgCVkHWQvUNqpm0bO4VKlq8%2BXUN01W34V2VdA7TOWW0j4AF8WpLnis2ua4HKIj1W7XvPjQX1VgFWeetnAwLXPcPC7IceMlgU6bdC4gy9Ku4AcjjimrP3wQwjWpH5oER82y1DkiehyEOJZfoMcIrlbk3cZLXfKek5y03z0B%2FpooQZ1NLlYgqHiFnWXEkB3st7%2BNZ8Y0OUBVKCIBuT4EU8lCHEhonT1xLldDWWl69DvNX9iIGye%2BuJGG7oqXd8WAdFCUtc4WQ3VzislSqVFBPe0w6Wro5%2FGXYckCkc8cdbOBCuEHClCOKILvw%2FjAgaGYRu7hh7FVGZQJIHgMKKrdxlM4sx0k1GBMVorIu6aEgS0Ufc%2B4718Euguno8ohuhK4QS%2FKbmXBAyCwqw5erH8kt4M0xCkNTdRlzFyJACg8kzEPN2UrsJZ%2FwJmoLZmcWyHMrCUS9adJdNDx%2B2iKRFWIOwmPd4a6w76iN6lUy%2Bc7mfrfQa6r8hMri4TdhRouFsEb4DNtt%2B8qkQ9tXxwlIh8%2FXVmT7KPzrFy9IhkfRQkJChz2zvN85jQpfeimSOgcPIaxmvx1x128gbEzObd62ZYY7CsA%2Bpb94J6%2FU%2B0VviyhDvTJBC3pzAek7cmj67eWjfeDIi2ejNbdcoHggN3NuhbgnZrH7oPz4o1QN%2Ftsvd8X0WCNpoLPAJfaUKlVFjCv2KzR2Gj%2Fv0%2Bu8jqMxEycnqReOI2R0YZR%2F87UBvoYbb9FGpJLAB6I5Exepo8iA14RB5sPuy6v4mpMLj8r7MGOrABOsegMNuvyN%2FAGq2eXqQrF1T5TKZe88Qocj97CVCa%2B3CNbSj2NxDMHZQBP%2Bd3LoWk%2BnPSANKRJVeC3Fr2yqNzqz483KYHIUJMlw6GGl4l80zZd5%2FKormqz%2FWvXiIbKTKCCKmHC1ZDakkWj85%2BLGUAOxjcAGByS7uCYlb5VfxBLEhJhA%2B9jA%2F4dTlEGi3kI%2BebJCd0iwht6n5Uqr8nF4F5Uqd8cW09abbsgJ8%2BzFna5QE%3D&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=20240614T090341Z&X-Amz-SignedHeaders=host&X-Amz-Expires=300&X-Amz-Credential=ASIAQ3PHCVTY5SQWVDU7%2F20240614%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Signature=47035e89f8523225c75c6bcdcb2e72c540802df11e226d6869e56c57d73b54db&hash=3664d46f4a6e13190be13c0ae8589f5dc3f6ee24e4a12e563f676d1ade49df5a&host=68042c943591013ac2b2430a89b270f6af2c76d8dfd086a07176afe7c76c2c61&pii=S1566253521001792&tid=spdf-820e6a01-efef-4b9b-9218-86c204b00fd5&sid=42bfce7c1ad21249e31bd373fec5d4592717gxrqb&type=client&tsoh=d3d3LnNjaWVuY2VkaXJlY3QuY29t&ua=0a0b5e58515e5d595e&rr=8939260b0c146575&cc=nl) on SR-methods lists possible applications, including intelligent surveillance, medical imaging, segmentation, detection, and motion tracking. However, image decompression is not included in this list. It turns out that basically all state-of-the-art models are trained on standard image libraries such as ImageNet. To construct training data, high-quality images are downsampled by simply deleting pixels, after which the SR-models predict the values of the deleted pixels. SR for compressed images is not well studied, and perhaps therefore not actively used in everyday life. ### Proposed solution In this blog post, we present WASRGAN (WhatsApp-trained Super-Resolution Generative Adversarial Network), a super-resolution deep learning model trained to upscale images that have been compressed via WhatsApp. WASRGAN is based on SRGAN, a state-of-the-art SR model that learns fine details of high-resolution images and generates an upscaled, high-quality image based on a compressed, low-resolution image. We aim to improve SR performance on WhatsApp compression by finetuning a pre-trained model on a new dataset with images that were sent over WhatsApp. ## Related work ### Interpolation Increasing the resolution of images is not a new field. The easiest method to do this is through interpolation. Interpolation estimates the value of data points by interpolating them based on the values around them. In computer vision, bilinear interpolation is applied to estimate the values of pixels based on surrounding pixels on both the x and the y axis. Given a point $(x, y)$ within the unit square defined by $(x_1, y_1)$, $(x_2, y_1)$, $(x_1, y_2)$, and $(x_2, y_2)$ with function values $Q_{11}, Q_{21}, Q_{12},$ and $Q_{22}$ at the corners, the bilinear interpolation formula is: $$ f(x, y) = \frac{1}{(x_2 - x_1)(y_2 - y_1)} \left( Q_{11}(x_2 - x)(y_2 - y) + Q_{21}(x - x_1)(y_2 - y) + Q_{12}(x_2 - x)(y - y_1) + Q_{22}(x - x_1)(y - y_1) \right) $$ Where: - $x_1 \le x \le x_2$ - $y_1 \le y \le y_2$ - $Q_{11} = f(x_1, y_1)$ - $Q_{21} = f(x_2, y_1)$ - $Q_{12} = f(x_1, y_2)$ - $Q_{22} = f(x_2, y_2)$ Bilinear interpolation is a simple and effective method to increase the resolution of existing images. Larger and more complex filters can also be used to calculate the value of new pixels. Bicubic interpolation for example uses the surrounding 16 pixels. However, determining pixel values based on surrounding pixels has certain characteristics. Because interpolation averages the values around it, the resulting image will have smooth color transitions. However, it will fail to generate hard edges in the picture because hard edges have pixel values that are very dissimilar from surrounding pixels. The result is that images that are upscaled using interpolation will have overly smooth textures. The extent of the blurriness will be dependent on the amount of interpolation and the degree to which the original image contained hard edges. ### Neural nets Over the years, several deep-learning approaches have been used in the super-resolution field. Convolutional neural networks have proved to work excellently. [Dong et al (2016)](https://pubmed.ncbi.nlm.nih.gov/26761735/) passed the output of a bicubic interpolation in a three-layer convnet. Further research showed that having the network learn the upscaling filters directly increased performance significantly. Subsequent research has iteratively improved the state-of-the-art architecture for SR, showing that the use of [batch normalization](https://arxiv.org/abs/1502.03167), [residual blocks](https://ieeexplore.ieee.org/document/7780459) and [skip-connections](https://arxiv.org/abs/1603.05027) is beneficial for performance and speed. ## Methods ### SRGAN A the core of our experiments lies the model introduced in 2017 as [SRGAN](https://arxiv.org/pdf/1609.04802v5), the generative adversarial network for image super-resolution. Despite being published over seven years ago, SRGAN is still considered one of the state-of-the-art methods in the field of single-image super-resolution. With 12,858 citations, there have been many tweaks and adaptations to the original model tailored to many different applications, but we did not manage to find previous work on using SRGAN for WhatsApp-specific image decompression. The core idea of SRGAN is to use an adversarial discriminative network to determine the loss for the generative network. Other loss functions such as pixel-wise MSE have difficulty in recovering textures with high frequency. The result is that the generated high resolution tends to be somewhat blurry, as this minimizes the MSE loss. Mathieu et al (42) and Denton et al (7) proposed an adversarial network to classify generated images as real or fake, forcing the generative model to rely on smoothing but instead attempt to generate a realistic-looking image. #### Generator Like any GAN, SRGAN consists of a generator and a discriminator (Figure 2). The generator network contains a series of residual convolutional blocks (rcbs) that sequentially apply the following operations to the input low-resolution image: 1. Convolution (3x3 kernel, 64 feature maps, stride=1, padding=1, no bias) 2. 2D batch normalization 3. Parametric Rectified Linear Unit (PReLU) 4. Convolution (3x3 kernel, 64 feature maps, stride=1, padding=1, no bias) 5. 2D batch normalization 6. Element-wise summation of the rcb input (skip connection) These six operations form one rcb, and the generator can contain multiple rcbs. After passing through the rcbs, one more convolution, batch norm and skip connection combination is performed. Next, the input is upsampled to the final resolution. For the upscaling, upsampling blocks are used, comprised of these operations: 1. Convolution (3x3 kernel, 256 feature maps, stride=1, padding=1) 2. PixelShuffle (upscale_factor=2) 3. PReLU The number of upsampling blocks determines the scaling factor with which the input resolution is multiplied in the output image. The full generator pipeline can be summarized by starting with a low-resolution input image, performing one convolution and activation, then passing through the rcbs, followed by the upsampling blocks and ending with one final convolution that results in a three-channeled output image with a resolution that is a power of two higher than the input image. #### Discriminator The discriminator contains two main parts: the first part is used as a feature extractor of the input image, and the last part functions as a linear classification head. The feature extractor contains a chain of multiple convolution-batch norm-activation operations with an increasing number of feature maps. The activation function used in the discriminator is a Leaky ReLU, as opposed to the PReLU, used in the generator. The classification head consists of a fully connected linear layer from feature maps to 1024 output neurons. The outputs pass through a Leaky ReLU followed by a final fully connected layer with a single output. This output is passed through a Sigmoid to obtain the probability of the input image being an original high-resolution image as opposed to a super-resolution output from the generator. <center> <img src="https://hackmd.io/_uploads/ByuKWTrBC.png" alt="drawing" width="700"/> </center> <center> <em> Figure 2: Architecture of Generator and Discriminator Network with corresponding kernel size (k), number of feature maps (n) and stride (s) indicated for each convolutional layer. </em> </center> <p></p> #### Loss function The loss function consists of an adversarial loss and a content loss. The adversarial model forces the generative model to generate images in the natural image manifold. SRGAN defines the generator loss as follows: $$ l^{SR}_{Gen} = \sum_{n=1}^{N} -\log D_{\theta_D} \left( G_{\theta_G} (I^{LR}) \right) $$ This loss function maximizes the log probability that the discriminator correctly classifies the generated images as fake. Note that this implementation differs slightly from the function suggested by [Goodfellow et al (2014)](https://arxiv.org/abs/1406.2661), with the goal of improving gradient behavior. The content loss is not defined as the pixel-wise MSE loss, as is frequently done in SR. Instead, a VGC loss is defined as: $$ l^{SR}_{VGG/i,j} = \frac{1}{W_{i,j} H_{i,j}} \sum_{x=1}^{W_{i,j}} \sum_{y=1}^{H_{i,j}} \left( \left( \phi_{i,j} (I^{HR})_{x,y} - \phi_{i,j} \left( G_{\theta_G} (I^{LR}) \right)_{x,y} \right)^2 \right) $$ where $W_{i, j}$ and $H_{i, j}$ describe the dimensions of the feature maps within the network, and $\phi_{i, j}$ is the feature map after the j-th convolution layer and before the i-th pooling layer. The perceptual loss function is the combination of these two loss functions, weighed by a constant: $$ l^{SR} = l^{SR}_{X} + 10^{-3} l^{SR}_{Gen} $$ ### Data acquisition To perform, train and test super-resolution for WhatsApp compressed images, we need an appropriate dataset. Since there are no off-the-shelf datasets available with the exact compression and downsampling used by WhatsApp, we constructed our own data from an existing image library. Our starting point was the [Flickr2K](https://www.kaggle.com/datasets/daehoyang/flickr2k) dataset, a dataset containing 2650 high-resolution images. These images depict nature, people, cityscapes, animals, food, etc. Therefore, Flickr2K contains a realistic representation of the typical images people might want to share via WhatsApp. We decided to use 500 of these images, send them over WhatsApp and download them again from the receiving end. In this way, the exactly right compression (and any pre- or post-processing, if present) is applied to the images without having to make any assumptions about what WhatsApp does to images aside from JPEG compression. By design of the upsampling blocks in the generator, SRGAN is only able to scale up image resolution by a power of two. However, during WhatsApp compression, the resolution is scaled down by a factor of approximately 0.78. To be able to accurately compare the super-resolution output to the original Flickr2K images, we scaled down the compressed images further by linear interpolation to a resolution of 0.5 compared to the high-resolution images from Flickr2K. ## Experiments With the newly constructed dataset, we were able to run SRGAN on the low-resolution images and use the Flickr2K dataset as ground truth. We compared the performance of trained SRGAN, our WhatsApp finetuned version WASRGAN, and standard linear interpolation on the task of upscaling the downscaled WhatsApp images. As a baseline, we also tested the same methods on the images downscaled with bicubic downscaling instead of WhatsApp. From both sets of downscaled images, we created patches of size 128 which match their upscaled patch of size 256. The model was both trained and tested on those patches since, as proposed in the original paper as well. ### Finetuning SRGAN comes with pre-trained weights for different upscaling factors. We chose to use the factor closest to the inverse of the empirically determined downscaling factor of WhatsApp compression (0.78), which resulted in an upscaling factor of 2.0. The trained weights were determined by training SRGAN with 16 rcbs and 1 upscaling block, for 18 epochs on 350,000 images from ImageNet. We attempted to finetune these weights by continuing training on our dataset of 500 WhatsApp images, with a train-test split of 450-50. Within the limitations of the available computational resources, we managed to complete 20 epochs of finetuning. ### Metrics To assess the performance of the finetuned model and compare it to the original pre-trained version, we looked at two different image quality metrics: peak signal-to-noise ratio (PSNR) and the structural similarity measure. #### PSNR The PSNR is used as a quality measurement between an original image and a compressed or reconstructed version. The original image is considered the signal and the mean-square error (MSE) between signal and reconstruction is considered to be noise. A large PSNR means a strong signal with little noise and is therefore preferable. PSNR is expressed in terms of MSE as follows: <center> <img src="https://hackmd.io/_uploads/ryFsztwH0.png" alt="drawing" width="200"/> </center> <p></p> where _R_ represents the dynamic range of the image, so the difference between maximal and minimal pixel values. #### SSIM The second metric we tested model performance with is the SSIM. SSIM separates the structural information in an image into three components: - Luminance - Contrast - Structure Given two images _x_ and _y_, we can measure the difference in each component. Luminance _l(x, y)_ is measured using the mean intensities of the images. Contrast _c(x, y)_ is measured through standard deviations and structure _s(x, y)_ is determined by the correlation between images: <center> <img src="https://hackmd.io/_uploads/HknOAYwSC.png" alt="drawing" width="200"/><img src="https://hackmd.io/_uploads/H1DJLrOBA.png" alt="drawing" width="100"/><img src="https://hackmd.io/_uploads/SJjSSS_BR.png" alt="drawing" width="100"/> </center> <p></p> where _c<sub>1</sub>_, _c<sub>2</sub>_ and _c<sub>3</sub>_ are determined by the dynamic range _L_ of the images, and constants _k<sub>1</sub>_ and _k<sub>2</sub>_, with _k<sub>1</sub>_=0.01 and _k<sub>2</sub>_=0.03. Multiplying these three components gives the SSIM between the two images: <center> <img src="https://hackmd.io/_uploads/SJhdAtwHA.png" alt="drawing" width="320"/> </center> <p></p> The maximal structural similarity between two images is 1, which is the case for two identical images. Therefore, the closer the SSIM is to 1, be better the super-resolution image approaches the ground truth. ## Results First, we show the results before any finetuning was done, those can be seen below. Table 1 and Figure 3 show the performance of both the SRGAN and interpolation on the WhatsApp downscaled images and the bicubic downscaled images. <table> <thead> <tr> <th></th> <th colspan="2">SRGAN</th> <th colspan="2">Interpolation</th> </tr> <tr> <th></th> <th>WhatsApp</th> <th>Bicubic</th> <th>WhatsApp</th> <th>Bicubic</th> </tr> </thead> <tbody> <tr> <th>PSNR</th> <td>36.6104</td> <td>37.5986</td> <td>34.8772</td> <td>35.1019</td> </tr> <tr> <th>SSIM</th> <td>0.9367</td> <td>0.9460</td> <td>0.9085</td> <td>0.9116</td> </tr> </tbody> </table> <center> <em> Table 1: Performance of the original SRGAN and interpolation on the WhatsApp dataset and bicubic dataset. </em> <p></p> </center> It shows that indeed the state-of-the-art SRGAN has a better performance than basic interpolation. This can also be seen quite clearly in the images below. Especially the smoother edges for which interpolation is known can be seen in the interpolation images. ![Schermafbeelding 2024-06-16 171804](https://hackmd.io/_uploads/HkVSzFnHA.png) <center> <em> Figure 3: Comparison of the same patch compressed in two ways and upsampled in two ways. </em> <p></p> </center> That the images from WhatsApp have lower performance in the SRGAN only motivates it more to train on those images, to create a WASRGAN that is especially suitable for upscaling WhatsApp images. However, since the WhatsApp images also perform worse with interpolation, it might indicate that the downscaling that was performed for this data set lost more information in general per patch. After finetuning for 20 epochs on the patches created from the training set of 450 images sent through WhatsApp, the new weights lead to the results in Table 2 and Figure 4. <table> <thead> <tr> <th></th> <th colspan="2">WASRGAN</th> </tr> <tr> <th></th> <th>WhatsApp</th> <th>Bicubic</th> </tr> </thead> <tbody> <tr> <th>PSNR</th> <td>31.9821</td> <td>32.0296</td> </tr> <tr> <th>SSIM</th> <td>0.9009</td> <td>0.8975</td> </tr> </tbody> </table> <center> <em> Table 2: Performance of WASRGAN on the WhatsApp dataset and bicubic dataset. </em> <p></p> </center> The main conclusion that can be drawn from these results is that finetuning actually had a negative effect on the performance in terms of PSNR and SSIM. There could be multiple reasons for this which could not all be explored due to high computational demands. From looking at the images closely, it can be concluded that for example, the colors turned out different after finetuning. This made it more different from the original than before finetuning. ![image](https://hackmd.io/_uploads/rJJ6EYhSA.png) <center> <em> Figure 4: Comparison of the same patch compressed through WhatsApp and upsampled with SRGAN and WASRGAN. </em> </center> ## Discussion Although our WASRGAN might not be the most suitable for super-resolution on WhatsApp images, there is still a lot of potential for further research. To effectively use super-resolution for WhatsApp images, multiple directions could be explored. First of all, it could be beneficial to spend more time on tuning the hyperparameters to facilitate better finetuning on WhatsApp images. This could create an SRGAN that performs at least as well on WhatsApp images as it does on bicubic downscaled images now. During the span of this project, there was not enough time to explore a lot of different settings for the hyperparameters. Secondly, it could be that a lot of information on the WhatsApp dataset was lost when downscaling from approximately size 0.8, the WhatsApp output, to 0.5, the necessary input for the SRGAN. This could be the reason why upsampling from bicubic downsampled images performs better in nearly all tests we did. It could thus be beneficial to design a super-resolution network that can work on the varying downscale size of WhatsApp. However, it could also be that the initial downsampling WhatsApp uses, the JPEG compression, is the reason for the information loss. In all cases, it is interesting for WhatsApp to explore whether they could quickly improve the resolution of sent images by using SRGAN or other deep learning models.