# DCGANs with self-attention for image denoising
This blog was written by TU Delft students of group 30 as part of the course Seminar Computer Vision (CS4245).
Authors:
- Dan Sochirca, 5295580, D.Sochirca@student.tudelft.nl
- Petter Reijalt, 5295289, P.J.W.Reijalt@student.tudelft.nl
Model based on paper: https://stanford.edu/class/ee367/Winter2017/yan_wang_ee367_win17_report.pdf
Code based on repository: https://github.com/david-gpu/srez
Our code is available at: https://github.com/DSochirca/dcgan-denoising
Our notebook is also available on kaggle: https://www.kaggle.com/code/dansochirca/notebook74c6913f2b/
## Introduction
Noise reduction in images is a critical challenge across various domains, from improving photographic quality to enhancing the clarity of medical scans. Traditional noise reduction techniques can sometimes blur essential details, or they might need manual adjustments to achieve desired results. On another note, GAN based techniques (Generative Adversarial Networks) have the advantage of providing more visually appealing results (more realistic and containing less artifacts) to humans compared to other techniques, even though their PSNR scores (Peak signal-to-noise ratio) are lower.
Currently there is insufficient research present on the performance of DCGANs (Deep Convolutional GANs) on image denoising. The emergence of techniques such as StyleGAN and transformers has raised questions about the relevance and effectiveness of these models. However, DCGANs benefit from being more lightweight and require reduced computational costs compared to state of the art (deep learning) techniques. Additionally, we believe leveraging self-attention within DCGANs has the potential to improve the model's ability to capture long-range relationships and improve feature representation. An attention mechanism (AM) tries to improve performance by deciding which elements are most important to attend to. It does this by assigning weights to different parts of the input data. This has succesfully been applied to other domains, such as video and most notably natural language processing [^12]. Self-attention has previously been shown to increase the accuracy of DCGANs in the case of bearing fault diagnosis. Noting it can relieve the demand for growing datasets [^11]. Similar results were found for the the fault diagnosis of planetary gearboxes, where DCGANS utilizing self-attention, performed better than the ones that did not [^12].
In this blog post, we **aim** to investigate the denoising performance of DCGANs (Deep Convolutional Generative Adversarial Networks) enhanced with self-attention mechanisms. We adopt the model from the paper *DCGANs for image super-resolution, denoising and deblurring* [^1], which is a DCGAN already tailored for image denoising. The paper doesn't provide links to any code, it just mentions that they based their model from a different repository[^8]. We translate the provided implementation to Pytorch, make the necessary adjustments to its architecture, enhance it with self-attention mechanisms, and measure its performance.
Our **experimental questions** are:
1. Does the proposed DCGAN denoising model with self-attention reduce noise effectively for different noise patterns, while preserving important image details compared to the adapted DCGAN method?
2. How does the proposed model generalize to unseen noise patterns?
3. What is the computational efficiency of the proposed model compared to the baseline, in terms of inference time and memory usage?
## Background
### GANs
Increased computational power and the advent of bigger datasets have improved the ability of deep learning methods to do image processing. Generative adversarial networks (GANs) have been created to learn deep representations without needing extensively annotated training data. GANs can learn by implicitly figuring out the similarity between the distribution of a certain candidate model and the distribution of the real data. A so-called _generator_ tries to 'fool' the _discriminator_ by emulating a sample from the real data set. The _discriminator_ tries to differentiate between samples produced by the _generator_ and drawn from the real data distribution. Early GANs made use of fully connected neural networks for both the generator and the discriminator. This naturally evolved into convolutional GANs, as Convolutional Neural Networks (CCNs) are well suited to image data [^2]. This resulted in Deep Convolutional GANs (DCGANs), which use strided and fractionally strided convolutions. These DCGANs have been successfully used to build up good image representations [^3]. In this blog post, we leverage these capabilities in order to denoise images.
GANs can be formulated as a minimax problem:
$$\min_G \ \max_D {f(D,G) = nE_{e \sim P_{data(x)}}[log D(x)] + E_{z \sim P_z(z)}[log(1-D(G(z)))}$$
This means the discriminator must assign probabilities to input generated by real data $x$ and low probabilities to the generated data $G(z)$. The generator, however, needs to minimize $log(1-D(G(z)))$ [^4].
### DCGANs
DCGANs are a class of neural networks, which follow certain guidelines for their CNN. This results in a network that, that when paired with a GAN, results in stable training across various datasets. The guidelines that aid in stable training are defined as follows [^3]:
- Replace any pooling layers. For the discriminator use with strided convolutions and for the generator use fractional-strided convolutions.
- Use batchnorm
- Remove fully connected hidden layers
- For the generator use ReLu everywhere except for the output
- For the discriminator use LeakyReLU everywhere
Because of their focus on local features (due to CNNs being used), the architecture fits well within the image denoising task.
<!-- Maybe leave this out? -->
<!-- ### Denoising
Traditionally standard algorithms have been used in order to denoise images. These methods rely on averages of pixel values in the vicinity.
#### Median filter
The median filter is non-linear filtering technique. A non-linear filtering alters only the pixels it deems to be corrupted. The median filter than replaces this corrupted pixel by the median value [^5].
#### Non-local means
Non-local means uses a weighted average of surrounding pixels in order to determine a pixels' supposed value. These weights are based on the similarity between the patches in which the pixels lie [^6]. -->
### DCGANs for image denoising [^1]
The foundation of our project represents the model proposed by the work of *Yan et. al*. Their approach involved modifying the loss functions to improve the generator and discriminator’s performance. Additionally, their architecture incorporated deep residual layers, which enabled more effective handling of super-resolution, denoising and debluring tasks.
#### Architecture
Below we depict the architecture of the adopted DCGAN model.

*Architecture for the generator and discriminator network, proposed by Yan et al.*
**Generator:** The generator consists of a series of transposed convolutional layers designed to upscale an input noise vector into a full-resolution image. Each transposed convolutional layer is typically followed by batch normalization and a ReLU activation function.
**Discriminator:** The discriminator architecture is featuring strided convolutional layers that progressively downsample the input image to distinguish between real and generated images. Each convolutional layer is usually followed by batch normalization and a LeakyReLU activation to provide non-linearity.
#### Loss functions
Certain modifications were made by *Yan et al.* to the loss function over a *vanilla* GAN, to better suit image restoration purposes [^1]. The loss function of the generator is as follows:
$$ l_G = 0.9 \cdot l_{content} + 0.1 \cdot l_{G,adv} $$
where $l_{content}$ is the loss between the generated image and the original one:
$$ l_{content} = || I^{generated} - I^{original}||$$
$l_{G, adv}$ is the adverserial loss:
$$l_{G, adv} = \sum_{n=1}^{N} - log \ D(G(I^{input})) $$
The discriminator loss is as follows:
$$ l_{D} = l_{D, adv} = \sum_{n=1}^{N} (log \ D(G(I^{input})) + log \ (1-D(I^{original})))$$
<!-- *Ledig et al.* proposed a perceptual loss, consisting of an adverserial loss in combination with a content loss. In order to gain more photo-realistic results, by recovering finer details after super-resolving at large upscale factors. They made use of a a deep residual network (ResNet) with skip-connection [^7]. *Yan et al.* expanded upon this by introducing an additional optional upscale layer. They also generalized it to do super-resolution, denoising and deconvolution [^1]. -->
### Self-attention [^9]
Self-attention is a mechanism that has revolutionized several fields of artificial intelligence. Initially popularized in NLP, it has found important applications in image processing and computer vision. Often described with concepts of queries, keys, and values, it can be implemented through a series of matrix operations. In our case, self-attention is applied to 2D data (images), which involves handling spatial relationships between different regions within the image.
Firstly, the input feature map is transformed into three reduced embedding components:
**Query** ($q_i$): Represents the feature currently being compared.
**Key** ($k_j$): Represents features against which the comparison is made.
**Value** ($v_j$): Represents the embedded 'essence' of the input at this position, which is weighted by the calculated attention scores and that contributes to the final output feature representation.
The **attention score** of dimension $i$ of image embedding $z$ can be computed as:
$z_i = \sum_j A_{ij} \cdot v_j = \sum_j \text{softmax}(q_i \cdot k_j) \cdot v_j$
Here, each output element $z_i$ is a weighted sum of all values $v_j$, with weights $A_{ij}$ determined by the attention scores derived from the dot products of queries and keys. With self-attention, the model can dynamically focus on relevant parts of the input.
## Methodology
The core modification in our methodology is the integration of self-attention mechanisms into both the generator and discriminator of the DCGAN. This change has the advantage of allowing the network to consider distant parts of the image in its calculations.
### Implementation details
We added **2 self-attention layers** for both generator and discriminator. For the generator we added one before the 1st residual layer and one after the last residual layer (see architecture picture). We did it similarly for the discriminator: one before first hidden block and one after last. Our reasoning is as follows:
- Placing before first hidden layer/block: would help the model to start with a broader understanding of the global dependencies within the noisy image. This allows for feature refining in later layers with an awareness of the overall context.
- Placing after last hidden layer/block: Since the first self-attention layer should already capture some global dependencies, we place a second one after the hidden layers. Our rationale here is that it might help the model make a final assessment with the *refined features*. Of course, in the case that this layer is redundant, the model can always choose to learn unit self-attention, that doesn't contribute in any way.
**Tweaking attention to reduce memory overhead:** We couldn't fit the full attention matrix into memory as it is $B(WH)^2$ where $B$ is the batch size. To reduce the memory overhead, the self-attention was factorized by applying it separately along the rows and columns of the input feature maps. Basically, we broke the process into a vertical pass and then a horizontal one. The attention mechanism however remains the same as the classical self-attention.
### Experimental setup
**Dataset:** We used the CelebA[^10] dataset, a large-scale face attributes dataset with over 200,000 celebrity images, which are of the size 178×218. This dataset provides a diverse set of real-world images, perfectly suited for testing our model. In our experiments we resize the images to 128x128 resolution, to balance the training time.
**The noise functions applied:**
*Gaussian Noise*: We trained models on Gaussian noise applied at both fixed (0.05) and variable (ranging from 0.01 to 0.1) intensities to the images. Gaussian noise is a very common real-world noise type in image processing tasks (often present in poor lighting photography conditions).
*Salt and Pepper Noise*: This noise type is characterized by sharp and random fluctuations. It simulates dead pixels or sensor malfunctions in cameras.
We trained models on images containing one noise pattern, which were applied at both fixed (0.05) and variable (ranging from 0.01 to 0.1) intensities to the images.
**Model Configurations:** We tested the standard DCGAN model without self-attention as our baseline. The modified DCGAN model with self-attention layers incorporated into both the generator and discriminator served as our test model.
**Evaluation Metrics:**
*Peak Signal-to-Noise Ratio (PSNR):* A common metric used to assess the quality of the denoised images compared to the original noise-free images. Higher PSNR indicates better denoising performance.
*Visual Quality Assessment:* We also conducted qualitative evaluations by visual assessments, to confirm the preservation of details and overall image quality.
**Testing Procedure:**
Each model was trained separately on the noisy datasets and evaluated after training to measure the denoising performance. The testing involved two phases:
*Intra-noise Generalization:* Here, models trained on a specific noise type and intensity were tested on the same noise type but different intensities.
*Cross-noise Generalization:* Models were also tested on noise types and intensities they were not explicitly trained on. This is important for understanding the model's generalization ability.
## Results
*1. Seen noise patterns*
Does the proposed DCGAN denoising model with self-attention reduce noise effectively for different noise patterns? In *table 1* the average PSNR-scores are shown for the different noise levels (0.01, 0.05, 0.1). As the noise level increases the PSNR-scores decreases across all experiments. It is also shown that the models that make use self-attention show in most cases a slight improvement over their counterparts that do not. This is particularly more noticeable at higher noise levels (0.05 and 0.1). We highlighted improvements in bold for better visibility.
<br/>
| Experiment | Train Noise Intensity | Avg. PSNR (0.01 noise) | Avg. PSNR (0.05 noise) | Avg. PSNR (0.1 noise) |
|---------------------------------|-----------------------|-----------------|-----------------|-----------------|
| Gaussian Noise (Fixed) | 0.05 | **34.94** | 32.80 | 28.02 |
| Gaussian Noise (Fixed, with attention) | 0.05 | 34.81 | **32.86** | **28.69** |
| Gaussian Noise (Variable) | 0.01-0.1 | **35.00** | 32.97 | 29.33 |
| Gaussian Noise (Variable, with attention) | 0.01-0.1 | 34.71 | **33.29** | **30.26** |
| Salt and Pepper Noise (Fixed) | 0.05 | 31.40 | 30.00 | 28.09 |
| Salt and Pepper Noise (Fixed, with attention) | 0.05 | **31.58** | **30.02** | **28.10** |
| Salt and Pepper Noise (Variable)| 0.01-0.1 | 31.64 | 30.07 | **28.37** |
| Salt and Pepper Noise (Variable, with attention) | 0.01-0.1 | **32.14** | **30.23** | 28.25 |
*Table 1. PSNR-scores for the various experiments (for the evaluation the same noise type was used as in training). The grey rows represent the self-attention models.*
Interestingly, models trained on variable noise levels tend to perform better across different test scenarios, moderately indicating better generalization to different noise levels. This effect is also a bit more pronounced in models equipped with self-attention.
<br />
*2. Unseen noise patterns*
In order to find whether or not our proposed model would work well on unseen noise, we trained the proposed model on gaussian noise and tested on salt and pepper noise, and vice versa. The relative results can be seen in *table 2*. Negative values indicate a decrease in PSNR, suggesting a drop in image quality or increased error, while positive values indicate an improvement.
<br />
| Experiment | Train Noise Intensity | PSNR (0.01 Noise) | PSNR (0.05 Noise) | PSNR 0.1 (0.1 Noise) |
|-------|----------|---------|-----|------|
| Gaussian Noise (Fixed) | 0.05 | -0.11 | -4.87 | -6.35 |
| Gaussian Noise (Fixed, with att.) | 0.05 | -0.09 | -4.49 | -5.91 |
| Gaussian Noise (Variable) | 0.01-0.1 | +0.27 | -2.91 | -4.34 |
| Gaussian Noise (Variable, with att.) | 0.01-0.1 | -0.34 | -3.47 | -4.66 |
| S&P (Fixed) | 0.05 | -2.21 | -1.33 | +0.64 |
| S&P (Fixed, with att.) | 0.05 | -3.02 | -2.07 | -0.15 |
| S&P (Variable) | 0.01-0.1 | -1.91 | -0.82 | +0.62 |
| S&P (Variable, with att. ) | 0.01-0.1 | -2.02 | -1.62 | -1.02 |
*Table 2. The difference in the PSNR score, computed as:* PSNR on unseen noise type $-$ PSNR on the trained noise type.
Across all experiments, the model typically performs worse on unseen noise types. This is expected because the model's training optimized it for specific noise patterns. Interestingly, Salt and Pepper trained models overall seem to show better generalization performance, in some cases even a slight increase in performance (+0.6).
**Effect of Self-Attention.** The addition of self-attention does not consistently improve performance on unseen noise types. While in some cases (like Gaussian Noise Fixed), the drop in PSNR is slightly less severe with self-attention (-4.49 vs. -4.87 at 0.05 noise intensity), in others, the self-attention models experience a larger drop or more significant degradation (S&P Fixed for example, where the decrease goes from -2.21 to -3.02 at 0.01 noise).
**Variable Noise Training.** Models trained on variable noise intensities generally show less degradation in performance. For instance, the Gaussian Noise (Variable) model shows a smaller decrease at 0.05 noise intensity compared to the (Fixed) variant (-2.91 vs. -4.87).
<br />
*3. Computational efficiency*
To answer what the computational efficiency is of our proposed model, we looked at inference time and peak GPU memory usage, the results can be seen in *table 3*. It shows that self-attention makes for nearly double the peak GPU usage, and a slight increase in runtime per sample.
| Experiment | Peak GPU (B) | Runtime (s) |
|-------|----------|---------|
| Gaussian Noise (Fixed) | 1176872976 | 0.06258 |
| Gaussian Noise (Fixed, with att.) | 2098823168 | 0.07304 |
| Gaussian Noise (Variable) | 1176872976 | 0.06677 |
| Gaussian Noise (Variable, with att.) | 2098823168 | 0.06872 |
| S&P Noise (Fixed) | 1176872976 | 0.06869 |
| S&P Noise (Fixed, with att.) | 2098823168 | 0.07341 |
| S&P Noise (Variable) | 1176872976 | 0.06454 |
| S&P Noise (Variable, with att.) | 2098823168 | 0.06917 |
*Table 3. GPU usage (in bytes) and runtime (in seconds) for the various experiments per sample.*
*4. Visual inspection*

*Figure 1. Model denoising on gaussian noisy images.*

*Figure 2. Model denoising on salt and pepper noisy images.*
**Key insights from the inspection:**
Effect of attention:
- The Gaussian Noise Attention model: The background appears smoother compared with baseline, particularly noticeable in photos with bokeh effects, although the difference is subtle.
- Salt & Pepper Noise Attention model: Similar to the Gaussian model with attention, there is a subtle smoothing effect.
Model generalization:
- Salt & Pepper model generalization to Gaussian noise: The S&P models seem to handle Gaussian noise quite well but do not completely eliminate it. The model with attention tends to accentuate the roughness of object edges.
- Gaussian Noise model generalization to Salt & Pepper noise: regardless of whether it has attention, the Gaussian trained models perform similarly and show limited effectiveness, with noticeable noisy 'sprinkles' showing on the images.
We recommend zooming into the figures for better visibility. In the following figure we present the discussed effect of attention, with background appearing smoother in some cases.

*Figure 3. Denoising examples in which the attention model has a slightly smoother output.*
## Discussion / Conclusion
### Comparability with original paper's experiments
Our experiments build directly on the work of Yan et al., implementing their proposed DCGAN architecture for image denoising. It must be noted however that our setup is a bit different: we used an image resolution of 128x128 (ours) compared to 64x64 (theirs), therefore the noise applied to our images isn't as harsh and the noisy images preserve more detail. Despite these differences, our experiments achieved comparable PSNR scores, with an average close to 30 in our baseline, versus the 26.2 reported in their study.
### Does self-attention improve performance?
Self-attention shows marginally better PSNR-scores across all the models, especially at higher noise levels. This suggests a modest potential in more challenging denoising conditions. Visually, the benefits of self-attention are most apparent on even surfaces, where it tends to smooths out noise better than the baseline models in some cases, although this is somewhat subtle.
Despite the gains, the overall impact of self-attention is small. This observation suggests that while it does contribute in a positive way, it's full potential may not be realized within the current architecture. This can be due to possible limitations in how self-attention is integrated, or possibly in the fundamental architecture itself, which may not be too flexible for this task. Future work might explore alternative integration strategies or modifications to the architecture that could better leverage the strengths of self-attention.
### Model generalizability
The models trained on salt-and-pepper noise generalized better to other kinds of noise, i.e. gaussian noise, something that cannot be stated the other way around. This performance could be attributed to the more irregular nature of salt-and-pepper noise, which makes it more robust against other types of noise. To the contrary, gaussian noise follows a more predictable fashion, which makes the model trained on it harder to use on more irregular types of noise.
The addition of self-attention did not seem to improve the model generalizability, and in our experiments only one case proved to be an exception of this: the model trained on 0.05 gaussian noise. Still, the improvement was small.
### Computational and memory overhead
Computational and memory overhead increases, whilst employing self-attention. Being quadratic in the number of features present, it comes at high cost of increased GPU memory usage. Our implementation had peak memory usage of nearly twice as much bytes over the implementation without self-attention. Also a small increase could be noted in runtime, roughly 0.05 seconds depending on the kind of noise. Whether or not the extra computational overhead is worthwile depends on the application.
<br>
In conclusion, image denoising with self-attention enhanced DCGANs is a promising but complex advancement. While self-attention offers some benefits in handling structured noise and improving visual detail, it poses significant challenges in its impact on generalization and computational efficiency. As the field of deep learning continues to evolve, the iterative refinement of these models in future work is crucial.
[^1]: [DCGANs for image super-resolution, denoising and debluring](https://stanford.edu/class/ee367/Winter2017/yan_wang_ee367_win17_report.pdf)
[^2]: [Generative Adversarial Networks: An Overview](https://ieeexplore.ieee.org/abstract/document/8253599)
[^3]: [Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks](https://arxiv.org/abs/1511.06434)
[^4]: [Generative Adversarial Networks: Introduction and Outlook](https://ieeexplore.ieee.org/abstract/document/8039016)
[^5]: [Image Denoising using New Adaptive Based Median Filters](https://arxiv.org/abs/1410.2175)
[^6]: [Nonlocal Means-Based Speckle Filtering for Ultrasound Images](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=4982678)
[^7]: [Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network](https://openaccess.thecvf.com/content_cvpr_2017/html/Ledig_Photo-Realistic_Single_Image_CVPR_2017_paper.html)
[^8]: [Image super-resolution through deep learning.](https://github.com/david-gpu/srez)
[^9]: [Attention is all you need](https://arxiv.org/abs/1706.03762)
[^10]: [Large-scale CelebFaces Attributes (CelebA) Dataset](https://mmlab.ie.cuhk.edu.hk/projects/CelebA.html)
[^11]: [Fine-tuning transfer learning based on DCGAN integrated with self-attention and spectral normalization for bearing fault diagnosis](https://www.sciencedirect.com/science/article/abs/pii/S0263224122016189)
[^12]: [Application of self-attention
conditional deep convolutional
generative adversarial networks
in the fault diagnosis of planetary
gearboxes](https://journals.sagepub.com/doi/full/10.1177/1748006X221147784?casa_token=e-yUe9aX98MAAAAA%3AL7iPmFZUD7TtFuHDKrOwkHA2fAwUIlhFWroJ8V2tIX51iIs1AWXyLa-fvKzuBXADvA38Xmo4y-8)