We have incorporated additional experiments for the **image super-resolution task** and updated the paper. Quantitative results on image super-resolution for **x2** and **x8** tasks have been added in **Sec. K** (Experiments on image super-resolution, Tab. 7 and Tab. 8 in Appendix). We have also performed human evaluation for this task in **Sec K.2** (human evaluation, Fig. 26 in Appendix). **All changes are highlighed in blue in the updated version of the paper.**
Dear Reviewer,
We hope to hear your feedback after our responses. Please let us know if our replies have sufficiently addressed your concerns and whether you have further feeback for improvements. Thank you for your time and consideration.
Dear Reviewer,
We hope to hear your feedback after our responses.
We have incorporated additional experiments for the **image super-resolution task** and updated the paper. Quantitative results on image super-resolution for **x2** and **x8** tasks have been added in **Sec. K** (Experiments on image super-resolution, Tab. 7 and Tab. 8 in Appendix). We have also performed human evaluation for this task in **Sec K.2** (human evaluation, Fig. 26 in Appendix). **All changes are highlighed in blue in the updated version of the paper.**
Please let us know if our replies have sufficiently addressed your concerns and whether you have further feeback for improvements. Thank you for your time and consideration.
We thank the AC for initiating the discussion.
We have incorporated additional experiments for the **image super-resolution task** and updated the paper. Quantitative results on image super-resolution for **x2** and **x8** tasks have been added in **Sec. K** (Experiments on image super-resolution, Tab. 7 and Tab. 8 in Appendix). We have also performed human evaluation for this task in **Sec K.2** (human evaluation, Fig. 26 in Appendix). **All changes are highlighed in blue in the updated version of the paper.** We hope to hear from the reviewers soon.
#########################################################
After Rubuttal
We thank the reviewer for their insightful comments. The response is provided below.
**Q1. ['unnatural' is too general]**
Ans: We are not addressing any specific type of artifacts, rather, we aim to improve the overall quality of the generated images through the lens of kurtosis-based statistical properties.
Similarly, for validation, we resort to standard metrics like the FID, MUSIQ score, which measures the similarity between the overall distribution of the generated images and the real ones, rather than relying on selectively chosen properties.
<!-- However, unnatural artifacts (as shown in Fig. 1) can be observed in personalized few-shot finetuning (e.g., Dreambooth) due to finetuning from small data (Ruiz et al.), and that certainly hampers the generated image quality. Therefore, we have shown that in the teaser figure (Fig. 1).
It's just a anecdotal instance and is not the primary motive of the paper. Moreover, this is not an issue in the case of unconditional image generation, image super-resoltion tasks. -->
We admit ''naturalness'' is a generic term for image quality, and therefore, image quality metrics e.g., SSIM [7], BRISQUE [8] etc. are computed considering luminance, chrominance, contrast, structure, color in the image all together. We are not specifically addressing any particular kind of artifacts and not claiming that those artifacts could be mitigated through KC loss. That would have been more particular and infeasible, given the ''naturalness'' or image quality evaluation itself is a complex phenomenon. Instead, we are exploiting the KC property to propose KC loss which can be added to any standard diffusion pipeline as a plug-and-play component in order to preserve the ''naturalness'' and ameliorate image quality of the generated images. Therefore, KC loss serves as cause and the effect would be generating better quality images.
We experimentally validate this through quantitative (FID, MUSIQ score comparison in Tab. 1, 2, 3 main paper), qualitative analysis (Fig. 6, Fig. 8, Fig. 9, Fig. 10 main paper) as well as user studies (Sec. human evaluation) for three diverse tasks with multiple appraches. We empirically observe better image quality in terms of absense artifacts in the edges (Fig. 18, Appendix), improved contrast and reflections (Fig. 19, Appendix), better color consistency (Fig. 20, Appendix), smoother texture (in Fig. 21, Appendix), better eye structure preservation (Fig. 22, Appendix), better eye details and skin smoothness (Fig. 23, 24, 25 Appendix) while adding KC loss. We will clarify this in the updated version of the paper.
<!-- Ans: We again like to mention that our primary goal in the paper is to improve the image quality of the diffusion model generated images.
We are not specifically addressing any kind of artifacts in this paper.
However, unnatural artifacts (as shown in Fig. 1) can be observed in personalized few-shot finetuning (e.g., Dreambooth) due to finetuning from small data (Ruiz et al.), and that certainly hampers the generated image quality. Therefore, we have shown that in the teaser figure (Fig. 1).
It's just a anecdotal instance and is not the primary motive of the paper. Moreover, this is not an issue in the case of unconditional image generation, image super-resoltion tasks. -->
**Q2. [But why 'natural images have more concentrated kurtosis values'?]**
Ans: The primary motivation of the paper stems from the well-known kurtosis concentration property of natural images, which is well-established both theoretically and experimentally [1, 2, 3].
It is well-known that natural images can be modeled using zero-mean GSM vector [2, 4, 7, 10, 11, 12], and based on this assumption theoretically it is proved that projection kurtosis of bandpass filtered version of the images are constant (Lemma 1, main paper). We experimentally verify this property for natural images for large datasets, e.g., Dreambooth dataset (Fig. 15, Appendix) , Oxford-flowers dataset (Fig. 16, Appendix), FFHQ dataset (Fig. 17, Appendix). Therefore, this property holds for both object datasets (Dreambooth dataset, Oxford flowers), face dataset (FFHQ). This is reasonable since the necessary conditions for this assumption to hold is the input to be an zero-mean Gaussian Scale Mixture model, which natural images generally holds.
Then, we use this property as a prior to the standard diffusion model pipeline to apply this constrain in order to generate images of better quality. We have observed that KC loss improves image quality quantitatively, qualitatively and using user study for three tasks with multiple approaches. This validates KC loss to be an useful prior in multiple settings across different datasets.
One intuitive reasoning of the projected kurtosis concentration property is given as follows - different bandpass (DWT) filtered versions of natural images generally have a generalized Gaussian density of the following form [2, 5].
\begin{equation}
p(x, \alpha, \beta) = \frac{\beta}{2\alpha \Gamma(1/\beta)} exp(-\frac{|x|}{\alpha})^{\beta}
\end{equation}
The kurtosis of this function is given by [5],
\begin{equation}
\kappa = \frac{\Gamma(1/\beta) \Gamma(5/\beta)}{\Gamma(3/\beta)^2}
\end{equation}
Empirically, it has been shown that for natual images, $\beta$ is relatively small values ranges from 0.5 to 1 [2, 6], and this kurtosis value tend to be constant [1, 2, 3, 6], independent of $\alpha$ or $x$.
[1] Xing Zhang and Siwei Lyu. Using projection kurtosis concentration of natural images for blind noise covariance matrix estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2870–2876, 2014.
[2] Daniel Zoran and Yair Weiss. Scale invariance and noise in natural images. In 2009 IEEE 12th International Conference on Computer Vision, pp. 2209–2216. IEEE, 2009.
[3] Lyu S, Pan X, Zhang X. Exposing region splicing forgeries with blind local noise estimation. International journal of computer vision. 2014 Nov;110:202-21.
[4] Wainwright MJ, Simoncelli E. Scale mixtures of Gaussians and the statistics of natural images. Advances in neural information processing systems. 1999;12.
[5] Buccigrossi, Robert W., and Eero P. Simoncelli. "Image compression via joint statistical characterization in the wavelet domain." IEEE transactions on Image processing 8.12 (1999): 1688-1701.
[6] A. Srivastava, A. B. Lee, E. P. Simoncelli, and S.-C. Zhu. On advances in statistical modeling of natural images. J. Math. Imaging Vis., 18(1):17–33, 2003.
[7] Ruderman, Daniel L., and William Bialek. "Statistics of natural images: Scaling in the woods." Physical review letters 73.6 (1994): 814.
[8] Wang, Zhou; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. (2004-04-01). "Image quality assessment: from error visibility to structural similarity". IEEE Transactions on Image Processing. 13 (4): 600–612.
[9] Mittal, Anish, Anush Krishna Moorthy, and Alan Conrad Bovik. "No-reference image quality assessment in the spatial domain." IEEE Transactions on image processing 21.12 (2012): 4695-4708.
[10] Mumford, David Bryant. "Empirical statistics and stochastic models for visual signals." (2006).
[11] D. Mumford and B. Gidas, “Stochastic Models for Generic Images,” Quarterly J. Applied Math., vol. LIX, no. 1, pp. 85-111, 2001.
[12] D. Field, “Relations between the Statistics of Natural Images and the Response Properties of Cortial Cells,” J. Optical Soc. Am. A, vol. 4, no. 12, pp. 2379-2394, Dec. 1987
<!-- Here we provide a simple, intuitive understanding of effectiveness of proposed KC loss.
Suppose, we need to train or finetune a diffusion model ${f_{\theta}}$ from input training images ($\{x\}$) with or without a conditioning vector $c$. The conditioning vector could be text, image, or none (in case of the unconditional diffusion model).
The generated images obtained from $f_{\theta}$ given an initial noise map $\epsilon \sim N(0,I)$, a conditioning vector $c$ at each timestep $t$ is given by $x_{gen, t} = f_{\theta} (x, \epsilon, c, t)$.
Typically, the neural network in the diffusion framewoek is trained to minimize the $l2$ distance between the ground truth image ($x$) and the noisy image ($x_{gen}$) or their corresponding latent in case of Latent Diffusion Model (LDM). Without loss of generality, we are referring that as reconstruction loss ($L_{recon}$) between the ground-truth image ($x$) and the generated image ($x_{gen}$), denoted by,
\begin{equation}
L_{recon} = \mathbb{E}_{x,c, \epsilon} [ \ || x_{gen} - x ||_{2}^{2}]
\end{equation}
The kurtosis concentration loss is applied on the generated images ($x_{gen}$), and therefore can be considered as a function ($f'$) of $x_{gen}$ as follows:
\begin{equation}
L_{KC} = \mathbb{E}_{x, c, \epsilon} [ f'( x_{gen}) ]
\end{equation}
Note the function $f'$ is difference between the maximum and minimum values of the DWT filtered version of input $x_{gen}$.
Therefore, the total loss function can be written as,
\begin{equation}
L_{KC} = \mathbb{E}_{x,c, \epsilon} [ \ || x_{gen} - x ||_{2}^{2}] + \mathbb{E}_{x, c, \epsilon} [ f'( x_{gen}) ]
\end{equation}
-->
**Q3. [I just wonder if use KC loss, can we still find a suitable and strict derivations for that?]**
Ans: We would like to highlight that in our work, the underlying theoretical framework behind the forward and reverse diffusion processes remains unchanged; rather, we focus on improving the performance of the denoising neural network used to approximate the reverse diffusion trajectory.
Suppose, we have the input training images ($\{x\}$) and conditioning vector $c$. The conditioning vector could be text (text-to-image model), image (image-to-image model), or none (in case of the unconditional diffusion model). In the forward process [13, 14], the noisy versions of image $x$ at timestep $t$ is generated as $x_t = \alpha_{t} x +\sigma_{t} \epsilon$, where $\epsilon \sim N(0,I)$.
In the reverse process [13, 14], a denoised autoencoder (${f_{\theta}}$) is trained to predict the denoised version of the image ($x_{t, gen}$) at each timestep $t$ from the noisy images $x_t$, i.e., $x_{t, gen} = f_{\theta} (x_t, c, t)$.
Typically, the denoised autoencoder (${f_{\theta}}$) is trained by minimzing the Mean Squared Error between the real image ($x$) and the generated denoised version of the image at time step $t$ ($x_{t, gen}$) averaged over timesteps and noise variances as denoted by,
\begin{equation}
L_{recon} = \mathbb{E}_{x,c, \epsilon, t} [ \ || x_{t, gen} - x ||_{2}^{2}]
\end{equation}
The kurtosis concentration loss is applied on the generated images ($x_{t, gen}$), and therefore can be considered as a function ($f'$) of $x_{gen}$ as follows:
\begin{equation}
L_{KC} = \mathbb{E}_{x, c, \epsilon, t} [ f'( x_{t, gen}) ]
\end{equation}
Note the function $f'$ is difference between the maximum and minimum values of the DWT filtered version of input $x_{t, gen}$.
Therefore, the total loss function can be written as,
\begin{equation}
L_{total} = \mathbb{E}_{x,c, \epsilon, t} [ \ || x_{t, gen} - x ||_{2}^{2}] + \mathbb{E}_{x, c, \epsilon, t} [ f'( x_{t, gen}) ]
\end{equation}
In our work, the above-mentioned framework remains the same. Instead, the proposed KC loss acts as an additional regularizer to the training of the denoising neural network, which helps it to denoise $x_t$ better (Lemma 2, main paper), ultimately improving the approximation of $x$, i.e., $x_{t, gen}$ at each time step t.
[13] Ruiz, Nataniel, et al. "Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.
[14] Ho, Jonathan, Ajay Jain, and Pieter Abbeel. "Denoising diffusion probabilistic models." Advances in neural information processing systems 33 (2020): 6840-6851.
**Q4. [I like the idea to use priors for networks.]**
Ans: We thank the reviewer for appreciating our idea.
**Q5. [However, I think we need to know very clearly what exact problem we want to solve and why these priors are suitable and theorectically correct, rather than just borrow and use like a black box.]**
Ans: We respectfully beg to differ with the reviewer on this point. Our exact problem is very clear, i.e., improving image quality of diffusion generated images, which is of paramount importance for diffusion tasks. We are not particularly addressing any specific artifacts, rather we aim to improve the overall image quality.
Before proposing the loss function, we verify that projected kurtosis concentration property actually holds for natural images in large datasets, e.g., Dreambooth dataset (Fig. 15, Appendix), Oxford flower dataset (Fig. 16, Appendix), FFHQ dataset (Fig. 17, Appendix), encompassing diverse range of object, face datasets.
Next, using KC prior we propose the KC loss, which has been motivated theoretically, and validated experimentally for diverse diffusion tasks across different datasets, not just a black box.
<!-- For example, in [DDPM Paper], at each reverse step, Eq. 11 requires the value of the noise that was added to the clean image (or, equivalently, the clean image itself) from the noisy intermediate estimate x_t.
Typically, in literature, a conditional denoising neural network is used to approximate the clean image, which takes the noisy x_t as input, along with any other conditional information. -->
Dear Reviewer ,
Thank you for your time and effort. **As you know the author-reviewer discussion period ends tomorrow.** Please let us know if our rebuttal has sufficiently addressed your concerns and whether you have further questions for improvements.
Best,
authors.
###############################
Dear AC and all the reviewers,
We appreciate your time and effort in reviewing our paper. We summarize the positive points identified by all the reviewers. **We addressed the concerns of the reviewers in detail and hope to get positive feeback from them (4XiE, XKja) soon since the rebuttal period is coming to an end.** For their convenience, we also add the summary of the experiments added in the updated verison of the paper. **These changes has been highlighed in blue in the final updated verison.**
**Positive points:**
**1. Novelty: Interesting and novel idea [en6K, 4XiE, XKja, NS8z]**
**2. Well-written: easy to follow [en6K, 4XiE]**
**3. Extensive experiments: Strong results with multiple tasks [XKja, NS8z]**
**Added experiments and analysis:**
**1. Computation complexity**
We analyze the computational complexity of the proposed KC loss. Suppose, given a batch of N images. We need to perform DWT of each images using k different filters. Since, DWT for 'haar' wavelet can be done in linear time, the complexity of performing DWT with k filters can be done in $\mathcal{O}(Nk)$ time. Now, calculating the difference between maximum and minimum kurtosis can be done in linear time, therefore, the computational complexity of calculating KC loss is $\mathcal{O}(Nk)$. This minimal overhead of computing KC loss can be observed in the training time analysis. This is included in **Sec. F (Appendix).**
**2. Training time analysis**
The run time analysis has been provided in Table. 1 . Note that the experiments for Dreambooth, Custom diffusion, DDPM have been performed on a single A5000 machine with 24GB GPU. We have performed guided diffusion (GD) and latent diffusion (LD) experiments on a server of 8 24GB A5000 GPUs. The experimental results show that incorporating KC loss induces minimum training overhead. This is included in **Sec. G (Appendix).**
**Table 1: Training time analysis**
| Method | dataset | Training time |
|--------------|-------------|------------|
| DreamBooth [1] | 5-shot finetune DreamBooth dataset | 10m 21s |
| DreamBooth [1] + KC loss | 5-shot finetune DreamBooth dataset | 11m 30s |
| Custom diff [2] | 5-shot finetune DreamBooth dataset | 6m 43s|
| Custom diff [2] + KC loss | 5-shot finetune DreamBooth dataset | 7m 11s|
| DDPM [4] | CelebAfaces | 2d 8h 21 m |
| DDPM [4] + KC loss | CelebAfaces | 2d 9h 19 m |
| DDPM [4] | CelebAHQ | 21h 48m |
| DDPM [4] + KC loss | CelebAHQ | 22h 40m |
| DDPM [4] | Oxford flowers | 6h 17m |
| DDPM [4] + KC loss | Oxford flowers | 6h 39m |
| GD [5] | FFHQ | 23h 10m |
| GD [5] + KC loss | FFHQ | 1d 1h 29m |
| LD [6] | FFHQ | 20h 15m |
| LD [6] + KC loss | FFHQ | 22h 40m |
**3. Kurtosis analysis**
To verify the efficacy of the proposed KC loss, we perform average kurtosis analysis in **Sec. H (Appendix)**. We compute the average kurtosis deviation of DWT filtered version of images from the dataset and plot them in Fig. 15, 16, 17 (Appendix) in the updated version of the paper for Dreambooth dataset, Oxford flowers and FFHQ dataset.
All the plots show that adding KC loss minimizes the deviation of kurtosis values in the generated images, and natural images have least kurtosis deviation as conjectured by kurtosis concentration property.
This analysis verifies minimizing kurtosis loss improves diffusion image quality.
**4. Convergence analysis**
The main idea of the diffusion model is to train a UNet, which learns to denoise from a random noise to a specific image distribution. More denoising steps ensure a better denoised version of the image, e.g., DDPM [4], LDM [6]. In proposition 1 (main paper), we show that minimizing projection kurtosis further denoise input signals. Therefore, KC loss helps in the denoising process and improves the convergence speed. We have shown that adding KC loss improves the loss to converge faster for Dreambooth task in **Fig. 14 (Appendix)**. This is discussed in **Sec. I (Appendix)** of the updated verison of paper.
**5. Qualitative analysis**
In this section, we provide more qualitative analysis to show that adding KC loss improves image quality. Zoomed view of the generated images are shown to compare w.r.t the baselines in **Fig. 18, 19, 20, 21, 22, 23, 24, 25 (Appendix)**.
**6. Quantitative analysis**
More expermients with new baslines and additional settings are provided here. Experiments of personalized few-shot finetuning task have been provided in Table 2 (Tab 1 in main paper). Unconditinal image generation experiments are shown in Table 3, 4, 5 (Tab 2 in main paper).
Additional experiments on Table 6, 7, 8 in x4, x2, x8 setting respectively. Human evaluation for image super-resolution task has shown in **Sec. K2 (Appendix)**.
**Table 2: Comparison of Personalized few-shot finetuning task**
| Method | FID score | MUSIQ score | DINO | CLIP-I | CLIP-T |
|--------------|-------------|------------|-------|--------|---------|
| DreamBooth [1]| 111.76 | 68.31 | 0.65 | 0.81 | 0.31 |
| DreamBooth [1] + LPIPS [3] | 108.23 | 68.39 | 0.65 | 0.80 | 0.32 |
| DreamBooth [1] + KC loss | **100.08** | **69.78** | **0.68** | **0.84** | **0.34** |
| Custom Diff.[2] | 84.65 | 70.15 | 0.71 | 0.87 | 0.38 |
| Custom Diff.[2] + LPIPS [3] | 80.12 | 70.56 | 0.71 | 0.87 | 0.37 |
| Custom Diff.[2] + KC loss | **75.68** | **72.22** | **0.73** | **0.88** | **0.40** |
**Table 3: Comparison of unconditional image generation task (Oxford flowers dataset)**
| Method | FID score | MUSIQ score |
|--------------|-------------|------------|
| DDPM [4]| 243.43 | 20.67 |
| DDPM [4] + LPIPS [3] | 242.62 | 20.80 |
| DDPM [4] + KC loss | **237.73** | **21.13** |
**Table 4: Comparison of unconditional image generation task (Celeb-faces dataset)**
| Method | FID score | MUSIQ score |
|--------------|-------------|------------|
| DDPM [4]| 202.67 | 19.07 |
| DDPM [4] + LPIPS [3] | 201.55 | 19.21 |
| DDPM [4] + KC loss | **198.23** | **19.52** |
**Table 5: Comparison of unconditional image generation task (CelebAHQ dataset)**
| Method | FID score | MUSIQ score |
|--------------|-------------|------------|
| DDPM [4]| 199.77 | 46.05 |
| DDPM [4] + LPIPS [3] | 197.17 | 46.15 |
| DDPM [4] + KC loss | **190.59** | **46.83** |
**Table 6: Comparison of image super-resolution (x4) task**
| Method | FID score | PSNR | SSIM | LPIPS | MUSIQ score |
|--------------|-------------|------------|-------|--------|---------|
| GD [5]| 121.23 | 18.13 | 0.54 | 0.28 | 57.31 |
| GD [5] + LPIPS [3] | 119.81 | 18.22 | 0.54 | 0.27 | 57.42 |
| GD [5] + KC loss | **103.19** | **18.92** | **0.55** | **0.26** | **58.69** |
| LD [6] | 95.83 | 19.16 | 0.56 | 0.26 | 59.57 |
| LD [6] + LPIPS [3] | 92.77 | 19.42 | 0.57 | 0.25 | 59.82 |
| LD [6] + KC loss | **83.34** | **20.25** | **0.58** | **0.22** | **61.20** |
**Table 7: Comparison of image super-resolution (x2) task**
| Method | FID score | PSNR | SSIM | LPIPS | MUSIQ score |
|--------------|-------------|------------|-------|--------|---------|
| GD [5]| 100.2 | 19.4 | 0.62 | 0.25 | 58.12 |
| GD [5] + KC loss | **80.9** | **20.2** | **0.66** | **0.20** | **59.91** |
| LD [6] | 82.45 | 21.2 | 0.64 | 0.24 | 60.23 |
| LD [6] + KC loss | **70.12** | **22.3** | **0.70** | **0.18** | **62.15** |
**Table 8: Comparison of image super-resolution (x8) task**
| Method | FID score | PSNR | SSIM | LPIPS | MUSIQ score |
|--------------|-------------|------------|-------|--------|---------|
| GD [5]| 140.3 | 17.5 | 0.52 | 0.32 | 55.26 |
| GD [5] + KC loss | **125.5** | **18.7** | **0.56** | **0.27** | **57.33** |
| LD [6] | 103.2 | 18.7 | 0.59 | 0.25 | 58.62 |
| LD [6] + KC loss | **80.1** | **19.5** | **0.67** | **0.20** | **60.31** |
**References:**
[1] Ruiz, Nataniel, et al. "Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.
[2] Nupur Kumari, Bingliang Zhang, Richard Zhang, Eli Shechtman, and Jun-Yan Zhu. Multi-concept customization of text-to-image diffusion. CVPR 2023
[3] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 586–595, 2018.
[4] Ho, Jonathan, Ajay Jain, and Pieter Abbeel. "Denoising diffusion probabilistic models." Advances in neural information processing systems 33 (2020): 6840-6851.
[5] Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021.
[6] Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusionbased generative models. Advances in Neural Information Processing Systems, 35:26565–26577,2022