We thank all reviewers for their constructive feedback, and we have responded to each reviewer individually. We have also uploaded a Rebuttal PDF that includes:
- $\\textrm{\\color{blue}Figure A}$: Memorization ratios / KNN distances / FIDs of diffusion models trained with different intra-diversity (for cat class);
- $\\textrm{\\color{blue}Figure B} and $\\textrm{\\color{blue}Figure C}$: Sample-level memorization analysis;
- $\\textrm{\\color{blue}Table A} and $\\textrm{\\color{blue}Figure C}$: Sample-level memorization analysis;
# Global rebuttal
We thank all reviewers for their constructive feedback, and we have responded to each reviewer individually. We have also uploaded a Rebuttal PDF that includes:
- $\\textrm{\\color{blue}Figure A}$: Memorization ratios / KNN distances / FIDs of diffusion models trained with different intra-diversity (for cat class);
- $\\textrm{\\color{blue}Figure B} and $\\textrm{\\color{blue}Figure C}$: Sample-level memorization analysis;
- $\\textrm{\\color{blue}Table A} and $\\textrm{\\color{blue}Figure C}$: Sample-level memorization analysis;
Since both reviewer cKDx and JLai suggest to provide explanations, discussions, and insight regarding our empirical results, here we respond to this in a point-to-point manner (we thank reviewer 6xzn for summarizing our findings).
First, we would like to show that the objective of denoising score matching is equivalent to fit the optimal diffusion model $\\boldsymbol{s}\^*$ (also shown in Appendix B.1). This equivalence may provide a perspective of understanding diffusion model training from a lens of function fitting.
$\\qquad\\arg\\min\_\{\\theta\}\\frac\{1\}\{2N\}\\sum\_\{n=1\}\^\{N\}\\mathbb\{E\}\_\{t\\sim[0, T]\}\\mathbb\{E\}\_ \{\\epsilon\\sim \\mathcal\{N\}(\\mathbf\{0\},\\mathbf\{I\})\}\\left[\\lambda(t)\\left\\|\\boldsymbol\{s\}\_\{\\theta\}(z\_t, t)+\\frac\{\\epsilon\}\{\\sigma\_\{t\}\}\\right\\|\_\{2\}\^\{2\}\\right]$
$\\Leftrightarrow \\arg\\min\_\{\\theta\}\\frac\{1\}\{2N\}\\sum\_\{n=1\}\^\{N\}\\mathbb\{E\}\_\{t\\sim[0, T]\}\\mathbb\{E\}\_ \{\\epsilon\\sim \\mathcal\{N\}(\\mathbf\{0\},\\mathbf\{I\})\}\\left[\\lambda(t)\\left\\|\\boldsymbol\{s\}\_\{\\theta\}(\\alpha\_\{t\}x\_\{n\}+\\sigma\_\{t\}\\epsilon, t)-\\boldsymbol\{s\}\^*(\\alpha\_\{t\}x\_\{n\}+\\sigma\_\{t\}\\epsilon, t)\\right\\|\_\{2\}\^\{2\}\\right]$
*Data distribution*
Finding 1: Higher-resolution images lead to significantly less memorization (lower EMM).
Explanation 1: A higher resolution results in a larger dimensionality for supervision signals in denoising score matching, which leads to the difficulty of memorizing training data.
Finding 2: For a fixed dataset size, increasing dataset diversity (both by adding more classes of images and having more intra-class diversity) lead to less memorization, but the effect is limited.
Explanation 2: In such cases, the supervision dimensionality and model capacity are the same. The only difference is that the image representation manifold is changed, which may have limited impact on memorization difficulties.
*Model Configuration*
Finding 3: Increasing model width significantly increases memorization
while increasing model depth has a non-monotonic effect on memorization (EMM increases up to a certain point, and then starts decreasing).
Explanation 3: Increasing model width/depth generally increases the model representation capacity, which makes trained models fit the optimal solution better. However, further increasing model depth may also bring optimization difficulties, thus resulting in a non-monotonic effect.
Finding 4: Learned positional embeddings for the noise level lead to more memorization than random Fourier features.
Explanation 4: When replacing learned positional embeddings with random Fourier features, the model capacity is reduced.
Finding 5: Increasing the number of skip connections in the U-Net backbone of the diffusion model has a limited effect on memorization. For a fixed number of skip connections, placing them at a higher resolution in the U-Net leads to higher memorization.
Explanation 5: Increasing the number of skip connections does not significantly increase the model capacity to memorize training data as long as at least one high-resolution skip connection is retained. This indicates the importance of high-resolution skip connections on memorization.
*Training Procedure*
Finding 6: Increasing the batch size and learning rate in tandem increases memorization (but not significantly).
Explanation 6: Our intuition stems from existing research studying the generalization capability of stochastic gradient descent (SGD). Moreover, as presented in the above equivalence, diffusion models are trained to converge to the optimal solutions. Using a larger batch size or even full data size results in less noisy gradient estimation. Therefore, the trained diffusion model tends to be more closer to the theoretical optimum, which can only memorize training data.
Finding 7: A large weight decay coefficient leads to very low memorization.
Explanation 7: The introduction of weight decay greater than zero alters the training objective, leading to a divergence from the original theoretical optimum. This divergence becomes apparent for large weight decay.
Finding 8: Taking an exponential moving average of model weights during training has a limited effect on memorization.
Explanation 8: EMA is adopted to remove artifacts in generated images, which can stablize FIDs. It could be regarded as a type of model ensemble. This ensemble seems to have limited impact when fitting the optimal solution.
*Unconditional vs Conditional*
Finding 9: The number of classes for conditions are the key to memorization in conditional diffusion models, no matter what these conditions are informative or not.
Explanation 9: As shown in Appendix B.3, we derive the theoretical optimal for class-conditioned diffusion models. We find that training class-conditioned diffusion models is equivalent to fit a series of optimal models partitioned by class labels. This partition, whether informative or not, reduces the fitting difficulty. When we apply unique condition, the memorization becomes much easier. This may explain why Stable Diffusion (SD) trained on millions / billions of text-conditioned images is still prone to memorize training data especially when input prompts are the same as text labels because each image has its unique text label. This may inspire more theoretical or emprical investigation in the future.
<!-- Besides, we really appreciate the suggestion from reviewer VooV on sample-level analysis of memorization, which we believe will enrich our results. Therefoe, we discuss the new experimental results in this global rebuttal. -->
<!-- ---
***Explanations, discussions, and insight regarding the empirical results.*** -->
# Reviewer cKDx
Thank you for your valuable review and suggestions. Below we respond to the comments in ***Weaknesses (W)*** and ***Questions (Q)***.
---
***W1 & Q1: Constributions in the context of previous works***
Thank you for your suggestions. Below, we clarify our contributions in the context of previous work on memorization in diffusion models (DMs).
[4,5] aim to show that DMs could replicate training data partly or globally, with the focus of inference perspective. Comparatively, we focus on the training perspective. Additionally, our motivation is different: investigate into the conditions under which DMs inadvertently approximate the optimal solution outlined in Eq.(2). This optimal model, while theoretically expected, can only replicate training data. Understanding the extent of this theoretical hazard quantitatively in typical DM training scenarios is crucial. Therefore, the comprehensive empirical experiments and findings are our main contribution.
There are several other studies involving how training configurations affect memorization [2,3] or generalization [1,2] in DMs. However, these explorations are conducted in different scenarios with limited factors. While our paper studies this problem in a unified framework with rigorous experimental design and comprehensive factors that may impact memorization. Moreover, through this framework, we are able to guage which factors contribute signifantly or marginally to the memorization.
Specifically, [3] explored the effects of text conditions, including random captions, on memorization during the fine-tuning of SD models. While we conduct experiments by traiing DMs from scratch, which could eliminate possible effects from pre-training. We show that the number of classes is the key to memorization through more experimental designs, which push the conclusion futher compared to [3]. [2] explored the effects of model capacity on memorizaion, only in terms of model width. Our studies complement their experiments by exploring model depth, positional embeddings, skip connections. [1] explored the effects of image resolutions on generalization. When modifying the resolutions, they also adjusted the model capacity. While we always keep other factors the same and explore the effects of one specific factor.
---
***W2 & Q2: Explanations or solutions to the problem of memorization***
Thank you for raising this. We have included the response in the global rebuttal. Please let us know if you have further concerns.
---
***W3: EMM is adapted from a similar notion previously defined in [2]***
Our definition of EMM includes an expectation while memorization capacity in [2] adopts a probablistic form. These two definitions share similarities and we cite the paper [2] in our definition. Additionally, EMM only serves as a tool to facilitate our analysis on memorization, thus we do not claim it as our main contribution.
---
***W4: The experiments were conducted using only one random seed.***
We would like to highlight that, to conduct such a comprehensive empirical study, we trained hundreds of DMs. It will be extremely expensive to run all experiments with various random seeds. Moreover, for several experiments, we have run three random seeds and found that the variance of memorization ratio is negligible (\~ 0.1\%~0.2\%) in Appendix C. Considering this minimal fluctuation and difficulty for repeating all experiments, we run each experiment once in the main paper.
---
***W5, W6, & W7: The findings regarding impact of image resolution, model width, text conditioning have been reported in [1-3], respectively.***
Please refer to our clarifications on the differences between our study and [1-3] in response to ***W1 & Q1***.
---
***W8, Q4, & Q5: The study does not explore key parameters, such as L2 distance and the factor 1/3 that define the EMM metric. And whether the results are sensitive to these choices?***
If a generated sample is a replicate of training sample, the ratio between its distance to the nearest sample in training dataset and that to the 2rd nearest sample is a very small value either for L1/L2 distance. Based on this intuition, we followed [2] to select a small threshold 1/3 and L2 distance to define the memorization ratio. Empirically, this choice aligns with human perception. Employing a different factor or replacing with L1 distance only changes the strictness of memorization definition, which will not change the memorization tendencies in our paper.
---
***W9: Lack of comparison with other metrics for evaluating memorization***
Besides the memorization ratio, we also consider the KNN distance used in [5] in Appendix G. Our findings remain valid with alternative metrics.
We apologize that we delay the results regarding FID and KNN to the Appendix F and G, which may reduce the readability. We will re-organize our results for better demonstration in revision.
We note that the quality metrics, e.g. FIDs, are also close to optimal when DMs memorize a significant proportion of training data. For instance, the theoretical optimal achieves a FID of about 0.5, much smaller than 1.96 achieved by EDM. Therefore, it is difficult to disentangle the effects of memorization and image quality. This intuition is in line with our results in Appendix F.
---
***Q3: Are the findings related to intra-class diversity robust to other classes besides the dog class?***
Yes, we conducted experiments on other classes, e.g., the cat class.
We present our new results in $\\textrm{\\color{blue}Figure A}$. It is noticed that with higher intra-diversity in the training data, DMs tend to demonstrate higher memorization. Consistent with the main paper, this effect is not significant. We also present the KNN and FID results parallel with memorization ratios.
The reason why we show the experiments for the dog class is that it has the most subclasses in the ImageNet dataset, which could more clearly demonstrate the effects of intra-class diversity.
*References:*\
[1] Kadkhodaie, Zahra, et al. "Generalization in diffusion models arises from geometry-adaptive harmonic representation." ICLR 2024.\
[2] Yoon, TaeHo, et al. "Diffusion probabilistic models generalize when they fail to memorize." ICML 2023 Workshop on Structured Probabilistic Inference {&} Generative Modeling.\
[3] Somepalli, Gowthami, et al. "Understanding and mitigating copying in diffusion models." NeurIPS 2023.\
[4] Somepalli, Gowthami, et al. "Diffusion art or digital forgery? investigating data replication in diffusion models." CVPR 2023.\
[5] Carlini, Nicholas, et al. "Extracting training data from large language models." USENIX Security 2021.
<!-- [6] Karras, Tero, et al. "Elucidating the design space of diffusion-based generative models." NeurIPS 2022. -->
<!-- We highlight that although the explorations regarding several factors are involved in the literature, our paper features comprehensive investigations on as many factors as possible in a unified setting with rigor. -->
<!-- In [1], the authors focused on evaluating the generalization of diffusion models under different image resolutions. When they modified the image resolution, the capacity of the UNet was also adjusted. While in our study, we explored the effects of image resolutions on memorization with other factors the same.
In [2], the authors only evaluated the effects of model capacity, in terms of model width, on memorization. Comparatively, to show the effects of model capacity, we also conducted experiments on model depth, positional embeddings, and skip connections.
In [3], the authors conducted their experiments related to text conditioning by fine-tuning Stable Diffusion (SD) models. While our study explored the effects of conditions in a bottom-up manner, i.e., from training diffusion models from scratch to (LoRA / fully) fine-tuning SD models. In the former setting, we eliminate the effects of pre-training. From a series of experiments: (1) from unconditioning to conditioning on true labels; (2) from true labels to random labels; (3) increasing the number of classes for random labels until the number of data size (unique labels), we show that the number of classes matters and its larger values significantly push trained diffusion models approximate the thoeretical optimum. -->
<!-- The memorization ratio, as defined in the main paper, can be formally articulated as follows
\begin{equation}
\mathcal{R}_{\textrm{Mem}}=\frac{1}{N'}\sum_{n=1}^{N'}\mathbb{I}(\frac{\left\|x_n'\textrm{,}\,\,\textrm{NN}_1(x_n', \mathcal{D})\right\|_2}{\left\|x_n'\textrm{,}\,\,\textrm{NN}_2(x_n', \mathcal{D})\right\|_2}<\frac{1}{3})\mathrm{,}
\end{equation}
where $\mathbb{I}$ is an indicator function, $\textrm{NN}_j(x_n', \mathcal{D})$ is the $j$-th nearest training sample of $x_n'$ in $\mathcal{D}$, and $N'$ is the number of generations sampled by the diffusion model.
Through our theoretical analysis (and empirical results) on the theoretical optimum, each generated sample is a replicate for its nearest training sample. Then we know that the ratios for each replicated sample $\frac{\left\|x_n'\textrm{,}\,\,\textrm{NN}_1(x_n', \mathcal{D})\right\|_2}{\left\|x_n'\textrm{,}\,\,\textrm{NN}_2(x_n', \mathcal{D})\right\|_2}=\frac{\left\|x_n'\textrm{,}\,\,\textrm{NN}_1(x_n', \mathcal{D})\right\|_1}{\left\|x_n'\textrm{,}\,\,\textrm{NN}_2(x_n', \mathcal{D})\right\|_1}=0$. Therefore, a memorized sample should have small ratios for either L1 or L2 distance. The choices of L2 distance and the factor $1/3$ are made by empirical observations. As our objective is not to introduce a new detection metric for memorization, we just followed [2] to define the memorization ratio, which aligns with human perception. -->
---
# Reviewer JLai
Thank you for your valuable review and suggestions. Below we respond to the comments in ***Weaknesses (W)*** and ***Questions (Q)***.
---
***W1 (1): It is unclear how this paper significantly advances beyond the existing studies in [1-5].***
Thank you for your suggestions. Here we explain our contributions in the context of previous literature on memorization in diffusion models (DMs).
[1,3] aim to show that DMs could replicate training data partly or globally, with the focus of inference perspective. Comparatively, we focus on the training perspective. Additionally, our motivation is different: investigate into the conditions under which diffusion models inadvertently approximate the optimal solution outlined in Eq.(2). This optimal model, while theoretically expected, can only replicate training data. Understanding the extent of this theoretical hazard quantitatively in typical diffusion model training scenarios is crucial. Therefore, the comprehensive empirical experiments and findings are our main contribution.
There are several other studies involving how training configurations affect memorization [2,4] or generalization [4,5] in DMs. However, these explorations are conducted in different scenarios with limited factors. While our paper studies this problem in a unified framework with rigorous experimental design and comprehensive factors that may impact memorization. Moreover, through this unified framework, we are able to guage which factors contribute signifantly or marginally to the memorization.
Specifically, [2] explored the effects of text conditions, including random captions, on memorization during the fine-tuning of SD models. While we conduct experiments by traiing DMs from scratch, which could eliminate possible effects from pre-training. We show that the number of classes is the key to memorization through more experimental designs, which push the conclusion futher compared to [2]. The authors in [4] explored the effects of model capacity on memorizaion, only in terms of model width. Our studies complements their experiments by exploring model depth, positional embeddings, skip connections. [5] explored the effects of image resolutions on generalization. When modifying the resolutions, they also adjusted the model capacity. While we always keep other factors the same and explore the effects of one specific factor.
---
***W1 (2): Paper should clarify what are the "illuminating results".***
Through our comprehensive experiments, we elucidate the contribution of each factor on memorization. Several findings are not quite expected, which we consider "illuminating results":
1. Dataset diversity has limited contributed to memorization.
2. Model depth has non-monotonic effects on memorization.
3. Skip connections contribute sparsely to model capacity and memorization and ones located at high resolutions play important roles.
4. Although EMA can significantly improve FIDs, it contributes limitedly to memorization.
5. Random labels can also induce memorization just like informative labels. And the number of classes is the key. Manipulating conditions could significantly trigger the memorization of diffusion models.
---
***W2: The current manuscript lacks in-depth explanations, discussion, and insights regarding the observed results.***
Thank you for raising this. Please refer to our global rebuttal for a detailed response and let us know if you have further questions.
---
***W3: Considering more complex notions of copying, not copying the entire image but some of the composing mechanisms.***
In our work, we have established through both empirical and theoretical analysis that the optimal solution for denoising score matching in diffusion models memorizes training data on a pixel-by-pixel basis. Since our motivation is to gauge the extent of this theoretical hazard in typical diffusion model training scenarios, the same pixel-by-pixel memorization is our research focus.
---
***Q1: Can you explicitly list the novel and specific scientific goals and contributions of the work?***
Thanks for your comments. The scientific question in our paper is a contradiction: diffusion models are trained to converge to the optimal solution outlined in Eq. (2) (which can only replicate training data), but they empirically do not demonstrate such memorization behavior. Then our scientific goal is to invitigate the conditions under which diffusion models inadvertently approximate the above optimum.
The contributions of our work are listed as below:
1. We conduct comprehensive empirical studies to guage the quantitative effects of various factors on memorization in a unified framework, thus providing guidance for future research and practitioners.
2. From the theoretical part, we explain the reason why the optimal diffusion model in Eq.(2) can only replicate training data from the perspective. We also show that denoising score matching is equivalent to fitting the optimal diffusion models.
*References:*\
[1] Somepalli, G., Singla, V., Goldblum, M., Geiping, J. and Goldstein, T.. Diffusion art or digital forgery? investigating data replication in diffusion models. CVPR 2023.\
[2] Somepalli, G., Singla, V., Goldblum, M., Geiping, J. and Goldstein, T.. Understanding and mitigating copying in diffusion models. NeurIPS 2023.\
[3] Carlini, N., Hayes, J., Nasr, M., Jagielski, M., Sehwag, V., Tramer, F., Balle, B., Ippolito, D. and Wallace, E.. Extracting training data from diffusion models. USENIX Security 2023.\
[4] Yoon, T., Choi, J.Y., Kwon, S. and Ryu, E.K.. Diffusion probabilistic models generalize when they fail to memorize. ICML 2023 Workshop on Structured Probabilistic Inference & Generative Modeling.\
[5] Kadkhodaie, Z., Guth, F., Simoncelli, E.P. and Mallat, S.. Generalization in diffusion models arises from geometry-adaptive harmonic representation. ICLR 2024.
---
# Reviewer VooV
Thank you for your supportive review and suggestions. Below we respond to the comments in ***Weaknesses (W)*** and ***Questions (Q)***.
---
***W1: The main drawback of this work is the small scale of the experiments.***
Our primary goal is to investigate into the conditions under which diffusion models inadvertently approximate the optimal solution outlined in Eq.(2) through a comprehensive empirical study. This optimal model, while theoretically expected, presents potential drawbacks such as solely replicating training data and potentially leading to privacy or copyright concerns. Understanding the extent of this theoretical hazard in typical diffusion model training scenarios is crucial.
Through our experiments, we find that for unconditional diffusion models, they typically demonstrate a high memorization ratio when training data size is less than 5k for CIFAR/FFHQ datasets. For larger training data size, the memorization ratios are expected to reduce significantly. This is the reason why Carlini et al. [1] only extracted 200-300 training images from $2\^{20}$ generated images by DDPM and its variant when these models train on original datasets.
However, when it turns to conditional diffusion models, we show that with random labels / unique labels could significantly trigger the memorization. This could explain the memorization behavior of Stable Diffusion (SD), which was trained on millions or billions of images. The text conditioning could be regarded as a specific circumstance of unique conditioning. In our study, even trained on the original dataset, unique conditioned diffusion models demonstrate much higher memorization ratio than unconditional ones. Therefore, even trained on very large scale data, SD can still replicate training data, especially when the prompt condition is similar to the training text label.
As it is computationally infeasible for us to reproduce our experiments by training SD from scratch, we fine-tune SD on customized data, which are not limited to CIFAR10/FFHQ.
---
***W2: Concerning the novelty of theoretical part.***
Yes, we do not consider showing that "replicating the training data is an optimal solution for denoising score matching criteria" as our main contribution. This is the reason why we delay the derivation of theoretical optimum for DSM criteria to the appendix. Our main constribution is still the extensive empirical experiments and findings of memorization in diffusion models.
Thanks to the suggestions from Reviewer 6xzn. Our theoretical analysis on (i) the equivalence of denoised score matching and fitting optimal diffusion model during the training; (ii) why the optimal diffusion model only replicate training data (from the perspective of backward process) could be considered as our contributions.
---
***W3 and Q1: It would have been interesting to have a more sample level analysis on memorized samples generated by different diffusion models.***
Thank you for raising this. We have conducted a series of experiments to observe whether the memorized samples stay the same when fixing the training dataset and modifying hyperparameters, e.g. model width/depth, batch size, and EMA values. We conduct two sample-level analysis, which we hope could answer your question. We compare four sets of diffusion models trained on two fixed datasets ($\\mathcal{D}=$ 1k and $\\mathcal{D}=$ 2k) with different model widths (128, 192, 256, 320), model depths (2, 4, 6, 8, 10), batch sizes (128, 256, 384, 512, 640, 768, 896), EMA values (0.99929, 0.999, 0.99, 0.9, 0.5, 0.1, 0.0).
We firstly pay attention to the "memorization coverage": $\\textrm{cov}=\\textrm{set}\\left(\\textrm{NN}\_1(x', \\mathcal{D})\\big | \\frac{||x'-\\textrm{NN}\_1(x', \\mathcal{D})||\_2}{||x'-\\textrm{NN}\_2(x', \\mathcal{D})||\_2})<1/3\\right)$, which represent which training data are memorized during the inference. Then we evaluate the intersection of memorization coverage of different models.
As shown in $\\textrm{\\color{blue}Table A}$(*left*) and (*middle*), we notice that the most of training data could be replicated during the inference for different models. The ratios of common memorized samples are high. We find that although EMA has little impact on memorization ratios, it changes the memorization coverage.
Secondly, we also evaluate the "memorization reproducibility": given the same initial noises, the ratio of common memorized samples across different models. As present in $\\textrm{\\color{blue}Table A}$(*right*), we notice that across different models, with the same initial noise, a significant number of memorized samples are the same.
In $\\textrm{\\color{blue}Figure B}$ and $\\textrm{\\color{blue}Figure C}$, We also include a detailed visualization of memorization reproducibility between two diffusion models with different hyper-parameters.
These results indicate that different models may learn very similar diffusion trajectories under the context of memorization, which open up for the future research.
---
***W4: This paper could be seen as having limited novelty since it mostly focus on providing extensive empirical analysis.***
Thank you for your comments. Despite our focus on extensive empirical analysis, we provide quantitative insights into the critical points at which memorization occurs. This awareness is essential for mitigating adverse consequences and refining the practical utility of diffusion models.
Several findings in our paper are not "quite expected", which may deepen the understanding of diffusion models, e.g. model depth, skip connections, EMA values, random conditions, etc.These findings could be regarded as guidelines for future research and practitioners, as also mentioned by reviewer 6xzn.
*References:*\
[1] Carlini, Nicholas, et al. "Extracting trainingdata from large language models." USENIX Security 2021.
<!-- Our primary goal is to investigate into the conditions under which diffusion models inadvertently approximate the optimal solution outlined in Eq.(2) through a comprehensive empirical study. This optimal model, while theoretically expected, presents potential drawbacks such as solely replicating training data and potentially leading to privacy or copyright concerns. Understanding the extent of this theoretical hazard in typical diffusion model training scenarios is crucial. -->
<!-- While it may be intuitive to assume that simpler tasks (e.g., smaller datasets, larger models) are more susceptible to this issue, the quantitative threshold for the onset of the memorization phenomenon remains unclear. Our work aims to address this gap by providing quantitative insights into the critical points at which memorization occurs. This awareness is essential for mitigating adverse consequences and refining the practical utility of diffusion models. The findings could also be regarded as guidelines for future research and practitioners, as also mentioned by reviewer 6xzn. -->
<!--
Despite our focus on empirical study, we also provide the theoretical analysis on why the optimal diffusion model can only replicate training data from the perspective of backward process. Furthermore, we show that the objective of denoising score matching is equivalent to fit the optimal diffusion model $\boldsymbol{s}^*$.
Since neural networks have limited capacity and supervised signals are infinite, trained diffusion models fail to fit all input-output pairs of optimal model. This failure results in novel samples instead of replicating training data. This equivalence may benefit the future work on deeper understanding of why diffusion models memorize or generalize, e.g. the inductive bias or implicit bias during the fitting. Thanks to suggestions from reviewer 6xzn, we will make more highlights on the above theoretical analysis in the main paper, which could be considered as our contributions. -->
# Reviewer 6xzn
We really appreciate that you read our paper thoroughly and provide detailed summary and insightful suggestions. Below we respond to the comments in ***Weaknesses (W)*** and ***Questions (Q)***.
---
***W1: Theoretical analyses in the appendix could be more fleshed out in the main paper.***
Thank you for your suggestions. In the main paper, we will make more highlights on our theoretical analysis about (i) the equivalence of denoised score matching and fitting optimal diffusion model during the training; (ii) why the optimal diffusion model only replicate training data.
---
***W2: Plots \(a) and \(c) in Figure 4 to be hard to interpret visually.***
Thank you for your suggestions. We will use tables to show the experimental results in Figure 4 in revision.
---
***W3: Some readers might consider some of the findings to be unsurprising.***
We really appreciate that you value our work and explain our empirical findings. We will acknowledge your valuable comments and update the explainations in our revision.
---
***Minor typos and similar mistakes***
We really appreciate your corrections and suggestions. We have already fixed these typos and mistakes and will update them in the revision. Regards "Different from FIDs" in Line 251, we mean that "although EMA values are significant for improving FIDs, they contribute limited to memorization". We will also clarify this in the future version.
---
***Q1: In Definition 2.1 (EMM), with respect to what randomness is the expectation computed?***
For the definition of EMM,
$\\textrm\{EMM\}\_\{\\zeta\}(\\mathbf\{P\}, \\mathcal\{M\}, \\mathcal\{T\})
= \\max\_\{N\} \\left\\\{\\mathbb\{E\}[\\mathcal\{R\}\_\{\\textrm\{Mem\}\}(\\mathcal\{D\},
\\mathcal\{M\}, \\mathcal\{T\})] \\geq 1-\\zeta \\Big|\\mathcal\{D\}\\sim \\mathbf\{P\},|\\mathcal\{D\}|=N \\right\\\}\\textrm\{,\}$
the expectation is computed on different sampled dataset $\\mathcal{D}$ with the same size $|\\mathcal{D}|=N$. Although $\\mathcal{D}$ is sampled from the same data distribution $\\mathbf{P}$, different samples may contribute differentally to the computation of memorization ratio. Therefore, EMM should be computed under different $\\mathcal{D}$ with the same size. However, this will significantly increase computation costs. In practice, we only sample one dataset $\\mathcal{D}$ with a size of $N$. To rule out the effects of the above randomness, when we estimate the EMM, we select a series of datasets satisfying $\\mathcal{D}\_1\\subset\\mathcal{D}\_2\\subset...\\subset\\mathcal{D}\_n$.
---
Thank you for your response. As mentioned by our paper, our experiments explore the effects of data, model, and optimization (or training) on memorization. We would like to make more clarifications, especially focusing on "actionable discussions".
From the perspective of data, we find that data dimension matters while the data manifold is less important. This finding can guide the practitioners to train diffusion models with higher resolutions to mitigate memorization. From the model perspective, the findings regarding model with/depth, positional embedding, skip connections, lora ranks could be summarized and explained as that the model has more capacity, it generally tends to memorize training data. The practitioners may consider a model with low capacity when training / fine-tuning on low-resource data generation scenarios to prevent large memorization. From the optimization, we find that smaller batch size, adding weight decay could mitigate memorization, while EMA has no insignficant impacts on memorization. So practitioners may consider the above training hyperpameters during their training. Finally, our experiments on different conditions showcase that for conditional diffusion models, they generally more easily to memorize training data, which requires more attention on memorization. Based on this, one may could consider reducing the diversity for class / text conditions. For instance, practitioners can replace sub-class labels with class labels, or label the same visual object (e.g. "cup") with the same text description (not captioned with "mug" and "cup").
Finally, we would like to highlight that our findings may also lead to theoretical advances. For instance, one may consider interpolating the theoretical optimum and learned diffusion models to observe memorization and generalization.
Firstly, we transform the objective of denoising score matching into fitting the theoretical optimum. Then we could understand the memorization or generative modeling from the perspective of supervised learning.