ICLR2024_Memorization_Rebuttal

# Reviewer p3eW Thank you for your constructive feedback and valuable questions. --- ***Weakness 1: Isolation of analysis of memorization and generalization or image quality.*** Thank you for pointing this out. We would like to clarify the primary objective of our work: investigating under what conditions the diffusion models (undesirably) approximate the optimal solution represented by Eq.(2). On the one hand, such an optimal model makes no sense to practitioners (as also mentioned in the review), and may even lead to privacy/copyright issues. On the other hand, such an optimal model is indeed what is theoretically expected. Therefore, it becomes imperative to gauge the extent of this theoretical hazard in typical diffusion model training scenarios. This awareness is vital for mitigating adverse consequences and refining the practical utility of these models, thus requiring quantitative studies like our work. Further, studying memorization behavior also helps the understanding of generalization. A recent work [1] suggests that diffusion models with memorization has potential adverse generalization performance. Therefore, it is expected that an increase in the memorization ratio within a diffusion model implies a diminution in its generalization capability. As for the image quality, it has interleaving relationship with memorization. When a significant proportion of generated samples are replicas of the training data, their image quality is inherently high. Additionally, this leads to a generation distribution close to that of the training data, resulting in a low FID score. For instance, the optimal diffusion model, which can only replicate training data (thus exhibiting a memorization ratio is 100%), achieves an FID score of 0.56. This is substantially lower compared to the 1.96 FID score attained by the state-of-the-art unconditional diffusion model, EDM. Therefore, we do not emphasize metrics of image quality in our experiments. --- ***Weakness 2 and 3: Extension of conclusions on small-scaled datasets to real scenarios.*** As elucidated in the above, our objective is to gauge the extent of the theoretical hazard in typical diffusion model training scenarios instead of understanding the memorization behaviors of diffusion models trained on millions of images. Through our extensive experiments, we find that the EMMs for training recipes of diffusion models are generally small. This provides explanations why diffusion models in real scenarios demonstrate low memorization ratios. In addition to CIFAR-10, we have conducted a series of additional experiments using the FFHQ dataset [2], which is a higher-dimensional face dataset. These new experiments have been included in **Appendix D: More empirical results on FFHQ** of our revised paper. Due to the time constraints, our additional experiments focused on investigating the impact of data dimension / time embeddings / conditioning, on the memorization of diffusion models. It is noteworthy that the outcomes of these recent experiments are in alignment with our initial findings derived from the CIFAR-10 dataset. --- ***Weakness 4: Lack of analysis for specific factors.*** Our main body of work was developed as a comprehensive and empirical guidance towards the effects of various factors from perspectives of data distributions / model configuration / training procedure / conditioning on memorization in diffusion models. Our findings aim to delineate which factors have a substantial impact on memorization and which contribute more subtly. Therefore, our paper is positioned at an empirical analysis instead of a theoretical analysis of how different factors interact. Additionally, we provide several surprising findings, e.g. the significant effects of random labels, which may inspire the theoretical practitioners for further exploration. --- **References:** [1] TaeHo Yoon, Joo Young Choi, Sehyun Kwon, and Ernest K Ryu. Diffusion probabilistic models generalize when they fail to memorize. In ICML 2023 Workshop on Structured Probabilistic Inference & Generative Modeling, 2023.\ [2] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In IEEE International Conference on Computer Vision (CVPR), 2019. # Reviewer sac3 Thank you for your constructive feedback and valuable questions. --- ***Weakness & Question 2: Lack of detailed comparison with [1] [2] [3]*** Thank you for the question. First, the foundational motivations of our research diverge significantly from those of studies [1], [2], [3]. We aim to investigate under what conditions the diffusion models (undesirably) approximate the optimal solution represented by Eq.(2). On the one hand, such an optimal model can only replicate training data, and may even lead to privacy/copyright issues. On the other hand, such an optimal model is indeed what is theoretically expected. Therefore, it becomes imperative to gauge the extent of this theoretical hazard in typical diffusion model training scenarios. This awareness is vital for mitigating adverse consequences and refining the practical utility of these models, thus requiring quantitative studies like our work. Secondly, our paper conducted a comprehensive and quantitative examination of various factors influencing the memorization of diffusion models. These factors span a wide range, including data distribution, model configuration, training configuration, and conditioning. In contrast, [1] and [3] primarily focused on demonstrating that diffusion models may replicate training data and proposing frameworks for detecting or extracting such replication. [2] investigated the impact of various factors on the memorization in text-to-image diffusion models, particularly those fine-tuned on new datasets. Compared to [2], our experiments predominantly engage with unconditional diffusion models trained from scratch, and the variables we examine differ from those in [2]. Our findings offer new insights, e.g. time embedding / random conditions / skip connections, into how these factors affect memorization in diffuion models, as detailed in our paper. We wound also like to make more clarifications on how our research different from [1] regarding dataset size and [2] regarding text conditioning and dataset complexity. - In [1], the authors conducted a comparative analysis of the memorization tendencies in diffusion models trained on datasets of varying sizes (specifically, 300 versus 3,000 samples). In contrast, in our research, we showed when diffusion models memorize in terms of a novel metric EMM. This specific training data size reflects the capacity of model and algorithm, etc, and discloses the interactions among different factors. Additionally, we monitored the memorization ratios throughout the training procedure and showed that the memorization becomes apparent after a sufficiently extended training duration, particularly when the size of the training data is sufficiently small. - The authors in [2] undertook a comparative analysis of diffusion models conditioned on various types of captions. Our research, however, diverges in the notable discovery that random conditions can effectively induce the memorization of class-conditioned diffusion models. We also find that the number of classes plays an important role in the memorization. In terms of dataset complexity, [2] compared models trained on LAION-10k and Imagenette datasets, attributing the higher memorization observed in the latter to the structural complexity of its images. Here the dataset complexity is assessed on an instance-level basis. While in our experiments, we meticulously constructed a series of training datasets, each varying in the number of classes or the intra-class diversity, while keeping the other factor constant. In our research, the number of classes and intra-class diversity serve as population-level metrics to assess the dataset complexity. Finally, we introduced a novel metric for memorization: EMM. This metric is designed to determine the conditions under which trained diffusion models exhibit memorization behaviors akin to those of the optimal solution. --- ***Question 1: Extension of conclusions on small-scaled datasets to real scenarios.*** Firstly, we have run a series of new experiments on the FFHQ dataset [4] during the rebuttal period, which has been updated in **Appendix D: More empirical results on FFHQ** in the revised paper. The new experimental results support our findings on CIFAR-10. Secondly, we would like to emphasize that our objective is to gauge the extent of the theoretical hazard in typical diffusion model training scenarios instead of understanding the memorization behaviors of diffusion models trained on large data. When diffusion models trained on large amounts of data, they generally do not memorize training data in a pixel-by-pixel manner. This also aligns the research findings in [3]. The authors in [3] (Table 1) mentioned that only 200~300 training images out of 1 million generations sampled by DDPM [5] and its variant [6] are extracted succussfully. The above two models are trained on a dataset of only 50k CIFAR-10 images. Therefore, much larger training data size is out of our research scope. --- **References:** [1] Somepalli, Gowthami, et al. "Diffusion art or digital forgery? investigating data replication in diffusion models." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.\ [2]Somepalli, Gowthami, et al. "Understanding and Mitigating Copying in Diffusion Models." arXiv preprint arXiv:2305.20086 (2023).\ [3]Carlini, Nicolas, et al. "Extracting training data from diffusion models." 32nd USENIX Security Symposium (USENIX Security 23). 2023.\ [4] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In IEEE International Conference on Computer Vision (CVPR), 2019.\ [5] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems (NeurIPS), 2020.\ [6] Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning (ICML), pp. 8162–8171. PMLR, 2021. ## Second round Thank you for your highlighting the differences in your work and previous work. I understand that this paper introduces a new metric called EMM and all discussions are centred around it. But it seems that the new metric has limited application because the memorization in diffusion models according to the definition used in the work moves towards zero for any reasonable setting expected in real world settings. As dataset sizes and image resolutions go up, memorization starts decreasing at a very small scale. The setting of having completely random text conditioning is also not very real-world. Can the authors think of any other applications and benefits of the new metric? Response: Thank you for your feedback. We would like to highlight that this research engages in a comprehensive empirical analysis, where EMM serves as a metric for evaluating pixel-to-pixel memorization, and is instrumental in comparing various experimental settings, including data distribution, model configuration, training procedure, conditioning. Since our focus is to gauge the extent (in terms of EMM) of the theoretical hazard in typical diffusion model training scenarios, the broader applicabability of EMM is not within the immediate scope of this study. We have extended to fine-tuning stable diffusion [a] on Artbench-10 [b], a high dimensional art dataset with image resolution of $256\times256$. This is a common setting in real-world scenario as stable diffusion was trained on billions of images and practitioners have motives to fine-tune it on small customized data. The results have been included in **Appendix F: More empirical results on stable diffusion** in our revised paper. Due to the time limit, we consider training data size $|\mathcal{D}|=100, 200, 500$ and two different conditioning, named as "plain" and "class". "Plain" conditioning refers to that we label each image the same text prompt "a painting" while "class" conditioning refers to that we add class of artistic style into the text prompt, e.g. "a realism painting". The former is similar to the case of $C=1$ in class-conditioned diffusion model while the latter is similar to the case of $C=10$. We observe that stable diffusion achieves about $30$\% memorization ratio for $|\mathcal{D}|=500$ with "class" conditioning, which indicates that in real-world scenarios of large dataset sizes and resolutions, the memorization ratio can still be high. Additionally, we find that the memorization ratio drops with the increase of $|\mathcal{D}|$ and the memorization ratio of "class" conditioning is significantly larger than that of "plain" conditioning. These results remain consistent with our previous experiments on CIFAR-10 and FFHQ. Regarding "setting of having completely random text conditioning", we would like to first clarify that these were conducted on class-conditioned diffusion models instead of text-to-image diffusion models. Therefore, we employed **random class conditioning** instead of "random text conditioning" in our experiments. The results demonstrate that diffusion models can maintain a consistent level of memorization even with random labels, provided the number of classes $C$ remains constant. Additionally, an increase in $C$ substantially enhance the memorization of diffusion models. Although random class conditioning may not directly mirror real-world scenarios, our results are counter-intuitive and deepen the understanding of intrinsic nature of memorization in diffusion models. This is similar to [c] where the authors demonstrated that discriminative models could memorize training data even with random labels. In response to your question concerning other applications and benefits of EMM, we elaborate its potential in assessing memorization risk. In our work, a strict threshold $\epsilon=0.1$ was chosen to ensure proximity to the optimum, yielding a relatively modest EMM. However, adjusting $\epsilon$ to a more relaxed value could extend the utility of EMM. For example, setting $\epsilon=0.99$ transforms the current EMM into a metric corresponding to the data size where the diffusion model's memorization ratio is at $1$\%, which will be more practical in scenarios involving large data sizes and high image resolutions. This is left to future research, as our current focus is on the memorization behavior akin to the optimal model. Please kindly let us know if there is any further concern, and we will do our best to respond. --- **References:** [a] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022.\ [b] Peiyuan Liao, Xiuyu Li, Xihui Liu, and Kurt Keutzer. The artbench dataset: Benchmarking generative models with artworks. arXiv preprint arXiv:2206.11404, 2022.\ [c] Zhang, Chiyuan, et al. "Understanding deep learning (still) requires rethinking generalization." Communications of the ACM 64.3 (2021): 107-115.    --- Thank the authors for the responses. My concerns are not addressed. SOTA text-to-image models are not considered. And no clear and promising mitigation strategies or guidelines are provided. Therefore, I would like to maintain my rating. --- Response: Thank you for your feedback. We have extended to fine-tuning stable diffusion [b] on Artbench-10 [c], a high dimensional art dataset with image resolution of $256\times256$. We have explored the effects of training data size and conditioning on memorization. The results have been included in **Appendix F: More empirical results on stable diffusion** in our revised paper. We note that the observations align with our previous experiments on CIFAR-10 and FFHQ. In response to the inquiry on mitigation strategies, from our empirical analysis, we suggest to adopt random fourier features as time embeddings in the architecture of DDPM++ or incoporate weight decay or smaller batch size during the training as these strategies typically have smaller EMMs. Please kindly let us know if there is any further concern, and we will do our best to respond. --- Thank you for your suggestion on stable diffusion. We have extended to fine-tuning stable diffusion on Artbench-10, a high dimensional art dataset with image resolution of $256\times256$. We have explored the effects of training data size and conditioning on memorization. The results have been included in **Appendix F: More empirical results on stable diffusion** in our revised paper. We note that the observations align with our previous experiments on CIFAR-10 and FFHQ. --- Dear Reviewers, We have included new experimental results on fine-tuning stable diffusion on the Artbench-10 dataset in **Appendix F: More empirical results on stable diffusion** in our revised paper. It is noticed that the new experimental results align with our previous experiments on CIFAR-10 and FFHQ. We sincerely hope these new experiments are helpful to address your concerns. Best, The Authors     # Reviewer QP3y Thank you for your supportive feedback and valuable questions. --- ***Weakness 1 and 2: Lack of experiments or analysis on other datasets.*** We have also conducted a series of new experiments on the FFHQ dataset [6], which is a higher-dimensional face dataset. These experiments have been meticulously updated in **Appendix D** of the updated paper, titled "**More empirical results on FFHQ**". Given the time limit, the investigation primarily focused on revisting the effects of data dimension / time embeddings / conditioning, on the memorization of diffusion models. It is noteworthy that the outcomes of these recent experiments corroborate the findings previously observed on the CIFAR-10 dataset. Our objective is to determine the specific training size when diffusion models demonstrate similar memorization behaviors as the theoretical optimum. This specific training size is defined as EMM in our paper. However, dataset duplication will cause the ambiguity in this definition. --- ***Weakness 3: Lack of analysis on other metrics related to generation. For example, the effects of weight decay on quality metrics.*** We would like to clarify the primary objective of our work: investigating under what conditions the diffusion models (undesirably) approximate the optimal solution represented by Eq.(2). On the one hand, such an optimal model can only replicate training data, and may even lead to privacy/copyright issues. On the other hand, such an optimal model is indeed what is theoretically expected. Therefore, it becomes imperative to gauge the extent of this theoretical hazard in typical diffusion model training scenarios. This awareness is vital for mitigating adverse consequences and refining the practical utility of these models, thus requiring quantitative studies like our work. Consequently, the memorization metric was the focal point of our investigation. FID score is conventionally employed for assessing the quality and diversity of generations. However, in the context of our research, a low FID score is expected when diffusion models extensively memorize training data. This is because a large amount of training data replicas in generated samples would naturally result in a generation distribution close to the training distribution. Furthermore, it is anticipated that image quality metrics would also show enhanced performance, given that replicas of training data are typically of high quality. It is crucial to note that the introduction of weight decay greater than zero alters the training objective, leading to a divergence from the original theoretical optimum, which can only replicate training data. This divergence becomes apparent for large weight decay. However, as aforementioned, the primary focus of this study is not on these other metrics but rather on the memorization aspect. --- ***Weakness 4: “Several results presented in this paper, especially regarding dataset and model complexity …”*** The relationship between learning outcomes and the complexity of data and models represent a topic of enduring interest within the machine learning community. Nevertheless, the literature has not adequately elucidated the relationship between the memorization of diffusion models and various influencing factors. Our study aims to address this gap via a thorough analysis, examining the role of data distribution, model configuration, training procedure, and conditioning. Moreover, the motivations underpinning our research diverge significantly from those in previous studies, particularly those outlined in references [1] and [2]. While [1] delved into memorization within GANs, and [2] investigated similar phenomena in discriminative models, our focus is distinctly oriented towards diffusion models. Unlike GANs and discriminative models, which possess infinite optimal solutions, diffusion models are characterized by a closed-form solution that exclusively memorizes training data without generalization. This distinct attribute propels our inquiry into the memorization discrepancies between the trained diffusion model and its theoretical optimal solution. In contrast to [1], which concentrated on dataset size and complexity, our experimental framework extends to encompass the effects of data distribution, model configuration, training procedure, and conditioning. --- ***Question 1: More explanations on selected memorization ratio.*** Thank you for your valuable suggestion. The Euclidean $l_2$ distance between a generated image and its nearest training data (or KNN distance) was used in [5] as a measure of image memorization. In our preliminary experiments, this $l_2$ distance was also employed as a metric for memorization. Our findings indicate that the outcomes are consistent whether utilizing this pure $l_2$ distance or the $l_2$ distance ratio, as detailed in our paper. The factor of $\frac{1}{3}$, adopted from the paper [7], was identified by the authors as a threshold that correlates closely with the human perceptual recognition of memorization. It is acknowledged that determining an exact threshold to clearly differentiate between memorized and non-memorized generations is a complex challenge. Consequently, we have incorporated your suggestion and presented the experimental results using an alternative memorization metric for cross-validation purposes. As updated in **Appendix E: Discussions on Memorization Criteria** in the revised paper, we re-evaluate our results in the main paper by using the above $l_2$ distance as memorization metric. It has been observed that lower KNN distances are indicative of diffusion models demonstrating a higher propensity for memorizing training data. We notice that these new results are in alignment with our original conclusions using the memorization ratio metric. --- ***Question 2: More explanations on experimental results regarding EMA and weight decay.*** The primary aim of our study is to systematically analyze the influence of various factors—namely, data distribution, model configuration, training procedure, and conditioning on the value of EMM. Our exploration on EMA is motivated by its substantial impact on the FID score and overall image quality. This observation prompted an investigation into whether it similarly exerts a considerable effect on memorization. Our empirical findings reveal that while EMA substantially influences image quality, its effect on memorization is marginal, a conclusion that is not trivial. Our exploration on weight decay is motvated by that the training objective of diffusion models is altered when introducing weight decay. Consequently, the optimal solution is also different from Eq. 2 of our main paper. This alteration in the training objective raises the question of how weight decay affects the deviation of trained diffusion models from the optimal solution. Our experiments have shown that with a small weight decay, diffusion models are still capable of demonstrating memorization behaviors. --- ***Question 3: New explorations on noise schedule.*** Thank you for your suggestion. We have prioritized experiments involving the FFHQ dataset. Nonetheless, if time allows, we will endeavor to incorporate your suggestion regarding the investigation of noise schedules in diffusion models and include the results in the revised version of our paper. --- ***Minor question 1: Better demonstrations/visualizations regarding skip connection results.*** Your feedback is highly appreciated. We have amended the figures in our revised paper to enhance clarity and better illustrate the results. --- ***Minor question 2: Better word descriptions on memorization ratio.*** Thank you for your suggestion. We have incorporated detailed word descriptions on the memorization metric we used into the revised version of our paper. --- **Reference:** [1] Feng, Qianli, et al. "When do gans replicate? on the choice of dataset size." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021.\ [2] Zhang, Chiyuan, et al. "Understanding deep learning (still) requires rethinking generalization." Communications of the ACM 64.3 (2021): 107-115.\ [3] Somepalli, Gowthami, et al. "Diffusion art or digital forgery? investigating data replication in diffusion models." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.\ [4] Somepalli, Gowthami, et al. "Understanding and Mitigating Copying in Diffusion Models." arXiv preprint arXiv:2305.20086 (2023).\ [5] Carlini, Nicolas, et al. "Extracting training data from diffusion models." 32nd USENIX Security Symposium (USENIX Security 23). 2023.\ [6] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In IEEE International Conference on Computer Vision (CVPR), 2019.\ [7] TaeHo Yoon, Joo Young Choi, Sehyun Kwon, and Ernest K Ryu. Diffusion probabilistic models generalize when they fail to memorize. In ICML 2023 Workshop on Structured Probabilistic Inference & Generative Modeling, 2023. # Reviewer MQLD: Thank you for your constructive feedback and valuable questions. --- ***Weakness: Lack of experiments on stable diffusion.*** We would like to clarify the primary objective of our work: investigating under what conditions the diffusion models (undesirably) approximate the optimal solution represented by Eq.(2). On the one hand, such an optimal model can only replicate training data, and may even lead to privacy/copyright issues. On the other hand, such an optimal model is indeed what is theoretically expected. Therefore, it becomes imperative to gauge the extent of this theoretical hazard in typical diffusion model training scenarios. This awareness is vital for mitigating adverse consequences and refining the practical utility of these models, thus requiring quantitative studies like our work. To advance this research, we have trained hundreds of EDMs from scratch. However, it is computationally intractable and not feasible to similarly train such a huge number of stable diffusion models. Notably, text-to-image diffusion models, which also adhere to the objective of denoising score matching, possess the similar theoretical optimum and are likewise expected to exhibit substantial memorization of training data when meeting EMM. As updated in **Appendix D: More empirical results on FFHQ** and **Appendix E: Discussions on memorization criteria** of the revised paper, our new investigations reveal consistent findings when changing the dataset to FFHQ [1] or altering the memorization ratio to KNN distance. Consequently, we conjecture that similar memorization behaviors are likely observable in state-of-the-art text-to-image diffusion models. --- ***Question: Any stategies to mitigate memorization?*** Thank you for your inquiry. Our experiments offer empirical and quantitative guidance for practitioners aiming to train their diffusion models from scratch while preventing large memorization. Generally, it is advisable to select a training recipe, encompassing aspects of data distribution, model configuration, training procedure, conditioning that exhibits a lower EMM when other performance metrics are comparably maintained. Opting for such a training recipe is advantageous as it has lower risk of memorizaton. --- **Reference** [1] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In IEEE International Conference on Computer Vision (CVPR), 2019.  # Reviewer gEHa: Thank you for your constructive feedback and valuable questions. --- ***Weakness 1: Limited scientific contributions.*** We would like to clarify the primary objective of our work: investigating under what conditions the diffusion models (undesirably) approximate the optimal solution represented by Eq.(2). On the one hand, such an optimal model can only replicate training data, and may even lead to privacy/copyright issues. On the other hand, such an optimal model is indeed what is theoretically expected. Therefore, it becomes imperative to gauge the extent of this theoretical hazard in typical diffusion model training scenarios. This awareness is vital for mitigating adverse consequences and refining the practical utility of these models, thus requiring quantitative studies like our work. Furthermore, our research offers extensive and empirical guidances for understanding the memorization in diffusion models and training diffusion models while preventing large memorization. Our findings aim to delineate which factors have a substantial impact on memorization and which contribute more subtly. As such, our paper primarily presents an empirical analysis, rather than a theoretical one, of the interrelationships between these factors. In addition, our study reveals surprising findings, e.g. the significant effects of random labels, which may inspire the theoretical practitioners for further exploration. --- ***Weakness 2: Only pixel-by-pixel memorization is explored.*** In our research, as presented in the introduction and Appendix, we have established through both empirical and theoretical analysis that the optimal solution for denoising score matching in diffusion models memorizes training data on a pixel-by-pixel basis. Since our motivation is to gauge the extent of this theoretical hazard in typical diffusion model training scenarios, the same pixel-by-pixel memorization is our research focus. --- ***Weakness 3: Lack of experiments on other datasets.*** Thank you for your valuable suggestions. In addition to CIFAR-10, we have conducted a series of additional experiments using the FFHQ dataset [1], which is a higher-dimensional face dataset similar to CelebA. These new experiments have been included in **Appendix D: More empirical results on FFHQ** of our revised paper. Due to the time constraints, our additional experiments focused on investigating the impact of data dimension / time embeddings / conditioning, on the memorization of diffusion models. It is noteworthy that the outcomes of these recent experiments support our initial findings derived from the CIFAR-10 dataset. --- ***Question 1: Whether statistical significance matters?*** In our experiments, we noted that the variance in memorization ratios across repeated trials is negligible. For instance, as depicted in Figure 2(b), with the training data size of $|\mathcal{D}|=1$k and two classes ($C=2$), the memorization ratio for the trained diffusion model stands at $94.59\pm0.19$\% over three different random seeds. Similarly, for $C=5$, the memorization ratio is recorded at $92.32\pm0.14$\%. Throughout our research, we have trained several hundred diffusion models, necessitating substantial GPU hours. Consequently, it is impractical to execute each experiment with varying random seeds. Given that the observed variance in memorization ratios is minor (less than $0.2$%), we postulate that this minimal fluctuation is unlikely to impact the statistical significance or alter the overarching conclusions of our study. --- ***Question 2: Model performance using large weight decays? Tradeoff between generation quality and memorization.*** We note that the model performance also deteriorates when large weight decay is applied during the training process. In reference to the tradeoff you mentioned, here we make more clarifications regarding the relationship between the quality of generated samples and memorization. When a significant proportion of generated samples are replicas of the training data, their image quality is inherently high. Additionally, this leads to a generation distribution close to that of the training data, resulting in a low FID score. The FID score is conventionally used to evaluate the quality and diversity of generated images. For instance, the optimal diffusion model, which can only replicate training data (thus exhibiting a memorization ratio is 100%), achieves an FID score of 0.56. This is substantially compared to the 1.96 FID score attained by the state-of-the-art unconditional diffusion model, EDM. Given that the objective of our research is to explore when diffusion models memorize in terms of our definitions of EMM for a training recipe, we have consistently employed a memorization metric throughout our study. --- **Reference** [1] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In IEEE International Conference on Computer Vision (CVPR), 2019.

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.