Xiangming Gu
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights New
    • Engagement control
    • Make a copy
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Make a copy Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    # Reviewer p3eW Thank you for your constructive feedback and valuable questions. --- ***Weakness 1: Isolation of analysis of memorization and generalization or image quality.*** Thank you for pointing this out. We would like to clarify the primary objective of our work: investigating under what conditions the diffusion models (undesirably) approximate the optimal solution represented by Eq.(2). On the one hand, such an optimal model makes no sense to practitioners (as also mentioned in the review), and may even lead to privacy/copyright issues. On the other hand, such an optimal model is indeed what is theoretically expected. Therefore, it becomes imperative to gauge the extent of this theoretical hazard in typical diffusion model training scenarios. This awareness is vital for mitigating adverse consequences and refining the practical utility of these models, thus requiring quantitative studies like our work. Further, studying memorization behavior also helps the understanding of generalization. A recent work [1] suggests that diffusion models with memorization has potential adverse generalization performance. Therefore, it is expected that an increase in the memorization ratio within a diffusion model implies a diminution in its generalization capability. As for the image quality, it has interleaving relationship with memorization. When a significant proportion of generated samples are replicas of the training data, their image quality is inherently high. Additionally, this leads to a generation distribution close to that of the training data, resulting in a low FID score. For instance, the optimal diffusion model, which can only replicate training data (thus exhibiting a memorization ratio is 100%), achieves an FID score of 0.56. This is substantially lower compared to the 1.96 FID score attained by the state-of-the-art unconditional diffusion model, EDM. Therefore, we do not emphasize metrics of image quality in our experiments. --- ***Weakness 2 and 3: Extension of conclusions on small-scaled datasets to real scenarios.*** As elucidated in the above, our objective is to gauge the extent of the theoretical hazard in typical diffusion model training scenarios instead of understanding the memorization behaviors of diffusion models trained on millions of images. Through our extensive experiments, we find that the EMMs for training recipes of diffusion models are generally small. This provides explanations why diffusion models in real scenarios demonstrate low memorization ratios. In addition to CIFAR-10, we have conducted a series of additional experiments using the FFHQ dataset [2], which is a higher-dimensional face dataset. These new experiments have been included in **Appendix D: More empirical results on FFHQ** of our revised paper. Due to the time constraints, our additional experiments focused on investigating the impact of data dimension / time embeddings / conditioning, on the memorization of diffusion models. It is noteworthy that the outcomes of these recent experiments are in alignment with our initial findings derived from the CIFAR-10 dataset. --- ***Weakness 4: Lack of analysis for specific factors.*** Our main body of work was developed as a comprehensive and empirical guidance towards the effects of various factors from perspectives of data distributions / model configuration / training procedure / conditioning on memorization in diffusion models. Our findings aim to delineate which factors have a substantial impact on memorization and which contribute more subtly. Therefore, our paper is positioned at an empirical analysis instead of a theoretical analysis of how different factors interact. Additionally, we provide several surprising findings, e.g. the significant effects of random labels, which may inspire the theoretical practitioners for further exploration. --- **References:** [1] TaeHo Yoon, Joo Young Choi, Sehyun Kwon, and Ernest K Ryu. Diffusion probabilistic models generalize when they fail to memorize. In ICML 2023 Workshop on Structured Probabilistic Inference & Generative Modeling, 2023.\ [2] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In IEEE International Conference on Computer Vision (CVPR), 2019. # Reviewer sac3 Thank you for your constructive feedback and valuable questions. --- ***Weakness & Question 2: Lack of detailed comparison with [1] [2] [3]*** Thank you for the question. First, the foundational motivations of our research diverge significantly from those of studies [1], [2], [3]. We aim to investigate under what conditions the diffusion models (undesirably) approximate the optimal solution represented by Eq.(2). On the one hand, such an optimal model can only replicate training data, and may even lead to privacy/copyright issues. On the other hand, such an optimal model is indeed what is theoretically expected. Therefore, it becomes imperative to gauge the extent of this theoretical hazard in typical diffusion model training scenarios. This awareness is vital for mitigating adverse consequences and refining the practical utility of these models, thus requiring quantitative studies like our work. Secondly, our paper conducted a comprehensive and quantitative examination of various factors influencing the memorization of diffusion models. These factors span a wide range, including data distribution, model configuration, training configuration, and conditioning. In contrast, [1] and [3] primarily focused on demonstrating that diffusion models may replicate training data and proposing frameworks for detecting or extracting such replication. [2] investigated the impact of various factors on the memorization in text-to-image diffusion models, particularly those fine-tuned on new datasets. Compared to [2], our experiments predominantly engage with unconditional diffusion models trained from scratch, and the variables we examine differ from those in [2]. Our findings offer new insights, e.g. time embedding / random conditions / skip connections, into how these factors affect memorization in diffuion models, as detailed in our paper. We wound also like to make more clarifications on how our research different from [1] regarding dataset size and [2] regarding text conditioning and dataset complexity. - In [1], the authors conducted a comparative analysis of the memorization tendencies in diffusion models trained on datasets of varying sizes (specifically, 300 versus 3,000 samples). In contrast, in our research, we showed when diffusion models memorize in terms of a novel metric EMM. This specific training data size reflects the capacity of model and algorithm, etc, and discloses the interactions among different factors. Additionally, we monitored the memorization ratios throughout the training procedure and showed that the memorization becomes apparent after a sufficiently extended training duration, particularly when the size of the training data is sufficiently small. - The authors in [2] undertook a comparative analysis of diffusion models conditioned on various types of captions. Our research, however, diverges in the notable discovery that random conditions can effectively induce the memorization of class-conditioned diffusion models. We also find that the number of classes plays an important role in the memorization. In terms of dataset complexity, [2] compared models trained on LAION-10k and Imagenette datasets, attributing the higher memorization observed in the latter to the structural complexity of its images. Here the dataset complexity is assessed on an instance-level basis. While in our experiments, we meticulously constructed a series of training datasets, each varying in the number of classes or the intra-class diversity, while keeping the other factor constant. In our research, the number of classes and intra-class diversity serve as population-level metrics to assess the dataset complexity. Finally, we introduced a novel metric for memorization: EMM. This metric is designed to determine the conditions under which trained diffusion models exhibit memorization behaviors akin to those of the optimal solution. --- ***Question 1: Extension of conclusions on small-scaled datasets to real scenarios.*** Firstly, we have run a series of new experiments on the FFHQ dataset [4] during the rebuttal period, which has been updated in **Appendix D: More empirical results on FFHQ** in the revised paper. The new experimental results support our findings on CIFAR-10. Secondly, we would like to emphasize that our objective is to gauge the extent of the theoretical hazard in typical diffusion model training scenarios instead of understanding the memorization behaviors of diffusion models trained on large data. When diffusion models trained on large amounts of data, they generally do not memorize training data in a pixel-by-pixel manner. This also aligns the research findings in [3]. The authors in [3] (Table 1) mentioned that only 200~300 training images out of 1 million generations sampled by DDPM [5] and its variant [6] are extracted succussfully. The above two models are trained on a dataset of only 50k CIFAR-10 images. Therefore, much larger training data size is out of our research scope. --- **References:** [1] Somepalli, Gowthami, et al. "Diffusion art or digital forgery? investigating data replication in diffusion models." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.\ [2]Somepalli, Gowthami, et al. "Understanding and Mitigating Copying in Diffusion Models." arXiv preprint arXiv:2305.20086 (2023).\ [3]Carlini, Nicolas, et al. "Extracting training data from diffusion models." 32nd USENIX Security Symposium (USENIX Security 23). 2023.\ [4] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In IEEE International Conference on Computer Vision (CVPR), 2019.\ [5] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems (NeurIPS), 2020.\ [6] Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning (ICML), pp. 8162–8171. PMLR, 2021. ## Second round Thank you for your highlighting the differences in your work and previous work. I understand that this paper introduces a new metric called EMM and all discussions are centred around it. But it seems that the new metric has limited application because the memorization in diffusion models according to the definition used in the work moves towards zero for any reasonable setting expected in real world settings. As dataset sizes and image resolutions go up, memorization starts decreasing at a very small scale. The setting of having completely random text conditioning is also not very real-world. Can the authors think of any other applications and benefits of the new metric? Response: Thank you for your feedback. We would like to highlight that this research engages in a comprehensive empirical analysis, where EMM serves as a metric for evaluating pixel-to-pixel memorization, and is instrumental in comparing various experimental settings, including data distribution, model configuration, training procedure, conditioning. Since our focus is to gauge the extent (in terms of EMM) of the theoretical hazard in typical diffusion model training scenarios, the broader applicabability of EMM is not within the immediate scope of this study. We have extended to fine-tuning stable diffusion [a] on Artbench-10 [b], a high dimensional art dataset with image resolution of $256\times256$. This is a common setting in real-world scenario as stable diffusion was trained on billions of images and practitioners have motives to fine-tune it on small customized data. The results have been included in **Appendix F: More empirical results on stable diffusion** in our revised paper. Due to the time limit, we consider training data size $|\mathcal{D}|=100, 200, 500$ and two different conditioning, named as "plain" and "class". "Plain" conditioning refers to that we label each image the same text prompt "a painting" while "class" conditioning refers to that we add class of artistic style into the text prompt, e.g. "a realism painting". The former is similar to the case of $C=1$ in class-conditioned diffusion model while the latter is similar to the case of $C=10$. We observe that stable diffusion achieves about $30$\% memorization ratio for $|\mathcal{D}|=500$ with "class" conditioning, which indicates that in real-world scenarios of large dataset sizes and resolutions, the memorization ratio can still be high. Additionally, we find that the memorization ratio drops with the increase of $|\mathcal{D}|$ and the memorization ratio of "class" conditioning is significantly larger than that of "plain" conditioning. These results remain consistent with our previous experiments on CIFAR-10 and FFHQ. Regarding "setting of having completely random text conditioning", we would like to first clarify that these were conducted on class-conditioned diffusion models instead of text-to-image diffusion models. Therefore, we employed **random class conditioning** instead of "random text conditioning" in our experiments. The results demonstrate that diffusion models can maintain a consistent level of memorization even with random labels, provided the number of classes $C$ remains constant. Additionally, an increase in $C$ substantially enhance the memorization of diffusion models. Although random class conditioning may not directly mirror real-world scenarios, our results are counter-intuitive and deepen the understanding of intrinsic nature of memorization in diffusion models. This is similar to [c] where the authors demonstrated that discriminative models could memorize training data even with random labels. In response to your question concerning other applications and benefits of EMM, we elaborate its potential in assessing memorization risk. In our work, a strict threshold $\epsilon=0.1$ was chosen to ensure proximity to the optimum, yielding a relatively modest EMM. However, adjusting $\epsilon$ to a more relaxed value could extend the utility of EMM. For example, setting $\epsilon=0.99$ transforms the current EMM into a metric corresponding to the data size where the diffusion model's memorization ratio is at $1$\%, which will be more practical in scenarios involving large data sizes and high image resolutions. This is left to future research, as our current focus is on the memorization behavior akin to the optimal model. Please kindly let us know if there is any further concern, and we will do our best to respond. --- **References:** [a] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022.\ [b] Peiyuan Liao, Xiuyu Li, Xihui Liu, and Kurt Keutzer. The artbench dataset: Benchmarking generative models with artworks. arXiv preprint arXiv:2206.11404, 2022.\ [c] Zhang, Chiyuan, et al. "Understanding deep learning (still) requires rethinking generalization." Communications of the ACM 64.3 (2021): 107-115. <!-- Thank you for your feedback. We would like to highlight that this research engages in a comprehensive empirical analysis, where EMM serves as a metric for evaluating pixel-to-pixel memorization, and is instrumental in comparing various experimental settings, including data distribution, model configuration, training procedure, conditioning. Since our focus is to gauge the extent (in terms of EMM) of the theoretical hazard in typical diffusion model training scenarios, the broader applicabability of EMM is not within the immediate scope of this study. We have extended to fine-tuning stable diffusion [b] on Artbench-10 [c], a high dimensional art dataset with image resolution of $256\times256$. This is a common setting in real-world scenario as stable diffusion was trained on billions of images and practitioners have motives to fine-tune it on small customized data. The results have been included in **Appendix F: More empirical results on stable diffusion** in our revised paper. Due to the time limit, we consider training data size $|\mathcal{D}|=100, 200, 500$ and two different conditioning, named as "plain" and "class". "Plain" conditioning refers to that we label each image the same text prompt "a painting" while "class" conditioning refers to that we add class of artistic style into the text prompt, e.g. "a realism painting". The former is similar to the case of $C=1$ in class-conditioned diffusion model while the latter is similar to the case of $C=10$. We observe that stable diffusion achieves about $30$\% memorization ratio for $|\mathcal{D}|=500$ with "class" conditioning, which indicates that in real-world scenarios of large dataset sizes and resolutions, the memorization ratio can still be high. Additionally, we find that the memorization ratio drops with the increase of $|\mathcal{D}|$ and the memorization ratio of "class" conditioning is significantly larger than that of "plain" conditioning. These results remain consistent with our previous experiments on CIFAR-10 and FFHQ. Regarding "setting of having completely random text conditioning", we would like to first clarify that these were conducted on class-conditioned diffusion models instead of text-to-image diffusion models. Therefore, we employed **random class conditioning** instead of "random text conditioning" in our experiments. The results demonstrate that diffusion models can maintain a consistent level of memorization even with random labels, provided the number of classes $C$ remains constant. Additionally, an increase in $C$ substantially enhance the memorization of diffusion models. Although random class conditioning may not directly mirror real-world scenarios, our results are counter-intuitive and deepen the understanding of intrinsic nature of memorization in diffusion models. This is similar to [a] where the authors demonstrated that discriminative models could memorize training data even with random labels. In response to your question concerning other applications and benefits of EMM, we elaborate its potential in assessing memorization risk. In our work, a strict threshold $\epsilon=0.1$ was chosen to ensure proximity to the optimum, yielding a relatively modest EMM. However, adjusting $\epsilon$ to a more relaxed value could extend the utility of EMM. For example, setting $\epsilon=0.99$ transforms the current EMM into a metric corresponding to the data size where the diffusion model's memorization ratio is at $1$\%, which will be more practical in scenarios involving large data sizes and high image resolutions. This is left to future research, as our current focus is on the memorization behavior akin to the optimal model. Please kindly let us know if there is any further concern, and we will do our best to respond. --- **References:** [a] Zhang, Chiyuan, et al. "Understanding deep learning (still) requires rethinking generalization." Communications of the ACM 64.3 (2021): 107-115.\ [b]Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022.\ [c] Peiyuan Liao, Xiuyu Li, Xihui Liu, and Kurt Keutzer. The artbench dataset: Benchmarking generative models with artworks. arXiv preprint arXiv:2206.11404, 2022. --> <!-- In contexts of large data sizes and resolutions, diffusion models typically do not exhibit pixel-by-pixel memorization of training data. This finding is consistent with previous research [5] indicating that very few training images can be extracted in DDPM and its variant. Therefore, training datasets with substantial sizes and resolutions fall outside the scope of this study, which concentrates on approximating the theoretical optimum. --> <!-- Through our experiments, we would like to deepen the understanding of memorization in diffusion models and delineate which factors have a substantial impact on memorization and which contribute more subtly. --> --- Thank the authors for the responses. My concerns are not addressed. SOTA text-to-image models are not considered. And no clear and promising mitigation strategies or guidelines are provided. Therefore, I would like to maintain my rating. --- Response: Thank you for your feedback. We have extended to fine-tuning stable diffusion [b] on Artbench-10 [c], a high dimensional art dataset with image resolution of $256\times256$. We have explored the effects of training data size and conditioning on memorization. The results have been included in **Appendix F: More empirical results on stable diffusion** in our revised paper. We note that the observations align with our previous experiments on CIFAR-10 and FFHQ. In response to the inquiry on mitigation strategies, from our empirical analysis, we suggest to adopt random fourier features as time embeddings in the architecture of DDPM++ or incoporate weight decay or smaller batch size during the training as these strategies typically have smaller EMMs. Please kindly let us know if there is any further concern, and we will do our best to respond. --- Thank you for your suggestion on stable diffusion. We have extended to fine-tuning stable diffusion on Artbench-10, a high dimensional art dataset with image resolution of $256\times256$. We have explored the effects of training data size and conditioning on memorization. The results have been included in **Appendix F: More empirical results on stable diffusion** in our revised paper. We note that the observations align with our previous experiments on CIFAR-10 and FFHQ. --- Dear Reviewers, We have included new experimental results on fine-tuning stable diffusion on the Artbench-10 dataset in **Appendix F: More empirical results on stable diffusion** in our revised paper. It is noticed that the new experimental results align with our previous experiments on CIFAR-10 and FFHQ. We sincerely hope these new experiments are helpful to address your concerns. Best, The Authors <!-- The experimental design incorporating random labels is deliberately chosen to align our contribution of deepening understanding of the intrinsic nature of memorization. The observations derived from these experiments are counter-intuitive, highlighting the significant role the number of classes plays in the memorization. --> <!-- empirical analysis tool/serve to the analysis EMM is a metric characterising pixel to pixel memorization compare different settings, we do not expect is has broader application scenarios emphasize motivation/contribution In response to your question "", could be / potentials broad application, $\epsilon$. --> <!-- Moreover, similar to EMC, we also do not have a principaled way to select a memorization metric and establish a memorization threshold $\epsilon$. In our experiments, to explore the memorization limit of diffusion models, we adopt a strict $\epsilon$, resulting in a comparatively modest EMM. We note that to reach this memorization threshold, the training duration was extended to 10 times compared to [b]. These extensive experimental endeavors underscore the potential to significantly reduce the memorization gap between the theoretical optimum and the empirically trained diffusion models. Additionally, in Section 6, titled "Unconditional v.s. conditional generation", we have shown that even in scenarios of large data size and dimensions, a large number of classes in conditions can contribute to an elevated EMM or memorization ratio. Instead of pursuit for memorization limit, if we would like to assess the memorization risk, it is natural to employ a loose $\epsilon$ and then re-evaluate the EMM. For instance, with $\epsilon$ set as $0.99$, EMM corresponds to the data size when memorization ratio of the diffusion model equals $1$\%. Then it could be naturally applied to the scenarios of larger data size and dimensions. While this study concentrates on approaching the memorization limit near the theoretical optimum, further exploration of higher $\epsilon$ values is left to future researchers. The experimental design incorporating random labels is deliberately chosen to align our research objective of deepening understanding of the intrinsic nature of memorization. The observations derived from these experiments are counter-intuitive, highlighting the significant role the number of classes plays in the memorization. --> <!-- [a] Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever. Deep double descent: Where bigger models and more data hurt. In International Conference on Learning Representations (ICLR), 2020.\ [b] Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. Advances in Neural Information Processing Systems (NeurIPS), 35: 26565–26577, 2022. --> # Reviewer QP3y Thank you for your supportive feedback and valuable questions. --- ***Weakness 1 and 2: Lack of experiments or analysis on other datasets.*** We have also conducted a series of new experiments on the FFHQ dataset [6], which is a higher-dimensional face dataset. These experiments have been meticulously updated in **Appendix D** of the updated paper, titled "**More empirical results on FFHQ**". Given the time limit, the investigation primarily focused on revisting the effects of data dimension / time embeddings / conditioning, on the memorization of diffusion models. It is noteworthy that the outcomes of these recent experiments corroborate the findings previously observed on the CIFAR-10 dataset. Our objective is to determine the specific training size when diffusion models demonstrate similar memorization behaviors as the theoretical optimum. This specific training size is defined as EMM in our paper. However, dataset duplication will cause the ambiguity in this definition. --- ***Weakness 3: Lack of analysis on other metrics related to generation. For example, the effects of weight decay on quality metrics.*** We would like to clarify the primary objective of our work: investigating under what conditions the diffusion models (undesirably) approximate the optimal solution represented by Eq.(2). On the one hand, such an optimal model can only replicate training data, and may even lead to privacy/copyright issues. On the other hand, such an optimal model is indeed what is theoretically expected. Therefore, it becomes imperative to gauge the extent of this theoretical hazard in typical diffusion model training scenarios. This awareness is vital for mitigating adverse consequences and refining the practical utility of these models, thus requiring quantitative studies like our work. Consequently, the memorization metric was the focal point of our investigation. FID score is conventionally employed for assessing the quality and diversity of generations. However, in the context of our research, a low FID score is expected when diffusion models extensively memorize training data. This is because a large amount of training data replicas in generated samples would naturally result in a generation distribution close to the training distribution. Furthermore, it is anticipated that image quality metrics would also show enhanced performance, given that replicas of training data are typically of high quality. It is crucial to note that the introduction of weight decay greater than zero alters the training objective, leading to a divergence from the original theoretical optimum, which can only replicate training data. This divergence becomes apparent for large weight decay. However, as aforementioned, the primary focus of this study is not on these other metrics but rather on the memorization aspect. --- ***Weakness 4: “Several results presented in this paper, especially regarding dataset and model complexity …”*** The relationship between learning outcomes and the complexity of data and models represent a topic of enduring interest within the machine learning community. Nevertheless, the literature has not adequately elucidated the relationship between the memorization of diffusion models and various influencing factors. Our study aims to address this gap via a thorough analysis, examining the role of data distribution, model configuration, training procedure, and conditioning. Moreover, the motivations underpinning our research diverge significantly from those in previous studies, particularly those outlined in references [1] and [2]. While [1] delved into memorization within GANs, and [2] investigated similar phenomena in discriminative models, our focus is distinctly oriented towards diffusion models. Unlike GANs and discriminative models, which possess infinite optimal solutions, diffusion models are characterized by a closed-form solution that exclusively memorizes training data without generalization. This distinct attribute propels our inquiry into the memorization discrepancies between the trained diffusion model and its theoretical optimal solution. In contrast to [1], which concentrated on dataset size and complexity, our experimental framework extends to encompass the effects of data distribution, model configuration, training procedure, and conditioning. --- ***Question 1: More explanations on selected memorization ratio.*** Thank you for your valuable suggestion. The Euclidean $l_2$ distance between a generated image and its nearest training data (or KNN distance) was used in [5] as a measure of image memorization. In our preliminary experiments, this $l_2$ distance was also employed as a metric for memorization. Our findings indicate that the outcomes are consistent whether utilizing this pure $l_2$ distance or the $l_2$ distance ratio, as detailed in our paper. The factor of $\frac{1}{3}$, adopted from the paper [7], was identified by the authors as a threshold that correlates closely with the human perceptual recognition of memorization. It is acknowledged that determining an exact threshold to clearly differentiate between memorized and non-memorized generations is a complex challenge. Consequently, we have incorporated your suggestion and presented the experimental results using an alternative memorization metric for cross-validation purposes. As updated in **Appendix E: Discussions on Memorization Criteria** in the revised paper, we re-evaluate our results in the main paper by using the above $l_2$ distance as memorization metric. It has been observed that lower KNN distances are indicative of diffusion models demonstrating a higher propensity for memorizing training data. We notice that these new results are in alignment with our original conclusions using the memorization ratio metric. --- ***Question 2: More explanations on experimental results regarding EMA and weight decay.*** The primary aim of our study is to systematically analyze the influence of various factors—namely, data distribution, model configuration, training procedure, and conditioning on the value of EMM. Our exploration on EMA is motivated by its substantial impact on the FID score and overall image quality. This observation prompted an investigation into whether it similarly exerts a considerable effect on memorization. Our empirical findings reveal that while EMA substantially influences image quality, its effect on memorization is marginal, a conclusion that is not trivial. Our exploration on weight decay is motvated by that the training objective of diffusion models is altered when introducing weight decay. Consequently, the optimal solution is also different from Eq. 2 of our main paper. This alteration in the training objective raises the question of how weight decay affects the deviation of trained diffusion models from the optimal solution. Our experiments have shown that with a small weight decay, diffusion models are still capable of demonstrating memorization behaviors. --- ***Question 3: New explorations on noise schedule.*** Thank you for your suggestion. We have prioritized experiments involving the FFHQ dataset. Nonetheless, if time allows, we will endeavor to incorporate your suggestion regarding the investigation of noise schedules in diffusion models and include the results in the revised version of our paper. --- ***Minor question 1: Better demonstrations/visualizations regarding skip connection results.*** Your feedback is highly appreciated. We have amended the figures in our revised paper to enhance clarity and better illustrate the results. --- ***Minor question 2: Better word descriptions on memorization ratio.*** Thank you for your suggestion. We have incorporated detailed word descriptions on the memorization metric we used into the revised version of our paper. --- **Reference:** [1] Feng, Qianli, et al. "When do gans replicate? on the choice of dataset size." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021.\ [2] Zhang, Chiyuan, et al. "Understanding deep learning (still) requires rethinking generalization." Communications of the ACM 64.3 (2021): 107-115.\ [3] Somepalli, Gowthami, et al. "Diffusion art or digital forgery? investigating data replication in diffusion models." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.\ [4] Somepalli, Gowthami, et al. "Understanding and Mitigating Copying in Diffusion Models." arXiv preprint arXiv:2305.20086 (2023).\ [5] Carlini, Nicolas, et al. "Extracting training data from diffusion models." 32nd USENIX Security Symposium (USENIX Security 23). 2023.\ [6] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In IEEE International Conference on Computer Vision (CVPR), 2019.\ [7] TaeHo Yoon, Joo Young Choi, Sehyun Kwon, and Ernest K Ryu. Diffusion probabilistic models generalize when they fail to memorize. In ICML 2023 Workshop on Structured Probabilistic Inference & Generative Modeling, 2023. # Reviewer MQLD: Thank you for your constructive feedback and valuable questions. --- ***Weakness: Lack of experiments on stable diffusion.*** We would like to clarify the primary objective of our work: investigating under what conditions the diffusion models (undesirably) approximate the optimal solution represented by Eq.(2). On the one hand, such an optimal model can only replicate training data, and may even lead to privacy/copyright issues. On the other hand, such an optimal model is indeed what is theoretically expected. Therefore, it becomes imperative to gauge the extent of this theoretical hazard in typical diffusion model training scenarios. This awareness is vital for mitigating adverse consequences and refining the practical utility of these models, thus requiring quantitative studies like our work. To advance this research, we have trained hundreds of EDMs from scratch. However, it is computationally intractable and not feasible to similarly train such a huge number of stable diffusion models. Notably, text-to-image diffusion models, which also adhere to the objective of denoising score matching, possess the similar theoretical optimum and are likewise expected to exhibit substantial memorization of training data when meeting EMM. As updated in **Appendix D: More empirical results on FFHQ** and **Appendix E: Discussions on memorization criteria** of the revised paper, our new investigations reveal consistent findings when changing the dataset to FFHQ [1] or altering the memorization ratio to KNN distance. Consequently, we conjecture that similar memorization behaviors are likely observable in state-of-the-art text-to-image diffusion models. --- ***Question: Any stategies to mitigate memorization?*** Thank you for your inquiry. Our experiments offer empirical and quantitative guidance for practitioners aiming to train their diffusion models from scratch while preventing large memorization. Generally, it is advisable to select a training recipe, encompassing aspects of data distribution, model configuration, training procedure, conditioning that exhibits a lower EMM when other performance metrics are comparably maintained. Opting for such a training recipe is advantageous as it has lower risk of memorizaton. --- **Reference** [1] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In IEEE International Conference on Computer Vision (CVPR), 2019. <!-- --- Thank you again for your feedback and suggestions, and we hope our responses and new experiments were able to address any remaining concerns. Please do let us know if you have any further questions as well as what would be expected for score improvement. --> # Reviewer gEHa: Thank you for your constructive feedback and valuable questions. --- ***Weakness 1: Limited scientific contributions.*** We would like to clarify the primary objective of our work: investigating under what conditions the diffusion models (undesirably) approximate the optimal solution represented by Eq.(2). On the one hand, such an optimal model can only replicate training data, and may even lead to privacy/copyright issues. On the other hand, such an optimal model is indeed what is theoretically expected. Therefore, it becomes imperative to gauge the extent of this theoretical hazard in typical diffusion model training scenarios. This awareness is vital for mitigating adverse consequences and refining the practical utility of these models, thus requiring quantitative studies like our work. Furthermore, our research offers extensive and empirical guidances for understanding the memorization in diffusion models and training diffusion models while preventing large memorization. Our findings aim to delineate which factors have a substantial impact on memorization and which contribute more subtly. As such, our paper primarily presents an empirical analysis, rather than a theoretical one, of the interrelationships between these factors. In addition, our study reveals surprising findings, e.g. the significant effects of random labels, which may inspire the theoretical practitioners for further exploration. --- ***Weakness 2: Only pixel-by-pixel memorization is explored.*** In our research, as presented in the introduction and Appendix, we have established through both empirical and theoretical analysis that the optimal solution for denoising score matching in diffusion models memorizes training data on a pixel-by-pixel basis. Since our motivation is to gauge the extent of this theoretical hazard in typical diffusion model training scenarios, the same pixel-by-pixel memorization is our research focus. --- ***Weakness 3: Lack of experiments on other datasets.*** Thank you for your valuable suggestions. In addition to CIFAR-10, we have conducted a series of additional experiments using the FFHQ dataset [1], which is a higher-dimensional face dataset similar to CelebA. These new experiments have been included in **Appendix D: More empirical results on FFHQ** of our revised paper. Due to the time constraints, our additional experiments focused on investigating the impact of data dimension / time embeddings / conditioning, on the memorization of diffusion models. It is noteworthy that the outcomes of these recent experiments support our initial findings derived from the CIFAR-10 dataset. --- ***Question 1: Whether statistical significance matters?*** In our experiments, we noted that the variance in memorization ratios across repeated trials is negligible. For instance, as depicted in Figure 2(b), with the training data size of $|\mathcal{D}|=1$k and two classes ($C=2$), the memorization ratio for the trained diffusion model stands at $94.59\pm0.19$\% over three different random seeds. Similarly, for $C=5$, the memorization ratio is recorded at $92.32\pm0.14$\%. Throughout our research, we have trained several hundred diffusion models, necessitating substantial GPU hours. Consequently, it is impractical to execute each experiment with varying random seeds. Given that the observed variance in memorization ratios is minor (less than $0.2$%), we postulate that this minimal fluctuation is unlikely to impact the statistical significance or alter the overarching conclusions of our study. --- ***Question 2: Model performance using large weight decays? Tradeoff between generation quality and memorization.*** We note that the model performance also deteriorates when large weight decay is applied during the training process. In reference to the tradeoff you mentioned, here we make more clarifications regarding the relationship between the quality of generated samples and memorization. When a significant proportion of generated samples are replicas of the training data, their image quality is inherently high. Additionally, this leads to a generation distribution close to that of the training data, resulting in a low FID score. The FID score is conventionally used to evaluate the quality and diversity of generated images. For instance, the optimal diffusion model, which can only replicate training data (thus exhibiting a memorization ratio is 100%), achieves an FID score of 0.56. This is substantially compared to the 1.96 FID score attained by the state-of-the-art unconditional diffusion model, EDM. Given that the objective of our research is to explore when diffusion models memorize in terms of our definitions of EMM for a training recipe, we have consistently employed a memorization metric throughout our study. --- **Reference** [1] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In IEEE International Conference on Computer Vision (CVPR), 2019. <!-- --- Thank you again for your detailed feedback and suggestions, and we hope our responses and new experiments were able to address any remaining concerns. Please do let us know if you have any further questions as well as what would be expected for score improvement. -->

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully