ICML2024 Memorization

# Reviewer 7Cgs ## Summary: This work is a thorough empirical study of when and how much diffusion models memorize. These experiments use CIFAR-10 for tractability reasons, but also verify some results on larger datasets (FFHQ and ArtBench). Equipped with a model-level metric for memorization, the authors study and analyze the effects of the data distribution (data dimensionality and diversity), model architecture (width, depth, time embedding), training (batch size, weight decay, EMA) and different types of conditioning. These experiments have unearthed some surprising results. For example, Memorization does not scale monotonically with model depth, Higher-resolution skip connections are more important for memorization than lower-resolution ones, and Conditioning on random, unique class labels (one unique class per datapoint) triggers memorization. ## Strengths And Weaknesses: This is a sound and well-written empirical paper. The experiments are exhaustive and can serve as a reference for others studying for memorization. As a result, it represents an obvious contribution to the literature. In particular, the result around conditioning on random labels is interesting and surprising. I do have some (addressable) concerns: In my opinion, what this work mainly lacks is information about model performance (in the sense of sample quality) in all of their experiments. Performance is important in the context of memorization. In most usecases when deploying a diffusion model, practitioners need to choose their preferred point on the Pareto frontier of memorization versus performance. Upon discovering that some variable, such as the type of time embedding or model depth, affects memorization, the natural question to ask whether this variable also affects performance, or whether it serendipitously improves the memorization-performance tradeoff. At the outset, the authors wave off metrics of image quality with the justification that it is close-to-optimal when memorization occurs. I am not so sure this is always true, nor is it a good reason to ignore sample quality metrics. Some of the types of models the authors are training (ie. DDPMs on CIFAR-10 subsamples on the order of 0 to 10k) can actually have variable sample quality, with the best-looking samples being the "most memorized" ones. Even if they were close to perfect, models of different memorization levels would still be differentiable in terms of sample quality. If they weren't, that would be an interesting finding in itself. For this reason, I believe the paper would be made much stronger if sample quality (eg. FID) were reported for their models, especially those in Sections 4 and 5. The definition of effective model memorization is identical to the memorization capacity of Yoon et al. (2023) (already cited in the work, but not for this contribution). This should really be attributed rather than put forward as though it were the authors' own contribution. The result on random class labels seems to contradict an analogous experiment conditioning on random text strings in Somepalli et al. (2023b), Section 5.3. To clarify what's happening, it would be helpful to replicate their experiment in your setup (ie. add a new row to Table 2). If this in infeasible, at least a discussion around the discrepancy between your result and that of Somepalli et al. (2023b) would be helpful. ## Questions: How did you sample from the conditional diffusion models in section 5? Did you condition on anything at sample time? What's your hypothesis on why memorization does not scale monotonically with depth? Why do you believe conditioning on random classes leads to memorization, while conditioning a finetuned Stable Diffusion model on random captions does not (Somepalli et al., 2023b)? ## Limitations: Yes, limitations were adequately addressed. ## Ratings Ethics Flag: No Soundness: 4: excellent Presentation: 4: excellent Contribution: 3: good Rating: 6: Weak Accept: Technically solid, moderate-to-high impact paper, with no major concerns with respect to evaluation, resources, reproducibility, ethical considerations. Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Code Of Conduct: Yes ## Response: # Reviewer x6WK ## Summary: Diffusion models have become the leading models for image generation tasks. However, the optimal solution to their training objective is replicating all the samples from the dataset, which causes privacy concerns. This begs the question: when does this issue arise? The authors present numerous empirical results to isolate each design factor's contribution towards memorization. The authors find that dataset size, dimension, model size, time embedding, skip connection, and class conditions are important factors that play a vital role in determining whether the models memorize the training samples. ## Strengths And Weaknesses: S1: The paper is well-written and very easy to follow. The topic it studies is also an important one, the memorization problem of diffusion models. It causes privacy-related issues in all kinds of situations currently. S2: The study on CIFAR is very thorough. The authors investigate many important design facets of diffusion model training, and demonstrate each one's unique contribution towards the memorization issue. W1: It is unsurprising that diffusion models could memorize the training set. After all, that is what they are trained to do. The easier the task is (smaller dataset, bigger models, ...), the easier this issue arises. I am not sure about the takeaways and contributions of this paper, especially since the memorization issues have been observed before. 1. From the theoretical side, the paper did not offer new insights since memorization is expected. The more interesting question to ask here is how they generalize when they do not memorize (of course, this could be out of scope of this paper). 2. More importantly, from the experimental side, most of the investigation is focused on CIFAR, which is not a real-world use case where memorization can actually cause issues (privacy, etc.). The stable diffusion task presented in this paper only concerns fine-tuning on a "small" dataset, which is known to be prone to overfitting, regardless of how many samples the model was pre-trained with. While I appreciate the authors' detailed exploration of the phenomenon, when investigating a real-world problem in a "toyish" setting if real settings are too costly, it would be more meaningful to present some insights that can be beneficial for those real-world settings. Like what can we do to make the models memorize less? With a fixed dataset, all the trends presented in the paper suggest smaller models, training shorter, and less conditioning. All these designs will cause a downgrade in quality. It would be more interesting to explore acceptable solutions. W2: Related to the first point, the authors state that the paper holds practical significance for diffusion model users, but the setting explored in the paper is very far from the real-world use case. The authors also do not touch upon any strategy that could mitigate memorization. ## Questions: Just a comment: to me, the labels, no matter whether informative or not, are still providing the diffusion model information to work with. Intuitively, if each conditional model is viewed as its own model, it is essentially learning from a smaller dataset. Of course, there is parameter sharing within the model across labels, the intuition is not exact, but I think this intuition agrees with the authors' observation in Fig. 5b&c. The current way of writing in the introduction is a bit misleading, I was under the impression that random labels trigger the memorization when it did not happen with true labels. I only later find that the authors mean the models with random labels can just memorize as the ones with true labels. ## Limitations: Yes. ## Ratings: Ethics Flag: No Soundness: 2: fair Presentation: 3: good Contribution: 1: poor Rating: 4: Borderline reject: Technically solid paper where reasons to reject, e.g., limited evaluation, outweigh reasons to accept, e.g., good evaluation. Please use sparingly. Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Code Of Conduct: Yes ## Response: # Reviewer iQ3S ## Summary: The manuscript is an exhaustive empirical study about memorization in diffusion models, investigating different variations of data distribution , model configuration , training procedure , as well as the effect of conditioning with true labels, with random labels, and with text prompts. The main paper uses Yoon et al. (2023)'s memorization metric, and the appendix has some results for Carlini et al. (2023)'s memorization metric. There is no independent assessment of image quality (e.g., Fréchet Inception Distance, FID), which the authors justify with the general observation that the metrics related to image quality are close to optimal when a significant proportion of training data has been memorized. Two notable counter-example are observed: Exponential Model Average (EMA) (known to stabilize FID and remove artifacts in generated samples) has very minor impact on memorization, and conditional generation (known to decrease FID) increases memorization. ## Strengths And Weaknesses: The manuscript is very clear. I'm not sure if the color-coding of , and was necessary, but it definitely doesn't hurt, and I truly appreciate the way these aspects are decoupled in their own sections. I believe the work to be original: it takes existing ideas but commits to an exhaustive investigation that, to the best of my knowledge, hadn't been done before. The authors claim that their "study holds practical significance for diffusion model users and offers clues to theoretical research in deep generative models". My understanding is that the work would clearly have had that bilateral significance had it also assessed image quality in parallel with memorization. Nonetheless, two "gold nuggets", the effect of EMA and of conditioning, can still be carved out by cross-referencing with past observations on FID. So, there is some claim to significance that can be made, but it is a more borderline claim than could have been with explicit FID results. My Question 1 below delves more into these matters. Although efforts toward quality clearly show in the manuscript, I believe that both selected memorization metrics unfortunately suffer from the same issue when coupled with variations in the training datasets. The evidence is still there, but it was captured by a lense with notable imperfections. My Question 2 below covers this point. All in all, I believe that a resubmission addressing these points could make for a solid, needed contribution to the field, but I doubt of its feasibility within the restrictive timescale of the present conference submission. ## Questions: Question 1: Why not include FID results? Taken to the limit, the last paragraph of Section 2 states the truism that perfect memorization implies perfect image quality. But practitioners wish for the best image quality while minimizing memorization, and your very results (EMA, conditioning) showed that this compromise-space is nontrivial. Theorists intuitively understand Eq. (2), but then wish to understand why these models generalize so well in practice. Your work could have been very useful to both these crowds: why not walk that extra step? Question 2: Are the Yoon et al. (2023) and Carlini et al. (2023) memorization metrics valid for experiments that vary the training dataset (including its size)? I challenge this validity in general, but I also recognize that many of the manuscript's specific results may still hold despite this technical issue. I'll focus on Yoon et al. (2023) in my argument, but the same applies mutatis mutandis to Carlini et al. (2023). And I am aware that there is a precedent in Yoon et al. (2023) to use this metric while varying the training set's size, but that's not the paper I'm reviewing today. Consider a distribution that is a unidimensional Gaussian with mean and variance . We have a training dataset composed of one billion samples from . We also have two models: is a neural network with 1000 trainable parameters , and was provided as-is by a domain expert and needs no training. It turns out that "perfectly" samples from a Gaussian of mean 0 and variance 1 (up to numerical precision). We use Eq. (61) to assess if the models memorized, and we get a very small . Ok, now we subsample ten thousands points to get , and apply Eq. (61) again: both models are doing worse! As we cross the datapoint threshold, something happens with and its quickly approaches . We see that does better in comparison, but its keeps going up as we decrease the dataset's size. For very small datasets, there is a lot of variance in 's memorization, and it looks like which samples are kept in the dataset matters a lot. In particular, for a random seed that yielded , the value of gets very close to ! I'll stop there, I think you got the point: we are not training at all, so any change in the observed for that model is not a property of the model, but a property of the metric itself. A similar argument could be made for the kind of variations investigated in Section 3 (Data Distribution ): at this point, I have no idea how much can be read from Figure 2. Things are a bit better for Sections 4-6: I can trust Figures 3(a) and 3(c) because they tell the same story at each Data Size , but I have to be more cautious when interpreting Figure 3(b). I see an obvious fix to Eq. (61): the left-hand-side denominator inside the indicator function is meant to be a comparison basis, and that comparison basis should remain stable as we vary . Let's note the largest training dataset (everything we've got), and let's use that as our stable comparison basis when assessing the memorization for . This would completely "fix" the issue for Sections 4-6, but something fancier would still be required for Section 3. Minor/curiosity On the top-right of page 6, you say: larger batch size correlates with a higher memorization ratio. This is counterintuitive to me. Do you have any hypothesis as to why this could be happening? Are your batches identical across epochs, or are you shuffling? ## Limitations: The authors' "Impact Statements" is good, but I suggest adding something along the lines "This is not meant as a ready-made solution to assess copyright infringement." ## Ratings Ethics Flag: No Soundness: 2: fair Presentation: 4: excellent Contribution: 2: fair Rating: 4: Borderline reject: Technically solid paper where reasons to reject, e.g., limited evaluation, outweigh reasons to accept, e.g., good evaluation. Please use sparingly. Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Code Of Conduct: Yes ## Response # Reviewer FKDj ## Summary: This paper addresses diffusion models' ability to replicate training data, a phenomenon called 'Effective Model Memorization (EMM)', challenges their generalization capabilities. This paper shows that factors like data size and training labels significantly influence memorization, suggesting that diffusion models' behavior needs deeper understanding. ## Strengths And Weaknesses: ### Strengths 1. The training dynamics of the diffusion model, it is observed that the optimal solution corresponds to a dirac-delta distribution, suggesting that memorization is inevitable upon reaching the optimal solution. This paper serves to remind people of this point and mathematically derives it well. 2. The factors influencing memorization were analyzed in the CIFAR-10 and EDM models. ### Weaknesses 1. Most of the definitions and hypotheses in this paper do not seem to differ from those in Yoon et al.'s work [1]. For example, the conditions under which memorization occurs appear to be the same as those presented by Yoon et al. While additional experiments and discussions have been conducted, they seem minor to me. What are the differences from Yoon et al.? 2. Experiments were conducted only on the CIFAR-10 dataset, and it is unclear whether the findings hold for more diverse and complex data distributions or in text-to-image (t2i) models. [1] Diffusion Probabilistic Models Generalize when They Fail to Memorize, https://openreview.net/pdf?id=shciCbSk9h ## Questions: See weaknesses. ## Limitations: See weaknesses. ## Ratings: Ethics Flag: No Soundness: 3: good Presentation: 2: fair Contribution: 1: poor Rating: 3: Reject: For instance, a paper with technical flaws, weak evaluation, inadequate reproducibility and incompletely addressed ethical considerations. Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. Code Of Conduct: Yes ## Response: ## Reviewer 7Cgs Thank you for your supportive and valuable comments. Below, we respond to the concerns raised in Weaknesses and Questions. --- ***W1: This work lacks information about model performance (in the sense of sample quality) in all of the experiments.*** Thank you for your insightful comments on the necessity of providing sample quality metrics (e.g. FID). It may take some time for us to organize and analyze the results related on FID and we will release them upon completion. The reason why we didn't include the model performance metrics at first is that we were focusing on the timing of occurrence and the influential factors for memorization. Especially, we were interested in when trained diffusion models mimic the theoretical optimum and only replicate training data. Therefore, we only provide the memorization metric to show the differences between a trained diffusion model and the theoretical optimum. --- ***W2: The contribution of Yoon et al. (2023) should be properly cited when defining EMM.*** Thank you for raising this point. Our definition of EMM includes an expectation on memorization metric while memorization capacity in Yoon et al. (2023) adopts a probablistic form for memorization. These two definitions share several similarities. Additionally, EMM only serves as a tool to facilitate our analysis on memorization, thus we do not claim it as our main contribution. We will cite the contributions of Yoon et al. (2023) properly, including the definitions of EMMs in revision. --- ***W3&Q3: The result on random class labels seems to contradict a analogous experiment in Somepalli et al. (2023b).*** Our finding actually aligns with the observations in (Somepalli et al., 2023b). As shown in Figure 4(**Right**) in (Somepalli et al., 2023b), when finetuning diffusion models together with text encoder, conditioning on random captions results in the highest similarity scores, indicating the highest memorization effect. This setup is similar to ours as we train the whole diffusion model including the class embedding modules. Additionally, our observations on class-conditioned diffusion models trained from scratch can be used to interpret the results in (Somepalli et al., 2023b). We only consider the above training setup, where text encoder is also trained. The numbers of classes in fixed caption, class captions, blip/original captions, and random captions meet: fixed caption $<$ class captions $<$ Blip/Original captions $\approx$ random captions. Therefore, in Figure 4(**Right**) in (Somepalli et al., 2023b), the similarity scores of four different types of conditioning demonstrate the same order. Conditioning on random captions results in a slightly higher similarity score than that on blip/original captions. This may be attributable to that blip/original captions share more semantic-level or object-level similarities among different images, e.g., "An image of a dog running on the lawn" and "An image of a dog walking on a street" have the same object. While random captions are more diverse and share less similarities. This may require further investigation in the future. --- ***Q1: How did you sample from the conditional diffusion models in section 5? Did you condition anything at sample time?*** Thanks for your question. As section 5 is about experiments on training procedure using unconditional diffusion models, we assume you are asking the sampling procedure in section 6. For conditional EDM, we follow the Algorithm 2 in [1] to generate samples conditioning on class labels. The class labels are the one-hot encodings of $0, 1, ..., C-1$. While for Stable Diffusion (SD), we follow the pipeline of DDIM [2] with 50 steps. The conditions are either "a painting" for "plain" conditioning or "an <artistic_style> painting" for "class" conditioning. Please let us know if you have further questions regarding sampling. --- ***Q2: What's your hypothesis on why memorization does not scale monotonically with depth?*** We hypothesize that scaling model depth increases more training difficulties than scaling model width. Although we have set a much longer (10 times) training duration than that in [1], it may take more time for a much deeper diffusion model to converge to the theoretical optimum. --- **References** [1] Karras T, Aittala M, Aila T, et al. Elucidating the design space of diffusion-based generative models. NeurIPS 2022.\ [2] Song J, Meng C, Ermon S. Denoising diffusion implicit models. ICLR 2021. ## Reviewer x6WK Thank you for your valuable comments and questions. Below, we respond to the concerns raised in Weaknesses and Questions. --- ***W1: Clarification on the takeaways and contributions.*** Thanks for your detailed comments. We would like to clarify the primary objectives and contributions of our work. Our primary goal is to conduct a comprehensive investigation into the conditions under which diffusion models inadvertently approximate the optimal solution outlined in Eq.(2). This optimal model, while theoretically expected, presents potential drawbacks such as solely replicating training data and potentially leading to privacy or copyright concerns. Understanding the extent of this theoretical hazard in typical diffusion model training scenarios is crucial. While it may be intuitive to assume that simpler tasks (e.g., smaller datasets, larger models) are more susceptible to this issue, the quantitative threshold for the onset of the memorization phenomenon remains unclear. Our work aims to address this gap by providing quantitative insights into the critical points at which memorization occurs. This awareness is essential for mitigating adverse consequences and refining the practical utility of diffusion models. --- ***W2: The setting explored in the paper is very far from the real-world use case. Memorization mitigation strategies are not touched.*** We would like to clarify that besides the experiments on CIFAR10 in the main paper, we also conduct experiments on FFHQ dataset (64$\\times$64). This dataset presents a more diverse and complex data distribution, and the results can be found in Appendix D. Moreover, we also conduct experiments by finetuning Stable Diffusion (SD) on ArtBench, which is a realistic user case. Because people may want to customize stable diffusion model for generating particular-styled or subject-driven images like DreamBooth [1]. This requires further fine-tuning stable diffusion on a certain type of data, normally in small scale. For memorization mitigation strategies, generally, opting for a training recipe exhibiting a lower EMM has lower risk of memorization, e.g. adopting a relatively small batch size and a weight decay during the training. --- ***Q1: Comment on label information impact.*** Thank you for your insightful comment. This intuition does agree with our partial observations in section 6. From another perspective, with true labels as conditioning, similar training samples tend to share the same "individual model". Intuitively, diffusion models conditioning on true labels should have higher memorization ratios. However, as present in Fig. 5(a), diffusion models conditioning on random labels have the similar level of memorization ratio. Finally, since these intuitions are not exact, we believe that clear quantitative results in section 6 are worthy of exploration. --- ***Q2: The misleading aspect in the introduction.*** We appreciate your feedback. We will clarify in the revision to avoid any misleading implications about the memorization behavior with random labels compared to true labels. --- **References** [1] Ruiz N, Li Y, Jampani V, et al. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. CVPR 2023. ## Reviewer iQ3S Thank you very much for your insightful comments and valuable questions. Below, we respond to each of the concerns raised in Weaknesses and Questions. --- ***Q1: Why not include FID results?*** Thank you for your insightful discussions on the signifance of including both memorization and image quality metrics. To enrich our findings, we decide to report FID metrics in parallel with memorization in our paper. It may take some time for us to organize and analyze the results related to FID and we will release them upon completion. The reason why we didn't include FID at first is that we were focusing on the memorization gap between trained diffusion models and the theoretical optimum. For this, we explored the timing of occurrence and the influential factors for memorization. Therefore, we only provide the memorization metric to indicate when trained diffusion models mimic the theoretical optimum. Moreover, as clarified in Section 2, when diffusion models replicate a large proportion of training data, FID also approaches to optimal.  --- ***Q2: Validity of the memorization metrics.*** Thank you for articulating your concerns regarding the validity of the metrics, illustrated with such an illustrative example. We agree that the metric $\\mathcal\{R\}\^\{\\mathrm\{Mem\}\}$ is contingent on the dataset $\\mathcal\{D\}$ itself, as demonstrated in the unidimensional Gaussian example. However, we would like to emphasize that the nature of the image space we focus on in this paper differs significantly. Images exist in a highly dimensional space, where each pixel corresponds to a unidimensional case. In such high-dimensional spaces, randomness and variance tend to average out, minimizing the impact of differences in data sizes within our scope of discussion. To offer an intuition, consider the quantity $\\frac\{\\min\_\{j:j\\neq i\}\\|x\_i-x\_j\\|\_2\}\{\\max\_\{j:j\\neq i\}\\|x\_i-x\_j\\|\_2\}$ for $x\_i\\in\\mathcal\{D\}$, representing the ratio between the minimum and maximum distances of $x_i$ to other data samples. On CIFAR-10, this quantity is approximately $\\frac\{1\}\{3\}$, indicating that the distance between a natural CIFAR-10 image and other samples in the dataset remains relatively stable (within an order of magnitude). Conversely, for a memorized generation, the distance to its nearest neighbor is about 4 orders of magnitude smaller. Furthermore, this quantity remains essentially unchanged across the range of data sizes tested in our paper. Hence, there exists fair flexibility in defining $\\mathcal\{R\}\^\{\\mathrm\{Mem\}\}$ in practice: one can adjust the constant for different dataset sizes without perceptible changes in results. In our work, the metric $\\mathcal\{R\}\^\{\\mathrm\{Mem\}\}$ serves as a proxy for human judgment in distinguishing between memorized and generalized samples (Fig. 1a, Fig. 10a vs. Fig. 1b, Fig. 10b), and we empirically observe strong alignment with human judgment. We genuinely appreciate your thoughtful consideration and intend to incorporate the suggested definition into our revision. Furthermore, we will acknowledge your valuable contribution in our paper. --- ***Q3: Why larger batch size correlates with a higher memorization ratio?*** Our intuition stems from existing research studying the generalization capability of stochastic gradient descent (SGD). As a key aspect of SGD, minibatching introduces noise to the gradient estimation, with smaller batch sizes corresponding to larger noise levels. It has long been believed that stochastic gradient descent can generalize better (and thus avoid memorization) than full batch gradient descent in deep learning [1, 2, 3]. Moreover, as presented in Eq.(21) in Appendix A.1, diffusion models are trained to converge to the optimal solutions. Using a larger batch size or even full data size results in less noisy gradient estimation. Therefore, the trained diffusion model tends to be more closer to the theoretical optimum, which can only memorize training data. --- **References** [1] Heskes, Tom M., and Bert Kappen. "On-line learning processes in artificial neural networks." North-Holland Mathematical Library. Vol. 51. Elsevier, 1993. 199-233.\ [2] Keskar, Nitish Shirish, et al. "On large-batch training for deep learning: Generalization gap and sharp minima." arXiv preprint arXiv:1609.04836 (2016).\ [3] Smith, Samuel L., and Quoc V. Le. "A bayesian perspective on generalization and stochastic gradient descent." arXiv preprint arXiv:1710.06451 (2017). ## Reviewer FKDj Thank you for your valuable comments. Below, we respond to each of the concerns raised in Weaknesses. --- ***W1: Difference from Yoon et al.'s work [1].*** We appreciate your acknowledgment of Yoon et al.'s work [1]. While our work shares similarities on the definition of memorization, there are several key difference: - Our work offers a much more thorough quantitative analysis of the memorization phenomenon (as agreed by all the other three reviewers), focusing on the timing of occurrence and the influential factors. Our analysis includes aspects of the training procedure, model architecture, and data distribution, providing a holistic understanding of memorization in diffusion models. - Yoon et al. focused on the relationship between memorization and generalization of diffusion models. In terms of memorization, they mainly explored the effects of data size and model width on toy data and CIFAR10. While our work has more comprehensive analysis on CIFAR10/FFHQ/ArtBench10 as mentioned above. Several interesting findings in our work are also mentioned by other reviewers, such as model depth, skip connections, conditions, etc. Different from the finding that the memorization-generalization dichotomy may manifest at the level of classes in [1], we find that memorization can be triggered by conditioning training data on completely random and uniformative labels. Moreover, we observe that the number of categories for class/text conditions plays an important role in memorization. ***W2: Experiments were conducted only on the CIFAR-10 dataset, and it is unclear whether the findings hold for more diverse and complex data distributions or in text-to-image (t2i) models.*** We appreciate your feedback regarding the scope of our experiments. We highlight that experiments on the FFHQ dataset (64$\\times$64), which presents a more diverse and complex data distribution, can be found in Appendix D. The trends and results observed on FFHQ align closely with our findings on CIFAR-10, thereby reinforcing our conclusions regarding the influential factors on memorization in diffusion models. Furthermore, experiments on Stable Diffusion (SD) can be found in the end of Section 6, where we finetune SD on the Artbench dataset—a scenario more relevant to large-scale text-to-image models. Our observations align with those from the class-conditioned cases, showing again that conditioning is a potential trigger for memorization. We apologize for the lack of clarity in locating these experiments. We will make it more clear in the revision. --- **References** [1] Yoon, TaeHo, et al. "Diffusion probabilistic models generalize when they fail to memorize." ICML 2023 Workshop on Structured Probabilistic Inference & Generative Modeling. 2023. --- Dear Reviewers, Thank you again for your valuable and suggestions, which are really helpful for us. We provide the FID results in parallel with the memorization ratios in our paper through [anonymous google drive](https://drive.google.com/drive/folders/1yDZ4ioV9EcLI4RuWhrX2EUqECkJt7m7P?usp=drive_link). For instance, the corresponding FID results of **Figure3 (a)** are visualized in the file **Figure3a.pdf** in the above google drive link. The corresponding FID results of **Table 1** are visualized in the file **Table1.pdf**. By analyzing the FID results, we observe that a configuration with higher memorization ratio tends to results in a lower FID. This observation is in line with the intuition: when diffusion models memorize a significant proportion of training data, FIDs are also close to optimal. Apart from this observation, we note that although EMA values contribute limitedly to the memorization, proper selections lead to lower FIDs, as shown in **Table1.pdf**. This follows the common practices of diffusion models [1,2,3]. Interestingly, comparing **Figure 5(a)** in the paper and **Figure5a.pdf** in the google drive, a diffusion model conditioned on true labels ($C=10$) has a similar memorization ratio with the one conditioned on random labels ($C=10$), but the former has a lower FID score. We hypothesize that with true labels as conditioning, similar training samples tend to share the same "individual model", resulting in better "individual models". While a diffusion model conditioned on true labels ($C=10$) has an suboptimal FID comparing to the one conditioned on unique labels ($C=|\\mathcal\{D\}|$), as present in **Table3.pdf**. We totally understand that this is quite a busy period, so we deeply appreciate it if you could take some time to return further feedback on whether our responses solve your concerns. If there are any other comments, we will try our best to address them. Best, The Authors --- **References** [1] Ho, J., Jain, A., and Abbeel, P. Denoising diffusion prob- abilistic models. NeurIPS 2020.\ [2] Nichol, A. Q. and Dhariwal, P. Improved denoising diffusion probabilistic models. ICML 2021.\ [3] Karras, T., Aittala, M., Aila, T., and Laine, S. Elucidating the design space of diffusion-based generative models. NeurIPS 2022. --- We appreciate your detailed comments and suggestions. In the revision, besides the FID results and mentioned clarifications, we will also include the plots of relationships between FID and memorization, as you suggested. Thank you again! --- Thank you for your comments. Further responses are provided below: --- ***The overall messages are similar to those of Yoon et al.*** The overall message of our paper is a comprehensive investigation into the conditions under which diffusion models inadvertently approximate the theoretical optimum defined in Eq.(2). While the overall message of Yoon et al.'s work is the relationship between memorization and generalization in diffusion models. We have clarified the differences between our paper and Yoon et al.'s work in the above response. We appreciate your valuable comments and will clearly acknowledge the contributions of Yoon et al.'s work in the revision. --- ***The experiments are insufficient. The stable diffusion experiments were conducted on simple datasets. It needs to be verified whether this is a phenomenon that is found more broadly (i.e., through much more extensive experimentation)*** In our paper, we conducted experiments by finetuning Stable Diffusion (SD) on ArtBench, which is a realistic user case. Because it is natural for practitioners to customize SD to generate particular-styled or subject-driven images like DreamBooth [1]. This cutomization requires further fine-tuning stable diffusion on a certain type of data, normally in small scale. Regarding "through much more extensive experimentation", we are considering including additional experiments on other datasets with both LoRA finetuning and full-finetuning strategies. The newly added experiments will be updated in this [anonymous google drive](https://drive.google.com/drive/folders/1HMDxyEM4P-dr0EQOTkFxh_veKMSkHFhp?usp=drive_link). We appreciate it if you could provide suggestions or instructions on the above experimental designs. We will update more experimental evidence through the same link even after the author-reviewer discussion period. --- **References**\ [1] Ruiz N, Li Y, Jampani V, et al. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. CVPR 2023.  --- Thank you for providing these extra FID results. These will certainly make the paper more impactful, and I believe this resolves all of my concerns. Can you also share the exact protocol used for computing FID? (Eg. between 50k generated samples and 50k training images?) Thank you for raising the score! To ensure consistency in our analysis, we maintain the same set of samples (10k in our paper) for both evaluating memorization and computing FID scores. Specifically, we compute FID scores using 10k generated samples and the training dataset $\\mathcal{D}$ (with varying sizes $|\\mathcal{D}|$). Thank you for your feedback. As detailed in Appendix B, we find that memorization ratio has a negligible variance when sample size more than 10k images. So we generate 10k images to compute memorization ratios throughout this study. To provide the FID results parallel with the memorization metrics in the paper, we employ the same 10k images and the training data $|\mathcal{D}|$ to compute the FID. --- Thank you for your support and raising the score. Thank you for raising the score! We will polish our paper further and incorporate new results into the revision. --- We are conducting more experiments; Looking forward to your feedback Dear Reviewer FKDj, We are already in the process of conducting the following experiments: - We employ a new dataset, e.g., subsets of ImageNet to fine-tune Stable Diffusion (SD) models. Following the experimental design in Table 2, we will re-validate the effects of plain condition and text condition on memorization using the new dataset. - We will also adopt LoRA fine-tuning, a common parameter efficient fine-tuning (PEFT) technique for cutomizing SD models, on both ArtBench and new dataset to observe the effects of conditioning on memorization. - In our paper, we fine-tuned the U-Net in SD model with text encoder fixed. We will also conduct the experiments by fine-tuning the U-Net and text encoder simultaneously. Since the Author/Reviewer discussion is going to end, we will update the experimental results in this [anonymous google drive](https://drive.google.com/drive/folders/1HMDxyEM4P-dr0EQOTkFxh_veKMSkHFhp?usp=drive_link) during the Reviewer/AC discussion. If the above listed experiments could not address your concerns, please feel free to propose new experimental designs. We will try our best to update the experimental results in the above link before the end of Reviewer/AC discussion. Thank you! The Authors ---  # Reviewer-AC discussion --- ### 8 April, 2024. Update stable diffusion experiments on the Imagenette dataset. Besides the ArtBench-10 dataset employed in our paper, we also conducted experiments by fine-tuning Stable Diffusion (SD) models on Imagenette [1]. Imagenette consists of $C=10$ classes from the ImageNet dataset [2]. To construct a training dataset $\mathcal{D}$, we randomly sample $|\mathcal{D}|/C$ images for each class from Imagenette. Afterwards, we fully fine-tune the U-Net in SD model with the text encoder fixed. We employ DDIM with 50 steps to generate 10k images for computing the memorization metrics. The results are saved in the file **Imagenette_full_finetune_SD.pdf**. From the table, it is observed that with the increase of training data size $|\mathcal{D}|$, the memorization ratio drops. Furthermore, "class" conditioning is more prone to induce memorization in text-to-image diffusion models than "plain" conditioning. **References**\ [1] https://github.com/fastai/imagenette [2] Deng J, Dong W, Socher R, et al. Imagenet: A large-scale hierarchical image database. CVPR 2009. --- Dear Area Chairs, Sorry for bothering you. In less than 36 hours before the end of author-reviewer discussion period, reviewer FKDj responded to our rebuttal with the comments that “…the experiments are insufficient, the stable diffusion experiments were also conducted on simple datasets…”. To respond to the reviewer’s feedback, we conducted new experiments related to stable diffusion models to further validate our findings on memorisation of diffusion models. Due to the time limit, it is impossible for us to provide the experimental results before the end of author-reviewer discussion period. However, we still would like to report these new experimental results as a response to reviewer FKDj’s comments. We have included the results in this [anonymous google drive](https://drive.google.com/drive/folders/1HMDxyEM4P-dr0EQOTkFxh_veKMSkHFhp?usp=drive_link) together with a readme file explaining our experiments, as we promised to reviewer FKDj in our previous official comment. We wish these results could be considered. In our paper, we have already conducted comprehensive experiments on the CIFAR-10 and FFHQ datasets. We also conducted experiments by fully fine-tuning stable diffusion models on the ArtBench10 datasets. In our newly added experiments, we include a new dataset, Imagenette, which is a subset of ImageNet. Afterwards, we employed both full fine-tuning and LoRA fine-tuning strategies on stable diffusion models using Imagenette. These new experimental results are in line with our conclusions drawn from CIFAR-10 / FFHQ / ArtBench-10 in our paper. Thank you! The Authors

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.