Neurips ScoreGAN Rebuttal

# Global Response ## Response to Reviews We would like to thank the reviewers for their insightful, constructive, and timely reviews. We are glad that most reviewers appreciated the paper and the perspective we propose, generally agreeing on its relevance and potential for the community. Below, we respond to each reviewer individually. The readers will find attached to this global response a one-page PDF of figures illustrating some of our discussions, including interpolations in the latent space and an analysis of the training dynamics of Score GANs. A recurring concern raised by the reviewers deals with the experimental results. We understand that better performance and more thorough experiments could improve the paper and provide further support to our framework. Nonetheless, we would like to remind the reviewers that the proposed models and experiments serve as a "reasonable basic validation of the theory" (Reviewer iGU3), given the more theoretical focus of the paper, as also noted by Reviewers HYQ5 and 6P1y. Some reviewers spotted typos throughout the paper. We would like to point out that we had already corrected the most important ones in the appendix provided in the supplementary material. We refer the reviewers to this document for a more polished version of the paper. We look forward to further discussion with the reviewers during the discussion period. ## Correction: Appendix B.3 During our work on this rebuttal, we noticed a mistake in the experiments in Appendix B.3, which changes the conclusion that Discriminator Flows are more efficient in terms of NFE than the diffusion model EDM with a first-order solver. Thus, we report this correction to the reviewers. In the original submission, the results of the "EDM (Euler solver)" baseline in Figure 5 were computed for the stochastic version of EDM instead of the deterministic one like in the rest of the paper. The corrected figure with the deterministic baseline can be found in the attached one-page PDF, Figure 1(a). Surprinsingly, the first-order version of EDM is actually more efficient in terms of NFE than the second-order version for low NFE values. This baseline was not considered in the EDM paper (Karras et al., 2022), so this is a new observation. We confirmed this phenomenon using the official implementation of EDM on CIFAR10. Unfortunately, this invalidates our initial claim that Discriminator Flows are more efficient that the first-order EDM with the chosen discretizations. We still expect that, with proper tuning, Discriminator Flows can be more efficient than diffusion models, based on Figure 1 of our paper -- which remains valid. The final image appears quickly after a few steps, but some residual noise still needs to be eliminated in the remaining steps. Nevertheless, this result was secondary and all other results of the paper remain valid, as this mistake only impacted this experiment. We apologize for the confusion and thank the reviewers in advance when taking these new elements into account. # Response to Reviewer AQyb We thank the reviewer for their feedback. We address the weaknesses they raised by answering their questions below. ## (1). Wording Clarification > (1). The following phrase is not clear: Line 51 on Page 2: "In prior work, ...". It is not clear if this is referring to a specific work or prior works. This is referring to prior works, plural. We meant that the choice of a prior that is easy to sample from is a widely adopted choice in the literature. ## (2). and (3). Equivalence Between Particle Models and GANs > (2). It is not obvious that the modeling of the generation/inference with score-based models is interchangeable with training generators. Please can you provide more information on this? > (3). Please provide more details on the equivalence of the interacting particle model of GANs and the diffusion models as stated in Finding 6. We do not claim that particle model (PM) inference and generator training are equivalent or interchangeable. Instead, we show that interacting particle models (Int-PMs), and in particular GANs, ***generalize*** PM inference during generator training, as worded in Findings 4 and 6. Indeed, the particle evolution of an Int-PM (Eq. (13)) reduces to the particle evolution of a PM (Eq. (1)) with the same functional $h_{\rho}$ when $k_{g_{\theta_t}}(z, z') = \delta_{z - z'}$ (lines 152 - 158), in which case $\mathcal{A}_{\theta_t}(z)$ disappears from Eq. (13) to obtain Eq. (1). This indicates that a functional $h_{\rho}$ used in an Int-PM may be used in a PM and vice versa. However, note that this does not guarantee successful performance, as the compatibility of the generator with $h_{\rho}$ (through $\mathcal{A}_{\theta_t}(z)$) should also be taken into account. We confirm the possibility of transferring $h_{\rho}$ with the models we introduce: the PM Discriminator Flow is based on an Int-PM GAN functional, and the Int-PM Score GAN is based on an PM score functional. Furthermore, this is supported by Finding 6 where we notice that functionals $h_{\rho}$ from typical PMs have been shown to be encompassed in the Int-PM GAN framework. Under some hypotheses on the discriminator, Yi et al. (2023) show that the $h_{\rho}$ of $f$-divergence GANs is actually the functional of the forward KL divergence gradient flow (like in the NCSN diffusion model of Song et al. (2019)), and Franceschi et al. (2022) show that it is the one of the squared MMD gradient flow. Considering Finding 4, this therefore means that GANs generalize these PMs. Nonetheless, we agree that the formulation of Finding 6 should be more specific. We will clarify our explanations in Section 3 and specify exactly which score models and Wasserstein gradient flows are generalized by GANs based on the previous paragraph. ## (4). Hybrid Models > (4). Is it possible to come up with a hybrid structure that combines the two main techniques under the unified PM model? If yes, do you think this could improve the performance? The introduced Score GAN and Discriminator Flow in Section 4 are examples of combinations of such techniques, as explained above: the PM Discriminator Flow is based on an Int-PM GAN functional, and the Int-PM Score GAN is based on an PM score functional. However, they are only proof-of-concept examples. As highlighted in the conclusion, we believe that finding a hybrid model that outperforms both diffusion and GANs is an interesting direction for future work regarding our framework. # Response to Reviewer HYQ5 We thank the reviewer for their feedback. We address the questions they raised below. > The claim that "Discriminator Flows have the expected advantage of converging faster to the target distribution as compared to the state-of-the-art diffusion model EDM, and thus have higher efficiency during inference" may be overstated, as Figure 5 in the Appendix demonstrates that EDM (Heun solver) exhibits comparable efficiency. Which solver is employed for EDM in Figure 1? In Figure 1, we used the Heun solver for EDM. However, Figure 1 does not correspond to the setting of Figure 5. In Figure 5, we studied two different ways to reduce the number of NFE with Discriminator Flows: by discretizing the generating differential equation of Eq. (20) -- Figure 5(a) -- and by stopping the generation process early, like in Figure 1 -- Figure 5(b). We compared these two versions to the canonical way of reducing NFE in a score-based diffusion model, i.e. by discretizing the generating differential equation of Eq. (8) like Karras et al. (2022), either with Euler or Heun solver. Those are the two baselines identically reproduced in both Figures 5(a) and 5(b). We included in the one-page PDF attached to our global response, in Figure 1(a), the merged plots of Figures 5(a) and 5(b). To compare directly performance vs NFE in the same setting as in Figure 1 in the paper, we evaluated EDM in early stopping similarly to Discriminator Flows in the one-page PDF attached to our global response, Figure 1(b). As expected, diffusion models need to completely denoise the image to achieve reasonable performance. Nevertheless, after noticing an unrelated mistake in our experimental setting, we unfortunately have to retract the claim questioned by the reviewer. Please refer to the global response for more details. > Why both $x \sim p_{\mathrm{data}}$ and $x = g_{\theta}(z)$ exists in Algorithm 1? We apologize for the mistake. $x \sim p_{\mathrm{data}}$ should not appear in Algorithm 1. We will correct this in the updated version of the manuscript. # Response to Reviewer kQU7 We thank the reviewer for their feedback. We address the weaknesses and questions they raised below. ## Wasserstein Notation > Some mathematical symbols (e.g., $\nabla_W$) require further explanation to enhance the readability of this paper for a wider range of readers. The notation $\nabla_W$ is a standard notation for readers with expertise in Wasserstein gradient flows, but it may indeed be confusing for a wider range of readers. For a functional $\mathcal{F}$ defined over probability measures, $\nabla_W \mathcal{F}(q)$ is the so-called Wasserstein gradient of the functional -- similar to the gradient of a functional defined over a Euclidean space. It is defined as $\nabla_x \frac{\delta \mathcal{F}}{\delta q}$ (Eq. (3)), i.e. the Euclidean gradient of the functional’s first variation $\frac{\delta \mathcal{F}}{\delta q}$. We refer to Santambrogio (2017), cited in the paper, for more information. ## Experimental Results > The experiments in this paper are relatively weak, with a limited number of datasets and tests conducted. Furthermore, according to Table 3, the proposed methods (Discriminator Flow and Score GAN) fail to surpass baseline methods (EDM and GAN). Providing state-of-the-art performance in our experimental results is beyond the scope of our paper. We include the experimental results in our paper primarily for proof-of-concept purposes, to validate that our proposed Discriminator Flow and Score GAN models are practically feasible. Please also refer to our global answer on this topic. ## Generator as a Smoothing Operator > In line 29, what does it mean that the generator has the role of a smoothing operator? This role of the generator is formally defined and explained in Definition 2 (lines 148-151). We note that compared to the particle model (PM) given in Definition 1 and Equation 1, the use of a generator results in applying a linear operator on top of the vector field $\nabla h_{\rho_t}$. It is this linear operator role that provides the intuition that the generator acts as a smoothing operator (analogous to a smoothing filter). Indeed, we notice in line 151 that this linear operator is a kernel integral operator, which by design smoothes the input signal $V$ with a kernel convolution in Eq. (14). ## Training Costs > What is the time cost of training Score GAN and Discriminator Flows? Do they have a greater training overhead than Score-based Model and GAN? Indeed, the design of Score GAN and Discriminator Flow induces computational constraints that make each of their training iterations slower than those of baseline models. Score GAN requires pretraining a score network as specified in the paper, and its score-based update remains computationally more demanding than a discriminator-based update like in GANs (since the score function takes values in the data space, while the discriminator output is scalar). As explained in lines 270 and 287, training Discriminator Flow requires sampling at every step from the generating ODE of Eq. (20). This makes its training iterations slower than both diffusion models (which do not require resampling through the ODE/SDE) and GANs (which have fast sampling). Besides the cost of individual training iterations, the total temporal cost of training also depends on the number of iterations which we specify in Appendix C.2. Overall, the approximate amount of time we used to train the tested models over MNIST on one NVIDIA V100 GPU are: - for Discriminator Flow, 24 hours; - for Score GAN, 10 hours (excluding the pretrained diffusion model); - for EDM, 6 hours; - for GAN, 1 hours. We will highlight the elements of this discussion in the updated version of the manuscript. ## Limitations and Societal Impact > No, the authors have not adequately addressed the limitations and potential negative societal impact of their work. We would like to point out that **we did address potential negative impacts** of our work in the supplementary material, Appendix Section D. Furthermore, **we did address limitations** of our approach throughout the paper, including: - the generative performance of the introduced models, lines 279 to 282; - the training cost efficiency of the introduced models, lines 241, 270 and 287; - the omission of continuous training of the GAN discriminator, lines 169 and 342-343. Nonetheless, we welcome specific suggestions from the reviewer to improve this part of the paper. # Response to Reviewer XiV8 We would like to thank the reviewer for their feedback. We address the weaknesses they raised below. ## Novelty > Particle GANs are not very novel; [...] this is not really new, but as far as I know, its rarely done > line 21: 'challenge the conventional view that particle and adversarial generative models are opposed to each other': the Discriminator can be used to learn particles (learn x_fake directly); this is well known as far as I know. Our understanding is that the reviewer questions the novelty of Section 3.2 and of the introduced Discriminator Flow model. However, they did not provide any reference to support their assertion, so we have no choice but to respectfully maintain our claims. We welcome any suggestion to improve our positioning w.r.t. related work that we may have missed. Moreover, we would like to point out that we did appropriately cite and build upon prior works studying the particle model aspect of GANs, albeit not as general as our approach: e.g., Franceschi et al. (2022) on line 162, Yi et al. (2023) on line 183, as well as Chu et al. (2020) and Durr et al. (2022) on line 196. In particular regarding the novelty of Discriminator Flows, the closest model we are aware of is the one of Heng et al. (2023). We notice in lines 305-306 that their model is a particular case of Discriminator Flows, although the original article did not make the link with discriminators and GANs. ## Experiments and Performance > Results are much worse for the disc-flow [and the ScoreGAN] method compared to regular GANs and Diffusion and also the best GAN-Hybrid methods [...]. > Results are very limited, with only MNIST and CelebA and 2 baseline methods. [...] more in-depth analyses using different optimizers, initializations for the particles, and architectures on slightly bigger datasets would add a lot. We understand that better performance and more thorough experiments could only improve the paper and support our framework. However, the reviewer's focus on state-of-the-art experiments differs from the paper's scope which is mostly theoretical. In our support, we recall the following statement from Reviewer iGU3: > **Reviewer iGU3:** *As acknowledged by the authors as well, the models are not really contenders for the state of the art at the moment. It’s possible (but by no means clear) that they could be significantly improved by a deep dive into engineering the design choices, but at the moment they mostly serve as a reasonable basic validation of the theory.* ## Optimization in the Particle Space > people don't use [particle GANs] frequently because its hard to optimize in high-dimensional space > It is well known that optimizing directly in the particle-space is difficult in high dimensions (I sadly do not remember those prior works, but I remember reading this) We respectfully disagree with the reviewer. We know that gradient descent in particle space is not inherently flawed: after all, diffusion models are also particle models and they are state-of-the-art in image generation (Dhariwal & Nichol, 2021). Furthermore, Heng et al. (2023) (with their model which we show is a special case of Discriminator Flows) and Fan et al. (2022) also present particle models with competitive performance. It is indeed known that *some* particle models suffer from the the curse of dimensionality, but not all. How the curse of dimensionality affects gradient descent of a functional in the particle space is closely related to how the curse of dimensionality affects the functional itself. Indeed, if the sample complexity of the functional scales badly in the dimension, it is expected that the resulting gradient flow also inherits this curse. A prototypical example of this is given by the 1-Wasserstein distance (Dudley, 1969), which the reviewer may have thought about. However, in our case, the functional is (implicity) implemented by a neural network, known to be particularly effective in high dimensions. This explains why particle models such as ours can be effective even in high dimensions. R. M. Dudley. The speed of mean Glivenko-Cantelli convergence. Ann. Math. Statist. 1969. ## Related Work The reviewer suggests that we discuss the following papers: 1. Wang et al. Diffusion-GAN: Training GANs with diffusion. arXiv, 2022. 2. Xiao et al. Tackling the generative learning trilemma with denoising diffusion GANs. ICLR 2022. 3. Jolicoeur-Martineau et al. Adversarial score matching and improved sampling for image generation. ICLR 2021. We thank the reviewer for the suggestion. We would like to point out that **we already cite 1. in Section 4.4, line 299** as strongly linked to Score GAN. **We also cite 2. in our introduction**. We will further discuss 2. and 3. in the next version of the manuscript. More specifically, 2. and 3. are not directly related to our unified view of score-based diffusion and GANs as particle models. 2. trains several GANs to successively denoise an image, similarly to diffusion. 3. augments the denoising objective of score-based diffusion models with an adversarial objective to improve the denoising image quality. None of these references leverage the link we make between GANs and particle models. ## Limitations > The only limitation mentioned is that the generator-less method is slower. They don't mention many limitations. They say the results may be poor because they are too novel and have yet to be fine-tuned. I don't consider those enough. As most of the reviewers noticed (AQyb, HYQ5, iGU3 and 6P1y), we addressed the limitations of our approach in the paper, including but not limited to the ones the reviewer stated in this quote. We would welcome any suggestion of additional limitations to discuss from the reviewer. Beyond this, we would also like to point out that we addressed potential negative impacts of our work in the supplementary material, Appendix Section D. ## Naming > 'Score GANs': I would encourage trying to find another name [...] We would to thank the reviewer for this remark. We will consider finding a new name and would be happy to hear any suggestion from the reviewer. # Response to Reviewer iGU3 We would like to thank the reviewer for their feedback. We concur with the reviewer regarding the weaknesses raised, which are acknowledged in the paper. Notably, we mention that Score GAN can serve as a distillation method in the conclusion. We address the reviewer's questions below. ## Latent Interpolations We provide examples of interpolations on MNIST for all four tested models in the one-page PDF attached to our global response, Figure 2. We tested four interpolation methods on Gaussian priors considered by Leśniak et al. (2019): linear, spherical, Cauchy-linear, and spherical Cauchy-linear. We reached the same conclusion for each of these methods and thus show in the additional document the result of the most visually appealing one, spherical Cauchy-linear. We notice that the generator-based models, Score GAN and GAN, show smoother transitions between generated images than the particle models EDM and Discriminator Flow, for which abrupt changes of digit identity and shape can be seen between consecutive interpolation steps. This confirms that interacting particle models (generator-based) can perform feature learning via their smaller latent space, allowing for smoother generation, while particle models operating in the data space are less prone to such phenomenon. In this regard, Score GAN and Discriminator Flow are no different than their parent method in the same model category. We will add this discussion in the appendix of the next version of our paper. Leśniak et al. Distribution-Interpolation Trade off in Generative Models. ICLR 2019. ## Theory vs Practice in Algorithm 1 We address the points raised by the reviewer below. As a preamble, we would like to notice that these remarks also hold for GANs, regarding the alternating updates of the discriminator and the generator, the use of Adam, and the stochasticity of expectation estimation via mini-batch training. This is then no suprise that Score GANs and GANs share properties like mode collapse, as pointed out in lines 285-286. ### Estimation of the Score > the estimated score is not trained to completion We would like to highlight that this is only true at the very beginning of training of Score GANs. Indeed, the score function is continuously optimized as its weights are carried on after each generator update, like a discriminator in GANs. To confirm this, we added an analysis of the training loss of the score function in the one-page PDF attached to our global response, Figure 3. In this experiment, we perform $K = 100$ discriminator steps in-between generator updates. At the beginning of training, the score function needs a few updates to converge. However, after a low number of generator updates, it is already close to the optimum before its first update. ### On Adam > I assume you also use something like Adam for training the generator with the update on Line 5 (?) Do you have any insight on whether the theory can accommodate some of these practicalities, or does it come down to intuition and empirics in the end? We thank the reviewer for the interesting question. In fact, we do use Adam in practice. Theoretically, it's entirely possible to formulate continuous-time equations for Adam. By fixing the values of $(\beta_1, \beta_2)$ and allowing the time step to approach zero, we can recover the continuous version of SignSGD. A relevant example of SignSGD's study within a non-convex context was presented by Bernstein et al. (2018). Furthermore, exploring the scenario where non-interaction is present (where $\mathcal{A}_{\theta_t}$ disappears from Eq. (13) and described in lines 156-158) reveals a particle gradient flow with a renormalized gradient. This yields intriguing connections with continuous acceleration, as demonstrated by Wibisono & Wilsan (2015). We consider this avenue to be a promising direction for future investigation. A. Wibisono & A. C. Wilson. On accelerated methods in optimization. arXiv, 2015.\ J. Bernstein et al. signSGD: Compressed optimisation for non-convex problems. ICML 2018. ### Other Factors and Mode Collapse > Do you have any insight on whether the theory can accommodate some of these practicalities, or does it come down to intuition and empirics in the end? Are things like mode collapse a consequence of inexact adherence to theory? While various factors, such as the inherent stochasticity of non-full-batch training, could contribute to mode collapse, we suppose that the primary cause lies in the departure from strict theoretical adherence. This departure occurs due to discretization errors introduced during the translation of the gradient flow into a discrete setting. This discrepancy between theory and practice could also be an interesting area for further exploration. ## Typos We would like to thank the reviewer for highlighting these typos and apologize for the mistakes. - The first part of Eq. (14) should indeed read as: $[\mathcal{A}_{\theta_t}(z)](V) \triangleq \mathbb{E}_{z' \sim p_z} \left[k_{g_{\theta_t}}(z, z') V(g_{\theta_t}(z'))\right]$. - All mentions of $x \sim p_{\mathrm{data}}$ in Algorithm 1 should be deleted; we sample from the generated distribution only. - For the two last typos, the reviewer is correct and they were already corrected in the PDF file of the supplementary material (where the main paper is reproduced). # Response to Reviewer 6P1y (SHORT VERSION) We would like to thank the reviewer for their detailed feedback. Due to rebuttal size constraints, we answer the reviewer's main questions below. We will follow-up with the rest of the response during the discussion period. ## Training of Discriminator Flows Discriminator Flows do not attempt at mimicking Langevin sampling behavior. The principle of Discriminator Flows is to **make particles follow the same functional gradient $\nabla h_{\rho}$ as particles in GANs**, their corresponding interacting particle model, without the generator smoothing of Eq. (13). This corresponds to the gradient of the generator's loss that depends on the discriminator (cf. Eq. (20) which relates to Eq. (16) / Finding 5 in GANs). The most direct way to achieve this involves successively learning a discriminator per discretization step of Eq. (20), as follows. - Sample a batch of $(x_0^b)_b$ i.i.d. from the prior $\pi$. - For $i = 0$ to $N - 1$: - $t = \frac{i}{N}$. - Learn a new discriminator $f_{t}$ between the current generated distribution $(x_t^b)_b$ and $p_{\mathrm{data}}$ using a standard discriminator loss from Eq. (15) like WGAN-GP, i.e. like step 6 of Algorithm 3. - Create the new batch of samples from the updated distribution following Eq. (20): $x_{t + \frac{1}{N}}^b = x_t^b - \frac{\eta}{N} \nabla_{x_t^b}(c \circ f_t)(x_t^b)$. However, this is impractical as the large number of independent networks to learn and the successive learning procedure make learning prohibitively slow. In our proposed version, we aim at simultaneously learning all discriminators to alleviate this issue. To this end, instead of learning multiple independent discriminators to cover all sampling times, we learn in Algorithm 3 a single time-parameterized network $f_{\phi}(\cdot, t)$ on all time steps at once. This may be seen as an alternating scheme, because updating of the discriminator naturally changes all intermediate generated distributions. As specified in step 2 of Algorithm 2, steps 3-4 of Algorithm 3 and lines 263-264, we do so after each update of the discriminator, i.e. $M = 1$. Overall, Discriminator Flows do not mimic a known sampling behavior a priori. The path that particles take depends on the chosen functions $a$, $b$, and $c$ parameterizing the losses of Eq. (15) and (16). Determining this path for general GAN losses such as WGAN-GP is unfortunately an open problem. Nonetheless, it can be described under simplifying hypotheses. If $a$, $b$, and $c$ are chosen to implement an $f$-divergence GAN loss, then Discriminator Flows will implement a forward KL divergence gradient flow (Yi et al., 2023); if they correspond to an IPM GAN loss, then Discriminator Flows will implement a squared MMD gradient flow (Franceschi et al., 2022). This makes Discriminator Flows fundamentally different from Denoising Diffusion GANs (Xiao et al., 2022). On the one hand, Denoising Diffusion GANs train time-parameterized GANs to follow the inverse diffusion path. Hence, their discriminators are trained to discriminate between the intermediate generated distribution and the *noisy* data distribution. On the other hand, in Discriminator Flows, discriminators are trained to discriminate between the intermediate generated distribution and the *true*, non-noisy data distribution, thereby learning the path towards $p_{\mathrm{data}}$. ## Integration of the Stochastic Component in our Framework We thank the reviewer for this question, which allows us to develop this aspect of our framework. As the reviewer pointed out, for diffusion models, we focused on the probability flow of Eq. (9) -- which is the true equivalent of Eq. (7), instead of Eq. (8) as stated in most works on diffusion. We put aside the probabilistic formulation of Eq. (7) for purposes of clarity of exposition. Fortunately, it is possible to generalize our framework by taking into account the stochastic component of Eq. (7), as explained in the following. We will add this discussion is the appendix of the updated manuscript. Generally, the following equation shares the same probability path as both Eq. (7) and (9), for any $\alpha \in [0, 1]$: $$ \mathrm{d}x_t = 2 \sigma'(t) \sigma(t) \nabla \log \left[p_{\mathrm{data}} \star k_{\mathrm{RBF}}^{\sigma(t)}\right](x_t) \mathrm{d}t - \alpha \sigma'(t) \sigma(t) \nabla \log \rho_t(x_t) \mathrm{d}t + \sqrt{2 \alpha \sigma'(t) \sigma(t)} \mathrm{d}W_t. $$ This corresponds to an interpolation between Eq. (7) and (9) that trades Brownian noise with its deterministic equivalent. This stochastic component can then be integrated in the formulation of interacting particle models, via Eq. (13). The latter equation takes the directions followed by the particles in the particle model, and transforms them via the operator $\mathcal{A}_{\theta_t}(z)$. Since this operator is linear, it is possible to integrate a stochastic component in the equation (Klebaner, 2012), allowing us to take into account stochastic particle models: $$ \mathrm{d}g_{\theta_t}(z) = [\mathcal{A}_{\theta_t}(z)]\left(\nabla h_{\rho_t} \mathrm{d}t + \gamma(t) \mathrm{d}W_t\right), $$ where $\gamma(t)$ is a scalar function of time. This makes it possible to integrate the stochastic component of diffusion models in Score GANs by interpolating between Gaussian noise and the score of the generated distribution in step 5 of Algorithm 1, similarly to the previous equation using $\alpha$. Nonetheless, this comes with no guarantee on the experimental performance, as we found in preliminary experiments that adding such a stochastic component is often detrimental to the resulting FID. Indeed, to succeed, the chosen gradient vector field to follow with the generator must be compatible with the generator architecture (i.e., compatible with the generator preconditioning $\mathcal{A}_{\theta_t}(z)$), which may not be the case with white noise of high variance. F. C. Klebaner. Introduction to stochastic calculus with applications. Imperial College Press. 2012. # Response to Reviewer 6P1y (FOLLOW-UP) The small character limit prevented us from completely answering the reviewer's many and detailed questions. We would like to offer additional elements of discussion in this message to start the discussion phase. ## Training of Score GANs The reviewer's interpretation is correct. This score network depends on the chosen noise level $\sigma$, like in EDM, but that we sample at each training step $t$ instead of scheduling it, unlike EDM. Indeed, $g_{\theta}(z)$ in Algorithm 1 corresponds to $g_{\theta_t}(z)$. The entirety of Algorithm 1 describes a single training iteration $t$. There are K steps of score updates per generator update, similarly to discriminators in GANs. Accordingly, $K$ is an important parameter in Score GANs. First of all, like in GANs, the tuning of $K$ heavily depends on the ratio $r = \frac{\lambda}{\eta}$ between the learning rates of the score network and the generator (Jelassi et al., 2022). A higher ratio may allow us to decrease the necessary number of steps $K$. In our experiments on image data, we use $r \geq 2$ (Appendix, Table 7). However, to gain more intuition empirically, we performed a set of experiments on MNIST with $\lambda$ s.t. $r = 1$ by making the number of steps $K$ vary from $1$ to $10$. We obtained similar results across this range of values for $K$, close to the values reported in Table 3. This indicates that even low values of $K$ can provide a sufficient approximation of the score of the generated distribution. To observe this qualitatively, we plotted in the document attached to our global response, Figure 3, the evolution of the score training loss in-between generator updates for $K=100$. At the beginning of training, $s_{\phi}^{\rho}$ needs around 20 updates to converge. However, after a low number of generator updates, it is already close to the optimum before its first update. This confirms that the continuous update of the score of the generated distribution $s_{\phi}^{\rho}$, like a discriminator, coupled with a learning rate ratio $r > 1$, makes a low number of score updates $K$ between generator updates sufficient. S. Jelassi et al. Dissecting adaptive methods in GANs. arXiv, 2022. ## Further Information on Discriminator Flows For time embeddings, we followed the implementation of EDM (Karras et al., 2023), which embeds time / noise levels using either Fourier random features (Vaswani et al., 2017) or positional encodings (Tancik et al., 2020). It is possible to train and infer the model using a second-order solver to solve Eq. (20) (step 2 of Algorithm 2). We experimented with this during the rebuttal period but did not observe a significant enough improvement in terms of NFE. Furthermore, we draw the reviewer's attention to our global response, where we explain that we must retract our claim that Discriminator Flows are more efficient than first-order EDM. A. Vaswani et al. Attention is all you need. NIPS 2017.\ M. Tancik et al. Fourier features let networks learn high frequency functions in low dimensional domains. NeurIPS 2020 ## Discretization Studying how discretizing the considered continuous phenomena could affect our formulations is an interesting perspective for future work. We initiate a discussion on this topic below. Choosing the best discretization method for diffusion models is challenging (Karras et al., 2023): it depends on the chosen $\sigma(t)$, its efficiency is assessed w.r.t. NFE instead of the number of discretization steps like in numerical methods, and the final purpose of discretization (generating realistic data) differs from its initial one (approximating a solution to an SDE). Hence, standard approaches like EDM rely on empirical discretization grids and custom solvers, tailored to the generation task. Our framework, by identifying the true probability flow PDE of Eq. (9), may help diffusion models cope with discretization errors through the score of the generated distribution (as explained in line 109). The previous discussion, however, only holds for score-based diffusion models, for which the probability path is known in advance. For other particle models, this is not possible, and studying the convergence properties of their discretizations, like Arbel et al. (2019) do for MMD, is non-trivial. Adding a generator, as suggested, is an alternative way to solve the underlying particle model PDE. By generalizing the parallel between Wasserstein and Stein gradient flows in Section 3.3, generators can be seen in our framework as a preconditioning over the particle model PDE via the linear operator $\mathcal{A}_{\theta_t}(z)$ in Eq. (13). A well-chosen architecture, adapted to the particle flow $\nabla h$, may speedup and simplify the dynamics towards the data distribution. ## Typos We will correct typos in the next version. The first part of Eq. (14) should indeed read as: $[\mathcal{A}_{\theta_t}(z)] (V) \triangleq \mathbb{E}_{z' \sim p_z} \left[k_{g_{\theta_t}}(z, z') V(g_{\theta_t}(z'))\right]$. # Response to Reviewer 6P1y (FULL VERSION) We would like to thank the reviewer for their detailed feedback. We answer their comments below. ## Clarifications and Insights on Training Algorithms / Questions We apologize for the confusion regarding the training algorithms. We answer your requests for clarification below. We will improve the presentation in the next version of the paper, and we will include the additional elements of discussion addressed in this response. We suggest that the reviewer refers to the complete paper PDF in the supplementary material, where some typos (pointed out by other reviewers as well) in Section 4 were corrected. ### Training of Score GANs > First, in score GANs, if I understand correctly, the score of the generator would be required at each training step $t$, to approximate the score of a sampling iteration index $t$, correct? The reviewer's interpretation is correct. This score network explicitly depends on the chosen noise level $\sigma$, like in EDM (Karras et al., 2022). We choose to sample $\sigma$s at each training step $t$ instead of choosing a schedule for $\sigma$ like in particle diffusion models, in line with the reviewer's comment. > From Algorithm 1, I understand that $s_{\phi}^{\rho}$ is trained for K steps on a DMS loss, when the iterations on $t$ are replaced with noise levels $\sigma$, starting with $x = g_{\theta}(z)$. Is this g_{\theta_t}(z)? Indeed, $g_{\theta}(z)$ in Algorithm 1 corresponds to $g_{\theta_t}(z)$. We omitted $t$ for better readability, but will include it again to improve clarity. The entirety of Algorithm 1 describes a single training iteration $t$. > Does this mean that there are K steps of update per generator update? Yes, it does, similarly to GANs having K steps of discriminator updates per generator update. > How large would K need to be to estimate the generator score accurately? Have the authors considered ablation experiments on this, at least say, on Gaussian data? I suspect the need for larger K as the data complexity increases. As the reviewer suggests, $K$ is an important parameter in Score GANs, like in GANs. First of all, we would like to point out that, like in GANs, the tuning of $K$ heavily depends on the ratio $\frac{\lambda}{\eta}$ (cf. Algorithm 1) between the learning rates of the score network and the generator (Jelassi et al., 2022). A higher ratio may allow us to decrease the necessary number of steps $K$. In our experiments on image data, we use $\frac{\lambda}{\eta} \geq 2$ (Appendix, Table 7). However, to gain more intuition empirically, we performed a set of experiments on MNIST with $\lambda$ s.t. $\frac{\lambda}{\eta} = 1$ by making the number of steps $K$ vary from $1$ to $10$. We obtained similar results across this range of values for $K$, close to the values reported in Table 3. This indicates that even low values of $K$ can provide a sufficient approximation of the score of the generated distribution. To observe this qualitatively, we provide in the one-page PDF attached to our global response, Figure 3, a plot representing the evolution of the score training loss in-between generator updates for $K=100$. At the beginning of training, $s_{\phi}^{\rho}$ needs around 20 updates to converge. However, after a low number of generator updates, $s_{\phi}^{\rho}$ is already close to the optimum before its first update. This confirms that the continuous update of the score of the generated distribution $s_{\phi}^{\rho}$, like a discriminator, coupled with a learning rate ratio $\frac{\lambda}{\eta} > 1$, makes a low number of score updates $K$, between generator updates, sufficient. S. Jelassi et al. Dissecting adaptive methods in GANs. arXiv, 2022. ### Training of Discriminator Flows > If my understand is correct, the explanations provided in the appendix indicate that the discriminator, provided with batches of data, considers random $t$-step Langevin sampling, and is trained such that the discriminator mimics this sampling behavior using a chosen loss? It appears there is a confusion in the design principles of Discriminator Flows, which we try to clarify in the following. Discriminator Flows do not attempt at mimicking Langevin sampling behavior. The principle of Discriminator Flows is to **make particles follow the same functional gradient $\nabla h_{\rho}$ as particles in GANs**, their corresponding interacting particle model, without the generator smoothing of Eq. (13). This corresponds to the gradient of the generator's loss that depends on the discriminator (cf. Eq. (20) which relates to Eq. (16) / Finding 5 in GANs). The most direct way to achieve this involves successively learning a discriminator per discretization step of Eq. (20), as follows. - Sample a batch of $(x_0^b)_b$ i.i.d. from the prior $\pi$. - For $i = 0$ to $N - 1$: - $t = \frac{i}{N}$. - Learn a new discriminator $f_{t}$ between the current generated distribution $(x_t^b)_b$ and $p_{\mathrm{data}}$ using the discriminator loss of Eq. (15). - Create the new batch of samples from the updated generated distribution following Eq. (20): $x_{t + \frac{1}{N}}^b = x_t^b - \frac{\eta}{N} \nabla_{x_t^b}(c \circ f_t)(x_t^b)$. Inference would then amount to discretizing Eq. (20) using all learned discriminators $f_0, f_{\frac{1}{N}}, \ldots, f_{\frac{N-1}{N}}$. However, this is impractical as the large number of independent networks to learn and the successive learning procedure make learning prohibitively slow. In our proposed version, we aim at simultaneously learning all discriminators to alleviate this issue. To this end, instead of learning multiple independent discriminators to cover all sampling times, we learn a single time-parameterized network $f_{\phi}(\cdot, t)$ on all time steps at once. This is how we obtain Algorithm 3. Overall, Discriminator Flows do not mimic a known sampling behavior a priori. The path that particles take depends on the chosen functions $a$, $b$, and $c$ parameterizing the losses of Eq. (15) and (16). Determining this path for general GAN losses is unfortunately an open problem. Nonetheless, it can be described under simplifying hypotheses. If $a$, $b$, and $c$ are chosen to implement an $f$-divergence GAN loss, then Discriminator Flows will implement a forward KL divergence gradient flow (Yi et al., 2023); if they correspond to an IPM GAN loss, then Discriminator Flows will implement a squared MMD gradient flow (Franceschi et al., 2022). We hope this clarifies the motivation and design of the Discriminator Flow learning algorithm. This allows us to answer the reviewer's follow-up questions. > What step 6 in Algorithm 3 actually simplifies to, is unclear. It might help to add some intuition on what this loss amounts to minimizing, in the GAN context, (assuming the WGAN-GP loss, as the authors do, as per Table 4). Step 6 in Algorithm 3 corresponds to training the discriminator for each sampling time $t = \frac{i_b}{N}$ to discriminate between the intermediate generated distribution $\rho_t$ (composed of the samples $x_t^b$) and $p_{\mathrm{data}}$ (composed of the $x_b$s), using a standard GAN discriminator loss from Eq. (15). Therefore, in the tested models using the WGAN-GP loss, step 6 amounts to training a WGAN-GP discriminator between the intermediate generated distribution $\rho_t$ and $p_{\mathrm{data}}$, like in a standard GAN. > Again, would this be an alternating scheme? Since is part of the sampler in lines 3-5, Algo. 3, would we need to reevaluate these samples after ever M steps of update on the discriminator? Is there an interplay between N and the number we would need for M? Indeed, we may say that Discriminator Flows rely on an alternating scheme, because updating of the discriminator naturally changes all intermediate generated distributions. Therefore, the reviewer is right when stating that we need to reevaluate samples. As specified in step 2 of Algorithm 2, steps 3-4 of Algorithm 3 and lines 263-264, we do so after each update of the discriminator, i.e. $M = 1$. We may consider increasing $M$ as suggested by the reviewer, but leave this for future work to decrease the computational cost of Discriminator Flows. ### Practical Implementation of the Discriminator in Discriminator Flows #### Time Embeddings For time embeddings, we followed the implementation of EDM (Karras et al., 2023), which embeds time / noise levels using either Fourier random features (Vaswani et al., 2017) or positional encodings (Tancik et al., 2020). A. Vaswani et al. Attention is all you need. NIPS 2017.\ M. Tancik et al. Fourier features let networks learn high frequency functions in low dimensional domains. NeurIPS 2020. #### Relationship to Denoising Diffusion GANs > how different would the proposed discriminator be, from the formulation of Denoising Diffusion GANs (Xiao et al., ICLR 2022)? According to our understanding, the discriminator in Denoising Diffusion GANs is fundamentally different from the one in Discriminator Flows. On the one hand, Denoising Diffusion GANs train time-parameterized GANs to successively denoise an image, following the inverse diffusion path. Hence, their discriminators are trained to discriminate between the intermediate generated distribution and the *noisy* data distribution. On the other hand, Discriminator Flows do not mimic the inverse diffusion path but rather learn the particle path via the discriminator. In Discriminator Flows, discriminators are trained to discriminate between the intermediate generated distribution and the *true*, non-noisy data distribution. #### Second-Order Methods > I get that the performance of the discriminator flow is one par with second order methods, but can we say more about this? Would it be feasible to explore training the discriminator with samples generated via second order discretization? It is possible to train and infer Discriminator Flows using a second-order solver, by changing the discretization method for Eq. (20) (step 2 of Algorithm 2). We experimented with this during the rebuttal period but unfortunately did not observe a significant enough improvement in terms of NFE. Furthermore, we draw the reviewer's attention to our global response, where we explain that we must retract our claim that Discriminator Flows are more efficient than first-order EDM. ## Discretization of Generative Differential Equations We agree that it would be interesting to study how discretizing the studied continuous phenomena could affect our formulations. We leave a complete study for future work. Nonetheless, we will add the following elements of discussion in the updated version of the manuscript. Choosing the best discretization method for diffusion models is a challenging task (Karras et al., 2023). First, it depends on the chosen noise schedule $\sigma(t)$. Second, unlike with standard numerical methods, the efficiency of the solver and discretization grid should be assessed w.r.t. NFE instead of the number of discretization steps. Third, the final purpose of discretization (generating realistic data) differs from its initial purpose (approximating a solution to a differential equation). For this reason, state-of-the-art approaches like EDM rely on empirical discretization grids and custom solvers, tailored to the generation task. In this regard, our framework, by identifying the true probability flow PDE of Eq. (9), may help diffusion models cope with discretization errors, which could be analyzed and handled through the score of the generated distribution (as explained in line 109). The previous discussion, however, is only valid for score-based diffusion models for which the probability path is known in advance. For other particle models, this is not possible, and studying convergence properties of discretization schemes of one such model, like Arbel et al. (2019) do for MMD, is non-trivial. The addition of a generator, as the reviewer suggests, is an alternative way to solve the underlying particle model PDE. By generalizing the parallel between Wasserstein and Stein gradient flows that we analyze in Section 3.3, generators can be seen in our framework as a preconditioning over the particle model PDE via the linear operator $\mathcal{A}_{\theta_t}(z)$ in Eq. (13). A well-chosen architecture, well adapted to the particle flow $\nabla h$, may considerably speedup and simplify the dynamics towards the data distribution. ## Integration of the Stochastic Component in our Framework We thank the reviewer for this question, which allows us to develop this aspect of our framework. As the reviewer pointed out, for diffusion models, we focused on the probability flow of Eq. (9) -- which is the true equivalent of Eq. (7), instead of Eq. (8) as stated in most works on diffusion. We put aside the probabilistic formulation of Eq. (7) for purposes of clarity of exposition. Fortunately, it is possible to generalize our framework by taking into account the stochastic component of Eq. (7), as explained in the following. We will add this discussion is the appendix of the updated manuscript. Generally, the following equation shares the same probability path as both Eq. (7) and (9), for any $\alpha \in [0, 1]$: $$ \mathrm{d}x_t = 2 \sigma'(t) \sigma(t) \nabla \log \left[p_{\mathrm{data}} \star k_{\mathrm{RBF}}^{\sigma(t)}\right](x_t) \mathrm{d}t - \alpha \sigma'(t) \sigma(t) \nabla \log \rho_t(x_t) \mathrm{d}t + \sqrt{2 \alpha \sigma'(t) \sigma(t)} \mathrm{d}W_t. $$ This corresponds to an interpolation between Eq. (7) and (9) that trades Brownian noise with its deterministic equivalent. This stochastic component can then be integrated in the formulation of interacting particle models, via Eq. (13). The latter equation takes the directions followed by the particles in the particle model, and transforms them via the operator $\mathcal{A}_{\theta_t}(z)$. Since this operator is linear, it is possible to integrate a stochastic component in the equation (Klebaner, 2012), allowing us to take into account stochastic particle models: $$ \mathrm{d}g_{\theta_t}(z) = [\mathcal{A}_{\theta_t}(z)]\left(\nabla h_{\rho_t} \mathrm{d}t + \gamma(t) \mathrm{d}W_t\right), $$ where $\gamma(t)$ is a scalar function of time. This makes it possible to integrate the stochastic component of diffusion models in Score GANs by interpolating between Gaussian noise and the score of the generated distribution in step 5 of Algorithm 1, similarly to the previous equation using $\alpha$. Nonetheless, this comes with no guarantee on the experimental performance, as we found in preliminary experiments that adding such a stochastic component is often detrimental to the resulting FID. Indeed, to succeed, the chosen gradient vector field to follow with the generator must be compatible with the generator architecture (i.e., compatible with the generator preconditioning $\mathcal{A}_{\theta_t}(z)$), which may not be the case with white noise of high variance. F. C. Klebaner. Introduction to stochastic calculus with applications. Imperical College Press. 2012. ## Minor Comments - We will mention what $p_{\sigma}$ is in the main text. - NFE indeed stands for number of function evaluations. - The first part of Eq. (14) should indeed read as: $[\mathcal{A}_{\theta_t}(z)](V) \triangleq \mathbb{E}_{z' \sim p_z} \left[k_{g_{\theta_t}}(z, z') V(g_{\theta_t}(z'))\right]$.

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.