## Global response ##
We thank all reviewers for their time and valuable feedback, and appreciate that reviewers found our work novel and relevant.
We have addressed each reviewer's questions and concerns individually below.
Please let us know if there are any further questions.
Best wishes,
The Authors
## TODOs low prio ##
- make nice figure as requested by Reviewer CPFx
## Reviewer CPFx ##
**Notes / Weaknesses mentioned**
> How useful is “extracting relative reward functions from two diffusion models”? Applications here don’t seem quite compelling.
R1: Two potential use cases our work attempts to demonstrate are:
(1) Extracting reward functions from decision-making diffusion models: Obtaining a reward function allows for interpreting behavior differences, for composition and manipulation of reward functions, and allows to either train agents from scratch or fine-tune existing policies.
(2) Better understanding diffusion models by contrasting them: The biases of large models trained on different datasets are not always obvious, and our method aids interpretability and auditing of models by revealing the differences between the outputs they are producing.
> Diffuser already introduce the idea of extracting the cumulative reward function. When is relative reward superior?
R2: We believe there is a misunderstanding here. The Diffuser paper [A] does not extract reward functions. In fact, the MLP network used in Diffuser to predict the cumulative reward of a trajectory is trained using the ground truth reward. Meanwhile, our method extracts the relative reward function, not assuming access to the ground truth reward.
> Limited numbers of baselines – e.g. no comparison to other RRF techniques, or BC, or Janner et al numbers. Janner et al compare with many baselines [..]. No quantifiable comparison with Diffuser?
R3: We are not aware of any other RRF techniques -- we kindly invite the reviewer to point to any references, and we will consider them. Note that the baselines used in Diffuser [A] do not apply -- our problem setting is distinct from that in Diffuser (we extract reward functions, whereas they produce policies), so they are not comparable.
> No comparison to reward heatmaps learned in other ways, such as using method of Janner et al*
R4: We show learned reward maps in Figures 3 and 6. As mentioned above, Janner [A] does not learn reward heatmaps.
**Questions**
> Is there any concept of “relative reward function” in existing literature?
R5: Generally, learned reward functions are never absolute, as optimal policies are invariant to certain transformations of the reward function (see Definition 3.3 in [C]). Previous work [C] studies how to quantify differences of given reward functions, but does not extract reward functions from demonstrations.
> L264 why are 32 expert models required? L263 why are 8 expert models required?
R6: In the Maze2D experiments, we evaluate our method’s ability to model diverse sets of 8 goals for each maze configuration. As such, for each maze, 8 expert models are needed; one for each goal position. Since there are 4 mazes in total, this gives 32 expert models across all mazes. As we run this for 5 random seeds, this gives a total of 160 expert models.
> Why is access to the original dataset required for Algorithm 1?
R7: Generally, our problem setting assumes that two diffusion models are given. However, training the relative reward function requires input samples. In the Stable Diffusion experiments, we show how to obtain these samples from the pre-trained diffusion models themselves, while in Maze2D and Locomotion we simply use the given datasets.
> Is no steering occurring in Section 5.1, on Maze2D?
R8: In Maze2D's low-dimensional state space, we can effectively evaluate reward functions, unlike in high-dimensional Locomotion environments. Thus, we chose to directly assess these functions rather than use them to guide the base model.
We conducted additional experiments, which show that using our extracted reward function, agents can be trained from scratch in Maze2D. Agents trained this way with PPO [D] achieve 73.72% of the rewards compared to those trained with the ground-truth reward function, underscoring the robustness of our extracted function.
Average performance of agents trained with different reward functions (note that a random policy obtains near zero reward):
| Environment | Groundtruth Reward | Relative Reward (Ours) |
|-------------|--------------- |------------------------|
| OpenMaze | 92.89 ± 11.79 | 76.45 ± 19.10 |
| UMaze | 94.94 ± 8.89 | 74.52 ± 18.32 |
| MediumMaze | 423.21 ± 51.30 | 276.10 ± 65.21 |
| LargeMaze | 388.76 ± 121.39 | 267.56 ± 98.45 |
> L334 why use “the dataset to generate sets of image embeddings rather than actual images”?
R9: Our method for reward learning leverages the gradual sampling process of diffusion models. In the case of Stable Diffusion and Safe Stable Diffusion, this process happens in latent space, which is why we only need the image embeddings to train our model. This also simplifies the optimization problem, as the embeddings are lower dimensional than the decoded images.
> do not discuss any limitations of their work, and they do not discuss any potential negative societal impact.
R10: Please refer to the Conclusion section, which discusses the limitations. We now added an additional sentence stating that we expect the potential positive effects of enabling learning of relative reward functions of large pre-trained models, e.g., for understanding and alignment of generated outputs, to outweigh the potential negative effects.
> No checklist.
R11: This year, the checklist is on OpenReview, hence not attached to the paper. We refer the reviewer to the website to see it.
**Conclusion**
Given the clarifications around the distinction of our method from Janner's work (Diffuser) and the additional experiments that demonstrate that the learned reward function allows to train agents from scratch, we kindly ask the reviewer whether it is possible to revise the review score.
**Citations:**
[A] ”Planning with Diffusion for Flexible Behavior Synthesis”, Janner et al., ICML 2022
[B] "Safe latent diffusion: Mitigating inappropriate degeneration in diffusion models", Schramowski et al., CVPR 2023
[C] ”Quantifying differences in reward functions”, Gleave et al., ICLR 2021
[D] "Proximal Policy Optimization", Schulman et al. 2017
## Reviewer qcPp ##
**Notes/weaknesses**
> .. explanation or justification to a *relative* reward function ..
R1: The motivation behind the setting is that the *base* model represents general, unguided exploratory behavior (which could intuitively be thought of as a high-entropy exploration policy in the absence of a reward function). In that case, the learned relative reward function can be thought of as an absolute reward function (as shown in Maze2D).
Another justification is that learning absolute reward functions is generally ill-posed, as optimal policies are agnostic to certain transformations of the reward function (see Definition 3.3 in [C] for more details).
> .. experiments primarily assess directional effect of the reward guidance ..
R2: Utilizing stochastic gradient descent, we might not reach the global minimum in parameter space. However, we prove that the global minimum of the optimization objective in function space, stated in Equation 14, approximates the relative reward function that "most closely" matches the optimal relative reward gradient.
Further improving the proposed optimisation constitutes an interesting direction for future work. Taking into consideration the new experiments demonstrating that learned reward functions allow training agents from scratch (see below), we believe our method yields good results in practice even with the current procedure.
**Questions**
> It's not clear to me whether the learned per-step reward function can indeed be used as a reward function beyond guiding diffusion models[...].
R3: We have added an additional evaluation and show that the learned reward functions allow to successfully train agents in Maze2D from scratch, achieving 73.72% of the performance of agents trained with the ground truth reward (see table below).
> I would like to see results comparing the performance of expert model vs base guided with reward.
R4: We would like to point out that we also compare to a baseline where we use Discriminator guidance (see Table 1). We have now also added the performance of the expert model to the main paper (achieving > 95% on all three environments), but do not consider this a comparable baseline as it is not resulting from an extracted reward function.
> It would be super interesting to see a comparison of reward learning from this method versus IRL or other classical methods given a common dataset of expert demonstrations / examples. Do you have a hypothesis about the relative strengths of reward learning via diffusion vs RL?
R3: We've now benchmarked our method against AIRL [B] in Maze2D and also included results from training with PPO [C] using the ground truth reward. For a fair comparison, we ran a grid search on four values each for four different relevant hyperparameters. Our approach notably outperforms AIRL (refer to the table below). However, it's worth noting that IRL assumes both environment access and a dataset of expert demonstrations. In contrast, our method only assumes pre-trained diffusion models without needing environment access.
Average performance of agents trained with different reward functions (note that a random policy obtains near zero reward):
| Environment | Groundtruth Reward | Relative Reward (Ours) | AIRL |
|-------------|-- |--|--|
| OpenMaze | 92.89 ± 11.79 | 76.45 ± 19.10 | 53.42 ± 33.75 |
| UMaze | 94.94 ± 8.89 | 74.52 ± 18.32 | 69.62 ± 29.75 |
| MediumMaze | 423.21 ± 51.30 | 276.10 ± 65.21 | 175.49 ± 133.79 |
| LargeMaze | 388.76 ± 121.39 | 267.56 ± 98.45 | 139.59 ± 137.79 |
> Is there any way to implement your method using classifier-free guidance?
R4: Our method is agnostic to the architecture of the diffusion models, as pointed out in L47-L48. The application of our method to Stable Diffusion (see section 5.3.) in fact relies on classifier-free guidance, specifically the Safe Stable diffusion model uses classifier guidance to produce safe images.
> What would happen if you guided the expert model with the relative reward? Or an intermediate model that captures some knowledge from expert but not all? Would the reward guidance echo (overweight) the effect of the already-learned features or would it have no effect?
R5: We've added the experiments in the table below. Using the learned reward function, the medium-expert model improves by 7.34%. The expert model improves by 0.76% (details omitted due to space constraints). This shows that the reward function boosts the performance of varied base models, notably the medium-expert and the base model, without compromising the performance of near-optimal models like the expert.
Performance in unsteered and RRF steered scenario for medium-performance base diffusion model:
| Environment | Unsteered | Reward (Ours) |
|-------------|---------------|-------------------|
| Halfcheetah | 59.41 ± 0.87 | 69.32 ± 0.80 |
| Hopper | 58.80 ± 1.01 | 64.97 ± 1.15 |
| Walker2d | 96.12 ± 0.92 | 102.05 ± 1.15 |
| Mean | 71.44 | 78.78 |
> In the first sentence under section 4 (methods) [...] is y the optimality variable of the entire trajectory?
R6: Yes, $y$ corresponds to $\mathcal{O}_{1:T}$, i.e. the vector of optimality variables for all timesteps of the trajectory. This connection is explored in the "Planning with Diffusion section" (L146-L154).
**Conclusion:**
We would like to politely ask whether the reviewer would be willing to update the score given the added experimental results that show performance favorable to AIRL and the ability to utilize learned reward function to train agents from scratch. We will also be happy to address any remaining questions.
**Citations:**
[A] ”Quantifying differences in reward functions”, Gleave et al., ICLR 2021
[B] "Learning Robust Rewards with Adversarial Inverse Reinforcement Learning", Fu et al., ICLR 2018
[C] "Proximal Policy Optimization", Schulman et al. 2017
## Reviewer rnSj ##
**Notes/ Weaknesses mentioned**
> The paper assumes that the two diffusion models have the same architecture and are trained on the same data distribution. This may limit the applicability of the method to scenarios where the diffusion models have different architectures or are trained on different data distributions.
R1: This is not actually true -- our method does not assume that the models have the same architecture, as we point out in L45-L48. Our problem setting requires two models trained on different distributions, that only need to have the same output dimension.
Applying our method to scenarios where the base and expert model have different output dimensions, however, would constitute an interesting direction for future research.
> The paper does not provide any ablation studies on the hyperparameters of the proposed method.
R2: We already provide some ablations, with respect to the size of the dataset used to train the diffusion models in appendix F.0.1. Due to computational constraints, we did not run additional ablation studies with respect to all hyperparameters (note that for Maze2D we trained 180 Diffusion models and 30 for Locomotion environments). We now conducted additional ablations with respect to t_stopgrad and the guide scale in the Locomotion environments, for a decreased and an increased value each (hence four additional experiments per environment), compared to the optimal values reported in the paper. We have added these experiments to the appendix and report the lower bound, mean and upper bound in the table below, which indicate robustness of our results.
Performance bounds for ablation of t-stopgrad and guide scales:
| Environment | Lower bound | Mean | Upper bound |
|-------------|---------------|----------------|---------------|
| Halfcheetah | 30.24 ± 0.39 | 30.62 |31.5 ± 0.35 |
| Hopper | 19.65 ± 0.27 | 22.32 |25.03 ± 0.65 |
| Walker2d | 31.65 ± 0.92 | 34.35 |38.14 ± 1.08 |
We further conducted additional experiments in the Locomotion domains in which we steer a **new, unseen** (during training) medium-performance diffusion model, using the relative reward function learned from the base diffusion model and the expert diffusion model. We find that our learned reward function significantly improves performance, by 7.34% on average, which is even larger than the performance increase of 4.61% found in the experiments in the main paper, underlining the robustness of our method.
Performance in unsteered and RRF steered scenario for medium-performance base diffusion model:
| Environment | Unsteered | Relative Reward (Ours) |
|-------------|---------------|-------------------|
| Halfcheetah | 59.41 ± 0.87 | 69.32 ± 0.80 |
| Hopper | 58.80 ± 1.01 | 64.97 ± 1.15 |
| Walker2d | 96.12 ± 0.92 | 102.05 ± 1.15 |
| Mean | 71.44 | 78.78 |
> The paper does not provide any qualitative analysis or visualization of the learned reward functions.
R3: Figures 3 and 6 already show a qualitative analysis of the learned reward functions for Maze2D. The visualized points are colored by the learned reward value.
Note that such a visualization is not possible in the Locomotion environments due to their high dimensionality (in which case we instead measure performance improvements when steering the base models).
**Questions**
> Can this method be used to reduce images with bad quality in the diffusion process? How?
R4: We have not considered this application yet. We believe that a general “image quality improvement reward function” might in principle be learnable from two diffusion models that produce low and high-quality images respectively. This reward function could then be used to improve the image quality of other, potentially more domain-specific image-generation diffusion models. This constitutes an interesting direction for future work.
**Conclusion**
We would like to politely ask the reviewer to consider updating the score, given the additional experiments that demonstrate the robustness of our method and considering the qualitative analysis presented in the paper. We will be happy to answer any additional questions.
## Reviewer TdMi ##
**Weaknesses/ Questions**
>What's the benefit of using diffusion models rather than other probabilistic generative models, e.g. VAE, FLOW and GAN, for reward function extraction?
R1: Diffusion models produce samples by gradual denoising, as opposed to VAEs and GANs. Thus, they can be seen as a vector field continuously steering the denoised samples. This interpretation then allows us to extract a relative reward by considering the difference of these vector fields. Such an approach would not be feasible with VAEs or GANs.
Furthermore, diffusion models have recently been successfully applied to large-scale generative modeling, as well as decision making, allowing for broad and interesting applications of any methods focused on them.
>The drawback of using diffusion models is not mentioned. Will diffusion models slowdown the reward function calculation because iterative denoising is super time-consuming. Can the proposed method be applied to real-time applications?
R2: The learned reward function is a feed-forward network, and not a diffusion model, hence it has much faster inference and allows real-time applications. Furthermore, even though diffusion-model-based policies may be slower due to the denoising process, this does not slow down learning the relative reward function, as all denoising steps are utilized (and not only the final denoised sample).
**Conclusion**:
Given these clarifications and in light of the other reviews, we wanted to kindly ask the reviewer to consider updating the score. Please let us know of any further questions.
## Reviewer NVyM ##
**Weaknesses/ Notes**
> The experiments lack proper evaluation on classifier guidance except for the locomotion tasks. As stated in the problem setting, the objective is to extract the reward function that can steer the base model to the expert model through classifier guidance, so this evaluation should be included in more domains.
R1: We would like to point out that the ultimate goal is to learn the relative reward function, which we prove is equivalent to finding the reward function that would allow for "most closely" steering the base model into the expert model.
As the low-dimensional state of Maze2D allows a quantitative and qualitative evaluation of the learned reward functions (in contrast to the high-dimensional Locomotion environments), we directly evaluate the reward functions instead of using them to steer the base model.
We have now conducted additional experiments that demonstrate that the learned reward function in Maze2D can be used to train agents from scratch with PPO [A], achieving 73.72% of the performance of an agent trained with the ground truth reward function (see table below).
Average performance of agents trained with different reward functions (note that a random policy obtains near zero reward):
| Environment | Groundtruth Reward | Relative Reward (Ours) |
|-------------|--------------- |------------------------|
| OpenMaze | 92.89 ± 11.79 | 76.45 ± 19.10 |
| UMaze | 94.94 ± 8.89 | 74.52 ± 18.32 |
| MediumMaze | 423.21 ± 51.30 | 276.10 ± 65.21 |
| LargeMaze | 388.76 ± 121.39 | 267.56 ± 98.45 |
**Questions**
>For the locomotion experiments, what do the expert policy rewards look like? Does your method represent a significant improvement to the policy? It would also be helpful to have an upper bound figure using a ground truth classifier.
R2: The expert policy consistenty achieves > 95% across all three Locomotion environments.
We have now conducted additional experiments in which we steer a medium-performance diffusion model (trained on the medium-expert D4RL datasets) with the learned relative reward function; see table below. We find that our learned reward function significantly improves performance, by 7.34% on average, which is even larger than the performance increase of 4.61% found in the experiments in the main paper.
Performance in unsteered and RRF steered scenario for medium-performance base diffusion model:
| Environment | Unsteered | Relative Reward (Ours) |
|-------------|---------------|-------------------|
| Halfcheetah | 59.41 ± 0.87 | 69.32 ± 0.80 |
| Hopper | 58.80 ± 1.01 | 64.97 ± 1.15 |
| Walker2d | 96.12 ± 0.92 | 102.05 ± 1.15 |
| Mean | 71.44 | 78.78 |
>The problem setting, motivation, and evaluation criteria are all a bit unclear. Defining the problem explicitly (what makes a good reward function?), with convincing use cases, and bringing the experiments in line would make this a strong paper.
R3: The problem statement, broadly speaking, is to learn a reward function that quantifies the difference between two diffusion models. When one of these is taken to model a more "general" distribution and the other a more narrow distribution, we can interpret the latter as an expert model, and the relative reward then encodes the objective that the expert is trying to achieve.
In the Maze2D experiments, we qualitatively evaluate the learned reward functions in Figures 3 and 6. We now also added an experiment demonstrating that the learned reward function allows to retrain agents from scratch (achieving 73.72% of the ground truth performance, see table below). In the high-dimensional Locomotion experiments we evaluate the learned reward function by steering a lower-performanct base model, and now also have additional experiments in this case (as outlined above). In the large-scale Stable Diffusion experiment, the difference between the base and expert models is that the latter is tailored to not producing inappropriate images, which is what we observe when evaluating the learned reward function, which penalizes inappropriate images.
Two potential use cases our work attempts to demonstrate are:
(1) Extracting reward functions from decision-making diffusion models: Obtaining a reward function allows for interpreting behavior differences, for composition and manipulation of reward functions, and allows to either train agents from scratch or fine-tune existing policies.
(2) Better understanding diffusion models by contrasting them: The biases of large models trained on different datasets are not always obvious, and our method may aid interpretability and auditing of models by revealing the differences between the outputs they are producing.
Average performance of agents trained with different reward functions (note that a random policy obtains near zero reward):
| Environment | Groundtruth Reward | Relative Reward (Ours) |
|-------------|--------------- |------------------------|
| OpenMaze | 92.89 ± 11.79 | 76.45 ± 19.10 |
| UMaze | 94.94 ± 8.89 | 74.52 ± 18.32 |
| MediumMaze | 423.21 ± 51.30 | 276.10 ± 65.21 |
| LargeMaze | 388.76 ± 121.39 | 267.56 ± 98.45 |
**Citations**
[A] "Proximal Policy Optimization", Schulman et al. 2017
## Reviewer rtGf ##
**Weaknessess/ comments:**
>IRL algorithms typically formulate the problem as a two-player minimax game, assuming bounded rationality of expert data through a maximum entropy framework. In stochastic environments, these methods often analyze causal entropy. However, it appears that such methods have not been thoroughly compared or investigated in this work.
R1: As pointed out by the reviewer in the next question, IRL methods typically rely on environment access and access to a set of expert demonstrations. In contrast, our method does not require environment access, as it assumes that the expert behavior is modeled by a pre-trained diffusion model. We have now clarified this in the respective parts of the introduction and related work sections.
We have now also added experimental comparison with IRL, benchmarking against the classic AIRL [B] method in Maze2D. To ensure a fair comparison, we ran a grid search over 4 datapoints each for the parameters (n_disc_updates_per_round, gen_replay_buffer_capacity, demo_batch_size and gen_train_timesteps). We find that our method outperforms AIRL by a clear margin (see table at the bottom of this response). We would however like to point out that we do not see IRL as a comparable baseline, as it makes different assumptions.
> In IRL, researchers typically have access to an environment for interaction, which allows for exploration and the generation of nominal trajectories. In the context of diffusion models, there seems to be no such interaction, or at least none observed within the algorithm. Consequently, this paper may be more akin to inverse optimal control, where preference learning is commonly implemented offline without any interaction.
R2: We agree that there are some parallels to IOC. However we use comparisons to IRL mostly to motivate the problem of reward function extraction. We have now stated this more clearly in the introduction, also referencing IOC [C] as a related problem setting.
**Questions:**
>I am wondering whether the diffuision models can achieve comparable performance with popular RL methods like SAC, PPO or DDPG.
R3: In accordance with [A], we found that our trained expert diffusion models achieve similar performance to RL agents trained with PPO from scratch in both Maze2D and Locomotion environments.
We further conducted additional experiments in Maze2D, where we retrained RL agents from scratch with PPO [D], using the reward functions learned with our method. We found that these agents achieve 72.73% of the performance of PPO agents trained with the ground truth reward function.
Average performance of agents trained with different reward functions (note that a random policy obtains near zero reward):
| Environment | Groundtruth Reward | Relative Reward (Ours) | AIRL |
|-------------|--------------- |------------------------|----------|
| OpenMaze | 92.89 ± 11.79 | 76.45 ± 19.10 | 53.42 ± 33.75 |
| UMaze | 94.94 ± 8.89 | 74.52 ± 18.32 | 69.62 ± 29.75 |
| MediumMaze | 423.21 ± 51.30 | 276.10 ± 65.21 | 175.49 ± 133.79 |
| LargeMaze | 388.76 ± 121.39 | 267.56 ± 98.45 | 139.59 ± 137.79 |
**Citations:**
[A] ”Planning with Diffusion for Flexible Behavior Synthesis”, Janner et al., ICML 2022
[B] "Learning Robust Rewards with Adversarial Inverse Reinforcement Learning", Fu et al., ICLR 2018
[C] "From inverse optimal control to inverse reinforcement learning: A historical review", ab Azar et al., Annual Reviews in Control 2020
[D] "Proximal Policy Optimization", Schulman et al. 2017
## Reviewer CPFx (additional comments, will not go into Rebuttal) ##
**Additional clarifications and citations**
> Appendix is unclear about details given about which diffusion framework used, start/end of schedules, etc? Was Karras-style preconditioning used?
Following Janner et al. Section 3, we use the DDPM framework of Ho et al. 2022 with a cosine beta schedule. We do not use preconditioning in the sense of Section 5 of Karras et al. 2022. However, we do clip the the denoised latents $\mathbf{x}_t$ during sampling, and apply scaling to trajectories, as per the Appendix (L711). We will add these details, and release the source code, which should alleviate any reproducibility concerns.
> Requires access to the training dataset for marquee algorithm (Algorithm 1) Why is Algo 2 in the appendix, and not in main paper? Seems like the more useful version, than Algo 1?
We placed algorithm 1 in the main paper as it is more frequently used in the experiments. We appreciate your feedback and will place Algorithm 2 in the main paper for the camera-ready version (for which we will have an additional page).
> Hard to tell whether it is actually unsafe content since it’s blurred out, overblurred to be not so convincing
We applied a strong blur because even when blurred it can be easy to guess the disturbing content. The images contain violence and hateful symbolism, which we feel should not be imposed on the reader of a scientific paper. Please refer to the original paper [B] for more detailed descriptions of the unsafe (blurred) images.
> Quantitative results and eval are very sparse. Figure 2: Histogram caption should say on which dataset. One qualitative histogram is not so easy to tell about quantitative performance – why no distributional distances measured?
Thank you for this remark. We have now computed the Wasserstein 2 distance for the distributions displayed in Figure 2, which evaluates to 17.74. We added this information to the main paper and updated the caption.
> No comparison with other baselines for the safe image generation experiment.
We are not aware of relevant baselines in this scenario, but are happy to add such a comparison if pointed to specific baselines.
> Missing an architecture figure with the two models? Need some sort of graphical illustration / system figure showing training vs. inference behavior and which models are being learned (too much text)
We have added such a Figure to the paper, highlighting that both models are required for training the reward function and that only the base diffusion model and the learned reward function are used for inference.
> Bit too much background on diffusion overpowers contribution. I recommend that the proof on page 5 can be moved to appendix. Main algorithm is only shown on page 7, seems it should be on page 3.
We have made the overview of diffusion models more concise, however, we would prefer keeping the proof on page 5 in the main paper as we consider it central to our contribution, unless there is a strong opinion to change this.
> Content placement: Figure 2 is very far from where it is discussed (page 7 vs. page 9)
We have rearranged the Figures accordingly.
> Line 278 – “peaks occurring at true goal” -> no discussion of within what neighborhood / distance?
As discussed in Appendix E we used a tolerance of 1 grid cell, similar to the tolerance used in the original environment reward function. We have now stated this more clearly.
> Weird italicization and capitalization: “Physics informed neural networks
We have updated the capitalization in accordance with the rest of the paper; thank you for the remark.
> Keeps citing Sohl-Dickstein [59] for classifier guidance L24, L47, L143, why not cite Dhariwal et al “Diffusion Models Beat GANs on Image Synthesis”? Where does Sohl-Dickstein mention anything about classifier guidance?
Sohl-Dickstein et al. [59] introduce the classifier guidance equation in Appendix C, Eq. 61, which is why we cited them over Dhariwal et al. However, recognizing the significant contribution of Dhariwal et al. to the practical implementation, we have also included them in the citation.
> L321 incorrect statement “Safe Stable Diffusion [58], a modified version of Stable Diffusion designed to mitigate the generation of shocking images.” [...] Inappropriate images are the focus.
Thank you for this remark, we have updated our wording.
# Clarification Reviewer 2
We thank the reviewer for their response. We would like to make a few clarifications:
> … while the paper claims to be able to match the denoising process at every step …
We would like to clarify that the paper does not claim to exactly match the denoising process at every step. Instead, we train a network to do this approximately, and justify why doing this allows us to extract a canonical notion of relative reward. However, it is nowhere claimed in the paper that the learned reward must exactly match the denoising process.
> … I think it would help to mention the fact that the method works for CG and CFG.
We thank the reviewer for the remark about the applicability of our method to both classifier guidance and classifier-free guidance. We consider a strength of our method the fact that it is agnostic to the architecture and sampling method used by the diffusion models involved.
> … In order to obtain such a model, you would need to have the reward in hand already …
We would like to clarify that the work of Janner et al. presents a way of obtaining an expert diffusion model from purely observational data, without access to the reward function. Their method only uses the ground-truth reward to do classifier guidance on such a diffusion model, but does not use the reward to train the diffusion model itself.
As such, it is possible to obtain an expert diffusion model only from demonstration, without needing to have the reward at hand. The expert diffusion models used in our Maze2D and Locomotion are obtained in this way: we use the offline datasets from D4RL to, without accessing any ground truth reward, obtain diffusion models reproducing expert behavior.
> … but the unrealistic requirement of having an expert diffusion model has not been motivated satisfactorily …
Taking the above into consideration, we believe that the access to the expert diffusion model is not as unrealistic as posed, since such a model can be obtained from optimal demonstrations, as per Janner et al.
Hence, we would like to ask if the reviewer would consider revising their score, given our clarifications concerning the main points of improvement raised.