[ICML] DiffusionRL Rebuttal

# General Response We would like to thank all reviewers for their feedback and constructive suggestions on our manuscript. We are glad that the reviewers find our paper to be: (1) addressing an interesting and relevant problem (Reviewer 7ZgY, Reviewer XUtn) (2) novel and well-motivated (Reviewer 7ZgY, Reviewer kwpT, Reviewer ifEq, Reviewer XUtn) (3) having informative and good-quality experiments (Reviewer 7ZgY, Reviewer kwpT, Reviewer ifEq, Reviewer XUtn) As a recap, this work has made the following major contributions: (1) We present DDiffPG, the first actor-critic algorithm that learns multimodal policies, parameterized as diffusion models, from scratch. (2) We introduce diffusion policy gradient, a novel way to train diffusion models following the RL objective. (3) We propose to use hierarchical clustering with a Dynamic Time Warping (DTW) distance measure to discover different modes from trajectories, necessary for crafting multimodal training batches. (4) We utilize mode-specific Q-functions and multimodal training batches to guarantee learning across all modes. We additionally maintain a dedicated exploratory mode for exploration, effectively decoupling exploration-exploitation. (5) We allow the learned multimodal policy to operate autonomously based on the underlying multimodal action distribution or execute mode explicitly by policy conditioning on a mode-specific latent embedding, shown to be beneficial for online replanning. To better illustrate and highlight the methodological components of our proposed algorithm, we have updated Figure 1: https://imgur.com/a/89We5iQ. ### Additional Experiments (1) We extended the results for robotic manipulation tasks in multimodal domains and added a *Cabinet-open* task. In the visualization https://imgur.com/a/UyvuKBi, we observe the multimodal behaviors learned by our proposed method, while the baselines cannot capture the existent multimodality. DDiffPG has comparable performance to the baselines on all four tasks, while capturing all possible task-solutions. Given the varying complexities of these tasks and their hard-explore nature due to sparse rewards, we conclude that there is no significant performance gap between methods. (2) We addressed the success rate issue raised by reviewer 7ZgY and reviewer XUtn. As shown in Figure and Table in https://imgur.com/a/AMAkMm3, we observed that DDiffPG achieved a 100% success rate after convergence. The previous lower success rate was because the exploratory mode was mistakenly included during evaluation. (3) We provided an ablation study on the update rule by substituting the diffusion policy gradient with a standard policy gradient that maximizes the Q-value, as suggested by reviewer ifEq. This new baseline is referred as DiffQ. In our experiments https://imgur.com/a/sAIxIEI, we note that the actor gradient diminishes with an increase in diffusion iterations for DiffQ. Crucially, we observed that even with $N=5$, the actor gradient remains zero throughout the training in some seeds, resulting in high variance in performance. Overall, DDiffPG has proven to be both more sample efficient and stable. (4) We conducted an evaluation on computational time. We use NVIDIA GeForce RTX 4090 for all experiments. From the Figure https://imgur.com/a/Z9S6DRd, we observe that on AntMaze DDiffPG is approximately 5 times slower than TD3 and SAC. Due to the small diffusion iterations ($N=5$), the inference speed of diffusion policy has no significant impact on wall-clock time. However, DDiffPG requires more wall-clock time on computing the target action and trajectory processing, e.g., clustering. We also point out that DDiffPG's computational time will increase linearly as the number of discovered modes increases. We commit to add these additional experiments and their conclusions to the revised paper. Moreover, upon paper acceptance, we will open-source the codebases and environments to the community. # Response to Reviewer 7ZgY We sincerely thank reviewer 7ZgY for their feedback on our paper. We address the reviewer's concerns as follows. > *Paper is not polished. There is a big gap between Section 3 and Section 4, as the connection between the general diffusion probabilistic models’ formulation to how it is applied in RL is not clarified.* We thank the reviewer for the insightful suggestions, which we took into account for updating our paper in terms of clarity and overall presentation. Unfortunately, ICML's rebuttal phase does not allow us to upload a revised version of the paper. In the updated version of our paper, we added more explanations to the connection between the general difusion model and our formulation in the Method section and would like to verbally describe the changes in the revised version here: The conventional diffusion model is trained using a dataset that includes supervised labels. In offline decision-making, a diffusion policy predicts actions given states, with the dataset providing both the state and the corresponding ground-truth action as the target (label). This approach allows the diffusion policy to mimic behaviors within the dataset. In contrast, our online setting does not provide a predefined ground-truth action. Instead, we need to discover good actions in an online RL setting while ensuring the policy retains the multimodality of the action distribution (if any). To adapt without altering the supervised training framework of conventional diffusion models (Ho et al., 2020), we generate a target action, $a^{target}$, by computing the gradient of a Q-function w.r.t. the action $a$ and performing a gradient ascent on it. The action $a^{target}$, derived through gradient ascent, represents an improved choice based on the current Q-function. During training, we additionally store $a^{target}$ in the replay buffer and continuously update it based on its preceding values, ensuring a continuity of learning. Intuitively, we are chasing the target by replacing it with the newly computed $a^{target}$ in this online RL setting rather than learning a static target as in the offline setting. > *Moreover, several symbols in Algorithm 1 are not specified (what does G stand for?) and there is inconsistency between what symbols mean in different parts of the paper (t in Algorithm 1 and t in Equation 1).* We thank the reviewer for pointing out these inconsistencies. We are committed to polishing the writing and will make the following changes in the revised version of our paper: 1. In Algorithm 1: $G$ is the update per data-collection ratio, i.e., the number of updates we perform after every data collection. We have provided an ablation study on this hyper-parameter in Section 5.6. 2. In Algorithm 1: $S$ is the Q-function scheduler. Since each Q-function is linked to a specific mode, we initialize/remove a Q-function according to the outcomes of the clustering process. For example, if a new mode is discovered, we will initialize a corresponding Q-function; if two modes are merged, we will assign the Q-function from the mode with a larger data proportion within the merged mode as the new Q-function of the merged mode. 3. In Algorithm 1 and Equation 1, we will clarify the difference between the timestep $t$ in forward/reverse diffusion process and MDP, and use a different notation for the timestep in MDP. 4. In Algorithm 1 and Figure 1, we will use a consistent symbol $a^{target}$ to represent the target action (the updated Figure 1 is on https://imgur.com/a/89We5iQ). > *Experiment results are not well elaborated. Firstly, the method seems to be majorly evaluated on four AntMaze tasks (the experiment results for the robotic tasks are included as a single figure in Appendix and are never explained or referred to). It’s hard to tell if the model generalizes to other higher dimensional tasks such as Kitchen from D4RL.* We thank the reviewer for this comment and agree with them about the importance of experimenting with high-dimensional tasks. To this end: First, we introduce an additional robotic control task, *Cabinet-open* (similar to the Kitchen setup in D4RL), and extend the results for robotic manipulation tasks in multimodal domains. Visualizations of the obtained multimodal behaviors by DDiffPG, as well as performance plots w.r.t. baselines can be found at https://imgur.com/a/UyvuKBi. Note that all four manipulation tasks use **sparse rewards** and are **goal-agnostic** MDPs, making the learning process very challenging, especially when learning multimodal behaviors. Second, we will provide a more detailed discussion on the four manipulation tasks in the revised version as follows: 1. In Section 5.4: Beyond the AntMaze tasks, DDiffPG demonstrates consistent exploration and acquisition of multiple behaviors in four robotic manipulation tasks with sparse rewards. Specifically, in *Reach*, the agent navigates around a fixed cross-shaped obstacle from various directions; in *Peg-in-hole*, it learns to insert a peg into one of the two possible holes; in *Drawer-close*, it chooses to close one of the four drawers randomly; and, in *Cabinet-open*, the agent can move the arm to either layer and subsequently pull the door open. 2. We will include a comprehensive description of each task in the Appendix (see description on https://imgur.com/a/UyvuKBi). 3. We will add a table for manipulation tasks, similar to the table for AntMaze tasks, showing that DDiffPG can discover and master multiple modes while achieving a 100% success rate after convergence (see table on https://imgur.com/a/UyvuKBi). While the Kitchen task from D4RL is a suitable benchmark for offline RL, it represents a sequential long-horizon challenge that is difficult to tackle with sparse rewards for learning from scratch, and it also lacks clear multimodal opportunities. Therefore, we designed new yet challenging manipulation tasks, extending an existing codebase [1] to spare-reward MDPs with multimodal goals, to better highlight the challenges of learning and maintaining multimodal behaviors in high-dimensional robotic tasks. Upon paper acceptance, we will publicly release both our full codebase and environments, to benefit further research towards multimodal behavior learning. We would like to further emphasize that the AntMaze tasks are nontrivial, since the agent should solve both the Ant-locomotion task (learn to walk) through joint control (high-dimensional)--a notoriously hard continuous control problem--while also learning to steer the locomotion gaits to discover goals in complex mazes, all through sparse rewards. We chose to highlight those results in the main paper, due to the better visualization abilities through the mazes, providing better insights into the exploration and learning abilities of our method. [1] Gallouedec, Quentin et al. “panda-gym: Open-source goal-conditioned environments for robotic learning.” 4th Robot Learning Workshop: Self-Supervised and Lifelong Learning at NeurIPS (2021). > *Proposed method achieves a lower success rate and longer episode lengths compared to baselines in Figure 3.* Regarding the lower success rate, thanks to thorough investigation after the paper submission, we have found that the exploratory mode was mistakenly included during evaluation. In this mode, the agent's objective is not to complete the task but to explore uncharted areas. Consequently, executing the behavior imposed by the exploratory mode during evaluation does not lead to goal completion but decreases the success rate. After removing the exploratory mode from the evaluation phase, we observed that DDiffPG achieved a 100% success rate on both AntMaze (https://imgur.com/a/AMAkMm3) and Robotic (https://imgur.com/a/UyvuKBi) tasks after convergence. Regarding the longer episode lengths, our analysis indicated that this phenomenon was due to the presence of successful paths of different lengths. Taking AntMaze-v3 as an example, multiple paths exist within the maze, but not all of them are of the same length. DDiffPG is capable of learning and freely executing all these paths, which leads to an increase in episode length. Nonetheless, this issue can be mitigated given our ability to control the agent's behavior through the mode embeddings. > *I understand that the main contribution of the proposed method is generating multimodal solutions, and the maze figures are added to demonstrate this. However, I believe a more thorough evaluation on a multimodality metric is required for better understanding the significance of the model, not just on the AntMaze task but a set of different tasks.* We thank the reviewer for the constructive comment. We tested our method on four AntMaze tasks and four robot manipulation tasks. On these tasks, we demonstrated that our algorithm was the only method that successfully mastered multimodal behaviors, as shown in the provided visualizations at https://imgur.com/a/UyvuKBi. We agree with the reviewer on the necessity to showcase the significance of the model. To better highlight the need for multimodal behaviors, we provided in our original submission in Section 5.5. "Online Replanning with Multimodal Policy", an online application of the learned multimodal policy in nonstationary mazes, in which we randomly sample obstacles unseen during training. The ability of DDiffPG to solve those tasks are due to its multimodal behavior, which strengthens the importance of multimodal policies in real-world scenarios where certain solutions may be unavailabe and alternatives are required to solve the task (visualization https://imgur.com/a/ODo5D8D). > *The hierarchical clustering method requires some background knowledge about the environment, making this method harder to generalize to new tasks and environments.* We kindly remark that the hierarchical clustering approach requires an appropriate distance metric and a hyper-parameter on the distance threshold. This means that this unsupervised clustering approach requires the least amount of knowledge about the domains -- a key motivation for selecting this approach -- compared to methods like k-means that requires an a-priori knowledge about the expected number of modes. Although the threshold has to be set manually, we found that certain thresholds can generally work well across different tasks, as shown in the table below. Moreover, the threshold is easy-to-tune via the hierarchical clustering dendrogram. | Task | Clustering Threshold | | -------- | -------- | | *AntMaze-v1* | 60 | | *AntMaze-v2* | 60 | | *AntMaze-v3* | 60 | | *AntMaze-v4* | 50 | | *Reach* | default | | *Peg-in-hole* | default | | *Drawer-close* | default | | *Cabinet-open* | default | The default threshold is set to 0.7 max(Z[:, : 2]) corresponding with MATLAB(TM) behavior, where Z is the linkage matrix [2]. [2] The MathWorks Inc. Statistics and Machine Learning Toolbox Documentation, Natick, Massachusetts: The MathWorks Inc. https://www.mathworks.com/help/stats/linkage.html. (2024). > *Can the authors elaborate on how the results of Figure 14 compare to the results of Figure 3?* In Figure 3 (updated https://imgur.com/a/AMAkMm3) and Figure 14 (updated https://imgur.com/a/UyvuKBi), we observe that DDiffPG has a comparable performance to the baselines on all eight tasks, but it acquires multimodal behaviors. Specifically, in the AntMaze tasks, the sample efficiency of all algorithms is very similar, except for RPG. In *AntMaze-v1* and *AntMaze-v3*, TD3 and DDiffPG(v) surpass the performance of others; in *AntMaze-v2* and *AntMaze-v4*, TD3 and DDiffPG are the most sample-efficient. Overall, TD3 consistently delivers strong performance, whereas SAC and RPG tend to lag behind. In the robotic manipulation tasks, a similar pattern emerges: DDiffPG leads in *Peg-in-hole*, DDiffPG(v) excels in *Reach*, and TD3 stands out in both *Drawer-close* and *Cabinet-open*. In general, DDiffPG demonstrates lower sample efficiency compared to baselines in tasks that pose significant exploration challenges. For example, in *AntMaze-v2*, the route to the top-left goal is more extended, and in *Reach*, the robotic dynamics make it difficult for the agent to explore the bottom paths. For other tasks with a comparable exploration space, DDiffPG achieves similar performance to the baselines, despite learning multiple solutions. Given the varying complexities of these tasks and the hard-explore nature due to sparse rewards, we conclude that there is no significant performance gap between methods. For RPG, it employs VAE for policy parameterization, in which the latent variable facilitating action multimodality and exploration. However, it cannot consistently solve the tasks, indicating that policy learning under such conditioning remains challenging. > *Can you provide ablations on how the addition of hierarchical clustering and mode-specific Q-functions affect the results?* We agree with the reviewer on the importance of investigating the effect of the hierarchical clustering and mode-specific Q-functions. Given that the hiearchical clustering, mode-specific Q-functions, and multimodal training batch are tightly coupled and cannot be ablated individually, we provided DDiffPG(v) as a baseline. It follows the exact learning procedure, design choices, and hyper-parameters, e.g., diffusion policy gradient, double Q-learning as DDiffPG, but lacks the the hiearchical clustering, mode-specific Q-functions, and multimodal training batch. Our experiments revealed that although DDiffPG(v) also utilizes diffusion model as policy parameterization, it was unable to learn multimodal behaviors, verifying the effectiveness of our algorithmic choices in the full DDiffPG model. # Response to Reviewer kwpT We thank the reviewer for their insightful feedback. We would like to address the reviewer's concerns as follows. > *The method offers a relatively simple way to train a diffusion model as policies, and the use of multiple Q functions prevents the collapse to a single mode of behavior which would remove the advantage of using this more expressive policy class.* In this paper, we are interested in learning multimodal policy parameterized as a diffusion model while Q-functions are implemented using MLP networks. First, we learn mode-specific Q-functions to prevent the value function to collapse to a single mode and ensure the improvement across all discovered modes. We additionally maintain an exploratory mode, which allows the agent to keep exploring new modes while keeping optimizing the existing ones. Second, we construct a multimodal training batch for the policy learning, exposing the diffusion policy to data from all modes. The expressiveness of the diffusion model enables the policy to capture the multimodal action distribution in the training batch. However, the baseline DDiffPG(v), which is the vanilla version without the hierarchical clustering, mode-specific Q-functions, and multimodal training batch, is unable to learn multimodal behaviors. This contrast verifies the effectiveness of our method on taking advantage of the expressiveness of the diffusion model. > *As noted by the authors in their limitations, the use of a separate trajectory clustering as the means to separate modes for the different Q functions requires computing similarity between trajectories. There is no clear way to do this for higher dimensional state spaces, or if similarity cannot be captured through direct distance metrics on the states.* We acknowledge the reviewer's point that it is challenging to perform clustering on high-dimensional trajectories, however this has been an active research area for years, with different approaches spanning representations and metrics available for various applications. Therefore, the statement of the reviewer is not realistic. In particular, one research direction involves encoding high-dimensional states into embeddings, then perform clustering at the embedding level. For instance, [1] employed a sequence encoder (such as a decision transformer or trajectory transformer) to convert trajectories from the D4RL dataset into trajectory embeddings and use X-means for clustering. For higher-dimensional inputs, e.g., images, [2] showed that the representation space of a random encoder effectively captures information about the similarity between images and k-nearest neighbor provided meaningful clusterings. Another important research avenue focuses on the development of appropriate distance metrics for calculating similarities, such as [3]. However, through our empirical evaluation, we found that our simple hierarchical clustering with classic Dynamic Time Warping (DTW) effectively addresses the clustering challenge in both AntMaze locomotion (https://imgur.com/a/AMAkMm3) and Robotic manipulation (https://imgur.com/a/UyvuKBi) tasks, with robust hyper-parameters. We provide a table below to list the clustering threshold for each task. | Task | Clustering Threshold | | -------- | -------- | | *AntMaze-v1* | 60 | | *AntMaze-v2* | 60 | | *AntMaze-v3* | 60 | | *AntMaze-v4* | 50 | | *Reach* | default | | *Peg-in-hole* | default | | *Drawer-close* | default | | *Cabinet-open* | default | The default threshold is set to 0.7 max(Z[:, : 2]) corresponding with MATLAB(TM) behavior, where Z is the linkage matrix [4]. Finally, we wish to emphasize that the main contribution of our paper is to provide a framework for learning multimodal diffusion policies online, ensuring policy improvement across all discovered behavioral modes. Given the good performance of our hierarchical clustering approach in already challenging tasks like the robotic manipulation ones, and the fact that each component of our method, as detailed in Sec. 4, is decoupled from each other, we leave the exploration of alternative clustering methodologies for future work. [1] Deshmukh, Shripad et al. “Explaining RL Decisions with Trajectories.” ArXiv abs/2305.04073 (2023). [2] Seo, Younggyo et al. “State Entropy Maximization with Random Encoders for Efficient Exploration.” International Conference on Machine Learning (2021). [3] Zelch, Christoph et al. “Clustering of Motion Trajectories by a Distance Measure Based on Semantic Features.” 2023 IEEE-RAS 22nd International Conference on Humanoid Robots (Humanoids) (2023). [4] The MathWorks Inc. Statistics and Machine Learning Toolbox Documentation, Natick, Massachusetts: The MathWorks Inc. https://www.mathworks.com/help/stats/linkage.html. (2024). > *A clever way to leverage maximum entropy or diversity between the Q functions could have greatly benefited the algorithm. As it stands, only the clustering separates the Q functions. This also means that the multimodal policy learnt does not by default weigh the different modes in a way such that it is more likely to pick a mode with higher reward, which would be preferable.* We agree with the reviewer on exploring the interplay between Q-functions, e.g., ensemble methods, the disagreement among Q-functions, symmetry learning, etc. However, we would like to point out that each Q-function is trained on non-overlapping replay buffer, implicitly ensuring diversity. On the other hand, it is interesting to investigate if we can train a global Q-function that captures the shared information while overcoming the greediness of the RL objective, which could be a promising future direction. To allow the multimodal policy to be more likely to pick a mode with higher reward, one approach involves adjusting the proportion of each mode in the multimodal training batch. Essentially, the probability of a mode being selected correlates with its proportion in the multimodal training batch. In the original setup, data from each mode is equally represented, resulting in uniform probabilities of mode execution during inference. However, by adjusting the data proportion in the batch according to the returns of each mode — a metric we can approximate given our knowledge of each mode's trajectories — we can prioritize modes with higher returns. This approach increases the probability of these high-return modes being selected by the policy. > *The algorithm is presented as a way to leverage diffusion policies for RL, but the proposed multiple Q functions to prevent mode collapse can be applied for simpler policy classes too, which has not been investigated. For the relatively simple navigation tasks in the paper, it is highly likely that VAEs for example could work. In fact, you could just learn a separate unimodal actor for each Q function and just randomly pick between them during exploration. I note that this is not a major weakness.* We agree with the reviewer that there might be other candidate model parameterizations that can learn multimodal behaviors. While Variational Autoencoders (VAEs) conditioning the policy on a latent variable offer a pathway to multimodality, they can sometimes generate non-existent modes. This consideration led us to adopt a simpler yet effective clustering approach for explicit mode discovery. For instance, RPG, which employs VAE for policy parameterization, illustrates that despite the latent variable facilitating action multimodality and exploration, policy learning under such conditioning remains challenging. Moreover, we would like to emphasize that the navigation tasks are nontrivial -- these tasks correspond to the Ant locomotion task of Mujoco, which is a standard benchmark for continuous control widely known to be challenging, with the additional complexity of how to steer the locomotion gait for navigation only through sparse rewards. Although learning separate unimodal policies could yield similar behaviors, we advocate for the benefits of learning a unified model: 1. Learning a single model allows to share information (e.g., representations) across modes, which is particularly beneficial for future research, e.g., on image-based-observation tasks [1]. Our multimodal policy learning also draws an analogy to multitask RL, in which the objective is to solve multiple tasks with a single policy rather than separate policies per task, benefiting from knowledge sharing [2]. 2. Learning separate unimodal policies would significantly increase the computational time and agent interactions with the environment. With separate policies, we need to iteratively update them; however, only one backpropagation is needed for a single multimodal policy. 3. When a new mode is discovered, our diffusion model continues learning, adding this mode in its landscape, without forgetting previous knowledge. However, if we have separate policies, we have to initialize a new policy and learn from scrach, otherwise additional techniques have to be introduced to transfer knowledge from previous training. [1] Kalashnikov, Dmitry et al. “QT-Opt: Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation.” Conference on Robot Learning (2018). [2] Hendawy, Ahmed et al. “Multi-Task Reinforcement Learning with Mixture of Orthogonal Experts.” International Conference on Learning Representations (2024). > *It is interesting that the ablations shows generally less diffusion steps are actually better, with 5 steps outperforming the rest. Do the authors have an explanation for this? This further indicates to me that a less expressive class of algorithms could have worked here.* We thank the reviewer for this comment and have thoroughly considered this outcome. We believe that the sample-efficiency achieved by the shorter noising/denoising chains are due to the complexity of the state-space in our control tasks--much higher than the 2-dimensional manifolds of image generation tasks. In essence, the high-dimensional manifolds of the state space indicates that every noising step moves the samples out of this complex manifold, and during the reverse process, the model needs to proceduraly backmap this process to essentially pull-back the samples to the original manifold. There is substantial work currently in the manifold representation and consequent application of diffusion models on complex manifolds [Add citations](/Fb9Lh_ZXQ2WDjHWQsEbL0Q), and we find this an exciting future research direction. The longer the chain the more complex the pullback actions, and as we can see by the result of Fig. 6b, eventually longer chains also learn the task but in the cost of more computational time and samples. We empirically demonstrated that 5 steps are enough to learn and maintain multimodal behaviors, which is consistent with the findings of other diffusion policy learning papers [1, 2], while also showed that simpler policy parameterizations like the RPG model [3] do not work out of the box in our challenging control tasks. [1] Ding, Zihan, and Chi Jin. "Consistency models as a rich and efficient policy class for reinforcement learning." International Conference on Learning Representations (2024). [2] Wang, Zhendong et al. “Diffusion Policies as an Expressive Policy Class for Offline Reinforcement Learning.” International Conference on Learning Representations (2022). [3] Huang, Zhiao et al. “Reparameterized Policy Learning for Multimodal Trajectory Optimization.” International Conference on Machine Learning (2023). > *Were the antmaze tasks trained with sparse rewards?* Yes, all tasks, both AntMaze and the Robotic manipulation ones, use a sparse 0-1 reward, with 1 given to the agent when finding a goal. > *It is mentioned that your approach will exclusively train the Q function with successful trajectories and those close to it while other approaches will not. This seems like an unfair advantage that is more of a data buffer prioritization scheme, and can be used even if there is only 1 Q function like in other methods. Is this understanding correct?* Yes, each Q-function is trained on its own mode data, which consists of the goal-reached trajectories alongside unsuccessful ones that are closely related. However, we would like to emphasize that we keep the total replay buffer size equal to 1 million for all algorithms for fair comparison. Since each mode's data does not overlap, one can assume each mode-specific Q-function in DDiffPG has its own replay buffer, where the buffer size is $\frac{\text{total replay buffer size}}{\text{# of modes}}$ and is smaller than the size of baselines. The impact of the replay buffer size in the off-policy training has been actively investigated [1]. The trade-off is as follows: while a smaller buffer size may speedup learning due to the higher relevance of data, it could also increase the risk of forgetting previous samples and lead to overfitting. Conversely, a larger buffer size can enhance generalization and stability by providing a more diverse dataset but might reduce the data's relevance. Therefore, it is hard to say it is an advantage of DDiffPG. We agree with the reviewer that a similar sampling strategy can be used for baselines. However, it is practically challenging to set the prioritized buffer size for fair comparison, as it depends on the varying number of modes DDiffPG discovers throughout the training. For example, at the early stage, no mode is discovered, and all of the data is used to train the exploratory mode. As the training proceeds, more modes are discovered and the size of prioritized data decreases. The inherent randomness in exploration, compounded by sparse rewards, further complicates setting the prioritized data buffer size for fair comparisons. Given these considerations, we decided to keep the total buffer size the same rather than restricting the baselines to a specific subset of data as was the case with DDiffPG. [1] Li, Zechu et al. “Parallel Q-Learning: Scaling Off-policy Reinforcement Learning under Massively Parallel Simulation.” International Conference on Machine Learning (2023). # Response to Reviewer ifEq We sincerely thank reviewer ifEq for the helpful feedback and suggestions on our paper. We address the reviewer's concerns below. > *Experimental results are not strong. First, the performance of DDiffPG does not seem to yield a strong improvement on sample efficiency over previous algorithms like TD3 in Fig. 3. Second, no results are presented for robotic environments in the main paper.* While we acknowledge that the performance of DDiffPG is only comparable with baselines, we would like to emphasize our main contribution, which is to learn multimodal behaviors from scratch using a diffusion model. In experiments, we demonstrate that DDiffPG can successfully show multimodal behaviors and solve a given task in multiple ways, while, baselines converged to a unimodal behavior. Furthermore, we showcase a proof-of-concept online replanning application of the multimodal policy in non-stationary environments, in which we randomly sample obstacles unseen during training. The ability of DDiffPG to solve those tasks are due to its multimodal behavior, which strengthens the importance of multimodal policies in real-world scenarios where certain solutions may be unavailabe and alternatives are required to solve the task (visualization https://imgur.com/a/ODo5D8D). We appreciate reviewer's careful reading and will plan to provide a more detailed discussion on the four manipulation tasks in the revised version as follows: 1. In Section 5.4: Beyond the AntMaze tasks, DDiffPG demonstrates consistent exploration and acquisition of multiple behaviors in four robotic manipulation tasks with sparse rewards. Specifically, in *Reach*, the agent navigates around a fixed cross-shaped obstacle from various directions; in *Peg-in-hole*, it learns to insert a peg into one of two possible holes; in *Drawer-close*, it chooses to close one of four drawers randomly; and, in *Cabinet-open*, the agent can move the arm to either layer and subsequently pull the door open. 2. We will include a comprehensive description of each task in the Appendix (see description of tasks on https://imgur.com/a/UyvuKBi). 3. We would add a table for manipulation tasks, similar to the table for AntMaze tasks. It shows that DDiffPG can discover and master multiple modes while achieving a 100% success rate after convergence (see table on https://imgur.com/a/UyvuKBi). > *Lack of details for gradient vanishing with standard policy gradient via maximizing Q. I think this is an essential claim for supporting the key choice of using action gradient ascent without gradient backpropogated to the policy network. There exists a missing reference [1] that actually did this for online RL with diffusion policy. Is this gradient vanishing problem mainly caused by the large number of diffusion iterations (e.g. 100) for policy learning? If so, please show some experimental evidence for this, like the gradient norm with different numbers of iterations. Also, given the hyperparameter sweep results in Fig. 6, it seems only using iteration=5 will lead to the best sample efficiency. Does the gradient vanishing still happen in this case? Please prove the necessity of taking action ascent when a large number of diffusion iterations is not helpful.* We thank for the reviewer's insightful suggestions. In response, we have included an ablation study on the update rule by substituting the diffusion policy gradient with a standard policy gradient that maximizes the Q-value. This new baseline, referred to as DiffQ, only differs from DDiffPG in the update rule, while retaining the hierarchical clustering, mode-specific Q-functions, and multimodal training batches. To ensure a fair comparison, we directly utilized the code from [1] specifically for the update rule. In our experiments conducted on *AntMaze-v3* and *Cabinet-open*, as depicted in the figure https://imgur.com/a/sAIxIEI, we note that the actor gradient diminishes with an increase in diffusion iterations for DiffQ. Crucially, we observed that even with $N=5$, the actor gradient remains zero throughout the training in some seeds, resulting in high variance in performance. In contrast, our diffusion policy gradient approach, which turns the training objective to minimizing the MSE loss, demonstrates significantly greater stability. Regarding performance, DDiffPG showcases superior sample efficiency. One hypothesis is that training a single policy with different Q-functions using standard policy gradient could lead to conflicting gradient signals, which hurts the performance. Overall, DDiffPG has proven to be both more sample efficient and stable. > *Missing baselines. Given that Yang et al. 2023 paper has a very similar algorithm, also diffusion policy for online RL in [1], I think it is important to directly compare the proposed DDiffPG with these two baselines. I’m not saying the proposed method is weaker but if there are some advantages over the previous two algorithms, it can be good for convincing me the effectiveness of the proposed techniques.* We agree with the relevance of the two papers highlighted by the reviewer. Regarding DIPO as proposed by Yang et al. in 2023, our decision to exclude it as a baseline is due to two main considerations: 1. **Inconsistency in MDP Dynamics and Reward Function**: According to Algorithm 2 in Yang et al. 2023, DIPO stores transition $(s_t, a_t, s_{t+1}, r(s_{t+1}|s_t, a_t))$ into the replay buffer, performs gradient ascent on $a_t$, and replaces $a_t$ in the original buffer. Therefore the transition $(s_t, a_t, s_{t+1}, r(s_{t+1}|s_t, a_t))$ no longer aligns with the current MDP dynamics and reward function, due to the replacement of $a_t$. Given that DIPO is an off-policy algorithm, the reuse of these replaced transitions for training the Q-function could be problematic, as the agent is training values of actions that have not been actually played out in the environment, and their true outcome (reward and next state) are unknown. 2. **Lack of Coherent Updates Between Q-function and Policy**: DIPO performs updates on the Q-function and the policy using different batches, as detailed in Algorithm 2, which does not guarantee coherent updates. To this end, we chose DDiffPG(v) as our baseline instead of DIPO. DDiffPG(v) follows similar learning procedures as DIPO but addresses the aforementioned problems. For (1), DDiffPG(v) stores transition $(s_t, a_t, a^{target}_t, s_{t+1}, r(s_{t+1}|s_t, a_t))$, replaces $a^{target}_t$, and uses the original transitions to train the Q-function. For (2), DDiffPG(v) uses a consistent batch to train both Q-function and policy, aligning with traditional actor-critic algorithms. On the other hand, DDiffPG(v) differs from DDiffPG by lacking the hierarchical clustering, mode-specific Q-functions, and multimodal training batch, which also serves as an ablation for our framework. We thank the reviewer's suggestion on [1] and we are working on getting results from this baseline and include it in the revised version. > *I’m also unclear if the desired multimodal behavior can be achieved with other multimodal representation of policy, like using variational autoencoder, or Gaussian mixture. Experiments showing comparison with those in Fig. 4 and 5, as well as mode statistics in Tab. 2 can be useful. I think Gaussian mixture policy can easily handle the multi-goal setting in Antmaze environments.* We agree with the reviewer that other multimodal representations may show similar multimodal behaviors, however, additional techniques and knowledge are required and need to be invesitigated. In experiments, we include RPG as a baseline, which employs VAE for policy parameterization. In Figure 3, its performance in AntMaze tasks illustrates that the policy learning remains challenging even though it demonstrates a good exploration. One potential reason is that VAEs condition the policy on a latent variable, which offers a pathway to multimodality but sometimes leads to non-existent modes. This consideration led us to adopt a simpler yet effective clustering approach for explicit mode discovery. Gaussian mixture models have the potential to address the multi-goal scenario in AntMaze tasks; however, they often necessitate substantial prior knowledge, such as the number of modes. In this paper, we are particularly interested in self-emerging behaviors and need prior knowledge as little as possible. This objective motivates us to employ the unsupervised clustering approach. Furthermore, specifiying the number of modes could be problematic in the early stage of the training, particularly when certain modes remain unexplored. Therefore, additional techniques need to be introduced to handle the case when only one or two modes have been discovered. > *The writing can be improved. Some typos need to be fixed, like line 147 in Sec.3. > Fig. 6 is not referred to in Sec. 5.6.* We appreciate your feedback very much and will polish the writing in the revised version. > *In Yang et al. 2023, the algorithm pseudocode states that the Bellman Q-learning is before the action gradient ascent. Why does this paper claim that the previous paper is using target actions for Bellman update, instead of using original actions?* We believe that DIPO in Yang et al. 2023 is an off-policy algorithm, in which a replay buffer is introduced and the transitions inside can be reused for training. When a transition is sampled a second time, the Bellman update will then use the inconsistent transition with replaced target action rather than the original one. While the pseudocode states that the Bellman update is before the action gradient, this sequencing only guarantees that the first sampling of a transition utilizes the original action. To address this issue, we propose to store the transition as $(s_t, a_t, a^{target}_t, s_{t+1}, r(s_{t+1}|s_t, a_t))$, which additionally includes $a^{target}_t$. When the action is replaced, we will replace $a^{target}_t$ rather than $a_t$; when Q-function is updated, we will use the original transition to ensure the consistency. > *How are the design choices different from Yang et al. 2023 verified to be better?* We state two issues of the DIPO algorithm proposed by Yang et al. 2023: **inconsistency in MDP dynamics and reward function** and **lack of coherent updates between Q-function and policy**. Therefore, our method is more principled. > *Limitations of the current method are discussed in Sec. 6. For the limitation of computational time, providing the evaluation of computational time for the proposed method can be good.* We thank the reviwer for their insightful suggestions. In this manuscript, we have developed a novel actor-critic algorithm, DDiffPG, that parameterizes the policy as diffusion model and can learn multimodal behaviors from scratch. One well-known limitation of diffusion models is that it is computationally inefficient to forward and backward through the whole Markov chain during training [2], therefore we provide an evaluation of computational time compared with baselines https://imgur.com/a/Z9S6DRd. We use NVIDIA GeForce RTX 4090 for all experiments. From the figure, we observe that on AntMaze DDiffPG is approximately 5 times slower than TD3 and SAC. * For data collection, we observe that DDiffPG needs more wall-clock time than others, which is due to the trajectory processing, clustering, etc. We note that DDiffPG(v) has a similar wall-clock time compared to the time of TD3 and SAC, implying that the impact of inference speed of diffusion model is not significant. * For policy update, DDiffPG and DDiffPG(v) require less computational time. This is because TD3 and SAC need to estimate the Q-value during policy update, while DDiffPG and DDiffPG(v) only needs to minimze the MSE loss. * For critic update, DDiffPG requires more wall-clock time due to multiple mode-specific Q-functions. * For target action update, only DDiffPG and DDiffPG(v) needs to compute the target action. DDiffPG requires more wall-clock time because it has the multimodal batch and needs to compute the target action for each sub-batch. Note that DDiffPG's computational time will increase linearly as the number of discovered modes increases. [2] Kang, Bingyi et al. “Efficient Diffusion Policies for Offline Reinforcement Learning.” Neural Information Processing Systems (2023). # Response to Reviewer XUtn Thank you for your insightful feedback. We would like to address your concerns as follows. > *It seems like the proposed method cannot achieve a very high success rate compared to baselines. Why there is a tradeoff between multimodality and success rate if the policy converges? Why multimodality hurts performance?* We throughly investigated this phenomenon after the paper submission, and we have found that the exploratory mode was mistakenly included during evaluation. In this exploratory mode, the agent's objective is not to complete the task but to explore uncharted areas. Consequently, executing the behavior imposed by the exploratory mode during evaluation does not lead to goal completion but decreases the success rate. After removing the exploratory mode from the evaluation phase, we observed that DDiffPG achieved a 100% success rate on both AntMaze (https://imgur.com/a/AMAkMm3) and Robotic (https://imgur.com/a/UyvuKBi) tasks after convergence. > *What is Q-function scheduler S in Algo.1?* We appreciate the careful reading from the reviewer. To achieve mode-specific Q-functions, we use a scheduler $S$ to allocate Q-functions for each mode. The scheduler $S$ takes the previous and current clusters and Q-functions as inputs and returns the new Q-functions. For example, if a new mode is discovered, the scheduler $S$ will initialize a corresponding Q-function; if two modes are merged, the scheduler $S$ will assign the Q-function from the mode with a larger data proportion within the merged mode as the new Q-function. We will add the description in the revised version. > *In Algo. 1 and Fig. 1, authors use both $a^{target}$ and $a_{target}$.* We thank reviewer for pointing out these inconsistencies. In Algorithm 1 and Figure 1, we will use a consistent symbol $a^{target}$ to represent the target action (the updated Figure 1 is https://imgur.com/a/89We5iQ). > *If I understand correctly, DDiffPG(v) will quickly converge to one mode it first explored. Then why DDIffPG is more sample efficient than DDiffPG(v)?* In Figure 3 (updated https://imgur.com/a/AMAkMm3) and Figure 14 (updated https://imgur.com/a/UyvuKBi), we observe that DDiffPG has a comparable performance with DDiffPG(v) on most tasks. This could be counter-intuitive as DDiffPG learns multiple solutions. The reason is as following: considering *AntMaze-v1* as an example, all methods initially explore both paths, indicating a similar exploration space. Once the goal is reached, our method focuses on training a Q-function using this goal-reached trajectory alongside unsuccessful ones that are closely related, effectively narrowing the scope of the replay buffer and facilitating faster convergence. In other methods, the success signal would be propagated to the Q-function delayed due to the uniform sampling from the replay buffer. However, in tasks that present significant exploration challenges, DDiffPG demonstrates lower sample efficiency compared to DDiffPG(v). Specifically, in *AntMaze-v2*, the route to the top-left goal is more extended, and in *Reach*, the robotic dynamics make it difficult for the agent to explore the bottom paths. Consequently, DDiffPG requires more samples for effective exploration in these tasks, becoming less sample efficienct.

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.