Plug-and-play Safety Control Rebuttal

# Plug-and-play Safety Control Rebuttal ## 1.Reviewer cMPZ (Rating 6, Confidence 2) **Response to reviewer vi5j** We appreciate your constructive and detailed comments. Our point-wise responses are provided below: **Q1:"... Can the proposed method be applied to RL fields other than safety? ... Therefore, the proposed method can be applied to any RL fields that pursue auxiliary rewards."** > **A1:** Sincere thanks for your constructive suggestion. Our method can be applied to RL if we use a function of $p_{data}(x)$ as an intrinsic reward. In the future work we will try to do so. However, in our paper, the training process of our method is completely independent of the baseline algorithm. Modifying rewards to help the agent make safer actions will affect the training process of the baseline algorithm, which is not consistent with our idea of plug-and-play in this paper. **L1:"**.... only qualitatively guarantees that the state-action pairs are made in or closer to the safe distribution. But a quantitative analysis may further enhance the solidness of your method. That is, $p_{\text {data }}\left(s_{t}^{\prime}, a_{t}+K_{1} a_{1}^{t}+K_{2} a_{2}^{t}\right)-p_{\text {data }}\left(s_{t}, a_{t}\right) \geq \text { some function of } p_{\text {data }}, s_{t}, a_{t}, \ldots$ **," > **A:** Thank you very much for pointing out the limitations of our theory. The quantitative analysis can be done after adding new assumptions. If we suppose the first derivative is continuous and then assume $\frac{\partial p_{\text {data }}(s, a)}{\partial s} = \frac{\partial p_{\text {data }}(s+K_2\Delta s, a+K_1a_1)}{\partial s}\:\:$ and $\frac{\partial p_{\text {data }}(s, a)}{\partial a} = \frac{\partial p_{\text {data }}(s+K_2\Delta s, a+K_1a_1)}{\partial a}\:\:$ since $K_1$ and $K_2$ are very small number, then > \begin{aligned} & p_{\text {data }}(s_{t}+K_{2} \Delta s_{t}, a_{t}^{\prime}+K_{1} a_{1}^{t}+K_{2} a_{2}^{t}) - p_{\text {data }}(s_{t}, a_{t}) \\\\ = & p_{\text {data }}(s_{t}+K_{2} \Delta s_{t}, a_{t}+K_{1} a_{1}^{t}+o(a_{t})) -p_{\text {data }}(s_{t}, a_{t})\\\\ = & K_2\Delta s_{t}\frac{\partial p_{\text {data }}(s_{t}+h K_2\Delta s_{t}, a_{t}+gK_{1} a_{1}^{t})}{\partial s} + K_1a_1\frac{\partial p_{\text {data }}(s_{t}+h K_2\Delta s_{t}, a_{t}+gK_{1} a_{1}^{t})}{\partial a} \\\\ = & \frac{K_2}{p_{\text {data }}(s_t,a_t)}\frac{\partial p_{\text {data }}(s_{t}, a_{t})}{\partial s}\frac{\partial p_{\text {data }}(s_{t}+h K_2\Delta s_{t}, a_{t}+gK_{1} a_{1}^{t})}{\partial s} + \frac{K_1}{p_{\text {data }}(s_t,a_t)}\frac{\partial p_{\text {data }}(s_{t}, a_{t})}{\partial a}\frac{\partial p_{\text {data }}(s_{t}+h K_2\Delta s_{t}, a_{t}+gK_{1} a_{1}^{t})}{\partial a}\\\\ = & \frac{K_2}{p_{\text {data }}(s_t,a_t)}\parallel \frac{\partial p_{\text {data }}(s_{t}, a_{t})}{\partial s}\parallel^2+\frac{K_1}{p_{\text {data }}(s_t,a_t)}\parallel \frac{\partial p_{\text {data }}(s_{t}, a_{t})}{\partial a}\parallel^2\\\\ &(h,g\in[0,1]) \end{aligned} We will replace qualitative analysis with quantitative analysis and update our lemma, assumption and theorem in the revision. ## 2.Reviewer vi5j (Rating 3, Confidence 3) **Response to reviewer vi5j** We appreciate your constructive and detailed comments. Our point-wise responses are provided below: **Q1:"What is the connection between the gradient method and the potential function method?"** > **A1:** Our method uses the density of the dataset as the potential function, but only calculate its gradient rather than the potential function itself to achieve safe control. > > However, We argue that our method is different from the potential function method in key positions and has more advantages: > > Firstly, by using the score-based approach, we reconstruct the gradient field of the density function rather than reconstructing the density function itself. Naively reconstructing the density function itself fails for many reasons, one major reason is the sparseness of the trajectories collected. > > Secondly, we minimize the using of human knowledge by only permitting the access to the collected dataset, rather than using artificial designed attractive and replusive function like in the APF(Aritificial potential function) method. **Q2:"In the abstract, the authors mentioned that existing methods is inapplicable to ... while their method is generalizable. I do not see any proof of this claim ... Can the authors formally show this in the technical section?** > **A2:** Thanks for your question. By mentioning "... works focused on embedding risk-averse reasoning into the training ...", we are referring to the distributional and constraint RL methods in the related works. Whether is the distributed critic network or the discounted penalty function is depended on a training policy $\pi$, which differs from a behavior policy $\pi_b$ and an initialized policy $\pi_0$. Therefore, even if the safety knowledge regarding network structures can be independently extracted for another task need the same safety knowledge, These networks cannot be used because of the absence of $\pi$. To address any potential confusion caused by relevant texts, we will make corrections in the revision. **Q3:It is not clear what are the theoretical analysis ... Can the author provide some quantitative measurement of safety?** > **A3:** Our method does not focus on directly improving the measurement of safety. Instead, we leverage this measurement to demonstrate the safety of our algorithm. While a theoretical analysis of the measurement of safety is not feasible, our method can theoretically prove that it enhances the likelihood of the agent in the safe dataset distribution. / **Q4:"It is not clear throughout the paper what is the input of the model.** > **A4:** We apologize for the confusion. The input of the model is the pair of state and original action which is outputted from the baseline algorithm. The model then outputs two auxiliary actions, which helps to increase the probability of the agent in the distribution. We will add an illustrative graph in the revision. **Q5:"In experiments, the scores seem to be marginal improvement compare to POR. Why do the authors compare the scores instead of the safety metric?** > **A5:** We have also compared the safety metric, value at risk (VaR), in our paper. The results of IQL and SRAM (IQL) are shown in figure 2, and the results of POR and SRAM (POR) are shown in figure 4, Appendix B. We compare the scores to directly show the effectiveness of the algorithm from the results and then compare the VaR, the safety metric, to demonstrate the effectiveness that our algorithm can improve safety. **W1:"... This work does not have a good literature review regarding safety control ... "** >**A:** We apologize for that, and we will expand the literature review during the revision. **W2:"It looks like the gradient in this work only considers the locations, while ingoring other system states such as speed and acceleration ..."** >**A:** We apologize for any confusion that may have arisen due to the visualization of certain parts of our experiment and methodology. However, it's important to note that our method is able to consider factors other than location. In particular, our 6.2 section is based on the d4rl dataset, which already incorporates factors such as velocity in the offline mujoco dataset. We incorporate these factors in to an offline dataset, rather than relying on human analysis, allowing our method to work in dynamic intractable environments. > >To support our claims, we conducted additional experiments in the online section, utilizing safety-gym as the environment and PPO as the baseline algorithm. The results is presented in https://sites.google.com/view/additional-results-of-sram. And the following table concisely demonstrates the advantages of our algorithm: >| $VaR(0.1)$ | $Goal0$ | $Goal1$ | $Goal2$ | $Button0$ | $Button1$| $Button2$ | $Push0$| $Push1$ | $Push2$ | >| :-------------------------: |:-----:|:-----:|:------:|:------:|:------:|:------:|:------:|:------:|:------:| >| $PPO$ | $19.9±0.15$ | $18.7±0.23$ | $13.3±1.01$ | $7.09±1.56$ |$11.2±1.22$ | $8.89±0.95$ | $-0.16±0.02$ | $-2.07±0.24$ | $-2.66±0.29$ | >| $SRAM（PPO）$ | $20.3±0.10$ | $19.1±0.09$ | $14.9±0.67$ | $8.36±1.46$ |$12.4±0.35$ | $9.96±1.41$ | $-0.14±0.04$ | $-1.66±0.29$ | $-2.06±0.49$ | **Reference** -- [1] Feirstein D S, Koryakovskiy I, Kober J, et al. Reinforcement learning of potential fields to achieve limit-cycle walking[J]. IFAC-PapersOnLine, 2016, 49(14): 113-118. -- [2] Gu S, Yang L, Du Y, et al. A review of safe reinforcement learning: Methods, theory and applications[J]. arXiv preprint arXiv:2205.10330, 2022. -- [3] Bellemare M G, Dabney W, Munos R. A distributional perspective on reinforcement learning[C]//International conference on machine learning. PMLR, 2017: 449-458. ## Reviewer FZfq (Rating 3, Confidence 4) **Response to reviewer FZfq** We appreciate your constructive comments and questions pointed out. Our point-wise responses are provided below: **Q:There have been many works on this topic. What's the novelty of your work compared to previous works? Two surveys on this topic can be found at: https://arxiv.org/pdf/2108.06266.pdf https://openreview.net/pdf?id=UGp6FDaxB0fThe method is very similar to safety-shield-based methods (see above two surveys). What's the difference and advantage of your method compared with these methods?** >**A:** > >Thanks for your question. We summarize the differences and advantages below: > >1)The way of extracting the safe information is different. The safety-shield-based methods need to manually design an energy function to determine which states are safe and then generate a safe set. The target of these methods is to choose actions to keep states in the safe set. In our method, we do not pre-design which states are safe and do not have constraints that states must in a designated set. We use diffusion model to extract the gradient field of the dataset distribution and the safe information is all from the dataset. > >2)The idea of improving the safety is different. The safety-shield-based methods learn the safe control set in any state and the action chosen by these methods is within this set, which improves the safety of the action. The implicit safe set algorithm (ISSA) modifies the action by following the direction of gradients, which are sampled from a Gaussian distribution, until the action is in the safe control set. However, we do not require the action to belong to any set. In our approach, we use score-based diffusion model to generate auxiliary safe actions, which helps to improve the safety of the action. The auxiliary safe actions can improve the probability of state-action pairs in the distribution of the dataset, which improves the probability of the agent in safe and familiar scenarios. > >Advantage: > >1)Our approach does not need to design special functions for each environment. The energy function needs to be designed manually and properly to divide the safe and unsafe states. In our method, we don't need to design functions in advance to separate safe and unsafe states. > >2)Our approach requires no prior knowledge of the environment. The safety-shield-based methods requires information about the environment such as the position of the obstacle, the position, the velocity and accelerate the velocity of the agent. But in some environments, this information is unknown. However, the input of our method is the dataset, which already exists in the offline setting. In the online setting, we propose an algorithm to generate a safe dataset for learning without any additional information about the environment. > >3)Our approach can work in environments where there are no explicit risky regions. For instance, in Mujoco environment, there is no obstacle in the environment. It is hard for the safety-shield-based methods to separate the set of safe states. However, our method can deal with the problem. The average scores and Var are improved by using our approach which can be found in the experiments. > >4)Our approach takes less time. For every state, the safety-shield-based methods need to sample a lot of directions to find a safe action in the safe control set, which is time-consuming. In our method, the auxiliary actions of current state are directly known since the actions are the output of our models.