# Plug-and-play Safety Control Rebuttal
## 1.Reviewer cMPZ (Rating 6, Confidence 2)
**Response to reviewer vi5j**
We appreciate your constructive and detailed comments. Our point-wise responses are provided below:
**Q1:"... Can the proposed method be applied to RL fields other than safety? ... Therefore, the proposed method can be applied to any RL fields that pursue auxiliary rewards."**
> **A1:** Sincere thanks for your constructive suggestion. Our method can be applied to RL if we use a function of $p_{data}(x)$ as an intrinsic reward. In the future work we will try to do so. However, in our paper, the training process of our method is completely independent of the baseline algorithm. Modifying rewards to help the agent make safer actions will affect the training process of the baseline algorithm, which is not consistent with our idea of plug-and-play in this paper.
**L1:"**.... only qualitatively guarantees that the state-action pairs are made in or closer to the safe distribution. But a quantitative analysis may further enhance the solidness of your method. That is, $p_{\text {data }}\left(s_{t}^{\prime}, a_{t}+K_{1} a_{1}^{t}+K_{2} a_{2}^{t}\right)-p_{\text {data }}\left(s_{t}, a_{t}\right) \geq \text { some function of } p_{\text {data }}, s_{t}, a_{t}, \ldots$ **,"
> **A:** Thank you very much for pointing out the limitations of our theory. The quantitative analysis can be done after adding new assumptions. If we suppose the first derivative is continuous and then assume $\frac{\partial p_{\text {data }}(s, a)}{\partial s} = \frac{\partial p_{\text {data }}(s+K_2\Delta s, a+K_1a_1)}{\partial s}\:\:$ and $\frac{\partial p_{\text {data }}(s, a)}{\partial a} = \frac{\partial p_{\text {data }}(s+K_2\Delta s, a+K_1a_1)}{\partial a}\:\:$ since $K_1$ and $K_2$ are very small number, then
> \begin{aligned}
& p_{\text {data }}(s_{t}+K_{2} \Delta s_{t}, a_{t}^{\prime}+K_{1} a_{1}^{t}+K_{2} a_{2}^{t}) - p_{\text {data }}(s_{t}, a_{t}) \\\\
= & p_{\text {data }}(s_{t}+K_{2} \Delta s_{t}, a_{t}+K_{1} a_{1}^{t}+o(a_{t})) -p_{\text {data }}(s_{t}, a_{t})\\\\
= & K_2\Delta s_{t}\frac{\partial p_{\text {data }}(s_{t}+h K_2\Delta s_{t}, a_{t}+gK_{1} a_{1}^{t})}{\partial s} + K_1a_1\frac{\partial p_{\text {data }}(s_{t}+h K_2\Delta s_{t}, a_{t}+gK_{1} a_{1}^{t})}{\partial a} \\\\
= & \frac{K_2}{p_{\text {data }}(s_t,a_t)}\frac{\partial p_{\text {data }}(s_{t}, a_{t})}{\partial s}\frac{\partial p_{\text {data }}(s_{t}+h K_2\Delta s_{t}, a_{t}+gK_{1} a_{1}^{t})}{\partial s} + \frac{K_1}{p_{\text {data }}(s_t,a_t)}\frac{\partial p_{\text {data }}(s_{t}, a_{t})}{\partial a}\frac{\partial p_{\text {data }}(s_{t}+h K_2\Delta s_{t}, a_{t}+gK_{1} a_{1}^{t})}{\partial a}\\\\
= & \frac{K_2}{p_{\text {data }}(s_t,a_t)}\parallel \frac{\partial p_{\text {data }}(s_{t}, a_{t})}{\partial s}\parallel^2+\frac{K_1}{p_{\text {data }}(s_t,a_t)}\parallel \frac{\partial p_{\text {data }}(s_{t}, a_{t})}{\partial a}\parallel^2\\\\
&(h,g\in[0,1])
\end{aligned}
We will replace qualitative analysis with quantitative analysis and update our lemma, assumption and theorem in the revision.
## 2.Reviewer vi5j (Rating 3, Confidence 3)
**Response to reviewer vi5j**
We appreciate your constructive and detailed comments. Our point-wise responses are provided below:
**Q1:"What is the connection between the gradient method and the potential function method?"**
> **A1:** Our method uses the density of the dataset as the potential function, but only calculate its gradient rather than the potential function itself to achieve safe control.
>
> However, We argue that our method is different from the potential function method in key positions and has more advantages:
>
> Firstly, by using the score-based approach, we reconstruct the gradient field of the density function rather than reconstructing the density function itself. Naively reconstructing the density function itself fails for many reasons, one major reason is the sparseness of the trajectories collected.
>
> Secondly, we minimize the using of human knowledge by only permitting the access to the collected dataset, rather than using artificial designed attractive and replusive function like in the APF(Aritificial potential function) method.
**Q2:"In the abstract, the authors mentioned that existing methods is inapplicable to ... while their method is generalizable. I do not see any proof of this claim ... Can the authors formally show this in the technical section?**
> **A2:** Thanks for your question. By mentioning "... works focused on embedding risk-averse reasoning into the training ...", we are referring to the distributional and constraint RL methods in the related works. Whether is the distributed critic network or the discounted penalty function is depended on a training policy $\pi$, which differs from a behavior policy $\pi_b$ and an initialized policy $\pi_0$. Therefore, even if the safety knowledge regarding network structures can be independently extracted for another task need the same safety knowledge, These networks cannot be used because of the absence of $\pi$. To address any potential confusion caused by relevant texts, we will make corrections in the revision.
**Q3:It is not clear what are the theoretical analysis ... Can the author provide some quantitative measurement of safety?**
> **A3:** Our method does not focus on directly improving the measurement of safety. Instead, we leverage this measurement to demonstrate the safety of our algorithm. While a theoretical analysis of the measurement of safety is not feasible, our method can theoretically prove that it enhances the likelihood of the agent in the safe dataset distribution. /
**Q4:"It is not clear throughout the paper what is the input of the model.**
> **A4:** We apologize for the confusion. The input of the model is the pair of state and original action which is outputted from the baseline algorithm. The model then outputs two auxiliary actions, which helps to increase the probability of the agent in the distribution. We will add an illustrative graph in the revision.
**Q5:"In experiments, the scores seem to be marginal improvement compare to POR. Why do the authors compare the scores instead of the safety metric?**
> **A5:** We have also compared the safety metric, value at risk (VaR), in our paper. The results of IQL and SRAM (IQL) are shown in figure 2, and the results of POR and SRAM (POR) are shown in figure 4, Appendix B. We compare the scores to directly show the effectiveness of the algorithm from the results and then compare the VaR, the safety metric, to demonstrate the effectiveness that our algorithm can improve safety.
**W1:"... This work does not have a good literature review regarding safety control ... "**
>**A:** We apologize for that, and we will expand the literature review during the revision.
**W2:"It looks like the gradient in this work only considers the locations, while ingoring other system states such as speed and acceleration ..."**
>**A:** We apologize for any confusion that may have arisen due to the visualization of certain parts of our experiment and methodology. However, it's important to note that our method is able to consider factors other than location. In particular, our 6.2 section is based on the d4rl dataset, which already incorporates factors such as velocity in the offline mujoco dataset. We incorporate these factors in to an offline dataset, rather than relying on human analysis, allowing our method to work in dynamic intractable environments.
>
>To support our claims, we conducted additional experiments in the online section, utilizing safety-gym as the environment and PPO as the baseline algorithm. The results is presented in https://sites.google.com/view/additional-results-of-sram. And the following table concisely demonstrates the advantages of our algorithm:
>| $VaR(0.1)$ | $Goal0$ | $Goal1$ | $Goal2$ | $Button0$ | $Button1$| $Button2$ | $Push0$| $Push1$ | $Push2$ |
>| :-------------------------: |:-----:|:-----:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|
>| $PPO$ | $19.9±0.15$ | $18.7±0.23$ | $13.3±1.01$ | $7.09±1.56$ |$11.2±1.22$ | $8.89±0.95$ | $-0.16±0.02$ | $-2.07±0.24$ | $-2.66±0.29$ |
>| $SRAM(PPO)$ | $20.3±0.10$ | $19.1±0.09$ | $14.9±0.67$ | $8.36±1.46$ |$12.4±0.35$ | $9.96±1.41$ | $-0.14±0.04$ | $-1.66±0.29$ | $-2.06±0.49$ |
**Reference**
-- [1] Feirstein D S, Koryakovskiy I, Kober J, et al. Reinforcement learning of potential fields to achieve limit-cycle walking[J]. IFAC-PapersOnLine, 2016, 49(14): 113-118.
-- [2] Gu S, Yang L, Du Y, et al. A review of safe reinforcement learning: Methods, theory and applications[J]. arXiv preprint arXiv:2205.10330, 2022.
-- [3] Bellemare M G, Dabney W, Munos R. A distributional perspective on reinforcement learning[C]//International conference on machine learning. PMLR, 2017: 449-458.
## Reviewer FZfq (Rating 3, Confidence 4)
**Response to reviewer FZfq**
We appreciate your constructive comments and questions pointed out. Our point-wise responses are provided below:
**Q:There have been many works on this topic. What's the novelty of your work compared to previous works? Two surveys on this topic can be found at: https://arxiv.org/pdf/2108.06266.pdf https://openreview.net/pdf?id=UGp6FDaxB0fThe method is very similar to safety-shield-based methods (see above two surveys). What's the difference and advantage of your method compared with these methods?**
>**A:**
>
>Thanks for your question. We summarize the differences and advantages below:
>
>1)The way of extracting the safe information is different. The safety-shield-based methods need to manually design an energy function to determine which states are safe and then generate a safe set. The target of these methods is to choose actions to keep states in the safe set. In our method, we do not pre-design which states are safe and do not have constraints that states must in a designated set. We use diffusion model to extract the gradient field of the dataset distribution and the safe information is all from the dataset.
>
>2)The idea of improving the safety is different. The safety-shield-based methods learn the safe control set in any state and the action chosen by these methods is within this set, which improves the safety of the action. The implicit safe set algorithm (ISSA) modifies the action by following the direction of gradients, which are sampled from a Gaussian distribution, until the action is in the safe control set. However, we do not require the action to belong to any set. In our approach, we use score-based diffusion model to generate auxiliary safe actions, which helps to improve the safety of the action. The auxiliary safe actions can improve the probability of state-action pairs in the distribution of the dataset, which improves the probability of the agent in safe and familiar scenarios.
>
>Advantage:
>
>1)Our approach does not need to design special functions for each environment. The energy function needs to be designed manually and properly to divide the safe and unsafe states. In our method, we don't need to design functions in advance to separate safe and unsafe states.
>
>2)Our approach requires no prior knowledge of the environment. The safety-shield-based methods requires information about the environment such as the position of the obstacle, the position, the velocity and accelerate the velocity of the agent. But in some environments, this information is unknown. However, the input of our method is the dataset, which already exists in the offline setting. In the online setting, we propose an algorithm to generate a safe dataset for learning without any additional information about the environment.
>
>3)Our approach can work in environments where there are no explicit risky regions. For instance, in Mujoco environment, there is no obstacle in the environment. It is hard for the safety-shield-based methods to separate the set of safe states. However, our method can deal with the problem. The average scores and Var are improved by using our approach which can be found in the experiments.
>
>4)Our approach takes less time. For every state, the safety-shield-based methods need to sample a lot of directions to find a safe action in the safe control set, which is time-consuming. In our method, the auxiliary actions of current state are directly known since the actions are the output of our models.
<!-- >## New reviews requesting:
Dear PCs, ACs and SACs:
Thank you for your efforts in organizing the review process and making the conference a success.
We understand with the huge amount of submission this year, the review tasks are not easy. However, as a top conference in machine learning, we feel disappointed to receive some unprofessional comments. Especially, the comments given from reviewer FZfq in our paper seem confusing. We noticed that he gave too few statements, and only one question is proposed. Moreover, it seems that the opinions he has proposed are all simple summary statements. Due to his logical error in his criticizing, we have great confidence to solidly refuting his question.
In his question, he provides links to two papers, and question how our approach differs from theirs. However, the differences are obvious. The first link, https://arxiv.org/pdf/2108.06266.pdf, describes the safety decision making in control theory perspective and RL Perspective. The methods mentioned in RL setting are methods of safe exploration and safe optimization, Risk-Averse RL methods Uncertainty-Aware RL methods and Constrained MDPs RL methods, which are very different from our approach.
The second link, https://openreview.net/pdf?id=UGp6FDaxB0f (ISSA), describes an implicit safe control algorithm without white-box models of the environment. However, this work needs human knowledge to design a safety index, whose calculations requires the speed and acceleration. Thus the ISSA method is based on a dynamic environment, and its safety index is artificial. On the contrary, our method is essentially an offline RL algorithm, and does not restrict the existence of a dynamic environment. Instead, by using some model-based tricks, a completely black-box, non dynamic environment is allowed. Moreover, our method is limited to only have access to the offline dataset, so in the offline experiment (section 6.2 in our paper), we can compete fairly with other conservative offline algorithms. In other words, in the offline setting, our method is exactly a novel conservative offline algorithm.
We hence hope the PCs can assign another review, if possible, for our work, or pay closer attention to reviewer FZfq
Thanks.
<!-- >Our method uses the density of the dataset as the potential function, but only calculate its gradient rather than the potential function itself to achieve safe control. Naively reconstructing the density function itself fails for many reasons, one major reason is the sparseness of the trajectories collected.
Moreover, our method minimize the using of human knowledge by only permitting the access to the collected dataset, rather than using artificial designed attractive and replusvie function like in the APF (Aritificial potential function) method.
<!-- > **A1 long version:** Thanks for your question. In continous environment setting, the gradients of the potential function is usually formed to be continuous and differentiable with values, and the gradients of the function are oftenly exploited to help guiding the behavior of the agent, like in [reference link]. Our method uses gradient fields for auxiliary decision-making, which can also be seen as tracking the density function of a dataset as well, in other words, simply using the dataset density D(x) to serve as the potential function U(x) describes the original intention of our work, inspired by [reference link], which pointed out that the trajectory density can be used to improve the security of the agent. However, we used the score-based approaches to avoid directly retoring this function, but to retore its gradient direction field instead, which is also the main novality of our work.
>
>We noticed that you think there is such a weakness in our work ‘The approach ... is very similar to the classical potential funcion method ...’. Thank you for mentioning this point. This is really the place where our work is easily confused with potential function method. We admit the similarity, but what we need to say is that our method is different from and the potential function method in key positions and has more advantages. As we just mentioned, by using the score-based method, we reconstruct the gradient of the density function rather than the density function itself. This method more directly archieves the security control, and is different from the APF(Artificial potential function) method which needs human knowledge to design the attractive and repulsive method. On the contray, our method minimize the using of human knowledge, only permitting the access to the dataset collected from the exploration.
>
>Further more, what we want to emphasize that this approach is meaningful if we want to achieve security control by close to the location with higher density of the dataset, because it is impossible to obtain an almost continuous density function through exploration in continous environment setting. The dataset collected from exploration is always sparse, and there will always be some point not visited in the state space of the environment. Naively add some noise to the dataset collected will inevitably affect the integrity of the original information, and using neural network to directly restore the function itself is difficult to take into account the gradient reduction at the same time, which limits the effectiveness of the potential function method.
>
>We also notice that you think we have not mentioned the security control algorithm related to our method. We admit it. The reason is the naturality to use a function to refer to the security of a state. In fact, our idea is not based on these methods. -->