# General Comment
We thank the reviewers for their helpful comments and feedback. Overall, the reviewers found our perspective to be novel (ZEGo, Voa2, Lizq), our paper to be well-written (ZEGo, QuAp, Voa2, Lizq) and the quality of our figures and experiments to be high (Voa2, Lizq).
Multiple reviewers (ZEGo, ErLG, VoA2) were curious to know whether our method of analysis based on post-update return distributions and our results generalize to other domains. We provide an affirmative answer by showing that the tools we introduce are easily applicable to discrete-action environments (see Figures 4, 5, 1 (b) in the pdf), and that meaningful variations in the post-update return can be found across different settings.
As discrete-action environments, we employ four games from the standard ALE benchmark (Bellemare et al, 2013). We run PPO on these environments for 5 runs and for 10 million steps each, and measure post-update return distribution statistics using the same protocol employed in the rest of our paper: we evaluate 10 policies evenly-spaced across training, and perform 1000 independent updates to each policy. We find that post-update return distributions computed for these environments still exhibit a remarkable degree of variation, as captured by their standard deviation. At the same time, the shape of the resulting distributions can be quite different compared to the ones observed in robotic locomotion tasks, and, thus, it is not necessarily described in a rich way by metrics such as the LTP (see Figure 5).
Overall, these results reinforce the utility of return landscapes and post-update return distributions as a general tool to understand policies produced by deep reinforcement learning algorithms.
Reviewers also raised a range of individual issues, which we have addressed. This includes clarifying the motivation behind the paper, reporting more precise quantitative evidence for our findings, and clearing up some misunderstandings. We address these in replies to each reviewer below.
# Reviewer ZEGo
Thank you for your feedback!
> In my understanding the concept the authors propose apply to stochastic environments and discrete actions as well.
We thank the reviewer for the suggestion. We have run two experiments (Brax with stochastic policies and four games from the ALE, see General Response) to confirm that this is the case.
For the stochastic setting, we generalize our experimental protocol for the computation of the post-update return distribution to the case in which the policy is non-deterministic. For each policy checkpoint from a TD3 training run, we produce 100 policies with the TD3 update, and then evaluate the stochasticity policy obtained with the same random perturbation scale used during training 10 times and compute the resulting distribution statistics. We find that, for a given policy, this alternate post-update return distribution yields very similar results to the one based on its deterministic version (see Figure 2 in the pdf). In the paper, we focus on deterministic environments in order to understand the relationship between policy parameters and return without the confounding effect of environmental stochasticity.
> Maybe it's better to rename the work as "Policy Optimization in a Noisy Neighborhood: On Return Landscapes in MuJoCo" since the work in its current form does not have any other domains.
We appreciate the point from the reviewer. Including the results in the Appendix and the new results on the discrete-action ALE, we have now a total of three different simulators (Brax, MuJoCo via DMC, and Atari). We believe this positions the method of analysis to be general, despite, for computational reasons, we have run most of our experiments on the efficient Brax simulator.
> In other words, it is interesting to see whether the observation is general across various domains, especially the stability of interpolation of policies inside a single run.
We thank the reviewer for the suggestion. To test whether the phenomenon is present also on other domains, we run an interpolation experiment on the ALE under the same setting based on PPO described in the general response. As per suggestion of other reviewers, we produce quantitative results on the phenomenon with the following procedure. For each one of the four games, we sample a set of at least 20 pairs of policies from same runs and 20 pairs of policies from different runs of the algorithm. Then, we linearly interpolate between policies in the pairs, producing 50 intermediate policies, and randomly perturb them using Gaussian noise with standard deviation $0.0003$ to obtain an estimate of the mean of their (random) post-update return distribution. Then, for each pair of policy, we measure how frequently the return collapses in between the two extremes, by counting how many times it becomes less than 10\% of the minimum return of the two original policies. We then average this _Below-Threshold Proportion_ across pairs, and across environments using rliable (Agarwal et al., 2021). Figure 1 (b) shows that the phenomenon, properly quantified, is still present when using a very different class of environments (discrete-action, game-based).
# Reviewer ErLG
Thank you for your feedback!
> The idea of looking at return surfaces is not original (e.g., see [1], [2], and [3] among many others).
While the return landscape has been studied in other works, our distributional perspective on it has no precedent. It is also important to distinguish between works which study a loss landscape (e.g. [2]), and a landscape of returns from the environment, which we study.
Of the works in the latter category, [1] emphasizes visualizations of the landscape and [3] studies its relationship with the optimization landscape. Neither of these works show how distributional properties of the landscape can characterize policies, nor do they demonstrate macro-scale structure of the landscape when interpolating between policies.
> I encourage the authors to dedicate the last paragraph of their introduction to list 3 or 4 main contributions, and commit to delivering them in the entire paper.
Our introduction already explicitly states our contributions, in the form of three paragraphs with bold title, each one corresponding to a specific contribution.
> the manuscript looks more like a technical report draft rather than a conference paper.
> most NeurIPS papers focus on driving the main message in the initial sections, and avoid detailing the methods/data until the experiments section.
We do not present any experimental detail in the introduction, and simply enumerate and explain our findings. Note that the overall presentation strategy has been praised by most of the other reviewers.
> vague terms are consistently used throughout the paper, including, but not limited to, (1) the "quality" of a policy, (2) "noisy" neighborhoods, (3) "failure" of a policy or a trajectory, (4) "safer" behavior
These concepts are defined in the paper or in related work. Notions such as "stability" and "failure" are entailed by the post-update return distributions and the LTP. "Noisy neighborhood" is a name defined in line 31. "Safer" is used in the sense reported in the related work. Better "quality" means having lower level of LTP/std for a given level of mean return of a post-update return distribution.
> variations in the return are inherently undesirable, and I don't necessarily agree with that. In my experience, Deep RL algorithms such as PPO, SAC, and TD3 gravitate towards such "noisy" regions
An important point behind our analysis based on post-update return distributions is showing that a policy optimization algorithm produces policies with the same level of return but very different levels of variability under further updates. Our experiments have the goal of quantifying and understanding this phenomenon, which is generally undesirable when deploying a policy, but also compelling from a purely scientific perspective.
> The paper mainly focuses on oscillations in the deterministic landscape, whereas the effectively optimized landscape can be much smoother
Our paper is about the return landscape resulting from evaluating a policy in the environment, regardless of how smooth the optimization objective used by the algorithm that produced this policy was.
> interpolating parameters within the same run in a deep classification task can lead to reasonably-performing parameters
The reviewer is correct that a similar phenomenon exists in deep classification (see citations). However, the RL setting is different: the optimization objective is non-stationary and the evaluation metric (the return vs the loss) depends on an environment and multiple forward passes from the neural network. We believe this makes the existence in RL of a phenomenon akin to mode connectivity far from certain.
> Line 245 speculates about the existence of "paths" (I'm not even sure what it means)
A path in this context simply refers to a trajectory in parameter space. Thus, a linear path is one that traverses through parameter space along a line, which is given by the interpolation between two parameters.
> Many depicted training curves are based upon single trainings.
Please note that our paper does not report any single training curve, since they are not the object of study of our work.
> 20 seeds were used for running the experiments. I encourage the authors to run the experiments for at least a 100 seeds.
We disagree that additional seeds are required to support the results presented in the paper. Using 100 seeds instead of 20 would only cause the scatter plot in Figure 2 to be populated by more points, and not increase the significance on any of the new plots in the pdf, greatly increasing the computational cost of our experiments without any clear benefit.
> Finding ways to incorporate confidence intervals in the plots is important for this paper.
We incorporated error bars in experiments in the pdf using bootstrapped CIs (as in Agarwal et al., 2021). First, we compute the uncertainty of the statistics of the post-update return distribution (see Figure 6), showing it is very small; second, we quantify the presence of return drops when interpolating across pairs of policies (see Figure 1 and response to Voa2); third, we show the improvement provided by Algorithm 1 across environments is statistically significant (Figure 8); fourth, we measured the number of rejections from Algorithm 1 (see Figure 7).
> The number of environments considered in the paper are limited.
See Appendix and general response.
> while standard gym benchmarks are a good start, they do not depict realistic robotic artifacts.
> the authors never discuss the physical aspects of each environment
Our paper is mainly directed at the deep RL community. Since simple robotic locomotion tasks or Atari games are standard in this community, we believe the extension of our analysis to more complex robotic artifacts is an interesting avenue for future research, but outside the scope of our paper.
> I encourage the authors to release an open-source link to the code
We will release code once the paper is published.
# AC Comment
Dear Area Chair,
We want to bring your attention to the review by ErLG. The review contains several subjective or inaccurate statements that likely influenced the negative score given by the reviewer.
Just to write a few examples, the reviewer wrote
> I encourage the authors to dedicate the last paragraph of their introduction to list 3 or 4 main contributions, and commit to delivering them in the entire paper.
despite our introduction containing clearly stated contributions, or:
> As is, the manuscript looks more like a technical report draft rather than a conference paper. In particular, note that most NeurIPS papers focus on driving the main message in the initial sections, and avoid detailing the methods/data until the experiments section.
which is a personal opinion on how a NeurIPS paper should look like and has been perceived completely differently by other reviewers.
The reviewer also brings up points which are atypical for the deep RL community, such as:
> while standard gym benchmarks are a good start, they do not depict realistic robotic artifacts.
> While the paper is specifically targeting continuous control, the authors never discuss the physical aspects of each environment and do not make physical variations to the underlying physical models.
We do not think that a requirement for a deep RL paper to be accepted at a machine learning conference is the one of having the most realistic robotic artifacts or to go deep into the physical aspects of the environments. The reviewer also says:
> 20 seeds were used for running the experiments. I encourage the authors to run the experiments for at least a 100 seeds.
We believe that at least 100 seeds per experiment is an unreasonable request for most RL papers. Furthermore, where applicable, we have included appropriate confidence intervals in our rebuttal PDF which demonstrate that the data provided is more than enough to present statistically significant results.
Overall, we would be grateful if this was considered during the discussion period or the decision on our paper.
Thank you,
The Authors
## Second response to ErLG
Thank you for your response!
1. We committed to make some changes to our paper in the rebuttals. In particular, we will include all the figures contained in the rebuttal pdf in the final version of the paper. The rest of this answer also includes additional modifications that will be making to the paper.
2. We appreciate your point on nomenclature, but largely disagree with it.
- The "noisy" term can be indeed seen as being associated with the stochasticity of returns from the post-update return distribution; even if the return is actually a function, a defining feature of our study is the use of distributional tools to characterize it. Therefore, we believe that "noisy neighborhood" has an appropriate degree of accuracy.
- On "stability", we will make more clear in the paper that the concept of stability we refer to is the one associated to the Left-Tail Probabilility of the post-update return distribution of a policy. Note that related notions of stability have been employed in other work, beyond (Nikishin et al., 2018), for instance as "variability within training runs" in (Chan et al, 2019) or in (Khanna et al, 2022).
- To avoid confusion, we will only keep the term "safety" to refer to previous work and substitute any reference to it with "stability" in the LTP sense.
- The word "paths" has been largely used in previous work on mode connectivity, to denote a sequence of vectors that is connected in a parameter space, for instance by a linear interpolation. To provide concrete examples, (Frankle et al, 2019) writes "networks are connected by linear paths of constant test error" and the recent Git Re-Basin paper (Ainsworth et al., 2022) writes "connected by paths of near-constant loss".
3. We apologize for the confusion, due to the lack of space in the original rebuttal. The connection between the optimization objective and the return landscape is indeed at the heart of our work. The post-update return distribution is the bridge between optimization, by the execution of multiple updates, and returns, by evaluation of the resulting policies, even in the absence of explicit visualization of the optimization landscape.
4. We indeed already provided them through Figure 1 in the rebuttal pdf, and the response to reviewer Voa2, where we describe the protocol we used to quantify our findings on interpolation between policies. In general, our results demonstrate large and statistically significant differences in the quantity of collapses in return when interpolating between policies from the same vs. different runs, across environments. We believe that the inclusion of some anecdotal results has illustrative value, but we agree that the conclusions here are best supported by rigorous statistical comparisons.
5. While we agree that aggregate metrics are a commonly-used tool for making comparisons, we strongly believe that results showing the behavior of individual policies or training runs can be valuable.
To be precise, in our context, we stated that more seeds would "cause the scatter plot in Figure 2 to be populated by more points." This figure, in particular, is an intentionally disaggregated result which illustrates that the post-update return distributions of policies obtained by three popular algorithms exhibit diverse profiles and statistics. We hold that the 20 seeds per algorithm shown are enough to support this claim.
During the rebuttal period, we presented several new results aggregated over multiple seeds and environments, along with appropriate confidence intervals. These include: first, providing CIs for the statistics of the post-update return distribution (see pdf Figure 6), showing they are very small; second, quantifying the presence of return drops when interpolating across pairs of policies (see pdf Figure 1 and response to Voa2); third, demonstrating the benefit provided by Algorithm 1 across environments is statistically significant (see pdf Figure 8); fourth, measuring the proportion of rejections from Algorithm 1 (see pdf Figure 7). In all cases, the computed confidence intervals demonstrate that we have used enough independent runs to provide statistically significant comparisons, refuting the proposition that additional seeds would benefit our work.
6. We acknowledge the reviewer's suggestion. In order to accommodate the several new results from the rebuttal, we will condense lines 126-140, decrease the size of Figure 1 and, if needed, remove Figure 4.
## Third response to ErLG
We are glad to hear the suggestions from the reviewer. However, we disagree that every aspect of the proposed changes would be useful for improving the paper. To respond to the reviewer's points:
- We still believe the use of the expression "noisy neighborhood" is appropriate for scientific communication reasons. The term is related to stochasticity as already discussed, and overall it is just a proper name we have chosen.
- When its meaning is ambiguous, we will substitute the term "stability" with appropriate expressions such as "post-update return variability".
- We will highlight more the limitations of the work in terms of comparisons, where applicable (e.g., it would be interesting to know if our conclusions generalize to more environments and algorithms). However, note that our sentence "our conclusions here are best supported by rigorous statistical comparisons" referred to the statistical comparisons we actually already carried out for the rebuttal, and that can be found in the rebuttal pdf. This might not be clear if the sentence is extrapolated from its context.
- As stated in the previous response, we will apply the other changes and experiments added during the rebuttal period to the paper. We will also run more seeds for the ALE experiments to align them with the Brax/MuJoCo ones.
We hope these comments clarify our position on the issues highlighted by the reviewer.
# Reviewer QuAp
Thank you for your feedback!
> This paper would benefit from providing additional explanation and motivation regarding the significance of investigating the mapping between policy parameters and return. A more thorough exploration of why this particular landscape is crucial to the focus of the work is needed.
By investigating the return landscape with the set of tools we introduced in the paper, we can both advance our understanding of the policies traversed by policy optimization algorithms and improve their stability. Our work offers both a new perspective on previously-studied unstable learning dynamics of deep RL agents (e.g. Chan et al. 2020, Henderson et al. 2018), as well as a new way to look at the stability of a given policy in terms of its post-update return distribution, which we measure using statistics such as the left-tail probability and show to correspond to the qualitative robustness an agent's behavior. Through this landscape-oriented view, we can improve a policy along that dimension of policy quality, as we propose to do in Algorithm 1. We are happy to make revisions to the manuscript to emphasize the central role of the return landscape in the results obtained.
> How do those findings relate to the findings in this paper? (For reference, look at “Understanding the Evolution of Linear Regions in Deep Reinforcement Learning” from NeurIPS’22)
We appreciate the reference. Although this work is similar in spirit to ours, in that it aims to characterize policies beyond common evaluation metrics, they study the complexity of the learned policy as a function of the input state, whereas we study the return in the environment as a function of the policy parameters. Naturally, these objects are related: we conjecture that a more complex policy may exhibit greater variability in returns under a single update to its parameters. We take this as an interesting direction for future work, and will cite this work in the final version of the paper.
> I personally find figure 2 hard to understand.
To clarify, Figure 2 considers several main objects. Each color in the legend corresponds to an algorithm, and each point corresponds to a policy produced by that algorithm. For each policy, we compute its post-update return distribution. On the three main scatter plots, we plot each policy according to different statistics of the distribution: skewness, standard deviation, and left-tail probability, plotted against the mean. Therefore, each policy appears once on each scatter plot. We select 6 policies of interest, indicated by stars, and plot histograms of their post-update distributions at bottom to make a correspondence between the statistics and visual appearance of the distribution. We will consider if it's possible to simplify the presentation of Figure 2 for the final paper.
> What will the results look like among policies trained with different algorithms on the same task?
Figure 2 depicts precisely how policies obtained by different algorithms trained on the Brax `ant` task compare to each other. Similar figures are reproduced for other environments across Brax and DeepMind Control in the appendix.
> It would be invaluable if the same experiments would be repeated for policies trained with behavior cloning
We thank the reviewer for the compelling suggestion. We were especially curious to understand whether behavior cloning produces policies which occupy less noisy neighborhoods of the return landscape. To do so, we conducted a set of additional experiments on 4 Brax environments. The protocol was as follows: For each environment, we consider 10 independent training runs of TD3, and 5 policies distributed evenly throughout the run. For each of these policies, we train a new agent using behavior cloning on the data logged up until the collection time of the teacher policy, replacing the actions in the dataset with the actions of the teacher policy, for 1 million gradient steps. We log 10 policies throughout each training run of behavior cloning. To compute the post-update return distribution for the policies obtained by behavior cloning, we used one additional gradient step on the MSE-based BC objective, and 1000 samples.
In Figure 3 (left), we compare statistics of the post-update return distributions for all pairs of policies $(\pi^{BC}_{i, j}, \pi^{TD3}_i)$ where policy $\pi^{BC}_{i, j}$ is obstained by behavior cloning $\pi^{TD3}_i$. We compute the Pearson correlation coefficient of statistics of the post-update return distributions of these policies: between the mean of each pair, and between the LTPs of each pair. We find that the means are highly correlated and that the learned BC policies are comparable in performance to their teacher policy. We additionally show that correlation in the LTP is much more variable across environments -- in general, cloning a policy of high or low LTP does not always lead to a cloned policy of the same LTP.
But does training policies by BC produce policies occupying less noisy neighborhoods of the return landscape, which would have correspondingly lower LTP overall? In Figure 3 (right), we show the mean LTP across policies and environments, along with its 95% bootstrapped confidence interval following the recommendations from *rliable* (Agarwal et al, 2021). Our results demonstrate that BC does not produce fundamentally more stable policies, as measured by the LTP.
# Reviewer Voa2
Thank you for your feedback!
> I suspect there is some way to quantify the claims in this section across a larger population of policy pairs, which I think would significantly strengthen this section.
We thank the reviewer for the suggestion. To quantify whether there is a statistically significant difference in the proportion of return collapses encountered when interpolating between policies, we use the following experimental design. We sample for each environment a set of 500 pairs of policies from the same runs and a set of 500 pairs of policies from different runs. Then, we linearly interpolate between policies in the pairs, producing 100 intermediate policies, and randomly perturb them using Gaussian noise with standard deviation $0.0003$ to obtain an estimate of the mean of their (random) post-update return distribution. Then, for each pair of policy, we measure how frequently the return collapses in between the two extremes, by counting how many times it becomes less than 10\% of the minimum return of the two original policies. We then average this _Below-Threshold Proportion_ across pairs, and across environments using rliable (Agarwal et al., 2021). Figure 1 (a) shows a proper quantification of the phenomenon: on Brax, there is on average almost no drop in return when interpolating among policies from the same run. Additionally, similar results on Atari games in Figure 1 (b) (see response to ZEGo) show that the phenomenon also exists in other domains.
> I felt the experiments could be more comprehensive w.r.t environment diversity.
We provide additional results in the pdf using a set of games from the ALE (see General Response). We also have additional results on DeepMind Control Suite in the Appendix.
> Or perhaps I'm misunderstanding, and Algorithm 1 is meant to by applied after a training run in order to robustify an already learned policy?
This is correct -- the purpose of Algorithm 1 is to transport an existing good policy to a less noisy neighborhood of parameter space. Training an agent from scratch with the procedure in Algorithm 1 would slow down progress, especially at the beginning of training. The pipeline we study involves achieving a 'good' policy (with respect to its return) using existing RL techniques, and then 'fine-tuning' the resulting policy with Algorithm 1.
> I think you should probably mention distributional RL [1] as related work.
Thanks for the suggestion, we will add that to the related work.
> In Figure 3, labeling the successful/failing trajectory either in the plot or in the caption would be nice.
Thank you for the suggestion, we will add this to the final version of the paper.
# Reviewer Lizq
Thank you for your feedback!
> "while Figure 5 demonstrates that we can smoothly interpolate between levels of stability along a single gradient direction"
We apologize for the confusion. This is a typo in the original version of our paper: the direction that is used to interpolate between policies is not a gradient direction, but it is instead the direction going from one vector to the other in the parameter space.
> can we not only reject policy updates that create noisier neighborhoods but also leverage policy updates that result in more stable neighborhoods?
We appreciate the suggestion. We believe this is indeed an exciting avenue for future research, but we found this approach to be much less stable than the approach presented in Algorithm 1 in our experimentation. Moreover, we leverage Algorithm 1 to show that many, but not all, of the directions proposed by policy optimization algorithms during training are actually sensible directions of improvement in terms of LTP. We believe this observation could be useful for practitioners who wish to improve this class of algorithms.
> Therefore, a more thorough comparison or discussion with related work is necessary. For example, in the related work section, the authors have mentioned that their work is related to studies based on rejection/backtracking strategies
We appreciate the suggestion. We will expand our discussion in the related work section.
> I am particularly interested in whether distributional RL algorithms exhibit the issues mentioned in the paper. Given that distributional RL methods take into account reward distribution during policy optimization (albeit with significant differences from the post-update reward distribution studied in this paper), can these methods produce smoother reward landscapes?
This is an excellent question, and indeed, it was an approach that we experimented with extensively. We experimented with a distributional extension of TD3, where the critic was replaced with a distributional critic. Additionally, we experimented with risk-averse policy updates using the distributional critic, where the Conditional Value at Risk (CVaR) of the return distribution was reinforced rather than the mean return, with the hypothesis that the risk-averse policies should navigate towards smoother neighborhoods. Unfortunately, our findings showed no significant improvement with regard to the smoothness of the neighborhoods encountered by the distributional agents.
We also compared the post-update return distributions of the policies visited by TD3 and distributional TD3, and observed that there is no substantial difference in neighborhoods reached by the distributional algorithms relative to TD3. We found that the distributional critic does not provide a reliable estimate of the post-update return distribution. Indeed, the distributional Bellman equation characterizing the return distributions in distributional RL does not model such an object explicitly.
> Can the authors analyze the limitations of Algorithm 1? In particular, are the main conclusions and Algorithm 1 still valid for policies updated using different methods (beyond PPO, SAC, and TD3)?
To understand the limitations of Algorithm 1 in an extreme case, we have run the same rejection procedure but using random directions proposed by simple Gaussian perturbations to the policy parameters instead of the directions produced by policy optimization algorithms. In this extreme case, we found that Algorithm 1 is no longer an effective procedure for improving the LTP, suggesting that the Algorithm might be sensitive to the presence of bad update directions.
Another limitation of the algorithm, which does not create particular problems for the specific evaluation setting leveraged in our paper but might create problems when aiming at sample efficiency, is the need to obtain rollouts from the environment to evaluate the post-update return distribution statistics and reject an update.
> On average, how many times will Algorithm 1 reject an update during a single policy parameter update in actual operation? Will the efficiency of the algorithm be severely affected if the number of rejected updates becomes too high?
Thanks for the question. We compute the statistics for the frequency of rejections across five seeds in Figure 7 in the pdf.
Indeed, the rate of improvement of the algorithm (with respect to the policy return) may be reduced when using Algorithm 1. However, Algorithm 1 is only meant as a procedure to improve an existing acceptable policy with respect to its LTP -- we are not looking for fast improvement at this stage. Having said that, as shown in Figure 6, the difference in mean return after 40 gradient steps of Algorithm 1 tends to be comparable to that of TD3.
# Reply to Reviewer Lizq
> Thank you for your detailed rebuttal, which has addressed most of my concerns regarding your manuscript. I appreciate the effort and time spent on explaining the key aspects of your work.
>
> Regarding the comparison and analysis of the proposed algorithm with existing algorithms, it would be beneficial for you to include a brief yet focused discussion in the rebuttal. If you can clearly highlight the differences and advantages of your algorithm compared to others in the field, it could potentially lead to a higher evaluation score.
>
> Incorporating this information in your rebuttal would not only strengthen the manuscript by providing better context for readers but will also aid in emphasizing the novelty and significance of your work. I look forward to your response on this particular aspect.
### Response draft
Thank you for your response and for providing us the opportunity to clarify the relationship between ours and related work. We agree that adding further discussion about the distinctions between Algorithm 1 and existing algorithms in the literature will make both Algorithm 1 and the analytical tools introduced in the paper more clear.
Algorithm 1 is fundamentally based on the principle of rejection sampling. In the context of reinforcement learning, previous work made use of rejection sampling in the action space, with respect to a critic, in order to choose safe actions [4, 5]. Instead, our Algorithm 1 performs rejection in the space of policy parameters, in a way related to success-story algorithms [7], in which updates to the policy are discarded if they lead to undesirable return levels.
The main distinction between Algorithm 1 and previous algorithms based on rejecting policy updates is that Algorithm 1 explicitly guides policies away from unstable regions in the parameter space. The most closely related prior work based on selective improvements to policies is the EVEREST algorithm of [1]. Their approach is significantly different, in that it works by always updating a training policy, but only evaluating a specific subset policies encountered during training. The goal of EVEREST is to obtain monotonically increasing evaluation returns, but it does not consider the robustness of the obtained policies to further updates. Our algorithm functions by rejecting updates according to statistics of the post-update return distribution, and therefore produces policies which are stable under further training and exhibit qualitatively more robust behaviors.
Moreover, rather than rejection sampling, one might consider finding stable policies by optimizing the policy against an estimate of the post-update return distribution online, for example using the techniques of distributional RL [9]. In our setting, approaches for distributional risk-sensitive control such as WCSAC or SDAC [3, 8] could be appropriate for this task. However, these methods do not consider the return landscape, and therefore do not produce policies which are robust under updates. In our experiments, we found that estimation of the post-update return distribution with distributional RL techniques is quite difficult, and that the resulting algorithms did not reliably produce policies of improved LTP.
> [Feel free to remove]
Another benefit of Algorithm 1 is that it does not depend on any risk-sensitive algorithm design specified *a priori*. For example, an approach based on distributional RL would require the implementation of a distributional critic and perhaps a choice of risk-sensitive objective to optimize. In contrast, we applied Algorithm 1 on top of TD3 policies that were trained with standard hyperparameters chosen before Algorithm 1 was conceived.
### Response draft 2
Thank you for your response and for providing us the opportunity to clarify the relationship between ours and related work. We agree that adding further discussion about the distinctions between Algorithm 1 and existing algorithms in the literature will strengthen the paper.
Our Algorithm is best contextualized against existing methods based on 1) policy gradient methods which optimize a risk criterion and 2) rejection sampling.
Methods in the first class mainly aim to improve the stability of a policy by optimizing against a distributional critic [7], such as WCSAC or SDAC [2, 6]. Importantly, these algorithms do not consider the return landscape: They estimate only the distribution of returns for a single policy parameter and therefore do not produce policies which are robust under updates. In our experiments, we found that estimation of the true post-update return distribution with distributional RL techniques is quite difficult, and that the resulting algorithms did not reliably produce policies of improved LTP. Therefore, an advantage of Algorithm 1 is that it does not depend on any risk-sensitive algorithm design specified *a priori*, since these approaches typically require a well-behaved distributional critic and risk-sensitive objective to optimize. In contrast, we applied Algorithm 1 on top of TD3 policies that were trained with standard hyperparameters.
Our algorithm is most closely related to works based on rejection sampling. In the context of reinforcement learning, previous work made use of rejection sampling in the action space, with respect to a critic, in order to choose safe actions [3, 4]. Instead, our Algorithm 1 performs rejection in the space of policy parameters, related to success-story algorithms [5], in which updates to the policy are discarded if they lead to worse average returns. A key related work based on selective improvements to policies is the EVEREST algorithm of [1]. Our Algorithm 1 admits several advantages when compared to their procedure: First, EVEREST is based on the rejection of policies which do not probably improve the return, but it does not produce policies which are inherently robust to further updates. Second, EVEREST performs rejection at a frequency on the order of once per thousands of updates; our algorithm is instead able to significantly improve the LTP over a small number of update steps (40).
[1] Khanna, Pranav, et al. "Never Worse, Mostly Better: Stable Policy Improvement in Deep Reinforcement Learning." arXiv preprint arXiv:1910.01062 (2019).
[2] Yang, Qisong, et al. "WCSAC: Worst-case soft actor critic for safety-constrained reinforcement learning." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 35. No. 12. 2021.
[3] Bharadhwaj, Homanga, et al. "Conservative Safety Critics for Exploration." International Conference on Learning Representations. 2021.
[4] Srinivasan, Krishnan, et al. "Learning to be safe: Deep RL with a safety critic." arXiv preprint arXiv:2010.14603 (2020).
[5] Schmidhuber, Jürgen, Jieyu Zhao, and Marco Wiering. "Shifting inductive bias with success-story algorithm, adaptive Levin search, and incremental self-improvement." Machine Learning 28 (1997): 105-130.
[6] Kim, Dohyeong, Kyungjae Lee, and Songhwai Oh. "Efficient Trust Region-Based Safe Reinforcement Learning with Low-Bias Distributional Actor-Critic." arXiv preprint arXiv:2301.10923 (2023).
[7] Bellemare, Marc G., Will Dabney, and Rémi Munos. "A distributional perspective on reinforcement learning." International conference on machine learning. PMLR, 2017.
### old stuff / notes
* Approaches based on rejection sampling (in action space)
* Approaches based on rejection sampling in policy space
Algorithm 1 is fundamentally based on the principle of rejection sampling. In reinforcement learning, some approaches make use of rejection sampling in the action space [4, 5, 6]. [6] performs rejection sampling on the outputs of a language model to obtain preference data used for RLHF training.
The closest approach to ours is to perform rejection sampling in policy space [1].
Answer: Why is it better to select a policy based on LTP, rather than just its return?
* alg 1 is meant to apply to an existing policy (vs. "online" methods)
| Alg. 1 | EVEREST |
| -------- | -------- |
| PURD | Stochatic rollouts of $\pi_\theta$ |
| "Offline" | "Online" |
| Updates policy | Updates "target network" (i think this is just the policy) |
| Should produce policies which are robust to updates| Should not guarantee policy is robust to future updates |
| Policies produced by Algorithm 1 do not achieve any bad returns after an update | Evaluation procedure: they do not smooth the curves or average over seeds, but do they average evaluation returns from many rollouts for each point on the curve? "As we analyze the behavior of the training process, we need to ensure that any observed performance degradation is due to the policy optimization and not due to the sampling techniques. Hence, throughout training, each evaluation is performed over 100 episodes." |
[6] performs rejection sampling on the outputs of a language model to obtain preference data used for RLHF training.
Existing algorithms, such as EVEREST [1] and variants of Safe Policy Iteration [2], employ "safe policy updates" where safety is defined with respect to statistics of the return for a fixed policy parameter, and *not* with respect to the return landscape of the policy parameter. As such, these algorithms have no mechanism for disfavoring policies in noisier neighborhoods on the return landscape, which can for instance lead to less robust gaits as argued in Section 3.2.
In contrast, we showed that by rejecting policy updates according to samples from the post-update return distribution, we can reliably guide policies to less noisy landscapes with substantially lower LTP.
Another benefit of Algorithm 1 relative to algorithms based on risk-sensitive control algorithms is that it can easily be "switched on" to any RL algorithm of choice -- for instance, in our experiments, we used plain TD3 to learn good policies with respect to their mean return, and then we employ Algorithm 1 to quickly reduce the LTP of the TD3 policies. We could have done the same procedure with any policy optimization algorithm, and it does not depend at all on the algorithm or architecture that was used to learn the starting policy. **(it's risky to say this because we don't actually know that this works)**
> [name=Harley] True, this is a good point. My intention when writing this was more about the fact that we were able to take an existing policy and implement our procedure without any architectural changes. Do you think there is any way to mention this without over-promising?
**(it's risky to say this because we don't actually know that this works)**