owned this note
owned this note
Published
Linked with GitHub
# DPO Rebuttal
## Reviewer KEqn
**Q1** Is it possible that the optimal learning mechanism (i.e., the optimal drift function) is different for every environment?
> **A1** It is definitely possible. However, meta-learning becomes practical when a mechanism that is meta-learnt in some domains transfers to unseen ones. In the most desirable, extreme scenario, it suffices to meta-learn a mechanism in one environment only. Our LPO achieves this goal---it is trained on Ant, but performs great in all environments.
**Q2** Why your drift function uses inputs defined in line 180?
> **A2** Firstly, we wanted to use inputs which are invariant for all possible environments and policy designs (discrete/continuous). A drift function with such inputs, after being meta-learnt on one task, can be deployed in other ones. The policy ratio can always be obtained when stochastic policies are considered, and the normalised advantage function commonly appears in deep-RL implementations. Information like *state* can vary from one environment to another, thus we did not include it as an input.
Secondly, the input vector is constructed so that it is zero when the old and candidate policies are the same, which helps us meet the drift condition 1.
Thirdly, we used different non-linear transformations of the ratio and advantage, to help learning more complex mappings.
## Reviewer H27p
**Q1** If the inputs to the drift function from line 180 are changed, does the resulting algorithm still work well?
> **A1** Our choice of the input arguments was deliberate. The choice of invariant quantities---the policy ratio $r$ and the (normalised) advantage function $A_{\pi}$---enabled us to learn a drift function that is applicable to multiple environments. The nonlinear mappings of these (line 180) helped us learn more complex features. We considered increasing the maximal degreee of the polynomial in $r$ (from quadratic to cubic) but it did not bring any improvement. Lastly, finding alternative, invariant inputs to the drift function is an interesting alley of future research.
**Q2** Is the presence of $\bar{\pi}$ in line 137 a typo?
> **A2** Yes, it is. Thank you for pointing it out.
**Q3** The variable $r$ is used for both the reward and the policy ratio. That it confusing.
> **A3** We will change the reward notation to $R$.
**Q4** I suggest moving the training of LPO-Zero to the appendix. It is not an important part of the paper/
> **A4** **Here CHRIS**
>## Response to Response
>> **Followup for A1** You should run and include experiments confirming your intuition about other inputs not making much difference.
>> **Our Answer** As we elluded in A1, currently we do not know other inputs that would guarantee us invariance across environments other than the ratio $r$ and advantage $A_{\pi}$. We already tried polynomials in $r$ of higher order, but it did not bring improvements. This suggests that other smooth functions would not bring much improvement either, as such functions can be approximated with polynomials by Taylor's theorem. We are running some experiments to confirm that.
>>**Followup for A4** You can move the results on LPO-Zero to the appendix.
>>**Our Answer** Another reason for which we would like to keep it is that its objective is similar to that of LPO (see Figures 2 & 3). We think it is an important finding that shows that our meta-learning method indeed leads to a discovery of RL concepts.
>>**More Experiments** To verify robustness and improvement of LPO/DPO with respect to PPO, you should run experiments in other domains where PPO is known to work. Also, what do you mean by that you struggle to make Brax's PPO work?
>>**Our Answer** Following the reviewer's advice, we implemented DPO in PyTorch and tested it in MuJoCo. Unfortunately, it did not work that well (at least with PPO hyperparameters). Although it is possible that more extensive hyperparameter tuning would improve the performance, it is likely that in the meta-learning process LPO has learnt an algorithm (and discovered DPO) that is optimal for the specific implementation and testing frameworks, just as PPO for Brax works poorly in MuJoCo. We chose JAX and Brax because they enabled us to train multiple inner-loop agents fast and in parallel [Freeman et al., 2021]. Once other development tools, like PyTorch and MuJoCo, will allow for it, our method can be easily applied in them.
[Freeman et al., 2021] C. Daniel Freeman, Erik Frey, Anton Raichuk, Sertan Girgin, Igor Mordatch, Olivier Bachem; "Brax -- A Differentiable Physics Engine for Large Scale Rigid Body Simulation". 2021.
## Reviewer XLZN
**Q1** It would be interesting to visualise and analyse what does the drift function $\tilde{f}_{\phi}$ converge to, and contrast it with PPO.
> **A1** We believe that this question is addressed in the paper already. Although we did not explicitly visualise $\tilde{f}_{\phi}$, we visualised the optimisation objective of LPO, which involves both advantage and the converged drift function (see Figure 3). Specifically, we plot the derivative with respect to the policy ratio, which explicitly informs how the objective affects learning. We believe that such a plot is more informative that a naive plot of the objective, as that would potentially involve "noisy" components that are independent of the policy variable. The same visualisation was conducted for PPO (Figure 1), whose objective was contrasted with LPO in Section 6.
**Q2** How would a fine-tuned PPO perform in comparison to a fine-tuned DPO?
> **A2** As we have not fine-tuned DPO, we cannot answer this question. However, as we used Brax's fine-tuned PPO, and simply adopted the same hyperparameters to DPO and obtained better performance, we deduce that fine-tuning DPO would perform even better.
**Q3** Can the authors add the corresponding PPO heat map to Figure 5?
> **A3** That heat map has already been plotted in Figure 1. However, if the reviewer suggests that it would help to copy it, we can add it.
**Q4** The inputs to the drift function of LPO are cherry-picked.
> **A4** We chose policy ratio and the (normalised) advantage function as inputs because they are invariant to many environments. Indeed, we trained such a drift on Ant only, and transfered it to other environments. The nonlinear transformations in line 180, which we chose quite arbitrarily, were supposed to help the drift function learn complicated mappings. Designing other, transferable inputs to the drift function is an interesting avenue of future research.
**Q5** Why DPO does not use $(r-1)^n$ as input?
> **A5** DPO is no-longer a meta-trainable (like LPO) algorithm. It is, instead, an analytical approximation of $LPO$. The functions that it uses (line 302) were chosen to reproduce the LPO opitimisation objective. Mappings like $(r-1)^n$ were not necessary.
**Q6** Can you provide some studies on how different values of $\alpha$ and $\beta$ affect the performance of DPO?
> **A6** Although we write the variables $\alpha$ and $\beta$ symbolically for clarity, it is important that these take the values $\alpha=2 \ \&\ \beta=0.6$ to reproduce LPO accurately. We do not consider them hyperparameters to tune (at least, not in this paper), but rather elements of our analytical model.
**Q7** There were too few experiments conduceted to justify the generalisation claims about LPO. Can you run some experiments on Procgen?
>**A7** We believe that it is an unfair comment. We provided experiments on 8 tasks in total, with a remarkably high unseen:seen reatio ($7:1$). Other works usually use less---for example [Kirsch et al., 2020] use $4$ tasks in total, with the unseen:seen ratio of only up to $2:1$. We will include more experiments on
**Q8** Could you add a properly referenced appendix and include equation numbering?
>**A8** Lack of these is a clerical mistake on our behalf. We apologise and fill address this concern in the paper's final version.