# Thanks to R4 for upgrading their score
Thank you for the update!
The minor issue with Definition 5 is now fixed.
>I believe that the paper could benefit from including some theoretical justifications to the loss function and data collection scheme (though I completely understand the difficulty of theoretically justify deep RL algorithm). For example:
> Assuming all the loss functions can be optimized to optimal, will the policy converge to optimal or near-optimal solutions?
(TODO @ Simon do you agree?) Yes, this is a corollary of Proposition 1, as inherited from the SAC losses' properties. Actually it is probably possible to show from Proposition 1 that it converges to the optimal faster than SAC.
> Assuming the value net can be optimized to optimal, how the resampling process change the gradient of policy net?
(TODO) In Proposition 1, we initially wanted to prove that the loss's gradient is better in DCAC than is SAC, but we ended up showing that the loss itself is better. Which is not quite equivalent, but still a step in this direction I guess.
> In which case would the on-policy sample with truncated trajectory (i.e., the value function computed by Eq. (8) where the length of the trajectory is $n$) out-perform off-policy sample with full trajectory (i.e. SAC without resampling)? If I understand correctly, without resampling the error of the value net suffers from the amplification caused by distribution mismatch (which is potentially exponential?).
(TODO) SAC using a full non-resampled off-policy trajectory with no further precaution would be biased (indeed because of the distribution mismatch between $p^\mu_n(\cdot|x_0)$ and $p^\pi_n(\cdot|x_0)$). A possibility would be to use methods such as Importance Sampling or Retrace to get unbiased multi-step estimates, but this would suffer from issues that do not affect the partial resampling operator (e.g. variance explosion). Of course, partial resampling is only possible as long as the condition of Theorem 1 holds, so it is not applicable to full trajectories. If one wants to try longer trajectories, probably they should use an hybrid approach: partial resampling as long as Theorem1 holds, and then Importance Sampling-like methods for the remaining part of the trajectory.
> And with resampling, would the error of value net come from the truncation?
(TODO)
# Answer to R1
Under constant conditions, the resampling always happens over the full length of the action-buffer (because under constant conditions the total delay is always $K$).
Under non-constant conditions, the length of the resampling depends on the trajectory fragment sampled from the replay memory.
The average resampling length depends on the delay distributions, but even with uniform distributions (c.f. Appendix) it is often fairly long.
This is because the condition of Theorem 1 starts very loose (for the first action of the buffer it is $\alpha_1 + \omega_1 \geq 1$, i.e. 'True') and it tightens as we advance in the trajectory fragment (for the last action of the buffer it is $\alpha_K + \omega_K \geq K$, i.e $\alpha_K + \omega_K = K$).
In Theorem 1, $t$ is not really a free variable, it is an index going from $1$ to $n$ in a given trajectory fragment $\tau^\*_n$ (the condition must hold for each $\alpha^\*_t, \omega^\*_t$ in $\tau^\*_n$ for the theorem to hold for $\tau^\*_n$). We have updated the submission to define $t$ explicitly in Theorem 1.
# Authors conclusion
As the discussion phase with us ends, we want to thank the reviewers again for their involvement and interesting feedback.
In particular, the discussion yielded changes in different parts of the text to make the paper easier to follow, some relevant additions to the related work section, and the addition of Figures 4 and 5 (which we believe will be helpful for the new reader, as the reviewers found these informative).
The discussion with R1 and R4 was particularly constructive. We thank them for their involvement and humbly believe that their concerns have now all been addressed.
R3 didn't have difficulties understanding the paper and we thank them for their straightforward review. We hope our response correctly answered their few concerns.
We regret that R2 did not take part to the discussion with us. We hope that the changes we have made in the paper will help them appreciate our contribution more in the post-rebuttal.