> **Comment:** I still have doubt on the claim, since linear dependency on $d$ is proven to be a lower bound in the linear bandit case: see Proposition 1 in [1]. The authors mentioned three works in Response to Reviewer YNwE that matches their new update. But [A] depends on both $d_E$ and $d_K$, and [B][C] are from tabular settings, not continous setting. I doubt whether the claimed update is valid given the existing lower bound in [1].
%
[1] Zanette, A., Lazaric, A., Kochenderfer, M., and Brunskill, E. Learning near optimal policies with low inherent bellman error. arXiv preprint arXiv:2003.00153, 2020.
[A] Osband, I. and Van Roy, B. Model-based reinforcement learning and the eluder dimension. In Advances in Neural Information Processing Systems, pp. 1466–1474, 2014.
[B] Ian Osband, Van Roy Benjamin, and Russo Daniel. (More) efficient reinforcement learning via posterior sampling. In Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2, NIPS’13, pages 3003–3011, USA, 2013. Curran Associates Inc.
[C] Ian Osband and Benjamin Van Roy. Why is posterior sampling better than optimism for reinforcement learning? In International conference on machine learning, pages 2701–2710 PMLR, 2017.
**Response:** We thank the reviewer for the comment. We remark that we are not aware of any specific lower bound for model based RL in continous space. And we believe that the lower bound provided in [A] also holds for frequentist settings not Bayesian settings (which we consider in our work).
We agree that it might look surprising that we don't have linear dependence on $d$ and it is not valid as per the lower bound. But we would like to remark that the lower bound would depend upon the problem structure and the assumptions under which we operate. For instance, [B] obtains regret which has $\sqrt{d}$ term. We would also like to refer the reviewer to Table 1 in [C] where different lower bounds for bandit and RL settings are mentioned with $\sqrt{d}$ dependence.
In this work, we operate in Stien class of kernels, and utilize KSD upper bounds from [D,E] (which also does not have any explicit dependence on $d$) to obtain regret results in our work. We remark that our analysis is completely new for the RL setting and does not follow the standard analysis of existing model based RL methods to upper bound the distance between estimated and true posterior.
We are not claiming that our result holds in the general settings, it is restricted to Stein class of kernel with target density satisfying specific assumptions (mentioned in Lemma 4.1, 4.2, and Theorem 4.3), and our improved results are attributed to these specific conditions, which at least covers the Gaussian target (so we compared to [F]). It is this restriction to the Stein class of transition models that permits us to derive improved dependence on the input dimension. We apologize for insufficiently emphasizing this point in our analysis, and will ensure to do so there as well as in the abstract, intro, and summary of our contributions. We believe properly explaining this point will substantially strengthen the manuscript.
To draw an analogous situation in which similar lower bounds can be sharpened, consider an online convex optimization problem [G]. The regret lower bound is $\mathcal{O}(\sqrt{T})$, which can be improved to $\mathcal{O}(\log(T))$ [G], when one imposes strong convexity on instantaneous costs. And our Stein class of kernel is kind of equivalent to stronfly convex setting here. In other words, we figured out a setting for which it is possible to obtain better bounds.
**References:**
[A] Zanette, A., Lazaric, A., Kochenderfer, M., and Brunskill, E. Learning near optimal policies with low inherent bellman error. arXiv preprint arXiv:2003.00153, 2020.
[B] Chowdhury, Sayak Ray, and Aditya Gopalan. "On kernelized multi-armed bandits." International Conference on Machine Learning. PMLR, 2017.
[C] Foster, D. J., Kakade, S. M., Qian, J., & Rakhlin, A. (2021). The statistical complexity of interactive decision making. arXiv preprint arXiv:2112.13487.
[D] Liu, Qiang, Jason Lee, and Michael Jordan. "A kernelized Stein discrepancy for goodness-of-fit tests." International conference on machine learning. PMLR, 2016.
[E] Chen, W. Y., Barp, A., Briol, F. X., Gorham, J., Girolami, M., Mackey, L., & Oates, C. (2019, May). Stein point markov chain monte carlo. In International Conference on Machine Learning (pp. 1011-1021). PMLR.
[F] Fan, Ying, and Yifei Ming. “Model-based Reinforcement Learning for Continuous Control with Posterior Sampling.” International onference on Machine Learning. PMLR, 2021.
[G] Hazan, Elad, Amit Agarwal, and Satyen Kale. "Logarithmic regret algorithms for online convex optimization." Machine Learning 69.2 (2007): 169-192.