HackMD - Collaborative Markdown Knowledge Base

> Comment 1: Thank you for your response. It is now clear to me that you do not need any assumption the value functions. I would encourage you to rewrite the statement of your Lemma 1 and define all the terms properly so that no confusion arises. Now, seeing the argument that Theorem 2 in [B] hold under the stein operator depending on posterior distribution, I feel posterior distribution should satisfy some specific assumptions (with relations to the stein set) for the result to hold. Can the authors comment on that and give a justification why it should hold? **Reply:** We thank the reviewer for valuable comments. We are extremely glad that we are able to resolve the **primary concern** of the reviewer regarding the requirement of $\mathbb{E}_{P^*}[V(\cdot)]=0$ (Weakness 1,2,3,5). As mentioned by the reviewer, we revise the assumptions for Lemma 4.1. For instance, target measure not only has to be smooth (as required in [B]), it has to be strongly log concave, which means that $\log(P^*)$ has to be strongly concave with third and fourth derivative to be bounded and continous (coming from [A]). We modify the statement of Lemma 4.1 to include these assumptions as follows. ------ ***Lemma 4.1.** (Lipschitz in Kernel Stein Discrepancy) Recall the definition of the future value function in (3). Under the assumption that (a) posterior distributions are smooth, (b) and strongly Log-concave with third and fourth derivative to be bounded and continuous, the future value function estimation error of $P^k$ with respect to $P^∗$ is upper-bounded in terms of the KSD of $P^k$ as $$ U_i^k(P^k(h_i))-U_i^k(P^*(h_i)) \leq (H \text{R}_{max})\|{P}^{k}(\cdot|h_i)-P^*(\cdot|h_i)\|, $$ where $H$ is the horizon length, and $R_{max}$ is the maximum reward.* --------- **Justifcation of Assumptions:** This is a good point. In this work, our focus is to come up with a mathematical approach to apply Stein's method for model based RL setting (true to our knowledge, this is the first work in that direction). The smoothness assumption in the transition dynamics is pretty common in the literature of model-based RL and have been widely used in several literatures with Lipschitzness and/or Gaussian assumption on the dynamics [C, D, E, F] which are much more restrictive. Observe that our technical setting permits the transition density to belong to a family of mixture models, which contains the Gaussian as a special case. Moreover, it is important to note that the sufficient conditions of Theorem 2 in [A] are certainly not necessary for lower bounding the classical Stein discrepancy and can be generalized to other large classes of multivariate target distributions with some modifications as mentioned in [A]. Additionally, as highlighted by the reviewer in the summary, our empirical evaluations strongly support the claims detailed above. **References:** [A] Jackson Gorham and Lester Mackey. Measuring sample quality with stein’s method. Advances in Neural Information Processing Systems, 28, 2015. [B] Qiang Liu, Jason Lee, and Michael Jordan. A kernelized stein discrepancy for goodness-of-fit tests. In International conference on machine learning, pages 276–284. PMLR, 2016. [C] Osband, I. and Van Roy, B. Model-based reinforcement learning and the eluder dimension. In Advances in Neural Information Processing Systems, pp. 1466–1474, 2014 [D] Fan, Ying, and Yifei Ming. "Model-based Reinforcement Learning for Continuous Control with Posterior Sampling." International Conference on Machine Learning. PMLR, 2021. [E]. Sayak Ray Chowdhury and Aditya Gopalan. Online learning in kernelized markov decision processes. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 3197–3205, 2019 [F]. Asadi, K., Misra, D., and Littman, M. Lipschitz continuity in model-based reinforcement learning. In International Conference on Machine Learning, 2018. >Comment 2: I beg to differ with the author that hardness result for bandits only apply to the frequentist setting. In fact the lower bounds proved in both [1] and [2] are for the Bayesian setting (and hence also apply to the frequentist setting). To sum up, be it Bayesian setting or frequentist setting, the problem complexity should show up in any estimation procedure. [1] Varsha Dani, Thomas P. Hayes, and Sham M. Kakade. Stochastic linear optimization under bandit feedback. In Rocco Servedio and Tong Zhang, editors, Proceedings of the 21st Annual Conference on Learning Theory (COLT 2008), pages 355–366, 2008 [2] Paat Rusmevichientong and John N. Tsitsiklis. Linearly parameterized bandits. Mathematics of Operations Research, 35(2):395–411, 2010. **Reply:** Thank you so much for this insight. After carefully thinking, we agree that this was missing from our result and we apologize for that. But after taking a carefull look, it was a quick fix in the analysis and we were able to obtain the following regret bound which shows dependence upon the dimension $d$ given by $$ \mathbb{E}[\text{Regret}_T]=\mathcal{O}(\sqrt{d}\cdot T^{1-(\alpha/2)}H^{1+(\alpha/2)}), $$ where $\alpha\in[0,1]$. And for $\alpha=1$, we obtain $\mathbb{E}[\text{Regret}_T]=\mathcal{O}(\sqrt{d}\cdot T^{1/2}H^{3/2})$ which matches the lower bound dependence in terms of dimension $d$ and $T$. The source of this issue was an absolute bound assumption on the Stein kernel $\sup_{x,x'\in\mathcal{X}}|\kappa_0(x,x')| \leq B<\infty$ which actually is of order $\mathcal{O}(d)$ and not a constant. **A simple example:** To realize this, let us consider the definition of Stein kernel which is given by $$ k_0(x,x')=S_{P}(x)^T\kappa(x,x')S_{P}(x')+S_{P}(x)^T\nabla_{x'}\kappa(x,x')+\nabla_{x}\kappa(x,x')^TS_{P}(x')+\text{trace}(\nabla_{x,x'}\kappa(x,x')). $$ Next, even for a linear based kernel $\kappa(x,y)=x^Ty$, compact space $\mathcal{X}$ such that for all $x\in\mathcal{X}$ it holds $\|x\|\leq U$, and Gaussian target distribution (so that $P(x)=(1/2\sigma^2)\exp(-\|x-\mu\|^2/\sigma^2)$), we can see $$ |k_0(x,x')|\leq \|S_{P}(x)\|\cdot |\kappa(x,x')|\cdot \|S_{P}(x')\|+\|S_{P}(x)\|\cdot \|\nabla_{x'}\kappa(x,x')\|+\|\nabla_{x}\kappa(x,x')\|\cdot \|S_{P}(x')\|+|\text{trace}(\nabla_{x,x'}\kappa(x,x'))|. $$ Now since space is compact, the kernel $|\kappa(x,x')|\leq U^2$, $\|\nabla_{x}\kappa(x,x')\|\leq U$, $\|S_{P}(x)\|=\|\nabla \log(P(x))\|\leq \mathcal{O}(U)$. Therefore, we can write the above inequality as where we use triangle inequality and Cauchy-Schwarz inequality. $$ |k_0(x,x')|\leq \mathcal{O}(U^4)+\mathcal{O}(U^2)+\mathcal{O}(U^2)+|\text{trace}(\nabla_{x,x'}\kappa(x,x'))|. $$ We note that the last term $|\text{trace}(\nabla_{x,x'}\kappa(x,x'))|=\mathcal{O}(d)$ due to the trace of matrix. Therefore, we get $$ |k_0(x,x')|\leq \mathcal{O}(U^4)+\mathcal{O}(U^2)+\mathcal{O}(U^2)+\mathcal{O}(d). $$ Therefore, in the simplest possible manner, we can write $$ \sup_{x,x'\in\mathcal{X}}|k_0(x,x')|\leq \mathcal{O}(d), $$ where we suppress the constants. In the earlier version, we were also suppressing the $d$ dependence as well which we have removed now. We thank the reviewer for pointing this out again. **Modification in the Analysis:** We remark that the whole analysis remains exactly the same except in equation (24), instead of using $k_0(h_k^m,h_k^m) \leq S_k^2$, we need to use $k_0(h_k^m,h_k^m) \leq dS_k^2$. This results in the following statement of Lemma 4.2. ***Lemma 4.1.** Under the Assumptions of Lemma 4.1 and thinning budget $\epsilon_{k}=\frac{\log(k)}{f(k)^2}$, for the iterates of proposed Algorithm 1, it holds that $$ \mathbb{E}\left[\text{KSD}(\phi_{\mathcal{D}_k})\right] = \mathcal{O} \left(\frac{\sqrt{dk\log(k)}}{f(k)}\right), $$ where $f(k)$ lower bounds the model complexity (size) growth rate: $|\mathcal{D}_k|\geq f(k)$.* This would eventually make regret $\mathbb{E}[\text{Regret}_{T}]$ of the order of $\mathcal{O}\left((\sqrt{d})\cdot T^{1-\frac{\alpha}{2}} H^{1+\frac{\alpha}{2}}\right)$, which matches the existing lower bound in-terms of $d, T$ as shown in [A, B, C] for $\alpha=1$. **Additionally**, we would also like to mention that it is possible to obtain a refined and a more general bound that depends on the decay rate of the series defined by eigenvalues of the kernel matrix (this will affect the trace term upper bound in Stein kernel definition), but we defer this to future work. **References:** [A] Osband, I. and Van Roy, B. Model-based reinforcement learning and the eluder dimension. In Advances in Neural Information Processing Systems, pp. 1466–1474, 2014. [B]. Ian Osband, Van Roy Benjamin, and Russo Daniel. (More) efficient reinforcement learning via posterior sampling. In Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2, NIPS’13, pages 3003–3011, USA, 2013. Curran Associates Inc. [C]. Ian Osband and Benjamin Van Roy. Why is posterior sampling better than optimism for reinforcement learning? In International conference on machine learning, pages 2701–2710 PMLR, 2017. **Note** : We want to thank the reviewer for a detailed and critical commments on our analysis which not only helped us in improving the correctness of the regret bound, but also improving the quality of the analysis in the paper. We are really excited and believe we are able to resolve all the issues raised by the reviewer regarding Value function in RKHS, smoothness assumptions in the dynamics, and finally incorporating the dimentionality correction in the regret. We sincerely believe that the discussions have immensely strengthened our analysis and description. We kindly request the reviewer to consider raising the score. We are happy to address if there are any further questions. The primary issue was with defining the search-space $\mathcal{Y}_m$ with bounded kernels which induces an additional error as a function of the dimensionality $(d)$ since the upper bound on the Stein kernel is dependent on the state-space dimensionality. With the above understanding, we refine the upper-bound of the Kernalized Stein Discrepancy in our analysis starting with equation 24 to update the bound by $k_0(h_k^m,h_k^m) \leq S_k^2 + \beta d$ and for the remaining analysis let's assume $S_k' = S_k^2 + \beta d$ and we update the equations 33-39 accordingly and finally equation 40 can be re-written as $$ \begin{align} T_1 & \leq \frac{1}{ f(k)^2}\sum_{i=1}^{k} \left( {(H/M+4r_i)(S_i^2 + \beta d)+\frac{4br_i H}{\gamma M} \exp\left(-\frac{\gamma}{2}(S_i^2 + \beta d)\right))}\right) \end{align} $$ On simplification with the assumptions detailed in the paper, now we get $$ \begin{align} T_1 & \leq \mathcal{O} \left(\frac{k\log(k) + kd}{f(k)^2}\right) \end{align} $$ Finally, after collecting all the constants as above equation 51 can be written as \begin{align} \mathbb{E}\left[\text{KSD}(\phi_{{\mathcal{D}}_k})^2 \right] \leq & \mathcal{O}\left(\frac{k\log(k)+ kd}{f(k)^2}\right)\\ \mathbb{E}\left[\text{KSD}(\phi_{{\mathcal{D}}_k})\right] \leq & \sqrt{\mathcal{O}\left(\frac{k\log(k)+ kd}{f(k)^2}\right)} \end{align} Now, here applying the inequality $\sqrt{a + b} \leq \sqrt{a} + \sqrt{b}, \forall a,b \ge 0$, we get the updated equation as \begin{align} \mathbb{E}\left[\text{KSD}(\phi_{{\mathcal{D}}_k})\right] \leq & \sqrt{\mathcal{O}\left(\frac{k\log(k)}{f(k)^2}\right)} + \sqrt{d} \sqrt{\mathcal{O}\left(\frac{k}{f(k)^2}\right)} \end{align} The analysis of the first term has been already presented in the paper and we see that the second term can be trivially bounded as $\sqrt{d} \sqrt{\mathcal{O}\left(\frac{k}{f(k)^2}\right)} \leq \sqrt{d} \sqrt{\mathcal{O}\left(\frac{k\log(k)}{f(k)^2}\right)}$ Then following the exact same analysis as done in the paper, we obtain the final modified regret bound (equation 18) as $\mathbb{E}[\text{Regret}_{T}] = \mathcal{O}\left(T^{\frac{1}{2}} d^{\frac{1}{2}} H^{\frac{3}{2}}\right)$ which matches the existing lower bound as shown in [A, B, C]. **Additionally**, we would also like to mention that it is possible to obtain a refined and a more general bound that depends on the decay rate of the series defined by eigenvalues of the kernel matrix (this will affect the trace term upper bound in Stein kernel definition), but we defer this to future work. [A] Osband, I. and Van Roy, B. Model-based reinforcement learning and the eluder dimension. In Advances in Neural Information Processing Systems, pp. 1466–1474, 2014. [B]. Ian Osband, Van Roy Benjamin, and Russo Daniel. (More) efficient reinforcement learning via posterior sampling. In Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2, NIPS’13, pages 3003–3011, USA, 2013. Curran Associates Inc. [C]. Ian Osband and Benjamin Van Roy. Why is posterior sampling better than optimism for reinforcement learning? In International conference on machine learning, pages 2701–2710 PMLR, 2017. ----- **Note** : We speficially want to thank the reviewer for a detailed and critical commments on our analysis which not only helped us in improving the correctness of the regret bound and achieving lower bound but also helped in improving the quality of our paper. We are really excited and thankful to the reviewer and believe we are able to explain all the details including the confusion regarding Value function in RKHS, smoothness assumptions in the dynamics and finally incorporating the correction in the regret with the added dependence of dimensionality. We sincerely believe, the questions and the discussions have immensely strengthened our analysis and description and we kindly request the reviewer to consider raising the score. We are happy to address if there are any further questions. --------- > **New Comment:** Thank the authors for the updates. I'm still not sure about the dimensionality part. As stated by the authors, $M(T)$ is the number of data points needed to store, which should have dependency on the dimensionality. In the updated result, there's a new term $\sqrt{d}$ in the regret which is not from $M(T)$. So the overall dependency on dimensionality is still not clear, which raises doubt on the claim of the improvement on the dependency of the dimensionality in theory. **Reply:** We thank the reviewer for getting back to us with the concern. We apologize for the confusion created by our earlier response. Let us try to clarify it in detail here as follows. The reviewer is correct that $M(T)$ is the number of elements we need to store in the dictionary to estimate the posterior. But it is not true that $M(T)$ is completely independent from $d$. Actually, the regret and $M(T)$ are connected implicitly via $\alpha$ which is the common parameter for regret and model complexity. To be specific, let us consider the common expression from there we derive the regret as well as $M(T)$ expressions. We consider the updated version of Eq. (60) as follows $$ \mathbb{E}[\text{Regret}_{T}]\leq H^2 R_{\max} \sqrt{d} \sum_{k=1}^{\lceil\frac{T}{H}\rceil} \frac{\sqrt{k\log(k)}}{f(k)}, $$ where $f(k)$ control the dictionary growth and is selected to be $f(k)=\sqrt{k^{\alpha+1}\log(k)}$. This selection gives rise to our final expression of $M(T)=\tilde\Omega(\sqrt{T^{\alpha+1}})$. Therefore, regret and $M(T)$ are both coming from one expression which depends upon $d$. Hence, the dependence of dimensionality is inherent in the expressions of regret and $M(T)$ and $\alpha$ is the common parameter which connects the both. **To be specific,** we expand upon this connection here for quick reference. Consider the final expressions for regret and $M(T)$ as follows $$ \mathbb{E}[\text{Regret}_{T}]=\mathcal{O}(\sqrt{d} \cdot T^{1-\frac{\alpha}{2}}H^{1+\frac{\alpha}{2}}) ] \ \ \text{and}\ \ M(T)= \Omega{\left(T^{\frac{1+\alpha}{2}}\right)}, $$ where the value of $\alpha$ ranges between $0$ and $1$. To see the connection, let us try to consider a scenario where we want to achieve $\mathbb{E}[\text{Regret}_{T}]\leq \mathcal{O} (T^{(1-\epsilon)})$ for some $\epsilon> 0$. We note that $\epsilon>0$ would result in sublinear regret. Let us see how this would affect the selection of $\alpha$. We want $$ \mathbb{E}[\text{Regret}_{T}]=\mathcal{O}(\sqrt{d} \cdot T^{1-\frac{\alpha}{2}}H^{1+\frac{\alpha}{2}}) ]\leq \mathcal{O} (T^{(1-\epsilon)}), $$ and it would hold if an $\alpha$ such that (avoiding the constants) $\sqrt{d} \cdot T^{1-\frac{\alpha}{2}}H^{2}\leq T^{(\frac{1}{2}+\epsilon)}$ because $\sqrt{d} \cdot T^{1-\frac{\alpha}{2}}H^{2}$ is an upper bound for $\mathcal{O}(\sqrt{d} \cdot T^{1-\frac{\alpha}{2}}H^{1+\frac{\alpha}{2}})]$ (doing this for simplicity). Next, consider the expression $$ \sqrt{d} \cdot T^{1-\frac{\alpha}{2}}H^{2}\leq T^{({1}-\epsilon)}. $$ Take $\log()$ on both sides, and then taking common terms to one side, we can write $$ \log(\sqrt{d} H^2) \leq {\left(\frac{\alpha}{2}-\epsilon\right)}\log(T). $$ After again rearranging the terms, we can write $$ \frac{2\log(\sqrt{d} H^2)}{\log(T)} +2\epsilon \leq \alpha, $$ which provides a lower bound on $\alpha$ which depends upon $d$. This actually implies a lower bound on $M(T)= \Omega{\left(T^{\frac{1+\alpha}{2}}\right)}$ which increases if the dimension $d$ increases. So regret and $M(T)$ are connected via $\alpha$ and share the dependence on dimension $d$. **Remark:** Further, we note that in the above analysis, we derive a lower bound on $M(T)$ by fixing the regret. We remark here that one can also derive a lower bound on regret for fixed value of $M(T)$ as well. Then it would result in an upper bound on $\alpha$. Therefore, there exists a tradeoff in terms of regret and $M(T)$ (also detailed in Table 2 in the paper). A lower value of $\alpha$ would result in less $M(T)$ but a higher value of regret, and a higher value of $\alpha$ would result in higher value of $M(T)$ but lower regret. So there exists a tradeoff which we precisely characterize in this work. **Reply to Claim of Improvement:** We would like to mention here that the state-of-the-art results for continuous state action space in model based RL via posterior sampling have a linear dependence on dimension $d$ for the regret [A]. In [A], the authors does not talk about the term $M(T)$ at all and actually has $M(T)=\Omega(T)$ which is covered by our results when we select $\alpha=1$. But even for $\alpha=1$, our dependence on $d$ is sublinear (which is $\sqrt{d}$) in regret term and better than dependence (which is linear in $d$) in [A]. ***This is our claim of improvement in terms of dimension $d$ which matches the lower bound in terms of $d$ and $T$***. We hope that we have addressed all the concerns of the reviewer. We are happy to address any additional concerns as well, and kindly request the reviewer to consider raising the scores. [A]. Ying Fan and Yifei Ming. Model-based reinforcement learning for continuous control with posterior sampling. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th Inter379 national Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 3078–3087. PMLR, 18–24 Jul 2021 ------- > New Comment2: I am a little confused about the dimensionality dependence of the bound. In response to reviewer YNwE you argued that the smoothness properties are responsible for the lack of dependence. Later, you updated your response on August 7, noting that the bound should depend on the dimension, as per the lower bounds pointed out to you. Given that the current dependence on d is similar to previous results, please highlight the key advantages the over other methods, rendering it the method of choice. **Reply:** We thank the reviewers for the insight provided and apologize that our earlier bound didn't explictly capture the dimensionality. Specifically, the source was an absolute bound assumption on the Stein kernel $\sup_{x,x'\in\mathcal{X}}|\kappa_0(x,x')| \leq B<\infty$ which actually is of order $\mathcal{O}(d)$ and not a constant. Hence, we incorporated the additional dependence and modified the expression of our regret with $\mathcal{O}\left((\sqrt{d})\cdot T^{1-\frac{\alpha}{2}} H^{1+\frac{\alpha}{2}}\right)$ which matches the lowerbound. **Key advantages (novelties) over other methods:** 1. **Our results is still the best known regret bound for PSRL with continous state action space.** We note that the state-of-the-art results for continuous state action space in model based RL via posterior sampling have a linear dependence on dimension $d$ for the regret [A]. In [A], the authors does not talk about the term $M(T)$ at all and actually has $M(T)=\Omega(T)$ which is covered by our results when we select $\alpha=1$. But even for $\alpha=1$, our dependence on $d$ is sublinear (which is $\sqrt{d}$) in regret term and better than dependence (which is linear in $d$) in [A]. ***This is our claim of improvement in terms of dimension $d$ which matches the lower bound in terms of $d$ and $T$***. 2. **Application to general class of transition probabilities.** The existing results in litereature are restricted to settings when the true posterior model is restricted to Gaussian or Lipschitz [B, C, D, E]. Our space of true transition models is general than that. 3. **First time utilization of Stein ideas in RL.** Our algorithm introduces the Stein's method in MBRL for the first time which relaxes the need for any smoothness or Gaussian assumptions, allowing to represent the transition dynamics with complex mixture models. A critical point is to notice that in almost all the prior model based RL analysis, one needs to compute $$U_i^k(P^k(h_i))-U_i^k(P^*(h_i)) \leq (H \text{R}_{max})\|{P}^{k}(\cdot|h_i)-P^*(\cdot|h_i)\|, $$ where $H$ is the horizon length, and $R_{max}$ is the maximum reward. However, the exact computation of the RHS is not feasible as we never know the true dynamics and hence prior methods leverage the design of concentration ineqalities with strong assumptions to approximate the bound as in [B, C, D] which is restrictive and not generalisable. However, due to the power of Kernelized Stein discrepancy we can estimate the above without the need of having the true dynamics or design complex sub-gaussian concentration inequalities. 3. **Mathematical Tradeoff between Regret and Model Complexity $M(T)$.** None of the prior model based RL via posterior sampling methods focus on the posterior representational complexity or model complexity which is always $M(T)=\Omega(T)$. But we mathematically characterize the precise tradeoff between regret and $M(T)$ and introduce a tuning parameter $\alpha$ which one can tune according the application at hand. As demonstrated in experiments, we are able to achive similar performances to state of the art methods with way less number of points needed to estimate the posterior distribution, which is of great importance for practical applicability of Bayesian type approaches in real world. [A]. Ying Fan and Yifei Ming. Model-based reinforcement learning for continuous control with posterior sampling. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th Inter379 national Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 3078–3087. PMLR, 18–24 Jul 2021. [B]. Osband, I. and Van Roy, B. Model-based reinforcement learning and the eluder dimension. In Advances in Neural Information Processing Systems, pp. 1466–1474, 2014 [C]. Fan, Ying, and Yifei Ming. “Model-based Reinforcement Learning for Continuous Control with Posterior Sampling.” International onference on Machine Learning. PMLR, 2021. [D]. Sayak Ray Chowdhury and Aditya Gopalan. Online learning in kernelized markov decision processes. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 3197–3205, 2019 [E]. Asadi, K., Misra, D., and Littman, M. Lipschitz continuity in model-based reinforcement learning. In International Conference on Machine Learning, 2018. --- In summary, this work is the first to derive optimal regret bounds (in terms of $d$ and $T$), introduce Stien ideas to model based RL, precisely characterize the tradeoff between regret and model complexity $M(T)$ and relate via $\alpha$. We also propose a compression technique that makes it possible to apply posterior sampling based MBRL methods to large scale training with way less memory as demonstrated in experiments. We hope that we have addressed all the concerns of the reviewer. We are happy to address any additional concerns as well, and kindly request the reviewer to consider raising the scores. --------- > New Comment2: If d is the aggregate dimension, then it is not fare to compare with prior work, and claiming an improvement. We believe that it is not entirely true. Prior research [A,B, C] all have considered $d$ the aggregated dimension of the state action space in the analysis and we clearly improve upon the best known regret for continuous state action space (PSRL) which holds for Gaussian dynamics. Our assumption is that the class of test functions lies in the Stein class of which gaussian is a special case. Hence, we believe that we improve upon the previous results for PSRL with continuous state-action space. We are also thankful to the reviewer for insightfull comments and do agree that our claim of achiving lower bound in terms of $d$ was wrong. It cannot be directly comparable to the lower bound as stated by reviewer in () which is true and we will be removing any such claims from the final draft. >Furthermore, in that case, the complexity of estimating unknown transition distribution is still not evident in the bound. For e.g., if a transition model is hard to estimate than one competitor, number of data point needed to be higher. I fail to see where the current regret bound capture this phenomena. We thank the reviewer for the comment but we believe there is some confusion either on our part or on the reviewer part. Let us try to explain our understanding in detail here and our bound do compture the hardness of transition model estimation. As we already highlighted that our estimation of the transition dynamics is in a non-parametric manner, hence it is pretty common in non-parametric statistics that the complexity or hardness in estimating the dynamics will not explicitly depend on the dimensionality rather on the sample complexity (take Radamacher complexity for a function class in RKHS for example [***is it true that it does not depend upon d??*]). Now, it is clearly stated in our paper that the transition dynamics estimation will depend on the compressed dictionary which is controlled by $f(k)$ (which controls the number of samples to estimate the posterior) and is selected to be $f(k)=\sqrt{k^{\alpha+1}\log(k)}$ and thus the transition dynamics estimation is clearly depending on $f(k)$. Similarly, the hardness is very clearly represented in the regret as well shown below in the updated version of equation (60) $$ \mathbb{E}[\text{Regret}_{T}]\leq H^2 R_{\max} \sqrt{d} \sum_{k=1}^{\lceil\frac{T}{H}\rceil} \frac{\sqrt{k\log(k)}}{f(k)}, $$ where $f(k)$ control the dictionary growth and is selected to be $f(k)=\sqrt{k^{\alpha+1}\log(k)}$. It is very evident from the updated version of the equation, that if we increase the dictionary growth $f(k)$, the regret will decrease and similarly if we store more samples by controlling $f(k)$ as the samples increase $N -> \infty$, the estimation error for transition dynamics will be $0$. Hence, the complexity in estimation of the transition dynamics, as well as the regret is clearly dependent on $f(k)$ which precisely characterizes the hardness which the reviewer is mentioning we believe. Now, we try to explain how $f(k)$ is related to the dimensionality $d$ or in other words how the complexity and the dimensionality relation can be characterized. The regret and $M(T)$ are connected implicitly via $\alpha$ which is the common parameter for regret and model complexity. Therefore, regret and $M(T)$ are both coming from one expression which depends upon $d$. Hence, the dependence of dimensionality is inherent in the expressions of regret and $M(T)$ and $\alpha$ is the common parameter which connects the both. **To be specific,** we expand upon this connection here for quick reference. Consider the final expressions for regret and $M(T)$ as follows $$ \mathbb{E}[\text{Regret}_{T}]=\mathcal{O}(\sqrt{d} \cdot T^{1-\frac{\alpha}{2}}H^{1+\frac{\alpha}{2}}) ] \ \ \text{and}\ \ M(T)= \Omega{\left(T^{\frac{1+\alpha}{2}}\right)}, $$ where the value of $\alpha$ ranges between $0$ and $1$. To see the connection, let us try to consider a scenario where we want to achieve $\mathbb{E}[\text{Regret}_{T}]\leq \mathcal{O} (T^{(1-\epsilon)})$ for some $\epsilon> 0$. We note that $\epsilon>0$ would result in sublinear regret. Let us see how this would affect the selection of $\alpha$. We want $$ \mathbb{E}[\text{Regret}_{T}]=\mathcal{O}(\sqrt{d} \cdot T^{1-\frac{\alpha}{2}}H^{1+\frac{\alpha}{2}}) ]\leq \mathcal{O} (T^{(1-\epsilon)}), $$ and it would hold if an $\alpha$ such that (avoiding the constants) $\sqrt{d} \cdot T^{1-\frac{\alpha}{2}}H^{2}\leq T^{(\frac{1}{2}+\epsilon)}$ because $\sqrt{d} \cdot T^{1-\frac{\alpha}{2}}H^{2}$ is an upper bound for $\mathcal{O}(\sqrt{d} \cdot T^{1-\frac{\alpha}{2}}H^{1+\frac{\alpha}{2}})]$ (doing this for simplicity). Next, consider the expression $$ \sqrt{d} \cdot T^{1-\frac{\alpha}{2}}H^{2}\leq T^{({1}-\epsilon)}. $$ Take $\log()$ on both sides, and then taking common terms to one side, we can write $$ \log(\sqrt{d} H^2) \leq {\left(\frac{\alpha}{2}-\epsilon\right)}\log(T). $$ After again rearranging the terms, we can write $$ \frac{2\log(\sqrt{d} H^2)}{\log(T)} +2\epsilon \leq \alpha, $$ which provides a lower bound on $\alpha$ which depends upon $d$. This actually implies a lower bound on $M(T)= \Omega{\left(T^{\frac{1+\alpha}{2}}\right)}$ which increases if the dimension $d$ increases. So regret and $M(T)$ are connected via $\alpha$ and share the dependence on dimension $d$. **Remark:** Further, we note that in the above analysis, we derive a lower bound on $M(T)$ by fixing the regret. We remark here that one can also derive a lower bound on regret for fixed value of $M(T)$ as well. Then it would result in an upper bound on $\alpha$. Therefore, there exists a tradeoff in terms of regret and $M(T)$ (also detailed in Table 2 in the paper). A lower value of $\alpha$ would result in less $M(T)$ but a higher value of regret, and a higher value of $\alpha$ would result in higher value of $M(T)$ but lower regret. So there exists a tradeoff which we precisely characterize in this work.

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.