Amrit Singh Bedi
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights New
    • Engagement control
    • Make a copy
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Make a copy Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    > Comment 1: Thank you for your response. It is now clear to me that you do not need any assumption the value functions. I would encourage you to rewrite the statement of your Lemma 1 and define all the terms properly so that no confusion arises. Now, seeing the argument that Theorem 2 in [B] hold under the stein operator depending on posterior distribution, I feel posterior distribution should satisfy some specific assumptions (with relations to the stein set) for the result to hold. Can the authors comment on that and give a justification why it should hold? **Reply:** We thank the reviewer for valuable comments. We are extremely glad that we are able to resolve the **primary concern** of the reviewer regarding the requirement of $\mathbb{E}_{P^*}[V(\cdot)]=0$ (Weakness 1,2,3,5). As mentioned by the reviewer, we revise the assumptions for Lemma 4.1. For instance, target measure not only has to be smooth (as required in [B]), it has to be strongly log concave, which means that $\log(P^*)$ has to be strongly concave with third and fourth derivative to be bounded and continous (coming from [A]). We modify the statement of Lemma 4.1 to include these assumptions as follows. ------ ***Lemma 4.1.** (Lipschitz in Kernel Stein Discrepancy) Recall the definition of the future value function in (3). Under the assumption that (a) posterior distributions are smooth, (b) and strongly Log-concave with third and fourth derivative to be bounded and continuous, the future value function estimation error of $P^k$ with respect to $P^∗$ is upper-bounded in terms of the KSD of $P^k$ as $$ U_i^k(P^k(h_i))-U_i^k(P^*(h_i)) \leq (H \text{R}_{max})\|{P}^{k}(\cdot|h_i)-P^*(\cdot|h_i)\|, $$ where $H$ is the horizon length, and $R_{max}$ is the maximum reward.* --------- **Justifcation of Assumptions:** This is a good point. In this work, our focus is to come up with a mathematical approach to apply Stein's method for model based RL setting (true to our knowledge, this is the first work in that direction). The smoothness assumption in the transition dynamics is pretty common in the literature of model-based RL and have been widely used in several literatures with Lipschitzness and/or Gaussian assumption on the dynamics [C, D, E, F] which are much more restrictive. Observe that our technical setting permits the transition density to belong to a family of mixture models, which contains the Gaussian as a special case. Moreover, it is important to note that the sufficient conditions of Theorem 2 in [A] are certainly not necessary for lower bounding the classical Stein discrepancy and can be generalized to other large classes of multivariate target distributions with some modifications as mentioned in [A]. Additionally, as highlighted by the reviewer in the summary, our empirical evaluations strongly support the claims detailed above. **References:** [A] Jackson Gorham and Lester Mackey. Measuring sample quality with stein’s method. Advances in Neural Information Processing Systems, 28, 2015. [B] Qiang Liu, Jason Lee, and Michael Jordan. A kernelized stein discrepancy for goodness-of-fit tests. In International conference on machine learning, pages 276–284. PMLR, 2016. [C] Osband, I. and Van Roy, B. Model-based reinforcement learning and the eluder dimension. In Advances in Neural Information Processing Systems, pp. 1466–1474, 2014 [D] Fan, Ying, and Yifei Ming. "Model-based Reinforcement Learning for Continuous Control with Posterior Sampling." International Conference on Machine Learning. PMLR, 2021. [E]. Sayak Ray Chowdhury and Aditya Gopalan. Online learning in kernelized markov decision processes. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 3197–3205, 2019 [F]. Asadi, K., Misra, D., and Littman, M. Lipschitz continuity in model-based reinforcement learning. In International Conference on Machine Learning, 2018. >Comment 2: I beg to differ with the author that hardness result for bandits only apply to the frequentist setting. In fact the lower bounds proved in both [1] and [2] are for the Bayesian setting (and hence also apply to the frequentist setting). To sum up, be it Bayesian setting or frequentist setting, the problem complexity should show up in any estimation procedure. [1] Varsha Dani, Thomas P. Hayes, and Sham M. Kakade. Stochastic linear optimization under bandit feedback. In Rocco Servedio and Tong Zhang, editors, Proceedings of the 21st Annual Conference on Learning Theory (COLT 2008), pages 355–366, 2008 [2] Paat Rusmevichientong and John N. Tsitsiklis. Linearly parameterized bandits. Mathematics of Operations Research, 35(2):395–411, 2010. **Reply:** Thank you so much for this insight. After carefully thinking, we agree that this was missing from our result and we apologize for that. But after taking a carefull look, it was a quick fix in the analysis and we were able to obtain the following regret bound which shows dependence upon the dimension $d$ given by $$ \mathbb{E}[\text{Regret}_T]=\mathcal{O}(\sqrt{d}\cdot T^{1-(\alpha/2)}H^{1+(\alpha/2)}), $$ where $\alpha\in[0,1]$. And for $\alpha=1$, we obtain $\mathbb{E}[\text{Regret}_T]=\mathcal{O}(\sqrt{d}\cdot T^{1/2}H^{3/2})$ which matches the lower bound dependence in terms of dimension $d$ and $T$. The source of this issue was an absolute bound assumption on the Stein kernel $\sup_{x,x'\in\mathcal{X}}|\kappa_0(x,x')| \leq B<\infty$ which actually is of order $\mathcal{O}(d)$ and not a constant. **A simple example:** To realize this, let us consider the definition of Stein kernel which is given by $$ k_0(x,x')=S_{P}(x)^T\kappa(x,x')S_{P}(x')+S_{P}(x)^T\nabla_{x'}\kappa(x,x')+\nabla_{x}\kappa(x,x')^TS_{P}(x')+\text{trace}(\nabla_{x,x'}\kappa(x,x')). $$ Next, even for a linear based kernel $\kappa(x,y)=x^Ty$, compact space $\mathcal{X}$ such that for all $x\in\mathcal{X}$ it holds $\|x\|\leq U$, and Gaussian target distribution (so that $P(x)=(1/2\sigma^2)\exp(-\|x-\mu\|^2/\sigma^2)$), we can see $$ |k_0(x,x')|\leq \|S_{P}(x)\|\cdot |\kappa(x,x')|\cdot \|S_{P}(x')\|+\|S_{P}(x)\|\cdot \|\nabla_{x'}\kappa(x,x')\|+\|\nabla_{x}\kappa(x,x')\|\cdot \|S_{P}(x')\|+|\text{trace}(\nabla_{x,x'}\kappa(x,x'))|. $$ Now since space is compact, the kernel $|\kappa(x,x')|\leq U^2$, $\|\nabla_{x}\kappa(x,x')\|\leq U$, $\|S_{P}(x)\|=\|\nabla \log(P(x))\|\leq \mathcal{O}(U)$. Therefore, we can write the above inequality as where we use triangle inequality and Cauchy-Schwarz inequality. $$ |k_0(x,x')|\leq \mathcal{O}(U^4)+\mathcal{O}(U^2)+\mathcal{O}(U^2)+|\text{trace}(\nabla_{x,x'}\kappa(x,x'))|. $$ We note that the last term $|\text{trace}(\nabla_{x,x'}\kappa(x,x'))|=\mathcal{O}(d)$ due to the trace of matrix. Therefore, we get $$ |k_0(x,x')|\leq \mathcal{O}(U^4)+\mathcal{O}(U^2)+\mathcal{O}(U^2)+\mathcal{O}(d). $$ Therefore, in the simplest possible manner, we can write $$ \sup_{x,x'\in\mathcal{X}}|k_0(x,x')|\leq \mathcal{O}(d), $$ where we suppress the constants. In the earlier version, we were also suppressing the $d$ dependence as well which we have removed now. We thank the reviewer for pointing this out again. **Modification in the Analysis:** We remark that the whole analysis remains exactly the same except in equation (24), instead of using $k_0(h_k^m,h_k^m) \leq S_k^2$, we need to use $k_0(h_k^m,h_k^m) \leq dS_k^2$. This results in the following statement of Lemma 4.2. ***Lemma 4.1.** Under the Assumptions of Lemma 4.1 and thinning budget $\epsilon_{k}=\frac{\log(k)}{f(k)^2}$, for the iterates of proposed Algorithm 1, it holds that $$ \mathbb{E}\left[\text{KSD}(\phi_{\mathcal{D}_k})\right] = \mathcal{O} \left(\frac{\sqrt{dk\log(k)}}{f(k)}\right), $$ where $f(k)$ lower bounds the model complexity (size) growth rate: $|\mathcal{D}_k|\geq f(k)$.* This would eventually make regret $\mathbb{E}[\text{Regret}_{T}]$ of the order of $\mathcal{O}\left((\sqrt{d})\cdot T^{1-\frac{\alpha}{2}} H^{1+\frac{\alpha}{2}}\right)$, which matches the existing lower bound in-terms of $d, T$ as shown in [A, B, C] for $\alpha=1$. **Additionally**, we would also like to mention that it is possible to obtain a refined and a more general bound that depends on the decay rate of the series defined by eigenvalues of the kernel matrix (this will affect the trace term upper bound in Stein kernel definition), but we defer this to future work. **References:** [A] Osband, I. and Van Roy, B. Model-based reinforcement learning and the eluder dimension. In Advances in Neural Information Processing Systems, pp. 1466–1474, 2014. [B]. Ian Osband, Van Roy Benjamin, and Russo Daniel. (More) efficient reinforcement learning via posterior sampling. In Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2, NIPS’13, pages 3003–3011, USA, 2013. Curran Associates Inc. [C]. Ian Osband and Benjamin Van Roy. Why is posterior sampling better than optimism for reinforcement learning? In International conference on machine learning, pages 2701–2710 PMLR, 2017. **Note** : We want to thank the reviewer for a detailed and critical commments on our analysis which not only helped us in improving the correctness of the regret bound, but also improving the quality of the analysis in the paper. We are really excited and believe we are able to resolve all the issues raised by the reviewer regarding Value function in RKHS, smoothness assumptions in the dynamics, and finally incorporating the dimentionality correction in the regret. We sincerely believe that the discussions have immensely strengthened our analysis and description. We kindly request the reviewer to consider raising the score. We are happy to address if there are any further questions. The primary issue was with defining the search-space $\mathcal{Y}_m$ with bounded kernels which induces an additional error as a function of the dimensionality $(d)$ since the upper bound on the Stein kernel is dependent on the state-space dimensionality. With the above understanding, we refine the upper-bound of the Kernalized Stein Discrepancy in our analysis starting with equation 24 to update the bound by $k_0(h_k^m,h_k^m) \leq S_k^2 + \beta d$ and for the remaining analysis let's assume $S_k' = S_k^2 + \beta d$ and we update the equations 33-39 accordingly and finally equation 40 can be re-written as $$ \begin{align} T_1 & \leq \frac{1}{ f(k)^2}\sum_{i=1}^{k} \left( {(H/M+4r_i)(S_i^2 + \beta d)+\frac{4br_i H}{\gamma M} \exp\left(-\frac{\gamma}{2}(S_i^2 + \beta d)\right))}\right) \end{align} $$ On simplification with the assumptions detailed in the paper, now we get $$ \begin{align} T_1 & \leq \mathcal{O} \left(\frac{k\log(k) + kd}{f(k)^2}\right) \end{align} $$ Finally, after collecting all the constants as above equation 51 can be written as \begin{align} \mathbb{E}\left[\text{KSD}(\phi_{{\mathcal{D}}_k})^2 \right] \leq & \mathcal{O}\left(\frac{k\log(k)+ kd}{f(k)^2}\right)\\ \mathbb{E}\left[\text{KSD}(\phi_{{\mathcal{D}}_k})\right] \leq & \sqrt{\mathcal{O}\left(\frac{k\log(k)+ kd}{f(k)^2}\right)} \end{align} Now, here applying the inequality $\sqrt{a + b} \leq \sqrt{a} + \sqrt{b}, \forall a,b \ge 0$, we get the updated equation as \begin{align} \mathbb{E}\left[\text{KSD}(\phi_{{\mathcal{D}}_k})\right] \leq & \sqrt{\mathcal{O}\left(\frac{k\log(k)}{f(k)^2}\right)} + \sqrt{d} \sqrt{\mathcal{O}\left(\frac{k}{f(k)^2}\right)} \end{align} The analysis of the first term has been already presented in the paper and we see that the second term can be trivially bounded as $\sqrt{d} \sqrt{\mathcal{O}\left(\frac{k}{f(k)^2}\right)} \leq \sqrt{d} \sqrt{\mathcal{O}\left(\frac{k\log(k)}{f(k)^2}\right)}$ Then following the exact same analysis as done in the paper, we obtain the final modified regret bound (equation 18) as $\mathbb{E}[\text{Regret}_{T}] = \mathcal{O}\left(T^{\frac{1}{2}} d^{\frac{1}{2}} H^{\frac{3}{2}}\right)$ which matches the existing lower bound as shown in [A, B, C]. **Additionally**, we would also like to mention that it is possible to obtain a refined and a more general bound that depends on the decay rate of the series defined by eigenvalues of the kernel matrix (this will affect the trace term upper bound in Stein kernel definition), but we defer this to future work. [A] Osband, I. and Van Roy, B. Model-based reinforcement learning and the eluder dimension. In Advances in Neural Information Processing Systems, pp. 1466–1474, 2014. [B]. Ian Osband, Van Roy Benjamin, and Russo Daniel. (More) efficient reinforcement learning via posterior sampling. In Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2, NIPS’13, pages 3003–3011, USA, 2013. Curran Associates Inc. [C]. Ian Osband and Benjamin Van Roy. Why is posterior sampling better than optimism for reinforcement learning? In International conference on machine learning, pages 2701–2710 PMLR, 2017. ----- **Note** : We speficially want to thank the reviewer for a detailed and critical commments on our analysis which not only helped us in improving the correctness of the regret bound and achieving lower bound but also helped in improving the quality of our paper. We are really excited and thankful to the reviewer and believe we are able to explain all the details including the confusion regarding Value function in RKHS, smoothness assumptions in the dynamics and finally incorporating the correction in the regret with the added dependence of dimensionality. We sincerely believe, the questions and the discussions have immensely strengthened our analysis and description and we kindly request the reviewer to consider raising the score. We are happy to address if there are any further questions. --------- > **New Comment:** Thank the authors for the updates. I'm still not sure about the dimensionality part. As stated by the authors, $M(T)$ is the number of data points needed to store, which should have dependency on the dimensionality. In the updated result, there's a new term $\sqrt{d}$ in the regret which is not from $M(T)$. So the overall dependency on dimensionality is still not clear, which raises doubt on the claim of the improvement on the dependency of the dimensionality in theory. **Reply:** We thank the reviewer for getting back to us with the concern. We apologize for the confusion created by our earlier response. Let us try to clarify it in detail here as follows. The reviewer is correct that $M(T)$ is the number of elements we need to store in the dictionary to estimate the posterior. But it is not true that $M(T)$ is completely independent from $d$. Actually, the regret and $M(T)$ are connected implicitly via $\alpha$ which is the common parameter for regret and model complexity. To be specific, let us consider the common expression from there we derive the regret as well as $M(T)$ expressions. We consider the updated version of Eq. (60) as follows $$ \mathbb{E}[\text{Regret}_{T}]\leq H^2 R_{\max} \sqrt{d} \sum_{k=1}^{\lceil\frac{T}{H}\rceil} \frac{\sqrt{k\log(k)}}{f(k)}, $$ where $f(k)$ control the dictionary growth and is selected to be $f(k)=\sqrt{k^{\alpha+1}\log(k)}$. This selection gives rise to our final expression of $M(T)=\tilde\Omega(\sqrt{T^{\alpha+1}})$. Therefore, regret and $M(T)$ are both coming from one expression which depends upon $d$. Hence, the dependence of dimensionality is inherent in the expressions of regret and $M(T)$ and $\alpha$ is the common parameter which connects the both. **To be specific,** we expand upon this connection here for quick reference. Consider the final expressions for regret and $M(T)$ as follows $$ \mathbb{E}[\text{Regret}_{T}]=\mathcal{O}(\sqrt{d} \cdot T^{1-\frac{\alpha}{2}}H^{1+\frac{\alpha}{2}}) ] \ \ \text{and}\ \ M(T)= \Omega{\left(T^{\frac{1+\alpha}{2}}\right)}, $$ where the value of $\alpha$ ranges between $0$ and $1$. To see the connection, let us try to consider a scenario where we want to achieve $\mathbb{E}[\text{Regret}_{T}]\leq \mathcal{O} (T^{(1-\epsilon)})$ for some $\epsilon> 0$. We note that $\epsilon>0$ would result in sublinear regret. Let us see how this would affect the selection of $\alpha$. We want $$ \mathbb{E}[\text{Regret}_{T}]=\mathcal{O}(\sqrt{d} \cdot T^{1-\frac{\alpha}{2}}H^{1+\frac{\alpha}{2}}) ]\leq \mathcal{O} (T^{(1-\epsilon)}), $$ and it would hold if an $\alpha$ such that (avoiding the constants) $\sqrt{d} \cdot T^{1-\frac{\alpha}{2}}H^{2}\leq T^{(\frac{1}{2}+\epsilon)}$ because $\sqrt{d} \cdot T^{1-\frac{\alpha}{2}}H^{2}$ is an upper bound for $\mathcal{O}(\sqrt{d} \cdot T^{1-\frac{\alpha}{2}}H^{1+\frac{\alpha}{2}})]$ (doing this for simplicity). Next, consider the expression $$ \sqrt{d} \cdot T^{1-\frac{\alpha}{2}}H^{2}\leq T^{({1}-\epsilon)}. $$ Take $\log()$ on both sides, and then taking common terms to one side, we can write $$ \log(\sqrt{d} H^2) \leq {\left(\frac{\alpha}{2}-\epsilon\right)}\log(T). $$ After again rearranging the terms, we can write $$ \frac{2\log(\sqrt{d} H^2)}{\log(T)} +2\epsilon \leq \alpha, $$ which provides a lower bound on $\alpha$ which depends upon $d$. This actually implies a lower bound on $M(T)= \Omega{\left(T^{\frac{1+\alpha}{2}}\right)}$ which increases if the dimension $d$ increases. So regret and $M(T)$ are connected via $\alpha$ and share the dependence on dimension $d$. **Remark:** Further, we note that in the above analysis, we derive a lower bound on $M(T)$ by fixing the regret. We remark here that one can also derive a lower bound on regret for fixed value of $M(T)$ as well. Then it would result in an upper bound on $\alpha$. Therefore, there exists a tradeoff in terms of regret and $M(T)$ (also detailed in Table 2 in the paper). A lower value of $\alpha$ would result in less $M(T)$ but a higher value of regret, and a higher value of $\alpha$ would result in higher value of $M(T)$ but lower regret. So there exists a tradeoff which we precisely characterize in this work. **Reply to Claim of Improvement:** We would like to mention here that the state-of-the-art results for continuous state action space in model based RL via posterior sampling have a linear dependence on dimension $d$ for the regret [A]. In [A], the authors does not talk about the term $M(T)$ at all and actually has $M(T)=\Omega(T)$ which is covered by our results when we select $\alpha=1$. But even for $\alpha=1$, our dependence on $d$ is sublinear (which is $\sqrt{d}$) in regret term and better than dependence (which is linear in $d$) in [A]. ***This is our claim of improvement in terms of dimension $d$ which matches the lower bound in terms of $d$ and $T$***. We hope that we have addressed all the concerns of the reviewer. We are happy to address any additional concerns as well, and kindly request the reviewer to consider raising the scores. [A]. Ying Fan and Yifei Ming. Model-based reinforcement learning for continuous control with posterior sampling. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th Inter379 national Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 3078–3087. PMLR, 18–24 Jul 2021 ------- > New Comment2: I am a little confused about the dimensionality dependence of the bound. In response to reviewer YNwE you argued that the smoothness properties are responsible for the lack of dependence. Later, you updated your response on August 7, noting that the bound should depend on the dimension, as per the lower bounds pointed out to you. Given that the current dependence on d is similar to previous results, please highlight the key advantages the over other methods, rendering it the method of choice. **Reply:** We thank the reviewers for the insight provided and apologize that our earlier bound didn't explictly capture the dimensionality. Specifically, the source was an absolute bound assumption on the Stein kernel $\sup_{x,x'\in\mathcal{X}}|\kappa_0(x,x')| \leq B<\infty$ which actually is of order $\mathcal{O}(d)$ and not a constant. Hence, we incorporated the additional dependence and modified the expression of our regret with $\mathcal{O}\left((\sqrt{d})\cdot T^{1-\frac{\alpha}{2}} H^{1+\frac{\alpha}{2}}\right)$ which matches the lowerbound. **Key advantages (novelties) over other methods:** 1. **Our results is still the best known regret bound for PSRL with continous state action space.** We note that the state-of-the-art results for continuous state action space in model based RL via posterior sampling have a linear dependence on dimension $d$ for the regret [A]. In [A], the authors does not talk about the term $M(T)$ at all and actually has $M(T)=\Omega(T)$ which is covered by our results when we select $\alpha=1$. But even for $\alpha=1$, our dependence on $d$ is sublinear (which is $\sqrt{d}$) in regret term and better than dependence (which is linear in $d$) in [A]. ***This is our claim of improvement in terms of dimension $d$ which matches the lower bound in terms of $d$ and $T$***. 2. **Application to general class of transition probabilities.** The existing results in litereature are restricted to settings when the true posterior model is restricted to Gaussian or Lipschitz [B, C, D, E]. Our space of true transition models is general than that. 3. **First time utilization of Stein ideas in RL.** Our algorithm introduces the Stein's method in MBRL for the first time which relaxes the need for any smoothness or Gaussian assumptions, allowing to represent the transition dynamics with complex mixture models. A critical point is to notice that in almost all the prior model based RL analysis, one needs to compute $$U_i^k(P^k(h_i))-U_i^k(P^*(h_i)) \leq (H \text{R}_{max})\|{P}^{k}(\cdot|h_i)-P^*(\cdot|h_i)\|, $$ where $H$ is the horizon length, and $R_{max}$ is the maximum reward. However, the exact computation of the RHS is not feasible as we never know the true dynamics and hence prior methods leverage the design of concentration ineqalities with strong assumptions to approximate the bound as in [B, C, D] which is restrictive and not generalisable. However, due to the power of Kernelized Stein discrepancy we can estimate the above without the need of having the true dynamics or design complex sub-gaussian concentration inequalities. 3. **Mathematical Tradeoff between Regret and Model Complexity $M(T)$.** None of the prior model based RL via posterior sampling methods focus on the posterior representational complexity or model complexity which is always $M(T)=\Omega(T)$. But we mathematically characterize the precise tradeoff between regret and $M(T)$ and introduce a tuning parameter $\alpha$ which one can tune according the application at hand. As demonstrated in experiments, we are able to achive similar performances to state of the art methods with way less number of points needed to estimate the posterior distribution, which is of great importance for practical applicability of Bayesian type approaches in real world. [A]. Ying Fan and Yifei Ming. Model-based reinforcement learning for continuous control with posterior sampling. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th Inter379 national Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 3078–3087. PMLR, 18–24 Jul 2021. [B]. Osband, I. and Van Roy, B. Model-based reinforcement learning and the eluder dimension. In Advances in Neural Information Processing Systems, pp. 1466–1474, 2014 [C]. Fan, Ying, and Yifei Ming. “Model-based Reinforcement Learning for Continuous Control with Posterior Sampling.” International onference on Machine Learning. PMLR, 2021. [D]. Sayak Ray Chowdhury and Aditya Gopalan. Online learning in kernelized markov decision processes. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 3197–3205, 2019 [E]. Asadi, K., Misra, D., and Littman, M. Lipschitz continuity in model-based reinforcement learning. In International Conference on Machine Learning, 2018. --- In summary, this work is the first to derive optimal regret bounds (in terms of $d$ and $T$), introduce Stien ideas to model based RL, precisely characterize the tradeoff between regret and model complexity $M(T)$ and relate via $\alpha$. We also propose a compression technique that makes it possible to apply posterior sampling based MBRL methods to large scale training with way less memory as demonstrated in experiments. We hope that we have addressed all the concerns of the reviewer. We are happy to address any additional concerns as well, and kindly request the reviewer to consider raising the scores. --------- > New Comment2: If d is the aggregate dimension, then it is not fare to compare with prior work, and claiming an improvement. We believe that it is not entirely true. Prior research [A,B, C] all have considered $d$ the aggregated dimension of the state action space in the analysis and we clearly improve upon the best known regret for continuous state action space (PSRL) which holds for Gaussian dynamics. Our assumption is that the class of test functions lies in the Stein class of which gaussian is a special case. Hence, we believe that we improve upon the previous results for PSRL with continuous state-action space. We are also thankful to the reviewer for insightfull comments and do agree that our claim of achiving lower bound in terms of $d$ was wrong. It cannot be directly comparable to the lower bound as stated by reviewer in () which is true and we will be removing any such claims from the final draft. >Furthermore, in that case, the complexity of estimating unknown transition distribution is still not evident in the bound. For e.g., if a transition model is hard to estimate than one competitor, number of data point needed to be higher. I fail to see where the current regret bound capture this phenomena. We thank the reviewer for the comment but we believe there is some confusion either on our part or on the reviewer part. Let us try to explain our understanding in detail here and our bound do compture the hardness of transition model estimation. As we already highlighted that our estimation of the transition dynamics is in a non-parametric manner, hence it is pretty common in non-parametric statistics that the complexity or hardness in estimating the dynamics will not explicitly depend on the dimensionality rather on the sample complexity (take Radamacher complexity for a function class in RKHS for example [***is it true that it does not depend upon d??*]). Now, it is clearly stated in our paper that the transition dynamics estimation will depend on the compressed dictionary which is controlled by $f(k)$ (which controls the number of samples to estimate the posterior) and is selected to be $f(k)=\sqrt{k^{\alpha+1}\log(k)}$ and thus the transition dynamics estimation is clearly depending on $f(k)$. Similarly, the hardness is very clearly represented in the regret as well shown below in the updated version of equation (60) $$ \mathbb{E}[\text{Regret}_{T}]\leq H^2 R_{\max} \sqrt{d} \sum_{k=1}^{\lceil\frac{T}{H}\rceil} \frac{\sqrt{k\log(k)}}{f(k)}, $$ where $f(k)$ control the dictionary growth and is selected to be $f(k)=\sqrt{k^{\alpha+1}\log(k)}$. It is very evident from the updated version of the equation, that if we increase the dictionary growth $f(k)$, the regret will decrease and similarly if we store more samples by controlling $f(k)$ as the samples increase $N -> \infty$, the estimation error for transition dynamics will be $0$. Hence, the complexity in estimation of the transition dynamics, as well as the regret is clearly dependent on $f(k)$ which precisely characterizes the hardness which the reviewer is mentioning we believe. Now, we try to explain how $f(k)$ is related to the dimensionality $d$ or in other words how the complexity and the dimensionality relation can be characterized. The regret and $M(T)$ are connected implicitly via $\alpha$ which is the common parameter for regret and model complexity. Therefore, regret and $M(T)$ are both coming from one expression which depends upon $d$. Hence, the dependence of dimensionality is inherent in the expressions of regret and $M(T)$ and $\alpha$ is the common parameter which connects the both. **To be specific,** we expand upon this connection here for quick reference. Consider the final expressions for regret and $M(T)$ as follows $$ \mathbb{E}[\text{Regret}_{T}]=\mathcal{O}(\sqrt{d} \cdot T^{1-\frac{\alpha}{2}}H^{1+\frac{\alpha}{2}}) ] \ \ \text{and}\ \ M(T)= \Omega{\left(T^{\frac{1+\alpha}{2}}\right)}, $$ where the value of $\alpha$ ranges between $0$ and $1$. To see the connection, let us try to consider a scenario where we want to achieve $\mathbb{E}[\text{Regret}_{T}]\leq \mathcal{O} (T^{(1-\epsilon)})$ for some $\epsilon> 0$. We note that $\epsilon>0$ would result in sublinear regret. Let us see how this would affect the selection of $\alpha$. We want $$ \mathbb{E}[\text{Regret}_{T}]=\mathcal{O}(\sqrt{d} \cdot T^{1-\frac{\alpha}{2}}H^{1+\frac{\alpha}{2}}) ]\leq \mathcal{O} (T^{(1-\epsilon)}), $$ and it would hold if an $\alpha$ such that (avoiding the constants) $\sqrt{d} \cdot T^{1-\frac{\alpha}{2}}H^{2}\leq T^{(\frac{1}{2}+\epsilon)}$ because $\sqrt{d} \cdot T^{1-\frac{\alpha}{2}}H^{2}$ is an upper bound for $\mathcal{O}(\sqrt{d} \cdot T^{1-\frac{\alpha}{2}}H^{1+\frac{\alpha}{2}})]$ (doing this for simplicity). Next, consider the expression $$ \sqrt{d} \cdot T^{1-\frac{\alpha}{2}}H^{2}\leq T^{({1}-\epsilon)}. $$ Take $\log()$ on both sides, and then taking common terms to one side, we can write $$ \log(\sqrt{d} H^2) \leq {\left(\frac{\alpha}{2}-\epsilon\right)}\log(T). $$ After again rearranging the terms, we can write $$ \frac{2\log(\sqrt{d} H^2)}{\log(T)} +2\epsilon \leq \alpha, $$ which provides a lower bound on $\alpha$ which depends upon $d$. This actually implies a lower bound on $M(T)= \Omega{\left(T^{\frac{1+\alpha}{2}}\right)}$ which increases if the dimension $d$ increases. So regret and $M(T)$ are connected via $\alpha$ and share the dependence on dimension $d$. **Remark:** Further, we note that in the above analysis, we derive a lower bound on $M(T)$ by fixing the regret. We remark here that one can also derive a lower bound on regret for fixed value of $M(T)$ as well. Then it would result in an upper bound on $\alpha$. Therefore, there exists a tradeoff in terms of regret and $M(T)$ (also detailed in Table 2 in the paper). A lower value of $\alpha$ would result in less $M(T)$ but a higher value of regret, and a higher value of $\alpha$ would result in higher value of $M(T)$ but lower regret. So there exists a tradeoff which we precisely characterize in this work.

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully