# Rebutal for STEERING Paper
## Comment for AC and SAC [Discuss an existing review with ACs and SACs]
Dear ACs and SACs,
Thank you for your time and effort in organizing the reviews for our ICML submission. We greatly appreciate your service and dedication.
We are writing to respectfully discuss the feedback provided by **Reviewer LqGz**. While we understand that their review was made with good intentions, we have some concerns about the validity of the criticisms and the thoroughness of their evaluation.
We feel that the reviewer's decision was based on a quick high-level review of our manuscript, as evidenced by the comments made. While the reviewer acknowledged the novelty of our approach, the criticisms provided were high-level and easily addressable. We have carefully addressed all of the reviewer's comments in detail in our rebuttal.
We feel that the reviewer was overly harsh on minor typos that are easily fixable. For example, the reviewer pointed out a minor typo in the definition of $\pi^*$ in **Question 2**, which we have promptly corrected. Additionally, the reviewer mentioned that the formula of information gain on line 127 is incorrect in **Question 3**, but this was simply a confusion due to a lack of explanation, which we have now clarified.
We also respectfully disagree with the reviewer's statement that $M$ and $M^\star$ in our paper are the same in **Question 5**, and we have provided a detailed response in our rebuttal.
Furthermore, we feel that the ***reviewer overlooked one of our major contributions***, which is the development of the first computationally tractable information directed sampling (IDS) algorithm, ***which is appreciated by both the other reviewers*** by mentioning "*The paper is with clear motivation to address the computational intractability of IDS algorithms for solving hard exploration problems (Reviewer MRFt)*" and "*the introduction of KSD magically makes it computationally tractable (Reviewer TJXq)*". We provided a detailed experimental evaluation in the main body and appendix showing the improvement over all prior methods which the reviewer has missed entirely and appreciated by other reviewers.
Lastly, we are concerned with the reviewer's comment that "I am not sure that the authors have an accurate understanding of the literature of IDS," in weakness as it lacks a concrete reason and contradicts the appreciation for the novelty of our approach expressed earlier in the review. We have provided a detailed discussion of the related work with an additional context in Appendix A on page 13 in our paper.
We would appreciate your help in addressing these concerns with the reviewer, and we thank you for your consideration.
Sincerely,
Authors Paper ID 2813
================================================================
## Response to Reviewer MRFt
<!-- **General Response:** -->
We would like to thank the reviewer for dedicating their valuable time and effort towards reviewing our manuscript and acknowledging the motivation and contributions of our research. We deeply appreciate the feedback provided by the reviewer, which has helped us to strengthen exposition of our paper, and we provide detailed responses to all the reviewer’s comments below.
> **Weakness 1:** From table 1, we see that the regret of STEERING has a higher dependency on S, H, and A than some of the previous algorithms. It makes me wonder about the theoretical advantage of STEERING compared to previous methods, especially PSRL.
**Response to Weakness 1:** This is a good point, and thank you for raising that. Let us recall the Table 1 from the paper where we list the regrets achieved by different approaches as follows.
**Table 1:** *This table list the Bayesian regrets for the baseline approaches PSRL (Osband et al., 2013), PSRL2 (Osband & Van Roy, 2017a), TSDE (Ouyang et al., 2017), and Information theoretic Bayesian regret of IDS-RL1 (Lu & Van Roy, 2019a), and IDS-RL2 (Hao & Lattimore, 2022).*
<div style="text-align:center">
| **Approaches** | **Regularization** | **Regret** | **Information Theoretic Bayesian Regret** |
|----------|----------|----------|----------|
| PSRL | No | $\tilde{ \mathcal{O}}( \sqrt{H^3S^2AK})$ | No |
| PSRL2 | No | $\tilde{\mathcal{O}}( \sqrt{H^3SAK})$ | No |
| TSDE | No | $\tilde{\mathcal{O}}( \sqrt{H^3S^2AK})$ | No |
| IDS-RL1 | Mutual Information | $\tilde{ \mathcal{O}}(\sqrt{H^3 SAK})$ | Yes |
| IDS-RL2 | Mutual Information | $\tilde{ \mathcal{O}}(\sqrt{H^4 S^3 A^2 K})$ | Yes |
| STEERING | Stein Information | $\tilde{ \mathcal{O}}(\sqrt{H^4 S^2A K})$ | Yes |
</div>
[We will first elaborate on the theoretical benefits of STEERING, and then illustrate through concrete examples STEERING's advantage over PSRL.]
***Theoretical benefit of STEERING:***
1. **The Information Theoretic Bayesian Regret Bound used in STEERING (which considers the sparse feedback/reward) is not directly comparable to the Bayesian Regret Bound used in PSRL, PSRL2, TSDE (which does not consider the sparse feedback/reward)**. We note from Table A above that there is no information regularization in the first three methods (PSRL, PSRL2, and TSDE), and hence they only study the Bayesian regrets. However, as motivated in the seminal works of IDS-RL1 (Lu & Van Roy, 2019a), and IDS-RL2 (Hao & Lattimore, 2022), optimizing just the Bayesian regret alone is insufficient for solving sequential decision-making problems especially with sparse feedback. Therefore, the notion of *Information Theoretic Bayesian Regret* was utilized in IDS-RL1 and IDS-RL2 to characterize the performance. Although not directly comparable, we still listed the Bayesian Regret Bounds of PSRL, PSRL2 and TSDE to contrast with the Information-Theoretic Bayesian Regret Bounds of IDS-RL1, IDS-RL2 and STEERING for completeness.
2. **STEERING is prior-free IDS while IDS-RL1 considers a specific Dirichlet prior**. Amongt the information theoretic approaches (IDS-RL1, IDS-RL2, and STEERING), IDS-RL1 operates under specific Dirichlet prior which helps to obtain an improved regret bound. The first prior free information theoretic Bayesian regret bound is reported in IDS-RL2, which is, as expected, worse than IDS-RL1 (Remark 3.5 in Hao & Lattimore, 2022).
3. **STEERING achieves a tighter bound than state-of-the-art prior-free IDS method, IDS-RL2.** STEERING uses the Stein information gain as a regularization term as compared to the mutual information gain in IDS-RL2, and obtains a tighter regret bound.
5. **STEERING is the first computationally tractable IDS approach.** Both IDS-RL1 and IDS-RL2 suffer from the computational tractability issue. In contrast, our STEERING is a novel computationally tractable IDS based approach, which uses Stein information gain as a regularization term.
***Motivating example to highlight the benefit of STEERING or IDS based approaches as compared to PSRL:***
1. **IDS based approaches consider maximizing an extra information gain term, in addition to minimizing the regret term considered in PSRL.** To better understand the importance of IDS based approach, we need to look at the policy selection procedure for both IDS and PSRL at each episode $k$, given by
\begin{align}
\pi^k=\arg\max_\pi V_{1,\pi}^{M_k}(s_1^k) \ \ \ \ \ \ \ \ \ \ (\text{for PSRL}),
\end{align}
and
\begin{align}
\pi^k=\arg\min_\pi{\Gamma_k(\pi, M):=\frac{\big(\mathbb E_k[V_{1,\pi^*}^{{M}}(s_1^k)-V_{1,\pi}^{M}(s_1^k)]\big)^2}{\mathbb I_k^{\pi}({M}; \mathcal{H}_{k, H})}} \ \ \ \ \ \ \ \ \ \ (\text{for IDS}).
\end{align}
Therefore, IDS based approaches also focus on maximizing the information gain (due to the ratio objective) as well rather than just minimizing the regret for the policy selection at each episode $k$.
2. **A concrete example for intuitive understanding of the benefit of IDS.** Let us consider the **Treasure hunt example** shown in Lu’s Thesis (Lu, X., 2020.), where the environment has a binary tree structure with depth $d$ i.e with $2d$ leaves and only one of them contains a reward of $1$ and other nodes with zero reward, hence *exhibits sparse reward structure*. In addition to observing sparse rewards, the agent will see a treasure map if it reaches the first leaf node which indicates the location of the treasure. Based on the above explanation, it can be shown that any UCB or Thompson sampling driven methods such as UCRL and PSRL only explores policies which are potentially high valued not leveraging information directed exploration, thereby leading to suboptimal exploration and slower convergence. *In contrast, information directed approaches (such as IDS) not only consider the high valued policies but also cares about high exploration policies via optimizing information gain as well.* We establish this fact via showing the exploration heatmaps on various sparse reward environments in Figure 5 (page 8), and detailed analysis in Figures 10-12 (page 23) in our paper.
3. ***Empirical validation of our claims:*** We would like to emphasize that in contrast to existing PSRL or IDS based approaches, which are not computationally tractable, our proposed STERRING allows, for the first time, an extensive empirical validation of IDS method.
- **3a.** For example, on page 3 in Figure 2, for the first time, we established the fact that posterior sampling based approaches are the most suitable one for the challenging practical settings of sparse rewards. Even PSRL suffers a lot when the sparsity is high.
- **3b.** Then we establish in Figure 3 that our proposed STEERING is able to overcome all the difficulties faced by existing approaches and outperform other methods even in sparse reward environments.
- **3c.** Further, we established the directed exploration characteristic of STEERING in Figure 5, and also showed the benefit in setting with weak belief prior in Figure 6 on page 8. We expanded upon the experimental results in Appendix I on page 20, and have performed detailed ablation studies for STEERING algorithm. ***This is only possible due to the computational tractability of STEERING which is not achieved by any other algorithm in the literature.***
**New references:**
Lu, X., 2020. Information-directed sampling for reinforcement learning. Stanford University.
> **Weakness 2:** Specifically, this paper aims to address the computational intractability of IDS methods. But a formal statement is never given about when and why IDS is preferred over PSRL. As a consequence, from the sublinear regret analysis, it is hard to tell the unique advantage of STEERING compared with PSRL and UCRL.
**Response to Weakness 2:** This is a good point. Let us start by emphasizing the fact that in IDS based approaches, we study the information theoretic Bayesian regret but in PSRL we only study Bayesian regret (as detailed in response to Question 1).
***When IDS is preferred over PSRL:*** This is an important point which we exactly try to address in the results shown in Figure 2 in the main body of our paper. We note from Figure 2$(a)$ that most of the existing approaches work well in dense reward settings and PSRL works the best. But as we move towards sparse reward settings in Figure 2$(b)$, all of the existing techniques including PSRL suffers a lot and exhibit larger regret as compared to dense settings. In Figure 2$(c)$, we further show that PSRL suffers a lot when the reward sparsity is high in the environment. Therefore, we need to consider an approach which takes this issue into account and does exhibit performance degradation. That is where IDS based approaches comes to rescue but suffers from computationally tractability issues which we addressed in this work by developing novel algorithm called STEERING.
***Advantages of IDS over PSRL/UCRL:*** Referring to the discussions in PSRL (Osband et al., 2013) and PSRL2 (Osband & Van Roy, 2017a), we know that posterior sampling and UCB based optimistic algorithms (UCRL) are the two broad classes of algorithms that are able to effectively balance the exploration-exploitation tradeoff and achieve near-optimal performance in various sequential decision-making tasks. However, because they all focus on optimizing the value function driven optimality without a precise characterization of the information gain, this leads to insufficient exploration in important practical scenarios such as sparse or delayed feedback (as established in Figure 1 on page 3 in the main body of the paper). Therefore, Bayesian regret analysis of PSRL and UCRL fails to capture the essence of information gained over the episodes which is one of the major drawbacks.
IDS based approaches, via information gain based regularization, allows us to develop information theoretic Bayesian regret bounds which is missing from the analysis of PSRL/UCRL. We would again like to refer back to the Treasure hunt example provided in response to Weakness 1 above.
We thank the reviewer for raising this point and will update the paper with the additional explanations as required to expand upon the points mentioned above.
> **Weakness 3:** Line 55-56 is not clear to me. What is the "underlying target distribution"? Why focusing on the posterior variance will lead to insufficiently informative about such target distribution?
**Response to Weakness 3:** We apologize for the confusion here. The term "underlying target distribution" means ***"true transition dynamics"*** only. The notion of posterior variance is used as a lower bound to information gain in practice to make IDS based approaches computationally tractable as detailed in Limitation 2 (L2) in line 175 in the main body of the paper. For instance, consider a setting where we start with the strong belief prior (Figure 6 in our paper), then, since the variance is already low, IDS based approach (used with posterior variance lower bound for computational tractability) would not add any benefit on top of PSRL-based approaches.
In contrast, in this work, we introduce the novel notion of Stein information gain for which we don't need to rely on any lower bound (as in (Achiam & Sastry, 2017)). We utilize the ideas of kernel Stein discrepancy to evaluate Stein information gain directly as detailed in Section 3 in the paper.
> **Weakness 4:** I recommend the authors also discuss and compare (both theoretically and empirically) with UCRL. Can UCRL be applied to address the sparse reward MDPs? Since no formal statement of when IDS is preferred -- as the authors only mentioned that PSRL struggles if many trajectories are uninformative -- it is not clear if UCRL is better than vanilla PSRL in these uninformative cases.
**Response to Weakness 4:** We clarified the motivation and importance of IDS over existing methods in response to Questions 1 & 2. So, even in UCRL or PSRL, the algorithm does not deal with the complexity of characterizing the information gain over the episodes, which leads to undirected exploration as explained in the example in response to Question 2 above.
In-order to validate the hypothesis, as suggested by the reviewer, we performed an additional experiment on the sparse grid-world problem (see New Table A below in response to Weakness 5). We compared STEERING with UCRL and UCBVI algorithms, and observe that the performance of UCBVI or UCRL algorithms underperform relative to PSRL. A similar finding has also been observed in (Figure 1, Tiapkin et al.,2022).
***Limitations of UCRL:*** We thank the reviewer for mentioning UCRL based algorithms, and we note that UCB based algorithm indeed incentivizes exploration with an optimism in the face of uncertainty principle. However, it suffers from statistical limitations as beautifully detailed in the seminal papers by Osband & Van Roy (2017a). They mention that one of the major challenges in UCB driven approaches lie in the sub-optimal construction of the confidence sets $\mathcal{M}_k$ which is extremely critical for UCB based algorithms. The construction of such optimal confidence sets require designing complex concentration inequalities, which although can be solved in theory but are computationally intractable. More specifically in Section 4 of Osband & Van Roy (2017a), they show that , the confidence sets suggested by state of the art UCRL algorithm becomes miscalibrated with the increase in cardinality of state space as also can be observed in Figure 3 in Osband & Van Roy (2017a).
In addition to the above limitation of computational tractability, UCRL or even PSRL based algorithms cannot characterize the information gain and hence we leverage IDS based approaches for our sparse scenarios.
**New refereces:**
Tiapkin, Daniil, et al. "Optimistic Posterior Sampling for Reinforcement Learning with Few Samples and Tight Guarantees." Advances in Neural Information Processing Systems, 2022.
> **Weakness 5:** Since STEERING is claimed to work well in sparse reward tabular settings, can the authors also provide additional experimental evaluations in the Maze or Four Room [1] environments, which are classical sparse-reward tabular tasks to measure the exploration ability?
**Response to Weakness 5:** We thank the reviewer for pointing this out and have now compared the proposed algorithm STEERING with other algorithms (including UCRL) on maze environment. We report the results in *New Table B* which reports regret values (hence lower number is better). We note that the proposed STEERING is able to achieve the lower regret.
***New Table B**. Comparison of algorithm performance in terms of Regret incurred after varying numbers of iterations, for selected reinforcement learning algorithms including UCBVI, UCRL, RLSVI, PSRL, IDS, and STEERING (lower value is better).* We will add the figure version of this table in the final version.
| Algorithms | Iter-500 | Iter-2000 | Iter-4000 | Iter-6000 | Iter-8000 |
|--------------|----------|-----------|-----------|-----------|-----------|
| UCBVI | 3000 | 37590 | 56560 | 65500 | 68900 |
| UCRL | 2500 | 29000 | 32500 | 34780 | 34570 |
| RLSVI | 2350 | 18000 | 23050 | 28790 | 34500 |
| PSRL | 2290 | 9700 | 11200 | 13100 | 15000 |
| IDS | 2300 | 8600 | 9500 | 12100 | 13100 |
| STEERING (proposed) | 2500 | 7600 | 9500 | 10050 | 11200 |
**New experiment details:** To obtain *New Table B*, we run all the mentioned algorithms on Sparse Grid-world environment with $100$ states and $4$ action for $H = 50$ and transition noise $0.2$. In this problem, an agent starts at position $(1, 1)$ in a grid world. When taking an action, the agent moves in the corresponding direction with probability $1 − \epsilon$, and moves to a randomly chosen neighbor state with probability $\epsilon$. The reward in the grid world is $1$ if the agent reaches the state $(10, 10)$ and $0$ otherwise (*hence extremely sparse reward*). From the results obtained in *New Table A*, we note that UCRL is not better than even vanilla PSRL, and STEERING outperforms all the other baselines. More specifically, the performance of STEERING is better than IDS and PSRL for the sparse-reward gridworld environment.
> **Question 1:** How is IDS with information gain-based exploration implemented in practice? Do they all base on optimizing the evidence lower-bound?
**Response to Question 1:** We thank the reviewer for this important observation and question. As detailed in Section 2.1 in the main body of our paper, it is challenging to estimate the information gain for practical implementation. This is essentially the major motivation for our work which led to the development of a computationally tractable approach in this paper.
The reviewer correctly pointed out that the existing IDS based approaches (Hao & Lattimore, 2022; Lu & Van Roy, 2019b) approximate the IDS by its lower bound via Pinsker's inequality, i.e,
$$\mathbb E_{k}[D_{KL}(P^M(\cdot|s_h^k,a_h^k)||{P}^ {\bar{M}_k}(\cdot|s_h^k,a_h^k))] \geq \sum_{s'} \text{Var}(P^M(s'|s_h^k,a_h^k)).$$ However, [3] shows that any high-confidence lower bound requires exponential samples in the mutual information, which is a critical concern from a practical standpoint and not addressed property in prior literatures on IDS. Hence, there has been a ***significant gap between theory and practice*** for the existing IDS based approaches, which we try to ***overcome through our research*** in this paper.
> **Question 2:** Can the authors provide intuition behind their point selection strategy? This can provide more insights for readers who are not familiar with SPMCMC.
**Response to Question 2:** We thank the reviewer for raising this important point. As described in Algorithm 1, STEERING interact with the environment in Step 6 and collected the data in $\mathcal{C}$. After collecting state action pair data in $\mathcal{C}$, we perform an intelligent point-selection strategy called SPMCMC, where we select the most informative samples from $\mathcal{C}$ to represent the transition dynamics instead of relying on the entire set.
***Intuition behind point selection:*** We remark that by utilizing an intelligent point selection method (as mentioned in the proof of Lemma 4.2 in Appendix E), we achieve better exploration and tighter bounds, as previously demonstrated in different contexts by (Koppel et al., 2021; Chen et al., 2019). Specifically, at each episode $k$, our SPMCMC based point selection approach focuses on objective $\inf_{(x,y)} \sum_{(x_i, y_i) \in {\mathcal{D}}_{k-1}}\kappa_M((x_i, y_i),(x, y))$ to choose a new sample $(x, y)$ as the point that minimizes the similarity (in RKHS spanned by the Stein kernel denoted by $\kappa_M(\cdot, \cdot)$) to the current samples in the dictionary ${\mathcal{D}}_{k-1}$ resulting in directed exploration.
This is an important point, and we will further expand upon its discussion in our final revised to provide more insights to the readers.
## Response to Reviewer LqGz
Thank you for taking the time to provide feedback on our paper. We appreciate your comments and provide detailed response to your queries as follows.
> **Weakness:** I am not sure that the authors have an accurate understanding of the literature of IDS.
**Response to Weakness:** We thank the reviewer for the feedback. We would like to point out that we have conducted a thorough review of the relevant literature in the field and have included a detailed discussion of related work in the introduction of our paper. We also have included an extended discussion of related works in Appendix A on page 13. However, we welcome any specific suggestions the reviewer may have to help us improve our understanding and analysis of the IDS literature.
> **Question 1:** Line 86 implies that the paper focuses on time-homogeneous MDPs. If so, the comparisons of regret bounds in Table 1 are a bit unfair since some regret bounds there are proved for time-inhomogeneous MDPs, e.g., the one for IDS-RL2.
**Response to Question 1:** The reviewer is correct that we focus on the time-homogenous settings. We apologize for missing this discussion in the Table 1 in the main body of the paper. But we would like to remark that all the references mentioned in Table 1 considers time-homogeneous setting except IDS-RL2 (Hao & Lattimore, 2022). We have carefully checked and note that the same regret expression we mentioned in Table 1 for IDS-RL2 also hold for the time-homogeneous version of IDS-RL2. We will clarify this in the revised final version of our paper. Hence, our comparison in the Table 1 is fair. We are happy to address any other additional discrepancy.
> **Question 2:** The definition of $\pi^*$ in (2) is not accurate. Since RHS depends on episode, LHS should also depends on $i$
**Response to Question 2:** This is a typo. The rectified version of the equation is
\begin{align}
\pi^\star:=\arg\max_{\pi} V^{M}_{\pi, 1}(s).
\end{align}
Thanks for pointing this out, will update in the revised version.
.
>**Question 3:** The formula of information gain on line 127 is incorrect. Here the information gain is the one given history's realization instead of conditional information gain.
**Response to Question 3:** We apologize for the confusion raised by our expression and loose notation. What we meant to write is the information gain given by
$$\mathbb{I}_{k}^{\pi}(M; \mathcal{H}_{k, h}) = D_{KL}( P_{k,\pi}((M, H_{k, h})\in \cdot|D_k)||\mathbb P_{k,\pi}(M=\cdot|D_k)\otimes\mathbb P_{k,\pi}(H_{k, h}=\cdot|D_k)).$$We have used the above expression only in our analysis, which is the same as used in (Hao & Lattimore, 2022).
We would like to emphasize that even if our sentence regarding the information gain in terms of reduction in entropy was confusing, it does not affect any of the results in our paper. We thank the reviewer for pointing this out and will update it in the revised version final version.
>**Question 4:** On line 143, the authors describe what two expectations are taken with respect to, but it is unclear. Could the authors elaborate more on it?
**Response to Question 4:** Let us consider the expresion in 143, which is given by
$$\mathbb{I}_k^{\pi}\left({M}; \mathcal{H}_{k, H}\right)=\sum_{h=1}^H \mathbb{E}_{k}\Big[\mathbb{E}_{\pi}^{\overline{M}_k}\Big[D_{\text{KL}}\Big(P^M(\cdot|s_h^k,a_h^k)||P^{\overline{M}_k}(\cdot|s_h^k,a_h^k)\Big)\Big]\Big].$$ In the above expression (which is exactly same as given in Lemma A.1 in Hao & Lattimore, 2022 and explained in the proof sketch of Lemma 3.2), $M$ denotes the environment MDP and $\overline{M}_k$ denotes the corresponding mean of the posterior $\Phi(\cdot|\mathcal{D}_k)$ over MDPs.
We apologize for any confusion by mentioning the statement *"where $\mathbb{E}_{\pi}^{\overline{M}_k}$ is taken with respect to $s_h^k,a_h^k$, and $\mathbb{E}_{k}$ is with respect to $\pi$ and environment $M$" in the paper.*
First we remark that the information gain expression provided above holds for any policy $\pi$. The expectation $\mathbb{E}_{\pi}^{\overline{M}_k}$ is with respect to the state action trajectories we collect following policy $\pi$ under environment $\overline{M}_k$. And, the $\mathbb{E}_{k}$ is with respect to the posterior over $M$ at episode $k$.
We thank the reviewer for pointing this out and we will make it explicitly clear in the final version of our paper.
>**Question 5:** What are the difference between the underlying MDP $M$ and the true MDP $M^*$? To me, the paper focuses on the Bayesian setting, and thus they are the same. If my understanding is correct, I cannot understand Stein information gain defined in (8). Also I don't understand the meaning of directed exploration illustrated in Figure 1.
**Response to Question 5:** We thank the reviewer for this point, and apologies for any confusion. We provide a detailed answer below.
* ***$M$ and $M^\star$ are not the same:*** Let usrecall the definition of information gain at episode $k$ (cf. Equation (5) in our paper) in Lemma A.1 from Hao and lattimore (2022) given by
\begin{align}
\mathbb{I}_k^{\pi}\left({M}; \mathcal{H}_{k, H}\right)=\sum_{h=1}^H \mathbb E_{k}\left[\mathbb E_{\pi}^{\overline M_k}\left[D_{KL}\left(P^{M}(\cdot|s_h,a_h)||P^{\overline{M}_k}(\cdot|s_h,a_h)\right)\right]\right],
\end{align}where $M$ follows the posterior distribution $\phi(\cdot~|\mathcal{D}_k)$. ***In our paper, $M^\star$ denotes*** MDP drawn from the prior, and is the true MDP from which we collect the trajectory data at each episode, and the posterior will concentrate to $M^\star$ eventually. ***Therefore, in our paper, $M$ and $M^\star$ are not the same.*** Alternatively, we also call $M^\star$ as the true dynamics of the environment from where we collect the data to update our posterior estimate of $M$.
* ***Clarification regarding Figure 1:*** In Figure 1 on page 1 in the paper, we show the true MDP as a fixed $M^\star$ on the x-axis. And y-axis plots the posterior distribution $\Phi(\cdot|\mathcal{D}_k)$ of $M$ at episode $k$ from where we sample $M_k$ at episode $k$ to determine policy $\pi^k$.
* ***Concept of Stein information gain:*** To understand the concept of Stein information gain, let us consider the definition of information gain mentioned above (Equation (5) in our paper and Lemma A.1 from Hao and lattimore (2022)), where the KL divergence is between the $P^M$ and $P^{\overline{M}_k}$ which is computationally intractable to estimate in the general setting. In the existing literature, a lower bound via Pinsker inequality is used to deal with the computational issues but it boils down to just considering the posterior variance. However, in our case, we propose to consider the *Stein information gain* instead which is given by \begin{align}
\mathbb{K}_k^{\pi}\left({M}; \mathcal{H}_{k, H}\right)=\sum_{h=1}^H \mathbb E_{k}\Big[\mathbb E_{\pi}^{{M}^\star}\left[KSD\left({P}^{M}(\cdot|s_h^k,a_h^k), {P}^{{M}^\star}(\cdot|s_h^k,a_h^k)\right)\right]\Big],
\end{align}where, we evaluate the kernel Stein discrepancy (KSD) with respect to the fixed $M^\star$ from where data is collected (*hence introducing direction exploration towards the true dynamics*). This is interesting because now via utilizing the power of Stein's method, we can easily compute the Stein information gain rather than resorting to the approximate lower bounds to develop information directed approaches. Note that this helps to drive exploration towards $M^*$, rather than only defining uncertainty in terms of current posterior variance of $M$, as is common in the prior art.
* ***Directed exploration connection:*** We call it directed exploration because the ultimate objective for any model based RL is to learn the true unknown MDP $M^\star$ around which the posterior eventually concentrates. But, owing to the power of Stein's method we can estimate the KSD in closed form with respect to the true dynamics and introduce exploration bonus to move our model estimates towards $M^\star$ since the samples are drawn from $M^*$.
* ***We support our claims via extensive empirical evaluations:*** Our claims are supported with an extensive empirical evaluation done in several environments in the main body of our paper (page 7-8) and detailed ablation in the appendix (page 22-29) which shows the benefit of directed exploration. For example : Refer to Figure 5 on page 8 where we plot heatmaps showing the distribution of state-action visits in the initial and final episodes of training for PSRL, IDS, and STEERING. We show that our algorithm can effectively identify and visit the most important states in a much more efficient way in compairson to PSRL and IDS.
>**Question 6:** Could the authors first intuitively explain why the proposed IDS has better regret bound than those in the literature? Then technically speaking, is the uniform bound of Stein information ratio better? Or is the bound of cumulative Stein information gain is better? What make one of them or both better?
**Response to Question 6:** We thank the reviewer for this point. We address it in detail as follows.
* First, we want to emphasize that by mentioning in the abstract that STEERING achieves an improved regret as compared to prior approaches, we do not mean that we improve the regret achieved by IDS-RL2. Our regret is better with the proposed *Stein information gain* as regularization, while IDS-RL2 utilizes the standard information gain as the regularization term. Therefore, the IDS techniques in IDS-RL1, IDS-RL2 and STEERING are different in terms of regularization used but done for the same purpose.
* Our major motivation to propose Stein information gain (rather than using traditional information gain) is to resolve the computational intractability of IDS based approaches in the existing literature. The existing methods are computationally intractable and the lower-bound typically used require exponential samples for a good estimation as detailed in Ozair et al. (2019). Hence, the utilization of modified notion of information gain helps to obtain the better information theoretic regret bounds. But we have shown with extensive experiments in the main body (Figures 3-7) and appendix (Figures 10-22) of the paper that the proposed notion of Stein information gain induces directed exploration and able to converge to desired solution faster.
* Technically speaking, as we decompose the information theoretic regret into the product of two terms in the state of Theorem 4.1 (Equation (14), it is the second part which is associated with Stein information gain term. We are able to bound it by a tighter upper bond using the machinery from the convergence of kernelized Stein discrepancy (in Lemma 4.3) which also constitutes one of our novel contributions of this work. We have also discussed further insights for improved regret in Remark 1 on page 7 in the main body of the paper.
## Response to Reviewer TJXq
We are grateful to the reviewer for the time and effort to thoroughly review our manuscript. We appreciate the reviewer's valuable feedback and comments. We would like to acknowledge the reviewer for identifying the major motivation and contribution of our work, which is addressing the issue of computational intractability in IDS based approaches in a novel manner. We are pleased that the reviewer recognized the significance and relevance of this contribution to the field of RL.
> **Weakness 1:** One major concern for the proposed method is scalability. The analysis and the experiments are for tabular setup. But it doesn’t seem to indicate any way to extend it to more general setup neither theoretically nor experimentally. Is there any way to implement the method beyond tabular setup?
**Response to Weakness 1:** Thank you for your insightful comment regarding the scalability of our approach in continuous settings. In this paper, our primary objective is to provide a comprehensive theoretical and empirical evaluation of IDS in the tabular setting, and demonstrate its effectiveness compared to other state-of-the-art exploration methods. Prior to this study, a detailed empirical investigation of IDS even in the tabular setting was missing from existing literature.
However, since our analysis relies upon computing the kernerlized Stein discrepancy (KSD) to quantify and characterize the Stein information, we are optimistic about extending our method to continuous state spaces. Under certain smoothness assumptions on the transition dynamics, the KSD can be used to model the dynamics of the system in continuous state spaces, as demonstrated by recent literature (Koppel et al., 2021; Chen et al., 2019, Chakraborty et al., 2022c).
> **Question 1:** Is there any way to implement the method beyond tabular setup?
**Response to Question 1:** This is an interesting question. By leveraging the insights from tabular settings, we can compute the divergence and quantify the Stein information gain in continuous settings. However, a precise theoretical characterization of IDS with KSD under continuous state and continuous action space MDPs is not straightforward, and is a point of our current research. Importantly, none of the past IDS methods have been able to provide theoretical bounds for such settings (to the best of our knowledge). Nonetheless, we believe that our approach can serve as a starting point towards providing initial theoretical bounds for IDS in continuous state and action spaces.
> **Minor Comment:** In line 131, second column, it should be arg min, right?
**Response to Minor comment:** Yes, good catch, we will correct this typo in the revised final version.
------------------------
## Discussion Phase
---------------------------------
## Response to Reviewer MRFt
> **Comment:** Thanks for the author's effort in writing the detailed replies. I am satisfied with the authors' response. This addresses most of my previous concerns.
**Response to Comment:** We are thankful to the reviewer for appreciating our efforts and happy that our responses have addressed most of the reviewer's concerns. We are grateful to the reviewer for their invaluable input and thoughtful questioning, which have helped us to strengthen the quality and comprehensiveness of our paper. The incorporation of additional details arising from the reviewer's inquiries would definitely provide more insights to the readers.
<!-- We want to thank the reviewer for the insightful questions, which not only helped in improving the quality of our paper but also the additional details will provide benefit for the readers. -->
> **Question 1:** It seems not clear to me if and why including the Stein information gain in the Bayes regret bound can conclude the advantage in sparse-reward settings. What is the uniqueness compared with previous non-information-theoretic Bayes regret such as the ones established in PSRL, which doesn't distinguish between sparse and non-sparse settings? Is it possible to provide some more explicit result that depends on some measure of sparsity? It seems that we can only directly compare different algorithms in sparse and non-sparse settings as in experiments, e.g. Figure 2.
**Response to Question 1:** Thanks for the interesting and insightful question to understand the advantage of Information directed augmentation specific to sparse reward setting and a comparison with non-information theoretic approaches like PSRL.
***Uniqueness:*** We note that we not only include the Stein information gain into the Bayes regret bound, but also modify the way we select the policy at each episode as compared to PSRL. To further clarify, one of the key differences between information theoretic (such as IDS, STEERING) and non-information theoretic (such as PSRL) lies in the policy optimzation step in the algorithm at each episode $k$. The PSRL method only considers the value function based optimality (depends on rewards) to select policy for data collection at each episode, given by
\begin{align}
\pi^k=\arg\max_\pi V_{1,\pi}^{M_k}(s_1^k) \ \ \ \ \ \ \ \ \ \ (\text{for PSRL})\tag{E.1}
\end{align}
On the other hand, STEERING considers the following problem for the policy optimization at each episode $k$ given by
\begin{align}
\pi^k=\arg\min_\pi{\Bigg[\frac{\big(\mathbb E_k[V_{1,\pi^*}^{{M}}(s_1^k)-V_{1,\pi}^{M}(s_1^k)]\big)^2}{\mathbb{K}_k^{\pi}({M}; \mathcal{H}_{k, H})}\Bigg]} \ \ \ \ \ \ \ \ \ \ (\text{for STEERING}),\tag{E.2}
\end{align}
We remark that the objective for IDS (STEERING) in Equation (E.2) takes into account not only the optimality of the value function (which depends on rewards) by measuring regret in the numerator, but also the coverage of the state space by incorporating the Stein information gain term in the denominator (which is independent of rewards and depends on the coverage of the space).
**Connection to Sparsity** : We agree with the reviewer's insightful observation regarding the connection between regret for IDS and reward sparsity. We concede that establishing a mathematical characterisation of sparisyt is an hard open problem in general. We emphasize that our research in this paper, for the first time, established a connection between the benefits of IDS approaches in sparse reward settings, with an empirical justification illustrated in Figure 2. This result served as a motivation to study further IDS based approaches ***and developing computationally tractable solution as the main goal of this work***. We consider the connection to reward sparsity as a valid scope of our future research.
***Intuitive Illustration of Connection of Sparsity to IDS based approaches:*** We try to explain how information directed approach can help specifically in sparse reward scenarios through an illustration.
We note that a critical step for both PSRL and IDS based approaches is the policy optimization (mentioned in Equation (E.1) and (E.2)). In this step, we select our policy for episode $k$ to interact with the environment and collect state action pair data in order to update our posterior estimates. This policy optimization step is essentially selecting the best policy amongst candidate policies via optimizing the mentioned objective.
Now let's consider a Sparse reward navigation problem in a Gridworld where the agent only receives a reward of $1$ upon reaching the goal and zero otherwise.
To understand the benefit of approach in Equation (E.2) than in Equation (E.1), let us assume that we have two policies ($\pi_1$ and $\pi_2$) to choose one from at the episode $k$. The policy $\pi_1$ heavily explores the state-space without visiting the goal state and policy $\pi_2$ is almost stuck in a close neighborhood of the starting state $s_1$. We note that due to the reward sparsity, both the policies would receive zero rewards. The challenge in this setting is to identify a better policy which would help us to reach the goal. We can agree that in this case we should select policy $\pi_1$ but the challenge is how to do that in practice and incorporate this understanding into the algorithm.
This is where IDS based approaches (such as STEERING) comes to rescue. A traditional value-function driven regret is not be enough to differentiate between these non-goal reaching policies, as it fails to account for the state-space coverage in the objective and only focuses on the cumulative reward, which is zero in the above example for both the policies. To overcome this issue, we use the concept of information-theoretic regret, which would consider the state-space coverage of both policies in terms of Stein information gain.
We observe the above described behavior in Figure **[MENTION THE FIGURE]**, where we note that the exploration is better in both IDS based methods are as compared to PSRL.
We again thank the reviewer for their interest and insights about our work. We are happy to engage further and clarify any doubts.
-----------------------------------------
## Response to Reviewer LqGz
> **Question 1:** Thanks the authors for the responses. As mentioned in the rebuttal before this one, $M^*$ is a random variable drawn from the prior, so the inner expectation in Stein information gain is taken with respect to the prior over $M^*$?
**Response:** We apologize for the confusion. By drawn we don't mean we sample $M^\star$ from the prior, we only meant that $M^\star$ lies in the support of the prior and is realizable. We emphasize that $M^\star$ is the true unknown ground truth model (which is fixed throughout algorithm learning iterates) from where we collect the state action trajectory data. Alternatively, we note that $M^\star$ is the MDP model around which our posterior would concentrate as $k\rightarrow\infty$. For the aproach to work, we just need the standard assumption that $M^\star$ lies in the support of the prior (which we intend to show in Figure 1 in the paper). We apologies if our earlier explanation lead to any confusion.
***Details of Stein information gain:*** Let us recall the definition of *Stein information gain* as follows
\begin{align}
\mathbb{K}_k^{\pi}\left({M}; \mathcal{H}_{k, H}\right)=\sum_{h=1}^H \mathbb E_{k}\Big[\mathbb E_{\pi}^{{M}^\star}\left[KSD\left({P}^{M}(\cdot|s_h^k,a_h^k), {P}^{{M}^\star}(\cdot|s_h^k,a_h^k)\right)\right]\Big]\tag{E.1}
\end{align}
We note that the Stein information gain expression provided above holds for any policy $\pi$. We make another important remark that $M^\star$ is fixed for all episodes $k$. The inner expectation $\mathbb{E}_{\pi}^{{M}^\star}$ is with respect to the state action trajectories we collect by interacting with the environment (which is $M^\star$ and is fixed) following policy $\pi$. The outer expectation $\mathbb{E}_{k}$ is with respect to the posterior over $M$ at episode $k$.
We remark that the above definition is exactly similar to the standard information gain expression given by
\begin{align}
\mathbb{I}_k^{\pi}\left({M}; \mathcal{H}_{k, H}\right)=\sum_{h=1}^H \mathbb E_{k}\left[\mathbb E_{\pi}^{\overline M_k}\left[D_{KL}\left(P^{M}(\cdot|s_h,a_h)||P^{\overline{M}_k}(\cdot|s_h,a_h)\right)\right]\right], \tag{E.2}
\end{align}
but with a change of replacing $\overline{M}_k$ (which is mean of posterior at $k$ and is also fixed at $k$) with $M^\star$, and replacing the KL divergence by the kernelized Stein discrepancy (KSD).
> **Question 2:** If this understanding is correct, then for any realization of $M^*$, you need to apply the policy $\pi$ to each to generate the trajectory $\{s_h^k,a_h^k\}_{h=1,\ldots,H}$, and thus the computational complexity of calculating Stein information gain is much heavier than that of the canonical information gain?
***Response to Question 2:*** We respectfully disagree with the reviewer. This is not the case in our setting. We don't need to take expectation with respect to $M^\star$ because it is fixed for episode $k$. This is essentially in line with the standard definition of information gain in Equation (E.2) where $\overline{M}_k$ is fixed. For the inner expectation, we just need to calculate it with respect to the state action trajectories we will collect via using $\pi$.
We are thankful to the reviewer for their time. We are happy to engage further and clarify any additional doubts.
[**OPTIONAL**]***Additional Comment Regarding the point of departure in our paper as paper*** In relation to the existing state-of-the-art result by Hao \& Lattimore (2022), we emphasize that the definition of Bayesian regret is exactly the same to the one used in Hao \& Lattimore (2022). To highlight the point of departure to the existing analysis of IDS based approaches, let us consder the first step of the proof sketch mentioned in the proof of Lemma 3.2 in Hao \& Lattimore (2022), we can write
\begin{align}
\mathbb E_k\left[V_{1,\pi^*}^{M}(s_1^k)-V_{1,\pi^k}^{M}(s_1^k)\right]= \underbrace{\mathbb E_k\left[V_{1,\pi^*}^{M}(s_1^k)-V_{1,\pi^k}^{{\overline{M}_k}}(s_1^k)\right]}_{I_1} + \underbrace{\mathbb E_k\left[V_{1,\pi^k}^{{\overline{M}_k}}(s_1^k)-V_{1,\pi^k}^{M}(s_1^k)\right]}_{I_2}\,,
\end{align}
where $\mathbb E_k\left[V_{1,\pi^k}^{{\overline{M}_k}}(s_1^k)\right]$ term is added and subtracted to obtain the right hand side. In other words, the regret expression at episode $k$ in the left hand side is decomposed around the mean ${\overline{M}_k}$ of the posterior for the analysis. In contrast, in our paper, we obtain the following
\begin{align}
\mathbb E_k\left[V_{1,\pi^*}^{M}(s_1^k)-V_{1,\pi^k}^{M}(s_1^k)\right]= \underbrace{\mathbb E_k\left[V_{1,\pi^*}^{M}(s_1^k)-V_{1,\pi^k}^{{M}^\star}(s_1^k)\right]}_{I_1} + \underbrace{\mathbb E_k\left[V_{1,\pi^k}^{{M}^\star}(s_1^k)-V_{1,\pi^k}^{M}(s_1^k)\right]}_{I_2}\,.
\end{align}
where $\mathbb E_k\left[V_{1,\pi^k}^{{{M}^\star}}(s_1^k)\right]$ term is added and subtracted to obtain the right hand side. In other words, in our setting, we decompose the regret expression at episode $k$ in the left hand side around the true fixed model ${{M}^\star}$. We bound both the terms utilizing the tools from kernelized Stein discrepancy literature.