> Question 1: The complexity of MDP is $S^2A$, while Lemma 4.3 shows that the order of total Stein information gain is at most $S$ and independent of $A$? Can you intuitively explain why such result could hold?
**Response to Question 1:** We apologies for missing this comment. It got overlooked because it was not appearing in the usual discussion thread somehow. Thank you for the reminder. Such bound is coming because we are upper bounding the Stein information gain via utilize the kernelized Stein discrepency based analysis (mentioned in Lemma 4.2). Interestingly, even for the standard information gain, we can visit the upper bound mentioned in Lemma 20 of Lu & Roy (2019), and trade the dependence on $SA$ by using the upper bound on $\log$ term to obtain $\mathcal{O}\left(ST)\right)$ where it is independent of $A$ but now becomes linear with respect to $T$ (which are number of samples). But the proposed Stein based method utilizes the intelligent point selection schemes such as SPMCMC to improve the dependence on the number of samples.
<!-- **Response to Question 1:** We apologies for missing this comment. It got overlooked because it was not appearing in the discussion thread somehow. Thank you for the reminder. Intuitively speaking, such bound is coming because we are upper bounding the Stein information gain via utilize the kernelized Stein discrepency based analysis (mentioned in Lemma 4.2). Interestingly, even for the standard information gain, we can visit the upper bound mentioned in Lemma 20 of Lu & Roy (2019), which is of the order $\mathcal{O}\left(S^2A\log(1+(T/(SA))\right)$, where $T$ denotes the total number of samples. Here, we can trade the dependence on $SA$ by using the upper bound $\log(1+(T/(SA))) \leq T/(SA)$ to obtain $\mathcal{O}\left(ST)\right)$ where it is independent of $A$ but now becomes linear with respect to $T$. But the proposed Stein based method utilizes the intelligent point selection schemes such as SPMCMC to actually make the dependence on the number of samples sublinear. -->
> Question 2: Thank you for the confirmation. Now I understand that STEERING requires a simulator that can simulate the unknown true environment, which is not required by online RL algorithms, e.g., UCRL, PSRL, canonical IDS, that can only learn the environment by sequentially interacting with the unknown environment. Hence, I think it is very unfair to compare STEERING with pure online RL algorithm. Furthermore, if you already have a simulator that can do planning and others under the true environment, you don't even need STEERING to learn the environment.
**Response to Question 2:** Dear reviewer, ***there is a serious misunderstanding here of our paper and results.*** We have been trying to convey that STEERING just interacts with the environment to obtain samples, which is exactly the same as the online setting. ***So there is no unfairness.***
We emphasize that ***STEERING is indeed an online RL algorithm***, just like UCRL, PSRL, canonical IDS, that interacts with the unknown environment. STEERING clearly doesn’t know $M^\star$ just like UCRL, PSRL, canonical IDS. To experiment with any online RL method, one needs a simulator to provide an environments and run empirical experiments. We provided the code snippet in the previous response just to show the reviewer how the interaction with real environment is implemented in a RL experiment (and in any RL experiments) since the reviewer asked how to obtain samples with respect to true unknown environment. We hope this discussion help the reviewer clear their misunderstanding.