# Rebutal for ICML 4224 : Information-Directed Pessimism for Offline Reinforcement Learning
Summary of changes to the paper:
* In response to reviewers G419, Rj1u and NzhM, we clarified throughout the manuscript that our intended interpretation of sub-Gaussian was meant to be unimodal, not a specific characterization of tail probability decay rates. We agree with the reviewer that in finite spaces all distributions are sub-Gaussian, but standard pessimism operates upon a unimodality representation, whereas our approach permits a mixture family representation.
* As requested by reviewer Gfwv, we added algorithm complexity analysis to Table 3 in Appendix A as requested by reviewer
* Other issues related to typos, narrative exposition, and relationship to prior art are contained in the reviewer-specific rebuttals.
Please see https://www.dropbox.com/scl/fi/cdp17x8fpqe1sx99bfwp0/ICML2024__Information_Directed_Pessimism__8_pages_without_refs_-2.pdf?rlkey=jgdvruinz4c3syih3ifsot1cd&dl=0 for revision.
## Response to Reviewer G419
### Weaknesses:
> * I find the paper to lacking in terms of presentation. Overall, I believe it's not immediate to grasp immediately what is the main result/message and its application. Several sections, esp. sec 3, can be shortened/improved in clarity to help with readability.
**Response to Weakness 1:** Our main goal is to point out that when conducting policy search from offline data, the choice of penalty matters experimentally, even if theoretically it is optimal to use LCB and its variants. In particular, LCB implies the deviation of the transition model from its expected value under the behavioral policy as a good representation of mismatch in theory. However, in practice, we find that there is a large gap between this choice of representation and one that contains fuller distributional information of mismatch. To enable this finding, we adapt machinery from kernelized Stein points in Markov Chain Monte Carlo to the offline RL setting for the first time, and establish both in theory and practice there are situations where it can generalize the regret of prior art, or otherwise match it, depending on the underlying distributional properties of the MDP transition dynamics.
> * Several times the authors mention non-Gaussianity, but I believe they mean non-sub-Gaussianity, although this does not seem to be well-defined in the paper.
**Response to Weakness 2:** <!-- ~~We agree that non-Gaussianity is used as shorthand for non-sub-Gaussianity of the distribution mismatch, and that in several places we used this shortened terminology without explanation. We will endeavor to clarify throughout the intended interpretation, and appreciate the reviewer for drawing attention to this issue.~~-->
We will clarify throught the paper that we mean multimodal distributions. Sub-Gaussianity refers to the decay rate of the tails, which in tabular settings are always satisfied; however, they require a central tendency. To be more precise, concentration inequalities are well-defined for unimodal distributions and form a part of most of the offline RL algorithms that use pessimism. The main contribution of the paper is to say there are situations where the transition distribution is multimodal and a pessimistic penalty based on Stein Discrepancy does better theoretically (Theorem 4.3) and empirically. Throughout, we have corrected the attribute of sub-Gaussianity we have honed in on to be associated with unimodality, so as to avoid this unintended alternate interpretation.
> * Related work section/discussion is missing or mixed with the introduction.
**Response to Weakness 3:** Expanded discussion of related work is in Appendix A, which is referred to at the end of the introduction following the itemized list.
#### Other:
> The authors mention the deepsea environment in the results, but it seems more like the riverswim environment.
**Response:** Our instantiation of the DeepSea environment is a variant of Osband [46] taken from https://github.com/stratisMarkou/sample-efficient-bayesian-rl
We agree this variant is similar to Riverswim, and will add a citation of the following reference:
A. L. Strehl and M. L. Littman. An analysis of model-based interval estimation for markov
decision processes. Journal of Computer and System Sciences, 74(8):1309–1331, 2008.
## Response to Reviewer Rj1u
### Weaknesses:
>* The writing can be improved.
* The description of discrete Stein discrepancy is not clear. I get lost following the usage of subscript $i$
**Response to Weakness 1:** We agree that understanding the concept of discrete Stein discrepancy is intricate, and hinges upon first understanding discrete score function as the analogue of the gradient of the log-likelihood of a probability distribution. These concepts require a fuller explanation, but due to the spatial constraint, this connective tissue was deferred to Appendix B.2. We will revise the explanation of discrete Stein discrepancy to better refer to these explanations.
Regarding the subscript $i$, we note the state $s\in\mathcal{S}$ is a vector containing of length $|\mathcal{S}|$ entries, e.g., in the case of a random walk over $|\mathcal{S}|$ integers, $s$ may be represented through its one-hot encoding, each of which is a vector in $\mathbb{R}^{|\mathcal{S}|}$ with $1$ in the position of $i$ and $0$ elsewhere. The rest of the definitions are computable with this encoding. We will add a footnote detailing this point, as it clarifies the subsequent evaluations.
>* Some statements are unclear. For example, "algorithms that achieve this lower bound by dividing by mutual information, i.e., IDS and its variants" in line 212 is an oversimplified summary of information-directed sampling. In line 272, it is unclear what "high-fidelity model estimators are tuned to have a model" means.
**Response to Weakness 2:** We agree that this description of IDS is a simplification of the methodology. We have caveated the explanation as follows:
> line 212: "algorithms that achieve this lower bound, i.e., IDS and its variants, employ intrinsic exploration incentives in the form of mutual information, which augment sampling by uncertainty estimates with respect to the optimal policy."
Weakness 3, line 272: We agree that this phrasing was ambiguous, and have refined it as follows:
The knowledge of the `true' score function is feasible in certain financial applications [58], where access to a generative model is provided by a physics or market simulation engine.
> * There are some minor typos, like in line 123, "the state $\mathcal{S}$ space" should be "the state space $\mathcal{S}$."
**Response to Weakness 4:** This has been corrected.
> * The proposed method does not improve the regret bound compared to previous methods; it is even worse. While it does not require the sub-Gaussian assumption, the regret bound is only available in tabular settings, where the distribution is always sub-Gaussian due to a finite (therefore bounded) support.
**Response to Weakness 5:** While the distribution of bounded rewards is always sub-gaussian, it can still be multimodal. Concentration bounds are well-defined for unimodal distributions and under unimodal assumptions VI-LCB is optimal as shown in [17]. However, as we point out there are examples in the tabular setting (Fig. 2) that necessitate turning to characterizing distributional distances instead of concentration bounds due to multimodality. This work seeks to address these scenarios by incorporating a penalty fashioned using Stein Discrepancy. While the algorithmic elements remain close to VI-LCB, we obtain better rate in terms of $(1-\gamma)$ coming close to the lower bound.
<!--This is a misreading of the landscape of related rate analyses on offline RL predicated on unclear explanation on our part, for which we apologize. The fastest possible learning rates available in the literature for the tabular setting (i) require multiple time-scales and operate upon different concentration bounds. Instead, our rate analyses is competitive with alternative approaches that employ pessimism in single time-scale algorithms -- see the discussion in Appendix A; (ii) Moreover, alternatives impose sub-Gaussianity on the distribution mismatch. While we agree that in the tabular setting, mismatch is in principle sub-Gaussian, the constants implied by employing sub-Gaussian-based concentration bounds are in practice are miscalibrated to the realized mismatch. Therefore, while in theory it may appear to hold, in practice it appears to be an invalid condition, unless the value distribution in the sense of Figure 2 is unimodal.-->
>* The proposed method requires an additional assumption about uniform ergodicity, while previous methods do not require such an assumption.
**Response to Weakness 6:** It is important to note that prior works, especially [15],[17],[18] suppose exponentially fast mixing of the transition model under a fixed policy to its occupancy measure induced by a fixed policy, without bothering to explain that ergodicity is required for this to hold. See [18], Assumption 1. We note that ergodicity is only required for the convergence of IDP-Q, not IDP-V, whose sufficient conditions are more comparable to [15],[17], which are value iteration-type.
### Questions:
>* What do $\mathbb{S}$ and $\Delta^*$ in Equation 3.5 mean without subscript ?
**Response to Question 1:** $\mathbb{S}$ is notation used to define the score function of a distribution $\mathbb{P}$, which we represent equivalently here through its transition probability matrix $P_{s,a}$ due to the tabular MDP setting. $\mathbb{S}$ may be defined with respect to continuous distributions as well -- see Appendicx B.2, as well as [32].
>* Why is a uniform ergodicity assumption required for the proposed method? How much does the proposed method rely on this assumption compared to previous methods like Li et al. [1]?
**Response to Question 2:** Uniform ergodicity is only required for the convergence analysis of IDP-Q, not IDP-V. This condition also appears in [18]. We will better clarify how the necessity of the assumptions maps to our distinct theorem statements in Section 4, as currently this has not been delineated in sufficient detail.
* Can the regret bound be proved in settings with an infinite state space like linear MDPs or kernelized MDPs?
>**Response to Question 3:** The KSD machinery is much more natural in continuous space, and hence the pessimistic penalty based upon it would be easily definable in that case. We believe linear and kernelized MDPs are a meaningful extensions of this work for which our proposed machinery may be similarly able to handle multimodal mismatch, and hence are important avenues for future work.
>* When is the proposed method provably better than previous methods?
[1] Li, Gen, et al. "Settling the sample complexity of model-based offline reinforcement learning." The Annals of Statistics 52.1 (2024): 233-260.
**Response to Question 4:**
As seen from the extensive numerical studies, there are multiple scenarios where the multimodality in the distribution mismatch impacts the return using traditional concentration based approaches. While the approach of distributional distance based penalty is applicable in general, the performance improvement is conspicuous in the multimodal setting (see Fig. 3, Fig. 5). Concentration inequalities only loosely capture the discrepancy due to multimodality and hence translates to poor empirical performance. Theoretically, using the same algorithmic elements and under the same concentrability assumptions, pessimism based on Stein Discrepancy in Theorem 4.3 improves the order in terms of $(1-\gamma)$.
## Response to Reviewer Gfwv
### Questions
>1) In the discussion of your method's computational cost, could you possibly include a comparative analysis within the existing literature in Table 3?
**Response to Question 1:** We agree this would better serve to understand how our additional required computational complexity leads to the ability to model multimodal distributions, i.e., what is the cost of our improved performance empirically. We have added this to Table 3; however, the table is very large in clumsy to depict in Openreview format. Therefore, we have provided the updated version of Table 3 in a revision of the paper available at the anonymous link below:
https://www.dropbox.com/scl/fi/cdp17x8fpqe1sx99bfwp0/ICML2024__Information_Directed_Pessimism__8_pages_without_refs_-2.pdf?rlkey=jgdvruinz4c3syih3ifsot1cd&dl=0
<!-- https://file.io/Yzvbke4toynD -->
>2) Would it be possible to incorporate the standard deviation of your experimental results?
**Response to Question 2:** $\pm$ one standard deviation is presented as shaded regions in the plots.
<!-- ~~We omitted these values from the tables for clarity but can include these in the appendix.~~ -->
>3) I find myself puzzled by the presentation in Table 6, specifically regarding the labeling of the first row as "Hard." Additionally, could you elucidate the criteria or rationale behind the selection of the specific five rows in Table 2?
**Response to Question 3:** This terminology comes from
Aviral Kumar, Aurick Zhou, George Tucker, and
Sergey Levine. Conservative q-learning for offline
reinforcement learning. Advances in Neural Information Processing Systems, 33:1179–1191, 2020.
and is meant to indicate that data sets that contain more trajectories from the optimal policy and random policies exhibit (i) smaller concentrability coefficient (smaller ratio closer to 1 of the data's occupancy measure to the optimal policy's occupancy measure); and (ii) better coverage of the state and action space (in the sense of persistent exploration condition), which is known to be related to ensuring that the mixing time to the stationary distribution induced by the current policy is not too large. We will add the interpretation of these aspects in the supplementary material where we describe the experiments in an expanded format.
The specific rows in Table 2 aimed to depict the minimal settings (namely the environment and dataset) where the multimodal nature of our problems arose.
<!-- Notably, Q-LCB may have been -->
---------------------------------
## Response to Reviewer NzhM
### Weakness:
>* The presentation of the paper could be improved.
**Response to Weakness:**
### Questions:
>* The authors mentioned that Theorem 4.4 allows the mismatch to be in a mixture family, which is strictly more general than sub-Gaussian conditions in prior art. Could the authors provide a concrete theoretical example where prior work such as VI-LCB (which is shown to be optimal in theory) performs worse than IDP-VI and IDP-Q?
**Response to Question:** <!--Mismatch is inherently a function of the MDP transition model. Theoretically, we expect sub-Gaussian distributions on the mismatch to not hold when the underlying transition model has slow-decaying tails. While all distributions over a finite set are sub-Gaussian in theory, in practice, they also must be unimodal for the concentration bound-based penalty to encapsulate the mismatch well. Therefore, for MDP transition models that are multimodal, e.g., an appropriately discretized beta distribution of the type in our portfolio optimization environment, VI-LCB underperforms.-->
Mismatch is inherently a function of the MDP transition model. While all distributions over a finite set are sub-Gaussian in theory, in practice, they also must be unimodal for the concentration bound-based penalty to encapsulate the mismatch well. As a concrete example, consider the distribution of the true transition matrix to be from a multimodal distribution of the type in the PriorMDP environment as in one of our benchmark environments. In this case, using concentration inequalities to model the mismatch towards defining a penalty leads to a miscalibrated bound, as it captures the deviation from the unimodal beta whose mean is the average of the individual beta distributions. This looseness translates to poor performance as well as seen from Fig 1 (left) and Figure 5.