Rebuttal-DRO-KL-ICLR-2023

# Rebuttal-DRO-KL-ICLR-2023 # General response: We thank the reviewers for their careful reading of the paper and their insightful and valuable feedback. Below we provide a summary of the main changes and highlight our contributions in comparisons with prior art. ### New numerical results We added a new section of experiments to corroborate our theory and demonstrate the practical benefits of our algorithm in terms of better sample efficiency. This is highlighted in red in Section 4 of the revision. ### A clearer summary and highlight of our results Our work addresses the sample size requirement of distributionally robust offline RL when the history dataset potentially only has partial coverage over the state-action space, which is a much weaker assumption than prior analyses that assume full coverage. To make the comparison clearer for the reviewers and readers, we now benchmark our results in both the setting of prior works (full coverage) and the setting of ours (partial coverage) separately. * **1) full coverage of the history dataset (the setting of prior works):** We highlight that our method can be directly applied in this setting and leads to near-optimal sample complexity (shown in [**Table 2**](https://i.imgur.com/E0RmX1D.png)). In particular, our claim follows from verifying that the single-concentrability $C_{\mathrm{rob}}^\star$ in this setting becomes $C_{\mathrm{rob}}^\star = A$. Consequently, we achieve dramatic improvement of the sample complexity from a quadratic dependency of $S$ in all prior analyses to a linear dependency. In addition, we also improve the exponential dependency on the horizon of (Zhou et al, 2021; Panaganti & Kalathil, 2022) to a polynomial one, as well as the quadratic dependency on $1/P_{\mathrm{min}}$ in (Yang et al, 2021) to a linear one.  * **2) partial coverage of the history dataset (new setting of this work):** We want to highlight that our work deals with a much more challenging setting of robust offline RL problem compared to prior works. In fact, prior works cannot be applied provably anymore (the sample complexity blows up to infinity) in our setting, while our method can still achieve near minimax-optimal sample complexity (shown in Table 1 in the paper and we recall it here in [**Table 3**](https://i.imgur.com/sWrgv8V.png)).  ### Difference with vanilla offline RL We want to clarify that our work addresses the problem of robust offline RL, which aims to learn the robust optimal policy in a robust MDP, which is dramatically different from the vanilla offline RL problem focusing learning an optimal policy in a standard MDP. **This is best seen by the difference in the performance metric: vanilla offline RL works bound the standard value gap $V^{\pi^*} - V^{\hat{\pi}}$ in standard MDP, while we bound the robust value gap $V^{\pi^*,\sigma} - V^{\hat{\pi},\sigma}$ in robust MDP.** Therefore, several existing works mentioned by the reviewers---which are focused vanilla offline RL---do not apply to our setting that targets the robust MDP, and therefore not comparable. We would be grateful if the reviewers could take a look at the responses and consider raising the support of this work if we have addressed your concerns adequately. We would be glad to discuss further if there are additional concerns. # Review 1:![](https://i.imgur.com/aPaFFtd.png) We thank gratefully the reviewer for the suggestions! In what follows, we provide our response to the reviewer's comments. ### significance of studying tabular MDPs Thanks for raising this important question! We believe the design of provably efficient robust offline RL algorithms in the tabular setting is highly nontrivial and worth investigation on its own. Provable tabular RL (MDP) has a long line of research for decades [1-4], playing an essential role in shaping the theoretical foundation of RL. Following this line of works, we target tabular robust RL/MDP which is further compounded by pressing challenges of achieving robustness using offline datasets. Theoretical underpinnings of robust offline RL in the tabular setting are critically needed before addressing more complex cases/settings that incorporate, for example, function approximation. However, all prior analyses on robust RL--even in this tabular setting--fail to achieve sample efficiency under the KL divergence uncertainty set, where the sample complexity scales undesirably large with salient problem parameters, while ours is the first to achieve a near-optimal sample complexity that further allows partial coverage. It is definitely of great interest to investigate provably efficient algorithms for robust offline RL with function approximation, which we believe our work lays solid foundation to carry out as future directions. ### The core concepts of pessimism are not new We agree with the reviewer that the concept of pessimism is not new, but **as the reviewer has also pointed out, the study of pessimism in robust offline RL and its theoretical guarantees are novel**. We believe it is significant enough to design the right form of the pessimistic penalty in the robust offline RL setting, which comes with a near-optimal sample complexity in a provable manner (complemented by an information-theoretic lower bound). ### numerical evaluations Thanks so much for raising this question. To evaluate our provable method in practice, we upload a revision with a new section (Section 4), which conducts experiments on the Gambler problem -- a benchmark problem that is widely considered in robust RL literature [5,6]. The results show that compared to robust value iteration (DRVI), our method can achieve much better accuracy using the same batch dataset, corroborating the sample efficiency of the proposed method using our tailored pessimistic penalty. ### missing references Thanks so much for raising this question! Due to space limit, the related work section is included in the appendix of the submission, and we wonder if the reviewer has missed it in the initial review. Focusing on algorithms with provable performance guarantees, we have cited extensively theory-oriented papers in the space of offline model-based RL and distributionally robust (offline) RL, to the best of our knoweledge. We'll be more than happy to screen the literature more carefully to add more upon the reviewer's suggestion. However, we did not survey the literature in empirical RL as it is out of our scope. Regarding the mentioned paper "Augmented World Models Facilitate Zero-Shot Dynamics Generalization From a Single Offline Environment", note that this is an empirical paper that has a very different problem/setting compared to this work. We target provable algorithms with theoretical analysis in tabular robust offline RL using the KL divergence as the distance, while they target practical neural-network based algorithms using some heuristic approaches (RAD/RADS) to mimic the distance. Nonetheless, we're happy to add this paper to the "Related works" section upon your request. ### Tone down the claim of "To the best of our knowledge, our paper is the first work to execute the principle of pessimism in a data-driven manner for robust offline RL, leading to the first provably near-optimal algorithm that learns under simultaneous model uncertainty and partial coverage." Thank you very much for raising this question. Although empirically, partial coverage, uncertainty and pessimism are widely acknowledged in the robust offline RL literature, there is no existing algorithm that comes with **provable performance guarantees**. This is perhaps also why our work is interesting, because it provides the theoretical footing to solving robust offline RL in these settings by providing nearly-matching information-theoretical lower bounds and achievability bounds using the principle of pessimism. [1] Kearns, Michael, and Satinder Singh. "Near-optimal reinforcement learning in polynomial time." Machine learning 49.2 (2002): 209-232. [2] Kakade, Sham Machandranath. On the sample complexity of reinforcement learning. University of London, University College London (United Kingdom), 2003. [3] Jin, Chi, et al. "Is Q-learning provably efficient?." Advances in neural information processing systems 31 (2018). [4] Li, Gen, et al. "Breaking the sample size barrier in model-based reinforcement learning with a generative model." Advances in neural information processing systems 33 (2020): 12861-12872. [5] Zhou, Zhengqing, et al. "Finite-sample regret bound for distributionally robust offline tabular reinforcement learning." International Conference on Artificial Intelligence and Statistics. PMLR, 2021. [6] Panaganti, Kishan, and Dileep Kalathil. "Sample Complexity of Robust Reinforcement Learning with a Generative Model." International Conference on Artificial Intelligence and Statistics. PMLR, 2022. # Review 2:![](https://i.imgur.com/H2EjMWe.png) We thank the reviewer for the very helpful comments and constructive suggestions! In what follows, we provide our response to the reviewer’s comments. We would be grateful if the reviewer could take a look at this response and consider raising the suppoert of this work if we have addressed your concerns adequately. ### Comparisons with prior methods Thanks for this great question! Please refer to the general response that provides a clearer comparison with prior art under both full coverage and partial coverage assumptions. ### Comparisons on some benchmark problem Thanks so much for raising this question. To evaluate our provable method in practice, we upload a revision with a new section (Section 4), which conducts experiments on the Gambler problem -- a benchmark problem that is widely considered in robust RL literature [1,2]. The results show that compared to robust value iteration (DRVI), our method can achieve much better accuracy using the same dataset, corroborating the sample efficiency of the proposed method by adding our tailored pessimistic penalty. ### Improving the clarity Thank you so much for raising various suggestions on improving the clarity! We will strengthen the link between the main text and the proof as much as possible. Due to page limit, we have to defer the analysis to the appendix. We have polished the organization and writing of the paper in this revision to address this concern. ### The description of penalty term is hard to grasp Thank you for raising this question. By data-driven design, we mean that the penalty term is estimated and constructed using the samples in the history dataset. This can be observed directly from the definition of the penalty term in (21) --- determined by $\widehat{P}_{\min,h}$ which is constructed by samples. $b_h(s,a)= \begin{cases} \min\left\{ c_b \frac{H}{\sigma} \sqrt{\frac{\log(\frac{KHS}{\delta})}{ \widehat{P}_{\mathsf{min},h}(s,a) N_h(s,a)}}~,~H\right\} & \text{if } N_h(s,a) > 0, \\ H & \text{ otherwise}, \end{cases}$  The penalty term takes the problem structure of robust MDP into considration, as it is tailored to be the upper bound of the statistical uncertaintiy of the *robust* Bellman operator: $\left| \inf_{ \mathcal{P} \in \mathcal{U}^{\sigma}(\widehat{P}_{h,s,a}^0)} \mathcal{P} \widehat{V}- \inf_{ \mathcal{P} \in \mathcal{U}^{\sigma}(P_{h,s,a}^0)} \mathcal{P} \widehat{V} \right|$, where $\widehat{V}$ is the value estimate in any step. The upper bound require a careful calculation to ensure tightness (see Lemma 8). To appreciate our novelty, in sharp contrast to vanilla RL, the statisticacl uncertainty term to be controlled in our robust RL problem has no closed-form and is highly non-linear w.r.t. the transition kernel error, while vanilla RL has a much easier form that is linear w.r.t. the transition kernel error: $\text{Vanilla RL:} \quad \quad \underset{\color{blue}{\bf \text{ linear}} \text{ w.r.t. } P^o_{h,s,a} - \widehat{P}^o_{h,s,a} }{\underbrace{\Big| P^o_{h,s,a}\widehat{V} -\widehat{P}^o_{h,s,a}\widehat{V} \Big|}}$ $\text{Robust RL:} \quad \underset{\color{red}{\bf \text{No closed form}} \text{ w.r.t. } P^o_{h,s,a} - \widehat{P}^o_{h,s,a} \text{ due to complex form of } \mathcal{U}^\sigma(\cdot)}{\underbrace{ \Big|\inf_{\mathcal{P}\in \mathcal{U}^\sigma\left(P^o_{h,s,a} \right)} \mathcal{P}\widehat{V}- \inf_{\mathcal{P} \in \mathcal{U}^\sigma\left(\widehat{P}^o_{h, s,a} \right)} \mathcal{P}\widehat{V} \Big|}}$  [1] Zhou, Zhengqing, et al. "Finite-sample regret bound for distributionally robust offline tabular reinforcement learning." International Conference on Artificial Intelligence and Statistics. PMLR, 2021. [2] Panaganti, Kishan, and Dileep Kalathil. "Sample Complexity of Robust Reinforcement Learning with a Generative Model." International Conference on Artificial Intelligence and Statistics. PMLR, 2022. # Review 3:![](https://i.imgur.com/fttkk89.png) ### trade-off between coverage and robustness Thank you so much for the wonderful question! We are very happy to explain about this point. We agree with the reviewer that the definition of the robust single-concentrality coefficient captures both coverage and structural property of the robust MDP. However, this assumption is **strictly weaker than full coverage**, as under full coverage, $C_{\mathrm{rob}}^{\star}$ simply reduces to the size of the action space $A$, and our bounds still provide signifant savings in sample complexity compared with prior work, as mentioned in the general response. ### non-triviality of the algorithm design Thank you so much for careful and detail review of our methods and raising this significant question. We are very glad to explain and discuss more. * **Tailoring the penalty term is the key of sample efficiency.** The reviewer is right that the general idea of designing our algorithm is adding the penalty term in (21) into the robust value iteration in (19) to promote pessimism. Our main contribution is to design the penalty term $b_h(s,a)$ for the specific robust MDP problem in this work, to ensure that the penalty term serves as a very tight upper bound of statistical uncertainty, which in turn leads to a tight sample complexity. * **Challenges in the penalty term design.** To highlight the challenge that is unique in the robust RL case (and not present in the vanilla RL case commonly studied), the penalty term upper bounds the statistical uncertaintiy of the *robust* Bellman operator: $\left| \inf_{ \mathcal{P} \in \mathcal{U}^{\sigma}(\widehat{P}_{h,s,a}^0)} \mathcal{P} \widehat{V}- \inf_{ \mathcal{P} \in \mathcal{U}^{\sigma}(P_{h,s,a}^0)} \mathcal{P} \widehat{V} \right|$, where $\widehat{V}$ is the value estimate in any step. The upper bound require a careful calculation to ensure tightness (see Lemma 8). To appreciate our novelty, in sharp contrast to vanilla RL, the statisticacl uncertainty term to be controlled in our robust RL problem has no closed-form and is highly non-linear w.r.t. the transition kernel error, while vanilla RL has a much easier form that is linear w.r.t. the transition kernel error: $\text{Vanilla RL:} \quad \quad \underset{\color{blue}{\bf \text{ linear}} \text{ w.r.t. } P^o_{h,s,a} - \widehat{P}^o_{h,s,a} }{\underbrace{\Big| P^o_{h,s,a}\widehat{V} -\widehat{P}^o_{h,s,a}\widehat{V} \Big|}}$ $\text{Robust RL:} \quad \underset{\color{red}{\bf \text{No closed form}} \text{ w.r.t. } P^o_{h,s,a} - \widehat{P}^o_{h,s,a} \text{ due to complex form of } \mathcal{U}^\sigma(\cdot)}{\underbrace{ \Big|\inf_{\mathcal{P}\in \mathcal{U}^\sigma\left(P^o_{h,s,a} \right)} \mathcal{P}\widehat{V}- \inf_{\mathcal{P} \in \mathcal{U}^\sigma\left(\widehat{P}^o_{h, s,a} \right)} \mathcal{P}\widehat{V} \Big|}}$ To address this, we design the penalty term by transfering the primal robust value iteration problem into the dual space and carefully control the uncertainty term in the dual space (see Lemma 8 in the paper).  ### novelty of the proof Thanks the reviewer for detailed careful review. We would like to argue that our technical proof involves results that is new and interesting for future work with comparisons to prior works in the following points. * **A first minimax lower bound for this problem:** The proof of this lower bound requires a challenging control due to the no-closed-form of the uncertainty set that does not appear in vanilla RL or other robust RL cases. It is the first minimax lower bound for this problem, providing significant intuition about the theoretical optimality case which is useful to evaluate all the algorithms * **Improvement of sample complexity from $S^2$ to $S$:** Note that this significant improvement of sample efficiency can't be abtained by just following all the previous proof pipelines in offline RL or robust RL. To the best of our knowledge, we add a tailored leave-one-out technique to achieve the first sample-complexity with linear dependency on $S$ in provable tabular robust RL problem, which is new and promising in many future works not only in model-based RL, but also in model-free RL, policy-based RL, etc. ### how to modify the method for infinite state space? Thank you for raising this important point! We agree that addressing infinite state space will be a great future direction to study, and existing works on handling infinite state space in standard RL might provide useful entry points when combined with our work. Our focus is on providing a comprehensive understanding in the tabular setting in this paper, which is already quite extensive on its own.  [1] Li, Gen, et al. "Settling the sample complexity of model-based offline reinforcement learning." arXiv preprint arXiv:2204.05275 (2022). [2] Li, Gen, et al. "Breaking the sample complexity barrier to regret-optimal model-free reinforcement learning." Advances in Neural Information Processing Systems 34 (2021): 17762-17776. [3] Zhou, Zhengqing, et al. "Finite-sample regret bound for distributionally robust offline tabular reinforcement learning." International Conference on Artificial Intelligence and Statistics. PMLR, 2021. # Review 4:![](https://i.imgur.com/p6IBpIc.png) ### More clarification about the comparison of the concentrability and the results with existing work could help Thank you for raising this question! Please refer to the general response that provides a clearer comparison with prior art under both full coverage and partial coverage assumptions. ### Can you discuss the contrentrability parameter in Assumption 1 and how it differs from other similar notions in related work? Thank you very much for raising this important question! Several variants of single-policy concentrability coefficients are proposed in the literature of _vanilla offline RL_, in order to address the issue of partial coverage, e.g. [2-4]. This kind of assumption is weaker than earlier ones [1] requiring full coverage (used in prior works in robust offline RL), since it only needs to compare against a single optimal policy, rather than all policies. Ours can be viewed as an adapted version tailored to the context of **robust MDP**, which to the best of our knowledge, has not been considered before.  ### The distance between the learned policy and the optimal policy. Thank you for raising this wonderful question! Indeed, our performance is on bounding the robust value gap, not on the policy itself. It is an interesting future direction to see how to invert the robust value gap to the distance on the policy, which we believe requires us to elaborate using structural properties of the robust MDP itself, but not much on the concentrability coefficient (which is concerned on the history dataset).  ### Difference with Xie et al., (2021), Zanette et al. (2021), Uehara et al., (2021) Thank you so much for mentioning these related works! We have now cited all these papers in the revision. We would like to highlight that we target a very different problem, that is **robust offline RL** --- maximize the worst performance over an uncertainty set of environments, but not the *vanilla offline RL* problem considered in these works [Xie et al., (2021); Zanette et al. (2021); Uehara et al., (2021)]. Therefore, the results in these papers do not apply to the robust offline RL problem, and cannot be compared directly with ours. To the best of our knowledge, our work is indeed the first provable near-optimal algorithm for the robust offline RL problem under partial coverage. [1] Chen, Jinglin, and Nan Jiang. "Information-theoretic considerations in batch reinforcement learning." International Conference on Machine Learning. PMLR, 2019. [2] Rashidinejad, Paria, et al. "Bridging offline reinforcement learning and imitation learning: A tale of pessimism." Advances in Neural Information Processing Systems 34 (2021): 11702-11716. [3] Xie, T., Jiang, N., Wang, H., Xiong, C., and Bai, Y. (2021b). Policy finetuning: Bridging sample-efficient offline and online reinforcement learning. arXiv preprint arXiv:2106.04895. [4] Li, Gen, et al. "Settling the sample complexity of model-based offline reinforcement learning." arXiv preprint arXiv:2204.05275 (2022). [5] Uehara, Masatoshi, and Wen Sun. "Pessimistic model-based offline reinforcement learning under partial coverage." arXiv preprint arXiv:2107.06226 (2021). # Review 5:![](https://i.imgur.com/VtAcpFs.png) ### Comparisons to [1] We believe there might be some confusion in comparing our work and [1]. Although the algorithm [1] contains a min-max formulation, it in fact targets a very different problem --- vanilla offline RL, while we consider robust offline RL --- to be robust w.r.t. possible inherent model/transition uncertainty (that is not present in vanilla offline RL). **This is best seen by the resulting performance metric: [1] bounds the standard value gap $V^{\pi^*} - V^{\hat{\pi}}$ in standard MDP, while we bound the robust value gap $V^{\pi^*,\sigma} - V^{\hat{\pi},\sigma}$ in robust MDP.** Therefore, the algorithm in [1] cannot be applied to solve the robust offline RL problem that we focus on, and the theoretical results in [1] and ours are not directly comparable. [1] Uehara, Masatoshi, and Wen Sun. "Pessimistic model-based offline reinforcement learning under partial coverage." arXiv preprint arXiv:2107.06226 (2021). ### Clarification of contributions Our contribution lies in designing and analyzing the first provably efficient algorithm for robust offline RL under partial coverage, as backed by theoretical analysis of sample size requirements with both upper and lower bounds. Please refer to the general response that provides a clearer comparison with prior art under both full coverage and partial coverage assumptions. ### applications and numerical experiments Thanks so much for raising this question. To evaluate our provable method in practice, we **upload a revision with a new section (Section 4)**, which conducts experiments on the Gambler problem -- a benchmark problem that is widely considered in robust RL literature. The results show that compared to robust value iteration (DRVI), our method can achieve much better accuracy using the same dataset, corroborating the sample efficiency of the proposed method by adding our tailored pessimistic penalty. ### Is the pessimistic penalty (21) necessary or important? Indeed, the pessimistic penalty is necessary and important both for empirical performance and theoretical analysis, which we elaborate below. * **The role of pessimism in practical performance** The reviewer is right that the pessimistic penalty in (21) is designed to address the partial coverage history dataset arised from using offline as the sampling mechanism. The pessimistic penalty (21) is the key to achieve better accuracy using the same sampling size, corroborating the sample efficiency of the proposed method, **shown in [Fig.1(a-b) in the revison](https://i.imgur.com/e7WU0ke.png).** Here, DRVI is the baseline without pessimistic penalty and DRVI-LCB is our method by adding a tailored pessimistic penalty.  * **The role of pessimism in theoretical sample efficiency** The tailored pessimistic penalty in (21) is one of the keys to achieve near-optimal sample complexity in robust offline RL --- especially addressing the partial coverage challenges of history dataset. The significance can also be corroborated by the comparison with prior work in terms of sample complexity, which are detailed in the general response. ### Further clarification of the role of the penalty term in the theoretical results Thank you so much for raising this question. We recall the performance gap guarantee critically depends on the size of the pessimistic penalty (see (54)), which comes from upper bounding the perturbation in Lemma 8. Therefore, being able to tightly control this penlty term leads to a tighter performance guarantee. Regarding the $c_b$ term in (21), it is a universal constant that is already absorbed in the large enough constant $c_0$ in Theorem 1 (cf. (26)). # To AC: Dear Area Chairs, We thank all the reviewers / Area Chairs / Meta Reviewers for their time and effort in reviewing our paper. We believe this work matches the community's interest and can contribute to a lot of meaningful future works. So we carefully \color{blue}{summarized and clarified our contributions and added numerical experiments as requested by the reviewers.} However, up to now, we have not received any replies or discussions. Since the deadline is approaching, we really appreciate if the ACs could encourage the reviewers to engage in the discussion. We also hope the ACs would consider our response when making the decision if there is no response from reviewers after the rebuttal stage. In particular, we want to point out that Reviewer 8J56's review raises ethical concerns and shows that the reviewer not only misses the core contribution of this paper but also insult this paper as a 'survey'. We definitely welcome suggestions and concerns but not such emotional blames. We really appreciate if ACs can encourage this reviewer to engage in the discussion and we really hope to address their concerns. We respectfully hope the Area Chair / Meta Reviewer can take the quality of reviewer’s comments into account when making the decision. Thank you so much! Best regards, Authors

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.