# Rebuttal-DRO-KL-ICLR-2023
# General response:
We thank the reviewers for their careful reading of the paper and their insightful and valuable feedback. Below we provide a summary of the main changes and highlight our contributions in comparisons with prior art.
### New numerical results
We added a new section of experiments to corroborate our theory and demonstrate the practical benefits of our algorithm in terms of better sample efficiency. This is highlighted in red in Section 4 of the revision.
### A clearer summary and highlight of our results
Our work addresses the sample size requirement of distributionally robust offline RL when the history dataset potentially only has partial coverage over the state-action space, which is a much weaker assumption than prior analyses that assume full coverage. To make the comparison clearer for the reviewers and readers, we now benchmark our results in both the setting of prior works (full coverage) and the setting of ours (partial coverage) separately.
* **1) <span style="color:blue">full coverage of the history dataset</span> (the setting of prior works):** We highlight that our method can be directly applied in this setting and leads to near-optimal sample complexity (shown in [**Table 2**](https://i.imgur.com/E0RmX1D.png)). In particular, our claim follows from verifying that the single-concentrability $C_{\mathrm{rob}}^\star$ in this setting becomes $C_{\mathrm{rob}}^\star = A$. Consequently, we achieve dramatic improvement of the sample complexity from a quadratic dependency of $S$ in all prior analyses to a linear dependency. In addition, we also improve the exponential dependency on the horizon of (Zhou et al, 2021; Panaganti & Kalathil, 2022) to a polynomial one, as well as the quadratic dependency on $1/P_{\mathrm{min}}$ in (Yang et al, 2021) to a linear one.
<!--  -->
* **2) <span style="color:blue">partial coverage of the history dataset</span> (new setting of this work):** We want to highlight that our work deals with a much more challenging setting of robust offline RL problem compared to prior works. In fact, prior works cannot be applied provably anymore (the sample complexity blows up to infinity) in our setting, while our method can still achieve near minimax-optimal sample complexity (shown in Table 1 in the paper and we recall it here in [**Table 3**](https://i.imgur.com/sWrgv8V.png)).
<!--  -->
### Difference with vanilla offline RL
We want to clarify that our work addresses the problem of robust offline RL, which aims to learn the robust optimal policy in a robust MDP, which is dramatically different from the vanilla offline RL problem focusing learning an optimal policy in a standard MDP. **This is best seen by the difference in the performance metric: vanilla offline RL works bound the standard value gap $V^{\pi^*} - V^{\hat{\pi}}$ in standard MDP, while we bound the robust value gap $V^{\pi^*,\sigma} - V^{\hat{\pi},\sigma}$ in robust MDP.** Therefore, several existing works mentioned by the reviewers---which are focused vanilla offline RL---do not apply to our setting that targets the robust MDP, and therefore not comparable.
We would be grateful if the reviewers could take a look at the responses and consider raising the support of this work if we have addressed your concerns adequately. We would be glad to discuss further if there are additional concerns.
# Review 1:
We thank gratefully the reviewer for the suggestions! In what follows, we provide our response to the reviewer's comments.
### significance of studying tabular MDPs
Thanks for raising this important question! We believe the design of provably efficient robust offline RL algorithms in the tabular setting is highly nontrivial and worth investigation on its own. Provable tabular RL (MDP) has a long line of research for decades [1-4], playing an essential role in shaping the theoretical foundation of RL. Following this line of works, we target tabular robust RL/MDP which is further compounded by pressing challenges of achieving robustness using offline datasets. Theoretical underpinnings of robust offline RL in the tabular setting are critically needed before addressing more complex cases/settings that incorporate, for example, function approximation. However, all prior analyses on robust RL--even in this tabular setting--fail to achieve sample efficiency under the KL divergence uncertainty set, where the sample complexity scales undesirably large with salient problem parameters, while ours is the first to achieve a near-optimal sample complexity that further allows partial coverage. It is definitely of great interest to investigate provably efficient algorithms for robust offline RL with function approximation, which we believe our work lays solid foundation to carry out as future directions.
### The core concepts of pessimism are not new
We agree with the reviewer that the concept of pessimism is not new, but **as the reviewer has also pointed out, the study of pessimism in robust offline RL and its theoretical guarantees are novel**. We believe it is significant enough to design the right form of the pessimistic penalty in the robust offline RL setting, which comes with a near-optimal sample complexity in a provable manner (complemented by an information-theoretic lower bound).
### numerical evaluations
Thanks so much for raising this question. To evaluate our provable method in practice, we upload a revision with a new section (Section 4), which conducts experiments on the Gambler problem -- a benchmark problem that is widely considered in robust RL literature [5,6]. The results show that compared to robust value iteration (DRVI), our method can achieve much better accuracy using the same batch dataset, corroborating the sample efficiency of the proposed method using our tailored pessimistic penalty.
### missing references
Thanks so much for raising this question! Due to space limit, the related work section is included in the appendix of the submission, and we wonder if the reviewer has missed it in the initial review. Focusing on algorithms with provable performance guarantees, we have cited extensively theory-oriented papers in the space of offline model-based RL and distributionally robust (offline) RL, to the best of our knoweledge. We'll be more than happy to screen the literature more carefully to add more upon the reviewer's suggestion. However, we did not survey the literature in empirical RL as it is out of our scope.
Regarding the mentioned paper "Augmented World Models Facilitate Zero-Shot Dynamics Generalization From a Single Offline Environment", note that this is an empirical paper that has a very different problem/setting compared to this work. We target provable algorithms with theoretical analysis in tabular robust offline RL using the KL divergence as the distance, while they target practical neural-network based algorithms using some heuristic approaches (RAD/RADS) to mimic the distance. Nonetheless, we're happy to add this paper to the "Related works" section upon your request.
### Tone down the claim of "To the best of our knowledge, our paper is the first work to execute the principle of pessimism in a data-driven manner for robust offline RL, leading to the first provably near-optimal algorithm that learns under simultaneous model uncertainty and partial coverage."
Thank you very much for raising this question. Although empirically, partial coverage, uncertainty and pessimism are widely acknowledged in the robust offline RL literature, there is no existing algorithm that comes with **provable performance guarantees**. This is perhaps also why our work is interesting, because it provides the theoretical footing to solving robust offline RL in these settings by providing nearly-matching information-theoretical lower bounds and achievability bounds using the principle of pessimism.
[1] Kearns, Michael, and Satinder Singh. "Near-optimal reinforcement learning in polynomial time." Machine learning 49.2 (2002): 209-232.
[2] Kakade, Sham Machandranath. On the sample complexity of reinforcement learning. University of London, University College London (United Kingdom), 2003.
[3] Jin, Chi, et al. "Is Q-learning provably efficient?." Advances in neural information processing systems 31 (2018).
[4] Li, Gen, et al. "Breaking the sample size barrier in model-based reinforcement learning with a generative model." Advances in neural information processing systems 33 (2020): 12861-12872.
[5] Zhou, Zhengqing, et al. "Finite-sample regret bound for distributionally robust offline tabular reinforcement learning." International Conference on Artificial Intelligence and Statistics. PMLR, 2021.
[6] Panaganti, Kishan, and Dileep Kalathil. "Sample Complexity of Robust Reinforcement Learning with a Generative Model." International Conference on Artificial Intelligence and Statistics. PMLR, 2022.
# Review 2:
We thank the reviewer for the very helpful comments and constructive suggestions! In what follows, we provide our response to the reviewer’s comments. We would be grateful if the reviewer could take a look at this response and consider raising the suppoert of this work if we have addressed your concerns adequately.
### Comparisons with prior methods
Thanks for this great question! Please refer to the general response that provides a clearer comparison with prior art under both full coverage and partial coverage assumptions.
### Comparisons on some benchmark problem
Thanks so much for raising this question. To evaluate our provable method in practice, we upload a revision with a new section (Section 4), which conducts experiments on the Gambler problem -- a benchmark problem that is widely considered in robust RL literature [1,2]. The results show that compared to robust value iteration (DRVI), our method can achieve much better accuracy using the same dataset, corroborating the sample efficiency of the proposed method by adding our tailored pessimistic penalty.
### Improving the clarity
Thank you so much for raising various suggestions on improving the clarity! We will strengthen the link between the main text and the proof as much as possible. Due to page limit, we have to defer the analysis to the appendix. We have polished the organization and writing of the paper in this revision to address this concern.
### The description of penalty term is hard to grasp
Thank you for raising this question. By data-driven design, we mean that the penalty term is estimated and constructed using the samples in the history dataset. This can be observed directly from the definition of the penalty term in (21) --- determined by $\widehat{P}_{\min,h}$ which is constructed by samples.
$b_h(s,a)= \begin{cases}
\min\left\{ c_b \frac{H}{\sigma} \sqrt{\frac{\log(\frac{KHS}{\delta})}{ \widehat{P}_{\mathsf{min},h}(s,a) N_h(s,a)}}~,~H\right\} & \text{if } N_h(s,a) > 0, \\
H & \text{ otherwise},
\end{cases}$
<!--  -->
The penalty term takes the problem structure of robust MDP into considration, as it is tailored to be the upper bound of the statistical uncertaintiy of the *robust* Bellman operator: $\left| \inf_{ \mathcal{P} \in \mathcal{U}^{\sigma}(\widehat{P}_{h,s,a}^0)} \mathcal{P} \widehat{V}- \inf_{ \mathcal{P} \in \mathcal{U}^{\sigma}(P_{h,s,a}^0)} \mathcal{P} \widehat{V} \right|$, where $\widehat{V}$ is the value estimate in any step. The upper bound require a careful calculation to ensure tightness (see Lemma 8). To appreciate our novelty, in sharp contrast to vanilla RL, the statisticacl uncertainty term to be controlled in our robust RL problem has no closed-form and is highly non-linear w.r.t. the transition kernel error, while vanilla RL has a much easier form that is linear w.r.t. the transition kernel error:
$\text{Vanilla RL:} \quad \quad \underset{\color{blue}{\bf \text{ linear}} \text{ w.r.t. } P^o_{h,s,a} - \widehat{P}^o_{h,s,a} }{\underbrace{\Big| P^o_{h,s,a}\widehat{V} -\widehat{P}^o_{h,s,a}\widehat{V} \Big|}}$
$\text{Robust RL:} \quad \underset{\color{red}{\bf \text{No closed form}} \text{ w.r.t. } P^o_{h,s,a} - \widehat{P}^o_{h,s,a} \text{ due to complex form of } \mathcal{U}^\sigma(\cdot)}{\underbrace{ \Big|\inf_{\mathcal{P}\in \mathcal{U}^\sigma\left(P^o_{h,s,a} \right)} \mathcal{P}\widehat{V}- \inf_{\mathcal{P} \in \mathcal{U}^\sigma\left(\widehat{P}^o_{h, s,a} \right)} \mathcal{P}\widehat{V} \Big|}}$
<!-- <img src="https://i.imgur.com/dgjpG48.png" alt="drawing" width="600"/> -->
[1] Zhou, Zhengqing, et al. "Finite-sample regret bound for distributionally robust offline tabular reinforcement learning." International Conference on Artificial Intelligence and Statistics. PMLR, 2021.
[2] Panaganti, Kishan, and Dileep Kalathil. "Sample Complexity of Robust Reinforcement Learning with a Generative Model." International Conference on Artificial Intelligence and Statistics. PMLR, 2022.
# Review 3:
### trade-off between coverage and robustness
Thank you so much for the wonderful question! We are very happy to explain about this point. We agree with the reviewer that the definition of the robust single-concentrality coefficient captures both coverage and structural property of the robust MDP. However, this assumption is **strictly weaker than full coverage**, as under full coverage, $C_{\mathrm{rob}}^{\star}$ simply reduces to the size of the action space $A$, and our bounds still provide signifant savings in sample complexity compared with prior work, as mentioned in the general response.
### non-triviality of the algorithm design
Thank you so much for careful and detail review of our methods and raising this significant question. We are very glad to explain and discuss more.
* **Tailoring the penalty term is the key of sample efficiency.** The reviewer is right that the general idea of designing our algorithm is adding the penalty term in (21) into the robust value iteration in (19) to promote pessimism. Our main contribution is to design the penalty term $b_h(s,a)$ for the specific robust MDP problem in this work, to ensure that the penalty term serves as a very tight upper bound of statistical uncertainty, which in turn leads to a tight sample complexity.
* **Challenges in the penalty term design.** To highlight the challenge that is unique in the robust RL case (and not present in the vanilla RL case commonly studied), the penalty term upper bounds the statistical uncertaintiy of the *robust* Bellman operator: $\left| \inf_{ \mathcal{P} \in \mathcal{U}^{\sigma}(\widehat{P}_{h,s,a}^0)} \mathcal{P} \widehat{V}- \inf_{ \mathcal{P} \in \mathcal{U}^{\sigma}(P_{h,s,a}^0)} \mathcal{P} \widehat{V} \right|$, where $\widehat{V}$ is the value estimate in any step. The upper bound require a careful calculation to ensure tightness (see Lemma 8). To appreciate our novelty, in sharp contrast to vanilla RL, the statisticacl uncertainty term to be controlled in our robust RL problem has no closed-form and is highly non-linear w.r.t. the transition kernel error, while vanilla RL has a much easier form that is linear w.r.t. the transition kernel error:
$\text{Vanilla RL:} \quad \quad \underset{\color{blue}{\bf \text{ linear}} \text{ w.r.t. } P^o_{h,s,a} - \widehat{P}^o_{h,s,a} }{\underbrace{\Big| P^o_{h,s,a}\widehat{V} -\widehat{P}^o_{h,s,a}\widehat{V} \Big|}}$
$\text{Robust RL:} \quad \underset{\color{red}{\bf \text{No closed form}} \text{ w.r.t. } P^o_{h,s,a} - \widehat{P}^o_{h,s,a} \text{ due to complex form of } \mathcal{U}^\sigma(\cdot)}{\underbrace{ \Big|\inf_{\mathcal{P}\in \mathcal{U}^\sigma\left(P^o_{h,s,a} \right)} \mathcal{P}\widehat{V}- \inf_{\mathcal{P} \in \mathcal{U}^\sigma\left(\widehat{P}^o_{h, s,a} \right)} \mathcal{P}\widehat{V} \Big|}}$
To address this, we design the penalty term by transfering the primal robust value iteration problem into the dual space and carefully control the uncertainty term in the dual space (see Lemma 8 in the paper).
<!-- <img src="https://i.imgur.com/dgjpG48.png" alt="drawing" width="600"/> -->
### novelty of the proof
Thanks the reviewer for detailed careful review. We would like to argue that our technical proof involves results that is new and interesting for future work with comparisons to prior works in the following points.
* **A first minimax lower bound for this problem:** The proof of this lower bound requires a challenging control due to the no-closed-form of the uncertainty set that does not appear in vanilla RL or other robust RL cases. It is the first minimax lower bound for this problem, providing significant intuition about the theoretical optimality case which is useful to evaluate all the algorithms
* **Improvement of sample complexity from $S^2$ to $S$:** Note that this significant improvement of sample efficiency can't be abtained by just following all the previous proof pipelines in offline RL or robust RL. To the best of our knowledge, we add a tailored leave-one-out technique to achieve the first sample-complexity with linear dependency on $S$ in provable tabular robust RL problem, which is new and promising in many future works not only in model-based RL, but also in model-free RL, policy-based RL, etc.
### how to modify the method for infinite state space?
Thank you for raising this important point! We agree that addressing infinite state space will be a great future direction to study, and existing works on handling infinite state space in standard RL might provide useful entry points when combined with our work. Our focus is on providing a comprehensive understanding in the tabular setting in this paper, which is already quite extensive on its own.
<!--The provable tabular case that this work mainly focus on has a long line of research for decades [1-4], playing an essential role in shaping the theoretical foundation of RL. We believe that it can provides solid theoretical foundation for future investigation of more complex cases/settings in robust RL, such as infinite-state cases. When state-space is of large number or infinite, combined with additional techniques of supporting infinite space in RL such as state aggregation function approximation (already work in robust RL case [3]), our method has great potential to work well in these cases. It is a really interseting and promising future direction to extend our method to more complex scenarios such as linear/general function approximation. -->
[1] Li, Gen, et al. "Settling the sample complexity of model-based offline reinforcement learning." arXiv preprint arXiv:2204.05275 (2022).
[2] Li, Gen, et al. "Breaking the sample complexity barrier to regret-optimal model-free reinforcement learning." Advances in Neural Information Processing Systems 34 (2021): 17762-17776.
[3] Zhou, Zhengqing, et al. "Finite-sample regret bound for distributionally robust offline tabular reinforcement learning." International Conference on Artificial Intelligence and Statistics. PMLR, 2021.
# Review 4:
### More clarification about the comparison of the concentrability and the results with existing work could help
Thank you for raising this question! Please refer to the general response that provides a clearer comparison with prior art under both full coverage and partial coverage assumptions.
### Can you discuss the contrentrability parameter in Assumption 1 and how it differs from other similar notions in related work?
Thank you very much for raising this important question! Several variants of single-policy concentrability coefficients are proposed in the literature of _vanilla offline RL_, in order to address the issue of partial coverage, e.g. [2-4]. This kind of assumption is weaker than earlier ones [1] requiring full coverage (used in prior works in robust offline RL), since it only needs to compare against a single optimal policy, rather than all policies. Ours can be viewed as an adapted version tailored to the context of **robust MDP**, which to the best of our knowledge, has not been considered before.
<!-- * **Concentrability requirements in vanilla offline RL:** Earlier vanilla offline RL works focus on <span style="color:blue">full coverage</span> --- very strong concentrability requirements by taking a supremum of
the density ratio over all state-action pairs and all policies, i.e., maxπ C
* **Uniform coverage assumptions (Zhou et al, 2021; Panaganti & Kalathil, 2022; Yang et al, 2021)** (Panaganti & Kalathil, 2022; Yang et al, 2021) assumes access to all the state-action pairs, possibly in a nonuniform manner, assumption on the history dataset is a uniformly <span style="color:blue">full coverage one</span>--- having an equal number of samples on all state action pairs. It can be seen the most strong assumption in not only tabular robust RL literature, but also vanilla RL.
* **concentrability assumption in (Zhou et al, 2021; (Part of) Yang et al, 2021)** The concentrability requirement on the history dataset is a non-uniform <span style="color:blue">full coverage one</span>--- having positive samples on all state action pairs and all policies. It is used in earlier vanilla offline RL works [1].
* **robust single-policy concentrability requirement in this work:** The concentrability requirement in this work is a <span style="color:blue">partial coverage one</span> --- only requiring samples over the region that a robust optimal policy visit. Recently, vanilla offline RL works [3-4] focus on a much weaker assumption (than the earlier offline RL works [1]) --- single-policy concentrability. We adapt the weakest clipped single-policy concentrability assumption in current vanilla offline RL literature [4] to the robust RL case --- robust single-policy concentrability. Conseqeuntly, <span style="color:blue">this assumption is much weaker than all the prior works in robust offline RL.</span> -->
### The distance between the learned policy and the optimal policy.
Thank you for raising this wonderful question! Indeed, our performance is on bounding the robust value gap, not on the policy itself. It is an interesting future direction to see how to invert the robust value gap to the distance on the policy, which we believe requires us to elaborate using structural properties of the robust MDP itself, but not much on the concentrability coefficient (which is concerned on the history dataset).
<!--
* **practical evaluation of the distance between the learned policy (closed to robust optimal policy) and (non-robust) optimal policy.** To show the performance of our learned policy (closed to the robust optimal policy) in the pertubed environments, <span style="color:blue">we add the numerical results in the Section.4 of the revision</span>. In Fig 1(c) (we recall here), we show the performance of our method (DRVI-LCB) using different uncertainty radius $\sigma$ with comparisons to the (non-robust) optimal policy. It indicates that our robust policy can achieve better results in the worst case when the parameters of environment(heads-up probablity) varies. Given different uncertainty level $\sigma$, the learned policy/robust optimal policy also performs differently --- bigger $\sigma$ leads to more robust performance but less benefits in some cases, which is a tradeoff between robustness and high-reward-seeking in some cases.
<img src="https://i.imgur.com/XRAna2n.png" alt="drawing" width="400"/>
<!-- 
* **theoretical characterization of the distance --- interesting future work.** In the current literature of robust RL including this work, the main goal is to learn a robust optimal policy --- maximizing the worst case performance using as few samples as possible. So the investigations focus on the distance of learned policy to the robust optimal policy. For the distance between the robust optimal policy and non-robust optimal policy, we believe the distance will depends on not only concentrability coefficient, but also the uncertainty level $\sigma$, and the `distance' form used for the uncertainty set. The specific characterization of this distance is a interesting and promising future work. -->
### Difference with Xie et al., (2021), Zanette et al. (2021), Uehara et al., (2021)
Thank you so much for mentioning these related works! We have now cited all these papers in the revision. We would like to highlight that we target a very different problem, that is **robust offline RL** --- maximize the worst performance over an uncertainty set of environments, but not the *vanilla offline RL* problem considered in these works [Xie et al., (2021); Zanette et al. (2021); Uehara et al., (2021)]. Therefore, the results in these papers do not apply to the robust offline RL problem, and cannot be compared directly with ours. To the best of our knowledge, our work is indeed the first provable near-optimal algorithm for the robust offline RL problem under partial coverage.
[1] Chen, Jinglin, and Nan Jiang. "Information-theoretic considerations in batch reinforcement learning." International Conference on Machine Learning. PMLR, 2019.
[2] Rashidinejad, Paria, et al. "Bridging offline reinforcement learning and imitation learning: A tale of pessimism." Advances in Neural Information Processing Systems 34 (2021): 11702-11716.
[3] Xie, T., Jiang, N., Wang, H., Xiong, C., and Bai, Y. (2021b). Policy finetuning: Bridging sample-efficient offline and online reinforcement learning. arXiv preprint arXiv:2106.04895.
[4] Li, Gen, et al. "Settling the sample complexity of model-based offline reinforcement learning." arXiv preprint arXiv:2204.05275 (2022).
[5] Uehara, Masatoshi, and Wen Sun. "Pessimistic model-based offline reinforcement learning under partial coverage." arXiv preprint arXiv:2107.06226 (2021).
# Review 5:
### Comparisons to [1]
We believe there might be some confusion in comparing our work and [1]. Although the algorithm [1] contains a min-max formulation, it in fact targets a very different problem --- vanilla offline RL, while we consider robust offline RL --- to be robust w.r.t. possible inherent model/transition uncertainty (that is not present in vanilla offline RL). **This is best seen by the resulting performance metric: [1] bounds the standard value gap $V^{\pi^*} - V^{\hat{\pi}}$ in standard MDP, while we bound the robust value gap $V^{\pi^*,\sigma} - V^{\hat{\pi},\sigma}$ in robust MDP.** Therefore, the algorithm in [1] cannot be applied to solve the robust offline RL problem that we focus on, and the theoretical results in [1] and ours are not directly comparable.
[1] Uehara, Masatoshi, and Wen Sun. "Pessimistic model-based offline reinforcement learning under partial coverage." arXiv preprint arXiv:2107.06226 (2021).
### Clarification of contributions
Our contribution lies in designing and analyzing the first provably efficient algorithm for robust offline RL under partial coverage, as backed by theoretical analysis of sample size requirements with both upper and lower bounds. Please refer to the general response that provides a clearer comparison with prior art under both full coverage and partial coverage assumptions.
### applications and numerical experiments
Thanks so much for raising this question. To evaluate our provable method in practice, we **upload a revision with a new section (Section 4)**, which conducts experiments on the Gambler problem -- a benchmark problem that is widely considered in robust RL literature. The results show that compared to robust value iteration (DRVI), our method can achieve much better accuracy using the same dataset, corroborating the sample efficiency of the proposed method by adding our tailored pessimistic penalty.
### Is the pessimistic penalty (21) necessary or important?
Indeed, the pessimistic penalty is necessary and important both for empirical performance and theoretical analysis, which we elaborate below.
* **The role of pessimism in practical performance**
The reviewer is right that the pessimistic penalty in (21) is designed to address the partial coverage history dataset arised from using offline as the sampling mechanism. The pessimistic penalty (21) is the key to achieve better accuracy using the same sampling size, corroborating the sample efficiency of the proposed method, **shown in [Fig.1(a-b) in the revison](https://i.imgur.com/e7WU0ke.png).** Here, DRVI is the baseline without pessimistic penalty and DRVI-LCB is our method by adding a tailored pessimistic penalty.
<!--  -->
* **The role of pessimism in theoretical sample efficiency**
The tailored pessimistic penalty in (21) is one of the keys to achieve near-optimal sample complexity in robust offline RL --- especially addressing the partial coverage challenges of history dataset. The significance can also be corroborated by the comparison with prior work in terms of sample complexity, which are detailed in the general response.
### Further clarification of the role of the penalty term in the theoretical results
Thank you so much for raising this question. We recall the performance gap guarantee critically depends on the size of the pessimistic penalty (see (54)), which comes from upper bounding the perturbation in Lemma 8. Therefore, being able to tightly control this penlty term leads to a tighter performance guarantee.
Regarding the $c_b$ term in (21), it is a universal constant that is already absorbed in the large enough constant $c_0$ in Theorem 1 (cf. (26)).
# To AC:
Dear Area Chairs,
We thank all the reviewers / Area Chairs / Meta Reviewers for their time and effort in reviewing our paper. We believe this work matches the community's interest and can contribute to a lot of meaningful future works. So we carefully \color{blue}{summarized and clarified our contributions and added numerical experiments as requested by the reviewers.} However, up to now, we have not received any replies or discussions. Since the deadline is approaching, we really appreciate if the ACs could encourage the reviewers to engage in the discussion. We also hope the ACs would consider our response when making the decision if there is no response from reviewers after the rebuttal stage.
In particular, we want to point out that Reviewer 8J56's review raises ethical concerns and shows that the reviewer not only misses the core contribution of this paper but also insult this paper as a 'survey'. We definitely welcome suggestions and concerns but not such emotional blames. We really appreciate if ACs can encourage this reviewer to engage in the discussion and we really hope to address their concerns.
We respectfully hope the Area Chair / Meta Reviewer can take the quality of reviewer’s comments into account when making the decision. Thank you so much!
Best regards,
Authors