# Rebuttal for "Robust MDP", ICML 2023
### Reviever 3tXt: Score 4: Post-rebuttal Update
* Our contribution.
> We agree with the reviewer on all points about the contribution of the paper:
> * An exact characterization of the optimal policy in robust MDPs.
> * Our methods are faster than existing literature, which is as efficient as non-robust MDPs (upto logarithmic factors).
> * Our methods are trivially superior than numerical methods such as LP solver, convex optimization, etc.
* On Literature review.
> We thank the reviwer for suggestion. We would happy to incorporate the suggestions in the updated draft.
* On presentation of the appendix.
> * We agree that the paper is technical. Also there are many book-keepings (cases $p=1,\infty$) and extentions (appendix C,D,E,F,H). Our main technical results are Lemma I.1 and Lemma J.1, which we will try to put more emphasis on, with more inuitive explanations, as suggested by the reviewer.
* I think with sufficient polish on the literature review and the presentation of the results, especially the proofs, the paper would be significantly improved and made much more accessible to the research community. I would like to keep my current score.e
> * We find the reviewers suggestions valuable and at the same time easily implementable in a day work. The reviewer finds our work solving a well known challenging problem, and provides a superior method with faster rate and an exact characterization of the optimal policy.
>* Hence, we disagree with the reviewer, that these minor points are the fair ground for rejecting the work.
We thank the reviewer for responding to our rebuttal and reply to the comments below.
> It can be argued that the paper is superior to existing literature since it has a faster rate and it contains an exact characterization of the optimal policy.
Thanks for the suggestion! We believe this sentence is a perfect summary of our contributions and we would revise our paper accordingly.
> On literature review
Thanks for the feedback! We will incorporate these changes in the final revision.
> I think rearranging the materials and modifying the appendix (e.g. Appendix B, reducing the number of Appendix sections, and providing more intuition connecting each section/result) would make the results much easier to digest for researchers in the area.
Thanks for the suggestions! We will revise our paper as suggested and make the results more accessible.
We feel that the above improvements are minor and can be accomplished within a day’s work or so. As the reviewer clearly appreciates the contributions of our work, we would like to encourage the reviewer to consider raising the score. As the reviewer summarizes, our work has a faster rate and contains an exact characterization of the optimal policy, advancing existing literature. Hence, we believe rejecting it for minor cosmetic issues that can be easily fixed, seems like a waste.
### Message to the AC:
Dear AC,
Thanks for your time and efforts in handling our sumbission! We are writing to raise your attention on Review 3tXt's evaluations.
The reviewer is satisfied with our contributions and rebuttal. The remaining comments are about literature review and the presentation of appendix, which we believe are minor and can be easily incorporated in the final verison. Thus, we are a little bit confused with the reviewer's decision of keeping the negative score. We believe rejecting it for minor cosmetic issues that can be easily fixed, seems like a waste.
Thank you for your time and consideration.
Best regards,
Paper 1212 Authors
Our work solves s-rectangular robust MDPs which are well-known to be challenging problems according to the reviewer 3tXt. The reviwer further finds the proposed methods to be sound with the theoretical guarantees matching those of non-robust MDPs both in theory and in numerical simulations. Moreover, the reviwer d5VS finds our work very well-written providing plenty of insights for future directions. In addition, our work avoids unrealistinc assumption on kernel, which reduces the gap between theory and practice, as pointed by the reviwer KexR.
The reviewer 3tXt suggested some policing in literature review and appendix. This we believe are very minor improvements which we will incorporate in the final version. Apart from this, the paper attracts no criticism by the any reviewer. Hence, we disagree that the rejection by the reviewer 3tXt, is the fair evaluation to the work.
## Reviewer d5VS, Score 7
### Strengths And Weaknesses:
* The paper is very well-written and the proposed method provides plenty of insights for future directions. I can see how the new bellman robust operator will allow deriving other popular non-robust methods such as Q-learning and policy gradients on the robust MDP. The new closed-form expression also makes Lp robust MDP seems not really “harder” than a regular MDP as it can be seen with a non-robust operator with an extra regularization term.
* The only weakness I can think of is the limitation to the Lp uncertainty set. The proposed methods and techniques pretty much rely on the uncertainty set defined through Lp distance. It is thus unclear whether this idea can be generalized to other uncertainty set or if it suggests Lp robust MDP is a very special case of robust MDP.
### Questions:
* Could you comment on the potential possibilities of extending this to other uncertainty sets or whether Lp robust MDP is a special case of robust MDP?
> * We thank the reviewer for suggestion the to generalize our results to the general norms. We agree, the work considers uncertainty sets constrained by $L_p$ norm. We believe, it is important and natural family of norms to cosider. Moreover, the work can be trivially be generalized to any norm $\lVert \cdot \rVert$. We demonstrate it here for sa-rectangular case: The $\kappa_q$ in Theorem 3.1 would change to
$$\kappa_{\lVert\cdot\rVert}(v) = \min_{\lVert c \rVert \leq 1, \langle 1,c\rangle =0} \langle c, v\rangle.$$
Similarly the generalization to s-rectangular is also possible. In $L_p$ norm these terms ($\kappa$ etc) are exactly computable. Thanks to the reviewer, we would be very happy to include this discussion in the revised draft.
>* This work also opens up the direction to generalize these results to KL distance constrained uncertainty sets. Note that KL distance is not a norm.
## Reviewer KexR, Score 6
### Strengths And Weaknesses:
* The paper avoids an unrealistic assumption on the noise transition kernel, which assumes each row of the noisy transition kernel to be zero. It is hard to ensure such an assumption is satisfied in practice; thus, removing this assumption in the proposed solution reduces the gap between theory and practice.
* The paper provides a detailed analysis of the proposed method, and both theoretical and numerical experiments are included.
* In the meanwhile, there remains a point unclear to me. The assumption on the noise transition kernel is much weaker than previous work by removing the requirement that the sum of each row in the noise transition kernel be zero. However, the noise kernel is still affecting the learning process. According to Theorem 3.3 and equation 8 $(\mathcal{T}^*_\mathcal{U}v)(s)$
is the minimum value of x that $$ [\sum_{a}(Q(s,a)-x)^p\mathcal{1}(Q(s,a)\leq x)]^{\frac{1}{p}} = \sigma, $$ $\sigma$ is defined to be $\alpha_s +\gamma\beta_s\kappa_q(v)$ , where $\alpha$ and $\beta$ are the noise radius of the reward and transition metrics separately. On the left column on page 6, the paper explains the solution of equation 8 can be approximated by a binary search between the interval $[\max_{a}Q(s,a)-\sigma, \max_{a}Q(s,a)]$ . Therefore, when the noise is larger ( $\alpha$ and $\beta$ ), $\sigma$will be larger and the interval to search in will be larger. Given these conditions, it remains unclear to me why the radius vector did not take a role in the time complexity of the proposed solution. I would appreciate it if the authors could explain it a bit more. In practice, this radius vector can vary a lot. If it affects the time complexity as well, it might be worth discussing it.
> We thank the reviewer for the question. The reviewer is right to observe that the time complexity of binary search depends on $\sigma$. But it only adds additional time of $O(\log(\sigma))$ to the time complexity of evaluation optimal robust operator. Further $\sigma$ can be bounded above by $O(\frac{\max_{s}\alpha_{s}}{1-\gamma})$. Note that $\beta_s \leq 1$, as it is noise in kernel.
>* We will add this discussion in the appendix in detail, in the revised version.
#### Small Things:
On the left column on page 6, the left quotation marks in “sub-optimality distance” and “uncertainty penalty” should be ``.
> Thanks for the observation, we will update it, in the revised version.
## Reviewer 3tXt, Score 4
### Strengths
* s-rectangular robust MDPs are well-known to be challenging problems.
* The proposed methods are sound and the theoretical guarantees match those of non-robust MDPs both in theory and in numerical simulations.
### Weaknesses
* The literature review part of the work could be improved. For instance, in lines 76 - 77, the authors commented on the existing methods' reliance on black-box solvers for solving linear programs, yet the discussion was somewhat vague. Pointing to the experiments done in the paper itself or comparing the computation complexity bounds for the approaches could better justify the claims made in the literature review section. Please see below for additional questions on the work's relationship with existing literature.
>* Related work
* It is unclear what the technical novelties are. One of the key results is the generalization of the "water-filling" lemma, which seems to be the result of directly applying the KKT conditions to a convex optimization problem to derive the exact solutions. Can the authors comment a bit more on the technical challenges encountered?
> * The reviewer is right to observe that we obtained our technical results (generalized water-filling lemma and others) using simple tools such as KKT conditions etc. These results helps us unveil the nature of optimal robust policies and facitilate efficient computation robust Bellman policies. This work makes useful and important contribution in robust MDPs using simple mathematical tools which is a salient feature of the work.
* The presentation of the work could be improved significantly. Currently the remarks and definitions are somewhat cluttered. For instance, in Proposition I.2 it might be useful to point out that the proposition holds only for finite $p$'s. Lemma J.1, one of the key results in the paper, is really difficult to understand as is. Perhaps the lemma could be broken down into a collection of smaller lemmas to facilitate understanding? See also additional questions below.
>* Yes, Proposition I.2 holds for $p\in[1,\infty)$. Thanks we will mention it in the upated version.
>* We agree with the reviewer the J.1 is long. We tried to put all the property of (63) at one place for better referencing. We will try to put make this lemma as sub-section in the next version with more intuitive explanation.
* While the submission and the supplements should be anonymized, it is hard to verify that the proposed method does significantly outperform LP-based approaches as Table 9 suggests. A tarball containing the code used for the experiments could better validate the experiment results.
>* We will make all our codes, experiments results public on github. Outperformance of our methods over LP and numerical methods is intuitive as our methods exploits the structure of robust MDPs to compute optimal robust MDPs whereas numerical methods relies on brute force.
### Questions:
* Line 85 - 87. Could the authors clarify what they meant by the policy improvement algorithm being very slow and unstable? Can the authors provide empirical justification or theoretical bounds that show why these existing works with LP-based approaches are inferior? For instance, in non-robust MDP we know LP-based approaches are arguably inferior because they converge at $O(S^3A\log(\frac{1}{\epsilon}))$. Can the authors provide a more rigorous discussion of this paper's advantage? In particular, the paper [1] (which is cited by [2]) shows that projecting on to the probability simplex can be efficient.
> * [Numerical Approaches] As the reviewer pointed out, LP-based approaches must take atleast $O(S^3A\log(\frac{1}{\epsilon}))$ for convergence in robust MDPs. And our methods have complexity of $O(S^2A\log(\frac{1}{\epsilon}))$, ignoring logarithmic factors in $S$ and $A$. Hence, our methods are superior than LP approaches.
> * [Derman et al. 2021] It is discussed in details in appendix A (related work). The paper provides policy gradient for reward robust MDPs. And for general robust MDPs, the paper proposes $R^2$MPI algorithm (algorithm 1). The algorithm does policy improvement via simplex projection and policy evaluation m-times, in every iteration. The time complexity of this approach is not provided in the paper. In contrast, we compute optimal robust Bellman operator $\mathcal{T}^*_{\mathcal{u}}v$ directly (in close form or by binary search) at every iteration.
>* The advantage of our methods is that it provides the insight into the problem and doesn't relies on brute force of numerical methods. Further, our methods have potential to scale to large problems (online and deep learning settings). While it is not clear, how the works (R^2MPI, LP methods) can be generalized to those settings.
* Lines 136 - 139. Are the uncertainty sets assumed to be s-rectangular or sa-rectangular throughout the paper?
> Thanks for pointing it out. The uncertainty set is assumed to be s-rectangular throughout the paper. sa-rectangular case being the special case, which have simplified results in section 3.1.
* Line 184. Is this a typo? Should the simplex condition restrict $\mathcal{P}_{s,a}$ to sum to one instead?
> We are sorry for the confusion, we will improve the presentation of this part in the revised version. $\mathcal{P}_{sa}$ is noise set in the kernel, hence row sums to $0$. Since, $P_0$ is valid kernel, whose rom sums to 1, hence all the kernel in $(P_0 + \mathcal{P})$ has row sum 1.
* The experimental results are a bit hard to parse at the moment. How many times were the experiments repeated? Can the authors comment a bit more on the standard deviations in the relative performances? Can the authors comment a bit more on the linear programs? What are these linear programs and how do these linear programs related to the existing methods mentioned in the literature review section?
>* Each experiments were repeated 500 times. Typical relative standard was
>* The optimal robust Bellman operator is not in LP objective in general.
>>* For sa-rectangular $L_1$ optimal robust operator : We exploited the structure, that greedy policies are deterministic policies. Hence, formulated robust Q-learning, as
$$Q_{n+1}(s,a) = \max_{a}\Bigm[R_{0}(s,a)-\alpha_{sa}-\gamma\beta_{sa}\kappa(v) + \gamma\sum_{s'}P_0(s'|s,a)v(s')\Bigm],$$
where we solved $\kappa(v) = \min_{p\in \mathcal{P}_{sa}}[p_{sa}v]$ via linear programming. This allowed us to solve the problem by LP.
* For sa-rectangular $L_1$ optimal robust operator: we used of scipy.optimize.minimize for policy maximization and LP for minimization over uncertainty sets.
[1] Duchi, J., Shalev-Shwartz, S., Singer, Y., & Chandra, T. (2008, July). Efficient projections onto the l 1-ball for learning in high dimensions. In Proceedings of the 25th international conference on Machine learning (pp. 272-279).
[2] Derman, E., Geist, M., & Mannor, S. (2021). Twice regularized MDPs and the equivalence between robustness and regularization. Advances in Neural Information Processing Systems, 34, 22274-22287.
Limitations: