Navdeep Kumar
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights New
    • Engagement control
    • Make a copy
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Make a copy Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    # Rebuttal for "Robust MDP", ICML 2023 ### Reviever 3tXt: Score 4: Post-rebuttal Update * Our contribution. > We agree with the reviewer on all points about the contribution of the paper: > * An exact characterization of the optimal policy in robust MDPs. > * Our methods are faster than existing literature, which is as efficient as non-robust MDPs (upto logarithmic factors). > * Our methods are trivially superior than numerical methods such as LP solver, convex optimization, etc. * On Literature review. > We thank the reviwer for suggestion. We would happy to incorporate the suggestions in the updated draft. * On presentation of the appendix. > * We agree that the paper is technical. Also there are many book-keepings (cases $p=1,\infty$) and extentions (appendix C,D,E,F,H). Our main technical results are Lemma I.1 and Lemma J.1, which we will try to put more emphasis on, with more inuitive explanations, as suggested by the reviewer. * I think with sufficient polish on the literature review and the presentation of the results, especially the proofs, the paper would be significantly improved and made much more accessible to the research community. I would like to keep my current score.e > * We find the reviewers suggestions valuable and at the same time easily implementable in a day work. The reviewer finds our work solving a well known challenging problem, and provides a superior method with faster rate and an exact characterization of the optimal policy. >* Hence, we disagree with the reviewer, that these minor points are the fair ground for rejecting the work. We thank the reviewer for responding to our rebuttal and reply to the comments below. > It can be argued that the paper is superior to existing literature since it has a faster rate and it contains an exact characterization of the optimal policy. Thanks for the suggestion! We believe this sentence is a perfect summary of our contributions and we would revise our paper accordingly. > On literature review Thanks for the feedback! We will incorporate these changes in the final revision. > I think rearranging the materials and modifying the appendix (e.g. Appendix B, reducing the number of Appendix sections, and providing more intuition connecting each section/result) would make the results much easier to digest for researchers in the area. Thanks for the suggestions! We will revise our paper as suggested and make the results more accessible. We feel that the above improvements are minor and can be accomplished within a day’s work or so. As the reviewer clearly appreciates the contributions of our work, we would like to encourage the reviewer to consider raising the score. As the reviewer summarizes, our work has a faster rate and contains an exact characterization of the optimal policy, advancing existing literature. Hence, we believe rejecting it for minor cosmetic issues that can be easily fixed, seems like a waste. ### Message to the AC: Dear AC, Thanks for your time and efforts in handling our sumbission! We are writing to raise your attention on Review 3tXt's evaluations. The reviewer is satisfied with our contributions and rebuttal. The remaining comments are about literature review and the presentation of appendix, which we believe are minor and can be easily incorporated in the final verison. Thus, we are a little bit confused with the reviewer's decision of keeping the negative score. We believe rejecting it for minor cosmetic issues that can be easily fixed, seems like a waste. Thank you for your time and consideration. Best regards, Paper 1212 Authors Our work solves s-rectangular robust MDPs which are well-known to be challenging problems according to the reviewer 3tXt. The reviwer further finds the proposed methods to be sound with the theoretical guarantees matching those of non-robust MDPs both in theory and in numerical simulations. Moreover, the reviwer d5VS finds our work very well-written providing plenty of insights for future directions. In addition, our work avoids unrealistinc assumption on kernel, which reduces the gap between theory and practice, as pointed by the reviwer KexR. The reviewer 3tXt suggested some policing in literature review and appendix. This we believe are very minor improvements which we will incorporate in the final version. Apart from this, the paper attracts no criticism by the any reviewer. Hence, we disagree that the rejection by the reviewer 3tXt, is the fair evaluation to the work. ## Reviewer d5VS, Score 7 ### Strengths And Weaknesses: * The paper is very well-written and the proposed method provides plenty of insights for future directions. I can see how the new bellman robust operator will allow deriving other popular non-robust methods such as Q-learning and policy gradients on the robust MDP. The new closed-form expression also makes Lp robust MDP seems not really “harder” than a regular MDP as it can be seen with a non-robust operator with an extra regularization term. * The only weakness I can think of is the limitation to the Lp uncertainty set. The proposed methods and techniques pretty much rely on the uncertainty set defined through Lp distance. It is thus unclear whether this idea can be generalized to other uncertainty set or if it suggests Lp robust MDP is a very special case of robust MDP. ### Questions: * Could you comment on the potential possibilities of extending this to other uncertainty sets or whether Lp robust MDP is a special case of robust MDP? > * We thank the reviewer for suggestion the to generalize our results to the general norms. We agree, the work considers uncertainty sets constrained by $L_p$ norm. We believe, it is important and natural family of norms to cosider. Moreover, the work can be trivially be generalized to any norm $\lVert \cdot \rVert$. We demonstrate it here for sa-rectangular case: The $\kappa_q$ in Theorem 3.1 would change to $$\kappa_{\lVert\cdot\rVert}(v) = \min_{\lVert c \rVert \leq 1, \langle 1,c\rangle =0} \langle c, v\rangle.$$ Similarly the generalization to s-rectangular is also possible. In $L_p$ norm these terms ($\kappa$ etc) are exactly computable. Thanks to the reviewer, we would be very happy to include this discussion in the revised draft. >* This work also opens up the direction to generalize these results to KL distance constrained uncertainty sets. Note that KL distance is not a norm. ## Reviewer KexR, Score 6 ### Strengths And Weaknesses: * The paper avoids an unrealistic assumption on the noise transition kernel, which assumes each row of the noisy transition kernel to be zero. It is hard to ensure such an assumption is satisfied in practice; thus, removing this assumption in the proposed solution reduces the gap between theory and practice. * The paper provides a detailed analysis of the proposed method, and both theoretical and numerical experiments are included. * In the meanwhile, there remains a point unclear to me. The assumption on the noise transition kernel is much weaker than previous work by removing the requirement that the sum of each row in the noise transition kernel be zero. However, the noise kernel is still affecting the learning process. According to Theorem 3.3 and equation 8 $(\mathcal{T}^*_\mathcal{U}v)(s)$ is the minimum value of x that $$ [\sum_{a}(Q(s,a)-x)^p\mathcal{1}(Q(s,a)\leq x)]^{\frac{1}{p}} = \sigma, $$ $\sigma$ is defined to be $\alpha_s +\gamma\beta_s\kappa_q(v)$ , where $\alpha$ and $\beta$ are the noise radius of the reward and transition metrics separately. On the left column on page 6, the paper explains the solution of equation 8 can be approximated by a binary search between the interval $[\max_{a}Q(s,a)-\sigma, \max_{a}Q(s,a)]$ . Therefore, when the noise is larger ( $\alpha$ and $\beta$ ), $\sigma$will be larger and the interval to search in will be larger. Given these conditions, it remains unclear to me why the radius vector did not take a role in the time complexity of the proposed solution. I would appreciate it if the authors could explain it a bit more. In practice, this radius vector can vary a lot. If it affects the time complexity as well, it might be worth discussing it. > We thank the reviewer for the question. The reviewer is right to observe that the time complexity of binary search depends on $\sigma$. But it only adds additional time of $O(\log(\sigma))$ to the time complexity of evaluation optimal robust operator. Further $\sigma$ can be bounded above by $O(\frac{\max_{s}\alpha_{s}}{1-\gamma})$. Note that $\beta_s \leq 1$, as it is noise in kernel. >* We will add this discussion in the appendix in detail, in the revised version. #### Small Things: On the left column on page 6, the left quotation marks in “sub-optimality distance” and “uncertainty penalty” should be ``. > Thanks for the observation, we will update it, in the revised version. ## Reviewer 3tXt, Score 4 ### Strengths * s-rectangular robust MDPs are well-known to be challenging problems. * The proposed methods are sound and the theoretical guarantees match those of non-robust MDPs both in theory and in numerical simulations. ### Weaknesses * The literature review part of the work could be improved. For instance, in lines 76 - 77, the authors commented on the existing methods' reliance on black-box solvers for solving linear programs, yet the discussion was somewhat vague. Pointing to the experiments done in the paper itself or comparing the computation complexity bounds for the approaches could better justify the claims made in the literature review section. Please see below for additional questions on the work's relationship with existing literature. >* Related work * It is unclear what the technical novelties are. One of the key results is the generalization of the "water-filling" lemma, which seems to be the result of directly applying the KKT conditions to a convex optimization problem to derive the exact solutions. Can the authors comment a bit more on the technical challenges encountered? > * The reviewer is right to observe that we obtained our technical results (generalized water-filling lemma and others) using simple tools such as KKT conditions etc. These results helps us unveil the nature of optimal robust policies and facitilate efficient computation robust Bellman policies. This work makes useful and important contribution in robust MDPs using simple mathematical tools which is a salient feature of the work. * The presentation of the work could be improved significantly. Currently the remarks and definitions are somewhat cluttered. For instance, in Proposition I.2 it might be useful to point out that the proposition holds only for finite $p$'s. Lemma J.1, one of the key results in the paper, is really difficult to understand as is. Perhaps the lemma could be broken down into a collection of smaller lemmas to facilitate understanding? See also additional questions below. >* Yes, Proposition I.2 holds for $p\in[1,\infty)$. Thanks we will mention it in the upated version. >* We agree with the reviewer the J.1 is long. We tried to put all the property of (63) at one place for better referencing. We will try to put make this lemma as sub-section in the next version with more intuitive explanation. * While the submission and the supplements should be anonymized, it is hard to verify that the proposed method does significantly outperform LP-based approaches as Table 9 suggests. A tarball containing the code used for the experiments could better validate the experiment results. >* We will make all our codes, experiments results public on github. Outperformance of our methods over LP and numerical methods is intuitive as our methods exploits the structure of robust MDPs to compute optimal robust MDPs whereas numerical methods relies on brute force. ### Questions: * Line 85 - 87. Could the authors clarify what they meant by the policy improvement algorithm being very slow and unstable? Can the authors provide empirical justification or theoretical bounds that show why these existing works with LP-based approaches are inferior? For instance, in non-robust MDP we know LP-based approaches are arguably inferior because they converge at $O(S^3A\log(\frac{1}{\epsilon}))$. Can the authors provide a more rigorous discussion of this paper's advantage? In particular, the paper [1] (which is cited by [2]) shows that projecting on to the probability simplex can be efficient. > * [Numerical Approaches] As the reviewer pointed out, LP-based approaches must take atleast $O(S^3A\log(\frac{1}{\epsilon}))$ for convergence in robust MDPs. And our methods have complexity of $O(S^2A\log(\frac{1}{\epsilon}))$, ignoring logarithmic factors in $S$ and $A$. Hence, our methods are superior than LP approaches. > * [Derman et al. 2021] It is discussed in details in appendix A (related work). The paper provides policy gradient for reward robust MDPs. And for general robust MDPs, the paper proposes $R^2$MPI algorithm (algorithm 1). The algorithm does policy improvement via simplex projection and policy evaluation m-times, in every iteration. The time complexity of this approach is not provided in the paper. In contrast, we compute optimal robust Bellman operator $\mathcal{T}^*_{\mathcal{u}}v$ directly (in close form or by binary search) at every iteration. >* The advantage of our methods is that it provides the insight into the problem and doesn't relies on brute force of numerical methods. Further, our methods have potential to scale to large problems (online and deep learning settings). While it is not clear, how the works (R^2MPI, LP methods) can be generalized to those settings. * Lines 136 - 139. Are the uncertainty sets assumed to be s-rectangular or sa-rectangular throughout the paper? > Thanks for pointing it out. The uncertainty set is assumed to be s-rectangular throughout the paper. sa-rectangular case being the special case, which have simplified results in section 3.1. * Line 184. Is this a typo? Should the simplex condition restrict $\mathcal{P}_{s,a}$ to sum to one instead? > We are sorry for the confusion, we will improve the presentation of this part in the revised version. $\mathcal{P}_{sa}$ is noise set in the kernel, hence row sums to $0$. Since, $P_0$ is valid kernel, whose rom sums to 1, hence all the kernel in $(P_0 + \mathcal{P})$ has row sum 1. * The experimental results are a bit hard to parse at the moment. How many times were the experiments repeated? Can the authors comment a bit more on the standard deviations in the relative performances? Can the authors comment a bit more on the linear programs? What are these linear programs and how do these linear programs related to the existing methods mentioned in the literature review section? >* Each experiments were repeated 500 times. Typical relative standard was >* The optimal robust Bellman operator is not in LP objective in general. >>* For sa-rectangular $L_1$ optimal robust operator : We exploited the structure, that greedy policies are deterministic policies. Hence, formulated robust Q-learning, as $$Q_{n+1}(s,a) = \max_{a}\Bigm[R_{0}(s,a)-\alpha_{sa}-\gamma\beta_{sa}\kappa(v) + \gamma\sum_{s'}P_0(s'|s,a)v(s')\Bigm],$$ where we solved $\kappa(v) = \min_{p\in \mathcal{P}_{sa}}[p_{sa}v]$ via linear programming. This allowed us to solve the problem by LP. * For sa-rectangular $L_1$ optimal robust operator: we used of scipy.optimize.minimize for policy maximization and LP for minimization over uncertainty sets. [1] Duchi, J., Shalev-Shwartz, S., Singer, Y., & Chandra, T. (2008, July). Efficient projections onto the l 1-ball for learning in high dimensions. In Proceedings of the 25th international conference on Machine learning (pp. 272-279). [2] Derman, E., Geist, M., & Mannor, S. (2021). Twice regularized MDPs and the equivalence between robustness and regularization. Advances in Neural Information Processing Systems, 34, 22274-22287. Limitations:

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully