Navdeep Kumar
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights New
    • Engagement control
    • Make a copy
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Make a copy Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    # Rebuttal "PG for RMDP" ## Reviewer : f5qY : Score 7 ### Strengths: * Derivation of policy gradient theorems (theorems 3.6 and 3.7) for s-rectangular and sa-rectangular RMDPs with uncertainty set constrained by the $L_p$ norm is quite interesting, and many other policy gradient methods can be derived with these theorems. * It is noteworthy that the adversarial kernel for such an ambiguity set is a rank-one perturbation of the nominal kernel. ### Weakness: * The paper has a limited scope, which makes it less ground-breaking. In particular, it only works on uncertainty sets constrained by the $L_p$ norm. In another paper I am reviewing, a policy gradient method is proposed for a more general class of s-rectangular ambiguity sets. Their method, however, does not propose a closed form equation for RPG (i.e. the policy gradient theorem). However, I believe it is a solid piece of work. > We would like to thank the reviewer for the positive feedback. > > *"[the paper] only works on uncertainty sets constrained by the \(L_p\) norm"* > > Thank you for suggesting an extension to more general uncertainty sets. Although this work focuses on $L_p$-norm constrained uncertainty sets, it actually applies to any norm $\lVert \cdot \rVert$. We demonstrate it here for the $(s,a)$-rectangular case. The robust Q-value would satisfy the following equality: $Q^\pi_{\mathcal{U}^{sa}_{\lVert \cdot\rVert}}(s,a) = R_0(s,a)-\alpha_{sa}-\gamma\beta_{sa}\kappa_{\lVert\cdot\rVert}(v^\pi_{\mathcal{U}^{sa}_{\lVert \cdot\rVert}}) + \gamma\sum_{s'\in\mathcal{S}}P_0(s'|s,a)\pi(a'|s')Q^\pi_{\mathcal{U}^{sa}_{\lVert \cdot\rVert}}(s',a'),$ where $\kappa_{\lVert\cdot\rVert}(v) := \min_{\lVert c \rVert \leq 1, \langle \mathbf{1},c\rangle =0} \langle c, v\rangle$ (to be compared with the robust Q-value from Appx. A). On the other hand, the worst kernel in Thm 3.1 would change to: $P^\pi_{\mathcal{U}}(\cdot|s,a) = P_0(s,a) -\beta_{sa}u,$ where $u \in \arg\min_{\lVert c \rVert \leq 1, \langle \mathbf{1},c\rangle =0} \langle c, v^\pi_{\mathcal{U}}\rangle$ (to be compared with Lemma B.2). It is still a rank-one perturbation of the nominal kernel, so we can get the robust occupation measure from Lemma 3.5. Combining the robust Q-value with the robust occupation measure, we can get an RPG as in Thm 3.6. We can similarly derive an RPG for $s$-rectangular uncertainty sets with balls constrained by an arbitrary norm. In the $L_p$-case, it can additionally be computed in closed form when $p\in\{1,2,\infty\}$, thanks to the explicit expression of the $\kappa$-function (Tabs. 1 and 2). We thank the reviewer for this useful remark and would be very happy to include this discussion in the final version. ### Questions: * This paper relies heavily on properties of $L_p$ norm. I wonder do you have any idea for extending this paper to a more general uncertainty sets? > *"do you have any idea for extending this paper to a more general uncertainty sets?"* > As established above, our $L_p$-norm results can be generalized to arbitrary norms. We believe our work opens up an avenue for even more general uncertainty sets such as $f$-divergence constraints. Even in the particular case of KL, one challenge for deriving RPG lies in the fact that the basic properties of a distance are no longer satisfied. ## Reviewer : 1fLD : Score 4 ### Strengths * s-rectangular robust MDPs are a class of hard problems in robust RL. The paper features some novel techniques not found in existing works. * The numerical results obtained in the paper are nice and illustrates the empirical performance of the proposed methods. ## Weaknesses * While the paper is theoretical in nature, the experiment results are somewhat limited with crucial information missing, such as the exact number of repetitions and the standard deviation of the observed run times. * Some of the results seem incomplete or poorly stated. Particularly, Lemma 3.5 references the non-existent Lemma 3.1 for the terms $b$ and $k$. What exactly are $b$ and $k$ referring to? If we look at Theorem 3.1, the terms in the upperbounds are $\alpha$ and $\beta$, which are both scalars, whereas $\alpha$ and $\beta$ are both $\lvert \mathcal{S}\rvert$ -dimensional vectors. Additionally, what do $P_1$ and $P_2$ refer to? Is $P_1$ the transition probability of some arbitrary policy under $P_0$ and $P_2$ the transition probability under $P^\pi_\mathcal{U}$? For , $D_i$ should we only consider $i \in \{1,2\}$ ? Is $D$ essentially $d$ (occupancy measure)? > We thank the reviewer for the detailed comments. We shall address them below in order. >* *"the exact number of repetitions and the standard deviation of the observed run times"* > Thank you for pointing this out. Experiments were repeated 100 times with standard deviations 1-10%. We will get back to it below, in our response regarding lines 368-370. >* *"Lemma 3.5 references the non-existent Lemma 3.1 for the terms $b$ and $k$. What exactly are $b$ and $k$ referring to?"* >The reviewer is right. There is a typo in the sentence: "by plugging the values of $b$ and $k$ in the upper bound of Lemma 3.1" should instead be "by plugging the values of $b$ and $k$ from Theorems 3.1 and 3.2." >* *"what do $P_1$ and $P_2$ refer to?"* > Lemma 3.5 is a general statement expressing one occupancy matrix according to another, provided that their corresponding kernels differ by a rank-one matrix (i.e., $P_2$ is a rank-one perturbation of $P_1$). Then, leveraging the results in Thms. 3.1 and 3.2 which establish the worst kernel as a rank-one perturbation of the nominal, we can derive the robust occupancy matrix as follows (please see Appx. E for more details): <!-- > * In the sa-rectangular case, we can get the robust occupancy matrix by setting $b(s) := \sum_{a}\pi(a|s)\beta_{s,a}$ and $k := u^\pi_{\mathcal{U}^{sa}_p}$ in Lemma 3.5. --> >>* In the $(s,a)$-rectangular case, i.e., $\mathcal{U} = \mathcal{U}^{sa}_p$, we get the robust occupancy matrix by setting $b(s) := \sum_{a}\pi(a|s)\beta_{s,a}$, $k := u^\pi_{\mathcal{U}}$, $P_2(s'|s) := \sum_{a}\pi(a|s)P^\pi_\mathcal{U}(s'|s,a)$ and $P_1(s'|s) := \sum_{a}\pi(a|s)P_0(s'|s,a)$ in Lemma 3.5. >> * In the $s$-rectangular case, i.e., $\mathcal{U} = \mathcal{U}^{s}_p$, we get the robust occupancy matrix by setting $b(s) := \lVert\pi_s\rVert_q\beta_{s}$, $k := u^\pi_{\mathcal{U}}$, $P_2(s'|s) := \sum_{a}\pi(a|s)P^\pi_\mathcal{U}(s'|s,a)$ and $P_1(s'|s) := \sum_{a}\pi(a|s)P_0(s'|s,a)$ in Lemma 3.5. > Thank you for pointing out this potential source of confusion. We will accordingly expand Sec. 3.3 and include the above discussion in the final version. > * *"upperbounds are $\alpha$ and $\beta$, which are both scalars, whereas $\alpha$ and $\beta$ are both $\lvert \mathcal{S}\rvert$ -dimensional vectors"* > Response here >* *"For , $D_i$ should we only consider $i \in \{1,2\}$ ? Is $D$ essentially $d$ (occupancy measure)?"* > Response here * It is probably due to my limited background in robust MDP and I have some questions on the proofs of Theorems 3.1 and 3.2 listed below. <!-- >* We have happily adressed them below. --> ### Questions: * Lines 322 - 324. Could the authors clarify the sentence "We conclude this subsection by noting that the robust Q-value function can be computed as easily from the robust value functions which are efficiently computable?" >*"Lines 322 - 324. Could the authors clarify the sentence..."* > There is a typo here as well, it should be "... can be computed as easily from the **non-robust** value functions...". Indeed, the robust Q-value is the unique limit of the recursion: $$Q_{n+1}(s,a) = R^\pi_\mathcal{U}(s,a)+ \gamma\sum_{s',a'}P^\pi_\mathcal{U}(s'|s,a)\pi(a'|s')Q_n(s',a'),$$ i.e., $Q_n \to_n Q^\pi_\mathcal{U}$. As in non-robust MDPs, the convergence rate is linear because of its contracting property. Additionally, this holds for every fixed kernel, reward and policy. Thus, thanks to Thms. 3.1-3.2, Property 1, and taking $v_n(s) = \sum_a\pi(a|s)Q_n(s,a)$, we have : >>* In the $(s,a)$-rectangular case, i.e., $\mathcal{U} = \mathcal{U}^{sa}_p$: $$Q_{n+1}(s,a) = R_0(s,a) -\alpha_{sa} - \gamma\beta_{sa}\kappa_q(v_n) + \gamma\sum_{s',a'}P_0(s'|s,a)\pi(a'|s')Q_n(s',a')$$ >>* In the $s$-rectangular case, i.e., $\mathcal{U} = \mathcal{U}^{s}_p$: $$Q_{n+1}(s,a) = R_0(s,a) -[\alpha_{s} + \gamma\beta_{s}\kappa_q(v_n)]\bigm[\frac{\pi(a|s)}{\lVert \pi_s\rVert_q}\bigm]^{q-1} + \gamma\sum_{s',a'}P_0(s'|s,a)\pi(a'|s')Q_n(s',a')$$ >These recursions highlight how the robust Q-value can be computed with similar efficiency as non-robust Q-value. We refer the reviewer to Appx. A for further details. * Lines 21 - 26 mention that "However, their scope remains limited ... because of their limitation to planning (Derman et al., 2021; Kumar et al., 2022)." However, it seems like this paper also focuses on planning, as computing the robust policy gradients requires knowledge of the underlying environment. Can the authors comment a bit more on the work's relationship with prior literature? > *"it seems like this paper also focuses on planning, as computing the robust policy gradients requires knowledge of the underlying environment"* > You are right, we also rely on a known nominal model. We apologize for the confusion. What we mean here is planning in the sense of "value-based with given nominal". In this work, we provide a policy-based method that leverages regularization to derive an efficient RPG. To be more precise on our contributions with respect to prior work: >>* Derman et al. [2021] is the first work that efficiently evaluates the robust Bellman operator $T^{\\pi}\_{\\mathcal{U}}v$ via regularization. However, their algorithmic contribution focuses on modified policy iteration (an extension of PI and VI), which is value-based and relies on a known nominal model. Additionally, although Derman et al. [2021] proposed a robust policy gradient, their result is restricted to "reward-robust MDPs", i.e., it does not account for transition uncertainty (see Prop. 3.2 and our response to reviewer B9Kt). >>* Kumar et al. [2022] builds upon that work by additionally accounting for the simplex constraint (for which Derman et al. [2021] do not). Moreover, they provide a closed from expression for the optimal robust Bellman operator $T^*v =\max_{\pi}T^{\pi}_{\mathcal{U}}v$, whereas Derman et al. [2021] suggest to 'project onto the (action) simplex' to obtain a greedy policy. >>* Wang and Zou [2022] proposed policy gradient for R-contaminated MDPs, i.e., uncertain models of the form $(1-R)P_0 + P,$ where $P\in\Delta_{\mathcal{S}}^{\mathcal{S}\times\mathcal{A}}$ is an arbitrary kernel and $P_0$ the nominal. By construction, this is $(s,a)$-rectangular, and included in an $L_{\infty}$-ball. The particular case they focus on enables them to derive a converging RPG and explicit the gradient, whereas we aim to extend this to $s$-rectangular balls with arbitrary norm. >>* As a result, our work is the first to provide an explicit closed-form expression of policy gradient for s-rectangular robust MDPs. We also deal with a broader class of uncertainty sets, namely, s-rectangular balls of arbitrary norm. > We thank the reviewer for the comment. We will further detail on this literature review. * Lines 368 - 370 mention that the results are averaged after several runs. Can the authors also provide (1) the exact number of runs and (2) the standard deviation of the run time? Without these it is harder to appreciate the computation speedup of the proposed method. >*"(1) the exact number of runs and (2) the standard deviation of the run time"* > Each of the results were averaged over 100 runs, and the standard deviation (std) was typically around 2-10%. The table below presents the relative std. | S | A | $\mathcal{U^{\text{sa}}_1}$ | $\mathcal{U^{\text{sa}}_2}$ | $\mathcal{U^{\text{sa}}_5}$ | $\mathcal{U^{\text{sa}}_{10}}$ | $\mathcal{U^{\text{sa}}_\infty}$ | $\mathcal{U^{\text{s}}_1}$ | $\mathcal{U^{\text{s}}_2}$ | $\mathcal{U^{\text{s}}_5}$ | $\mathcal{U^{\text{s}}_{10}}$ | $\mathcal{U^{\text{s}}_\infty}$ | | ---- |---- |---- |---- |---- |---- |---- |---- |---- |---- |---- |---- | | 10 | 10 |0.018|0.062|0.039|0.037|0.015|0.016|0.016|0.021|0.017|0.011| | 30 | 10 |0.016|0.013|0.029|0.019|0.013|0.007|0.032|0.023|0.017|0.015| | 50 | 10 |0.046|0.053|0.018|0.089|0.031|0.052|0.007|0.016|0.014|0.010| | 100 | 20 |0.015|0.015|0.006|0.007|0.007|0.013|0.006|0.006|0.029|0.006| > We will add these details in the final version. * For the experiments, how is run time defined? Are we comparing the runtime of a single iteration? Or are we comparing the run time of achieving an $\epsilon$-optimal policy? > *how is run time defined*: We are comparing the runtime of a single iteration. * How should we interpret Lemma 3.5? What exactly are $b$ and $k$? > *"How should we interpret Lemma 3.5? What exactly are $b$ and $k$?"* > We refer to the response in the weakness section above. * It is probably due to my limited background in robust MDP, but can the authors provide some intuition on line 644 in the appendix? Shouldn't the optimization problem for $P^*,R^*$ be different from eq (7) and dependent on $P_0,R_0$? Two cross terms, $$ R_0(s,a)+\sum P^*(s'|s,a)v^\pi_\mathcal{U}(s') \quad\text{and}\quad R^*(s,a)+\sum P_0(s'|s,a)v^\pi_\mathcal{U}(s')$$ seem to be missing. Why can we remove these terms? >*"some intuition on line 644 in the appendix"* > The model $(P^*,R^*)$ doesn't depend on $(P_0,R_0)$, meaning that the noise and the nominal values can be separated. This is the central insight of this line work [ours, Esther et al. 2021, Kumar et al. 2022]. We sketch the proof: From (7), we have >> $(P^\pi_\mathcal{U},R^\pi_\mathcal{U}) = \arg\min_{P \in (P_0+\mathcal{P}),R \in (R_0+\mathcal{R})} \pi(a|s)\Bigm[R(s,a) + \gamma\sum_{a}P(s'|s,a)v(s') \Bigm]$ >> $=(P_0,R_0) + \arg\min_{p \in \mathcal{P},r \in \mathcal{R}} \pi(a|s)\Bigm[R_0(s,a)+ \gamma\sum_{a}P_0(s'|s,a)v(s') + r(s,a) + \gamma\sum_{a}p(s'|s,a)v(s') \Bigm]$ >> $= (P_0,R_0) + \arg\min_{p \in \mathcal{P},r \in \mathcal{R}} \pi(a|s)\Bigm[ r(s,a) + \gamma\sum_{a}p(s'|s,a)v(s') \Bigm]$ >> $= (P_0,R_0) + (P^*,R^*).$ > The (7) is for the worst values, and $(P^*,R^*)$ is worst noise. They are related as $(P^\pi_\mathcal{U},R^\pi_\mathcal{U})=(P_0,R_0) + (P^*,R^*)$. There are no cross terms because uncertainty in rewards and kernel are assumed to be indpendent, that is $\mathcal{U} = (P_0+\mathcal{P})\times(R_0+\mathcal{R})$. * Additionally, why can we treat $v^\pi_\mathcal{U}$ as a fixed vector with no additional constraints? Shouldn't we also use an equality constraint to ensure that the value function under $P^\pi_{\mathcal{U}},R^\pi_{\mathcal{U}}$ equals to $v^\pi_{\mathcal{U}}$? Otherwise isn't it possible that the solution to eq (7) does not have the same value function as $P^\pi_{\mathcal{U}}$. >*"why can we treat $v^\pi_\mathcal{U}$ as a fixed vector with no additional constraints? "* > For s-rectangular uncertainty set $\mathcal{U}$, the worst values $(P^\pi_\mathcal{U},R^\pi_\mathcal{U})$ for policy $\pi$, is given by $$(P^\pi_\mathcal{U},R^\pi_\mathcal{U})\in \arg\min_{(P,R)\in\mathcal{U}}\mathcal{T}^\pi_{(P,R)}v^\pi_\mathcal{U}$$ (Theorem 2 of [Wiesemann, 2012]). This resembles the following result in non-robust MDPs: $$\pi^* \in \arg\max_{\pi} \mathcal{T}^\pi v^*.$$ > Thank you for pointing this out. We will add this missing detail in the preliminary section 2.2.2. ## Reviewer : B9Kt : Score 4 ### Strengths And Weaknesses: * The paper is well-written and the results are easy to follow. The policy gradient theorems are important to robust MDP and the proposed methods provide a viable way for policy-based methods in robust MDP. * My main concern is on the technical contribution. To my best knowledge, the robust bellman operator, and robust Q value was first provided in Kumar 2022 for sa and s rectangular set. These serve as important ingredients to the proposed policy gradient method, along with the robust occupancy measure. I wonder if the authors could discuss the technical challenges of deriving the robust occupancy measure and highlight the technical contribution of this. > *"discuss the technical challenges of deriving the robust occupancy measure and highlight the technical contribution of this"* > We thank the reviewer for the constructive remarks. Although we agree that this work builds upon [Kumar et al 2022] who explicited robust values without the minimization problem, it adds the following contributions: >> * [Minor contributions] The robust Q-value defined in [Kumar et al 2022] are different than ours. Nonetheless, we compute it using their tools. >> * [Central Technical Contribution] In addition to [Kumar et al. 2022], we provide an explicit expression of the worst kernel. This novelty enables us to observe that it is a rank-one perturbation of the nominal kernel, which in turn motivates us developping Lemma 3.5 and explicit the robust occupation measure with a short and elegant proof. This constitutes our core technical contribution, as it is key to derive an explicit expression of RPG. >> * [Technical Challenges] Similarly as done in [Wang and Zou 2022], we may be tempted to directly derive RPG as: $$\frac{d v^\pi(s)}{d\pi} = \frac{d}{d\pi}\Bigm[R_0(s,a) -\alpha_{sa} -\gamma \kappa_q(v^\pi) + \gamma\sum_{s'} P_0(s'|s,a)v^\pi(s')\Bigm].$$ In the R-contamination case of Wang and Zou [2022] (which is an R-mixture between the nominal and the worst kernel from the simplex -- in particular, it is $(s,a)$-rectangular), the regularizer has a simple form ($\max_{s}v(s)$). Differently, for more general, $s$-rectangular balls, it depends on the $\kappa$-function (+ on the policy in the $s$-rectangular setting). This makes the direct technique untractable. In fact, we notice that this is the approach Derman et al. [2021] took to derive their reward-robust policy gradient. As mentioned in Apx A.5 there, their attempt to derive an RPG under both reward and transition uncertainty was unfruitful, because of an interdependence between the gradient of the regularizer and that of the value function. * With the robust policy gradient method, I am not sure if the global convergence results still hold for robust MDP. Could the author also comment on the impact of using subgradients and having this regularized gradient on the convergence rate? > *"global convergence results [...] for robust MDP ... impact of using subgradients and having this regularized gradient on the convergence rate"* > We note that the work [Wang et al., 2022] guarantees convergence of RPG to a global optimal robust policy. This, under similar conditions as for non-robust MDPs [Agarwal et al. 2020]. Also, they obtain slower convergence rate than non-robust PG, but it is a topic of independent research. * There is also a missing literature. Esther 2021 also proposed a policy gradient theorem for robust MDP. Esther Derman, Matthieu Geist, and Shie Mannor. Twice regularized mdps and the equivalence between robustness and regularization, 2021. > *"... missing literature. Esther 2021 also proposed a policy gradient theorem for robust MDP"* > Derman et al. [2021] proposed a policy gradient for reward-robust MDPs only, i.e., with uncertain reward but **fixed** transition kernel (Prop. 3.2). In fact, as mentioned in Appx. A.5 there, their attempt to account for uncertain transition was unfruitul, as the gradient expression of their regularizer depended on that of the value function and vice versa (see related discussion above). We will make this difference clearer in the final version. > *References* [Kumar et al., 2022] Kumar, Navdeep, et al. "Efficient policy iteration for robust markov decision processes via regularization." arXiv preprint arXiv:2205.14327 (2022). [Derman et al., 2021] Derman, Esther, Matthieu Geist, and Shie Mannor. "Twice regularized MDPs and the equivalence between robustness and regularization." Advances in Neural Information Processing Systems 34 (2021): 22274-22287. [Agarwal et al., 2020] Agarwal, Alekh, et al. "Optimality and approximation with policy gradient methods in markov decision processes." Conference on Learning Theory. PMLR, 2020. ** Second response to reviewer *Can you provide the specific location in Derman et al. [2021] or Kumar et al. [2022] where the authors explicitly show in detail the correctness of line 644?* Derman et al. [2021] establish a similar result to line 644 in Sec. C.3. We should note two differences with our proof: 1. They build their proof upon the robust optimization problem underlying the robust value function. The independence in state-action pairs thus appears in the computation of the robust counterpart denoted there by $F$ (second equality in p. 25). 2. They do not account for the simplex constraint. In their work, $\\mathcal{P}(s,a) = \\{ P(\\cdot|s,a)\\in\\mathbb{R}^{\\mathcal{S}} | \\lVert P(\\cdot|s,a)\\lVert\\leq \\beta_{s,a} \\}$, whereas we additionally impose the uncertainty set to include transition kernels only, i.e., $\\mathcal{P}(s,a) = \\{ P(\\cdot|s,a)\\in\\mathbb{R}^{\\mathcal{S}} | \\lVert P(\\cdot|s,a)\\lVert\\leq \\beta_{s,a} , \\langle P(\\cdot|s,a), \\mathbf{1}\\rangle = 0\\}$. This results in a different regularization term, namely, $-\\beta\_{s,a} \\lVert{v^{\\pi}\_{\\mathcal{U}}}\\lVert\_q$ instead of $- \\beta\_{s,a}\\kappa\_{q}(v^{\\pi}\_{\\mathcal{U}})$ (see Kumar et al. [2022]). Similar independence appears in Kumar et al. [2022], Eq. (86). More generally, this derivation comes from the following observation. Suppose we aim to optimize an objective function $\langle P(\cdot| s,a), v\rangle$ over a Cartesian product of sets $\\times_{(s,a)} \\mathcal{P}(s,a)$. Since the objective function is *independent* of uncertainty sets $\\mathcal{P}(\\bar{s},\\bar{a})$ such that $(\\bar{s},\\bar{a})\\neq (s,a)$ and the constraint set is *independently* defined for all state-action pairs, we can restrict the optimization to the given pair $(s,a)$. *I am also not 100% convinced by the proof sketch: the second equation, at least to my understanding, also seems incorrect* #### We provide an alternative proof of our Theorem 3.1: By definition, we have $$T^\pi_{\mathcal{U}^{sa}_p}(v) = \min_{(P,R)\in \mathcal{U}^{sa}_p}T^\pi_{P,R}v, $$ and from [Kumar et al. 2022][Thm 3.1], it holds that $$ (T^\pi_{\mathcal{U}^{sa}_p}v^\pi_{\mathcal{U}^{sa}_p})(s) = \sum_{a}\pi(a|s)\Bigm[-\alpha_{sa}-\gamma\beta_{sa}\kappa_q(v^\pi_{\mathcal{U}^{sa}_p}) + R_0(s,a)+\gamma\sum_{s'}P_0(s'|s,a)v^\pi_{\mathcal{U}^{sa}_p}(s')\Bigm],$$ by taking $v = v^\pi_{\mathcal{U}^{sa}_p}.$ Now, we are going to re-derive the above expression, by plugging the worst value from our Theorem 3.1 as $$ (T^\pi_{(P^\pi_{\mathcal{U}^{sa}_p},R^\pi_{\mathcal{U}^{sa}_p})}v^\pi_{\mathcal{U}^{sa}_p})(s) = \sum_{a}\pi(a|s)\Bigm[ R^\pi_{\mathcal{U}^{sa}_p}(s,a)+\gamma\sum_{s'}P^\pi_{\mathcal{U}^{sa}_p}(s'|s,a)v^\pi_{\mathcal{U}^{sa}_p}(s')\Bigm], \qquad \text{(by definition)}$$ $$= \sum_{a}\pi(a|s)\Bigm[ R_0(s,a)-\alpha_{sa}+\gamma\sum_{s'}P_0(s'|s,a)v^\pi_{\mathcal{U}^{sa}_p}(s') - \gamma\beta_{sa}\sum_{s'}u^\pi_{\mathcal{U}^{sa}_p}(s')v^\pi_{\mathcal{U}^{sa}_p}(s')\Bigm], \qquad \text{(Theorem 3.1)}$$ $$= \sum_{a}\pi(a|s)\Bigm[ R_0(s,a)-\alpha_{sa}+\gamma\sum_{s'}P_0(s'|s,a)v^\pi_{\mathcal{U}^{sa}_p}(s') - \gamma\beta_{sa}\kappa_q(v^\pi_{\mathcal{U}^{sa}_p})\Bigm], \qquad \text{(Property 1)}.$$ We thus showed that $$ T^\pi_{(P^\pi_{\mathcal{U}^{sa}_p},R^\pi_{\mathcal{U}^{sa}_p})}v^\pi_{\mathcal{U}^{sa}_p} = \min_{(P,R)\in \mathcal{U}^{sa}_p}T^\pi_{P,R}v^\pi_{\mathcal{U}^{sa}_p}, $$ which establishes that $$ (P^\pi_{\mathcal{U}^{sa}_p},R^\pi_{\mathcal{U}^{sa}_p})\in \arg\min_{(P,R)\in \mathcal{U}^{sa}_p}T^\pi_{P,R}v^\pi_{\mathcal{U}^{sa}_p}.$$ We will add this alternative proof in the final version in case you find it more convincing.

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully