Rebuttal "PG for RMDP"

# Rebuttal "PG for RMDP" ## Reviewer : f5qY : Score 7 ### Strengths: * Derivation of policy gradient theorems (theorems 3.6 and 3.7) for s-rectangular and sa-rectangular RMDPs with uncertainty set constrained by the $L_p$ norm is quite interesting, and many other policy gradient methods can be derived with these theorems. * It is noteworthy that the adversarial kernel for such an ambiguity set is a rank-one perturbation of the nominal kernel. ### Weakness: * The paper has a limited scope, which makes it less ground-breaking. In particular, it only works on uncertainty sets constrained by the $L_p$ norm. In another paper I am reviewing, a policy gradient method is proposed for a more general class of s-rectangular ambiguity sets. Their method, however, does not propose a closed form equation for RPG (i.e. the policy gradient theorem). However, I believe it is a solid piece of work. > We would like to thank the reviewer for the positive feedback. > > *"[the paper] only works on uncertainty sets constrained by the $L_p$ norm"* > > Thank you for suggesting an extension to more general uncertainty sets. Although this work focuses on $L_p$-norm constrained uncertainty sets, it actually applies to any norm $\lVert \cdot \rVert$. We demonstrate it here for the $(s,a)$-rectangular case. The robust Q-value would satisfy the following equality: $Q^\pi_{\mathcal{U}^{sa}_{\lVert \cdot\rVert}}(s,a) = R_0(s,a)-\alpha_{sa}-\gamma\beta_{sa}\kappa_{\lVert\cdot\rVert}(v^\pi_{\mathcal{U}^{sa}_{\lVert \cdot\rVert}}) + \gamma\sum_{s'\in\mathcal{S}}P_0(s'|s,a)\pi(a'|s')Q^\pi_{\mathcal{U}^{sa}_{\lVert \cdot\rVert}}(s',a'),$ where $\kappa_{\lVert\cdot\rVert}(v) := \min_{\lVert c \rVert \leq 1, \langle \mathbf{1},c\rangle =0} \langle c, v\rangle$ (to be compared with the robust Q-value from Appx. A). On the other hand, the worst kernel in Thm 3.1 would change to: $P^\pi_{\mathcal{U}}(\cdot|s,a) = P_0(s,a) -\beta_{sa}u,$ where $u \in \arg\min_{\lVert c \rVert \leq 1, \langle \mathbf{1},c\rangle =0} \langle c, v^\pi_{\mathcal{U}}\rangle$ (to be compared with Lemma B.2). It is still a rank-one perturbation of the nominal kernel, so we can get the robust occupation measure from Lemma 3.5. Combining the robust Q-value with the robust occupation measure, we can get an RPG as in Thm 3.6. We can similarly derive an RPG for $s$-rectangular uncertainty sets with balls constrained by an arbitrary norm. In the $L_p$-case, it can additionally be computed in closed form when $p\in\{1,2,\infty\}$, thanks to the explicit expression of the $\kappa$-function (Tabs. 1 and 2). We thank the reviewer for this useful remark and would be very happy to include this discussion in the final version. ### Questions: * This paper relies heavily on properties of $L_p$ norm. I wonder do you have any idea for extending this paper to a more general uncertainty sets? > *"do you have any idea for extending this paper to a more general uncertainty sets?"* > As established above, our $L_p$-norm results can be generalized to arbitrary norms. We believe our work opens up an avenue for even more general uncertainty sets such as $f$-divergence constraints. Even in the particular case of KL, one challenge for deriving RPG lies in the fact that the basic properties of a distance are no longer satisfied. ## Reviewer : 1fLD : Score 4 ### Strengths * s-rectangular robust MDPs are a class of hard problems in robust RL. The paper features some novel techniques not found in existing works. * The numerical results obtained in the paper are nice and illustrates the empirical performance of the proposed methods. ## Weaknesses * While the paper is theoretical in nature, the experiment results are somewhat limited with crucial information missing, such as the exact number of repetitions and the standard deviation of the observed run times. * Some of the results seem incomplete or poorly stated. Particularly, Lemma 3.5 references the non-existent Lemma 3.1 for the terms $b$ and $k$. What exactly are $b$ and $k$ referring to? If we look at Theorem 3.1, the terms in the upperbounds are $\alpha$ and $\beta$, which are both scalars, whereas $\alpha$ and $\beta$ are both $\lvert \mathcal{S}\rvert$ -dimensional vectors. Additionally, what do $P_1$ and $P_2$ refer to? Is $P_1$ the transition probability of some arbitrary policy under $P_0$ and $P_2$ the transition probability under $P^\pi_\mathcal{U}$? For , $D_i$ should we only consider $i \in \{1,2\}$ ? Is $D$ essentially $d$ (occupancy measure)? > We thank the reviewer for the detailed comments. We shall address them below in order. >* *"the exact number of repetitions and the standard deviation of the observed run times"* > Thank you for pointing this out. Experiments were repeated 100 times with standard deviations 1-10%. We will get back to it below, in our response regarding lines 368-370. >* *"Lemma 3.5 references the non-existent Lemma 3.1 for the terms $b$ and $k$. What exactly are $b$ and $k$ referring to?"* >The reviewer is right. There is a typo in the sentence: "by plugging the values of $b$ and $k$ in the upper bound of Lemma 3.1" should instead be "by plugging the values of $b$ and $k$ from Theorems 3.1 and 3.2." >* *"what do $P_1$ and $P_2$ refer to?"* > Lemma 3.5 is a general statement expressing one occupancy matrix according to another, provided that their corresponding kernels differ by a rank-one matrix (i.e., $P_2$ is a rank-one perturbation of $P_1$). Then, leveraging the results in Thms. 3.1 and 3.2 which establish the worst kernel as a rank-one perturbation of the nominal, we can derive the robust occupancy matrix as follows (please see Appx. E for more details):  >>* In the $(s,a)$-rectangular case, i.e., $\mathcal{U} = \mathcal{U}^{sa}_p$, we get the robust occupancy matrix by setting $b(s) := \sum_{a}\pi(a|s)\beta_{s,a}$, $k := u^\pi_{\mathcal{U}}$, $P_2(s'|s) := \sum_{a}\pi(a|s)P^\pi_\mathcal{U}(s'|s,a)$ and $P_1(s'|s) := \sum_{a}\pi(a|s)P_0(s'|s,a)$ in Lemma 3.5. >> * In the $s$-rectangular case, i.e., $\mathcal{U} = \mathcal{U}^{s}_p$, we get the robust occupancy matrix by setting $b(s) := \lVert\pi_s\rVert_q\beta_{s}$, $k := u^\pi_{\mathcal{U}}$, $P_2(s'|s) := \sum_{a}\pi(a|s)P^\pi_\mathcal{U}(s'|s,a)$ and $P_1(s'|s) := \sum_{a}\pi(a|s)P_0(s'|s,a)$ in Lemma 3.5. > Thank you for pointing out this potential source of confusion. We will accordingly expand Sec. 3.3 and include the above discussion in the final version. > * *"upperbounds are $\alpha$ and $\beta$, which are both scalars, whereas $\alpha$ and $\beta$ are both $\lvert \mathcal{S}\rvert$ -dimensional vectors"* > Response here >* *"For , $D_i$ should we only consider $i \in \{1,2\}$ ? Is $D$ essentially $d$ (occupancy measure)?"* > Response here * It is probably due to my limited background in robust MDP and I have some questions on the proofs of Theorems 3.1 and 3.2 listed below.  ### Questions: * Lines 322 - 324. Could the authors clarify the sentence "We conclude this subsection by noting that the robust Q-value function can be computed as easily from the robust value functions which are efficiently computable?" >*"Lines 322 - 324. Could the authors clarify the sentence..."* > There is a typo here as well, it should be "... can be computed as easily from the **non-robust** value functions...". Indeed, the robust Q-value is the unique limit of the recursion: $$Q_{n+1}(s,a) = R^\pi_\mathcal{U}(s,a)+ \gamma\sum_{s',a'}P^\pi_\mathcal{U}(s'|s,a)\pi(a'|s')Q_n(s',a'),$$ i.e., $Q_n \to_n Q^\pi_\mathcal{U}$. As in non-robust MDPs, the convergence rate is linear because of its contracting property. Additionally, this holds for every fixed kernel, reward and policy. Thus, thanks to Thms. 3.1-3.2, Property 1, and taking $v_n(s) = \sum_a\pi(a|s)Q_n(s,a)$, we have : >>* In the $(s,a)$-rectangular case, i.e., $\mathcal{U} = \mathcal{U}^{sa}_p$: $$Q_{n+1}(s,a) = R_0(s,a) -\alpha_{sa} - \gamma\beta_{sa}\kappa_q(v_n) + \gamma\sum_{s',a'}P_0(s'|s,a)\pi(a'|s')Q_n(s',a')$$ >>* In the $s$-rectangular case, i.e., $\mathcal{U} = \mathcal{U}^{s}_p$: $$Q_{n+1}(s,a) = R_0(s,a) -[\alpha_{s} + \gamma\beta_{s}\kappa_q(v_n)]\bigm[\frac{\pi(a|s)}{\lVert \pi_s\rVert_q}\bigm]^{q-1} + \gamma\sum_{s',a'}P_0(s'|s,a)\pi(a'|s')Q_n(s',a')$$ >These recursions highlight how the robust Q-value can be computed with similar efficiency as non-robust Q-value. We refer the reviewer to Appx. A for further details. * Lines 21 - 26 mention that "However, their scope remains limited ... because of their limitation to planning (Derman et al., 2021; Kumar et al., 2022)." However, it seems like this paper also focuses on planning, as computing the robust policy gradients requires knowledge of the underlying environment. Can the authors comment a bit more on the work's relationship with prior literature? > *"it seems like this paper also focuses on planning, as computing the robust policy gradients requires knowledge of the underlying environment"* > You are right, we also rely on a known nominal model. We apologize for the confusion. What we mean here is planning in the sense of "value-based with given nominal". In this work, we provide a policy-based method that leverages regularization to derive an efficient RPG. To be more precise on our contributions with respect to prior work: >>* Derman et al. [2021] is the first work that efficiently evaluates the robust Bellman operator $T^{\\pi}\_{\\mathcal{U}}v$ via regularization. However, their algorithmic contribution focuses on modified policy iteration (an extension of PI and VI), which is value-based and relies on a known nominal model. Additionally, although Derman et al. [2021] proposed a robust policy gradient, their result is restricted to "reward-robust MDPs", i.e., it does not account for transition uncertainty (see Prop. 3.2 and our response to reviewer B9Kt). >>* Kumar et al. [2022] builds upon that work by additionally accounting for the simplex constraint (for which Derman et al. [2021] do not). Moreover, they provide a closed from expression for the optimal robust Bellman operator $T^*v =\max_{\pi}T^{\pi}_{\mathcal{U}}v$, whereas Derman et al. [2021] suggest to 'project onto the (action) simplex' to obtain a greedy policy. >>* Wang and Zou [2022] proposed policy gradient for R-contaminated MDPs, i.e., uncertain models of the form $(1-R)P_0 + P,$ where $P\in\Delta_{\mathcal{S}}^{\mathcal{S}\times\mathcal{A}}$ is an arbitrary kernel and $P_0$ the nominal. By construction, this is $(s,a)$-rectangular, and included in an $L_{\infty}$-ball. The particular case they focus on enables them to derive a converging RPG and explicit the gradient, whereas we aim to extend this to $s$-rectangular balls with arbitrary norm. >>* As a result, our work is the first to provide an explicit closed-form expression of policy gradient for s-rectangular robust MDPs. We also deal with a broader class of uncertainty sets, namely, s-rectangular balls of arbitrary norm. > We thank the reviewer for the comment. We will further detail on this literature review. * Lines 368 - 370 mention that the results are averaged after several runs. Can the authors also provide (1) the exact number of runs and (2) the standard deviation of the run time? Without these it is harder to appreciate the computation speedup of the proposed method. >*"(1) the exact number of runs and (2) the standard deviation of the run time"* > Each of the results were averaged over 100 runs, and the standard deviation (std) was typically around 2-10%. The table below presents the relative std. | S | A | $\mathcal{U^{\text{sa}}_1}$ | $\mathcal{U^{\text{sa}}_2}$ | $\mathcal{U^{\text{sa}}_5}$ | $\mathcal{U^{\text{sa}}_{10}}$ | $\mathcal{U^{\text{sa}}_\infty}$ | $\mathcal{U^{\text{s}}_1}$ | $\mathcal{U^{\text{s}}_2}$ | $\mathcal{U^{\text{s}}_5}$ | $\mathcal{U^{\text{s}}_{10}}$ | $\mathcal{U^{\text{s}}_\infty}$ | | ---- |---- |---- |---- |---- |---- |---- |---- |---- |---- |---- |---- | | 10 | 10 |0.018|0.062|0.039|0.037|0.015|0.016|0.016|0.021|0.017|0.011| | 30 | 10 |0.016|0.013|0.029|0.019|0.013|0.007|0.032|0.023|0.017|0.015| | 50 | 10 |0.046|0.053|0.018|0.089|0.031|0.052|0.007|0.016|0.014|0.010| | 100 | 20 |0.015|0.015|0.006|0.007|0.007|0.013|0.006|0.006|0.029|0.006| > We will add these details in the final version. * For the experiments, how is run time defined? Are we comparing the runtime of a single iteration? Or are we comparing the run time of achieving an $\epsilon$-optimal policy? > *how is run time defined*: We are comparing the runtime of a single iteration. * How should we interpret Lemma 3.5? What exactly are $b$ and $k$? > *"How should we interpret Lemma 3.5? What exactly are $b$ and $k$?"* > We refer to the response in the weakness section above. * It is probably due to my limited background in robust MDP, but can the authors provide some intuition on line 644 in the appendix? Shouldn't the optimization problem for $P^*,R^*$ be different from eq (7) and dependent on $P_0,R_0$? Two cross terms, $$ R_0(s,a)+\sum P^*(s'|s,a)v^\pi_\mathcal{U}(s') \quad\text{and}\quad R^*(s,a)+\sum P_0(s'|s,a)v^\pi_\mathcal{U}(s')$$ seem to be missing. Why can we remove these terms? >*"some intuition on line 644 in the appendix"* > The model $(P^*,R^*)$ doesn't depend on $(P_0,R_0)$, meaning that the noise and the nominal values can be separated. This is the central insight of this line work [ours, Esther et al. 2021, Kumar et al. 2022]. We sketch the proof: From (7), we have >> $(P^\pi_\mathcal{U},R^\pi_\mathcal{U}) = \arg\min_{P \in (P_0+\mathcal{P}),R \in (R_0+\mathcal{R})} \pi(a|s)\Bigm[R(s,a) + \gamma\sum_{a}P(s'|s,a)v(s') \Bigm]$ >> $=(P_0,R_0) + \arg\min_{p \in \mathcal{P},r \in \mathcal{R}} \pi(a|s)\Bigm[R_0(s,a)+ \gamma\sum_{a}P_0(s'|s,a)v(s') + r(s,a) + \gamma\sum_{a}p(s'|s,a)v(s') \Bigm]$ >> $= (P_0,R_0) + \arg\min_{p \in \mathcal{P},r \in \mathcal{R}} \pi(a|s)\Bigm[ r(s,a) + \gamma\sum_{a}p(s'|s,a)v(s') \Bigm]$ >> $= (P_0,R_0) + (P^*,R^*).$ > The (7) is for the worst values, and $(P^*,R^*)$ is worst noise. They are related as $(P^\pi_\mathcal{U},R^\pi_\mathcal{U})=(P_0,R_0) + (P^*,R^*)$. There are no cross terms because uncertainty in rewards and kernel are assumed to be indpendent, that is $\mathcal{U} = (P_0+\mathcal{P})\times(R_0+\mathcal{R})$. * Additionally, why can we treat $v^\pi_\mathcal{U}$ as a fixed vector with no additional constraints? Shouldn't we also use an equality constraint to ensure that the value function under $P^\pi_{\mathcal{U}},R^\pi_{\mathcal{U}}$ equals to $v^\pi_{\mathcal{U}}$? Otherwise isn't it possible that the solution to eq (7) does not have the same value function as $P^\pi_{\mathcal{U}}$. >*"why can we treat $v^\pi_\mathcal{U}$ as a fixed vector with no additional constraints? "* > For s-rectangular uncertainty set $\mathcal{U}$, the worst values $(P^\pi_\mathcal{U},R^\pi_\mathcal{U})$ for policy $\pi$, is given by $$(P^\pi_\mathcal{U},R^\pi_\mathcal{U})\in \arg\min_{(P,R)\in\mathcal{U}}\mathcal{T}^\pi_{(P,R)}v^\pi_\mathcal{U}$$ (Theorem 2 of [Wiesemann, 2012]). This resembles the following result in non-robust MDPs: $$\pi^* \in \arg\max_{\pi} \mathcal{T}^\pi v^*.$$ > Thank you for pointing this out. We will add this missing detail in the preliminary section 2.2.2. ## Reviewer : B9Kt : Score 4 ### Strengths And Weaknesses: * The paper is well-written and the results are easy to follow. The policy gradient theorems are important to robust MDP and the proposed methods provide a viable way for policy-based methods in robust MDP. * My main concern is on the technical contribution. To my best knowledge, the robust bellman operator, and robust Q value was first provided in Kumar 2022 for sa and s rectangular set. These serve as important ingredients to the proposed policy gradient method, along with the robust occupancy measure. I wonder if the authors could discuss the technical challenges of deriving the robust occupancy measure and highlight the technical contribution of this. > *"discuss the technical challenges of deriving the robust occupancy measure and highlight the technical contribution of this"* > We thank the reviewer for the constructive remarks. Although we agree that this work builds upon [Kumar et al 2022] who explicited robust values without the minimization problem, it adds the following contributions: >> * [Minor contributions] The robust Q-value defined in [Kumar et al 2022] are different than ours. Nonetheless, we compute it using their tools. >> * [Central Technical Contribution] In addition to [Kumar et al. 2022], we provide an explicit expression of the worst kernel. This novelty enables us to observe that it is a rank-one perturbation of the nominal kernel, which in turn motivates us developping Lemma 3.5 and explicit the robust occupation measure with a short and elegant proof. This constitutes our core technical contribution, as it is key to derive an explicit expression of RPG. >> * [Technical Challenges] Similarly as done in [Wang and Zou 2022], we may be tempted to directly derive RPG as: $$\frac{d v^\pi(s)}{d\pi} = \frac{d}{d\pi}\Bigm[R_0(s,a) -\alpha_{sa} -\gamma \kappa_q(v^\pi) + \gamma\sum_{s'} P_0(s'|s,a)v^\pi(s')\Bigm].$$ In the R-contamination case of Wang and Zou [2022] (which is an R-mixture between the nominal and the worst kernel from the simplex -- in particular, it is $(s,a)$-rectangular), the regularizer has a simple form ($\max_{s}v(s)$). Differently, for more general, $s$-rectangular balls, it depends on the $\kappa$-function (+ on the policy in the $s$-rectangular setting). This makes the direct technique untractable. In fact, we notice that this is the approach Derman et al. [2021] took to derive their reward-robust policy gradient. As mentioned in Apx A.5 there, their attempt to derive an RPG under both reward and transition uncertainty was unfruitful, because of an interdependence between the gradient of the regularizer and that of the value function. * With the robust policy gradient method, I am not sure if the global convergence results still hold for robust MDP. Could the author also comment on the impact of using subgradients and having this regularized gradient on the convergence rate? > *"global convergence results [...] for robust MDP ... impact of using subgradients and having this regularized gradient on the convergence rate"* > We note that the work [Wang et al., 2022] guarantees convergence of RPG to a global optimal robust policy. This, under similar conditions as for non-robust MDPs [Agarwal et al. 2020]. Also, they obtain slower convergence rate than non-robust PG, but it is a topic of independent research. * There is also a missing literature. Esther 2021 also proposed a policy gradient theorem for robust MDP. Esther Derman, Matthieu Geist, and Shie Mannor. Twice regularized mdps and the equivalence between robustness and regularization, 2021. > *"... missing literature. Esther 2021 also proposed a policy gradient theorem for robust MDP"* > Derman et al. [2021] proposed a policy gradient for reward-robust MDPs only, i.e., with uncertain reward but **fixed** transition kernel (Prop. 3.2). In fact, as mentioned in Appx. A.5 there, their attempt to account for uncertain transition was unfruitul, as the gradient expression of their regularizer depended on that of the value function and vice versa (see related discussion above). We will make this difference clearer in the final version. > *References* [Kumar et al., 2022] Kumar, Navdeep, et al. "Efficient policy iteration for robust markov decision processes via regularization." arXiv preprint arXiv:2205.14327 (2022). [Derman et al., 2021] Derman, Esther, Matthieu Geist, and Shie Mannor. "Twice regularized MDPs and the equivalence between robustness and regularization." Advances in Neural Information Processing Systems 34 (2021): 22274-22287. [Agarwal et al., 2020] Agarwal, Alekh, et al. "Optimality and approximation with policy gradient methods in markov decision processes." Conference on Learning Theory. PMLR, 2020. ** Second response to reviewer *Can you provide the specific location in Derman et al. [2021] or Kumar et al. [2022] where the authors explicitly show in detail the correctness of line 644?* Derman et al. [2021] establish a similar result to line 644 in Sec. C.3. We should note two differences with our proof: 1. They build their proof upon the robust optimization problem underlying the robust value function. The independence in state-action pairs thus appears in the computation of the robust counterpart denoted there by $F$ (second equality in p. 25). 2. They do not account for the simplex constraint. In their work, $\\mathcal{P}(s,a) = \\{ P(\\cdot|s,a)\\in\\mathbb{R}^{\\mathcal{S}} | \\lVert P(\\cdot|s,a)\\lVert\\leq \\beta_{s,a} \\}$, whereas we additionally impose the uncertainty set to include transition kernels only, i.e., $\\mathcal{P}(s,a) = \\{ P(\\cdot|s,a)\\in\\mathbb{R}^{\\mathcal{S}} | \\lVert P(\\cdot|s,a)\\lVert\\leq \\beta_{s,a} , \\langle P(\\cdot|s,a), \\mathbf{1}\\rangle = 0\\}$. This results in a different regularization term, namely, $-\\beta\_{s,a} \\lVert{v^{\\pi}\_{\\mathcal{U}}}\\lVert\_q$ instead of $- \\beta\_{s,a}\\kappa\_{q}(v^{\\pi}\_{\\mathcal{U}})$ (see Kumar et al. [2022]). Similar independence appears in Kumar et al. [2022], Eq. (86). More generally, this derivation comes from the following observation. Suppose we aim to optimize an objective function $\langle P(\cdot| s,a), v\rangle$ over a Cartesian product of sets $\\times_{(s,a)} \\mathcal{P}(s,a)$. Since the objective function is *independent* of uncertainty sets $\\mathcal{P}(\\bar{s},\\bar{a})$ such that $(\\bar{s},\\bar{a})\\neq (s,a)$ and the constraint set is *independently* defined for all state-action pairs, we can restrict the optimization to the given pair $(s,a)$. *I am also not 100% convinced by the proof sketch: the second equation, at least to my understanding, also seems incorrect* #### We provide an alternative proof of our Theorem 3.1: By definition, we have $$T^\pi_{\mathcal{U}^{sa}_p}(v) = \min_{(P,R)\in \mathcal{U}^{sa}_p}T^\pi_{P,R}v, $$ and from [Kumar et al. 2022][Thm 3.1], it holds that $$ (T^\pi_{\mathcal{U}^{sa}_p}v^\pi_{\mathcal{U}^{sa}_p})(s) = \sum_{a}\pi(a|s)\Bigm[-\alpha_{sa}-\gamma\beta_{sa}\kappa_q(v^\pi_{\mathcal{U}^{sa}_p}) + R_0(s,a)+\gamma\sum_{s'}P_0(s'|s,a)v^\pi_{\mathcal{U}^{sa}_p}(s')\Bigm],$$ by taking $v = v^\pi_{\mathcal{U}^{sa}_p}.$ Now, we are going to re-derive the above expression, by plugging the worst value from our Theorem 3.1 as $$ (T^\pi_{(P^\pi_{\mathcal{U}^{sa}_p},R^\pi_{\mathcal{U}^{sa}_p})}v^\pi_{\mathcal{U}^{sa}_p})(s) = \sum_{a}\pi(a|s)\Bigm[ R^\pi_{\mathcal{U}^{sa}_p}(s,a)+\gamma\sum_{s'}P^\pi_{\mathcal{U}^{sa}_p}(s'|s,a)v^\pi_{\mathcal{U}^{sa}_p}(s')\Bigm], \qquad \text{(by definition)}$$ $$= \sum_{a}\pi(a|s)\Bigm[ R_0(s,a)-\alpha_{sa}+\gamma\sum_{s'}P_0(s'|s,a)v^\pi_{\mathcal{U}^{sa}_p}(s') - \gamma\beta_{sa}\sum_{s'}u^\pi_{\mathcal{U}^{sa}_p}(s')v^\pi_{\mathcal{U}^{sa}_p}(s')\Bigm], \qquad \text{(Theorem 3.1)}$$ $$= \sum_{a}\pi(a|s)\Bigm[ R_0(s,a)-\alpha_{sa}+\gamma\sum_{s'}P_0(s'|s,a)v^\pi_{\mathcal{U}^{sa}_p}(s') - \gamma\beta_{sa}\kappa_q(v^\pi_{\mathcal{U}^{sa}_p})\Bigm], \qquad \text{(Property 1)}.$$ We thus showed that $$ T^\pi_{(P^\pi_{\mathcal{U}^{sa}_p},R^\pi_{\mathcal{U}^{sa}_p})}v^\pi_{\mathcal{U}^{sa}_p} = \min_{(P,R)\in \mathcal{U}^{sa}_p}T^\pi_{P,R}v^\pi_{\mathcal{U}^{sa}_p}, $$ which establishes that $$ (P^\pi_{\mathcal{U}^{sa}_p},R^\pi_{\mathcal{U}^{sa}_p})\in \arg\min_{(P,R)\in \mathcal{U}^{sa}_p}T^\pi_{P,R}v^\pi_{\mathcal{U}^{sa}_p}.$$ We will add this alternative proof in the final version in case you find it more convincing.

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.