**NeuRIPS 2022 Rebuttal, RMDPs**

# **NeuRIPS 2022 Rebuttal, RMDPs** ### Experiments While our contribution is mostly theoretical, given the common critic on the lack of experiments, we propose to add the following numerical experiments table. The following table contains ***relative cost (time) of robust value iteration*** w.r.t. non-robust MDP, for randomly generated kernel and reward function with the number of states $S$ and the number of action $A$. | S | A | non-robust |$\mathcal{U}^{sa}_\infty$ LP|$\mathcal{U}^{sa}_1$ LP|$\mathcal{U}^{s}_1$ LP|$\mathcal{U}^{sa}_1$|$\mathcal{U}^{sa}_2$|$\mathcal{U}^{sa}_\infty$ |$\mathcal{U}^{s}_1$|$\mathcal{U}^{s}_2$|$\mathcal{U}^{s}_\infty$|$\mathcal{U}^{sa}_5$|$\mathcal{U}^{sa}_{10}$|$\mathcal{U}^{s}_5$|$\mathcal{U}^{s}_{10}$| | -------- | -------- |-------- |-------- | -------- | -------- |-------|-----|-------|------|-------|-----|-------|------|-------|-----| | 10 | 10 | 1| 1374 |1438|72625 |1.77|1.51|1.58|1.41|2.63|1.41|5.4|5.56|33.30|33.59| | 30 | 10 | 1 |2282 |6616|629890 |1.38|1.43|1.48|1.58|2.82|3.04|4.91|5.29|89.23|78.17| | 50 | 10 | 1 |2848 |6622|4904004|1.54|1.91|1.37|1.20|2.49|2.25|4.14|4.15|40.22|41.07| | 100 | 20 | 1 |6930 |16714|NA|1.45|1.59|1.58|1.16|2.18|1.50|4.06|3.26|41.22|41.10| **Notations:** *S* : number of state, *A*: number of actions, *$\mathcal{U}^{sa}_p$ LP*: Sa rectangular $L_p$ robust MPDs by Linear Programming, *$\mathcal{U}^{s}_p$ LP*: S rectangular $L_p$ robust MPDs by Linear Programming and other numerical methods, *$\mathcal{U}^{sa/s}_{p=1,2,\infty}$* : Sa/S rectangular $L_1/L_2/L_\infty$ robust MDPs by closed form method (see table 2, theorem 3) *$\mathcal{U}^{sa/s}_{p=5,10}$* : Sa/S rectangular $L_5/L_{10}$ robust MDPs by binary search (see table 2, theorem 3 of the paper) **Observations**: 1. Our method for s/sa rectangular $L_1/L_2/L_\infty$ robust MDPs takes almost same (1-3 times) the time as non-robust MDP for one iteration of value iteration. This confirms our complexity analysis (see table 4 of the paper) 2. Our binary search method for sa rectangular $L_5/L_{10}$ robust MDPs takes around $4-6$ times more time than non-robust counterpart. This is due to extra iterations required to find p-variance function $\kappa_p(v)$ through binary search. 3. Our binary search method for s rectangular $L_5/L_{10}$ robust MDPs takes around $30-100$ times more time than non-robust counterpart. This is due to extra iterations required to find p-variance function $\kappa_p(v)$ through binary search and Bellman operator. 4. One common feature of our method is that time complexity scales moderately as guranteed through our complexity analysis. 5. Linear programming methods for sa-rectangualr $L_1/L_\infty$ robsust MDPs take atleast 1000 times more than our methods for small state-action space, and it scales up very fast. 6. Numerical methods (Linear programming for minimization over uncertianty and 'scipy.optimize.minimize' for maximization over policy) for s-rectangular $L_1$ robust MDPs take 4-5 order more time than our mehtods (and non-robust MDPs) for very small state-action space, and scales up too fast. The reason is obvious, as it has to solve two optimization, one minimization over uncertainty and other maximization over policy, whereas in the sa-rectangular case, only minimization over uncertainty is required. This confirms that s-rectangular uncertainty set is much more challenging. The rate of convergence for all were approximately the same as $0.9 = \gamma$, as predicted by theory. And it is well illustrated by the ***relative rate of convergence*** w.r.t. non-robust by the table below. | S | A | non-robust |$\mathcal{U}^{sa}_1$|$\mathcal{U}^{sa}_2$|$\mathcal{U}^{sa}_\infty$ |$\mathcal{U}^{s}_1$|$\mathcal{U}^{s}_2$|$\mathcal{U}^{s}_\infty$|$\mathcal{U}^{sa}_5$|$\mathcal{U}^{sa}_{10}$|$\mathcal{U}^{s}_5$|$\mathcal{U}^{s}_{10}$| | -------- | -------- | -------- |-------|-----|-------|------|-------|-----|-------|------|-------|-----| | 10 | 10 | 1 |0.999|0.999|1.000|0.999|0.999|1.000|0.999|1.000|1.000|1.000| | 100 | 20 | 1 |0.999|0.999|0.998|0.999|0.999|0.998|0.995|0.997|0.998|0.995| In the above experiments, Bellman updates for sa/s rectangular $L_1/L_2/L_\infty$ were done in closed form, and for $L_5/L_{10}$ were done by binary search as suggested by table 2 and theorem 3. Note: Above experiments' results are for few runs, hence containing some stochasticity but the general trend is clear. In the final version, we will do averaging of many runs to minimize the stochastic nature. Results for many different runs can be found at https://github.com/**********. Note that the above experiments were done without using too much parallelization. There is ample scope to fine-tune and improve the performance of robust MDPs. The above experiments confirm the theoretical complexity provided in Table 4 of the paper. The codes and results can be found at https://github.com/**********. **Experiments parameters:** Number of states $S$ (variable), number of actions $A$ (variable), transition kernel and reward function generated randomly, discount factor $0.9$, uncertainty radiuses =$0.1$ (for all states and action, just for convenience ), number of iterations = 100, tolerance for binary search = $0.00001$ **Hardware:** The experiments are done on the following hardware: Intel(R) Core(TM) i5-4300U CPU @ 1.90GHz 64 bits, memory 7862MiB **Software:** Experiments were done in python, using numpy, scipy.optimize.linprog for Linear programmig for policy evalution in s/sa rectangular robust MDPs, scipy.optimize.minize and scipy.optimize.LinearConstraints for policy improvement in s-rectangular L_1 robust MDPs. ## *Reviewer 1 (kzqs) >Rating 5, confidence 3, contribution 3, soundness 3, presentation 2 We thank reviewer kzqs for his encouraging reviews.We are encouraged that the reviewer finds our main results "novel, significant and capable of advancing our understanding of the robust MDPs". We concur with the reviewer that our work can be used to derive sample-free algorithms for s/sa rectangular $L_1/L_2$ robust MDPs. Below we respond to questions. `>>> W1:` **The scope might be limited by the fact that the state-action space is assumed to be finite and the focus is on the dynamic programming with given models (no sample-bases approximation /standard RL interaction)** `>>> R1:`**Finite state Space:**> Essentially all previous works in robust MDPs (tens of contributions over more than 15 years starting from Nilim and El Ghaoui 2005 paper [3], with the exception of Tamar et al. [8]) have focused on the finite case, we believe that our work will open up the opportunity for solving robust MDPs beyond finite state-space especially in s-rectangular case. **Sample Based Algorithms:** The convergence proof/algorithms for sa-rectangular $L_1/L_2$ robust MDPs will be the same as sample free algorithms exists for R-contamination uncertainty set (sa-rectangular) robust MDPs by [2]. Further, the work can be easily generalized to get model free algorithms for s rectangular $L_1/L_2$ robust MDPs. We may attach model free algorithms in suplementary materials. `>>> W2:`**No Experiments** `>>> R2: `While our contribution is mostly theoretical, we propose to add the experiments in the experiment section above. It confirms our theoretical claims. `>>> Q1:` ***Regarding the quality of writing ...*** `>>> A1:` We thank the reviwer for pointing our attention there, we plan to revise and implement all the suggestions. `>>> Q2:` ***I think adding some experiments would be of great value for a paper like this. ..... how it behaves empirically in these two settings (and different values of p).*** `>>> A2:` Please see results in experiments section above. We will add similar and more extensive experiments in the final version. `>>> Q4:` **Any idea how can the results here be generalized beyond finite MDPs?** `>>> A4:` This is the most important question in the subfield of robust MDPs. We think that the new operators proposed in THM 1 and 2 may lead to a gradient style operator (maybe Bellman error style) that would enable a deep learning scheme. In fact, our goal in this line of research has been to develop operators that we would later plug into a gradient based deep learning approach. It turned out more complicated than expected because of the different normalizations needed for the Bellman like operators. `>>> Q5:` **The main step in the proof of, e.g., Theorem 1 seems to be a separation between the nominal model and the noise. Is this possible only because of this specific definition of uncertainty set where noise and nominal model seem to be disjoint? For instance, if we define the uncertainty set as $\{p:\lVert p-p_0\rVert_p \leq \beta\}$, can we still prove the same thing?** `>>> A5:` Yes, you are right to observe this sepration is possible due to this crucial s(or sa)-rectangularity of noise (or uncertainty set) assumption. Solving robust MDPs for general noise is proven to be NP hard [4,5]. The uncertainty set $\{p:\lVert p-p_0\rVert_p \leq \beta\}$ is not s/sa-rectangular that can make things very difficult. Many of the MDPs structural properties that holds for s-rectangularity robust MDPs may break here. We believe this is an interesting future direction that can be build upon our s-rectangular robust MDPs results. `>>> Q6:` **A line of works that I believe should be acknowledged here is the one of optimistic exploration in MDPs (eg regret minimization). There a connection between regularization and robustness (well, not really robustness but a similar thing) already appeared way before these recent works. For instance, if we think about the UCRL algorithm by Auer et al. or the UCBVI algorithm by Azar et al., the main idea is to build upper confidence bounds over value functions from confidence intervals over P and R (which are really rectangular uncertainty sets). Then, one derives an extended value iteration scheme which amounts to finding, at each state, the action an models (P, R) maximizing the one-step application of the Bellman operator. This is not really robust (as we have max vs min) but it is conceptually very similar. For that problem, it is known that the max over models can be resolved by simply adding an exploration bonus (ie by regularizing).** `>>> A6` Yes, this line of work might be relevent in the conceptual understaning of MDP in uncertainties in general. Our results have striking similarity (appending reward with penalty/bonus), but the motiavation and setting is quite different. If there is a way to unify the two lines of work, this would be an awesome result. It also has a following striking difference as well: The above line of work and work in sa-rectangularity (our $L_p$ robust and R contamination uncertianty [2]) suggests appending reward solves everything, but our s-rectangular results suggests besides appending reward, in face of uncertainty we should not totally rely on the best action instead on some top $k$ actions. That is very novel. ## *Reviewer 2 (4h9F) >Rating 5, confidence 2, contribution 2, soundness 2, presentation 2 We thank the reviewer for his/her comments. We are pleased to read that the reviewer thinks that our paper is well organised and summarized through Tables 2-4, and paper contributes by describing the policy improvement step and time complexity of $L_p$ robust MDPs. `>>> W1` **The paper is constrained to theoretical results. It could be improved by including numerical experiments to showcase the proposed approach's viability, improving the paper's clarity.** `>>> R1` The main contribution of the paper is to provide algorithms for sa and s rectangular $L_p$ robust MDPs that can be generalized to sample free regimes that consumes resources similar to non-robust MDPs. Although the time complexity provides the worst case analysis, we plan to add experiments similar to above (experiment section) in the final draft, that shows the robust MDPs converges as fast as non-robust MDPs and takes only constant (and a modest one - something like 2) times more resources for sa/s rectangular $L_1/L_2/L_\infty$ robust MDPs. For general $p$ (s-rectangular), we have to do binary search for each state, and one binary search take around 20-100 steps, that why it takes around 50 times more resources than non-robust MDPs. Each binary search is done from scratch, this can be greatly improved by fine tuning binary search. `>>> W2` ***Only the introduction discusses related work. Perhaps this could be extended in a related work section, which would help clarify the significance and originality of the contribution.*** `>>> R2` We concur with reviewer, as the paper is very dense for the current page limit. We would definetely use the extra page given for the final version for discussing related work that would highlight that s-rectangular uncertainty set is much more challenging and this is the firt work that effectively tackles that, with some novel insights (In face of uncertainty don't rely on the best action, instead on top $k$ actions). And also add some experiments to show our robust algorithms is on par with non-robust counterpart in terms of resources consumption and performance. `>>> Q1` ***Line 101: what 'it' refers to?*** `>>> A1` sa-rectangular uncertainty set `>>>Q2` ***line 40: should "did" be removed?*** `>>>A2` Thank you, we will remove it in the updated draft. `>>>Q3` ***line 125: should it say "tables 2 and 3"?*** `>>> A3` Yes, thank you, we will update it. ## *Reviewer 3 (2VQH) >Rating 4, confidence 3, contribution 2, soundness 2, presentation 2 We thank the reviewer for the reviews. We are pleased that reviewer thinks that our paper considers a challenging task. `>>>W1:` ***The definition of the uncertainty set is unrealistic: condition A is not enough to guarantee the set is well-defined, because there may be some , and the set isn't an uncertainty set. One way is to relax this set but will introduce additional assumptions [1]; Another way is to use some other uncertainty set type like [2]. I’m not sure how this relaxation will impact in this paper, hence I am not totally convinced by the following results.*** `>>> R1` Previous work [6], have taken such assumption, that assumes the nominal kernel lies well inside the simplex. Although, we can impose an extra contiditon to ensure $P+\mathcal{P}$ has all positive entries, this will amount to taking a projection of $\kappa_q(v)$ on a linear space (computationally easy), this will not effect the nature of our results, only reward regularizer $\kappa_q(v)$ will be changed slightly. We will add this discussion in the appendix. `>>>W2:` ***The value of Q is used to solve the problem. Does Q is assumed to be known?*** `>>> R2` We don't assume Q to be known. Q here is a temporary variable, we can state results without it. For example, in non robust counterpart, we have $$ v(s) = \max_{a}[R(s,a) + \gamma\sum_{s'}P(s'|s,a)v(s') ], $$ we can write it as $$v(s) = \max_{a}Q(s,a), \text{where}\quad Q(s,a) = R(s,a) + \gamma\sum_{s'}P(s'|s,a)v(s').$$ Here, Q can have meaning (Q-value), but in our setting, we don't yet have given any meaning. Just a temporary variable, defined as $$Q(s,a) = R_0(s,a) + \gamma\sum_{s'}P_0(s'|s,a)v(s'),$$ while computing Bellman update $\mathcal{T}_{\mathcal{U}}v$ at value function $v$. We compute it as above (as $R_0,P_0$ are known), just as in non-robust value iteration. `>>> Q1:`***How do you fix the problems in the Weakness part?*** `>>> A1:` For now adding an extra assumption in the main paper that our nominal kernel lies well inside the simplex. And discussing its relaxation in the appendix or supplementary material. As relaxation of this would require an extra condition in the optimization that will effect reward regularizer $\kappa_p(v)$ a bit. We will add this discussion in the appendix. `>>> Q2:` ***Is $P_0$ assumed to be known as the paper needs ? If so, the uncertainty set is known, why not use dynamic programming? If not, how do you obtain $Q$? This should be clarified.*** `>>> A2` Yes, we assume $P_0$ to be known. **Sa rectangular case:** We have value iteration as $$(\mathcal{T}_{\mathcal{U}^{sa}_p}v)(s) = \max_{a}[R_0(s,a) -\alpha_{s,a}-\beta_{s,a}\kappa_{q}(v)+ \gamma\sum_{s'}P_0(s'|s,a)v(s')$$ that can rewritten as $$(\mathcal{T}_{\mathcal{U}^{sa}_p}Q)(s,a) = R_0(s,a) -\alpha_{s,a}-\beta_{s,a}\kappa_{q}(v)+ \gamma\sum_{s'}P_0(s'|s,a)\max_{a'}Q(s',a'), $$ where $v(s):=\max_{a}Q(s,a).$ And we can use dynamic programming to obtain $Q$ (Q-value iteration) for sa-rectangular $L_p$ robust MDPs, same as non-robust and R-contamination uncertainty set [2], as following For this we can use Q-learning or TD(0) etc. This can easily be generalized to model free regimes too, if we can compute $\kappa_q$ online, and such is this case for $L_1$ and $L_2$. In other words, $\kappa_\infty$ and $\kappa_2$ can be computed online that enables us to generalize the above Q-learning to model free setting. The convergence proof may follow directly from [2]. ***S rectangular case*** We compute $Q$ value as illustratd in algorithm 2. Things are not easy as sa rectangular case, as the greedy policy is not to take the best action, see thoerem 3 and property 2. Lets focus on $L_1$ case, as in algorithm 2. We have $$(\mathcal{T}^*_{\mathcal{U}^s_1}v)(s) =\max_{m} \frac{\sum_{i=1}^mQ(s,a_i) - \alpha_s-\beta_s\gamma\kappa_\infty(v)}{m},$$ where $$Q(s,a) = R_0(s,a) + \gamma\sum_{s'}P_0(s'|s,a)v(s').$$ Now, if we try to get Q-learning from above, we get $$Q(s,a) = R_0(s,a) + \gamma\sum_{s'}P_0(s'|s,a)\max_{m} \frac{\sum_{i=1}^mQ(s',a_i) - \alpha_{s'}-\beta_{s'}\gamma\kappa_\infty(v)}{m},$$ where $$v(s) = \max_{m} \frac{\sum_{i=1}^mQ(s,a_i) - \alpha_s-\beta_s\gamma\kappa_\infty(v)}{m}.$$ There seems no way out without calculating value function also. So TD(0) and Q-learning may not be possible for s-rectangular case. But it deosn't restrict us from getting model based algorithm (as in algorithm 2) and model free algorithm that we may attach in supplementary materials. The main idea to generalize to model free regime, is to estimate both $Q$ and $v$. `>>> Q3:` ***(Assume the question (1) is fixed) Why not use TD(0) or Q-learning directly (like in [1,2] above) since the Bellman equation is clear?*** `>>> A3` We can use it for sa-rectangular case but maybe not for s-rectangular case, as argued in A2. As Bellman equation for s-rectangular case is very messy. ## *Reviewer 4 (GQmV) >Rating 4, confidence 5, contribution 2, soundness 3, presentation 3 We thank the reviewer for the reviews. We are encouraged to learn that reviewer thinks our paper is mathematically rigorous, well-written, and has new and intresting results that would make good contributions to the line of research. `>>> W1` ***No experiments to support their algorithms.*** `>>> R1` Please see experimental results in the table (in the experiment section above). It clearly illustrate that our robust algorithm is resource efficient (upto constant factor like 2) as guranteed in our complexity anaysis, and performance (rate of convergence) is similar to non-robust MDPs as guranteed in litreature. We will add exhaustive experiments of these kind in final version. `>>> W2:` ***Why our policy improvement step is better, why binary search is better?*** `>>> R2:` We want to emphasize that we can compute policy improvement step in closed form for s/sa rectangular $L_1/L_2/L_\infty$ robust MDPs whose performance is very close (upto constant/log factor) to the non-robust counterpart. For general $p$, we resort to binary search, still its performance is very good compared black box solvers such as linear programming. Most importantly, our policy improvement step scales well with state-action space. The previous work to solve s-recangular noise by [6,7], uses black box solver such as LP and projection. Binary search takes around 30-50 iterations to achieve a very good tolerance (estimation error). This is also confirmed by the experiments in the experiment section above. Hence, our methods have clear advantages over those. `>>> W3:` ***Why there is no comparision with existing algorithms?*** `>>> R3:` There are works that proposes algorithms for sa-rectangular case [2,6,7]. Our main contribution is policy improvement step in s-rectangular case, previous work [6,7] do policy improvement through black box projection subroutine, and we do so in close form (for $L_1/L_2/L_\infty$) that is obviously faster. Hence, we think our conribution is clear in this front. The comparision for the sa-rectangular case would not be fair to other algorithms and we will have much better peformance. `>>> Q1 a):` ***What is the implication of Theorem 4?*** `>>> A1 a):` We know that greedy (also optimal) policy is deterministic (take best action) in non-robust and sa-recgtangular MDPs. And it is also known that the greedy policy (also optimal) may be stochastic in case of s-rectangular case. But nothing is known of its nature: whether it is a softmax, or something else. Thoerem 4 reveals that it is of threshold kind (in $L_p$ robust) type, where actions above some threshold is taken with probability of selecting each action is proprotional to its advantage. This result indicated that it suffices to find this threshild and the magnitude of the advantage to derive an optimal policy which may allow much faster computation. `>>> Q1 b)`***Is this useful for the algorithm developments?*** `>>> A1 b)` In our paper, we can do robust value iteration without this result, as we can compute Bellman operator directly. But we believe this result says that in face of uncertainty don't rely on your best choice, but on top $k$ choices. It reveals an very interesting property of the optimal policy (and greedy policy) that may guide us to design robsut algorithms in more complicated scenarios, in future. `>>> Q1 c):` ***It is not surprising that the optimal policy is a threshold policy, but can we compute the threshold efficiently? The theorem does not say anything about this?*** `>>> A1 c` It is konwn in literature that greedy (optimal) policy for s-rectangular case may be stochastic, but it is a threshold policy, this is very new. Yes we can compute it efficiently. To compute this policy (see theorem 4), we need advantage function $$A(s,a) = R_0(s,a) + \gamma \sum_{s'}P_0(s'|s,a)v(s') - (\mathcal{T}^*_{\mathcal{U}^s_p}v)(s).$$ We assume $R_0,P_0$ to be known, and theorem 3 says we can do Bellman operator efficiently.\\ In model free setting, we can keep track of Q-value and value function $v$ both, so it may look something like (not rigourous, left for future work) $$A(s,a) = Q(s,a)- v(s).$$ But this is the first work that effectively tackles s-rectangular robust MDPs that is much more challenging than sa rectangular setting. Our algorithms for sa-rectangular setting share the same promises (complexity, ability to generalize to model free regimes, etc) with [2]. Besides, the work provides model-based algorithms for s-rectangular $L_p$ robust MDPs, where for special cases such as $L_1/L_2/L_\infty$, the algoritms computes both policy evaluation and improvement in close form (or some simple algorithm). Further, for s-rectangular $L_1/L_2$ robust MDPs the algorithms can be generalized to the model free regimes, as $\kappa_2(v) = \sqrt{S}std(v),\kappa_\infty(v) = 0.5[\max_s(v)-\min_{s}v(s)]$ can be be estimated online. Further, in future, if we can find a way to compute $\kappa_q$ online in fashion, then s-rectangular $L_p$ robust MDPs can also be generalized to model free regimes.\\ `>>> Q2 a)` ***Implication, usefullness and computation of property 1?*** `>>> A2 a)` Post facto: we know $\chi_p(v,s)$ is the number of active action in state $s$ at value function $v$. Property 1 states it can be computed directly and efficiently from equation 13, without computing the greedy policy or Bellman operator.\\ Now, we know in $L_1$ robust case the greedy policy is take top $\chi_p(v,s)$ action uniformly at value function $v$ and state $s$. So we can compute $\chi$ from equation 13 then get greedy policy (top $\chi_p(v)$ action uniformly), then compute Bellman operator as $$(\mathcal{T}_{\mathcal{U}^s_1}v)(s) = \sum_{i=1}^{\chi_p(v,s)}\frac{Q(s,a_i) - \alpha_s -\beta_s\gamma\kappa_\infty(v)}{\chi_p(v,s)},$$ where $Q(s,a) = R_0(s,a) + \gamma\sum_{s'}P_0(s'|s,a)v(s')$, and $Q(s,a_1)\geq Q(s,a_2),\cdots .$\\ Note the order, it is totally reverse, as we first compute Bellman operator then advantage function, then greedy policy then number of active actions. Apart from above computational things, we can directly look at equation 13, and try to understand how numbe of active actions depends on level of uncertainty. That is, if uncertainty increases then how many more action becomes active. `>>> Q2 b)` ***Implication, usefullness and computation of property 2?*** `>>> A2 b):` This is also an conceptual result. In non-robust MDPs, we know that the value functin is the Q value, that is $$(\mathcal{T}^*v)(s) = \max_{a}Q(s,a),$$ where $Q(s,a) = R(s,a) + \gamma\sum_{s}P(s'|s,a)v(s')$. Something similar is true in sa-rectangular case too, the value function is best regularized value, that is $$\mathcal{T}^*_{\mathcal{U}^{sa}_p}v(s) = \max_{a}Q(s,a),$$ where $Q(s,a) = R_0(s,a) - \alpha_{s,a}-\beta_{s,a}\kappa_{q}(v) + \gamma\sum_{s}P_0(s'|s,a)v(s')$. What about s-rectangular case? Nothing but property 2 relates value function Q-value. It say says that value at state $s$ lies between Q value of $\chi_p(v,s)$ and $\chi_p(v,s)+1$th top action. $$Q(s, a_{\chi_p(v,s)}) \geq (\mathcal{T}^*_{\mathcal{U}^s_p}v)(s) \geq Q(s, a_{\chi_p(v,s) +1}),$$ where $Q(s,a) = R_0(s,a) + \gamma\sum_{s}P_0(s'|s,a)v(s')$. This is not equality but inequality. This comes from equation 13. New and different! We know value is regularized average of top $\chi_p(v,s)$ actions (theorem 4), but that says nothing of this sort. # References [1]@article{roy2017reinforcement, title={Reinforcement learning under model mismatch}, author={Roy, Aurko and Xu, Huan and Pokutta, Sebastian}, journal={Advances in neural information processing systems}, volume={30}, year={2017} } [2]@article{wang2021online, title={Online robust reinforcement learning with model uncertainty}, author={Wang, Yue and Zou, Shaofeng}, journal={Advances in Neural Information Processing Systems}, volume={34}, pages={7193--7206}, year={2021} } [3]@article{Nilim2005RobustCO, title={Robust Control of Markov Decision Processes with Uncertain Transition Matrices}, author={Arnab Nilim and Laurent El Ghaoui}, journal={Oper. Res.}, year={2005}, volume={53}, pages={780-798}} [4]@article{Iyenger2005, title={Robust Dynamic Programming}, volume={30}, ISSN={1526-5471},url=http://dx.doi.org/10.1287/MOOR.1040.0129}, DOI={10.1287/moor.1040.0129}, number={2}, journal={Mathematics of Operations Research}, publisher={Institute for Operations Research and the Management Sciences (INFORMS)}, author={Iyengar, Garud N.}, year={2005}, month={May}, pages={257–280} } [5]@article{RVI, title={Robust Markov Decision Processes}, author={Wolfram Wiesemann, Daniel Kuhn, Berç Rustem}, journal={Mathematics of Operations Research 38(1):153-183}, year={2012}, } [6]@article{10.5555/3546258.3546533,author = {Ho, Chin Pang and Petrik, Marek and Wiesemann, Wolfram},title = {Partial Policy Iteration for L1-Robust Markov Decision Processes},year = {2021},issue_date = {January 2021},publisher = {JMLR.org},volume = {22},number = {1},issn = {1532-4435},journal = {J. Mach. Learn. Res.},month = {jan},articleno = {275},numpages = {46},keywords = {optimization, Robust Markov decision processes, reinforcement learning}} [7]@inproceedings{derman2021twice, title={Twice regularized {MDP}s and the equivalence between robustness and regularization},author={Esther Derman and Matthieu Geist and Shie Mannor},booktitle={Advances in Neural Information Processing Systems},editor={A. Beygelzimer and Y. Dauphin and P. Liang and J. Wortman Vaughan}, year={2021},url={https://openreview.net/forum?id=x00mCNwbH8Q} } [8]@article{DBLP:journals/corr/TamarXM13, author = {Aviv Tamar and Huan Xu and Shie Mannor}, title = {Scaling Up Robust MDPs by Reinforcement Learning}, journal = {CoRR}, volume = {abs/1306.6189}, year = {2013}, url = {http://arxiv.org/abs/1306.6189}, eprinttype = {arXiv}, eprint = {1306.6189}, timestamp = {Mon, 13 Aug 2018 16:47:54 +0200}, biburl ={https://dblp.org/rec/journals/corr/TamarXM13.bib}, bibsource = {dblp computer science bibliography, https://dblp.org}}

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.