Quantile MDP with Markovian Policy reduces to Quantile-based Distributional RL

--- title: Quantile MDP with Markovian Policy reduces to Quantile-based Distributional RL description: This post shows the connection between Quantile MDP and Distributional RL tags: Distributional RL --- # Quantile MDP with Markovian Policy reduces to Quantile-based Distributional RL ## 1. Notation and General Assumptions Denote the random variable by a tilde, i.e., $\tilde{x}$. Denote the augmented real as $\bar{\mathbb{R}}:=\mathbb{R}\cup\{-\infty,\infty\}$. **Quantiles and Value-at-Risk**. The quantile of random variable $\tilde{x}$ at level $\alpha\in[0,1]$ is any $\tau\in \mathbb{R}$ such that $\mathbb{P}[\tilde{x}\leq \tau]\geq \alpha$ and $\mathbb{P}[\tilde{x}<\tau]\geq 1-\alpha$. It might not be unique and lies in the interval $[q^-_\alpha(\tilde{x}), q^+_\alpha(\tilde{x})]$, where $$ \begin{aligned} q^-_\alpha(\tilde{x}):=&\min\{\tau\in \bar{\mathbb{R}}| \mathbb{P}[\tilde{x}\leq \tau]\geq \alpha\}\\ q^+_\alpha(\tilde{x}):=&\max\{\tau\in \bar{\mathbb{R}}| \mathbb{P}[\tilde{x}< \tau]\leq \alpha\}. \end{aligned} $$ $\mathrm{VaR}_\alpha[\tilde{x}]$ is defined as the largest $1-\alpha$ confidence lower bound on the value of $\tilde{x}$, i.e., $$ \begin{aligned} \mathrm{VaR}_\alpha[\tilde{x}]:=q^+_\alpha(\tilde{x}). \end{aligned} $$ For ease of presentation, we assume the quantile is unique, i.e., $q^-_\alpha(\tilde{x})=q^+_\alpha(\tilde{x})$, for any $\tilde{x}$. In this case, value-at-risk (VaR) equals quantile. For a Markov Decision Process (MDP), we also assume the space of state, action, and random reward is finite. ## 2. Quantile-based Distributional RL Distributional reinforcement learning is getting popular since the seminal work of [Bellemare et al., 2017], where a distributional Bellman equation is proposed, i.e., \begin{equation} \tag{1} \tilde{z}(s,a)\overset{d}{=} \tilde{r}(s,a) + \gamma \tilde{z}(\tilde{s}',\tilde{a}') \end{equation} In this equation, $\tilde{z}(s,a):=\sum_{t=0}^\infty [\gamma^t \tilde{r}(\tilde{s}_t,\tilde{a}_t)| \tilde{s}_0=s,\tilde{a}_t=a]$ is the random variable of the return, and $\overset{d}{=}$ means equal in distribution. [Bellemare et al., 2017] mentioned that when $\tilde{a}'$ comes from a fixed policy $\pi(\cdot|\tilde{s}')$, i.e., performing policy evaluation, the equation enjoys the contraction property. However, in control setting, i.e., $\tilde{a}'$ chooses the optimal risk-neutral action, the distributions of two sides of Eq.(1) might not be equal. [Bellemare et al., 2017] used categorical distribution to represent the value distribution. Later, [Dabney et al., 2018] proposed to represent the value distribution by its empirical inverse CDF (quantile function), and update the quantile estimates by quantile regression given sampled $(s,a,r,s')$ as \begin{equation} \tag{2} q_\alpha(s,a)\leftarrow q_\alpha(s,a) - \eta \cdot \partial_y\mathbb{E}_{u\sim U[0,1]}\Big[l_\alpha\big(r+q_u(s',a') - y\big)\Big]\Big |_{y=q_\alpha(s,a)},~\forall \alpha\in[0,1] \end{equation} where $q_\alpha(s,a):=\mathrm{VaR}_\alpha[\tilde{z}(s,a)]$ represents the $\alpha$ quantile of the state-action value; $\eta$ is the learning rate; $U[0,1]$ is a uniform distribution on $[0,1]$; $l_\alpha(\cdot)$ is the loss function corresponds to quantile regression given by $$ l_\alpha(x-y):=(\alpha-\mathbb{I}\{x<y\})(x-y); $$ $a'$ can come from $\pi(\cdot|s')$ if doing policy evaluation, and $a'=\arg\max_{a} \mathbb{E}_u[q_u(s',a)]$ when performing optimal control. Quantile-based approach has become one of the mainstream approaches of distributional RL since it was introduced. Recently, [Rowland et al., 2024] provided theoretical analysis for quantile TD learning (Eq.(2)). ## 3. Quantile MDP Consider a more general problem, given an MDP, what is the optimal quantile value that can be achieved in the MDP for all $\alpha\in[0,1]$ and $(s,a)$, i.e., \begin{equation} \tag{3} q^*_\alpha(s,a) := \max_{\pi_{\mathrm{h}}} \mathrm{VaR}_\alpha^{\pi_{\mathrm{h}}}\Big[\sum_{t=0}^\infty \gamma^t \tilde{r}(\tilde{s}_t,\tilde{a}_t)|\tilde{s}_0=s,\tilde{a}_0=a\Big]. \end{equation} The corresponding optimal policy of Eq.(3) is history-dependent since $\mathrm{VaR}_\alpha$ operator is non-linear, therefore denoted as $\pi_\mathrm{h}$ (h means history) in Eq.(3). [Li et al., 2022] and [Hau et al., 2023] showed that this static VaR problem enjoys a dynamic decomposition which admits a Bellman-like equation. However, the equations in [Li et al., 2022] and [Hau et al., 2023] are model-based where a constrained optimization problem involving the transition probability need to be solved. To address this limitation, [Hau et al., 2025] proposed a nested VaR Bellman optimality equation as [^1] \begin{equation} \tag{4} q^*(s,\alpha,a) = \mathrm{VaR}_\alpha\Big[ \tilde{r}(s,a)+\gamma \max_{a'} q^*(\tilde{s}',\tilde{u},a') \Big], \end{equation} where $\tilde{u}$ follows $U[0,1]$ and $\mathrm{VaR}_\alpha$ is applied to the joint distribution of $\tilde{r}(s,a)$, $\tilde{s}'$ and $\tilde{u}$. We have moved the quantile-level $\alpha$ inside the $q$ function to treat it as an augmentation of the state space. The fixed point of Eq.(4) is the optimal $\alpha$-quantile under $(s,a)$ pair, i.e., Eq.(3). Eq.(4) is not directly amenable to a $Q$-learning style algorithm, since the VaR operator is generally unavailable in closed form. However, notice that $\alpha$-quantile is the argmin of the quantile regression loss, as a result, Eq.(4) can be alternatively expressed as \begin{equation} \tag{5} q^*(s,\alpha,a)=\arg\min_y \mathbb{E}\Big[l_\alpha\big(\tilde{r}(s,a)+\gamma \max_{a'}q^*(\tilde{s}',\tilde{u},a')-y\big)\Big]. \end{equation} As a result, the quantile value can be updated, by gradient descent, towards the decreasing direction of the quantile regression loss. This can be performed by stochastic gradients given sampled $(s,\alpha,a,r,s')$ (Note that using sampled $r$ and $s'$ still gives unbiased gradient). The quantile value is updated by \begin{equation} \tag{6} q(s,\alpha,a)\leftarrow q(s,\alpha,a) - \eta \cdot \partial_y\mathbb{E}_{u\sim U[0,1]}\Big[l_\alpha\big(r+\gamma \max_{a'}q(s',u,a') -y\big)\Big]\Big|_{y=q(s,\alpha,a)}, \end{equation} where $\eta$ is the learning rate. Eq.(2) and Eq.(6) share very similar structure. The only difference is that Eq.(6) obtains the optimal action $a'$ for each quantile-level $u\in[0,1]$, while Eq.(2), in control setting, uses the optimal risk-neutral action $a'=\arg\max_a \mathbb{E}_{u}[q(s',u,a)]$. However, this subtle difference suggests that quantile-based distributional RL approach, in control setting, is **not** learning the optimal quantile values in an MDP. Another explanation is that the optimal quantile policy is history-dependent while distributional RL uses Markovian policy, see Algo. 1 of [Hau et al., 2025] on how to obtain the optimal history-dependent VaR policy. ## 4. Quantile Decomposition under Markovian Policy We show that when policy is Markovian, the quantile decomposition technique (Li et al., 2022) used in quantile MDP reduces to quantile-based distributional RL approach in policy evaluation. $$ \begin{aligned} q^\pi(s,\alpha,a)&=\mathrm{VaR}_\alpha\Big[\sum_{t=0}^\infty \gamma^{t}\tilde{r}(\tilde{s}_t,\tilde{a}_t)|\tilde{s}_0=s,\tilde{a}_0=a\Big]\\ &=\mathrm{VaR}_\alpha\Big[\tilde{r}(s,a) + \gamma \sum_{t={1}}^\infty \gamma^{t-1}\tilde{r}(\tilde{s}_t,\tilde{a}_t) \Big],~\tilde{a}_t \sim \pi(\cdot|\tilde{s}_t)\\ &\overset{(a)}{=}\mathrm{VaR}_\alpha\Big[\sum_{(r_i,s'_i,a'_i)}\pi(a'_i|s'_i)P(r_i,s'_i|s,a) \Big(r_i + \gamma \big[\sum_{t=1}^\infty \gamma^{t-1} \tilde{r}(\tilde{s}_t,\tilde{a}_t)|\tilde{s}_1=s'_i, \tilde{a}_1=a'_i\big]\Big)\Big],~\tilde{a}_t \sim \pi(\cdot|\tilde{s}_t)\\ &\overset{(b)}{=}\mathrm{VaR}_\alpha\Big[\sum_{(r_i,s'_i,a'_i)}P(r_i,s'_i,a'_i|s,a) \Big(r_i + \gamma q^\pi(s'_i,\tilde{u},a'_i)\Big)\Big]\\ &\overset{(c)}{=}\max_{\boldsymbol{\beta}} \min_i \mathrm{VaR}_{\beta_i}[r_i + \gamma q^\pi(s'_i,\tilde{u},a'_i)]~\mathrm{with}~\sum_i \beta_i \cdot P(r_i,s'_i,a'_i|s,a) \leq \alpha\\ &=\max_{\boldsymbol{\beta}} \min_i r_i + \gamma q^\pi(s'_i, \beta_i, a'_i) \\ &\overset{(d)}{=}\mathrm{VaR}_\alpha\Big[\tilde{r} + \gamma q^\pi(\tilde{s}',\tilde{u},\tilde{a}')\Big]~\mathrm{with}~ \tilde{a}'\sim \pi(\cdot|\tilde{s}'), \end{aligned} $$ where (a) considers all the possible combinations of $(r,s',a')$ (indexed by $i$) under the joint distribution of policy $\pi(a'|s')$ and transition $P(r,s'|s,a)$; (b) replaces $\pi(a'_i|s'_i)P(r_i,s'_i|s,a)$ by $P(r_i,s'_i,a'_i|s,a)$ and replaces the distribution of $\big[\sum_{t=1}^\infty \gamma^{t-1} \tilde{r}(\tilde{s}_t,\tilde{a}_t)|\tilde{s}_1=s'_i,\tilde{a}_1=a'_i\big]$ by its equivalent representation $q^\pi(s'_i, \tilde{u}, a'_i)$; ( c) is according to the quantile decomposition theory, i.e, theorem 1 and Lemma 2 in [Li et al., 2022]; (d) is according to Lemma B.4 of [Hau et al., 2025]. Following the same idea as Eq.(5) and Eq.(6), we have \begin{equation} \begin{aligned} q^\pi(s,\alpha,a) &= \mathrm{VaR}_\alpha\Big[\tilde{r}(s,a)+\gamma q^\pi(\tilde{s}',\tilde{u},\tilde{a}')\Big]\\ &= \arg\min_y \mathbb{E}\Big[l_\alpha\big(\tilde{r}(s,a)+\gamma q^\pi(\tilde{s}',\tilde{u},\tilde{a}')-y\big)\Big], \end{aligned} \end{equation} which leads to the update rule as Eq.(2) given sampled $(s,a,r,s')$ when performing policy evaluation. Therefore, quantile MDP with Markovian policy reduces to quantile-based distributional RL under policy evaluation. ## 5. Remark Note that some technique details are omitted. Please refer to [Hau et al., 2025] for handling the non-smoothness of quantile regression loss, and the convergence analysis for $Q$-learning algorithm in quantile MDP. [^1]: See the justification of including random reward $\tilde{r}$ in this Bellman equation in [Luo and Delage, 2026] ## Reference [Bellemare et al., 2017] A Distributional Perspective on Reinforcement Learning, ICML 2017 [Dabney et al., 2018] Distributional Reinforcement Learning with Quantile Regression, AAAI 2018 [Li et al., 2022] Quantile Markov Decision Processes, Operations Research 2022 [Hau et al., 2023] On Dynamic Programming Decompositions of Static Risk Measures in Markov Decision Processes, NeurIPS 2023 [Rowland et al., 2024] An Analysis of Quantile Temporal-Difference Learning, JMLR 2024 [Hau et al., 2025] Q-learning for Quantile MDPs: A Decomposition, Performance, and Convergence Analysis, AISTATS 2025 [Luo and Delage, 2026] Boosting CVaR Policy Optimization with Quantile Gradients, ICML 2026