# Summary Note [Reinforcement Learning with Human Feedback: Learning Dynamic Choices via Pessimism (2023)] Reinforcement Learning with Human Feedback (RLHF) has been widely used in large language model, recommendation system, auto-driving nowadays. However, there are several issues that RLHF suffers: - Large state space with limited human feedbacks - Bounded rationality of human decisions - Off-policy distribution shift To overcome these issues, this paper proposed a method called Dynamic-Choice-Pessimistic-Policy-Optimization (DCPPO). # Introduction In RHLF, unlike conventional offline RL, learners does not have the direct access to the reward signal. In fact, learners can only observe the historical record of visited states and human-prefered actions. Therefore, a question occurs: How to get these reward knowledge from these human feedbacks? Well, they come up with the idea that using the DDC model to help them modeling and understanding human choices. Dynamic Discrete Choice (DDC) model is widely used in econometrics and is quite similar to reward learning in RLHF. In DDC model, the human agent is assumed to make decisions under presence of Gumble noise: <font color=Black> $$\pi_{h}(a_{h}|s_{h}) = argmax_{a}\{Q_{h}(s_{h}, a)+\epsilon_{h}(a)\}$$</font> Where $\epsilon_{h}$ is an unobservable random noise and $Q_{h}$ is the value function. Main Challenges --- Here, we focus on the challenges of RLHF within the context of DDC model: - Agent must first learn the human behavior policies from feedback data. - Since the reward can't directly observed, the estimation of reward from behavior policies is needed. - Insufficient dataset coverage & large state space. Main Contributions --- To resolve the problems mentioned above, they have these following contributions: - In order to learn behavior policies, they use maximum likelyhood estimation (MLE) to estimate state/action value functions with function approximation. - By using the learned value funcitons in (1.), they minimize the Bellman mean squared error (BMSE) through linear regression to recover the unobservable reward from learned policies. - For insufficient coverage, they follow the principle of pessimism by adding penalty into the value function during value iteration. They establish the suboptimality of algorithm with high pobability with single-policy coverage. - Fisrt theoretical guarantee off-policy offline RLHF with DDC model. Related Work --- - RLHF - There are many ways to incorporate human preference into decision-making process of agent, however the convergence of their algorithms requires sufficient coverage. - The majority of the previous research in RLHF only consider "Bandit cases" and have not studied "MDP cases with transition dynamics". <font color=Black> >The main diffenence of "Bandit cases" and "MDP cases with transition dynamics" is that "Bandit case" assumes that there is no state transitions exist and decision-making process is single-step (immediate reward), however "MDP cases with transition dynamics" assumes that there is state transitions and decision-making process is multi-step (cumulate reward). </font> <br> - DDC model and CCP algorithm - DDC model is a choice model in econometrics that closely related to reward learning in Inverse Reinforcement Learning (IRL) and RLHF. - Conditional Choice Probability (CCP) algorithm is highly related to the work of this paper. In CCP, the learner estimate choice probability from data first, and estimate value function by choice. However, this method suffer from computation burdens with large state space and the previous study only focus on estimated utility without estimation error. <br> - Offline RL and Pessimism - There are many studys indicates that Pessimism can deal with distribution shift in offline RL. - Some research shows the pessimism is sufficient to eliminate spurious correlation and intrinsic uncertainty during value iteration. # Preliminaries Let's begin with some useful notations used in the paper first. ><font color=Black>**MDP Model (finite-horizon):** >$$M\ =\ (S, A, H, \{P_{h}\}_{h\in H}, \{r_{h}\}_{h\in H})$$ $$where\ H\ is\ the\ horizon\ length$$</font> ><font color=Black>**Policy:** $$\pi\ =\ \{\pi_{h}\}_{h \in H}$$</font> In each step $h$ , starts from state $s_{h}$ , the agent takes action $a_{h}$ with probability $\pi(a_{h}|s_{h})$ , received reward $r_{h}(s_{h}, a_{h})$, and transits to next state $s'$ with probablitity $P_{h}(s'|s_{h}, a_{h})$. <br><br> ><font color=Black>**State Value Function:** $$V^{\pi}_{h}(s)\ =\ \mathbb{E}_{\pi}[\sum^{H}_{t=h}r_{t}(s_{t}, a_{t})|s_{t}=s]$$</font> ><font color=Black>**Action Value Function:** $$Q^{\pi}_{h}(s, a)\ =\ \mathbb{E}_{\pi}[\sum^{H}_{t=h}r_{t}(s_{t}, a_{t})|s_{t}=s, a]$$</font> Here, the expectation $\mathbb{E}_{\pi}$ takes the randomness of trajectory induced by $\pi$ , i.e. taking action $a_{t}$ ~ $\pi_{t}(\cdot|s_{t})$ and observing $s_{t+1}$ ~ $P_{h}(\cdot|s_{t}, a_{t})$. <br><br> ><font color=Black>**Bellman Equations:** $$V^{\pi}_{h}(s)\ =\ \langle\pi_{h}(a|s), Q^{\pi_{b}}_{h}(s, a)\rangle$$$$Q^{\pi}_{h}(s, a)\ =\ r_{h}(s, a)\ +\ \mathbb{P}_{h}V^{\pi}_{h+1}(s, a)$$</font> ><font color=Black>**Performance Metric:** $$SubOpt(\pi)\ =\ V^{\pi^{*}}_{1}\ -\ V^{\pi}_{1}$$$$where\ \pi^{*}\ is\ optimal\ policy.$$</font> Problem Formulation --- In this paper, we try to learn from the dataset of human choices under DDC model, and our goal is to learn the optimal policy $\pi^{*}$. Dataset $D$ contains n trajectories collected by observing a single human behavior in DDC model. $$D = \{D_{h} = \{s^{i}_{h},a^{i}_{h}\}_{i\in[n]}\}_{h\in[H]}$$ We assume that agent is bounded-rational and makes decisions accroding to the DDC model, here we illustrate the characteristic of agant's policy: $$\pi_{b, h}(a|s) = \dfrac{exp\left(Q^{\pi_{b}, \gamma}_{h}(s, a)\right)}{\sum_{a'\in A}exp\left(Q^{\pi_{b}, \gamma}_{h}(s, a')\right)}$$ And the discounted Bellman equations: $$V^{\pi_{b, \gamma}}_{h}(s)\ =\ \langle\pi_{b, h}(a|s), Q^{\pi_{b, \gamma}}_{h}(s, a)\rangle$$ $$Q^{\pi_{b, \gamma}}_{h}(s, a)\ =\ r_{h}(s, a)\ +\ \gamma\cdot\mathbb{P}_{h}V^{\pi_{b, \gamma}}_{h+1}(s, a)$$ Reward Learning from Human Dynamic Choices --- Here, we construct a framework of an offline algorithm for learning reward of the MDP with 2 steps: 1. Estimate the agent behavior policy $\pi_{b, h}$ from dataset $D$ by maximum likelyhood estimation (MLE). Additionally, recover state value functions $\{V^{\pi_{b, \gamma}}_{h}\}_{h\in[H]}$ and action value function $\{Q^{\pi_{b, \gamma}}_{h}\}_{h\in[H]}$ by function approximation. 2. Estimate the reward from model class $M$ by minimizing Bellman mean squared error (BMSE) with value function learned in step 1. ><font color=Black>**Assumption (Function Approximation Model Class):** >A model class $M = \{M_{h}\}_{h\in[H]}$ contains functions $f: S \times A \to [0, H]\ \forall\ h \in[H]$. >And $M_{h}$ is rich enough s.t. $M_{h}$ contains the true model. i.e. $r_{h} \in M_{h}, Q_{h} \in M_{h}$</font> <br> For step 1, we use MLE to estimate $\pi_{b, h}$, corresponding to $Q^{\pi_{b, \gamma}}_{h}(s, a)$ in a general $M_{h}$ for each step. And the log-likelyhood function is: $$L_{h}(Q)=\dfrac{1}{n}\sum^{n}_{i=1}log\left(\dfrac{exp(Q(s^{i}_{h}, a^{i}_{h}))}{\sum_{a'\in A}exp(Q(s,a'))}\right)$$ By maximize $L_{h}$, we can estimate $Q_{h}$, which is: $$\hat{Q}_{h}=argmax_{Q\in M_{h}}\dfrac{1}{n}\sum^{n}_{i=1}Q(s^{i}_{h}, a^{i}_{h})-log\left(\sum_{a'\in A}exp(\hat{Q}(s^{i}_{h}, a'))\right)$$ <br> And get the estimated $\pi_{b, h}$ through $\hat{Q}_{h}$ : $$\hat{\pi}_{h}(a_{h}|s_{h})= exp(\hat{Q}_{h}(s_{h}, a_{h}))/\sum_{a'\in A}exp(\hat{Q}_{h}(s_{h}, a'))$$ <br> And estimate $V^{\pi_{b, \gamma}}_{h}(s)$ by bellman equation: $$\hat{V}_{h}(s_{h})=\langle\hat{Q}_{h}(s_{h}, \cdot), \hat{\pi}_{h}(\cdot|s_{h})\rangle_{A}$$ <br> For step 2, we estimate reward function $r_{h}$ by regress $\hat{Q}_{h}-\hat{V}_{h+1}$ on (s, a), here we focus on linear class. ><font color=Black>**Assumption (Linear Model Class):** >consider the function class $M_{h} = \{f(\cdot)=\phi(\cdot)^T\theta:S\times A\to \mathbb{R}, \theta\in\Theta \}$ for $h\in[H]$ >where $\phi\in\mathbb{R}^d$ is a feature vector defined on $S\times A$, d is dimension of the feature.</font> In linear model case, the MLE of step 1 becomes a logistic regression: $$\hat{\theta}_{h}=argmax_{\theta\in\Theta}\dfrac{1}{n}\sum^{n}_{i=1}\phi(s^{i}_{h}, a^{i}_{h})\cdot\theta-log\left(\sum_{a'\in A}exp(\phi(s^{i}_{h}, a')\cdot\theta)\right)$$ <br> Therefore the ridge regression for Bellman MSE is: $$\hat{w}_{h}=argmin_{w}\left\{\sum^{n}_{i=1}\left(\phi(s^{i}_{h}, a^{i}_{h})\cdot w+\gamma\cdot\hat{V}_{h+1}(s^{i}_{h+1})-\hat{Q}_{h}(s^{i}_{h}, a^{i}_{h})\right)^2+\lambda\Vert w\Vert^2_2\right\}$$ <br> By setting $\hat{r}(s_{h}, a_{h})=\phi(s_{h}, a_{h})\cdot\hat{w}_{h}$ and $\rho(\phi\cdot w)=\Vert w\Vert^2_2$, we get: $$\hat{r}_{h}(s_{h}, a_{h})=argmin_{r\in M_{h}}\{\sum^{n}_{i=1}\left(r_{h}(s^{i}_{h}, a^{i}_{h})+\gamma\cdot\hat{V}_{h+1}(s^{i}_{h+1})-\hat{Q}_{h}(s^{i}_{h}, a^{i}_{h})\right)^2+\lambda\rho(r)\}$$ <br> Eventually, we have the algorithm as follow: ><font color=Black>**Algorithm 1 (DCPPO: Reward Learning for General Model Class)** >Require: Dataset $\{D_{h}=\{s^{i}_{h},a^{i}_{h}\}_{i\in[n]}\}_{h\in[H]}$, constant $\lambda<0$, penalty function $\rho(\cdot)$, parameter $\beta$. > >$for\ step\ h\ =\ H, ..., 1\ do$ $\qquad set \quad \hat{Q}_{h}=argmax_{Q\in M_{h}}\dfrac{1}{n}\sum^{n}_{i=1}Q(s^{i}_{h}, a^{i}_{h})-log\left(\sum_{a'\in A}exp(\hat{Q}(s^{i}_{h}, a'))\right)$ $\qquad set \quad \hat{\pi}_{h}(a_{h}|s_{h})= exp(\hat{Q}_{h}(s_{h}, a_{h}))/\sum_{a'\in A}exp(\hat{Q}_{h}(s_{h}, a'))$ $\qquad set \quad \hat{V}_{h}(s_{h})=\langle\hat{Q}_{h}(s_{h}, \cdot), \hat{\pi}_{h}(\cdot|s_{h})\rangle_{A}$ $\qquad set \quad \hat{r}_{h}(s_{h}, a_{h})=argmin_{r\in M_{h}}\{\sum^{n}_{i=1}\left(r_{h}(s^{i}_{h}, a^{i}_{h})+\gamma\cdot\hat{V}_{h+1}(s^{i}_{h+1})-\hat{Q}_{h}(s^{i}_{h}, a^{i}_{h})\right)^2+\lambda\rho(r)\}$$end\ for$ $output: \{\hat{r}_{h}\}_{h\in[H]}$</font> <br> Policy Learning from Dynamic Choices via Pessimistic Value Iteration --- Now, we create a pessimistic value iteration algorithm, which minus a penalty function $\Gamma_{h}$ from value function when choosing the best action. By using the penalty function $\Gamma_{h}$ for pessimistic planning, we can achieve conservative estimation of value function. $\Gamma_{h}$ is a uncertainty quantifier for value functions $\{\tilde{V}_{h}\}_{h\in H}$ s. t. : $$\vert\left(\hat{r}_{h}+\tilde{\mathbb{P}}_{h}\tilde{V}_{h+1}\right)(s, a)-\left(r_{h}+\mathbb{P}_{h}\tilde{V}_{h+1}\right)(s, a)\vert \le \Gamma_{h}(s, a)\quad \forall (s, a)\in S\times A$$ <br> In short, the penalty $\Gamma_{h}$ considers the uncertainty in environment (i.e. worst-case senarios). Therefore we have the second algorithm as follow: ><font color=Black>**Algorithm 2 (DCPPO: Pessimistic Value Iteration)** >Require: Surrogate reward $\{\hat{r}_{h}(s_{h}, a_{h})\}_{h\in[H]}$ learned in Algorithm 1, collected dataset $\{(s^{i}_{h},a^{i}_{h})\}_{i\in[n],h\in[H]}$, constant $\lambda<0$, penalty function $\rho(\cdot)$, parameter $\beta$. >Initialize: Set $\tilde{V}_{H+1}(s_{H+1})=0$. > >$for\ step\ h\ =\ H, ..., 1\ do$ $\qquad set \quad \tilde{\mathbb{P}}_{h}\tilde{V}_{h+1}(s_{h}, a_{h})\ =\ argmin_{f}\sum_{i\in[n]}\left(f(s^{i}_{h}, a^{i}_{h})-\tilde{V}_{h+1}(s_{h+1})\right)^2+\lambda\cdot\rho(f)$ $\qquad construct \quad \Gamma_{h}(s_{h}, a_{h})\ based\ on\ D$ $\qquad set \quad \tilde{Q}_{h}(s_{h}, a_{h}) = min\{\hat{r}_{h}(s_{h}, a_{h})+\tilde{P}_{h}\tilde{V}_{h+1}(s_{h}, a_{h})-\Gamma_{h}(s_{h}, a_{h}), H-h+1\}_+$ $\qquad set \quad \tilde{\pi}_{h}(\cdot|s_{h})=argmax\langle\tilde{Q}_{h}(s_{h}, \cdot), \pi_{h}(\cdot|s_{h})\rangle$ $\qquad set \quad \tilde{V}_{h}(s_{h})=\langle\tilde{Q}_{h}(s_{h}, \cdot), \tilde{\pi}_{h}(\cdot|s_{h})\rangle_A$ $end\ for$ $output: \{\tilde{\pi}_{h}\}_{h\in[H]}$</font> <br> DCPPO for RKHS --- This paper also consider the Reproducing Kernel Hilbert Space (RKHS) case, which we assume model class $M = \{M_h\}_{h\in[H]}$ are the subset of the RKHS. (However, for the simplicity we skip the proof in paper this part. The proof is quite similar to the **Theorem 2** and **Theorem 3** in next section.) The result also matches standard pessimistic planning under RKHS case, where the reward is observable. <br> # Supporting Lemmas and Theoretical Analysis In this section, we will dive into the theorems proposed in this paper and their proofs. Theorem 1: Policy and Value Functions Recovery from Choice Model --- To show the efficiency of learning $\pi_{b,h}$, we have the following assumption and theorem: ><font color=Black>**Assumption (Model Identification):** >We assume that there $\exists\ a_0\in A,\ s.t.\ Q(s,a_0)=0\ \forall s\in S$ </font> ><font color=Black>**Theorem (Policy and Value Functions Recovery from Choice Model)** >With **Algorithm 1**, we have: $$\mathbb{E}_{D_{h}}[\Vert\hat{\pi}_{h}(\cdot|s_{h})-\pi_{b,h}(\cdot|s_{h})\Vert^2_1]\le O\left(\dfrac{log(H\cdot N(M_{h}, \Vert\cdot\Vert_{\infty}, 1/n)/\delta)}{n}\right)$$$$\mathbb{E}_{D_{h}}[\Vert\hat{Q_{h}}(s_{h}, \cdot)-Q^{\pi_{b}, \gamma}_{h}(s_{h}, \cdot)\Vert^2_1]\le O\left(\dfrac{H^2e^{2H}\cdot\vert A\vert^2\cdot log(H\cdot N(M_{h}, \Vert\cdot\Vert_{\infty}, 1/n)/\delta)}{n}\right)$$hold for every $h\in[H]$ with probability at least $1-\delta$</font> This theorem shows that we have $O(1/n)$ guarantee on learning $\pi_{b,h}$ under the identification assumption, and therefore we can also provably recover the value functions. <br> ### Proof of Theorem 1 This theorem can regarded as MLE guarantee for dataset distribution, we break the proof into 2 steps: 1. MLE guarantee for dataset distribution i.e. $s_{h}$~$\pi_{b,h}$, bounded estimation error in expectation. 2. Transfer expectation bound to bound on dataset. <br> ### Step 1 : MLE distribution bound ><font color=Black>**Lemma (MLE distribution bound):** >By $\hat{Q}_{h}$ estimated in **Algorithm 1** we have: >$$\mathbb{E}_{s_{h}\sim\pi_{b}}[\Vert\hat{\pi}_{h}(\cdot|s_{h})-\pi_{b,h}(\cdot|s_{h})\Vert^2_1]\le c\cdot\dfrac{log(H\cdot N(M, \Vert\cdot\Vert_{\infty}, 1/n)/\delta)}{n}$$ >with probability at least $1-\delta$</font> **Proof.** Fisrt, we define: $$\prod_{h}:=\{\pi_{Q}(a|s)=exp(Q(s, a))/\sum_{a'\in A}exp(s, a')\ for\ some\ Q\in M_{h}\}\quad \forall\ h\in[H]$$$$N_{[]}(\prod_{h}, \Vert\cdot\Vert_{\infty}, 1/n) :=\ smallest\ 1/n\ upper\ bracketing\ of\ \prod_{h}$$ <br> By MLE guarantee, we have: $$\dfrac{1}{n}\sum_{i=1}{n}log\left(\dfrac{\hat{\pi}_{h}(a^i_h|s^i_h)}{\pi_{b, h}(a^i_h|s^i_h)}\right) \ge 0$$ <br> By selecting $\bar{\pi}\in N_{[]}(\prod_{h}, \Vert\cdot\Vert_{\infty}, 1/n)$ with probability $\geq 1-\delta$, using Markov's and Boole's inequality, for all $\bar{\pi}$ we have: $$\sum_{i=1}^{n}\dfrac{1}{2}log\left(\dfrac{\bar{\pi}_{h}(a^i_h|s^i_h)}{\pi_{b, h}(a^i_h|s^i_h)}\right) \leq n\cdot log\left(\mathbb{E}_{\pi_{b}}\left[exp\left(\dfrac{1}{2}log\left(\dfrac{\bar{\pi}(\cdot|\cdot)}{\pi_{b, h}(\cdot|\cdot)}\right)\right)\right]\right) + log\left(\dfrac{N_{[]}(\prod_{h}, \Vert\cdot\Vert_{\infty}, 1/n)}{\delta}\right)$$ $$\leq n\cdot log\left(\mathbb{E}_{\pi_{b}}\left[\sqrt{\dfrac{\bar{\pi}(\cdot|\cdot)}{\pi_{b, h}(\cdot|\cdot)}}\right]\right) + log\left(\dfrac{N_{[]}(\prod_{h}, \Vert\cdot\Vert_{\infty}, 1/n)}{\delta}\right)$$ $$= n\cdot log\left(\mathbb{E}_{\pi_{b}}\left[\sum_{a\in A}\sqrt{\bar{\pi}(a|s_{h})\cdot\pi_{b, h}(a|s_{h})}\right]\right) + log\left(\dfrac{N_{[]}(\prod_{h}, \Vert\cdot\Vert_{\infty}, 1/n)}{\delta}\right)$$ <br> Leverage $log\ x\leq x-1$, we have: $$1-\mathbb{E}_{\pi_{b}}\left[\sum_{a\in A}\sqrt{\bar{\pi}(a|s_{h})\cdot\pi_{b, h}(a|s_{h})}\right] \leq \dfrac{1}{n}log\left(\dfrac{N_{[]}(\prod_{h}, \Vert\cdot\Vert_{\infty}, 1/n)}{\delta}\right)$$ <br> Thus we bound the Hellinger distance between $\pi_{b, h}$ and $\bar{\pi}$ : $$\mathbb{E}_{s_{h}\sim\pi_{b}}[\sum_{a\in A}(\bar{\pi}(a|s_{h})^{1/2} - \pi_{b,h}(a|s_{h})^{1/2})^2] \leq \dfrac{2}{n}log\left(\dfrac{N_{[]}(\prod_{h}, \Vert\cdot\Vert_{\infty}, 1/n)}{\delta}\right) + 1/n \qquad\qquad\qquad (1)$$ <br> And by the fact that $\bar{\pi}$ is the upper bracket of $\prod$ : $$\mathbb{E}_{s_{h}\sim\pi_{b}}[\sum_{a\in A}(\bar{\pi}(a|s_{h})^{1/2} + \pi_{b,h}(a|s_{h})^{1/2})^2] \leq \dfrac{2}{n} + 4 \qquad\qquad\qquad (2)$$ <br> Combining (1) and (2), by Cauchy-Schwarz inequality, we have: $$\mathbb{E}_{s_{h}\sim\pi_{b}}[\Vert\bar{\pi}(\cdot|s_{h})-\pi_{b,h}(\cdot|s_{h})\Vert^2_1] \leq \dfrac{15}{n}\cdot log\left(\dfrac{N_{[]}(\prod_{h}, \Vert\cdot\Vert_{\infty}, 1/n)}{\delta}\right)$$ <br> Also, by definition of norm we have: $$\Vert\bar{\pi}(\cdot|s_{h})-\pi_{b,h}(\cdot|s_{h})\Vert^2_1 - \Vert\hat{\pi}_{h}(\cdot|s_{h})-\pi_{b,h}(\cdot|s_{h})\Vert^2_1 \leq (4+\dfrac{1}{n})\cdot\dfrac{1}{n} \leq \dfrac{5}{n}$$ <br> Eventually, we have: $$\mathbb{E}_{s_{h}\sim\pi_{b}}[\Vert\hat{\pi}_{h}(\cdot|s_{h})-\pi_{b,h}(\cdot|s_{h})\Vert^2_1] \leq \dfrac{20}{n}\cdot log\left(\dfrac{N_{[]}(\prod_{h}, \Vert\cdot\Vert_{\infty}, 1/n)}{\delta}\right)$$ <br> Since $N(M_{h}, \Vert\cdot\Vert_{\infty}, 1/n)$ is the covering number for model class for $M_{h}$ : $$N_{[]}(\prod_{h}, \Vert\cdot\Vert_{\infty}, 1/n) \leq N(M_{h}, \Vert\cdot\Vert_{\infty}, 1/4n)$$ <br> Now, we finally prove that: $$\mathbb{E}_{s_{h}\sim\pi_{b}}[\Vert\hat{\pi}_{h}(\cdot|s_{h})-\pi_{b,h}(\cdot|s_{h})\Vert^2_1]\le O\left(\dfrac{log(H\cdot N(M, \Vert\cdot\Vert_{\infty}, 1/n)/\delta)}{n}\right)$$ ### Step 2 : Expectation bound on dataset Letting $$A(\hat{\pi}_h) := \vert \mathbb{E}_{s_{h}\sim\pi_{b}}[\Vert\hat{\pi}_{h}(\cdot|s_{h})-\pi_{b,h}(\cdot|s_{h})\Vert^2_1] - \mathbb{E}_{D_{h}}[\Vert\hat{\pi}_{h}(\cdot|s_{h})-\pi_{b,h}(\cdot|s_{h})\Vert^2_1]\vert$$ <br> With probability $1-\delta$, by Bernstein Inequality, we have: $$A(\hat{\pi}_h) \leq O\left(\dfrac{log(H\cdot N(M, \Vert\cdot\Vert_{\infty}, 1/n)/\delta)}{n}\right)$$ <br> From lemma we prove in step 1, with probability $1-\delta$ we have: $$\mathbb{E}_{s_{h}\sim\pi_{b}}[\Vert\hat{\pi}_{h}(\cdot|s_{h})-\pi_{b,h}(\cdot|s_{h})\Vert^2_1]\le O\left(\dfrac{log(H\cdot N(M, \Vert\cdot\Vert_{\infty}, 1/n)/\delta)}{n}\right)$$ <br> Therefore, with probability $1-\delta$, we prove: $$\mathbb{E}_{D_{h}}[\Vert\hat{\pi}_{h}(\cdot|s_{h})-\pi_{b,h}(\cdot|s_{h})\Vert^2_1]\le O\left(\dfrac{log(H\cdot N(M_{h}, \Vert\cdot\Vert_{\infty}, 1/n)/\delta)}{n}\right)$$ <br> By definition of $\hat{\pi}_{h}$ and $\pi_{b,h}$, we have: $$\vert \hat{Q}_h(s, a) - Q^{\pi_b, \gamma}_h(s, a)\vert = \vert log\left(\dfrac{\hat{\pi}_h(a|s)}{\pi_{b, h}(a|s)}\right) - log\left(\dfrac{\hat{\pi}_h(a_0|s)}{\pi_{b, h}(a_0|s)}\right)\vert$$ <br> By the fact that $ln(x/y) \leq x/y - 1$ and $\pi_h(a|s)\in[e^{-H},1]$, we have: $$\vert \hat{Q}_h(s, a) - Q^{\pi_b, \gamma}_h(s, a)\vert \leq e^H\cdot\left(\vert\pi_{b,h}(a|s)-\hat{\pi}_h(a|s)\vert+\vert\pi_{b,h}(a_0|s)-\hat{\pi}_h(a_0|s)\vert\right)$$ <br> Taking summation over $a\in A$, we prove: $$\mathbb{E}_{D_{h}}[\Vert\hat{Q_{h}}(s_{h}, \cdot)-Q^{\pi_{b}, \gamma}_{h}(s_{h}, \cdot)\Vert^2_1]\le O\left(\dfrac{H^2e^{2H}\cdot\vert A\vert^2\cdot log(H\cdot N(M_{h}, \Vert\cdot\Vert_{\infty}, 1/n)/\delta)}{n}\right)$$ <br> Theorem 2: Reward Estimation for Linear Model MDP --- To show the estimated reward function indeed fits the true reward function, we have the following assumptions and theorem: ><font color=Black>**Assumption (Function Approximation Model Class):** >A model class $M = \{M_{h}\}_{h\in[H]}$ contains functions $f: S \times A \to [0, H]\ \forall\ h \in[H]$. >And $M_{h}$ is rich enough s.t. $M_{h}$ contains the true model. i.e. $r_{h} \in M_{h}, Q_{h} \in M_{h}$</font> ><font color=Black>**Assumption (Reward Conditions):** >1. $\Vert\theta\Vert_2\leq H\sqrt{d}\quad \forall\ \theta\in \Theta$, $\Vert w_h\Vert_2 \leq \sqrt{d}$ for reward $r_h=\phi\cdot w_h$ >2. $\Vert\phi(s_h, a_h)\Vert_2 \leq 1\quad \forall\ (s_h, a_h) \in S\times A$ >3. $\exists\ constant\ c\ s.t.\ log\ N(\Theta, \Vert\cdot\Vert_{\infty}, 1/n) \leq c\cdot d\log\ n\quad \forall\ n>0$ </font> ><font color=Black>**Theorem (Reward Estimation for Linear Model MDP)** >With assumptions above, the reward function holds with probability $1-\delta$ for all $(s, a)\in S\times A, \lambda > 0$ $$\vert r_h(s,a)-\hat{r}_h(s,a) \vert \leq \Vert \phi(s,a)\Vert_{(\Lambda+\lambda I)^{-1}}\cdot O\left(\sqrt{\lambda d}+(1+\gamma)\cdot He^H\cdot \vert A\vert\cdot d\sqrt{log(nH/\lambda\delta)}\right)$$</font> By this theorem, the error of estimating reward can be bounded by elliptical term $\Vert \phi(s,a)\Vert_{(\Lambda+\lambda I)^{-1}}$ times normalizing term $O\left(He^H\cdot \vert A\vert\cdot d\sqrt{log(nH/\lambda\delta)}\right)$. They also mentioned that the exponential term $O(e^H\vert A\vert)$ comes from estimating $Q^{\pi_b, \gamma}_h$ may possibly be improved. ### Proof of Theorem 2 Recall that the ridge regression for BMSE: $$\hat{w}_{h}=argmin_{w}\left\{\sum^{n}_{i=1}\left(\phi(s^{i}_{h}, a^{i}_{h})\cdot w+\gamma\cdot\hat{V}_{h+1}(s^{i}_{h+1})-\hat{Q}_{h}(s^{i}_{h}, a^{i}_{h})\right)^2+\lambda\Vert w\Vert^2_2\right\}$$ And its optimization has close form: $$\hat{w}_{h}=(\Lambda_h+\lambda I)^{-1}\left(\sum^n_{i=1}\phi(s^i_h,a^i_h)(\hat{Q}_h(s^i_h,a^i_h)-\gamma\cdot\hat{V}_{h+1}(s^i_{h+1}))\right)$$$$where \Lambda_h=\sum^n_{i=1}\phi(s^i_h,a^i_h)\phi(s^i_h,a^i_h)^T$$ By assumption, we have $\exists\ w_h\in\mathbb{R}^d$ s.t. $r_h(s,a)=\phi(s,a)\cdot w_h$ By $\hat{r}_h$ in **Algorithm 1**, we have: $$\vert r_h(s,a) - \hat{r}_h(s,a)\vert = \vert\phi(s,a)(w_h-\hat{w}_h)\vert$$ $$= \vert\phi(s,a)(\Lambda_h+\lambda I)^{-1}\left(\lambda\cdot w_h + \sum^n_{i=1}\phi(s^i_h,a^i_h)(\hat{Q}_h(s^i_h,a^i_h)-\gamma\cdot\hat{V}_{h+1}(s^i_{h+1})-r_h(s^i_h,a^i_h))\right)\vert$$ $$\leq\underbrace{\lambda\cdot\vert\phi(s,a)(\Lambda_h+\lambda I)^{-1}\cdot w_h\vert}_{(1)}\ +\ \underbrace{\vert\phi(s,a)(\Lambda_h+\lambda I)^{-1}\left(\sum^n_{i=1}\phi(s^i_h,a^i_h)(\hat{Q}_h(s^i_h,a^i_h)-\gamma\cdot\hat{V}_{h+1}(s^i_{h+1})-r_h(s^i_h,a^i_h))\right)\vert}_{(2)}$$ For elliptical term (1), by Cauchy-Schwarz inequality and $\Vert w_h\Vert_2\leq\sqrt{d}$ , we have: $$(1)\leq\lambda\cdot\Vert\phi(s,a)\Vert_{(\Lambda_h+\lambda I)^{-1}}\cdot\Vert w_h\Vert_{(\Lambda_h+\lambda I)^{-1}}$$ $$\leq\sqrt{\lambda d}\cdot\Vert\phi(s,a)\Vert_{(\Lambda_h+\lambda I)^{-1}}$$ Also, for normalizing term (2), we have: $$(2)\leq\Vert\phi(s,a)\Vert_{(\Lambda_h+\lambda I)^{-1}}\cdot\Vert\sum^n_{i=1}\phi(s^i_h,a^i_h)(\hat{Q}_h(s^i_h,a^i_h)-\gamma\cdot\hat{V}_{h+1}(s^i_{h+1})-r_h(s^i_h,a^i_h))\Vert_{(\Lambda_h+\lambda I)^{-1}}$$ $$=\Vert\phi(s,a)\Vert_{(\Lambda_h+\lambda I)^{-1}}\cdot\Vert\sum^n_{i=1}\phi(s^i_h,a^i_h)\left((\hat{Q}_h(s^i_h, a^i_h)-Q^{\pi_b, \gamma}_h(s^i_h, a^i_h))-\gamma\cdot(\hat{V}_{h+1}(s^i_{h+1})-\mathbb{P}_hV^{\pi_b, \gamma}_{h+1}(s^i_{h+1}))\right)\Vert_{(\Lambda_h+\lambda I)^{-1}}$$ $$\leq\Vert\phi(s,a)\Vert_{(\Lambda_h+\lambda I)^{-1}}\cdot\left( \underbrace{\Vert\sum^n_{i=1}\phi(s^i_h,a^i_h)\left(\hat{Q}_h(s^i_h, a^i_h)-Q^{\pi_b, \gamma}_h(s^i_h, a^i_h)\right)\Vert_{(\Lambda_h+\lambda I)^{-1}}}_{(3)}\ +\ \underbrace{\gamma\cdot\Vert\sum^n_{i=1}\phi(s^i_h,a^i_h)\left( \hat{V}_{h+1}(s^i_{h+1}) - V^{\pi, \gamma}_{h+1}(s^i_{h+1})\right)\Vert_{(\Lambda_h+\lambda I)^{-1}}}_{(4)}\ +\ \underbrace{\gamma\cdot\Vert\sum^n_{i=1}\phi(s^i_h,a^i_h)\left( \mathbb{P}_hV^{\pi_b, \gamma}_{h+1}(s^i_{h+1}) -V^{\pi, \gamma}_{h+1}(s^i_{h+1})\right)\Vert_{(\Lambda_h+\lambda I)^{-1}}}_{(5)}\right)$$ By bounding (3) and (4), we will have: $$(2)\leq\Vert\phi(s,a)\Vert_{(\Lambda_h+\lambda I)^{-1}}\cdot\left(\sqrt{d}He^H\cdot\vert A\vert\cdot\sqrt{log(N(\Theta, \Vert\cdot\Vert_{\infty}, 1/n)/\delta)}\right)$$ By bounding (5), we will have: $$(2)\leq\Vert\phi(s,a)\Vert_{(\Lambda_h+\lambda I)^{-1}}\cdot O\left((1+\gamma)\cdot dHe^H\cdot\sqrt{\vert A\vert log(nH/\lambda\delta)}\right)$$ Therefore, combine these two bound we have: $$(2)\leq\Vert\phi(s,a)\Vert_{(\Lambda_h+\lambda I)^{-1}}\cdot O\left((1+\gamma)\cdot\vert A\vert\cdot dHe^H\cdot\sqrt{log(nH/\lambda\delta)}\right)$$ Eventually, we prove: $$\vert r_h(s,a)-\hat{r}_h(s,a) \vert \leq \Vert \phi(s,a)\Vert_{(\Lambda+\lambda I)^{-1}}\cdot O\left(\sqrt{\lambda d}+(1+\gamma)\cdot He^H\cdot \vert A\vert\cdot d\sqrt{log(nH/\lambda\delta)}\right)$$ <br> Theorem 3: Suboptimality Gap for DCPPO --- To prove that the uncertainty quantifier $\Gamma_h$ in **Algorithm 2** with pessimistic value iteration can achieve $O(n^{-1/2})$ suboptimality gap without observable reward signal, we have the following assumptions and theorem: ><font color=Black>**Assumption (Model Identification):** >We assume that there $\exists\ a_0\in A,\ s.t.\ Q(s,a_0)=0\ \forall s\in S$ </font> ><font color=Black>**Assumption (Reward Conditions):** >1. $\Vert\theta\Vert_2\leq H\sqrt{d}\quad \forall\ \theta\in \Theta$, $\Vert w_h\Vert_2 \leq \sqrt{d}$ for reward $r_h=\phi\cdot w_h$ >2. $\Vert\phi(s_h, a_h)\Vert_2 \leq 1\quad \forall\ (s_h, a_h) \in S\times A$ >3. $\exists\ constant\ c$ s.t. $log\ N(\Theta, \Vert\cdot\Vert_{\infty}, 1/n) \leq c\cdot d\log\ n\quad \forall\ n>0$</font> ><font color=Black>**Assumption (Linear MDP):** > For underlying MDP, assume that $\forall\ V_{h+1}:S\to[0, H-h]$, $\exists\ u_h\in \mathbb{R}^d$ s.t. $\forall\ (s,a)\in S\times A$: $$\mathbb{P}_hV_{h+1}(s,a)=\phi(s,a)\cdot u_h$$ >And assume that $\Vert u_h\Vert\leq(H-h+1)\cdot\sqrt{d}$ for all $h\in[H]$</font> ><font color=Black>**Assumption (Single-Policy Coverage):** >Suppose there exists an absolute constant $c^\dagger >0$ s.t. with probability at least $1-\delta/2$ : $$\Lambda_h \geq c^\dagger\cdot n\cdot\mathbb{E}_\pi[\phi(s_h, a_h)\phi(s_h, a_h)^T]$$</font> ><font color=Black>**Theorem (Suboptimality Gap for DCPPO)** >With assumptions above, $\lambda=1$ and $\beta=O(He^H\cdot\vert A\vert\cdot d \sqrt{log(nH/\delta)})$, we have: $$\Gamma_h(s,a)=\beta\cdot(\phi(s,a)^T(\Lambda_h+\lambda I)^{-1}\phi(s,a))^{1/2}$$$$SubOpt(\{\hat{\pi}_h\}_{h\in[H]}) \leq c\cdot(1+\gamma)\cdot\vert A\vert d^{3/2}H^2e^Hn^{-1/2}\sqrt{\xi}$$holds for **Algorithm 2** with probability at least $1-\delta$</font> This theorem shows that it nearly matches the standard result for pessimistic offline RL with observable rewards when facing insufficient dataset coverage. In conclusion, **Algorithm 1 & 2** almost matches the suboptimality gap of standard pessimism planning with an observable reward. ### Proof of Theorem 3 Here, we leverage the **Theorem C.1** in [Is Pessimism Provably Efficient for Offline RL?](https://arxiv.org/abs/2012.15085). ><font color=Black>**Theorem C.1** >Suppose $\{ \Gamma_h \}^H_{h=1}$, in **Algorithm 2** is a uncertainty quantifier, then suboptimality of **Algorithm 2** satisfies: $$SubOpt(\{\hat{\pi}_h\}_{h\in[H]}) \leq 2\cdot\sum^H_{h=1}\mathbb{E}_{\pi^*}[\Gamma_h(s_h, a_h)]$$here, $\mathbb{E}_{\pi^*}$ is with respect to trajectory induceed by $\pi^*$ in underlying MDP.</font> By this theorem, the proof of **Theorem 3** can divided into 2 step: 1. Prove that $\Gamma_h$ with $\beta$ is an uncertainty quantifier, with probability at least $1-\delta/2$. 2. Prove that with penalty function $\rho(\cdot)$, we can bound $\sum^H_{h=1}\mathbb{E}_{\pi^*}[\Gamma_h(s_h, a_h)]$ with probability at least $1-\delta/2$. <br> ### Step 1 : Uncertainty quantifier With $\beta=O(He^H\cdot\vert A\vert\cdot d \sqrt{log(nH/\delta)})$, we have: $$\vert\left(\hat{r}_{h}+\tilde{\mathbb{P}}_{h}\tilde{V}_{h+1}\right)(s, a)-\left(r_{h}+\mathbb{P}_{h}\tilde{V}_{h+1}\right)(s, a)\vert \leq \underbrace{\vert\hat{r}_h(s,a) - r_h(s,a)\vert}_{(1)} + \underbrace{\vert \tilde{\mathbb{P}}_{h}\tilde{V}_{h+1}(s,a) - \mathbb{P}_{h}\tilde{V}_{h+1}(s,a)\vert}_{(2)}$$ For (1), we have bounded with guarantee in the proof of **Theorem 2**: $$(1)\leq\Vert\phi(s,a)\Vert_{(\Lambda_h+\lambda I)^{-1}}\cdot O\left((1+\gamma)\cdot\vert A\vert\cdot dHe^H\cdot\sqrt{log(nH/\lambda\delta)}\right)$$ <br> For (2), we have: $$(2)\leq\Vert\phi(s,a)\Vert_{(\Lambda_h+\lambda I)^{-1}}\cdot O\left(dH\cdot log(ndH/\delta) + H\sqrt{d}\right)$$ <br> Combine (1) and (2), with probability at least $1-\delta$, we have: $$\vert\hat{r}_h(s,a) - r_h(s,a)\vert + \vert \tilde{\mathbb{P}}_{h}\tilde{V}_{h+1}(s,a) - \mathbb{P}_{h}\tilde{V}_{h+1}(s,a)\vert \leq \beta\Vert\phi(s,a)\Vert_{(\Lambda_h+\lambda I)^{-1}}$$ <br> Therefore, we have proved that $\Gamma_h=\beta\Vert\phi(s,a)\Vert_{(\Lambda_h+\lambda I)^{-1}}$ is an uncertainty quantifier. <br> ### Step 2 : Bounding the total expectation By the uncertainty quantifier $\Gamma_h$ we get in step 1, the total expectation can be view as: $$\sum^n_{h=1}\mathbb{E}_{\pi^*}[\Gamma_h(s_h, a_h)] = \sum^n_{i=1}\beta\cdot\mathbb{E}_{\pi^*}[\Vert\phi(s,a)\Vert_{(\Lambda_h+\lambda I)^{-1}}]$$ <br> By Cauchy-Schwarz inequality, we have: $$\mathbb{E}_{\pi^*}[\Vert\phi(s,a)\Vert_{(\Lambda_h+\lambda I)^{-1}}] \leq \sqrt{Tr\left(\mathbb{E}_{\pi^*}[\phi(s_i, a_i)\phi(s_i, a_i)^T]\Lambda^{-1}_h\right)}$$ <br> And condition on result in step 1 and assumption of single-policy coverage, we have: $$SubOpt(\{\hat{\pi}_h\}_{h\in[H]}) \leq 2\beta \sum^H_{h=1}\sqrt{\sum^d_{j=1}\dfrac{\lambda_{h,j}}{1+c^\dagger\cdot n\cdot\lambda_{h,j}}}$$where the $\{\lambda_{h,j}\}^d_{j=1}$ are the eigenvalues of $\mathbb{E}_{\pi^*}[\phi(s_i, a_i)\phi(s_i, a_i)^T]$ <br> And by Jenson's inequality, we have: $$SubOpt(\{\hat{\pi}_h\}_{h\in[H]}) \leq 2\beta \sum^H_{h=1}\sqrt{\sum^d_{j=1}\dfrac{1}{1+c^\dagger\cdot n}} \leq c'\cdot d^{3/2}H^2n^{-1/2}\sqrt{\xi}$$where $\xi=\sqrt{log(dHn/\delta)}$ <br> Eventually, by bounding $\sum^H_{h=1}\mathbb{E}_{\pi^*}[\Gamma_h(s_h, a_h)]$, we have proved the suboptimality gap. <br> # Discussions and Takeaways In this summary note, we present: 1. A brief overview of DCPPO proposed by the paper 2. Prove the algorithm is efficient in sample complexity for linear MDP case and RKHS case However the application of DCPPO is not provided in this note. In my opinion, DCPPO can be powerful for applications involving complex, sequential decision-making processes i.e. treatment planning in healthcare field, which includes plenty of sequential treatment decisions. Moreover, as a student in department of information management and finance, studied numerous policy of investing and risk-management, I found this method quite promising in financial field as well. My insight is that this algorithm can provide the "Investment Advising" based on personal preference of investing, which matches the concept of pessimism (when ppl doing investing, they tend to use more conservative way) # References This blog post clearly explains the difference between on-policy RL, off-policy RL, offline RL [Off-policy vs On-Policy vs Offline Reinforcement Learning Demystified!](https://kowshikchilamkurthy.medium.com/off-policy-vs-on-policy-vs-offline-reinforcement-learning-demystified-f7f87e275b48) For beginners to fully understand how offline RL works, this video will help a lot: [Offline Reinforcement Learning](https://www.youtube.com/watch?v=NV4oSWe1H9o) Blog for the knowledge about IQL (In chinese): [深入淺出Inverse Reinforcement Learning 第一課](https://medium.com/@houyitong/%E6%B7%B1%E5%85%A5%E6%B7%BA%E5%87%BAinverse-reinforcement-learning-%E7%AC%AC%E4%B8%80%E8%AA%B2-995fa7fafa60) Introduction of MLE: [Maximum Likelihood Estimation](https://purelyvivid.github.io/2019/08/13/MLE/) About CCP: [Conditional Choice Probability (CCP) estimation Joan Llull ](https://joanllull.github.io/Structural_micro/Structural_micro_4.pdf) About Uncertainty Quatification: [不确定性量化(Uncertainty Quantification):前言](https://blog.csdn.net/waitingwinter/article/details/109319707) Other papers: [Principled Reinforcement Learning with Human Feedback from Pairwise or K-wise Comparisons](https://arxiv.org/abs/2301.11270) [Is Pessimism Provably Efficient for Offline RL?](https://arxiv.org/abs/2012.15085)