[paper] Model-based Reinforcement Learning and the Eluder Dimension

# Model-based Reinforcement Learning and the Eluder Dimension https://arxiv.org/abs/1406.1853 ## Problem formulation - We consider the problem of learning to optimize a random finite horizon MDP $M=$ $\left(\mathcal{S}, \mathcal{A}, R^M, P^M, \tau, \rho\right)$ in repeated finite episodes of interaction. - $\mathcal{S}$ is the state space, $\mathcal{A}$ is the action space, $R^M(s, a)$ is the reward distribution over $\mathbb{R}$ and $P^M(\cdot \mid s, a)$ is the transition distribution over $\mathcal{S}$ when selecting action $a$ in state $s, \tau$ is the time horizon, and $\rho$ the initial state distribution. - All random variables we will consider are on a probability space $(\Omega, \mathcal{F}, \mathbb{P})$. - A policy $\mu$ is a function mapping each state $s \in \mathcal{S}$ and $i=1, \ldots, \tau$ to an action $a \in \mathcal{A}$. - define $\bar{r}^M(s, a):=\mathbb{E}\left[r \mid r \sim R^M(s, a)\right]$ and the subscripts of the expectation operator indicate that $a_j=\mu\left(s_j, j\right)$, and $s_{j+1} \sim P^M\left(\cdot \mid s_j, a_j\right)$ for $j=i, \ldots, \tau$. - For each MDP $M$ and policy $\mu$, we define a value function $V$ : $$ V_{\mu, i}^M(s):=\mathbb{E}_{M, \mu}\left[\sum_{j=i}^\tau \bar{r}^M\left(s_j, a_j\right) \mid s_i=s\right] $$ - A policy $\mu$ is said to be optimal for MDP $M$ if $V_{\mu, i}^M(s)=\max _{\mu^{\prime}} V_{\mu^{\prime}, i}^M(s)$ for all $s \in \mathcal{S}$ and $i=1, \ldots, \tau$. - We will associate with each MDP $M$ a policy $\mu^M$ that is optimal for $M$. - For any distribution $\Phi$ over $\mathcal{S}$, we define the one step future value function $U$ to be the expected value of the optimal policy with the next state distributed according to $\Phi$. $$ U_i^M(\Phi):=\mathbb{E}_{M, \mu^M}\left[V_{\mu^M, i+1}^M(s) \mid s \sim \Phi\right] $$ - One natural regularity condition for learning is that the future values of similar distributions should be similar. - We examine this idea through the Lipschitz constant on the means of these state distributions. - We write $\mathcal{E}(\Phi):=\mathbb{E}[s \mid s \sim \Phi] \in \mathcal{S}$ for the mean of a distribution $\Phi$ and express the Lipschitz continuity for $U_i^M$ with respect to the $\|\cdot\|_2$-norm of the mean: $$ \left|U_i^M(\Phi)-U_i^M(\tilde{\Phi})\right| \leq K_i^M(\mathcal{D})\|\mathcal{E}(\Phi)-\mathcal{E}(\tilde{\Phi})\|_2 $$ $$ \text { for all } \Phi, \tilde{\Phi} \in \mathcal{D} $$ - We define $K^M(\mathcal{D}):=\max _i K_i^M(\mathcal{D})$ to be a global Lipschitz contant for the future value function with state distributions from $\mathcal{D}$. - write $K^M:=K^M(\mathcal{D}(M))$ where $\mathcal{D}(M):=\left\{P^M(\cdot \mid s, a) \mid s \in \mathcal{S}, a \in \mathcal{A}\right\}$ is the set of all possible one-step state distributions under the MDP $M$. - The reinforcement learning agent interacts with the MDP over episodes that begin at times $t_k=(k-1) \tau+1, k=1,2, \ldots$ - Let $H_t=\left(s_1, a_1, r_1, \ldots, s_{t-1}, a_{t-1}, r_{t-1}\right)$ denote the history of observations made prior to time $t$. - A reinforcement learning algorithm is a deterministic sequence $\left\{\pi_k \mid k=1,2, \ldots\right\}$ of functions, each mapping $H_{t_k}$ to a probability distribution $\pi_k\left(H_{t_k}\right)$ over policies which the agent will employ during the $k$ th episode. - $\Delta_k$ denotes regret over the $k$ th episode, defined with respect to the MDP $M^*$ by $$ \Delta_k:=\int_{s \in \mathcal{S}} \rho(s)\left(V_{\mu^*, 1}^{M^*}-V_{\mu_k, 1}^{M^*}\right)(s) $$ - Note that regret is not deterministic since it can depend on the random MDP $M^*$, the algorithm's internal random sampling and, through the history $H_{t_k}$, on previous random transitions and random rewards. - We define the regret incurred by a reinforcement learning algorithm $\pi$ up to time $T$ to be $$ \operatorname{Regret}\left(T, \pi, M^*\right):=\sum_{k=1}^{\lceil T / \tau\rceil} \Delta_k $$ ## Main results *** Input: Prior distribution $\phi$ for $M^*, \mathrm{t}=1$ for episodes $k=1,2, .$. do $\quad$ sample $M_k \sim \phi\left(\cdot \mid H_t\right)$ $\quad$ compute $\mu_k=\mu^{M_k}$ $\quad$ for timesteps $j=1, . ., \tau$ do $\quad$ $\quad$ apply $a_t \sim \mu_k\left(s_t, j\right)$ $\quad$ $\quad$ observe $r_t$ and $s_{t+1}$ $\quad$ $\quad$ advance $t=t+1$ $\quad$ end for end for *** - At the start of episode $k$, PSRL samples an MDP $M_k$ from the posterior. - PSRL then follows the policy $\mu_k=\mu^{M_k}$ which is optimal for this sampled MDP during episode $k$. - To state our results we first introduce some notation. - For any set $\mathcal{X}$ and $\mathcal{Y} \subseteq \mathbb{R}^d$ for $d$ finite let $\mathcal{P}_{\mathcal{X}, \mathcal{Y}}^{C, \sigma}$ be the family the distributions from $\mathcal{X}$ to $\mathcal{Y}$ with mean $\|\cdot\|_2$-bounded in $[0, C]$ and additive $\sigma$-sub-Gaussian noise. - We let $N\left(\mathcal{F}, \alpha,\|\cdot\|_2\right)$ be the $\alpha$-covering number of $\mathcal{F}$ with respect to the $\|\cdot\|_2$-norm and write $$ n_{\mathcal{F}}=\log \left(8 N\left(\mathcal{F}, 1 / T^2,\|\cdot\|_2\right) T\right) $$ - Finally we write $d_E(\mathcal{F})=\operatorname{dim}_E\left(\mathcal{F}, T^{-1}\right)$ for the eluder dimension of $\mathcal{F}$ at precision $T^{-1}$, a notion of dimension specialized to sequential measurements. *** Theorem 1 (Expected regret for PSRL in parameterized MDPs). Fix a state space $\mathcal{S}$, action space $\mathcal{A}$, function families $\mathcal{R} \subseteq \mathcal{P}_{\mathcal{S} \times \mathcal{A}, \mathbb{R}}^{C_{\mathcal{R}}, \sigma_{\mathcal{R}}}$ and $\mathcal{P} \subseteq \mathcal{P}_{\mathcal{S} \times \mathcal{A}, \mathcal{S}}^{C_{\mathcal{P}}, \sigma_{\mathcal{P}}}$ for any $C_{\mathcal{R}}, C_{\mathcal{P}}, \sigma_{\mathcal{R}}, \sigma_{\mathcal{P}}>0$. Let $M^*$ be an MDP with state space $\mathcal{S}$, action space $\mathcal{A}$, rewards $R^* \in \mathcal{R}$ and transitions $P^* \in \mathcal{P}$. If $\phi$ is the distribution of $M^*$ and $K^*=K^{M^*}$ is a global Lipschitz constant for the future value function as per (31) then: $$ \begin{aligned} &\mathbb{E}\left[\operatorname{Regret}\left(T, \pi^{P S}, M^*\right)\right] \\ &\leq\left[C_{\mathcal{R}}+C_{\mathcal{P}}\right]+\tilde{D}(\mathcal{R})\\ &+\mathbb{E}\left[K^*\right]\left(1+\frac{1}{T-1}\right) \tilde{D}(\mathcal{P}) \end{aligned} $$ Where for $\mathcal{F}$ equal to either $\mathcal{R}$ or $\mathcal{P}$ we will use the shorthand: $$ \begin{aligned} & \tilde{D}(\mathcal{F}):=1+\tau C_{\mathcal{F}} d_E(\mathcal{F})\\ &+8\sqrt{d_E(\mathcal{F})\left(4 C_{\mathcal{F}}+\sqrt{2 \sigma_{\mathcal{F}}^2 \log \left(32 T^3\right)}\right)}\\ &+8 \sqrt{2 \sigma_{\mathcal{F}}^2 n_{\mathcal{F}} d_E(\mathcal{F}) T} \end{aligned} $$ ***