# Model-based Reinforcement Learning and the Eluder Dimension
https://arxiv.org/abs/1406.1853
## Problem formulation
- We consider the problem of learning to optimize a random finite horizon MDP $M=$ $\left(\mathcal{S}, \mathcal{A}, R^M, P^M, \tau, \rho\right)$ in repeated finite episodes of interaction.
- $\mathcal{S}$ is the state space, $\mathcal{A}$ is the action space, $R^M(s, a)$ is the reward distribution over $\mathbb{R}$ and $P^M(\cdot \mid s, a)$ is the transition distribution over $\mathcal{S}$ when selecting action $a$ in state $s, \tau$ is the time horizon, and $\rho$ the initial state distribution.
- All random variables we will consider are on a probability space $(\Omega, \mathcal{F}, \mathbb{P})$.
- A policy $\mu$ is a function mapping each state $s \in \mathcal{S}$ and $i=1, \ldots, \tau$ to an action $a \in \mathcal{A}$.
- define $\bar{r}^M(s, a):=\mathbb{E}\left[r \mid r \sim R^M(s, a)\right]$ and the subscripts of the expectation operator indicate that $a_j=\mu\left(s_j, j\right)$, and $s_{j+1} \sim P^M\left(\cdot \mid s_j, a_j\right)$ for $j=i, \ldots, \tau$.
- For each MDP $M$ and policy $\mu$, we define a value function $V$ :
$$
V_{\mu, i}^M(s):=\mathbb{E}_{M, \mu}\left[\sum_{j=i}^\tau \bar{r}^M\left(s_j, a_j\right) \mid s_i=s\right]
$$
- A policy $\mu$ is said to be optimal for MDP $M$ if $V_{\mu, i}^M(s)=\max _{\mu^{\prime}} V_{\mu^{\prime}, i}^M(s)$ for all $s \in \mathcal{S}$ and $i=1, \ldots, \tau$.
- We will associate with each MDP $M$ a policy $\mu^M$ that is optimal for $M$.
- For any distribution $\Phi$ over $\mathcal{S}$, we define the one step future value function $U$ to be the expected value of the optimal policy with the next state distributed according to $\Phi$.
$$
U_i^M(\Phi):=\mathbb{E}_{M, \mu^M}\left[V_{\mu^M, i+1}^M(s) \mid s \sim \Phi\right]
$$
- One natural regularity condition for learning is that the future values of similar distributions should be similar.
- We examine this idea through the Lipschitz constant on the means of these state distributions.
- We write $\mathcal{E}(\Phi):=\mathbb{E}[s \mid s \sim \Phi] \in \mathcal{S}$ for the mean of a distribution $\Phi$ and express the Lipschitz continuity for $U_i^M$ with respect to the $\|\cdot\|_2$-norm of the mean:
$$
\left|U_i^M(\Phi)-U_i^M(\tilde{\Phi})\right| \leq K_i^M(\mathcal{D})\|\mathcal{E}(\Phi)-\mathcal{E}(\tilde{\Phi})\|_2
$$
$$
\text { for all } \Phi, \tilde{\Phi} \in \mathcal{D}
$$
- We define $K^M(\mathcal{D}):=\max _i K_i^M(\mathcal{D})$ to be a global Lipschitz contant for the future value function with state distributions from $\mathcal{D}$.
- write $K^M:=K^M(\mathcal{D}(M))$ where $\mathcal{D}(M):=\left\{P^M(\cdot \mid s, a) \mid s \in \mathcal{S}, a \in \mathcal{A}\right\}$ is the set of all possible one-step state distributions under the MDP $M$.
- The reinforcement learning agent interacts with the MDP over episodes that begin at times $t_k=(k-1) \tau+1, k=1,2, \ldots$
- Let $H_t=\left(s_1, a_1, r_1, \ldots, s_{t-1}, a_{t-1}, r_{t-1}\right)$ denote the history of observations made prior to time $t$.
- A reinforcement learning algorithm is a deterministic sequence $\left\{\pi_k \mid k=1,2, \ldots\right\}$ of functions, each mapping $H_{t_k}$ to a probability distribution $\pi_k\left(H_{t_k}\right)$ over policies which the agent will employ during the $k$ th episode.
- $\Delta_k$ denotes regret over the $k$ th episode, defined with respect to the MDP $M^*$ by
$$
\Delta_k:=\int_{s \in \mathcal{S}} \rho(s)\left(V_{\mu^*, 1}^{M^*}-V_{\mu_k, 1}^{M^*}\right)(s)
$$
- Note that regret is not deterministic since it can depend on the random MDP $M^*$, the algorithm's internal random sampling and, through the history $H_{t_k}$, on previous random transitions and random rewards.
- We define the regret incurred by a reinforcement learning algorithm $\pi$ up to time $T$ to be
$$
\operatorname{Regret}\left(T, \pi, M^*\right):=\sum_{k=1}^{\lceil T / \tau\rceil} \Delta_k
$$
## Main results
***
Input: Prior distribution $\phi$ for $M^*, \mathrm{t}=1$
for episodes $k=1,2, .$. do
$\quad$ sample $M_k \sim \phi\left(\cdot \mid H_t\right)$
$\quad$ compute $\mu_k=\mu^{M_k}$
$\quad$ for timesteps $j=1, . ., \tau$ do
$\quad$ $\quad$ apply $a_t \sim \mu_k\left(s_t, j\right)$
$\quad$ $\quad$ observe $r_t$ and $s_{t+1}$
$\quad$ $\quad$ advance $t=t+1$
$\quad$ end for
end for
***
- At the start of episode $k$, PSRL samples an MDP $M_k$ from the posterior.
- PSRL then follows the policy $\mu_k=\mu^{M_k}$ which is optimal for this sampled MDP during episode $k$.
- To state our results we first introduce some notation.
- For any set $\mathcal{X}$ and $\mathcal{Y} \subseteq \mathbb{R}^d$ for $d$ finite let $\mathcal{P}_{\mathcal{X}, \mathcal{Y}}^{C, \sigma}$ be the family the distributions from $\mathcal{X}$ to $\mathcal{Y}$ with mean $\|\cdot\|_2$-bounded in $[0, C]$ and additive $\sigma$-sub-Gaussian noise.
- We let $N\left(\mathcal{F}, \alpha,\|\cdot\|_2\right)$ be the $\alpha$-covering number of $\mathcal{F}$ with respect to the $\|\cdot\|_2$-norm and write
$$
n_{\mathcal{F}}=\log \left(8 N\left(\mathcal{F}, 1 / T^2,\|\cdot\|_2\right) T\right)
$$
- Finally we write $d_E(\mathcal{F})=\operatorname{dim}_E\left(\mathcal{F}, T^{-1}\right)$ for the eluder dimension of $\mathcal{F}$ at precision $T^{-1}$, a notion of dimension specialized to sequential measurements.
***
Theorem 1 (Expected regret for PSRL in parameterized MDPs).
Fix a state space $\mathcal{S}$, action space $\mathcal{A}$, function families $\mathcal{R} \subseteq \mathcal{P}_{\mathcal{S} \times \mathcal{A}, \mathbb{R}}^{C_{\mathcal{R}}, \sigma_{\mathcal{R}}}$ and $\mathcal{P} \subseteq \mathcal{P}_{\mathcal{S} \times \mathcal{A}, \mathcal{S}}^{C_{\mathcal{P}}, \sigma_{\mathcal{P}}}$ for any $C_{\mathcal{R}}, C_{\mathcal{P}}, \sigma_{\mathcal{R}}, \sigma_{\mathcal{P}}>0$. Let $M^*$ be an MDP with state space $\mathcal{S}$, action space $\mathcal{A}$, rewards $R^* \in \mathcal{R}$ and transitions $P^* \in \mathcal{P}$. If $\phi$ is the distribution of $M^*$ and $K^*=K^{M^*}$ is a global Lipschitz constant for the future value function as per (31) then:
$$
\begin{aligned}
&\mathbb{E}\left[\operatorname{Regret}\left(T, \pi^{P S}, M^*\right)\right] \\
&\leq\left[C_{\mathcal{R}}+C_{\mathcal{P}}\right]+\tilde{D}(\mathcal{R})\\
&+\mathbb{E}\left[K^*\right]\left(1+\frac{1}{T-1}\right) \tilde{D}(\mathcal{P})
\end{aligned}
$$
Where for $\mathcal{F}$ equal to either $\mathcal{R}$ or $\mathcal{P}$ we will use the shorthand:
$$
\begin{aligned}
& \tilde{D}(\mathcal{F}):=1+\tau C_{\mathcal{F}} d_E(\mathcal{F})\\
&+8\sqrt{d_E(\mathcal{F})\left(4 C_{\mathcal{F}}+\sqrt{2 \sigma_{\mathcal{F}}^2 \log \left(32 T^3\right)}\right)}\\
&+8 \sqrt{2 \sigma_{\mathcal{F}}^2 n_{\mathcal{F}} d_E(\mathcal{F}) T}
\end{aligned}
$$
***