---
marp: true
math: katex
header-includes:
- \setbeamertemplate{footline}[page number]
---
# Model-based RBMLE
---
## The original estimator of RBMLE
$J_{opt}(\mu)=\sum_{(s,a)} d(s,a)Q^*(s,a)$
$d_{\mu}(s,a)$: state-action discounted visitation distribution
$\mu$: initial state distribution
$\mathbb{P}(s'|s,a) = \langle \phi(s'|s,a),\mathbf{\theta}^{*} \rangle$
$\phi_{V}(s,a) = \sum_{s'} \phi(s'|s,a)V(s')$
$\mathbf{\theta}_{s,a} = \text{argmax}_{\mathbf{\theta} \in \beta} \left\{ -\sum\limits_{i=1}^{t-1} \left( \langle \mathbf{\theta}, \phi_{V_i(s_i,a_i)} \rangle - V_{i}(s_{i+1}) \right)^{2} - \lambda\lVert \mathbf{\theta} \rVert_2^2 + \alpha(t)\cdot\langle \mathbf{\theta}, \phi_{V_{t-1}}(s,a) \rangle \right\}, \forall s,a$
$\mathbf{\theta} = \text{argmax}_{\mathbf{\theta} \in \beta} \left\{ -\sum\limits_{i=1}^{t-1} \left( \langle \mathbf{\theta}, \phi_{V_i(s_i,a_i)} \rangle - V_{i}(s_{i+1}) \right)^{2} - \lambda\lVert \mathbf{\theta} \rVert_2^2 + \alpha(t)\cdot\max_{s,a}\langle \mathbf{\theta}, \phi_{V_{t-1}}(s,a) \rangle \right\}$
$\mathbf{\theta}'_{t,s} = \text{argmax}_{\mathbf{\theta} \in \beta} \left\{ -\sum\limits_{i=1}^{t-1} \left( \langle \mathbf{\theta}, \phi_{V_i(s_i,a_i)} \rangle - V_{i}(s_{i+1}) \right)^{2} - \lambda\lVert \mathbf{\theta} \rVert_2^2 + \alpha(t)\cdot\max_{a}\langle \mathbf{\theta}, \phi_{V_{t-1}}(s,a) \rangle \right\}$
$\delta\cdot\max_{s,a}\langle \mathbf{\theta}, \phi_{V_{t-1}}(s,a) \rangle\leq \max_{a}\langle \mathbf{\theta}, \phi_{V_{t-1}}(s,a) \rangle\leq \max_{s,a}\langle \mathbf{\theta}, \phi_{V_{t-1}}(s,a) \rangle$
For all $s,s'$, $\max_{a}\langle \mathbf{\theta}, \phi_{V_{t-1}}(s,a) \rangle/\max_{a}\langle \mathbf{\theta}, \phi_{V_{t-1}}(s',a) \rangle \in [a,b]$, for all feasible $\theta$
Ergodicity
Mars Rover / N-chain
---
## The approximated estimator of RBMLE
Let $a' = \text{argmax}_{a} \langle \mathbf{\theta}_{t-1}, \phi_{V_{t-1}}(s,a) \rangle$
$\mathbf{\theta}_{t,s} = \text{argmax}_{\mathbf{\theta} \in \beta} \left\{ -\sum\limits_{i=1}^{t-1} \left( \langle \mathbf{\theta}, \phi_{V_i(s_i,a_i)} \rangle - V_{i}(s_{i+1}) \right)^{2} - \lambda\lVert \mathbf{\theta} \rVert_2^2 + \alpha(t)\cdot\langle \mathbf{\theta}, \phi_{V_{t-1}}(s,a') \rangle \right\}$
$$\Vert \mathbf{\theta}_{t,s} - \mathbf{\theta}'_{t,s}\Vert \leq \delta'$$
$$\Vert \mathbf{\hat{\theta}}_{t} - \mathbf{\hat{\theta}}_{t-1}\Vert$$
$$\Vert \mathbf{\theta}'_{t-1,s} - \mathbf{\hat{\theta}}_{t-1} \Vert\text{, need to know }\alpha(t)$$
$$\Vert \mathbf{\theta}_{t-1,s} - \mathbf{\hat{\theta}}_{t-1} \Vert\text{, need to know }\alpha(t)$$
---
# Challenges
1. RBMLE estimator should satisfy Lemma 6.1, Lemma 6.2, and Lemma 6.3 in [Provably Efficient Reinforcement Learning for Discounted MDPs with Feature Mapping](https://arxiv.org/pdf/2006.13165.pdf).
2. To provide a practice RBMLE method, we should quantify the difference between the exact RBMLE estimator and the approximated estimator.
---
# Challenges 1
* Lemma 6.1: $$\mathbf{\theta}^{*} \in \mathcal{C}_{t} \cap \mathcal{B}$$
* Lemma 6.2(Upper bound of Q): $$ \frac{1}{1-\gamma} \geq Q_{t}(s,a) > Q^{*}(s,a)$$
* Lemma 6.3(Convergence of EVI): $$Q_{t}(s_t,a_t) \leq r(s_t,a_t) + \gamma\langle \mathbf{\theta}^{\text{RBMLE}},\phi_V(s_t,a_t)\rangle + 2\gamma^{U}$$
* $$\alpha(t) = \sqrt{\log{t}}.$$ Need to show $$\mathbf{\theta}^{\text{RBMLE}} \in \mathcal{C}_{t} \cap \mathcal{B}$$
How to get the lower bound of $$\alpha(t)$$
---
# Challenges 2
How to quantify the difference between the exact RBMLE estimator and the approximated estimator?
1. Leverage the property of the convexity of the loss function and thereby we can limit the variation of the learned estimator.
2. Question: Do we need to consider the max over the provious state and action?
3. Question: How to handle the term of two different context?


---



---
# How to get the lower bound of $\alpha(t)$
Proof of Lemma 6.2.
We need $$ \theta^{\text{RBMLE}} \in\mathcal{C}_{t} \cap \mathcal{B}$$
or
$$ \langle \theta^{\text{RBMLE}},\phi_{V}(s,a) \rangle \geq \langle \theta^{*},\phi_{V}(s,a) \rangle$$
# How to quantify $\Vert \mathbf{\theta}_{t,s} - \mathbf{\theta}'_{t,s}\Vert \leq \delta'$
One step MDP
\begin{align}
\Vert \mathbf{\theta}_{t,s} - \mathbf{\theta}'_{t,s}\Vert \leq \alpha(t) \cdot \Vert x_{t,a_{t-1}} - x_{t-1}\Vert
\end{align}



\begin{align}
\theta^{\text{RBMLE}} = \text{argmax}_{\theta} \{ \ell_{\lambda}(\theta) + \alpha(t)\text{bias}(\theta) \} \\
\ell_{\lambda}(\theta^{\text{RBMLE}}) + \alpha(t)\text{bias}(\theta^{\text{RBMLE}}) \geq \ell_{\lambda}(\theta^{\text{UCB}}) + \alpha(t)\text{bias}(\theta^{\text{UCB}})\\
\ell_{\lambda}(\theta^{\text{UCB}}) \geq \ell_{\lambda}(\theta^{\text{RBMLE}})\\ \alpha(t)\text{bias} (\theta^{\text{RBMLE}}) \geq \alpha(t)\text{bias}(\theta^{\text{UCB}}) \geq \alpha(t)\text{bias}(\theta^{\text{*}})
\end{align}