CSRL Literature Review

--- title: CSRL Literature Review tags: rl --- # Batch-Constrained Q-Learning [[Paper 1](https://arxiv.org/abs/1812.02900)] [[Paper 2](https://arxiv.org/abs/1910.01708)] [[Code](https://github.com/sfujim/BCQ)] #### Extrapolation Error (1) Missing data: Missing $(s,a)$ implies $\pi$ can be arbitrarily bad for some $(s, \pi(s))$\\ (2) Model bias: $\mathcal{T}^\pi Q(s,a) = E_{s' \sim B}[r ~ + ~ \gamma Q(s', \pi(s'))]$ where $s' \sim B$ not $s' \sim \mu(s)$ (3) Training mismatch: $L = \frac{1}{|B|}\sum_{s,a,r,s' \sim B}|r + \gamma Q(s', \pi(s') - Q(s,a)|^2$; if the data distribution in $B$ does not match that under $\pi$, $Q(s,a)$ may be a poor estimate of the actual action values BCQ Intuition: Choose actions such that they: (1) Are similar to those in batch (2) Lead to familiar states (3) Maximize the $Q$-value #### Extrapolation in Finite MDPs The paper recasts learning $Q(s,a)$ from a batch $B$ in terms of learning on a new MDP $M_B$. $\epsilon_{MDP}(s,a) = Q^\pi(s,a) - Q^\pi_B(s,a)$ is the tabular extrapolation error. Recursively, we can express this as: $\epsilon_{MDP}(s,a) = \sum_{s'}\Big((p_M(s'|s,a) - p_B(s'|s,a))(r(s,a,s') + \gamma \sum_{a'}\pi(a'|s')Q^\pi_B(s',a')) \\ + p_M(s'|s,a)\gamma \sum_{a'}\pi(a'|s')\epsilon_{MDP}(s',a')\Big)$ BCQL update: $\Q(s,a) \leftarrow (1 - \alpha)Q(s,a) + \alpha(r + \gamma \max_{a', (s',a') \in B}Q(s',a'))$ The proofs basically show that BCQL converges to the optimal batch-constrained policy i.e. the best policy for which if $\mu_\pi(s) > 0, \pi(a|s) > 0$ then $(s,a) \in B$, similar to sampling requirements for Q-Learning #### Batch-Constrained Deep Reinforcement Learning (BCQ) (1) Train VAE $G_\omega(s)$ on $(s,a)$ on $N$ actions $B$ (2) Use VAE to generate $n$ actions $a_i$ similar to those in $B$ (3) Train a perturbation $\mathcal{E}_\phi(s',a_i, \Phi)$ to pick an optimal action in $[a_i - \Phi, a_i + \Phi]$ (4) Value target $y$, based on variant of Clipped Double-Q ($\lambda = 1$) from TD3: $y = r + \gamma \max_{a_i}[\lambda \min_{j=1,2}Q_{\theta'_j}(s', a_i) + (1 - \lambda) \max_{j=1,2}Q_{\theta'_j}(s', a_i)$ (5) Update $\theta_1, \theta_2$: $\theta \leftarrow \arg\min_\theta \sum(y - Q_\theta(s,a))^2$ (6) Update $\phi$ using DPG update: $\phi \leftarrow \arg\max_\phi \sum Q_{\theta_1}(s, a + \mathcal(E)_\phi(s,a,\Phi)), a \sim G_\omega(s)$ **Note**: $\Phi = [a_{min}, a_{max}], N \to \infty$ is Q-Learning, $\Phi = 0, N = 1$ is behavior cloning ![BCQ Algorithm](https://i.imgur.com/FuSlYQ4.png) #### Discrete BCQ on ALE (Paper 2) (1) $G_\omega(a|s)$ can now be computed over all actions; change criterion to $a| \frac{G_\omega(a|s)}{\max_{a'}G_\omega(a'|s)} > \tau$; $\tau = 0$ gives Q-learning, $\tau = 1$ gives behavior cloning (2) Change CDQ to Double-DQN #### Notes (1) Can't follow the whole recursive expression for $\epsilon$, but looks about right (2) Why only $Q_{\theta_1}$ in DPG update? Paper points to TD3 paper for explanation (3) Why $G_\omega$ then $\mathcal{E}_\phi$? I understand we want to pick similar actions, then choose optimal actions near similar ones using $Q$, but the specific method here seems a little awkward. "This enables access to actions in a constrained region, without having to sample from the generative model a prohibitive number of times."? If we're already choosing the highest-valued action from $G_\omega$, why perturb them? (4) Related to (3): The algorithm shown samples from $G_\omega$ twice; once to find actions $G_\omega(s')$ and pass then to $\mathcal{E}$ and once for DPG update $\G_\omega(s)$? # Generative Adversarial Tree Search (GATS) [[Paper](https://arxiv.org/abs/1806.05780)] # Deep Bisimulation for Control (DBC) [[Paper](https://arxiv.org/abs/2006.10742)] [[Site](https://sites.google.com/view/deepbisim4control)] [[Code](https://github.com/facebookresearch/deep_bisim4control)] # UCRL2