---
title: CSRL Literature Review
tags: rl
---
# Batch-Constrained Q-Learning
[[Paper 1](https://arxiv.org/abs/1812.02900)] [[Paper 2](https://arxiv.org/abs/1910.01708)] [[Code](https://github.com/sfujim/BCQ)]
#### Extrapolation Error
(1) Missing data:
Missing $(s,a)$ implies $\pi$ can be arbitrarily bad for some $(s, \pi(s))$\\
(2) Model bias:
$\mathcal{T}^\pi Q(s,a) = E_{s' \sim B}[r ~ + ~ \gamma Q(s', \pi(s'))]$ where $s' \sim B$ not $s' \sim \mu(s)$
(3) Training mismatch:
$L = \frac{1}{|B|}\sum_{s,a,r,s' \sim B}|r + \gamma Q(s', \pi(s') - Q(s,a)|^2$; if the data distribution in $B$ does not match that under $\pi$, $Q(s,a)$ may be a poor estimate of the actual action values
BCQ Intuition: Choose actions such that they:
(1) Are similar to those in batch
(2) Lead to familiar states
(3) Maximize the $Q$-value
#### Extrapolation in Finite MDPs
The paper recasts learning $Q(s,a)$ from a batch $B$ in terms of learning on a new MDP $M_B$. $\epsilon_{MDP}(s,a) = Q^\pi(s,a) - Q^\pi_B(s,a)$ is the tabular extrapolation error. Recursively, we can express this as:
$\epsilon_{MDP}(s,a) = \sum_{s'}\Big((p_M(s'|s,a) - p_B(s'|s,a))(r(s,a,s') + \gamma \sum_{a'}\pi(a'|s')Q^\pi_B(s',a')) \\ + p_M(s'|s,a)\gamma \sum_{a'}\pi(a'|s')\epsilon_{MDP}(s',a')\Big)$
BCQL update:
$\Q(s,a) \leftarrow (1 - \alpha)Q(s,a) + \alpha(r + \gamma \max_{a', (s',a') \in B}Q(s',a'))$
The proofs basically show that BCQL converges to the optimal batch-constrained policy i.e. the best policy for which if $\mu_\pi(s) > 0, \pi(a|s) > 0$ then $(s,a) \in B$, similar to sampling requirements for Q-Learning
#### Batch-Constrained Deep Reinforcement Learning (BCQ)
(1) Train VAE $G_\omega(s)$ on $(s,a)$ on $N$ actions $B$
(2) Use VAE to generate $n$ actions $a_i$ similar to those in $B$
(3) Train a perturbation $\mathcal{E}_\phi(s',a_i, \Phi)$ to pick an optimal action in $[a_i - \Phi, a_i + \Phi]$
(4) Value target $y$, based on variant of Clipped Double-Q ($\lambda = 1$) from TD3:
$y = r + \gamma \max_{a_i}[\lambda \min_{j=1,2}Q_{\theta'_j}(s', a_i) + (1 - \lambda) \max_{j=1,2}Q_{\theta'_j}(s', a_i)$
(5) Update $\theta_1, \theta_2$:
$\theta \leftarrow \arg\min_\theta \sum(y - Q_\theta(s,a))^2$
(6) Update $\phi$ using DPG update:
$\phi \leftarrow \arg\max_\phi \sum Q_{\theta_1}(s, a + \mathcal(E)_\phi(s,a,\Phi)), a \sim G_\omega(s)$
**Note**: $\Phi = [a_{min}, a_{max}], N \to \infty$ is Q-Learning, $\Phi = 0, N = 1$ is behavior cloning

#### Discrete BCQ on ALE (Paper 2)
(1) $G_\omega(a|s)$ can now be computed over all actions; change criterion to $a| \frac{G_\omega(a|s)}{\max_{a'}G_\omega(a'|s)} > \tau$; $\tau = 0$ gives Q-learning, $\tau = 1$ gives behavior cloning
(2) Change CDQ to Double-DQN
#### Notes
(1) Can't follow the whole recursive expression for $\epsilon$, but looks about right
(2) Why only $Q_{\theta_1}$ in DPG update? Paper points to TD3 paper for explanation
(3) Why $G_\omega$ then $\mathcal{E}_\phi$? I understand we want to pick similar actions, then choose optimal actions near similar ones using $Q$, but the specific method here seems a little awkward. "This enables access to actions in a constrained region, without having to sample from the generative model a prohibitive number of times."? If we're already choosing the highest-valued action from $G_\omega$, why perturb them?
(4) Related to (3): The algorithm shown samples from $G_\omega$ twice; once to find actions $G_\omega(s')$ and pass then to $\mathcal{E}$ and once for DPG update $\G_\omega(s)$?
# Generative Adversarial Tree Search (GATS)
[[Paper](https://arxiv.org/abs/1806.05780)]
# Deep Bisimulation for Control (DBC)
[[Paper](https://arxiv.org/abs/2006.10742)] [[Site](https://sites.google.com/view/deepbisim4control)] [[Code](https://github.com/facebookresearch/deep_bisim4control)]
# UCRL2