Addressing Distribution Shift in Online Reinforcement Learning with Offline Datasets - Notes

--- title : "Addressing Distribution Shift in Online Reinforcement Learning with Offline Datasets - Notes" tags : "IvLabs, RL" --- # Addressing Distribution Shift in Online Reinforcement Learning with Offline Datasets Link to the [Research Paper](https://offline-rl-neurips.github.io/pdf/13.pdf) {%pdf https://offline-rl-neurips.github.io/pdf/13.pdf%} Strong RL agents can be made on previously-collected, static datasets (Offline RL), it is often desirable to improve such offline RL agents with further online interaction ## Introduction Offline RL may be suboptimal - The dataset they were trained on may only contain suboptimal data - The environment in which the data set is developed may be different from the environment in which dataset was generated This calls for Fine-Tuning Challenges - Offline RL algorithms based on modeling the dataset-generating policy are not amenable to fine tuning - due to difficulty of modeling the dataset-generating policy in the online setup - Conservative Q-Learning - does not require explicit behaviour modeling - amenable to fine tuning - In a non trivial task - due to distribution shift - The agent encounters out of distribution samples - loses its good initial policy - Can be attributed to bootstrapping error - error introduced when Q-function is updated with an inaccurate target value evaluated at unfamiliar states and actions Appeal of Offline RL lies in safe deployment at test time ### Contribution - Demonstration of fine-tuning a CQL - unstable training - Balanced replay scheme and an ensemble distillation scheme - Separate replay buffers for offline and online samples - modulate the sampling ratio to balance the effect - Widening the data distribution the agent sees (offline data) - Exploiting the environment feedback (online data) - Ensemble distillation - Learn Ensemble of independent CQL agents - Distill the multiple polices into a single policy - Policy is improved using the mean of Q-functions - policy updates are more robust to error in each individual Q-function ## Background ### Reinforcement Learning Off-policy RL algorithms - Train an agent with samples generated by any behaviour policy - Well suited for fine tuning a pre trained RL agent - can leverage both offline and aonline samples ### Soft Actor-Critic* - Off-policy actor-critic algorithm that learns a soft Q-function $Q_{\theta} (s,a)$ parameterized by $\theta$ and a stochastic policy $\pi_\theta$ modeled as a Gaussian with its parameters $\phi$ - SAC alternates between a soft policy evaluation and a soft policy improvement - Soft Policy evaluation - $\theta$ updated to minimize the following $\mathcal L_{\text{critic}}^{\text{SAC}}(\theta) = \Bbb E_{\tau_t\sim\mathcal B}[\mathcal L_Q(\tau_t,\theta)]$ $\mathcal L_Q^{\text{SAC}}(\tau_t,\theta) = (Q_\theta(s_t, a_t) - (r_t + \gamma \Bbb E_{\alpha_t\sim\pi_\phi}[Q_{\bar\theta}(s_{t+1}, a_{t+1}) - \alpha\text{ log }\pi_\phi(a_{t+1}|s_{t+1})]))^2$ $\tau_t = (s_t, a_t, r_t, s_{t+1})$ $\mathcal B$ - Replay Buffer $\bar \theta$ - Moving target of soft Q-function parameter $\theta$ $\alpha$ - Temperature parameter - Soft Policy improvement - $\phi$ is updated to minimize the following $\mathcal L_{\text{actor}}^{\text{SAC}}(\phi) = \Bbb E_{s_t\sim\mathcal B}[\mathcal L_\pi(s_t,\phi)]$ $\mathcal L_\pi(s_t,\phi) = \Bbb E_{\alpha_t\sim\pi_\phi}[\alpha \text{ log }\pi_\phi(a_t|s_t) - Q_\theta(s_t, a_t)]$ ### Conservative Q-Learning CQL - Offline Rl algorithm that learns a lower bound of the Q-function $Q_\theta(s,a)$ - To prevent extrapolation error-value overestimation caused by bootstrapping from out-of distribution actions CQL($\mathcal H$) - Imposes a regularization that minimizes the expected Q-value at unseen actions and maximizes the expected Q-value at seen actions ## Challenge: Distribution Shift - Distribution Shift - Offline RL agent encounters data distributed away from the offline data - when interacting with the environment - It involves an interplau between actor and critic updates with newly collected out-of-distribution samples - It occurs because there is a shift between offline and online data distribution - In case of using both online and offline data - The chance of agent seeing online samples for update becomes too low - This prevents the timely updates at unfamiliar states encountered online - In case of using online data exlusively - The agent is exposed to unseen samples only, for which Q function does not provide a reliable value estimate - bootstrapping error There is a need to balance the trade-off ## BRED: Balanced Replay with Ensemble Distillation - Addresses the distribution shift - Separate offline and online replay buffers - to select a balanced mix of samples - Advantage - Updates Q-function with a wide distribution of samples - Q-values are updated at novel, unseen states from online interaction - Multiple actor-critic models are trained together and their policy is distilled in a single policy - This distilled policy is then improved via Q-ensemble ### Balancing experiences from online and offline replay buffers At timestep t - $B\cdot(1-\rho_t^{\text{on}})$ samples are drawn from the offline replay buffer and $B\cdot(\rho_t^{\text{on}})$ samples from the offline replay buffer - B - Batch Size - $\rho_t^{\text{on}}$ - Fraction of online samples $\rho_t^{\text{on}} = \rho_0^{\text{on}} + (1-\rho_0^{\text{on}})\cdot\dfrac{\text{min}(t,t_{\text{final}})}{t_{\text{final}}}$ $t_{\text{final}}$ - Final timestep of the annealing schedule $\rho_0^{\text{on}}$ - Initial fraction of online samples - Effect - Better Q-function updates with a wide distribution of both offline and online samples - Eventually exploiting the online samples later when there are enough online sample gathered ### Ensemble of offline RL agenst for online fine-tuning - During fine-tuning - each individual Q-function may be inaccurate due to bootstrapping error from unfamiliar online samples - Consider an ensemble of N CQL agents pre-trained - Distillation of these ensemble of independent policies is done by minimizing the following before online interaction- $\mathcal L_{\text{distill}}^{pd} (\phi_{pd}) = \Bbb E_{s_t\sim\mathcal D}[||\mu_{\phi_{pd}}(s_t) - \hat\mu(s_t)||^2 + ||\sigma_{\phi_{pd}}(s_t) - \hat\sigma(s_t)||^2]$ Where, $\displaystyle\hat \mu (s_t) = \frac 1N\underset{i = 1}{\overset{N}{\sum}}\mu_{\phi_i}(s_t)$ $\displaystyle\hat \sigma^2 (s_t) = \frac 1N\underset{i = 1}{\overset{N}{\sum}}(\sigma^2_{\phi_i(s_t)} + \mu_{\phi_i}^2(s_t)) - \hat\mu(s_t)^2$ - The distilled policy is them updated by minimizing the following $\mathcal L_{\text{actor}}^{\text{pd}}(\phi_\text{pd}) = \Bbb E_{s_t\sim\mathcal B}[\mathcal L_\pi^{\text{pd}}(s_t,\phi_\text{pd})]$ $\mathcal L_\pi^\text{pd}(s_t,\phi) = \Bbb E_{\alpha_t\sim\pi_{\phi_\text{pd}}}[\alpha \text{ log }\pi_{\phi_\text{pd}}(a_t|s_t) - \dfrac 1N \underset{i = 1}{\overset{N}{\sum}} Q_{\theta_i}(s_t, a_t)]$ - A separate target Q-function $Q_{\hat\theta}$ for each $Q_\theta$ - then minimize the loss independently - to ensure diversity among the Q-functions ## Related Work Offline RL - [CQL](https://dl.acm.org/doi/pdf/10.5555/3495724.3495824) Online RL with offline datasets - [Optimality of Dataset](https://arxiv.org/pdf/2006.09359.pdf) Replay Buffer - [Hard Exploration Problem](https://arxiv.org/pdf/1707.08817.pdf) - [Continual Learning](https://papers.nips.cc/paper/2019/file/fa7cdfad1a5aaf8370ebeda47a1ff1c3-Paper.pdf) Ensemble Methods - Addressing Q-function's [overestimating bias](https://papers.nips.cc/paper/2010/file/091d584fced301b442654dd8c23b3fc9-Paper.pdf) - [Better exploration and reducing bootstrap error propagation](https://arxiv.org/pdf/2002.06487.pdf) ## Experiments ### Setups #### Tasks and Implementation MuJoCo - `halfcheetah`, `hopper`, `walker2d` - D4RL Dataset types - `random` - `medium` - `medium-replay` - `medium-expert` Offline RL agent - 1000 epochs without early stopping - N = 5 - $\rho_0^{\text{on}} \in \{0.5,0.75\}$ - $t_{\text{final}} = 125K$ - Report mean and standard deviation across four runs for 250K timesteps #### Baselines - Advantage-Weighted Actor Critic (AWAC) - actor-critic scheme for fine-tuning - policy is trained to imitate actions with high advantage estimates - Comparison of BRED to AWAC shows the benefit of exploiting the generalization ability of Q-function for policy learning - Batch-Constrained deep Q-learning (BCQ-ft) - Offline RL method - updates policy be modeling the data-generating policy using a conditional [VAE](https://papers.nips.cc/paper/2015/file/8d55a249e6baa5c06772297520da2051-Paper.pdf) - CQL - CQl trained agent, fine-tune with SAC - Exclude CQL regularization - SAC - SAC agent trained from scratch - No access to offline dataset - To show the benefit of fine-tuning a pre-trained agent in terms of sample-efficiency