Progress_11/28

Progress_11/28 === ## Ways to prevent the extrapolation errors. - What is **extrapolation errors**? 1. The estimation error of actions that the policy might select but are not contained in the training dataset would be overestimated. 2. Thus, affects the policy improvement,where agents learn to prefer OOD actions whose value has been overestimated, and result in a poor performance. - **Behavior regularization** is a prominent approach to offline RL that aims to address this problem by using appropriate regularizers to compel the learned policy to stay close to the data. - There are two common ways to incorporate behavior regularization into the actor-critic framework – via policy regularization or via a critic penalty. - **Policy Regularization / Policy Constraints** 1. [Batch-Constrained deep Q- Learning(BCQ)](https://arxiv.org/pdf/1812.02900.pdf) -- Constraint actions that the agent could select via using VAE to generate actions similar to that in the training data. -- The perturbation model, ξ, used to enhence the diversity of actions. ![](https://i.imgur.com/bTzyMV0.png) ![](https://i.imgur.com/S1gse3G.png) - **Critic Penalty** 1. [Conservative Q-Learning(CQL)](https://proceedings.neurips.cc/paper/2020/file/0d2b2061826a5df3221116a5085a6052-Paper.pdf) -- Add some regularization term to the Bellman error, to make the learned Q-value is the lower-bound of the real Q-value . ![](https://i.imgur.com/9YydPLb.png) - **Approximation** ![](https://i.imgur.com/Jkin8FI.png)