Progress_11/28 === ## Ways to prevent the extrapolation errors. - What is **extrapolation errors**? 1. The estimation error of actions that the policy might select but are <font color="#3338FF">not contained in the training dataset</font> would be <font color="#FF0000">overestimated</font>. 2. Thus, affects the policy improvement,where agents learn to prefer OOD actions whose value has been overestimated, and result in a poor performance. - **Behavior regularization** is a prominent approach to offline RL that aims to address this problem by using appropriate regularizers to <font color="#3338FF">compel the learned policy to stay close to the data</font>. - There are two common ways to incorporate behavior regularization into the actor-critic framework – via <font color="#FF0000">policy regularization</font> or via a <font color="#FF0000">critic penalty</font>. - **Policy Regularization / Policy Constraints** 1. [Batch-Constrained deep Q- Learning(BCQ)](https://arxiv.org/pdf/1812.02900.pdf) -- Constraint actions that the agent could select via using VAE to generate actions similar to that in the training data. -- The perturbation model, ξ, used to enhence the diversity of actions. ![](https://i.imgur.com/bTzyMV0.png) ![](https://i.imgur.com/S1gse3G.png) - **Critic Penalty** 1. [Conservative Q-Learning(CQL)](https://proceedings.neurips.cc/paper/2020/file/0d2b2061826a5df3221116a5085a6052-Paper.pdf) -- Add some regularization term to the Bellman error, to make the learned Q-value is the lower-bound of the real Q-value . ![](https://i.imgur.com/9YydPLb.png) - **Approximation** ![](https://i.imgur.com/Jkin8FI.png)