Progress_11/28
===
## Ways to prevent the extrapolation errors.
- What is **extrapolation errors**?
1. The estimation error of actions that the policy might select but are <font color="#3338FF">not contained in the training dataset</font> would be <font color="#FF0000">overestimated</font>.
2. Thus, affects the policy improvement,where agents learn to prefer OOD actions whose value has been overestimated, and result in a poor performance.
- **Behavior regularization** is a prominent approach to offline RL that aims to address this problem by using appropriate regularizers to <font color="#3338FF">compel the learned policy to stay close to the data</font>.
- There are two common ways to incorporate behavior regularization into the actor-critic framework – via <font color="#FF0000">policy regularization</font> or via a <font color="#FF0000">critic penalty</font>.
- **Policy Regularization / Policy Constraints**
1. [Batch-Constrained deep Q- Learning(BCQ)](https://arxiv.org/pdf/1812.02900.pdf)
-- Constraint actions that the agent could select via using VAE to generate actions similar to that in the training data.
-- The perturbation model, ξ, used to enhence the diversity of actions.


- **Critic Penalty**
1. [Conservative Q-Learning(CQL)](https://proceedings.neurips.cc/paper/2020/file/0d2b2061826a5df3221116a5085a6052-Paper.pdf)
-- Add some regularization term to the Bellman error, to make the learned Q-value is the lower-bound of the real Q-value .

- **Approximation**
