Notes on "[The Value Equivalence Principle for Model-Based Reinforcement Learning](https://proceedings.neurips.cc/paper/2020/file/3bb585ea00014b0e3ebe4c6dd165a358-Paper.pdf)"

# Notes on "[The Value Equivalence Principle for Model-Based Reinforcement Learning](https://proceedings.neurips.cc/paper/2020/file/3bb585ea00014b0e3ebe4c6dd165a358-Paper.pdf)" ###### tags: `MBRL` `Value-Equivalence` [Reproduced results](https://github.com/RajGhugare19/VE-principle-for-model-based-RL) Details about the results, ablation, methodology, hyper-parameters can be found in the following report. [Report on reproducibility](https://openreview.net/forum?id=IU5y7hIIZqS) ## Introduction: In this paper the authors argue that the limited representational resources of model-based RL agents are better used to build models that are directly useful for value-based planning rather than optimising it to predict the best transition probabilities. Their major contribution is the following theorem, Equivalence theorem : two models are value equivalent with respect to a set of functions and policies if they yield the same Bellman updates. Truly intelligent agents should learn the model of the environment for fast re-planning. The authors are basically saying that it would be better if we focus on building our model according to its future use. For a set of functions(value functions I suppose) and policies they are setting requirements on how accurate the model should really be. They further state that as this set(of policies and functions) increases the prospective models shrink down to the one single true model of the MDP ## Background: 1) Standard MDP and annotations are considered. 2) **Pi** is the set of all possible policies. (uncountable set even for finite MDPs) 3) Agents goal is to find the policy which has the maximum value for all states 4) In model based RL, the agent learns approximate values of transition probabilities and reward functions and uses them to do policy evaluation and improvement. ## Value Equivalence: Given a state space and action space we can define a model = $m(r,p)$. A model with a discount factor are enough to define bellman updates. Value equivalence aims to approximate models based on the policies and functions on which the agent would be applied. Let $\Pi$ be a set of policies and $V$ be a set a of functions. Given $\lambda$, two models $m,m'$ are said to be value equivalent if for all $\pi,v \in \Pi,V$ $T_{\pi}^{m}v = T_{\pi}^{m'}v$ ### Space of Value equivalent models: Given a model $m$, the set of all models which are value equivalent to m is called the space of Value equivalent models. $\Pi$ and $V$ are pre-defined. $m^{*}$ is the true model of the environment. $M(\Pi,V)$ is the space of models which are value equivalent to $m^{*}$ for all policies and functions belonging to $\Pi$ and $V$. We are often interested in this set only. The total reproducibility can be divided into three type of experiments: • $span(V)≈ V$ and finite state space. • $span(V)≈ V$ and finite state space, using linear function approximation. • $span(V)≈ V$ and infinite state space, using neural networks. ### Conclusion: The value equivalence principle is shown to work better than MLE based models in constrained spaces on simple cartpole environment. The theory of equivalent models is promising and complete, its application to create a different class of model based algorithms remain unclear.