# VALUE GRADIENT WEIGHTED MODEL-BASED REINFORCEMENT LEARNING
2023/2/14
###### tags: `RL Group meeting`
**Published as a conference paper at ICLR 2022**
## Outline
- Abstract
- Introduction
- Background
- Value-Gradient weighted Model loss (VaGraM)
- Experiment: Model Learning In Low-Dimensional Problem
- Experiment: Model-Based Continuous Control
- Conclusion
## Abstract
- Model-based reinforcement learning (MBRL) is a sample efficient technique to obtain control policies, yet unavoidable modeling errors often lead to performance deterioration.
- Value-Gradient weighted Model loss (VaGraM) improves the performance such as **small model capacity** and **the presence of distracting state dimensions**.
- We analyze both MLE and value-aware approaches and demonstrate how they fail to account for sample coverage and the behavior of function approximation when learning value-aware models.
## Introduction
- **MBRL** solves the control optimization into two interleaved stages :
- In the model **learning stage**, an approximate model of the environment is learned
- Utilized in the **planning stage** to generate new experience without having to query the original environment.
- The accuracy of the model directly influences the quality of the learned policy or plan.
- Limits of function approximation in model and value learning algorithms.
- It cannot fully capture the full distribution over dynamics functions perfectly, and the use of finite datasets.
- They use maximum likelihood estimation (**MLE**) to learn a parametric model of the environment without involving information from the planning process.
- We present the Value-Gradient weighted Model loss (**VaGraM**) which **rescales the mean squared error loss function with gradient information** from the current value function estimate.
- We analysis of the optimization behavior of the Value-Aware Model Learning framework.
- VaGraM loss impacts the resulting state and value prediction accuracy.
## Background


The reward function is either known or learned by mean squared error minimization.
### 1. Model-Based Reinforcement Learning
$\hat{p}$ : trained from data to represent the unknown transition function *p*
‘model’ : refer to the learned approximation
‘environment’ : refer to the unknown MDP transition function.
$\mathcal{D}: \theta^*=\arg \max _\theta \sum_{i=1}^N \log \hat{p}_\theta\left(s_i^{\prime}, r_i \mid s_i, a_i\right)$
### 2. Key Insight: Model Mismatch Problem
- Model errors propagate and compound when the model is used for planning.
- It depends on the size of the error and the local behavior of the value function.
- We can motivate the use of MLE as a loss function by an upper bound:
$$
\begin{aligned}
& \sup _{V \in \mathcal{F}}|\langle p-\hat{p}, V\rangle| \leq \| p-\hat{p} \|
\end{aligned}
$$
$$
\begin{aligned}
\sup _{V \in \mathcal{F}}\|V\|_{\infty} \leq \sqrt{\operatorname{KL}(p \| \hat{p})} \sup _{V \in \mathcal{F}}\|V\|_{\infty}
\end{aligned}
$$
- This bound is loose and does not account for the geometry of the problem’s value function.
### 3. Value-Aware Model Learning
- VAML is to penalize a model prediction by the resulting difference in a value function.


- This approach is that it relies on the value function, which is not known a priori while learning the model.
- Farahmand introduced a modification of VAML called Iterative Value-Aware Model Learning (**IterVAML**), where the supremum is replaced with the current estimate of the value function.
- In each iteration, the value function is updated based on the model, and the model is trained using the loss function based on the last iteration’s value function.
## Value-Gradient weighted Model loss (VaGraM)
1. Value function evaluation outside of the empirical state-action distribution.
2. Suboptimal local minima.
- We expect that the updated model loss forces the model prediction to a new solution, but due to the non-convex nature of the VAML loss, the model can get stuck or even diverge.
### 1. Approximating a value-aware loss with the value function gradient

- All $s'_i$ are in the dataset the value function is trained on, which solves the first problem with the VAML paradigm.
$$
\sum_i\left(\left(\left.\nabla_s V(s)\right|_{s_i^{\prime}}\right)^{\top}\left(f_\theta\left(s_i, a_i\right)-s_i^{\prime}\right)\right)^2
$$
- This vector can be interpreted as a measure of sensitivity of the value function at each data point and dimension
### 2. Preventing spurious local minima
- Apply the Cauchy Schwartz inequalit:

- This reformulation is equivalent to a mean squared error loss function with a per-sample diagonal scaling matrix.

- The VAML loss has a complicated shape that depends on the exact values of the value function while both MSE and our proposal have a paraboloid shape.
- Compared to MSE, our proposed loss function is rescaled to account for the larger gradient of the value function in the $θ$ axis.

## Experiment: Model Learning In Low-Dimensional Problem
- We compare the performance of VaGraM, with both MSE and VAML on a pedagogical environment with a small state space and smooth dynamics to gain qualitative insight into the loss surfaces.
- We decided to investigate the model losses without modelbased value function learning.
- We used the SAC algorithm in a model-free setup to estimate the value function.

- In the linear setting, VAML achieves the lowest VAML error, while VaGraM is able to significantly outperform MSE.
- In the NN setting, VAML diverges rapidly, while VaGraM and MSE converge to approximately the same solution.
- Using a flexible function approximation, the VAML loss converges in the first iteration with the given value function, but then rapidly diverges once the value function is updated.
- VaGraM remains stable even with flexible function approximation and achieves a lower VAML error than the MSE baseline.
- Single solution convergence.
## Experiment: Model-Based Continuous Control
- To test whether our loss function is superior to a maximum likelihood approach in these cases, we used the Hopper environment
### 1. Hopper With Reduced Model Capacity

- When reducing the model size, the maximum likelihood models quickly lose performance, completely failing to even stabilize the Hopper for a short period in the smallest setting, while VaGraM retains almost its original performance.
### 2. Hopper With Distracting Dimensions

- When increasing the number of dimensions, the performance of the MLE model deteriorates, as more and more of its capacity is used to model the added dynamics.
**VaGraM is able to deal with challenging distractions and reduced model capacity significantly better than a MLE baseline.**
## Conclusion
- We presented the VaGraM to train models that model a dynamics function where it matters for the control problem.
- We highlighted how VaGraM counters these issues and showed the increased stability of the training procedure when using our loss in a pedagogical environment.
- In future work we seek to scale our loss function to image-based RL.
- we seek to derive a related value-aware approach for partially observable domains that can take the state inference problem into account.
[Reference](https://openreview.net/pdf?id=4-D6CZkRXxI)