# VALUE GRADIENT WEIGHTED MODEL-BASED REINFORCEMENT LEARNING 2023/2/14 ###### tags: `RL Group meeting` **Published as a conference paper at ICLR 2022** ## Outline - Abstract - Introduction - Background - Value-Gradient weighted Model loss (VaGraM) - Experiment: Model Learning In Low-Dimensional Problem - Experiment: Model-Based Continuous Control - Conclusion ## Abstract - Model-based reinforcement learning (MBRL) is a sample efficient technique to obtain control policies, yet unavoidable modeling errors often lead to performance deterioration. - Value-Gradient weighted Model loss (VaGraM) improves the performance such as **small model capacity** and **the presence of distracting state dimensions**. - We analyze both MLE and value-aware approaches and demonstrate how they fail to account for sample coverage and the behavior of function approximation when learning value-aware models. ## Introduction - **MBRL** solves the control optimization into two interleaved stages : - In the model **learning stage**, an approximate model of the environment is learned - Utilized in the **planning stage** to generate new experience without having to query the original environment. - The accuracy of the model directly influences the quality of the learned policy or plan. - Limits of function approximation in model and value learning algorithms. - It cannot fully capture the full distribution over dynamics functions perfectly, and the use of finite datasets. - They use maximum likelihood estimation (**MLE**) to learn a parametric model of the environment without involving information from the planning process. - We present the Value-Gradient weighted Model loss (**VaGraM**) which **rescales the mean squared error loss function with gradient information** from the current value function estimate. - We analysis of the optimization behavior of the Value-Aware Model Learning framework. - VaGraM loss impacts the resulting state and value prediction accuracy. ## Background ![](https://i.imgur.com/Di0dBlG.png) ![](https://i.imgur.com/IKuuL0o.png) The reward function is either known or learned by mean squared error minimization. ### 1. Model-Based Reinforcement Learning $\hat{p}$ : trained from data to represent the unknown transition function *p* ‘model’ : refer to the learned approximation ‘environment’ : refer to the unknown MDP transition function. $\mathcal{D}: \theta^*=\arg \max _\theta \sum_{i=1}^N \log \hat{p}_\theta\left(s_i^{\prime}, r_i \mid s_i, a_i\right)$ ### 2. Key Insight: Model Mismatch Problem - Model errors propagate and compound when the model is used for planning. - It depends on the size of the error and the local behavior of the value function. - We can motivate the use of MLE as a loss function by an upper bound: $$ \begin{aligned} & \sup _{V \in \mathcal{F}}|\langle p-\hat{p}, V\rangle| \leq \| p-\hat{p} \| \end{aligned} $$ $$ \begin{aligned} \sup _{V \in \mathcal{F}}\|V\|_{\infty} \leq \sqrt{\operatorname{KL}(p \| \hat{p})} \sup _{V \in \mathcal{F}}\|V\|_{\infty} \end{aligned} $$ - This bound is loose and does not account for the geometry of the problem’s value function. ### 3. Value-Aware Model Learning - VAML is to penalize a model prediction by the resulting difference in a value function. ![](https://i.imgur.com/T0alDAK.png) ![](https://i.imgur.com/WddrMp5.png) - This approach is that it relies on the value function, which is not known a priori while learning the model. - Farahmand introduced a modification of VAML called Iterative Value-Aware Model Learning (**IterVAML**), where the supremum is replaced with the current estimate of the value function. - In each iteration, the value function is updated based on the model, and the model is trained using the loss function based on the last iteration’s value function. ## Value-Gradient weighted Model loss (VaGraM) 1. Value function evaluation outside of the empirical state-action distribution. 2. Suboptimal local minima. - We expect that the updated model loss forces the model prediction to a new solution, but due to the non-convex nature of the VAML loss, the model can get stuck or even diverge. ### 1. Approximating a value-aware loss with the value function gradient ![](https://i.imgur.com/QYkWcaq.png) - All $s'_i$ are in the dataset the value function is trained on, which solves the first problem with the VAML paradigm. $$ \sum_i\left(\left(\left.\nabla_s V(s)\right|_{s_i^{\prime}}\right)^{\top}\left(f_\theta\left(s_i, a_i\right)-s_i^{\prime}\right)\right)^2 $$ - This vector can be interpreted as a measure of sensitivity of the value function at each data point and dimension ### 2. Preventing spurious local minima - Apply the Cauchy Schwartz inequalit: ![](https://i.imgur.com/Ys7WFVj.png) - This reformulation is equivalent to a mean squared error loss function with a per-sample diagonal scaling matrix. ![](https://i.imgur.com/3PppIz3.png) - The VAML loss has a complicated shape that depends on the exact values of the value function while both MSE and our proposal have a paraboloid shape. - Compared to MSE, our proposed loss function is rescaled to account for the larger gradient of the value function in the $θ$ axis. ![](https://i.imgur.com/g0reVZl.png) ## Experiment: Model Learning In Low-Dimensional Problem - We compare the performance of VaGraM, with both MSE and VAML on a pedagogical environment with a small state space and smooth dynamics to gain qualitative insight into the loss surfaces. - We decided to investigate the model losses without modelbased value function learning. - We used the SAC algorithm in a model-free setup to estimate the value function. ![](https://i.imgur.com/2e6oqZK.png) - In the linear setting, VAML achieves the lowest VAML error, while VaGraM is able to significantly outperform MSE. - In the NN setting, VAML diverges rapidly, while VaGraM and MSE converge to approximately the same solution. - Using a flexible function approximation, the VAML loss converges in the first iteration with the given value function, but then rapidly diverges once the value function is updated. - VaGraM remains stable even with flexible function approximation and achieves a lower VAML error than the MSE baseline. - Single solution convergence. ## Experiment: Model-Based Continuous Control - To test whether our loss function is superior to a maximum likelihood approach in these cases, we used the Hopper environment ### 1. Hopper With Reduced Model Capacity ![](https://i.imgur.com/Gx0xsXD.png) - When reducing the model size, the maximum likelihood models quickly lose performance, completely failing to even stabilize the Hopper for a short period in the smallest setting, while VaGraM retains almost its original performance. ### 2. Hopper With Distracting Dimensions ![](https://i.imgur.com/lh8bNp0.png) - When increasing the number of dimensions, the performance of the MLE model deteriorates, as more and more of its capacity is used to model the added dynamics. **VaGraM is able to deal with challenging distractions and reduced model capacity significantly better than a MLE baseline.** ## Conclusion - We presented the VaGraM to train models that model a dynamics function where it matters for the control problem. - We highlighted how VaGraM counters these issues and showed the increased stability of the training procedure when using our loss in a pedagogical environment. - In future work we seek to scale our loss function to image-based RL. - we seek to derive a related value-aware approach for partially observable domains that can take the state inference problem into account. [Reference](https://openreview.net/pdf?id=4-D6CZkRXxI)