VALUE GRADIENT WEIGHTED MODEL-BASED REINFORCEMENT LEARNING
2023/2/14
Published as a conference paper at ICLR 2022
Outline
- Abstract
- Introduction
- Background
- Value-Gradient weighted Model loss (VaGraM)
- Experiment: Model Learning In Low-Dimensional Problem
- Experiment: Model-Based Continuous Control
- Conclusion
Abstract
- Model-based reinforcement learning (MBRL) is a sample efficient technique to obtain control policies, yet unavoidable modeling errors often lead to performance deterioration.
- Value-Gradient weighted Model loss (VaGraM) improves the performance such as small model capacity and the presence of distracting state dimensions.
- We analyze both MLE and value-aware approaches and demonstrate how they fail to account for sample coverage and the behavior of function approximation when learning value-aware models.
Introduction
- MBRL solves the control optimization into two interleaved stages :
- In the model learning stage, an approximate model of the environment is learned
- Utilized in the planning stage to generate new experience without having to query the original environment.
- The accuracy of the model directly influences the quality of the learned policy or plan.
- Limits of function approximation in model and value learning algorithms.
- It cannot fully capture the full distribution over dynamics functions perfectly, and the use of finite datasets.
- They use maximum likelihood estimation (MLE) to learn a parametric model of the environment without involving information from the planning process.
- We present the Value-Gradient weighted Model loss (VaGraM) which rescales the mean squared error loss function with gradient information from the current value function estimate.
- We analysis of the optimization behavior of the Value-Aware Model Learning framework.
- VaGraM loss impacts the resulting state and value prediction accuracy.
Background
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
The reward function is either known or learned by mean squared error minimization.
1. Model-Based Reinforcement Learning
: trained from data to represent the unknown transition function p
‘model’ : refer to the learned approximation
‘environment’ : refer to the unknown MDP transition function.
2. Key Insight: Model Mismatch Problem
- Model errors propagate and compound when the model is used for planning.
- It depends on the size of the error and the local behavior of the value function.
- We can motivate the use of MLE as a loss function by an upper bound:
- This bound is loose and does not account for the geometry of the problem’s value function.
3. Value-Aware Model Learning
- VAML is to penalize a model prediction by the resulting difference in a value function.
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
- This approach is that it relies on the value function, which is not known a priori while learning the model.
- Farahmand introduced a modification of VAML called Iterative Value-Aware Model Learning (IterVAML), where the supremum is replaced with the current estimate of the value function.
- In each iteration, the value function is updated based on the model, and the model is trained using the loss function based on the last iteration’s value function.
Value-Gradient weighted Model loss (VaGraM)
- Value function evaluation outside of the empirical state-action distribution.
- Suboptimal local minima.
- We expect that the updated model loss forces the model prediction to a new solution, but due to the non-convex nature of the VAML loss, the model can get stuck or even diverge.
1. Approximating a value-aware loss with the value function gradient
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
- All are in the dataset the value function is trained on, which solves the first problem with the VAML paradigm.
- This vector can be interpreted as a measure of sensitivity of the value function at each data point and dimension
2. Preventing spurious local minima
- Apply the Cauchy Schwartz inequalit:
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
- This reformulation is equivalent to a mean squared error loss function with a per-sample diagonal scaling matrix.
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
- The VAML loss has a complicated shape that depends on the exact values of the value function while both MSE and our proposal have a paraboloid shape.
- Compared to MSE, our proposed loss function is rescaled to account for the larger gradient of the value function in the axis.
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Experiment: Model Learning In Low-Dimensional Problem
- We compare the performance of VaGraM, with both MSE and VAML on a pedagogical environment with a small state space and smooth dynamics to gain qualitative insight into the loss surfaces.
- We decided to investigate the model losses without modelbased value function learning.
- We used the SAC algorithm in a model-free setup to estimate the value function.
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
- In the linear setting, VAML achieves the lowest VAML error, while VaGraM is able to significantly outperform MSE.
- In the NN setting, VAML diverges rapidly, while VaGraM and MSE converge to approximately the same solution.
- Using a flexible function approximation, the VAML loss converges in the first iteration with the given value function, but then rapidly diverges once the value function is updated.
- VaGraM remains stable even with flexible function approximation and achieves a lower VAML error than the MSE baseline.
- Single solution convergence.
Experiment: Model-Based Continuous Control
- To test whether our loss function is superior to a maximum likelihood approach in these cases, we used the Hopper environment
1. Hopper With Reduced Model Capacity
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
- When reducing the model size, the maximum likelihood models quickly lose performance, completely failing to even stabilize the Hopper for a short period in the smallest setting, while VaGraM retains almost its original performance.
2. Hopper With Distracting Dimensions
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
- When increasing the number of dimensions, the performance of the MLE model deteriorates, as more and more of its capacity is used to model the added dynamics.
VaGraM is able to deal with challenging distractions and reduced model capacity significantly better than a MLE baseline.
Conclusion
- We presented the VaGraM to train models that model a dynamics function where it matters for the control problem.
- We highlighted how VaGraM counters these issues and showed the increased stability of the training procedure when using our loss in a pedagogical environment.
- In future work we seek to scale our loss function to image-based RL.
- we seek to derive a related value-aware approach for partially observable domains that can take the state inference problem into account.
Reference