VALUE GRADIENT WEIGHTED MODEL-BASED REINFORCEMENT LEARNING

2023/2/14

tags: `RL Group meeting`

Published as a conference paper at ICLR 2022

Outline

Abstract
Introduction
Background
Value-Gradient weighted Model loss (VaGraM)
Experiment: Model Learning In Low-Dimensional Problem
Experiment: Model-Based Continuous Control
Conclusion

Abstract

Model-based reinforcement learning (MBRL) is a sample efficient technique to obtain control policies, yet unavoidable modeling errors often lead to performance deterioration.
Value-Gradient weighted Model loss (VaGraM) improves the performance such as small model capacity and the presence of distracting state dimensions.
We analyze both MLE and value-aware approaches and demonstrate how they fail to account for sample coverage and the behavior of function approximation when learning value-aware models.

Introduction

MBRL solves the control optimization into two interleaved stages :
- In the model learning stage, an approximate model of the environment is learned
- Utilized in the planning stage to generate new experience without having to query the original environment.
The accuracy of the model directly influences the quality of the learned policy or plan.
- Limits of function approximation in model and value learning algorithms.
- It cannot fully capture the full distribution over dynamics functions perfectly, and the use of finite datasets.
They use maximum likelihood estimation (MLE) to learn a parametric model of the environment without involving information from the planning process.
We present the Value-Gradient weighted Model loss (VaGraM) which rescales the mean squared error loss function with gradient information from the current value function estimate.
We analysis of the optimization behavior of the Value-Aware Model Learning framework.
VaGraM loss impacts the resulting state and value prediction accuracy.

Background

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

The reward function is either known or learned by mean squared error minimization.

1. Model-Based Reinforcement Learning

\hat{p}

: trained from data to represent the unknown transition function p
‘model’ : refer to the learned approximation
‘environment’ : refer to the unknown MDP transition function.

D : θ^{*} = \arg max_{θ} \sum_{i = 1}^{N} \log {\hat{p}}_{θ} (s_{i}^{'}, r_{i} ∣ s_{i}, a_{i})

2. Key Insight: Model Mismatch Problem

Model errors propagate and compound when the model is used for planning.
It depends on the size of the error and the local behavior of the value function.
We can motivate the use of MLE as a loss function by an upper bound:

$\begin{aligned} sup_{V \in F} | ⟨ p - \hat{p}, V ⟩ | \leq ‖ p - \hat{p} ‖ \end{aligned}$

$\begin{array}{r} sup_{V \in F} ‖ V ‖_{\infty} \leq \sqrt{KL (p ‖ \hat{p})} sup_{V \in F} ‖ V ‖_{\infty} \end{array}$
This bound is loose and does not account for the geometry of the problem’s value function.

3. Value-Aware Model Learning

VAML is to penalize a model prediction by the resulting difference in a value function.
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
This approach is that it relies on the value function, which is not known a priori while learning the model.
Farahmand introduced a modification of VAML called Iterative Value-Aware Model Learning (IterVAML), where the supremum is replaced with the current estimate of the value function.
In each iteration, the value function is updated based on the model, and the model is trained using the loss function based on the last iteration’s value function.

Value-Gradient weighted Model loss (VaGraM)

Value function evaluation outside of the empirical state-action distribution.
Suboptimal local minima.
- We expect that the updated model loss forces the model prediction to a new solution, but due to the non-convex nature of the VAML loss, the model can get stuck or even diverge.

1. Approximating a value-aware loss with the value function gradient

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

All
$s_{i}^{'}$ are in the dataset the value function is trained on, which solves the first problem with the VAML paradigm.

$\sum_{i} {({({\nabla_{s} V (s) |}_{s_{i}^{'}})}^{⊤} (f_{θ} (s_{i}, a_{i}) - s_{i}^{'}))}^{2}$
This vector can be interpreted as a measure of sensitivity of the value function at each data point and dimension

2. Preventing spurious local minima

Apply the Cauchy Schwartz inequalit:
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
This reformulation is equivalent to a mean squared error loss function with a per-sample diagonal scaling matrix.

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

The VAML loss has a complicated shape that depends on the exact values of the value function while both MSE and our proposal have a paraboloid shape.
Compared to MSE, our proposed loss function is rescaled to account for the larger gradient of the value function in the
$θ$ axis.

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Experiment: Model Learning In Low-Dimensional Problem

We compare the performance of VaGraM, with both MSE and VAML on a pedagogical environment with a small state space and smooth dynamics to gain qualitative insight into the loss surfaces.
We decided to investigate the model losses without modelbased value function learning.
We used the SAC algorithm in a model-free setup to estimate the value function.
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
In the linear setting, VAML achieves the lowest VAML error, while VaGraM is able to significantly outperform MSE.
In the NN setting, VAML diverges rapidly, while VaGraM and MSE converge to approximately the same solution.
Using a flexible function approximation, the VAML loss converges in the first iteration with the given value function, but then rapidly diverges once the value function is updated.
VaGraM remains stable even with flexible function approximation and achieves a lower VAML error than the MSE baseline.
Single solution convergence.

Experiment: Model-Based Continuous Control

To test whether our loss function is superior to a maximum likelihood approach in these cases, we used the Hopper environment

1. Hopper With Reduced Model Capacity

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

When reducing the model size, the maximum likelihood models quickly lose performance, completely failing to even stabilize the Hopper for a short period in the smallest setting, while VaGraM retains almost its original performance.

2. Hopper With Distracting Dimensions

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

When increasing the number of dimensions, the performance of the MLE model deteriorates, as more and more of its capacity is used to model the added dynamics.

VaGraM is able to deal with challenging distractions and reduced model capacity significantly better than a MLE baseline.

Conclusion

We presented the VaGraM to train models that model a dynamics function where it matters for the control problem.
We highlighted how VaGraM counters these issues and showed the increased stability of the training procedure when using our loss in a pedagogical environment.
In future work we seek to scale our loss function to image-based RL.
we seek to derive a related value-aware approach for partially observable domains that can take the state inference problem into account.

Reference

VALUE GRADIENT WEIGHTED MODEL-BASED REINFORCEMENT LEARNING

tags: RL Group meeting

Outline

Abstract

Introduction

Background

1. Model-Based Reinforcement Learning

2. Key Insight: Model Mismatch Problem

3. Value-Aware Model Learning

Value-Gradient weighted Model loss (VaGraM)

1. Approximating a value-aware loss with the value function gradient

2. Preventing spurious local minima

Experiment: Model Learning In Low-Dimensional Problem

Experiment: Model-Based Continuous Control

1. Hopper With Reduced Model Capacity

2. Hopper With Distracting Dimensions

Conclusion

Read more

Contrastive Disentanglement for Coherent Empathetic Dialogue

Towards a Unified Framework of Contrastive Learning for Disentangled Representations, NIPS

How to measure hallucination

CONT: Contrastive Neural Text Generation

tags: `RL Group meeting`