Bridging Offline Reinforcement Learning and Imitation Learning: A Tale of Pessimism

tags: `RL Group meeting`

2022/12/27

Abstract

The new framework is centered around a weak version of the concentrability coefficient that measures the deviation of the behavior policy from the expert policy alone.
We consider a lower confidence bound (LCB) algorithm developed based on pessimism in the face of uncertainty in offline RL.

Outline

1. Introduction

2. Background and problem formulation

3. A warm-up: LCB in multi-armed bandits

4. LCB in contextual bandits

5. LCB in Markov decision processes

6. Proof techniques and conjecture

7. Discussion

1. Introduction

The online paradigm falls short of leveraging previously-collected datasets.
The key component of offline RL is a pre-collected dataset from an unknown stochastic environment.

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Expert data : Algorithm achieves the minimal sub-optimality of 1/N in episodic Markov decision processes (MDPs).
Uniform coverage data : Algorithms developed for this regime are demonstrated to achieve a 1/
$\sqrt{N}$ sub-optimality competing with the optimal policy.

Motivating questions

Q1. Can we propose an offline RL framework that accommodates the entire data composition range?

We characterize the data composition in terms of the ratio between the state-action occupancy density of an optimal policy
$d^{*} (s, a)$ and that of the data distribution
$μ (s, a)$ .
- $C^{*}$ is the smallest constant satisfies
  $d^{*} (s, a)$ /
  $μ (s, a)$ ≤
  $C^{*}$
We point out that existing works on offline RL either do not specify the dependency of sub-optimality on data coverage or do not have a batch data coverage assumption that accommodates the entire data spectrum.
Unifying offline RL and imitation learning via a single algorithm is thus beneficial.

Q2. Can we design algorithms that can achieve minimal suboptimality when facing different dataset compositions without knowing

C^{*}

beforehand?

We analyze a pessimistic variant of a value-based method.
We first form a lower confidence bound (LCB) for the value function of a policy using the batch data
Then, seek to find a policy that maximizes the LCB
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

2 Background and problem formulation

Markov decision processes

$M = (S, A, P, R, ρ, γ)$
$d^{π} (s, a) := (1 - γ) Σ_{t = 0}^{\infty} γ^{t} P_{t} (s_{t} = s, a_{t} = a; π)$
$V^{π} (s) := E [Σ_{t = 0}^{\infty} γ^{t} r_{t} | s_{0} = s, a_{t} = π (s_{t})$ for all t
$\geq$ 0]
$Q^{π} (s, a) := E [Σ_{t = 0}^{\infty} γ^{t} r_{t} | s_{0} = s, a_{0} = a, a_{t} = π (s_{t})$ for all t
$\geq$ 1]
$J (π) := E_{s \sim p} [V^{π} (s)]$

Offline data and offline RL

The goal of offline RL is to find a policy
$\hat{π}$ —based on D— so as to minimize the expected sub-optimality with respect to the optimal policy
$π^{*}$ i.e.,
$E_{D}$ [
$J$ (
$π^{*}$ ) −
$J$ (
$\hat{π}$ )].

Dataset coverage assumption

Definition 1 (Single policy concentrability) Given a policy π, define
$C^{π}$ to be the smallest constant that satisfies
$d^{π}$ (s, a)/
$μ$ (s, a) ≤
$C^{π}$ for all s ∈
$S$ and a ∈
$A$ .

3. A warm-up: LCB in multi-armed bandits

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

The multi-armed bandit (MAB) model, where S = 1 and γ = 0
The goal of offline learning in MAB is to select an arm
$\hat{a}$ that minimizes the expectedsub-optimality
$E_{D}$ [
$J$ (
$π^{*}$ ) −
$J$ (
$\hat{π}$ )] =
$E_{D}$ [r(
$a^{*}$ ) − r(
$\hat{a}$ )]

Why does the best empirical arm fail?

The empirical best arm is quite sensitive to the arms with a small N(a)
Proposition 1 For any
$ϵ$ < 0.3,
$N$ ≥ 500, there exists a bandit problem with two arms such that for
$\hat{a}$ =
$a r g m a x_{a}$
$\hat{r}$ (a), one has
$E_{D}$ [r(
$a^{*}$ ) − r(
$\hat{a}$ )] ≥
$ϵ$ .

LCB: The benefit of pessimism

Pessimism can be deployed by first constructing a penalty function b(a).
When b(a) captures a confidence level about the empirical reward,
$\hat{r}$ (a) − b(a) can be viewed as a lower confidence bound (LCB) on the true mean reward r(a).
The penalty function originates from Hoeffding’s inequality
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

Is LCB optimal for solving offline multi-armed bandits?

The following theorem shows that LCB is optimal up to a logarithmic factor when
$C^{*}$ ≥ 2.
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

Imitation learning in bandit: the most played arm achieves a better rate

Case:
$C^{*}$ ∈ [1, 2).
Assume that 1/
$μ$ (
$a^{*}$ ) ≤
$C^{*}$ .
$\hat{a}$ =
$a r g m a x_{a}$
$N (a)$ .
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

Non-adaptivity of LCB

$C^{*}$ = 1.5
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
$C^{*}$ = 6
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
It is impossible for LCB to achieve optimal rate in both
$C^{*}$ ∈ (1, 2) and
$C^{*}$ ≥ 2 regimes simultaneous.

4. LCB in contextual bandits

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

The offline learning objective in CB is to find a policy
$\hat{π}$ based on
$D$ .

LCB algorithm and its performance guarante

The pessimism principle introduced for MAB can be naturally extended to CB by subtracting a penalty function b(s, a) from the empirical rewards
$\hat{r}$ (s, a) and returning
$\hat{π}$ (s) ∈
$a r g m a x_{a}$
$\hat{r}$ (s, a) − b(s, a) for every state s.
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
The first term has the usual statistical estimation rate of 1/
$\sqrt{N}$ .
The second term is due to missing mass.

Optimality of LCB for solving offline contextual bandits

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

CB(

C^{*}

) := {

(ρ, μ, R)

{m a x}_{s} ρ (s)

μ (s, π^{*} (s)

≤

C^{*}

}

On a closer inspection, in the
$C^{*}$ ∈ [1, 2) regime, there is a clear separation between the informationtheoretic difficulty of offline learning in MAB, which has an exponential rate in N, and CB with at least 2 states, which has a 1/N rate.

5. LCB in Markov decision processes

Offline value iteration with LCB

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Assume

d^{*} (s, a)

μ (s, a)

≤

C^{*}

for all
$C^{*}$ ≥ 1
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

Information-theoretic lower bound for offline RL in MDPs

Information-theoretic limit, MDP
The first term captures the imitation learning regime under which a fast rate 1/N is expected, while the second term deals with the large
$C^{*}$ regime with a rate 1/
$\sqrt{N}$ .
The sample complexity of VI-LCB is loose by an extra
$1 / (1 - γ)^{2}$ factor in sample complexity.

6. Proof techniques and conjecture

1.When N(s,
$π^{*}$ (s)) = 0, this type of error, incurred by missing mass.
2.Standard concentration arguments tell us that
$ϵ$ takes place with high probability, i.e., the term
$T_{2}$ in the figure is no larger than
$δ$ .
3.This allows us to focus on the states with large weights
4.Thanks to the LCB approach, the optimal action will be chosen with high probability.

Conjecture

The LCB approach, together with value iteration is adaptively optimal for solving offline MDPs for all ranges of
$C^{*}$ .

7. Discussion

We propose a new batch RL framework smoothly interpolates the two extremes of data.
We focus on the LCB approach inspired by the principle of pessimism, we find that LCB is adaptively minimax optimal for addressing the offline learning problems in most settings.
To provide a tighter bound for LCB in MDP.
We expect to see our characterization of offline RL to be extended to the function approximation setting and used in the development of new offline RL algorithms that only require partial coverage.
To analyze whether alternative conservative methods such as value regularization can achieve adaptivity and/or minimax optimality.

Reference

Bridging Offline Reinforcement Learning and Imitation Learning: A Tale of Pessimism

tags: RL Group meeting

Abstract

Outline

1. Introduction

2. Background and problem formulation

3. A warm-up: LCB in multi-armed bandits

4. LCB in contextual bandits

5. LCB in Markov decision processes

6. Proof techniques and conjecture

7. Discussion

1. Introduction

Motivating questions

2 Background and problem formulation

Markov decision processes

Offline data and offline RL

Dataset coverage assumption

3. A warm-up: LCB in multi-armed bandits

Why does the best empirical arm fail?

LCB: The benefit of pessimism

Is LCB optimal for solving offline multi-armed bandits?

Imitation learning in bandit: the most played arm achieves a better rate

Non-adaptivity of LCB

4. LCB in contextual bandits

LCB algorithm and its performance guarante

Optimality of LCB for solving offline contextual bandits

5. LCB in Markov decision processes

Offline value iteration with LCB

Information-theoretic lower bound for offline RL in MDPs

6. Proof techniques and conjecture

Conjecture

7. Discussion

Read more

Contrastive Disentanglement for Coherent Empathetic Dialogue

Towards a Unified Framework of Contrastive Learning for Disentangled Representations, NIPS

How to measure hallucination

CONT: Contrastive Neural Text Generation

tags: `RL Group meeting`