Bridging Offline Reinforcement Learning and Imitation Learning: A Tale of Pessimism
2022/12/27
Abstract
- The new framework is centered around a weak version of the concentrability coefficient that measures the deviation of the behavior policy from the expert policy alone.
- We consider a lower confidence bound (LCB) algorithm developed based on pessimism in the face of uncertainty in offline RL.
Outline
1. Introduction
3. A warm-up: LCB in multi-armed bandits
4. LCB in contextual bandits
5. LCB in Markov decision processes
6. Proof techniques and conjecture
7. Discussion
1. Introduction
- The online paradigm falls short of leveraging previously-collected datasets.
- The key component of offline RL is a pre-collected dataset from an unknown stochastic environment.
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
- Expert data : Algorithm achieves the minimal sub-optimality of 1/N in episodic Markov decision processes (MDPs).
- Uniform coverage data : Algorithms developed for this regime are demonstrated to achieve a 1/ sub-optimality competing with the optimal policy.
Motivating questions
Q1. Can we propose an offline RL framework that accommodates the entire data composition range?
- We characterize the data composition in terms of the ratio between the state-action occupancy density of an optimal policy and that of the data distribution .
- is the smallest constant satisfies / ≤
- We point out that existing works on offline RL either do not specify the dependency of sub-optimality on data coverage or do not have a batch data coverage assumption that accommodates the entire data spectrum.
- Unifying offline RL and imitation learning via a single algorithm is thus beneficial.
Q2. Can we design algorithms that can achieve minimal suboptimality when facing different dataset compositions without knowing beforehand?
- We analyze a pessimistic variant of a value-based method.
- We first form a lower confidence bound (LCB) for the value function of a policy using the batch data
- Then, seek to find a policy that maximizes the LCB
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Markov decision processes
- for all t 0]
- for all t 1]
Offline data and offline RL
- The goal of offline RL is to find a policy —based on D— so as to minimize the expected sub-optimality with respect to the optimal policy i.e., [() − ()].
Dataset coverage assumption
- Definition 1 (Single policy concentrability) Given a policy π, define to be the smallest constant that satisfies (s, a)/(s, a) ≤ for all s ∈ and a ∈ .
3. A warm-up: LCB in multi-armed bandits
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
- The multi-armed bandit (MAB) model, where S = 1 and γ = 0
- The goal of offline learning in MAB is to select an arm that minimizes the expectedsub-optimality [() − ()] = [r() − r()]
Why does the best empirical arm fail?
- The empirical best arm is quite sensitive to the arms with a small N(a)
- Proposition 1 For any < 0.3, ≥ 500, there exists a bandit problem with two arms such that for = (a), one has [r() − r()] ≥ .
LCB: The benefit of pessimism
- Pessimism can be deployed by first constructing a penalty function b(a).
- When b(a) captures a confidence level about the empirical reward, (a) − b(a) can be viewed as a lower confidence bound (LCB) on the true mean reward r(a).
- The penalty function originates from Hoeffding’s inequality
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Is LCB optimal for solving offline multi-armed bandits?
- The following theorem shows that LCB is optimal up to a logarithmic factor when ≥ 2.
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Imitation learning in bandit: the most played arm achieves a better rate
- Case: ∈ [1, 2).
- Assume that 1/() ≤ .
- = .
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Non-adaptivity of LCB
- = 1.5
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
- = 6
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
- It is impossible for LCB to achieve optimal rate in both ∈ (1, 2) and ≥ 2 regimes simultaneous.
4. LCB in contextual bandits
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
- The offline learning objective in CB is to find a policy based on .
- The pessimism principle introduced for MAB can be naturally extended to CB by subtracting a penalty function b(s, a) from the empirical rewards (s, a) and returning (s) ∈ (s, a) − b(s, a) for every state s.
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
- The first term has the usual statistical estimation rate of 1/.
- The second term is due to missing mass.
Optimality of LCB for solving offline contextual bandits
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
CB() := { | / ≤ }
- On a closer inspection, in the ∈ [1, 2) regime, there is a clear separation between the informationtheoretic difficulty of offline learning in MAB, which has an exponential rate in N, and CB with at least 2 states, which has a 1/N rate.
5. LCB in Markov decision processes
Offline value iteration with LCB
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Assume / ≤
- for all ≥ 1
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
- Information-theoretic limit, MDP

- The first term captures the imitation learning regime under which a fast rate 1/N is expected, while the second term deals with the large regime with a rate 1/ .
- The sample complexity of VI-LCB is loose by an extra factor in sample complexity.
6. Proof techniques and conjecture

- 1.When N(s, (s)) = 0, this type of error, incurred by missing mass.
- 2.Standard concentration arguments tell us that takes place with high probability, i.e., the term in the figure is no larger than .
- 3.This allows us to focus on the states with large weights
- 4.Thanks to the LCB approach, the optimal action will be chosen with high probability.
Conjecture
- The LCB approach, together with value iteration is adaptively optimal for solving offline MDPs for all ranges of .
7. Discussion
- We propose a new batch RL framework smoothly interpolates the two extremes of data.
- We focus on the LCB approach inspired by the principle of pessimism, we find that LCB is adaptively minimax optimal for addressing the offline learning problems in most settings.
- To provide a tighter bound for LCB in MDP.
- We expect to see our characterization of offline RL to be extended to the function approximation setting and used in the development of new offline RL algorithms that only require partial coverage.
- To analyze whether alternative conservative methods such as value regularization can achieve adaptivity and/or minimax optimality.
Reference