A Note on Adversarially Trained Actor Critic for Offline Reinforcement Learning

# A Note on Adversarially Trained Actor Critic for Offline Reinforcement Learning ## Abstract In reinforcement learning (RL), training effective policies from offline data is challenging. Traditional actor-critic methods often struggle with stability and performance on static datasets. Adversarially Trained Actor Critic (ATAC) addresses these issues by integrating adversarial training into the actor-critic framework. ATAC enhances robustness and stability by combining temporal-difference (TD) errors with an adversarial loss component. ## Introduction The main difference between Adversarially Trained Actor Critic (ATAC) and other actor-critic is the adversarial loss in critic. Actor-critic has a similar architecture to Generative Adversarial Networks (GANs). In GANs, the discriminator aims to classify between true images and the ones generated by the generator. The critic in an actor-critic setup plays a similar role by evaluating the actions generated by the policy. However, there is a key difference: while a GAN discriminator has access to both real and generated data to make its judgment, the critic in a standard actor-critic method evaluates only the actions generated by the policy, based on their expected values in given states. ATAC incorporates the adversarial concept from GANs into reinforcement learning, leveraging the availability of offline datasets. In this context, the critic does not solely evaluate the policy-generated actions but also assesses actions from the dataset. This approach enhances the training process by introducing an adversarial component where the critic is tasked with differentiating between the policy's actions and those in the dataset. The intention behind the critic in ATAC is not only to evaluate the state-action value but also to identify the areas where the actor's policy is inferior to the behavior policy present in the dataset. This make policy more robust. We will go through why the adversarial loss is feasible in critic and its theoretical proof. ## Preliminaries **Reinforcement Learning** problem can be modeled as Markov Decision Process (MDP) by 5 tuples ($\mathcal{S},\mathcal{A},\mathcal{P},\mathcal{R},\gamma$) with a state space $\mathcal{S}$, action space $\mathcal{A}$, reward function $\mathcal{R}$, transition function $P$, discount factor $\gamma$. We use $\pi : \mathcal{S} \rightarrow \Delta(\mathcal{A})$ as learner's policy, where $\Delta(\cdot)$ is set of probability distribution. The goal of RL is maximizing the expected return with respect to $\pi$ which is $$J(\pi) := \mathbb{E} \left[ \sum_{t=0}^{\infty} \gamma^t R(s_t,a_t) \mid a_t \sim \pi(\cdot \mid s_t) \right]$$ For any policy $\pi$, we define Q-value function is $$ Q^{\pi}(s, a) := \mathbb{E} \left[ \sum_{t=0}^{\infty} \gamma^t R(s_t,a_t) \mid s_0=s, a_0=a, a_t \sim \pi(\cdot \mid s_t) \right]. $$ For a policy $\pi$, the Bellman operator is defined as $(T^{\pi} f)(s, a) := R(s, a) + \gamma \mathbb{E}_{s'\sim P(\cdot \mid s, a), a'\sim \pi (s') } [f(s', a')]$ In addition, we use $d^{\pi}$ to denote the normalized and discounted state-action occupancy measure of the policy $\pi$. That is, $d^{\pi}(s, a) := (1 - \gamma) \mathbb{E} \left[ \sum_{t=0}^{\infty} \gamma^t \mathbf{1}(s_t = s, a_t = a) \mid a_t \sim \pi(\cdot \mid s_t) \right].$ **Function Approximation** We assume access to a value-function set $\mathcal{F} \subseteq (\mathcal{S} \times \mathcal{A} \rightarrow [0, V_{\text{max}}])$ for modeling the Q-functions of policies. Additionally, we search for optimal policies within a policy set $\Pi \subseteq (\mathcal{S} \rightarrow \Delta(\mathcal{A}))$. **Offline RL** We assume the offline data $\mathcal{D}$ consists of $N$ i.i.d. $(s, a, r, s')$ tuples, where $(s, a) \sim \mu$, $r = R(s, a)$, and $s' \sim \mathcal{P}(\cdot \mid s, a)$. Here, $\mu$ represents the discounted state-action occupancy measure of a behavior policy, which we denote as $\mu$ for simplicity (i.e., $\mu = d^{\mu}$). We also use $a \sim \mu(\cdot \mid s)$ to denote actoins drawn from this policy. One key challenge in offline RL is the off-support error, which occurs when the policy takes actions in states that are not well-covered by the data distribution $\mu$. This leads to inaccurate estimates in those regions due to insufficient data, and we use $\mathcal{C}(\nu; \mu, \mathcal{F}, \pi) := \max_{f \in \mathcal{F}} \frac{\|f - T^\pi f\|_{2,\nu}^2}{\|f - T^\pi f\|_{2,\mu}^2}$ to measure how well a distribution of interest $\nu \text{ (e.g., } d^\pi)$ is covered by the data distribution $\mu \text{ w.r.t. policy } \pi$ and function class $\mathcal{F}$. > $\textbf{Assumption 1 (Approximate Realizability)}$ For any policy $\pi \in \Pi$, $\min_{f \in \mathcal{F}} \max_{\text{admissible} \ \nu} \| f - T^{\pi} f \|_{2, \nu}^2 \leq \varepsilon_{\mathcal{F}}$, where admissibility $\nu$ means $\nu \in \{d^{\pi'} : \forall \pi' \in \Pi\}$. For a quick review, the Bellman consistency is $(T^{\pi} f)(s) = \mathbb{E} [R(s, \pi(s)) + \gamma f(s')]=f(s)$ Assumption 1 means that for any state-action occupancy distribution $\nu$ corresponding to $\pi^{\prime} \in \Pi$, we can find an action-value function $f$ that nearly satisfies Bellman consistency with a small error $\varepsilon_{\mathcal{F}}$. This is a weaker form of stating $Q^{\pi} \in \mathcal{F}$ and $\| Q - T^{\pi} Q \|_{2, \nu}^2 = 0$. > $\textbf{Assumption 2 (Approximate Completeness).}$ For any $\pi \in \Pi$ and $f \in \mathcal{F}$, we have $\min_{g \in \mathcal{F}} \|g - T^\pi f\|_{2,\mu} \leq \epsilon_{\mathcal{F},\mathcal{F}}$.d $\|\cdot\|{2,\mu} := \sqrt{\mathbb{E}\mu[(\cdot)^2]}$ represents the $\mu$-weighted 2-norm. Assumption 2, once more, relaxes the usual completeness assumption that $T^\pi f \in \mathcal{F}$ for all $f \in \mathcal{F}$. It can be seen as weaker version of Assumption 1. Here, we only require the approximation to be accurate within the data distribution. **Offline RL as a Stackelberg game** We now formulate Stackelberg game for Offline RL as an optimization problem, where policy is leader, critic is follower: \begin{equation} \hat{\pi}^* \in \arg\max_{\pi \in \Pi} \mathcal{L}_{\mu}(\pi, f^{\pi}) \end{equation} \begin{equation} f^{\pi} \in \arg\min_{f \in \mathcal{F}} \mathcal{L}_{\mu}(\pi, f) + \beta \mathcal{E}_{\mu}(\pi, f) \end{equation} where $\beta \geq 0$ is a hyperparameter \begin{equation} \mathcal{L}_{\mu}(\pi, f) := \mathbb{E}_{(s, a) \sim \mu} [f(s, \pi(s)) - f(s, a)] \end{equation} \begin{equation} \mathcal{E}_{\mu}(\pi, f) := \mathbb{E}_{(s, a) \sim \mu} \left[ \left( (f - T^{\pi} f)(s, a) \right)^2 \right] \end{equation} The above equations reflect the concept mentioned in the Introduction. The critic's goal is to identify areas where the policy $\pi$ is weaker compared to the behavior policy $\mu$. The policy $\hat{\pi}^*$ aims to maximize the value predicted by $f^{\pi}$, which has two components: $\mathcal{L}{\mu}(\pi, f)$, achieving *relative pessimism*, and $\mathcal{E}{\mu}(\pi, f)$, ensuring Bellman consistency. **Relative Pessimism** Relative pessimism refers to the critic's objective of evaluating the policy $\pi$ by comparing it to the behavior policy $\mu$, focusing on areas where $\pi$ may underperform relative to $\mu$. This is captured by the term $\mathcal{L}_{\mu}(\pi, f)$ in the optimization problem. In previous work, the absolute pessimism one optimizes a pessimistic estimate of $J(\pi)$, but ATAC optimizes $J(\pi) - J(\mu)$, which is performance of $\pi$ relative to $\mu$. Following is the algorithm of ATAC: ![截圖 2024-05-21 下午6.37.05](https://hackmd.io/_uploads/HyCwKecmR.png) > $\textbf{Definition 3 (No-regret policy optimization oracle).}$ > An algorithm PO is called a no-regret policy optimization oracle if for any sequence of functions $f_1, \ldots, f_K$ with $f_k : S \times A \rightarrow [0, V_{\max}]$, the policies $\pi_1, \ldots, \pi_K$ produced by PO satisfy, for any comparator $\pi \in \Pi$: > $$ > \epsilon_{\text{opt}}^{\pi} := \frac{1}{1 - \gamma} \sum_{k=1}^K \mathbb{E}_{\pi} \left[ f_k(s, \pi) - f_k(s, \pi_k) \right] = o(K). > $$ ## Example: Self-Driving If you are about to present something very deep and involved, it is recommended to provide a motivating example to highlight the main idea and the intuition. Imagine we are developing an autonomous driving system that needs to learn a safe driving policy from a dataset of human driving behaviors. (Assume that we have reward in the dataset.) The dataset consists of various driving scenarios, including normal driving, emergency braking, and lane changes. Our goal is to train an autonomous vehicle to drive in a way that is safe, using the principles of ATAC. **Behavior Policy ($\mu$)**: The behavior policy $\mu$ represents the driving actions taken by human drivers in the dataset. These actions include accelerating, braking, and steering in response to different traffic conditions. **State-Action Occupancy ($d^{\pi}$)**: The state-action occupancy measure $d^{\pi}$ represents the frequency and distribution of state-action pairs observed in the dataset. For instance, it tells us how often a human driver performs an emergency brake in specific traffic conditions. **Value Function ($f$)**: The value function $f$ estimates the expected return (e.g., safety and efficiency) of following a specific policy $\pi$. In our example, $f$ evaluates the safety and efficiency of the autonomous vehicle's driving actions. **Bellman Consistency**: Bellman consistency ensures that the value function $f$ aligns with the expected rewards and future values under the policy $\pi$. For instance, if the autonomous vehicle decides to slow down to avoid a collision, Bellman consistency helps evaluate whether this action leads to a safer and more efficient outcome. Policy $\pi$ learned using ATAC will perform at least as well as the behavior policy $\mu$, given that we can find an action-value function $f$ that nearly satisfies Bellman consistency with a small error $\varepsilon_{\mathcal{F}}$. * The **leader** (the autonomous vehicle's policy $\pi$) seeks to maximize the expected return $J(\pi)$ by making driving decisions. * The **follower** (the critic $f$) evaluates the safety and efficiency of these decisions, ensuring that the value function aligns with the observed behavior in the dataset. By following this adversarial training approach, the autonomous driving can learn to make driving decisions that not only match but potentially exceed the performance of human drivers, leading to a safer and more efficient autonomous driving system. ## Supporting Lemmas and Theoretical Analysis > $\textbf{Proposition 4}$. > If Assumption 1 holds with $\varepsilon_{\mathcal{F}} = 0$ and $\mu \in \Pi$, then $\mathcal{L}_{\mu}(\pi, f^{\pi}) \leq (1 - \gamma)(J(\pi) - J(\mu)) \ \forall \pi \in \Pi$, for any $\beta \geq 0$. This implies $J(\hat{\pi}^*) \geq J(\mu)$. ***Proof.*** First, we show the relationship between perfomance difference lemma and $\mathcal{L}_{\mu}(\pi, f)$ representing relative pessimism step by step. $$ J(\pi) - J(\mu) = V^{\pi}(s_0) - V^{\mu}(s_0). $$ \begin{align*} &V^{\pi}(s_0) - V^{\mu}(s_0) \\ &= V^{\pi}(s_0) - \mathbb{E}_{a_0 \sim \mu(\cdot \mid s_0)} \Bigl[ r(s_0, a_0) + \gamma \mathbb{E}_{s' \sim P(s_0, a_0)} V^{\pi}(s') \Bigr] + \mathbb{E}_{a_0 \sim \mu(\cdot \mid s_0)} \Bigl[ r(s_0, a_0) + \gamma \mathbb{E}_{s' \sim P(s_0, a_0)} V^{\pi}(s') \Bigr] - V^{\mu}(s_0) \\ &= V^{\pi}(s_0) - \mathbb{E}_{a_0 \sim \mu(\cdot \mid s_0)} \Bigl[ r(s_0, a_0) + \gamma \mathbb{E}_{s' \sim P(s_0, a_0)} V^{\pi}(s') \Bigr] \\ &\hspace{1.2cm}+ \mathbb{E}_{a_0 \sim \mu(\cdot \mid s_0)} \Bigl[ r(s_0, a_0) + \gamma \mathbb{E}_{s' \sim P(s_0, a_0)} V^{\pi}(s') \Bigr] - \mathbb{E}_{a_0 \sim \mu(\cdot \mid s_0)} \Bigl[ r(s_0, a_0) + \gamma \mathbb{E}_{s' \sim P(s_0, a_0)} V^{\mu}(s') \Bigr] \\ &= \mathbb{E}_{a_0 \sim \pi(\cdot \mid s_0)} \Bigl[ Q^{\pi}(s_0,a_0) \Bigr] - \mathbb{E}_{a_0 \sim \mu(\cdot \mid s_0)} \Bigl[ Q^{\pi}(s_0,a_0) \Bigr]+ \gamma \mathbb{E}_{a_0 \sim \mu(\cdot \mid s_0)} \mathbb{E}_{s_1 \sim P(s_0, a_0)} \Bigl[ V^{\pi}(s_1) - V^{\mu}(s_1) \Bigr] \end{align*} For the first two term, we can write it as $$\mathbb{E}_{a^{\pi}_{0} \sim \pi(\cdot \mid s_0), a^{\mu}_{0} \sim \mu(\cdot \mid s_0)} \Bigl[ Q^{\pi}(s_0,a_0^{\pi}) - Q^{\pi}(s_0,a_0^{\mu}) \Bigr]$$ For the $\gamma$-term from last step, we denote $\mathbb{P}^{\pi}_h(s; s_0)$ as the probability of $\pi$ reaching $s$ at time step $h$ from $s_0$, i.e., $\mathbb{P}^{\pi}_h(s; s_0) = \sum_a \mathbb{P}^{\pi}_h(s, a; s_0)$, and we can get \begin{align*} \gamma-\text{term} &= \gamma \mathbb{E}_{s_1 \sim \mathbb{P}^{\pi}_1(\cdot \mid s_0)} \Bigl[ V^{\pi}(s_1) - V^{\mu}(s_1) \Bigr] \\ &= \gamma \mathbb{E}_{s_1 \sim \mathbb{P}^{\pi}_1(\cdot \mid s_0)} \biggl[ \gamma \mathbb{E}_{a_1 \sim \pi(\cdot \mid s_1)} \mathbb{E}_{s_2 \sim P(s_1, a_1)} \Bigl[ V^{\pi}(s_2) - V^{\pi'}(s_2) \Bigr] + \mathbb{E}_{a^{\pi}_{1} \sim \pi(\cdot \mid s_1), a^{\mu}_{1} \sim \mu(\cdot \mid s_1)} \Bigl[ Q^{\pi}(s_1,a_1^{\pi}) - Q^{\pi}(s_1,a_1^{\mu}) \Bigr] \biggr] \\ &= \gamma^2 \mathbb{E}_{s_2 \sim \mathbb{P}^{\pi}_2(\cdot \mid s_0)} \Bigl[ V^{\pi}(s_2) - V^{\pi'}(s_2) \Bigr] + \gamma \mathbb{E}_{s_1 \sim \mathbb{P}^{\pi}_1(\cdot \mid s_0), a^{\pi}_{1} \sim \pi(\cdot \mid s_1), a^{\mu}_{1} \sim \mu(\cdot \mid s_1)} \Bigl[ Q^{\pi}(s_1,a_1^{\pi}) - Q^{\pi}(s_1,a_1^{\mu}) \Bigr] \end{align*} As you can see, there is a recursion step. We extend it. \begin{align*} V^{\pi}(s_0) - V^{\pi'}(s_0) &= \mathbb{E}_{a^{\pi}_{0} \sim \pi(\cdot \mid s_0), a^{\mu}_{0} \sim \mu(\cdot \mid s_0)} \Bigl[ Q^{\pi}(s_0,a_0^{\pi}) - Q^{\pi}(s_0,a_0^{\mu}) \Bigr] + \gamma \mathbb{E}_{s_1 \sim \mathbb{P}^{\pi}_1(\cdot \mid s_0), a^{\pi}_{1} \sim \pi(\cdot \mid s_1), a^{\mu}_{1} \sim \mu(\cdot \mid s_1)} \Bigl[ Q^{\pi}(s_1,a_1^{\pi}) - Q^{\pi}(s_1,a_1^{\mu}) \Bigr] \\&\quad+ \gamma^2 \mathbb{E}_{s_2 \sim \mathbb{P}^{\pi}_2(\cdot \mid s_0)} \Bigl[ V^{\pi}(s_2) - V^{\pi'}(s_2) \Bigr] \\&\quad \vdots \\ &= \sum_{h=0}^{\infty} \gamma^h \mathbb{E}_{s \sim \mathbb{P}^{\pi}_h(\cdot \mid s_0), a^{\pi} \sim \pi(\cdot \mid s), a^{\mu} \sim \mu(\cdot \mid s)} \Bigl[ Q^{\pi}(s,a^{\pi}) - Q^{\pi}(s,a^{\mu}) \Bigr] \\ &= \frac{1}{1 - \gamma} \mathbb{E}_{(s, a^{\mu}) \sim d^{\mu}_{s_0}, a^{\pi}\sim \pi(\cdot \mid s)} \Bigl[ Q^{\pi}(s,a^{\pi}) - Q^{\pi}(s,a^{\mu}) \Bigr] \\ &= \frac{1}{1 - \gamma} \mathbb{E}_{s, a \sim \mu} \Bigl[ Q^{\pi}(s,\pi(s)) - Q^{\pi}(s,a) \Bigr]. \end{align*} In the last step, we make notation simple to align the format of $\mathcal{L}_{\mu}(\pi, f)$. Therefore, if $Q^{\pi} \in \mathcal{F}$ on states of $\mu$, then $$ (1 - \gamma)(J(\pi) - J(\mu)) = \mathcal{L}_{\mu}(\pi, Q^{\pi}) = \mathcal{L}_{\mu}(\pi, Q^{\pi}) + \beta \mathcal{E}_{\mu}(Q^{\pi}, \pi) \geq \mathcal{L}_{\mu}(\pi, f^{\pi}), $$ where we use $\mathcal{E}_{\mu}(\pi, Q^{\pi}) = 0$ by definition of $Q^{\pi}$ and $\mathcal{E}(\pi, f) \geq 0$ for any $f \in \mathcal{F}$. Robust policy improvement follows, as $$ J(\hat{\pi}^*) - J(\mu) \geq \mathcal{L}_{\mu}(\hat{\pi}^*, f^{\hat{\pi}^*}) \geq \mathcal{L}_{\mu}(\mu, f^{\mu}) = 0. $$$\hspace{17.5cm}\square$ So far, we confirm that for policy $\pi$, optimizing $\mathcal{L}_{\mu}(\pi, f) := \mathbb{E}_{(s, a) \sim \mu} [f(s, \pi(s)) - f(s, a)]$ equals to optimizing the lower bound of perfomance. > $\textbf{Lemma 5}$. > Let $\pi$ be an arbitrary competitor policy, $\hat{\pi} \in \Pi$ be some learned policy, and $f$ be an arbitrary function over $S \times A$. Then we have, > \begin{align*} > J(\pi) - J(\hat{\pi}) = \frac{1}{1 - \gamma} \left( \mathbb{E}_\mu \left[ (f - T^{\hat{\pi}} f)(s, a) \right] + \mathbb{E}_{\hat{\pi}} \left[ (T^{\hat{\pi}} f - f)(s, a) \right] + \mathbb{E}_{\pi} \left[ f(s, \pi) - f(s, \hat{\pi}) \right] + L_\mu(\hat{\pi}, f) - L_\mu(\hat{\pi}, Q^{\hat{\pi}}) \right). > \end{align*} ***Proof.*** Let $R^{f,\hat{\pi}}(s, a) := f(s, a) - \gamma \mathbb{E}_{s'|s,a}[f(s', \hat{\pi})]$ be a fake reward function given $f$ and $\hat{\pi}$. We use the subscript “$(\cdot)_{R^{f,\hat{\pi}}}$” to denote functions or operators under the true dynamics but the fake reward $ R^{f,\hat{\pi}}$. Since $f(s, a) = (T^{\hat{\pi}}_{R^{f,\hat{\pi}}} f)(s, a)$, we know $f \equiv Q^{\pi}_{R^{f,\hat{\pi}}}$. We perform a performance decomposition: $$ J(\pi) - J(\hat{\pi}) = (J(\pi) - J(\mu)) - (J(\hat{\pi}) - J(\mu)) $$ and rewrite the second term as \begin{align*} (1 - \gamma) (J(\hat{\pi}) - J(\mu)) &= L_\mu(\hat{\pi}, Q^{\hat{\pi}}) \\ &= \Delta(\hat{\pi}) + L_\mu(\hat{\pi}, f) \quad (\Delta(\hat{\pi}) := L_\mu(\hat{\pi}, Q^{\hat{\pi}}) - L_\mu(\hat{\pi}, f)) \\ &= \Delta(\hat{\pi}) + \mathbb{E}_\mu[f(s, \hat{\pi}) - f(s, a)] \\ &= \Delta(\hat{\pi}) + (1 - \gamma)(J_{R^{f,\hat{\pi}}}(\hat{\pi}) - J_{R^{f,\hat{\pi}}}(\mu)) \quad (\text{similar form showed in proposition 1}) \\ &= \Delta(\hat{\pi}) + (1 - \gamma) Q^{\hat{\pi}}_{R^{f,\hat{\pi}}}(s_0, \hat{\pi}) - \mathbb{E}_\mu[R^{\hat{\pi},f}(s, a)] \\ &= \Delta(\hat{\pi}) + (1 - \gamma) f(s_0, \hat{\pi}) - \mathbb{E}_\mu[R^{\hat{\pi},f}(s, a)] \quad (\text{by } f(\cdot, \cdot) \equiv Q^{\pi}_{R^{f,\hat{\pi}}}(\cdot, \cdot)) \end{align*} Therefore, $$ (1 - \gamma)(J(\pi) - J(\hat{\pi})) = (1 - \gamma)(J(\pi) - f(s_0, \hat{\pi})) + \mathbb{E}_\mu[R^{\hat{\pi},f}(s, a)] - (1 - \gamma)J(\mu) - \Delta(\hat{\pi}) $$ We first analyze second term $(II)$. We can expand it by the definition of $R^{\hat{\pi},f}$ as follows $$ (II) = \mathbb{E}_\mu[R^{\hat{\pi},f}(s, a)] - (1 - \gamma)J(\mu) = \mathbb{E}_\mu[R^{\hat{\pi},f}(s, a) - R(s, a)]= \mathbb{E}_\mu[(f - T^{\hat{\pi}} f)(s, a)] $$ We now write first term $(I)$ as \begin{align*} (I) &= (1 - \gamma)(J(\pi) - f(s_0, \hat{\pi})) \\ &= (1 - \gamma)J(\pi) - \mathbb{E}_{d\pi}[R^{\hat{\pi},f}(s, a)] + \mathbb{E}_{d\pi}[R^{\hat{\pi},f}(s, a)] - (1 - \gamma)f(s_0, \hat{\pi}) \end{align*} We analyze each term above in the following. Seperate them as first two term as $(Ia)$ and last two term as $(Ib)$. \begin{align*} (Ib) &= \mathbb{E}_{d\pi}[R^{\hat{\pi},f}(s, a)] - (1 - \gamma)f(s_0, \hat{\pi}) \\ &= \mathbb{E}_{d\pi}[f(s, \pi) - f(s, \hat{\pi})] \end{align*} On the other hand, we can write $$\begin{align*} (Ia) &= (1 - \gamma)J(\pi) - \mathbb{E}_{d\pi}[R^{\hat{\pi},f}(s, a)] \\ &= \mathbb{E}_{d\pi}[R(s, a) - R^{\hat{\pi},f}(s, a)] \\ &= \mathbb{E}_{d\pi}[(T^{\hat{\pi}} f - f)(s, a)] \end{align*}$$ Combine them all, we have $$\begin{align*} J(\pi) - J(\hat{\pi}) &= \frac{1}{1 - \gamma} \left( (Ia) + (Ib) + (II) - \Delta(\hat{\pi}) \right) \\ &= \frac{1}{1 - \gamma} \left( \mathbb{E}_\mu \left[ (f - T^{\hat{\pi}} f)(s, a) \right] + \mathbb{E}_\pi \left[ (T^{\hat{\pi}} f - f)(s, a) \right] + \mathbb{E}_\pi [f(s, \pi) - f(s, \hat{\pi})] + L_\mu(\hat{\pi}, f) - L_\mu(\hat{\pi}, Q^{\hat{\pi}}) \right) \end{align*}$$ This completes the proof.$\hspace{17.5cm}\square$ > $\textbf{Lemma 6}$. > Let $\pi$ be an arbitrary competitor policy. Also let $\pi_k$ and $f_k$ be obtained by Algorithm 1 for $k \in [K]$. Then with high probability, for any $k \in [K]$, > \begin{align*} > (1 - \gamma) (J(\pi) - J(\pi_k)) &\leq \mathbb{E}_\mu \left[ f_k - T^{\pi_k} f_k \right] + \mathbb{E}_\pi \left[ T^{\pi_k} f_k - f_k \right] + \mathbb{E}_\pi \left[ f_k(s, \pi) - f_k(s, \pi_k) \right] \\ > &\quad + \mathcal{O} \left( V_{\max} \sqrt{ \frac{\log |\mathcal{N}_\infty(\mathcal{F}, V_{\max}/N) \| \mathcal{N}_\infty, 1 (\Pi, 1/N) \| / \delta}{N} } + \sqrt{\epsilon_\mathcal{F}} \right) \\ > &\quad + \beta \cdot \mathcal{O} \left( \frac{V_{\max}^2 \log |\mathcal{N}_\infty(\mathcal{F}, V_{\max}/N) \| \mathcal{N}_\infty, 1 (\Pi, 1/N) \| / \delta}{N} + \epsilon_\mathcal{F} \right). > \end{align*} ***Proof*** By Lemma 5, we have $$ J(\pi) - J(\pi_k) = \frac{1}{1 - \gamma} \left( \mathbb{E}_\mu \left[ f_k - T^{\pi_k} f_k \right] + \mathbb{E}_\pi \left[ T^{\pi_k} f_k - f_k \right] + \mathbb{E}_\pi [f_k(s, \pi) - f_k(s, \pi_k)] + L_\mu(\pi_k, f_k) - L_\mu(\pi_k, Q^{\pi_k}) \right). $$ We now bound the term of $L_\mu(\pi_k, f_k) - L_\mu(\pi_k, Q^{\pi_k})$. \begin{align*} f_{\pi} &:= \arg \min_{f \in \mathcal{F}} \sup_{\text{admissible } \nu} \frac{\| f - T^\pi f \|_{2, \nu}^2}{N}, \quad \forall \pi \in \Pi \\ \epsilon_{\text{stat}} &:= \mathcal{O} \left( \frac{V_{\max}^2 \log |\mathcal{N}_\infty(\mathcal{F}, V_{\max}/N) \| \mathcal{N}_\infty, 1 (\Pi, 1/N) \| / \delta}{N} \right), \\ \epsilon_{\pi} &:= \epsilon_{\text{stat}} + \mathcal{O} (\epsilon_{\mathcal{F}}). \end{align*} Then, assume with high probability, for any $k \in [K]$, there will be a error upper bound $\epsilon_{\pi}$, $$ \mathcal{E}_\mathcal{D} (\pi_k, f_{\pi_k}) \leq \epsilon_{\pi}. $$ For $L_\mu(\pi_k, Q^{\pi_k}) - L_\mu(\pi_k, f_{\pi_k})$, we have, \begin{align} L_\mu(\pi_k, Q^{\pi_k}) &= \mathbb{E}_\mu \left[ Q^{\pi_k} (s, \pi_k) - Q^{\pi_k} (s, a) \right] \\ &= (1 - \gamma) \left( J(\pi_k) - J(\mu) \right) \\ &= (1 - \gamma) \left( f_{\pi_k}(s_0, \pi_k) - J(\mu) \right) \\ &= \mathbb{E}_\mu \left[ f_{\pi_k} (s, \pi_k) - (T^{\pi_k} f_{\pi_k})(s, a) \right] + \mathbb{E}_{d \pi_k} \left[ (T^{\pi_k} f_{\pi_k})(s, a) - f_{\pi_k} (s, a) \right] \\ &\leq \mathbb{E}_\mu \left[ f_{\pi_k} (s, \pi_k) - (T^{\pi_k} f_{\pi_k})(s, a) \right] + \mathbb{E}_{d \pi_k} \left[ \| T^{\pi_k} f_{\pi_k} - f_{\pi_k} \|_{2, d \pi_k} \right] \\ &\leq \mathcal{O} (\sqrt{\epsilon_{\pi}}), \end{align} where the last step is by Assumption 1. Also, by applying standard concentration inequalities on $\mathcal{L}_{\mathcal{D}}$: \begin{align} | \mathcal{L}_\mu(\pi_k, f_k) - \mathcal{L}_{\mathcal{D}}(\pi_k, f_k) | + | \mathcal{L}_\mu(\pi_k, f_{\pi_k}) - \mathcal{L}_{\mathcal{D}}(\pi_k, f_{\pi_k}) | \leq \sqrt{\epsilon_{\text{stat}}}, \quad \forall k \in [K]. \end{align} Therefore, \begin{align} L_\mu(\pi_k, f_k) - L_\mu(\pi_k, Q^{\pi_k}) &\leq L_\mu(\pi_k, f_k) + \beta \mathcal{E}_{\mathcal{D}}(\pi_k, f_k) - L_\mu(\pi_k, Q^{\pi_k}) \\ &\leq L_\mu(\pi_k, f_k) + \beta \mathcal{E}_{\mathcal{D}}(\pi_k, f_k) - L_\mu(\pi_k, f_{\pi_k}) - \beta \mathcal{E}_{\mathcal{D}}(\pi_k, f_{\pi_k}) \\ &\quad + \mathcal{O} (\sqrt{\epsilon_{\mathcal{F}}}) + \sqrt{\epsilon_{\text{stat}}} + \beta \cdot \mathcal{O} (\epsilon_{\text{stat}} + \epsilon_{\mathcal{F}}) \\ &\leq \mathcal{O} (\sqrt{\epsilon_{\mathcal{F}}}) + \sqrt{\epsilon_{\text{stat}}} + \beta \cdot \mathcal{O} (\epsilon_{\text{stat}} + \epsilon_{\mathcal{F}}) \\ &\leq \mathcal{O} \left( V_{\max} \sqrt{ \frac{\log |\mathcal{N}_\infty(\mathcal{F}, V_{\max}/N) \| \mathcal{N}_\infty, 1 (\Pi, 1/N) \| / \delta}{N} } + \sqrt{\epsilon_{\mathcal{F}}} \right) \\ &\quad + \beta \cdot \mathcal{O} \left( \frac{V_{\max}^2 \log |\mathcal{N}_\infty(\mathcal{F}, V_{\max}/N) \| \mathcal{N}_\infty, 1 (\Pi, 1/N) \| / \delta}{N} + \epsilon_{\mathcal{F}} \right). \end{align} This completes the proof.$\hspace{17.5cm}\square$ Now we can go through theorem of the performance between a random policy and output of algorithm 1 (ATAC). > $\textbf{Theorem 7}$. > Let $|\mathcal{D}| = N$, $C \ge 1$ be any constant, $\nu \in \Delta(S \times A)$ be an arbitrarily distribution that satisfies $\max_{k \in [K]} \mathcal{C} (\nu; \mu, \mathcal{F}, \pi_k) \le C$, and $\pi \in \Pi$ be an arbitrary competitor policy. Then, when $\epsilon_{\mathcal{F}} = \epsilon_{\mathcal{F},\mathcal{F}} = 0$, choosing $\beta = \Theta \left( \sqrt[3]{\frac{V_{\max} N^2}{d_{\mathcal{F},\Pi}^2}} \right)$, with high probability: > \begin{align*} > J(\pi) - J(\bar{\pi}) &\le \epsilon_{\pi}^{\text{opt}} + \mathcal{O} \left( \frac{V_{\max} \sqrt{C(d_{\mathcal{F},\Pi})^{1/3}}}{(1-\gamma)N^{1/3}} \right) \\ > &\quad + \frac{1}{K(1-\gamma)} \sum_{k=1}^{K} \langle d^{\pi} \setminus \nu, f_k - T^{\pi^k} f_k \rangle, > \end{align*} > where $(d^{\pi} \setminus \nu)(s, a) := \max(d^{\pi}(s, a) - \nu(s, a), 0)$, and $\langle d, f \rangle := \sum_{(s,a) \in S \times A} d(s, a) f(s, a)$ for any $d$ and $f$. At a high level, ATAC demonstrate that it can perform as well against any policy $\pi$ if we have a large enough dataset, provided that the optimization regret is low and the data distribution $\mu$ covers $d^\pi$ well. Specifically, if we choose $\nu = d^\pi$, it eliminates the off-support term, ensuring that our guarantee scales with $\max_k \mathcal{C}(d^\pi, \mu, \mathcal{F}, \pi_k)$. Additionally, we can benefit if other distributions $\nu$ cover better with a small off-support mass $\|d^\pi \setminus \nu\|_1$. Moreover, the off-support term can be minimized if a small Bellman error under $\mu$ leads to a small error out of support, thanks to the properties of $\mathcal{F}$. ## Discussions and Takeaways The ATAC paper presents a comprehensive approach that combines both theoretical rigor and practical implementation to achieve robust policy improvement in offline RL environments. Here are the key insights and evaluations: * **Adversarial Loss for Robust Policy Evaluation**: ATAC integrates adversarial training into the actor-critic framework. By incorporating adversarial loss, the critic not only estimates the state-action value but also provides a lower bound on the learned policy compared to the behavior policy, thus ensuring that the learned policy improves beyond the behavior policy. * **Practical Implementation and Performance:**: The practical implementation of ATAC demonstrates impressive performance across various values of $\beta$. Here if we set $\beta = 0$, implying without reward information, ATAC will become Imitaion learning style algorithm. More generally, if the function class $F$ is sufficiently expressive to approximate all bounded, Lipschitz functions, then setting $\beta = 0$ makes the objective similar to behavior cloning, where an integral probability metric, like in GANs, is used to match the occupancy measures of $\pi$ and $\mu$. In addition, pratical ATAC use several techniques: Entropy in policy design, Projection for critic model with $\ell_2$-bounded weight, *DQRA* loss: $\mathbb{E}_{s,a \sim \mu}\Bigl[\frac{1}{2}(Q^{Target}(s,a) - Q(s,a))^{2} + \frac{1}{2}(\text{TD-target}- Q(s,a))^{2}\Bigr]$ where $Q^{Target}(s,a)=\min_{k=1,2}\left[Q_{k}^{Target}(s,a) \right]$ (smaller one of the delayed updated $Q_{1,2}$) and $\text{TD-target}=R(s,a)+\gamma Q(s',\pi(s))$ **Difference between ATAC and CQL(Conservative Q-Learning)** You might be confused that ATAC is similar to CQL(Conservative Q-learning). We can compare ATAC with CQL. Theoretically, ATAC aims to find the solution in Stackelberg. CQL perform the update: $$f_{k+1} \leftarrow \arg\min_{f \in \mathcal{F}} \max_{\pi \in \Pi} \alpha \mathbb{E}_{\mu} [ f(s, \pi(s)) - f(s, a) ] - \mathcal{R}(\pi) + \mathbb{E}_{\mu} [ ((f - \mathcal{T}^{\pi_k} f_k)(s, a))^2 ] $$ ATAC states the clear objective but CQL has only iterative update. Two algorithm share the same thought, modeling Q for lower bound of policy $\pi$. Consider scenario that is bandit problem and set $\alpha=\frac{1}{\beta}$ in CQL, \begin{align*} \hat{\pi}^\star &\in \arg\max_{\pi \in \Pi} \min_{f \in \mathcal{F}} \mathbb{E}_{\mu} [f(s, \pi(s)) - f(s, a)] + \beta \mathbb{E}_{\mu} [((f - r )(s, a))^2] &\quad (\text{ATAC}) \\ \hat{f}^\star &\leftarrow \arg\min_{f \in \mathcal{F}} \max_{\pi \in \Pi} \mathbb{E}_{\mu} [f(s, \pi(s)) - f(s, a)] + \beta \mathbb{E}_{\mu} [((f - r)(s, a))^2] - \beta \mathcal{R}(\pi) &\quad (\text{CQL}) \end{align*} Then we ignore the regularized term in CQL, because it can be integrated into policy class. The difference between the two approaches lies in the order of maximization and minimization. It is well known that maximin and minimax generally yield different solutions unless the objective is convex-concave with respect to the policy and critic parameterizations. For instance, in this specific bandit problem, suppose the states and actions are tabular. The objective is convex-concave when $\Pi$ and $F$ contain all tabular functions, but this convex-concave property is lost when $F$ contains only a finite set of functions. In the latter scenario, CQL and ATAC would result in very different policies, and CQL would not benefit from the desirable properties of ATAC. We take a simple example with constraint that there will be only 2 tabular functions $\{f_1,f_2\}$ in $F$: $S = \{s_1, s_{2\text{(Terminal state)}}\}, A = \{a_1,a_2,a_3\}$, For any action $a_k$, $P(s_2 | s_1, a_k)=1$, $R:\{r(s_1, a_1) = 0.8, \hspace{0.3cm} r(s_1, a_2) = 0.6, \hspace{0.3cm}r(s_1, a_3) = 0.2\}$ In this case, the $\beta \mathbb{E}_{\mu} [((f - r )(s, a))^2]$ will not influence result, we ignore it. $$f_1(s, a) = \begin{cases} 0.9 & \text{if } (s, a) = (s_1, a_1) \\ 0.1 & \text{if } (s, a) = (s_1, a_2) \\ 0.5 & \text{if } (s, a) = (s_1, a_3) \\ 0 & \text{if } s = s_2 \end{cases} \quad f_2(s, a) = \begin{cases} 0.2 & \text{if } (s, a) = (s_1, a_1) \\ 1.1 & \text{if } (s, a) = (s_1, a_2) \\ 0 & \text{if } (s, a) = (s_1, a_3) \\ 0 & \text{if } s = s_2 \end{cases}$$ Behavior policy $\mu(a|s_1) = [\mu(a_1|s_1),\mu(a_2|s_1),\mu(a_3|s_1)]= [0.6, 0.2, 0.2]$ Consider 3 policy $\pi(s)=[1,0,0], [0,1,0],[0,0,1]$ For policy $\pi=[1,0,0]$, \begin{align*} \mathbb{E}_{\mu} [f_1(s, \pi(s)) - f_1(s, \mu(s))] &=0.9−(0.6⋅0.9+0.2⋅0.1+0.2⋅0.5)\\ &=0.9−(0.54+0.02+0.1)\\ &=0.9−0.66=0.24 \\ \mathbb{E}_{\mu} [f_2(s, \pi(s)) - f_2(s, \mu(s))] &= 0.2 - (0.6 \cdot 0.2 + 0.2 \cdot 1.1 + 0.2 \cdot 0) \\ &= 0.2 - (0.12 + 0.22 + 0) \\ &= 0.2 - 0.34 = -0.14 \end{align*} For policy $\pi=[0,1,0]$, \begin{align*} \mathbb{E}_{\mu} [f_1(s, \pi(s)) - f_1(s, \mu(s))] &=0.1−(0.6⋅0.9+0.2⋅0.1+0.2⋅0.5) \\ &=0.1−(0.54+0.02+0.1) \\ &=0.1−0.66=−0.56 \\ \mathbb{E}_{\mu} [f_2(s, \pi(s)) - f_2(s, \mu(s))] &=1.1−(0.6⋅0.2+0.2⋅1.1+0.2⋅0) \\ &=1.1−(0.12+0.22+0) \\ &=1.1−0.34=0.76 \end{align*} For policy $\pi=[1,0,0]$, \begin{align*} \mathbb{E}_{\mu} [f_1(s, \pi(s)) - f_1(s, \mu(s))] &=0.5−(0.6⋅0.9+0.2⋅0.1+0.2⋅0.5)\\ &=0.5−(0.54+0.02+0.1)\\ &=0.5−0.66=−0.16 \\ \mathbb{E}_{\mu} [f_2(s, \pi(s)) - f_2(s, \mu(s))] &=0−(0.6⋅0.2+0.2⋅1.1+0.2⋅0)\\ &=0−(0.12+0.22+0) \\ &=0−0.34=−0.34 \end{align*} | $\pi(s)$ | $\mathbb{E}_{\mu} [f_1(s, \pi(s)) - f_1(s, \mu(s))]$ | $\mathbb{E}_{\mu} [f_2(s, \pi(s)) - f_2(s, \mu(s))]$ | $worst \hspace{0.2cm} case$ | | ----------- | ----- | ----- | ----- | | $[1, 0, 0]$ | 0.24 | -0.14 | -0.14 | | $[0, 1, 0]$ | -0.56 | 0.76 | -0.56 | | $[0, 0, 1]$ | -0.16 | -0.34 | -0.34 | In $\max \min$, meaning maximize the worst case, policy will be $\pi(s)=[1,0,0]$. | $\pi(s)$ | regret of $f_1$ | regret of $f_2$ | max regret| | ----------- | ---------------- | ---------------- | --- | | $[1, 0, 0]$ | 0.24-0.24=0 | 0.76-(-0.14)=0.9 | 0.9 | | $[0, 1, 0]$ | 0.24-(-0.56)=0.8 | 0.76-0.76=0 | 0.8 | | $[0, 0, 1]$ | 0.24-(-0.16)=0.4 | 0.76-(-0.34)=1.1 | 1.1 | In $\min \max$, meaning minimizing the maximum regret, policy will be $\pi(s)=[0,1,0]$. **My thought** This work did a great job to extend existing work and provide a comprehesive and detailed proof. The code implementation is clean and simple. **Takeaway** 1. The use of adversarial comoponet in offline actor critc and its intension as well as theoretical analysis. 2. Theoretical guarantee of ATAC (Adersarial Trained Actor Critic). 3. Robust policy improvement in ATAC. 4. Example to compare with similar algorithm CQL (Conservative Q-Learning). ## References 1. [Adversarially Trained Actor Critic for Offline Reinforcement Learning](https://arxiv.org/pdf/2202.02446): ATAC 2. [Off-Policy Actor-Critic](https://arxiv.org/abs/1205.4839): Actor critic paper 3. [Conservative Q-Learning for Offline Reinforcement Learning](https://arxiv.org/abs/2006.04779): Similar idea and design to ATAC. 4. [The Approximately Optimal Approximate Reinforcement Learning](https://people.eecs.berkeley.edu/~pabbeel/cs287-fa09/readings/KakadeLangford-icml2002.pdf): Performance difference lemma 5. [Bellman-consistent Pessimism for Offline Reinforcement Learning](https://arxiv.org/abs/2106.06926): Many of proof, design, assumption are came from this paper, but it is absolute pessimism different from ATAC's relative pessimism. 6. [Wasserstein GAN](https://arxiv.org/abs/1701.07875): If $\beta = 0$, ATAC will be GAN style.