# Stackelberg Actor-Critic: Game-Theoretic Reinforcement Learning Algorithms
Link: https://arxiv.org/abs/2109.12286
Authors: Liyuan Zheng, Tanner Fiez, Zane Alumbaugh, Benjamin Chasnov, Lillian J. Ratliff
Written by: Pronoma Banerjee
---
### Abstract and Introduction
The paper discusses the following:
1. It models the actor-critic interactions in the actor-critic reinforcement learning algorithms as a two-player general sum Stackelberg game with a leader-follower structure.
2. Modeling Contributions: Proposes a framework for the Stackelberg actor-critic algorithms where the leader follows the total derivative of its objective function defined using the implicit function theorem and the follower updates using individual gradient dynamics, and compares it with previous approaches of using individual gradients for both the actor and critic.
3. Theoretical contributions: Develops a policy gradient theorem (Theorem 1) for the refined update of the total derivative, builds a meta-framework of Stackelberg actor-critic algorithms, adapting the standard actor-critic, deep deterministic policy gradient and soft actor-critic algorithms, and provides a local convergence guarantee (Theorem 2) for them to a local Stackelberg equilibrium defined by gradient-based sufficient conditions.
4. Experimental contributions: Show through examples that Stackelberg gradient dynamics mitigate cycling and accelerate convergence. Further experiments show that Stackelberg actor-critic algorithms always perform at least as well and often significantly outperform the standard actor-critic algorithm counterparts.
---
### Motivation and Preliminaries
We first start by stating the implicit function theorem used for the objective function of the leader in a Stackelberg game:
**Implicit function theorem**
Let $f=(f_1, f_2, ..., f_n)$ be a vector-valued function defined in an open set in $R^{n+k}$ with values in $R^n$. Suppose $f \in C'$ on $S$. Let $(x_0,t_0)$ be a point in $S$ for which $f(x_0, t_0)=0$ and for which the $n \times n$ determinant $det[D_j f_i(x_0, t_0)] \neq 0$. Then there exists a k-dimensional open set $T_0$ containing $t_0$ and one and only one vector valued function $g$, defined on $T_0$ and having values in $R^n$ such that:
1. $g \in C'$ on $T_0$
2. $g(t_0) = x_0$
3. $f(g(t),t)=0$ for all $t \in T_0$
**Game theoretic priliminaries**
A Stackelberg game is a game between two agents where one agent is deemed the leader and the other the follower. The leader optimizes it's objective assuming that the follower will play its best response. We frame the leader's objective using the implicit function theorem.
Let $f_1(x_1,x_2)$ and $f_2(x_1,x_2)$ be the objective functions that the leader and follower want to minimize, respectively, where $x_1 ∈ X_1 ⊆ R^{d_1}$ and $x_2 ∈ X_2 ⊆ R^{d_2}$ are their decision variables or strategies and $x = (x_1, x_2) ∈ X_1 × X_2$ is their joint strategy.
The leader L aims to solve the following:
min$_{x_1∈X_1}\{f_1(x_1,x_2)|x_2 ∈ arg$ min$_{y∈X_2} f_2(x_1,y)\}$
and the follower F aims to solve the following problem:
min$_{x_2∈X_2} f_2(x_1,x_2)$
Since the leader assumes the follower chooses a best response $x^∗_2(x_1)$ = arg min$_y f_2(x_1,y)$, the follower’s decision variables are implicitly a function of the leader’s. The leader utilizes this while deriving the sufficient conditions for its optimization, from the total derivative of its cost function:
$∇f_1(x_1, x^∗_2(x_1)) = ∇_1f_1(x) + (∇x^∗_2(x_1))^⊤∇_2f_1(x)$.
where $∇x^∗_2(x_1) = −(∇^2_2f_2(x))^{−1}∇_{21}f_2(x)$
Hence, a point $x = (x_1, x_2)$ is a local solution to L's objective if $∇f_1(x_1, x^∗_2(x_1)) = 0$ and $∇_2f_1(x_1, x^∗_2(x_1)) > 0$. For the follower’s problem, sufficient conditions for optimality are $∇_2f_2(x_1, x_2) = 0$ and $∇_2f_2(x_1, x_2) > 0$.
Next we define a concept that gives the sufficient conditions for a local Stackelberg equilibrium:
**Differential Stackelberg Equilibrium**
The joint strategy $x^∗ = (x^∗_1,x^∗_2) ∈ X_1 × X_2$ is a differential Stackelberg equilibrium if $∇f_1(x^∗) = 0, ∇_2f_2(x^∗) = 0, ∇^2f_1(x^∗) > 0,$ and $∇^2_2f_2(x^∗) > 0.$
**Stackelberg gradient dynamics**
It is derived from the first-order gradient based sufficient conditions and are given by:
$x_{1,k+1} =x_{1,k} − α_1∇f_1(x_{1,k},x_{2,k})$
$x_{2,k+1} =x_{2,k} − α_2∇f_1(x_{1,k},x_{2,k})$
**Motivating Examples**
The actor and critic exhibit simple hidden structure in their parameters for all actor-critic algorithms. The actor objective typically has a hidden linear structure in terms of the parameters $θ$ which is abstractly of the form $Q_w(θ) = w^⊤μ(θ)$. The critic objective usually has a hidden quadratic structure in the parameters $w$ which is abstractly of the form of $(R(θ) − Q_w(θ))^2$. The actor seeks to find the action that maximizes the value indicated by the critic and the critic approximates the rewards of actions generated by the actor. For simplicity, we assume the critic only minimizes the mean square error of the sample action generated by current actor $θ$.
Next we briefly go through the various actor-critic algorithms. The vanilla actor-critic and the deep deterministic policy gradient optimize their objectives by performing the individual gradient dynamics (gradient descent and ascent) on the actor and critic parameters. The soft actor-critic exhibit a similar structure as the others, but with entropic regularization in the actor objective.
1. **Actor Critic (AC):**
* Relies on the critic function $Q_w(s,a)$ parameterized by $w$ to approximate $Q^\pi(s,a)$.
* The actor which is parameterized by $\theta$ has the objective:
$J(θ,w)$ = E$_{s∼\rho,a∼π_θ(·|s)}[Q_w(s,a)]$
(from policy gradient theorem).
* The objective is optimized using gradient ascent where:
$∇θJ(θ,w)$ = E$_{s∼ρ,a∼π_θ(·|s)} [∇_θ log(π_θ(a|s))Qw(s,a)]$.
* The critic which is parameterized has the objective to minimize the mean square error between the Q-functions:
$L(\theta, w)$ = E$_{s∼ρ,a∼π_θ(·|s)}[(Q_w(s,a)−Q_π(s,a))^2]$,
where $Q^\pi(s,a)$ is approximated by Monte Carlo estimation or bootstrapping.
2. **Deep Deterministic Policy Gradient (DDPG):**
* Deterministic actor $\mu_\theta(s): S \rightarrow A$ with objective:
$J(θ, w)$ = E$_{ξ∼D} [Q_w(s, μ_θ(s))]$.
* Critic objective is the mean square Bellman error:
$L(θ,w)$= E$_{ξ∼D} [(Q_w(s,a)−(r+γQ_0(s',μ_θ(s'))))^2]$.
where $ξ = (s, a, r, s')$, $D$ is a replay buffer, and $Q_0$ is a target $Q$ network.
3. **Soft Actor Critic (SAC):**
* Exploits double Q-learning and employs entropic regularization to encourage exploration.
* Actor's objective:
$J(θ, w)$ = E$_{ξ∼D}$[min$_{i=1,2} Q_{w_i}(s, a_θ (s)) − η$ log $(π_θ (a_θ (s)|s))]$,
where $a_\theta(s)$ is a sample from $\pi_\theta(.|s)$ and $\eta$ is the entropy regularization coefficient.
* The parameter of the critic is the union of both Q networks parameters $w = \{w1, w2\}$ and the critic objective is defined correspondingly by:
$L(θ, w)$ = E$_{ξ∼D}[\sum_{i=1,2}(Q_{w_i}(s, a) − y(r, s'))^2]$,
where
$y(r,s')=r + γ$(min $Q_{0,i}(s',a_θ(s')) − η$ log$(π_θ(a_θ(s')|s')))$.
The target networks in DDPG and SAC are updated by taking the Polyak average of the network parameters over the course of training, and the actor and critic networks are updated by individual gradient dynamics:
$θ ← θ + α_θ ∇_θ J (θ, w)$
$w ← w − α_w ∇_w L(θ, w)$
---
### Stackelberg framework
**Meta algorithm**
---
**Algorithm: Stackelberg Actor-Critic Framework**
**Input**: actor-critic algorithm ALG, player designations, and learning rate sequences $α_{θ,k},α_{w,k}$.
**Case 1:** if actor is leader, update actor and critic in ALG with:
$θ_{k+1} =θ_k +α_{θ,k}∇J(θ_k,w_k)$
$w_{k+1} =w_k −α_{w,k}∇_wL(θ_k,w_k)$
**Case 2:** if critic is leader, update actor and critic in ALG with:
$θ_{k+1} =θ_k +α_{θ,k}∇_θJ(θ_k,w_k)$
$w_{k+1} =w_k −α_{w,k}∇L(θ_k,w_k)$
---
When actor is the leader (case 1), the total derivative of the actor is given by:
$∇_θJ(θ, w) − ∇^⊤_{wθ}L(θ, w)(∇^2_wL(θ, w))^{−1}∇_wJ(θ, w)$.
When critic is the leader (case 2), the total derivative of the critic is given by:
$∇_wL(θ,w)−∇^⊤_{θw}J(θ,w)(∇^2_θJ(θ,w))^{−1}∇_θL(θ,w)$.
**Stackelberg “Vanilla” Actor-Critic (STAC)**
The critic assists the actor in learning the optimal policy by approximating the value function of the current policy. The critic aims to be selecting a best response $w^∗(θ) = arg min_{w′} L(θ, w')$. Thus, the actor plays the role of the leader and critic the follower.
$∇_wJ(θ,w) = E_{s∼ρ,a∼π_θ(·|s)}[∇_wQ_w(s,a)]$
$∇^2_w L(θ, w) = E_{s∼ρ,a∼π_θ (·|s)}[2∇_w Q_w(s, a) ∇^⊤_w Q_w (s, a) +2(Q_w (s, a) − Q^π (s, a))∇^2_w Q_w (s, a)] .$
* **Theorem 1:** Given an MDP and actor-critic parameters $(θ, w)$, the gradient of $L(θ, w)$ with respect to $θ$ is given by:
$∇_θL(θ,w)=E_{τ∼π_θ}[∇_θ log(π_θ(a_0|s_0)) (Q_w(s_0,a_0)−Q^π(s_0,a_0))^2$ $+\sum^T_{t=1} γ^t∇_θ log(π_θ(a_t|s_t)) (Q^π(s_0,a_0)−Q_w(s_0,a_0))Q^π(s_t,a_t)]$.
Theorem 1 allows us to compute $∇_{θw} L(θ, w)$ directly by $∇_w (∇_θ L(θ, w))$ since the distribution of $∇_θ L(θ, w)$ does not depend on $w$ and $∇_w$ can be moved into the expectation.
The critic in AC is often designed to approximate the state value function $V^π (s)$ which has computational advantages, and the policy gradient can be computed by advantage estimation. In this formulation,
$J(θ,w)=E_{τ∼π_θ}[r(s_0,a_0)+V_w(s_1)]$ and
$L(θ,w)=E_{s∼ρ}[(V_w(s)−V^\pi(s))^2]$.
Then $∇_\theta L(θ,w)$ can be computed by the next proposition.
* **Proposition 1:** Given an MDP and actor-critic parameters $(θ, w)$, if the critic has the objective function $L(θ,w)=E_{s∼ρ}[(V_w(s)−V^π(s))^2]$, then $∇_θ L(θ,w)$ is given by
$E_{\tau \sim \pi_\theta} [2\sum^T_{t=0} γ^t∇_θ$ log$π_θ(a_t|s_t (V^π(s_0)−V_w(s_0))Q^π(s_t,a_t)]$.
**Stackelberg DDPG (STDDPG) and SAC (STSAC)**
In comparison to on-policy methods where the critic is designed to evaluate the actor using sampled trajectories generated by the current policy, in off-policy methods the critic minimizes the Bellman error using samples from a replay buffer. Thus, the leader and follower designation between the actor and critic in off-policy methods is not as clear. Given the actor as the leader, the algorithms are similar to policy-based methods, where the critic plays an approximate best response to evaluate the current actor. On the other hand, given the critic as the leader, the actor plays an approximate best response to the critic value, resulting in behavior closely resembling that of the value-based methods.
The objective functions of off-policy methods are defined in expectation over an arbitrary distribution from a replay buffer instead of the distribution induced by the current policy. Thus, each terms in the total derivatives updates can be computed directly and estimated by samples.
**Convergence guarantee**
The following theorem gives a local convergence guarantee to a local Stackelberg equilibrium under the assumption that the maps $∇J : R^m → R^{m_θ} , ∇_wL : R^m → R^{m_w}$ are Lipschitz, and $||∇J|| < ∞$, the learning rate sequences for the gradient updates such that $α_{θ,k} = o(α_w,k)$ and $\sum_k α_{i,k} = ∞, \sum_k α^2_{i,k} < ∞$ for $i ∈ I = \{θ,w\}$, and the stochastic components in the gradient updates $\{\epsilon_{i,k}\}$ are zero mean, martingale difference sequences.
*(A stochastic series is a martingale difference sequence if its expectation with respect to the past is zero.)*
* **Theorem 2:** Consider an MDP and actor-critic parameters $(\theta, w).$ Given a locally asymptotically stable Stackelberg equilibrium $(\theta^*, w^*)$ of the continuous time-limiting system $(\dot \theta, \dot w)= (\bigtriangledown J(\theta, w), - \bigtriangledown_w L(\theta, w)),$ under the above assumptions there exists a neighbourhood $U$ for which the iterates $(\theta_k, w_k)$ of the discrete time system
$θ_{k+1} =θ_k +α_{θ,k}(∇J(θ,w)+ε_{θ,k+1})$,
$w_{k+1} =w_k −α_{w,k}(∇_wL(θ,w)+ε_{w,k+1})$
This result is effectively giving the guarantee that the discrete-time dynamics locally converge to a stable, game theoretically meaningful equilibrium of the continuous-time system using stochastic approximation methods given suitable learning rate sequences and unbiased gradient estimates.
**Implicit Map Regularization**
The total derivative in the Stackelberg gradient dynamics requires computing the inverse of follower Hessian $∇_2f_2(x)$. Since critic networks in practical reinforcement learning problems may be highly non-convex, $(∇_2f_2(x))^{−1}$ can be ill-conditioned. Thus, instead of computing this term directly in the Stackelberg actor- critic algorithms, we compute a regularized variant of the form $(∇_2f_2(x)+λI)^{−1}$. This regularization method can be interpreted as the leader viewing the follower as optimizing a regularized cost $f_2(x) + λ_2||x_2||^2$, while the follower actually optimizes $f_2(x)$. Interestingly, the regularization parameter $λ$ can serve to interpolate
between the Stackelberg and individual gradient updates for the leader.
* **Proposition 2:** Consider a Stackelberg game where the leader updates using the regularized total derivative: $∇^λf_1(x) = ∇_1f_1(x) − ∇^⊤_{21}f_2(x)(∇^2_2f_2(x) + λI)^{−1}∇_2f_1(x)$. As $λ → 0$ then $∇^λf_1(x) → ∇f_1(x)$ and when $λ → ∞$ then $∇^λf_1(x) → ∇_1f_1(x)$.
---
### Experiments
**Performance**:
1. STAC, STDDPG and STSAC are implemented on several tasks on the OpenAI gym environments. The critic is unrolled m steps between each actor step. For each task, STAC with multiple critic unrolling steps performs the best. This is due to the fact when the critic is closer to the best response, then the real response of the critic is closer to what is anticipated by the Stackelberg gradient for the actor.
2. STDDPG with both actor and critic as leader overall perform better than DDPG on the same tasks. However, STSAC's performance does not prove to be advantageous over SAC.
3. In all experiments, when the actor is the leader, the Stackelberg versions either outperform or are comparable to the existing actor-critic algorithms, showing that the Stackelberg framework has an emperical advantage in most tasks.
**Game-Theoretic Interpretations**
SAC is considered the state-of-the-art model-free reinforcement learning algorithm and it significantly outperforms DDPG on most tasks. The interpretation for this is as follows:
1. The common interpretation of its advantage is that SAC encourages exploration by penalizing low entropy policies.
2. The authors further suggest that the hidden structures in AC and DDPG may lead to cyclic behavious in their individual gradient dynamics. SAC constructs a more well-conditioned game structure by regularizing the actor objective, which leads to the learning dynamics converging more directly to the equilibrium. This also explains why we observe improved performance with STAC and STDDPG-AL compared to AC and DDPG, but the performance gap between STSAC-AL and SAC is not as significant.
3. Experimentally, the actor as the leader always outperforms the critic as the leader. critic objective is typically a quadratic mean square error objective which results in a hidden quadratic structure whereas the actor’s objective typically is in the form of a hidden linear due to parameterization of the Q network and policy. As a result, the critic cost structure is more well-suited for computing an approximate local best response since it is more likely to be well-conditioned. Thus, the critic being the follower is a more natural hierarchical structure of the game.
---
### Conclusion
The paper revisits the standard actor-critic algorithms from a game-theoretic perspective to capture the hierarchical interaction structure and introduce a Stackelberg framework for actor-critic algorithms. The rest of the paper, through theory and experimentation, shows that the Stackelberg actor-critic algorithms always outperform the existing counterparts when the actor plays the leader.
---
### Other Relevant References:
1. https://hackmd.io/@FtbpSED3RQWclbmbmkChEA/rJFUQA1QO
2. http://proceedings.mlr.press/v119/fiez20a/fiez20a-supp.pdf
3. https://sci-hub.se/10.1007/978-1-4612-1029-0