Stackelberg Actor-Critic: Game-Theoretic Reinforcement Learning Algorithms

Link: https://arxiv.org/abs/2109.12286
Authors: Liyuan Zheng, Tanner Fiez, Zane Alumbaugh, Benjamin Chasnov, Lillian J. Ratliff
Written by: Pronoma Banerjee

Abstract and Introduction

The paper discusses the following:

It models the actor-critic interactions in the actor-critic reinforcement learning algorithms as a two-player general sum Stackelberg game with a leader-follower structure.
Modeling Contributions: Proposes a framework for the Stackelberg actor-critic algorithms where the leader follows the total derivative of its objective function defined using the implicit function theorem and the follower updates using individual gradient dynamics, and compares it with previous approaches of using individual gradients for both the actor and critic.
Theoretical contributions: Develops a policy gradient theorem (Theorem 1) for the refined update of the total derivative, builds a meta-framework of Stackelberg actor-critic algorithms, adapting the standard actor-critic, deep deterministic policy gradient and soft actor-critic algorithms, and provides a local convergence guarantee (Theorem 2) for them to a local Stackelberg equilibrium defined by gradient-based sufficient conditions.
Experimental contributions: Show through examples that Stackelberg gradient dynamics mitigate cycling and accelerate convergence. Further experiments show that Stackelberg actor-critic algorithms always perform at least as well and often significantly outperform the standard actor-critic algorithm counterparts.

Motivation and Preliminaries

We first start by stating the implicit function theorem used for the objective function of the leader in a Stackelberg game:

Implicit function theorem
Let

f = (f_{1}, f_{2}, . . ., f_{n})

be a vector-valued function defined in an open set in

R^{n + k}

with values in

R^{n}

. Suppose

f \in C^{'}

S

. Let

(x_{0}, t_{0})

be a point in

S

for which

f (x_{0}, t_{0}) = 0

and for which the

n \times n

determinant

d e t [D_{j} f_{i} (x_{0}, t_{0})] \neq 0

. Then there exists a k-dimensional open set

T_{0}

containing

t_{0}

and one and only one vector valued function

g

, defined on

T_{0}

and having values in

R^{n}

such that:

$g \in C^{'}$ on
$T_{0}$
$g (t_{0}) = x_{0}$
$f (g (t), t) = 0$ for all
$t \in T_{0}$

Game theoretic priliminaries

A Stackelberg game is a game between two agents where one agent is deemed the leader and the other the follower. The leader optimizes it's objective assuming that the follower will play its best response. We frame the leader's objective using the implicit function theorem.

Let

f_{1} (x_{1}, x_{2})

and

f_{2} (x_{1}, x_{2})

be the objective functions that the leader and follower want to minimize, respectively, where

x_{1} \in X_{1} \subseteq R^{d_{1}}

and

x_{2} \in X_{2} \subseteq R^{d_{2}}

are their decision variables or strategies and

x = (x_{1}, x_{2}) \in X_{1} \times X_{2}

is their joint strategy.

The leader L aims to solve the following:
min

_{x_{1} \in X_{1}} {f_{1} (x_{1}, x_{2}) | x_{2} \in a r g

min

_{y \in X_{2}} f_{2} (x_{1}, y)}

and the follower F aims to solve the following problem:
min

_{x_{2} \in X_{2}} f_{2} (x_{1}, x_{2})

Since the leader assumes the follower chooses a best response

x_{2}^{*} (x_{1})

= arg min

_{y} f_{2} (x_{1}, y)

, the follower’s decision variables are implicitly a function of the leader’s. The leader utilizes this while deriving the sufficient conditions for its optimization, from the total derivative of its cost function:

\nabla f_{1} (x_{1}, x_{2}^{*} (x_{1})) = \nabla_{1} f_{1} (x) + (\nabla x_{2}^{*} (x_{1}))^{⊤} \nabla_{2} f_{1} (x)

.
where

\nabla x_{2}^{*} (x_{1}) = - (\nabla_{2}^{2} f_{2} (x))^{- 1} \nabla_{21} f_{2} (x)

Hence, a point

x = (x_{1}, x_{2})

is a local solution to L's objective if

\nabla f_{1} (x_{1}, x_{2}^{*} (x_{1})) = 0

and

\nabla_{2} f_{1} (x_{1}, x_{2}^{*} (x_{1})) > 0

. For the follower’s problem, sufficient conditions for optimality are

\nabla_{2} f_{2} (x_{1}, x_{2}) = 0

and

\nabla_{2} f_{2} (x_{1}, x_{2}) > 0

Next we define a concept that gives the sufficient conditions for a local Stackelberg equilibrium:

Differential Stackelberg Equilibrium
The joint strategy

x^{*} = (x_{1}^{*}, x_{2}^{*}) \in X_{1} \times X_{2}

is a differential Stackelberg equilibrium if

\nabla f_{1} (x^{*}) = 0, \nabla_{2} f_{2} (x^{*}) = 0, \nabla^{2} f_{1} (x^{*}) > 0,

and

\nabla_{2}^{2} f_{2} (x^{*}) > 0.

Stackelberg gradient dynamics
It is derived from the first-order gradient based sufficient conditions and are given by:

x_{1, k + 1} = x_{1, k} - α_{1} \nabla f_{1} (x_{1, k}, x_{2, k})

x_{2, k + 1} = x_{2, k} - α_{2} \nabla f_{1} (x_{1, k}, x_{2, k})

Motivating Examples
The actor and critic exhibit simple hidden structure in their parameters for all actor-critic algorithms. The actor objective typically has a hidden linear structure in terms of the parameters

θ

which is abstractly of the form

Q_{w} (θ) = w^{⊤} μ (θ)

. The critic objective usually has a hidden quadratic structure in the parameters

w

which is abstractly of the form of

(R (θ) - Q_{w} (θ))^{2}

. The actor seeks to find the action that maximizes the value indicated by the critic and the critic approximates the rewards of actions generated by the actor. For simplicity, we assume the critic only minimizes the mean square error of the sample action generated by current actor

θ

Next we briefly go through the various actor-critic algorithms. The vanilla actor-critic and the deep deterministic policy gradient optimize their objectives by performing the individual gradient dynamics (gradient descent and ascent) on the actor and critic parameters. The soft actor-critic exhibit a similar structure as the others, but with entropic regularization in the actor objective.

Actor Critic (AC):
- Relies on the critic function
  $Q_{w} (s, a)$ parameterized by
  $w$ to approximate
  $Q^{π} (s, a)$ .
- The actor which is parameterized by
  $θ$ has the objective:
  
  $J (θ, w)$ = E
  $_{s \sim ρ, a \sim π_{θ} (\cdot | s)} [Q_{w} (s, a)]$
  (from policy gradient theorem).
- The objective is optimized using gradient ascent where:
  
  $\nabla θ J (θ, w)$ = E
  $_{s \sim ρ, a \sim π_{θ} (\cdot | s)} [\nabla_{θ} l o g (π_{θ} (a | s)) Q w (s, a)]$ .
- The critic which is parameterized has the objective to minimize the mean square error between the Q-functions:
  
  $L (θ, w)$ = E
  $_{s \sim ρ, a \sim π_{θ} (\cdot | s)} [(Q_{w} (s, a) - Q_{π} (s, a))^{2}]$ ,
  where
  $Q^{π} (s, a)$ is approximated by Monte Carlo estimation or bootstrapping.
Deep Deterministic Policy Gradient (DDPG):
- Deterministic actor
  $μ_{θ} (s) : S \to A$ with objective:
  
  $J (θ, w)$ = E
  $_{ξ \sim D} [Q_{w} (s, μ_{θ} (s))]$ .
- Critic objective is the mean square Bellman error:
  
  $L (θ, w)$ = E
  $_{ξ \sim D} [(Q_{w} (s, a) - (r + γ Q_{0} (s^{'}, μ_{θ} (s^{'}))))^{2}]$ .
  where
  $ξ = (s, a, r, s^{'})$ ,
  $D$ is a replay buffer, and
  $Q_{0}$ is a target
  $Q$ network.
Soft Actor Critic (SAC):
- Exploits double Q-learning and employs entropic regularization to encourage exploration.
- Actor's objective:
  
  $J (θ, w)$ = E
  $_{ξ \sim D}$ [min
  $_{i = 1, 2} Q_{w_{i}} (s, a_{θ} (s)) - η$ log
  $(π_{θ} (a_{θ} (s) | s))]$ ,
  where
  $a_{θ} (s)$ is a sample from
  $π_{θ} (. | s)$ and
  $η$ is the entropy regularization coefficient.
- The parameter of the critic is the union of both Q networks parameters
  $w = {w 1, w 2}$ and the critic objective is defined correspondingly by:
  
  $L (θ, w)$ = E
  $_{ξ \sim D} [\sum_{i = 1, 2} (Q_{w_{i}} (s, a) - y (r, s^{'}))^{2}]$ ,
  where
  
  $y (r, s^{'}) = r + γ$ (min
  $Q_{0, i} (s^{'}, a_{θ} (s^{'})) - η$ log
  $(π_{θ} (a_{θ} (s^{'}) | s^{'})))$ .

The target networks in DDPG and SAC are updated by taking the Polyak average of the network parameters over the course of training, and the actor and critic networks are updated by individual gradient dynamics:

θ \leftarrow θ + α_{θ} \nabla_{θ} J (θ, w)

w \leftarrow w - α_{w} \nabla_{w} L (θ, w)

Stackelberg framework

Meta algorithm

Algorithm: Stackelberg Actor-Critic Framework

Input: actor-critic algorithm ALG, player designations, and learning rate sequences

α_{θ, k}, α_{w, k}

Case 1: if actor is leader, update actor and critic in ALG with:

θ_{k + 1} = θ_{k} + α_{θ, k} \nabla J (θ_{k}, w_{k})

w_{k + 1} = w_{k} - α_{w, k} \nabla_{w} L (θ_{k}, w_{k})

Case 2: if critic is leader, update actor and critic in ALG with:

θ_{k + 1} = θ_{k} + α_{θ, k} \nabla_{θ} J (θ_{k}, w_{k})

w_{k + 1} = w_{k} - α_{w, k} \nabla L (θ_{k}, w_{k})

When actor is the leader (case 1), the total derivative of the actor is given by:

\nabla_{θ} J (θ, w) - \nabla_{w θ}^{⊤} L (θ, w) (\nabla_{w}^{2} L (θ, w))^{- 1} \nabla_{w} J (θ, w)

When critic is the leader (case 2), the total derivative of the critic is given by:

\nabla_{w} L (θ, w) - \nabla_{θ w}^{⊤} J (θ, w) (\nabla_{θ}^{2} J (θ, w))^{- 1} \nabla_{θ} L (θ, w)

Stackelberg “Vanilla” Actor-Critic (STAC)

The critic assists the actor in learning the optimal policy by approximating the value function of the current policy. The critic aims to be selecting a best response

w^{*} (θ) = a r g m i n_{w'} L (θ, w^{'})

. Thus, the actor plays the role of the leader and critic the follower.

\nabla_{w} J (θ, w) = E_{s \sim ρ, a \sim π_{θ} (\cdot | s)} [\nabla_{w} Q_{w} (s, a)]

\nabla_{w}^{2} L (θ, w) = E_{s \sim ρ, a \sim π_{θ} (\cdot | s)} [2 \nabla_{w} Q_{w} (s, a) \nabla_{w}^{⊤} Q_{w} (s, a) + 2 (Q_{w} (s, a) - Q^{π} (s, a)) \nabla_{w}^{2} Q_{w} (s, a)] .

Theorem 1: Given an MDP and actor-critic parameters
$(θ, w)$ , the gradient of
$L (θ, w)$ with respect to
$θ$ is given by:

$\nabla_{θ} L (θ, w) = E_{τ \sim π_{θ}} [\nabla_{θ} l o g (π_{θ} (a_{0} | s_{0})) (Q_{w} (s_{0}, a_{0}) - Q^{π} (s_{0}, a_{0}))^{2}$
$+ \sum_{t = 1}^{T} γ^{t} \nabla_{θ} l o g (π_{θ} (a_{t} | s_{t})) (Q^{π} (s_{0}, a_{0}) - Q_{w} (s_{0}, a_{0})) Q^{π} (s_{t}, a_{t})]$ .

Theorem 1 allows us to compute

\nabla_{θ w} L (θ, w)

directly by

\nabla_{w} (\nabla_{θ} L (θ, w))

since the distribution of

\nabla_{θ} L (θ, w)

does not depend on

w

and

\nabla_{w}

can be moved into the expectation.

The critic in AC is often designed to approximate the state value function

V^{π} (s)

which has computational advantages, and the policy gradient can be computed by advantage estimation. In this formulation,

J (θ, w) = E_{τ \sim π_{θ}} [r (s_{0}, a_{0}) + V_{w} (s_{1})]

and

L (θ, w) = E_{s \sim ρ} [(V_{w} (s) - V^{π} (s))^{2}]

Then

\nabla_{θ} L (θ, w)

can be computed by the next proposition.

Proposition 1: Given an MDP and actor-critic parameters
$(θ, w)$ , if the critic has the objective function
$L (θ, w) = E_{s \sim ρ} [(V_{w} (s) - V^{π} (s))^{2}]$ , then
$\nabla_{θ} L (θ, w)$ is given by

$E_{τ \sim π_{θ}} [2 \sum_{t = 0}^{T} γ^{t} \nabla_{θ}$ log
$π_{θ} (a_{t} | s_{t} (V^{π} (s_{0}) - V_{w} (s_{0})) Q^{π} (s_{t}, a_{t})]$ .

Stackelberg DDPG (STDDPG) and SAC (STSAC)

In comparison to on-policy methods where the critic is designed to evaluate the actor using sampled trajectories generated by the current policy, in off-policy methods the critic minimizes the Bellman error using samples from a replay buffer. Thus, the leader and follower designation between the actor and critic in off-policy methods is not as clear. Given the actor as the leader, the algorithms are similar to policy-based methods, where the critic plays an approximate best response to evaluate the current actor. On the other hand, given the critic as the leader, the actor plays an approximate best response to the critic value, resulting in behavior closely resembling that of the value-based methods.

The objective functions of off-policy methods are defined in expectation over an arbitrary distribution from a replay buffer instead of the distribution induced by the current policy. Thus, each terms in the total derivatives updates can be computed directly and estimated by samples.

Convergence guarantee

The following theorem gives a local convergence guarantee to a local Stackelberg equilibrium under the assumption that the maps

\nabla J : R^{m} \to R^{m_{θ}}, \nabla_{w} L : R^{m} \to R^{m_{w}}

are Lipschitz, and

| | \nabla J | | < \infty

, the learning rate sequences for the gradient updates such that

α_{θ, k} = o (α_{w}, k)

and

\sum_{k} α_{i, k} = \infty, \sum_{k} α_{i, k}^{2} < \infty

for

i \in I = {θ, w}

, and the stochastic components in the gradient updates

{ϵ_{i, k}}

are zero mean, martingale difference sequences.

(A stochastic series is a martingale difference sequence if its expectation with respect to the past is zero.)

Theorem 2: Consider an MDP and actor-critic parameters
$(θ, w) .$ Given a locally asymptotically stable Stackelberg equilibrium
$(θ^{*}, w^{*})$ of the continuous time-limiting system
$(\dot{θ}, \dot{w}) = (▽ J (θ, w), - ▽_{w} L (θ, w)),$ under the above assumptions there exists a neighbourhood
$U$ for which the iterates
$(θ_{k}, w_{k})$ of the discrete time system

$θ_{k + 1} = θ_{k} + α_{θ, k} (\nabla J (θ, w) + ε_{θ, k + 1})$ ,

$w_{k + 1} = w_{k} - α_{w, k} (\nabla_{w} L (θ, w) + ε_{w, k + 1})$

This result is effectively giving the guarantee that the discrete-time dynamics locally converge to a stable, game theoretically meaningful equilibrium of the continuous-time system using stochastic approximation methods given suitable learning rate sequences and unbiased gradient estimates.

Implicit Map Regularization

The total derivative in the Stackelberg gradient dynamics requires computing the inverse of follower Hessian

\nabla_{2} f_{2} (x)

. Since critic networks in practical reinforcement learning problems may be highly non-convex,

(\nabla_{2} f_{2} (x))^{- 1}

can be ill-conditioned. Thus, instead of computing this term directly in the Stackelberg actor- critic algorithms, we compute a regularized variant of the form

(\nabla_{2} f_{2} (x) + λ I)^{- 1}

. This regularization method can be interpreted as the leader viewing the follower as optimizing a regularized cost

f_{2} (x) + λ_{2} | | x_{2} | |^{2}

, while the follower actually optimizes

f_{2} (x)

. Interestingly, the regularization parameter

λ

can serve to interpolate
between the Stackelberg and individual gradient updates for the leader.

Proposition 2: Consider a Stackelberg game where the leader updates using the regularized total derivative:
$\nabla^{λ} f_{1} (x) = \nabla_{1} f_{1} (x) - \nabla_{21}^{⊤} f_{2} (x) (\nabla_{2}^{2} f_{2} (x) + λ I)^{- 1} \nabla_{2} f_{1} (x)$ . As
$λ \to 0$ then
$\nabla^{λ} f_{1} (x) \to \nabla f_{1} (x)$ and when
$λ \to \infty$ then
$\nabla^{λ} f_{1} (x) \to \nabla_{1} f_{1} (x)$ .

Experiments

Performance:

STAC, STDDPG and STSAC are implemented on several tasks on the OpenAI gym environments. The critic is unrolled m steps between each actor step. For each task, STAC with multiple critic unrolling steps performs the best. This is due to the fact when the critic is closer to the best response, then the real response of the critic is closer to what is anticipated by the Stackelberg gradient for the actor.
STDDPG with both actor and critic as leader overall perform better than DDPG on the same tasks. However, STSAC's performance does not prove to be advantageous over SAC.
In all experiments, when the actor is the leader, the Stackelberg versions either outperform or are comparable to the existing actor-critic algorithms, showing that the Stackelberg framework has an emperical advantage in most tasks.

Game-Theoretic Interpretations
SAC is considered the state-of-the-art model-free reinforcement learning algorithm and it significantly outperforms DDPG on most tasks. The interpretation for this is as follows:

The common interpretation of its advantage is that SAC encourages exploration by penalizing low entropy policies.
The authors further suggest that the hidden structures in AC and DDPG may lead to cyclic behavious in their individual gradient dynamics. SAC constructs a more well-conditioned game structure by regularizing the actor objective, which leads to the learning dynamics converging more directly to the equilibrium. This also explains why we observe improved performance with STAC and STDDPG-AL compared to AC and DDPG, but the performance gap between STSAC-AL and SAC is not as significant.
Experimentally, the actor as the leader always outperforms the critic as the leader. critic objective is typically a quadratic mean square error objective which results in a hidden quadratic structure whereas the actor’s objective typically is in the form of a hidden linear due to parameterization of the Q network and policy. As a result, the critic cost structure is more well-suited for computing an approximate local best response since it is more likely to be well-conditioned. Thus, the critic being the follower is a more natural hierarchical structure of the game.

Conclusion

The paper revisits the standard actor-critic algorithms from a game-theoretic perspective to capture the hierarchical interaction structure and introduce a Stackelberg framework for actor-critic algorithms. The rest of the paper, through theory and experimentation, shows that the Stackelberg actor-critic algorithms always outperform the existing counterparts when the actor plays the leader.