# Policy Gradient Algorithms
## Introduction
Policy gradient methods are a class of reinforcement learning (RL) algorithms that directly optimize an agent's decision-making policy to maximize cumulative rewards. Unlike value-based methods, which focus on estimating value functions, policy gradient algorithms adjust the parameters of a policy through gradient ascent.
### Why Policy Gradient Methods?
Policy gradient methods excel in complex environments with vast or continuous action spaces, where estimating optimal actions based solely on values becomes computationally intractable. They leverage gradient ascent to iteratively refine the policy, steering it towards actions that promise greater rewards.
### RL Algorithms Landscape
* **Value-Based:** These algorithms learn to estimate the long-term value of states or state-action pairs, providing a foundation for deriving optimal policies indirectly.
* **Policy-Based:** Combining the best of both worlds, actor-critic algorithms learn both value functions (to critique) and policies (to act), synergistically enhancing learning efficiency and performance.
Policy gradient methods can be classified as either purely policy-based or as a specific type of actor-critic algorithm, offering a direct route to policy optimization without relying on intermediate value function estimation.
### Strengths and Limitations
#### Strengths
* **Natural Handling of Stochasticity:** They excel at learning stochastic policies, which prove invaluable in environments with incomplete information or where exploration is essential.
* **Inherent Exploration:** The stochastic nature of the learned policies fosters exploration, allowing agents to discover potentially better actions.
* **Mastery of Complex Actions:** They gracefully handle high-dimensional or continuous action spaces, where value-based methods often struggle.
* **Stable Learning Dynamics:** Gradual, proportional changes in policy parameters contribute to stable and reliable learning.
* **Robust Convergence:** They avoid some of the convergence pitfalls associated with value-based methods that rely on maximizing estimated values.
#### Limitations
* **Local Optima Risk:** The optimization process might converge to suboptimal solutions, preventing the discovery of the globally optimal policy.
* **Variance Challenges:** Gradient estimates can suffer from high variance, leading to slow or unstable learning progress.
* **Sample Inefficiency:** Acquiring accurate gradient estimates can demand a substantial number of samples, potentially hindering practical application.
## The Policy Gradient Theorem (PGT)
The Policy Gradient Theorem (PGT) provides the mathematical foundation for policy gradient algorithms in reinforcement learning. It quantifies how small adjustments to the policy parameters impact the expected cumulative rewards an agent receives.
### Notation and Definitions
Before delving into the theorem, let's clarify key concepts:
* **Policy $\pi(a|s,\theta)$:** The agent's decision-making strategy, a probability distribution over actions $a$ given a state $s$, parameterized by $\theta$.
* **State Distribution $\rho^\pi$:** The probability distribution of states encountered while following policy $\pi$.
* **Instantaneous reward $\mathcal{R}_s^a$:** Expected reward after taking action $a$ in state $s$.
* **Expected Returns $J(\theta)$:** The expected cumulative sum of discounted rewards under policy $\pi(\theta)$: $$J(\theta) = \mathbb{E}_{\tau \sim \pi(\theta)} \left[ \sum_{t=0}^{\infty} \gamma^t \cdot R_{t+1} \right]$$ where:
* $\tau$ represents a trajectory (sequence of states and actions).
* $R_{t+1}$ is the reward received at time step $t+1$.
* $\gamma$ is the discount factor ($0 \leq \gamma \leq 1$), controlling how much future rewards are valued.
* **State-Value Function $V^\pi(s)$:** The expected return starting from state $s$ and following policy $\pi$:
$$
V^{\pi}(s) = \mathbb{E}_{\pi}\left[\sum_{k=t}^{\infty} \gamma^{k-t} \cdot R_{k+1} | S_t = s \right]
$$
* **Action-Value Function $Q^\pi(s, a)$:** The expected return starting from state $s$, taking action $a$, and thereafter following policy $\pi$:
$$
Q^{\pi}(s,a)= \mathbb{E}_{\pi}\left[\sum_{k=t}^{\infty} \gamma^{k-t} \cdot R_{k+1} | S_t = s, A_t = a\right]
$$
* **Advantage Function $A^\pi(s, a)$:** The relative advantage of taking action $a$ in state $s$ compared to the average action under policy $\pi$:
$$
A^{\mathcal{\pi}}(s,a)=Q^\pi(s,a)- V^\pi(s,a)
$$
From the definition, we mat deduce
$$
J(\theta)
= \sum_{s \in \mathcal{N}} \rho^{\pi}(s) \cdot \sum_{a \in \mathcal{A}}\pi(s,a;\theta) Q^{\pi}(s, a)
= \mathbb{E}_{s \sim \rho^\pi, a \sim \pi}[\mathcal{R}_s^a]
$$
### The Policy Gradient Theorem
The PGT establishes a fundamental connection between the gradient of the expected return and the policy parameters:
**Theorem (PGT)**
$$
\nabla_{\theta}J(\theta) = \sum_{s \in \mathcal{N}} \rho^{\pi}(s) \cdot \sum_{a \in \mathcal{A}} \nabla_{\theta} \pi(s, a; \theta) \cdot Q^{\pi}(s, a)
$$
This gradient guides policy parameter updates to maximize expected returns.
#### Intuition
The PGT reveals that to improve the policy, we should:
* **Focus on frequently visited states:** The term $\rho^{\pi}(s)$ emphasizes the importance of states visited often under the current policy.
* **Adjust action probabilities based on advantage:** The term $\nabla_{\theta} \pi(s, a; \theta) \cdot Q^{\pi}(s, a)$ indicates that we should increase the probability of actions that lead to higher returns (positive advantage) and decrease the probability of actions leading to lower returns (negative advantage).
### The PGT in Action
The PGT serves as the backbone of various policy gradient algorithms. These algorithms differ in how they estimate the action-value function and the policy gradient, but they all share the goal of updating policy parameters to maximize expected returns.
## Score Functions: Simplifying Policy Gradient Calculations
Score functions are a powerful tool in policy gradient methods, offering a more convenient way to express the gradient of the policy with respect to its parameters, $\nabla_{\theta} \pi(s, a; \theta)$. This simplification is achieved by exploiting the specific properties of different policy types, making it easier to implement policy gradient algorithms.
Recall the Policy Gradient Theorem (PGT):
$$
\nabla_{\theta}J(\theta) = \sum_{s \in \mathcal{N}} \rho^{\pi}(s) \cdot \sum_{a \in \mathcal{A}} \nabla_{\theta} \pi(s, a; \theta) \cdot Q^{\pi}(s, a)
$$
By leveraging the score function, we can rewrite the gradient of the policy:
$$
\nabla_{\theta} \pi(s, a; \theta) = \pi(s, a; \theta) \cdot \nabla_{\theta} \log(\pi(s,a;\theta))
$$
This leads to a reformulated policy gradient expression:
$$
\nabla_{\theta}J(\theta) = \sum_{s \in \mathcal{N}} \rho^{\pi}(s) \cdot \sum_{a \in \mathcal{A}} \left( \pi(s, a; \theta) \cdot \nabla_{\theta} \log(\pi(s,a;\theta)) \cdot Q^{\pi}(s, a)\right)
$$
Now, let's delve into score functions for two common policy types:
### Softmax Policy for Finite Action Spaces
* **Policy Choice:** The softmax policy is well-suited for scenarios with a finite set of discrete actions.
* **Parameterization:** We represent policy parameters $\theta$ as an $m$-dimensional vector: $\theta=(\theta_1, \dots, \theta_m)$.
* **Feature Vector:** For each state-action pair $(s,a)$, we define a feature vector $\phi(s, a) = (\phi_1(s, a), \dots, \phi_m(s, a))$ to capture relevant information.
* **Action Probabilities:** The softmax policy calculates action probabilities using an exponentiated weighted sum of features: $$\pi(s, a; \theta) = \frac{e^{\phi(s,a)^T \cdot \theta}}{\sum_{b \in \mathcal{A}} e^{\phi(s,b)^T \cdot \theta}}$$
#### Score Function
The score function for the softmax policy is:
$$
\nabla_{\theta} \log \pi(s, a; \theta) = \phi(s, a) - \sum_{b \in \mathcal{A}} \pi(s, b; \theta) \cdot \phi(s, b) = \phi(s, a) - \mathbb{E}_{\pi}[\phi(s, \cdot)]
$$
This quantifies the difference between the feature vector of the chosen action and the average feature vector of all actions in that state, weighted by their probabilities under the current policy.
### Gaussian Policy for Continuous Action Spaces
* **Policy Choice:** Gaussian policies are commonly employed when actions are continuous values, like controlling robotic movements.
* **Parameterization:** Similar to the softmax case, policy parameters are represented as $\theta=(\theta_1, \dots, \theta_m)$.
* **Feature Vector:** We define a feature vector $\phi(s) = (\phi_1(s), \dots, \phi_m(s))$ for each state $s$.
* **Action Distribution:** Actions are sampled from a normal distribution with a mean determined by a linear combination of state features: $$a \sim \mathcal{N}(\phi(s)^T \cdot \theta, \sigma^2)$$ where $\sigma^2$ is the variance (often fixed).
#### Score Function
The score function for the Gaussian policy is:
$$
\nabla_\theta \log \pi(s, a; \theta) = \frac{(a - \phi(s)^T \cdot \theta) \cdot \phi(s)}{\sigma^2}
$$
This measures how much the actual action deviates from the mean predicted by the policy, scaled by state features and inverse variance.
### Interpreting Score Functions
In essence, score functions quantify the relative advantage or disadvantage of a specific action compared to the average action choice in a given state. This information is crucial for updating policy parameters in a way that maximizes expected rewards.
## REINFORCE: A Monte Carlo Approach to Policy Gradient Learning
REINFORCE is a fundamental policy gradient algorithm that directly applies the Policy Gradient Theorem (PGT) to optimize policies in reinforcement learning. It estimates the action-value function, $Q^\pi(s, a)$, using **Monte Carlo** sampling of complete episode returns.
### The REINFORCE Strategy
REINFORCE's core principles are:
1. **Stochastic Gradient Ascent:** Policy parameters $\theta$ are iteratively updated using stochastic gradient ascent, guided by the PGT.
2. **Monte Carlo Estimation of Q-values:** Instead of relying on bootstrapping (estimating values based on other estimates), REINFORCE estimates Q-values using the actual cumulative discounted rewards obtained throughout each episode. This approach samples entire trajectories to get an unbiased estimate of the expected return.
3. **Policy Update Rule:** The update for the policy parameters is proportional to the product of:
* The **score function** $\nabla_{\theta}\log\pi(S_t,A_t;\theta)$, which indicates the direction of change in action probabilities.
* The **discounted return** $G_t$, representing the cumulative reward from time step $t$ onwards.
### REINFORCE Algorithm
The REINFORCE algorithm can be formalized as follows:
:::success
<font color=blue>Algorithm: REINFORCE</font>
1. **Initialize:**
* Policy parameters $\theta$
* Learning rate $\alpha$
* Discount factor $\gamma$
2. **Repeat** for each episode:
* Generate an episode following policy $\pi(\theta)$: $\{S_0, A_0, R_1, S_1, ..., S_{T-1}, A_{T-1}, R_T, S_T \}$
* **For** each time step $t$ in the episode:
* Calculate the return $G_t$ starting from time $t$: $G_t = \sum_{k=t}^T \gamma^{k-t} \cdot R_{k+1}$
* Update the policy parameters $\theta$: $\theta \leftarrow \theta + \alpha \cdot \gamma^t \cdot \nabla_{\theta}\log\pi(S_t,A_t;\theta) \cdot G$
:::
### Implementation: REINFORCE with Gaussian Policies
The provided Python code demonstrates a REINFORCE implementation for Gaussian policies, where the policy's mean is represented using a function approximator. Key components include:
* `GaussianPolicyFromApprox`: This class defines a Gaussian policy where the mean is determined by a function approximation and the standard deviation is fixed.
* `reinforce_gaussian`: This function embodies the REINFORCE algorithm for Gaussian policies. It simulates episodes, calculates returns, and updates the policy mean using gradients.
```python=
from typing import Iterator, Iterable, Sequence, TypeVar
from dataclasses import dataclass
from rl.distribution import Gaussian
from rl.function_approx import FunctionApprox, Gradient
from rl.returns import returns
from rl.policy import Policy
from rl.markov_process import NonTerminal
from rl.markov_decision_process import MarkovDecisionProcess, TransitionStep
from rl.approximate_dynamic_programming import NTStateDistribution
import numpy as np
S = TypeVar('S')
@dataclass(frozen=True)
class GaussianPolicyFromApprox(Policy[S, float]):
"""
A policy that selects actions based on a Gaussian distribution,
where the mean is determined by a function approximator.
"""
function_approx: FunctionApprox[NonTerminal[S]]
stdev: float
def act(self, state: NonTerminal[S]) -> Gaussian:
return Gaussian(
μ=self.function_approx(state),
σ=self.stdev
)
def reinforce_gaussian(
mdp: MarkovDecisionProcess[S, float],
policy_mean_approx0: FunctionApprox[NonTerminal[S]],
start_states_distribution: NTStateDistribution[S],
policy_stdev: float,
gamma: float,
episode_length_tolerance: float
) -> Iterator[FunctionApprox[NonTerminal[S]]]:
"""
Implement the REINFORCE algorithm for a Gaussian policy.
:param mdp: The Markov Decision Process.
:param policy_mean_approx0: The initial function approximator for the policy mean.
:param start_states_distribution: The distribution of start states.
:param policy_stdev: The standard deviation of the Gaussian policy.
:param gamma: The discount factor.
:param episode_length_tolerance: The maximum number of steps per episode.
:return: An iterator over the updated function approximators for the policy mean.
"""
policy_mean_approx: FunctionApprox[NonTerminal[S]] = policy_mean_approx0
yield policy_mean_approx
while True:
policy: Policy[S, float] = GaussianPolicyFromApprox(
function_approx=policy_mean_approx,
stdev=policy_stdev
)
# Generate an episode using the current policy
trace: Iterable[TransitionStep[S, float]] = mdp.simulate_actions(
start_states=start_states_distribution,
policy=policy
)
gamma_prod: float = 1.0
# Iterate over the steps in the episode
for step in returns(trace, gamma, episode_length_tolerance):
# Define the objective derivative for the policy mean
def obj_deriv_out(
states: Sequence[NonTerminal[S]],
actions: Sequence[float]
) -> np.ndarray:
return (policy_mean_approx.evaluate(states) -
np.array(actions)) / (policy_stdev * policy_stdev)
# Compute the gradient of the objective with respect to the policy mean
grad: Gradient[FunctionApprox[NonTerminal[S]]] = \
policy_mean_approx.objective_gradient(
xy_vals_seq=[(step.state, step.action)],
obj_deriv_out_fun=obj_deriv_out
)
# Scale the gradient by the return and discount factor
scaled_grad: Gradient[FunctionApprox[NonTerminal[S]]] = \
grad * gamma_prod * step.return_
# Update the policy mean function approximator using the scaled gradient
policy_mean_approx = \
policy_mean_approx.update_with_gradient(scaled_grad)
gamma_prod *= gamma
yield policy_mean_approx
```
### Monte Carlo vs. Bootstrapping: A Tradeoff in Variance
REINFORCE's Monte Carlo estimation is advantageous due to its unbiased nature. However, it suffers from high variance as it considers the entire episode's rewards, including potentially noisy or irrelevant events. Bootstrapping methods, like those used in Temporal Difference (TD) learning, reduce variance by using estimates of future returns. However, they introduce bias into the learning process.
The choice between Monte Carlo and bootstrapping involves a tradeoff between bias and variance. The most suitable method depends on the specific characteristics of the reinforcement learning problem at hand. Actor-Critic methods, discussed later, offer a way to combine these approaches for more effective learning.
## Actor-Critic Methods: Balancing Exploration and Exploitation with Reduced Variance
Actor-Critic (AC) methods are a family of policy gradient algorithms designed to address the high variance issue often encountered in REINFORCE. They achieve this by incorporating a "critic" component that learns to estimate value functions, working in tandem with the "actor" (the policy) to improve both policy and value estimation. This synergistic approach leads to more stable and efficient learning in reinforcement learning tasks.
### The Actor and Critic: A Collaborative Learning Framework
In the Actor-Critic framework:
* **Critic:** The critic, denoted as $V(s;v)$ for state-value function or $Q(s,a;w)$ for action-value function, is responsible for evaluating the quality of actions taken by the actor in different states. It learns to predict the expected cumulative reward given a state or a state-action pair, effectively guiding the actor's decision-making process.
* **Actor:** The actor is the policy itself, denoted as $\pi(s,a;\theta)$, which determines the actions to take based on the current state. It leverages the feedback from the critic's evaluations and adjusts its parameters $\theta$ to make better decisions over time.
The AC framework operates by interleaving policy evaluation (updating the critic's parameters) and policy improvement (updating the actor's parameters). This continuous cycle of adaptation and refinement enables both the actor and critic to learn and improve together.
### Training the Critic and Actor
Training the critic involves estimating the state or action-value function using various methods:
* **Monte Carlo (MC) policy evaluation:** Similar to REINFORCE, this approach utilizes complete episode returns to update the critic's estimates.
* **Temporal Difference (TD) learning:** TD learning updates the critic based on the difference between predicted and observed returns, known as the TD error.
* **TD($\lambda$) with eligibility traces:** This extends TD learning by considering a weighted trace of past experiences, often leading to faster learning convergence.
* **Least Squares Temporal Difference (LSTD):** When the critic is a linear function approximator, LSTD offers an efficient method for estimating its parameters.
Training the actor involves adjusting the policy parameters based on the critic's feedback. The critic's evaluations guide these updates, encouraging the actor to select actions that are expected to yield higher rewards.
### Approximate Policy Gradient and Variance Reduction
The use of the critic's estimated values in the policy gradient update introduces some bias because these estimates might not perfectly align with the true values. However, this bias is often outweighed by the significant reduction in variance achieved through this approach. The approximate policy gradient is expressed as:
$$
\nabla_{\theta}J(\theta) \approx \sum_{s \in \mathcal{N}} \rho^{\pi}(s) \cdot \sum_{a \in \mathcal{A}} \nabla_{\theta} \pi(s,a;\theta) \cdot Q(s,a;w)
$$
To further reduce variance, a **baseline function** $B(s)$ is introduced. It is a scalar value dependent solely on the current state, not the chosen action. Subtracting the baseline from the estimated action-value helps center the policy gradient estimates and reduce their variance without introducing bias:
$$
\nabla_{\theta}J(\theta) \approx \sum_{s \in \mathcal{N}} \rho^{\pi}(s) \cdot \sum_{a \in \mathcal{A}} \nabla_{\theta} \pi(s,a;\theta) \cdot \left( Q(s,a;w) - B(s) \right)
$$
### Choosing the State-Value Function as a Baseline and Utilizing TD Error
A common and effective choice for the baseline function is the state-value function $V(s; v)$ estimated by the critic. This choice leverages the critic's knowledge about the expected cumulative reward from each state, leading to a more informed and stable policy gradient estimate:
$$
\nabla_{\theta}J(\theta) \approx \sum_{a \in \mathcal{A}}
\nabla_{\theta} \pi(s,a;\theta) \cdot A(s,a;w,v)
$$
where $A(s, a; w, v)$ is the advantage function.
Furthermore, the **TD error** $\delta_\pi$:
$$
\delta^{\pi} = r + \gamma \cdot V^{\pi}(s') -V^{\pi}(s)
$$
which represents the difference between predicted and observed returns, can be used as an unbiased estimator of the **advantage function** $A(s, a; w, v)$:
\begin{split}
\mathbb{E}_{\pi}[\delta^{\pi} | s, a]
&= E_{\pi}[r + \gamma · V^{\pi}(s') | s, a] - V^{\pi}(s) \\
&= Q^{\pi}(s, a) - V^{\pi}(s) \\
&= A^{\pi}(s, a)
\end{split}
This simplifies the actor-critic algorithm as it only requires estimating the state-value function instead of the full action-value function:
$$
\nabla_{\theta}J(\theta) =
\sum_{s \in \mathcal{N}} \rho^{\pi}(s) \cdot \sum_{a \in \mathcal{A}}
\nabla_{\theta} \pi(s,a;\theta) \cdot \mathbb{E}_{\pi}[\delta^{\pi} | s, a]
$$
### Actor-Critic Algorithm with TD Error
:::success
<font color=blue>Algorithm: ACTOR-CRITIC-TD-ERROR</font>
1. **Initialize:**
* Policy parameters $\theta$
* State-Value Function parameters $v$
* Discount factor $\gamma$
* Learning rates $\alpha_v$ (for critic) and $\alpha_\theta$ (for actor)
2. **For each episode:**
* Initialize $s$ (first state of the episode)
* While $s$ is not terminal:
* Sample action a from policy $\pi(s,\cdot;\theta)$
* Take action $a$, observe reward $r$ and next state $s'$
* Calculate TD error $\delta$: $\delta \leftarrow r + \gamma \cdot V(s';v)$
* Update critic parameters $v$: $v \leftarrow v + \alpha_v \cdot \delta \cdot V(s;v)$
* Update actor parameters $\theta$: $\theta \leftarrow \theta + \alpha_{\theta} \cdot \delta \cdot \nabla_{\theta} \log\pi(s,a;\theta)$
* $s \leftarrow s'$ (Move to the next state)
:::
#### Code Imeplementation
```python=
from typing import Iterator, Sequence
from rl.distribution import Gaussian
from rl.function_approx import FunctionApprox, Gradient
from rl.markov_process import NonTerminal
from rl.markov_decision_process import MarkovDecisionProcess
from rl.approximate_dynamic_programming import NTStateDistribution, ValueFunctionApprox
import numpy as np
def actor_critic_td_error_gaussian(
mdp: MarkovDecisionProcess[S, float],
policy_mean_approx0: FunctionApprox[NonTerminal[S]],
value_func_approx0: ValueFunctionApprox[S],
start_states_distribution: NTStateDistribution[S],
policy_stdev: float,
gamma: float,
max_episode_length: float
) -> Iterator[FunctionApprox[NonTerminal[S]]]:
"""
Implement the Actor-Critic algorithm with TD error for a Gaussian policy.
:param mdp: The Markov Decision Process.
:param policy_mean_approx0: The initial function approximator for the policy mean.
:param value_func_approx0: The initial function approximator for the value function.
:param start_states_distribution: The distribution of start states.
:param policy_stdev: The standard deviation of the Gaussian policy.
:param gamma: The discount factor.
:param max_episode_length: The maximum number of steps per episode.
:return: An iterator over the updated function approximators for the policy mean.
"""
policy_mean_approx: FunctionApprox[NonTerminal[S]] = policy_mean_approx0
yield policy_mean_approx
vf: ValueFunctionApprox[S] = value_func_approx0
while True:
steps: int = 0
gamma_prod: float = 1.0
state: NonTerminal[S] = start_states_distribution.sample()
while isinstance(state, NonTerminal) and steps < max_episode_length:
# Sample an action from the Gaussian policy
action: float = Gaussian(
μ=policy_mean_approx(state),
σ=policy_stdev
).sample()
# Take a step in the MDP and observe the next state and reward
next_state, reward = mdp.step(state, action).sample()
# Calculate the TD target and TD error
if isinstance(next_state, NonTerminal):
td_target: float = reward + gamma * vf(next_state)
else:
td_target = reward
td_error: float = td_target - vf(state)
# Update the value function approximator
vf = vf.update([(state, td_target)])
# Define the objective derivative for the policy mean
def obj_deriv_out(
states: Sequence[NonTerminal[S]],
actions: Sequence[float]
) -> np.ndarray:
return (policy_mean_approx.evaluate(states) -
np.array(actions)) / (policy_stdev * policy_stdev)
# Compute the gradient of the objective with respect to the policy mean
grad: Gradient[FunctionApprox[NonTerminal[S]]] = \
policy_mean_approx.objective_gradient(
xy_vals_seq=[(state, action)],
obj_deriv_out_fun=obj_deriv_out
)
# Scale the gradient by the TD error and the discount factor
scaled_grad: Gradient[FunctionApprox[NonTerminal[S]]] = \
grad * gamma_prod * td_error
# Update the policy mean function approximator using the scaled gradient
policy_mean_approx = \
policy_mean_approx.update_with_gradient(scaled_grad)
yield policy_mean_approx
gamma_prod *= gamma
steps += 1
state = next_state
```
## Eliminating Bias: The Compatible Function Approximation Theorem
In the pursuit of accurate policy gradients within actor-critic methods, various approximations for the action-value function $Q^\pi(s, a)$ have been explored. While these approximations (e.g., $Q(s,a;w)$, advantage function $A(s,a;w,v)$, or TD error $\delta(s,s',r;v)$) can reduce variance, they often introduce bias into policy gradient calculations. The Compatible Function Approximation Theorem offers a solution to this bias.
### The Compatible Function Approximation Theorem
This theorem reveals that under specific conditions, even a biased function approximation for the critic can yield the exact policy gradient. Let's break down the theorem:
**Theorem (Compatible Function Approximation)**
Let $w^*_{\theta}$ denote the critic parameters $w$ that minimize the mean squared error between the true action-value function $Q^\pi(s, a)$ and the approximated critic $Q(s, a; w)$ for a given policy $\pi(s, a; θ)$:
$$
\sum_{s \in \mathcal{N}} \rho^π(s) \cdot \sum_{a \in A} \pi(s, a; \theta) \cdot (Q^\pi(s, a) - Q(s, a; w))^2
$$
Assume that the data type of $\theta$ is the same as the data type of $w$ and furthermore, assume that for any policy parameters $\theta$, the Critic gradient at $w^*_{\theta}$ is compatible with the Actor score function:
$$
\nabla_w Q(s, a; w^*_\theta) = \nabla_\theta \log \pi(s, a; \theta)
$$
Then, the policy gradient calculated using the critic $Q(s, a; w^*_\theta)$ is exact:
$$
\nabla_\theta J(\theta) = \sum_{s \in \mathcal{N}} ρ^\pi(s) \cdot \sum_{a \in A} \nabla_\theta \pi(s, a; \theta) \cdot Q(s, a; w^*_\theta)
$$
### Achieving Compatible Function Approximation
A simple way to satisfy the compatibility condition is to employ a linear function approximator for the critic, where the features are derived directly from the actor's score function:
$$
Q(s, a; w) = \sum_{i=1}^{m} \phi_i(s, a) \cdot w_i = \sum_{i=1}^{m} \frac{\partial \log \pi(s, a; \theta)}{\partial \theta_i} \cdot w_i
$$
This choice ensures alignment between the critic's gradient and the actor's score function, resulting in an unbiased policy gradient.
Consequently, updates after each experience are:
* $\Delta\theta = \alpha_\theta \cdot \gamma^t \cdot SC(S_t, A_t; w) \cdot SC(S_t, A_t; w)^T \cdot w$
* $\Delta w = \alpha_w \cdot (R_{t+1} + \gamma \cdot SC(S_{t+1}, A_{t+1}; \theta)^T \cdot w - SC(S_t, A_t; \theta)^T \cdot w) \cdot SC(S_t, A_t; \theta)$
where $SC(s, a; w)$ is the score column vector $(\frac{\partial \log \pi(s, a; \theta)}{\partial \theta_i})$.
### Implications and Applications
The Compatible Function Approximation Theorem is pivotal in modern actor-critic methods, offering several benefits:
* **Use Powerful Function Approximators:** Flexible function approximators like neural networks can be employed for the critic without compromising policy gradient accuracy.
* **Efficiently Estimate Advantage Functions:** Compatible linear function approximators can effectively approximate the advantage function, streamlining policy gradient computation.
* **Develop Advanced Algorithms:** This theorem has paved the way for sophisticated actor-critic algorithms such as Natural Policy Gradient (NPG) and Trust Region Policy Optimization (TRPO), leveraging the compatibility condition for stable and efficient learning.
## Policy Gradient Methods in Practice
### Natural Policy Gradient (NPG)
Standard gradient ascent methods can sometimes falter in parameter spaces with complex geometries. **Natural Policy Gradient (NPG)** addresses this by incorporating the inherent structure of the parameter space into the optimization process.
#### Natural Gradient Descent
The core idea behind NPG is to find the update direction that maximizes the improvement in the objective function $J(\theta)$ while constraining the magnitude of the parameter change. This constraint is defined using a metric tensor $G(\theta)$, which captures the local geometry of the parameter space:
$$
|\Delta\theta|^2 = (\Delta\theta)^T \cdot G(\theta) \cdot \Delta\theta
$$
The natural gradient, denoted as $\nabla_{\theta}^{nat} J(\theta)$, is the direction that maximizes the improvement in $J(\theta)$ subject to this constraint. It's calculated as:
$$
\nabla_{\theta}^{nat} f(\theta) = G(\theta)^{-1} \cdot \nabla_{\theta} f(\theta)
$$
Parameter updates are then made in the direction of the natural gradient:
$$
\Delta\theta = -\alpha \cdot \nabla_{\theta}^{nat} f(\theta)
$$
where $\alpha$ is the learning rate.
### NPG in Supervised Learning
In supervised learning, where the goal is to estimate a conditional probability distribution, the metric tensor $G(\theta)$ is often the Fisher Information Matrix (FIM) $F(\theta)$:
$$
F(\theta) = \mathbb{E}_{s \sim \rho^{\pi}, a \sim \pi} \left[SC(s,a;\theta) \cdot SC(s,a;\theta)^T \right]
$$
The FIM quantifies the information that the data provides about the model parameters. Thus, the natural gradient in supervised learning is:
$$
\nabla_{\theta}^{nat} J(\theta) = F(\theta)^{-1} \cdot \nabla_{\theta} J(\theta)
$$
#### NPG in Reinforcement Learning with Compatible Function Approximation
Remarkably, when employing compatible function approximation in policy gradient methods, the Fisher Information Matrix naturally arises in the policy gradient calculation. This leads to a simplified update rule for NPG in reinforcement learning:
* **Critic Update:** Update critic parameters $w$ using the standard TD error-based update:
$$
\Delta w = \alpha_{w} \cdot (R_{t+1} + \gamma \cdot SC(S_{t+1}, A_{t+1}; \theta)^T \cdot w - SC(S_{t}, A_{t}; \theta)^T \cdot w) \cdot SC(S_{t}, A_{t}; \theta)
$$
* **Actor Update:** Update actor parameters $\theta$ directly in the direction of the critic weights $w$:
$$
\Delta\theta = \alpha_{\theta} \cdot w
$$
### Deterministic Policy Gradient (DPG)
Deterministic Policy Gradient (DPG) is specifically designed for environments with continuous action spaces. Unlike stochastic policies that output probability distributions, deterministic policies directly map states to actions.
#### Actor-Critic DPG
In the Actor-Critic DPG framework:
* **Actor $\mu(s; \theta)$:** Represents the deterministic policy function (often the mean of a Gaussian distribution).
* **Critic $Q(s, a; w)$:** Approximates the action-value function.
DPG can be viewed as the limiting case of stochastic policy gradients as the variance of the policy's action distribution approaches zero.
#### Deterministic Policy Gradient Theorem
The Deterministic Policy Gradient Theorem (DPGT) provides the theoretical foundation for DPG, stating that the policy gradient for a deterministic policy is:
\begin{split}
\nabla_\theta J(\theta)
&= \sum_{s \in \mathcal{N}} \rho^{\mu}(s) \cdot \nabla_\theta \mu(s; \theta) \cdot \nabla_a Q^{\mu}(s, a)|_{a=\mu(s;\theta)} \\
&= \mathbb{E}_{s \sim \rho^{\mu}}\left[\nabla_\theta \mu(s; \theta) \cdot \nabla_a Q^{\mu}(s, a)|_{a=\mu(s;\theta)}\right]
\end{split}
As a result, for Off-Policy Actor-Critic DPG, we update the Critic parameters $w$ and the Actor parameters $\theta$ is updated via:
\begin{split}
\Delta w &= \alpha_w \cdot \left( R_{t+1}+\gamma \cdot Q(S_{t+1}, \mu(S_{t+1};\theta); w) - Q(S_t, A_t; w) \right) \cdot \nabla_w Q(S_t, A_t; w) \\
\Delta \theta &= \alpha_\theta \cdot \nabla_{\theta} \mu(S_t; \theta) \cdot \nabla_a Q(S_t, a; w)|_{a=\mu(S_t;\theta)}
\end{split}
#### Key Points
* DPG excels in high-dimensional continuous action spaces, as it eliminates the need to integrate over actions.
* The critic is essential for estimating the Q-value function's gradient, guiding policy updates.
* Compatible function approximation can be applied to DPG to mitigate bias in the policy gradient estimation.
## Evolutionary Strategies: An Alternative Approach to Reinforcement Learning
Evolutionary Strategies (ES) present a compelling alternative to traditional reinforcement learning (RL) algorithms for solving Markov Decision Processes (MDPs). While capable of addressing MDP control problems, ES distinguishes itself from RL through its unique approach.
### Key Characteristics of Evolutionary Strategies
* **Not RL Algorithms:** ES doesn't rely on value function estimation or explicit policy improvement, which are fundamental to RL. Instead, ES is a black-box optimization technique.
* **Inspired by Nature:** ES draws inspiration from biological evolution, mimicking natural selection. A population of candidate solutions iteratively improves through mutation, recombination, and selection.
* **Heuristic Search:** ES can be considered a heuristic search method guided by "survival of the fittest." It doesn't necessitate a deep understanding of the environment's dynamics, making it adaptable to diverse problems.
### Applying ES to MDPs: Natural Evolutionary Strategies (NES)
Natural Evolutionary Strategies (NES), a specific type of ES, is particularly suited for solving MDPs. In NES, the goal is to optimize a policy parameterized by a vector $\psi$, where the policy's parameters are sampled from a multivariate Gaussian distribution with mean $\theta$ and standard deviation $\sigma$. The objective is to maximize the expected return of the MDP:
$$
\mathbb{E}_{\psi \sim p_\theta} [F(\psi)] = \mathbb{E}_{\epsilon \sim \mathcal{N}(0, I_m)} [F(\theta + \sigma \cdot \epsilon)]
$$
### Maximizing Expected Returns with Stochastic Gradient Ascent
NES utilizes stochastic gradient ascent to iteratively refine the policy parameters $\theta$. The gradient of the expected return is estimated by:
1. Sampling multiple perturbation vectors $\epsilon$ from a standard normal distribution.
2. Evaluating the resulting policies.
3. Averaging the returns weighted by the perturbations.
Mathematically:
$$
\nabla_{\theta} J(\theta) = \frac{1}{\sigma} \cdot \mathbb{E}_{\epsilon \sim \mathcal{N}(0, I)} [\epsilon \cdot F(\theta + \sigma \cdot \epsilon)]
$$
### NES Algorithm
:::success
<font color=blue>Algorithm: Natural Evolution Strategies</font>
1. **Initialize:**
* Learning rate (step size) $\alpha$
* Standard deviation $\sigma$
* Initial parameter vector $\theta_0$
2. **Repeat** for each generation $t$:
* Sample $n$ perturbation vectors: $\epsilon_1, \epsilon_2, \dots, \epsilon_n \sim N (0, 1)$
* Evaluate the return $F_i$ for each perturbed policy: $F_i \leftarrow F(\theta + \sigma \cdot \epsilon)$ for $i = 1, 2, \dots, n$
* Update the policy parameters $\theta$: $\theta_{t+1} \leftarrow \theta_t + \frac{\alpha}{n \sigma} \sum_{i=1}^n \epsilon_i \cdot F_i$
:::
### ES vs. RL: A Comparative Analysis
* **Efficiency:** Traditionally, ES was considered less sample-efficient than RL. However, recent advances have significantly improved ES data efficiency.
* **Implementation:** ES algorithms are often simpler to implement than RL methods, not requiring complex value function approximations or backpropagation. They are also highly parallelizable.
* **Robustness:** ES demonstrates remarkable robustness to challenges like delayed rewards, sparse rewards, and long time horizons.
In conclusion, Evolutionary Strategies offer a distinct and effective approach to solving MDP control problems. While they may not always match the sample efficiency of some RL methods, their simplicity, parallelizability, and robustness make them a valuable tool in the reinforcement learning arsenal.
## Reference
- Chapter 14 of the [RLForFinanceBook](https://stanford.edu/~ashlearn/RLForFinanceBook/book.pdf)
- [Policy Gradient Algorithms](https://github.com/coverdrive/technical-documents/blob/master/finance/cme241/Tour-PG.pdf) slides for CME 241: Foundations of Reinforcement Learning with Applications in Finance