# Batch RL, Experience-Replay, DQN, LSPI, Gradient TD
## Introduction
In Reinforcement Learning (RL), agents learn optimal behaviors by interacting with an environment. This learning process often involves estimating value functions, which quantify the expected long-term reward from a given state or state-action pair. We'll focus on two key concepts that enhance the efficiency and effectiveness of value function learning:
* **Experience Replay:** In RL, experience typically arrives as individual episodes or transitions. Experience replay allows us to store these experiences and revisit them later during the learning process. By sampling from this stored experience, we can break correlations between consecutive experiences and reuse valuable data multiple times, leading to more efficient learning.
* **Batch RL:** While some RL algorithms update value functions incrementally after each new experience (Incremental RL), batch RL methods utilize a collection of experiences (a batch) to perform updates. By processing a batch of data simultaneously, we can leverage the entire dataset for each update, potentially leading to more stable and accurate value function estimates.
In this context, we'll delve into several powerful RL algorithms that leverage these concepts:
* **Deep Q-Networks (DQN):** A deep learning approach that utilizes experience replay and fixed Q-learning targets to learn complex value functions.
* **Least Squares Policy Iteration (LSPI):** A batch RL method that efficiently learns optimal policies using least-squares techniques for value function estimation.
* **Gradient Temporal Difference (Gradient TD):** A family of algorithms that directly optimize the projected Bellman error using gradient descent, enabling efficient learning with non-linear function approximation.
## Batch RL and Experience Replay
In Reinforcement Learning, the way we update the value function—incrementally or in batches—is a crucial design choice. Let's explore these options and their implementations.
### Notation
* **(MRP) Value Function:** Our goal is to estimate the value function $V(s;w)$, which predicts the expected return from a state $s$, parameterized by weights $w$.
* **Loss Function:** We often use a squared error loss: $$\mathcal{L}_{(S_i, G_i)}(w) = \frac{1}{2} \cdot (V(S_i; w) - G_i)^2$$ where $G_i$ is the actual return observed from state $S_i$.
### Incremental MC Prediction
* **Data:** A sequence of state-return pairs $\mathcal{D} = \{(S_i, G_i)\}^n_{i=1}$.
* **Update:** After each experience $(S_i, G_i)$, update the weights w using gradient descent: $$\Delta w = \alpha \cdot (G_i - V(S_i; w)) \cdot \nabla_w V(S_i; w)$$ where $\alpha$ is the learning rate.
* **Drawbacks:** Incremental RL can be inefficient, as it essentially discards each data point after a single update.
### Batch MC Prediction
* **Data:** Same as Incremental MC.
* **Update:** After accumulating a batch of experiences, update $w$ by minimizing the average loss over the entire batch: $$w^* = \mathop{\arg \min}_w \frac{1}{n} \sum_{i=1}^n \frac{1}{2} (V(S_i; w) - G_i)^2$$ This results in the update rule: $$\Delta w = \alpha \cdot \frac{1}{n} \sum_{i=1}^n (G_i - V(S_i; w)) \cdot \nabla_w V(S_i; w)$$
* **Key Idea:** Batch MC uses experience replay, where the agent learns from past experiences multiple times.
#### Code Implementation
```python=
import rl.markov_process as mp
from rl.returns import returns
from rl.approximate_dynamic_programming import ValueFunctionApprox
import itertools
def batch_mc_prediction(
traces: Iterable[Iterable[mp.TransitionStep[S]]],
approx: ValueFunctionApprox[S],
gamma: float,
episode_length_tolerance: float = 1e-6,
convergence_tolerance: float = 1e-5
) -> ValueFunctionApprox[S]:
"""
Perform batch Monte Carlo prediction using the given traces and approximator.
Args:
traces: A finite iterable of transition step traces.
approx: The value function approximator.
gamma: The discount factor.
episode_length_tolerance: The tolerance for episode length.
convergence_tolerance: The tolerance for convergence.
Returns:
The updated value function approximator.
"""
return_steps: Iterable[mp.ReturnStep[S]] = itertools.chain.from_iterable(
returns(trace, gamma, episode_length_tolerance) for trace in traces
)
return_pairs = [(step.state, step.return_) for step in return_steps]
return approx.solve(return_pairs, convergence_tolerance)
```
### Batch TD Prediction
* **Data:** A sequence of transitions $\mathcal{D} = \{(S_i, R_i, S_i')\}^n_{i=1}$, where $R_i$ is the reward and $S'_i$ the next state.
* **Update:** Similar to Batch MC, but uses the temporal difference error for updates: $$\Delta w = \alpha \cdot \frac{1}{n} \sum_{i=1}^n (R_i + \gamma \cdot V(S'_i; w) - V(S_i; w)) \cdot \nabla_w V(S_i; w)$$ where $\gamma$ is the discount factor.
#### Code Implementation
```python=
import itertools
from typing import Iterable, Sequence
import numpy as np
import rl.markov_process as mp
from rl.approximate_dynamic_programming import ValueFunctionApprox, extended_vf
import rl.iterate as iterate
def batch_td_prediction(
transitions: Iterable[mp.TransitionStep[S]],
approx_0: ValueFunctionApprox[S],
gamma: float,
convergence_tolerance: float = 1e-5
) -> ValueFunctionApprox[S]:
"""
Perform batch TD prediction using the given transitions and initial approximator.
Args:
transitions: A finite iterable of transition steps.
approx_0: The initial value function approximator.
gamma: The discount factor.
convergence_tolerance: The tolerance for convergence.
Returns:
The updated value function approximator.
"""
def step(
v: ValueFunctionApprox[S],
tr_seq: Sequence[mp.TransitionStep[S]]
) -> ValueFunctionApprox[S]:
return v.update([(
tr.state, tr.reward + gamma * extended_vf(v, tr.next_state)
) for tr in tr_seq])
def done(
a: ValueFunctionApprox[S],
b: ValueFunctionApprox[S],
tolerance: float = convergence_tolerance
) -> bool:
return b.within(a, tolerance)
return iterate.converged(
iterate.accumulate(
itert
```
### Batch TD($\lambda$) Prediction
* **Data:** A sequence of trajectories $$D = \{(S_{i,0}, R_{i,1}, S_{i,1}, R_{i,2}, S_{i,2}, \ldots, R_{i,T_i}, S_{i,T_i}) \mid 1 \leq i \leq n\}$$ each consisting of multiple transitions.
* **Update:** Extends Batch TD by incorporating eligibility traces \begin{split} E_{i,0} &= \nabla_w V(S_{i,0}; w) \\ E_{i,t} &= \gamma \lambda \cdot E_{i,t-1} + \nabla_w V(S_{i,t}; w) \text{ for } t = 1, 1, \ldots, T_i - 1 \end{split} for updatas:
$$
\Delta w = \alpha \cdot \frac{1}{n} \sum_{i=1}^n \frac{1}{T_i} \sum_{t=0}^{T_i-1} (R_{i,t+1} + \gamma \cdot V(S_{i,t+1}; w) - V(S_{i,t}; w)) \cdot E_{i,t}
$$
## Least-Squares RL Prediction
In Reinforcement Learning, function approximation is a powerful tool for representing complex value functions. When we assume the value function $V(s; w)$ can be linearly approximated by a set of feature functions of the state space, we can directly solve for the optimal weights w using linear algebra, bypassing the need for iterative updates.
### Notation
* **Sequence of feature functions:** $\phi_j:\mathcal{S} \rightarrow \mathbb{R}$, where $j =1,2, \dots,m$. Each feature function extracts a relevant aspect of the state.
* **Weight vector:** $w = (w_1, w_2, . . . , w_m) \in \mathbb{R}^m$. These weights determine the contribution of each feature to the overall value estimate.
The linear value function approximation is then expressed as:
$$
V(s; w) = \sum_{j=1}^m \phi_j(s) \cdot w_j = \phi(s)^T \cdot w
$$
where $\phi(s) \in \mathbb{R}^m$ is the feature vector for state $s$.
### Least-Squares Monte Carlo (LSMC)
LSMC directly estimates the value function from state-return pairs using linear regression.
* **Data:** A dataset $\mathcal{D} = \{(S_i, G_i)\}^n_{i=1}$ of state-return pairs, where $G_i$ is the return from state $S_i$.
* **Loss Function:** The mean squared error between the approximated value and the actual return: $$\mathcal{L}(w) = \frac{1}{2n} \sum_{i=1}^n \left( \phi(S_i)^T \cdot w - G_i \right)^2$$
* **Solution:** The optimal weights $w^*$ are obtained by minimizing the loss function, leading to the closed-form solution: $$w^* = A^{-1}b$$ where $$A = \sum_{i=1}^n \phi(S_i) \cdot \phi(S_i)^T, \quad b = \sum_{i=1}^n \phi(S_i) \cdot G_i$$
* **Updating Rules (Incremental):** To efficiently update $A^{-1}$ as new data arrives, the Sherman-Morrison formula can be applied:
$$
(A_{i-1} + \phi(S_i) \cdot \phi(S_i)^T)^{-1} = A_{i-1}^{-1} - \frac{A_{i-1}^{-1} \cdot \phi(S_i) \cdot \phi(S_i)^T \cdot A_{i-1}^{-1}}{1 + \phi(S_i)^T \cdot A_{i-1}^{-1} \cdot \phi(S_i)}
$$
#### Code Implementation
```python=
import itertools
from typing import Iterable, TypeVar
from rl.distribution import Categorical
from rl.approximate_dynamic_programming import ValueFunctionApprox
from rl.markov_process import TransitionStep, ReturnStep
from rl.returns import returns
S = TypeVar('S')
def batch_mc_prediction(
traces: Iterable[Iterable[TransitionStep[S]]],
approx: ValueFunctionApprox[S],
gamma: float,
episode_length_tolerance: float = 1e-6,
convergence_tolerance: float = 1e-5
) -> ValueFunctionApprox[S]:
"""
Perform batch Monte Carlo prediction using the given traces and approximator.
Args:
traces: A finite iterable of transition step traces.
approx: The value function approximator (assumed to be LinearFunctionApprox).
gamma: The discount factor.
episode_length_tolerance: The tolerance for episode length.
convergence_tolerance: The tolerance for convergence.
Returns:
The updated value function approximator.
"""
# Set the 'direct_solve' attribute of the approximator to True
approx.direct_solve = True
return_steps: Iterable[ReturnStep[S]] = itertools.chain.from_iterable(
returns(trace, gamma, episode_length_tolerance) for trace in traces
)
return_pairs = [(step.state, step.return_) for step in return_steps]
return approx.solve(return_pairs, convergence_tolerance)
```
### Least-Squares Temporal Difference (LSTD)
LSTD, similar to LSMC, finds the optimal weights using linear regression but leverages the temporal difference error for updates.
* **Data:** A dataset $\mathcal{D} = \{(S_i, R_i, S_i')\}^n_{i=1}$ of state-reward-next state triples.
* **Loss Function:** The mean squared TD error:
$$
\mathcal{L}(w) = \frac{1}{2n} \sum_{i=1}^n \left(\phi(S_i)^T \cdot w - (R_i + \gamma \cdot \phi(S_i')^T \cdot w)\right)^2
$$
* **Solution:** The optimal weights $w^*$ are again found by minimizing the loss, with a slight modification to the matrix A:
$$
A = \sum_{i=1}^n \phi(S_i) \cdot (\phi(S_i) - \gamma \cdot \phi(S'_i))^T, \quad b = \sum_{i=1}^n \phi(S_i) \cdot R_i
$$
#### Code Implementation
```python=
from typing import Iterable, Sequence, Callable
import numpy as np
from rl.function_approx import LinearFunctionApprox, Weights
from rl.markov_process import TransitionStep, NonTerminal
S = TypeVar('S')
def least_squares_td(
transitions: Iterable[TransitionStep[S]],
feature_functions: Sequence[Callable[[NonTerminal[S]], float]],
gamma: float,
epsilon: float
) -> LinearFunctionApprox[NonTerminal[S]]:
"""
Perform Least Squares Temporal Difference (LSTD) learning on the given transitions.
Args:
transitions: A finite iterable of transition steps.
feature_functions: A sequence of feature functions.
gamma: The discount factor.
epsilon: The regularization parameter.
Returns:
The learned linear function approximator.
"""
num_features = len(feature_functions)
a_inv = np.eye(num_features) / epsilon
b_vec = np.zeros(num_features)
for tr in transitions:
phi1 = np.array([f(tr.state) for f in feature_functions])
phi2 = phi1 - gamma * np.array([f(tr.next_state) for f in feature_functions]) \
if isinstance(tr.next_state, NonTerminal) else phi1
temp = a_inv.T.dot(phi2)
a_inv -= np.outer(a_inv.dot(phi1), temp) / (1 + phi1.dot(temp))
b_vec += phi1 * tr.reward
opt_weights = a_inv.dot(b_vec)
return LinearFunctionApprox.create(
feature_functions=feature_functions,
weights=Weights.create(opt_weights)
)
```
### LSTD($\lambda$)
* **Data:** A sequence of trajectories $$D = \{(S_{i,0}, R_{i,1}, S_{i,1}, R_{i,2}, S_{i,2}, \ldots, R_{i,T_i}, S_{i,T_i}) \mid 1 \leq i \leq n\}$$ each consisting of multiple transitions.
* **Solution:** Extends LSTD by incorporating eligibility traces: $$A = \sum^n_{i=1} \frac{1}{T_i} E_{i} \cdot \left(\phi(S_{i}) - \gamma \cdot \phi(S_{i})\right)^T, \quad b = \sum_{i=1}^n \frac{1}{T_i} E_{i} \cdot R_i$$ where $E_t$ is the eligibility trace at time $t$, allowing us to control the trade-off between bias and variance.
### Convergence of Methods
| On/Off Policy | Algorithm | Tabular | Linear | Non-Linear |
|---------------|----------------|---------|--------|------------|
| **On-Policy** | MC | ✔️ | ✔️ | ✔️ |
| | LSMC | ✔️ | ✔️ | - |
| | TD | ✔️ | ✔️ | ❌ |
| | LSTD | ✔️ | ✔️ | - |
| | Gradient TD | ✔️ | ✔️ | ✔️ |
| **Off-Policy**| MC | ✔️ | ✔️ | ✔️ |
| | LSMC | ✔️ | ❌ | - |
| | TD | ✔️ | ❌ | ❌ |
| | LSTD | ✔️ | ❌ | - |
| | Gradient TD | ✔️ | ✔️ | ✔️ |
## Deep Q-Networks (DQN) Algorithm
Deep Q-Networks (DQN) is a powerful reinforcement learning control algorithm that combines Q-learning with deep neural networks and experience replay. This approach has proven highly effective in learning to play complex games and solve challenging control tasks.
### DQN Architecture
DQN leverages two deep neural networks:
* **Q-Network (Main Network):** This network, parameterized by weights $w$, is responsible for estimating the action-value function, or Q-values, for the current state $s$. The Q-value $Q(s,a;w)$ represents the expected return for taking action $a$ in state $s$.
* **Target Network:** This network, with weights $w^-$ is a periodic copy of the Q-network. It is used to generate the target Q-values for training the Q-network. By keeping the target network fixed for a certain number of updates, it helps stabilize the learning process.
#### Why Update the Target Network Infrequently?
In Q-learning, the target values for updating the Q-values are themselves based on the current Q-values. This creates a moving target problem, where the Q-values and their targets can co-adapt in ways that lead to instability. By updating the target network infrequently, we essentially "freeze" the targets for a while, giving the Q-network a chance to learn towards these stable targets. This approach significantly improves the stability and convergence of the algorithm.
#### Key Advantages of Deep Q-Learning:
* **Increased Representational Power:** Neural networks can model complex, non-linear relationships between states and values.
* **Ability to Handle High-Dimensional State Spaces:** DQL can effectively process large amounts of information, making it suitable for options with multiple underlying assets or complex features.
### DQN Algorithm
Here's a simplified outline of the DQN algorithm:
1. **Initialize:** Initialize the Q-network and target network with random weights ($w$ and $w^-$).
2. **Repeat:**
* **Choose Action:** For the current state $S_t$, select an action $A_t$ using an $\epsilon$-greedy policy based on the Q-network's output: $$A_t = \begin{cases} \mathop{\arg\min}_a Q(S_t,a,w),& \text{ with probability } 1-\epsilon \\ \text{random action},& \text{ with probability } \epsilon \end{cases}$$
* **Execute Action:** Take the chosen action, observe the reward $R_{t+1}$ and the next state $S_{t+1}$.
* **Store Experience:** Store the transition $(S_t, A_t, R_{t+1}, S_{t+1})$ in the experience replay memory $D$.
* **Sample Mini-Batch:** Sample a random mini-batch of transitions $(s_i, a_i, r_i, s'_i)$ from the replay memory.
* **Compute Targets:** Compute the Q-learning target for each sampled transition using the target network: $$y_i = r_i + \gamma \cdot \max_{a'} Q(s'_i, a'_i; w^-)$$
* **Update Q-Network:** Perform a gradient descent step on the Q-network to minimize the mean squared error between the predicted Q-values $Q(s_i,a_i;w)$ and the targets $y_i$: $$w \leftarrow w - \alpha \cdot \frac1{|B|}\sum_i \left(y_i - Q(s_i, a_i; w) \right) \cdot \nabla_w Q(s_i, a_i; w)$$
* **Update Target Network:** Every $C$ steps, update the target network weights $w^- \leftarrow w$.
### Code Implementation
```python=
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
import random
from collections import deque
class DQN(nn.Module):
def __init__(self, state_dim, action_dim, hidden_dim):
super(DQN, self).__init__()
self.fc1 = nn.Linear(state_dim, hidden_dim)
self.fc2 = nn.Linear(hidden_dim, hidden_dim)
self.fc3 = nn.Linear(hidden_dim, action_dim)
def forward(self, state):
x = torch.relu(self.fc1(state))
x = torch.relu(self.fc2(x))
return self.fc3(x)
class ReplayMemory:
def __init__(self, capacity):
self.memory = deque(maxlen=capacity)
def push(self, transition):
self.memory.append(transition)
def sample(self, batch_size):
return random.sample(self.memory, batch_size)
def __len__(self):
return len(self.memory)
def train_dqn(env, num_episodes, hidden_dim, batch_size, gamma, epsilon, target_update):
state_dim = env.observation_space.shape[0]
action_dim = env.action_space.n
q_net = DQN(state_dim, action_dim, hidden_dim)
target_net = DQN(state_dim, action_dim, hidden_dim)
target_net.load_state_dict(q_net.state_dict())
optimizer = optim.Adam(q_net.parameters())
memory = ReplayMemory(10000)
for episode in range(num_episodes):
state = env.reset()
done = False
while not done:
if random.random() < epsilon:
action = env.action_space.sample()
else:
with torch.no_grad():
state_tensor = torch.FloatTensor(state).unsqueeze(0)
q_values = q_net(state_tensor)
action = q_values.argmax().item()
next_state, reward, done, _ = env.step(action)
memory.push((state, action, reward, next_state, done))
if len(memory) >= batch_size:
transitions = memory.sample(batch_size)
batch = tuple(zip(*transitions))
state_batch = torch.FloatTensor(batch[0])
action_batch = torch.LongTensor(batch[1]).unsqueeze(1)
reward_batch = torch.FloatTensor(batch[2])
next_state_batch = torch.FloatTensor(batch[3])
done_batch = torch.FloatTensor(batch[4])
q_values = q_net(state_batch).gather(1, action_batch)
with torch.no_grad():
next_q_values = target_net(next_state_batch).max(1)[0]
targets = reward_batch + gamma * next_q_values * (1 - done_batch)
loss = (q_values - targets.unsqueeze(1)).pow(2).mean()
optimizer.zero_grad()
loss.backward()
optimizer.step()
state = next_state
if episode % target_update == 0:
target_net.load_state_dict(q_net.state_dict())
return q_net
```
## Least-Squares Policy Iteration (LSPI)
Least-Squares Policy Iteration (LSPI) is an off-policy reinforcement learning algorithm that combines the benefits of both Q-learning and least-squares temporal difference (LSTD) methods. In contrast to on-policy methods (like MC or TD), where the agent learns only from the current policy's experience, LSPI allows the agent to learn from any behavior policy, making it highly efficient and flexible.
LSPI operates within the framework of Generalized Policy Iteration (GPI), where the agent iteratively improves its policy by alternating between policy evaluation and policy improvement. However, LSPI distinguishes itself by employing least-squares techniques during the policy evaluation phase.
### Key Features of LSPI
* **Off-Policy Learning:** LSPI can learn from data generated by any behavior policy, not just the one being evaluated. This enables the reuse of past experiences and efficient exploration.
* **Batch Updates:** LSPI performs updates using batches of collected experiences, leading to more stable and reliable value function estimates.
* **Linear Function Approximation:** LSPI assumes a linear representation of the action-value function (Q-function), allowing for efficient updates using linear algebra:
$$
Q^\pi(s, a) \approx Q(s, a; w^*) = \phi(s, a)^T \cdot w^*
$$
### LSPI Algorithm
1. **Initialization:** Start with an initial policy $\pi_0$.
2. **Policy Evaluation (LSTDQ):** Using a batch of experiences $\mathcal{D}= \{(s_i,a_i,r_i,s'_i)\}^n_{i=1}$, compute the least-squares solution for the Q-value function:
* The loss function is: \begin{split} \mathcal{L}(w) &= \sum_{i=1}^n \left(Q(s_i, a_i; w) - \left(r_i + \gamma Q(s'_i, \pi_D(s'_i); w)\right)\right)^2 \\ &= \sum_{i=1}^n \left(\phi(s_i, a_i)^T \cdot w - (r_i + \gamma \cdot \phi(s'_i, \pi_D(s'_i))^T \cdot w)\right)^2 \end{split}
* Solve for the optimal weights $w^*$ using the updating rules: \begin{split} A &\leftarrow A + \phi(s_i, a_i) \cdot (\phi(s_i, a_i) - \gamma \cdot \phi(s'_i, \pi_D(s'_i)))^T \\ b &\leftarrow b + \phi(s_i, a_i) \cdot r_i \\ w^* &= A^{-1}b \end{split}
3. Policy Improvement: Update the policy $\pi$ greedily with respect to the new Q-value function:
$$
\pi'(s) = \arg \max_a Q(s, a; w')
$$
4. Repeat steps 2 and 3 until convergence.
#### Code Implementation
```python=
from typing import Iterable, Iterator, Sequence, Callable, Tuple
from rl.markov_decision_process import TransitionStep, NonTerminal
from rl.policy import DeterministicPolicy
from rl.function_approx import LinearFunctionApprox
from rl.td import least_squares_tdq
from rl.policy_gradient import greedy_policy_from_qvf
S = TypeVar('S')
A = TypeVar('A')
def least_squares_policy_iteration(
transitions: Iterable[TransitionStep[S, A]],
actions: Callable[[NonTerminal[S]], Iterable[A]],
feature_functions: Sequence[Callable[[Tuple[NonTerminal[S], A]], float]],
initial_target_policy: DeterministicPolicy[S, A],
gamma: float,
epsilon: float
) -> Iterator[LinearFunctionApprox[Tuple[NonTerminal[S], A]]]:
"""
Perform Least Squares Policy Iteration (LSPI) on the given transitions.
Args:
transitions: A finite iterable of transition steps.
actions: A function that returns the available actions for a given state.
feature_functions: A sequence of feature functions.
initial_target_policy: The initial target policy.
gamma: The discount factor.
epsilon: The regularization parameter.
Yields:
The linear function approximator for each iteration of LSPI.
"""
target_policy = initial_target_policy
transitions_seq = list(transitions)
while True:
q = least_squares_tdq(
transitions=transitions_seq,
feature_functions=feature_functions,
target_policy=target_policy,
gamma=gamma,
epsilon=epsilon,
)
target_policy = greedy_policy_from_qvf(q, actions)
yield q
```
### Convergence of RL Control Algorithms
| Algorithm | Tabular | Linear | Non-Linear |
|-------------|---------|---------|------------|
| MC Control | ✔ | (✔) | ❌ |
| SARSA | ✔ | (✔) | ❌ |
| Q-Learning | ✔ | ❌ | ❌ |
| LSPI | ✔ | (✔) | - |
| Gradient Q-Learning | ✔ | ✔ | ❌ |
(✔) means it chatters around near-optimal Value Function
### Example: The Vampire's Dilemma
Imagine a village plagued by a nightly visit from a vampire. Each morning, the vampire randomly chooses one villager to feed upon. The villagers, desperate to survive, devise a plan: they will poison some of their own each night, hoping the vampire will consume a poisoned individual and perish. However, the poison is fatal to villagers as well, and they die the day after being poisoned.
The villagers face a critical question: How many villagers should they poison each night to maximize the expected number of survivors once the vampire is defeated (or everyone is dead)?
#### Problem Formulation
We can model this scenario as a Markov Decision Process (MDP):
* **State $s$:** The number of villagers remaining on a given night.
* **Terminal State $T$:** Zero villagers remaining (everyone dies).
* **Action $a$:** The number of villagers to poison each night ($0$ to $s-1$).
* **Reward $r$:** Zero if the vampire is alive, and the number of remaining villagers if the vampire dies.
The transition probabilities are:
$$
P_R(s, a, r, s') =
\begin{cases}
\frac{s-a}{s} & \text{if } r = 0 \text{ and } s' = s - a - 1 \\
\frac{a}{s} & \text{if } r = s - a \text{ and } s' = 0 \\
0 & \text{otherwise}
\end{cases}
$$
#### Solving the MDP
We can solve this MDP to find the optimal policy using various methods, including value iteration and LSPI. LSPI is particularly well-suited here due to its ability to learn from batches of data and use linear function approximation.
```python=
from typing import Mapping, Tuple, Iterable, Iterator, Sequence, Callable, \
List
from rl.markov_process import NonTerminal
from rl.markov_decision_process import FiniteMarkovDecisionProcess, \
TransitionStep
from rl.distribution import Categorical, Choose
from rl.function_approx import LinearFunctionApprox
from rl.policy import DeterministicPolicy, FiniteDeterministicPolicy
from rl.dynamic_programming import value_iteration_result, V
from rl.chapter11.control_utils import get_vf_and_policy_from_qvf
from rl.td import least_squares_policy_iteration
from numpy.polynomial.laguerre import lagval
import itertools
import rl.iterate as iterate
import numpy as np
class VampireMDP(FiniteMarkovDecisionProcess[int, int]):
initial_villagers: int
def __init__(self, initial_villagers: int):
self.initial_villagers = initial_villagers
super().__init__(self.mdp_map())
def mdp_map(self) -> \
Mapping[int, Mapping[int, Categorical[Tuple[int, float]]]]:
return {s: {a: Categorical(
{(s - a - 1, 0.): 1 - a / s, (0, float(s - a)): a / s}
) for a in range(s)} for s in range(1, self.initial_villagers + 1)}
def vi_vf_and_policy(self) -> \
Tuple[V[int], FiniteDeterministicPolicy[int, int]]:
return value_iteration_result(self, 1.0)
def lspi_features(
self,
factor1_features: int,
factor2_features: int
) -> Sequence[Callable[[Tuple[NonTerminal[int], int]], float]]:
ret: List[Callable[[Tuple[NonTerminal[int], int]], float]] = []
ident1: np.ndarray = np.eye(factor1_features)
ident2: np.ndarray = np.eye(factor2_features)
for i in range(factor1_features):
def factor1_ff(x: Tuple[NonTerminal[int], int], i=i) -> float:
return lagval(
float((x[0].state - x[1]) ** 2 / x[0].state),
ident1[i]
)
ret.append(factor1_ff)
for j in range(factor2_features):
def factor2_ff(x: Tuple[NonTerminal[int], int], j=j) -> float:
return lagval(
float((x[0].state - x[1]) * x[1] / x[0].state),
ident2[j]
)
ret.append(factor2_ff)
return ret
def lspi_transitions(self) -> Iterator[TransitionStep[int, int]]:
states_distribution: Choose[NonTerminal[int]] = \
Choose(self.non_terminal_states)
while True:
state: NonTerminal[int] = states_distribution.sample()
action: int = Choose(range(state.state)). sample()
next_state, reward = self.step(state, action).sample()
transition: TransitionStep[int, int] = TransitionStep(
state=state,
action=action,
next_state=next_state,
reward=reward
)
yield transition
def lspi_vf_and_policy(self) -> \
Tuple[V[int], FiniteDeterministicPolicy[int, int]]:
transitions: Iterable[TransitionStep[int, int]] = itertools.islice(
self.lspi_transitions(),
20000
)
qvf_iter: Iterator[LinearFunctionApprox[Tuple[
NonTerminal[int], int]]] = least_squares_policy_iteration(
transitions=transitions,
actions=self.actions,
feature_functions=self.lspi_features(4, 4),
initial_target_policy=DeterministicPolicy(
lambda s: int(s / 2)
),
γ=1.0,
ε=1e-5
)
qvf: LinearFunctionApprox[Tuple[NonTerminal[int], int]] = \
iterate.last(
itertools.islice(
qvf_iter,
20
)
)
return get_vf_and_policy_from_qvf(self, qvf)
```
#### Results
The plot below shows the optimal value function and policy learned using LSPI. The x-axis represents the number of villagers remaining, and the y-axis represents the expected number of survivors under the optimal policy.

The second plot depicts the optimal policy. For each number of villagers, it shows how many should be poisoned to maximize the expected number of survivors.

## RL for Optimal Exercise of American Options
American options offer the unique flexibility of early exercise, meaning they can be exercised at any time before their expiration date. This introduces the concept of path dependence, where the option's value depends not only on the final price of the underlying asset but also on the price trajectory over time. The problem of finding the optimal time to exercise an American option is an optimal stopping problem.
### Traditional Approach: Longstaff-Schwartz Algorithm
The financial industry often relies on the Longstaff-Schwartz algorithm for American option pricing. This method involves:
1. **Sampling Price Traces:** Simulate multiple possible price paths for the underlying asset.
2. **Continuation Value Estimation:** Use function approximation (e.g., regression) to estimate the value of continuing to hold the option (the "continuation value") in states where it's "in-the-money" (profitable to exercise).
3. **Recursive Decision Process:** Starting from the expiration date and moving backward, compare the immediate payoff of exercising with the estimated continuation value. Exercise the option if the payoff is greater.
### Reinforcement Learning Approaches
Given the nature of American option pricing as an optimal stopping problem, Reinforcement Learning (RL) methods like Least Squares Policy Iteration (LSPI) and Deep Q-learning can be applied. Let's explore how LSPI can be adapted for this task.
### LSPI for American Option Pricing
We can model the option pricing problem as a Markov Decision Process (MDP):
* **State $s$:** A combination of the current time and the relevant price history of the underlying asset.
* **Action $a$:** A binary decision:
* $c$: continue holding the option
* $e$: exercise the option
* **Reward $r$:**
* 0 for continuing ($a = c$)
* The option's payoff $g(s)$ for exercising ($a = e$)
* **State Transitions:** Governed by the risk-neutral price dynamics of the underlying asset.
The goal is to learn a policy that maximizes the expected discounted reward, effectively determining the optimal exercise time. LSPI can estimate the Q-values for both continuing and exercising, leading us to the optimal policy.
#### Mathematical Formulation
Let:
* $g(s)$ g(s) be the payoff for exercising in state $s$.
* $\hat{Q}(s,a;w)$ be the approximated Q-value, parameterized by weights $w$.
* $\phi(s)$ be the feature vector for state $s$.
We approximate the Q-value as follows:
$$
\hat{Q}(s, a; w) =
\begin{cases}
\phi(s)^T \cdot w & \text{if } a = c \\
g(s) & \text{if } a = e
\end{cases}
$$
#### Updating Rules
We update the weights $w$ using a modified LSPI semi-gradient equation:
$$
0 = \sum_i \phi(s_i) \cdot (\phi(s_i)^T \cdot w^* - \mathbf{1}_{c} \cdot \gamma \cdot \phi(s'_i)^T \cdot w^* - \mathbf{1}_{e} \cdot \gamma \cdot g(s'_i))
$$
where
* $\mathbf{1}_{c}$ indicates whether to continue in the next state $s'_i$ (if the continuation value is higher than the exercise value).
* $\mathbf{1}_{e}$ indicates whether to exercise in the next state $s'_i$.
* $\gamma$ is the discount factor.
### Deep Q-Learning for American Options
Deep Q-learning (DQL) offers a powerful alternative to LSPI, replacing the linear function approximation with a deep neural network $f(s; w)$ to estimate the Q-value for the "continue" action:
$$
\hat{Q}(s, a; w) =
\begin{cases}
f(s; w) & \text{if } a = c \\
g(s) & \text{if } a = e
\end{cases}
$$
#### Updating Rules in Deep Q-Learning
The update rule for the neural network weights $w$ is based on the Q-learning temporal difference error:
$$
\Delta w = \alpha \cdot (\gamma \cdot \hat{Q}(s'_i, \pi(s'_i); w) - f(s_i; w)) \cdot \nabla_w f(s_i; w)
$$
This translates to:
* **Non-terminal state $s'_i$:**
$$
\Delta w = \alpha \cdot (\gamma \cdot \max(g(s'_i), f(s'_i; w)) - f(s_i; w)) \cdot \nabla_w f(s_i; w)
$$
* **Terminal state $s'_i$:**
$$
\Delta w = \alpha \cdot (\gamma \cdot g(s'_i) - f(s_i; w)) \cdot \nabla_w f(s_i; w)
$$
Here, $\alpha$ is the learning rate, and the gradient $\nabla_w f(s_i; w)$ is computed using backpropagation.
## Value Function Geometry: A Deeper Dive into Function Approximation
In reinforcement learning (RL), the combination of bootstrapping (using current value estimates to update themselves), function approximation, and off-policy learning (learning from data generated by a different policy) is often referred to as the **Deadly Triad**. This triad is notorious for causing instability and divergence in the learning process. To understand the underlying reasons and develop more robust algorithms, we can visualize value functions from a geometric perspective. This approach is particularly insightful for linear function approximation.
### Notation and Definitions
Let's establish some key concepts and notation for our discussion:
* **State Space $\mathcal{S}$:** A finite set of states: $\mathcal{S} = \{s_1,s_2, \dots ,s_n\}$
* **Action Space $\mathcal{A}$:** A finite set of actions.
* **Policy $\pi$:** A mapping from states to probability distributions over actions: $\pi: \mathcal{S} \times \mathcal{A} → [0, 1]$.
* **Value Function $V^\pi$:** A function that maps each state to the expected return when following policy $\pi$: $V^\pi: \mathcal{S} \to \mathbb{R}$.
* **Feature Functions $\phi_j$:** Functions that extract relevant features from states: $\phi_j: \mathcal{S} → \mathbb{R}$, where $j = 1, 2, \dots, m$.
* **Feature Vectors:** A vector of feature values for a state $s$: $\phi(s) = (\phi_1(s_1),\phi_2(s_2),...,\phi_m(s_m)) \in \mathbb{R}^m$.
* **Linear Function Approximation:** Approximates the value function as a linear combination of features: $$V_w(s) = \phi(s)^T \cdot w = \sum_{j=1}^m \phi_j(s) \cdot w_j$$ where $w = (w_1, w_2, ..., w_m) \in \mathbb{R}^m$ are the weights.
* **Feature Matrix $\Phi$:** An $n \times m$ matrix where each row is the feature vector of a state.
The goal is to find the weights w that produce a linear approximation $V_w$ that best represents the true value function $V^\pi$. We can visualize this as finding the vector $V_w$ that is closest to $V^\pi$ within the subspace spanned by the columns of the feature matrix $\Phi$.
### Bellman Policy Operator and Projection Operator
#### Bellman Policy Operator $B^\pi$
The Bellman policy operator is a core concept in RL. It updates the value function based on the current policy and the environment's dynamics:
$$
B^\pi(V) = R^\pi + \gamma P^\pi \cdot V
$$
where:
* $R^\pi$ is the expected reward vector under policy $\pi$.
* $P^\pi$ is the transition probability matrix under policy $\pi$.
* $\gamma$ is the discount factor.
The true value function $V^\pi$ is a fixed point of this operator:
$$
B^\pi \cdot V^\pi = V^\pi
$$
#### Projection Operator $\Pi_\Phi$
The projection operator projects a value function onto the subspace spanned by the feature vectors. Given a value function $V$, its projection $\Pi_\Phi(V)$ is the closest linear approximation to V within this subspace:
$$
\Pi_\Phi(V) = \mathop{\arg \min}_{V' \in \text{span}\Phi} \| V,V' \|_D
$$
where $\|\cdot\|_D$ is a weighted Euclidean norm, weighted by the state distribution under policy $\pi$.
This projection operator can be expressed in matrix form as:
$$
\Pi_\Phi(V) = \Phi \cdot (\Phi^T \cdot D \cdot \Phi)^{-1} \cdot \Phi^T \cdot D
$$
where $D$ is a diagonal matrix with the state distribution on the diagonal.
### Four Important Vectors
In the subspace defined by our features, we have four key vectors:
1. **Projection $w_\pi$:** The weights that give the best linear approximation of the true value function $V_w$:
$$
w_\pi = \mathop{\arg \min}_w \ \|V^\pi, V_w\|_D
$$
2. **Bellman Error Minimizing $w_{BE}$:** The weights that minimize the Bellman error, the difference between the value function and its Bellman update: $$w_{BE} = \arg \min_w d(B^\pi \cdot V_w, V_w)$$
3. **Temporal Difference Error Minimizing $w_{TDE}$:** The weights that minimize the expected squared temporal difference error:
$$
w_{TDE} = \mathop{\arg \min}_{w} \ \mathbb{E}_{\pi}[\delta^2]
$$
4. **Projected Bellman Error Minimizing $w_{PBE}$:** The weights that minimize the projected Bellman error, which is the Bellman error projected back onto the feature space: $$w_{PBE} = \mathop{\arg \min}_w \ d_\pi((\Pi_\Phi \cdot B^\pi) \cdot V_w, V_w)$$ This is the target of the semi-gradient TD algorithm.
### Model-Free Learning Algorithms
Various algorithms aim to learn these weights from experience. For example:
* **Residual Gradient Algorithm:** Learns $w_{BE}$ but is often slow to converge:
\begin{split}
\Delta w
&= -\frac{1}{2} \alpha \cdot \nabla_w (\mathbb{E}_\pi [\delta])^2 \\
&= \alpha \cdot \left( \mathbb{E}_\pi [r + \gamma \phi(s')^T \cdot w - \phi(s)^T \cdot w] \right) \cdot \left( \phi(s) - \gamma \mathbb{E}_\pi [\phi(s')] \right)^T \cdot w
\end{split}
* **Naive Residual Gradient:** Learns $w_{TDE}$, but shares the drawbacks of RG:
\begin{split}
\Delta w
&= -\frac{1}{2} \alpha \cdot \nabla_w (r + \gamma \phi(s')^T \cdot w - \phi(s)^T \cdot w)^2 \\
&= \alpha \cdot (r + \gamma \phi(s')^T \cdot w - \phi(s)^T \cdot w) \cdot (\phi(s) - \gamma \phi(s'))
\end{split}
* **Semi-gradient TD:** A popular algorithm that converges to $w_{PBE}$, providing a practical and robust solution for learning value functions with linear function approximation:
$$
\Delta w = \alpha \cdot (r + \gamma \cdot \phi(s')^T \cdot w - \phi(s)^T \cdot w) \cdot \phi(s)
$$
By understanding the geometric relationships between these vectors, we can design better algorithms that avoid the pitfalls of the Deadly Triad and achieve stable and accurate value function approximation.

## Gradient Temporal-Difference (Gradient TD) Learning
Recall that in on-policy reinforcement learning (RL) with linear function approximation, we can solve for the Projected Bellman Error (PBE)-minimizing weights $w_{PBE}$ using the semi-gradient TD algorithm. However, when dealing with non-linear function approximation or off-policy learning, we need more sophisticated methods like Gradient TD.
Gradient Temporal-Difference (Gradient TD) is a family of algorithms designed to handle these challenging scenarios. They directly optimize the projected Bellman error using gradient descent, making them well-suited for non-linear function approximation and off-policy learning. There are several variants of Gradient TD, including:
* **GTD:** The original Gradient TD algorithm
* **GTD-2:** A second-generation GTD algorithm with improved convergence properties
* **TDC:** TD with Gradient Correction
### TDC Algorithm
Let's focus on the TDC algorithm. The first step is to define a loss function whose gradient will guide our optimization process using stochastic gradient descent.
#### Loss Function
The goal is to minimize the distance between the projected Bellman update and the current value function estimate:
$$
w_{PBE} = \mathop{\arg\min}_w (\Pi_\Phi \cdot B^\pi \cdot V_w, V_w) = \mathop{\arg\min}_w (\Pi_\Phi \cdot B^\pi \cdot V_w, \Pi_\Phi \cdot V_w)
$$
To achieve this, we define the loss function based on the temporal difference (TD) error:
$$
\delta_w = B^\pi \cdot V_w - V_w
$$
The TDC loss function is then:
$$
\mathcal{L}(w) = (\Pi_\Phi \cdot \delta_w)^T \cdot D \cdot (\Pi_\Phi \cdot \delta_w)
$$
Expanding and simplifying this expression leads to:
$$
\mathcal{L}(w) = (\Phi^T \cdot D \cdot \delta_w)^T \cdot (\Phi^T \cdot D \cdot \Phi)^{-1} \cdot (\Phi^T \cdot D \cdot \delta_w)
$$
#### Gradient of the Loss Function
The gradient of the loss function with respect to the weights $w$ is:
$$
\nabla_w \mathcal{L}(w) = 2 \cdot \left(\nabla_w (\Phi^T \cdot D \cdot \delta_w)^T \cdot (\Phi^T \cdot D \cdot \Phi)^{-1} \cdot (\Phi^T \cdot D \cdot \delta_w)\right)
$$
To make this gradient practically usable, we express it in terms of expectations over transitions $s \xrightarrow{\pi} (r,s')$, where $\delta_w$ is the TD error $r + \gamma \cdot \phi(s')^T \cdot w - \phi(s)^T \cdot w$:
\begin{split}
\Phi^T \cdot D \cdot \delta_w &= \mathbb{E}[\delta \cdot \phi(s)] \\
\nabla_w (\Phi^T \cdot D \cdot \delta_w)^T &= \mathbb{E}[\nabla_w \delta \cdot \phi(s)^T] = \mathbb{E}[(\gamma \cdot \phi(s') - \phi(s)) \cdot \phi(s)^T] \\
\Phi^T \cdot D \cdot \Phi &= \mathbb{E}[\phi(s) \cdot \phi(s)^T]
\end{split}
The gradient can then be estimated using these expectations.
#### Weight Updates
TDC introduces an auxiliary variable $\theta$ to aid in the weight updates. The updates for $w$ and $\theta$ are as follows:
1. **Compute TD error $\delta$:** For a transition $s \xrightarrow{\pi} (r,s')$, compute:
$$
r + \gamma \cdot \phi(s')^T \cdot w - \phi(s)^T \cdot w
$$
2. **Update auxiliary variable $\theta$:** $$\Delta \theta = \beta \cdot (\delta - \theta^T \cdot \phi(s)) \cdot \phi(s)$$ where $\beta$ is a second step-size parameter.
3. **Update weights $w$:** $$\Delta w = \alpha \cdot \delta \cdot \phi(s) - \alpha \cdot \gamma \cdot \phi(s') \cdot (\theta^T \cdot \phi(s))$$ where $\alpha$ is the learning rate.
The update rule for $\theta$ is essentially a separate linear TD(0) learning process for approximating the product $\mathbb{E}[\delta \cdot \phi(s)]$.
### Key Improvements and Insights
* **Efficiency:** The auxiliary variable $\theta$ acts as an estimator of the TD error, improving the convergence speed compared to the original GTD algorithm.
* **Convergence:** TDC is guaranteed to converge to the true TD solution even under off-policy training with linear function approximation.
* **Flexibility:** Gradient TD methods, like TDC, can be extended to non-linear function approximation as well, making them applicable to a wider range of RL problems.
## Reference
- Chapter 13 of the [RLForFinanceBook](https://stanford.edu/~ashlearn/RLForFinanceBook/book.pdf)
- [Batch RL, Experience-Replay, DQN, LSPI, Gradient TD](https://github.com/coverdrive/technical-documents/blob/master/finance/cme241/Tour-Batch.pdf) slides for CME 241: Foundations of Reinforcement Learning with Applications in Finance