# Reinforcement Learning for Control
## Introduction
In reinforcement learning (RL), we extend our study of Monte Carlo (MC) and Temporal Difference (TD) algorithms from prediction to control problems. RL effectively addresses the curse of dimensionality and modeling by learning appropriate function approximations of the value function from individual experiences, making it suitable for large-scale control problems in the real world.
## Policy and Q-Value Function
A policy is a function $\pi: \mathcal{N} \times \mathcal{A} \rightarrow [0,1]$, where
$$
\pi(s,a) = \mathbb{P}[A_t=a|S_t=s]
$$
for all time steps $t=0,1,2,\dots$. This definition assumes the policy is:
* **Markovian:** The decision for the current action depends only on the current state.
* **Stationary:** The decision rule doesn't change over time.
A Q-value function for a fixed policy $\pi$ is a function $Q^\pi: \mathcal{N} \times \mathcal{A} \rightarrow \mathbb{R}$ representing the expected return when starting from state $s$, taking action $a$, and thereafter following policy $\pi$:
$$
Q^\pi(s,a)=\mathbb{E}_{\pi,\mathcal{P}_R}[G_t|(S_t=s,A_t=a)]
$$
for all $s \in \mathcal{N}$ and $t=0,1,2,\dots$.
## Generalized Policy Iteration (GPI)
Generalized Policy Iteration (GPI) is a flexible framework in reinforcement learning.
### What is Policy Iteration?
Traditional policy iteration consists of two main steps:
1. **Policy Evaluation:** Evaluate the value of each state under the current policy.
2. **Policy Improvement:** Improve the policy based on the evaluation results.
This process repeats until convergence.
### Concept of GPI
GPI extends this by allowing:
1. **Flexible Policy Evaluation:** Evaluate states partially or in subsets.
2. **Flexible Policy Improvement:** Improve the policy for a subset of states.
This allows for faster progress towards near-optimal solutions.
### Benefits of GPI
* **Higher computational efficiency:** GPI does not require full policy evaluation and improvement at each iteration, reducing computational cost.
* **Greater adaptability to large-scale problems:** The flexibility of GPI allows for handling large state and action spaces more efficiently.


## GPI with Monte Carlo Evaluation
To overcome the early bias problem in RL, where initial random actions and reward fluctuations can lead to suboptimal policy choices, the $\epsilon$-greedy method can be employed to balance exploration and exploitation.
### Two Components of GPI with Monte Carlo Evaluation
1. **Estimating Q-value:** The Monte Carlo (MC) method is used to estimate the Q-values based on sample returns from complete episodes.
2. **Exploring New Strategies:** The $\epsilon$-greedy policy is used to select actions. It chooses the action with the highest current Q-value most of the time (exploitation) but also explores other actions with probability $\epsilon$.
### $\epsilon$-Greedy Strategy
$$
\pi'(s, a) =
\begin{cases}
\frac{\epsilon}{|A|} + 1 - \epsilon & \text{if } a = \mathop{\arg\max}_{b \in A} Q(s, b) \\
\frac{\epsilon}{|A|} & \text{otherwise}
\end{cases}
$$
### Q-value Update Steps in GPI with Monte Carlo Evaluation
:::success
<font color=blue>Algorithm: GPI with Monte Carlo Evaluation</font>
1. **Initialization:**
* Set `Q(s, a) = 0` for all state-action pairs $(s, a)$
* Set `Count(s, a) = 0` for all state-action pairs $(s, a)$
* Choose exploration parameter `epsilon`
2. **Episode Loop:** (Repeat for multiple episodes)
* Initialize an empty episode list: `episode = []`
* Observe the initial state `S`
3. **Step Loop:** (Repeat until `S` is terminal within the episode)
* Choose action `A` using $\epsilon$-greedy policy based on `Q(S, a)`
* Execute action `A`, receive reward `R`, and observe next state `S'`
* Append `(S, A, R)` to the `episode` list
* Update current state: `S <- S'`
4. **Monte Carlo Update:** (After the episode terminates)
* For each step `(S_t, A_t, R_t)` in the `episode` (starting from the end):
* Calculate return `G` (sum of rewards from time `t` to the end of the episode)
* Increment the count for the state-action pair: `Count(S_t, A_t) <- Count(S_t, A_t) + 1`
* Update Q-value using the incremental average:
```Q(S_t, A_t) <- Q(S_t, A_t) + (1 / Count(S_t, A_t)) * (G - Q(S_t, A_t))```
:::
### Theoretical Justification of $\epsilon$-Greedy Policy Improvement
**Theorem:** For a finite MDP, if a policy $\pi$ satisfies $\pi(s, a) \geq \frac{\epsilon}{|A|}$ for all states $s \in \mathcal{N}$ and actions $a \in A$, then the $\epsilon$-greedy policy $\pi'$ derived from $Q^\pi$ is an improvement over $\pi$, i.e., $V^{\pi'}(s) \geq V^\pi(s)$ for all $s \in \mathcal{N}$.
**Proof (Sketch):**
1. The proof relies on the convergence of repeated applications of the Bellman policy operator $B^{\pi'}$ to $V^{\pi'}$.
2. It suffices to show that the sequence of value functions generated by applying $B^{\pi'}$ is non-decreasing.
3. This is proven by induction, with the base case demonstrating that $B^{\pi'}(V^\pi) \geq V^{\pi}$.
4. The inductive step utilizes the monotonicity property of the Bellman operator $X \geq Y \Rightarrow B^\pi(X) \geq B^\pi(Y)$.
5. The initial policy condition is satisfied by choosing a uniform random policy.
$\Box$
## GLIE Monte-Carlo Control
**Greedy in the Limit with Infinite Exploration (GLIE)** defines an algorithm with two key properties:
1. **Infinite Exploration:** All state-action pairs are explored infinitely many times as the number of episodes increases.
2. **Convergence to Greedy:** The probability of the policy selecting the action with the highest estimated Q-value approaches 1 as the number of episodes approaches infinity: $$\lim_{k \to \infty} \pi_k(s, a) = I_{a=\mathop{\arg\max}_{b \in A} Q(s,b)}$$
A simple way to achieve GLIE behavior is to decrease the exploration probability $\epsilon$ over time, for example, by setting $\epsilon_k = \frac{1}{k}$ for the $k$-th episode.
### GLIE Tabular Monte-Carlo Control Algorithm
:::success
<font color=blue>Algorithm: GLIE Tabular Monte-Carlo Control</font>
1. **Initialization:**
* Set `Q(s, a) = 0` for all state-action pairs $(s, a)$
* Set `Count(s, a) = 0` for all state-action pairs $(s, a)$
2. **Episode Loop:** (Repeat for each episode `k`)
* Set exploration rate `epsilon = 1 / k` (decaying epsilon)
* Sample initial state `S` uniformly from all possible states
* Initialize an empty episode list: `episode = []`
3. **Step Loop:** (Repeat until `S` is terminal within the episode)
* Choose action `A` using ε-greedy policy based on Q(S, a) with the current ε
* Execute action `A`, receive reward `R`, and observe next state `S'`
* Append (S, A, R) to the `episode` list
* Update current state: `S <- S'`
4. **Monte Carlo Update:** (After the episode terminates)
* For each step `(S_t, A_t, R_t)` in the `episode` (starting from the end):
* Calculate discounted return `G` (sum of rewards from time `t` to the end, each multiplied by a discount factor `gamma`)
* Increment the count for the state-action pair: `Count(S_t, A_t) <- Count(S_t, A_t) + 1`
* Update Q-value using the incremental average:
```Q(S_t, A_t) <- Q(S_t, A_t) + (1 / Count(S_t, A_t)) * (G - Q(S_t, A_t))```
:::
**Theorem:** Under suitable conditions, GLIE Tabular Monte-Carlo Control converges to the optimal action-value function and an optimal policy: $Q(s, a) \rightarrow Q^*(s, a)$.
### Extension to Function Approximation
GLIE Monte-Carlo Control can be extended to use function approximation to represent the Q-value function. In this case, we update the parameters w of the Q-value function approximation $Q(s,a;w)$ using:
$$
\Delta w = \alpha (G_t - Q(S_t, A_t; w)) \cdot \nabla_w Q(S_t, A_t; w)
$$
where:
* $\alpha$ is the learning rate.
* $G_t$ is the return observed from state $S_t$ after taking action $A_t$.
* $\nabla_w Q(S_t, A_t; w)$ is the gradient of the Q-value function with respect to the parameters $w$.
### Implementation
#### GLIE MC Control
```python=
from rl.markov_decision_process import epsilon_greedy_policy, TransitionStep
from rl.approximate_dynamic_programming import QValueFunctionApprox
from rl.approximate_dynamic_programming import NTStateDistribution
def glie_mc_control(
mdp: MarkovDecisionProcess[S, A],
states: NTStateDistribution[S],
approx_0: QValueFunctionApprox[S, A],
gamma: float,
epsilon_as_func_of_episodes: Callable[[int], float],
episode_length_tolerance: float = 1e-6
) -> Iterator[QValueFunctionApprox[S, A]]:
q: QValueFunctionApprox[S, A] = approx_0
p: Policy[S, A] = epsilon_greedy_policy(q, mdp, 1.0)
yield q
num_episodes: int = 0
while True:
trace: Iterable[Transition``Step[S, A]] = \
mdp.simulate_actions(states, p)
num_episodes += 1
for step in returns(trace, gamma, episode_length_tolerance):
q = q.update([((step.state, step.action), step.return_)])
p = epsilon_greedy_policy(
q,
mdp,
epsilon_as_func_of_episodes(num_episodes)
)
yield q
```
#### $\epsilon$-Greedy Policy
```python=
def greedy_policy_from_qvf(
q: QValueFunctionApprox[S, A],
actions: Callable[[NonTerminal[S]], Iterable[A]]
) -> DeterministicPolicy[S, A]:
def optimal_action(s: S) -> A:
_, a = q.argmax((NonTerminal(s), a) for a in actions(NonTerminal(s)))
return a
return DeterministicPolicy(optimal_action)
def epsilon_greedy_policy(
q: QValueFunctionApprox[S, A],
mdp: MarkovDecisionProcess[S, A],
epsilon: float = 0.0
) -> Policy[S, A]:
def explore(s: S, mdp=mdp) -> Iterable[A]:
return mdp.actions(NonTerminal(s))
return RandomPolicy(Categorical(
{UniformPolicy(explore): epsilon,
greedy_policy_from_qvf(q, mdp.actions): 1 - epsilon}
))
```
### Example: Simple Inventory
```python=
capacity: int = 2
poisson_lambda: float = 1.0
holding_cost: float = 1.0
stockout_cost: float = 10.0
gamma: float = 0.9
si_mdp: SimpleInventoryMDPCap = SimpleInventoryMDPCap(
capacity=capacity,
poisson_lambda=poisson_lambda,
holding_cost=holding_cost,
stockout_cost=stockout_cost
)
true_opt_vf, true_opt_policy = value_iteration_result(si_mdp, gamma=gamma)
print(”True Optimal Value Function”)
pprint(true_opt_vf)
print(”True Optimal Policy”)
print(true_opt_policy)
```
```
True Optimal Value Function
{NonTerminal(state=InventoryState(on_hand=0, on_order=0)): -43.59563313047815,
NonTerminal(state=InventoryState(on_hand=0, on_order=1)): -37.97111179441265,
NonTerminal(state=InventoryState(on_hand=0, on_order=2)): -37.3284904356655,
NonTerminal(state=InventoryState(on_hand=1, on_order=0)): -38.97111179441265,
NonTerminal(state=InventoryState(on_hand=1, on_order=1)): -38.3284904356655,
NonTerminal(state=InventoryState(on_hand=2, on_order=0)): -39.3284904356655}
True Optimal Policy
For State InventoryState(on_hand=0, on_order=0): Do Action 2
For State InventoryState(on_hand=0, on_order=1): Do Action 1
For State InventoryState(on_hand=0, on_order=2): Do Action 0
For State InventoryState(on_hand=1, on_order=0): Do Action 1
For State InventoryState(on_hand=1, on_order=1): Do Action 0
For State InventoryState(on_hand=2, on_order=0): Do Action 0
```
Run GLIE MC Control with the following parameters:
```python=
episode_length_tolerance: float = 1e-5
epsilon_as_func_of_episodes: Callable[[int], float] = lambda k: k ** -0.5
initial_learning_rate: float = 0.1
half_life: float = 10000.0
exponent: float = 1.0
initial_qvf_dict: Mapping[Tuple[NonTerminal[InventoryState], int], float] = {
(s, a): 0. for s in si_mdp.non_terminal_states for a in si_mdp.actions(s)
}
learning_rate_func: Callable[[int], float] = learning_rate_schedule(
initial_learning_rate=initial_learning_rate,
half_life=half_life,
exponent=exponent
)
qvfs: Iterator[QValueFunctionApprox[InventoryState, int]] = glie_mc_control(
mdp=si_mdp,
states=Choose(si_mdp.non_terminal_states),
approx_0=Tabular(
values_map=initial_qvf_dict,
count_to_weight_func=learning_rate_func
),
gamma=gamma,
epsilon_as_func_of_episodes=epsilon_as_func_of_episodes,
episode_length_tolerance=episode_length_tolerance
)
num_episodes = 10000
final_qvf: QValueFunctionApprox[InventoryState, int] = \
iterate.last(itertools.islice(qvfs, num_episodes))
def get_vf_and_policy_from_qvf(
mdp: FiniteMarkovDecisionProcess[S, A],
qvf: QValueFunctionApprox[S, A]
) -> Tuple[V[S], FiniteDeterministicPolicy[S, A]]:
opt_vf: V[S] = {
s: max(qvf((s, a)) for a in mdp.actions(s))
for s in mdp.non_terminal_states
}
opt_policy: FiniteDeterministicPolicy[S, A] = \
FiniteDeterministicPolicy({
s.state: qvf.argmax((s, a) for a in mdp.actions(s))[1]
for s in mdp.non_terminal_states
})
return opt_vf, opt_policy
opt_vf, opt_policy = get_vf_and_policy_from_qvf(
mdp=si_mdp,
qvf=final_qvf
)
print(f”GLIE MC Optimal Value Function with {num_episodes:d} episodes”)
pprint(opt_vf)
print(f”GLIE MC Optimal Policy with {num_episodes:d} episodes”)
print(opt_policy)
```
```
GLIE MC Optimal Value Function with 10000 episodes
{NonTerminal(state=InventoryState(on_hand=0, on_order=0)): -44.38581299284649,
NonTerminal(state=InventoryState(on_hand=0, on_order=1)): -38.39628366868024,
NonTerminal(state=InventoryState(on_hand=0, on_order=2)): -38.181915947627736,
NonTerminal(state=InventoryState(on_hand=1, on_order=0)): -39.45527081570233,
NonTerminal(state=InventoryState(on_hand=1, on_order=1)): -38.68329079171965,
NonTerminal(state=InventoryState(on_hand=2, on_order=0)): -39.137125440510395}
GLIE MC Optimal Policy with 10000 episodes
For State InventoryState(on_hand=0, on_order=0): Do Action 2
For State InventoryState(on_hand=0, on_order=1): Do Action 1
For State InventoryState(on_hand=0, on_order=2): Do Action 0
For State InventoryState(on_hand=1, on_order=0): Do Action 1
For State InventoryState(on_hand=1, on_order=1): Do Action 0
For State InventoryState(on_hand=2, on_order=0): Do Action 0
```
## SARSA (State-Action-Reward-State-Action)
In reinforcement learning control tasks, we can adapt Temporal Difference (TD) methods, similar to how we used them for prediction. The core idea is to replace the Monte Carlo (MC) target (the actual return $G_t$) with a TD target that bootstraps from the estimated Q-value of the next state-action pair:
$$
R_{t+1} + \gamma \cdot Q(S_{t+1}, A_{t+1}; w)
$$
This TD target is biased, but it reduces variance compared to using the full return. This leads to the SARSA update rule:
$$
\Delta w = \alpha \cdot (R_{t+1} + \gamma \cdot Q(S_{t+1}, A_{t+1}; w) - Q(S_t, A_t; w)) \cdot \nabla_w Q(S_t, A_t; w)
$$
### Key Differences from MC Control
* **Online Updates:** SARSA updates the Q-values and the policy after each step within an episode, enabling continuous learning and adaptation. MC control updates only at the end of each episode.
* **Bootstrap Target:** SARSA uses the estimated Q-value of the next state-action pair in its update target, introducing bias but reducing variance compared to MC control's use of the actual return.
* **Incomplete Episodes:** SARSA can handle continuing tasks where episodes don't always terminate, whereas MC control requires complete episodes.
### Convergence of Tabular SARSA
**Theorem:** Tabular SARSA converges to the optimal action-value function $Q^*(s,a)$ and the optimal deterministic policy $\pi^*$if the following conditions are met:
* **GLIE Schedule of Policies:** The policy gradually becomes greedy (exploiting the current best actions) while still ensuring infinite exploration of all state-action pairs.
* **Robbins-Monro Schedule of Step-Sizes:** The learning rate $\alpha_t$ satisfies:
* $\sum_{t=1}^{\infty} \alpha_t = \infty$ (ensures sufficient learning over time)
* $\sum_{t=1}^{\infty} \alpha_t^2 < \infty$ (ensures the learning rate decreases fast enough for stability)
**In simpler terms:** SARSA can find the best way to act in every situation if it explores enough options and adjusts its learning pace appropriately.
### Example: Simple Inventory Problem
To visualize this, we can run GLIE MC Control and GLIE SARSA on a simple inventory management problem (`SimpleInventoryMDPCap`) and plot the RMSE of their Q-value function estimates as a function of episode batches.
```python=
def compare_mc_sarsa_ql(
fmdp: FiniteMarkovDecisionProcess[S, A],
method_mask: Tuple[bool, bool, bool],
learning_rates: Sequence[Tuple[float, float, float]],
gamma: float,
epsilon_as_func_of_episodes: Callable[[int], float],
q_learning_epsilon: float,
mc_episode_length_tol: float,
num_episodes: int,
plot_batch: int,
plot_start: int
) -> None:
true_vf: V[S] = value_iteration_result(fmdp, gamma)[0]
states: Sequence[NonTerminal[S]] = fmdp.non_terminal_states
colors: Sequence[str] = ['b', 'g', 'r', 'k', 'c', 'm', 'y']
import matplotlib.pyplot as plt
plt.figure(figsize=(11, 7))
if method_mask[0]:
for k, (init_lr, half_life, exponent) in enumerate(learning_rates):
mc_funcs_it: Iterator[QValueFunctionApprox[S, A]] = \
glie_mc_finite_control_learning_rate(
fmdp=fmdp,
initial_learning_rate=init_lr,
half_life=half_life,
exponent=exponent,
gamma=gamma,
epsilon_as_func_of_episodes=epsilon_as_func_of_episodes,
episode_length_tolerance=mc_episode_length_tol
)
mc_errors = []
batch_mc_errs = []
for i, mc_qvf in enumerate(
itertools.islice(mc_funcs_it, num_episodes)
):
mc_vf: V[S] = {
s: max(mc_qvf((s, a)) for a in fmdp.actions(s))
for s in states
}
batch_mc_errs.append(sqrt(sum(
(mc_vf[s] - true_vf[s]) ** 2 for s in states
) / len(states)))
if i % plot_batch == plot_batch - 1:
mc_errors.append(sum(batch_mc_errs) / plot_batch)
batch_mc_errs = []
mc_plot = mc_errors[plot_start:]
label = f"MC InitRate={init_lr:.3f},HalfLife" + \
f"={half_life:.0f},Exp={exponent:.1f}"
plt.plot(
range(len(mc_plot)),
mc_plot,
color=colors[k],
linestyle='-',
label=label
)
sample_episodes: int = 1000
uniform_policy: FinitePolicy[S, A] = \
FinitePolicy(
{s.state: Choose(fmdp.actions(s)) for s in states}
)
fmrp: FiniteMarkovRewardProcess[S] = \
fmdp.apply_finite_policy(uniform_policy)
td_episode_length: int = int(round(sum(
len(list(returns(
trace=fmrp.simulate_reward(Choose(states)),
γ=gamma,
tolerance=mc_episode_length_tol
))) for _ in range(sample_episodes)
) / sample_episodes))
if method_mask[1]:
for k, (init_lr, half_life, exponent) in enumerate(learning_rates):
sarsa_funcs_it: Iterator[QValueFunctionApprox[S, A]] = \
glie_sarsa_finite_learning_rate(
fmdp=fmdp,
initial_learning_rate=init_lr,
half_life=half_life,
exponent=exponent,
gamma=gamma,
epsilon_as_func_of_episodes=epsilon_as_func_of_episodes,
max_episode_length=td_episode_length,
)
sarsa_errors = []
transitions_batch = plot_batch * td_episode_length
batch_sarsa_errs = []
for i, sarsa_qvf in enumerate(
itertools.islice(
sarsa_funcs_it,
num_episodes * td_episode_length
)
):
sarsa_vf: V[S] = {
s: max(sarsa_qvf((s, a)) for a in fmdp.actions(s))
for s in states
}
batch_sarsa_errs.append(sqrt(sum(
(sarsa_vf[s] - true_vf[s]) ** 2 for s in states
) / len(states)))
if i % transitions_batch == transitions_batch - 1:
sarsa_errors.append(sum(batch_sarsa_errs) /
transitions_batch)
batch_sarsa_errs = []
sarsa_plot = sarsa_errors[plot_start:]
label = f"SARSA InitRate={init_lr:.3f},HalfLife" + \
f"={half_life:.0f},Exp={exponent:.1f}"
plt.plot(
range(len(sarsa_plot)),
sarsa_plot,
color=colors[k],
linestyle='--',
label=label
)
if method_mask[2]:
for k, (init_lr, half_life, exponent) in enumerate(learning_rates):
ql_funcs_it: Iterator[QValueFunctionApprox[S, A]] = \
q_learning_finite_learning_rate(
fmdp=fmdp,
initial_learning_rate=init_lr,
half_life=half_life,
exponent=exponent,
gamma=gamma,
epsilon=q_learning_epsilon,
max_episode_length=td_episode_length,
)
ql_errors = []
transitions_batch = plot_batch * td_episode_length
batch_ql_errs = []
for i, ql_qvf in enumerate(
itertools.islice(
ql_funcs_it,
num_episodes * td_episode_length
)
):
ql_vf: V[S] = {
s: max(ql_qvf((s, a)) for a in fmdp.actions(s))
for s in states
}
batch_ql_errs.append(sqrt(sum(
(ql_vf[s] - true_vf[s]) ** 2 for s in states
) / len(states)))
if i % transitions_batch == transitions_batch - 1:
ql_errors.append(sum(batch_ql_errs) / transitions_batch)
batch_ql_errs = []
ql_plot = ql_errors[plot_start:]
label = f"Q-Learning InitRate={init_lr:.3f},HalfLife" + \
f"={half_life:.0f},Exp={exponent:.1f}"
plt.plot(
range(len(ql_plot)),
ql_plot,
color=colors[k],
linestyle=':',
label=label
)
plt.xlabel("Episode Batches", fontsize=20)
plt.ylabel("Optimal Value Function RMSE", fontsize=20)
plt.title(
"RMSE as function of episode batches",
fontsize=20
)
plt.grid(True)
plt.legend(fontsize=10)
plt.show()
```
```python=
capacity: int = 2
poisson_lambda: float = 1.0
holding_cost: float = 1.0
stockout_cost: float = 10.0
gamma: float = 0.9
si_mdp: SimpleInventoryMDPCap = SimpleInventoryMDPCap(
capacity=capacity,
poisson_lambda=poisson_lambda,
holding_cost=holding_cost,
stockout_cost=stockout_cost
)
# Parameters
method_mask: Tuple[bool, bool, bool] = (True, True, False)
learning_rates: Sequence[Tuple[float, float, float]] = [(0.05, 1000000, 0.5)]
epsilon_as_func_of_episodes = lambda k: k ** -0.5
q_learning_epsilon: float = 0.1
mc_episode_length_tol: float = 1e-5
num_episodes: int = 500
plot_batch: int = 10
plot_start: int = 0
# call the compare function
compare_mc_sarsa_ql(
fmdp=si_mdp,
method_mask=method_mask,
learning_rates=learning_rates,
gamma=gamma,
epsilon_as_func_of_episodes=epsilon_as_func_of_episodes,
q_learning_epsilon=q_learning_epsilon,
mc_episode_length_tol=mc_episode_length_tol,
num_episodes=num_episodes,
plot_batch=plot_batch,
plot_start=plot_start
)
```

### Key observations
* **Higher Variance in MC Control:** The MC Control RMSE curve is noticeably choppier, indicating higher variance in its estimates compared to SARSA.
* **Initial Speed vs. Later Convergence:** MC Control progresses rapidly in the early episodes but slows down relative to SARSA's steadier convergence. This leads to SARSA typically reaching a lower RMSE faster.
* **Typical Behavior:** The superior performance of GLIE SARSA over GLIE MC Control with constant learning rates is a common observation in many Markov Decision Process (MDP) control problems.
### Interpretation and Choice of Method
These observations underscore the inherent trade-offs between MC and TD control methods:
* **MC Control:**
* **Advantage:** Provides unbiased estimates by relying on complete episode returns.
* **Disadvantage:** Suffers from higher variance, potentially leading to slower and less stable convergence, especially with constant learning rates.
* **SARSA:**
* **Advantage:** Demonstrates lower variance, often leading to faster and more stable convergence, particularly with constant learning rates.
* **Disadvantage:** Introduces bias into estimates due to bootstrapping (using current estimates to update other estimates).
Choosing between MC Control and SARSA depends on your specific problem and priorities:
* **Prioritize Unbiased Estimates and Have Ample Computational Resources:** MC Control might be preferred, even if it means potentially slower convergence.
* **Prioritize Faster Learning and Stability:** SARSA is often a better choice, especially when dealing with limited data or when computational efficiency is a concern. It's important to note that the bias introduced by SARSA might be acceptable in many practical scenarios.
## SARSA($\lambda$)
SARSA($\lambda$) is a reinforcement learning algorithm that combines the strengths of SARSA and TD($\lambda$) methods, allowing for greater flexibility in controlling the degree of bootstrapping during the learning process.
* $\lambda=0$: SARSA($\lambda$) everts to the classical SARSA algorithm.
* $\lambda=1$: SARSA($\lambda$) behaves similarly to Monte Carlo (MC) control.
* $0 \leq \lambda \leq 1$: SARSA($\lambda$) strikes a balance between the two, leveraging the advantages of both SARSA and MC control.
### Why Use SARSA($\lambda$)?
SARSA($\lambda$) offers several key benefits:
* **Flexibility:** The $\lambda$ parameter allows you to fine-tune the learning behavior for your specific problem. Higher $\lambda$ values emphasize recent experiences more strongly, while lower values prioritize long-term rewards.
* **Bias-Variance Trade-off:** Adjusting $\lambda$ enables you to manage the balance between bias (from bootstrapping) and variance (from relying more on actual returns).
* **Efficient Credit Assignment:** By utilizing eligibility traces, SARSA($\lambda$) can efficiently propagate reward signals back to previously visited state-action pairs, leading to faster and more effective learning.
### How Does SARSA($\lambda$) Work?
The central idea is to update the Q-value function using a $\lambda$-return, a weighted average of n-step returns.
#### Key Components
* **n-step Bootstrapped Return:** This represents the cumulative discounted reward obtained over the next $n$ steps, followed by the estimated Q-value of the resulting state-action pair. It is calculated as: $$G_{t,n} = \sum_{i=t+1}^{t+n} \gamma^{i-t-1} \cdot R_i + \gamma^n \cdot Q(S_{t+n}, A_{t+n}; w)$$
* **$\lambda$-Return:** This is a weighted average of $n$-step returns, where the weights decrease exponentially with $n$, controlled by the $\lambda$ parameter: $$G^{(\lambda)}_t = (1 - \lambda) \sum_{n=1}^{T-t-1} \lambda^{n-1} \cdot G_{t,n} + \lambda^{T-t-1} \cdot G_t$$
* **Eligibility Traces:** These maintain a record of how often and how recently a state-action pair has been visited. They allow SARSA($\lambda$) to update Q-values for multiple state-action pairs based on a single experience, improving the efficiency of learning.
### Online SARSA($\lambda$) Algorithm (with Eligibility Traces)
:::success
<font color=blue>Algorithm: Online SARSA($\lambda$)</font>
1. **Initialization:**
* Set `Q(s, a) = 0` for all state-action pairs $(s, a)$
* Set `E(s, a) = 0` for all state-action pairs $(s, a)$ (Eligibility traces)
* Choose parameters:
* `alpha` (learning rate)
* `gamma` (discount factor)
* `lambda` (trace decay parameter)
* `epsilon` (exploration rate)
2. **Episode Loop:** (Repeat for each episode)
* Observe the initial state `S`
* Choose action `A` using ε-greedy policy based on `Q(S, a)`
3. **Step Loop:** (Repeat until `S` is terminal within the episode)
* Execute action `A`, receive reward `R`, and observe next state `S'`
* Choose next action `A'` using ε-greedy policy based on `Q(S', a)`
* Calculate TD error: `delta = R + gamma * Q(S', A') - Q(S, A)`
* Update eligibility trace for current state-action pair: `E(S, A) = E(S, A) + 1`
4. **Eligibility Trace and Q-Value Update:**
* For all state-action pairs $(s, a)$:
* Update Q-value: `Q(s, a) = Q(s, a) + alpha * delta * E(s, a)`
* Decay eligibility trace: `E(s, a) = gamma * lambda * E(s, a)`
5. **Transition to Next Step:**
* Set `S = S'` (move to the next state)
* Set `A = A'` (update current action)
:::
### Example: Maze Environment
Let's illustrate SARSA($\lambda$) with a concrete example of an agent navigating a maze.
#### Maze Environment Setup
Consider a simple 4x5 maze:
```
S 0 0 0 G
0 # 0 # 0
0 # 0 # 0
0 0 0 0 0
```
* **S:** Starting position
* **G:** Goal position
* **#:** Obstacles (walls)
* **0:** Open spaces
#### Environment
* **State $S$:** The agent's current position in the maze.
* **Action $A$:** Four possible movements: up, down, left, right.
* **Reward $R$:**
* -1 for each step taken.
* -10 for bumping into a wall.
* +100 for reaching the goal.
#### Goal
The agent's goal is to learn a policy that maximizes the total reward, essentially finding the shortest path from the start to the goal.
#### SARSA($\lambda$) Algorithm Implementation
**Parameters**
- **Learning rate $\alpha$**: 0.1
- **Discount factor $\gamma$**: 0.9
- **Eligibility trace decay parameter $\lambda$**: 0.8
- **Exploration rate $\epsilon$**: 0.1
**Actions**
- **Up**: (-1, 0)
- **Down**: (1, 0)
- **Left**: (0, -1)
- **Right**: (0, 1)
#### Code Implementation
```python=
import numpy as np
# Maze environment
maze = np.array([
['S', '0', '0', '0', 'G'],
['0', '#', '0', '#', '0'],
['0', '#', '0', '#', '0'],
['0', '0', '0', '0', '0']
])
# Initialize Q-values and eligibility traces
Q = np.zeros((4, 5, 4)) # 4x5 maze and 4 actions (up, down, left, right)
E = np.zeros_like(Q)
# Parameters
alpha = 0.1 # Learning rate
gamma = 0.9 # Discount factor
lambda_ = 0.8 # Eligibility trace decay parameter
epsilon = 0.1 # Exploration rate
# Actions corresponding to (up, down, left, right)
actions = [(-1, 0), (1, 0), (0, -1), (0, 1)]
# Function to choose an action using ε-greedy policy
def choose_action(state):
if np.random.rand() < epsilon:
return np.random.choice(4) # Random action
else:
return np.argmax(Q[state]) # Action with highest Q-value
# Function to check if a state is valid
def is_valid_state(state):
return (0 <= state[0] < 4) and (0 <= state[1] < 5) and (maze[state] != '#')
# SARSA(λ) main algorithm
policy_output = []
for episode in range(1000): # Run for 1000 episodes
state = (0, 0) # Starting state
action = choose_action(state)
E.fill(0) # Reset eligibility traces
for t in range(100): # Maximum of 100 steps per episode
# Execute action, get reward and next state
next_state = (state[0] + actions[action][0], state[1] + actions[action][1])
if not is_valid_state(next_state):
next_state = state # Stay in place
reward = -10 # Penalty for hitting an obstacle
elif maze[next_state] == 'G':
reward = 100 # Reward for reaching the goal
else:
reward = -1 # Small penalty for each step
next_action = choose_action(next_state)
# Calculate TD error
delta_t = reward + gamma * Q[next_state + (next_action,)] - Q[state + (action,)]
# Update eligibility traces
E[state + (action,)] += 1
# Update all Q-values
Q += alpha * delta_t * E
# Decay eligibility traces
E *= gamma * lambda_
# Update current state and action
state = next_state
action = next_action
# Termination condition
if maze[state] == 'G':
break
# Output current policy
if (episode + 1) % 100 == 0: # Output policy every 100 episodes
policy = []
for i in range(4):
row = ""
for j in range(5):
if maze[i, j] == '#':
row += " # "
elif maze[i, j] == 'G':
row += " G "
else:
action = np.argmax(Q[i, j])
if action == 0:
row += " ↑ "
elif action == 1:
row += " ↓ "
elif action == 2:
row += " ← "
elif action == 3:
row += " → "
policy.append(row)
policy_output.append((episode + 1, policy))
# Final Q-value function
final_Q = Q
policy_output, final_Q
```
```
#### Policy after 100 episodes
→ → → → G
↑ # ← # ↑
↓ # ↑ # ↑
→ ↑ ↑ ↑ ↑
#### Policy after 200 episodes
→ → → → G
↑ # ↑ # ↑
↓ # ↑ # ↑
→ ↑ ↑ ↑ ↑
#### Policy after 300 episodes
→ → → → G
↑ # ↑ # ↑
↓ # ↑ # ↑
→ ↑ ↑ ↑ ↑
#### Policy after 400 episodes
→ → → → G
↑ # ↑ # ↑
↓ # ↑ # ↑
→ ↑ ↑ ↑ ↑
#### Policy after 500 episodes
→ → → → G
↑ # ↑ # ↑
↓ # ↑ # ↑
→ ↑ ↑ ↑ ↑
#### Policy after 600 episodes
→ → → → G
↑ # ↑ # ↑
↑ # ↑ # ↑
← → ↑ ↑ ↑
#### Policy after 700 episodes
→ → → → G
↑ # ↑ # ↑
↑ # ↑ # ↑
← → ↑ ↑ ↑
#### Policy after 800 episodes
→ → → → G
↑ # ↑ # ↑
↑ # ↑ # ↑
← → ↑ ↑ ↑
#### Policy after 900 episodes
→ → → → G
↑ # ↑ # ↑
↑ # ↑ # ↑
← → ↑ ↑ ↑
#### Policy after 1000 episodes
→ → → → G
↑ # ↑ # ↑
↑ # ↑ # ↑
← → ↑ ↑ ↑
```
```
array([[[ 44.12549652, 36.06322046, 37.20414908, 67.35616728],
[ 50.00529531, 53.46710381, 57.25914356, 76.34848328],
[ 65.40104996, 58.61159137, 58.79292819, 88.18028235],
[ 73.19972103, 70.51073643, 66.12133247, 100. ],
[ 0. , 0. , 0. , 0. ]],
[[ 49.43385947, -0.3129012 , 0.27593241, -6.60188385],
[ 0. , 0. , 0. , 0. ],
[ 70.95640518, 5.33740981, 12.84874052, 0. ],
[ 0. , 0. , 0. , 0. ],
[ 0. , 0. , 0. , 0. ]],
[[ 1.47237719, -3.83422569, -6.01519792, -7.39609924],
[ 0. , 0. , 0. , 0. ],
[ 13.42083292, 0. , 0. , 0. ],
[ 0. , 0. , 0. , 0. ],
[ 0. , 0. , 0. , 0. ]],
[[ -2.84541233, -4.49533056, -1.57245685, -2.063482 ],
[ -2.4575131 , 1.56343923, -1.00283656, 3.31996062],
[ 4.7499453 , 0. , 0. , 0. ],
[ 0. , 0. , 0. , 0. ],
[ 0. , 0. , 0. , 0. ]]])
```
## Off-Policy Control
In reinforcement learning, all control algorithms must balance the **exploration-exploitation trade-off:** the need to explore all possible actions versus the need to exploit the actions that appear most rewarding based on the current estimate of the Q-value function. While the ε-greedy policy is a simple way to address this, off-policy control offers a more sophisticated approach by employing two distinct policies.
### Two Policies
* **Target Policy $\pi$:** The policy we aim to learn and optimize. Ideally, this converges to the optimal policy that maximizes expected rewards.
* **Behavior Policy $\mu$:** An exploratory policy used to generate diverse experiences for learning. It might take suboptimal actions in the short term to discover better long-term strategies.
### On-Policy vs. Off-Policy
#### On-Policy
Algorithms like SARSA are classified as **on-policy** because the policy used to select actions (the **behavior policy**) is the same policy that is being evaluated and improved (the **target policy**). The agent learns about the value of the policy it is currently following.
#### Off-Policy
**Off-policy** algorithms decouple the behavior and target policies. The agent learns about the value of the target policy while following a different behavior policy. This separation enables more powerful and flexible learning strategies.
### Advantages of Off-Policy Control
* **Learning from Multiple Sources:** Off-policy methods can learn from demonstrations, historical data, or even other agents, as long as the experiences are somewhat relevant to the target policy.
* **Reusing Experience:** Data collected with an old or suboptimal behavior policy can still be valuable for improving the target policy.
* **Parallel Learning:** Multiple agents can gather data simultaneously using different behavior policies, contributing to the learning of a single, shared target policy.
### Q-Learning: A Key Off-Policy Algorithm
Q-Learning is a prime example of off-policy control. It leverages the behavior policy $\mu$ to gather experiences, but updates the Q-values of the target policy $\pi$ using the maximum possible Q-value of the next state, regardless of the action chosen by the behavior policy. This allows Q-Learning to learn the optimal policy even if the behavior policy is suboptimal.
#### Practical Example
Consider training a robotic arm to grasp objects:
* **Behavior Policy $\mu$:** The arm initially moves randomly, exploring the full range of its movements and potential grasp positions.
* **Target Policy $\pi$:** The arm aims to learn the optimal policy for reliably grasping objects, minimizing errors and maximizing success rates. By learning from the exploratory behavior of the behavior policy, it can refine its grasping strategy.
#### How It Works
Q-Learning learns the optimal action-value function, $Q^*(s, a)$, which gives the expected return for taking action $a$ in state $s$ and following the optimal policy thereafter. The algorithm iteratively updates the Q-values based on the experiences gathered by the behavior policy.
#### Algorithm
:::success
<font color=blue>Algorithm: Off-Policy Q-Learning</font>
1. **Initialization:**
* Set `Q(s, a) = 0` for all state-action pairs $(s, a)$
* Choose parameters:
* `alpha` (learning rate)
* `gamma` (discount factor)
* `epsilon` (exploration rate)
2. **Episode Loop:** (Repeat for each episode)
* Observe the initial state `S`
3. **Step Loop:** (Repeat until `S` is terminal within the episode)
* Choose action `A` using $\epsilon$-greedy policy based on Q(S, a)
* Execute action `A`, receive reward `R`, and observe next state `S'`
* Update Q-value using the following formula:
```Q(S, A) = Q(S, A) + alpha * (R + gamma * max_{a'} Q(S', a') - Q(S, A))```
* Set `S = S'` (move to the next state)
:::
#### Implementation
```python=
PolicyFromQType = Callable[
[QValueFunctionApprox[S, A], MarkovDecisionProcess[S, A]],
Policy[S, A]
]
def q_learning(
mdp: MarkovDecisionProcess[S, A],
policy_from_q: PolicyFromQType,
states: NTStateDistribution[S],
approx_0: QValueFunctionApprox[S, A],
gamma: float,
max_episode_length: int
) -> Iterator[QValueFunctionApprox[S, A]]:
q: QValueFunctionApprox[S, A] = approx_0
yield q
while True:
state: NonTerminal[S] = states.sample()
steps: int = 0
while isinstance(state, NonTerminal) and steps < max_episode_length:
policy: Policy[S, A] = policy_from_q(q, mdp)
action: A = policy.act(state).sample()
next_state, reward = mdp.step(state, action).sample()
next_return: float = max(
q((next_state, a))
for a in mdp.actions(next_state)
) if isinstance(next_state, NonTerminal) else 0.
q = q.update([((state, action), reward + gamma * next_return)])
yield q
steps += 1
state = next_state
```
### Example: Windy Grid Problem
The Windy Grid problem is a classic reinforcement learning environment where an agent navigates a gridworld with varying wind strengths in each column. The agent's objective is to reach a goal state while minimizing the total cost incurred, which includes the number of steps taken and penalties for bumping into obstacles or being pushed out of bounds by the wind.
#### Modeling the Problem
* **States:** Each cell in the grid represents a state.
* **Actions:** The agent can move up, down, left, or right.
* **Rewards:**
* -1 for each step.
* Additional penalty for bumping into obstacles or being pushed out of bounds by the wind.
* Large positive reward for reaching the goal state.
#### Solving with Q-Learning and SARSA
Both Q-Learning and SARSA can be used to solve the Windy Grid problem, but their exploration strategies and learning approaches differ:
* **SARSA (On-Policy):** SARSA learns the value of the policy it is currently following. It tends to be more conservative, avoiding actions that might lead to high penalties during learning. This makes it suitable for real-world scenarios where mistakes are costly.
* **Q-Learning (Off-Policy):** Q-Learning directly learns the optimal policy, regardless of the agent's current behavior. It is more aggressive, exploring actions that could lead to penalties to learn faster. This makes it suitable for simulated environments where the cost of mistakes is low.
#### Implementing
```python=
from typing import Tuple, Callable, Sequence, Set, Mapping, Dict, Iterator
from control_utils import glie_sarsa_finite_learning_rate
from control_utils import q_learning_finite_learning_rate
from control_utils import get_vf_and_policy_from_qvf
from dataclasses import dataclass
from distribution import Categorical
from markov_decision_process import FiniteMarkovDecisionProcess
from policy import FiniteDeterministicPolicy
from approximate_dynamic_programming import QValueFunctionApprox
from dynamic_programming import value_iteration_result, V
import itertools
import iterate as iterate
'''
Cell specifies (row, column) coordinate
'''
Cell = Tuple[int, int]
CellSet = Set[Cell]
Move = Tuple[int, int]
'''
WindSpec specifies a random vertical wind for each column.
Each random vertical wind is specified by a (p1, p2) pair
where p1 specifies probability of Downward Wind (could take you
one step lower in row coordinate unless prevented by a block or
boundary) and p2 specifies probability of Upward Wind (could take
you onw step higher in column coordinate unless prevented by a
block or boundary). If one bumps against a block or boundary, one
incurs a bump cost and doesn't move. The remaining probability
1- p1 - p2 corresponds to No Wind.
'''
WindSpec = Sequence[Tuple[float, float]]
possible_moves: Mapping[Move, str] = {
(-1, 0): 'D',
(1, 0): 'U',
(0, -1): 'L',
(0, 1): 'R'
}
@dataclass(frozen=True)
class WindyGrid:
rows: int # number of grid rows
columns: int # number of grid columns
blocks: CellSet # coordinates of block cells
terminals: CellSet # coordinates of goal cells
wind: WindSpec # spec of vertical random wind for the columns
bump_cost: float # cost of bumping against block or boundary
def validate_spec(self) -> bool:
b1 = self.rows >= 2
b2 = self.columns >= 2
b3 = all(0 <= r < self.rows and 0 <= c < self.columns
for r, c in self.blocks)
b4 = len(self.terminals) >= 1
b5 = all(0 <= r < self.rows and 0 <= c < self.columns and
(r, c) not in self.blocks for r, c in self.terminals)
b6 = len(self.wind) == self.columns
b7 = all(0. <= p1 <= 1. and 0. <= p2 <= 1. and p1 + p2 <= 1.
for p1, p2 in self.wind)
b8 = self.bump_cost > 0.
return all([b1, b2, b3, b4, b5, b6, b7, b8])
def print_wind_and_bumps(self) -> None:
for i, (d, u) in enumerate(self.wind):
print(f"Column {i:d}: Down Prob = {d:.2f}, Up Prob = {u:.2f}")
print(f"Bump Cost = {self.bump_cost:.2f}")
print()
@staticmethod
def add_move_to_cell(cell: Cell, move: Move) -> Cell:
return cell[0] + move[0], cell[1] + move[1]
def is_valid_state(self, cell: Cell) -> bool:
'''
checks if a cell is a valid state of the MDP
'''
return 0 <= cell[0] < self.rows and 0 <= cell[1] < self.columns \
and cell not in self.blocks
def get_all_nt_states(self) -> CellSet:
'''
returns all the non-terminal states
'''
return {(i, j) for i in range(self.rows) for j in range(self.columns)
if (i, j) not in set.union(self.blocks, self.terminals)}
def get_actions_and_next_states(self, nt_state: Cell) \
-> Set[Tuple[Move, Cell]]:
'''
given a non-terminal state, returns the set of all possible
(action, next_state) pairs
'''
temp: Set[Tuple[Move, Cell]] = {(a, WindyGrid.add_move_to_cell(
nt_state,
a
)) for a in possible_moves}
return {(a, s) for a, s in temp if self.is_valid_state(s)}
def get_transition_probabilities(self, nt_state: Cell) \
-> Mapping[Move, Categorical[Tuple[Cell, float]]]:
'''
given a non-terminal state, return a dictionary whose
keys are the valid actions (moves) from the given state
and the corresponding values are the associated probabilities
(following that move) of the (next_state, reward) pairs.
The probabilities are determined from the wind probabilities
of the column one is in after the move. Note that if one moves
to a goal cell (terminal state), then one ends up in that
goal cell with 100% probability (i.e., no wind exposure in a
goal cell).
'''
d: Dict[Move, Categorical[Tuple[Cell, float]]] = {}
for a, (r, c) in self.get_actions_and_next_states(nt_state):
if (r, c) in self.terminals:
d[a] = Categorical({((r, c), -1.): 1.})
else:
down_prob, up_prob = self.wind[c]
stay_prob: float = 1. - down_prob - up_prob
d1: Dict[Tuple[Cell, float], float] = \
{((r, c), -1.): stay_prob}
if self.is_valid_state((r - 1, c)):
d1[((r - 1, c), -1.)] = down_prob
if self.is_valid_state((r + 1, c)):
d1[((r + 1, c), -1.)] = up_prob
d1[((r, c), -1. - self.bump_cost)] = \
down_prob * (1 - self.is_valid_state((r - 1, c))) + \
up_prob * (1 - self.is_valid_state((r + 1, c)))
d[a] = Categorical(d1)
return d
def get_finite_mdp(self) -> FiniteMarkovDecisionProcess[Cell, Move]:
'''
returns the FiniteMarkovDecision object for this windy grid problem
'''
return FiniteMarkovDecisionProcess(
{s: self.get_transition_probabilities(s) for s in
self.get_all_nt_states()}
)
def get_vi_vf_and_policy(self) -> \
Tuple[V[Cell], FiniteDeterministicPolicy[Cell, Move]]:
'''
Performs the Value Iteration DP algorithm returning the
Optimal Value Function (as a V[Cell]) and the Optimal Policy
(as a FiniteDeterministicPolicy[Cell, Move])
'''
return value_iteration_result(self.get_finite_mdp(), gamma=1.)
def get_glie_sarsa_vf_and_policy(
self,
epsilon_as_func_of_episodes: Callable[[int], float],
learning_rate: float,
num_updates: int
) -> Tuple[V[Cell], FiniteDeterministicPolicy[Cell, Move]]:
qvfs: Iterator[QValueFunctionApprox[Cell, Move]] = \
glie_sarsa_finite_learning_rate(
fmdp=self.get_finite_mdp(),
initial_learning_rate=learning_rate,
half_life=1e8,
exponent=1.0,
gamma=1.0,
epsilon_as_func_of_episodes=epsilon_as_func_of_episodes,
max_episode_length=int(1e8)
)
final_qvf: QValueFunctionApprox[Cell, Move] = \
iterate.last(itertools.islice(qvfs, num_updates))
return get_vf_and_policy_from_qvf(
mdp=self.get_finite_mdp(),
qvf=final_qvf
)
def get_q_learning_vf_and_policy(
self,
epsilon: float,
learning_rate: float,
num_updates: int
) -> Tuple[V[Cell], FiniteDeterministicPolicy[Cell, Move]]:
qvfs: Iterator[QValueFunctionApprox[Cell, Move]] = \
q_learning_finite_learning_rate(
fmdp=self.get_finite_mdp(),
initial_learning_rate=learning_rate,
half_life=1e8,
exponent=1.0,
gamma=1.0,
epsilon=epsilon,
max_episode_length=int(1e8)
)
final_qvf: QValueFunctionApprox[Cell, Move] = \
iterate.last(itertools.islice(qvfs, num_updates))
return get_vf_and_policy_from_qvf(
mdp=self.get_finite_mdp(),
qvf=final_qvf
)
def print_vf_and_policy(
self,
vf_dict: V[Cell],
policy: FiniteDeterministicPolicy[Cell, Move]
) -> None:
display = "%5.2f"
display1 = "%5d"
vf_full_dict = {
**{s.state: display % -v for s, v in vf_dict.items()},
**{s: display % 0.0 for s in self.terminals},
**{s: 'X' * 5 for s in self.blocks}
}
print(" " + " ".join([display1 % j for j in range(self.columns)]))
for i in range(self.rows - 1, -1, -1):
print("%2d " % i + " ".join(vf_full_dict[(i, j)]
for j in range(self.columns)))
print()
pol_full_dict = {
**{s: possible_moves[policy.action_for[s]]
for s in self.get_all_nt_states()},
**{s: 'T' for s in self.terminals},
**{s: 'X' for s in self.blocks}
}
print(" " + " ".join(["%2d" % j for j in range(self.columns)]))
for i in range(self.rows - 1, -1, -1):
print("%2d " % i + " ".join(pol_full_dict[(i, j)]
for j in range(self.columns)))
print()
if __name__ == '__main__':
wg = WindyGrid(
rows=5,
columns=5,
blocks={(0, 1), (0, 2), (0, 4), (2, 3), (3, 0), (4, 0)},
terminals={(3, 4)},
wind=[(0., 0.9), (0.0, 0.8), (0.7, 0.0), (0.8, 0.0), (0.9, 0.0)],
bump_cost=4.0
)
valid = wg.validate_spec()
if valid:
wg.print_wind_and_bumps()
vi_vf_dict, vi_policy = wg.get_vi_vf_and_policy()
print("Value Iteration\n")
wg.print_vf_and_policy(
vf_dict=vi_vf_dict,
policy=vi_policy
)
epsilon_as_func_of_episodes: Callable[[int], float] = lambda k: 1. / k
learning_rate: float = 0.03
num_updates: int = 100000
sarsa_vf_dict, sarsa_policy = wg.get_glie_sarsa_vf_and_policy(
epsilon_as_func_of_episodes=epsilon_as_func_of_episodes,
learning_rate=learning_rate,
num_updates=num_updates
)
print("SARSA\n")
wg.print_vf_and_policy(
vf_dict=sarsa_vf_dict,
policy=sarsa_policy
)
epsilon: float = 0.2
ql_vf_dict, ql_policy = wg.get_q_learning_vf_and_policy(
epsilon=epsilon,
learning_rate=learning_rate,
num_updates=num_updates
)
print("Q-Learning\n")
wg.print_vf_and_policy(
vf_dict=ql_vf_dict,
policy=ql_policy
)
else:
print("Invalid Spec of Windy Grid")
```
```
Column 0: Down Prob = 0.00, Up Prob = 0.90
Column 1: Down Prob = 0.00, Up Prob = 0.80
Column 2: Down Prob = 0.70, Up Prob = 0.00
Column 3: Down Prob = 0.80, Up Prob = 0.00
Column 4: Down Prob = 0.90, Up Prob = 0.00
Bump Cost = 4.00
Value Iteration
0 1 2 3 4
4 XXXXX 5.25 2.02 1.10 1.00
3 XXXXX 8.53 5.20 1.00 0.00
2 9.21 6.90 8.53 XXXXX 1.00
1 8.36 9.21 8.36 12.16 11.00
0 10.12 XXXXX XXXXX 17.16 XXXXX
0 1 2 3 4
4 X R R R D
3 X R R R T
2 R U U X U
1 R U L L U
0 U X X U X
SARSA
0 1 2 3 4
4 XXXXX 5.05 2.02 1.08 1.00
3 XXXXX 8.41 5.22 1.00 0.00
2 9.32 6.68 8.37 XXXXX 1.00
1 8.39 9.15 8.50 12.39 10.29
0 10.08 XXXXX XXXXX 17.50 XXXXX
0 1 2 3 4
4 X R R R D
3 X R R R T
2 R U U X U
1 R U L L U
0 U X X U X
Q-Learning
0 1 2 3 4
4 XXXXX 5.26 2.01 1.04 1.00
3 XXXXX 8.49 4.78 1.00 0.00
2 9.37 6.85 8.23 XXXXX 1.00
1 8.46 9.19 8.49 12.19 10.02
0 10.26 XXXXX XXXXX 17.27 XXXXX
0 1 2 3 4
4 X R R R D
3 X R R R T
2 R U U X U
1 R U L L U
0 U X X U X
```
#### Comparison and Choice
The choice between SARSA and Q-Learning depends on the trade-off between risk and learning speed:
* **Prioritize Safety and Minimize Risk (Real-World):** Choose SARSA when the cost of making errors during training is significant, such as in real-world applications where safety is paramount.
* **Prioritize Fast Learning (Simulation):** Choose Q-Learning when experimenting in a simulated environment where mistakes are less costly and you want the agent to learn quickly.
#### Experiments and Results
Experiments with varying bump costs have revealed:
* **Higher Bump Costs:** Q-Learning often learns significantly faster than SARSA, particularly when the penalty for bumping is high. This is due to Q-Learning's persistent exploration and its focus on the optimal policy's value, even if the current behavior is suboptimal.
* **Lower Bump Costs:** The difference in learning speed between SARSA and Q-Learning becomes less pronounced as the bump cost decreases.
### Importance Sampling
Importance sampling is a technique used in off-policy reinforcement learning to estimate properties of a target probability distribution (e.g., the value function of a target policy) using samples generated from a different distribution (e.g., the behavior policy).
### Mathematical Intuition
The core idea behind importance sampling is to reweight samples from one distribution to make them representative of another. Given a function $f(X)$ and two probability distributions:
* $P$ (target distribution)
* $Q$ (sampling distribution)
The expected value of $f(X)$ under $P$ can be estimated using samples from $Q$:
\begin{split}
\mathbb{E}_{X \sim P}[f(X)]
&= \sum P(X) f(X) \\
&= \sum Q(X)\cdot \frac{P(X)}{Q(X)}\cdot f(X) \\
&= \mathbb{E}_{X \sim Q}\left[\frac{P(X)}{Q(X)} \cdot f(X)\right]
\end{split}
Here, $\frac{P(X)}{Q(X)}$ is the **importance sampling ratio**, which adjusts the weight of each sample from $Q$ to reflect its relative importance under $P$.
#### Application to Off-Policy Monte Carlo Prediction
In off-policy Monte Carlo (MC) prediction, we aim to estimate the value function $V^\pi(s$) of a target policy $\pi$ using returns generated by a different behavior policy $\mu$. To achieve this, we weight each return $G_t$ by the importance sampling ratio across the entire episode:
$$
\rho_t = \prod_{k=t}^{T-1} \frac{\pi(S_k, A_k)}{\mu(S_k, A_k)}
$$
where:
* $\rho_t$ is the importance sampling ratio at time step $t$
* $T$ is the terminal time step of the episode.
The update rule for the value function then becomes:
$$
\Delta w = \alpha \cdot \rho_t \cdot (G_t - V(S_t; w)) \cdot \nabla_w V(S_t; w)
$$
where:
* $\alpha$ is the learning rate.
* $w$ are the parameters of the value function approximation.
Similarly, for off-policy MC control, the Q-value update is adjusted:
$$
\Delta w = \alpha \cdot \rho_t \cdot (G_t - Q(S_t, A_t; w)) \cdot \nabla_w Q(S_t, A_t; w)
$$
**Important Note:** This method fails if $\mu(S_t,A_t)=0$ when $\pi(S_t, A_t)>0$, as the importance sampling ratio becomes undefined.
#### Reducing Variance with TD Importance Sampling
Off-policy MC with importance sampling can suffer from high variance due to the cumulative product of importance sampling ratios over an episode. To address this, TD targets (instead of full returns) are used to estimate the value function $V^\pi(s)$. The off-policy TD prediction update rule then becomes:
$$
\Delta w = \alpha \cdot \frac{\pi(S_t, A_t)}{\mu(S_t, A_t)} \cdot (R_{t+1} + \gamma \cdot V(S_{t+1}; w) - V(S_t; w)) \cdot \nabla_w V(S_t; w)
$$
A similar update is applied to the Q-value function for TD control.
**Key Advantage:** TD importance sampling only requires the policies to be similar over a single time step, resulting in lower variance than MC importance sampling.
## Relationship between DP and TD
The following table summarizes the relationship between Dynamic Programming (DP) and Temporal Difference (TD) learning methods in Reinforcement Learning (RL). It highlights how these approaches address prediction and control problems using either full backups (DP) or sample backups (TD).
| Bellman Equation | Full Backup (DP) | Sample Backup (TD) |
|-----------------------------------|--------------------------|--------------------|
| **Bellman Expectation Equation for $V^π(s)$** | Iterative Policy Evaluation | TD Learning|
| **Bellman Expectation Equation for $Q^π(s, a)$** | Q-Policy Iteration| SARSA |
| **Bellman Optimality Equation for $Q^*(s, a)$** | Q-Value Iteration| Q-Learning |
### Summary of Updates
| | **Full Backup (DP)** | **Sample Backup (TD)** |
|-----------------------------|-----------------------------------------------------------------|-----------------------------------------------------|
| **Iterative Policy Evaluation: $V(S)$ update** | $\mathbb{E}[R + \gamma V(S') \mid S]$ | TD Learning: $V(S)$ update $R + \gamma V(S')$ |
| **Q-Policy Iteration: $Q(S, A)$ update** | $\mathbb{E}[R + \gamma Q(S', A') \mid S, A]$ | SARSA: $Q(S, A)$ update $R + \gamma Q(S', A')$ |
| **Q-Value Iteration: $Q(S, A)$ update** | $\mathbb{E}[R + \gamma \max_{a'} Q(S', a') \mid S, A]$ | Q-Learning: $Q(S, A)$ update $R + \gamma \max_{a'} Q(S', a')$ |
## Convergence of RL Algorithms
### Convergence of Prediction Algorithms
| On/Off Policy | Algorithm | Tabular | Linear | Non-Linear |
|---------------|-------------|:-------:|:------:|:----------:|
| **On-Policy** | **MC** | ✔️ | ✔️ | ✔️ |
| | **TD(0)** | ✔️ | ✔️ | ❌ |
| | **TD(λ)** | ✔️ | ✔️ | ❌ |
| **Off-Policy**| **MC** | ✔️ | ✔️ | ✔️ |
| | **TD(0)** | ✔️ | ❌ | ❌ |
| | **TD(λ)** | ✔️ | ❌ | ❌ |
MC prediction methods generally have strong convergence guarantees, whether on-policy or off-policy, and whether using tabular representations or function approximation. However, TD prediction methods can encounter convergence issues, primarily due to their use of semi-gradient updates (as opposed to true gradient updates).
While on-policy TD prediction can converge with linear function approximation, it may not with non-linear approximation. Off-policy TD prediction lacks convergence guarantees even for linear function approximation.
### The Deadly Triad
The combination of:
1. **Bootstrapping:** Updating a value estimate based on other learned estimates.
2. **Off-Policy Learning:** Learning the target policy from data generated by a different behavior policy.
3. **Function Approximation:** Using a function approximator (e.g., neural network) to represent the value function.
is known as the **deadly triad** because it can lead to instability and divergence in TD learning algorithms. To mitigate these issues, one can:
* **Adjust the TD($\lambda$) parameter:** Increasing $\lambda$ can improve stability by reducing the reliance on bootstrapping.
* **Use Gradient TD:** This method uses true gradient updates instead of semi-gradient updates, improving convergence properties.
| On/Off Policy | Algorithm | Tabular | Linear | Non-Linear |
|---------------|-------------|:-------:|:------:|:----------:|
| **On-Policy** | **MC** | ✔️ | ✔️ | ✔️ |
| | **TD** | ✔️ | ✔️ | ❌ |
| | **Gradient TD** | ✔️ | ✔️ | ✔️ |
| **Off-Policy** | **MC** | ✔️ | ✔️ | ✔️ |
| | **TD** | ✔️ | ❌ | ❌ |
| | **Gradient TD** | ✔️ | ✔️ | ✔️ |
### Convergence of Control Algorithms
| Algorithm | Tabular | Linear | Non-Linear |
|--------------------------|:-------:|:--------:|:----------:|
| **MC Control** | ✔️ | (✔️) | ❌ |
| **SARSA** | ✔️ | (✔️) | ❌ |
| **Q-Learning** | ✔️ | ❌ | ❌ |
| **Gradient Q-Learning** | ✔️ | ✔️ | ❌ |
* (✔️) indicates convergence near the optimal value function, but not exact convergence.
Gradient Q-Learning addresses some convergence issues by using Gradient TD techniques. It combines off-policy learning and bootstrapping but avoids the problems associated with semi-gradient updates. However, even Gradient Q-Learning still struggles with non-linear function approximation in control tasks, underscoring the challenges presented by the deadly triad.
## References
- Chapter 12 of the [RLForFinanceBook](https://stanford.edu/~ashlearn/RLForFinanceBook/book.pdf)
- [RL for Control](https://github.com/coverdrive/technical-documents/blob/master/finance/cme241/Tour-RLControl.pdf) slides for CME 241: Foundations of Reinforcement Learning with Applications in Finance