---
title: RL Bible Part II
tags: Templates, Talk
description: View the slide with "Slide Mode".
---
This note taking provides an overview of the Barto and Sutton book and is not a substitute for reading it. Moreover, some passages from the book were more detailed than others on purpose.
# Overview of Barto & Sutton:

# Part II
### Tabular Solution Methods : Chapters 5, 6, 7, 8. The full Reinforcement Learning problem starts here.
## 5. Monte Carlo Methods
Only require experience (= sample sequences of states, actions, rewards).
Often easy to generate experience according to desired prob distribution but infeasible to obtain the distributions in explicit form.
Only define Monte Carlo for episodic tasks here: sample & average complete returns for each state-action pair.
Like bandits but multiple states each acting like interrelated different bandit problems: return after taking an action in one state depends on the actions taken in later states in the same episode.
Each bandit is nonstationary because the next action selections are undergoing learning.
#### 5.1 Monte Carlo Prediction
First-visit MC vs every-visit MC in chapter 9 & 12.

Estimate for each state is independent (does not depend on other state estimation =/= bootstrap).
Estimating the value of a single state is independent of the number of states.
#### 5.2 Monte Carlo Estimation of Action Values
When no prior knowledge about the environment dynamics: estimate action-state values used.
Same except we talk about visits to state-action pairs.
Introduce a problem of exploration: need to estimate all actions for each state, not just the one the deterministic policy choses.
Exploring starts: every pair a nonzero prob of being selected at the beginning of an episode.
Not enough so we need stochastic policies with nonzero probs of selecting all actions in each state.
#### 5.3 Monte Carlo Control

Policy Improvement:


Policy evaluation after each episode for every pair first visited.
Keep track of returns for each state-action pair: should just maintain the mean and a count for each state-action pair to incrementally update as in 2.4.
Convergence of Monte Carlo ES has not been formally proved even if it seems inevitable.
#### 5.4 Monte Carlo without Exploring Starts
To ensure that all actions are selected infinitely often:
- On-policy:
Evaluate or improve the policy that is used to make decisions (ex: Monte Carlo ES above and 5.4).
Generally simpler and considered first.
- Off-policy:
Evaluate or improve a policy different from that used to generate the data (ex: 5.5)
Often of greater variance and slower to converge, but more powerful in general.
On-policy:
Use of a soft policy (ex: eps-greedy policy): stochastic probs with pi(a|s) >0 gradually shifting to a deterministic optimal policy.
Policy improvement: instead of moving the policy all the way to a greedy policy, only move toward a greedy policy -> eps-greedy policy.
Policy Improvement theorem: For any eps-soft policy pi, any eps-greedy policy with respect to value function of pi is guaranteed to be better or equal to pi.



Without exploring starts, we now achieve the best policy only among the eps-soft policies.
This approach learns action values not for the optimal policy but for a near-optimal policy that still explores.
#### 5.5 Off-policy Prediction via Importance Sampling
A more straightforward approach is to use 2 policies, one that is learned about and that becomes the optimal policy (target policy), the other that is more exploratory and used to generate behavior (behavior policy).
Off-policy learning allows learning an optimal policy from suboptimal behavior.
Assumption of coverage: pi(a|s) > 0 => b(a|s) >0, in particular b must be stochastic in states where it is not identical to pi.
Almost all utilize importance sampling.
Importance sampling ratio:


The importance sampling ratio depends only on the policies and the sequence, not on the MDP.
And is used for policy evaluation to correct the returns as follows:
 
#### 5.6 Incremental Implementation
Every-visit MC method suitable for nonstationary env:

Do not have to retain a list of returns to compute the average at each update.
#### 5.7 Off-policy Monte Carlo Control
#### 5.10 Summary
Advantages of Monte Carlo methods:
GPI with alternative policy evaluation process is performed.
## 6. Temporal-Difference Learning
TD learning is a central and novel idea to RL, combination of MC ideas (learn from raw experience) and DP ideas (update estimates based in part on other learned estimates: bootstrap).
Chapter 7: Bridge from TD to MC methods.
Chapter 12 (in Part 2 of the book): TD (lambda) seamlessly unifies them.
Differences between DP, TD and MC methods are primarily in the prediction problem.
#### 6.1 TD Prediction
At time t+1, they form a target immediately on transition to new state and receiving a reward, such that the update for TD(0) is:
 
MC target is an estimate of 1 because the expected value is unknown: use of a sample return instead.
DP target is an estimate of 2 because even if the expected values are provided by a model of the environment, the value of the next state is unknown, and an estimate is used instead.
TD target is an estimate for both reasons, it combines the sampling of MC and bootstrapping of DP:
- It samples the expected values in 1
- It uses the current estimate of the next state value
MC and TD updates are called as sample updates because they involve looking ahead to a sample successor state =/= expected updates of DP.
 is the TD error.
Delta of t is the error in state value S of t available at t+1.
MC Error:

Approximation if the array V changes during the episode.
Example: Driving home

Algorithm:

#### 6.2 Advantages of TD Prediction Methods
This will be answered throughout the book.
MC: one-step supervised learning, TD: multi-step unsupervised supervised learning.
TD methods are naturally implemented in an online, fully incremental fashion =/= MC.
Only need to wait 1 time step =/= MC (important if long episode or continuing tasks).
Theorem:
For any fixed policy pi, TD(0) has been proved to converge to v[pi] with prob 1 if the step-size parameter decreases according to the usual stochastic approximation conditions:

TD target has lower variance than MC target.
Both MC and TD(0) converge.
#### 6.3 Optimality of TD(0)
Batch updating is explained.
Batch MC prediction produces lower error on existing data, while TD produces lower error on future data.
Batch MC always find the estimates that minimize the MSE on the training set.
TD(0) always finds the estimates that would be exactly correct for the maximum-likelihood model of the Markov process. A maximum-likelihood estimate of a parameter = value whose prob of generating the data is greatest. In this case, the maximum-likelihood estimate is the model of the Markov process formed from the observed episodes. The estimate of the value function can be computed for this model and would be exactly correct if the estimated model itself was exactly correct: certainty-equivalence estimate.
In general, batch TD(0) converges to the certainty-equivalence estimate.
In batch form, TD(0) is faster than MC because it computes the true certainty-equivalence estimate.
Nonbatch TD(0) do not but can be understood as moving roughly toward this estimate.
Certainty-equivalence estimate is almost never feasible to compute directly.
#### 6.4 Sarsa: On-Policy TD Control
Rather than improving policy after convergence of policy evaluation (iterative policy evaluation) or after an episode (MC), we do it after a single step.
Control problem part following the GPI pattern. On-Policy.
In 6.1 we learnt the state-value function, but we need to learn an action-value function instead.
We considered transitions from state to state, now we consider transitions from state-action pair to state-action pair. Hence the name SA-R-SA.

If S[t+1] is terminal, then Q is defined as 0.
Sarsa converges to an optimal policy and action-value function if all pairs are visited an infinite number of times and if the policy converges in the limit to the greedy policy (ex: eps = 1/t in eps-greedy policy).

In examples of episodic tasks where it is difficult for the agent to end up in a terminal state, TD is better than MC as it improves its policy step after step.
Note that initializing the action values optimistically will cause extensive exploration to occur because unseen states are valued better than frequently visited (cf part 2.6 ) even if eps is set to 0.
#### 6.5 Q-learning: Off-policy TD Control


Difference with sarsa: instead of using the next state action pair in its target, it does it takes the max Q over actions.

**Sarsa is a sample-based version of policy iteration.
Q-learning is a sample-based version of value iteration.**
Q-learning is an off-policy control method because the action selection is taken with respect to an eps-greedy algorithm (behavior policy) while the action value function update is determined using a different policy that is greedy with respect to the action value function (target policy).
However, no need for importance sampling.
Sarsa can do better than Q-learning when performance is measured online. On some examples, Sarsa may act in a less risky way than Q-learning because learning a eps-greedy policy (cliff walking):

Q-learning ignores the outcomes of exploratory actions, it only learns about the greedy actions.
#### 6.6 Expected Sarsa
In Sarsa, we sample over the possible next states as we do not know the model, but we also sample over the possible next actions while knowing the current policy.
In expected Sarsa, we calculate the expectation over next actions using the current policy.


Expected Sarsa has a more stable update target than Sarsa, it has lower variance target, but computations are more expensive.
More robust than Sarsa at large step-sizes.
Expected Sarsa learns both on-policy and off-policy without importance sampling.

Q-learning is a special case of Sarsa.
#### 6.7 Maximization Bias and Double Learning
#### 6.8 Games, Afterstates, and Other Special Cases
#### 6.9 Summary
Sarsa learns Q[pi], is on-policy.
Q-learning leans Q[*], is off-policy.
Expected Sarsa uses the same Bellmann equation as Sarsa but samples it differently, is both on-policy and off-policy as it can learn from any target policy.
## 7. N-step Bootstrapping
A bridge between 5. and 6.
## 8. Planning and Learning with Tabular Methods
Goal: understand the difference between planning (model-based RL) and learning (model-free RL)
#### 8.1 Models and Planning
Model of the environment = anything that an agent can use to predict how the environment will respond to its actions.
A distribution model consists of the probabilities of next states and rewards for possible actions.
A sample model produces single transitions and rewards generated according to these probabilities.
In either case, the model can be used to simulate the environment and produce simulated experience.
Here, planning is used as “State-space planning” and not as “Plan-space planning”.
State-space planning is primarily viewed as a search through the state space for an optimal policy, or an optimal path to a goal. Actions cause transitions from state to state, and value functions are computed over states.
Plan-space planning” is instead a search through the space of plans. Operators transform one plan into another, and value functions, if any, are defined over the space of plans. It includes evolutionary methods and partial-order planning.

Common structure in all state-space planning methods:
- All state-space planning methods involve computing value functions as a key intermediate step toward improving the policy
- They compute value functions by updates or backup operations applied to simulated experience.
There are 2 types of experience, from the model, and from the environment.
We can do planning from a sample model, for example Q-learning learns directly from experience given by interaction with the environment, and Q-planning performs the same update but with experience from the model.

This algorithms assumes that we have a sample model of the transition dynamics and a strategy to samples state-action pairs (uniformly random for example).
Random-sample one-step tabular Q-planning:

This first seems irrelevant but the distinction between planning and learning is important when the interaction with the environment is much longer than the time needed to make the updates.
#### 8.2 Dyna: Integrated Planning, Acting, and Learning
The environment is assumed to be deterministic here, but the algorithm can be extended to stochastic environments.
 
Real experience can be used to improve the model, and both simulated and real experience can be used to improve the value function and policy.

Step d: Direct RL.
Step e: Model Learning
Step f: Planning part, with search control = strategy for sampling S and A
In the example of a maze with 0 rewards except in the transition to state E, the Q-learning algorithm takes 184 actions to update the first action-state value. Whereas using model learning and planning update, we can leverage all this experience and update most state-action values without having to further interact with the environment.

Model-based methods like Dyna typically require more memory than model-free methods like Q-learning. The amount of computation per interaction with the environment is also larger in the Dyna-Q algorithm (with non-zero planning steps).
But when compared with model-free methods, model-based methods are relatively more sample efficient as they can achieve a comparable performance with comparatively fewer environmental interactions.
#### 8.3 When the Model Is Wrong
If a lot of transitions are missing, the model is incomplete
The trivial solution for Dyna-Q to deal with an incomplete model is to randomly select a state-action pair that has already been selected during the interaction with the environment.
If the environment is changing, the model is inaccurate because the transitions from the same state and action might not be the same after the environment changes. During planning, the update is based on what the model thinks the output of the state-action pair is, rather than what it actually happens in the environment, so an outdated model will improve the policy or value function with respect to the model, but not the environment.
We can make sure that the model is accurate by making the agent revisit the part of the environment that changed. Moreover, the model is more likely to be wrong in places it has not visited for a long time.
Thus, model accuracy produces a variant of the exploration-exploitation trade-off.
Dyna-Q+ is a solution, it redefines the reward to give an exploration bonus (within the planning update):

#### 8.4 Prioritized Sweeping
In problems with large number of states, search control can be done by starting backwards from state-action pairs that have had a non-zero update (from the state right before a state leading to the goal state for example). This avoids the otherwise wasteful computations from state-action pairs which have had no updates.
#### 8.5 Expected vs. Sample Updates
Dynamic programming requires a distribution model because it uses expected updates, which involve computing expectations over all the possible next states and rewards.
A sample model, on the other hand, is what is needed to simulate interacting with the environment during which sample updates can be used.
#### 8.6 Trajectory Sampling
Focuses on states or state-action pairs that the agent is likely to visit when controlling its environment. This can allow computation to skip over parts of the state space that are irrelevant to the prediction or control problem.
#### 8.7 Real-time Dynamic Programming
It is an on-policy trajectory sampling version of value iteration illustrating some of the advantages trajectory sampling has over conventional sweep-based policy iteration.
#### 8.8 Planning at Decision Time
#### 8.9 Heuristic Search
#### 8.10 Rollout Algorithms
#### 8.11 Monte Carlo Tree Search
#### 8.12 Summary of the Chapter
Planning requires a model of the environment.
Distribution model vs sample model.
Any of the learning methods can be converted into planning methods simply by applying them to simulated (model-generated) experience rather than to real experience.
They are possibly identical algorithms operating on 2 different sources of experience.
#### 8.13 Summary of Part I: Dimensions
All algorithms presented have 3 key ideas in common:
- Seek to estimate value functions
- Operate by backing up value updates along actual or possible state trajectories
- Follow the general strategy of GPI maintaining an approximate value function and approximate policy and continually trying to improve each on the basis of the other
The methods vary on the kind of update to improve value function is used: sample update (based on a sample trajectory = sample model) or expected update (based on a distribution of possible trajectories = distribution model).
Moreover, they vary on the depth of the update.

A third perpendicular dimension could be the binary distinction between on-policy and off-policy methods.
Others could be:
- Definition of return (episodic or continuing, discounted or not)
- Action values vs state values
- Action selection/exploration (eps-greedy, optimistic initialization of values, soft-max, upper confidence bound, etc.)
- Synchronous vs asynchronous updates for all states
- Real vs simulated
- Location of updates (model-free methods can choose only among the states actually encountered, but model-based methods can choose arbitrarily).
- …
In the part 2, we distinguish between tabular methods and function approximation.