Lecture Notes 12: Reinforcement Learning

Welcome to our introduction to reinforcement learning! Reinforcement Learning is the study of agents and how they learn by trial and error. It formalizes the idea that rewarding or punishing an agent for its behavior makes it more likely to repeat or forego that behavior in the future.

What Can RL Do?

RL methods have recently enjoyed a wide variety of successes. For example, it’s been used to teach computers to control robots in simulation and real world.

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

It’s also famously been used to create breakthrough AIs for sophisticated strategy games, most notably Go and Dota, taught computers to play Atari games from raw pixels, and trained simulated robots to follow human instructions.

Key Concepts and Terminology

Agent-environment interaction loop.

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

The main characters of RL are the agent and the environment. The environment is the world that the agent lives in and interacts with. At every step of interaction, the agent sees a observation of the state of the world, and then decides on an action to take. The environment changes when the agent acts on it, but may also change on its own.

The agent also perceives a reward signal from the environment, a number that tells it how good or bad the current world state is. The goal of the agent is to maximize its cumulative reward, called return. Reinforcement learning methods are ways that the agent can learn behaviors to achieve its goal.

To talk more specifically what RL does, we need to introduce additional terminology. We need to talk about

(1) states and observations,
(2) action spaces,
(3) policies,
(4) trajectories,
(5) different formulations of return,
(6) the RL optimization problem,
(7) value functions.

(1) States and Observations

A state s is a complete description of the state of the world. There is no information about the world which is hidden from the state. An observation

O

is a partial description of a state, which may omit information.

In deep RL, we almost always represent states and observations by a real-valued vector, matrix, or higher-order tensor. For instance, a visual observation could be represented by the RGB matrix of its pixel values; the state of a robot might be represented by its joint angles and velocities.

When the agent is able to observe the complete state of the environment, we say that the environment is fully observed. When the agent can only see a partial observation, we say that the environment is partially observed.

Reinforcement learning notation sometimes puts the symbol for state,

S

, in places where it would be technically more appropriate to write the symbol for observation,

O

. Specifically, this happens when talking about how the agent decides an action: we often signal in notation that the action is conditioned on the state, when in practice, the action is conditioned on the observation because the agent does not have access to the state when the environment is partially observed.

(2)Action Spaces

Different environments allow different kinds of actions. The set of all valid actions in a given environment is often called the action space. Some environments, like Atari and Go, have discrete action spaces, where only a finite number of moves are available to the agent. Other environments, like where the agent controls a robot in a physical world, have continuous action spaces. In continuous spaces, actions are real-valued vectors.

This distinction has some quite-profound consequences for methods in deep RL. Some families of algorithms can only be directly applied in one case, and would have to be substantially reworked for the other.

(3) Policies

A policy is a rule used by an agent to decide what actions to take. It can be deterministic, in which case it is usually denoted by

μ

a_{t} = μ (s_{t}),

or it may be stochastic, in which case it is usually denoted by

π

a_{t} \sim π (\cdot | s_{t}) .

Because the policy is essentially the agent’s brain, it’s not uncommon to substitute the word “policy” for “agent”, eg saying “The policy is trying to maximize reward.”

In deep RL, we deal with parameterized policies: policies whose outputs are computable functions that depend on a set of parameters (eg the weights and biases of a neural network) which we can adjust to change the behavior via some optimization algorithm.

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Deterministic Policies

Example: Deterministic Policies. Here is a code snippet for building a simple deterministic policy for a continuous action space in PyTorch, using the torch.nn package:

pi_net = nn.Sequential(
              nn.Linear(obs_dim, 64),
              nn.Tanh(),
              nn.Linear(64, 64),
              nn.Tanh(),
              nn.Linear(64, act_dim)
            )

This builds a multi-layer perceptron (MLP) network with two hidden layers of size 64 and \tanh activation functions. If obs is a Numpy array containing a batch of observations, pi_net can be used to obtain a batch of actions as follows:

obs_tensor = torch.as_tensor(obs, dtype=torch.float32)
actions = pi_net(obs_tensor)

Don’t worry about it if this neural network stuff is unfamiliar to you—this tutorial will focus on RL, and not on the neural network side of things. So you can skip this example and come back to it later. But we figured that if you already knew, it could be helpful.

Stochastic Policies

The two most common kinds of stochastic policies in deep RL are categorical policies and diagonal Gaussian policies.

Categorical policies can be used in discrete action spaces, while diagonal Gaussian policies are used in continuous action spaces.

Two key computations are centrally important for using and training stochastic policies:

sampling actions from the policy,
and computing log likelihoods of particular actions,

\log π_{θ} (a | s)

.
In what follows, we’ll describe how to do these for both categorical and diagonal Gaussian policies.

Categorical Policies

A categorical policy is like a classifier over discrete actions. You build the neural network for a categorical policy the same way you would for a classifier: the input is the observation, followed by some number of layers (possibly convolutional or densely-connected, depending on the kind of input), and then you have one final linear layer that gives you logits for each action, followed by a softmax to convert the logits into probabilities.

Sampling. Given the probabilities for each action, frameworks like PyTorch and Tensorflow have built-in tools for sampling. For example, see the documentation for Categorical distributions in PyTorch, torch.multinomial, tf.distributions.Categorical, or tf.multinomial.

Log-Likelihood. Denote the last layer of probabilities as

P_{θ} (s)

. It is a vector with however many entries as there are actions, so we can treat the actions as indices for the vector. The log likelihood for an action a can then be obtained by indexing into the vector:

\log π_{θ} (a | s) = \log {[P_{θ} (s)]}_{a}

Diagonal Gaussian Policies

A multivariate Gaussian distribution (or multivariate normal distribution, if you prefer) is described by a mean vector,

μ

, and a covariance matrix,

Σ

. A diagonal Gaussian distribution is a special case where the covariance matrix only has entries on the diagonal. As a result, we can represent it by a vector.

A diagonal Gaussian policy always has a neural network that maps from observations to mean actions,

μ_{θ} (s)

. There are two different ways that the covariance matrix is typically represented.

The first way: There is a single vector of log standard deviations,

\log σ

, which is not a function of state: the

\log σ

are standalone parameters. (You Should Know: our implementations of VPG, TRPO, and PPO do it this way.)

The second way: There is a neural network that maps from states to log standard deviations,

\log σ_{θ} (s)

. It may optionally share some layers with the mean network.

Note that in both cases we output log standard deviations instead of standard deviations directly. This is because log stds are free to take on any values in

(- \infty, \infty)

, while stds must be nonnegative. It’s easier to train parameters if you don’t have to enforce those kinds of constraints. The standard deviations can be obtained immediately from the log standard deviations by exponentiating them, so we do not lose anything by representing them this way.

Sampling. Given the mean action

μ_{θ} (s)

and standard deviation

σ_{θ} (s)

, and a vector

z

of noise from a spherical Gaussian

(z \sim N (0, I))

, an action sample can be computed with

a = μ_{θ} (s) + σ_{θ} (s) ⊙ z

where

⊙

denotes the elementwise product of two vectors. Standard frameworks have built-in ways to generate the noise vectors, such as torch.normal or tf.random_normal. Alternatively, you can build distribution objects, eg through torch.distributions.Normal or tf.distributions.Normal, and use them to generate samples. (The advantage of the latter approach is that those objects can also calculate log-likelihoods for you.)

Log-Likelihood. The log-likelihood of a k-dimensional action

a

, for a diagonal Gaussian with mean

μ = μ_{θ} (s)

and standard deviation

σ = σ_{θ} (s)

, is given by

\log π_{θ} (a | s) = - \frac{1}{2} (\sum_{i = 1}^{k} (\frac{(a_{i} - μ_{i})^{2}}{σ_{i}^{2}} + 2 \log σ_{i}) + k \log 2 π)

(4) Trajectories

A trajectory

τ

is a sequence of states and actions in the world,

τ = (s_{0}, a_{0}, s_{1}, a_{1}, . . .) .

The very first state of the world,

s_{0}

, is randomly sampled from the start-state distribution, sometimes denoted by

ρ_{0}

s_{0} \sim ρ_{0} (\cdot) .

State transitions (what happens to the world between the state at time

t

s_{t}

, and the state at

t + 1

s_{t + 1}

), are governed by the natural laws of the environment, and depend on only the most recent action,

a_{t}

. They can be either deterministic,

s_{t + 1} = f (s_{t}, a_{t})

or stochastic,

s_{t + 1} \sim P (\cdot | s_{t}, a_{t}) .

Actions come from an agent according to its policy.

Trajectories are also frequently called episodes or rollouts.

(5) Reward and Return

The reward function

R

is critically important in reinforcement learning. It depends on the current state of the world, the action just taken, and the next state of the world:

r_{t} = R (s_{t}, a_{t}, s_{t + 1})

although frequently this is simplified to just a dependence on the current state,

r_{t} = R (s_{t})

, or state-action pair

r_{t} = R (s_{t}, a_{t})

The goal of the agent is to maximize some notion of cumulative reward over a trajectory, but this actually can mean a few things. We’ll notate all of these cases with

R (τ)

, and it will either be clear from context which case we mean, or it won’t matter (because the same equations will apply to all cases).

One kind of return is the finite-horizon undiscounted return, which is just the sum of rewards obtained in a fixed window of steps:

R (τ) = \sum_{t = 0}^{T} r_{t} .

Another kind of return is the infinite-horizon discounted return, which is the sum of all rewards ever obtained by the agent, but discounted by how far off in the future they’re obtained. This formulation of reward includes a discount factor

γ \in (0, 1)

R (τ) = \sum_{t = 0}^{\infty} γ^{t} r_{t}

Why would we ever want a discount factor, though? Don’t we just want to get all rewards? We do, but the discount factor is both intuitively appealing and mathematically convenient. On an intuitive level: cash now is better than cash later. Mathematically: an infinite-horizon sum of rewards may not converge to a finite value, and is hard to deal with in equations. But with a discount factor and under reasonable conditions, the infinite sum converges.

While the line between these two formulations of return are quite stark in RL formalism, deep RL practice tends to blur the line a fair bit—for instance, we frequently set up algorithms to optimize the undiscounted return, but use discount factors in estimating value functions.

(6) The RL Problem

Whatever the choice of return measure (whether infinite-horizon discounted, or finite-horizon undiscounted), and whatever the choice of policy, the goal in RL is to select a policy which maximizes expected return when the agent acts according to it.

To talk about expected return, we first have to talk about probability distributions over trajectories.

Let’s suppose that both the environment transitions and the policy are stochastic. In this case, the probability of a T-step trajectory is:

P (τ | π) = ρ_{0} (s_{0}) \prod_{t = 0}^{T - 1} P (s_{t + 1} | s_{t}, a_{t}) π (a_{t} | s_{t})

The expected return (for whichever measure), denoted by

J (π)

, is then:

J (π) = \int_{τ} P (τ | π) R (τ) = E_{τ \sim π} R (τ)

The central optimization problem in RL can then be expressed by

π^{*} = \arg max_{π} J (π)

with

π^{*}

being the optimal policy.

(7) Value Functions

It’s often useful to know the value of a state, or state-action pair. By value, we mean the expected return if you start in that state or state-action pair, and then act according to a particular policy forever after. Value functions are used, one way or another, in almost every RL algorithm.

There are four main functions of note here.

The On-Policy Value Function,

V^{π} (s)

, which gives the expected return if you start in state

s

and always act according to policy

π

V^{π} (s) = E_{τ \sim π} R (τ) | s_{0} = s

The On-Policy Action-Value Function, Q^{\pi}(s,a), which gives the expected return if you start in state s, take an arbitrary action a (which may not have come from the policy), and then forever after act according to policy \pi:

Q^{π} (s, a) = E_{τ \sim π} R (τ) | s_{0} = s, a_{0} = a

The Optimal Value Function,

V^{*} (s)

, which gives the expected return if you start in state s and always act according to the optimal policy in the environment:

V^{*} (s) = max_{π} E_{τ \sim π} R (τ) | s_{0} = s

The Optimal Action-Value Function,

Q^{*} (s, a)

, which gives the expected return if you start in state

s

, take an arbitrary action

a

, and then forever after act according to the optimal policy in the environment:

Q^{*} (s, a) = max_{π} E_{τ \sim π} R (τ) | s_{0} = s, a_{0} = a

There are two key connections between the value function and the action-value function that come up pretty often:

V^{π} (s) = E_{a \sim π} Q^{π} (s, a)

V^{*} (s) = max_{a} Q^{*} (s, a)

The Optimal Q-Function and the Optimal Action

There is an important connection between the optimal action-value function

Q^{*} (s, a)

and the action selected by the optimal policy. By definition,

Q^{*} (s, a)

gives the expected return for starting in state

s

, taking (arbitrary) action a, and then acting according to the optimal policy forever after.

The optimal policy in s will select whichever action maximizes the expected return from starting in s. As a result, if we have

Q^{*}

, we can directly obtain the optimal action,

a^{*} (s)

, via

a^{*} (s) = \arg max_{a} Q^{*} (s, a)

Note: there may be multiple actions which maximize

Q^{*} (s, a)

, in which case, all of them are optimal, and the optimal policy may randomly select any of them. But there is always an optimal policy which deterministically selects an action.

Bellman Equations

All four of the value functions obey special self-consistency equations called Bellman equations. The basic idea behind the Bellman equations is this:

The value of your starting point is the reward you expect to get from being there, plus the value of wherever you land next.
The Bellman equations for the on-policy value functions are

\begin{aligned} V^{π} (s) & = E_{a \sim π s^{'} \sim P} r (s, a) + γ V^{π} (s^{'}), \\ Q^{π} (s, a) & = E_{s^{'} \sim P} r (s, a) + γ E_{a^{'} \sim π} Q^{π} (s^{'}, a^{'}), \end{aligned}

where

s^{'} \sim P

is shorthand for

s^{'} \sim P (\cdot | s, a)

, indicating that the next state

s^{'}

is sampled from the environment’s transition rules;

a \sim π

is shorthand for

a \sim π (\cdot | s)

; and

a^{'} \sim π

is shorthand for

a^{'} \sim π (\cdot | s^{'})

The Bellman equations for the optimal value functions are

\begin{aligned} V^{*} (s) & = max_{a} E_{s^{'} \sim P} r (s, a) + γ V^{*} (s^{'}), \\ Q^{*} (s, a) & = E_{s^{'} \sim P} r (s, a) + γ max_{a^{'}} Q^{*} (s^{'}, a^{'}) . \end{aligned}

The crucial difference between the Bellman equations for the on-policy value functions and the optimal value functions, is the absence or presence of the \max over actions. Its inclusion reflects the fact that whenever the agent gets to choose its action, in order to act optimally, it has to pick whichever action leads to the highest value.

The term “Bellman backup” comes up quite frequently in the RL literature. The Bellman backup for a state, or state-action pair, is the right-hand side of the Bellman equation: the reward-plus-next-value.

Advantage Functions

Sometimes in RL, we don’t need to describe how good an action is in an absolute sense, but only how much better it is than others on average. That is to say, we want to know the relative advantage of that action. We make this concept precise with the advantage function.

The advantage function

A^{π} (s, a)

corresponding to a policy

π

describes how much better it is to take a specific action a in state s, over randomly selecting an action according to

π (\cdot | s)

, assuming you act according to

π

forever after. Mathematically, the advantage function is defined by

A^{π} (s, a) = Q^{π} (s, a) - V^{π} (s) .

We’ll discuss this more later, but the advantage function is crucially important to policy gradient methods.

Policy Gradient

In this section, we’ll discuss the mathematical foundations of policy optimization algorithms, and connect the material to sample code. We will cover three key results in the theory of policy gradients: the simplest equation describing the gradient of policy performance with respect to policy parameters, a rule which allows us to drop useless terms from that expression, and a rule which allows us to add useful terms to that expression.

In the end, we’ll tie those results together and describe the advantage-based expression for the policy gradient—the version we use in our Vanilla Policy Gradient implementation.

Deriving the Simplest Policy Gradient

Here, we consider the case of a stochastic, parameterized policy,

π_{θ}

. We aim to maximize the expected return

J (π_{θ}) = E_{τ \sim π_{θ}} R (τ)

. For the purposes of this derivation, we’ll take

R (τ)

to give the finite-horizon undiscounted return, but the derivation for the infinite-horizon discounted return setting is almost identical.

We would like to optimize the policy by gradient ascent, eg,

θ_{k + 1} = θ_{k} + α {\nabla_{θ} J (π_{θ}) |}_{θ_{k}}

The gradient of policy performance,

\nabla_{θ} J (π_{θ})

, is called the policy gradient, and algorithms that optimize the policy this way are called policy gradient algorithms.

To actually use this algorithm, we need an expression for the policy gradient which we can numerically compute. This involves two steps: 1) deriving the analytical gradient of policy performance, which turns out to have the form of an expected value, and then 2) forming a sample estimate of that expected value, which can be computed with data from a finite number of agent-environment interaction steps.

In this subsection, we’ll find the simplest form of that expression. In later subsections, we’ll show how to improve on the simplest form to get the version we actually use in standard policy gradient implementations.

We’ll begin by laying out a few facts which are useful for deriving the analytical gradient.

Probability of a Trajectory. The probability of a trajectory
$τ = (s_{0}, a_{0}, . . ., s_{T + 1})$ given that actions come from
$π_{θ}$ is

$\begin{array}{r} P (τ | θ) = ρ_{0} (s_{0}) \prod_{t = 0}^{T} P (s_{t + 1} | s_{t}, a_{t}) π_{θ} (a_{t} | s_{t}) . \end{array}$
The Log-Derivative Trick. The log-derivative trick is based on a simple rule from calculus: the derivative of \log x with respect to x is 1/x. When rearranged and combined with chain rule, we get:

$\begin{array}{r} \nabla_{θ} P (τ | θ) = P (τ | θ) \nabla_{θ} \log P (τ | θ) . \end{array}$
Log-Probability of a Trajectory. The log-prob of a trajectory is just

$\begin{array}{r} \log P (τ | θ) = \log ρ_{0} (s_{0}) + \sum_{t = 0}^{T} (\log P (s_{t + 1} | s_{t}, a_{t}) + \log π_{θ} (a_{t} | s_{t})) . \end{array}$
Gradients of Environment Functions. The environment has no dependence on \theta, so gradients of
$ρ_{0} (s_{0}), P (s_{t + 1} | s_{t}, a_{t})$ , and
$R (τ)$ are zero.
Grad-Log-Prob of a Trajectory. The gradient of the log-prob of a trajectory is thus

$\begin{array}{r} \nabla_{θ} \log P (τ | θ) = \sum_{t = 0}^{T} \nabla_{θ} \log π_{θ} (a_{t} | s_{t}) . \end{array}$

Putting it all together, we derive the following:

Derivation for Basic Policy Gradient

\begin{aligned} \nabla_{θ} J (π_{θ}) & = \nabla_{θ} E_{τ \sim π_{θ}} R (τ) \\ = \nabla_{θ} \int_{τ} P (τ | θ) R (τ) & Expand expectation \\ = \int_{τ} \nabla_{θ} P (τ | θ) R (τ) & Bring gradient under integral \\ = \int_{τ} P (τ | θ) \nabla_{θ} \log P (τ | θ) R (τ) & Log-derivative trick \\ =_{τ \sim π_{θ}} \nabla_{θ} \log P (τ | θ) R (τ) & Return to expectation form \\ ∴ \nabla_{θ} J (π_{θ}) & = E_{τ \sim π_{θ}} \sum_{t = 0}^{T} \nabla_{θ} \log π_{θ} (a_{t} | s_{t}) R (τ) & Expression for grad-log-prob \end{aligned}

This is an expectation, which means that we can estimate it with a sample mean. If we collect a set of trajectories

D = {τ_{i}}_{i = 1, . . ., N}

where each trajectory is obtained by letting the agent act in the environment using the policy

π_{θ}

, the policy gradient can be estimated with

\hat{g} = \frac{1}{| D |} \sum_{τ \in D} \sum_{t = 0}^{T} \nabla_{θ} \log π_{θ} (a_{t} | s_{t}) R (τ)

, where

| D |

is the number of trajectories in

D

(here, N).

This last expression is the simplest version of the computable expression we desired. Assuming that we have represented our policy in a way which allows us to calculate

\nabla_{θ} \log π_{θ} (a | s)

, and if we are able to run the policy in the environment to collect the trajectory dataset, we can compute the policy gradient and take an update step.

Implementing the Simplest Policy Gradient

We give a short PyTorch implementation of this simple version of the policy gradient algorithm
This section was previously written with a Tensorflow example. The old Tensorflow section can be found here.

1. Making the Policy Network.

# make core of policy network
logits_net = mlp(sizes=[obs_dim]+hidden_sizes+[n_acts])

# make function to compute action distribution
def get_policy(obs):
   logits = logits_net(obs)
   return Categorical(logits=logits)

# make action selection function (outputs int actions, sampled from policy)
def get_action(obs):
   return get_policy(obs).sample().item()

This block builds modules and functions for using a feedforward neural network categorical policy. (See the Stochastic Policies section in Part 1 for a refresher.) The output from the logits_net module can be used to construct log-probabilities and probabilities for actions, and the get_action function samples actions based on probabilities computed from the logits. (Note: this particular get_action function assumes that there will only be one obs provided, and therefore only one integer action output. That’s why it uses .item(), which is used to get the contents of a Tensor with only one element.)

A lot of work in this example is getting done by the Categorical object on L36. This is a PyTorch Distribution object that wraps up some mathematical functions associated with probability distributions. In particular, it has a method for sampling from the distribution and a method for computing log probabilities of given samples. Since PyTorch distributions are really useful for RL, check out their documentation to get a feel for how they work.

Friendly reminder! When we talk about a categorical distribution having “logits,” what we mean is that the probabilities for each outcome are given by the Softmax function of the logits. That is, the probability for action j under a categorical distribution with logits

x_{j}

is:

p_{j} = \frac{\exp (x_{j})}{\sum_{i} \exp (x_{i})}

2. Making the Loss Function.

# make loss function whose gradient, for the right data, is policy gradient
def compute_loss(obs, act, weights):
   logp = get_policy(obs).log_prob(act)
   return -(logp * weights).mean()

In this block, we build a “loss” function for the policy gradient algorithm. When the right data is plugged in, the gradient of this loss is equal to the policy gradient. The right data means a set of (state, action, weight) tuples collected while acting according to the current policy, where the weight for a state-action pair is the return from the episode to which it belongs. (Although as we will show in later subsections, there are other values you can plug in for the weight which also work correctly.)

Even though we describe this as a loss function, it is not a loss function in the typical sense from supervised learning. There are two main differences from standard loss functions.

The data distribution depends on the parameters. A loss function is usually defined on a fixed data distribution which is independent of the parameters we aim to optimize. Not so here, where the data must be sampled on the most recent policy.
It doesn’t measure performance. A loss function usually evaluates the performance metric that we care about. Here, we care about expected return, J(\pi_{\theta}), but our “loss” function does not approximate this at all, even in expectation. This “loss” function is only useful to us because, when evaluated at the current parameters, with data generated by the current parameters, it has the negative gradient of performance.

But after that first step of gradient descent, there is no more connection to performance. This means that minimizing this “loss” function, for a given batch of data, has no guarantee whatsoever of improving expected return. You can send this loss to -\infty and policy performance could crater; in fact, it usually will. Sometimes a deep RL researcher might describe this outcome as the policy “overfitting” to a batch of data. This is descriptive, but should not be taken literally because it does not refer to generalization error.

We raise this point because it is common for ML practitioners to interpret a loss function as a useful signal during training—”if the loss goes down, all is well.” In policy gradients, this intuition is wrong, and you should only care about average return. The loss function means nothing.

The approach used here to make the logp tensor–calling the log_prob method of a PyTorch Categorical object–may require some modification to work with other kinds of distribution objects.

For example, if you are using a Normal distribution (for a diagonal Gaussian policy), the output from calling policy.log_prob(act) will give you a Tensor containing separate log probabilities for each component of each vector-valued action. That is to say, you put in a Tensor of shape (batch, act_dim), and get out a Tensor of shape (batch, act_dim), when what you need for making an RL loss is a Tensor of shape (batch,). In that case, you would sum up the log probabilities of the action components to get the log probabilities of the actions. That is, you would compute:

logp = get_policy(obs).log_prob(act).sum(axis=-1)

3. Running One Epoch of Training.

# for training policy
def train_one_epoch():
    # make some empty lists for logging.
    batch_obs = []          # for observations
    batch_acts = []         # for actions
    batch_weights = []      # for R(tau) weighting in policy gradient
    batch_rets = []         # for measuring episode returns
    batch_lens = []         # for measuring episode lengths

    # reset episode-specific variables
    obs = env.reset()       # first obs comes from starting distribution
    done = False            # signal from environment that episode is over
    ep_rews = []            # list for rewards accrued throughout ep

    # render first episode of each epoch
    finished_rendering_this_epoch = False

    # collect experience by acting in the environment with current policy
    while True:

        # rendering
        if (not finished_rendering_this_epoch) and render:
            env.render()

        # save obs
        batch_obs.append(obs.copy())

        # act in the environment
        act = get_action(torch.as_tensor(obs, dtype=torch.float32))
        obs, rew, done, _ = env.step(act)

        # save action, reward
        batch_acts.append(act)
        ep_rews.append(rew)

        if done:
            # if episode is over, record info about episode
            ep_ret, ep_len = sum(ep_rews), len(ep_rews)
            batch_rets.append(ep_ret)
            batch_lens.append(ep_len)

            # the weight for each logprob(a|s) is R(tau)
            batch_weights += [ep_ret] * ep_len

            # reset episode-specific variables
            obs, done, ep_rews = env.reset(), False, []

            # won't render again this epoch
            finished_rendering_this_epoch = True

            # end experience loop if we have enough of it
            if len(batch_obs) > batch_size:
                break

    # take a single policy gradient update step
    optimizer.zero_grad()
    batch_loss = compute_loss(obs=torch.as_tensor(batch_obs, dtype=torch.float32),
                              act=torch.as_tensor(batch_acts, dtype=torch.int32),
                              weights=torch.as_tensor(batch_weights, dtype=torch.float32)
                              )
    batch_loss.backward()
    optimizer.step()
    return batch_loss, batch_rets, batch_lens

Baselines in Policy Gradients

An immediate consequence of the EGLP lemma is that for any function b which only depends on state,

E_{a_{t} \sim π_{θ}} \nabla_{θ} \log π_{θ} (a_{t} | s_{t}) b (s_{t}) = 0

This allows us to add or subtract any number of terms like this from our expression for the policy gradient, without changing it in expectation:

\nabla_{θ} J (π_{θ}) = E_{τ \sim π_{θ}} \sum_{t = 0}^{T} \nabla_{θ} \log π_{θ} (a_{t} | s_{t}) (\sum_{t^{'} = t}^{T} R (s_{t^{'}}, a_{t^{'}}, s_{t^{'} + 1}) - b (s_{t}))

Any function

b

used in this way is called a baseline.

The most common choice of baseline is the on-policy value function V^{\pi}(s_t). Recall that this is the average return an agent gets if it starts in state s_t and then acts according to policy \pi for the rest of its life.

Empirically, the choice

b (s_{t}) = V^{π} (s_{t})

has the desirable effect of reducing variance in the sample estimate for the policy gradient. This results in faster and more stable policy learning. It is also appealing from a conceptual angle: it encodes the intuition that if an agent gets what it expected, it should “feel” neutral about it.

In practice,

V^{π} (s_{t})

cannot be computed exactly, so it has to be approximated. This is usually done with a neural network,

V_{ϕ} (s_{t})

, which is updated concurrently with the policy (so that the value network always approximates the value function of the most recent policy).

The simplest method for learning

V_{ϕ}

, used in most implementations of policy optimization algorithms (including VPG, TRPO, PPO, and A2C), is to minimize a mean-squared-error objective:

ϕ_{k} = \arg min_{ϕ} E_{s_{t}, {\hat{R}}_{t} \sim π_{k}} {(V_{ϕ} (s_{t}) - {\hat{R}}_{t})}^{2}

, where

π_{k}

is the policy at epoch k. This is done with one or more steps of gradient descent, starting from the previous value parameters

ϕ_{k - 1}

Other Forms of the Policy Gradient

What we have seen so far is that the policy gradient has the general form

\nabla_{θ} J (π_{θ}) = E_{τ \sim π_{θ}} \sum_{t = 0}^{T} \nabla_{θ} \log π_{θ} (a_{t} | s_{t}) Φ_{t}

where

Φ_{t}

could be any of

Φ_{t} = R (τ)

, or

Φ_{t} = \sum_{t^{'} = t}^{T} R (s_{t^{'}}, a_{t^{'}}, s_{t^{'} + 1})

, or with the baseline

Φ_{t} = \sum_{t^{'} = t}^{T} R (s_{t^{'}}, a_{t^{'}}, s_{t^{'} + 1}) - b (s_{t})

All of these choices lead to the same expected value for the policy gradient, despite having different variances. It turns out that there are two more valid choices of weights

Φ_{t}

which are important to know.

Q-learning

So how can we learn the Q-value? One of the most popular methods is Q-learning with the following steps:
(1) We sample an action.
(2) We observed the reward and the next state.
(3) We take the action with the highest Q.

Then we apply the dynamic programming again to compute the Q-value function iteratively:

Here is the algorithm of Q-learning with function fitting. Step 2 below reduces the variance by using Temporal Difference. This also improves the sample efficiency comparing with the Monte Carlo method which takes samples until the end of the episode.

Pseudocode

Exploration is very important in RL. Without exploration, you will never know what is better ahead. But if it is overdone, we are wasting time. In Q-learning, we have an exploration policy, like epsilon-greedy, to select the action taken in step 1. We pick the action with the highest

Q

value but yet we allow a small chance of selecting other random actions.

Q

is initialized with zero. Hence, there is no specific action standing out in early training. As the training progress, more promising actions are selected and the training shift from exploration to exploitation.

Deep Q-network DQN

Q-learning is unfortunately not very stable with deep learning. In this section, we will finally put all things together and introduce the DQN which beats the human in playing some of the Atari Games by accessing the image frames only.

DQN is the poster child for Q-learning using a deep network to approximate

Q

. We use supervised learning to fit the Q-value function. We want to duplicate the success of supervised learning but RL is different. In deep learning, we randomize the input samples so the input class is quite balanced and pretty stable across training batches. In RL, we search better as we explore more. So the input space and actions we searched are constantly changing. In addition, as we know better, we update the target value of Q. That is bad news. Both the input and output are under frequent changes.

This makes it very hard to learn the Q-value approximator. DQN introduces experience replay and target network to slow down the changes so we can learn

Q

gradually. Experience replay stores the last million of state-action-reward in a replay buffer. We train

Q

with batches of random samples from this buffer. Therefore, the training samples are randomized and behave closer to the supervised learning in Deep Learning.
In addition, we have two networks for storing the values of Q. One is constantly updated while the second one, the target network, is synchronized from the first network once a while. We use the target network to retrieve the Q value such that the changes for the target value are less volatile. Here is the objective for those interested.

D

is the replay buffer and

θ^{-}

is the target network.

To make training samples independent of each other, we store the latest samples in a replay buffer and randomize them for training. This makes samples less correlated within a training batch. To avoid a moving target, we also delay the Q-value network updates so the target does not change constantly. This topic deserves a whole new article so we will briefly list the algorithm for reference only.