Try   HackMD

POLICY GRADIENT THEOREM

Author

Raj Ghugare

tags: notes Policy-gradients

Review and research notes

Introduction

  • Policy gradient methods are directly aligned with the goal of RL problems, i.e finding a policy to obtain the highest expected returns.

  • The main idea behind policy grad methods is parameterizing a policy in different ways and then optimizing those parameters to obtain highest expected returns.

  • By parameterizing we are essentially defining the space of all possible polices in which we want to find the best one. Different choices of parameterizations that people use are Softmax, Gaussian or neural networks or many a times a combination of them etc.

  • This blog would mostly be a detailed explanation of the section 13.2,Sutton and Barton. It would be a good idea to read this part of the book before reading this article.

  • The performance measure used to optimize the parameters of the current policy

    πθ is the expected return from the starting state
    S0
    .We are assuming that there is only one starting state.

    J(θ)=Vπθ(S0)

    The value of state

    S0 depends upon our policy as well as the environments dynamics hence it is a challenge to find the gradient of
    J(θ)
    . This is where policy gradient theorem comes in.

NOTE:

Throughout the next section we wouldn't consider discounting. But the policy gradient theorem equation remains unchanged even with discounting (kinda easy to prove if you undertand the theorem nicely).

Policy Gradient Theorem

Vπ(s)=[Σaπ(a|s)qπ(s,a)]

The above statement is true for all states

sS.
Using the product rule of calculus,

Vπ(s)=Σa[π(a|s)qπ(s,a)+π(a|s)qπ(s,a)]

Vπ(s)=Σa[π(a|s)qπ(s,a)+π(a|s)Σr,sp(s,r|s,a)(r+Vπ(s))]

Since reward (

r) is independent of
θ
and
Σrp(s,r|s,a)=p(s|s,a)

Vπ(s)=Σa[π(a|s)qπ(s,a)+π(a|s)Σsp(s|s,a)Vπ(s)]

WE can continue to expand

Vπ(s) the same way but if we keep on rolling out and expanding this equation for different states
s,s...
we will get an infinite series.

Vπ(s)=Σa[π(a|s)qπ(s,a)+π(a|s)Σsp(s|s,a)Σa[π(a|s)qπ(s,a)+π(a|s)Σsp(s|s,a)Vπ(s)]]

As this sum rolls out it becomes difficult to make sense of it due to all the gradients but Sutton and Barton gives another way to look at it:

NOTE:
Wherever I have mentioned

π it actually means
πθ
, i.e. the current policy.

Pr(sx,k,π) is the probability of reaching state x, starting from state s in k steps following policy
π
. This term is explicitly defined as a tool which will help us in the derivation.

If

k=0 then,

Pr(ss,0,π)=1

As there is no way we can reach another state in 0 time-steps. Hence we will always remain in the same state

if

k=1 then,

Pr(sx,1,π)=Σaπ(a|s)p(x|s,a)

Just the sum of the probabilities of taking all actions and environment dynamics that would result in a transfer from

s to
x
in one step.

Then,

Vπ(s)=Σa[π(a|s)qπ(s,a)+π(a|s)Σsp(s|s,a)Vπ(s)]

Vπ(s)=Σa[π(a|s)qπ(s,a)]+Σa[π(a|s)Σsp(s|s,a)Vπ(s)]

Re-arranging the summations for the second term

Vπ(s)=Σa[π(a|s)qπ(s,a)]+Σs[Vπ(s)Σap(s|s,a)π(a|s)]

We just saw that

Σap(s|s,a)π(a|s)=Pr(ss,1,π). Substituting this in our equation

Vπ(s)=Σa[π(a|s)qπ(s,a)]+Σs[Vπ(s)Pr(ss,1,π)]

We will use this recursive equation again and again so have a good look at it and try to make yourself comfotable with it.

Using the recursive equation for

Vπ(s),

Vπ(s)=Σaπ(a|s)qπ(s,a)+ΣsPr(ss,1,π)[Σaπ(a|s)qπ(s,a)+ΣsVπ(s)Pr(ss,1,π)]

Vπ(s)=Σaπ(a|s)qπ(s,a)+ΣsPr(ss,1,π)[Σaπ(a|s)qπ(s,a)]+ΣsPr(ss,1,π)[ΣsVπ(s)Pr(ss,1,π)]

Using the product rule of probabilty,

ΣsPr(ss,1,π)Pr(ss,1,π)=Pr(ss,2,π)

We can substitute this in our last equation

=Σa[π(a|s)qπ(s,a)]+ΣsPr(ss,1,π)[Σaπ(a|s)qπ(s,a)]+ΣsPr(ss,2,π)Vπ(s)

If we unroll this equation another time we would obtain:

Vπ(s)=Σa[π(a|s)qπ(s,a)]+ΣsPr(ss,1,π)[Σaπ(a|s)qπ(s,a)]+ΣsPr(ss,2,π)Σaπ(a|s)qπ(s,a)+ΣsPr(ss,3,π)Vπ(s)

If we keep on unrolling in this fashion and re-arrange the summation,

=ΣxSΣk=0Pr(sx,k,π)Σaπ(a|x)qπ(x,a)

Hopefully this makes it easier if one wasn't able to understand it after reading the book.

We know that

J(θ)=Vπ(s0)

=Σs(Σk=0Pr(s0s,k,π))Σaπ(a|s)qπ(s,a)

let

n(s)=Σk=0Pr(s0s,k,π)

J(θ)=Σsn(s)Σaπ(a|s)qπ(s,a)

n(s) is the measure of reaching state
s
from
s0
in whatever ways possible.Think about why
n(s)
isn't a probability distribution over the state space. Hint: think of a simple MDP with only two states where the value of
n(s)
would exceed 1. We normalize by
Σsn(s)
to make a probability distribution
u(s)

J(θ)=Σsn(s)Σsn(s)Σsn(s)Σaπ(a|s)qπ(s,a)

Let

u(s)=n(s)Σsn(s). Since
Σsn(s)
is a constant

J(θ)Σsn(s)Σsn(s)Σaπ(a|s)qπ(s,a)

J(θ)u(s)Σaπ(a|s)qπ(s,a)

We can do with a quantity which is proportional to the gradient because the constant of proportionality can be dissolved in the learning-rate which is abitrary. Hence,

J(θ)=u(s)Σaπ(a|s)qπ(s,a)

u(s) is the probability of how often state
s
is reached if we start from state
s0
following policy
π
. Hence
J(θ)
can be expressed as an expectation:

J(θ)=Eπ[Σaπ(a|s)qπ(s,a)]

The policy gradient method gives us the gradient of our performance measure in terms of quantities that can be sampled from Monte-Carlo rollouts.