Author
notes
Policy-gradients
Policy gradient methods are directly aligned with the goal of RL problems, i.e finding a policy to obtain the highest expected returns.
The main idea behind policy grad methods is parameterizing a policy in different ways and then optimizing those parameters to obtain highest expected returns.
By parameterizing we are essentially defining the space of all possible polices in which we want to find the best one. Different choices of parameterizations that people use are Softmax, Gaussian or neural networks or many a times a combination of them etc.
This blog would mostly be a detailed explanation of the section 13.2,Sutton and Barton. It would be a good idea to read this part of the book before reading this article.
The performance measure used to optimize the parameters of the current policy
The value of state
Throughout the next section we wouldn't consider discounting. But the policy gradient theorem equation remains unchanged even with discounting (kinda easy to prove if you undertand the theorem nicely).
The above statement is true for all states
Using the product rule of calculus,
Since reward (
WE can continue to expand
As this sum rolls out it becomes difficult to make sense of it due to all the gradients but Sutton and Barton gives another way to look at it:
NOTE:
Wherever I have mentioned
If
As there is no way we can reach another state in 0 time-steps. Hence we will always remain in the same state
if
Just the sum of the probabilities of taking all actions and environment dynamics that would result in a transfer from
Then,
Re-arranging the summations for the second term
We just saw that
We will use this recursive equation again and again so have a good look at it and try to make yourself comfotable with it.
Using the recursive equation for
Using the product rule of probabilty,
We can substitute this in our last equation
If we unroll this equation another time we would obtain:
If we keep on unrolling in this fashion and re-arrange the summation,
Hopefully this makes it easier if one wasn't able to understand it after reading the book.
We know that
let
Let
We can do with a quantity which is proportional to the gradient because the constant of proportionality can be dissolved in the learning-rate which is abitrary. Hence,
The policy gradient method gives us the gradient of our performance measure in terms of quantities that can be sampled from Monte-Carlo rollouts.