POLICY GRADIENT THEOREM
===
Author
[Raj Ghugare](https://github.com/Raj19022000)
###### tags: `notes` `Policy-gradients`
## Review and research notes
### Introduction
* Policy gradient methods are directly aligned with the goal of RL problems, i.e finding a policy to obtain the highest expected returns.
* The main idea behind policy grad methods is parameterizing a policy in different ways and then optimizing those parameters to obtain highest expected returns.
* By parameterizing we are essentially defining the space of all possible polices in which we want to find the best one. Different choices of parameterizations that people use are Softmax, Gaussian or neural networks or many a times a combination of them etc.
* This blog would mostly be a detailed explanation of the [section 13.2,Sutton and Barton](http://incompleteideas.net/book/bookdraft2017nov5.pdf). It would be a good idea to read this part of the book before reading this article.
* The performance measure used to optimize the parameters of the current policy $\pi_{\theta}$ is the expected return from the starting state $S_{0}$.We are assuming that there is only one starting state.
$$J(\theta) = V_{\pi_{\theta}(S_{0})}$$
The value of state $S_{0}$ depends upon our policy as well as the environments dynamics hence it is a challenge to find the gradient of $J(\theta)$. This is where policy gradient theorem comes in.
#### NOTE:
Throughout the next section we wouldn't consider discounting. But the policy gradient theorem equation remains unchanged even with discounting (kinda easy to prove if you undertand the theorem nicely).
### Policy Gradient Theorem
$$\nabla V_\pi({s}) = \nabla[\Sigma_{a} \pi(a|s) q_{\pi}(s,a)]$$
The above statement is true for all states $s \in S$.
Using the product rule of calculus,
$$\nabla V_\pi({s}) = \Sigma_{a}[ \nabla\pi(a|s) q_{\pi}(s,a) + \pi(a|s) \nabla q_{\pi}(s,a) ]$$
$$\nabla V_\pi(s) = \Sigma_{a}[ \nabla\pi(a|s) q_{\pi}(s,a) + \pi(a|s) \nabla \Sigma_{r,s'} p(s',r|s,a)(r + V_\pi(s')) ]$$
Since reward ($r$) is independent of $\theta$ and $\Sigma_{r}p(s',r|s,a) = p(s'|s,a)$
$$\nabla V_\pi(s) = \Sigma_{a}[ \nabla\pi(a|s) q_{\pi}(s,a) + \pi(a|s) \Sigma_{s'} p(s'|s,a)\nabla V_\pi(s') ]$$
WE can continue to expand $\nabla V_{\pi}(s')$ the same way but if we keep on rolling out and expanding this equation for different states $s'', s''' ...$ we will get an infinite series.
$$\nabla V_\pi(s) = \Sigma_{a}[ \nabla\pi(a|s) q_{\pi}(s,a) + \pi(a|s) \Sigma_{s'} p(s'|s,a)\Sigma_{a}[ \nabla\pi(a|s') q_{\pi}(s',a) + \pi(a|s') \Sigma_{s''} p(s''|s',a)\nabla V_\pi(s'')]]$$
As this sum rolls out it becomes difficult to make sense of it due to all the gradients but Sutton and Barton gives another way to look at it:
NOTE:
Wherever I have mentioned $\pi$ it actually means $\pi_{\theta}$, i.e. the current policy.
$Pr(s\rightarrow x,k,\pi)$ is the probability of reaching state x, starting from state s in k steps following policy $\pi$. This term is explicitly defined as a tool which will help us in the derivation.
If $k=0$ then,
$$Pr(s\rightarrow s,0,\pi) = 1$$
As there is no way we can reach another state in 0 time-steps. Hence we will always remain in the same state
if $k=1$ then,
$$Pr(s\rightarrow x,1,\pi) = \Sigma_{a}\pi(a|s)p(x|s,a)$$
Just the sum of the probabilities of taking all actions and environment dynamics that would result in a transfer from $s$ to $x$ in one step.
Then,
$$\nabla V_\pi(s) = \Sigma_{a}[ \nabla\pi(a|s) q_{\pi}(s,a) + \pi(a|s) \Sigma_{s'} p(s'|s,a)\nabla V_\pi(s') ]$$
$$\nabla V_\pi(s) = \Sigma_{a}[ \nabla\pi(a|s) q_{\pi}(s,a)] +\Sigma_{a}[\pi(a|s) \Sigma_{s'} p(s'|s,a)\nabla V_\pi(s') ]$$
Re-arranging the summations for the second term
$$\nabla V_\pi(s) = \Sigma_{a}[ \nabla\pi(a|s) q_{\pi}(s,a)] +\Sigma_{s'}[\nabla V_\pi(s') \Sigma_{a} p(s'|s,a) \pi(a|s)] $$
We just saw that $\Sigma_{a} p(s'|s,a) \pi(a|s)=Pr(s\rightarrow s',1,\pi)$. Substituting this in our equation
$$\nabla V_\pi(s) = \Sigma_{a}[ \nabla\pi(a|s) q_{\pi}(s,a)] +\Sigma_{s'}[\nabla V_\pi(s') Pr(s\rightarrow s',1,\pi)]$$
We will use this recursive equation again and again so have a good look at it and try to make yourself comfotable with it.
Using the recursive equation for $\nabla V_\pi(s')$,
$$\nabla V_\pi(s) = \Sigma_{a} \nabla\pi(a|s) q_{\pi}(s,a) +\Sigma_{s'}Pr(s\rightarrow s',1,\pi)[\Sigma_{a}\nabla\pi(a|s') q_{\pi}(s',a) + \Sigma_{s''} \nabla V_\pi(s'')Pr(s' \rightarrow s'',1,\pi)]$$
$$\nabla V_\pi(s) = \Sigma_{a}\nabla\pi(a|s) q_{\pi}(s,a) + \Sigma_{s'}Pr(s\rightarrow s',1,\pi)[\Sigma_{a}\nabla\pi(a|s') q_{\pi}(s',a)] +
\Sigma_{s'}Pr(s\rightarrow s',1,\pi)[\Sigma_{s''} \nabla V_\pi(s'')Pr(s' \rightarrow s'',1,\pi)]$$
Using the product rule of probabilty,
$\Sigma_{s'}Pr(s\rightarrow s',1,\pi)Pr(s'\rightarrow s'',1,\pi) = Pr(s\rightarrow s'',2,\pi)$
We can substitute this in our last equation
$$= \Sigma_{a}[ \nabla\pi(a|s) q_{\pi}(s,a)] + \Sigma_{s'}Pr(s\rightarrow s',1,\pi)[\Sigma_{a}\nabla\pi(a|s') q_{\pi}(s',a)] +
\Sigma_{s''}Pr(s\rightarrow s'',2,\pi)\nabla V_\pi(s'')$$
If we unroll this equation another time we would obtain:
$$\nabla V_\pi(s) = \Sigma_{a}[ \nabla\pi(a|s) q_{\pi}(s,a)] + \Sigma_{s'}Pr(s\rightarrow s',1,\pi)[\Sigma_{a}\nabla\pi(a|s') q_{\pi}(s',a)] +
\Sigma_{s''}Pr(s\rightarrow s'',2,\pi)\Sigma_{a}\nabla \pi(a|s'') q_{\pi}(s'',a) + \Sigma_{s'''}Pr(s\rightarrow s''',3,\pi)V_{\pi}(s''')$$
If we keep on unrolling in this fashion and re-arrange the summation,
$$= \Sigma_{x\in S} \Sigma_{k=0}^{\infty}Pr(s\rightarrow x,k,\pi)\Sigma_{a}\nabla \pi(a|x) q_\pi(x,a)$$
Hopefully this makes it easier if one wasn't able to understand it after reading the book.
We know that $J(\theta) = V_\pi(s_0)$
$$ = \Sigma_{s} (\Sigma_{k=0}^{\infty}Pr(s_{0}\rightarrow s,k,\pi))\Sigma_{a} \nabla \pi(a|s) q_\pi(s,a)$$
let $n(s) = \Sigma_{k=0}^{\infty}Pr(s_{0}\rightarrow s,k,\pi)$
$$J(\theta) = \Sigma_{s} n(s)\Sigma_{a} \nabla \pi(a|s) q_\pi(s,a)$$
$n(s)$ is the measure of reaching state $s$ from $s_0$ in whatever ways possible.Think about why $n(s)$ isn't a probability distribution over the state space. Hint: think of a simple MDP with only two states where the value of $n(s)$ would exceed 1. We normalize by $\Sigma_{s} n(s)$ to make a probability distribution $u(s)$
$$J(\theta) = \Sigma_{s} n(s) \Sigma_{s}\frac{n(s)}{\Sigma_{s} n(s)}\Sigma_{a}\nabla \pi(a|s) q_\pi(s,a)$$
Let $u(s) = \frac{n(s)}{\Sigma_{s}n(s)}$. Since $\Sigma_{s}n(s)$ is a constant
$$J(\theta) \propto \Sigma_{s}\frac{n(s)}{\Sigma_{s} n(s)}\Sigma_{a}\nabla \pi(a|s) q_\pi(s,a)$$
$$J(\theta) \propto u(s)\Sigma_{a}\nabla \pi(a|s) q_\pi(s,a)$$
We can do with a quantity which is proportional to the gradient because the constant of proportionality can be dissolved in the learning-rate which is abitrary. Hence,
$$J(\theta) = u(s)\Sigma_{a}\nabla \pi(a|s) q_\pi(s,a)$$
$u(s)$ is the probability of how often state $s$ is reached if we start from state $s_{0}$ following policy $\pi$. Hence $J(\theta)$ can be expressed as an expectation:
$$J(\theta) = E_{\pi}[\Sigma_{a}\nabla \pi(a|s) q_\pi(s,a)]$$
The policy gradient method gives us the gradient of our performance measure in terms of quantities that can be sampled from Monte-Carlo rollouts.