# Notes on "[Trust Region Policy Optimization](https://arxiv.org/pdf/1502.05477.pdf)"
###### tags: `incomplete` `research paper notes` `Policy gradient` `TRPO`
#### Author
[Raj Ghugare](https://github.com/RajGhugare19)
## Introduction:
The proposed algorithm promises iterative monotonic improvement of large non linear policy parametarizations like neural networks.
### Preliminaries:
Every MDP is described by the tuple $(S,A,P,r,P_{0},\gamma)$ where,
$S \rightarrow$ set of all states(finite)
$A \rightarrow$ set of all actions(finite)
$P \rightarrow$ transfer dynamics of the env
$r \rightarrow$ reward function
$P_{0} \rightarrow$ Probability dist. of initial states
$\gamma \rightarrow$ discount factor
### Other standard definitions:
$\pi \rightarrow$ stochastic policy
$\eta(\pi) = E_{\tau from \pi}[\Sigma_{l=0}^{\infty} \gamma^{l} r_{l+1}]$
This can be thought of as a performance measure for any policy as it takes an expectation over all states starting from $s_{0}$.
$V_{\pi}(s_{t}) = E_{a_{t},s_{t}...s_{T}}[\Sigma_{l=0}^{\infty} \gamma^{l} r_{l+1}]$
$Q_{\pi}(s_{t},a_{t}) = E_{s_{t},a_{t+1}...s_{T}}[\Sigma_{l=0}^{\infty} \gamma^{l} r_{l+1}]$
$A_{\pi}(s_{t},a_{t}) = V_{\pi}(s_{t}) - Q_{\pi}(s_{t},a_{t})$
The trajectories $<s_{0},a_{0}...s_{t},a_{t}....s_{T}>$ are derived from $P_{0},P(s_{t+1}|s_{t},a_{t})$ and $\pi(a|s)$.
### Identity 1:
$$\eta(\tilde \pi) = \eta(\pi) + E_{s_{0},a_{1}...s_{T} from \tilde\pi}[\Sigma_{l=0}^{\infty} \gamma^{l} A_{\pi}(s_{l},a_{l})]$$
The proof of this is given in appendix A:
![](https://i.imgur.com/sQQA6E3.png)
An intuition for this identity come if we understand what the advantage function for a policy signifies. It tells us how better or worse, taking an action $a_{t}$ is than the expected value of acting according to the policy. In the identity, we are adding up the discounted forms of such advantages of actions which are actually sampled using $\tilde \pi$. Hence, it is like summing up how better is taking the action sampled by $\tilde \pi$ than following policy $\pi$.
### Discounted frequencies:
$P_{\pi}(s) = P(s=s_{0}) + \gamma P(s=s_{1}) + \gamma^{2} P(s=s_{2}) +..$
$P_{\pi}$ is a measure(not probability) which is proportional to the number of times a state is likely to occur in a trajectory generated following policy $\pi$.
$(1-\gamma)P_{\pi}$ is a valid probability distribution conditioned over a starting state.
### Monotonic policy improvement:
We know that,
$\eta(\tilde \pi) = \eta(\pi) + E_{s_{0},a_{1}...s_{T} from \tilde\pi}[\Sigma_{t=0}^{\infty} \gamma^{t} A_{\pi}(s_{t},a_{t})]$
$\eta(\tilde \pi) = \eta(\pi) + \Sigma_{t=0}^{\infty}\gamma^{t}\Sigma_{s}P(s_{t} = s|\tilde\pi)\Sigma_{a}\tilde\pi(a|s_{t}=s)[A_{\pi}(s,a)]$
The above step is just converting the expectation into a weighted sum over all time steps over all states and actions.
Directly re-arranging the first two summations(because they are independent).
$\eta(\tilde \pi) = \eta(\pi) + \Sigma_{s}\Sigma_{t=0}^{\infty}\gamma^{t}P(s_{t} = s|\tilde\pi)\Sigma_{a}\tilde\pi(a|s_{t})[A_{\pi}(s,a)]$
$\eta(\tilde \pi) = \eta(\pi) + \Sigma_{s}P_{\tilde\pi}(s)\Sigma_{a}\tilde\pi(a|s)[A_{\pi}(s,a)]$
The above equation tells us that is we have a non-negative expected advantage for every state s then the new policy $\tilde\pi$ will be an improvement over $\pi$.This is similar to the policy improvement theorem. The only thing that differs here is that there is an expectation over all actions. This is because we cannot have a deterministic policy with neural networks.
### Results of [Kakade and Langford (2002)](https://people.eecs.berkeley.edu/~pabbeel/cs287-fa09/readings/KakadeLangford-icml2002.pdf):
A local approximation of $\eta(\tilde \pi)$ is proposed $\rightarrow L_{\pi}(\tilde\pi)$, such that:
$L_{\pi}[\tilde\pi] = \eta(\pi) + \Sigma_{s}P_{\pi}(s)\Sigma_{a}\tilde\pi(a|s)[A_{\pi}(s,a)]$
This local approximation uses discounted states frequency of policy $\pi$, i.e. it ignores the difference in state visitation measure of $\pi$ and $\tilde \pi$.
$L_{\pi_{\theta}}(\pi_{\theta}) = \eta(\pi_{\theta})$
$\nabla_{\theta}L_{\pi_{\theta_0}}(\pi_{\theta}) = \nabla_{\theta}\eta(\pi_{\theta})$, at $\theta = \theta_{0}$.
This result tells us that a small step in $\theta_{0}$ improving $L_{\pi_{\theta_{0}}}$ will also improve the true performance measure i.e. $\eta(\pi_{\theta})$.
#### Consertvative Policy Iteration:
$\pi_{old} \rightarrow$ current policy
$\tilde \pi \rightarrow argmax_{\tilde \pi}L_{\pi_{old}}(\tilde\pi)$
Then this update states that
$\pi_{new} =(1-\alpha)\pi_{old} + \alpha\tilde\pi$
Then Kakade and Langford have dervied the bound that:
$\eta(\pi_{new}) \geq L_{\pi_{old}}(\pi_{new}) - \frac{2\epsilon\alpha}{(1-\alpha)^2}\alpha^{2}$
where $\epsilon = max_{s}|E_{a' from \pi'(a|s)}[A_{\pi}(s,a)]|$
### Monotonic Improvement General Stochastic Policies:
The main theoretical result of this paper is showing that a similar bound can be put for all stochastic policy updates. They use a distance measure, total variation divergence,between the old and new policy and change $\epsilon$ appropriately.
#### Total Variation Divergence:
For discrete probability distributions $p,q$
$D_{t,v}$ is defined as
$D_{t,v}(p||q) = \frac{1}{2}\Sigma_{i}|p(i)-q(i)|$
Defining $D_{t,v}^{max}$ as
$D_{\pi,\tilde \pi}^{max} = max_{s}D_{t,v}(\pi(.|s),\tilde \pi(.|s))$
## Theorem 1:
![](https://i.imgur.com/zvWXa3X.png)
### Binding it all together:
In this section we will discuss the main theoretical results that this paper offers and the entire proof of theorem 1. Sit back, take a sip of your coffee and relax.
We've already proved that,
$\eta(\tilde \pi) = \eta(\pi) + E_{\tau \sim \tilde\pi}[\Sigma_{t=0}^{\infty} \gamma^{t}A_{\pi}(s,a)]$
Let $\overline A(s)$ be the expected advantage of $\tilde \pi$ over $\pi$ at a state $s$.
$\overline A(s) = E_{a \sim \tilde\pi(.|s)}[A_{\pi}(s,a)]$
hence the topmost equation can be written as,
$\eta(\tilde \pi) = \eta(\pi) + E_{\tau \sim \tilde\pi}[\Sigma_{t=0}^{\infty}\gamma^{t} \overline A(s)]$
Using an approximation $L_{\pi}(\tilde\pi)$ We can write,
$L_{\pi}(\tilde\pi) = \eta(\pi) + E_{\tau \sim \pi}[\Sigma_{t=0}^{\infty}\gamma^{t} \overline A(s)]$
The only thing that changed in the above equation is that the states are sampled according to policy $\pi$ despite the fact that actions for calculating the advantage are sampled using $\tilde\pi$ at every step.They are abusing the notation a little bit but it is important that you understand the meaning.
Our main goal is to bound the difference between $L_{\pi}(\tilde\pi)$ and $\eta({\tilde\pi})$. For this we define a couple of things first:
#### Alpha-coupled policies:
![](https://i.imgur.com/atR7zau.png)
It should be evident that if $\pi$ and $\tilde\pi$ are similar then alpha would be smaller.
In simpler terms the definition says that if we randomly sample action from the same state from policies $\pi$ and $\tilde\pi$ then the action would be the same about $(1-\alpha)$ of the times.
#### Lemma 2:
Let's assume that $\pi$ and $\tilde\pi$ are $\alpha$-coupled then let,
$$\overline A(s) = E_{\tilde a \sim \tilde\pi}[A_{\pi}(s,\tilde a)]$$
$\overline A(s)$ is the expected advantage of taking an action according to $\tilde\pi$ over the action taken according to $\pi$ at a state $s$.If it is a positive quantity then $\tilde\pi$ works better than $\pi$ at state $s$, if it is zero then $\tilde\pi$=$\pi$. And if it is negative then $\pi$ is better at state $s$.
Since $E_{(a \sim \pi)}[A_{\pi}(s,a)] = 0$,
$$= E_{(a,\tilde a) \sim (\pi,\tilde\pi)}[A_{\pi}(s,\tilde a) - A_{\pi}(s,a)]$$
$$ = p(a \neq \tilde a)E_{(a,\tilde a) \sim (\pi,\tilde\pi)|a \neq \tilde a}[A_{\pi}(s,\tilde a) - A_{\pi}(s,a)] + 0$$
$$ \leq \alpha E_{(a,\tilde a) \sim (\pi,\tilde\pi)|a \neq \tilde a}[|A_{\pi}(s,\tilde a) - A_{\pi}(s,a)|]$$
Since the difference of the advantage of any two actions would be lesser than 2 times the maximum advantage over all actions
$$|\overline A(s)| \leq \alpha E_{(a,\tilde a) \sim (\pi,\tilde\pi)|a \neq \tilde a}[2max_{a}(|A_{\pi}(s,a)|)]$$
$$ \leq (2\alpha) max_{a}(|A_{\pi}(s,a)|)$$
#### Lemma 3:
We know that,
$L_{\pi}(\tilde\pi) = \eta(\pi) + E_{\tau \sim \pi}[\Sigma_{t=0}^{\infty}\overline A(s)]$
To bound the difference between $L_{\pi}[\tilde\pi]$ and $\eta(\tilde\pi)$ we need to bound the advantage of $\tilde\pi$ over $\pi$ at every time-step t.
Let's say $\tau$ is a trajectory obtained by sampling from $\pi$ and $\tilde\tau$ is a policy obtained from sampling $\tilde\pi$.
Let $n_{t}$ be a random variable denoting the number of times $a_{i} \neq \tilde a_{i}$, $\forall i<t$.
i.e. it gives the number of times the actions sampled by $\pi$ differ from $\tilde\pi$ at time-steps less than $t$.
Expected advantage at time step $t$ of policy $\tilde\pi$ if policy $\pi$ is followed:
$E_{s_{t} \sim \pi}[\overline A(s_{t})] = p(n_{t}=0)E_{s_{t} \sim \pi|n_t=0}[\overline A(s_{t})] + p(n_{t}>0)E_{s_{t} \sim \pi|n_t=0}[\overline A(s_{t})]$
Expected advantage at time step $t$ of policy $\tilde\pi$ if policy $\tilde\pi$ if followed:
$E_{s_{t} \sim \tilde\pi}[\overline A(s_{t})] = p(n_{t}=0)E_{s_{t} \sim \tilde\pi|n_t=0}[\overline A(s_t)] + p(n_{t}>0)E_{s_{t} \sim \tilde\pi|n_t=0}[\overline A(s_t)]$
In both of the above cases the advantage as we know it is of policy $\tilde \pi$ over policy $\pi$ at any state $s_{t}$.The expectation over a random variable(trajectory) is divided into a conditional random variable, where both the conditions are mutually exclusive and complete as they should be.
The conditional random variables $s_{t} \sim \pi|n_t=0$ and $s_{t} \sim \tilde\pi|n_t=0$ have the same distributuion and hence are equal(because the condition that both should never disagree before time-step t).
Subtracting both equations we obtain:
$E_{s_{t} \sim \pi}[\overline A(s_{t})]-E_{s_{t} \sim \tilde\pi}[\overline A(s_{t})]= p(n_{t}>0)(E_{s_{t} \sim \pi|n_t>0}[\overline A(s_{t})]-E_{s_{t} \sim \tilde\pi|n_t>0}[\overline A(s_{t})])$
Due to the definition of alpha coupled policies and using the multiplication rule of probability,
$p(n_t=0) \geq (1-\alpha)^{t}$
It follows that,
$p(n_t>0) \leq 1-(1-\alpha)^{t}$
Using this and taking modulus on both sides we obtain,
$|E_{s_{t} \sim \pi}[\overline A(s_{t})]-E_{s_{t} \sim \tilde\pi}[\overline A(s_{t})]| \leq (1-(1-\alpha)^t)|E_{s_{t} \sim \pi|n_t>0}[\overline A(s_{t})]-E_{s_{t} \sim \tilde\pi|n_t>0}[\overline A(s_{t})]|$
Using the traingle inequality,
$|(E_{s_{t} \sim \pi|n_t>0}[\overline A(s_{t})]-E_{s_{t} \sim \tilde\pi|n_t>0}[\overline A(s_{t})])| \leq |E_{s_{t} \sim \pi|n_t>0}[\overline A(s_{t})]| + |E_{s_{t} \sim \tilde\pi|n_t>0}[\overline A(s_{t})]|$
Using lemma 2 we can say that,
$|(E_{s_{t} \sim \pi|n_t>0}[\overline A(s_{t})]-E_{s_{t} \sim \tilde\pi|n_t>0}[\overline A(s_{t})])| \leq 4\alpha max_{a}[\overline A_{\pi}(s)]$
Substituting this result,
$|(E_{s_{t} \sim \pi}[\overline A(s_{t})]-E_{s_{t} \sim \tilde\pi}[\overline A(s_{t})])| \leq 4\alpha(1-(1-\alpha)^{t})max_{a}(\overline A(s))$
We now make a weaker bound by taking the max over all actions as well(like the paper has done),
$|(E_{s_{t} \sim \pi|n_t>0}[\overline A(s_{t})]-E_{s_{t} \sim \tilde\pi|n_t>0}[\overline A(s_{t})])| \leq 4\alpha(1-(1-\alpha)^{t})max_{s,a}(|\overline A(s)|)$
The above bound is for the approximate and real advantage of $\tilde\pi$ at a time-step t, but we need the bound for $L$ and $\eta$(which are a discounted sum over all time-steps of these terms respectively).
$|L_{\pi}(\tilde\pi) - \eta(\tilde\pi)| = \Sigma_{t=0}^{t=\infty}\gamma^{t}|(E_{s_{t} \sim \pi}[\overline A(s_{t})]-E_{s_{t} \sim \tilde\pi}[\overline A(s_{t})])|$
$|L_{\pi}(\tilde\pi) - \eta(\tilde\pi)|\leq \Sigma_{t=0}^{\infty}\gamma^{t}4\alpha(1-(1-\alpha)^t)max_{s,a}|\overline A(s,a)|$
Let $\epsilon = max_{s,a}|\overline A(s,a)|$
Then,
$|L_{\pi}(\tilde\pi) - \eta(\tilde\pi)| \leq \Sigma_{t=0}^{\infty}\gamma^{t}4\alpha(1-(1-\alpha)^t)\epsilon$
The rest is just algebraic manipulation,
$|L_{\pi}(\tilde\pi) - \eta(\tilde\pi)| \leq \frac{4\alpha^{2}\gamma\epsilon}{(1-\gamma)^{2}}$