# Notes on "[Trust Region Policy Optimization](https://arxiv.org/pdf/1502.05477.pdf)" ###### tags: `incomplete` `research paper notes` `Policy gradient` `TRPO` #### Author [Raj Ghugare](https://github.com/RajGhugare19) ## Introduction: The proposed algorithm promises iterative monotonic improvement of large non linear policy parametarizations like neural networks. ### Preliminaries: Every MDP is described by the tuple $(S,A,P,r,P_{0},\gamma)$ where, $S \rightarrow$ set of all states(finite) $A \rightarrow$ set of all actions(finite) $P \rightarrow$ transfer dynamics of the env $r \rightarrow$ reward function $P_{0} \rightarrow$ Probability dist. of initial states $\gamma \rightarrow$ discount factor ### Other standard definitions: $\pi \rightarrow$ stochastic policy $\eta(\pi) = E_{\tau from \pi}[\Sigma_{l=0}^{\infty} \gamma^{l} r_{l+1}]$ This can be thought of as a performance measure for any policy as it takes an expectation over all states starting from $s_{0}$. $V_{\pi}(s_{t}) = E_{a_{t},s_{t}...s_{T}}[\Sigma_{l=0}^{\infty} \gamma^{l} r_{l+1}]$ $Q_{\pi}(s_{t},a_{t}) = E_{s_{t},a_{t+1}...s_{T}}[\Sigma_{l=0}^{\infty} \gamma^{l} r_{l+1}]$ $A_{\pi}(s_{t},a_{t}) = V_{\pi}(s_{t}) - Q_{\pi}(s_{t},a_{t})$ The trajectories $<s_{0},a_{0}...s_{t},a_{t}....s_{T}>$ are derived from $P_{0},P(s_{t+1}|s_{t},a_{t})$ and $\pi(a|s)$. ### Identity 1: $$\eta(\tilde \pi) = \eta(\pi) + E_{s_{0},a_{1}...s_{T} from \tilde\pi}[\Sigma_{l=0}^{\infty} \gamma^{l} A_{\pi}(s_{l},a_{l})]$$ The proof of this is given in appendix A: ![](https://i.imgur.com/sQQA6E3.png) An intuition for this identity come if we understand what the advantage function for a policy signifies. It tells us how better or worse, taking an action $a_{t}$ is than the expected value of acting according to the policy. In the identity, we are adding up the discounted forms of such advantages of actions which are actually sampled using $\tilde \pi$. Hence, it is like summing up how better is taking the action sampled by $\tilde \pi$ than following policy $\pi$. ### Discounted frequencies: $P_{\pi}(s) = P(s=s_{0}) + \gamma P(s=s_{1}) + \gamma^{2} P(s=s_{2}) +..$ $P_{\pi}$ is a measure(not probability) which is proportional to the number of times a state is likely to occur in a trajectory generated following policy $\pi$. $(1-\gamma)P_{\pi}$ is a valid probability distribution conditioned over a starting state. ### Monotonic policy improvement: We know that, $\eta(\tilde \pi) = \eta(\pi) + E_{s_{0},a_{1}...s_{T} from \tilde\pi}[\Sigma_{t=0}^{\infty} \gamma^{t} A_{\pi}(s_{t},a_{t})]$ $\eta(\tilde \pi) = \eta(\pi) + \Sigma_{t=0}^{\infty}\gamma^{t}\Sigma_{s}P(s_{t} = s|\tilde\pi)\Sigma_{a}\tilde\pi(a|s_{t}=s)[A_{\pi}(s,a)]$ The above step is just converting the expectation into a weighted sum over all time steps over all states and actions. Directly re-arranging the first two summations(because they are independent). $\eta(\tilde \pi) = \eta(\pi) + \Sigma_{s}\Sigma_{t=0}^{\infty}\gamma^{t}P(s_{t} = s|\tilde\pi)\Sigma_{a}\tilde\pi(a|s_{t})[A_{\pi}(s,a)]$ $\eta(\tilde \pi) = \eta(\pi) + \Sigma_{s}P_{\tilde\pi}(s)\Sigma_{a}\tilde\pi(a|s)[A_{\pi}(s,a)]$ The above equation tells us that is we have a non-negative expected advantage for every state s then the new policy $\tilde\pi$ will be an improvement over $\pi$.This is similar to the policy improvement theorem. The only thing that differs here is that there is an expectation over all actions. This is because we cannot have a deterministic policy with neural networks. ### Results of [Kakade and Langford (2002)](https://people.eecs.berkeley.edu/~pabbeel/cs287-fa09/readings/KakadeLangford-icml2002.pdf): A local approximation of $\eta(\tilde \pi)$ is proposed $\rightarrow L_{\pi}(\tilde\pi)$, such that: $L_{\pi}[\tilde\pi] = \eta(\pi) + \Sigma_{s}P_{\pi}(s)\Sigma_{a}\tilde\pi(a|s)[A_{\pi}(s,a)]$ This local approximation uses discounted states frequency of policy $\pi$, i.e. it ignores the difference in state visitation measure of $\pi$ and $\tilde \pi$. $L_{\pi_{\theta}}(\pi_{\theta}) = \eta(\pi_{\theta})$ $\nabla_{\theta}L_{\pi_{\theta_0}}(\pi_{\theta}) = \nabla_{\theta}\eta(\pi_{\theta})$, at $\theta = \theta_{0}$. This result tells us that a small step in $\theta_{0}$ improving $L_{\pi_{\theta_{0}}}$ will also improve the true performance measure i.e. $\eta(\pi_{\theta})$. #### Consertvative Policy Iteration: $\pi_{old} \rightarrow$ current policy $\tilde \pi \rightarrow argmax_{\tilde \pi}L_{\pi_{old}}(\tilde\pi)$ Then this update states that $\pi_{new} =(1-\alpha)\pi_{old} + \alpha\tilde\pi$ Then Kakade and Langford have dervied the bound that: $\eta(\pi_{new}) \geq L_{\pi_{old}}(\pi_{new}) - \frac{2\epsilon\alpha}{(1-\alpha)^2}\alpha^{2}$ where $\epsilon = max_{s}|E_{a' from \pi'(a|s)}[A_{\pi}(s,a)]|$ ### Monotonic Improvement General Stochastic Policies: The main theoretical result of this paper is showing that a similar bound can be put for all stochastic policy updates. They use a distance measure, total variation divergence,between the old and new policy and change $\epsilon$ appropriately. #### Total Variation Divergence: For discrete probability distributions $p,q$ $D_{t,v}$ is defined as $D_{t,v}(p||q) = \frac{1}{2}\Sigma_{i}|p(i)-q(i)|$ Defining $D_{t,v}^{max}$ as $D_{\pi,\tilde \pi}^{max} = max_{s}D_{t,v}(\pi(.|s),\tilde \pi(.|s))$ ## Theorem 1: ![](https://i.imgur.com/zvWXa3X.png) ### Binding it all together: In this section we will discuss the main theoretical results that this paper offers and the entire proof of theorem 1. Sit back, take a sip of your coffee and relax. We've already proved that, $\eta(\tilde \pi) = \eta(\pi) + E_{\tau \sim \tilde\pi}[\Sigma_{t=0}^{\infty} \gamma^{t}A_{\pi}(s,a)]$ Let $\overline A(s)$ be the expected advantage of $\tilde \pi$ over $\pi$ at a state $s$. $\overline A(s) = E_{a \sim \tilde\pi(.|s)}[A_{\pi}(s,a)]$ hence the topmost equation can be written as, $\eta(\tilde \pi) = \eta(\pi) + E_{\tau \sim \tilde\pi}[\Sigma_{t=0}^{\infty}\gamma^{t} \overline A(s)]$ Using an approximation $L_{\pi}(\tilde\pi)$ We can write, $L_{\pi}(\tilde\pi) = \eta(\pi) + E_{\tau \sim \pi}[\Sigma_{t=0}^{\infty}\gamma^{t} \overline A(s)]$ The only thing that changed in the above equation is that the states are sampled according to policy $\pi$ despite the fact that actions for calculating the advantage are sampled using $\tilde\pi$ at every step.They are abusing the notation a little bit but it is important that you understand the meaning. Our main goal is to bound the difference between $L_{\pi}(\tilde\pi)$ and $\eta({\tilde\pi})$. For this we define a couple of things first: #### Alpha-coupled policies: ![](https://i.imgur.com/atR7zau.png) It should be evident that if $\pi$ and $\tilde\pi$ are similar then alpha would be smaller. In simpler terms the definition says that if we randomly sample action from the same state from policies $\pi$ and $\tilde\pi$ then the action would be the same about $(1-\alpha)$ of the times. #### Lemma 2: Let's assume that $\pi$ and $\tilde\pi$ are $\alpha$-coupled then let, $$\overline A(s) = E_{\tilde a \sim \tilde\pi}[A_{\pi}(s,\tilde a)]$$ $\overline A(s)$ is the expected advantage of taking an action according to $\tilde\pi$ over the action taken according to $\pi$ at a state $s$.If it is a positive quantity then $\tilde\pi$ works better than $\pi$ at state $s$, if it is zero then $\tilde\pi$=$\pi$. And if it is negative then $\pi$ is better at state $s$. Since $E_{(a \sim \pi)}[A_{\pi}(s,a)] = 0$, $$= E_{(a,\tilde a) \sim (\pi,\tilde\pi)}[A_{\pi}(s,\tilde a) - A_{\pi}(s,a)]$$ $$ = p(a \neq \tilde a)E_{(a,\tilde a) \sim (\pi,\tilde\pi)|a \neq \tilde a}[A_{\pi}(s,\tilde a) - A_{\pi}(s,a)] + 0$$ $$ \leq \alpha E_{(a,\tilde a) \sim (\pi,\tilde\pi)|a \neq \tilde a}[|A_{\pi}(s,\tilde a) - A_{\pi}(s,a)|]$$ Since the difference of the advantage of any two actions would be lesser than 2 times the maximum advantage over all actions $$|\overline A(s)| \leq \alpha E_{(a,\tilde a) \sim (\pi,\tilde\pi)|a \neq \tilde a}[2max_{a}(|A_{\pi}(s,a)|)]$$ $$ \leq (2\alpha) max_{a}(|A_{\pi}(s,a)|)$$ #### Lemma 3: We know that, $L_{\pi}(\tilde\pi) = \eta(\pi) + E_{\tau \sim \pi}[\Sigma_{t=0}^{\infty}\overline A(s)]$ To bound the difference between $L_{\pi}[\tilde\pi]$ and $\eta(\tilde\pi)$ we need to bound the advantage of $\tilde\pi$ over $\pi$ at every time-step t. Let's say $\tau$ is a trajectory obtained by sampling from $\pi$ and $\tilde\tau$ is a policy obtained from sampling $\tilde\pi$. Let $n_{t}$ be a random variable denoting the number of times $a_{i} \neq \tilde a_{i}$, $\forall i<t$. i.e. it gives the number of times the actions sampled by $\pi$ differ from $\tilde\pi$ at time-steps less than $t$. Expected advantage at time step $t$ of policy $\tilde\pi$ if policy $\pi$ is followed: $E_{s_{t} \sim \pi}[\overline A(s_{t})] = p(n_{t}=0)E_{s_{t} \sim \pi|n_t=0}[\overline A(s_{t})] + p(n_{t}>0)E_{s_{t} \sim \pi|n_t=0}[\overline A(s_{t})]$ Expected advantage at time step $t$ of policy $\tilde\pi$ if policy $\tilde\pi$ if followed: $E_{s_{t} \sim \tilde\pi}[\overline A(s_{t})] = p(n_{t}=0)E_{s_{t} \sim \tilde\pi|n_t=0}[\overline A(s_t)] + p(n_{t}>0)E_{s_{t} \sim \tilde\pi|n_t=0}[\overline A(s_t)]$ In both of the above cases the advantage as we know it is of policy $\tilde \pi$ over policy $\pi$ at any state $s_{t}$.The expectation over a random variable(trajectory) is divided into a conditional random variable, where both the conditions are mutually exclusive and complete as they should be. The conditional random variables $s_{t} \sim \pi|n_t=0$ and $s_{t} \sim \tilde\pi|n_t=0$ have the same distributuion and hence are equal(because the condition that both should never disagree before time-step t). Subtracting both equations we obtain: $E_{s_{t} \sim \pi}[\overline A(s_{t})]-E_{s_{t} \sim \tilde\pi}[\overline A(s_{t})]= p(n_{t}>0)(E_{s_{t} \sim \pi|n_t>0}[\overline A(s_{t})]-E_{s_{t} \sim \tilde\pi|n_t>0}[\overline A(s_{t})])$ Due to the definition of alpha coupled policies and using the multiplication rule of probability, $p(n_t=0) \geq (1-\alpha)^{t}$ It follows that, $p(n_t>0) \leq 1-(1-\alpha)^{t}$ Using this and taking modulus on both sides we obtain, $|E_{s_{t} \sim \pi}[\overline A(s_{t})]-E_{s_{t} \sim \tilde\pi}[\overline A(s_{t})]| \leq (1-(1-\alpha)^t)|E_{s_{t} \sim \pi|n_t>0}[\overline A(s_{t})]-E_{s_{t} \sim \tilde\pi|n_t>0}[\overline A(s_{t})]|$ Using the traingle inequality, $|(E_{s_{t} \sim \pi|n_t>0}[\overline A(s_{t})]-E_{s_{t} \sim \tilde\pi|n_t>0}[\overline A(s_{t})])| \leq |E_{s_{t} \sim \pi|n_t>0}[\overline A(s_{t})]| + |E_{s_{t} \sim \tilde\pi|n_t>0}[\overline A(s_{t})]|$ Using lemma 2 we can say that, $|(E_{s_{t} \sim \pi|n_t>0}[\overline A(s_{t})]-E_{s_{t} \sim \tilde\pi|n_t>0}[\overline A(s_{t})])| \leq 4\alpha max_{a}[\overline A_{\pi}(s)]$ Substituting this result, $|(E_{s_{t} \sim \pi}[\overline A(s_{t})]-E_{s_{t} \sim \tilde\pi}[\overline A(s_{t})])| \leq 4\alpha(1-(1-\alpha)^{t})max_{a}(\overline A(s))$ We now make a weaker bound by taking the max over all actions as well(like the paper has done), $|(E_{s_{t} \sim \pi|n_t>0}[\overline A(s_{t})]-E_{s_{t} \sim \tilde\pi|n_t>0}[\overline A(s_{t})])| \leq 4\alpha(1-(1-\alpha)^{t})max_{s,a}(|\overline A(s)|)$ The above bound is for the approximate and real advantage of $\tilde\pi$ at a time-step t, but we need the bound for $L$ and $\eta$(which are a discounted sum over all time-steps of these terms respectively). $|L_{\pi}(\tilde\pi) - \eta(\tilde\pi)| = \Sigma_{t=0}^{t=\infty}\gamma^{t}|(E_{s_{t} \sim \pi}[\overline A(s_{t})]-E_{s_{t} \sim \tilde\pi}[\overline A(s_{t})])|$ $|L_{\pi}(\tilde\pi) - \eta(\tilde\pi)|\leq \Sigma_{t=0}^{\infty}\gamma^{t}4\alpha(1-(1-\alpha)^t)max_{s,a}|\overline A(s,a)|$ Let $\epsilon = max_{s,a}|\overline A(s,a)|$ Then, $|L_{\pi}(\tilde\pi) - \eta(\tilde\pi)| \leq \Sigma_{t=0}^{\infty}\gamma^{t}4\alpha(1-(1-\alpha)^t)\epsilon$ The rest is just algebraic manipulation, $|L_{\pi}(\tilde\pi) - \eta(\tilde\pi)| \leq \frac{4\alpha^{2}\gamma\epsilon}{(1-\gamma)^{2}}$