# Notes on "[High-Dimensional Continuous Control Using Generalized Advantage Estimation](https://arxiv.org/abs/1506.02438)"
###### tags: `GAE` `research paper notes` `advantage` `baselines`
#### Author
[Raj Ghugare](https://github.com/RajGhugare19)
## Note:
These notes only cover the the advantage estimation part. Other stuff in the paper about previous state of the art algos like TRPO are not reviewed.
## A brief outline:
The paper addresses the problems of vanilla policy gradient methods.
1) High variance in the gradient estimates.
At the cost of some bias they aim to reduce variance by using the notion of value functions as baselines.
2) Large amount of sample data requirements (trajectories).
Trust region optimization is the solution offered to this problem.
## Preliminaries:
### Performance measure:
The policy gradient method tries to maximize the expected total reward by calculating the gradient with respect to the total return $\nabla_{\theta}[\Sigma_{1}^{\infty}{r_t}]$
### Policy Gradient:
Using the Policy Gradient Theorem we can estimate the gradient in several ways of the form:
$$E[\Sigma_{0}^{\infty} \phi_{t} \nabla log_{e}{\pi(a|s;\theta)}]$$
where $\phi_{t}$ can be any of the following things:
1) $\Sigma_{t=0}^{\infty} r_{t}$
2) $\Sigma_{t'=t}^{\infty} r_{t'}$
3) $\Sigma_{t'=t}^{\infty} r_{t'}-b(s_{t})$
4) $Q^{\pi}(s_t,a_t)$
5) $A^{\pi}(s_t,a_t)$
6) $r_{t} + V^{\pi}(s_{t+1}) - V^{\pi}(s_t)$
The fifth choice for $\phi_{t}$, i.e. the advantage, function leads to lowest possible variance.They have provided a good intuition about this that the policy will be moved in the direction of increasing $\pi_{\theta}(a_t|s_t)$ only if the action $a_t$ is expected to be better than the policies expected outcome.
In RL, gamma is many a times considered as a part of the problem rather than as an hyper-parameter. Here,$\gamma$ is used as a variance reducing parameter as in down weighs all the rewards. Introducing $\gamma$ gets us to a lower variance and approximate estimate of the policy evaluation gradient called $g^{\gamma}$.But it comes at the cost of some bias.
$$g^{\gamma} = E_{S_{0:\infty},A_{0:\infty}}[\Sigma_{t=0}^{\infty} A^{\pi,\gamma}(s_{t},a_{t})\nabla log_{e}{\pi(a|s;\theta)}]$$
### $\gamma$ - Just estimators :
What is $\gamma$-just estimator for $A^{\pi,\gamma}(s_{t},a_{t})$?
An estimator for $A^{\pi,\gamma}(s_{t},a_{t})$ is $\gamma$-just if it gives an unbiased estimate of $g^{\gamma}$
i.e.
$$E_{S_{0:\infty},A_{0:\infty}}[\Sigma_{t=0}^{\infty} \hat{A}(s_{t},a_{t})\nabla log_{e}{\pi(a|s;\theta)}] = g^{\gamma}$$
### Proposition 1:
![proposition-1](https://i.imgur.com/8Mu3n2y.png)
The proposition 1 in the paper states another of proving that an estimator of $A^{\pi,\gamma}(s_{t},a_{t})$ is $\gamma$ just or not.
if the following is true
$$\hat{A}(s_{t},a_{t}) = Q_{t}(s_{t},a_{t}) - b_{t}(s_{0:t},a_{0:t-1})$$
such that
$$E_{s_{t:\infty},a_{t+1:\infty}|s_{t},a_{t}}[Q_{t}(s_{t:\infty},a_{t:\infty})] = Q^{\pi,\gamma}(s_{t},a_{t})$$
Then $A^{\pi,\gamma}(s_{t},a_{t})$ is $\gamma$ just.
### Advantage function estimation
This section is about finding an accurate estimate of a $\gamma$ just estimate of the advantage function.
$V$ - Approximate Value function
$\delta^{v}_{t} = r_{t} + \gamma V(s_{t+1}) - V(s_{t}) \rightarrow$ TD(0) error
If $V = V^{\pi,\gamma}$ then $E[\delta_{t}^{V}]$ is $\gamma$ just.
After adding these TD errors for time t to infinity we get
$$\hat{A_{t}}^{(\infty)} = \Sigma_{l=0}^{\infty} \gamma^{l} \delta^{V}_{t+1} = -V(s_{t}) + \Sigma_{l=0}^{\infty} \gamma^{l} r_{t+l}$$
$\hat{A_{t}}^{(\infty)}$ is an unbiased estimator of the advantage function but would be very noisy due to the sum of all random variables($r_{t}$).
The paper uses an already existing trend in RL (The bridge between monte carlo and TD) to define a generalised advantage estimator GAE($\gamma, \lambda$)
$$\hat{A_{t}}^{GAE(\gamma, \lambda)} = (1-\lambda)(\hat{A_{t}}^{(1)} + \lambda \hat{A_{t}}^{(2)} + \lambda^2 \hat{A_{t}}^{(3)} ....)$$
After simplification we finally reach a pretty compact term
$$\hat{A_{t}}^{GAE(\gamma, \lambda)} = \Sigma_{l=0}^{\infty} (\gamma \lambda)^{l} \delta_{t+l}^{V}$$
The two notable cases are $GAE(\gamma,0)$ and $GAE(\gamma,1)$, where the former is low variance and highly biased while the latter has low bias but high variance.
Empirically they find that the best value for $\lambda$ is much lower than the best value of $\gamma$, likely because higher values of $\lambda$ will add a lot of noise and lower values of $\gamma$ will introduce a lot of bias.
## Interpretation as Reward shaping
### Reward Shaping:
Reward shaping refers to the following transformation of the reward function.
let
$$\phi : S \rightarrow R$$
be and arbitrary scalar valued function on the state space. The transformed reward function is defined by:
$$\tilde{r}(s,a,s') = r(s,a,s') + \gamma \phi(s') - \phi(s)$$
This change in the reward function defines a transformed MDP. When we consider the discounted sum of modified rewards of a trajectory starting from state $s_{t}$, this results in a telescoping sum:
$$\Sigma_{l=0}^{\infty} \gamma^{l} \tilde{r}(s_{t+l},a_{t},s_{t+l+1}) = \Sigma_{l=0}^{\infty} \gamma^{l} r(s_{t+l},a_{t=l},s_{t+l+1}) - \phi(s_{t})$$
The new value function, Q function and advantage function for the modified MDP will be,
$$\tilde{V}^{\pi,\gamma}(s,a) = V^{\pi,\gamma}(s) - \phi(s)$$
$$\tilde{Q}^{\pi,\gamma}(s,a) = Q^{\pi,\gamma}(s,a) - \phi(s)$$
From the above two equations we can say that the advantage function remains the same.
To further analyze the effect of this reward shaping they have introduced a response function $X$
$$X(l;s_{t},a_{t}) = E[r_{t+1}|s_{t},a_{t}] - E[r_{t+1}|s_{t}]$$
$$A^{\pi,\gamma}(s,a) = Q^{\pi,\gamma}(s,a) - V^{\pi,\gamma}(s,a)$$
$$ = E_{s_{1:\infty},a_{0:\infty}}[\Sigma_{l=0}^{\infty}\gamma^{l}r_{l}] - E_{s_{0:\infty}, a_{0:\infty}}[\Sigma_{l=0}^{\infty} \gamma^{l}r_{l}]$$
$$= \Sigma_{l=0}^{\infty} E_{s_{1:\infty},a_{0:\infty}}[\gamma^{l}r_{l}] - \Sigma_{l=0}^{\infty} E_{s_{0:\infty}, a_{0:\infty}}[\gamma^{l}r_{l}]$$
Since $r_l$ is is a random variable which only depends on $s_{l}$ and $a_{l}$ we can modify the equation in the following way,
$$= \Sigma_{l=0}^{\infty}E[\gamma^{l} r_{l}|s_{l},a_{l}] - \Sigma_{l=0}^{\infty} E[\gamma^{l}r_{l}|s_{l}]$$
$$ = \Sigma_{l=0}^{\infty} \gamma^{l} X(l;s,a)$$
The response function will have non zero values for $l>>0$ if the action taken at time $0$ will have late credit assignment.
The value function estimation is done using existing TRPO techniques.
The final algorithm they use is as follows:
![](https://i.imgur.com/YK034Mp.png)