REINFORCE === Author [Raj Ghugare](https://github.com/Raj19022000) Research Paper : [Policy Gradient Methods for Reinforcement Learning with Function Approximation](https://papers.nips.cc/paper/1713-policy-gradient-methods-for-reinforcement-learning-with-function-approximation.pdf) ###### tags: `notes` `REINFORCE` `Policy-gradients` ## Review and research notes ### Introduction Policy based algorithms are the ones where we parameterize our policy and optimize the parameters based an objective function(expected return of the policy). REINFORCE is a classical and vanilla approach which fairly works in episodic problems. The most common methods to parameterize a policy is using softmax when the action space is small in number and finite or a Gaussian distribution with mean and variance as the parameters where the action space is infinite. Neural networks are the most popular due to the recent success of deep learning. We will look at REINFORCE for episodic problems i.e. problems which have atleast one terminal state. Lets get some basic things cleared first: * $\pi(a|s;\theta) \rightarrow$ policy parameterized by $\theta$ * Samples will be trajectories of an episode of the form $<S_{0},A_{0},R_{1}....S_{T-1},A_{T-1},R_{T}>$ * The way we will measure the performance of our policy would be through returns * We will assume that we have a fixed starting state $S_{0}$ for our algorithm. For example in the Cart-pole problem the pole always starts in a fixed upright position or a board game always starts in a fixed starting position. * Our goal is to form a policy which will give the maximum expected return from the starting state ### Derivation of Gradient based REINFORCE Let $J(\theta)$ be the expected return obtained by following a policy $\pi(a|s;\theta)$. This becomes an optimization problem and we normally use gradient ascent to solve it. (Ascent because we have to move our parameters in the direction of $J(\theta)$) Let us try to mathematically express $J(\theta)$. $$J(\theta) = E_{T_{i} \in \pi_{\theta}}[G_{i}] $$ Basically $J(\theta)$ is the average of returns over many trajectories($T_{i}$'s) generated after following policy $\pi$. $G_{i}$ is the discounted return for a finite episode. The probability of a particular trajectory occurring, given that we are following policy $\pi_{\theta}$ is: $$P(T_{i}|\theta) = \pi(A_{0}|S_{0};\theta)*prob(S_{1}|s_{0},a_{0})*\pi(A_{1}|S_{1};\theta)....prob(S_{t}|s_{t-1},a_{t-1})$$ It is important to note that we don't know the value of this probability because it involves the dynamics of the environment. The gradient ascent algorithm: $$\theta = \theta + \alpha \nabla_{\theta}(J(\theta)) $$ By removing the expectation operator,$J(\theta)$ can be expressed as: $$J(\theta) = \Sigma_{T_{i} \in \pi_{\theta}} P(T_{i},\pi_{\theta}) G_{i}$$ $$\nabla(J(\theta)) = \nabla \Sigma_{i=0}^{T}P(T_{i},\pi_{\theta})G_{i}$$ $$ = \Sigma_{T_{i} \in \pi_{\theta}}\nabla[P(T_{i},\pi_{\theta})]G_{i}$$ Dividing and multiplying by $P(T_{i},\pi_{\theta})$ $$ = \Sigma_{T_{i} \in \pi_{\theta}} \frac{\nabla[P(T_{i},\pi_{\theta})] G_{i}P(T_{i},\pi_{\theta})}{P(T_{i},\pi_{\theta})}$$ because, $\frac{f'(x)}{f(x)} = \nabla \log(f(x))$ $$ = \Sigma_{t=0}^{T} \nabla[\log_eP(T_{i},\pi_{\theta})] P(T_{i},\pi_{\theta}) G_{i}$$ Simplifying $\nabla \log_eP(T_{i},\pi_{\theta})$ : $$= \nabla \log_e[{\pi(A_{0}|S_{0};\theta)*prob(S_{1}|s_{0},a_{0})*\pi(A_{1}|S_{1};\theta)....prob(S_{t}|s_{t-1},a_{t-1})}]$$ $$= \nabla \log_e[{\pi(A_{0}|S_{0};\theta)] + \nabla \log_e[prob(S_{1}|s_{0},a_{0})] + \nabla \log_e[\pi(A_{1}|S_{1};\theta)].... \nabla \log_e[prob(S_{t}|s_{t-1},a_{t-1})}]$$ The terms which are independent of theta will have a gradient of 0 $$= \nabla \log_e[{\pi(A_{0}|S_{0};\theta)] + \nabla \log_e[\pi(A_{1}|S_{1};\theta)].... \nabla \log_e[\pi(A_{t-1}|S_{t-1};\theta)}] $$ Substituting this in the equation for $\nabla J(\theta)$ Here $|T_{i}|$ is the number of states encountered in trajectory $|T_{i}|$ $$\nabla J(\theta) = \Sigma_{T_{i} \in \pi_{\theta}}\Sigma_{j=0}^{|T_{i}|-1} \nabla \log_{e}{\pi(A_{j}|s_{j};\theta})P(T_{i},\pi_{\theta}) G_{i}$$ $$= E_{T_{i} \in \pi_{\theta}}[\Sigma_{j=0}^{|T_{i}|-1} \nabla \log_{e}{\pi(A_{j}|s_{j};\theta})G_{i} ]$$ To approximate the expectation we can use multiple samples of trajectories obtained from following policy $\pi_{\theta}$ and average over them. REINFORCE is like a stochastic gradient ascent algorithm where we just use one trajectory to find the gradient per step. Hence: $$\nabla J(\theta) = \Sigma_{j=0}^{|T_{i}|-1} \nabla \log_{e}{\pi(A_{j}|s_{j};\theta})G_{i}$$ This is the final form of the algorithm: 1) Input: a differentiable policy parameterization $\pi(a|s;\theta)$ 2) Algorithm parameter: step size $\alpha > 0$ 3) Initialize policy parameter $\theta \in R^{d}$ (e.g to 0) Loop forever(for each episode): * Generate an episode $<S_{0},A_{0},R_{1}....S_{t-1},A_{t-1},R_{t-1}>$, following policy $\pi(a|s;\theta)$. * Loop for each step of the episode $t=0,1....T-1$: $G := \Sigma_{k=t+1}^{T} \gamma^{k-t-1}R_{k}$ $\theta := \theta + \alpha \gamma^{t} G log{\nabla(\pi(a|s;\theta))}$ Note that there are some differences between our estimate of $J(\theta)$ and the algorithm update. In our estimate of $J(\theta)$ we are using the total return of the trajectory, $G_{i}$, but in the algorithm for time step $t$ update only the after $t$ part of the total return is used. i.e. for the algorithm $$\nabla J(\theta) = \Sigma_{j=0}^{|T_{i}|-1} \nabla \log_{e}{\pi(A_{j}|s_{j};\theta}) \Sigma_{k=t+1}^{T}\gamma^{k-t+1}R_{k}$$ The part that is left out is : $$\Sigma_{j=0}^{|T_{i}-1|}\nabla \log_{e}{\pi(A_{j}|s_{j};\theta})\Sigma_{k=0}^{t}\gamma^{k}R_{k}$$ Many a times the $\gamma^{t}$ during the update is also left out and that surpsisingly gives better results. There are different modifications to this algorithms which empirically give better results. This part of the gradient which reinforces the policy towards previous rewards has a 0 mean and only contributes towards making the data noisy(has finite variance).The intuition is that rewards which were obtained before the update at any time-step $t$ should not affect the value of .I don't know rigorous mathematical proof of this but if anyone does please do share :)