Policy Gradient Theorem and equivalent Bellman Updates

# Policy Gradient Theorem and equivalent Bellman Updates > [name=Riashat Islam] Policy gradient update can be written as an equivalent Bellman update. > TODO: - Write GVF algorithm with vector cumulants (like [algo 1](#GVF-for-Policy-Gradient) from the GVF paper) - Write the equivalant vector reward without GVF - How is it better if we are learning h when in vector form? - Since h is a cumulant, can we exploit the GVF in terms of transfer learning!? Learn new tasks without learning h??? - Can we learn faster with h instead of vector rewards without bootstrap gradients? #### What * We can reformulate the policy gradient update as a Bellman equation itself. * We can write down a RL specific optimizer for policy gradients, where we update the policy gradient in a TD like manner itself (ie, a TD based gradient update, based on the past gradient at s,a and estimated gradient at the next s,a) * Can we interpret the TD gradient term, with eligiblity traces, as a type of optimizer specifically suited for RL and policy optimization? #### Why * We can leverage all of the research in TD learning to learn the policy gradient, ie, we can do value/policy iteration and Q-learning sort of updates for the policy gradient itself? This can come with convergence guarantees for PGs? It is worthwhile to think of it from an optimization perspective too (TD learning in SGD) \begin{align*} \def\balpha{\unicode{x1D6C2}} \def\blambda{\unicode{x1D6CC}} \def\brho{\unicode{x1D746}} \def\mat#1{\mathbf{#1}} \def\beye{\mathbf{I}} \def\bvartheta{\unicode{x1D751}} \def\btheta{\unicode{x1D73D}} \def\ind{\unicode{x1D7D9}} \def\inv{{-1}} \end{align*}  ## Introduction The starting point is to realize that the policy gradient expression can be written as an equivalent Bellman equation (also noted in Bacon et al.,) Policy gradient methods can be described as performing approximation in the policy space, and has close resemblance with policy iteration. They are analogous since in both cases the search for the optimal policy is guided by a policy evaluation procedure. Policy gradient methods perform their improvement using the gradient of some objective, whereas in policy iteration, the improved policy is obtained from the greedy policy. (Sutton et al., 1999) : Given an initial state distribution $\alpha$, the discounted objective for policy gradient methods, for the exact solution, is given by \begin{align*} J_{\alpha}(\theta) = \balpha^\top \mat v_{\theta} = \balpha^\top (\mat I - \gamma \mat P_{\theta})^{-1} \mat r_{\theta} \end{align*} With the discounted weighting of states, the equivalent form is \begin{align*} J_{\alpha}(\theta) = \balpha^\top \mat v_{\theta} = \mat d^{\top}_{\alpha, \gamma, \theta}\mat r_{\theta} \end{align*} where $\mat d^{\top}_{\alpha, \gamma, \theta} = \balpha^\top (\mat I - \gamma \mat P_{\theta})^{-1}$, is the discounted weighting of states. **TODO** (Matrix Derivatives : ) For policy gradient, we need $\nabla_{\theta} J_{\alpha}(\theta)$. What if we take the gradient of the above term, in matrix form directly? Policy gradient theorem as a Bellman Equation : \begin{align*} \nabla_{\theta} v_{\theta}(s) &= \nabla_{\theta} \Big[ \sum_{a} \pi_{\theta}(a,s) Q_{\pi}(s,a) \Big] \\ &= \sum_{a} \Big[ \nabla_{\theta} \pi_{\theta}(a,s) Q_{\pi}(s,a) + \pi_{\theta}(a,s) \nabla_{\theta} Q_{\pi}(s,a) \Big]\\ &= \sum_{a} \Big[ \nabla_{\theta} \pi_{\theta}(a,s) Q_{\pi}(s,a) + \pi_{\theta}(a,s) \nabla_{\theta} [ r(s,a) + \gamma \sum_{s'} P(s' \mid s,a) v_{\pi}(s') ] \Big]\\ &= \sum_{a} \Big[ \nabla_{\theta} \pi_{\theta}(a,s) Q_{\pi}(s,a) + \pi_{\theta}(a,s) \gamma \sum_{s'} P(s' \mid s,a) \nabla_{\theta} v_{\pi}(s') \Big] \end{align*} This is where we deviate from the expression written in Bacon et al., where the next line is written to highlight the common structure with policy evaluation equations. This also deviates from the proof in Sutton et al., 1999. We can write the above expression in a similar policy evaluation form : \begin{align*} \nabla_{\theta} v_{\theta}(s) &= \sum_{a} \Big[ \pi_{\theta}(a,s) \nabla_{\theta} \log \pi_{\theta}(a,s) Q_{\pi}(s,a) + \pi_{\theta}(a,s) \gamma \sum_{s'} P(s' \mid s,a) \nabla_{\theta} v_{\pi}(s') \Big]\\ &= \sum_{a} \pi_{\theta}(a,s) \Big[ \nabla_{\theta} \log \pi_{\theta}(a,s) Q_{\pi}(s,a) + \gamma \sum_{s'} P(s' \mid s,a) \nabla_{\theta} v_{\pi}(s') \Big] \end{align*} In (Bacon et al.,) this is however written as : \begin{align*} \nabla_{\theta} v_{\theta}(s) = \sum_{a} \pi_{\theta}(a,s) \Big[ \sum_{a} \nabla_{\theta} \pi_{\theta}(a,s) Q_{\pi}(s,a) + \gamma \sum_{s'} P(s' \mid s, a) \nabla_{\theta} v_{\theta}(s') \Big] \end{align*} Note these two are equivalent, in (Bacon et al.), a sum over action probabilities was used, whereas we applied the log-derivative trick. **TODO** Check if both forms are correct or not : [Andre]: Yes, they are equivalent. ## Bellman Equation for the gradient $\nabla_{\theta} v_{\theta}(s)$ We take our derivation and write the equivalent Bellman type recursive expression as follows First let us define : \begin{align*} h_{\theta}(s,a) = \nabla_{\theta} \log \pi_{\theta}(a,s) Q_{\pi}(s,a) \end{align*} \begin{align*} h_{\theta}(s) = \sum_{a} \pi_{\theta}(a,s) \nabla_{\theta} \log \pi_{\theta}(a,s) Q_{\pi}(s,a) \end{align*} where $h_{\theta}(s,a) \in R^{|s|, |a|}$ is a matrix. We can write down the equivalent vector form by summing over all actions, where $h_{\theta}(s) \in R^{|s|}$ is a vector with dimension the size of the state space. **Note $h_\theta$** is defined as cumulant below (see [here](#GVF-for-Policy-Gradient)) Let us also definite $f_{\theta}(s) = \nabla_{\theta} v_{\theta}(s)$, so that if we take the policy gradient derivation : \begin{align*} \nabla_{\theta} v_{\theta}(s) = \sum_{a} \pi_{\theta}(a,s) \Big[ \nabla_{\theta} \log \pi_{\theta}(a,s) Q_{\pi}(s,a) + \gamma \sum_{s'} P(s' \mid s,a) \nabla_{\theta} v_{\pi}(s') \Big] \end{align*} This can now be written in a recursive Bellman form as \begin{align*} f_{\theta}(s) = \sum_{a} \pi_{\theta}(a,s) \Big[ h_{\theta}(s) + \gamma \sum_{s'} P(s' \mid s, a) f_{\theta}(s') \Big] \end{align*} where note that $h_{\theta}(s)$ and $f_{\theta}(s)$ are vectors and not scalars. The difference here is that, we have a vector form of rewards instead of scalar, but the vector-valued rewards does not involve any special theoretical or algorithmic considerations (Bacon et al.,) ### Closed Form solution We can write this as an equivalent vector form as : \begin{align*} \mat f_{\theta} = \mat h_{\theta} + \gamma \mat P_{\theta} \mat f_{\theta} \end{align*} Therefore, the exact solution for $\nabla_{\theta} v_{\theta}(s)$ can be written as : \begin{align*} \mat f_{\theta} = (\mat I - \gamma \mat P_{\theta})^{-1} \mat h_{\theta} = \nabla_{\theta} v_{\theta}(s) \end{align*} We know, the policy gradient is given by \begin{align*} \nabla_{\theta} J_{\alpha}{\theta} = \mat \alpha^{\top} \nabla_{\theta} v_{\theta}(s) = \alpha^{\top} (\mat I - \gamma \mat P_{\theta})^{-1} \mat h_{\theta} \end{align*} where $\mat d^{\top}_{\alpha, \gamma, \theta} = \balpha^\top (\mat I - \gamma \mat P_{\theta})^{-1}$, is the discounted weighting of states. This gives us the policy gradient \begin{align*} \nabla_{\theta} J_{\alpha} ({\theta}) = \mat d^{\top}_{\alpha, \gamma, \theta} \mat h_{\theta} \end{align*} #### Policy Gradient Theorem for Non-Stationary Policies See [Kakade, Thesis Section 4.2.2] - considering the episodic case (T-epoch case), where there is an explicit dependency of the policy with t, $\pi(a| s, t, \theta)$, there is an equivalent expression for the policy gradient theorem but for non-stationary policies. Following the derivation, we can write an equivalent form for the PGT with Bellman equations, where $f_{\theta}(s,a, t)$ would now have an explicit dependency on time. * What would be the significance of this? ### Recursive Expressions of the Gradient as Bellman Equation We have previously defined the following : \begin{align*} h_{\theta}(s,a) = \nabla_{\theta} \log \pi_{\theta}(a,s) Q_{\pi}(s,a) \end{align*} \begin{align*} h_{\theta}(s) = \sum_{a} \pi_{\theta}(a,s) \nabla_{\theta} \log \pi_{\theta}(a,s) Q_{\pi}(s,a) \end{align*} \begin{align*} \nabla_{\theta} v_{\theta}(s) = f_{\theta}(s)= \sum_{a} \pi_{\theta}(a,s) \Big[ h_{\theta}(s, a) + \gamma \sum_{s'} P(s' \mid s, a) f_{\theta}(s') \Big] \end{align*} The equivalent Bellman equations for the gradient terms as follows, where we can now define **Gradient-Value** and **Gradient Action-Value** functions : * Policy Evaluation Expression for $\nabla_{\theta} v_{\pi}(s)$ \begin{align*} f_{\theta}(s)= \sum_{a} \pi_{\theta}(a,s) \Big[ h_{\theta}(s, a) + \gamma \sum_{s'} P(s' \mid s, a) f_{\theta}(s') \Big] \end{align*} * Policy Evaluation Expression for $\nabla_{\theta} q_{\pi}(s,a)$ \begin{align*} f_{\theta}(s, a)= h_{\theta}(s, a) + \gamma \sum_{s'} P(s' \mid s, a) f_{\theta}(s') \end{align*} This means, we can update the gradient terms as a Bellman equation. This is interesting because now we can estimate the gradient using an equivalent policy iteration or value iteration algorithm. ## Solving Policy Gradients with TD Learning We can update the gradient action-value function with TD updates, e.g with Q-learning. Recall that $f_\theta(s)$ outputs $\nabla_\theta v_\pi(s)$. However, in the TD version, $f(s)$ is parameterized as $f_\psi(s)$ and approximates $\nabla_\theta v_\pi(s)$ in an iterative fashion: \begin{align*} f_{\psi}(s,a) \leftarrow f_{\psi}(s,a) + \alpha \Big[ h_{\theta}(s,a) + \gamma \max_{a'} f_{\psi}(s', a') - f_{\psi}(s,a) \Big] \end{align*} and then update the policy parameters w.r.t the *estimated* gradient \begin{align*} \theta_{t+1} \leftarrow \theta_{t} + \alpha f_{\psi}(s,a) \end{align*} Instead of Q-learning, we can perform any value-based updates for the gradients $f_{\psi}(s,a)$ or $f_{\psi}(s)$ ## Meta-gradient training Interestingly, since $f_\psi (s',a')$ is a gradient producing function, we can train it for $n$ steps, and continue to train our policy without ever actually computing gradients, relying solely on the gradient estimator $f_\psi$. Naturally, one wonders how many steps are required to still be able to solve a task? For which task does this meta-gradient setup work? ## Theoretical Questions : * With this type of formulation, what is the sample complexity view [Kakade, Thesis]. ie, if we can update the policy gradient with an equivalent Bellman expression, what would be the sample size required to obtain an accurate gradient estimate? * How many samples would be required before this algorithm, with equivalent Bellman updates, would find a good policy? ### Convergence Proofs : **TODO** * We can define a Bellman operator and $\gamma$ contraction mapping for $f_{\theta}(s,a)$. But what does it mean for this term to converge? ## Policy Iteration and Policy Gradients * Equivalence between policy iteration and Newton's method * In policy iteration, we take an argmax over the value function. With our interpretation, we basically replace the argmax with a gradient based update $\nabla_{\theta} v_{\theta}(s)$ ## Greedy Value Function Methods (See Sham Kakade, Thesis Chapter 3 for more details on this, especially deriving bounds with the greedy policy update, based on the max norm). There are a plethora of asynchronous schemes, such as optimistic policy iteration, SARSA, Dyna-Q which interleave the policy updating and the value updating scheme, without waiting for convergence of the policy evaluation algorithm. Typically the policy evaluation algorithm, estimating $v_{\pi}(s)$ makes slow changes determined by a learning rate, and the policy is greedy w.r.t to these values. The reasoning behind this is to avoid making more drastic policy changes, which is often the problem when providing convergence results. * From our interpretation, updating the policy gradient with an equivalent gradient evaluation scheme (e.g policy evaluation of the gradient-value function), maybe the same reasoning will justify? We want to make slow changes/updates in the gradient, or estimate the gradient with a slower learning rate, to avoid drastic changes in the policy? * From our derived algorithm above, we can use different learning rates for updating $f_{\theta}(s,a)$ and updating $\theta$, where the learning rate for $f_{\theta}(s,a)$ should be lower than for $\theta$. This will allow for less drastic changes in the gradient, resulting in conservative policy updates? * TODO : See literature on conservative policy iteration, trust region approach etc, that all deals with avoiding drastic changes to the policy parameters $\theta$. In most greedy value function methods, chattering occurs for many algorithms, where the policy fluctuates between some set of policies without ever converging. [Bertsekas and Tsitsiklis, 1996] provides the most extensive convergence analysis. For TD learning, convergence results have focussed on the quality of policy evaluation for a *single* policy. In general, the problem of greedy updates is that we cannot control which states our greedy policy visits, and this leads to a devastating max norm (infinity norm) greedy update bound. [Kakade, Thesis] : Under what formal conditions do greedy updates provide good policies when we have some average guarantee on J. * What would the equivalent bounds mean for $f_{\theta}(s,a)?$ Since $f_{\theta}(s,a) = \nabla_{\theta} v_{\theta}(s)$, there is no notion of optimal $\nabla_{\theta} v^{*}_{\theta}$ in this case? (Most bounds are derived based on the optimal $v^{*}$ and $q^{*}$ with the max norm [Bertsekas]) One drawback of greedy value function methods is that - the performance of a greedy policy derived from some approximate value function can be worse than the old policy by an amount related to the max norm error. This has motivated the use of policy gradient methods which have stronger performance improvement guarantees [Kakade, Thesis]. ## TD Approach for Gradients **Clarify** A TD learning approach here, as derived above, would mean that instead of an approximation of the next position of $\theta$, we are instead computing an approximation of the next position of the gradient $f_{\theta}(s,a) = \nabla_{\theta} v_{\pi}(s)$. Our policy gradient is given by : \begin{align*} \nabla_{\theta} J(\theta) = \mat \alpha^{\top} \nabla_{\theta} v_{\theta}(s) = \mat \alpha^{\top} f_{\theta}(s) \end{align*} where we can write an equivalent TD update for $f_{\theta}$ (considering f(s,a)): \begin{align*} f_{\theta}(s,a) \leftarrow f_{\theta}(s,a) + \alpha \Big[ h_{\theta}(s,a) + \gamma \max_{a'} f_{\theta}(s', a') - f_{\theta}(s,a) \Big] \end{align*} The overall $\theta$ parameter update for the policy parameters is therefore : \begin{align*} \theta &\leftarrow \theta + \eta f_{\theta}(s,a)\\ &\leftarrow \theta + \eta f_{\theta}(s,a) + \eta \alpha \Big[ h_{\theta}(s,a) + \gamma \max_{a'} f_{\theta}(s', a') - f_{\theta}(s,a) \Big]\\ &= \theta + \eta f_{\theta}(s,a) + \eta_{\alpha} \Big[ h_{\theta}(s,a) + \gamma \max_{a'} f_{\theta}(s', a') - f_{\theta}(s,a) \Big] \end{align*} where $f_{\theta}(s,a)$ is the current gradient, $f_{\theta}(s', a')$ is the gradient in the next $(s', a')$ and we are updating $\theta$ based on the estimated gradient at the next state, action pair? **TODO:** Is the above the correct way to think about this? ## Interpretation of Gradient-TD **TODO** * Intuitive explanation of what Gradient-TD is? ## Relationship with Optimization Algorithms : Stochastic gradient descent (SGD) is used to solve optimization problems, where the objective function is of the form : \begin{align*} f(x) = \mathbb{E} [f_{i}(x)] = \frac{1}{n} \sum_{i=1}^{N} f_{i}(x) \end{align*} #### Momentum SGD is often augmented with the momentum term as well, so that the update becomes \begin{align*} x_{t+1} = x_t - \gamma_{t} \nabla f_{i}(x_t) + \beta (x_t - x_{t-1}) \end{align*} where momentum adds the previous update direction to the gradient. We basically use the previous direction of update $x_{t-1}$ to also update the current parameters $x_t$. This has the useful effect of stabilising the update in case of outlayer gradients and escaping regions of flat curvature. A better way of looking at momentum based methods can be based on the following updates. Momentum helps accelerate SGD in the relevant direction, by adding a fraction $\gamma$ of the update vector of the past time step to the current update vector. \begin{align*} v_t = \gamma v_{t-1} + \eta \nabla_{\theta} J(\theta) \end{align*} \begin{align*} \theta = \theta - v_t \end{align*} In momentum-based methods, the term $\theta - \gamma v_{t-1}$ computes an approximation of the next position of $\theta$. #### Nesterov's Accelerated Gradient Computing $\theta - \gamma v_{t-1}$ thus gives us an approximation of the next position of the parameters (the gradient is missing for the full update), a rough idea where our parameters are going to be. We can now effectively look ahead by calculating the gradient not w.r.t. to our current parameters $\theta$ but w.r.t. the approximate future position of our parameters: \begin{align*} v_t = \gamma v_{t-1} + \eta \nabla_{\theta} J(\theta - \gamma v_{t-1}) \end{align*} \begin{align*} \theta \leftarrow \theta - v_t \end{align*} In Nesterov's, we do a lookahead by calculating the gradients not w.r.t to the current parameters, but w.r.t to the approximate future position of the parametets $\theta - \gamma v_{t-1}$, for the term $\nabla_{\theta} J(\theta - \gamma v_{t-1})$. The $\theta$ update is therefore as : \begin{align*} \theta \leftarrow \theta - \gamma v_{t-1} - \eta \nabla_{\theta} J(\theta - \gamma v_{t-1}) \end{align*} ## Eligibility Traces as Optimizers in Policy Gradients #### Eligibility Traces in Policy Gradient and Actor-Critic In policy gradient methods, the gradient is estimated using sequences of states and rewards encountered between visits to some designated recurrent state (ie, a trajectory equivalent where we take the states and rewards of the trajectory, and trajectory ends at the end of the episode (or at an abosrbing state), and the gradient is evaluated based on the trajectory of states and rewards). Eligibility traces can be used to efficiently perform these updates in an online manner [Marbach and Tsitsiklis, 2001]. Performing gradient updates in an online manner may be useful, because in regular policy gradients, the variance in the gradient estimate grows with the recurrence time (ie, as trajectory grows). This recurrence time can be large in large-scale problems and the time is also dependent on the policy, such that as the policy improves, it is possible that the recurrence time will increase too! #### Relation between Eligiblity Traces and Optimizers We consider TD update with Bootstrapping, assuming linear function approximation form, for $v_{\theta}(s) = \phi(s)^{T} \theta$. Considering the TD($\lambda$) update with a backward view implementation for $\lambda$, we find the interesting connection that the update equations look very similar to a naive *momentum-based stochastic gradient descent (ascent) approach*. TD update for Value Functions, with Bootstrapping : \begin{align*} \theta_{t+1} \leftarrow \theta_t + \alpha \nabla_{\theta} MSE(\theta) \end{align*} where $\nabla_{\theta} MSE(\theta) \approx \Big[ r(s,a) + \gamma v_{\theta}(s') - v_{\theta}(s) \Big] \nabla_{\theta} v_{\theta}(s)$ \begin{align*} \theta_{t+1} \leftarrow \theta_t + \alpha \nabla_{\theta} \delta_{\theta} \nabla_{\theta} v_{\theta}(s) \end{align*} where $\delta$ is the one-step TD update. If we consider the $\lambda$ returns view, we can write the TD update as $\delta_{\theta} = G^{\lambda} - v_{\theta}(s)$, considering the forward view. The backward view equivalent of this with eligiblity traces is given by \begin{align*} \theta_{t+1} \leftarrow \theta_t + \alpha \delta_{t} e_t \end{align*} and we update the e-trace vector as \begin{align*} e_t \leftarrow \gamma \lambda e_{t-1} + \nabla_{\theta} v_{\theta}(s) \end{align*} We can in fact write the two jointly as \begin{align*} \theta_{t+1} &\leftarrow \theta_t + \alpha \delta_t ( \gamma \lambda e_{t-1} + \nabla_{\theta} v_{\theta}(s) )\\ &= \theta_t + \alpha \delta_t \nabla_{\theta} v_{\theta}(s) + \alpha \delta_t \gamma \lambda e_{t-1} \end{align*} #### Momentum Based Stochastic Gradient Ascent : Interestingly, the above expression is exactly equivalent if we write down SGD with a momentum term. \begin{align*} \theta_{t+1} \leftarrow \theta_t + \alpha V_t \end{align*} where $V_t$ is the *velocity term* and the *eligibility term* in our case, which is updated as \begin{align*} V_t \leftarrow \gamma V_{t-1} + \eta \nabla_{\theta} J(\theta) \end{align*} just as before, we can write down the update equation jointly as \begin{align*} \theta_{t+1} &\leftarrow \theta_t + \alpha \Big[ \gamma V_{t-1} + \eta \nabla_{\theta} J(\theta) \Big]\\ &= \theta_t + \alpha \eta \nabla_{\theta} J(\theta) + \alpha \gamma V_{t-1} \end{align*} The two terms, writing the joint $\theta$ update for $v_{\theta}$ and the $\theta$ update in SGD with Momentum, are exactly equivalent, with the only difference (hyperparameter difference) being that we updated the eligibility trace vector with $\gamma \lambda$ whereas we updated the Velocity vector with a $\gamma$ only. The other difference is that, in the Value Function update, we have $\delta_t \nabla_{\theta} v_{\theta}(s)$ because with TD learning we do a *semi-gradient* update, whereas in SGD we take the full gradient. The key difference however, is that eligiblity traces capture the temporal relation (ie, the dependency on time), whereas regular SGD with Momentum does not capture this explicitly! For RL, perhaps we need to capture the temporal correlation for our optimizers? #### Nesterov's Accelerated Gradient In Momentum based method, the gradient was calculated using the current parameters $\theta$, $\nabla_{\theta} J(\theta)$. In Nesterov's, we apply the velocity term $V_t$ to the parameters $\theta$ first, to compute the interim parameters $\hat{\theta}$. We can then compute the gradients with the next interim parametes $\nabla_{\theta} J(\hat{\theta})$. Based on the new gradients, we then follow the similar update rule as in Nesterov's method, with the only difference now is that the gradients are computed with the interim parameters $\hat{\theta}$. \begin{align*} \hat{\theta} \leftarrow \theta + \alpha V_{t-1} \end{align*} \begin{align*} V_{t} \leftarrow \alpha V_{t-1} + \eta \nabla_{\theta} J(\hat{\theta}) \end{align*} \begin{align*} \theta_{t+1} \leftarrow \theta_t + V_{t} \end{align*} **Equivalent Eligiblity Traces for Nesterov's Accelerated Gradient** Given we know the similarity between eligibility trace updates and Momentum based updates, what would be the equivalent eligibility trace backward view updates if we use Nesterov's? We would at first compute the interim parameters based on the eligiblity vector, given by \begin{align*} \hat{\theta} \leftarrow \theta + \alpha e_{t-1} \end{align*} To compute the gradients with the interim parameters, we would need $\nabla_{\theta} MSE(\theta) = \delta_{\hat{\theta}} \nabla_{\theta} v_{\hat{\theta}}(s)$, so following the e-trace updates, we would update the parameters $\theta as : \begin{align*} \theta_{t+1} &\leftarrow \theta_t + \alpha \delta_{\hat{\theta}} e_t\\ &= \theta_t + \alpha \Big[ r(s,a) + \gamma v(s') - v_{\hat{\theta}}(s) \Big] e_t \end{align*} where note that, in the above we have $v_{\hat{\theta}}(s)$ and not $v_{\theta}(s)$. **TODO** : In the RL case, this makes things a bit difficult, since the implication is that we need to compute the expected return with the interim parameters $\hat{\theta}$ now, for state s? Following the update of $\theta$, we can then update the eligiblity trace vector \begin{align*} e_t \leftarrow \gamma \lambda e_{t-1} + \nabla_{\theta} v_{\hat{\theta}(s)} \end{align*} One good news though is, if we see the above equations, we never in fact have to compute $\delta_{\theta}$ or $v_{\theta}(s)$ actually. In fact, the only computation for the value functions we need to make is the TD error at $\hat{\theta}$ and $v_{\hat{\theta}}(s)$. So in fact, there is only one expectation we need to compute as before anyway, with the key difference now being that we are computing with $\hat{\theta}$ instead of $\theta$. ## Gradient Temporal Difference Learning in Policy Gradients, with Eligibility Traces and $\lambda$ Returns Previously, we established that we can write the gradient term of the value function, as an equivalent Bellman expression, and therefore do TD updates for the gradient term as follows : \begin{align*} f_{\theta}(s) \leftarrow f_{\theta}(s) + \alpha \Big[ h_{\theta}(s) + \gamma f_{\theta}(s') - f_{\theta}(s) \Big] \end{align*} and then update the policy parameters w.r.t to the gradient \begin{align*} \theta_{t+1} \leftarrow \theta_{t} + \alpha f_{\theta}(s) \end{align*} The expression for $f_{\theta}(s)$ is a regular TD update, with one-step return. Note that here return $h_{\theta}(s) = \sum_{a} \nabla_{\theta} \pi_{\theta}(a,s) Q_{\theta}(s,a)$, where the return contains a gradient term itself, and is in vector form instead of scalar. Instead of the one-step return, as in n-step backups, $\lambda$-returns with forward and backward views, we can equivalently compute the $\lambda$-returns for the vector-valued returns? This will be of the form $h_{\theta}^{\lambda}(s)$, where $\lambda$ denotes the weighted average of n-step returns (or n-step gradients in this case). Forward view of Gradient-TD($\lambda$) : Method for averaging all n-step backups \begin{align*} h_{\theta}^{\lambda} = (1 - \lambda) \sum_{n=1}^{\infty} \lambda^{n-1} h_{\theta}^{n} \end{align*} NOTE : The difficulty of this however is that, for $\lambda$ gradients, we now need the $\nabla_{\theta} \pi_{\theta}(a,s)$ term and the $q_{\theta}(s,a)$ term in all the forward n-steps, since the $h_{\theta}(s)$ term contains these terms. ie, we need to consider vector-valued returns for the $\lambda$-returns, where the vector terms depends on $\nabla_{\theta}\pi_{\theta}(a,s)$ for all the n-step terms ahead. **TODO** We can therefore write our Gradient-TD expression with the equivalent backward view with eligibility traces and so on - but it is questionable whether this is worth pursuing or not. **TODO** It is not clear why a TD form of the gradient (assuming on-step) would be useful or not. If it is, then we can pursue this later to think about the forward view gradients of n-steps, instead of one-step, where the mechanism to implement this would be to consider eligiblity traces for gradients. #### PL Thesis : Page 48 See the eligiblity trace expression in policy gradients, where we have an eligiblity trace vector of the form : \begin{align*} z_{\theta}^{k} = \gamma z_{\theta}^{k-1} + \nabla_{\theta} \log \pi_{\theta} (A_k | S_k) \end{align*} where this comes from the expression of the Monte-Carlo estimator, and the trace vector is a trace of past gradients decayed based on their recency. Plugging this back into the PG expression, we have a form where the immediate rewards are weighted by an accumulation of past gradients, whereas the regular PG expression contains the term $\nabla_{\theta} \log \pi_{\theta}$, weighted by an accumulation of future rewards $Q_{\theta}(s,a)$. We can re-write the expressions above using an equivalent form for $h_{\theta}(s)$ where we can replae $Q_{\theta}(s,a)$ with the Monte-Carlo returns as well (the approach followed in PL Thesis Page 48). **TODO** The above also hints us that we can write the $h_{\theta}(s)$ term containing gradients, using an equivalent $\lambda$-returns, for considering *past* and *future* gradients in a similari eligiblity traces form. ## Relationship with Meta-Learning [Andre:] It seems we are doing a sort of "learning-learn" method, i.e. learning the gradient function as opposed to the function. If we learn the "learning", then after $n$ steps of $f_\theta$ updates, we can freeze $f\theta$ and iterate to update the policy parameters only. Would it converge? ## Knowledge Representation for Reinforcement Learning using General Value Functions Comanici et al, 2019, [ICLR submission](https://openreview.net/forum?id=rygvZ2RcYm) which was withdrawn Summary: We can use GVFs for PG, successor features, options, etc. In RL, given reward $R: S \times A \times S \to [0,1]$ and discount $\gamma \in [0,1]$, value function equals $$ v_{\gamma, \pi} = E_{\pi}\left[\sum_{t=0}^{\infty} \gamma^{t} R_{t+1}\right]= E_{\pi}\left[\sum_{t=0}^{\infty}\left(\prod_{i=1}^{t} \gamma\right) R_{t+1}\right] $$ ### General Value Function (GVF) The GVF is a unified way of expression predictions about signals other than extrinsic reward. Given policy $\pi S \times A \to [0,1]$, a cumulant funciton $C : S \times A \times S \to \mathbb{R}^{K}$ and continuation function $\gamma: \mathcal{S} \rightarrow[0,1]$, the GVF predicts the cumulant based return discounted by continuation function: $$ v_{C, \gamma, \pi}(s)=\mathbb{E}\left[\sum_{t=0}^{\infty}\left(\prod_{i=1}^{t} \gamma\left(S_{i}\right)\right) C_{t} \mid S_{0}=s, A_{0: \infty} \sim \pi\right] $$ For example, **successor features** can be formulated as a cumulant, where cumulant $\phi: \mathcal{S} \times \mathcal{A} \times \mathcal{S} \rightarrow \mathbb{R}^{K}$ defines a feature representation. The SF vector is defined as: $$ \psi(s, a ; \pi)=E\left[\sum_{t=0}^{\infty} \gamma^{t} \phi\left(S_{t}, A_{t+1}, S_{t+1}\right) \mid S_{0}=s, A_{1}=a, \pi\right] $$ This is useful in transfer learning for reward functions linear in features $R\left(s, a, s^{\prime}\right)=\phi\left(s, a, s^{\prime}\right)^{T} \mathbf{w}$ where different tasks correspond to different $\mathbf{w}$. ### GVF for Policy Gradient **Theorem**: Let $C: \mathcal{S} \times \mathcal{A} \rightarrow \mathbb{R}$ be a one-dimensional cumulant defined as $$ \hat{C}(s, a):=v_{C, \gamma, \pi_{\theta}}(s, a) \nabla_{\theta} \log \pi_{\theta}(a \mid s) $$ Then the gradient of the GVF $v_{C, \gamma, \pi_{\theta}}(s)$ w.r.t. $\theta$ is the GVF with above cumulant $\hat{C}$, that is: $$ \nabla_{\theta} v_{C, \gamma, \pi_{\theta}}(s)=v_{\hat{C}, \gamma, \pi_{\theta}}\left(s\right) $$ The cumulant of one GVF is obtained as the output of another GVF. This is similar to $h_\theta$ in an [earlier section](#Bellman-Equation-for-the-gradient-nabla_theta-v_thetas). ![](https://i.imgur.com/u0EyndC.png) ## References * Policy/Value iteration [notebook](https://drive.google.com/file/d/1UR20JtQRjFyrvCseusVuPBmQIpB3XFAH/view?usp=sharing) * Eligibility Traces optimizer [notebook](https://drive.google.com/file/d/1UZsVvGEoR5Cg5c4KISbs8U9mVDvr0LuG/view?usp=sharing) * Pierre-Luc Bacon. "Temporal Representation Learning". PhD Thesis. McGill University, Montreal, June 2018. * Yann Ollivier. "Approximate Temporal Difference Learning is a Gradient Descent for Reversible Policies" * Baird et al., "Gradient Descent for General Reinforcement Learning" (VAPS Algorithm) * Jalaj Bhandari and Daniel Russo. "Global Convergence Guarantees for Policy Gradient Methods" * Equivalence between policy iteration and Newton's method "On the Convergence of Policy Iteration in Stationary Dynamic Programming" (Martin L. Puterman and Shelby L. Brumelle)

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.