Riashat Islam
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights New
    • Engagement control
    • Make a copy
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Make a copy Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    # Policy Gradient Theorem and equivalent Bellman Updates > [name=Riashat Islam] Policy gradient update can be written as an equivalent Bellman update. > TODO: - Write GVF algorithm with vector cumulants (like [algo 1](#GVF-for-Policy-Gradient) from the GVF paper) - Write the equivalant vector reward without GVF - How is it better if we are learning h when in vector form? - Since h is a cumulant, can we exploit the GVF in terms of transfer learning!? Learn new tasks without learning h??? - Can we learn faster with h instead of vector rewards without bootstrap gradients? #### What * We can reformulate the policy gradient update as a Bellman equation itself. * We can write down a RL specific optimizer for policy gradients, where we update the policy gradient in a TD like manner itself (ie, a TD based gradient update, based on the past gradient at s,a and estimated gradient at the next s,a) * Can we interpret the TD gradient term, with eligiblity traces, as a type of optimizer specifically suited for RL and policy optimization? #### Why * We can leverage all of the research in TD learning to learn the policy gradient, ie, we can do value/policy iteration and Q-learning sort of updates for the policy gradient itself? This can come with convergence guarantees for PGs? It is worthwhile to think of it from an optimization perspective too (TD learning in SGD) \begin{align*} \def\balpha{\unicode{x1D6C2}} \def\blambda{\unicode{x1D6CC}} \def\brho{\unicode{x1D746}} \def\mat#1{\mathbf{#1}} \def\beye{\mathbf{I}} \def\bvartheta{\unicode{x1D751}} \def\btheta{\unicode{x1D73D}} \def\ind{\unicode{x1D7D9}} \def\inv{{-1}} \end{align*} <!-- ## Background: Bellman Recall the Bellman equation for the value function: \begin{equation} v_{\theta}(s) = \sum_{a} \left[ R(a,s) + \gamma \sum_{s'} P(s' \mid s,a) v_{\pi}(s') \right]\\ \end{equation} --> ## Introduction The starting point is to realize that the policy gradient expression can be written as an equivalent Bellman equation (also noted in Bacon et al.,) Policy gradient methods can be described as performing approximation in the policy space, and has close resemblance with policy iteration. They are analogous since in both cases the search for the optimal policy is guided by a policy evaluation procedure. Policy gradient methods perform their improvement using the gradient of some objective, whereas in policy iteration, the improved policy is obtained from the greedy policy. (Sutton et al., 1999) : Given an initial state distribution $\alpha$, the discounted objective for policy gradient methods, for the exact solution, is given by \begin{align*} J_{\alpha}(\theta) = \balpha^\top \mat v_{\theta} = \balpha^\top (\mat I - \gamma \mat P_{\theta})^{-1} \mat r_{\theta} \end{align*} With the discounted weighting of states, the equivalent form is \begin{align*} J_{\alpha}(\theta) = \balpha^\top \mat v_{\theta} = \mat d^{\top}_{\alpha, \gamma, \theta}\mat r_{\theta} \end{align*} where $\mat d^{\top}_{\alpha, \gamma, \theta} = \balpha^\top (\mat I - \gamma \mat P_{\theta})^{-1}$, is the discounted weighting of states. **TODO** (Matrix Derivatives : ) For policy gradient, we need $\nabla_{\theta} J_{\alpha}(\theta)$. What if we take the gradient of the above term, in matrix form directly? Policy gradient theorem as a Bellman Equation : \begin{align*} \nabla_{\theta} v_{\theta}(s) &= \nabla_{\theta} \Big[ \sum_{a} \pi_{\theta}(a,s) Q_{\pi}(s,a) \Big] \\ &= \sum_{a} \Big[ \nabla_{\theta} \pi_{\theta}(a,s) Q_{\pi}(s,a) + \pi_{\theta}(a,s) \nabla_{\theta} Q_{\pi}(s,a) \Big]\\ &= \sum_{a} \Big[ \nabla_{\theta} \pi_{\theta}(a,s) Q_{\pi}(s,a) + \pi_{\theta}(a,s) \nabla_{\theta} [ r(s,a) + \gamma \sum_{s'} P(s' \mid s,a) v_{\pi}(s') ] \Big]\\ &= \sum_{a} \Big[ \nabla_{\theta} \pi_{\theta}(a,s) Q_{\pi}(s,a) + \pi_{\theta}(a,s) \gamma \sum_{s'} P(s' \mid s,a) \nabla_{\theta} v_{\pi}(s') \Big] \end{align*} This is where we deviate from the expression written in Bacon et al., where the next line is written to highlight the common structure with policy evaluation equations. This also deviates from the proof in Sutton et al., 1999. We can write the above expression in a similar policy evaluation form : \begin{align*} \nabla_{\theta} v_{\theta}(s) &= \sum_{a} \Big[ \pi_{\theta}(a,s) \nabla_{\theta} \log \pi_{\theta}(a,s) Q_{\pi}(s,a) + \pi_{\theta}(a,s) \gamma \sum_{s'} P(s' \mid s,a) \nabla_{\theta} v_{\pi}(s') \Big]\\ &= \sum_{a} \pi_{\theta}(a,s) \Big[ \nabla_{\theta} \log \pi_{\theta}(a,s) Q_{\pi}(s,a) + \gamma \sum_{s'} P(s' \mid s,a) \nabla_{\theta} v_{\pi}(s') \Big] \end{align*} In (Bacon et al.,) this is however written as : \begin{align*} \nabla_{\theta} v_{\theta}(s) = \sum_{a} \pi_{\theta}(a,s) \Big[ \sum_{a} \nabla_{\theta} \pi_{\theta}(a,s) Q_{\pi}(s,a) + \gamma \sum_{s'} P(s' \mid s, a) \nabla_{\theta} v_{\theta}(s') \Big] \end{align*} Note these two are equivalent, in (Bacon et al.), a sum over action probabilities was used, whereas we applied the log-derivative trick. **TODO** Check if both forms are correct or not : [Andre]: Yes, they are equivalent. ## Bellman Equation for the gradient $\nabla_{\theta} v_{\theta}(s)$ We take our derivation and write the equivalent Bellman type recursive expression as follows First let us define : \begin{align*} h_{\theta}(s,a) = \nabla_{\theta} \log \pi_{\theta}(a,s) Q_{\pi}(s,a) \end{align*} \begin{align*} h_{\theta}(s) = \sum_{a} \pi_{\theta}(a,s) \nabla_{\theta} \log \pi_{\theta}(a,s) Q_{\pi}(s,a) \end{align*} where $h_{\theta}(s,a) \in R^{|s|, |a|}$ is a matrix. We can write down the equivalent vector form by summing over all actions, where $h_{\theta}(s) \in R^{|s|}$ is a vector with dimension the size of the state space. **Note $h_\theta$** is defined as cumulant below (see [here](#GVF-for-Policy-Gradient)) Let us also definite $f_{\theta}(s) = \nabla_{\theta} v_{\theta}(s)$, so that if we take the policy gradient derivation : \begin{align*} \nabla_{\theta} v_{\theta}(s) = \sum_{a} \pi_{\theta}(a,s) \Big[ \nabla_{\theta} \log \pi_{\theta}(a,s) Q_{\pi}(s,a) + \gamma \sum_{s'} P(s' \mid s,a) \nabla_{\theta} v_{\pi}(s') \Big] \end{align*} This can now be written in a recursive Bellman form as \begin{align*} f_{\theta}(s) = \sum_{a} \pi_{\theta}(a,s) \Big[ h_{\theta}(s) + \gamma \sum_{s'} P(s' \mid s, a) f_{\theta}(s') \Big] \end{align*} where note that $h_{\theta}(s)$ and $f_{\theta}(s)$ are vectors and not scalars. The difference here is that, we have a vector form of rewards instead of scalar, but the vector-valued rewards does not involve any special theoretical or algorithmic considerations (Bacon et al.,) ### Closed Form solution We can write this as an equivalent vector form as : \begin{align*} \mat f_{\theta} = \mat h_{\theta} + \gamma \mat P_{\theta} \mat f_{\theta} \end{align*} Therefore, the exact solution for $\nabla_{\theta} v_{\theta}(s)$ can be written as : \begin{align*} \mat f_{\theta} = (\mat I - \gamma \mat P_{\theta})^{-1} \mat h_{\theta} = \nabla_{\theta} v_{\theta}(s) \end{align*} We know, the policy gradient is given by \begin{align*} \nabla_{\theta} J_{\alpha}{\theta} = \mat \alpha^{\top} \nabla_{\theta} v_{\theta}(s) = \alpha^{\top} (\mat I - \gamma \mat P_{\theta})^{-1} \mat h_{\theta} \end{align*} where $\mat d^{\top}_{\alpha, \gamma, \theta} = \balpha^\top (\mat I - \gamma \mat P_{\theta})^{-1}$, is the discounted weighting of states. This gives us the policy gradient \begin{align*} \nabla_{\theta} J_{\alpha} ({\theta}) = \mat d^{\top}_{\alpha, \gamma, \theta} \mat h_{\theta} \end{align*} #### Policy Gradient Theorem for Non-Stationary Policies See [Kakade, Thesis Section 4.2.2] - considering the episodic case (T-epoch case), where there is an explicit dependency of the policy with t, $\pi(a| s, t, \theta)$, there is an equivalent expression for the policy gradient theorem but for non-stationary policies. Following the derivation, we can write an equivalent form for the PGT with Bellman equations, where $f_{\theta}(s,a, t)$ would now have an explicit dependency on time. * What would be the significance of this? ### Recursive Expressions of the Gradient as Bellman Equation We have previously defined the following : \begin{align*} h_{\theta}(s,a) = \nabla_{\theta} \log \pi_{\theta}(a,s) Q_{\pi}(s,a) \end{align*} \begin{align*} h_{\theta}(s) = \sum_{a} \pi_{\theta}(a,s) \nabla_{\theta} \log \pi_{\theta}(a,s) Q_{\pi}(s,a) \end{align*} \begin{align*} \nabla_{\theta} v_{\theta}(s) = f_{\theta}(s)= \sum_{a} \pi_{\theta}(a,s) \Big[ h_{\theta}(s, a) + \gamma \sum_{s'} P(s' \mid s, a) f_{\theta}(s') \Big] \end{align*} The equivalent Bellman equations for the gradient terms as follows, where we can now define **Gradient-Value** and **Gradient Action-Value** functions : * Policy Evaluation Expression for $\nabla_{\theta} v_{\pi}(s)$ \begin{align*} f_{\theta}(s)= \sum_{a} \pi_{\theta}(a,s) \Big[ h_{\theta}(s, a) + \gamma \sum_{s'} P(s' \mid s, a) f_{\theta}(s') \Big] \end{align*} * Policy Evaluation Expression for $\nabla_{\theta} q_{\pi}(s,a)$ \begin{align*} f_{\theta}(s, a)= h_{\theta}(s, a) + \gamma \sum_{s'} P(s' \mid s, a) f_{\theta}(s') \end{align*} This means, we can update the gradient terms as a Bellman equation. This is interesting because now we can estimate the gradient using an equivalent policy iteration or value iteration algorithm. ## Solving Policy Gradients with TD Learning We can update the gradient action-value function with TD updates, e.g with Q-learning. Recall that $f_\theta(s)$ outputs $\nabla_\theta v_\pi(s)$. However, in the TD version, $f(s)$ is parameterized as $f_\psi(s)$ and approximates $\nabla_\theta v_\pi(s)$ in an iterative fashion: \begin{align*} f_{\psi}(s,a) \leftarrow f_{\psi}(s,a) + \alpha \Big[ h_{\theta}(s,a) + \gamma \max_{a'} f_{\psi}(s', a') - f_{\psi}(s,a) \Big] \end{align*} and then update the policy parameters w.r.t the *estimated* gradient \begin{align*} \theta_{t+1} \leftarrow \theta_{t} + \alpha f_{\psi}(s,a) \end{align*} Instead of Q-learning, we can perform any value-based updates for the gradients $f_{\psi}(s,a)$ or $f_{\psi}(s)$ ## Meta-gradient training Interestingly, since $f_\psi (s',a')$ is a gradient producing function, we can train it for $n$ steps, and continue to train our policy without ever actually computing gradients, relying solely on the gradient estimator $f_\psi$. Naturally, one wonders how many steps are required to still be able to solve a task? For which task does this meta-gradient setup work? ## Theoretical Questions : * With this type of formulation, what is the sample complexity view [Kakade, Thesis]. ie, if we can update the policy gradient with an equivalent Bellman expression, what would be the sample size required to obtain an accurate gradient estimate? * How many samples would be required before this algorithm, with equivalent Bellman updates, would find a good policy? ### Convergence Proofs : **TODO** * We can define a Bellman operator and $\gamma$ contraction mapping for $f_{\theta}(s,a)$. But what does it mean for this term to converge? ## Policy Iteration and Policy Gradients * Equivalence between policy iteration and Newton's method * In policy iteration, we take an argmax over the value function. With our interpretation, we basically replace the argmax with a gradient based update $\nabla_{\theta} v_{\theta}(s)$ ## Greedy Value Function Methods (See Sham Kakade, Thesis Chapter 3 for more details on this, especially deriving bounds with the greedy policy update, based on the max norm). There are a plethora of asynchronous schemes, such as optimistic policy iteration, SARSA, Dyna-Q which interleave the policy updating and the value updating scheme, without waiting for convergence of the policy evaluation algorithm. Typically the policy evaluation algorithm, estimating $v_{\pi}(s)$ makes slow changes determined by a learning rate, and the policy is greedy w.r.t to these values. The reasoning behind this is to avoid making more drastic policy changes, which is often the problem when providing convergence results. * From our interpretation, updating the policy gradient with an equivalent gradient evaluation scheme (e.g policy evaluation of the gradient-value function), maybe the same reasoning will justify? We want to make slow changes/updates in the gradient, or estimate the gradient with a slower learning rate, to avoid drastic changes in the policy? * From our derived algorithm above, we can use different learning rates for updating $f_{\theta}(s,a)$ and updating $\theta$, where the learning rate for $f_{\theta}(s,a)$ should be lower than for $\theta$. This will allow for less drastic changes in the gradient, resulting in conservative policy updates? * TODO : See literature on conservative policy iteration, trust region approach etc, that all deals with avoiding drastic changes to the policy parameters $\theta$. In most greedy value function methods, chattering occurs for many algorithms, where the policy fluctuates between some set of policies without ever converging. [Bertsekas and Tsitsiklis, 1996] provides the most extensive convergence analysis. For TD learning, convergence results have focussed on the quality of policy evaluation for a *single* policy. In general, the problem of greedy updates is that we cannot control which states our greedy policy visits, and this leads to a devastating max norm (infinity norm) greedy update bound. [Kakade, Thesis] : Under what formal conditions do greedy updates provide good policies when we have some average guarantee on J. * What would the equivalent bounds mean for $f_{\theta}(s,a)?$ Since $f_{\theta}(s,a) = \nabla_{\theta} v_{\theta}(s)$, there is no notion of optimal $\nabla_{\theta} v^{*}_{\theta}$ in this case? (Most bounds are derived based on the optimal $v^{*}$ and $q^{*}$ with the max norm [Bertsekas]) One drawback of greedy value function methods is that - the performance of a greedy policy derived from some approximate value function can be worse than the old policy by an amount related to the max norm error. This has motivated the use of policy gradient methods which have stronger performance improvement guarantees [Kakade, Thesis]. ## TD Approach for Gradients **Clarify** A TD learning approach here, as derived above, would mean that instead of an approximation of the next position of $\theta$, we are instead computing an approximation of the next position of the gradient $f_{\theta}(s,a) = \nabla_{\theta} v_{\pi}(s)$. Our policy gradient is given by : \begin{align*} \nabla_{\theta} J(\theta) = \mat \alpha^{\top} \nabla_{\theta} v_{\theta}(s) = \mat \alpha^{\top} f_{\theta}(s) \end{align*} where we can write an equivalent TD update for $f_{\theta}$ (considering f(s,a)): \begin{align*} f_{\theta}(s,a) \leftarrow f_{\theta}(s,a) + \alpha \Big[ h_{\theta}(s,a) + \gamma \max_{a'} f_{\theta}(s', a') - f_{\theta}(s,a) \Big] \end{align*} The overall $\theta$ parameter update for the policy parameters is therefore : \begin{align*} \theta &\leftarrow \theta + \eta f_{\theta}(s,a)\\ &\leftarrow \theta + \eta f_{\theta}(s,a) + \eta \alpha \Big[ h_{\theta}(s,a) + \gamma \max_{a'} f_{\theta}(s', a') - f_{\theta}(s,a) \Big]\\ &= \theta + \eta f_{\theta}(s,a) + \eta_{\alpha} \Big[ h_{\theta}(s,a) + \gamma \max_{a'} f_{\theta}(s', a') - f_{\theta}(s,a) \Big] \end{align*} where $f_{\theta}(s,a)$ is the current gradient, $f_{\theta}(s', a')$ is the gradient in the next $(s', a')$ and we are updating $\theta$ based on the estimated gradient at the next state, action pair? **TODO:** Is the above the correct way to think about this? ## Interpretation of Gradient-TD **TODO** * Intuitive explanation of what Gradient-TD is? ## Relationship with Optimization Algorithms : Stochastic gradient descent (SGD) is used to solve optimization problems, where the objective function is of the form : \begin{align*} f(x) = \mathbb{E} [f_{i}(x)] = \frac{1}{n} \sum_{i=1}^{N} f_{i}(x) \end{align*} #### Momentum SGD is often augmented with the momentum term as well, so that the update becomes \begin{align*} x_{t+1} = x_t - \gamma_{t} \nabla f_{i}(x_t) + \beta (x_t - x_{t-1}) \end{align*} where momentum adds the previous update direction to the gradient. We basically use the previous direction of update $x_{t-1}$ to also update the current parameters $x_t$. This has the useful effect of stabilising the update in case of outlayer gradients and escaping regions of flat curvature. A better way of looking at momentum based methods can be based on the following updates. Momentum helps accelerate SGD in the relevant direction, by adding a fraction $\gamma$ of the update vector of the past time step to the current update vector. \begin{align*} v_t = \gamma v_{t-1} + \eta \nabla_{\theta} J(\theta) \end{align*} \begin{align*} \theta = \theta - v_t \end{align*} In momentum-based methods, the term $\theta - \gamma v_{t-1}$ computes an approximation of the next position of $\theta$. #### Nesterov's Accelerated Gradient Computing $\theta - \gamma v_{t-1}$ thus gives us an approximation of the next position of the parameters (the gradient is missing for the full update), a rough idea where our parameters are going to be. We can now effectively look ahead by calculating the gradient not w.r.t. to our current parameters $\theta$ but w.r.t. the approximate future position of our parameters: \begin{align*} v_t = \gamma v_{t-1} + \eta \nabla_{\theta} J(\theta - \gamma v_{t-1}) \end{align*} \begin{align*} \theta \leftarrow \theta - v_t \end{align*} In Nesterov's, we do a lookahead by calculating the gradients not w.r.t to the current parameters, but w.r.t to the approximate future position of the parametets $\theta - \gamma v_{t-1}$, for the term $\nabla_{\theta} J(\theta - \gamma v_{t-1})$. The $\theta$ update is therefore as : \begin{align*} \theta \leftarrow \theta - \gamma v_{t-1} - \eta \nabla_{\theta} J(\theta - \gamma v_{t-1}) \end{align*} ## Eligibility Traces as Optimizers in Policy Gradients #### Eligibility Traces in Policy Gradient and Actor-Critic In policy gradient methods, the gradient is estimated using sequences of states and rewards encountered between visits to some designated recurrent state (ie, a trajectory equivalent where we take the states and rewards of the trajectory, and trajectory ends at the end of the episode (or at an abosrbing state), and the gradient is evaluated based on the trajectory of states and rewards). Eligibility traces can be used to efficiently perform these updates in an online manner [Marbach and Tsitsiklis, 2001]. Performing gradient updates in an online manner may be useful, because in regular policy gradients, the variance in the gradient estimate grows with the recurrence time (ie, as trajectory grows). This recurrence time can be large in large-scale problems and the time is also dependent on the policy, such that as the policy improves, it is possible that the recurrence time will increase too! #### Relation between Eligiblity Traces and Optimizers We consider TD update with Bootstrapping, assuming linear function approximation form, for $v_{\theta}(s) = \phi(s)^{T} \theta$. Considering the TD($\lambda$) update with a backward view implementation for $\lambda$, we find the interesting connection that the update equations look very similar to a naive *momentum-based stochastic gradient descent (ascent) approach*. TD update for Value Functions, with Bootstrapping : \begin{align*} \theta_{t+1} \leftarrow \theta_t + \alpha \nabla_{\theta} MSE(\theta) \end{align*} where $\nabla_{\theta} MSE(\theta) \approx \Big[ r(s,a) + \gamma v_{\theta}(s') - v_{\theta}(s) \Big] \nabla_{\theta} v_{\theta}(s)$ \begin{align*} \theta_{t+1} \leftarrow \theta_t + \alpha \nabla_{\theta} \delta_{\theta} \nabla_{\theta} v_{\theta}(s) \end{align*} where $\delta$ is the one-step TD update. If we consider the $\lambda$ returns view, we can write the TD update as $\delta_{\theta} = G^{\lambda} - v_{\theta}(s)$, considering the forward view. The backward view equivalent of this with eligiblity traces is given by \begin{align*} \theta_{t+1} \leftarrow \theta_t + \alpha \delta_{t} e_t \end{align*} and we update the e-trace vector as \begin{align*} e_t \leftarrow \gamma \lambda e_{t-1} + \nabla_{\theta} v_{\theta}(s) \end{align*} We can in fact write the two jointly as \begin{align*} \theta_{t+1} &\leftarrow \theta_t + \alpha \delta_t ( \gamma \lambda e_{t-1} + \nabla_{\theta} v_{\theta}(s) )\\ &= \theta_t + \alpha \delta_t \nabla_{\theta} v_{\theta}(s) + \alpha \delta_t \gamma \lambda e_{t-1} \end{align*} #### Momentum Based Stochastic Gradient Ascent : Interestingly, the above expression is exactly equivalent if we write down SGD with a momentum term. \begin{align*} \theta_{t+1} \leftarrow \theta_t + \alpha V_t \end{align*} where $V_t$ is the *velocity term* and the *eligibility term* in our case, which is updated as \begin{align*} V_t \leftarrow \gamma V_{t-1} + \eta \nabla_{\theta} J(\theta) \end{align*} just as before, we can write down the update equation jointly as \begin{align*} \theta_{t+1} &\leftarrow \theta_t + \alpha \Big[ \gamma V_{t-1} + \eta \nabla_{\theta} J(\theta) \Big]\\ &= \theta_t + \alpha \eta \nabla_{\theta} J(\theta) + \alpha \gamma V_{t-1} \end{align*} The two terms, writing the joint $\theta$ update for $v_{\theta}$ and the $\theta$ update in SGD with Momentum, are exactly equivalent, with the only difference (hyperparameter difference) being that we updated the eligibility trace vector with $\gamma \lambda$ whereas we updated the Velocity vector with a $\gamma$ only. The other difference is that, in the Value Function update, we have $\delta_t \nabla_{\theta} v_{\theta}(s)$ because with TD learning we do a *semi-gradient* update, whereas in SGD we take the full gradient. The key difference however, is that eligiblity traces capture the temporal relation (ie, the dependency on time), whereas regular SGD with Momentum does not capture this explicitly! For RL, perhaps we need to capture the temporal correlation for our optimizers? #### Nesterov's Accelerated Gradient In Momentum based method, the gradient was calculated using the current parameters $\theta$, $\nabla_{\theta} J(\theta)$. In Nesterov's, we apply the velocity term $V_t$ to the parameters $\theta$ first, to compute the interim parameters $\hat{\theta}$. We can then compute the gradients with the next interim parametes $\nabla_{\theta} J(\hat{\theta})$. Based on the new gradients, we then follow the similar update rule as in Nesterov's method, with the only difference now is that the gradients are computed with the interim parameters $\hat{\theta}$. \begin{align*} \hat{\theta} \leftarrow \theta + \alpha V_{t-1} \end{align*} \begin{align*} V_{t} \leftarrow \alpha V_{t-1} + \eta \nabla_{\theta} J(\hat{\theta}) \end{align*} \begin{align*} \theta_{t+1} \leftarrow \theta_t + V_{t} \end{align*} **Equivalent Eligiblity Traces for Nesterov's Accelerated Gradient** Given we know the similarity between eligibility trace updates and Momentum based updates, what would be the equivalent eligibility trace backward view updates if we use Nesterov's? We would at first compute the interim parameters based on the eligiblity vector, given by \begin{align*} \hat{\theta} \leftarrow \theta + \alpha e_{t-1} \end{align*} To compute the gradients with the interim parameters, we would need $\nabla_{\theta} MSE(\theta) = \delta_{\hat{\theta}} \nabla_{\theta} v_{\hat{\theta}}(s)$, so following the e-trace updates, we would update the parameters $\theta as : \begin{align*} \theta_{t+1} &\leftarrow \theta_t + \alpha \delta_{\hat{\theta}} e_t\\ &= \theta_t + \alpha \Big[ r(s,a) + \gamma v(s') - v_{\hat{\theta}}(s) \Big] e_t \end{align*} where note that, in the above we have $v_{\hat{\theta}}(s)$ and not $v_{\theta}(s)$. **TODO** : In the RL case, this makes things a bit difficult, since the implication is that we need to compute the expected return with the interim parameters $\hat{\theta}$ now, for state s? Following the update of $\theta$, we can then update the eligiblity trace vector \begin{align*} e_t \leftarrow \gamma \lambda e_{t-1} + \nabla_{\theta} v_{\hat{\theta}(s)} \end{align*} One good news though is, if we see the above equations, we never in fact have to compute $\delta_{\theta}$ or $v_{\theta}(s)$ actually. In fact, the only computation for the value functions we need to make is the TD error at $\hat{\theta}$ and $v_{\hat{\theta}}(s)$. So in fact, there is only one expectation we need to compute as before anyway, with the key difference now being that we are computing with $\hat{\theta}$ instead of $\theta$. ## Gradient Temporal Difference Learning in Policy Gradients, with Eligibility Traces and $\lambda$ Returns Previously, we established that we can write the gradient term of the value function, as an equivalent Bellman expression, and therefore do TD updates for the gradient term as follows : \begin{align*} f_{\theta}(s) \leftarrow f_{\theta}(s) + \alpha \Big[ h_{\theta}(s) + \gamma f_{\theta}(s') - f_{\theta}(s) \Big] \end{align*} and then update the policy parameters w.r.t to the gradient \begin{align*} \theta_{t+1} \leftarrow \theta_{t} + \alpha f_{\theta}(s) \end{align*} The expression for $f_{\theta}(s)$ is a regular TD update, with one-step return. Note that here return $h_{\theta}(s) = \sum_{a} \nabla_{\theta} \pi_{\theta}(a,s) Q_{\theta}(s,a)$, where the return contains a gradient term itself, and is in vector form instead of scalar. Instead of the one-step return, as in n-step backups, $\lambda$-returns with forward and backward views, we can equivalently compute the $\lambda$-returns for the vector-valued returns? This will be of the form $h_{\theta}^{\lambda}(s)$, where $\lambda$ denotes the weighted average of n-step returns (or n-step gradients in this case). Forward view of Gradient-TD($\lambda$) : Method for averaging all n-step backups \begin{align*} h_{\theta}^{\lambda} = (1 - \lambda) \sum_{n=1}^{\infty} \lambda^{n-1} h_{\theta}^{n} \end{align*} NOTE : The difficulty of this however is that, for $\lambda$ gradients, we now need the $\nabla_{\theta} \pi_{\theta}(a,s)$ term and the $q_{\theta}(s,a)$ term in all the forward n-steps, since the $h_{\theta}(s)$ term contains these terms. ie, we need to consider vector-valued returns for the $\lambda$-returns, where the vector terms depends on $\nabla_{\theta}\pi_{\theta}(a,s)$ for all the n-step terms ahead. **TODO** We can therefore write our Gradient-TD expression with the equivalent backward view with eligibility traces and so on - but it is questionable whether this is worth pursuing or not. **TODO** It is not clear why a TD form of the gradient (assuming on-step) would be useful or not. If it is, then we can pursue this later to think about the forward view gradients of n-steps, instead of one-step, where the mechanism to implement this would be to consider eligiblity traces for gradients. #### PL Thesis : Page 48 See the eligiblity trace expression in policy gradients, where we have an eligiblity trace vector of the form : \begin{align*} z_{\theta}^{k} = \gamma z_{\theta}^{k-1} + \nabla_{\theta} \log \pi_{\theta} (A_k | S_k) \end{align*} where this comes from the expression of the Monte-Carlo estimator, and the trace vector is a trace of past gradients decayed based on their recency. Plugging this back into the PG expression, we have a form where the immediate rewards are weighted by an accumulation of past gradients, whereas the regular PG expression contains the term $\nabla_{\theta} \log \pi_{\theta}$, weighted by an accumulation of future rewards $Q_{\theta}(s,a)$. We can re-write the expressions above using an equivalent form for $h_{\theta}(s)$ where we can replae $Q_{\theta}(s,a)$ with the Monte-Carlo returns as well (the approach followed in PL Thesis Page 48). **TODO** The above also hints us that we can write the $h_{\theta}(s)$ term containing gradients, using an equivalent $\lambda$-returns, for considering *past* and *future* gradients in a similari eligiblity traces form. ## Relationship with Meta-Learning [Andre:] It seems we are doing a sort of "learning-learn" method, i.e. learning the gradient function as opposed to the function. If we learn the "learning", then after $n$ steps of $f_\theta$ updates, we can freeze $f\theta$ and iterate to update the policy parameters only. Would it converge? ## Knowledge Representation for Reinforcement Learning using General Value Functions Comanici et al, 2019, [ICLR submission](https://openreview.net/forum?id=rygvZ2RcYm) which was withdrawn Summary: We can use GVFs for PG, successor features, options, etc. In RL, given reward $R: S \times A \times S \to [0,1]$ and discount $\gamma \in [0,1]$, value function equals $$ v_{\gamma, \pi} = E_{\pi}\left[\sum_{t=0}^{\infty} \gamma^{t} R_{t+1}\right]= E_{\pi}\left[\sum_{t=0}^{\infty}\left(\prod_{i=1}^{t} \gamma\right) R_{t+1}\right] $$ ### General Value Function (GVF) The GVF is a unified way of expression predictions about signals other than extrinsic reward. Given policy $\pi S \times A \to [0,1]$, a cumulant funciton $C : S \times A \times S \to \mathbb{R}^{K}$ and continuation function $\gamma: \mathcal{S} \rightarrow[0,1]$, the GVF predicts the cumulant based return discounted by continuation function: $$ v_{C, \gamma, \pi}(s)=\mathbb{E}\left[\sum_{t=0}^{\infty}\left(\prod_{i=1}^{t} \gamma\left(S_{i}\right)\right) C_{t} \mid S_{0}=s, A_{0: \infty} \sim \pi\right] $$ For example, **successor features** can be formulated as a cumulant, where cumulant $\phi: \mathcal{S} \times \mathcal{A} \times \mathcal{S} \rightarrow \mathbb{R}^{K}$ defines a feature representation. The SF vector is defined as: $$ \psi(s, a ; \pi)=E\left[\sum_{t=0}^{\infty} \gamma^{t} \phi\left(S_{t}, A_{t+1}, S_{t+1}\right) \mid S_{0}=s, A_{1}=a, \pi\right] $$ This is useful in transfer learning for reward functions linear in features $R\left(s, a, s^{\prime}\right)=\phi\left(s, a, s^{\prime}\right)^{T} \mathbf{w}$ where different tasks correspond to different $\mathbf{w}$. ### GVF for Policy Gradient **Theorem**: Let $C: \mathcal{S} \times \mathcal{A} \rightarrow \mathbb{R}$ be a one-dimensional cumulant defined as $$ \hat{C}(s, a):=v_{C, \gamma, \pi_{\theta}}(s, a) \nabla_{\theta} \log \pi_{\theta}(a \mid s) $$ Then the gradient of the GVF $v_{C, \gamma, \pi_{\theta}}(s)$ w.r.t. $\theta$ is the GVF with above cumulant $\hat{C}$, that is: $$ \nabla_{\theta} v_{C, \gamma, \pi_{\theta}}(s)=v_{\hat{C}, \gamma, \pi_{\theta}}\left(s\right) $$ The cumulant of one GVF is obtained as the output of another GVF. This is similar to $h_\theta$ in an [earlier section](#Bellman-Equation-for-the-gradient-nabla_theta-v_thetas). ![](https://i.imgur.com/u0EyndC.png) ## References * Policy/Value iteration [notebook](https://drive.google.com/file/d/1UR20JtQRjFyrvCseusVuPBmQIpB3XFAH/view?usp=sharing) * Eligibility Traces optimizer [notebook](https://drive.google.com/file/d/1UZsVvGEoR5Cg5c4KISbs8U9mVDvr0LuG/view?usp=sharing) * Pierre-Luc Bacon. "Temporal Representation Learning". PhD Thesis. McGill University, Montreal, June 2018. * Yann Ollivier. "Approximate Temporal Difference Learning is a Gradient Descent for Reversible Policies" * Baird et al., "Gradient Descent for General Reinforcement Learning" (VAPS Algorithm) * Jalaj Bhandari and Daniel Russo. "Global Convergence Guarantees for Policy Gradient Methods" * Equivalence between policy iteration and Newton's method "On the Convergence of Policy Iteration in Stationary Dynamic Programming" (Martin L. Puterman and Shelby L. Brumelle)

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully