Riashat Islam
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note No publishing access yet

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.

      Your account was recently created. Publishing will be available soon, allowing you to share notes on your public page and in search results.

      Your team account was recently created. Publishing will be available soon, allowing you to share notes on your public page and in search results.

      Explore these features while you wait
      Complete general settings
      Bookmark and like published notes
      Write a few more notes
      Complete general settings
      Write a few more notes
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights New
    • Engagement control
    • Make a copy
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Make a copy Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note No publishing access yet

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.

    Your account was recently created. Publishing will be available soon, allowing you to share notes on your public page and in search results.

    Your team account was recently created. Publishing will be available soon, allowing you to share notes on your public page and in search results.

    Explore these features while you wait
    Complete general settings
    Bookmark and like published notes
    Write a few more notes
    Complete general settings
    Write a few more notes
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    # A Framework with Non-Linear Bellman Equations and Bi-Level Optimization for Policy Optimization > [name=Riashat Islam] Notes on *Non-Linear mappings of the Bellman Equation for Policy Gradient Algorthms* > \begin{align*} \def\balpha{\unicode{x1D6C2}} \def\blambda{\unicode{x1D6CC}} \def\brho{\unicode{x1D746}} \def\mat#1{\mathbf{#1}} \def\beye{\mathbf{I}} \def\bvartheta{\unicode{x1D751}} \def\btheta{\unicode{x1D73D}} \def\ind{\unicode{x1D7D9}} \def\inv{{-1}} \end{align*} The goal of this work is build a framework with non-linear function mappings of the Bellman equation. This is based on the general class of non-linear Bellman equations (Hado et al., Van Seijen et al.,) to open up the design space of policy gradient algorithms. For example, MaxEnt RL with entropy regularized policy optimization has had many successes (Soft Q-Learning, Soft Actor-Critic, Deep Energy based policies). We hope that opening up this design space with non-linear class of Bellman equations can lead to a wider class of algorithms , with the potential of also learning these function mappings in the future. Generalized Value Functions (GVFs) (Sutton et al., 2011) were originally designed to learn many things (cumulants and not just rewards), and non-linear Bellman equations offer a framework towards more efficient computation of different predictive questions. In this work, we mainly look into the framework of policy optimization with non-linear classes of Bellman equations. First, we consider two broad classes of Bellman equations, which can either be non-linear in the rewards and future values, or non-linear mappings of the expected future discounted rewards. These can be as follows : * Non-Linear Mappings of Returns : \begin{align*} v_{\pi}(s) = \mathbb{E} [ f ( \sum_{t=0}^{\infty} \gamma^{t} r(S_t, A_t) ) \mid S_0 = s ) ] \end{align*} * Non-Linear Mappings of Expected Returns \begin{align*} v_{\pi}(s) = f ( \mathbb{E} [ \sum_{t=0}^{\infty} \gamma^{t} r(S_t, A_t) \mid S_0 = s ) ] ) \end{align*} Unrolling the infinite sum of rewards to obtain the recursive expression for the value function of a state, we can therefore write two general classes of policy evaluation equations * Policy Evaluation with Non-Linearly Mapped Returns \begin{align*} v_{\pi}(s) = \mathbb{E} [ f ( r(S_0, A_0) + \gamma v_{\pi}(S_1)) \mid S_0 = s ) ] \end{align*} * Non-Linear mappings of Policy Evaluation Equations \begin{align*} v_{\pi}(s) = f( \mathbb{E} [ r(S_0, A_0) + \gamma v_{\pi}(S_1)) \mid S_0 = s ] ) \end{align*} We are effectively introducing a new operator $\mathcal{T}^{f}$ (for non-linearly mapped returns) and $\mathcal{T}^{\mathbb{E}}$ for non-linearly mapped expected returns. **TODO** : We need to ask whether these operators are a contraction or not (sufficient conditions for convergence (Bertsekas and Tsitsiklis, et al., 1996)) , which also depends on the choice of f(). For example, (Van Seijen et al., 2019) uses f() as a logarithmic mapping and shows convergence of logarithmically mapped Q-functions in Q-learning. Since we consider non-linear mappings of value functions for policy optimization specifically, first let us re-visit the general classes of Monte-Carlo gradient estimation with non-linear function mappings of the general probabilistic objective $\mathcal{F}$. This is the problem of computing the gradient of an expectation of a function with respect to parameters defining the distribution that is integrated or the problem of sensitivity analysis (Mohamed et al., 2019) ## Monte-Carlo Gradient Estimation Considering mean-value analysis, in which a function g of an input variable x with structural parameters φ, is evaluated on average with respect to an input distribution p(x; θ) with distributional parameters θ. ### Non-Linear mappings of the function : First, let's consider mappings f(.) of the function g(.) \begin{align*} \mathcal{F}(\theta) = \int p(x; \theta) f( g(x; \phi) ) dx = \mathbb{E}_{p(x;\theta)} [ f ( g(x; \phi))] \end{align*} this gives us the gradient (using score function) to derive the general purpose Reinforce estimator, where the nice property is that $f ( g(x; \phi))$ remains intact \begin{align*} \nabla_{\theta}\mathcal{F}(\theta) = \mathbb{E}_{p(x;\theta)} [ \nabla_{\theta} \log p(x;\theta) f ( g(x; \phi)) ] \end{align*} **Variance Analysis** If we consider the variance of the Monte-Carlo gradient estimate, without function mappings for $g(x;\phi)$, we have \begin{align*} Var[ \nabla_{\theta} \mathcal{F}(\theta) ] = \mathbb{E}_{p(x;\theta)} [ ( g(x) \nabla_{\theta} \log p(x) )^{2} ] - [ \mathbb{E} [ \mathcal{F}(\theta) ] ]^{2} \end{align*} but we note that, with the mapping of $g(x;\phi)$ to $f( g(x;\phi) )$, example, if f(.) is $\log(.)$, then the variance of the gradient can be significantly lower \begin{align*} Var[ \nabla_{\theta} \mathcal{F}(\theta) ] = \mathbb{E}_{p(x;\theta)} [ ( \log (g(x)) \nabla_{\theta} \log p(x) )^{2} ] - [ \mathbb{E} [ \mathcal{F}(\theta) ] ]^{2} \end{align*} ### Non-Linear mappings of the Expectation of a function : If we consider mappings f(.) of the function $\mathbb{E} [g()]$, such that we have $f(\mathbb{E} [g()])$ \begin{align*} \mathcal{F}(\theta) = f (\int p(x; \theta) f( g(x; \phi) ) ) dx = f ( \mathbb{E}_{p(x;\theta)} [ g(x; \phi)] ) \end{align*} then we require *sensitivity analysis* of $\mathcal{F}(\theta)$. This also gives an interesting form of the gradient \begin{align*} \nabla_{\theta} \mathcal{F}(\theta) = \nabla_{\theta} f ( \mathbb{E}_{p(x; \theta)} [ g(x;\theta)] ) * \nabla_{\theta} \mathbb{E}_{p(x;\theta)} [ g(x;\theta) ] \end{align*} where the second term is equivalent to the REINFORCE gradient with the score function \begin{align*} \nabla_{\theta} \mathcal{F}(\theta) = \nabla_{\theta} f ( \mathbb{E}_{p(x; \theta)} [ g(x;\theta)] ) * \mathbb{E}_{p(x;\theta)} [ \nabla_{\theta} \log p(x;\theta) g(x;\theta) ] \end{align*} **TODO : Convolution Operators** Is the above form of the gradient any useful? This reminds me of a term similar to gradient of **convolution of a function** (EEE days!), where we had $ h(x) = f(x) $\circledast$ g(x) $ then $\nabla_{x} h(x) = \nabla_{x} f(x) * g(x)$ ### Summary : With non-linear mappings, we therefore get two forms of objectives, where one follows the regular score function gradient estimator, but the other requires sensitivity analysis. Write these objectives generallically, the differences are therefore as follows : * **Gradient of the Expectation of a Function** \begin{align*} J(\theta) = \mathbb{E}_{x \sim p(x;\theta)} [g(x;\theta)] \end{align*} for which we have the gradient \begin{align*} \nabla_{\theta} J(\theta) = \mathbb{E}_{x \sim p(x;\theta)} [\nabla_{\theta} \log p(x;\theta) g(x;\theta)] \end{align*} * **Gradient of the Expectation of a Non-Linearly mapped Function** \ (ie, function of functions) \begin{align*} J(\theta) = \mathbb{E}_{x \sim p(x;\theta)} [f ( g(x;\theta))] \end{align*} for which we have the gradient \begin{align*} \nabla_{\theta} J(\theta) = \mathbb{E}_{x \sim p(x;\theta)} [\nabla_{\theta} \log p(x;\theta) f ( g(x;\theta))] \end{align*} * **Gradient of Non-Linearly mapped Expectation of a Function** \ (ie, function mapping of expectation of a function) \begin{align*} J(\theta) = f ( \mathbb{E}_{x \sim p(x;\theta)} [ g(x;\theta) ] ) \end{align*} for which we have the gradient \begin{align*} \nabla_{\theta} J(\theta) = \nabla_{\theta} f( \mathbb{E}_{x \sim p(x;\theta)} [g(x;\theta)] ) * \mathbb{E}_{x \sim p(x;\theta)} [ \nabla_{\theta} \log p(x;\theta) g(x;\theta) ] \end{align*} NOTE : From Jensen's inequality, we know that if f(.) is a concave mapping, example $f(x) = \log(x)$, then we have $\mathbb{E} [ f(X) ] \leq f( \mathbb{E} [X] )$, where one is a smoother form of objective than the other since \begin{align*} \mathbb{E}_{x \sim p(x;\theta)} [ f(g(x;\theta)) ] \leq f ( \mathbb{E}_{x \sim p(x;\theta)} [ g(x;\theta) ] ) \end{align*} ## Policy Gradient : REINFORCE Considering the true objective that we want to optimize, ie, infinite horizon discounted return, the policy gradient objective is given by \begin{align*} J(\theta) = \mathbb{E} [ \sum_{t=0}^{\infty} \gamma^{t} r_{t} \mid \pi_{\theta} ] = \mathbb{E} [ G_{0} \mid \pi_{\theta} ] \end{align*} where let $G_t = \sum_{t=0}^{\infty} \gamma^{t} r_{t}$ denote the $\gamma$ discounted cumulative return for an infinite horizon problem, the REINFORCE gradient estimator is given by \begin{align*} \nabla_{\theta} J(\theta) = \mathbb{E}_{\pi} [ \sum_{t=0}^{\infty} \nabla_{\theta} \log \pi_{\theta} (a_t, s_t) \gamma^{t} G_t ] \end{align*} We can further write this term as (omitting the $\gamma^{t}$ term) \begin{align*} \nabla_{\theta} J(\theta) = \mathbb{E}_{s \sim d_{\pi}, a\sim \pi} [ \sum_{t=0}^{\infty} \nabla_{\theta} \log \pi_{\theta} (a_t, s_t) G_t ] \end{align*} where we have written $d_{\pi} = \sum_{t=0}^{\infty} \gamma^{t} p(s_t = s)$ as the un-normalized discounted state visitation frequency. Note that in the above equation, we have ommitted the term $\gamma^{t}$ for $G_t$. This gives an unbiased form of the gradient. However, note that, in practice we often use $\gamma=1$, ie, we use *undiscounted state visitation frequencies* in the $d_{\pi}$ term, giving the gradient estimator to be biased. (Thomas et al., 2014) argues for the inclusion of the $\gamma^{t}$ term in REINFORCE gradient estimator, but it turns out that often using $\gamma < 1$ can in fact hurt performance - ie, using unbiased forms of the gradient estimator can hurt performance. **LATER** : We will re-visit this later, when we definite the term **Gradient-Gap** (similar to **Action Gap** in value-based methods) in policy gradient theorem, a term that we argue has significance due to the $\gamma$ term. ### REINFORCE with Non-Linearly Mapped Returns With a mapping on the returns $f(G_t)$, we have the following objective (**TODO** : We can choose the function mapping f() to get smoother objectives, e.g with $\log()$ mapping to get lower bounds, but smoother forms of gradients ) \begin{align*} J(\theta) = \mathbb{E} [ \sum_{t=0}^{\infty} f ( \gamma^{t} r_{t} ) \mid \pi_{\theta} ] = \mathbb{E} [ f ( G_{0} ) \mid \pi_{\theta} ] \end{align*} This gives the following gradient, which follows from the score function trick in Monte-Carlo gradients : \begin{align*} \nabla_{\theta} J(\theta) = \mathbb{E}_{s \sim d_{\pi}, a\sim \pi} [ \sum_{t=0}^{\infty} \nabla_{\theta} \log \pi_{\theta} (a_t, s_t) f ( G_t ) ] \end{align*} #### Algorithm Pseudocode - Reinforce with Logarithmic Mapped Returns * for each episode $( s_1, a_1, r_2, s_2, a_2, r_3...) \sim \pi_{\theta}$, do * for t = to $T-1$ do * $\theta \leftarrow \theta + \alpha \nabla_{\theta} \log \pi_{\theta}(s_t, a_t) (\log G_t )$ * where $G_t = \sum \gamma^{t} r_t (s,a)$ * end T * end episodes ### REINFORCE with Non-Linearly Mapped Expected Returns If we instead map the expectation of the returns itself, our objective rather looks like this : \begin{align*} J(\theta) = f ( \mathbb{E} [ \sum_{t=0}^{\infty} \gamma^{t} r_{t} \mid \pi_{\theta} ] ) = f ( \mathbb{E} [ G_{0} \mid \pi_{\theta} ] ) \end{align*} Taking the gradient of this above objective, however, is messy and requires finding gradients of the expectation of a function (where we cannot straightforwardly apply the score function trick). **TODO** : Literature on sensitivity analysis (e.g SPSA) allows to do this? \begin{align*} \nabla_{\theta} J(\theta) = \nabla_{\theta} f ( \mathbb{E} [ \sum_{t=0}^{\infty} \gamma^{t} r_{t} \mid \pi_{\theta} ] ) ) * \mathbb{E}_{s \sim d_{\pi}, a\sim \pi} [ \sum_{t=0}^{\infty} \nabla_{\theta} \log \pi_{\theta} (a_t, s_t) G_t ] \end{align*} where in the gradient expression above, we have a term that is : \begin{align*} \nabla_{\theta} f ( \mathbb{E} [ \sum_{t=0}^{\infty} \gamma^{t} r_{t} \mid \pi_{\theta} ] ) ) = \nabla_{\theta} f ( V_{\pi_{\theta}}(s) ) \end{align*} **TODO** : It is not clear how to find such gradient terms? Sensitivity analysis and finite difference methods might allow a way to do this. There are few possible directions we can consider to compute $\nabla_{\theta} f ( V_{\pi_{\theta}}(s) )$. #### Finite Difference Approximation : \begin{align*} \nabla_{\theta} f ( V_{\pi_{\theta}}(s) ) = \frac{f (V_{ \tilde{ \pi_{\theta}} } (s)) - f ( V_{\pi_{\theta}}(s))}{\epsilon} \end{align*} where $\tilde{\pi_{\theta}}$ is the perturbed policy $\pi_{\theta}$ with $\epsilon$. However, this method would be quite sample inefficient, as it would require separate rollouts with $\tilde{\pi_{\theta}}$ and $\pi_{\theta}$ to compute the finite difference gradients. #### A semi-gradient approach If we notice the $\nabla_{\theta} J(\theta)$ term above, the second half of the equation is the same gradient as Reinforce (this is also the semi-gradient term as in (Thomas et al., 2019)). Let's write the two terms in the gradient as $\nabla_{\theta} J_{1}(\theta) * \nabla_{\theta} J_{2}(\theta)$ where \ $\nabla_{\theta} J_{1}(\theta) = \nabla_{\theta} f ( V_{\pi_{\theta}}(s) )$ and $\nabla_{\theta} J_{2}(\theta) = \mathbb{E}_{s \sim d_{\pi}, a\sim \pi} [ \sum_{t=0}^{\infty} \nabla_{\theta} \log \pi_{\theta} (a_t, s_t) G_t ]$. We can do a semi-gradient update of $\theta$ to get $\tilde{\theta}$ as \begin{align*} \tilde{\theta} \leftarrow \tilde{\theta} + \alpha \nabla_{\theta} J_{1}(\theta) \end{align*} with the obtained $\tilde{\theta}$ we can then use a similar approach to finite differences, to eventually do an update for $\theta$? Does this have any similarities to Nesterov's and Momentum based updates, where we compute $\tilde{\theta}$ and then use it for updating $\theta$? **TODO** : There should be a better way to do this - maybe something along lines of KL and Trust Region constraints with $\tilde{\theta}$ and $\theta$? #### Taylor series expansion of $f(V_{\pi_{\theta}}(s))$ Assuming f(.) is a $\log$ mapping, we can perhaps use the Taylor expansion of $\log V_{\pi}$ around a fixed point, and then use the Taylor expanded term to compute this gradient #### A Deterministic Policy Gradient approach We can use an update such as DPG (Silver et al., 2014), where $\pi_{\theta}(s)$ is a deterministic policy, or we can use Gumbel softmax trick for stochastic policies, but use a similar DPG style update : \begin{align*} \nabla_{\theta} f (V_{\pi_{\theta}}(s)) = \nabla_{\theta} f ( Q (s, \pi_{\theta}(s)) ) \end{align*} where the key to this step is, we can compute a value function with the mapping $f(.)$ itself, and then compute the gradient directly through the mapped value function for $\theta$, example : \begin{align*} \nabla_{\theta} f ( Q (s, \pi_{\theta}(s)) ) = \nabla_{\theta} \tilde{Q}(s, \pi_{\theta}(s)) = \nabla_{\theta} \pi_{\theta}(s) \nabla_{a} \tilde{Q}(s, a) \end{align*} #### Summary : It seems like the last approach with using a DPG style update to compute $\nabla_{\theta} f ( V_{\pi_{\theta}}(s) )$ might be the most feasible? **TODO** Why should a gradient term as in $\nabla_{\theta} J(\theta) = \nabla_{\theta} J_{1}(\theta) * \nabla_{\theta} J_{2}(\theta)$ be at all be interesting though? Any related literature on optimization and SGD that does something similar? ## Policy Gradient Theorem with Non-Linearly Mapped Value Functions : Policy gradient methods can be described as performing approximation in the policy space, and has close resemblance with policy iteration. They are analogous since in both cases the search for the optimal policy is guided by a policy evaluation procedure. Policy gradient methods perform their improvement using the gradient of some objective, whereas in policy iteration, the improved policy is obtained from the greedy policy. (More on this later : ) (Bacon et al.,) - the idea of solving policy gradient in a dynamic programming fashion or through TD has not been explored much. **(For later)** In (Van Seijen et al., 2019), logarithmic Q-learning is used, while Q-learning convergence guarantees hold. If policy gradients can be written as a TD update, we should in principle be able to write an equivalent PG update with logarihmic mappings and perform gradient updates in a TD like fashion. We will expore this later. For now, we first write down the policy gradient theorem, using an equivalent form with non-linear mappings of the value functions. Again, the non-linear mappings can either be on the returns itself, or mappings on the expected return (ie, function mapping of the value function itself). (Sutton et al., 1999) : Given an initial state distribution $\alpha$, the discounted objective for policy gradient methods, for the exact solutio, is given by \begin{align*} J_{\alpha}(\theta) = \balpha^\top \mat v_{\theta} = \balpha^\top (\mat I - \gamma \mat P_{\theta})^{-1} \mat r_{\theta} \end{align*} With the discounted weighting of states, the equivalent form is \begin{align*} J_{\alpha}(\theta) = \balpha^\top \mat v_{\theta} = \mat d^{\top}_{\alpha, \gamma, \theta}\mat r_{\theta} \end{align*} where $\mat d^{\top}_{\alpha, \gamma, \theta} = \balpha^\top (\mat I - \gamma \mat P_{\theta})^{-1}$, is the discounted weighting of states. Note that (Bacon et al.,) this is not a distribution in itself, or for that matter a staionary distribution, but it is a discounted weighting of states. Similar term is called the discounted future state distribution (Kakade et al., 2003) and discounted state distribution (Thomas et al., 2014). For the exact policy gradient solution, with non-linear mappings, we can write the following two forms : * Non-Linear mapping on the return : \begin{align*} J_{\alpha}(\theta) = \balpha^\top \mat v_{\theta} = \balpha^\top (\mat I - \gamma \mat P_{\theta})^{-1} \mat f ( r_{\theta} ) \end{align*} * Non-Linear mapping on the expected return/value function itself \begin{align*} J_{\alpha}(\theta) = \balpha^\top f ( \mat v_{\theta} ) = \balpha^\top f ( (\mat I - \gamma \mat P_{\theta})^{-1} \mat r_{\theta} ) \end{align*} * (Van Seijen et a;.,) considers modified/mapped objective itself, with the mapped value functions. The way to write that would therefore be : \begin{align*} \tilde{J}_{\alpha}(\theta) = \mat \alpha^{\top} \tilde{v}_{\theta} \end{align*} The key differences in the above is that : in the first two, we are writing the original objective $J(\theta)$ but with the mapped returns or mapped value functions itself, whereas in the third we are writing the modified mapped version of the objective $\tilde{J}(\theta)$. **TODO** Does the above form of writing with the non-linear mapping makes sense? In (Van Seijen et al.,), they denote $\tilde{Q}(s,a) = f (Q(s,a))$ as the expected return mapped to a different space (ie, logarithmic space). It is a general approach, also used in (Pohlen et al., 2018) where the target is mapped to a different space and updates performed in that space instead. Considering the $\tilde{Q}(s,a)$, ie the value functions mapped to a different space and defining the policy gradient objective with the mapped value function, we will therefore get the policy gradient theorem with mapped value functions (where the expected return is mapped to a different space itself) \begin{align*} \nabla_{\theta} \tilde{v}_{\theta}(s) &= \sum_{a} \Big[ \nabla_{\theta} \pi_{\theta}(a,s) \tilde{Q}(s,a) + \pi_{\theta}(a,s) \nabla_{\theta} \tilde{Q}(s,a) \Big] \\ &= \sum_{a} \pi_{\theta}(a,s) \Big[ ( \sum_{a} \nabla_{\theta} \pi_{\theta}(a,s) \tilde{Q}(s,a)) + \gamma \sum_{s'} P(s'\mid s,a) \nabla_{\theta} \tilde{v_{\theta}}(s') \Big] \end{align*} this gives the following policy gradient theorem with the mapped value functions : \begin{align*} \nabla_{\theta} \tilde{J_{\alpha}}(\theta) = \sum_{s} \mat d_{\alpha, \gamma, \theta}(s) \sum_{a} \nabla_{\theta} \pi_{\theta} (a,s) \tilde{Q_{\theta}}(s,a) \end{align*} where $\tilde{Q}(s,a) = f ( Q (s,a) ) = \log Q(s,a)$ **NOTE** We should highlight here that this is not the gradient of the original objective $J(\theta)$, but a gradient of the mapped objective $\tilde{J_{\theta}}$, which depends on the choice of $f(.)$, and can either be a good or bad objective, depending on f(.) for the policy optimization problem. This means, if f(.) is chosen such as to make the original form of the objective $J(\theta)$ to be a smoother lower bound, then it may be a good objective to optimize, otherwise no. What if we want to find the original objective $J(\theta)$ with the mapped value functions or mapped returns, and not the modified/mapped objective $\tilde{J}_{\theta}$ itself? ### Policy Gradient Theorem with Mapped Value Functions \begin{align*} \nabla_{\theta} \tilde{v}_{\theta}(s) &= \nabla_{\theta} \Big[ \sum_{a} \pi_{\theta}(a,s) f ( Q_{\theta} (s,a)) \Big] \\ &= \sum_{a} \Big[ \nabla_{\theta} \pi_{\theta}(a,s) f(Q(s,a)) + \pi_{\theta}(a,s) \nabla_{\theta} f ( Q_{\theta} (s,a) ) \Big]\\ &= \sum_{a} \Big[ \nabla_{\theta} \pi_{\theta}(a,s) f(Q_{\theta}(s,a)) + \pi_{\theta}(a,s) \nabla_{\theta} f ( Q_{\theta} (s,a) * \gamma \sum_{s'} P(s' \mid s, a) \nabla_{\theta} v_{\theta}(s') ) \Big]\\ \end{align*} From here, if we keep unrolling following (Sutton et al., 1999), then we get : \begin{align*} \nabla_{\theta} \tilde{v}_{\theta}(s) = \sum_{a} \Big[ \nabla_{\theta} \pi_{\theta}(a,s) f(Q_{\theta}(s,a)) + \nabla_{\theta} f ( Q_{\theta} (s,a) * \pi_{\theta}(a,s) \gamma \sum_{s'} P(s' \mid s, a) \nabla_{\theta} v_{\theta}(s') ) \Big] \end{align*} **TODO** Check below : We get something like this? \begin{align*} \nabla_{\theta} \tilde{v}_{\theta}(s) = \sum_{a} \nabla_{\theta} \pi_{\theta}(a,s) f (Q (s,a)) + \nabla_{\theta} f ( Q_{\theta}(s,a)) \end{align*} The policy gradient with the mapped value functions is therefore given by : \begin{align*} \nabla_{\theta} J(\theta) = \sum_{s} d_{\pi}(s) \sum_{a} \nabla_{\theta} \pi_{\theta}(a,s) f (Q (s,a)) + \nabla_{\theta} f ( Q_{\theta}(s,a)) \end{align*} ## Gradient Gap We define a term that we call Gradient-Gap similar to Action-Gap in value based methods, following the argument as in (Thomas et al.,) that there is a $\gamma^{t}$ missing in the policy gradient theorem, which makes the gradient estimates biased. Including the $\gamma^{t}$ term, for discounted weighting of future states will make the gradient term unbiased. However, in practice, it is often observed that including this term hurts performance. We definte the **Gradient-Gap** as \begin{align*} \bigtriangleup \nabla_{\theta} J_{gap} (\theta) = \nabla_{\theta} J_{\gamma} (\theta) - \nabla_{\theta} J(\theta) \end{align*} This primarily comes from the fact, as noted by (Thomas et al., 2014) that the policy gradient objective is often defined w.r.t to the discounted weighting of states (and NOT a distribution) $\mat d_{\alpha, \gamma, \theta}$ where $\alpha$ is the starting state distribution. We can normalize $\mat d$ by multiplying with $1-\gamma$ factor, to convert it into a distribution, and then correcting the gradient by a factor $\frac{1}{(1-\gamma)}$ (Kakade et al., 2003) \begin{align*} \nabla_{\theta} J(\theta) = \frac{1}{(1-\gamma)} \mathbb{E}_{d_{\pi}} \Big[ \sum_{a} \nabla_{\theta} \pi_{\theta}(a,s) Q_{\theta}(s,a) \Big] \end{align*} where $d_{\pi}(s)$ is the normalized discounted weighitng of states given in the following form : \begin{align*} \mat d_{\pi} = \mat \alpha^{\top} \sum_{t=0}^{\infty} (1-\gamma) (\gamma P_{\theta})^{t} \end{align*} where $\gamma^{t}(1-\gamma)$ is from the geometric distribution. However, in practice, we never sample from the normalized discounted weighting of states distribution, but instead sample from the unnormalized one (which is not a distribution). This is because sampling from the normalized discounted weighting of states would require that we first apply the policy in one step, and then upon entering the next state, we need to sample from the geomretic distribution to decide whether to terminate or not in the trajectory. Therefore, in practice, we always estimate the policy gradient from samples along the on-policy undiscounted stationary distribution. ## Implicit Differentiation and Learning Non-Linear Function Mappings * What if instead of a fixed function mapping $f()$, we want to parameterize the mapping $f_{\phi}$, and can maximize the overall objective, as a bi-level objective, using implicit differentiation? ## References * Harm van Seijen, Mehdi Fatemi, Arash Tavakoli. "Using a Logarithmic Mapping to Enable Lower Discount Factors in Reinforcement Learning" * Hado van Hasselt, John Quan, Matteo Hessel, Zhongwen Xu, Diana Borsa, Andre Barreto. "General non-linear Bellman equations" *

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password
    or
    Sign in via Facebook Sign in via X(Twitter) Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    By signing in, you agree to our terms of service.

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully