# A Framework with Non-Linear Bellman Equations and Bi-Level Optimization for Policy Optimization
> [name=Riashat Islam] Notes on *Non-Linear mappings of the Bellman Equation for Policy Gradient Algorthms*
>
\begin{align*}
\def\balpha{\unicode{x1D6C2}}
\def\blambda{\unicode{x1D6CC}}
\def\brho{\unicode{x1D746}}
\def\mat#1{\mathbf{#1}}
\def\beye{\mathbf{I}}
\def\bvartheta{\unicode{x1D751}}
\def\btheta{\unicode{x1D73D}}
\def\ind{\unicode{x1D7D9}}
\def\inv{{-1}}
\end{align*}
The goal of this work is build a framework with non-linear function mappings of the Bellman equation. This is based on the general class of non-linear Bellman equations (Hado et al., Van Seijen et al.,) to open up the design space of policy gradient algorithms. For example, MaxEnt RL with entropy regularized policy optimization has had many successes (Soft Q-Learning, Soft Actor-Critic, Deep Energy based policies).
We hope that opening up this design space with non-linear class of Bellman equations can lead to a wider class of algorithms , with the potential of also learning these function mappings in the future. Generalized Value Functions (GVFs) (Sutton et al., 2011) were originally designed to learn many things (cumulants and not just rewards), and non-linear Bellman equations offer a framework towards more efficient computation of different predictive questions.
In this work, we mainly look into the framework of policy optimization with non-linear classes of Bellman equations. First, we consider two broad classes of Bellman equations, which can either be non-linear in the rewards and future values, or non-linear mappings of the expected future discounted rewards. These can be as follows :
* Non-Linear Mappings of Returns :
\begin{align*}
v_{\pi}(s) = \mathbb{E} [ f ( \sum_{t=0}^{\infty} \gamma^{t} r(S_t, A_t) ) \mid S_0 = s ) ]
\end{align*}
* Non-Linear Mappings of Expected Returns
\begin{align*}
v_{\pi}(s) = f ( \mathbb{E} [ \sum_{t=0}^{\infty} \gamma^{t} r(S_t, A_t) \mid S_0 = s ) ] )
\end{align*}
Unrolling the infinite sum of rewards to obtain the recursive expression for the value function of a state, we can therefore write two general classes of policy evaluation equations
* Policy Evaluation with Non-Linearly Mapped Returns
\begin{align*}
v_{\pi}(s) = \mathbb{E} [ f ( r(S_0, A_0) + \gamma v_{\pi}(S_1)) \mid S_0 = s ) ]
\end{align*}
* Non-Linear mappings of Policy Evaluation Equations
\begin{align*}
v_{\pi}(s) = f( \mathbb{E} [ r(S_0, A_0) + \gamma v_{\pi}(S_1)) \mid S_0 = s ] )
\end{align*}
We are effectively introducing a new operator $\mathcal{T}^{f}$ (for non-linearly mapped returns) and $\mathcal{T}^{\mathbb{E}}$ for non-linearly mapped expected returns.
**TODO** : We need to ask whether these operators are a contraction or not (sufficient conditions for convergence (Bertsekas and Tsitsiklis, et al., 1996)) , which also depends on the choice of f(). For example, (Van Seijen et al., 2019) uses f() as a logarithmic mapping and shows convergence of logarithmically mapped Q-functions in Q-learning.
Since we consider non-linear mappings of value functions for policy optimization specifically, first let us re-visit the general classes of Monte-Carlo gradient estimation with non-linear function mappings of the general probabilistic objective $\mathcal{F}$. This is the problem of computing the gradient of an expectation of a function with respect to parameters defining the
distribution that is integrated or the problem of sensitivity analysis (Mohamed et al., 2019)
## Monte-Carlo Gradient Estimation
Considering mean-value analysis, in which a function g of an input variable x with structural parameters φ, is evaluated on average with respect to an input distribution p(x; θ) with distributional parameters θ.
### Non-Linear mappings of the function :
First, let's consider mappings f(.) of the function g(.)
\begin{align*}
\mathcal{F}(\theta) = \int p(x; \theta) f( g(x; \phi) ) dx = \mathbb{E}_{p(x;\theta)} [ f ( g(x; \phi))]
\end{align*}
this gives us the gradient (using score function) to derive the general purpose Reinforce estimator, where the nice property is that $f ( g(x; \phi))$ remains intact
\begin{align*}
\nabla_{\theta}\mathcal{F}(\theta) = \mathbb{E}_{p(x;\theta)} [ \nabla_{\theta} \log p(x;\theta) f ( g(x; \phi)) ]
\end{align*}
**Variance Analysis**
If we consider the variance of the Monte-Carlo gradient estimate, without function mappings for $g(x;\phi)$, we have
\begin{align*}
Var[ \nabla_{\theta} \mathcal{F}(\theta) ] = \mathbb{E}_{p(x;\theta)} [ ( g(x) \nabla_{\theta} \log p(x) )^{2} ] - [ \mathbb{E} [ \mathcal{F}(\theta) ] ]^{2}
\end{align*}
but we note that, with the mapping of $g(x;\phi)$ to $f( g(x;\phi) )$, example, if f(.) is $\log(.)$, then the variance of the gradient can be significantly lower
\begin{align*}
Var[ \nabla_{\theta} \mathcal{F}(\theta) ] = \mathbb{E}_{p(x;\theta)} [ ( \log (g(x)) \nabla_{\theta} \log p(x) )^{2} ] - [ \mathbb{E} [ \mathcal{F}(\theta) ] ]^{2}
\end{align*}
### Non-Linear mappings of the Expectation of a function :
If we consider mappings f(.) of the function $\mathbb{E} [g()]$, such that we have $f(\mathbb{E} [g()])$
\begin{align*}
\mathcal{F}(\theta) = f (\int p(x; \theta) f( g(x; \phi) ) ) dx = f ( \mathbb{E}_{p(x;\theta)} [ g(x; \phi)] )
\end{align*}
then we require *sensitivity analysis* of $\mathcal{F}(\theta)$. This also gives an interesting form of the gradient
\begin{align*}
\nabla_{\theta} \mathcal{F}(\theta) = \nabla_{\theta} f ( \mathbb{E}_{p(x; \theta)} [ g(x;\theta)] ) * \nabla_{\theta} \mathbb{E}_{p(x;\theta)} [ g(x;\theta) ]
\end{align*}
where the second term is equivalent to the REINFORCE gradient with the score function
\begin{align*}
\nabla_{\theta} \mathcal{F}(\theta) = \nabla_{\theta} f ( \mathbb{E}_{p(x; \theta)} [ g(x;\theta)] ) * \mathbb{E}_{p(x;\theta)} [ \nabla_{\theta} \log p(x;\theta) g(x;\theta) ]
\end{align*}
**TODO : Convolution Operators** Is the above form of the gradient any useful? This reminds me of a term similar to gradient of **convolution of a function** (EEE days!), where we had $ h(x) = f(x) $\circledast$ g(x) $ then $\nabla_{x} h(x) = \nabla_{x} f(x) * g(x)$
### Summary :
With non-linear mappings, we therefore get two forms of objectives, where one follows the regular score function gradient estimator, but the other requires sensitivity analysis. Write these objectives generallically, the differences are therefore as follows :
* **Gradient of the Expectation of a Function**
\begin{align*}
J(\theta) = \mathbb{E}_{x \sim p(x;\theta)} [g(x;\theta)]
\end{align*}
for which we have the gradient
\begin{align*}
\nabla_{\theta} J(\theta) = \mathbb{E}_{x \sim p(x;\theta)} [\nabla_{\theta} \log p(x;\theta) g(x;\theta)]
\end{align*}
* **Gradient of the Expectation of a Non-Linearly mapped Function** \
(ie, function of functions)
\begin{align*}
J(\theta) = \mathbb{E}_{x \sim p(x;\theta)} [f ( g(x;\theta))]
\end{align*}
for which we have the gradient
\begin{align*}
\nabla_{\theta} J(\theta) = \mathbb{E}_{x \sim p(x;\theta)} [\nabla_{\theta} \log p(x;\theta) f ( g(x;\theta))]
\end{align*}
* **Gradient of Non-Linearly mapped Expectation of a Function** \
(ie, function mapping of expectation of a function)
\begin{align*}
J(\theta) = f ( \mathbb{E}_{x \sim p(x;\theta)} [ g(x;\theta) ] )
\end{align*}
for which we have the gradient
\begin{align*}
\nabla_{\theta} J(\theta) = \nabla_{\theta} f( \mathbb{E}_{x \sim p(x;\theta)} [g(x;\theta)] ) * \mathbb{E}_{x \sim p(x;\theta)} [ \nabla_{\theta} \log p(x;\theta) g(x;\theta) ]
\end{align*}
NOTE : From Jensen's inequality, we know that if f(.) is a concave mapping, example $f(x) = \log(x)$, then we have $\mathbb{E} [ f(X) ] \leq f( \mathbb{E} [X] )$, where one is a smoother form of objective than the other since
\begin{align*}
\mathbb{E}_{x \sim p(x;\theta)} [ f(g(x;\theta)) ] \leq f ( \mathbb{E}_{x \sim p(x;\theta)} [ g(x;\theta) ] )
\end{align*}
## Policy Gradient : REINFORCE
Considering the true objective that we want to optimize, ie, infinite horizon discounted return, the policy gradient objective is given by
\begin{align*}
J(\theta) = \mathbb{E} [ \sum_{t=0}^{\infty} \gamma^{t} r_{t} \mid \pi_{\theta} ] = \mathbb{E} [ G_{0} \mid \pi_{\theta} ]
\end{align*}
where let $G_t = \sum_{t=0}^{\infty} \gamma^{t} r_{t}$ denote the $\gamma$ discounted cumulative return for an infinite horizon problem, the REINFORCE gradient estimator is given by
\begin{align*}
\nabla_{\theta} J(\theta) = \mathbb{E}_{\pi} [ \sum_{t=0}^{\infty} \nabla_{\theta} \log \pi_{\theta} (a_t, s_t) \gamma^{t} G_t ]
\end{align*}
We can further write this term as (omitting the $\gamma^{t}$ term)
\begin{align*}
\nabla_{\theta} J(\theta) = \mathbb{E}_{s \sim d_{\pi}, a\sim \pi} [ \sum_{t=0}^{\infty} \nabla_{\theta} \log \pi_{\theta} (a_t, s_t) G_t ]
\end{align*}
where we have written $d_{\pi} = \sum_{t=0}^{\infty} \gamma^{t} p(s_t = s)$ as the un-normalized discounted state visitation frequency. Note that in the above equation, we have ommitted the term $\gamma^{t}$ for $G_t$. This gives an unbiased form of the gradient.
However, note that, in practice we often use $\gamma=1$, ie, we use *undiscounted state visitation frequencies* in the $d_{\pi}$ term, giving the gradient estimator to be biased. (Thomas et al., 2014) argues for the inclusion of the $\gamma^{t}$ term in REINFORCE gradient estimator, but it turns out that often using $\gamma < 1$ can in fact hurt performance - ie, using unbiased forms of the gradient estimator can hurt performance.
**LATER** : We will re-visit this later, when we definite the term **Gradient-Gap** (similar to **Action Gap** in value-based methods) in policy gradient theorem, a term that we argue has significance due to the $\gamma$ term.
### REINFORCE with Non-Linearly Mapped Returns
With a mapping on the returns $f(G_t)$, we have the following objective (**TODO** : We can choose the function mapping f() to get smoother objectives, e.g with $\log()$ mapping to get lower bounds, but smoother forms of gradients )
\begin{align*}
J(\theta) = \mathbb{E} [ \sum_{t=0}^{\infty} f ( \gamma^{t} r_{t} ) \mid \pi_{\theta} ] = \mathbb{E} [ f ( G_{0} ) \mid \pi_{\theta} ]
\end{align*}
This gives the following gradient, which follows from the score function trick in Monte-Carlo gradients :
\begin{align*}
\nabla_{\theta} J(\theta) = \mathbb{E}_{s \sim d_{\pi}, a\sim \pi} [ \sum_{t=0}^{\infty} \nabla_{\theta} \log \pi_{\theta} (a_t, s_t) f ( G_t ) ]
\end{align*}
#### Algorithm Pseudocode - Reinforce with Logarithmic Mapped Returns
* for each episode $( s_1, a_1, r_2, s_2, a_2, r_3...) \sim \pi_{\theta}$, do
* for t = to $T-1$ do
* $\theta \leftarrow \theta + \alpha \nabla_{\theta} \log \pi_{\theta}(s_t, a_t) (\log G_t )$
* where $G_t = \sum \gamma^{t} r_t (s,a)$
* end T
* end episodes
### REINFORCE with Non-Linearly Mapped Expected Returns
If we instead map the expectation of the returns itself, our objective rather looks like this :
\begin{align*}
J(\theta) = f ( \mathbb{E} [ \sum_{t=0}^{\infty} \gamma^{t} r_{t} \mid \pi_{\theta} ] ) = f ( \mathbb{E} [ G_{0} \mid \pi_{\theta} ] )
\end{align*}
Taking the gradient of this above objective, however, is messy and requires finding gradients of the expectation of a function (where we cannot straightforwardly apply the score function trick). **TODO** : Literature on sensitivity analysis (e.g SPSA) allows to do this?
\begin{align*}
\nabla_{\theta} J(\theta) = \nabla_{\theta} f ( \mathbb{E} [ \sum_{t=0}^{\infty} \gamma^{t} r_{t} \mid \pi_{\theta} ] ) ) * \mathbb{E}_{s \sim d_{\pi}, a\sim \pi} [ \sum_{t=0}^{\infty} \nabla_{\theta} \log \pi_{\theta} (a_t, s_t) G_t ]
\end{align*}
where in the gradient expression above, we have a term that is :
\begin{align*}
\nabla_{\theta} f ( \mathbb{E} [ \sum_{t=0}^{\infty} \gamma^{t} r_{t} \mid \pi_{\theta} ] ) ) = \nabla_{\theta} f ( V_{\pi_{\theta}}(s) )
\end{align*}
**TODO** : It is not clear how to find such gradient terms? Sensitivity analysis and finite difference methods might allow a way to do this. There are few possible directions we can consider to compute $\nabla_{\theta} f ( V_{\pi_{\theta}}(s) )$.
#### Finite Difference Approximation :
\begin{align*}
\nabla_{\theta} f ( V_{\pi_{\theta}}(s) ) = \frac{f (V_{ \tilde{ \pi_{\theta}} } (s)) - f ( V_{\pi_{\theta}}(s))}{\epsilon}
\end{align*}
where $\tilde{\pi_{\theta}}$ is the perturbed policy $\pi_{\theta}$ with $\epsilon$. However, this method would be quite sample inefficient, as it would require separate rollouts with $\tilde{\pi_{\theta}}$ and $\pi_{\theta}$ to compute the finite difference gradients.
#### A semi-gradient approach
If we notice the $\nabla_{\theta} J(\theta)$ term above, the second half of the equation is the same gradient as Reinforce (this is also the semi-gradient term as in (Thomas et al., 2019)). Let's write the two terms in the gradient as $\nabla_{\theta} J_{1}(\theta) * \nabla_{\theta} J_{2}(\theta)$ where \
$\nabla_{\theta} J_{1}(\theta) = \nabla_{\theta} f ( V_{\pi_{\theta}}(s) )$ and $\nabla_{\theta} J_{2}(\theta) = \mathbb{E}_{s \sim d_{\pi}, a\sim \pi} [ \sum_{t=0}^{\infty} \nabla_{\theta} \log \pi_{\theta} (a_t, s_t) G_t ]$.
We can do a semi-gradient update of $\theta$ to get $\tilde{\theta}$ as
\begin{align*}
\tilde{\theta} \leftarrow \tilde{\theta} + \alpha \nabla_{\theta} J_{1}(\theta)
\end{align*}
with the obtained $\tilde{\theta}$ we can then use a similar approach to finite differences, to eventually do an update for $\theta$? Does this have any similarities to Nesterov's and Momentum based updates, where we compute $\tilde{\theta}$ and then use it for updating $\theta$? **TODO** : There should be a better way to do this - maybe something along lines of KL and Trust Region constraints with $\tilde{\theta}$ and $\theta$?
#### Taylor series expansion of $f(V_{\pi_{\theta}}(s))$
Assuming f(.) is a $\log$ mapping, we can perhaps use the Taylor expansion of $\log V_{\pi}$ around a fixed point, and then use the Taylor expanded term to compute this gradient
#### A Deterministic Policy Gradient approach
We can use an update such as DPG (Silver et al., 2014), where $\pi_{\theta}(s)$ is a deterministic policy, or we can use Gumbel softmax trick for stochastic policies, but use a similar DPG style update :
\begin{align*}
\nabla_{\theta} f (V_{\pi_{\theta}}(s)) = \nabla_{\theta} f ( Q (s, \pi_{\theta}(s)) )
\end{align*}
where the key to this step is, we can compute a value function with the mapping $f(.)$ itself, and then compute the gradient directly through the mapped value function for $\theta$, example :
\begin{align*}
\nabla_{\theta} f ( Q (s, \pi_{\theta}(s)) ) = \nabla_{\theta} \tilde{Q}(s, \pi_{\theta}(s)) = \nabla_{\theta} \pi_{\theta}(s) \nabla_{a} \tilde{Q}(s, a)
\end{align*}
#### Summary :
It seems like the last approach with using a DPG style update to compute $\nabla_{\theta} f ( V_{\pi_{\theta}}(s) )$ might be the most feasible?
**TODO** Why should a gradient term as in $\nabla_{\theta} J(\theta) = \nabla_{\theta} J_{1}(\theta) * \nabla_{\theta} J_{2}(\theta)$ be at all be interesting though? Any related literature on optimization and SGD that does something similar?
## Policy Gradient Theorem with Non-Linearly Mapped Value Functions :
Policy gradient methods can be described as performing approximation in the policy space, and has close resemblance with policy iteration. They are analogous since in both cases the search for the optimal policy is guided by a policy evaluation procedure. Policy gradient methods perform their improvement using the gradient of some objective, whereas in policy iteration, the improved policy is obtained from the greedy policy.
(More on this later : ) (Bacon et al.,) - the idea of solving policy gradient in a dynamic programming fashion or through TD has not been explored much.
**(For later)** In (Van Seijen et al., 2019), logarithmic Q-learning is used, while Q-learning convergence guarantees hold. If policy gradients can be written as a TD update, we should in principle be able to write an equivalent PG update with logarihmic mappings and perform gradient updates in a TD like fashion. We will expore this later.
For now, we first write down the policy gradient theorem, using an equivalent form with non-linear mappings of the value functions. Again, the non-linear mappings can either be on the returns itself, or mappings on the expected return (ie, function mapping of the value function itself).
(Sutton et al., 1999) : Given an initial state distribution $\alpha$, the discounted objective for policy gradient methods, for the exact solutio, is given by
\begin{align*}
J_{\alpha}(\theta) = \balpha^\top \mat v_{\theta} = \balpha^\top (\mat I - \gamma \mat P_{\theta})^{-1} \mat r_{\theta}
\end{align*}
With the discounted weighting of states, the equivalent form is
\begin{align*}
J_{\alpha}(\theta) = \balpha^\top \mat v_{\theta} = \mat d^{\top}_{\alpha, \gamma, \theta}\mat r_{\theta}
\end{align*}
where $\mat d^{\top}_{\alpha, \gamma, \theta} = \balpha^\top (\mat I - \gamma \mat P_{\theta})^{-1}$, is the discounted weighting of states. Note that (Bacon et al.,) this is not a distribution in itself, or for that matter a staionary distribution, but it is a discounted weighting of states. Similar term is called the discounted future state distribution (Kakade et al., 2003) and discounted state distribution (Thomas et al., 2014).
For the exact policy gradient solution, with non-linear mappings, we can write the following two forms :
* Non-Linear mapping on the return :
\begin{align*}
J_{\alpha}(\theta) = \balpha^\top \mat v_{\theta} = \balpha^\top (\mat I - \gamma \mat P_{\theta})^{-1} \mat f ( r_{\theta} )
\end{align*}
* Non-Linear mapping on the expected return/value function itself
\begin{align*}
J_{\alpha}(\theta) = \balpha^\top f ( \mat v_{\theta} ) = \balpha^\top f ( (\mat I - \gamma \mat P_{\theta})^{-1} \mat r_{\theta} )
\end{align*}
* (Van Seijen et a;.,) considers modified/mapped objective itself, with the mapped value functions. The way to write that would therefore be :
\begin{align*}
\tilde{J}_{\alpha}(\theta) = \mat \alpha^{\top} \tilde{v}_{\theta}
\end{align*}
The key differences in the above is that : in the first two, we are writing the original objective $J(\theta)$ but with the mapped returns or mapped value functions itself, whereas in the third we are writing the modified mapped version of the objective $\tilde{J}(\theta)$.
**TODO** Does the above form of writing with the non-linear mapping makes sense?
In (Van Seijen et al.,), they denote $\tilde{Q}(s,a) = f (Q(s,a))$ as the expected return mapped to a different space (ie, logarithmic space). It is a general approach, also used in (Pohlen et al., 2018) where the target is mapped to a different space and updates performed in that space instead.
Considering the $\tilde{Q}(s,a)$, ie the value functions mapped to a different space and defining the policy gradient objective with the mapped value function, we will therefore get the policy gradient theorem with mapped value functions (where the expected return is mapped to a different space itself)
\begin{align*}
\nabla_{\theta} \tilde{v}_{\theta}(s) &= \sum_{a} \Big[ \nabla_{\theta} \pi_{\theta}(a,s) \tilde{Q}(s,a) + \pi_{\theta}(a,s) \nabla_{\theta} \tilde{Q}(s,a) \Big] \\
&= \sum_{a} \pi_{\theta}(a,s) \Big[ ( \sum_{a} \nabla_{\theta} \pi_{\theta}(a,s) \tilde{Q}(s,a)) + \gamma \sum_{s'} P(s'\mid s,a) \nabla_{\theta} \tilde{v_{\theta}}(s') \Big]
\end{align*}
this gives the following policy gradient theorem with the mapped value functions :
\begin{align*}
\nabla_{\theta} \tilde{J_{\alpha}}(\theta) = \sum_{s} \mat d_{\alpha, \gamma, \theta}(s) \sum_{a} \nabla_{\theta} \pi_{\theta} (a,s) \tilde{Q_{\theta}}(s,a)
\end{align*}
where $\tilde{Q}(s,a) = f ( Q (s,a) ) = \log Q(s,a)$
**NOTE** We should highlight here that this is not the gradient of the original objective $J(\theta)$, but a gradient of the mapped objective $\tilde{J_{\theta}}$, which depends on the choice of $f(.)$, and can either be a good or bad objective, depending on f(.) for the policy optimization problem. This means, if f(.) is chosen such as to make the original form of the objective $J(\theta)$ to be a smoother lower bound, then it may be a good objective to optimize, otherwise no.
What if we want to find the original objective $J(\theta)$ with the mapped value functions or mapped returns, and not the modified/mapped objective $\tilde{J}_{\theta}$ itself?
### Policy Gradient Theorem with Mapped Value Functions
\begin{align*}
\nabla_{\theta} \tilde{v}_{\theta}(s) &= \nabla_{\theta} \Big[ \sum_{a} \pi_{\theta}(a,s) f ( Q_{\theta} (s,a)) \Big] \\
&= \sum_{a} \Big[ \nabla_{\theta} \pi_{\theta}(a,s) f(Q(s,a)) + \pi_{\theta}(a,s) \nabla_{\theta} f ( Q_{\theta} (s,a) ) \Big]\\
&= \sum_{a} \Big[ \nabla_{\theta} \pi_{\theta}(a,s) f(Q_{\theta}(s,a)) + \pi_{\theta}(a,s) \nabla_{\theta} f ( Q_{\theta} (s,a) * \gamma \sum_{s'} P(s' \mid s, a) \nabla_{\theta} v_{\theta}(s') ) \Big]\\
\end{align*}
From here, if we keep unrolling following (Sutton et al., 1999), then we get :
\begin{align*}
\nabla_{\theta} \tilde{v}_{\theta}(s) = \sum_{a} \Big[ \nabla_{\theta} \pi_{\theta}(a,s) f(Q_{\theta}(s,a)) + \nabla_{\theta} f ( Q_{\theta} (s,a) * \pi_{\theta}(a,s) \gamma \sum_{s'} P(s' \mid s, a) \nabla_{\theta} v_{\theta}(s') ) \Big]
\end{align*}
**TODO** Check below : We get something like this?
\begin{align*}
\nabla_{\theta} \tilde{v}_{\theta}(s) = \sum_{a} \nabla_{\theta} \pi_{\theta}(a,s) f (Q (s,a)) + \nabla_{\theta} f ( Q_{\theta}(s,a))
\end{align*}
The policy gradient with the mapped value functions is therefore given by :
\begin{align*}
\nabla_{\theta} J(\theta) = \sum_{s} d_{\pi}(s) \sum_{a} \nabla_{\theta} \pi_{\theta}(a,s) f (Q (s,a)) + \nabla_{\theta} f ( Q_{\theta}(s,a))
\end{align*}
## Gradient Gap
We define a term that we call Gradient-Gap similar to Action-Gap in value based methods, following the argument as in (Thomas et al.,) that there is a $\gamma^{t}$ missing in the policy gradient theorem, which makes the gradient estimates biased. Including the $\gamma^{t}$ term, for discounted weighting of future states will make the gradient term unbiased. However, in practice, it is often observed that including this term hurts performance.
We definte the **Gradient-Gap** as
\begin{align*}
\bigtriangleup \nabla_{\theta} J_{gap} (\theta) = \nabla_{\theta} J_{\gamma} (\theta) - \nabla_{\theta} J(\theta)
\end{align*}
This primarily comes from the fact, as noted by (Thomas et al., 2014) that the policy gradient objective is often defined w.r.t to the discounted weighting of states (and NOT a distribution) $\mat d_{\alpha, \gamma, \theta}$ where $\alpha$ is the starting state distribution. We can normalize $\mat d$ by multiplying with $1-\gamma$ factor, to convert it into a distribution, and then correcting the gradient by a factor $\frac{1}{(1-\gamma)}$ (Kakade et al., 2003)
\begin{align*}
\nabla_{\theta} J(\theta) = \frac{1}{(1-\gamma)} \mathbb{E}_{d_{\pi}} \Big[ \sum_{a} \nabla_{\theta} \pi_{\theta}(a,s) Q_{\theta}(s,a) \Big]
\end{align*}
where $d_{\pi}(s)$ is the normalized discounted weighitng of states given in the following form :
\begin{align*}
\mat d_{\pi} = \mat \alpha^{\top} \sum_{t=0}^{\infty} (1-\gamma) (\gamma P_{\theta})^{t}
\end{align*}
where $\gamma^{t}(1-\gamma)$ is from the geometric distribution.
However, in practice, we never sample from the normalized discounted weighting of states distribution, but instead sample from the unnormalized one (which is not a distribution). This is because sampling from the normalized discounted weighting of states would require that we first apply the policy in one step, and then upon entering the next state, we need to sample from the geomretic distribution to decide whether to terminate or not in the trajectory. Therefore, in practice, we always estimate the policy gradient from samples along the on-policy undiscounted stationary distribution.
## Implicit Differentiation and Learning Non-Linear Function Mappings
* What if instead of a fixed function mapping $f()$, we want to parameterize the mapping $f_{\phi}$, and can maximize the overall objective, as a bi-level objective, using implicit differentiation?
## References
* Harm van Seijen, Mehdi Fatemi, Arash Tavakoli. "Using a Logarithmic Mapping to Enable Lower Discount Factors in Reinforcement Learning"
* Hado van Hasselt, John Quan, Matteo Hessel, Zhongwen Xu, Diana Borsa, Andre Barreto. "General non-linear Bellman equations"
*