# Convexity of the Loss function of GAN on univariate Gaussian Distribution
## Problem setup
### Distribution
* Target: $N_0(x) = \frac{1}{(2 \pi)^{d/2}} e^{-\frac{1}{2}x^Tx}$
* Generator: $N_{\mu}(x) = \frac{1}{(2 \pi)^{d/2}} e^{-\frac{1}{2}(x-\mu)^T(x-\mu)}$
### Loss function Formulation
$$
L(\mu) = \sup_{D} \left(E_{x \sim N_0} \log(D(x)) + E_{x \sim N_{\mu}} \log(1 - D(x))\right )
$$
### Perfect Discriminator:
$$
D^*(\mu; x) = \frac{e^{-\frac{1}{2}x^Tx}}{e^{-\frac{1}{2}(x-\mu)^T(x-\mu)} + e^{-\frac{1}{2}x^Tx}} = \frac{1}{1 + e^{-\frac{1}{2}\mu^T\mu + x^T\mu}}
$$
## Derivative for univariate normal (holding the other variable constant)
### Derivative for mean
#### First Order
Using Envelope theorem, we have:
$$
\nabla_{\mu} L(\mu) =
\nabla_{\mu} E_{x \sim N_{\mu}} \log\left( 1 - D^*(\mu; x) \right) = E_{x \sim N_{\mu}}[\log\left( 1 - D^*(\mu; x) \right) (x - \mu)]
$$
Applying Steins Lemma,
\begin{align*}
\nabla_{\mu} L(\mu)
&= E_{x \sim N_{\mu}} [\nabla_x \log\left( 1 - D^*(\mu; x) \right)]
\\ &= E_{x \sim N_{\mu}}\left[ \frac{D^*(\mu; x)^2}{1- D^*(\mu;x)}\ \frac{N_{\mu}(x)}{N_{0}(x)}\ \mu \right]
\\ &= \mu\ E_{x \sim N_{\mu}}[D^*(\mu; x)]
\end{align*}
Observe that $E_{x \sim N_{\mu}}[D^*(\mu; x)] > 0$ for any $\mu \in \mathbb{R}^d$. Therefore, the only critical point of $L(\mu)$ is $0$. Moreover, for any direction $v \in \mathbb{R}^d$ the gradient along that direction is negative for any vector $\rho v$ with $\rho < 0$ and positive for any $\rho > 0$. Therefore, $0$ is the unique global minimum of $L(\mu)$.
#### Second Order
$$
\frac{\partial Loss}{\partial \mu^2} = E_{x \in P_{G}} D^*(x) \space + \space E_{x \in P_{G}} \mu \cdot D^*(x) \cdot (x-\mu) \space - \space E_{x \in P_{G}} \mu * D^*(x)^2 \cdot \frac{N_{\mu}(x)}{N_{t}(x)} \cdot (x-\mu)
\\=E_{x \in P_{G}} D^*(x) \space + \space E_{x \in P_{G}} \mu \cdot D^*(x) \cdot (x-\mu) \cdot \left( 1 - D^*(x) \cdot \frac{N_{\mu}(x)}{N_{t}(x)} \right)
\\ = E_{x \in P_{G}} D^*(x) \space + \space \mu \cdot E_{x \in P_{G}} D^*(x)^2 \cdot (x-\mu)
$$
Again, applying Steins Lemma:
$$
\frac{\partial Loss}{\partial \mu^2} = E_{x \in P_{G}} D^*(x) \space + \space \mu \cdot E_{x \in P_{G}} 2 \cdot D^*(x) \cdot (-1) \cdot D^*(x)^2 \cdot \frac{N_{\mu}(x)}{N_{t}(x)} \cdot \mu
\\ = E_{x \in P_{G}} D^*(x) \cdot \left [1 \space - \space 2 \cdot \mu^2 \cdot D^*(x) \cdot \left(1 - D^*(x)\right) \right]
$$
A trivial bound can be optained to ensure the positivity of the expression. As $0 \leq D^*(x) \leq 1$, $D^*(x) \cdot (1 - D^*(x)) \leq \frac{1}{4}$. Thus, if we have $-\sqrt{2} \leq \mu \leq \sqrt{2}$, the whole thing will be always postive. I have tried to plot the expression under different $\mu$ values and obtained the following graph:

~~So it seems like the bound is actually quite tight?~~ (I have made a mistake in my program.)
There should be a more relaxed bound but I am unable to have a symbolic solution
```python
import matplotlib.pyplot as plt
import scipy.stats as st
import numpy as np
import math
# How many points are used to evaluate the Expectation
sample_size = 1000
# Range of the points
range = 100
def eva(mu):
mean = mu
std = 1
xmin = mu - range
xmax = mu + range
x = np.linspace(xmin, xmax, sample_size)
value = st.norm.pdf(x,mean,std)
ret = 0
def D(pt):
return 1/ ( 1 + math.exp(-0.5 * mu * mu + pt * mu ) )
for pt, pdf in zip(x, value):
D_pt = D(pt)
ret += pdf * ( D_pt - 2 * D_pt * D_pt * (1 - D_pt) * mu * mu)
return ret
# Range of mu to be used in plotting
mus = np.linspace(1, 2, 50)
res = []
for mu in mus:
res.append(eva(mu))
plt.scatter(mus, res, label='Second Derivative')
plt.plot(mus, np.zeros(len(mus)), label="zero")
plt.legend()
plt.show()
```
### Derivative for Variance
Target Distribution: $N_t(\mu^*=0, \sigma^* = 1)$
Initialization: $N_i(0, \sigma )$
$\frac{\partial Loss}{\partial \sigma} = \mathbb{E}_{x \sim N_i}log\left(1 - D\left(x\right) \right) \cdot \left(\frac{x^2}{\sigma^3} - \sigma^{-1}\right) \\
\quad \quad = \mathbb{E}_{x \sim N_i} \sigma^{-2} \cdot \frac{x^2}{\sigma} \cdot log \left(1 - D(x) \right) - \mathbb{E}_{x \sim N_i} \sigma^{-1}log(1-D(x))$
Applying Steins lemma with $g(x) = \frac{x}{\sigma} \cdot log(1 - D(x))$
$\frac{\partial Loss}{\partial \sigma} = \mathbb{E}_{x \sim N_i}|\sigma|^{-1} \cdot D(x) \cdot x^2 \cdot (1 - \sigma^{-2})$
Thus, we reach the same conclusion as for the case $\mu$ is unknown and $\sigma=1$
A graph for the gradient:

Obviously, the function is still not convex as the gradient vanishes when $\sigma$ goes further away from $\sigma^*=1$
### Derivative when both mean and variance is changing
#### Mean
$$
\frac{\partial Loss}{\partial \mu} = \mathbb{E} D(x) \cdot (x \cdot (1- \sigma ^{-2}) + \frac{\mu}{\sigma^2})
$$
##### Case: $\sigma>1$
We argue $\mathbb{E}D(x)\cdot x$ alone is positive.
$\mathbb{E}D(x)\cdot x \\= \int_{-\infty}^{\infty}\frac{x}{\sigma \cdot e^{\frac{(x-\mu)^2}{2}} + e^{\frac{x^2}{2}}} \\= \int_{0}^{\infty}\frac{x}{\sigma \cdot e^{\frac{(x-\mu)^2}{2}} + e^{\frac{x^2}{2}}} + \int_{-\infty}^{0}\frac{x}{\sigma \cdot e^{\frac{(x-\mu)^2}{2}} + e^{\frac{x^2}{2}}} \\= \int_{0}^{\infty}\frac{x}{\sigma \cdot e^{\frac{(x-\mu)^2}{2}} + e^{\frac{x^2}{2}}} + \int_{0}^{\infty}\frac{-x}{\sigma \cdot e^{\frac{(-x-\mu)^2}{2}} + e^{\frac{(-x)^2}{2}}} \\= \int_{0}^{\infty}x \cdot (\frac{1}{\sigma \cdot e^{\frac{(x-\mu)^2}{2}} + e^{\frac{x^2}{2}}} - \frac{1}{\sigma \cdot e^{\frac{(-x-\mu)^2}{2}} + e^{\frac{x^2}{2}}}) \\= \int_{0}^{\infty}x \cdot \frac{e^{\frac{(x+\mu)^2}{2}} - e^{\frac{(x-\mu)^2}{2}}}{Denominator}$
Since $\mu$ is positive, when $x>0$, $(x+\mu)^2 \geq (x-\mu)^2$. Therefore, the whole thing is always positive.
##### Case: $\sigma<1$
We instead argue for $\frac{\partial Loss}{\partial \mu} \cdot \sigma^2 = \mathbb{E} D(x)\cdot (\mu + x \cdot (\sigma^2-1))$.
As we know $\sigma^2 \cdot \mathbb{E}D(x)x$ is positive, we have:
$\frac{\partial Loss}{\partial \mu} \cdot \sigma^2 \geq \mathbb{E} D(x)\cdot (\mu - x) \\ = \int_{-\infty}^{\infty}\frac{\mu - x}{\sigma \cdot e^{\frac{(x-\mu)^2}{2}} + e^{\frac{x^2}{2}}}dx \\ = -\int_{-\infty}^{\infty}\frac{x - \mu}{\sigma \cdot e^{\frac{(x-\mu)^2}{2}} + e^{\frac{x^2}{2}}}dx$
Now, substitute $y = x-\mu$, we have:
$\frac{\partial Loss}{\partial \mu} \cdot \sigma^2 \geq -\int_{-\infty}^{\infty}\frac{y}{\sigma \cdot e^{\frac{y^2}{2}} + e^{\frac{(y+\mu)^2}{2}}}dy$
Follow a similar argument as in Case 1, we will have:
$\int_{-\infty}^{\infty}\frac{y}{\sigma \cdot e^{\frac{y^2}{2}} + e^{\frac{(y+\mu)^2}{2}}}dy \leq 0$.
Thus, $\frac{\partial Loss}{\partial \mu} \cdot \sigma^2 \geq 0$.
#### Variance
$$
\frac{\partial Loss}{\partial \mu} = \mathbb{E} D(x) \cdot (x \cdot (1- \sigma ^{-2}) + \frac{\mu}{\sigma^2}) \cdot \frac{x-\mu}{\sigma}
$$
## Next Step
* Write the bounds obtained in terms of the target distribution's mean
* Generalize it to higher dimension
* Figure out if there is a similar bound for the covariance/variance
* See what would happen if we are using truncated normal distribution
## Truncated normal distribution
### First order
$$
\frac{\partial Loss}{\partial \mu} = \frac{\partial}{\partial \mu}E_{x \in P_{G}} log\left( 1 - D^*(x) \right) \\= E_{x \in P_{G}} log\left( 1 - D^*(x) \right) \cdot (x - \mu) - E_{x \in P_{G}}(log(1-D^*(x))) \cdot E_{x \in P_{G}}(x-\mu) \\= E_{x \in P_{G}} log\left( 1 - D^*(x) \right) \cdot (x - \mu)
$$
I guess we may need some rule similar to Steins lemma but for truncated normal distribution
## Derivative for $(\mu, \sigma)$
We have
$$
\frac{\partial}{\partial \mu} L(\mu, \sigma)= E_{x \sim N(\mu, \sigma)}[ D^*(\mu, \sigma;x)(x (1-1/\sigma^2) + \mu/\sigma^2)]
$$
We want to prove that it is positive for $\mu > 0$. First, assume that $\sigma > 1$. Then it suffices to show that $E_{x \sim N(\mu,\sigma)}[D^*(\mu, \sigma;x)x] > 0$.
We have
$$
E_{x \sim N(\mu,\sigma)}[D^*(\mu, \sigma;x)x]
= \int_{-\infty}^{\infty} \frac{1}{\frac{1}{N_0(x)}+ \frac{1}{N(\mu, \sigma;x)}} x d x.
$$ We can lower bound the above integral by picking $\mu \geq 0$ so that the contribution of the negative $x$'s is maximized. Therefore, we want the density $N(\mu,\sigma;x)$ to be as large as possible for $x<0$. Given that $\mu \geq 0$, in the worst case $\mu = 0$.
Therefore
$$
E_{x \sim N(\mu,\sigma)}[D^*(\mu, \sigma;x)x]
\geq
E_{x \sim N(0,\sigma)}[D^*(0, \sigma;x)x]
$$
Moreover, notice that $D^*(0, \sigma;x) = D^*(0, \sigma, -x)$ and therefore $E_{x \sim N(0,\sigma)}[D^*(0, \sigma;x)x] =0$.
## Normal variable with a Relu layer
### Gradient with respect to $\sigma$
* Formulating with piece-wise function
* No longer expectation - integration only on half of the space
### Case when both $\mu$ and $\sigma$ are changing (with respect to $\mu$)
$\frac{ \partial Loss } {\mu} = exp(-\frac{\mu_g^2}{2\sigma_g^2}) \cdot log(\frac{1 - lim_{x \to 0}D(x)}{1 - D(0)}) + \frac{1}{\sigma_g^2} \cdot \int_{0}^{\infty} D(x) \cdot exp(\frac{ - (x - \mu_g^2)}{2\sigma_g^2}) \cdot (\mu_g - \mu_t \cdot \sigma_g^2 + x \cdot (\sigma_g^2 - 1))dx$
where $\frac{1 - lim_{x \to 0}D(x)}{1 - D(0)}$ can be reduced to evaluate:
$x^2 (1 - \sigma_g^2) + 2x(\mu_t \cdot \sigma_g^2 - \mu_g)$
### Simpler case: when $\sigma = 1$
$\frac{ \partial Loss } {\mu} = exp(-\frac{\mu_g^2}{2\sigma_g^2}) \cdot log(\frac{1 - lim_{x \to 0}D(x)}{1 - D(0)}) + \frac{1}{\sigma_g^2} \cdot \int_{0}^{\infty} D(x) \cdot exp(\frac{ - (x - \mu_g^2)}{2\sigma_g^2}) \cdot (\mu_g - \mu_t)dx$
where $\frac{1 - lim_{x \to 0}D(x)}{1 - D(0)}$ can be reduced to evaluate:
$2x(\mu_t - \mu_g)$
### Case when both $\mu$ and $\sigma$ are changing (with respect to $\sigma$)
$\frac{ \partial Loss } {\sigma} = \mu_g \cdot exp(-\frac{\mu_g^2}{2\sigma_g^2}) \cdot log(\frac{1 - D(0)}{1 - lim_{x \to 0}D(x)}) + \cdot \int_{0}^{\infty} D(x) \cdot exp(\frac{ - (x - \mu_g^2)}{2\sigma_g^2}) \cdot (x - \mu_g) \cdot (\mu_g - \mu_t \cdot \sigma_g^2 + x \cdot (\sigma_g^2 - 1))dx$
#### Simpler Case of Case II: $\mu = \mu_g = \mu_t$
$\frac{ \partial Loss } {\sigma} = \mu \cdot exp(-\frac{\mu^2}{2\sigma_g^2}) \cdot log(\frac{1 - D(0)}{1 - lim_{x \to 0}D(x)}) + \cdot \int_{0}^{\infty} D(x) \cdot exp(\frac{ - (x - \mu^2)}{2\sigma_g^2}) \cdot (x - \mu)^2 \cdot (\sigma_g^2 - 1)dx$
We focus on showing:
$\frac{1 - D(0)}{1 - lim_{x \to 0}D(x)} \geq 1$ when $\sigma_g > 1$ and vice versa.
Expanding it we have:
$$
\frac{1 - D(0)}{1 - lim_{x \to 0}D(x)}
= \frac{ \sigma_g^{-1} \cdot \int_{-\infty}^{0} exp ( \frac{-(x - \mu)^2}{2 \cdot \sigma_g^2} ) dx }{ \sigma_g^{-1} \cdot \int_{-\infty}^{0} exp ( \frac{-(x - \mu)^2}{2 \cdot \sigma_g^2} ) dx + \int_{-\infty}^{0} exp(\frac{-(x - \mu)^2}{2}) dx} \cdot \frac{ \sigma_g^{-1} \cdot exp (\frac{- (x - \mu)^2}{2 \sigma_g^2} ) + exp ( \frac{- (x - \mu)^2}{2} ) }{ \sigma_g^{-1} \cdot exp (\frac{- (x - \mu)^2}{2 \sigma_g^2} ) }
$$
**Numerator** - **Denomenator** gives us:
$$ \int_{-\infty}^{0} exp( \frac{- (x - \mu)^2 }{2\sigma_g^2} - \frac{\mu^2}{2} )dx - \int_{-\infty}^{0} exp( \frac{- (x - \mu)^2}{2} - \frac{\mu^2}{2\sigma_g^2} ) dx$$
Directly comparing the power in exponential gives us:
$$
-\frac{(x - \mu)^2}{2\sigma_g^2} - \frac{\mu^2}{2} + \frac{(x - \mu)^2}{2} + \frac{\mu^2}{2\sigma_g^2}
= (\sigma_g^2 - 1) \cdot \left( (x - \mu)^2 - \mu^2 \right)
= (\sigma_g^2 - 1) \cdot ( x^2 - 2 \mu x)
$$
In case $\mu>0$, as our range of integration is $x \in (-\infty, 0)$, we easily have $(x^2 - 2 \mu x) > 0$. This thus gives us the conclusion we want.
In case $\mu>0$, we need to rewrite our integration in $$\frac{1 - D(0)}{1 - lim_{x \to 0}D(x)}$$
Denote $A = \sigma_g^{-1} \cdot \int_{-\infty}^{\infty} exp( \frac{ -(x - \mu)^2 }{2 \sigma_g^2} )$
Denote $B = \int_{-\infty}^{\infty} exp( \frac{ -(x - \mu)^2 }{2 } )$
$$
\frac{1 - D(0)}{1 - lim_{x \to 0}D(x)}
= \frac{ A - \sigma_g^{-1} \cdot \int_{0}^{\infty} exp ( \frac{-(x - \mu)^2}{2\sigma_g^2} ) dx }{ A + B - \sigma_g^{-1} \cdot \int_{0}^{\infty} exp ( \frac{-(x - \mu)^2}{2 \cdot \sigma_g^2} ) dx - \int_{0}^{\infty} exp(\frac{-(x - \mu)^2}{2}) dx} \cdot \frac{ \sigma_g^{-1} \cdot exp (\frac{- (x - \mu)^2}{2 \sigma_g^2} ) + exp ( \frac{- (x - \mu)^2}{2} ) }{ \sigma_g^{-1} \cdot exp (\frac{- (x - \mu)^2}{2 \sigma_g^2} ) }
$$
**Numerator** - **Denomenator** gives us:
$$
\left( A \cdot \sigma_g exp(\frac{-\mu^2}{2}) - B \cdot exp( -\frac{\mu^2}{2\sigma_g^2} ) \right) + \int_{0}^{\infty} exp(- \frac{(x - \mu)^2}{2} - \frac{\mu^2}{2\sigma_g^2}) - exp( -\frac{(x - \mu)^2}{2\sigma_g^2} - \frac{\mu^2}{2} )
$$
**FACT**: $A \cdot \sigma_g = B$ when $\sigma_g > 1$ vice versa.
Then, the first part is reduced to comparing:
$$ exp(-\frac{\mu^2}{2}) < exp(-\frac{\mu^2}{2\sigma_g^2}) $$
The second part is reduced to:
$(x^2 - 2x\mu) \cdot (1 - \sigma_g^2)$
Now, we have $x > 0$ and $\mu < 0$.
Both parts have the behavior 'greater than 0' when $\sigma_g < 1$ and vice versa.
The reversed behavior is corrected with $\mu$ mutliplied in front of the log term.
### New observations
Denote $A = exp(-\frac{\mu_g^2}{2\sigma_g^2}) \cdot log(\frac{1 - lim_{x \to 0}D(x)}{1 - D(0)})$
Denote $B = D(x) \cdot exp(\frac{ - (x - \mu_g^2)}{2\sigma_g^2})$
#### Case I: $\frac{\partial Loss}{\partial \mu_g}$
if $\mu_g \geq \mu_t$, $\frac{\partial Loss}{\partial \mu} \geq A + \sigma_g^{-2} \cdot \int_{0}^{\infty} B \cdot (x - \mu_g) \cdot (\sigma_g^2-1) dx$
if $\mu_g \leq \mu_t$, $\frac{\partial Loss}{\partial \mu} \leq A + \sigma_g{-2} \cdot \int_{0}^{\infty} B \cdot (x - \mu_g) \cdot (\sigma_g^2-1) dx$
#### Case II: $\frac{\partial Loss}{\partial \sigma_g}$
Still haven't figured out how to bound $\int_{0}^{\infty} B \cdot (x - \mu_g) dx$
Assume $\int_{0}^{\infty} B \cdot (x - \mu_g) dx \geq 0$,
if $\mu_g \geq \mu_t$, $\frac{\partial Loss}{\partial \sigma}\geq -A + \int_{0}^{\infty} B \cdot (x - \mu_g)^2 \cdot (\sigma_g^2 - 1) dx$
if $\mu_g \leq \mu_t$, $\frac{\partial Loss}{\partial \sigma} \leq -A + \int_{0}^{\infty} B \cdot (x - \mu_g)^2 \cdot (\sigma_g^2 - 1) dx$
Assume $\int_{0}^{\infty} B \cdot (x - \mu_g) dx \geq 0$,
if $\mu_g \leq \mu_t$, $\frac{\partial Loss}{\partial \sigma} \geq -A + \int_{0}^{\infty} B \cdot (x - \mu_g)^2 \cdot (\sigma_g^2 - 1) dx$
if $\mu_g \geq \mu_t$, $\frac{\partial Loss}{\partial \sigma} \leq -A + \int_{0}^{\infty} B \cdot (x - \mu_g)^2 \cdot (\sigma_g^2 - 1) dx$
It really makes a lot of sense if it is:
$\mathbb{E}_{x \sim N_g} x \cdot ( I - \Sigma^{-1} ) \cdot x^T \cdot \Sigma^{-1} \cdot (W + W^T) \cdot D^*(W;x)$
intead of:
$\mathbb{E}_{x \sim N_g} x^T \cdot ( I - \Sigma^{-1}) \cdot x \cdot \Sigma^{-1} \cdot (W + W^T) \cdot D^*(W;x)$
In the first case, the resulting matrix gradient for diagonal matrix $\Lambda$ will be:
$\Lambda_{ij} = \lambda_{ij} - \frac{1}{\lambda_{ij}}$
Gradient for Diagonal Matrix
$$
\frac{\partial}{\partial \lambda_j} L(\Lambda) =
\int \log(1- D(x)) N_{\Lambda}(x) \left(\frac{x_j^2}{\lambda_j^3} - \frac{1}{\lambda_j}\right) d x
$$
Reparametarize with respect to $W^{-1}$ (this is the new matrix that we are trying to estimate).
Gradient for matrix $W^{-1}$, $y = W^{-1}x$. Then $y$ is distributed according to $N(0, W^{-1} (W^{-1})^T) = N(0, (W^TW)^{-1})$, i.e. the density is
$$
N_W(x) = (1/2 \pi)^{d/2} \exp\left(-\frac 1 2 x^T W^T W x + \frac 12 \log |W^TW|\right)
$$
Therefore, the derivative with respect to $W$ is
$$
L'(W) = \int \log(1- D(x)) N_W(x) (-W x x^T + {(W^{-1}})^T ) d x
$$
We have
$$
\nabla_x N_{\Sigma}(x) = - \Sigma^{-1} x N_{\Sigma}(x).
$$
Therefore, (Stein's Lemma) integrating by parts we get
$$
\int N_{\Sigma}(x) x f(x) d x=
\Sigma \int N_{\Sigma}(x) \nabla_x f(x) d x
$$
In particular,
$$
\int N_{\Sigma}(x) x x^T f(x) =
\Sigma \int N_{\Sigma}(x) f(x) d x +
\Sigma \int N_{\Sigma}(x) \nabla_x f(x) x^T d x
$$
We have
$$
\begin{align*}
L'(W) &= (W^{-1})^T \int N_{W}(x) \nabla_x \left[\log(1-D(x))\right]\ x^T d x \\
&=
\end{align*}
$$
We have $$\log(1-D(x)) = \log\left(\frac{1}{1+N_0(x)/N_W(x)} \right)
$$
\begin{align*}
\nabla_x [\log(1 - D(x))]
&= - \frac{1}{1 + N_0(x)/N_\Sigma(0)} \frac{N_0(x)}{N_\Sigma(x)}(\Sigma^{-1} - I)x \\
&= D(x)(I - \Sigma^{-1}) x
\end{align*}
---
I don't know if I have done it correctly.
$$
\begin{align*}
L'(W) &= (W^{-1})^T \int N_{W}(x) \nabla_x \left[\log(1-D(x))\right]\ x^T d x \\
&= (W^{-1})^T \int N(\Sigma^{-1};x) \cdot (1 - D(x)) \cdot \frac{N(I^{-1};x)}{N(\Sigma^{-1};x)} \cdot (I^{-1} - \Sigma^{-1}) \cdot xx^T \\
&= (W^{-1})^T \int \frac{N(\Sigma^{-1};x) \cdot N(I^{-1};x)}{N(\Sigma^{-1};x) + N(I^{-1};x)} \cdot (I^{-1} - \Sigma^{-1}) \cdot xx^T
\end{align*}
$$
For the diagonal element, it is obvious that its sign solely depends on $1 - \lambda_{i,j}$.
For the non-diagonal element, assume that we have an diagonal matrix, then the density function will be:
$N(\Lambda^{-1};x) = \Sigma x_i^2 \cdot \lambda_i^{-1}$. The function is symmetric with respect to each entry $x_i$.
Consider entry $(i,j)$ where $i \neq j$ in the resulting gradient matrix.
Observation:
$$
\frac{N(\Sigma^{-1};x) \cdot N(I^{-1};x)}{N(\Sigma^{-1};x) + N(I^{-1};-x)}= \frac{N(\Sigma^{-1};-x) \cdot N(I^{-1};-x)}{N(\Sigma^{-1};-x) + N(I^{-1};-x)}
$$
when there is no correlation between entries in the two distributions.
\begin{align*}
L'(W)_{i,j} &= \lambda_i^{-1} \cdot (1 - \lambda_i{-1}) \int_{x_1} \int_{x_2} \cdots \int_{x_i} \int_{x_j} \frac{N(\Sigma^{-1};x) \cdot N(I^{-1};x)}{N(\Sigma^{-1};x) + N(I^{-1};x)} \cdot x_ix_j \\
&= \lambda_i^{-1} \cdot (1 - \lambda_i{-1}) \int_{x_1} \int_{x_2} \cdots \int_{x_i} x_i \cdot \int_{x_j} \frac{N(\Sigma^{-1};x) \cdot N(I^{-1};x)}{N(\Sigma^{-1};x) + N(I^{-1};x)} \cdot x_j \\
&= \lambda_i^{-1} \cdot (1 - \lambda_i{-1}) \int_{x_1} \int_{x_2} \cdots \int_{x_i} x_i \cdot 0 \\
&= 0 \\
\end{align*}
Consider the case of learning non-diagonal matrix:
We transform our space such that our target becomes a non-diagonal matrix and our generator distribution becomes the identity matrix
## Missing Gradient under imperfect discriminator
$$
\frac{\patial}{\partial p_g} \log { \frac{p_d}{p_d + p_g \epsilon_x} }
$$
### Missing Gradient
$$
\int (1 - D(x)) \cdot p_g'
$$
$$
\int p_g' \cdot log(D(x)) + \int (1 - D(x)) \cdot p_g'
$$
$$
\nabla \int p_g \cdot log(D(x))
$$
Objective:
$$
\nabla \int p_g \cdot log( \frac{p_d}{p_d + p_g \cdot \epsilon_x} )
$$