# Parametrization independence of natural gradient flow ## Notation I'm going to consider two parametrisations, $w$ and $\beta$, like in the separable classification papers. I will assume there's a mapping $P$ between them so that $\beta = P(w)$. I will assume $P$ is differentiable and I'm going to denote it's Jacobian by $J$ such that $$ J_{i,j} = \frac{\partial \beta_i}{\partial w_j} $$ With slightly overloaded notation I'll call the Fisher information matrix with respect to $\beta$ as $F(\beta)$ or simply $F$, and the Fisher info matrix w.r.t. $w$ as F(w). Generally, the following relationship holds between the two $F$s: $$ F(w) = J^TF(\beta)J $$ I'll denote the loss function $\mathcal{L}(\beta)$ or $\mathcal{L}(w)$, and it's gradients with respect to $\beta$ as $\nabla_\beta^T \mathcal{L}(\beta)$. I'm going to denote by $A^+$ the Moore-Penrose pseudoinverse of $A$. ## Invariance Natural gradient flow w.r.t. $\beta$ is defined as: \begin{align} \dot{\beta} &= - F^+(\beta) \nabla_\beta^T \mathcal{L}(\beta) \end{align} And NGF in $w$ space is: \begin{align} \dot{w} &= - F^+(w)\nabla_w^T \mathcal{L}(w) \end{align} Now let's see what happens if we run NGF in $w$ space starting from $w_0$ such that $\beta_0 = P(w_0)$, and then map the trajectory $w_t$ to $\beta$-space using the mapping $P$, i.e. $\beta_t = P(w_t)$. Using the chain rule: $$ \dot{\beta} = J \dot{w} $$ Then substituting the differential equation for NGF in $w$ space we get: \begin{align} \dot{\beta} &= J \dot{w} \\ &= - J F^+(w) \nabla_w^T \mathcal{L}(w) \end{align} Using the chain rule again and substituting the identity for $F(w)$ we obtain: \begin{align} \dot{\beta} &= J \dot{w} \\ &= - J F^+(w) \nabla_w^T \mathcal{L}(w) \\ &= - J \left(J^T F(\beta) J\right)^+ J^T \nabla_\beta^T \mathcal{L}(\beta) \end{align} Now, we'll use the identity $(AB)^+ = B^+A^+$ twice: \begin{align} \dot{\beta} &= J \dot{w} \\ &= - J F^+(w) \nabla_w^T \mathcal{L}(w) \\ &= - J \left(J^T F(\beta) J\right)^+ J^T \nabla_\beta^T \mathcal{L}(\beta) \\ &=- J J^+ F(\beta)^+ (J^T)^+ J^T \nabla_\beta^T \mathcal{L}(\beta) \\ \end{align} Now, if $J$ is full rank, $JJ^+ = (J^T)^+J^T = I$, leaving: \begin{align} \dot{\beta} &= J \dot{w} \\ &= - J F^+(w) \nabla_w^T \mathcal{L}(w) \\ &= - J \left(J^T F(\beta) J\right)^+ J^T \nabla_\beta^T \mathcal{L}(\beta) \\ &=- J J^+ F(\beta)^+ (J^T)^+ J^T \nabla_\beta^T \mathcal{L}(\beta) \\ &= -F(\beta)^+ \nabla_\beta^T \mathcal{L}(\beta) \end{align} Which is the same as natural gradient flow in $\beta$ space. ## Notes on counterexamples In this part I'll illustrate why the full-rank Jacobian condition is important, and also talk about a few counterexamples to invariance Francisco mentioned: when the parametrisation is such that not all values of $\beta$ can be represented by a $w$. Let's consider the 1-d parametrisation $\beta = \exp(w)$. In this case, only positive $\beta$s can be represented by any $w$. What's going to happen here. Let's assume we start from an init $w_0$ for which obviously $\beta_0>0$, and the loss function is such that NGF in $\beta$ should converge to a negative $\beta_\infty$. Obviously, no gradient flow of any kind in $w$ will ever converge to a negative $\beta$ (complex numbers not allowed), so the invariance cannot hold. I think what will happen is this: There is a time point T $\beta_T=0$ in the natural gradient flow defined in $\beta$ space. For $t<T$, NGF in $\beta$ and $w$ space are equivalent. The parameter $w_t$ is going to diverge to $\infty$ as $t\rightarrow T$. At time $T$ the condition for parametrisation invariance no longer holds, since the Jacobian converges to $0$ (which is a rank-0 matrix of size $1\times 1$). At this point, NGF in $w$ grinds to a halt and stops, $\forall t\geq T: w_t=-\infty$, the corresponding $\beta_t$ gets stuck at $0$. So: until the conditions of reparametrisation invarince are violated, NGF is in fact invariant to reparametrisation in this case. Once it is violated, bad things happen and the trajectories are no longer the same. As a sidenote, we can interpret the Jacobian being full-rank as a generalisation of the parametrisation being strictly monotonic. Indeed, in 1D, the Jacobian being full rank means that it's non-zero, which means strict monotonicity (either decreasing or increasing), which also means invertibility of the function if it's continuous. For higher dimensional spaces things are a bit more complicated.