In a neural network the follow

In a neural network the following is true (in this notation, I will assume $\nabla_\theta f(x)$ is a row vector and $\theta$ is a column vector) $$ \nabla_\theta f(x) = \sum_i u_i(x) \sigma_i v_i $$ where $\sigma^2_i$ are proportional to the eigenvalues and $v_i$ equal to the eigenvectors of the Fisher information matrix. $\sigma_i$, $v_i$ and $u_i$ all depend on the parameter $\theta$. $u_i$ are "eigenfunctions" of the neural tangent kernel, and $\sigma^2_i$ are its eigenvalues, so that the following also holds: $$ k(x, x') = \sum_i \sigma^2_i u_i(x) u_i(x') $$ Let's look at the following linearization of this network around $\theta_0$, and consider the following reparametrization: \begin{align} f(x; \theta) &\approx f(x; \theta_0) + \nabla_\theta f(x) (\theta - \theta_0) \\ &= f(x; \theta_0) + \sum_i u_i(x) \underbrace{\sigma_i v_i (\theta - \theta_0)}_{w_i}\\ &= f(x; \theta_0) + \sum_i w_i u_i(x) \end{align} We call this reparametrized linear model the u-features linear model. Why is this interesting? Let's consider what happens to $w$ during one one gradient step in $\theta$. First off we have that: $$ w^{t+1}_i = w^t_i - \eta \sigma_i v_i \nabla^T_\theta\mathcal{L} $$ Or, writing this in terms of the column vector $w$ we have: $$ w^{t+1} = w^t - \eta S V \nabla^T_\theta\mathcal{L}, $$ where $S$ is a diagonal matrix containing $\sigma_i$. Secondly, we can see that $$ \nabla^T_\theta \mathcal{L}(\theta) = \nabla_\theta w \nabla^T_w \mathcal{L}(w) = V^T S^T \nabla^T_w\mathcal{L}(w) $$ Putting these together we have: \begin{align} w^{t+1} &= w^t - \eta S V V^T S \nabla^T_w\mathcal{L}(w) \\ &= w^t - \eta S^2 \nabla^T_w\mathcal{L}(w) \end{align} Thus, a single gradient step in $\theta$ can be interpreted as a gradient step in the u-features model, except with different learning rates for different $u$-features. THe learning rate for the coefficient for $u_i$ is modulated by $\sigma^2_i$. Therefore, we can say that: * $u$-features with high corresponding eigenvalue describe directions in function space along which $f$ can change fast. * $u$-features with low corresponding eigenvalue describe directions in function space along which $f$ can change only very slowly. Now, if we assume that there is a value $R$ such that $\sigma_j \ll \sigma_i, i\leq R, j>R$, that is, the NTK has effective rank $R$, we can say that the function changes only along directions $u_i, i\leq R$, and the model locally behaves like a linear model of effective dimensionality $R$. Thus, the effective rank of the NTK can be interpreted as an effective parameter count of some sort. I hypothesize that a low effective rank characterises good generalisation properties. That in the chaotic phase of learning, the NTK becomes gradually low-rank, finding the right feature subspace to learn in, and this allows good generalisation in the subsequent phase in which the model's effective dimensionality is limited. ## Redundant features in the linearized model Another way of looking at this is to say that the linearized model $f(x; \theta) \approx f(x; \theta_0) + \nabla_\theta f(x) (\theta - \theta_0)$ will have redundant features. This linear model has as many nonlinear features as there are parameters, since $\nabla_\theta f(x)$ has $P$ dimensions. However, some of these features are very similar to one another, or overlap a lot. Imagine a generalised linear model, in which there are two nonlinear features, $a$ and $b$. However, the feature $b$ appears repeatedly $k$ times with $k$ indepependent coefficients: $$ f(x) = w_0 a(x) + \sum_{i=1}^{k} w_i b(x) $$ In such a model, the higher $k$ is, the more the model's gradient updates will "make use of" the feature $b$. This model is equivalent to a two-parameter model with only one copy of $a$ and $b$, but where the effective learning rate of the coefficient in front of $b$ is cranked up by a factor of $k$. Thus, the coefficient for $b$ is going to move a lot faster than the coefficient for $a$. If $a$ and $b$ are equally rich to solve the task, this will lead to the model converging to be proportional to $b$. So the interpretation of my view is: in the linearised model, the features are not orthogonal. Some features are represented in a redundant fashion. By reparametrising the model to an orthogonal feature space (the $u$-features), this redundancy disappears, and the redundancy is absorbed by the eigenvalues $\sigma_i$. ##Regularizing effective dimensionality of feature space If the effective rank of the NTK (also the effective rank of the Fisher information matrix) is a good measure of effective model complexity, it might make sense to minimise the $l_1$ norm of the vector of eigenvalues, to encourage sparsity. $$ \operatorname{tr} F(\theta) = \operatorname{tr} \mathbb{E}_X \nabla_\theta \log p() $$