In a neural network the following is true (in this notation, I will assume $\nabla_\theta f(x)$ is a row vector and $\theta$ is a column vector)
$$
\nabla_\theta f(x) = \sum_i u_i(x) \sigma_i v_i
$$
where $\sigma^2_i$ are proportional to the eigenvalues and $v_i$ equal to the eigenvectors of the Fisher information matrix.
$\sigma_i$, $v_i$ and $u_i$ all depend on the parameter $\theta$.
$u_i$ are "eigenfunctions" of the neural tangent kernel, and $\sigma^2_i$ are its eigenvalues, so that the following also holds:
$$
k(x, x') = \sum_i \sigma^2_i u_i(x) u_i(x')
$$
Let's look at the following linearization of this network around $\theta_0$, and consider the following reparametrization:
\begin{align}
f(x; \theta) &\approx f(x; \theta_0) + \nabla_\theta f(x) (\theta - \theta_0) \\
&= f(x; \theta_0) + \sum_i u_i(x) \underbrace{\sigma_i v_i (\theta - \theta_0)}_{w_i}\\
&= f(x; \theta_0) + \sum_i w_i u_i(x)
\end{align}
We call this reparametrized linear model the u-features linear model.
Why is this interesting? Let's consider what happens to $w$ during one one gradient step in $\theta$. First off we have that:
$$
w^{t+1}_i = w^t_i - \eta \sigma_i v_i \nabla^T_\theta\mathcal{L}
$$
Or, writing this in terms of the column vector $w$ we have:
$$
w^{t+1} = w^t - \eta S V \nabla^T_\theta\mathcal{L},
$$
where $S$ is a diagonal matrix containing $\sigma_i$.
Secondly, we can see that
$$
\nabla^T_\theta \mathcal{L}(\theta) = \nabla_\theta w \nabla^T_w \mathcal{L}(w) = V^T S^T \nabla^T_w\mathcal{L}(w)
$$
Putting these together we have:
\begin{align}
w^{t+1} &= w^t - \eta S V V^T S \nabla^T_w\mathcal{L}(w) \\
&= w^t - \eta S^2 \nabla^T_w\mathcal{L}(w)
\end{align}
Thus, a single gradient step in $\theta$ can be interpreted as a gradient step in the u-features model, except with different learning rates for different $u$-features. THe learning rate for the coefficient for $u_i$ is modulated by $\sigma^2_i$.
Therefore, we can say that:
* $u$-features with high corresponding eigenvalue describe directions in function space along which $f$ can change fast.
* $u$-features with low corresponding eigenvalue describe directions in function space along which $f$ can change only very slowly.
Now, if we assume that there is a value $R$ such that $\sigma_j \ll \sigma_i, i\leq R, j>R$, that is, the NTK has effective rank $R$, we can say that the function changes only along directions $u_i, i\leq R$, and the model locally behaves like a linear model of effective dimensionality $R$.
Thus, the effective rank of the NTK can be interpreted as an effective parameter count of some sort.
I hypothesize that a low effective rank characterises good generalisation properties. That in the chaotic phase of learning, the NTK becomes gradually low-rank, finding the right feature subspace to learn in, and this allows good generalisation in the subsequent phase in which the model's effective dimensionality is limited.
## Redundant features in the linearized model
Another way of looking at this is to say that the linearized model $f(x; \theta) \approx f(x; \theta_0) + \nabla_\theta f(x) (\theta - \theta_0)$ will have redundant features.
This linear model has as many nonlinear features as there are parameters, since $\nabla_\theta f(x)$ has $P$ dimensions. However, some of these features are very similar to one another, or overlap a lot.
Imagine a generalised linear model, in which there are two nonlinear features, $a$ and $b$. However, the feature $b$ appears repeatedly $k$ times with $k$ indepependent coefficients:
$$
f(x) = w_0 a(x) + \sum_{i=1}^{k} w_i b(x)
$$
In such a model, the higher $k$ is, the more the model's gradient updates will "make use of" the feature $b$. This model is equivalent to a two-parameter model with only one copy of $a$ and $b$, but where the effective learning rate of the coefficient in front of $b$ is cranked up by a factor of $k$. Thus, the coefficient for $b$ is going to move a lot faster than the coefficient for $a$. If $a$ and $b$ are equally rich to solve the task, this will lead to the model converging to be proportional to $b$.
So the interpretation of my view is: in the linearised model, the features are not orthogonal. Some features are represented in a redundant fashion. By reparametrising the model to an orthogonal feature space (the $u$-features), this redundancy disappears, and the redundancy is absorbed by the eigenvalues $\sigma_i$.
##Regularizing effective dimensionality of feature space
If the effective rank of the NTK (also the effective rank of the Fisher information matrix) is a good measure of effective model complexity, it might make sense to minimise the $l_1$ norm of the vector of eigenvalues, to encourage sparsity.
$$
\operatorname{tr} F(\theta) = \operatorname{tr} \mathbb{E}_X \nabla_\theta \log p()
$$