Reading Group – Wednesday April 5th 2023
By Ross Viljoen and Vincent Dutordoir
Learning the principal eigenfunctions of an operator is a fundamental problem in various machine learning tasks, from representation learning to Gaussian processes. However, traditional non-parametric solutions suffer from scalability issues –- rendering them impractical on large datasets.
This reading group will discuss parametric approaches to approximating eigendecompositions using neural networks. In particular, Spectral Inference Networks (SpIN) offer a scalable method for approximating eigenfunctions of symmetric operators on high-dimensional function spaces using bi-level optimization methods and gradient masking (Pfau et al., 2019).
A recent improvement on SpIN, called NeuralEF, focuses on approximating eigenfunction expansions of kernels (Deng et al., 2022a). The method is applied to modern neural-network based kernels (GP-NN and NTK) as well as scaling up the linearised Laplace approximation for deep networks (Deng et al., 2022a). Finally, self-supervised learning can be expressed in terms of approximating a contrastive kernel, which allows NeuralEF to learn structured representations (Deng et al., 2022b).
David Pfau, Stig Petersen, Ashish Agarwal, David G. T. Barrett,and Kimberly L. Stachenfeld. "Spectral inference networks: Unifying deep and spectral learning." ICLR (2019).
Zhijie Deng, Jiaxin Shi, and Jun Zhu. "NeuralEF: Deconstructing kernels by deep neural networks." ICML (2022a).
Zhijie Deng, Jiaxin Shi, Hao Zhang, Peng Cui, Cewu Lu, Jun Zhu. "Neural Eigenfunctions Are Structured Representation Learners." arXiv preprint arXiv:2210.12637 (2022b).
Spectral methods are ubiquitous:
Machine Learning | Physics/Engineering |
---|---|
Image Not Showing
Possible Reasons
|
Image Not Showing
Possible Reasons
|
Find non-linear embedding of data on low-dimensional manifold.
A well known algorithm is Principle Component Analysis.
Scales cubically and quadratically with
This reading group will cover three papers in chronological order that build on each other.
Eigenvectors of a matrix are defined as
Let
Proof (sketch)
Rewriting the objective:
Now adding a Lagrange multiplier
which can be solved by setting the derivative w.r.t.
The constrained optimisation objective from above, is equivalent (up to a scaling factor) to
because we normalize the vector to lie on the unit hypersphere:
To compute the top N eigenvectors
If we drop the orthogonality constraint we can formulate this as a single unconstrained objective
Following a similar derivation as above, the maximum is given by
Remark. The objective is invariant to right multiplications (i.e. linear combination of the basis).
Proof. Set
The above derivation can be used to optimize for the eigenvectors in (kernel) PCA where
An embedding for a new datapoint can be found by interpolating the eigenfunctions:
which scales
Spectral Inference Networks (SpIN) overcomes these limitations by using a parametric form for the embeddings.
Suppose that instead of a matrix
Define inner product on
and a kernel operator
we are interested in finding the top
In vector space we could cast the problem of spectral decomposition to an optimization problem using the following objective:
The analogue of this in function space looks like this:
Let
To simplify notation
which gives, parameterizing the eigenfunctions using a neural net with parameters
Problems
The objective is invariant to linear transformation of the output. The solution only spans the top eigenfunctions…
We can use the invariance of trace to cyclic permutation to rewrite the objective.
Let
where we used a shorthand
By zero-ing out the gradient from higher to lower eigenvalues we can impose an ordering. Summarized in the following claim (in the appendix):
Forward | Backward |
---|---|
Image Not Showing
Possible Reasons
|
Image Not Showing
Possible Reasons
|
The 'masked' gradient can be written down in closed-form:
In SGD we use stochastic gradient estimates
The objective expression is a nonlinear function of multiple expectations, so naively replacing
Reframe objective as bi-level optimization, in which
Bilevel stochastic optimization is the problem of simultaneously solving two coupled minimization.
Traditional optimization
Formally, a bilevel optimization problem reads:
where
Bi-level optimization the objective depends on the 'decision' of another actor whose decision depends on our action…
Example: Toll Setting
Consider a network of highways that is operated by the government. The government wants to maximize its revenues by choosing the optimal toll setting for the highways. However, the government can maximize its revenues only by taking the highway users' problem into account. For any given tax structure the highway users solve their own optimization problem, where they minimize their traveling costs by deciding between utilizing the highways or an alternative route
Bilevel stochastic problems are common in machine learning and include actor-critic methods, generative adversarial networks and imitation learning. SpIN relies on a classic result from 1997 "Vivek S Borkar. Stochastic approximation with two time scales". It boils down to having a second objective which is solved using a different (slower) learning rate
where
Problems:
Kernel methods project data into a high-dimensional feature space
Often we don't know
There is a rich family of kernels.
Classic kernels: Squared Exponential, Matérn Family, Linear, Polynomial, etc. They may easily fail when processing real-world data like images and texts due to inadequate expressive power and the curse of dimensionality.
Modern kernel have been developed on the inductive biases of NN architectures, such as the NTK and NN-GP (discussed briefly by Ross and the topic of next week's reading group).
An idea to scale up kernel methods is to approximate the kernel with the inner product of some explicit vector representations of the data
where
Bochner's theorem: a stationary kernels is psd if it is the Fourier transform of a positive measure
A kernel
Eigenfunctions
Using the eigenfunctions corresponding to non-zero eigenvalues
We can use the data
Nystrom methods approximate the eigenfunctions for out-of-sample points
which approximates the kernel
Unlike RFFs, Nyström method can be applied to approximate any positive-definite kernel. Yet, it is non-trivial to scale it up.
Back to the SpIN objective:
Can we instead enforce an ordering directly in the optimisation objective such that the gradient masking becomes unnecessary?
Characterise the spectral decomposition as an asymmetric maximisation problem:
where:
The claim is that the pairs
Proof
For
but, since
Recalling the constraint:
the objective must be maximised by taking
For
So, for
In practice, turn the orthogonality constraints into penalties:
Where the scaling factor of
This objective is derived from a recent reinterpretation of PCA as a multi-player game in Gemp, Ian, et al. (2020) "Eigengame: PCA as a nash equilibrium." It can be seen as a function-space version of the same approach.
At each optimisation step,
where
Tracking
Note: Minibatching will not give an unbiased estimate of the gradients of the objective. There is no discussion of this in the paper, but it appears that –- for large enough minibatch sizes –- this is not a problem in practice.
Of course, there is still another constraint to satisfy, normalisation:
This is enforced by adding an L2 batch normalisation layer to the end of each neural net:
where
After all the setup above, the training loss on a minibatch becomes:
where
Notably, the term in the gradient within the brackets is exactly the (generalised) Gram-Schmidt orthogonalisation of
Let
which can be computed analytically when the number of neurons in the layer goes to infinity. There are many classic papers who do this Neal (1998), Cho and Saul (2009), …
A Neural Tangent Kernel (NTK) is defined as
Topic of next week's reading group….
The NNGP and NTK kernels can be very expensive, or even analytically intractable to compute. A common approach is to estimate them by Monte Carlo sampling with finite width networks. This is straightforward for the NNGP:
The NTK is not so simple to estimate, as it involves the outer product of the Jacobian of the network. For large models, this is very expensive to compute and store exactly. Here we focus on estimating the "empirical NTK" of a finite network, given by
Observing that, for any
Taking a Taylor expansion,
Both of these approximations require
The LLA is a method to obtain predictive uncertainty estimates for neural networks trained by MAP. The approximate posterior in function space is given by:
where
This covariance matrix can be approximated using the eigenfunction approximation to the NTK, but now evaluated at the MAP parameters:
The LLA posterior can now be approximated as:
where the matrix to be inverted now only has dimension
Compared to other scalable LLA methods, this method performs favourably on NLL and Expected Calibration Error (ECE):
In contrastive learning, a clean data point
and
This kernel gives a measure of how likely is is that
Plugging this kernel into the eigenfunction loss:
where
A benefit of this approach is that the feature representations (eigenfunctions) are ordered in terms of importance. This means that they are well suited to be used as adaptive length codes. An application of this is in image retrieval where a shorter code can lead to much lower time and storage requirements for retrieval of the top