Reading Group – Wednesday April 5th 2023
By Ross Viljoen and Vincent Dutordoir
Learning the principal eigenfunctions of an operator is a fundamental problem in various machine learning tasks, from representation learning to Gaussian processes. However, traditional non-parametric solutions suffer from scalability issues –- rendering them impractical on large datasets.
This reading group will discuss parametric approaches to approximating eigendecompositions using neural networks. In particular, Spectral Inference Networks (SpIN) offer a scalable method for approximating eigenfunctions of symmetric operators on high-dimensional function spaces using bi-level optimization methods and gradient masking (Pfau et al., 2019).
A recent improvement on SpIN, called NeuralEF, focuses on approximating eigenfunction expansions of kernels (Deng et al., 2022a). The method is applied to modern neural-network based kernels (GP-NN and NTK) as well as scaling up the linearised Laplace approximation for deep networks (Deng et al., 2022a). Finally, self-supervised learning can be expressed in terms of approximating a contrastive kernel, which allows NeuralEF to learn structured representations (Deng et al., 2022b).
David Pfau, Stig Petersen, Ashish Agarwal, David G. T. Barrett,and Kimberly L. Stachenfeld. "Spectral inference networks: Unifying deep and spectral learning." ICLR (2019).
Zhijie Deng, Jiaxin Shi, and Jun Zhu. "NeuralEF: Deconstructing kernels by deep neural networks." ICML (2022a).
Zhijie Deng, Jiaxin Shi, Hao Zhang, Peng Cui, Cewu Lu, Jun Zhu. "Neural Eigenfunctions Are Structured Representation Learners." arXiv preprint arXiv:2210.12637 (2022b).
Spectral methods are ubiquitous:
Machine Learning | Physics/Engineering |
---|---|
Find non-linear embedding of data on low-dimensional manifold.
A well known algorithm is Principle Component Analysis.
Scales cubically and quadratically with for compute and memory. What to do with new data?
This reading group will cover three papers in chronological order that build on each other.
Eigenvectors of a matrix are defined as for some scalar , called the eigenvalue.
Let be symmetric then, the eigenvector corresponding to the largest eigenvalue is the solution of
Proof (sketch)
Rewriting the objective:
Now adding a Lagrange multiplier for the constaint (which can be rewritten as ) gives
which can be solved by setting the derivative w.r.t. and to zero. This gives
The constrained optimisation objective from above, is equivalent (up to a scaling factor) to
because we normalize the vector to lie on the unit hypersphere: . This expression is known as the Rayleigh quotient and equals the eigenvalue corresponding to the eigenvector at its maximum.
To compute the top N eigenvectors , we can solve a sequence of maximization problems greedily:
If we drop the orthogonality constraint we can formulate this as a single unconstrained objective
Following a similar derivation as above, the maximum is given by (the largest eigenvalues). However, without the orthogonality constaint for , will simply span the same subspace as the true eigenfunctions.
Remark. The objective is invariant to right multiplications (i.e. linear combination of the basis).
Proof. Set and show that the objective has the same optimum for as for .
The above derivation can be used to optimize for the eigenvectors in (kernel) PCA where or for a feature matrix.
An embedding for a new datapoint can be found by interpolating the eigenfunctions:
which scales .
Spectral Inference Networks (SpIN) overcomes these limitations by using a parametric form for the embeddings.
Suppose that instead of a matrix we have a symmetric (not necessarily positive definite) kernel where and are in some measurable space , which can be either continuous or discrete.
Define inner product on w.r.t. to input density as
and a kernel operator as
we are interested in finding the top eigenfunctions of the operator , for which hold
In vector space we could cast the problem of spectral decomposition to an optimization problem using the following objective:
The analogue of this in function space looks like this:
Let represent the N eigenfunctions (or a linear combination thereof, which means that they share the same span).
To simplify notation
which gives, parameterizing the eigenfunctions using a neural net with parameters , the following objective:
Problems
The objective is invariant to linear transformation of the output. The solution only spans the top eigenfunctions…
We can use the invariance of trace to cyclic permutation to rewrite the objective.
Let , then the objective can be rewritten
where we used a shorthand . This matrix has the convenient property that the upper left block only depends on the first eigenfunctions.
By zero-ing out the gradient from higher to lower eigenvalues we can impose an ordering. Summarized in the following claim (in the appendix):
proof by induction.
Forward | Backward |
---|---|
The 'masked' gradient can be written down in closed-form:
where .
In SGD we use stochastic gradient estimates which typically are unbiased:
The objective expression is a nonlinear function of multiple expectations, so naively replacing and their gradients with empirical estimates will be biased.
Reframe objective as bi-level optimization, in which and are estimated as well as .
Bilevel stochastic optimization is the problem of simultaneously solving two coupled minimization.
Traditional optimization
Formally, a bilevel optimization problem reads:
where is the set of optimal solutions of the x-parameterized problem
Bi-level optimization the objective depends on the 'decision' of another actor whose decision depends on our action…
Example: Toll Setting
Consider a network of highways that is operated by the government. The government wants to maximize its revenues by choosing the optimal toll setting for the highways. However, the government can maximize its revenues only by taking the highway users' problem into account. For any given tax structure the highway users solve their own optimization problem, where they minimize their traveling costs by deciding between utilizing the highways or an alternative route
Bilevel stochastic problems are common in machine learning and include actor-critic methods, generative adversarial networks and imitation learning. SpIN relies on a classic result from 1997 "Vivek S Borkar. Stochastic approximation with two time scales". It boils down to having a second objective which is solved using a different (slower) learning rate
where and are moving averages.
Problems:
Kernel methods project data into a high-dimensional feature space to enable linear manipulation of nonlinear data. In other words, let be the feature map, then linearly depends on the feature representations of the inputs
Often we don't know (it is typically infinite dimensional). The subtlety is that we can leverage the "kernel trick" to bypass the need of specifying explicitly.
There is a rich family of kernels.
Classic kernels: Squared Exponential, Matérn Family, Linear, Polynomial, etc. They may easily fail when processing real-world data like images and texts due to inadequate expressive power and the curse of dimensionality.
Modern kernel have been developed on the inductive biases of NN architectures, such as the NTK and NN-GP (discussed briefly by Ross and the topic of next week's reading group).
An idea to scale up kernel methods is to approximate the kernel with the inner product of some explicit vector representations of the data
where . There are two broad approaches…
Bochner's theorem: a stationary kernels is psd if it is the Fourier transform of a positive measure (power spectrum):
A kernel has a kernel operator and we are interested in finding it's eigenfunctions.
Eigenfunctions of the operator obey
Using the eigenfunctions corresponding to non-zero eigenvalues has the representation
We can use the data to obtain an MCMC estimate of the top R eigenfunctions by taking the eigenvectors of .
Nystrom methods approximate the eigenfunctions for out-of-sample points as:
which approximates the kernel
Unlike RFFs, Nyström method can be applied to approximate any positive-definite kernel. Yet, it is non-trivial to scale it up.
Back to the SpIN objective:
Can we instead enforce an ordering directly in the optimisation objective such that the gradient masking becomes unnecessary?
Characterise the spectral decomposition as an asymmetric maximisation problem:
where:
The claim is that the pairs will converge to the true eigenpairs , ordered such that .
Proof
For , there is no orthogonality constraint, so the problem is the eigenfunction version of finding the largest eigenvector by maximisation:
but, since form an orthonormal basis, we can write , so
Recalling the constraint:
the objective must be maximised by taking and . Therefore, the optimum is found when .
For , we now have the additional orthogonality constraint:
So, for we end up solving the same maximisation problem as for , but restricted to the orthogonal subspace of . Repeating the same logic for completes the proof.
In practice, turn the orthogonality constraints into penalties:
Where the scaling factor of is introduced into the denominator so that the two terms are on the same scale.
This objective is derived from a recent reinterpretation of PCA as a multi-player game in Gemp, Ian, et al. (2020) "Eigengame: PCA as a nash equilibrium." It can be seen as a function-space version of the same approach.
At each optimisation step, is estimated by Monte Carlo integration with a minibatch of samples :
where is the concatenated output of on the minibatch.
Tracking with an exponential moving average (EMA) gives an estimate of .
Note: Minibatching will not give an unbiased estimate of the gradients of the objective. There is no discussion of this in the paper, but it appears that –- for large enough minibatch sizes –- this is not a problem in practice.
Of course, there is still another constraint to satisfy, normalisation: . In the minibatch setting, this is estimated as , so we want to have
This is enforced by adding an L2 batch normalisation layer to the end of each neural net:
where is the output of the penultimate layer for the minibatch .
After all the setup above, the training loss on a minibatch becomes:
where denotes the "stop gradient" operation. Its gradient with respect to each network's parameters becomes:
Notably, the term in the gradient within the brackets is exactly the (generalised) Gram-Schmidt orthogonalisation of with respect to all where .
Let denote a function represented by an NN with weights . Then a Neural Network Gaussian Process (NNGP) kernel is defined as
which can be computed analytically when the number of neurons in the layer goes to infinity. There are many classic papers who do this Neal (1998), Cho and Saul (2009), …
A Neural Tangent Kernel (NTK) is defined as
Topic of next week's reading group….
The NNGP and NTK kernels can be very expensive, or even analytically intractable to compute. A common approach is to estimate them by Monte Carlo sampling with finite width networks. This is straightforward for the NNGP:
The NTK is not so simple to estimate, as it involves the outer product of the Jacobian of the network. For large models, this is very expensive to compute and store exactly. Here we focus on estimating the "empirical NTK" of a finite network, given by
Observing that, for any such that = :
Taking a Taylor expansion, , and
Both of these approximations require forward passes of the network with different parameter settings. To obtain good approximations in practice, must be large (the authors use ). The hope is to be able to approximate this kernel with eigenfunctions to significantly speed up evaluation at test time.
The LLA is a method to obtain predictive uncertainty estimates for neural networks trained by MAP. The approximate posterior in function space is given by:
where is the inverse Gauss-Newton matrix (an approximation to the Hessian). has size , so for a large network this is expensive to invert. It has the form
This covariance matrix can be approximated using the eigenfunction approximation to the NTK, but now evaluated at the MAP parameters:
The LLA posterior can now be approximated as:
where the matrix to be inverted now only has dimension .
Compared to other scalable LLA methods, this method performs favourably on NLL and Expected Calibration Error (ECE):
In contrastive learning, a clean data point is used to generate random augmentations according to some distribution . The goal of contrastive learning is then to learn similar (in some sense) representations for augmentations which are generated from the same clean data point. One way of phrasing this problem is as learning a kernel function which measures similarity of data points according to the augmentation distribution :
and is the clean data distribution. and are the respective marginals of .
This kernel gives a measure of how likely is is that and were generated from the same clean data point.
Plugging this kernel into the eigenfunction loss:
where and here are two independent samples generated from the same clean data point . This loss can then be optimised in exactly the same way as the standard NeuralEF.
A benefit of this approach is that the feature representations (eigenfunctions) are ordered in terms of importance. This means that they are well suited to be used as adaptive length codes. An application of this is in image retrieval where a shorter code can lead to much lower time and storage requirements for retrieval of the top closest examples, while longer codes provide better accuracy when possible.