Try   HackMD

Scalable Approaches to Self-Supervised Learning using Spectral Analysis

Reading Group Wednesday April 5th 2023

By Ross Viljoen and Vincent Dutordoir

Abstract

Learning the principal eigenfunctions of an operator is a fundamental problem in various machine learning tasks, from representation learning to Gaussian processes. However, traditional non-parametric solutions suffer from scalability issues - rendering them impractical on large datasets.

This reading group will discuss parametric approaches to approximating eigendecompositions using neural networks. In particular, Spectral Inference Networks (SpIN) offer a scalable method for approximating eigenfunctions of symmetric operators on high-dimensional function spaces using bi-level optimization methods and gradient masking (Pfau et al., 2019).

A recent improvement on SpIN, called NeuralEF, focuses on approximating eigenfunction expansions of kernels (Deng et al., 2022a). The method is applied to modern neural-network based kernels (GP-NN and NTK) as well as scaling up the linearised Laplace approximation for deep networks (Deng et al., 2022a). Finally, self-supervised learning can be expressed in terms of approximating a contrastive kernel, which allows NeuralEF to learn structured representations (Deng et al., 2022b).

References

David Pfau, Stig Petersen, Ashish Agarwal, David G. T. Barrett,and Kimberly L. Stachenfeld. "Spectral inference networks: Unifying deep and spectral learning." ICLR (2019).

Zhijie Deng, Jiaxin Shi, and Jun Zhu. "NeuralEF: Deconstructing kernels by deep neural networks." ICML (2022a).

Zhijie Deng, Jiaxin Shi, Hao Zhang, Peng Cui, Cewu Lu, Jun Zhu. "Neural Eigenfunctions Are Structured Representation Learners." arXiv preprint arXiv:2210.12637 (2022b).


Introduction

Spectral methods are ubiquitous:

  • mathematics: spectral theorem (diagonalisation of compact, self-adjoint operators)
  • physics: Solving PDEs (e.g., heat, wave, Schrödinger equations) in frequency domain (i.e. using spectral metin frequency domain (i.e. using spectral methods).
  • Machine learning:
    • Kernels on manifold use the eigenbasis w.r.t. to the Laplace-Beltrami operator
      ΔM
      .
    • Spectral Mixture Kernels
    • Fourier Neural Operators
Machine Learning Physics/Engineering
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Topic of today's reading group

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Find non-linear embedding of data on low-dimensional manifold.

A well known algorithm is Principle Component Analysis.

  1. Construct Gram for all pair of points
    k(X,X)RN×N
  2. Each eigenvector
    \boldsymbolu
    contains a one-dimensional embedding for each point
    xn
    .

Scales cubically and quadratically with

N for compute and memory. What to do with new data?

Overview of reading group

This reading group will cover three papers in chronological order that build on each other.

  1. Pfau et al. (2019). "Spectral inference networks: Unifying deep and spectral learning."
    • Introduces the problem and formulates the problem of finding embeddings (i.e. eigenfunctions) as an optimization problem.
    • Focus is not necessarily on kernel operators.
  2. Deng et al. (2022). "NeuralEF: Deconstructing kernels by deep neural networks."
    • Improve upon SpIN algorithm (more computationally efficient and simpler).
    • Apply the method to popular DNN-inspired kernels.
  3. Deng et al. (2022). Neural Eigenfunctions Are Structured Representation Learners
    • A final final improvement on the core algorithm with applications in self-supervised learning.

Spectral decomposition as Optimization

Eigenvectors of a matrix are defined as

A\boldsymbolu=λ\boldsymbolu for some scalar
λ
, called the eigenvalue.

Let

A be symmetric then, the eigenvector corresponding to the largest eigenvalue is the solution of

max\boldsymbolu\boldsymboluA\boldsymbolus.t.\boldsymbolu\boldsymbolu=1

Proof (sketch)

  • A
    is symmetric and real, therfore
    A=VΛV
    , where
    \boldsymbolV
    is a orthonormal basis:
    VV=I
    and
    Λ=diag(λ1,,λn)
    with
    λ1>λ2>>λn
    .
  • Let
    \boldsymbolu=V\boldsymbolα=αi\boldsymbolvi

Rewriting the objective:

\boldsymboluA\boldsymbolu=\boldsymbolαVVΛVV\boldsymbolα=Tr(\boldsymbolα\boldsymbolαΛ)=iαi2λi

Now adding a Lagrange multiplier

for the constaint (which can be rewritten as
\boldsymbolα22=1
) gives
iαi2λi+(\boldsymbolα221)

which can be solved by setting the derivative w.r.t.
\boldsymbolα
and
to zero. This gives
α1=1andα1=0.

The constrained optimisation objective from above, is equivalent (up to a scaling factor) to

max\boldsymbolu\boldsymboluA\boldsymbolu\boldsymbolu\boldsymbolu
because we normalize the vector to lie on the unit hypersphere:
\boldsymbolu\boldsymbolu\boldsymboluSd
. This expression is known as the Rayleigh quotient and equals the eigenvalue corresponding to the eigenvector at its maximum.

To compute the top N eigenvectors

U=(\boldsymbolu1,,\boldsymboluN), we can solve a sequence of maximization problems greedily:
\boldsymbolui=argmax\boldsymbolu\boldsymboluA\boldsymbolu\boldsymbolu\boldsymbolus.t.\boldsymbolu1\boldsymbolu=0,,\boldsymbolui1\boldsymbolu=0.

If we drop the orthogonality constraint we can formulate this as a single unconstrained objective

argmaxUTr[(UU)1UAU]
Following a similar derivation as above, the maximum is given by
i=1Nλi
(the largest
N
eigenvalues). However, without the orthogonality constaint
\boldsymbolui\boldsymboluj=0
for
ij
,
U
will simply span the same subspace as the true eigenfunctions.

Remark. The objective is invariant to right multiplications (i.e. linear combination of the basis).

Proof. Set

U~=UW and show that the objective has the same optimum for
U~
as for
U
.

The above derivation can be used to optimize for the eigenvectors in (kernel) PCA where

A=XX or
A=k(X,X)
for
XRN×D
a feature matrix.

An embedding for a new datapoint can be found by interpolating the eigenfunctions:

\boldsymbolu=n=1Nk(xn,x)\boldsymbolun
which scales
O(N)
.

Spectral Inference Networks (SpIN) overcomes these limitations by using a parametric form for the embeddings.

From eigenvectors to eigenfunctions

Suppose that instead of a matrix

A we have a symmetric (not necessarily positive definite) kernel
k(x,x)
where
x
and
x
are in some measurable space
Ω
, which can be either continuous or discrete.

Define inner product on

Ω w.r.t. to input density
p(x)
as
f,g=Ωf(x)g(x)p(x)dx=Exp[f(x)g(x)]

and a kernel operator
K
as
Kf=k(x,),f=k(x,x)f(x)p(x)dx=Exp[k(x,x)f(x)]

we are interested in finding the top

N eigenfunctions
u:ΩR
of the operator
K
, for which hold
(Ku)(x)=k(x,x)u(x)p(x)dx=λu(x)andui,uj=δij.

In vector space

In vector space we could cast the problem of spectral decomposition to an optimization problem using the following objective:

maxUTr[(UU)1UAU]max     UUU=ITr[UAU]

In function space

The analogue of this in function space looks like this:

Let

\boldsymbolu:XRN represent the N eigenfunctions (or a linear combination thereof, which means that they share the same span).

max\boldsymbolu Tr[Ex[\boldsymbolu(x)\boldsymbolu(x)]1Ex,x[k(x,x)\boldsymbolu(x)\boldsymbolu(x)]]max\boldsymboluTr[Ex,x[k(x,x)\boldsymbolu(x)\boldsymbolu(x)]] s.t. Ex[\boldsymbolu(x)\boldsymbolu(x)]=I

To simplify notation

Σ=Ex[\boldsymbolu(x)\boldsymbolu(x)]andΠ=Ex,x[k(x,x)\boldsymbolu(x)\boldsymbolu(x)]
which gives, parameterizing the eigenfunctions using a neural net with parameters
θ
, the following objective:
L(θ)=max\boldsymboluθTr[Σ1Π]

Problems

  • The objective is invariant to linear tsransformation of the features
    \boldsymbolu(x)
    , optimizing it will only give a function that spans the top eigenfunctions of
    K
    . There is also no ordering (i.e. first eigenfunction corresponds to largest
  • Minibatch estimations of
    L
    do not lead to unbiased estimates of the gradients. We can not use traditional SGD.

Ordering through gradient masking

The objective is invariant to linear transformation of the output. The solution only spans the top eigenfunctions

We can use the invariance of trace to cyclic permutation to rewrite the objective.
Let

Σ=LL, then the objective can be rewritten
Tr(Σ1Π)=Tr(L1ΠL)=TrΛ=λi

where we used a shorthand

ΛL1ΠL. This matrix has the convenient property that the upper left
n×n
block only depends on the first
n
eigenfunctions.

By zero-ing out the gradient from higher to lower eigenvalues we can impose an ordering. Summarized in the following claim (in the appendix):

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

proof by induction.

Forward Backward
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

The 'masked' gradient can be written down in closed-form:

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

where
Σ=LL
.

Bi-level optimization

In SGD we use stochastic gradient estimates

\boldsymbolg~ which typically are unbiased:
E[\boldsymbolg~]=θL

The objective expression is a nonlinear function of multiple expectations, so naively replacing

Σ,Π and their gradients with empirical estimates will be biased.

Reframe objective as bi-level optimization, in which

Σ and
Π
are estimated as well as
θ
.

Bilevel stochastic optimization is the problem of simultaneously solving two coupled minimization.

Traditional optimization

maxxf(x)s.t.g(x)0.

Formally, a bilevel optimization problem reads:

minxX,y F(x,y)s.t. G(x,y)0, and yS(x)
where
S(x)
is the set of optimal solutions of the x-parameterized problem
S(x)=argminyYf(x,y)

Bi-level optimization the objective depends on the 'decision' of another actor whose decision depends on our action

Example: Toll Setting
Consider a network of highways that is operated by the government. The government wants to maximize its revenues by choosing the optimal toll setting for the highways. However, the government can maximize its revenues only by taking the highway users' problem into account. For any given tax structure the highway users solve their own optimization problem, where they minimize their traveling costs by deciding between utilizing the highways or an alternative route

Bilevel stochastic problems are common in machine learning and include actor-critic methods, generative adversarial networks and imitation learning. SpIN relies on a classic result from 1997 "Vivek S Borkar. Stochastic approximation with two time scales". It boils down to having a second objective which is solved using a different (slower) learning rate

minΣ,θΣΣΣ¯t2+θΣθΣt¯2
where
Σ¯t
and
θΣ¯t
are moving averages.

SpIN Algorithm

  1. Minimizes the objective end-to-end by stochastic optimization
    Tr[Ex[\boldsymbolu(x)\boldsymbolu(x)]1Ex,x[k(x,x)\boldsymbolu(x)\boldsymbolu(x)]]
  2. Performs the optimization over a parametric function class such as deep neural networks.
  3. Uses the modified gradient (masking) to impose an ordering on the learned features
  4. Uses bilevel optimization to overcome the bias introduced by finite batch sizes

Problems:

  • Need to cholesky decompose
    Σ
    in every iteration
  • Need to track Jacobian
    θΣ

Kernel methods

Kernel methods project data into a high-dimensional feature space

H to enable linear manipulation of nonlinear data. In other words, let
ϕ:XH
be the feature map, then
k:X×XR
linearly depends on the feature representations of the inputs
k(x,x)=ϕ(x),ϕ(x)

Often we don't know
H
(it is typically infinite dimensional). The subtlety is that we can leverage the "kernel trick" to bypass the need of specifying
H
explicitly.

There is a rich family of kernels.

Classic kernels: Squared Exponential, Matérn Family, Linear, Polynomial, etc. They may easily fail when processing real-world data like images and texts due to inadequate expressive power and the curse of dimensionality.

Modern kernel have been developed on the inductive biases of NN architectures, such as the NTK and NN-GP (discussed briefly by Ross and the topic of next week's reading group).

Kernel approximations

An idea to scale up kernel methods is to approximate the kernel with the inner product of some explicit vector representations of the data

k(x,x)\boldsymbolν(x)\boldsymbolν(x),
where
\boldsymbolν:XRR
. There are two broad approaches

Bochner's theorem

Bochner's theorem: a stationary kernels is psd if it is the Fourier transform of a positive measure

S(ω) (power spectrum):
k(xy)=S(ω)exp(iωr)dω1Rexp(iωr(xy))ωrS(ω)=[1Rexp(iω1x)1Rexp(iω2x)1Rexp(iωRx)][1Rexp(iω1y)1Rexp(iω2y)1Rexp(iωRy)]

  • Can only be used for stationary kernel
  • Scales exponentially bad with dimension

Mercer decompsition

A kernel

k has a kernel operator
K
and we are interested in finding it's eigenfunctions.

Eigenfunctions

u:ΩR of the operator
K
obey
(Ku)(x)=k(x,x)u(x)p(x)dx=λu(x)andui,uj=δij.

Using the eigenfunctions corresponding to non-zero eigenvalues

k has the representation
k(x,x)=iλiui(x)ui(x).

Nyström approximation

We can use the data

X={xn}n=1N to obtain an MCMC estimate of the top R eigenfunctions by taking the eigenvectors of
k(X,X)
.
{λ^i, (u^i(x1),,u^i(xN))}i=1R

Nystrom methods approximate the eigenfunctions for out-of-sample points

x as:
u^i(x)=1Nnk(x,xn)u^i(xn)

which approximates the kernel

k(x,x)i=1Rλ^iu^i(x)u^i(x)

Unlike RFFs, Nyström method can be applied to approximate any positive-definite kernel. Yet, it is non-trivial to scale it up.

  • the matrix eigendecomposition is costly for even medium sized training data.
  • evaluating the eigenfuction at a new location entails evaluating
    k
    for N times, which is unaffordable when coping with the modern kernels specified by deep architectures.

NeuralEF (Deng et al., 2022a)

Back to the SpIN objective:

max\boldsymbolu Tr[Ex[\boldsymbolu(x)\boldsymbolu(x)]1Ex,x[k(x,x)\boldsymbolu(x)\boldsymbolu(x)]]

Can we instead enforce an ordering directly in the optimisation objective such that the gradient masking becomes unnecessary?

Characterise the spectral decomposition as an asymmetric maximisation problem:

maxu^j Πjjs.t.Σjj=1, Πij=0, i<j

where:

Σjj=Ex[u^j(x)u^j(x)]andΠij=Ex,x[k(x,x)u^i(x)u^j(x)]

The claim is that the pairs

(Πjj,u^j) will converge to the true eigenpairs
(λj,uj)
, ordered such that
λ1λ2λk
.

Proof

For

j=1, there is no orthogonality constraint, so the problem is the eigenfunction version of finding the largest eigenvector by maximisation:

Π11=Ex,x[k(x,x)u^1(x)u^1(x)]=Ex,x[(j1λjuj(x)uj(x))u^1(x)u^1(x)](Mercer's thm)=j1λjEx,x[uj(x)uj(x)u^1(x)u^1(x)]=j1λjEx[uj(x)u^1(x)]Ex[uj(x)u^1(x)]=j1λjuj,u^12

but, since

{uj} form an orthonormal basis, we can write
u^1=l1wlul
, so

Π11=j1λjuj,l1wlul2=j1λjwj2

Recalling the constraint:

Σ11=Ex[u^1(x)u^1(x)]=u^1,u^1=j1wj2=1,

the objective must be maximised by taking

w1=1 and
wj=0,j>1
. Therefore, the optimum is found when
u^1=u1
.

For

j=2, we now have the additional orthogonality constraint:
Π12=Ex,x[k(x,x)u^1(x)u^2(x)]=Ex,x[k(x,x)u1(x)u^2(x)]=j1λjuj,u1uj,u^2=λ1u1,u^2=0

So, for

j=2 we end up solving the same maximisation problem as for
j=1
, but restricted to the orthogonal subspace of
u1
. Repeating the same logic for
j>2
completes the proof.

In practice, turn the orthogonality constraints into penalties:

maxuj Πjji<jΠij2Πiis.t.Σjj=1, i<j

Where the scaling factor of

Πii is introduced into the denominator so that the two terms are on the same scale.

This objective is derived from a recent reinterpretation of PCA as a multi-player game in Gemp, Ian, et al. (2020) "Eigengame: PCA as a nash equilibrium." It can be seen as a function-space version of the same approach.

Minibatching

At each optimisation step,

Πij is estimated by Monte Carlo integration with a minibatch of samples
X={xb}b=1B
:

Π~ij=1B2b=1Bb=1B[k(xb,xb)u^i(xb)u^j(xb)]=1B2\boldsymboluiX KX,X \boldsymbolujX

where

\boldsymbolujX=[uj(x1),,uj(xB)]RB is the concatenated output of
uj
on the minibatch.

Tracking

Π~jj with an exponential moving average (EMA) gives an estimate of
λj
.

Note: Minibatching will not give an unbiased estimate of the gradients of the objective. There is no discussion of this in the paper, but it appears that - for large enough minibatch sizes - this is not a problem in practice.

Batchnorm

Of course, there is still another constraint to satisfy, normalisation:

Σjj=1. In the minibatch setting, this is estimated as
Σ^jj=1B\boldsymbolujX\boldsymbolujX
, so we want to have
\boldsymbolujX22B

This is enforced by adding an L2 batch normalisation layer to the end of each neural net:

\boldsymbolujX=\boldsymbolhjX\boldsymbolhjX2B

where

\boldsymbolhjX is the output of the penultimate layer for the minibatch
X
.

The Training Loss and Gradients

After all the setup above, the training loss on a minibatch becomes:

minθ1,,θk=1B2j=1k(\boldsymbolujX KX,X \boldsymbolujXi=1j1(sg(\boldsymboluiX) KX,X \boldsymbolujX)2sg(\boldsymboluiX KX,X \boldsymboluiX))

where

sg denotes the "stop gradient" operation. Its gradient with respect to each network's parameters becomes:

θj=2B2KX,X(\boldsymbolujXi=1j1\boldsymboluiX KX,X \boldsymbolujX\boldsymboluiX KX,X \boldsymboluiX\boldsymbolujX)θj\boldsymbolujX

Notably, the term in the gradient within the brackets is exactly the (generalised) Gram-Schmidt orthogonalisation of

\boldsymbolujX with respect to all
\boldsymbolujX
where
i<j
.

Eigendecomposition of Modern NN Kernels

Let

g(,θ):XRNout denote a function represented by an NN with weights
θp(θ)
. Then a Neural Network Gaussian Process (NNGP) kernel is defined as
kNNGP(x,x)=Eθp(θ)[g(x,θ)g(x,θ)]

which can be computed analytically when the number of neurons in the layer goes to infinity. There are many classic papers who do this Neal (1998), Cho and Saul (2009),

A Neural Tangent Kernel (NTK) is defined as

kNTK(x,x)=Eθp(θ)[θg(x,θ)θg(x,θ)]

Topic of next week's reading group.

The NNGP and NTK kernels can be very expensive, or even analytically intractable to compute. A common approach is to estimate them by Monte Carlo sampling with finite width networks. This is straightforward for the NNGP:

kNNGP(x,x)=Eθp(θ)[g(x,θ)g(x,θ)]1Ss=1Sg(x,θs)g(x,θs)

The NTK is not so simple to estimate, as it involves the outer product of the Jacobian of the network. For large models, this is very expensive to compute and store exactly. Here we focus on estimating the "empirical NTK" of a finite network, given by

kNTK(x,x)=θg(x,θ)θg(x,θ).

Observing that, for any

p(\boldsymbolv) such that
E\boldsymbolvp(\boldsymbolv)[\boldsymbolv\boldsymbolv]
=
Idim(θ)
:

kNTK(x,x)=E\boldsymbolvp(\boldsymbolv)[(θg(x,θ)\boldsymbolv)(θg(x,θ)\boldsymbolv)]

Taking a Taylor expansion,

θg(x,θ)\boldsymbolv(g(x,θ+ε\boldsymbolv)g(x,θ))/ε, and

kNTK(x,x)1Ss=1S[(g(x,θ+ε\boldsymbolv)g(x,θ)ε)(g(x,θ+ε\boldsymbolv)g(x,θ)ε)].

Both of these approximations require

S forward passes of the network with different parameter settings. To obtain good approximations in practice,
S
must be large (the authors use
S=4000
). The hope is to be able to approximate this kernel with
kS
eigenfunctions to significantly speed up evaluation at test time.

Scaling up the Linearised Laplace Approximation (LLA)

The LLA is a method to obtain predictive uncertainty estimates for neural networks trained by MAP. The approximate posterior in function space is given by:

GP(f|g(x,\boldsymbolθMAP),θg(x,\boldsymbolθMAP)Σθg(x,\boldsymbolθMAP))

where

Σ is the inverse Gauss-Newton matrix (an approximation to the Hessian).
Σ
has size
dim(\boldsymbolθ)×dim(\boldsymbolθ)
, so for a large network this is expensive to invert. It has the form

Σ1=iθg(xi,\boldsymbolθMAP)Λiθg(xi,\boldsymbolθMAP)+1σ02Idim(\boldsymbolθ)whereΛi=\boldsymbolf\boldsymbolf2logp(yi|\boldsymbolf)|\boldsymbolg=g(xi|\boldsymbolθMAP)RNout×Nout

This covariance matrix can be approximated using the eigenfunction approximation to the NTK, but now evaluated at the MAP parameters:

kNTK(x,x)=θg(x,\boldsymbolθMAP)θg(x,\boldsymbolθMAP)\boldsymbolψ(xi)\boldsymbolψ(xi)where\boldsymbolψ(x)=[λ^1u1(x),,λ^kuk(x)]

The LLA posterior can now be approximated as:

GP(f|g(x,\boldsymbolθMAP),\boldsymbolψ(x)[i\boldsymbolψ(xi)Λi\boldsymbolψ(xi)+1σ02Ik]1\boldsymbolψ(x))

where the matrix to be inverted now only has dimension

k×k.

Compared to other scalable LLA methods, this method performs favourably on NLL and Expected Calibration Error (ECE):

Structured Representation Learning via Contrastive Learning (Deng et al., 2022b)

In contrastive learning, a clean data point

\boldsymbolx¯ is used to generate random augmentations according to some distribution
p(\boldsymbolx|\boldsymbolx¯)
. The goal of contrastive learning is then to learn similar (in some sense) representations for augmentations which are generated from the same clean data point. One way of phrasing this problem is as learning a kernel function which measures similarity of data points according to the augmentation distribution
p(\boldsymbolx|\boldsymbolx¯)
:

k(\boldsymbolx,\boldsymbolx):=p(\boldsymbolx,\boldsymbolx)p(\boldsymbolx)p(\boldsymbolx)wherep(\boldsymbolx,\boldsymbolx):=Epd(\boldsymbolx¯)[p(\boldsymbolx|\boldsymbolx¯)p(\boldsymbolx|\boldsymbolx¯)]

and

pd(\boldsymbolx¯) is the clean data distribution.
p(\boldsymbolx)
and
p(\boldsymbolx)
are the respective marginals of
p(\boldsymbolx,\boldsymbolx)
.

This kernel gives a measure of how likely is is that

\boldsymbolx and
\boldsymbolx
were generated from the same clean data point.

Plugging this kernel into the eigenfunction loss:

Πij=Ep(\boldsymbolx)Ep(\boldsymbolx)[k(\boldsymbolx,\boldsymbolx)u^i(\boldsymbolx)u^j(\boldsymbolx)]=Ep(\boldsymbolx)Ep(\boldsymbolx)[p(\boldsymbolx,\boldsymbolx)p(\boldsymbolx)p(\boldsymbolx)u^i(\boldsymbolx)u^j(\boldsymbolx)]=Ep(\boldsymbolx,\boldsymbolx)[u^i(\boldsymbolx)u^j(\boldsymbolx)]1bi=1bu^i(\boldsymbolxi)u^j(\boldsymbolxi)

where

xi and
xi
here are two independent samples generated from the same clean data point
x¯i
. This loss can then be optimised in exactly the same way as the standard NeuralEF.

A benefit of this approach is that the feature representations (eigenfunctions) are ordered in terms of importance. This means that they are well suited to be used as adaptive length codes. An application of this is in image retrieval where a shorter code can lead to much lower time and storage requirements for retrieval of the top

M closest examples, while longer codes provide better accuracy when possible.