###### tags: `one-offs` `sampling` `monte carlo` `gaussian processes`
# Low-Rank Approximations for Subsampling MCMC in Gaussian Process Models
**Overview**: In this note, I will describe some challenges with scaling up inference in Gaussian Process models to large datasets. I will then describe an approach to handling these challenges, based on reduced-basis expansions, and then outline a potential application to MCMC-based inference.
## Inference in Gaussian Process Models
In statistical modelling, one often has to model unknown functions. In simple examples, one can get away with linear models, or expanding the function in a small basis. For more elaborate models, it is sometimes preferred to model functions *nonparametrically*, i.e. in a representation of unbounded complexity. A simple instance of this is to consider expansions in an infinite basis, i.e. writing out the function in a Fourier basis, and then inferring the Fourier coefficients. While this can be handled practically by simply truncating the basis expansion at a sufficiently high index, it is often desirable to work directly with the infinite-dimensional object.
One road towards this is to work with *Gaussian Processes*. A Gaussian Process is a distribution over functions $f$, such that for any finite set of inputs $X = \{ x_i \}_{i \in [N]}$, the joint law of $f(X) = \{ f(x_i) \}_{i \in [N]}$ is multivariate Gaussian, i.e.
\begin{align}
P \left( \{ f(x_i) \}_{i \in [N]} \right) = \mathcal{N} \left( \{ f(x_i) \}_{i \in [N]} \vert m_X, K_{XX}\right)
\end{align}
for some vector $m_X$ and some positive semi-definite matrix $K_{XX}$. In this setting, one says that $f$ has the law of a Gaussian process with mean function $m$ and covariance kernel $K$.
A benefit of working with GPs is that one is implicitly working with an expansion in an infinite basis, but the marginal distribution of the function's values at a finite number of locations remains available in closed form. One thus retains substantial flexibility in the range of $f$ which can be modelled.
## Costs of Inference in Gaussian Processes
Consider a simple logistic regression task with a latent Gaussian process, i.e.
\begin{align}
f &\sim \text{GP} \left( m, K \right) \\
\text{for } i \in [N],\quad y_i &\sim \text{Bernoulli} ( \sigma( f(x_i) ) )
\end{align}
where $\sigma(u) = (1 + \exp(-u))^{-1}$. We treat any hyperparameters of $(m, K)$ as fixed for now.
In principle, life is good: we have a closed form prior and likelihood, and our posterior measure is even log-concave with respect to $\{ f(x_i) \}_{i \in [N]}$. However, there is a computational hitch: writing out the negative log-likelihood explicitly (taking $m \equiv 0$ for simplicity),
\begin{align}
-\log P \left( \{ f(x_i) \}_{i \in [N]} \right) &= \frac{1}{2} | f |_{K^{-1}}^2 + \sum_{i \in [N]} \ell_i ( f_i ) + \text{const.},
\end{align}
we see that evaluating our likelihood requires the inversion of the $N \times N$ matrix $K$, or an operation of similar cost (computing the SVD, solving a linear system, etc.). Given that the cost of this operation scales cubically with $N$, this is bad news for even MAP inference in GP models, without even getting to 'full' Bayesian inference.
## Approximate Inference with Reduced-Basis Expansions
A common strategy involves appealing to the representer trick to note that the MAP configuration for $f$ is expressible as
\begin{align}
f(x) = \sum_{i \in [N]} \alpha_i K (x, x_i)
\end{align}
for some scalars $\alpha_i$, and thus seeking to write $f$ in a basis of functions of the form $x \mapsto K(x, \bar{x}_j )$, where $\bar{x}_j$ may or may not be contained in the original set of $x_i$. Depending on the specific instantiation, this often corresponds to working with a modified prior of the form
\begin{align}
f &\sim \text{GP} \left( \bar{m}, \bar{K }\right) \\
\text{where } \bar{K} (x, y) &= K(x, \bar{X}) K (\bar{X}, \bar{X} )^{-1} K (\bar{X}, y),
\end{align}
where $\bar{m}$ is some modified mean function. Crucially, the rank of $\bar{K}$ is now bounded above by the cardinality of the set $\bar{X}$, which can be taken much less than $N$. This enables more scalable computation strategies. The approximation $\bar{K} \approx K$ is known as the *Nyström approximation*, and has impact far beyond the world of Gaussian Processes.
Finally, the corresponding negative log-'posterior' density then looks like
\begin{align}
-\log P \left( \{ f(x_i) \}_{i \in [N]} \right) &= \frac{1}{2} | f |_{\bar{K}^{-1}}^2 + \sum_{i \in [N]} \ell_i ( f_i ),
\end{align}
where some care should be taken with interpreting the $\frac{1}{2} | f |_{\bar{K}^{-1}}^2$ term, noting the rank-deficiency of $\bar{K}$. With this in mind, one can devise efficient strategies for MAP and Variational Inference in this modified model.
## Monte Carlo Inference with the Nyström Approximation
While MAP and Variational Inference can provide useful information about the posterior, for many models, they will not tell the full story. In order to provide a more enriched view of the posterior, it is common to make use of Markov Chain Monte Carlo methods, which draw approximate samples from the posterior.
For similar reasons to those detailed above, the cost of matrix computations can make the implementation of MCMC for Gaussian Process models prohibitively slow. For example, consider the overdamped Langevin diffusion which targets the posterior on $f | X, Y$:
\begin{align}
df &= \nabla_f \log P ( f | X, Y) dt + \sqrt{2} dW \\
&= -\nabla_f\left( \frac{1}{2} | f |_{K^{-1}}^2 + \sum_{i \in [N]} \ell_i ( f_i ) \right) dt + \sqrt{2} dW \\
&= -\left( K^{-1} f + \sum_{i \in [N]} \nabla_{f_i} \ell_i ( f_i ) \right) dt + \sqrt{2} dW.
\end{align}
Evaluating the drift function involves computing $K^{-1}f$, which has cost scaling cubically with $N$. Moreover, even if this cost is reduced, we still have to evaluate $N$ likelihood terms.
One strategy which we might try is to precondition the dynamics, i.e. specify a positive semi-definite matrix $D$ and instead simulate
\begin{align}
df &= D \nabla_f \log P ( f | X, Y) dt + \sqrt{2D} dW.
\end{align}
For any such $D$, this diffusion will also leave the posterior exactly invariant (in continuous time, when simulated exactly).
A natural choice is to use the prior covariance matrix as a preconditioner. Indeed, by taking $D = K$, one obtains
\begin{align}
df &= -\left( f + \sum_{i \in [N]} K \nabla_{f_i} \ell_i ( f_i ) \right) dt + \sqrt{2K} dW,
\end{align}
which renders the first term in the drift trivial to evaluate. However, it i) increases the cost of evaluating the likelihood gradient terms, and ii) increases the cost of evaluating the diffusion term, as one must compute a square root / SVD / etc. of $K$. So we are back to square one!
A potential compromise is the following: precondition the dynamics with $\bar{K} \approx K$, evaluate the diffusion term exactly, but simplify the drift term as:
\begin{align}
df &= \bar{K} \nabla_f \log P ( f | X, Y) dt + \sqrt{2\bar{K}} dW \\
&= -\left( \bar{K} K^{-1} f + \sum_{i \in [N]} \bar{K} \nabla_{f_i} \ell_i ( f_i ) \right) dt + \sqrt{2 \bar{K}} dW \\
&\approx -\left( f + \sum_{i \in [N]} \bar{K} \nabla_{f_i} \ell_i ( f_i ) \right) dt + \sqrt{2 \bar{K}} dW,
\end{align}
which will be a good approximation if $\| \bar{K} K^{-1} - I \|_\text{op}$ is not too large. The main cost now shifts to the likelihood gradient term. However, this is a favourable place to encounter cost, as this term has a finite-sum structure, and is thus amenable to subsampling to further reduce the cost.
A notable drawback of this approach is that it induces a bias in the invariant measure, which is not easily quantified, nor cheaply corrected. However, the iteration cost is now reduced to $\mathcal{O} (M^2 N)$, with a potential for further reductions via subsampling. Moreover, there may be scope for providing a richer approximation to the posterior distribution than is available with fixed-form approximations.
## Conclusion
In this note, I have given a brief overview of some of the computational difficulties which are involved in carrying out Bayesian inference in Gaussian process models. I gave a short introduction to the idea of low-rank / reduced-basis expansions in such models, and how they improve computational complexity, at the cost of incurring some bias. Finally, I outlined how some of these ideas might be made useful for carrying out approximate Monte Carlo inference for Gaussian Process models in the large-$N$ setting.
A major oversight in this note is that I have made no structural assumptions on the covariance kernel which is used in the Gaussian Process. In practice, many models use a highly-structured $K$, which can provide substantial computational speedups in specific circumstances.
Finally, I have assumed throughout that the covariance kernel is fixed, and has no free hyperparameters. In practice, hyperparameters play a substantial role in the effective deployment of GP models, and an honest account of computation with GPs should really include a treatment of this issue.

Overview: In this note, I log some basic observations about diffusion-based generative models.

8/14/2023Overview: In this note, I describe some aspects of hierarchical structure in MCMC algorithms, and how they can be of theoretical and practical relevance.

8/9/2023Overview: In this note, I discuss a recurrent question which can be used to generate research questions about methods of all sorts. I then discuss a specific instance of how this question has proved fruitful in the theory of optimisation algorithms. Methods and Approximations A nice story is that when Brad Efron derived the bootstrap, it was done in service of the question “What is the jackknife an approximation to?”. I can't help but agree that there's something quite exciting about research questions which have this same character of ''What is (this existing thing) an approximation to?''. One bonus tilt on this which I appreciate is that there can be multiple levels of approximation, and hence many answers to the same question. One well-known example is gradient descent, which can be viewed as an approximation to the proximal point method, which can then itself be viewed as an approximation to a gradient flow. There are probably even more stops along the way here. In this case, there is even the perspective that from the perspective of mathematical theory, there may be at least as much to be gained by stopping off at the proximal point interpretation, as there is from the gradient flow perspective. My experience is that generalist applied mathematicians get to grips with the gradient flow quickly, but optimisation theorists can squeeze more out of the PPM formulation. There is thus some hint that using this 'intermediate' approximation can be particularly insightful in its own right. It would be interesting to collect more examples with this character.

5/22/2023Overview: In this note, I prove Hoeffding's inequality from the perspectives of martingales and convex ordering. The Basic Construction Let $-\infty<a<x<b<\infty$, and define a random variable $M$ with law $M\left(x;a,b\right)$ by \begin{align} M=\begin{cases} a & \text{w.p. }\frac{b-x}{b-a}\ b & \text{w.p. }\frac{x-a}{b-a}. \end{cases}

5/22/2023
Published on ** HackMD**