Low-Rank Approximations for Subsampling MCMC in Gaussian Process Models

###### tags: `one-offs` `sampling` `monte carlo` `gaussian processes` # Low-Rank Approximations for Subsampling MCMC in Gaussian Process Models **Overview**: In this note, I will describe some challenges with scaling up inference in Gaussian Process models to large datasets. I will then describe an approach to handling these challenges, based on reduced-basis expansions, and then outline a potential application to MCMC-based inference. ## Inference in Gaussian Process Models In statistical modelling, one often has to model unknown functions. In simple examples, one can get away with linear models, or expanding the function in a small basis. For more elaborate models, it is sometimes preferred to model functions *nonparametrically*, i.e. in a representation of unbounded complexity. A simple instance of this is to consider expansions in an infinite basis, i.e. writing out the function in a Fourier basis, and then inferring the Fourier coefficients. While this can be handled practically by simply truncating the basis expansion at a sufficiently high index, it is often desirable to work directly with the infinite-dimensional object. One road towards this is to work with *Gaussian Processes*. A Gaussian Process is a distribution over functions $f$, such that for any finite set of inputs $X = \{ x_i \}_{i \in [N]}$, the joint law of $f(X) = \{ f(x_i) \}_{i \in [N]}$ is multivariate Gaussian, i.e. \begin{align} P \left( \{ f(x_i) \}_{i \in [N]} \right) = \mathcal{N} \left( \{ f(x_i) \}_{i \in [N]} \vert m_X, K_{XX}\right) \end{align} for some vector $m_X$ and some positive semi-definite matrix $K_{XX}$. In this setting, one says that $f$ has the law of a Gaussian process with mean function $m$ and covariance kernel $K$. A benefit of working with GPs is that one is implicitly working with an expansion in an infinite basis, but the marginal distribution of the function's values at a finite number of locations remains available in closed form. One thus retains substantial flexibility in the range of $f$ which can be modelled. ## Costs of Inference in Gaussian Processes Consider a simple logistic regression task with a latent Gaussian process, i.e. \begin{align} f &\sim \text{GP} \left( m, K \right) \\ \text{for } i \in [N],\quad y_i &\sim \text{Bernoulli} ( \sigma( f(x_i) ) ) \end{align} where $\sigma(u) = (1 + \exp(-u))^{-1}$. We treat any hyperparameters of $(m, K)$ as fixed for now. In principle, life is good: we have a closed form prior and likelihood, and our posterior measure is even log-concave with respect to $\{ f(x_i) \}_{i \in [N]}$. However, there is a computational hitch: writing out the negative log-likelihood explicitly (taking $m \equiv 0$ for simplicity), \begin{align} -\log P \left( \{ f(x_i) \}_{i \in [N]} \right) &= \frac{1}{2} | f |_{K^{-1}}^2 + \sum_{i \in [N]} \ell_i ( f_i ) + \text{const.}, \end{align} we see that evaluating our likelihood requires the inversion of the $N \times N$ matrix $K$, or an operation of similar cost (computing the SVD, solving a linear system, etc.). Given that the cost of this operation scales cubically with $N$, this is bad news for even MAP inference in GP models, without even getting to 'full' Bayesian inference. ## Approximate Inference with Reduced-Basis Expansions A common strategy involves appealing to the representer trick to note that the MAP configuration for $f$ is expressible as \begin{align} f(x) = \sum_{i \in [N]} \alpha_i K (x, x_i) \end{align} for some scalars $\alpha_i$, and thus seeking to write $f$ in a basis of functions of the form $x \mapsto K(x, \bar{x}_j )$, where $\bar{x}_j$ may or may not be contained in the original set of $x_i$. Depending on the specific instantiation, this often corresponds to working with a modified prior of the form \begin{align} f &\sim \text{GP} \left( \bar{m}, \bar{K }\right) \\ \text{where } \bar{K} (x, y) &= K(x, \bar{X}) K (\bar{X}, \bar{X} )^{-1} K (\bar{X}, y), \end{align} where $\bar{m}$ is some modified mean function. Crucially, the rank of $\bar{K}$ is now bounded above by the cardinality of the set $\bar{X}$, which can be taken much less than $N$. This enables more scalable computation strategies. The approximation $\bar{K} \approx K$ is known as the *Nyström approximation*, and has impact far beyond the world of Gaussian Processes. Finally, the corresponding negative log-'posterior' density then looks like \begin{align} -\log P \left( \{ f(x_i) \}_{i \in [N]} \right) &= \frac{1}{2} | f |_{\bar{K}^{-1}}^2 + \sum_{i \in [N]} \ell_i ( f_i ), \end{align} where some care should be taken with interpreting the $\frac{1}{2} | f |_{\bar{K}^{-1}}^2$ term, noting the rank-deficiency of $\bar{K}$. With this in mind, one can devise efficient strategies for MAP and Variational Inference in this modified model. ## Monte Carlo Inference with the Nyström Approximation While MAP and Variational Inference can provide useful information about the posterior, for many models, they will not tell the full story. In order to provide a more enriched view of the posterior, it is common to make use of Markov Chain Monte Carlo methods, which draw approximate samples from the posterior. For similar reasons to those detailed above, the cost of matrix computations can make the implementation of MCMC for Gaussian Process models prohibitively slow. For example, consider the overdamped Langevin diffusion which targets the posterior on $f | X, Y$: \begin{align} df &= \nabla_f \log P ( f | X, Y) dt + \sqrt{2} dW \\ &= -\nabla_f\left( \frac{1}{2} | f |_{K^{-1}}^2 + \sum_{i \in [N]} \ell_i ( f_i ) \right) dt + \sqrt{2} dW \\ &= -\left( K^{-1} f + \sum_{i \in [N]} \nabla_{f_i} \ell_i ( f_i ) \right) dt + \sqrt{2} dW. \end{align} Evaluating the drift function involves computing $K^{-1}f$, which has cost scaling cubically with $N$. Moreover, even if this cost is reduced, we still have to evaluate $N$ likelihood terms. One strategy which we might try is to precondition the dynamics, i.e. specify a positive semi-definite matrix $D$ and instead simulate \begin{align} df &= D \nabla_f \log P ( f | X, Y) dt + \sqrt{2D} dW. \end{align} For any such $D$, this diffusion will also leave the posterior exactly invariant (in continuous time, when simulated exactly). A natural choice is to use the prior covariance matrix as a preconditioner. Indeed, by taking $D = K$, one obtains \begin{align} df &= -\left( f + \sum_{i \in [N]} K \nabla_{f_i} \ell_i ( f_i ) \right) dt + \sqrt{2K} dW, \end{align} which renders the first term in the drift trivial to evaluate. However, it i) increases the cost of evaluating the likelihood gradient terms, and ii) increases the cost of evaluating the diffusion term, as one must compute a square root / SVD / etc. of $K$. So we are back to square one! A potential compromise is the following: precondition the dynamics with $\bar{K} \approx K$, evaluate the diffusion term exactly, but simplify the drift term as: \begin{align} df &= \bar{K} \nabla_f \log P ( f | X, Y) dt + \sqrt{2\bar{K}} dW \\ &= -\left( \bar{K} K^{-1} f + \sum_{i \in [N]} \bar{K} \nabla_{f_i} \ell_i ( f_i ) \right) dt + \sqrt{2 \bar{K}} dW \\ &\approx -\left( f + \sum_{i \in [N]} \bar{K} \nabla_{f_i} \ell_i ( f_i ) \right) dt + \sqrt{2 \bar{K}} dW, \end{align} which will be a good approximation if $\| \bar{K} K^{-1} - I \|_\text{op}$ is not too large. The main cost now shifts to the likelihood gradient term. However, this is a favourable place to encounter cost, as this term has a finite-sum structure, and is thus amenable to subsampling to further reduce the cost. A notable drawback of this approach is that it induces a bias in the invariant measure, which is not easily quantified, nor cheaply corrected. However, the iteration cost is now reduced to $\mathcal{O} (M^2 N)$, with a potential for further reductions via subsampling. Moreover, there may be scope for providing a richer approximation to the posterior distribution than is available with fixed-form approximations. ## Conclusion In this note, I have given a brief overview of some of the computational difficulties which are involved in carrying out Bayesian inference in Gaussian process models. I gave a short introduction to the idea of low-rank / reduced-basis expansions in such models, and how they improve computational complexity, at the cost of incurring some bias. Finally, I outlined how some of these ideas might be made useful for carrying out approximate Monte Carlo inference for Gaussian Process models in the large-$N$ setting. A major oversight in this note is that I have made no structural assumptions on the covariance kernel which is used in the Gaussian Process. In practice, many models use a highly-structured $K$, which can provide substantial computational speedups in specific circumstances. Finally, I have assumed throughout that the covariance kernel is fixed, and has no free hyperparameters. In practice, hyperparameters play a substantial role in the effective deployment of GP models, and an honest account of computation with GPs should really include a treatment of this issue.