Gaussian Process

Gaussian Processes (GPs) is a generalized concept of multivariate normal distribution. While multivariate Gaussians represent a distribution over finite dimensional vectors, GPs represent a distribution over (uncountably infinite-dimensional) real-valued functions — Imagine the input domain are the real numbers, then the random variable described by a GP can be thought of as a “vector” of uncountably infinite extent and with infinite resolution that is “indexed” by a real number rather than a discrete index. Formally, a GP is defined through two real-valued functions: a mean function $\mu(\cdot) : X \to R$ and a symmetric positive-definite covariance function $k(\cdot, \cdot') : X \times X → R$ a.k.a. kernel: $$ f(\cdot)\sim \mathcal{GP}(\mu(\cdot),k(\cdot, \cdot')) $$ where $\mu(\cdot) = E[f(\cdot)]$ and $k(\cdot, \cdot')=E[(f(\cdot)-\mu(\cdot))(f(\cdot')-\mu(\cdot'))]$. Even though no analytic description of the probability density function of the GP exists in general, the interesting property is that if we evaluate the GP at any finite subset $\{X_1, X_2, ..., X_N \}$ of $X$ with cardinality N, we would obtain an N-dimensional multivariate Gaussian random variable $\mathbf{f}$: $$ \mathbf{f} \sim \mathcal{N}(\mathbf{µ_f},\mathbf{K_{ff}}) $$ where $\mathbf{µ_f}$ and $\mathbf{K_{ff}}$ are the mean and covariance matrix obtained by evaluating the mean and covariance function respectively at $\{X_1, X_2, ..., X_N \}$, i.e. $\mathbf{µ_f}[n]=\mu(X_n)$ and $\mathbf{K_{ff}}[n,m]=k(X_n,X_m)$. We can partition the uncountably infinite set of random variables represented by a GP into two sets: one that contains a finite subset denoted as $u$ evaluated at a finite set of evaluation points $\{Z_1, Z_2, ..., Z_M\} \in X$ , such that $u[m] = f(Z_m)$, with mean $\mathbf{µ_u}$ and covariance matrix $\mathbf{K_{uu}}$; and one that contains the remaining uncountably infinite set of random variables denoted as $f(\cdot)$ evaluated at all locations in $X$ except for the finitely many points $Z_m$: $$ \begin{pmatrix} f(\cdot) \\ \mathbf{u} \end{pmatrix} \sim \mathcal{GP}\Bigg( \begin{pmatrix} \mu(\cdot) \\ \mathbf{µ_u} \end{pmatrix}, \begin{pmatrix} k(\cdot, \cdot') & \mathbf{k_{\cdot u}} \\ \mathbf{k_{u\cdot'}} & \mathbf{K_{uu}} \end{pmatrix} \Bigg), $$ where $\mathbf{k_{\cdot u}}$ and $\mathbf{k_{u\cdot'}}$ denote vector-valued functions that express the cross-covariance between the finite-dimensional random variable $\mathbf{u}$ and the uncountably infinite-dimensional random variable $f(·)$, i.e. $\mathbf{k_{\cdot u}}[m] = k(·, Z_m)$ and $\mathbf{k_{u\cdot'}}[m] = k(Z_m, ·')$. Note that $\mathbf{k_{\cdot u}}$ is row-vector-valued whereas $\mathbf{k_{u\cdot'}}$ is column-vector-valued, and that $\mathbf{k_{\cdot u}} = \mathbf{k_{u\cdot'}}^\intercal$ holds. The conditional GP of $f(·)$ conditioned on $\mathbf{u}$ corresponding to the joint from previous function: $$ f(·)|\mathbf{u} \sim \mathcal{GP}(\mu(\cdot)+\mathbf{k_{\cdot u}}\mathbf{K^{-1}_{uu}}(\mathbf{u}-\mathbf{µ_u}),k(\cdot, \cdot')-\mathbf{k_{\cdot u}}\mathbf{K^{-1}_{uu}}\mathbf{k_{u\cdot'}}). $$ One can assume another marginal distribution $q(\mathbf{u})$ over $\mathbf{u}$ (another Gaussian distribution over $\mathbf{u}$ with mean $\mathbf{m_u}$ and covariance $\mathbf{S_{uu}}$): $$ \mathbf{u} \sim \mathcal{N}(\mathbf{m_u},\mathbf{S_{uu}}) $$ Integrating out $\mathbf{u}$ with $q(\mathbf{u})$ from the conditional GP yields: $$ f(·) \sim \mathcal{GP}(\mu(\cdot)+\mathbf{k_{\cdot u}}\mathbf{K^{-1}_{uu}}(\mathbf{m_u}-\mathbf{µ_u}),k(\cdot, \cdot')-\mathbf{k_{\cdot u}}\mathbf{K^{-1}_{uu}}(\mathbf{K_{uu}}-\mathbf{S_{uu}})\mathbf{K^{-1}_{uu}}\mathbf{k_{u\cdot'}}) $$ ## Prediction using Noisy Observations Consider the more realistic modelling situations that we want to make prediction based on noisy observation $\mathbf{y}=f(\mathbf{x})+\epsilon$, where $\epsilon$ is additive independent identically distributed Gaussian noise with variance $\sigma_n^2$. The covariance function specifies the covariance between noisy observations as $\text{cov}(\mathbf{y}) = k(\mathbf{x},\mathbf{x})+\sigma_n^2 I$. We can write the joint distribution of the observed target values and the function values $\mathbf{f}_*$ at the test locations $\mathbf{x}_*$ under the prior as $$ \begin{pmatrix} \mathbf{y} \\ \mathbf{f}_* \end{pmatrix} \sim \mathcal{N}\Bigg( \mathbf{0} , \begin{pmatrix} k(\mathbf{x},\mathbf{x})+\sigma_n^2 I & k(\mathbf{x},\mathbf{x}_*) \\ k(\mathbf{x}_*,\mathbf{x}) & k(\mathbf{x}_*,\mathbf{x}_*) \end{pmatrix} \Bigg), $$ Deriving the conditional distribution, we have: $$ \mathbf{f}_*|\mathbf{x},\mathbf{y},\mathbf{x}_* \sim \mathcal{N}(\bar{\mathbf{f}}_*,\text{cov}(\mathbf{f}_*)), $$ where $$ \mathbf{f}_* = k(\mathbf{x}_*,\mathbf{x})[k(\mathbf{x},\mathbf{x})+\sigma_n^2 I]^{-1}\mathbf{y}, \\ \text{cov}(\mathbf{f}_*) = k(\mathbf{x}_*,\mathbf{x}_*) - k(\mathbf{x}_*,\mathbf{x})[k(\mathbf{x},\mathbf{x})+\sigma_n^2 I]^{-1}k(\mathbf{x},\mathbf{x}_*) $$ :::info A Gaussian process is a collection of random variables, any finite number of which have a joint Gaussian distribution. ::: :::info [The difference between Gaussian process regression and other regression](https://stats.stackexchange.com/questions/372339/difference-between-gaussian-process-regression-and-other-regression-techniques): you're replacing the mean of the linear regression (which is a parametric line) to a non-parametric GP ::: :::info Reference: [A Tutorial on Sparse Gaussian Processes and Variational Inference](https://arxiv.org/abs/2012.13962) [An Introduction to Gaussian Process Models](https://arxiv.org/pdf/2102.05497.pdf) (some commonly used kernel functions in Table 1) [Guassian Process and Gaussian Mixture Model](https://roboticsknowledgebase.com/wiki/math/gaussian-process-gaussian-mixture-model/) Gaussian process regression (GPR). The implementation is based on Algorithm 2.1 of [Gaussian Processes for Machine Learning (GPML)](https://direct.mit.edu/books/book/2320/Gaussian-Processes-for-Machine-Learning) by Rasmussen and Williams. [A Handbook for Sparse Variational Gaussian Processes](https://tiao.io/post/sparse-variational-gaussian-processes/#prior) [How to choose kernel for Gaussian Process (Kernel Cookbook)](https://www.cs.toronto.edu/~duvenaud/cookbook/) Sklearn Implementation - Prediction using Noisy Observations: ![](https://hackmd.io/_uploads/By-ZsM1in.png) The Cholesky decomposition is mainly used for the numerical solution of linear equations $\mathbf{Ax} =\mathbf{b}$, if A is symmetric and positive definite. ::: :::info Cross-validation can be used to tune hyperparameters by splitting the data into training and validation sets and evaluating the model's performance on the validation set for different values of the hyperparameters. However, cross-validation might not work as expected with Gaussian Process Regression models due to the nature of Gaussian Processes. GPR models do not have explicit parameters that can be learned by minimizing a loss function on a training set. Instead, Gaussian Processes model the distribution of functions directly. The kernel's hyperparameters and the noise level of the Gaussian Process are usually learned during training by maximizing the model's log-marginal likelihood, a measure of how well the model explains the data. As a result, GPR models usually don't require traditional cross-validation methods for hyperparameter tuning. Instead, one usually initializes the model with sensible hyperparameters (which can often be found using domain knowledge, as was done in your example), and then lets the training process (which is actually an optimization process) find the optimal hyperparameters that maximize the log-marginal likelihood. ::: :::info [Difference between Gaussian Process Regression and Kernelized Linear Regression](https://stats.stackexchange.com/questions/231585/is-a-gaussian-process-the-same-as-a-kernelized-generalized-linear-model) ::: ## Sparse Gaussian Process Regression The pseudocode representation of the Sparse Gaussian Process Regression algorithm: ```plaintext Initialize training data: X_train, y_train Initialize inducing inputs: Z Initialize kernel parameters: θ Initialize noise variance: σ^2 Initialize regularization parameter: λ Calculate the kernel matrix K(X_train, X_train) using the kernel function and θ Initialize the approximate covariance matrix: Σ = K(Z, Z) + σ^2 * I for each training iteration: Calculate the mean and covariance of the approximate posterior distribution: m(Z) = K(Z, X_train) * (K(X_train, X_train) + σ^2 * I)^(-1) * y_train Σ(Z) = K(Z, Z) - K(Z, X_train) * (K(X_train, X_train) + σ^2 * I)^(-1) * K(X_train, Z) Update the inducing inputs Z to maximize the evidence lower bound (ELBO): Z = argmax ELBO(Z) Update the kernel parameters θ using gradient-based optimization: θ = argmax ELBO(θ) Update the noise variance σ^2 using maximum likelihood estimation: σ^2 = argmax log likelihood(σ^2) End loop Calculate the predictive mean and covariance for test inputs X_test: m(X_test) = K(X_test, Z) * Σ(Z)^(-1) * m(Z) Σ(X_test) = K(X_test, X_test) - K(X_test, Z) * Σ(Z)^(-1) * K(Z, X_test) + σ^2 * I Return the predictive mean m(X_test) and covariance Σ(X_test) ``` In this pseudocode: - `X_train` and `y_train` represent the training data. - `Z` represents the inducing inputs. - `θ` represents the kernel parameters. - `σ^2` represents the noise variance. - `λ` represents the regularization parameter. - `K` represents the kernel function. - `I` is the identity matrix. - `ELBO` is the evidence lower bound. - The updates for Z, θ, and σ^2 can be performed using optimization techniques such as gradient ascent and maximum likelihood estimation. Please note that this pseudocode provides a high-level overview of the Sparse Gaussian Process Regression algorithm. The specific details of the kernel function and optimization methods may vary depending on the problem and implementation. :::info ref:[Sparse Gaussian Process](http://krasserm.github.io/2020/12/12/gaussian-processes-sparse/) ::: ### Toy Implementation ```python import torch import gpytorch from gpytorch.models import ApproximateGP from gpytorch.variational import CholeskyVariationalDistribution from gpytorch.variational import VariationalStrategy from gpytorch.distributions import MultivariateNormal # Generate some toy data train_x = torch.linspace(0, 1, 100) train_y = torch.sin(train_x * (2 * 3.1416)) + torch.randn(train_x.size()) * 0.2 # Define the inducing points inducing_points = torch.linspace(0, 1, 10) class GPRegressionModel(ApproximateGP): def __init__(self, inducing_points): variational_distribution = CholeskyVariationalDistribution( inducing_points.size(0)) variational_strategy = VariationalStrategy( self, inducing_points, variational_distribution, learn_inducing_locations=True) super(GPRegressionModel, self).__init__(variational_strategy) self.mean_module = gpytorch.means.ConstantMean() self.covar_module = gpytorch.kernels.ScaleKernel(gpytorch.kernels.RBFKernel()) def forward(self, x): mean_x = self.mean_module(x) covar_x = self.covar_module(x) return MultivariateNormal(mean_x, covar_x) # Initialize the model and likelihood model = GPRegressionModel(inducing_points) likelihood = gpytorch.likelihoods.GaussianLikelihood() # Training model.train() likelihood.train() optimizer = torch.optim.Adam([ {'params': model.parameters()}, {'params': likelihood.parameters()}, ], lr=0.1) # Training loop for i in range(100): optimizer.zero_grad() output = model(train_x) loss = -likelihood(output, train_y).log_prob(train_y).sum() loss.backward() optimizer.step() # Testing model.eval() likelihood.eval() # Predictive distribution with torch.no_grad(): test_x = torch.linspace(0, 1, 51) observed_pred = likelihood(model(test_x)) lower, upper = observed_pred.confidence_region() # Plot the results import matplotlib.pyplot as plt plt.figure(figsize=(8, 6)) plt.plot(train_x.numpy(), train_y.numpy(), 'k.') plt.plot(test_x.numpy(), observed_pred.mean.numpy(), 'b') plt.fill_between(test_x.numpy(), lower.numpy(), upper.numpy(), alpha=0.5) plt.xlabel('x') plt.ylabel('y') plt.title('Sparse Gaussian Process Regression') plt.show() ``` ![](https://hackmd.io/_uploads/SJRtYT922.png)