# GP consistency
[toc]
[Link to other papers I found](https://docs.google.com/document/d/1en4Ewx2FKI3UyCkVNXeTh1015iF6q7jRaH8uXnWOYyE/edit?usp=sharing)
## What is consistency?
### General definition
Bayesian consistency is more subtle than frequentist consistency since the quantity of interest is a distribution (i.e., the posterior distribution) rather than a function (i.e., an estimator). I tried to summarize the various definitions in the following general form:
:::success
***Definition*: Bayesian consistency (general form)**: Given a dataset $\mathcal{D}_N$ of $N$ observations drawn from <span style="color:#ff7f0e">[data generation process $P_0$]</span>, the posterior is *consistent* if for any neighborhood $A_{\theta_0}$ <span style="color:#1f77b4">[according to topology $\mathcal{T}$]</span> of the ground truth <span style="color:#1f77b4">[object $\theta_0$]</span>,
\begin{equation}
p(A_{\theta_0} | \mathcal{D}_N ) \to 1 \text{ as } N\to\infty,
\end{equation}
where convergence is of <span style="color:#d62728">[a chosen type]</span> under <span style="color:#ff7f0e">[data generation process $P_0$]</span>
:::
A few notes:
- The object $\theta_0$ could be a parameter, a function, or even a distribution.
- Think of $p(A_{\theta_0} | \mathcal{D}_N )$, the posterior probability of a neighborhood around the ground truth, as a random variable, since it is a function of the random variable $\mathcal{D}_N \sim P_0$. Convergence of this random variable can be assessed by any of the standard nontions of convergence of random variables (e.g., in probability, almost sure). This is the "<span style="color:#d62728">chosen type</span>" of convergence mentioned above.
- Since $p(A_{\theta_0} | \mathcal{D}_N )$ converges to the constant 1, I believe the equation above is equivalent to (i.e., and does not just imply) the posterior distribution converging to a point mass at $\theta_0$.
### Special case: ground truth drawn from the prior
If the true value of the object of interest $\theta_0\in\Theta$ (e.g. a parameter vector) is drawn from the prior, then under fairly general conditions consistency follows for *any* Bayesian model by **Doob's Theorem**.
A common criticism of this result is that it holds up to only a set of probability zero under the prior, which could be very "large" for two reasons:
1. The prior is very concentrated in a small subset of $\Theta$,
2. The space $\Theta$ is infinite dimensional.
For an extreme example of the first case, if the prior is a point mass $\delta_{\theta_1}$ for some $\theta_1\in\Theta$, then by construction the true parameter is $\theta_1$ (since this is the only value that can be drawn from the prior) and hence the posterior is consistent, but only because it can't really be anything else. This is a common criticism of the importance of Doob's theorem.
[Miller (2018)](https://arxiv.org/pdf/1801.03122.pdf) however argues this isses the point, since for a well chosen prior consistency can be guaranteed on a very large set.
### Defining consistency for GPs
Next I'll discuss each of the choices that go into defining consistency. To be concrete, we observe data $\mathcal{D}_N = \{(X_n,Y_n)\}_{n=1}^N$ and assume standard GP model
$$
f(\cdot) \sim \mathcal{G}\mathcal{P}(\mu(\cdot), k(\cdot, \cdot))
$$
used for either regression
$$
Y_n = f(X_n) + \epsilon_n,\quad \epsilon_n \sim \mathcal{N}(0, \sigma^2), \quad n=1,\dotsc,N
$$
or binary classification, with
$$
\mathbb{E}[Y|X=x] = g^{-1}(f(x)).
$$
#### <span style="color:#ff7f0e">1) How is the data generated? </span>
Recall $P_0$ denotes the true data generation process. Included in this process are a few things:
- **How are the errors $\epsilon_1,\dotsc,\epsilon_N$ generated?** Typically assume *iid* Gaussian, but other distributions are possible (e.g., Double Exponential).
- **How are the covariates $X_1,\dotsc,X_N$ generated?** Let $Q_X$ denote their distribution. There are two possibilities:
- *Random design*: covariates are random, typically *iid*, and possibly over a bounded domain.
- *Non-random design*: covariates are fixed. May need to assume spacing between them decreases with $N$.
- **How is the regression function $f_0$ generated?** There are three possibilities:
- $f_0$ is drawn from the prior
- $f_0$ is drawn from some other distribution
- $f_0$ is fixed
#### <span style="color:#d62728">2) With respect to the data, what type of convergence do you want to show?</span>
The goal is to show convergence in distribution (i.e. of the posterior distribution), but *with respect to random data* generated by $P_0$. What kind of assurance do we want regarding the randomness in the data?
- **In probability convergence**: The probability of an "unsual" result --- i.e., receiving a dataset under $P_0$ for which the posterior differs by $\epsilon$ from its limit --- goes to zero as $N\to\infty$ (i.e., for all $\epsilon>0$).
- **Almost sure convergence**: The datasets for which the posterior does not converge have probability zero under $P_0$.
<!---
If you want to be technical, let $(\Omega, \mathcal{F}, \text{Pr})$ be the probability space that induces the distribution $P_0$ on the random variable $\mathcal{D}_N:\Omega \to (\mathcal{X}\times\mathcal{Y})^N$, then
$$
\text{Pr}\left( \omega \in \Omega : \lim_{N\to\infty} p(A_{\theta_0} | \mathcal{D}_N(\omega)) = 1 \right)
$$
-->
Almost sure convergence is stronger so will likely require stronger assumptions and/or weaker notions of distance (see next section).
#### <span style="color:#1f77b4">3) On which space do you want to define convergence and what topology do you want to use?</span>
- **In function space**. Here convergence is assesed in an equivalance class of functions under some topology. A popular choice of topology is the $L^p(Q_X)$ topology, which has corresponding distance metric
$$ d(f_1, f_2) = \left(\int |f_1(x)-f_2(x)|^p dQ_X(x) \right)^{1/p}.
$$
For example, in binary classification, GP consistency has been show under the $L^1(\lambda)$ topology, where $\lambda$ is the Lesbesgue mesure.
\
A weaker topology is that of in probability convergence, which can be metrized by
$$
d(f_1, f_2) = \inf\{\epsilon: Q_X(\{x: |f_1(x) - f_2(x)| \}) < \epsilon \}.
$$
It has the property that $f_n(X)$ coverges to $f(X)$ if and only if $\lim_{n\to\infty} d(f_n, f)=0$, hence its name. This topology is used to show GP consistency in a regression context.
- **In function *distribution* space**. Here convergence is assesed in the joint distribution space of $(X, Y)$. If $X\sim Q_X$ and $Y$ has conditional density $q_Y(y\mid x)$ with respect to the Lesbeque measure $\lambda$, then $q_Y(y\mid x)$ is the joint density of $(X,Y)$ with respect to the product measure $Q_X \times \lambda$. The Hellinger distance defines a metric on this space of joint distributions:
$$
d_H(q_0, q_1) = \int\left( \sqrt{q_1(y|x)} - \sqrt{q_2(y|x)} \right)^2~d(Q_X \times \lambda).
$$
GP regression consistency can be shown for this topology.
## When are GPs known to be consistent?
### 1D regression [[Choi and Schervish 2007](https://kilthub.cmu.edu/articles/journal_contribution/Posterior_Consistency_in_Nonparametric_Regression_Problems_under_Gaussian_Process_Priors/6586832)]
**Assumptions**
- *Data*:
- $Y\in\mathbb{R}$.
- $X$ belongs to a bounded interval $[0,1]$ (so it's 1-dimensional).
- Each $X$ is either *iid* distributed over $[0,1]$ or the maximum distance between any of the fixed $x$'s is less than $C/N$ for some constant $C$.
- Errors $\epsilon_n$ are *iid* Gaussian or Double Exponential.
- *Prior*:
- The first derivative of the GP prior mean $\mu$ is differentiable and the 4th partial derivative of the GP prior covariance $k$ is differentiable.
- There is a prior over the noise variance $\sigma^2$ with support on $(0,\infty)$.
- *Ground truth*:
- $f_0$ is a continuously differentiable function on $[0,1]$.
**Results**
- If the covariates are fixed, the posterior is almost surely consistent according to the $L^1(Q_X)$ topology.
- If the covariates are random, the posterior is almost surely consistent according to the in probability topology *and* the Hellinger topology.
### Binary classification [[Ghosal and Roy](https://arxiv.org/pdf/math/0702686.pdf)]
**Assumptions**
- *Data*:
- $Y$ is binary.
- $X$ belongs to a compact subset $\mathcal{X}\in\mathbb{R}^D$.
- Each $X$ is either *iid* distributed over $\mathcal{X}$ or are fixed under some assumptions about their separation.
- *Prior*:
- The prior covariance is of the form: $k(x,x')=\tau^2 k_0(\tfrac{x}{l},\tfrac{x'}{l})$, where $\tau>0$ and $l>0$ are hyperparameters.
- $\tau$ and $l$ have priors. The prior over $\lambda$ is fully supported over $(0,\infty)$.
- $k_0(x,x')$ has continuous partial derivatives up to order $2\alpha+2$.
- The prior mean $\mu$ belongs to the RKHS, $\bar{\mathcal{H}}$.
- *Ground truth*
- $f_0$ belongs to the RKHS, $\bar{\mathcal{H}}$.
We can write the RKHS in the reproducing kernel map construction:
$$
\mathcal{H} = \left\{f(x) = \sum_{m=1}^M \alpha_m k_0(\tfrac{x}{l},\tfrac{x_m}{l}): M\in\mathbb{N},\; x_m\in\mathcal{X},\; \alpha_m\in\mathbb{R},\; l>0 \right\}.
$$
The RKHS is the closure of this set, $\bar{\mathcal{H}}$. Notice that including $\tau$ does not change the RKHS, so it's excluded, but $l$ is included and does change the RKHS.
**Results**
- The posterior is consistent in probability according to the $L^1(\lambda)$ topology (I'm confused if this assumes $Q_X = \lambda$)