GP consistency

# GP consistency [toc] [Link to other papers I found](https://docs.google.com/document/d/1en4Ewx2FKI3UyCkVNXeTh1015iF6q7jRaH8uXnWOYyE/edit?usp=sharing) ## What is consistency? ### General definition Bayesian consistency is more subtle than frequentist consistency since the quantity of interest is a distribution (i.e., the posterior distribution) rather than a function (i.e., an estimator). I tried to summarize the various definitions in the following general form: :::success ***Definition*: Bayesian consistency (general form)**: Given a dataset $\mathcal{D}_N$ of $N$ observations drawn from [data generation process $P_0$], the posterior is *consistent* if for any neighborhood $A_{\theta_0}$ [according to topology $\mathcal{T}$] of the ground truth [object $\theta_0$], \begin{equation} p(A_{\theta_0} | \mathcal{D}_N ) \to 1 \text{ as } N\to\infty, \end{equation} where convergence is of [a chosen type] under [data generation process $P_0$] ::: A few notes: - The object $\theta_0$ could be a parameter, a function, or even a distribution. - Think of $p(A_{\theta_0} | \mathcal{D}_N )$, the posterior probability of a neighborhood around the ground truth, as a random variable, since it is a function of the random variable $\mathcal{D}_N \sim P_0$. Convergence of this random variable can be assessed by any of the standard nontions of convergence of random variables (e.g., in probability, almost sure). This is the "chosen type" of convergence mentioned above. - Since $p(A_{\theta_0} | \mathcal{D}_N )$ converges to the constant 1, I believe the equation above is equivalent to (i.e., and does not just imply) the posterior distribution converging to a point mass at $\theta_0$. ### Special case: ground truth drawn from the prior If the true value of the object of interest $\theta_0\in\Theta$ (e.g. a parameter vector) is drawn from the prior, then under fairly general conditions consistency follows for *any* Bayesian model by **Doob's Theorem**. A common criticism of this result is that it holds up to only a set of probability zero under the prior, which could be very "large" for two reasons: 1. The prior is very concentrated in a small subset of $\Theta$, 2. The space $\Theta$ is infinite dimensional. For an extreme example of the first case, if the prior is a point mass $\delta_{\theta_1}$ for some $\theta_1\in\Theta$, then by construction the true parameter is $\theta_1$ (since this is the only value that can be drawn from the prior) and hence the posterior is consistent, but only because it can't really be anything else. This is a common criticism of the importance of Doob's theorem. [Miller (2018)](https://arxiv.org/pdf/1801.03122.pdf) however argues this isses the point, since for a well chosen prior consistency can be guaranteed on a very large set. ### Defining consistency for GPs Next I'll discuss each of the choices that go into defining consistency. To be concrete, we observe data $\mathcal{D}_N = \{(X_n,Y_n)\}_{n=1}^N$ and assume standard GP model $$ f(\cdot) \sim \mathcal{G}\mathcal{P}(\mu(\cdot), k(\cdot, \cdot)) $$ used for either regression $$ Y_n = f(X_n) + \epsilon_n,\quad \epsilon_n \sim \mathcal{N}(0, \sigma^2), \quad n=1,\dotsc,N $$ or binary classification, with $$ \mathbb{E}[Y|X=x] = g^{-1}(f(x)). $$ #### 1) How is the data generated? Recall $P_0$ denotes the true data generation process. Included in this process are a few things: - **How are the errors $\epsilon_1,\dotsc,\epsilon_N$ generated?** Typically assume *iid* Gaussian, but other distributions are possible (e.g., Double Exponential). - **How are the covariates $X_1,\dotsc,X_N$ generated?** Let $Q_X$ denote their distribution. There are two possibilities: - *Random design*: covariates are random, typically *iid*, and possibly over a bounded domain. - *Non-random design*: covariates are fixed. May need to assume spacing between them decreases with $N$. - **How is the regression function $f_0$ generated?** There are three possibilities: - $f_0$ is drawn from the prior - $f_0$ is drawn from some other distribution - $f_0$ is fixed #### 2) With respect to the data, what type of convergence do you want to show? The goal is to show convergence in distribution (i.e. of the posterior distribution), but *with respect to random data* generated by $P_0$. What kind of assurance do we want regarding the randomness in the data? - **In probability convergence**: The probability of an "unsual" result --- i.e., receiving a dataset under $P_0$ for which the posterior differs by $\epsilon$ from its limit --- goes to zero as $N\to\infty$ (i.e., for all $\epsilon>0$). - **Almost sure convergence**: The datasets for which the posterior does not converge have probability zero under $P_0$.  Almost sure convergence is stronger so will likely require stronger assumptions and/or weaker notions of distance (see next section). #### 3) On which space do you want to define convergence and what topology do you want to use? - **In function space**. Here convergence is assesed in an equivalance class of functions under some topology. A popular choice of topology is the $L^p(Q_X)$ topology, which has corresponding distance metric $$ d(f_1, f_2) = \left(\int |f_1(x)-f_2(x)|^p dQ_X(x) \right)^{1/p}. $$ For example, in binary classification, GP consistency has been show under the $L^1(\lambda)$ topology, where $\lambda$ is the Lesbesgue mesure. \ A weaker topology is that of in probability convergence, which can be metrized by $$ d(f_1, f_2) = \inf\{\epsilon: Q_X(\{x: |f_1(x) - f_2(x)| \}) < \epsilon \}. $$ It has the property that $f_n(X)$ coverges to $f(X)$ if and only if $\lim_{n\to\infty} d(f_n, f)=0$, hence its name. This topology is used to show GP consistency in a regression context. - **In function *distribution* space**. Here convergence is assesed in the joint distribution space of $(X, Y)$. If $X\sim Q_X$ and $Y$ has conditional density $q_Y(y\mid x)$ with respect to the Lesbeque measure $\lambda$, then $q_Y(y\mid x)$ is the joint density of $(X,Y)$ with respect to the product measure $Q_X \times \lambda$. The Hellinger distance defines a metric on this space of joint distributions: $$ d_H(q_0, q_1) = \int\left( \sqrt{q_1(y|x)} - \sqrt{q_2(y|x)} \right)^2~d(Q_X \times \lambda). $$ GP regression consistency can be shown for this topology. ## When are GPs known to be consistent? ### 1D regression [[Choi and Schervish 2007](https://kilthub.cmu.edu/articles/journal_contribution/Posterior_Consistency_in_Nonparametric_Regression_Problems_under_Gaussian_Process_Priors/6586832)] **Assumptions** - *Data*: - $Y\in\mathbb{R}$. - $X$ belongs to a bounded interval $[0,1]$ (so it's 1-dimensional). - Each $X$ is either *iid* distributed over $[0,1]$ or the maximum distance between any of the fixed $x$'s is less than $C/N$ for some constant $C$. - Errors $\epsilon_n$ are *iid* Gaussian or Double Exponential. - *Prior*: - The first derivative of the GP prior mean $\mu$ is differentiable and the 4th partial derivative of the GP prior covariance $k$ is differentiable. - There is a prior over the noise variance $\sigma^2$ with support on $(0,\infty)$. - *Ground truth*: - $f_0$ is a continuously differentiable function on $[0,1]$. **Results** - If the covariates are fixed, the posterior is almost surely consistent according to the $L^1(Q_X)$ topology. - If the covariates are random, the posterior is almost surely consistent according to the in probability topology *and* the Hellinger topology. ### Binary classification [[Ghosal and Roy](https://arxiv.org/pdf/math/0702686.pdf)] **Assumptions** - *Data*: - $Y$ is binary. - $X$ belongs to a compact subset $\mathcal{X}\in\mathbb{R}^D$. - Each $X$ is either *iid* distributed over $\mathcal{X}$ or are fixed under some assumptions about their separation. - *Prior*: - The prior covariance is of the form: $k(x,x')=\tau^2 k_0(\tfrac{x}{l},\tfrac{x'}{l})$, where $\tau>0$ and $l>0$ are hyperparameters. - $\tau$ and $l$ have priors. The prior over $\lambda$ is fully supported over $(0,\infty)$. - $k_0(x,x')$ has continuous partial derivatives up to order $2\alpha+2$. - The prior mean $\mu$ belongs to the RKHS, $\bar{\mathcal{H}}$. - *Ground truth* - $f_0$ belongs to the RKHS, $\bar{\mathcal{H}}$. We can write the RKHS in the reproducing kernel map construction: $$ \mathcal{H} = \left\{f(x) = \sum_{m=1}^M \alpha_m k_0(\tfrac{x}{l},\tfrac{x_m}{l}): M\in\mathbb{N},\; x_m\in\mathcal{X},\; \alpha_m\in\mathbb{R},\; l>0 \right\}. $$ The RKHS is the closure of this set, $\bar{\mathcal{H}}$. Notice that including $\tau$ does not change the RKHS, so it's excluded, but $l$ is included and does change the RKHS. **Results** - The posterior is consistent in probability according to the $L^1(\lambda)$ topology (I'm confused if this assumes $Q_X = \lambda$)