# GP consistency [toc] [Link to other papers I found](https://docs.google.com/document/d/1en4Ewx2FKI3UyCkVNXeTh1015iF6q7jRaH8uXnWOYyE/edit?usp=sharing) ## What is consistency? ### General definition Bayesian consistency is more subtle than frequentist consistency since the quantity of interest is a distribution (i.e., the posterior distribution) rather than a function (i.e., an estimator). I tried to summarize the various definitions in the following general form: :::success ***Definition*: Bayesian consistency (general form)**: Given a dataset $\mathcal{D}_N$ of $N$ observations drawn from <span style="color:#ff7f0e">[data generation process $P_0$]</span>, the posterior is *consistent* if for any neighborhood $A_{\theta_0}$ <span style="color:#1f77b4">[according to topology $\mathcal{T}$]</span> of the ground truth <span style="color:#1f77b4">[object $\theta_0$]</span>, \begin{equation} p(A_{\theta_0} | \mathcal{D}_N ) \to 1 \text{ as } N\to\infty, \end{equation} where convergence is of <span style="color:#d62728">[a chosen type]</span> under <span style="color:#ff7f0e">[data generation process $P_0$]</span> ::: A few notes: - The object $\theta_0$ could be a parameter, a function, or even a distribution. - Think of $p(A_{\theta_0} | \mathcal{D}_N )$, the posterior probability of a neighborhood around the ground truth, as a random variable, since it is a function of the random variable $\mathcal{D}_N \sim P_0$. Convergence of this random variable can be assessed by any of the standard nontions of convergence of random variables (e.g., in probability, almost sure). This is the "<span style="color:#d62728">chosen type</span>" of convergence mentioned above. - Since $p(A_{\theta_0} | \mathcal{D}_N )$ converges to the constant 1, I believe the equation above is equivalent to (i.e., and does not just imply) the posterior distribution converging to a point mass at $\theta_0$. ### Special case: ground truth drawn from the prior If the true value of the object of interest $\theta_0\in\Theta$ (e.g. a parameter vector) is drawn from the prior, then under fairly general conditions consistency follows for *any* Bayesian model by **Doob's Theorem**. A common criticism of this result is that it holds up to only a set of probability zero under the prior, which could be very "large" for two reasons: 1. The prior is very concentrated in a small subset of $\Theta$, 2. The space $\Theta$ is infinite dimensional. For an extreme example of the first case, if the prior is a point mass $\delta_{\theta_1}$ for some $\theta_1\in\Theta$, then by construction the true parameter is $\theta_1$ (since this is the only value that can be drawn from the prior) and hence the posterior is consistent, but only because it can't really be anything else. This is a common criticism of the importance of Doob's theorem. [Miller (2018)](https://arxiv.org/pdf/1801.03122.pdf) however argues this isses the point, since for a well chosen prior consistency can be guaranteed on a very large set. ### Defining consistency for GPs Next I'll discuss each of the choices that go into defining consistency. To be concrete, we observe data $\mathcal{D}_N = \{(X_n,Y_n)\}_{n=1}^N$ and assume standard GP model $$ f(\cdot) \sim \mathcal{G}\mathcal{P}(\mu(\cdot), k(\cdot, \cdot)) $$ used for either regression $$ Y_n = f(X_n) + \epsilon_n,\quad \epsilon_n \sim \mathcal{N}(0, \sigma^2), \quad n=1,\dotsc,N $$ or binary classification, with $$ \mathbb{E}[Y|X=x] = g^{-1}(f(x)). $$ #### <span style="color:#ff7f0e">1) How is the data generated? </span> Recall $P_0$ denotes the true data generation process. Included in this process are a few things: - **How are the errors $\epsilon_1,\dotsc,\epsilon_N$ generated?** Typically assume *iid* Gaussian, but other distributions are possible (e.g., Double Exponential). - **How are the covariates $X_1,\dotsc,X_N$ generated?** Let $Q_X$ denote their distribution. There are two possibilities: - *Random design*: covariates are random, typically *iid*, and possibly over a bounded domain. - *Non-random design*: covariates are fixed. May need to assume spacing between them decreases with $N$. - **How is the regression function $f_0$ generated?** There are three possibilities: - $f_0$ is drawn from the prior - $f_0$ is drawn from some other distribution - $f_0$ is fixed #### <span style="color:#d62728">2) With respect to the data, what type of convergence do you want to show?</span> The goal is to show convergence in distribution (i.e. of the posterior distribution), but *with respect to random data* generated by $P_0$. What kind of assurance do we want regarding the randomness in the data? - **In probability convergence**: The probability of an "unsual" result --- i.e., receiving a dataset under $P_0$ for which the posterior differs by $\epsilon$ from its limit --- goes to zero as $N\to\infty$ (i.e., for all $\epsilon>0$). - **Almost sure convergence**: The datasets for which the posterior does not converge have probability zero under $P_0$. <!--- If you want to be technical, let $(\Omega, \mathcal{F}, \text{Pr})$ be the probability space that induces the distribution $P_0$ on the random variable $\mathcal{D}_N:\Omega \to (\mathcal{X}\times\mathcal{Y})^N$, then $$ \text{Pr}\left( \omega \in \Omega : \lim_{N\to\infty} p(A_{\theta_0} | \mathcal{D}_N(\omega)) = 1 \right) $$ --> Almost sure convergence is stronger so will likely require stronger assumptions and/or weaker notions of distance (see next section). #### <span style="color:#1f77b4">3) On which space do you want to define convergence and what topology do you want to use?</span> - **In function space**. Here convergence is assesed in an equivalance class of functions under some topology. A popular choice of topology is the $L^p(Q_X)$ topology, which has corresponding distance metric $$ d(f_1, f_2) = \left(\int |f_1(x)-f_2(x)|^p dQ_X(x) \right)^{1/p}. $$ For example, in binary classification, GP consistency has been show under the $L^1(\lambda)$ topology, where $\lambda$ is the Lesbesgue mesure. \ A weaker topology is that of in probability convergence, which can be metrized by $$ d(f_1, f_2) = \inf\{\epsilon: Q_X(\{x: |f_1(x) - f_2(x)| \}) < \epsilon \}. $$ It has the property that $f_n(X)$ coverges to $f(X)$ if and only if $\lim_{n\to\infty} d(f_n, f)=0$, hence its name. This topology is used to show GP consistency in a regression context. - **In function *distribution* space**. Here convergence is assesed in the joint distribution space of $(X, Y)$. If $X\sim Q_X$ and $Y$ has conditional density $q_Y(y\mid x)$ with respect to the Lesbeque measure $\lambda$, then $q_Y(y\mid x)$ is the joint density of $(X,Y)$ with respect to the product measure $Q_X \times \lambda$. The Hellinger distance defines a metric on this space of joint distributions: $$ d_H(q_0, q_1) = \int\left( \sqrt{q_1(y|x)} - \sqrt{q_2(y|x)} \right)^2~d(Q_X \times \lambda). $$ GP regression consistency can be shown for this topology. ## When are GPs known to be consistent? ### 1D regression [[Choi and Schervish 2007](https://kilthub.cmu.edu/articles/journal_contribution/Posterior_Consistency_in_Nonparametric_Regression_Problems_under_Gaussian_Process_Priors/6586832)] **Assumptions** - *Data*: - $Y\in\mathbb{R}$. - $X$ belongs to a bounded interval $[0,1]$ (so it's 1-dimensional). - Each $X$ is either *iid* distributed over $[0,1]$ or the maximum distance between any of the fixed $x$'s is less than $C/N$ for some constant $C$. - Errors $\epsilon_n$ are *iid* Gaussian or Double Exponential. - *Prior*: - The first derivative of the GP prior mean $\mu$ is differentiable and the 4th partial derivative of the GP prior covariance $k$ is differentiable. - There is a prior over the noise variance $\sigma^2$ with support on $(0,\infty)$. - *Ground truth*: - $f_0$ is a continuously differentiable function on $[0,1]$. **Results** - If the covariates are fixed, the posterior is almost surely consistent according to the $L^1(Q_X)$ topology. - If the covariates are random, the posterior is almost surely consistent according to the in probability topology *and* the Hellinger topology. ### Binary classification [[Ghosal and Roy](https://arxiv.org/pdf/math/0702686.pdf)] **Assumptions** - *Data*: - $Y$ is binary. - $X$ belongs to a compact subset $\mathcal{X}\in\mathbb{R}^D$. - Each $X$ is either *iid* distributed over $\mathcal{X}$ or are fixed under some assumptions about their separation. - *Prior*: - The prior covariance is of the form: $k(x,x')=\tau^2 k_0(\tfrac{x}{l},\tfrac{x'}{l})$, where $\tau>0$ and $l>0$ are hyperparameters. - $\tau$ and $l$ have priors. The prior over $\lambda$ is fully supported over $(0,\infty)$. - $k_0(x,x')$ has continuous partial derivatives up to order $2\alpha+2$. - The prior mean $\mu$ belongs to the RKHS, $\bar{\mathcal{H}}$. - *Ground truth* - $f_0$ belongs to the RKHS, $\bar{\mathcal{H}}$. We can write the RKHS in the reproducing kernel map construction: $$ \mathcal{H} = \left\{f(x) = \sum_{m=1}^M \alpha_m k_0(\tfrac{x}{l},\tfrac{x_m}{l}): M\in\mathbb{N},\; x_m\in\mathcal{X},\; \alpha_m\in\mathbb{R},\; l>0 \right\}. $$ The RKHS is the closure of this set, $\bar{\mathcal{H}}$. Notice that including $\tau$ does not change the RKHS, so it's excluded, but $l$ is included and does change the RKHS. **Results** - The posterior is consistent in probability according to the $L^1(\lambda)$ topology (I'm confused if this assumes $Q_X = \lambda$)