# Kernel learning project update
## Problem setup
Observe two datasets, $\{x_{0,n}, y_{0,n}\}_{n_0=1}^{N_0}$ and $\{x_{1,n}, y_{1,n}\}_{n_1=1}^{N_1}$, each generated from different but somehow related functions $f_0$ and $f_1$:
$$
y_{0,n} = f_0(x_{0,n}) + \epsilon_{0,n}\\
y_{1,n} = f_1(x_{1,n}) + \epsilon_{1,n}
$$
The goal is to predict $f_1$ well ("Stage 1") by "pre-training" on data from $f_0$ ("Stage 0"). Presumably, $N_0\gg N_1$.
This is an abstract problem that I argue could be applied to e.g. transfer learning, few shot learning, continual learning.
## Approach
I will model $f_0$ and $f_1$ by GPs of the same kernel, $K^*$. This is the modeling assumption for how $f_0$ and $f_1$ are related. So, in Stage 0 you can use the $f_0$ data to obtain an estimate $\hat{K}$ of $K^*$ and then in Stage 1 you can use $\hat{K}$ to estimate $f_1$.
I want to consider a variety of $\hat{K}$ estimates, including kernel hyperparameter learning, linear combinations of kernels, and neural networks (where the last hidden layer gives the feature map of the kernel).
## Goals
Here's two possible goals, but I'm not sure how easy either is to accomplish.
- Goal 1: Prove learning rates for the whole procedure (Stage 0 + Stage 1). So, for a given true kernel and a kernel learning approach (e.g., LML optimization for the kernel hyperparameters), what is the expected risk in estimating $f_1$ as a function of $n_0$ and $n_1$?
- Goal 2: Perhaps I could show when this approach fails (e.g., because the kernel is not identifiable in some sense, or because the ground truth $K^*$ is not in the span of your kernel learning method, the $\hat{K}$ estimate could be bad and so the $f_1$ estimate could be bad).
Along the way, there are other questions that will come up. It's possible these are answered already but I don't know.
- What's the space of kernels approximated by a neural network? The space of functions approximated by a neural network is well-studied, but
## Current progress
I'll divide progress into two steps: learning the kernel and learning $f_1$ given the kernel.
### Step 1: learning the kernel
I came across the concept of *equivalence* of GPs. I'm still wrapping my head around how this fits into the overall story but it seems interesting, so I'll say what I've learned so far. An important caveat is that equivalence might go away in higher input dimensions (which would be good for Goal 1 but bad for Goal 2).
#### Equivalence of GPs
Some GPs are equivalent, which means that no amount of data (from a single function draw) can tell you which GP generated the data. This implies there is no consistent estimator for the kernel hyperparameters. However, equivalence does *not* mean GPs are equally likely. Equivalent GPs can differ greatly in how well they fit the data. However, I found that in practice "most" equivalent GPs do fit the data similarly.
For example, GPs with exponential (not squared exponential) kernels are equivalent if the ratio of the variance to lengthscale is the same. For example, suppose the true lengthscale and variance are both 1.0, so the ratio is 1.0. I found that GPs with variance and lengthscale *smaller* than the ground truth (but still fixed at a ratio of 1.0) fit the data poorly, but GPs with variance and lengthscale *larger* than the ground truth (also fixed at a 1.0 ratio) fit about the same as the ground truth GP. The variance and lengthscale can be *much* larger than the ground truth (e.g., 10,000) and the posterior is almost identical to the posterior of the ground truth (Isn't that cool?).
The question is, does equivalence cause any problems for the problem I'm working on? If so this could be progress towards Goal 2. I can think of two ways that it could cause a problem:
- Because of equivalence, you learn the wrong kernel hyperparameters in Stage 0 (in the example above, it's not crazy to think you could estimate a variance and lengthscale of 10,000 when the ground true values are 1). The issue is that because of equivalence, it doesn't really matter that you misestimated the hyperparameters because despite them being very different from the ground truth, in function space the models perform about the same.
- I'm still trying to figure if there could be a problem with learning the wrong kernel hyperparameters, maybe something with covariate shift? Because even though it's the ratio that defines equivalence, if you learn a variance of 10,000, the variance is still 10,000. That is, the prior is *way* more dispersed than the data, but because the lengthscale is also large the posterior contracts quickly (very quickly). There's got to be a way to relate equivalence of GPs to learning rates.
- If there is no GP equivalent to the ground truth GP in the span of whatever kenel learning method you are using, then your method might be bound to fail.
#### Next steps / questions:
- I think equivalence is interesting, but one concern I have is that in higher input dimensions it might go away. In the example it goes away when the input dimension is 4 or greater. So maybe equivalence doesn't matter for problems of interest? The good news is that it means kernel hyperparameters *are* identifiabile in higher dimensions, but just that the equivalence story goes away.
- I'm still working on an example where learning the wrong kernel hyparameters (but still implying a GP equivalent to the ground truth) causes a downstream problem in estimating $f_1$, but in a standard setup there is no issue (i.e., if the hyperparameters work for $f_0$ then they work for $f_1$).
- Perhaps I could show something about the equivalence class of kernels that can be learned by a neural network? Like a universal approximation-style result but for kernels implied by the last hidden layer? Does equivalence also go away in higher input dimensions?
### Step 2: learning the $f_1$ given the kernel
I'd like to use existing learning curve results, but there are two problems:
- Existing results assume that the target kernel $K^*$ and the model kernel (in my case inferred from data, $\hat{K}$) have the same eigenfunctions. This makes it difficult to analyze any approach except kernel hyperparameter learning (otherwise, how do you know the eigenfunctions are the same as the ground truth? I assume they would not be if the model is a neural network and the target is some arbitrary kernel).
- I don't think there are closed-form estimates for learning a kernel, even for kernel hyperparameter learning (except the variance hyperparameter, but I feel like that's the least interesting hyperparameter), so there's nothing to plug-in to existing results even when I can assume the eigenfunctions are the same.
#### Next steps / questions:
- If I can't get closed-form expressions for estimated kernels or kernel hyperparameters, is there anything I can do? Are empirical learning curves interesting at all (like, it would just be a simulation study)? Just trying to figure if I've already hit a dead end.
- If I need analytical results, is it possible to derive results when the eigenfunctions differ? I'm worried that if this result doesn't exist already then there's no way I'd be able to prove it (unless maybe no one has tried because the model kernel is usually assumed fixed?)