# Feature Learning Project
:::success
**Project tagline**
Understanding the ability of kernel learning methods, particularly BNNs, to recover aspects of a ground truth kernel and the implications this has on downstream performance.
:::
:::info
**Motivation**
Kernel methods can perform poorly if mismatched to the data. Bayesian neural networks offer the ability to infer a kernel from the data, but there is little understanding how this inferred kernel relates to an assumed ground truth kernel as well as how this relationship impacts downstream performance.
**Goals**
For a variety of kernel learning methods, we aim to answer:
1. *What properties of a ground truth kernel can be recovered?* Full consistency is unlikely, but perhaps certain spectral properties can be recovered to some extent.
2. *At what rate can these properties be recovered?* In particular, can the BNN posterior collapse quickly enough to justify such a diffuse prior?
3. *How does inferring these properties impact performance on downstream tasks?* Tasks of interest include standard generalization error, but covariate shift and/or few shot learning could be interesting.
This will provide insight into when each method should be used.
**Methods**
- Empirical analysis of various kernel learning methods, including BNNs, multiple kernel learning, Bayesian model averaging, and hyperpriors for standard kernels.
- Ideally a theoretical analysis of any behavior that seems empirically true.
:::
This is just a running list of questions that have come up:
- Is lengthscale the key property of a kernel that needs to be inferred correctly? Are there others? Can issues with GPs be fixed just by placing a prior over the lengthscale?
- What is the role of architecture (including capacity)? Does width present a tradeoff with kernel learning? Can this tradeoff be managed by controlling other aspects of the model (e.g., depth)? Is double descent a consideration?
- Is multiple kernel learning at all comparible to a BNN? That is, can we think of it as a simplified BNN or are they completely different?
- Will BNNs only outperform GPs if the ground truth kernel is nonstationary?
- Can feedforward BNNs actually outperform properly tuned GPs on tabular data, or is it only, for example, CNNs applied to image data that can outperform?
- Do any of the conclusions change for approximate posteriors?
- Do any of the conclusions change using the log marginal likelihood for model selection?
## Old project overview
*This was based more on double descent / capacity issues in feature learning. The overview above is more recent.*
**Intro** Neural networks converge to kernel methods as the width tends to infinity, with the kernel depending on the architecture of the network, prior and/or initialization of the parameters, and inference method. We seek to understand when and why a finite-width network outperforms an infinitely wide network[^1]. This is challenging because as the width changes, multiple aspects of the model change simultaneously. Moreover, the changes are possibly non-monotonic near the "double descent" interpolation point or even elsewhere. For random features, the bias-variance decomposition of the test error provides a simple answer: use the widest network possible, since it can be shown the variance will decrease beyond the interpolation point while the bias will remain constant near or at zero. However, when the features are learned from the data, a smaller (but still overparameterized) network may outperform if the limiting kernel is poorly matched to the data. This is because the bias actually can increase beyond the interpolation point as the model converges to the limiting kernel[^2], where the features are no longer learned from the data.
[^1]: Or maybe something else :man-shrugging:
[^2]: We think this is true, although it could be the variance.
**Goals** I am currently focusing on the following questions:
* What is the optimal width?
- Once the network is overparameterized, does the width just control the amount of kernel learning that is possible?
- Can you decompose the bias into a "capacity bias" and "kernel bias"?
$$
\text{Bias}(f) = \text{Capacity bias} + \text{Kernel bias}
$$
* Why do finite networks sometimes outperform infinite networks?
- Does it have more to do with learning a kernel outside the "span" of an infinite network, or is it more about learning how to locally vary the kernel over the input space?
* In what applications is feature learning important? (e.g. few-shot learning)
Other directions include a model that does not depend on the number of data points (as discussed this may not make sense), prior design over kernel "properties", feature learning under approximate inference, feature learning in deep ensembles vs. BNNs, and incorporating other labels to learn better features.
**What has been done so far**
* BNNs don't seem to have double descent, as expected
* Uniformly averaging random feature models mitigates double descent better than Bayesian averaging
* Prior variance of the kernel increases with depth and decreases with width, as expected
----------
[toc]
# Prob models updates
## 5/2/2022
Consider the following nonparametric regression model:
$$
y \sim \mathcal{N}(f(x), \sigma^2),\quad
f(x) \sim \mathcal{N}(0, K),\quad
K \sim p_{M_k}(K)
$$
Since there is a hyperprior $p_{M_k}(\cdot)$ over the kernel (written as a prior over the $N\times N$ Gram matrix $K$), this model permits kernel learning. Note that $M_k\in\mathbb{R}$ is just a number (the subscript $k$ just indicates that it relates to the kernel). Suppose that as $M_k\to \infty$, the hyperprior converges to a point mass at a limiting kernel $K_\infty$: $\ p_{M_k}(K) \to 1{[K=K_\infty]}$. Therefore, $M_k$ controls to what extent this model resembles standard GP regression with fixed kernel $K_\infty$.
Now suppose $f$ is parametric, with a different number $M_c\in\mathbb{R}$ somehow controlling its capacity. As $M_c \to \infty$, assume $f$ becomes a universal function approximator. To summarize:
- $M_k$: controls concentration of kernel around $K_\infty$
- $M_c$: controls capacity of $f$
We could implement a model like this using a [Fixed Basis Model Average](https://hackmd.io/FzcfjAR8TJuWH7Y0HSPfqg#Fixed-Basis-Model-Average), for example: $M_k$ would control the prior weight on a single model and $M_c$ would equal the number of basis functions in each model.
We can also think of this model with both $M_k$ and $M_c$ as an overparameterized BNN, but here's the concern I have: when you change the width $M$, you change $M_k$ and $M_c$ *at the same time.* I think this has important implications for the prior and posterior of a BNN.
- *Implications for the prior:* To better use prior information, I think it's valuable to know how $M$ impacts $M_k$ and $M_c$. Ideally, along with $M$ you could simultaneously adjust other aspects of the prior (e.g., depth or prior weight variance) to control $M_k$ and $M_c$ separately.
- *Implications for the posterior:* I think there are unusual interactions between the $M_k$, $M_c$, the dataset, and the dataset size, $N$, but I've had trouble formalizing this. What if you adjust the dataset complexity so that additional capacity is needed to model the true function? Would this erode the flexibility to learn the kernel?
My next steps are to experiment with the Fixed Basis Model Average.
## 5/9/2022
### Summary
**Goal** Can the the bias be decomposed into a "kernel bias" and "capacity bias"?
**Approach** Empirical study the bias, variance, and an attempt at the kernel bias of the [Fixed Basis Model Average](https://hackmd.io/FzcfjAR8TJuWH7Y0HSPfqg#Fixed-Basis-Model-Average) (FBMA).
**Conclusions** Dependence of bias and variance on number of random features and prior model weights seems as expected. I was hoping to see something weird in the $N$ dependence but it looks ok. The definition of the kernel bias I chose seems to generally perform as expected.
### FBMA experiments
Recall the model from last week:
$$
y \sim \mathcal{N}(f(x), \sigma^2),\quad
f(x) \sim \mathcal{N}(0, K),\quad
K \sim p(K)
$$
Notice the kernel $K$ is a random variable.[^3]
[^3]: Slightly subtle point: the kernel is typically defined as the covariance $K(x,x'):=\text{Cov}[f(x),f(x')] = \mathbb{E}[f(x)f(x')]$. In this case, we're doing inference over the neural network feature map $\Phi(X)$, so I'm defining the kernel as the *conditional* covariance $K(x,x'):=\mathbb{E}[f(x)f(x') | \Phi(x)]$, making it a random variable since $\Phi(x)$ is a random variable.
Let $x^*$ be a test point, $X$ be the training data, $\epsilon$ be the noise, $W_0$ be the random weights, and $\hat{\cdot}$ denote the posterior mean.
I will keep track of the following **metrics**:
- Variance: $\mathbb{V}_{x^*, X, \epsilon, W_0}[\hat{f}(x^*)]$ (decomposed into contributions from $X$, $\epsilon$, and $W_0$)[^4]
- Bias: $\mathbb{E}_{x^*}\left[\left(f_{\text{true}}(x^*)-\mathbb{E}_{X, \epsilon, W_0}[\hat{f}(x^*)]\right)^2\right]$
- "Kernel bias": $\mathbb{E}_{x^*}\left[\| K_{\text{true}}(x^*,x^*)-\mathbb{E}_{X, \epsilon, W_0}[\hat{K}(x^*,x^*)]\|\right]$
[^4]: As in [Double Trouble in Double Descent: Bias and Variance(s) in the Lazy Regime](http://proceedings.mlr.press/v119/d-ascoli20a/d-ascoli20a.pdf)
Note: I don't think this definition of the kernel bias is what we want (in particular, kernel bias $\gg$ bias) but hopefully it captures some of the right idea.
I will vary the following **variables**:
- $M_c$: indexes the capacity (number of random features in each model)
- $M_k$: indexes the prior weight around the incorrect limiting kernel
- $N$: Number of observations
Recall, in a BNN, changing the width $M$ is essentially like changing $M_c$ and $M_k$ at the same time. I'm concerned you may want to vary one without the other. I am also concerned there are strange interactions with $N$. To gain better insight, I will vary one at a time in the Fixed Basis Model Average (which is *not* a BNN but hopefully will provide insight anyway).
I will do regression on the following 1D function drawn from GP with an RBF kernel of lengthscale 2:

There are 3 models in the BMA with lengthscales of 0.5, 1, and 2. As $M_k\to 1$, the prior concentrates around the model with lengthscale = 0.5.

#### Changing $M_c$ ($M_k$ and $N$ fixed)
*Hypothesis*: As $M_c$ increases:
- <span style="color:#ff7f0e">Variance</span> will decrease :heavy_check_mark:
- <span style="color:#1f77b4">Bias</span> will decrease :heavy_check_mark:
- <span style="color:#d62728">Kernel bias</span> will decrease? :x:
*Conclusion*: I was expecting all metrics to decrease, in which case you should just set $M_c$ as large as possible. I'm not sure how to explain the increase in the kernel bias.

#### Changing $M_k$ ($M_c$ and $N$ fixed)
*Hypothesis*: As $M_k$ increases:
- <span style="color:#ff7f0e">Variance</span> will increase? :heavy_check_mark:
- <span style="color:#1f77b4">Bias</span> will increase :heavy_check_mark:
- <span style="color:#d62728">Kernel bias</span> will increase :heavy_check_mark:
*Conclusion*: Since the model is misspecified as $M_k\to 1$, you should pick the most flexible model ($M_k=0$).

#### Changing $N$ ($M_k$ and $M_c$ fixed)
*Hypothesis*: As $N$ increases:
- <span style="color:#ff7f0e">Variance</span> will ?
- <span style="color:#1f77b4">Bias</span> will ?
- <span style="color:#d62728">Kernel bias</span> will ?
I'm sure how the metrics should behave. On one hand you have more data, so it seems the bias and variance (which are evaluated on the test data) should go down. On the other hand, you have less capacity so maybe they should go up?
*Conclusion*: All metrics mostly decrease as you increase $N$. I was hoping to see something weird here but in the overparameterized regime everything seems ok.

## 6/19/2022
### Generalization error $\epsilon$
I'm expanding the calculations from [Sollich 2001](https://papers.nips.cc/paper/2001/file/d68a18275455ae3eaa2c291eebb46e6d-Paper.pdf) to include mismatched lengthscales. I'm assuming a matched observation noise $\sigma^2$ but possibly unmatched lengthscale $l$ and smoothness $r$. I use $[\cdot]_*$ to denote the teacher and $n$ to denote the number of observations.
If the student is rougher $(r<r_*)$, then only the student matters:
$$
\epsilon \propto l^{-1} \left(\frac{n}{\sigma^2}\right)^{-\tfrac1r(r-1)}
$$
If the student is smoother or as smooth $(r\ge r_*)$, then both the student and teacher matter:
$$
\epsilon \propto l^{-1} \left(\frac{l}{l_*} \right)^{r_*} \left(\frac{n}{\sigma^2}\right)^{-\tfrac1r(r_*-1)}
$$
Takeaways:
- The lengthscale ($l$) only impacts the constant term, unlike the smoothness ($r$) that impacts the dependence on the number of observations ($n$).
- There is always a benefit to a larger student lengthscale ($l^{-1}$ term in front), but...
- If the student is rougher $(r<r_*)$ OR the lengthscales are matched $(l= l_*)$, then this is the whole story. There is no other lengthscale dependence.
- If the student is smoother or as smooth $(r\ge r_*)$ AND the lengthscales are mismatched $(l\neq l_*)$, then there is also a consequence to having a larger student lengthscale ($(l/l^*)^{r_*}$ term), and it gets worse for a smoother teacher (larger $r_*$).
- I'm confused what happens in the RBF case ($r_*\to\infty$) when the student has a larger lengthscale $(l>l_*)$, since the $(l/l^*)^{r_*}$ term blows up. Does that mean the error never converges?
I'm a little confused by what $\propto$ means exactly. I think it is $\Theta(\cdot)$.
## 6/26/2022
I'm trying to build the project off a simple observation:
> GPs have poor learning rates when they are too "smooth"
where "smooth" can mean literal smoothness, which could be measured by number of times differentiable, or lengthscale, which could be measured by number of upcrossings.
Why this observation is important:
- Unless you are certain your model is matched to the ground truth "smoothness", you need to learn the kernel.
- Erring on the side of a rough model isn't satisfactory: there might be other reasons to prefer a smooth model (e.g., interpretability?)
We want to compare two ways of learning the kernel:
- Learn the hyperparameters of a standard kernel (i.e., use a GP).
- Learn the hyperparameters of feature map corresponding to a degenerate kernel (i.e., use a BNN)
In both cases, we could use full Bayesian inference or type-II maximum likelihood. In the BNN case, I think that careful consideration of the architecture will be important, especially the width.
### Enumerating the types of mismatch
There are many forms of model mismatch. I broke it down into a few categories. There is a yes/no answer to each bullet point, giving $2^6=64$ kinds of mismatch.
Properties of interest that could be mismatched:
- Smoothness
- Lengthscale
Types of mismatch:
- Overall level
- Dependence on inputs
Types of mismatch learning:
- Overall level $\leftarrow$ can you easily learn smoothness? (i.e. is a continuous parameter?)
- Dependence on inputs $\leftarrow$ hard to do with a GP
### Existing work on GP learning rates
I prepared a more [detailed overview](https://hackmd.io/PWlh1Eo9T0O5pHfmTxNa7A) of existing work. Here I just list the relevant things I learned.
- If the model matches the smoothness of the ground truth $f_0$, then convergence is minimax optimal. If not, convergence can be quite poor, even logarithmically slow (but only in the RBF model case).
- There is work on fixed and random $f_0$.
- $f_0$ does not need to belong to the RKHS of the prior kernel $k$. Assuming it does is a very strong assumption. Note that drawing $f_0 \sim \mathcal{G}\mathcal{P}(0, k)$ *does not* imply $f_0$ is in the RKHS of $k$. In fact, it is almost surely not.
- I think the lengthscale will only impact constant terms, so it will have less of an impact than the smoothness. However, inferring the lengthscale can compensate for a mismatch in smoothness.
- Although most of the results are upper bounds, I believe they are considered fairly tight.
Here's a quick summary of results for a fixed $f_0$.
:::success
**Theorem: GP learning rates (simplified)** [[van Zanten 2011](https://jmlr.org/papers/volume12/vandervaart11a/vandervaart11a.pdf)]
Suppose $f_0$ is a "$\beta$ smooth" function.
- For a Matérn prior with smoothness parameter $\alpha$,
$$
\text{risk}(n) = O\left( n^{-\min(\alpha,\beta)/(2\alpha+D)} \right).
$$
- For an RBF prior,
$$
\text{risk}(n) = O\left( (\log n)^{-\beta/2+d/4} \right).
$$
Note: there are some ommited assumptions about how $\alpha$ and $\beta$ relate to the input dimension $D$.
:::
### Initial experiments
To make things simpler, I tried varying the dataset while keeping the model constant. $f_0$ is drawn from a GP. So far I am using Matern kernels only.
I want to consider overall mismatch and mismatch as a function of $x$. So, for each of the two properties (lengthscale and smoothness), I considered four possibilities regarding $f_0$'s value of the property:
- $f_0$ too low
- $f_0$ just right
- $f_0$ too high
- $f_0$ too low for $x<0$, otherwise too high
- $f_0$ too high for $x<0$, otherwise too low

#### Model: Matern 3/2 kernel
Everything here is a Matern kernel (i.e., both $f$ and $f_0$).
rows: $f_0$ lengthscale, columns: $f_0$ smoothness.
Here's the case of constant values of the properties:

Here's the case of non-constant values of the properties:

Notice the lengthscale can compensate for a mismatch in smoothness. In this case, $f_0$ is too smooth, so the lengthscale increases above the ground truth value.

## 8/15
**Base models**
I am training single-layer neural networks of the form
$$
f_j(x) = \frac{1}{\sqrt{M}}\sum_{m=1}^M w^{(1)}_m\cos( l^{-1}w^{(0)}_m x + b^{(0)}_m) + b^{(1)}
$$
where
$$
w^{(0)}_m \sim \mathcal{N}(0, 1) \\
b^{(0)}_m \sim \mathcal{U}(0, 2\pi) \\
w^{(1)}_m \sim \mathcal{N}(0, 1) \\
b^{(1)}_m \sim \mathcal{N}(0, 1) \\
l^2 \sim \mathcal{IG}(\cdot, \cdot)
$$
If $w^{(0)}_m$, $b^{(0)}_m$, and $l$ are fixed at initialization, then this corresponds to a Random Fourier Feature network with lengthscale $l$.
Note: if the lengthscale is fixed, it is not randomly drawn, unlike the input-layer weights and biases.
Note: I accidentally put an inverse gamma prior on $l$ instead of $l^2$.
**Meta models**
I am also combining functions as:
$$
f(x) = \sum_{j=1}^J \beta_j f_j(x)
$$
where
$$
\beta_j \sim \mathcal{N}(0, 1) \text{ or } \beta\sim\mathcal{D}(1,\dotsc, 1)
$$
In the experiments below I used $J=3$ models with lengthscales (or prior lengthscale means) of 0.25, 0.5, and 1. I used $M=100$ hidden units.
**Data**
I generate noisy data from a GP with an RBF kernel and lengthscale 0.5.
**Metrics**
There are 3 variables of interest that I want to compare to the ground truth:
- $f$ (aggregate function)
- $\beta$ (weight on different $f_j$'s)
- $K$ (kernel implied by last layer feature map)
For each posterior sample $s$ evaluated on each observation $n$ (if applicable), call the parameter of interest $\theta^{(s,n)}$ and the true value is $\theta^{(n)}_0$.
I compute the error
$$
\text{error}(\theta, \theta_0) = \frac{1}{N}\sum_{n=1}^N\text{loss}\left(\frac{1}{S}\sum_{s=1}^S \theta^{(s,n)}, \theta_0^{(n)}\right)
$$
and the risk
$$
\text{risk}(\theta, \theta_0) = \frac{1}{N} \frac{1}{S} \sum_{n=1}^N\sum_{s=1}^S \text{loss}\left( \theta^{(s,n)}, \theta_0^{(n)}\right).
$$
For $f$ and $\beta$ I use the squared error: $\text{loss}(a,b)=(a-b)^2$. For the kernel I use the Frobenius norm $\text{loss}(A,B)=\|A-B\|_F$ and the kernel alignment $\text{loss}(A,B)=(\sum_{ij}a_{ij}b_{ij})/ (\|A\|_F \|B\|_F)$
For example, when the risk of $\theta=f(X_\text{test})\in \mathbb{R}^N$ using the squared error loss is equal to a scaled and shifted version of the marginal likelihood (i.e., because of the Gaussian likelihood and because we are not considering the observation noise, if that makes sense).
### Single model
**function $f$**


Conclusions:
- When there are zero training points, why are all of the risks the same? Shouldn't the lengthscale impact the risk?
- The larger lengthscale mode does better for small training sets, but eventually the ground truth model does best.
**kernel $K$**
I'm only showing the errors because the risks have the same trends.


Conclusions:
- There is an error in the BNNs/RBFNNs: the variance was smaller than the ground truth GP. That is why the 1.0 lengthscale model has a smaller Frobenius error to the ground truth.
- The Frobenius norm error shows little change with the training set size.
- The alignment shows some improvement with the dataset size.
### Meta model
**function $f$**


Conclusions:
- Error and risk have similar trends
- There is little difference between the two $\beta$ priors
**kernel $K$**
*Frobenius norm*


*Alignment*


Conclusions:
- Error and risk have similar trends
- There is a stronger trend using the Dirichlet prior, with slightly smaller distance to the ground truth
- The normal prior is quiete noisy but is possible not monotonic?
**coefficients $\beta$**


Conclusions:
- Normal prior does not appear to concentrate around the ground truth $\beta$ but the Dirichlet prior does
## 8/23
Hypothesis: A BNN can adapt its kernel to the data, with the amount of adaptability decreasing with width.

Conclusions:
- Smaller width BNN has a more adaptable kernel. However, if lengthscale is already correctly specified (.15), then the inferred kernel gets further from the ground truth. Interestingly, it gets further away with more data.
- BNN kernel is closer to the ground truth when there is no data because the averaging is also over the input weights.
-
Hypothesis: When lengthscale is mismatched, BNN will perform better than the RFFN

Conclusions:
- BNN outperforms RFNN when the lengthscale is mismatched (first and last columns)
- Even though the in the non-mismatched case the BNN kernel gets further from the ground truth, it doesn't hurt performance (middle column).
Hypothesis: when BNN outperforms RFFN, performance gain will be larger when width is smaller.

Conclusions:
- When lengthscale is too small, BNN performs better at smaller width (top left). Otherwise, width doesn't make a difference.
- Although width doesn't make a difference when the lengthscale is too large (bottom right), the BNN is still outperforming the RFFN. It's interesting to me these models perform about the same as the lengthscale too small case. Maybe this would change if more data?
## 8/29
**Goal**: verify basic hypotheses about kernel learning in BNNs, RFNNs, and RFNNs with priors on the lengthscale (RFNNls). There are no ensemble models.
See 8/15 post for more details on experimental setup.
---
As a precursor to the goal "*what properties of a ground truth kernel can be recovered?*" we want some indication that the posterior learns *something* about the ground truth kernel. We measure that with the MSE of the posterior mean, $\text{error}(K)$.
**Hypothesis 1**: In a mismatched lengthscale setting, as you increase $N_\text{train}$:
1. $\text{error}(K)$ of BNN decreases,
2. $\text{error}(K)$ of RFNNls decreases,
3. $\text{error}(K)$ of RFNN stays the same.
**Verdict**: True.
**Observations**: If the lengthscale is *not* mismatched, the BNN and RFNNls get further from the ground truth as $N_\text{train}$ increases. I think that's ok and would hopefully stop increasing with more data.
**Interpretation**: The BNN posterior of a mismatched prior can somewhat concentrate around the true kernel. That's a good start. If not this project would not be going anywhere.
---
Now that we've seen we can learn something about the ground truth kernel, how does the width impact to what extent this happens?
**Hypothesis 2**: In a mismatched lengthscale setting, as you increase width:
1. $\text{error}(K)$ of BNN increases,
2. $\text{error}(K)$ of RFNNls decreases,
3. $\text{error}(K)$ of RFNN decreases.
**Verdict**: True.
**Observations**: If the lengthscale is *not* mismatched, the BNN and RFNNls get closer to the ground truth as width increases. I think this makes sense.
**Interpretation**: As expected, width is controling the BNN kernel learning ability as expected and not controlling the RFNNls ability. The RFNNs benefit from the additional capacity.
---
The following relates to the third goal, "*How does inferring these properties impact performance on downstream tasks?*"
**Hypothesis 3**: In a mismatched lengthscale setting, as you increase width:
1. $\text{error}(f)$ of BNN increases,
2. $\text{error}(f)$ of RFNNls decreases,
3. $\text{error}(f)$ of RFNN decreases.
**Verdict**: Partially true. In the BNN case, width only has an impact in the when the model lengthscale is too small. This could be dataset specific.
**Observations**:
**Interpretation**: It seems the extra capacity helps for the random feature methods, but it *can* hurt in the BNN case because it comes at the cost of less kernel adaptivity.
---
The following also relates to the third goal, "*How does inferring these properties impact performance on downstream tasks?*"
**Hypothesis 4**: In all mismatched settings, the relative values of $\text{error}(f)$ are: RFNNls < BNN < RFNN
**Verdict**: Somewhat true. When the lengthscale is too small, the BNN actually performs best.
**Observations**: When the lengthscale is *not* mismatched, all models do about the same (i.e., same as RFNN)
**Interpretation**: It's surprising to me the BNN can outperform the RFFN. I wonder what would happen with more data.
Here are plots used to investigate the above hypotheses. The first two plots show $\text{error}(K)$, the last one shows $\text{error}(f)$. For example, the top left panel of the first set of plots shows the strongest evidence of kernel learning in BNNs (supports Hypothesis 1).



## 9/6
There are two topics in this post:
1. Expanding on the motivation for this project --- *why understanding kernel learning in BNNs is challenging and why you should should care*.
2. Providing some intuition for why kernel learning in BNNs is even possible.
For context, note that my experiments so far say:
- *Is there any evidence you can learn the true kernel?* Yes. The error is far from zero, but it decreases with more data.
- *What are the implications for downstream tasks?* Better kernel generalization usually implies better function generalization (sometimes it's not much of a difference).
See 8/29 post for most recent experiments.
#### 1. Why understanding kernel learning in BNNs is challenging and why you should should care
It's commonly noted that BNNs aren't consistent in weight space but are in function space, and luckily that's all we really care about. But there's also a seemingly conflicting notion that BNNs are doing kernel learning, where the last hidden layer gives the feature map of the kernel. So does that mean BNNs are consistent in more than just function space, i.e. in "kernel space" too? That seems unlikely. But if not, what kernel are we learning? Does it have no relation to a data generating kernel at all? That also seems unlikely.
To reiterate, it's confusing because I don't expect kernel learning to be consistent (since I would think multiple kernels could have generated the single function you observed, even if you observed the function at every point? See following section for more on this), but yet the feature map of a kernel isn't totally unidentifiable like the weights of a neural network. For example, if you have random Fourier feature maps with different lengthscale parameters, my understanding is they imply different RKHSs, so the lengthscale is identifiable from the RKHS (i.e., $l_1\neq l_2 \implies\text{RKHS}_{l_1}\neq \text{RKHS}_{l_2}$, which is the same as $\text{RKHS}_{l_1} = \text{RKHS}_{l_2} \implies l_1=l_2$, i.e. the lengthscale is identifiable). Of course, you observe a function and not a space of functions / RKHS, but maybe that one function should tell you something about the lengthscale, which is a property of the feature map / kernel.
So hopefully that explains why these questions are challenging. But why should you care? After all, I wrote "luckily [function space] is all we really care about". Is there anything that should make us also care about kernel space, unlike weight space where we actually don't care? In other words, all that matters about the posterior over the weights is the posterior over the function that it implies. You would be satisfied only knowing the posterior over the function. Is the same true of the posterior over the kernel?
Now, I bet you could decompose the function space generalization error into a "kernel error" and "function given kernel error" and that the former could really impact the learning rate of the latter, but, again, if you only care about the function, what's the point of this decomposition?
The example I can think of is transfer learning. There it's important that the last hidden layer is learning something meaningful about the first dataset so it can be transferred to a new dataset. And if I can show that you can learn properties of a ground truth kernel when there is one, then it seems something meaningful can be learned? Is that convincing?
#### 2. Intuition for why kernel learning in BNNs is even possible
Consider the following simple example first.
:::warning
*Illustrative example*
Suppose your data is generated as
$$
\beta \sim \mathcal{N}(0,1) \\
y_n \sim \mathcal{N}(\beta x_n,1),\quad i=1,\dotsc,n.
$$
Then I think any model that places prior mass on the true $\beta$ would be consistent for $\beta$.
Alternatively, suppose your data is generated as
$$
\sigma^2 \sim \mathcal{IG}(\sigma^2) \\
\beta \sim \mathcal{N}(0,\sigma^2) \\
y_n \sim \mathcal{N}(\beta x_n, 1),\quad i=1,\dotsc,n.
$$
Then I think a model is still consistent for $\beta$ if there is prior mass on the true value, but I don't think the same is true for $\sigma^2$, right? Because even if you were told the true value for $\beta$, this would only give you one observation to infer $\sigma^2$.
:::
This is analogous to my setting, where "$f=\beta$" and "$k=\sigma^2$" (where $f$ is the function and $k$ is the kernel). But my setting is also more complicated and as a result I think more observations $y_n$ can actually help you infer the kernel.
To see why, take the extreme example as above, where you are told the true value for the function $f$ (i.e. at all inputs $x$). Then in some sense this is like multiple observations of $f$ if you were to divide up the input space (e.g., one observation for $x<0$ and one observation for $x>0$). Does that make sense? The observations aren't independent because it's the same function, but they're not fully dependent either. I suppose this argument relies on the true kernel being stationary and non-constant (i.e., so the same kernel generated the two, different $f$'s), but maybe a softer version of this is true for nonstationary kernels as long as they don't change too rapidly over the input space (by this argument, I wonder if you could actually get consistency if the true kernel is stationary). So that's an argument for why you can learn the ground truth kernel better with more data, which is what I'm seeing in my experiments.
Side note: the above example is actually equivalent to my case for a linear data generating kernel. But linear kernels are nonstationary, so it makes sense that you can't identify $\sigma^2$.
# Background
Note: this section isn't always up to date.
## Tools
### Fixed Basis Model Average
A model that is a weighted sum of other models, each with their own fixed basis. The idea is you can disentangle which kernels are included a priori (by adjusting which models show up in the weighted sum) from the capacity (by changing the number of basis functions in each model). I'm typically weighting models by their posterior probability (i.e., so this is a Bayesian Model Average), but I also have some experiments with uniform weights.

### Weighted average of fixed basis models
See 8/23 post
### Prior variance over the kernel
Prior variance over the kernel. In other words, for any given network you can compute its kernel using the last layer as the feature map. By drawing networks from the prior, you can look at the uncertainty in this kernel (evaluated at some fixed inputs).
## Results
### BNNs do not appear to have double descent, as expected
On an experiment in $d = 5$ dimensions with a linear ground truth function, a BNN does not appear to exhibit double descent, in contrast to randomly sampling the features.

### Uniformly averaging random feature models mitigates double descent better than Bayesian averaging
Using the Fixed Basis Model Average (Tool 2), if the models are uniformly weighted then adding more models quickly eliminates double descent. Weighting the models by the posterior probability decreases double descent but not as significantly for the same number of models. This seems to be because the posterior weights tend to concentrate on a small set of models. My hypothesis is that with an infinite number of models, double descent would go away, and that this corresponds to a BNN (since the posterior is an integral over models).
### Prior variance of the kernel increases with depth and decreases with width, as expected
Using Tool 2, we can see how the prior variance of the kernel scales with the width. This is analogous to Figure 3 in [Why bigger is not always better: on finite and infinite neural networks](https://arxiv.org/pdf/1910.08013.pdf) but for for nonlinear networks.
As with linear networks, we see that the variance decreases with width and increases with depth.

In each of the 4 panels, prior mean (left) and variance (right) of the elements of a 2 × 2 Gram matrix, where the feature map of the kernel is the last hidden layer. Each panel corresponds to different depths (rows) or widths (columns). The dashed line corresponds to the limiting kernel (which has zero variance).
## Notation