# Summary of related work
### Infinte width NNs/NLMs and GPs
---
## GP
**Infinite NN with random initialization**
A NN at random initialization is a **NNGP** with kernel $\mathcal{K}$
So an ensemble of multiple NN (without training) is the same as sampling from this GP.
Conditioning on training data gives a GP with
$$
μ_{GP}(x) = \mathcal{K}( x, \mathcal{X}) \mathcal{K}(\mathcal{X}, \mathcal{X})^{-1} \mathcal{Y}\\
Σ_{GP}(x, x')=\mathcal{K}(x, x') -\mathcal{K}(x, \mathcal{X})\mathcal{K}(\mathcal{X}, \mathcal{X})^{-1}\mathcal{K}(X, x)
$$
## NTK
Kernel function that can be used to describe NN initialization and trainnig:
$$
\hat \Theta_t(x, {x'}) = \nabla_{\theta} f(x, {\theta_t})^\top \nabla_{\theta} f({x'}, {\theta_t})
$$
**In general:**
- $\hat \Theta_0$ depends on the random initialization
- $\hat \Theta_t$ changes over time
**Infinite width:**
Deterministic and constant $\hat \Theta_0 = \hat \Theta_t = \Theta$, depends on architecture and can be defined recursively.
**Weak Training with GD (only last layer)**
Kernel converges to GP conditioned on training data (same as above).
**Full Training with GD (all layers)**
Also GP (NTK-GP) but not a posterior. Depends on $\Theta$ and $\mathcal{K}$.
**Bayesian Deep Ensemble**
Change training to receive **NTKGP**:
$$
μ_{NTKGP}(x) = \Theta( x, \mathcal{X}) \Theta(\mathcal{X}, \mathcal{X})^{-1} \mathcal{Y}\\
Σ_{NTKGP}(x, x')=\Theta(x, x') -\Theta(x, \mathcal{X})\Theta(\mathcal{X}, \mathcal{X})^{-1}\Theta(X, x)
$$
Same format as conditioned GP but with different prior covariance function ($\Theta$ instead of $\mathcal{K}$).
## Tensor Programs
[Paper Yang 2019](https://arxiv.org/abs/1910.12478)
> If an architecture can be expressed solely via matrix multiplication and coordinatewise nonlinearities (i.e. a tensor program), then it has an infinite-width GP
## Representation Learning
Infinite NN can perform worse as they are less _flexible_ and do not learn a representation. (Aitchison 2020) show a toy example where finite NN are better and describes a deep GP description for those. They define a sense of flexibility and show it decreases with increasing width for BNN but not for MAP estimation.