Summary of related work

# Summary of related work ### Infinte width NNs/NLMs and GPs --- ## GP **Infinite NN with random initialization** A NN at random initialization is a **NNGP** with kernel $\mathcal{K}$ So an ensemble of multiple NN (without training) is the same as sampling from this GP. Conditioning on training data gives a GP with $$ μ_{GP}(x) = \mathcal{K}( x, \mathcal{X}) \mathcal{K}(\mathcal{X}, \mathcal{X})^{-1} \mathcal{Y}\\ Σ_{GP}(x, x')=\mathcal{K}(x, x') -\mathcal{K}(x, \mathcal{X})\mathcal{K}(\mathcal{X}, \mathcal{X})^{-1}\mathcal{K}(X, x) $$ ## NTK Kernel function that can be used to describe NN initialization and trainnig: $$ \hat \Theta_t(x, {x'}) = \nabla_{\theta} f(x, {\theta_t})^\top \nabla_{\theta} f({x'}, {\theta_t}) $$ **In general:** - $\hat \Theta_0$ depends on the random initialization - $\hat \Theta_t$ changes over time **Infinite width:** Deterministic and constant $\hat \Theta_0 = \hat \Theta_t = \Theta$, depends on architecture and can be defined recursively. **Weak Training with GD (only last layer)** Kernel converges to GP conditioned on training data (same as above). **Full Training with GD (all layers)** Also GP (NTK-GP) but not a posterior. Depends on $\Theta$ and $\mathcal{K}$. **Bayesian Deep Ensemble** Change training to receive **NTKGP**: $$ μ_{NTKGP}(x) = \Theta( x, \mathcal{X}) \Theta(\mathcal{X}, \mathcal{X})^{-1} \mathcal{Y}\\ Σ_{NTKGP}(x, x')=\Theta(x, x') -\Theta(x, \mathcal{X})\Theta(\mathcal{X}, \mathcal{X})^{-1}\Theta(X, x) $$ Same format as conditioned GP but with different prior covariance function ($\Theta$ instead of $\mathcal{K}$). ## Tensor Programs [Paper Yang 2019](https://arxiv.org/abs/1910.12478) > If an architecture can be expressed solely via matrix multiplication and coordinatewise nonlinearities (i.e. a tensor program), then it has an infinite-width GP ## Representation Learning Infinite NN can perform worse as they are less _flexible_ and do not learn a representation. (Aitchison 2020) show a toy example where finite NN are better and describes a deep GP description for those. They define a sense of flexibility and show it decreases with increasing width for BNN but not for MAP estimation.