Example of a Literature Survey: Deep Bayes Alternatives to MF-VI for BNNs

# Example of a Literature Survey: Deep Bayes Alternatives to MF-VI for BNNs ### Deep GPs **What they are:** $y = g(x) + \epsilon = f_n\circ\ldots f_1(x) + \epsilon$, where $f_i \sim GP(mean, cov)$. **Inference:** GPyTorch...just kidding. GPyTorch implements doubly stochastic variational inference, wherein we model correlations between layers and make simplifying assumptions about correlations within layers. **Notable properties:** Deep GPs can yield non-Gaussian models. For example, they can model functions with large derivatives (e.g. jump discontinuities) better than a GP, since the marginal derivative distributions of a GP is a GP. **Pros:** Deep GPs define function classes that are more expressive than GPs. One can interpret the latent variables in each layer as representations? **Cons:** If the latent spaces are too small in dimensionality you risk collapsing important features in the data (Duvenaud et al. 2014). **Questions:** In terms of performance (generalization, robustness, efficiency in terms of number of layers vs number of kernels and computational complexity), how does deep GP compare to say composite kernels? **Notes:** is the problem with the marginal derivative distribution being gaussian just that the kernel we're considering is stationary? There are other limitations of GP's, but the question is that do deep GP remove these limitations? Practically speaking, how many jump discontinuities do we need to worry about in a typical dataset? Do we, instead, just want kernels that vary between chunks of the input space (this is a more soft version of jump discontinuity). ### Deep Kernel Learning **What they are:** $k_{\text{DKL}}(x, x') = k(g_W(x), g_W(x'))$, where $g_W$ is a NN with parameters $W$. **Inference:** We learn $W$, along with the parameters of $k$ and the output noise by Type II MLE, i.e. maximizing the marginal log-likelihood. **Pros:** This gives you a more black-box way of constructing a complex kernel that is data/task dependent. There is empirical evidence that this is a good idea. This is faster than learning composite kernels (?). **Cons:** Type II MLE can overfit, especially when the model is complex. This is already true for vanilla GPs with complex kernels. **Questions:** The Promises and Pitfalls of DKL paper uses SVI for GP inference, how does this extra layer of approximation interact with Type II MLE? Again, except for speed, what is the advantage of this type of kernel learning versus composite kernels? **Notes:** This framework is popular because it allows you to use whatever architecture you've already been using for non-Bayesian modeling and then you add a little Bayesian bit to get your uncertainty. ### Neural Linear Model **What they are:** A DKL model with a linear kernel for the GP. Alternatively, they are Bayesian linear regression models fitted on top of learned deep features ($h_W(x)$). **Inference:** You can do Type II ML on $W$, but typically people learn $W$ independently via MAP estimation. **Pros:** Inference is fast. You retain the feature learning part of deep models. The features to task model can also be meaningfully interpreted since the last layer is linear. **Cons:** Lots of works including stuff from our lab has shown that the prior predictive distributions of these models have limited expressivity under traditional training. **Questions:** Since this model is easy to reason about, can we specify sensible task-appropriate priors? **Notes:** We discussed the benefits of training the $W$ independently rather than doing type II MLE. Maybe try getting the features from some pre-trained model (maybe even a generative model) and then fitting a NLM or just a Bayesian linear model on top of that. Would this give you reasonable uncertainty? Experiments from past students says that yes, this works well in practice on image data. So maybe a research question is to formalize properties we want out of a set of independently learned feature maps that would support useful uncertainties. ### Laplace Approximation **What they are:** A full-covariance Gaussian fitted to the posterior centered at the MAP. **Inference:** We can place priors on all weights in a NN or just a subset. We can make semi-definite approximations of the Hessian (required for inference). Hyperparams are learnt via Type II MLE. Something to reduce the variance in the predictive distribution (apparently LA posteriors yield higher variance when Monte Carlo estimating the posterior predictive). **Pros:** Inference can be fast? Andrew and David showed that LA can yield more desirable uncertainties. Other emperical works show competitive performance on in-distribution, data-shift and ood conditions. **Cons:** Computation of the Hessian, evidence, and predictive can be expensive and high variance? **Questions:** This seems more complicated than DKL and NLM, is there any advantage to approximating the posterior of a full BNN? ### Sub-Network Inference **What they are:** We first estimate the MAP for all network weights, then we fix the estimates for a subsets of the weights and perform approximate Bayesian inference on its complement. **Inference:** do MAP, select a subset of weights $W_s\subset W$ such that the Wasserstein distance between $q(W_s)$ and $p(W|\text{Data})$ is minimized. Then we do linearized Laplace on the posterior over $W_s$. **Pros:** Like full LA, but on a subset of weights, which should simplify inference. Emperical evidence that you get some useful uncertainties. **Cons:** Selecting the sub-network seems hard. Also, how does one trade-off size of subnetwork and quality of approximation of $p(W|\text{Data})$ by $q(W_s)$? **Questions:** Besides being faster than full LA when doing the Bayesian inference part, what's the advantage doing subnetwork LA? When factoring the cost of minimizing the Wasserstein distance between $q(W_s)$ and $p(W|\text{Data})$ over $S$, how much compute do we save? ### SWAG or SWA-Gaussian **What they are:** we run SGD with high constant learning rate for $T$ iterations, snap-shotting at say every $t$ iterations. Then we fit a Gaussian to the snap-shots. **Inference:** do SGD, and then compute the mean and the covariance structure of the snap-shots. **Pros:** Super fast. Under some very strong assumptions, previous works have related this process to sampling and SWA-based distributions can recover the true posterior in asymptotic regimes. Empirically, SWAG gives you some useful uncertainties, also ensembling SGD solutions have long been shown to capture interesting properties of the loss landscape of NNs. **Cons:** But are we really guaranteed to recover the true posterior (or any aspect of the true posterior) of BNNs? I don't think so right? Unlike other approximate inference schemes - it's not super clear what objective function I'm optimizing here. **Questions:** Are we comfortable with the non-transparent nature of the training? Heuristically this seems reasonably motivated, but what is the objective function implied by this inference? **Notes:** Yes there is no explicit objective function here, but when you have an explicit objective function you can still have mysterious and unanticipated failure modes. So why should we be more bothered by a less "formal" or "principled" inference framework? ## Discussion Questions We've been talking (complaining) about Deep Bayes for many weeks now. Given our discussion and combined experiences/goals/knowledge, can we come up with a list of desiderata for our dream deep Baysian model? 1. What features of deep models are useful and that we'd like to retain in deep Bayesian models? In particular, can we be more concrete, e.g. instead of saying "deep NN learn features", we try to define what we mean by "features". 2. What features of (non-deep) Bayesian models are useful and that we'd like to retain in deep Bayesian models? Again, can we be more concrete, e.g. instead of saying "let's use informative priors", we think about what task-relevant, domain specific information can be actually encoded in a prior (functional or not). <br> For example, in the many traditional application domains of non-deep Bayesian models, the priors over parameters often come from previous empirical knowledge (e.g. this value has to be between blah and blah). So what would this kind of prior knowledge look like for a deep Bayesian model? If we say this knowledge takes the form of a functional prior p(f), and that p(f) is most conviniently specified in terms of a GP kernel, then why not just work with GPs? Is translating a GP prior into a deep Bayes prior and then doing approximate inference for the deep Bayes model really better/faster than doing approximate GP inference? 3. What are some features we'd like for the inference procedures for our deep Bayesian models? 4. What are some sensible and principled ways for evaluating our deep Bayesian models and our inference? - How do we independently evaluate our choice of model class (i.e. our prior + likelihood)? How do we test and evaluate the appropriateness of the inductive biases of our models? - How do we independently evaluate our choice of inference method? 5. Do any of the existing models + inference methods satisfy our list of desiderata? If not, in what ways are they falling short? 6. Should we bother being Bayesian at all? Ensembles are the state-of-the-art in industry when we want a bit of predictive uncertainty. Recent work show that ensembles (under assumptions) can be interpreted as samples from the posterior of some deep Bayes model.<br> Specifically, is there a real need for us to be separating out sources of uncertainty in the model (epistemic vs aleatoric)? Are there situations where being Bayesian is more efficient than writing down a good objective function for the deterministic model (as in the case of continual learning).