A Bayesian Perspective on Meta-Learning

# A Bayesian Perspective on Meta-Learning by Yee Why Teh Let's think more about the priors we use. For BNN, priors often zero-mean Gaussian with some variance. However, some problems: - we don't know what neuron means so it's hard to choose prior - hard to transfer domain knowledge into priors We place priors on the functions we want to learn via GP by specifying the first two moments -- mean and variance. Here's an example of prior such that the draws are quadratic functions ![](https://i.imgur.com/f9cLSEr.png) An idea is to use GP prior and translate them into BNN priors (see Pierce et al). However, if we have some GP priors we want, why not just work in GP framework, no need to translate into BNN prior? Goal: Let's use meta-learning to learn priors over functions. Stochastic process (SP) are infinite collection of random variables, we interpret SP as a random function. To construct an SP, we specify its finite-dimensional marginal distributions and then extend it via Kolmogorov Extension Theorem. To do this, the finite-dimensional marginal distributions must satisfy these properties -- exchangeability and consistency. ![](https://i.imgur.com/RQyU4bu.png) (the fs' in GP are the marginals) So, for this talk, Prior = GP = Random function = an SP that we want to construct. Enter Meta-learning: how do we learn a system that can do few-shot learning? (very small training data and can generalise well to large test data) ![](https://i.imgur.com/nCHdrhU.png) ![](https://i.imgur.com/KiTpFLs.png) (multi-task learning corresponds to hierarhical bayesian framework -- to share parameters) to learn, we backprop ![](https://i.imgur.com/0GnjCEY.png) We train on some tasks and then test on a different set of tasks. ![](https://i.imgur.com/XKu2TkK.png) When looked at from a Bayesian perspective, meta-learning is about learning priors over functions # Some rich datasets with distribution shift - Tabular weather data. From 2018-2019 (regression + classification). Shift: climate shifts over the time. - Machine Translation. It's not trivial to get uncertainty estimate. Shift: wrong spelling, wrong grammar, slang, profanity etc. - Vehicle motion prediction. Trajectory data with info e.g. accelration and turn left/right. Shift: people doing weird things on the road. Small takeaway from the competition: when the model is cheap, run ensembles! # An Automatic Finite-Data Robustness Metric for Bayes and Beyond: Can Dropping a Little Data Change Conclusions? by Tamara Broderick Q: If I take away some small subset of data, will the results change totally? Yes, they gave some example -- droping 0.1% of the data will change the sign of the results. Q: Can we detect such sensitivity if it exists? If we do it by brute force (check all the subsets), it's too expensive. Here they suggest an approximation! Q: Should we care about this sensitivity? Yes and no. Depends on application. Importantly, this should affect our willingness to generalise some result into a decision, e.g. when your policy population is not the same as your analysed population. ## Method - We phrase dropping data by reweighting them - Forms a Taylor approximation - (see more in the paper) What causes the sensitivity? Not small sample size, not outliers, not misspecification! It's... Low SNR! What about when the effects ARE in fact dominated by a small subset, how do we figure out when to or not to use the method? Well... we need to think about it subjectively per application. Importantly, we need to be clear and up front about the context, e.g. say we study about criminals in a big town of 10,000 people, if we have say 10 criminals, then we should NOT say we have 10,000 data points but we really have 10.