general
Question we are trying to answer
covariance structure
size of data with respect to width
MAP training (determinant)
claim that only a small number of bases fit the data?
Motivation
Current literature on NLM
Overarching research questions
What is a good basis?
BUT - this is dependent on where the data is
EX1. consider any basis that is infinitely differentiable in a region, then if we consider points close enough, we'll obtain piecewise linear and will not obtain good uncertainties nor fit (we think)
if this is true, this tells us that we need the in-distribution and out-of-distribution areas of the data in order to meaningfully ask this question
This leads us to...
Main Ideas:
Mysterious why NN’s can generalize well with so many more parameters than data points
Effective dimensionality (ED) measures the dimensionality of the parameter space determined by the data
Relates ED to posterior contraction in Bayesian deep learning, model selection, width-depth tradeoffs, double descent, and functional diversity in loss surfaces.
ED compares favorably to alternative norm- and flatness-based generalization measures.
Effective Dimensionality (ED)
Lectures
UC-Berkeley 20 min primer on Meta Learning
https://www.youtube.com/watch?v=h7qyQeXKxZE
Papers
Meta-Learning in Neural Networks: A Survey
Notes
Hypothesis: Good basis functions should transfer well to different tasks, but data-specific basis functions might not transfer well.
Deep Kernel Learning combines the representational power of
neural networks with the reliable uncertainty estimates of Gaussian processed by learning a flexible deep kernel function. The complexity is O(n) at train time and O(1) at test time compared to O(n^3) at train time and O(n^2) at test time for standard Gaussian processes.
One of the central critiques of Gaussian process regression is that it does not actually learn representations of the data. This is because the kernel function is specified and not flexible enough to learn representations of the data. We can solve this through deep kernel learning (DKL), which maps the inputs $x_n$ to intermediate values $v_n \in \mathbb{R}^Q$
through a neural network $g_\phi(·)$ parameterized by weights and biases, $\phi$.
These intermediate values are then used as inputs to the standard kernel resulting in the effective kernel $k_{DKL}(x, x_0) = k(g_\phi(x), g_\phi(x_0))$.
NLM is essentially Deep Kernel Learning with linear kernel
Geoffrey Liu changed 3 years agoView mode Like Bookmark
Paper main idea: new measurement of generalization; understand dropout’s effectiveness in improving generalization
New measurement proposed by the paper: weight expansion. With larger weight volume, one can achieve increased generalization in a PAC-Bayesian setting.
Application: Apply weight expansions to dropout. Theoretically and empirically examine that the application of dropout during training “expands” the weight volume.
Definition: weight volume- normalized determinant of the weight covariance matrix
intuitively, the more correlated they are, the smaller the weight volume and worse generalization ability
More orthogonal -> larger weight volume
1/31/22:
Todo: replicate experiments. sample from the prior and get the same function
Main Ideas:
NLM’s are deep Bayesian models, produce predictive uncertainties, then perform Bayesian linear regression over these features.
Few works have methodically evaluated the predictive uncertainties of NLMs
Traditional training procedures for NLM’s underestimate uncertainty on OOD
Identify underlying reasons
Idea: Adding functional priors and using fVI (note: I saw fPOVI in another paper), we get BNN-like uncertainty using NLMs
Questions/Technicalities:
Prior choice is hard? Why use an uninformative prior when we have other methods
Brings complexity up - Quadratric -> Cubic in worse case. Probably should not proceed with this approach.
GPs scale cubically with the number of observations -> NLM scales linearly, making Bayes Opt easier while maintaining flexibility and uncertainty
cubically in the basis function dimensionality, instead of growing with the number of observations as in GP
Related applications:
Applications in reinforcement learning (see Riquelme et al., 2018 and Azizzadenesheli and Anandkumar, 2019 https://arxiv.org/abs/1802.09127, https://arxiv.org/abs/1802.04412), active learning, AutoML (Zhou and Precioso, 2019 https://arxiv.org/abs/1904.00577)
Todo (lucy): understand the math?
Paper main ideas:
Three variations of neural linear regression
MAP NL (first train the neural network using MAP estimation, then outputs of the last hidden layer of the network used as features for BLN. Use Bayesian opt for tuning)
uncertainty is only added with the last layer, MAP training does not learn with the goal of uncertainty quantification in mind
Regularized NL: learn the features by optimizing the tractable marginal likelihood with respect to the network weights (previous to the output layer),
Bayesian noise NL
MAIN IDEAS:
There is a limited understanding of the effects of depth and width on learned representations
How does varying depth and width affect model hidden representations?
Characteristic block structure in hidden representations of larger capacity models
Implies model capacity is large relative to the size of the training set
Implies underlying layers preserving and propagating dominant principal component of their representations
MAIN IDEAS:
Kernel machines and NN’s possess universal function approximation properties
But ways of choosing the appropriate function class differ
NN’s learn representation by adapting their basis functions to the data
Kernel methods use a basis not adapted during training
Contrast random features of approximated kernel machines with learned features of NN’s
MAIN IDEAS:
Approximate inference techniques for weight space priors of BNN’s suffer from several drawbacks
The ‘Bayesian last layer’ (BLL) is an alternative BNN approach that learns the feature space for an exact Bayesian linear model with explicit predictive distributions.
Its predictions outside of the data distribution (OOD) are typically overconfident
Overcome this weakness by introducing a functional prior on the model’s derivatives
This method enhances the BLL to Gaussian process-like performance on tasks where calibrated uncertainty is critical: OOD regression, Bayesian optimization and active learning