Uncertainty-Aware (UNA) Bases for Deep Bayesian Regression using Multi-Headed Auxiliary Networks

# Uncertainty-Aware (UNA) Bases for Deep Bayesian Regression using Multi-Headed Auxiliary Networks ###### tags: `papers`, `uncertainty`, `nlm` 1/31/22: Todo: replicate experiments. sample from the prior and get the same function ## Main Ideas: - NLM’s are deep Bayesian models, produce predictive uncertainties, then perform Bayesian linear regression over these features. - Few works have methodically evaluated the predictive uncertainties of NLMs - Traditional training procedures for NLM’s underestimate uncertainty on OOD - Identify underlying reasons - Propose novel training framework for useful predictive uncertainties ## Inference Procedures for standard NLM Models - **Model** \begin{aligned} \mathbf y &\sim \mathcal N (\Phi_\theta \mathbf w, \sigma^2 I) \\ \mathbf w &\sim \mathcal N(0, \alpha I) \\ \Phi_\theta &= [\phi_\theta(\mathbf x_1), \dots, \phi_\theta (\mathbf x_N)]^T \end{aligned} and $\phi_\theta$ is parametrized by a NN with weights $\theta$. - **Inference Procedures** 1. Learn $\theta$ with some objective (e.g. MLE, MAP, Maximum Marginal Likelihood) 2. Given $\theta$, infer $p(\mathbf w | \mathcal D, \theta)$ analytically. ### MAP - the most common - maximize likelihood of the observed data with respect to $\theta$ and with respect to a point estimate $\tilde{\mathbf w}$ of last layer's weights with a regularization term (for the prior) $$ \mathcal L_{MAP}(\theta_{Full}) = \log \mathcal N (\mathbf y, \Phi_\theta \mathbf w, \sigma^2 I) - \gamma ||\theta_{Full}||_2^2 $$ - Then discard $\tilde{\mathbf w}$ and use $\theta$. - QUESTION: why isn't the regularization just on $\mathbf w$? Isn't the prior only on $\mathbf w$, and not on $\theta_{Full}$? #### Drawbacks - The regularization prevents a diversity of functions - experimentation shows: no variation in data scarce regions by prior predictives - as a result, posterior predictives show little in-between uncertainty ### MLE - MAP, but with $\gamma = 0$. - Makes sense, literally maximizing the likelihood of the data #### Drawbacks - Does not discourage functional diversity, but does not encourage it, so the appearance of diversity is random - OPINION: Not really any good support for this other than empirical evidence ### Maximum Marginal Likelihood - We maximize the marginal likelihood of $\theta$ (detaching $\mathbf w$ from $\theta_{Full}$): $$ \mathcal L_{Marginal}(\theta) = \log \mathbb E_{p(\mathbf w)}[\mathcal N (\mathbf y, \Phi_\theta \mathbf w, \sigma^2 I)] - \gamma ||\theta||_2^2 $$ #### Drawbacks - Suffers the same as MAP when $\gamma > 0$. When $\gamma = 0$, then similar to MLE. - For ReLU networks, can increase weights arbitrarily while decreasing the weights of the last layer. ### General Drawbacks of log-likelihood based training - does not measure quality of uncertainty in data poor regions ### Training Framework: Uncertainty-Aware Bases (UNA) LUNA - instantiation of UNA 1. Feature Training with Diverse and Task-Appropriate Auxiliary Regressors - Basically, train a bunch of different models with shared $\theta$ but different pointwise $\mathbf w$ with objective function that optimizes for diversity $$ \mathcal L_{LUNA}(\Psi) = \mathcal L_{FIT}(\Psi) - \lambda \cdot \mathcal L_{DIVERSE}(\Psi) $$ 3. Bayesian Linear Regression on Features - Get rid of all $\mathbf w$ but keep $\theta$, infer using posterior $p(\mathbf w | \mathcal D, \theta)$. - Same thing as traditional training