# Uncertainty-Aware (UNA) Bases for Deep Bayesian Regression using Multi-Headed Auxiliary Networks
###### tags: `papers`, `uncertainty`, `nlm`
1/31/22:
Todo: replicate experiments. sample from the prior and get the same function
## Main Ideas:
- NLM’s are deep Bayesian models, produce predictive uncertainties, then perform Bayesian linear regression over these features.
- Few works have methodically evaluated the predictive uncertainties of NLMs
- Traditional training procedures for NLM’s underestimate uncertainty on OOD
- Identify underlying reasons
- Propose novel training framework for useful predictive uncertainties
## Inference Procedures for standard NLM Models
- **Model**
\begin{aligned}
\mathbf y &\sim \mathcal N (\Phi_\theta \mathbf w, \sigma^2 I) \\
\mathbf w &\sim \mathcal N(0, \alpha I) \\
\Phi_\theta &= [\phi_\theta(\mathbf x_1), \dots, \phi_\theta (\mathbf x_N)]^T
\end{aligned}
and $\phi_\theta$ is parametrized by a NN with weights $\theta$.
- **Inference Procedures**
1. Learn $\theta$ with some objective (e.g. MLE, MAP, Maximum Marginal Likelihood)
2. Given $\theta$, infer $p(\mathbf w | \mathcal D, \theta)$ analytically.
### MAP
- the most common
- maximize likelihood of the observed data with respect to $\theta$ and with respect to a point estimate $\tilde{\mathbf w}$ of last layer's weights with a regularization term (for the prior)
$$
\mathcal L_{MAP}(\theta_{Full}) = \log \mathcal N (\mathbf y, \Phi_\theta \mathbf w, \sigma^2 I) - \gamma ||\theta_{Full}||_2^2
$$
- Then discard $\tilde{\mathbf w}$ and use $\theta$.
- QUESTION: why isn't the regularization just on $\mathbf w$? Isn't the prior only on $\mathbf w$, and not on $\theta_{Full}$?
#### Drawbacks
- The regularization prevents a diversity of functions
- experimentation shows: no variation in data scarce regions by prior predictives
- as a result, posterior predictives show little in-between uncertainty
### MLE
- MAP, but with $\gamma = 0$.
- Makes sense, literally maximizing the likelihood of the data
#### Drawbacks
- Does not discourage functional diversity, but does not encourage it, so the appearance of diversity is random
- OPINION: Not really any good support for this other than empirical evidence
### Maximum Marginal Likelihood
- We maximize the marginal likelihood of $\theta$ (detaching $\mathbf w$ from $\theta_{Full}$):
$$
\mathcal L_{Marginal}(\theta) = \log \mathbb E_{p(\mathbf w)}[\mathcal N (\mathbf y, \Phi_\theta \mathbf w, \sigma^2 I)] - \gamma ||\theta||_2^2
$$
#### Drawbacks
- Suffers the same as MAP when $\gamma > 0$. When $\gamma = 0$, then similar to MLE.
- For ReLU networks, can increase weights arbitrarily while decreasing the weights of the last layer.
### General Drawbacks of log-likelihood based training
- does not measure quality of uncertainty in data poor regions
### Training Framework: Uncertainty-Aware Bases (UNA)
LUNA - instantiation of UNA
1. Feature Training with Diverse and Task-Appropriate Auxiliary Regressors
- Basically, train a bunch of different models with shared $\theta$ but different pointwise $\mathbf w$ with objective function that optimizes for diversity
$$
\mathcal L_{LUNA}(\Psi) = \mathcal L_{FIT}(\Psi) - \lambda \cdot \mathcal L_{DIVERSE}(\Psi)
$$
3. Bayesian Linear Regression on Features
- Get rid of all $\mathbf w$ but keep $\theta$, infer using posterior $p(\mathbf w | \mathcal D, \theta)$.
- Same thing as traditional training