How good is the Bayes posterior in deep neural networks really?

# How good is the Bayes posterior in deep neural networks really? ## Overiew The motivation for this paper is the empirical observation in a series of previous works that for Bayesian neural network models, the heuristic of performing approximate inference with a 'temperized' posterior distribution with 'cold' temperatures $T \ll 1$, often gives much improved predictive performance on a held out test data set compared to the usually assumed setting with $T=1$. Further, improvements over non-Bayesian point-estimate baselines computed by direct minimisation of a regularised loss function with SGD, are only seen when using such cold posteriors, with the predictive performance of Bayesian inference with $T=1$ generally poorer than the SGD point estimate. The paper empirically evaluates seven proposed hypotheses to explain this phenomenon, specifically: 1. *Innacurate SDE simulation*: the SDE underlying the stochastic-gradient MCMC approach used to perform approximate inference is poorly simulated. 2. *Biased SG-MCMC*: the lack of a Metropolis accept step in the SG-MCMC method introduces bias. 3. *Minibatch noise*: gradient noise from minibatching causes innaccurate sampling at $T=1$. 4. *Bias-variance tradeoff*: As the posterior uncertainty increases as $T \to 1$, for a fixed number of independent posterior samples the variance of Monte Carlo estimates increases, and effects of this variance on prediction performance outweigh any decrease in bias. 5. *Dirty likelihood*: the use of additional latent variables / sources of randomness in the neural network model (data augmentation, dropout, batch normalisation) is not correctly accounted for. 6. *Bad prior*: The simplistic priors generally used in Bayesian neural networks are inadequate and potentially unintentionally informative. 7. *Implicit initialisation prior in SGD*: there is a beneficial inductive bias from the initialisation of SGD trajectories that is harmed by SG-MCMC sampling. The conclusions drawn are somewhat mixed, with the authors claiming their experiments indicate SG-MCMC is 'accurate enough' and not the source of the cold posterior effect, and that the poor choice of priors (hypothesis 6) seems to be particularly a factor. The 'cold posteriors' are at a high level described as being equivalent to overcounting evidence though as I will discuss its not clear how true this characterisation is in the specific models considered and in fact the main effect seems to be in scaling the prior distribution. ## Data Supervised learning setting with a training data set $$\mathcal{D} = \lbrace (x_n, y_n) \rbrace_{n=1}^N,$$ of $N$ pairs of inputs $x_n \in \mathcal{X}$ and outputs $y_n \in \mathcal{Y}$, and a further test data set $$\mathcal{D}^* = \lbrace (x^*_n, y^*_n) \rbrace_{n=1}^{N^*},$$ of $N^*$ input-output pairs. The example datasets considered are both classification problems with each $y_n$ assumed to be a one-of-$C$ encoding of the true label for input $x_n$. Specifically the two datasets are Dataset | Input $x_n$ | Output $y_n$ | $C$ | $N$ | $N^*$ | ---------| ----------------| -------------|-----|-------|-------| CIFAR-10 | 32×32 RGB image | Image class | 10 | 50000 | 10000 | IMDB | Review raw text | +/- rating | 2 | 20000 | 25000 | ## Model A generative model (for outputs given inputs) parameterised by $\theta \in \Theta \subseteq \mathbb{R}^P$ is assumed, and that the outputs are independently distributed given parameters and inputs $$ \theta \sim p(\cdot), \qquad y_i \sim \ell(\cdot \,|\, x_n, \theta) \quad \forall i \in 1{:}N $$ Specifically $\theta$ is taken to be the set of weights and biases of a neural network model (either 10 layer residual-network for CIFAR-10 or recurrent convolution network for IMDB) which defines a differentiable function $f : \mathcal{X} \times \Theta \to \mathbb{R}^C$ and the likelihood is then defined as $$ \log \ell(y_n \,|\, x_n, \theta) = \sum_{c=1}^C y_{n,c} f_c(x_n, \theta) - \log\left(\sum_{c=1}^C \exp(f_c(x_n, \theta)\right) $$ or in neural network parlance a cross-entropy loss function with softmax output activations. As a default the prior on $\theta$ is taken to be standard normal i.e. $\theta \sim \mathcal{N}(0, \mathbb{I}_P)$. ## Posterior distribution and posterior predictive Density of posterior distribution on parameters $\theta$ given data $\mathcal{D}$ then has the form $$ p_{\mathcal{D}}(\theta) \propto \prod_{n=1}^N \ell(y_n\,|\,x_n,\theta) \, p_0(\theta) = \exp\left(-U_{\mathcal{D}}(\theta)\right), $$ with the *posterior energy function* (negative log unnormalised posterior density) defined as $$ U_{\mathcal{D}}(\theta) = -\sum_{n=1}^N \log\ell(y_n\,|\,x_n,\theta) - \log p(\theta). $$ Integrating the observation model over the parameter posterior gives the *posterior predictive* on new input-output pairs $$ \pi_{\mathcal{D}}(y^* \,|\, x^*) = \int_{\Theta} \ell(y^* \,|\, x^*, \theta)\, p_{\mathcal{D}}(\theta)\,\mathrm{d}\theta. $$ Approximations to this function can then be used to define evaluation metrics on the test set $$ L_{\mathcal{D},\mathcal{D}^*}[h] = \sum_{n=1}^{N^*} h(\pi_{\mathcal{D}}(y^*_n \,|\, x^*_n)), $$ with $h(u) = \mathbb{1}_{[0.5, 1]}(u)$ corresponding to the test accuracy and $h(u) = -\log(u)$ the test cross entropy. ## Temperized posterior By introducing a *temperature* parameter $T \in \mathbb{R}_>0$ a 'temperized' posterior distribution is defined with density $$ p_{\mathcal{D},T}(\theta) \propto p_{\mathcal{D}}(\theta)^{\frac{1}{T}} \propto \exp\left(-\frac{U_{\mathcal{D}}(\theta)}{T}\right). $$ This corresponds to the density for the true posterior of a generative model $$ \theta \sim \mathcal{N}(0, \sqrt{T}\mathbb{I}_P), \qquad y_n \sim \ell(\cdot \,|\, x_n, \theta)^{\frac{1}{T}} \quad \forall n \in 1{:}N. $$ It would have been interesting to decouple the effects on the prior and likelihood by either performing separate experiments with only the likelihood tempered or defining separate temperatures for the prior and the likelihood terms. Similarly under this characterisation it is natural to consider $T$ as an extra parameter to be inferred although this may require a careful choice of prior on $T$. Importantly for the form of $\ell$ used in the paper the effect of large changes in $T$ is relatively small. For example considering the binary classifcation $C=2$ case, with now it is assumed $\mathcal{Y} = \lbrace -1,+1\rbrace$ and $f : \Theta \to \mathbb{R}$, then $\ell$ can be simplified to logistic-Bernoulli likelihood i.e. $$ \ell(y \,|\, x, \theta) = \sigma(yf(x,\theta)) = \frac{1}{1 + \exp(-y f(x, \theta))} $$ Raising $\sigma(u)$ to the power $\frac{1}{T}$ leads to minimal change in its 'shape', with the main effect being an apparent shift along the horizontal axis. The derivative of $\sigma(u)^{\frac{1}{T}}$ with respect to $u$ at the $u$ where $\sigma(u)^{\frac{1}{T}} = 0.5$ is $\left(\frac{2 T}{2^T - 1} + 2 T\right)^{-1}$ which is 0.25 for $T=1$ and tends to $\frac{\log{2}}{2} \approx 0.35$ for $T \to 0$. It is also unclear if the corresponding temperized posterior predictive is defined as $$ \pi_{\mathcal{D},T}(y^* \,|\, x^*) = \int_{\Theta} \ell(y^* \,|\, x^*, \theta)\, p_{\mathcal{D},T}(\theta)\,\mathrm{d}\theta \quad\text{or}\quad \pi_{\mathcal{D},T}(y^* \,|\, x^*) = \int_{\Theta} \ell(y^* \,|\, x^*, \theta)^{\frac{1}{T}}\, p_{\mathcal{D},T}(\theta)\,\mathrm{d}\theta, $$ i.e. whether the tempering is also used to adjust the observation probabilities in the predictions (which would be more consistent with considering tempering as corresponding to a different generative model). ## Approximate inference Due to the large dataset sizes typically used with neural network models, 'standard' approximate inference methods such as MCMC can be prohibitively costly due to the need to iterate through the full dataset at least once per transition. It is therefore common to use 'approximate approximate inference' methods such as stochastic gradient MCMC approaches. In the paper a damped second-order Langevin diffusion $$ \mathrm{d}\theta = M^{-1} m \,\mathrm{d}t,\qquad \mathrm{d}m = -(\nabla U_{\mathcal{D}}(\theta) - \gamma m)\,\mathrm{d}t + \sqrt{2\gamma T}M^{\frac{1}{2}}\mathrm{d}w $$ which has the temperized posterior distribution at temperature $T$ as its marginal stationary distribution on $\theta$, is used as the basis for the SG-MCMC method, with $m \in \mathbb{R}^P$ an auxiliary momentum variable, $\gamma > 0$ a damping or friction parameter, $M$ a positive-definite mass or preconditioning matrix and $w$ a Wiener process. This diffusion is simulated with a symplectic Euler discretisation and with the true energy gradient $\nabla U_{\mathcal{D}}$ replaced with unbiased estimate using a minibatch $\mathcal{B} \subseteq \lbrace 1,\ldots N\rbrace$ with $$ \nabla U_{\mathcal{D}}(\theta) \approx -\frac{N}{|\mathcal{B}|} \sum_{n\in\mathcal{B}} \nabla \log \ell(y_n \,|\, x_n, \theta) - \nabla\log p(\theta). $$ The discretisation of the diffusion (without Metropolis acceptance correction) and use of a minibatch gradient estimator each mean that a Markov chain simulated using the discretised dynamic no longer is guaranteed to have the (temperized) posterior as the stationary distribution on $\theta$. The authors claim to control these errors through the use of a layerwise preconditioner $M$ and cyclical time-stepping scheme, which periodically varies the time step from large to small values, with the chain samples only ouputted at the end of each cycle. ## Hypothesis: Innacurate SDE simulation The authors propose to use the fact that under the true continuous time Langevin dynamic the auxiliary momentum variables have a known stationary distribution $m \sim \mathcal{N}(0, M)$ to define a series of diagnostics that test whether under the discretised dynamic the generated momentum chain samples produce empirical expectations consistent with known true expectations under the Gaussian stationary distribution. Specifically they consider the test functions $h(m, \theta) = P^{-1} m^{\mathsf{T}} M^{-1} m$ and $h(m, \theta) = P^{-1} \theta^{\mathsf{T}} \nabla U_{\mathcal{D}}(\theta)$, both with true expected value $T$. Through a series of ablation studies they confirm that both the layerwise preconditioning and cyclical time-stepping improve simulation accuracy under these tests. ## Hypothesis: Biased SG-MCMC To test this hypothesis the authors compare their SG-MCMC method to what they consider a 'gold-standard' MCMC method namely Hamiltonian Monte Carlo with a fixed integration time. For these experiments they use an idealised simulated data set up namely they use the exact same model as used to generate the data as used for inference. Specifically they use a small densely connected neural networks (multi layer perceptrons) for $f: \mathcal{X} \times \Theta \to \mathbb{R}^3$ with between one and three hidden layers of 10 units, reLU activations, 5-dimensional inputs $\mathcal{X} = \mathbb{R}^5$ and simulated one-hot outputs with $C=3$. A training set of $N=100$ data points is generated as $$ \theta \sim p(\cdot), \qquad x_n \sim \mathcal{N}(0, \mathbb{I}_5),\quad y_n \sim \mathrm{Multinomial}(\mathrm{softmax}(f(x_n,\theta)) \quad \forall n \in 1{:}N $$ with the prior on the parameters specified to $\theta_p \sim \mathcal{N}(0, \sigma_p)$ with $\sigma_p = 0.05$ for parameter indices corresponding to biases, and a 'He' scaled $\sigma_p = \sqrt{\frac{2}{\text{fan_in}}}$ for the parameter indices corresponding to weights where $\textrm{fan_in}$ corresponds to the number of units feeding into the relevant layer for the weights. Somewhat confusingly the authors don't initialise the chains at the true parameters used to generate the data (which represent an exact sample from the posterior thus eliminating the need to burn in and giving unbiased MCMC estimates) but do perform some diagnostics to assess how many chain iterations to treat as burn in, with 5000 burn in samples found to be sufficient to give stable convergence diagnostics. For this simulated data setting both SG-MCMC and HMC **do not** show a cold posterior effect in the test accuracy or test cross-entropy metrics, with the best performance on both metrics for both MCMC methods occuring at $T=1$. They conclude from this that the lack of an accept-reject step in SG-MCMC is not the cause of the cold posterior phenomenon - while there experiments support this for this specific model / synthetic data setting it is not entirely clear however this directly carries over to more realistic models / datasets. A fairly exhaustive grid search of HMC hyperparameters is used namely step sizes $\epsilon \in \lbrace 0.001, 0.01, 0.1 \rbrace$ and number of leapfrog steps per sample $L \in \lbrace 5, 10, 100 \rbrace$ with similar results seen across all hyperparameters tested.   ## Hypothesis: Dirty likelihood The authors provide some discussion of the problems inherent in using techniques such as batch normalisation and data augmentation in the neural network model used in the CIFAR-10 example, and show that these methods can be considered as introducing extra auxiliary variables into the model which are not being correctly integrated over to give the marginal posterior on $\theta$ when performing approximate inference. They argue that the resulting log posterior values used can be considered a unbiased estimate of a lower bound of the true marginal log posterior on $\theta$. As well as the existing results using the IMDB dataset which already use a 'clean' posterior and still show the cold posterior effect, the authors provide further evidence refuting this being the issue for the CIFAR-10 case by providing results for an alternative 'clean' model without batch normalization and data augmentation which shows the same cold posterior effect.