This paper discusses how the BNNs behave as we scale to the large models. This paper discusses the "soap-bubble" issue in case of high dimensional probability spaces and how MFVI suffers from this. As a way to tackle this issue, the authors propose a new variational posterior approximation in hyperspherical coordinate system and show that this overcomes the soap-bubble issue when we sample from this posterior.
One of the properties of high dimensional spaces is that there is much more volume outside any given neighbourhood than inside of it. Betacount et al explained this behaviour visually with two intuitive examples. For first example let us consider partitioning our parameter space in equal rectangular intervals as shown below.
We can observe a similar behaviour if we consider spherical view of parameter space, where the exterior volume grows even larger than the interior in high dimensional spaces as shown in the figure.
How this intuition of volumes in high dimensions explain soap bubble phenomenon?
Even though there is high density sorrounded around the mean, most of the volume is away from the mean and the effective probability mass turns out, in most of the cases, to be away from the mean.
If we were to sample from these high dimensional complex distributions, from these two intuitive examples, the immediate consequence is that the samples drawn are far away from the mean and this might result in high variance of gradient estimates.
From the above geometric understanding of soap bubble we can see that as we increase the dimensions, the volume outside the given neighbourhood (lets say mode) of any geometric shape (like a rectangle/cube or circle/sphere/ and its high dimensional variants ) will dominate the volume around the mode. And the neighbourhood around the mode has high density. So at these two extremes (high-volume low-density and low-volume high-density), the probability mass is negligible. But there are some places in between which are referred as "target sets" where the both density and volume are equally high. And as the number of parameters increases this becomes narrower and distances away from the center. This is called as soap bubble phenomenon.
Now coming to the mathematical understanding, lets consider a thin shell of thickness
As we have seen in case of geometrical intuition, we see a similar density-volume tradeoff. To be precise for smaller
Mean field approximation is one of the most common techniques in variational bayes to approximate the posterior distributions on the unknown latent variables. In MFVI, we try to simplify the problem of this approximation by assuming some structure in the latent variables. To be precise, we assume that these are independent variable and hence the posterior approximation is given by
where
The Evidence Lower Bound (ELBO) objective is derived by using jensen's inequality over the marginal likelihood of the observations.
For a concave function
We can use above to now derive the ELBO. Knowing the
The independence assumption of the latent variables
We optimize this ELBO objective using coordinate ascent (where we update
In this paper, we can approximate the posterior disribution over the weights by a multivariate gaussian distribution. As discussed in previous section, in high dimensional scenarios, the samples of this multivariate gaussian becomes under-representative of the most probable weights as the volume is scattered away from the mean and hence the probability mass. Due to the large distance between the samples from the posterior distribution, the gradient of the log likelihood can suffer with high variance.
The key idea of this paper is that to choose a probability distribution which doesn't exhibit this "Soap bubble" property. For that this paper proposes to sample from a simple and practical distribution; a hyperspherical distribution where
Sampling from the approximate posterior is not costly as there is no need for explicit coordinate transformations and is almost equivalent to MFVI (interms of computational cost).
where
The term
The Evidence Lower Bound (ELBO) is used as the evaluating objective for the variational inference where the expected log-likelihood term is estimated using the mini-batches of data and MC integration. For KL divergence between the prior and the approximate posterior, which can be written as
the cross-entropy term is simply calculated by the MC estimates which is just the average of the log-likelihood of the posterior samples under the prior \citep{blundell2015weight}. The entropy term,
The detailed derivation is provided in the appendix of the paper.
Datasets:
Findings: