Radial Bayesian Neural Networks

# Radial Bayesian Neural Networks #### Author: [Sharath](https://sharathraparthy.github.io/) ## Overview This paper discusses how the BNNs behave as we scale to the large models. This paper discusses the "soap-bubble" issue in case of high dimensional probability spaces and how MFVI suffers from this. As a way to tackle this issue, the authors propose a new variational posterior approximation in hyperspherical coordinate system and show that this overcomes the soap-bubble issue when we sample from this posterior. ## The geometry of high dimensional spaces One of the properties of high dimensional spaces is that there is much more volume outside any given neighbourhood than inside of it. [Betacount et al](https://arxiv.org/pdf/1701.02434.pdf) explained this behaviour visually with two intuitive examples. For first example let us consider partitioning our parameter space in equal rectangular intervals as shown below. ![](https://i.imgur.com/Qxb4oxp.png) We can see that as we increase the dimensions the distribution of volume around the center decreases. This becomes almost negligible as compared to the its neighbourhood in high dimensional cases where $D$ is very large. We can observe a similar behaviour if we consider spherical view of parameter space, where the exterior volume grows even larger than the interior in high dimensional spaces as shown in the figure. ![](https://i.imgur.com/ypXl7zG.png). *How this intuition of volumes in high dimensions explain soap bubble phenomenon?* Even though there is high density sorrounded around the mean, most of the volume is away from the mean and the effective probability mass turns out, in most of the cases, to be away from the mean. If we were to sample from these high dimensional complex distributions, from these two intuitive examples, the immediate consequence is that the samples drawn are far away from the mean and this might result in high variance of gradient estimates. ## Soap bubble Issue ### Geometric Intuition From the above geometric understanding of soap bubble we can see that as we increase the dimensions, the volume outside the given neighbourhood (lets say mode) of any geometric shape (like a rectangle/cube or circle/sphere/ and its high dimensional variants ) will dominate the volume around the mode. And the neighbourhood around the mode has high density. So at these two extremes (high-volume low-density and low-volume high-density), the probability mass is negligible. But there are some places in between which are referred as "target sets" where the both density and volume are equally high. And as the number of parameters increases this becomes narrower and distances away from the center. This is called as soap bubble phenomenon. ### Mathematical Intuition Now coming to the mathematical understanding, lets consider a thin shell of thickness $\eta$ at a some distance $r$ from the mean $\mu$ and the sample $w$ (from $\mathcal{N}(0, I)$). And we look at the probability of this norm (which is the $\chi_d$ random variable) between $w$ and $\mu$ around this thin shell is given by the formula \begin{align*} \lim_{\eta \to 0} p(r \leq ||\mathbf{X}|| \leq r + \eta) &= (2\pi)^{-\frac{d}{2}} e^{-\frac{1}{2} r^2} \cdot S_d(r) \\ S_d(r) &= (2\pi)^{\frac{d}{2}} \frac{1}{\Gamma(\frac{d}{2})} r^{d-1} \end{align*} As we have seen in case of geometrical intuition, we see a similar density-volume tradeoff. To be precise for smaller $r$ and larger $D$ (high dimensional regime), the probability goes to zero. Similarly, for large $r$, the term $e^{-\frac{1}{2} r^2}$ will make the probability almost close to zero. And the region where the probability norm is maximal (which is away from the mode) is called as soap-bubble. ### Mean Field Variational Inference (MFVI) Mean field approximation is one of the most common techniques in variational bayes to approximate the posterior distributions on the unknown latent variables. In MFVI, we try to simplify the problem of this approximation by assuming some structure in the latent variables. To be precise, we assume that these are independent variable and hence the posterior approximation is given by $$ p(\theta \mid X) = q(\theta) \approx \Pi_{i=1}^{N} q(\theta_i) $$ where $q(\theta_i)$ is the variational approximation of a single latent variable. We can also group the multiple latent variables together and then make the independence assumption across the groups. This is called generalised Mean-Field approximation. The Evidence Lower Bound (ELBO) objective is derived by using jensen's inequality over the marginal likelihood of the observations. For a concave function $f(X)$, the jensen's inequality alows us the interchange the expectations around the function of interest; \begin{equation} f(\mathbb{E}[X]) \geq \mathbb{E}[f(X)] \end{equation} We can use above to now derive the ELBO. Knowing the $\log(\cdot)$ is a concave function we can write the marginal likelihood as \begin{equation} \begin{split} \log(p(X)) &= \log\left(\int_\theta p(X, \theta)\right)\\ &= \log \left(\int_\theta p(X, \theta) \frac{q(\theta)}{q(\theta)}\right)\\ &= \log \left(\mathbb{E}\left[\frac{p(X, \theta)}{q(\theta)} \right] \right)\\ &\geq \mathbb{E}_q\left[\log\left(\frac{p(X, \theta)}{q(\theta)} \right)\right] \\ &\geq \mathbb{E}_q\left[\log\left(p(X, \theta) \right)\right] - \mathbb{E}_q \left[ \log(q(\theta)) \right] \end{split} \end{equation} The independence assumption of the latent variables $q(\cdot)$ allows us to simplify the things drastically while writing the ELBO. So the ELBO objective further simplifies to \begin{equation} \begin{split} \mathcal{L} &= \mathbb{E}_q\left[\log\left(p(X_{1:n}, \theta_{1:n}) \right)\right] - \sum_{q_{j}}\mathbb{E}_{q_{j}} \left[ \log(q(\theta_{j})) \right]\\ &= \log(p(X_{1:n})) + \sum_{q_{j}} \mathbb{E}_{q_{j}} [\log (p(\theta_j \mid \theta_{1:\tilde{j} /\{j\}}, X_{1:n}))] - \mathbb{E}_{q_{j}} \left[ \log(q(\theta_{j})) \right] \end{split} \end{equation} We optimize this ELBO objective using coordinate ascent (where we update $q(\theta_j)$ by freezing rest all of the factors as is). ### Soap Bubble Issue in MFVI In this paper, we can approximate the posterior disribution over the weights by a multivariate gaussian distribution. As discussed in previous section, in high dimensional scenarios, the samples of this multivariate gaussian becomes under-representative of the most probable weights as the volume is scattered away from the mean and hence the probability mass. Due to the large distance between the samples from the posterior distribution, the gradient of the log likelihood can suffer with high variance. ## Radial BNN: The key idea of this paper is that to choose a probability distribution which doesn't exhibit this "Soap bubble" property. For that this paper proposes to sample from a simple and practical distribution; a hyperspherical distribution where 1. In the radial dimension; $r \sim \mathcal{N}(0, 1)$ 2. In the angular dimension; uniform distribution over hypersphere. Sampling from the approximate posterior is not costly as there is no need for explicit coordinate transformations and is almost equivalent to MFVI (interms of computational cost). $$ \mathbf{w}_{\text{Radial}} = \pmb{\mu} + \pmb{\sigma} \odot \frac{\pmb{\epsilon}_{\text{MFVI}}}{||\pmb{\epsilon}_{\text{MFVI}}||}. r $$ where $\pmb{\epsilon}_{\text{MFVI}} \sim \mathcal{N}(0, \mathbf{I})$ The term $\frac{\pmb{\epsilon}_{\text{MFVI}}}{||\pmb{\epsilon}_{\text{MFVI}}||}$ is equivalent to drawing samples from a uniform distribution over hypersphere as we are dividing the multivariate gaussian with its norm. The only additional computation cost is multiplying with scalar gaussian random variable. ### Objective The Evidence Lower Bound (ELBO) is used as the evaluating objective for the variational inference where the expected log-likelihood term is estimated using the mini-batches of data and MC integration. For KL divergence between the prior and the approximate posterior, which can be written as \begin{equation} \begin{split} KL(q(w) || p(w)) &= \int q(\mathbf{w})\log(q(\mathbf{w}))dw\\ &- \int q(\mathbf{w})\log(p(\mathbf{w}))dw\\ &= \mathcal{L}_\text{entropy} - \mathcal{L}_\text{cross-entropy} \end{split} \end{equation} the cross-entropy term is simply calculated by the MC estimates which is just the average of the log-likelihood of the posterior samples under the prior \citep{blundell2015weight}. The entropy term, $\mathcal{L}_\text{entropy}$, is given by $$ \mathcal{L}_\text{entropy} = -\sum_i \log(\sigma_i^{(x)}) + \text{const.} $$ The detailed derivation is provided in the appendix of the paper. ## Results Datasets: 1. Diabetic Retinopathy dataset Findings: 1. The radial BNN posterior is most robust to the hyperparameter variation. 2. As expected the variance of the gradient estimates in MFVI explodes as the variance of the weights grows. But this is not observed in case of radial BNN. 3. As the the training is carried out for more epochs, the performance of MFVI degrades but the radial-BNN performance continue to increase. ## [Link to the paper](https://arxiv.org/pdf/1907.00865.pdf)