# Radial Bayesian Neural Networks
#### Author: [Sharath](https://sharathraparthy.github.io/)
## Overview
This paper discusses how the BNNs behave as we scale to the large models. This paper discusses the "soap-bubble" issue in case of high dimensional probability spaces and how MFVI suffers from this. As a way to tackle this issue, the authors propose a new variational posterior approximation in hyperspherical coordinate system and show that this overcomes the soap-bubble issue when we sample from this posterior.
## The geometry of high dimensional spaces
One of the properties of high dimensional spaces is that there is much more volume outside any given neighbourhood than inside of it. [Betacount et al](https://arxiv.org/pdf/1701.02434.pdf) explained this behaviour visually with two intuitive examples. For first example let us consider partitioning our parameter space in equal rectangular intervals as shown below. ![](https://i.imgur.com/Qxb4oxp.png) We can see that as we increase the dimensions the distribution of volume around the center decreases. This becomes almost negligible as compared to the its neighbourhood in high dimensional cases where $D$ is very large.
We can observe a similar behaviour if we consider spherical view of parameter space, where the exterior volume grows even larger than the interior in high dimensional spaces as shown in the figure. ![](https://i.imgur.com/ypXl7zG.png).
*How this intuition of volumes in high dimensions explain soap bubble phenomenon?*
Even though there is high density sorrounded around the mean, most of the volume is away from the mean and the effective probability mass turns out, in most of the cases, to be away from the mean.
If we were to sample from these high dimensional complex distributions, from these two intuitive examples, the immediate consequence is that the samples drawn are far away from the mean and this might result in high variance of gradient estimates.
## Soap bubble Issue
### Geometric Intuition
From the above geometric understanding of soap bubble we can see that as we increase the dimensions, the volume outside the given neighbourhood (lets say mode) of any geometric shape (like a rectangle/cube or circle/sphere/ and its high dimensional variants ) will dominate the volume around the mode. And the neighbourhood around the mode has high density. So at these two extremes (high-volume low-density and low-volume high-density), the probability mass is negligible. But there are some places in between which are referred as "target sets" where the both density and volume are equally high. And as the number of parameters increases this becomes narrower and distances away from the center. This is called as soap bubble phenomenon.
### Mathematical Intuition
Now coming to the mathematical understanding, lets consider a thin shell of thickness $\eta$ at a some distance $r$ from the mean $\mu$ and the sample $w$ (from $\mathcal{N}(0, I)$). And we look at the probability of this norm (which is the $\chi_d$ random variable) between $w$ and $\mu$ around this thin shell is given by the formula
\begin{align*}
\lim_{\eta \to 0} p(r \leq ||\mathbf{X}|| \leq r + \eta) &= (2\pi)^{-\frac{d}{2}} e^{-\frac{1}{2} r^2} \cdot S_d(r) \\
S_d(r) &= (2\pi)^{\frac{d}{2}} \frac{1}{\Gamma(\frac{d}{2})} r^{d-1}
\end{align*}
As we have seen in case of geometrical intuition, we see a similar density-volume tradeoff. To be precise for smaller $r$ and larger $D$ (high dimensional regime), the probability goes to zero. Similarly, for large $r$, the term $e^{-\frac{1}{2} r^2}$ will make the probability almost close to zero. And the region where the probability norm is maximal (which is away from the mode) is called as soap-bubble.
### Mean Field Variational Inference (MFVI)
Mean field approximation is one of the most common techniques in variational bayes to approximate the posterior distributions on the unknown latent variables. In MFVI, we try to simplify the problem of this approximation by assuming some structure in the latent variables. To be precise, we assume that these are independent variable and hence the posterior approximation is given by
$$
p(\theta \mid X) = q(\theta) \approx \Pi_{i=1}^{N} q(\theta_i)
$$
where $q(\theta_i)$ is the variational approximation of a single latent variable. We can also group the multiple latent variables together and then make the independence assumption across the groups. This is called generalised Mean-Field approximation.
The Evidence Lower Bound (ELBO) objective is derived by using jensen's inequality over the marginal likelihood of the observations.
For a concave function $f(X)$, the jensen's inequality alows us the interchange the expectations around the function of interest;
\begin{equation}
f(\mathbb{E}[X]) \geq \mathbb{E}[f(X)]
\end{equation}
We can use above to now derive the ELBO. Knowing the $\log(\cdot)$ is a concave function we can write the marginal likelihood as
\begin{equation}
\begin{split}
\log(p(X)) &= \log\left(\int_\theta p(X, \theta)\right)\\
&= \log \left(\int_\theta p(X, \theta) \frac{q(\theta)}{q(\theta)}\right)\\
&= \log \left(\mathbb{E}\left[\frac{p(X, \theta)}{q(\theta)} \right] \right)\\
&\geq \mathbb{E}_q\left[\log\left(\frac{p(X, \theta)}{q(\theta)} \right)\right] \\
&\geq \mathbb{E}_q\left[\log\left(p(X, \theta) \right)\right] - \mathbb{E}_q \left[ \log(q(\theta)) \right]
\end{split}
\end{equation}
The independence assumption of the latent variables $q(\cdot)$ allows us to simplify the things drastically while writing the ELBO. So the ELBO objective further simplifies to
\begin{equation}
\begin{split}
\mathcal{L} &= \mathbb{E}_q\left[\log\left(p(X_{1:n}, \theta_{1:n}) \right)\right] - \sum_{q_{j}}\mathbb{E}_{q_{j}} \left[ \log(q(\theta_{j})) \right]\\
&= \log(p(X_{1:n})) + \sum_{q_{j}} \mathbb{E}_{q_{j}} [\log (p(\theta_j \mid \theta_{1:\tilde{j} /\{j\}}, X_{1:n}))] - \mathbb{E}_{q_{j}} \left[ \log(q(\theta_{j})) \right]
\end{split}
\end{equation}
We optimize this ELBO objective using coordinate ascent (where we update $q(\theta_j)$ by freezing rest all of the factors as is).
### Soap Bubble Issue in MFVI
In this paper, we can approximate the posterior disribution over the weights by a multivariate gaussian distribution. As discussed in previous section, in high dimensional scenarios, the samples of this multivariate gaussian becomes under-representative of the most probable weights as the volume is scattered away from the mean and hence the probability mass. Due to the large distance between the samples from the posterior distribution, the gradient of the log likelihood can suffer with high variance.
## Radial BNN:
The key idea of this paper is that to choose a probability distribution which doesn't exhibit this "Soap bubble" property. For that this paper proposes to sample from a simple and practical distribution; a hyperspherical distribution where
1. In the radial dimension; $r \sim \mathcal{N}(0, 1)$
2. In the angular dimension; uniform distribution over hypersphere.
Sampling from the approximate posterior is not costly as there is no need for explicit coordinate transformations and is almost equivalent to MFVI (interms of computational cost). $$
\mathbf{w}_{\text{Radial}} = \pmb{\mu} + \pmb{\sigma} \odot \frac{\pmb{\epsilon}_{\text{MFVI}}}{||\pmb{\epsilon}_{\text{MFVI}}||}. r
$$
where $\pmb{\epsilon}_{\text{MFVI}} \sim \mathcal{N}(0, \mathbf{I})$
The term $\frac{\pmb{\epsilon}_{\text{MFVI}}}{||\pmb{\epsilon}_{\text{MFVI}}||}$ is equivalent to drawing samples from a uniform distribution over hypersphere as we are dividing the multivariate gaussian with its norm. The only additional computation cost is multiplying with scalar gaussian random variable.
### Objective
The Evidence Lower Bound (ELBO) is used as the evaluating objective for the variational inference where the expected log-likelihood term is estimated using the mini-batches of data and MC integration. For KL divergence between the prior and the approximate posterior, which can be written as
\begin{equation}
\begin{split}
KL(q(w) || p(w)) &= \int q(\mathbf{w})\log(q(\mathbf{w}))dw\\
&- \int q(\mathbf{w})\log(p(\mathbf{w}))dw\\
&= \mathcal{L}_\text{entropy} - \mathcal{L}_\text{cross-entropy}
\end{split}
\end{equation}
the cross-entropy term is simply calculated by the MC estimates which is just the average of the log-likelihood of the posterior samples under the prior \citep{blundell2015weight}. The entropy term, $\mathcal{L}_\text{entropy}$, is given by
$$
\mathcal{L}_\text{entropy} = -\sum_i \log(\sigma_i^{(x)}) + \text{const.}
$$
The detailed derivation is provided in the appendix of the paper.
## Results
Datasets:
1. Diabetic Retinopathy dataset
Findings:
1. The radial BNN posterior is most robust to the hyperparameter variation.
2. As expected the variance of the gradient estimates in MFVI explodes as the variance of the weights grows. But this is not observed in case of radial BNN.
3. As the the training is carried out for more epochs, the performance of MFVI degrades but the radial-BNN performance continue to increase.
## [Link to the paper](https://arxiv.org/pdf/1907.00865.pdf)

SharathRaparthy
I am a Masters student at Mila working on continual reinforcement learning.

Author: Sharath Paper Link Overview Curriculum learning is inspired by the way human learns, where the examples are shown in the increasing order of the difficulty. More sepcifically te network is exposed to the easier examples in the early stages of training and then gradually to the tougher ones. This paper studies the benifits of showing this sequential ordered eamples to the network and comments about when it works and when it doesn't. Contributions: This paper introduces a phenomenon called implicit curricula. One of the claims they make is that the the ordered learning (curriculum, anti-curriculum and random) almost performs same in the standard settings. Curricula is benificial when there is a limited time budget and in noisy regime

7/4/2021Author: Sharath What is a convex function? Let's try to define a convex function formally and geometrically. Formally, a function $f$ is said to be a convex function if the domain of $f$ is a convex set and if it satisfies the following $\forall x \ \text{and} \ y \in \text{dom} f$; \begin{equation} f(\theta x + (1 - \theta)y) \leq \theta f(x) + (1 - \theta)f(y) \end{equation} Geometrically it means that the value of a function at the convex combination of two points of the function always lies below the convex combination of the values at the corresponding points. It means that if we draw a line at any two points $(x, y) \in \text{dom} f$, then this line/chord always lies above the function $f$.

7/4/2021Author: Sharath What is a dynamical system? It is any system that evolves and changes through time governed by a set of rules. Using dynamical systems we can study the long term behavior of an evolving system. Formally, it is a triplet $(X, T, \phi)$ where $X$ denotes the state space, $T$ denotes the time space and $\phi: X \times T \rightarrow X$ is the flow (this is the rule that governs the evolution). There are few properties of flow: $\phi(X, 0) = X$ Principle of compositionality: $\phi(\phi(x, t), s) = \phi(x, t+s)$

7/4/2021Author: Sharath Chandra Paper Link tags: simulation, interaction-networks, robotics In this paper the authors proposed a hybrid dynamics model, Simulation-Augemented Interaction Networks, where they incorporated Interaction Networks into a physics engine for solving real world complex robotics control tasks. Brief Outline: Most of the physics based simulators serves as a good platform for carrying out robot planning and control tasks. But no simulator is a perfect because it has it's own modelling errors. So, most of the physics engines (mujoco, bullet, gazebo etc.,) demonstrate some descrepencies between their predictions and actual real world predictions. To decrease these errors, many methods have been propsed in the literature. Some of the methods include randomizing the simulation environments, famously known as Domain Randomization. In this paper, model errors are tackled by learning a residual model between the real world and simulator. In other words, instead of adding pertubations to the environment parameters, here we utilize some real world data to correct the simulator. Even though this method uses some real world data, this method is shown to be sample efficient and have better generalization capabilities. Interaction Networks

7/4/2021
Published on ** HackMD**