# Sharpness avoidance of variational optimization Let's consider a deterministic, non-convex objective $\ell(\theta)$. Variational Optimization over this objective minimises the following objective: $$ \mathcal{L}(\theta, \Sigma) = \mathbb{E}_{\epsilon \sim \mathcal{N}(0, \Sigma)}\ell(\theta + \epsilon) $$ Let's focus on the case when $\Sigma = \sigma^2I$ for a fixed $\sigma$, thus $$ \ell_{VO}(\theta) = \mathbb{E}_{\epsilon \sim \mathcal{N}(0, \sigma I)}\ell(\theta + \epsilon) $$ ## Sharpness penalty One can rewrite the VO objective simply as the original loss plus a term that can be interpreted as a sharpness penalty. $$ \ell_{VO} = \ell(\theta) + \underbrace{\mathbb{E}_{\epsilon \sim \mathcal{N}(0, \sigma I)}\ell(\theta + \epsilon) - \ell(\theta)}_{\text{sharpness penalty}} $$ We can see how the above relates to the curvature of the loss by using a 2nd order Taylor expansion around $\theta$: \begin{align} \ell_{VO} &\approx \mathbb{E}_{\epsilon \sim \mathcal{N}(0, \sigma I)} \ell(\theta) + \epsilon^T g(\theta) + \frac{1}{2}\epsilon^TH(\theta)\epsilon\\ &= \ell(\theta) + \frac{\sigma^2\|H\|_{F}^2}{2} \end{align} ## Inductive bias due to stochasticity The VO objective is not typically available in closed form, one typically uses a 1-sample Monte Carlo estimate based on the reparametrization trick yielding a stochastic objective as follows. $$\ell_{VO}(\theta) \approx \ell(\theta + \sigma \epsilon), \epsilon \sim \mathcal{N}(0, \sigma I) $$ Running stochastic gradient descent with stochastic gradients computed this way exhibits an inductive bias, which in this case further reinforces 'sharpness avoidance'. We can follow the backwards error analysis of [Smith et al](https://arxiv.org/abs/2101.12176) to see that SGD on the 1-sample Monte Carlo estimates of the gradient display an additional implicit regularisation towards wider minima. Generally, in Smith et al, the implicit regularisation was expressed as the trace of the covariance matrix of random gradients. The covariance of gradients in the 1-sample VO case can be approximated by considering a 1st order Taylor expansion of the noisy gradients: $$ \operatorname{tr}\mathbb{V}[g(\theta)_\epsilon H(\theta)] = \sigma^2\|H(\theta)\|^2_F $$ Therefore, we can see that the added stochasticity of the gradients may further reinforce the sharpness-avoidance of VO.