--- title: α-SAM tags: Idea description: A two-parameter implementation of SAM --- SAM has a single parameter, $\rho$, that simultaneously controls two things: * the strength of the regularization: larger values of $\rho$ will lead to more 'sharpness-avoidance'. This can be seen in the implicit regularization view of SAM: $\mathcal{L}(\theta) + \rho \|\nabla_\theta\mathcal{L}(\theta)\|_2$. This suggests, higher values of $\rho$ might be a good thing. * the stepsize, and thus the validity of the Taylor-approximation. Higher values of $\rho$ increases the approximation error due to higher-order terms in the Taylor approximation. This suggests, higher values of $\rho$ are a bad thing. ![](https://i.imgur.com/coFbGfB.jpg) ## Decoupling the two things SAM may work even better if we decouple these two things from each other. We can do this by introducing a two-parameter version of SAM, in which the gradient is: $$ \mathbf{g}_{\alpha, \rho} = \alpha(\mathbf{g}_\rho - \mathbf{g}_0) + \mathbf{g}_0, $$ where $\mathbf{g}_\rho$ is the SAM gradient direction, and $\mathbf{g}_0$ is the typical gradient evaluated at $\theta$ that we would use in usual gradient descent. It is easy to see that: * $\alpha=1$ results in the usual SAM * $\alpha=0$ results in the usual gradient descent * $\alpha \in (0,1)$ interpolates between the two * $\alpha > 1$ increases the implicit regularization of SAM without increasing $\rho$, since $\mathbf{g}_{\alpha, \rho}$ is approximately the gradient of $\mathcal{L}(\theta) + \alpha\rho \|\nabla_\theta\mathcal{L}(\theta)\|_2$ * $\alpha<0$ gives a sharpness-seeking algorithm that Szilvi calls AntiSAM. Calculating $\mathbf{g}_{\alpha, \rho}$ is trivial once you calculated the SAM update. ## Fixing variational SAM This two-parameter SAM might be particularly useful when we want to learn $\alpha$ or $\rho$ in the algorithms, for example in VariationalSAM. One of the reason VSAM might not work very well is that when the variances are increased, the Taylor approximation breaks down and gradients become complete garbage. However, if we learn $\alpha$ but we keep $\rho$ fixed, we won't run into this issue. ## Negative $\rho$ Interestingly, $\alpha$-SAM also works when $\rho$ is negative. Consider, for example, the following scenario: $\rho<0$ and $\alpha=-1$. The $\alpha$-SAM algorithm still performs, approximately, the original SAM update. However, now the first step in SAM was downhill, not uphill, since $\rho$ is negative. A consequence of this is that the "one step uphill, one step downhill" may not really be necessary for the SAM update. Consider a situation where we have taken two steps of normal gadient descent with learning rate $\eta$: \begin{align} \theta_1 &= \theta_0 -\eta g(\theta_0)\\ \theta_2 &= \theta_1 -\eta g(\theta_1) \end{align} Now, we can observe the following: \begin{align} g(\theta_1) - g(\theta_0) &= g(\theta_0 - \eta g(\theta_0)) - g(\theta_0)\\ &\approx -\eta H(\theta_0)g(\theta_0)\\ &= -\frac{\eta}{2}\nabla_\theta \|g(\theta_0)\|_2^2\\ &= -\eta \|g(\theta_0)\|_2 \nabla_\theta \|g(\theta_0)\|_2 \end{align} So, from just the difference of the last two radients we can estimate the gradient of $\|g(\theta_0)\|_2$ or the gradient of $\|g(\theta_0)\|^2_2$ for free. We can then step in this direction to increase the inductive bias. And we can do this basically for free, by reusing the last two gradients. This hints at a cheaper version of SAM that always moves forward, constructs an estimate of $\nabla_\theta \|g(\theta_0)\|_2$ from consecutive gradient steps, and uses this to move in directions of smoother gradients. Something along these lines: \begin{align} \theta_1 &= \theta_0 -\eta g(\theta_0)\\ \theta_2 &= \theta_1 -\eta g(\theta_1)\\ \theta_3 &= \theta_2 + \frac{\alpha}{\eta \|g(\theta_0)\|_2} \left(g(\theta_1) - g(\theta_0)\right) \end{align} ### Problems with this algorithm: * **Stochastic gradients:** It's unclear what this algorithm would do if $g(\theta_0)$ and $g(\theta_1)$ were in fact stochastic gradients calculated on different minibatches. Would this still make any sense? What would be the inductive bias then? * **Non-vanilla SGD:** This kind of assumes that the first two steps are vanilla gradient steps, with no momentum, and no variable learning rate. You obviously don't want to restrict ourselves to vanilla GD. Perhaps there is a version of this that still works with Adam. That would be interesting, a combination of SAM and Adam. Saddam?