Stochastic Weight Averaging

# Stochastic Weight Averaging ###### tags: `Neural Networks` > Notes for the paper: [Averaging Weights Leads to Wider Optima and Better Generalization](https://arxiv.org/pdf/1803.05407.pdf) > This paper shows that simple averaging of multiple points along the trajectory of SGD, with a cyclical or constant learning rate, leads to better generalization than conventional training. Table of Content [ToC] ## Introduction Idea of maintaining a running average of weights traversed by SGD dates back to 1988, this procedure is not typically used to train neural networks. It is sometimes applied as an exponentially decaying running average in combination with a decaying learning rate (where it is called an exponential moving average), which smooths the trajectory of conventional SGD but does not perform very differently. However, we show that an equally weighted average of the points traversed by SGD with a cyclical or high constant learning rate, which we refer to as Stochastic Weight Averaging (SWA), has many surprising and promising features for training deep neural networks, leading to a better understanding of the geometry of their loss surfaces. Indeed, SWA with cyclical or constant learning rates can be used as a drop-in replacement for standard SGD training of multilayer networks — but with improved generalization and essentially no overhead. By running SGD with a cyclical or high constant learning rate, we traverse the surface of this set of points, and by averaging we find a more centred solution in a flatter region of the training loss. Further, the training loss for SWA is often slightly worse than for SGD suggesting that SWA solution is not a local optimum of the loss. In the title of this paper, optima is used in a general sense to mean solutions (converged points of a given procedure), rather than different local minima of the same objective. ### Convergence of SGD and SWA Some people say, SGD is more likely to converge to broad local optima than batch gradient methods, which tend to converge to sharp optima. Moreover, they argue that the broad optima found by SGD are more likely to have good test performance, even if the training loss is worse than for the sharp optima. But then, this statement is arguable, as stated in some papers. :::success The SWA method is based on averaging multiple points along the trajectory of SGD with cyclical or constant learning rates. At a high level, SWA and Dropout are both at once regularizers and training procedures, motivated to approximate an ensemble. ::: ### Stochastic Weight Averaging SWA is based on averaging the samples proposed by SGD using a learning rate schedule that allows exploration of the region of weight space corresponding to high-performing networks. In particular we consider cyclical and constant learning rate schedules. ### Analysis of SGD Trajectories Cyclic learning rate is used to promote the exploration of the loss surface. In each cycle we linearly decrease the learning rate from α1 to α2. The formula for the learning rate at iteration i is given by $$ \alpha(i) = (1 - t(i))\alpha_1 + t(i)\alpha_2 $$ $$ t(i) = \frac{1}{c}(mod(i − 1, c) + 1). $$ The base learning rates $α_1$ ≥ $α_2$ and the cycle length c are the hyper-parameters of the method. Here by iteration we assume the processing of one batch of data. We propose to use a discontinuous schedule that jumps directly from the minimum to maximum learning rates, and does not steadily increase the learning rate as part of the cycle. We use this more abrupt cycle because for our purpose exploration is more important than the accuracy of individual proposals. For even greater exploration, we also consider constant learning rates α(i) = $α_{1}$. ![](https://i.imgur.com/0AIsMOu.png) We run SGD with cyclical and constant learning rate schedules starting from a pretrained point for a Preactivation ResNet-164 on CIFAR-100. We then use the first, middle and last point of each of the trajectories to define a 2-dimensional plane in the weight space containing all affine combinations of these points. ![](https://i.imgur.com/J8u9W8y.png) In the above figure we plot the loss on train and error on test for points in these planes. We then project the other points of the trajectory to the plane of the plot. Note that the trajectories do not generally lie in the plane of the plot, except for the first, last and middle points, showed by black crosses in the figure. Therefore for other points of the trajectories it is not possible to tell the value of train loss and test error from the plots. :::success - The key insight from this is that, both methods explore points close to the periphery of the set of highperforming networks. The visualizations suggest that both methods are doing exploration in the region of space. corresponding to DNNs with high accuracy. The main difference between the two approaches is that the individual proposals of SGD with a cyclical learning rates schedule are in general much more accurate than the proposals of a fixed-learning rate SGD. After making a large step, SGD with a cyclical learning rate spends several epochs fine-tuning the resulting point with a decreasing learning rate. SGD with a fixed learning rate on the other hand is always making steps of relatively large sizes, exploring more efficiently than with a cyclical learning rate, but the individual proposals are worse. - While the train loss and test error surfaces are qualitatively similar, they are not perfectly aligned. The shift between train and test suggests that more robust central points in the set of high-performing networks can lead to better generalization. Indeed, if we average several proposals from the optimization trajectories, we get a more robust point that has a substantially higher test performance than the individual proposals of SGD, and is essentially centered on the shifted mode for test error. ::: ### SWA Algorithm Start with a pretrained model $\hat{w}$. We will refer to the number of epochs required to train a given DNN with the conventional training procedure as its training budget and will denote it by B. Starting from $\hat{w}$ we continue training, using a cyclical or constant learning rate schedule. When using a cyclical learning rate we capture the models $w_{i}$ that correspond to the minimum values of the learning rate(Shown in the first figure). For constant learning rates we capture models at each epoch. Next, we average the weights of all the captured networks $w_{i}$ to get our final model SWA. ### Solution Width Keskar et al. [2017] and Chaudhari et al. [2017] conjecture that the width of a local optimum is related to generalization. The general explanation for the importance of width is that the surfaces of train loss and test error are shifted with respect to each other and it is thus desirable to converge to the modes of broad optima, which stay approximately optimal under small perturbations. ## Some Important Insights - First, the train loss and test error plots are indeed substantially shifted, and the point obtained by minimizing the train loss is far from optimal on test. - Loss near sharp optima found by SGD with very large batches are actually flat in most directions, but there exist directions in which the optima are extremely steep. Because of this sharpness the generalization performance of large batch optimization is substantially worse than that of solutions found by small batch SGD