There is no commentSelect some text and then click Comment, or simply add a comment to this page from below to start a discussion.
Visualization of the loss landscape and optimization path of a neural network
Author : Pascal Tikeng
Notations
: parameters (weights) space,
We will consider as an affine space to which we will associate a vector space : . The elements of will be noted with an arrow on top of them.
: Frobenius norm
: bilinear product, for a matrix , and two vectors and
:
scalar product : for two vectors and
quadratic product : for a matrix and a vector
Introduction
In deep learning, the loss function is generally defined from to as , where the function measures how well the neural network with parameters predicts the label of a data sample , the data distribution and the regularizer (). In practice, is unknow, and the empirical distribution of a given dataset is generally used, . Then , an empirical estimate of . The parameters of the model are thus updated during training by a rule of the form with the gradient of the loss function at , the parameter update at time given the optimization algorithm of choice, the momentum and the velocity. Generic gradient methods (SGD…) use and generic adaptive gradient methods (AdaGrad, RMSPro, Adam, ..) use . By abuse of notation, we will sometime note instead of .
While neural loss functions live in a very high-dimensional space, visualizations are only possible using low-dimensional 1D (line) or 2D (surface) plots. Several methods exist for closing this dimensionality gap. The idea is to choose a linear subspace that maximally preserves the optimization trajectory’s shape, in order to observe patterns that are meaningful in the full parameter space.
Definition: A loss landscape visualization technique is said to be meaningful if it preserves the curvature of the original loss function. If this is the case, any critical point (minimums, maximums, saddle points, …) or critical surface (valleys, plateaus, ravines … and other flat regions) of the original loss function is present in its landscape visualization. It is not necessarily a question here of making the critical points/zones appear exactly in the visualization: we can simply define the indicators allowing to detect, on the landscape, the presence of these points/zones in the real functions.
Let and the local hessian matrix of the loss at . Near , . That is, for very small step size , .
Definition: When has some large positive eigenvalues (i.e. high-curvature directions) and some eigenvalues close to 0 (i.e. low-curvature directions), gradient descent bounces back and forth in high curvature directions and makes slow progress in low curvature directions. We will said that is ill conditioned in this case, or the optimization problem has an ill-conditioned curvature.
Suppose the loss function near has high condition number, ie very small steps cause increase in cost function (for example if is a very sharp minima surrounded by high loss regions). If , the optimization problem becomes also ill-conditioned. During training, we can monitor the terms (square of the gradient norm) and . If the gradient norm doesn’t shrink but grows order of magnitude, learning can become very slow despite a strong gradient.
Theorem : If the matrix is positive-definite, is ill conditioned.
Proof :
Definition: If 𝟘, then is a critical/stationary point of , and the determinant of is equal to the Gaussian curvature of the loss surface considered as a manifold. The eigenvalues of are the principal curvatures of the loss function at , and the eigenvectors are the principal directions of curvature.
if , is a local :
maximum of if is a negative definite (all its eigenvalues are negative).
minimum of if is a positive definite (all its eigenvalues are positive). Some local minimum can be a very flat (i.e. there is a large enough neighborhood of that contains only local minima) or sharp (the loss function near has high condition number, ie very small pertubation of cause increase in ).
if (some eigenvalues are positive and others are negative), is a saddle point of .
if (there is at least one zero eigenvalue, ie is undefined), we can't conclude, and the point could be any of a minimum, maximum or saddle point.
If the hessian matrix of is positive semi-definite at any point of , is convex and is its global minimum. If it is instead negative semi-definite at any point of , is concave and is its global maximum.
Many deep models are guaranteed to have an extremely large number of local minima. It has been proven that this is not necessarily a problem. In fact, the majority of local minima are of good quality (all almost equivalent in cost to the global minimum). The biggest obstacle to the optimization of remains the presence of saddle points. In low dimensions, local minima are more common, while in high dimensions, local minima are rare and saddle points more common. Most of training time spent on traversing flat valley of the Hessian matrix or circumnavigating tall mountain via an indirect arcing path, and trajectory of traversing such flat valley and circumventing such mountains may be long and result in excessive training time [10].
One of the most used methods, presented below, is to consider a remarkable point near which we want to study the behavior of the loss, choose a given direction compatible with the parameters space (either randomly or deterministically), and perturb the parameters in this direction. The loss is then visualized as a function of the intensity of the perturbation.
- Linear Interpolation [1]
Given two parameters , typically the initial and the final parameters (at the end of optimization), we define the function :
Then we plot the loss as a function of :
Depending on the problem, we can set or , or restrict ourselves to the segment with .
The linear interpolation curves for a decoder-only Transformer model train on modular addition, A = [-2, 2]. Left) at the minimizer, Right) When the model starts to converge. Image from me.
- Random Directions [1]
In this case, , where is a point carefully chosen in (generally the minimizer of ) and a direction vector carefully chosen (generally randomly) in .
- Bilinear Interpolation [2]
Given four parameters , we chose and define the interpolation as where and .
- Barycentric Interpolation [2]
Given three parameters , and ; let and . Then, the formulation of the interpolation is where and
1.2) Surface
For surfaces, two directions are chosen instead of one. An the pertubation is made along the two directions.
- Random Directions [1]
We define the function :
where is a point carefully chosen in (generally the minimizer of ) and two direction vectors carefully chosen (generally randomly) in .
It has been proven that in high dimensional spaces, randomly chosen vectors tend to be orthogonal to each other. For two independent unit vectors , [4]. Also, the expected cosine similarity between Gaussian random vectors in dimensions is roughly [3].
The loss is then plot (contours plot, 3D plot …) as a function of :
2D visualization of the loss surface of ResNet and ResNet-noshort (NS) with different depth. Image from [3]
- Interpolation
Given three parameters : and .
In this kind of case, is usually the initial parameter, and two parameters obtained after optimizations with two different methods (for example with two different optimizers). The goal here is to see the difference between the paths taken by the two methods (a path can be plotted between and using one of the methods described below for plotting optimization paths).
Visualization of the loss surface with and without batch-normalization. Image from [2]
- PCA directions
Here the PCA is applied to the matrix or (where denote model parameters at epoch , and is the total number of training epoch) and the two most explanatory directions are used as and . That is :
firstly by calculating the (resp. ) covariance matrix of : (resp. ), , .
The eigensystem of is then found, providing a new set of basis vectors and coordinate system (ie, a rotation of weight space)
To reduce the dimensionality of the weight space to some value , the largest eigenvalues and corresponding eigenvectors are chosen and the remaining values discarded.
In practice, PCA is quite often performed on the correlation matrix rather than the covariance matrix. The main reason for this is that if the variables in question are measured in different units, it may be that the variances of some variables dominate the first few principal components, simply because of the units they are measured in (e.g. consider one variable measured in meters together with some other variables measured in millimeters). In the case of neural network weight data, this problem does not arise - the covariance matrix is therefore used for PCA [9].
1.3) Normalization
[3] : «While the random directions approach to plotting is simple, it fails to capture the intrinsic geometry of loss surfaces, and cannot be used to compare the geometry of two different minimizers or two different networks. This is because of the scale invariance in network weights. When ReLU non-linearities are used, the network remains unchanged if we (for example) multiply the weights in one layer of a network by 10, and divide the next layer by 10. This invariance is even more prominent when batch normalization is used. In this case, the size (i.e., norm) of a filter is irrelevant because the output of each layer is re-scaled during batch normalization. For this reason, a network’s behavior remains unchanged if we re-scale the weights. Note, this scale invariance applies only to rectified networks. Scale invariance prevents us from making meaningful comparisons between plots, unless special precautions are taken. A neural network with large weights may appear to have a smooth and slowly varying loss function; perturbing the weights by one unit will have very little effect on network performance if the weights live on a scale much larger than one. However, if the weights are much smaller than one, then that same unit perturbation may have a catastrophic effect, making the loss function appear quite sensitive to weight perturbations. Keep in mind that neural nets are scale invariant; if the small-parameter and large-parameter networks in this example are equivalent (because one is simply a rescaling of the other), then any apparent differences in the loss function are merely an artifact of scale invariance …»
For these reasons, it is preferable to scale the directions added to the network so that they have the same scales as the network weights. In general, directions (whether determinist or random) are normalized before they are used. As descibe above, we begin by producing random Gaussian direction vectors with dimensions compatible with the modele parameters (ie in ). for a line and for a surface. Also, let a point carefully chosen in (for example the minimizer of ).
We will use this 2 layers FC neural networks as example in the following sections. For , let for and , . Then . Let also for and . We fix and for the sake of simplicity (this type of network is know as Soft Committee Machine). Then, and .
- No Normalization [3]
are linearly combined to the weights without processing as describe in the above sections.
- Model-wise Normalization
The directions are normalized at the model level : .
- Layer-wise Normalization [2]
The directions are normalized at the layer level so that the direction for each layer has the same norm as the corresponding layer of : for each layer with parameter .
For our example, since we have the parameters for only one layer of our model, this normalization corresponds to the model-wise normalization. If we include the parameters of the second layer ( and ), this normalization becomes :
with and
- Weight-wise Normalization
For our example, let . Then and for all .
- Filter-wise Normalization [3]
To remove the above mentioned scaling effect, the authors of [3] plot the loss functions using filter-wise normalized directions. To obtain such directions for a network with parameters they normalized each filter in each directions to have the same norm of the corresponding filter in . In other words, they made the replacement where represents the filter (not the weight) of the layer of . This is not limited to convolutional (Conv) layers but also applies to fully connected (FC) layers. The FC layer is "equivalent" to a Conv layer with a output feature map and the filter corresponds to the weights that generate one neuron [3].
For our example, we will have and for all .
2. Optimization Paths [3]
Above, we showed how to visualize the loss landscape of high dimensionnal neural network paramaters in low dimension. But how to visualize the trajectory between two weigths values encounter by the neural network during optimization? Indeed, random directions fail to capture the variation in optimization trajectories, leading to the misleading appearance of a straight line path [3]. To capture variation in trajectories, we need to use non-random and carefully chosen directions.
Let where denote model parameters at epoch , and is the total number of training epoch. We can apply PCA to the matrix , select the two most explanatory directions and project the optimization trajectory onto these two directions. This means that if and are the two directions vectors (in ) chosen after PCA, we will display on the loss surface the trajectory linking the following points: . There is no guarantee that the model followed this path during optimization. In fact, it is highly unlikely to have done so, unless the optimization problem is convex [5].
Projected learning trajectories use normalized PCA directions for VGG-9. The left plot in
each subfigure uses batch size 128, and the right one uses batch size 8192. WD = Weight decay. Image from [3]
3. Apparent and real convexity
[3] : «We are viewing the loss surface under a dramatic dimensionality reduction, and we need to be careful how we interpret these plots. One way to measure the level of convexity in a loss function is to compute the principle curvatures, which are simply eigenvalues of the Hessian. A truly convex function has non-negative curvatures (a positive semi-definite Hessian), while a non-convex function has negative curvatures. It can be shown that the principle curvatures of a dimensionality reduced plot (with random Gaussian directions) are weighted averages of the principle curvatures of the full-dimensional surface (where the weights are Chi-square random variables). This has several consequences. First of all, if non-convexity is present in the dimensionality reduced plot, then non-convexity must be present in the full-dimensional surface as well. However, apparent convexity in the low-dimensional surface does not mean the high-dimensional function is truly convex. Rather it means that the positive curvatures are dominant (more formally, the mean curvature, or average eigenvalue, is positive).»
The authors of [3] calculate the ratio at each point ( for a line, for a surface) of the loss surface/line, where and are respectivily the minimum and the maximum eigenvalues of the Hessian of at . They observe that the convex-looking regions in the surface/line plots correspond to regions with insignificant negative eigenvalues (i.e., there are not major non-convex features that the plot missed), while chaotic regions contain large negative curvatures; and that for convex-looking surfaces, the negative eigenvalues remain extremely small (less than 1% the size of the positive curvatures) over a large region of the plot.
The quantity is known as the condition number of the Hessian of at . Larger condition numbers imply slower convergence of gradient descent.
Blue color indicates a more convex region (near-zero negative eigenvalues relative to the positive eigenvalues), while yellow indicates significant levels of negative curvature. Image from [3]
[6] uses the normalized local curvature along the trajectory of the optimizer, given by , as a curvature measure.
Before discussing the convexity problem, let us recall this theorem. Let a function. For all , let
Theorem : .
Proof:
Do you see what I'm getting at? Between 0 and 1, the landscape curves drawn above capture the curvature of in the neighborhood of the point considered, but just in one direction.
In fact, near a given , . That is, for very small , .
For a unit-norm vector :
is the -directional derivative of at . So if we want to choose a direction of maximum (positive and negative) variation, we can choose , for example because .
is the second -directional derivative of at . Since the hessian matrix encodes the curvature of the function in every direction, encodes it in the direction . Also, the maximum curvature of at is given by the largest eigenvalue of and is in the direction of the corresponding eigenvector. The smallest curvature (the largest negative curvature) of at is given by the smallest eigenvalue of and is in the direction of the corresponding eigenvector.
The first-order approximation of a function at a given point is a linear function that has exactly the same directional derivatives at that point. The second-order approximation of a function at a given point is a quadratic form that has exactly the same directional derivatives and curvature at that point.
4. Other methods
4.1 PHATE [7]
4.2 …
5. Can we do better?
Yes. But how?
5.1 Andrews curve and Radar chart
An idea I'm exploring
Andrews plot or Andrews curve is a way to visualize structure in high-dimensional data. Let . We can analyze the evolution of the structure of the parameters across epochs by visualizing the functions where denote model parameters at epoch , and is the total number of training epoch. Note that is the projection of onto the vector . If there is structure in , it may be visible in its Andrews curves [8].
The projection onto the circle results in the radar chart.
An introduction to the conjugate gradient method without the agonizing pain
Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization : A Survey
NEGATIVE EIGENVALUES OF THE HESSIAN IN DEEP NEURAL NETWORKS
Landscape :
Geometry of learning: Visualizing the performance of neural network supervised training methods
[1]
[2]
[3]
[7]
Paths
Visualization of learning in multilayer perceptron networks using principal component analysis
Stuck in a what? Adventures in weight space
Visualizing Deep Network Training Trajectories with PCA
Sharpness and flatness
Flat minima, March 1996
"flat minimum search" search algorithm vs conventional backprop, weight decay and optimal brain surgeon / optimal brain damage.
A flat minimun is a (large) region of connected acceptable minima, where eache weigth withing this region leads to almost identical net function.
On large-batch training for deep learning : Generalization gap and sharp minima
The lack of generalization ability is due to the fact that large-batch methods tend to converge to sharp minimizers of the training function. These minimizers are characterized by a significant number of large positive eigenvalues in , and tend to generalize less well. In contrast, small-batch methods converge to flat minimizers characterized by having numerous small eigenvalues of . We have observed that the loss function landscape of deep neural networks is such that large-batch methods are attracted to regions with sharp minimizers and that, unlike small-batch methods, are unable to escape basins of attraction of these minimizers.
Sharp minima can generalize for deep nets
Exploring generalization in deep learning
Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data
Others
Identifying and attacking the saddle point problem in high
Exact solutions to the nonlinear dynamics of learning in deep linear neural networks
The Loss Surfaces of Multilayer Networks
On the importance of momentum and initialization in deep learning
Exploring loss function topology with cyclical learning rates
Phasemax : Convex phase retrieval via basis pursuit
Taxonomizing local versus global structure in neural network loss landscapes
On the Loss Landscape of Adversarial Training : Identifying Challenges and How to Overcome Them
convergence to minimizers of strongly-convex functions and to stationary points for non-convex functions
Optimization Methods for Large-Scale Machine Learning
saddle-point avoidance
Escaping From Saddle Points –- Online Stochastic Gradient for Tensor Decomposition
Gradient descent converges to minimizers
robustness to input data
Train faster, generalize better: Stability of stochastic gradient descent
References
[1] QUALITATIVELY CHARACTERIZING NEURAL NETWORK OPTIMIZATION PROBLEMS, Ian J. Goodfellow & Oriol Vinyals & Andrew M. Saxe, 2015
[2] An empirical analysis of the optimization of deep network loss surfaces, Daniel Jiwoong Im & Michael Tao & Kristin Branson, 2017
[3] Visualizing the Loss Landscape of Neural Nets, Hao Li & Zheng Xu & Gavin Taylor & Christoph Studer & Tom Goldstein, 2018
[6] The Slingshot Mechanism: An Empirical Study of Adaptive Optimizers and the Grokking Phenomenon, Vimal Thilak, Etai Littwin, Shuangfei Zhai, Omid Saremi, Roni Paiss, Joshua Susskind, 2022