Try   HackMD

Visualization of the loss landscape and optimization path of a neural network

Author : Pascal Tikeng

Notations

  • ΘRd
    : parameters (weights) space,
    |Θ|=d
  • We will consider
    Θ
    as an affine space to which we will associate a vector space
    Θ
    :
    (θ,θ)Θ2, θθ=θθΘ
    . The elements of
    Θ
    will be noted with an arrow on top of them.
  • In={1,,n} nN
  • : Frobenius norm
  • ,,
    : bilinear product,
    u,A,v=uTAv=ijuiaijvj
    for a matrix
    ARn×m
    , and two vectors
    uRn
    and
    vRm
  • ,
    :
    • scalar product :
      u,v=uTv=iuivi
      for two vectors
      uRn
      and
      vRn
    • quadratic product :
      A,u=u,A,u=uTAu=ijuiaijuj
      for a matrix
      ARn×n
      and a vector
      uRn

Introduction

In deep learning, the loss function is generally defined from

Θ to
R+
as
L(θ)=Esp[(s,θ)]+λR(θ)
, where the function
(s,θ)
measures how well the neural network with parameters
θ
predicts the label of a data sample
s
,
p
the data distribution and
R(θ)
the regularizer (
λR+
). In practice,
p
is unknow, and the empirical distribution of a given dataset
D
is generally used,
p^(s)=|{sD, s=s}||D|
. Then
L(θ)L^(θ)=Esp^[(s,θ)]+λR(θ)
, an empirical estimate of
L(θ)
. The parameters of the model are thus updated during training by a rule of the form
θt+1=θtϵtf(Gt,mt,Vt)
with
Gt=θtL(θ)=L(θ)θ|θ=θtRd
the gradient of the loss function at
θt
, the parameter update at time
t
given the optimization algorithm of choice,
mt=g(G1,,Gt)
the momentum and
Vt=h(G1,,Gt)
the velocity. Generic gradient methods (SGD) use
f(Gt,mt,Vt)=Gt
and generic adaptive gradient methods (AdaGrad, RMSPro, Adam, ..) use
f(Gt,mt,Vt)=mtVt2+ϵ
. By abuse of notation, we will sometime note
Gt
instead of
f(Gt,mt,Vt)
.

While neural loss functions live in a very high-dimensional space, visualizations are only possible using low-dimensional 1D (line) or 2D (surface) plots. Several methods exist for closing this dimensionality gap. The idea is to choose a linear subspace that maximally preserves the optimization trajectory’s shape, in order to observe patterns that are meaningful in the full parameter space.

Definition: A loss landscape visualization technique is said to be meaningful if it preserves the curvature of the original loss function. If this is the case, any critical point (minimums, maximums, saddle points, ) or critical surface (valleys, plateaus, ravines and other flat regions) of the original loss function is present in its landscape visualization. It is not necessarily a question here of making the critical points/zones appear exactly in the visualization: we can simply define the indicators allowing to detect, on the landscape, the presence of these points/zones in the real functions.

Let

Lt=L(θt) and
Ht=2L(θ)θ2|θ=θtRd×d
the local hessian matrix of the loss at
θt
. Near
θt
,
L(θ)Lt+(θθt)TGt+12(θθt)THt(θθt)+o(θθt2)
. That is, for very small step size
ϵt
,
Lt+1=L(θt+1)=L(θtϵtGt)LtϵtGtTGt+12ϵt2GtTHtGto(ϵt2Gt2)
.

Definition: When

Ht has some large positive eigenvalues (i.e. high-curvature directions) and some eigenvalues close to 0 (i.e. low-curvature directions), gradient descent bounces back and forth in high curvature directions and makes slow progress in low curvature directions. We will said that
Ht
is ill conditioned in this case, or the optimization problem has an ill-conditioned curvature.

Suppose the loss function near

θt has high condition number, ie very small steps cause increase in cost function (for example if
θt
is a very sharp minima surrounded by high loss regions). If
ϵtGtTGt+12ϵt2GtTHtGt>0
, the optimization problem becomes also ill-conditioned. During training, we can monitor the terms
GtTGt
(square of the gradient norm) and
GtTHtGt
. If the gradient norm doesn’t shrink but
GtTHtGt
grows order of magnitude, learning can become very slow despite a strong gradient.

Theorem : If the matrix

12ϵtHtId×d is positive-definite,
Ht
is ill conditioned.

Proof :

xRd, ϵt12ϵtHtId×d,x=ϵtxTx+12ϵt2xTHtx

Definition: If

Gt=0Θ, then
θt
is a critical/stationary point of
L
, and the determinant
dt
of
Ht
is equal to the Gaussian curvature of the loss surface considered as a manifold. The eigenvalues of
Ht
are the principal curvatures of the loss function at
θt
, and the eigenvectors are the principal directions of curvature.

  • if
    dt>0
    ,
    θt
    is a local :
    • maximum of
      L
      if
      Ht
      is a negative definite (all its eigenvalues are negative).
    • minimum of
      L
      if
      Ht
      is a positive definite (all its eigenvalues are positive). Some local minimum can be a very flat (i.e. there is a large enough neighborhood of
      θt
      that contains only local minima) or sharp (the loss function near
      θt
      has high condition number, ie very small pertubation of
      θt
      cause increase in
      L
      ).
  • if
    dt<0
    (some eigenvalues are positive and others are negative),
    θt
    is a saddle point of
    L
    .
  • if
    dt=0
    (there is at least one zero eigenvalue, ie
    Ht
    is undefined), we can't conclude, and the point
    θt
    could be any of a minimum, maximum or saddle point.

If the hessian matrix of

L is positive semi-definite at any point of
Θ
,
L
is convex and
θt
is its global minimum. If it is instead negative semi-definite at any point of
Θ
,
L
is concave and
θt
is its global maximum.

Many deep models are guaranteed to have an extremely large number of local minima. It has been proven that this is not necessarily a problem. In fact, the majority of local minima are of good quality (all almost equivalent in cost to the global minimum). The biggest obstacle to the optimization of

L remains the presence of saddle points. In low dimensions, local minima are more common, while in high dimensions, local minima are rare and saddle points more common. Most of training time spent on traversing flat valley of the Hessian matrix or circumnavigating tall mountain via an indirect arcing path, and trajectory of traversing such flat valley and circumventing such mountains may be long and result in excessive training time [10].

Plateau (flat region) Long, narrow ravines Cliff
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

1. Loss landscape

1.1) Line

One of the most used methods, presented below, is to consider a remarkable point near which we want to study the behavior of the loss, choose a given direction compatible with the parameters space (either randomly or deterministically), and perturb the parameters in this direction. The loss is then visualized as a function of the intensity of the perturbation.

- Linear Interpolation [1]

Given two parameters

(θ0,θ1)Θ2, typically the initial and the final parameters (at the end of optimization), we define the function :
θ:ARΘαθ(α)=θ0+αδ, δ=θ0θ1=θ1θ0

Then we plot the loss as a function of

A :

f:AR+αf(α)=L(θ(α))

Depending on the problem, we can set

A=R or
A=[a,a] for aR+
, or restrict ourselves to the segment
[θ0,θ1]
with
A=[0,1]
.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

The linear interpolation curves for a decoder-only Transformer model train on modular addition, A = [-2, 2]. Left) at the minimizer, Right) When the model starts to converge. Image from me.

- Random Directions [1]

In this case,

θ(α)=θ+αδ αA, where
θ
is a point carefully chosen in
Θ
(generally the minimizer of
L
) and
δ
a direction vector carefully chosen (generally randomly) in
Θ
.

- Bilinear Interpolation [2]

Given four parameters

θ0,θ1,θ2,θ3, we chose
β[0,1]
and define the interpolation as
θ(α)=βϕ(α)+(1β)φ(α) αA
where
ϕ(α)=θ0+αθ0θ1
and
φ(α)=θ2+αθ2θ3
.

- Barycentric Interpolation [2]

Given three parameters

θ0,θ1,θ2, and
β[0,1]
; let
d1=θ0θ1
and
d2=θ0θ2
. Then, the formulation of the interpolation is
θ(α)=βϕ(α)+(1β)φ(α) αA
where
ϕ(α)=θ0+αd1
and
φ(α)=θ0+αd2

1.2) Surface

For surfaces, two directions are chosen instead of one. An the pertubation is made along the two directions.

- Random Directions [1]

We define the function :

θ:A×BR2Θ(α,β)θ(α,β)=θ+αδ+βη

where

θ is a point carefully chosen in
Θ
(generally the minimizer of
L
) and
δ,η
two direction vectors carefully chosen (generally randomly) in
Θ
.

It has been proven that in high dimensional spaces, randomly chosen vectors tend to be orthogonal to each other. For two independent unit vectors

ud,vdRd,
limd+P(ud,vd<ϵ)=1 ϵ>0
[4]. Also, the expected cosine similarity between Gaussian random vectors in
d
dimensions is roughly
2πd
[3].

The loss is then plot (contours plot, 3D plot ) as a function of

A×B :
f:A×BR+(α,β)f(α,β)=L(θ(α,β))

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

The loss surfaces of ResNet-56 with/without skip connections, using filter normalization scheme (see below). Image from [3]

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

2D visualization of the loss surface of ResNet and ResNet-noshort (NS) with different depth. Image from [3]

- Interpolation

Given three parameters

θ0,θ1,θ2 :
δ=θ1θ0
and
η=θ2θ0
.

In this kind of case,

θ0 is usually the initial parameter,
θ1
and
θ2
two parameters obtained after optimizations with two different methods (for example with two different optimizers). The goal here is to see the difference between the paths taken by the two methods (a path can be plotted between
θ1
and
θ2
using one of the methods described below for plotting optimization paths).

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Visualization of the loss surface with and without batch-normalization. Image from [2]

- PCA directions

Here the PCA is applied to the matrix

M=[θt]1tT or
M=[θtθT]1tT1
(where
θt
denote model parameters at epoch
t
, and
T
is the total number of training epoch) and the two most explanatory directions are used as
δ
and
η
. That is :

  • firstly by calculating the
    T×T
    (resp.
    (T1)×(T1)
    ) covariance matrix
    Σ
    of
    M
    :
    Mt=θt
    (resp.
    θtθT
    ),
    M¯=1T (resp. T1)tMt
    ,
    Σi,j=1T (resp. T1)MiM¯,MjM¯
    .
  • The eigensystem of
    Σ
    is then found, providing a new set of basis vectors and coordinate system (ie, a rotation of weight space)
  • To reduce the dimensionality of the weight space to some value
    d<d
    , the
    d
    largest eigenvalues and corresponding eigenvectors are chosen and the remaining
    dd
    values discarded.

In practice, PCA is quite often performed on the correlation matrix rather than the covariance matrix. The main reason for this is that if the variables in question are measured in different units, it may be that the variances of some variables dominate the first few principal components, simply because of the units they are measured in (e.g. consider one variable measured in meters together with some other variables measured in millimeters). In the case of neural network weight data, this problem does not arise - the covariance matrix is therefore used for PCA [9].

1.3) Normalization

[3] : «While the random directions approach to plotting is simple, it fails to capture the intrinsic geometry of loss surfaces, and cannot be used to compare the geometry of two different minimizers or two different networks. This is because of the scale invariance in network weights. When ReLU non-linearities are used, the network remains unchanged if we (for example) multiply the weights in one layer of a network by 10, and divide the next layer by 10. This invariance is even more prominent when batch normalization is used. In this case, the size (i.e., norm) of a filter is irrelevant because the output of each layer is re-scaled during batch normalization. For this reason, a network’s behavior remains unchanged if we re-scale the weights. Note, this scale invariance applies only to rectified networks.
Scale invariance prevents us from making meaningful comparisons between plots, unless special precautions are taken. A neural network with large weights may appear to have a smooth and slowly varying loss function; perturbing the weights by one unit will have very little effect on network performance if the weights live on a scale much larger than one. However, if the weights are much smaller than one, then that same unit perturbation may have a catastrophic effect, making the loss function appear quite sensitive to weight perturbations. Keep in mind that neural nets are scale invariant; if the small-parameter and large-parameter networks in this example are equivalent (because one is simply a rescaling of the other), then any apparent differences in the loss function are merely an artifact of scale invariance »

For these reasons, it is preferable to scale the directions added to the network so that they have the same scales as the network weights. In general, directions (whether determinist or random) are normalized before they are used. As descibe above, we begin by producing

p{1,2} random Gaussian direction vectors
{uk}1kp
with dimensions compatible with the modele parameters
θ
(ie in
Θ
).
p=1
for a line and
p=2
for a surface. Also, let
θ
a point carefully chosen in
Θ
(for example the minimizer of
L(θ)
).

We will use this 2 layers FC neural networks as example in the following sections. For

xTRn, let
zT=σ(a+WxT)Rm
for
aRm
and
W=[w1,,wm]TRm×n
,
wiRn iIm
. Then
zi=σ(x,wi+ai) iIm
. Let also
y=c+zbR
for
cR
and
bRm
. We fix
c=0
and
bi=1m iIm
for the sake of simplicity (this type of network is know as Soft Committee Machine). Then,
θ={W,a}
and
θ={W,a}
.

- No Normalization [3]

{uk}1kp are linearly combined to the weights
θ
without processing as describe in the above sections.

- Model-wise Normalization

The directions

{uk}0kp are normalized at the model level :
ukukukθ kIp
.

- Layer-wise Normalization [2]

The directions

{uk}0kp are normalized at the layer level so that the direction for each layer has the same norm as the corresponding layer of
θ
:
uikuikuikθi kIp
for each layer
i
with parameter
θi
.

For our example, since we have the parameters for only one layer of our model, this normalization corresponds to the model-wise normalization. If we include the parameters of the second layer (

b and
c
), this normalization becomes :

  • [Wk,ak][Wk,ak][Wk,ak][W,a] kIp
  • [bk,ck][bk,ck][bk,ck][b,c] kIp

with

uk={Wk,ak,bk,ck} and
θ={W,a,b,c}

- Weight-wise Normalization

For our example, let

uk={Wk,ak}. Then
WkWkWkW
and
akakaka
for all
kIp
.

- Filter-wise Normalization [3]

To remove the above mentioned scaling effect, the authors of [3] plot the loss functions using filter-wise normalized directions. To obtain such directions for a network with parameters

θ they normalized each filter in each directions
uk
to have the same norm of the corresponding filter in
θ
. In other words, they made the replacement
ui,jkui,jkui,jkθi,j
where
ui,jk
represents the
jth
filter (not the
jth
weight) of the
ith
layer of
uk
. This is not limited to convolutional (Conv) layers but also applies to fully connected (FC) layers. The FC layer is "equivalent" to a Conv layer with a
1×1
output feature map and the filter corresponds to the weights that generate one neuron [3].

For our example, we will have

wikwikwikwi iIn and
ajkajk|ajk||aj| jIm
for all
kIp
.

2. Optimization Paths [3]

Above, we showed how to visualize the loss landscape of high dimensionnal neural network paramaters in low dimension. But how to visualize the trajectory between two weigths values encounter by the neural network during optimization? Indeed, random directions fail to capture the variation in optimization trajectories, leading to the misleading appearance of a straight line path [3]. To capture variation in trajectories, we need to use non-random and carefully chosen directions.

Let

M=[θtθT]1tT1 where
θt
denote model parameters at epoch
t
, and
T
is the total number of training epoch. We can apply PCA to the matrix
M
, select the two most explanatory directions and project the optimization trajectory onto these two directions. This means that if
u
and
v
are the two directions vectors (in
Θ
) chosen after PCA, we will display on the loss surface the trajectory linking the following points:
(θt,uu,θt,vv)1tT
. There is no guarantee that the model followed this path during optimization. In fact, it is highly unlikely to have done so, unless the optimization problem is convex [5].

Projected learning trajectories use normalized PCA directions for VGG-9. The left plot in each subfigure uses batch size 128, and the right one uses batch size 8192. WD = Weight decay. Image from [3]

3. Apparent and real convexity

[3] : «We are viewing the loss surface under a dramatic dimensionality reduction, and we need to be careful how we interpret these plots. One way to measure the level of convexity in a loss function is to compute the principle curvatures, which are simply eigenvalues of the Hessian. A truly convex function has non-negative curvatures (a positive semi-definite Hessian), while a non-convex function has negative curvatures. It can be shown that the principle curvatures of a dimensionality reduced plot (with random Gaussian directions) are weighted averages of the principle curvatures of the full-dimensional surface (where the weights are Chi-square random variables).
This has several consequences. First of all, if non-convexity is present in the dimensionality reduced plot, then non-convexity must be present in the full-dimensional surface as well. However, apparent convexity in the low-dimensional surface does not mean the high-dimensional function is truly convex. Rather it means that the positive curvatures are dominant (more formally, the mean curvature, or average eigenvalue, is positive).»

The authors of [3] calculate the ratio

λmin(q)λmax(q) at each point
q
(
α
for a line,
(α,β)
for a surface) of the loss surface/line, where
λmin(q)
and
λmax(q)
are respectivily the minimum and the maximum eigenvalues of the Hessian of
L(θ)
at
θ(q)
. They observe that the convex-looking regions in the surface/line plots correspond to regions with insignificant negative eigenvalues (i.e., there are not major non-convex features that the plot missed), while chaotic regions contain large negative curvatures; and that for convex-looking surfaces, the negative eigenvalues remain extremely small (less than 1% the size of the positive curvatures) over a large region of the plot.

The quantity

λmin(q)λmax(q) is known as the condition number of the Hessian of
L(θ)
at
θ(q)
. Larger condition numbers imply slower convergence of gradient descent.

Blue color indicates a more convex region (near-zero negative eigenvalues relative to the positive eigenvalues), while yellow indicates significant levels of negative curvature. Image from [3]



[6] uses the normalized local curvature along the trajectory of the optimizer, given by

1θt2θtTHtθt, as a curvature measure.

Before discussing the convexity problem, let us recall this theorem. Let

f:RnR a function. For all
(x,y)Rn×Rn
, let
g(x,y):α[0,1]g(x,y)(α)=f(αx+(1α)y)

Theorem :

f convexg(x,y) convex (x,y)Rn×Rn.

Proof:

  • g(x,y) convex θf(x)+(1θ)f(y)=θg(x,y)(1)+(1θ)g(x,y)(0)g(x,y)(θ)=f(θx+(1θ)y)
  • f convex θg(x,y)(α1)+(1θ)g(x,y)(α2)=θf(α1x+(1α1)y)+(1θ)f(α2x+(1α2)y)f(θα1x)=g(x,y)(θα1+(1θ)α2

Do you see what I'm getting at? Between 0 and 1, the landscape curves drawn above capture the curvature of

L in the neighborhood of the point considered, but just in one direction.

In fact, near a given

θt,
L(θ)Lt+Gt,θθt+12Ht,θθt+o(θθt2)
. That is, for very small
α
,
f(α)=L(θt+αδ)Lt+αGt,δ+12α2Ht,δo(α2δ2)
.

For a unit-norm vector

δ :

  • Gt,δ=δTGt
    is the
    δ
    -directional derivative of
    L
    at
    θt
    . So if we want to choose a direction of maximum (positive and negative) variation, we can choose
    δGt
    , for example
    δ=1θt+1θt2(θt+1θt)
    because
    θt+1θt=ϵtGt
    .
  • Ht,δ=δTHtδ
    is the second
    δ
    -directional derivative of
    L
    at
    θt
    . Since the hessian matrix encodes the curvature of the function in every direction,
    Ht,δ
    encodes it in the direction
    δ
    . Also, the maximum curvature of
    L
    at
    θt
    is given by the largest eigenvalue of
    Ht
    and is in the direction of the corresponding eigenvector. The smallest curvature (the largest negative curvature) of
    L
    at
    θt
    is given by the smallest eigenvalue of
    Ht
    and is in the direction of the corresponding eigenvector.

The first-order approximation of a function at a given point is a linear function that has exactly the same directional derivatives at that point. The second-order approximation of a function at a given point is a quadratic form that has exactly the same directional derivatives and curvature at that point.

4. Other methods

4.1 PHATE [7]

4.2

5. Can we do better?

Yes. But how?

5.1 Andrews curve and Radar chart

An idea I'm exploring

Andrews plot or Andrews curve is a way to visualize structure in high-dimensional data. Let

u(α)=(12, sin(α), cos(α), sin(2α), cos(2α), )Rd  α[π,π]. We can analyze the evolution of the structure of the parameters across epochs by visualizing the functions
{ft:α[π,π]ft(α)=θt,u(α)}1tT
where
θt
denote model parameters at epoch
t
, and
T
is the total number of training epoch. Note that
θ,u(α)
is the projection of
θ
onto the vector
u(α)
. If there is structure in
{θt}1tT
, it may be visible in its Andrews curves [8].

The projection onto the circle results in the radar chart.

5.2

To read

  • Optimization

  • Landscape :

    • Geometry of learning: Visualizing the performance of neural network supervised training methods
    • [1]
    • [2]
    • [3]
    • [7]
  • Paths

    • Visualization of learning in multilayer perceptron networks using principal component analysis
    • Stuck in a what? Adventures in weight space
    • Visualizing Deep Network Training Trajectories with PCA
  • Sharpness and flatness

    • Flat minima, March 1996

      • "flat minimum search" search algorithm vs conventional backprop, weight decay and optimal brain surgeon / optimal brain damage.
      • A flat minimun is a (large) region of connected acceptable minima, where eache weigth
        θ
        withing this region leads to almost identical net function.
    • On large-batch training for deep learning : Generalization gap and sharp minima

      • The lack of generalization ability is due to the fact that large-batch methods tend to converge to sharp minimizers of the training function. These minimizers are characterized by a significant number of large positive eigenvalues in
        H(θ)=2L(θ)
        , and tend to generalize less well. In contrast, small-batch methods converge to flat minimizers characterized by having numerous small eigenvalues of
        H(θ)
        . We have observed that the loss function landscape of deep neural networks is such that large-batch methods are attracted to regions with sharp minimizers and that, unlike small-batch methods, are unable to escape basins of attraction of these minimizers.
    • Sharp minima can generalize for deep nets

    • Exploring generalization in deep learning

    • Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data

  • Others

    • Identifying and attacking the saddle point problem in high
    • Exact solutions to the nonlinear dynamics of learning in deep linear neural networks
    • The Loss Surfaces of Multilayer Networks
    • On the importance of momentum and initialization in deep learning
    • Exploring loss function topology with cyclical learning rates
    • Phasemax : Convex phase retrieval via basis pursuit
    • Taxonomizing local versus global structure in neural network loss landscapes
    • On the Loss Landscape of Adversarial Training : Identifying Challenges and How to Overcome Them
  • convergence to minimizers of strongly-convex functions and to stationary points for non-convex functions

    • Optimization Methods for Large-Scale Machine Learning
  • saddle-point avoidance

    • Escaping From Saddle Points - Online Stochastic Gradient for Tensor Decomposition
    • Gradient descent converges to minimizers
  • robustness to input data

    • Train faster, generalize better: Stability of stochastic gradient descent

References

[1] QUALITATIVELY CHARACTERIZING NEURAL NETWORK OPTIMIZATION PROBLEMS, Ian J. Goodfellow & Oriol Vinyals & Andrew M. Saxe, 2015

[2] An empirical analysis of the optimization of deep network loss surfaces, Daniel Jiwoong Im & Michael Tao & Kristin Branson, 2017

[3] Visualizing the Loss Landscape of Neural Nets, Hao Li & Zheng Xu & Gavin Taylor & Christoph Studer & Tom Goldstein, 2018

[4] https://math.stackexchange.com/questions/995623/why-are-randomly-drawn-vectors-nearly-perpendicular-in-high-dimensions/995678

[5] https://github.com/marcellodebernardi/loss-landscapes/blob/master/loss_landscapes/main.py#L49

[6] The Slingshot Mechanism: An Empirical Study of Adaptive Optimizers and the Grokking Phenomenon, Vimal Thilak, Etai Littwin, Shuangfei Zhai, Omid Saremi, Roni Paiss, Joshua Susskind, 2022

[7] VISUALIZING HIGH-DIMENSIONAL TRAJECTORIES ON THE LOSS-LANDSCAPE OF ANNS, https://openreview.net/pdf?id=uhiF-dV99ir

[8] https://en.wikipedia.org/wiki/Andrews_plot

[9] Visualization of Learning in Multi-layer Perceptron Networks using PCA

[10] https://cedar.buffalo.edu/~srihari/CSE676/8.2 NNOptimization.pdf