Visualization of the loss landscape and optimization path of a neural network === **Author : Pascal Tikeng** # Notations - $\Theta \subseteq \mathbb{R}^d$ : parameters (weights) space, $|\Theta| = d$ - We will consider $\Theta$ as an affine space to which we will associate a vector space $\vec{\Theta}$ : $\forall (\theta, \theta') \in \Theta^2, \ \vec{\theta\theta'} = \theta' - \theta \in \vec{\Theta}$. The elements of $\vec{\Theta}$ will be noted with an arrow on top of them. - $I_n = \{1, \dots, n\} \ \forall n \in \mathbb{N}$ - $\| \cdot \|$ : Frobenius norm - $\langle \cdot, \cdot, \cdot \rangle$ : bilinear product, $\langle u, A, v \rangle = u^TAv = \sum_{i}\sum_{j} u_ia_{ij}v_j$ for a matrix $A \in \mathbb{R}^{n \times m}$, and two vectors $u \in \mathbb{R}^n$ and $v \in \mathbb{R}^m$ - $\langle \cdot, \cdot \rangle$ : - scalar product : $\langle u, v \rangle = u^Tv = \sum_{i} u_iv_i$ for two vectors $u \in \mathbb{R}^n$ and $v \in \mathbb{R}^n$ - quadratic product : $\langle A, u \rangle = \langle u, A, u \rangle = u^TAu = \sum_{i}\sum_{j} u_ia_{ij}u_j$ for a matrix $A \in \mathbb{R}^{n \times n}$ and a vector $u \in \mathbb{R}^n$ # Introduction In deep learning, the loss function is generally defined from $\Theta$ to $\mathbb{R}_{+}$ as $L(\theta) = \mathbb{E}_{s \sim p} [\ell(s, \theta)] + \lambda R(\theta)$, where the function $\ell(s, \theta)$ measures how well the neural network with parameters $\theta$ predicts the label of a data sample $s$, $p$ the data distribution and $R(\theta)$ the regularizer ($\lambda \in \mathbb{R}_{+}$). In practice, $p$ is unknow, and the empirical distribution of a given dataset $\mathcal{D}$ is generally used, $\hat{p}(s) = \frac{|\{s' \in \mathcal{D}, \ s' = s\}|}{|\mathcal{D}|}$. Then $L(\theta) \approx \hat{L}(\theta) = \mathbb{E}_{s \sim \hat{p}} [\ell(s, \theta)] + \lambda R(\theta)$, an empirical estimate of $L(\theta)$. The parameters of the model are thus updated during training by a rule of the form $\theta_{t+1} = \theta_t - \epsilon_t f(G_t, m_t, V_t)$ with $G_t = \nabla_{\theta_t}L(\theta) = \frac{\partial L(\theta)}{\partial \theta} \vert_{\theta=\theta_t} \in \mathbb{R}^{d}$ the gradient of the loss function at $\theta_t$, the parameter update at time $t$ given the optimization algorithm of choice, $m_t = g(G_1, \dots, G_t)$ the momentum and $V_t = h(G_1, \dots, G_t)$ the velocity. Generic gradient methods (SGD...) use $f(G_t, m_t, V_t) = G_t$ and generic adaptive gradient methods (AdaGrad, RMSPro, Adam, ..) use $f(G_t, m_t, V_t) = \frac{m_t}{\sqrt{V_t^2+\epsilon}}$. By abuse of notation, we will sometime note $G_t$ instead of $f(G_t, m_t, V_t)$. While neural loss functions live in a very high-dimensional space, visualizations are only possible using low-dimensional 1D (line) or 2D (surface) plots. Several methods exist for closing this dimensionality gap. The idea is to choose a linear subspace that maximally preserves the optimization trajectory’s shape, in order to observe patterns that are meaningful in the full parameter space. **Definition**: A loss landscape visualization technique is said to be **meaningful** if it preserves the curvature of the original loss function. If this is the case, any critical point (minimums, maximums, saddle points, ...) or critical surface (valleys, plateaus, ravines ... and other flat regions) of the original loss function is present in its landscape visualization. It is not necessarily a question here of making the critical points/zones appear exactly in the visualization: we can simply define the indicators allowing to detect, on the landscape, the presence of these points/zones in the real functions. Let $L_t = L(\theta_t)$ and $\mathcal{H}_t = \frac{\partial^2 L(\theta)}{\partial \theta^2} \vert_{\theta=\theta_t} \in \mathbb{R}^{d \times d}$ the local hessian matrix of the loss at $\theta_t$. Near $\theta_t$, $L(\theta) \approx L_t + (\theta - \theta_t)^T G_t + \frac{1}{2} (\theta - \theta_t)^T \mathcal{H}_t (\theta - \theta_t) + o(\|\theta - \theta_t\|^2)$. That is, for very small step size $\epsilon_t$, $L_{t+1} = L(\theta_{t+1}) = L(\theta_t - \epsilon_t G_t) \approx L_t - \epsilon_t G_t^T G_t + \frac{1}{2} \epsilon_t^2 G_t^T \mathcal{H}_t G_t - o(\epsilon_t^2 \|G_t\|^2)$. **Definition**: When $\mathcal{H}_t$ has some large positive eigenvalues (i.e. high-curvature directions) and some eigenvalues close to 0 (i.e. low-curvature directions), gradient descent bounces back and forth in high curvature directions and makes slow progress in low curvature directions. We will said that $\mathcal{H}_t$ is **ill conditioned** in this case, or the optimization problem has an ill-conditioned curvature. Suppose the loss function near $\theta_t$ has high condition number, ie very small steps cause increase in cost function (for example if $\theta_t$ is a very sharp minima surrounded by high loss regions). If $- \epsilon_t G_t^T G_t + \frac{1}{2} \epsilon_t^2 G_t^T \mathcal{H}_t G_t > 0$, the optimization problem becomes also ill-conditioned. During training, we can monitor the terms $G_t^T G_t$ (square of the gradient norm) and $G_t^T \mathcal{H}_t G_t$. If the gradient norm doesn’t shrink but $G_t^T \mathcal{H}_t G_t$ grows order of magnitude, learning can become very slow despite a strong gradient. **Theorem** : If the matrix $\frac{1}{2} \epsilon_t \mathcal{H}_t - \mathbb{I}_{d \times d}$ is positive-definite, $\mathcal{H}_t$ is ill conditioned. **Proof** : $\forall x \in \mathbb{R}^{d}, \ \epsilon_t \langle \frac{1}{2} \epsilon_t \mathcal{H}_t - \mathbb{I}_{d \times d}, x \rangle = - \epsilon_t x^T x + \frac{1}{2} \epsilon_t^2 x^T \mathcal{H}_t x$ **Definition**: If $G_t = \mathbb{0}_{\Theta}$, then $\theta_t$ is a **critical/stationary point** of $L$, and the determinant $d_t$ of $\mathcal{H}_t$ is equal to the Gaussian curvature of the loss surface considered as a manifold. The eigenvalues of $\mathcal{H}_t$ are the principal curvatures of the loss function at $\theta_t$, and the eigenvectors are the principal directions of curvature. - if $d_t > 0$, $\theta_t$ is a **local** : - **maximum** of $L$ if $\mathcal{H}_t$ is a negative definite (all its eigenvalues are negative). - **minimum** of $L$ if $\mathcal{H}_t$ is a positive definite (all its eigenvalues are positive). Some local minimum can be a very **flat** (i.e. there is a large enough neighborhood of $\theta_t$ that contains only local minima) or **sharp** (the loss function near $\theta_t$ has high condition number, ie very small pertubation of $\theta_t$ cause increase in $L$). - if $d_t < 0$ (some eigenvalues are positive and others are negative), $\theta_t$ is a **saddle point** of $L$. - if $d_t = 0$ (there is at least one zero eigenvalue, ie $\mathcal{H}_t$ is undefined), we can't conclude, and the point $\theta_t$ could be any of a minimum, maximum or saddle point. If the hessian matrix of $L$ is positive semi-definite at any point of $\Theta$, $L$ is convex and $\theta_t$ is its **global minimum**. If it is instead negative semi-definite at any point of $\Theta$, $L$ is concave and $\theta_t$ is its **global maximum**. Many deep models are guaranteed to have an extremely large number of local minima. It has been proven that this is not necessarily a problem. In fact, the majority of local minima are of good quality (all almost equivalent in cost to the global minimum). The biggest obstacle to the optimization of $L$ remains the presence of saddle points. In low dimensions, local minima are more common, while in high dimensions, local minima are rare and saddle points more common. Most of training time spent on traversing flat valley of the Hessian matrix or circumnavigating tall mountain via an indirect arcing path, and trajectory of traversing such flat valley and circumventing such mountains may be long and result in excessive training time [10]. Plateau (flat region) | Long, narrow ravines | Cliff :-------------------------:|:-------------------------:|:-------------------------: ![](https://i.imgur.com/3JI241h.png) | ![](https://i.imgur.com/6J1XqMJ.png) | ![](https://i.imgur.com/PB3LIRq.png) # 1. Loss landscape ## 1.1) Line One of the most used methods, presented below, is to consider a remarkable point near which we want to study the behavior of the loss, choose a given direction compatible with the parameters space (either randomly or deterministically), and perturb the parameters in this direction. The loss is then visualized as a function of the intensity of the perturbation. ### - Linear Interpolation [1] Given two parameters $(\theta_0, \theta_1) \in \Theta^2$, typically the initial and the final parameters (at the end of optimization), we define the function : \begin{align*} \theta \colon A \subseteq \mathbb{R} &\to \Theta \\ \alpha &\mapsto \theta(\alpha) = \theta_0 + \alpha \vec{\delta}, \ \vec{\delta} = \vec{\theta_0\theta_1} = \theta_1 - \theta_0 \end{align*} Then we plot the loss as a function of $A$ : \begin{align*} f \colon A &\to \mathbb{R}_{+} \\ \alpha &\mapsto f(\alpha) = L(\theta(\alpha)) \end{align*} Depending on the problem, we can set $A = \mathbb{R}$ or $A = [-a, a] \text{ for } a \in \mathbb{R}_+^*$, or restrict ourselves to the segment $[\theta_0, \theta_1]$ with $A = [0, 1]$. <p float="left"> <img src="https://i.imgur.com/Uq0ryrp.png" width="350" /> <img src="https://i.imgur.com/C6e3mR6.png" width="350" /> </p> <center>The linear interpolation curves for a decoder-only Transformer model train on modular addition, A = [-2, 2]. Left) at the minimizer, Right) When the model starts to converge. Image from me.</center> ### - Random Directions [1] In this case, $\theta(\alpha) = \theta^* + \alpha \vec{\delta} \ \forall \alpha \in A$, where $\theta^*$ is a point carefully chosen in $\Theta$ (generally the minimizer of $L$) and $\vec{\delta}$ a direction vector carefully chosen (generally randomly) in $\vec{\Theta}$. ### - Bilinear Interpolation [2] Given four parameters $\theta_0, \theta_1, \theta_2, \theta_3$, we chose $\beta \in [0, 1]$ and define the interpolation as $\theta(\alpha) = \beta \phi(\alpha) + (1 - \beta) \varphi(\alpha) \ \forall \alpha \in A$ where $\phi(\alpha) = \theta_0 + \alpha \vec{\theta_0\theta_1}$ and $\varphi(\alpha) = \theta_2 + \alpha \vec{\theta_2\theta_3}$. ### - Barycentric Interpolation [2] Given three parameters $\theta_0, \theta_1, \theta_2$, and $\beta \in [0, 1]$; let $\vec{d_1} = \vec{\theta_0\theta_1}$ and $\vec{d_2} =\vec{\theta_0\theta_2}$. Then, the formulation of the interpolation is $\theta(\alpha) = \beta \phi(\alpha) + (1 - \beta) \varphi(\alpha) \ \forall \alpha \in A$ where $\phi(\alpha) = \theta_0 + \alpha \vec{d_1}$ and $\varphi(\alpha) = \theta_0 + \alpha \vec{d_2}$ ## 1.2) Surface For surfaces, two directions are chosen instead of one. An the pertubation is made along the two directions. ### - Random Directions [1] We define the function : \begin{align*} \theta \colon A \times B \subseteq \mathbb{R}^2 &\to \Theta \\ (\alpha, \beta) &\mapsto \theta(\alpha, \beta) = \theta^* + \alpha \vec{\delta} + \beta \vec{\eta} \end{align*} where $\theta^*$ is a point carefully chosen in $\Theta$ (generally the minimizer of $L$) and $\vec{\delta}, \vec{\eta}$ two direction vectors carefully chosen (generally randomly) in $\vec{\Theta}$. It has been proven that in high dimensional spaces, randomly chosen vectors tend to be orthogonal to each other. For two independent unit vectors $\vec{u}_d, \vec{v}_d \in \mathbb{R}^d$, $\lim\limits_{d \rightarrow +\infty} P (\langle \vec{u}_d, \vec{v}_d \rangle < \epsilon) = 1 \ \forall \epsilon > 0$ [4]. Also, the expected cosine similarity between Gaussian random vectors in $d$ dimensions is roughly $\sqrt{\frac{2}{\pi d}}$ [3]. The loss is then plot (contours plot, 3D plot ...) as a function of $A \times B$ : \begin{align*} f \colon A \times B &\to \mathbb{R}_{+} \\ (\alpha, \beta) &\mapsto f(\alpha, \beta) = L(\theta(\alpha, \beta)) \end{align*} ![](https://i.imgur.com/YrchSwx.png) <center>The loss surfaces of ResNet-56 with/without skip connections, using filter normalization scheme (see below). Image from [3]</center> ![](https://i.imgur.com/e1RkSDM.png) <center>2D visualization of the loss surface of ResNet and ResNet-noshort (NS) with different depth. Image from [3]</center> ### - Interpolation Given three parameters $\theta_0, \theta_1, \theta_2$ : $\vec{\delta} = \theta_1 - \theta_0$ and $\vec{\eta} = \theta_2 - \theta_0$. In this kind of case, $\theta_0$ is usually the initial parameter, $\theta_1$ and $\theta_2$ two parameters obtained after optimizations with two different methods (for example with two different optimizers). The goal here is to see the difference between the paths taken by the two methods (a path can be plotted between $\theta_1$ and $\theta_2$ using one of the methods described below for plotting optimization paths). ![](https://i.imgur.com/PDOrgX1.png) <center>Visualization of the loss surface with and without batch-normalization. Image from [2]</center> ### - PCA directions Here the PCA is applied to the matrix $M = [\theta_t]_{1 \le t \le T}$ or $M = [\theta_t - \theta_T]_{1 \le t \le T - 1}$ (where $\theta_t$ denote model parameters at epoch $t$, and $T$ is the total number of training epoch) and the two most explanatory directions are used as $\vec{\delta}$ and $\vec{\eta}$. That is : - firstly by calculating the $T \times T$ (resp. $(T-1) \times (T-1)$) covariance matrix $\Sigma$ of $M$ : $M_t =\theta_t$ (resp. $\theta_t - \theta_T$), $\bar{M} = \frac{1}{T \ (resp. \ T-1)} \sum_{t} M_t$, $\Sigma_{i,j} = \frac{1}{T \ (resp. \ T-1)} \langle M_i - \bar{M}, M_j - \bar{M} \rangle$. - The eigensystem of $\Sigma$ is then found, providing a new set of basis vectors and coordinate system (ie, a rotation of weight space) - To reduce the dimensionality of the weight space to some value $d' < d$, the $d'$ largest eigenvalues and corresponding eigenvectors are chosen and the remaining $d-d'$ values discarded. In practice, PCA is quite often performed on the correlation matrix rather than the covariance matrix. The main reason for this is that if the variables in question are measured in different units, it may be that the variances of some variables dominate the first few principal components, simply because of the units they are measured in (e.g. consider one variable measured in meters together with some other variables measured in millimeters). In the case of neural network weight data, this problem does not arise - the covariance matrix is therefore used for PCA [9]. ## 1.3) Normalization [3] : *«While the random directions approach to plotting is simple, it fails to capture the intrinsic geometry of loss surfaces, and cannot be used to compare the geometry of two different minimizers or two different networks. This is because of the scale invariance in network weights. When ReLU non-linearities are used, the network remains unchanged if we (for example) multiply the weights in one layer of a network by 10, and divide the next layer by 10. This invariance is even more prominent when batch normalization is used. In this case, the size (i.e., norm) of a filter is irrelevant because the output of each layer is re-scaled during batch normalization. For this reason, a network’s behavior remains unchanged if we re-scale the weights. Note, this scale invariance applies only to rectified networks. Scale invariance prevents us from making meaningful comparisons between plots, unless special precautions are taken. A neural network with large weights may appear to have a smooth and slowly varying loss function; perturbing the weights by one unit will have very little effect on network performance if the weights live on a scale much larger than one. However, if the weights are much smaller than one, then that same unit perturbation may have a catastrophic effect, making the loss function appear quite sensitive to weight perturbations. Keep in mind that neural nets are scale invariant; if the small-parameter and large-parameter networks in this example are equivalent (because one is simply a rescaling of the other), then any apparent differences in the loss function are merely an artifact of scale invariance ...»* For these reasons, it is preferable to scale the directions added to the network so that they have the same scales as the network weights. In general, directions (whether determinist or random) are normalized before they are used. As descibe above, we begin by producing $p \in \{1, 2\}$ random Gaussian direction vectors $\{\vec{u}^k\}_{1 \le k \le p}$ with dimensions compatible with the modele parameters $\theta$ (ie in $\vec{\Theta}$). $p = 1$ for a line and $p=2$ for a surface. Also, let $\theta^*$ a point carefully chosen in $\Theta$ (for example the minimizer of $L(\theta)$). We will use this 2 layers FC neural networks as example in the following sections. For $x^T \in \mathbb{R}^n$, let $z^T = \sigma(a + Wx^T) \in \mathbb{R}^m$ for $a \in \mathbb{R}^m$ and $W = [w_1, \dots, w_m]^T \in \mathbb{R}^{m \times n}$, $w_i \in \mathbb{R}^n \ \forall i \in I_m$. Then $z_i = \sigma(\langle x, w_i \rangle + a_i) \ \forall i \in I_m$. Let also $y = c + z b \in \mathbb{R}$ for $c \in \mathbb{R}$ and $b \in \mathbb{R}^{m}$. We fix $c = 0$ and $b_i = \frac{1}{ \sqrt{{m}}} \ \forall i \in I_m$ for the sake of simplicity (this type of network is know as Soft Committee Machine). Then, $\theta = \{W, a\}$ and $\theta^* = \{W^*, a^*\}$. ### - No Normalization [3] $\{\vec{u}^k\}_{1 \le k \le p}$ are linearly combined to the weights $\theta^*$ without processing as describe in the above sections. ### - Model-wise Normalization The directions $\{\vec{u}^k\}_{0 \le k \le p}$ are normalized at the model level : $\vec{u}^k \leftarrow \frac{\vec{u}^k}{\| \vec{u}^k \|} \| \theta^* \| \ \forall k \in I_p$. ### - Layer-wise Normalization [2] The directions $\{\vec{u}^k\}_{0 \le k \le p}$ are normalized at the layer level so that the direction for each layer has the same norm as the corresponding layer of $\theta$ : $\vec{u}^k_i \leftarrow \frac{\vec{u}^k_i}{\| \vec{u}^k_i \|} \| \theta^*_i \| \ \forall k \in I_p$ for each layer $i$ with parameter $\theta^*_i$. For our example, since we have the parameters for only one layer of our model, this normalization corresponds to the model-wise normalization. If we include the parameters of the second layer ($b$ and $c$), this normalization becomes : - $[W^k, a^k] \leftarrow \frac{[W^k, a^k]}{\| [W^k, a^k] \|} \| [W^*, a^*] \| \ \forall k \in I_p$ - $[b^k, c^k] \leftarrow \frac{[b^k, c^k]}{\| [b^k, c^k] \|} \| [b^*, c^*] \| \ \forall k \in I_p$ with $\vec{u}^k = \{W^k, a^k, b^k, c^k\}$ and $\theta^* = \{W^*, a^*, b^*, c^*\}$ ### - Weight-wise Normalization For our example, let $\vec{u}^k = \{W^k, a^k\}$. Then $W^k \leftarrow \frac{W^k}{\| W^k \|} \| W^* \|$ and $a^k \leftarrow \frac{a^k}{\| a^k \|} \| a^* \|$ for all $k \in I_p$. ### - Filter-wise Normalization [3] To remove the above mentioned scaling effect, the authors of [3] plot the loss functions using filter-wise normalized directions. To obtain such directions for a network with parameters $\theta$ they normalized each filter in each directions $\vec{u}^k$ to have the same norm of the corresponding filter in $\theta$. In other words, they made the replacement $\vec{u}^k_{i,j} \leftarrow \frac{\vec{u}^k_{i,j}}{\| \vec{\vec{u}}^k_{i,j} \|} \| \theta_{i,j} \|$ where $\vec{u}^k_{i,j}$ represents the $j^{th}$ filter (not the $j^{th}$ weight) of the $i^{th}$ layer of $\vec{u}^k$. This is not limited to convolutional (Conv) layers but also applies to fully connected (FC) layers. The FC layer is "equivalent" to a Conv layer with a $1 \times 1$ output feature map and the filter corresponds to the weights that generate one neuron [3]. For our example, we will have $w^k_i \leftarrow \frac{w^k_i}{\| w^k_i \|} \| w^*_i \| \ \forall i \in I_n$ and $a^k_j \leftarrow \frac{a^k_j}{| a^k_j |} | a^*_j | \ \forall j \in I_m$ for all $k \in I_p$. # 2. Optimization Paths [3] Above, we showed how to visualize the loss landscape of high dimensionnal neural network paramaters in low dimension. But how to visualize the trajectory between two weigths values encounter by the neural network during optimization? Indeed, random directions fail to capture the variation in optimization trajectories, leading to the misleading appearance of a straight line path [3]. To capture variation in trajectories, we need to use non-random and carefully chosen directions. Let $M = [\theta_t - \theta_T]_{1 \le t \le T - 1}$ where $\theta_t$ denote model parameters at epoch $t$, and $T$ is the total number of training epoch. We can apply PCA to the matrix $M$, select the two most explanatory directions and project the optimization trajectory onto these two directions. This means that if $\vec{u}$ and $\vec{v}$ are the two directions vectors (in $\vec{\Theta}$) chosen after PCA, we will display on the loss surface the trajectory linking the following points: $\bigg( \frac{\langle \theta_t, \vec{u} \rangle}{\|\vec{u}\|}, \frac{\langle \theta_t, \vec{v} \rangle}{\|\vec{v}\|} \bigg)_{1 \le t \le T}$. There is no guarantee that the model followed this path during optimization. In fact, it is highly unlikely to have done so, unless the optimization problem is convex [5]. ![](https://i.imgur.com/jJLodQp.png) <center>Projected learning trajectories use normalized PCA directions for VGG-9. The left plot in each subfigure uses batch size 128, and the right one uses batch size 8192. WD = Weight decay. Image from [3]</center> # 3. Apparent and real convexity [3] : *«We are viewing the loss surface under a dramatic dimensionality reduction, and we need to be careful how we interpret these plots. One way to measure the level of convexity in a loss function is to compute the principle curvatures, which are simply eigenvalues of the Hessian. A truly convex function has non-negative curvatures (a positive semi-definite Hessian), while a non-convex function has negative curvatures. It can be shown that the principle curvatures of a dimensionality reduced plot (with random Gaussian directions) are weighted averages of the principle curvatures of the full-dimensional surface (where the weights are Chi-square random variables). This has several consequences. First of all, if non-convexity is present in the dimensionality reduced plot, then non-convexity must be present in the full-dimensional surface as well. However, apparent convexity in the low-dimensional surface does not mean the high-dimensional function is truly convex. Rather it means that the positive curvatures are dominant (more formally, the mean curvature, or average eigenvalue, is positive).»* The authors of [3] calculate the ratio $\frac{\lambda_{min} (q)}{\lambda_{max} (q)}$ at each point $q$ ($\alpha$ for a line, $(\alpha, \beta)$ for a surface) of the loss surface/line, where $\lambda_{min} (q)$ and $\lambda_{max} (q)$ are respectivily the minimum and the maximum eigenvalues of the Hessian of $L(\theta)$ at $\theta(q)$. They observe that the convex-looking regions in the surface/line plots correspond to regions with insignificant negative eigenvalues (i.e., there are not major non-convex features that the plot missed), while chaotic regions contain large negative curvatures; and that for convex-looking surfaces, the negative eigenvalues remain extremely small (less than 1% the size of the positive curvatures) over a large region of the plot. The quantity $\frac{\lambda_{min} (q)}{\lambda_{max} (q)}$ is known as the condition number of the Hessian of $L(\theta)$ at $\theta(q)$. Larger condition numbers imply slower convergence of gradient descent. ![](https://i.imgur.com/071Alxe.png) <center>Blue color indicates a more convex region (near-zero negative eigenvalues relative to the positive eigenvalues), while yellow indicates significant levels of negative curvature. Image from [3]</center> \ \ [6] uses the normalized local curvature along the trajectory of the optimizer, given by $\frac{1}{\|\theta_t\|^2} \theta_t^T \mathcal{H}_t \theta_t$, as a curvature measure. Before discussing the convexity problem, let us recall this theorem. Let $f : \mathbb{R}^n \to \mathbb{R}$ a function. For all $(x, y) \in \mathbb{R}^n \times \mathbb{R}^n$, let $g_{(x, y)} \colon \alpha \in [0, 1] \mapsto g_{(x, y)}(\alpha) = f\Big(\alpha x + (1-\alpha)y\Big)$ **Theorem** : **$f \text{ convex} \Longleftrightarrow g_{(x, y)} \text{ convex } \forall (x, y) \in \mathbb{R}^n \times \mathbb{R}^n$**. **Proof**: * $g_{(x, y)} \text{ convex } \Longrightarrow \theta f(x) + (1-\theta)f(y) = \theta g_{(x, y)}(1) + (1-\theta)g_{(x, y)}(0) \ge g_{(x, y)}(\theta) = f(\theta x + (1-\theta)y)$ * $f \text{ convex } \Longrightarrow \theta g_{(x, y)}(\alpha_1) + (1-\theta)g_{(x, y)}(\alpha_2) = \theta f(\alpha_1 x + (1-\alpha_1)y) + (1-\theta)f(\alpha_2 x + (1-\alpha_2)y) \\\ge f(\theta \alpha_1 x \dots) = g_{(x, y)}(\theta \alpha_1 + (1-\theta)\alpha_2$ Do you see what I'm getting at? Between 0 and 1, the landscape curves drawn above capture the curvature of $L$ in the neighborhood of the point considered, but just in one direction. In fact, near a given $\theta_t$, $L(\theta) \approx L_t + \langle G_t, \theta - \theta_t \rangle + \frac{1}{2} \langle \mathcal{H}_t, \theta - \theta_t \rangle + o(\|\theta - \theta_t\|^2)$. That is, for very small $\alpha$, $f(\alpha) = L(\theta_t + \alpha \vec{\delta}) \approx L_t + \alpha \langle G_t, \vec{\delta} \rangle + \frac{1}{2} \alpha^2 \langle \mathcal{H}_t, \vec{\delta} \rangle - o(\alpha^2 \|\vec{\delta}\|^2)$. For a unit-norm vector $\vec{\delta}$ : - $\langle G_t, \vec{\delta} \rangle = \vec{\delta}^T G_t$ is the $\vec{\delta}$-directional derivative of $L$ at $\theta_t$. So if we want to choose a direction of maximum (positive and negative) variation, we can choose $\vec{\delta} \parallel G_t$, for example $\vec{\delta} = \frac{1}{\|\theta_{t+1} - \theta_t\|_2}(\theta_{t+1} - \theta_t)$ because $\theta_{t+1} - \theta_t = -\epsilon_t G_t$. - $\langle \mathcal{H}_t, \vec{\delta} \rangle = \vec{\delta}^T \mathcal{H}_t \vec{\delta}$ is the second $\vec{\delta}$-directional derivative of $L$ at $\theta_t$. Since the hessian matrix encodes the curvature of the function in every direction, $\langle \mathcal{H}_t, \vec{\delta} \rangle$ encodes it in the direction $\vec{\delta}$. Also, the maximum curvature of $L$ at $\theta_t$ is given by the largest eigenvalue of $\mathcal{H}_t$ and is in the direction of the corresponding eigenvector. The smallest curvature (the largest negative curvature) of $L$ at $\theta_t$ is given by the smallest eigenvalue of $\mathcal{H}_t$ and is in the direction of the corresponding eigenvector. The first-order approximation of a function at a given point is a linear function that has exactly the same directional derivatives at that point. The second-order approximation of a function at a given point is a quadratic form that has exactly the same directional derivatives and curvature at that point. # 4. Other methods ## 4.1 PHATE [7] ## 4.2 ... # 5. Can we do better? **Yes. But how?** ## 5.1 Andrews curve and Radar chart *`An idea I'm exploring`* Andrews plot or Andrews curve is a way to visualize structure in high-dimensional data. Let $u(\alpha) = \Big({\frac{1}{\sqrt{2}}},\ sin(\alpha),\ cos(\alpha),\ sin(2\alpha),\ cos(2\alpha),\ \dots \Big) \in \mathbb{R}^d \ \forall \ \alpha \in [-\pi, \pi]$. We can analyze the evolution of the structure of the parameters across epochs by visualizing the functions $\Big\{f_t : \alpha \in [-\pi, \pi] \mapsto f_t(\alpha) = \langle \theta_t, u(\alpha) \rangle\Big\}_{1 \le t \le T}$ where $\theta_t$ denote model parameters at epoch $t$, and $T$ is the total number of training epoch. Note that $\langle \theta, u(\alpha) \rangle$ is the projection of $\theta$ onto the vector $u(\alpha)$. If there is structure in $\{\theta_t\}_{1 \le t \le T}$, it may be visible in its Andrews curves [8]. The projection onto the circle results in the radar chart. ## 5.2 ... # To read - Optimization - [x] CSC421/2516 Lectures 7-8: Optimization https://www.cs.toronto.edu/~rgrosse/courses/csc421_2019/slides/lec07.pdf - [x] https://cims.nyu.edu/~cfgranda/pages/OBDA_fall17/notes/convex_optimization.pdf - [ ] continuous vs discrete optimization of DNN - [ ] An introduction to the conjugate gradient method without the agonizing pain - [ ] Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization : A Survey - [ ] NEGATIVE EIGENVALUES OF THE HESSIAN IN DEEP NEURAL NETWORKS - Landscape : - [ ] Geometry of learning: Visualizing the performance of neural network supervised training methods - [x] [1] - [x] [2] - [x] [3] - [ ] [7] - Paths - [x] Visualization of learning in multilayer perceptron networks using principal component analysis - [x] Stuck in a what? Adventures in weight space - [x] Visualizing Deep Network Training Trajectories with PCA - Sharpness and flatness - [x] Flat minima, March 1996 - "flat minimum search" search algorithm vs conventional backprop, weight decay and optimal brain surgeon / optimal brain damage. - A flat minimun is a (large) region of connected acceptable minima, where eache weigth $\theta$ withing this region leads to almost identical net function. - [ ] On large-batch training for deep learning : Generalization gap and sharp minima - The lack of generalization ability is due to the fact that large-batch methods tend to converge to sharp minimizers of the training function. These minimizers are characterized by a significant number of large positive eigenvalues in $\mathcal{H}(\theta) = \nabla^2 L(\theta)$, and tend to generalize less well. In contrast, small-batch methods converge to flat minimizers characterized by having numerous small eigenvalues of $\mathcal{H}(\theta)$. We have observed that the loss function landscape of deep neural networks is such that large-batch methods are attracted to regions with sharp minimizers and that, unlike small-batch methods, are unable to escape basins of attraction of these minimizers. - [ ] Sharp minima can generalize for deep nets - [ ] Exploring generalization in deep learning - [ ] Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data - Others - [ ] Identifying and attacking the saddle point problem in high - [ ] Exact solutions to the nonlinear dynamics of learning in deep linear neural networks - [ ] The Loss Surfaces of Multilayer Networks - [ ] On the importance of momentum and initialization in deep learning - [ ] Exploring loss function topology with cyclical learning rates - [ ] Phasemax : Convex phase retrieval via basis pursuit - [ ] Taxonomizing local versus global structure in neural network loss landscapes - [ ] On the Loss Landscape of Adversarial Training : Identifying Challenges and How to Overcome Them - convergence to minimizers of strongly-convex functions and to stationary points for non-convex functions - [ ] Optimization Methods for Large-Scale Machine Learning - saddle-point avoidance - [ ] Escaping From Saddle Points --- Online Stochastic Gradient for Tensor Decomposition - [ ] Gradient descent converges to minimizers - robustness to input data - [ ] Train faster, generalize better: Stability of stochastic gradient descent # References [1] QUALITATIVELY CHARACTERIZING NEURAL NETWORK OPTIMIZATION PROBLEMS, Ian J. Goodfellow & Oriol Vinyals & Andrew M. Saxe, 2015 [2] An empirical analysis of the optimization of deep network loss surfaces, Daniel Jiwoong Im & Michael Tao & Kristin Branson, 2017 [3] Visualizing the Loss Landscape of Neural Nets, Hao Li & Zheng Xu & Gavin Taylor & Christoph Studer & Tom Goldstein, 2018 [4] https://math.stackexchange.com/questions/995623/why-are-randomly-drawn-vectors-nearly-perpendicular-in-high-dimensions/995678 [5] https://github.com/marcellodebernardi/loss-landscapes/blob/master/loss_landscapes/main.py#L49 [6] The Slingshot Mechanism: An Empirical Study of Adaptive Optimizers and the Grokking Phenomenon, Vimal Thilak, Etai Littwin, Shuangfei Zhai, Omid Saremi, Roni Paiss, Joshua Susskind, 2022 [7] VISUALIZING HIGH-DIMENSIONAL TRAJECTORIES ON THE LOSS-LANDSCAPE OF ANNS, https://openreview.net/pdf?id=uhiF-dV99ir [8] https://en.wikipedia.org/wiki/Andrews_plot [9] Visualization of Learning in Multi-layer Perceptron Networks using PCA [10] https://cedar.buffalo.edu/~srihari/CSE676/8.2%20NNOptimization.pdf