Pascal.Jr.Tikeng.Notsawo
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note No publishing access yet

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.

      Your account was recently created. Publishing will be available soon, allowing you to share notes on your public page and in search results.

      Your team account was recently created. Publishing will be available soon, allowing you to share notes on your public page and in search results.

      Explore these features while you wait
      Complete general settings
      Bookmark and like published notes
      Write a few more notes
      Complete general settings
      Write a few more notes
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights New
    • Engagement control
    • Make a copy
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Make a copy Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note No publishing access yet

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.

    Your account was recently created. Publishing will be available soon, allowing you to share notes on your public page and in search results.

    Your team account was recently created. Publishing will be available soon, allowing you to share notes on your public page and in search results.

    Explore these features while you wait
    Complete general settings
    Bookmark and like published notes
    Write a few more notes
    Complete general settings
    Write a few more notes
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    Grokking refers to a delayed generalization following overfitting when optimizing artificial neural networks with gradient-based methods. We show that the dynamic of grokking goes beyond the $\ell_2$ norm, that is: If there exists a model with a property $P$ (e.g., sparse or low-rank weights) that fits the data, then GD with a small (explicit or implicit) regularization of $P$ (e.g., $\ell_1$ or nuclear norm regularization) will also result in grokking, provided the number of training samples is large enough. Moreover, the $\ell_2$ norm of the parameters is no longer guaranteed to decrease with generalization when it is not the property sought. Paper : Grokking Beyond the Euclidean Norm of Model Parameters, Pascal Jr. Tikeng Notsawo, Guillaume Dumas, Guillaume Rabusseau, Forty-Second International Conference on Machine Learning (ICML), 2025. https://arxiv.org/abs/2506.05718 # What is Grokking? For a prime integer $p=97$, let consider the group $\mathcal{S} = \mathbb{Z}/p\mathbb{Z}$ endowed with modular addition, and let split the dataset $\mathcal{D} = \{ (\mathbf{x}, (x_1 + x_2)\%p ) :  \mathbf{x} = (x_1, x_2)  \in \mathcal{S}^2 \}$ in two non empty and disjoint subsets $\mathcal{D}_{\text{train}}$ and $\mathcal{D}_{\text{validation}}$ according to a ratio  $r_{\text{train}} := |\mathcal{D}_{\text{train}}| / |\mathcal{D}|=40\%$. Now, consider a multilayer perceptron (MLP) for which the logits are given by $\mathbf{y}_{\theta}(x_1, x_2) = \mathbf{b}^{(2)} + \mathbf{W}^{(2)} \phi\left(\mathbf{b}^{(1)} + \mathbf{W}^{(1)} \left( \mathbf{E}_{\langle x_1 \rangle} \circ \mathbf{E}_{\langle x_2 \rangle} \right) \right)$ where * $\phi(z) = \max(z, 0)$ is the ReLU activation function * $\langle x \rangle \in \{0, ..., p-1\}$ stands for the index of the token corresponding to $x \in \mathcal{S}$ * $\mathbf{E} \in \mathbb{R}^{p \times d_1}$ (embedding matrix for all the symbols in $\mathcal{S}$), $\mathbf{W}^{(1)} \in \mathbb{R}^{d_2 \times d_1}$, $\mathbf{b}^{(1)} \in \mathbb{R}^{d_2}$, $\mathbf{W}^{(2)} \in \mathbb{R}^{p \times d_2}$, and $\mathbf{b}^{(2)} \in \mathbb{R}^{p}$ are the learnable parameters, i.e. $\theta$. Let's train this model using gradient descent with: * as objective $f(\theta) = g(\theta) + \beta h(\theta)$, where $h(\theta) = \| \theta \|_2^2$ and $g(\theta)$ is the average of the cross-entropy error on $\mathcal{D}_{\text{train}}$, $$g(\theta) = \frac{1}{|\mathcal{D}_{\text{train}}|} \sum_{(\mathbf{x}, y) \in \mathcal{D}_{\text{train}}} \ell \left( \mathbf{y}_{\theta}(\mathbf{x}), y \right)$$ $$\ell \left( \mathbf{y}, i \right)  = - \log \frac{ \exp\left( \mathbf{y}_i \right) }{ \sum_{j} \exp\left( \mathbf{y}_j \right)} \quad \forall \mathbf{y} \in \mathbb{R}^p,  \quad \forall i \in \{0, ..., p-1\}$$ - as optimizer, Adam, with *learning rate* $\alpha=10^{-3}$ and a *regularization strength* $\beta = 10^{-6}$. We let $\theta^{(t)}$ be the parameter at step $t\ge 0$ of training. ![Grokking MLP modular addition](/images/posts/2025-07-06-grokking_beyong_l2/mlp_l2_alpha=0.001_beta=1e-6-1.png "This is the title") <center>Figure 1 : Grokking on modular addition with MLP</center> <br> We can observe from the figure above that the training accuracy becomes perfect after $t_1 \approx 540$, while it takes $t_2 \approx 9150$ steps for the validation accuracy to achieve the same level. This is basically what we call grokking, i.e., a generalization (often sudden) after a long period of overfitting (usually severe). It was first studied by [4], who trained Transformers on binary operations (addition, multiplication, division, etc.) on $\mathcal{S}$. Now that we agree on what grokking is, let's see how we can explain it. We will provide two previous explanations related to ours, highlight their limitations, and offer a more general explanation of the phenomenon based on regularization. # Why Grokking? ## Goldilocks zone & LU mechanism The **Goldilocks zone** (Figure 2, left) refers to a spherical shell in weight space (a region at an optimal weight norm $w = \|\theta\|_2 \approx w_c$) where models achieve good generalization. It is "just right": - If the weight norm is too small ($w<w_c$), the model underfits and struggles to fit the training data. - If the weight norm is too large ($w>w_c$), the model overfits—training loss is low, but test loss is high. Thus, generalizing solutions lie near this shell (not too small, not too large), hence the name “Goldilocks zone.” ![LU Mechanism](/images/posts/2025-07-06-grokking_beyong_l2/LU.png "LU Mechanism") <center>Figure 2 : LU mechanism [1]</center> <br> The **LU mechanism** (Figure 2, right) describes the mismatch between how training and test losses behave as functions of the **weight norm** $w = \|\theta\|_2$: * The **training loss** forms an **L-shape**: it decreases quickly and stays near zero for large $w$, because many overfitting solutions exist at high norms. * The **test loss** forms a **U-shape**: it is minimized at a particular weight norm $w_c$ (the **Goldilocks zone**) and increases for both smaller and larger norms. Accoriding to [1], this mismatch causes **grokking**: with large initialization $w_0 = \|\theta^{(0)}\|_2 > w_c$ and small weight decay $\beta$, the model first overfits at step $t_1$ (training loss drops, test loss remains high), then slowly drifts toward $w_c$ due to weight decay, eventually reaching a point where generalization improves dramatically (at step $t_2$). <!-- Accoriding to [1], with large initialization and small but nonzero weight decay, the model first rapidly overfits by converging to a solution far from the Goldilocks zone where test loss is high but training loss is low (at step $t_1$). Due to the LU mechanism (L-shaped training loss, U-shaped test loss vs. weight norm), the model then undergoes slow radial movement toward the Goldilocks zone (optimal weight norm $w_c$) driven by weight decay. Only upon reaching this zone does test accuracy improve, resulting in grokking: a long delay between memorization and generalization.--> How is this explanation limited? Consider, for example, a model $\mathbf{y}: \mathbb{R}^p \to \mathbb{R}^d$ defined by $\mathbf{y}(\mathbf{x}) := \mathbf{B} \phi(\mathbf{A}\mathbf{x})$ where $\mathbf{A} \in \mathbb{R}^{r \times p}$ and $\mathbf{B} \in \mathbb{R}^{d \times r}$. Suppose that $\phi$ is positive-$L$-homogeneous for some $L>0$, i.e., for all $\lambda>0$, $\phi(\lambda z) = \lambda^L \phi(z) \ \forall z \in \mathbb{R}$. For example, $\phi(z)=\max(z, 0)$ is positive-$1$-homogeneous, $\phi(z)=z^2$ is $2$-homogeneous (quadratic activation functions are widely used in the context of grokking and modular arithmetic [5]). $$\forall \lambda>0, \qquad \mathbf{y}_{\lambda}(\mathbf{x}) := \frac{\mathbf{B}}{\lambda^L} \phi(\lambda \mathbf{A}\mathbf{x}) = \mathbf{y}(\mathbf{x}) \quad \forall \mathbf{x} \in \mathbb{R}^{p}$$ <!-- In fact, we have $w'(\lambda) = 2a\lambda - \frac{2Lb}{\lambda^{2L+1}} = 0 \Longleftrightarrow \lambda = \lambda_0$ and $w(\lambda_0) = a \left(\frac{Lb}{a}\right)^{ \frac{1}{L+1} } + b \left(\frac{Lb}{a}\right)^{ \frac{-L}{L+1} } = a^{ \frac{L}{L+1} } b^{ \frac{1}{L+1} } \left( L^{ \frac{1}{L+1} } + L^{ -\frac{L}{L+1} } \right)$. Also, $w'(\lambda) < 0 \ \forall \lambda<\lambda_0$ and $w'(\lambda) > 0 \ \forall \lambda>\lambda_0$. --> Assume without loss of generality (WLOG) $a:=\|\mathbf{A}\|_2^2>0$ and $b:=\|\mathbf{B}\|_2^2>0$ (otherwise $\mathbf{y}(\mathbf{x})=0$ for all $\mathbf{x}$). The norm of the parameters of $\mathbf{y}_{\lambda}$ is $w(\lambda) = a \lambda^2 + \frac{b}{\lambda^{2L}}$. The function $w(\lambda)$ decreases from $\lambda=0$ to $\lambda=\lambda_0 := \left(\frac{Lb}{a}\right)^{ \frac{1}{2(L+1)} }$, then increases. We have $w(\lambda_0) = \left( L^{ \frac{1}{L+1} } + L^{ -\frac{L}{L+1} } \right) \left(a^L b\right)^{ \frac{1}{L+1} }$ and $\lim_{\lambda \rightarrow 0, +\infty} w(\lambda) = \infty$. This means that we can arbitrary increase the norm of the parameters of the model $\mathbf{y}_{\lambda}$ without changing its generalization performances. In other words, the set of solutions that generalize are not located on a spherical shell (with respect to the $\ell_2$ distance) around the origin, and it would be possible to find a solution that generalizes close to the initialization $\theta^{(0)} = \{\mathbf{A}^{(0)}, \mathbf{B}^{(0)}\}$ regardless of how large $\|\theta^{(0)}\|_2^2 = \|\mathbf{A}^{(0)}\|_2^2 + \|\mathbf{B}^{(0)}\|_2^2$ is. We will show below that in certain cases, due to this large initialization and the use of a small weight decay, we do indeed observe a transition in the generalization error during training due to a decrease in the norm of the parameters; but that this transition does not correspond to generalization and occurs even when the problem to be solved has no solution that generalizes. This transition is therefore only a mirage, and not grokking. This LU mechanism based solely on the $\ell_2$ norm cannot explain grokking since other regularizations induce grokking and affect the grokking delay $\Delta t= t_2 - t_1$, and when $\ell_2$ is not used, the $\ell_2$ norm of the model parameters is not monotonic after memorization, and even increases without harming generalization performance (Figure 3). A more general measure of complexity must be used in place of the $\ell_2$ norm for this LU mechanism to work. ![LU Mechanism](/images/posts/2025-07-06-grokking_beyong_l2/mlp_algorithmic_dataset_scaling_beta_l1_l2_lnuc-1.png "LU Mechanism") Figure 3: (Top) Training and test accuracy of a MLP trained on modular addition with $\ell_1$ (left), $\ell_2$ (middle), and $\ell_*$ (right) regularization for different values of the regularization strength $\beta$. Smaller values of $\beta$ delay generalization. (Bottom) With $\ell_1$ regularization, the $\ell_2$ norm (middle) increase after generalization, i.e. the model exits the Goldilocks zone, but generalize. $\ell_*$ is the nuclear norm. <br> ## Grokking as a transition from the kernel to the rich regime Recall that we have an objective function of the form $f(\theta) = g(\theta) + \frac{\beta}{2} \|\theta\|_2^2$, where $g : \mathbb{R}^p \to [0, \infty)$ is the training error (average loss on the training data). Assume the continuous-time gradient flow starting at some $\theta(0)$, $$\frac{d\theta(t)}{dt} = - \nabla_\theta f\left(\theta(t)\right)$$ We have the following result from [2] (refer to the paper for the formal version). Let $t_2 = \frac{\log\gamma}{\beta}$, with $\gamma = \|\theta(0)\|_2$ the initialization scale. Assume that : - $\beta = \Theta(\gamma^{-c})$ for some positive $c = \Theta(1)$. - the NTK (Neural Tangent Kernel) features at initialization $\left\{ \mathbf{w}^{(0)} (\mathbf{x}) := \nabla_{\theta} \mathbf{y}_{\theta}(\mathbf{x}) \big|_{\theta = \theta^{(0)} } \right\}_{ ( \mathbf{x}, y) \in \mathcal{D}_{\text{train}} }$ are linearly separable (classification setting) or independent (regression setting). - the model is homogeneous with respect to the parameter, i.e. there exist $L>0$ such that for all $\lambda>0$, $\mathbf{y}_{\lambda \theta}(\mathbf{z}) = \lambda^L \mathbf{y}_{\theta}(\mathbf{z}) \ \forall \mathbf{z} \in \mathbb{R}^p$ : feed-forward neural networks with homogeneous activation functions are homogeneous. - the initialization scale $\gamma \rightarrow \infty$ Then, we have for all $\epsilon \in (0, 1)$: - The solution found by gradient flow at time $t_1 = (1-\epsilon)t_2$ - represents the same classifier as the max $L^2$-margin linear classifier on the NTK features (classification), i.e. assuming binary classification with labels in $\{-1, +1\}$, $\frac{\theta^{(t_1)}}{\| \theta^{(t_1)} \|_2}$ is the unique unit vector that points to the direction of the unique optimal solution to the following constrained optimization problem: $$ \min_{\theta} \ \frac{1}{2} \| \theta \|_2 \text{ s.t. } y \langle \mathbf{w}^{(0)}(\mathbf{x}) , \theta \rangle \ge 1 \quad \forall (\mathbf{x}, y) \in \mathcal{D}_{\text{train}} $$ - approximates the minimum-norm interpolating solution in the NTK regime (regression) i.e. $\frac{\theta^{(t_1)}}{\| \theta^{(t_1)} \|_2}$ is the unique unit vector that points to the direction of the unique optimal solution to the following constrained optimization problem: $$ \min_{\theta} \ \frac{1}{2} \| \theta \|_2 \quad \text{ s.t. } \quad \langle \mathbf{w}^{(0)}(\mathbf{x}), \theta \rangle = y \quad \forall (\mathbf{x}, y) \in \mathcal{D}_{\text{train}} $$ - By continuing training slightly longer to time $t_2^+ = (1+\epsilon)t_2$, the dynamics leave the NTK (linearized) regime. More precisely, $\frac{\theta^{(t_2^+)}}{\| \theta^{(t_2^+)} \|_2}$ is along the direction of a KKT point of the following constrained optimization problem: $$ \min_{\theta} \ \frac{1}{2} \| \theta \|_2 \quad \text{ s.t. } \quad \begin{cases} y \langle \mathbf{y}(\mathbf{x}) , \theta \rangle \ge 1 & \text{ (classification) } \\ \langle \mathbf{y}(\mathbf{x}), \theta \rangle = y & \text{ (regression)} \end{cases} \quad \forall (\mathbf{x}, y) \in \mathcal{D}_{\text{train}} $$ How is this explanation limited? In certain settings, by using only the $\ell_2$ regularization under large-scale initialization, there is an abrupt transition in the generalization error during training, driven by changes in the $\ell_2$-norm of the model parameters (Figure 4). This transition, however, does not result in convergence to an optimal solution and can arise even in cases where the problem has no optimal solution due to insufficient training samples (such as sparse recovery or matrix completion using a number of samples far below the theoretical limit required for optimal recovery). We call this phenomenon **grokking without understanding**. ![Grokking Without Understanding](/images/posts/2025-07-06-grokking_beyong_l2/compressed_sensing_gradient_norm_b_subgradient_large_init_log_for_main_section-1.png "Grokking Without Understanding") ![Grokking Without Understanding](/images/posts/2025-07-06-grokking_beyong_l2/compressed_sensing_gradient_norm_b_subgradient_large_init_for_main_section-1.png "Grokking Without Understanding") Figure 4 : Grokking Without Understanding in Sparse recovery : $(n, s, N, \alpha, \beta, \|\mathbf{a}^{(0)}\|_2)=(10^2, 5, 30, 10^{-1}, 10^{-5}, 10)$. <br> Consider a sparse vector $\mathbf{a}^* \in \mathbb{R}^n$, i.e. $s = \|\mathbf{a}^*\|_0 \ll n$. Given a design matrix $\mathbf{X} = \left[ \mathbf{x}_1, \cdots, \mathbf{x}_N \right]^\top \in \mathbb{R}^{N \times n}$ containing $N$ samples, and the noisy labels $\mathbf{y}^* = \mathbf{X} \mathbf{a}^* + \boldsymbol{\xi}$ with $\boldsymbol{\xi} \in \mathbb{R}^N$ (noise), consider the objective function $f(\mathbf{a}) = g(\mathbf{a}) + \beta \|\mathbf{a}\|_2^2$ with $g(\mathbf{a}) = \frac{1}{2} \| \mathbf{X} \mathbf{a} - \mathbf{y}^* \|_2^2$, and iteratively define $$\mathbf{a}^{(t+1)} = \mathbf{a}^{(t)} - \alpha \nabla_{\mathbf{a}} f \left( \mathbf{a}^{(t)} \right)$$ with $\alpha>0$. Consider the least square solution $\hat{\mathbf{a}} := \left( \mathbf{X}^\top \mathbf{X} + \beta \mathbb{I}_n \right)^{\dag} \mathbf{X}^\top \mathbf{y}^*$, and define $\rho_2 := \sigma_{\max} \left( \mathbb{I}_n - \alpha \left( \mathbf{X}^\top \mathbf{X} + \beta \mathbb{I}_n \right) \right)$. Assume the learning rate satisfies $0< \alpha < \frac{2}{\sigma_{\max}(\mathbf{X}^\top \mathbf{X}) + \beta}$. Then (see Theorem 3.6) - $\| \mathbf{a}^{(t)} - \hat{\mathbf{a}} \|_2 \le \rho_2^{t-1} \| \mathbf{a}^{(1)} - \hat{\mathbf{a}} \|_2\ \forall t \ge 1$. That is, as $t\rightarrow \infty$, $\mathbf{a}^{(t)} \rightarrow \hat{\mathbf{a}}$. * For $N < n$, $\|\hat{\mathbf{a}} - \mathbf{a}^*\|_2^2 \ge \|\left(\mathbb{I}_n - \mathbf{X}^\top (\mathbf{X} \mathbf{X}^\top)^\dagger \mathbf{X} \right) \mathbf{a}^*\|_2^2$. In particular, if $\mathbf{a}^*$ has a nonzero component orthogonal to the column space of $\mathbf{X}$, then $\|\hat{\mathbf{a}} - \mathbf{a}^*\|_2 > 0$, i.e. $\hat{\mathbf{a}}$ cannot perfectly generalize to $\mathbf{a}^*$. We will show below that with $\ell_1$, i.e. $f(\mathbf{a}) = g(\mathbf{a}) + \beta \|\mathbf{a}\|_1$, there is grokking, i.e. $\mathbf{a}^{(t)}$ first converge to $\hat{\mathbf{a}} = \left( \mathbf{X}^\top \mathbf{X} \right)^{\dag} \mathbf{X}^\top \mathbf{y}^*$ and minimize $g$, then take a non trivial delay $\Delta t$ to converge to $\mathbf{a}^*$. Since the minimum $\ell_2$ norm solution only gives memorization, the $\ell_2$ norm cannot be used as an indicator of grokking. Although trivial, this example illustrates our intuition. For grokking by regularization to occur, the regularizer used must be able to enforce an inductive bias toward generalization. We will show this in the next section, and we will see that such regularization can also be implicit (for example, by over-parameterizing the model or selecting the training data appropriately). # Grokking Beyond $\ell_2$ Norm Consider a differentiable function (training loss) $g : \mathbb{R}^p \to [0, \infty)$ and a subdifferentiable function (regularizer) $h : \mathbb{R}^n \to [0, \infty)$. Define the objective function $f(\mathbf{x}) = g(\mathbf{x}) + \beta h(\mathbf{x})$, and the (sub)Gradient descent with learning rate $\alpha>0$: $$ \begin{equation} \mathbf{x}^{(0)} \in \mathbb{R}^p \text{ and } \mathbf{x}^{(t+1)} = \mathbf{x}^{(t)} - \alpha ( G( \mathbf{x}^{(t)}) + \beta H( \mathbf{x}^{(t)}) ) \ \forall t > 1 \end{equation} $$ where $G(\mathbf{x}) = \nabla g(\mathbf{x})$ is the gradient of $g$ at $\mathbf{x}$ and $H(\mathbf{x}) \in \partial h(\mathbf{x})$ is any subgradient of $h$ at $\mathbf{x}$. We define $\Theta_f := \argmin_{\mathbf{x} \in \mathbb{R}^p} f(\mathbf{x}) \subset \mathbb R^p$ and $f^* := \inf_{\mathbf{x} \in \mathbb{R}^p} f(\mathbf{x})$. Similarly, we define $\Theta_g$ and $g^*$, and assume $g^*=0$ without loss of generality. Under certain conditions on $g$ at $\mathbf{x}^{(0)}$ (to be specified below), there exist $\alpha_{\max}$ and $\beta_{\max}$ such that for all $\alpha \in (0, \alpha_{\max})$ and $\beta \le \beta_{\max}$, by defining the subgradient descent with $\alpha$ and $\beta$ starting at $\mathbf{x}^{(0)}$, the following hold: * Memorization : The iterates $\mathbf{x}^{(t)}$ initially move (exponentially fast) toward a solution close to the initialization $\mathbf{x}^{(0)}$ that minimizes $g$ (i.e., the kernel solution associated with $g$), at step $t_1$. * Generalization : Later in training, when $g$ and its gradient are already small, the influence of $\beta h(\mathbf{x})$ dominates the update and gradually drives the iterates toward low values of $f$ and $h$, until reaching $f(\mathbf{x}^{(t)}) \approx f^*$ and $h(\mathbf{x}^{(t)}) \approx h_g^* := \inf_{\mathbf{x}, g(\mathbf{x}) = 0 } h(\mathbf{x})$ up to an error of order $\mathcal{O}(\alpha\beta^2)$ and $\mathcal{O}(\alpha\beta)$ respectively, within $\Delta t = \Theta(1/\alpha\beta)$ additional steps. This is the Theorem 2.1 of the paper. We will now attempt to prove this, or at least provide a sketch of a proof. You can skip the proof since the other sections do not depend on it. First, let provide some definitions. Let define $$ \begin{equation} \begin{split} \chi(g, \mathbf{x}, r) := \inf_{\mathbf{y} \in B(\mathbf{x}, r), g(\mathbf{y}) \ne 0} \| \nabla g(\mathbf{y}) \|_2^2/g(\mathbf{y}) \quad \forall \mathbf{x} \in \mathbb{R}^p, \quad \forall r >0 \end{split} \end{equation} $$ where $B(\mathbf{x}, r) := \left\{ \mathbf{y} \in \mathbb{R}^p \mid \| \mathbf{x} - \mathbf{y} \|_2 \le r \right\}$. **Definition 1** We call the constant $\chi(g, \mathbf{x}, r)$ the Chatterjee-Łojasiewicz (CL) constant of $g$ at $\mathbf{x}$ with radius $r >0$, and we said the function $g$ satisfies the $r$-CL inequality at $\mathbf{x}$ or is $r$-CL at $\mathbf{x}$ when $4g(\mathbf{x}) < r^2 \chi(g, \mathbf{x}, r)$. No need to worry about this defition for now, we will make it more clear below with the PL (Polyak-Łojasiewicz) inequality. **Definition 2** For $L > 0$, we said that a differentiable function $\varphi : \mathbb{R}^p \to \mathbb{R}$ is $L$-smooth if and only if $\nabla \varphi(\mathbf{x})$ is $L$-Lipschitz, i.e. $\| \nabla \varphi(\mathbf{y}) - \nabla \varphi(\mathbf{x}) \|_2 \le L \|\mathbf{y}-\mathbf{x}\|_2 \quad \forall \mathbf{x}, \mathbf{y} \in \mathbb{R}^p$. **Lemma 1** Let $\varphi : \mathbb{R}^p \to \mathbb{R}$ be a differentiable $L$-smooth function. - We have $\varphi(\mathbf{y}) \le \varphi(\mathbf{x}) + (\mathbf{y}-\mathbf{x})^\top \nabla \varphi(\mathbf{y}) + \frac{L}{2} \|\mathbf{y}-\mathbf{x}\|_2^2 \ \forall \mathbf{x}, \mathbf{y} \in \mathbb{R}^p$. - For a non-negative $\varphi$, this implies $\|\nabla \varphi (\mathbf{x}) \|_2^{2} \le 2L \varphi(\mathbf{x}) \ \forall \mathbf{x} \in \mathbb{R}^p$ - For a twice-differentiable $\varphi$, this is equivalent to saying that for all $\mathbf{x} \in \mathbb{R}^p$, all the eigenvalues of $\nabla^2 \varphi(\mathbf{x})$ ar in $[-L, L]$. See [Smoothness and Gradient Inequalities: A Theoretical Overview](https://hackmd.io/@6LQ4mvRtS4Sc3LHkNEvDXQ/ry5NlKtHeg) for the proofs. $$ \begin{array}{r} \blacksquare\Box \end{array} $$ In the paper, we assume that there exists $r>0$ such that $r$-CL at $\mathbf{x}^{(0)}$, i.e. $4g(\mathbf{x}^{(0)}) < r^2 \chi(g, \mathbf{x}^{(0)}, r)$. We then show that under this condition, $g$ is $L$-smooth on $B(\mathbf{x}^{(0)}, r)$ for a certain $L>0$, and we have the following: - (i) As long as $\beta \sup_{ H(\mathbf{x}^{(k)}) \in \partial h(\mathbf{x}^{(k)}) } \|H(\mathbf{x}^{(k)})\|_2 \le \| G(\mathbf{x}^{(k)}) \|_2 \quad \forall k < t$, for a certain $t\ge 1$, the following hold for all $k \le t$ : * $\mathbf{x}^{(k)} \in B(\mathbf{x}^{(0)}, r)$ * $g(\mathbf{x}^{(k)}) \le (1-\delta)^{k} g(\mathbf{x}^{(0)})$ for a certain $\delta \in (0, 1)$ * $\|\nabla g(\mathbf{x}^{(k)})\|_2^{2} \le 2 L g(\mathbf{x}^{(k)})$ - (ii) By defining $L_h := \sup_{\mathbf{x} \in B(\mathbf{x}^{(0)},r)} \sup_{ H(\mathbf{x}) \in \partial h(\mathbf{x}) } \|H(\mathbf{x})\|_\infty < \infty$ we have for all $\gamma, \tau \in (0, 1)$, $$\gamma \| G(\mathbf{x}^{(0)}) \|_2 \ge \tau^{-t} \beta L_h \Longrightarrow \beta \|H(\mathbf{x}^{(k)})\|_2 \le \| G(\mathbf{x}^{(k)}) \|_2 \quad \forall k < t$$ Point (i) tells us that as long as the gradient $G$ of $g$ dominates all the subgradient $H$ of $h$, $g$ decreases geometrically and the iterate remains in the ball $B(\mathbf{x}^{(0)},r)$, while (ii) shows us that we can always adjust the hyperparameters ($\beta$, $\gamma$, $\tau$, ...) so that $G = \nabla g$ dominates $H \in \partial h$ up to an arbitrary step. Combining these two points, we showned that we can adjust these hyperparameters to allow $g$ (and $G$ since $\|G(\mathbf{x})\|_2^{2} \le 2 L g(\mathbf{x})$ in $B(\mathbf{x}^{(0)},r)$) to achieve arbitrary precision $\epsilon_g = \Omega(\beta^C)$ while remaining in $B(\mathbf{x}^{(0)},r)$, $C>0$ a constant. Since the proofs of (i) and (ii) are quite long, we will impose a more restrictive conditions on $g$ for the purposes of this blog post. **Assumption 1** We assume that $g$ is $\mu$-PL (Polyak-Łojasiewicz) for a certain $\mu>0$, i.e., $2\mu (g(\mathbf{y}) - g^*) \le \| \nabla g(\mathbf{y}) \|_2^2 \quad \forall \mathbf{y} \in \mathbb{R}^p$. Since $g^*=0$, we have $$ \begin{equation} \begin{split} 2\mu g(\mathbf{y}) \le \| \nabla g(\mathbf{y}) \|_2^2 \ \forall \mathbf{y} & \Longrightarrow 2\mu \le \frac{\| \nabla g(\mathbf{y}) \|_2^2}{g(\mathbf{y})} \quad \forall \mathbf{y}, \ g(\mathbf{y}) \ne 0 \\ & \Longrightarrow 2\mu \le \inf_{\mathbf{y} \in B(\mathbf{x}, r), g(\mathbf{y}) \ne 0} \| \nabla g(\mathbf{y}) \|_2^2/g(\mathbf{y}) = \chi(g, \mathbf{x}, r) \quad \forall \mathbf{x} \\ & \Longrightarrow 4g(\mathbf{x}) < r^2 \chi(g, \mathbf{x}, r) \quad \forall \mathbf{x}, \quad \forall r^2 \ge 2g(\mathbf{x})/\mu \end{split} \end{equation} $$ So the CL inequality is a strengthening of the classical PL inequality, which has been shown to hold for wide overparameterized neural networks in a neighborhood of their initialization. The advantage here is that we require the CL inequality to be satisfied only at initialization in the paper, unlike the following simple version under PL, which require the function to satisfy the PL inequality over its entire domain. **Assumption 2** We assume $g$ is $L$-smooth for some $L > 0$ **Assumption 3** Define $\delta(\alpha, \beta) := \mu \alpha \cdot \left( ( 2 - \alpha L )(1 + \beta) + \alpha \beta^2 L \right)$, and assume the learning rate $\alpha>0$ and the regularization strength $\beta\ge 0$ satisfy $0< \alpha < \frac{2}{L}$ and $0 < \delta(\alpha, \beta) < 2$. The feasible set of theses constraints can be write $\{ (\alpha, \beta) : 0 < \alpha < \alpha_{\max}, 0 < \beta < \beta_{\max}(\alpha) \}$ for a certain $\alpha_{\max}>0$ and $\beta_{\max}(\alpha)>0 \forall \ \alpha \in (0, \alpha_{\max})$. **Theorem** Under assumptions 1, 2 and 3, we have for all $t \ge 0$ : $$\| H( \mathbf{x}^{(t)}) \|_2 \le \| G( \mathbf{x}^{(t)}) \|_2 \quad \Longrightarrow \quad g(\mathbf{x}^{(t+1)}) - g^* \le \left( 1 - \delta \right) \left( g(x^{(t)}) - g^* \right)$$ $$ \| H( \mathbf{x}^{(k)}) \|_2 \le \| G( \mathbf{x}^{(k)}) \|_2 \quad \forall k < t \quad \Longrightarrow \quad g(\mathbf{x}^{(k)}) - g^* \le \left( 1 - \delta \right)^{k} \left( g(x^{0}) - g^* \right) \quad \forall k \le t $$ And since g is assume $L$-smooth, we have $\|\nabla g(\mathbf{x}^{(t)})\|_2^{2} \le 2 L g(\mathbf{x}^{(t)})$ for all $t\ge 0$. **Proof** Fix $t \ge 0$. If $\| H( \mathbf{x}^{(t)}) \|_2 \le \| G( \mathbf{x}^{(t)}) \|_2$, then $$ \begin{equation} \begin{split} g(\mathbf{x}^{(t+1)}) - g^* & \le g(\mathbf{x}^{(t)}) + (\mathbf{x}^{(t+1)}-\mathbf{x}^{(t)})^\top G(\mathbf{x}^{(t)}) + \frac{L}{2} \|\mathbf{x}^{(t+1)}-\mathbf{x}^{(t)}\|_2^2 - g^* \text{ since $g$ has an $L$-Lipschitz continuous gradient} \\ & = g(\mathbf{x}^{(t)}) - \alpha ( G( \mathbf{x}^{(t)}) + \beta H( \mathbf{x}^{(t)}) )^\top G(\mathbf{x}^{(t)}) + \frac{L\alpha^2}{2} \| G( \mathbf{x}^{(t)}) + \beta H( \mathbf{x}^{(t)}) \|_2^2 - g^* \text{ since } \mathbf{x}^{(t+1)} = \mathbf{x}^{(t)} - \alpha ( G( \mathbf{x}^{(t)}) + \beta H( \mathbf{x}^{(t)}) ) \\ & = g(\mathbf{x}^{(t)}) - \alpha \| G( \mathbf{x}^{(t)}) \|_2^2 - \alpha \beta H( \mathbf{x}^{(t)} )^\top G(\mathbf{x}^{(t)}) + \frac{\alpha^2 L}{2} \| G( \mathbf{x}^{(t)}) \|_2^2 + \frac{\alpha^2 \beta L}{2} H( \mathbf{x}^{(t)} )^\top G(\mathbf{x}^{(t)}) + \frac{\alpha^2 \beta^2 L}{2} \| H( \mathbf{x}^{(t)}) \|_2^2 - g^* \\ & = g(x^{(t)}) - \frac{\alpha}{2} \left( 2 - \alpha L \right) \| G( \mathbf{x}^{(t)}) \|_2^2 - \frac{\alpha \beta}{2} (2 - \alpha L ) H( \mathbf{x}^{(t)} )^\top G(\mathbf{x}^{(t)}) + \frac{\alpha^2 \beta^2 L}{2} \| H( \mathbf{x}^{(t)}) \|_2^2 - g^* \\ & \le g(x^{(t)}) - \frac{\alpha}{2} \left( 2 - \alpha L \right) \| G( \mathbf{x}^{(t)}) \|_2^2 + \frac{\alpha \beta}{2} (2 - \alpha L ) \| H( \mathbf{x}^{(t)} ) \|_2 \| G(\mathbf{x}^{(t)}) \|_2 + \frac{\alpha^2 \beta^2 L}{2} \| H( \mathbf{x}^{(t)}) \|_2^2 - g^* \text{ using Cauchy-Schwarz, since $2 - \alpha L >0$ } \\ & \le g(x^{(t)}) - \frac{\alpha}{2} \left( 2 - \alpha L \right) \| G( \mathbf{x}^{(t)}) \|_2^2 + \frac{\alpha \beta}{2} (2 - \alpha L ) \| G(\mathbf{x}^{(t)}) \|_2^2 + \frac{\alpha^2 \beta^2 L}{2} \| G( \mathbf{x}^{(t)}) \|_2^2 - g^* \text{ since } \| H( \mathbf{x}^{(t)} ) \|_2 \le \| G( \mathbf{x}^{(t)} ) \|_2 \\ & = g(x^{(t)}) - \frac{\alpha}{2} \left[ \left( 2 - \alpha L \right)(1 + \beta) + \alpha \beta^2 L \right] \| G( \mathbf{x}^{(t)}) \|_2^2 - g^* \\ & = g(x^{(t)}) - \frac{\alpha}{2} \kappa \| \nabla g( \mathbf{x}^{(t)}) \|_2^2 - g^* \text{ with } \kappa := \left( 2 - \alpha L \right)(1 + \beta) + \alpha \beta^2 L > 0 \text{ since } 2 - \alpha L >0 \\ & \le g(x^{(t)}) - \alpha \mu \cdot \kappa \cdot \left( g(\mathbf{x}^{(t)}) - g^* \right) - g^* \text{ since $\kappa >0$ and $g$ is $\mu$-PL, i.e. $\| \nabla g(\mathbf{x}^{(t)}) \|_2^2 \ge 2\mu (g(\mathbf{x}^{(t)}) - g^*) $} \\ & = \left(1 - \delta(\alpha, \beta) \right) \left( g(x^{(t)}) - g^* \right) %\\ & \le \left( 1 - \delta(\alpha, \beta) \right)^{t+1} \left( g(x^{(0)}) - g^* \right) \end{split} \end{equation} $$ As a consequence, if we have $\| H( \mathbf{x}^{(k)}) \|_2 \le \| G( \mathbf{x}^{(k)}) \|_2$ for all $k < t$, then * $g(\mathbf{x}^{(1)}) - g^* \le \left( 1 - \delta \right) \left( g(x^{0}) - g^* \right)$ * $g(\mathbf{x}^{(2)}) - g^* \le \left( 1 - \delta \right) \left( g(x^{1}) - g^* \right) \le \left( 1 - \delta \right)^2 \left( g(x^{1}) - g^* \right)$ * $\cdots$ * $g(\mathbf{x}^{(t)}) - g^* \le \left( 1 - \delta \right) \left( g(x^{t-1}) - g^* \right) \le \left( 1 - \delta \right)^{t} \left( g(x^{0}) - g^* \right)$ The equation $\|\nabla g(\mathbf{x}^{(t)})\|_2^{2} \le 2 L g(\mathbf{x}^{(t)})$ for all $t\ge 0$ comes from Lemma 1. $$ \begin{array}{r} \blacksquare\Box \end{array} $$ This proves (i) exactly. It should be noted that there is a difference between this result and the one obtained in the paper under CL. - The regularity of $g$ along the trajectory is only a consequence of the theorem under CL, whereas under PL, it is assumed in advance. - CL is only required at initialization, whereas PL is required on R^p. We will not prove (ii) here. But the proof is quite simple. We first prove that (P) if for a certain $t\ge 1$, we have $\beta \sup_{ H(\mathbf{x}^{(k)}) \in \partial h(\mathbf{x}^{(k)}) } \|H(\mathbf{x}^{(k)})\|_2 \le \| G(\mathbf{x}^{(k)}) \|_2$ and $\mathbf{x}^{(k)} \in B(\mathbf{x}^{(0)}, r)$ for all $k < t$, then $\mathbf{x}^{(t)} \in B(\mathbf{x}^{(0)}, r)$. Using this, we proove that In fact, since $\mathbf{x}^{(0)} \in B(\mathbf{x}^{(0)}, r)$, we have $\gamma \| G(\mathbf{x}^{(0)}) \|_2 \ge \frac{\beta L_h}{\tau^{k}} \ge \beta L_h \ge \beta H(\mathbf{x}^{(0)})$. So $\mathbf{x}^{(1)} \in B(\mathbf{x}^{(0)}, r)$ by (P), and hence $\sup_{H \in \partial h(\mathbf{x}^{(1)})} \|H\|_2 \le L_h$. Since $g$ is $C^2$, $\nabla^2 g(\mathbf{x})$ is symmetric for all $\mathbf{x} \in \mathbb{R}^p$, and we thus have for all $\mathbf{x} \in B(\mathbf{x}^{(0)}, r) \subset B(\mathbf{x}^{(0)}, 2r)$, $\sigma_{\max}( \nabla^2 g(\mathbf{x})) \le \sqrt{p} \max_{ij} \left| \left[ \nabla^2 g(\mathbf{x})\right]_{ij} \right| \le \sqrt{p} L_2 = L$. So $g$ is $L$-smooth on $B(\mathbf{x}^{(0)}, r)$, i.e. $$ \begin{equation} \begin{split} \| G(\mathbf{x}^{(1)}) - G(\mathbf{x}^{(0)}) \|_2 & \le L \| \mathbf{x}^{(1)} - \mathbf{x}^{(0)} \|_2 \\ & \le \alpha L \| G(\mathbf{x}^{(0)}) + \beta H(\mathbf{x}^{(0)}) \|_2 \\ & \le \alpha L \left( \| G(\mathbf{x}^{(0)})\|_2 + \beta \|H(\mathbf{x}^{(0)})\|_2 \right) \\ & \le (1+\gamma) \alpha L \| G(\mathbf{x}^{(0)})\|_2 \end{split} \end{equation} $$ So $$ \begin{equation} \begin{split} \| G(\mathbf{x}^{(1)}) \|_2 & \ge \| G(\mathbf{x}^{(0)}) \|_2 - \| G(\mathbf{x}^{(1)}) -G(\mathbf{x}^{(0)}) \|_2 \text{ (Triangle inequality)} \\ & \ge \| G(\mathbf{x}^{(0)}) \|_2 - (1+\gamma) \alpha L \| G(\mathbf{x}^{(0)})\|_2 \text{ (Equation TODO)} \\ & = \tau \| G(\mathbf{x}^{(0)}) \|_2 \\ & \ge \frac{\beta L_h}{\gamma\tau^{k-1}} \text{ since } \gamma \| G(\mathbf{x}^{(0)}) \|_2 \ge \frac{\beta L_h}{\tau^{k}} \end{split} \end{equation} $$ We thus have $\gamma \| G(\mathbf{x}^{(1)}) \|_2 \ge \beta L_h \ge \beta H(\mathbf{x}^{(1)})$. So $\mathbf{x}^{(2)} \in B(\mathbf{x}^{(0)}, r)$ by (P), and hence $\sup_{H \in \partial h(\mathbf{x}^{(2)})} \|H\|_2 \le L_h$. And so on, we prove that $\| G(\mathbf{x}^{(j)}) \|_2 \ge \frac{\beta L_h}{\gamma\tau^{k-j}}$ and $\beta \sup_{H \in \partial h(\mathbf{x}^{(j)})} \|H\|_2 \le \gamma \| G(\mathbf{x}^{(j)}) \|_2 \ \forall \ j \le k$. # Sparsity We have * a sparse vector $\mathbf{a}^* \in \mathbb{R}^n$, i.e. $s = \|\mathbf{a}^*\|_0 \ll n$. * the labels $\mathbf{y}^* = \mathbf{X} \mathbf{a}^* + \boldsymbol{\xi}$ with $\mathbf{X} \in \mathbb{R}^{N \times n}$ (design matrix) and $\boldsymbol{\xi} \in \mathbb{R}^N$ (noise). * $g(\mathbf{a}) = \frac{1}{2} \| \mathbf{X} \mathbf{a} - \mathbf{y}^* \|_2^2$ and $f(\mathbf{a}) = g(\mathbf{a}) + \beta \|\mathbf{a}\|_1$ **Theorem 3.1 and 3.3 (Informal)** Under certain conditions on $\alpha$, $\beta$ and $\boldsymbol{\xi}$ : * (i) Memorization: $\mathbf{a}^{(t)}$ first moves near the least square solution $\hat{\mathbf{a}} := \left( \mathbf{X}^\top \mathbf{X} \right)^{\dag} \mathbf{X}^\top \mathbf{y}^*$ at step $t_1<\infty$ * (ii) Later in training, $\partial\|\mathbf{a}\|_1$ dominates the update, leading to $\| \mathbf{a}^{(t)} \|_1 - \|\mathbf{a}^* \|_1 = \mathcal O (\alpha\beta)$ in the order of $\frac{\| \hat{\mathbf{a}} - \mathbf{a}^* \|_{\infty}}{\alpha \beta}$ more training steps For suitable $\mathbf{X}$, this implies $$ \begin{equation} \|\mathbf{a}^{(t)} - \mathbf{a}^*\|_1 = \mathcal{O}(\alpha\beta) \Longleftrightarrow t \ge t_1 + \frac{\| \hat{\mathbf{a}} - \mathbf{a}^* \|_{\infty}}{\alpha \beta} \end{equation} $$ ![compressed_sensing_gradient_norm_b_subgradient-1](/images/posts/2025-07-06-grokking_beyong_l2/compressed_sensing_gradient_norm_b_subgradient-1.png "compressed_sensing_gradient_norm_b_subgradient-1") Figure 5 : $G(\mathbf{a}^{(t)})$ dominates $\beta H(\mathbf{a}^{(t)})$ until memorization at $t_1$, $g(\mathbf{a}^{(t_1)})\approx0$. From memorization $\beta H(\mathbf{a}^{(t)})$ dominates and make $\|\mathbf{a}^{(t)}\|_1$ converge to $\|\mathbf{a}^*\|_1$ at $t_2 = t_1 + \Delta t$, and so $\mathbf{a}^{(t_2)} = \mathbf{a}^*$. <br> ![compressed_sensing_scaling_time_and_error_vs_alpha_and_beta_1_subgradient-1](/images/posts/2025-07-06-grokking_beyong_l2/compressed_sensing_scaling_time_and_error_vs_alpha_and_beta_1_subgradient-1.png "compressed_sensing_scaling_time_and_error_vs_alpha_and_beta_1_subgradient-1") Figure 6 : We can see that $t_2 \propto \| \mathbf{\hat{a}} - \mathbf{a}^*\|_{\infty} / \alpha \beta$ and $\|\mathbf{a}^{(t_2)}-\mathbf{a}^*\|_1 \propto \alpha \beta$, i.e. small $\alpha \beta$ require longer time to converge, but do so at a lower generalization error. <br> ![compressed_sensing_scaling_alpha_and_beta_1_subgradient-1](/images/posts/2025-07-06-grokking_beyong_l2/compressed_sensing_scaling_alpha_and_beta_1_subgradient-1.png "compressed_sensing_scaling_alpha_and_beta_1_subgradient-1") Figure 7 : Small $\alpha \beta$ require longer time to converge, but do so at a lower generalization error. <br> # Low-rankness We have * a low rank matrix $\mathbf{A}^* \in \mathbb{R}^{n_1 \times n_2}$ of rank $r \ll \min(n_1, n_2)$ * the labels $\mathbf{y}^* = \mathbf{X} \text{vec}(\mathbf{A}^*) + \boldsymbol{\xi}$: matrix completion, matrix sensing, etc * $g(\mathbf{A}) = \frac{1}{2} \| \mathbf{X} \text{vec}(\mathbf{A}) - \mathbf{y}^* \|_2^2$ * $f(\mathbf{A}) = g(\mathbf{A}) + \beta \|\mathbf{A}\|_*$ with $\|\mathbf{A}\|_* = \sum_{i} \sigma_{i}(\mathbf{A})$ **Theorem 3.4 and 3.5 (Informal)** Under certain conditions on $\alpha$, $\beta$ and $\boldsymbol{\xi}$ : * (i) Memorization: $\mathbf{A}^{(t)} \rightarrow \text{vec}(\mathbf{\hat{A}}) := \left( \mathbf{X}^\top \mathbf{X} \right)^{\dag} \mathbf{X}^\top \mathbf{y}^*$ at step $t_1<\infty$. * (ii) Generalization: After $t_1$, $\| \mathbf{A}^{(t)} \|_* - \|\mathbf{A}^* \|_* \longrightarrow \mathcal O (\alpha\beta)$ in the order of $\frac{\| \hat{\mathbf{A}} - \mathbf{A}^* \|_{2 \to 2}}{\alpha \beta}$ more training steps For suitable $\mathbf{X}$, this implies $$ \begin{equation} \|\mathbf{A}^{(t)} - \mathbf{A}^*\|_* = \mathcal{O}(\alpha\beta) \Longleftrightarrow t \ge t_1 + \frac{\| \mathbf{\hat{A}} - \mathbf{A}^* \|_{2 \to 2}}{\alpha \beta} \end{equation} $$ ![matrix-completion_gradient_norm_a_subgradient-1.png](/images/posts/2025-07-06-grokking_beyong_l2/matrix-completion_gradient_norm_a_subgradient-1.png "matrix-completion_gradient_norm_a_subgradient-1.png") Figure 5 : $G(\mathbf{A}^{(t)})$ dominates $\beta H(\mathbf{A}^{(t)})$ until memorization at $t_1$, $g(\mathbf{A}^{(t_1)})\approx0$. From memorization $\beta H(\mathbf{A}^{(t)})$ dominates and make $\|\mathbf{A}^{(t)}\|_*$ converge to $\|\mathbf{A}^*\|_*$ at $t_2 = t_1 + \Delta t$, and so $\mathbf{A}^{(t_2)} = \mathbf{A}^*$. <br> ![matrix-completion_scaling_time_and_error_vs_alpha_and_beta_star_subgradient-1](/images/posts/2025-07-06-grokking_beyong_l2/matrix-completion_scaling_time_and_error_vs_alpha_and_beta_star_subgradient-1.png "matrix-completion_scaling_time_and_error_vs_alpha_and_beta_star_subgradient-1") Figure 6 : We can see that $t_2 \propto \| \mathbf{\hat{A}} - \mathbf{A}^*\|_{2 \to 2} / \alpha \beta$ and $\|\mathbf{A}^{(t_2)}-\mathbf{A}^*\|_* \propto \alpha \beta$, i.e. small $\alpha \beta$ require longer time to converge, but do so at a lower generalization error. <br> # Algorithmic Dataset mlp_algorithmic_dataset_scaling_alpha_and_beta_2-1.png mlp_algorithmic_dataset_scaling_alpha_and_beta_1-1.png mlp_algorithmic_dataset_scaling_alpha_and_beta_nuc_small_plot-1.png # Non-linear Teacher-student 2layers_nn_scaling_alpha_and_beta_1_small_plot-1.png 2layers_nn_sobolev_scaling_alpha_and_beta_d1_small_plot-1.png # Data Selection matrix-completion_scaling_N_and_tau_subgradient_n=100_r=2-1.png matrix-completion_scaling_N_and_tau_subgradient_n=100_r=2_nologx-1.png # Overparameterization with Depth: Sparse Recovery compressed_sensing_scaling_depth_error_vs_N_and_L_small_subgradient_n=100_s=5_test_error-1.png # References [1] Ziming Liu et al. “Omnigrok: Grokking Beyond Algorithmic Data”. In: The Eleventh International Conference on Learning Representations. 2023. url: https://openreview.net/forum?id=zDiHoIWa0q1. [2] Kaifeng Lyu et al. Dichotomy of Early and Late Phase Implicit Biases Can Provably Induce Grokking. 2024. arXiv: 2311.18817 [cs.LG]. url: https://arxiv.org/abs/2311.18817. [3] Pascal Jr Tikeng Notsawo et al. Grokking Beyond the Euclidean Norm of Model Parameters. 2025. arXiv: 2506.05718 [cs.LG]. url:https://arxiv.org/abs/2506.05718. [4] Alethea Power et al. “Grokking: Generalization beyond overfitting on small algorithmic datasets”. In: arXiv preprint arXiv:2201.02177 (2022). [5] Gromov, A. Grokking modular arithmetic. arXiv preprint arXiv: Arxiv-2301.02679, 2023.

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password
    or
    Sign in via Facebook Sign in via X(Twitter) Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    By signing in, you agree to our terms of service.

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully