Grokking Beyond the Euclidean Norm of Model Parameters

Grokking refers to a delayed generalization following overfitting when optimizing artificial neural networks with gradient-based methods. We show that the dynamic of grokking goes beyond the $\ell_2$ norm, that is: If there exists a model with a property $P$ (e.g., sparse or low-rank weights) that fits the data, then GD with a small (explicit or implicit) regularization of $P$ (e.g., $\ell_1$ or nuclear norm regularization) will also result in grokking, provided the number of training samples is large enough. Moreover, the $\ell_2$ norm of the parameters is no longer guaranteed to decrease with generalization when it is not the property sought. Paper : Grokking Beyond the Euclidean Norm of Model Parameters, Pascal Jr. Tikeng Notsawo, Guillaume Dumas, Guillaume Rabusseau, Forty-Second International Conference on Machine Learning (ICML), 2025. https://arxiv.org/abs/2506.05718 # What is Grokking? For a prime integer $p=97$, let consider the group $\mathcal{S} = \mathbb{Z}/p\mathbb{Z}$ endowed with modular addition, and let split the dataset $\mathcal{D} = \{ (\mathbf{x}, (x_1 + x_2)\%p ) : \mathbf{x} = (x_1, x_2) \in \mathcal{S}^2 \}$ in two non empty and disjoint subsets $\mathcal{D}_{\text{train}}$ and $\mathcal{D}_{\text{validation}}$ according to a ratio $r_{\text{train}} := |\mathcal{D}_{\text{train}}| / |\mathcal{D}|=40\%$. Now, consider a multilayer perceptron (MLP) for which the logits are given by $\mathbf{y}_{\theta}(x_1, x_2) = \mathbf{b}^{(2)} + \mathbf{W}^{(2)} \phi\left(\mathbf{b}^{(1)} + \mathbf{W}^{(1)} \left( \mathbf{E}_{\langle x_1 \rangle} \circ \mathbf{E}_{\langle x_2 \rangle} \right) \right)$ where * $\phi(z) = \max(z, 0)$ is the ReLU activation function * $\langle x \rangle \in \{0, ..., p-1\}$ stands for the index of the token corresponding to $x \in \mathcal{S}$ * $\mathbf{E} \in \mathbb{R}^{p \times d_1}$ (embedding matrix for all the symbols in $\mathcal{S}$), $\mathbf{W}^{(1)} \in \mathbb{R}^{d_2 \times d_1}$, $\mathbf{b}^{(1)} \in \mathbb{R}^{d_2}$, $\mathbf{W}^{(2)} \in \mathbb{R}^{p \times d_2}$, and $\mathbf{b}^{(2)} \in \mathbb{R}^{p}$ are the learnable parameters, i.e. $\theta$. Let's train this model using gradient descent with: * as objective $f(\theta) = g(\theta) + \beta h(\theta)$, where $h(\theta) = \| \theta \|_2^2$ and $g(\theta)$ is the average of the cross-entropy error on $\mathcal{D}_{\text{train}}$, $$g(\theta) = \frac{1}{|\mathcal{D}_{\text{train}}|} \sum_{(\mathbf{x}, y) \in \mathcal{D}_{\text{train}}} \ell \left( \mathbf{y}_{\theta}(\mathbf{x}), y \right)$$ $$\ell \left( \mathbf{y}, i \right) = - \log \frac{ \exp\left( \mathbf{y}_i \right) }{ \sum_{j} \exp\left( \mathbf{y}_j \right)} \quad \forall \mathbf{y} \in \mathbb{R}^p, \quad \forall i \in \{0, ..., p-1\}$$ - as optimizer, Adam, with *learning rate* $\alpha=10^{-3}$ and a *regularization strength* $\beta = 10^{-6}$. We let $\theta^{(t)}$ be the parameter at step $t\ge 0$ of training. ![Grokking MLP modular addition](/images/posts/2025-07-06-grokking_beyong_l2/mlp_l2_alpha=0.001_beta=1e-6-1.png "This is the title") <center>Figure 1 : Grokking on modular addition with MLP</center> We can observe from the figure above that the training accuracy becomes perfect after $t_1 \approx 540$, while it takes $t_2 \approx 9150$ steps for the validation accuracy to achieve the same level. This is basically what we call grokking, i.e., a generalization (often sudden) after a long period of overfitting (usually severe). It was first studied by [4], who trained Transformers on binary operations (addition, multiplication, division, etc.) on $\mathcal{S}$. Now that we agree on what grokking is, let's see how we can explain it. We will provide two previous explanations related to ours, highlight their limitations, and offer a more general explanation of the phenomenon based on regularization. # Why Grokking? ## Goldilocks zone & LU mechanism The **Goldilocks zone** (Figure 2, left) refers to a spherical shell in weight space (a region at an optimal weight norm $w = \|\theta\|_2 \approx w_c$) where models achieve good generalization. It is "just right": - If the weight norm is too small ($w<w_c$), the model underfits and struggles to fit the training data. - If the weight norm is too large ($w>w_c$), the model overfits—training loss is low, but test loss is high. Thus, generalizing solutions lie near this shell (not too small, not too large), hence the name “Goldilocks zone.” ![LU Mechanism](/images/posts/2025-07-06-grokking_beyong_l2/LU.png "LU Mechanism") <center>Figure 2 : LU mechanism [1]</center> The **LU mechanism** (Figure 2, right) describes the mismatch between how training and test losses behave as functions of the **weight norm** $w = \|\theta\|_2$: * The **training loss** forms an **L-shape**: it decreases quickly and stays near zero for large $w$, because many overfitting solutions exist at high norms. * The **test loss** forms a **U-shape**: it is minimized at a particular weight norm $w_c$ (the **Goldilocks zone**) and increases for both smaller and larger norms. Accoriding to [1], this mismatch causes **grokking**: with large initialization $w_0 = \|\theta^{(0)}\|_2 > w_c$ and small weight decay $\beta$, the model first overfits at step $t_1$ (training loss drops, test loss remains high), then slowly drifts toward $w_c$ due to weight decay, eventually reaching a point where generalization improves dramatically (at step $t_2$).  How is this explanation limited? Consider, for example, a model $\mathbf{y}: \mathbb{R}^p \to \mathbb{R}^d$ defined by $\mathbf{y}(\mathbf{x}) := \mathbf{B} \phi(\mathbf{A}\mathbf{x})$ where $\mathbf{A} \in \mathbb{R}^{r \times p}$ and $\mathbf{B} \in \mathbb{R}^{d \times r}$. Suppose that $\phi$ is positive-$L$-homogeneous for some $L>0$, i.e., for all $\lambda>0$, $\phi(\lambda z) = \lambda^L \phi(z) \ \forall z \in \mathbb{R}$. For example, $\phi(z)=\max(z, 0)$ is positive-$1$-homogeneous, $\phi(z)=z^2$ is $2$-homogeneous (quadratic activation functions are widely used in the context of grokking and modular arithmetic [5]). $$\forall \lambda>0, \qquad \mathbf{y}_{\lambda}(\mathbf{x}) := \frac{\mathbf{B}}{\lambda^L} \phi(\lambda \mathbf{A}\mathbf{x}) = \mathbf{y}(\mathbf{x}) \quad \forall \mathbf{x} \in \mathbb{R}^{p}$$  Assume without loss of generality (WLOG) $a:=\|\mathbf{A}\|_2^2>0$ and $b:=\|\mathbf{B}\|_2^2>0$ (otherwise $\mathbf{y}(\mathbf{x})=0$ for all $\mathbf{x}$). The norm of the parameters of $\mathbf{y}_{\lambda}$ is $w(\lambda) = a \lambda^2 + \frac{b}{\lambda^{2L}}$. The function $w(\lambda)$ decreases from $\lambda=0$ to $\lambda=\lambda_0 := \left(\frac{Lb}{a}\right)^{ \frac{1}{2(L+1)} }$, then increases. We have $w(\lambda_0) = \left( L^{ \frac{1}{L+1} } + L^{ -\frac{L}{L+1} } \right) \left(a^L b\right)^{ \frac{1}{L+1} }$ and $\lim_{\lambda \rightarrow 0, +\infty} w(\lambda) = \infty$. This means that we can arbitrary increase the norm of the parameters of the model $\mathbf{y}_{\lambda}$ without changing its generalization performances. In other words, the set of solutions that generalize are not located on a spherical shell (with respect to the $\ell_2$ distance) around the origin, and it would be possible to find a solution that generalizes close to the initialization $\theta^{(0)} = \{\mathbf{A}^{(0)}, \mathbf{B}^{(0)}\}$ regardless of how large $\|\theta^{(0)}\|_2^2 = \|\mathbf{A}^{(0)}\|_2^2 + \|\mathbf{B}^{(0)}\|_2^2$ is. We will show below that in certain cases, due to this large initialization and the use of a small weight decay, we do indeed observe a transition in the generalization error during training due to a decrease in the norm of the parameters; but that this transition does not correspond to generalization and occurs even when the problem to be solved has no solution that generalizes. This transition is therefore only a mirage, and not grokking. This LU mechanism based solely on the $\ell_2$ norm cannot explain grokking since other regularizations induce grokking and affect the grokking delay $\Delta t= t_2 - t_1$, and when $\ell_2$ is not used, the $\ell_2$ norm of the model parameters is not monotonic after memorization, and even increases without harming generalization performance (Figure 3). A more general measure of complexity must be used in place of the $\ell_2$ norm for this LU mechanism to work. ![LU Mechanism](/images/posts/2025-07-06-grokking_beyong_l2/mlp_algorithmic_dataset_scaling_beta_l1_l2_lnuc-1.png "LU Mechanism") Figure 3: (Top) Training and test accuracy of a MLP trained on modular addition with $\ell_1$ (left), $\ell_2$ (middle), and $\ell_*$ (right) regularization for different values of the regularization strength $\beta$. Smaller values of $\beta$ delay generalization. (Bottom) With $\ell_1$ regularization, the $\ell_2$ norm (middle) increase after generalization, i.e. the model exits the Goldilocks zone, but generalize. $\ell_*$ is the nuclear norm. ## Grokking as a transition from the kernel to the rich regime Recall that we have an objective function of the form $f(\theta) = g(\theta) + \frac{\beta}{2} \|\theta\|_2^2$, where $g : \mathbb{R}^p \to [0, \infty)$ is the training error (average loss on the training data). Assume the continuous-time gradient flow starting at some $\theta(0)$, $$\frac{d\theta(t)}{dt} = - \nabla_\theta f\left(\theta(t)\right)$$ We have the following result from [2] (refer to the paper for the formal version). Let $t_2 = \frac{\log\gamma}{\beta}$, with $\gamma = \|\theta(0)\|_2$ the initialization scale. Assume that : - $\beta = \Theta(\gamma^{-c})$ for some positive $c = \Theta(1)$. - the NTK (Neural Tangent Kernel) features at initialization $\left\{ \mathbf{w}^{(0)} (\mathbf{x}) := \nabla_{\theta} \mathbf{y}_{\theta}(\mathbf{x}) \big|_{\theta = \theta^{(0)} } \right\}_{ ( \mathbf{x}, y) \in \mathcal{D}_{\text{train}} }$ are linearly separable (classification setting) or independent (regression setting). - the model is homogeneous with respect to the parameter, i.e. there exist $L>0$ such that for all $\lambda>0$, $\mathbf{y}_{\lambda \theta}(\mathbf{z}) = \lambda^L \mathbf{y}_{\theta}(\mathbf{z}) \ \forall \mathbf{z} \in \mathbb{R}^p$ : feed-forward neural networks with homogeneous activation functions are homogeneous. - the initialization scale $\gamma \rightarrow \infty$ Then, we have for all $\epsilon \in (0, 1)$: - The solution found by gradient flow at time $t_1 = (1-\epsilon)t_2$ - represents the same classifier as the max $L^2$-margin linear classifier on the NTK features (classification), i.e. assuming binary classification with labels in $\{-1, +1\}$, $\frac{\theta^{(t_1)}}{\| \theta^{(t_1)} \|_2}$ is the unique unit vector that points to the direction of the unique optimal solution to the following constrained optimization problem: $$ \min_{\theta} \ \frac{1}{2} \| \theta \|_2 \text{ s.t. } y \langle \mathbf{w}^{(0)}(\mathbf{x}) , \theta \rangle \ge 1 \quad \forall (\mathbf{x}, y) \in \mathcal{D}_{\text{train}} $$ - approximates the minimum-norm interpolating solution in the NTK regime (regression) i.e. $\frac{\theta^{(t_1)}}{\| \theta^{(t_1)} \|_2}$ is the unique unit vector that points to the direction of the unique optimal solution to the following constrained optimization problem: $$ \min_{\theta} \ \frac{1}{2} \| \theta \|_2 \quad \text{ s.t. } \quad \langle \mathbf{w}^{(0)}(\mathbf{x}), \theta \rangle = y \quad \forall (\mathbf{x}, y) \in \mathcal{D}_{\text{train}} $$ - By continuing training slightly longer to time $t_2^+ = (1+\epsilon)t_2$, the dynamics leave the NTK (linearized) regime. More precisely, $\frac{\theta^{(t_2^+)}}{\| \theta^{(t_2^+)} \|_2}$ is along the direction of a KKT point of the following constrained optimization problem: $$ \min_{\theta} \ \frac{1}{2} \| \theta \|_2 \quad \text{ s.t. } \quad \begin{cases} y \langle \mathbf{y}(\mathbf{x}) , \theta \rangle \ge 1 & \text{ (classification) } \\ \langle \mathbf{y}(\mathbf{x}), \theta \rangle = y & \text{ (regression)} \end{cases} \quad \forall (\mathbf{x}, y) \in \mathcal{D}_{\text{train}} $$ How is this explanation limited? In certain settings, by using only the $\ell_2$ regularization under large-scale initialization, there is an abrupt transition in the generalization error during training, driven by changes in the $\ell_2$-norm of the model parameters (Figure 4). This transition, however, does not result in convergence to an optimal solution and can arise even in cases where the problem has no optimal solution due to insufficient training samples (such as sparse recovery or matrix completion using a number of samples far below the theoretical limit required for optimal recovery). We call this phenomenon **grokking without understanding**. ![Grokking Without Understanding](/images/posts/2025-07-06-grokking_beyong_l2/compressed_sensing_gradient_norm_b_subgradient_large_init_log_for_main_section-1.png "Grokking Without Understanding") ![Grokking Without Understanding](/images/posts/2025-07-06-grokking_beyong_l2/compressed_sensing_gradient_norm_b_subgradient_large_init_for_main_section-1.png "Grokking Without Understanding") Figure 4 : Grokking Without Understanding in Sparse recovery : $(n, s, N, \alpha, \beta, \|\mathbf{a}^{(0)}\|_2)=(10^2, 5, 30, 10^{-1}, 10^{-5}, 10)$. Consider a sparse vector $\mathbf{a}^* \in \mathbb{R}^n$, i.e. $s = \|\mathbf{a}^*\|_0 \ll n$. Given a design matrix $\mathbf{X} = \left[ \mathbf{x}_1, \cdots, \mathbf{x}_N \right]^\top \in \mathbb{R}^{N \times n}$ containing $N$ samples, and the noisy labels $\mathbf{y}^* = \mathbf{X} \mathbf{a}^* + \boldsymbol{\xi}$ with $\boldsymbol{\xi} \in \mathbb{R}^N$ (noise), consider the objective function $f(\mathbf{a}) = g(\mathbf{a}) + \beta \|\mathbf{a}\|_2^2$ with $g(\mathbf{a}) = \frac{1}{2} \| \mathbf{X} \mathbf{a} - \mathbf{y}^* \|_2^2$, and iteratively define $$\mathbf{a}^{(t+1)} = \mathbf{a}^{(t)} - \alpha \nabla_{\mathbf{a}} f \left( \mathbf{a}^{(t)} \right)$$ with $\alpha>0$. Consider the least square solution $\hat{\mathbf{a}} := \left( \mathbf{X}^\top \mathbf{X} + \beta \mathbb{I}_n \right)^{\dag} \mathbf{X}^\top \mathbf{y}^*$, and define $\rho_2 := \sigma_{\max} \left( \mathbb{I}_n - \alpha \left( \mathbf{X}^\top \mathbf{X} + \beta \mathbb{I}_n \right) \right)$. Assume the learning rate satisfies $0< \alpha < \frac{2}{\sigma_{\max}(\mathbf{X}^\top \mathbf{X}) + \beta}$. Then (see Theorem 3.6) - $\| \mathbf{a}^{(t)} - \hat{\mathbf{a}} \|_2 \le \rho_2^{t-1} \| \mathbf{a}^{(1)} - \hat{\mathbf{a}} \|_2\ \forall t \ge 1$. That is, as $t\rightarrow \infty$, $\mathbf{a}^{(t)} \rightarrow \hat{\mathbf{a}}$. * For $N < n$, $\|\hat{\mathbf{a}} - \mathbf{a}^*\|_2^2 \ge \|\left(\mathbb{I}_n - \mathbf{X}^\top (\mathbf{X} \mathbf{X}^\top)^\dagger \mathbf{X} \right) \mathbf{a}^*\|_2^2$. In particular, if $\mathbf{a}^*$ has a nonzero component orthogonal to the column space of $\mathbf{X}$, then $\|\hat{\mathbf{a}} - \mathbf{a}^*\|_2 > 0$, i.e. $\hat{\mathbf{a}}$ cannot perfectly generalize to $\mathbf{a}^*$. We will show below that with $\ell_1$, i.e. $f(\mathbf{a}) = g(\mathbf{a}) + \beta \|\mathbf{a}\|_1$, there is grokking, i.e. $\mathbf{a}^{(t)}$ first converge to $\hat{\mathbf{a}} = \left( \mathbf{X}^\top \mathbf{X} \right)^{\dag} \mathbf{X}^\top \mathbf{y}^*$ and minimize $g$, then take a non trivial delay $\Delta t$ to converge to $\mathbf{a}^*$. Since the minimum $\ell_2$ norm solution only gives memorization, the $\ell_2$ norm cannot be used as an indicator of grokking. Although trivial, this example illustrates our intuition. For grokking by regularization to occur, the regularizer used must be able to enforce an inductive bias toward generalization. We will show this in the next section, and we will see that such regularization can also be implicit (for example, by over-parameterizing the model or selecting the training data appropriately). # Grokking Beyond $\ell_2$ Norm Consider a differentiable function (training loss) $g : \mathbb{R}^p \to [0, \infty)$ and a subdifferentiable function (regularizer) $h : \mathbb{R}^n \to [0, \infty)$. Define the objective function $f(\mathbf{x}) = g(\mathbf{x}) + \beta h(\mathbf{x})$, and the (sub)Gradient descent with learning rate $\alpha>0$: $$ \begin{equation} \mathbf{x}^{(0)} \in \mathbb{R}^p \text{ and } \mathbf{x}^{(t+1)} = \mathbf{x}^{(t)} - \alpha ( G( \mathbf{x}^{(t)}) + \beta H( \mathbf{x}^{(t)}) ) \ \forall t > 1 \end{equation} $$ where $G(\mathbf{x}) = \nabla g(\mathbf{x})$ is the gradient of $g$ at $\mathbf{x}$ and $H(\mathbf{x}) \in \partial h(\mathbf{x})$ is any subgradient of $h$ at $\mathbf{x}$. We define $\Theta_f := \argmin_{\mathbf{x} \in \mathbb{R}^p} f(\mathbf{x}) \subset \mathbb R^p$ and $f^* := \inf_{\mathbf{x} \in \mathbb{R}^p} f(\mathbf{x})$. Similarly, we define $\Theta_g$ and $g^*$, and assume $g^*=0$ without loss of generality. Under certain conditions on $g$ at $\mathbf{x}^{(0)}$ (to be specified below), there exist $\alpha_{\max}$ and $\beta_{\max}$ such that for all $\alpha \in (0, \alpha_{\max})$ and $\beta \le \beta_{\max}$, by defining the subgradient descent with $\alpha$ and $\beta$ starting at $\mathbf{x}^{(0)}$, the following hold: * Memorization : The iterates $\mathbf{x}^{(t)}$ initially move (exponentially fast) toward a solution close to the initialization $\mathbf{x}^{(0)}$ that minimizes $g$ (i.e., the kernel solution associated with $g$), at step $t_1$. * Generalization : Later in training, when $g$ and its gradient are already small, the influence of $\beta h(\mathbf{x})$ dominates the update and gradually drives the iterates toward low values of $f$ and $h$, until reaching $f(\mathbf{x}^{(t)}) \approx f^*$ and $h(\mathbf{x}^{(t)}) \approx h_g^* := \inf_{\mathbf{x}, g(\mathbf{x}) = 0 } h(\mathbf{x})$ up to an error of order $\mathcal{O}(\alpha\beta^2)$ and $\mathcal{O}(\alpha\beta)$ respectively, within $\Delta t = \Theta(1/\alpha\beta)$ additional steps. This is the Theorem 2.1 of the paper. We will now attempt to prove this, or at least provide a sketch of a proof. You can skip the proof since the other sections do not depend on it. First, let provide some definitions. Let define $$ \begin{equation} \begin{split} \chi(g, \mathbf{x}, r) := \inf_{\mathbf{y} \in B(\mathbf{x}, r), g(\mathbf{y}) \ne 0} \| \nabla g(\mathbf{y}) \|_2^2/g(\mathbf{y}) \quad \forall \mathbf{x} \in \mathbb{R}^p, \quad \forall r >0 \end{split} \end{equation} $$ where $B(\mathbf{x}, r) := \left\{ \mathbf{y} \in \mathbb{R}^p \mid \| \mathbf{x} - \mathbf{y} \|_2 \le r \right\}$. **Definition 1** We call the constant $\chi(g, \mathbf{x}, r)$ the Chatterjee-Łojasiewicz (CL) constant of $g$ at $\mathbf{x}$ with radius $r >0$, and we said the function $g$ satisfies the $r$-CL inequality at $\mathbf{x}$ or is $r$-CL at $\mathbf{x}$ when $4g(\mathbf{x}) < r^2 \chi(g, \mathbf{x}, r)$. No need to worry about this defition for now, we will make it more clear below with the PL (Polyak-Łojasiewicz) inequality. **Definition 2** For $L > 0$, we said that a differentiable function $\varphi : \mathbb{R}^p \to \mathbb{R}$ is $L$-smooth if and only if $\nabla \varphi(\mathbf{x})$ is $L$-Lipschitz, i.e. $\| \nabla \varphi(\mathbf{y}) - \nabla \varphi(\mathbf{x}) \|_2 \le L \|\mathbf{y}-\mathbf{x}\|_2 \quad \forall \mathbf{x}, \mathbf{y} \in \mathbb{R}^p$. **Lemma 1** Let $\varphi : \mathbb{R}^p \to \mathbb{R}$ be a differentiable $L$-smooth function. - We have $\varphi(\mathbf{y}) \le \varphi(\mathbf{x}) + (\mathbf{y}-\mathbf{x})^\top \nabla \varphi(\mathbf{y}) + \frac{L}{2} \|\mathbf{y}-\mathbf{x}\|_2^2 \ \forall \mathbf{x}, \mathbf{y} \in \mathbb{R}^p$. - For a non-negative $\varphi$, this implies $\|\nabla \varphi (\mathbf{x}) \|_2^{2} \le 2L \varphi(\mathbf{x}) \ \forall \mathbf{x} \in \mathbb{R}^p$ - For a twice-differentiable $\varphi$, this is equivalent to saying that for all $\mathbf{x} \in \mathbb{R}^p$, all the eigenvalues of $\nabla^2 \varphi(\mathbf{x})$ ar in $[-L, L]$. See [Smoothness and Gradient Inequalities: A Theoretical Overview](https://hackmd.io/@6LQ4mvRtS4Sc3LHkNEvDXQ/ry5NlKtHeg) for the proofs. $$ \begin{array}{r} \blacksquare\Box \end{array} $$ In the paper, we assume that there exists $r>0$ such that $r$-CL at $\mathbf{x}^{(0)}$, i.e. $4g(\mathbf{x}^{(0)}) < r^2 \chi(g, \mathbf{x}^{(0)}, r)$. We then show that under this condition, $g$ is $L$-smooth on $B(\mathbf{x}^{(0)}, r)$ for a certain $L>0$, and we have the following: - (i) As long as $\beta \sup_{ H(\mathbf{x}^{(k)}) \in \partial h(\mathbf{x}^{(k)}) } \|H(\mathbf{x}^{(k)})\|_2 \le \| G(\mathbf{x}^{(k)}) \|_2 \quad \forall k < t$, for a certain $t\ge 1$, the following hold for all $k \le t$ : * $\mathbf{x}^{(k)} \in B(\mathbf{x}^{(0)}, r)$ * $g(\mathbf{x}^{(k)}) \le (1-\delta)^{k} g(\mathbf{x}^{(0)})$ for a certain $\delta \in (0, 1)$ * $\|\nabla g(\mathbf{x}^{(k)})\|_2^{2} \le 2 L g(\mathbf{x}^{(k)})$ - (ii) By defining $L_h := \sup_{\mathbf{x} \in B(\mathbf{x}^{(0)},r)} \sup_{ H(\mathbf{x}) \in \partial h(\mathbf{x}) } \|H(\mathbf{x})\|_\infty < \infty$ we have for all $\gamma, \tau \in (0, 1)$, $$\gamma \| G(\mathbf{x}^{(0)}) \|_2 \ge \tau^{-t} \beta L_h \Longrightarrow \beta \|H(\mathbf{x}^{(k)})\|_2 \le \| G(\mathbf{x}^{(k)}) \|_2 \quad \forall k < t$$ Point (i) tells us that as long as the gradient $G$ of $g$ dominates all the subgradient $H$ of $h$, $g$ decreases geometrically and the iterate remains in the ball $B(\mathbf{x}^{(0)},r)$, while (ii) shows us that we can always adjust the hyperparameters ($\beta$, $\gamma$, $\tau$, ...) so that $G = \nabla g$ dominates $H \in \partial h$ up to an arbitrary step. Combining these two points, we showned that we can adjust these hyperparameters to allow $g$ (and $G$ since $\|G(\mathbf{x})\|_2^{2} \le 2 L g(\mathbf{x})$ in $B(\mathbf{x}^{(0)},r)$) to achieve arbitrary precision $\epsilon_g = \Omega(\beta^C)$ while remaining in $B(\mathbf{x}^{(0)},r)$, $C>0$ a constant. Since the proofs of (i) and (ii) are quite long, we will impose a more restrictive conditions on $g$ for the purposes of this blog post. **Assumption 1** We assume that $g$ is $\mu$-PL (Polyak-Łojasiewicz) for a certain $\mu>0$, i.e., $2\mu (g(\mathbf{y}) - g^*) \le \| \nabla g(\mathbf{y}) \|_2^2 \quad \forall \mathbf{y} \in \mathbb{R}^p$. Since $g^*=0$, we have $$ \begin{equation} \begin{split} 2\mu g(\mathbf{y}) \le \| \nabla g(\mathbf{y}) \|_2^2 \ \forall \mathbf{y} & \Longrightarrow 2\mu \le \frac{\| \nabla g(\mathbf{y}) \|_2^2}{g(\mathbf{y})} \quad \forall \mathbf{y}, \ g(\mathbf{y}) \ne 0 \\ & \Longrightarrow 2\mu \le \inf_{\mathbf{y} \in B(\mathbf{x}, r), g(\mathbf{y}) \ne 0} \| \nabla g(\mathbf{y}) \|_2^2/g(\mathbf{y}) = \chi(g, \mathbf{x}, r) \quad \forall \mathbf{x} \\ & \Longrightarrow 4g(\mathbf{x}) < r^2 \chi(g, \mathbf{x}, r) \quad \forall \mathbf{x}, \quad \forall r^2 \ge 2g(\mathbf{x})/\mu \end{split} \end{equation} $$ So the CL inequality is a strengthening of the classical PL inequality, which has been shown to hold for wide overparameterized neural networks in a neighborhood of their initialization. The advantage here is that we require the CL inequality to be satisfied only at initialization in the paper, unlike the following simple version under PL, which require the function to satisfy the PL inequality over its entire domain. **Assumption 2** We assume $g$ is $L$-smooth for some $L > 0$ **Assumption 3** Define $\delta(\alpha, \beta) := \mu \alpha \cdot \left( ( 2 - \alpha L )(1 + \beta) + \alpha \beta^2 L \right)$, and assume the learning rate $\alpha>0$ and the regularization strength $\beta\ge 0$ satisfy $0< \alpha < \frac{2}{L}$ and $0 < \delta(\alpha, \beta) < 2$. The feasible set of theses constraints can be write $\{ (\alpha, \beta) : 0 < \alpha < \alpha_{\max}, 0 < \beta < \beta_{\max}(\alpha) \}$ for a certain $\alpha_{\max}>0$ and $\beta_{\max}(\alpha)>0 \forall \ \alpha \in (0, \alpha_{\max})$. **Theorem** Under assumptions 1, 2 and 3, we have for all $t \ge 0$ : $$\| H( \mathbf{x}^{(t)}) \|_2 \le \| G( \mathbf{x}^{(t)}) \|_2 \quad \Longrightarrow \quad g(\mathbf{x}^{(t+1)}) - g^* \le \left( 1 - \delta \right) \left( g(x^{(t)}) - g^* \right)$$ $$ \| H( \mathbf{x}^{(k)}) \|_2 \le \| G( \mathbf{x}^{(k)}) \|_2 \quad \forall k < t \quad \Longrightarrow \quad g(\mathbf{x}^{(k)}) - g^* \le \left( 1 - \delta \right)^{k} \left( g(x^{0}) - g^* \right) \quad \forall k \le t $$ And since g is assume $L$-smooth, we have $\|\nabla g(\mathbf{x}^{(t)})\|_2^{2} \le 2 L g(\mathbf{x}^{(t)})$ for all $t\ge 0$. **Proof** Fix $t \ge 0$. If $\| H( \mathbf{x}^{(t)}) \|_2 \le \| G( \mathbf{x}^{(t)}) \|_2$, then $$ \begin{equation} \begin{split} g(\mathbf{x}^{(t+1)}) - g^* & \le g(\mathbf{x}^{(t)}) + (\mathbf{x}^{(t+1)}-\mathbf{x}^{(t)})^\top G(\mathbf{x}^{(t)}) + \frac{L}{2} \|\mathbf{x}^{(t+1)}-\mathbf{x}^{(t)}\|_2^2 - g^* \text{ since $g$ has an $L$-Lipschitz continuous gradient} \\ & = g(\mathbf{x}^{(t)}) - \alpha ( G( \mathbf{x}^{(t)}) + \beta H( \mathbf{x}^{(t)}) )^\top G(\mathbf{x}^{(t)}) + \frac{L\alpha^2}{2} \| G( \mathbf{x}^{(t)}) + \beta H( \mathbf{x}^{(t)}) \|_2^2 - g^* \text{ since } \mathbf{x}^{(t+1)} = \mathbf{x}^{(t)} - \alpha ( G( \mathbf{x}^{(t)}) + \beta H( \mathbf{x}^{(t)}) ) \\ & = g(\mathbf{x}^{(t)}) - \alpha \| G( \mathbf{x}^{(t)}) \|_2^2 - \alpha \beta H( \mathbf{x}^{(t)} )^\top G(\mathbf{x}^{(t)}) + \frac{\alpha^2 L}{2} \| G( \mathbf{x}^{(t)}) \|_2^2 + \frac{\alpha^2 \beta L}{2} H( \mathbf{x}^{(t)} )^\top G(\mathbf{x}^{(t)}) + \frac{\alpha^2 \beta^2 L}{2} \| H( \mathbf{x}^{(t)}) \|_2^2 - g^* \\ & = g(x^{(t)}) - \frac{\alpha}{2} \left( 2 - \alpha L \right) \| G( \mathbf{x}^{(t)}) \|_2^2 - \frac{\alpha \beta}{2} (2 - \alpha L ) H( \mathbf{x}^{(t)} )^\top G(\mathbf{x}^{(t)}) + \frac{\alpha^2 \beta^2 L}{2} \| H( \mathbf{x}^{(t)}) \|_2^2 - g^* \\ & \le g(x^{(t)}) - \frac{\alpha}{2} \left( 2 - \alpha L \right) \| G( \mathbf{x}^{(t)}) \|_2^2 + \frac{\alpha \beta}{2} (2 - \alpha L ) \| H( \mathbf{x}^{(t)} ) \|_2 \| G(\mathbf{x}^{(t)}) \|_2 + \frac{\alpha^2 \beta^2 L}{2} \| H( \mathbf{x}^{(t)}) \|_2^2 - g^* \text{ using Cauchy-Schwarz, since $2 - \alpha L >0$ } \\ & \le g(x^{(t)}) - \frac{\alpha}{2} \left( 2 - \alpha L \right) \| G( \mathbf{x}^{(t)}) \|_2^2 + \frac{\alpha \beta}{2} (2 - \alpha L ) \| G(\mathbf{x}^{(t)}) \|_2^2 + \frac{\alpha^2 \beta^2 L}{2} \| G( \mathbf{x}^{(t)}) \|_2^2 - g^* \text{ since } \| H( \mathbf{x}^{(t)} ) \|_2 \le \| G( \mathbf{x}^{(t)} ) \|_2 \\ & = g(x^{(t)}) - \frac{\alpha}{2} \left[ \left( 2 - \alpha L \right)(1 + \beta) + \alpha \beta^2 L \right] \| G( \mathbf{x}^{(t)}) \|_2^2 - g^* \\ & = g(x^{(t)}) - \frac{\alpha}{2} \kappa \| \nabla g( \mathbf{x}^{(t)}) \|_2^2 - g^* \text{ with } \kappa := \left( 2 - \alpha L \right)(1 + \beta) + \alpha \beta^2 L > 0 \text{ since } 2 - \alpha L >0 \\ & \le g(x^{(t)}) - \alpha \mu \cdot \kappa \cdot \left( g(\mathbf{x}^{(t)}) - g^* \right) - g^* \text{ since $\kappa >0$ and $g$ is $\mu$-PL, i.e. $\| \nabla g(\mathbf{x}^{(t)}) \|_2^2 \ge 2\mu (g(\mathbf{x}^{(t)}) - g^*) $} \\ & = \left(1 - \delta(\alpha, \beta) \right) \left( g(x^{(t)}) - g^* \right) %\\ & \le \left( 1 - \delta(\alpha, \beta) \right)^{t+1} \left( g(x^{(0)}) - g^* \right) \end{split} \end{equation} $$ As a consequence, if we have $\| H( \mathbf{x}^{(k)}) \|_2 \le \| G( \mathbf{x}^{(k)}) \|_2$ for all $k < t$, then * $g(\mathbf{x}^{(1)}) - g^* \le \left( 1 - \delta \right) \left( g(x^{0}) - g^* \right)$ * $g(\mathbf{x}^{(2)}) - g^* \le \left( 1 - \delta \right) \left( g(x^{1}) - g^* \right) \le \left( 1 - \delta \right)^2 \left( g(x^{1}) - g^* \right)$ * $\cdots$ * $g(\mathbf{x}^{(t)}) - g^* \le \left( 1 - \delta \right) \left( g(x^{t-1}) - g^* \right) \le \left( 1 - \delta \right)^{t} \left( g(x^{0}) - g^* \right)$ The equation $\|\nabla g(\mathbf{x}^{(t)})\|_2^{2} \le 2 L g(\mathbf{x}^{(t)})$ for all $t\ge 0$ comes from Lemma 1. $$ \begin{array}{r} \blacksquare\Box \end{array} $$ This proves (i) exactly. It should be noted that there is a difference between this result and the one obtained in the paper under CL. - The regularity of $g$ along the trajectory is only a consequence of the theorem under CL, whereas under PL, it is assumed in advance. - CL is only required at initialization, whereas PL is required on R^p. We will not prove (ii) here. But the proof is quite simple. We first prove that (P) if for a certain $t\ge 1$, we have $\beta \sup_{ H(\mathbf{x}^{(k)}) \in \partial h(\mathbf{x}^{(k)}) } \|H(\mathbf{x}^{(k)})\|_2 \le \| G(\mathbf{x}^{(k)}) \|_2$ and $\mathbf{x}^{(k)} \in B(\mathbf{x}^{(0)}, r)$ for all $k < t$, then $\mathbf{x}^{(t)} \in B(\mathbf{x}^{(0)}, r)$. Using this, we proove that In fact, since $\mathbf{x}^{(0)} \in B(\mathbf{x}^{(0)}, r)$, we have $\gamma \| G(\mathbf{x}^{(0)}) \|_2 \ge \frac{\beta L_h}{\tau^{k}} \ge \beta L_h \ge \beta H(\mathbf{x}^{(0)})$. So $\mathbf{x}^{(1)} \in B(\mathbf{x}^{(0)}, r)$ by (P), and hence $\sup_{H \in \partial h(\mathbf{x}^{(1)})} \|H\|_2 \le L_h$. Since $g$ is $C^2$, $\nabla^2 g(\mathbf{x})$ is symmetric for all $\mathbf{x} \in \mathbb{R}^p$, and we thus have for all $\mathbf{x} \in B(\mathbf{x}^{(0)}, r) \subset B(\mathbf{x}^{(0)}, 2r)$, $\sigma_{\max}( \nabla^2 g(\mathbf{x})) \le \sqrt{p} \max_{ij} \left| \left[ \nabla^2 g(\mathbf{x})\right]_{ij} \right| \le \sqrt{p} L_2 = L$. So $g$ is $L$-smooth on $B(\mathbf{x}^{(0)}, r)$, i.e. $$ \begin{equation} \begin{split} \| G(\mathbf{x}^{(1)}) - G(\mathbf{x}^{(0)}) \|_2 & \le L \| \mathbf{x}^{(1)} - \mathbf{x}^{(0)} \|_2 \\ & \le \alpha L \| G(\mathbf{x}^{(0)}) + \beta H(\mathbf{x}^{(0)}) \|_2 \\ & \le \alpha L \left( \| G(\mathbf{x}^{(0)})\|_2 + \beta \|H(\mathbf{x}^{(0)})\|_2 \right) \\ & \le (1+\gamma) \alpha L \| G(\mathbf{x}^{(0)})\|_2 \end{split} \end{equation} $$ So $$ \begin{equation} \begin{split} \| G(\mathbf{x}^{(1)}) \|_2 & \ge \| G(\mathbf{x}^{(0)}) \|_2 - \| G(\mathbf{x}^{(1)}) -G(\mathbf{x}^{(0)}) \|_2 \text{ (Triangle inequality)} \\ & \ge \| G(\mathbf{x}^{(0)}) \|_2 - (1+\gamma) \alpha L \| G(\mathbf{x}^{(0)})\|_2 \text{ (Equation TODO)} \\ & = \tau \| G(\mathbf{x}^{(0)}) \|_2 \\ & \ge \frac{\beta L_h}{\gamma\tau^{k-1}} \text{ since } \gamma \| G(\mathbf{x}^{(0)}) \|_2 \ge \frac{\beta L_h}{\tau^{k}} \end{split} \end{equation} $$ We thus have $\gamma \| G(\mathbf{x}^{(1)}) \|_2 \ge \beta L_h \ge \beta H(\mathbf{x}^{(1)})$. So $\mathbf{x}^{(2)} \in B(\mathbf{x}^{(0)}, r)$ by (P), and hence $\sup_{H \in \partial h(\mathbf{x}^{(2)})} \|H\|_2 \le L_h$. And so on, we prove that $\| G(\mathbf{x}^{(j)}) \|_2 \ge \frac{\beta L_h}{\gamma\tau^{k-j}}$ and $\beta \sup_{H \in \partial h(\mathbf{x}^{(j)})} \|H\|_2 \le \gamma \| G(\mathbf{x}^{(j)}) \|_2 \ \forall \ j \le k$. # Sparsity We have * a sparse vector $\mathbf{a}^* \in \mathbb{R}^n$, i.e. $s = \|\mathbf{a}^*\|_0 \ll n$. * the labels $\mathbf{y}^* = \mathbf{X} \mathbf{a}^* + \boldsymbol{\xi}$ with $\mathbf{X} \in \mathbb{R}^{N \times n}$ (design matrix) and $\boldsymbol{\xi} \in \mathbb{R}^N$ (noise). * $g(\mathbf{a}) = \frac{1}{2} \| \mathbf{X} \mathbf{a} - \mathbf{y}^* \|_2^2$ and $f(\mathbf{a}) = g(\mathbf{a}) + \beta \|\mathbf{a}\|_1$ **Theorem 3.1 and 3.3 (Informal)** Under certain conditions on $\alpha$, $\beta$ and $\boldsymbol{\xi}$ : * (i) Memorization: $\mathbf{a}^{(t)}$ first moves near the least square solution $\hat{\mathbf{a}} := \left( \mathbf{X}^\top \mathbf{X} \right)^{\dag} \mathbf{X}^\top \mathbf{y}^*$ at step $t_1<\infty$ * (ii) Later in training, $\partial\|\mathbf{a}\|_1$ dominates the update, leading to $\| \mathbf{a}^{(t)} \|_1 - \|\mathbf{a}^* \|_1 = \mathcal O (\alpha\beta)$ in the order of $\frac{\| \hat{\mathbf{a}} - \mathbf{a}^* \|_{\infty}}{\alpha \beta}$ more training steps For suitable $\mathbf{X}$, this implies $$ \begin{equation} \|\mathbf{a}^{(t)} - \mathbf{a}^*\|_1 = \mathcal{O}(\alpha\beta) \Longleftrightarrow t \ge t_1 + \frac{\| \hat{\mathbf{a}} - \mathbf{a}^* \|_{\infty}}{\alpha \beta} \end{equation} $$ ![compressed_sensing_gradient_norm_b_subgradient-1](/images/posts/2025-07-06-grokking_beyong_l2/compressed_sensing_gradient_norm_b_subgradient-1.png "compressed_sensing_gradient_norm_b_subgradient-1") Figure 5 : $G(\mathbf{a}^{(t)})$ dominates $\beta H(\mathbf{a}^{(t)})$ until memorization at $t_1$, $g(\mathbf{a}^{(t_1)})\approx0$. From memorization $\beta H(\mathbf{a}^{(t)})$ dominates and make $\|\mathbf{a}^{(t)}\|_1$ converge to $\|\mathbf{a}^*\|_1$ at $t_2 = t_1 + \Delta t$, and so $\mathbf{a}^{(t_2)} = \mathbf{a}^*$. ![compressed_sensing_scaling_time_and_error_vs_alpha_and_beta_1_subgradient-1](/images/posts/2025-07-06-grokking_beyong_l2/compressed_sensing_scaling_time_and_error_vs_alpha_and_beta_1_subgradient-1.png "compressed_sensing_scaling_time_and_error_vs_alpha_and_beta_1_subgradient-1") Figure 6 : We can see that $t_2 \propto \| \mathbf{\hat{a}} - \mathbf{a}^*\|_{\infty} / \alpha \beta$ and $\|\mathbf{a}^{(t_2)}-\mathbf{a}^*\|_1 \propto \alpha \beta$, i.e. small $\alpha \beta$ require longer time to converge, but do so at a lower generalization error. ![compressed_sensing_scaling_alpha_and_beta_1_subgradient-1](/images/posts/2025-07-06-grokking_beyong_l2/compressed_sensing_scaling_alpha_and_beta_1_subgradient-1.png "compressed_sensing_scaling_alpha_and_beta_1_subgradient-1") Figure 7 : Small $\alpha \beta$ require longer time to converge, but do so at a lower generalization error. # Low-rankness We have * a low rank matrix $\mathbf{A}^* \in \mathbb{R}^{n_1 \times n_2}$ of rank $r \ll \min(n_1, n_2)$ * the labels $\mathbf{y}^* = \mathbf{X} \text{vec}(\mathbf{A}^*) + \boldsymbol{\xi}$: matrix completion, matrix sensing, etc * $g(\mathbf{A}) = \frac{1}{2} \| \mathbf{X} \text{vec}(\mathbf{A}) - \mathbf{y}^* \|_2^2$ * $f(\mathbf{A}) = g(\mathbf{A}) + \beta \|\mathbf{A}\|_*$ with $\|\mathbf{A}\|_* = \sum_{i} \sigma_{i}(\mathbf{A})$ **Theorem 3.4 and 3.5 (Informal)** Under certain conditions on $\alpha$, $\beta$ and $\boldsymbol{\xi}$ : * (i) Memorization: $\mathbf{A}^{(t)} \rightarrow \text{vec}(\mathbf{\hat{A}}) := \left( \mathbf{X}^\top \mathbf{X} \right)^{\dag} \mathbf{X}^\top \mathbf{y}^*$ at step $t_1<\infty$. * (ii) Generalization: After $t_1$, $\| \mathbf{A}^{(t)} \|_* - \|\mathbf{A}^* \|_* \longrightarrow \mathcal O (\alpha\beta)$ in the order of $\frac{\| \hat{\mathbf{A}} - \mathbf{A}^* \|_{2 \to 2}}{\alpha \beta}$ more training steps For suitable $\mathbf{X}$, this implies $$ \begin{equation} \|\mathbf{A}^{(t)} - \mathbf{A}^*\|_* = \mathcal{O}(\alpha\beta) \Longleftrightarrow t \ge t_1 + \frac{\| \mathbf{\hat{A}} - \mathbf{A}^* \|_{2 \to 2}}{\alpha \beta} \end{equation} $$ ![matrix-completion_gradient_norm_a_subgradient-1.png](/images/posts/2025-07-06-grokking_beyong_l2/matrix-completion_gradient_norm_a_subgradient-1.png "matrix-completion_gradient_norm_a_subgradient-1.png") Figure 5 : $G(\mathbf{A}^{(t)})$ dominates $\beta H(\mathbf{A}^{(t)})$ until memorization at $t_1$, $g(\mathbf{A}^{(t_1)})\approx0$. From memorization $\beta H(\mathbf{A}^{(t)})$ dominates and make $\|\mathbf{A}^{(t)}\|_*$ converge to $\|\mathbf{A}^*\|_*$ at $t_2 = t_1 + \Delta t$, and so $\mathbf{A}^{(t_2)} = \mathbf{A}^*$. ![matrix-completion_scaling_time_and_error_vs_alpha_and_beta_star_subgradient-1](/images/posts/2025-07-06-grokking_beyong_l2/matrix-completion_scaling_time_and_error_vs_alpha_and_beta_star_subgradient-1.png "matrix-completion_scaling_time_and_error_vs_alpha_and_beta_star_subgradient-1") Figure 6 : We can see that $t_2 \propto \| \mathbf{\hat{A}} - \mathbf{A}^*\|_{2 \to 2} / \alpha \beta$ and $\|\mathbf{A}^{(t_2)}-\mathbf{A}^*\|_* \propto \alpha \beta$, i.e. small $\alpha \beta$ require longer time to converge, but do so at a lower generalization error. # Algorithmic Dataset mlp_algorithmic_dataset_scaling_alpha_and_beta_2-1.png mlp_algorithmic_dataset_scaling_alpha_and_beta_1-1.png mlp_algorithmic_dataset_scaling_alpha_and_beta_nuc_small_plot-1.png # Non-linear Teacher-student 2layers_nn_scaling_alpha_and_beta_1_small_plot-1.png 2layers_nn_sobolev_scaling_alpha_and_beta_d1_small_plot-1.png # Data Selection matrix-completion_scaling_N_and_tau_subgradient_n=100_r=2-1.png matrix-completion_scaling_N_and_tau_subgradient_n=100_r=2_nologx-1.png # Overparameterization with Depth: Sparse Recovery compressed_sensing_scaling_depth_error_vs_N_and_L_small_subgradient_n=100_s=5_test_error-1.png # References [1] Ziming Liu et al. “Omnigrok: Grokking Beyond Algorithmic Data”. In: The Eleventh International Conference on Learning Representations. 2023. url: https://openreview.net/forum?id=zDiHoIWa0q1. [2] Kaifeng Lyu et al. Dichotomy of Early and Late Phase Implicit Biases Can Provably Induce Grokking. 2024. arXiv: 2311.18817 [cs.LG]. url: https://arxiv.org/abs/2311.18817. [3] Pascal Jr Tikeng Notsawo et al. Grokking Beyond the Euclidean Norm of Model Parameters. 2025. arXiv: 2506.05718 [cs.LG]. url:https://arxiv.org/abs/2506.05718. [4] Alethea Power et al. “Grokking: Generalization beyond overfitting on small algorithmic datasets”. In: arXiv preprint arXiv:2201.02177 (2022). [5] Gromov, A. Grokking modular arithmetic. arXiv preprint arXiv: Arxiv-2301.02679, 2023.

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.