# Response to Reviewer U2TN (2)
We thank Reviewer U2TN for appreciating the relevance of our work. We now address the main points of the review separately.
**Response to 1): Limit of $\lambda \to 0$ for min-norm interpolator results in Theorem 3.1.** We obtain the asymptotic risk of the min-$\ell_2$-norm interplator by taking the limit $\lim_{\lambda \to 0}$ of the RHS in Equation (eq). We remark that $m(z)$ is the Stieltjes transform of the Marchenko-Pastur law. Hence, if $\gamma \neq 1$ then the limit $\lim_{z \to 0} m(z)$ exists and is $\lim_{z \to 0} m(z) = 1/(1-\gamma)$ if $\gamma <1$ and $\lim_{z \to 0} m(z) = 1/(\gamma(\gamma -1 ))$ if $\gamma >1$; not infinity as indicated by Reviewer U2TN. For further details we would like to point the reader to Corollary 5 in [2]. We added the above remarks to Theorem 3.1 and its proof in the revised version of our paper.
**Response to 2): Large $\lambda$ in robust logistic ridge regression inducing zero vector and large risk.** We point out that for *classification* the risk defined with the 0-1 loss only depends on the direction of the estimator, but not on the norm. Therefore, a large $\lambda$ does not contradict the good performance the estimator attains in Figure 4b for large $\lambda$.
**Response to 3) (i): Why we study regularization with $\ell_2$-penalty.** We thank Reviewer U2TN for this question as it made us realize that we have to emphasize the motivation for ridge regularization much more clearly in the revised version of our paper.
Taking the limit $\lim_{\lambda \to 0}$ of ridge ($\ell_2$) regularization results in the robust max-$\ell_2$-margin and min-$\ell_2$-norm interpolators for logistic and linear regression, respectively. The min-$\ell_2$-norm interpolator is also the interpolating estimator obtained from *unregularized gradient descent*.
While this fact is well-known, we also empirically observe that the optimization path of robust logistic regression with decreasing ridge penalty and gradient descent exhibit similar risk curves, i.e., early stopping and ridge regularization yield solutions with similar robust risks. In fact, Rice et al. [3] observe exactly that early stopping benefits robust generalization.
To better highlight this analogy, we moved the discussion on the similarities between gradient descent and ridge regularization to the main text in the revised version of our paper. In particular, we encourage future work to study the gradient descent path with respect to the robust risk and to provide theoretical evidence for the similarities observed in Figure 8.
**Response to 3) (ii): Effect of regularization with $\ell_1$-penalty.** Nonetheless, it is perfectly reasonable to wonder what would happen if we regularized with an $\ell_1$-penalty and compared $\lambda > 0$ with $\lambda \to 0$. We briefly investigated this setting as well and came to the following conclusions:
An explicit $\ell_1$-penalty induces a strong bias towards a sparse solution. Such estimators have been studied for standard linear classification and can even reach consistency when taking the limit $\lambda \to 0$ (see [4] and references therein). In fact, we observe in our own experiments that using $\ell_1$-regularization instead of $\ell_2$ allows the estimator to achieve a vanishing robust risk, assuming that the ground truth function is sparse.
Furthermore, an important contribution of our paper is that ridge ($\ell_2$) regularization can lead to unexpected benefits even in completely noiseless settings, indicating that its role goes beyond its classical perception as a variance-reduction technique. In contrast, $\ell_1$-regularization induces an implicit bias towards sparse solutions and therefore results in a very different effect. Having said that, a detailed study of the role of $\ell_1$-regularization to improve the robust risk would be an interesting direction for future work.
**Response to 4): Suggestions for additional references.** We would like to thank Reviewer U2TN for noting that regularization in linear regression mitigates the peak of the standard risk at the interpolation threshold by reducing variance in the presence of noise, as established in [1,2,5]. In fact, it is well known that ridge regularization leads to a bias-variance tradeoff and it is therefore not surprising that explicit ridge regularization helps to mitigate the peak in the double descent curve, that is caused by the large variance. In contrast, as we already highlight in the paragraph above, in our paper we show the benefits of ridge regularization for the robust risk in
- highly overparameterized settings where the effect of explicit regularization is negligible for the standard risk as also shown in [1,2]
- noiseless settings where the variance is zero and thus the standard risk does not benefit from regularization, even at the interpolation threshold.
**Response to 5,6): Typos.** We would like to thank Reviewer U2TN for pointing out the typos. We fixed them in the revised manuscript.
[1]: Optimal Regularization Can Mitigate Double Descent. Preetum Nakkiran, Prayaag Venkat, Sham Kakade, Tengyu Ma. ICLR 21.\
[2]: Surprises in High-Dimensional Ridgeless Least Squares Interpolation. Trevor Hastie, Andrea Montanari, Saharon Rosset, Ryan J. Tibshirani. Arxiv 18. \
[3]: Overfitting in adversarially robust deep learning. Leslie Rice, Eric Wong, J. Zico Kolter. ICML 20. \
[4]: AdaBoost and robust one-bit compressed sensing. Geoffrey Chinot, Felix Kuchelmeister, Matthias Löffler, Sara van de Geer. Arxiv 21.\
[5] Benign overfitting in ridge regression. A. Tsigler, P. L. Bartlett. Arxiv 20.