We thank Reviewer g2yR for appreciating the relevance of our work. Their high-level summary exactly captures the key message we want to convey with our paper. We now address the main points raised in the review separately. **Response to 1): Optimal $\lambda$ for standard risk in noiseless linear regression.** As correctly noted by the reviewer, the optimal $\lambda_\text{opt}$ is zero for the standard risk in the noiseless case. This is indeed predicted by our theory by plugging $\epsilon = 0$ and $\sigma=0$ into Theorem 3.1. We added a comment on this in a revised version of our paper. **Response to 2)(i): Reason for picking $\epsilon = O(1/\sqrt(d))$ vs. larger $\epsilon$ for classification.** We would like to thank Reviewer g2yR for raising this important question. We appreciate that we can explain this in more detail since we could not include a discussion in the main text due to space constraints. The primary reason for choosing $\epsilon = 1/\sqrt{d}$ is that this leads to asymptotic predictions which closely match our experimental simulations for finite $d,n$ run with small attack sizes $\epsilon$ (see Figure 4a and Appendix B for experimental details). We now further argue from a technical perspective how a vanishing $\epsilon = O(1/\sqrt{d})$ is necessary to allow non-vanishing robust margins of the estimator $$\min\_i \langle x\_i, \theta/\|\theta\|\_2\rangle - \epsilon \| \Pi\_{\perp} \theta/\|\theta\|\_2 \|\_1 .$$ Minimizing the loss function in Equation (23) leads to a tradeoff between maximizing the margin $\min\_i y\_i\langle x\_i, \theta/\|\theta\|\_2 \rangle$ and minimizing the term $\max\_{\| \delta\|\_{\infty} \leq \epsilon} \langle \delta, \theta/\|\theta\|\_2\rangle = \epsilon \|\Pi\_{\perp} \theta\|\_1/ |\theta\|\_2$. In order to have an asymptotically non-vanishing margin (and thus also robust margin), the estimator has to align with $n \asymp d$ approximately orthogonal vectors $x_i$. Hence, the $\ell_1$-norm of the resulting estimator $\theta$ would grow when increasing $d$. In fact, you can even see that the $\ell_1$-norm of the max-$\ell_2$-margin estimator grows at rate $\sqrt{d}$ (note that this follows from Figure 4a and the definition of the robust risk). At the same time, however, the margin is non-zero and bounded for any $\theta$ as a consequence of the existence of a finite solution of the optimization problem in Theorem F.2. Finally, by choosing $\epsilon = O(1/\sqrt{d})$, we have $\epsilon \| \Pi_{\perp}\theta/\|\theta\|_2\|_1 = O(1)$ and hence both terms are of the same order. **Response to 2)(ii): Effect of large $\epsilon$.** However, the question of what happens when $\epsilon$ is constant as $d \to \infty$ is still interesting. In the case of sparse ground truths, not only can they be “tolerated” as shown in [3], but in fact we observe in Figure 7 that larger $\epsilon$ in the noiseless case may even lead to consistent estimation! This is in fact the second reason why we did not discuss this setting as it is another one of the examples for when ridge regularization does not benefit regularization. We want to briefly give some intuition for why this might be the case by referring to the expression of the robust loss derived in Equation (23). It is given by an average of $\ell(y\_i \langle x\_i, \theta \rangle - \epsilon \| \Pi\_{\perp} \theta\|\_1)$, and hence varying $\epsilon$ leads to a tradeoff between maximizing the margins $y\_i \langle x\_i, \theta \rangle$ and minimizing the projected $\ell\_1$-norm $\| \Pi\_{\perp} \theta\|\_1$. For large perturbations $\epsilon$ in particular, although counterintuitive, we can recover the sparse ground truth since the $\ell\_1$-penalty induces a sparsity bias that aligns with the ground truth. Since the robust max-$\ell\_2$-margin interpolator is already consistent (i.e., achieves vanishing risks as $d,n \to \infty$), the benefits of ridge regularization are limited. Finally, we would like to mention that, as a future work, a thorough non-asymptotic analysis that mathematically captures this intuition would greatly improve our understanding of robust overfitting in noiseless settings. **Response to 3): $\ell_\infty$ vs. $\ell_2$-perturbations in classification.** Since this is a natural question to ask, we did already investigate $\ell_2$-perturbations for robust logistic regression experimentally and think about them theoretically. However, ridge regularization does not benefit robustness in that scenario: $\ell_2$-perturbations add a $\ell_2$-penalty to the logistic loss, which leads to a shrinkage of the $\ell_2$-norm of the estimator $\theta$. In particular, this results in a similar effect to adding an explicit ridge ($\ell_2$) penalty, and since we do not observe any robust overfitting for standard training (see Figure 10), we also do not expect any benefits of ridge regularization when training with $\ell_2$-perturbations. Our intuition was confirmed by experiments that we can add in a revised version together with a discussion. Lastly, the proof of Theorem 4.1 can be extended to $\ell_2$-perturbations straightforwardly, but requires constant $\epsilon$ in order to observe any effect of adversarial training. Note that this stands in contrast to the case of $\ell_{\infty}$-perturbations as discussed in the response to question 2). Essentially, the main reason is that in the case of $\ell_2$-perturbations, $\max_{\| \delta\|_{2} \leq \epsilon} \langle \delta, \theta/\|\theta\|_2\rangle = \epsilon$ and thus the influence of the adversarial perturbations is bounded even for constant attack sizes $\epsilon$ as $d\to \infty$. To give a more detailed answer regarding how to modify the proof in the case of $\ell\_2$-perturbations: Essentially, we would only have to replace $\|w\|\_{\infty}$ with $\|w\|\_2$ on line 809 in the proof of Theorem F.1. For the choice $\epsilon = O(1/\sqrt{d})$, the optimal $w \approx 0$, which means that adversarial attacks have asymptotically no influence. On the other hand, for a constant $\epsilon$, the optimal $\|w\|\_2$ is non-zero. In particular, in this case, a modification of the section on lines 819-822 captures the tradeoff between minimizing the adversarial penalty (which has the closed form expression $\epsilon \|\Pi\_{\perp} \theta\|\_2$ when training with $\ell\_2$-perturbations) and maximizing the margins $y\_i \langle x\_i, \theta \rangle$. Finally, a similar argument also holds for the proof of Theorem F.2. **Response to 4): Discussion on additional references.** We would like to thank the reviewer for pointing out these references. Reference [1] is a more accurate citation than [4] for the convergence of the gradient descent path of robust logistic regression. Furthermore, reference [3] studies robust estimators trained with gradient descent and shows that early stopping can lead to robustness to noise. This is a relevant citation which we added to the discussion of Figure 8, emphasizing how early stopping prevents the robust logistic regression estimate from overfitting. Lastly, we agree that [2] is another important contribution providing tight bounds for the standard risk of linear ridge regression and min-$\ell_2$-norm interpolators for overparameterized regression. Our focus, however, is on understanding the heavily overparameterized regime. There, previous works [4,5] show that ridge regularization is not beneficial compared to min-$\ell_2$-norm interpolation. We would like to refer to the section in our general response highlighting our main contribution and how it differs from other works that study regularized estimators. **Response to 5): Convergence of gradient descent to robust max-margin.** In our paper we primarily cite Corollary 3.2 of [6] where the authors point out that a straightforward consequence of [7,8] is that gradient descent minimizing the unregularized robust logistic loss in Equation (8) converges to the robust max-$\ell_2$-margin interpolator from Equation (9). In fact, reference [1] pointed out by the reviewer explicitly proves the fact that gradient descent converges to the robust max-$\ell_2$-margin via their Definition 3.2, Lemma 3.2 and Theorem 3.4. Since [1] provides a much cleaner and more rigorous statement, we changed the citation in our revised version of the paper. **Response to 6): Details on the proofs of the theorems.** We addressed this shortcoming by adding a comment on the methodologies used in Theorem 3.1 to the revised version of our paper. **Typos.** We would like to thank Reviewer g2yR for pointing out the typos. We fixed them in our revised manuscript. [1] Implicit Bias of Gradient Descent based Adversarial Training on Separable Data. Yan Li, Ethan X.Fang, Huan Xu, Tuo Zhao. ICLR 21.\ [2] Benign overfitting in ridge regression. A. Tsigler, P. L. Bartlett. Arxiv 20.\ [3] Provable Robustness of Adversarial Training for Learning Halfspaces with Noise. Difan Zou, Spencer Frei, Quanquan Gu. ICML 21.\ [4]: Optimal Regularization Can Mitigate Double Descent. Preetum Nakkiran, Prayaag Venkat, Sham Kakade, Tengyu Ma. ICLR 21.\ [5]: Surprises in High-Dimensional Ridgeless Least Squares Interpolation. Trevor Hastie, Andrea Montanari, Saharon Rosset, Ryan J. Tibshirani. Arxiv 18. \ [6]: Precise Statistical Analysis of Classification Accuracies for Adversarial Training. Adel Javanmard and Mahdi Soltanolkotabi. Arxiv 20. \ [7]: Risk and parameter convergence of logistic regression. Z. Ji and M. Telgarsky. COLT 19.\ [8]: The implicit bias of gradient descent on separable data. D. Soudry, E. Hoffer, M. S. Nacson, S. Gunasekar, and N. Srebro. JMLR 18.