Konstantin Donhauser

@9rgrD--hSxWXW83ccFIBGg

Joined on Mar 10, 2021

  • $fn = \sum (1- \xi_i v \vert X_i \vert)+^2$ v>0 $\geq \sum P(\xi_i =1\vert X_i) (1- \vert X_i \vert)+^2$ By continuity of $P$ and Now we have that $v^*>0 \implies \exists x_1>c_0 and $\delta>0$ such for all $\forall x \in [x_1, x_1 + \delta]: P(\xi\vert x) >c_1$ $\geq \sum 1[X_i \in [x_1, x_1 + \delta]] c_1 (1- \vert X_i \vert)_+^2$ Whats $\frac{1}{n}\sum 1[X_i \in [x_1, x_1 + \delta]]$ thus we have bernoulli concentration. With exponential tail. So $\frac{1}{n}\sum 1[X_i \in [x_1, x_1 + \delta]] -> c_3 >0$
     Like  Bookmark
  • We thank Reviewer g2yR for appreciating the relevance of our work. Their high-level summary exactly captures the key message we want to convey with our paper. We now address the main points raised in the review separately. Response to 1): Optimal $\lambda$ for standard risk in noiseless linear regression. As correctly noted by the reviewer, the optimal $\lambda_\text{opt}$ is zero for the standard risk in the noiseless case. This is indeed predicted by our theory by plugging $\epsilon = 0$ and $\sigma=0$ into Theorem 3.1. We added a comment on this in a revised version of our paper. Response to 2)(i): Reason for picking $\epsilon = O(1/\sqrt(d))$ vs. larger $\epsilon$ for classification. We would like to thank Reviewer g2yR for raising this important question. We appreciate that we can explain this in more detail since we could not include a discussion in the main text due to space constraints. The primary reason for choosing $\epsilon = 1/\sqrt{d}$ is that this leads to asymptotic predictions which closely match our experimental simulations for finite $d,n$ run with small attack sizes $\epsilon$ (see Figure 4a and Appendix B for experimental details). We now further argue from a technical perspective how a vanishing $\epsilon = O(1/\sqrt{d})$ is necessary to allow non-vanishing robust margins of the estimator $$\min_i \langle x_i, \theta/|\theta|_2\rangle - \epsilon | \Pi_{\perp} \theta/|\theta|_2 |_1 .$$
     Like  Bookmark
  • We thank Reviewer g2yR for appreciating the relevance of our work. Their high-level summary exactly captures the key message we want to convey with our paper. We now address the main points raised in the review separately. Response to 1): Optimal $\lambda$ for standard risk in noiseless linear regression. As correctly noted by the reviewer, the optimal $\lambda_\text{opt}$ is zero for the standard risk in the noiseless case. This is indeed predicted by our theory by plugging $\epsilon = 0$ and $\sigma=0$ into Theorem 3.1. We added a comment on this in a revised version of our paper. Response to 2)(i): Reason for picking $\epsilon = O(1/\sqrt(d))$ vs. larger $\epsilon$ for classification. We would like to thank Reviewer g2yR for raising this important question. We appreciate that we can explain this in more detail since we could not include a discussion in the main text due to space constraints. The primary reason for choosing $\epsilon = 1/\sqrt{d}$ is that this leads to asymptotic predictions which closely match our experimental simulations for finite $d,n$ run with small attack sizes $\epsilon$ (see Figure 4a and Appendix B for experimental details). We now further argue from a technical perspective how a vanishing $\epsilon = O(1/\sqrt{d})$ is necessary to allow non-vanishing robust margins of the estimator $$\min_i \langle x_i, \theta/|\theta|2\rangle - \epsilon | \Pi{\perp} \theta/|\theta|_2 |_1.$$
     Like  Bookmark
  • We thank Reviewer kE4t for appreciating the relevance of our work. We now address the main points raised in the review. [...] to provide a faithful comparison to related work some experimental evidence on real world data might be beneficial: Prior works already provide experimental evidence on real world data. Our work draws on two directions of related work: 1) the theoretical results that characterize the inductive bias of min-$\ell_2$-norm interpolators (for regression) and max-$\ell_2$-margin interpolators (for classification) in the overparameterized regime; and 2) the empirical observation of the phenomenon of robust overfitting [1]. We assume that Reviewer kE4t refers to the latter, but we would be happy to clarify the relationship to the former line of work as well if desired. Prior work shows experimentally that deep neural networks trained on large image datasets benefit from early stopping if evaluated with the adversarially robust risk [1], or with the worst-case risk among subpopulations [2,3]. A recent work [4] argues that overfitting of the adversarially robust risk may be due to noise in the training data. In our paper, we instead focus on showing that robust overfitting occurs even for settings in which we can reduce label noise to a minimum (see the experiments in Figure 1 for more details). While the results presented in this paper are useful and interesting I have some concerns about the form of adversarial perturbations considered. In particular, there is not sufficient motivation why the particular definition of adversarial perturbation is used. It seems as though considering perturbation orthogonal to the ground truth may primarily be well suited to the linear ground truth setting. It's not clear if this is a valid choice for other models or how to generalize such perturbation to different models especially one where the data is lower dimensional. It is also unclear why "adversarial" perturbations are consistent. That is a strong assumption.
     Like  Bookmark
  • We thank Reviewer yfXT for the feedback on our work. My main concern is the paper’s writing and clarity. I found the flow of the paper lacking and, in general, the message the authors tried to convey did not pass clearly. In addition, there are many sloppy typos throughout the draft. Doing a very thorough revision and improving the writing flow would greatly benefit the paper and make the message much clearer. Typos and clarity of writing: We have made substantial improvements in the exposition of the paper, detailed in the general comments, which as we hope conveys our message much clearer. The observations and conclusions from each of the experiments should be clearly stated. Discussion of the experiments: The submitted paper states the main take-aways for each experiment in the respective figure caption. In addition to those, we provide a detailed interpretation of each result in the main text. A detailed description of the experimental setting is in the supplementary material. We kindly ask Reviewer yfXT to point out which experiments were unclear in particular, so that we can point to the respective explanation, improve those explanations, and/or provide clarifying remarks.
     Like  Bookmark
  • We thank Reviewer U2TN for appreciating the relevance of our work. We now address the main points of the review separately. Response to 1): Limit of $\lambda \to 0$ for min-norm interpolator results in Theorem 3.1. We obtain the asymptotic risk of the min-$\ell_2$-norm interplator by taking the limit $\lim_{\lambda \to 0}$ of the RHS in Equation (eq). We remark that $m(z)$ is the Stieltjes transform of the Marchenko-Pastur law. Hence, if $\gamma \neq 1$ then the limit $\lim_{z \to 0} m(z)$ exists and is $\lim_{z \to 0} m(z) = 1/(1-\gamma)$ if $\gamma <1$ and $\lim_{z \to 0} m(z) = 1/(\gamma(\gamma -1 ))$ if $\gamma >1$; not infinity as indicated by Reviewer U2TN. For further details we would like to point the reader to Corollary 5 in [2]. We added the above remarks to Theorem 3.1 and its proof in the revised version of our paper. Response to 2): Large $\lambda$ in robust logistic ridge regression inducing zero vector and large risk. We point out that for classification the risk defined with the 0-1 loss only depends on the direction of the estimator, but not on the norm. Therefore, a large $\lambda$ does not contradict the good performance the estimator attains in Figure 4b for large $\lambda$. Response to 3) (i): Why we study regularization with $\ell_2$-penalty. We thank Reviewer U2TN for this question as it made us realize that we have to emphasize the motivation for ridge regularization much more clearly in the revised version of our paper. Taking the limit $\lim_{\lambda \to 0}$ of ridge ($\ell_2$) regularization results in the robust max-$\ell_2$-margin and min-$\ell_2$-norm interpolators for logistic and linear regression, respectively. The min-$\ell_2$-norm interpolator is also the interpolating estimator obtained from unregularized gradient descent. While this fact is well-known, we also empirically observe that the optimization path of robust logistic regression with decreasing ridge penalty and gradient descent exhibit similar risk curves, i.e., early stopping and ridge regularization yield solutions with similar robust risks. In fact, Rice et al. [3] observe exactly that early stopping benefits robust generalization.
     Like  Bookmark
  • We would like to thank all the reviewers including the area chair for their contribution to this conference and for helping us to improve the quality of our paper. We would further like to briefly summarize the major contributions of this paper and highlight the most prominent changes made to the revised version of our paper which already address some of the concerns of the reviewers. In addition to these changes, we point out that we have substantially improved the clarity of the paper in terms of structure and language by elaborating on some key concepts (e.g. motivation for consistent/inconsistent perturbations) and removing the typos. Main contribution and relation to other works on overparameterized linear regression that show benefits for regularization around the interpolation threshold. Our paper demonstrates how the robust risk of linear models benefits from explicit ridge regularization in overparameterized settings, even if the data is entirely noiseless. When the goal is to achieve good robust generalization, our results contradict the modern storyline that highly overparameterized models (e.g. $d/n \gg 1$ for linear models) perform best when interpolating the training data without the need for explicit regularization. To the best of our knowledge, we are the first to provide theoretical evidence that the phenomenon of robust overfitting occurs even in entirely noiseless settings. As noted by the reviewers, a number of recent related works [1,2,3] show that regularization in linear regression mitigates the peak of the standard risk at the interpolation threshold by reducing the variance in the presence of noise. In fact, it is well known that ridge regularization leads to a bias-variance tradeoff and it is therefore not surprising that explicit ridge regularization helps to mitigate the peak in the double descent curve, that is caused by the large variance. In contrast, in our paper we show the benefits of ridge regularization for the robust risk in highly overparameterized settings where the effect of explicit regularization is negligible for the standard risk as also shown in [1,2]. noiseless settings where the variance is zero and thus the standard risk does not benefit from regularization, even at the interpolation threshold. Motivation for studying standard training for linear regression and adversarial training for logistic regression. In our paper we study standard trained estimators for regression and adversarially trained estimators for classification. We motivated our choice by adding a short discussion in the revised version of our paper that captures the following argument: Since the goal of this paper is to study the shortcomings of interpolating estimators compared to regularized estimators, we only analyze training algorithms that allow interpolation. For regression, inconsistent adversarial training prevents interpolation (as discussed in Appendix C). On the other hand, consistent adversarial training simply recovers the ground truth for linear regression when training with noiseless samples. In contrast, for linear classification, interpolation is easier to achieve -- it only requires the sign of $\langle x_i , \theta\rangle$ to be the same as the label $y_i$ for all $i$. In particular, when the data is sufficiently high-dimensional, it is possible to find a classifier that interpolates the adversarially perturbed training set.
     Like  Bookmark
  • We make the induction assumption for that for $m \to \infty$: $$ f^{(h)} \sim GP(0, \Sigma^{(h-1)})$$ First, we for the mean have : $$ \mathbb{E}\left[[f^{(h+1)}(x)]i \mid f^{(h)}(x) \right] = \mathbb{E}\left[\sum{j=1}^mW^{(h+1)}{i,j}[g^{(h)}(x)]i \mid f^{(h)}(x) \right]
     Like  Bookmark
  • #Hey I am Alex Question 1 One intuition is that common asymptotic results rely on $D = \max ||x||_2$ to bound the excess risk. However $D$ grows with d Theorem 1 states that we can approximate $\hat{f}$ with a polynomial function of degree 2. This would suggest that if $f^*$ is a not a itself such a function then you will always carry a bias with you empirical optimum. Question 2 $$k\left(x, x^{\prime}\right)=\sum_{i=0}^{\infty} \frac{1}{i !}\left(x^{T} x^{\prime}\right)^{i}$$
     Like  Bookmark
  • Presentation Write your presentation here if you like (you can also use your ipad if one of you has one) Dummy formula: $$R(f) - R(f^\star) \leq \mathcal{R}(\mathcal{H}) - \gamma$$ Dummy align environment \begin{align}
     Like  Bookmark
  • Presentation Write your presentation here if you like (you can also use your ipad if one of you has one) Dummy formula: $$R(f) - R(f^\star) \leq \mathcal{R}(\mathcal{H}) - \gamma$$ Dummy align environment \begin{align}
     Like  Bookmark
  • Presentation a) The main intuition is that if we have a large margin, we are far away from the decision boundary, hence we make more confident predictions. Another intuition, based on bias-variance tradeoff. b) Bound should get tighter as we increase the margin (and the number of data points $n$).
     Like  Bookmark
  • BLABLABLA Presentation Write your presentation here if you like (you can also use your ipad if one of you has one) Dummy formula: $$R(f) - R(f^\star) \leq \mathcal{R}(\mathcal{H}) - \gamma$$ Dummy align environment
     Like  Bookmark
  • Presentation Write your presentation here if you like (you can also use your ipad if one of you has one) Dummy formula: $$R(f) - R(f^\star) \leq \mathcal{R}(\mathcal{H}) - \gamma$$ Dummy align environment \begin{align}
     Like  Bookmark
  • Presentation Write your presentation here if you like (you can also use your ipad if one of you has one) Dummy formula: $$R(f) - R(f^\star) \leq \mathcal{R}(\mathcal{H}) - \gamma$$ Dummy align environment \begin{align}
     Like  Bookmark