# **Response to reviewer WNca**
Thank you for taking the time to review our submission. Here are some clarifications regarding your observations. We hope that they will improve your opinion of our work and we kindly ask you to consider the possibility of raising your score.
* **Regarding the practical implications of our work:** Although the main purpose and goal of our analysis is to highlight the different generalization behaviors of different regularizers, it is possible to draw some practical implications from our work: it might be a good idea to use nuclear norm rather than Frobenius norm regularisation in conjunction with quadratic classifiers, especially in high dimensions when the data is nearly isotropic. However, we stress that our paper is not arguing for the use of a certain regularizer over the other, we are simply showing that an observed phenomenon (improved sample complexity) can be explained by regularization.
More generally, our results also *suggest* (but not imply) two more promising avenues of research: (1) it might be a good idea to try nuclear norm regularization in multiplicative networks [A] which are composed of quadratic layers and (2) adaptivity of neural networks to the data might be due to small nuclear norm of the weight matrices (this aligns nicely with the line of work arguing that SGD is implicitely minimizing rank, for which the nuclear norm is a proxy). Of course, **these suggestions are not part of the contributions of our paper**. We will clarify this in the conclusion section in the final version.
* **Regarding the choice of \\(\lambda\\) in the experiments:** We set lambda to 1 in the experiments. (We have indicated it on line 307, but we can make it figure more prominently)
* **Regarding the assumption \\(\|x\|_2^2 \geq c \mathbb{E} \|x\|_2^2 \\) almost surely:** We think you mean the inequality \\(\|x\|_2^2 \leq c \mathbb{E} \|x\|_2^2\\) in Lemma 4 (line 207)? This is only a technical assumption which means that the distribution is bounded, and that its magnitude concentrates near the mean. This is in particular satisfied for natural images with fixed average pixel magnitude. By writing it this way, we are considering the entire class of bounded distributions whose magnitudes are allowed to scale with dimension, but once normalized by the expected norm, the dimension dependence dissappears. Indeed without such assumption, the empirical covariance is not a good estimator of the true covariance (see thm 5.6.1 and ex 5.6.5 in [B]).
* **Regarding the interpretation of the experiments, and Figure 1.b**: Our theoretical bounds, as all uniform generalisation bounds, are on the worst function in the class so we designed the experiment to find this worst performing function. We do so to exhibit the intrinsic dimension dependence of the worst train-test gap and corroborate our theoretical findings.
We think that the observed decrease in Figure 1.b is not substantial enough to conclude that there is a downward trend, for instance, in the rightmost part of the plot, we see that the generalization gap starts to increase slightly, breaking the downwards trajectory. Overall, we believe the trend is *flat*. The main message is that there is clearly a _different sensitivity_ to an increase in dimension between the different regularizations. The plot is a result of a non-convex optimization in increasing dimensions, so some variability is to be expected.
* **Regarding the writing syle**:
* *<u>In Proof of Cor 2 it says "both identities in eq. 3" -- but eq(3) is one inequality!</u>*: We will rephrase this in a clearer way. What we mean is that one should use the two identities \\(r(\Sigma) \approx d^s \\) and \\( \|\Sigma\|_2 \approx d^{1-s}\\) together with the inequality given in (3), to arrive at the final result. This line should be read as “Plugging the two previously mentioned identities in the right-hand side of inequality (3)...”
* *<u>bottom of p 8: "try to find" sounds vague - indeed, what if the optimiser doesn't find the function we look for?</u>*: The problem is a non-convex maximization problem, therefore, if the optimiser does not find the maximum, it nevertheless provides a lower bound.
* *<u>top of p 10: what is mean by "an equal regularisation parameter"?</u>*: This line means that the parameter \\(\lambda\\) has the same numerical value for both regularizations.
# **References**
[A] MULTIPLICATIVE INTERACTIONS AND WHERE TO FIND THEM
Siddhant M. Jayakumar, Wojciech M. Czarnecki, Jacob Menick, Jonathan Schwarz, Jack Rae, Simon Osidnero, Yee Whye Teh, Tim Harley, Razvan Pascanu (DeepMind). ICLR 2020.
[B] High-Dimensional Probability: An Introduction with Applications in Data Science. Roman Vershynin. 2020. https://www.math.uci.edu/~rvershyn/papers/HDP-book/HDP-book.html
# **Response to reviewer DAu4**
Thank you for taking the time to review our paper. We will make the following changes to improve the presentation.
We will precisely define quadratic classifiers as classifiers based on the sign of a quadratic polynomial. We will add that by pixel space, we mean the space \\([0, 1]^d\\) where images belong, as they are arrays of bounded values.
We will also make sure that the definition of _isotropicity_ appears before the term is mentioned in the abstract, precisely, we will rewrite the first sentence as “It has been recently observed that neural networks, unlike kernel methods, enjoy a reduced sample complexity when the distribution is isotropic (i.e., when the covariance matrix is the identity).”
Finally, we will correct the typos that you mention.
# **Response to reviewer FnjC**
Thank you for taking the time to review our submission. We appreciate your comments on the comparison of the trace-norm and the Frobenius-norm.
We included the section on computability for completeness. It can be moved to the appendix.
The links to neural network learning are indeed interesting. As noted in our response to reviewer WNca, our work points towards the following avenues of research. The most direct application of our results is on networks with quadratic activations or multiplicative interactions [A]. The second more interesting avenue is to show that adaptivity of neural networks to the data might be due to small nuclear norm of the weight matrices. The implicit regularization effects of SGD, which, at least for linear neural networks, is argued to be a form of rank minimization, is closer to a trace-norm regularization than a Frobenius-norm one because trace-norm minimization is a better proxy of rank minimization. Consequently, it may be this implicit regularization that explains the data adaptivity. These speculations however require careful analysis and we judged it best suited for follow-up research.
We will nonetheless add a few lines detailling the following points. First, we can describe in more detail the empirical observations that studied intrinsic dimension and its influence on generalization. Second, we can write out the derivation showing that a single hidden layer neural network with quadratic activations \\(x \mapsto x^2\\) is exactly a non-convex parametrization of a quadratic classifier. We believe this will clarify our position that our work serves as an important first step in theoretically characterizing why neural networks generalize well.
Our main message, however, remains that a simple model can exhibit some of the observed phenomena in complicated, difficult-to-analyze settings. Regularization can explain improved sample complexity over kernels.
# **References**
[A] MULTIPLICATIVE INTERACTIONS AND WHERE TO FIND THEM
Siddhant M. Jayakumar, Wojciech M. Czarnecki, Jacob Menick, Jonathan Schwarz, Jack Rae, Simon Osidnero, Yee Whye Teh, Tim Harley, Razvan Pascanu (DeepMind). ICLR 2020.