# ICLR Reviews
## Global Response
### Experiment Protocol
#### Sample Complexity Experiments
In our first experiment, we fix the patch dimension $d$ at 20 and vary the number of patches $k$ across the range $[10, 15, 20, 25, 30]$. For each $(k,d)$ pair, we plot the sample complexity for both CNNs and LCNs. We evaluate the sample complexity via the following steps:
1. Target Loss Evaluation: We compute the optimal loss based on the ground truth and add a fixed tolerance of $0.03$ to establish the target loss.
2. Determining Sample Range: Through trial and error, we determine that a maximum of $1000$ samples is sufficient for any model across all $k$ values.
3. Binary Search Method: To find the minimum number of samples required to reach the target loss, we perform a binary search. In each step, a grid search over the learning rates for weights $[10^{-1}, 10^{-2}, 10^{-3}]$ and biases $[10^{-2}, 10^{-3}, 10^{-4}]$ is conducted, with the selection of the model that exhibits the lowest test error.
4. Repetitions for Reliability: We repeat the steps (1-3) five times, plotting the mean and standard deviation of the sample complexities.


Recalling the theoretical sample complexity for CNNs at $O(k+d)$ and for LCNs at $O(k(k+d)), \Omega(kd)$, our observations align well with these expectations. We note a linear dependence on $d$ for both CNNs and LCNs, consistent with the theory. Furthermore, LCNs require significantly more samples than CNNs—about 10 to 20 times more—which corresponds with the theoretical multiplicative $d$ factor in LCNs' sample complexity.
---
In our second experiment, we set the number of patches $k$ at 20 and vary the patch dimension $d$ across the range $[10, 15, 20, 25, 30]$. The same steps (1-4) are repeated for this setup.


We observe that LCNs exhibit a linear dependence on $k$, while CNNs show a sublinear dependence on $k$, aligning with the theoretical bounds. This provides evidence for our claim that LCNs need to learn each of the $k$ patches independently, whereas CNNs benefit from weight sharing. As was the case with our findings of in the first experiment, LCNs require approximately 20 times more samples than CNNs, reflecting the multiplicative $k$ factor in their sample complexity.
---
## Reviewer 1
**Formal Response:**
We genuinely thank the reviewer for their positive evaluation and feel encouraged by their recognition of our work's significance, originality, and clarity. We now address their comments below.
**Comment 1: Can experiments be added that validate the theory?**
Answer:
We agree with the reviewer that empirical validation of our bounds can help us connect them to practice. We have added detailed experimental protocols and results in the supplementary material, and we plan to incorporate them more fully in the next version of our manuscript.
**Comment 2: Much of the theoretical development is relegated to the appendix.**
Answer:
We agree with the reviewer's concern about the placement of our theoretical development. In response, we will include a more comprehensive proof outline in our paper's forthcoming revision. Additionally, we have provided a concise outline for the proof of Theorem 6.1 in an [official comment](https://openreview.net/forum?id=AfnsTnYphT¬eId=zesOXLpzPT), which maybe of interest to the reviewer.
**Comment 3: Notational inconsistency in vectors indexing on page 4.**
Answer:
We are thankful to the reviewer for pointing out this notational inconsistency, and we will rectify it in our revised submission.
We sincerely hope that our responses have addressed the concerns raised by the reviewer.
## Review 2
[Link](https://openreview.net/forum?id=AfnsTnYphT¬eId=8RIt8f8W3U)
**Formal Response:**
We are grateful for the reviewer's positive feedback and we address their questions below.
**Comment 1: Can the framework be extended to ReLU activation function, given its popularity over LSA activation function?**
Answer:
We note that the LSA activation function can be written as the sum of two ReLU activation functions as $\phi_b(x) = \text{ReLU}(x-b) - \text{ReLU}(-x-b)$. This decomposition allows us to adapt our framework for ReLU through the following architectural modifications:
1. We augment an additional hidden node for each patch in LCNs and CNNs and additional $k$ hidden nodes for FNNs.
2. We include a bias term for each patch.
3. We fix the second layer to $\{-1,1,..,-1,1\} \in \mathbb{R}^{2d}$
The idea is that the two hidden nodes corresponding to each patch learn $\mathbf{w}^\star$ and $-\mathbf{w}^\star$ respectively. Together with the bias and the fixed $\pm 1$ values from the second layer, we can simulate the thresholding effect of the LSA activation function.
We also note that the LSA activation function, also known as the "soft-thresholding function", is extensively used in high-dimensional sparse recovery problems (Section 18.2 [Hastie]). Since our task involves recovering the sparse mean vector $\mathbf{w}^\star$, it justifies our use of the LSA activation function.
**Comment 2: Why is the sample complexity of LCNs different in section 6 and 7? Is this due to learning $k$ "nearly-independent" and $k$ "independent" subtasks respectively?**
Answer:
Our conjecture is that the difference in the sample complexity for LCN upper bound of $O(k(k+d))$ and the LCN lower bound of $\Omega(kd)$, stems primarily from the computational efficiency of the algorithms employed. The upper bound is derived for a computationally efficient (polynomial time) equivariant gradient descent style algorithm. In contrast, the lower bound is derived from minimax principles and is applicable to both computationally efficient and inefficient equivariant algorithms. It is generally the case that computationally inefficient algorithms have lower sample complexity as compared to their efficient counterparts, which possibly explains the difference in sample complexity.
**Comment 3: Are there settings where the sample complexity of FCNs, LCNs and CNNs are similar?**
There are indeed scenarios where the sample complexities of FCNs, LCNs, and CNNs converge. These include:
1. Consider the setting where the signal can only be present in a smaller subset of $\tilde{k} < k$ patches, implying that the label-determining signal is "less mobile". In this case, our analysis can be readily extended to show that the sample complexity for FCNs, LCNs, and CNNs scale approximately as $\tilde{k}kd$, $\tilde{k}d$, and $d$ respectively. Therefore, as $\tilde{k}$ approaches 1, LCNs and CNNs will perform similarly, and FCNs' sample complexity will decrease, despite unchanged number of trainable parameters.
2. In the setting when $d$ is relatively large as compared to $k$, that is if the signal is not too-sparse, our bounds show that the advantages of CNNs and LCNs over FNNs become less pronounced. For instance, if we choose $k$ to be a constant $c$, then the sample complexity of FCNs, LCNs and CNNs scales as $c^2 d, cd, d$ respectively and therefore the three models have similar sample efficiency.
We hope that our explanations addressed the reviewer's inquiries comprehensively.
## Review 3 - Summary
[Link](https://openreview.net/forum?id=AfnsTnYphT¬eId=OtjifWU7VA)
**Formal Response:**
We thank the reviewer for their valuable comments and we appreciate their insightful feedback. We now take this opportunity to address their queries below.
**Comment 1: Can experiments be added that validate the analysis?**
Answer:
We completely agree with the reviewer's suggestion that empirical validation would strengthen the credibility of our theoretical analysis. We request the reviewer to refer to the newly added supplementary material, which includes the detailed experimental protocol and results. We also plan to incorporate them in the forthcoming revision.
**Comment 2: Can our framework also provide bounds for transformers, given their prevalance in machine learning?**
Answer:
Regarding the potential application of our framework to transformers, we can consider a transformer architecture where each input patch corresponds to an input token. If we assume that the keys $W_K$, queries $W_Q$, values $W_V$ matrices are initialized using standard isotropic gaussian distribution, then this transformer, trained with SGD, is equivariant to the application of the same $U \in \mathcal{O}(d)$ to each patch of the input. Ignoring the issues of semi-metricness, this leads to a $\Omega(\sigma^2d)$ lower bound. The upper bound, however, may be a little challenging as it requires gradient descent analysis for the softmax-based attention mechanism. Nonetheless, we fully agree with the reviewer that this is an important direction for future work.
**Comment 3: Misplaced QED symbol after the conclusion**
Answer:
We thank the reviewer for pointing out the typo, and we will fix in the next revision.
We sincerely hope that our responses clarified the questions raised by the reviewer.
## Review 4 - Summary
[Link](https://openreview.net/forum?id=AfnsTnYphT¬eId=LU7mrnXNe7)
**Formal Response**
We thank the reviewer for their thorough and insightful comments, and we address the issues raised below.
**Comment 1: "Novelty in our work over Wang et al."**
Answer:
We agree with the reviewer's observation on our use of global equivariance for FCNs, as in Li et al. and Wang et al., and local equivariance for LCNs, following Wang et al. Our work, however, extends these techniques to a more realistic task, which yields justifications for colloquial intuitions and practical insights. Moreover, given the non-rotationally invariant nature of our task, our novel analysis diverges from standard information-theoretic approaches used in Li et al. and Wang et al. Additionally, our use of gradient descent offers important theoretical advantages over ERM analysis: it establishes a separation between computationally-efficient equivariant algorithms and it circumvents the potential non-equivariance issues in ERM. We elaborate on these aspects below.
*1. Realistic task and practical insights*
We reemphasize the realistic nature of the task, mirroring real-world image classification, where outputs are determined by local signals embedded within uninformative noise. Specifically, we represent the label-determining signals by $\pm \mathbf{w}^\star \in \mathbb{R}^d$ against a backdrop of i.i.d Gaussian noise with variance $\sigma$, and the $k$ patches denote the potential areas for the signal to translate.
In this context, our analysis rigorously brings forth the insight that FCNs incur a threefold cost: a multiplicative factor of $k$ due to the need of independently learning $k$ patches, a cost of $k$ for isolating each patch, and a cost of $d$ for learning the sparse signal. LCNs avoid the isolation cost, while CNNs only incur a cost of $d$ to learn the sparse signal.
Our analysis offers another insight in the setting where the signal may only appear in a smaller subset of $\tilde{k} < k$ patches. This implies that the label-determining signal is "less mobile". Our method can be readily extended to show that the sample complexity for FCNs, LCNs, and CNNs scale approximately as $\tilde{k}kd$, $\tilde{k}d$, and $d$ respectively. As $\tilde{k}$ nears 1, the sample complexity of LCNs and CNNs converges, and the sample complexity of FCNs decreases, even though the number of trainable parameters remain unchanged.
Furthermore, we can also infer from our bounds that when $d$ is relatively large as compared to $k$, that is if the signal is not "very sparse", the advantages of CNNs and LCNs over FNNs become less pronounced. Additionally, our analysis proves the intuitve fact that lower noise levels uniformly enhance the sample efficiency across all models.
In contrast, the task is defined in Li et al., $f(\mathbf{x}) = \sum_{i=1}^d x_i^2 - \sum_{i=d+1}^{2d} x_i^2$, and in Wang et al., $g(\mathbf{x}) = (\sum_{i=1}^d x_{2i}^2 - x_{2i+1}^2) (\sum_{i=d+1}^{2d} x_{2i}^2 - x_{2i+1}^2)$, do not possess the necessary structure to yield the insights mentioned above. Specifically, they cannot show how sample complexity changes with varying the degree of locality and translation invariance in the input, nor can they lay out conditions on the input under which conditions differences between CNNs, LCNs, and FCNs are observed or absent.
Furthermore, as highlighted in our paper, the workhorse for the FCN lower bound in both Li et al. (Lemma D.2) and Wang et al. (Appendix H.2) is the quadratic interaction between the first \(d\) coordinates and the subsequent \(d\) coordinates. This signifies an interaction phenomenon in their tasks, which is distinct from what we capture in our task. While this interaction is an interesting phenomenon, it is not the primary characteristic of locality and translation invariance found in images. Nonetheless, we view incorporating signal-signal interaction into our framework represents a promising avenue for future research.
*2. Technical challenges and significance of our analysis*
Our approach diverges from the approaches used in Li et al. and Wang et al. because the marginal over $\mathcal{X}$ is not rotationally invariant. Specifically, this deviation precludes the use of Benedek Itai style bounds from Li et al., nor do we enjoy the semi-metricness of $l_2$ or $0 \text{-} 1$ loss under an invariant distribution. Recognizing this, Li et al. have previously [noted](https://openreview.net/forum?id=uCY5MuAxcxU¬eId=eEtVx0qv-36) the need for new proof techniques for such distributions as an area for future exploration and our work takes the first step in this direction.
To address the above challenge, we developed a novel reduction technique that leveraged input-independent randomization and equivariance to show that both FCNs and LCNs must learn each patch independently from every other patch. This technique not only aids in computation, but also provides a theoretical justification for the intuition for the '$k$ cost' associated with learning each patch independently. Despite the fact that our reduced problem is relatively simpler, it still does not enjoy the semi-metricness on the $l_2$ loss function. To overcome this issue, we introduce a novel variant of Fano's Lemma for input-independent randomized algorithms. This allows to derive the lower bounds without requiring semi-metricness to hold across the entire space.
We respectfully disagree with the reviewer's perspective on the significance of our methods. First, our reduction technique through randomization alongwith the tailored variant of Fano's Lemma have broad applicability, and we believe it should be of independent mathematical interest. Additionally, given the widespread use of Gaussian Mixture Models (GMMs), our techniques could offer valuable tools for researchers in this area. Lastly, our techniques could serve as a foundation for future extensions of our work, such as incorporating the aforementioned interaction effects.
*3. Benefits of gradient descent analysis over ERM*
We acknowledge the reviewer's observation regarding the simplicity of our gradient descent analysis on a one-hidden layer neural network with two iterations. However, our choice of gradient descent over an ERM-style argument was deliberate because of two key theoretical reasons:
(1) The gradient descent demonstrates a sample complexity separation between FCNs, LCNs, and CNNs for computationally-efficient (polynomial time) equivariant algorithms. This distinction is crucial because while a separation may exist for computationally inefficient algorithms, the separation might disappear under constraints of computational efficiency. Given that models in practice are trained using efficient algorithms, we believe establishing a separation for efficient algorithms is theoretically important.
(2) It is important to ensure that both the upper and lower bounds are derived for equivariant algorithms. This is because having lower bounds for equivariant algorithms and upper bounds for non-equivariant ones does not lead to a valid sample complexity separation as non-equivariant algorithms could potentially be more sample-efficient than their equivariant counterparts. Since, the equivariance of ERM is not clearly established and Wang et al. do not discuss this aspect in their work, we believe a gradient descent analysis is more appropriate.
**Comment 2: Shouldn't the current FCN lower bound hold for any width? Why is the width of FCN set to $k$?"**
Answer:
Our current lower bound may not trivially hold for any width. This is because, unlike the tasks in Li et al. and Wang et al., the lack of rotational invariance of the marginal over $\mathcal{X}$ means that we cannot assume that the $l_2$ loss function is a semimetric on the space of all functions. This limitation prevents us from applying standard Fano's Lemma as in Wang et al. Consequently, we developed a variant of Fano's Lemma, that only requires the semimetric property to hold when two out of three functions are from the set of "hard instances". We had to explicitly establish this semi-metric property for the underlying FCN function class with width equal to $k$. Therefore, we cannot immediately say that the lower bound holds for all widths.
We set the width of the FCNs to $k$ so that like LCNs and CNNs, FCNs could also
benefit from the denoising capability of the $k$ non-linear filters.
**Comment 3: "Why is the norm of $\mathbf{w}_i \le 1$?"**
Answer:
We imposed the constraint $\mathbf{w}_i \le 1$ primarily to simplify our analysis. We use it to prove the technical requirement of our Fano's Lemma which is that the semimetric property should hold when two out of three functions belong to the set of 'hard instances'. We particularly selected the constant $1$ to enable the filters to capture the unit norm signal $\mathbf{w}^\star$. While this choice might appear somewhat artificial, it does not impact the main takeways of our theory.
**Comment 4: "Can the proof outline for step 1 of theorem 6.1 be elaborated more technically?"**
Answer:
We appreciate the reviewer's request and we will definitely include a more technical explanation in our revised draft. We now outline step 1 of the proof, focusing on the key aspects while avoiding overly intricate details for better understanding.
We denote the DSD distrbution by $P$ and its marginal for the $i^{\text{th}}$ patch as $Q_i$. We denote the mean of the positive marginal of each $Q_i$ as $\mathbf{\mu}_i$. The proof begins by noting that the risk of the algorithm, $\bar{\theta}_n$, on $\mathbf{U} \circ P$ is given by the average of the risk of $\bar{\theta}_n$ over each $\mathbf{U} \circ Q_i$. We then use equivariance to show that the risk of $\bar{\theta}_n$ over each $\mathbf{U} \circ Q_i$ is exactly the same,
$$\mathbb{E}_{S^n \sim (\mathbf{U} \circ P)^n} \left[
R\left(
\bar{\theta}_n,
\mathbf{U} \circ Q_i
\right)
\right]
=
\mathbb{E}_{S^n \sim (\mathbf{U} \circ P)^n}
\left[
R\left(
\bar{\theta}_n,
\mathbf{U} \circ Q_j
\right)
\right].
$$
This implies that even though we are "training" the model using $\mathbf{U} \circ P$, it is enough to "test" the model on $\mathbf{U} \circ Q_1$. Since this holds for all $\mathbf{U}$ and $\bar{\theta}_n$,
$$\mathbb{E}_{S^n \sim P^n} \left[
R\left(
\bar{\theta}_n,
P
\right)
\right]
\ge
\inf_{\bar{\theta}_n}
\sup_{\mathbf{U}}
\mathbb{E}_{S^n}
\left[
R\left(
\bar{\theta}_n,
\mathbf{U} \circ Q_1
\right)
\right].
$$
We denote $R(
\bar{\theta}_n,
\mathbf{U} \circ Q_1
) = R_1(
\bar{\theta}_n
)$ for brevity. We now begin a series of reductions, first observe that the minimax risk can only decrease if the algorithm was instead supplied with patch separated input data $\{S^{n_i} \sim \mathbf{U} \circ Q_i\}$. This is because the algorithm can simply ignore the patch-id, which is strictly extra information. Please note that with high probability $n_i \approx n/k$,
$$
\ge
\inf_{\bar{\theta}_n}
\sup_{\mathbf{U}}
\mathbb{E}_{\{S^{n_i}\}}
[
R_1(
\bar{\theta}_n
)
].
$$
Next, we provide the algorithm with $S^{n_1}$ and all the required quantities so that it can internally construct $S^{n_2},..S^{n_k}$. It is easy to see that this can only reduce the minimax risk. These quantities include the means $\mathbf{U}\mathbf{\mu}_2, .., \mathbf{U}\mathbf{\mu}_k$ and presampled gaussian noise vector, label vector, and $n_2,..,n_k$. We refer to the presampled quantities by $\xi$.
$$
\ge
\inf_{\bar{\theta}_n}
\sup_{\mathbf{U}}
\mathbb{E}_{\{S^{n_1}, \xi\}}
[
R_1(
\bar{\theta}_n(\{\mathbf{U}\mathbf{\mu}_2, .., \mathbf{U}\mathbf{\mu}_k\})
)
].
$$
The final step is to note that since $\mathbf{U}\mathbf{\mu}_1, .., \mathbf{U}\mathbf{\mu}_k$ are mutually orthogonal, the only information about $\mathbf{U}\mathbf{\mu}_1$ we get from $\mathbf{U}\mathbf{\mu}_1, .., \mathbf{U}\mathbf{\mu}_k$ is that the former lies in a subspace of size $kd-k+1 \approx kd$. This information is too little and can effectively be ignored,
$$
\ge
\inf_{\bar{\theta}_n}
\sup_{\mathbf{U}}
\mathbb{E}_{S^{n_1}}
[
R_1(
\bar{\theta}_n
)
].
$$
We have now reduced learning $\mathbf{U} \circ P$ with $n$ samples to learning $\mathbf{U} \circ Q_1$ with $n/k$ samples. This marks the end of Step 1. In step 2, we show that $n/k \approx \sigma^2kd$, which proves the result.
We sincerely hope that our responses provided clarity on the points raised by the reviewer, particularly regarding the novelty and significance of our work.