NeurIPs23 Rebuttal for ICL

# Response to All Reviewers (a.k.a Letter to the AC) We thank all the reviewers for very constructive comments. We thank **Reviewer m6sU** recognizing the novelity of our work: *"The paper provides a novel perspective -- the preconditioning/adaptive perspective -- on in-context learning in transformers, which could potentially lead to new ways of understanding and improving these models. Given the growing interest in understanding ICL from a theoretical perspective, the "preconditioning gradient descent" can be served as an important building block. The analysis tools developed in this paper could be helpful for future research on understanding ICL."*. We also thank **Reviewer k8UL** for appreciating the value of our approach: This is a very good paper. I was very happy to read it.....I really liked the approach that the authors have taken. It is quite a neat idea to consider sparsity driven constraints....The results resolve some very important questions towards understanding ICL. The paper is very well written and a joy to read....This paper serves as a crucial step towards the goal of understanding ICL is one of the central problems towards demistifying LLMs. Last but not least, we thank **Reviewer vRHd** for appreciating the analysis in our paper: *"The paper provides an interesting analysis that can be useful to understand the intriguing properties of in-context learning......this paper can be useful as a stepping stone for any researcher that wants to further study intriguing properties of in-context learning."* We begin our response with the summary of our main contributions as summarized by the reviewers above. **Primary technical achievements:** - The inspiring previous works have shown that by setting the weights of transformers carefully, one can ICL the task of linear regression. We follow up on these works and show that in fact when you train your the transformer over the random instances of linear regression, **the global minima (and stationary points) correspond to interesting algorithms** (such as "preconditioned" gradient descent). - At a more technical level, as endorsed by the reviewers, we propose an analytical techniques to analyze the training loss of transformer architecture, which led to follow-up/concurrent works [2] and [3]. - We would also like to underscore the relevance of this work to this community by briefly mentioning concurrent works that were posted shortly after our submission. First, Zhang et al. [2] have shown that under a random initialization, gradient descent converges to the global minimum characterized in this work, contributing to our understanding training transformers. Moreover, Arvind et al. [3] establish a similar set of results to ours for the single layer case. We believe that the question we attempt to answer in this work is indeed very timely and relevant to the community, and would hopefully spark further interests. We next address a shared concern raised by the reviewers: **Linear Attention:** As noted by reviewers, our settings in this paper is theoretical where we analyze transformers with linear attention and no MLP layers. Nevertheless, such networks are expressive enough for solving linear regression, and they have become the focus of theoretical studies in various recent works [1,2,3]. In particular, Lemma 2 proves these networks are expressive enough to implement a family of optimization methods. In fact, our results extend---to an extent---to the softmax attention case. In a nutshell, we observe both theoretically and empirically the case of softmax is similar to the linear attention (similar to the conclusion of [Appendix A.9] in Oswald et al.) In fact, under the two-headed softmax attention, we observe that the learned algorithm is identical to that of linear attention. Without going into gory details, the key intuition is the following linearization trick $\frac{1}{2} (e^x-e^{-x}) \approx x$. Indeed, in our experiments, we observe that the weights of the two attention heads have approximately opposite sign to each other. We'll add this discussion in our final version. - Already this simple setting suffices to learn interesting algorithms. - The expressivity is less than the model with MLP... - Oswald's 2 attentions can achieve the same solution as linear for this setting... then why not focus on linear? **Additional Experiments:** - ICL experiments: number of ICL samples vs test loss - --- **References.** [1] Von Oswald, Johannes, et al. "Transformers learn in-context by gradient descent." International Conference on Machine Learning. PMLR, 2023. [2] Zhang, Ruiqi, Spencer Frei, and Peter L. Bartlett. "Trained Transformers Learn Linear Models In-Context." arXiv preprint arXiv:2306.09927 (2023) [3] Mahankali, Arvind, Tatsunori B. Hashimoto, and Tengyu Ma. "One Step of Gradient Descent is Provably the Optimal In-Context Learner with One Layer of Linear Self-Attention." arXiv preprint arXiv:2307.03576 (2023) # Response to m6sU (7// Confidence 4) ---- As we shared in the common response, we conducted the suggested experiment and shared the results. Thank you for the suggestion. As noted by Oswald et al., the network analyzed in this paper is sufficiently expressive to implement gradient descent on the least-squares objective. Hence, one can zero-out all parameters of the MLP block and skip this block with residual connection to implement gradient descent on ridge regression. Thus, in this work, we omit MLP block, and in fact, this approach was also taken in other recent works in the literature [1,2,3]. Note that ICL of kernel regression requires MLP blocks as shown in [1,4,5]. Extending our results to kernel regression is an interesting research topic for future works. We will add a detailed discussion of this in our final version as per your suggestion. Thank you for catching the typos, we will correct them in the final version. Thank you once again for taking the time to review our paper. We sincerely hope that the response to your concerns, as well as the overall response to other reviewers’ concerns, helps assuage your concerns, and view this paper in a more favorable light. **References.** [1] Von Oswald, Johannes, et al. “Transformers learn in-context by gradient descent.” International Conference on Machine Learning. PMLR, 2023. [2] Zhang, Ruiqi, Spencer Frei, and Peter L. Bartlett. "Trained Transformers Learn Linear Models In-Context." arXiv preprint arXiv:2306.09927 (2023) [3] Mahankali, Arvind, Tatsunori B. Hashimoto, and Tengyu Ma. "One Step of Gradient Descent is Provably the Optimal In-Context Learner with One Layer of Linear Self-Attention." arXiv preprint arXiv:2307.03576 (2023) [4] Garg, Shivam, et al. "What can transformers learn in-context? a case study of simple function classes." Advances in Neural Information Processing Systems 35 (2022) [5] Akyürek, Ekin, et al. "What learning algorithm is in-context learning? investigations with linear models." arXiv preprint 2022. ----- # Response to k8UL (7// Confidence 4) ---- For the linear attention comment, we provide an extended response in the common response section. Thanks for pointing it out! Regarding the parametrization, there are experimental and theoretical results confirming the optimization of reparameterized network will recovers the same solution charaterized in this paper: $W_k W_q^\top$ converges to $Q^*$ charaterized in Theorems 1, 2 and 3 when $W_q$ and $W_k$ are square matrices. Experiments in [1] shows this convergence for single-layer and multi-layer linear attention. Theoretically, [2] recently proves the global convergence of GD to the optimal solution when $W_q = W_k$ for a single attention layer. Regarding the LBFGS comment, we conjecture that various algorithms will converge to the same solution charaterized in the Theorems. This is proven for SGD on single layer attention by [1]. Furthermore, [1] includes experimentaly results for AdamW on mutli-layer attention without sparsity constraint. Moreover, regarding the sparsity pattern, [1] conjectures the similar results hold without sparsity constraint based on experimental observations. In fact, in our Section 5, We did our best to relax such techincal sparsity assumption required for theoretical proofs. We will clarify this in the paper. Thank you once again for taking the time to review our paper. We sincerely hope that the response to your concerns, as well as the overall response to other reviewers’ concerns, helps assuage your concerns, and view this paper in a more favorable light. **References** [1] Von Oswald, Johannes, et al. “Transformers learn in-context by gradient descent.” International Conference on Machine Learning, 2023. [2] Zhang, Ruiqi, Spencer Frei, and Peter L. Bartlett. “Trained Transformers Learn Linear Models In-Context.” # Response to iy8f (5// Confidence 2) ---- Thank you for your constructive comments! We address your comments one by one as below. > Compared to the previous work (von Oswald et al., 2022), the novelty is insufficient except the introduction of non-isotropic data distribution. Our results in Theorems 1, 2, and 3 are novel even for isotropic inputs. Remarkably, [von Oswald et al. 2022] proves the recurrence of linear attentions is experssive to implement gradient descent. However, such experssivity result **do not imply anything about whether the trained transformers also attain such properties**. Here, we theoretically address the conjecture of [von Oswald et al. 2022] which postulates that the optimization of parameters leads to the implementation of gradient descent. We establish the connection between data distribution used for training to motivate the importance of going beyond expressivity. When the data distribution is not isotropic, the network is still expressive to implement gradient descent but the optimal transformer (with a single-layer) is not the ones corresponding to implementing gradient descent. This is proved in Theorem 1. Thus, it is important to analyze the landscape of training to charaterize the outcome of training. > This paper simplifies the problem a lot by treating all components as linear. However, for a standard transformer, the softmax and activation functions are two key components and introduce nonlinearity. A discussion on the relationship between linear and standard transformers is missing. Our problem settings is taken from [1] since we are building on empirical observations in [1]. While [2] and [3] have the same settings and only analyze a single-layer attention, we provide analysis for mutli-layer attentions. - [MLP block with non-linear activations] We argue that MLP block are not necessary for learning linear models. Zeroing out the weights of MLP blocks allows skipping these blocks over residual connections. To implement gradient descent on ridge regression, one can omit MLP blocks and only rely on the attention. - [Softmax] As noted in the general response, we observe both theoretically and empirically the case of softmax is similar to the linear attention (similar to the conclusion of [Appendix A.9] in Oswald et al.) For the two-headed softmax attention, we observe that the learned algorithm is identical to that of linear attention. The key intuition is the following linearization trick $\frac{1}{2} ( e^x-e^{-x}) \approx x$. We observed that the weights of the two attention heads have approximately opposite sign to each other, confirming that this approximation is effecive. We'll add this discussion in our final version. > Although simulations are provided in this paper, it is not sufficient. It would be interesting to know: How performance varies with the number of layers? How a -layer linear transformer performs compared with a direct gradient descent approach with different step sizes or a standard transformer (during training or after sufficiently trained)? ... To address your concern, we actually conducted more experiments and share them in the common response. We would like to also highlight that there have already been several **empirical works** in the literature [1,4,5] showing various properties of the model learned for in-context learning. Hence, our main focus was to provide "**theoretical footing**" for their interesting empirical observations. In particular, we kindly refer the reviewer to the mentioned empirical works for further empirical properties. > Although $w^*$ is well-defined in the paper, it is not clear how $w$ relates to the model parameters or data. $w^*$ is a random vector, independent of the data distribution. Since the model is trained over random instances, the parameters of optimal model is independent of indiviual random $w^*$ and only depends on their distribution. We established the connection between distribution of $w^*$ and optimal model in Theorem 1. > Related work is absent in this paper. We reviewed the related works in the introduction. To address your concern, we will add a broader scope of related works in the final version. Also, we will include any other related works that the reviewer recommend. > Presuming you are studying least-squares loss, Eq.(5) may miss a square symbol. > There seem to be notation typos: in Eq. (9) and in Lemma 2. Thank you very much for catching these typos and your comments. Thank you once again for taking the time to review our paper. We sincerely hope that the response to your concerns, as well as the overall response to other reviewers’ concerns, helps assuage your concerns, and view this paper in a more favorable light. **References** [1] Von Oswald, Johannes, et al. “Transformers learn in-context by gradient descent.” International Conference on Machine Learning. PMLR, 2023. [2] Zhang, Ruiqi, Spencer Frei, and Peter L. Bartlett. “Trained Transformers Learn Linear Models In-Context.” arXiv preprint arXiv:2306.09927 (2023) [3] Mahankali, Arvind, Tatsunori B. Hashimoto, and Tengyu Ma. “One Step of Gradient Descent is Provably the Optimal In-Context Learner with One Layer of Linear Self-Attention.” arXiv preprint arXiv:2307.03576 (2023) [4] Garg, Shivam, et al. "What can transformers learn in-context? a case study of simple function classes." Advances in Neural Information Processing Systems 35 (2022) [5] Akyürek, Ekin, et al. "What learning algorithm is in-context learning? investigations with linear models." arXiv preprint 2022. # Response to vRHd (6// Confidence 4) ------- > Generally, the paper restricts itself to a very narrow setting of a specific data that is centered at zero, linear model generating the labels, linear transformer, specific sparsity of the parameter matrix. We agree with the reviewer that our setting is theoretical. Yet, we note that our settings is not constructive and used in other theoretical studies. - [1,2,3] use linear attentions as linear attentions are expressive to implement gradient descent on linear regression. - [1-5] use centered data and generated labels - [1,2,3,5] focus on linear models In Section 5, we relax the sparsity constraint. Noted by reviewer `k8UL:` "It is quite a neat idea to consider sparsity driven constraints that permit tractability." Without the minimal sparsity constraint in Eq. 10, the analysis becomes very difficult. > The transformers that they study are linear with a single head only. While I do appreciate the value of the analysis and I understand that this is just a step of many, it would be great to include a short paragraph outlining what the authors think about generalization of the proposed approach when the assumptions are removed. - **Multi-head attention.** Here, we show that learning Multi-head linear attentions reduces to learning a single head attention. Recall notation Attn$_{P,Q}(Z)$ in Eq. 2. Due to the linearity of the attention in $P$ and $Q$, we have $$\sum_{i} \text{Attn}_{P_i,Q_i}(Z) = \text{Attn}_{\sum_i P_i,\sum_{i} Q_i}(Z) $$ Thus the reprameterization $P' := \sum_{i} P_i$ and $Q' = \sum_{i} Q$ casts the problem to learning a single head attention. - **Softmax.** As noted in the general response, we observe both theoretically and empirically the case of softmax is similar to the linear attention (similar to the conclusion of [Appendix A.9] in Oswald et al.) For the two-headed softmax attention, we observe that the learned algorithm is identical to that of linear attention. The key intuition is the following linearization trick $\frac{1}{2} ( e^x- e^{-x}) \approx x$. We observed that the weights of the two attention heads have approximately opposite sign to each other, confirming that this approximation is effecive. We'll add this discussion in our final version. *Responding to this is easy, and we should do it inline here, and say where in the paper we talk about this.* > Also, given the series of the assumptions, I would make sure that they are stated early on in the abstract and introduction. In the abstract, we narrow down our scope to "linear transformers trained over random instances of linear regression". Introduction further clarifies "Akin to recent work, we too focus on the setting of linear regression encoded via ICL, for which we train a transformer architecture with multiple layers of single-head self-attentions without softmax". Linear regression encoded via ICL is introduced in [1,4,5] which encompasses the settings of generated labels, linear regression, centered input data and specific encoding of regression samples. We will gladly add more required details. > I also think that the experimental validation can be slightly improved by considering different choice of layers, dimension of x, conditioning factor of the input etc. It would also be great to have an empirical plot demonstrating how close A is to the Gram matrix as the number of points increase (the results from the end of section 3) *Here it is best to agree with the reviewer and mention something about more experiments, either in the appendix or ones that we may already be adding to the rebuttal.* **Questions:** >1. The equations for the gradient do not immediately pop out from the eq 7. I would encourage the authors to expose the equations for the gradient to the reader (maybe after eq 5) so that the connection between the gradient and the preconditioner would become more apparent. We will apply your comment to improve readiblity. Thank you! >2. What are the properties of A in sections 3 and 4? Can a transformer find a positive definite matrix through training? In experiments, we observe gradient descent converges to a positive definite matrix which depends on the covariance matrix of inputs (charaterized in Theorems 4 and 5). We take steps towards proving this observation showing that the charaterized parameters in Theorem 4 and 5 are stationary points of gradient descent. An interesting follow-up paper proves the convergence of gradients to the global optimum (in Theorem 1) for a single head attention [2]. Yet, the convergence of gradient descent is an open problem for multi-head attention. >3. For the evaluation of the Theorem 4 and 5, it is a bit hard to interpret the numbers, since the used distance metric is close to zero, but not exactly zero. The obtained value of ~0.1-0.2 is quite hard to interpret, apart from the fact that it is smaller that the distance with respect to the identity. We agree with the reviewer. In Figures 2 and 4, we visualize the outputs (after training) to illuserate the diagonal dominance of matrices which matches the result of Theorems 4 and 5. The error can be reduced by using a larger set of training set. > Also, the metric that the authors are using involves the minima over the space of scalars . It would be interesting if the authors can actually provide the scalars from Theorem 4 that they have found empirically. Do these scalars remain the same over multiple restarts of the algorithm? > How come the same matrix is used for both data and the weights (theorem 4 and 5)? It doesn't sound very realistic that the data and the weights use the same covariance matrix. In theorem 1, we used different covariance matrices for data and $w^*$. However, the analysis for multi-layer attention is signficantly more diffficult when covariance matrices are indepdent. Notably, [3] also uses the same covariance stucture as those of Theorems 4 and 5. > Would be great to add more information to Section 4.1 about adaptive coordinate-wise step sizes. If I understood correctly, Theorem 3 talks about the existence of the global minimizer under certain conditions. How exactly does adaptive coordinate-wise step sizes help find this solution? Here, we do not analyze the convergence of adaptive coordinate-wise stepsizes to find the global optimum charaterized in Theorem 3. Instead, we analyze the structure of a global minimizer of training objective and provide an algorithm interperation for that. > I would encourage the authors to add a simple table at the end of the introduction clearly stating the assumptions that each section has. As written, It is quite hard to understand, especially when the notation change from section to section. This is a good idea and will improve readablity. We will add a table at the end of introduction and outline assumptions and main results of different sections. Small comments: > How come the authors choose L-BFGS as an optimizer? It is a fairly specialized algorithm that is not often used these days. What properties of L-BFGS are desirable in the settings that authors used it for? Can the same results be achieved with SGD or Adam? LBFGs enojys considerably faster convergence rate compared to Adam in our settings. [1] has experiments with AdamW without sparsity constraints of Section 5. [2] proves the global convergence of gradient descent for a single-head attention. We will compare convergence rates of these algorithms to justify our choice for the optimizer. > It would be important to highlight limitations of the analysis. E.g. same covariance matrix, singe-head linear attention... In discussions, we included limitations of our analysis where we talked about linearity of attention. We will add more limitations such as covariance matrix structure, and single-head analysis. We fill fix typos. Thank you once again for taking the time to review our paper. We sincerely hope that the response to your concerns, as well as the overall response to other reviewers’ concerns, helps assuage your concerns, and view this paper in a more favorable light. **References** [1] Von Oswald, Johannes, et al. “Transformers learn in-context by gradient descent.” ICML 23. [2] Zhang, Ruiqi, Spencer Frei, and Peter L. Bartlett. “Trained Transformers Learn Linear Models In-Context.” arXiv 23 [3] Mahankali, Arvind, et al. “One Step of Gradient Descent is Provably the Optimal In-Context Learner with One Layer of Linear Self-Attention.” arXiv 23. [4] Garg, Shivam, et al. "What can transformers learn in-context? a case study of simple function classes." NeurIPs 22. [5] Akyürek, Ekin, et al. "What learning algorithm is in-context learning? investigations with linear models." arXiv 22. [6] Imanol Schlag, Kazuki Irie, and Jürgen Schmidhuber. Linear transformers are secretly fast weight programmers. ICML 21

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.