# ICML23_LoRA_Asymmetry_Rebuttal
## Letter to Area Chair
## General response for all audience
We would like to express our gratitude to the reviewers for their careful consideration of our work and for providing detailed reviews. We appreciate your critical feedback, which has allowed us to view our work in a new light and clarify several points, e.g. new analysis on running time, discussion through the lens of matrix column space, a deep dive into the initialization. All these changes will be included in the future version of the draft.
### Summary of improvemenmts and modifications we have made
1. Discussions and experiments regarding the efficiency (R1.a, R1.d)
2. More baseline methods and ablation (R1.c, R3.b)
3. More comprehensive theoretical results and more discussions(R2.a, R2.c, R2.e, R3.h)
4. Experimental details and hyperparameters (R3.c, R4.c)
5. Performance (R1.b, R2.b, R2.c)
6. Additional large scale experiments (R3.a, R4.b)
7. Insight, intuition, novelty (R1.b, R2.d, R4.a)
8. Minor clarifications (R3.e R3.f, R3.g )
## Response to Reviewer zKBE (R1)
```a. No computational experiments on runtime improvements due to freezing A.```
Thanks for raising this constructive question! We do observe the benefit of computational run time when freezing the A matrix, even when doubling the rank. This is due to the fact that, freezing matrix A means its gradients do not need to be stored or computed, thus the memory footprint for gradients during the training is reduced.
We provide additional experimental results on multple datasets to illustrate the runtime improvement. Specifically, we compare the *train samples per second* of different PEFT methods on multiple fine-tuning tasks. This number is captured through Weights & Bias experimental tracking. All the numbers are averaged through 3 random seeds.
Table 1.a: Train samples per second on GLUE RTE dataset
| Train samples /s | LoRA (r=8) | AdaLoRA (r=8) | \hat{B} (r=8) | \hat{B} (r=16) |
|---------|--------------|--------------|---------------|----------------|
| # Samples | 4.71 (.03) | 2.90 (.11) | 7.29 (.16) | 6.28 (.17) |
Table 1.b: Train samples per second on GLUE SST-2 dataset
| Train samples /s | LoRA (r=8) | AdaLoRA (r=8) | \hat{B} (r=8) | \hat{B} (r=16) |
|---------|--------------|--------------|---------------|----------------|
| # Samples | 227.62 (.59) | 88.14 (.19) | 255.45 (13.38) | 265.80 (12.13) |
Table 1.c: Train samples per second on Alpaca dataset
| Train samples /s | LoRA (r=32) | \hat{B} (r=32) | \hat{B} (r=64) |
|---------|--------------|--------------|----------------|
| # Samples | 227.62 (.59) | 288.14 (.19) | 265.80 (12.13) |
```b. While the work offers important insight into the working of LoRA, it does not currently confer performance improvements or specific insights into achieving better performance using low rank adaptation.```
Thank you for raising this concern. Our contribution can be interpreted from the following aspects:
1. When freezing the A matrix and only updating B, the fine-tuned model tends to have more significant improvements on out-of-distribution data. This is shown in our new 5-shot accuracy on the MMLU benchmark (Table 1.d), where we fine-tune the model on the Alapaca instruct tuning dataset and evaluate the model on the MMLU benchmark. The same trend can also be seen in our Table 5 (and Tables 9 and 10), where the fine-tuned model is evaluated on out-of-distribution data. In summary, our method provides better improvements on data distributions that are different from the one used for fine-tuning.
2. For the first time, we provide a theoretical justification for LLM fine-tuning and validate our understanding on multiple large pre-trained models, across both language and image modalities. Our method requires a smaller budget and is not intended to compete with methods that have more parameters or introduce more mechanisms.
3. Our contributions highlight the importance of the B matrix over A. We assess this hypothesis from both theoretical and empirical perspectives, using quantitative and qualitative results, ensuring that this is a general effect in fine-tuning large models before proposing sophisticated methods. We believe this understanding will firmly facilitate further applications such as model merging, pruning, and LoRA serving.
Table 1.d 5-shot accuracy on MMLU benchmark
| Method | % Param. | | | 5-shot | | |
|----------------|----------|-------|-------|--------|-------|-------|
| | | Hums | STEM | Social | Other | Avg |
| Llama-2-7b | 100.00 | 43.98 | 34.11 | 49.08 | 44.31 | 43.14 |
| LoRA (r=32) | 0.24 | 44.59 | 36.50 | 51.81 | 45.75 | 44.76 |
| \hat{B} (r=32) | 0.12 | 44.17 | 36.00 | 46.88 | 45.14 | 45.36 |
| \hat{A} (r=32) | 0.12 | 44.36 | 35.93 | 51.46 | 46.85 | 44.51 |
| \hat{B} (r=64) | 0.24 | **45.10** | **37.65** | **55.08** | **51.08** | **46.46** |
```c. What is the impact of a structured initialization of A (non zero constant, banded etc)? Do correlated columns in A causes performance deterioration?```
Thanks for asking such an insightful question! We conducted a new ablation study to investigate how different fixed A matrix will affect the performance. Specifically, we use three initializations (1) Columns dependent with each other (2) Rows dependent with each other, (3) Banded matrix with a bandwidth equal to rank. As we can see, the model struggled to learn anything when either the columns and rows of A are correlated. Also, fixing A to be a banded matrix leads to reasonable performance. Such observation further agrees with our theoretical formulation where we require the fixed A to be orthogornal.
Table 1.e Different fixed A on RTE task.
| | rte |
|-------------------|--------------|
| \hat{B} A_1 | 50.9 (3.13) |
| \hat{B} A_2 | 52.71 (3.29) |
| \hat{B} Banded A | 83.51 (2.18) |
| LoRA PUT OURS HERE INSTEAD | 84.1 (0.83) |
```d. What are the computational runtime improvements of freezing A?```
We have address this questions in our reply to question a.
We sincerely appreciate your valuable feedback and the opportunity to clarify our contributions. We will include the new discussions and results into our updated version. We hope that our response has addressed your concerns satisfactorily and it would be greatly appreciated if you could consider raising your assessment. Thanks a lot!
## Response to Reviewer 1L8Q (R2)
```a. The authors' analysis extends trivially in the linear probing setting, but apart from some intuition it does not clarify the non-linear case.```
Please see Section 4.1.2 where we analyze the general case of multilayer nonlinear networks and demonstrate asymmtetry by looking at the associated gradients. If there is anything still unclear we would be happy to discuss further.
```b. Reduced memory cost with respect to full rank LoRA only if you don't have to increase the rank while freezing A. I believe the experimental evaluation shows that in fact often the same rank is not sufficient (tables 1 and 4).```
We would like to clarify that, freezing A matrics still enjoy significant computational benefit even when doubling the rank. This is due to freezing matrix A means its gradients do not need to be stored or computed on the GPU. We can also observe this pheonomenon from experiments, as illustrated in following table 2.a, where only updating B has a significantly higher *sample per second* during training.
Table 2.a: Train samples per second on Alpaca dataset
| Train samples /s | LoRA | \hat{B} (r=32) | \hat{B} (r=64) |
|---------|--------------|--------------|----------------|
| # Samples | 227.62 (.59) | 288.14 (.19) | 265.80 (12.13) |
Regarding the performance, our updated 5-shot MMLU results indicate a more significant improvement, as show in the following table. As we can see, 5-shot prompt helps the model to better leverage the knowledge learned from instuction tuning.
Table 2.b: 5-shot accuracy on MMLU benchmark
| Method | % Param. | | | 5-shot | | |
|----------------|----------|-------|-------|--------|-------|-------|
| | | Hums | STEM | Social | Other | Avg |
| Llama-2-7b | 100.00 | 43.98 | 34.11 | 49.08 | 44.31 | 43.14 |
| LoRA (r=32) | 0.24 | 44.59 | 36.50 | 51.81 | 45.75 | 44.76 |
| \hat{B} (r=32) | 0.12 | 44.17 | 36.00 | 46.88 | 45.14 | 45.36 |
| \hat{A} (r=32) | 0.12 | 44.36 | 35.93 | 51.46 | 46.85 | 44.51 |
| \hat{B} (r=64) | 0.24 | **45.10** | **37.65** | **55.08** | **51.08** | **46.46** |
```c. Some details about the proofs are missing in the appendix, I believe it would be beneficial to the quality of the paper to add some references or further details.```
Thank you for reminding us. We agree that adding more details to the proof will strongly improve our presentation. We will augment our proof, for example in Lemma 1, as follows.
The objective is to find matrices $A$ and $B$ that optimize the following loss function:
\begin{equation}
\mathcal{L}(A, B) = \mathbb{E}{(Y{targ},X_{targ})} \left[|(W_{targ} - W_0) - BAX_{targ}|^2_2\right]
\end{equation}
\textbf{Case 1: Fixing $A = 0$, with $QQ^T = I$.}
Consider the least squares problem:
\begin{equation}
\min_{B} |(W_{targ} - W_0)X_{targ} - BQX_{targ}|^2_2 = \min_{B} |\hat{Y} - B\hat{X}|^2_2
\end{equation}
where $\hat{Y} = (W_{t} - W_0)X_{targ}$ and $\hat{X} = QX_{targ}$. The closed-form solution is:
\begin{equation}
B^* = \hat{Y}\hat{X}^T(\hat{X}\hat{X}^T)^{-1} = (W_{t} - W_0)X_{targ}X_{targ}^TA^T(QX_{targ}X_{targ}^TQ^T)^{-1}
\end{equation}
Denote $(W_{t} - W_0) = \Delta$ and $\Sigma = \text{Cov}[X_{targ}]$ (assuming $\mathbb{E}[X_{targ}] = 0$). We have:
\begin{equation}
B^* = \Delta\Sigma Q^T(\Lambda\Sigma Q^T)^{-1}
\end{equation}
where $\Lambda$ is a diagonal matrix.
In summary, to optimize $\mathcal{L}(A,B)$ when fixing $A=0$ with $QQ^T=I$, we solve the least squares problem to obtain the closed-form solution $B^*$ in terms of $W_0$, $W_{targ}$, the covariance of $X_{targ}$, and the orthogonal matrix $Q$.
We also reply to another reviewer with more details of the proof in Appendix B.3.
```d. Clearly, the memory advantage of optimizing just one of the two factors of LoRA is there only if you don't have to pay a high price when increasing the rank. In Table 1, for r=8 and frozen A, LoRA is still doing better in the majority of the benchmarks. It's also interesting that LoRA did worse in all benchmarks in which the metric is not accuracy, do you have any explanation for that?```
Thank you for this insightful comment! In fact, as we can see r=8 and frozen A improves upon lora on 3 of the datasets and on average in Table 1.
Still, we believe the tasks in GLUE benchmark are relatively simple when apllying more and more powerful base models. Thus, it becomes harder to differentiate different fine-tuning methods. Secondly, as we discussed in our section 4.2.2, freezing A matrics to be random basis yields better generalization.
Regarding classification tasks, one possible explanation is that they are easier to overfit to. Freezing some parameters can be viewed as a form of regularization that helps to prevent overfitting.
<!-- However, classification loss could be an objective that is easy to have overfitting. Afterall, freezing some parameters can be server as a regularization methods that leads to better Out-Of-Distribution (OOD) performance, which can be confirmed in our OOD image classification experiments. -->
```e. I find the asymmetry in the results in the linear regression setting intuitive: when freezing B we are freezing the range of the regression matrix (appendix B.1), and since the regression happens in the codomain it is reasonable that this constraint may pose more issues than fixing the co-range. In this sense, it is clear that if for example range(W_0) = range(W_target), then B can be kept fixed with range(B) \subseteq range(W_0) without impacting the loss, so the role of A is predominant and it depends in some sense on how much the distributions of (X, Y) and (X_targ, Y_targ) are different. I would really appreciate the authors' comment on this and I believe this needs to be discussed in the manuscript.```
We are glad that the asymmetry is intuitive, thinking about this in terms of matrix range spaces is a nice perspective! It is unclear to us however what is meant by "if for example range(W_0) = range(W_target), then B can be kept fixed with range(B) \subseteq range(W_0) without impacting the loss", but we would be happy to discuss further.
Through our study, it seems that there might exist an optimal subspace for a target dataset. The rank of the delta $\Delta = W_{target} - W_0$ should give the optimal rank for LoRA update. Also, it is reasonable to consider the distribution of (X_targ, Y_targ) and (X, Y) are determined by $W_{target}$ and $W_0$ respectively. Then, if we assume that $W_{target}$ and $W_0$ share the exactly same subspace, then it is expectable that the span of learned BA should match that of W_0 and W_target. Please let us know whether this answers your question!
## Response to Reviewer EHB7 (R3)
```a. GLUE results are unconvincing. There is very little difference between the random A and training A except for RTE and COLA. Why is there so little improvement when training with double the rank of A?```
Thank you for raising this point. In table 1, we first aim to illustrate the **asymmetry** of A and B matrix. It is obvious that freezing A matrix and updating B matrix always has better performance then freezing B matrix, which agress with our theoretical findings. Secondly, using a half of parameters achieves comparable performance to standard LoRA.
We are not expecting our method to significantly surpass other baselines on the GLUE benchmark, where the training and testing data comes from the same dataset.
On the other hand, we find that freezing A matrix has more significant benefits when the fine-tuned model need to be evaluated on the data that is different from the one used for fine-tuning. For example, here we present the 5-shot accuracy on the MMLU benchmark, where the model is fine-tuned using Alpaca dataset.
In summary, the main take away of table 1 should be (1) The asymmetry of A and B since freezing A updating B is better than the other (2) In addition, freezing A updating B has comparable performance than baselines with less GPU memory budget and faster computation. In addition, freezing A updating B is particular better on out-of-distribution (OOD) dataset, and can be seen from the following table 3.a (MMLU reasoning) as well as table 9 and 10 (DomainNet image classification) in our manuscript.
Table 3.a: 5-shot accuracy on MMLU benchmark
| Method | % Param. | | | 5-shot | | |
|----------------|----------|-------|-------|--------|-------|-------|
| | | Hums | STEM | Social | Other | Avg |
| Llama-2-7b | 100.00 | 43.98 | 34.11 | 49.08 | 44.31 | 43.14 |
| LoRA (r=32) | 0.24 | 44.59 | 36.50 | 51.81 | 45.75 | 44.76 |
| \hat{B} (r=32) | 0.12 | 44.17 | 36.00 | 46.88 | 45.14 | 45.36 |
| \hat{A} (r=32) | 0.12 | 44.36 | 35.93 | 51.46 | 46.85 | 44.51 |
| \hat{B} (r=64) | 0.24 | **45.10** | **37.65** | **55.08** | **51.08** | **46.46** |
```b. Table 3 is missing baselines. Please at least add the full finetuning and the LoRA baselines.```
Thank you for this reminder. Yes we are providing the full finetuning and LoRA baseline in addition to our previous table 3.
| Method | % Param. | | |
|-------------------------|----------|-----------------------|-----------------------|
| | | Xsum | CNN/DailyMail |
| Full FT | 100% | 45.49 / 22.33 / 37.26 | 44.16 / 21.28 / 40.90 |
| LoRA*(r=2) | 0.13% | 42.81 / 19.68 / 34.73 | 43.68 / 20.63 / 40.71 |
| \hat{B}_0 A_{rand} r=16 | 0.44% | 42.91 / 19.61 / 34.64 | 43.65 / 20.62 / 40.72 |
| B_{rand} \hat{A}_0 r=16 | 0.44% | 42.37 / 19.30 / 34.29 | 43.38 / 20.36 / 40.48 |
(* Here we adopt the number from Zhang et al. 2023)
```c. For table 5, experimental details are missing. For example where is LoRA adapted? How many epochs did you train for the supervised fine-tuning and that of LoRA? What hyperparameters are used in training for both settings?```
Thanks for pointing out this. We adopted the following hyperparameters for instruction tuning llama-2-7B model.
| learning rate | batch size | # epochs | alpha | dropput | target modules |
|---------------|------------|----------|-------|---------|------------------------------------|
| 2e-4 | 16 | 3 | 2 \times rank | 0.05 | {"k_proj", "q_proj", "classifier"} |
```d. Confused by fig 1. Generally for the LoRA, you want initialization at 0, so B is chosen as 0 and A is random so is it surprising that since B always start from the same initialization, it follows essentially a similar trajectory?```
In Figure 1, we illustrative the following phenomenon: *the learned B matrix is more determined by the target fine-tune dataset, while the A matrix is more determined by initalization*. Our theoretical results provide explanation for this asymmetry phenomenon, indicating this is a property brought by the intrinsic matrix structure. Motivated by this porperty, we argue the benefits of freezing A to be random orthonormal matrix.
1. In the case of Figure 1(a), the B matrix does not necessarilly converge to the same solution in a sense of entrivise similarity (due to randomness in A as you noted). Recall that we are using Canonical Correlation Analysis to measure similairty (see Appendix A for details). CCA compares subspaces rather than specific matrices. In other words, two independent runs can result in different B entriwise, but spanning similar subspaces.
<!-- 2. We hypothesize that there exits an optimal product low-rank BA given a base model. Since A is random and does not change much during training, B needs to adjust to recover the "optimal" BA. Thus, B -->
<!-- 3. However, due to permutation invariance in random initialized A, the learned B's will not follow the same trajectory but instead approach to the same subspace. Therefore, we compare the orthonormal bases of B's using Canonical Correlation Analysis [Kornblith et al., 2019]. -->
2. To further prove this trend is given by the matrix structure rather than initializations. We did new experiments by swapping the initialization. The figure is available in this anonymous link: [https://anonymous.4open.science/r/Anonymous_ICML2024_rebuttal-676C/figures/swap_initialization.png] As we can see, although the trend of the similarities of A and B also swaps, the A matrix, despite all initialized as zero, does converge as close as the previous setting. This experiment provides a potential reason for choose the A matrix to be random.
```e. Line 322-324 is weird. "This addition provides more expressive power for the same number of parameters without loss of generalization bounds" What addition?```
Using a larger $r_B$ than $r_{BA}$ is possible for the same number of parameters (the exact amount of increase to maintain the same number of parameters depends on the size of the matrices involved).
```f. What is $ \hat{A}_V $```
<!-- Sorry for the confusion. The reviewer might want to refer to \hat{A}_V? -->
Here we tried fixing the B / A matrix to be the left / right singular matrix of the pre-trained weight. Specifically, we first find $U, S, V = SVD(W_0)$, where $W_0$ is the original pretrained weight. Then, we freeze $A = V$. However, we find that this selection does not bring too much gain compared to a random orthonormal matrix. In addition, as shown in our new ablation study suggested by Reviewer 1 (Table 1.e), orthonormality seems to be the critical property for the fixed matrix during low-rank adaptation. We will add the missing details in the revised version.
```g. Line 363-364, it seems the performance is within noise? Outperform seems to be an exaggeration.```
We agree that the performance of our method on the GLUE benchmark is comparable to the baseline and will rephrase this sentence. However, as we stated in our reply to your question (a), still we can observe that the average performance of freezing A is higher than LoRA in 6 out of 7 tasks. Also, we would like to note that the main purpose of our experiments on GLUE benchmark is to confirm the asymmetry of A and B matrices.
<!-- It is also worth noting that the Out-Of-Distribution performance is significantly better as given in our new 5-shot MMLU table 3.a as well as on DomainBed (Table 9 and 10). -->
```h. How is Von Neumann's trace formula used in line 610?```
It turns out that there is a simpler/tighter proof making use of the Moore-Penrose pseudoinverse $A^\dagger$.
Note that since $\Sigma$ is positive semidefinite its symmetric square root exists and we can write
\begin{equation}
\mathrm{Tr} [Q \Sigma \Delta^\top \Delta \Sigma Q^\top (Q \Sigma Q^\top )^{-1}] = \mathrm{Tr}[(\Sigma^{1/2} \Delta^\top \Delta \Sigma^{1/2}) (\Sigma^{1/2} Q^T (Q \Sigma^{1/2} \Sigma^{1/2} Q^T)^{-1} Q \Sigma^{1/2})]
\end{equation}
\begin{equation}
= \mathrm{Tr}[(\Sigma^{1/2} \Delta^\top \Delta \Sigma^{1/2}) ((Q \Sigma^{1/2})^\dagger (Q \Sigma^{1/2})) ].
\end{equation}
Now, it can be shown that (this will be expanded in the revision)
\begin{equation}
\lim_{d/r \rightarrow \infty}\mathbb{E}[(Q \Sigma^{1/2})^\dagger (Q \Sigma^{1/2})] = r \frac{\Sigma}{\mathrm{Tr}[\Sigma]},
\end{equation}
hence
\begin{equation}
\mathbb{E}[III_A] \rightarrow r \frac{\mathrm{Tr}[\Sigma^2 \Delta^\top \Delta]}{\mathrm{Tr}[\Sigma]} = r \frac{\|\Delta \Sigma\|_F^2}{\mathrm{Tr}[\Sigma]}. \qquad (1)
\end{equation}
Recall that on the other hand that
\begin{equation}
\mathbb{E}[III_B] \rightarrow \frac{r}{d} \mathrm{Tr}[\Delta \Sigma \Delta^\top].
\end{equation}
For brevity in the response, let us suppose that $\Sigma$ is rank $k > r$, allowing its smallest nonzero eigenvalue to be $\geq \mathrm{Tr}[\Sigma] /d$. Then revisiting (1) above,
\begin{equation}
r \frac{\|\Delta \Sigma\|_F^2}{\mathrm{Tr}[\Sigma]} \geq \frac{r}{d} \frac{\mathrm{Tr}[\Sigma] \mathrm{Tr}[\Delta \Sigma \Delta^\top] }{\mathrm{Tr}[\Sigma] } \rightarrow \mathbb{E}[III_B]
\end{equation}
and the asymmetry is established.
Note furthermore that the term $III_A$ can be quite large for the right $\Sigma$, in fact the equality in
\begin{equation}
III_A \leq \mathrm{Tr}[\Delta \Sigma \Delta^\top]
\end{equation}
is achievable even with low $r$, which would yield the same performance as full finetuning - i.e. a loss of $d_{out} \sigma^2$.
On the other hand, the expected loss when freezing $B$ is always
\begin{equation}
d_{out} \sigma^2 + \mathrm{Tr}[\Delta \Sigma \Delta^\top] \left(1 - \frac{r}{d}\right)
\end{equation}
implying large errors must always persist here when $r$ is small.
We will expand this discussion in the revision and make the proof more readable.
## Response to Reviewer LHV6 (R4)
```a. Novelty```
We would like to thank Reviewer 4 for providing a engaging discussion. In fact, we are glad to learn that after reading our manuscript, reviewer 4 found our findings, theoretical analysis, and experimental results to agree with their intuitions.
We respectfully disagree with reviewer 4's claim on our weakness of novelty. In fact, we believe that by walking through the line of research in depth, we can better clarify the contribution of this work for reviewer LHV6(R4), as well as any potential reader of our paper.
- We first present the observation of the asymmetry between A and B. As we stated in our related works, **"While nearly all recent studies treat the two matrices asymmetrically..., there is a lack of formal investigation into this asymmetry in low-rank adaptation"**.
- The asymmetry between A and B has not been explored in recent LoRA papers, not even the term "asymmetry" is mentioned. We first state this phenomenon and investigate it systematically. LoRA [Hu et al, 2022] choose different initialization for A and B, without comparing a swapped initialization.
- In fact, in the original LoRA paper (the arXiv version) [Hu et al, 2022], when the authors are investigating the subspace similarity between random seeds (section 7.2, page 11), they claim *"a similar analysis can be carried out with B and the left-singular unitary matrices"*. Here, does this contradictory to our study? No, because in [Hu et al, 2022], their analysis studies the subspace similarity between different random seeds when fine-tuning on the **same task**. However, our asymmetry is driven by the subspace similarity between A and B learned on **different tasks**. Our study implies the existence of an *optimal subspace* for LoRA adapters.
- One might find the above observation intuitive since this is related to the assumption where the models lie in a "basin" of low loss in parameter space [Ilharco et al, 2022], and models fine-tuned from the same checkpoint lie in the same basin [Neyshabur er al 2020].
- There is one recent work [Hayou et al, 2024], which appears after the submission of this paper, approaches a similar conclusion by applying different learning rates to A and B. However, our methods still differs in terms of less active parameters, OOD performance, and orthonormality of matrixs.
- Reviewer 4 refers to BYOL [Grill, Strub et al, 2020], a self-supervised learning method which have been achieving great success. However, the core idea of BYOL is that using a target network which is the moving average of the online network can prevent collapsing. The BYOL paper did mention that a fixed randomly initialized network can serve as a target network that prevent collapsing. However this is a relatively bad baseline as shown in Table 5(a) in [Grill, Strub et al, 2020]. Nevertheless, even in this case, BYOL's model strucutre is still quite different from LoRA. It would be great if reviewer could better justify the how the BYOL method shades the contribution in LoRA setting. Alternatively, Reviewer 4 might want to refer to dimensional reduction with random project [Bingham 2001] or compressed sensing [Donoho et al 2006].
- However, it would be risky to directly apply the intuition from all aforementioned representation learning to an intermediate layer of a transformer model. Consider the following case, for a intermediate linear layer $W_0 \in \mathbb{R}^{d \times d}$, we can always achieve the low-rank decompose $W_0 = QR$, where $Q \in \mathbb{R}^{d \times r}$ and $R \in \mathbb{R}^{r \times d}$. However, it would be problematic to argue $R$ is *feature extractor* and $Q$ is a *classifier*. This is also the reason why we are analysing the **gradient** for nonlinear cases.
- In fact, a better approach is to assume the existence of an optimal target model $W_t$, **just as we did in our analysis**. In this sense, within each layer, LoRA is approximation the error matrix $\Delta = W_t - W_0$ with $BA$, where the well-known Eckart–Young theorem states the optimal solution will be the product of two orthornomal matrices, which also explains the reasons why we are freezing A and B to be orthornomal matrix.
- Althrough reviewer find that the theoretical contribution is marginal. The first theoretical analysis for LoRA only date back to ICLR 2024, where [Zeng et al, 2023] establish theoretical undetstanding following the Eckart-Young theorem. However, their analysis assume the rank of the $\Delta = W_t - W_0$ where we relax this assumption. I hope all our detailed explanation justify our contribution. Otherwise, it would be great if reviewer could let us know the reason why our result is marginal and we are happy to provide further clarifications.
In conclusion, in this work we **first present the asymmetry phenomenon between A and B in LoRA**. We **first establish the theoretical analysis regarding this phenomenon** which implies freezing $A$ matrix to be random orthonormal matrix will yields better generalization performance with less computational cost. Apart from experimental results, our analysis implies that (1) The B matrix is more important and strongly depends on data, (2) Given a target fine-tune dataset, there might exist an optimal subspace for learned LoRA matrices. We believe these conclusion benefit various future direction such as LoRA compression, LoRA retrevial, model merging, and more PEFT methods.
```b. More details on how the experiment hyperparameters ```
We are happy to provide the details for hyperparameter selecting for all our experiments.
We adopted the following hyperparameters for GLUE benchmarks.
| Dataset | learning rate | batch size | # epochs | alpha | target module |
|---------|---------------|------------|----------|-------|---------------|
| MNLI | 4e-4 | 16 | 10 | 16 | query, value |
| SST-2 | 2e-4 | 16 | 10 | 16 | query, value |
| MRPC | 2e-4 | 16 | 20 | 16 | query, value |
| CoLA | 4e-4 | 16 | 20 | 16 | query, value |
| QNLI | 2e-4 | 16 | 10 | 16 | query, value |
| RTE | 3e-4 | 16 | 20 | 16 | query, value |
| STS-B | 2e-4 | 16 | 10 | 16 | query, value |
We adopted the following hyperparameters for instruction tuning llama-2-7B model. We select hyperparameters from: learning rate {5e03, 1e-4, 2e-4, 4e-4}, dropout {0.01, 0.02, 0.05}.
| learning rate | batch size | # epochs | alpha | dropput | target modules |
|---------------|------------|----------|-------|---------|------------------------------------|
| 2e-4 | 16 | 3 | 2 \times rank | 0.05 | {"k_proj", "q_proj", "classifier"} |
```c. More models of the Llama-2 scale```
This is a helpful suggestion. We are running the same experiments on Mistral-7B model and the results can be viewed at the following anonymous link: https://anonymous.4open.science/r/Anonymous_ICML2024_rebuttal-676C
-----
### Reply
We are excited to learn that you are willing to revisit the details of our manuscript, and we appreciate your taking this effort! Please don't hesitate to let us know if you have any other questions that we can clarify!
#### For Table 4.
Regarding the question of hyperparameter tuning for table 4 (DomainBed experiments). First of all, we optimize the hyperparameter only for the standard LoRA, and use the same hyparameter (learning rate, rank, batch size) for corresponding freezing methods. This enable us to fairly comparing the mechanism of LoRA training.
(1) We fix the model architecture to be "google/vit-base-patch16-224-in21k" for all four different datasets (VLCS, PACS, OfficeHome, and TerraIncognita). (2) We fix the rank to be r=8 for all four different tasks. (3) For learning rate, we various from $[10e-5, 10e-3.5]$, (4) For batch size, we always fix batch size to be 8 across all datasets. (5) For lora dropout, we vary from from $[10e-2, 2.5e-1]$ (6) We preform three different random seeds and report the performance obtained from the best *in-domain* checkpoint.
We adopt the codebase from [Domainbed](https://github.com/facebookresearch/DomainBed) and specifically, we follow the provided hyperparameter search [strategy](https://github.com/facebookresearch/DomainBed/blob/main/domainbed/hparams_registry.py) in DomainBed. The **asymmetery of A and B** and the OOD generalization performance is robust with regard to different hyperparameters. We think different LoRA methods might be improved on In-Domain training distribution via a better hyperparameter search. However we think this is not the focus of this work at this stage.
#### For Table 5, Alpaca instruction finetuned LoRA on MMLU
Since we didn't present additional results for table 4 during the rebuttal. We think that reviewer LHV6 might be interested in the hyperparameter search for our table 5. As we stated in our last rebuttal note, We adopted the following hyperparameters for instruction tuning llama-2-7B model. We select hyperparameters from: learning rate {5e03, 1e-4, 2e-4, 4e-4}, dropout {0.01, 0.02, 0.05}.
| learning rate | batch size | # epochs | alpha | dropput | target modules |
|---------------|------------|----------|-------|---------|------------------------------------|
| 2e-4 | 16 | 3 | 2 \times rank | 0.05 | {"k_proj", "q_proj", "classifier"} |
We didn't retrain the Llama-2-7b model during the rebuttal. The major reason for getting better results is through the 5-shot prompting during evaluation. Specifically, we use the following self-defined prompt. It is mentioned in llama-2's report that the 5-shot prompt is critical for MMLU performance.
```python
def process_mcq(sample):
prompt = 'Question: ' + sample['input'] + '\n'
for c in ['A', 'B', 'C', 'D']:
prompt += '. '.join([c, sample[c]]) + '\n'
prompt += 'Answer:'
return prompt, sample['target']
```