JiachengZhu
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights New
    • Engagement control
    • Make a copy
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Make a copy Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    # ICML23_LoRA_Asymmetry_Rebuttal ## Letter to Area Chair ## General response for all audience We would like to express our gratitude to the reviewers for their careful consideration of our work and for providing detailed reviews. We appreciate your critical feedback, which has allowed us to view our work in a new light and clarify several points, e.g. new analysis on running time, discussion through the lens of matrix column space, a deep dive into the initialization. All these changes will be included in the future version of the draft. ### Summary of improvemenmts and modifications we have made 1. Discussions and experiments regarding the efficiency (R1.a, R1.d) 2. More baseline methods and ablation (R1.c, R3.b) 3. More comprehensive theoretical results and more discussions(R2.a, R2.c, R2.e, R3.h) 4. Experimental details and hyperparameters (R3.c, R4.c) 5. Performance (R1.b, R2.b, R2.c) 6. Additional large scale experiments (R3.a, R4.b) 7. Insight, intuition, novelty (R1.b, R2.d, R4.a) 8. Minor clarifications (R3.e R3.f, R3.g ) ## Response to Reviewer zKBE (R1) ```a. No computational experiments on runtime improvements due to freezing A.``` Thanks for raising this constructive question! We do observe the benefit of computational run time when freezing the A matrix, even when doubling the rank. This is due to the fact that, freezing matrix A means its gradients do not need to be stored or computed, thus the memory footprint for gradients during the training is reduced. We provide additional experimental results on multple datasets to illustrate the runtime improvement. Specifically, we compare the *train samples per second* of different PEFT methods on multiple fine-tuning tasks. This number is captured through Weights & Bias experimental tracking. All the numbers are averaged through 3 random seeds. Table 1.a: Train samples per second on GLUE RTE dataset | Train samples /s | LoRA (r=8) | AdaLoRA (r=8) | \hat{B} (r=8) | \hat{B} (r=16) | |---------|--------------|--------------|---------------|----------------| | # Samples | 4.71 (.03) | 2.90 (.11) | 7.29 (.16) | 6.28 (.17) | Table 1.b: Train samples per second on GLUE SST-2 dataset | Train samples /s | LoRA (r=8) | AdaLoRA (r=8) | \hat{B} (r=8) | \hat{B} (r=16) | |---------|--------------|--------------|---------------|----------------| | # Samples | 227.62 (.59) | 88.14 (.19) | 255.45 (13.38) | 265.80 (12.13) | Table 1.c: Train samples per second on Alpaca dataset | Train samples /s | LoRA (r=32) | \hat{B} (r=32) | \hat{B} (r=64) | |---------|--------------|--------------|----------------| | # Samples | 227.62 (.59) | 288.14 (.19) | 265.80 (12.13) | ```b. While the work offers important insight into the working of LoRA, it does not currently confer performance improvements or specific insights into achieving better performance using low rank adaptation.``` Thank you for raising this concern. Our contribution can be interpreted from the following aspects: 1. When freezing the A matrix and only updating B, the fine-tuned model tends to have more significant improvements on out-of-distribution data. This is shown in our new 5-shot accuracy on the MMLU benchmark (Table 1.d), where we fine-tune the model on the Alapaca instruct tuning dataset and evaluate the model on the MMLU benchmark. The same trend can also be seen in our Table 5 (and Tables 9 and 10), where the fine-tuned model is evaluated on out-of-distribution data. In summary, our method provides better improvements on data distributions that are different from the one used for fine-tuning. 2. For the first time, we provide a theoretical justification for LLM fine-tuning and validate our understanding on multiple large pre-trained models, across both language and image modalities. Our method requires a smaller budget and is not intended to compete with methods that have more parameters or introduce more mechanisms. 3. Our contributions highlight the importance of the B matrix over A. We assess this hypothesis from both theoretical and empirical perspectives, using quantitative and qualitative results, ensuring that this is a general effect in fine-tuning large models before proposing sophisticated methods. We believe this understanding will firmly facilitate further applications such as model merging, pruning, and LoRA serving. Table 1.d 5-shot accuracy on MMLU benchmark | Method | % Param. | | | 5-shot | | | |----------------|----------|-------|-------|--------|-------|-------| | | | Hums | STEM | Social | Other | Avg | | Llama-2-7b | 100.00 | 43.98 | 34.11 | 49.08 | 44.31 | 43.14 | | LoRA (r=32) | 0.24 | 44.59 | 36.50 | 51.81 | 45.75 | 44.76 | | \hat{B} (r=32) | 0.12 | 44.17 | 36.00 | 46.88 | 45.14 | 45.36 | | \hat{A} (r=32) | 0.12 | 44.36 | 35.93 | 51.46 | 46.85 | 44.51 | | \hat{B} (r=64) | 0.24 | **45.10** | **37.65** | **55.08** | **51.08** | **46.46** | ```c. What is the impact of a structured initialization of A (non zero constant, banded etc)? Do correlated columns in A causes performance deterioration?``` Thanks for asking such an insightful question! We conducted a new ablation study to investigate how different fixed A matrix will affect the performance. Specifically, we use three initializations (1) Columns dependent with each other (2) Rows dependent with each other, (3) Banded matrix with a bandwidth equal to rank. As we can see, the model struggled to learn anything when either the columns and rows of A are correlated. Also, fixing A to be a banded matrix leads to reasonable performance. Such observation further agrees with our theoretical formulation where we require the fixed A to be orthogornal. Table 1.e Different fixed A on RTE task. | | rte | |-------------------|--------------| | \hat{B} A_1 | 50.9 (3.13) | | \hat{B} A_2 | 52.71 (3.29) | | \hat{B} Banded A | 83.51 (2.18) | | LoRA PUT OURS HERE INSTEAD | 84.1 (0.83) | ```d. What are the computational runtime improvements of freezing A?``` We have address this questions in our reply to question a. We sincerely appreciate your valuable feedback and the opportunity to clarify our contributions. We will include the new discussions and results into our updated version. We hope that our response has addressed your concerns satisfactorily and it would be greatly appreciated if you could consider raising your assessment. Thanks a lot! ## Response to Reviewer 1L8Q (R2) ```a. The authors' analysis extends trivially in the linear probing setting, but apart from some intuition it does not clarify the non-linear case.``` Please see Section 4.1.2 where we analyze the general case of multilayer nonlinear networks and demonstrate asymmtetry by looking at the associated gradients. If there is anything still unclear we would be happy to discuss further. ```b. Reduced memory cost with respect to full rank LoRA only if you don't have to increase the rank while freezing A. I believe the experimental evaluation shows that in fact often the same rank is not sufficient (tables 1 and 4).``` We would like to clarify that, freezing A matrics still enjoy significant computational benefit even when doubling the rank. This is due to freezing matrix A means its gradients do not need to be stored or computed on the GPU. We can also observe this pheonomenon from experiments, as illustrated in following table 2.a, where only updating B has a significantly higher *sample per second* during training. Table 2.a: Train samples per second on Alpaca dataset | Train samples /s | LoRA | \hat{B} (r=32) | \hat{B} (r=64) | |---------|--------------|--------------|----------------| | # Samples | 227.62 (.59) | 288.14 (.19) | 265.80 (12.13) | Regarding the performance, our updated 5-shot MMLU results indicate a more significant improvement, as show in the following table. As we can see, 5-shot prompt helps the model to better leverage the knowledge learned from instuction tuning. Table 2.b: 5-shot accuracy on MMLU benchmark | Method | % Param. | | | 5-shot | | | |----------------|----------|-------|-------|--------|-------|-------| | | | Hums | STEM | Social | Other | Avg | | Llama-2-7b | 100.00 | 43.98 | 34.11 | 49.08 | 44.31 | 43.14 | | LoRA (r=32) | 0.24 | 44.59 | 36.50 | 51.81 | 45.75 | 44.76 | | \hat{B} (r=32) | 0.12 | 44.17 | 36.00 | 46.88 | 45.14 | 45.36 | | \hat{A} (r=32) | 0.12 | 44.36 | 35.93 | 51.46 | 46.85 | 44.51 | | \hat{B} (r=64) | 0.24 | **45.10** | **37.65** | **55.08** | **51.08** | **46.46** | ```c. Some details about the proofs are missing in the appendix, I believe it would be beneficial to the quality of the paper to add some references or further details.``` Thank you for reminding us. We agree that adding more details to the proof will strongly improve our presentation. We will augment our proof, for example in Lemma 1, as follows. The objective is to find matrices $A$ and $B$ that optimize the following loss function: \begin{equation} \mathcal{L}(A, B) = \mathbb{E}{(Y{targ},X_{targ})} \left[|(W_{targ} - W_0) - BAX_{targ}|^2_2\right] \end{equation} \textbf{Case 1: Fixing $A = 0$, with $QQ^T = I$.} Consider the least squares problem: \begin{equation} \min_{B} |(W_{targ} - W_0)X_{targ} - BQX_{targ}|^2_2 = \min_{B} |\hat{Y} - B\hat{X}|^2_2 \end{equation} where $\hat{Y} = (W_{t} - W_0)X_{targ}$ and $\hat{X} = QX_{targ}$. The closed-form solution is: \begin{equation} B^* = \hat{Y}\hat{X}^T(\hat{X}\hat{X}^T)^{-1} = (W_{t} - W_0)X_{targ}X_{targ}^TA^T(QX_{targ}X_{targ}^TQ^T)^{-1} \end{equation} Denote $(W_{t} - W_0) = \Delta$ and $\Sigma = \text{Cov}[X_{targ}]$ (assuming $\mathbb{E}[X_{targ}] = 0$). We have: \begin{equation} B^* = \Delta\Sigma Q^T(\Lambda\Sigma Q^T)^{-1} \end{equation} where $\Lambda$ is a diagonal matrix. In summary, to optimize $\mathcal{L}(A,B)$ when fixing $A=0$ with $QQ^T=I$, we solve the least squares problem to obtain the closed-form solution $B^*$ in terms of $W_0$, $W_{targ}$, the covariance of $X_{targ}$, and the orthogonal matrix $Q$. We also reply to another reviewer with more details of the proof in Appendix B.3. ```d. Clearly, the memory advantage of optimizing just one of the two factors of LoRA is there only if you don't have to pay a high price when increasing the rank. In Table 1, for r=8 and frozen A, LoRA is still doing better in the majority of the benchmarks. It's also interesting that LoRA did worse in all benchmarks in which the metric is not accuracy, do you have any explanation for that?``` Thank you for this insightful comment! In fact, as we can see r=8 and frozen A improves upon lora on 3 of the datasets and on average in Table 1. Still, we believe the tasks in GLUE benchmark are relatively simple when apllying more and more powerful base models. Thus, it becomes harder to differentiate different fine-tuning methods. Secondly, as we discussed in our section 4.2.2, freezing A matrics to be random basis yields better generalization. Regarding classification tasks, one possible explanation is that they are easier to overfit to. Freezing some parameters can be viewed as a form of regularization that helps to prevent overfitting. <!-- However, classification loss could be an objective that is easy to have overfitting. Afterall, freezing some parameters can be server as a regularization methods that leads to better Out-Of-Distribution (OOD) performance, which can be confirmed in our OOD image classification experiments. --> ```e. I find the asymmetry in the results in the linear regression setting intuitive: when freezing B we are freezing the range of the regression matrix (appendix B.1), and since the regression happens in the codomain it is reasonable that this constraint may pose more issues than fixing the co-range. In this sense, it is clear that if for example range(W_0) = range(W_target), then B can be kept fixed with range(B) \subseteq range(W_0) without impacting the loss, so the role of A is predominant and it depends in some sense on how much the distributions of (X, Y) and (X_targ, Y_targ) are different. I would really appreciate the authors' comment on this and I believe this needs to be discussed in the manuscript.``` We are glad that the asymmetry is intuitive, thinking about this in terms of matrix range spaces is a nice perspective! It is unclear to us however what is meant by "if for example range(W_0) = range(W_target), then B can be kept fixed with range(B) \subseteq range(W_0) without impacting the loss", but we would be happy to discuss further. Through our study, it seems that there might exist an optimal subspace for a target dataset. The rank of the delta $\Delta = W_{target} - W_0$ should give the optimal rank for LoRA update. Also, it is reasonable to consider the distribution of (X_targ, Y_targ) and (X, Y) are determined by $W_{target}$ and $W_0$ respectively. Then, if we assume that $W_{target}$ and $W_0$ share the exactly same subspace, then it is expectable that the span of learned BA should match that of W_0 and W_target. Please let us know whether this answers your question! ## Response to Reviewer EHB7 (R3) ```a. GLUE results are unconvincing. There is very little difference between the random A and training A except for RTE and COLA. Why is there so little improvement when training with double the rank of A?``` Thank you for raising this point. In table 1, we first aim to illustrate the **asymmetry** of A and B matrix. It is obvious that freezing A matrix and updating B matrix always has better performance then freezing B matrix, which agress with our theoretical findings. Secondly, using a half of parameters achieves comparable performance to standard LoRA. We are not expecting our method to significantly surpass other baselines on the GLUE benchmark, where the training and testing data comes from the same dataset. On the other hand, we find that freezing A matrix has more significant benefits when the fine-tuned model need to be evaluated on the data that is different from the one used for fine-tuning. For example, here we present the 5-shot accuracy on the MMLU benchmark, where the model is fine-tuned using Alpaca dataset. In summary, the main take away of table 1 should be (1) The asymmetry of A and B since freezing A updating B is better than the other (2) In addition, freezing A updating B has comparable performance than baselines with less GPU memory budget and faster computation. In addition, freezing A updating B is particular better on out-of-distribution (OOD) dataset, and can be seen from the following table 3.a (MMLU reasoning) as well as table 9 and 10 (DomainNet image classification) in our manuscript. Table 3.a: 5-shot accuracy on MMLU benchmark | Method | % Param. | | | 5-shot | | | |----------------|----------|-------|-------|--------|-------|-------| | | | Hums | STEM | Social | Other | Avg | | Llama-2-7b | 100.00 | 43.98 | 34.11 | 49.08 | 44.31 | 43.14 | | LoRA (r=32) | 0.24 | 44.59 | 36.50 | 51.81 | 45.75 | 44.76 | | \hat{B} (r=32) | 0.12 | 44.17 | 36.00 | 46.88 | 45.14 | 45.36 | | \hat{A} (r=32) | 0.12 | 44.36 | 35.93 | 51.46 | 46.85 | 44.51 | | \hat{B} (r=64) | 0.24 | **45.10** | **37.65** | **55.08** | **51.08** | **46.46** | ```b. Table 3 is missing baselines. Please at least add the full finetuning and the LoRA baselines.``` Thank you for this reminder. Yes we are providing the full finetuning and LoRA baseline in addition to our previous table 3. | Method | % Param. | | | |-------------------------|----------|-----------------------|-----------------------| | | | Xsum | CNN/DailyMail | | Full FT | 100% | 45.49 / 22.33 / 37.26 | 44.16 / 21.28 / 40.90 | | LoRA*(r=2) | 0.13% | 42.81 / 19.68 / 34.73 | 43.68 / 20.63 / 40.71 | | \hat{B}_0 A_{rand} r=16 | 0.44% | 42.91 / 19.61 / 34.64 | 43.65 / 20.62 / 40.72 | | B_{rand} \hat{A}_0 r=16 | 0.44% | 42.37 / 19.30 / 34.29 | 43.38 / 20.36 / 40.48 | (* Here we adopt the number from Zhang et al. 2023) ```c. For table 5, experimental details are missing. For example where is LoRA adapted? How many epochs did you train for the supervised fine-tuning and that of LoRA? What hyperparameters are used in training for both settings?``` Thanks for pointing out this. We adopted the following hyperparameters for instruction tuning llama-2-7B model. | learning rate | batch size | # epochs | alpha | dropput | target modules | |---------------|------------|----------|-------|---------|------------------------------------| | 2e-4 | 16 | 3 | 2 \times rank | 0.05 | {"k_proj", "q_proj", "classifier"} | ```d. Confused by fig 1. Generally for the LoRA, you want initialization at 0, so B is chosen as 0 and A is random so is it surprising that since B always start from the same initialization, it follows essentially a similar trajectory?``` In Figure 1, we illustrative the following phenomenon: *the learned B matrix is more determined by the target fine-tune dataset, while the A matrix is more determined by initalization*. Our theoretical results provide explanation for this asymmetry phenomenon, indicating this is a property brought by the intrinsic matrix structure. Motivated by this porperty, we argue the benefits of freezing A to be random orthonormal matrix. 1. In the case of Figure 1(a), the B matrix does not necessarilly converge to the same solution in a sense of entrivise similarity (due to randomness in A as you noted). Recall that we are using Canonical Correlation Analysis to measure similairty (see Appendix A for details). CCA compares subspaces rather than specific matrices. In other words, two independent runs can result in different B entriwise, but spanning similar subspaces. <!-- 2. We hypothesize that there exits an optimal product low-rank BA given a base model. Since A is random and does not change much during training, B needs to adjust to recover the "optimal" BA. Thus, B --> <!-- 3. However, due to permutation invariance in random initialized A, the learned B's will not follow the same trajectory but instead approach to the same subspace. Therefore, we compare the orthonormal bases of B's using Canonical Correlation Analysis [Kornblith et al., 2019]. --> 2. To further prove this trend is given by the matrix structure rather than initializations. We did new experiments by swapping the initialization. The figure is available in this anonymous link: [https://anonymous.4open.science/r/Anonymous_ICML2024_rebuttal-676C/figures/swap_initialization.png] As we can see, although the trend of the similarities of A and B also swaps, the A matrix, despite all initialized as zero, does converge as close as the previous setting. This experiment provides a potential reason for choose the A matrix to be random. ```e. Line 322-324 is weird. "This addition provides more expressive power for the same number of parameters without loss of generalization bounds" What addition?``` Using a larger $r_B$ than $r_{BA}$ is possible for the same number of parameters (the exact amount of increase to maintain the same number of parameters depends on the size of the matrices involved). ```f. What is $ \hat{A}_V $``` <!-- Sorry for the confusion. The reviewer might want to refer to \hat{A}_V? --> Here we tried fixing the B / A matrix to be the left / right singular matrix of the pre-trained weight. Specifically, we first find $U, S, V = SVD(W_0)$, where $W_0$ is the original pretrained weight. Then, we freeze $A = V$. However, we find that this selection does not bring too much gain compared to a random orthonormal matrix. In addition, as shown in our new ablation study suggested by Reviewer 1 (Table 1.e), orthonormality seems to be the critical property for the fixed matrix during low-rank adaptation. We will add the missing details in the revised version. ```g. Line 363-364, it seems the performance is within noise? Outperform seems to be an exaggeration.``` We agree that the performance of our method on the GLUE benchmark is comparable to the baseline and will rephrase this sentence. However, as we stated in our reply to your question (a), still we can observe that the average performance of freezing A is higher than LoRA in 6 out of 7 tasks. Also, we would like to note that the main purpose of our experiments on GLUE benchmark is to confirm the asymmetry of A and B matrices. <!-- It is also worth noting that the Out-Of-Distribution performance is significantly better as given in our new 5-shot MMLU table 3.a as well as on DomainBed (Table 9 and 10). --> ```h. How is Von Neumann's trace formula used in line 610?``` It turns out that there is a simpler/tighter proof making use of the Moore-Penrose pseudoinverse $A^\dagger$. Note that since $\Sigma$ is positive semidefinite its symmetric square root exists and we can write \begin{equation} \mathrm{Tr} [Q \Sigma \Delta^\top \Delta \Sigma Q^\top (Q \Sigma Q^\top )^{-1}] = \mathrm{Tr}[(\Sigma^{1/2} \Delta^\top \Delta \Sigma^{1/2}) (\Sigma^{1/2} Q^T (Q \Sigma^{1/2} \Sigma^{1/2} Q^T)^{-1} Q \Sigma^{1/2})] \end{equation} \begin{equation} = \mathrm{Tr}[(\Sigma^{1/2} \Delta^\top \Delta \Sigma^{1/2}) ((Q \Sigma^{1/2})^\dagger (Q \Sigma^{1/2})) ]. \end{equation} Now, it can be shown that (this will be expanded in the revision) \begin{equation} \lim_{d/r \rightarrow \infty}\mathbb{E}[(Q \Sigma^{1/2})^\dagger (Q \Sigma^{1/2})] = r \frac{\Sigma}{\mathrm{Tr}[\Sigma]}, \end{equation} hence \begin{equation} \mathbb{E}[III_A] \rightarrow r \frac{\mathrm{Tr}[\Sigma^2 \Delta^\top \Delta]}{\mathrm{Tr}[\Sigma]} = r \frac{\|\Delta \Sigma\|_F^2}{\mathrm{Tr}[\Sigma]}. \qquad (1) \end{equation} Recall that on the other hand that \begin{equation} \mathbb{E}[III_B] \rightarrow \frac{r}{d} \mathrm{Tr}[\Delta \Sigma \Delta^\top]. \end{equation} For brevity in the response, let us suppose that $\Sigma$ is rank $k > r$, allowing its smallest nonzero eigenvalue to be $\geq \mathrm{Tr}[\Sigma] /d$. Then revisiting (1) above, \begin{equation} r \frac{\|\Delta \Sigma\|_F^2}{\mathrm{Tr}[\Sigma]} \geq \frac{r}{d} \frac{\mathrm{Tr}[\Sigma] \mathrm{Tr}[\Delta \Sigma \Delta^\top] }{\mathrm{Tr}[\Sigma] } \rightarrow \mathbb{E}[III_B] \end{equation} and the asymmetry is established. Note furthermore that the term $III_A$ can be quite large for the right $\Sigma$, in fact the equality in \begin{equation} III_A \leq \mathrm{Tr}[\Delta \Sigma \Delta^\top] \end{equation} is achievable even with low $r$, which would yield the same performance as full finetuning - i.e. a loss of $d_{out} \sigma^2$. On the other hand, the expected loss when freezing $B$ is always \begin{equation} d_{out} \sigma^2 + \mathrm{Tr}[\Delta \Sigma \Delta^\top] \left(1 - \frac{r}{d}\right) \end{equation} implying large errors must always persist here when $r$ is small. We will expand this discussion in the revision and make the proof more readable. ## Response to Reviewer LHV6 (R4) ```a. Novelty``` We would like to thank Reviewer 4 for providing a engaging discussion. In fact, we are glad to learn that after reading our manuscript, reviewer 4 found our findings, theoretical analysis, and experimental results to agree with their intuitions. We respectfully disagree with reviewer 4's claim on our weakness of novelty. In fact, we believe that by walking through the line of research in depth, we can better clarify the contribution of this work for reviewer LHV6(R4), as well as any potential reader of our paper. - We first present the observation of the asymmetry between A and B. As we stated in our related works, **"While nearly all recent studies treat the two matrices asymmetrically..., there is a lack of formal investigation into this asymmetry in low-rank adaptation"**. - The asymmetry between A and B has not been explored in recent LoRA papers, not even the term "asymmetry" is mentioned. We first state this phenomenon and investigate it systematically. LoRA [Hu et al, 2022] choose different initialization for A and B, without comparing a swapped initialization. - In fact, in the original LoRA paper (the arXiv version) [Hu et al, 2022], when the authors are investigating the subspace similarity between random seeds (section 7.2, page 11), they claim *"a similar analysis can be carried out with B and the left-singular unitary matrices"*. Here, does this contradictory to our study? No, because in [Hu et al, 2022], their analysis studies the subspace similarity between different random seeds when fine-tuning on the **same task**. However, our asymmetry is driven by the subspace similarity between A and B learned on **different tasks**. Our study implies the existence of an *optimal subspace* for LoRA adapters. - One might find the above observation intuitive since this is related to the assumption where the models lie in a "basin" of low loss in parameter space [Ilharco et al, 2022], and models fine-tuned from the same checkpoint lie in the same basin [Neyshabur er al 2020]. - There is one recent work [Hayou et al, 2024], which appears after the submission of this paper, approaches a similar conclusion by applying different learning rates to A and B. However, our methods still differs in terms of less active parameters, OOD performance, and orthonormality of matrixs. - Reviewer 4 refers to BYOL [Grill, Strub et al, 2020], a self-supervised learning method which have been achieving great success. However, the core idea of BYOL is that using a target network which is the moving average of the online network can prevent collapsing. The BYOL paper did mention that a fixed randomly initialized network can serve as a target network that prevent collapsing. However this is a relatively bad baseline as shown in Table 5(a) in [Grill, Strub et al, 2020]. Nevertheless, even in this case, BYOL's model strucutre is still quite different from LoRA. It would be great if reviewer could better justify the how the BYOL method shades the contribution in LoRA setting. Alternatively, Reviewer 4 might want to refer to dimensional reduction with random project [Bingham 2001] or compressed sensing [Donoho et al 2006]. - However, it would be risky to directly apply the intuition from all aforementioned representation learning to an intermediate layer of a transformer model. Consider the following case, for a intermediate linear layer $W_0 \in \mathbb{R}^{d \times d}$, we can always achieve the low-rank decompose $W_0 = QR$, where $Q \in \mathbb{R}^{d \times r}$ and $R \in \mathbb{R}^{r \times d}$. However, it would be problematic to argue $R$ is *feature extractor* and $Q$ is a *classifier*. This is also the reason why we are analysing the **gradient** for nonlinear cases. - In fact, a better approach is to assume the existence of an optimal target model $W_t$, **just as we did in our analysis**. In this sense, within each layer, LoRA is approximation the error matrix $\Delta = W_t - W_0$ with $BA$, where the well-known Eckart–Young theorem states the optimal solution will be the product of two orthornomal matrices, which also explains the reasons why we are freezing A and B to be orthornomal matrix. - Althrough reviewer find that the theoretical contribution is marginal. The first theoretical analysis for LoRA only date back to ICLR 2024, where [Zeng et al, 2023] establish theoretical undetstanding following the Eckart-Young theorem. However, their analysis assume the rank of the $\Delta = W_t - W_0$ where we relax this assumption. I hope all our detailed explanation justify our contribution. Otherwise, it would be great if reviewer could let us know the reason why our result is marginal and we are happy to provide further clarifications. In conclusion, in this work we **first present the asymmetry phenomenon between A and B in LoRA**. We **first establish the theoretical analysis regarding this phenomenon** which implies freezing $A$ matrix to be random orthonormal matrix will yields better generalization performance with less computational cost. Apart from experimental results, our analysis implies that (1) The B matrix is more important and strongly depends on data, (2) Given a target fine-tune dataset, there might exist an optimal subspace for learned LoRA matrices. We believe these conclusion benefit various future direction such as LoRA compression, LoRA retrevial, model merging, and more PEFT methods. ```b. More details on how the experiment hyperparameters ``` We are happy to provide the details for hyperparameter selecting for all our experiments. We adopted the following hyperparameters for GLUE benchmarks. | Dataset | learning rate | batch size | # epochs | alpha | target module | |---------|---------------|------------|----------|-------|---------------| | MNLI | 4e-4 | 16 | 10 | 16 | query, value | | SST-2 | 2e-4 | 16 | 10 | 16 | query, value | | MRPC | 2e-4 | 16 | 20 | 16 | query, value | | CoLA | 4e-4 | 16 | 20 | 16 | query, value | | QNLI | 2e-4 | 16 | 10 | 16 | query, value | | RTE | 3e-4 | 16 | 20 | 16 | query, value | | STS-B | 2e-4 | 16 | 10 | 16 | query, value | We adopted the following hyperparameters for instruction tuning llama-2-7B model. We select hyperparameters from: learning rate {5e03, 1e-4, 2e-4, 4e-4}, dropout {0.01, 0.02, 0.05}. | learning rate | batch size | # epochs | alpha | dropput | target modules | |---------------|------------|----------|-------|---------|------------------------------------| | 2e-4 | 16 | 3 | 2 \times rank | 0.05 | {"k_proj", "q_proj", "classifier"} | ```c. More models of the Llama-2 scale``` This is a helpful suggestion. We are running the same experiments on Mistral-7B model and the results can be viewed at the following anonymous link: https://anonymous.4open.science/r/Anonymous_ICML2024_rebuttal-676C ----- ### Reply We are excited to learn that you are willing to revisit the details of our manuscript, and we appreciate your taking this effort! Please don't hesitate to let us know if you have any other questions that we can clarify! #### For Table 4. Regarding the question of hyperparameter tuning for table 4 (DomainBed experiments). First of all, we optimize the hyperparameter only for the standard LoRA, and use the same hyparameter (learning rate, rank, batch size) for corresponding freezing methods. This enable us to fairly comparing the mechanism of LoRA training. (1) We fix the model architecture to be "google/vit-base-patch16-224-in21k" for all four different datasets (VLCS, PACS, OfficeHome, and TerraIncognita). (2) We fix the rank to be r=8 for all four different tasks. (3) For learning rate, we various from $[10e-5, 10e-3.5]$, (4) For batch size, we always fix batch size to be 8 across all datasets. (5) For lora dropout, we vary from from $[10e-2, 2.5e-1]$ (6) We preform three different random seeds and report the performance obtained from the best *in-domain* checkpoint. We adopt the codebase from [Domainbed](https://github.com/facebookresearch/DomainBed) and specifically, we follow the provided hyperparameter search [strategy](https://github.com/facebookresearch/DomainBed/blob/main/domainbed/hparams_registry.py) in DomainBed. The **asymmetery of A and B** and the OOD generalization performance is robust with regard to different hyperparameters. We think different LoRA methods might be improved on In-Domain training distribution via a better hyperparameter search. However we think this is not the focus of this work at this stage. #### For Table 5, Alpaca instruction finetuned LoRA on MMLU Since we didn't present additional results for table 4 during the rebuttal. We think that reviewer LHV6 might be interested in the hyperparameter search for our table 5. As we stated in our last rebuttal note, We adopted the following hyperparameters for instruction tuning llama-2-7B model. We select hyperparameters from: learning rate {5e03, 1e-4, 2e-4, 4e-4}, dropout {0.01, 0.02, 0.05}. | learning rate | batch size | # epochs | alpha | dropput | target modules | |---------------|------------|----------|-------|---------|------------------------------------| | 2e-4 | 16 | 3 | 2 \times rank | 0.05 | {"k_proj", "q_proj", "classifier"} | We didn't retrain the Llama-2-7b model during the rebuttal. The major reason for getting better results is through the 5-shot prompting during evaluation. Specifically, we use the following self-defined prompt. It is mentioned in llama-2's report that the 5-shot prompt is critical for MMLU performance. ```python def process_mcq(sample): prompt = 'Question: ' + sample['input'] + '\n' for c in ['A', 'B', 'C', 'D']: prompt += '. '.join([c, sample[c]]) + '\n' prompt += 'Answer:' return prompt, sample['target'] ```

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully